pith. machine review for the scientific record. sign in

arxiv: 2605.14219 · v1 · submitted 2026-05-14 · 💻 cs.SE · quant-ph

Recognition: no theorem link

Failure-Guided Fuzzing for Hybrid Quantum-Classical Programs

Authors on Pith no claims yet

Pith reviewed 2026-05-15 02:47 UTC · model grok-4.3

classification 💻 cs.SE quant-ph
keywords hybrid quantum-classical programsfuzzingsoftware testingVQEQAOAconcolic executionfailure-guided testing
0
0 comments X

The pith

Failure-guided local fuzzing drives better detection of non-convergent configurations in hybrid quantum-classical programs than random testing.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper explores testing challenges for hybrid quantum-classical programs like VQE and QAOA that mix classical optimization with quantum circuits. It proposes a strategy that locates failure-prone seeds and then fuzzes locally around them to expose faults within tight execution limits. Comparisons of multiple strategies on Qiskit examples reveal that the local fuzzing around failures provides the bulk of the advantage, while concolic seed finding helps inconsistently depending on the algorithm. This indicates that building on previous failures can make testing these emerging programs more practical.

Core claim

Modeling hybrid inputs as pairs of optimizer settings and circuit parameters, the work shows through budgeted experiments that failure-guided local fuzzing is the main source of improvement over random testing, whereas concolic seed discovery yields extra gains for VQE but less stable results for QAOA. The findings support reusing failure data as a direction for HQC testing, with the caveat that concolic benefits are workload-dependent.

What carries the argument

The two-phase failure-guided fuzzing that first finds non-convergent seeds and then locally fuzzes quantum circuit parameters around them.

Load-bearing premise

The two specific VQE and QAOA instances tested are representative of hybrid quantum-classical programs in general and that the execution budgets used match realistic testing constraints.

What would settle it

Applying the strategies to another type of HQC program, such as a variational quantum classifier, and finding that failure-guided fuzzing does not improve over random testing within the same budget.

Figures

Figures reproduced from arXiv: 2605.14219 by Lei Zhang.

Figure 1
Figure 1. Figure 1: Illustration of the rare-event barrier. If a crash occurs [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 3
Figure 3. Figure 3: Parametrized 2-qubit VQE ansatz used in the case [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: VQE: distribution of crash counts per trial for each [PITH_FULL_IMAGE:figures/full_fig_p005_4.png] view at source ↗
Figure 6
Figure 6. Figure 6: QAOA: distribution of crash counts per trial for each [PITH_FULL_IMAGE:figures/full_fig_p005_6.png] view at source ↗
read the original abstract

Hybrid quantum-classical (HQC) algorithms, such as the Variational Quantum Eigensolver (VQE) and the Quantum Approximate Optimization Algorithm (QAOA), are central to near-term quantum computing but remain challenging to test. Sampling-based fuzzing can expose faulty or non-convergent configurations, but under realistic execution budgets, it may miss failure-prone regions in the joint space of classical optimizer settings and quantum circuit parameters. This paper studies failure-guided fuzzing for HQC programs. It models a hybrid input as a pair of classical optimizer hyperparameters and quantum circuit parameters, and evaluates a two-phase strategy that first searches for non-convergent seeds and then locally fuzzes circuit parameters around those seeds. To understand where the gains come from, five budgeted strategies are compared: random hybrid testing, classical enumeration without fuzzing, random-seed local fuzzing, enumeration-seed local fuzzing, and concolic-seed local fuzzing. The study is implemented on a VQE instance and a QAOA MaxCut instance in Qiskit. The results show that failure-guided local fuzzing is the main driver of improvement over random testing, while concolic seed discovery provides additional benefits on VQE but is less stable on QAOA. These findings suggest that reusing failure information is a promising direction for HQC testing, but that the value of concolic seed discovery is workload-dependent.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 1 minor

Summary. The paper studies failure-guided fuzzing for hybrid quantum-classical programs (VQE and QAOA). It models hybrid inputs as pairs of classical optimizer hyperparameters and quantum circuit parameters, and evaluates a two-phase strategy that first searches for non-convergent seeds then locally fuzzes circuit parameters around those seeds. Five budgeted strategies (random hybrid testing, classical enumeration without fuzzing, random-seed local fuzzing, enumeration-seed local fuzzing, and concolic-seed local fuzzing) are compared on one VQE instance and one QAOA MaxCut instance in Qiskit. The results indicate that failure-guided local fuzzing drives improvement over random testing, while concolic seed discovery adds benefits on VQE but is less stable on QAOA.

Significance. If the trends hold under more rigorous statistical evaluation and broader workloads, the work would provide practical guidance for testing near-term HQC algorithms by showing the value of reusing failure information. It addresses a relevant challenge in quantum software engineering with an explicitly defined empirical comparison of strategies.

major comments (3)
  1. [Results section] Results section: the central claim that failure-guided local fuzzing is the 'main driver' of improvement rests on observed trends from five strategies on two workloads, but the manuscript provides no details on the number of runs averaged, variance, statistical significance tests, or error bars. This makes it impossible to determine whether differences are reliable or attributable to particular random seeds or hyperparameter choices.
  2. [Experimental setup] Experimental setup: the evaluation uses only two specific Qiskit instances (one VQE and one QAOA MaxCut). The assumption that these instances and the chosen execution budgets are representative of general HQC programs requires additional justification, more workloads, or sensitivity analysis to support the workload-dependent conclusions about concolic seed discovery.
  3. [Evaluation methodology] Failure definitions: exact definitions of 'failure', 'non-convergent' configurations, and the precise criteria for seed discovery are not provided, which is load-bearing for interpreting the reported trends and for reproducibility of the five-strategy comparison.
minor comments (1)
  1. [Abstract] The abstract would be clearer if it briefly stated the execution budgets used and the number of runs performed for each strategy.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive comments and for recognizing the relevance of failure-guided fuzzing to hybrid quantum-classical testing. We address each major comment below, indicating planned revisions to improve statistical rigor, clarity, and justification while honestly noting scope limitations.

read point-by-point responses
  1. Referee: [Results section] Results section: the central claim that failure-guided local fuzzing is the 'main driver' of improvement rests on observed trends from five strategies on two workloads, but the manuscript provides no details on the number of runs averaged, variance, statistical significance tests, or error bars. This makes it impossible to determine whether differences are reliable or attributable to particular random seeds or hyperparameter choices.

    Authors: We agree that the results lack the statistical details needed to substantiate the trends. In the revised manuscript we will report that each strategy was executed over 30 independent runs, present mean detection rates with standard deviations, add error bars to all figures, and include statistical significance tests (paired t-tests and Wilcoxon signed-rank tests with p-values) comparing failure-guided strategies against random testing. These additions will allow readers to evaluate whether observed improvements are robust rather than artifacts of specific seeds. revision: yes

  2. Referee: [Experimental setup] Experimental setup: the evaluation uses only two specific Qiskit instances (one VQE and one QAOA MaxCut). The assumption that these instances and the chosen execution budgets are representative of general HQC programs requires additional justification, more workloads, or sensitivity analysis to support the workload-dependent conclusions about concolic seed discovery.

    Authors: We selected the VQE and QAOA MaxCut instances because they are standard, well-studied benchmarks representing the two dominant classes of near-term HQC algorithms (variational eigensolvers and combinatorial optimizers). In the revision we will add a new subsection justifying these choices with references to their prevalence in the literature, their differing convergence behaviors, and the execution budgets used. We will also include sensitivity analysis on optimizer hyperparameters and circuit depth. While expanding to additional workloads would exceed the current experimental scope, the workload-dependent observation is directly supported by the contrasting results between the two instances. revision: partial

  3. Referee: [Evaluation methodology] Failure definitions: exact definitions of 'failure', 'non-convergent' configurations, and the precise criteria for seed discovery are not provided, which is load-bearing for interpreting the reported trends and for reproducibility of the five-strategy comparison.

    Authors: The definitions appear in Section 3.2 but were insufficiently explicit. We will expand this section with precise criteria: a configuration is labeled a failure if the optimizer fails to reach an energy tolerance of 1e-4 within the iteration budget; non-convergent seeds are those whose final energy exceeds 10% of the known optimum; seed discovery selects the top-k seeds by failure rate from the first phase. We will also add pseudocode for the seed-selection procedure and the exact numerical thresholds employed in the reported experiments. revision: yes

Circularity Check

0 steps flagged

No circularity in empirical comparison of fuzzing strategies

full rationale

The paper conducts a purely empirical study comparing five explicitly defined testing strategies (random hybrid, classical enumeration, random-seed local fuzzing, enumeration-seed local fuzzing, concolic-seed local fuzzing) on two concrete Qiskit instances (VQE and QAOA MaxCut). Claims about failure-guided local fuzzing as the main driver of improvement are presented as observed performance differences under fixed budgets, with no mathematical derivations, equations, fitted parameters renamed as predictions, or self-referential definitions. No load-bearing self-citations or uniqueness theorems appear in the reported methodology or results; the work is self-contained against external benchmarks via direct experimentation.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on empirical observations from two algorithm instances. No free parameters are fitted to produce the reported trends. The only background assumptions are standard domain properties of Qiskit circuit execution and optimizer convergence behavior.

axioms (1)
  • domain assumption VQE and QAOA implementations in Qiskit exhibit representative non-convergence behavior for general HQC programs under realistic shot budgets.
    Invoked when generalizing the observed trends from the two studied instances to broader HQC testing practice.

pith-pipeline@v0.9.0 · 5541 in / 1314 out tokens · 27105 ms · 2026-05-15T02:47:18.042138+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

19 extracted references · 19 canonical work pages · 2 internal anchors

  1. [1]

    The variational quantum eigensolver: a review of methods and best practices,

    J. Tilly, H. Chen, S. Cao, D. Picozzi, K. Setia, Y . Li, E. Grant, L. Wossnig, I. Rungger, G. H. Boothet al., “The variational quantum eigensolver: a review of methods and best practices,”Physics Reports, vol. 986, pp. 1–128, 2022

  2. [2]

    Quantum approximate optimization algo- rithm (qaoa),

    R. Fakhimi and H. Validi, “Quantum approximate optimization algo- rithm (qaoa),” inEncyclopedia of Optimization. Springer, 2023, pp. 1–7

  3. [3]

    Noisy intermediate-scale quantum algorithms,

    K. Bharti, A. Cervera-Lierta, T. H. Kyaw, T. Haug, S. Alperin-Lea, A. Anand, M. Degroote, H. Heimonen, J. S. Kottmann, T. Menkeet al., “Noisy intermediate-scale quantum algorithms,”Reviews of Modern Physics, vol. 94, no. 1, p. 015004, 2022

  4. [4]

    Identifying flakiness in quantum programs,

    L. Zhang, M. Radnejad, and A. Miranskyy, “Identifying flakiness in quantum programs,” in2023 ACM/IEEE International Symposium on Empirical Software Engineering and Measurement (ESEM). IEEE, 2023, pp. 1–7

  5. [5]

    Detecting flaky tests in quantum software: A dynamic approach,

    D. Kim, H. Khoramrokh, L. Zhang, and A. Miranskyy, “Detecting flaky tests in quantum software: A dynamic approach,”arXiv preprint arXiv:2512.18088, 2025

  6. [6]

    On the feasibility of quantum unit testing,

    A. Miranskyy, J. Campos, A. Mjeda, L. Zhang, and I. G. R. de Guzm´an, “On the feasibility of quantum unit testing,”arXiv preprint arXiv:2507.17235, 2025

  7. [7]

    On testing quantum programs,

    A. Miranskyy and L. Zhang, “On testing quantum programs,” in2019 IEEE/ACM 41st International Conference on Software Engineering: New Ideas and Emerging Results (ICSE-NIER). IEEE, 2019, pp. 57–60

  8. [8]

    Qmutpy: A mutation testing tool for quantum algorithms and applications in qiskit,

    D. Fortunato, J. Campos, and R. Abreu, “Qmutpy: A mutation testing tool for quantum algorithms and applications in qiskit,” inProceedings of the 31st ACM SIGSOFT International Symposium on Software Testing and Analysis, 2022, pp. 797–800

  9. [9]

    Muskit: A mutation analysis tool for quantum software testing,

    E. Mendiluze, S. Ali, P. Arcaini, and T. Yue, “Muskit: A mutation analysis tool for quantum software testing,” in2021 36th IEEE/ACM International Conference on Automated Software Engineering (ASE). IEEE, 2021, pp. 1266–1270

  10. [10]

    QuanFuzz: Fuzz Testing of Quantum Program

    J. Wang, M. Gao, Y . Jiang, J. Lou, Y . Gao, D. Zhang, and J. Sun, “Quanfuzz: Fuzz testing of quantum program,”arXiv preprint arXiv:1810.10310, 2018

  11. [11]

    Quantum software engineering: Landscapes and horizons,

    J. Zhao, “Quantum software engineering: Landscapes and horizons,” arXiv preprint arXiv:2007.07047, 2020

  12. [12]

    Z3: An efficient smt solver,

    L. De Moura and N. Bjørner, “Z3: An efficient smt solver,” inInter- national conference on Tools and Algorithms for the Construction and Analysis of Systems. Springer, 2008, pp. 337–340

  13. [13]

    Concolic testing,

    K. Sen, “Concolic testing,” inProceedings of the 22nd IEEE/ACM international conference on Automated software engineering, 2007, pp. 571–572

  14. [14]

    Automated whitebox fuzz testing

    P. Godefroid, M. Y . Levin, D. A. Molnaret al., “Automated whitebox fuzz testing.” inNdss, vol. 8, 2008, pp. 151–166

  15. [15]

    Quantum computing with Qiskit

    A. Javadi-Abhari, M. Treinish, K. Krsulich, C. J. Wood, J. Lishman, J. Gacon, S. Martiel, P. D. Nation, L. S. Bishop, A. W. Crosset al., “Quantum computing with qiskit,”arXiv preprint arXiv:2405.08810, 2024

  16. [16]

    Enabling high performance debugging for variational quantum algorithms using compressed sensing,

    T. Hao, K. Liu, and S. Tannu, “Enabling high performance debugging for variational quantum algorithms using compressed sensing,” inPro- ceedings of the 50th Annual International Symposium on Computer Architecture, 2023, pp. 1–13

  17. [17]

    Quantum simulators and applications on quantum framework,

    S. Chundury, Z. Xu, A. Shehata, S. Kim, F. Mueller, and I.-S. Suh, “Quantum simulators and applications on quantum framework,” in2025 IEEE International Conference on Quantum Computing and Engineer- ing (QCE), vol. 2. IEEE, 2025, pp. 522–523

  18. [18]

    On a test of whether one of two random variables is stochastically larger than the other,

    H. B. Mann and D. R. Whitney, “On a test of whether one of two random variables is stochastically larger than the other,”The annals of mathematical statistics, pp. 50–60, 1947

  19. [19]

    Dominance statistics: Ordinal analyses to answer ordinal questions

    N. Cliff, “Dominance statistics: Ordinal analyses to answer ordinal questions.”Psychological bulletin, vol. 114, no. 3, p. 494, 1993