arxiv: 2605.14219 · v1 · submitted 2026-05-14 · 💻 cs.SE · quant-ph

Recognition: no theorem link

Failure-Guided Fuzzing for Hybrid Quantum-Classical Programs

Lei Zhang

Authors on Pith no claims yet

Pith reviewed 2026-05-15 02:47 UTC · model grok-4.3

classification 💻 cs.SE quant-ph

keywords hybrid quantum-classical programsfuzzingsoftware testingVQEQAOAconcolic executionfailure-guided testing

0 comments

The pith

Failure-guided local fuzzing drives better detection of non-convergent configurations in hybrid quantum-classical programs than random testing.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper explores testing challenges for hybrid quantum-classical programs like VQE and QAOA that mix classical optimization with quantum circuits. It proposes a strategy that locates failure-prone seeds and then fuzzes locally around them to expose faults within tight execution limits. Comparisons of multiple strategies on Qiskit examples reveal that the local fuzzing around failures provides the bulk of the advantage, while concolic seed finding helps inconsistently depending on the algorithm. This indicates that building on previous failures can make testing these emerging programs more practical.

Core claim

Modeling hybrid inputs as pairs of optimizer settings and circuit parameters, the work shows through budgeted experiments that failure-guided local fuzzing is the main source of improvement over random testing, whereas concolic seed discovery yields extra gains for VQE but less stable results for QAOA. The findings support reusing failure data as a direction for HQC testing, with the caveat that concolic benefits are workload-dependent.

What carries the argument

The two-phase failure-guided fuzzing that first finds non-convergent seeds and then locally fuzzes quantum circuit parameters around them.

Load-bearing premise

The two specific VQE and QAOA instances tested are representative of hybrid quantum-classical programs in general and that the execution budgets used match realistic testing constraints.

What would settle it

Applying the strategies to another type of HQC program, such as a variational quantum classifier, and finding that failure-guided fuzzing does not improve over random testing within the same budget.

Figures

Figures reproduced from arXiv: 2605.14219 by Lei Zhang.

**Figure 3.** Figure 3: Parametrized 2-qubit VQE ansatz used in the case [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗

**Figure 4.** Figure 4: VQE: distribution of crash counts per trial for each [PITH_FULL_IMAGE:figures/full_fig_p005_4.png] view at source ↗

**Figure 6.** Figure 6: QAOA: distribution of crash counts per trial for each [PITH_FULL_IMAGE:figures/full_fig_p005_6.png] view at source ↗

read the original abstract

Hybrid quantum-classical (HQC) algorithms, such as the Variational Quantum Eigensolver (VQE) and the Quantum Approximate Optimization Algorithm (QAOA), are central to near-term quantum computing but remain challenging to test. Sampling-based fuzzing can expose faulty or non-convergent configurations, but under realistic execution budgets, it may miss failure-prone regions in the joint space of classical optimizer settings and quantum circuit parameters. This paper studies failure-guided fuzzing for HQC programs. It models a hybrid input as a pair of classical optimizer hyperparameters and quantum circuit parameters, and evaluates a two-phase strategy that first searches for non-convergent seeds and then locally fuzzes circuit parameters around those seeds. To understand where the gains come from, five budgeted strategies are compared: random hybrid testing, classical enumeration without fuzzing, random-seed local fuzzing, enumeration-seed local fuzzing, and concolic-seed local fuzzing. The study is implemented on a VQE instance and a QAOA MaxCut instance in Qiskit. The results show that failure-guided local fuzzing is the main driver of improvement over random testing, while concolic seed discovery provides additional benefits on VQE but is less stable on QAOA. These findings suggest that reusing failure information is a promising direction for HQC testing, but that the value of concolic seed discovery is workload-dependent.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper shows failure-guided local fuzzing beats random testing on two Qiskit HQC cases, but the evidence is thin without stats or broader workloads.

read the letter

The core finding is that a two-phase strategy—first locating non-convergent seeds then fuzzing circuit parameters locally around them—outperforms plain random testing on the VQE and QAOA MaxCut instances they ran in Qiskit. Local fuzzing accounts for most of the lift, while concolic seeding adds some extra coverage on VQE but looks less reliable on QAOA. That matches the abstract's summary of the five-strategy comparison under fixed budgets. The work is straightforward: it defines hybrid inputs as optimizer hyperparameters plus circuit parameters, then measures how often each strategy hits failures within the same execution limits. The trends are plausible for these standard variational workloads and the idea of reusing failure information is a reasonable extension of existing fuzzing practice. What stands out is the head-to-head setup that isolates where the gains come from rather than just claiming a new method works. The main weakness is exactly the one the stress test notes: results come from single runs on two narrow instances with no error bars, no repeated trials, and no statistical tests. Observed differences could easily trace to the particular random seeds or the limited joint space they explored instead of a stable property of the approach. The workloads are representative of near-term variational methods but too few to support broad claims about HQC testing in general. This is useful reading for people already working on testing quantum software, especially those who need concrete strategies for VQE-style loops. A reader looking for practical ideas on budgeted fuzzing would pick up the workflow and the workload-specific notes on concolic stability. It is not a foundational result but the empirical comparison is honest enough to warrant referee time. I would send it for review and ask for more instances, replication code, and basic variance reporting before accepting.

Referee Report

3 major / 1 minor

Summary. The paper studies failure-guided fuzzing for hybrid quantum-classical programs (VQE and QAOA). It models hybrid inputs as pairs of classical optimizer hyperparameters and quantum circuit parameters, and evaluates a two-phase strategy that first searches for non-convergent seeds then locally fuzzes circuit parameters around those seeds. Five budgeted strategies (random hybrid testing, classical enumeration without fuzzing, random-seed local fuzzing, enumeration-seed local fuzzing, and concolic-seed local fuzzing) are compared on one VQE instance and one QAOA MaxCut instance in Qiskit. The results indicate that failure-guided local fuzzing drives improvement over random testing, while concolic seed discovery adds benefits on VQE but is less stable on QAOA.

Significance. If the trends hold under more rigorous statistical evaluation and broader workloads, the work would provide practical guidance for testing near-term HQC algorithms by showing the value of reusing failure information. It addresses a relevant challenge in quantum software engineering with an explicitly defined empirical comparison of strategies.

major comments (3)

[Results section] Results section: the central claim that failure-guided local fuzzing is the 'main driver' of improvement rests on observed trends from five strategies on two workloads, but the manuscript provides no details on the number of runs averaged, variance, statistical significance tests, or error bars. This makes it impossible to determine whether differences are reliable or attributable to particular random seeds or hyperparameter choices.
[Experimental setup] Experimental setup: the evaluation uses only two specific Qiskit instances (one VQE and one QAOA MaxCut). The assumption that these instances and the chosen execution budgets are representative of general HQC programs requires additional justification, more workloads, or sensitivity analysis to support the workload-dependent conclusions about concolic seed discovery.
[Evaluation methodology] Failure definitions: exact definitions of 'failure', 'non-convergent' configurations, and the precise criteria for seed discovery are not provided, which is load-bearing for interpreting the reported trends and for reproducibility of the five-strategy comparison.

minor comments (1)

[Abstract] The abstract would be clearer if it briefly stated the execution budgets used and the number of runs performed for each strategy.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive comments and for recognizing the relevance of failure-guided fuzzing to hybrid quantum-classical testing. We address each major comment below, indicating planned revisions to improve statistical rigor, clarity, and justification while honestly noting scope limitations.

read point-by-point responses

Referee: [Results section] Results section: the central claim that failure-guided local fuzzing is the 'main driver' of improvement rests on observed trends from five strategies on two workloads, but the manuscript provides no details on the number of runs averaged, variance, statistical significance tests, or error bars. This makes it impossible to determine whether differences are reliable or attributable to particular random seeds or hyperparameter choices.

Authors: We agree that the results lack the statistical details needed to substantiate the trends. In the revised manuscript we will report that each strategy was executed over 30 independent runs, present mean detection rates with standard deviations, add error bars to all figures, and include statistical significance tests (paired t-tests and Wilcoxon signed-rank tests with p-values) comparing failure-guided strategies against random testing. These additions will allow readers to evaluate whether observed improvements are robust rather than artifacts of specific seeds. revision: yes
Referee: [Experimental setup] Experimental setup: the evaluation uses only two specific Qiskit instances (one VQE and one QAOA MaxCut). The assumption that these instances and the chosen execution budgets are representative of general HQC programs requires additional justification, more workloads, or sensitivity analysis to support the workload-dependent conclusions about concolic seed discovery.

Authors: We selected the VQE and QAOA MaxCut instances because they are standard, well-studied benchmarks representing the two dominant classes of near-term HQC algorithms (variational eigensolvers and combinatorial optimizers). In the revision we will add a new subsection justifying these choices with references to their prevalence in the literature, their differing convergence behaviors, and the execution budgets used. We will also include sensitivity analysis on optimizer hyperparameters and circuit depth. While expanding to additional workloads would exceed the current experimental scope, the workload-dependent observation is directly supported by the contrasting results between the two instances. revision: partial
Referee: [Evaluation methodology] Failure definitions: exact definitions of 'failure', 'non-convergent' configurations, and the precise criteria for seed discovery are not provided, which is load-bearing for interpreting the reported trends and for reproducibility of the five-strategy comparison.

Authors: The definitions appear in Section 3.2 but were insufficiently explicit. We will expand this section with precise criteria: a configuration is labeled a failure if the optimizer fails to reach an energy tolerance of 1e-4 within the iteration budget; non-convergent seeds are those whose final energy exceeds 10% of the known optimum; seed discovery selects the top-k seeds by failure rate from the first phase. We will also add pseudocode for the seed-selection procedure and the exact numerical thresholds employed in the reported experiments. revision: yes

Circularity Check

0 steps flagged

No circularity in empirical comparison of fuzzing strategies

full rationale

The paper conducts a purely empirical study comparing five explicitly defined testing strategies (random hybrid, classical enumeration, random-seed local fuzzing, enumeration-seed local fuzzing, concolic-seed local fuzzing) on two concrete Qiskit instances (VQE and QAOA MaxCut). Claims about failure-guided local fuzzing as the main driver of improvement are presented as observed performance differences under fixed budgets, with no mathematical derivations, equations, fitted parameters renamed as predictions, or self-referential definitions. No load-bearing self-citations or uniqueness theorems appear in the reported methodology or results; the work is self-contained against external benchmarks via direct experimentation.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on empirical observations from two algorithm instances. No free parameters are fitted to produce the reported trends. The only background assumptions are standard domain properties of Qiskit circuit execution and optimizer convergence behavior.

axioms (1)

domain assumption VQE and QAOA implementations in Qiskit exhibit representative non-convergence behavior for general HQC programs under realistic shot budgets.
Invoked when generalizing the observed trends from the two studied instances to broader HQC testing practice.

pith-pipeline@v0.9.0 · 5541 in / 1314 out tokens · 27105 ms · 2026-05-15T02:47:18.042138+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

19 extracted references · 19 canonical work pages · 2 internal anchors

[1]

The variational quantum eigensolver: a review of methods and best practices,

J. Tilly, H. Chen, S. Cao, D. Picozzi, K. Setia, Y . Li, E. Grant, L. Wossnig, I. Rungger, G. H. Boothet al., “The variational quantum eigensolver: a review of methods and best practices,”Physics Reports, vol. 986, pp. 1–128, 2022

work page 2022
[2]

Quantum approximate optimization algo- rithm (qaoa),

R. Fakhimi and H. Validi, “Quantum approximate optimization algo- rithm (qaoa),” inEncyclopedia of Optimization. Springer, 2023, pp. 1–7

work page 2023
[3]

Noisy intermediate-scale quantum algorithms,

K. Bharti, A. Cervera-Lierta, T. H. Kyaw, T. Haug, S. Alperin-Lea, A. Anand, M. Degroote, H. Heimonen, J. S. Kottmann, T. Menkeet al., “Noisy intermediate-scale quantum algorithms,”Reviews of Modern Physics, vol. 94, no. 1, p. 015004, 2022

work page 2022
[4]

Identifying flakiness in quantum programs,

L. Zhang, M. Radnejad, and A. Miranskyy, “Identifying flakiness in quantum programs,” in2023 ACM/IEEE International Symposium on Empirical Software Engineering and Measurement (ESEM). IEEE, 2023, pp. 1–7

work page 2023
[5]

Detecting flaky tests in quantum software: A dynamic approach,

D. Kim, H. Khoramrokh, L. Zhang, and A. Miranskyy, “Detecting flaky tests in quantum software: A dynamic approach,”arXiv preprint arXiv:2512.18088, 2025

work page arXiv 2025
[6]

On the feasibility of quantum unit testing,

A. Miranskyy, J. Campos, A. Mjeda, L. Zhang, and I. G. R. de Guzm´an, “On the feasibility of quantum unit testing,”arXiv preprint arXiv:2507.17235, 2025

work page arXiv 2025
[7]

On testing quantum programs,

A. Miranskyy and L. Zhang, “On testing quantum programs,” in2019 IEEE/ACM 41st International Conference on Software Engineering: New Ideas and Emerging Results (ICSE-NIER). IEEE, 2019, pp. 57–60

work page 2019
[8]

Qmutpy: A mutation testing tool for quantum algorithms and applications in qiskit,

D. Fortunato, J. Campos, and R. Abreu, “Qmutpy: A mutation testing tool for quantum algorithms and applications in qiskit,” inProceedings of the 31st ACM SIGSOFT International Symposium on Software Testing and Analysis, 2022, pp. 797–800

work page 2022
[9]

Muskit: A mutation analysis tool for quantum software testing,

E. Mendiluze, S. Ali, P. Arcaini, and T. Yue, “Muskit: A mutation analysis tool for quantum software testing,” in2021 36th IEEE/ACM International Conference on Automated Software Engineering (ASE). IEEE, 2021, pp. 1266–1270

work page 2021
[10]

QuanFuzz: Fuzz Testing of Quantum Program

J. Wang, M. Gao, Y . Jiang, J. Lou, Y . Gao, D. Zhang, and J. Sun, “Quanfuzz: Fuzz testing of quantum program,”arXiv preprint arXiv:1810.10310, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018
[11]

Quantum software engineering: Landscapes and horizons,

J. Zhao, “Quantum software engineering: Landscapes and horizons,” arXiv preprint arXiv:2007.07047, 2020

work page arXiv 2007
[12]

Z3: An efficient smt solver,

L. De Moura and N. Bjørner, “Z3: An efficient smt solver,” inInter- national conference on Tools and Algorithms for the Construction and Analysis of Systems. Springer, 2008, pp. 337–340

work page 2008
[13]

Concolic testing,

K. Sen, “Concolic testing,” inProceedings of the 22nd IEEE/ACM international conference on Automated software engineering, 2007, pp. 571–572

work page 2007
[14]

Automated whitebox fuzz testing

P. Godefroid, M. Y . Levin, D. A. Molnaret al., “Automated whitebox fuzz testing.” inNdss, vol. 8, 2008, pp. 151–166

work page 2008
[15]

Quantum computing with Qiskit

A. Javadi-Abhari, M. Treinish, K. Krsulich, C. J. Wood, J. Lishman, J. Gacon, S. Martiel, P. D. Nation, L. S. Bishop, A. W. Crosset al., “Quantum computing with qiskit,”arXiv preprint arXiv:2405.08810, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[16]

Enabling high performance debugging for variational quantum algorithms using compressed sensing,

T. Hao, K. Liu, and S. Tannu, “Enabling high performance debugging for variational quantum algorithms using compressed sensing,” inPro- ceedings of the 50th Annual International Symposium on Computer Architecture, 2023, pp. 1–13

work page 2023
[17]

Quantum simulators and applications on quantum framework,

S. Chundury, Z. Xu, A. Shehata, S. Kim, F. Mueller, and I.-S. Suh, “Quantum simulators and applications on quantum framework,” in2025 IEEE International Conference on Quantum Computing and Engineer- ing (QCE), vol. 2. IEEE, 2025, pp. 522–523

work page 2025
[18]

On a test of whether one of two random variables is stochastically larger than the other,

H. B. Mann and D. R. Whitney, “On a test of whether one of two random variables is stochastically larger than the other,”The annals of mathematical statistics, pp. 50–60, 1947

work page 1947
[19]

Dominance statistics: Ordinal analyses to answer ordinal questions

N. Cliff, “Dominance statistics: Ordinal analyses to answer ordinal questions.”Psychological bulletin, vol. 114, no. 3, p. 494, 1993

work page 1993