pith. sign in

arxiv: 2605.29872 · v1 · pith:ZNZAZWS7new · submitted 2026-05-28 · 🪐 quant-ph · cs.SE

Claim against Measurement: Statistical Artefacts in Quantum Error Mitigation Benchmarks

Pith reviewed 2026-06-29 07:06 UTC · model grok-4.3

classification 🪐 quant-ph cs.SE
keywords quantum error mitigationstatistical artefactsbenchmarkszero-noise extrapolationreproducibilityevaluation standardshardware driftNISQ devices
0
0 comments X

The pith

Evaluation practices in quantum error mitigation can make performance gains appear more robust than the data support.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper reviews 81 recent studies on quantum error mitigation and finds that most omit inferential statistical tests, reporting only descriptive uncertainty instead. It then demonstrates the practical impact by sweeping parameters in zero-noise extrapolation across 132 configurations on hardware, where small choices about scale factors or extrapolation methods can reverse a result from statistically significant improvement to significant degradation. A separate 72-hour run on the same device shows that temporal drift alone can more than triple the measured effect size for an otherwise identical mitigation setup. These patterns indicate that current benchmark results may overstate the reliability of mitigation techniques. The authors therefore call for minimum standards that include explicit parameter logging, robustness checks, drift monitoring, and proper statistical testing.

Core claim

In a systematic review using an eight-criterion framework, only 15 of the applicable papers employed inferential statistics while 25 reported uncertainty descriptively without testing support for claimed effects. Targeted experiments on zero-noise extrapolation then isolate two artefact sources: parameter sensitivity, in which implicit choices alter conclusions across a 132-configuration sweep, and drift-induced illusion, in which a 72-hour longitudinal study shows the same configuration producing effect sizes differing by a factor of three solely due to execution timing, also reducing the number of independent observations.

What carries the argument

An eight-criterion framework that scores papers on statistical rigour, reproducibility, and reporting quality, used to surface artefacts in parameter sweeps and longitudinal hardware runs for zero-noise extrapolation.

If this is right

  • Evaluations must document all scale factors, extrapolation methods, and calibration settings explicitly.
  • Robustness checks across parameter variations become necessary before claiming mitigation benefits.
  • Longitudinal runs are required to separate hardware drift from mitigation effects.
  • Inferential statistics with effect-size reporting replace descriptive uncertainty in published results.
  • New minimum reporting standards would allow direct comparison of mitigation techniques across studies.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Similar review frameworks could be applied to other near-term quantum techniques that rely on post-processing.
  • Hardware vendors might incorporate automated drift logging into calibration routines to support cleaner benchmarks.
  • If artefacts are widespread, reported progress toward useful error mitigation on NISQ devices could be slower than current literature suggests.
  • Standardised test suites that include parameter and drift sweeps would make future claims easier to verify.

Load-bearing premise

The 81 papers form a representative sample of quantum error mitigation evaluation practice and the eight criteria capture the main dimensions of statistical validity.

What would settle it

A re-run of the 132-configuration and 72-hour experiments on multiple devices that applies mandatory inferential tests and finds that performance conclusions remain stable once parameter choices are fixed and drift is controlled.

Figures

Figures reproduced from arXiv: 2605.29872 by Dominik K\"oster, Wolfgang Mauerer.

Figure 1
Figure 1. Figure 1: Summary of the systematic review results across the eight criteria. [PITH_FULL_IMAGE:figures/full_fig_p003_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: The full parameter space P of a QEM experiment, with regions of improvement (green), no improvement (grey), and worsening (red). Studies typically test a small subset of parameter configurations (orange). Our analysis shows C3 and C4 have the smallest compliance, yet matter most when parameter choices and timing can influence or even determine the outcome of an experiment. Before we demonstrate this experi… view at source ↗
Figure 3
Figure 3. Figure 3: Parameter space heatmap for 4 backends × 3 Trotter depths (separated by vertical lines), one-at-a-time sweep from defaults. Each cell is split: the left half shows Cohen’s d (colour scale left), the right half shows Cliff’s δ (colour scale right). Green = ZNE significantly reduces error, yellow = ZNE significantly increases error (α=0.05), white = not significant. runs, or a different measure. The number o… view at source ↗
Figure 4
Figure 4. Figure 4: Sensitivity analysis. (a) Cohen’s d vs. shot count; d scales with √nshots for genuine effects. (b) Statistical power vs. repetition count (1,000 bootstrap resamples); 80% power requires nreps ≥ 20 for moderate effects. 0.05 0.06 0.07 0.08 1024 (Default) 3502 (Mid-Low) 5980 (Estimated) 8458 (Mid-High) 10936 (High) Shot count (configuration) Success rate [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Shot estimation results from Desdentado et al. [66] for a five qubit Grover circuit on the IBM Brisbane. The proposed shot estimation yields the best target state probability, while higher shot counts drift to the noise floor. The paper proposes an algorithm to estimate the ideal shot count and tests five configurations for a five-qubit Grover circuit on IBM Brisbane. If finds an improvement of 0.37% for t… view at source ↗
Figure 6
Figure 6. Figure 6: Longitudinal drift study on IQM Euro-Q-Exa (147 time points across 72 hours in three independent sessions). Top: [PITH_FULL_IMAGE:figures/full_fig_p009_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Seven-day longitudinal drift study on IQM Euro-Q-Exa (163 scheduled hours, 2026-04-08 to 2026-04-15). [PITH_FULL_IMAGE:figures/full_fig_p010_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Twelve-hour drift session on IBM ibm_brussels (TC3, 25 time points). E¯(λ1) with 95% CI ribbon and linear trend (dashed); the dotted line marks the ideal value (Eideal = 0.846). Despite a fixed hardware and parameter configuration, E¯(λ1) drifts across the session. 5) Quantify Temporal Stability by distributing HW experi￾ments over in time, or randomise execution order to de￾confound drift from parameter e… view at source ↗
read the original abstract

QEM is widely regarded as a plausible bridge from NISQ devices to FTQC. Yet the empirical studies used to assess the effectiveness of QEM techniques on concrete problems have received comparatively little scrutiny with respect to the validity of their conclusions. We systematically review 81 recent QEM papers using an eight-criterion framework covering statistical rigour, reproducibility, and reporting quality. Among the applicable papers, only 15 (25%) use inferential methods, while 25 (42%) report uncertainty only descriptively, without testing whether the claimed effects are statistically supported. To demonstrate the consequences of these omissions, we use ZNE as a representative and widely used case study and identify two compounding sources of artefacts in current QEM benchmarks. First, we observe parameter sensitivity: in a 132-configuration sweep, implicitly assumed choices such as scale factors, extrapolation method, and hardware calibration are not merely incidental but active, with variations changing conclusions from statistically significant improvement to statistically significant degradation. Second, we identify a drift-induced effectiveness illusion: in a 72-hour longitudinal study on real hardware, temporal drift alone can make the same ZNE configuration exhibit an effect size more than three times as large, depending solely on when it is executed, and also drastically reduces the effective number of independent observations. These findings do not imply that QEM methods are intrinsically unsound; rather, they show that current evaluation practice can make mitigation performance appear more robust than the evidence warrants. We therefore propose minimum reporting standards for QEM evaluations, including explicit parameter documentation, robustness checks, longitudinal drift assessment, and inferential statistical testing with effect-size reporting.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript conducts a systematic review of 81 recent QEM papers via an eight-criterion framework on statistical rigour, reproducibility and reporting quality. It reports that only 15 of 60 applicable papers (25%) employ inferential statistics while 25 (42%) report uncertainty descriptively only. Using ZNE as a case study, a 132-configuration parameter sweep and a 72-hour longitudinal hardware experiment are presented to show that implicit choices (scale factors, extrapolation method, calibration) and temporal drift can reverse statistical conclusions or inflate effect sizes by more than 3 imes. The authors conclude that current evaluation practice can overstate QEM robustness and propose minimum reporting standards including explicit parameter documentation, robustness checks, drift assessment and inferential testing with effect sizes.

Significance. If the central claims hold, the work is significant for the QEM literature. The explicit counts (15/60, 25/60) from the review together with the quantified hardware results (132 configurations, >3 imes effect-size variation, reduced independent observations) supply concrete evidence that statistical artefacts can distort benchmark conclusions. The proposed reporting standards are directly motivated by these findings and could improve reproducibility across the field.

major comments (2)
  1. [Systematic review and framework definition] The eight-criterion framework and the claim that the 81-paper sample is representative of the QEM literature are load-bearing for generalizing the 25% inferential-statistics figure, yet the manuscript provides no external validation of the framework (e.g., against CONSORT or similar guidelines) and does not state the selection protocol; this assumption therefore requires explicit justification or sensitivity analysis.
  2. [ZNE case-study and discussion sections] The link between the artefacts identified in the ZNE sweeps (parameter sensitivity across 132 configurations and drift-induced >3 imes effect-size change) and the practices in the 81 reviewed papers remains inferential; the manuscript does not demonstrate that the specific artefacts are operative in the reviewed corpus, weakening the claim that current evaluation practice systematically inflates robustness.
minor comments (2)
  1. [Abstract] Clarify the exact denominator for the percentages (60 applicable papers) already in the abstract so that the 15/25 counts are unambiguous without cross-reference to the main text.
  2. [Longitudinal drift experiment] The 72-hour drift study reports effect-size variation but does not state the precise statistical test or multiple-comparison correction used; adding this detail would strengthen the inferential claim.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments. We address each major comment point by point below, indicating planned revisions where appropriate.

read point-by-point responses
  1. Referee: [Systematic review and framework definition] The eight-criterion framework and the claim that the 81-paper sample is representative of the QEM literature are load-bearing for generalizing the 25% inferential-statistics figure, yet the manuscript provides no external validation of the framework (e.g., against CONSORT or similar guidelines) and does not state the selection protocol; this assumption therefore requires explicit justification or sensitivity analysis.

    Authors: We agree that the selection protocol and external validation were insufficiently detailed. The framework is derived from core statistical reporting principles, but we will add an explicit subsection in the revised manuscript describing the literature search protocol (databases, keywords, date range, and inclusion/exclusion criteria) along with a sensitivity analysis showing how the 25% figure changes under alternative sampling assumptions. We will also reference alignment with guidelines such as CONSORT to justify the criteria. revision: yes

  2. Referee: [ZNE case-study and discussion sections] The link between the artefacts identified in the ZNE sweeps (parameter sensitivity across 132 configurations and drift-induced >3 times effect-size change) and the practices in the 81 reviewed papers remains inferential; the manuscript does not demonstrate that the specific artefacts are operative in the reviewed corpus, weakening the claim that current evaluation practice systematically inflates robustness.

    Authors: The ZNE case study functions as an existence proof illustrating the consequences of the statistical omissions identified in the review, not as a direct claim that every reviewed paper exhibited these exact artefacts. The review establishes that 75% of papers lack inferential testing and robustness checks—the very practices needed to detect parameter sensitivity and drift. We will revise the discussion to cross-reference specific framework criteria (e.g., absence of parameter documentation and drift assessment) with the demonstrated artefacts, clarifying the illustrative role while preserving the argument that such gaps create risk of overstated robustness. revision: partial

Circularity Check

0 steps flagged

No significant circularity; claims rest on independent review and direct experiments

full rationale

The paper's central claims—that only 25% of applicable QEM papers use inferential statistics and that ZNE benchmarks exhibit parameter sensitivity and drift artefacts—are supported by an explicit eight-criterion literature review of 81 papers plus new hardware measurements across 132 configurations and a 72-hour longitudinal study. These elements are not derived from or equivalent to any self-citation, fitted parameter, or prior ansatz by the same authors; the review criteria are stated outright, and the experimental outcomes (effect sizes, statistical significance) are measured directly on hardware rather than predicted from inputs. No self-definitional, fitted-input, or self-citation-load-bearing reductions appear in the derivation chain.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The central claim depends on the authors' eight-criterion framework being an adequate lens and on the ZNE experiments generalizing to other mitigation methods; no free parameters are fitted to produce the headline percentages.

axioms (2)
  • domain assumption The 81 papers constitute a representative sample of recent QEM evaluation practice.
    The review percentages rest on this selection.
  • ad hoc to paper The eight-criterion framework comprehensively measures statistical rigour, reproducibility, and reporting quality.
    Framework is introduced by the authors for this study.

pith-pipeline@v0.9.1-grok · 5820 in / 1148 out tokens · 33368 ms · 2026-06-29T07:06:59.431186+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

73 extracted references · 41 canonical work pages · 3 internal anchors

  1. [1]

    Quantum Computing in the NISQ era and beyond

    J. Preskill, “Quantum Computing in the NISQ era and beyond,” Quantum, vol. 2, p. 79, Aug. 2018, arXiv:1801.00862 [quant-ph]. [Online]. Available: http://arxiv.org/abs/1801.00862

  2. [2]

    Effects of imperfections on quantum algorithms: A software engineering perspective,

    F. Greiwe, T. Krüger, and W. Mauerer, “Effects of imperfections on quantum algorithms: A software engineering perspective,” inIEEE International Conference on Quantum Software (QSW). IEEE, 2023, pp. 31–42. [Online]. Available: https://doi.org/10.1109/QSW59989.2023. 00014

  3. [3]

    Approximating under the influence of quantum noise and compute power,

    S. Thelen, H. Safi, and W. Mauerer, “Approximating under the influence of quantum noise and compute power,” inIEEE International Conference on Quantum Computing and Engineering (QCE). IEEE, 2024, pp. 274–

  4. [4]

    Available: https://doi.org/10.1109/QCE60285.2024.10291

    [Online]. Available: https://doi.org/10.1109/QCE60285.2024.10291

  5. [5]

    Noisy intermediate-scale quantum algorithms,

    K. Bharti, A. Cervera-Lierta, T. H. Kyaw, T. Haug, S. Alperin- Leaet al., “Noisy intermediate-scale quantum algorithms,”Rev. Mod. Phys., vol. 94, p. 015004, Feb 2022. [Online]. Available: https://link.aps.org/doi/10.1103/RevModPhys.94.015004

  6. [6]

    Fault-tolerant quantum computation,

    J. Preskill, “Fault-tolerant quantum computation,” Dec. 1997, arXiv:quant- ph/9712048. [Online]. Available: http://arxiv.org/abs/quant-ph/9712048

  7. [7]

    Beyond NISQ: The Megaquop Machine,

    ——, “Beyond NISQ: The Megaquop Machine,”ACM Transactions on Quantum Computing, vol. 6, no. 3, pp. 18:1–18:7, Apr. 2025. [Online]. Available: https://dl.acm.org/doi/10.1145/3723153

  8. [8]

    Surface code compilation via edge-disjoint paths,

    M. Beverland, V . Kliuchnikov, and E. Schoute, “Surface code compilation via edge-disjoint paths,”PRX Quantum, vol. 3, no. 2, p. 020342, May 2022, arXiv:2110.11493 [quant-ph]. [Online]. Available: http://arxiv.org/abs/2110.11493

  9. [9]

    Yoked surface codes,

    C. Gidney, M. Newman, P. Brooks, and C. Jones, “Yoked surface codes,” Dec. 2023, arXiv:2312.04522 [quant-ph]. [Online]. Available: http://arxiv.org/abs/2312.04522

  10. [10]

    Make some noise! measuring noise model quality in real-world quantum software,

    L. Schmidbauer and W. Mauerer, “SAT strikes back: Parameter and path relations in quantum toolchains,” inProceedings of the IEEE International Conference on Quantum Software (QSW). IEEE, 2025, pp. 1–12. [Online]. Available: https://doi.org/10.1109/QSW67625.2025.00021

  11. [11]

    Polynomial reduction methods and their impact on QAOA circuits,

    L. Schmidbauer, K. Wintersperger, E. Lobe, and W. Mauerer, “Polynomial reduction methods and their impact on QAOA circuits,” inIEEE International Conference on Quantum Software (QSW), 2024, pp. 35–45. [Online]. Available: https://doi.org/10.1109/QSW62656.2024.00018

  12. [13]

    Predict and conquer: Navigating algorithm trade-offs with quantum design automation,

    S. Thelen and W. Mauerer, “Predict and conquer: Navigating algorithm trade-offs with quantum design automation,” inIEEE International Conference on Quantum Computing and Engineering (QCE). Los Alamitos, CA, USA: IEEE Computer Society, 2025, pp. 591–602. [Online]. Available: https://doi.ieeecomputersociety.org/10. 1109/QCE65121.2025.00071

  13. [14]

    The mqt handbook : A summary of design automation tools and software for quantum computing,

    R. Wille, L. Berent, T. Forster, J. Kunasaikaran, K. Matoet al., “The mqt handbook : A summary of design automation tools and software for quantum computing,” in2024 IEEE International Conference on Quantum Software (QSW), 2024, pp. 1–8

  14. [15]

    Make some noise! measuring noise model quality in real-world quantum software,

    S. R. Maschek, J. Schwittalla, M. Franz, and W. Mauerer, “Make some noise! measuring noise model quality in real-world quantum software,” inProceedings of the IEEE International Conference on Quantum Software (QSW). IEEE, 2025, pp. 1–11. [Online]. Available: https://doi.org/10.1109/QSW67625.2025.00010

  15. [16]

    Error mitigation for short-depth quantum circuits

    K. Temme, S. Bravyi, and J. M. Gambetta, “Error mitigation for short-depth quantum circuits,”Physical Review Letters, vol. 119, no. 18, p. 180509, Nov. 2017, arXiv:1612.02058 [quant-ph]. [Online]. Available: http://arxiv.org/abs/1612.02058

  16. [17]

    Efficient variational quantum simulator incorporating active error minimization,

    Y . Li and S. C. Benjamin, “Efficient variational quantum simulator incorporating active error minimization,”Physical Review X, vol. 7, p. 021050, 2017

  17. [18]

    Practical quantum error mitigation for near-future applications,

    S. Endo, S. C. Benjamin, and Y . Li, “Practical quantum error mitigation for near-future applications,”Physical Review X, vol. 8, p. 031027, 2018

  18. [19]

    Quantum error mitigation,

    Z. Cai, “Quantum error mitigation,”Reviews of Modern Physics, vol. 95, no. 4, 2023

  19. [20]

    Probabilistic error cancellation with sparse Pauli-Lindblad models on noisy quantum processors,

    E. v. d. Berg, Z. K. Minev, A. Kandala, and K. Temme, “Probabilistic error cancellation with sparse Pauli-Lindblad models on noisy quantum processors,”Nature Physics, vol. 19, no. 8, pp. 1116– 1121, Aug. 2023, arXiv:2201.09866 [quant-ph]. [Online]. Available: http://arxiv.org/abs/2201.09866

  20. [21]

    Error mitigation with Clifford quantum-circuit data,

    P. Czarnik, A. Arrasmith, P. J. Coles, and L. Cincio, “Error mitigation with Clifford quantum-circuit data,”Quantum, vol. 5, p. 592, Nov. 2021, arXiv:2005.10189 [quant-ph]. [Online]. Available: http://arxiv.org/abs/2005.10189

  21. [22]

    Error mitigation extends the computational reach of a noisy quantum processor,

    A. Kandala, K. Temme, A. D. Córcoles, A. Mezzacapo, J. M. Chow et al., “Error mitigation extends the computational reach of a noisy quantum processor,”Nature, vol. 567, pp. 491–495, 2019

  22. [23]

    Cloud quantum computing of an atomic nucleus,

    E. F. Dumitrescu, A. J. McCaskey, G. Hagen, G. R. Jansen, T. D. Morris et al., “Cloud quantum computing of an atomic nucleus,”Physical review letters, vol. 120, no. 21, p. 210501, 2018

  23. [24]

    It’s quick to be square: Fast quadratisation for quantum toolchains,

    L. Schmidbauer, E. Lobe, I. Schaefer, and W. Mauerer, “It’s quick to be square: Fast quadratisation for quantum toolchains,”ACM Transactions on Quantum Computing, vol. 7, no. 2, p. 46, 2026. [Online]. Available: https://doi.org/10.1145/3800943

  24. [25]

    Quadratic Gravity ,

    A. Lucas, “Ising formulations of many np problems,”Frontiers in Physics, vol. 2, 2014. [Online]. Available: http://dx.doi.org/10.3389/fphy. 2014.00005

  25. [26]

    Out of the Loop: Structural Approximation of Optimisation Landscapes and non-Iterative Quantum Optimisation,

    T. Krüger and W. Mauerer, “Out of the Loop: Structural Approximation of Optimisation Landscapes and non-Iterative Quantum Optimisation,” Quantum, vol. 9, p. 1903, Nov. 2025. [Online]. Available: https: //doi.org/10.22331/q-2025-11-06-1903

  26. [27]

    Predict and conquer: Navigating algorithm trade-offs with quantum design automation,

    L. Schmidbauer, C. A. Riofrío, F. Heinrich, V . Junk, U. Schwenk et al., “Path matters: Industrial data meet quantum optimization,” inIEEE International Conference on Quantum Computing and Engineering (QCE). IEEE, 2025, pp. 2101–2111. [Online]. Available: https://doi.org/10.1109/QCE65121.2025.00230

  27. [28]

    Quantum-inspired digital annealing for join ordering,

    M. Schönberger, I. Trummer, and W. Mauerer, “Quantum-inspired digital annealing for join ordering,”Proc. VLDB Endow., vol. 17, no. 3, p. 511–524, Nov. 2023. [Online]. Available: https://doi.org/10.14778/ 3632093.3632112

  28. [29]

    An introduction to quantum machine learning,

    M. Schuld, I. Sinayskiy, and F. Petruccione, “An introduction to quantum machine learning,”Contemporary Physics, vol. 56, no. 2, pp. 172–185, 2015

  29. [30]

    Hype or heuristic? quantum reinforcement learning for join order optimisation,

    M. Franz, T. Winker, S. Groppe, and W. Mauerer, “Hype or heuristic? quantum reinforcement learning for join order optimisation,” inIEEE International Conference on Quantum Computing and Engineering (QCE), vol. 01, 2024, pp. 409–420

  30. [31]

    Schuld and F

    M. Schuld and F. Petruccione,Machine Learning with Quantum Computers, ser. Quantum Science and Technology. Springer Cham, 2021

  31. [32]

    Wittek,Quantum Machine Learning: What Quantum Computing Means to Data Mining

    P. Wittek,Quantum Machine Learning: What Quantum Computing Means to Data Mining. Boston: Academic Press, 2014

  32. [33]

    Error- mitigated simulation of quantum many-body scars on quantum computers with pulse-level control,

    I.-C. Chen, B. Burdick, Y . Yao, P. P. Orth, and T. Iadecola, “Error- mitigated simulation of quantum many-body scars on quantum computers with pulse-level control,”Physical Review Research, vol. 4, no. 4, p. 043027, 2022

  33. [34]

    Evidence for the utility of quantum computing before fault tolerance,

    Y . Kim, A. Eddins, S. Anand, K. X. Wei, E. Van Den Berget al., “Evidence for the utility of quantum computing before fault tolerance,” Nature, vol. 618, no. 7965, pp. 500–505, Jun. 2023. [Online]. Available: https://www.nature.com/articles/s41586-023-06096-3

  34. [35]

    Hypothesis Testing for Error Mitigation: How to Evaluate Error Mitigation,

    A. A. Saki, A. Katabarwa, S. Resch, and G. Umbrarescu, “Hypothesis Testing for Error Mitigation: How to Evaluate Error Mitigation,” Jan. 2023, arXiv:2301.02690 [quant-ph]. [Online]. Available: http://arxiv.org/abs/2301.02690

  35. [36]

    Testing Platform-Independent Quantum Error Mitigation on Noisy Quantum Computers,

    V . Russo, A. Mari, N. Shammah, R. LaRose, and W. J. Zeng, “Testing Platform-Independent Quantum Error Mitigation on Noisy Quantum Computers,”IEEE Transactions on Quantum Engineering, vol. 4, pp. 1–18, 2023. [Online]. Available: https://ieeexplore.ieee.org/document/ 10219054/

  36. [37]

    Error mitigation, optimization, and extrapolation on a trapped ion testbed,

    O. G. Maupin, A. D. Burch, B. Ruzic, C. G. Yale, A. Russo et al., “Error mitigation, optimization, and extrapolation on a trapped ion testbed,”Physical Review A, vol. 110, no. 3, p. 032416, Sep. 2024, arXiv:2307.07027 [quant-ph]. [Online]. Available: http://arxiv.org/abs/2307.07027

  37. [38]

    Fundamental limits of quantum error mitigation,

    R. Takagi, S. Endo, S. Minagawa, and M. Gu, “Fundamental limits of quantum error mitigation,”npj Quantum Information, vol. 8, no. 1, p. 114, Sep. 2022. [Online]. Available: https: //www.nature.com/articles/s41534-022-00618-z

  38. [39]

    Exponentially tighter bounds on limitations of quantum error mitigation,

    Y . Quek, D. Stilck França, S. Khatri, J. J. Meyer, and J. Eisert, “Exponentially tighter bounds on limitations of quantum error mitigation,” Nature Physics, vol. 20, no. 10, pp. 1648–1658, Oct. 2024. [Online]. Available: https://www.nature.com/articles/s41567-024-02536-7

  39. [40]

    Optimization of Richardson extrapolation for quantum error mitigation,

    M. Krebsbach, B. Trauzettel, and A. Calzona, “Optimization of Richardson extrapolation for quantum error mitigation,”Physical Review A, vol. 106, no. 6, p. 062436, Dec. 2022. [Online]. Available: https://link.aps.org/doi/10.1103/PhysRevA.106.062436

  40. [41]

    A Methodological Analysis of Empirical Studies in Quantum Software Testing

    Y . Li, M. Shao, J. Zhao, and Q. Wang, “A methodological analysis of empirical studies in quantum software testing,” 2026, arXiv:2601.08367 [quant-ph]

  41. [42]

    Quantum software experiments: A reporting and laboratory package structure guidelines,

    E. Moguel, J. A. Parejo, A. Ruiz-Cortés, J. Garcia-Alonso, and J. M. Murillo, “Quantum software experiments: A reporting and laboratory package structure guidelines,” May 2024, arXiv:2405.04192 [cs]. [Online]. Available: http://arxiv.org/abs/2405.04192

  42. [43]

    Towards Redefining the Reproducibility in Quantum Computing: A Data Analysis Approach on NISQ Devices,

    P. Senapati, Z. Wang, W. Jiang, T. S. Humble, B. Fanget al., “Towards Redefining the Reproducibility in Quantum Computing: A Data Analysis Approach on NISQ Devices,” in2023 IEEE International Conference on Quantum Computing and Engineering (QCE), vol. 01, Sep. 2023, pp. 468–474. [Online]. Available: https://ieeexplore.ieee.org/document/10313593/

  43. [44]

    PQML: Enabling the Predictive Reproducibility on NISQ Machines for Quantum ML Applications,

    P. Senapati, S. Y .-C. Chen, B. Fang, T. M. Athawale, A. Liet al., “PQML: Enabling the Predictive Reproducibility on NISQ Machines for Quantum ML Applications,” in2024 IEEE International Conference on Quantum Computing and Engineering (QCE), vol. 01, Sep. 2024, pp. 1413–1424. [Online]. Available: https://ieeexplore.ieee.org/document/10821454/

  44. [45]

    Detection of temporal fluctuation in superconducting qubits for quantum error mitigation,

    Y . Hirasaki, S. Daimon, T. Itoko, N. Kanazawa, and E. Saitoh, “Detection of temporal fluctuation in superconducting qubits for quantum error mitigation,”Applied Physics Letters, vol. 123, no. 18, p. 184002, Nov. 2023. [Online]. Available: https://doi.org/10.1063/5.0166739

  45. [46]

    Best practices for quantum error mitigation with digital zero-noise extrapolation,

    R. Majumdar, P. Rivero, F. Metz, A. Hasan, and D. S. Wang, “Best practices for quantum error mitigation with digital zero-noise extrapolation,” Jul. 2023, arXiv:2307.05203 [quant-ph]. [Online]. Available: http://arxiv.org/abs/2307.05203

  46. [47]

    Digital zero noise extrapolation for quantum error mitigation,

    T. Giurgica-Tiron, Y . Hindy, R. LaRose, A. Mari, and W. J. Zeng, “Digital zero noise extrapolation for quantum error mitigation,”2020 IEEE International Conference on Quantum Computing and Engineering (QCE), pp. 306–316, 2020

  47. [48]

    Improving Zero-noise Extrapolation for Quantum-gate Error Mitigation using a Noise-aware Folding Method,

    L. Hour, M. Go, and Y . Han, “Improving Zero-noise Extrapolation for Quantum-gate Error Mitigation using a Noise-aware Folding Method,” May 2024, arXiv:2401.12495 [quant-ph]. [Online]. Available: http://arxiv.org/abs/2401.12495

  48. [49]

    1-2-3 reproducibility for quantum software experiments,

    W. Mauerer and S. Scherzinger, “1-2-3 reproducibility for quantum software experiments,” inIEEE International Conference on Software Analysis, Evolution and Reengineering, 2022, pp. 1247–1248

  49. [50]

    Cohen,Statistical Power Analysis for the Behavioral Sciences, 2nd ed

    J. Cohen,Statistical Power Analysis for the Behavioral Sciences, 2nd ed. Hillsdale, NJ: Lawrence Erlbaum Associates, 1988

  50. [51]

    Fahrmeir, C

    L. Fahrmeir, C. Heumann, R. Künstler, I. Pigeot, and G. Tutz,Statistik: Der Weg zur Datenanalyse. Berlin, Heidelberg: Springer, 2023. [Online]. Available: https://link.springer.com/10.1007/978-3-662-67526-7

  51. [52]

    Probability and statistics for engineering and the sciences,

    J. L. Devore, “Probability and statistics for engineering and the sciences,” 2008

  52. [53]

    The asa statement on p-values: context, process, and purpose,

    R. L. Wasserstein and N. A. Lazar, “The asa statement on p-values: context, process, and purpose,” pp. 129–133, 2016

  53. [54]

    New effect size rules of thumb,

    S. S. Sawilowsky, “New effect size rules of thumb,”Journal of Modern Applied Statistical Methods, vol. 8, no. 2, pp. 597–599, 2009

  54. [55]

    Quantum software engineering: Roadmap and challenges ahead,

    J. M. Murillo, J. Garcia-Alonso, E. Moguel, J. Barzen, F. Leymann et al., “Quantum software engineering: Roadmap and challenges ahead,” ACM Trans. Softw. Eng. Methodol., vol. 34, no. 5, May 2025. [Online]. Available: https://doi.org/10.1145/3712002

  55. [56]

    Carbonelli, M

    C. Carbonelli, M. Felderer, M. Jung, E. Lobe, M. Lochauet al., Challenges for Quantum Software Engineering: An Industrial Application Scenario Perspective. Springer Nature Switzerland, 2024, p. 311–335. [Online]. Available: http://dx.doi.org/10.1007/978-3-031-64136-7_12

  56. [57]

    T. Yue, W. Mauerer, S. Ali, and D. Taibi,Challenges and Opportunities in Quantum Software Architecture. Springer Nature Switzerland, 2023, p. 1–

  57. [58]

    Available: http://dx.doi.org/10.1007/978-3-031-36847-9_1

    [Online]. Available: http://dx.doi.org/10.1007/978-3-031-36847-9_1

  58. [59]

    Reproducible builds for quantum computing,

    I. M. Veiga and E. Hänggi, “Reproducible builds for quantum computing,” 2025. [Online]. Available: https://arxiv.org/abs/2510.02251

  59. [60]

    Qef: Reproducible and exploratory quantum software experiments,

    V . Gierisch and W. Mauerer, “Qef: Reproducible and exploratory quantum software experiments,” 1 2026. [Online]. Available: https: //arxiv.org/pdf/2511.04563

  60. [61]

    Guidelines for performing systematic literature reviews in software engineering,

    B. Kitchenham, S. Charterset al., “Guidelines for performing systematic literature reviews in software engineering,” 2007

  61. [62]

    Quantum computing with Qiskit,

    A. Javadi-Abhari, M. Treinish, K. Krsulich, C. J. Wood, J. Lishman et al., “Quantum computing with Qiskit,” 2024

  62. [63]

    Error Mitigation in the NISQ Era: Applying Measurement Error Mitigation Techniques to Enhance Quantum Circuit Performance,

    M. U. Khan, M. A. Kamran, W. R. Khan, M. M. Ibrahim, M. U. Aliet al., “Error Mitigation in the NISQ Era: Applying Measurement Error Mitigation Techniques to Enhance Quantum Circuit Performance,” Mathematics, vol. 12, no. 14, p. 2235, Jan. 2024. [Online]. Available: https://www.mdpi.com/2227-7390/12/14/2235

  63. [64]

    Retired QPUs

    IBM Quantum, “Retired QPUs.” [Online]. Avail- able: https://quantum.cloud.ibm.com/docs/en/guides/quantum.cloud.ibm. com/docs/en/guides/processor-types

  64. [65]

    OpenQASM 3: A broader and deeper quantum assembly language,

    A. W. Cross, A. Javadi-Abhari, T. Alexander, N. d. Beaudrap, L. S. Bishopet al., “OpenQASM 3: A broader and deeper quantum assembly language,”ACM Transactions on Quantum Computing, vol. 3, no. 3, pp. 1–50, Sep. 2022, arXiv:2104.14722 [quant-ph]. [Online]. Available: http://arxiv.org/abs/2104.14722

  65. [66]

    A Bayesian Approach for Characterizing and Mitigating Gate and Measurement Errors,

    M. Zheng, A. Li, T. Terlaky, and X. Yang, “A Bayesian Approach for Characterizing and Mitigating Gate and Measurement Errors,”ACM Transactions on Quantum Computing, vol. 4, no. 2, pp. 11:1–11:21, Feb

  66. [67]

    Available: https://dl.acm.org/doi/10.1145/3563397

    [Online]. Available: https://dl.acm.org/doi/10.1145/3563397

  67. [68]

    Processor types

    IBM Quantum, “Processor types.” [Online]. Avail- able: https://eu-de.quantum.cloud.ibm.com/docs/en/guides/eu-de. quantum.cloud.ibm.com/docs/en/guides/processor-types

  68. [69]

    Estimating the number of shots to improve results accuracy,

    E. Desdentado, M. Polo, and C. Calero, “Estimating the number of shots to improve results accuracy,” 2025, preprint. [Online]. Available: https://github.com/GreenTeamAlarcos/ Estimating-The-Number-Of-Shots-To-Improve-Results-Accuracy

  69. [70]

    Simon’s algorithm in the NISQ cloud,

    R. Robertson, E. Doucet, E. Spicer, and S. Deffner, “Simon’s algorithm in the NISQ cloud,”Entropy, vol. 27, no. 7, p. 658, Jun. 2025, arXiv:2406.11771 [quant-ph]. [Online]. Available: http://arxiv.org/abs/2406.11771

  70. [71]

    Comparative study of quantum error correction strategies for the heavy-hexagonal lattice,

    C. Benito, E. López, B. Peropadre, and A. Bermudez, “Comparative study of quantum error correction strategies for the heavy-hexagonal lattice,”Quantum, vol. 9, p. 1623, Feb. 2025, arXiv:2402.02185 [quant-ph]. [Online]. Available: http://arxiv.org/abs/2402.02185

  71. [72]

    Practical Fidelity Limits of Toffoli Gates in Superconducting Quantum Processors,

    M. AbuGhanem, “Practical Fidelity Limits of Toffoli Gates in Superconducting Quantum Processors,” Sep. 2025, arXiv:2509.05395 [quant-ph] version: 1. [Online]. Available: http://arxiv.org/abs/2509.05395

  72. [73]

    First European quantum computer for Germany: Euro-Q-Exa starts operation at LRZ - Leibniz- Rechenzentrum

    Leibniz Supercomputing Centre, “First European quantum computer for Germany: Euro-Q-Exa starts operation at LRZ - Leibniz- Rechenzentrum.” [Online]. Available: https://www.lrz.de/en/news/detail/ first-european-quantum-computer-for-germany-euro-q-exa-starts-operation-at-lrz

  73. [74]

    Kish,Survey Sampling

    L. Kish,Survey Sampling. New York: John Wiley & Sons, 1965