arxiv: 2604.15870 · v1 · submitted 2026-04-17 · 💻 cs.SE · cs.DB

Recognition: unknown

QMutBench: A Dataset of Quantum Circuit Mutants

E\~naut Mendiluze Usandizaga , Thomas Laurent , Paolo Arcaini , Shaukat Ali

Authors on Pith no claims yet

Pith reviewed 2026-05-10 08:46 UTC · model grok-4.3

classification 💻 cs.SE cs.DB

keywords quantum software testingmutation testingquantum circuitsbenchmarksdatasetsfault injection

0 comments

The pith

QMutBench supplies over 700,000 quantum circuit mutants as standardized benchmarks for testing techniques.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents QMutBench as an online dataset of quantum circuit mutants to let developers evaluate how effectively their test cases detect faults in quantum programs. Existing testing techniques lack common faulty-program benchmarks, so it has been hard to measure or compare their quality. The resource supplies selection filters for original circuits, mutant survival rates, and mutation types so users can pull tailored sets for assessment or for building new mutation-based methods.

Core claim

QMutBench is a dataset containing over 700,000 quantum circuit mutants that represent different faults; it is accessible through an online interface that supports selection by original circuit, desired survival rate, and mutation characteristics such as faulty gate type.

What carries the argument

The online interface and filtering criteria that let users retrieve subsets of mutants to serve as fault benchmarks.

If this is right

Developers can now measure test-suite quality by counting how many mutants each suite detects.
Different testing techniques become directly comparable on identical mutant collections.
Researchers can create new testing methods guided by the mutation operators already present in the dataset.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Widespread adoption could create a de-facto standard for reporting test effectiveness in quantum software papers.
If the mutants prove unrepresentative of hardware noise, the dataset may need later calibration against real device error models.
The same generation and hosting approach could be reused for other quantum programming languages or circuit representations.

Load-bearing premise

The generated mutants represent faults that are both representative of real quantum hardware errors and useful for distinguishing effective test suites from ineffective ones.

What would settle it

Apply several published quantum testing techniques to the same mutant subsets and measure whether the fraction of mutants killed consistently ranks the techniques in the same order as independent real-hardware fault-injection experiments.

Figures

Figures reproduced from arXiv: 2604.15870 by E\~naut Mendiluze Usandizaga, Paolo Arcaini, Shaukat Ali, Thomas Laurent.

**Figure 2.** Figure 2: Distribution of mutants across survival rate ranges all mutant combinations by applying all three operators in all supported quantum gates, and across all positions in the circuit. In a previous study [9], we explored how the characteristics of quantum circuits and algorithms affect mutant detection and defined the survival rate as a metric for assessing the likelihood of a mutant’s survival. This rate is … view at source ↗

**Figure 4.** Figure 4: QMutBench online interface 1) Panel “Generic Selection”: The first panel of QMutBench shown in [PITH_FULL_IMAGE:figures/full_fig_p004_4.png] view at source ↗

**Figure 5.** Figure 5: Structure of the downloaded folder circuits and another for the mutated circuits. The “mutants” folder is further divided into subfolders organised by the selected algorithm and qubit settings. Each file is named to indicate the operator used, the modified gate, and the position of the change in the filename, as done by some classic mutation analysis tool [14]. The original quantum circuits and folders are… view at source ↗

read the original abstract

Quantum software testing has attracted interest in recent years, prompting the development of various techniques to automate the testing of quantum software. These techniques generate test cases that must be assessed for their effectiveness in detecting faults. Such an assessment requires benchmarks of faulty programs. However, there is a lack of benchmarks containing faults. In this data showcase, we propose QMutBench, a dataset that contains over 700,000 quantum circuit mutants representing different faults. The dataset is accessible via an online interface with selection criteria, such as the original quantum circuit(s) from which mutants are generated, the desired survival rate of the selected mutants, and other mutation characteristics (e.g., the type of faulty quantum gate). QMutBench provides quantum software developers and testers with an accessible online dataset to obtain benchmarks of mutants necessary to assess either the quality of the test cases generated by their testing technique or to compare different testing techniques. It also enables the development of new mutation-guided quantum software testing techniques.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

QMutBench ships a large, queryable set of quantum circuit mutants via an online interface, but supplies no experiments showing the mutants are realistic or useful for distinguishing test suites.

read the letter

The main thing to know is that the authors have generated and released over 700,000 mutants from quantum circuits, packaged with a web tool that lets users filter by original circuit, survival rate, gate type, and similar attributes. This is a concrete artifact that fills a stated gap in benchmarks for quantum software testing work. They have also made the selection process practical rather than leaving users to download everything at once. That accessibility is the clearest positive here. The scale alone could save time for anyone who needs faulty circuits to run their own mutation analyses or technique comparisons. The paper stays focused on the release and the interface rather than overclaiming broader impact. The soft spot is the missing validation. There are no reported experiments that measure mutation scores, check whether different test suites kill different fractions of these mutants, or compare the injected faults against actual hardware noise models such as depolarizing channels or T1/T2 decay. Without those checks, it remains an assumption that the mutants will serve as effective benchmarks rather than a demonstrated result. The generation steps themselves receive limited description in the sections available. This work is aimed at researchers already active in quantum software engineering and mutation testing for quantum circuits. Someone building or evaluating a new testing tool could pull subsets from the interface and run their own assessments, provided they are willing to do the validation themselves. It deserves peer review because dataset papers need external scrutiny on construction choices and fitness for the claimed use cases, even when the core deliverable is an artifact rather than a new algorithm.

Referee Report

3 major / 2 minor

Summary. The manuscript presents QMutBench, a dataset of over 700,000 quantum circuit mutants generated from original circuits to represent faults in quantum software. It describes an online interface allowing selection of mutants by criteria including the source circuit, survival rate, and mutation characteristics such as faulty gate type. The authors position the resource as a benchmark to evaluate the fault-detection effectiveness of test suites produced by quantum testing techniques, to compare different techniques, and to support development of mutation-guided testing methods.

Significance. If the mutants are shown to be representative of realistic faults and capable of distinguishing effective from ineffective test suites, the dataset would address a clear gap in quantum software testing benchmarks and enable reproducible empirical evaluations. The provision of an online selection interface is a practical strength that supports usability for the community.

major comments (3)

[Abstract and §3 (dataset generation)] Abstract and dataset construction section: the central utility claim—that the mutants serve as benchmarks to assess or compare test-suite quality—requires evidence that some mutants are killed by certain test suites but not others. No mutation-score experiments, survival-rate analysis, or comparison of detection rates across techniques are reported, leaving the discriminative power unverified.
[Abstract and §4] Abstract and §4 (validation or realism): the mutants are asserted to represent 'different faults,' yet no comparison is provided against real quantum hardware error models (e.g., depolarizing noise, T1/T2 relaxation, or gate-error distributions from IBM or Rigetti devices). Without such grounding, it is unclear whether the >700k mutants correspond to faults that occur in practice.
[§2] §2 (mutation operators): the specific operators used to generate mutants from the original circuits are not enumerated or formally defined. This omission prevents assessment of whether the mutation set is comprehensive, non-redundant, or aligned with known quantum fault models.

minor comments (2)

[§5] The online interface description would benefit from a screenshot or explicit list of all selectable fields to improve reproducibility for readers who cannot access the site immediately.
[§3] Clarify the exact number of original circuits used as seeds and the distribution of mutant counts per seed circuit.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the positive assessment of QMutBench's potential utility and for the constructive major comments. We address each point below, indicating revisions where appropriate. As this is a data showcase paper, our focus is on releasing the dataset and interface rather than conducting full-scale empirical evaluations of testing techniques.

read point-by-point responses

Referee: Abstract and §3 (dataset generation): the central utility claim—that the mutants serve as benchmarks to assess or compare test-suite quality—requires evidence that some mutants are killed by certain test suites but not others. No mutation-score experiments, survival-rate analysis, or comparison of detection rates across techniques are reported, leaving the discriminative power unverified.

Authors: We agree that the manuscript does not report mutation-score experiments or direct comparisons of test-suite detection rates across techniques. As a data showcase, the paper's contribution is the release of the >700k mutants and the online interface that already supports selection by precomputed survival rate (among other criteria). This allows users to obtain mutant sets with desired killability for their own evaluations. To address the concern, we will add a short subsection in §3 with aggregate statistics on survival-rate distributions across the source circuits and an example of how the interface can be used to select benchmark sets for technique comparison. Full cross-technique experiments remain outside the scope of this data paper. revision: partial
Referee: Abstract and §4 (validation or realism): the mutants are asserted to represent 'different faults,' yet no comparison is provided against real quantum hardware error models (e.g., depolarizing noise, T1/T2 relaxation, or gate-error distributions from IBM or Rigetti devices). Without such grounding, it is unclear whether the >700k mutants correspond to faults that occur in practice.

Authors: The mutants are produced by applying syntactic mutation operators to quantum circuits drawn from established benchmarks; they are intended to represent programming-level faults rather than physical noise processes on specific hardware. We will revise the abstract and §4 to clarify this distinction and to note that the dataset does not claim to replicate hardware error distributions. A brief discussion of possible future extensions (e.g., weighting mutants by hardware error rates) will be added. No hardware-specific comparison data was collected for the current release. revision: yes
Referee: §2 (mutation operators): the specific operators used to generate mutants from the original circuits are not enumerated or formally defined. This omission prevents assessment of whether the mutation set is comprehensive, non-redundant, or aligned with known quantum fault models.

Authors: We thank the referee for pointing out this omission. Section 2 will be expanded to list and formally define every mutation operator (gate replacement, insertion, deletion, parameter perturbation, etc.), including the precise transformation rules and the source circuits to which they were applied. This addition will enable readers to evaluate coverage and alignment with quantum fault models. revision: yes

Circularity Check

0 steps flagged

No circularity: dataset release paper with no derivation or fitted results

full rationale

The paper is a data showcase describing the construction and online release of QMutBench, a collection of >700k mutants generated from quantum circuits via mutation operators. No equations, predictions, first-principles derivations, or parameter-fitting steps are present, so none of the enumerated circularity patterns (self-definitional, fitted-input-as-prediction, self-citation load-bearing, etc.) can apply. The central claim that the dataset enables assessment of test suites rests on an untested assumption about mutant realism, but this is an external-validity issue rather than a logical loop in which any result reduces to its own inputs by construction. The work is therefore self-contained as an artifact contribution.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The dataset construction implicitly assumes that gate-level mutations produce representative faults for quantum programs and that survival-rate filtering yields useful benchmarks; these assumptions are not evidenced in the abstract.

pith-pipeline@v0.9.0 · 5473 in / 1041 out tokens · 25018 ms · 2026-05-10T08:46:01.693031+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

14 extracted references · 1 canonical work pages

[1]

Quantum software engineering: Roadmap and chal- lenges ahead,

J. M. Murillo, J. Garcia-Alonso, E. Moguel, J. Barzen, F. Leymann, S. Ali, T. Yue, P. Arcaini, R. P ´erez-Castillo, I. Garc ´ıa-Rodr´ıguez de Guzm´an, M. Piattini, A. Ruiz-Cort ´es, A. Brogi, J. Zhao, A. Miranskyy, and M. Wimmer, “Quantum software engineering: Roadmap and chal- lenges ahead,”ACM Trans. Softw. Eng. Methodol., vol. 34, no. 5, May 2025

2025
[2]

Testing and debugging quantum programs: The road to 2030,

N. C. Leite Ramalho, H. Amario de Souza, and M. Lordello Chaim, “Testing and debugging quantum programs: The road to 2030,”ACM Trans. Softw. Eng. Methodol., vol. 34, no. 5, May 2025

2030
[3]

Quantum program testing through commuting pauli strings on IBM’s quantum computers,

A. Muqeet, S. Ali, and P. Arcaini, “Quantum program testing through commuting pauli strings on IBM’s quantum computers,” inProceedings of the 39th IEEE/ACM International Conference on Automated Software Engineering, ser. ASE ’24. New York, NY , USA: Association for Computing Machinery, 2024, pp. 2130–2141

2024
[4]

Assessing the effectiveness of input and output coverage criteria for testing quantum programs,

S. Ali, P. Arcaini, X. Wang, and T. Yue, “Assessing the effectiveness of input and output coverage criteria for testing quantum programs,” in2021 IEEE 14th International Conference on Software Testing, Validation and Verification (ICST), 2021, pp. 13–23

2021
[5]

Bugs4Q: A benchmark of real bugs for quantum programs,

P. Zhao, J. Zhao, Z. Miao, and S. Lan, “Bugs4Q: A benchmark of real bugs for quantum programs,” in2021 36th IEEE/ACM International Conference on Automated Software Engineering (ASE), 2021, pp. 1373– 1376

2021
[6]

QBugs: A collection of reproducible bugs in quantum algorithms and a supporting infrastructure to enable con- trolled quantum software testing and debugging experiments,

J. Campos and A. Souto, “QBugs: A collection of reproducible bugs in quantum algorithms and a supporting infrastructure to enable con- trolled quantum software testing and debugging experiments,” in2021 IEEE/ACM 2nd International Workshop on Quantum Software Engineer- ing (Q-SE). Los Alamitos, CA, USA: IEEE Computer Society, 6 2021, pp. 28–32

2021
[7]

Muskit: A mutation analysis tool for quantum software testing,

E. Mendiluze Usandizaga, S. Ali, P. Arcaini, and T. Yue, “Muskit: A mutation analysis tool for quantum software testing,” inProceedings of the 36th IEEE/ACM International Conference on Automated Software Engineering, ser. ASE ’21. IEEE Press, 2022, pp. 1266–1270

2022
[8]

QMutPy: A mutation testing tool for quantum algorithms and applications in Qiskit,

D. Fortunato, J. Campos, and R. Abreu, “QMutPy: A mutation testing tool for quantum algorithms and applications in Qiskit,” inProceedings of the 31st ACM SIGSOFT International Symposium on Software Testing and Analysis, ser. ISSTA 2022. New York, NY , USA: Association for Computing Machinery, 2022, pp. 797–800

2022
[9]

Quantum circuit mutants: Empirical analysis and recommendations,

E. Mendiluze Usandizaga, S. Ali, T. Yue, and P. Arcaini, “Quantum circuit mutants: Empirical analysis and recommendations,”Empirical Software Engineering, vol. 30, no. 4, p. 100, Apr 2025

2025
[10]

Open Quantum Assembly Language

A. W. Cross, L. S. Bishop, J. A. Smolin, and J. M. Gambetta, “Open quantum assembly language,”arXiv preprint arXiv:1707.03429, 2017

work page Pith review arXiv 2017
[11]

N. S. Yanofsky and M. A. Mannucci,Quantum computing for computer scientists. Cambridge University Press, 2008

2008
[12]

IBM quantum composer,

IBM, “IBM quantum composer,” 2025

2025
[13]

Chapter six - mutation testing advances: An analysis and survey,

M. Papadakis, M. Kintis, J. Zhang, Y . Jia, Y . Le Traon, and M. Harman, “Chapter six - mutation testing advances: An analysis and survey,” ser. Advances in Computers, A. M. Memon, Ed. Elsevier, 2019, vol. 112, pp. 275–378

2019
[14]

Mujava: an automated class mutation system,

Y .-S. Ma, J. Offutt, and Y . R. Kwon, “Mujava: an automated class mutation system,”Software Testing, Verification and Reliability, vol. 15, no. 2, pp. 97–133, 2005

2005