pith. sign in

arxiv: 2605.27210 · v1 · pith:BYLK2BUZnew · submitted 2026-05-26 · 🪐 quant-ph · cs.AI

Qiskit QuantumKatas: Adapting Microsoft's Quantum Computing exercises for LLM evaluation

Pith reviewed 2026-06-29 16:33 UTC · model grok-4.3

classification 🪐 quant-ph cs.AI
keywords quantum computinglarge language modelsbenchmark evaluationQiskitQuantumKatasprompt engineeringalgorithm implementationGrover algorithm
0
0 comments X

The pith

Adapting QuantumKatas to Qiskit produces a benchmark that distinguishes LLM quantum capabilities with pass rates from 32.3% to 83.1%.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The authors convert Microsoft's QuantumKatas exercises into Qiskit format to form a new evaluation set for testing LLMs on quantum computing tasks. This includes 350 problems across 26 categories with prompts, solutions, and automatic verification through simulation. Evaluation of 16 models under different prompting methods reveals that the benchmark can separate strong from weak performers and identifies where models succeed or fail, such as in algorithm implementation versus problem formulation. The work provides the adapted tasks and results as a resource for further study of LLM quantum skills.

Core claim

By adapting the QuantumKatas curriculum to Qiskit, the paper establishes a benchmark consisting of 350 tasks that can be used to assess LLMs' ability to generate correct quantum code, demonstrating differentiation in capabilities through empirical runs on multiple models and prompting strategies.

What carries the argument

The Qiskit-adapted QuantumKatas benchmark comprising 350 tasks with natural language prompts, canonical solutions, and deterministic test verification via classical circuit simulation.

If this is right

  • Models achieve higher success rates on implementing known algorithms like SimonsAlgorithm (82.1%) and BasicGates (81.6%) compared to problem encoding tasks like SolveSATWithGrover (34.4%).
  • Frontier models show an average 26.1 percentage point advantage over open-source models in best-configuration pass rates.
  • Chain-of-thought prompting improves results for some models but reduces them for others, ranking mid-pack overall at 56.3% mean pass rate.
  • Best configurations yield pass rates between 32.3% and 83.1% across the tested LLMs.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Researchers could use this benchmark to identify training data gaps in LLMs regarding quantum problem-solving approaches.
  • Extending the benchmark with noisy simulations might reveal how well models handle realistic quantum hardware constraints.
  • High-performing models on this test may be better suited for assisting in the development of new quantum algorithms beyond the curriculum.

Load-bearing premise

That the 350 tasks adapted from the original QuantumKatas curriculum, together with deterministic classical simulation checks, provide a faithful and unbiased measure of an LLM's actual quantum-computing capability rather than merely its ability to produce syntactically plausible code.

What would settle it

Observation that LLMs scoring high on the benchmark produce incorrect outputs when asked to solve quantum problems outside the 26 categories or when verification includes hardware noise models.

Figures

Figures reproduced from arXiv: 2605.27210 by Ismael Faro, Juan Cruz-Benito.

Figure 1
Figure 1. Figure 1: Model performance on the Qiskit QuantumKatas benchmark. Bars show pass rates for each model’s best [PITH_FULL_IMAGE:figures/full_fig_p009_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Effect of prompting strategies on model performance. Box plots show the distribution of pass rates across all [PITH_FULL_IMAGE:figures/full_fig_p011_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Pass rate heatmap showing performance of the top 15 models across prompting strategies (zero-shot to 5-shot [PITH_FULL_IMAGE:figures/full_fig_p012_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Task category difficulty analysis. Aggregate pass rates across all models’ best configurations. Categories [PITH_FULL_IMAGE:figures/full_fig_p014_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Error type distribution across all models and configurations. Bars are colored by error category: red indicates [PITH_FULL_IMAGE:figures/full_fig_p016_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Model performance by pedagogical difficulty tier. Box plots show the distribution of pass rates across all [PITH_FULL_IMAGE:figures/full_fig_p017_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Frontier vs. open-source model performance. Box plots show the distribution of pass rates within each [PITH_FULL_IMAGE:figures/full_fig_p018_7.png] view at source ↗
read the original abstract

We adapt Microsoft's QuantumKatas -- a well-established quantum computing curriculum -- from Q# to Qiskit, the most widely-adopted quantum computing framework, and package it with an evaluation framework for systematic LLM assessment. The resulting benchmark comprises 350 tasks across 26 categories, spanning fundamental gates through advanced algorithms (Grover's, Simon's, Deutsch-Jozsa), error correction, key distribution, and quantum games. Each task includes a natural language prompt, canonical solution, and deterministic test verification via classical circuit simulation. By building on the QuantumKatas' proven pedagogical design rather than creating tasks from scratch, we inherit a principled difficulty progression and comprehensive concept coverage while contributing the framework adaptation, evaluation infrastructure, and empirical analysis. We evaluate 16 LLMs across 7 prompting configurations -- a total of 39,200 model runs -- to demonstrate the benchmark's utility. Three key findings emerge: (1) the benchmark effectively differentiates model capabilities, with best-configuration pass rates ranging from 32.3% to 83.1% and a 26.1 pp average gap between frontier and open-source models; (2) models perform well at implementing known algorithms (SimonsAlgorithm 82.1%, BasicGates 81.6%) but struggle with problem encoding (SolveSATWithGrover 34.4%, DistinguishUnitaries 40.0%); and (3) chain-of-thought prompting shows a modestly bimodal effect -- it is the best strategy for three models (two of them explicitly reasoning-tuned per vendor documentation) but degrades performance for the rest, leaving it mid-pack in aggregate (56.3% mean) behind few-shot-5 (57.8%). We release the benchmark, evaluation framework, and baseline results to support research on LLM capabilities in quantum computing.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

0 major / 3 minor

Summary. The manuscript adapts Microsoft's QuantumKatas curriculum from Q# to Qiskit, producing a benchmark of 350 tasks across 26 categories that span basic gates to advanced algorithms, error correction, and quantum games. Each task supplies a natural-language prompt, canonical solution, and deterministic verification via classical circuit simulation. The authors evaluate 16 LLMs under 7 prompting regimes (39,200 total runs), reporting clear model differentiation (best-configuration pass rates 32.3–83.1 %), category-specific patterns (strong on known algorithms such as SimonsAlgorithm at 82.1 %, weak on problem encoding such as SolveSATWithGrover at 34.4 %), and a modestly bimodal effect of chain-of-thought prompting. The benchmark, evaluation framework, and baseline results are released.

Significance. If the results hold, the work supplies a pedagogically grounded, non-synthetic benchmark that inherits a proven difficulty progression and uses deterministic verification, enabling reproducible assessment of LLM quantum-computing capabilities. The scale of the evaluation and the observed separation between frontier and open-source models, together with the public release of code and tasks, constitute a concrete resource for the community.

minor comments (3)
  1. [Abstract] Abstract: the reported 26.1 pp average gap between frontier and open-source models is stated without an accompanying definition of the two groups or the exact models assigned to each; this should be clarified in the main text or a table.
  2. The description of the 350 tasks would benefit from an explicit table listing the 26 categories together with task counts per category, to allow readers to assess coverage balance.
  3. The claim that chain-of-thought is 'mid-pack in aggregate (56.3 % mean)' should be accompanied by the exact ranking and scores of all seven prompting regimes for transparency.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for the positive summary, recognition of the benchmark's significance, and recommendation of minor revision. The evaluation of 16 LLMs across 39,200 runs on the adapted 350-task QuantumKatas benchmark is intended to provide a reproducible, pedagogically grounded resource, and we are glad this is acknowledged.

Circularity Check

0 steps flagged

No significant circularity identified

full rationale

The paper reports empirical results from running 16 LLMs on 350 tasks (39,200 total invocations) adapted from Microsoft's external QuantumKatas curriculum, with correctness verified by deterministic classical circuit simulation. Pass rates, category gaps, and prompting effects are direct measurements against released tasks and simulators; no equations, fitted parameters, or self-referential definitions appear. The adaptation inherits an established pedagogical progression rather than deriving tasks or metrics from the present evaluation itself. No load-bearing self-citation chains, uniqueness theorems, or reductions of claimed quantities to inputs by construction are present. This is a standard benchmark release paper whose differentiation claims are internally consistent with the stated methodology.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central contribution rests on the assumption that the original QuantumKatas pedagogical design transfers intact after language adaptation and that classical circuit simulation provides a reliable oracle for correctness; no free parameters or invented entities are introduced.

axioms (1)
  • domain assumption The original QuantumKatas tasks supply a principled difficulty progression and comprehensive concept coverage that remains valid after translation to Qiskit.
    The paper explicitly states it inherits this design rather than creating tasks from scratch.

pith-pipeline@v0.9.1-grok · 5864 in / 1424 out tokens · 53372 ms · 2026-06-29T16:33:09.401075+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

30 extracted references · 25 canonical work pages · 11 internal anchors

  1. [1]

    Evaluating Large Language Models Trained on Code

    Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde de Oliveira Pinto, Jared Kaplan, Harri Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, et al. Evaluating large language models trained on code.arXiv preprint arXiv:2107.03374,

  2. [2]

    Program Synthesis with Large Language Models

    23 Qiskit QuantumKatas for LLM EvaluationA PREPRINT Jacob Austin, Augustus Odena, Maxwell Nye, Maarten Bosma, Henryk Michalewski, David Dohan, Ellen Jiang, Carrie Cai, Michael Terry, Quoc Le, et al. Program synthesis with large language models.arXiv preprint arXiv:2108.07732,

  3. [3]

    Ds-1000: A natural and reliable benchmark for data science code generation.arXiv preprint arXiv:2211.11501,

    Yuhang Lai, Chengxi Li, Yiming Wang, Tianyi Zhang, Ruiqi Zhong, Luke Zettlemoyer, Scott Wen-tau Yih, Daniel Fried, Sida Wang, and Tao Yu. Ds-1000: A natural and reliable benchmark for data science code generation.arXiv preprint arXiv:2211.11501,

  4. [4]

    SWE-bench: Can Language Models Resolve Real-World GitHub Issues?

    Carlos E Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, and Karthik Narasimhan. Swe-bench: Can language models resolve real-world github issues?arXiv preprint arXiv:2310.06770,

  5. [5]

    Qiskit humaneval: An evaluation benchmark for quantum code generative models.arXiv preprint arXiv:2406.14712,

    Sanjay Vishwakarma, Francis Harkins, Siddharth Golecha, Vishal Sharathchandra Bajpe, Nicolas Dupuis, Luca Buratti, David Kremer, Ismael Faro, Ruchir Puri, and Juan Cruz-Benito. Qiskit humaneval: An evaluation benchmark for quantum code generative models.arXiv preprint arXiv:2406.14712,

  6. [6]

    Quantumbench: A benchmark for quantum problem solving.arXiv preprint arXiv:2511.00092,

    Shunya Minami, Tatsuya Ishigaki, Ikko Hamamura, Taku Mikuriya, Youmi Ma, Naoaki Okazaki, Hiroya Takamura, Yohichi Suzuki, and Tadashi Kadowaki. Quantumbench: A benchmark for quantum problem solving.arXiv preprint arXiv:2511.00092,

  7. [7]

    Quantum computing with Qiskit

    Accessed: 2025-01-14. Ali Javadi-Abhari, Matthew Treinish, Kevin Krsulich, Christopher J Wood, Jake Lishman, Julien Gacon, Simon Martiel, Paul D Nation, Lev S Bishop, Andrew W Cross, et al. Quantum computing with Qiskit.arXiv preprint arXiv:2405.08810,

  8. [8]

    Zenodo (2019)

    URLhttps://doi.org/10.5281/zenodo.2562111. Unitary Foundation. Quantum open source software survey 2024 results. https://unitary.foundation/posts/ 2025_survey_results/,

  9. [9]

    Minyang Tian, Luyu Gao, et al

    Accessed: 2025-01-22. Minyang Tian, Luyu Gao, et al. Scicode: A research coding benchmark curated by scientists.arXiv preprint arXiv:2407.13168,

  10. [10]

    GPQA: A Graduate-Level Google-Proof Q&A Benchmark

    David Rein, Betty Li Hou, Asa Cooper Stickland, Jackson Petty, Richard Yuanzhe Pang, Julien Dirani, Julian Michael, and Samuel R Bowman. Gpqa: A graduate-level google-proof q&a benchmark.arXiv preprint arXiv:2311.12022,

  11. [11]

    Measuring Mathematical Problem Solving With the MATH Dataset

    Dan Hendrycks, Collin Burns, Saurav Kadavath, Akul Arora, Steven Basart, Eric Tang, Dawn Song, and Jacob Steinhardt. Measuring mathematical problem solving with the math dataset.arXiv preprint arXiv:2103.03874,

  12. [12]

    QuantumChem-200K: A large-scale open organic molecular dataset for quantum-chemistry property screening and language model benchmarking.arXiv preprint arXiv:2511.21747,

    Yinqi Zeng and Renjie Li. QuantumChem-200K: A large-scale open organic molecular dataset for quantum-chemistry property screening and language model benchmarking.arXiv preprint arXiv:2511.21747,

  13. [13]

    Nicolas Dupuis, Luca Buratti, Sanjay Vishwakarma, Aitana Viudes Forrat, David Kremer, Ismael Faro, Ruchir Puri, and Juan Cruz-Benito

    Accessed: 2025-01-14. Nicolas Dupuis, Luca Buratti, Sanjay Vishwakarma, Aitana Viudes Forrat, David Kremer, Ismael Faro, Ruchir Puri, and Juan Cruz-Benito. Qiskit code assistant: Training LLMs for generating quantum computing code. In2024 IEEE LLM Aided Design Workshop (LAD), pages 1–4,

  14. [14]

    Nicolas Dupuis, Adarsh Tiwari, Youssef Mroueh, David Kremer, Ismael Faro, and Juan Cruz-Benito

    doi:10.1109/LAD62341.2024.10691762. Nicolas Dupuis, Adarsh Tiwari, Youssef Mroueh, David Kremer, Ismael Faro, and Juan Cruz-Benito. Quantum verifiable rewards for post-training qiskit code assistant.arXiv preprint arXiv:2508.20907,

  15. [15]

    Qcoder bench- mark: Bridging language generation and quantum hardware through simulator-based feedback.arXiv preprint arXiv:2510.26101,

    24 Qiskit QuantumKatas for LLM EvaluationA PREPRINT Taku Mikuriya, Tatsuya Ishigaki, Masayuki Kawarada, Shunya Minami, Tadashi Kadowaki, Yohichi Suzuki, Soshun Naito, Shunya Takata, Takumi Kato, Tamotsu Basseda, Reo Yamada, and Hiroya Takamura. Qcoder bench- mark: Bridging language generation and quantum hardware through simulator-based feedback.arXiv pre...

  16. [16]

    Quanbench: Benchmarking quantum code generation with large language models.arXiv preprint arXiv:2510.16779, 2025a

    Xiaoyu Guo, Minggu Wang, and Jianjun Zhao. Quanbench: Benchmarking quantum code generation with large language models.arXiv preprint arXiv:2510.16779, 2025a. Accepted at ASE

  17. [17]

    QuanBench+: A Unified Multi-Framework Benchmark for LLM-Based Quantum Code Generation

    Slim et al. QuanBench+: A unified multi-framework benchmark for LLM-based quantum code generation.arXiv preprint arXiv:2604.08570,

  18. [18]

    A PennyLane-Centric Dataset to Enhance LLM-based Quantum Code Generation using RAG

    Abdul Basit et al. PennyLang: Pioneering LLM-based quantum code generation with a novel PennyLane-centric dataset. arXiv preprint arXiv:2503.02497, 2025a. Rui Yang, Ziruo Wang, Yuntian Gu, Tianyi Chen, Yitao Liang, and Tongyang Li. Qcircuitbench: A large-scale dataset for benchmarking quantum algorithm design.arXiv preprint arXiv:2410.07961,

  19. [19]

    Qhackbench: Benchmarking large language models for quantum code generation using pennylane hackathon challenges.arXiv preprint arXiv:2506.20008, 2025b

    Abdul Basit, Minghao Shao, Muhammad Haider Asif, Nouhaila Innan, Muhammad Kashif, Alberto Marchisio, and Muhammad Shafique. Qhackbench: Benchmarking large language models for quantum code generation using pennylane hackathon challenges.arXiv preprint arXiv:2506.20008, 2025b. Paz et al. StabilizerBench: A benchmark for AI-assisted quantum error correction ...

  20. [20]

    QC-Bench: What do language models know about quantum computing? https://openreview.net/forum?id=hrDlJGrPqc, 2026a

    Mohamed Afane, Kayla Laufer, Wenqi Wei, Ying Mao, Junaid Farooq, Ying Wang, and Juntao Chen. QC-Bench: What do language models know about quantum computing? https://openreview.net/forum?id=hrDlJGrPqc, 2026a. Afane et al. Quantum-Audit: Evaluating the reasoning limits of LLMs on quantum computing.arXiv preprint arXiv:2602.10092, 2026b. Cao et al. QCalEval:...

  21. [21]

    QuantumQA: Enhancing Scientific Reasoning via Physics-Consistent Dataset and Verification-Aware Reinforcement Learning

    Qu et al. QuantumQA: Enhancing scientific reasoning via physics-consistent dataset and verification-aware reinforce- ment learning.arXiv preprint arXiv:2604.18176,

  22. [22]

    QUASAR – quantum assembly code generation using tool-augmented LLMs via agentic RL.arXiv preprint arXiv:2510.00967,

    Cong Yu, Valter Uotila, Shilong Deng, Qingyuan Wu, Tuo Shi, Songlin Jiang, Lei You, and Bo Zhao. QUASAR – quantum assembly code generation using tool-augmented LLMs via agentic RL.arXiv preprint arXiv:2510.00967,

  23. [23]

    M2qcode: A model-driven framework for generating multi-platform quantum programs.arXiv preprint arXiv:2510.17110, 2025b

    Xiaoyu Guo, Shinobu Saito, and Jianjun Zhao. M2qcode: A model-driven framework for generating multi-platform quantum programs.arXiv preprint arXiv:2510.17110, 2025b. Accepted at ASE

  24. [24]

    PennyCoder: Efficient domain-specific LLMs for PennyLane-based quantum code generation

    Abdul Basit et al. PennyCoder: Efficient domain-specific LLMs for PennyLane-based quantum code generation. In 2025 IEEE International Conference on Quantum Computing and Engineering (QCE), 2025c. Shlomo Kashani. QuantumLLMInstruct: A 500k LLM instruction-tuning dataset with problem-solution pairs for quantum computing.arXiv preprint arXiv:2412.20956,

  25. [25]

    David Deutsch and Richard Jozsa

    Accessed: 2025-01-14. David Deutsch and Richard Jozsa. Rapid solution of problems by quantum computation.Proceedings of the Royal Society of London. Series A: Mathematical and Physical Sciences, 439(1907):553–558,

  26. [26]

    OpenAI GPT-5 System Card

    OpenAI. Gpt-5 system card.arXiv preprint arXiv:2601.03267, 2025a. Google DeepMind. Gemini 3 pro model card. https://storage.googleapis.com/deepmind-media/ Model-Cards/Gemini-3-Pro-Model-Card.pdf,

  27. [27]

    Mistral AI

    Accessed: 2025-01-22. Mistral AI. Mistral-large-3-675b-instruct. https://huggingface.co/mistralai/ Mistral-Large-3-675B-Instruct-2512, 2025a. Accessed: 2025-01-22. Meta AI. Llama 4: Multimodal intelligence. https://ai.meta.com/blog/ llama-4-multimodal-intelligence/,

  28. [28]

    Accessed: 2025-01-22. OpenAI. gpt-oss-120b & gpt-oss-20b model card.arXiv preprint arXiv:2508.10925, 2025b. Google DeepMind. Gemma 4 language models. https://huggingface.co/google/gemma-4-31b-it ,

  29. [29]

    IBM Research

    Accessed: 2026-04-30. IBM Research. Granite 4.1 language models. https://huggingface.co/collections/ibm-granite/ granite-41-language-models,

  30. [30]

    Mistral AI

    Accessed: 2026-04-30. Mistral AI. Mistral-small-3.2-24b-instruct. https://huggingface.co/mistralai/Mistral-Small-3. 2-24B-Instruct-2506, 2025b. Accessed: 2025-01-22. Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Brian Ichter, Fei Xia, Ed Chi, Quoc Le, and Denny Zhou. Chain-of-thought prompting elicits reasoning in large language models.Advances ...