pith. machine review for the scientific record. sign in

arxiv: 2603.02804 · v2 · submitted 2026-03-03 · 🪐 quant-ph · cs.DC· cs.ET

Recognition: 1 theorem link

· Lean Theorem

Fast and memory-efficient classical simulation of quantum machine learning via forward and backward gate fusion

Authors on Pith no claims yet

Pith reviewed 2026-05-15 17:09 UTC · model grok-4.3

classification 🪐 quant-ph cs.DCcs.ET
keywords quantum machine learningclassical simulationgate fusiongradient computationGPU accelerationvariational quantum algorithmshardware-efficient ansatzmemory optimization
0
0 comments X

The pith

Fusing consecutive gates in forward and backward passes cuts memory traffic and speeds classical simulation of quantum machine learning circuits by roughly 20 times.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper proposes fusing multiple consecutive quantum gates along both the forward evaluation and backward gradient paths during classical simulation of quantum circuits. This reduces repeated global memory accesses on GPUs, which dominate runtime when computing gradients for deep variational circuits. The approach yields about 20 times higher throughput on hardware-efficient ansatz circuits of 12 qubits or more, with gains exceeding 30 times on bandwidth-limited consumer GPUs. When combined with gradient checkpointing, memory demand falls enough to train a 20-qubit model with 1,000 layers and 60,000 parameters on 1,000 samples in roughly 20 minutes per epoch. The result makes it feasible to run quantum machine learning experiments on datasets the size of MNIST or CIFAR-10 within hours rather than days on ordinary hardware.

Core claim

By fusing consecutive gates in the forward and backward paths, the simulation minimizes global memory accesses during circuit evaluation and gradient calculation. This produces approximately 20 times higher throughput for hardware-efficient ansatz circuits with 12 or more qubits, more than 30 times on mid-range consumer GPUs, and, together with checkpointing, permits training of a 20-qubit, 1,000-layer model containing 60,000 parameters on 1,000 samples in about 20 minutes per epoch, thereby enabling large-dataset experiments on classical machines.

What carries the argument

Forward-and-backward gate fusion, which merges sequences of consecutive gates before they are applied to state vectors or gradients so that intermediate results stay in fast on-chip memory instead of being written to and read from global memory.

If this is right

  • Throughput rises by a factor of about 20 for hardware-efficient ansatz circuits of 12 or more qubits.
  • Speedups exceed 30 times on GPUs whose performance is limited by memory bandwidth.
  • Memory footprint shrinks enough to fit and train 20-qubit circuits with 1,000 layers and 60,000 parameters on 1,000 samples in 20 minutes per epoch.
  • Training on datasets of tens of thousands of samples becomes practical within roughly 20 hours per epoch.
  • Verification of quantum machine learning algorithms and study of deep-circuit phenomena such as barren plateaus can be performed on realistic data volumes.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same fusion strategy could be applied to other variational quantum algorithms that rely on repeated circuit evaluations, such as quantum approximate optimization or quantum chemistry simulations.
  • Combining gate fusion with tensor-network or low-rank approximations might allow scaling beyond 20 qubits while keeping memory usage manageable.
  • The reduced per-epoch time could make systematic hyperparameter searches over circuit depth or ansatz structure feasible on single-GPU workstations.
  • Empirical checks of learning-theory predictions for very deep quantum circuits become experimentally accessible on standard hardware.

Load-bearing premise

Gate fusion works for typical quantum machine learning circuits without introducing numerical inaccuracies or extra overhead that would cancel the memory-access savings.

What would settle it

Measuring wall-clock time and peak memory for the exact 20-qubit, 1,000-layer, 60,000-parameter training run on the same GPU hardware without any gate fusion and finding that it either exceeds available memory or takes longer than 20 minutes per epoch.

Figures

Figures reproduced from arXiv: 2603.02804 by Yoshiaki Kawase.

Figure 1
Figure 1. Figure 1: Conventional method and proposed method layers in a quantum circuit, and let 𝑙var and 𝑙 const represent the number of variational gates and constant gates per HEA layer, respectively. In the PyTorch native implementation of the adjoint method without gradient checkpointing, since we need to store the input quantum states of all gates for the backward path, the number of required state vectors to be stored … view at source ↗
Figure 2
Figure 2. Figure 2: Stored state vectors when combining our proposed method with gradient checkpointing memory (bytes) can be estimated as 𝑀𝑡𝑜𝑡𝑎𝑙(𝑏) ≃ (⌈𝑙var 𝛼 ⌉ ⋅ 𝑏 + 𝑑 𝑏 ) × 𝑀sv (7) ≥ 2 √⌈ 𝑙var 𝛼 ⌉ 𝑑 × 𝑀sv, (8) where the equality holds when 𝑏 ∗ = √ 𝑑∕ ⌈ 𝑙var 𝛼 ⌉ . Specifically, for our proposed method (Triton fused), 𝛼 corresponds to 𝑚. For our proposed method with memory-saving mode (Triton fused (mem save)), we can effect… view at source ↗
Figure 3
Figure 3. Figure 3: A quantum circuit consisting of 𝑚 consecutive Rx, Ry, and Rz gates used for the benchmark of verifying the effect of our proposed method quantum circuits to randomly initialized quantum states |𝜓0 ⟩ rather than |0⟩ ⊗𝑛 in order to avoid artificial speedups caused by sparsity and to reflect the actual computational load and memory bandwidth utilization. 4.1. The effect of gate fusion In this subsection, we i… view at source ↗
Figure 4
Figure 4. Figure 4: (a) Execution time per 𝑚-gate sequencee and (b) Total peak memory usage when applying 𝑚 consecutive single-qubit gates on each qubit during the forward path 2 4 6 8 10 12 14 16 18 20 m (the number of consecutive gates to be fused) 2.5 5.0 7.5 10.0 12.5 15.0 17.5 mean execution time per fused gate (ms) Mean execution time per fused gate (bwd) RTX4090 RTX5070 GH200 (a) Execution time per fused gate 2 4 6 8 1… view at source ↗
Figure 5
Figure 5. Figure 5: (a) Execution time per fused gate and (b) Total peak memory usage when applying 𝑚 consecutive single-qubit gates on each qubit at the backward path 2 4 6 8 10 12 14 16 18 20 m (the number of consecutive gates) 0 20 40 60 80 m e a n e x ecutio n tim e p er m-g ate se q u e nce (ms) Mean execution time per m-gate sequence (fwd+bwd, fixed batch) PyTorch Native RTX4090 Triton Fused RTX4090 Triton Fused(mem sav… view at source ↗
Figure 6
Figure 6. Figure 6: (a) Execution time per 𝑚-gate sequence and (b) total peak memory usage (c) execution time ratio between Triton Fused and Triton Fused (mem save) when minimizing the sum of expectation values at the final quantum state after applying the fused single-qubit gates shown in [PITH_FULL_IMAGE:figures/full_fig_p018_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: HEA and gate fusion method: the quantum circuit consists of single-qubit gates (Rx, Ry, Rz) and CZ gates. The dotted frames indicate the range of the gate fusion in our proposed method. We fused up to nine single-qubit gates, including single-qubit gates acting on the same and adjacent qubits, and also the consecutive CZ gates in the implementation of our proposed method. In our numerical experiments, we u… view at source ↗
Figure 8
Figure 8. Figure 8: Throughputs for 𝑑 = 1 of the HEA shown in [PITH_FULL_IMAGE:figures/full_fig_p019_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: (a) Peak memory usage and (b) Execution time, when varying 𝑏 for a 20-qubit, 1,000-layer (𝑑 = 1,000) HEA with a batch size of one 4 6 8 10 12 14 16 18 20 22 24 26 28 30 n qubits 0 10 20 30 40 speed-up (x) Speed-up (HEA) Triton Fused RTX4090 Triton Fused(mem save) RTX4090 Triton Fused RTX5070 Triton Fused(mem save) RTX5070 Triton Fused GH200 Triton Fused(mem save) GH200 [PITH_FULL_IMAGE:figures/full_fig_p0… view at source ↗
Figure 10
Figure 10. Figure 10: Throughput improvement ratio of our proposed method relative to PyTorch native implementation, based on the results in [PITH_FULL_IMAGE:figures/full_fig_p020_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: (a) Peak memory usage and (b) Execution time, when running a 100-layer HEA with 20 qubits and a batch size of 1, varying the checkpoint block size 𝑏 Y. Kawase: Preprint submitted to Elsevier Page 20 of 17 [PITH_FULL_IMAGE:figures/full_fig_p020_11.png] view at source ↗
read the original abstract

While real quantum devices have been increasingly used to conduct research focused on achieving quantum advantage or quantum utility in recent years, executing deep quantum circuits or performing quantum machine learning with large-scale data on current noisy intermediate-scale quantum devices remains challenging, making classical simulation essential for quantum machine learning research. However, such classical simulation often suffers from the cost of gradient calculations, requiring enormous memory or computational time. To address these problems, we propose a method to fuse multiple consecutive gates in each of the forward and backward paths to improve throughput by minimizing global memory accesses. As a result, we achieved approximately $20$ times throughput improvement for a Hardware-Efficient Ansatz with $12$ or more qubits, reaching over $30$ times improvement on a mid-range consumer GPU with limited memory bandwidth. By combining our proposed method with gradient checkpointing, we drastically reduced memory usage, making it possible to train a large-scale quantum machine learning model, a $20$-qubit, $1{,}000$-layer model with $60{,}000$ parameters, using $1{,}000$ samples in approximately $20$ minutes per epoch. This implies that we can train the model on large datasets, comprising tens of thousands of samples, like MNIST or CIFAR-10, within a realistic time frame (e.g., $20$ hours per epoch). Thus, our proposed method significantly accelerates such classical simulations, making a significant contribution to advancing research in quantum machine learning and variational quantum algorithms, such as verifying algorithms on large datasets or investigating learning theories of deep quantum circuits like barren plateaus.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript proposes fusing consecutive gates in the forward and backward passes of classical quantum circuit simulation to minimize global memory accesses, reporting ~20x throughput gains for Hardware-Efficient Ansatz circuits (12+ qubits) and >30x on bandwidth-limited GPUs. Combined with gradient checkpointing, the method enables training a 20-qubit, 1000-layer model (60k parameters) on 1000 samples in ~20 minutes per epoch, with implications for scaling to datasets like MNIST.

Significance. If the performance numbers hold under broader testing, the work could meaningfully expand the feasible scale of classical QML simulations, supporting verification of variational algorithms and studies of phenomena like barren plateaus on larger instances. The fusion-plus-checkpointing combination directly targets the memory-compute tradeoff in gradient evaluation.

major comments (2)
  1. [Abstract and §4] Abstract and §4 (Experimental Results): the ~20x and >30x throughput claims are presented without specifying the baseline simulator implementation, exact gate counts, hardware specifications beyond 'mid-range consumer GPU', or statistical error bars on timings, preventing independent verification of the central performance result.
  2. [§3 and §5] §3 (Proposed Method) and §5 (Generalization discussion): the fusion opportunities are shown to exploit the regular layered structure of the Hardware-Efficient Ansatz; for ansatze with irregular gate sequences (e.g., QAOA or UCCSD), the number of fusible blocks may be smaller, so the reported speedups cannot be assumed to transfer without additional overhead analysis.
minor comments (2)
  1. [Tables in §4] Ensure all timing tables include the precise number of shots, batch size, and compiler flags used for the baseline to allow reproducibility.
  2. [Introduction] Add a short paragraph contrasting the proposed fusion with existing tensor-network or state-vector optimizations in the literature.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments, which help improve the clarity and scope of our work. We address each major point below and indicate planned revisions.

read point-by-point responses
  1. Referee: [Abstract and §4] Abstract and §4 (Experimental Results): the ~20x and >30x throughput claims are presented without specifying the baseline simulator implementation, exact gate counts, hardware specifications beyond 'mid-range consumer GPU', or statistical error bars on timings, preventing independent verification of the central performance result.

    Authors: We agree that these details are essential for verification. In the revised manuscript we will explicitly identify the baseline as our own unfused simulator implementation, report the precise gate counts for each tested circuit (e.g., 12-qubit HEA), provide the full GPU model name together with its memory bandwidth, and include standard deviations from repeated timing measurements. revision: yes

  2. Referee: [§3 and §5] §3 (Proposed Method) and §5 (Generalization discussion): the fusion opportunities are shown to exploit the regular layered structure of the Hardware-Efficient Ansatz; for ansatze with irregular gate sequences (e.g., QAOA or UCCSD), the number of fusible blocks may be smaller, so the reported speedups cannot be assumed to transfer without additional overhead analysis.

    Authors: The referee is correct that the largest gains occur with the regular layered structure of the Hardware-Efficient Ansatz. Our fusion procedure itself is agnostic to circuit regularity and can be applied to arbitrary gate sequences. We will expand the generalization discussion in §5 with a short overhead analysis for irregular ansatze (QAOA and UCCSD), showing that moderate speedups remain available even when fewer fusion opportunities exist. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical performance gains from implemented gate fusion

full rationale

The paper describes an algorithmic technique (forward/backward gate fusion plus checkpointing) and reports measured speedups and memory reductions on Hardware-Efficient Ansatz circuits. No mathematical derivation chain exists that reduces a claimed result to its own inputs by construction, self-definition, or self-citation load-bearing. Throughput figures are direct experimental outcomes of reduced global memory traffic, not predictions fitted to the same data or renamed known results. The work is self-contained against external benchmarks (GPU timings) and contains no uniqueness theorems or ansatze smuggled via prior self-citation.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract provides no explicit free parameters, axioms, or invented entities; the contribution is described as an algorithmic optimization relying on standard quantum circuit simulation assumptions.

pith-pipeline@v0.9.0 · 5587 in / 1177 out tokens · 55632 ms · 2026-05-15T17:09:45.306920+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

28 extracted references · 28 canonical work pages · 4 internal anchors

  1. [1]

    Classification with Quantum Neural Networks on Near Term Processors

    E. Farhi, H. Neven, Classification with quantum neural networks on near term processors, arXiv preprint arXiv:1802.06002 (2018)

  2. [2]

    Mitarai, M

    K. Mitarai, M. Negoro, M. Kitagawa, K. Fujii, Quantum circuit learning, Physical Review A 98 (2018) 032309

  3. [3]

    Cerezo, A

    M. Cerezo, A. Arrasmith, R. Babbush, S. C. Benjamin, S. Endo, K. Fujii, J. R. McClean, K. Mitarai, X. Yuan, L. Cincio, et al., Variational quantum algorithms, Nature Reviews Physics 3 (2021) 625–644

  4. [4]

    A.Peruzzo,J.McClean,P.Shadbolt,M.-H.Yung,X.-Q.Zhou,P.J.Love,A.Aspuru-Guzik,J.L.O’brien, Avariationaleigenvaluesolveron a photonic quantum processor, Nature communications 5 (2014) 4213

  5. [5]

    A.Kandala,A.Mezzacapo,K.Temme,M.Takita,M.Brink,J.M.Chow,J.M.Gambetta, Hardware-efficientvariationalquantumeigensolver for small molecules and quantum magnets, nature 549 (2017) 242–246

  6. [6]

    A Quantum Approximate Optimization Algorithm

    E. Farhi, J. Goldstone, S. Gutmann, A quantum approximate optimization algorithm, arXiv preprint arXiv:1411.4028 (2014)

  7. [7]

    J. R. McClean, S. Boixo, V. N. Smelyanskiy, R. Babbush, H. Neven, Barren plateaus in quantum neural network training landscapes, Nature communications 9 (2018) 4812

  8. [8]

    S.Wang,E.Fontana,M.Cerezo,K.Sharma,A.Sone,L.Cincio,P.J.Coles, Noise-inducedbarrenplateausinvariationalquantumalgorithms, Nature communications 12 (2021) 6961

  9. [9]

    Cerezo, A

    M. Cerezo, A. Sone, T. Volkoff, L. Cincio, P. J. Coles, Cost function dependent barren plateaus in shallow parametrized quantum circuits, Nature communications 12 (2021) 1791

  10. [10]

    X. You, X. Wu, Exponentially many local minima in quantum neural networks, in: International Conference on Machine Learning, PMLR, 2021, pp. 12144–12155

  11. [11]

    Bittel, M

    L. Bittel, M. Kliesch, Training variational quantum algorithms is np-hard, Physical review letters 127 (2021) 120502

  12. [12]

    Pednault, J

    E. Pednault, J. A. Gunnels, G. Nannicini, L. Horesh, R. Wisnieff, Leveraging secondary storage to simulate deep 54-qubit sycamore circuits, arXiv preprint arXiv:1910.09534 (2019)

  13. [13]

    J.Tindall,M.Fishman,E.M.Stoudenmire,D.Sels, Efficienttensornetworksimulationofibm’seaglekickedisingexperiment, Prxquantum 5 (2024) 010308

  14. [14]

    F.Arute,K.Arya,R.Babbush,D.Bacon,J.C.Bardin,R.Barends,R.Biswas,S.Boixo,F.G.Brandao,D.A.Buell,etal., Quantumsupremacy using a programmable superconducting processor, Nature 574 (2019) 505–510

  15. [15]

    G. Q. AI, Collaborators, Observation of constructive interference at the edge of quantum ergodicity, Nature 646 (2025) 825–830

  16. [16]

    Y. Kim, A. Eddins, S. Anand, K. X. Wei, E. Van Den Berg, S. Rosenblatt, H. Nayfeh, Y. Wu, M. Zaletel, K. Temme, et al., Evidence for the utility of quantum computing before fault tolerance, Nature 618 (2023) 500–505

  17. [17]

    Suzuki, Y

    Y. Suzuki, Y. Kawase, Y. Masumura, Y. Hiraga, M. Nakadai, J. Chen, K. M. Nakanishi, K. Mitarai, R. Imai, S. Tamiya, et al., Qulacs: a fast and versatile quantum circuit simulator for research purpose, Quantum 5 (2021) 559

  18. [18]

    Quantum computing with Qiskit

    A. Javadi-Abhari, M. Treinish, K. Krsulich, C. J. Wood, J. Lishman, J. Gacon, S. Martiel, P. D. Nation, L. S. Bishop, A. W. Cross, B. R. Johnson, J. M. Gambetta, Quantum computing with Qiskit, 2024. doi:10.48550/arXiv.2405.08810.arXiv:2405.08810. Y. Kawase:Preprint submitted to ElsevierPage 16 of 17 Fast memory-efficient classical simulation of QML

  19. [19]

    De Raedt, F

    H. De Raedt, F. Jin, D. Willsch, M. Willsch, N. Yoshioka, N. Ito, S. Yuan, K. Michielsen, Massively parallel quantum computer simulator, eleven years later, Computer Physics Communications 237 (2019) 47–61

  20. [20]

    C. Kim, E. Sohn, S. Kim, A. Sim, K. Wu, H. Tang, Y. Son, S. Kim, Scaleqsim: Highly scalable quantum circuit simulation framework for exascale hpc systems, Proceedings of the ACM on Measurement and Analysis of Computing Systems 9 (2025) 1–28

  21. [21]

    Häner, D

    T. Häner, D. S. Steiger, 5 petabyte simulation of a 45-qubit quantum circuit, in: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, 2017, pp. 1–10

  22. [22]

    qHiPSTER: The Quantum High Performance Software Testing Environment

    M. Smelyanskiy, N. P. Sawaya, A. Aspuru-Guzik, qhipster: The quantum high performance software testing environment, arXiv preprint arXiv:1601.07195 (2016)

  23. [23]

    D. Park, H. Kim, J. Kim, T. Kim, J. Lee, Snuqs: scaling quantum circuit simulation using storage devices, in: Proceedings of the 36th ACM International Conference on Supercomputing, 2022, pp. 1–13

  24. [24]

    Jones and J

    T. Jones, J. Gacon, Efficient calculation of gradients in classical simulations of variational quantum algorithms, arXiv preprint arXiv:2009.02823 (2020)

  25. [25]

    Luo, J.-G

    X.-Z. Luo, J.-G. Liu, P. Zhang, L. Wang, Yao. jl: Extensible, efficient framework for quantum algorithm design, Quantum 4 (2020) 341

  26. [26]

    Tillet, H

    P. Tillet, H. T. Kung, D. Cox, Triton: an intermediate language and compiler for tiled neural network computations, in: Proceedings of the 3rd ACM SIGPLAN International Workshop on Machine Learning and Programming Languages, MAPL 2019, Association for Computing Machinery, New York, NY, USA, 2019, p. 10–19. URL:https://doi.org/10.1145/3315508.3329973. doi:...

  27. [27]

    PyTorch 2 : Faster machine learning through dynamic P ython bytecode transformation and graph compilation

    J. Ansel, E. Yang, H. He, N. Gimelshein, A. Jain, M. Voznesensky, B. Bao, P. Bell, D. Berard, E. Burovski, G. Chauhan, A. Chourdia, W. Constable, A. Desmaison, Z. DeVito, E. Ellison, W. Feng, J. Gong, M. Gschwind, B. Hirsh, S. Huang, K. Kalambarkar, L. Kirsch, M. Lazos, M. Lezcano, Y. Liang, J. Liang, Y. Lu, C. Luk, B. Maher, Y. Pan, C. Puhrsch, M. Reso, ...

  28. [28]

    URL:https://github.com/mit-han-lab/torchquantum

    Torchquantum, 2024. URL:https://github.com/mit-han-lab/torchquantum. Y. Kawase:Preprint submitted to ElsevierPage 17 of 17 Fast memory-efficient classical simulation of QML 2 4 6 8 10 12 14 16 18 20 m (the number of consecutive gates) 0 100 200 300 400 500mean execution time per m-gate sequence (ms) Forward performance comparison: PyT orch Native vs Fused...