arxiv: 2603.02804 · v2 · submitted 2026-03-03 · 🪐 quant-ph · cs.DC· cs.ET

Recognition: 1 theorem link

· Lean Theorem

Fast and memory-efficient classical simulation of quantum machine learning via forward and backward gate fusion

Yoshiaki Kawase

Authors on Pith no claims yet

Pith reviewed 2026-05-15 17:09 UTC · model grok-4.3

classification 🪐 quant-ph cs.DCcs.ET

keywords quantum machine learningclassical simulationgate fusiongradient computationGPU accelerationvariational quantum algorithmshardware-efficient ansatzmemory optimization

0 comments

The pith

Fusing consecutive gates in forward and backward passes cuts memory traffic and speeds classical simulation of quantum machine learning circuits by roughly 20 times.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper proposes fusing multiple consecutive quantum gates along both the forward evaluation and backward gradient paths during classical simulation of quantum circuits. This reduces repeated global memory accesses on GPUs, which dominate runtime when computing gradients for deep variational circuits. The approach yields about 20 times higher throughput on hardware-efficient ansatz circuits of 12 qubits or more, with gains exceeding 30 times on bandwidth-limited consumer GPUs. When combined with gradient checkpointing, memory demand falls enough to train a 20-qubit model with 1,000 layers and 60,000 parameters on 1,000 samples in roughly 20 minutes per epoch. The result makes it feasible to run quantum machine learning experiments on datasets the size of MNIST or CIFAR-10 within hours rather than days on ordinary hardware.

Core claim

By fusing consecutive gates in the forward and backward paths, the simulation minimizes global memory accesses during circuit evaluation and gradient calculation. This produces approximately 20 times higher throughput for hardware-efficient ansatz circuits with 12 or more qubits, more than 30 times on mid-range consumer GPUs, and, together with checkpointing, permits training of a 20-qubit, 1,000-layer model containing 60,000 parameters on 1,000 samples in about 20 minutes per epoch, thereby enabling large-dataset experiments on classical machines.

What carries the argument

Forward-and-backward gate fusion, which merges sequences of consecutive gates before they are applied to state vectors or gradients so that intermediate results stay in fast on-chip memory instead of being written to and read from global memory.

If this is right

Throughput rises by a factor of about 20 for hardware-efficient ansatz circuits of 12 or more qubits.
Speedups exceed 30 times on GPUs whose performance is limited by memory bandwidth.
Memory footprint shrinks enough to fit and train 20-qubit circuits with 1,000 layers and 60,000 parameters on 1,000 samples in 20 minutes per epoch.
Training on datasets of tens of thousands of samples becomes practical within roughly 20 hours per epoch.
Verification of quantum machine learning algorithms and study of deep-circuit phenomena such as barren plateaus can be performed on realistic data volumes.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same fusion strategy could be applied to other variational quantum algorithms that rely on repeated circuit evaluations, such as quantum approximate optimization or quantum chemistry simulations.
Combining gate fusion with tensor-network or low-rank approximations might allow scaling beyond 20 qubits while keeping memory usage manageable.
The reduced per-epoch time could make systematic hyperparameter searches over circuit depth or ansatz structure feasible on single-GPU workstations.
Empirical checks of learning-theory predictions for very deep quantum circuits become experimentally accessible on standard hardware.

Load-bearing premise

Gate fusion works for typical quantum machine learning circuits without introducing numerical inaccuracies or extra overhead that would cancel the memory-access savings.

What would settle it

Measuring wall-clock time and peak memory for the exact 20-qubit, 1,000-layer, 60,000-parameter training run on the same GPU hardware without any gate fusion and finding that it either exceeds available memory or takes longer than 20 minutes per epoch.

Figures

Figures reproduced from arXiv: 2603.02804 by Yoshiaki Kawase.

**Figure 1.** Figure 1: Conventional method and proposed method layers in a quantum circuit, and let 𝑙var and 𝑙 const represent the number of variational gates and constant gates per HEA layer, respectively. In the PyTorch native implementation of the adjoint method without gradient checkpointing, since we need to store the input quantum states of all gates for the backward path, the number of required state vectors to be stored … view at source ↗

**Figure 2.** Figure 2: Stored state vectors when combining our proposed method with gradient checkpointing memory (bytes) can be estimated as 𝑀𝑡𝑜𝑡𝑎𝑙(𝑏) ≃ (⌈𝑙var 𝛼 ⌉ ⋅ 𝑏 + 𝑑 𝑏 ) × 𝑀sv (7) ≥ 2 √⌈ 𝑙var 𝛼 ⌉ 𝑑 × 𝑀sv, (8) where the equality holds when 𝑏 ∗ = √ 𝑑∕ ⌈ 𝑙var 𝛼 ⌉ . Specifically, for our proposed method (Triton fused), 𝛼 corresponds to 𝑚. For our proposed method with memory-saving mode (Triton fused (mem save)), we can effect… view at source ↗

**Figure 3.** Figure 3: A quantum circuit consisting of 𝑚 consecutive Rx, Ry, and Rz gates used for the benchmark of verifying the effect of our proposed method quantum circuits to randomly initialized quantum states |𝜓0 ⟩ rather than |0⟩ ⊗𝑛 in order to avoid artificial speedups caused by sparsity and to reflect the actual computational load and memory bandwidth utilization. 4.1. The effect of gate fusion In this subsection, we i… view at source ↗

**Figure 4.** Figure 4: (a) Execution time per 𝑚-gate sequencee and (b) Total peak memory usage when applying 𝑚 consecutive single-qubit gates on each qubit during the forward path 2 4 6 8 10 12 14 16 18 20 m (the number of consecutive gates to be fused) 2.5 5.0 7.5 10.0 12.5 15.0 17.5 mean execution time per fused gate (ms) Mean execution time per fused gate (bwd) RTX4090 RTX5070 GH200 (a) Execution time per fused gate 2 4 6 8 1… view at source ↗

**Figure 5.** Figure 5: (a) Execution time per fused gate and (b) Total peak memory usage when applying 𝑚 consecutive single-qubit gates on each qubit at the backward path 2 4 6 8 10 12 14 16 18 20 m (the number of consecutive gates) 0 20 40 60 80 m e a n e x ecutio n tim e p er m-g ate se q u e nce (ms) Mean execution time per m-gate sequence (fwd+bwd, fixed batch) PyTorch Native RTX4090 Triton Fused RTX4090 Triton Fused(mem sav… view at source ↗

**Figure 6.** Figure 6: (a) Execution time per 𝑚-gate sequence and (b) total peak memory usage (c) execution time ratio between Triton Fused and Triton Fused (mem save) when minimizing the sum of expectation values at the final quantum state after applying the fused single-qubit gates shown in [PITH_FULL_IMAGE:figures/full_fig_p018_6.png] view at source ↗

**Figure 7.** Figure 7: HEA and gate fusion method: the quantum circuit consists of single-qubit gates (Rx, Ry, Rz) and CZ gates. The dotted frames indicate the range of the gate fusion in our proposed method. We fused up to nine single-qubit gates, including single-qubit gates acting on the same and adjacent qubits, and also the consecutive CZ gates in the implementation of our proposed method. In our numerical experiments, we u… view at source ↗

**Figure 8.** Figure 8: Throughputs for 𝑑 = 1 of the HEA shown in [PITH_FULL_IMAGE:figures/full_fig_p019_8.png] view at source ↗

**Figure 9.** Figure 9: (a) Peak memory usage and (b) Execution time, when varying 𝑏 for a 20-qubit, 1,000-layer (𝑑 = 1,000) HEA with a batch size of one 4 6 8 10 12 14 16 18 20 22 24 26 28 30 n qubits 0 10 20 30 40 speed-up (x) Speed-up (HEA) Triton Fused RTX4090 Triton Fused(mem save) RTX4090 Triton Fused RTX5070 Triton Fused(mem save) RTX5070 Triton Fused GH200 Triton Fused(mem save) GH200 [PITH_FULL_IMAGE:figures/full_fig_p0… view at source ↗

**Figure 10.** Figure 10: Throughput improvement ratio of our proposed method relative to PyTorch native implementation, based on the results in [PITH_FULL_IMAGE:figures/full_fig_p020_10.png] view at source ↗

**Figure 11.** Figure 11: (a) Peak memory usage and (b) Execution time, when running a 100-layer HEA with 20 qubits and a batch size of 1, varying the checkpoint block size 𝑏 Y. Kawase: Preprint submitted to Elsevier Page 20 of 17 [PITH_FULL_IMAGE:figures/full_fig_p020_11.png] view at source ↗

read the original abstract

While real quantum devices have been increasingly used to conduct research focused on achieving quantum advantage or quantum utility in recent years, executing deep quantum circuits or performing quantum machine learning with large-scale data on current noisy intermediate-scale quantum devices remains challenging, making classical simulation essential for quantum machine learning research. However, such classical simulation often suffers from the cost of gradient calculations, requiring enormous memory or computational time. To address these problems, we propose a method to fuse multiple consecutive gates in each of the forward and backward paths to improve throughput by minimizing global memory accesses. As a result, we achieved approximately $20$ times throughput improvement for a Hardware-Efficient Ansatz with $12$ or more qubits, reaching over $30$ times improvement on a mid-range consumer GPU with limited memory bandwidth. By combining our proposed method with gradient checkpointing, we drastically reduced memory usage, making it possible to train a large-scale quantum machine learning model, a $20$-qubit, $1{,}000$-layer model with $60{,}000$ parameters, using $1{,}000$ samples in approximately $20$ minutes per epoch. This implies that we can train the model on large datasets, comprising tens of thousands of samples, like MNIST or CIFAR-10, within a realistic time frame (e.g., $20$ hours per epoch). Thus, our proposed method significantly accelerates such classical simulations, making a significant contribution to advancing research in quantum machine learning and variational quantum algorithms, such as verifying algorithms on large datasets or investigating learning theories of deep quantum circuits like barren plateaus.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Gate fusion on forward and backward passes delivers concrete throughput gains for HEA-based QML simulation on consumer GPUs, but the gains are shown only for that regular circuit family.

read the letter

The paper's main move is to fuse consecutive gates in both the forward evaluation and the backward gradient pass, cutting down on global memory traffic during classical simulation of variational quantum circuits. For Hardware-Efficient Ansatz circuits with 12 or more qubits this produces roughly 20x higher throughput, and over 30x on a mid-range GPU limited by memory bandwidth. Pairing the fusion with standard gradient checkpointing then lets them run a 20-qubit, 1000-layer model with 60k parameters on 1000 samples in about 20 minutes per epoch, which scales to MNIST-sized data in a day or so of wall time. That is a practical engineering result for anyone who has been memory-bound when trying to train deeper QML models on ordinary hardware. The backward-pass focus is the part that feels new; most prior fusion work has stayed on the forward side. The numbers are specific enough to be worth testing against your own workloads if you simulate HEA-style circuits. The soft spot is the narrow testbed. All the speedups are reported for HEA, whose uniform layered structure gives plenty of consecutive gates to fuse. QAOA, UCCSD, or data-reuploading circuits have more irregular gate sequences and could see smaller returns or extra fusion overhead. The abstract supplies no runs on those families, no overhead measurements, and no head-to-head numbers against PennyLane or Qiskit simulators that already do some fusion or checkpointing. Without those checks it is difficult to judge how far the method travels. This is useful reading for groups that already run large-scale classical simulations of variational algorithms on GPUs and want to squeeze out more scale on limited hardware. The engineering is grounded enough to deserve referee time even if the authors need to add broader benchmarks in revision.

Referee Report

2 major / 2 minor

Summary. The manuscript proposes fusing consecutive gates in the forward and backward passes of classical quantum circuit simulation to minimize global memory accesses, reporting ~20x throughput gains for Hardware-Efficient Ansatz circuits (12+ qubits) and >30x on bandwidth-limited GPUs. Combined with gradient checkpointing, the method enables training a 20-qubit, 1000-layer model (60k parameters) on 1000 samples in ~20 minutes per epoch, with implications for scaling to datasets like MNIST.

Significance. If the performance numbers hold under broader testing, the work could meaningfully expand the feasible scale of classical QML simulations, supporting verification of variational algorithms and studies of phenomena like barren plateaus on larger instances. The fusion-plus-checkpointing combination directly targets the memory-compute tradeoff in gradient evaluation.

major comments (2)

[Abstract and §4] Abstract and §4 (Experimental Results): the ~20x and >30x throughput claims are presented without specifying the baseline simulator implementation, exact gate counts, hardware specifications beyond 'mid-range consumer GPU', or statistical error bars on timings, preventing independent verification of the central performance result.
[§3 and §5] §3 (Proposed Method) and §5 (Generalization discussion): the fusion opportunities are shown to exploit the regular layered structure of the Hardware-Efficient Ansatz; for ansatze with irregular gate sequences (e.g., QAOA or UCCSD), the number of fusible blocks may be smaller, so the reported speedups cannot be assumed to transfer without additional overhead analysis.

minor comments (2)

[Tables in §4] Ensure all timing tables include the precise number of shots, batch size, and compiler flags used for the baseline to allow reproducibility.
[Introduction] Add a short paragraph contrasting the proposed fusion with existing tensor-network or state-vector optimizations in the literature.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments, which help improve the clarity and scope of our work. We address each major point below and indicate planned revisions.

read point-by-point responses

Referee: [Abstract and §4] Abstract and §4 (Experimental Results): the ~20x and >30x throughput claims are presented without specifying the baseline simulator implementation, exact gate counts, hardware specifications beyond 'mid-range consumer GPU', or statistical error bars on timings, preventing independent verification of the central performance result.

Authors: We agree that these details are essential for verification. In the revised manuscript we will explicitly identify the baseline as our own unfused simulator implementation, report the precise gate counts for each tested circuit (e.g., 12-qubit HEA), provide the full GPU model name together with its memory bandwidth, and include standard deviations from repeated timing measurements. revision: yes
Referee: [§3 and §5] §3 (Proposed Method) and §5 (Generalization discussion): the fusion opportunities are shown to exploit the regular layered structure of the Hardware-Efficient Ansatz; for ansatze with irregular gate sequences (e.g., QAOA or UCCSD), the number of fusible blocks may be smaller, so the reported speedups cannot be assumed to transfer without additional overhead analysis.

Authors: The referee is correct that the largest gains occur with the regular layered structure of the Hardware-Efficient Ansatz. Our fusion procedure itself is agnostic to circuit regularity and can be applied to arbitrary gate sequences. We will expand the generalization discussion in §5 with a short overhead analysis for irregular ansatze (QAOA and UCCSD), showing that moderate speedups remain available even when fewer fusion opportunities exist. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical performance gains from implemented gate fusion

full rationale

The paper describes an algorithmic technique (forward/backward gate fusion plus checkpointing) and reports measured speedups and memory reductions on Hardware-Efficient Ansatz circuits. No mathematical derivation chain exists that reduces a claimed result to its own inputs by construction, self-definition, or self-citation load-bearing. Throughput figures are direct experimental outcomes of reduced global memory traffic, not predictions fitted to the same data or renamed known results. The work is self-contained against external benchmarks (GPU timings) and contains no uniqueness theorems or ansatze smuggled via prior self-citation.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract provides no explicit free parameters, axioms, or invented entities; the contribution is described as an algorithmic optimization relying on standard quantum circuit simulation assumptions.

pith-pipeline@v0.9.0 · 5587 in / 1177 out tokens · 55632 ms · 2026-05-15T17:09:45.306920+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

we propose a method to fuse multiple consecutive gates in each of the forward and backward paths to improve throughput by minimizing global memory accesses

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

28 extracted references · 28 canonical work pages · 4 internal anchors

[1]

Classification with Quantum Neural Networks on Near Term Processors

E. Farhi, H. Neven, Classification with quantum neural networks on near term processors, arXiv preprint arXiv:1802.06002 (2018)

work page internal anchor Pith review Pith/arXiv arXiv 2018
[2]

Mitarai, M

K. Mitarai, M. Negoro, M. Kitagawa, K. Fujii, Quantum circuit learning, Physical Review A 98 (2018) 032309

work page 2018
[3]

Cerezo, A

M. Cerezo, A. Arrasmith, R. Babbush, S. C. Benjamin, S. Endo, K. Fujii, J. R. McClean, K. Mitarai, X. Yuan, L. Cincio, et al., Variational quantum algorithms, Nature Reviews Physics 3 (2021) 625–644

work page 2021
[4]

A.Peruzzo,J.McClean,P.Shadbolt,M.-H.Yung,X.-Q.Zhou,P.J.Love,A.Aspuru-Guzik,J.L.O’brien, Avariationaleigenvaluesolveron a photonic quantum processor, Nature communications 5 (2014) 4213

work page 2014
[5]

A.Kandala,A.Mezzacapo,K.Temme,M.Takita,M.Brink,J.M.Chow,J.M.Gambetta, Hardware-efficientvariationalquantumeigensolver for small molecules and quantum magnets, nature 549 (2017) 242–246

work page 2017
[6]

A Quantum Approximate Optimization Algorithm

E. Farhi, J. Goldstone, S. Gutmann, A quantum approximate optimization algorithm, arXiv preprint arXiv:1411.4028 (2014)

work page internal anchor Pith review Pith/arXiv arXiv 2014
[7]

J. R. McClean, S. Boixo, V. N. Smelyanskiy, R. Babbush, H. Neven, Barren plateaus in quantum neural network training landscapes, Nature communications 9 (2018) 4812

work page 2018
[8]

S.Wang,E.Fontana,M.Cerezo,K.Sharma,A.Sone,L.Cincio,P.J.Coles, Noise-inducedbarrenplateausinvariationalquantumalgorithms, Nature communications 12 (2021) 6961

work page 2021
[9]

Cerezo, A

M. Cerezo, A. Sone, T. Volkoff, L. Cincio, P. J. Coles, Cost function dependent barren plateaus in shallow parametrized quantum circuits, Nature communications 12 (2021) 1791

work page 2021
[10]

X. You, X. Wu, Exponentially many local minima in quantum neural networks, in: International Conference on Machine Learning, PMLR, 2021, pp. 12144–12155

work page 2021
[11]

Bittel, M

L. Bittel, M. Kliesch, Training variational quantum algorithms is np-hard, Physical review letters 127 (2021) 120502

work page 2021
[12]

Pednault, J

E. Pednault, J. A. Gunnels, G. Nannicini, L. Horesh, R. Wisnieff, Leveraging secondary storage to simulate deep 54-qubit sycamore circuits, arXiv preprint arXiv:1910.09534 (2019)

work page arXiv 1910
[13]

J.Tindall,M.Fishman,E.M.Stoudenmire,D.Sels, Efficienttensornetworksimulationofibm’seaglekickedisingexperiment, Prxquantum 5 (2024) 010308

work page 2024
[14]

F.Arute,K.Arya,R.Babbush,D.Bacon,J.C.Bardin,R.Barends,R.Biswas,S.Boixo,F.G.Brandao,D.A.Buell,etal., Quantumsupremacy using a programmable superconducting processor, Nature 574 (2019) 505–510

work page 2019
[15]

G. Q. AI, Collaborators, Observation of constructive interference at the edge of quantum ergodicity, Nature 646 (2025) 825–830

work page 2025
[16]

Y. Kim, A. Eddins, S. Anand, K. X. Wei, E. Van Den Berg, S. Rosenblatt, H. Nayfeh, Y. Wu, M. Zaletel, K. Temme, et al., Evidence for the utility of quantum computing before fault tolerance, Nature 618 (2023) 500–505

work page 2023
[17]

Suzuki, Y

Y. Suzuki, Y. Kawase, Y. Masumura, Y. Hiraga, M. Nakadai, J. Chen, K. M. Nakanishi, K. Mitarai, R. Imai, S. Tamiya, et al., Qulacs: a fast and versatile quantum circuit simulator for research purpose, Quantum 5 (2021) 559

work page 2021
[18]

Quantum computing with Qiskit

A. Javadi-Abhari, M. Treinish, K. Krsulich, C. J. Wood, J. Lishman, J. Gacon, S. Martiel, P. D. Nation, L. S. Bishop, A. W. Cross, B. R. Johnson, J. M. Gambetta, Quantum computing with Qiskit, 2024. doi:10.48550/arXiv.2405.08810.arXiv:2405.08810. Y. Kawase:Preprint submitted to ElsevierPage 16 of 17 Fast memory-efficient classical simulation of QML

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2405.08810.arxiv:2405.08810 2024
[19]

De Raedt, F

H. De Raedt, F. Jin, D. Willsch, M. Willsch, N. Yoshioka, N. Ito, S. Yuan, K. Michielsen, Massively parallel quantum computer simulator, eleven years later, Computer Physics Communications 237 (2019) 47–61

work page 2019
[20]

C. Kim, E. Sohn, S. Kim, A. Sim, K. Wu, H. Tang, Y. Son, S. Kim, Scaleqsim: Highly scalable quantum circuit simulation framework for exascale hpc systems, Proceedings of the ACM on Measurement and Analysis of Computing Systems 9 (2025) 1–28

work page 2025
[21]

Häner, D

T. Häner, D. S. Steiger, 5 petabyte simulation of a 45-qubit quantum circuit, in: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, 2017, pp. 1–10

work page 2017
[22]

qHiPSTER: The Quantum High Performance Software Testing Environment

M. Smelyanskiy, N. P. Sawaya, A. Aspuru-Guzik, qhipster: The quantum high performance software testing environment, arXiv preprint arXiv:1601.07195 (2016)

work page internal anchor Pith review Pith/arXiv arXiv 2016
[23]

D. Park, H. Kim, J. Kim, T. Kim, J. Lee, Snuqs: scaling quantum circuit simulation using storage devices, in: Proceedings of the 36th ACM International Conference on Supercomputing, 2022, pp. 1–13

work page 2022
[24]

Jones and J

T. Jones, J. Gacon, Efficient calculation of gradients in classical simulations of variational quantum algorithms, arXiv preprint arXiv:2009.02823 (2020)

work page arXiv 2009
[25]

Luo, J.-G

X.-Z. Luo, J.-G. Liu, P. Zhang, L. Wang, Yao. jl: Extensible, efficient framework for quantum algorithm design, Quantum 4 (2020) 341

work page 2020
[26]

Tillet, H

P. Tillet, H. T. Kung, D. Cox, Triton: an intermediate language and compiler for tiled neural network computations, in: Proceedings of the 3rd ACM SIGPLAN International Workshop on Machine Learning and Programming Languages, MAPL 2019, Association for Computing Machinery, New York, NY, USA, 2019, p. 10–19. URL:https://doi.org/10.1145/3315508.3329973. doi:...

work page doi:10.1145/3315508.3329973 2019
[27]

PyTorch 2 : Faster machine learning through dynamic P ython bytecode transformation and graph compilation

J. Ansel, E. Yang, H. He, N. Gimelshein, A. Jain, M. Voznesensky, B. Bao, P. Bell, D. Berard, E. Burovski, G. Chauhan, A. Chourdia, W. Constable, A. Desmaison, Z. DeVito, E. Ellison, W. Feng, J. Gong, M. Gschwind, B. Hirsh, S. Huang, K. Kalambarkar, L. Kirsch, M. Lazos, M. Lezcano, Y. Liang, J. Liang, Y. Lu, C. Luk, B. Maher, Y. Pan, C. Puhrsch, M. Reso, ...

work page doi:10.1145/3620665.3640366 2024
[28]

URL:https://github.com/mit-han-lab/torchquantum

Torchquantum, 2024. URL:https://github.com/mit-han-lab/torchquantum. Y. Kawase:Preprint submitted to ElsevierPage 17 of 17 Fast memory-efficient classical simulation of QML 2 4 6 8 10 12 14 16 18 20 m (the number of consecutive gates) 0 100 200 300 400 500mean execution time per m-gate sequence (ms) Forward performance comparison: PyT orch Native vs Fused...

work page 2024