arxiv: 2605.08792 · v2 · submitted 2026-05-09 · 💻 cs.PF · quant-ph

Recognition: 2 theorem links

· Lean Theorem

A Controlled Study of Memory Hierarchy Transitions in Quantum Circuit Simulation on Apple M4 Pro Unified Memory Architecture

Gyan Pratipat

Authors on Pith no claims yet

Pith reviewed 2026-05-13 07:39 UTC · model grok-4.3

classification 💻 cs.PF quant-ph

keywords quantum circuit simulationstate-vector simulationunified memory architecturememory bandwidthaccess patternsGPU speedupM4 ProRoofline analysis

0 comments

The pith

Peak streaming bandwidth fails to predict GPU speedups for quantum circuit simulation on unified memory.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

State-vector quantum circuit simulation is memory-bandwidth bound, yet on the Apple M4 Pro's unified memory the actual GPU-to-CPU speedups exceed the 1.85x ratio predicted by STREAM benchmarks. Tensordot implementations reach 3.1-4.1x, flat-index 3.5-5.9x, and direct-index 6-10x, with the excess growing as memory access patterns become less regular. A reproducible 4.46x timing jump appears at the 28-to-29 qubit transition for some backends but not others. Roofline analysis confirms every implementation sits far below the compute ridge, so differences trace to how each algorithm traverses the shared DRAM.

Core claim

On the M4 Pro, where CPU and GPU share identical LPDDR5X DRAM delivering 119.9 GB/s versus 221.9 GB/s STREAM bandwidth, three classes of state-vector simulators produce speedups well above the 1.85x bandwidth ratio: tensordot backends at 3.1-4.1x, flat-index at 3.5-5.9x, and direct-index at 6-10x. The excess grows with access irregularity. A 4.46x timing discontinuity occurs at the 28-to-29 qubit transition under thermally isolated conditions; tensordot backends show the full jump while direct-index backends retain roughly 2x per-qubit scaling.

What carries the argument

Comparison of tensordot, flat-index, and direct-index algorithm classes for state-vector simulation, each generating distinct non-contiguous DRAM access patterns on unified memory.

If this is right

STREAM bandwidth ratios cannot be used to forecast performance of quantum simulators whose memory accesses are non-contiguous.
Direct-index implementations maintain steadier scaling across the 28-to-29 qubit transition than tensordot implementations.
Arithmetic intensity below 0.38 FLOP/byte places all gate kernels firmly in the memory-bound regime on current hardware.
The performance gap between algorithm classes widens as qubit count and access irregularity increase.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Performance models for quantum simulation on unified-memory hardware must incorporate access-pattern irregularity as a first-class variable rather than relying on peak bandwidth alone.
Algorithm selection becomes more important than raw bandwidth when scaling state-vector simulations past 28 qubits on UMA platforms.
The same methodology could be applied to other UMA chips to test whether the observed excess speedups are specific to the M4 Pro memory hierarchy.

Load-bearing premise

Sharing identical physical LPDDR5X DRAM between CPU and GPU eliminates memory-technology and interconnect confounds so that observed timing differences can be attributed solely to access patterns and algorithm class.

What would settle it

Measure whether direct-index GPU speedup on the same circuits stays above 6x relative to CPU when the identical simulation code is run on a non-UMA platform with matched DRAM bandwidth but separate CPU and GPU memory pools.

Figures

Figures reproduced from arXiv: 2605.08792 by Gyan Pratipat.

**Figure 2.** Figure 2: QFT circuit wall-clock time vs. qubit count, thermally isolated ( [PITH_FULL_IMAGE:figures/full_fig_p006_2.png] view at source ↗

**Figure 3.** Figure 3: GHZ GPU speedup (CPU time ÷ GPU time) vs. STREAM-predicted 1.85×, thermally isolated (N = 5). Direct-index sustains ∼10× across all qubit counts; tensordot and flat-index exceed the STREAM prediction but fall short of direct-index [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗

**Figure 4.** Figure 4: QFT GPU speedup vs. STREAM-predicted 1.85 [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗

**Figure 5.** Figure 5: DRAM cliff magnitude (28→29 qubit step ratio) for QFT and GHZ circuits across four backends. Purple = QFT (N = 3); orange = GHZ (N = 5). Tensordot backends (C, F) cliff at 3.8–4.5× in both circuits; direct-index backends (J, K) remain near 2.1× in both, confirming the cliff is determined by algorithm access pattern, not circuit structure. backend C and 2.12–2.18× for backend J appear at the same qubit boun… view at source ↗

read the original abstract

State-vector quantum circuit simulation is memory-bandwidth bound, yet the interaction between memory hierarchy, access pattern, and hardware parallelism remains incompletely characterized. We address this using the Apple M4 Pro Unified Memory Architecture (UMA), where CPU and GPU share identical physical LPDDR5X DRAM ($\sim$224 GB/s STREAM bandwidth for both), eliminating memory-technology and interconnect confounds. Using a thermally isolated, multi-trial methodology across 11 simulation backends on GHZ and QFT circuits from 3 to 30 qubits, we make three central contributions. First, a Roofline analysis confirms all gate implementations have arithmetic intensity $\leq$0.38 FLOP/byte, well below the ridge point for any plausible peak compute on modern hardware, establishing structural memory-boundedness. Second, we identify a reproducible 4.46$\times$ timing discontinuity at the 28$\rightarrow$29 qubit transition, confirmed under thermally isolated conditions and cross-validated across GHZ and QFT circuits; tensordot backends exhibit the full discontinuity while direct-index backends maintain $\sim$2$\times$ per-qubit scaling throughout. Third, despite STREAM predicting only 1.85$\times$ GPU speedup (MLX CPU 119.9 GB/s vs. MLX GPU 221.9 GB/s), all three algorithm classes exceed this prediction: tensordot 3.1--4.1$\times$, flat-index 3.5--5.9$\times$, and direct-index 6--10$\times$, demonstrating that peak streaming bandwidth does not predict simulation speedup for non-contiguous memory access patterns, with the gap widening as access irregularity increases. These findings provide a hardware-characterization framework for quantum simulation workloads on UMA.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper's main contribution is a measured 4.46x timing jump at the 28-to-29 qubit transition on M4 Pro plus GPU speedups that exceed the STREAM bandwidth ratio, especially for irregular access patterns.

read the letter

The paper reports a reproducible 4.46x timing discontinuity when moving from 28 to 29 qubits on the Apple M4 Pro, along with GPU speedups across three algorithm classes that run well above the 1.85x ratio predicted by STREAM bandwidth alone. The direct-index backends reach 6-10x while tensordot stays in the 3.1-4.1x range, and the gap grows with access irregularity. This is measured on GHZ and QFT circuits from 3 to 30 qubits using 11 backends under thermally controlled, multi-trial conditions. The UMA design is a clean choice here because CPU and GPU share the same LPDDR5X DRAM, which removes memory-technology differences as an explanation. The roofline analysis showing arithmetic intensity at or below 0.38 FLOP/byte supports the claim that these workloads are structurally memory-bound, so the observed excess speedups can reasonably be tied to how the different access patterns interact with the memory hierarchy and caches. The fact that direct-index methods avoid the full discontinuity while others do not lines up with cache-size effects at that qubit count. The work is narrow by design: only one chip family and two circuit families. That keeps the claims focused but also limits how far the numbers travel. The abstract describes cross-validation, yet the absence of visible error bars, raw timing distributions, or full backend code in the provided material makes it harder to judge the precision of the 4.46x figure or rule out post-hoc selection. A reader would want those details before treating the discontinuity as a settled hardware characteristic. This is useful for anyone tuning state-vector simulators on unified-memory hardware or studying bandwidth-bound irregular workloads. It supplies concrete numbers that can guide backend choices on Apple silicon. The empirical observations are specific enough and the controlled setup is sound enough that the paper deserves a serious referee, even if revisions will be needed to strengthen the data presentation and expand the circuit set.

Referee Report

2 major / 2 minor

Summary. The paper claims that state-vector quantum circuit simulation is structurally memory-bandwidth bound on the Apple M4 Pro UMA (shared LPDDR5X DRAM), as shown by roofline analysis with arithmetic intensity ≤0.38 FLOP/byte. Using a thermally isolated multi-trial methodology on GHZ and QFT circuits (3–30 qubits) across 11 backends, it reports a reproducible 4.46× timing discontinuity at the 28→29 qubit transition (full in tensordot backends, ~2× scaling preserved in direct-index), and CPU-to-GPU speedups (tensordot 3.1–4.1×, flat-index 3.5–5.9×, direct-index 6–10×) that substantially exceed the 1.85× ratio predicted by STREAM bandwidth (119.9 vs 221.9 GB/s), with the excess widening as memory access irregularity increases.

Significance. If the results hold, the work is significant for demonstrating that peak streaming bandwidth fails to predict simulation performance under non-contiguous access patterns on UMA hardware, with the gap increasing for more irregular algorithms. The controlled isolation of access-pattern effects via identical physical DRAM, the roofline confirmation of memory-boundedness, and the direct observational reporting of the qubit-threshold discontinuity provide a useful hardware-characterization framework. Strengths include the parameter-free empirical approach, cross-validation across circuit types, and explicit attribution of excess speedup to access irregularity rather than memory-technology confounds.

major comments (2)

[Methodology and results] Methodology and results sections: the multi-trial, thermally isolated methodology is invoked to support the 4.46× discontinuity and all speedup ranges, yet no error bars, standard deviations, trial counts, or statistical significance tests are reported for the timing measurements or the 28→29 qubit transition; this weakens evaluation of reproducibility for the central discontinuity claim.
[Performance comparison] Performance comparison (speedup factors): the reported ranges (tensordot 3.1–4.1×, flat-index 3.5–5.9×, direct-index 6–10×) exceed the STREAM 1.85× prediction, but the text does not specify whether these are averaged across the full 3–30 qubit range, exclude the discontinuity, or are computed only on the post-29-qubit regime; clarification is needed because the widening gap with irregularity is load-bearing for the claim that STREAM bandwidth does not predict speedup.

minor comments (2)

[Abstract] The abstract states that 11 simulation backends were used but does not enumerate or categorize them; a short table or appendix listing the backends (tensordot, flat-index, direct-index and variants) would improve clarity.
[Roofline analysis] The arithmetic intensity bound ≤0.38 FLOP/byte is central to the roofline claim; a brief equation or derivation showing how this value is obtained for the gate implementations would aid readers without requiring external references.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive review and recommendation of minor revision. We address each major comment below.

read point-by-point responses

Referee: [Methodology and results] Methodology and results sections: the multi-trial, thermally isolated methodology is invoked to support the 4.46× discontinuity and all speedup ranges, yet no error bars, standard deviations, trial counts, or statistical significance tests are reported for the timing measurements or the 28→29 qubit transition; this weakens evaluation of reproducibility for the central discontinuity claim.

Authors: We agree that the manuscript does not report error bars, standard deviations, trial counts, or statistical significance tests, which limits quantitative assessment of the discontinuity's reproducibility. The multi-trial, thermally isolated approach and cross-validation across GHZ and QFT circuits were intended to demonstrate consistency, but we acknowledge this falls short of full statistical reporting. In the revised manuscript we will add error bars to all timing figures, specify the number of trials, include standard deviations, and note the statistical significance of the 4.46× discontinuity. revision: yes
Referee: [Performance comparison] Performance comparison (speedup factors): the reported ranges (tensordot 3.1–4.1×, flat-index 3.5–5.9×, direct-index 6–10×) exceed the STREAM 1.85× prediction, but the text does not specify whether these are averaged across the full 3–30 qubit range, exclude the discontinuity, or are computed only on the post-29-qubit regime; clarification is needed because the widening gap with irregularity is load-bearing for the claim that STREAM bandwidth does not predict speedup.

Authors: We thank the referee for identifying this ambiguity in the speedup reporting. The ranges reflect observed GPU-to-CPU ratios across the tested qubit counts, with the discontinuity treated separately and the excess speedup (widening with access irregularity) most evident in the memory-bound regime above 29 qubits. We will revise the performance comparison section to explicitly define the aggregation method, the qubit ranges used, and the relationship to access-pattern irregularity. revision: yes

Circularity Check

0 steps flagged

No significant circularity detected

full rationale

The paper's claims rest entirely on direct empirical timing measurements, standard STREAM bandwidth benchmarks, and a confirmatory roofline analysis showing AI ≤ 0.38 FLOP/byte. No equations, fitted parameters, or predictions are derived from the data itself; observed speedups (3.1–10× vs. 1.85× STREAM ratio) are simple ratios of measured values. The 4.46× discontinuity and differential scaling across backends are reported observations, not self-referential derivations. No self-citations, uniqueness theorems, or ansatzes appear in the provided text, and the UMA setup is presented as an experimental control rather than a fitted assumption. The derivation chain is observational and self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

Empirical performance study with no mathematical derivations. Relies on standard assumptions about benchmark validity and experimental controls rather than new postulates.

axioms (2)

domain assumption The STREAM benchmark accurately measures the peak memory bandwidth available to both CPU and GPU workloads on the M4 Pro
Used to derive the 1.85x predicted GPU speedup
domain assumption Thermally isolated multi-trial runs eliminate thermal throttling and environmental variability as confounds
Invoked to establish reproducibility of the 4.46x discontinuity

pith-pipeline@v0.9.0 · 5623 in / 1546 out tokens · 60704 ms · 2026-05-13T07:39:09.892896+00:00 · methodology

Review history (2 revisions) →

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Roofline analysis confirms all gate implementations have arithmetic intensity ≤0.38 FLOP/byte... demonstrating that peak streaming bandwidth does not predict simulation speedup for non-contiguous memory access patterns
IndisputableMonolith/Foundation/AlexanderDuality.lean alexander_duality_circle_linking unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

reproducible 4.46× timing discontinuity at the 28→29 qubit transition... direct-index backends maintain ∼2× per-qubit scaling

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

34 extracted references · 34 canonical work pages

[1]

SV-Sim: Scalable PGAS-based state vector simulation of quantum circuits,

A. Li, B. Fang, C. Granade, G. Prawiroatmodjo, B. Heim, M. Roetteler, and S. Krishnamoorthy, “SV-Sim: Scalable PGAS-based state vector simulation of quantum circuits,” inSC21: International Conference for High Performance Computing, Networking, Storage and Analysis, 2021, pp. 1–14

work page 2021
[2]

QARN: A Python based high- performance quantum circuit simulator,

A. S. Rejeesh and N. K. Shekhar, “QARN: A Python based high- performance quantum circuit simulator,” in2025 Supercomputing India (SCI), 2025, pp. 1–6

work page 2025
[3]

Queen: A quick, scalable, and comprehensive quantum circuit simulation for supercomputing,

C.-C. Wang, Y .-C. Lin, Y .-J. Wang, C.-H. Tu, and S.-H. Hung, “Queen: A quick, scalable, and comprehensive quantum circuit simulation for supercomputing,”arXiv preprint, 2024, arXiv:2406.14084

work page arXiv 2024
[4]

DiaQ: Efficient state-vector quantum simulation,

S. Chunduryet al., “DiaQ: Efficient state-vector quantum simulation,” arXiv preprint, 2024, arXiv:2405.01250

work page arXiv 2024
[5]

Apple introduces M4 Pro and M4 Max,

Apple Inc., “Apple introduces M4 Pro and M4 Max,” Apple Newsroom, Oct. 2024, accessed: 2026-04-28. [Online]. Available: https://www.apple. com/newsroom/2024/10/apple-introduces-m4-pro-and-m4-max/

work page 2024
[6]

MacBook Pro (14-inch, M4 Pro or M4 Max, 2024) — technical specifications,

——, “MacBook Pro (14-inch, M4 Pro or M4 Max, 2024) — technical specifications,” Apple Support, 2024, accessed: 2026-04-28. [Online]. Available: https://support.apple.com/en-us/121553

work page 2024
[7]

Gate- level simulation of quantum circuits,

G. F. Viamontes, M. Rajagopalan, I. L. Markov, and J. P. Hayes, “Gate- level simulation of quantum circuits,” inAsia and South Pacific Design Automation Conference (ASP-DAC), 2003, pp. 295–301

work page 2003
[8]

G. F. Viamontes, I. L. Markov, and J. P. Hayes,Quantum Circuit Simulation. Springer, 2009

work page 2009
[9]

qHiPSTER: The quantum high performance software testing environment,

M. Smelyanskiy, N. P. D. Sawaya, and A. Aspuru-Guzik, “qHiPSTER: The quantum high performance software testing environment,”arXiv preprint, 2016. [Online]. Available: https://arxiv.org/abs/1601.07195

work page arXiv 2016
[10]

QVecOpt: An efficient storage and computing optimization framework for large- scale quantum state simulation,

M. Yu, H. Yang, D. Wang, D. Kong, J. Du, Y . Fu, and J. Xu, “QVecOpt: An efficient storage and computing optimization framework for large- scale quantum state simulation,”arXiv preprint, 2025

work page 2025
[11]

Roofline: An insightful visual performance model for multicore architectures,

S. Williams, A. Waterman, and D. Patterson, “Roofline: An insightful visual performance model for multicore architectures,”Communications of the ACM, vol. 52, no. 4, pp. 65–76, Apr. 2009. [Online]. Available: https://dl.acm.org/doi/10.1145/1498765.1498785

work page doi:10.1145/1498765.1498785 2009
[12]

PIMutation: Ex- ploring the potential of PIM architecture for quantum circuit simulation,

D. Lee, E. Jang, S. Choi, J. An, C. Kim, and W. W. Ro, “PIMutation: Ex- ploring the potential of PIM architecture for quantum circuit simulation,” inProceedings of the 30th Asia and South Pacific Design Automation Conference (ASP-DAC), 2025

work page 2025
[13]

Quantum computer simulations at warp speed: Assessing the impact of GPU acceleration,

J. Faj, I. Peng, J. Wahlgren, and S. Markidis, “Quantum computer simulations at warp speed: Assessing the impact of GPU acceleration,” arXiv preprint, 2023

work page 2023
[14]

Memory perfor- mance and cache coherency effects on an Intel Nehalem multiprocessor system,

D. Molka, D. Hackenberg, R. Sch ¨one, and M. S. M ¨uller, “Memory perfor- mance and cache coherency effects on an Intel Nehalem multiprocessor system,” inProceedings of the 18th International Conference on Parallel Architectures and Compilation Techniques (PACT), 2009, pp. 261–270

work page 2009
[15]

Benchmarking quantum computer simulation software packages: State vector simulators,

A. Jamadagni, A. M. L ¨auchli, and C. Hempel, “Benchmarking quantum computer simulation software packages: State vector simulators,”SciPost Physics Core, vol. 7, p. 075, 2024

work page 2024
[16]

QuEST and high performance simulation of quantum computers,

T. Jones, A. Brown, I. Bush, and S. C. Benjamin, “QuEST and high performance simulation of quantum computers,”Scientific Reports, vol. 9, no. 1, p. 10736, Jul. 2019

work page 2019
[17]

Qulacs: a fast and versatile quantum circuit simulator for research purpose,

Y . Suzuki, Y . Kawase, Y . Masumura, Y . Hiraga, M. Nakadai, J. Chen, K. M. Nakanishi, K. Mitarai, R. Imai, S. Tamiya, T. Yamamoto, T. Yan, T. Kawakubo, Y . O. Nakagawa, Y . Ibe, Y . Zhang, H. Yamashita, H. Yoshimura, A. Hayashi, and K. Fujii, “Qulacs: a fast and versatile quantum circuit simulator for research purpose,”Quantum, vol. 5, p. 559, Oct. 2021

work page 2021
[18]

Quantum computing with Qiskit,

A. Javadi-Abhari, M. Treinish, K. Krsulich, C. J. Wood, J. Lishman, J. Gacon, S. Martiel, P. D. Nation, L. S. Bishop, A. W. Cross, B. R. Johnson, and J. M. Gambetta, “Quantum computing with Qiskit,”arXiv preprint, 2024

work page 2024
[19]

cuQuantum SDK: A high-performance library for accelerating quantum science,

H. Bayraktar, A. Charara, D. Clark, S. Cohen, T. Costa, Y .-L. L. Fang, Y . Gao, J. Guan, J. Gunnels, A. Haidar, A. Hehn, M. Hohnerbach, M. Jones, T. Lubowe, D. Lyakh, S. Morino, P. Springer, S. Stanwyck, I. Terentyev, S. Varadhan, J. Wong, and T. Yamaguchi, “cuQuantum SDK: A high-performance library for accelerating quantum science,” arXiv preprint, 2023

work page 2023
[20]

GPU-accelerated quantum simulation: Empirical backend selection, gate fusion, and adaptive precision,

P. Kumaresan, P. Muruganantham, L. Rajendran, and S. Sivasubramani, “GPU-accelerated quantum simulation: Empirical backend selection, gate fusion, and adaptive precision,”arXiv preprint, 2026

work page 2026
[21]

State of practice: Evaluating GPU performance of state vector and tensor network methods,

M. Vallero, F. Vella, and P. Rech, “State of practice: Evaluating GPU performance of state vector and tensor network methods,”Future Generation Computer Systems, vol. 179, p. 107927, 2025

work page 2025
[22]

osxQuantum: Metal-accelerated quantum circuit simulator for Apple Silicon,

QNeura.ai, “osxQuantum: Metal-accelerated quantum circuit simulator for Apple Silicon,” https://www.qneura.ai/osxQuantum.html, 2026, accessed: 2026-05-08

work page 2026
[23]

Apple vs. oranges: Evaluating the Apple Silicon M-Series SoCs for HPC performance and efficiency,

P. H¨ubner, A. Hu, I. Peng, and S. Markidis, “Apple vs. oranges: Evaluating the Apple Silicon M-Series SoCs for HPC performance and efficiency,” arXiv preprint, 2025

work page 2025
[24]

Memory bandwidth and machine balance in current high performance computers,

J. D. McCalpin, “Memory bandwidth and machine balance in current high performance computers,”IEEE Computer Society Technical Committee on Computer Architecture (TCCA) Newsletter, pp. 19–25, December

work page
[25]

Available: https://www.cs.virginia.edu/stream/ref.html

[Online]. Available: https://www.cs.virginia.edu/stream/ref.html

work page
[26]

MLX: Efficient and flexible machine learning on apple silicon,

A. Hannun, J. Digani, A. Katharopoulos, and R. Collobert, “MLX: Efficient and flexible machine learning on apple silicon,” 2023. [Online]. Available: https://github.com/ml-explore/mlx

work page 2023
[27]

Metal — GPU-accelerated graphics and compute,

Apple Inc., “Metal — GPU-accelerated graphics and compute,” Apple Developer Documentation, 2024, accessed: 2026-04-28. [Online]. Available: https://developer.apple.com/metal/

work page 2024
[28]

JAX: composable transformations of Python+ NumPy programs,

J. Bradbury, R. Frostig, P. Hawkins, M. J. Johnson, C. Leary, D. Maclaurin, G. Necula, A. Paszke, J. VanderPlas, S. Wanderman-Milne, and Q. Zhang, “JAX: composable transformations of Python+ NumPy programs,” 2018. [Online]. Available: https://github.com/google/jax

work page 2018
[29]

Accelerated JAX on Mac — Metal,

Apple Inc., “Accelerated JAX on Mac — Metal,” Apple Developer Documentation, 2023, accessed: 2026-04-28. [Online]. Available: https://developer.apple.com/metal/jax/

work page 2023
[30]

qsim-uma: Quantum circuit simulation benchmarks on unified memory architecture,

G. Pratipat, “qsim-uma: Quantum circuit simulation benchmarks on unified memory architecture,” GitHub, 2026, accessed: 2026-05-07. [Online]. Available: https://github.com/gyanpratipat/qsim-uma

work page 2026
[31]

Snapdragon X elite product brief,

Qualcomm Technologies, Inc., “Snapdragon X elite product brief,” Qualcomm Product Brief, 2023, accessed: 2026-04-28. [Online]. Available: https://www.qualcomm.com/content/dam/qcomm-martech/ dm-assets/documents/Product-Brief-Snapdragon-X-Elite.pdf

work page 2023
[32]

AMD Ryzen™ AI MAX+ 395 processor: Breakthrough AI performance in thin and light,

Advanced Micro Devices, Inc., “AMD Ryzen™ AI MAX+ 395 processor: Breakthrough AI performance in thin and light,” AMD Technical Blog, 2025, accessed: 2026-04-28. [Online]. Available: https://www.amd.com/ en/blogs/2025/amd-ryzen-ai-max-395-processor-breakthrough-ai-.html

work page 2025
[33]

Fact sheet: Intel unveils lunar lake architecture,

Intel Corporation, “Fact sheet: Intel unveils lunar lake architecture,” Intel Newsroom, 2024, accessed: 2026-05-07. [Online]. Available: https://download.intel.com/newsroom/2024/client-computing/ Lunar-Lake-Architecture-Fact-Sheet.pdf

work page 2024
[34]

NVIDIA grace hopper superchip architecture,

NVIDIA Corporation, “NVIDIA grace hopper superchip architecture,” NVIDIA Technical Blog, 2023, accessed: 2026-04-28. [Online]. Available: https://developer.nvidia.com/blog/ nvidia-grace-hopper-superchip-architecture-in-depth/

work page 2023