pith. sign in

arxiv: 2606.01852 · v1 · pith:NR2JSFJ3new · submitted 2026-06-01 · 💻 cs.DC · quant-ph

Parallelizing Large-Scale Tensor Network Contraction on Multiple GPUs

Pith reviewed 2026-06-28 12:51 UTC · model grok-4.3

classification 💻 cs.DC quant-ph
keywords tensor network contractionmulti-GPU parallelizationdistributed tensor contractionquantum circuit simulationslicingGEMM reorderingcommunication-aware scheduling
0
0 comments X

The pith

Distributing intermediate tensors across GPUs with explicit communication converts a fixed contraction path into a schedule that beats slicing by orders of magnitude.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that slicing, the standard way to parallelize tensor network contraction, incurs exponential redundant computation. Instead, the authors convert any fixed contraction path into a communication-efficient schedule by reordering modes for matrix multiplication and planning how to distribute tensor modes across devices. On eight H100 GPUs linked by NVLink this approach captures 87 to 101 percent of the available compute reduction, delivering 7 to 173 times the speedup of slicing alone. When the same workloads are scaled to 1024 GPUs over InfiniBand the advantage grows to between 42 and 67,869 times. The result matters for any domain that relies on exact contraction of large tensor networks, because it removes the exponential barrier that has limited prior parallel methods.

Core claim

We present a multi-GPU framework that distributes intermediate tensors across devices with explicit communication, converting a fixed contraction path into a communication-efficient schedule via GEMM-oriented mode reordering and communication-aware mode distribution planning. Within a single DGX H100 node (8 GPUs, NVLink), distribution delivers 7--173× extra speedup beyond embarrassingly parallel slicing, capturing nearly all of the available compute reduction (87--101%) because NVLink's high bandwidth keeps communication small relative to compute. Scaling the same four workloads to 1024 H100 GPUs over InfiniBand, the extra speedup beyond slicing ranges from 42× to 67,869×, demonstrating tha

What carries the argument

GEMM-oriented mode reordering combined with communication-aware mode distribution planning, which turns any fixed contraction path into a schedule that moves only the data required for each distributed matrix multiplication.

If this is right

  • On NVLink-connected GPUs the method captures 87-101% of the theoretical compute reduction available from avoiding slicing redundancy.
  • The same four workloads that are limited by slicing on 1024 GPUs become feasible when distribution is used instead.
  • Communication overhead remains small enough that the approach continues to scale when moving from a single node to a full InfiniBand cluster.
  • The framework applies directly to the contraction workloads that appear in quantum circuit simulation and combinatorial optimization.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same mode-reordering and distribution logic could be applied to tensor networks that arise in classical machine-learning models whose contraction graphs are currently handled by slicing.
  • If the planning step can be made dynamic rather than static, the method might adapt to tensor networks whose optimal contraction paths change during execution.
  • The reported scaling behavior implies that further increases in GPU count will continue to favor communication-aware distribution over slicing as long as interconnect bandwidth grows with compute.
  • Because the schedule preserves the original contraction path, existing path-finding heuristics can be reused without modification.

Load-bearing premise

Any fixed contraction path can be turned into a communication-efficient schedule by reordering modes for matrix multiplication and planning their distribution while keeping communication volume small relative to the compute that is saved.

What would settle it

Run one of the four reported workloads on 1024 GPUs and measure whether total communication time plus any extra compute exceeds the reduction in redundant floating-point operations compared with slicing; if the net time is longer, the claimed speedups do not hold.

Figures

Figures reproduced from arXiv: 2606.01852 by Feng Pan, Hanfeng Gu, Paul Springer, Xipeng Li.

Figure 1
Figure 1. Figure 1: Theoretical complexity reduction from distributing intermediate tensors across GPUs for six workloads: (a) quantum circuit simulation (Zuchongzhi [PITH_FULL_IMAGE:figures/full_fig_p005_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: End-to-end workflow. The offline planner takes a fixed contraction [PITH_FULL_IMAGE:figures/full_fig_p006_2.png] view at source ↗
Figure 4
Figure 4. Figure 4: Slicing vs. distribution. (a) Slicing fixes indices to create independent [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗
Figure 3
Figure 3. Figure 3: GEMM-oriented mode reordering on a two-step subtree. Dashed edges [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗
Figure 5
Figure 5. Figure 5: DP redistribution-point selection for the Zuchongzhi n60m24 bench [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Projected full-contraction speedup vs. embarrassingly parallel slicing (dashed) on 1–1024 H100 GPUs, computed from measured per-slice runtime [PITH_FULL_IMAGE:figures/full_fig_p010_6.png] view at source ↗
read the original abstract

Exact tensor network contraction underpins quantum circuit simulation, quantum error correction, combinatorial optimization, and many-body dynamics. The dominant parallelization strategy, slicing, scales exponentially and incurs redundant computation. We present a multi-GPU framework that instead distributes intermediate tensors across devices with explicit communication, converting a fixed contraction path into a communication-efficient schedule via GEMM-oriented mode reordering and communication-aware mode distribution planning. Within a single DGX H100 node (8 GPUs, NVLink), distribution delivers $7$--$173\times$ extra speedup beyond embarrassingly parallel slicing, capturing nearly all of the available compute reduction (87--101%) because NVLink's high bandwidth keeps communication small relative to compute. Scaling the same four workloads to 1024 H100 GPUs over InfiniBand, the extra speedup beyond slicing ranges from $42\times$ to $67{,}869\times$, demonstrating that communication-aware distributed contraction far surpasses slicing-based scaling limits for frontier tensor networks.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 0 minor

Summary. The manuscript presents a multi-GPU framework for exact tensor network contraction that distributes intermediate tensors with explicit communication rather than relying on slicing. A fixed contraction path is converted into a communication-efficient schedule via GEMM-oriented mode reordering and communication-aware mode distribution planning. On four workloads within a single DGX H100 node (8 GPUs, NVLink), the approach yields 7--173× extra speedup beyond embarrassingly parallel slicing while capturing 87--101% of the available compute reduction; scaling the same workloads to 1024 H100 GPUs over InfiniBand produces extra speedups ranging from 42× to 67,869×.

Significance. If the empirical results hold under detailed verification, the work would be significant for distributed tensor computations in quantum simulation and related domains. It demonstrates that communication-aware distribution can largely eliminate the redundant compute of slicing while keeping communication overhead low relative to compute savings on both NVLink and InfiniBand, offering a practical path to larger-scale contractions than slicing-based methods allow.

major comments (1)
  1. [Abstract] Abstract: the quantitative claims (7--173× on-node, 87--101% capture, 42×--67,869× at 1024 GPUs) are stated without workload definitions, implementation details, error analysis, or verification steps. These elements are load-bearing for assessing support for the central empirical claim that the GEMM-oriented reordering plus distribution planning keeps communication cost small relative to avoided redundant compute.

Simulated Author's Rebuttal

1 responses · 0 unresolved

Thank you for the constructive feedback. We address the single major comment below and agree that enhancing the abstract will improve clarity without altering the manuscript's core contributions.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the quantitative claims (7--173× on-node, 87--101% capture, 42×--67,869× at 1024 GPUs) are stated without workload definitions, implementation details, error analysis, or verification steps. These elements are load-bearing for assessing support for the central empirical claim that the GEMM-oriented reordering plus distribution planning keeps communication cost small relative to avoided redundant compute.

    Authors: We acknowledge the abstract's conciseness omits these details, which are present in the body. Workloads are defined in Section 4.1 (four quantum circuit simulation networks with explicit tensor dimensions and paths). Implementation (GEMM reordering and distribution planning) is in Section 3. Error analysis (multi-run timings) and verification (87-101% capture vs. theoretical reduction) appear in Section 5. To address the concern directly, we will revise the abstract to add brief workload descriptors (e.g., 'four 20-40 qubit quantum simulation workloads') and parenthetical references to the relevant sections. This change supports the empirical claims without misrepresenting results. revision: yes

Circularity Check

0 steps flagged

No significant circularity

full rationale

The paper's central claims consist of measured speedups (7-173× on-node, 42-67869× at 1024 GPUs) obtained by running four specific workloads on DGX H100 hardware under the proposed GEMM-oriented reordering and communication-aware distribution. These are direct empirical timings, not quantities derived from a fitted model, self-referential definition, or self-citation chain. The method description (mode reordering plus distribution planning) is presented as an algorithmic technique whose correctness and efficiency are verified by the reported measurements rather than presupposed by them. No load-bearing step reduces to its own inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract supplies no explicit free parameters, axioms, or invented entities; evaluation is limited by absence of full text.

pith-pipeline@v0.9.1-grok · 5696 in / 1145 out tokens · 29190 ms · 2026-06-28T12:51:27.850329+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Pushing the Classical Frontier of 1D Fermi-Hubbard Quench Dynamics Beyond Current Quantum Simulations

    quant-ph 2026-06 unverdicted novelty 5.0

    Symmetric TDVP on GPUs achieves converged 1D Fermi-Hubbard quench dynamics at chi~62000 up to t=7, certifying the high-entanglement regime and lowering the reported quantum advantage to ~36x.

Reference graph

Works this paper leans on

44 extracted references · 27 canonical work pages · cited by 1 Pith paper · 2 internal anchors

  1. [1]

    The density-matrix renormalization group in the age of matrix product states,

    U. Schollw ¨ock, “The density-matrix renormalization group in the age of matrix product states,”Annals of Physics, vol. 326, no. 1, pp. 96–192,

  2. [2]

    Available: https://doi.org/10.1016/j.aop.2010.09.012

    [Online]. Available: https://doi.org/10.1016/j.aop.2010.09.012

  3. [3]

    A practical introduction to tensor networks: Matrix product states and projected entangled pair states,

    R. Or ´us, “A practical introduction to tensor networks: Matrix product states and projected entangled pair states,”Annals of Physics, vol. 349, pp. 117–158, 2014. [Online]. Available: https: //doi.org/10.1016/j.aop.2014.06.013

  4. [4]

    Hyper-optimized tensor network contraction,

    J. Gray and S. Kourtis, “Hyper-optimized tensor network contraction,” Quantum, vol. 5, p. 410, 2021. [Online]. Available: https://doi.org/10. 22331/q-2021-03-15-410

  5. [5]

    Supervised learning with tensor networks,

    E. M. Stoudenmire and D. J. Schwab, “Supervised learning with tensor networks,” inAdvances in Neural Information Processing Systems 29, 2016, pp. 4799–4807

  6. [6]

    Tensorizing neural networks,

    A. Novikov, D. Podoprikhin, A. Osokin, and D. P. Vetrov, “Tensorizing neural networks,” inAdvances in Neural Information Processing Systems 28, 2015, pp. 442–450

  7. [7]

    Tensor networks and quantum error correction,

    A. J. Ferris and D. Poulin, “Tensor networks and quantum error correction,”Physical review letters, vol. 113, no. 3, p. 030501, 2014. [Online]. Available: https://doi.org/10.1103/PhysRevLett.113.030501

  8. [8]

    Dijkstra and Yoshitaka Tanimura

    J.-G. Liu, L. Wang, and P. Zhang, “Tropical tensor network for ground states of spin glasses,”Physical Review Letters, vol. 126, no. 9, p. 090506, 2021. [Online]. Available: https://doi.org/10.1103/PhysRevLett. 126.090506

  9. [9]

    The state of quantum computing applications in challenging many-body quantum dynamics,

    B. Fauseweh, “The state of quantum computing applications in challenging many-body quantum dynamics,”Nature Communications, vol. 15, p. 2123, 2024. [Online]. Available: https://doi.org/10.1038/ s41467-024-46402-9

  10. [10]

    Quantum supremacy using a programmable superconducting processor

    F. Arute, K. Arya, R. Babbush, D. Bacon, J. C. Bardin, R. Barends, R. Biswas, S. Boixo, F. G. Brandao, D. A. Buellet al., “Quantum supremacy using a programmable superconducting processor,”Nature, vol. 574, no. 7779, pp. 505–510, 2019. [Online]. Available: https://doi.org/10.1038/s41586-019-1666-5

  11. [11]

    Suppressing quantum errors by scaling a surface code logical qubit,

    Google Quantum AI, “Suppressing quantum errors by scaling a surface code logical qubit,”Nature, vol. 614, no. 7949, pp. 676–681, 2023. [Online]. Available: https://doi.org/10.1038/s41586-022-05434-1

  12. [12]

    Independent set enumeration in king’s graphs by tensor network contractions,

    K. Liang, “Independent set enumeration in king’s graphs by tensor network contractions,”arXiv preprint arXiv:2505.12776, 2025. [Online]. Available: https://arxiv.org/abs/2505.12776

  13. [13]

    Science354(6317), 1240–1241 (2016) https://doi.org/10.1126/science

    A. D. King, A. Nocera, M. Rams, J. Dziarmaga, R. Wiersema et al., “Beyond-classical computation in quantum simulation,”Science, pp. 199–204, 2025. [Online]. Available: https://doi.org/10.1126/science. ado6285

  14. [14]

    Simulating quantum computation by contract- ing tensor networks,

    I. L. Markov and Y . Shi, “Simulating quantum computation by contract- ing tensor networks,”SIAM Journal on Computing, vol. 38, no. 3, pp. 963–981, 2008

  15. [15]

    NVIDIA H100 Tensor Core GPU Architecture,

    NVIDIA Corporation, “NVIDIA H100 Tensor Core GPU Architecture,” https://resources.nvidia.com/en-us-tensor-core, 2022

  16. [16]

    cuTENSORMp: Multi-process tensor contraction library,

    ——, “cuTENSORMp: Multi-process tensor contraction library,” https: //docs.nvidia.com/cuda/cutensor/latest/user guide cutensorMp.html, 2024

  17. [17]

    NCCL: NVIDIA collective communications library,

    ——, “NCCL: NVIDIA collective communications library,” https:// developer.nvidia.com/nccl, 2024

  18. [18]

    Efficient parallelization of tensor network contraction for simulating quantum computation,

    C. Huang, F. Zhang, M. Newman, X. Ni, D. Ding, J. Cai, X. Gao, T. Wang, F. Wu, G. Zhang, H.-S. Ku, Z. Tian, J. Wu, H. Xu, H. Yu, B. Yuan, M. Szegedy, Y . Shi, H.-H. Zhao, C. Deng, and J. Chen, “Efficient parallelization of tensor network contraction for simulating quantum computation,”Nature Computational Science, vol. 1, pp. 578–587, 2021. [Online]. Avai...

  19. [19]

    Closing the “quantum supremacy

    Y . Liu, X. Liu, F. Li, Y . Yang, J. Song, P. Zhao, Z. Wang, D. Peng, H. Fu, D. Chen, W. Wu, H. Huang, and C. Guo, “Closing the “quantum supremacy” gap: Achieving real-time simulation of a random quantum circuit using a new Sunway supercomputer,” inProceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis...

  20. [20]

    Simulation of quantum circuits using the big-batch tensor network method,

    F. Pan and P. Zhang, “Simulation of quantum circuits using the big-batch tensor network method,”Physical Review Letters, vol. 128, no. 3, p. 030501, 2022. [Online]. Available: https: //doi.org/10.1103/PhysRevLett.128.030501

  21. [21]

    Solving the sampling problem of the Sycamore quantum circuits,

    F. Pan, K. Chen, and P. Zhang, “Solving the sampling problem of the Sycamore quantum circuits,”Physical Review Letters, vol. 129, no. 9, p. 090502, 2022. [Online]. Available: https://doi.org/10.1103/ PhysRevLett.129.090502

  22. [22]

    Leapfrogging Sycamore: Harnessing 1432 GPUs for 7x faster quantum random circuit sampling,

    X.-H. Zhao, H.-S. Zhong, F. Panet al., “Leapfrogging Sycamore: Harnessing 1432 GPUs for 7x faster quantum random circuit sampling,” National Science Review, vol. 12, no. 3, p. nwae317, 2025. [Online]. Available: https://doi.org/10.1093/nsr/nwae317

  23. [23]

    Hilfer fractional advection-diffusion equations with power-law initial condition; a Numerical study using variational iteration method

    R. Fu, Z. Su, H.-S. Zhong, X.-H. Zhao, J. Zhang, F. Panet al., “Surpassing Sycamore: Achieving energetic superiority through system- level circuit simulation,” inProceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis. IEEE, 2024. [Online]. Available: https://doi.org/10.1109/SC41406.2024. 00085

  24. [24]

    SW-TNC: Reaching the most complex random quantum circuit via tensor network contraction,

    Y . Chen, Z. Sun, C. Qiu, Z. Li, Y . Liu, L. Gan, X. Duan, and G. Yang, “SW-TNC: Reaching the most complex random quantum circuit via tensor network contraction,”arXiv preprint arXiv:2504.09186, 2025. [Online]. Available: https://arxiv.org/abs/2504.09186

  25. [25]

    Hilfer fractional advection-diffusion equations with power-law initial condition; a Numerical study using variational iteration method

    M. Xu, S. Cao, X. Miao, U. A. Acar, and Z. Jia, “Atlas: Hierarchical partitioning for quantum circuit simulation on GPUs,” inProceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (SC24), 2024. [Online]. Available: https://doi.org/10.1109/SC41406.2024.00072

  26. [26]

    Multi-GPU quantum circuit simulation and the impact of network performance,

    W. M. Brown, A. Ramesh, T. Lubinski, T. Nguyen, and D. E. B. Neira, “Multi-GPU quantum circuit simulation and the impact of network performance,”arXiv preprint arXiv:2511.14664, 2025. [Online]. Available: https://arxiv.org/abs/2511.14664

  27. [27]

    Simulation of quantum computers: Review and acceleration opportunities,

    A. Cicero, M. A. Maleki, M. W. Azhar, A. F. Kockum, and P. Trancoso, “Simulation of quantum computers: Review and acceleration opportunities,”ACM Transactions on Quantum Computing, vol. 7, no. 1, p. 3, 2025. [Online]. Available: https://doi.org/10.1145/ 3701725

  28. [28]

    Strong quantum computational advantage using a superconducting quantum processor,

    Y . Wu, W.-S. Bao, S. Cao, F. Chen, Y . Chen, X. Chen, T.-H. Chung, H. Deng, Y . Du, D. Fanet al., “Strong quantum computational advantage using a superconducting quantum processor,”Physical Review Letters, vol. 127, no. 18, p. 180501, 2021. [Online]. Available: https://doi.org/10.1103/PhysRevLett.127.180501

  29. [29]

    The computational boundaries of quantum advantage,

    A. Zlokapa, F. Fuchs, L. Schaeffer, A. M. Dalzell, E. Lau, E. T. Hollandet al., “The computational boundaries of quantum advantage,” npj Quantum Information, vol. 9, p. 36, 2023. [Online]. Available: https://doi.org/10.1038/s41534-023-00744-7

  30. [30]

    Contracting arbitrary tensor networks: General approximate algorithm and applications in graphical models and quantum circuit simulations,

    F. Pan, P. Zhou, S. Li, and P. Zhang, “Contracting arbitrary tensor networks: General approximate algorithm and applications in graphical models and quantum circuit simulations,”Physical Review Letters, vol. 125, no. 6, p. 060503, 2020. [Online]. Available: https://doi.org/10.1103/PhysRevLett.125.060503

  31. [31]

    Efficient quantum circuit simulation by tensor network methods on modern GPUs,

    F. Pan, H. Gu, L. Kuang, B. Liu, and P. Zhang, “Efficient quantum circuit simulation by tensor network methods on modern GPUs,”ACM Transactions on Quantum Computing, vol. 5, no. 4, 2024. [Online]. Available: https://doi.org/10.1145/3696465

  32. [32]

    Efficient algorithms for maximum likelihood decoding of quantum error-correcting codes,

    S. Bravyi, M. Suchara, and A. Vargo, “Efficient algorithms for maximum likelihood decoding of quantum error-correcting codes,” Physical Review A, vol. 90, no. 3, p. 032326, 2014. [Online]. Available: https://doi.org/10.1103/PhysRevA.90.032326

  33. [33]

    Surface codes: Towards practical large-scale quantum computation,

    A. G. Fowler, M. Mariantoni, J. M. Martinis, and A. N. Cleland, “Surface codes: Towards practical large-scale quantum computation,” Physical Review A, vol. 86, no. 3, p. 032324, 2012. [Online]. Available: https://doi.org/10.1103/PhysRevA.86.032324

  34. [34]

    Tensor-network decoding beyond 2d,

    C. Piveteau, C. T. Chubb, and J. M. Renes, “Tensor-network decoding beyond 2d,”PRX Quantum, vol. 5, no. 4, p. 040303, 2024. [Online]. Available: https://doi.org/10.1103/PRXQuantum.5.040303

  35. [35]

    Learning high-accuracy error decoding for quantum processors,

    J. Bausch, M. S. Kesselring, A. Elben, V . Swaroop, B. Yao, A. Molleet al., “Learning high-accuracy error decoding for quantum processors,”Nature, vol. 635, pp. 834–840, 2024. [Online]. Available: https://doi.org/10.1038/s41586-024-08148-8

  36. [36]

    Nature , publisher=

    Google Quantum AIet al., “Quantum error correction below the surface code threshold,”Nature, vol. 638, pp. 920–926, 2025. [Online]. Available: https://doi.org/10.1038/s41586-024-08449-y

  37. [37]

    Generalized trotter’s formula and systematic approximants of exponential operators and inner derivations with applications to many-body problems,

    M. Suzuki, “Generalized trotter’s formula and systematic approximants of exponential operators and inner derivations with applications to many-body problems,”Communications in Mathematical Physics, vol. 51, no. 2, pp. 183–190, 1976. [Online]. Available: https: //doi.org/10.1007/BF01609348

  38. [38]

    doi:10.1103/PhysRevLett.93.040502 , url =

    G. Vidal, “Efficient simulation of one-dimensional quantum many-body systems,”Physical Review Letters, vol. 93, no. 4, p. 040502, 2004. [Online]. Available: https://doi.org/10.1103/PhysRevLett.93.040502

  39. [39]

    doi:10.1038/s41586-023-06096-3 , url =

    Y . Kim, A. Eddins, S. Anand, K. X. Wei, E. van den Berg, S. Rosenblatt, H. Nayfeh, Y . Wu, M. Zaletel, K. Temme, and A. Kandala, “Evidence for the utility of quantum computing before fault tolerance,”Nature, vol. 618, pp. 500–505, 2023. [Online]. Available: https://doi.org/10.1038/s41586-023-06096-3

  40. [40]

    Uncovering local integrability in quantum many- body dynamics,

    O. Shtanko, D. S. Wang, H. Zhang, N. Harle, A. Seif, R. Movassagh, and Z. Minev, “Uncovering local integrability in quantum many- body dynamics,”Nature Communications, 2025. [Online]. Available: https://doi.org/10.1038/s41467-025-57623-x

  41. [41]

    Quantum critical dynamics in a 5,000-qubit pro- grammable spin glass,

    A. D. Kinget al., “Quantum critical dynamics in a 5,000-qubit pro- grammable spin glass,”Nature, vol. 617, pp. 61–66, 2023

  42. [42]

    Confinement in a Z2 lattice gauge theory on a quantum computer,

    J. Mildenberger, Z. Jiang, W. Mruczkiewicz, J. C. Halimeh, and P. Hauke, “Confinement in a Z2 lattice gauge theory on a quantum computer,”Nature Physics, 2025. [Online]. Available: https://doi.org/10.1038/s41567-024-02723-6

  43. [43]

    cuTENSOR: A high-performance CUDA library for tensor primitives,

    NVIDIA Corporation, “cuTENSOR: A high-performance CUDA library for tensor primitives,” https://developer.nvidia.com/cutensor, 2024

  44. [44]

    High-performance tensor contraction without transposition,

    D. A. Matthews, “High-performance tensor contraction without transposition,”SIAM Journal on Scientific Computing, vol. 40, no. 1, pp. C1–C24, 2018. [Online]. Available: https://doi.org/10.1137/ 16M108968X