pith. machine review for the scientific record. sign in

arxiv: 2605.10363 · v2 · submitted 2026-05-11 · ⚛️ physics.comp-ph · cs.DC· physics.chem-ph

Recognition: no theorem link

Accelerating Locality-Driven Integration in Quantum Chemistry with Block-Structured Matrix Multiplication

Authors on Pith no claims yet

Pith reviewed 2026-05-14 21:31 UTC · model grok-4.3

classification ⚛️ physics.comp-ph cs.DCphysics.chem-ph
keywords quantum chemistryGPU accelerationmatrix multiplicationdensity functional theorylocality-driven integrationblock-structured matricesKohn-Sham DFTab initio molecular dynamics
0
0 comments X

The pith

KerneLDI reorganizes matrix data into block-filtered form to accelerate locality-driven integration in quantum chemistry by up to 10 times on GPUs while keeping numerical accuracy intact.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents KerneLDI as a GPU framework that co-designs data layout and matrix operators to handle the structured sparsity arising when localized basis functions interact through quadrature or screening. It converts operand matrices into a unified block-filtered representation that drops spatially irrelevant blocks, then runs the contractions with customized dense block multipliers drawn from proven dense-matmul techniques. A sympathetic reader would care because this pattern dominates exchange-correlation evaluation in Kohn-Sham DFT and similar tasks; if the speedups hold, larger molecular systems become practical on current GPU hardware without loss of precision. The work also reports favorable scaling with system size, multi-GPU parallelism, faster self-consistent field cycles, and higher throughput in ab initio molecular dynamics.

Core claim

KerneLDI reorganizes operand matrices into a unified block-filtered representation that retains only spatially relevant blocks and executes the resulting contractions with customized dense block multipliers that adapt proven dense-matmul optimizations to retained block pairs, thereby delivering up to 10 times speedup for exchange-correlation evaluation over a dense GPU baseline while preserving numerical accuracy.

What carries the argument

Unified block-filtered matrix representation together with customized dense block multipliers applied only to retained block pairs.

If this is right

  • Numerical accuracy is preserved for exchange-correlation integration across tested molecular systems.
  • Up to 10 times speedup is observed for exchange-correlation evaluation relative to a dense GPU baseline.
  • Performance scales favorably as molecular system size grows and when multiple GPUs are used.
  • End-to-end self-consistent field calculations run faster under the new framework.
  • Ab initio molecular dynamics achieves nearly 6 times higher throughput.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same block-filtered layout could be applied to other integral-screening or quadrature steps that exhibit spatial locality.
  • Further gains may appear in even larger systems where the fraction of discarded blocks increases.
  • The co-design pattern of screening logic plus dense-block kernels might transfer to related sparse-dense hybrid computations outside quantum chemistry.
  • Porting the approach to other GPU architectures would test whether the reported speedups depend on specific hardware features.

Load-bearing premise

Reorganizing matrices into a unified block-filtered representation retains every spatially relevant contribution without numerical error or extra problem-specific tuning.

What would settle it

Compute exchange-correlation energies or forces for a large molecular system with both KerneLDI and an unmodified dense GPU baseline; any deviation larger than floating-point tolerance or any speedup below the claimed factor on that system would falsify the central claim.

Figures

Figures reproduced from arXiv: 2605.10363 by Bin Shao, Fusong Ju, Huanhuan Xia, Jianwei Zhu, Jia Zhang, Lin Huang, Tao Qin, Xinran Wei, Yan Pan, Yihong Zhang, Zehao Zhou.

Figure 1
Figure 1. Figure 1: Overview of the KerneLDI workflow. EXC operands [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Single-GPU speedup relative to dense GPU execution. [PITH_FULL_IMAGE:figures/full_fig_p007_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Sensitivity of EXC energy accuracy to the block [PITH_FULL_IMAGE:figures/full_fig_p008_3.png] view at source ↗
Figure 6
Figure 6. Figure 6: Multi-GPU scaling of KerneLDI on ubiquitin from 8 [PITH_FULL_IMAGE:figures/full_fig_p009_6.png] view at source ↗
Figure 4
Figure 4. Figure 4: Crossover behavior of KerneLDI under increasing [PITH_FULL_IMAGE:figures/full_fig_p009_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: End-to-end SCF comparison across six molecular [PITH_FULL_IMAGE:figures/full_fig_p009_5.png] view at source ↗
Figure 7
Figure 7. Figure 7: Simulated AIMD trajectory length achievable within [PITH_FULL_IMAGE:figures/full_fig_p010_7.png] view at source ↗
read the original abstract

Locality-driven integration is a pervasive computational pattern in quantum chemistry, arising whenever spatially localized basis functions interact through numerical quadrature or integral screening. The dominant matrix multiplications in these tasks exhibit dynamic, structured sparsity driven by spatial locality, posing significant challenges for both dense batched kernels and generic sparse formats on GPUs. We present KerneLDI, a GPU-oriented framework that addresses this regime by co-designing data layout, screening logic, and matrix-computation operators to realize block-structured matrix multiplication for locality-driven integration. KerneLDI reorganizes operand matrices into a unified block-filtered representation that retains only spatially relevant blocks, and executes the resulting contractions with customized dense block multipliers that adapt proven dense-matmul optimizations to retained block pairs. We develop and evaluate KerneLDI on exchange--correlation (EXC) integration in Kohn--Sham density functional theory, a representative and computationally critical instance of this pattern. Across diverse molecular systems, KerneLDI preserves numerical accuracy while delivering up to 10$\times$ speedup for EXC evaluation over a dense GPU baseline, scales favorably with increasing system size and multi-GPU parallelism, accelerates end-to-end self-consistent field calculations, and yields nearly 6$\times$ throughput improvement for ab initio molecular dynamics.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces KerneLDI, a GPU-oriented framework that co-designs data layout, screening logic, and matrix operators to perform block-structured matrix multiplication for locality-driven integration tasks in quantum chemistry. Focused on exchange-correlation (EXC) integration within Kohn-Sham DFT, the approach reorganizes operand matrices into a unified block-filtered representation retaining only spatially relevant blocks and executes contractions via customized dense block multipliers. The authors claim that numerical accuracy is preserved across tested molecular systems while delivering up to 10× speedup for EXC evaluation versus a dense GPU baseline, favorable scaling with system size and multi-GPU parallelism, acceleration of end-to-end SCF calculations, and nearly 6× throughput gains for ab initio molecular dynamics.

Significance. If the performance and accuracy claims are substantiated with detailed verification, this work addresses a pervasive computational pattern in quantum chemistry by mapping locality-driven sparsity onto efficient GPU kernels without generic sparse formats. It could meaningfully improve throughput for DFT-based simulations and molecular dynamics on modern hardware, particularly as system sizes grow, and the co-design strategy may generalize to other quadrature or integral-screening workloads.

major comments (2)
  1. [Abstract and §4] Abstract and §4 (Evaluation): the central claim that numerical accuracy is preserved lacks reported error bars, quantitative thresholds for the screening logic, or explicit comparison against established sparse GPU libraries such as cuSPARSE; without these, the 10× speedup cannot be fully assessed for hidden costs or generality.
  2. [§3] §3 (Methods): the assumption that the block-filtered representation retains all spatially relevant contributions exactly is load-bearing for the accuracy claim, yet no formal argument or exhaustive edge-case testing (e.g., systems near locality breakdown) is provided to confirm absence of truncation errors beyond the tested molecules.
minor comments (2)
  1. [§4.2] Figure captions and §4.2 should explicitly state the molecular systems, basis sets, and functional used in the timing and accuracy benchmarks for reproducibility.
  2. [§3] Notation for block indices and screening parameters could be unified across equations and pseudocode to avoid minor ambiguity in the operator definitions.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive review and the recommendation for minor revision. We appreciate the recognition of the potential impact of KerneLDI on locality-driven integration tasks in quantum chemistry. We address each major comment below and have incorporated revisions to strengthen the manuscript.

read point-by-point responses
  1. Referee: [Abstract and §4] Abstract and §4 (Evaluation): the central claim that numerical accuracy is preserved lacks reported error bars, quantitative thresholds for the screening logic, or explicit comparison against established sparse GPU libraries such as cuSPARSE; without these, the 10× speedup cannot be fully assessed for hidden costs or generality.

    Authors: We agree that error bars, explicit screening thresholds, and a comparison to cuSPARSE would improve the assessment of accuracy and performance claims. In the revised manuscript we have added error bars to all accuracy metrics in §4, stated the quantitative screening thresholds used to construct the block-filtered representation, and included direct benchmarks against cuSPARSE. These additions confirm that the observed speedups incur no hidden accuracy penalties relative to the dense baseline or generic sparse libraries for the targeted structured-sparsity regime. revision: yes

  2. Referee: [§3] §3 (Methods): the assumption that the block-filtered representation retains all spatially relevant contributions exactly is load-bearing for the accuracy claim, yet no formal argument or exhaustive edge-case testing (e.g., systems near locality breakdown) is provided to confirm absence of truncation errors beyond the tested molecules.

    Authors: The block-filtered representation retains blocks according to a spatial-overlap criterion that follows established locality principles in quantum chemistry; by construction it excludes only blocks whose contribution falls below the chosen threshold. While a fully general formal proof is difficult because locality is itself an approximation, we have expanded §3 with a detailed derivation of the retention criterion and its relation to standard integral-screening bounds. We have also added supplementary experiments on systems near the locality limit (highly delocalized and extended molecules) and report that truncation errors remain negligible within the tested regimes, consistent with the original accuracy results. revision: yes

Circularity Check

0 steps flagged

No significant circularity; implementation framework is self-contained

full rationale

The paper describes an engineering co-design of data layout, screening, and dense block matrix kernels for locality-driven integration in quantum chemistry (EXC evaluation in DFT). No mathematical derivation chain exists that reduces predictions or results to fitted parameters, self-definitions, or self-citation load-bearing steps. Claims rest on empirical benchmarks showing preserved numerical accuracy and measured speedups across systems, with the block-filtered representation and operators presented as direct mappings of existing spatial locality patterns onto GPU primitives. The central contribution is algorithmic implementation rather than a first-principles result derived from its own outputs.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The work introduces no new physical constants, fitted parameters, or postulated entities; it relies on standard assumptions of spatial locality in basis functions and the correctness of dense matrix-multiplication optimizations.

pith-pipeline@v0.9.0 · 5560 in / 1239 out tokens · 44792 ms · 2026-05-14T21:31:59.357709+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

50 extracted references · 50 canonical work pages

  1. [1]

    GPU computing,

    J. Nickolls and W. J. Dally, “GPU computing,”Proceedings of the IEEE, vol. 98, no. 8, pp. 1479–1492, 2010

  2. [2]

    Benchmarking GPUs to tune dense linear algebra,

    V . V olkov and J. W. Demmel, “Benchmarking GPUs to tune dense linear algebra,” inSC ’08: Proceedings of the 2008 ACM/IEEE Conference on Supercomputing. IEEE, 2008, pp. 1–11. [Online]. Available: https://doi.org/10.1109/SC.2008.5214359

  3. [3]

    cublas: The nvidia cuda basic linear algebra subroutine library,

    “cublas: The nvidia cuda basic linear algebra subroutine library,” https: //docs.nvidia.com/cuda/cublas/, 2024, accessed: 2024-11-11

  4. [4]

    nvidia.com/cuda/cuda-c-programming-guide/

    NVIDIA Corporation,CUDA C Programming Guide, 2019, https://docs. nvidia.com/cuda/cuda-c-programming-guide/

  5. [5]

    Implementing sparse matrix-vector multiplication on throughput-oriented processors,

    N. Bell and M. Garland, “Implementing sparse matrix-vector multiplication on throughput-oriented processors,” inProceedings of the Conference on High Performance Computing Networking, Storage and Analysis (SC ’09), 2009, pp. 1–11. [Online]. Available: https://doi.org/10.1145/1654059.1654078

  6. [6]

    Cusparse library,

    M. Naumov, L. Chien, P. Vandermersch, and U. Kapasi, “Cusparse library,” inGPU Technology Conference, vol. 12, 2010

  7. [7]

    Sparse matrix-vector multiplication on GPGPUs,

    S. Filippone, V . Cardellini, D. Barbieri, and A. Luque, “Sparse matrix-vector multiplication on GPGPUs,”ACM Transactions on Mathematical Software, vol. 43, no. 4, pp. 1–49, 2017. [Online]. Available: https://doi.org/10.1145/3017994

  8. [8]

    Parallel sparse matrix-matrix multiplication and indexing: Implementation and experiments,

    A. Buluc ¸ and J. R. Gilbert, “Parallel sparse matrix-matrix multiplication and indexing: Implementation and experiments,”SIAM Journal on Scientific Computing, vol. 34, no. 4, pp. C170–C191, 2012. [Online]. Available: https://doi.org/10.1137/110848244

  9. [9]

    Design principles for sparse matrix multiplication on the GPU,

    C. Yang, A. Buluc ¸, and J. D. Owens, “Design principles for sparse matrix multiplication on the GPU,” inEuro-Par 2018: Parallel Processing. Springer, 2018, pp. 672–687. [Online]. Available: https://doi.org/10.1007/978-3-319-96983-1 48

  10. [10]

    The University of Florida sparse matrix collection,

    T. A. Davis and Y . Hu, “The University of Florida sparse matrix collection,”ACM Transactions on Mathematical Software, vol. 38, no. 1, pp. 1–25, 2011. [Online]. Available: https://doi.org/10.1145/ 2049662.2049663

  11. [11]

    Fast sparse matrix-vector multiplication by exploiting variable block structure,

    R. W. Vuduc and H.-J. Moon, “Fast sparse matrix-vector multiplication by exploiting variable block structure,” inHigh Performance Computing and Communications: First International Conference, HPCC 2005, Sorrento, Italy, September 21-23, 2005. Proceedings 1. Springer, 2005, pp. 807–816. [Online]. Available: https://doi.org/10.1007/11557654 91

  12. [12]

    Optimization of block sparse matrix- vector multiplication on shared-memory parallel architectures,

    R. Eberhardt and M. Hoemmen, “Optimization of block sparse matrix- vector multiplication on shared-memory parallel architectures,” in2016 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW). IEEE, 2016, pp. 663–672. [Online]. Available: https://doi.org/10.1109/IPDPSW.2016.42

  13. [13]

    Integral approximations for lcao-scf calculations,

    O. Vahtras, J. Alml ¨of, and M. W. Feyereisen, “Integral approximations for lcao-scf calculations,”Chemical Physics Letters, vol. 213, no. 5–6, pp. 514–518, 1993. [Online]. Available: https://doi.org/10.1016/ 0009-2614(93)89151-7

  14. [14]

    Highly efficient resolution-of-identity density functional theory calculations on central and graphics processing units,

    J. Kussmann, H. Laqua, and C. Ochsenfeld, “Highly efficient resolution-of-identity density functional theory calculations on central and graphics processing units,”Journal of Chemical Theory and Computation, vol. 17, no. 3, pp. 1512–1521, 2021. [Online]. Available: https://doi.org/10.1021/acs.jctc.0c01252

  15. [15]

    Highly efficient, linear-scaling seminumerical exact-exchange method for graphic processing units,

    H. Laqua, T. H. Thompson, J. Kussmann, and C. Ochsenfeld, “Highly efficient, linear-scaling seminumerical exact-exchange method for graphic processing units,”Journal of Chemical Theory and Computation, vol. 16, no. 3, pp. 1456–1468, 2020. [Online]. Available: https://doi.org/10.1021/acs.jctc.9b00860

  16. [16]

    Parallel implementation of density functional theory methods in the quantum interaction computational kernel program,

    M. Manathunga, Y . Miao, D. Mu, A. W. G”otz, and K. M. Merz, Jr., “Parallel implementation of density functional theory methods in the quantum interaction computational kernel program,”Journal of Chemical Theory and Computation, vol. 16, no. 7, pp. 4315–4326,

  17. [17]

    Available: https://doi.org/10.1021/acs.jctc.0c00290

    [Online]. Available: https://doi.org/10.1021/acs.jctc.0c00290

  18. [18]

    Achieving performance portability in gaussian basis set density functional theory on accelerator based architectures in nwchemex,

    D. B. Williams-Young, A. Bagusetty, W. A. de Jong, D. Doerfler, H. J. J. van Dam, ´A. V ´azquez-Mayagoitia, T. L. Windus, and C. Yang, “Achieving performance portability in gaussian basis set density functional theory on accelerator based architectures in nwchemex,” Parallel Computing, vol. 108, p. 102829, 2021. [Online]. Available: https://doi.org/10.101...

  19. [19]

    Enhancing gpu-acceleration in the python-based simulations of chemistry frameworks,

    X. Wu, Q. Sun, Z. Pu, T. Zheng, W. Ma, W. Yan, Y . Xia, Z. Wu, M. Huo, X. Liet al., “Enhancing gpu-acceleration in the python-based simulations of chemistry frameworks,”Wiley Interdisciplinary Reviews: Computational Molecular Science, vol. 15, no. 2, p. e70008, 2025. [Online]. Available: https://doi.org/10.1002/wcms.70008

  20. [20]

    Efficient algorithms for gpu accelerated evaluation of the dft exchange-correlation functional,

    R. Stocks and G. M. Barca, “Efficient algorithms for gpu accelerated evaluation of the dft exchange-correlation functional,”Journal of Chemical Theory and Computation, vol. 21, no. 20, pp. 10 263–10 280,

  21. [21]

    Available: https://doi.org/10.1021/acs.jctc.5c01229

    [Online]. Available: https://doi.org/10.1021/acs.jctc.5c01229

  22. [22]

    A multicenter numerical integration scheme for polyatomic molecules,

    A. D. Becke, “A multicenter numerical integration scheme for polyatomic molecules,”The Journal of Chemical Physics, vol. 88, no. 4, pp. 2547–2553, 1988. [Online]. Available: https://doi.org/10. 1063/1.454033

  23. [23]

    Efficient molecular numerical integration schemes,

    O. Treutler and R. Ahlrichs, “Efficient molecular numerical integration schemes,”The Journal of Chemical Physics, vol. 102, no. 1, pp. 346–354, 1995. [Online]. Available: https://doi.org/10.1063/1.469408

  24. [24]

    A standard grid for density functional calculations,

    P. M. W. Gill, B. G. Johnson, and J. A. Pople, “A standard grid for density functional calculations,”Chemical Physics Letters, vol. 209, no. 5–6, pp. 506–512, 1993. [Online]. Available: https: //doi.org/10.1016/0009-2614(93)80125-9

  25. [25]

    Marx and J

    D. Marx and J. Hutter,Ab Initio Molecular Dynamics: Basic Theory and Advanced Methods. Cambridge University Press, 2009

  26. [26]

    Kresse and J

    G. Kresse and J. Furthm ¨uller, “Efficient iterative schemes for ab initio total-energy calculations using a plane-wave basis set,”Physical Review B, vol. 54, no. 16, pp. 11 169–11 186, 1996. [Online]. Available: https://doi.org/10.1103/PhysRevB.54.11169

  27. [27]

    Cp2k: atomistic simulations of condensed matter systems,

    J. Hutter, M. Iannuzzi, F. Schiffmann, and J. VandeV ondele, “Cp2k: atomistic simulations of condensed matter systems,”WIREs Computational Molecular Science, vol. 4, no. 1, pp. 15–25, 2014. [Online]. Available: https://doi.org/10.1002/wcms.1159

  28. [28]

    Acceleration without disruption: Dft software as a service,

    F. Ju, X. Wei, L. Huang, A. J. Jenkins, L. Xia, J. Zhang, J. Zhu, H. Yang, B. Shao, P. Dai, D. B. Williams-Young, A. Mayya, Z. Hooshmand, A. Efimovskaya, N. A. Baker, M. Troyer, and H. Liu, “Acceleration without disruption: Dft software as a service,”Journal of Chemical Theory and Computation, 2024. [Online]. Available: https://doi.org/10.1021/acs.jctc.4c00940

  29. [29]

    Zhao and D

    Y . Zhao and D. G. Truhlar, “The m06 suite of density functionals for main group thermochemistry, thermochemical kinetics, noncovalent interactions, excited states, and transition elements: 11 two new functionals and systematic testing of four m06- class functionals and 12 other functionals,”Theoretical chemistry accounts, vol. 120, no. 1, pp. 215–241, 20...

  30. [30]

    Balanced basis sets of split valence, triple zeta valence and quadruple zeta valence quality for h to rn: Design and assessment of accuracy,

    F. Weigend and R. Ahlrichs, “Balanced basis sets of split valence, triple zeta valence and quadruple zeta valence quality for h to rn: Design and assessment of accuracy,”Physical Chemistry Chemical Physics, vol. 7, no. 18, pp. 3297–3305, 2005. [Online]. Available: https://doi.org/10.1039/b508541a

  31. [31]

    Extending sparse tensor accelerators to support multiple compression formats,

    E. Qin, G. Jeong, W. Won, S.-C. Kao, H. Kwon, S. Srinivasan, D. Das, G. E. Moon, S. Rajamanickam, and T. Krishna, “Extending sparse tensor accelerators to support multiple compression formats,” in2021 IEEE International Parallel and Distributed Processing Symposium (IPDPS). IEEE, 2021, pp. 1014–1024. [Online]. Available: https://doi.org/10.1109/IPDPS49936...

  32. [32]

    Sparse approximate matrix- matrix multiplication for density matrix purification with error control,

    A. G. Artemov and E. H. Rubensson, “Sparse approximate matrix- matrix multiplication for density matrix purification with error control,” Journal of Computational Physics, vol. 438, p. 110354, 2021. [Online]. Available: https://doi.org/10.1016/j.jcp.2021.110354

  33. [33]

    Nvidia a100 tensor core gpu: Performance and innovation,

    J. Choquette, W. Gandhi, O. Giroux, N. Stam, and R. Krashinsky, “Nvidia a100 tensor core gpu: Performance and innovation,”IEEE Micro, vol. 41, no. 2, pp. 29–35, 2021. [Online]. Available: https://doi.org/10.1109/MM.2021.3061394

  34. [34]

    Recent developments in the general atomic and molecular electronic structure system,

    G. M. J. Barca, C. Bertoni, L. Carrington, D. Datta, N. De Silva, J. E. Deustua, D. G. Fedorov, J. R. Gour, A. O. Gunber, E. Guidezet al., “Recent developments in the general atomic and molecular electronic structure system,”The Journal of Chemical Physics, vol. 152, no. 15, p. 154102, 2020. [Online]. Available: https://doi.org/10.1063/5.0005188

  35. [35]

    Nvidia tensor core programmability, performance & precision,

    S. Markidis, S. W. Der Chien, E. Laure, I. B. Peng, and J. S. Vetter, “Nvidia tensor core programmability, performance & precision,” in 2018 IEEE international parallel and distributed processing symposium workshops (IPDPSW). IEEE, 2018, pp. 522–531. [Online]. Available: https://doi.org/10.1109/IPDPSW.2018.00091

  36. [36]

    Harnessing GPU tensor cores for fast FP16 arithmetic to speed up mixed- precision iterative refinement solvers,

    A. Haidar, S. Tomov, J. Dongarra, and N. J. Higham, “Harnessing GPU tensor cores for fast FP16 arithmetic to speed up mixed- precision iterative refinement solvers,”Proceedings of the International Conference for High Performance Computing, Networking, Storage, and Analysis (SC ’18), pp. 1–11, 2018. [Online]. Available: https://doi.org/10.1109/SC.2018.00050

  37. [37]

    Ginkgo: A modern linear operator algebra framework for high performance computing,

    H. Anzt, T. Cojean, G. Flegar, F. G ¨obel, T. Gr ¨utzmacher, P. Nayak, T. Ribizel, Y . M. Tsai, and E. S. Quintana-Ort ´ı, “Ginkgo: A modern linear operator algebra framework for high performance computing,” ACM Transactions on Mathematical Software, vol. 48, no. 1, pp. 1–33,

  38. [38]

    Available: https://doi.org/10.1145/3480935

    [Online]. Available: https://doi.org/10.1145/3480935

  39. [39]

    Routine microsecond molecular dynamics simulations with AMBER on GPUs. 1. generalized born,

    A. W. G ¨otz, M. J. Williamson, D. Xu, D. Poole, S. Le Grand, and R. C. Walker, “Routine microsecond molecular dynamics simulations with AMBER on GPUs. 1. generalized born,”Journal of Chemical Theory and Computation, vol. 8, no. 5, pp. 1542–1555, 2012. [Online]. Available: https://doi.org/10.1021/ct200909j

  40. [40]

    Whippletree: task-based scheduling of dynamic workloads on the GPU,

    M. Steinberger, M. Kenzel, P. Boechat, B. Kerber, M. Dokter, and D. Schmalstieg, “Whippletree: task-based scheduling of dynamic workloads on the GPU,” inACM Transactions on Graphics (TOG), vol. 33, no. 6, 2014, pp. 1–11. [Online]. Available: https://doi.org/10.1145/2661229.2661250

  41. [41]

    Self-consistent equations including exchange and correlation effects,

    W. Kohn and L. J. Sham, “Self-consistent equations including exchange and correlation effects,”Physical Review, vol. 140, no. 4A, pp. A1133–A1138, 1965. [Online]. Available: https://doi.org/10.1103/ PhysRev.140.A1133

  42. [42]

    Quadratures on a sphere,

    V . I. Lebedev, “Quadratures on a sphere,”USSR Computational Mathematics and Mathematical Physics, vol. 16, no. 2, pp. 10–24, 1976. [Online]. Available: https://doi.org/10.1016/0041-5553(76)90100-2

  43. [43]

    Quantum chemistry on graphical processing units. 2. direct self-consistent-field implementation,

    I. S. Ufimtsev and T. J. Martinez, “Quantum chemistry on graphical processing units. 2. direct self-consistent-field implementation,”Journal of Chemical Theory and Computation, vol. 5, no. 10, pp. 2619–2628,

  44. [44]

    Available: https://doi.org/10.1021/ct800526s

    [Online]. Available: https://doi.org/10.1021/ct800526s

  45. [45]

    Accelerating density functional calculations with graphics processing unit,

    K. Yasuda, “Accelerating density functional calculations with graphics processing unit,”Journal of Chemical Theory and Computation, vol. 4, no. 8, pp. 1230–1236, 2008. [Online]. Available: https: //doi.org/10.1021/ct8001046

  46. [46]

    Transition-potential coupled cluster II: optimisation of the core orbital occupation number

    J. L. G ´alvez Vallejo, G. M. J. Barca, and M. S. Gordon, “High-performance gpu-accelerated evaluation of electron repulsion integrals,”Molecular Physics, 2022. [Online]. Available: https: //doi.org/10.1080/00268976.2022.2112987

  47. [47]

    A gpu implementation of classical density functional theory for rapid prediction of gas adsorption in nanoporous materials,

    M. Zhou and J. Wu, “A gpu implementation of classical density functional theory for rapid prediction of gas adsorption in nanoporous materials,”The Journal of Chemical Physics, vol. 153, no. 7, 2020. [Online]. Available: https://doi.org/10.1063/5.0020797 12 APPENDIXA SUPPLEMENTARYTECHNICALDETAILS A. DFT Primer for the HPC Audience This appendix provides a...

  48. [48]

    Grid-Point Morton Ordering:Grid points are reordered using a Z-order (Morton) space-filling curve to concentrate spatially nearby points into contiguous memory ranges. Each grid point with Cartesian coordinates(x, y, z)is first scaled to an integer lattice by multiplying by a resolution factor (128 in our implementation) and truncating to integer values. ...

  49. [49]

    The overlap matrixS(Eq

    Overlap-Signature Construction and Basis-Function Clustering:Basis functions are reordered by clustering their overlap signatures, as outlined in Section III-B. The overlap matrixS(Eq. (8)) is already available from the Kohn– Sham setup, so no additional integral evaluation is required. For each basis functioni, the overlap signatures i = (Si1, Si2, . . ....

  50. [50]

    Parallel GPU implementations have also been demonstrated for classical density functional theory [43]

    to density-fitting approaches for Coulomb and exchange terms [42]. Parallel GPU implementations have also been demonstrated for classical density functional theory [43]. For EXC integration specifically, Williams-Young et al. demon- strated efficient GPU execution within the GauXC framework by grouping grid batches into dense sub-matrices and dis- patchin...