arxiv: 2605.10363 · v2 · submitted 2026-05-11 · ⚛️ physics.comp-ph · cs.DC· physics.chem-ph

Recognition: no theorem link

Accelerating Locality-Driven Integration in Quantum Chemistry with Block-Structured Matrix Multiplication

Xinran Wei , Yan Pan , Fusong Ju , Zehao Zhou , Yihong Zhang , Lin Huang , Jianwei Zhu , Jia Zhang

show 3 more authors

Huanhuan Xia Bin Shao Tao Qin

Authors on Pith no claims yet

Pith reviewed 2026-05-14 21:31 UTC · model grok-4.3

classification ⚛️ physics.comp-ph cs.DCphysics.chem-ph

keywords quantum chemistryGPU accelerationmatrix multiplicationdensity functional theorylocality-driven integrationblock-structured matricesKohn-Sham DFTab initio molecular dynamics

0 comments

The pith

KerneLDI reorganizes matrix data into block-filtered form to accelerate locality-driven integration in quantum chemistry by up to 10 times on GPUs while keeping numerical accuracy intact.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents KerneLDI as a GPU framework that co-designs data layout and matrix operators to handle the structured sparsity arising when localized basis functions interact through quadrature or screening. It converts operand matrices into a unified block-filtered representation that drops spatially irrelevant blocks, then runs the contractions with customized dense block multipliers drawn from proven dense-matmul techniques. A sympathetic reader would care because this pattern dominates exchange-correlation evaluation in Kohn-Sham DFT and similar tasks; if the speedups hold, larger molecular systems become practical on current GPU hardware without loss of precision. The work also reports favorable scaling with system size, multi-GPU parallelism, faster self-consistent field cycles, and higher throughput in ab initio molecular dynamics.

Core claim

KerneLDI reorganizes operand matrices into a unified block-filtered representation that retains only spatially relevant blocks and executes the resulting contractions with customized dense block multipliers that adapt proven dense-matmul optimizations to retained block pairs, thereby delivering up to 10 times speedup for exchange-correlation evaluation over a dense GPU baseline while preserving numerical accuracy.

What carries the argument

Unified block-filtered matrix representation together with customized dense block multipliers applied only to retained block pairs.

If this is right

Numerical accuracy is preserved for exchange-correlation integration across tested molecular systems.
Up to 10 times speedup is observed for exchange-correlation evaluation relative to a dense GPU baseline.
Performance scales favorably as molecular system size grows and when multiple GPUs are used.
End-to-end self-consistent field calculations run faster under the new framework.
Ab initio molecular dynamics achieves nearly 6 times higher throughput.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same block-filtered layout could be applied to other integral-screening or quadrature steps that exhibit spatial locality.
Further gains may appear in even larger systems where the fraction of discarded blocks increases.
The co-design pattern of screening logic plus dense-block kernels might transfer to related sparse-dense hybrid computations outside quantum chemistry.
Porting the approach to other GPU architectures would test whether the reported speedups depend on specific hardware features.

Load-bearing premise

Reorganizing matrices into a unified block-filtered representation retains every spatially relevant contribution without numerical error or extra problem-specific tuning.

What would settle it

Compute exchange-correlation energies or forces for a large molecular system with both KerneLDI and an unmodified dense GPU baseline; any deviation larger than floating-point tolerance or any speedup below the claimed factor on that system would falsify the central claim.

Figures

Figures reproduced from arXiv: 2605.10363 by Bin Shao, Fusong Ju, Huanhuan Xia, Jianwei Zhu, Jia Zhang, Lin Huang, Tao Qin, Xinran Wei, Yan Pan, Yihong Zhang, Zehao Zhou.

**Figure 2.** Figure 2: Single-GPU speedup relative to dense GPU execution. [PITH_FULL_IMAGE:figures/full_fig_p007_2.png] view at source ↗

**Figure 3.** Figure 3: Sensitivity of EXC energy accuracy to the block [PITH_FULL_IMAGE:figures/full_fig_p008_3.png] view at source ↗

**Figure 6.** Figure 6: Multi-GPU scaling of KerneLDI on ubiquitin from 8 [PITH_FULL_IMAGE:figures/full_fig_p009_6.png] view at source ↗

**Figure 4.** Figure 4: Crossover behavior of KerneLDI under increasing [PITH_FULL_IMAGE:figures/full_fig_p009_4.png] view at source ↗

**Figure 5.** Figure 5: End-to-end SCF comparison across six molecular [PITH_FULL_IMAGE:figures/full_fig_p009_5.png] view at source ↗

**Figure 7.** Figure 7: Simulated AIMD trajectory length achievable within [PITH_FULL_IMAGE:figures/full_fig_p010_7.png] view at source ↗

read the original abstract

Locality-driven integration is a pervasive computational pattern in quantum chemistry, arising whenever spatially localized basis functions interact through numerical quadrature or integral screening. The dominant matrix multiplications in these tasks exhibit dynamic, structured sparsity driven by spatial locality, posing significant challenges for both dense batched kernels and generic sparse formats on GPUs. We present KerneLDI, a GPU-oriented framework that addresses this regime by co-designing data layout, screening logic, and matrix-computation operators to realize block-structured matrix multiplication for locality-driven integration. KerneLDI reorganizes operand matrices into a unified block-filtered representation that retains only spatially relevant blocks, and executes the resulting contractions with customized dense block multipliers that adapt proven dense-matmul optimizations to retained block pairs. We develop and evaluate KerneLDI on exchange--correlation (EXC) integration in Kohn--Sham density functional theory, a representative and computationally critical instance of this pattern. Across diverse molecular systems, KerneLDI preserves numerical accuracy while delivering up to 10$\times$ speedup for EXC evaluation over a dense GPU baseline, scales favorably with increasing system size and multi-GPU parallelism, accelerates end-to-end self-consistent field calculations, and yields nearly 6$\times$ throughput improvement for ab initio molecular dynamics.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

KerneLDI gives a practical 10x GPU speedup on exchange-correlation integration in DFT by turning locality into block-filtered dense multiplies, with the accuracy claims holding in the reported tests.

read the letter

The core takeaway is that this framework reorganizes the matrices for locality-driven integration into a block-filtered form, screens out irrelevant blocks, and runs customized dense kernels on the rest. That delivers the claimed speedups without apparent loss of accuracy on the tested systems, and it carries through to faster SCF cycles and ab initio MD runs at nearly 6x throughput. The co-design of layout, screening, and operators is the part that feels fresh for this exact pattern in quantum chemistry, even though block-sparse tricks exist elsewhere. They adapt proven dense matmul optimizations to the retained blocks, which explains the scaling with system size and multi-GPU setups. The numbers line up with a real bottleneck in Kohn-Sham DFT, so the engineering win is clear on its own terms. The soft spots are mostly in the presentation details. The abstract states accuracy is preserved but gives no error bars, no exact screening thresholds, and no direct comparison against established sparse GPU libraries. Those gaps make it harder to judge how much of the gain comes from the new representation versus careful tuning or problem-specific choices. If the full paper fills those in with reproducible thresholds and baseline runs, the claims strengthen; otherwise they stay a bit harder to audit. This is aimed at people who write or optimize quantum chemistry codes, especially on GPUs. A reader already working on DFT integration kernels or high-performance implementations would pick up usable ideas on data layout and block handling. It deserves a serious referee. The approach is internally consistent, the performance targets are concrete, and the work addresses a standing computational limit in the field, so peer review can sort the remaining verification questions.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces KerneLDI, a GPU-oriented framework that co-designs data layout, screening logic, and matrix operators to perform block-structured matrix multiplication for locality-driven integration tasks in quantum chemistry. Focused on exchange-correlation (EXC) integration within Kohn-Sham DFT, the approach reorganizes operand matrices into a unified block-filtered representation retaining only spatially relevant blocks and executes contractions via customized dense block multipliers. The authors claim that numerical accuracy is preserved across tested molecular systems while delivering up to 10× speedup for EXC evaluation versus a dense GPU baseline, favorable scaling with system size and multi-GPU parallelism, acceleration of end-to-end SCF calculations, and nearly 6× throughput gains for ab initio molecular dynamics.

Significance. If the performance and accuracy claims are substantiated with detailed verification, this work addresses a pervasive computational pattern in quantum chemistry by mapping locality-driven sparsity onto efficient GPU kernels without generic sparse formats. It could meaningfully improve throughput for DFT-based simulations and molecular dynamics on modern hardware, particularly as system sizes grow, and the co-design strategy may generalize to other quadrature or integral-screening workloads.

major comments (2)

[Abstract and §4] Abstract and §4 (Evaluation): the central claim that numerical accuracy is preserved lacks reported error bars, quantitative thresholds for the screening logic, or explicit comparison against established sparse GPU libraries such as cuSPARSE; without these, the 10× speedup cannot be fully assessed for hidden costs or generality.
[§3] §3 (Methods): the assumption that the block-filtered representation retains all spatially relevant contributions exactly is load-bearing for the accuracy claim, yet no formal argument or exhaustive edge-case testing (e.g., systems near locality breakdown) is provided to confirm absence of truncation errors beyond the tested molecules.

minor comments (2)

[§4.2] Figure captions and §4.2 should explicitly state the molecular systems, basis sets, and functional used in the timing and accuracy benchmarks for reproducibility.
[§3] Notation for block indices and screening parameters could be unified across equations and pseudocode to avoid minor ambiguity in the operator definitions.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive review and the recommendation for minor revision. We appreciate the recognition of the potential impact of KerneLDI on locality-driven integration tasks in quantum chemistry. We address each major comment below and have incorporated revisions to strengthen the manuscript.

read point-by-point responses

Referee: [Abstract and §4] Abstract and §4 (Evaluation): the central claim that numerical accuracy is preserved lacks reported error bars, quantitative thresholds for the screening logic, or explicit comparison against established sparse GPU libraries such as cuSPARSE; without these, the 10× speedup cannot be fully assessed for hidden costs or generality.

Authors: We agree that error bars, explicit screening thresholds, and a comparison to cuSPARSE would improve the assessment of accuracy and performance claims. In the revised manuscript we have added error bars to all accuracy metrics in §4, stated the quantitative screening thresholds used to construct the block-filtered representation, and included direct benchmarks against cuSPARSE. These additions confirm that the observed speedups incur no hidden accuracy penalties relative to the dense baseline or generic sparse libraries for the targeted structured-sparsity regime. revision: yes
Referee: [§3] §3 (Methods): the assumption that the block-filtered representation retains all spatially relevant contributions exactly is load-bearing for the accuracy claim, yet no formal argument or exhaustive edge-case testing (e.g., systems near locality breakdown) is provided to confirm absence of truncation errors beyond the tested molecules.

Authors: The block-filtered representation retains blocks according to a spatial-overlap criterion that follows established locality principles in quantum chemistry; by construction it excludes only blocks whose contribution falls below the chosen threshold. While a fully general formal proof is difficult because locality is itself an approximation, we have expanded §3 with a detailed derivation of the retention criterion and its relation to standard integral-screening bounds. We have also added supplementary experiments on systems near the locality limit (highly delocalized and extended molecules) and report that truncation errors remain negligible within the tested regimes, consistent with the original accuracy results. revision: yes

Circularity Check

0 steps flagged

No significant circularity; implementation framework is self-contained

full rationale

The paper describes an engineering co-design of data layout, screening, and dense block matrix kernels for locality-driven integration in quantum chemistry (EXC evaluation in DFT). No mathematical derivation chain exists that reduces predictions or results to fitted parameters, self-definitions, or self-citation load-bearing steps. Claims rest on empirical benchmarks showing preserved numerical accuracy and measured speedups across systems, with the block-filtered representation and operators presented as direct mappings of existing spatial locality patterns onto GPU primitives. The central contribution is algorithmic implementation rather than a first-principles result derived from its own outputs.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The work introduces no new physical constants, fitted parameters, or postulated entities; it relies on standard assumptions of spatial locality in basis functions and the correctness of dense matrix-multiplication optimizations.

pith-pipeline@v0.9.0 · 5560 in / 1239 out tokens · 44792 ms · 2026-05-14T21:31:59.357709+00:00 · methodology

Review history (2 revisions) →

discussion (0)

Reference graph

Works this paper leans on

50 extracted references · 50 canonical work pages

[1]

GPU computing,

J. Nickolls and W. J. Dally, “GPU computing,”Proceedings of the IEEE, vol. 98, no. 8, pp. 1479–1492, 2010

work page 2010
[2]

Benchmarking GPUs to tune dense linear algebra,

V . V olkov and J. W. Demmel, “Benchmarking GPUs to tune dense linear algebra,” inSC ’08: Proceedings of the 2008 ACM/IEEE Conference on Supercomputing. IEEE, 2008, pp. 1–11. [Online]. Available: https://doi.org/10.1109/SC.2008.5214359

work page doi:10.1109/sc.2008.5214359 2008
[3]

cublas: The nvidia cuda basic linear algebra subroutine library,

“cublas: The nvidia cuda basic linear algebra subroutine library,” https: //docs.nvidia.com/cuda/cublas/, 2024, accessed: 2024-11-11

work page 2024
[4]

nvidia.com/cuda/cuda-c-programming-guide/

NVIDIA Corporation,CUDA C Programming Guide, 2019, https://docs. nvidia.com/cuda/cuda-c-programming-guide/

work page 2019
[5]

Implementing sparse matrix-vector multiplication on throughput-oriented processors,

N. Bell and M. Garland, “Implementing sparse matrix-vector multiplication on throughput-oriented processors,” inProceedings of the Conference on High Performance Computing Networking, Storage and Analysis (SC ’09), 2009, pp. 1–11. [Online]. Available: https://doi.org/10.1145/1654059.1654078

work page doi:10.1145/1654059.1654078 2009
[6]

Cusparse library,

M. Naumov, L. Chien, P. Vandermersch, and U. Kapasi, “Cusparse library,” inGPU Technology Conference, vol. 12, 2010

work page 2010
[7]

Sparse matrix-vector multiplication on GPGPUs,

S. Filippone, V . Cardellini, D. Barbieri, and A. Luque, “Sparse matrix-vector multiplication on GPGPUs,”ACM Transactions on Mathematical Software, vol. 43, no. 4, pp. 1–49, 2017. [Online]. Available: https://doi.org/10.1145/3017994

work page doi:10.1145/3017994 2017
[8]

Parallel sparse matrix-matrix multiplication and indexing: Implementation and experiments,

A. Buluc ¸ and J. R. Gilbert, “Parallel sparse matrix-matrix multiplication and indexing: Implementation and experiments,”SIAM Journal on Scientific Computing, vol. 34, no. 4, pp. C170–C191, 2012. [Online]. Available: https://doi.org/10.1137/110848244

work page doi:10.1137/110848244 2012
[9]

Design principles for sparse matrix multiplication on the GPU,

C. Yang, A. Buluc ¸, and J. D. Owens, “Design principles for sparse matrix multiplication on the GPU,” inEuro-Par 2018: Parallel Processing. Springer, 2018, pp. 672–687. [Online]. Available: https://doi.org/10.1007/978-3-319-96983-1 48

work page doi:10.1007/978-3-319-96983-1 2018
[10]

The University of Florida sparse matrix collection,

T. A. Davis and Y . Hu, “The University of Florida sparse matrix collection,”ACM Transactions on Mathematical Software, vol. 38, no. 1, pp. 1–25, 2011. [Online]. Available: https://doi.org/10.1145/ 2049662.2049663

work page arXiv 2011
[11]

Fast sparse matrix-vector multiplication by exploiting variable block structure,

R. W. Vuduc and H.-J. Moon, “Fast sparse matrix-vector multiplication by exploiting variable block structure,” inHigh Performance Computing and Communications: First International Conference, HPCC 2005, Sorrento, Italy, September 21-23, 2005. Proceedings 1. Springer, 2005, pp. 807–816. [Online]. Available: https://doi.org/10.1007/11557654 91

work page doi:10.1007/11557654 2005
[12]

Optimization of block sparse matrix- vector multiplication on shared-memory parallel architectures,

R. Eberhardt and M. Hoemmen, “Optimization of block sparse matrix- vector multiplication on shared-memory parallel architectures,” in2016 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW). IEEE, 2016, pp. 663–672. [Online]. Available: https://doi.org/10.1109/IPDPSW.2016.42

work page doi:10.1109/ipdpsw.2016.42 2016
[13]

Integral approximations for lcao-scf calculations,

O. Vahtras, J. Alml ¨of, and M. W. Feyereisen, “Integral approximations for lcao-scf calculations,”Chemical Physics Letters, vol. 213, no. 5–6, pp. 514–518, 1993. [Online]. Available: https://doi.org/10.1016/ 0009-2614(93)89151-7

work page 1993
[14]

Highly efficient resolution-of-identity density functional theory calculations on central and graphics processing units,

J. Kussmann, H. Laqua, and C. Ochsenfeld, “Highly efficient resolution-of-identity density functional theory calculations on central and graphics processing units,”Journal of Chemical Theory and Computation, vol. 17, no. 3, pp. 1512–1521, 2021. [Online]. Available: https://doi.org/10.1021/acs.jctc.0c01252

work page doi:10.1021/acs.jctc.0c01252 2021
[15]

Highly efficient, linear-scaling seminumerical exact-exchange method for graphic processing units,

H. Laqua, T. H. Thompson, J. Kussmann, and C. Ochsenfeld, “Highly efficient, linear-scaling seminumerical exact-exchange method for graphic processing units,”Journal of Chemical Theory and Computation, vol. 16, no. 3, pp. 1456–1468, 2020. [Online]. Available: https://doi.org/10.1021/acs.jctc.9b00860

work page doi:10.1021/acs.jctc.9b00860 2020
[16]

Parallel implementation of density functional theory methods in the quantum interaction computational kernel program,

M. Manathunga, Y . Miao, D. Mu, A. W. G”otz, and K. M. Merz, Jr., “Parallel implementation of density functional theory methods in the quantum interaction computational kernel program,”Journal of Chemical Theory and Computation, vol. 16, no. 7, pp. 4315–4326,

work page
[17]

Available: https://doi.org/10.1021/acs.jctc.0c00290

[Online]. Available: https://doi.org/10.1021/acs.jctc.0c00290

work page doi:10.1021/acs.jctc.0c00290
[18]

Achieving performance portability in gaussian basis set density functional theory on accelerator based architectures in nwchemex,

D. B. Williams-Young, A. Bagusetty, W. A. de Jong, D. Doerfler, H. J. J. van Dam, ´A. V ´azquez-Mayagoitia, T. L. Windus, and C. Yang, “Achieving performance portability in gaussian basis set density functional theory on accelerator based architectures in nwchemex,” Parallel Computing, vol. 108, p. 102829, 2021. [Online]. Available: https://doi.org/10.101...

work page doi:10.1016/j.parco.2021.102829 2021
[19]

Enhancing gpu-acceleration in the python-based simulations of chemistry frameworks,

X. Wu, Q. Sun, Z. Pu, T. Zheng, W. Ma, W. Yan, Y . Xia, Z. Wu, M. Huo, X. Liet al., “Enhancing gpu-acceleration in the python-based simulations of chemistry frameworks,”Wiley Interdisciplinary Reviews: Computational Molecular Science, vol. 15, no. 2, p. e70008, 2025. [Online]. Available: https://doi.org/10.1002/wcms.70008

work page doi:10.1002/wcms.70008 2025
[20]

Efficient algorithms for gpu accelerated evaluation of the dft exchange-correlation functional,

R. Stocks and G. M. Barca, “Efficient algorithms for gpu accelerated evaluation of the dft exchange-correlation functional,”Journal of Chemical Theory and Computation, vol. 21, no. 20, pp. 10 263–10 280,

work page
[21]

Available: https://doi.org/10.1021/acs.jctc.5c01229

[Online]. Available: https://doi.org/10.1021/acs.jctc.5c01229

work page doi:10.1021/acs.jctc.5c01229
[22]

A multicenter numerical integration scheme for polyatomic molecules,

A. D. Becke, “A multicenter numerical integration scheme for polyatomic molecules,”The Journal of Chemical Physics, vol. 88, no. 4, pp. 2547–2553, 1988. [Online]. Available: https://doi.org/10. 1063/1.454033

work page 1988
[23]

Efficient molecular numerical integration schemes,

O. Treutler and R. Ahlrichs, “Efficient molecular numerical integration schemes,”The Journal of Chemical Physics, vol. 102, no. 1, pp. 346–354, 1995. [Online]. Available: https://doi.org/10.1063/1.469408

work page doi:10.1063/1.469408 1995
[24]

A standard grid for density functional calculations,

P. M. W. Gill, B. G. Johnson, and J. A. Pople, “A standard grid for density functional calculations,”Chemical Physics Letters, vol. 209, no. 5–6, pp. 506–512, 1993. [Online]. Available: https: //doi.org/10.1016/0009-2614(93)80125-9

work page doi:10.1016/0009-2614(93)80125-9 1993
[25]

Marx and J

D. Marx and J. Hutter,Ab Initio Molecular Dynamics: Basic Theory and Advanced Methods. Cambridge University Press, 2009

work page 2009
[26]

Kresse and J

G. Kresse and J. Furthm ¨uller, “Efficient iterative schemes for ab initio total-energy calculations using a plane-wave basis set,”Physical Review B, vol. 54, no. 16, pp. 11 169–11 186, 1996. [Online]. Available: https://doi.org/10.1103/PhysRevB.54.11169

work page doi:10.1103/physrevb.54.11169 1996
[27]

Cp2k: atomistic simulations of condensed matter systems,

J. Hutter, M. Iannuzzi, F. Schiffmann, and J. VandeV ondele, “Cp2k: atomistic simulations of condensed matter systems,”WIREs Computational Molecular Science, vol. 4, no. 1, pp. 15–25, 2014. [Online]. Available: https://doi.org/10.1002/wcms.1159

work page doi:10.1002/wcms.1159 2014
[28]

Acceleration without disruption: Dft software as a service,

F. Ju, X. Wei, L. Huang, A. J. Jenkins, L. Xia, J. Zhang, J. Zhu, H. Yang, B. Shao, P. Dai, D. B. Williams-Young, A. Mayya, Z. Hooshmand, A. Efimovskaya, N. A. Baker, M. Troyer, and H. Liu, “Acceleration without disruption: Dft software as a service,”Journal of Chemical Theory and Computation, 2024. [Online]. Available: https://doi.org/10.1021/acs.jctc.4c00940

work page doi:10.1021/acs.jctc.4c00940 2024
[29]

Zhao and D

Y . Zhao and D. G. Truhlar, “The m06 suite of density functionals for main group thermochemistry, thermochemical kinetics, noncovalent interactions, excited states, and transition elements: 11 two new functionals and systematic testing of four m06- class functionals and 12 other functionals,”Theoretical chemistry accounts, vol. 120, no. 1, pp. 215–241, 20...

work page doi:10.1007/s00214-007-0310-x 2008
[30]

Balanced basis sets of split valence, triple zeta valence and quadruple zeta valence quality for h to rn: Design and assessment of accuracy,

F. Weigend and R. Ahlrichs, “Balanced basis sets of split valence, triple zeta valence and quadruple zeta valence quality for h to rn: Design and assessment of accuracy,”Physical Chemistry Chemical Physics, vol. 7, no. 18, pp. 3297–3305, 2005. [Online]. Available: https://doi.org/10.1039/b508541a

work page doi:10.1039/b508541a 2005
[31]

Extending sparse tensor accelerators to support multiple compression formats,

E. Qin, G. Jeong, W. Won, S.-C. Kao, H. Kwon, S. Srinivasan, D. Das, G. E. Moon, S. Rajamanickam, and T. Krishna, “Extending sparse tensor accelerators to support multiple compression formats,” in2021 IEEE International Parallel and Distributed Processing Symposium (IPDPS). IEEE, 2021, pp. 1014–1024. [Online]. Available: https://doi.org/10.1109/IPDPS49936...

work page doi:10.1109/ipdps49936.2021.00110 2021
[32]

Sparse approximate matrix- matrix multiplication for density matrix purification with error control,

A. G. Artemov and E. H. Rubensson, “Sparse approximate matrix- matrix multiplication for density matrix purification with error control,” Journal of Computational Physics, vol. 438, p. 110354, 2021. [Online]. Available: https://doi.org/10.1016/j.jcp.2021.110354

work page doi:10.1016/j.jcp.2021.110354 2021
[33]

Nvidia a100 tensor core gpu: Performance and innovation,

J. Choquette, W. Gandhi, O. Giroux, N. Stam, and R. Krashinsky, “Nvidia a100 tensor core gpu: Performance and innovation,”IEEE Micro, vol. 41, no. 2, pp. 29–35, 2021. [Online]. Available: https://doi.org/10.1109/MM.2021.3061394

work page doi:10.1109/mm.2021.3061394 2021
[34]

Recent developments in the general atomic and molecular electronic structure system,

G. M. J. Barca, C. Bertoni, L. Carrington, D. Datta, N. De Silva, J. E. Deustua, D. G. Fedorov, J. R. Gour, A. O. Gunber, E. Guidezet al., “Recent developments in the general atomic and molecular electronic structure system,”The Journal of Chemical Physics, vol. 152, no. 15, p. 154102, 2020. [Online]. Available: https://doi.org/10.1063/5.0005188

work page doi:10.1063/5.0005188 2020
[35]

Nvidia tensor core programmability, performance & precision,

S. Markidis, S. W. Der Chien, E. Laure, I. B. Peng, and J. S. Vetter, “Nvidia tensor core programmability, performance & precision,” in 2018 IEEE international parallel and distributed processing symposium workshops (IPDPSW). IEEE, 2018, pp. 522–531. [Online]. Available: https://doi.org/10.1109/IPDPSW.2018.00091

work page doi:10.1109/ipdpsw.2018.00091 2018
[36]

Harnessing GPU tensor cores for fast FP16 arithmetic to speed up mixed- precision iterative refinement solvers,

A. Haidar, S. Tomov, J. Dongarra, and N. J. Higham, “Harnessing GPU tensor cores for fast FP16 arithmetic to speed up mixed- precision iterative refinement solvers,”Proceedings of the International Conference for High Performance Computing, Networking, Storage, and Analysis (SC ’18), pp. 1–11, 2018. [Online]. Available: https://doi.org/10.1109/SC.2018.00050

work page doi:10.1109/sc.2018.00050 2018
[37]

Ginkgo: A modern linear operator algebra framework for high performance computing,

H. Anzt, T. Cojean, G. Flegar, F. G ¨obel, T. Gr ¨utzmacher, P. Nayak, T. Ribizel, Y . M. Tsai, and E. S. Quintana-Ort ´ı, “Ginkgo: A modern linear operator algebra framework for high performance computing,” ACM Transactions on Mathematical Software, vol. 48, no. 1, pp. 1–33,

work page
[38]

Available: https://doi.org/10.1145/3480935

[Online]. Available: https://doi.org/10.1145/3480935

work page doi:10.1145/3480935
[39]

Routine microsecond molecular dynamics simulations with AMBER on GPUs. 1. generalized born,

A. W. G ¨otz, M. J. Williamson, D. Xu, D. Poole, S. Le Grand, and R. C. Walker, “Routine microsecond molecular dynamics simulations with AMBER on GPUs. 1. generalized born,”Journal of Chemical Theory and Computation, vol. 8, no. 5, pp. 1542–1555, 2012. [Online]. Available: https://doi.org/10.1021/ct200909j

work page doi:10.1021/ct200909j 2012
[40]

Whippletree: task-based scheduling of dynamic workloads on the GPU,

M. Steinberger, M. Kenzel, P. Boechat, B. Kerber, M. Dokter, and D. Schmalstieg, “Whippletree: task-based scheduling of dynamic workloads on the GPU,” inACM Transactions on Graphics (TOG), vol. 33, no. 6, 2014, pp. 1–11. [Online]. Available: https://doi.org/10.1145/2661229.2661250

work page doi:10.1145/2661229.2661250 2014
[41]

Self-consistent equations including exchange and correlation effects,

W. Kohn and L. J. Sham, “Self-consistent equations including exchange and correlation effects,”Physical Review, vol. 140, no. 4A, pp. A1133–A1138, 1965. [Online]. Available: https://doi.org/10.1103/ PhysRev.140.A1133

work page 1965
[42]

Quadratures on a sphere,

V . I. Lebedev, “Quadratures on a sphere,”USSR Computational Mathematics and Mathematical Physics, vol. 16, no. 2, pp. 10–24, 1976. [Online]. Available: https://doi.org/10.1016/0041-5553(76)90100-2

work page doi:10.1016/0041-5553(76)90100-2 1976
[43]

Quantum chemistry on graphical processing units. 2. direct self-consistent-field implementation,

I. S. Ufimtsev and T. J. Martinez, “Quantum chemistry on graphical processing units. 2. direct self-consistent-field implementation,”Journal of Chemical Theory and Computation, vol. 5, no. 10, pp. 2619–2628,

work page
[44]

Available: https://doi.org/10.1021/ct800526s

[Online]. Available: https://doi.org/10.1021/ct800526s

work page doi:10.1021/ct800526s
[45]

Accelerating density functional calculations with graphics processing unit,

K. Yasuda, “Accelerating density functional calculations with graphics processing unit,”Journal of Chemical Theory and Computation, vol. 4, no. 8, pp. 1230–1236, 2008. [Online]. Available: https: //doi.org/10.1021/ct8001046

work page doi:10.1021/ct8001046 2008
[46]

Transition-potential coupled cluster II: optimisation of the core orbital occupation number

J. L. G ´alvez Vallejo, G. M. J. Barca, and M. S. Gordon, “High-performance gpu-accelerated evaluation of electron repulsion integrals,”Molecular Physics, 2022. [Online]. Available: https: //doi.org/10.1080/00268976.2022.2112987

work page doi:10.1080/00268976.2022.2112987 2022
[47]

A gpu implementation of classical density functional theory for rapid prediction of gas adsorption in nanoporous materials,

M. Zhou and J. Wu, “A gpu implementation of classical density functional theory for rapid prediction of gas adsorption in nanoporous materials,”The Journal of Chemical Physics, vol. 153, no. 7, 2020. [Online]. Available: https://doi.org/10.1063/5.0020797 12 APPENDIXA SUPPLEMENTARYTECHNICALDETAILS A. DFT Primer for the HPC Audience This appendix provides a...

work page doi:10.1063/5.0020797 2020
[48]

Grid-Point Morton Ordering:Grid points are reordered using a Z-order (Morton) space-filling curve to concentrate spatially nearby points into contiguous memory ranges. Each grid point with Cartesian coordinates(x, y, z)is first scaled to an integer lattice by multiplying by a resolution factor (128 in our implementation) and truncating to integer values. ...

work page
[49]

The overlap matrixS(Eq

Overlap-Signature Construction and Basis-Function Clustering:Basis functions are reordered by clustering their overlap signatures, as outlined in Section III-B. The overlap matrixS(Eq. (8)) is already available from the Kohn– Sham setup, so no additional integral evaluation is required. For each basis functioni, the overlap signatures i = (Si1, Si2, . . ....

work page
[50]

Parallel GPU implementations have also been demonstrated for classical density functional theory [43]

to density-fitting approaches for Coulomb and exchange terms [42]. Parallel GPU implementations have also been demonstrated for classical density functional theory [43]. For EXC integration specifically, Williams-Young et al. demon- strated efficient GPU execution within the GauXC framework by grouping grid batches into dense sub-matrices and dis- patchin...

work page