pith. sign in

arxiv: 2606.09058 · v1 · pith:HH2543GSnew · submitted 2026-06-08 · ⚛️ physics.comp-ph

GPU Acceleration of Collinear and Noncollinear DFT Using a Numerical Atomic Orbital-Based DFT Code

Pith reviewed 2026-06-27 14:31 UTC · model grok-4.3

classification ⚛️ physics.comp-ph
keywords GPU accelerationdensity functional theorynumerical atomic orbitalsOpenMXcollinear DFTnoncollinear DFTcuBLASperformance benchmarks
0
0 comments X

The pith

GPU acceleration in OpenMX yields 2.02x and 2.60x speedups for collinear and noncollinear DFT on two nodes.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper implements GPU acceleration for collinear and noncollinear density functional theory calculations in the numerical atomic orbital code OpenMX by offloading matrix multiplications, eigenvalue solves, and selected auxiliary steps to cuBLAS, cuSOLVER, and OpenACC. Benchmarks on the Pegasus supercomputer compare these GPU runs against CPU-only runs under identical settings for systems of several hundred atoms. The reported results show concrete speedups of 2.02 times for a 512-atom collinear case and 2.60 times for a 384-atom noncollinear case when using two nodes with one GPU each versus two nodes with 96 CPU cores. This matters to a sympathetic reader because it demonstrates that an existing NAO-based DFT implementation can obtain practical gains from GPUs for both standard and spin-noncollinear calculations without altering the underlying physics or basis set. The work therefore supplies evidence that GPU offloading can be added to such codes in a way that delivers measurable wall-time reductions for typical large-system runs.

Core claim

We implement GPU acceleration of collinear and noncollinear density functional theory (DFT) calculations in the numerical atomic orbitals (NAOs) code OpenMX by offloading matrix multiplications and eigenvalue solves (plus selected auxiliary steps) to cuBLAS/cuSOLVER and OpenACC. Benchmarks on the Pegasus supercomputer (per node: a 48-core Intel Xeon Platinum 8468 CPU and one NVIDIA H100 GPU) compare GPU-accelerated and CPU-only runs under identical settings. For a 512-atom collinear case on two nodes (two GPUs total), the GPU-accelerated calculation achieves a 2.02 times speedup over a CPU-only run on two nodes (96 CPU cores total); for a 384-atom noncollinear case on two nodes (two GPUs tot

What carries the argument

Offloading of matrix multiplications and eigenvalue solves to cuBLAS/cuSOLVER together with OpenACC directives in the OpenMX NAO-DFT implementation.

If this is right

  • GPU acceleration becomes available inside an existing NAO-based DFT code for both collinear and noncollinear spin treatments.
  • A single GPU per node can outperform a full node of 48 CPU cores by a factor of roughly two for the tested system sizes.
  • The same offloading strategy works for noncollinear calculations that are typically more expensive than collinear ones.
  • No change to the numerical atomic orbital basis or to the DFT Hamiltonian construction is required to obtain the measured gains.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same offloading pattern could be ported to other NAO codes that already rely on dense linear algebra for the same steps.
  • Speedup ratios may change if the number of GPUs per node increases or if the CPU-only baseline uses fewer cores per node.
  • The approach could make routine noncollinear DFT studies of larger magnetic or spin-orbit systems more feasible on GPU-equipped clusters.
  • Direct comparison of the same input files on a wider range of GPU generations would test how portable the observed gains remain.

Load-bearing premise

The reported speedups assume that the chosen benchmark systems, node configurations, and identical settings on the Pegasus supercomputer produce representative performance numbers for typical user workloads.

What would settle it

Running the same OpenMX GPU-accelerated and CPU-only calculations on a different hardware platform with different CPU-GPU balance or on systems substantially larger or smaller than 384-512 atoms and measuring whether the 2.0-2.6x speedups persist.

Figures

Figures reproduced from arXiv: 2606.09058 by (2) Nuclear Science, (3) Institute for Solid State Physics, Engineering Center, Hiroyuki Kawai (1), Japan Atomic Energy Agency, Niigata University, Taisuke Ozaki (3), Takuya Sekikawa (2), The University of Tokyo), Yoshiaki \=Ono (1) ((1) Department of Physics.

Figure 1
Figure 1. Figure 1: (Color online) Computational flow of the Band_DFT_Col routine. [PITH_FULL_IMAGE:figures/full_fig_p004_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: (Color online) Detailed flow of Part 2 in the Band_DFT_Col routine. [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: (Color online) Computational flow of the Band_DFT_NonCol routine. [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: (Color online) Detailed flow of Part 3 in the Band_DFT_NonCol routine. [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: (Color online) Relationship between number of atoms and total DFT wall time for collinear DFT (Four [PITH_FULL_IMAGE:figures/full_fig_p010_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: (Color online) Relationship between number of atoms and diagonalization time for collinear DFT (Four [PITH_FULL_IMAGE:figures/full_fig_p011_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: (Color online) Speed comparison between CPU-only and GPU-accelerated collinear DFT as the number of [PITH_FULL_IMAGE:figures/full_fig_p012_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: (Color online) Speed comparison for diagonalization between CPU-only and GPU-accelerated collinear [PITH_FULL_IMAGE:figures/full_fig_p012_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: (Color online) Fraction of host-device data transfer time in collinear DFT. (a) Variation with number of [PITH_FULL_IMAGE:figures/full_fig_p014_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: (Color online) Relationship between number of atoms and total noncollinear DFT wall time (Eight nodes). [PITH_FULL_IMAGE:figures/full_fig_p016_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: (Color online) Relationship between number of atoms and diagonalization time in noncollinear DFT (Eight [PITH_FULL_IMAGE:figures/full_fig_p017_11.png] view at source ↗
Figure 12
Figure 12. Figure 12: (Color online) Comparison of performance for noncollinear DFT as the node count increases. (a) Com [PITH_FULL_IMAGE:figures/full_fig_p018_12.png] view at source ↗
Figure 13
Figure 13. Figure 13: (Color online) Comparison of diagonalization performance for noncollinear DFT as the node count in [PITH_FULL_IMAGE:figures/full_fig_p019_13.png] view at source ↗
Figure 14
Figure 14. Figure 14: (Color online) Fraction of host-device data transfer time in noncollinear DFT. (a) Variation with number of [PITH_FULL_IMAGE:figures/full_fig_p020_14.png] view at source ↗
read the original abstract

We implement GPU acceleration of collinear and noncollinear density functional theory (DFT) calculations in the numerical atomic orbitals (NAOs) code OpenMX by offloading matrix multiplications and eigenvalue solves (plus selected auxiliary steps) to cuBLAS/cuSOLVER and OpenACC. Benchmarks on the Pegasus supercomputer (per node: a 48-core Intel Xeon Platinum 8468 CPU and one NVIDIA H100 GPU) compare GPU-accelerated and CPU-only runs under identical settings. For a 512-atom collinear case on two nodes (two GPUs total), the GPU-accelerated calculation achieves a 2.02 times speedup over a CPU-only run on two nodes (96 CPU cores total); for a 384-atom noncollinear case on two nodes (two GPUs total), the speedup is 2.60 times over two CPU-only nodes (96 cores). These results demonstrate practical GPU-accelerated DFT in an NAO-based code for both collinear and noncollinear calculations.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The manuscript describes an implementation of GPU acceleration for collinear and noncollinear DFT in the numerical atomic orbital code OpenMX. Selected linear-algebra kernels (matrix multiplications and eigenvalue solves) plus auxiliary steps are offloaded to cuBLAS/cuSOLVER with OpenACC directives. Direct wall-time benchmarks on the Pegasus supercomputer (48-core Xeon Platinum 8468 + one H100 GPU per node) report a 2.02× speedup for a 512-atom collinear system on two nodes (two GPUs vs. 96 CPU cores) and a 2.60× speedup for a 384-atom noncollinear system under identical settings.

Significance. If the reported timings are reproducible, the work supplies concrete evidence that standard GPU libraries can deliver practical acceleration in an established NAO-based DFT package for both collinear and noncollinear cases. The direct, controlled comparison on identical node configurations is a strength for an engineering implementation paper and could inform users running magnetism or spin-orbit calculations.

major comments (2)
  1. [Implementation and Results sections] The manuscript states concrete benchmark numbers but provides insufficient implementation detail on which exact routines were offloaded, how data movement between host and device was managed, and what fraction of runtime remains on the CPU. This information is load-bearing for the central performance claim because the reported 2.02× and 2.60× speedups cannot be assessed for generality or communication overhead without it.
  2. [Benchmark subsection] No validation tests (e.g., total-energy or force comparisons between GPU and CPU runs on the same systems) or overhead breakdown (cuBLAS time vs. data-transfer time) are presented. Without these, it is impossible to confirm that the acceleration preserves numerical accuracy or that the speedups are not dominated by unoffloaded portions of the code.
minor comments (1)
  1. [Abstract and Results] The abstract and text refer to “identical settings” but do not explicitly list the OpenMX input parameters (basis set, cutoff, k-points, SCF convergence) used for the 512-atom and 384-atom benchmarks; adding a short table would improve reproducibility.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments on our manuscript. We address each major comment below and will revise the manuscript to incorporate the requested details and tests.

read point-by-point responses
  1. Referee: [Implementation and Results sections] The manuscript states concrete benchmark numbers but provides insufficient implementation detail on which exact routines were offloaded, how data movement between host and device was managed, and what fraction of runtime remains on the CPU. This information is load-bearing for the central performance claim because the reported 2.02× and 2.60× speedups cannot be assessed for generality or communication overhead without it.

    Authors: We agree that additional implementation specifics are needed for a complete assessment. In the revised manuscript we will expand the Implementation section to list the exact offloaded routines (e.g., cublasDgemm, cublasZgemm, and the specific cuSOLVER eigensolvers), describe the OpenACC data clauses and host-device transfer strategy used to minimize communication, and report profiling-derived fractions of runtime spent on the CPU versus the GPU for the benchmark systems. revision: yes

  2. Referee: [Benchmark subsection] No validation tests (e.g., total-energy or force comparisons between GPU and CPU runs on the same systems) or overhead breakdown (cuBLAS time vs. data-transfer time) are presented. Without these, it is impossible to confirm that the acceleration preserves numerical accuracy or that the speedups are not dominated by unoffloaded portions of the code.

    Authors: We acknowledge the absence of these elements in the submitted version. The revised manuscript will include a dedicated validation subsection reporting total-energy and force differences between CPU-only and GPU runs on the same systems (confirming agreement to machine precision) together with a timing breakdown separating cuBLAS/cuSOLVER kernel execution from data-transfer overhead. revision: yes

Circularity Check

0 steps flagged

No significant circularity

full rationale

The paper is an engineering report on GPU offloading of matrix operations and eigensolves in OpenMX, with claims consisting solely of measured wall-time ratios (2.02x collinear, 2.60x noncollinear) on Pegasus nodes. No derivation chain, equations, fitted parameters, or self-citation load-bearing steps exist; the results are direct empirical timings under identical settings and reduce to the benchmark data themselves.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

This is a software-implementation and benchmarking study; the central claim does not depend on scientific axioms, free parameters, or invented entities.

pith-pipeline@v0.9.1-grok · 5767 in / 991 out tokens · 23344 ms · 2026-06-27T14:31:12.394072+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

78 extracted references · 1 linked inside Pith

  1. [1]

    J. D. Owens, M. Houston, D. Luebke, S. Green, J. E. Stone, and J. C. Phillips, Proc. IEEE96, 879 (2008)

  2. [2]

    S. W. Keckler, inProceedings of the 2011 IEEE International Symposium on Workload Characterization (IISWC’11)(IEEE, Austin, TX, 2011), p. 1

  3. [3]

    Peši ´c, and R

    J. Peši ´c, and R. Gaji´c, Phys. Scr.T162014027 (2014)

  4. [4]

    Carabaño, F

    J. Carabaño, F. Dios, M. Daneshtalab, and M. Ebrahimi, inProceedings of the 2013 8th International Workshop on Reconfigurable and Communication-Centric Systems-on-Chip (ReCoSoC’13)(IEEE, Darmstadt, Germany, 2013), pp. 1–7. 22 GPU Acceleration of Collinear and Noncollinear DFTA Preprint

  5. [5]

    Mittal, and J

    S. Mittal, and J. S. Vetter, ACM Comput. Surv.47, 4, Article 69 (2015)

  6. [6]

    D. C. Jespersen, Scientific Programming,18, 193 (2010)

  7. [7]

    Ginjupalli, and G

    R. Ginjupalli, and G. Khanna, arXiv:1006.0663 [physics.comp-ph] (2010)

  8. [8]

    W. P. Huhn, B. Lange, V . Wen-zhe Yu, M. Yoon, and V . Blum, Comp. Phys. Comm.254, 107314 (2020)

  9. [9]

    Yasuda, J

    K. Yasuda, J. Comput. Chem.29, 334 (2007)

  10. [10]

    Yasuda, J

    K. Yasuda, J. Chem. Theory Comput.4, 1230 (2008)

  11. [11]

    K. A. Wilkinson, P. Sherwood, M. F. Guest, and K. J. Naidoo, J. Comput. Chem.32, 2313 (2011)

  12. [12]

    D. B. Williams-Young, W. A. de Jong, H. J. J. van Dam, and C. Yang, Front. Chem.10, 8 (2020)

  13. [13]

    Genovese, M

    L. Genovese, M. Ospici, T. Deutsch, J.-F. Méhaut, A. Neelov, and S. Goedecker, J. Chem. Phys.131, 034103 (2009)

  14. [14]

    Genovese, B

    L. Genovese, B. Videau, D. Caliste, J.-F. Méhaut, S. Goedecker, and T. Deutsch, inElectronic Structure Calcu- lations on Graphics Processing Units: From Quantum Chemistry to Condensed Matter Physics, edited by R. C. Walker and A. W. Götz (John Wiley & Sons, Ltd., West Sussex, 2016), Chap. 6, pp. 115-134

  15. [15]

    Wortmann, S

    D. Wortmann, S. Baroni, A. Degomme, P. Delugas, S. de Gironcoli, A. Ferretti, A. García, L. Genovese, P. Giannozzi, A. Kozhevnikov and N. Spallanzani,Second Release ofMaXSoftware: Report on Performance Achieved, Deliverable D2.2 (H2020 CoE MaX, Jülich, 2020)

  16. [16]

    Zhang, Z

    H. Zhang, Z. Deng, Y . Liu, T. Liu, M. Chen, S. Yin, and L. He, arXiv:2409.09399 [cond-mat.mtrl-sci] (2024)

  17. [17]

    Hohenberg, and W

    P. Hohenberg, and W. Kohn, Phys. Rev.136, 864 (1964)

  18. [18]

    Kohn, and L

    W. Kohn, and L. J. Sham, Phys. Rev.140, 1133 (1965)

  19. [19]

    Burke, J

    K. Burke, J. Chem. Phys.136, 150901 (2012)

  20. [20]

    R. O. Jones, Newsletter124, 1 (2014)

  21. [21]

    A. D. Becke, J. Chem. Phys.140, 18A301 (2014)

  22. [22]

    A. Jain, Y . Shin, and K.A. Persson, Nature Rev. Mater.1, 1 (2016)

  23. [23]

    M. C. Payne, M. P. Teter, D. C. Allan, T. A. Arias, and J. D. Joannopoulos, Rev. Modern Phys.64, 1046 (1992)

  24. [24]

    Kresse, and J

    G. Kresse, and J. Furthmüller, Phys. Rev. B54, 11169 (1996)

  25. [25]

    Giannozzi, S

    P. Giannozzi, S. Baroni, N. Bonini, M. Calandra, R. Car, C. Cavazzoni, D. Ceresoli, G. L. Chiarotti, M. Cococ- cioni, I. Dabo, A. Dal Corso, S. de Gironcoli, S. Fabris, G. Fratesi, R. Gebauer, U. Gerstmann, C. Gougoussis, A. Kokalj, M. Lazzeri, L. Martin-Samos, N. Marzari, F. Mauri, R. Mazzarello, S. Paolini, A. Pasquarello, L. Paulatto, C. Sbraccia, S. S...

  26. [26]

    Gonze, J.-M

    X. Gonze, J.-M. Beuken, R. Caracas, F. Detraux, M. Fuchs, G.-M. Rignanese, L. Sindic, M. Verstraete, G. Zerah, F. Jollet, M. Torrent, A. Roy, M. Mikami, Ph. Ghosez, J.-Y . Raty, and D. C. Allan, Comput. Mater. Sci.25, 478 (2002)

  27. [27]

    S. J. Clark, M. D. Segall, C. J. Pickard, P. J. Hasnip, M. I. J. Probert, K. Refson, M. C. Payne, and Z. Kristallogr. 220, 567 (2005)

  28. [28]

    M. J. Frisch, J. A. Pople, and J. S. Binkley, J. Chem. Phys.80, 3265 (1984)

  29. [29]

    T. H. Dunning, J. Chem. Phys.90, 1007 (1989)

  30. [30]

    Szabo and N

    A. Szabo and N. S. Ostlund,Modern Quantum Chemistry: Introduction to Advanced Electronic Structure Theory, 1st ed. (Dover, Mineola, NY , 1996)

  31. [31]

    A. K. Wilson, T. van Mourik, and T. H. Dunning, J. Mol. Struct.388, 339 (1996)

  32. [32]

    Weigend, and R

    F. Weigend, and R. Ahlrichs, Phys. Chem. Chem. Phys.7, 3297 (2005)

  33. [33]

    Valiev, E

    M. Valiev, E. J. Bylaska, N. Govind, K. Kowalski, T. P. Straatsma, H. J. J. van Dam, D. Wang, J. Nieplocha, E. Apra, T. L. Windus, and W. A. de Jong, Comput. Phys. Commun.181, 1477 (2010)

  34. [34]

    Hutter, M

    J. Hutter, M. Iannuzzi, F. Schiffmann, and J. VandeV ondele, WIREs Comput. Mol. Sci.4, 15 (2014)

  35. [35]

    te Velde, F

    G. te Velde, F. M. Bickelhaupt, E. J. Baerends, C. F. Guerra, S. J. A. van Gisbergen, J. G. Snijders, and T. Ziegler, J. Comput. Chem.22, 931 (2001)

  36. [36]

    van Lenthe, and E

    E. van Lenthe, and E. J. Baerends, J. Comput. Chem.24, 1142 (2003)

  37. [37]

    F. W. Averill, and D. E. Ellis, J. Chem. Phys59, 6412 (1973). 23 GPU Acceleration of Collinear and Noncollinear DFTA Preprint

  38. [38]

    Zunger, and A

    A. Zunger, and A. J. Freeman, Phys. Rev. B15, 4716 (1977)

  39. [39]

    Delley, and D

    B. Delley, and D. E. Ellis, J. Chem. Phys.76, 1949 (1982)

  40. [40]

    O. F. Sankey, and D. J. Niklewski, Phys. Rev. B40, 3979 (1989)

  41. [41]

    Delley, J

    B. Delley, J. Chem. Phys.92, 508 (1990)

  42. [42]

    A. P. Horsfield, Phys. Rev. B56, 6594 (1997)

  43. [43]

    Koepernik, and H

    K. Koepernik, and H. Eschrig, Phys. Rev. B59, 1743 (1999)

  44. [44]

    J. M. Soler, E. Artacho, J. D. Gale, A. García, J. Junquera, P. Ordejón, and D. Sánchez-Portal, J. Phys. Condens. Matter14, 2745 (2002)

  45. [45]

    Ozaki, Phys

    T. Ozaki, Phys. Rev. B.67, 155108 (2003)

  46. [46]

    Ozaki and H

    T. Ozaki and H. Kino, Phys. Rev. B69, 195113 (2004)

  47. [47]

    Ozaki and H

    T. Ozaki and H. Kino, Phys. Rev. B72, 045121 (2005)

  48. [48]

    V . Blum, R. Gehrke, F. Hanke, P. Havu, V . Havu, X. Ren, K. Reuter, and M. Scheffler, Comp. Phys. Comm.180, 2175 (2009)

  49. [49]

    I. Y . Zhang, X. Ren, P. Rinke, V . Blum, and M. Scheffler, New. J. Phys.15, 123033 (2013)

  50. [50]

    Sánchez-Portal, P

    D. Sánchez-Portal, P. Ordejón, E. Artacho, and J. M. Soler, Int. J. Quantum Chem.65, 453 (1997)

  51. [51]

    Motamarri and V

    P. Motamarri and V . Gavini, Phys. Rev. B90, 115127 (2014)

  52. [52]

    J.-H. Parq, E. Sevre, and S.-M. Lee, Int. J. of Comp. Appl.98, 20 (2014)

  53. [53]

    von Barth, and L

    U. von Barth, and L. Hedin, J. Phys. C: Solid State Phys.5, 1629 (1972)

  54. [54]

    Kübler, K.-H

    J. Kübler, K.-H. Höck, J. Sticht, and A. R. Williams, J. Phys. F: Met. Phys.18, 469 (1988)

  55. [55]

    Sticht, K.-H

    J. Sticht, K.-H. Höck, and J. Kübler, J. Phys.: Condens. Matter1, 8155 (1989)

  56. [56]

    T. Oda, A. Pasquarello, and R. Car, Phys. Rev. Lett.80, 3622 (1998)

  57. [57]

    L. S. Blackford, J. Choi, A. Cleary, E. D’Azevedo, J. Demmel, I. Dhillon, J. Dongarra, S. Hammarling, G. Henry, A. Petitet, K. Stanley, D. Walker, and R. C. Whaley,ScaLAPACK Users’ Guide(SIAM, Philadelphia, PA, 1997)

  58. [58]

    https://developer.nvidia.com/cublas/ (accessed 12 May 2025)

  59. [59]

    Auckenthaler, V

    T. Auckenthaler, V . Blum, H.-J. Bungartz, T. Huckle, R. Johanni, L. Krämer, B. Lang, H. Lederer, and P. R. Willems, Parallel comput.37, 783 (2011)

  60. [60]

    Marek, V

    A. Marek, V . Blum, R. Johanni, V . Havu, B. Lang, T. Auckenthaler, A. Heinecke, H.-J. Bungartz, and H. Lederer, J. Phys.-Condens. Matter26, 213201 (2014)

  61. [61]

    https://developer.nvidia.com/cusolver/ (accessed 12 May 2025)

  62. [62]

    B. S. Atdaev, V . F. Grin, E. A. Salkov, and V . G. Chalaya, Inorg. Mater.,23, 1835 (1987)

  63. [63]

    https://www.openmx-square.org/viewer/ (accessed 12 May 2025)

  64. [64]

    J. P. Perdew, K. Burke, and M. Ernzerhof, Phys. Rev. Lett.77, 3865 (1996)

  65. [65]

    H. Wang, Z. Duan, Z. Zhao, S. Wu, S. Zheng, Q. Li, X. Jiang, and S. Zhang, Proc. ACM SIGPLAN Symp. Principles Pract. Parallel Program.PPoPP ’25, 469 (2025)

  66. [66]

    NVIDIA Corporation,cuSOLVER Library API Reference Guide, Release 12.9 (2025)

  67. [67]

    Yamazaki, T

    I. Yamazaki, T. Dong, R. Solcà, S. Tomov, J. Dongarra, and T. Schulthess, Concurrency Computat. Pract. Exper. 25, e3152 (2013)

  68. [68]

    Haidar, P

    A. Haidar, P. Luszczek, and J. Dongarra, Proc. GPU Technol. Conf. 2012, 1 (2012)

  69. [69]

    Y . M. Tsai, P. Luszczek, and J. Dongarra, ICL-UTK Tech. Rep.1499(2021)

  70. [70]

    Tisseur and J

    F. Tisseur and J. Dongarra, LAPACK Working Note132(UT-CS-98-382) (1998)

  71. [71]

    Gates, K

    M. Gates, K. Akbudak, M. Al Farhan, A. Charara, J. Kurzak, D. Sukkari, A. YarKhan, and J. Dongarra, ICL Technical ReportICL-UT-19-07(2019)

  72. [72]

    Pichon, A

    G. Pichon, A. Haidar, M. Faverge, and J. Kurzak, inProceedings of the 2015 IEEE International Parallel and Distributed Processing Symposium (IPDPS’15)(IEEE, Hyderabad, India, 2015), pp. 1–8. 24 GPU Acceleration of Collinear and Noncollinear DFTA Preprint

  73. [73]

    Huang, H

    Z. Huang, H. Liu, and J. Liu, J. Eng. 2019,893(2019)

  74. [74]

    Haidar, A

    A. Haidar, A. Abdelfattah, M. Zounon, S. Tomov, and J. Dongarra, IEEE Trans. Parallel Distrib. Syst.29, 973 (2018)

  75. [75]

    Goto and R

    K. Goto and R. A. van de Geijn, ACM Trans. Math. Softw.34, 12 (2008)

  76. [76]

    https://docs.nvidia.com/cuda/cublasmp/index.html (accessed 12 May 2025)

  77. [77]

    P. Kus, A. Marek, and H. Lederer, inNumerical Mathematics and Advanced Applications ENUMATH 2017, edited by F. A. Radu, K. Kumar, I. Berre, J. M. Nordbotten, and I. S. Pop (Springer, Cham, 2019), pp. 123-131

  78. [78]

    V . Yu, J. Moussa, P. Kus, A. Marek, P. Messmer, M. Yoon, H. Lederer, and V . Blum, Comput. Phys. Commun. 262, 107808 (2021). 25