pith. sign in

arxiv: 2606.21695 · v1 · pith:SEVXJGXJnew · submitted 2026-06-19 · 💻 cs.DC · cs.NA· math.NA

Efficient Domain Decomposition for the Helmholtz Equation on GPUs

Pith reviewed 2026-06-26 12:59 UTC · model grok-4.3

classification 💻 cs.DC cs.NAmath.NA
keywords domain decompositionHelmholtz equationGPU computingWaveHoltzMINRESiterative solversCUDAsingle precision
0
0 comments X

The pith

Domain decomposition assigns each Helmholtz subdomain to a GPU thread block solved by WaveHoltz for 2x-25x speedup over MINRES on A100.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper develops a domain decomposition approach for the Helmholtz equation on GPUs by assigning each subdomain to a single thread block so that all subdomain solves run concurrently inside one kernel launch. WaveHoltz serves as the inner solver because its fixed-point iteration needs little memory and performs no reductions. This layout removes device-level synchronizations, replaces most global memory traffic with shared memory and registers, and keeps subdomain data in L1 and L2 cache. The resulting implementation delivers 2x-25x speedup over MINRES on an A100, with the gap widening for larger subdomains, and an extra 2x-10x when the subdomain solves use single precision.

Core claim

A block-level domain decomposition scheme that maps every subdomain to one GPU thread block and solves it with the WaveHoltz fixed-point iteration allows all subdomain solves to execute concurrently in a single kernel launch, achieving 2x-25x speedup over MINRES on A100 GPUs with the advantage increasing with subdomain size and an additional 2x-10x from single-precision arithmetic that MINRES cannot exploit because of lost Krylov-vector orthogonality.

What carries the argument

Block-level domain decomposition scheme that assigns each subdomain to one thread block and solves it with the WaveHoltz fixed-point iteration inside a single kernel launch.

If this is right

  • The speedup margin over MINRES widens as subdomain size increases.
  • Single-precision execution supplies additional acceleration unavailable to MINRES because of orthogonality loss under reduced precision.
  • Subdomain data remains largely resident in L1 and L2 cache through shared-memory and register-level operations.
  • Two threading strategies (one degree of freedom per thread or multiple) cover both small and large subdomains.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same block-level mapping and cache-resident design could be tested on other indefinite systems such as Maxwell or elasticity equations on GPUs.
  • Energy use per solve may drop because global memory traffic is replaced by on-chip operations.
  • The absence of reductions inside the kernel suggests the method could scale to multi-GPU nodes with fewer inter-device synchronizations.

Load-bearing premise

WaveHoltz converges in a small fixed number of iterations for the chosen subdomains while single-precision arithmetic still delivers sufficient accuracy for the target applications.

What would settle it

A measurement showing that the number of WaveHoltz iterations required grows steadily with subdomain size or that single-precision results deviate from double-precision reference solutions beyond acceptable application error thresholds.

Figures

Figures reproduced from arXiv: 2606.21695 by Amit Rotem.

Figure 1
Figure 1. Figure 1: A partition of a rectangular domain Ω ⊂ R2 into several subdomains. The elements are outlined in black, and the subdomains are outlined in purple. Two representative subdomains Ωℓ and Ωr and their common interface Γℓ,r are highlighted. The solution visualized is the wave produced by an off-center point source in a Y-shaped wave guide with a 3× contrast in wavespeed relative to the surrounding media. The so… view at source ↗
Figure 2
Figure 2. Figure 2: The T DOFs per Thread algorithm with six threads and four DOFs/thread. Arrows indicate memory movement. When T = 1 (Section VI-A), the memory traffic through global is bypassed, and only the shared memory buffer is used. Since the DDM is a preconditioner, we can gain significant speedups simply by evaluating the preconditioner in a lower precision, for example as considered in [35], [36], [37]. In our nume… view at source ↗
Figure 3
Figure 3. Figure 3: The time to evaluate the DDM cycle (DD time) relative to the time it takes to evaluate [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: The throughput (millions of DOFs per second) of the various kernels for fixed degree and increasing problem size. [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Share of total bytes transferred at each level of the memory hierarchy [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: The time to evaluate the DDM cycle (DD time) relative to the time it takes to evaluate [PITH_FULL_IMAGE:figures/full_fig_p009_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: The throughput (millions of DOFs per second) of the various kernels for fixed degree and increasing problem size. The top panel is [PITH_FULL_IMAGE:figures/full_fig_p009_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Runtime comparison of MINRES and WaveHoltz subdomain solvers [PITH_FULL_IMAGE:figures/full_fig_p009_8.png] view at source ↗
read the original abstract

The Helmholtz equation governs wave propagation in acoustics, electromagnetics, and seismology, but its indefinite nature makes it difficult to solve with iterative methods. Domain decomposition methods are a natural fit for massively parallel architectures, yet mapping efficient Helmholtz solvers onto modern GPUs remains a challenge. We address both with two key contributions: (1) a block-level domain decomposition scheme, in which each subdomain is assigned to a single thread block and all solves run concurrently in a single kernel launch, and (2) WaveHoltz as the subdomain solver. WaveHoltz is a fixed-point iteration that is uniquely well-suited to the GPU execution model due to its minimal memory footprint and no reduction operations. Together, these eliminate device-level synchronizations and replace global memory traffic with shared memory and register-level operations, keeping subdomain data largely resident in L1 and L2 cache. We explore two threading strategies: one degree of freedom per thread for small subdomains, and multiple degrees of freedom per thread for larger ones. Benchmarks of our CUDA based implementation on a NVIDIA A100 show that WaveHoltz achieves 2x-25x speedup over MINRES, with the advantage growing with subdomain size. Crucially, evaluating the subdomain solver in single rather than double precision yields an additional 2x-10x speedup--a benefit largely unattainable by MINRES due to loss of Krylov vector orthogonality under reduced precision.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper presents a block-level domain decomposition method for the Helmholtz equation on GPUs, assigning each subdomain to one thread block with all solves in a single kernel launch, and using WaveHoltz fixed-point iteration as the subdomain solver. It reports 2x-25x speedups over MINRES on A100 GPUs (advantage growing with subdomain size) plus 2x-10x from single precision, enabled by minimal memory footprint, no reductions, and avoidance of device synchronizations.

Significance. If the iteration counts remain small and bounded and single-precision accuracy holds, the work would provide a practical GPU-native solver for indefinite Helmholtz problems in acoustics and seismology by keeping data in cache and eliminating global synchronizations. The architecture-specific design choices are a strength.

major comments (2)
  1. [Abstract] Abstract: the headline speedups (2x-25x over MINRES, growing with subdomain size) rest on WaveHoltz requiring only a small fixed number of iterations. No data, plots, or tables reporting observed iteration counts versus subdomain size, wave number k, or mesh resolution are referenced, so it is impossible to confirm the contraction rate remains independent of these parameters as required for the claimed scaling.
  2. [Abstract] Abstract (single-precision paragraph): the additional 2x-10x speedup from single precision is asserted to be unattainable by MINRES due to orthogonality loss, yet no error analysis, residual norms, or accuracy verification against double-precision reference solutions is supplied. For an indefinite operator this is load-bearing for the single-precision claim.
minor comments (1)
  1. [Abstract] The abstract would benefit from a brief statement of the test problems (domain, boundary conditions, range of k) used to obtain the reported timings.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their careful reading and constructive feedback. We address the two major comments point by point below. We agree that the manuscript would benefit from explicit supporting data on iteration counts and single-precision accuracy and will incorporate these elements in the revision.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the headline speedups (2x-25x over MINRES, growing with subdomain size) rest on WaveHoltz requiring only a small fixed number of iterations. No data, plots, or tables reporting observed iteration counts versus subdomain size, wave number k, or mesh resolution are referenced, so it is impossible to confirm the contraction rate remains independent of these parameters as required for the claimed scaling.

    Authors: We agree that the abstract and main text do not reference explicit iteration-count data versus subdomain size, k, or mesh resolution. Although the paper emphasizes the fixed-point character of WaveHoltz, a dedicated table or plot confirming bounded iterations is absent. In the revised manuscript we will add a new table (or figure) in Section 4 reporting observed WaveHoltz iteration counts across the tested range of subdomain sizes, wave numbers, and resolutions to substantiate that the count remains small and largely independent of these parameters. revision: yes

  2. Referee: [Abstract] Abstract (single-precision paragraph): the additional 2x-10x speedup from single precision is asserted to be unattainable by MINRES due to orthogonality loss, yet no error analysis, residual norms, or accuracy verification against double-precision reference solutions is supplied. For an indefinite operator this is load-bearing for the single-precision claim.

    Authors: The manuscript states the single-precision advantage but does not supply direct accuracy verification or residual comparisons. We will add to the revision (likely as a short subsection or appendix) a verification study that reports relative solution errors and residual norms for single-precision WaveHoltz against double-precision reference solutions, confirming that accuracy remains acceptable for the acoustics and seismology applications targeted. revision: yes

Circularity Check

0 steps flagged

No circularity; claims rest on empirical benchmarks with no self-referential derivation

full rationale

The manuscript describes a block-level domain decomposition scheme paired with the WaveHoltz fixed-point iteration as subdomain solver, implemented in CUDA for A100 GPUs. All reported results are wall-clock timings and speedups (2x-25x over MINRES, plus single-precision gains) obtained from direct execution on test problems. No equations, uniqueness theorems, or parameter-fitting steps are invoked whose outputs are then relabeled as predictions; the suitability arguments cite only the algorithmic properties of fixed-point iteration (low memory, no reductions) that are independent of the measured timings. No self-citation chain or ansatz smuggling appears in the provided text.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract supplies no explicit parameters, axioms, or invented entities; full paper would be required to audit any hidden modeling choices such as subdomain overlap or iteration tolerances.

pith-pipeline@v0.9.1-grok · 5774 in / 1102 out tokens · 36564 ms · 2026-06-26T12:59:20.540138+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

37 extracted references · 14 canonical work pages

  1. [1]

    Why it is difficult to solve Helmholtz problems with classical iterative methods,

    O. G. Ernst and M. J. Gander, “Why it is difficult to solve Helmholtz problems with classical iterative methods,” eng, inNumerical Analysis of Multiscale Problems, ser. Lecture Notes in Computational Science and Engineering, Berlin, Heidelberg: Springer Berlin Heidelberg, 2011, pp. 325–363,ISBN: 3642220606

  2. [2]

    Advances in iterative methods and pre- conditioners for the Helmholtz equation,

    Y . A. Erlangga, “Advances in iterative methods and pre- conditioners for the Helmholtz equation,” eng,Archives of computational methods in engineering, vol. 15, no. 1, pp. 37–66, 2008,ISSN: 1134-3060

  3. [3]

    On controllability methods for the Helmholtz equation,

    M. Grote and J. Tang, “On controllability methods for the Helmholtz equation,”Journal of Computational and Applied Mathematics, vol. 358, pp. 306–326, 2019

  4. [4]

    Parallel controllability methods for the Helmholtz equation,

    M. J. Grote, F. Nataf, J. H. Tang, and P.-H. Tournier, “Parallel controllability methods for the Helmholtz equation,”Computer Methods in Applied Mechanics and Engineering, vol. 362, p. 112 846, 2020

  5. [5]

    Waveholtz: It- erative solution of the Helmholtz equation via the wave equation,

    D. Appel ¨o, F. Garcia, and O. Runborg, “Waveholtz: It- erative solution of the Helmholtz equation via the wave equation,” eng,SIAM journal on scientific computing, vol. 42, no. 4, A1950–A1983, 2020,ISSN: 1064-8275

  6. [6]

    Convergence of the semi-discrete waveholtz iteration,

    A. Rotem, O. Runborg, and D. Appel ¨o, “Convergence of the semi-discrete waveholtz iteration,”Journal of Computational Physics, p. 114 882, 2026,ISSN: 0021- 9991.DOI: https://doi.org/10.1016/j.jcp.2026.114882 [Online]. Available: https : / / www. sciencedirect . com / science/article/pii/S0021999126002329

  7. [7]

    An optimal O(N) Helmholtz solver for complex geometry using WaveHoltz and overset grids,

    D. Appel ¨o, J. W. Banks, W. D. Henshaw, and D. W. Schwendeman, “An optimal O(N) Helmholtz solver for complex geometry using WaveHoltz and overset grids,” ArXiv, vol. abs/2504.03074, 2025. [Online]. Available: https://api.semanticscholar.org/CorpusID:277596478

  8. [8]

    M ´ethodes de d ´ecomposition de demains pour les probl `ems de propagation d‘ondes en r ´egime harmonique,

    B. Depr ´es, “M ´ethodes de d ´ecomposition de demains pour les probl `ems de propagation d‘ondes en r ´egime harmonique,”PhD thesis, 1991

  9. [9]

    L-sweeps: A scalable, parallel precondi- tioner for the high-frequency helmholtz equation,

    M. Taus, L. Zepeda-N ´u˜nez, R. J. Hewett, and L. Demanet, “L-sweeps: A scalable, parallel precondi- tioner for the high-frequency helmholtz equation,”Jour- nal of Computational Physics, vol. 420, p. 109 706, 2020,ISSN: 0021-9991.DOI: https : / / doi . org / 10 . 1016 / j . jcp . 2020 . 109706 [Online]. Available: https : / / www . sciencedirect . com / s...

  10. [10]

    A rapidly converging domain decom- position method for the helmholtz equation,

    C. C. Stolk, “A rapidly converging domain decom- position method for the helmholtz equation,”Jour- nal of Computational Physics, vol. 241, pp. 240–252, 2013,ISSN: 0021-9991.DOI: https : / / doi . org / 10 . 1016 / j . jcp . 2013 . 01 . 039 [Online]. Available: https : / / www . sciencedirect . com / science / article / pii / S0021999113000922

  11. [11]

    A non-overlapping domain decomposition method with perfectly matched layer transmission conditions for the helmholtz equation,

    A. Royer, C. Geuzaine, E. B ´echet, and A. Modave, “A non-overlapping domain decomposition method with perfectly matched layer transmission conditions for the helmholtz equation,”Computer Methods in Ap- plied Mechanics and Engineering, vol. 395, p. 115 006, 2022,ISSN: 0045-7825.DOI: https : / / doi . org / 10 . 1016 / j . cma . 2022 . 115006 [Online]. Ava...

  12. [12]

    M. J. Gander and H. Zhang, “A class of iterative solvers for the helmholtz equation: Factorizations, sweeping preconditioners, source transfer, single layer poten- tials, polarized traces, and optimized schwarz methods,” SIAM Review, vol. 61, no. 1, pp. 3–76, 2019.DOI: 10. 1137 / 16M109781X eprint: https : / / doi . org / 10 . 1137 / 16M109781X. [Online]....

  13. [13]

    Absorbing boundary conditions and per- fectly matched layers in wave propagation problems,

    F. Nataf, “Absorbing boundary conditions and per- fectly matched layers in wave propagation problems,” inDirect and Inverse Problems in Wave Propagation and Applications, I. Graham, U. Langer, J. Melenk, and M. Sini, Eds. Berlin, Boston: De Gruyter, 2013, pp. 219–232,ISBN: 9783110282283.DOI: doi:10.1515/ 9783110282283 . 219 [Online]. Available: https : / ...

  14. [14]

    Nu- merical experiments on a domain decomposition algo- rithm for nonlinear elliptic boundary value problems,

    T. Hagstrom, R. Tewarson, and A. Jazcilevich, “Nu- merical experiments on a domain decomposition algo- rithm for nonlinear elliptic boundary value problems,” Applied Mathematics Letters, vol. 1, no. 3, pp. 299–302, 1988,ISSN: 0893-9659.DOI: https : / / doi . org / 10 . 1016 / 0893 - 9659(88 ) 90097 - 3 [Online]. Available: https : / / www. sciencedirect ....

  15. [15]

    Optimized schwarz methods without overlap for the helmholtz equation,

    M. J. Gander, F. Magoul `es, and F. Nataf, “Optimized schwarz methods without overlap for the helmholtz equation,”SIAM Journal on Scientific Computing, vol. 24, no. 1, pp. 38–60, 2002.DOI: 10 . 1137 / S1064827501387012 eprint: https://doi.org/10.1137/ S1064827501387012. [Online]. Available: https://doi. org/10.1137/S1064827501387012

  16. [16]

    An optimized order 2 (oo2) method for the helmholtz equation,

    P. Chevalier and F. Nataf, “An optimized order 2 (oo2) method for the helmholtz equation,”Comptes Rendus de l’Acad´emie des Sciences - Series I - Mathematics, vol. 326, no. 6, pp. 769–774, 1998,ISSN: 0764-4442. DOI: https://doi.org/10.1016/S0764-4442(98)80047-5 [Online]. Available: https : / / www. sciencedirect . com / science/article/pii/S0764444298800475

  17. [17]

    A quasi- optimal non-overlapping domain decomposition algo- rithm for the helmholtz equation,

    Y . Boubendir, X. Antoine, and C. Geuzaine, “A quasi- optimal non-overlapping domain decomposition algo- rithm for the helmholtz equation,”Journal of Computa- tional Physics, vol. 231, no. 2, pp. 262–280, 2012,ISSN: 0021-9991.DOI: https://doi.org/10.1016/j.jcp.2011.08. 007 [Online]. Available: https://www.sciencedirect.com/ science/article/pii/S0021999111004797

  18. [18]

    Gmres with deflated restarting,

    R. B. Morgan, “Gmres with deflated restarting,”SIAM Journal on Scientific Computing, vol. 24, no. 1, pp. 20– 37, 2002.DOI: 10.1137/S1064827599364659

  19. [19]

    Recycling krylov subspaces for se- quences of linear systems,

    M. L. Parks, E. de Sturler, G. Mackey, D. D. Johnson, and S. Maiti, “Recycling krylov subspaces for se- quences of linear systems,”SIAM Journal on Scientific Computing, vol. 28, no. 5, pp. 1651–1674, 2006.DOI: 10.1137/040607277

  20. [20]

    Block iterative methods and recycling for improved scalability of linear solvers,

    P. Jolivet and P.-H. Tournier, “Block iterative methods and recycling for improved scalability of linear solvers,” inProceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, ser. SC ’16, Salt Lake City, Utah: IEEE Press, 2016,ISBN: 9781467388153

  21. [21]

    A coarse space for heterogeneous helmholtz problems based on the dirichlet-to-neumann operator,

    L. Conen, V . Dolean, R. Krause, and F. Nataf, “A coarse space for heterogeneous helmholtz problems based on the dirichlet-to-neumann operator,”Journal of Computational and Applied Mathematics, vol. 271, pp. 83–99, 2014,ISSN: 0377-0427.DOI: https :/ / doi . org/10.1016/j.cam.2014.03.031 [Online]. Available: https : / / www. sciencedirect . com / science /...

  22. [22]

    Iterative solution applied to the helmholtz equation: Complex deflation on unstructured grids,

    R. Aubry, S. Dey, and R. L ¨ohner, “Iterative solution applied to the helmholtz equation: Complex deflation on unstructured grids,”Computer Methods in Applied Mechanics and Engineering, vol. 241-244, pp. 155– 171, 2012,ISSN: 0045-7825.DOI: https : / / doi . org / 10 . 1016 / j . cma . 2012 . 06 . 007 [Online]. Available: https : / / www. sciencedirect . c...

  23. [23]

    Restricted overlapping balancing domain decomposition methods and restricted coarse problems for the helmholtz problem,

    J.-H. Kimn and M. Sarkis, “Restricted overlapping balancing domain decomposition methods and restricted coarse problems for the helmholtz problem,”Com- puter Methods in Applied Mechanics and Engineering, vol. 196, no. 8, pp. 1507–1514, 2007, Domain Decom- position Methods: recent advances and new challenges in engineering,ISSN: 0045-7825.DOI: https : / / ...

  24. [24]

    Scalable domain decomposition preconditioners for heterogeneous elliptic problems,

    P. Jolivet, F. Hecht, F. Nataf, and C. Prud’homme, “Scalable domain decomposition preconditioners for heterogeneous elliptic problems,” inProceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis, ser. SC ’13, Denver, Colorado: Association for Computing Ma- chinery, 2013,ISBN: 9781450323789.DOI: 10 . 1145 ...

  25. [25]

    S. G. Johnson,Notes on perfectly matched layers (pmls), 2021. arXiv: 2108.05348[cs.CE]

  26. [26]

    A perfectly matched layer for the absorption of electromagnetic waves,

    J.-P. Berenger, “A perfectly matched layer for the absorption of electromagnetic waves,”Journal of Com- putational Physics, vol. 114, no. 2, pp. 185–200, 1994,ISSN: 0021-9991.DOI: https : / / doi . org / 10 . 1006 / jcph . 1994 . 1159 [Online]. Available: https : / / www . sciencedirect . com / science / article / pii / S0021999184711594

  27. [27]

    Chapter 2: Opti- mized schwarz methods,

    V . Dolean, P. Jolivet, and F. Nataf, “Chapter 2: Opti- mized schwarz methods,” inAn Introduction to Domain Decomposition Methods. 2015, pp. 35–75.DOI: 10 . 1137/1.9781611974065.ch2 eprint: https://epubs.siam. org/doi/pdf/10.1137/1.9781611974065.ch2. [Online]. Available: https://epubs.siam.org/doi/abs/10.1137/1. 9781611974065.ch2

  28. [28]

    D. A. Kopriva,Implementing Spectral Methods for Partial Differential Equations: Algorithms for Scien- tists and Engineers(Scientific Computation). Springer Netherlands, 2009

  29. [29]

    Matrix-free finite-element computations on graphics processors with adaptively refined unstruc- tured meshes,

    K. Ljungkvist, “Matrix-free finite-element computations on graphics processors with adaptively refined unstruc- tured meshes,” inProceedings of the 25th High Perfor- mance Computing Symposium, ser. HPC ’17, Virginia Beach, Virginia: Society for Computer Simulation In- ternational, 2017,ISBN: 9781510838222

  30. [30]

    Gpu algorithms for efficient exascale discretizations,

    A. Abdelfattah et al., “Gpu algorithms for efficient exascale discretizations,”Parallel Computing, vol. 108, p. 102 841, 2021,ISSN: 0167-8191.DOI: https://doi. org/10.1016/j.parco.2021.102841 [Online]. Available: https : / / www. sciencedirect . com / science / article / pii / S0167819121000879

  31. [31]

    High performance sparse mul- tifrontal solvers on modern gpus,

    P. Ghysels and R. Synk, “High performance sparse mul- tifrontal solvers on modern gpus,”Parallel Computing, vol. 110, p. 102 897, 2022,ISSN: 0167-8191.DOI: https: / / doi . org / 10 . 1016 / j . parco . 2022 . 102897 [Online]. Available: https : / / www . sciencedirect . com / science / article/pii/S0167819122000059

  32. [32]

    NVIDIA,NVIDIA cuDSS (preview): GPU-accelerated direct sparse solver, https://developer.nvidia.com/cudss, Accessed: 2026-03-27, 2025

  33. [33]

    6. krylov subspace methods, part i,

    Y . Saad, “6. krylov subspace methods, part i,” in Iterative Methods for Sparse Linear Systems. 2003, pp. 151–216.DOI: 10 . 1137 / 1 . 9780898718003 . ch6 eprint: https : / / epubs . siam . org / doi / pdf / 10 . 1137 / 1 . 9780898718003.ch6. [Online]. Available: https://epubs. siam.org/doi/abs/10.1137/1.9780898718003.ch6

  34. [34]

    Demystifying the nvidia ampere architec- ture through microbenchmarking and instruction-level analysis,

    H. Abdelkhalik, Y . Arafa, N. Santhi, and A.-H. A. Badawy, “Demystifying the nvidia ampere architec- ture through microbenchmarking and instruction-level analysis,”2022 IEEE High Performance Extreme Com- puting Conference (HPEC), pp. 1–8, 2022. [Online]. Available: https : // api . semanticscholar. org /CorpusID : 251765382

  35. [35]

    Low/adaptive precision computation in preconditioned iterative solvers for ill- conditioned problems,

    M. Kawai and K. Nakajima, “Low/adaptive precision computation in preconditioned iterative solvers for ill- conditioned problems,” inInternational Conference on High Performance Computing in Asia-Pacific Region, ser. HPCAsia ’22, Virtual Event, Japan: Association for Computing Machinery, 2022, pp. 30–40,ISBN: 9781450384988.DOI: 10 . 1145 / 3492805 . 349281...

  36. [36]

    Mixed precision block- jacobi preconditioner: Algorithms, performance evalua- tion and feature analysis,

    N. Tian, S. Huang, and X. Xu, “Mixed precision block- jacobi preconditioner: Algorithms, performance evalua- tion and feature analysis,”CCF Transactions on High Performance Computing, vol. 7, no. 2, pp. 114–128, 2025

  37. [37]

    Y . Guo, E. de Sturler, and T. Warburton,An adaptive mixed precision and dynamically scaled preconditioned conjugate gradient algorithm, 2025. arXiv: 2505.04155 [math.NA]. [Online]. Available: https://arxiv.org/abs/ 2505.04155