pith. machine review for the scientific record. sign in

arxiv: 2603.09038 · v2 · submitted 2026-03-10 · 💻 cs.DC · cs.MS· cs.PF

Recognition: no theorem link

Accelerating High-Order Finite Element Simulations at Extreme Scale with FP64 Tensor Cores

Authors on Pith no claims yet

Pith reviewed 2026-05-15 14:17 UTC · model grok-4.3

classification 💻 cs.DC cs.MScs.PF
keywords finite element methodstensor coresGPU accelerationhigh-order discretizationexascale computingkernel fusionenergy efficiencyMFEM library
0
0 comments X

The pith

FP64 tensor cores on NVIDIA GPUs speed up high-order finite element simulations by up to 2 times while cutting energy use by 83 percent.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that FP64 tensor cores, when combined with kernel fusion, can be programmed directly to handle the dominant operations in high-order finite element codes such as MFEM. This produces measurable speedups and energy savings on current NVIDIA architectures without changing the underlying mathematical formulation. A sympathetic reader would care because finite element simulations at practical resolutions already consume large fractions of supercomputer time and power; any reliable reduction in both directly expands the size and speed of problems that can be tackled.

Core claim

By directly programming the FP64 tensor cores and fusing kernels in MFEM, the authors obtain up to 2 times faster execution and up to 83 percent better energy efficiency for the core finite-element operators on Grace Hopper GH200 and Grace Blackwell GB200 systems. The same kernels exhibit near-perfect weak scaling and 90 percent strong scaling efficiency when run across nearly 10,000 GPUs on the Alps machine, and the resulting improvements are already used in production codes including the 2025 Gordon Bell Prize tsunami-forecasting application.

What carries the argument

FP64 tensor cores used inside fused high-order finite-element kernels that replace conventional double-precision matrix operations while preserving the required arithmetic.

If this is right

  • High-order finite-element codes can execute larger problems in the same wall-clock time on existing GPU clusters.
  • Energy cost per simulation decreases, allowing more runs within fixed power budgets.
  • Real-time or ensemble forecasting applications that rely on MFEM become feasible at higher resolutions.
  • The same tensor-core approach can be applied to other matrix-heavy kernels inside MFEM without altering the user interface.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The technique may transfer to other libraries that already use high-order discretizations on GPUs, such as those for computational fluid dynamics or electromagnetics.
  • Future hardware generations with wider or faster tensor units could amplify the observed speedups if the same fusion strategy is retained.
  • Accuracy verification workflows that compare tensor-core results against reference implementations will become routine parts of performance tuning for scientific codes.

Load-bearing premise

The FP64 tensor-core arithmetic must deliver results accurate enough for the high-order finite-element discretizations and the kernel fusions must not add hidden overheads that erase the reported gains on real workloads.

What would settle it

Run a production-scale MFEM simulation with the tensor-core kernels and compare the final solution fields against an identical run performed with standard double-precision arithmetic; any statistically significant discrepancy in the solution or loss of convergence order would falsify the claim.

Figures

Figures reproduced from arXiv: 2603.09038 by Ian Karlin, Jiqun Tu, John Camier, Omar Ghattas, Stefan Henneking, Tzanio Kolev, Veselin Dobrev.

Figure 1
Figure 1. Figure 1: Finite element operators, A, where P handles parallel scattering and MPI communication across global true degrees of freedom (T-vectors), G manages mesh topology to local subdomains (L-vectors) and elements (E-vectors), B encodes geometry mappings and tensor-product basis functions, and D encapsulates physics at quadrature points (Q-vectors)—with numerical kernels reducing to batched small dense tensor con… view at source ↗
Figure 2
Figure 2. Figure 2: Shared memory banks for matrix A/B/C of an [PITH_FULL_IMAGE:figures/full_fig_p008_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Throughput, in billion degrees of freedom (GDOF) per second, for the finite element kernels corresponding to the off-diagonal blocks in ( [PITH_FULL_IMAGE:figures/full_fig_p009_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Strong scalability of the MFEM finite element solver on [PITH_FULL_IMAGE:figures/full_fig_p011_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Weak scalability of the MFEM finite element solver on [PITH_FULL_IMAGE:figures/full_fig_p011_5.png] view at source ↗
read the original abstract

Finite element simulations play a critical role in a wide range of applications, from automotive design to tsunami modeling and computational electromagnetics. Performing these simulations efficiently at the high resolutions needed for practical applications and scientific insights necessitates the use of high-order methods and large-scale supercomputing. While much progress has been made in porting finite element codes to GPU systems in recent years, additional improvements in the efficiency and computational speed of GPU-accelerated high-order finite element simulations are in constant demand. In this paper, we demonstrate that the FP64 tensor cores on NVIDIA GPUs can be used to further accelerate such simulations, achieving significant speedups in key kernels of MFEM, a scalable open-source finite element library widely used in HPC applications. By integrating FP64 tensor cores with kernel fusion optimizations, we were able to achieve up to 2$\times$ performance gains and up to 83% energy efficiency gains on NVIDIA's Grace Hopper GH200 and Grace Blackwell GB200 architectures. To the best of our knowledge, this is the first time that FP64 tensor cores have been directly programmed to accelerate large-scale finite element scientific computing applications. We demonstrate the performance of the optimized kernels at exascale by showing near-perfect weak scaling efficiency and 90% strong scaling efficiency across nearly 10,000 GPUs on the Alps system. The new algorithms and MFEM enhancements directly benefit complex production codes, including the 2025 Gordon Bell Prize-winning application for real-time tsunami forecasting.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper claims that FP64 tensor cores on NVIDIA GPUs, when integrated with kernel fusion optimizations in the MFEM library, can accelerate key kernels in high-order finite element simulations. It reports up to 2× performance gains and up to 83% energy efficiency gains on Grace Hopper GH200 and Grace Blackwell GB200 systems, together with near-perfect weak scaling and 90% strong scaling efficiency on nearly 10,000 GPUs of the Alps system, directly benefiting production codes such as the 2025 Gordon Bell tsunami forecasting application.

Significance. If the numerical accuracy of the FP64 tensor-core operations is shown to match standard FP64 within solver tolerances for high-order (p≥4) operators, the work would provide a concrete path for exploiting new GPU hardware features in memory-bound scientific kernels at exascale. The reported scaling efficiencies and energy savings, achieved on production-relevant workloads, would be of immediate interest to the HPC and computational science communities.

major comments (3)
  1. [Abstract and §4] Abstract and §4 (Performance Results): The 2× speedup and 83% energy claims rest on the assumption that FP64 tensor-core matrix multiplies produce results numerically equivalent (within solver tolerance) to cuBLAS FP64 GEMM for the dense local operators arising in high-order finite-element integration and assembly. No residual-norm comparisons, solver-iteration counts, or rounding-error analysis for p≥4 kernels are supplied; without this verification the performance numbers cannot be accepted as supporting the central claim.
  2. [§5] §5 (Scaling Experiments): The 90% strong-scaling efficiency on nearly 10,000 GPUs is presented without quantitative breakdown of how kernel fusion alters communication volume or load balance in the global assembly phase of MFEM. Additional profiling or weak-scaling data that isolates the contribution of the tensor-core kernels versus the rest of the solver stack is required to substantiate the exascale claim.
  3. [§4.3] §4.3 (Energy Efficiency): The 83% energy-efficiency improvement is stated without describing the measurement method (NVML counters, integrated node power, or full-system energy) or confirming that the baseline includes the same CPU and interconnect overheads on GH200/GB200 nodes. This detail is load-bearing for the energy claim.
minor comments (2)
  1. [Introduction] Introduction: The statement that this is the 'first time' FP64 tensor cores have been used for large-scale finite-element applications should be accompanied by citations to any prior exploratory uses of tensor cores in scientific kernels.
  2. [§4] Figure captions and §4: Ensure that all performance plots include error bars or at least the number of repeated runs; the current presentation leaves the statistical significance of the reported speedups unclear.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed comments. We address each major point below and have revised the manuscript to incorporate the requested clarifications and additional data.

read point-by-point responses
  1. Referee: [Abstract and §4] Abstract and §4 (Performance Results): The 2× speedup and 83% energy claims rest on the assumption that FP64 tensor-core matrix multiplies produce results numerically equivalent (within solver tolerance) to cuBLAS FP64 GEMM for the dense local operators arising in high-order finite-element integration and assembly. No residual-norm comparisons, solver-iteration counts, or rounding-error analysis for p≥4 kernels are supplied; without this verification the performance numbers cannot be accepted as supporting the central claim.

    Authors: We agree that explicit numerical verification is necessary to support the central claims. In the revised manuscript we have added a dedicated subsection in §4 that reports residual-norm comparisons, solver iteration counts, and a brief rounding-error analysis for p≥4 kernels. These results demonstrate that FP64 tensor-core results remain within solver tolerances and are numerically equivalent to standard cuBLAS FP64 GEMM for the operators considered. revision: yes

  2. Referee: [§5] §5 (Scaling Experiments): The 90% strong-scaling efficiency on nearly 10,000 GPUs is presented without quantitative breakdown of how kernel fusion alters communication volume or load balance in the global assembly phase of MFEM. Additional profiling or weak-scaling data that isolates the contribution of the tensor-core kernels versus the rest of the solver stack is required to substantiate the exascale claim.

    Authors: We have expanded §5 with new profiling results that quantify the effect of kernel fusion on communication volume and load balance during global assembly. We also include weak-scaling breakdowns that isolate the tensor-core kernel contributions from the remainder of the MFEM solver stack, confirming the reported near-perfect weak scaling and 90% strong-scaling efficiency on the Alps system. revision: yes

  3. Referee: [§4.3] §4.3 (Energy Efficiency): The 83% energy-efficiency improvement is stated without describing the measurement method (NVML counters, integrated node power, or full-system energy) or confirming that the baseline includes the same CPU and interconnect overheads on GH200/GB200 nodes. This detail is load-bearing for the energy claim.

    Authors: We have revised §4.3 to describe the energy measurement procedure in detail. Power was measured via NVML counters on the GPU together with integrated node-level power readings that include CPU and interconnect overheads on the GH200 and GB200 platforms. All baseline runs used identical node configurations, ensuring the reported 83% energy-efficiency gains are directly comparable. revision: yes

Circularity Check

0 steps flagged

No circularity: performance claims rest on direct hardware measurements

full rationale

The paper presents empirical results from implementing FP64 tensor-core kernels and kernel-fusion optimizations inside MFEM, then measuring wall-clock time, energy, and scaling on GH200/GB200 hardware up to 10k GPUs. No equations, fitted parameters, or first-principles derivations are offered; the central claims (2× speedup, 83% energy gain, 90% strong scaling) are direct outputs of those runs. No self-citation chain, ansatz smuggling, or renaming of known results is used to justify the performance numbers. The derivation chain is therefore self-contained against external benchmarks and receives the default non-circularity finding.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

No free parameters or invented entities; the work is an empirical engineering optimization that relies on existing GPU hardware semantics and standard finite-element numerical assumptions.

axioms (1)
  • domain assumption FP64 tensor-core matrix multiplies produce results sufficiently accurate for the target high-order finite-element operators.
    Invoked implicitly when claiming performance gains without accuracy loss.

pith-pipeline@v0.9.0 · 5591 in / 1229 out tokens · 64311 ms · 2026-05-15T14:17:17.376969+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Sensor Placement for Tsunami Early Warning via Large-Scale Bayesian Optimal Experimental Design

    cs.DC 2026-04 unverdicted novelty 6.0

    A reformulation of Bayesian OED as dense matrix subset selection plus a pipelined Schur-complement greedy algorithm on hundreds of GPUs enables optimization of 175-sensor networks for billion-degree-of-freedom tsunami...

Reference graph

Works this paper leans on

27 extracted references · 27 canonical work pages · cited by 1 Pith paper

  1. [1]

    Whitepaper

    NVIDIA Corporation.NVIDIA Tesla V100 GPU Architecture. Whitepaper. https : / / images . nvidia . com / content / volta - architecture / pdf/volta-architecture-whitepaper.pdf. NVIDIA, 2017. [2]NVIDIA V100. https://www.nvidia.com/en-in/data-center/v100. [3]NVIDIA GB200. https://www.nvidia.com/en- us/data- center/gb200- nvl72

  2. [2]

    Learning physics-based models from data: Perspectives from inverse problems and model reduction

    O. Ghattas and K. Willcox. “Learning physics-based models from data: Perspectives from inverse problems and model reduction”. In:Acta Numerica30 (2021), pp. 445–554.DOI: 10.1017/S0962492921000064. [5]NVIDIA A100. https://www.nvidia.com/en-us/data-center/a100. [6]NVIDIA GH200. https://www.nvidia.com/en- us/data- center/grace- hopper-superchip

  3. [3]

    GPU acceleration of large-scale full- frequency GW calculations

    V . W.-z. Yu and M. Govoni. “GPU acceleration of large-scale full- frequency GW calculations”. In:Journal of Chemical Theory and Computation18.8 (2022), pp. 4690–4707.DOI: 10 . 1021 / acs . jctc . 2c00241

  4. [4]

    BerkeleyGW: A massively parallel computer package for the calculation of the quasiparticle and optical properties of materials and nanostructures

    J. Deslippe, G. Samsonidze, D. A. Strubbe, M. Jain, M. L. Cohen, and S. G. Louie. “BerkeleyGW: A massively parallel computer package for the calculation of the quasiparticle and optical properties of materials and nanostructures”. In:Computer Physics Communications183.6 (2012), pp. 1269–1289.DOI: 10.1016/j.cpc.2011.12.006

  5. [5]

    J. J. Dongarra, C. B. Moler, J. R. Bunch, and G. W. Stewart.LINPACK users’ guide. SIAM, 1979.DOI: 10.1137/1.9781611971811

  6. [6]

    Matrix engines for high performance computing: A paragon of performance or grasping at straws?

    J. Domke, E. Vatai, A. Drozd, P. Chen, Y . Oyama, L. Zhang, S. Salaria, D. Mukunoki, A. Podobas, M. Wahib, and S. Matsuoka. “Matrix engines for high performance computing: A paragon of performance or grasping at straws?” In:2021 IEEE International Parallel and Distributed Processing Symposium (IPDPS). 2021, pp. 1056–1065. DOI: 10.1109/IPDPS49936.2021.00114

  7. [7]

    SPIDER: Unleashing sparse tensor cores for stencil computation via strided swapping

    Q. Gu, C. Wu, H. Shi, and J. Yao. “SPIDER: Unleashing sparse tensor cores for stencil computation via strided swapping”. In:Proceedings of the 31st ACM SIGPLAN Annual Symposium on Principles and Prac- tice of Parallel Programming. PPoPP ’26. Sydney, NSW, Australia: Association for Computing Machinery, 2026, pp. 218–231.DOI: 10. 1145/3774934.3786414

  8. [8]

    Acceleration of tensor-product operations with tensor cores

    C. Cui. “Acceleration of tensor-product operations with tensor cores”. In:ACM Transactions on Parallel Computing11.4 (Nov. 2024).DOI: 10.1145/3695466

  9. [9]

    Real-time Bayesian inference at extreme scale: A digital twin for tsunami early warning applied to the Cascadia subduction zone

    S. Henneking, S. Venkat, V . Dobrev, J. Camier, T. Kolev, M. Fernando, A.-A. Gabriel, and O. Ghattas. “Real-time Bayesian inference at extreme scale: A digital twin for tsunami early warning applied to the Cascadia subduction zone”. In:Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis. SC ’25. New...

  10. [11]

    hypre: A library of high performance preconditioners

    R. D. Falgout and U. M. Yang. “hypre: A library of high performance preconditioners”. In:Proceedings of the International Conference on Computational Science-Part III. ICCS ’02. Berlin, Heidelberg: Springer-Verlag, 2002, pp. 632–641.DOI: 10.1007/3-540-47789-6 66

  11. [12]

    SUNDIALS: Suite of nonlinear and differential/algebraic equation solvers

    A. C. Hindmarsh, P. N. Brown, K. E. Grant, S. L. Lee, R. Serban, D. E. Shumaker, and C. S. Woodward. “SUNDIALS: Suite of nonlinear and differential/algebraic equation solvers”. In:ACM Transactions on Mathematical Software (TOMS)31.3 (2005), pp. 363–396.DOI: 10. 1145/1089014.1089020

  12. [13]

    Balay, S

    S. Balay, S. Abhyankar, M. F. Adams, S. Benson, J. Brown, P. Brune, K. Buschelman, E. Constantinescu, L. Dalcin, A. Dener, V . Eijkhout, J. Faibussowitsch, W. D. Gropp, V . Hapla, T. Isaac, P. Jolivet, D. Karpeev, D. Kaushik, M. G. Knepley, F. Kong, S. Kruger, D. A. May, L. C. McInnes, R. T. Mills, L. Mitchell, T. Munson, J. E. Roman, K. Rupp, P. Sanan, J...

  13. [14]

    High-performance finite elements with MFEM

    J. Andrej, N. Atallah, J.-P. B ¨acker, J.-S. Camier, D. Copeland, V . Dobrev, Y . Dudouit, T. Duswald, B. Keith, D. Kim, T. Kolev, B. Lazarov, K. Mittal, W. Pazner, S. Petrides, S. Shiraiwa, M. Stowell, and V . Tomov. “High-performance finite elements with MFEM”. In: International Journal of High Performance Computing Applications 38.5 (2024), pp. 447–467...

  14. [15]

    Bott and L

    R. Bott and L. W. Tu.Differential Forms in Algebraic Topology. V ol. 82. Graduate Texts in Mathematics. New York: Springer-Verlag, 1982, pp. xiv+331.DOI: 10.1007/978-1-4757-3951-0

  15. [16]

    End-to-end GPU acceleration of low-order-refined preconditioning for high-order finite element discretizations

    W. Pazner, T. Kolev, and J.-S. Camier. “End-to-end GPU acceleration of low-order-refined preconditioning for high-order finite element discretizations”. In:The International Journal of High Performance Computing Applications37.5 (2023), pp. 578–599.DOI: 10 . 1177 / 10943420231175462

  16. [17]

    Goal-oriented real-time Bayesian inference for linear autonomous dynamical systems with application to digital twins for tsunami early warning

    S. Henneking, S. Venkat, and O. Ghattas. “Goal-oriented real-time Bayesian inference for linear autonomous dynamical systems with application to digital twins for tsunami early warning”. In:Journal of Computational Physics552 (2026), p. 114682.DOI: 10.1016/j.jcp. 2026.114682

  17. [18]

    Fast and scalable FFT-based GPU-accelerated algorithms for block-triangular Toeplitz matrices with application to linear inverse problems governed by autonomous dynamical systems

    S. Venkat, M. Fernando, S. Henneking, and O. Ghattas. “Fast and scalable FFT-based GPU-accelerated algorithms for block-triangular Toeplitz matrices with application to linear inverse problems governed by autonomous dynamical systems”. In:SIAM Journal on Scientific Computing47.5 (2025), B1201–B1226.DOI: 10.1137/24M1683172

  18. [19]

    Real-time probabilistic tsunami forecasting in Cascadia from sparse offshore pressure observations

    S. Henneking, F. Kutschera, S. Venkat, A.-A. Gabriel, and O. Ghattas. “Real-time probabilistic tsunami forecasting in Cascadia from sparse offshore pressure observations”. In:arXiv:2603.14966(2026).DOI: 10. 48550/arXiv.2603.14966

  19. [20]

    High-order finite difference modeling of tsunami generation in a compressible ocean from offshore earth- quakes

    G. C. Lotto and E. M. Dunham. “High-order finite difference modeling of tsunami generation in a compressible ocean from offshore earth- quakes”. In:Computational Geosciences19.2 (2015), pp. 327–340. DOI: 10.1007/s10596-015-9472-0. [26]SIMT architecture. https://docs.nvidia.com/cuda/cuda-c-programming- guide/#simt-architecture. [27]Shared memory. https://d...

  20. [21]

    NVIDIA tensor core programmability, performance and precision

    S. Markidis, S. W. D. Chien, E. Laure, I. B. Peng, and J. S. Vetter. “NVIDIA tensor core programmability, performance and precision”. In:2018 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW). Los Alamitos, CA, USA: IEEE Computer Society, May 2018, pp. 522–531.DOI: 10.1109/IPDPSW. 2018.00091

  21. [22]

    DGEMM on integer matrix multiplication unit

    H. Ootomo, K. Ozaki, and R. Yokota. “DGEMM on integer matrix multiplication unit”. In:The International Journal of High Perfor- mance Computing Applications38.4 (2024), pp. 297–313.DOI: 10. 1177/10943420241239588

  22. [23]

    Ozaki Scheme II: A GEMM- oriented emulation of floating-point matrix multiplication using an integer modular technique

    K. Ozaki, Y . Uchino, and T. Imamura. “Ozaki Scheme II: A GEMM- oriented emulation of floating-point matrix multiplication using an integer modular technique”. In:arXiv:2504.08009(2025).DOI: 10 . 48550/arXiv.2504.08009

  23. [24]

    2026.Configuring Agentic AI Coding Tools: An Exploratory Study (Supplementary Material)

    T. Kolev, P. Fischer, A. Abdelfattah, Z. Atkins, A. Bankole, N. Beams, J. Brown, J.-S. Camier, R. Carson, N. Chalmers, V . Dobrev, J. Holmen, K. E. Jansen, S. Kerkemeier, Y .-H. Lan, D. McDougall, E. Merzari, M. Min, M. Phillips, T. Ratnayaka, K. Rowe, M. S. Shephard, C. W. Smith, J. L. Thompson, A. Tomboulides, S. Tomov, V . Tomov, U. Unnikrishnan, A. Va...

  24. [25]

    Fast semi-iterative finite ele- ment Poisson solvers for tensor core GPUs based on prehandling

    D. Ruda, S. Turek, and D. Ribbrock. “Fast semi-iterative finite ele- ment Poisson solvers for tensor core GPUs based on prehandling”. In:European Conference on Numerical Mathematics and Advanced Applications. Springer. 2023, pp. 320–330.DOI: 10.1007/978-3-031- 86169-7 33

  25. [26]

    Low-order finite element solver with small matrix-matrix multipli- cation accelerated by AI-specific hardware for crustal deformation computation

    T. Yamaguchi, K. Fujita, T. Ichimura, A. Naruse, J. C. Wells, C. J. Zimmer, T. P. Straatsma, M. Hori, L. Maddegedara, and N. Ueda. “Low-order finite element solver with small matrix-matrix multipli- cation accelerated by AI-specific hardware for crustal deformation computation”. In:Proceedings of the Platform for Advanced Scientific Computing Conference. ...

  26. [27]

    Matthew Weinberg

    T. Ichimura, K. Fujita, M. Hori, and M. Lalith. “Fast and power efficient GPU-based explicit elastic wave propagation analysis by low- ordered orthogonal voxel finite element with INT8 tensor cores”. In: Journal of Computational Science(2025), p. 102659.DOI: 10.1016/j. jocs.2025.102659

  27. [28]

    Low-ordered orthogo- nal voxel finite element with INT8 tensor cores for GPU-based explicit elastic wave propagation analysis

    T. Ichimura, K. Fujita, M. Hori, and M. Lalith. “Low-ordered orthogo- nal voxel finite element with INT8 tensor cores for GPU-based explicit elastic wave propagation analysis”. In:International Conference on Computational Science. Springer. 2024, pp. 257–271.DOI: 10.1007/ 978-3-031-63759-9 31