arxiv: 2603.09038 · v2 · submitted 2026-03-10 · 💻 cs.DC · cs.MS· cs.PF

Recognition: no theorem link

Accelerating High-Order Finite Element Simulations at Extreme Scale with FP64 Tensor Cores

Jiqun Tu , Ian Karlin , John Camier , Veselin Dobrev , Tzanio Kolev , Stefan Henneking , Omar Ghattas

Authors on Pith no claims yet

Pith reviewed 2026-05-15 14:17 UTC · model grok-4.3

classification 💻 cs.DC cs.MScs.PF

keywords finite element methodstensor coresGPU accelerationhigh-order discretizationexascale computingkernel fusionenergy efficiencyMFEM library

0 comments

The pith

FP64 tensor cores on NVIDIA GPUs speed up high-order finite element simulations by up to 2 times while cutting energy use by 83 percent.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that FP64 tensor cores, when combined with kernel fusion, can be programmed directly to handle the dominant operations in high-order finite element codes such as MFEM. This produces measurable speedups and energy savings on current NVIDIA architectures without changing the underlying mathematical formulation. A sympathetic reader would care because finite element simulations at practical resolutions already consume large fractions of supercomputer time and power; any reliable reduction in both directly expands the size and speed of problems that can be tackled.

Core claim

By directly programming the FP64 tensor cores and fusing kernels in MFEM, the authors obtain up to 2 times faster execution and up to 83 percent better energy efficiency for the core finite-element operators on Grace Hopper GH200 and Grace Blackwell GB200 systems. The same kernels exhibit near-perfect weak scaling and 90 percent strong scaling efficiency when run across nearly 10,000 GPUs on the Alps machine, and the resulting improvements are already used in production codes including the 2025 Gordon Bell Prize tsunami-forecasting application.

What carries the argument

FP64 tensor cores used inside fused high-order finite-element kernels that replace conventional double-precision matrix operations while preserving the required arithmetic.

If this is right

High-order finite-element codes can execute larger problems in the same wall-clock time on existing GPU clusters.
Energy cost per simulation decreases, allowing more runs within fixed power budgets.
Real-time or ensemble forecasting applications that rely on MFEM become feasible at higher resolutions.
The same tensor-core approach can be applied to other matrix-heavy kernels inside MFEM without altering the user interface.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The technique may transfer to other libraries that already use high-order discretizations on GPUs, such as those for computational fluid dynamics or electromagnetics.
Future hardware generations with wider or faster tensor units could amplify the observed speedups if the same fusion strategy is retained.
Accuracy verification workflows that compare tensor-core results against reference implementations will become routine parts of performance tuning for scientific codes.

Load-bearing premise

The FP64 tensor-core arithmetic must deliver results accurate enough for the high-order finite-element discretizations and the kernel fusions must not add hidden overheads that erase the reported gains on real workloads.

What would settle it

Run a production-scale MFEM simulation with the tensor-core kernels and compare the final solution fields against an identical run performed with standard double-precision arithmetic; any statistically significant discrepancy in the solution or loss of convergence order would falsify the claim.

Figures

Figures reproduced from arXiv: 2603.09038 by Ian Karlin, Jiqun Tu, John Camier, Omar Ghattas, Stefan Henneking, Tzanio Kolev, Veselin Dobrev.

**Figure 1.** Figure 1: Finite element operators, A, where P handles parallel scattering and MPI communication across global true degrees of freedom (T-vectors), G manages mesh topology to local subdomains (L-vectors) and elements (E-vectors), B encodes geometry mappings and tensor-product basis functions, and D encapsulates physics at quadrature points (Q-vectors)—with numerical kernels reducing to batched small dense tensor con… view at source ↗

**Figure 2.** Figure 2: Shared memory banks for matrix A/B/C of an [PITH_FULL_IMAGE:figures/full_fig_p008_2.png] view at source ↗

**Figure 3.** Figure 3: Throughput, in billion degrees of freedom (GDOF) per second, for the finite element kernels corresponding to the off-diagonal blocks in ( [PITH_FULL_IMAGE:figures/full_fig_p009_3.png] view at source ↗

**Figure 4.** Figure 4: Strong scalability of the MFEM finite element solver on [PITH_FULL_IMAGE:figures/full_fig_p011_4.png] view at source ↗

**Figure 5.** Figure 5: Weak scalability of the MFEM finite element solver on [PITH_FULL_IMAGE:figures/full_fig_p011_5.png] view at source ↗

read the original abstract

Finite element simulations play a critical role in a wide range of applications, from automotive design to tsunami modeling and computational electromagnetics. Performing these simulations efficiently at the high resolutions needed for practical applications and scientific insights necessitates the use of high-order methods and large-scale supercomputing. While much progress has been made in porting finite element codes to GPU systems in recent years, additional improvements in the efficiency and computational speed of GPU-accelerated high-order finite element simulations are in constant demand. In this paper, we demonstrate that the FP64 tensor cores on NVIDIA GPUs can be used to further accelerate such simulations, achieving significant speedups in key kernels of MFEM, a scalable open-source finite element library widely used in HPC applications. By integrating FP64 tensor cores with kernel fusion optimizations, we were able to achieve up to 2$\times$ performance gains and up to 83% energy efficiency gains on NVIDIA's Grace Hopper GH200 and Grace Blackwell GB200 architectures. To the best of our knowledge, this is the first time that FP64 tensor cores have been directly programmed to accelerate large-scale finite element scientific computing applications. We demonstrate the performance of the optimized kernels at exascale by showing near-perfect weak scaling efficiency and 90% strong scaling efficiency across nearly 10,000 GPUs on the Alps system. The new algorithms and MFEM enhancements directly benefit complex production codes, including the 2025 Gordon Bell Prize-winning application for real-time tsunami forecasting.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper shows concrete speedups from mapping high-order FE kernels onto FP64 tensor cores in MFEM, with strong scaling numbers, but needs explicit accuracy checks against plain FP64.

read the letter

The main takeaway is that this work maps the dense local operators in high-order finite element kernels onto FP64 tensor cores, adds kernel fusion, and measures up to 2x speedups plus 83% energy gains on Grace Hopper and Blackwell GPUs. It also reports near-perfect weak scaling and 90% strong scaling to nearly 10k GPUs on the Alps system, with direct benefit to production codes such as the recent Gordon Bell tsunami application in MFEM.

Referee Report

3 major / 2 minor

Summary. The paper claims that FP64 tensor cores on NVIDIA GPUs, when integrated with kernel fusion optimizations in the MFEM library, can accelerate key kernels in high-order finite element simulations. It reports up to 2× performance gains and up to 83% energy efficiency gains on Grace Hopper GH200 and Grace Blackwell GB200 systems, together with near-perfect weak scaling and 90% strong scaling efficiency on nearly 10,000 GPUs of the Alps system, directly benefiting production codes such as the 2025 Gordon Bell tsunami forecasting application.

Significance. If the numerical accuracy of the FP64 tensor-core operations is shown to match standard FP64 within solver tolerances for high-order (p≥4) operators, the work would provide a concrete path for exploiting new GPU hardware features in memory-bound scientific kernels at exascale. The reported scaling efficiencies and energy savings, achieved on production-relevant workloads, would be of immediate interest to the HPC and computational science communities.

major comments (3)

[Abstract and §4] Abstract and §4 (Performance Results): The 2× speedup and 83% energy claims rest on the assumption that FP64 tensor-core matrix multiplies produce results numerically equivalent (within solver tolerance) to cuBLAS FP64 GEMM for the dense local operators arising in high-order finite-element integration and assembly. No residual-norm comparisons, solver-iteration counts, or rounding-error analysis for p≥4 kernels are supplied; without this verification the performance numbers cannot be accepted as supporting the central claim.
[§5] §5 (Scaling Experiments): The 90% strong-scaling efficiency on nearly 10,000 GPUs is presented without quantitative breakdown of how kernel fusion alters communication volume or load balance in the global assembly phase of MFEM. Additional profiling or weak-scaling data that isolates the contribution of the tensor-core kernels versus the rest of the solver stack is required to substantiate the exascale claim.
[§4.3] §4.3 (Energy Efficiency): The 83% energy-efficiency improvement is stated without describing the measurement method (NVML counters, integrated node power, or full-system energy) or confirming that the baseline includes the same CPU and interconnect overheads on GH200/GB200 nodes. This detail is load-bearing for the energy claim.

minor comments (2)

[Introduction] Introduction: The statement that this is the 'first time' FP64 tensor cores have been used for large-scale finite-element applications should be accompanied by citations to any prior exploratory uses of tensor cores in scientific kernels.
[§4] Figure captions and §4: Ensure that all performance plots include error bars or at least the number of repeated runs; the current presentation leaves the statistical significance of the reported speedups unclear.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed comments. We address each major point below and have revised the manuscript to incorporate the requested clarifications and additional data.

read point-by-point responses

Referee: [Abstract and §4] Abstract and §4 (Performance Results): The 2× speedup and 83% energy claims rest on the assumption that FP64 tensor-core matrix multiplies produce results numerically equivalent (within solver tolerance) to cuBLAS FP64 GEMM for the dense local operators arising in high-order finite-element integration and assembly. No residual-norm comparisons, solver-iteration counts, or rounding-error analysis for p≥4 kernels are supplied; without this verification the performance numbers cannot be accepted as supporting the central claim.

Authors: We agree that explicit numerical verification is necessary to support the central claims. In the revised manuscript we have added a dedicated subsection in §4 that reports residual-norm comparisons, solver iteration counts, and a brief rounding-error analysis for p≥4 kernels. These results demonstrate that FP64 tensor-core results remain within solver tolerances and are numerically equivalent to standard cuBLAS FP64 GEMM for the operators considered. revision: yes
Referee: [§5] §5 (Scaling Experiments): The 90% strong-scaling efficiency on nearly 10,000 GPUs is presented without quantitative breakdown of how kernel fusion alters communication volume or load balance in the global assembly phase of MFEM. Additional profiling or weak-scaling data that isolates the contribution of the tensor-core kernels versus the rest of the solver stack is required to substantiate the exascale claim.

Authors: We have expanded §5 with new profiling results that quantify the effect of kernel fusion on communication volume and load balance during global assembly. We also include weak-scaling breakdowns that isolate the tensor-core kernel contributions from the remainder of the MFEM solver stack, confirming the reported near-perfect weak scaling and 90% strong-scaling efficiency on the Alps system. revision: yes
Referee: [§4.3] §4.3 (Energy Efficiency): The 83% energy-efficiency improvement is stated without describing the measurement method (NVML counters, integrated node power, or full-system energy) or confirming that the baseline includes the same CPU and interconnect overheads on GH200/GB200 nodes. This detail is load-bearing for the energy claim.

Authors: We have revised §4.3 to describe the energy measurement procedure in detail. Power was measured via NVML counters on the GPU together with integrated node-level power readings that include CPU and interconnect overheads on the GH200 and GB200 platforms. All baseline runs used identical node configurations, ensuring the reported 83% energy-efficiency gains are directly comparable. revision: yes

Circularity Check

0 steps flagged

No circularity: performance claims rest on direct hardware measurements

full rationale

The paper presents empirical results from implementing FP64 tensor-core kernels and kernel-fusion optimizations inside MFEM, then measuring wall-clock time, energy, and scaling on GH200/GB200 hardware up to 10k GPUs. No equations, fitted parameters, or first-principles derivations are offered; the central claims (2× speedup, 83% energy gain, 90% strong scaling) are direct outputs of those runs. No self-citation chain, ansatz smuggling, or renaming of known results is used to justify the performance numbers. The derivation chain is therefore self-contained against external benchmarks and receives the default non-circularity finding.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

No free parameters or invented entities; the work is an empirical engineering optimization that relies on existing GPU hardware semantics and standard finite-element numerical assumptions.

axioms (1)

domain assumption FP64 tensor-core matrix multiplies produce results sufficiently accurate for the target high-order finite-element operators.
Invoked implicitly when claiming performance gains without accuracy loss.

pith-pipeline@v0.9.0 · 5591 in / 1229 out tokens · 64311 ms · 2026-05-15T14:17:17.376969+00:00 · methodology

discussion (0)

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Sensor Placement for Tsunami Early Warning via Large-Scale Bayesian Optimal Experimental Design
cs.DC 2026-04 unverdicted novelty 6.0

A reformulation of Bayesian OED as dense matrix subset selection plus a pipelined Schur-complement greedy algorithm on hundreds of GPUs enables optimization of 175-sensor networks for billion-degree-of-freedom tsunami...

Reference graph

Works this paper leans on

27 extracted references · 27 canonical work pages · cited by 1 Pith paper

[1]

Whitepaper

NVIDIA Corporation.NVIDIA Tesla V100 GPU Architecture. Whitepaper. https : / / images . nvidia . com / content / volta - architecture / pdf/volta-architecture-whitepaper.pdf. NVIDIA, 2017. [2]NVIDIA V100. https://www.nvidia.com/en-in/data-center/v100. [3]NVIDIA GB200. https://www.nvidia.com/en- us/data- center/gb200- nvl72

work page 2017
[2]

Learning physics-based models from data: Perspectives from inverse problems and model reduction

O. Ghattas and K. Willcox. “Learning physics-based models from data: Perspectives from inverse problems and model reduction”. In:Acta Numerica30 (2021), pp. 445–554.DOI: 10.1017/S0962492921000064. [5]NVIDIA A100. https://www.nvidia.com/en-us/data-center/a100. [6]NVIDIA GH200. https://www.nvidia.com/en- us/data- center/grace- hopper-superchip

work page doi:10.1017/s0962492921000064 2021
[3]

GPU acceleration of large-scale full- frequency GW calculations

V . W.-z. Yu and M. Govoni. “GPU acceleration of large-scale full- frequency GW calculations”. In:Journal of Chemical Theory and Computation18.8 (2022), pp. 4690–4707.DOI: 10 . 1021 / acs . jctc . 2c00241

work page 2022
[4]

BerkeleyGW: A massively parallel computer package for the calculation of the quasiparticle and optical properties of materials and nanostructures

J. Deslippe, G. Samsonidze, D. A. Strubbe, M. Jain, M. L. Cohen, and S. G. Louie. “BerkeleyGW: A massively parallel computer package for the calculation of the quasiparticle and optical properties of materials and nanostructures”. In:Computer Physics Communications183.6 (2012), pp. 1269–1289.DOI: 10.1016/j.cpc.2011.12.006

work page doi:10.1016/j.cpc.2011.12.006 2012
[5]

J. J. Dongarra, C. B. Moler, J. R. Bunch, and G. W. Stewart.LINPACK users’ guide. SIAM, 1979.DOI: 10.1137/1.9781611971811

work page doi:10.1137/1.9781611971811 1979
[6]

Matrix engines for high performance computing: A paragon of performance or grasping at straws?

J. Domke, E. Vatai, A. Drozd, P. Chen, Y . Oyama, L. Zhang, S. Salaria, D. Mukunoki, A. Podobas, M. Wahib, and S. Matsuoka. “Matrix engines for high performance computing: A paragon of performance or grasping at straws?” In:2021 IEEE International Parallel and Distributed Processing Symposium (IPDPS). 2021, pp. 1056–1065. DOI: 10.1109/IPDPS49936.2021.00114

work page doi:10.1109/ipdps49936.2021.00114 2021
[7]

SPIDER: Unleashing sparse tensor cores for stencil computation via strided swapping

Q. Gu, C. Wu, H. Shi, and J. Yao. “SPIDER: Unleashing sparse tensor cores for stencil computation via strided swapping”. In:Proceedings of the 31st ACM SIGPLAN Annual Symposium on Principles and Prac- tice of Parallel Programming. PPoPP ’26. Sydney, NSW, Australia: Association for Computing Machinery, 2026, pp. 218–231.DOI: 10. 1145/3774934.3786414

work page arXiv 2026
[8]

Acceleration of tensor-product operations with tensor cores

C. Cui. “Acceleration of tensor-product operations with tensor cores”. In:ACM Transactions on Parallel Computing11.4 (Nov. 2024).DOI: 10.1145/3695466

work page doi:10.1145/3695466 2024
[9]

Real-time Bayesian inference at extreme scale: A digital twin for tsunami early warning applied to the Cascadia subduction zone

S. Henneking, S. Venkat, V . Dobrev, J. Camier, T. Kolev, M. Fernando, A.-A. Gabriel, and O. Ghattas. “Real-time Bayesian inference at extreme scale: A digital twin for tsunami early warning applied to the Cascadia subduction zone”. In:Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis. SC ’25. New...

work page doi:10.1145/3712285.3771787 2025
[11]

hypre: A library of high performance preconditioners

R. D. Falgout and U. M. Yang. “hypre: A library of high performance preconditioners”. In:Proceedings of the International Conference on Computational Science-Part III. ICCS ’02. Berlin, Heidelberg: Springer-Verlag, 2002, pp. 632–641.DOI: 10.1007/3-540-47789-6 66

work page doi:10.1007/3-540-47789-6 2002
[12]

SUNDIALS: Suite of nonlinear and differential/algebraic equation solvers

A. C. Hindmarsh, P. N. Brown, K. E. Grant, S. L. Lee, R. Serban, D. E. Shumaker, and C. S. Woodward. “SUNDIALS: Suite of nonlinear and differential/algebraic equation solvers”. In:ACM Transactions on Mathematical Software (TOMS)31.3 (2005), pp. 363–396.DOI: 10. 1145/1089014.1089020

work page arXiv 2005
[13]

Balay, S

S. Balay, S. Abhyankar, M. F. Adams, S. Benson, J. Brown, P. Brune, K. Buschelman, E. Constantinescu, L. Dalcin, A. Dener, V . Eijkhout, J. Faibussowitsch, W. D. Gropp, V . Hapla, T. Isaac, P. Jolivet, D. Karpeev, D. Kaushik, M. G. Knepley, F. Kong, S. Kruger, D. A. May, L. C. McInnes, R. T. Mills, L. Mitchell, T. Munson, J. E. Roman, K. Rupp, P. Sanan, J...

work page doi:10.2172/2205494 2023
[14]

High-performance finite elements with MFEM

J. Andrej, N. Atallah, J.-P. B ¨acker, J.-S. Camier, D. Copeland, V . Dobrev, Y . Dudouit, T. Duswald, B. Keith, D. Kim, T. Kolev, B. Lazarov, K. Mittal, W. Pazner, S. Petrides, S. Shiraiwa, M. Stowell, and V . Tomov. “High-performance finite elements with MFEM”. In: International Journal of High Performance Computing Applications 38.5 (2024), pp. 447–467...

work page doi:10.1177/10943420241261981 2024
[15]

Bott and L

R. Bott and L. W. Tu.Differential Forms in Algebraic Topology. V ol. 82. Graduate Texts in Mathematics. New York: Springer-Verlag, 1982, pp. xiv+331.DOI: 10.1007/978-1-4757-3951-0

work page doi:10.1007/978-1-4757-3951-0 1982
[16]

End-to-end GPU acceleration of low-order-refined preconditioning for high-order finite element discretizations

W. Pazner, T. Kolev, and J.-S. Camier. “End-to-end GPU acceleration of low-order-refined preconditioning for high-order finite element discretizations”. In:The International Journal of High Performance Computing Applications37.5 (2023), pp. 578–599.DOI: 10 . 1177 / 10943420231175462

work page 2023
[17]

Goal-oriented real-time Bayesian inference for linear autonomous dynamical systems with application to digital twins for tsunami early warning

S. Henneking, S. Venkat, and O. Ghattas. “Goal-oriented real-time Bayesian inference for linear autonomous dynamical systems with application to digital twins for tsunami early warning”. In:Journal of Computational Physics552 (2026), p. 114682.DOI: 10.1016/j.jcp. 2026.114682

work page doi:10.1016/j.jcp 2026
[18]

Fast and scalable FFT-based GPU-accelerated algorithms for block-triangular Toeplitz matrices with application to linear inverse problems governed by autonomous dynamical systems

S. Venkat, M. Fernando, S. Henneking, and O. Ghattas. “Fast and scalable FFT-based GPU-accelerated algorithms for block-triangular Toeplitz matrices with application to linear inverse problems governed by autonomous dynamical systems”. In:SIAM Journal on Scientific Computing47.5 (2025), B1201–B1226.DOI: 10.1137/24M1683172

work page doi:10.1137/24m1683172 2025
[19]

Real-time probabilistic tsunami forecasting in Cascadia from sparse offshore pressure observations

S. Henneking, F. Kutschera, S. Venkat, A.-A. Gabriel, and O. Ghattas. “Real-time probabilistic tsunami forecasting in Cascadia from sparse offshore pressure observations”. In:arXiv:2603.14966(2026).DOI: 10. 48550/arXiv.2603.14966

work page arXiv 2026
[20]

High-order finite difference modeling of tsunami generation in a compressible ocean from offshore earth- quakes

G. C. Lotto and E. M. Dunham. “High-order finite difference modeling of tsunami generation in a compressible ocean from offshore earth- quakes”. In:Computational Geosciences19.2 (2015), pp. 327–340. DOI: 10.1007/s10596-015-9472-0. [26]SIMT architecture. https://docs.nvidia.com/cuda/cuda-c-programming- guide/#simt-architecture. [27]Shared memory. https://d...

work page doi:10.1007/s10596-015-9472-0 2015
[21]

NVIDIA tensor core programmability, performance and precision

S. Markidis, S. W. D. Chien, E. Laure, I. B. Peng, and J. S. Vetter. “NVIDIA tensor core programmability, performance and precision”. In:2018 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW). Los Alamitos, CA, USA: IEEE Computer Society, May 2018, pp. 522–531.DOI: 10.1109/IPDPSW. 2018.00091

work page doi:10.1109/ipdpsw 2018
[22]

DGEMM on integer matrix multiplication unit

H. Ootomo, K. Ozaki, and R. Yokota. “DGEMM on integer matrix multiplication unit”. In:The International Journal of High Perfor- mance Computing Applications38.4 (2024), pp. 297–313.DOI: 10. 1177/10943420241239588

work page 2024
[23]

Ozaki Scheme II: A GEMM- oriented emulation of floating-point matrix multiplication using an integer modular technique

K. Ozaki, Y . Uchino, and T. Imamura. “Ozaki Scheme II: A GEMM- oriented emulation of floating-point matrix multiplication using an integer modular technique”. In:arXiv:2504.08009(2025).DOI: 10 . 48550/arXiv.2504.08009

work page arXiv 2025
[24]

2026.Configuring Agentic AI Coding Tools: An Exploratory Study (Supplementary Material)

T. Kolev, P. Fischer, A. Abdelfattah, Z. Atkins, A. Bankole, N. Beams, J. Brown, J.-S. Camier, R. Carson, N. Chalmers, V . Dobrev, J. Holmen, K. E. Jansen, S. Kerkemeier, Y .-H. Lan, D. McDougall, E. Merzari, M. Min, M. Phillips, T. Ratnayaka, K. Rowe, M. S. Shephard, C. W. Smith, J. L. Thompson, A. Tomboulides, S. Tomov, V . Tomov, U. Unnikrishnan, A. Va...

work page doi:10.5281/zenodo 2023
[25]

Fast semi-iterative finite ele- ment Poisson solvers for tensor core GPUs based on prehandling

D. Ruda, S. Turek, and D. Ribbrock. “Fast semi-iterative finite ele- ment Poisson solvers for tensor core GPUs based on prehandling”. In:European Conference on Numerical Mathematics and Advanced Applications. Springer. 2023, pp. 320–330.DOI: 10.1007/978-3-031- 86169-7 33

work page doi:10.1007/978-3-031- 2023
[26]

Low-order finite element solver with small matrix-matrix multipli- cation accelerated by AI-specific hardware for crustal deformation computation

T. Yamaguchi, K. Fujita, T. Ichimura, A. Naruse, J. C. Wells, C. J. Zimmer, T. P. Straatsma, M. Hori, L. Maddegedara, and N. Ueda. “Low-order finite element solver with small matrix-matrix multipli- cation accelerated by AI-specific hardware for crustal deformation computation”. In:Proceedings of the Platform for Advanced Scientific Computing Conference. ...

work page 2020
[27]

Matthew Weinberg

T. Ichimura, K. Fujita, M. Hori, and M. Lalith. “Fast and power efficient GPU-based explicit elastic wave propagation analysis by low- ordered orthogonal voxel finite element with INT8 tensor cores”. In: Journal of Computational Science(2025), p. 102659.DOI: 10.1016/j. jocs.2025.102659

work page doi:10.1016/j 2025
[28]

Low-ordered orthogo- nal voxel finite element with INT8 tensor cores for GPU-based explicit elastic wave propagation analysis

T. Ichimura, K. Fujita, M. Hori, and M. Lalith. “Low-ordered orthogo- nal voxel finite element with INT8 tensor cores for GPU-based explicit elastic wave propagation analysis”. In:International Conference on Computational Science. Springer. 2024, pp. 257–271.DOI: 10.1007/ 978-3-031-63759-9 31

work page 2024