BF16 tensor cores on GPUs emulate FP32 SGEMM with superior performance, power efficiency, and numerical accuracy compared to native FP32, including a library implementation that handles denormals.
InProceedings of the Super- computing Asia and International Conference on High Performance Computing in Asia Pacific Region (SCA/HPCAsia ’26)
3 Pith papers cite this work. Polarity classification is still indexing.
citation-role summary
citation-polarity summary
years
2026 3roles
background 1polarities
background 1representative citing papers
An adaptation of the Ozaki-II scheme allows DGEMM emulation on FP8 MMA units with significantly reduced computational cost compared to FP8-based Ozaki-I.
No single post-Moore technology replaces current HPC for plasma simulations, but FPGA-class accelerators offer near-term kernel offload, non-von Neumann architectures medium-term operator acceleration, and quantum computing long-term potential for warm dense matter microphysics.
citing papers explorer
-
Exceeding the Numerical and Performance Characteristics of IEEE-754 SGEMM with BFloat16 Tensor Cores on GPUs for Scientific Computing
BF16 tensor cores on GPUs emulate FP32 SGEMM with superior performance, power efficiency, and numerical accuracy compared to native FP32, including a library implementation that handles denormals.
-
Double-Precision Matrix Multiplication Emulation via Ozaki-II Scheme with FP8 Quantization
An adaptation of the Ozaki-II scheme allows DGEMM emulation on FP8 MMA units with significantly reduced computational cost compared to FP8-based Ozaki-I.
-
Post-Moore Technologies for Plasma Simulation: A Community Roadmap
No single post-Moore technology replaces current HPC for plasma simulations, but FPGA-class accelerators offer near-term kernel offload, non-von Neumann architectures medium-term operator acceleration, and quantum computing long-term potential for warm dense matter microphysics.