BF16 tensor cores on GPUs emulate FP32 SGEMM with superior performance, power efficiency, and numerical accuracy compared to native FP32, including a library implementation that handles denormals.
Guaranteed dgemm accuracy while using reduced precision tensor cores through extensions of the ozaki scheme
3 Pith papers cite this work. Polarity classification is still indexing.
citation-role summary
citation-polarity summary
years
2026 3roles
background 1polarities
background 1representative citing papers
An adaptation of the Ozaki-II scheme allows DGEMM emulation on FP8 MMA units with significantly reduced computational cost compared to FP8-based Ozaki-I.
No single post-Moore technology replaces current HPC for plasma simulations, but FPGA-class accelerators offer near-term kernel offload, non-von Neumann architectures medium-term operator acceleration, and quantum computing long-term potential for warm dense matter microphysics.
citing papers explorer
-
Double-Precision Matrix Multiplication Emulation via Ozaki-II Scheme with FP8 Quantization
An adaptation of the Ozaki-II scheme allows DGEMM emulation on FP8 MMA units with significantly reduced computational cost compared to FP8-based Ozaki-I.
-
Post-Moore Technologies for Plasma Simulation: A Community Roadmap
No single post-Moore technology replaces current HPC for plasma simulations, but FPGA-class accelerators offer near-term kernel offload, non-von Neumann architectures medium-term operator acceleration, and quantum computing long-term potential for warm dense matter microphysics.