pith. sign in

arxiv: 2603.10634 · v2 · submitted 2026-03-11 · 💻 cs.DC

Double-Precision Matrix Multiplication Emulation via Ozaki-II Scheme with FP8 Quantization

Pith reviewed 2026-05-15 13:08 UTC · model grok-4.3

classification 💻 cs.DC
keywords DGEMM emulationOzaki-II schemeFP8 quantizationlow-precision MMAdouble-precision emulationmatrix multiplicationHPC applications
0
0 comments X

The pith

A novel adaptation allows the Ozaki-II scheme to emulate DGEMM using FP8 MMA units with reduced computational cost.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper develops a technique to run the Ozaki-II emulation scheme for double-precision matrix multiplication on FP8 hardware units. Existing Ozaki-I works directly with FP8, but Ozaki-II required changes because its original form does not fit FP8 quantization. The new method lowers the number of matrix multiplications needed, offering a more efficient path to FP64 accuracy when INT8 performance is limited on recent chips. This matters for high-performance computing kernels that must deliver reliable results without native double-precision hardware.

Core claim

We introduce a novel technique to demonstrate DGEMM emulation based on the Ozaki-II scheme that operates on FP8 MMA units. Compared to the FP8-based Ozaki-I scheme, our method significantly reduces the computational cost and enables efficient FP64 emulation.

What carries the argument

Adapted Ozaki-II scheme for FP8 quantization that modifies the algorithmic structure to work with FP8 MMA units.

Load-bearing premise

The Ozaki-II algorithmic structure can be directly modified for FP8 quantization without introducing unacceptable rounding errors or needing extra corrections that erase the cost savings.

What would settle it

A direct comparison on benchmark matrices measuring operation count and numerical error for the new FP8 Ozaki-II method versus the existing FP8 Ozaki-I method; failure to show lower cost or acceptable accuracy would disprove the claim.

Figures

Figures reproduced from arXiv: 2603.10634 by Katsuhisa Ozaki, Toshiyuki Imamura, Yuki Uchino.

Figure 1
Figure 1. Figure 1: Predicted throughput heatmaps for DGEMM emulation us [PITH_FULL_IMAGE:figures/full_fig_p007_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Predicted throughput heatmaps for DGEMM emulation us [PITH_FULL_IMAGE:figures/full_fig_p007_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: shows the accuracy of DGEMM on the RTX 4090 Laptop. Similar results were obtained on other platforms. For the INT8-based Ozaki-I baseline, we used the cuBLAS 1https://github.com/RIKEN-RCCS/GEMMul8 implementation. To control the dynamic range of the test matrices, we generated A ∈ R 128×k and B ∈ R k×128 as aij , bij ≈ (rand−0.5)·exp(randn·ϕ), where rand ∈ (0, 1] de￾notes a uniformly distributed random numb… view at source ↗
Figure 5
Figure 5. Figure 5: Throughput comparison on RTX 5080. faster than the FP8-based Ozaki-II by a factor of 0.9–2.3× across the evaluated parameter ranges [PITH_FULL_IMAGE:figures/full_fig_p009_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Throughput comparison on B200. Across all tested (m, n, k) configurations on the RTX 5080, both the INT8-based and FP8-based Ozaki-II achieve higher throughput than that of native FP64 DGEMM. On the B200, for m = n = 2048, the INT8-based Ozaki-II outperforms native FP64 DGEMM for k > 4096, whereas the FP8-based Ozaki-II does so for k ≥ 16384. For m = n = 4096, the INT8-based Ozaki-II outperforms native FP6… view at source ↗
Figure 7
Figure 7. Figure 7: Time breakdown for RTX 5080 [PITH_FULL_IMAGE:figures/full_fig_p010_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Time breakdown for B200. the RTX 5080, which is consistent with reduced utilization efficiency of the available MMA units for small matrices. VI. CONCLUSION In this paper, we presented an FP8-based DGEMM emula￾tion method based on the Ozaki-II scheme. By combining a Karatsuba-based extension with a modular reduction technique for selected square moduli, we enabled Ozaki-II-style FP64- targeted emulation on… view at source ↗
read the original abstract

In this paper, we propose a method for emulating double-precision general matrix--matrix multiplication (DGEMM), a fundamental and performance-critical kernel in many high-performance computing applications. Ozaki-I and Ozaki-II are established DGEMM emulation schemes via low-precision matrix multiply-accumulate (MMA) units. For the Ozaki-I scheme, INT8-, FP8-, and FP16-based implementations have been proposed, all of which can be realized based on the same underlying algorithmic structure. In contrast, although INT8-based implementations of the Ozaki-II scheme have been reported, the original algorithm cannot be directly adapted to exploit FP8 MMA units. In several recent architectures, such as NVIDIA Blackwell Ultra and NVIDIA Rubin, INT8 performance has been reduced, making reliance on INT8 alone insufficient. Therefore, we introduce a novel technique to demonstrate DGEMM emulation based on the Ozaki-II scheme that operates on FP8 MMA units. Compared to the FP8-based Ozaki-I scheme, our method significantly reduces the computational cost and enables efficient FP64 emulation.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The manuscript proposes a novel adaptation of the Ozaki-II scheme to enable DGEMM emulation on FP8 MMA units. It observes that the original Ozaki-II algorithm cannot be directly ported to FP8 (unlike Ozaki-I, which supports INT8/FP8/FP16 under the same structure) and introduces a new technique that is claimed to reduce computational cost relative to the FP8 Ozaki-I baseline while still delivering accurate FP64 results. The motivation is the reduced INT8 throughput on recent NVIDIA architectures such as Blackwell Ultra and Rubin.

Significance. If the claimed cost reduction and error control are rigorously established, the work would be a useful incremental advance for low-precision emulation kernels on FP8-dominant hardware. It directly addresses a practical limitation of prior Ozaki schemes and could improve performance portability for scientific codes that rely on DGEMM emulation.

major comments (2)
  1. [Abstract and §3] Abstract and §3 (algorithm description): the central claim that the novel FP8 adaptation of Ozaki-II 'significantly reduces the computational cost' relative to FP8 Ozaki-I is not supported by any operation-count table, flop breakdown, or pseudocode. Without these, it is impossible to verify whether the added quantization or splitting steps required for FP8 dynamic-range handling erase the purported savings.
  2. [§4] §4 (error analysis): no forward-error bound, rounding-error analysis, or numerical verification is supplied for the FP8-quantized Ozaki-II variant. The manuscript must demonstrate that the adaptation controls accumulation error without extra corrective passes; otherwise the cost-reduction claim cannot be evaluated.
minor comments (1)
  1. [Abstract] The abstract and introduction repeatedly use 'Ozaki-I' and 'Ozaki-II' without a brief reminder of the original algorithmic difference; a one-sentence recap would improve readability for readers unfamiliar with the prior work.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments on our manuscript. We address each major comment below and have revised the manuscript to provide the requested details on operation counts and error analysis.

read point-by-point responses
  1. Referee: [Abstract and §3] Abstract and §3 (algorithm description): the central claim that the novel FP8 adaptation of Ozaki-II 'significantly reduces the computational cost' relative to FP8 Ozaki-I is not supported by any operation-count table, flop breakdown, or pseudocode. Without these, it is impossible to verify whether the added quantization or splitting steps required for FP8 dynamic-range handling erase the purported savings.

    Authors: We agree that an explicit breakdown strengthens the claim. In the revised version we add a table in §3 that lists the exact number of FP8 MMA operations, memory accesses, and splitting steps for both the proposed Ozaki-II adaptation and the FP8 Ozaki-I baseline. Updated pseudocode and a short flop-count derivation are included to show that the FP8-specific quantization avoids the extra splitting overhead of Ozaki-I, confirming the cost reduction. revision: yes

  2. Referee: [§4] §4 (error analysis): no forward-error bound, rounding-error analysis, or numerical verification is supplied for the FP8-quantized Ozaki-II variant. The manuscript must demonstrate that the adaptation controls accumulation error without extra corrective passes; otherwise the cost-reduction claim cannot be evaluated.

    Authors: We accept that a formal error analysis is required. We expand §4 with a forward-error bound derivation for the FP8-quantized Ozaki-II scheme, a rounding-error analysis demonstrating that accumulation error remains controlled without additional corrective passes, and numerical verification results that compare observed errors against the derived bounds and against the FP8 Ozaki-I baseline. revision: yes

Circularity Check

0 steps flagged

No circularity: novel FP8 adaptation of Ozaki-II presented as independent algorithmic contribution

full rationale

The paper introduces a new technique for FP8-based Ozaki-II DGEMM emulation, explicitly contrasting it with prior Ozaki-I and INT8 Ozaki-II schemes. No equations, fitted parameters, or predictions are shown that reduce to inputs by construction. References to established Ozaki schemes serve as background rather than load-bearing self-citations that force the result; the central claim rests on the described novel modification, which is presented as original work without self-referential derivation or renaming of known results.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Abstract-only review supplies no concrete equations, so the ledger is limited to background assumptions common to all floating-point emulation schemes.

axioms (1)
  • standard math Standard rounding and associativity properties of IEEE floating-point arithmetic hold for both FP8 and FP64 operations.
    Required for any low-precision emulation scheme to produce bounded error.

pith-pipeline@v0.9.0 · 5493 in / 1195 out tokens · 59440 ms · 2026-05-15T13:08:04.651765+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

29 extracted references · 29 canonical work pages

  1. [1]

    (2026) NVIDIA Vera Rubin NVL72

    NVIDIA Corporation. (2026) NVIDIA Vera Rubin NVL72. Retrieved 25 January, 2026. [Online]. Available: https://www.nvidia.com/en- us/data-center/vera-rubin-nvl72

  2. [2]

    (2024) NVIDIA H200 Tensor Core GPU

    ——. (2024) NVIDIA H200 Tensor Core GPU. Retrieved 25 January,

  3. [3]

    Available: https://resources.nvidia.com/en-us-hopper- architecture/hpc-datasheet-sc23

    [Online]. Available: https://resources.nvidia.com/en-us-hopper- architecture/hpc-datasheet-sc23

  4. [4]

    (2025) NVIDIA Blackwell Architecture Technical Brief v2.1

    ——. (2025) NVIDIA Blackwell Architecture Technical Brief v2.1. Retrieved 25 January, 2026. [Online]. Available: https://resources.nvidia.com/en-us-blackwell-architecture

  5. [5]

    (2020) Nvidia a100 tensor core gpu archi- tecture v1.0

    ——. (2020) Nvidia a100 tensor core gpu archi- tecture v1.0. Retrieved 5 December, 2025. [Online]. Available: https://images.nvidia.com/aem-dam/en-zz/Solutions/data- center/nvidia-ampere-architecture-whitepaper.pdf

  6. [6]

    (2023) Nvidia h100 tensor core gpu architecture v1.04

    ——. (2023) Nvidia h100 tensor core gpu architecture v1.04. Retrieved 5 December, 2025. [Online]. Available: https://resources.nvidia.com/en- us-hopper-architecture/nvidia-h100-tensor-c

  7. [7]

    (2025) Amd instinct mi300x apu

    Advanced Micro Devices, Inc. (2025) Amd instinct mi300x apu. Retrieved 5 December, 2025. [Online]. Avail- able: https://www.amd.com/content/dam/amd/en/documents/instinct- tech-docs/data-sheets/amd-instinct-mi300a-data-sheet.pdf

  8. [8]

    (2025) Amd instinct mi300x accelera- tor

    ——. (2025) Amd instinct mi300x accelera- tor. Retrieved 5 December, 2025. [Online]. Avail- able: https://www.amd.com/content/dam/amd/en/documents/instinct- tech-docs/data-sheets/amd-instinct-mi300x-data-sheet.pdf

  9. [9]

    (2025) Amd instinct mi325x accelera- tor

    ——. (2025) Amd instinct mi325x accelera- tor. Retrieved 5 December, 2025. [Online]. Avail- able: https://www.amd.com/content/dam/amd/en/documents/instinct- tech-docs/product-briefs/instinct-mi325x-datasheet.pdf

  10. [10]

    (2025) Amd instinct mi350x gpu

    ——. (2025) Amd instinct mi350x gpu. Re- trieved 5 December, 2025. [Online]. Avail- able: https://www.amd.com/content/dam/amd/en/documents/instinct- tech-docs/product-briefs/amd-instinct-mi350x-gpu-brochure.pdf

  11. [11]

    (2025) Amd instinct mi355x gpu

    ——. (2025) Amd instinct mi355x gpu. Re- trieved 5 December, 2025. [Online]. Avail- able: https://www.amd.com/content/dam/amd/en/documents/instinct- tech-docs/product-briefs/amd-instinct-mi355x-gpu-brochure.pdf

  12. [12]

    (2025) Intel Core Ultra 200H and 200U Series Processors, Datasheet, V olume 1 of

    Intel Corporation. (2025) Intel Core Ultra 200H and 200U Series Processors, Datasheet, V olume 1 of

  13. [13]

    [Online]

    Retrieved 25 February, 2026. [Online]. Available: https://edc.intel.com/content/www/jp/ja/design/products-and- solutions/processors-and-chipsets/core-ultra-200h-and-200u-series- processors-datasheet-volume-1-of-2/intel-neural-processing-unit-intel- npu/

  14. [14]

    (2026) Tpu v6e

    Google Cloud. (2026) Tpu v6e. Retrieved 25 February, 2026. [Online]. Available: https://docs.cloud.google.com/tpu/docs/v6e?hl=en

  15. [15]

    (2026) Tpu7x (ironwood)

    ——. (2026) Tpu7x (ironwood). Retrieved 25 February, 2026. [Online]. Available: https://docs.cloud.google.com/tpu/docs/tpu7x?hl=en

  16. [16]

    Error-free transformations of matrix multiplication by using fast routines of matrix multiplication and its applications,

    K. Ozaki, T. Ogita, S. Oishi, and S. M. Rump, “Error-free transformations of matrix multiplication by using fast routines of matrix multiplication and its applications,”Numerical Algorithms, vol. 59, no. 1, pp. 95–118, 2012. [Online]. Available: https://doi.org/10.1007/s11075- 011-9478-1

  17. [17]

    Ozaki, T

    ——, “Generalization of error-free transformation for matrix multiplication and its application,”Nonlinear Theory and Its Applications, IEICE, vol. 4, no. 1, pp. 2–11, 2013. [Online]. Available: https://doi.org/10.1587/nolta.4.2

  18. [18]

    Ozaki Scheme II: A GEMM- oriented emulation of floating-point matrix multiplication using an integer modular technique

    K. Ozaki, Y . Uchino, and T. Imamura, “Ozaki Scheme II: A GEMM-oriented emulation of floating-point matrix multiplication using an integer modular technique,” 2025. [Online]. Available: https://arxiv.org/abs/2504.08009

  19. [19]

    DGEMM on integer matrix multiplication unit,

    H. Ootomo, K. Ozaki, and R. Yokota, “DGEMM on integer matrix multiplication unit,”The International Journal of High Performance Computing Applications, vol. 38, no. 4, pp. 297–313, 2024. [Online]. Available: https://doi.org/10.1177/10943420241239588

  20. [20]

    2025), 462–476

    Y . Uchino, K. Ozaki, and T. Imamura, “Performance enhancement of the Ozaki Scheme on integer matrix multiplication unit,” The International Journal of High Performance Computing Applications, vol. 39, no. 3, pp. 462–476, 2025. [Online]. Available: https://doi.org/10.1177/10943420241313064

  21. [21]

    Proceedings of the SC '25 Workshops of the International Conference for High Performance Computing, Networking, Storage and Analysis , pages =

    ——, “High-performance and power-efficient emulation of matrix multiplication using INT8 matrix engines,” inProceedings of the SC ’25 Workshops of the International Conference for High Performance Computing, Networking, Storage and Analysis, ser. SC Workshops ’25. St. louis, MO, USA: Association for Computing Machinery, 2025, pp. 1824–1831. [Online]. Avail...

  22. [22]

    Mukunoki, K

    D. Mukunoki, K. Ozaki, T. Ogita, and T. Imamura, “DGEMM using Tensor Cores, and its accurate and reproducible versions,” inHigh Performance Computing, P. Sadayappan, B. L. Chamberlain, G. Juckeland, and H. Ltaief, Eds. Cham: Springer International Publishing, 2020, pp. 230–248. [Online]. Available: https://doi.org/10.1007/978-3-030-50743-5 12

  23. [23]

    DGEMM using FP64 Arithmetic Emulation and FP8 Tensor Cores with Ozaki Scheme,

    D. Mukunoki, “DGEMM using FP64 Arithmetic Emulation and FP8 Tensor Cores with Ozaki Scheme,” inProceedings of the Supercomputing Asia and International Conference on High Performance Computing in Asia Pacific Region Workshops, ser. SCA/HPCAsiaWS ’26. Osaka, Japan: Association for Computing Machinery, 2026, p. 303–311. [Online]. Available: https://doi.org/...

  24. [24]

    Emulation of complex matrix multiplication based on the Chinese Remainder Theorem,

    Y . Uchino, Q. Ma, T. Imamura, K. Ozaki, and P. L. Gutsche, “Emulation of complex matrix multiplication based on the Chinese Remainder Theorem,” 2025. [Online]. Available: https://arxiv.org/abs/2512.08321

  25. [25]

    Error estimation of floating-point summation and dot product,

    S. M. Rump, “Error estimation of floating-point summation and dot product,”BIT Numerical Mathematics, vol. 52, no. 1, pp. 201–220,

  26. [26]

    Available: https://doi.org/10.1007/s10543-011-0342-4

    [Online]. Available: https://doi.org/10.1007/s10543-011-0342-4

  27. [27]

    (2026) SAKURAONE: a managed high performance computing cluster

    SAKURA internet Inc. (2026) SAKURAONE: a managed high performance computing cluster. Retrieved 26 March, 2026. [Online]. Available: https://www.sakura.ad.jp/sakuraone/

  28. [28]

    Guaranteed dgemm accuracy while using reduced precision tensor cores through extensions of the ozaki scheme,

    A. Schwarz, A. Anders, C. Brower, H. Bayraktar, J. Gunnels, K. Clark, R. G. Xu, S. Rodriguez, S. Cayrols, P. Tabaszewski, and V . Podlozhnyuk, “Guaranteed dgemm accuracy while using reduced precision tensor cores through extensions of the ozaki scheme,” ser. SCA/HPCAsia ’26. Association for Computing Machinery, 2026, p. 91–101. [Online]. Available: https:...

  29. [29]

    Energy-Efficient Supercomputing Through Tensor Core-Accelerated Mixed-Precision Computing and Floating- Point Emulation,

    S. R. Bernabeu, “Energy-Efficient Supercomputing Through Tensor Core-Accelerated Mixed-Precision Computing and Floating- Point Emulation,” Oral presentation at NVIDIA GTC 2025, Mar. 2025. [Online]. Available: https://www.nvidia.com/en-us/on- demand/session/gtc25-s71487/