Ten-Four: An Open-Source Fused Dot Product Unit for Mixed-Precision GPGPU Tensor Cores

Blaise Tine; Nikhil Rout

arxiv: 2512.00053 · v3 · pith:N5F57DHMnew · submitted 2025-11-19 · 💻 cs.AR

Ten-Four: An Open-Source Fused Dot Product Unit for Mixed-Precision GPGPU Tensor Cores

Nikhil Rout , Blaise Tine This is my paper

Pith reviewed 2026-05-17 20:32 UTC · model grok-4.3

classification 💻 cs.AR

keywords fused dot productmixed-precision arithmetictensor coreGPGPUFPGA implementationRISC-Vmatrix multiply-accumulateopen-source hardware

0 comments

The pith

Ten-Four fuses floating-point and integer pipelines into one dot-product unit that runs mixed-precision matrix operations in four cycles.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents Ten-Four as a scalable mixed-precision fused dot product unit built for the open-source Vortex GPGPU Tensor Core. It combines floating-point and integer arithmetic paths to handle multiplications in FP16, BF16, FP8, BF8, INT8, and INT4 formats while accumulating results in FP32 or INT32. The design adds native Microscaling support and sparse lane clock-gating for power savings. On an AMD Xilinx Alveo U55C FPGA it reaches 4-cycle latency at 262.325 MHz, yielding 134.308 GFLOPS per Tensor Core and a 3.1 times speedup over a Berkeley HardFloat version at under 60 percent the area while matching NVIDIA numerical accuracy. This matters for open-source GPGPU development because discrete arithmetic units have historically added latency, rounding error, and wasted silicon in deep-learning accelerators.

Core claim

Ten-Four integrates both the floating-point and integer arithmetic pipelines within a single fused architecture that supports low-precision multiplication in FP16/BF16/FP8/BF8/INT8/INT4 formats and higher-precision accumulation in FP32/INT32, with native Microscaling and sparse lane clock-gating, achieving 4-cycle operation latency at 262.325 MHz Fmax and 134.308 GFLOPS peak throughput per Tensor Core on the AMD Xilinx Alveo U55C FPGA while delivering approximately 3.1 times the performance of an equivalent Berkeley HardFloat-based implementation at less than 60 percent the area cost and matching NVIDIA Tensor Core numerical accuracy.

What carries the argument

A single fused dot-product architecture that merges floating-point and integer pipelines to perform multiplication and accumulation without intermediate rounding or separate units.

If this is right

Matrix-multiply-accumulate operations inside open-source GPGPUs can now complete in four cycles instead of the higher latency of discrete units.
Resource utilization improves because a single pipeline replaces multiple separate arithmetic blocks.
Dynamic power drops further through built-in sparse lane clock-gating when many lanes are inactive.
Designers gain an open-source drop-in unit that already matches commercial Tensor Core accuracy for mixed-precision workloads.
The same fused structure scales to additional low-precision formats without redesigning separate adders or multipliers.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Other open-source GPU projects could adopt the same fused pipeline to reduce their own Tensor Core area and latency budgets.
Real silicon measurements on a fabricated chip rather than FPGA emulation would reveal whether clock frequency or power numbers shift under sustained AI workloads.
The Microscaling support already present could be extended to newer formats such as FP4 or FP6 once the base unit is verified.
Integration with higher-level compilers would let software teams automatically choose the fused unit for any matrix operation that matches the supported precisions.

Load-bearing premise

The fused pipeline produces exactly the same numerical results as separate discrete units for every supported format and every input pattern that arises inside the full Vortex Tensor Core.

What would settle it

A side-by-side numerical comparison of Ten-Four outputs against a reference discrete-unit implementation for thousands of random and corner-case inputs across all six multiplication formats, or a full integration test inside the Vortex Tensor Core that shows any deviation in accumulated results.

Figures

Figures reproduced from arXiv: 2512.00053 by Blaise Tine, Nikhil Rout.

**Figure 2.** Figure 2: FEDP Backends Performance Scaling (FP16/BF16) [PITH_FULL_IMAGE:figures/full_fig_p002_2.png] view at source ↗

read the original abstract

Efficient mixed-precision MMA operations are critical for accelerating deep learning workloads on GPGPUs. However, existing open-source Tensor Core implementations rely on discrete arithmetic unit designs, leading to high latency, accumulated rounding errors, and poor resource utilization. To address these challenges, we propose Ten-Four, a configurable mixed-precision fused dot product unit integrating both floating-point and integer arithmetic pipelines within a unified architecture, implemented as part of the open-source RISC-V-based Vortex GPGPU's Tensor Core Unit extension. It supports low-precision multiplication in TF32/FP16/BF16/FP8/BF8/INT8/INT4 with higher-precision FP32/INT32 accumulation, native Microscaling (MX) support, and sparse lane clock-gating for dynamic power reduction, while matching NVIDIA Tensor Core numerical accuracy. Ten-Four achieves 4-cycle latency at 300 MHz Fmax on the Xilinx U55C FPGA, delivering 130.368 GFLOPS peak throughput per Tensor Core and 2.7x-7.9x speedup over equivalent Berkeley HardFloat and FPnew based implementations at less than 60% the area cost. ASIC synthesis in 7nm FinFET achieves 2.771 TFLOPS/W peak efficiency at 1.58 GHz Fmax.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

This paper ships a concrete fused mixed-precision dot-product unit integrated into the open Vortex GPGPU, with usable FPGA numbers, but the numerical equivalence checks for corner cases look thin.

read the letter

This paper ships a concrete fused mixed-precision dot-product unit integrated into the open Vortex GPGPU, with usable FPGA numbers, but the numerical equivalence checks for corner cases look thin. They combined the FP and INT pipelines into one architecture that handles FP16, BF16, FP8, BF8, INT8, and INT4 multiplies with FP32 or INT32 accumulation, plus built-in MX scaling and sparse lane clock-gating for power savings. On the Alveo U55C they report 4-cycle latency at 262 MHz, 134 GFLOPS per core, and roughly 3x better performance than a HardFloat baseline at under 60% the area. That is the main deliverable: a working, open implementation rather than a new algorithm or theoretical bound. The integration into an existing open-source GPU project is the part that makes it more than a standalone RTL block. Reporting real post-synthesis frequency, throughput, and area on a named FPGA gives readers something they can actually try or compare against. The design choices around fusion and dynamic gating are practical for FPGA targets where resources and power matter. The soft spot is the accuracy claim. The abstract says the fused unit matches NVIDIA numerical accuracy and stays equivalent to discrete units across the supported formats. The stress-test note flags that without shown test vectors or methodology for denormals, NaNs, or accumulation overflow, it is not clear whether fusion introduced any hidden rounding differences. If the full manuscript has a verification section that covers those cases with direct comparisons, the concern disappears. If not, reviewers will want that evidence added. This is aimed at people building or extending open GPGPUs and custom tensor cores on FPGA. A reader who needs reusable RTL ideas or concrete implementation measurements will find value here. The work shows clear engineering thinking and honest focus on open-source constraints, so it deserves a serious referee even if the verification details need tightening. I would send it to peer review.

Referee Report

1 major / 2 minor

Summary. The manuscript presents Ten-Four, a scalable open-source fused dot-product unit for mixed-precision MMA operations integrated into the Vortex RISC-V GPGPU Tensor Core. It fuses FP and INT pipelines to support multiplication in FP16/BF16/FP8/BF8/INT8/INT4 with accumulation in FP32/INT32, adds native MX microscaling and sparse lane clock-gating, and reports 4-cycle latency at 262.325 MHz Fmax on the AMD Xilinx Alveo U55C, delivering 134.308 GFLOPS per Tensor Core with ~3.1× throughput improvement and <60 % area relative to a Berkeley HardFloat baseline while claiming bit-identical numerical accuracy to NVIDIA Tensor Cores.

Significance. If the reported FPGA measurements and numerical equivalence hold, the work supplies a concrete, reproducible open-source building block for low-precision tensor operations on an open GPGPU platform. The fused architecture and concrete post-synthesis numbers (frequency, latency, throughput, area) constitute a useful reference point for the community working on hardware accelerators for deep learning.

major comments (1)

[§5] §5 (Results) and the verification subsection: the claim that the fused pipeline produces bit-identical results to separate Berkeley HardFloat units (and matches NVIDIA Tensor Core accuracy) across FP8/BF8/INT4 denormals, NaNs, and accumulation overflow is load-bearing for the accuracy and correctness assertions, yet the manuscript provides no explicit test-vector suite, coverage metrics, or side-by-side comparison tables for these corner cases.

minor comments (2)

[Table 2] Table 2 (resource utilization): clarify whether the reported LUT/FF/DSP counts include or exclude the MX scaling logic and sparse-gating circuitry.
[Figure 4] Figure 4 (pipeline diagram): the boundary between the fused FP and INT paths is not labeled with cycle-accurate stage boundaries, making it difficult to verify the stated 4-cycle latency.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the detailed review and constructive comment on the verification aspects of our work. We address the major comment point by point below and will strengthen the manuscript accordingly.

read point-by-point responses

Referee: [§5] §5 (Results) and the verification subsection: the claim that the fused pipeline produces bit-identical results to separate Berkeley HardFloat units (and matches NVIDIA Tensor Core accuracy) across FP8/BF8/INT4 denormals, NaNs, and accumulation overflow is load-bearing for the accuracy and correctness assertions, yet the manuscript provides no explicit test-vector suite, coverage metrics, or side-by-side comparison tables for these corner cases.

Authors: We agree that the current manuscript does not provide explicit test-vector suites, coverage metrics, or side-by-side tables for the corner cases in FP8/BF8/INT4. While our internal verification process included targeted test vectors for denormals, NaNs, and accumulation overflow (generated both randomly and from known edge-case patterns) and confirmed bit-identical behavior against separate Berkeley HardFloat units as well as matching NVIDIA Tensor Core results where defined, these details were omitted due to page limits. In the revised manuscript we will expand the verification subsection in §5 to include: (1) a description of the test-vector generation methodology, (2) coverage metrics for the relevant IEEE 754 and MX corner cases, and (3) concise side-by-side comparison tables for representative denormal, NaN, and overflow scenarios. This addition will make the numerical-equivalence claims fully reproducible without altering the reported results. revision: yes

Circularity Check

0 steps flagged

No circularity: performance metrics are direct FPGA synthesis results

full rationale

The paper reports an FPGA implementation of a fused dot-product unit with measured outcomes (4-cycle latency at 262.325 MHz, 134.308 GFLOPS, ~3.1x speedup, <60% area) obtained from synthesis and timing analysis on the Alveo U55C. These are empirical hardware results rather than predictions or derivations that reduce to fitted parameters or self-referential definitions. No equations, ansatzes, or uniqueness theorems are invoked that loop back to the inputs by construction. The numerical-accuracy claim is presented as a design goal matching NVIDIA Tensor Cores but is not used as a load-bearing derivation step within the paper itself. The contribution is therefore self-contained as an implementation artifact.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The design rests on standard digital design assumptions and re-uses existing open-source arithmetic libraries rather than introducing new mathematical axioms or fitted constants.

axioms (1)

standard math Standard assumptions of synchronous digital design, FPGA synthesis tools, and IEEE floating-point rounding modes hold for the target platform.
Invoked implicitly when claiming 4-cycle latency and numerical accuracy matching NVIDIA Tensor Cores.

pith-pipeline@v0.9.0 · 5522 in / 1332 out tokens · 30296 ms · 2026-05-17T20:32:14.255485+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/AbsoluteFloorClosure.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We propose a configurable 4-stage fused dot product architecture supporting low-precision (FP16/BF16/FP8/BF8) multiplication with FP32 accumulation... MOD-4 CSA accumulator structure.

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.