Ten-Four: An Open-Source Fused Dot Product Unit for Mixed-Precision GPGPU Tensor Cores
Pith reviewed 2026-05-17 20:32 UTC · model grok-4.3
The pith
Ten-Four fuses floating-point and integer pipelines into one dot-product unit that runs mixed-precision matrix operations in four cycles.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Ten-Four integrates both the floating-point and integer arithmetic pipelines within a single fused architecture that supports low-precision multiplication in FP16/BF16/FP8/BF8/INT8/INT4 formats and higher-precision accumulation in FP32/INT32, with native Microscaling and sparse lane clock-gating, achieving 4-cycle operation latency at 262.325 MHz Fmax and 134.308 GFLOPS peak throughput per Tensor Core on the AMD Xilinx Alveo U55C FPGA while delivering approximately 3.1 times the performance of an equivalent Berkeley HardFloat-based implementation at less than 60 percent the area cost and matching NVIDIA Tensor Core numerical accuracy.
What carries the argument
A single fused dot-product architecture that merges floating-point and integer pipelines to perform multiplication and accumulation without intermediate rounding or separate units.
If this is right
- Matrix-multiply-accumulate operations inside open-source GPGPUs can now complete in four cycles instead of the higher latency of discrete units.
- Resource utilization improves because a single pipeline replaces multiple separate arithmetic blocks.
- Dynamic power drops further through built-in sparse lane clock-gating when many lanes are inactive.
- Designers gain an open-source drop-in unit that already matches commercial Tensor Core accuracy for mixed-precision workloads.
- The same fused structure scales to additional low-precision formats without redesigning separate adders or multipliers.
Where Pith is reading between the lines
- Other open-source GPU projects could adopt the same fused pipeline to reduce their own Tensor Core area and latency budgets.
- Real silicon measurements on a fabricated chip rather than FPGA emulation would reveal whether clock frequency or power numbers shift under sustained AI workloads.
- The Microscaling support already present could be extended to newer formats such as FP4 or FP6 once the base unit is verified.
- Integration with higher-level compilers would let software teams automatically choose the fused unit for any matrix operation that matches the supported precisions.
Load-bearing premise
The fused pipeline produces exactly the same numerical results as separate discrete units for every supported format and every input pattern that arises inside the full Vortex Tensor Core.
What would settle it
A side-by-side numerical comparison of Ten-Four outputs against a reference discrete-unit implementation for thousands of random and corner-case inputs across all six multiplication formats, or a full integration test inside the Vortex Tensor Core that shows any deviation in accumulated results.
Figures
read the original abstract
Efficient mixed-precision MMA operations are critical for accelerating deep learning workloads on GPGPUs. However, existing open-source Tensor Core implementations rely on discrete arithmetic unit designs, leading to high latency, accumulated rounding errors, and poor resource utilization. To address these challenges, we propose Ten-Four, a configurable mixed-precision fused dot product unit integrating both floating-point and integer arithmetic pipelines within a unified architecture, implemented as part of the open-source RISC-V-based Vortex GPGPU's Tensor Core Unit extension. It supports low-precision multiplication in TF32/FP16/BF16/FP8/BF8/INT8/INT4 with higher-precision FP32/INT32 accumulation, native Microscaling (MX) support, and sparse lane clock-gating for dynamic power reduction, while matching NVIDIA Tensor Core numerical accuracy. Ten-Four achieves 4-cycle latency at 300 MHz Fmax on the Xilinx U55C FPGA, delivering 130.368 GFLOPS peak throughput per Tensor Core and 2.7x-7.9x speedup over equivalent Berkeley HardFloat and FPnew based implementations at less than 60% the area cost. ASIC synthesis in 7nm FinFET achieves 2.771 TFLOPS/W peak efficiency at 1.58 GHz Fmax.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript presents Ten-Four, a scalable open-source fused dot-product unit for mixed-precision MMA operations integrated into the Vortex RISC-V GPGPU Tensor Core. It fuses FP and INT pipelines to support multiplication in FP16/BF16/FP8/BF8/INT8/INT4 with accumulation in FP32/INT32, adds native MX microscaling and sparse lane clock-gating, and reports 4-cycle latency at 262.325 MHz Fmax on the AMD Xilinx Alveo U55C, delivering 134.308 GFLOPS per Tensor Core with ~3.1× throughput improvement and <60 % area relative to a Berkeley HardFloat baseline while claiming bit-identical numerical accuracy to NVIDIA Tensor Cores.
Significance. If the reported FPGA measurements and numerical equivalence hold, the work supplies a concrete, reproducible open-source building block for low-precision tensor operations on an open GPGPU platform. The fused architecture and concrete post-synthesis numbers (frequency, latency, throughput, area) constitute a useful reference point for the community working on hardware accelerators for deep learning.
major comments (1)
- [§5] §5 (Results) and the verification subsection: the claim that the fused pipeline produces bit-identical results to separate Berkeley HardFloat units (and matches NVIDIA Tensor Core accuracy) across FP8/BF8/INT4 denormals, NaNs, and accumulation overflow is load-bearing for the accuracy and correctness assertions, yet the manuscript provides no explicit test-vector suite, coverage metrics, or side-by-side comparison tables for these corner cases.
minor comments (2)
- [Table 2] Table 2 (resource utilization): clarify whether the reported LUT/FF/DSP counts include or exclude the MX scaling logic and sparse-gating circuitry.
- [Figure 4] Figure 4 (pipeline diagram): the boundary between the fused FP and INT paths is not labeled with cycle-accurate stage boundaries, making it difficult to verify the stated 4-cycle latency.
Simulated Author's Rebuttal
We thank the referee for the detailed review and constructive comment on the verification aspects of our work. We address the major comment point by point below and will strengthen the manuscript accordingly.
read point-by-point responses
-
Referee: [§5] §5 (Results) and the verification subsection: the claim that the fused pipeline produces bit-identical results to separate Berkeley HardFloat units (and matches NVIDIA Tensor Core accuracy) across FP8/BF8/INT4 denormals, NaNs, and accumulation overflow is load-bearing for the accuracy and correctness assertions, yet the manuscript provides no explicit test-vector suite, coverage metrics, or side-by-side comparison tables for these corner cases.
Authors: We agree that the current manuscript does not provide explicit test-vector suites, coverage metrics, or side-by-side tables for the corner cases in FP8/BF8/INT4. While our internal verification process included targeted test vectors for denormals, NaNs, and accumulation overflow (generated both randomly and from known edge-case patterns) and confirmed bit-identical behavior against separate Berkeley HardFloat units as well as matching NVIDIA Tensor Core results where defined, these details were omitted due to page limits. In the revised manuscript we will expand the verification subsection in §5 to include: (1) a description of the test-vector generation methodology, (2) coverage metrics for the relevant IEEE 754 and MX corner cases, and (3) concise side-by-side comparison tables for representative denormal, NaN, and overflow scenarios. This addition will make the numerical-equivalence claims fully reproducible without altering the reported results. revision: yes
Circularity Check
No circularity: performance metrics are direct FPGA synthesis results
full rationale
The paper reports an FPGA implementation of a fused dot-product unit with measured outcomes (4-cycle latency at 262.325 MHz, 134.308 GFLOPS, ~3.1x speedup, <60% area) obtained from synthesis and timing analysis on the Alveo U55C. These are empirical hardware results rather than predictions or derivations that reduce to fitted parameters or self-referential definitions. No equations, ansatzes, or uniqueness theorems are invoked that loop back to the inputs by construction. The numerical-accuracy claim is presented as a design goal matching NVIDIA Tensor Cores but is not used as a load-bearing derivation step within the paper itself. The contribution is therefore self-contained as an implementation artifact.
Axiom & Free-Parameter Ledger
axioms (1)
- standard math Standard assumptions of synchronous digital design, FPGA synthesis tools, and IEEE floating-point rounding modes hold for the target platform.
Lean theorems connected to this paper
-
IndisputableMonolith/Foundation/AbsoluteFloorClosure.leanreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
We propose a configurable 4-stage fused dot product architecture supporting low-precision (FP16/BF16/FP8/BF8) multiplication with FP32 accumulation... MOD-4 CSA accumulator structure.
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.