pith. machine review for the scientific record. sign in

arxiv: 2605.04803 · v1 · submitted 2026-05-06 · 💻 cs.AR

Recognition: unknown

Not All Faults Are Equal: Transient-Fault Sensitivity Characterization of an Open-Source RISC-V Vector Cluster

Amirhossein Kiamarzi, Angelo Garofalo, Davide Rossi, Maoyuan Cai

Authors on Pith no claims yet

Pith reviewed 2026-05-08 16:27 UTC · model grok-4.3

classification 💻 cs.AR
keywords transient faultsRISC-Vvector clusterfault injectionmatrix multiplicationsilent data corruptionSEUSET
0
0 comments X

The pith

Faulty data corruption dominates transient fault outcomes in an open-source RISC-V vector cluster for matrix workloads.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper examines transient fault sensitivity in the Spatz RISC-V vector cluster through large-scale injection experiments. It applies SET and SEU fault models to six MatMul and widening MatMul configurations and tracks how faults propagate into system behavior. Faulty data corruption emerges as the primary manifesting error across all cases. The work also compares error severity across floating-point precisions and pinpoints which hardware modules and bit positions drive the worst effects.

Core claim

Across 100,000 fault injections, faulty data corruption accounts for at least 86 percent of manifesting errors under SET and 91 percent under SEU. SET faults concentrate in the vector execution path while the TCDM memory contributes most to data corruption. Among formats, FP8 produces the smallest average corruption spread and RMSE; widening reduces both metrics for FP16 but has limited effect on FP8. Corruptions hitting exponent bits cause the largest output deviations, especially in FP32 and BF16.

What carries the argument

Large-scale SET and SEU fault-injection campaigns that classify outcomes into faulty data corruption, silent data corruption, and other categories, then quantify SDC severity via corrupted output count and RMSE across FP32/FP16/BF16/FP8.

If this is right

  • Protection resources should focus first on the vector execution path and TCDM memory rather than uniform coverage of the entire cluster.
  • Lower-precision formats such as FP8 can limit the spatial spread and magnitude of output errors in matrix workloads.
  • Widening MatMul operations offer a measurable reduction in corruption severity for FP16 but little benefit for FP8.
  • Exponent datapaths warrant selective hardening because they produce the largest RMSE deviations when struck.
  • Workload-specific fault-tolerance policies become practical once the dominant error mode and high-impact sites are known.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Designers of vector-based AI accelerators could trade uniform error detection for targeted data-path integrity checks to improve efficiency.
  • The precision-dependent severity patterns suggest that mixed-precision scheduling might also serve as a lightweight fault-mitigation technique.
  • Repeating the study on other common kernels such as convolutions or reductions would test whether the same module and format rankings hold.
  • If the dominance of faulty data corruption persists on silicon, system-level recovery strategies could shift from restart to targeted data scrubbing.

Load-bearing premise

The simulation-based SET and SEU models and the selected injection sites inside the cluster accurately stand in for the physical transient faults that would occur in real silicon.

What would settle it

Fabricating the Spatz cluster and repeating the 100,000-injection campaigns on actual hardware to check whether faulty data corruption still exceeds 86 percent of manifesting errors.

Figures

Figures reproduced from arXiv: 2605.04803 by Amirhossein Kiamarzi, Angelo Garofalo, Davide Rossi, Maoyuan Cai.

Figure 1
Figure 1. Figure 1: Strobe Positions of Spatz. perform a finer-grained analysis at functional-unit level for the vector co-processor system. To obtain soft-error sensitivity data, we use RTL fault injection with Synopsys VC Z01X [9]. It performs concurrent fault simula￾tion by tracking divergences between a fault-free good machine (GM) and faulty machines (FMs), enabling efficient campaign-scale evaluation at RTL. We consider… view at source ↗
Figure 3
Figure 3. Figure 3: Module sensitivity to SETs. 0 0 2 4 6 8 10Fault rate (%) 1 2 3 4 5 6 7 0.0 0.1 0.2 0.3 0.4 0.5 FS FD FP32-MM FP16-MM BP16-MM FP16-WMM FP8-MM FP8-WMM 0: TCDM 1: Snitch 2: Snitch_I$ 3: VSLDU 4: VFU 5: VLSU 6: Controller 7: VRF view at source ↗
Figure 4
Figure 4. Figure 4: Module sensitivity to SEUs. 4 Results 4.1 Module-level sensitivity This subsection presents the module-level fault sensitivity analysis of Spatz under the considered fault models and workloads. The evaluation covers six workload configurations: FP32 MatMul, FP16 MatMul, BP16 MatMul, FP16 Widening MatMul, FP8 MatMul, and FP8 Widening MatMul. We distinguish the results by fault type, considering SEU injectio… view at source ↗
Figure 5
Figure 5. Figure 5: RMSE versus average number of corrupted outputs view at source ↗
read the original abstract

We present a transient-fault sensitivity study of the open-source RISC-V vector cluster Spatz under SET and SEU fault models. Across 100,000 fault injections on six MatMul and Widening MatMul configurations, faulty data corruption (FD) is the dominant manifesting outcome for all evaluated workloads, accounting for at least 86% of manifesting errors in the SET campaigns and at least 91% in the SEU campaigns. At the module level, SET sensitivity is concentrated in the vector execution path, while TCDM is the major contributor to FD manifestations. We further quantify SDC severity across FP32, FP16, BP16, and FP8 by analyzing both the average number of corrupted outputs and their RMSE. FP8 shows the lowest output impact overall, while FP16 Widening MatMul reduces both corruption spread and RMSE compared with FP16 MatMul. By contrast, the effect of widening on FP8 is limited in our experiments. Finally, exponent-targeted corruptions induce the most severe SDC events, with the largest deviations observed in FP32 and BP16, motivating selective protection of the highest-impact datapaths and fault cases.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 3 minor

Summary. The manuscript presents an empirical transient-fault sensitivity characterization of the open-source Spatz RISC-V vector cluster. Using 100,000 SET and SEU fault injections across six MatMul and Widening MatMul kernel configurations, it reports that faulty data corruption (FD) is the dominant manifesting outcome (at least 86% of manifesting errors under SET and 91% under SEU). The study further localizes sensitivity at the module level (vector execution path for SET, TCDM for FD), quantifies SDC severity via corrupted-output count and RMSE across FP32/FP16/BF16/FP8 precisions, and identifies exponent-bit corruptions as producing the most severe SDCs, with FP8 showing the lowest overall impact.

Significance. If the empirical counts hold under the stated fault models, the work supplies concrete, large-scale injection data on vector-unit reliability that can directly inform selective hardening strategies in AI accelerators and embedded vector processors. The open-source target and breakdown by precision and corruption type are particular strengths; the observational nature of the central percentages avoids circularity or untested derivations.

major comments (1)
  1. Methodology section: the manuscript must explicitly define the error classification rules that map injection outcomes to FD versus other categories (e.g., crash, timeout, SDC), including any thresholds or post-injection checks; without these rules the reported 86%/91% dominance figures cannot be independently verified or reproduced.
minor comments (3)
  1. Abstract and results sections: the six MatMul configurations should be enumerated with their exact dimensions, data types, and widening factors so readers can assess representativeness.
  2. Results section on SDC severity: clarify whether the reported RMSE values are normalized or absolute and whether they are averaged only over manifesting SDCs or over all injections.
  3. Figure captions (where present): ensure that error-bar or confidence-interval information is stated for all percentage and RMSE plots derived from the 100k-injection campaigns.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the positive assessment and recommendation of minor revision. We address the single major comment below.

read point-by-point responses
  1. Referee: Methodology section: the manuscript must explicitly define the error classification rules that map injection outcomes to FD versus other categories (e.g., crash, timeout, SDC), including any thresholds or post-injection checks; without these rules the reported 86%/91% dominance figures cannot be independently verified or reproduced.

    Authors: We agree that the error classification rules must be stated explicitly to support reproducibility. In the revised manuscript we will add a dedicated paragraph (or subsection) in the Methodology section that defines the precise mapping from injection outcomes to FD, crash, timeout, and SDC categories. The addition will enumerate the post-injection checks performed on the processor state and output buffers, the timeout detection mechanism, the criteria used to distinguish crashes from other silent failures, and any numerical thresholds applied. This change will make the 86 % / 91 % figures independently verifiable. revision: yes

Circularity Check

0 steps flagged

No significant circularity

full rationale

This is a purely empirical fault-injection study. The dominant claim (FD as the leading outcome at ≥86%/91%) is a direct observational count from 100k SET/SEU injections on six MatMul variants; no equations, fitted models, derivations, or self-citation chains are present that could reduce the reported percentages to prior results by construction. All module-level sensitivities and SDC severity metrics are likewise raw tallies under explicitly stated injection locations and fault models.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The study rests on standard domain assumptions from fault-injection research rather than new free parameters, axioms, or invented entities.

axioms (2)
  • domain assumption SET and SEU models accurately capture real transient faults in CMOS vector hardware.
    Invoked to justify the 100,000 injection campaigns and module-level sensitivity conclusions.
  • domain assumption The simulation platform faithfully reproduces hardware behavior under injected faults.
    Required for all FD, SDC, and RMSE measurements to be meaningful.

pith-pipeline@v0.9.0 · 5518 in / 1339 out tokens · 46722 ms · 2026-05-08T16:27:15.461763+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

12 extracted references · 1 canonical work pages

  1. [1]

    Udit Kumar Agarwal et al. 2023. Towards reliability assessment of systolic arrays against stuck-at faults. In2023 53rd Annual IEEE/IFIP International Conference on Dependable Systems and Networks-Supplemental Volume (DSN-S). IEEE, 230–236

  2. [2]

    Alessandro Geist et al . 2023. NASA SpaceCube next-generation artificial- intelligence computing for STP-H9-SCENIC on ISS. (2023)

  3. [3]

    2025.Silent Data Corruption in AI

    Nishant George et al. 2025.Silent Data Corruption in AI. Technical Report. Open Compute Project

  4. [4]

    Julian Hoefer et al. 2023. Sifi-ai: A fast and flexible rtl fault simulation frame- work tailored for ai models and accelerators. InProceedings of the Great Lakes Symposium on VLSI 2023. 287–292

  5. [5]

    Leonidas Kosmidis et al. 2019. GPU4S: Embedded GPUs in space. In2019 22nd Euromicro Conference on Digital System Design (DSD). IEEE, 399–405

  6. [6]

    Stefan Mach et al. 2020. FPnew: An open-source multiformat floating-point unit architecture for energy-proportional transprecision computing.IEEE Transactions on Very Large Scale Integration (VLSI) Systems29, 4 (2020), 774–787

  7. [7]

    Matteo Perotti et al. 2025. Spatz: Clustering compact RISC-V-based vector units to maximize computing efficiency.IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems(2025)

  8. [8]

    2024.EdgeCortix SAKURA-I Machine-Learning, PCIe Accelerator SEE Proton Test

    Seth Roffe et al. 2024.EdgeCortix SAKURA-I Machine-Learning, PCIe Accelerator SEE Proton Test. Technical Report. NASA Electronic Parts and Packaging (NEPP) Program

  9. [9]

    2026.VC Z01X High Performance and Versatile Fault Simulation

    Synopsys. 2026.VC Z01X High Performance and Versatile Fault Simulation. http: //synopsys.com/verification/simulation/vc-z01x.html

  10. [10]

    Rafael Billig Tonetto et al. 2026. ENFOR-SA: End-to-end Cross-layer Transient Fault Injector for Efficient and Accurate DNN Reliability Assessment on Systolic Arrays.arXiv preprint arXiv:2602.00909(2026)

  11. [11]

    Abhishek Tyagi et al. 2024. Characterizing Soft-Error Resiliency in Arm’s Ethos- U55 Embedded Machine Learning Accelerator. In2024 IEEE International Sympo- sium on Performance Analysis of Systems and Software (ISPASS). IEEE, 96–108

  12. [12]

    Toon Vinck et al . 2025. Mitigating multiple single-event upsets during deep neural network inference using fault-aware training.Journal of Instrumentation 20, 02 (2025), C02044