pith. machine review for the scientific record. sign in

arxiv: 2605.07417 · v1 · submitted 2026-05-08 · 💻 cs.AR · cs.LG

Recognition: no theorem link

Effective and Memory-Efficient Alternatives to ECC for Reliable Large-Scale DNNs

Jaan Raik, Luca Di Mauro, Marco Restifo, Marten Roots, Mohammad Hasan Ahmadilivani, Sven-Markus Loorits

Pith reviewed 2026-05-11 02:00 UTC · model grok-4.3

classification 💻 cs.AR cs.LG
keywords DNN reliabilityECC alternativesCNN protectionVision Transformerstransient faultsmemory efficiencyfault tolerancehardware overhead
0
0 comments X

The pith

Selective bit protection methods outperform ECC in reliability for large CNNs and ViTs while using no extra memory and less hardware area.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Modern deep learning models in safety-critical settings are vulnerable to transient hardware faults that corrupt model parameters stored in memory. This paper proposes two alternatives to traditional error correction codes: MSET, which hardens only the most vulnerable bits in CNN and ViT parameters, and CEP, which applies fine-grained protection to all bits. Experiments show these approaches deliver better fault tolerance than SECDED ECC schemes, with no memory overhead and reduced area and delay costs. For vision transformers, protecting only the highest exponent bits in FP16 and FP32 formats proves sufficient. If correct, this would allow more reliable large-scale AI systems without the hardware penalties of current protection methods.

Core claim

MSET and CEP are lightweight alternatives to ECC that enhance the reliability of large CNNs and ViTs by selectively or comprehensively safeguarding critical parameter bits, mostly outperforming SECDED schemes with no memory overhead and considerably lower area and delay, while ViTs can be protected effectively by guarding only their highest exponent bits in FP16 and FP32; CEP further guarantees resilience at up to one order of magnitude higher bit error rates with 3.5x lower area overhead and 7x faster decoder than SECDED ECC.

What carries the argument

MSET for selective hardening of the most vulnerable bits identified via ranking and CEP for fine-grained protection across all parameter bits, both applied to floating-point representations of CNN and ViT weights to mitigate transient faults.

If this is right

  • Large CNNs and ViTs achieve higher reliability than with conventional SECDED ECC without increasing memory usage.
  • Vision transformers require protection only on the highest exponent bits in FP16 and FP32 representations for effective fault tolerance.
  • CEP enables DNN resilience at bit error rates up to ten times higher than SECDED while using 3.5 times less area and decoding seven times faster.
  • Hardware implementations incur considerably lower area and delay characteristics than ECC-based schemes.
  • Reliable operation of memory-intensive safety-critical DL workloads becomes feasible with reduced silicon costs.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • These bit-protection approaches could integrate into model quantization or training pipelines to reduce overhead even further.
  • The techniques may apply to other neural network types or non-neural machine learning models that store parameters in memory.
  • Widespread use in data centers could reduce overall system power draw due to simpler protection hardware.
  • Scaling tests on models with billions of parameters would clarify whether the bit-ranking process remains efficient.

Load-bearing premise

The vulnerability rankings and fault-injection experiments used to identify critical bits and measure reliability gains accurately reflect real-world transient hardware fault behavior in deployed large-scale DNN systems.

What would settle it

A physical hardware test with a large CNN or ViT under measured transient fault rates where the observed error rates after MSET or CEP protection are higher than or equal to those under SECDED ECC would disprove the performance advantage.

Figures

Figures reproduced from arXiv: 2605.07417 by Jaan Raik, Luca Di Mauro, Marco Restifo, Marten Roots, Mohammad Hasan Ahmadilivani, Sven-Markus Loorits.

Figure 1
Figure 1. Figure 1: An overall system view of the system for implementing the proposed [PITH_FULL_IMAGE:figures/full_fig_p003_1.png] view at source ↗
Figure 4
Figure 4. Figure 4: Encoding and decoding the parameters in a memory block by the [PITH_FULL_IMAGE:figures/full_fig_p003_4.png] view at source ↗
Figure 2
Figure 2. Figure 2: Probability Distribution Function (PDF) of accuracy for ViT-base [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗
Figure 6
Figure 6. Figure 6: Reliability analysis of ViTs and CNNs using different protection [PITH_FULL_IMAGE:figures/full_fig_p005_6.png] view at source ↗
read the original abstract

Modern Deep Learning (DL) workloads are increasingly deployed in safety-critical domains, such as automotive systems and hyperscale data centers, where transient hardware faults pose a serious threat to system reliability. These workloads are highly memory-intensive, and their correct functionality strongly depends on model parameters stored in memory, which are typically protected using Error Correction Codes (ECCs). In this work, we study ECC's impact on such models and propose two lightweight alternatives to ECCs that achieve superior reliability. The first approach, MSET, selectively hardens the most vulnerable bits in CNN and ViT parameters, while the second approach, CEP, provides fine-grained protection for all parameter bits. Experimental results demonstrate that both methods significantly enhance the reliability of large CNNs and ViTs, mostly outperforming conventional Single Error Detection Double Error Correction (SECDED) ECC schemes, with no memory overhead and, in fact, with considerably lower area and delay characteristics when compared to SECDEC. Experimental results indicate that ViTs can be effectively protected by merely protecting their highest exponent bits in FP16 and FP32 representations. Furthermore, applying the CEP technique can guarantee the resilience of DNNs by up to one order of magnitude higher BERs, with a 3.5x lower area overhead and 7x faster decoder compared to SECDED ECC.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 1 minor

Summary. The paper proposes two lightweight alternatives to conventional ECC for protecting large-scale DNNs (CNNs and ViTs) from transient hardware faults: MSET, which selectively hardens the most vulnerable parameter bits, and CEP, which offers fine-grained protection across all bits. Based on fault-injection experiments, it claims these methods deliver superior reliability to SECDED ECC with zero memory overhead, substantially lower area and delay, that ViTs require protection only for the highest exponent bits in FP16/FP32 formats, and that CEP tolerates up to 10x higher bit error rates with 3.5x lower area overhead and 7x faster decoding than SECDED.

Significance. If the experimental claims are substantiated with reproducible fault models and hardware validation, the work could provide practical low-overhead techniques for reliable DNN deployment in safety-critical systems, reducing reliance on memory-intensive ECC while maintaining or improving resilience.

major comments (3)
  1. [Abstract and Experimental Results] Abstract and Experimental Results section: no details are supplied on the fault model (single- vs. multi-bit flips, injection locations or rates, number of trials, statistical significance tests, or error bars), preventing verification that the reported gains over SECDED and the 10x BER tolerance actually hold.
  2. [Abstract and ViT Results] Vulnerability ranking and ViT protection claims (Abstract): the method for identifying critical bits in MSET and the assertion that protecting only highest exponent bits suffices for ViTs lack description of the ranking procedure, ablation studies, or sensitivity analysis, making it impossible to assess whether these reflect real transient fault behavior rather than artifacts of the chosen injection setup.
  3. [Abstract and Overhead Analysis] Hardware overhead claims (Abstract): the 3.5x lower area and 7x faster decoder for CEP versus SECDED are presented without implementation details (e.g., technology node, decoder architecture, or measurement methodology), which are load-bearing for the central 'no memory overhead, lower area/delay' advantage.
minor comments (1)
  1. [Abstract] The acronyms MSET and CEP are introduced without expansion on first use.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We address each major comment point by point below, agreeing where additional clarity is warranted and outlining the revisions we will make to strengthen the manuscript.

read point-by-point responses
  1. Referee: [Abstract and Experimental Results] Abstract and Experimental Results section: no details are supplied on the fault model (single- vs. multi-bit flips, injection locations or rates, number of trials, statistical significance tests, or error bars), preventing verification that the reported gains over SECDED and the 10x BER tolerance actually hold.

    Authors: We agree that the abstract is concise and omits these methodological specifics, which are important for independent verification. The body of the manuscript describes the experimental setup, but to directly address this concern we will revise the abstract to include a brief statement on the fault model and expand the Experimental Results section with an explicit subsection detailing single-bit transient fault injections at random parameter locations, the range of BERs tested, the number of trials performed, and the use of statistical measures including error bars. revision: yes

  2. Referee: [Abstract and ViT Results] Vulnerability ranking and ViT protection claims (Abstract): the method for identifying critical bits in MSET and the assertion that protecting only highest exponent bits suffices for ViTs lack description of the ranking procedure, ablation studies, or sensitivity analysis, making it impossible to assess whether these reflect real transient fault behavior rather than artifacts of the chosen injection setup.

    Authors: We acknowledge that the abstract does not elaborate on the bit-ranking procedure or provide supporting ablations. The MSET method identifies vulnerable bits via impact analysis on model accuracy, and our ViT results indicate that highest-exponent-bit protection is sufficient; however, we will add a dedicated subsection with the ranking methodology, ablation studies across bit groups, and sensitivity analysis in the revised manuscript to demonstrate that the claims are robust rather than setup-specific. revision: yes

  3. Referee: [Abstract and Overhead Analysis] Hardware overhead claims (Abstract): the 3.5x lower area and 7x faster decoder for CEP versus SECDED are presented without implementation details (e.g., technology node, decoder architecture, or measurement methodology), which are load-bearing for the central 'no memory overhead, lower area/delay' advantage.

    Authors: The referee correctly identifies that the abstract lacks these implementation specifics. The manuscript contains an overhead analysis, but we will revise the relevant section and abstract to explicitly state the technology node, decoder architectures, synthesis methodology, and measurement approach used to obtain the area and delay comparisons, thereby substantiating the reported advantages. revision: yes

Circularity Check

0 steps flagged

No significant circularity; claims rest on independent experiments

full rationale

The paper contains no equations, derivations, or fitted parameters that reduce to their own inputs by construction. All load-bearing claims (outperformance of SECDED, ViT exponent-bit protection, area/delay gains, BER tolerance) are presented as direct outcomes of fault-injection campaigns and hardware synthesis measurements. These are external to any self-referential redefinition or self-citation chain; the experimental methodology is described as an independent validation step rather than a tautology. Minor self-citations may exist for context but do not carry the central results.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 2 invented entities

Review performed on abstract only; no detailed mathematical derivations or parameter lists are available.

axioms (1)
  • domain assumption Transient hardware faults are a primary threat to correct DNN operation in safety-critical deployments
    Stated as motivation in the opening sentences of the abstract.
invented entities (2)
  • MSET no independent evidence
    purpose: Selective hardening of the most vulnerable parameter bits
    New technique introduced in the paper.
  • CEP no independent evidence
    purpose: Fine-grained protection across all parameter bits
    New technique introduced in the paper.

pith-pipeline@v0.9.0 · 5552 in / 1357 out tokens · 39297 ms · 2026-05-11T02:00:54.633360+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

31 extracted references · 31 canonical work pages

  1. [1]

    Are emerging machine learning models dependable at the nanoscales?

    S. Liuet al., “Are emerging machine learning models dependable at the nanoscales?”IEEE Nanotechnology Magazine, 2024

  2. [2]

    A systematic literature review on hardware reliability assessment methods for deep neural networks,

    M. H. Ahmadilivani, M. Taheri, J. Raik, M. Daneshtalab, and M. Jeni- hhin, “A systematic literature review on hardware reliability assessment methods for deep neural networks,”ACM Computing Surveys, vol. 56, no. 6, pp. 1–39, 2024

  3. [3]

    Artificial neural networks for space and safety-critical appli- cations: Reliability issues and potential solutions,

    P. Rech, “Artificial neural networks for space and safety-critical appli- cations: Reliability issues and potential solutions,”IEEE Transactions on Nuclear Science, 2024

  4. [4]

    Tesla Car Crash,

    “Tesla Car Crash,” https://www.opb.org/article/2025/01/15/ tesla-may-face-less-accountability-for-crashes-under-trump/

  5. [5]

    Silent data corruptions at scale

    H. D. Dixitet al., “Silent data corruptions at scale,”arXiv preprint arXiv:2102.11245, 2021

  6. [6]

    Silent data corruption in ai,

    N. Georgeet al., “Silent data corruption in ai,”Open Compute Project, 2025

  7. [7]

    Scaling vision transformers,

    X. Zhaiet al., “Scaling vision transformers,” inProceedings of the IEEE/CVF CVPR, 2022, pp. 12 104–12 113

  8. [8]

    Impact of scaling on neutron-induced soft error in srams from a 250 nm to a 22 nm design rule,

    E. Ibeet al., “Impact of scaling on neutron-induced soft error in srams from a 250 nm to a 22 nm design rule,”IEEE Transactions on Electron Devices, vol. 57, no. 7, pp. 1527–1538, 2010

  9. [9]

    NVIDIA GPU Memory Error Management,

    “NVIDIA GPU Memory Error Management,” https://docs.nvidia.com/ deploy/pdf/NVIDIA-GPU-Memory-Error-Management.pdf

  10. [10]

    World-Class Performance on AWS Graviton,

    “World-Class Performance on AWS Graviton,” https://www.arm.com/ markets/computing-infrastructure/cloud-computing/aws

  11. [11]

    Inside Arm Zena CSS: The Compute Platform for AI-Defined Vehicles,

    “Inside Arm Zena CSS: The Compute Platform for AI-Defined Vehicles,” https://newsroom.arm.com/blog/ arm-zena-css-ai-defined-vehicle-compute-platform

  12. [12]

    Characterizing and mitigating soft errors in gpu dram,

    M. B. Sullivanet al., “Characterizing and mitigating soft errors in gpu dram,” inIEEE/ACM International Symposium on Microarchitecture, 2021, pp. 641–653

  13. [13]

    Configurable-ecc: Architecting a flexible ecc scheme to support different sized accesses in high bandwidth memory systems,

    H.-M. Chenet al., “Configurable-ecc: Architecting a flexible ecc scheme to support different sized accesses in high bandwidth memory systems,” IEEE Transactions on Computers, vol. 68, no. 5, pp. 646–659, 2018

  14. [14]

    Reliability of vision transformers and cnns on edge ai systems under neutron radiation,

    J. M. Badiaet al., “Reliability of vision transformers and cnns on edge ai systems under neutron radiation,”IEEE Transactions on Nuclear Science, 2025

  15. [15]

    Evaluating the reliability of vision transformers for space robotics applications,

    P. R. Bodmannet al., “Evaluating the reliability of vision transformers for space robotics applications,” in2024 International Conference on Space Robotics (iSpaRo). IEEE, 2024, pp. 278–283

  16. [16]

    Cross-layer reliability evaluation and efficient hard- ening of large vision transformers models,

    L. Roquetet al., “Cross-layer reliability evaluation and efficient hard- ening of large vision transformers models,” inProceedings of the 61st ACM/IEEE DAC, 2024, pp. 1–6

  17. [17]

    Lost-vit: A low overhead soft error tolerance framework for vision transformers via model compression and selective bit-level redundancy,

    S. Ahsaei and M. Raji, “Lost-vit: A low overhead soft error tolerance framework for vision transformers via model compression and selective bit-level redundancy,”Journal of Systems Architecture, p. 103623, 2025

  18. [18]

    Soft error reliability analysis of vision transformers,

    X. Xueet al., “Soft error reliability analysis of vision transformers,” IEEE Transactions on VLSI Systems, vol. 31, no. 12, pp. 2126–2136, 2023

  19. [19]

    Late breaking results: Uncovering the limits of eccs in vision transformers and a zero-cost reliability enhancement,

    M. H. Ahmadilivani, M. Roots, M. Restifo, S.-M. Loorits, L. Di Mauro, and J. Raik, “Late breaking results: Uncovering the limits of eccs in vision transformers and a zero-cost reliability enhancement,” in2026 Design, Automation & Test in Europe Conference & Exhibition (DATE). IEEE, 2026, pp. 1–3

  20. [20]

    Robustness of neural networks against storage media errors,

    M. Qinet al., “Robustness of neural networks against storage media errors,”arXiv preprint arXiv:1709.06173, 2017

  21. [21]

    In-place zero-space memory protection for cnn,

    H. Guanet al., “In-place zero-space memory protection for cnn,” Advances in Neural Information Processing Systems, vol. 32, 2019

  22. [22]

    Zero-overhead protection for cnn weights,

    S. Burelet al., “Zero-overhead protection for cnn weights,” in2021 IEEE DFTS. IEEE, 2021, pp. 1–6

  23. [23]

    Value-aware parity insertion ecc for fault- tolerant deep neural network,

    S.-S. Lee and J.-S. Yang, “Value-aware parity insertion ecc for fault- tolerant deep neural network,” in2022 Design, Automation & Test in Europe Conference & Exhibition (DATE). IEEE, 2022, pp. 724–729

  24. [24]

    Nn-ecc: Embedding error correction codes in neural network weight memories using multi-task learning,

    S. T. Ahmedet al., “Nn-ecc: Embedding error correction codes in neural network weight memories using multi-task learning,” in2024 IEEE 42nd VLSI Test Symposium (VTS). IEEE, 2024, pp. 1–7

  25. [25]

    Stegano-ecc: Enhancing dnn fault tolerance with embedded parity for important bits,

    M. J. Jo and Y . S. Lee, “Stegano-ecc: Enhancing dnn fault tolerance with embedded parity for important bits,”Journal of Systems Architecture, p. 103651, 2025

  26. [26]

    Pop-ecc: Robust and flexible error correction against multi-bit upsets in dnn accelerators,

    T. Parket al., “Pop-ecc: Robust and flexible error correction against multi-bit upsets in dnn accelerators,” in2025 62nd ACM/IEEE Design Automation Conference (DAC). IEEE, 2025, pp. 1–7

  27. [27]

    Loco: Lpddr optimization with compression and iecc scheme for dnn inference,

    J.-Y . Honget al., “Loco: Lpddr optimization with compression and iecc scheme for dnn inference,” inProceedings of the 29th ACM/IEEE International Symposium on Low Power Electronics and Design, 2024, pp. 1–6

  28. [28]

    Functional error correction for robust neural networks,

    K. Huanget al., “Functional error correction for robust neural networks,” IEEE Journal on Selected Areas in Information Theory, vol. 1, no. 1, pp. 267–276, 2020

  29. [29]

    Sparrow ecc: A lightweight ecc approach for hbm refresh reduction towards energy-efficient dnn inference,

    H. Kimet al., “Sparrow ecc: A lightweight ecc approach for hbm refresh reduction towards energy-efficient dnn inference,” inProceedings of the 29th ACM/IEEE International Symposium on Low Power Electronics and Design, 2024, pp. 1–6

  30. [30]

    hardnning: a machine- learning-based framework for fault tolerance assessment and protection of dnns,

    M. Traiola, A. Kritikakou, and O. Sentieys, “hardnning: a machine- learning-based framework for fault tolerance assessment and protection of dnns,” in2023 IEEE European Test Symposium (ETS). IEEE, 2023, pp. 1–6

  31. [31]

    In-datacenter performance analysis of a tensor pro- cessing unit,

    N. P. Jouppiet al., “In-datacenter performance analysis of a tensor pro- cessing unit,” inProceedings of the 44th annual international symposium on computer architecture, 2017, pp. 1–12