arxiv: 2605.07417 · v1 · submitted 2026-05-08 · 💻 cs.AR · cs.LG

Recognition: no theorem link

Effective and Memory-Efficient Alternatives to ECC for Reliable Large-Scale DNNs

Jaan Raik, Luca Di Mauro, Marco Restifo, Marten Roots, Mohammad Hasan Ahmadilivani, Sven-Markus Loorits

Pith reviewed 2026-05-11 02:00 UTC · model grok-4.3

classification 💻 cs.AR cs.LG

keywords DNN reliabilityECC alternativesCNN protectionVision Transformerstransient faultsmemory efficiencyfault tolerancehardware overhead

0 comments

The pith

Selective bit protection methods outperform ECC in reliability for large CNNs and ViTs while using no extra memory and less hardware area.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Modern deep learning models in safety-critical settings are vulnerable to transient hardware faults that corrupt model parameters stored in memory. This paper proposes two alternatives to traditional error correction codes: MSET, which hardens only the most vulnerable bits in CNN and ViT parameters, and CEP, which applies fine-grained protection to all bits. Experiments show these approaches deliver better fault tolerance than SECDED ECC schemes, with no memory overhead and reduced area and delay costs. For vision transformers, protecting only the highest exponent bits in FP16 and FP32 formats proves sufficient. If correct, this would allow more reliable large-scale AI systems without the hardware penalties of current protection methods.

Core claim

MSET and CEP are lightweight alternatives to ECC that enhance the reliability of large CNNs and ViTs by selectively or comprehensively safeguarding critical parameter bits, mostly outperforming SECDED schemes with no memory overhead and considerably lower area and delay, while ViTs can be protected effectively by guarding only their highest exponent bits in FP16 and FP32; CEP further guarantees resilience at up to one order of magnitude higher bit error rates with 3.5x lower area overhead and 7x faster decoder than SECDED ECC.

What carries the argument

MSET for selective hardening of the most vulnerable bits identified via ranking and CEP for fine-grained protection across all parameter bits, both applied to floating-point representations of CNN and ViT weights to mitigate transient faults.

If this is right

Large CNNs and ViTs achieve higher reliability than with conventional SECDED ECC without increasing memory usage.
Vision transformers require protection only on the highest exponent bits in FP16 and FP32 representations for effective fault tolerance.
CEP enables DNN resilience at bit error rates up to ten times higher than SECDED while using 3.5 times less area and decoding seven times faster.
Hardware implementations incur considerably lower area and delay characteristics than ECC-based schemes.
Reliable operation of memory-intensive safety-critical DL workloads becomes feasible with reduced silicon costs.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

These bit-protection approaches could integrate into model quantization or training pipelines to reduce overhead even further.
The techniques may apply to other neural network types or non-neural machine learning models that store parameters in memory.
Widespread use in data centers could reduce overall system power draw due to simpler protection hardware.
Scaling tests on models with billions of parameters would clarify whether the bit-ranking process remains efficient.

Load-bearing premise

The vulnerability rankings and fault-injection experiments used to identify critical bits and measure reliability gains accurately reflect real-world transient hardware fault behavior in deployed large-scale DNN systems.

What would settle it

A physical hardware test with a large CNN or ViT under measured transient fault rates where the observed error rates after MSET or CEP protection are higher than or equal to those under SECDED ECC would disprove the performance advantage.

Figures

Figures reproduced from arXiv: 2605.07417 by Jaan Raik, Luca Di Mauro, Marco Restifo, Marten Roots, Mohammad Hasan Ahmadilivani, Sven-Markus Loorits.

**Figure 4.** Figure 4: Encoding and decoding the parameters in a memory block by the [PITH_FULL_IMAGE:figures/full_fig_p003_4.png] view at source ↗

**Figure 2.** Figure 2: Probability Distribution Function (PDF) of accuracy for ViT-base [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗

**Figure 6.** Figure 6: Reliability analysis of ViTs and CNNs using different protection [PITH_FULL_IMAGE:figures/full_fig_p005_6.png] view at source ↗

read the original abstract

Modern Deep Learning (DL) workloads are increasingly deployed in safety-critical domains, such as automotive systems and hyperscale data centers, where transient hardware faults pose a serious threat to system reliability. These workloads are highly memory-intensive, and their correct functionality strongly depends on model parameters stored in memory, which are typically protected using Error Correction Codes (ECCs). In this work, we study ECC's impact on such models and propose two lightweight alternatives to ECCs that achieve superior reliability. The first approach, MSET, selectively hardens the most vulnerable bits in CNN and ViT parameters, while the second approach, CEP, provides fine-grained protection for all parameter bits. Experimental results demonstrate that both methods significantly enhance the reliability of large CNNs and ViTs, mostly outperforming conventional Single Error Detection Double Error Correction (SECDED) ECC schemes, with no memory overhead and, in fact, with considerably lower area and delay characteristics when compared to SECDEC. Experimental results indicate that ViTs can be effectively protected by merely protecting their highest exponent bits in FP16 and FP32 representations. Furthermore, applying the CEP technique can guarantee the resilience of DNNs by up to one order of magnitude higher BERs, with a 3.5x lower area overhead and 7x faster decoder compared to SECDED ECC.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

MSET and CEP give selective and fine-grained bit protection for DNNs that claim better reliability than SECDED ECC with no memory overhead, but the fault-injection experiments are the part that needs the most checking.

read the letter

The main takeaway is that this paper proposes MSET for selective hardening of vulnerable bits and CEP for fine-grained protection as alternatives to ECC in DNNs, with experiments showing better reliability for CNNs and ViTs at lower area and delay costs. They do a good job highlighting the memory overhead problem with traditional ECC and offering targeted fixes. The observation that protecting only the highest exponent bits suffices for ViTs in FP16 and FP32 is a nice practical insight that could reduce protection costs significantly. The experimental results claim substantial improvements, like handling higher bit error rates and faster decoders. If the full paper includes reproducible fault injection setups and clear baselines, that strengthens the case. The soft spot is the reliance on fault-injection results without clear evidence that the model captures real-world transient faults in deployed systems. Things like correlated errors or specific hardware behaviors might not be represented, which could inflate the reported gains. Also, more comparison to existing selective protection techniques in the literature would help place the novelty. This paper is for researchers and engineers focused on reliable AI hardware in critical applications. A reader interested in fault tolerance for neural networks would get concrete ideas to build on. It deserves serious peer review because the claims are specific and the topic has practical impact. I would send it to referees who can evaluate the experimental methodology.

Referee Report

3 major / 1 minor

Summary. The paper proposes two lightweight alternatives to conventional ECC for protecting large-scale DNNs (CNNs and ViTs) from transient hardware faults: MSET, which selectively hardens the most vulnerable parameter bits, and CEP, which offers fine-grained protection across all bits. Based on fault-injection experiments, it claims these methods deliver superior reliability to SECDED ECC with zero memory overhead, substantially lower area and delay, that ViTs require protection only for the highest exponent bits in FP16/FP32 formats, and that CEP tolerates up to 10x higher bit error rates with 3.5x lower area overhead and 7x faster decoding than SECDED.

Significance. If the experimental claims are substantiated with reproducible fault models and hardware validation, the work could provide practical low-overhead techniques for reliable DNN deployment in safety-critical systems, reducing reliance on memory-intensive ECC while maintaining or improving resilience.

major comments (3)

[Abstract and Experimental Results] Abstract and Experimental Results section: no details are supplied on the fault model (single- vs. multi-bit flips, injection locations or rates, number of trials, statistical significance tests, or error bars), preventing verification that the reported gains over SECDED and the 10x BER tolerance actually hold.
[Abstract and ViT Results] Vulnerability ranking and ViT protection claims (Abstract): the method for identifying critical bits in MSET and the assertion that protecting only highest exponent bits suffices for ViTs lack description of the ranking procedure, ablation studies, or sensitivity analysis, making it impossible to assess whether these reflect real transient fault behavior rather than artifacts of the chosen injection setup.
[Abstract and Overhead Analysis] Hardware overhead claims (Abstract): the 3.5x lower area and 7x faster decoder for CEP versus SECDED are presented without implementation details (e.g., technology node, decoder architecture, or measurement methodology), which are load-bearing for the central 'no memory overhead, lower area/delay' advantage.

minor comments (1)

[Abstract] The acronyms MSET and CEP are introduced without expansion on first use.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We address each major comment point by point below, agreeing where additional clarity is warranted and outlining the revisions we will make to strengthen the manuscript.

read point-by-point responses

Referee: [Abstract and Experimental Results] Abstract and Experimental Results section: no details are supplied on the fault model (single- vs. multi-bit flips, injection locations or rates, number of trials, statistical significance tests, or error bars), preventing verification that the reported gains over SECDED and the 10x BER tolerance actually hold.

Authors: We agree that the abstract is concise and omits these methodological specifics, which are important for independent verification. The body of the manuscript describes the experimental setup, but to directly address this concern we will revise the abstract to include a brief statement on the fault model and expand the Experimental Results section with an explicit subsection detailing single-bit transient fault injections at random parameter locations, the range of BERs tested, the number of trials performed, and the use of statistical measures including error bars. revision: yes
Referee: [Abstract and ViT Results] Vulnerability ranking and ViT protection claims (Abstract): the method for identifying critical bits in MSET and the assertion that protecting only highest exponent bits suffices for ViTs lack description of the ranking procedure, ablation studies, or sensitivity analysis, making it impossible to assess whether these reflect real transient fault behavior rather than artifacts of the chosen injection setup.

Authors: We acknowledge that the abstract does not elaborate on the bit-ranking procedure or provide supporting ablations. The MSET method identifies vulnerable bits via impact analysis on model accuracy, and our ViT results indicate that highest-exponent-bit protection is sufficient; however, we will add a dedicated subsection with the ranking methodology, ablation studies across bit groups, and sensitivity analysis in the revised manuscript to demonstrate that the claims are robust rather than setup-specific. revision: yes
Referee: [Abstract and Overhead Analysis] Hardware overhead claims (Abstract): the 3.5x lower area and 7x faster decoder for CEP versus SECDED are presented without implementation details (e.g., technology node, decoder architecture, or measurement methodology), which are load-bearing for the central 'no memory overhead, lower area/delay' advantage.

Authors: The referee correctly identifies that the abstract lacks these implementation specifics. The manuscript contains an overhead analysis, but we will revise the relevant section and abstract to explicitly state the technology node, decoder architectures, synthesis methodology, and measurement approach used to obtain the area and delay comparisons, thereby substantiating the reported advantages. revision: yes

Circularity Check

0 steps flagged

No significant circularity; claims rest on independent experiments

full rationale

The paper contains no equations, derivations, or fitted parameters that reduce to their own inputs by construction. All load-bearing claims (outperformance of SECDED, ViT exponent-bit protection, area/delay gains, BER tolerance) are presented as direct outcomes of fault-injection campaigns and hardware synthesis measurements. These are external to any self-referential redefinition or self-citation chain; the experimental methodology is described as an independent validation step rather than a tautology. Minor self-citations may exist for context but do not carry the central results.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 2 invented entities

Review performed on abstract only; no detailed mathematical derivations or parameter lists are available.

axioms (1)

domain assumption Transient hardware faults are a primary threat to correct DNN operation in safety-critical deployments
Stated as motivation in the opening sentences of the abstract.

invented entities (2)

MSET no independent evidence
purpose: Selective hardening of the most vulnerable parameter bits
New technique introduced in the paper.
CEP no independent evidence
purpose: Fine-grained protection across all parameter bits
New technique introduced in the paper.

pith-pipeline@v0.9.0 · 5552 in / 1357 out tokens · 39297 ms · 2026-05-11T02:00:54.633360+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

31 extracted references · 31 canonical work pages

[1]

Are emerging machine learning models dependable at the nanoscales?

S. Liuet al., “Are emerging machine learning models dependable at the nanoscales?”IEEE Nanotechnology Magazine, 2024

work page 2024
[2]

A systematic literature review on hardware reliability assessment methods for deep neural networks,

M. H. Ahmadilivani, M. Taheri, J. Raik, M. Daneshtalab, and M. Jeni- hhin, “A systematic literature review on hardware reliability assessment methods for deep neural networks,”ACM Computing Surveys, vol. 56, no. 6, pp. 1–39, 2024

work page 2024
[3]

Artificial neural networks for space and safety-critical appli- cations: Reliability issues and potential solutions,

P. Rech, “Artificial neural networks for space and safety-critical appli- cations: Reliability issues and potential solutions,”IEEE Transactions on Nuclear Science, 2024

work page 2024
[4]

Tesla Car Crash,

“Tesla Car Crash,” https://www.opb.org/article/2025/01/15/ tesla-may-face-less-accountability-for-crashes-under-trump/

work page 2025
[5]

Silent data corruptions at scale

H. D. Dixitet al., “Silent data corruptions at scale,”arXiv preprint arXiv:2102.11245, 2021

work page arXiv 2021
[6]

Silent data corruption in ai,

N. Georgeet al., “Silent data corruption in ai,”Open Compute Project, 2025

work page 2025
[7]

Scaling vision transformers,

X. Zhaiet al., “Scaling vision transformers,” inProceedings of the IEEE/CVF CVPR, 2022, pp. 12 104–12 113

work page 2022
[8]

Impact of scaling on neutron-induced soft error in srams from a 250 nm to a 22 nm design rule,

E. Ibeet al., “Impact of scaling on neutron-induced soft error in srams from a 250 nm to a 22 nm design rule,”IEEE Transactions on Electron Devices, vol. 57, no. 7, pp. 1527–1538, 2010

work page 2010
[9]

NVIDIA GPU Memory Error Management,

“NVIDIA GPU Memory Error Management,” https://docs.nvidia.com/ deploy/pdf/NVIDIA-GPU-Memory-Error-Management.pdf

work page
[10]

World-Class Performance on AWS Graviton,

“World-Class Performance on AWS Graviton,” https://www.arm.com/ markets/computing-infrastructure/cloud-computing/aws

work page
[11]

Inside Arm Zena CSS: The Compute Platform for AI-Defined Vehicles,

“Inside Arm Zena CSS: The Compute Platform for AI-Defined Vehicles,” https://newsroom.arm.com/blog/ arm-zena-css-ai-defined-vehicle-compute-platform

work page
[12]

Characterizing and mitigating soft errors in gpu dram,

M. B. Sullivanet al., “Characterizing and mitigating soft errors in gpu dram,” inIEEE/ACM International Symposium on Microarchitecture, 2021, pp. 641–653

work page 2021
[13]

Configurable-ecc: Architecting a flexible ecc scheme to support different sized accesses in high bandwidth memory systems,

H.-M. Chenet al., “Configurable-ecc: Architecting a flexible ecc scheme to support different sized accesses in high bandwidth memory systems,” IEEE Transactions on Computers, vol. 68, no. 5, pp. 646–659, 2018

work page 2018
[14]

Reliability of vision transformers and cnns on edge ai systems under neutron radiation,

J. M. Badiaet al., “Reliability of vision transformers and cnns on edge ai systems under neutron radiation,”IEEE Transactions on Nuclear Science, 2025

work page 2025
[15]

Evaluating the reliability of vision transformers for space robotics applications,

P. R. Bodmannet al., “Evaluating the reliability of vision transformers for space robotics applications,” in2024 International Conference on Space Robotics (iSpaRo). IEEE, 2024, pp. 278–283

work page 2024
[16]

Cross-layer reliability evaluation and efficient hard- ening of large vision transformers models,

L. Roquetet al., “Cross-layer reliability evaluation and efficient hard- ening of large vision transformers models,” inProceedings of the 61st ACM/IEEE DAC, 2024, pp. 1–6

work page 2024
[17]

Lost-vit: A low overhead soft error tolerance framework for vision transformers via model compression and selective bit-level redundancy,

S. Ahsaei and M. Raji, “Lost-vit: A low overhead soft error tolerance framework for vision transformers via model compression and selective bit-level redundancy,”Journal of Systems Architecture, p. 103623, 2025

work page 2025
[18]

Soft error reliability analysis of vision transformers,

X. Xueet al., “Soft error reliability analysis of vision transformers,” IEEE Transactions on VLSI Systems, vol. 31, no. 12, pp. 2126–2136, 2023

work page 2023
[19]

Late breaking results: Uncovering the limits of eccs in vision transformers and a zero-cost reliability enhancement,

M. H. Ahmadilivani, M. Roots, M. Restifo, S.-M. Loorits, L. Di Mauro, and J. Raik, “Late breaking results: Uncovering the limits of eccs in vision transformers and a zero-cost reliability enhancement,” in2026 Design, Automation & Test in Europe Conference & Exhibition (DATE). IEEE, 2026, pp. 1–3

work page 2026
[20]

Robustness of neural networks against storage media errors,

M. Qinet al., “Robustness of neural networks against storage media errors,”arXiv preprint arXiv:1709.06173, 2017

work page arXiv 2017
[21]

In-place zero-space memory protection for cnn,

H. Guanet al., “In-place zero-space memory protection for cnn,” Advances in Neural Information Processing Systems, vol. 32, 2019

work page 2019
[22]

Zero-overhead protection for cnn weights,

S. Burelet al., “Zero-overhead protection for cnn weights,” in2021 IEEE DFTS. IEEE, 2021, pp. 1–6

work page 2021
[23]

Value-aware parity insertion ecc for fault- tolerant deep neural network,

S.-S. Lee and J.-S. Yang, “Value-aware parity insertion ecc for fault- tolerant deep neural network,” in2022 Design, Automation & Test in Europe Conference & Exhibition (DATE). IEEE, 2022, pp. 724–729

work page 2022
[24]

Nn-ecc: Embedding error correction codes in neural network weight memories using multi-task learning,

S. T. Ahmedet al., “Nn-ecc: Embedding error correction codes in neural network weight memories using multi-task learning,” in2024 IEEE 42nd VLSI Test Symposium (VTS). IEEE, 2024, pp. 1–7

work page 2024
[25]

Stegano-ecc: Enhancing dnn fault tolerance with embedded parity for important bits,

M. J. Jo and Y . S. Lee, “Stegano-ecc: Enhancing dnn fault tolerance with embedded parity for important bits,”Journal of Systems Architecture, p. 103651, 2025

work page 2025
[26]

Pop-ecc: Robust and flexible error correction against multi-bit upsets in dnn accelerators,

T. Parket al., “Pop-ecc: Robust and flexible error correction against multi-bit upsets in dnn accelerators,” in2025 62nd ACM/IEEE Design Automation Conference (DAC). IEEE, 2025, pp. 1–7

work page 2025
[27]

Loco: Lpddr optimization with compression and iecc scheme for dnn inference,

J.-Y . Honget al., “Loco: Lpddr optimization with compression and iecc scheme for dnn inference,” inProceedings of the 29th ACM/IEEE International Symposium on Low Power Electronics and Design, 2024, pp. 1–6

work page 2024
[28]

Functional error correction for robust neural networks,

K. Huanget al., “Functional error correction for robust neural networks,” IEEE Journal on Selected Areas in Information Theory, vol. 1, no. 1, pp. 267–276, 2020

work page 2020
[29]

Sparrow ecc: A lightweight ecc approach for hbm refresh reduction towards energy-efficient dnn inference,

H. Kimet al., “Sparrow ecc: A lightweight ecc approach for hbm refresh reduction towards energy-efficient dnn inference,” inProceedings of the 29th ACM/IEEE International Symposium on Low Power Electronics and Design, 2024, pp. 1–6

work page 2024
[30]

hardnning: a machine- learning-based framework for fault tolerance assessment and protection of dnns,

M. Traiola, A. Kritikakou, and O. Sentieys, “hardnning: a machine- learning-based framework for fault tolerance assessment and protection of dnns,” in2023 IEEE European Test Symposium (ETS). IEEE, 2023, pp. 1–6

work page 2023
[31]

In-datacenter performance analysis of a tensor pro- cessing unit,

N. P. Jouppiet al., “In-datacenter performance analysis of a tensor pro- cessing unit,” inProceedings of the 44th annual international symposium on computer architecture, 2017, pp. 1–12

work page 2017