hub

Silent data corruptions at scale

Harish Dattatraya Dixit, Sneha Pendharkar, Matt Beadon, Chris Mason, Tejasvi Chakravarthy, Bharath Muthiah, Sriram Sankar · 2021 · arXiv 2102.11245

14 Pith papers cite this work. Polarity classification is still indexing.

14 Pith papers citing it

read on arXiv browse 14 citing papers

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 3

citation-polarity summary

background 3

representative citing papers

Self-Verifying Measurement Records: Hash-Linked Evidence Graphs for Hardware Benchmarking

cs.CR · 2026-06-26 · unverdicted · novelty 7.0

The paper constructs hash-linked evidence graphs that bind hardware measurement quantities to their verification records, enabling offline auditing with probabilistic matrix checks and security measures against probe attacks on GPUs.

Not All Errors Are Equal: A Systematic Study of Error Propagation in Large Language Model Inference

cs.DC · 2026-06-01 · unverdicted · novelty 7.0

A new fault-injection framework enables a systematic empirical study that produces 17 takeaways on error propagation in LLM inference and four software-only mitigation directions.

ITHICA: Intra-Thread Instruction Checking Approach for Defect-Induced Silent Data Corruptions

cs.AR · 2026-05-15 · unverdicted · novelty 7.0

ITHICA generates functional tests via intra-thread instruction duplication and comparison, detecting 39% more defective servers than baseline methods on over 3000 real CPUs while revealing new defect behaviors.

FLARE: One-Shot PE-Level Fault Localization in Systolic Arrays via Algebraic Test Vectors

cs.AR · 2026-05-09 · unverdicted · novelty 7.0

FLARE uses pairwise coprime test vectors to create unique divisibility signatures that localize faulty rows in systolic arrays with one test pass and over 98% probability for 256x256 INT16 arrays.

Kernel Contracts: A Specification Language for ML Kernel Correctness Across Heterogeneous Silicon

cs.LG · 2026-04-23 · unverdicted · novelty 7.0

Kernel Contracts is a specification language that formalizes correctness requirements for ML kernels to ensure consistent results across heterogeneous silicon platforms.

Making TransactionIsolation Checking Practical

cs.DB · 2026-04-22 · unverdicted · novelty 7.0

Boomslang introduces a front-end/back-end pipeline with superpositions in its IR to enable general-purpose checking of arbitrary transaction isolation levels via SMT solving.

DRIFT: Harnessing Inherent Fault Tolerance for Efficient and Reliable Diffusion Model Inference

cs.AR · 2026-04-10 · unverdicted · novelty 7.0

DRIFT uses resilience analysis, targeted DVFS, and adaptive rollback ABFT to deliver 36% average energy savings or 1.7x speedup in diffusion model inference while preserving generation quality.

Zero knowledge verification for frontier AI training is possible

cs.AI · 2026-06-03 · unverdicted · novelty 6.0

Proposes zkVM-based protocol for verifiable frontier AI pre-training with committed specs, network observations, Merkle commitments, and FP precompiles, estimating 36-month POC at single-digit overhead.

The Anatomy of Silent Data Corruption: GPU Error Pattern Study and Modeling Guidance

cs.AR · 2026-05-05 · unverdicted · novelty 6.0

Large-scale GPU fault injection shows NaN/inf outcomes are only 1% of SDC, single-bit flips under 40%, and corruption addresses are periodic, supporting distribution-aware modeling.

LLM-PRISM: Characterizing Silent Data Corruption from Permanent GPU Faults in LLM Training

cs.AR · 2026-04-12 · unverdicted · novelty 6.0

LLMs resist low-frequency permanent GPU faults but certain datapaths and precision formats trigger catastrophic training divergence even at moderate fault rates.

Gemini: A Family of Highly Capable Multimodal Models

cs.CL · 2023-12-19 · conditional · novelty 6.0

Gemini Ultra reaches human-expert performance on MMLU for the first time and sets new state-of-the-art results on 30 of 32 benchmarks, including all 20 multimodal ones tested.

Effective and Memory-Efficient Alternatives to ECC for Reliable Large-Scale DNNs

cs.AR · 2026-05-08 · unverdicted · novelty 5.0

MSET and CEP deliver higher reliability than SECDED ECC for CNNs and Vision Transformers with zero memory overhead and substantially lower area and delay.

Aging Aware Adaptive Voltage Scaling for Reliable and Efficient AI Accelerators

cs.AR · 2026-04-11 · unverdicted · novelty 5.0

An aging-aware adaptive voltage scaling framework for AI accelerators reduces predicted threshold voltage shifts by ~19% and aging degradation by up to 46% while saving 14% lifetime power by leveraging neural network resilience.

AIReSim: A Discrete Event Simulator for Large-scale AI Cluster Reliability Modeling

cs.DC · 2026-03-07 · unverdicted · novelty 4.0

AIReSim is a discrete event simulator for evaluating failure mitigation, recovery, and capacity planning decisions in large AI clusters.

citing papers explorer

Showing 14 of 14 citing papers.

Self-Verifying Measurement Records: Hash-Linked Evidence Graphs for Hardware Benchmarking cs.CR · 2026-06-26 · unverdicted · none · ref 10
The paper constructs hash-linked evidence graphs that bind hardware measurement quantities to their verification records, enabling offline auditing with probabilistic matrix checks and security measures against probe attacks on GPUs.
Not All Errors Are Equal: A Systematic Study of Error Propagation in Large Language Model Inference cs.DC · 2026-06-01 · unverdicted · none · ref 21
A new fault-injection framework enables a systematic empirical study that produces 17 takeaways on error propagation in LLM inference and four software-only mitigation directions.
ITHICA: Intra-Thread Instruction Checking Approach for Defect-Induced Silent Data Corruptions cs.AR · 2026-05-15 · unverdicted · none · ref 16
ITHICA generates functional tests via intra-thread instruction duplication and comparison, detecting 39% more defective servers than baseline methods on over 3000 real CPUs while revealing new defect behaviors.
FLARE: One-Shot PE-Level Fault Localization in Systolic Arrays via Algebraic Test Vectors cs.AR · 2026-05-09 · unverdicted · none · ref 6
FLARE uses pairwise coprime test vectors to create unique divisibility signatures that localize faulty rows in systolic arrays with one test pass and over 98% probability for 256x256 INT16 arrays.
Kernel Contracts: A Specification Language for ML Kernel Correctness Across Heterogeneous Silicon cs.LG · 2026-04-23 · unverdicted · none · ref 11
Kernel Contracts is a specification language that formalizes correctness requirements for ML kernels to ensure consistent results across heterogeneous silicon platforms.
Making TransactionIsolation Checking Practical cs.DB · 2026-04-22 · unverdicted · none · ref 42
Boomslang introduces a front-end/back-end pipeline with superpositions in its IR to enable general-purpose checking of arbitrary transaction isolation levels via SMT solving.
DRIFT: Harnessing Inherent Fault Tolerance for Efficient and Reliable Diffusion Model Inference cs.AR · 2026-04-10 · unverdicted · none · ref 38
DRIFT uses resilience analysis, targeted DVFS, and adaptive rollback ABFT to deliver 36% average energy savings or 1.7x speedup in diffusion model inference while preserving generation quality.
Zero knowledge verification for frontier AI training is possible cs.AI · 2026-06-03 · unverdicted · none · ref 10
Proposes zkVM-based protocol for verifiable frontier AI pre-training with committed specs, network observations, Merkle commitments, and FP precompiles, estimating 36-month POC at single-digit overhead.
The Anatomy of Silent Data Corruption: GPU Error Pattern Study and Modeling Guidance cs.AR · 2026-05-05 · unverdicted · none · ref 3
Large-scale GPU fault injection shows NaN/inf outcomes are only 1% of SDC, single-bit flips under 40%, and corruption addresses are periodic, supporting distribution-aware modeling.
LLM-PRISM: Characterizing Silent Data Corruption from Permanent GPU Faults in LLM Training cs.AR · 2026-04-12 · unverdicted · none · ref 2
LLMs resist low-frequency permanent GPU faults but certain datapaths and precision formats trigger catastrophic training divergence even at moderate fault rates.
Gemini: A Family of Highly Capable Multimodal Models cs.CL · 2023-12-19 · conditional · none · ref 20
Gemini Ultra reaches human-expert performance on MMLU for the first time and sets new state-of-the-art results on 30 of 32 benchmarks, including all 20 multimodal ones tested.
Effective and Memory-Efficient Alternatives to ECC for Reliable Large-Scale DNNs cs.AR · 2026-05-08 · unverdicted · none · ref 5
MSET and CEP deliver higher reliability than SECDED ECC for CNNs and Vision Transformers with zero memory overhead and substantially lower area and delay.
Aging Aware Adaptive Voltage Scaling for Reliable and Efficient AI Accelerators cs.AR · 2026-04-11 · unverdicted · none · ref 10
An aging-aware adaptive voltage scaling framework for AI accelerators reduces predicted threshold voltage shifts by ~19% and aging degradation by up to 46% while saving 14% lifetime power by leveraging neural network resilience.
AIReSim: A Discrete Event Simulator for Large-scale AI Cluster Reliability Modeling cs.DC · 2026-03-07 · unverdicted · none · ref 6
AIReSim is a discrete event simulator for evaluating failure mitigation, recovery, and capacity planning decisions in large AI clusters.

Silent data corruptions at scale

hub tools

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer