The paper constructs hash-linked evidence graphs that bind hardware measurement quantities to their verification records, enabling offline auditing with probabilistic matrix checks and security measures against probe attacks on GPUs.
hub
Silent data corruptions at scale
14 Pith papers cite this work. Polarity classification is still indexing.
hub tools
citation-role summary
citation-polarity summary
roles
background 3polarities
background 3representative citing papers
A new fault-injection framework enables a systematic empirical study that produces 17 takeaways on error propagation in LLM inference and four software-only mitigation directions.
ITHICA generates functional tests via intra-thread instruction duplication and comparison, detecting 39% more defective servers than baseline methods on over 3000 real CPUs while revealing new defect behaviors.
FLARE uses pairwise coprime test vectors to create unique divisibility signatures that localize faulty rows in systolic arrays with one test pass and over 98% probability for 256x256 INT16 arrays.
Kernel Contracts is a specification language that formalizes correctness requirements for ML kernels to ensure consistent results across heterogeneous silicon platforms.
Boomslang introduces a front-end/back-end pipeline with superpositions in its IR to enable general-purpose checking of arbitrary transaction isolation levels via SMT solving.
DRIFT uses resilience analysis, targeted DVFS, and adaptive rollback ABFT to deliver 36% average energy savings or 1.7x speedup in diffusion model inference while preserving generation quality.
Proposes zkVM-based protocol for verifiable frontier AI pre-training with committed specs, network observations, Merkle commitments, and FP precompiles, estimating 36-month POC at single-digit overhead.
Large-scale GPU fault injection shows NaN/inf outcomes are only 1% of SDC, single-bit flips under 40%, and corruption addresses are periodic, supporting distribution-aware modeling.
LLMs resist low-frequency permanent GPU faults but certain datapaths and precision formats trigger catastrophic training divergence even at moderate fault rates.
Gemini Ultra reaches human-expert performance on MMLU for the first time and sets new state-of-the-art results on 30 of 32 benchmarks, including all 20 multimodal ones tested.
MSET and CEP deliver higher reliability than SECDED ECC for CNNs and Vision Transformers with zero memory overhead and substantially lower area and delay.
An aging-aware adaptive voltage scaling framework for AI accelerators reduces predicted threshold voltage shifts by ~19% and aging degradation by up to 46% while saving 14% lifetime power by leveraging neural network resilience.
AIReSim is a discrete event simulator for evaluating failure mitigation, recovery, and capacity planning decisions in large AI clusters.
citing papers explorer
-
Self-Verifying Measurement Records: Hash-Linked Evidence Graphs for Hardware Benchmarking
The paper constructs hash-linked evidence graphs that bind hardware measurement quantities to their verification records, enabling offline auditing with probabilistic matrix checks and security measures against probe attacks on GPUs.
-
Not All Errors Are Equal: A Systematic Study of Error Propagation in Large Language Model Inference
A new fault-injection framework enables a systematic empirical study that produces 17 takeaways on error propagation in LLM inference and four software-only mitigation directions.
-
ITHICA: Intra-Thread Instruction Checking Approach for Defect-Induced Silent Data Corruptions
ITHICA generates functional tests via intra-thread instruction duplication and comparison, detecting 39% more defective servers than baseline methods on over 3000 real CPUs while revealing new defect behaviors.
-
FLARE: One-Shot PE-Level Fault Localization in Systolic Arrays via Algebraic Test Vectors
FLARE uses pairwise coprime test vectors to create unique divisibility signatures that localize faulty rows in systolic arrays with one test pass and over 98% probability for 256x256 INT16 arrays.
-
Kernel Contracts: A Specification Language for ML Kernel Correctness Across Heterogeneous Silicon
Kernel Contracts is a specification language that formalizes correctness requirements for ML kernels to ensure consistent results across heterogeneous silicon platforms.
-
Making TransactionIsolation Checking Practical
Boomslang introduces a front-end/back-end pipeline with superpositions in its IR to enable general-purpose checking of arbitrary transaction isolation levels via SMT solving.
-
DRIFT: Harnessing Inherent Fault Tolerance for Efficient and Reliable Diffusion Model Inference
DRIFT uses resilience analysis, targeted DVFS, and adaptive rollback ABFT to deliver 36% average energy savings or 1.7x speedup in diffusion model inference while preserving generation quality.
-
Zero knowledge verification for frontier AI training is possible
Proposes zkVM-based protocol for verifiable frontier AI pre-training with committed specs, network observations, Merkle commitments, and FP precompiles, estimating 36-month POC at single-digit overhead.
-
The Anatomy of Silent Data Corruption: GPU Error Pattern Study and Modeling Guidance
Large-scale GPU fault injection shows NaN/inf outcomes are only 1% of SDC, single-bit flips under 40%, and corruption addresses are periodic, supporting distribution-aware modeling.
-
LLM-PRISM: Characterizing Silent Data Corruption from Permanent GPU Faults in LLM Training
LLMs resist low-frequency permanent GPU faults but certain datapaths and precision formats trigger catastrophic training divergence even at moderate fault rates.
-
Gemini: A Family of Highly Capable Multimodal Models
Gemini Ultra reaches human-expert performance on MMLU for the first time and sets new state-of-the-art results on 30 of 32 benchmarks, including all 20 multimodal ones tested.
-
Effective and Memory-Efficient Alternatives to ECC for Reliable Large-Scale DNNs
MSET and CEP deliver higher reliability than SECDED ECC for CNNs and Vision Transformers with zero memory overhead and substantially lower area and delay.
-
Aging Aware Adaptive Voltage Scaling for Reliable and Efficient AI Accelerators
An aging-aware adaptive voltage scaling framework for AI accelerators reduces predicted threshold voltage shifts by ~19% and aging degradation by up to 46% while saving 14% lifetime power by leveraging neural network resilience.
-
AIReSim: A Discrete Event Simulator for Large-scale AI Cluster Reliability Modeling
AIReSim is a discrete event simulator for evaluating failure mitigation, recovery, and capacity planning decisions in large AI clusters.