FLARE uses pairwise coprime test vectors to create unique divisibility signatures that localize faulty rows in systolic arrays with one test pass and over 98% probability for 256x256 INT16 arrays.
Silent data corruptions at scale
8 Pith papers cite this work. Polarity classification is still indexing.
years
2026 8verdicts
UNVERDICTED 8representative citing papers
Kernel Contracts is a specification language that formalizes correctness requirements for ML kernels to ensure consistent results across heterogeneous silicon platforms.
Boomslang introduces a front-end/back-end pipeline with superpositions in its IR to enable general-purpose checking of arbitrary transaction isolation levels via SMT solving.
DRIFT uses resilience analysis, targeted DVFS, and adaptive rollback ABFT to deliver 36% average energy savings or 1.7x speedup in diffusion model inference while preserving generation quality.
Large-scale GPU fault injection shows NaN/inf outcomes are only 1% of SDC, single-bit flips under 40%, and corruption addresses are periodic, supporting distribution-aware modeling.
LLMs resist low-frequency permanent GPU faults but certain datapaths and precision formats trigger catastrophic training divergence even at moderate fault rates.
MSET and CEP deliver higher reliability than SECDED ECC for CNNs and Vision Transformers with zero memory overhead and substantially lower area and delay.
An aging-aware adaptive voltage scaling framework for AI accelerators reduces predicted threshold voltage shifts by ~19% and aging degradation by up to 46% while saving 14% lifetime power by leveraging neural network resilience.
citing papers explorer
-
FLARE: One-Shot PE-Level Fault Localization in Systolic Arrays via Algebraic Test Vectors
FLARE uses pairwise coprime test vectors to create unique divisibility signatures that localize faulty rows in systolic arrays with one test pass and over 98% probability for 256x256 INT16 arrays.
-
Kernel Contracts: A Specification Language for ML Kernel Correctness Across Heterogeneous Silicon
Kernel Contracts is a specification language that formalizes correctness requirements for ML kernels to ensure consistent results across heterogeneous silicon platforms.
-
Making TransactionIsolation Checking Practical
Boomslang introduces a front-end/back-end pipeline with superpositions in its IR to enable general-purpose checking of arbitrary transaction isolation levels via SMT solving.
-
DRIFT: Harnessing Inherent Fault Tolerance for Efficient and Reliable Diffusion Model Inference
DRIFT uses resilience analysis, targeted DVFS, and adaptive rollback ABFT to deliver 36% average energy savings or 1.7x speedup in diffusion model inference while preserving generation quality.
-
The Anatomy of Silent Data Corruption: GPU Error Pattern Study and Modeling Guidance
Large-scale GPU fault injection shows NaN/inf outcomes are only 1% of SDC, single-bit flips under 40%, and corruption addresses are periodic, supporting distribution-aware modeling.
-
LLM-PRISM: Characterizing Silent Data Corruption from Permanent GPU Faults in LLM Training
LLMs resist low-frequency permanent GPU faults but certain datapaths and precision formats trigger catastrophic training divergence even at moderate fault rates.
-
Effective and Memory-Efficient Alternatives to ECC for Reliable Large-Scale DNNs
MSET and CEP deliver higher reliability than SECDED ECC for CNNs and Vision Transformers with zero memory overhead and substantially lower area and delay.
-
Aging Aware Adaptive Voltage Scaling for Reliable and Efficient AI Accelerators
An aging-aware adaptive voltage scaling framework for AI accelerators reduces predicted threshold voltage shifts by ~19% and aging degradation by up to 46% while saving 14% lifetime power by leveraging neural network resilience.