Low-precision Flash Attention fails due to similar low-rank attention representations combined with biased rounding errors that accumulate and corrupt weight updates; a minimal fix to reduce rounding bias stabilizes training.
Scaling FP8 Training to Trillion- Token LLMs
5 Pith papers cite this work. Polarity classification is still indexing.
citation-role summary
citation-polarity summary
roles
background 2polarities
background 2representative citing papers
RankElastor mitigates embedding collapse via spectrum-robust token mixing and GLU-based P-FFNs, yielding better performance and scaling on industrial recommendation datasets.
LoKA enables practical FP8 use in numerically sensitive large recommendation models via online profiling of activations, reusable model modifications for stability, and dynamic kernel dispatching.
NVILA improves on VILA with a scale-then-compress visual token strategy and full-lifecycle efficiency optimizations, matching or exceeding leading VLMs on image and video benchmarks while reducing training cost 1.9-5.1x and latencies 1.2-2.8x.
citing papers explorer
-
Why Low-Precision Transformer Training Fails: An Analysis on Flash Attention
Low-precision Flash Attention fails due to similar low-rank attention representations combined with biased rounding errors that accumulate and corrupt weight updates; a minimal fix to reduce rounding bias stabilizes training.
-
Expand More, Shrink Less: Shaping Effective-Rank Dynamics for Dense Scaling in Recommendation
RankElastor mitigates embedding collapse via spectrum-robust token mixing and GLU-based P-FFNs, yielding better performance and scaling on industrial recommendation datasets.
-
LoKA: Low-precision Kernel Applications for Recommendation Models At Scale
LoKA enables practical FP8 use in numerically sensitive large recommendation models via online profiling of activations, reusable model modifications for stability, and dynamic kernel dispatching.
-
NVILA: Efficient Frontier Visual Language Models
NVILA improves on VILA with a scale-then-compress visual token strategy and full-lifecycle efficiency optimizations, matching or exceeding leading VLMs on image and video benchmarks while reducing training cost 1.9-5.1x and latencies 1.2-2.8x.
- From Detection to Recovery: Operational Analysis on LLM Pre-training with 504 GPUs