Scaling FP8 Training to Trillion- Token LLMs

Maxim Fishman, Brian Chmiel, Ron Banner, Daniel Soudry · 2025 · arXiv 2409.12517

5 Pith papers cite this work. Polarity classification is still indexing.

5 Pith papers citing it

read on arXiv browse 5 citing papers

citation-role summary

background 2

citation-polarity summary

background 2

representative citing papers

Why Low-Precision Transformer Training Fails: An Analysis on Flash Attention

cs.LG · 2025-10-05 · unverdicted · novelty 7.0

Low-precision Flash Attention fails due to similar low-rank attention representations combined with biased rounding errors that accumulate and corrupt weight updates; a minimal fix to reduce rounding bias stabilizes training.

Expand More, Shrink Less: Shaping Effective-Rank Dynamics for Dense Scaling in Recommendation

cs.LG · 2026-05-22 · unverdicted · novelty 6.0

RankElastor mitigates embedding collapse via spectrum-robust token mixing and GLU-based P-FFNs, yielding better performance and scaling on industrial recommendation datasets.

LoKA: Low-precision Kernel Applications for Recommendation Models At Scale

cs.LG · 2026-05-11 · unverdicted · novelty 6.0 · 2 refs

LoKA enables practical FP8 use in numerically sensitive large recommendation models via online profiling of activations, reusable model modifications for stability, and dynamic kernel dispatching.

NVILA: Efficient Frontier Visual Language Models

cs.CV · 2024-12-05 · unverdicted · novelty 5.0

NVILA improves on VILA with a scale-then-compress visual token strategy and full-lifecycle efficiency optimizations, matching or exceeding leading VLMs on image and video benchmarks while reducing training cost 1.9-5.1x and latencies 1.2-2.8x.

From Detection to Recovery: Operational Analysis on LLM Pre-training with 504 GPUs

cs.DC · 2026-05-10

citing papers explorer

Showing 5 of 5 citing papers.

Why Low-Precision Transformer Training Fails: An Analysis on Flash Attention cs.LG · 2025-10-05 · unverdicted · none · ref 4
Low-precision Flash Attention fails due to similar low-rank attention representations combined with biased rounding errors that accumulate and corrupt weight updates; a minimal fix to reduce rounding bias stabilizes training.
Expand More, Shrink Less: Shaping Effective-Rank Dynamics for Dense Scaling in Recommendation cs.LG · 2026-05-22 · unverdicted · none · ref 7
RankElastor mitigates embedding collapse via spectrum-robust token mixing and GLU-based P-FFNs, yielding better performance and scaling on industrial recommendation datasets.
LoKA: Low-precision Kernel Applications for Recommendation Models At Scale cs.LG · 2026-05-11 · unverdicted · none · ref 30 · 2 links
LoKA enables practical FP8 use in numerically sensitive large recommendation models via online profiling of activations, reusable model modifications for stability, and dynamic kernel dispatching.
NVILA: Efficient Frontier Visual Language Models cs.CV · 2024-12-05 · unverdicted · none · ref 86
NVILA improves on VILA with a scale-then-compress visual token strategy and full-lifecycle efficiency optimizations, matching or exceeding leading VLMs on image and video benchmarks while reducing training cost 1.9-5.1x and latencies 1.2-2.8x.
From Detection to Recovery: Operational Analysis on LLM Pre-training with 504 GPUs cs.DC · 2026-05-10 · unreviewed · ref 49

Scaling FP8 Training to Trillion- Token LLMs

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer