Attention sinks reflect either adaptive nop or broadcast mechanisms, with distinct traces, synthetic diagnostics, and complementary interventions via gating plus registers.
Methods of improving llm training stability.arXiv preprint arXiv:2410.16682
3 Pith papers cite this work. Polarity classification is still indexing.
citation-role summary
citation-polarity summary
fields
cs.LG 3verdicts
UNVERDICTED 3roles
background 1polarities
background 1representative citing papers
Low-precision Flash Attention fails due to similar low-rank attention representations combined with biased rounding errors that accumulate and corrupt weight updates; a minimal fix to reduce rounding bias stabilizes training.
Pion is an optimizer that preserves the singular values of weight matrices in LLM training by applying orthogonal equivalence transformations.
citing papers explorer
-
A Unifying View of Attention Sinks: Two Algorithms, Two Solutions
Attention sinks reflect either adaptive nop or broadcast mechanisms, with distinct traces, synthetic diagnostics, and complementary interventions via gating plus registers.
-
Why Low-Precision Transformer Training Fails: An Analysis on Flash Attention
Low-precision Flash Attention fails due to similar low-rank attention representations combined with biased rounding errors that accumulate and corrupt weight updates; a minimal fix to reduce rounding bias stabilizes training.
-
Pion: A Spectrum-Preserving Optimizer via Orthogonal Equivalence Transformation
Pion is an optimizer that preserves the singular values of weight matrices in LLM training by applying orthogonal equivalence transformations.