pith. sign in

arxiv: 2510.04212 · v3 · submitted 2025-10-05 · 💻 cs.LG · cs.AI

Why Low-Precision Transformer Training Fails: An Analysis on Flash Attention

Pith reviewed 2026-05-18 09:58 UTC · model grok-4.3

classification 💻 cs.LG cs.AI
keywords low-precision trainingflash attentiontransformer instabilityrounding errorslow-rank representationstraining dynamicserror accumulationattention mechanism
0
0 comments X p. Extension

The pith

Low-precision Flash Attention training fails because similar low-rank attention representations combine with biased rounding errors to create a self-reinforcing cycle that corrupts weight updates.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that instabilities in low-precision transformer training using Flash Attention stem from two linked issues rather than random hardware glitches: attention heads develop nearly identical low-rank representations, and low-precision arithmetic introduces systematic rounding biases that accumulate over steps. A sympathetic reader cares because this account turns an opaque failure into a diagnosable mechanism, suggesting that targeted fixes could make efficient low-precision training reliable instead of requiring full-precision fallbacks. The authors back the account by tracing how the two factors reinforce each other into a vicious cycle and then show that a minimal change to Flash Attention removes the bias and restores stable training.

Core claim

The failure is not a random artifact but caused by two intertwined phenomena: the emergence of similar low-rank representations within the attention mechanism and the compounding effect of biased rounding errors inherent in low-precision arithmetic. These factors create a vicious cycle of error accumulation that corrupts weight updates, ultimately derailing the training dynamics.

What carries the argument

The vicious cycle formed when similar low-rank attention representations meet biased rounding errors in low-precision arithmetic, which then amplifies errors in subsequent weight updates.

If this is right

  • A minimal modification to Flash Attention that mitigates rounding bias is sufficient to stabilize low-precision training.
  • The same low-rank similarity and rounding bias mechanism explains why instabilities appear specifically with Flash Attention rather than with standard attention.
  • Correcting the rounding bias breaks the error-accumulation loop and prevents corruption of weight updates.
  • The identified cycle accounts for the catastrophic loss spikes seen in prior low-precision Flash Attention runs.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same bias-accumulation pattern may appear in other fused attention kernels or low-precision linear layers once representations become correlated.
  • Hardware vendors could prioritize unbiased rounding modes in low-precision matrix units to reduce the need for software patches.
  • Monitoring the rank diversity of attention heads during training could serve as an early warning signal for impending instability.

Load-bearing premise

The observed low-rank similarity and biased rounding errors are the primary and sufficient drivers of the instability rather than symptoms of other unexamined factors in training dynamics or hardware.

What would settle it

Training a transformer with the authors' proposed minimal modification to Flash Attention in the same low-precision setting and checking whether the loss explosion disappears while keeping all other factors fixed.

Figures

Figures reproduced from arXiv: 2510.04212 by Haiquan Qiu, Quanming Yao.

Figure 1
Figure 1. Figure 1: Analysis in different sections. Our paper traces the causal chain of training failure (blue [PITH_FULL_IMAGE:figures/full_fig_p003_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: The failure case using BF16 and flash attention results in a sudden loss explosion, while the stable config￾uration converges. Our investigation targets a well-documented and per￾sistent failure: the catastrophic loss explosion that oc￾curs when training Generative Pre-trained Transformer 2 (GPT-2) models with flash attention in BF16 preci￾sion (nanoGPT Issue 303, 2023; nanoGPT Issue 524, 2024; nanoGPT Iss… view at source ↗
Figure 3
Figure 3. Figure 3: WQ of attention head 8 has the largest spectral norm. Subsequent analysis focuses on this head. Numerical Errors in O are the Source of Failure. Building on the finding that the computation of δ is crit￾ical, we further isolate the source of error to the output matrix Olp in low-precision δlp = rowsum(dO ◦ Olp). We conduct two key experiments. First, instead of using the low-precision O from the forward pa… view at source ↗
Figure 4
Figure 4. Figure 4: PK, X, and (PK)[T] ⊤X[T] at different batch indices and training steps. (c) and (f) show that (PK)[T] ⊤X[T] for different tokens and training steps have some similar columns in input features 546 and 678. training steps 6610 and 6619, respectively. Because these rank-1 error components are structurally consistent, we can approximate the total gradient difference as dWQ hp − dWQ lp ≈ α XN T =1 (δlp − δhp)[T… view at source ↗
Figure 5
Figure 5. Figure 5: Analysis of δ = rowsum(dO ◦ O). 0 200 400 600 800 1000 Tokens 0.0 0.2 0.4 0.6 0.8 1.0 P 6 5 4 3 2 1 0 1 2 V P V Visualization of P[T, :] and V[: , i] (a) Most V[:, i] are negative 0 200 400 600 800 1000 Tokens t 0.0 0.2 0.4 0.6 0.8 1.0 P 0.014 0.012 0.010 0.008 0.006 0.004 0.002 0.000 O error(t) P Oerror(t) Visualization of P[T, :] and Oerror(t) (b) Large error when P¯ [T, i] = 1 630 640 650 660 670 680 To… view at source ↗
Figure 6
Figure 6. Figure 6: Analysis of PV¯ upstream gradient dO and the numerical error in the low-precision output, Olp − Ohp. To dissect this, we focus on a token position T = 718 where the error component (δlp − δhp)[T] is positive. In [PITH_FULL_IMAGE:figures/full_fig_p007_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Comparison of the stabilized FA and the original FA. This modification prevents elements of P¯ from becom￾ing exactly 1. If a row’s maximum value, rm, is positive and repeated, the normalization factor is adjusted to m = βrm (with β > 1). This makes the new maximum in the exponent −(β −1)rm, which is strictly negative. If rm is negative and repeated, we set m = 0, which also ensures the exponent’s maximum … view at source ↗
Figure 8
Figure 8. Figure 8: Loss curves of two independent runs of GPT-2 training with flash attention in BF16 [PITH_FULL_IMAGE:figures/full_fig_p018_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Spectral norm across layers and training steps. [PITH_FULL_IMAGE:figures/full_fig_p019_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Token difference visualization 20 [PITH_FULL_IMAGE:figures/full_fig_p020_10.png] view at source ↗
read the original abstract

The pursuit of computational efficiency has driven the adoption of low-precision formats for training transformer models. However, this progress is often hindered by notorious training instabilities. This paper provides the first mechanistic explanation for a long-standing and unresolved failure case where training with flash attention in low-precision settings leads to catastrophic loss explosion. Our in-depth analysis reveals that the failure is not a random artifact but caused by two intertwined phenomena: the emergence of similar low-rank representations within the attention mechanism and the compounding effect of biased rounding errors inherent in low-precision arithmetic. We demonstrate how these factors create a vicious cycle of error accumulation that corrupts weight updates, ultimately derailing the training dynamics. To validate our findings, we introduce a minimal modification to the flash attention that mitigates the bias in rounding errors. This simple change stabilizes the training process, confirming our analysis and offering a practical solution to this persistent problem. Code is available at https://github.com/ucker/why-low-precision-training-fails.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper claims to provide the first mechanistic explanation for catastrophic loss explosion when training transformers with Flash Attention in low-precision arithmetic. It attributes the failure to two intertwined phenomena: the emergence of similar low-rank representations within the attention mechanism and the compounding effect of biased rounding errors in low-precision operations. These factors are said to form a vicious cycle that corrupts weight updates. The authors introduce a minimal modification to Flash Attention that mitigates the rounding bias and report that this change stabilizes training.

Significance. If the causal analysis is substantiated, the work supplies a concrete mechanistic account of a known practical failure mode in efficient transformer training together with a simple, deployable fix. The public code release is a strength that enables direct verification and extension.

major comments (2)
  1. [§4.2] §4.2 and the associated stabilization experiment: the paper shows that the rounding-bias mitigation restores stable training, yet does not report an intervention that selectively disrupts low-rank similarity in the attention keys/queries while leaving the rounding bias intact (or the converse). Without such a disambiguation, the mutual-reinforcement claim remains correlational rather than demonstrably causal.
  2. [§3.1] §3.1, the low-rank representation analysis: the reported cosine similarities and singular-value spectra are consistent with collapse, but the manuscript does not quantify the downstream effect on gradient magnitude or provide a bound showing that this similarity is sufficient to drive the observed loss explosion independent of other low-precision matmul dynamics.
minor comments (2)
  1. [Figure 3] Figure 3: axis labels and color legends are insufficiently descriptive; readers cannot immediately distinguish the low-precision versus high-precision curves without consulting the caption.
  2. [§2.2] The notation for the online softmax accumulation in FlashAttention (around Eq. (2)) re-uses symbols that were previously defined for the full-precision case; a short clarifying sentence would prevent confusion.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the careful reading and constructive suggestions. The comments correctly identify opportunities to strengthen the causal interpretation of our results. We address each point below and have revised the manuscript with additional discussion and supporting analysis.

read point-by-point responses
  1. Referee: [§4.2] §4.2 and the associated stabilization experiment: the paper shows that the rounding-bias mitigation restores stable training, yet does not report an intervention that selectively disrupts low-rank similarity in the attention keys/queries while leaving the rounding bias intact (or the converse). Without such a disambiguation, the mutual-reinforcement claim remains correlational rather than demonstrably causal.

    Authors: We agree that an orthogonal intervention isolating low-rank similarity from rounding bias would provide stronger causal evidence. Designing such an experiment without inadvertently altering rounding behavior or other low-precision dynamics has proven difficult in our setup, as the collapse emerges from the joint training process. Nevertheless, the fact that mitigating rounding bias alone stabilizes training—while low-rank similarity is still observed—indicates that the bias is necessary for the observed explosion. In the revised manuscript we have expanded §4.2 to explicitly characterize the current evidence as supporting a mutual-reinforcement mechanism while acknowledging its correlational character and outlining possible future disambiguation approaches. revision: partial

  2. Referee: [§3.1] §3.1, the low-rank representation analysis: the reported cosine similarities and singular-value spectra are consistent with collapse, but the manuscript does not quantify the downstream effect on gradient magnitude or provide a bound showing that this similarity is sufficient to drive the observed loss explosion independent of other low-precision matmul dynamics.

    Authors: We accept that explicit quantification of the effect on gradients would strengthen the section. The revised manuscript now includes additional plots in §3.1 that track the relationship between rising key/query cosine similarity and the growth of attention gradient norms across training steps in the unstable low-precision runs. Deriving a tight theoretical bound that isolates representation similarity from the full suite of low-precision matrix-multiplication effects is technically involved and lies outside the scope of the present study; we have added a concise discussion of this limitation together with the empirical support provided by the stabilization experiment. revision: yes

Circularity Check

0 steps flagged

No significant circularity in the mechanistic analysis

full rationale

The paper presents an empirical mechanistic explanation for training instability in low-precision Flash Attention, identifying low-rank attention representations and biased rounding errors as intertwined causes of a vicious cycle. Validation comes from observing these phenomena and testing a minimal rounding-bias mitigation that stabilizes training. No load-bearing steps reduce by construction to fitted parameters, self-definitions, or self-citation chains; the central claim rests on direct observation and intervention rather than renaming or importing uniqueness from prior author work. The derivation is self-contained against external benchmarks of empirical reproduction.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The analysis rests on standard assumptions about floating-point rounding behavior and attention matrix properties; no new free parameters, invented entities, or ad-hoc axioms are introduced in the abstract. Full details would be needed to audit any implicit modeling choices.

pith-pipeline@v0.9.0 · 5693 in / 1048 out tokens · 24072 ms · 2026-05-18T09:58:34.409469+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

35 extracted references · 35 canonical work pages · 13 internal anchors

  1. [1]

    Scalify: scale propagation for efficient low-precision llm training.arXiv preprint arXiv:2407.17353,

    Paul Balanc ¸a, Sam Hosegood, Carlo Luschi, and Andrew Fitzgibbon. Scalify: scale propagation for efficient low-precision llm training.arXiv preprint arXiv:2407.17353,

  2. [2]

    u-µp: The unit- scaled maximal update parametrization.arXiv preprint arXiv:2407.17465,

    Charlie Blake, Constantin Eichenberg, Josef Dean, Lukas Balles, Luke Y Prince, Bj ¨orn Deiseroth, Andres Felipe Cruz-Salinas, Carlo Luschi, Samuel Weinbach, and Douglas Orr. u-µp: The unit- scaled maximal update parametrization.arXiv preprint arXiv:2407.17465,

  3. [3]

    Language models are few-shot learners.Advances in neural information processing systems, 33:1877–1901,

    Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are few-shot learners.Advances in neural information processing systems, 33:1877–1901,

  4. [4]

    Scaling fp8 training to trillion- token llms.arXiv preprint arXiv:2409.12517,

    Maxim Fishman, Brian Chmiel, Ron Banner, and Daniel Soudry. Scaling fp8 training to trillion- token llms.arXiv preprint arXiv:2409.12517,

  5. [5]

    Aaron Gokaslan, Vanya Cohen, Ellie Pavlick, and Stefanie Tellex

    Accessed: 2025-09-07. Aaron Gokaslan, Vanya Cohen, Ellie Pavlick, and Stefanie Tellex. Openwebtext corpus.http: //Skylion007.github.io/OpenWebTextCorpus,

  6. [6]

    Alicia Golden, Samuel Hsia, Fei Sun, Bilge Acun, Basil Hosmer, Yejin Lee, Zachary DeVito, Jeff Johnson, Gu-Yeon Wei, David Brooks, et al

    10 Preprint. Alicia Golden, Samuel Hsia, Fei Sun, Bilge Acun, Basil Hosmer, Yejin Lee, Zachary DeVito, Jeff Johnson, Gu-Yeon Wei, David Brooks, et al. Is flash attention stable?arXiv preprint arXiv:2405.02803,

  7. [7]

    arXiv preprint arXiv:2505.01043

    Zhiwei Hao, Jianyuan Guo, Li Shen, Yong Luo, Han Hu, Guoxia Wang, Dianhai Yu, Yonggang Wen, and Dacheng Tao. Low-precision training of large language models: Methods, challenges, and opportunities.arXiv preprint arXiv:2505.01043,

  8. [8]

    Query-key normalization for transformers

    Alex Henry, Prudhvi Raj Dachapally, Shubham Pawar, and Yuxuan Chen. Query-key normalization for transformers.arXiv preprint arXiv:2010.04245,

  9. [9]

    Training Compute-Optimal Large Language Models

    Jordan Hoffmann, Sebastian Borgeaud, Arthur Mensch, Elena Buchatskaya, Trevor Cai, Eliza Rutherford, Diego de Las Casas, Lisa Anne Hendricks, Johannes Welbl, Aidan Clark, et al. Train- ing compute-optimal large language models.arXiv preprint arXiv:2203.15556,

  10. [10]

    Spam: Spike- aware adam with momentum reset for stable llm training.arXiv preprint arXiv:2501.06842,

    Tianjin Huang, Ziquan Zhu, Gaojie Jin, Lu Liu, Zhangyang Wang, and Shiwei Liu. Spam: Spike- aware adam with momentum reset for stable llm training.arXiv preprint arXiv:2501.06842,

  11. [11]

    A Study of BFLOAT16 for Deep Learning Training

    Dhiraj Kalamkar, Dheevatsa Mudigere, Naveen Mellempudi, Dipankar Das, Kunal Banerjee, Sasikanth Avancha, Dharma Teja V ooturi, Nataraj Jammalamadaka, Jianyu Huang, Hector Yuen, et al. A study of bfloat16 for deep learning training.arXiv preprint arXiv:1905.12322,

  12. [12]

    Kimi K2: Open Agentic Intelligence

    Kimi-Team. Kimi k2: Open agentic intelligence.arXiv preprint arXiv:2507.20534,

  13. [13]

    DeepSeek-V3 Technical Report

    Aixin Liu, Bei Feng, Bing Xue, Bingxuan Wang, Bochao Wu, Chengda Lu, Chenggang Zhao, Chengqi Deng, Chenyu Zhang, Chong Ruan, et al. Deepseek-v3 technical report.arXiv preprint arXiv:2412.19437,

  14. [14]

    Mixed Precision Training With 8-bit Floating Point

    Naveen Mellempudi, Sudarshan Srinivasan, Dipankar Das, and Bharat Kaul. Mixed precision train- ing with 8-bit floating point.arXiv preprint arXiv:1905.12334,

  15. [15]

    Mixed Precision Training

    Paulius Micikevicius, Sharan Narang, Jonah Alben, Gregory Diamos, Erich Elsen, David Garcia, Boris Ginsburg, Michael Houston, Oleksii Kuchaiev, Ganesh Venkatesh, et al. Mixed precision training.arXiv preprint arXiv:1710.03740,

  16. [16]

    FP8 Formats for Deep Learning

    Paulius Micikevicius, Dusan Stosic, Neil Burgess, Marius Cornea, Pradeep Dubey, Richard Grisen- thwaite, Sangwon Ha, Alexander Heinecke, Patrick Judd, John Kamalu, et al. Fp8 formats for deep learning.arXiv preprint arXiv:2209.05433,

  17. [17]

    Molybog, P

    Igor Molybog, Peter Albert, Moya Chen, Zachary DeVito, David Esiobu, Naman Goyal, Punit Singh Koura, Sharan Narang, Andrew Poulton, Ruan Silva, et al. A theory on adam instability in large- scale machine learning.arXiv preprint arXiv:2304.09871,

  18. [19]

    nanoGPT Issue

    Accessed: 2025-09-07. nanoGPT Issue

  19. [20]

    Badreddine Noune, Philip Jones, Daniel Justus, Dominic Masters, and Carlo Luschi

    Ac- cessed: 2025-09-07. Badreddine Noune, Philip Jones, Daniel Justus, Dominic Masters, and Carlo Luschi. 8-bit numerical formats for deep neural networks.arXiv preprint arXiv:2206.02915,

  20. [21]

    Houwen Peng, Kan Wu, Yixuan Wei, Guoshuai Zhao, Yuxiang Yang, Ze Liu, Yifan Xiong, Ziyue Yang, Bolin Ni, Jingcheng Hu, et al

    11 Preprint. Houwen Peng, Kan Wu, Yixuan Wei, Guoshuai Zhao, Yuxiang Yang, Ze Liu, Yifan Xiong, Ziyue Yang, Bolin Ni, Jingcheng Hu, et al. Fp8-lm: Training fp8 large language models.arXiv preprint arXiv:2310.18313,

  21. [22]

    Training and inference of large lan- guage models using 8-bit floating point.arXiv preprint arXiv:2309.17224,

    Sergio P Perez, Yan Zhang, James Briggs, Charlie Blake, Josh Levy-Kramer, Paul Balanca, Carlo Luschi, Stephen Barlow, and Andrew William Fitzgibbon. Training and inference of large lan- guage models using 8-bit floating point.arXiv preprint arXiv:2309.17224,

  22. [23]

    Gated Attention for Large Language Models: Non-linearity, Sparsity, and Attention-Sink-Free

    Zihan Qiu, Zekun Wang, Bo Zheng, Zeyu Huang, Kaiyue Wen, Songlin Yang, Rui Men, Le Yu, Fei Huang, Suozhi Huang, et al. Gated attention for large language models: Non-linearity, sparsity, and attention-sink-free.arXiv preprint arXiv:2505.06708,

  23. [24]

    Qwen3 Technical Report

    URLhttps://arxiv.org/abs/2505.09388. Alec Radford, Jeff Wu, Rewon Child, David Luan, Dario Amodei, and Ilya Sutskever. Language models are unsupervised multitask learners

  24. [25]

    Scaling Language Models: Methods, Analysis & Insights from Training Gopher

    Jack W Rae, Sebastian Borgeaud, Trevor Cai, Katie Millican, Jordan Hoffmann, Francis Song, John Aslanides, Sarah Henderson, Roman Ring, Susannah Young, et al. Scaling language models: Methods, analysis & insights from training gopher.arXiv preprint arXiv:2112.11446,

  25. [26]

    Methods of improving llm training stability.arXiv preprint arXiv:2410.16682,

    Oleg Rybakov, Mike Chrzanowski, Peter Dykas, Jinze Xue, and Ben Lanir. Methods of improving llm training stability.arXiv preprint arXiv:2410.16682,

  26. [27]

    LLaMA: Open and Efficient Foundation Language Models

    Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timoth ´ee Lacroix, Baptiste Rozi `ere, Naman Goyal, Eric Hambro, Faisal Azhar, et al. Llama: Open and efficient foundation language models.arXiv preprint arXiv:2302.13971,

  27. [28]

    Training llms with mxfp4.arXiv preprint arXiv:2502.20586,

    Albert Tseng, Tao Yu, and Youngsuk Park. Training llms with mxfp4.arXiv preprint arXiv:2502.20586,

  28. [29]

    Optimizing large language model training using fp4 quantization.arXiv preprint arXiv:2501.17116,

    Ruizhe Wang, Yeyun Gong, Xiao Liu, Guoshuai Zhao, Ziyue Yang, Baining Guo, Zhengjun Zha, and Peng Cheng. Optimizing large language model training using fp4 quantization.arXiv preprint arXiv:2501.17116,

  29. [30]

    Mitchell Wortsman, Tim Dettmers, Luke Zettlemoyer, Ari S

    Accessed: 2025-09-07. Mitchell Wortsman, Tim Dettmers, Luke Zettlemoyer, Ari S. Morcos, Ali Farhadi, and Lud- wig Schmidt. Stable and low-precision training for large-scale vision-language models. In Thirty-seventh Conference on Neural Information Processing Systems,

  30. [31]

    Efficient Streaming Language Models with Attention Sinks

    URLhttps: //openreview.net/forum?id=sqqASmpA2R. Guangxuan Xiao, Yuandong Tian, Beidi Chen, Song Han, and Mike Lewis. Efficient streaming language models with attention sinks.arXiv preprint arXiv:2309.17453,

  31. [32]

    Tensor programs v: Tuning large neural networks via zero-shot hyperparameter transfer

    Greg Yang, Edward J Hu, Igor Babuschkin, Szymon Sidor, Xiaodong Liu, David Farhi, Nick Ry- der, Jakub Pachocki, Weizhu Chen, and Jianfeng Gao. Tensor programs v: Tuning large neural networks via zero-shot hyperparameter transfer.arXiv preprint arXiv:2203.03466,

  32. [33]

    A spectral condition for feature learning

    12 Preprint. Greg Yang, James B Simon, and Jeremy Bernstein. A spectral condition for feature learning.arXiv preprint arXiv:2310.17813,

  33. [34]

    Towards efficient pre-training: Exploring fp4 precision in large language models.arXiv preprint arXiv:2502.11458,

    Jiecheng Zhou, Ding Tang, Rong Fu, Boni Hu, Haoran Xu, Yi Wang, Zhilin Pei, Zhongling Su, Liang Liu, Xingcheng Zhang, et al. Towards efficient pre-training: Exploring fp4 precision in large language models.arXiv preprint arXiv:2502.11458,

  34. [35]

    gradient spikes

    13 Preprint. A RELATEDWORK A.1 MIXED-PRECISIONBF16 TRAINING. Contemporary large language model (LLM) pretraining almost universally employs mixed-precision arithmetic. Early efforts by Micikevicius et al. (2017) demonstrated that FP16 training—using an FP32 master copy of weights and fixed loss scaling—could match FP32 accuracy for many models. However, t...

  35. [36]

    Seg en à st 're ich s ho hem S oh ne / Un ser m Kaiser Ferdinand !

    is a robust choice, as it ensures the maximum value in the exponent remains sufficiently negative. Algorithm 1Stablized Flash Attention by Mitigating Biased Rounding Error: Forward Pass Require:MatricesQ,K,V∈R N×d , block sizesB c,B r,β >1. 1:DivideQintoT r = l N Br m blocksQ 1, . . . ,QTr of sizeB r ×deach, and divideK,Vin to Tc = l N Bc m blocksK 1, . ....