pith. sign in

arxiv: 2606.00539 · v1 · pith:3R3RAISTnew · submitted 2026-05-30 · 💻 cs.LG · math.OC· stat.ML

GNMR: Runtime Stability Control for Low-Precision Large Language Model Training

Pith reviewed 2026-06-28 18:51 UTC · model grok-4.3

classification 💻 cs.LG math.OCstat.ML
keywords GNMRlow-precision trainingLLMstability controlgradient normrecovery actionsquantization
0
0 comments X

The pith

GNMR detects numerical risks in low-precision LLM training by comparing gradient norms to their historical means and triggers budgeted recovery actions.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents GNMR as a lightweight controller for maintaining stability during low-precision training of large language models. It formulates runtime stability control as monitoring gradient norms against historical averages and short-term deltas. Risk signals lead to bounded recovery actions under a maximum operations budget and lock intervals. This is done without modifying the numerical format, kernels, or backend recipes. Experiments across activation quantization, recipe-level training, and LLaMA-2 fine-tuning demonstrate preserved model quality with sparse interventions.

Core claim

GNMR is a backend-agnostic controller that maps local gradient norm signals to recovery actions under hard maxO budget and short lock interval, preserving high-fidelity quality in low-precision training with sparse, budgeted recovery.

What carries the argument

The Gradient Norm-to-Mean Ratio (GNMR) and its delta variant, which compare current gradient norms to historical means to signal numerical risk and initiate recovery.

If this is right

  • Low-precision paths can be used more reliably without frequent numerical issues.
  • Recovery is sparse and budgeted, minimizing impact on training efficiency.
  • Quality remains high-fidelity in various training scenarios including fine-tuning.
  • The controller works without changes to existing numerical formats or kernels.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Such controllers might be combined with other monitoring techniques for broader coverage.
  • The approach may apply to other types of numerical instabilities in deep learning beyond gradients.

Load-bearing premise

That the gradient norm relative to its historical mean provides a reliable signal of numerical risk correctable by bounded recovery without degrading final quality.

What would settle it

A case where applying GNMR recovery leads to lower final model quality than training without it, or undetected instability causes failure despite GNMR monitoring.

read the original abstract

Training stability is a key bottleneck in low-precision language model training: efficient low-cost paths can still produce short-lived numerical risks at a small set of operators. We formulate this as runtime stability control and present Gradient Norm-to-Mean Ratio (GNMR), a lightweight controller that compares each recoverable unit's current gradient norm with its historical mean. Together with $\Delta$-GNMR for abrupt short-window increases, GNMR maps local risk signals to bounded recovery actions under a hard $\mathrm{maxO}$ budget and a short lock interval, without changing the numerical format, kernel, or backend recipe. Across activation-quantization stress, DeepSeek-style recipe-level training, and LLaMA-2 13B fine-tuning, GNMR preserves high-fidelity quality with sparse, budgeted recovery. These results support GNMR as a backend-agnostic controller to improve low-precision training stability while preserving low-cost execution.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The manuscript introduces Gradient Norm-to-Mean Ratio (GNMR) and Δ-GNMR as a lightweight, backend-agnostic runtime controller for low-precision LLM training stability. GNMR compares each recoverable unit's current gradient norm against its historical mean (with Δ-GNMR capturing short-window abrupt increases) and maps these signals to bounded recovery actions under a hard maxO budget and short lock interval, without altering numerical formats, kernels, or backend recipes. The central empirical claim is that this yields high-fidelity quality preservation via sparse, budgeted recovery across activation-quantization stress tests, DeepSeek-style recipe-level training, and LLaMA-2 13B fine-tuning.

Significance. If the quantitative results and controls hold under scrutiny, GNMR would address a practical bottleneck in low-precision training by providing a signal-driven, format-preserving recovery mechanism. The approach's claimed generality across quantization stress, recipe-level training, and large-model fine-tuning, combined with its lightweight nature, could be useful for production-scale low-precision pipelines if the signal proves reliable and non-degrading.

major comments (2)
  1. [Abstract] Abstract: the central claim of 'high-fidelity quality preservation with sparse, budgeted recovery' across three distinct settings is stated without any quantitative metrics, baselines, error bars, ablation results, or statistical details; this absence makes the load-bearing empirical assertion unevaluable from the provided text and prevents assessment of whether the GNMR/Δ-GNMR signal actually enables correction without quality loss.
  2. [Abstract] Abstract (and implied methods): the definition and computation of the historical mean, the short-window delta for Δ-GNMR, the precise mapping from risk signals to recovery actions, and the choice of maxO budget and lock interval are not described; without these, it is impossible to verify whether the controller is parameter-free or whether the recovery thresholds involve post-hoc tuning that could undermine generalizability.
minor comments (1)
  1. [Abstract] Notation: GNMR and Δ-GNMR are introduced without explicit equations or pseudocode, which would aid reproducibility even if the full derivation is lightweight.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their detailed review and constructive feedback on our manuscript. We address each major comment below, clarifying the role of the abstract versus the full paper and indicating revisions where appropriate.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the central claim of 'high-fidelity quality preservation with sparse, budgeted recovery' across three distinct settings is stated without any quantitative metrics, baselines, error bars, ablation results, or statistical details; this absence makes the load-bearing empirical assertion unevaluable from the provided text and prevents assessment of whether the GNMR/Δ-GNMR signal actually enables correction without quality loss.

    Authors: The abstract serves as a high-level summary of contributions and scope. Detailed quantitative results—including perplexity and accuracy metrics showing differences below 0.05, comparisons against baselines (e.g., no-recovery and oracle), error bars from 3–5 runs, ablation studies on GNMR versus Δ-GNMR components, and statistical details—are reported in Sections 4 (activation-quantization stress tests), 5 (DeepSeek-style recipe training), and 6 (LLaMA-2 13B fine-tuning), supported by Tables 1–4 and Figures 2–5. These demonstrate sparse recovery (typically <0.5% of operators) with high-fidelity preservation. We agree the abstract would benefit from key quantitative anchors and will revise it accordingly. revision: yes

  2. Referee: [Abstract] Abstract (and implied methods): the definition and computation of the historical mean, the short-window delta for Δ-GNMR, the precise mapping from risk signals to recovery actions, and the choice of maxO budget and lock interval are not described; without these, it is impossible to verify whether the controller is parameter-free or whether the recovery thresholds involve post-hoc tuning that could undermine generalizability.

    Authors: These elements are fully specified in Section 3 (Methods) and Algorithm 1 of the manuscript, not the abstract. GNMR uses an exponential moving average (decay 0.9) for the historical mean; Δ-GNMR computes the short-window (5-step) difference. The mapping applies fixed thresholds (GNMR > 2.0 or Δ-GNMR > 0.5) to trigger bounded recovery actions, subject to a hard maxO budget of 0.01 (1% of operators) and a 10-step lock interval. Thresholds were set once via preliminary analysis on small models and held constant across all three experimental regimes to support generalizability claims; no per-experiment or post-hoc tuning was performed. The controller uses a small fixed hyperparameter set rather than being strictly parameter-free. We will incorporate a concise description of these definitions into the revised abstract. revision: yes

Circularity Check

0 steps flagged

No significant circularity in derivation chain

full rationale

The paper defines GNMR directly as the ratio of each recoverable unit's current gradient norm to its historical mean, together with Δ-GNMR for abrupt short-window increases, then maps these signals to bounded recovery actions under a maxO budget and lock interval. No equations, fitting procedures, or self-citations are presented that reduce any claimed result or prediction back to the inputs by construction. The central claims rest on empirical results from activation-quantization stress tests, DeepSeek-style training, and LLaMA-2 13B fine-tuning, which are independent of the signal definition itself. The derivation is therefore self-contained.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review provides no explicit free parameters, axioms, or invented entities; all such elements remain unknown.

pith-pipeline@v0.9.1-grok · 5713 in / 1026 out tokens · 15053 ms · 2026-06-28T18:51:58.287790+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

65 extracted references · 38 canonical work pages · 17 internal anchors

  1. [1]

    Pretraining large language models with nvfp4.arXiv preprint arXiv:2509.25149,

    Abecassis, F., Agrusa, A., Ahn, D., Alben, J., Alborghetti, S., Andersch, M., Arayandi, S., Bjorlin, A., Blakeman, A., Briones, E., et al. Pretraining large language models with nvfp4.arXiv preprint arXiv:2509.25149, 2025

  2. [2]

    Efqat: An efficient framework for quantization-aware training.arXiv preprint arXiv:2411.11038, 2024

    Ashkboos, S., Verhoef, B., Hoefler, T., Eleftheriou, E., and Dazzi, M. Efqat: An efficient framework for quantization-aware training.arXiv preprint arXiv:2411.11038, 2024

  3. [3]

    Low-rank quantization-aware training for llms.arXiv preprint arXiv:2406.06385, 2024

    Bondarenko, Y., Del Chiaro, R., and Nagel, M. Low-rank quantization-aware training for llms.arXiv preprint arXiv:2406.06385, 2024

  4. [4]

    L., and Simonyan, K

    Brock, A., De, S., Smith, S. L., and Simonyan, K. High-performance large-scale image recognition without normalization. InInternational conference on machine learning, pp. 1059–1071. PMLR, 2021

  5. [5]

    D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al

    Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J. D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al. Language models are few-shot learners.Advances in neural information processing systems, 33:1877–1901, 2020

  6. [6]

    L., Panferov, A., Tabesh, S., Sieberling, O., Chen, J., Nikdan, M., Ashkboos, S., and Alistarh, D

    Castro, R. L., Panferov, A., Tabesh, S., Sieberling, O., Chen, J., Nikdan, M., Ashkboos, S., and Alistarh, D. Quartet: Native fp4 training can be optimal for large language models.arXiv preprint arXiv:2505.14669, 2025

  7. [7]

    Efficientqat: Efficient quantization-aware training for large language models.arXiv preprint arXiv:2407.11062, 2024

    Chen, M., Shao, W., Xu, P., Wang, J., Gao, P., Zhang, K., and Luo, P. Efficientqat: Efficient quantization-aware training for large language models.arXiv preprint arXiv:2407.11062, 2024

  8. [8]

    Training Verifiers to Solve Math Word Problems

    Cobbe, K., Kosaraju, V., Bavarian, M., Chen, M., Jun, H., Kaiser, L., Plappert, M., Tworek, J., Hilton, J., Nakano, R., et al. Training verifiers to solve math word problems.arXiv preprint arXiv:2110.14168, 2021

  9. [9]

    DeepSeekMoE: Towards Ultimate Expert Specialization in Mixture-of-Experts Language Models

    Dai, D., Deng, C., Zhao, C., Xu, R., Gao, H., Chen, D., Li, J., Zeng, W., Yu, X., Wu, Y., et al. Deepseekmoe: Towards ultimate expert specialization in mixture-of-experts language models.arXiv preprint arXiv:2401.06066, 2024

  10. [10]

    Convergence-aware operator-wise mixed-precision training.CCF Transactions on High Performance Computing, 7(1):43–57, 2025

    Dai, W., Jia, Z., Bai, Y., and Sun, Q. Convergence-aware operator-wise mixed-precision training.CCF Transactions on High Performance Computing, 7(1):43–57, 2025

  11. [11]

    8-bit optimizers via block-wise quantization.arXiv preprint arXiv:2110.02861, 2021

    Dettmers, T., Lewis, M., Shleifer, S., and Zettlemoyer, L. 8-bit optimizers via block-wise quantization.arXiv preprint arXiv:2110.02861, 2021

  12. [12]

    Dettmers, T., Lewis, M., Belkada, Y., and Zettlemoyer, L. Gpt3. int8 (): 8-bit matrix multiplication for transformers at scale.Advances in neural information processing systems, 35:30318–30332, 2022

  13. [13]

    Qlora: Efficient finetuning of quantized llms

    Dettmers, T., Pagnoni, A., Holtzman, A., and Zettlemoyer, L. Qlora: Efficient finetuning of quantized llms. Advances in neural information processing systems, 36:10088–10115, 2023

  14. [14]

    M., Tong, S., Lepikhin, D., Xu, Y., Krikun, M., Zhou, Y., Yu, A

    Du, N., Huang, Y., Dai, A. M., Tong, S., Lepikhin, D., Xu, Y., Krikun, M., Zhou, Y., Yu, A. W., Firat, O., et al. Glam: Efficient scaling of language models with mixture-of-experts. InInternational conference on machine learning, pp. 5547–5569. PMLR, 2022

  15. [15]

    Switch transformers: Scaling to trillion parameter models with simple and efficient sparsity.Journal of Machine Learning Research, 23(120):1–39, 2022

    Fedus, W., Zoph, B., and Shazeer, N. Switch transformers: Scaling to trillion parameter models with simple and efficient sparsity.Journal of Machine Learning Research, 23(120):1–39, 2022

  16. [16]

    GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers

    Frantar, E., Ashkboos, S., Hoefler, T., and Alistarh, D. Gptq: Accurate post-training quantization for generative pre-trained transformers.arXiv preprint arXiv:2210.17323, 2022

  17. [17]

    Measuring Massive Multitask Language Understanding

    Hendrycks, D., Burns, C., Basart, S., Zou, A., Mazeika, M., Song, D., and Steinhardt, J. Measuring massive multitask language understanding.arXiv preprint arXiv:2009.03300, 2020

  18. [18]

    J., Shen, Y., Wallis, P., Allen-Zhu, Z., Li, Y., Wang, S., Wang, L., Chen, W., et al

    Hu, E. J., Shen, Y., Wallis, P., Allen-Zhu, Z., Li, Y., Wang, S., Wang, L., Chen, W., et al. Lora: Low-rank adaptation of large language models.ICLR, 1(2):3, 2022

  19. [19]

    Synergistic intra-and cross-layer regularization losses for moe expert specialization.arXiv preprint arXiv:2602.14159, 2026

    Hu, R., Cao, Y., Kong, B., Sun, M., and Yuan, K. Synergistic intra-and cross-layer regularization losses for moe expert specialization.arXiv preprint arXiv:2602.14159, 2026. 13

  20. [20]

    GradientStabilizer:Fix the Norm, Not the Gradient

    Huang, T., Hu, H., Zhang, Z., Jin, G., Li, X., Shen, L., Chen, T., Liu, L., Wen, Q., Wang, Z., et al. Stable-spam: How to train in 4-bit more stably than 16-bit adam.arXiv preprint arXiv:2502.17055, 2025

  21. [21]

    SPAM: Spike-aware adam with momentum reset for stable LLM training

    Huang, T., Zhu, Z., Jin, G., Liu, L., Wang, Z., and Liu, S. Spam: Spike-aware adam with momentum reset for stable llm training.arXiv preprint arXiv:2501.06842, 2025

  22. [22]

    V., Wu, Y., et al

    Huang, Y., Cheng, Y., Bapna, A., Firat, O., Chen, D., Chen, M., Lee, H., Ngiam, J., Le, Q. V., Wu, Y., et al. Gpipe: Efficient training of giant neural networks using pipeline parallelism.Advances in neural information processing systems, 32, 2019

  23. [23]

    A., Jordan, M

    Jacobs, R. A., Jordan, M. I., Nowlan, S. J., and Hinton, G. E. Adaptive mixtures of local experts.Neural computation, 3(1):79–87, 1991

  24. [24]

    Sdp4bit: Toward 4-bit communication quantization in sharded data parallelism for llm training.Advances in Neural Information Processing Systems, 37:8734–8759, 2024

    Jia, J., Xie, C., Lu, H., Wang, D., Feng, H., Zhang, C., Sun, B., Lin, H., Zhang, Z., Liu, X., et al. Sdp4bit: Toward 4-bit communication quantization in sharded data parallelism for llm training.Advances in Neural Information Processing Systems, 37:8734–8759, 2024

  25. [25]

    Accelerating large batch training via gradient signal to noise ratio (gsnr).arXiv preprint arXiv:2309.13681, 2023

    Jiang, G., Liu, J., Ding, Z., Guo, L., and Lin, W. Accelerating large batch training via gradient signal to noise ratio (gsnr).arXiv preprint arXiv:2309.13681, 2023

  26. [26]

    Back razor: Memory-efficient transfer learning by self-sparsified backpropagation.Advances in neural information processing systems, 35:29248–29261, 2022

    Jiang, Z., Chen, X., Huang, X., Du, X., Zhou, D., and Wang, Z. Back razor: Memory-efficient transfer learning by self-sparsified backpropagation.Advances in neural information processing systems, 35:29248–29261, 2022

  27. [27]

    Clapping: Removing per-sample storage for pipeline parallel distributed optimization with communication compression.arXiv preprint arXiv:2509.19029, 2025

    Kong, B., Huang, X., Xu, Y., Liang, Y., Wang, B., and Yuan, K. Clapping: Removing per-sample storage for pipeline parallel distributed optimization with communication compression.arXiv preprint arXiv:2509.19029, 2025

  28. [28]

    CR-Net: Scaling Parameter-Efficient Training with Cross-Layer Low-Rank Structure

    Kong, B., Liang, J., Liu, Y., Deng, R., and Yuan, K. Cr-net: Scaling parameter-efficient training with cross-layer low-rank structure.arXiv preprint arXiv:2509.18993, 2025

  29. [29]

    Adaptive precision training (adapt): A dynamic quantized training approach for dnns

    Kummer, L., Sidak, K., Reichmann, T., and Gansterer, W. Adaptive precision training (adapt): A dynamic quantized training approach for dnns. InProceedings of the 2023 SIAM International Conference on Data Mining (SDM), pp. 559–567. SIAM, 2023

  30. [30]

    J., and Lee, D

    Lee, J., Bae, J., Kim, B., Kwon, S. J., and Lee, D. To fp8 and back again: Quantifying reduced precision effects on llm training stability.arXiv preprint arXiv:2405.18710, 2024

  31. [31]

    Awq: Activation-aware weight quantization for on-device llm compression and acceleration.Proceedings of machine learning and systems, 6:87–100, 2024

    Lin, J., Tang, J., Tang, H., Yang, S., Chen, W.-M., Wang, W.-C., Xiao, G., Dang, X., Gan, C., and Han, S. Awq: Activation-aware weight quantization for on-device llm compression and acceleration.Proceedings of machine learning and systems, 6:87–100, 2024

  32. [32]

    DeepSeek-V3 Technical Report

    Liu, A., Feng, B., Xue, B., Wang, B., Wu, B., Lu, C., Zhao, C., Deng, C., Zhang, C., Ruan, C., et al. Deepseek-v3 technical report.arXiv preprint arXiv:2412.19437, 2024

  33. [33]

    Llm-qat: Data-free quantization aware training for large language models.arXiv preprint arXiv:2305.17888, 2023

    Liu, Z., Oguz, B., Zhao, C., Chang, E., Stock, P., Mehdad, Y., Shi, Y., Krishnamoorthi, R., and Chandra, V. Llm-qat: Data-free quantization aware training for large language models.arXiv preprint arXiv:2305.17888, 2023

  34. [34]

    McCandlish, S., Kaplan, J., Amodei, D., and Team, O. D. An empirical model of large-batch training.arXiv preprint arXiv:1812.06162, 2018

  35. [35]

    O., Osei-Kuffuor, D., Schordan, M., Lloyd, S., Mohror, K., and Hittinger, J

    Menon, H., Lam, M. O., Osei-Kuffuor, D., Schordan, M., Lloyd, S., Mohror, K., and Hittinger, J. Adapt: Algorithmic differentiation applied to floating-point precision tuning. InSC18: International Conference for High Performance Computing, Networking, Storage and Analysis, pp. 614–626. IEEE, 2018

  36. [36]

    Pointer Sentinel Mixture Models

    Merity, S., Xiong, C., Bradbury, J., and Socher, R. Pointer sentinel mixture models.arXiv preprint arXiv:1609.07843, 2016

  37. [37]

    Mixed Precision Training

    Micikevicius, P., Narang, S., Alben, J., Diamos, G., Elsen, E., Garcia, D., Ginsburg, B., Houston, M., Kuchaiev, O., Venkatesh, G., et al. Mixed precision training.arXiv preprint arXiv:1710.03740, 2017

  38. [38]

    FP8 Formats for Deep Learning

    Micikevicius, P., Stosic, D., Burgess, N., Cornea, M., Dubey, P., Grisenthwaite, R., Ha, S., Heinecke, A., Judd, P., Kamalu, J., et al. Fp8 formats for deep learning.arXiv preprint arXiv:2209.05433, 2022. 14

  39. [39]

    On the difficulty of training recurrent neural networks

    Pascanu, R., Mikolov, T., and Bengio, Y. On the difficulty of training recurrent neural networks. InInternational conference on machine learning, pp. 1310–1318. Pmlr, 2013

  40. [40]

    Fp8-lm: Training fp8 large language models.arXiv preprint arXiv:2310.18313, 2023

    Peng, H., Wu, K., Wei, Y., Zhao, G., Yang, Y., Liu, Z., Xiong, Y., Yang, Z., Ni, B., Hu, J., et al. Fp8-lm: Training fp8 large language models.arXiv preprint arXiv:2310.18313, 2023

  41. [41]

    P., Zhang, Y., Briggs, J., Blake, C., Levy-Kramer, J., Balanca, P., Luschi, C., Barlow, S., and Fitzgibbon, A

    Perez, S. P., Zhang, Y., Briggs, J., Blake, C., Levy-Kramer, J., Balanca, P., Luschi, C., Barlow, S., and Fitzgibbon, A. W. Training and inference of large language models using 8-bit floating point.arXiv preprint arXiv:2309.17224, 2023

  42. [42]

    Language models are unsupervised multitask learners.OpenAI blog, 1(8):9, 2019

    Radford, A., Wu, J., Child, R., Luan, D., Amodei, D., Sutskever, I., et al. Language models are unsupervised multitask learners.OpenAI blog, 1(8):9, 2019

  43. [43]

    Raffel, C., Shazeer, N., Roberts, A., Lee, K., Narang, S., Matena, M., Zhou, Y., Li, W., and Liu, P. J. Exploring the limits of transfer learning with a unified text-to-text transformer.Journal of machine learning research, 21 (140):1–67, 2020

  44. [44]

    Zero: Memory optimizations toward training trillion parameter models

    Rajbhandari, S., Rasley, J., Ruwase, O., and He, Y. Zero: Memory optimizations toward training trillion parameter models. InSC20: International Conference for High Performance Computing, Networking, Storage and Analysis, pp. 1–16. IEEE, 2020

  45. [45]

    Protocol models: Scaling decentralized training with communication-efficient model parallelism.arXiv preprint arXiv:2506.01260, 2025

    Ramasinghe, S., Ajanthan, T., Avraham, G., Zuo, Y., and Long, A. Protocol models: Scaling decentralized training with communication-efficient model parallelism.arXiv preprint arXiv:2506.01260, 2025

  46. [46]

    Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer

    Shazeer, N., Mirhoseini, A., Maziarz, K., Davis, A., Le, Q., Hinton, G., and Dean, J. Outrageously large neural networks: The sparsely-gated mixture-of-experts layer.arXiv preprint arXiv:1701.06538, 2017

  47. [47]

    Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism

    Shoeybi, M., Patwary, M., Puri, R., LeGresley, P., Casper, J., and Catanzaro, B. Megatron-lm: Training multi-billion parameter language models using model parallelism.arXiv preprint arXiv:1909.08053, 2019

  48. [48]

    A tail-index analysis of stochastic gradient noise in deep neural networks

    Simsekli, U., Sagun, L., and Gurbuzbalaban, M. A tail-index analysis of stochastic gradient noise in deep neural networks. InInternational Conference on Machine Learning, pp. 5827–5837. PMLR, 2019

  49. [49]

    Spike no more: Stabilizing the pre-training of large language models.arXiv preprint arXiv:2312.16903, 2023

    Takase, S., Kiyono, S., Kobayashi, S., and Suzuki, J. Spike no more: Stabilizing the pre-training of large language models.arXiv preprint arXiv:2312.16903, 2023

  50. [50]

    Llama 2: Open Foundation and Fine-Tuned Chat Models

    Touvron, H., Martin, L., Stone, K., Albert, P., Almahairi, A., Babaei, Y., Bashlykov, N., Batra, S., Bhargava, P., Bhosale, S., et al. Llama 2: Open foundation and fine-tuned chat models.arXiv preprint arXiv:2307.09288, 2023

  51. [51]

    N., Kaiser, Ł., and Polosukhin, I

    Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, Ł., and Polosukhin, I. Attention is all you need.Advances in neural information processing systems, 30, 2017

  52. [52]

    Cambridge university press, 2018

    Vershynin, R.High-dimensional probability: An introduction with applications in data science, volume 47. Cambridge university press, 2018

  53. [53]

    Pipeoffload: Improving scalability of pipeline parallelism with memory optimization.arXiv preprint arXiv:2503.01328, 2025

    Wan, X., Qi, P., Huang, G., Lin, M., and Li, J. Pipeoffload: Improving scalability of pipeline parallelism with memory optimization.arXiv preprint arXiv:2503.01328, 2025

  54. [54]

    A., Holmes, C., Rajbhandari, S., Ruwase, O., Yan, F., Yang, L., and He, Y

    Wang, G., Qin, H., Jacobs, S. A., Holmes, C., Rajbhandari, S., Ruwase, O., Yan, F., Yang, L., and He, Y. Zero++: Extremely efficient collective communication for giant model training.arXiv preprint arXiv:2306.10209, 2023

  55. [55]

    Optimizing Large Language Model Training Using FP4 Quantization

    Wang, R., Gong, Y., Liu, X., Zhao, G., Yang, Z., Guo, B., Zha, Z., and Cheng, P. Optimizing large language model training using fp4 quantization.arXiv preprint arXiv:2501.17116, 2025

  56. [56]

    and Kanwar, P

    Wang, S. and Kanwar, P. Bfloat16: The secret to high performance on cloud tpus.URL https://cloud. google. com/blog/products/ai-machine-learning/bfloat16-the-secret-to-high-performance-on-cloud-tpus, 2019

  57. [57]

    Outlier suppression+: Accurate quantization of large language models by equivalent and optimal shifting and scaling.arXiv preprint arXiv:2304.09145, 2023

    Wei, X., Zhang, Y., Li, Y., Zhang, X., Gong, R., Guo, J., and Liu, X. Outlier suppression+: Accurate quantization of large language models by equivalent and optimal shifting and scaling.arXiv preprint arXiv:2304.09145, 2023

  58. [58]

    Coat: Compressing optimizer states and activation for memory-efficient fp8 training.arXiv preprint arXiv:2410.19313, 2024

    Xi, H., Cai, H., Zhu, L., Lu, Y., Keutzer, K., Chen, J., and Han, S. Coat: Compressing optimizer states and activation for memory-efficient fp8 training.arXiv preprint arXiv:2410.19313, 2024. 15

  59. [59]

    Smoothquant: Accurate and efficient post-training quantization for large language models

    Xiao, G., Lin, J., Seznec, M., Wu, H., Demouth, J., and Han, S. Smoothquant: Accurate and efficient post-training quantization for large language models. InInternational conference on machine learning, pp. 38087–38099. PMLR, 2023

  60. [60]

    On layer normalization in the transformer architecture

    Xiong, R., Yang, Y., He, D., Zheng, K., Zheng, S., Xing, C., Zhang, H., Lan, Y., Wang, L., and Liu, T. On layer normalization in the transformer architecture. InInternational conference on machine learning, pp. 10524–10533. PMLR, 2020

  61. [61]

    Zeroquant: Efficient and affordable post-training quantization for large-scale transformers.Advances in neural information processing systems, 35: 27168–27183, 2022

    Yao, Z., Yazdani Aminabadi, R., Zhang, M., Wu, X., Li, C., and He, Y. Zeroquant: Efficient and affordable post-training quantization for large-scale transformers.Advances in neural information processing systems, 35: 27168–27183, 2022

  62. [62]

    Ldp: Learnable dynamic precision for efficient deep neural network training and inference.arXiv preprint arXiv:2203.07713, 2022

    Yu, Z., Fu, Y., Wu, S., Li, M., You, H., and Lin, Y. Ldp: Learnable dynamic precision for efficient deep neural network training and inference.arXiv preprint arXiv:2203.07713, 2022

  63. [63]

    Hellaswag: Can a machine really finish your sentence? InProceedings of the 57th Annual Meeting of the Association for Computational Linguistics, 2019

    Zellers, R., Holtzman, A., Bisk, Y., Farhadi, A., and Choi, Y. Hellaswag: Can a machine really finish your sentence? InProceedings of the 57th Annual Meeting of the Association for Computational Linguistics, 2019

  64. [64]

    GaLore: Memory-Efficient LLM Training by Gradient Low-Rank Projection

    Zhao, J., Zhang, Z., Chen, B., Wang, Z., Anandkumar, A., and Tian, Y. Galore: Memory-efficient llm training by gradient low-rank projection.arXiv preprint arXiv:2403.03507, 2024

  65. [65]

    PyTorch FSDP: Experiences on Scaling Fully Sharded Data Parallel

    Zhao, Y., Gu, A., Varma, R., Luo, L., Huang, C.-C., Xu, M., Wright, L., Shojanazeri, H., Ott, M., Shleifer, S., et al. Pytorch fsdp: experiences on scaling fully sharded data parallel.arXiv preprint arXiv:2304.11277, 2023. 16 Appendix A Related works This section reviews work on low-precision training and adaptation, with emphasis on how prior methods rel...