AdaGC: Enhancing LLM Pretraining Stability via Adaptive Gradient Clipping

Congliang Chen; Dianhai Yu; Guoxia Wang; JiaBin Yang; Jinle Zeng; Li Shen; Shuai Li; Yanjun Ma

arxiv: 2502.11034 · v3 · pith:XMKKDZROnew · submitted 2025-02-16 · 💻 cs.LG

AdaGC: Enhancing LLM Pretraining Stability via Adaptive Gradient Clipping

Guoxia Wang , Shuai Li , Congliang Chen , Jinle Zeng , Jiabin Yang , Dianhai Yu , Yanjun Ma , Li Shen This is my paper

classification 💻 cs.LG

keywords adagcspikesgradientlossaccuracyadaptiveaveragecause

0 comments

read the original abstract

Loss spikes remain a persistent obstacle in large-scale language model pretraining. While previous research has attempted to identify the root cause of loss spikes by investigating individual factors, we observe that, in practice, such spikes are typically triggered by the confluence of heterogeneous factors. Empirically, loss spikes may arise from a combination of data outliers, hardware or transient computational faults, numerical precision issues, and hyperparameter settings. Regardless of the underlying cause, these spikes manifest as unstable optimizer updates, as abnormal gradients contaminate both first- and second-moment states. In this paper, we propose a principled gradient-centric remedy: AdaGC, an adaptive per-tensor gradient clipping scheme that mitigates such contamination by bounding gradient norms relative to a tensor-wise exponential moving average of their historical clipped values. AdaGC is optimizer-agnostic, introduces negligible memory overhead, and reduces communication costs compared to GlobalGC, particularly in hybrid-parallel distributed training. Experiments on Llama-2 7B, Mixtral 8x1B, and ERNIE 10B-A1.4B demonstrate that AdaGC robustly eliminates training instabilities, consistently reducing spike scores to zero for all models and improving downstream accuracy over GlobalGC by 1.32%, 1.27%, and 2.48%, respectively. Furthermore, AdaGC seamlessly integrates with optimizers such as Muon and Lion, consistently yielding higher average accuracy and zero spike scores. The code is available at https://github.com/PaddlePaddle/PaddleFleet (see Research/AdaGC).

This paper has not been read by Pith yet.

discussion (0)

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

SimReg: Achieving Higher Performance in the Pretraining via Embedding Similarity Regularization
cs.CL 2026-05 unverdicted novelty 5.0

SimReg regularization accelerates LLM pretraining convergence by over 30% and raises average zero-shot performance by over 1% across benchmarks.