GradientStabilizer:Fix the Norm, Not the Gradient

(Cited on page 1 · 2025 · cs.LG · arXiv 2502.17055

2 Pith papers cite this work. Polarity classification is still indexing.

2 Pith papers citing it

open full Pith review browse 2 citing papers arXiv PDF

abstract

Training instability in modern deep learning systems is frequently triggered by rare but extreme gradient-norm spikes, which can induce oversized parameter updates, corrupt optimizer state, and lead to slow recovery or divergence. Widely used safeguards such as gradient clipping mitigate these failures but require threshold tuning and indiscriminately truncate large updates. We propose GradientStabilizer, a lightweight, drop-in gradient transform that preserves the instantaneous gradient direction while replacing the update magnitude with a statistically stabilized estimate derived from running gradient-norm statistics. We prove that the resulting stabilized magnitude is uniformly bounded on spike steps, independent of the spike size, and show how this boundedness controls optimizer state evolution in adaptive methods. Across LLM pre-training (FP16), quantization-aware pre-training (FP4), ImageNet classification, reinforcement learning, and time-series forecasting, GradientStabilizer consistently improves training stability, widens stable learning-rate regions, and reduces divergence relative to clipping-based baselines, even substantially reducing Adam's sensitivity to weight-decay strength. Code will be released soon.

representative citing papers

Memory-Efficient LLM Pretraining via Minimalist Optimizer Design

cs.LG · 2025-06-20 · conditional · novelty 6.0

SCALE matches Adam performance in LLM pretraining from 60M to 7B parameters by combining column-wise gradient normalization with last-layer-only momentum, using 35-45% of Adam's memory.

GNMR: Runtime Stability Control for Low-Precision Large Language Model Training

cs.LG · 2026-05-30 · unverdicted · novelty 5.0

GNMR is a gradient-norm-based controller that maps local stability signals to budgeted recovery actions to stabilize low-precision LLM training while preserving quality.

citing papers explorer

Showing 2 of 2 citing papers.

Memory-Efficient LLM Pretraining via Minimalist Optimizer Design cs.LG · 2025-06-20 · conditional · none · ref 4 · internal anchor
SCALE matches Adam performance in LLM pretraining from 60M to 7B parameters by combining column-wise gradient normalization with last-layer-only momentum, using 35-45% of Adam's memory.
GNMR: Runtime Stability Control for Low-Precision Large Language Model Training cs.LG · 2026-05-30 · unverdicted · none · ref 20 · internal anchor
GNMR is a gradient-norm-based controller that maps local stability signals to budgeted recovery actions to stabilize low-precision LLM training while preserving quality.

GradientStabilizer:Fix the Norm, Not the Gradient

fields

years

verdicts

representative citing papers

citing papers explorer