Can Entry-Wise Clipping Give Spectral Control of Stochastic Gradients?

Brian Bullins; Cedar Site Bai; David F. Gleich; Zhe Zhang; Zitao Song

arxiv: 2605.27733 · v1 · pith:JAQ2J35Tnew · submitted 2026-05-26 · 💻 cs.LG

Can Entry-Wise Clipping Give Spectral Control of Stochastic Gradients?

Zitao Song , Cedar Site Bai , Zhe Zhang , Brian Bullins , David F. Gleich This is my paper

Pith reviewed 2026-06-29 18:23 UTC · model grok-4.3

classification 💻 cs.LG

keywords stochastic gradientsgradient clippingspectral normalizationheavy-tailed noiseconvergence guaranteesAdam optimizerlanguage model pretraining

0 comments

The pith

Entry-wise clipping can achieve spectral control of stochastic gradients by exploiting noise localization.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Training instabilities often arise from heavy-tailed stochastic gradient noise that persists through mini-batching in language models. The paper claims this noise resembles entry-wise heavy-tailed contamination rather than vector-structured noise, so that a first-order perturbation analysis reveals a localization property allowing simple per-entry operations to control the spectrum of the gradient matrix. From this they derive a tractable shrinkage rule as a surrogate for the Bayes-optimal entry-wise estimator under a Gaussian signal prior. The resulting method carries an O(ε^{-4}) convergence guarantee under Cauchy-contaminated noise and yields measurable token savings when combined with Adam or Muon on NanoGPT pretraining.

Core claim

Real gradient noise appears to be similar to entry-wise heavy-tailed contamination, and a first-order perturbation analysis reveals a localization property of such noise, under which a simple entry-wise method achieves spectral control. Exploiting this, we derive a tractable surrogate for the Bayes-optimal entry-wise estimator under a Gaussian signal prior. We establish O(ε^{-4}) convergence guarantee under Cauchy-contaminated noise.

What carries the argument

Localization property of entry-wise heavy-tailed noise under first-order perturbation analysis, which enables entry-wise clipping to achieve spectral control of the gradient matrix.

If this is right

Yields O(ε^{-4}) convergence under Cauchy-contaminated noise.
Smooth shrinkage improves Adam on NanoGPT pretraining and saves approximately 7% of training tokens.
Applying entry-wise clipping before spectral normalization adds approximately 2% further token savings on top of Muon.
Balances the structure-cost trade-off between vector-norm clipping and full spectral normalization.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same entry-wise shrinkage could be inserted into other first-order optimizers that already use per-coordinate scaling.
If the localization property holds outside language-model training, the method may reduce instability in vision or reinforcement-learning settings.
Hybrid entry-wise plus spectral pipelines might become a default stabilization pattern once the noise model is verified at larger scale.
The convergence rate suggests the approach could be especially useful in regimes where noise tails are heavier than Gaussian.

Load-bearing premise

Real gradient noise appears to be similar to entry-wise heavy-tailed contamination.

What would settle it

An experiment in which entry-wise clipping leaves the spectral norm of the gradient matrix uncontrolled, or in which observed gradient noise fails to exhibit the predicted localization under perturbation.

Figures

Figures reproduced from arXiv: 2605.27733 by Brian Bullins, Cedar Site Bai, David F. Gleich, Zhe Zhang, Zitao Song.

**Figure 1.** Figure 1: Entry-wise sparse heavy-tailed noise reproduces the spectral spikes observed in real stochastic gradients. The columns correspond to four noise models E, and the rows correspond to two diagnostics. Real (a, e): minibatch gradient noise from a GPT-2 layer (blocks.10.attn.qkv w.v). Subspace low-rank (b, f ): E “ λ řK r“1 urv T r with λ“100, K“16, and ur, vr drawn uniformly from the unit sphere. Pure heavy-ta… view at source ↗

**Figure 2.** Figure 2: Three entry-wise operators on x ě 0 with τ “ 1. We set β “ 1 for smooth shrinkage in Equation (12). Here we focus on the Rpp∆q " 1 regime, in which localization is induced by the entry-wise contamination model of Definition 3.2. Our main result identifies a closed-form surrogate for the Bayesoptimal entry-wise estimator in Proposition 3.5 that is asymptotically faithful in the large-|G˜ ij | region, wher… view at source ↗

**Figure 3.** Figure 3: Random Gaussian feature regression (d “ 32, n “ 128) with a fraction α of feature entries corrupted by Student-t noise (ν “ 1, scale 3.0). y-axis is the final-loss speedup over no clipping. Smooth shrinkage (red) in both post-clipping (a) and pre-clipping (b) tracks hard clipping (blue) for small α but pulls ahead as the noise is dominated by heavy-tailed entries. Executing the update rule Equation (17) wi… view at source ↗

**Figure 4.** Figure 4: Validation loss vs. tokens on Modded-NanoGPT under hard clipping ( [PITH_FULL_IMAGE:figures/full_fig_p011_4.png] view at source ↗

**Figure 5.** Figure 5: Comparison of smooth shrinkage and hard coordinate-wise clipping at the tangent [PITH_FULL_IMAGE:figures/full_fig_p018_5.png] view at source ↗

**Figure 6.** Figure 6: Post Clipping for SGD under Gaussian random feature models ( [PITH_FULL_IMAGE:figures/full_fig_p039_6.png] view at source ↗

**Figure 7.** Figure 7: Post Clipping for SGD under Gaussian random feature models ( [PITH_FULL_IMAGE:figures/full_fig_p040_7.png] view at source ↗

**Figure 8.** Figure 8: Pre Clipping for spectral GD under Gaussian random feature models ( [PITH_FULL_IMAGE:figures/full_fig_p041_8.png] view at source ↗

**Figure 9.** Figure 9: Pre Clipping for spectral GD under Gaussian random feature models ( [PITH_FULL_IMAGE:figures/full_fig_p042_9.png] view at source ↗

**Figure 10.** Figure 10: Learning-rate sweeps for post- and pre-clipping methods in terms of the final [PITH_FULL_IMAGE:figures/full_fig_p043_10.png] view at source ↗

**Figure 11.** Figure 11: Localization ratio RppErealq of real stochastic noise plotted across layers at three training stages. Each row corresponds to a different singular direction of the signal G, from the leading direction (top row) to the 8th (bottom). The thick dashed line at y “ 1 marks the Gaussian null baseline, at which the noise is delocalized and matches the random-matrix prediction. As training progresses, RppEq gener… view at source ↗

**Figure 12.** Figure 12: Correlation between the top singular value and the largest entry of the stochastic [PITH_FULL_IMAGE:figures/full_fig_p046_12.png] view at source ↗

**Figure 13.** Figure 13: Hill tail-index estimator for stochastic noise across different layers at different [PITH_FULL_IMAGE:figures/full_fig_p046_13.png] view at source ↗

read the original abstract

Training instabilities such as loss spikes are frequently the result of stochastic gradient noise. Because of rare expressions in language training data, and multiple layer composition, the noise impact is heavy-tailed and survives mini-batch averaging. Existing remedies trade off structure against cost: vector-norm clipping ignores the matrix structure of weight updates, while spectral normalization (e.g., Muon (Jordan et al., 2024)) respects it at additional cost. We show that this trade-off can be balanced. Real gradient noise appears to be similar to entry-wise heavy-tailed contamination, and a first-order perturbation analysis reveals a localization property of such noise, under which a simple entry-wise method achieves spectral control. Exploiting this, we derive a tractable surrogate for the Bayes-optimal entry-wise estimator under a Gaussian signal prior. We establish $O(\epsilon^{-4})$ convergence guarantee under Cauchy-contaminated noise. Empirically, we find that smooth shrinkage improves Adam on NanoGPT pretraining, saving ${\sim}7\%$ of training tokens. We further find that applying the entry-wise clipping before spectral normalization yields a ${\sim}2\%$ token saving on top of Muon.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Entry-wise clipping can control spectral norm under entry-wise Cauchy noise via a localization property, but the first-order perturbation step is the part that needs checking and the gains are small.

read the letter

The paper's core move is to model gradient noise as entry-wise heavy-tailed contamination and use a first-order perturbation argument to show that simple per-entry shrinkage keeps the spectral norm of the update matrix bounded. From there they build a tractable Bayes-optimal surrogate under a Gaussian signal prior and prove an O(ε^{-4}) rate under Cauchy noise. The experiments add that this shrinkage improves Adam on NanoGPT (roughly 7% fewer tokens) and gives a further 2% when stacked before Muon.

What is actually new is the explicit link between entry-wise clipping and spectral control through the localization property, rather than treating clipping as a separate heuristic. The derivation of the surrogate and the convergence guarantee are cleanly stated and go beyond standard norm-clipping analyses.

The soft spot is the localization claim itself. It rests on the first-order perturbation analysis; if higher-order terms matter or if the gradient entries are correlated across the matrix (as they often are in real layers), the spectral control may not follow. The modeling assumption that real gradient noise behaves like independent entry-wise Cauchy contamination is also doing a lot of work and receives limited direct validation. The reported token savings are positive but modest, so they do not yet show a large practical shift.

This is for researchers working on cheap, structure-aware stabilizers for large-model training. A reader who cares about the gap between vector-norm clipping and full spectral methods will find the framing useful. The work is coherent enough on its own terms to deserve a serious referee, though the perturbation analysis and noise model will need close attention in review.

Referee Report

1 major / 2 minor

Summary. The paper claims that real gradient noise resembles entry-wise heavy-tailed (Cauchy) contamination on gradient matrices, and that a first-order perturbation analysis reveals a localization property allowing simple entry-wise clipping to achieve spectral-norm control. It derives a tractable Bayes-optimal surrogate under a Gaussian signal prior, proves an O(ε^{-4}) convergence guarantee under Cauchy noise, and reports empirical gains: smooth shrinkage saves ~7% tokens versus Adam on NanoGPT pretraining and adds ~2% on top of Muon when applied before spectral normalization.

Significance. If the localization property is rigorously established and the perturbation analysis controls the spectral norm for matrix updates, the work supplies a low-cost method that respects matrix structure without the overhead of full spectral normalization. The explicit O(ε^{-4}) guarantee and the reproducible token-saving numbers on NanoGPT are concrete strengths that would make the result practically relevant for stabilizing large-model training.

major comments (1)

[Perturbation analysis and convergence section] The localization property asserted in the first-order perturbation analysis (the load-bearing step for the spectral-control claim) only approximates the leading term. The manuscript does not bound the remainder or demonstrate that the property survives when gradient entries are jointly distributed or when the Gaussian signal prior is misspecified; without such control the O(ε^{-4}) guarantee does not transfer to the matrix-structured updates used in the Muon comparison.

minor comments (2)

[Abstract and experimental section] The abstract states that 'smooth shrinkage improves Adam' but does not name the precise shrinkage function or the value of any hyper-parameter; the main text should give the explicit formula used in the NanoGPT runs.
[Notation and definitions] Notation for the entry-wise estimator and the Cauchy contamination model should be introduced once and used consistently; several symbols appear to be redefined between the theoretical and empirical sections.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the detailed and constructive feedback. We address the single major comment below.

read point-by-point responses

Referee: [Perturbation analysis and convergence section] The localization property asserted in the first-order perturbation analysis (the load-bearing step for the spectral-control claim) only approximates the leading term. The manuscript does not bound the remainder or demonstrate that the property survives when gradient entries are jointly distributed or when the Gaussian signal prior is misspecified; without such control the O(ε^{-4}) guarantee does not transfer to the matrix-structured updates used in the Muon comparison.

Authors: We agree that the localization property is obtained from a first-order perturbation analysis that identifies the leading term under entry-wise independent heavy-tailed contamination; the manuscript provides neither an explicit remainder bound nor an extension to jointly distributed entries or a misspecified Gaussian prior. The O(ε^{-4}) guarantee is stated under the model in which the localization property holds. The Muon comparison applies entry-wise clipping as a practical preprocessing step whose benefit is reported empirically. In revision we will add a clarifying paragraph in the theoretical section that states the first-order character of the analysis, lists the independence and prior assumptions, and notes that the guarantee does not automatically extend beyond those assumptions. revision: partial

Circularity Check

0 steps flagged

No circularity: derivations rely on explicit assumptions and perturbation analysis rather than self-definition or fitted inputs

full rationale

The paper's central steps consist of (1) a first-order perturbation analysis to identify a localization property under entry-wise Cauchy contamination, (2) derivation of a tractable surrogate estimator assuming a Gaussian signal prior, and (3) an O(ε^{-4}) convergence proof under the stated noise model. None of these reduce by construction to a fitted parameter or to a self-citation whose content is the target result itself. The comparison to Muon is external. Empirical token savings are reported separately from the theory. This is a standard non-circular derivation chain.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Abstract-only review; the central assumption is stated directly but cannot be audited for independence or additional free parameters.

axioms (1)

domain assumption Real gradient noise appears to be similar to entry-wise heavy-tailed contamination
Invoked in the abstract as the basis for applying entry-wise methods and the perturbation analysis.

pith-pipeline@v0.9.1-grok · 5742 in / 1131 out tokens · 36139 ms · 2026-06-29T18:23:38.161663+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

48 extracted references · 11 canonical work pages · 6 internal anchors

[1]

Keller Jordan, Yuchen Jin, Vlado Boza, Jiacheng You, Franz Cesista, Laker Newhouse, and Jeremy Bernstein.Muon: An optimizer for hidden layers in neural networks. 2024. url:https://kellerjordan.github.io/posts/muon/(cit. on pp. 1, 4, 17, 43)

2024
[2]

A tail-index analysis of stochastic gradient noise in deep neural networks

Umut Simsekli, Levent Sagun, and Mert Gurbuzbalaban. “A tail-index analysis of stochastic gradient noise in deep neural networks”. In:International Conference on Machine Learning. PMLR. 2019, pp. 5827–5837 (cit. on pp. 1, 5, 17)

2019
[3]

The heavy-tail phenomenon in SGD

Mert Gurbuzbalaban, Umut Simsekli, and Lingjiong Zhu. “The heavy-tail phenomenon in SGD”. In:International Conference on Machine Learning. PMLR. 2021, pp. 3964– 3975 (cit. on p. 1)

2021
[4]

The Llama 3 Herd of Models

Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Alex Vaughan, et al. “The llama 3 herd of models”. In:arXiv preprint arXiv:2407.21783(2024) (cit. on p. 1)

work page internal anchor Pith review Pith/arXiv arXiv 2024
[5]

Kimi K2: Open Agentic Intelligence

Kimi Team, Yifan Bai, Yiping Bao, Y Charles, Cheng Chen, Guanduo Chen, Haiting Chen, Huarong Chen, Jiahao Chen, Ningxin Chen, et al. “Kimi k2: Open agentic intelligence”. In:arXiv preprint arXiv:2507.20534(2025) (cit. on pp. 1, 43)

work page internal anchor Pith review Pith/arXiv arXiv 2025
[6]

Small-scale proxies for large-scale Transformer training instabilities

Mitchell Wortsman, Peter J Liu, Lechao Xiao, Katie E Everett, Alexander A Alemi, Ben Adlam, John D Co-Reyes, Izzeddin Gur, Abhishek Kumar, Roman Novak, Jeffrey Pennington, Jascha Sohl-Dickstein, Kelvin Xu, Jaehoon Lee, Justin Gilmer, and Simon Kornblith. “Small-scale proxies for large-scale Transformer training instabilities”. In: The Twelfth Internationa...

2024
[7]

SPAM: Spike-aware adam with momentum reset for stable LLM training

Tianjin Huang, Ziquan Zhu, Gaojie Jin, Lu Liu, Zhangyang Wang, and Shiwei Liu. “SPAM: Spike-aware adam with momentum reset for stable LLM training”. In:The Thirteenth International Conference on Learning Representations. 2025 (cit. on p. 1)

2025
[8]

On the difficulty of training recurrent neural networks

Razvan Pascanu, Tomas Mikolov, and Yoshua Bengio. “On the difficulty of training recurrent neural networks”. In:International conference on machine learning. Pmlr. 2013, pp. 1310–1318 (cit. on pp. 1, 4, 17)

2013
[9]

Why are adaptive methods good for attention models?

Jingzhao Zhang, Sai Praneeth Karimireddy, Andreas Veit, Seungyeon Kim, Sashank Reddi, Sanjiv Kumar, and Suvrit Sra. “Why are adaptive methods good for attention models?” In:Advances in Neural Information Processing Systems33 (2020), pp. 15383– 15393 (cit. on pp. 1, 4, 5, 9, 10, 17)

2020
[10]

Stochastic spectral descent for restricted Boltzmann machines

David Carlson, Volkan Cevher, and Lawrence Carin. “Stochastic spectral descent for restricted Boltzmann machines”. In:Artificial intelligence and statistics. PMLR. 2015, pp. 111–119 (cit. on pp. 1, 17)

2015
[11]

Spectral normalization for generative adversarial networks

Takeru Miyato, Toshiki Kataoka, Masanori Koyama, and Yuichi Yoshida. “Spectral normalization for generative adversarial networks”. In:International Conference on Learning Representations. 2018 (cit. on p. 1)

2018
[12]

Fantastic Pre- training Optimizers and Where to Find Them

Kaiyue Wen, David Leo Wright Hall, Tengyu Ma, and Percy Liang. “Fantastic Pre- training Optimizers and Where to Find Them”. In:The Fourteenth International Conference on Learning Representations. 2026 (cit. on p. 1)

2026
[13]

Old Optimizer, New Norm: An Anthology

Jeremy Bernstein and Laker Newhouse. “Old optimizer, new norm: An anthology”. In: arXiv preprint arXiv:2409.20325(2024) (cit. on pp. 1, 17)

work page internal anchor Pith review Pith/arXiv arXiv 2024
[14]

Enhancing LLM Training via Spectral Clipping

Xiaowen Jiang, Andrei Semenov, and Sebastian U Stich. “Enhancing LLM Training via Spectral Clipping”. In:arXiv preprint arXiv:2603.14315(2026) (cit. on pp. 1, 4, 5, 17, 43)

work page internal anchor Pith review Pith/arXiv arXiv 2026
[15]

Training transformers with enforced lipschitz constants.arXiv preprint arXiv:2507.13338, 2025

Laker Newhouse, R Preston Hess, Franz Cesista, Andrii Zahorodnii, Jeremy Bernstein, and Phillip Isola. “Training transformers with enforced lipschitz constants”. In:arXiv preprint arXiv:2507.13338(2025) (cit. on pp. 4, 17). 13

work page arXiv 2025
[16]

High-probability bounds for non-convex stochastic optimization with heavy tails

Ashok Cutkosky and Harsh Mehta. “High-probability bounds for non-convex stochastic optimization with heavy tails”. In:Advances in Neural Information Processing Systems 34 (2021), pp. 4883–4895 (cit. on pp. 4, 17)

2021
[17]

Improved Con- vergence in High Probability of Clipped Gradient Methods with Heavy Tailed Noise

Ta Duy Nguyen, Thien Hang Nguyen, Alina Ene, and Huy Nguyen. “Improved Con- vergence in High Probability of Clipped Gradient Methods with Heavy Tailed Noise”. In:Thirty-seventh Conference on Neural Information Processing Systems. 2023 (cit. on pp. 4, 10, 17)

2023
[18]

AdaMuon: Adaptive Muon optimizer

Chongjie Si, Debing Zhang, and Wei Shen. “Adamuon: Adaptive muon optimizer”. In: arXiv preprint arXiv:2507.11005(2025) (cit. on p. 4)

work page arXiv 2025
[19]

ROOT: Robust Orthogonalized Optimizer for Neural Network Training

Wei He, Kai Han, Hang Zhou, Hanting Chen, Zhicheng Liu, Xinghao Chen, and Yunhe Wang. “ROOT: Robust Orthogonalized Optimizer for Neural Network Training”. In: arXiv preprint arXiv:2511.20626(2025) (cit. on p. 4)

work page arXiv 2025
[20]

The largest eigenvalue of small rank perturbations of Hermitian random matrices

Sandrine P´ ech´ e. “The largest eigenvalue of small rank perturbations of Hermitian random matrices”. In:Probability Theory and Related Fields134.1 (2006), pp. 127–173 (cit. on p. 5)

2006
[21]

Robust estimation of a location parameter

Peter J Huber. “Robust estimation of a location parameter”. In:Breakthroughs in statistics: Methodology and distribution. Springer, 1992, pp. 492–518 (cit. on p. 5)

1992
[22]

Zipf’s word frequency law in natural language: A critical review and future directions

Steven T Piantadosi. “Zipf’s word frequency law in natural language: A critical review and future directions”. In:Psychonomic bulletin & review21.5 (2014), pp. 1112–1130 (cit. on p. 5)

2014
[23]

Ravenio books, 2016 (cit

George Kingsley Zipf.Human behavior and the principle of least effort: An introduction to human ecology. Ravenio books, 2016 (cit. on p. 5)

2016
[24]

On differentiating eigenvalues and eigenvectors

Jan R Magnus. “On differentiating eigenvalues and eigenvectors”. In:Econometric theory1.2 (1985), pp. 179–191 (cit. on pp. 5, 20)

1985
[25]

Tosio Kato.Perturbation theory for linear operators. Vol. 132. Springer, 1966 (cit. on pp. 5, 20)

1966
[26]

High probability convergence bounds for non-convex stochastic gradient descent with sub-weibull noise

Liam Madden, Emiliano Dall’Anese, and Stephen Becker. “High probability convergence bounds for non-convex stochastic gradient descent with sub-weibull noise”. In:Journal of Machine Learning Research25.241 (2024), pp. 1–36 (cit. on p. 10)

2024
[27]

Nonconvex stochastic optimization under heavy- tailed noises: Optimal convergence without gradient clipping

Zijian Liu and Zhengyuan Zhou. “Nonconvex stochastic optimization under heavy- tailed noises: Optimal convergence without gradient clipping”. In:The Thirteenth International Conference on Learning Representations. 2025 (cit. on pp. 10, 17)

2025
[28]

To clip or not to clip: the dynamics of SGD with gradient clipping in high-dimensions

Noah Marshall, Ke Liang Xiao, Atish Agarwala, and Elliot Paquette. “To clip or not to clip: the dynamics of SGD with gradient clipping in high-dimensions”. In:International Conference on Learning Representations. Vol. 2025. 2025, pp. 27381–27417 (cit. on p. 11)

2025
[29]

2024.url: https://github.com/ KellerJordan/modded-nanogpt(cit

Keller Jordan, Jeremy Bernstein, Brendan Rappazzo, @fernbear.bsky.social, Boza Vlado, You Jiacheng, Franz Cesista, Braden Koszarsky, and @Grad62304977.modded- nanogpt: Speedrunning the NanoGPT baseline. 2024.url: https://github.com/ KellerJordan/modded-nanogpt(cit. on pp. 11, 43)

2024
[30]

The fineweb datasets: Decanting the web for the finest text data at scale

Guilherme Penedo, Hynek Kydl´ ıˇ cek, Anton Lozhkov, Margaret Mitchell, Colin Raffel, Leandro Von Werra, Thomas Wolf, et al. “The fineweb datasets: Decanting the web for the finest text data at scale”. In:Advances in Neural Information Processing Systems 37 (2024), pp. 30811–30849 (cit. on pp. 11, 43)

2024
[31]

Generalized Gradient Norm Clipping & Non-Euclidean $(L 0,L 1)$-Smoothness

Thomas Pethick, Wanyun Xie, Mete Erdogan, Kimon Antonakopoulos, Tony Silveti- Falls, and Volkan Cevher. “Generalized Gradient Norm Clipping & Non-Euclidean $(L 0,L 1)$-Smoothness”. In:The Thirty-ninth Annual Conference on Neural Infor- mation Processing Systems. 2025 (cit. on p. 17). 14

2025
[32]

Lions and Muons: Optimization via stochastic Frank-Wolfe.arXiv:2506.04192,

Maria-Eleni Sfyraki and Jun-Kun Wang. “Lions and muons: Optimization via stochastic frank-wolfe”. In:arXiv preprint arXiv:2506.04192(2025) (cit. on p. 17)

work page arXiv 2025
[33]

High-probability convergence bounds for nonlinear stochastic gradient descent under heavy-tailed noise

Aleksandar Armacki, Pranay Sharma, Gauri Joshi, Dragana Bajovic, Dusan Jakovetic, and Soummya Kar. “High-probability convergence bounds for nonlinear stochastic gradient descent under heavy-tailed noise”. In:Proceedings of The 28th International Conference on Artificial Intelligence and Statistics. 2025 (cit. on p. 17)

2025
[34]

Minimization methods for nonsmooth convex and quasiconvex functions

Yurii E Nesterov. “Minimization methods for nonsmooth convex and quasiconvex functions”. In:Matekon29.3 (1984), pp. 519–531 (cit. on p. 17)

1984
[35]

Yurii Nesterov et al.Lectures on convex optimization. Vol. 137. Springer, 2018 (cit. on p. 17)

2018
[36]

Breaking the lower bound with (little) structure: Acceleration in non-convex stochastic optimization with heavy-tailed noise

Zijian Liu, Jiawei Zhang, and Zhengyuan Zhou. “Breaking the lower bound with (little) structure: Acceleration in non-convex stochastic optimization with heavy-tailed noise”. In:The Thirty Sixth Annual Conference on Learning Theory. PMLR. 2023, pp. 2266–2290 (cit. on p. 17)

2023
[37]

From gradient clipping to normalization for heavy tailed sgd

Florian H¨ ubler, Ilyas Fatkhullin, and Niao He. “From gradient clipping to normalization for heavy tailed sgd”. In:arXiv preprint arXiv:2410.13849(2024) (cit. on p. 17)

work page arXiv 2024
[38]

signSGD: Compressed optimisation for non-convex problems

Jeremy Bernstein, Yu-Xiang Wang, Kamyar Azizzadenesheli, and Animashree Anandku- mar. “signSGD: Compressed optimisation for non-convex problems”. In:International conference on machine learning. PMLR. 2018, pp. 560–569 (cit. on p. 17)

2018
[39]

Stacey: Promoting Stochastic Steepest Descent via Accelerated ℓp-Smooth Nonconvex Optimization

Xinyu Luo, Cedar Site Bai, Bolian Li, Petros Drineas, Ruqi Zhang, and Brian Bullins. “Stacey: Promoting Stochastic Steepest Descent via Accelerated ℓp-Smooth Nonconvex Optimization”. In:Forty-second International Conference on Machine Learning. 2025 (cit. on p. 17)

2025
[40]

On the Convergence Analysis of Muon

Wei Shen, Ruichuan Huang, Minhui Huang, Cong Shen, and Jiawei Zhang. “On the convergence analysis of muon”. In:arXiv preprint arXiv:2505.23737(2025) (cit. on p. 17)

work page internal anchor Pith review Pith/arXiv arXiv 2025
[41]

Decou- pling Variance and Scale-Invariant Updates in Adaptive Gradient Descent for Unified Vector and Matrix Optimization

Zitao Song, Cedar Site Bai, Zhe Zhang, Brian Bullins, and David F Gleich. “Decou- pling Variance and Scale-Invariant Updates in Adaptive Gradient Descent for Unified Vector and Matrix Optimization”. In:Forty-third International Conference on Machine Learning. 2026 (cit. on p. 17)

2026
[42]

Adam: A Method for Stochastic Optimization

Diederik P Kingma and Jimmy Ba. “Adam: A method for stochastic optimization”. In:arXiv preprint arXiv:1412.6980(2014) (cit. on p. 17)

work page internal anchor Pith review Pith/arXiv arXiv 2014
[43]

Shrink globally, act locally: Sparse Bayesian regularization and prediction

Nicholas G Polson and James G Scott. “Shrink globally, act locally: Sparse Bayesian regularization and prediction”. In:Bayesian statistics9.501-538 (2010), p. 105 (cit. on p. 24)

2010
[44]

On outlier rejection phenomena in Bayes inference

Anthony O’Hagan. “On outlier rejection phenomena in Bayes inference”. In:Journal of the Royal Statistical Society Series B: Statistical Methodology41.3 (1979), pp. 358–367 (cit. on p. 24)

1979
[45]

Exact and approximate posterior moments for a normal location parameter

LR Pericchi and AFM Smith. “Exact and approximate posterior moments for a normal location parameter”. In:Journal of the Royal Statistical Society Series B: Statistical Methodology54.3 (1992), pp. 793–804 (cit. on p. 24)

1992
[46]

Decoupled weight decay regularization

Ilya Loshchilov and Frank Hutter. “Decoupled weight decay regularization”. In:Inter- national Conference on Learning Representations. 2019 (cit. on p. 43)

2019
[47]

Minicpm: Unveiling the potential of small language models with scalable training strategies

Shengding Hu, Yuge Tu, Xu Han, Chaoqun He, Ganqu Cui, Xiang Long, Zhi Zheng, Yewei Fang, Yuxiang Huang, Weilin Zhao, et al. “Minicpm: Unveiling the potential of small language models with scalable training strategies”. In:First Conference on Language Modeling. 2024 (cit. on p. 43). 15 Appendix Contents 1 Introduction 1 2 Preliminaries 2 2.1 Clipping Metho...

2024
[48]

1) satisfies them with C1 “C 2 “ν` 1 and T0 “ ?ν, and symmetric stable laws satisfy them analogously [43]. The intuition for the above lemma is that for observation |y|

We denote pa, vq PRˆR N as the local perturbation at pλ˚, w˚q. The Jacobian of Fpλ, wq at this point with can be written as Lpa, vq “ ˆ ´w˚ A´λ ˚I 0w ˚ ˙ ˆ a v ˙ . We claim Lpa, vq “ 0 only has trivial solution. If Lpa, vq “ 0, then vKw ˚ and aw˚ “ pA´λ ˚Iqv . Taking the inner product with w˚ to the second equation and using symmetry, a“w T ˚ pA´λ ˚Iqv“ `...

2025

[1] [1]

Keller Jordan, Yuchen Jin, Vlado Boza, Jiacheng You, Franz Cesista, Laker Newhouse, and Jeremy Bernstein.Muon: An optimizer for hidden layers in neural networks. 2024. url:https://kellerjordan.github.io/posts/muon/(cit. on pp. 1, 4, 17, 43)

2024

[2] [2]

A tail-index analysis of stochastic gradient noise in deep neural networks

Umut Simsekli, Levent Sagun, and Mert Gurbuzbalaban. “A tail-index analysis of stochastic gradient noise in deep neural networks”. In:International Conference on Machine Learning. PMLR. 2019, pp. 5827–5837 (cit. on pp. 1, 5, 17)

2019

[3] [3]

The heavy-tail phenomenon in SGD

Mert Gurbuzbalaban, Umut Simsekli, and Lingjiong Zhu. “The heavy-tail phenomenon in SGD”. In:International Conference on Machine Learning. PMLR. 2021, pp. 3964– 3975 (cit. on p. 1)

2021

[4] [4]

The Llama 3 Herd of Models

Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Alex Vaughan, et al. “The llama 3 herd of models”. In:arXiv preprint arXiv:2407.21783(2024) (cit. on p. 1)

work page internal anchor Pith review Pith/arXiv arXiv 2024

[5] [5]

Kimi K2: Open Agentic Intelligence

Kimi Team, Yifan Bai, Yiping Bao, Y Charles, Cheng Chen, Guanduo Chen, Haiting Chen, Huarong Chen, Jiahao Chen, Ningxin Chen, et al. “Kimi k2: Open agentic intelligence”. In:arXiv preprint arXiv:2507.20534(2025) (cit. on pp. 1, 43)

work page internal anchor Pith review Pith/arXiv arXiv 2025

[6] [6]

Small-scale proxies for large-scale Transformer training instabilities

Mitchell Wortsman, Peter J Liu, Lechao Xiao, Katie E Everett, Alexander A Alemi, Ben Adlam, John D Co-Reyes, Izzeddin Gur, Abhishek Kumar, Roman Novak, Jeffrey Pennington, Jascha Sohl-Dickstein, Kelvin Xu, Jaehoon Lee, Justin Gilmer, and Simon Kornblith. “Small-scale proxies for large-scale Transformer training instabilities”. In: The Twelfth Internationa...

2024

[7] [7]

SPAM: Spike-aware adam with momentum reset for stable LLM training

Tianjin Huang, Ziquan Zhu, Gaojie Jin, Lu Liu, Zhangyang Wang, and Shiwei Liu. “SPAM: Spike-aware adam with momentum reset for stable LLM training”. In:The Thirteenth International Conference on Learning Representations. 2025 (cit. on p. 1)

2025

[8] [8]

On the difficulty of training recurrent neural networks

Razvan Pascanu, Tomas Mikolov, and Yoshua Bengio. “On the difficulty of training recurrent neural networks”. In:International conference on machine learning. Pmlr. 2013, pp. 1310–1318 (cit. on pp. 1, 4, 17)

2013

[9] [9]

Why are adaptive methods good for attention models?

Jingzhao Zhang, Sai Praneeth Karimireddy, Andreas Veit, Seungyeon Kim, Sashank Reddi, Sanjiv Kumar, and Suvrit Sra. “Why are adaptive methods good for attention models?” In:Advances in Neural Information Processing Systems33 (2020), pp. 15383– 15393 (cit. on pp. 1, 4, 5, 9, 10, 17)

2020

[10] [10]

Stochastic spectral descent for restricted Boltzmann machines

David Carlson, Volkan Cevher, and Lawrence Carin. “Stochastic spectral descent for restricted Boltzmann machines”. In:Artificial intelligence and statistics. PMLR. 2015, pp. 111–119 (cit. on pp. 1, 17)

2015

[11] [11]

Spectral normalization for generative adversarial networks

Takeru Miyato, Toshiki Kataoka, Masanori Koyama, and Yuichi Yoshida. “Spectral normalization for generative adversarial networks”. In:International Conference on Learning Representations. 2018 (cit. on p. 1)

2018

[12] [12]

Fantastic Pre- training Optimizers and Where to Find Them

Kaiyue Wen, David Leo Wright Hall, Tengyu Ma, and Percy Liang. “Fantastic Pre- training Optimizers and Where to Find Them”. In:The Fourteenth International Conference on Learning Representations. 2026 (cit. on p. 1)

2026

[13] [13]

Old Optimizer, New Norm: An Anthology

Jeremy Bernstein and Laker Newhouse. “Old optimizer, new norm: An anthology”. In: arXiv preprint arXiv:2409.20325(2024) (cit. on pp. 1, 17)

work page internal anchor Pith review Pith/arXiv arXiv 2024

[14] [14]

Enhancing LLM Training via Spectral Clipping

Xiaowen Jiang, Andrei Semenov, and Sebastian U Stich. “Enhancing LLM Training via Spectral Clipping”. In:arXiv preprint arXiv:2603.14315(2026) (cit. on pp. 1, 4, 5, 17, 43)

work page internal anchor Pith review Pith/arXiv arXiv 2026

[15] [15]

Training transformers with enforced lipschitz constants.arXiv preprint arXiv:2507.13338, 2025

Laker Newhouse, R Preston Hess, Franz Cesista, Andrii Zahorodnii, Jeremy Bernstein, and Phillip Isola. “Training transformers with enforced lipschitz constants”. In:arXiv preprint arXiv:2507.13338(2025) (cit. on pp. 4, 17). 13

work page arXiv 2025

[16] [16]

High-probability bounds for non-convex stochastic optimization with heavy tails

Ashok Cutkosky and Harsh Mehta. “High-probability bounds for non-convex stochastic optimization with heavy tails”. In:Advances in Neural Information Processing Systems 34 (2021), pp. 4883–4895 (cit. on pp. 4, 17)

2021

[17] [17]

Improved Con- vergence in High Probability of Clipped Gradient Methods with Heavy Tailed Noise

Ta Duy Nguyen, Thien Hang Nguyen, Alina Ene, and Huy Nguyen. “Improved Con- vergence in High Probability of Clipped Gradient Methods with Heavy Tailed Noise”. In:Thirty-seventh Conference on Neural Information Processing Systems. 2023 (cit. on pp. 4, 10, 17)

2023

[18] [18]

AdaMuon: Adaptive Muon optimizer

Chongjie Si, Debing Zhang, and Wei Shen. “Adamuon: Adaptive muon optimizer”. In: arXiv preprint arXiv:2507.11005(2025) (cit. on p. 4)

work page arXiv 2025

[19] [19]

ROOT: Robust Orthogonalized Optimizer for Neural Network Training

Wei He, Kai Han, Hang Zhou, Hanting Chen, Zhicheng Liu, Xinghao Chen, and Yunhe Wang. “ROOT: Robust Orthogonalized Optimizer for Neural Network Training”. In: arXiv preprint arXiv:2511.20626(2025) (cit. on p. 4)

work page arXiv 2025

[20] [20]

The largest eigenvalue of small rank perturbations of Hermitian random matrices

Sandrine P´ ech´ e. “The largest eigenvalue of small rank perturbations of Hermitian random matrices”. In:Probability Theory and Related Fields134.1 (2006), pp. 127–173 (cit. on p. 5)

2006

[21] [21]

Robust estimation of a location parameter

Peter J Huber. “Robust estimation of a location parameter”. In:Breakthroughs in statistics: Methodology and distribution. Springer, 1992, pp. 492–518 (cit. on p. 5)

1992

[22] [22]

Zipf’s word frequency law in natural language: A critical review and future directions

Steven T Piantadosi. “Zipf’s word frequency law in natural language: A critical review and future directions”. In:Psychonomic bulletin & review21.5 (2014), pp. 1112–1130 (cit. on p. 5)

2014

[23] [23]

Ravenio books, 2016 (cit

George Kingsley Zipf.Human behavior and the principle of least effort: An introduction to human ecology. Ravenio books, 2016 (cit. on p. 5)

2016

[24] [24]

On differentiating eigenvalues and eigenvectors

Jan R Magnus. “On differentiating eigenvalues and eigenvectors”. In:Econometric theory1.2 (1985), pp. 179–191 (cit. on pp. 5, 20)

1985

[25] [25]

Tosio Kato.Perturbation theory for linear operators. Vol. 132. Springer, 1966 (cit. on pp. 5, 20)

1966

[26] [26]

High probability convergence bounds for non-convex stochastic gradient descent with sub-weibull noise

Liam Madden, Emiliano Dall’Anese, and Stephen Becker. “High probability convergence bounds for non-convex stochastic gradient descent with sub-weibull noise”. In:Journal of Machine Learning Research25.241 (2024), pp. 1–36 (cit. on p. 10)

2024

[27] [27]

Nonconvex stochastic optimization under heavy- tailed noises: Optimal convergence without gradient clipping

Zijian Liu and Zhengyuan Zhou. “Nonconvex stochastic optimization under heavy- tailed noises: Optimal convergence without gradient clipping”. In:The Thirteenth International Conference on Learning Representations. 2025 (cit. on pp. 10, 17)

2025

[28] [28]

To clip or not to clip: the dynamics of SGD with gradient clipping in high-dimensions

Noah Marshall, Ke Liang Xiao, Atish Agarwala, and Elliot Paquette. “To clip or not to clip: the dynamics of SGD with gradient clipping in high-dimensions”. In:International Conference on Learning Representations. Vol. 2025. 2025, pp. 27381–27417 (cit. on p. 11)

2025

[29] [29]

2024.url: https://github.com/ KellerJordan/modded-nanogpt(cit

Keller Jordan, Jeremy Bernstein, Brendan Rappazzo, @fernbear.bsky.social, Boza Vlado, You Jiacheng, Franz Cesista, Braden Koszarsky, and @Grad62304977.modded- nanogpt: Speedrunning the NanoGPT baseline. 2024.url: https://github.com/ KellerJordan/modded-nanogpt(cit. on pp. 11, 43)

2024

[30] [30]

The fineweb datasets: Decanting the web for the finest text data at scale

Guilherme Penedo, Hynek Kydl´ ıˇ cek, Anton Lozhkov, Margaret Mitchell, Colin Raffel, Leandro Von Werra, Thomas Wolf, et al. “The fineweb datasets: Decanting the web for the finest text data at scale”. In:Advances in Neural Information Processing Systems 37 (2024), pp. 30811–30849 (cit. on pp. 11, 43)

2024

[31] [31]

Generalized Gradient Norm Clipping & Non-Euclidean $(L 0,L 1)$-Smoothness

Thomas Pethick, Wanyun Xie, Mete Erdogan, Kimon Antonakopoulos, Tony Silveti- Falls, and Volkan Cevher. “Generalized Gradient Norm Clipping & Non-Euclidean $(L 0,L 1)$-Smoothness”. In:The Thirty-ninth Annual Conference on Neural Infor- mation Processing Systems. 2025 (cit. on p. 17). 14

2025

[32] [32]

Lions and Muons: Optimization via stochastic Frank-Wolfe.arXiv:2506.04192,

Maria-Eleni Sfyraki and Jun-Kun Wang. “Lions and muons: Optimization via stochastic frank-wolfe”. In:arXiv preprint arXiv:2506.04192(2025) (cit. on p. 17)

work page arXiv 2025

[33] [33]

High-probability convergence bounds for nonlinear stochastic gradient descent under heavy-tailed noise

Aleksandar Armacki, Pranay Sharma, Gauri Joshi, Dragana Bajovic, Dusan Jakovetic, and Soummya Kar. “High-probability convergence bounds for nonlinear stochastic gradient descent under heavy-tailed noise”. In:Proceedings of The 28th International Conference on Artificial Intelligence and Statistics. 2025 (cit. on p. 17)

2025

[34] [34]

Minimization methods for nonsmooth convex and quasiconvex functions

Yurii E Nesterov. “Minimization methods for nonsmooth convex and quasiconvex functions”. In:Matekon29.3 (1984), pp. 519–531 (cit. on p. 17)

1984

[35] [35]

Yurii Nesterov et al.Lectures on convex optimization. Vol. 137. Springer, 2018 (cit. on p. 17)

2018

[36] [36]

Breaking the lower bound with (little) structure: Acceleration in non-convex stochastic optimization with heavy-tailed noise

Zijian Liu, Jiawei Zhang, and Zhengyuan Zhou. “Breaking the lower bound with (little) structure: Acceleration in non-convex stochastic optimization with heavy-tailed noise”. In:The Thirty Sixth Annual Conference on Learning Theory. PMLR. 2023, pp. 2266–2290 (cit. on p. 17)

2023

[37] [37]

From gradient clipping to normalization for heavy tailed sgd

Florian H¨ ubler, Ilyas Fatkhullin, and Niao He. “From gradient clipping to normalization for heavy tailed sgd”. In:arXiv preprint arXiv:2410.13849(2024) (cit. on p. 17)

work page arXiv 2024

[38] [38]

signSGD: Compressed optimisation for non-convex problems

Jeremy Bernstein, Yu-Xiang Wang, Kamyar Azizzadenesheli, and Animashree Anandku- mar. “signSGD: Compressed optimisation for non-convex problems”. In:International conference on machine learning. PMLR. 2018, pp. 560–569 (cit. on p. 17)

2018

[39] [39]

Stacey: Promoting Stochastic Steepest Descent via Accelerated ℓp-Smooth Nonconvex Optimization

Xinyu Luo, Cedar Site Bai, Bolian Li, Petros Drineas, Ruqi Zhang, and Brian Bullins. “Stacey: Promoting Stochastic Steepest Descent via Accelerated ℓp-Smooth Nonconvex Optimization”. In:Forty-second International Conference on Machine Learning. 2025 (cit. on p. 17)

2025

[40] [40]

On the Convergence Analysis of Muon

Wei Shen, Ruichuan Huang, Minhui Huang, Cong Shen, and Jiawei Zhang. “On the convergence analysis of muon”. In:arXiv preprint arXiv:2505.23737(2025) (cit. on p. 17)

work page internal anchor Pith review Pith/arXiv arXiv 2025

[41] [41]

Decou- pling Variance and Scale-Invariant Updates in Adaptive Gradient Descent for Unified Vector and Matrix Optimization

Zitao Song, Cedar Site Bai, Zhe Zhang, Brian Bullins, and David F Gleich. “Decou- pling Variance and Scale-Invariant Updates in Adaptive Gradient Descent for Unified Vector and Matrix Optimization”. In:Forty-third International Conference on Machine Learning. 2026 (cit. on p. 17)

2026

[42] [42]

Adam: A Method for Stochastic Optimization

Diederik P Kingma and Jimmy Ba. “Adam: A method for stochastic optimization”. In:arXiv preprint arXiv:1412.6980(2014) (cit. on p. 17)

work page internal anchor Pith review Pith/arXiv arXiv 2014

[43] [43]

Shrink globally, act locally: Sparse Bayesian regularization and prediction

Nicholas G Polson and James G Scott. “Shrink globally, act locally: Sparse Bayesian regularization and prediction”. In:Bayesian statistics9.501-538 (2010), p. 105 (cit. on p. 24)

2010

[44] [44]

On outlier rejection phenomena in Bayes inference

Anthony O’Hagan. “On outlier rejection phenomena in Bayes inference”. In:Journal of the Royal Statistical Society Series B: Statistical Methodology41.3 (1979), pp. 358–367 (cit. on p. 24)

1979

[45] [45]

Exact and approximate posterior moments for a normal location parameter

LR Pericchi and AFM Smith. “Exact and approximate posterior moments for a normal location parameter”. In:Journal of the Royal Statistical Society Series B: Statistical Methodology54.3 (1992), pp. 793–804 (cit. on p. 24)

1992

[46] [46]

Decoupled weight decay regularization

Ilya Loshchilov and Frank Hutter. “Decoupled weight decay regularization”. In:Inter- national Conference on Learning Representations. 2019 (cit. on p. 43)

2019

[47] [47]

Minicpm: Unveiling the potential of small language models with scalable training strategies

Shengding Hu, Yuge Tu, Xu Han, Chaoqun He, Ganqu Cui, Xiang Long, Zhi Zheng, Yewei Fang, Yuxiang Huang, Weilin Zhao, et al. “Minicpm: Unveiling the potential of small language models with scalable training strategies”. In:First Conference on Language Modeling. 2024 (cit. on p. 43). 15 Appendix Contents 1 Introduction 1 2 Preliminaries 2 2.1 Clipping Metho...

2024

[48] [48]

1) satisfies them with C1 “C 2 “ν` 1 and T0 “ ?ν, and symmetric stable laws satisfy them analogously [43]. The intuition for the above lemma is that for observation |y|

We denote pa, vq PRˆR N as the local perturbation at pλ˚, w˚q. The Jacobian of Fpλ, wq at this point with can be written as Lpa, vq “ ˆ ´w˚ A´λ ˚I 0w ˚ ˙ ˆ a v ˙ . We claim Lpa, vq “ 0 only has trivial solution. If Lpa, vq “ 0, then vKw ˚ and aw˚ “ pA´λ ˚Iqv . Taking the inner product with w˚ to the second equation and using symmetry, a“w T ˚ pA´λ ˚Iqv“ `...

2025