Scale-Invariant Neural Network Optimization: Norm Geometry and Heavy-Tailed Noise

Jiayu Zhang; Tianyi Lin

arxiv: 2605.18528 · v1 · pith:FPEVKMNQnew · submitted 2026-05-18 · 🧮 math.OC · cs.LG

Scale-Invariant Neural Network Optimization: Norm Geometry and Heavy-Tailed Noise

Jiayu Zhang , Tianyi Lin This is my paper

Pith reviewed 2026-05-20 08:45 UTC · model grok-4.3

classification 🧮 math.OC cs.LG

keywords scale-invariant optimizationheavy-tailed noisenonconvex stochastic optimizationspectral normoracle complexityScion methodneural network trainingHessian Lipschitz

0 comments

The pith

Any scale-invariant first-order method using the spectral norm requires Ω(min{m,n} ε^{-(3p-2)/(p-1)}) oracle calls to reach an ε-stationary point under p-moment heavy-tailed noise when the matrix dimensions satisfy max{m,n}/(min{m,n})^2 is,

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper studies stochastic optimization of matrix-valued problems that arise in scale-invariant neural network layers, where the objective is to reach an approximate stationary point despite noise that obeys only a p-th moment bound rather than sub-Gaussian tails. It establishes a dimension-dependent lower bound that any first-order scale-invariant algorithm with spectral norm must pay a cost linear in the smaller matrix dimension and polynomial in the accuracy. The authors then construct a batched Scion method whose complexity exactly matches this lower bound and a transported Scion variant that improves the exponent when the Hessian is Lipschitz continuous.

Core claim

In nonconvex smooth stochastic optimization over R^{m×n} equipped with general norms, when max{m,n}/(min{m,n})^2 is large enough, every scale-invariant first-order method that uses the spectral norm must perform Ω(min{m,n} ε^{-(3p-2)/(p-1)}) calls to a stochastic oracle to produce an ε-stationary point under p-th-moment heavy-tailed noise. A batched Scion method attains the matching O(min{m,n} ε^{-(3p-2)/(p-1)}) upper bound; under the additional assumption that the Hessian is Lipschitz, a transported Scion method further reduces the complexity to O(min{m,n} ε^{-(5p-3)/(2p-2)}).

What carries the argument

The Scion method, a normalized update rule that respects input-output matrix norm geometry while using batching or transport to control variance from heavy-tailed gradients.

If this is right

The lower and upper bounds are tight, so the exponent (3p-2)/(p-1) is optimal for first-order scale-invariant methods under heavy tails.
Higher-order smoothness via Hessian Lipschitzness yields a strictly better exponent through the transported Scion construction.
The results apply to any matrix problem whose aspect ratio satisfies the stated dimension condition.
Practical heuristics can be layered on the transported Scion method while preserving its theoretical rate.
The dimension factor min{m,n} is unavoidable and grows with the smaller matrix side.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Optimizers that ignore the tail index p will pay a worse rate than necessary when real gradients exhibit heavy tails.
The transported variant may be worth testing on models whose weight matrices have extreme aspect ratios, such as wide embedding layers.
Whether these complexity improvements translate into faster wall-clock training or better generalization remains to be checked empirically.
Similar norm-geometry arguments could be applied to other structured parameter spaces common in modern architectures.

Load-bearing premise

The stochastic gradient noise satisfies a p-th moment bound for some p greater than 1.

What would settle it

An explicit scale-invariant first-order algorithm with spectral norm that reaches an ε-stationary point in o(min{m,n} ε^{-(3p-2)/(p-1)}) oracle calls for sufficiently unbalanced dimensions under the same p-moment noise model would falsify the lower bound.

read the original abstract

A growing lesson from neural network optimization is that optimizer design should respect how the model is parametrized. Scale-invariant methods become important because their normalized layerwise updates can not only support hyperparameter transfer across model sizes but exploit input-output matrix norm geometry. At the same time, stochastic gradient noises in deep learning are often far from sub-Gaussian and may exhibit heavy tails. These crucial observations have shaped recent algorithmic principles for training neural networks, yet their joint theoretical consequences remain underexplored. In particular, it is unclear what dimension dependence is unavoidable for scale-invariant methods with general input-output matrix norm, and whether higher-order smoothness can accelerate training under heavy-tailed noise. We study these questions through nonconvex smooth stochastic optimization over $\mathbb{R}^{m\times n}$ with general norms, where the goal is to achieve an $\epsilon$-stationary point under $p^{\mathrm{th}}$-moment heavy-tailed noise. Our first contribution is a dimension-dependent lower bound: when $\frac{\max\{m,n\}}{(\min\{m,n\})^2}$ is large enough, any scale-invariant first-order method with spectral norm requires $\Omega(\min\{m, n\}\epsilon^{-\frac{3p-2}{p-1}})$ oracle calls. We prove that a batched Scion method with spectral norm achieves the matching upper bound of $O(\min\{m, n\}\epsilon^{-\frac{3p-2}{p-1}})$. To exploit higher-order smoothness, we propose a transported Scion method and improve the bound to $O(\min\{m, n\}\epsilon^{-\frac{5p-3}{2p-2}})$ when the norm is spectral and the Hessian is Lipschitz. Finally, we incorporate practical heuristics into our transported method and evaluate it across multiple architectures and model sizes, demonstrating its flexibility and compatibility in training neural networks.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper derives a dimension-dependent lower bound for scale-invariant first-order methods under p-moment heavy-tailed noise and matches it with batched and transported Scion algorithms.

read the letter

The main thing to know is that this work pins down unavoidable dimension dependence for scale-invariant optimizers when the noise has only finite p-moments, and it supplies algorithms that hit the bound in the spectral-norm case. The lower bound activates once max{m,n} over min squared gets large, forcing Omega(min{m,n} epsilon to the power (3p-2)/(p-1)) oracle calls for any such method. The batched Scion matches that rate, and the transported version tightens it further under Hessian Lipschitzness to something like (5p-3)/(2p-2). That combination of lower and upper bounds on the same setting is the concrete advance. It directly engages the practical facts that neural-net layers are often scale-invariant and that real gradients show heavier tails than sub-Gaussian models assume. The analysis stays within standard stochastic nonconvex assumptions without obvious circularity or invented parameters. The empirical part folds in heuristics and checks a few architectures, which at least shows the transported method is implementable. The soft spots are modest but real. The tight results are stated for the spectral norm even though the setup claims generality; extending the lower bound to other norms would strengthen the claim. The p-moment assumption is explicit and necessary for the exponents, yet real training noise can be more variable than a fixed p. The practical experiments are described at a high level, so the size of any gain over existing scale-invariant baselines is not yet clear from the abstract alone. Readers working on theoretical guarantees for deep-learning optimizers will get the most out of it, especially those who care about complexity under realistic noise. The paper is coherent on its own terms and supplies verifiable claims, so it is worth a serious referee even if revisions are needed on the norm generality and the empirical detail.

Referee Report

2 major / 2 minor

Summary. The paper studies nonconvex stochastic optimization over matrices R^{m x n} equipped with general norms, focusing on scale-invariant first-order methods under p-th moment heavy-tailed noise. It derives a dimension-dependent lower bound of Ω(min{m,n} ε^{-(3p-2)/(p-1)}) oracle complexity for any scale-invariant method restricted to the spectral norm when max{m,n}/(min{m,n})^2 is sufficiently large. A batched Scion method is shown to achieve a matching O bound, while a transported Scion variant improves the rate to O(min{m,n} ε^{-(5p-3)/(2p-2)}) under the additional assumption of Hessian Lipschitzness. The work concludes with practical heuristics and experiments demonstrating applicability to neural network training across architectures and scales.

Significance. If the matching lower and upper bounds hold under the stated assumptions, the results clarify unavoidable dimension dependence and complexity for scale-invariant methods in the presence of heavy-tailed noise, which is a realistic model for deep learning gradients. The improvement via the transported method under higher-order smoothness, combined with empirical validation, offers concrete guidance for optimizer design that respects parametrization and norm geometry. The explicit p-moment noise model and dimension condition make the claims falsifiable and relevant to the field.

major comments (2)

[Abstract and lower-bound section] The lower bound in the abstract (and presumably §4) is stated for spectral norm, yet the problem setting is introduced with general input-output matrix norms; the manuscript should clarify whether the Ω(min{m,n} ε^{-(3p-2)/(p-1)}) rate extends to other norms or if spectral norm is necessary for the hardness construction, as this affects the generality of the central claim.
[Transported Scion analysis and experiments] The transported Scion improvement to O(min{m,n} ε^{-(5p-3)/(2p-2)}) relies on Hessian Lipschitzness (abstract); the paper must specify how this assumption is verified or relaxed in the neural-network experiments, since violation could invalidate the faster rate and undermine the practical significance of the higher-order variant.

minor comments (2)

[Abstract and setting] Notation for the matrix dimensions m,n and the ratio max{m,n}/(min{m,n})^2 should be introduced with a precise threshold value for 'large enough' to make the lower-bound statement self-contained.
[Preliminaries] The definition of scale-invariance for the methods (used in both lower and upper bounds) would benefit from an explicit equation or property list early in the manuscript to avoid ambiguity when comparing to prior work.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the thorough review and constructive feedback. The comments highlight important points regarding the scope of our theoretical results and the connection to experiments. We address each major comment below and have incorporated revisions to improve clarity.

read point-by-point responses

Referee: [Abstract and lower-bound section] The lower bound in the abstract (and presumably §4) is stated for spectral norm, yet the problem setting is introduced with general input-output matrix norms; the manuscript should clarify whether the Ω(min{m,n} ε^{-(3p-2)/(p-1)}) rate extends to other norms or if spectral norm is necessary for the hardness construction, as this affects the generality of the central claim.

Authors: We agree that additional clarification is warranted. The lower bound construction in Section 4 relies on specific properties of the spectral norm (in particular, its behavior under scale-invariant updates and the choice of hard instances that exploit the operator norm geometry). The result does not directly extend to arbitrary input-output norms, for which the dimension dependence may be milder or require a different hardness argument. Our matching upper bound for the batched Scion method holds for general norms, while the lower bound is stated specifically for the spectral norm. We will revise the abstract and add a short paragraph at the end of Section 4 to make this distinction explicit, thereby strengthening the precision of the central claim without altering its substance. revision: yes
Referee: [Transported Scion analysis and experiments] The transported Scion improvement to O(min{m,n} ε^{-(5p-3)/(2p-2)}) relies on Hessian Lipschitzness (abstract); the paper must specify how this assumption is verified or relaxed in the neural-network experiments, since violation could invalidate the faster rate and undermine the practical significance of the higher-order variant.

Authors: The faster rate for the transported Scion method is derived under the additional assumption of Hessian Lipschitz continuity, which is stated clearly in the abstract and analysis. In the neural-network experiments we apply practical heuristics inspired by the transported update (e.g., approximate transport maps and adaptive batching) rather than enforcing the Hessian-Lipschitz condition, which is generally unverifiable at scale. We will expand the experimental section to explicitly note that the O(min{m,n} ε^{-(5p-3)/(2p-2)}) guarantee is theoretical, while the heuristics are motivated by the analysis and are evaluated empirically for their practical benefits even when the higher-order assumption may hold only approximately. This revision clarifies the theory-practice gap without changing the reported results. revision: yes

Circularity Check

0 steps flagged

No significant circularity in derivation chain

full rationale

The paper derives dimension-dependent lower and upper bounds on oracle complexity for scale-invariant first-order methods under p-th moment heavy-tailed noise directly from the problem setting (spectral norm, matrix dimensions m,n, and the explicit noise moment assumption). The matching O and improved O bounds for batched and transported Scion methods follow from standard nonconvex stochastic optimization analysis without reducing to fitted parameters, self-definitional constructions, or load-bearing self-citations. The Hessian Lipschitz condition for the transported variant is an additional independent assumption that does not loop back to the core claims. The derivation chain remains self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 2 invented entities

The results rest on standard domain assumptions about noise moments and smoothness; new algorithmic entities are introduced without independent evidence beyond the theoretical analysis.

axioms (2)

domain assumption Stochastic gradients have finite p-th moment for p > 1
Invoked to model heavy-tailed noise and derive the specific complexity exponents.
domain assumption Objective is nonconvex and sufficiently smooth (Lipschitz gradient or Hessian)
Standard assumption for nonconvex stochastic optimization analysis in the abstract setting.

invented entities (2)

Batched Scion method no independent evidence
purpose: Achieves the matching upper bound for scale-invariant first-order optimization with spectral norm
New algorithm proposed to match the lower bound.
Transported Scion method no independent evidence
purpose: Exploits higher-order smoothness to improve the convergence rate under Hessian Lipschitzness
Improved variant for the case with Lipschitz Hessian.

pith-pipeline@v0.9.0 · 5875 in / 1556 out tokens · 46578 ms · 2026-05-20T08:45:07.216128+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

scale-invariant first-order method with spectral norm requires Ω(min{m,n}ε^{-(3p-2)/(p-1)}) oracle calls
IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

transported Scion method ... O(min{m,n}ε^{-(5p-3)/(2p-2)}) under Hessian Lipschitzness

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

150 extracted references · 150 canonical work pages · 9 internal anchors

[1]

Nature , Volume =

Learning representations by back-propagating errors , Author =. Nature , Volume =. 1986 , Publisher =

work page 1986
[2]

Neural Computation , Volume =

Adaptive mixtures of local experts , Author =. Neural Computation , Volume =. 1991 , Publisher =

work page 1991
[3]

Proceedings of the IEEE , Volume =

Gradient-based learning applied to document recognition , Author =. Proceedings of the IEEE , Volume =. 2002 , Publisher =

work page 2002
[4]

Neural Computation , Volume =

Long short-term memory , Author =. Neural Computation , Volume =. 1997 , Publisher =

work page 1997
[5]

and Van Merri

Cho, K. and Van Merri. Learning phrase representations using. EMNLP , Pages =

work page
[6]

Neural Computation , Volume =

A fast learning algorithm for deep belief nets , Author =. Neural Computation , Volume =. 2006 , Publisher =

work page 2006
[7]

and Sutskever, I

Krizhevsky, A. and Sutskever, I. and Hinton, G. E. , Booktitle =. Image

work page
[8]

CVPR , Pages =

Deep residual learning for image recognition , Author =. CVPR , Pages =. 2016 , Organization =

work page 2016
[9]

NeurIPS , Pages =

Attention is all you need , Author =. NeurIPS , Pages =

work page
[10]

The Annals of Mathematical Statistics , Pages =

A stochastic approximation method , Author =. The Annals of Mathematical Statistics , Pages =. 1951 , Publisher =

work page 1951
[11]

1964 , Publisher =

Some methods of speeding up the convergence of iteration methods , Author =. 1964 , Publisher =

work page 1964
[12]

Doklady Akademii Nauk , Pages =

A method of solving a convex programming problem with convergence rate O(1/k^2) , Author =. Doklady Akademii Nauk , Pages =. 1983 , Organization =

work page 1983
[13]

ICML , Pages =

On the importance of initialization and momentum in deep learning , Author =. ICML , Pages =. 2013 , Organization =

work page 2013
[14]

The Journal of Machine Learning Research , Volume =

Adaptive subgradient methods for online learning and stochastic optimization , Author =. The Journal of Machine Learning Research , Volume =. 2011 , Publisher =

work page 2011
[15]

and Hinton, G

Tieleman, T. and Hinton, G. E. , Year =. Neural networks for machine learning,

work page
[16]

ICLR , Year =

Adam: A method for stochastic optimization , Author =. ICLR , Year =

work page
[17]

ICLR , Year =

Decoupled weight decay regularization , Author =. ICLR , Year =

work page
[18]

AISTATS , Pages =

Understanding the difficulty of training deep feedforward neural networks , Author =. AISTATS , Pages =. 2010 , Publisher =

work page 2010
[19]

ICML , Pages =

A tail-index analysis of stochastic gradient noise in deep neural networks , Author =. ICML , Pages =. 2019 , Organization =

work page 2019
[20]

NeurIPS , Pages =

Preconditioned spectral descent for deep learning , Author =. NeurIPS , Pages =

work page
[21]

2024 , Url =

Muon: An optimizer for hidden layers in neural networks , Author =. 2024 , Url =

work page 2024
[22]

NeurIPS Workshop on Optimization for Machine Learning , Year =

Old optimizer, new norm: An anthology , Author =. NeurIPS Workshop on Optimization for Machine Learning , Year =

work page
[23]

and Xie, W

Pethick, T. and Xie, W. and Antonakopoulos, K. and Zhu, Z. and Silveti-Falls, A. and Cevher, V. , Booktitle =. Training deep learning models with norm-constrained. 2025 , Organization =

work page 2025
[24]

NeurIPS , Pages =

Scalable optimization in the modular norm , Author =. NeurIPS , Pages =

work page
[25]

ICML , Pages =

Modular duality in deep learning , Author =. ICML , Pages =. 2025 , Organization =

work page 2025
[26]

ICML , Pages =

Batch normalization: Accelerating deep network training by reducing internal covariate shift , Author =. ICML , Pages =. 2015 , Organization =

work page 2015
[27]

NIPS Workshop on Deep Learning Symposium , Year =

Layer normalization , Author =. NIPS Workshop on Deep Learning Symposium , Year =

work page
[28]

and Hu, E

Yang, G. and Hu, E. J. , Booktitle =. Tensor programs. 2021 , Organization =

work page 2021
[29]

and Hu, E

Yang, G. and Hu, E. J. and Babuschkin, I. and Sidor, S. and Liu, X. and Farhi, D. and Ryder, N. and Pachocki, J. and Chen, W. and Gao, J. , Booktitle =. Tensor programs

work page
[30]

A spectral condition for feature learning

A spectral condition for feature learning , Author =. ArXiv Preprint: 2310.17813 , Year =

work page arXiv
[31]

and Su, J

Liu, J. and Su, J. and Yao, X. and Jiang, Z. and Lai, G. and Du, Y. and Qin, Y. and Xu, W. and Lu, E. and Yan, J. and others , Journal =. Muon is scalable for

work page
[32]

Kimi K2: Open Agentic Intelligence

Kimi k2: Open agentic intelligence , Author =. ArXiv Preprint: 2507.20534 , Year =

work page internal anchor Pith review Pith/arXiv arXiv
[33]

NeurIPS , Pages =

Why are adaptive methods good for attention models? , Author =. NeurIPS , Pages =

work page
[34]

NeurIPS , Pages =

High-probability bounds for non-convex stochastic optimization with heavy tails , Author =. NeurIPS , Pages =

work page
[35]

From gradient clipping to normalization for heavy tailed

H. From gradient clipping to normalization for heavy tailed. AISTATS , Pages =. 2025 , Organization =

work page 2025
[36]

and Liu, X

Sun, T. and Liu, X. and Yuan, K. , Journal =. Revisiting gradient normalization and clipping for nonconvex. 2025 , Publisher =

work page 2025
[37]

ICLR , Year =

Nonconvex stochastic optimization under heavy-tailed Noises: Optimal convergence without gradient clipping , Author =. ICLR , Year =

work page
[38]

and Yaroslav, K

Chezhegov, S. and Yaroslav, K. and Semenov, A. and Beznosikov, A. and Gasnikov, A. and Horv. Clipping improves. ICML , Pages =. 2025 , Organization =

work page 2025
[39]

Sign-Based Optimizers Are Effective Under Heavy-Tailed Noise

Sign-based optimizers are effective under heavy-tailed noise , Author =. ArXiv Preprint: 2602.07425 , Year =

work page internal anchor Pith review Pith/arXiv arXiv
[40]

and AlRashed, S

Shulgin, E. and AlRashed, S. and Richt. Beyond the ideal: Analyzing the inexact. AISTATS , Year =

work page
[41]

Kim, G. Y. and Oh, M-h. , Booktitle =. Convergence of. 2026 , Url =

work page 2026
[42]

and Wang, J-K

Sfyraki, M-E. and Wang, J-K. , Journal =. Lions and

work page
[43]

Mathematical Programming , Volume =

Lower bounds for non-convex stochastic optimization , Author =. Mathematical Programming , Volume =. 2023 , Publisher =

work page 2023
[44]

and Mehta, H

Cutkosky, A. and Mehta, H. , Booktitle =. Momentum improves normalized. 2020 , Organization =

work page 2020
[45]

and Grosse, R

Martens, J. and Grosse, R. , Booktitle =. Optimizing neural networks with. 2015 , Organization =

work page 2015
[46]

and Martens, J

Grosse, R. and Martens, J. , Booktitle =. A. 2016 , Organization =

work page 2016
[47]

ICML , Pages =

Shampoo: Preconditioned stochastic tensor optimization , Author =. ICML , Pages =. 2018 , Organization =

work page 2018
[48]

and Ren, Y

Goldfarb, D. and Ren, Y. and Bahamou, A. , Booktitle =. Practical quasi-

work page
[49]

NeurIPS , Pages =

Tensor normal training for deep learning models , Author =. NeurIPS , Pages =

work page
[50]

Duvvuri, S. S. and Devvrit, F. and Anil, R. and Hsieh, C-J. and Dhillon, I. S. , Booktitle =. Combining axes preconditioners through. 2024 , Url =

work page 2024
[51]

and Zhang, Z

Zhao, J. and Zhang, Z. and Chen, B. and Wang, Z. and Anandkumar, A. and Tian, Y. , Booktitle =. GaLore: Memory-efficient. 2024 , Organization =

work page 2024
[52]

and Shapira, I

Morwani, D. and Shapira, I. and Vyas, N. and Malach, E. and Kakade, S. M. and Janson, L. , Booktitle =. A new perspective on. 2025 , Url =

work page 2025
[53]

and Morwani, D

Vyas, N. and Morwani, D. and Zhao, R. and Shapira, I. and Brandfonbrener, D. and Janson, L. and Kakade, S. M. , Booktitle =. 2025 , Url =

work page 2025
[54]

and Liu, Y

Yuan, H. and Liu, Y. and Wu, S. and Xun, Z. and Gu, Q. , Booktitle =. 2025 , Organization =

work page 2025
[55]

and Liu, Y

An, K. and Liu, Y. and Pan, R. and Ren, Y. and Ma, S. and Goldfarb, D. and Zhang, T. , Booktitle =. 2025 , Url =

work page 2025
[56]

and Liu, L

Li, Z. and Liu, L. and Liang, C. and Chen, W. and Zhao, T. , Journal =. Nor

work page
[57]

and Shulgin, E

Riabinin, A. and Shulgin, E. and Gruntkowska, K. and Richt. Gluon: Making. ICML Workshop on High-dimensional Learning Dynamics , Year =

work page
[58]

Dion: Distributed Orthonormalized Updates

Dion: Distributed orthonormalized updates , Author =. ArXiv Preprint: 2504.05295 , Year =

work page arXiv
[59]

and Amsel, N

Ahn, K. and Amsel, N. and Langford, J. , Journal =. Dion2: A simple method to shrink matrix in

work page
[60]

ArXiv Preprint: 2505.21799 , Year =

Polargrad: A class of matrix-gradient optimizers from a unifying preconditioning perspective , Author =. ArXiv Preprint: 2505.21799 , Year =

work page arXiv
[61]

Low-rank Orthogonalization for Large-scale Matrix Optimization with Applications to Foundation Model Training

Low-rank orthogonalization for large-scale matrix optimization with applications to foundation model training , Author =. ArXiv Preprint: 2509.11983 , Year =

work page internal anchor Pith review Pith/arXiv arXiv
[62]

and Luo, Y

Huang, F. and Luo, Y. and Chen, S. , Journal =. Limuon: Light and fast

work page
[63]

and Joshi, A

Page, S. and Joshi, A. and Sonawane, S. S. , Journal =. Muon

work page
[64]

and Yan, W

Xu, C. and Yan, W. and Zhang, Y-J. A. , Journal =

work page
[65]

and Xie, Z

Gu, Y. and Xie, Z. , Journal =

work page
[66]

and Zazo, J

Gong, W. and Zazo, J. and Luo, Q. and Wang, P. and Hensman, J. and Ma, C. , Journal =

work page
[67]

and Liu, Y

Zhang, M. and Liu, Y. and Schaeffer, H. , Journal =. Adam improves

work page
[68]

and Su, W

Du, Z. and Su, W. , Journal =. The

work page
[69]

and Persson, D

Amsel, N. and Persson, D. and Musco, C. and Gower, R. M. , Booktitle =. The. 2026 , Url =

work page 2026
[70]

and Amsel, N

Zhang, J. and Amsel, N. and Chen, B. and Dao, T. , Year =. Gram

work page
[71]

and Simsekli, U

Gurbuzbalaban, M. and Simsekli, U. and Zhu, L. , Booktitle =. The heavy-tail phenomenon in. 2021 , Organization =

work page 2021
[72]

and Milligan, A

Kunstner, F. and Milligan, A. and Yadav, R. and Schmidt, M. and Bietti, A. , Booktitle =. Heavy-tailed class imbalance and why

work page
[73]

and Bach, F

Kunstner, F. and Bach, F. , Booktitle =. Scaling laws for gradient descent and sign descent for linear bigram models under. 2025 , Url =

work page 2025
[74]

and Fang, A

Li, J. and Fang, A. and Smyrnis, G. and Ivgi, M. and Jordan, M. and Gadre, S. and Bansal, H. and Guha, E. and Keh, S. and Arora, K. and others , Booktitle =. Data

work page
[75]

, Year =

Karpath, A. , Year =. nanochat: The best

work page
[76]

and Yang, Y

Diao, S. and Yang, Y. and Fu, Y. and Dong, X. and Su, D. and Kliegl, M. and Chen, Z. and Belcak, P. and Suhara, Y. and Yin, H. and others , Journal =. Nemotron-

work page
[77]

, Booktitle =

Dozat, T. , Booktitle =. Incorporating. 2016 , Url =

work page 2016
[78]

ArXiv Preprint: 2404.00498 , Year =

94\ Author =. ArXiv Preprint: 2404.00498 , Year =

work page arXiv
[79]

2009 , Month = apr, Url =

Learning multiple layers of features from tiny images , Author =. 2009 , Month = apr, Url =

work page 2009
[80]

NeurIPS , Pages =

The road less scheduled , Author =. NeurIPS , Pages =

work page

Showing first 80 references.

[1] [1]

Nature , Volume =

Learning representations by back-propagating errors , Author =. Nature , Volume =. 1986 , Publisher =

work page 1986

[2] [2]

Neural Computation , Volume =

Adaptive mixtures of local experts , Author =. Neural Computation , Volume =. 1991 , Publisher =

work page 1991

[3] [3]

Proceedings of the IEEE , Volume =

Gradient-based learning applied to document recognition , Author =. Proceedings of the IEEE , Volume =. 2002 , Publisher =

work page 2002

[4] [4]

Neural Computation , Volume =

Long short-term memory , Author =. Neural Computation , Volume =. 1997 , Publisher =

work page 1997

[5] [5]

and Van Merri

Cho, K. and Van Merri. Learning phrase representations using. EMNLP , Pages =

work page

[6] [6]

Neural Computation , Volume =

A fast learning algorithm for deep belief nets , Author =. Neural Computation , Volume =. 2006 , Publisher =

work page 2006

[7] [7]

and Sutskever, I

Krizhevsky, A. and Sutskever, I. and Hinton, G. E. , Booktitle =. Image

work page

[8] [8]

CVPR , Pages =

Deep residual learning for image recognition , Author =. CVPR , Pages =. 2016 , Organization =

work page 2016

[9] [9]

NeurIPS , Pages =

Attention is all you need , Author =. NeurIPS , Pages =

work page

[10] [10]

The Annals of Mathematical Statistics , Pages =

A stochastic approximation method , Author =. The Annals of Mathematical Statistics , Pages =. 1951 , Publisher =

work page 1951

[11] [11]

1964 , Publisher =

Some methods of speeding up the convergence of iteration methods , Author =. 1964 , Publisher =

work page 1964

[12] [12]

Doklady Akademii Nauk , Pages =

A method of solving a convex programming problem with convergence rate O(1/k^2) , Author =. Doklady Akademii Nauk , Pages =. 1983 , Organization =

work page 1983

[13] [13]

ICML , Pages =

On the importance of initialization and momentum in deep learning , Author =. ICML , Pages =. 2013 , Organization =

work page 2013

[14] [14]

The Journal of Machine Learning Research , Volume =

Adaptive subgradient methods for online learning and stochastic optimization , Author =. The Journal of Machine Learning Research , Volume =. 2011 , Publisher =

work page 2011

[15] [15]

and Hinton, G

Tieleman, T. and Hinton, G. E. , Year =. Neural networks for machine learning,

work page

[16] [16]

ICLR , Year =

Adam: A method for stochastic optimization , Author =. ICLR , Year =

work page

[17] [17]

ICLR , Year =

Decoupled weight decay regularization , Author =. ICLR , Year =

work page

[18] [18]

AISTATS , Pages =

Understanding the difficulty of training deep feedforward neural networks , Author =. AISTATS , Pages =. 2010 , Publisher =

work page 2010

[19] [19]

ICML , Pages =

A tail-index analysis of stochastic gradient noise in deep neural networks , Author =. ICML , Pages =. 2019 , Organization =

work page 2019

[20] [20]

NeurIPS , Pages =

Preconditioned spectral descent for deep learning , Author =. NeurIPS , Pages =

work page

[21] [21]

2024 , Url =

Muon: An optimizer for hidden layers in neural networks , Author =. 2024 , Url =

work page 2024

[22] [22]

NeurIPS Workshop on Optimization for Machine Learning , Year =

Old optimizer, new norm: An anthology , Author =. NeurIPS Workshop on Optimization for Machine Learning , Year =

work page

[23] [23]

and Xie, W

Pethick, T. and Xie, W. and Antonakopoulos, K. and Zhu, Z. and Silveti-Falls, A. and Cevher, V. , Booktitle =. Training deep learning models with norm-constrained. 2025 , Organization =

work page 2025

[24] [24]

NeurIPS , Pages =

Scalable optimization in the modular norm , Author =. NeurIPS , Pages =

work page

[25] [25]

ICML , Pages =

Modular duality in deep learning , Author =. ICML , Pages =. 2025 , Organization =

work page 2025

[26] [26]

ICML , Pages =

Batch normalization: Accelerating deep network training by reducing internal covariate shift , Author =. ICML , Pages =. 2015 , Organization =

work page 2015

[27] [27]

NIPS Workshop on Deep Learning Symposium , Year =

Layer normalization , Author =. NIPS Workshop on Deep Learning Symposium , Year =

work page

[28] [28]

and Hu, E

Yang, G. and Hu, E. J. , Booktitle =. Tensor programs. 2021 , Organization =

work page 2021

[29] [29]

and Hu, E

Yang, G. and Hu, E. J. and Babuschkin, I. and Sidor, S. and Liu, X. and Farhi, D. and Ryder, N. and Pachocki, J. and Chen, W. and Gao, J. , Booktitle =. Tensor programs

work page

[30] [30]

A spectral condition for feature learning

A spectral condition for feature learning , Author =. ArXiv Preprint: 2310.17813 , Year =

work page arXiv

[31] [31]

and Su, J

Liu, J. and Su, J. and Yao, X. and Jiang, Z. and Lai, G. and Du, Y. and Qin, Y. and Xu, W. and Lu, E. and Yan, J. and others , Journal =. Muon is scalable for

work page

[32] [32]

Kimi K2: Open Agentic Intelligence

Kimi k2: Open agentic intelligence , Author =. ArXiv Preprint: 2507.20534 , Year =

work page internal anchor Pith review Pith/arXiv arXiv

[33] [33]

NeurIPS , Pages =

Why are adaptive methods good for attention models? , Author =. NeurIPS , Pages =

work page

[34] [34]

NeurIPS , Pages =

High-probability bounds for non-convex stochastic optimization with heavy tails , Author =. NeurIPS , Pages =

work page

[35] [35]

From gradient clipping to normalization for heavy tailed

H. From gradient clipping to normalization for heavy tailed. AISTATS , Pages =. 2025 , Organization =

work page 2025

[36] [36]

and Liu, X

Sun, T. and Liu, X. and Yuan, K. , Journal =. Revisiting gradient normalization and clipping for nonconvex. 2025 , Publisher =

work page 2025

[37] [37]

ICLR , Year =

Nonconvex stochastic optimization under heavy-tailed Noises: Optimal convergence without gradient clipping , Author =. ICLR , Year =

work page

[38] [38]

and Yaroslav, K

Chezhegov, S. and Yaroslav, K. and Semenov, A. and Beznosikov, A. and Gasnikov, A. and Horv. Clipping improves. ICML , Pages =. 2025 , Organization =

work page 2025

[39] [39]

Sign-Based Optimizers Are Effective Under Heavy-Tailed Noise

Sign-based optimizers are effective under heavy-tailed noise , Author =. ArXiv Preprint: 2602.07425 , Year =

work page internal anchor Pith review Pith/arXiv arXiv

[40] [40]

and AlRashed, S

Shulgin, E. and AlRashed, S. and Richt. Beyond the ideal: Analyzing the inexact. AISTATS , Year =

work page

[41] [41]

Kim, G. Y. and Oh, M-h. , Booktitle =. Convergence of. 2026 , Url =

work page 2026

[42] [42]

and Wang, J-K

Sfyraki, M-E. and Wang, J-K. , Journal =. Lions and

work page

[43] [43]

Mathematical Programming , Volume =

Lower bounds for non-convex stochastic optimization , Author =. Mathematical Programming , Volume =. 2023 , Publisher =

work page 2023

[44] [44]

and Mehta, H

Cutkosky, A. and Mehta, H. , Booktitle =. Momentum improves normalized. 2020 , Organization =

work page 2020

[45] [45]

and Grosse, R

Martens, J. and Grosse, R. , Booktitle =. Optimizing neural networks with. 2015 , Organization =

work page 2015

[46] [46]

and Martens, J

Grosse, R. and Martens, J. , Booktitle =. A. 2016 , Organization =

work page 2016

[47] [47]

ICML , Pages =

Shampoo: Preconditioned stochastic tensor optimization , Author =. ICML , Pages =. 2018 , Organization =

work page 2018

[48] [48]

and Ren, Y

Goldfarb, D. and Ren, Y. and Bahamou, A. , Booktitle =. Practical quasi-

work page

[49] [49]

NeurIPS , Pages =

Tensor normal training for deep learning models , Author =. NeurIPS , Pages =

work page

[50] [50]

Duvvuri, S. S. and Devvrit, F. and Anil, R. and Hsieh, C-J. and Dhillon, I. S. , Booktitle =. Combining axes preconditioners through. 2024 , Url =

work page 2024

[51] [51]

and Zhang, Z

Zhao, J. and Zhang, Z. and Chen, B. and Wang, Z. and Anandkumar, A. and Tian, Y. , Booktitle =. GaLore: Memory-efficient. 2024 , Organization =

work page 2024

[52] [52]

and Shapira, I

Morwani, D. and Shapira, I. and Vyas, N. and Malach, E. and Kakade, S. M. and Janson, L. , Booktitle =. A new perspective on. 2025 , Url =

work page 2025

[53] [53]

and Morwani, D

Vyas, N. and Morwani, D. and Zhao, R. and Shapira, I. and Brandfonbrener, D. and Janson, L. and Kakade, S. M. , Booktitle =. 2025 , Url =

work page 2025

[54] [54]

and Liu, Y

Yuan, H. and Liu, Y. and Wu, S. and Xun, Z. and Gu, Q. , Booktitle =. 2025 , Organization =

work page 2025

[55] [55]

and Liu, Y

An, K. and Liu, Y. and Pan, R. and Ren, Y. and Ma, S. and Goldfarb, D. and Zhang, T. , Booktitle =. 2025 , Url =

work page 2025

[56] [56]

and Liu, L

Li, Z. and Liu, L. and Liang, C. and Chen, W. and Zhao, T. , Journal =. Nor

work page

[57] [57]

and Shulgin, E

Riabinin, A. and Shulgin, E. and Gruntkowska, K. and Richt. Gluon: Making. ICML Workshop on High-dimensional Learning Dynamics , Year =

work page

[58] [58]

Dion: Distributed Orthonormalized Updates

Dion: Distributed orthonormalized updates , Author =. ArXiv Preprint: 2504.05295 , Year =

work page arXiv

[59] [59]

and Amsel, N

Ahn, K. and Amsel, N. and Langford, J. , Journal =. Dion2: A simple method to shrink matrix in

work page

[60] [60]

ArXiv Preprint: 2505.21799 , Year =

Polargrad: A class of matrix-gradient optimizers from a unifying preconditioning perspective , Author =. ArXiv Preprint: 2505.21799 , Year =

work page arXiv

[61] [61]

Low-rank Orthogonalization for Large-scale Matrix Optimization with Applications to Foundation Model Training

Low-rank orthogonalization for large-scale matrix optimization with applications to foundation model training , Author =. ArXiv Preprint: 2509.11983 , Year =

work page internal anchor Pith review Pith/arXiv arXiv

[62] [62]

and Luo, Y

Huang, F. and Luo, Y. and Chen, S. , Journal =. Limuon: Light and fast

work page

[63] [63]

and Joshi, A

Page, S. and Joshi, A. and Sonawane, S. S. , Journal =. Muon

work page

[64] [64]

and Yan, W

Xu, C. and Yan, W. and Zhang, Y-J. A. , Journal =

work page

[65] [65]

and Xie, Z

Gu, Y. and Xie, Z. , Journal =

work page

[66] [66]

and Zazo, J

Gong, W. and Zazo, J. and Luo, Q. and Wang, P. and Hensman, J. and Ma, C. , Journal =

work page

[67] [67]

and Liu, Y

Zhang, M. and Liu, Y. and Schaeffer, H. , Journal =. Adam improves

work page

[68] [68]

and Su, W

Du, Z. and Su, W. , Journal =. The

work page

[69] [69]

and Persson, D

Amsel, N. and Persson, D. and Musco, C. and Gower, R. M. , Booktitle =. The. 2026 , Url =

work page 2026

[70] [70]

and Amsel, N

Zhang, J. and Amsel, N. and Chen, B. and Dao, T. , Year =. Gram

work page

[71] [71]

and Simsekli, U

Gurbuzbalaban, M. and Simsekli, U. and Zhu, L. , Booktitle =. The heavy-tail phenomenon in. 2021 , Organization =

work page 2021

[72] [72]

and Milligan, A

Kunstner, F. and Milligan, A. and Yadav, R. and Schmidt, M. and Bietti, A. , Booktitle =. Heavy-tailed class imbalance and why

work page

[73] [73]

and Bach, F

Kunstner, F. and Bach, F. , Booktitle =. Scaling laws for gradient descent and sign descent for linear bigram models under. 2025 , Url =

work page 2025

[74] [74]

and Fang, A

Li, J. and Fang, A. and Smyrnis, G. and Ivgi, M. and Jordan, M. and Gadre, S. and Bansal, H. and Guha, E. and Keh, S. and Arora, K. and others , Booktitle =. Data

work page

[75] [75]

, Year =

Karpath, A. , Year =. nanochat: The best

work page

[76] [76]

and Yang, Y

Diao, S. and Yang, Y. and Fu, Y. and Dong, X. and Su, D. and Kliegl, M. and Chen, Z. and Belcak, P. and Suhara, Y. and Yin, H. and others , Journal =. Nemotron-

work page

[77] [77]

, Booktitle =

Dozat, T. , Booktitle =. Incorporating. 2016 , Url =

work page 2016

[78] [78]

ArXiv Preprint: 2404.00498 , Year =

94\ Author =. ArXiv Preprint: 2404.00498 , Year =

work page arXiv

[79] [79]

2009 , Month = apr, Url =

Learning multiple layers of features from tiny images , Author =. 2009 , Month = apr, Url =

work page 2009

[80] [80]

NeurIPS , Pages =

The road less scheduled , Author =. NeurIPS , Pages =

work page