pith. machine review for the scientific record. sign in

arxiv: 2605.09552 · v1 · submitted 2026-05-10 · 🧮 math.OC · cs.LG· stat.ML

Recognition: 2 theorem links

· Lean Theorem

Phases of Muon: When Muon Eclipses SignSGD

Atish Agarwala, Courtney Paquette, Elliot Paquette, Guangyuan Wang, Lucas Benigni, Noah Marshall

Pith reviewed 2026-05-12 04:01 UTC · model grok-4.3

classification 🧮 math.OC cs.LGstat.ML
keywords MuonSignSGDSignSVDpower law covariancespectral optimizationphase diagramleast squaresstochastic gradient descent
0
0 comments X

The pith

Power-law covariance splits Muon and SignSGD into three performance phases.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper studies stochastic spectral optimizers such as Muon on a high-dimensional matrix least-squares problem to compare them with SignSGD. It derives deterministic dynamics showing that SignSVD, which Muon approximates, applies square-root preconditioning to the data covariance for large batches but behaves like plain SGD on smaller modes for small batches. SignSGD applies no preconditioning for generic covariance and therefore exhibits different optimal learning rates and convergence. When both data covariance and target follow power laws with exponents alpha and beta, the plane of these exponents divides into three regions: one where SignSGD is uniformly better, one where SignSVD is uniformly better, and one where the methods trade off. This partition supplies a concrete way to predict which optimizer will converge faster once the spectral exponents of a problem are known.

Core claim

We analyze stochastic spectral optimizers including Muon approximated by SignSVD and SignSGD as a proxy for Adam on a high-dimensional matrix-valued least squares problem. For large batch size SignSVD performs square-root preconditioning with respect to the data covariance spectrum while for small batch size smaller eigenmodes behave like SGD. SignSGD performs no preconditioning and has no transition, leading to different optimal learning rates and convergence characteristics. An analysis of a power law covariance model with data exponent alpha and target exponent beta shows there are three phases in the (alpha, beta) plane: one where SignSGD is uniformly favored, one where SignSVD is 1.0-f,

What carries the argument

The power-law covariance model with data exponent alpha and target exponent beta, which partitions the (alpha, beta) plane into three regions of relative performance between SignSVD and SignSGD.

Load-bearing premise

The high-dimensional matrix-valued least squares problem with power-law spectra is a faithful proxy for the learning dynamics of these optimizers in practical deep neural network training.

What would settle it

Simulating SignSVD and SignSGD on the matrix least-squares problem for a grid of alpha and beta values and finding that the measured convergence rates or final losses do not fall into the three predicted phases would falsify the claim.

Figures

Figures reproduced from arXiv: 2605.09552 by Atish Agarwala, Courtney Paquette, Elliot Paquette, Guangyuan Wang, Lucas Benigni, Noah Marshall.

Figure 1
Figure 1. Figure 1: Random-matrix theory predicts Muon and SignSGD loss curves. Half￾anisotropic power-law data (α = 1.5, β = 0.7, N = 1024, B = 2048, a hard, bias-dominated phase). Solid: average of 16 numerical simu￾lations of Muon with 5 Newton–Schulz itera￾tions; gray lines: predicted scaling-law expo￾nents from SignSVD theory in Sec. 5. Theory for SignSVD quantitatively predicts the be￾havior of the stochastic algorithm … view at source ↗
Figure 2
Figure 2. Figure 2: Self-consistent fixed-point system for the deterministic equivalents of the resolvents [PITH_FULL_IMAGE:figures/full_fig_p006_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Isotropic SignSVD and SignSGD under their respective optimal learning rates (3.3). Square setting where Nin = Nout = N, Σin = Σout = IdN , N B = 1; 32 runs of SignSVD and SignSGD with learning rate (3.3) for N ∈ {128, 256, 512}; Shaded region is 80% confidence interval, and the lines the theoretical predictions. (a) Predictions for the risk trajectories in (3.1) match the simulations of SignSVD and SignSGD… view at source ↗
Figure 4
Figure 4. Figure 4: Phase diagram for SignSVD vs. SignSGD in the (α, β) plane (large-batch regime B ≥ N). The boundaries β = 1 and β = α+1 are the saturation thresholds of SignSVD and SignSGD respectively (Thm. 5.2). Our analysis only holds for all α + β > 1 and α > 0. Three phases: which algorithm wins, SignSVD or SignSGD? Dividing (5.4) by (5.3) yields the ratio t s-SGD 4ϵ /ts-SVD 4ϵ , which sorts (α, β)-space into three ph… view at source ↗
Figure 5
Figure 5. Figure 5: Deterministic risk trajectories R(t), validating the t4ϵ scaling laws across the four phase regimes (isotropic and Phases A, B, C from [PITH_FULL_IMAGE:figures/full_fig_p012_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Validation of the fixed-point system (C.59): empirical spectral density of Ht = Gt+1G⊤ t+1 versus the deterministic equivalent. Each panel uses the same setup family (iso￾tropic/anisotropic spectra, γ, and R) and compares ρemp(x) to ρDE(x) = π −1 Im m(x + iη) with η = 1/ √ Nout. The figure shown here was regenerated with scale factor 4 (baseline d = 100 to d = 400, and B scaled accordingly). where the firs… view at source ↗
Figure 7
Figure 7. Figure 7: Step 5 projected-drift validation. For each setup, the blue curve is the Monte Carlo [PITH_FULL_IMAGE:figures/full_fig_p054_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Off-critical Step 5 validation in the isotropic setup with [PITH_FULL_IMAGE:figures/full_fig_p055_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Term-level projected validation using the updated deterministic equivalents for [PITH_FULL_IMAGE:figures/full_fig_p056_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Step-7 term-level validation of the variance kernel. Each panel plots [PITH_FULL_IMAGE:figures/full_fig_p086_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: Half-anisotropic drift and volatility versus Gaussian Monte Carlo at [PITH_FULL_IMAGE:figures/full_fig_p108_11.png] view at source ↗
Figure 12
Figure 12. Figure 12: Half-anisotropic drift and volatility versus Gaussian Monte Carlo at [PITH_FULL_IMAGE:figures/full_fig_p109_12.png] view at source ↗
Figure 13
Figure 13. Figure 13: Half-anisotropic drift and volatility versus Gaussian Monte Carlo at [PITH_FULL_IMAGE:figures/full_fig_p110_13.png] view at source ↗
Figure 14
Figure 14. Figure 14: tT ϵ(ϵ) for SignSVD and SignSGD bracketed by the upper bound of Lemma H.5 (c0 = 2, numerically-evaluated c(c0); target threshold T = K(c0) ≈ 5) and the lower bound (1/η) R τF=T ϵ 0 √ F du of Lemma H.4. Parameters: α = 1.5, γ = 0.5, N = 1000; left β = 0.7 (SignSVD F-branch), right β = 1.5 (SignSVD saturation regime). Blue circles and red squares are measured tT ϵ; blue/red plus markers are upper bounds; do… view at source ↗
Figure 15
Figure 15. Figure 15: Stochastic Muon (Newton–Schulz quintic, βmom = 0, dashed) against the de￾terministic SignSVD time-to-ϵ theory (solid) of Sec. 5, for N = 1024, B = 2048, 16 trials per target ϵ, color-coded by ϵ ∈ {2 −4 , 2 −6 , . . . , 2 −14}. Circles are the theoretical t2ϵ; squares are the simulated t2ϵ. Across the isotropic baseline and all three half-anisotropic phases, the no-momentum Muon trajectories sit on top of … view at source ↗
Figure 16
Figure 16. Figure 16: Half-anisotropic Phase A (α = 1.5, βinit = 0.7, N = 1024, B = 2048, 16 trials). Left: risk trajectories with the matched-LR η ⋆ for each target ϵ (color); linestyles encode βmom ∈ {0, 0.9, 0.95, 0.99}; markers indicate the measured t2ϵ. Right: t2ϵ scaling against target ϵ on log–log axes. The dotted reference is the SignSVD slope pSVD = 0.625 from Sec. 5; the dashed reference is the SignSGD slope pSGD = 1… view at source ↗
Figure 17
Figure 17. Figure 17: Half-anisotropic Phase B (α = 1.5, βinit = 1.5, N = 1024, B = 2048, 16 trials); axes and conventions as in [PITH_FULL_IMAGE:figures/full_fig_p130_17.png] view at source ↗
Figure 18
Figure 18. Figure 18: Half-anisotropic Phase C (α = 1.5, βinit = 3.0, N = 1024, B = 2048, 16 trials); axes and conventions as in [PITH_FULL_IMAGE:figures/full_fig_p130_18.png] view at source ↗
Figure 19
Figure 19. Figure 19: Effective noise constant Neff(βmom) for Muon at N = 1024 across the three half￾anisotropic phases. Left: absolute values on a log axis; markers are calibrated Neff. The dashed curve is the raw heuristic Neff(0)p (1 + β)/(1 − β) of Eq. (I.1) with ρ = 1; the solid curve is the empirical fit with ρˆ = 0.842. The three phases overlap to within marker width at every βmom. Right: empirical prefactor ρ(βmom) ≡ N… view at source ↗
read the original abstract

Recently, Muon and related spectral optimizers have demonstrated strong empirical performance as scalable stochastic methods, often outperforming Adam. Yet their behaviour remains poorly understood. We analyze stochastic spectral optimizers, including Muon, on a high-dimensional matrix-valued least squares problem. We derive explicit deterministic dynamics that provide a tractable framework for studying learning behaviour with a focus on (stochastic) SignSVD, which Muon approximates, and (stochastic) SignSGD, the latter serving as a proxy for Adam. Our analysis shows that for large batch size, SignSVD performs a square-root preconditioning with respect to the data covariance spectrum, while for small batch size smaller eigenmodes behave like SGD, slowing down convergence. We contrast with SignSGD which for generic covariance performs no preconditioning and has no transition, leading to different optimal learning rates and convergence characteristics. The two methods match up to a constant factor with isotropic data, but behave differently with anisotropic data. An analysis of a power law covariance model with data exponent $\alpha$ and target exponent $\beta$ shows there are three phases in the $(\alpha,\beta)$ plane: one where SignSGD is uniformly favored, one where SignSVD is uniformly favored, and a third where the two methods exhibit a trade-off in performance.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper analyzes stochastic spectral optimizers including SignSVD (approximating Muon) and SignSGD (proxy for Adam) on a high-dimensional matrix-valued least-squares problem. It derives explicit deterministic dynamics for error evolution, shows that SignSVD performs square-root preconditioning for large batches (reverting to SGD-like behavior for small eigenmodes at small batches) while SignSGD performs no preconditioning, and uses a power-law covariance model with data exponent α and target exponent β to identify three phases in the (α, β) plane: one uniformly favoring SignSGD, one uniformly favoring SignSVD, and one with a performance trade-off.

Significance. If the deterministic approximations and phase classification hold, the work supplies a tractable theoretical framework for predicting when spectral methods outperform simpler sign-based optimizers as a function of spectral anisotropy, with explicit scaling exponents and falsifiable phase boundaries. The derivation of closed-form deterministic dynamics and the power-law analysis constitute clear strengths that enable direct comparison of convergence rates.

major comments (2)
  1. [Deterministic dynamics derivation] The derivation of deterministic dynamics (via expectation or large-batch limits) for the sign nonlinearity does not address whether higher moments or finite-batch fluctuations alter the effective convergence rates for small eigenmodes; this directly affects the location and existence of the trade-off phase in the (α, β) plane.
  2. [Power-law covariance model analysis] The power-law analysis classifies phases by comparing asymptotic scaling exponents obtained from the deterministic rates; without an explicit check that the mean trajectory governs typical sample paths under stochastic sign operations (especially when batch size is small), the trichotomy claim rests on an unverified approximation.
minor comments (2)
  1. [Abstract] The abstract states that the two methods 'match up to a constant factor with isotropic data' but does not quantify the constant or the precise isotropic limit.
  2. [Power-law model] Notation for the data exponent α and target exponent β should be introduced with their precise definitions before the phase-plane analysis.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the careful reading and valuable feedback on our analysis of stochastic spectral optimizers. We address the two major comments point by point below, clarifying the scope of our deterministic approximations while acknowledging their limitations.

read point-by-point responses
  1. Referee: The derivation of deterministic dynamics (via expectation or large-batch limits) for the sign nonlinearity does not address whether higher moments or finite-batch fluctuations alter the effective convergence rates for small eigenmodes; this directly affects the location and existence of the trade-off phase in the (α, β) plane.

    Authors: Our derivation obtains deterministic dynamics by computing the expectation of the sign update (or taking the large-batch limit), which produces explicit per-eigenmode error recursions. In the manuscript we already note that small eigenmodes revert to SGD-like behavior under small batches because the sign operation on low-magnitude signals becomes effectively stochastic. We agree, however, that higher moments and finite-batch fluctuations are not analyzed in detail and could modify the effective rates, thereby shifting the trade-off phase boundaries. We will add a new subsection discussing the validity regime of the mean-field approximation, including a qualitative argument on when fluctuations remain negligible (large batch or sufficiently strong signal-to-noise per mode) and explicitly stating that a full stochastic approximation theory lies beyond the present scope. revision: partial

  2. Referee: The power-law analysis classifies phases by comparing asymptotic scaling exponents obtained from the deterministic rates; without an explicit check that the mean trajectory governs typical sample paths under stochastic sign operations (especially when batch size is small), the trichotomy claim rests on an unverified approximation.

    Authors: The three phases in the (α, β) plane are obtained by comparing the asymptotic scaling exponents that follow directly from the deterministic rates under the assumed power-law spectra. This yields explicit, falsifiable boundaries. We concur that the classification assumes the mean trajectory is representative of typical paths and that this assumption is least secure for small batches and small eigenmodes, where sign stochasticity can produce larger deviations. We will revise the power-law section to include a short paragraph that (i) states the mean-field nature of the derivation, (ii) identifies the parameter regimes (batch size relative to eigenvalue magnitude) where the approximation is expected to be accurate, and (iii) notes that concentration or large-deviation analysis of the stochastic sign process is left for future work. revision: partial

Circularity Check

0 steps flagged

No circularity: phases derived from explicit deterministic dynamics on posited power-law model

full rationale

The paper posits a high-dimensional matrix least-squares problem with power-law spectra as a proxy model, derives deterministic dynamics for SignSVD and SignSGD error evolution (via expectation/large-batch limits), obtains asymptotic convergence rates as functions of exponents α and β, and partitions the (α,β) plane by comparing those rates. This chain is self-contained first-principles analysis within the model; no parameter is fitted to the target data, no prediction is renamed from a fit, and no load-bearing step reduces to a self-citation or self-definition. The trichotomy is a direct consequence of the derived scalings, not tautological with the inputs.

Axiom & Free-Parameter Ledger

2 free parameters · 2 axioms · 0 invented entities

The analysis rests on a high-dimensional least-squares model, deterministic ODE approximation of stochastic updates, and power-law spectra; no new physical entities are introduced.

free parameters (2)
  • α
    Exponent controlling the decay of the data covariance spectrum in the power-law model.
  • β
    Exponent controlling the decay of the target solution spectrum in the power-law model.
axioms (2)
  • domain assumption Stochastic gradient updates can be replaced by deterministic differential equations in the high-dimensional limit
    Invoked to obtain tractable dynamics for SignSVD and SignSGD.
  • domain assumption Power-law spectra adequately capture the anisotropy present in real training data
    Used to partition the (α, β) plane into performance phases.

pith-pipeline@v0.9.0 · 5550 in / 1413 out tokens · 40000 ms · 2026-05-12T04:01:20.377577+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

Reference graph

Works this paper leans on

90 extracted references · 90 canonical work pages · 4 internal anchors

  1. [1]

    Understanding Double Descent Requires A Fine- Grained Bias-Variance Decomposition

    Ben Adlam and Jeffrey Pennington. Understanding Double Descent Requires A Fine- Grained Bias-Variance Decomposition. InAdvances in Neural Information Processing Systems (NeurIPS), volume 33, pages 11022–11032, 2020

  2. [2]

    Noah Amsel, David Persson, Christopher Musco, and Robert M. Gower. The Polar Express: Optimal Matrix Sign Methods and Their Application to the Muon Algorithm. InInternational Conference on Learning Representations (ICLR), 2026

  3. [3]

    Springer, 2010

    Zhidong Bai and Jack W Silverstein.Spectral analysis of large dimensional random matrices, volume 20. Springer, 2010

  4. [4]

    Dissecting Adam: The Sign, Magnitude and Variance of Stochastic Gradients

    Lukas Balles and Philipp Hennig. Dissecting Adam: The Sign, Magnitude and Variance of Stochastic Gradients. InProceedings of the 35th International Conference on Machine Learning (ICML), volume 80 ofProceedings of Machine Learning Research, pages 404–413. PMLR, PMLR, 2018

  5. [5]

    The Geometry of Sign Gradient Descent

    Lukas Balles, Fabian Pedregosa, and Nicolas Le Roux. The Geometry of Sign Gradient Descent. InInternational Conference on Learning Representations (ICLR), pages 1–24, 2020

  6. [6]

    Learning quadratic neural networks in high dimensions: SGD dynamics and scaling laws.arXiv preprint arXiv:2508.03688, 2025

    Gérard Ben Arous, Murat A Erdogdu, Nuri Mert Vural, and Denny Wu. Learning quadratic neural networks in high dimensions: SGD dynamics and scaling laws.arXiv preprint arXiv:2508.03688, 2025

  7. [7]

    High-dimensional limit theorems for SGD: Effective dynamics and critical scaling.Advances in Neural Information Processing Systems (NeurIPS), 35:25349–25362, 2022

    Gérard Ben Arous, Reza Gheissari, and Aukosh Jagannath. High-dimensional limit theorems for SGD: Effective dynamics and critical scaling.Advances in Neural Information Processing Systems (NeurIPS), 35:25349–25362, 2022

  8. [8]

    Old optimizer, new norm: An anthology

    Jeremy Bernstein and Laker Newhouse. Old optimizer, new norm: An anthology.arXiv preprint arXiv:2409.20325, 2024

  9. [9]

    signSGD: Compressed optimisation for non-convex problems

    Jeremy Bernstein, Yu-Xiang Wang, Kamyar Azizzadenesheli, and Animashree Anandkumar. signSGD: Compressed optimisation for non-convex problems. InProceedings of the 35th 14 E. PAQUETTE ET AL. International Conference on Machine Learning (ICML), volume 80 ofProceedings of Machine Learning Research, pages 560–569. PMLR, 2018

  10. [10]

    A Dynamical Model of Neural Scaling Laws

    Blake Bordelon, Alexander Atanasov, and Cengiz Pehlevan. A Dynamical Model of Neural Scaling Laws. InProceedings of the 41st International Conference on Machine Learning (ICML), volume 235 ofProceedings of Machine Learning Research, pages 4345–4382. PMLR, 2024

  11. [11]

    How Feature Learning Can Improve Neural Scaling Laws .International Conference on Learning Representations (ICLR), 2025

    Blake Bordelon, Alexander Atanasov, and Cengiz Pehlevan. How Feature Learning Can Improve Neural Scaling Laws .International Conference on Learning Representations (ICLR), 2025

  12. [12]

    Optimal rates for the regularized least-squares algorithm.Foundations of Computational Mathematics, 7:331–368, 2007

    Andrea Caponnetto and Ernesto De Vito. Optimal rates for the regularized least-squares algorithm.Foundations of Computational Mathematics, 7:331–368, 2007

  13. [13]

    Stochastic spectral descent for restricted boltzmann machines

    David Carlson, Volkan Cevher, and Lawrence Carin. Stochastic spectral descent for restricted boltzmann machines. InArtificial intelligence and statistics, pages 111–119. PMLR, 2015

  14. [14]

    Learning with sgd and random features.Advances in Neural Information Processing Systems (NeurIPS), 31, 2018

    Luigi Carratino, Alessandro Rudi, and Lorenzo Rosasco. Learning with sgd and random features.Advances in Neural Information Processing Systems (NeurIPS), 31, 2018

  15. [15]

    Muon optimizes under spectral norm con- straints.Transactions on Machine Learning Research (TMLR), 2026

    Lizhang Chen, Jonathan Li, and Qiang Liu. Muon optimizes under spectral norm con- straints.Transactions on Machine Learning Research (TMLR), 2026

  16. [16]

    Adaptive gradient methods at the edge of stability.arXiv preprint arXiv:2207.14484, 2022

    Jeremy M Cohen, Behrooz Ghorbani, Shankar Krishnan, Naman Agarwal, Sourabh Medapati, Michal Badura, Daniel Suo, David Cardoze, Zachary Nado, George E Dahl, et al. Adaptive gradient methods at the edge of stability.arXiv preprint arXiv:2207.14484, 2022

  17. [17]

    Gradient descent on neural networks typically occurs at the edge of stability

    Jeremy M Cohen, Simran Kaur, Yuanzhi Li, J Zico Kolter, and Ameet Talwalkar. Gradient descent on neural networks typically occurs at the edge of stability. InInternational Conference on Learning Representations (ICLR), pages 1–80, 2021

  18. [18]

    Hitting the high-dimensional notes: an ODE for SGD learning dynamics on GLMs and multi-index models.Inf

    Elizabeth Collins-Woodfin, Courtney Paquette, Elliot Paquette, and Inbar Seroussi. Hitting the high-dimensional notes: an ODE for SGD learning dynamics on GLMs and multi-index models.Inf. Inference, 13(4):Paper No. iaae028, 107, 2024

  19. [19]

    The High Line: Exact Risk and Learning Rate Curves of Stochastic Adaptive Learning Rate Algorithms

    Elizabeth Collins-Woodfin, Inbar Seroussi, Begoña García Malaxechebarría, Andrew W Mackenzie, Elliot Paquette, and Courtney Paquette. The High Line: Exact Risk and Learning Rate Curves of Stochastic Adaptive Learning Rate Algorithms. InAdvances in Neural Information Processing Systems (NeurIPS), volume 37, 2024

  20. [20]

    Cambridge University Press, 2022

    Romain Couillet and Zhenyu Liao.Random Matrix Methods for Machine Learning. Cambridge University Press, 2022

  21. [21]

    arXiv preprint arXiv:2512.04299 , year =

    Damek Davis and Dmitriy Drusvyatskiy. When do spectral gradient updates help in deep learning?arXiv preprint arXiv:2512.04299, 2025

  22. [22]

    The PHASES OF MUON: WHEN MUON ECLIPSES SIGNSGD 15 DeepMind JAX Ecosystem

    DeepMind, Igor Babuschkin, Kate Baumli, Alison Bell, Surya Bhupatiraju, Jake Bruce, Peter Buchlovsky, David Budden, Trevor Cai, Aidan Clark, Ivo Danihelka, Antoine Dedieu, Claudio Fantacci, Jonathan Godwin, Chris Jones, Ross Hemsley, Tom Hennigan, Matteo Hessel, Shaobo Hou, Steven Kapturowski, Thomas Keck, Iurii Kemaev, Michael King, Markus Kunesch, Lena ...

  23. [23]

    Deepseek-v4: Towards highly efficient million-token context intelligence, 2026

    DeepSeek-AI. Deepseek-v4: Towards highly efficient million-token context intelligence, 2026

  24. [24]

    Nonparametric stochastic approximation with large step-sizes.The Annals of Statistics, 44(4):1363 – 1399, 2016

    Aymeric Dieuleveut and Francis Bach. Nonparametric stochastic approximation with large step-sizes.The Annals of Statistics, 44(4):1363 – 1399, 2016

  25. [25]

    TheNewton-MuonOptimizer.arXiv preprint arXiv:2604.01472, 2026

    ZhehangDuand WeijieSu. TheNewton-MuonOptimizer.arXiv preprint arXiv:2604.01472, 2026

  26. [26]

    arXiv preprint arXiv:2502.04664 , year=

    Chen Fan, Mark Schmidt, and Christos Thrampoulidis. Implicit Bias of Spectral Descent and Muon on Multiclass Separable Data.arXiv preprint arXiv:2502.04664, 2025

  27. [27]

    Dimension-adapted Momentum Outscales SGD.arXiv preprint arXiv:2505.16098, 2025

    Damien Ferbach, Katie Everett, Gauthier Gidel, Elliot Paquette, and Courtney Paquette. Dimension-adapted Momentum Outscales SGD.arXiv preprint arXiv:2505.16098, 2025

  28. [28]

    Universality of high-dimensional scaling limits of stochastic gradient descent.arXiv preprint arXiv:2512.13634, 2025

    Reza Gheissari and Aukosh Jagannath. Universality of high-dimensional scaling limits of stochastic gradient descent.arXiv preprint arXiv:2512.13634, 2025

  29. [29]

    Insights on Muon from Simple Quadratics.arXiv preprint arXiv:2602.11948, 2026

    Antoine Gonon, Andreea-Alexandra Muşat, and Nicolas Boumal. Insights on Muon from Simple Quadratics.arXiv preprint arXiv:2602.11948, 2026

  30. [30]

    Shampoo: Preconditioned stochastic tensor optimization

    Vineet Gupta, Tomer Koren, and Yoram Singer. Shampoo: Preconditioned stochastic tensor optimization. InProceedings of the 35th International Conference on Machine Learning (ICML), volume 80 ofProceedings of Machine Learning Research, pages 1842–1850. PMLR, 2018

  31. [31]

    Hanson and F

    David L. Hanson and F. T. Wright. A bound on tail probabilities for quadratic forms in independent random variables.The Annals of Mathematical Statistics, 42(3):1079–1083, 1971

  32. [32]

    Universality laws for high-dimensional learning with random features.IEEE Transactions on Information Theory, 69(3):1932–1964, 2022

    Hong Hu and Yue M Lu. Universality laws for high-dimensional learning with random features.IEEE Transactions on Information Theory, 69(3):1932–1964, 2022

  33. [33]

    High-dimensional limit theorems for SGD: Momentum and Adaptive Step-sizes.arXiv preprint arXiv:2511.03952, 2025

    Aukosh Jagannath, Taj Jones-McCormick, and Varnan Sarangian. High-dimensional limit theorems for SGD: Momentum and Adaptive Step-sizes.arXiv preprint arXiv:2511.03952, 2025

  34. [34]

    Adaptive matrix online learning through smoothing with guarantees for nonsmooth nonconvex optimization.arXiv preprint arXiv:2602.08232,

    Ruichen Jiang, Zakaria Mhammedi, Mehryar Mohri, and Aryan Mokhtari. Adaptive Matrix Online Learning through Smoothing with Guarantees for Nonsmooth Nonconvex Optimization.arXiv preprint arXiv:2602.08232, 2026

  35. [35]

    Muon: An optimizer for hidden layers in neural networks.URL https://kellerjordan

    Keller Jordan, Yuchen Jin, Vlado Boza, You Jiacheng, Franz Cesista, Laker Newhouse, and Jeremy Bernstein. Muon: An optimizer for hidden layers in neural networks.URL https://kellerjordan. github. io/posts/muon, 6(3):4, 2024

  36. [36]

    Convergence of Muon with Newton-Schulz

    Gyu Yeol Kim and Min-hwan Oh. Convergence of Muon with Newton-Schulz. InInterna- tional Conference on Learning Representations (ICLR), pages 1–29, 2026

  37. [37]

    Scaling Laws of SignSGD in Linear Regression: When Does It Outperform SGD? InInternational Conference on Learning Representations (ICLR), pages 1–89, 2026

    Jihwan Kim, Dogyoon Song, and Chulhee Yun. Scaling Laws of SignSGD in Linear Regression: When Does It Outperform SGD? InInternational Conference on Learning Representations (ICLR), pages 1–89, 2026

  38. [38]

    Sharp Capacity Scaling of Spectral Optimizers in Learning Associative Memory

    Juno Kim, Eshaan Nichani, Denny Wu, Alberto Bietti, and Jason D Lee. Sharp Ca- pacity Scaling of Spectral Optimizers in Learning Associative Memory.arXiv preprint arXiv:2603.26554, 2026

  39. [39]

    Trajectory of mini- batch momentum: batch size saturation and convergence in high dimensions.Advances in Neural Information Processing Systems (NeurIPS), 35:36944–36957, 2022

    Kiwon Lee, Andrew Cheng, Elliot Paquette, and Courtney Paquette. Trajectory of mini- batch momentum: batch size saturation and convergence in high dimensions.Advances in Neural Information Processing Systems (NeurIPS), 35:36944–36957, 2022. 16 E. PAQUETTE ET AL

  40. [40]

    Muon in Associative Memory Learning: Training Dynamics and Scaling Laws.arXiv preprint arXiv:2602.05725, 2026

    Binghui Li, Kaifei Wang, Han Zhong, Pinyan Lu, and Liwei Wang. Muon in Associative Memory Learning: Training Dynamics and Scaling Laws.arXiv preprint arXiv:2602.05725, 2026

  41. [41]

    Risk bounds of accelerated SGD for overparameterized linear regression

    Xuheng Li, Yihe Deng, Jingfeng Wu, Dongruo Zhou, and Quanquan Gu. Risk bounds of accelerated SGD for overparameterized linear regression. InInternational Conference on Learning Representations (ICLR), pages 1–69, 2024

  42. [42]

    Kakade, Peter L

    Licong Lin, Jingfeng Wu, Sham M. Kakade, Peter L. Bartlett, and Jason D. Lee. Scaling Laws in Linear Regression: Compute, Parameters, and Data. InAdvances in Neural Information Processing Systems (NeurIPS), volume 37, pages 60556–60606, 2024

  43. [43]

    Muon is Scalable for LLM Training

    Jingyuan Liu, Jianlin Su, Xingcheng Yao, Zhejun Jiang, Guokun Lai, Yulun Du, Yidao Qin, Weixin Xu, Enzhe Lu, Junjie Yan, et al. Muon is Scalable for LLM Training.arXiv preprint arXiv:2502.16982, 2025

  44. [44]

    Cosmos: A Hybrid Adaptive Optimizer for Efficient Training of Large Language Models

    Liming Liu, Zhenghao Xu, Zixuan Zhang, Hao Kang, Zichong Li, Chen Liang, Weizhu Chen, and Tuo Zhao. Cosmos: A Hybrid Adaptive Optimizer for Efficient Training of Large Language Models. InInternational Conference on Learning Representations (ICLR), 2026

  45. [45]

    Swan: SGD with Nor- malization and Whitening Enables Stateless LLM training

    Chao Ma, Wenbo Gong, Meyer Scetbon, and Edward Meeds. Swan: SGD with Nor- malization and Whitening Enables Stateless LLM training. InProceedings of the 42nd International Conference on Machine Learning (ICML), volume 267 ofProceedings of Machine Learning Research, pages 41907–41942. PMLR, 2025

  46. [46]

    To Clip or not to Clip: the Dynamics of SGD with Gradient Clipping in High-dimensions

    Noah Marshall, Ke Liang Xiao, Atish Agarwala, and Elliot Paquette. To Clip or not to Clip: the Dynamics of SGD with Gradient Clipping in High-dimensions. InInternational Conference on Learning Representations (ICLR), pages 1–37, 2025

  47. [47]

    Optimizing neural networks with Kronecker-factored approximate curvature

    James Martens and Roger Grosse. Optimizing neural networks with Kronecker-factored approximate curvature. InProceedings of the 32nd International Conference on Machine Learning (ICML), volume 37 ofProceedings of Machine Learning Research, pages 2408–2417. PMLR, 2015

  48. [48]

    The generalization error of random features regression: Precise asymptotics and the double descent curve.Communications on Pure and Applied Mathematics, 75(4):667–766, 2022

    Song Mei and Andrea Montanari. The generalization error of random features regression: Precise asymptotics and the double descent curve.Communications on Pure and Applied Mathematics, 75(4):667–766, 2022

  49. [49]

    Mingo and Roland Speicher.Free Probability and Random Matrices, volume 35 ofFields Institute Monographs

    James A. Mingo and Roland Speicher.Free Probability and Random Matrices, volume 35 ofFields Institute Monographs. Springer New York, 2017

  50. [50]

    Phase transitions for feature learning in neural networks.arXiv preprint arXiv:2602.01434, 2026

    Andrea Montanari and Zihao Wang. Phase transitions for feature learning in neural networks.arXiv preprint arXiv:2602.01434, 2026

  51. [51]

    Springer, 2004

    Yuri Nesterov.Introductory lectures on convex optimization. Springer, 2004

  52. [52]

    SGD in the Large: Average-case Analysis, Asymptotics, and Stepsize Criticality

    Courtney Paquette, Kiwon Lee, Fabian Pedregosa, and Elliot Paquette. SGD in the Large: Average-case Analysis, Asymptotics, and Stepsize Criticality. InProceedings of Thirty Fourth Conference on Learning Theory (COLT), volume 134, pages 3548–3626, 2021

  53. [53]

    Dynamics of Stochastic Momentum Methods on Large-scale, Quadratic Models.Advances in Neural Information Processing Systems (NeurIPS), 34:9229–9240, 2021

    Courtney Paquette and Elliot Paquette. Dynamics of Stochastic Momentum Methods on Large-scale, Quadratic Models.Advances in Neural Information Processing Systems (NeurIPS), 34:9229–9240, 2021

  54. [54]

    Homogenization of SGD in high-dimensions: exact dynamics and generalization properties.Mathematical Programming, pages 1–90, 2024

    Courtney Paquette, Elliot Paquette, Ben Adlam, and Jeffrey Pennington. Homogenization of SGD in high-dimensions: exact dynamics and generalization properties.Mathematical Programming, pages 1–90, 2024. PHASES OF MUON: WHEN MUON ECLIPSES SIGNSGD 17

  55. [55]

    4+3 Phases of Compute-Optimal Neural Scaling Laws

    Elliot Paquette, Courtney Paquette, Lechao Xiao, and Jeffrey Pennington. 4+3 Phases of Compute-Optimal Neural Scaling Laws. InAdvances in Neural Information Processing Systems (NeurIPS), volume 37, 2024

  56. [56]

    Pennington and P

    J. Pennington and P. Worah. Nonlinear random matrix theory for deep learning. In Advances in Neural Information Processing Systems (NeurIPS), 2017

  57. [57]

    Training Deep Learning Models with Norm-Constrained LMOs

    Thomas Pethick, Wanyun Xie, Kimon Antonakopoulos, Zhenyu Zhu, Antonio Silveti-Falls, and Volkan Cevher. Training Deep Learning Models with Norm-Constrained LMOs. In Proceedings of the 42nd International Conference on Machine Learning (ICML), volume 267 ofProceedings of Machine Learning Research, pages 49069–49104. PMLR, 2025

  58. [58]

    Petrov.Sums of Independent Random Variables

    Valentin V. Petrov.Sums of Independent Random Variables. Springer Berlin Heidelberg, 1975

  59. [59]

    Muon Dynamics as a Spectral Wasserstein Flow

    Gabriel Peyré. Muon Dynamics as a Spectral Wasserstein Flow.arXiv preprint arXiv:2604.04891, 2026

  60. [60]

    Statistical optimality of stochastic gradient descent on hard learning problems through multiple passes.Advances in Neural Information Processing Systems (NeurIPS), 31, 2018

    Loucas Pillaud-Vivien, Alessandro Rudi, and Francis Bach. Statistical optimality of stochastic gradient descent on hard learning problems through multiple passes.Advances in Neural Information Processing Systems (NeurIPS), 31, 2018

  61. [61]

    B.T. Polyak. Some methods of speeding up the convergence of iteration methods.USSR Computational Mathematics and Mathematical Physics, 04, 1964

  62. [62]

    Scaling collapse reveals universal dynamics in compute-optimally trained neural networks

    Shikai Qiu, Lechao Xiao, Andrew Gordon Wilson, Jeffrey Pennington, and Atish Agarwala. Scaling collapse reveals universal dynamics in compute-optimally trained neural networks. InInternational Conference on Machine Learning, pages 50697–50720. PMLR, 2025

  63. [63]

    Gluon: Making Muon & Scion Great Again! (Bridging Theory and Practice of lmo-based Optimizers for LLMs)

    Artem Riabinin, Egor Shulgin, Kaja Gruntkowska, and Peter Richtárik. Gluon: Making Muon & Scion Great Again!(Bridging Theory and Practice of LMO-based Optimizers for LLMs).arXiv preprint arXiv:2505.13416, 2025

  64. [64]

    arXiv preprint arXiv:2505.02222 , year=

    Ishaan Shah, Anthony M Polloreno, Karl Stratos, Philip Monk, Adarsh Chaluvaraju, Andrew Hojel, Andrew Ma, Anil Thomas, Ashish Tanwer, Darsh J Shah, et al. Practical Efficiency of Muon for Pretraining.arXiv preprint arXiv:2505.02222, 2025

  65. [65]

    On the Convergence Analysis of Muon

    Wei Shen, Ruichuan Huang, Minhui Huang, Cong Shen, and Jiawei Zhang. On the Convergence Analysis of Muon.arXiv preprint arXiv:2505.23737, 2025

  66. [66]

    A Distributed Data- Parallel Pytorch Implementation of the Distributed Shampoo Optimizer for Training Neural Networks At-Scale.arXiv preprint arXiv:2309.06497, 2023

    Hao-Jun Michael Shi, Tsung-Hsien Lee, Shintaro Iwasaki, Jose Gallego-Posada, Zhijing Li, Kaushik Rangadurai, Dheevatsa Mudigere, and Michael Rabbat. A Distributed Data- Parallel Pytorch Implementation of the Distributed Shampoo Optimizer for Training Neural Networks At-Scale.arXiv preprint arXiv:2309.06497, 2023

  67. [67]

    Beyond the ideal: Analyzing the inexact Muon update

    Egor Shulgin, Sultan AlRashed, Francesco Orabona, and Peter Richtárik. Beyond the Ideal: Analyzing the Inexact Muon Update.arXiv preprint arXiv:2510.19933, 2025

  68. [68]

    Simon, Dhruva Karkada, Nikhil Ghosh, and Mikhail Belkin

    James B. Simon, Dhruva Karkada, Nikhil Ghosh, and Mikhail Belkin. More is better in modern machine learning: when infinite overparameterization is optimal and overfitting is obligatory. InInternational Conference on Learning Representations (ICLR), pages 1–40, 2024

  69. [69]

    arXiv preprint arXiv:2511.00674 , year=

    Weijie Su. Isotropic Curvature Model for Understanding Deep Learning Optimization: Is Gradient Orthogonalization Optimal?arXiv preprint arXiv:2511.00674, 2025

  70. [70]

    Flavors of Margin: Implicit Bias of Steepest Descent in Homogeneous Neural Networks

    Nikolaos Tsilivis, Eitan Gronich, Julia Kempe, and Gal Vardi. Flavors of Margin: Implicit Bias of Steepest Descent in Homogeneous Neural Networks. InInternational Conference on Learning Representations (ICLR), pages 1–27, 2025. 18 E. PAQUETTE ET AL

  71. [71]

    Orthogonalising gradients to speed up neural network optimisation

    Mark Tuddenham, Adam Prügel-Bennett, and Jonathan Hare. Orthogonalising gradients to speed up neural network optimisation. InInternational Conference on Learning Representations (ICLR), pages 1–15, 2022

  72. [72]

    Accelerated sgd for non-strongly-convex least squares

    Aditya Varre and Nicolas Flammarion. Accelerated sgd for non-strongly-convex least squares. InProceedings of Thirty Fifth Conference on Learning Theory (COLT), volume 135 ofProceedings of Machine Learning Research, pages 2062–2126, 2022

  73. [73]

    How Muon’s Spectral Design Benefits Generalization: A Study on Imbalanced Data

    Bhavya Vasudeva, Puneesh Deora, Yize Zhao, Vatsal Sharan, and Christos Thrampoulidis. How Muon’s Spectral Design Benefits Generalization: A Study on Imbalanced Data. In International Conference on Learning Representations (ICLR), pages 1–36, 2026

  74. [74]

    Vershynin.High-dimensional probability: An introduction with applications in data science

    R. Vershynin.High-dimensional probability: An introduction with applications in data science. Cambridge University Press, 2018

  75. [75]

    SOAP: Improving and Stabilizing Shampoo Using Adam

    NikhilVyas, DepenMorwani, RosieZhao, MujinKwun, ItaiShapira, DavidBrandfonbrener, Lucas Janson, and Sham Kakade. SOAP: Improving and Stabilizing Shampoo Using Adam. InInternational Conference on Learning Representations (ICLR), pages 1–22, 2025

  76. [76]

    High-dimensional isotropic scaling dynamics of Muon and SGD

    Guangyuan Wang, Elliot Paquette, and Atish Agarwala. High-dimensional isotropic scaling dynamics of Muon and SGD. InOPT 2025: Optimization for Machine Learning, 2025

  77. [77]

    MuonOutperformsAdaminTail-EndAssociative Memory Learning

    Shuche Wang, Fengzhuo Zhang, Jiaxiang Li, Cunxiao Du, Chao Du, Tianyu Pang, Zhuoran Yang, MingyiHong, andVincentYFTan. MuonOutperformsAdaminTail-EndAssociative Memory Learning. InInternational Conference on Learning Representations (ICLR), pages 1–38, 2026

  78. [78]

    More than a toy: Random matrix models predict how real-world neural representations generalize

    Alexander Wei, Wei Hu, and Jacob Steinhardt. More than a toy: Random matrix models predict how real-world neural representations generalize. InInternational Conference on Machine Learning (ICML), 2022

  79. [79]

    Fantastic Pretraining Optimizers and Where to Find Them

    Kaiyue Wen, David Hall, Tengyu Ma, and Percy Liang. Fantastic Pretraining Optimizers and Where to Find Them. InInternational Conference on Learning Representations (ICLR), pages 1–107, 2026

  80. [80]

    Exact risk curves of signSGD in High-Dimensions: quantifying preconditioning and noise-compression effects

    Ke Liang Xiao, Noah Marshall, Atish Agarwala, and Elliot Paquette. Exact risk curves of signSGD in High-Dimensions: quantifying preconditioning and noise-compression effects. InInternational Conference on Learning Representations (ICLR), pages 1–48, 2025

Showing first 80 references.