arxiv: 2605.09552 · v1 · submitted 2026-05-10 · 🧮 math.OC · cs.LG· stat.ML

Recognition: 2 theorem links

· Lean Theorem

Phases of Muon: When Muon Eclipses SignSGD

Atish Agarwala, Courtney Paquette, Elliot Paquette, Guangyuan Wang, Lucas Benigni, Noah Marshall

Pith reviewed 2026-05-12 04:01 UTC · model grok-4.3

classification 🧮 math.OC cs.LGstat.ML

keywords MuonSignSGDSignSVDpower law covariancespectral optimizationphase diagramleast squaresstochastic gradient descent

0 comments

The pith

Power-law covariance splits Muon and SignSGD into three performance phases.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper studies stochastic spectral optimizers such as Muon on a high-dimensional matrix least-squares problem to compare them with SignSGD. It derives deterministic dynamics showing that SignSVD, which Muon approximates, applies square-root preconditioning to the data covariance for large batches but behaves like plain SGD on smaller modes for small batches. SignSGD applies no preconditioning for generic covariance and therefore exhibits different optimal learning rates and convergence. When both data covariance and target follow power laws with exponents alpha and beta, the plane of these exponents divides into three regions: one where SignSGD is uniformly better, one where SignSVD is uniformly better, and one where the methods trade off. This partition supplies a concrete way to predict which optimizer will converge faster once the spectral exponents of a problem are known.

Core claim

We analyze stochastic spectral optimizers including Muon approximated by SignSVD and SignSGD as a proxy for Adam on a high-dimensional matrix-valued least squares problem. For large batch size SignSVD performs square-root preconditioning with respect to the data covariance spectrum while for small batch size smaller eigenmodes behave like SGD. SignSGD performs no preconditioning and has no transition, leading to different optimal learning rates and convergence characteristics. An analysis of a power law covariance model with data exponent alpha and target exponent beta shows there are three phases in the (alpha, beta) plane: one where SignSGD is uniformly favored, one where SignSVD is 1.0-f,

What carries the argument

The power-law covariance model with data exponent alpha and target exponent beta, which partitions the (alpha, beta) plane into three regions of relative performance between SignSVD and SignSGD.

Load-bearing premise

The high-dimensional matrix-valued least squares problem with power-law spectra is a faithful proxy for the learning dynamics of these optimizers in practical deep neural network training.

What would settle it

Simulating SignSVD and SignSGD on the matrix least-squares problem for a grid of alpha and beta values and finding that the measured convergence rates or final losses do not fall into the three predicted phases would falsify the claim.

Figures

Figures reproduced from arXiv: 2605.09552 by Atish Agarwala, Courtney Paquette, Elliot Paquette, Guangyuan Wang, Lucas Benigni, Noah Marshall.

**Figure 1.** Figure 1: Random-matrix theory predicts Muon and SignSGD loss curves. Halfanisotropic power-law data (α = 1.5, β = 0.7, N = 1024, B = 2048, a hard, bias-dominated phase). Solid: average of 16 numerical simulations of Muon with 5 Newton–Schulz iterations; gray lines: predicted scaling-law exponents from SignSVD theory in Sec. 5. Theory for SignSVD quantitatively predicts the behavior of the stochastic algorithm … view at source ↗

**Figure 2.** Figure 2: Self-consistent fixed-point system for the deterministic equivalents of the resolvents [PITH_FULL_IMAGE:figures/full_fig_p006_2.png] view at source ↗

**Figure 3.** Figure 3: Isotropic SignSVD and SignSGD under their respective optimal learning rates (3.3). Square setting where Nin = Nout = N, Σin = Σout = IdN , N B = 1; 32 runs of SignSVD and SignSGD with learning rate (3.3) for N ∈ {128, 256, 512}; Shaded region is 80% confidence interval, and the lines the theoretical predictions. (a) Predictions for the risk trajectories in (3.1) match the simulations of SignSVD and SignSGD… view at source ↗

**Figure 4.** Figure 4: Phase diagram for SignSVD vs. SignSGD in the (α, β) plane (large-batch regime B ≥ N). The boundaries β = 1 and β = α+1 are the saturation thresholds of SignSVD and SignSGD respectively (Thm. 5.2). Our analysis only holds for all α + β > 1 and α > 0. Three phases: which algorithm wins, SignSVD or SignSGD? Dividing (5.4) by (5.3) yields the ratio t s-SGD 4ϵ /ts-SVD 4ϵ , which sorts (α, β)-space into three ph… view at source ↗

**Figure 5.** Figure 5: Deterministic risk trajectories R(t), validating the t4ϵ scaling laws across the four phase regimes (isotropic and Phases A, B, C from [PITH_FULL_IMAGE:figures/full_fig_p012_5.png] view at source ↗

**Figure 6.** Figure 6: Validation of the fixed-point system (C.59): empirical spectral density of Ht = Gt+1G⊤ t+1 versus the deterministic equivalent. Each panel uses the same setup family (isotropic/anisotropic spectra, γ, and R) and compares ρemp(x) to ρDE(x) = π −1 Im m(x + iη) with η = 1/ √ Nout. The figure shown here was regenerated with scale factor 4 (baseline d = 100 to d = 400, and B scaled accordingly). where the firs… view at source ↗

**Figure 7.** Figure 7: Step 5 projected-drift validation. For each setup, the blue curve is the Monte Carlo [PITH_FULL_IMAGE:figures/full_fig_p054_7.png] view at source ↗

**Figure 8.** Figure 8: Off-critical Step 5 validation in the isotropic setup with [PITH_FULL_IMAGE:figures/full_fig_p055_8.png] view at source ↗

**Figure 9.** Figure 9: Term-level projected validation using the updated deterministic equivalents for [PITH_FULL_IMAGE:figures/full_fig_p056_9.png] view at source ↗

**Figure 10.** Figure 10: Step-7 term-level validation of the variance kernel. Each panel plots [PITH_FULL_IMAGE:figures/full_fig_p086_10.png] view at source ↗

**Figure 11.** Figure 11: Half-anisotropic drift and volatility versus Gaussian Monte Carlo at [PITH_FULL_IMAGE:figures/full_fig_p108_11.png] view at source ↗

**Figure 12.** Figure 12: Half-anisotropic drift and volatility versus Gaussian Monte Carlo at [PITH_FULL_IMAGE:figures/full_fig_p109_12.png] view at source ↗

**Figure 13.** Figure 13: Half-anisotropic drift and volatility versus Gaussian Monte Carlo at [PITH_FULL_IMAGE:figures/full_fig_p110_13.png] view at source ↗

**Figure 14.** Figure 14: tT ϵ(ϵ) for SignSVD and SignSGD bracketed by the upper bound of Lemma H.5 (c0 = 2, numerically-evaluated c(c0); target threshold T = K(c0) ≈ 5) and the lower bound (1/η) R τF=T ϵ 0 √ F du of Lemma H.4. Parameters: α = 1.5, γ = 0.5, N = 1000; left β = 0.7 (SignSVD F-branch), right β = 1.5 (SignSVD saturation regime). Blue circles and red squares are measured tT ϵ; blue/red plus markers are upper bounds; do… view at source ↗

**Figure 15.** Figure 15: Stochastic Muon (Newton–Schulz quintic, βmom = 0, dashed) against the deterministic SignSVD time-to-ϵ theory (solid) of Sec. 5, for N = 1024, B = 2048, 16 trials per target ϵ, color-coded by ϵ ∈ {2 −4 , 2 −6 , . . . , 2 −14}. Circles are the theoretical t2ϵ; squares are the simulated t2ϵ. Across the isotropic baseline and all three half-anisotropic phases, the no-momentum Muon trajectories sit on top of … view at source ↗

**Figure 16.** Figure 16: Half-anisotropic Phase A (α = 1.5, βinit = 0.7, N = 1024, B = 2048, 16 trials). Left: risk trajectories with the matched-LR η ⋆ for each target ϵ (color); linestyles encode βmom ∈ {0, 0.9, 0.95, 0.99}; markers indicate the measured t2ϵ. Right: t2ϵ scaling against target ϵ on log–log axes. The dotted reference is the SignSVD slope pSVD = 0.625 from Sec. 5; the dashed reference is the SignSGD slope pSGD = 1… view at source ↗

**Figure 17.** Figure 17: Half-anisotropic Phase B (α = 1.5, βinit = 1.5, N = 1024, B = 2048, 16 trials); axes and conventions as in [PITH_FULL_IMAGE:figures/full_fig_p130_17.png] view at source ↗

**Figure 18.** Figure 18: Half-anisotropic Phase C (α = 1.5, βinit = 3.0, N = 1024, B = 2048, 16 trials); axes and conventions as in [PITH_FULL_IMAGE:figures/full_fig_p130_18.png] view at source ↗

**Figure 19.** Figure 19: Effective noise constant Neff(βmom) for Muon at N = 1024 across the three halfanisotropic phases. Left: absolute values on a log axis; markers are calibrated Neff. The dashed curve is the raw heuristic Neff(0)p (1 + β)/(1 − β) of Eq. (I.1) with ρ = 1; the solid curve is the empirical fit with ρˆ = 0.842. The three phases overlap to within marker width at every βmom. Right: empirical prefactor ρ(βmom) ≡ N… view at source ↗

read the original abstract

Recently, Muon and related spectral optimizers have demonstrated strong empirical performance as scalable stochastic methods, often outperforming Adam. Yet their behaviour remains poorly understood. We analyze stochastic spectral optimizers, including Muon, on a high-dimensional matrix-valued least squares problem. We derive explicit deterministic dynamics that provide a tractable framework for studying learning behaviour with a focus on (stochastic) SignSVD, which Muon approximates, and (stochastic) SignSGD, the latter serving as a proxy for Adam. Our analysis shows that for large batch size, SignSVD performs a square-root preconditioning with respect to the data covariance spectrum, while for small batch size smaller eigenmodes behave like SGD, slowing down convergence. We contrast with SignSGD which for generic covariance performs no preconditioning and has no transition, leading to different optimal learning rates and convergence characteristics. The two methods match up to a constant factor with isotropic data, but behave differently with anisotropic data. An analysis of a power law covariance model with data exponent $\alpha$ and target exponent $\beta$ shows there are three phases in the $(\alpha,\beta)$ plane: one where SignSGD is uniformly favored, one where SignSVD is uniformly favored, and a third where the two methods exhibit a trade-off in performance.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper derives a three-phase diagram in the (α, β) plane for SignSVD versus SignSGD on power-law spectra using deterministic dynamics, but the approximation may miss how stochastic fluctuations and the sign nonlinearity affect small eigenmodes.

read the letter

The core contribution is a first-principles analysis of stochastic SignSVD (standing in for Muon) and SignSGD on a high-dimensional matrix least-squares problem with power-law covariance. They obtain explicit deterministic dynamics for the error evolution, show that large-batch SignSVD applies square-root preconditioning while SignSGD applies none, and then classify the (α, β) plane into three regions: one where SignSGD wins uniformly, one where SignSVD wins uniformly, and a middle trade-off region. This phase diagram is new and gives a concrete, exponent-based criterion rather than another empirical comparison. The derivation avoids fitting to the same data it analyzes, which keeps the circularity burden low. The model problem is simple enough that the math stays tractable, and the authors are explicit about the large-batch limit and the small-batch reversion to SGD-like behavior for weaker modes. That transparency is useful. The main limitation is the deterministic approximation itself. The sign function is nonlinear, so the mean trajectory need not control the typical sample paths once batch size drops and variance in the small eigen-directions becomes comparable to the signal. The stress-test concern about higher moments shifting the phase boundaries looks real on the abstract alone; without seeing the full derivations it is hard to judge how much the boundaries move. The least-squares proxy is also a step removed from the non-convex, layered dynamics of actual networks, so the practical takeaway for Muon versus Adam remains provisional. This work is aimed at theorists who build optimizer analyses from simplified models. It is worth sending to a serious referee who can check the stochastic-to-deterministic step and the power-law calculations; the framework is clean enough that the paper can be revised productively even if the phase boundaries need adjustment.

Referee Report

2 major / 2 minor

Summary. The paper analyzes stochastic spectral optimizers including SignSVD (approximating Muon) and SignSGD (proxy for Adam) on a high-dimensional matrix-valued least-squares problem. It derives explicit deterministic dynamics for error evolution, shows that SignSVD performs square-root preconditioning for large batches (reverting to SGD-like behavior for small eigenmodes at small batches) while SignSGD performs no preconditioning, and uses a power-law covariance model with data exponent α and target exponent β to identify three phases in the (α, β) plane: one uniformly favoring SignSGD, one uniformly favoring SignSVD, and one with a performance trade-off.

Significance. If the deterministic approximations and phase classification hold, the work supplies a tractable theoretical framework for predicting when spectral methods outperform simpler sign-based optimizers as a function of spectral anisotropy, with explicit scaling exponents and falsifiable phase boundaries. The derivation of closed-form deterministic dynamics and the power-law analysis constitute clear strengths that enable direct comparison of convergence rates.

major comments (2)

[Deterministic dynamics derivation] The derivation of deterministic dynamics (via expectation or large-batch limits) for the sign nonlinearity does not address whether higher moments or finite-batch fluctuations alter the effective convergence rates for small eigenmodes; this directly affects the location and existence of the trade-off phase in the (α, β) plane.
[Power-law covariance model analysis] The power-law analysis classifies phases by comparing asymptotic scaling exponents obtained from the deterministic rates; without an explicit check that the mean trajectory governs typical sample paths under stochastic sign operations (especially when batch size is small), the trichotomy claim rests on an unverified approximation.

minor comments (2)

[Abstract] The abstract states that the two methods 'match up to a constant factor with isotropic data' but does not quantify the constant or the precise isotropic limit.
[Power-law model] Notation for the data exponent α and target exponent β should be introduced with their precise definitions before the phase-plane analysis.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the careful reading and valuable feedback on our analysis of stochastic spectral optimizers. We address the two major comments point by point below, clarifying the scope of our deterministic approximations while acknowledging their limitations.

read point-by-point responses

Referee: The derivation of deterministic dynamics (via expectation or large-batch limits) for the sign nonlinearity does not address whether higher moments or finite-batch fluctuations alter the effective convergence rates for small eigenmodes; this directly affects the location and existence of the trade-off phase in the (α, β) plane.

Authors: Our derivation obtains deterministic dynamics by computing the expectation of the sign update (or taking the large-batch limit), which produces explicit per-eigenmode error recursions. In the manuscript we already note that small eigenmodes revert to SGD-like behavior under small batches because the sign operation on low-magnitude signals becomes effectively stochastic. We agree, however, that higher moments and finite-batch fluctuations are not analyzed in detail and could modify the effective rates, thereby shifting the trade-off phase boundaries. We will add a new subsection discussing the validity regime of the mean-field approximation, including a qualitative argument on when fluctuations remain negligible (large batch or sufficiently strong signal-to-noise per mode) and explicitly stating that a full stochastic approximation theory lies beyond the present scope. revision: partial
Referee: The power-law analysis classifies phases by comparing asymptotic scaling exponents obtained from the deterministic rates; without an explicit check that the mean trajectory governs typical sample paths under stochastic sign operations (especially when batch size is small), the trichotomy claim rests on an unverified approximation.

Authors: The three phases in the (α, β) plane are obtained by comparing the asymptotic scaling exponents that follow directly from the deterministic rates under the assumed power-law spectra. This yields explicit, falsifiable boundaries. We concur that the classification assumes the mean trajectory is representative of typical paths and that this assumption is least secure for small batches and small eigenmodes, where sign stochasticity can produce larger deviations. We will revise the power-law section to include a short paragraph that (i) states the mean-field nature of the derivation, (ii) identifies the parameter regimes (batch size relative to eigenvalue magnitude) where the approximation is expected to be accurate, and (iii) notes that concentration or large-deviation analysis of the stochastic sign process is left for future work. revision: partial

Circularity Check

0 steps flagged

No circularity: phases derived from explicit deterministic dynamics on posited power-law model

full rationale

The paper posits a high-dimensional matrix least-squares problem with power-law spectra as a proxy model, derives deterministic dynamics for SignSVD and SignSGD error evolution (via expectation/large-batch limits), obtains asymptotic convergence rates as functions of exponents α and β, and partitions the (α,β) plane by comparing those rates. This chain is self-contained first-principles analysis within the model; no parameter is fitted to the target data, no prediction is renamed from a fit, and no load-bearing step reduces to a self-citation or self-definition. The trichotomy is a direct consequence of the derived scalings, not tautological with the inputs.

Axiom & Free-Parameter Ledger

2 free parameters · 2 axioms · 0 invented entities

The analysis rests on a high-dimensional least-squares model, deterministic ODE approximation of stochastic updates, and power-law spectra; no new physical entities are introduced.

free parameters (2)

α
Exponent controlling the decay of the data covariance spectrum in the power-law model.
β
Exponent controlling the decay of the target solution spectrum in the power-law model.

axioms (2)

domain assumption Stochastic gradient updates can be replaced by deterministic differential equations in the high-dimensional limit
Invoked to obtain tractable dynamics for SignSVD and SignSGD.
domain assumption Power-law spectra adequately capture the anisotropy present in real training data
Used to partition the (α, β) plane into performance phases.

pith-pipeline@v0.9.0 · 5550 in / 1413 out tokens · 40000 ms · 2026-05-12T04:01:20.377577+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear
An analysis of a power law covariance model with data exponent α and target exponent β shows there are three phases in the (α,β) plane
IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear
SignSVD performs a square-root preconditioning with respect to the data covariance spectrum

Reference graph

Works this paper leans on

90 extracted references · 90 canonical work pages · 4 internal anchors

[1]

Understanding Double Descent Requires A Fine- Grained Bias-Variance Decomposition

Ben Adlam and Jeffrey Pennington. Understanding Double Descent Requires A Fine- Grained Bias-Variance Decomposition. InAdvances in Neural Information Processing Systems (NeurIPS), volume 33, pages 11022–11032, 2020

work page 2020
[2]

Noah Amsel, David Persson, Christopher Musco, and Robert M. Gower. The Polar Express: Optimal Matrix Sign Methods and Their Application to the Muon Algorithm. InInternational Conference on Learning Representations (ICLR), 2026

work page 2026
[3]

Springer, 2010

Zhidong Bai and Jack W Silverstein.Spectral analysis of large dimensional random matrices, volume 20. Springer, 2010

work page 2010
[4]

Dissecting Adam: The Sign, Magnitude and Variance of Stochastic Gradients

Lukas Balles and Philipp Hennig. Dissecting Adam: The Sign, Magnitude and Variance of Stochastic Gradients. InProceedings of the 35th International Conference on Machine Learning (ICML), volume 80 ofProceedings of Machine Learning Research, pages 404–413. PMLR, PMLR, 2018

work page 2018
[5]

The Geometry of Sign Gradient Descent

Lukas Balles, Fabian Pedregosa, and Nicolas Le Roux. The Geometry of Sign Gradient Descent. InInternational Conference on Learning Representations (ICLR), pages 1–24, 2020

work page 2020
[6]

Learning quadratic neural networks in high dimensions: SGD dynamics and scaling laws.arXiv preprint arXiv:2508.03688, 2025

Gérard Ben Arous, Murat A Erdogdu, Nuri Mert Vural, and Denny Wu. Learning quadratic neural networks in high dimensions: SGD dynamics and scaling laws.arXiv preprint arXiv:2508.03688, 2025

work page arXiv 2025
[7]

High-dimensional limit theorems for SGD: Effective dynamics and critical scaling.Advances in Neural Information Processing Systems (NeurIPS), 35:25349–25362, 2022

Gérard Ben Arous, Reza Gheissari, and Aukosh Jagannath. High-dimensional limit theorems for SGD: Effective dynamics and critical scaling.Advances in Neural Information Processing Systems (NeurIPS), 35:25349–25362, 2022

work page 2022
[8]

Old optimizer, new norm: An anthology

Jeremy Bernstein and Laker Newhouse. Old optimizer, new norm: An anthology.arXiv preprint arXiv:2409.20325, 2024

work page arXiv 2024
[9]

signSGD: Compressed optimisation for non-convex problems

Jeremy Bernstein, Yu-Xiang Wang, Kamyar Azizzadenesheli, and Animashree Anandkumar. signSGD: Compressed optimisation for non-convex problems. InProceedings of the 35th 14 E. PAQUETTE ET AL. International Conference on Machine Learning (ICML), volume 80 ofProceedings of Machine Learning Research, pages 560–569. PMLR, 2018

work page 2018
[10]

A Dynamical Model of Neural Scaling Laws

Blake Bordelon, Alexander Atanasov, and Cengiz Pehlevan. A Dynamical Model of Neural Scaling Laws. InProceedings of the 41st International Conference on Machine Learning (ICML), volume 235 ofProceedings of Machine Learning Research, pages 4345–4382. PMLR, 2024

work page 2024
[11]

How Feature Learning Can Improve Neural Scaling Laws .International Conference on Learning Representations (ICLR), 2025

Blake Bordelon, Alexander Atanasov, and Cengiz Pehlevan. How Feature Learning Can Improve Neural Scaling Laws .International Conference on Learning Representations (ICLR), 2025

work page 2025
[12]

Optimal rates for the regularized least-squares algorithm.Foundations of Computational Mathematics, 7:331–368, 2007

Andrea Caponnetto and Ernesto De Vito. Optimal rates for the regularized least-squares algorithm.Foundations of Computational Mathematics, 7:331–368, 2007

work page 2007
[13]

Stochastic spectral descent for restricted boltzmann machines

David Carlson, Volkan Cevher, and Lawrence Carin. Stochastic spectral descent for restricted boltzmann machines. InArtificial intelligence and statistics, pages 111–119. PMLR, 2015

work page 2015
[14]

Learning with sgd and random features.Advances in Neural Information Processing Systems (NeurIPS), 31, 2018

Luigi Carratino, Alessandro Rudi, and Lorenzo Rosasco. Learning with sgd and random features.Advances in Neural Information Processing Systems (NeurIPS), 31, 2018

work page 2018
[15]

Muon optimizes under spectral norm con- straints.Transactions on Machine Learning Research (TMLR), 2026

Lizhang Chen, Jonathan Li, and Qiang Liu. Muon optimizes under spectral norm con- straints.Transactions on Machine Learning Research (TMLR), 2026

work page 2026
[16]

Adaptive gradient methods at the edge of stability.arXiv preprint arXiv:2207.14484, 2022

Jeremy M Cohen, Behrooz Ghorbani, Shankar Krishnan, Naman Agarwal, Sourabh Medapati, Michal Badura, Daniel Suo, David Cardoze, Zachary Nado, George E Dahl, et al. Adaptive gradient methods at the edge of stability.arXiv preprint arXiv:2207.14484, 2022

work page arXiv 2022
[17]

Gradient descent on neural networks typically occurs at the edge of stability

Jeremy M Cohen, Simran Kaur, Yuanzhi Li, J Zico Kolter, and Ameet Talwalkar. Gradient descent on neural networks typically occurs at the edge of stability. InInternational Conference on Learning Representations (ICLR), pages 1–80, 2021

work page 2021
[18]

Hitting the high-dimensional notes: an ODE for SGD learning dynamics on GLMs and multi-index models.Inf

Elizabeth Collins-Woodfin, Courtney Paquette, Elliot Paquette, and Inbar Seroussi. Hitting the high-dimensional notes: an ODE for SGD learning dynamics on GLMs and multi-index models.Inf. Inference, 13(4):Paper No. iaae028, 107, 2024

work page 2024
[19]

The High Line: Exact Risk and Learning Rate Curves of Stochastic Adaptive Learning Rate Algorithms

Elizabeth Collins-Woodfin, Inbar Seroussi, Begoña García Malaxechebarría, Andrew W Mackenzie, Elliot Paquette, and Courtney Paquette. The High Line: Exact Risk and Learning Rate Curves of Stochastic Adaptive Learning Rate Algorithms. InAdvances in Neural Information Processing Systems (NeurIPS), volume 37, 2024

work page 2024
[20]

Cambridge University Press, 2022

Romain Couillet and Zhenyu Liao.Random Matrix Methods for Machine Learning. Cambridge University Press, 2022

work page 2022
[21]

arXiv preprint arXiv:2512.04299 , year =

Damek Davis and Dmitriy Drusvyatskiy. When do spectral gradient updates help in deep learning?arXiv preprint arXiv:2512.04299, 2025

work page arXiv 2025
[22]

The PHASES OF MUON: WHEN MUON ECLIPSES SIGNSGD 15 DeepMind JAX Ecosystem

DeepMind, Igor Babuschkin, Kate Baumli, Alison Bell, Surya Bhupatiraju, Jake Bruce, Peter Buchlovsky, David Budden, Trevor Cai, Aidan Clark, Ivo Danihelka, Antoine Dedieu, Claudio Fantacci, Jonathan Godwin, Chris Jones, Ross Hemsley, Tom Hennigan, Matteo Hessel, Shaobo Hou, Steven Kapturowski, Thomas Keck, Iurii Kemaev, Michael King, Markus Kunesch, Lena ...

work page 2020
[23]

Deepseek-v4: Towards highly efficient million-token context intelligence, 2026

DeepSeek-AI. Deepseek-v4: Towards highly efficient million-token context intelligence, 2026

work page 2026
[24]

Nonparametric stochastic approximation with large step-sizes.The Annals of Statistics, 44(4):1363 – 1399, 2016

Aymeric Dieuleveut and Francis Bach. Nonparametric stochastic approximation with large step-sizes.The Annals of Statistics, 44(4):1363 – 1399, 2016

work page 2016
[25]

TheNewton-MuonOptimizer.arXiv preprint arXiv:2604.01472, 2026

ZhehangDuand WeijieSu. TheNewton-MuonOptimizer.arXiv preprint arXiv:2604.01472, 2026

work page arXiv 2026
[26]

arXiv preprint arXiv:2502.04664 , year=

Chen Fan, Mark Schmidt, and Christos Thrampoulidis. Implicit Bias of Spectral Descent and Muon on Multiclass Separable Data.arXiv preprint arXiv:2502.04664, 2025

work page arXiv 2025
[27]

Dimension-adapted Momentum Outscales SGD.arXiv preprint arXiv:2505.16098, 2025

Damien Ferbach, Katie Everett, Gauthier Gidel, Elliot Paquette, and Courtney Paquette. Dimension-adapted Momentum Outscales SGD.arXiv preprint arXiv:2505.16098, 2025

work page arXiv 2025
[28]

Universality of high-dimensional scaling limits of stochastic gradient descent.arXiv preprint arXiv:2512.13634, 2025

Reza Gheissari and Aukosh Jagannath. Universality of high-dimensional scaling limits of stochastic gradient descent.arXiv preprint arXiv:2512.13634, 2025

work page arXiv 2025
[29]

Insights on Muon from Simple Quadratics.arXiv preprint arXiv:2602.11948, 2026

Antoine Gonon, Andreea-Alexandra Muşat, and Nicolas Boumal. Insights on Muon from Simple Quadratics.arXiv preprint arXiv:2602.11948, 2026

work page arXiv 2026
[30]

Shampoo: Preconditioned stochastic tensor optimization

Vineet Gupta, Tomer Koren, and Yoram Singer. Shampoo: Preconditioned stochastic tensor optimization. InProceedings of the 35th International Conference on Machine Learning (ICML), volume 80 ofProceedings of Machine Learning Research, pages 1842–1850. PMLR, 2018

work page 2018
[31]

Hanson and F

David L. Hanson and F. T. Wright. A bound on tail probabilities for quadratic forms in independent random variables.The Annals of Mathematical Statistics, 42(3):1079–1083, 1971

work page 1971
[32]

Universality laws for high-dimensional learning with random features.IEEE Transactions on Information Theory, 69(3):1932–1964, 2022

Hong Hu and Yue M Lu. Universality laws for high-dimensional learning with random features.IEEE Transactions on Information Theory, 69(3):1932–1964, 2022

work page 1932
[33]

High-dimensional limit theorems for SGD: Momentum and Adaptive Step-sizes.arXiv preprint arXiv:2511.03952, 2025

Aukosh Jagannath, Taj Jones-McCormick, and Varnan Sarangian. High-dimensional limit theorems for SGD: Momentum and Adaptive Step-sizes.arXiv preprint arXiv:2511.03952, 2025

work page arXiv 2025
[34]

Adaptive matrix online learning through smoothing with guarantees for nonsmooth nonconvex optimization.arXiv preprint arXiv:2602.08232,

Ruichen Jiang, Zakaria Mhammedi, Mehryar Mohri, and Aryan Mokhtari. Adaptive Matrix Online Learning through Smoothing with Guarantees for Nonsmooth Nonconvex Optimization.arXiv preprint arXiv:2602.08232, 2026

work page arXiv 2026
[35]

Muon: An optimizer for hidden layers in neural networks.URL https://kellerjordan

Keller Jordan, Yuchen Jin, Vlado Boza, You Jiacheng, Franz Cesista, Laker Newhouse, and Jeremy Bernstein. Muon: An optimizer for hidden layers in neural networks.URL https://kellerjordan. github. io/posts/muon, 6(3):4, 2024

work page 2024
[36]

Convergence of Muon with Newton-Schulz

Gyu Yeol Kim and Min-hwan Oh. Convergence of Muon with Newton-Schulz. InInterna- tional Conference on Learning Representations (ICLR), pages 1–29, 2026

work page 2026
[37]

Scaling Laws of SignSGD in Linear Regression: When Does It Outperform SGD? InInternational Conference on Learning Representations (ICLR), pages 1–89, 2026

Jihwan Kim, Dogyoon Song, and Chulhee Yun. Scaling Laws of SignSGD in Linear Regression: When Does It Outperform SGD? InInternational Conference on Learning Representations (ICLR), pages 1–89, 2026

work page 2026
[38]

Sharp Capacity Scaling of Spectral Optimizers in Learning Associative Memory

Juno Kim, Eshaan Nichani, Denny Wu, Alberto Bietti, and Jason D Lee. Sharp Ca- pacity Scaling of Spectral Optimizers in Learning Associative Memory.arXiv preprint arXiv:2603.26554, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[39]

Trajectory of mini- batch momentum: batch size saturation and convergence in high dimensions.Advances in Neural Information Processing Systems (NeurIPS), 35:36944–36957, 2022

Kiwon Lee, Andrew Cheng, Elliot Paquette, and Courtney Paquette. Trajectory of mini- batch momentum: batch size saturation and convergence in high dimensions.Advances in Neural Information Processing Systems (NeurIPS), 35:36944–36957, 2022. 16 E. PAQUETTE ET AL

work page 2022
[40]

Muon in Associative Memory Learning: Training Dynamics and Scaling Laws.arXiv preprint arXiv:2602.05725, 2026

Binghui Li, Kaifei Wang, Han Zhong, Pinyan Lu, and Liwei Wang. Muon in Associative Memory Learning: Training Dynamics and Scaling Laws.arXiv preprint arXiv:2602.05725, 2026

work page arXiv 2026
[41]

Risk bounds of accelerated SGD for overparameterized linear regression

Xuheng Li, Yihe Deng, Jingfeng Wu, Dongruo Zhou, and Quanquan Gu. Risk bounds of accelerated SGD for overparameterized linear regression. InInternational Conference on Learning Representations (ICLR), pages 1–69, 2024

work page 2024
[42]

Kakade, Peter L

Licong Lin, Jingfeng Wu, Sham M. Kakade, Peter L. Bartlett, and Jason D. Lee. Scaling Laws in Linear Regression: Compute, Parameters, and Data. InAdvances in Neural Information Processing Systems (NeurIPS), volume 37, pages 60556–60606, 2024

work page 2024
[43]

Muon is Scalable for LLM Training

Jingyuan Liu, Jianlin Su, Xingcheng Yao, Zhejun Jiang, Guokun Lai, Yulun Du, Yidao Qin, Weixin Xu, Enzhe Lu, Junjie Yan, et al. Muon is Scalable for LLM Training.arXiv preprint arXiv:2502.16982, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[44]

Cosmos: A Hybrid Adaptive Optimizer for Efficient Training of Large Language Models

Liming Liu, Zhenghao Xu, Zixuan Zhang, Hao Kang, Zichong Li, Chen Liang, Weizhu Chen, and Tuo Zhao. Cosmos: A Hybrid Adaptive Optimizer for Efficient Training of Large Language Models. InInternational Conference on Learning Representations (ICLR), 2026

work page 2026
[45]

Swan: SGD with Nor- malization and Whitening Enables Stateless LLM training

Chao Ma, Wenbo Gong, Meyer Scetbon, and Edward Meeds. Swan: SGD with Nor- malization and Whitening Enables Stateless LLM training. InProceedings of the 42nd International Conference on Machine Learning (ICML), volume 267 ofProceedings of Machine Learning Research, pages 41907–41942. PMLR, 2025

work page 2025
[46]

To Clip or not to Clip: the Dynamics of SGD with Gradient Clipping in High-dimensions

Noah Marshall, Ke Liang Xiao, Atish Agarwala, and Elliot Paquette. To Clip or not to Clip: the Dynamics of SGD with Gradient Clipping in High-dimensions. InInternational Conference on Learning Representations (ICLR), pages 1–37, 2025

work page 2025
[47]

Optimizing neural networks with Kronecker-factored approximate curvature

James Martens and Roger Grosse. Optimizing neural networks with Kronecker-factored approximate curvature. InProceedings of the 32nd International Conference on Machine Learning (ICML), volume 37 ofProceedings of Machine Learning Research, pages 2408–2417. PMLR, 2015

work page 2015
[48]

The generalization error of random features regression: Precise asymptotics and the double descent curve.Communications on Pure and Applied Mathematics, 75(4):667–766, 2022

Song Mei and Andrea Montanari. The generalization error of random features regression: Precise asymptotics and the double descent curve.Communications on Pure and Applied Mathematics, 75(4):667–766, 2022

work page 2022
[49]

Mingo and Roland Speicher.Free Probability and Random Matrices, volume 35 ofFields Institute Monographs

James A. Mingo and Roland Speicher.Free Probability and Random Matrices, volume 35 ofFields Institute Monographs. Springer New York, 2017

work page 2017
[50]

Phase transitions for feature learning in neural networks.arXiv preprint arXiv:2602.01434, 2026

Andrea Montanari and Zihao Wang. Phase transitions for feature learning in neural networks.arXiv preprint arXiv:2602.01434, 2026

work page arXiv 2026
[51]

Springer, 2004

Yuri Nesterov.Introductory lectures on convex optimization. Springer, 2004

work page 2004
[52]

SGD in the Large: Average-case Analysis, Asymptotics, and Stepsize Criticality

Courtney Paquette, Kiwon Lee, Fabian Pedregosa, and Elliot Paquette. SGD in the Large: Average-case Analysis, Asymptotics, and Stepsize Criticality. InProceedings of Thirty Fourth Conference on Learning Theory (COLT), volume 134, pages 3548–3626, 2021

work page 2021
[53]

Dynamics of Stochastic Momentum Methods on Large-scale, Quadratic Models.Advances in Neural Information Processing Systems (NeurIPS), 34:9229–9240, 2021

Courtney Paquette and Elliot Paquette. Dynamics of Stochastic Momentum Methods on Large-scale, Quadratic Models.Advances in Neural Information Processing Systems (NeurIPS), 34:9229–9240, 2021

work page 2021
[54]

Homogenization of SGD in high-dimensions: exact dynamics and generalization properties.Mathematical Programming, pages 1–90, 2024

Courtney Paquette, Elliot Paquette, Ben Adlam, and Jeffrey Pennington. Homogenization of SGD in high-dimensions: exact dynamics and generalization properties.Mathematical Programming, pages 1–90, 2024. PHASES OF MUON: WHEN MUON ECLIPSES SIGNSGD 17

work page 2024
[55]

4+3 Phases of Compute-Optimal Neural Scaling Laws

Elliot Paquette, Courtney Paquette, Lechao Xiao, and Jeffrey Pennington. 4+3 Phases of Compute-Optimal Neural Scaling Laws. InAdvances in Neural Information Processing Systems (NeurIPS), volume 37, 2024

work page 2024
[56]

Pennington and P

J. Pennington and P. Worah. Nonlinear random matrix theory for deep learning. In Advances in Neural Information Processing Systems (NeurIPS), 2017

work page 2017
[57]

Training Deep Learning Models with Norm-Constrained LMOs

Thomas Pethick, Wanyun Xie, Kimon Antonakopoulos, Zhenyu Zhu, Antonio Silveti-Falls, and Volkan Cevher. Training Deep Learning Models with Norm-Constrained LMOs. In Proceedings of the 42nd International Conference on Machine Learning (ICML), volume 267 ofProceedings of Machine Learning Research, pages 49069–49104. PMLR, 2025

work page 2025
[58]

Petrov.Sums of Independent Random Variables

Valentin V. Petrov.Sums of Independent Random Variables. Springer Berlin Heidelberg, 1975

work page 1975
[59]

Muon Dynamics as a Spectral Wasserstein Flow

Gabriel Peyré. Muon Dynamics as a Spectral Wasserstein Flow.arXiv preprint arXiv:2604.04891, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[60]

Statistical optimality of stochastic gradient descent on hard learning problems through multiple passes.Advances in Neural Information Processing Systems (NeurIPS), 31, 2018

Loucas Pillaud-Vivien, Alessandro Rudi, and Francis Bach. Statistical optimality of stochastic gradient descent on hard learning problems through multiple passes.Advances in Neural Information Processing Systems (NeurIPS), 31, 2018

work page 2018
[61]

B.T. Polyak. Some methods of speeding up the convergence of iteration methods.USSR Computational Mathematics and Mathematical Physics, 04, 1964

work page 1964
[62]

Scaling collapse reveals universal dynamics in compute-optimally trained neural networks

Shikai Qiu, Lechao Xiao, Andrew Gordon Wilson, Jeffrey Pennington, and Atish Agarwala. Scaling collapse reveals universal dynamics in compute-optimally trained neural networks. InInternational Conference on Machine Learning, pages 50697–50720. PMLR, 2025

work page 2025
[63]

Gluon: Making Muon & Scion Great Again! (Bridging Theory and Practice of lmo-based Optimizers for LLMs)

Artem Riabinin, Egor Shulgin, Kaja Gruntkowska, and Peter Richtárik. Gluon: Making Muon & Scion Great Again!(Bridging Theory and Practice of LMO-based Optimizers for LLMs).arXiv preprint arXiv:2505.13416, 2025

work page arXiv 2025
[64]

arXiv preprint arXiv:2505.02222 , year=

Ishaan Shah, Anthony M Polloreno, Karl Stratos, Philip Monk, Adarsh Chaluvaraju, Andrew Hojel, Andrew Ma, Anil Thomas, Ashish Tanwer, Darsh J Shah, et al. Practical Efficiency of Muon for Pretraining.arXiv preprint arXiv:2505.02222, 2025

work page arXiv 2025
[65]

On the Convergence Analysis of Muon

Wei Shen, Ruichuan Huang, Minhui Huang, Cong Shen, and Jiawei Zhang. On the Convergence Analysis of Muon.arXiv preprint arXiv:2505.23737, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[66]

A Distributed Data- Parallel Pytorch Implementation of the Distributed Shampoo Optimizer for Training Neural Networks At-Scale.arXiv preprint arXiv:2309.06497, 2023

Hao-Jun Michael Shi, Tsung-Hsien Lee, Shintaro Iwasaki, Jose Gallego-Posada, Zhijing Li, Kaushik Rangadurai, Dheevatsa Mudigere, and Michael Rabbat. A Distributed Data- Parallel Pytorch Implementation of the Distributed Shampoo Optimizer for Training Neural Networks At-Scale.arXiv preprint arXiv:2309.06497, 2023

work page arXiv 2023
[67]

Beyond the ideal: Analyzing the inexact Muon update

Egor Shulgin, Sultan AlRashed, Francesco Orabona, and Peter Richtárik. Beyond the Ideal: Analyzing the Inexact Muon Update.arXiv preprint arXiv:2510.19933, 2025

work page arXiv 2025
[68]

Simon, Dhruva Karkada, Nikhil Ghosh, and Mikhail Belkin

James B. Simon, Dhruva Karkada, Nikhil Ghosh, and Mikhail Belkin. More is better in modern machine learning: when infinite overparameterization is optimal and overfitting is obligatory. InInternational Conference on Learning Representations (ICLR), pages 1–40, 2024

work page 2024
[69]

arXiv preprint arXiv:2511.00674 , year=

Weijie Su. Isotropic Curvature Model for Understanding Deep Learning Optimization: Is Gradient Orthogonalization Optimal?arXiv preprint arXiv:2511.00674, 2025

work page arXiv 2025
[70]

Flavors of Margin: Implicit Bias of Steepest Descent in Homogeneous Neural Networks

Nikolaos Tsilivis, Eitan Gronich, Julia Kempe, and Gal Vardi. Flavors of Margin: Implicit Bias of Steepest Descent in Homogeneous Neural Networks. InInternational Conference on Learning Representations (ICLR), pages 1–27, 2025. 18 E. PAQUETTE ET AL

work page 2025
[71]

Orthogonalising gradients to speed up neural network optimisation

Mark Tuddenham, Adam Prügel-Bennett, and Jonathan Hare. Orthogonalising gradients to speed up neural network optimisation. InInternational Conference on Learning Representations (ICLR), pages 1–15, 2022

work page 2022
[72]

Accelerated sgd for non-strongly-convex least squares

Aditya Varre and Nicolas Flammarion. Accelerated sgd for non-strongly-convex least squares. InProceedings of Thirty Fifth Conference on Learning Theory (COLT), volume 135 ofProceedings of Machine Learning Research, pages 2062–2126, 2022

work page 2062
[73]

How Muon’s Spectral Design Benefits Generalization: A Study on Imbalanced Data

Bhavya Vasudeva, Puneesh Deora, Yize Zhao, Vatsal Sharan, and Christos Thrampoulidis. How Muon’s Spectral Design Benefits Generalization: A Study on Imbalanced Data. In International Conference on Learning Representations (ICLR), pages 1–36, 2026

work page 2026
[74]

Vershynin.High-dimensional probability: An introduction with applications in data science

R. Vershynin.High-dimensional probability: An introduction with applications in data science. Cambridge University Press, 2018

work page 2018
[75]

SOAP: Improving and Stabilizing Shampoo Using Adam

NikhilVyas, DepenMorwani, RosieZhao, MujinKwun, ItaiShapira, DavidBrandfonbrener, Lucas Janson, and Sham Kakade. SOAP: Improving and Stabilizing Shampoo Using Adam. InInternational Conference on Learning Representations (ICLR), pages 1–22, 2025

work page 2025
[76]

High-dimensional isotropic scaling dynamics of Muon and SGD

Guangyuan Wang, Elliot Paquette, and Atish Agarwala. High-dimensional isotropic scaling dynamics of Muon and SGD. InOPT 2025: Optimization for Machine Learning, 2025

work page 2025
[77]

MuonOutperformsAdaminTail-EndAssociative Memory Learning

Shuche Wang, Fengzhuo Zhang, Jiaxiang Li, Cunxiao Du, Chao Du, Tianyu Pang, Zhuoran Yang, MingyiHong, andVincentYFTan. MuonOutperformsAdaminTail-EndAssociative Memory Learning. InInternational Conference on Learning Representations (ICLR), pages 1–38, 2026

work page 2026
[78]

More than a toy: Random matrix models predict how real-world neural representations generalize

Alexander Wei, Wei Hu, and Jacob Steinhardt. More than a toy: Random matrix models predict how real-world neural representations generalize. InInternational Conference on Machine Learning (ICML), 2022

work page 2022
[79]

Fantastic Pretraining Optimizers and Where to Find Them

Kaiyue Wen, David Hall, Tengyu Ma, and Percy Liang. Fantastic Pretraining Optimizers and Where to Find Them. InInternational Conference on Learning Representations (ICLR), pages 1–107, 2026

work page 2026
[80]

Exact risk curves of signSGD in High-Dimensions: quantifying preconditioning and noise-compression effects

Ke Liang Xiao, Noah Marshall, Atish Agarwala, and Elliot Paquette. Exact risk curves of signSGD in High-Dimensions: quantifying preconditioning and noise-compression effects. InInternational Conference on Learning Representations (ICLR), pages 1–48, 2025

work page 2025

Showing first 80 references.