Denoise First, Orthogonalize Later: Understanding Momentum in Muon via Spectral Filtering

Han Bao; Weiyang Liu; Xianliang Li; Zihan Zhang

arxiv: 2606.03899 · v2 · pith:CZ7WYBMTnew · submitted 2026-06-02 · 💻 cs.LG

Denoise First, Orthogonalize Later: Understanding Momentum in Muon via Spectral Filtering

Xianliang Li , Zihan Zhang , Weiyang Liu , Han Bao This is my paper

Pith reviewed 2026-06-28 11:22 UTC · model grok-4.3

classification 💻 cs.LG

keywords Muon optimizermomentumspectral filteringorthogonalizationgradient modelsingular subspacesLLM training

0 comments

The pith

Momentum in Muon acts as a spectral filter that enlarges the gap between signal and perturbations before orthogonalization.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that momentum serves as a spectral filter in the Muon optimizer. Under a signal-plus-perturbation model for gradients, momentum reduces perturbations while keeping the main signal, which increases the separation between them. This separation stabilizes the singular subspaces used in the orthogonalization, leading to more reliable updates. The analysis proves that performing momentum before orthogonalization gives better signal alignment than the reverse order or no momentum. Supporting experiments are provided on tasks including large language model pretraining.

Core claim

Under a structured signal-plus-perturbation gradient model, momentum suppresses perturbations while preserving the dominant signal, thereby enlarging the spectral gap between them. This enlarged gap stabilizes the singular subspaces of the matrix passed to Muon's orthogonalization step, making the resulting update more reliable. We further show that applying momentum before orthogonalization achieves provably stronger alignment with the signal component of the gradient than either reversing this order or simply removing momentum.

What carries the argument

Momentum as a spectral filter that enlarges the gap between dominant signal and perturbations in the gradient matrix before the orthogonalization step.

If this is right

Momentum before orthogonalization yields stronger alignment with the signal component of the gradient.
The orthogonalized update becomes more reliable due to stabilized singular subspaces.
This mechanism explains observed performance gains in Muon with momentum.
The theory extends to other matrix-based optimizers using similar steps.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Similar filtering benefits might appear in other optimizers that combine momentum with matrix decompositions.
The signal-plus-perturbation model could be used to derive optimal momentum coefficients for specific gradient structures.
Testing on gradients without clear spectral gaps would clarify the model's applicability.

Load-bearing premise

The gradient follows a structured signal-plus-perturbation model in which momentum enlarges the spectral gap.

What would settle it

Measure the singular subspace alignment or update reliability in a setting where the gradient lacks a clear dominant signal component separated from perturbations and check whether momentum still improves performance.

Figures

Figures reproduced from arXiv: 2606.03899 by Han Bao, Weiyang Liu, Xianliang Li, Zihan Zhang.

**Figure 1.** Figure 1: End-to-end validation loss comparisons across (a) NanoGPT training and (b) LLaMA 350M training. The Muon Pre-polar pipeline outperforms Post-polar and Polar-only pipelines. The full experimental settings are in Appendix F.4. The relationship between Muon’s polar update and momentum is more nuanced. Whereas OrthogonalSGDM [53], proposed prior to Muon, orthogonalizes each per-step gradient before momentum s… view at source ↗

**Figure 2.** Figure 2: Spectral filtering visualization. (a) Filtered momentum singular value spectra (blue), the raw gradient spectrum (grey), and the mean-gradient spectrum (dashed orange) on layer h.0. (b) Per-step filtering ratio on h.0. (c) Noise-suppression ratio R(T) on each layer h.0, h.5, and h.11 (K = 500) versus momentum window size T = 1/(1 − β), with the dashed (2T − 1)1/4 floor. Amplitude recovery (in simulation). … view at source ↗

**Figure 3.** Figure 3: Stationary probe subspace alignment error sin ΘU and sin ΘV at ranks r ∈ {1, 5, 10} versus momentum window size T = 1/(1 − β), with the dashed cr (2T − 1)−1/4 guide (cr fitted independently per panel). 4 Noncommutativity of Momentum and Orthogonalization Signal-recovery separation (in theory). While Section 3 focuses on the denoising effect arising solely from momentum, this section investigates the intera… view at source ↗

**Figure 4.** Figure 4: Stationary probe signal alignment for the three pipelines defined in equation (1)–equation (3) (K = 500). (a) β-sweep at the step-3000 checkpoint of the full-rank signal alignment. (b) rank-5 signal alignment at β = 0.95 across the five checkpoints. (c) Full-rank signal alignment at β = 0.95 across the same five checkpoints. The reference G¯ = K−1 P t Gt replaces G sig on real gradients. Assumption 5 (Rank… view at source ↗

**Figure 5.** Figure 5: Trajectory probe subspace alignment errors sin ΘU , sin ΘV at ranks r ∈ {1, 5, 10} versus the momentum window size T = 1/(1 − β), with the β grid restricted to β ≤ 0.95 (T ≤ K/2). Curves show seed means. Shaded bands show sample standard deviation across seeds. Subspace alignment error during training [PITH_FULL_IMAGE:figures/full_fig_p011_5.png] view at source ↗

**Figure 6.** Figure 6: Trajectory probe signal alignment for the three pipelines defined in equation (1)–equation (3) (K = 50, 3-seed mean). (a) β-sweep at step-3000 checkpoint of the full-rank signal alignment. (b) Full-rank signal alignment at β = 0.95 across training steps. Curves show seed means. Shaded bands show sample standard deviation across seeds. In figure 6a, Pre-polar full-rank alignment rises monotonically with β, … view at source ↗

**Figure 7.** Figure 7: Synthetic stationary filtered singular value spectra under the rank-3 spiked model (m = n = 100, σn = 1, K = 1000, 10-trial mean) with a BVMZOS perturbation. (Left). The rank-3 spiked model with the Gaussian noise. (Right) The rank-3 spiked model with the heavy-tailed Student-t noise. The black diamonds at k = 1, 2, 3 mark the planted signal singular values σk ∈ {12, 8, 5}. The experimental details are des… view at source ↗

**Figure 8.** Figure 8: CIFAR-10 stationary probe at layer2.0.conv1 (128 × 576), warmup step 500, K = 2000. Index range k ∈ {3, . . . , 40}. Curve and reference conventions follow figures 2a and 2b. 40 [PITH_FULL_IMAGE:figures/full_fig_p040_8.png] view at source ↗

**Figure 9.** Figure 9: Noise-suppression ratio R(T) (Appendix F.3) on (a) synthetic rank-3 spiked gradients (m = n = 100, σn = 1, 1000 steps, 10 trials) under a BVMZOS perturbation and (b) CIFAR-10 stationary gradients at layer2.0.conv1. The noisesuppression ratio R(T) in the synthetic simulation uses the planted signal G sig t in place of G¯ with a zero-init bias correction (Appendix F.3). Dashed line: (2T − 1)1/4 floor. NanoG… view at source ↗

**Figure 10.** Figure 10: Stationary NanoGPT filtered singular value spectra over attention output projections h.0, h.5, h.11 (rows) and training checkpoints 1000, 2000, 3000, 4000, and 5000 (columns), K = 500. Mean-gradient spectrum σk(G¯) shown in dashed orange. Axes are shared across all fifteen cells. 0.2 0.4 0.6 0.8 1.0 h.0.attn.c_proj Per-step filtering ratio step 1000 step 2000 step 3000 step 4000 step 5000 0.2 0.4 0.6 0.8 … view at source ↗

**Figure 11.** Figure 11: Stationary NanoGPT per-step filtering ratio Filtk(β) = σk(M (β) K )/σk(GK) over the same (layer,step) grid as figure 10. Dashed reference at y = 1. Axes are shared across all fifteen cells. 42 [PITH_FULL_IMAGE:figures/full_fig_p042_11.png] view at source ↗

**Figure 12.** Figure 12: Stationary NanoGPT noise-suppression ratio R(T) (Appendix F.3) at training checkpoints (a) step 1000 and (b) step 5000. Three attention output projections per panel. Dashed line: (2T − 1)1/4 floor. 10 0 10 1 10 2 Momentum window T = 1/(1 ) 10 1 10 0 Subspace alignment error Rank 1 sin U sin V 0.84 (2T 1) 1/4 10 0 10 1 10 2 Momentum window T = 1/(1 ) 10 1 10 0 Rank 2 sin U sin V 0.99 (2T 1) 1/4 10 0 10 1 1… view at source ↗

**Figure 13.** Figure 13: Subspace alignment error on the synthetic rank-3 spiked model under a BVMZOS perturbation. Panels report ranks r ∈ {1, 2, 3}. sin ΘU (blue) and sin ΘV (orange) are computed against the planted top-r singular subspace (Utrue, Vtrue). Dashed line: fitted cr (2T − 1)−1/4 guide. Shaded bands: ±1 trial standard deviation across 10 random-seed trials. 10 0 10 1 10 2 Momentum window T = 1/(1 ) 10 2 10 1 Subspace… view at source ↗

**Figure 14.** Figure 14: CIFAR-10 stationary subspace alignment error on layer2.0.conv1 (128 × 576), warmup step 500, K = 2000, at ranks r ∈ {1,5,10}. Curve and reference conventions follow figure 3. 43 [PITH_FULL_IMAGE:figures/full_fig_p043_14.png] view at source ↗

**Figure 15.** Figure 15: Stationary NanoGPT subspace alignment error on attention output projections h.5.attn.c_proj (top) and h.11.attn.c_proj (bottom) at checkpoint step 3000, K = 500, at ranks r ∈ {1,5,10} (columns). The representative h.0.attn.c_proj panel is in the main text as figure 3. CIFAR-10 Trajectory Probes [PITH_FULL_IMAGE:figures/full_fig_p044_15.png] view at source ↗

**Figure 16.** Figure 16: CIFAR-10 trajectory subspace alignment error at training step 1500 on layer2.0.conv1, six-seed mean (seeds 42–47, K = 100), at ranks r ∈ {1,5,10}. Solid lines: six-seed mean. Shaded bands: sample standard deviation across seeds. NanoGPT Trajectory Probes Across Layers. Figures 17 and 18 extend figure 5 to the two remaining attention output projections and the three MLP output projections at training step … view at source ↗

**Figure 17.** Figure 17: NanoGPT trajectory subspace alignment error on attention output projections h.5.attn.c_proj (top) and h.11.attn.c_proj (bottom) at training step 3000, trajectory buffer K = 50, at ranks r ∈ {1,5,10} (columns). 3-seed mean with ±1 sample-standard-deviation bands. I Experimental Results of Signal Alignment Ordering In this section, we extend the signal alignment experiments (theoretically suggested by Theor… view at source ↗

**Figure 18.** Figure 18: NanoGPT trajectory subspace alignment error on MLP output projections h.0.mlp.c_proj (top), h.5.mlp.c_proj (middle), h.11.mlp.c_proj (rows) at training step 3000, trajectory buffer K = 50, at ranks r ∈ {1,5,10} (columns). Same plotting conventions as figure 17. 0.5 0.6 0.7 0.8 0.9 1.0 Momentum coefficient 0.6 0.7 0.8 0.9 1.0 Signal alignment Rank 1 0.5 0.6 0.7 0.8 0.9 1.0 Momentum coefficient Rank 2 0.5 0… view at source ↗

**Figure 19.** Figure 19: Synthetic signal alignment versus momentum coefficient β on the rank-3 spiked model (m = n = 100, σn = 1, K = 1000, 10 random-seed trials) under a BVMZOS perturbation. Panels report ranks r ∈ {1, 2, 3} against the planted top-r singular subspace (Utrue, Vtrue). Curve and reference conventions follow figure 4. Shaded bands show trial standard deviation across the 10 trials. 46 [PITH_FULL_IMAGE:figures/ful… view at source ↗

**Figure 20.** Figure 20: CIFAR-10 stationary signal alignment on layer2.0.conv1 (128 × 576), warmup step 500, K = 2000, at ranks r ∈ {1,5,10} and the full-rank signal alignment. Curve and reference conventions follow figure 4. at every panel. Rank-5 and rank-10 are the stable subspace ranks. Rank-1 is unstable on layers with a small σ1/σ2 gap. Tables 7 to 9 report the corresponding numerical summaries at β = 0.95 across the three… view at source ↗

**Figure 21.** Figure 21: Stationary NanoGPT signal alignment over attention output projections h.0 (top), h.5 (middle), h.11 (rows) and ranks r ∈ {1,5,10} plus full rank signal alignment (columns) at checkpoint step 3000, K = 500. Curve and reference conventions follow figure 4. All twelve cells share the same 8-point β grid {0.5, 0.7, 0.8, 0.9, 0.93, 0.95, 0.97, 0.99} [PITH_FULL_IMAGE:figures/full_fig_p048_21.png] view at source ↗

**Figure 22.** Figure 22: CIFAR-10 trajectory signal alignment versus β at training step 1500, K = 100, on layer2.0.conv1, at the rank-5 (left) and full-rank alignment (right) panels. Curve and reference conventions follow figure 4. 200 400 600 800 1000 1200 1400 Training step 0.1 0.2 0.3 0.4 0.5 0.6 Full-rank signal alignment Pre-polar Post-polar Polar-only [PITH_FULL_IMAGE:figures/full_fig_p049_22.png] view at source ↗

**Figure 23.** Figure 23: CIFAR-10 trajectory ordering history on layer2.0.conv1 at β = 0.95 (trajectory buffer K = 100, analysis interval I = 100, 15 checkpoints over training step 100–1500). Curve and reference conventions follow figure 4. NanoGPT Trajectory Probes Across Layers and Checkpoints [PITH_FULL_IMAGE:figures/full_fig_p049_23.png] view at source ↗

**Figure 24.** Figure 24: All-layer NanoGPT trajectory full-rank signal alignment across every attention and MLP output projection at training step 3000, K = 50, β = 0.95, aggregated over three seeds (1337, 1338, 1339). Curve and reference conventions follow figure 4 [PITH_FULL_IMAGE:figures/full_fig_p050_24.png] view at source ↗

**Figure 25.** Figure 25: NanoGPT trajectory signal alignment over training on attention output projections h.0, h.5, and h.11 at K = 50, β = 0.95, three-seed mean with sample standard deviation bands across seeds. Curve and reference conventions follow figure 4. 50 [PITH_FULL_IMAGE:figures/full_fig_p050_25.png] view at source ↗

**Figure 26.** Figure 26: Synthetic rank-3 subspace alignment ∥U ⊤ 3 AV3∥F / √ 3 against the planted (Utrue, Vtrue) versus signal strength λ, on the rank-3 spiked model shared with figure 19. Pre-polar (blue squares, O(M (β) K )), Post-polar (red diamonds, Mf(β) K ), Polar-only (green dashed, O(GK)). Bands are the standard deviation across 10 random-seed trials at β = 0.95. CIFAR-10 Batch Sweep [PITH_FULL_IMAGE:figures/full_fig_p… view at source ↗

**Figure 27.** Figure 27: CIFAR-10 stationary signal alignment versus mini-batch size on layer2.0.conv1 (128 × 576) of a ResNet-18, warmup step 500, K = 200 per probe, β = 0.95. Pre-polar (blue squares, O(M (β) K )), Post-polar (red diamonds, Mf(β) K ), Polar-only (green dashed, O(GK)). Panels are rank-1, rank-5, rank-10, and full-rank alignment against G¯. 0.0 0.2 0.4 0.6 0.8 1.0 Signal alignment Rank 1 Rank 5 16 32 64 128 256 51… view at source ↗

**Figure 28.** Figure 28: NanoGPT stationary signal alignment versus mini-batch size on h.0.attn.c_proj (768 × 768) at the step-3000 checkpoint, K = 500 per probe, β = 0.95. Pre-polar (blue squares, O(M (β) K )), Post-polar (red diamonds, Mf(β) K ), Polar-only (green dashed, O(GK)). Panels are rank-1, rank-5, rank-10, and full-rank alignment against G¯. 52 [PITH_FULL_IMAGE:figures/full_fig_p052_28.png] view at source ↗

read the original abstract

Muon has recently demonstrated strong empirical performance in large language model training, but the theoretical role of momentum in Muon remains unclear. Existing analyses of Muon either remove momentum to study spectral updates in isolation, or retain momentum without explaining why it improves empirical performance. Our work bridges this gap by showing momentum in Muon acts as a spectral filter. Under a structured signal-plus-perturbation gradient model, we prove that momentum suppresses perturbations while preserving the dominant signal, thereby enlarging the spectral gap between them. This enlarged gap stabilizes the singular subspaces of the matrix passed to Muon's orthogonalization step, making the resulting update more reliable. We further show that applying momentum before orthogonalization achieves provably stronger alignment with the signal component of the gradient than either reversing this order or simply removing momentum. Experiments across diverse tasks, including LLM pretraining, support our theoretical analysis. More broadly, our theory offers a starting point for understanding the benefits of momentum in other matrix-based optimizers.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper shows momentum before orthogonalization in Muon enlarges the spectral gap under a signal-plus-perturbation model, with the ordering result as the concrete new claim.

read the letter

The key point is that this work explains momentum in Muon as a spectral filter that suppresses noise in the gradient before the orthogonalization step, leading to more stable updates. They prove this under a signal-plus-perturbation model and show the specific order matters for alignment.

What is new is the comparison of ordering: momentum first gives better signal preservation than orthogonalizing first or skipping momentum. The derivation uses the enlarged spectral gap to bound subspace perturbations via standard tools like Davis-Kahan. Experiments on diverse tasks including LLM pretraining provide supporting evidence that matches the theory.

The analysis is internally consistent because the model is stated clearly and the results follow from it. The experiments add credibility by testing on real training scenarios.

One limitation is the dependence on the assumed gradient structure. Real gradients might deviate, so the benefits could vary. But since the paper frames it as an explanation within that model rather than a universal claim, it avoids overreach. No major issues with the citation pattern or reproducibility from what's described.

This is aimed at the optimization community working on second-order or matrix methods for large models. Readers interested in Muon or similar optimizers will find the spectral view useful. The combination of theory and experiments makes it worth a full review.

I recommend sending it to peer review.

Referee Report

0 major / 2 minor

Summary. The paper claims that under an explicitly stated structured signal-plus-perturbation model for gradients, momentum in Muon functions as a spectral low-pass filter: it suppresses perturbation components while preserving the dominant signal, thereby enlarging the gap between their singular values. This gap enlargement is shown to stabilize the singular subspaces passed to Muon's orthogonalization step (via Davis-Kahan-type bounds), and applying momentum before orthogonalization is proved to yield stronger signal alignment than the reverse order or momentum-free updates. Experiments on diverse tasks including LLM pretraining are reported as supporting evidence.

Significance. If the gradient model is realistic, the work supplies a conditional but rigorous theoretical account of momentum's benefit in Muon, filling a gap left by prior analyses that either omit momentum or retain it without explanation. The explicit model definition, direct derivation of the filtering effect and ordering claim, and supporting experiments constitute clear strengths; the analysis is internally consistent on its stated terms. The potential circularity concern raised in the stress-test note does not land, as the model is presented as the scope of the proof rather than a hidden premise tuned post hoc.

minor comments (2)

[Abstract] Abstract, final sentence: the phrase 'starting point for understanding the benefits of momentum in other matrix-based optimizers' would benefit from a brief forward reference to the discussion section where this extension is sketched.
[§4] §4 (Experiments): Table 1 caption could explicitly note the number of random seeds used for the reported means and standard deviations to improve clarity of statistical reliability.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for their positive assessment of the manuscript and for recommending acceptance. The referee's summary accurately reflects the paper's claims, model assumptions, and experimental support. No major comments were raised that require point-by-point rebuttal.

Circularity Check

0 steps flagged

No significant circularity: conditional proof under explicit model

full rationale

The central derivation is a conditional proof that momentum enlarges the spectral gap under an explicitly stated signal-plus-perturbation gradient model, followed by standard Davis-Kahan subspace perturbation bounds. The model is introduced as an assumption defining the scope of the analysis rather than being fitted to data or defined in terms of the desired filtering outcome. No load-bearing steps reduce to self-citation chains, fitted parameters renamed as predictions, or ansatzes smuggled via prior work. Experiments on LLM pretraining supply independent empirical checks outside the model. The argument is therefore self-contained on its stated terms and does not collapse to its inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The load-bearing premise is the structured signal-plus-perturbation gradient model invoked to prove the filtering property; no free parameters or new entities are named in the abstract.

axioms (1)

domain assumption Gradient admits a structured decomposition into dominant signal plus perturbation such that momentum enlarges their spectral gap
This decomposition is the setting in which the proof is carried out (abstract).

pith-pipeline@v0.9.1-grok · 5704 in / 1151 out tokens · 29006 ms · 2026-06-28T11:22:37.710935+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

79 extracted references · 22 canonical work pages · 9 internal anchors

[1]

Amsel, D

N. Amsel, D. Persson, C. Musco, and R. M. Gower. The polar express: Optimal matrix sign methods and their application to the Muon algorithm. InProceedings of the 14th International Conference on Learning Representations, 2026. (cited on p. 18)

2026
[2]

J. Ba, M. A. Erdogdu, T. Suzuki, Z. Wang, D. Wu, and G. Yang. High-dimensional asymptotics of featurelearning: Howonegradientstepimprovestherepresentation.AdvancesinNeuralInformation Processing Systems, 35:37932–37946, 2022. (cited on pp. 4 and 19)

2022
[3]

Bochnak, M

J. Bochnak, M. Coste, and M.-F. Roy.Real Algebraic Geometry, volume 36 ofErgebnisse der Mathematik und ihrer Grenzgebiete. 3. Folge / A Series of Modern Surveys in Mathematics. Springer- Verlag, Berlin, 1998. doi: 10.1007/978-3-662-03718-8. (cited on p. 24)

work page doi:10.1007/978-3-662-03718-8 1998
[4]

Boreiko, Z

V. Boreiko, Z. Bu, and S. Zha. Towards understanding orthogonalization in Muon. InProceedings of the 3rd Workshop on Efficient Systems for Foundation Models, 2025. (cited on p. 18)

2025
[5]

Spectralgradientdescentmitigatesanisotropy-driven misalignment: A case study in phase retrieval.arXiv preprint arXiv:2601.22652, 2026

G.Braun,H.Bao,W.Huang,andM.Imaizumi. Spectralgradientdescentmitigatesanisotropy-driven misalignment: A case study in phase retrieval.arXiv preprint arXiv:2601.22652, 2026. (cited on pp. 4 and 18). 12

work page arXiv 2026
[6]

Busbridge, J

D. Busbridge, J. Ramapuram, P. Ablin, T. Likhomanenko, E. G. Dhekane, X. Suau Cuadros, and R. Webb. How to scale your EMA.Advances in Neural Information Processing Systems, 36: 73122–73174, 2023. (cited on p. 19)

2023
[7]

Carlson, V

D. Carlson, V. Cevher, and L. Carin. Stochastic spectral descent for restricted Boltzmann machines. InProceedings of the 18th International Conference on Artificial Intelligence and Statistics, pages 111–119. PMLR, 2015. (cited on p. 18)

2015
[8]

Carlson, E

D. Carlson, E. Collins, Y.-P. Hsieh, L. Carin, and V. Cevher. Preconditioned spectral descent for deep learning.Advances in Neural Information Processing Systems, 28:2971–2979, 2015. (cited on p. 18)

2015
[9]

Muonoptimizesunderspectralnormconstraints.TransactionsonMachine Learning Research, 2026

L.Chen,J.Li,andQ.Liu. Muonoptimizesunderspectralnormconstraints.TransactionsonMachine Learning Research, 2026. (cited on pp. 2, 6, and 18)

2026
[10]

Chikuse.Statistics on Special Manifolds, volume 174

Y. Chikuse.Statistics on Special Manifolds, volume 174. Springer Science & Business Media, 2003. (cited on p. 27)

2003
[11]

Cutkosky and H

A. Cutkosky and H. Mehta. Momentum improves normalized SGD. InProceedings of the 37th International Conference on Machine Learning, pages 2260–2268. PMLR, 2020. (cited on pp. 1 and 18)

2020
[12]

arXiv preprint arXiv:2512.04299 , year=

D. Davis and D. Drusvyatskiy. When do spectral gradient updates help in deep learning?arXiv preprint arXiv:2512.04299, 2025. (cited on pp. 2 and 18)

work page arXiv 2025
[13]

A. Defazio. Momentum via primal averaging: Theoretical insights and learning rate schedules for non-convex optimization.arXiv preprint arXiv:2010.00406, 2020. (cited on p. 18)

work page arXiv 2010
[14]

RMNP: Row-Momentum Normalized Preconditioning for Scalable Matrix-Based Optimization

S.Deng,Z.Ouyang,T.Pang,Z.Liu,R.Jin,S.Yu,andY.Yang. RMNP:Row-momentumnormalized preconditioning for scalable matrix-based optimization.arXiv preprint arXiv:2603.20527, 2026. (cited on p. 18)

work page internal anchor Pith review Pith/arXiv arXiv 2026
[15]

C. Fan, M. Schmidt, and C. Thrampoulidis. Implicit bias of spectral descent and Muon on multiclass separable data.Advances in Neural Information Processing Systems, 38:39622–39669, 2025. (cited on pp. 2, 6, and 18)

2025
[16]

E. S. Gardner Jr. Exponential smoothing: The state of the art.Journal of Forecasting, 4(1):1–28,
[17]

4 and 19)

(cited on pp. 4 and 19)
[18]

Ghorbani, S

B. Ghorbani, S. Krishnan, and Y. Xiao. An investigation into neural net optimization via Hessian eigenvaluedensity. InProceedingsofthe36thInternationalConferenceonMachineLearning,pages 2232–2241. PMLR, 2019. (cited on p. 19)

2019
[19]

Ghosh, D

N. Ghosh, D. Wu, and A. Bietti. Understanding the mechanisms of fast hyperparameter transfer. InProceedings of the 14th International Conference on Learning Representations, 2026. (cited on p. 4)

2026
[20]

G. Goh. Why momentum really works.Distill, 2017. doi: 10.23915/distill.00006. (cited on p. 18)

work page doi:10.23915/distill.00006 2017
[21]

Gupta, T

V. Gupta, T. Koren, and Y. Singer. Shampoo: Preconditioned stochastic tensor optimization. In Proceedings of the 35th International Conference on Machine Learning, pages 1842–1850. PMLR,
[22]

1 and 18)

(cited on pp. 1 and 18)
[23]

Gradient Descent Happens in a Tiny Subspace

G. Gur-Ari, D. A. Roberts, and E. Dyer. Gradient descent happens in a tiny subspace.arXiv preprint arXiv:1812.04754, 2018. (cited on pp. 4 and 19). 13

work page internal anchor Pith review Pith/arXiv arXiv 2018
[24]

C. He, Z. Deng, and Z. Lu. Low-rank orthogonalization for large-scale matrix optimization with applications to foundation model training.arXiv preprint arXiv:2509.11983, 2025. (cited on p. 18)

work page internal anchor Pith review Pith/arXiv arXiv 2025
[25]

N. J. Higham.Functions of Matrices: Theory and Computation. SIAM, 2008. (cited on p. 28)

2008
[26]

Muon: Anoptimizerfor hiddenlayersinneuralnetworks,2024

K.Jordan,Y.Jin,V.Boza,J.You,F.Cesista,L.Newhouse,andJ.Bernstein. Muon: Anoptimizerfor hiddenlayersinneuralnetworks,2024. URL https://kellerjordan.github.io/posts/muon/. (cited on pp. 1, 2, 3, and 18)

2024
[27]

J. Kim, E. Nichani, D. Wu, A. Bietti, and J. D. Lee. Sharp capacity scaling of spectral optimizers in learning associative memory.arXiv preprint arXiv:2603.26554, 2026. (cited on pp. 2 and 18)

work page internal anchor Pith review Pith/arXiv arXiv 2026
[28]

D. P. Kingma and J. Ba. Adam: A method for stochastic optimization. InProceedings of the 3rd International Conference on Learning Representations, 2015. (cited on p. 1)

2015
[29]

D. Kovalev. Understanding gradient orthogonalization for deep learning via non-Euclidean trust- region optimization.arXiv preprint arXiv:2503.12645, 2025. (cited on pp. 2, 6, and 18)

work page arXiv 2025
[30]

B. Li, K. Wang, H. Zhong, P. Lu, and L. Wang. Muon in associative memory learning: Training dynamics and scaling laws.arXiv preprint arXiv:2602.05725, 2026. (cited on pp. 2 and 18)

work page internal anchor Pith review Pith/arXiv arXiv 2026
[31]

X. Li, J. Luo, Z. Zheng, H. Wang, L. Luo, L. Wen, L. Wu, and S. Xu. On the performance analysis of momentum method: A frequency domain perspective. InProceedings of the 13th International Conference on Learning Representations, 2025. (cited on pp. 2 and 18)

2025
[32]

J. Liu, J. Su, X. Yao, Z. Jiang, G. Lai, Y. Du, Y. Qin, W. Xu, E. Lu, J. Yan, Y. Chen, H. Zheng, Y. Liu, S. Liu, B. Yin, W. He, H. Zhu, Y. Wang, J. Wang, M. Dong, Z. Zhang, Y. Kang, H. Zhang, X. Xu, Y. Zhang, Y. Wu, X. Zhou, and Z. Yang. Muon is scalable for LLM training.arXiv preprint arXiv:2502.16982, 2025. (cited on pp. 1 and 18)

work page internal anchor Pith review Pith/arXiv arXiv 2025
[33]

W. Liu, R. Lin, Z. Liu, J. M. Rehg, L. Paull, L. Xiong, L. Song, and A. Weller. Orthogonal over- parameterizedtraining. In2021IEEE/CVFConferenceonComputerVisionandPatternRecognition, pages 7251–7260, 2021. (cited on p. 18)

2021
[34]

Loshchilov and F

I. Loshchilov and F. Hutter. Decoupled weight decay regularization. InProceedings of the 7th International Conference on Learning Representations, 2019. (cited on p. 1)

2019
[35]

J. Ma, Y. Huang, Y. Chi, and Y. Chen. Preconditioning benefits of spectral orthogonalization in Muon.arXiv preprint arXiv:2601.13474, 2026. (cited on pp. 2 and 18)

work page arXiv 2026
[36]

Mousavi-Hosseini, D

A. Mousavi-Hosseini, D. Wu, T. Suzuki, and M. A. Erdogdu. Gradient-based feature learning under structured data.Advances in Neural Information Processing Systems, 36:71449–71485, 2023. (cited on pp. 4 and 19)

2023
[37]

Nesterov

Y. Nesterov. A method for solving the convex programming problem with convergence rateo(1/k2). Soviet Mathematics Doklady, 27(2):372–376, 1983. (cited on pp. 1 and 18)

1983
[38]

Penedo, H

G. Penedo, H. Kydlíček, A. Lozhkov, M. Mitchell, C. Raffel, L. Von Werra, and T. Wolf. The FineWebdatasets: Decantingthewebforthefinesttextdataatscale.AdvancesinNeuralInformation Processing Systems, 37:30811–30849, 2024. (cited on p. 30)

2024
[39]

Pethick, W

T. Pethick, W. Xie, K. Antonakopoulos, Z. Zhu, A. Silveti-Falls, and V. Cevher. Training deep learning modelswithnorm-constrainedLMOs. InProceedingsofthe42ndInternationalConference on Machine Learning, pages 49069–49104, 2025. (cited on pp. 1 and 18). 14

2025
[40]

B. T. Polyak. Some methods of speeding up the convergence of iteration methods.USSR Computa- tional Mathematics and Mathematical Physics, 4(5):1–17, 1964. (cited on pp. 1 and 18)

1964
[41]

Z. Qiu, S. Buchholz, T. Z. Xiao, M. Dax, B. Schölkopf, and W. Liu. Reparameterized LLM training via orthogonal equivalence transformation.Advances in Neural Information Processing Systems, 38: 140775–140821, 2025. (cited on p. 18)

2025
[42]

Z. Qiu, L. Liu, A. Weller, H. Shi, and W. Liu. POET-X: Memory-efficient LLM training by scaling orthogonaltransformation.InProceedingsofthe43rdInternationalConferenceonMachineLearning. PMLR, 2026. (cited on p. 18)

2026
[43]

Raffel, N

C. Raffel, N. Shazeer, A. Roberts, K. Lee, S. Narang, M. Matena, Y. Zhou, W. Li, and P. J. Liu. Exploring the limits of transfer learning with a unified text-to-text transformer.Journal of Machine Learning Research, 21(140):1–67, 2020. (cited on p. 30)

2020
[44]

Riabinin, E

A. Riabinin, E. Shulgin, K. Gruntkowska, and P. Richtárik. Gluon: Making Muon & Scion great again! (bridging theory and practice of LMO-based optimizers for LLMs). InProceedings of the 3rd High-dimensional Learning Dynamics, 2025. (cited on p. 18)

2025
[45]

Empirical Analysis of the Hessian of Over-Parametrized Neural Networks

L. Sagun, U. Evci, V. U. Guney, Y. Dauphin, and L. Bottou. Empirical analysis of the Hessian of over-parametrized neural networks.arXiv preprint arXiv:1706.04454, 2017. (cited on p. 19)

work page internal anchor Pith review Pith/arXiv arXiv 2017
[46]

Semenov, M

A. Semenov, M. Pagliardini, and M. Jaggi. Benchmarking optimizers for large language model pretraining.arXiv preprint arXiv:2509.01440, 2025. (cited on p. 1)

work page arXiv 2025
[47]

K. Shi, H. Li, Z. Qiu, Y. Wen, S. Buchholz, and W. Liu. Pion: A spectrum-preserving optimizer via orthogonal equivalence transformation.arXiv preprint arXiv:2605.12492, 2026. (cited on p. 18)

work page internal anchor Pith review Pith/arXiv arXiv 2026
[48]

Shulgin, S

E. Shulgin, S. AlRashed, P. Richtárik, and F. Orabona. Beyond the ideal: Analyzing the inexact Muon update. InProceedings of the 29th International Conference on Artificial Intelligence and Statistics, 2026. (cited on pp. 2, 6, and 18)

2026
[49]

Simsekli, L

U. Simsekli, L. Sagun, and M. Gurbuzbalaban. A tail-index analysis of stochastic gradient noise in deep neural networks. InProceedings of the 36th International Conference on Machine Learning, pages 5827–5837. PMLR, 2019. (cited on p. 19)

2019
[50]

Onthegeneralizationbenefitofnoiseinstochasticgradientdescent

S.Smith,E.Elsen,andS.De. Onthegeneralizationbenefitofnoiseinstochasticgradientdescent. In Proceedings of the 37th International Conference on Machine Learning, pages 9058–9067. PMLR,
[51]

J. Su, M. Ahmed, Y. Lu, S. Pan, W. Bo, and Y. Liu. RoFormer: Enhanced transformer with rotary position embedding.Neurocomputing, 568:127063, 2024. (cited on p. 37)

2024
[52]

W. Su. Isotropic curvature model for understanding deep learning optimization: Is gradient orthogo- nalization optimal?arXiv preprint arXiv:2511.00674, 2025. (cited on pp. 2 and 18)

work page arXiv 2025
[53]

Sutskever, J

I. Sutskever, J. Martens, G. Dahl, and G. Hinton. On the importance of initialization and momentum in deep learning. InProceedings of the 30th International Conference on Machine Learning, pages 1139–1147. PMLR, 2013. (cited on pp. 1, 4, 18, and 19)

2013
[54]

LLaMA: Open and Efficient Foundation Language Models

H. Touvron, T. Lavril, G. Izacard, X. Martinet, M. Lachaux, T. Lacroix, B. Rozière, N. Goyal, E.Hambro,F.Azhar,A.Rodriguez,A.Joulin,E.Grave,andG.Lample. LLaMA:Openandefficient foundation language models.arXiv preprint arXiv:2302.13971, 2023. (cited on p. 30). 15

work page internal anchor Pith review Pith/arXiv arXiv 2023
[55]

Tuddenham, A

M. Tuddenham, A. Prügel-Bennett, and J. Hare. Orthogonalising gradients to speed up neural network optimisation.arXiv preprint arXiv:2202.07052, 2022. (cited on pp. 2 and 18)

work page arXiv 2022
[56]

How Muon’s spectral design benefits generalization: A study on imbalanced data.arXiv preprint arXiv:2510.22980,

B. Vasudeva, P. Deora, Y. Zhao, V. Sharan, and C. Thrampoulidis. How Muon’s spectral design benefits generalization: A study on imbalanced data.arXiv preprint arXiv:2510.22980, 2025. (cited on p. 18)

work page arXiv 2025
[57]

Vershynin.High-dimensional Probability: An Introduction with Applications in Data Science, volume 47

R. Vershynin.High-dimensional Probability: An Introduction with Applications in Data Science, volume 47. Cambridge University Press, 2018. (cited on pp. 28 and 29)

2018
[58]

N. Vyas, D. Morwani, R. Zhao, I. Shapira, D. Brandfonbrener, L. Janson, and S. M. Kakade. SOAP: Improving and stabilizing Shampoo using Adam for language modeling. InProceedings of the 13th International Conference on Learning Representations, 2025. (cited on pp. 1 and 18)

2025
[59]

Themarginalvalueofmomentumforsmalllearning rate SGD

R.Wang,S.Malladi,T.Wang,K.Lyu,andZ.Li. Themarginalvalueofmomentumforsmalllearning rate SGD. InProceedings of the 12th International Conference on Learning Representations, 2024. (cited on p. 2)

2024
[60]

, Zhang, F

S.Wang,F.Zhang,J.Li,C.Du,C.Du,T.Pang,Z.Yang,M.Hong,andV.Y.Tan. Muonoutperforms Adam in tail-end associative memory learning.arXiv preprint arXiv:2509.26030, 2025. (cited on pp. 2 and 18)

work page arXiv 2025
[61]

P.-Å. Wedin. Perturbation bounds in connection with singular value decomposition.BIT Numerical Mathematics, 12(1):99–111, 1972. (cited on pp. 7 and 23)

1972
[62]

K. Wen, D. Hall, T. Ma, and P. Liang. Fantastic pretraining optimizers and where to find them.arXiv preprint arXiv:2509.02046, 2025. (cited on pp. 1 and 21)

work page arXiv 2025
[63]

Z. Zhu, J. Wu, B. Yu, L. Wu, and J. Ma. The anisotropic noise in stochastic gradient descent: Its behavior of escaping from sharp minima and regularization effects. InProceedings of the 36th International Conference on Machine Learning, pages 7654–7663. PMLR, 2019. (cited on p. 19). 16 Appendix Table of Contents A Related Work 18 B Setup Conventions and V...

2019
[64]

prove that momentum eliminates the large-batch requirement of normalized SGD, and Defazio[13] reformulates SGD with momentum as primal averaging to obtain sharper non-convex convergence bounds. Inspired by signal processing theory, Li et al.[29]interpret momentum in the frequency domain as a low- pass filter that amplifies low-frequency gradient component...
[65]

No weight is updated during gradient collection

Load the model from a saved checkpoint and hold every weight fixed, including the target weightW. No weight is updated during gradient collection
[66]

For t= 1, . . . , K , draw one mini-batch from the dataloader in its natural sequential order (the dataloader’s default ordering without shuffling), run one step of forward and backward propagation, and record the gradientGt of the target weightW. We refer to this as thesequentialcollection order, which is the default setting used throughout the paper. Fo...
[67]

The collection order is preserved because the downstream momentum buffers are order-dependent

Save the gradient sequence{Gt}K t=1 to disk in the order the mini-batches were drawn. The collection order is preserved because the downstream momentum buffers are order-dependent. Since the model weights do not change during gradient collection, everyGt is drawn from the same gradient distribution. This protocol therefore synthetically simulates the stat...
[68]

Each training step appendsGt to the buffer and pops the oldest entry once the buffer is full

Maintain a fixed-capacity First-In, First-Out (FIFO) buffer of the most recentK target weight’s gradients alongside the regular optimizer step. Each training step appendsGt to the buffer and pops the oldest entry once the buffer is full
[69]

, Gt}, run theAnalysis procedurebelow, and save the resulting summary

At everyI training step (we usedI= 100 in all CIFAR-10 and NanoGPT trajectory runs), take a checkpoint of the current buffer{Gt−K+1, . . . , Gt}, run theAnalysis procedurebelow, and save the resulting summary
[70]

The probe does not alter weight updates, optimizer state, random seeds, or data ordering

Continue training without modification. The probe does not alter weight updates, optimizer state, random seeds, or data ordering. The trajectory buffer therefore represents a sliding buffer over the live training trajectory, and the corre- sponding analysis quantities inherit any non-stationarity in the gradient stream. Buffer-size selection.The buffer si...
[71]

The top-r left and right singular vectors(Ur, Vr) define the signal subspace used as the alignment target on the CIFAR-10 and NanoGPT experiments

Signal reference.Compute the mean gradient¯G :=K −1PK t=1 Gt and its exact SVD¯G=UΣV ⊤. The top-r left and right singular vectors(Ur, Vr) define the signal subspace used as the alignment target on the CIFAR-10 and NanoGPT experiments. On the synthetic simulation the spiked model of Appendix F.4.1 plants a known rank-r⋆ signalG sig t with singular basesUtr...
[72]

Threepipelines.Foreach β intheper-taskgrid(AppendixF.4),startingfrom M0 = fM0 = 0,Pre-polar 32 and Post-polar pipelines maintain two separate momentum buffers: Pre-polar:O M (β) K ,whereM (β) K := (1−β) K−1X s=0 βs GK−s,(20) Post-polar: fM (β) K := (1−β) K−1X s=0 βs O(GK−s),(21) Polar-only:O(G K),(22) where O(·) is the polar factor introduced in equation ...
[73]

The two ratios serve different purposes and use different denominators

Spectral summaries.Record (i) the singular-value sequences of¯Gand of Pre-polar momentum buffer M (β) K , (ii) the per-step filtering ratioσk(M (β) K )/σk(GK) at the final collection index, and (iii) the noise-suppression ratioR(T) that compares the operator norm of the raw-gradient residual with that of the momentum residual at momentum window sizeT= 1/(...
[74]

Subspace alignment error panels report thesin Θprincipal-angle distance at fixed ranksr∈{1,5,10}

Signal alignment and subspace alignment error metrics.Signal alignment is reported with the rank-r and full-rank signal alignment metricsAlignr and Alignfull of Appendix F.3. Subspace alignment error panels report thesin Θprincipal-angle distance at fixed ranksr∈{1,5,10}. All SVDs inside theAnalysis procedureare exactfloat32 decompositions. Newton–Schulz ...
[75]

Per-step filtering ratio— the ratio of thek-th singular value of Pre-polar momentum bufferM (β) K (equation (20)) to that of the latest collected raw gradientGK, Filtk(β) := σk M (β) K σk(GK) . Bothspectraarecomputedfromthesamegradientbufferatthefinalcollectionstep: GK isthelastraw mini-batch gradient (equivalently, the momentum buffer atβ= 0 ), andM (β) ...
[76]

Explicitly, R(T) := GK − ¯G 2 M (β) K − ¯G 2 , ¯G := 1 K KX t=1 Gt, 33 where ¯G is the in-buffer approximation ofGsig t

Noise-suppressionratio—residualoperator-normratio R(T) ofrawgradientvs.Pre-polarmomentum buffer with probe-side momentum coefficientβ (associated with the effective sample size2T−1 ). Explicitly, R(T) := GK − ¯G 2 M (β) K − ¯G 2 , ¯G := 1 K KX t=1 Gt, 33 where ¯G is the in-buffer approximation ofGsig t . Subtracting ¯G from both numerator and denominator ...
[77]

Explicitly, sinθ r(A;B) := sin Θ Ur(A), U r(B) 2, with the right-subspace versionsinθ r(A⊤;B ⊤) =∥sin Θ(V r(A), V r(B))∥2 reported separately

Subspace alignment error— thesinθ subspace distance from the top-r singular subspaces of the reference to those of Pre-polar momentum buffer. Explicitly, sinθ r(A;B) := sin Θ Ur(A), U r(B) 2, with the right-subspace versionsinθ r(A⊤;B ⊤) =∥sin Θ(V r(A), V r(B))∥2 reported separately. In this paper, we also definesin ΘU := sinθ r(M (β) K ; ¯G) and sin ΘV :...
[78]

Alignr(A;B) := Ur(B)⊤A Vr(B) F√r ∈[0,1]

Signal alignment— the signal alignment comparison applied to Pre-polar=O(M (β) K ), Post-polar = fM (β) K , and Polar-only=O(GK), reported through the following two metrics: Rank-rsignal alignment. Alignr(A;B) := Ur(B)⊤A Vr(B) F√r ∈[0,1]. Larger values indicate stronger signal alignment. Theorem 2 predicts that Pre-polar achieves higher Alignr than Post-p...

2000
[79]

Dashed line:(2T−1)1/4 floor

Three attention output projections per panel. Dashed line:(2T−1)1/4 floor. 100 101 102 Momentum window T = 1/(1 ) 10 1 100 Subspace alignment error Rank 1 sin U sin V 0.84 (2T 1) 1/4 100 101 102 Momentum window T = 1/(1 ) 10 1 100 Rank 2 sin U sin V 0.99 (2T 1) 1/4 100 101 102 Momentum window T = 1/(1 ) 10 1 100 Rank 3 sin U sin V 1.02 (2T 1) 1/4 Figure 1...

2000

[1] [1]

Amsel, D

N. Amsel, D. Persson, C. Musco, and R. M. Gower. The polar express: Optimal matrix sign methods and their application to the Muon algorithm. InProceedings of the 14th International Conference on Learning Representations, 2026. (cited on p. 18)

2026

[2] [2]

J. Ba, M. A. Erdogdu, T. Suzuki, Z. Wang, D. Wu, and G. Yang. High-dimensional asymptotics of featurelearning: Howonegradientstepimprovestherepresentation.AdvancesinNeuralInformation Processing Systems, 35:37932–37946, 2022. (cited on pp. 4 and 19)

2022

[3] [3]

Bochnak, M

J. Bochnak, M. Coste, and M.-F. Roy.Real Algebraic Geometry, volume 36 ofErgebnisse der Mathematik und ihrer Grenzgebiete. 3. Folge / A Series of Modern Surveys in Mathematics. Springer- Verlag, Berlin, 1998. doi: 10.1007/978-3-662-03718-8. (cited on p. 24)

work page doi:10.1007/978-3-662-03718-8 1998

[4] [4]

Boreiko, Z

V. Boreiko, Z. Bu, and S. Zha. Towards understanding orthogonalization in Muon. InProceedings of the 3rd Workshop on Efficient Systems for Foundation Models, 2025. (cited on p. 18)

2025

[5] [5]

Spectralgradientdescentmitigatesanisotropy-driven misalignment: A case study in phase retrieval.arXiv preprint arXiv:2601.22652, 2026

G.Braun,H.Bao,W.Huang,andM.Imaizumi. Spectralgradientdescentmitigatesanisotropy-driven misalignment: A case study in phase retrieval.arXiv preprint arXiv:2601.22652, 2026. (cited on pp. 4 and 18). 12

work page arXiv 2026

[6] [6]

Busbridge, J

D. Busbridge, J. Ramapuram, P. Ablin, T. Likhomanenko, E. G. Dhekane, X. Suau Cuadros, and R. Webb. How to scale your EMA.Advances in Neural Information Processing Systems, 36: 73122–73174, 2023. (cited on p. 19)

2023

[7] [7]

Carlson, V

D. Carlson, V. Cevher, and L. Carin. Stochastic spectral descent for restricted Boltzmann machines. InProceedings of the 18th International Conference on Artificial Intelligence and Statistics, pages 111–119. PMLR, 2015. (cited on p. 18)

2015

[8] [8]

Carlson, E

D. Carlson, E. Collins, Y.-P. Hsieh, L. Carin, and V. Cevher. Preconditioned spectral descent for deep learning.Advances in Neural Information Processing Systems, 28:2971–2979, 2015. (cited on p. 18)

2015

[9] [9]

Muonoptimizesunderspectralnormconstraints.TransactionsonMachine Learning Research, 2026

L.Chen,J.Li,andQ.Liu. Muonoptimizesunderspectralnormconstraints.TransactionsonMachine Learning Research, 2026. (cited on pp. 2, 6, and 18)

2026

[10] [10]

Chikuse.Statistics on Special Manifolds, volume 174

Y. Chikuse.Statistics on Special Manifolds, volume 174. Springer Science & Business Media, 2003. (cited on p. 27)

2003

[11] [11]

Cutkosky and H

A. Cutkosky and H. Mehta. Momentum improves normalized SGD. InProceedings of the 37th International Conference on Machine Learning, pages 2260–2268. PMLR, 2020. (cited on pp. 1 and 18)

2020

[12] [12]

arXiv preprint arXiv:2512.04299 , year=

D. Davis and D. Drusvyatskiy. When do spectral gradient updates help in deep learning?arXiv preprint arXiv:2512.04299, 2025. (cited on pp. 2 and 18)

work page arXiv 2025

[13] [13]

A. Defazio. Momentum via primal averaging: Theoretical insights and learning rate schedules for non-convex optimization.arXiv preprint arXiv:2010.00406, 2020. (cited on p. 18)

work page arXiv 2010

[14] [14]

RMNP: Row-Momentum Normalized Preconditioning for Scalable Matrix-Based Optimization

S.Deng,Z.Ouyang,T.Pang,Z.Liu,R.Jin,S.Yu,andY.Yang. RMNP:Row-momentumnormalized preconditioning for scalable matrix-based optimization.arXiv preprint arXiv:2603.20527, 2026. (cited on p. 18)

work page internal anchor Pith review Pith/arXiv arXiv 2026

[15] [15]

C. Fan, M. Schmidt, and C. Thrampoulidis. Implicit bias of spectral descent and Muon on multiclass separable data.Advances in Neural Information Processing Systems, 38:39622–39669, 2025. (cited on pp. 2, 6, and 18)

2025

[16] [16]

E. S. Gardner Jr. Exponential smoothing: The state of the art.Journal of Forecasting, 4(1):1–28,

[17] [17]

4 and 19)

(cited on pp. 4 and 19)

[18] [18]

Ghorbani, S

B. Ghorbani, S. Krishnan, and Y. Xiao. An investigation into neural net optimization via Hessian eigenvaluedensity. InProceedingsofthe36thInternationalConferenceonMachineLearning,pages 2232–2241. PMLR, 2019. (cited on p. 19)

2019

[19] [19]

Ghosh, D

N. Ghosh, D. Wu, and A. Bietti. Understanding the mechanisms of fast hyperparameter transfer. InProceedings of the 14th International Conference on Learning Representations, 2026. (cited on p. 4)

2026

[20] [20]

G. Goh. Why momentum really works.Distill, 2017. doi: 10.23915/distill.00006. (cited on p. 18)

work page doi:10.23915/distill.00006 2017

[21] [21]

Gupta, T

V. Gupta, T. Koren, and Y. Singer. Shampoo: Preconditioned stochastic tensor optimization. In Proceedings of the 35th International Conference on Machine Learning, pages 1842–1850. PMLR,

[22] [22]

1 and 18)

(cited on pp. 1 and 18)

[23] [23]

Gradient Descent Happens in a Tiny Subspace

G. Gur-Ari, D. A. Roberts, and E. Dyer. Gradient descent happens in a tiny subspace.arXiv preprint arXiv:1812.04754, 2018. (cited on pp. 4 and 19). 13

work page internal anchor Pith review Pith/arXiv arXiv 2018

[24] [24]

C. He, Z. Deng, and Z. Lu. Low-rank orthogonalization for large-scale matrix optimization with applications to foundation model training.arXiv preprint arXiv:2509.11983, 2025. (cited on p. 18)

work page internal anchor Pith review Pith/arXiv arXiv 2025

[25] [25]

N. J. Higham.Functions of Matrices: Theory and Computation. SIAM, 2008. (cited on p. 28)

2008

[26] [26]

Muon: Anoptimizerfor hiddenlayersinneuralnetworks,2024

K.Jordan,Y.Jin,V.Boza,J.You,F.Cesista,L.Newhouse,andJ.Bernstein. Muon: Anoptimizerfor hiddenlayersinneuralnetworks,2024. URL https://kellerjordan.github.io/posts/muon/. (cited on pp. 1, 2, 3, and 18)

2024

[27] [27]

J. Kim, E. Nichani, D. Wu, A. Bietti, and J. D. Lee. Sharp capacity scaling of spectral optimizers in learning associative memory.arXiv preprint arXiv:2603.26554, 2026. (cited on pp. 2 and 18)

work page internal anchor Pith review Pith/arXiv arXiv 2026

[28] [28]

D. P. Kingma and J. Ba. Adam: A method for stochastic optimization. InProceedings of the 3rd International Conference on Learning Representations, 2015. (cited on p. 1)

2015

[29] [29]

D. Kovalev. Understanding gradient orthogonalization for deep learning via non-Euclidean trust- region optimization.arXiv preprint arXiv:2503.12645, 2025. (cited on pp. 2, 6, and 18)

work page arXiv 2025

[30] [30]

B. Li, K. Wang, H. Zhong, P. Lu, and L. Wang. Muon in associative memory learning: Training dynamics and scaling laws.arXiv preprint arXiv:2602.05725, 2026. (cited on pp. 2 and 18)

work page internal anchor Pith review Pith/arXiv arXiv 2026

[31] [31]

X. Li, J. Luo, Z. Zheng, H. Wang, L. Luo, L. Wen, L. Wu, and S. Xu. On the performance analysis of momentum method: A frequency domain perspective. InProceedings of the 13th International Conference on Learning Representations, 2025. (cited on pp. 2 and 18)

2025

[32] [32]

J. Liu, J. Su, X. Yao, Z. Jiang, G. Lai, Y. Du, Y. Qin, W. Xu, E. Lu, J. Yan, Y. Chen, H. Zheng, Y. Liu, S. Liu, B. Yin, W. He, H. Zhu, Y. Wang, J. Wang, M. Dong, Z. Zhang, Y. Kang, H. Zhang, X. Xu, Y. Zhang, Y. Wu, X. Zhou, and Z. Yang. Muon is scalable for LLM training.arXiv preprint arXiv:2502.16982, 2025. (cited on pp. 1 and 18)

work page internal anchor Pith review Pith/arXiv arXiv 2025

[33] [33]

W. Liu, R. Lin, Z. Liu, J. M. Rehg, L. Paull, L. Xiong, L. Song, and A. Weller. Orthogonal over- parameterizedtraining. In2021IEEE/CVFConferenceonComputerVisionandPatternRecognition, pages 7251–7260, 2021. (cited on p. 18)

2021

[34] [34]

Loshchilov and F

I. Loshchilov and F. Hutter. Decoupled weight decay regularization. InProceedings of the 7th International Conference on Learning Representations, 2019. (cited on p. 1)

2019

[35] [35]

J. Ma, Y. Huang, Y. Chi, and Y. Chen. Preconditioning benefits of spectral orthogonalization in Muon.arXiv preprint arXiv:2601.13474, 2026. (cited on pp. 2 and 18)

work page arXiv 2026

[36] [36]

Mousavi-Hosseini, D

A. Mousavi-Hosseini, D. Wu, T. Suzuki, and M. A. Erdogdu. Gradient-based feature learning under structured data.Advances in Neural Information Processing Systems, 36:71449–71485, 2023. (cited on pp. 4 and 19)

2023

[37] [37]

Nesterov

Y. Nesterov. A method for solving the convex programming problem with convergence rateo(1/k2). Soviet Mathematics Doklady, 27(2):372–376, 1983. (cited on pp. 1 and 18)

1983

[38] [38]

Penedo, H

G. Penedo, H. Kydlíček, A. Lozhkov, M. Mitchell, C. Raffel, L. Von Werra, and T. Wolf. The FineWebdatasets: Decantingthewebforthefinesttextdataatscale.AdvancesinNeuralInformation Processing Systems, 37:30811–30849, 2024. (cited on p. 30)

2024

[39] [39]

Pethick, W

T. Pethick, W. Xie, K. Antonakopoulos, Z. Zhu, A. Silveti-Falls, and V. Cevher. Training deep learning modelswithnorm-constrainedLMOs. InProceedingsofthe42ndInternationalConference on Machine Learning, pages 49069–49104, 2025. (cited on pp. 1 and 18). 14

2025

[40] [40]

B. T. Polyak. Some methods of speeding up the convergence of iteration methods.USSR Computa- tional Mathematics and Mathematical Physics, 4(5):1–17, 1964. (cited on pp. 1 and 18)

1964

[41] [41]

Z. Qiu, S. Buchholz, T. Z. Xiao, M. Dax, B. Schölkopf, and W. Liu. Reparameterized LLM training via orthogonal equivalence transformation.Advances in Neural Information Processing Systems, 38: 140775–140821, 2025. (cited on p. 18)

2025

[42] [42]

Z. Qiu, L. Liu, A. Weller, H. Shi, and W. Liu. POET-X: Memory-efficient LLM training by scaling orthogonaltransformation.InProceedingsofthe43rdInternationalConferenceonMachineLearning. PMLR, 2026. (cited on p. 18)

2026

[43] [43]

Raffel, N

C. Raffel, N. Shazeer, A. Roberts, K. Lee, S. Narang, M. Matena, Y. Zhou, W. Li, and P. J. Liu. Exploring the limits of transfer learning with a unified text-to-text transformer.Journal of Machine Learning Research, 21(140):1–67, 2020. (cited on p. 30)

2020

[44] [44]

Riabinin, E

A. Riabinin, E. Shulgin, K. Gruntkowska, and P. Richtárik. Gluon: Making Muon & Scion great again! (bridging theory and practice of LMO-based optimizers for LLMs). InProceedings of the 3rd High-dimensional Learning Dynamics, 2025. (cited on p. 18)

2025

[45] [45]

Empirical Analysis of the Hessian of Over-Parametrized Neural Networks

L. Sagun, U. Evci, V. U. Guney, Y. Dauphin, and L. Bottou. Empirical analysis of the Hessian of over-parametrized neural networks.arXiv preprint arXiv:1706.04454, 2017. (cited on p. 19)

work page internal anchor Pith review Pith/arXiv arXiv 2017

[46] [46]

Semenov, M

A. Semenov, M. Pagliardini, and M. Jaggi. Benchmarking optimizers for large language model pretraining.arXiv preprint arXiv:2509.01440, 2025. (cited on p. 1)

work page arXiv 2025

[47] [47]

K. Shi, H. Li, Z. Qiu, Y. Wen, S. Buchholz, and W. Liu. Pion: A spectrum-preserving optimizer via orthogonal equivalence transformation.arXiv preprint arXiv:2605.12492, 2026. (cited on p. 18)

work page internal anchor Pith review Pith/arXiv arXiv 2026

[48] [48]

Shulgin, S

E. Shulgin, S. AlRashed, P. Richtárik, and F. Orabona. Beyond the ideal: Analyzing the inexact Muon update. InProceedings of the 29th International Conference on Artificial Intelligence and Statistics, 2026. (cited on pp. 2, 6, and 18)

2026

[49] [49]

Simsekli, L

U. Simsekli, L. Sagun, and M. Gurbuzbalaban. A tail-index analysis of stochastic gradient noise in deep neural networks. InProceedings of the 36th International Conference on Machine Learning, pages 5827–5837. PMLR, 2019. (cited on p. 19)

2019

[50] [50]

Onthegeneralizationbenefitofnoiseinstochasticgradientdescent

S.Smith,E.Elsen,andS.De. Onthegeneralizationbenefitofnoiseinstochasticgradientdescent. In Proceedings of the 37th International Conference on Machine Learning, pages 9058–9067. PMLR,

[51] [51]

J. Su, M. Ahmed, Y. Lu, S. Pan, W. Bo, and Y. Liu. RoFormer: Enhanced transformer with rotary position embedding.Neurocomputing, 568:127063, 2024. (cited on p. 37)

2024

[52] [52]

W. Su. Isotropic curvature model for understanding deep learning optimization: Is gradient orthogo- nalization optimal?arXiv preprint arXiv:2511.00674, 2025. (cited on pp. 2 and 18)

work page arXiv 2025

[53] [53]

Sutskever, J

I. Sutskever, J. Martens, G. Dahl, and G. Hinton. On the importance of initialization and momentum in deep learning. InProceedings of the 30th International Conference on Machine Learning, pages 1139–1147. PMLR, 2013. (cited on pp. 1, 4, 18, and 19)

2013

[54] [54]

LLaMA: Open and Efficient Foundation Language Models

H. Touvron, T. Lavril, G. Izacard, X. Martinet, M. Lachaux, T. Lacroix, B. Rozière, N. Goyal, E.Hambro,F.Azhar,A.Rodriguez,A.Joulin,E.Grave,andG.Lample. LLaMA:Openandefficient foundation language models.arXiv preprint arXiv:2302.13971, 2023. (cited on p. 30). 15

work page internal anchor Pith review Pith/arXiv arXiv 2023

[55] [55]

Tuddenham, A

M. Tuddenham, A. Prügel-Bennett, and J. Hare. Orthogonalising gradients to speed up neural network optimisation.arXiv preprint arXiv:2202.07052, 2022. (cited on pp. 2 and 18)

work page arXiv 2022

[56] [56]

How Muon’s spectral design benefits generalization: A study on imbalanced data.arXiv preprint arXiv:2510.22980,

B. Vasudeva, P. Deora, Y. Zhao, V. Sharan, and C. Thrampoulidis. How Muon’s spectral design benefits generalization: A study on imbalanced data.arXiv preprint arXiv:2510.22980, 2025. (cited on p. 18)

work page arXiv 2025

[57] [57]

Vershynin.High-dimensional Probability: An Introduction with Applications in Data Science, volume 47

R. Vershynin.High-dimensional Probability: An Introduction with Applications in Data Science, volume 47. Cambridge University Press, 2018. (cited on pp. 28 and 29)

2018

[58] [58]

N. Vyas, D. Morwani, R. Zhao, I. Shapira, D. Brandfonbrener, L. Janson, and S. M. Kakade. SOAP: Improving and stabilizing Shampoo using Adam for language modeling. InProceedings of the 13th International Conference on Learning Representations, 2025. (cited on pp. 1 and 18)

2025

[59] [59]

Themarginalvalueofmomentumforsmalllearning rate SGD

R.Wang,S.Malladi,T.Wang,K.Lyu,andZ.Li. Themarginalvalueofmomentumforsmalllearning rate SGD. InProceedings of the 12th International Conference on Learning Representations, 2024. (cited on p. 2)

2024

[60] [60]

, Zhang, F

S.Wang,F.Zhang,J.Li,C.Du,C.Du,T.Pang,Z.Yang,M.Hong,andV.Y.Tan. Muonoutperforms Adam in tail-end associative memory learning.arXiv preprint arXiv:2509.26030, 2025. (cited on pp. 2 and 18)

work page arXiv 2025

[61] [61]

P.-Å. Wedin. Perturbation bounds in connection with singular value decomposition.BIT Numerical Mathematics, 12(1):99–111, 1972. (cited on pp. 7 and 23)

1972

[62] [62]

K. Wen, D. Hall, T. Ma, and P. Liang. Fantastic pretraining optimizers and where to find them.arXiv preprint arXiv:2509.02046, 2025. (cited on pp. 1 and 21)

work page arXiv 2025

[63] [63]

Z. Zhu, J. Wu, B. Yu, L. Wu, and J. Ma. The anisotropic noise in stochastic gradient descent: Its behavior of escaping from sharp minima and regularization effects. InProceedings of the 36th International Conference on Machine Learning, pages 7654–7663. PMLR, 2019. (cited on p. 19). 16 Appendix Table of Contents A Related Work 18 B Setup Conventions and V...

2019

[64] [64]

prove that momentum eliminates the large-batch requirement of normalized SGD, and Defazio[13] reformulates SGD with momentum as primal averaging to obtain sharper non-convex convergence bounds. Inspired by signal processing theory, Li et al.[29]interpret momentum in the frequency domain as a low- pass filter that amplifies low-frequency gradient component...

[65] [65]

No weight is updated during gradient collection

Load the model from a saved checkpoint and hold every weight fixed, including the target weightW. No weight is updated during gradient collection

[66] [66]

For t= 1, . . . , K , draw one mini-batch from the dataloader in its natural sequential order (the dataloader’s default ordering without shuffling), run one step of forward and backward propagation, and record the gradientGt of the target weightW. We refer to this as thesequentialcollection order, which is the default setting used throughout the paper. Fo...

[67] [67]

The collection order is preserved because the downstream momentum buffers are order-dependent

Save the gradient sequence{Gt}K t=1 to disk in the order the mini-batches were drawn. The collection order is preserved because the downstream momentum buffers are order-dependent. Since the model weights do not change during gradient collection, everyGt is drawn from the same gradient distribution. This protocol therefore synthetically simulates the stat...

[68] [68]

Each training step appendsGt to the buffer and pops the oldest entry once the buffer is full

Maintain a fixed-capacity First-In, First-Out (FIFO) buffer of the most recentK target weight’s gradients alongside the regular optimizer step. Each training step appendsGt to the buffer and pops the oldest entry once the buffer is full

[69] [69]

, Gt}, run theAnalysis procedurebelow, and save the resulting summary

At everyI training step (we usedI= 100 in all CIFAR-10 and NanoGPT trajectory runs), take a checkpoint of the current buffer{Gt−K+1, . . . , Gt}, run theAnalysis procedurebelow, and save the resulting summary

[70] [70]

The probe does not alter weight updates, optimizer state, random seeds, or data ordering

Continue training without modification. The probe does not alter weight updates, optimizer state, random seeds, or data ordering. The trajectory buffer therefore represents a sliding buffer over the live training trajectory, and the corre- sponding analysis quantities inherit any non-stationarity in the gradient stream. Buffer-size selection.The buffer si...

[71] [71]

The top-r left and right singular vectors(Ur, Vr) define the signal subspace used as the alignment target on the CIFAR-10 and NanoGPT experiments

Signal reference.Compute the mean gradient¯G :=K −1PK t=1 Gt and its exact SVD¯G=UΣV ⊤. The top-r left and right singular vectors(Ur, Vr) define the signal subspace used as the alignment target on the CIFAR-10 and NanoGPT experiments. On the synthetic simulation the spiked model of Appendix F.4.1 plants a known rank-r⋆ signalG sig t with singular basesUtr...

[72] [72]

Threepipelines.Foreach β intheper-taskgrid(AppendixF.4),startingfrom M0 = fM0 = 0,Pre-polar 32 and Post-polar pipelines maintain two separate momentum buffers: Pre-polar:O M (β) K ,whereM (β) K := (1−β) K−1X s=0 βs GK−s,(20) Post-polar: fM (β) K := (1−β) K−1X s=0 βs O(GK−s),(21) Polar-only:O(G K),(22) where O(·) is the polar factor introduced in equation ...

[73] [73]

The two ratios serve different purposes and use different denominators

Spectral summaries.Record (i) the singular-value sequences of¯Gand of Pre-polar momentum buffer M (β) K , (ii) the per-step filtering ratioσk(M (β) K )/σk(GK) at the final collection index, and (iii) the noise-suppression ratioR(T) that compares the operator norm of the raw-gradient residual with that of the momentum residual at momentum window sizeT= 1/(...

[74] [74]

Subspace alignment error panels report thesin Θprincipal-angle distance at fixed ranksr∈{1,5,10}

Signal alignment and subspace alignment error metrics.Signal alignment is reported with the rank-r and full-rank signal alignment metricsAlignr and Alignfull of Appendix F.3. Subspace alignment error panels report thesin Θprincipal-angle distance at fixed ranksr∈{1,5,10}. All SVDs inside theAnalysis procedureare exactfloat32 decompositions. Newton–Schulz ...

[75] [75]

Per-step filtering ratio— the ratio of thek-th singular value of Pre-polar momentum bufferM (β) K (equation (20)) to that of the latest collected raw gradientGK, Filtk(β) := σk M (β) K σk(GK) . Bothspectraarecomputedfromthesamegradientbufferatthefinalcollectionstep: GK isthelastraw mini-batch gradient (equivalently, the momentum buffer atβ= 0 ), andM (β) ...

[76] [76]

Explicitly, R(T) := GK − ¯G 2 M (β) K − ¯G 2 , ¯G := 1 K KX t=1 Gt, 33 where ¯G is the in-buffer approximation ofGsig t

Noise-suppressionratio—residualoperator-normratio R(T) ofrawgradientvs.Pre-polarmomentum buffer with probe-side momentum coefficientβ (associated with the effective sample size2T−1 ). Explicitly, R(T) := GK − ¯G 2 M (β) K − ¯G 2 , ¯G := 1 K KX t=1 Gt, 33 where ¯G is the in-buffer approximation ofGsig t . Subtracting ¯G from both numerator and denominator ...

[77] [77]

Explicitly, sinθ r(A;B) := sin Θ Ur(A), U r(B) 2, with the right-subspace versionsinθ r(A⊤;B ⊤) =∥sin Θ(V r(A), V r(B))∥2 reported separately

Subspace alignment error— thesinθ subspace distance from the top-r singular subspaces of the reference to those of Pre-polar momentum buffer. Explicitly, sinθ r(A;B) := sin Θ Ur(A), U r(B) 2, with the right-subspace versionsinθ r(A⊤;B ⊤) =∥sin Θ(V r(A), V r(B))∥2 reported separately. In this paper, we also definesin ΘU := sinθ r(M (β) K ; ¯G) and sin ΘV :...

[78] [78]

Alignr(A;B) := Ur(B)⊤A Vr(B) F√r ∈[0,1]

Signal alignment— the signal alignment comparison applied to Pre-polar=O(M (β) K ), Post-polar = fM (β) K , and Polar-only=O(GK), reported through the following two metrics: Rank-rsignal alignment. Alignr(A;B) := Ur(B)⊤A Vr(B) F√r ∈[0,1]. Larger values indicate stronger signal alignment. Theorem 2 predicts that Pre-polar achieves higher Alignr than Post-p...

2000

[79] [79]

Dashed line:(2T−1)1/4 floor

Three attention output projections per panel. Dashed line:(2T−1)1/4 floor. 100 101 102 Momentum window T = 1/(1 ) 10 1 100 Subspace alignment error Rank 1 sin U sin V 0.84 (2T 1) 1/4 100 101 102 Momentum window T = 1/(1 ) 10 1 100 Rank 2 sin U sin V 0.99 (2T 1) 1/4 100 101 102 Momentum window T = 1/(1 ) 10 1 100 Rank 3 sin U sin V 1.02 (2T 1) 1/4 Figure 1...

2000