pith. sign in

arxiv: 2606.21514 · v1 · pith:SBV655U5new · submitted 2026-06-19 · 💻 cs.LG

Towards Understanding the Power and Limits of the Muon Optimizer: A River-Valley Perspective

Pith reviewed 2026-06-26 14:21 UTC · model grok-4.3

classification 💻 cs.LG
keywords Muon optimizerriver-valley perspectivemixed-spiked matrix sensingoptimization trajectoryLLM traininggradient descent comparisonmomentum methods
0
0 comments X

The pith

Muon moves faster early along the key direction but can oscillate and slow near the solution compared to gradient descent.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper develops a trajectory-level theory to explain why Muon shows mixed gains over Adam-like methods in large language model training. It models the loss landscape as a river-valley structure using a mixed-spiked matrix sensing problem whose operator splits into signal, spike, and bulk parts. Without momentum, Muon advances quicker along the information-bearing river direction at the start yet converges more slowly near the river bottom than plain gradient descent. With momentum on general nonconvex problems, the orthogonalized updates strip away residual scale information and produce overshooting plus oscillation close to the target. The analysis therefore recommends a two-stage strategy that switches to gradient-descent-style refinement for the final phase rather than relying on a single fixed schedule.

Core claim

In the momentum-free setting, Muon moves faster along the information-bearing river direction during early optimization, but can converge much more slowly near the river bottom than gradient descent. Extending to general nonconvex objectives with momentum, Muon's orthogonalized update removes residual scale information, making it prone to overshooting and oscillation near the target solution.

What carries the argument

The river-valley perspective that decomposes the landscape into a primary river direction toward the solution and perpendicular hill directions of nuisance information, constructed on the mixed-spiked matrix sensing model.

If this is right

  • Muon supplies early speed advantages in landscapes that contain both strong directional signal and long-tail bulk components.
  • Near the target, Muon's orthogonal updates tend to discard scale cues and produce overshooting.
  • Switching from Muon to a gradient-descent-style optimizer in the final phase can reduce oscillation and improve final accuracy.
  • The two-stage schedule outperforms a single fixed learning-rate schedule for Muon on language model tasks.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same river-valley lens could be applied to other spectral-normalization optimizers to check whether they share the late-stage oscillation pattern.
  • A hybrid optimizer that keeps Muon's early directional speed while restoring scale information late could be designed and tested.
  • The mixed-spiked model offers a concrete testbed for measuring how much bulk-component strength is needed before Muon's disadvantage appears.

Load-bearing premise

The mixed-spiked matrix sensing model whose sensing operator decomposes into signal, spike, and bulk components captures the mixture of anisotropic structure and long-tail information in LLM training landscapes.

What would settle it

Direct measurement of convergence speed near the river bottom in the mixed-spiked model, or observation of oscillation amplitude in late-stage Muon runs during actual language model training.

Figures

Figures reproduced from arXiv: 2606.21514 by Jianhao Ma, Jiaye Teng, Jinji Yang, Runze Shi, Tianqi Shen, Ziye Ma.

Figure 1
Figure 1. Figure 1: A river-valley perspective on GD-like methods (e.g. momentum-GD, Adam), Muon, and their hybrid strategy. (a): GD-like methods explores the river slowly but achieves accurate final convergence. (b): Hybrid method combines fast early exploration with accurate late-stage refinement. (c): Muon explores rapidly along the river but remains more oscillatory near the end of training. and objective h(X), Muon perfo… view at source ↗
Figure 2
Figure 2. Figure 2: A simple motivating example. This 2D anisotropic spectral slice illustrates the main message of this paper: Muon is highly effective as an early-stage exploration optimizer, but GD/Adam-type dynamics are needed for stable late-stage refinement. (Details in Appendix C.1.1) 2M with superscript and subscript denotes momentum instead of plain matrix used in Section 3. 2 [PITH_FULL_IMAGE:figures/full_fig_p002_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Empirical evidence for a mixed-spiked covariance structure in image representations. Across datasets and architectures, the covariance spectrum of pre-classifier feature maps exhibits a pronounced bulk-signal-spike structure. Larger model (SwinV2-B [41]) and larger dataset (STL-10 [42]) tend to yield more task-relevant eigen￾directions, whereas smaller model (ResNet-18 [43]) or smaller dataset (CIFAR-10 [4… view at source ↗
Figure 4
Figure 4. Figure 4: LLM pre-training with a “Muon → AdamW” transition. We train with Muon for the first 1.5k iterations and switch to AdamW for the remaining 2.5k iterations. We conduct all the experiments in a language-model pre-training setting. Specifically, we train a 250M-parameter LLaMA-style [56] decoder-only Transformer from scratch, rather than fine-tuning a pretrained LLaMA checkpoint.4 We use the GPT-2 tokenizer [5… view at source ↗
Figure 5
Figure 5. Figure 5: 2D anisotropic spectral slice case: pure optimizers. Left: optimization trajectories of momentum-GD, Adam, and Muon under a fixed learning rate 0.037. The black arrows indicate the displacement every other iteration. Right: loss versus iteration for the three optimizers. We consider a simple two-dimensional anisotropic spectral landscape as a motivating example for the phenomena discussed in the main text.… view at source ↗
Figure 6
Figure 6. Figure 6: 2D anisotropic spectral slice case: hybrid optimizers. Left: optimization trajectories of the two hybrid schemes under a fixed learning rate 0.037. The black arrows indicate the displacement every other iteration. Right: loss versus iteration for the pure and hybrid optimizers. Overall, the pure-optimizer experiment is consistent with our intuition: Muon achieves rapid early-stage progress but poor final a… view at source ↗
Figure 7
Figure 7. Figure 7: Mixed-spiked MS with a diagonal interaction matrix Kˆ . Evolution of the reduced coefficients αk, βk, γk versus iteration. Left: pure GD. Middle: hybrid optimization, which switches from simplified Muon to vanilla GD. Right: pure Muon. appearing in the discrete simulation and closed-form trajectories. This truncation prevents negative plotted values, but it should not be interpreted as evidence that Muon a… view at source ↗
Figure 8
Figure 8. Figure 8: Mixed-spiked MS with a PSD interaction matrix Kˆ . Evolution of the reduced coefficients αk, βk, γk versus iteration (horizontal axis plotted in the symmetric-log scale). Left: pure GD. Middle: hybrid optimization, which switches from simplified Muon to vanilla GD. Right: pure Muon. PSD case. We next consider a more general positive semidefinite interaction matrix Kˆ , beyond the diagonal setting discussed… view at source ↗
Figure 9
Figure 9. Figure 9: Training a two-layer neural network on MNIST with pure optimizers. Left: training loss of momentum￾GD, Adam, and Muon. Right: test accuracy of the three optimizers. Concretely, we train a two-layer neural network with input dimension 28 × 28 = 784, hidden width 256, and output dimension 10, corresponding to the ten MNIST digit classes. We use the ReLU activation function [62] and train with mini-batches of… view at source ↗
Figure 10
Figure 10. Figure 10: Training a two-layer neural network on MNIST with a hybrid optimizer. The optimizer first follows Muon and then switches to Adam. Left: training loss. Right: test accuracy. Motivated by this observation, we further test a hybrid optimization strategy. In [PITH_FULL_IMAGE:figures/full_fig_p042_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: LLM pre-training with a “Muon → AdamW” transition. We train with Muon for the first 1k iterations and switch to AdamW for the remaining 3k iterations. C.2.2 Additional Demonstration for Early “Muon → AdamW” Switching Section 5 presents the training-loss and validation-loss curves for an early “Muon → AdamW” transition at 1.5k iterations. Due to space constraints, we provide the corresponding 1k-switch res… view at source ↗
Figure 12
Figure 12. Figure 12: LLM pre-training with a late “Muon → AdamW” transition. We train with Muon for the first 3k iterations and switch to AdamW for the remaining 1k iterations. (a) (b) [PITH_FULL_IMAGE:figures/full_fig_p044_12.png] view at source ↗
Figure 13
Figure 13. Figure 13: LLM pre-training with a very late “Muon → AdamW” transition. We train with Muon for the first 3.75k iterations and switch to AdamW for the remaining 0.25k iterations. AdamW can still provide effective late-stage refinement even after a long Muon pre-training phase, despite the limited remaining training budget. Overall, these experiments provide supplementary evidence that the “Muon → AdamW” transition is… view at source ↗
read the original abstract

Recently, Muon has gained substantial attention as an appealing alternative to Adam-like optimizers, with many works highlighting its advantages through spectral normalization and improved conditioning. Yet this positive theoretical narrative contrasts with its empirical performance in large language model (LLM) training, where Muon's gains over Adam/AdamW are often mixed, schedule-sensitive, and not uniformly superior. To address this gap, we develop a trajectory-level theory characterizing both the strengths and limitations of Muon. We introduce a mixed-spiked matrix sensing model whose sensing operator decomposes into signal, spike, and bulk components, capturing a mixture of anisotropic structure and long-tail information reminiscent of LLM training. On top of it, we adopted a river-valley perspective in which we view the landscape as composed of a river direction flowing to the desired solution and hill directions encoding nuisance or task-irrelevant information. In the momentum-free setting, we show that Muon moves faster along the information-bearing river direction during early optimization, but can converge much more slowly near the river bottom than gradient descent. We then extend the river-valley perspective to general nonconvex objectives with momentum by studying points on the spectral river. There, while Muon converges faster early on, its orthogonalized update removes residual scale information, making it prone to overshooting and oscillation near the target solution. Together, these results suggest that our characterizations extend beyond spiked matrix sensing and motivate switching to GD-like refinement optimizers in the final phase, rather than relying only on a fixed learning-rate schedule for Muon. We also provide preliminary evidence supporting this two-stage approach in language model training experiments.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 2 minor

Summary. The manuscript introduces a mixed-spiked matrix sensing model whose sensing operator decomposes into signal, spike, and bulk components, together with a river-valley perspective that decomposes the landscape into a river direction (information-bearing) and hill directions (nuisance). In the momentum-free setting the paper claims Muon moves faster along the river early but converges more slowly near the river bottom than gradient descent. Extending the perspective to general nonconvex objectives with momentum, the authors argue that Muon's orthogonalized updates remove residual scale information and therefore induce overshooting and oscillation near spectral river points; they recommend a two-stage strategy that switches to GD-like refinement in the final phase and supply preliminary LLM training evidence for this approach.

Significance. If the characterizations hold, the work supplies a trajectory-level explanation for the schedule-sensitive and mixed empirical gains of Muon over Adam/AdamW in LLM training. The mixed-spiked model and river-valley decomposition constitute a novel modeling choice that captures anisotropic structure plus long-tail information; the preliminary experiments provide concrete support for the suggested two-stage optimizer switch. These elements could usefully inform the design of hybrid first-order methods.

major comments (1)
  1. [Extension to general nonconvex objectives with momentum] The extension of the river-valley analysis to general nonconvex objectives with momentum (abstract and the corresponding theoretical section) asserts that the orthogonalized update removes residual scale information and thereby produces overshooting/oscillation. No explicit conditions on curvature, momentum coefficient, or bulk/spike ratios are supplied that would guarantee persistence of the scale-removal effect once the trajectory leaves the spiked sensing operator. Because this step is load-bearing for the two-stage recommendation, the oscillation claim requires additional derivation or counter-example analysis to be fully grounded.
minor comments (2)
  1. [Introduction] Notation for the river and hill directions is introduced in the abstract but would benefit from an explicit one-sentence definition in the first paragraph of the introduction for readers encountering the perspective for the first time.
  2. [Experiments] The preliminary experiments section would be strengthened by reporting the precise learning-rate schedules and the point at which the switch to the GD-like refinement occurs, so that the two-stage protocol can be reproduced.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the positive assessment of the work's significance and for the detailed comment on the extension to general nonconvex objectives. We address the concern below and will revise the manuscript accordingly.

read point-by-point responses
  1. Referee: The extension of the river-valley analysis to general nonconvex objectives with momentum (abstract and the corresponding theoretical section) asserts that the orthogonalized update removes residual scale information and thereby produces overshooting/oscillation. No explicit conditions on curvature, momentum coefficient, or bulk/spike ratios are supplied that would guarantee persistence of the scale-removal effect once the trajectory leaves the spiked sensing operator. Because this step is load-bearing for the two-stage recommendation, the oscillation claim requires additional derivation or counter-example analysis to be fully grounded.

    Authors: We agree that the manuscript's extension relies on the mechanism identified at spectral river points within the mixed-spiked model and does not furnish explicit conditions guaranteeing that the scale-removal effect of orthogonal updates persists for arbitrary nonconvex objectives once the trajectory departs the spiked sensing operator. The core observation—that Muon's update discards residual scale information along the river direction—follows directly from the orthogonality property and is illustrated at those points, but the claim for broader applicability is indeed heuristic at present. In the revised version we will supply a short derivation of sufficient conditions on local curvature (near the river bottom) and momentum coefficient under which overshooting is expected, or, if the conditions turn out to be restrictive, include a brief counter-example analysis that delineates when the effect may not hold. This will better support the two-stage recommendation without overstating the current theoretical reach. revision: yes

Circularity Check

0 steps flagged

No circularity; model and river-valley perspective introduced independently without self-referential reductions

full rationale

The abstract and provided text introduce the mixed-spiked matrix sensing model and river-valley perspective as new constructs to characterize Muon trajectories. Claims of faster early river progress but slower bottom convergence (momentum-free) and scale-removal leading to overshoot (with momentum) are presented as derived results within this framework, with an empirical two-stage suggestion. No equations, self-citations, or fitted parameters are shown that reduce any prediction to the inputs by construction. The extension to general nonconvex objectives is described conceptually rather than via a load-bearing self-citation chain or ansatz smuggling. This is self-contained against external benchmarks, consistent with a normal non-circular finding.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 2 invented entities

The central claim depends on the validity of the newly introduced mixed-spiked model and river-valley perspective, which are postulated to represent LLM training without external validation beyond the preliminary experiments mentioned.

axioms (1)
  • standard math Standard assumptions underlying convergence analysis of first-order methods in nonconvex settings
    Invoked when extending the analysis to general nonconvex objectives with momentum.
invented entities (2)
  • mixed-spiked matrix sensing model no independent evidence
    purpose: To decompose the sensing operator into signal, spike, and bulk components capturing anisotropic and long-tail structure
    New model constructed to enable the river-valley analysis of Muon trajectories.
  • river-valley perspective no independent evidence
    purpose: To decompose the landscape into a river direction toward the solution and orthogonal hill directions for nuisance information
    New viewpoint adopted to characterize early versus late optimization behavior.

pith-pipeline@v0.9.1-grok · 5841 in / 1314 out tokens · 21179 ms · 2026-06-26T14:21:09.544270+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

61 extracted references · 13 linked inside Pith

  1. [1]

    Improving generalization performance by switching from adam to sgd

    Nitish Shirish Keskar and Richard Socher. Improving generalization performance by switching from adam to sgd. arXiv preprint arXiv:1712.07628, 2017

  2. [2]

    Adam: A method for stochastic optimization.arXiv preprint arXiv:1412.6980, 2014

    Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization.arXiv preprint arXiv:1412.6980, 2014

  3. [3]

    Evolution of optimization methods: Algorithms, scenarios, and evaluations

    Tong Zhang, Jiangning Zhang, Zhucun Xue, Juntao Jiang, Yicheng Xu, Chengming Xu, Teng Hu, Xingyu Xie, Xiaobin Hu, Yabiao Wang, et al. Evolution of optimization methods: Algorithms, scenarios, and evaluations. arXiv preprint arXiv:2604.12968, 2026

  4. [4]

    An exploration of non-euclidean gradient descent: Muon and its many variants.arXiv preprint arXiv:2510.09827, 2025

    Michael Crawshaw, Chirag Modi, Mingrui Liu, and Robert M Gower. An exploration of non-euclidean gradient descent: Muon and its many variants.arXiv preprint arXiv:2510.09827, 2025

  5. [5]

    Muon: An optimizer for hidden layers in neural networks, 2024.URL https://kellerjordan

    Keller Jordan, Yuchen Jin, Vlado Boza, You Jiacheng, Franz Cesista, Laker Newhouse, and Jeremy Bernstein. Muon: An optimizer for hidden layers in neural networks, 2024.URL https://kellerjordan. github. io/posts/muon, 6(3):4, 2024

  6. [6]

    Muon is scalable for llm training.arXiv preprint arXiv:2502.16982, 2025

    Jingyuan Liu, Jianlin Su, Xingcheng Yao, Zhejun Jiang, Guokun Lai, Yulun Du, Yidao Qin, Weixin Xu, Enzhe Lu, Junjie Yan, et al. Muon is scalable for llm training.arXiv preprint arXiv:2502.16982, 2025

  7. [7]

    Practical efficiency of muon for pretraining.arXiv preprint arXiv:2505.02222, 2025

    Ishaan Shah, Anthony M Polloreno, Karl Stratos, Philip Monk, Adarsh Chaluvaraju, Andrew Hojel, Andrew Ma, Anil Thomas, Ashish Tanwer, Darsh J Shah, et al. Practical efficiency of muon for pretraining.arXiv preprint arXiv:2505.02222, 2025

  8. [8]

    Spectra: Rethinking optimizers for llms under spectral anisotropy.arXiv preprint arXiv:2602.11185, 2026

    Zhendong Huang, Hengjie Cao, Fang Dong, Ruijun Huang, Mengyi Chen, Yifeng Yang, Xin Zhang, Anrui Chen, Mingzhi Dong, Yujiang Wang, et al. Spectra: Rethinking optimizers for llms under spectral anisotropy.arXiv preprint arXiv:2602.11185, 2026

  9. [9]

    Spectral gradient descent mitigates anisotropy- driven misalignment: A case study in phase retrieval.arXiv preprint arXiv:2601.22652, 2026

    Guillaume Braun, Han Bao, Wei Huang, and Masaaki Imaizumi. Spectral gradient descent mitigates anisotropy- driven misalignment: A case study in phase retrieval.arXiv preprint arXiv:2601.22652, 2026

  10. [10]

    On the convergence analysis of muon

    Wei Shen, Ruichuan Huang, Minhui Huang, Cong Shen, and Jiawei Zhang. On the convergence analysis of muon. arXiv preprint arXiv:2505.23737, 2025

  11. [11]

    Decoupled weight decay regularization.arXiv preprint arXiv:1711.05101, 2017

    Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization.arXiv preprint arXiv:1711.05101, 2017

  12. [12]

    Kimi k2: Open agentic intelligence.arXiv preprint arXiv:2507.20534, 2025

    Kimi Team, Yifan Bai, Yiping Bao, Y Charles, Cheng Chen, Guanduo Chen, Haiting Chen, Huarong Chen, Jiahao Chen, Ningxin Chen, et al. Kimi k2: Open agentic intelligence.arXiv preprint arXiv:2507.20534, 2025

  13. [13]

    Muon: Training and trade-offs with latent attention and moe.arXiv preprint arXiv:2509.24406, 2025

    Sushant Mehta, Raj Dandekar, Rajat Dandekar, and Sreedath Panat. Muon: Training and trade-offs with latent attention and moe.arXiv preprint arXiv:2509.24406, 2025

  14. [14]

    Preconditioning benefits of spectral orthogonalization in muon.arXiv preprint arXiv:2601.13474, 2026

    Jianhao Ma, Yu Huang, Yuejie Chi, and Yuxin Chen. Preconditioning benefits of spectral orthogonalization in muon.arXiv preprint arXiv:2601.13474, 2026

  15. [16]

    Understanding warmup-stable- decay learning rates: A river valley loss landscape perspective.arXiv preprint arXiv:2410.05192, 2024

    Kaiyue Wen, Zhiyuan Li, Jason Wang, David Hall, Percy Liang, and Tengyu Ma. Understanding warmup-stable- decay learning rates: A river valley loss landscape perspective.arXiv preprint arXiv:2410.05192, 2024

  16. [17]

    The anisotropy of time.The Monist, 48(2):219–247, 1964

    Adolf Grünbaum. The anisotropy of time.The Monist, 48(2):219–247, 1964

  17. [18]

    OUP Oxford, 2004

    Robert E Newnham.Properties of materials: anisotropy, symmetry, structure. OUP Oxford, 2004

  18. [19]

    Mapping the large-scale anisotropy in the wmap data.Astronomy & Astrophysics, 464(2):479–485, 2007

    Armando Bernui, B Mota, Marcelo J Reboucas, and R Tavakol. Mapping the large-scale anisotropy in the wmap data.Astronomy & Astrophysics, 464(2):479–485, 2007

  19. [20]

    Anisotropy is everywhere, to see, to measure, and to model.Rock Mechanics and Rock Engineering, 48(4):1323–1339, 2015

    Nick Barton and Eda Quadros. Anisotropy is everywhere, to see, to measure, and to model.Rock Mechanics and Rock Engineering, 48(4):1323–1339, 2015

  20. [21]

    Neural anisotropy directions.Advances in Neural Information Processing Systems, 33:17896–17906, 2020

    Guillermo Ortiz-Jiménez, Apostolos Modas, Seyed-Mohsen Moosavi, and Pascal Frossard. Neural anisotropy directions.Advances in Neural Information Processing Systems, 33:17896–17906, 2020

  21. [22]

    Learning shape correspondence with anisotropic convolutional neural networks.Advances in neural information processing systems, 29, 2016

    Davide Boscaini, Jonathan Masci, Emanuele Rodolà, and Michael Bronstein. Learning shape correspondence with anisotropic convolutional neural networks.Advances in neural information processing systems, 29, 2016

  22. [23]

    Deformable convolutional networks

    Jifeng Dai, Haozhi Qi, Yuwen Xiong, Yi Li, Guodong Zhang, Han Hu, and Yichen Wei. Deformable convolutional networks. InProceedings of the IEEE international conference on computer vision, pages 764–773, 2017. 10 River-valley geometry reveals Muon as an early-stage exploration optimizer rather than late-stage refinement optimizer

  23. [24]

    Anisotropy is inherent to self-attention in transformers

    Nathan Godey, Éric Clergerie, and Benoît Sagot. Anisotropy is inherent to self-attention in transformers. In Proceedings of the 18th Conference of the European Chapter of the Association for Computational Linguistics (Volume 1: Long Papers), pages 35–48, 2024

  24. [25]

    Anisotropy is not inherent to transformers

    Anemily Machina and Robert Mercer. Anisotropy is not inherent to transformers. InProceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers), pages 4892–4907, 2024

  25. [26]

    Accelerating llm pre-training through flat-direction dynamics enhancement.arXiv preprint arXiv:2602.22681, 2026

    Shuchen Zhu, Rizhen Hu, Mingze Wang, Mou Sun, Xue Wang, Kun Yuan, and Zaiwen Wen. Accelerating llm pre-training through flat-direction dynamics enhancement.arXiv preprint arXiv:2602.22681, 2026

  26. [27]

    Revisiting anisotropy in language transform- ers: The geometry of learning dynamics.arXiv preprint arXiv:2604.08764, 2026

    Raphael Bernas, Fanny Jourdan, Antonin Poché, and Céline Hudelot. Revisiting anisotropy in language transform- ers: The geometry of learning dynamics.arXiv preprint arXiv:2604.08764, 2026

  27. [28]

    Accelerating block coordinate descent for llm finetuning via landscape expansion.Advances in Neural Information Processing Systems, 38:56619–56645, 2026

    Qijun Luo, Yifei Shen, Liangzu Peng, Dongsheng Li, and Xiao Li. Accelerating block coordinate descent for llm finetuning via landscape expansion.Advances in Neural Information Processing Systems, 38:56619–56645, 2026

  28. [29]

    Rethinking llm training through information geometry and quantum metrics.arXiv preprint arXiv:2506.15830, 2025

    Riccardo Di Sipio. Rethinking llm training through information geometry and quantum metrics.arXiv preprint arXiv:2506.15830, 2025

  29. [30]

    Gradient descent on neural networks typically occurs at the edge of stability.arXiv preprint arXiv:2103.00065, 2021

    Jeremy M Cohen, Simran Kaur, Yuanzhi Li, J Zico Kolter, and Ameet Talwalkar. Gradient descent on neural networks typically occurs at the edge of stability.arXiv preprint arXiv:2103.00065, 2021

  30. [31]

    Analyzing sharpness along gd trajectory: Progressive sharpening and edge of stability.Advances in Neural Information Processing Systems, 35:9983–9994, 2022

    Zixuan Wang, Zhouzi Li, and Jian Li. Analyzing sharpness along gd trajectory: Progressive sharpening and edge of stability.Advances in Neural Information Processing Systems, 35:9983–9994, 2022

  31. [32]

    The origin of edge of stability.arXiv preprint arXiv:2604.20446, 2026

    Litman Elon. The origin of edge of stability.arXiv preprint arXiv:2604.20446, 2026

  32. [33]

    Isotropic curvature model for understanding deep learning optimization: Is gradient orthogonalization optimal?arXiv preprint arXiv:2511.00674, 2025

    Weijie Su. Isotropic curvature model for understanding deep learning optimization: Is gradient orthogonalization optimal?arXiv preprint arXiv:2511.00674, 2025

  33. [34]

    Muon outperforms adam in tail-end associative memory learning.arXiv preprint arXiv:2509.26030, 2025

    Shuche Wang, Fengzhuo Zhang, Jiaxiang Li, Cunxiao Du, Chao Du, Tianyu Pang, Zhuoran Yang, Mingyi Hong, and Vincent YF Tan. Muon outperforms adam in tail-end associative memory learning.arXiv preprint arXiv:2509.26030, 2025

  34. [35]

    Juno Kim, Eshaan Nichani, Denny Wu, Alberto Bietti, and Jason D. Lee. Sharp capacity scaling of spectral optimizers in learning associative memory.arXiv preprint arXiv:2603.26554, 2026

  35. [36]

    Muon optimizes under spectral norm constraints.arXiv preprint arXiv:2506.15054, 2025

    Lizhang Chen, Jonathan Li, and Qiang Liu. Muon optimizes under spectral norm constraints.arXiv preprint arXiv:2506.15054, 2025

  36. [37]

    Lions and muons: Optimization via stochastic frank-wolfe.arXiv preprint arXiv:2506.04192, 2025

    Maria-Eleni Sfyraki and Jun-Kun Wang. Lions and muons: Optimization via stochastic frank-wolfe.arXiv preprint arXiv:2506.04192, 2025

  37. [38]

    Training deep learning models with norm-constrained lmos

    Thomas Pethick, Wanyun Xie, Kimon Antonakopoulos, Zhenyu Zhu, Antonio Silveti-Falls, and V olkan Cevher. Training deep learning models with norm-constrained lmos. InInternational Conference on Machine Learning, pages 49069–49104. PMLR, 2025

  38. [39]

    Noah Amsel, David Persson, Christopher Musco, and Robert M. Gower. The polar express: Optimal matrix sign methods and their application to the muon algorithm.arXiv preprint arXiv:2505.16932, 2025

  39. [40]

    Convergence of muon with newton-schulz.arXiv preprint arXiv:2601.19156, 2026

    Gyu Yeol Kim and Min-hwan Oh. Convergence of muon with newton-schulz.arXiv preprint arXiv:2601.19156, 2026

  40. [41]

    Swin transformer v2: Scaling up capacity and resolution

    Ze Liu, Han Hu, Yutong Lin, Zhuliang Yao, Zhenda Xie, Yixuan Wei, Jia Ning, Yue Cao, Zheng Zhang, Li Dong, et al. Swin transformer v2: Scaling up capacity and resolution. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 12009–12019, 2022

  41. [42]

    An analysis of single-layer networks in unsupervised feature learning

    Adam Coates, Andrew Ng, and Honglak Lee. An analysis of single-layer networks in unsupervised feature learning. InProceedings of the fourteenth international conference on artificial intelligence and statistics, pages 215–223. JMLR Workshop and Conference Proceedings, 2011

  42. [43]

    Deep residual learning for image recognition

    Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016

  43. [44]

    Alex Krizhevsky and Geoffrey E. Hinton. Learning multiple layers of features from tiny images. Technical report, University of Toronto, Toronto, ON, Canada, 2009

  44. [45]

    Introduction to compressed sensing., 2012

    Mark A Davenport, Marco F Duarte, Yonina C Eldar, and Gitta Kutyniok. Introduction to compressed sensing., 2012

  45. [46]

    Andreas M Tillmann and Marc E Pfetsch. The computational complexity of the restricted isometry property, the nullspace property, and related concepts in compressed sensing.IEEE Transactions on Information Theory, 60(2):1248–1259, 2013. 11 River-valley geometry reveals Muon as an early-stage exploration optimizer rather than late-stage refinement optimizer

  46. [47]

    Phase retrieval via matrix completion.SIAM review, 57(2):225–251, 2015

    Emmanuel J Candes, Yonina C Eldar, Thomas Strohmer, and Vladislav V oroninski. Phase retrieval via matrix completion.SIAM review, 57(2):225–251, 2015

  47. [48]

    Universal low-rank matrix recovery from pauli measurements.Advances in Neural Information Processing Systems, 24, 2011

    Yi-Kai Liu. Universal low-rank matrix recovery from pauli measurements.Advances in Neural Information Processing Systems, 24, 2011

  48. [49]

    A blind compressed sensing formulation for collaborative filtering

    Anuj Rajani, Paritosh Mittal, Aishwarya Jain, and Angshul Majumdar. A blind compressed sensing formulation for collaborative filtering. In2014 IEEE International Symposium on Signal Processing and Information Technology (ISSPIT), pages 000438–000443. IEEE, 2014

  49. [50]

    Towards robust and scalable power system state estimation

    Ming Jin, Igor Molybog, Reza Mohammadi-Ghazi, and Javad Lavaei. Towards robust and scalable power system state estimation. In2019 IEEE 58th Conference on Decision and Control (CDC), pages 3245–3252. IEEE, 2019

  50. [51]

    Algorithmic regularization in over-parameterized matrix sensing and neural networks with quadratic activations

    Yuanzhi Li, Tengyu Ma, and Hongyang Zhang. Algorithmic regularization in over-parameterized matrix sensing and neural networks with quadratic activations. InConference On Learning Theory, pages 2–47. PMLR, 2018

  51. [52]

    Imagenet: A large-scale hierarchical image database

    Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large-scale hierarchical image database. In2009 IEEE conference on computer vision and pattern recognition, pages 248–255. Ieee, 2009

  52. [53]

    Distribution of eigenvalues for some sets of random matrices.Mathematics of the USSR-Sbornik, 1(4):457–483, 1967

    Vladimir A Mar ˇcenko and Leonid Andreevich Pastur. Distribution of eigenvalues for some sets of random matrices.Mathematics of the USSR-Sbornik, 1(4):457–483, 1967

  53. [54]

    Applied linear statistical models, 2008

    Herman F Senter. Applied linear statistical models, 2008

  54. [55]

    Dominik Stöger and Mahdi Soltanolkotabi. Small random initialization is akin to spectral learning: Optimiza- tion and generalization guarantees for overparameterized low-rank matrix reconstruction.Advances in Neural Information Processing Systems, 34:23831–23843, 2021

  55. [56]

    Llama: Open and efficient foundation language models.arXiv preprint arXiv:2302.13971, 2023

    Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timothée Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, et al. Llama: Open and efficient foundation language models.arXiv preprint arXiv:2302.13971, 2023

  56. [57]

    Language models are unsupervised multitask learners.OpenAI blog, 1(8):9, 2019

    Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, Ilya Sutskever, et al. Language models are unsupervised multitask learners.OpenAI blog, 1(8):9, 2019

  57. [58]

    The pile: An 800gb dataset of diverse text for language modeling.arXiv preprint arXiv:2101.00027, 2020

    Leo Gao, Stella Biderman, Sid Black, Laurence Golding, Travis Hoppe, Charles Foster, Jason Phang, Horace He, Anish Thite, Noa Nabeshima, et al. The pile: An 800gb dataset of diverse text for language modeling.arXiv preprint arXiv:2101.00027, 2020

  58. [59]

    Springer, 2018

    Yurii Nesterov et al.Lectures on convex optimization, volume 137. Springer, 2018

  59. [60]

    Perturbation theory for the singular value decomposition.SVD and Signal Processing II, Algorithms, Analysis and Applications, pages 99–109, 1991

    Gilbert W Stewart. Perturbation theory for the singular value decomposition.SVD and Signal Processing II, Algorithms, Analysis and Applications, pages 99–109, 1991

  60. [61]

    The mnist database of handwritten digit images for machine learning research [best of the web].IEEE signal processing magazine, 29(6):141–142, 2012

    Li Deng. The mnist database of handwritten digit images for machine learning research [best of the web].IEEE signal processing magazine, 29(6):141–142, 2012

  61. [62]

    Muon→AdamW

    Vinod Nair and Geoffrey E Hinton. Rectified linear units improve restricted boltzmann machines. InProceedings of the 27th international conference on machine learning (ICML-10), pages 807–814, 2010. 12 River-valley geometry reveals Muon as an early-stage exploration optimizer rather than late-stage refinement optimizer Appendix Contents A Additional Detai...