Towards Understanding the Power and Limits of the Muon Optimizer: A River-Valley Perspective

Jianhao Ma; Jiaye Teng; Jinji Yang; Runze Shi; Tianqi Shen; Ziye Ma

arxiv: 2606.21514 · v1 · pith:SBV655U5new · submitted 2026-06-19 · 💻 cs.LG

Towards Understanding the Power and Limits of the Muon Optimizer: A River-Valley Perspective

Tianqi Shen , Jinji Yang , Runze Shi , Jianhao Ma , Jiaye Teng , Ziye Ma This is my paper

Pith reviewed 2026-06-26 14:21 UTC · model grok-4.3

classification 💻 cs.LG

keywords Muon optimizerriver-valley perspectivemixed-spiked matrix sensingoptimization trajectoryLLM traininggradient descent comparisonmomentum methods

0 comments

The pith

Muon moves faster early along the key direction but can oscillate and slow near the solution compared to gradient descent.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper develops a trajectory-level theory to explain why Muon shows mixed gains over Adam-like methods in large language model training. It models the loss landscape as a river-valley structure using a mixed-spiked matrix sensing problem whose operator splits into signal, spike, and bulk parts. Without momentum, Muon advances quicker along the information-bearing river direction at the start yet converges more slowly near the river bottom than plain gradient descent. With momentum on general nonconvex problems, the orthogonalized updates strip away residual scale information and produce overshooting plus oscillation close to the target. The analysis therefore recommends a two-stage strategy that switches to gradient-descent-style refinement for the final phase rather than relying on a single fixed schedule.

Core claim

In the momentum-free setting, Muon moves faster along the information-bearing river direction during early optimization, but can converge much more slowly near the river bottom than gradient descent. Extending to general nonconvex objectives with momentum, Muon's orthogonalized update removes residual scale information, making it prone to overshooting and oscillation near the target solution.

What carries the argument

The river-valley perspective that decomposes the landscape into a primary river direction toward the solution and perpendicular hill directions of nuisance information, constructed on the mixed-spiked matrix sensing model.

If this is right

Muon supplies early speed advantages in landscapes that contain both strong directional signal and long-tail bulk components.
Near the target, Muon's orthogonal updates tend to discard scale cues and produce overshooting.
Switching from Muon to a gradient-descent-style optimizer in the final phase can reduce oscillation and improve final accuracy.
The two-stage schedule outperforms a single fixed learning-rate schedule for Muon on language model tasks.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same river-valley lens could be applied to other spectral-normalization optimizers to check whether they share the late-stage oscillation pattern.
A hybrid optimizer that keeps Muon's early directional speed while restoring scale information late could be designed and tested.
The mixed-spiked model offers a concrete testbed for measuring how much bulk-component strength is needed before Muon's disadvantage appears.

Load-bearing premise

The mixed-spiked matrix sensing model whose sensing operator decomposes into signal, spike, and bulk components captures the mixture of anisotropic structure and long-tail information in LLM training landscapes.

What would settle it

Direct measurement of convergence speed near the river bottom in the mixed-spiked model, or observation of oscillation amplitude in late-stage Muon runs during actual language model training.

Figures

Figures reproduced from arXiv: 2606.21514 by Jianhao Ma, Jiaye Teng, Jinji Yang, Runze Shi, Tianqi Shen, Ziye Ma.

**Figure 1.** Figure 1: A river-valley perspective on GD-like methods (e.g. momentum-GD, Adam), Muon, and their hybrid strategy. (a): GD-like methods explores the river slowly but achieves accurate final convergence. (b): Hybrid method combines fast early exploration with accurate late-stage refinement. (c): Muon explores rapidly along the river but remains more oscillatory near the end of training. and objective h(X), Muon perfo… view at source ↗

**Figure 2.** Figure 2: A simple motivating example. This 2D anisotropic spectral slice illustrates the main message of this paper: Muon is highly effective as an early-stage exploration optimizer, but GD/Adam-type dynamics are needed for stable late-stage refinement. (Details in Appendix C.1.1) 2M with superscript and subscript denotes momentum instead of plain matrix used in Section 3. 2 [PITH_FULL_IMAGE:figures/full_fig_p002_2.png] view at source ↗

**Figure 3.** Figure 3: Empirical evidence for a mixed-spiked covariance structure in image representations. Across datasets and architectures, the covariance spectrum of pre-classifier feature maps exhibits a pronounced bulk-signal-spike structure. Larger model (SwinV2-B [41]) and larger dataset (STL-10 [42]) tend to yield more task-relevant eigendirections, whereas smaller model (ResNet-18 [43]) or smaller dataset (CIFAR-10 [4… view at source ↗

**Figure 4.** Figure 4: LLM pre-training with a “Muon → AdamW” transition. We train with Muon for the first 1.5k iterations and switch to AdamW for the remaining 2.5k iterations. We conduct all the experiments in a language-model pre-training setting. Specifically, we train a 250M-parameter LLaMA-style [56] decoder-only Transformer from scratch, rather than fine-tuning a pretrained LLaMA checkpoint.4 We use the GPT-2 tokenizer [5… view at source ↗

**Figure 5.** Figure 5: 2D anisotropic spectral slice case: pure optimizers. Left: optimization trajectories of momentum-GD, Adam, and Muon under a fixed learning rate 0.037. The black arrows indicate the displacement every other iteration. Right: loss versus iteration for the three optimizers. We consider a simple two-dimensional anisotropic spectral landscape as a motivating example for the phenomena discussed in the main text.… view at source ↗

**Figure 6.** Figure 6: 2D anisotropic spectral slice case: hybrid optimizers. Left: optimization trajectories of the two hybrid schemes under a fixed learning rate 0.037. The black arrows indicate the displacement every other iteration. Right: loss versus iteration for the pure and hybrid optimizers. Overall, the pure-optimizer experiment is consistent with our intuition: Muon achieves rapid early-stage progress but poor final a… view at source ↗

**Figure 7.** Figure 7: Mixed-spiked MS with a diagonal interaction matrix Kˆ . Evolution of the reduced coefficients αk, βk, γk versus iteration. Left: pure GD. Middle: hybrid optimization, which switches from simplified Muon to vanilla GD. Right: pure Muon. appearing in the discrete simulation and closed-form trajectories. This truncation prevents negative plotted values, but it should not be interpreted as evidence that Muon a… view at source ↗

**Figure 8.** Figure 8: Mixed-spiked MS with a PSD interaction matrix Kˆ . Evolution of the reduced coefficients αk, βk, γk versus iteration (horizontal axis plotted in the symmetric-log scale). Left: pure GD. Middle: hybrid optimization, which switches from simplified Muon to vanilla GD. Right: pure Muon. PSD case. We next consider a more general positive semidefinite interaction matrix Kˆ , beyond the diagonal setting discussed… view at source ↗

**Figure 9.** Figure 9: Training a two-layer neural network on MNIST with pure optimizers. Left: training loss of momentumGD, Adam, and Muon. Right: test accuracy of the three optimizers. Concretely, we train a two-layer neural network with input dimension 28 × 28 = 784, hidden width 256, and output dimension 10, corresponding to the ten MNIST digit classes. We use the ReLU activation function [62] and train with mini-batches of… view at source ↗

**Figure 10.** Figure 10: Training a two-layer neural network on MNIST with a hybrid optimizer. The optimizer first follows Muon and then switches to Adam. Left: training loss. Right: test accuracy. Motivated by this observation, we further test a hybrid optimization strategy. In [PITH_FULL_IMAGE:figures/full_fig_p042_10.png] view at source ↗

**Figure 11.** Figure 11: LLM pre-training with a “Muon → AdamW” transition. We train with Muon for the first 1k iterations and switch to AdamW for the remaining 3k iterations. C.2.2 Additional Demonstration for Early “Muon → AdamW” Switching Section 5 presents the training-loss and validation-loss curves for an early “Muon → AdamW” transition at 1.5k iterations. Due to space constraints, we provide the corresponding 1k-switch res… view at source ↗

**Figure 12.** Figure 12: LLM pre-training with a late “Muon → AdamW” transition. We train with Muon for the first 3k iterations and switch to AdamW for the remaining 1k iterations. (a) (b) [PITH_FULL_IMAGE:figures/full_fig_p044_12.png] view at source ↗

**Figure 13.** Figure 13: LLM pre-training with a very late “Muon → AdamW” transition. We train with Muon for the first 3.75k iterations and switch to AdamW for the remaining 0.25k iterations. AdamW can still provide effective late-stage refinement even after a long Muon pre-training phase, despite the limited remaining training budget. Overall, these experiments provide supplementary evidence that the “Muon → AdamW” transition is… view at source ↗

read the original abstract

Recently, Muon has gained substantial attention as an appealing alternative to Adam-like optimizers, with many works highlighting its advantages through spectral normalization and improved conditioning. Yet this positive theoretical narrative contrasts with its empirical performance in large language model (LLM) training, where Muon's gains over Adam/AdamW are often mixed, schedule-sensitive, and not uniformly superior. To address this gap, we develop a trajectory-level theory characterizing both the strengths and limitations of Muon. We introduce a mixed-spiked matrix sensing model whose sensing operator decomposes into signal, spike, and bulk components, capturing a mixture of anisotropic structure and long-tail information reminiscent of LLM training. On top of it, we adopted a river-valley perspective in which we view the landscape as composed of a river direction flowing to the desired solution and hill directions encoding nuisance or task-irrelevant information. In the momentum-free setting, we show that Muon moves faster along the information-bearing river direction during early optimization, but can converge much more slowly near the river bottom than gradient descent. We then extend the river-valley perspective to general nonconvex objectives with momentum by studying points on the spectral river. There, while Muon converges faster early on, its orthogonalized update removes residual scale information, making it prone to overshooting and oscillation near the target solution. Together, these results suggest that our characterizations extend beyond spiked matrix sensing and motivate switching to GD-like refinement optimizers in the final phase, rather than relying only on a fixed learning-rate schedule for Muon. We also provide preliminary evidence supporting this two-stage approach in language model training experiments.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The river-valley framing on the mixed-spiked model is a fresh way to separate Muon's early river progress from its late-stage issues, but the nonconvex extension with momentum rests on an unproven carry-over of the scale-removal effect.

read the letter

The paper's core contribution is a new decomposition of the optimization landscape into a river direction and hill directions, applied first to a mixed-spiked matrix sensing model. In the momentum-free case it derives that Muon advances faster along the signal river early on but slows near the bottom compared with plain gradient descent. That part looks internally consistent within the model they chose.

The extension to general nonconvex problems with momentum is where it gets thinner. The claim that orthogonalization removes residual scale and produces overshoot rests on analyzing spectral river points, yet the write-up does not supply explicit conditions on curvature, momentum size, or spike-to-bulk ratios that would make the same mechanism hold outside the sensing operator. The stress-test note correctly flags this gap; without those conditions the oscillation prediction and the two-stage switch recommendation lose their theoretical anchor.

Experiments are described as preliminary and only loosely back the scheduling suggestion. No machine-checked proofs or large-scale controlled ablations appear.

The work is aimed at researchers who already follow Muon and related spectral methods and want a trajectory-level story rather than another convergence-rate bound. It is coherent enough on its own terms to merit referee time, though any review will need to press on the nonconvex step. I would bring it to a reading group for the model construction alone, but I would not cite the oscillation result until the derivation is tightened.

Referee Report

1 major / 2 minor

Summary. The manuscript introduces a mixed-spiked matrix sensing model whose sensing operator decomposes into signal, spike, and bulk components, together with a river-valley perspective that decomposes the landscape into a river direction (information-bearing) and hill directions (nuisance). In the momentum-free setting the paper claims Muon moves faster along the river early but converges more slowly near the river bottom than gradient descent. Extending the perspective to general nonconvex objectives with momentum, the authors argue that Muon's orthogonalized updates remove residual scale information and therefore induce overshooting and oscillation near spectral river points; they recommend a two-stage strategy that switches to GD-like refinement in the final phase and supply preliminary LLM training evidence for this approach.

Significance. If the characterizations hold, the work supplies a trajectory-level explanation for the schedule-sensitive and mixed empirical gains of Muon over Adam/AdamW in LLM training. The mixed-spiked model and river-valley decomposition constitute a novel modeling choice that captures anisotropic structure plus long-tail information; the preliminary experiments provide concrete support for the suggested two-stage optimizer switch. These elements could usefully inform the design of hybrid first-order methods.

major comments (1)

[Extension to general nonconvex objectives with momentum] The extension of the river-valley analysis to general nonconvex objectives with momentum (abstract and the corresponding theoretical section) asserts that the orthogonalized update removes residual scale information and thereby produces overshooting/oscillation. No explicit conditions on curvature, momentum coefficient, or bulk/spike ratios are supplied that would guarantee persistence of the scale-removal effect once the trajectory leaves the spiked sensing operator. Because this step is load-bearing for the two-stage recommendation, the oscillation claim requires additional derivation or counter-example analysis to be fully grounded.

minor comments (2)

[Introduction] Notation for the river and hill directions is introduced in the abstract but would benefit from an explicit one-sentence definition in the first paragraph of the introduction for readers encountering the perspective for the first time.
[Experiments] The preliminary experiments section would be strengthened by reporting the precise learning-rate schedules and the point at which the switch to the GD-like refinement occurs, so that the two-stage protocol can be reproduced.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the positive assessment of the work's significance and for the detailed comment on the extension to general nonconvex objectives. We address the concern below and will revise the manuscript accordingly.

read point-by-point responses

Referee: The extension of the river-valley analysis to general nonconvex objectives with momentum (abstract and the corresponding theoretical section) asserts that the orthogonalized update removes residual scale information and thereby produces overshooting/oscillation. No explicit conditions on curvature, momentum coefficient, or bulk/spike ratios are supplied that would guarantee persistence of the scale-removal effect once the trajectory leaves the spiked sensing operator. Because this step is load-bearing for the two-stage recommendation, the oscillation claim requires additional derivation or counter-example analysis to be fully grounded.

Authors: We agree that the manuscript's extension relies on the mechanism identified at spectral river points within the mixed-spiked model and does not furnish explicit conditions guaranteeing that the scale-removal effect of orthogonal updates persists for arbitrary nonconvex objectives once the trajectory departs the spiked sensing operator. The core observation—that Muon's update discards residual scale information along the river direction—follows directly from the orthogonality property and is illustrated at those points, but the claim for broader applicability is indeed heuristic at present. In the revised version we will supply a short derivation of sufficient conditions on local curvature (near the river bottom) and momentum coefficient under which overshooting is expected, or, if the conditions turn out to be restrictive, include a brief counter-example analysis that delineates when the effect may not hold. This will better support the two-stage recommendation without overstating the current theoretical reach. revision: yes

Circularity Check

0 steps flagged

No circularity; model and river-valley perspective introduced independently without self-referential reductions

full rationale

The abstract and provided text introduce the mixed-spiked matrix sensing model and river-valley perspective as new constructs to characterize Muon trajectories. Claims of faster early river progress but slower bottom convergence (momentum-free) and scale-removal leading to overshoot (with momentum) are presented as derived results within this framework, with an empirical two-stage suggestion. No equations, self-citations, or fitted parameters are shown that reduce any prediction to the inputs by construction. The extension to general nonconvex objectives is described conceptually rather than via a load-bearing self-citation chain or ansatz smuggling. This is self-contained against external benchmarks, consistent with a normal non-circular finding.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 2 invented entities

The central claim depends on the validity of the newly introduced mixed-spiked model and river-valley perspective, which are postulated to represent LLM training without external validation beyond the preliminary experiments mentioned.

axioms (1)

standard math Standard assumptions underlying convergence analysis of first-order methods in nonconvex settings
Invoked when extending the analysis to general nonconvex objectives with momentum.

invented entities (2)

mixed-spiked matrix sensing model no independent evidence
purpose: To decompose the sensing operator into signal, spike, and bulk components capturing anisotropic and long-tail structure
New model constructed to enable the river-valley analysis of Muon trajectories.
river-valley perspective no independent evidence
purpose: To decompose the landscape into a river direction toward the solution and orthogonal hill directions for nuisance information
New viewpoint adopted to characterize early versus late optimization behavior.

pith-pipeline@v0.9.1-grok · 5841 in / 1314 out tokens · 21179 ms · 2026-06-26T14:21:09.544270+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

61 extracted references · 13 linked inside Pith

[1]

Improving generalization performance by switching from adam to sgd

Nitish Shirish Keskar and Richard Socher. Improving generalization performance by switching from adam to sgd. arXiv preprint arXiv:1712.07628, 2017

Pith/arXiv arXiv 2017
[2]

Adam: A method for stochastic optimization.arXiv preprint arXiv:1412.6980, 2014

Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization.arXiv preprint arXiv:1412.6980, 2014

Pith/arXiv arXiv 2014
[3]

Evolution of optimization methods: Algorithms, scenarios, and evaluations

Tong Zhang, Jiangning Zhang, Zhucun Xue, Juntao Jiang, Yicheng Xu, Chengming Xu, Teng Hu, Xingyu Xie, Xiaobin Hu, Yabiao Wang, et al. Evolution of optimization methods: Algorithms, scenarios, and evaluations. arXiv preprint arXiv:2604.12968, 2026

Pith/arXiv arXiv 2026
[4]

An exploration of non-euclidean gradient descent: Muon and its many variants.arXiv preprint arXiv:2510.09827, 2025

Michael Crawshaw, Chirag Modi, Mingrui Liu, and Robert M Gower. An exploration of non-euclidean gradient descent: Muon and its many variants.arXiv preprint arXiv:2510.09827, 2025

arXiv 2025
[5]

Muon: An optimizer for hidden layers in neural networks, 2024.URL https://kellerjordan

Keller Jordan, Yuchen Jin, Vlado Boza, You Jiacheng, Franz Cesista, Laker Newhouse, and Jeremy Bernstein. Muon: An optimizer for hidden layers in neural networks, 2024.URL https://kellerjordan. github. io/posts/muon, 6(3):4, 2024

2024
[6]

Muon is scalable for llm training.arXiv preprint arXiv:2502.16982, 2025

Jingyuan Liu, Jianlin Su, Xingcheng Yao, Zhejun Jiang, Guokun Lai, Yulun Du, Yidao Qin, Weixin Xu, Enzhe Lu, Junjie Yan, et al. Muon is scalable for llm training.arXiv preprint arXiv:2502.16982, 2025

Pith/arXiv arXiv 2025
[7]

Practical efficiency of muon for pretraining.arXiv preprint arXiv:2505.02222, 2025

Ishaan Shah, Anthony M Polloreno, Karl Stratos, Philip Monk, Adarsh Chaluvaraju, Andrew Hojel, Andrew Ma, Anil Thomas, Ashish Tanwer, Darsh J Shah, et al. Practical efficiency of muon for pretraining.arXiv preprint arXiv:2505.02222, 2025

arXiv 2025
[8]

Spectra: Rethinking optimizers for llms under spectral anisotropy.arXiv preprint arXiv:2602.11185, 2026

Zhendong Huang, Hengjie Cao, Fang Dong, Ruijun Huang, Mengyi Chen, Yifeng Yang, Xin Zhang, Anrui Chen, Mingzhi Dong, Yujiang Wang, et al. Spectra: Rethinking optimizers for llms under spectral anisotropy.arXiv preprint arXiv:2602.11185, 2026

arXiv 2026
[9]

Spectral gradient descent mitigates anisotropy- driven misalignment: A case study in phase retrieval.arXiv preprint arXiv:2601.22652, 2026

Guillaume Braun, Han Bao, Wei Huang, and Masaaki Imaizumi. Spectral gradient descent mitigates anisotropy- driven misalignment: A case study in phase retrieval.arXiv preprint arXiv:2601.22652, 2026

arXiv 2026
[10]

On the convergence analysis of muon

Wei Shen, Ruichuan Huang, Minhui Huang, Cong Shen, and Jiawei Zhang. On the convergence analysis of muon. arXiv preprint arXiv:2505.23737, 2025

Pith/arXiv arXiv 2025
[11]

Decoupled weight decay regularization.arXiv preprint arXiv:1711.05101, 2017

Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization.arXiv preprint arXiv:1711.05101, 2017

Pith/arXiv arXiv 2017
[12]

Kimi k2: Open agentic intelligence.arXiv preprint arXiv:2507.20534, 2025

Kimi Team, Yifan Bai, Yiping Bao, Y Charles, Cheng Chen, Guanduo Chen, Haiting Chen, Huarong Chen, Jiahao Chen, Ningxin Chen, et al. Kimi k2: Open agentic intelligence.arXiv preprint arXiv:2507.20534, 2025

Pith/arXiv arXiv 2025
[13]

Muon: Training and trade-offs with latent attention and moe.arXiv preprint arXiv:2509.24406, 2025

Sushant Mehta, Raj Dandekar, Rajat Dandekar, and Sreedath Panat. Muon: Training and trade-offs with latent attention and moe.arXiv preprint arXiv:2509.24406, 2025

arXiv 2025
[14]

Preconditioning benefits of spectral orthogonalization in muon.arXiv preprint arXiv:2601.13474, 2026

Jianhao Ma, Yu Huang, Yuejie Chi, and Yuxin Chen. Preconditioning benefits of spectral orthogonalization in muon.arXiv preprint arXiv:2601.13474, 2026

arXiv 2026
[16]

Understanding warmup-stable- decay learning rates: A river valley loss landscape perspective.arXiv preprint arXiv:2410.05192, 2024

Kaiyue Wen, Zhiyuan Li, Jason Wang, David Hall, Percy Liang, and Tengyu Ma. Understanding warmup-stable- decay learning rates: A river valley loss landscape perspective.arXiv preprint arXiv:2410.05192, 2024

arXiv 2024
[17]

The anisotropy of time.The Monist, 48(2):219–247, 1964

Adolf Grünbaum. The anisotropy of time.The Monist, 48(2):219–247, 1964

1964
[18]

OUP Oxford, 2004

Robert E Newnham.Properties of materials: anisotropy, symmetry, structure. OUP Oxford, 2004

2004
[19]

Mapping the large-scale anisotropy in the wmap data.Astronomy & Astrophysics, 464(2):479–485, 2007

Armando Bernui, B Mota, Marcelo J Reboucas, and R Tavakol. Mapping the large-scale anisotropy in the wmap data.Astronomy & Astrophysics, 464(2):479–485, 2007

2007
[20]

Anisotropy is everywhere, to see, to measure, and to model.Rock Mechanics and Rock Engineering, 48(4):1323–1339, 2015

Nick Barton and Eda Quadros. Anisotropy is everywhere, to see, to measure, and to model.Rock Mechanics and Rock Engineering, 48(4):1323–1339, 2015

2015
[21]

Neural anisotropy directions.Advances in Neural Information Processing Systems, 33:17896–17906, 2020

Guillermo Ortiz-Jiménez, Apostolos Modas, Seyed-Mohsen Moosavi, and Pascal Frossard. Neural anisotropy directions.Advances in Neural Information Processing Systems, 33:17896–17906, 2020

2020
[22]

Learning shape correspondence with anisotropic convolutional neural networks.Advances in neural information processing systems, 29, 2016

Davide Boscaini, Jonathan Masci, Emanuele Rodolà, and Michael Bronstein. Learning shape correspondence with anisotropic convolutional neural networks.Advances in neural information processing systems, 29, 2016

2016
[23]

Deformable convolutional networks

Jifeng Dai, Haozhi Qi, Yuwen Xiong, Yi Li, Guodong Zhang, Han Hu, and Yichen Wei. Deformable convolutional networks. InProceedings of the IEEE international conference on computer vision, pages 764–773, 2017. 10 River-valley geometry reveals Muon as an early-stage exploration optimizer rather than late-stage refinement optimizer

2017
[24]

Anisotropy is inherent to self-attention in transformers

Nathan Godey, Éric Clergerie, and Benoît Sagot. Anisotropy is inherent to self-attention in transformers. In Proceedings of the 18th Conference of the European Chapter of the Association for Computational Linguistics (Volume 1: Long Papers), pages 35–48, 2024

2024
[25]

Anisotropy is not inherent to transformers

Anemily Machina and Robert Mercer. Anisotropy is not inherent to transformers. InProceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers), pages 4892–4907, 2024

2024
[26]

Accelerating llm pre-training through flat-direction dynamics enhancement.arXiv preprint arXiv:2602.22681, 2026

Shuchen Zhu, Rizhen Hu, Mingze Wang, Mou Sun, Xue Wang, Kun Yuan, and Zaiwen Wen. Accelerating llm pre-training through flat-direction dynamics enhancement.arXiv preprint arXiv:2602.22681, 2026

arXiv 2026
[27]

Revisiting anisotropy in language transform- ers: The geometry of learning dynamics.arXiv preprint arXiv:2604.08764, 2026

Raphael Bernas, Fanny Jourdan, Antonin Poché, and Céline Hudelot. Revisiting anisotropy in language transform- ers: The geometry of learning dynamics.arXiv preprint arXiv:2604.08764, 2026

Pith/arXiv arXiv 2026
[28]

Accelerating block coordinate descent for llm finetuning via landscape expansion.Advances in Neural Information Processing Systems, 38:56619–56645, 2026

Qijun Luo, Yifei Shen, Liangzu Peng, Dongsheng Li, and Xiao Li. Accelerating block coordinate descent for llm finetuning via landscape expansion.Advances in Neural Information Processing Systems, 38:56619–56645, 2026

2026
[29]

Rethinking llm training through information geometry and quantum metrics.arXiv preprint arXiv:2506.15830, 2025

Riccardo Di Sipio. Rethinking llm training through information geometry and quantum metrics.arXiv preprint arXiv:2506.15830, 2025

arXiv 2025
[30]

Gradient descent on neural networks typically occurs at the edge of stability.arXiv preprint arXiv:2103.00065, 2021

Jeremy M Cohen, Simran Kaur, Yuanzhi Li, J Zico Kolter, and Ameet Talwalkar. Gradient descent on neural networks typically occurs at the edge of stability.arXiv preprint arXiv:2103.00065, 2021

arXiv 2021
[31]

Analyzing sharpness along gd trajectory: Progressive sharpening and edge of stability.Advances in Neural Information Processing Systems, 35:9983–9994, 2022

Zixuan Wang, Zhouzi Li, and Jian Li. Analyzing sharpness along gd trajectory: Progressive sharpening and edge of stability.Advances in Neural Information Processing Systems, 35:9983–9994, 2022

2022
[32]

The origin of edge of stability.arXiv preprint arXiv:2604.20446, 2026

Litman Elon. The origin of edge of stability.arXiv preprint arXiv:2604.20446, 2026

Pith/arXiv arXiv 2026
[33]

Isotropic curvature model for understanding deep learning optimization: Is gradient orthogonalization optimal?arXiv preprint arXiv:2511.00674, 2025

Weijie Su. Isotropic curvature model for understanding deep learning optimization: Is gradient orthogonalization optimal?arXiv preprint arXiv:2511.00674, 2025

arXiv 2025
[34]

Muon outperforms adam in tail-end associative memory learning.arXiv preprint arXiv:2509.26030, 2025

Shuche Wang, Fengzhuo Zhang, Jiaxiang Li, Cunxiao Du, Chao Du, Tianyu Pang, Zhuoran Yang, Mingyi Hong, and Vincent YF Tan. Muon outperforms adam in tail-end associative memory learning.arXiv preprint arXiv:2509.26030, 2025

arXiv 2025
[35]

Juno Kim, Eshaan Nichani, Denny Wu, Alberto Bietti, and Jason D. Lee. Sharp capacity scaling of spectral optimizers in learning associative memory.arXiv preprint arXiv:2603.26554, 2026

Pith/arXiv arXiv 2026
[36]

Muon optimizes under spectral norm constraints.arXiv preprint arXiv:2506.15054, 2025

Lizhang Chen, Jonathan Li, and Qiang Liu. Muon optimizes under spectral norm constraints.arXiv preprint arXiv:2506.15054, 2025

arXiv 2025
[37]

Lions and muons: Optimization via stochastic frank-wolfe.arXiv preprint arXiv:2506.04192, 2025

Maria-Eleni Sfyraki and Jun-Kun Wang. Lions and muons: Optimization via stochastic frank-wolfe.arXiv preprint arXiv:2506.04192, 2025

arXiv 2025
[38]

Training deep learning models with norm-constrained lmos

Thomas Pethick, Wanyun Xie, Kimon Antonakopoulos, Zhenyu Zhu, Antonio Silveti-Falls, and V olkan Cevher. Training deep learning models with norm-constrained lmos. InInternational Conference on Machine Learning, pages 49069–49104. PMLR, 2025

2025
[39]

Noah Amsel, David Persson, Christopher Musco, and Robert M. Gower. The polar express: Optimal matrix sign methods and their application to the muon algorithm.arXiv preprint arXiv:2505.16932, 2025

Pith/arXiv arXiv 2025
[40]

Convergence of muon with newton-schulz.arXiv preprint arXiv:2601.19156, 2026

Gyu Yeol Kim and Min-hwan Oh. Convergence of muon with newton-schulz.arXiv preprint arXiv:2601.19156, 2026

arXiv 2026
[41]

Swin transformer v2: Scaling up capacity and resolution

Ze Liu, Han Hu, Yutong Lin, Zhuliang Yao, Zhenda Xie, Yixuan Wei, Jia Ning, Yue Cao, Zheng Zhang, Li Dong, et al. Swin transformer v2: Scaling up capacity and resolution. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 12009–12019, 2022

2022
[42]

An analysis of single-layer networks in unsupervised feature learning

Adam Coates, Andrew Ng, and Honglak Lee. An analysis of single-layer networks in unsupervised feature learning. InProceedings of the fourteenth international conference on artificial intelligence and statistics, pages 215–223. JMLR Workshop and Conference Proceedings, 2011

2011
[43]

Deep residual learning for image recognition

Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016

2016
[44]

Alex Krizhevsky and Geoffrey E. Hinton. Learning multiple layers of features from tiny images. Technical report, University of Toronto, Toronto, ON, Canada, 2009

2009
[45]

Introduction to compressed sensing., 2012

Mark A Davenport, Marco F Duarte, Yonina C Eldar, and Gitta Kutyniok. Introduction to compressed sensing., 2012

2012
[46]

Andreas M Tillmann and Marc E Pfetsch. The computational complexity of the restricted isometry property, the nullspace property, and related concepts in compressed sensing.IEEE Transactions on Information Theory, 60(2):1248–1259, 2013. 11 River-valley geometry reveals Muon as an early-stage exploration optimizer rather than late-stage refinement optimizer

2013
[47]

Phase retrieval via matrix completion.SIAM review, 57(2):225–251, 2015

Emmanuel J Candes, Yonina C Eldar, Thomas Strohmer, and Vladislav V oroninski. Phase retrieval via matrix completion.SIAM review, 57(2):225–251, 2015

2015
[48]

Universal low-rank matrix recovery from pauli measurements.Advances in Neural Information Processing Systems, 24, 2011

Yi-Kai Liu. Universal low-rank matrix recovery from pauli measurements.Advances in Neural Information Processing Systems, 24, 2011

2011
[49]

A blind compressed sensing formulation for collaborative filtering

Anuj Rajani, Paritosh Mittal, Aishwarya Jain, and Angshul Majumdar. A blind compressed sensing formulation for collaborative filtering. In2014 IEEE International Symposium on Signal Processing and Information Technology (ISSPIT), pages 000438–000443. IEEE, 2014

2014
[50]

Towards robust and scalable power system state estimation

Ming Jin, Igor Molybog, Reza Mohammadi-Ghazi, and Javad Lavaei. Towards robust and scalable power system state estimation. In2019 IEEE 58th Conference on Decision and Control (CDC), pages 3245–3252. IEEE, 2019

2019
[51]

Algorithmic regularization in over-parameterized matrix sensing and neural networks with quadratic activations

Yuanzhi Li, Tengyu Ma, and Hongyang Zhang. Algorithmic regularization in over-parameterized matrix sensing and neural networks with quadratic activations. InConference On Learning Theory, pages 2–47. PMLR, 2018

2018
[52]

Imagenet: A large-scale hierarchical image database

Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large-scale hierarchical image database. In2009 IEEE conference on computer vision and pattern recognition, pages 248–255. Ieee, 2009

2009
[53]

Distribution of eigenvalues for some sets of random matrices.Mathematics of the USSR-Sbornik, 1(4):457–483, 1967

Vladimir A Mar ˇcenko and Leonid Andreevich Pastur. Distribution of eigenvalues for some sets of random matrices.Mathematics of the USSR-Sbornik, 1(4):457–483, 1967

1967
[54]

Applied linear statistical models, 2008

Herman F Senter. Applied linear statistical models, 2008

2008
[55]

Dominik Stöger and Mahdi Soltanolkotabi. Small random initialization is akin to spectral learning: Optimiza- tion and generalization guarantees for overparameterized low-rank matrix reconstruction.Advances in Neural Information Processing Systems, 34:23831–23843, 2021

2021
[56]

Llama: Open and efficient foundation language models.arXiv preprint arXiv:2302.13971, 2023

Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timothée Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, et al. Llama: Open and efficient foundation language models.arXiv preprint arXiv:2302.13971, 2023

Pith/arXiv arXiv 2023
[57]

Language models are unsupervised multitask learners.OpenAI blog, 1(8):9, 2019

Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, Ilya Sutskever, et al. Language models are unsupervised multitask learners.OpenAI blog, 1(8):9, 2019

2019
[58]

The pile: An 800gb dataset of diverse text for language modeling.arXiv preprint arXiv:2101.00027, 2020

Leo Gao, Stella Biderman, Sid Black, Laurence Golding, Travis Hoppe, Charles Foster, Jason Phang, Horace He, Anish Thite, Noa Nabeshima, et al. The pile: An 800gb dataset of diverse text for language modeling.arXiv preprint arXiv:2101.00027, 2020

Pith/arXiv arXiv 2020
[59]

Springer, 2018

Yurii Nesterov et al.Lectures on convex optimization, volume 137. Springer, 2018

2018
[60]

Perturbation theory for the singular value decomposition.SVD and Signal Processing II, Algorithms, Analysis and Applications, pages 99–109, 1991

Gilbert W Stewart. Perturbation theory for the singular value decomposition.SVD and Signal Processing II, Algorithms, Analysis and Applications, pages 99–109, 1991

1991
[61]

The mnist database of handwritten digit images for machine learning research [best of the web].IEEE signal processing magazine, 29(6):141–142, 2012

Li Deng. The mnist database of handwritten digit images for machine learning research [best of the web].IEEE signal processing magazine, 29(6):141–142, 2012

2012
[62]

Muon→AdamW

Vinod Nair and Geoffrey E Hinton. Rectified linear units improve restricted boltzmann machines. InProceedings of the 27th international conference on machine learning (ICML-10), pages 807–814, 2010. 12 River-valley geometry reveals Muon as an early-stage exploration optimizer rather than late-stage refinement optimizer Appendix Contents A Additional Detai...

2010

[1] [1]

Improving generalization performance by switching from adam to sgd

Nitish Shirish Keskar and Richard Socher. Improving generalization performance by switching from adam to sgd. arXiv preprint arXiv:1712.07628, 2017

Pith/arXiv arXiv 2017

[2] [2]

Adam: A method for stochastic optimization.arXiv preprint arXiv:1412.6980, 2014

Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization.arXiv preprint arXiv:1412.6980, 2014

Pith/arXiv arXiv 2014

[3] [3]

Evolution of optimization methods: Algorithms, scenarios, and evaluations

Tong Zhang, Jiangning Zhang, Zhucun Xue, Juntao Jiang, Yicheng Xu, Chengming Xu, Teng Hu, Xingyu Xie, Xiaobin Hu, Yabiao Wang, et al. Evolution of optimization methods: Algorithms, scenarios, and evaluations. arXiv preprint arXiv:2604.12968, 2026

Pith/arXiv arXiv 2026

[4] [4]

An exploration of non-euclidean gradient descent: Muon and its many variants.arXiv preprint arXiv:2510.09827, 2025

Michael Crawshaw, Chirag Modi, Mingrui Liu, and Robert M Gower. An exploration of non-euclidean gradient descent: Muon and its many variants.arXiv preprint arXiv:2510.09827, 2025

arXiv 2025

[5] [5]

Muon: An optimizer for hidden layers in neural networks, 2024.URL https://kellerjordan

Keller Jordan, Yuchen Jin, Vlado Boza, You Jiacheng, Franz Cesista, Laker Newhouse, and Jeremy Bernstein. Muon: An optimizer for hidden layers in neural networks, 2024.URL https://kellerjordan. github. io/posts/muon, 6(3):4, 2024

2024

[6] [6]

Muon is scalable for llm training.arXiv preprint arXiv:2502.16982, 2025

Jingyuan Liu, Jianlin Su, Xingcheng Yao, Zhejun Jiang, Guokun Lai, Yulun Du, Yidao Qin, Weixin Xu, Enzhe Lu, Junjie Yan, et al. Muon is scalable for llm training.arXiv preprint arXiv:2502.16982, 2025

Pith/arXiv arXiv 2025

[7] [7]

Practical efficiency of muon for pretraining.arXiv preprint arXiv:2505.02222, 2025

Ishaan Shah, Anthony M Polloreno, Karl Stratos, Philip Monk, Adarsh Chaluvaraju, Andrew Hojel, Andrew Ma, Anil Thomas, Ashish Tanwer, Darsh J Shah, et al. Practical efficiency of muon for pretraining.arXiv preprint arXiv:2505.02222, 2025

arXiv 2025

[8] [8]

Spectra: Rethinking optimizers for llms under spectral anisotropy.arXiv preprint arXiv:2602.11185, 2026

Zhendong Huang, Hengjie Cao, Fang Dong, Ruijun Huang, Mengyi Chen, Yifeng Yang, Xin Zhang, Anrui Chen, Mingzhi Dong, Yujiang Wang, et al. Spectra: Rethinking optimizers for llms under spectral anisotropy.arXiv preprint arXiv:2602.11185, 2026

arXiv 2026

[9] [9]

Spectral gradient descent mitigates anisotropy- driven misalignment: A case study in phase retrieval.arXiv preprint arXiv:2601.22652, 2026

Guillaume Braun, Han Bao, Wei Huang, and Masaaki Imaizumi. Spectral gradient descent mitigates anisotropy- driven misalignment: A case study in phase retrieval.arXiv preprint arXiv:2601.22652, 2026

arXiv 2026

[10] [10]

On the convergence analysis of muon

Wei Shen, Ruichuan Huang, Minhui Huang, Cong Shen, and Jiawei Zhang. On the convergence analysis of muon. arXiv preprint arXiv:2505.23737, 2025

Pith/arXiv arXiv 2025

[11] [11]

Decoupled weight decay regularization.arXiv preprint arXiv:1711.05101, 2017

Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization.arXiv preprint arXiv:1711.05101, 2017

Pith/arXiv arXiv 2017

[12] [12]

Kimi k2: Open agentic intelligence.arXiv preprint arXiv:2507.20534, 2025

Kimi Team, Yifan Bai, Yiping Bao, Y Charles, Cheng Chen, Guanduo Chen, Haiting Chen, Huarong Chen, Jiahao Chen, Ningxin Chen, et al. Kimi k2: Open agentic intelligence.arXiv preprint arXiv:2507.20534, 2025

Pith/arXiv arXiv 2025

[13] [13]

Muon: Training and trade-offs with latent attention and moe.arXiv preprint arXiv:2509.24406, 2025

Sushant Mehta, Raj Dandekar, Rajat Dandekar, and Sreedath Panat. Muon: Training and trade-offs with latent attention and moe.arXiv preprint arXiv:2509.24406, 2025

arXiv 2025

[14] [14]

Preconditioning benefits of spectral orthogonalization in muon.arXiv preprint arXiv:2601.13474, 2026

Jianhao Ma, Yu Huang, Yuejie Chi, and Yuxin Chen. Preconditioning benefits of spectral orthogonalization in muon.arXiv preprint arXiv:2601.13474, 2026

arXiv 2026

[15] [16]

Understanding warmup-stable- decay learning rates: A river valley loss landscape perspective.arXiv preprint arXiv:2410.05192, 2024

Kaiyue Wen, Zhiyuan Li, Jason Wang, David Hall, Percy Liang, and Tengyu Ma. Understanding warmup-stable- decay learning rates: A river valley loss landscape perspective.arXiv preprint arXiv:2410.05192, 2024

arXiv 2024

[16] [17]

The anisotropy of time.The Monist, 48(2):219–247, 1964

Adolf Grünbaum. The anisotropy of time.The Monist, 48(2):219–247, 1964

1964

[17] [18]

OUP Oxford, 2004

Robert E Newnham.Properties of materials: anisotropy, symmetry, structure. OUP Oxford, 2004

2004

[18] [19]

Mapping the large-scale anisotropy in the wmap data.Astronomy & Astrophysics, 464(2):479–485, 2007

Armando Bernui, B Mota, Marcelo J Reboucas, and R Tavakol. Mapping the large-scale anisotropy in the wmap data.Astronomy & Astrophysics, 464(2):479–485, 2007

2007

[19] [20]

Anisotropy is everywhere, to see, to measure, and to model.Rock Mechanics and Rock Engineering, 48(4):1323–1339, 2015

Nick Barton and Eda Quadros. Anisotropy is everywhere, to see, to measure, and to model.Rock Mechanics and Rock Engineering, 48(4):1323–1339, 2015

2015

[20] [21]

Neural anisotropy directions.Advances in Neural Information Processing Systems, 33:17896–17906, 2020

Guillermo Ortiz-Jiménez, Apostolos Modas, Seyed-Mohsen Moosavi, and Pascal Frossard. Neural anisotropy directions.Advances in Neural Information Processing Systems, 33:17896–17906, 2020

2020

[21] [22]

Learning shape correspondence with anisotropic convolutional neural networks.Advances in neural information processing systems, 29, 2016

Davide Boscaini, Jonathan Masci, Emanuele Rodolà, and Michael Bronstein. Learning shape correspondence with anisotropic convolutional neural networks.Advances in neural information processing systems, 29, 2016

2016

[22] [23]

Deformable convolutional networks

Jifeng Dai, Haozhi Qi, Yuwen Xiong, Yi Li, Guodong Zhang, Han Hu, and Yichen Wei. Deformable convolutional networks. InProceedings of the IEEE international conference on computer vision, pages 764–773, 2017. 10 River-valley geometry reveals Muon as an early-stage exploration optimizer rather than late-stage refinement optimizer

2017

[23] [24]

Anisotropy is inherent to self-attention in transformers

Nathan Godey, Éric Clergerie, and Benoît Sagot. Anisotropy is inherent to self-attention in transformers. In Proceedings of the 18th Conference of the European Chapter of the Association for Computational Linguistics (Volume 1: Long Papers), pages 35–48, 2024

2024

[24] [25]

Anisotropy is not inherent to transformers

Anemily Machina and Robert Mercer. Anisotropy is not inherent to transformers. InProceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers), pages 4892–4907, 2024

2024

[25] [26]

Accelerating llm pre-training through flat-direction dynamics enhancement.arXiv preprint arXiv:2602.22681, 2026

Shuchen Zhu, Rizhen Hu, Mingze Wang, Mou Sun, Xue Wang, Kun Yuan, and Zaiwen Wen. Accelerating llm pre-training through flat-direction dynamics enhancement.arXiv preprint arXiv:2602.22681, 2026

arXiv 2026

[26] [27]

Revisiting anisotropy in language transform- ers: The geometry of learning dynamics.arXiv preprint arXiv:2604.08764, 2026

Raphael Bernas, Fanny Jourdan, Antonin Poché, and Céline Hudelot. Revisiting anisotropy in language transform- ers: The geometry of learning dynamics.arXiv preprint arXiv:2604.08764, 2026

Pith/arXiv arXiv 2026

[27] [28]

Accelerating block coordinate descent for llm finetuning via landscape expansion.Advances in Neural Information Processing Systems, 38:56619–56645, 2026

Qijun Luo, Yifei Shen, Liangzu Peng, Dongsheng Li, and Xiao Li. Accelerating block coordinate descent for llm finetuning via landscape expansion.Advances in Neural Information Processing Systems, 38:56619–56645, 2026

2026

[28] [29]

Rethinking llm training through information geometry and quantum metrics.arXiv preprint arXiv:2506.15830, 2025

Riccardo Di Sipio. Rethinking llm training through information geometry and quantum metrics.arXiv preprint arXiv:2506.15830, 2025

arXiv 2025

[29] [30]

Gradient descent on neural networks typically occurs at the edge of stability.arXiv preprint arXiv:2103.00065, 2021

Jeremy M Cohen, Simran Kaur, Yuanzhi Li, J Zico Kolter, and Ameet Talwalkar. Gradient descent on neural networks typically occurs at the edge of stability.arXiv preprint arXiv:2103.00065, 2021

arXiv 2021

[30] [31]

Analyzing sharpness along gd trajectory: Progressive sharpening and edge of stability.Advances in Neural Information Processing Systems, 35:9983–9994, 2022

Zixuan Wang, Zhouzi Li, and Jian Li. Analyzing sharpness along gd trajectory: Progressive sharpening and edge of stability.Advances in Neural Information Processing Systems, 35:9983–9994, 2022

2022

[31] [32]

The origin of edge of stability.arXiv preprint arXiv:2604.20446, 2026

Litman Elon. The origin of edge of stability.arXiv preprint arXiv:2604.20446, 2026

Pith/arXiv arXiv 2026

[32] [33]

Isotropic curvature model for understanding deep learning optimization: Is gradient orthogonalization optimal?arXiv preprint arXiv:2511.00674, 2025

Weijie Su. Isotropic curvature model for understanding deep learning optimization: Is gradient orthogonalization optimal?arXiv preprint arXiv:2511.00674, 2025

arXiv 2025

[33] [34]

Muon outperforms adam in tail-end associative memory learning.arXiv preprint arXiv:2509.26030, 2025

Shuche Wang, Fengzhuo Zhang, Jiaxiang Li, Cunxiao Du, Chao Du, Tianyu Pang, Zhuoran Yang, Mingyi Hong, and Vincent YF Tan. Muon outperforms adam in tail-end associative memory learning.arXiv preprint arXiv:2509.26030, 2025

arXiv 2025

[34] [35]

Juno Kim, Eshaan Nichani, Denny Wu, Alberto Bietti, and Jason D. Lee. Sharp capacity scaling of spectral optimizers in learning associative memory.arXiv preprint arXiv:2603.26554, 2026

Pith/arXiv arXiv 2026

[35] [36]

Muon optimizes under spectral norm constraints.arXiv preprint arXiv:2506.15054, 2025

Lizhang Chen, Jonathan Li, and Qiang Liu. Muon optimizes under spectral norm constraints.arXiv preprint arXiv:2506.15054, 2025

arXiv 2025

[36] [37]

Lions and muons: Optimization via stochastic frank-wolfe.arXiv preprint arXiv:2506.04192, 2025

Maria-Eleni Sfyraki and Jun-Kun Wang. Lions and muons: Optimization via stochastic frank-wolfe.arXiv preprint arXiv:2506.04192, 2025

arXiv 2025

[37] [38]

Training deep learning models with norm-constrained lmos

Thomas Pethick, Wanyun Xie, Kimon Antonakopoulos, Zhenyu Zhu, Antonio Silveti-Falls, and V olkan Cevher. Training deep learning models with norm-constrained lmos. InInternational Conference on Machine Learning, pages 49069–49104. PMLR, 2025

2025

[38] [39]

Noah Amsel, David Persson, Christopher Musco, and Robert M. Gower. The polar express: Optimal matrix sign methods and their application to the muon algorithm.arXiv preprint arXiv:2505.16932, 2025

Pith/arXiv arXiv 2025

[39] [40]

Convergence of muon with newton-schulz.arXiv preprint arXiv:2601.19156, 2026

Gyu Yeol Kim and Min-hwan Oh. Convergence of muon with newton-schulz.arXiv preprint arXiv:2601.19156, 2026

arXiv 2026

[40] [41]

Swin transformer v2: Scaling up capacity and resolution

Ze Liu, Han Hu, Yutong Lin, Zhuliang Yao, Zhenda Xie, Yixuan Wei, Jia Ning, Yue Cao, Zheng Zhang, Li Dong, et al. Swin transformer v2: Scaling up capacity and resolution. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 12009–12019, 2022

2022

[41] [42]

An analysis of single-layer networks in unsupervised feature learning

Adam Coates, Andrew Ng, and Honglak Lee. An analysis of single-layer networks in unsupervised feature learning. InProceedings of the fourteenth international conference on artificial intelligence and statistics, pages 215–223. JMLR Workshop and Conference Proceedings, 2011

2011

[42] [43]

Deep residual learning for image recognition

Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016

2016

[43] [44]

Alex Krizhevsky and Geoffrey E. Hinton. Learning multiple layers of features from tiny images. Technical report, University of Toronto, Toronto, ON, Canada, 2009

2009

[44] [45]

Introduction to compressed sensing., 2012

Mark A Davenport, Marco F Duarte, Yonina C Eldar, and Gitta Kutyniok. Introduction to compressed sensing., 2012

2012

[45] [46]

Andreas M Tillmann and Marc E Pfetsch. The computational complexity of the restricted isometry property, the nullspace property, and related concepts in compressed sensing.IEEE Transactions on Information Theory, 60(2):1248–1259, 2013. 11 River-valley geometry reveals Muon as an early-stage exploration optimizer rather than late-stage refinement optimizer

2013

[46] [47]

Phase retrieval via matrix completion.SIAM review, 57(2):225–251, 2015

Emmanuel J Candes, Yonina C Eldar, Thomas Strohmer, and Vladislav V oroninski. Phase retrieval via matrix completion.SIAM review, 57(2):225–251, 2015

2015

[47] [48]

Universal low-rank matrix recovery from pauli measurements.Advances in Neural Information Processing Systems, 24, 2011

Yi-Kai Liu. Universal low-rank matrix recovery from pauli measurements.Advances in Neural Information Processing Systems, 24, 2011

2011

[48] [49]

A blind compressed sensing formulation for collaborative filtering

Anuj Rajani, Paritosh Mittal, Aishwarya Jain, and Angshul Majumdar. A blind compressed sensing formulation for collaborative filtering. In2014 IEEE International Symposium on Signal Processing and Information Technology (ISSPIT), pages 000438–000443. IEEE, 2014

2014

[49] [50]

Towards robust and scalable power system state estimation

Ming Jin, Igor Molybog, Reza Mohammadi-Ghazi, and Javad Lavaei. Towards robust and scalable power system state estimation. In2019 IEEE 58th Conference on Decision and Control (CDC), pages 3245–3252. IEEE, 2019

2019

[50] [51]

Algorithmic regularization in over-parameterized matrix sensing and neural networks with quadratic activations

Yuanzhi Li, Tengyu Ma, and Hongyang Zhang. Algorithmic regularization in over-parameterized matrix sensing and neural networks with quadratic activations. InConference On Learning Theory, pages 2–47. PMLR, 2018

2018

[51] [52]

Imagenet: A large-scale hierarchical image database

Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large-scale hierarchical image database. In2009 IEEE conference on computer vision and pattern recognition, pages 248–255. Ieee, 2009

2009

[52] [53]

Distribution of eigenvalues for some sets of random matrices.Mathematics of the USSR-Sbornik, 1(4):457–483, 1967

Vladimir A Mar ˇcenko and Leonid Andreevich Pastur. Distribution of eigenvalues for some sets of random matrices.Mathematics of the USSR-Sbornik, 1(4):457–483, 1967

1967

[53] [54]

Applied linear statistical models, 2008

Herman F Senter. Applied linear statistical models, 2008

2008

[54] [55]

Dominik Stöger and Mahdi Soltanolkotabi. Small random initialization is akin to spectral learning: Optimiza- tion and generalization guarantees for overparameterized low-rank matrix reconstruction.Advances in Neural Information Processing Systems, 34:23831–23843, 2021

2021

[55] [56]

Llama: Open and efficient foundation language models.arXiv preprint arXiv:2302.13971, 2023

Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timothée Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, et al. Llama: Open and efficient foundation language models.arXiv preprint arXiv:2302.13971, 2023

Pith/arXiv arXiv 2023

[56] [57]

Language models are unsupervised multitask learners.OpenAI blog, 1(8):9, 2019

Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, Ilya Sutskever, et al. Language models are unsupervised multitask learners.OpenAI blog, 1(8):9, 2019

2019

[57] [58]

The pile: An 800gb dataset of diverse text for language modeling.arXiv preprint arXiv:2101.00027, 2020

Leo Gao, Stella Biderman, Sid Black, Laurence Golding, Travis Hoppe, Charles Foster, Jason Phang, Horace He, Anish Thite, Noa Nabeshima, et al. The pile: An 800gb dataset of diverse text for language modeling.arXiv preprint arXiv:2101.00027, 2020

Pith/arXiv arXiv 2020

[58] [59]

Springer, 2018

Yurii Nesterov et al.Lectures on convex optimization, volume 137. Springer, 2018

2018

[59] [60]

Perturbation theory for the singular value decomposition.SVD and Signal Processing II, Algorithms, Analysis and Applications, pages 99–109, 1991

Gilbert W Stewart. Perturbation theory for the singular value decomposition.SVD and Signal Processing II, Algorithms, Analysis and Applications, pages 99–109, 1991

1991

[60] [61]

The mnist database of handwritten digit images for machine learning research [best of the web].IEEE signal processing magazine, 29(6):141–142, 2012

Li Deng. The mnist database of handwritten digit images for machine learning research [best of the web].IEEE signal processing magazine, 29(6):141–142, 2012

2012

[61] [62]

Muon→AdamW

Vinod Nair and Geoffrey E Hinton. Rectified linear units improve restricted boltzmann machines. InProceedings of the 27th international conference on machine learning (ICML-10), pages 807–814, 2010. 12 River-valley geometry reveals Muon as an early-stage exploration optimizer rather than late-stage refinement optimizer Appendix Contents A Additional Detai...

2010