Convergence Analysis of Muon-type Methods with Inexact LMO in the Degenerate Case
Pith reviewed 2026-06-26 13:26 UTC · model grok-4.3
The pith
Muon-type methods achieve convergence rates with inexact LMO even when momentum degeneracy occurs, via new assumptions.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The authors claim that novel assumptions suffice to prove convergence rates for Muon-type methods using inexact LMO in degenerate scenarios, for the general non-convex case and the star-convex case with weight decay, under the layer-wise (L^0, L^1)-smooth assumption.
What carries the argument
Novel assumptions that address the coupling between inexact LMO solutions and optimal step size or momentum when the smallest positive singular value of the rescaled momentum can approach zero.
If this is right
- Convergence rates hold for general non-convex problems under the layer-wise smoothness and new assumptions.
- Convergence rates hold for star-convex problems with weight decay under the same conditions.
- The results apply directly to practical inexact iterative solvers for the LMO without requiring a uniform lower bound on singular values.
- The analysis covers the layer-wise (L^0, L^1)-smooth setting typical for deep architectures.
Where Pith is reading between the lines
- These guarantees suggest Muon-type methods remain theoretically supported when training very deep or large models where momentum degeneracy is common.
- Empirical checks of whether the new assumptions hold on standard neural network losses would test how far the rates extend in practice.
- The same style of assumptions could be tested on other momentum-based first-order methods that use approximate oracles.
Load-bearing premise
The novel assumptions introduced to handle inexact LMO in degenerate momentum cases are valid and sufficient for the stated rates.
What would settle it
A concrete numerical run of a Muon-type method on a simple non-convex problem where the smallest positive singular value approaches zero, the new assumptions are violated, and the predicted convergence rate fails to materialize.
read the original abstract
Muon-type methods have demonstrated potentially superior performance over Adam and its variants, and have shown hyperparameter transferability across model sizes when specific norms are chosen for the LMO in deep architectures. However, while the LMO is solved approximately via iterative algorithms in practice, most convergence analyses consider the ideal case where the search direction is the exact solution to the LMO. Recently, the inexact Muon update was analyzed by Shulgin et al. [2025], which reveals a fundamental coupling between the inexactness and the optimal step size and momentum. However, the convergence is guaranteed for the non-degenerate case only, i.e., the smallest positive singular value of the rescaled momentum is assumed to be bounded below by some positive constant when the spectral norm is used. In this work, we investigate Muon-type methods with inexact LMO in the degenerate case, where the smallest positive singular value of the rescaled momentum can approach zero, for the general non-convex case and the star-convex case with weight decay. Novel assumptions are proposed to address the challenges posed by inexact LMO in such degenerate scenarios, and convergence rates are established under the layer-wise $(L^0, L^1)$-smooth assumption for both cases.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript claims to establish convergence rates (including O(1/sqrt(T)) for non-convex objectives) for Muon-type methods using inexact linear minimization oracles (LMO) in the degenerate regime, where the smallest positive singular value of the rescaled momentum may approach zero. Novel assumptions are introduced to handle the interaction between inexactness and degeneracy; rates are derived for both the general non-convex case and the star-convex case with weight decay, under a layer-wise (L^0, L^1)-smoothness condition that extends prior work by Shulgin et al. (2025).
Significance. If the novel assumptions hold and are non-vacuous, the extension to the degenerate case is a meaningful contribution, as it removes a restrictive bounded-singular-value condition that may not hold in practice for iterative LMO solvers. The use of layer-wise smoothness is a positive feature for applicability to deep networks. The work provides a clear technical advance over the non-degenerate analysis, but its impact depends on whether the new assumptions can be verified or relaxed.
major comments (3)
- [§3] §3 (novel assumptions): The assumptions introduced to control inexact LMO in the degenerate regime (where the smallest positive singular value of rescaled momentum approaches zero) are not shown to be implied by the layer-wise smoothness or standard boundedness conditions, nor is any concrete verification provided (e.g., a 2-by-2 matrix sequence where the singular value decays at the rate permitted by the analysis while the assumptions remain satisfied).
- [Theorem 4.1] Theorem 4.1 (non-convex rate): The O(1/sqrt(T)) guarantee is derived under the new assumptions; the proof indicates that violation restores the inexactness-step-size coupling identified by Shulgin et al., yet no quantitative sensitivity result or example is given showing the assumptions hold at the boundary of allowed degeneracy.
- [§5] §5 (star-convex case with weight decay): The interaction between weight decay, inexact LMO, and the degeneracy assumption is not separated from the non-convex analysis; it is unclear whether the claimed rate requires additional restrictions on the decay parameter when the singular value vanishes.
minor comments (2)
- [§2] The precise statement of the layer-wise (L^0, L^1)-smoothness assumption should be moved to the preliminaries or early in §2 so that the introduction can focus on the novelty relative to Shulgin et al. without forward references.
- Notation for the rescaled momentum and its singular values is introduced without a dedicated table or display equation summarizing all symbols; this makes cross-referencing between the degenerate-case assumptions and the rate proofs cumbersome.
Simulated Author's Rebuttal
We thank the referee for the constructive comments and for recognizing the technical contribution of extending the analysis to the degenerate regime. We address each major comment below.
read point-by-point responses
-
Referee: [§3] §3 (novel assumptions): The assumptions introduced to control inexact LMO in the degenerate regime (where the smallest positive singular value of rescaled momentum approaches zero) are not shown to be implied by the layer-wise smoothness or standard boundedness conditions, nor is any concrete verification provided (e.g., a 2-by-2 matrix sequence where the singular value decays at the rate permitted by the analysis while the assumptions remain satisfied).
Authors: The novel assumptions are introduced precisely to capture the necessary control on inexact LMO under degeneracy, which is not implied by layer-wise (L^0, L^1)-smoothness or standard boundedness; this separation is the point of the contribution. We agree that an explicit verification example would improve clarity. In revision we will add a remark containing a 2-by-2 matrix sequence in which the smallest positive singular value decays at the rate permitted by the analysis while the assumptions continue to hold. revision: yes
-
Referee: [Theorem 4.1] Theorem 4.1 (non-convex rate): The O(1/sqrt(T)) guarantee is derived under the new assumptions; the proof indicates that violation restores the inexactness-step-size coupling identified by Shulgin et al., yet no quantitative sensitivity result or example is given showing the assumptions hold at the boundary of allowed degeneracy.
Authors: The proof of Theorem 4.1 already identifies the boundary by showing that violation of the assumptions recovers the inexactness-step-size coupling of Shulgin et al. To make this boundary behavior more explicit, we will add a short discussion paragraph (and, if space permits, a brief numerical illustration) in the revised manuscript. revision: partial
-
Referee: [§5] §5 (star-convex case with weight decay): The interaction between weight decay, inexact LMO, and the degeneracy assumption is not separated from the non-convex analysis; it is unclear whether the claimed rate requires additional restrictions on the decay parameter when the singular value vanishes.
Authors: Section 5 re-uses the same degeneracy assumptions as the non-convex case but exploits star-convexity together with the explicit weight-decay term inside the potential function; the proof does not introduce any further restrictions on the decay parameter. We will insert a clarifying sentence at the beginning of §5 that explicitly separates the two analyses and states that no additional restriction on the decay parameter is required. revision: yes
Circularity Check
No significant circularity detected
full rationale
The paper extends the external result of Shulgin et al. [2025] on inexact Muon updates by introducing novel assumptions to handle the degenerate case where the smallest positive singular value can approach zero. Convergence rates are then derived under the layer-wise (L^0, L^1)-smoothness assumption for non-convex and star-convex settings. No quoted steps reduce by construction to fitted inputs, self-definitions, or load-bearing self-citations; the cited prior work is treated as independent external support, and the new assumptions are explicitly presented as additions rather than tautological redefinitions of the target rates. The derivation chain therefore remains self-contained against the provided abstract and description.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[3]
International Conference on Artificial Intelligence and Statistics , pages=
Parameter-agnostic optimization under relaxed smoothness , author=. International Conference on Artificial Intelligence and Statistics , pages=. 2024 , organization=
2024
-
[6]
URL https://kellerjordan
Muon: An optimizer for hidden layers in neural networks , author=. URL https://kellerjordan. github. io/posts/muon , volume=
-
[7]
2008 , publisher=
Functions of matrices: theory and computation , author=. 2008 , publisher=
2008
-
[8]
SIAM Journal on Matrix Analysis and Applications , volume=
Small singular values can increase in lower precision , author=. SIAM Journal on Matrix Analysis and Applications , volume=. 2024 , publisher=
2024
-
[10]
International conference on learning representations (ICLR) , volume=
Adam: A method for stochastic optimization , author=. International conference on learning representations (ICLR) , volume=. 2015 , organization=
2015
-
[12]
International Conference on Learning Representations , year=
Decoupled Weight Decay Regularization , author=. International Conference on Learning Representations , year=
-
[13]
IEEE Transactions on Pattern Analysis and Machine Intelligence , volume=
Adan: Adaptive nesterov momentum algorithm for faster optimizing deep models , author=. IEEE Transactions on Pattern Analysis and Machine Intelligence , volume=. 2024 , publisher=
2024
-
[15]
Artificial intelligence and statistics , pages=
Stochastic spectral descent for restricted Boltzmann machines , author=. Artificial intelligence and statistics , pages=. 2015 , organization=
2015
-
[16]
IEEE Journal of Selected Topics in Signal Processing , volume=
Stochastic spectral descent for discrete graphical models , author=. IEEE Journal of Selected Topics in Signal Processing , volume=. 2015 , publisher=
2015
-
[17]
Advances in neural information processing systems , volume=
Preconditioned spectral descent for deep learning , author=. Advances in neural information processing systems , volume=
-
[18]
arXiv e-prints , pages=
A note on the convergence of muon and further , author=. arXiv e-prints , pages=
-
[25]
2025 , eprint=
LiMuon: Light and Fast Muon Optimizer for Large Models , author=. 2025 , eprint=
2025
-
[32]
SIAM Journal on Matrix Analysis and Applications , volume=
Optimizing Halley's iteration for computing the matrix polar decomposition , author=. SIAM Journal on Matrix Analysis and Applications , volume=. 2010 , publisher=
2010
-
[33]
SIAM Journal on Scientific Computing , volume=
Stable and efficient spectral divide and conquer algorithms for the symmetric eigenvalue decomposition and the SVD , author=. SIAM Journal on Scientific Computing , volume=. 2013 , publisher=
2013
-
[34]
siam REVIEW , volume=
Computing fundamental matrix decompositions accurately via the matrix sign function in two iterations: The power of Zolotarev's functions , author=. siam REVIEW , volume=. 2016 , publisher=
2016
-
[37]
The polar express: Optimal matrix sign methods and their application to the muon algorithm
Noah Amsel, David Persson, Christopher Musco, and Robert M Gower. The polar express: Optimal matrix sign methods and their application to the muon algorithm. arXiv preprint arXiv:2505.16932, 2025
Pith/arXiv arXiv 2025
-
[38]
Old optimizer, new norm: An anthology
Jeremy Bernstein and Laker Newhouse. Old optimizer, new norm: An anthology. arXiv preprint arXiv:2409.20325, 2024
Pith/arXiv arXiv 2024
-
[39]
Small singular values can increase in lower precision
Christos Boutsikas, Petros Drineas, and Ilse CF Ipsen. Small singular values can increase in lower precision. SIAM Journal on Matrix Analysis and Applications, 45 0 (3): 0 1518--1540, 2024
2024
-
[40]
Stochastic spectral descent for restricted boltzmann machines
David Carlson, Volkan Cevher, and Lawrence Carin. Stochastic spectral descent for restricted boltzmann machines. In Artificial intelligence and statistics, pages 111--119. PMLR, 2015 a
2015
-
[41]
Stochastic spectral descent for discrete graphical models
David Carlson, Ya-Ping Hsieh, Edo Collins, Lawrence Carin, and Volkan Cevher. Stochastic spectral descent for discrete graphical models. IEEE Journal of Selected Topics in Signal Processing, 10 0 (2): 0 296--311, 2015 b
2015
-
[42]
Preconditioned spectral descent for deep learning
David E Carlson, Edo Collins, Ya-Ping Hsieh, Lawrence Carin, and Volkan Cevher. Preconditioned spectral descent for deep learning. Advances in neural information processing systems, 28, 2015 c
2015
-
[43]
On the convergence of muon and beyond
Da Chang, Yongxiang Liu, and Ganzhao Yuan. On the convergence of muon and beyond. arXiv preprint arXiv:2509.15816, 2025
Pith/arXiv arXiv 2025
-
[44]
Muon with nesterov momentum: Heavy-tailed noise and (randomized) inexact polar decomposition
Sayantan Choudhury, Xiaoran Cheng, Martin Tak \'a c , Sen Na, and Mladen Kolar. Muon with nesterov momentum: Heavy-tailed noise and (randomized) inexact polar decomposition. arXiv preprint arXiv:2605.06884, 2026
Pith/arXiv arXiv 2026
-
[45]
Error feedback for muon and friends
Kaja Gruntkowska, Alexander Gaponov, Zhirayr Tovmasyan, and Peter Richt \'a rik. Error feedback for muon and friends. arXiv preprint arXiv:2510.00643, 2025
arXiv 2025
-
[46]
Chuan He, Zhanwang Deng, and Zhaosong Lu. Low-rank orthogonalization for large-scale matrix optimization with applications to foundation model training. arXiv preprint arXiv:2509.11983, 2025
Pith/arXiv arXiv 2025
-
[47]
Functions of matrices: theory and computation
Nicholas J Higham. Functions of matrices: theory and computation. SIAM, 2008
2008
-
[48]
Limuon: Light and fast muon optimizer for large models, 2025
Feihu Huang, Yuning Luo, and Songcan Chen. Limuon: Light and fast muon optimizer for large models, 2025. URL https://arxiv.org/abs/2509.14562
Pith/arXiv arXiv 2025
-
[49]
Muon: An optimizer for hidden layers in neural networks
Keller Jordan, Yuchen Jin, Vlado Boza, You Jiacheng, Franz Cecista, Laker Newhouse, and Jeremy Bernstein. Muon: An optimizer for hidden layers in neural networks. URL https://kellerjordan. github. io/posts/muon, 6, 2024
2024
-
[50]
A study of bfloat16 for deep learning training
Dhiraj Kalamkar, Dheevatsa Mudigere, Naveen Mellempudi, Dipankar Das, Kunal Banerjee, Sasikanth Avancha, Dharma Teja Vooturi, Nataraj Jammalamadaka, Jianyu Huang, Hector Yuen, et al. A study of bfloat16 for deep learning training. arXiv preprint arXiv:1905.12322, 2019
Pith/arXiv arXiv 1905
-
[51]
Adam: A method for stochastic optimization
Diederik Kinga, Jimmy Ba Adam, et al. Adam: A method for stochastic optimization. In International conference on learning representations (ICLR), volume 5. California;, 2015
2015
-
[52]
Dmitry Kovalev. Understanding gradient orthogonalization for deep learning via non-euclidean trust-region optimization. arXiv preprint arXiv:2503.12645, 2025
arXiv 2025
-
[53]
Non-euclidean sgd for structured optimization: Unified analysis and improved rates
Dmitry Kovalev and Ekaterina Borodich. Non-euclidean sgd for structured optimization: Unified analysis and improved rates. arXiv preprint arXiv:2511.11466, 2025
Pith/arXiv arXiv 2025
-
[54]
A note on the convergence of muon and further
Jiaxiang Li and Mingyi Hong. A note on the convergence of muon and further. arXiv e-prints, pages arXiv--2502, 2025
2025
-
[55]
Decoupled weight decay regularization
Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization. In International Conference on Learning Representations, 2019
2019
-
[56]
Signmuon: Communication-efficient distributed muon optimization
Neel Mishra, Kushagara Trivedi, and Pawan Kumar. Signmuon: Communication-efficient distributed muon optimization. arXiv preprint arXiv:2605.16311, 2026
Pith/arXiv arXiv 2026
-
[57]
Computing fundamental matrix decompositions accurately via the matrix sign function in two iterations: The power of zolotarev's functions
Yuji Nakatsukasa and Roland W Freund. Computing fundamental matrix decompositions accurately via the matrix sign function in two iterations: The power of zolotarev's functions. siam REVIEW, 58 0 (3): 0 461--493, 2016
2016
-
[58]
Stable and efficient spectral divide and conquer algorithms for the symmetric eigenvalue decomposition and the svd
Yuji Nakatsukasa and Nicholas J Higham. Stable and efficient spectral divide and conquer algorithms for the symmetric eigenvalue decomposition and the svd. SIAM Journal on Scientific Computing, 35 0 (3): 0 A1325--A1349, 2013
2013
-
[59]
Optimizing halley's iteration for computing the matrix polar decomposition
Yuji Nakatsukasa, Zhaojun Bai, and Fran c ois Gygi. Optimizing halley's iteration for computing the matrix polar decomposition. SIAM Journal on Matrix Analysis and Applications, 31 0 (5): 0 2700--2720, 2010
2010
-
[60]
Training deep learning models with norm-constrained lmos
Thomas Pethick, Wanyun Xie, Kimon Antonakopoulos, Zhenyu Zhu, Antonio Silveti-Falls, and Volkan Cevher. Training deep learning models with norm-constrained lmos. arXiv preprint arXiv:2502.07529, 2025
Pith/arXiv arXiv 2025
-
[61]
Muon is provably faster with momentum variance reduction
Xun Qian, Hussein Rammal, Dmitry Kovalev, and Peter Richtarik. Muon is provably faster with momentum variance reduction. arXiv preprint arXiv:2512.16598, 2025
arXiv 2025
-
[62]
Communication-efficient gluon in federated learning
Xun Qian, Alexander Gaponov, Grigory Malinovsky, and Peter Richt \'a rik. Communication-efficient gluon in federated learning. arXiv preprint arXiv:2604.10689, 2026
Pith/arXiv arXiv 2026
-
[63]
On the convergence of adam and beyond
Sashank J Reddi, Satyen Kale, and Sanjiv Kumar. On the convergence of adam and beyond. arXiv preprint arXiv:1904.09237, 2019
Pith/arXiv arXiv 1904
-
[64]
Artem Riabinin, Egor Shulgin, Kaja Gruntkowska, and Peter Richt \'a rik. Gluon: Making muon & scion great again!(bridging theory and practice of lmo-based optimizers for llms). arXiv preprint arXiv:2505.13416, 2025
arXiv 2025
-
[65]
Lions and muons: Optimization via stochastic frank-wolfe
Maria-Eleni Sfyraki and Jun-Kun Wang. Lions and muons: Optimization via stochastic frank-wolfe. arXiv preprint arXiv:2506.04192, 2025
arXiv 2025
-
[66]
On the convergence analysis of muon
Wei Shen, Ruichuan Huang, Minhui Huang, Cong Shen, and Jiawei Zhang. On the convergence analysis of muon. arXiv preprint arXiv:2505.23737, 2025
Pith/arXiv arXiv 2025
-
[67]
Beyond the ideal: Analyzing the inexact muon update
Egor Shulgin, Sultan AlRashed, Francesco Orabona, and Peter Richt \'a rik. Beyond the ideal: Analyzing the inexact muon update. arXiv preprint arXiv:2510.19933, 2025
arXiv 2025
-
[68]
Muonq: Enhancing low-bit muon quantization via directional fidelity optimization
Yupeng Su, Ruijie Zhang, Ziyue Liu, Yequan Zhao, and Zheng Zhang. Muonq: Enhancing low-bit muon quantization via directional fidelity optimization. arXiv preprint arXiv:2605.11396, 2026
Pith/arXiv arXiv 2026
-
[69]
Adan: Adaptive nesterov momentum algorithm for faster optimizing deep models
Xingyu Xie, Pan Zhou, Huan Li, Zhouchen Lin, and Shuicheng Yan. Adan: Adaptive nesterov momentum algorithm for faster optimizing deep models. IEEE Transactions on Pattern Analysis and Machine Intelligence, 46 0 (12): 0 9508--9520, 2024
2024
-
[70]
Why gradient clipping accelerates training: A theoretical justification for adaptivity
Jingzhao Zhang, Tianxing He, Suvrit Sra, and Ali Jadbabaie. Why gradient clipping accelerates training: A theoretical justification for adaptivity. arXiv preprint arXiv:1905.11881, 2019
arXiv 1905
-
[71]
On provable benefits of muon in federated learning
Xinwen Zhang and Hongchang Gao. On provable benefits of muon in federated learning. arXiv preprint arXiv:2510.03866, 2025
arXiv 2025
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.