Convergence Analysis of Muon-type Methods with Inexact LMO in the Degenerate Case

Peter Richt\'arik; Xun Qian

arxiv: 2606.21581 · v1 · pith:7OVOS26Enew · submitted 2026-06-19 · 🧮 math.OC

Convergence Analysis of Muon-type Methods with Inexact LMO in the Degenerate Case

Xun Qian , Peter Richt\'arik This is my paper

Pith reviewed 2026-06-26 13:26 UTC · model grok-4.3

classification 🧮 math.OC

keywords Muon optimizerinexact LMOdegenerate caseconvergence analysisnon-convex optimizationstar-convex optimizationweight decaylayer-wise smoothness

0 comments

The pith

Muon-type methods achieve convergence rates with inexact LMO even when momentum degeneracy occurs, via new assumptions.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tries to establish that convergence rates can still be proven for Muon-type optimizers when the linear minimization oracle is solved only approximately and the rescaled momentum is allowed to degenerate with its smallest positive singular value approaching zero. This matters because real implementations rely on iterative approximate solvers for the LMO and degeneracy arises naturally in deep network training. The analysis covers both general non-convex objectives and star-convex objectives with weight decay. It relies on layer-wise (L^0, L^1)-smoothness and introduces novel assumptions to manage the coupling between inexactness and step-size selection in the degenerate regime. If correct, the guarantees extend to the practical settings where these methods are actually used.

Core claim

The authors claim that novel assumptions suffice to prove convergence rates for Muon-type methods using inexact LMO in degenerate scenarios, for the general non-convex case and the star-convex case with weight decay, under the layer-wise (L^0, L^1)-smooth assumption.

What carries the argument

Novel assumptions that address the coupling between inexact LMO solutions and optimal step size or momentum when the smallest positive singular value of the rescaled momentum can approach zero.

If this is right

Convergence rates hold for general non-convex problems under the layer-wise smoothness and new assumptions.
Convergence rates hold for star-convex problems with weight decay under the same conditions.
The results apply directly to practical inexact iterative solvers for the LMO without requiring a uniform lower bound on singular values.
The analysis covers the layer-wise (L^0, L^1)-smooth setting typical for deep architectures.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

These guarantees suggest Muon-type methods remain theoretically supported when training very deep or large models where momentum degeneracy is common.
Empirical checks of whether the new assumptions hold on standard neural network losses would test how far the rates extend in practice.
The same style of assumptions could be tested on other momentum-based first-order methods that use approximate oracles.

Load-bearing premise

The novel assumptions introduced to handle inexact LMO in degenerate momentum cases are valid and sufficient for the stated rates.

What would settle it

A concrete numerical run of a Muon-type method on a simple non-convex problem where the smallest positive singular value approaches zero, the new assumptions are violated, and the predicted convergence rate fails to materialize.

read the original abstract

Muon-type methods have demonstrated potentially superior performance over Adam and its variants, and have shown hyperparameter transferability across model sizes when specific norms are chosen for the LMO in deep architectures. However, while the LMO is solved approximately via iterative algorithms in practice, most convergence analyses consider the ideal case where the search direction is the exact solution to the LMO. Recently, the inexact Muon update was analyzed by Shulgin et al. [2025], which reveals a fundamental coupling between the inexactness and the optimal step size and momentum. However, the convergence is guaranteed for the non-degenerate case only, i.e., the smallest positive singular value of the rescaled momentum is assumed to be bounded below by some positive constant when the spectral norm is used. In this work, we investigate Muon-type methods with inexact LMO in the degenerate case, where the smallest positive singular value of the rescaled momentum can approach zero, for the general non-convex case and the star-convex case with weight decay. Novel assumptions are proposed to address the challenges posed by inexact LMO in such degenerate scenarios, and convergence rates are established under the layer-wise $(L^0, L^1)$-smooth assumption for both cases.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Extends prior Muon inexact-LMO analysis to the degenerate case via new assumptions, but those assumptions are unverified and carry the result.

read the letter

The paper closes the explicit gap left by Shulgin et al. by treating the case where the smallest positive singular value of the rescaled momentum can approach zero. It gives convergence rates under the layer-wise (L^0, L^1)-smoothness assumption for both the general non-convex setting and the star-convex setting with weight decay, using the inexact LMO.

What is new is the set of assumptions introduced to control the interaction between inexactness and the vanishing singular value. The derivations appear to follow the same style as the earlier work once those assumptions are in place.

The soft spot is exactly those assumptions. They are stated as novel and are not shown to follow from standard boundedness or smoothness conditions, nor is there a simple low-dimensional example confirming they remain non-vacuous when the singular value decays at the rate the analysis permits. If they fail in that regime, the step-size/inexactness coupling identified previously reappears and the claimed rates do not hold.

This is for readers already working inside the Muon convergence program. Someone tracking theoretical support for these optimizers will want to see how the degenerate case is handled, but the value depends on whether the new conditions can be justified or relaxed.

I would send it to peer review so that the proofs and the plausibility of the assumptions can be checked directly.

Referee Report

3 major / 2 minor

Summary. The manuscript claims to establish convergence rates (including O(1/sqrt(T)) for non-convex objectives) for Muon-type methods using inexact linear minimization oracles (LMO) in the degenerate regime, where the smallest positive singular value of the rescaled momentum may approach zero. Novel assumptions are introduced to handle the interaction between inexactness and degeneracy; rates are derived for both the general non-convex case and the star-convex case with weight decay, under a layer-wise (L^0, L^1)-smoothness condition that extends prior work by Shulgin et al. (2025).

Significance. If the novel assumptions hold and are non-vacuous, the extension to the degenerate case is a meaningful contribution, as it removes a restrictive bounded-singular-value condition that may not hold in practice for iterative LMO solvers. The use of layer-wise smoothness is a positive feature for applicability to deep networks. The work provides a clear technical advance over the non-degenerate analysis, but its impact depends on whether the new assumptions can be verified or relaxed.

major comments (3)

[§3] §3 (novel assumptions): The assumptions introduced to control inexact LMO in the degenerate regime (where the smallest positive singular value of rescaled momentum approaches zero) are not shown to be implied by the layer-wise smoothness or standard boundedness conditions, nor is any concrete verification provided (e.g., a 2-by-2 matrix sequence where the singular value decays at the rate permitted by the analysis while the assumptions remain satisfied).
[Theorem 4.1] Theorem 4.1 (non-convex rate): The O(1/sqrt(T)) guarantee is derived under the new assumptions; the proof indicates that violation restores the inexactness-step-size coupling identified by Shulgin et al., yet no quantitative sensitivity result or example is given showing the assumptions hold at the boundary of allowed degeneracy.
[§5] §5 (star-convex case with weight decay): The interaction between weight decay, inexact LMO, and the degeneracy assumption is not separated from the non-convex analysis; it is unclear whether the claimed rate requires additional restrictions on the decay parameter when the singular value vanishes.

minor comments (2)

[§2] The precise statement of the layer-wise (L^0, L^1)-smoothness assumption should be moved to the preliminaries or early in §2 so that the introduction can focus on the novelty relative to Shulgin et al. without forward references.
Notation for the rescaled momentum and its singular values is introduced without a dedicated table or display equation summarizing all symbols; this makes cross-referencing between the degenerate-case assumptions and the rate proofs cumbersome.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive comments and for recognizing the technical contribution of extending the analysis to the degenerate regime. We address each major comment below.

read point-by-point responses

Referee: [§3] §3 (novel assumptions): The assumptions introduced to control inexact LMO in the degenerate regime (where the smallest positive singular value of rescaled momentum approaches zero) are not shown to be implied by the layer-wise smoothness or standard boundedness conditions, nor is any concrete verification provided (e.g., a 2-by-2 matrix sequence where the singular value decays at the rate permitted by the analysis while the assumptions remain satisfied).

Authors: The novel assumptions are introduced precisely to capture the necessary control on inexact LMO under degeneracy, which is not implied by layer-wise (L^0, L^1)-smoothness or standard boundedness; this separation is the point of the contribution. We agree that an explicit verification example would improve clarity. In revision we will add a remark containing a 2-by-2 matrix sequence in which the smallest positive singular value decays at the rate permitted by the analysis while the assumptions continue to hold. revision: yes
Referee: [Theorem 4.1] Theorem 4.1 (non-convex rate): The O(1/sqrt(T)) guarantee is derived under the new assumptions; the proof indicates that violation restores the inexactness-step-size coupling identified by Shulgin et al., yet no quantitative sensitivity result or example is given showing the assumptions hold at the boundary of allowed degeneracy.

Authors: The proof of Theorem 4.1 already identifies the boundary by showing that violation of the assumptions recovers the inexactness-step-size coupling of Shulgin et al. To make this boundary behavior more explicit, we will add a short discussion paragraph (and, if space permits, a brief numerical illustration) in the revised manuscript. revision: partial
Referee: [§5] §5 (star-convex case with weight decay): The interaction between weight decay, inexact LMO, and the degeneracy assumption is not separated from the non-convex analysis; it is unclear whether the claimed rate requires additional restrictions on the decay parameter when the singular value vanishes.

Authors: Section 5 re-uses the same degeneracy assumptions as the non-convex case but exploits star-convexity together with the explicit weight-decay term inside the potential function; the proof does not introduce any further restrictions on the decay parameter. We will insert a clarifying sentence at the beginning of §5 that explicitly separates the two analyses and states that no additional restriction on the decay parameter is required. revision: yes

Circularity Check

0 steps flagged

No significant circularity detected

full rationale

The paper extends the external result of Shulgin et al. [2025] on inexact Muon updates by introducing novel assumptions to handle the degenerate case where the smallest positive singular value can approach zero. Convergence rates are then derived under the layer-wise (L^0, L^1)-smoothness assumption for non-convex and star-convex settings. No quoted steps reduce by construction to fitted inputs, self-definitions, or load-bearing self-citations; the cited prior work is treated as independent external support, and the new assumptions are explicitly presented as additions rather than tautological redefinitions of the target rates. The derivation chain therefore remains self-contained against the provided abstract and description.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only access yields no concrete free parameters, axioms, or invented entities; the paper relies on standard non-convex and star-convex assumptions plus newly proposed conditions whose precise statements are not given.

pith-pipeline@v0.9.1-grok · 5754 in / 1221 out tokens · 43292 ms · 2026-06-26T13:26:26.406598+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

50 extracted references · 14 linked inside Pith

[3]

International Conference on Artificial Intelligence and Statistics , pages=

Parameter-agnostic optimization under relaxed smoothness , author=. International Conference on Artificial Intelligence and Statistics , pages=. 2024 , organization=

2024
[6]

URL https://kellerjordan

Muon: An optimizer for hidden layers in neural networks , author=. URL https://kellerjordan. github. io/posts/muon , volume=
[7]

2008 , publisher=

Functions of matrices: theory and computation , author=. 2008 , publisher=

2008
[8]

SIAM Journal on Matrix Analysis and Applications , volume=

Small singular values can increase in lower precision , author=. SIAM Journal on Matrix Analysis and Applications , volume=. 2024 , publisher=

2024
[10]

International conference on learning representations (ICLR) , volume=

Adam: A method for stochastic optimization , author=. International conference on learning representations (ICLR) , volume=. 2015 , organization=

2015
[12]

International Conference on Learning Representations , year=

Decoupled Weight Decay Regularization , author=. International Conference on Learning Representations , year=
[13]

IEEE Transactions on Pattern Analysis and Machine Intelligence , volume=

Adan: Adaptive nesterov momentum algorithm for faster optimizing deep models , author=. IEEE Transactions on Pattern Analysis and Machine Intelligence , volume=. 2024 , publisher=

2024
[15]

Artificial intelligence and statistics , pages=

Stochastic spectral descent for restricted Boltzmann machines , author=. Artificial intelligence and statistics , pages=. 2015 , organization=

2015
[16]

IEEE Journal of Selected Topics in Signal Processing , volume=

Stochastic spectral descent for discrete graphical models , author=. IEEE Journal of Selected Topics in Signal Processing , volume=. 2015 , publisher=

2015
[17]

Advances in neural information processing systems , volume=

Preconditioned spectral descent for deep learning , author=. Advances in neural information processing systems , volume=
[18]

arXiv e-prints , pages=

A note on the convergence of muon and further , author=. arXiv e-prints , pages=
[25]

2025 , eprint=

LiMuon: Light and Fast Muon Optimizer for Large Models , author=. 2025 , eprint=

2025
[32]

SIAM Journal on Matrix Analysis and Applications , volume=

Optimizing Halley's iteration for computing the matrix polar decomposition , author=. SIAM Journal on Matrix Analysis and Applications , volume=. 2010 , publisher=

2010
[33]

SIAM Journal on Scientific Computing , volume=

Stable and efficient spectral divide and conquer algorithms for the symmetric eigenvalue decomposition and the SVD , author=. SIAM Journal on Scientific Computing , volume=. 2013 , publisher=

2013
[34]

siam REVIEW , volume=

Computing fundamental matrix decompositions accurately via the matrix sign function in two iterations: The power of Zolotarev's functions , author=. siam REVIEW , volume=. 2016 , publisher=

2016
[37]

The polar express: Optimal matrix sign methods and their application to the muon algorithm

Noah Amsel, David Persson, Christopher Musco, and Robert M Gower. The polar express: Optimal matrix sign methods and their application to the muon algorithm. arXiv preprint arXiv:2505.16932, 2025

Pith/arXiv arXiv 2025
[38]

Old optimizer, new norm: An anthology

Jeremy Bernstein and Laker Newhouse. Old optimizer, new norm: An anthology. arXiv preprint arXiv:2409.20325, 2024

Pith/arXiv arXiv 2024
[39]

Small singular values can increase in lower precision

Christos Boutsikas, Petros Drineas, and Ilse CF Ipsen. Small singular values can increase in lower precision. SIAM Journal on Matrix Analysis and Applications, 45 0 (3): 0 1518--1540, 2024

2024
[40]

Stochastic spectral descent for restricted boltzmann machines

David Carlson, Volkan Cevher, and Lawrence Carin. Stochastic spectral descent for restricted boltzmann machines. In Artificial intelligence and statistics, pages 111--119. PMLR, 2015 a

2015
[41]

Stochastic spectral descent for discrete graphical models

David Carlson, Ya-Ping Hsieh, Edo Collins, Lawrence Carin, and Volkan Cevher. Stochastic spectral descent for discrete graphical models. IEEE Journal of Selected Topics in Signal Processing, 10 0 (2): 0 296--311, 2015 b

2015
[42]

Preconditioned spectral descent for deep learning

David E Carlson, Edo Collins, Ya-Ping Hsieh, Lawrence Carin, and Volkan Cevher. Preconditioned spectral descent for deep learning. Advances in neural information processing systems, 28, 2015 c

2015
[43]

On the convergence of muon and beyond

Da Chang, Yongxiang Liu, and Ganzhao Yuan. On the convergence of muon and beyond. arXiv preprint arXiv:2509.15816, 2025

Pith/arXiv arXiv 2025
[44]

Muon with nesterov momentum: Heavy-tailed noise and (randomized) inexact polar decomposition

Sayantan Choudhury, Xiaoran Cheng, Martin Tak \'a c , Sen Na, and Mladen Kolar. Muon with nesterov momentum: Heavy-tailed noise and (randomized) inexact polar decomposition. arXiv preprint arXiv:2605.06884, 2026

Pith/arXiv arXiv 2026
[45]

Error feedback for muon and friends

Kaja Gruntkowska, Alexander Gaponov, Zhirayr Tovmasyan, and Peter Richt \'a rik. Error feedback for muon and friends. arXiv preprint arXiv:2510.00643, 2025

arXiv 2025
[46]

Low-rank orthogonalization for large-scale matrix optimization with applications to foundation model training

Chuan He, Zhanwang Deng, and Zhaosong Lu. Low-rank orthogonalization for large-scale matrix optimization with applications to foundation model training. arXiv preprint arXiv:2509.11983, 2025

Pith/arXiv arXiv 2025
[47]

Functions of matrices: theory and computation

Nicholas J Higham. Functions of matrices: theory and computation. SIAM, 2008

2008
[48]

Limuon: Light and fast muon optimizer for large models, 2025

Feihu Huang, Yuning Luo, and Songcan Chen. Limuon: Light and fast muon optimizer for large models, 2025. URL https://arxiv.org/abs/2509.14562

Pith/arXiv arXiv 2025
[49]

Muon: An optimizer for hidden layers in neural networks

Keller Jordan, Yuchen Jin, Vlado Boza, You Jiacheng, Franz Cecista, Laker Newhouse, and Jeremy Bernstein. Muon: An optimizer for hidden layers in neural networks. URL https://kellerjordan. github. io/posts/muon, 6, 2024

2024
[50]

A study of bfloat16 for deep learning training

Dhiraj Kalamkar, Dheevatsa Mudigere, Naveen Mellempudi, Dipankar Das, Kunal Banerjee, Sasikanth Avancha, Dharma Teja Vooturi, Nataraj Jammalamadaka, Jianyu Huang, Hector Yuen, et al. A study of bfloat16 for deep learning training. arXiv preprint arXiv:1905.12322, 2019

Pith/arXiv arXiv 1905
[51]

Adam: A method for stochastic optimization

Diederik Kinga, Jimmy Ba Adam, et al. Adam: A method for stochastic optimization. In International conference on learning representations (ICLR), volume 5. California;, 2015

2015
[52]

Understanding gradient orthogonalization for deep learning via non-euclidean trust-region optimization

Dmitry Kovalev. Understanding gradient orthogonalization for deep learning via non-euclidean trust-region optimization. arXiv preprint arXiv:2503.12645, 2025

arXiv 2025
[53]

Non-euclidean sgd for structured optimization: Unified analysis and improved rates

Dmitry Kovalev and Ekaterina Borodich. Non-euclidean sgd for structured optimization: Unified analysis and improved rates. arXiv preprint arXiv:2511.11466, 2025

Pith/arXiv arXiv 2025
[54]

A note on the convergence of muon and further

Jiaxiang Li and Mingyi Hong. A note on the convergence of muon and further. arXiv e-prints, pages arXiv--2502, 2025

2025
[55]

Decoupled weight decay regularization

Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization. In International Conference on Learning Representations, 2019

2019
[56]

Signmuon: Communication-efficient distributed muon optimization

Neel Mishra, Kushagara Trivedi, and Pawan Kumar. Signmuon: Communication-efficient distributed muon optimization. arXiv preprint arXiv:2605.16311, 2026

Pith/arXiv arXiv 2026
[57]

Computing fundamental matrix decompositions accurately via the matrix sign function in two iterations: The power of zolotarev's functions

Yuji Nakatsukasa and Roland W Freund. Computing fundamental matrix decompositions accurately via the matrix sign function in two iterations: The power of zolotarev's functions. siam REVIEW, 58 0 (3): 0 461--493, 2016

2016
[58]

Stable and efficient spectral divide and conquer algorithms for the symmetric eigenvalue decomposition and the svd

Yuji Nakatsukasa and Nicholas J Higham. Stable and efficient spectral divide and conquer algorithms for the symmetric eigenvalue decomposition and the svd. SIAM Journal on Scientific Computing, 35 0 (3): 0 A1325--A1349, 2013

2013
[59]

Optimizing halley's iteration for computing the matrix polar decomposition

Yuji Nakatsukasa, Zhaojun Bai, and Fran c ois Gygi. Optimizing halley's iteration for computing the matrix polar decomposition. SIAM Journal on Matrix Analysis and Applications, 31 0 (5): 0 2700--2720, 2010

2010
[60]

Training deep learning models with norm-constrained lmos

Thomas Pethick, Wanyun Xie, Kimon Antonakopoulos, Zhenyu Zhu, Antonio Silveti-Falls, and Volkan Cevher. Training deep learning models with norm-constrained lmos. arXiv preprint arXiv:2502.07529, 2025

Pith/arXiv arXiv 2025
[61]

Muon is provably faster with momentum variance reduction

Xun Qian, Hussein Rammal, Dmitry Kovalev, and Peter Richtarik. Muon is provably faster with momentum variance reduction. arXiv preprint arXiv:2512.16598, 2025

arXiv 2025
[62]

Communication-efficient gluon in federated learning

Xun Qian, Alexander Gaponov, Grigory Malinovsky, and Peter Richt \'a rik. Communication-efficient gluon in federated learning. arXiv preprint arXiv:2604.10689, 2026

Pith/arXiv arXiv 2026
[63]

On the convergence of adam and beyond

Sashank J Reddi, Satyen Kale, and Sanjiv Kumar. On the convergence of adam and beyond. arXiv preprint arXiv:1904.09237, 2019

Pith/arXiv arXiv 1904
[64]

Gluon: Making muon & scion great again!(bridging theory and practice of lmo-based optimizers for llms)

Artem Riabinin, Egor Shulgin, Kaja Gruntkowska, and Peter Richt \'a rik. Gluon: Making muon & scion great again!(bridging theory and practice of lmo-based optimizers for llms). arXiv preprint arXiv:2505.13416, 2025

arXiv 2025
[65]

Lions and muons: Optimization via stochastic frank-wolfe

Maria-Eleni Sfyraki and Jun-Kun Wang. Lions and muons: Optimization via stochastic frank-wolfe. arXiv preprint arXiv:2506.04192, 2025

arXiv 2025
[66]

On the convergence analysis of muon

Wei Shen, Ruichuan Huang, Minhui Huang, Cong Shen, and Jiawei Zhang. On the convergence analysis of muon. arXiv preprint arXiv:2505.23737, 2025

Pith/arXiv arXiv 2025
[67]

Beyond the ideal: Analyzing the inexact muon update

Egor Shulgin, Sultan AlRashed, Francesco Orabona, and Peter Richt \'a rik. Beyond the ideal: Analyzing the inexact muon update. arXiv preprint arXiv:2510.19933, 2025

arXiv 2025
[68]

Muonq: Enhancing low-bit muon quantization via directional fidelity optimization

Yupeng Su, Ruijie Zhang, Ziyue Liu, Yequan Zhao, and Zheng Zhang. Muonq: Enhancing low-bit muon quantization via directional fidelity optimization. arXiv preprint arXiv:2605.11396, 2026

Pith/arXiv arXiv 2026
[69]

Adan: Adaptive nesterov momentum algorithm for faster optimizing deep models

Xingyu Xie, Pan Zhou, Huan Li, Zhouchen Lin, and Shuicheng Yan. Adan: Adaptive nesterov momentum algorithm for faster optimizing deep models. IEEE Transactions on Pattern Analysis and Machine Intelligence, 46 0 (12): 0 9508--9520, 2024

2024
[70]

Why gradient clipping accelerates training: A theoretical justification for adaptivity

Jingzhao Zhang, Tianxing He, Suvrit Sra, and Ali Jadbabaie. Why gradient clipping accelerates training: A theoretical justification for adaptivity. arXiv preprint arXiv:1905.11881, 2019

arXiv 1905
[71]

On provable benefits of muon in federated learning

Xinwen Zhang and Hongchang Gao. On provable benefits of muon in federated learning. arXiv preprint arXiv:2510.03866, 2025

arXiv 2025

[1] [3]

International Conference on Artificial Intelligence and Statistics , pages=

Parameter-agnostic optimization under relaxed smoothness , author=. International Conference on Artificial Intelligence and Statistics , pages=. 2024 , organization=

2024

[2] [6]

URL https://kellerjordan

Muon: An optimizer for hidden layers in neural networks , author=. URL https://kellerjordan. github. io/posts/muon , volume=

[3] [7]

2008 , publisher=

Functions of matrices: theory and computation , author=. 2008 , publisher=

2008

[4] [8]

SIAM Journal on Matrix Analysis and Applications , volume=

Small singular values can increase in lower precision , author=. SIAM Journal on Matrix Analysis and Applications , volume=. 2024 , publisher=

2024

[5] [10]

International conference on learning representations (ICLR) , volume=

Adam: A method for stochastic optimization , author=. International conference on learning representations (ICLR) , volume=. 2015 , organization=

2015

[6] [12]

International Conference on Learning Representations , year=

Decoupled Weight Decay Regularization , author=. International Conference on Learning Representations , year=

[7] [13]

IEEE Transactions on Pattern Analysis and Machine Intelligence , volume=

Adan: Adaptive nesterov momentum algorithm for faster optimizing deep models , author=. IEEE Transactions on Pattern Analysis and Machine Intelligence , volume=. 2024 , publisher=

2024

[8] [15]

Artificial intelligence and statistics , pages=

Stochastic spectral descent for restricted Boltzmann machines , author=. Artificial intelligence and statistics , pages=. 2015 , organization=

2015

[9] [16]

IEEE Journal of Selected Topics in Signal Processing , volume=

Stochastic spectral descent for discrete graphical models , author=. IEEE Journal of Selected Topics in Signal Processing , volume=. 2015 , publisher=

2015

[10] [17]

Advances in neural information processing systems , volume=

Preconditioned spectral descent for deep learning , author=. Advances in neural information processing systems , volume=

[11] [18]

arXiv e-prints , pages=

A note on the convergence of muon and further , author=. arXiv e-prints , pages=

[12] [25]

2025 , eprint=

LiMuon: Light and Fast Muon Optimizer for Large Models , author=. 2025 , eprint=

2025

[13] [32]

SIAM Journal on Matrix Analysis and Applications , volume=

Optimizing Halley's iteration for computing the matrix polar decomposition , author=. SIAM Journal on Matrix Analysis and Applications , volume=. 2010 , publisher=

2010

[14] [33]

SIAM Journal on Scientific Computing , volume=

Stable and efficient spectral divide and conquer algorithms for the symmetric eigenvalue decomposition and the SVD , author=. SIAM Journal on Scientific Computing , volume=. 2013 , publisher=

2013

[15] [34]

siam REVIEW , volume=

Computing fundamental matrix decompositions accurately via the matrix sign function in two iterations: The power of Zolotarev's functions , author=. siam REVIEW , volume=. 2016 , publisher=

2016

[16] [37]

The polar express: Optimal matrix sign methods and their application to the muon algorithm

Noah Amsel, David Persson, Christopher Musco, and Robert M Gower. The polar express: Optimal matrix sign methods and their application to the muon algorithm. arXiv preprint arXiv:2505.16932, 2025

Pith/arXiv arXiv 2025

[17] [38]

Old optimizer, new norm: An anthology

Jeremy Bernstein and Laker Newhouse. Old optimizer, new norm: An anthology. arXiv preprint arXiv:2409.20325, 2024

Pith/arXiv arXiv 2024

[18] [39]

Small singular values can increase in lower precision

Christos Boutsikas, Petros Drineas, and Ilse CF Ipsen. Small singular values can increase in lower precision. SIAM Journal on Matrix Analysis and Applications, 45 0 (3): 0 1518--1540, 2024

2024

[19] [40]

Stochastic spectral descent for restricted boltzmann machines

David Carlson, Volkan Cevher, and Lawrence Carin. Stochastic spectral descent for restricted boltzmann machines. In Artificial intelligence and statistics, pages 111--119. PMLR, 2015 a

2015

[20] [41]

Stochastic spectral descent for discrete graphical models

David Carlson, Ya-Ping Hsieh, Edo Collins, Lawrence Carin, and Volkan Cevher. Stochastic spectral descent for discrete graphical models. IEEE Journal of Selected Topics in Signal Processing, 10 0 (2): 0 296--311, 2015 b

2015

[21] [42]

Preconditioned spectral descent for deep learning

David E Carlson, Edo Collins, Ya-Ping Hsieh, Lawrence Carin, and Volkan Cevher. Preconditioned spectral descent for deep learning. Advances in neural information processing systems, 28, 2015 c

2015

[22] [43]

On the convergence of muon and beyond

Da Chang, Yongxiang Liu, and Ganzhao Yuan. On the convergence of muon and beyond. arXiv preprint arXiv:2509.15816, 2025

Pith/arXiv arXiv 2025

[23] [44]

Muon with nesterov momentum: Heavy-tailed noise and (randomized) inexact polar decomposition

Sayantan Choudhury, Xiaoran Cheng, Martin Tak \'a c , Sen Na, and Mladen Kolar. Muon with nesterov momentum: Heavy-tailed noise and (randomized) inexact polar decomposition. arXiv preprint arXiv:2605.06884, 2026

Pith/arXiv arXiv 2026

[24] [45]

Error feedback for muon and friends

Kaja Gruntkowska, Alexander Gaponov, Zhirayr Tovmasyan, and Peter Richt \'a rik. Error feedback for muon and friends. arXiv preprint arXiv:2510.00643, 2025

arXiv 2025

[25] [46]

Low-rank orthogonalization for large-scale matrix optimization with applications to foundation model training

Chuan He, Zhanwang Deng, and Zhaosong Lu. Low-rank orthogonalization for large-scale matrix optimization with applications to foundation model training. arXiv preprint arXiv:2509.11983, 2025

Pith/arXiv arXiv 2025

[26] [47]

Functions of matrices: theory and computation

Nicholas J Higham. Functions of matrices: theory and computation. SIAM, 2008

2008

[27] [48]

Limuon: Light and fast muon optimizer for large models, 2025

Feihu Huang, Yuning Luo, and Songcan Chen. Limuon: Light and fast muon optimizer for large models, 2025. URL https://arxiv.org/abs/2509.14562

Pith/arXiv arXiv 2025

[28] [49]

Muon: An optimizer for hidden layers in neural networks

Keller Jordan, Yuchen Jin, Vlado Boza, You Jiacheng, Franz Cecista, Laker Newhouse, and Jeremy Bernstein. Muon: An optimizer for hidden layers in neural networks. URL https://kellerjordan. github. io/posts/muon, 6, 2024

2024

[29] [50]

A study of bfloat16 for deep learning training

Dhiraj Kalamkar, Dheevatsa Mudigere, Naveen Mellempudi, Dipankar Das, Kunal Banerjee, Sasikanth Avancha, Dharma Teja Vooturi, Nataraj Jammalamadaka, Jianyu Huang, Hector Yuen, et al. A study of bfloat16 for deep learning training. arXiv preprint arXiv:1905.12322, 2019

Pith/arXiv arXiv 1905

[30] [51]

Adam: A method for stochastic optimization

Diederik Kinga, Jimmy Ba Adam, et al. Adam: A method for stochastic optimization. In International conference on learning representations (ICLR), volume 5. California;, 2015

2015

[31] [52]

Understanding gradient orthogonalization for deep learning via non-euclidean trust-region optimization

Dmitry Kovalev. Understanding gradient orthogonalization for deep learning via non-euclidean trust-region optimization. arXiv preprint arXiv:2503.12645, 2025

arXiv 2025

[32] [53]

Non-euclidean sgd for structured optimization: Unified analysis and improved rates

Dmitry Kovalev and Ekaterina Borodich. Non-euclidean sgd for structured optimization: Unified analysis and improved rates. arXiv preprint arXiv:2511.11466, 2025

Pith/arXiv arXiv 2025

[33] [54]

A note on the convergence of muon and further

Jiaxiang Li and Mingyi Hong. A note on the convergence of muon and further. arXiv e-prints, pages arXiv--2502, 2025

2025

[34] [55]

Decoupled weight decay regularization

Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization. In International Conference on Learning Representations, 2019

2019

[35] [56]

Signmuon: Communication-efficient distributed muon optimization

Neel Mishra, Kushagara Trivedi, and Pawan Kumar. Signmuon: Communication-efficient distributed muon optimization. arXiv preprint arXiv:2605.16311, 2026

Pith/arXiv arXiv 2026

[36] [57]

Computing fundamental matrix decompositions accurately via the matrix sign function in two iterations: The power of zolotarev's functions

Yuji Nakatsukasa and Roland W Freund. Computing fundamental matrix decompositions accurately via the matrix sign function in two iterations: The power of zolotarev's functions. siam REVIEW, 58 0 (3): 0 461--493, 2016

2016

[37] [58]

Stable and efficient spectral divide and conquer algorithms for the symmetric eigenvalue decomposition and the svd

Yuji Nakatsukasa and Nicholas J Higham. Stable and efficient spectral divide and conquer algorithms for the symmetric eigenvalue decomposition and the svd. SIAM Journal on Scientific Computing, 35 0 (3): 0 A1325--A1349, 2013

2013

[38] [59]

Optimizing halley's iteration for computing the matrix polar decomposition

Yuji Nakatsukasa, Zhaojun Bai, and Fran c ois Gygi. Optimizing halley's iteration for computing the matrix polar decomposition. SIAM Journal on Matrix Analysis and Applications, 31 0 (5): 0 2700--2720, 2010

2010

[39] [60]

Training deep learning models with norm-constrained lmos

Thomas Pethick, Wanyun Xie, Kimon Antonakopoulos, Zhenyu Zhu, Antonio Silveti-Falls, and Volkan Cevher. Training deep learning models with norm-constrained lmos. arXiv preprint arXiv:2502.07529, 2025

Pith/arXiv arXiv 2025

[40] [61]

Muon is provably faster with momentum variance reduction

Xun Qian, Hussein Rammal, Dmitry Kovalev, and Peter Richtarik. Muon is provably faster with momentum variance reduction. arXiv preprint arXiv:2512.16598, 2025

arXiv 2025

[41] [62]

Communication-efficient gluon in federated learning

Xun Qian, Alexander Gaponov, Grigory Malinovsky, and Peter Richt \'a rik. Communication-efficient gluon in federated learning. arXiv preprint arXiv:2604.10689, 2026

Pith/arXiv arXiv 2026

[42] [63]

On the convergence of adam and beyond

Sashank J Reddi, Satyen Kale, and Sanjiv Kumar. On the convergence of adam and beyond. arXiv preprint arXiv:1904.09237, 2019

Pith/arXiv arXiv 1904

[43] [64]

Gluon: Making muon & scion great again!(bridging theory and practice of lmo-based optimizers for llms)

Artem Riabinin, Egor Shulgin, Kaja Gruntkowska, and Peter Richt \'a rik. Gluon: Making muon & scion great again!(bridging theory and practice of lmo-based optimizers for llms). arXiv preprint arXiv:2505.13416, 2025

arXiv 2025

[44] [65]

Lions and muons: Optimization via stochastic frank-wolfe

Maria-Eleni Sfyraki and Jun-Kun Wang. Lions and muons: Optimization via stochastic frank-wolfe. arXiv preprint arXiv:2506.04192, 2025

arXiv 2025

[45] [66]

On the convergence analysis of muon

Wei Shen, Ruichuan Huang, Minhui Huang, Cong Shen, and Jiawei Zhang. On the convergence analysis of muon. arXiv preprint arXiv:2505.23737, 2025

Pith/arXiv arXiv 2025

[46] [67]

Beyond the ideal: Analyzing the inexact muon update

Egor Shulgin, Sultan AlRashed, Francesco Orabona, and Peter Richt \'a rik. Beyond the ideal: Analyzing the inexact muon update. arXiv preprint arXiv:2510.19933, 2025

arXiv 2025

[47] [68]

Muonq: Enhancing low-bit muon quantization via directional fidelity optimization

Yupeng Su, Ruijie Zhang, Ziyue Liu, Yequan Zhao, and Zheng Zhang. Muonq: Enhancing low-bit muon quantization via directional fidelity optimization. arXiv preprint arXiv:2605.11396, 2026

Pith/arXiv arXiv 2026

[48] [69]

Adan: Adaptive nesterov momentum algorithm for faster optimizing deep models

Xingyu Xie, Pan Zhou, Huan Li, Zhouchen Lin, and Shuicheng Yan. Adan: Adaptive nesterov momentum algorithm for faster optimizing deep models. IEEE Transactions on Pattern Analysis and Machine Intelligence, 46 0 (12): 0 9508--9520, 2024

2024

[49] [70]

Why gradient clipping accelerates training: A theoretical justification for adaptivity

Jingzhao Zhang, Tianxing He, Suvrit Sra, and Ali Jadbabaie. Why gradient clipping accelerates training: A theoretical justification for adaptivity. arXiv preprint arXiv:1905.11881, 2019

arXiv 1905

[50] [71]

On provable benefits of muon in federated learning

Xinwen Zhang and Hongchang Gao. On provable benefits of muon in federated learning. arXiv preprint arXiv:2510.03866, 2025

arXiv 2025