pith. sign in

arxiv: 2509.15816 · v5 · submitted 2025-09-19 · 💻 cs.LG

On the Convergence of Muon and Beyond

Pith reviewed 2026-05-18 15:50 UTC · model grok-4.3

classification 💻 cs.LG
keywords Muon optimizervariance reductionconvergence analysisstochastic non-convex optimizationmomentum methodsneural network trainingPolyak-Lojasiewicz condition
0
0 comments X

The pith

A variance-reduced version of the Muon optimizer attains the optimal anytime convergence rate of roughly O(T to the minus one third).

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that the two-batch variance-reduced Muon-MVR2 reaches the best possible anytime convergence rate of roughly O(T to the negative one third) when learning rates do not depend on knowing the total number of steps ahead of time. This rate matches the known lower bound for stochastic non-convex problems and improves on prior Muon analyses that only reached a slower O(T to the negative one quarter). A reader would care because Muon is already used successfully in practice for neural network parameters that have matrix structure, so matching the theoretical optimum narrows the gap between observed performance and guarantees. The work further shows strong rates for both variants under the Polyak-Lojasiewicz condition for best-iterate suboptimality and last-iterate objective gap when an extra gradient bound holds.

Core claim

The central claim is that under horizon-free learning-rate schedules, variance reduction enables Muon-MVR2 to attain the optimal anytime convergence rate of tilde O(T to the minus one third) in stochastic non-convex settings, matching the lower bound for this problem class. This is the first rigorous proof of such a rate for Muon variants. Under the Polyak-Lojasiewicz condition, Muon-MVR1 and Muon-MVR2 achieve best-iterate rates of tilde O(T to the minus one quarter) and tilde O(T to the minus one third) for expected square-root suboptimality, and with an additional uniform gradient bound along the iterates they reach last-iterate rates of O(T to the minus one quarter) and O(T to the minus 1

What carries the argument

Horizon-free learning-rate schedules paired with momentum-based variance reduction in the one-batch Muon-MVR1 and two-batch Muon-MVR2. These control stochastic gradient variance to support the improved rate proofs.

If this is right

  • Muon-MVR2 matches the optimal convergence rate for the class of stochastic non-convex problems.
  • Under the Polyak-Lojasiewicz condition the variants deliver anytime guarantees on square-root suboptimality and objective gap.
  • The two-batch design in Muon-MVR2 is what enables the tighter rate compared with the one-batch version.
  • Experiments on CIFAR-10 classification and C4 language modeling back the practical value of these variants.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Variance reduction may be worth testing on other matrix-aware optimizers that currently lack tight theory.
  • The horizon-free property suggests these variants can be used in training runs where the length is not fixed in advance.
  • Checking whether the same rate gains appear in very large models could show if the analysis scales beyond the reported experiments.

Load-bearing premise

The proofs rely on horizon-free learning-rate schedules together with standard assumptions of bounded variance and smoothness; if the target setting violates these, the stated rates may not apply.

What would settle it

An experiment on a smooth non-convex objective with bounded variance where Muon-MVR2 fails to improve beyond O(T to the negative one quarter) under a horizon-free schedule would falsify the central rate claim.

Figures

Figures reproduced from arXiv: 2509.15816 by Da Chang, Ganzhao Yuan, Yongxiang Liu.

Figure 1
Figure 1. Figure 1: Training dynamics of Muon-MVR2, Muon-MVR1, Muon-MVR1 ( [PITH_FULL_IMAGE:figures/full_fig_p007_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: LLaMA2-130M train and validation curves on C4 Dataset [PITH_FULL_IMAGE:figures/full_fig_p008_2.png] view at source ↗
read the original abstract

The Muon optimizer has demonstrated remarkable empirical success in handling matrix-structured parameters for training neural networks. However, a significant gap remains between its practical performance and theoretical understanding. Existing analyses show that the Muon variants achieve only a suboptimal ergodic convergence rate of $\mathcal{O}(T^{-1/4})$ in stochastic non-convex settings, where $T$ denotes the number of iterations. To study the theoretical limits of Muon, we analyze two momentum-based variance-reduced variants: the one-batch Muon-MVR1 and the two-batch Muon-MVR2. We provide the first rigorous proof that, under \textbf{horizon-free} learning-rate schedules, variance reduction enables Muon-MVR2 to attain the optimal anytime convergence rate $\widetilde{\mathcal{O}}(T^{-1/3})$, matching the lower bound for this problem class. Under the Polyak--\L{}ojasiewicz (PL) condition, we establish anytime guarantees for Muon-MVR1 and Muon-MVR2: they attain best-iterate rates of $\widetilde{\mathcal{O}}(T^{-1/4})$ and $\widetilde{\mathcal{O}}(T^{-1/3})$ for the expected square-root suboptimality, and, given an additional uniform gradient bound along the iterates, achieve last-iterate rates of $\mathcal{O}(T^{-1/4})$ and $\mathcal{O}(T^{-1/3})$ for the objective gap, respectively. Experiments on CIFAR-10 and C4 support the practical effectiveness of the proposed variance-reduced Muon variants. Code is available at \href{https://github.com/MaeChd/MUON-MVR}{Muon-MVR} Codebase.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper analyzes the Muon optimizer for matrix-structured parameters and introduces two momentum-based variance-reduced variants: one-batch Muon-MVR1 and two-batch Muon-MVR2. It provides the first rigorous proofs that, under horizon-free learning-rate schedules in stochastic non-convex settings, Muon-MVR2 attains an optimal anytime convergence rate of ~O(T^{-1/3}) matching the lower bound for the problem class. Under the PL condition, it establishes anytime best-iterate rates of ~O(T^{-1/4}) and ~O(T^{-1/3}) for expected square-root suboptimality, and last-iterate rates of O(T^{-1/4}) and O(T^{-1/3}) for the objective gap (with an additional uniform gradient bound). Experiments on CIFAR-10 and C4 support practical effectiveness, with code released.

Significance. If the convergence claims hold, this work would meaningfully advance the theoretical understanding of Muon by closing the gap with its empirical success in neural network training. Strengths include the provision of rigorous proofs for horizon-free schedules, reproducible code at the linked GitHub repository, and explicit rates under both general stochastic non-convex and PL settings, which could guide optimizer improvements.

major comments (2)
  1. [Abstract] Abstract and main convergence theorems: the claim that Muon-MVR2 attains the optimal ~O(T^{-1/3}) anytime rate 'matching the lower bound for this problem class.' Standard lower bounds for finding an epsilon-stationary point in online stochastic non-convex optimization require Omega(epsilon^{-4}) stochastic gradient queries, implying no algorithm with constant queries per iteration can exceed O(T^{-1/4}). Muon-MVR2 uses two batches per iteration (total queries Theta(T)), so the setting must be clarified as online stochastic versus finite-sum, along with whether T counts iterations or total samples and which specific lower bound is being matched. This directly underpins the 'first rigorous proof' and optimality assertions.
  2. [Convergence Analysis] Proofs of the main rates (likely in the convergence analysis section): the derivations rely on horizon-free learning-rate schedules together with bounded variance and smoothness. The exact theorem statements should explicitly list these assumptions and confirm they suffice for the claimed rates without additional hidden restrictions on the query model.
minor comments (2)
  1. [Experiments] Figure clarity in the experimental section: ensure error bars or multiple runs are reported for the CIFAR-10 and C4 results to allow direct comparison with baselines.
  2. [Notation and Preliminaries] Notation consistency: verify uniform use of tilde-O notation and definition of 'anytime' convergence across theorems and the abstract.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the careful review and constructive comments. We appreciate the opportunity to clarify the problem setting, query model, and theorem assumptions. We address each major comment below and will incorporate the necessary revisions.

read point-by-point responses
  1. Referee: [Abstract] Abstract and main convergence theorems: the claim that Muon-MVR2 attains the optimal ~O(T^{-1/3}) anytime rate 'matching the lower bound for this problem class.' Standard lower bounds for finding an epsilon-stationary point in online stochastic non-convex optimization require Omega(epsilon^{-4}) stochastic gradient queries, implying no algorithm with constant queries per iteration can exceed O(T^{-1/4}). Muon-MVR2 uses two batches per iteration (total queries Theta(T)), so the setting must be clarified as online stochastic versus finite-sum, along with whether T counts iterations or total samples and which specific lower bound is being matched. This directly underpins the 'first rigorous proof' and optimality assertions.

    Authors: We thank the referee for highlighting this important clarification. Our analysis is conducted in the standard stochastic non-convex optimization setting with access to an unbiased stochastic gradient oracle satisfying bounded variance. Here T denotes the number of iterations. Muon-MVR2 performs two independent stochastic gradient evaluations per iteration, yielding a total of Θ(T) queries. The claimed anytime rate of ~O(T^{-1/3}) is expressed in terms of iterations and corresponds to an overall sample complexity of ~O(N^{-1/3}) (N total queries). This matches the Ω(ε^{-3}) lower bound for variance-reduced methods in stochastic non-convex optimization (as achieved by SPIDER-type algorithms), which is strictly better than the Ω(ε^{-4}) bound that applies to non-variance-reduced methods such as plain SGD. We will revise the abstract and introduction to explicitly state the stochastic oracle model, confirm that T counts iterations (with total samples Θ(T)), and identify the matched lower bound as the one for variance-reduced stochastic methods. These changes strengthen the presentation without altering the proofs or rates. revision: yes

  2. Referee: [Convergence Analysis] Proofs of the main rates (likely in the convergence analysis section): the derivations rely on horizon-free learning-rate schedules together with bounded variance and smoothness. The exact theorem statements should explicitly list these assumptions and confirm they suffice for the claimed rates without additional hidden restrictions on the query model.

    Authors: We agree that explicit enumeration of assumptions will improve clarity. In the revised manuscript we will update the statements of the main theorems to list all assumptions in full: (i) L-smoothness of the objective function, (ii) bounded variance of the stochastic gradients, (iii) the specific horizon-free learning-rate schedule (of the form η_t ∝ t^{-1/3}), and (iv) the query model in which each iteration of Muon-MVR2 makes exactly two independent calls to the stochastic gradient oracle. These assumptions are standard, sufficient for the derived rates, and are the only ones used in the proofs; no additional or hidden restrictions on the query model are present. We will also add a short remark immediately after the theorem statements confirming that the rates hold under precisely these conditions. revision: yes

Circularity Check

0 steps flagged

No circularity: direct convergence proofs under standard assumptions

full rationale

The paper derives convergence rates for Muon-MVR1 and Muon-MVR2 via explicit analysis of variance-reduced momentum updates under horizon-free schedules, bounded variance, and smoothness. These steps rely on standard stochastic optimization lemmas rather than reducing any claimed rate to a fitted parameter, self-citation chain, or definitional equivalence. The optimality assertion references an external lower bound for the problem class without importing uniqueness theorems or ansatzes from prior author work. The derivation remains self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The central claims rest on standard stochastic optimization assumptions (smoothness, bounded variance) and the Polyak-Lojasiewicz condition for the stronger rates; no free parameters or invented entities are introduced.

axioms (2)
  • domain assumption The objective satisfies the Polyak-Lojasiewicz condition for the PL-rate results.
    Invoked explicitly for the best-iterate and last-iterate guarantees under PL.
  • domain assumption Horizon-free learning-rate schedules are admissible.
    Required for the anytime O(T^{-1/3}) claim.

pith-pipeline@v0.9.0 · 5834 in / 1296 out tokens · 53918 ms · 2026-05-18T15:50:16.967609+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 11 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. When and Why SignSGD Outperforms SGD: A Theoretical Study Based on $\ell_1$-norm Lower Bounds

    cs.LG 2026-05 unverdicted novelty 8.0

    SignSGD provably beats SGD by a factor of d under sparse noise via matched ℓ1-norm upper and lower bounds, with an equivalent result for Muon on matrices, and this predicts faster GPT-2 pretraining.

  2. Symmetry-Compatible Principle for Optimizer Design: Embeddings, LM Heads, SwiGLU MLPs, and MoE Routers

    math.OC 2026-05 conditional novelty 7.0

    Proposes equivariant optimizers matched to the symmetry groups of embeddings, SwiGLU projections and MoE routers, with experiments showing consistent gains over AdamW on language model pre-training.

  3. Gradient Clipping Beyond Vector Norms: A Spectral Approach for Matrix-Valued Parameters

    cs.LG 2026-05 unverdicted novelty 7.0

    Spectral clipping of leading singular values in gradient matrices stabilizes SGD for non-convex problems with heavy-tailed noise and achieves the optimal convergence rate O(K^{(2-2α)/(3α-2)}).

  4. Muon with Nesterov Momentum: Heavy-Tailed Noise and (Randomized) Inexact Polar Decomposition

    math.OC 2026-05 unverdicted novelty 7.0

    Muon with Nesterov momentum and inexact polar decomposition achieves optimal convergence rates of O(ε^(-(3α-2)/(α-1))) under heavy-tailed noise for ε-stationary points in non-convex settings.

  5. Dimension-Free Saddle-Point Escape in Muon

    cs.LG 2026-05 unverdicted novelty 6.0

    Muon achieves dimension-free saddle-point escape through non-linear spectral shaping, resolvent calculus, and structural incoherence, yielding an algebraically dimension-free escape bound.

  6. Muon Does Not Converge on Convex Lipschitz Functions

    cs.LG 2026-05 unverdicted novelty 6.0

    Muon does not converge on convex Lipschitz functions regardless of learning rate, while error feedback restores theoretical convergence but degrades performance on CIFAR-10 and nanoGPT tasks.

  7. MuonEq: Balancing Before Orthogonalization with Lightweight Equilibration

    cs.LG 2026-03 unverdicted novelty 6.0

    MuonEq introduces pre-orthogonalization equilibration schemes that improve Muon optimizer performance during large language model pretraining.

  8. Anytime Training with Schedule-Free Spectral Optimization

    cs.LG 2026-05 unverdicted novelty 5.0

    SF-NorMuon is a new schedule-free spectral optimizer that closes the gap with tuned AdamW on 125M-772M parameter models across 1-8x Chinchilla horizons while providing stationarity guarantees.

  9. MiMuon: Mixed Muon Optimizer with Improved Generalization for Large Models

    cs.LG 2026-05 unverdicted novelty 5.0

    MiMuon is a hybrid optimizer that achieves a generalization error bound of O(1/N) independent of the small singular-value gap that limits the original Muon bound, while retaining the same O(1/T^{1/4}) convergence rate.

  10. Communication-Efficient Gluon in Federated Learning

    cs.LG 2026-04 unverdicted novelty 5.0

    Compressed Gluon variants using unbiased/contraction compressors and SARAH-style variance reduction achieve convergence guarantees and lower communication costs in federated learning under layer-wise smoothness.

  11. HTMuon: Improving Muon via Heavy-Tailed Spectral Correction

    cs.LG 2026-03 unverdicted novelty 5.0

    HTMuon modifies Muon to produce heavier-tailed updates and weight spectra via HT-SR theory, yielding up to 0.98 lower perplexity on LLaMA pretraining and serving as a plug-in for other Muon variants.

Reference graph

Works this paper leans on

56 extracted references · 56 canonical work pages · cited by 11 Pith papers · 11 internal anchors

  1. [1]

    write newline

    " write newline "" before.all 'output.state := FUNCTION n.dashify 't := "" t empty not t #1 #1 substring "-" = t #1 #2 substring "--" = not "--" * t #2 global.max substring 't := t #1 #1 substring "-" = "-" * t #2 global.max substring 't := while if t #1 #1 substring * t #2 global.max substring 't := if while FUNCTION format.date year duplicate empty "emp...

  2. [2]

    @esa (Ref

    \@ifxundefined[1] #1\@undefined \@firstoftwo \@secondoftwo \@ifnum[1] #1 \@firstoftwo \@secondoftwo \@ifx[1] #1 \@firstoftwo \@secondoftwo [2] @ #1 \@temptokena #2 #1 @ \@temptokena \@ifclassloaded agu2001 natbib The agu2001 class already includes natbib coding, so you should not add it explicitly Type <Return> for now, but then later remove the command n...

  3. [3]

    \@lbibitem[] @bibitem@first@sw\@secondoftwo \@lbibitem[#1]#2 \@extra@b@citeb \@ifundefined br@#2\@extra@b@citeb \@namedef br@#2 \@nameuse br@#2\@extra@b@citeb \@ifundefined b@#2\@extra@b@citeb @num @parse #2 @tmp #1 NAT@b@open@#2 NAT@b@shut@#2 \@ifnum @merge>\@ne @bibitem@first@sw \@firstoftwo \@ifundefined NAT@b*@#2 \@firstoftwo @num @NAT@ctr \@secondoft...

  4. [4]

    @open @close @open @close and [1] URL: #1 \@ifundefined chapter * \@mkboth \@ifxundefined @sectionbib * \@mkboth * \@mkboth\@gobbletwo \@ifclassloaded amsart * \@ifclassloaded amsbook * \@ifxundefined @heading @heading NAT@ctr thebibliography [1] @ \@biblabel @NAT@ctr \@bibsetup #1 @NAT@ctr @ @openbib .11em \@plus.33em \@minus.07em 4000 4000 `\.\@m @bibit...

  5. [5]

    Asgo: Adaptive structured gradient optimization.arXiv preprint arXiv:2503.20762,

    Kang An, Yuxing Liu, Rui Pan, Shiqian Ma, Donald Goldfarb, and Tong Zhang. Asgo: Adaptive structured gradient optimization. ArXiv, abs/2503.20762, 2025. URL https://api.semanticscholar.org/CorpusID:277321722

  6. [6]

    Lower bounds for non-convex stochastic optimization

    Yossi Arjevani, Yair Carmon, John C Duchi, Dylan J Foster, Nathan Srebro, and Blake Woodworth. Lower bounds for non-convex stochastic optimization. Mathematical Programming, 199 0 (1): 0 165--214, 2023

  7. [7]

    Old Optimizer, New Norm: An Anthology

    Jeremy Bernstein and Laker Newhouse. Old optimizer, new norm: An anthology. arXiv preprint arXiv:2409.20325, 2024

  8. [8]

    Muon optimizes under spectral norm constraints.arXiv preprint arXiv:2506.15054,

    Lizhang Chen, Jonathan Li, and Qiang Liu. Muon optimizes under spectral norm constraints. arXiv preprint arXiv:2506.15054, 2025

  9. [9]

    Xiangning Chen, Chen Liang, Da Huang, Esteban Real, Kaiyuan Wang, Yao Liu, Hieu Pham, Xuanyi Dong, Thang Luong, Cho-Jui Hsieh, Yifeng Lu, and Quoc V. Le. Symbolic discovery of optimization algorithms. ArXiv, abs/2302.06675, 2023. URL https://api.semanticscholar.org/CorpusID:256846990

  10. [10]

    On the Convergence of A Class of Adam-Type Algorithms for Non-Convex Optimization

    Xiangyi Chen, Sijia Liu, Ruoyu Sun, and Mingyi Hong. On the convergence of a class of adam-type algorithms for non-convex optimization. ArXiv, abs/1808.02941, 2018. URL https://api.semanticscholar.org/CorpusID:51952942

  11. [11]

    Momentum-based variance reduction in non-convex sgd

    Ashok Cutkosky and Francesco Orabona. Momentum-based variance reduction in non-convex sgd. ArXiv, abs/1905.10018, 2019. URL https://api.semanticscholar.org/CorpusID:165163984

  12. [12]

    Incorporating nesterov momentum into adam

    Timothy Dozat. Incorporating nesterov momentum into adam. 2016

  13. [13]

    arXiv preprint arXiv:2502.04664 , year=

    Chen Fan, Mark Schmidt, and Christos Thrampoulidis. Implicit bias of spectral descent and muon on multiclass separable data. arXiv preprint arXiv:2502.04664, 2025

  14. [14]

    Spider: Near-optimal non-convex optimization via stochastic path-integrated differential estimator

    Cong Fang, Chris Junchi Li, Zhouchen Lin, and Tong Zhang. Spider: Near-optimal non-convex optimization via stochastic path-integrated differential estimator. Advances in neural information processing systems, 31, 2018

  15. [15]

    Stochastic first- and zeroth-order methods for nonconvex stochastic programming.SIAM Journal on Optimization, 23(4):2341–2368, 2013a

    Zhishuai Guo, Yi Xu, Wotao Yin, Rong Jin, and Tianbao Yang. A novel convergence analysis for algorithms of the adam family. ArXiv, abs/2112.03459, 2021. URL https://api.semanticscholar.org/CorpusID:244920672

  16. [16]

    Shampoo: Preconditioned stochastic tensor optimization

    Vineet Gupta, Tomer Koren, and Yoram Singer. Shampoo: Preconditioned stochastic tensor optimization. In International Conference on Machine Learning, 2018. URL https://api.semanticscholar.org/CorpusID:3585068

  17. [17]

    Deep residual learning for image recognition

    Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp.\ 770--778, 2016

  18. [18]

    Convergence of adam for non-convex objectives: Relaxed hyperparameters and non-ergodic case

    Meixuan He, Yuqing Liang, Jinlan Liu, and Dongpo Xu. Convergence of adam for non-convex objectives: Relaxed hyperparameters and non-ergodic case. arXiv preprint arXiv:2307.11782, 2023

  19. [19]

    Lecture 6a overview of mini--batch gradient descent

    Geoffrey Hinton, Nitish Srivastava, and Kevin Swersky. Lecture 6a overview of mini--batch gradient descent. Coursera Lecture slides https://class. coursera. org/neuralnets-2012-001/lecture,[Online, 2012

  20. [20]

    Training Compute-Optimal Large Language Models

    Jordan Hoffmann, Sebastian Borgeaud, Arthur Mensch, Elena Buchatskaya, Trevor Cai, Eliza Rutherford, Diego de Las Casas, Lisa Anne Hendricks, Johannes Welbl, Aidan Clark, et al. Training compute-optimal large language models. arXiv preprint arXiv:2203.15556, 2022

  21. [21]

    Super-adam: Faster and universal framework of adaptive gradients

    Feihu Huang, Junyi Li, and Heng Huang. Super-adam: Faster and universal framework of adaptive gradients. ArXiv, abs/2106.08208, 2021. URL https://api.semanticscholar.org/CorpusID:235436027

  22. [22]

    Muon: An optimizer for hidden layers in neural networks, 2024

    Keller Jordan, Yuchen Jin, Vlado Boza, Jiacheng You, Franz Cesista, Laker Newhouse, and Jeremy Bernstein. Muon: An optimizer for hidden layers in neural networks, 2024. URL https://kellerjordan.github.io/posts/muon/

  23. [23]

    Linear convergence of gradient and proximal-gradient methods under the polyak- ojasiewicz condition

    Hamed Karimi, Julie Nutini, and Mark Schmidt. Linear convergence of gradient and proximal-gradient methods under the polyak- ojasiewicz condition. In Joint European conference on machine learning and knowledge discovery in databases, pp.\ 795--811. Springer, 2016

  24. [24]

    Adam: A Method for Stochastic Optimization

    Diederik P. Kingma and Jimmy Ba. Adam: A method for stochastic optimization. CoRR, abs/1412.6980, 2014. URL https://api.semanticscholar.org/CorpusID:6628106

  25. [25]

    ArXiv Preprint: 2505.21799 , Year =

    Tim Tsz-Kit Lau, Qi Long, and Weijie Su. Polargrad: A class of matrix-gradient optimizers from a unifying preconditioning perspective. arXiv preprint arXiv:2505.21799, 2025

  26. [26]

    Convergence of adam under relaxed assumptions

    Haochuan Li, Ali Jadbabaie, and Alexander Rakhlin. Convergence of adam under relaxed assumptions. ArXiv, abs/2304.13972, 2023. URL https://api.semanticscholar.org/CorpusID:258352491

  27. [27]

    A note on the convergence of muon

    Jiaxiang Li and Mingyi Hong. A note on the convergence of muon. 2025. URL https://api.semanticscholar.org/CorpusID:276116929

  28. [28]

    Simple and optimal stochastic gradient methods for nonsmooth nonconvex optimization

    Zhize Li and Jian Li. Simple and optimal stochastic gradient methods for nonsmooth nonconvex optimization. Journal of Machine Learning Research, 23 0 (239): 0 1--61, 2022

  29. [29]

    Cautious optimizers: Improving training with one line of code.arXiv preprint arXiv:2411.16085,

    Kaizhao Liang, Lizhang Chen, Bo Liu, and Qiang Liu. Cautious optimizers: Improving training with one line of code. ArXiv, abs/2411.16085, 2024. URL https://api.semanticscholar.org/CorpusID:274234738

  30. [30]

    Sophia: A scalable stochastic second-order optimizer for language model pre-training, 2024a

    Hong Liu, Zhiyuan Li, David Leo Wright Hall, Percy Liang, and Tengyu Ma. Sophia: A scalable stochastic second-order optimizer for language model pre-training. ArXiv, abs/2305.14342, 2023. URL https://api.semanticscholar.org/CorpusID:258841030

  31. [31]

    Muon is Scalable for LLM Training

    Jingyuan Liu, Jianling Su, Xingcheng Yao, Zhejun Jiang, Guokun Lai, Yulun Du, Yidao Qin, Weixin Xu, Enzhe Lu, Junjie Yan, Yanru Chen, Huabin Zheng, Yibo Liu, Shaowei Liu, Bohong Yin, Weiran He, Han Zhu, Yuzhi Wang, Jianzhou Wang, Meng Dong, Zheng Zhang, Yongsheng Kang, Hao Zhang, Xinran Xu, Yutao Zhang, Yuxin Wu, Xinyu Zhou, and Zhilin Yang. Muon is scala...

  32. [32]

    Cosmos: A hybrid adaptive optimizer for memory-efficient training of llms

    Liming Liu, Zhenghao Xu, Zixuan Zhang, Hao Kang, Zichong Li, Chen Liang, Weizhu Chen, and Tuo Zhao. Cosmos: A hybrid adaptive optimizer for memory-efficient training of llms. arXiv preprint arXiv:2502.17410, 2025 b

  33. [33]

    On the variance of the adaptive learning rate and beyond.arXiv preprint arXiv:1908.03265,

    Liyuan Liu, Haoming Jiang, Pengcheng He, Weizhu Chen, Xiaodong Liu, Jianfeng Gao, and Jiawei Han. On the variance of the adaptive learning rate and beyond. ArXiv, abs/1908.03265, 2019. URL https://api.semanticscholar.org/CorpusID:199528271

  34. [34]

    Decoupled Weight Decay Regularization

    I Loshchilov. Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101, 2017

  35. [35]

    Adaptive Gradient Methods with Dynamic Bound of Learning Rate

    Liangchen Luo, Yuanhao Xiong, Yan Liu, and Xu Sun. Adaptive gradient methods with dynamic bound of learning rate. ArXiv, abs/1902.09843, 2019. URL https://api.semanticscholar.org/CorpusID:67856101

  36. [36]

    A method for solving the convex programming problem with convergence rate o(1/k^2)

    Yurii Nesterov. A method for solving the convex programming problem with convergence rate o(1/k^2) . Proceedings of the USSR Academy of Sciences, 269: 0 543--547, 1983. URL https://api.semanticscholar.org/CorpusID:145918791

  37. [37]

    Training Deep Learning Models with Norm-Constrained LMOs

    Thomas Pethick, Wanyun Xie, Kimon Antonakopoulos, Zhenyu Zhu, Antonio Silveti-Falls, and Volkan Cevher. Training deep learning models with norm-constrained lmos. arXiv preprint arXiv:2502.07529, 2025

  38. [38]

    Exploring the limits of transfer learning with a unified text-to-text transformer

    Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J Liu. Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of machine learning research, 21 0 (140): 0 1--67, 2020

  39. [39]

    On the Convergence of Adam and Beyond

    Sashank J. Reddi, Satyen Kale, and Surinder Kumar. On the convergence of adam and beyond. ArXiv, abs/1904.09237, 2018. URL https://api.semanticscholar.org/CorpusID:3455897

  40. [40]

    Convergence bound and critical batch size of muon optimizer

    Naoki Sato, Hiroki Naganuma, and Hideaki Iiduka. Convergence bound and critical batch size of muon optimizer. 2025. URL https://api.semanticscholar.org/CorpusID:280140878

  41. [41]

    Lionsandmuons: Optimizationviastochasticfrank-wolfe.arXivpreprint arXiv:2506.04192, 2025

    Maria-Eleni Sfyraki and Jun-Kun Wang. Lions and muons: Optimization via stochastic frank-wolfe. arXiv preprint arXiv:2506.04192, 2025

  42. [42]

    arXiv preprint arXiv:2505.02222 , year=

    Ishaan Shah, Anthony M Polloreno, Karl Stratos, Philip Monk, Adarsh Chaluvaraju, Andrew Hojel, Andrew Ma, Anil Thomas, Ashish Tanwer, Darsh J Shah, et al. Practical efficiency of muon for pretraining. arXiv preprint arXiv:2505.02222, 2025

  43. [43]

    On the Convergence Analysis of Muon

    Wei Shen, Ruichuan Huang, Minhui Huang, Cong Shen, and Jiawei Zhang. On the convergence analysis of muon. arXiv preprint arXiv:2505.23737, 2025

  44. [44]

    Adamuon: Adaptive muon optimizer.arXiv preprint arXiv:2507.11005,

    Chongjie Si, Debing Zhang, and Wei Shen. Adamuon: Adaptive muon optimizer. arXiv preprint arXiv:2507.11005, 2025

  45. [45]

    Llama 2: Open Foundation and Fine-Tuned Chat Models

    Hugo Touvron, Louis Martin, Kevin R. Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, Daniel M. Bikel, Lukas Blecher, Cristian Cant \'o n Ferrer, Moya Chen, Guillem Cucurull, David Esiobu, Jude Fernandes, Jeremy Fu, Wenyin Fu, Brian Fuller, Cynthia Gao, Vedanuj Goswami, Naman Goyal, ...

  46. [46]

    Nikhil Vyas, Depen Morwani, Rosie Zhao, Itai Shapira, David Brandfonbrener, Lucas Janson, and Sham M. Kakade. SOAP : Improving and stabilizing shampoo using adam for language modeling. In The Thirteenth International Conference on Learning Representations, 2025. URL https://openreview.net/forum?id=IDxZhXrpNf

  47. [47]

    Closing the gap between the upper bound and the lower bound of adam's iteration complexity

    Bohan Wang, Jingwen Fu, Huishuai Zhang, Nanning Zheng, and Wei Chen. Closing the gap between the upper bound and the lower bound of adam's iteration complexity. ArXiv, abs/2310.17998, 2023. URL https://api.semanticscholar.org/CorpusID:264555523

  48. [48]

    Adagrad stepsizes: Sharp convergence over nonconvex landscapes

    Rachel Ward, Xiaoxia Wu, and Leon Bottou. Adagrad stepsizes: Sharp convergence over nonconvex landscapes. Journal of Machine Learning Research, 21 0 (219): 0 1--30, 2020

  49. [49]

    Adan: Adaptive nesterov momentum algorithm for faster optimizing deep models

    Xingyu Xie, Pan Zhou, Huan Li, Zhouchen Lin, and Shuicheng Yan. Adan: Adaptive nesterov momentum algorithm for faster optimizing deep models. IEEE Transactions on Pattern Analysis and Machine Intelligence, 46 0 (12): 0 9508--9520, 2024. doi:10.1109/TPAMI.2024.3423382

  50. [50]

    Linear convergence of adaptive stochastic gradient descent

    Yuege Xie, Xiaoxia Wu, and Rachel Ward. Linear convergence of adaptive stochastic gradient descent. In International conference on artificial intelligence and statistics, pp.\ 1475--1485. PMLR, 2020 a

  51. [51]

    Adaptive inertia: Disentangling the effects of adaptive learning rate and momentum

    Zeke Xie, Xinrui Wang, Huishuai Zhang, Issei Sato, and Masashi Sugiyama. Adaptive inertia: Disentangling the effects of adaptive learning rate and momentum. In International Conference on Machine Learning, 2020 b . URL https://api.semanticscholar.org/CorpusID:248986834

  52. [52]

    Mars: Unleashing the power of variance reduction for training large models.arXiv preprint arXiv:2411.10438,

    Huizhuo Yuan, Yifeng Liu, Shuang Wu, Xun Zhou, and Quanquan Gu. Mars: Unleashing the power of variance reduction for training large models. ArXiv, abs/2411.10438, 2024. URL https://api.semanticscholar.org/CorpusID:274116548

  53. [53]

    Adagrad meets muon: Adaptive stepsizes for orthogonal updates

    Minxin Zhang, Yuxuan Liu, and Hayden Schaeffer. Adagrad meets muon: Adaptive stepsizes for orthogonal updates. 2025. URL https://api.semanticscholar.org/CorpusID:281091748

  54. [54]

    arXiv preprint arXiv:1808.05671 , year=

    Dongruo Zhou, Yiqi Tang, Ziyan Yang, Yuan Cao, and Quanquan Gu. On the convergence of adaptive gradient methods for nonconvex optimization. ArXiv, abs/1808.05671, 2018. URL https://api.semanticscholar.org/CorpusID:52040763

  55. [55]

    Stochastic nested variance reduction for nonconvex optimization

    Dongruo Zhou, Pan Xu, and Quanquan Gu. Stochastic nested variance reduction for nonconvex optimization. Journal of machine learning research, 21 0 (103): 0 1--63, 2020

  56. [56]

    Adabelief optimizer: Adapting stepsizes by the belief in observed gradients

    Juntang Zhuang, Tommy M. Tang, Yifan Ding, Sekhar Chandra Tatikonda, Nicha C. Dvornek, Xenophon Papademetris, and James S. Duncan. Adabelief optimizer: Adapting stepsizes by the belief in observed gradients. ArXiv, abs/2010.07468, 2020. URL https://api.semanticscholar.org/CorpusID:222377595