On the Convergence of Muon and Beyond

Da Chang; Ganzhao Yuan; Yongxiang Liu

arxiv: 2509.15816 · v5 · submitted 2025-09-19 · 💻 cs.LG

On the Convergence of Muon and Beyond

Da Chang , Yongxiang Liu , Ganzhao Yuan This is my paper

Pith reviewed 2026-05-18 15:50 UTC · model grok-4.3

classification 💻 cs.LG

keywords Muon optimizervariance reductionconvergence analysisstochastic non-convex optimizationmomentum methodsneural network trainingPolyak-Lojasiewicz condition

0 comments

The pith

A variance-reduced version of the Muon optimizer attains the optimal anytime convergence rate of roughly O(T to the minus one third).

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that the two-batch variance-reduced Muon-MVR2 reaches the best possible anytime convergence rate of roughly O(T to the negative one third) when learning rates do not depend on knowing the total number of steps ahead of time. This rate matches the known lower bound for stochastic non-convex problems and improves on prior Muon analyses that only reached a slower O(T to the negative one quarter). A reader would care because Muon is already used successfully in practice for neural network parameters that have matrix structure, so matching the theoretical optimum narrows the gap between observed performance and guarantees. The work further shows strong rates for both variants under the Polyak-Lojasiewicz condition for best-iterate suboptimality and last-iterate objective gap when an extra gradient bound holds.

Core claim

The central claim is that under horizon-free learning-rate schedules, variance reduction enables Muon-MVR2 to attain the optimal anytime convergence rate of tilde O(T to the minus one third) in stochastic non-convex settings, matching the lower bound for this problem class. This is the first rigorous proof of such a rate for Muon variants. Under the Polyak-Lojasiewicz condition, Muon-MVR1 and Muon-MVR2 achieve best-iterate rates of tilde O(T to the minus one quarter) and tilde O(T to the minus one third) for expected square-root suboptimality, and with an additional uniform gradient bound along the iterates they reach last-iterate rates of O(T to the minus one quarter) and O(T to the minus 1

What carries the argument

Horizon-free learning-rate schedules paired with momentum-based variance reduction in the one-batch Muon-MVR1 and two-batch Muon-MVR2. These control stochastic gradient variance to support the improved rate proofs.

If this is right

Muon-MVR2 matches the optimal convergence rate for the class of stochastic non-convex problems.
Under the Polyak-Lojasiewicz condition the variants deliver anytime guarantees on square-root suboptimality and objective gap.
The two-batch design in Muon-MVR2 is what enables the tighter rate compared with the one-batch version.
Experiments on CIFAR-10 classification and C4 language modeling back the practical value of these variants.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Variance reduction may be worth testing on other matrix-aware optimizers that currently lack tight theory.
The horizon-free property suggests these variants can be used in training runs where the length is not fixed in advance.
Checking whether the same rate gains appear in very large models could show if the analysis scales beyond the reported experiments.

Load-bearing premise

The proofs rely on horizon-free learning-rate schedules together with standard assumptions of bounded variance and smoothness; if the target setting violates these, the stated rates may not apply.

What would settle it

An experiment on a smooth non-convex objective with bounded variance where Muon-MVR2 fails to improve beyond O(T to the negative one quarter) under a horizon-free schedule would falsify the central rate claim.

Figures

Figures reproduced from arXiv: 2509.15816 by Da Chang, Ganzhao Yuan, Yongxiang Liu.

**Figure 2.** Figure 2: LLaMA2-130M train and validation curves on C4 Dataset [PITH_FULL_IMAGE:figures/full_fig_p008_2.png] view at source ↗

read the original abstract

The Muon optimizer has demonstrated remarkable empirical success in handling matrix-structured parameters for training neural networks. However, a significant gap remains between its practical performance and theoretical understanding. Existing analyses show that the Muon variants achieve only a suboptimal ergodic convergence rate of $\mathcal{O}(T^{-1/4})$ in stochastic non-convex settings, where $T$ denotes the number of iterations. To study the theoretical limits of Muon, we analyze two momentum-based variance-reduced variants: the one-batch Muon-MVR1 and the two-batch Muon-MVR2. We provide the first rigorous proof that, under \textbf{horizon-free} learning-rate schedules, variance reduction enables Muon-MVR2 to attain the optimal anytime convergence rate $\widetilde{\mathcal{O}}(T^{-1/3})$, matching the lower bound for this problem class. Under the Polyak--\L{}ojasiewicz (PL) condition, we establish anytime guarantees for Muon-MVR1 and Muon-MVR2: they attain best-iterate rates of $\widetilde{\mathcal{O}}(T^{-1/4})$ and $\widetilde{\mathcal{O}}(T^{-1/3})$ for the expected square-root suboptimality, and, given an additional uniform gradient bound along the iterates, achieve last-iterate rates of $\mathcal{O}(T^{-1/4})$ and $\mathcal{O}(T^{-1/3})$ for the objective gap, respectively. Experiments on CIFAR-10 and C4 support the practical effectiveness of the proposed variance-reduced Muon variants. Code is available at \href{https://github.com/MaeChd/MUON-MVR}{Muon-MVR} Codebase.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper gives the first proof of an O(T^{-1/3}) anytime rate for two-batch variance-reduced Muon under horizon-free schedules, but the optimality claim needs checking against standard stochastic query lower bounds.

read the letter

The main thing to know is that this paper proves an O(T^{-1/3}) anytime convergence rate for the two-batch variance-reduced Muon-MVR2 under horizon-free learning rate schedules in stochastic non-convex optimization, and claims this matches the lower bound for the class. It also gives rates under the Polyak-Lojasiewicz condition for both MVR1 and MVR2, including last-iterate results with an extra gradient bound assumption. The paper does well by addressing the theory gap for Muon, which has good practical results but only had suboptimal O(T^{-1/4}) ergodic rates before. Moving to anytime guarantees is practical because training often stops early. The PL extensions are relevant since many loss landscapes have some curvature. The CIFAR-10 and C4 experiments show the variants are competitive, and the open code lets others check and extend the work. The soft spots center on whether the claimed optimality holds up under standard query complexity. The usual lower bound for stochastic non-convex smooth optimization requires Omega(epsilon^{-4}) gradient evaluations to reach an epsilon-stationary point, which means O(T^{-1/4}) is the best possible if T counts iterations and each uses a fixed number of samples. Since Muon-MVR2 takes two batches per iteration, the total sample count is still Theta(T). To get T^{-1/3} and call it optimal, the analysis likely redefines T as total samples or assumes a finite-sum setting where full gradients are available periodically. The abstract does not make this explicit, so the proofs need to clarify the model and show the lower bound they match. The standard assumptions of bounded variance and smoothness are fine, but any extra conditions for the horizon-free schedule could affect real-world use. This paper targets researchers focused on the theoretical foundations of optimizers used in large-scale training, particularly matrix-structured parameters. A reader interested in variance reduction techniques or anytime convergence would get the most from it. The combination of new rates, PL results, and experiments makes it worth a serious referee's time, even if the query model details need verification. I recommend putting it through peer review rather than desk rejecting it.

Referee Report

2 major / 2 minor

Summary. The paper analyzes the Muon optimizer for matrix-structured parameters and introduces two momentum-based variance-reduced variants: one-batch Muon-MVR1 and two-batch Muon-MVR2. It provides the first rigorous proofs that, under horizon-free learning-rate schedules in stochastic non-convex settings, Muon-MVR2 attains an optimal anytime convergence rate of ~O(T^{-1/3}) matching the lower bound for the problem class. Under the PL condition, it establishes anytime best-iterate rates of ~O(T^{-1/4}) and ~O(T^{-1/3}) for expected square-root suboptimality, and last-iterate rates of O(T^{-1/4}) and O(T^{-1/3}) for the objective gap (with an additional uniform gradient bound). Experiments on CIFAR-10 and C4 support practical effectiveness, with code released.

Significance. If the convergence claims hold, this work would meaningfully advance the theoretical understanding of Muon by closing the gap with its empirical success in neural network training. Strengths include the provision of rigorous proofs for horizon-free schedules, reproducible code at the linked GitHub repository, and explicit rates under both general stochastic non-convex and PL settings, which could guide optimizer improvements.

major comments (2)

[Abstract] Abstract and main convergence theorems: the claim that Muon-MVR2 attains the optimal ~O(T^{-1/3}) anytime rate 'matching the lower bound for this problem class.' Standard lower bounds for finding an epsilon-stationary point in online stochastic non-convex optimization require Omega(epsilon^{-4}) stochastic gradient queries, implying no algorithm with constant queries per iteration can exceed O(T^{-1/4}). Muon-MVR2 uses two batches per iteration (total queries Theta(T)), so the setting must be clarified as online stochastic versus finite-sum, along with whether T counts iterations or total samples and which specific lower bound is being matched. This directly underpins the 'first rigorous proof' and optimality assertions.
[Convergence Analysis] Proofs of the main rates (likely in the convergence analysis section): the derivations rely on horizon-free learning-rate schedules together with bounded variance and smoothness. The exact theorem statements should explicitly list these assumptions and confirm they suffice for the claimed rates without additional hidden restrictions on the query model.

minor comments (2)

[Experiments] Figure clarity in the experimental section: ensure error bars or multiple runs are reported for the CIFAR-10 and C4 results to allow direct comparison with baselines.
[Notation and Preliminaries] Notation consistency: verify uniform use of tilde-O notation and definition of 'anytime' convergence across theorems and the abstract.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the careful review and constructive comments. We appreciate the opportunity to clarify the problem setting, query model, and theorem assumptions. We address each major comment below and will incorporate the necessary revisions.

read point-by-point responses

Referee: [Abstract] Abstract and main convergence theorems: the claim that Muon-MVR2 attains the optimal ~O(T^{-1/3}) anytime rate 'matching the lower bound for this problem class.' Standard lower bounds for finding an epsilon-stationary point in online stochastic non-convex optimization require Omega(epsilon^{-4}) stochastic gradient queries, implying no algorithm with constant queries per iteration can exceed O(T^{-1/4}). Muon-MVR2 uses two batches per iteration (total queries Theta(T)), so the setting must be clarified as online stochastic versus finite-sum, along with whether T counts iterations or total samples and which specific lower bound is being matched. This directly underpins the 'first rigorous proof' and optimality assertions.

Authors: We thank the referee for highlighting this important clarification. Our analysis is conducted in the standard stochastic non-convex optimization setting with access to an unbiased stochastic gradient oracle satisfying bounded variance. Here T denotes the number of iterations. Muon-MVR2 performs two independent stochastic gradient evaluations per iteration, yielding a total of Θ(T) queries. The claimed anytime rate of ~O(T^{-1/3}) is expressed in terms of iterations and corresponds to an overall sample complexity of ~O(N^{-1/3}) (N total queries). This matches the Ω(ε^{-3}) lower bound for variance-reduced methods in stochastic non-convex optimization (as achieved by SPIDER-type algorithms), which is strictly better than the Ω(ε^{-4}) bound that applies to non-variance-reduced methods such as plain SGD. We will revise the abstract and introduction to explicitly state the stochastic oracle model, confirm that T counts iterations (with total samples Θ(T)), and identify the matched lower bound as the one for variance-reduced stochastic methods. These changes strengthen the presentation without altering the proofs or rates. revision: yes
Referee: [Convergence Analysis] Proofs of the main rates (likely in the convergence analysis section): the derivations rely on horizon-free learning-rate schedules together with bounded variance and smoothness. The exact theorem statements should explicitly list these assumptions and confirm they suffice for the claimed rates without additional hidden restrictions on the query model.

Authors: We agree that explicit enumeration of assumptions will improve clarity. In the revised manuscript we will update the statements of the main theorems to list all assumptions in full: (i) L-smoothness of the objective function, (ii) bounded variance of the stochastic gradients, (iii) the specific horizon-free learning-rate schedule (of the form η_t ∝ t^{-1/3}), and (iv) the query model in which each iteration of Muon-MVR2 makes exactly two independent calls to the stochastic gradient oracle. These assumptions are standard, sufficient for the derived rates, and are the only ones used in the proofs; no additional or hidden restrictions on the query model are present. We will also add a short remark immediately after the theorem statements confirming that the rates hold under precisely these conditions. revision: yes

Circularity Check

0 steps flagged

No circularity: direct convergence proofs under standard assumptions

full rationale

The paper derives convergence rates for Muon-MVR1 and Muon-MVR2 via explicit analysis of variance-reduced momentum updates under horizon-free schedules, bounded variance, and smoothness. These steps rely on standard stochastic optimization lemmas rather than reducing any claimed rate to a fitted parameter, self-citation chain, or definitional equivalence. The optimality assertion references an external lower bound for the problem class without importing uniqueness theorems or ansatzes from prior author work. The derivation remains self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The central claims rest on standard stochastic optimization assumptions (smoothness, bounded variance) and the Polyak-Lojasiewicz condition for the stronger rates; no free parameters or invented entities are introduced.

axioms (2)

domain assumption The objective satisfies the Polyak-Lojasiewicz condition for the PL-rate results.
Invoked explicitly for the best-iterate and last-iterate guarantees under PL.
domain assumption Horizon-free learning-rate schedules are admissible.
Required for the anytime O(T^{-1/3}) claim.

pith-pipeline@v0.9.0 · 5834 in / 1296 out tokens · 53918 ms · 2026-05-18T15:50:16.967609+00:00 · methodology

discussion (0)

Forward citations

Cited by 11 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

When and Why SignSGD Outperforms SGD: A Theoretical Study Based on $\ell_1$-norm Lower Bounds
cs.LG 2026-05 unverdicted novelty 8.0

SignSGD provably beats SGD by a factor of d under sparse noise via matched ℓ1-norm upper and lower bounds, with an equivalent result for Muon on matrices, and this predicts faster GPT-2 pretraining.
Symmetry-Compatible Principle for Optimizer Design: Embeddings, LM Heads, SwiGLU MLPs, and MoE Routers
math.OC 2026-05 conditional novelty 7.0

Proposes equivariant optimizers matched to the symmetry groups of embeddings, SwiGLU projections and MoE routers, with experiments showing consistent gains over AdamW on language model pre-training.
Gradient Clipping Beyond Vector Norms: A Spectral Approach for Matrix-Valued Parameters
cs.LG 2026-05 unverdicted novelty 7.0

Spectral clipping of leading singular values in gradient matrices stabilizes SGD for non-convex problems with heavy-tailed noise and achieves the optimal convergence rate O(K^{(2-2α)/(3α-2)}).
Muon with Nesterov Momentum: Heavy-Tailed Noise and (Randomized) Inexact Polar Decomposition
math.OC 2026-05 unverdicted novelty 7.0

Muon with Nesterov momentum and inexact polar decomposition achieves optimal convergence rates of O(ε^(-(3α-2)/(α-1))) under heavy-tailed noise for ε-stationary points in non-convex settings.
Dimension-Free Saddle-Point Escape in Muon
cs.LG 2026-05 unverdicted novelty 6.0

Muon achieves dimension-free saddle-point escape through non-linear spectral shaping, resolvent calculus, and structural incoherence, yielding an algebraically dimension-free escape bound.
Muon Does Not Converge on Convex Lipschitz Functions
cs.LG 2026-05 unverdicted novelty 6.0

Muon does not converge on convex Lipschitz functions regardless of learning rate, while error feedback restores theoretical convergence but degrades performance on CIFAR-10 and nanoGPT tasks.
MuonEq: Balancing Before Orthogonalization with Lightweight Equilibration
cs.LG 2026-03 unverdicted novelty 6.0

MuonEq introduces pre-orthogonalization equilibration schemes that improve Muon optimizer performance during large language model pretraining.
Anytime Training with Schedule-Free Spectral Optimization
cs.LG 2026-05 unverdicted novelty 5.0

SF-NorMuon is a new schedule-free spectral optimizer that closes the gap with tuned AdamW on 125M-772M parameter models across 1-8x Chinchilla horizons while providing stationarity guarantees.
MiMuon: Mixed Muon Optimizer with Improved Generalization for Large Models
cs.LG 2026-05 unverdicted novelty 5.0

MiMuon is a hybrid optimizer that achieves a generalization error bound of O(1/N) independent of the small singular-value gap that limits the original Muon bound, while retaining the same O(1/T^{1/4}) convergence rate.
Communication-Efficient Gluon in Federated Learning
cs.LG 2026-04 unverdicted novelty 5.0

Compressed Gluon variants using unbiased/contraction compressors and SARAH-style variance reduction achieve convergence guarantees and lower communication costs in federated learning under layer-wise smoothness.
HTMuon: Improving Muon via Heavy-Tailed Spectral Correction
cs.LG 2026-03 unverdicted novelty 5.0

HTMuon modifies Muon to produce heavier-tailed updates and weight spectra via HT-SR theory, yielding up to 0.98 lower perplexity on LLaMA pretraining and serving as a plug-in for other Muon variants.

Reference graph

Works this paper leans on

56 extracted references · 56 canonical work pages · cited by 11 Pith papers · 11 internal anchors

[1]

write newline

" write newline "" before.all 'output.state := FUNCTION n.dashify 't := "" t empty not t #1 #1 substring "-" = t #1 #2 substring "--" = not "--" * t #2 global.max substring 't := t #1 #1 substring "-" = "-" * t #2 global.max substring 't := while if t #1 #1 substring * t #2 global.max substring 't := if while FUNCTION format.date year duplicate empty "emp...

work page
[2]

@esa (Ref

\@ifxundefined[1] #1\@undefined \@firstoftwo \@secondoftwo \@ifnum[1] #1 \@firstoftwo \@secondoftwo \@ifx[1] #1 \@firstoftwo \@secondoftwo [2] @ #1 \@temptokena #2 #1 @ \@temptokena \@ifclassloaded agu2001 natbib The agu2001 class already includes natbib coding, so you should not add it explicitly Type <Return> for now, but then later remove the command n...

work page
[3]

\@lbibitem[] @bibitem@first@sw\@secondoftwo \@lbibitem[#1]#2 \@extra@b@citeb \@ifundefined br@#2\@extra@b@citeb \@namedef br@#2 \@nameuse br@#2\@extra@b@citeb \@ifundefined b@#2\@extra@b@citeb @num @parse #2 @tmp #1 NAT@b@open@#2 NAT@b@shut@#2 \@ifnum @merge>\@ne @bibitem@first@sw \@firstoftwo \@ifundefined NAT@b*@#2 \@firstoftwo @num @NAT@ctr \@secondoft...

work page
[4]

@open @close @open @close and [1] URL: #1 \@ifundefined chapter * \@mkboth \@ifxundefined @sectionbib * \@mkboth * \@mkboth\@gobbletwo \@ifclassloaded amsart * \@ifclassloaded amsbook * \@ifxundefined @heading @heading NAT@ctr thebibliography [1] @ \@biblabel @NAT@ctr \@bibsetup #1 @NAT@ctr @ @openbib .11em \@plus.33em \@minus.07em 4000 4000 `\.\@m @bibit...

work page
[5]

Asgo: Adaptive structured gradient optimization.arXiv preprint arXiv:2503.20762,

Kang An, Yuxing Liu, Rui Pan, Shiqian Ma, Donald Goldfarb, and Tong Zhang. Asgo: Adaptive structured gradient optimization. ArXiv, abs/2503.20762, 2025. URL https://api.semanticscholar.org/CorpusID:277321722

work page arXiv 2025
[6]

Lower bounds for non-convex stochastic optimization

Yossi Arjevani, Yair Carmon, John C Duchi, Dylan J Foster, Nathan Srebro, and Blake Woodworth. Lower bounds for non-convex stochastic optimization. Mathematical Programming, 199 0 (1): 0 165--214, 2023

work page 2023
[7]

Old Optimizer, New Norm: An Anthology

Jeremy Bernstein and Laker Newhouse. Old optimizer, new norm: An anthology. arXiv preprint arXiv:2409.20325, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[8]

Muon optimizes under spectral norm constraints.arXiv preprint arXiv:2506.15054,

Lizhang Chen, Jonathan Li, and Qiang Liu. Muon optimizes under spectral norm constraints. arXiv preprint arXiv:2506.15054, 2025

work page arXiv 2025
[9]

Xiangning Chen, Chen Liang, Da Huang, Esteban Real, Kaiyuan Wang, Yao Liu, Hieu Pham, Xuanyi Dong, Thang Luong, Cho-Jui Hsieh, Yifeng Lu, and Quoc V. Le. Symbolic discovery of optimization algorithms. ArXiv, abs/2302.06675, 2023. URL https://api.semanticscholar.org/CorpusID:256846990

work page arXiv 2023
[10]

On the Convergence of A Class of Adam-Type Algorithms for Non-Convex Optimization

Xiangyi Chen, Sijia Liu, Ruoyu Sun, and Mingyi Hong. On the convergence of a class of adam-type algorithms for non-convex optimization. ArXiv, abs/1808.02941, 2018. URL https://api.semanticscholar.org/CorpusID:51952942

work page internal anchor Pith review Pith/arXiv arXiv 2018
[11]

Momentum-based variance reduction in non-convex sgd

Ashok Cutkosky and Francesco Orabona. Momentum-based variance reduction in non-convex sgd. ArXiv, abs/1905.10018, 2019. URL https://api.semanticscholar.org/CorpusID:165163984

work page arXiv 1905
[12]

Incorporating nesterov momentum into adam

Timothy Dozat. Incorporating nesterov momentum into adam. 2016

work page 2016
[13]

arXiv preprint arXiv:2502.04664 , year=

Chen Fan, Mark Schmidt, and Christos Thrampoulidis. Implicit bias of spectral descent and muon on multiclass separable data. arXiv preprint arXiv:2502.04664, 2025

work page arXiv 2025
[14]

Spider: Near-optimal non-convex optimization via stochastic path-integrated differential estimator

Cong Fang, Chris Junchi Li, Zhouchen Lin, and Tong Zhang. Spider: Near-optimal non-convex optimization via stochastic path-integrated differential estimator. Advances in neural information processing systems, 31, 2018

work page 2018
[15]

Stochastic first- and zeroth-order methods for nonconvex stochastic programming.SIAM Journal on Optimization, 23(4):2341–2368, 2013a

Zhishuai Guo, Yi Xu, Wotao Yin, Rong Jin, and Tianbao Yang. A novel convergence analysis for algorithms of the adam family. ArXiv, abs/2112.03459, 2021. URL https://api.semanticscholar.org/CorpusID:244920672

work page arXiv 2021
[16]

Shampoo: Preconditioned stochastic tensor optimization

Vineet Gupta, Tomer Koren, and Yoram Singer. Shampoo: Preconditioned stochastic tensor optimization. In International Conference on Machine Learning, 2018. URL https://api.semanticscholar.org/CorpusID:3585068

work page 2018
[17]

Deep residual learning for image recognition

Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp.\ 770--778, 2016

work page 2016
[18]

Convergence of adam for non-convex objectives: Relaxed hyperparameters and non-ergodic case

Meixuan He, Yuqing Liang, Jinlan Liu, and Dongpo Xu. Convergence of adam for non-convex objectives: Relaxed hyperparameters and non-ergodic case. arXiv preprint arXiv:2307.11782, 2023

work page arXiv 2023
[19]

Lecture 6a overview of mini--batch gradient descent

Geoffrey Hinton, Nitish Srivastava, and Kevin Swersky. Lecture 6a overview of mini--batch gradient descent. Coursera Lecture slides https://class. coursera. org/neuralnets-2012-001/lecture,[Online, 2012

work page 2012
[20]

Training Compute-Optimal Large Language Models

Jordan Hoffmann, Sebastian Borgeaud, Arthur Mensch, Elena Buchatskaya, Trevor Cai, Eliza Rutherford, Diego de Las Casas, Lisa Anne Hendricks, Johannes Welbl, Aidan Clark, et al. Training compute-optimal large language models. arXiv preprint arXiv:2203.15556, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[21]

Super-adam: Faster and universal framework of adaptive gradients

Feihu Huang, Junyi Li, and Heng Huang. Super-adam: Faster and universal framework of adaptive gradients. ArXiv, abs/2106.08208, 2021. URL https://api.semanticscholar.org/CorpusID:235436027

work page arXiv 2021
[22]

Muon: An optimizer for hidden layers in neural networks, 2024

Keller Jordan, Yuchen Jin, Vlado Boza, Jiacheng You, Franz Cesista, Laker Newhouse, and Jeremy Bernstein. Muon: An optimizer for hidden layers in neural networks, 2024. URL https://kellerjordan.github.io/posts/muon/

work page 2024
[23]

Linear convergence of gradient and proximal-gradient methods under the polyak- ojasiewicz condition

Hamed Karimi, Julie Nutini, and Mark Schmidt. Linear convergence of gradient and proximal-gradient methods under the polyak- ojasiewicz condition. In Joint European conference on machine learning and knowledge discovery in databases, pp.\ 795--811. Springer, 2016

work page 2016
[24]

Adam: A Method for Stochastic Optimization

Diederik P. Kingma and Jimmy Ba. Adam: A method for stochastic optimization. CoRR, abs/1412.6980, 2014. URL https://api.semanticscholar.org/CorpusID:6628106

work page internal anchor Pith review Pith/arXiv arXiv 2014
[25]

ArXiv Preprint: 2505.21799 , Year =

Tim Tsz-Kit Lau, Qi Long, and Weijie Su. Polargrad: A class of matrix-gradient optimizers from a unifying preconditioning perspective. arXiv preprint arXiv:2505.21799, 2025

work page arXiv 2025
[26]

Convergence of adam under relaxed assumptions

Haochuan Li, Ali Jadbabaie, and Alexander Rakhlin. Convergence of adam under relaxed assumptions. ArXiv, abs/2304.13972, 2023. URL https://api.semanticscholar.org/CorpusID:258352491

work page arXiv 2023
[27]

A note on the convergence of muon

Jiaxiang Li and Mingyi Hong. A note on the convergence of muon. 2025. URL https://api.semanticscholar.org/CorpusID:276116929

work page 2025
[28]

Simple and optimal stochastic gradient methods for nonsmooth nonconvex optimization

Zhize Li and Jian Li. Simple and optimal stochastic gradient methods for nonsmooth nonconvex optimization. Journal of Machine Learning Research, 23 0 (239): 0 1--61, 2022

work page 2022
[29]

Cautious optimizers: Improving training with one line of code.arXiv preprint arXiv:2411.16085,

Kaizhao Liang, Lizhang Chen, Bo Liu, and Qiang Liu. Cautious optimizers: Improving training with one line of code. ArXiv, abs/2411.16085, 2024. URL https://api.semanticscholar.org/CorpusID:274234738

work page arXiv 2024
[30]

Sophia: A scalable stochastic second-order optimizer for language model pre-training, 2024a

Hong Liu, Zhiyuan Li, David Leo Wright Hall, Percy Liang, and Tengyu Ma. Sophia: A scalable stochastic second-order optimizer for language model pre-training. ArXiv, abs/2305.14342, 2023. URL https://api.semanticscholar.org/CorpusID:258841030

work page arXiv 2023
[31]

Muon is Scalable for LLM Training

Jingyuan Liu, Jianling Su, Xingcheng Yao, Zhejun Jiang, Guokun Lai, Yulun Du, Yidao Qin, Weixin Xu, Enzhe Lu, Junjie Yan, Yanru Chen, Huabin Zheng, Yibo Liu, Shaowei Liu, Bohong Yin, Weiran He, Han Zhu, Yuzhi Wang, Jianzhou Wang, Meng Dong, Zheng Zhang, Yongsheng Kang, Hao Zhang, Xinran Xu, Yutao Zhang, Yuxin Wu, Xinyu Zhou, and Zhilin Yang. Muon is scala...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[32]

Cosmos: A hybrid adaptive optimizer for memory-efficient training of llms

Liming Liu, Zhenghao Xu, Zixuan Zhang, Hao Kang, Zichong Li, Chen Liang, Weizhu Chen, and Tuo Zhao. Cosmos: A hybrid adaptive optimizer for memory-efficient training of llms. arXiv preprint arXiv:2502.17410, 2025 b

work page arXiv 2025
[33]

On the variance of the adaptive learning rate and beyond.arXiv preprint arXiv:1908.03265,

Liyuan Liu, Haoming Jiang, Pengcheng He, Weizhu Chen, Xiaodong Liu, Jianfeng Gao, and Jiawei Han. On the variance of the adaptive learning rate and beyond. ArXiv, abs/1908.03265, 2019. URL https://api.semanticscholar.org/CorpusID:199528271

work page arXiv 1908
[34]

Decoupled Weight Decay Regularization

I Loshchilov. Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017
[35]

Adaptive Gradient Methods with Dynamic Bound of Learning Rate

Liangchen Luo, Yuanhao Xiong, Yan Liu, and Xu Sun. Adaptive gradient methods with dynamic bound of learning rate. ArXiv, abs/1902.09843, 2019. URL https://api.semanticscholar.org/CorpusID:67856101

work page internal anchor Pith review Pith/arXiv arXiv 1902
[36]

A method for solving the convex programming problem with convergence rate o(1/k^2)

Yurii Nesterov. A method for solving the convex programming problem with convergence rate o(1/k^2) . Proceedings of the USSR Academy of Sciences, 269: 0 543--547, 1983. URL https://api.semanticscholar.org/CorpusID:145918791

work page 1983
[37]

Training Deep Learning Models with Norm-Constrained LMOs

Thomas Pethick, Wanyun Xie, Kimon Antonakopoulos, Zhenyu Zhu, Antonio Silveti-Falls, and Volkan Cevher. Training deep learning models with norm-constrained lmos. arXiv preprint arXiv:2502.07529, 2025

work page internal anchor Pith review arXiv 2025
[38]

Exploring the limits of transfer learning with a unified text-to-text transformer

Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J Liu. Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of machine learning research, 21 0 (140): 0 1--67, 2020

work page 2020
[39]

On the Convergence of Adam and Beyond

Sashank J. Reddi, Satyen Kale, and Surinder Kumar. On the convergence of adam and beyond. ArXiv, abs/1904.09237, 2018. URL https://api.semanticscholar.org/CorpusID:3455897

work page internal anchor Pith review Pith/arXiv arXiv 1904
[40]

Convergence bound and critical batch size of muon optimizer

Naoki Sato, Hiroki Naganuma, and Hideaki Iiduka. Convergence bound and critical batch size of muon optimizer. 2025. URL https://api.semanticscholar.org/CorpusID:280140878

work page 2025
[41]

Lionsandmuons: Optimizationviastochasticfrank-wolfe.arXivpreprint arXiv:2506.04192, 2025

Maria-Eleni Sfyraki and Jun-Kun Wang. Lions and muons: Optimization via stochastic frank-wolfe. arXiv preprint arXiv:2506.04192, 2025

work page arXiv 2025
[42]

arXiv preprint arXiv:2505.02222 , year=

Ishaan Shah, Anthony M Polloreno, Karl Stratos, Philip Monk, Adarsh Chaluvaraju, Andrew Hojel, Andrew Ma, Anil Thomas, Ashish Tanwer, Darsh J Shah, et al. Practical efficiency of muon for pretraining. arXiv preprint arXiv:2505.02222, 2025

work page arXiv 2025
[43]

On the Convergence Analysis of Muon

Wei Shen, Ruichuan Huang, Minhui Huang, Cong Shen, and Jiawei Zhang. On the convergence analysis of muon. arXiv preprint arXiv:2505.23737, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[44]

Adamuon: Adaptive muon optimizer.arXiv preprint arXiv:2507.11005,

Chongjie Si, Debing Zhang, and Wei Shen. Adamuon: Adaptive muon optimizer. arXiv preprint arXiv:2507.11005, 2025

work page arXiv 2025
[45]

Llama 2: Open Foundation and Fine-Tuned Chat Models

Hugo Touvron, Louis Martin, Kevin R. Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, Daniel M. Bikel, Lukas Blecher, Cristian Cant \'o n Ferrer, Moya Chen, Guillem Cucurull, David Esiobu, Jude Fernandes, Jeremy Fu, Wenyin Fu, Brian Fuller, Cynthia Gao, Vedanuj Goswami, Naman Goyal, ...

work page internal anchor Pith review Pith/arXiv arXiv 2023
[46]

Nikhil Vyas, Depen Morwani, Rosie Zhao, Itai Shapira, David Brandfonbrener, Lucas Janson, and Sham M. Kakade. SOAP : Improving and stabilizing shampoo using adam for language modeling. In The Thirteenth International Conference on Learning Representations, 2025. URL https://openreview.net/forum?id=IDxZhXrpNf

work page 2025
[47]

Closing the gap between the upper bound and the lower bound of adam's iteration complexity

Bohan Wang, Jingwen Fu, Huishuai Zhang, Nanning Zheng, and Wei Chen. Closing the gap between the upper bound and the lower bound of adam's iteration complexity. ArXiv, abs/2310.17998, 2023. URL https://api.semanticscholar.org/CorpusID:264555523

work page arXiv 2023
[48]

Adagrad stepsizes: Sharp convergence over nonconvex landscapes

Rachel Ward, Xiaoxia Wu, and Leon Bottou. Adagrad stepsizes: Sharp convergence over nonconvex landscapes. Journal of Machine Learning Research, 21 0 (219): 0 1--30, 2020

work page 2020
[49]

Adan: Adaptive nesterov momentum algorithm for faster optimizing deep models

Xingyu Xie, Pan Zhou, Huan Li, Zhouchen Lin, and Shuicheng Yan. Adan: Adaptive nesterov momentum algorithm for faster optimizing deep models. IEEE Transactions on Pattern Analysis and Machine Intelligence, 46 0 (12): 0 9508--9520, 2024. doi:10.1109/TPAMI.2024.3423382

work page doi:10.1109/tpami.2024.3423382 2024
[50]

Linear convergence of adaptive stochastic gradient descent

Yuege Xie, Xiaoxia Wu, and Rachel Ward. Linear convergence of adaptive stochastic gradient descent. In International conference on artificial intelligence and statistics, pp.\ 1475--1485. PMLR, 2020 a

work page 2020
[51]

Adaptive inertia: Disentangling the effects of adaptive learning rate and momentum

Zeke Xie, Xinrui Wang, Huishuai Zhang, Issei Sato, and Masashi Sugiyama. Adaptive inertia: Disentangling the effects of adaptive learning rate and momentum. In International Conference on Machine Learning, 2020 b . URL https://api.semanticscholar.org/CorpusID:248986834

work page 2020
[52]

Mars: Unleashing the power of variance reduction for training large models.arXiv preprint arXiv:2411.10438,

Huizhuo Yuan, Yifeng Liu, Shuang Wu, Xun Zhou, and Quanquan Gu. Mars: Unleashing the power of variance reduction for training large models. ArXiv, abs/2411.10438, 2024. URL https://api.semanticscholar.org/CorpusID:274116548

work page arXiv 2024
[53]

Adagrad meets muon: Adaptive stepsizes for orthogonal updates

Minxin Zhang, Yuxuan Liu, and Hayden Schaeffer. Adagrad meets muon: Adaptive stepsizes for orthogonal updates. 2025. URL https://api.semanticscholar.org/CorpusID:281091748

work page 2025
[54]

arXiv preprint arXiv:1808.05671 , year=

Dongruo Zhou, Yiqi Tang, Ziyan Yang, Yuan Cao, and Quanquan Gu. On the convergence of adaptive gradient methods for nonconvex optimization. ArXiv, abs/1808.05671, 2018. URL https://api.semanticscholar.org/CorpusID:52040763

work page arXiv 2018
[55]

Stochastic nested variance reduction for nonconvex optimization

Dongruo Zhou, Pan Xu, and Quanquan Gu. Stochastic nested variance reduction for nonconvex optimization. Journal of machine learning research, 21 0 (103): 0 1--63, 2020

work page 2020
[56]

Adabelief optimizer: Adapting stepsizes by the belief in observed gradients

Juntang Zhuang, Tommy M. Tang, Yifan Ding, Sekhar Chandra Tatikonda, Nicha C. Dvornek, Xenophon Papademetris, and James S. Duncan. Adabelief optimizer: Adapting stepsizes by the belief in observed gradients. ArXiv, abs/2010.07468, 2020. URL https://api.semanticscholar.org/CorpusID:222377595

work page arXiv 2010

[1] [1]

write newline

" write newline "" before.all 'output.state := FUNCTION n.dashify 't := "" t empty not t #1 #1 substring "-" = t #1 #2 substring "--" = not "--" * t #2 global.max substring 't := t #1 #1 substring "-" = "-" * t #2 global.max substring 't := while if t #1 #1 substring * t #2 global.max substring 't := if while FUNCTION format.date year duplicate empty "emp...

work page

[2] [2]

@esa (Ref

\@ifxundefined[1] #1\@undefined \@firstoftwo \@secondoftwo \@ifnum[1] #1 \@firstoftwo \@secondoftwo \@ifx[1] #1 \@firstoftwo \@secondoftwo [2] @ #1 \@temptokena #2 #1 @ \@temptokena \@ifclassloaded agu2001 natbib The agu2001 class already includes natbib coding, so you should not add it explicitly Type <Return> for now, but then later remove the command n...

work page

[3] [3]

\@lbibitem[] @bibitem@first@sw\@secondoftwo \@lbibitem[#1]#2 \@extra@b@citeb \@ifundefined br@#2\@extra@b@citeb \@namedef br@#2 \@nameuse br@#2\@extra@b@citeb \@ifundefined b@#2\@extra@b@citeb @num @parse #2 @tmp #1 NAT@b@open@#2 NAT@b@shut@#2 \@ifnum @merge>\@ne @bibitem@first@sw \@firstoftwo \@ifundefined NAT@b*@#2 \@firstoftwo @num @NAT@ctr \@secondoft...

work page

[4] [4]

@open @close @open @close and [1] URL: #1 \@ifundefined chapter * \@mkboth \@ifxundefined @sectionbib * \@mkboth * \@mkboth\@gobbletwo \@ifclassloaded amsart * \@ifclassloaded amsbook * \@ifxundefined @heading @heading NAT@ctr thebibliography [1] @ \@biblabel @NAT@ctr \@bibsetup #1 @NAT@ctr @ @openbib .11em \@plus.33em \@minus.07em 4000 4000 `\.\@m @bibit...

work page

[5] [5]

Asgo: Adaptive structured gradient optimization.arXiv preprint arXiv:2503.20762,

Kang An, Yuxing Liu, Rui Pan, Shiqian Ma, Donald Goldfarb, and Tong Zhang. Asgo: Adaptive structured gradient optimization. ArXiv, abs/2503.20762, 2025. URL https://api.semanticscholar.org/CorpusID:277321722

work page arXiv 2025

[6] [6]

Lower bounds for non-convex stochastic optimization

Yossi Arjevani, Yair Carmon, John C Duchi, Dylan J Foster, Nathan Srebro, and Blake Woodworth. Lower bounds for non-convex stochastic optimization. Mathematical Programming, 199 0 (1): 0 165--214, 2023

work page 2023

[7] [7]

Old Optimizer, New Norm: An Anthology

Jeremy Bernstein and Laker Newhouse. Old optimizer, new norm: An anthology. arXiv preprint arXiv:2409.20325, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[8] [8]

Muon optimizes under spectral norm constraints.arXiv preprint arXiv:2506.15054,

Lizhang Chen, Jonathan Li, and Qiang Liu. Muon optimizes under spectral norm constraints. arXiv preprint arXiv:2506.15054, 2025

work page arXiv 2025

[9] [9]

Xiangning Chen, Chen Liang, Da Huang, Esteban Real, Kaiyuan Wang, Yao Liu, Hieu Pham, Xuanyi Dong, Thang Luong, Cho-Jui Hsieh, Yifeng Lu, and Quoc V. Le. Symbolic discovery of optimization algorithms. ArXiv, abs/2302.06675, 2023. URL https://api.semanticscholar.org/CorpusID:256846990

work page arXiv 2023

[10] [10]

On the Convergence of A Class of Adam-Type Algorithms for Non-Convex Optimization

Xiangyi Chen, Sijia Liu, Ruoyu Sun, and Mingyi Hong. On the convergence of a class of adam-type algorithms for non-convex optimization. ArXiv, abs/1808.02941, 2018. URL https://api.semanticscholar.org/CorpusID:51952942

work page internal anchor Pith review Pith/arXiv arXiv 2018

[11] [11]

Momentum-based variance reduction in non-convex sgd

Ashok Cutkosky and Francesco Orabona. Momentum-based variance reduction in non-convex sgd. ArXiv, abs/1905.10018, 2019. URL https://api.semanticscholar.org/CorpusID:165163984

work page arXiv 1905

[12] [12]

Incorporating nesterov momentum into adam

Timothy Dozat. Incorporating nesterov momentum into adam. 2016

work page 2016

[13] [13]

arXiv preprint arXiv:2502.04664 , year=

Chen Fan, Mark Schmidt, and Christos Thrampoulidis. Implicit bias of spectral descent and muon on multiclass separable data. arXiv preprint arXiv:2502.04664, 2025

work page arXiv 2025

[14] [14]

Spider: Near-optimal non-convex optimization via stochastic path-integrated differential estimator

Cong Fang, Chris Junchi Li, Zhouchen Lin, and Tong Zhang. Spider: Near-optimal non-convex optimization via stochastic path-integrated differential estimator. Advances in neural information processing systems, 31, 2018

work page 2018

[15] [15]

Stochastic first- and zeroth-order methods for nonconvex stochastic programming.SIAM Journal on Optimization, 23(4):2341–2368, 2013a

Zhishuai Guo, Yi Xu, Wotao Yin, Rong Jin, and Tianbao Yang. A novel convergence analysis for algorithms of the adam family. ArXiv, abs/2112.03459, 2021. URL https://api.semanticscholar.org/CorpusID:244920672

work page arXiv 2021

[16] [16]

Shampoo: Preconditioned stochastic tensor optimization

Vineet Gupta, Tomer Koren, and Yoram Singer. Shampoo: Preconditioned stochastic tensor optimization. In International Conference on Machine Learning, 2018. URL https://api.semanticscholar.org/CorpusID:3585068

work page 2018

[17] [17]

Deep residual learning for image recognition

Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp.\ 770--778, 2016

work page 2016

[18] [18]

Convergence of adam for non-convex objectives: Relaxed hyperparameters and non-ergodic case

Meixuan He, Yuqing Liang, Jinlan Liu, and Dongpo Xu. Convergence of adam for non-convex objectives: Relaxed hyperparameters and non-ergodic case. arXiv preprint arXiv:2307.11782, 2023

work page arXiv 2023

[19] [19]

Lecture 6a overview of mini--batch gradient descent

Geoffrey Hinton, Nitish Srivastava, and Kevin Swersky. Lecture 6a overview of mini--batch gradient descent. Coursera Lecture slides https://class. coursera. org/neuralnets-2012-001/lecture,[Online, 2012

work page 2012

[20] [20]

Training Compute-Optimal Large Language Models

Jordan Hoffmann, Sebastian Borgeaud, Arthur Mensch, Elena Buchatskaya, Trevor Cai, Eliza Rutherford, Diego de Las Casas, Lisa Anne Hendricks, Johannes Welbl, Aidan Clark, et al. Training compute-optimal large language models. arXiv preprint arXiv:2203.15556, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022

[21] [21]

Super-adam: Faster and universal framework of adaptive gradients

Feihu Huang, Junyi Li, and Heng Huang. Super-adam: Faster and universal framework of adaptive gradients. ArXiv, abs/2106.08208, 2021. URL https://api.semanticscholar.org/CorpusID:235436027

work page arXiv 2021

[22] [22]

Muon: An optimizer for hidden layers in neural networks, 2024

Keller Jordan, Yuchen Jin, Vlado Boza, Jiacheng You, Franz Cesista, Laker Newhouse, and Jeremy Bernstein. Muon: An optimizer for hidden layers in neural networks, 2024. URL https://kellerjordan.github.io/posts/muon/

work page 2024

[23] [23]

Linear convergence of gradient and proximal-gradient methods under the polyak- ojasiewicz condition

Hamed Karimi, Julie Nutini, and Mark Schmidt. Linear convergence of gradient and proximal-gradient methods under the polyak- ojasiewicz condition. In Joint European conference on machine learning and knowledge discovery in databases, pp.\ 795--811. Springer, 2016

work page 2016

[24] [24]

Adam: A Method for Stochastic Optimization

Diederik P. Kingma and Jimmy Ba. Adam: A method for stochastic optimization. CoRR, abs/1412.6980, 2014. URL https://api.semanticscholar.org/CorpusID:6628106

work page internal anchor Pith review Pith/arXiv arXiv 2014

[25] [25]

ArXiv Preprint: 2505.21799 , Year =

Tim Tsz-Kit Lau, Qi Long, and Weijie Su. Polargrad: A class of matrix-gradient optimizers from a unifying preconditioning perspective. arXiv preprint arXiv:2505.21799, 2025

work page arXiv 2025

[26] [26]

Convergence of adam under relaxed assumptions

Haochuan Li, Ali Jadbabaie, and Alexander Rakhlin. Convergence of adam under relaxed assumptions. ArXiv, abs/2304.13972, 2023. URL https://api.semanticscholar.org/CorpusID:258352491

work page arXiv 2023

[27] [27]

A note on the convergence of muon

Jiaxiang Li and Mingyi Hong. A note on the convergence of muon. 2025. URL https://api.semanticscholar.org/CorpusID:276116929

work page 2025

[28] [28]

Simple and optimal stochastic gradient methods for nonsmooth nonconvex optimization

Zhize Li and Jian Li. Simple and optimal stochastic gradient methods for nonsmooth nonconvex optimization. Journal of Machine Learning Research, 23 0 (239): 0 1--61, 2022

work page 2022

[29] [29]

Cautious optimizers: Improving training with one line of code.arXiv preprint arXiv:2411.16085,

Kaizhao Liang, Lizhang Chen, Bo Liu, and Qiang Liu. Cautious optimizers: Improving training with one line of code. ArXiv, abs/2411.16085, 2024. URL https://api.semanticscholar.org/CorpusID:274234738

work page arXiv 2024

[30] [30]

Sophia: A scalable stochastic second-order optimizer for language model pre-training, 2024a

Hong Liu, Zhiyuan Li, David Leo Wright Hall, Percy Liang, and Tengyu Ma. Sophia: A scalable stochastic second-order optimizer for language model pre-training. ArXiv, abs/2305.14342, 2023. URL https://api.semanticscholar.org/CorpusID:258841030

work page arXiv 2023

[31] [31]

Muon is Scalable for LLM Training

Jingyuan Liu, Jianling Su, Xingcheng Yao, Zhejun Jiang, Guokun Lai, Yulun Du, Yidao Qin, Weixin Xu, Enzhe Lu, Junjie Yan, Yanru Chen, Huabin Zheng, Yibo Liu, Shaowei Liu, Bohong Yin, Weiran He, Han Zhu, Yuzhi Wang, Jianzhou Wang, Meng Dong, Zheng Zhang, Yongsheng Kang, Hao Zhang, Xinran Xu, Yutao Zhang, Yuxin Wu, Xinyu Zhou, and Zhilin Yang. Muon is scala...

work page internal anchor Pith review Pith/arXiv arXiv 2025

[32] [32]

Cosmos: A hybrid adaptive optimizer for memory-efficient training of llms

Liming Liu, Zhenghao Xu, Zixuan Zhang, Hao Kang, Zichong Li, Chen Liang, Weizhu Chen, and Tuo Zhao. Cosmos: A hybrid adaptive optimizer for memory-efficient training of llms. arXiv preprint arXiv:2502.17410, 2025 b

work page arXiv 2025

[33] [33]

On the variance of the adaptive learning rate and beyond.arXiv preprint arXiv:1908.03265,

Liyuan Liu, Haoming Jiang, Pengcheng He, Weizhu Chen, Xiaodong Liu, Jianfeng Gao, and Jiawei Han. On the variance of the adaptive learning rate and beyond. ArXiv, abs/1908.03265, 2019. URL https://api.semanticscholar.org/CorpusID:199528271

work page arXiv 1908

[34] [34]

Decoupled Weight Decay Regularization

I Loshchilov. Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017

[35] [35]

Adaptive Gradient Methods with Dynamic Bound of Learning Rate

Liangchen Luo, Yuanhao Xiong, Yan Liu, and Xu Sun. Adaptive gradient methods with dynamic bound of learning rate. ArXiv, abs/1902.09843, 2019. URL https://api.semanticscholar.org/CorpusID:67856101

work page internal anchor Pith review Pith/arXiv arXiv 1902

[36] [36]

A method for solving the convex programming problem with convergence rate o(1/k^2)

Yurii Nesterov. A method for solving the convex programming problem with convergence rate o(1/k^2) . Proceedings of the USSR Academy of Sciences, 269: 0 543--547, 1983. URL https://api.semanticscholar.org/CorpusID:145918791

work page 1983

[37] [37]

Training Deep Learning Models with Norm-Constrained LMOs

Thomas Pethick, Wanyun Xie, Kimon Antonakopoulos, Zhenyu Zhu, Antonio Silveti-Falls, and Volkan Cevher. Training deep learning models with norm-constrained lmos. arXiv preprint arXiv:2502.07529, 2025

work page internal anchor Pith review arXiv 2025

[38] [38]

Exploring the limits of transfer learning with a unified text-to-text transformer

Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J Liu. Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of machine learning research, 21 0 (140): 0 1--67, 2020

work page 2020

[39] [39]

On the Convergence of Adam and Beyond

Sashank J. Reddi, Satyen Kale, and Surinder Kumar. On the convergence of adam and beyond. ArXiv, abs/1904.09237, 2018. URL https://api.semanticscholar.org/CorpusID:3455897

work page internal anchor Pith review Pith/arXiv arXiv 1904

[40] [40]

Convergence bound and critical batch size of muon optimizer

Naoki Sato, Hiroki Naganuma, and Hideaki Iiduka. Convergence bound and critical batch size of muon optimizer. 2025. URL https://api.semanticscholar.org/CorpusID:280140878

work page 2025

[41] [41]

Lionsandmuons: Optimizationviastochasticfrank-wolfe.arXivpreprint arXiv:2506.04192, 2025

Maria-Eleni Sfyraki and Jun-Kun Wang. Lions and muons: Optimization via stochastic frank-wolfe. arXiv preprint arXiv:2506.04192, 2025

work page arXiv 2025

[42] [42]

arXiv preprint arXiv:2505.02222 , year=

Ishaan Shah, Anthony M Polloreno, Karl Stratos, Philip Monk, Adarsh Chaluvaraju, Andrew Hojel, Andrew Ma, Anil Thomas, Ashish Tanwer, Darsh J Shah, et al. Practical efficiency of muon for pretraining. arXiv preprint arXiv:2505.02222, 2025

work page arXiv 2025

[43] [43]

On the Convergence Analysis of Muon

Wei Shen, Ruichuan Huang, Minhui Huang, Cong Shen, and Jiawei Zhang. On the convergence analysis of muon. arXiv preprint arXiv:2505.23737, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[44] [44]

Adamuon: Adaptive muon optimizer.arXiv preprint arXiv:2507.11005,

Chongjie Si, Debing Zhang, and Wei Shen. Adamuon: Adaptive muon optimizer. arXiv preprint arXiv:2507.11005, 2025

work page arXiv 2025

[45] [45]

Llama 2: Open Foundation and Fine-Tuned Chat Models

Hugo Touvron, Louis Martin, Kevin R. Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, Daniel M. Bikel, Lukas Blecher, Cristian Cant \'o n Ferrer, Moya Chen, Guillem Cucurull, David Esiobu, Jude Fernandes, Jeremy Fu, Wenyin Fu, Brian Fuller, Cynthia Gao, Vedanuj Goswami, Naman Goyal, ...

work page internal anchor Pith review Pith/arXiv arXiv 2023

[46] [46]

Nikhil Vyas, Depen Morwani, Rosie Zhao, Itai Shapira, David Brandfonbrener, Lucas Janson, and Sham M. Kakade. SOAP : Improving and stabilizing shampoo using adam for language modeling. In The Thirteenth International Conference on Learning Representations, 2025. URL https://openreview.net/forum?id=IDxZhXrpNf

work page 2025

[47] [47]

Closing the gap between the upper bound and the lower bound of adam's iteration complexity

Bohan Wang, Jingwen Fu, Huishuai Zhang, Nanning Zheng, and Wei Chen. Closing the gap between the upper bound and the lower bound of adam's iteration complexity. ArXiv, abs/2310.17998, 2023. URL https://api.semanticscholar.org/CorpusID:264555523

work page arXiv 2023

[48] [48]

Adagrad stepsizes: Sharp convergence over nonconvex landscapes

Rachel Ward, Xiaoxia Wu, and Leon Bottou. Adagrad stepsizes: Sharp convergence over nonconvex landscapes. Journal of Machine Learning Research, 21 0 (219): 0 1--30, 2020

work page 2020

[49] [49]

Adan: Adaptive nesterov momentum algorithm for faster optimizing deep models

Xingyu Xie, Pan Zhou, Huan Li, Zhouchen Lin, and Shuicheng Yan. Adan: Adaptive nesterov momentum algorithm for faster optimizing deep models. IEEE Transactions on Pattern Analysis and Machine Intelligence, 46 0 (12): 0 9508--9520, 2024. doi:10.1109/TPAMI.2024.3423382

work page doi:10.1109/tpami.2024.3423382 2024

[50] [50]

Linear convergence of adaptive stochastic gradient descent

Yuege Xie, Xiaoxia Wu, and Rachel Ward. Linear convergence of adaptive stochastic gradient descent. In International conference on artificial intelligence and statistics, pp.\ 1475--1485. PMLR, 2020 a

work page 2020

[51] [51]

Adaptive inertia: Disentangling the effects of adaptive learning rate and momentum

Zeke Xie, Xinrui Wang, Huishuai Zhang, Issei Sato, and Masashi Sugiyama. Adaptive inertia: Disentangling the effects of adaptive learning rate and momentum. In International Conference on Machine Learning, 2020 b . URL https://api.semanticscholar.org/CorpusID:248986834

work page 2020

[52] [52]

Mars: Unleashing the power of variance reduction for training large models.arXiv preprint arXiv:2411.10438,

Huizhuo Yuan, Yifeng Liu, Shuang Wu, Xun Zhou, and Quanquan Gu. Mars: Unleashing the power of variance reduction for training large models. ArXiv, abs/2411.10438, 2024. URL https://api.semanticscholar.org/CorpusID:274116548

work page arXiv 2024

[53] [53]

Adagrad meets muon: Adaptive stepsizes for orthogonal updates

Minxin Zhang, Yuxuan Liu, and Hayden Schaeffer. Adagrad meets muon: Adaptive stepsizes for orthogonal updates. 2025. URL https://api.semanticscholar.org/CorpusID:281091748

work page 2025

[54] [54]

arXiv preprint arXiv:1808.05671 , year=

Dongruo Zhou, Yiqi Tang, Ziyan Yang, Yuan Cao, and Quanquan Gu. On the convergence of adaptive gradient methods for nonconvex optimization. ArXiv, abs/1808.05671, 2018. URL https://api.semanticscholar.org/CorpusID:52040763

work page arXiv 2018

[55] [55]

Stochastic nested variance reduction for nonconvex optimization

Dongruo Zhou, Pan Xu, and Quanquan Gu. Stochastic nested variance reduction for nonconvex optimization. Journal of machine learning research, 21 0 (103): 0 1--63, 2020

work page 2020

[56] [56]

Adabelief optimizer: Adapting stepsizes by the belief in observed gradients

Juntang Zhuang, Tommy M. Tang, Yifan Ding, Sekhar Chandra Tatikonda, Nicha C. Dvornek, Xenophon Papademetris, and James S. Duncan. Adabelief optimizer: Adapting stepsizes by the belief in observed gradients. ArXiv, abs/2010.07468, 2020. URL https://api.semanticscholar.org/CorpusID:222377595

work page arXiv 2010