On the Convergence of Muon and Beyond

Da Chang , Yongxiang Liu , Ganzhao Yuan

Authors on Pith no claims yet

classification 💻 cs.LG

keywords mathcalmuonconvergencemuon-mvr2variantswidetildeachieveanytime

read the original abstract

The Muon optimizer has demonstrated remarkable empirical success in handling matrix-structured parameters for training neural networks. However, a significant gap remains between its practical performance and theoretical understanding. Existing analyses show that the Muon variants achieve only a suboptimal ergodic convergence rate of $\mathcal{O}(T^{-1/4})$ in stochastic non-convex settings, where $T$ denotes the number of iterations. To study the theoretical limits of Muon, we analyze two momentum-based variance-reduced variants: the one-batch Muon-MVR1 and the two-batch Muon-MVR2. We provide the first rigorous proof that, under \textbf{horizon-free} learning-rate schedules, variance reduction enables Muon-MVR2 to attain the optimal anytime convergence rate $\widetilde{\mathcal{O}}(T^{-1/3})$, matching the lower bound for this problem class. Under the Polyak--\L{}ojasiewicz (PL) condition, we establish anytime guarantees for Muon-MVR1 and Muon-MVR2: they attain best-iterate rates of $\widetilde{\mathcal{O}}(T^{-1/4})$ and $\widetilde{\mathcal{O}}(T^{-1/3})$ for the expected square-root suboptimality, and, given an additional uniform gradient bound along the iterates, achieve last-iterate rates of $\mathcal{O}(T^{-1/4})$ and $\mathcal{O}(T^{-1/3})$ for the objective gap, respectively. Experiments on CIFAR-10 and C4 support the practical effectiveness of the proposed variance-reduced Muon variants. Code is available at \href{https://github.com/MaeChd/MUON-MVR}{Muon-MVR} Codebase.

This paper has not been read by Pith yet.

discussion (0)

Forward citations

Cited by 7 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

When and Why SignSGD Outperforms SGD: A Theoretical Study Based on $\ell_1$-norm Lower Bounds
cs.LG 2026-05 unverdicted novelty 8.0

SignSGD provably beats SGD by a factor of d under sparse noise via matched ℓ1-norm upper and lower bounds, with an equivalent result for Muon on matrices, and this predicts faster GPT-2 pretraining.
Gradient Clipping Beyond Vector Norms: A Spectral Approach for Matrix-Valued Parameters
cs.LG 2026-05 unverdicted novelty 7.0

Spectral clipping of leading singular values in gradient matrices stabilizes SGD for non-convex problems with heavy-tailed noise and achieves the optimal convergence rate O(K^{(2-2α)/(3α-2)}).
Muon with Nesterov Momentum: Heavy-Tailed Noise and (Randomized) Inexact Polar Decomposition
math.OC 2026-05 unverdicted novelty 7.0

Muon with Nesterov momentum and inexact polar decomposition achieves optimal convergence rates of O(ε^(-(3α-2)/(α-1))) under heavy-tailed noise for ε-stationary points in non-convex settings.
Dimension-Free Saddle-Point Escape in Muon
cs.LG 2026-05 unverdicted novelty 6.0

Muon achieves dimension-free saddle-point escape through non-linear spectral shaping, resolvent calculus, and structural incoherence, yielding an algebraically dimension-free escape bound.
Muon Does Not Converge on Convex Lipschitz Functions
cs.LG 2026-05 unverdicted novelty 6.0

Muon does not converge on convex Lipschitz functions regardless of learning rate, while error feedback restores theoretical convergence but degrades performance on CIFAR-10 and nanoGPT tasks.
MuonEq: Balancing Before Orthogonalization with Lightweight Equilibration
cs.LG 2026-03 unverdicted novelty 6.0

MuonEq introduces pre-orthogonalization equilibration schemes that improve Muon optimizer performance during large language model pretraining.
Communication-Efficient Gluon in Federated Learning
cs.LG 2026-04 unverdicted novelty 5.0

Compressed Gluon variants using unbiased/contraction compressors and SARAH-style variance reduction achieve convergence guarantees and lower communication costs in federated learning under layer-wise smoothness.