SignSGD provably beats SGD by a factor of d under sparse noise via matched ℓ1-norm upper and lower bounds, with an equivalent result for Muon on matrices, and this predicts faster GPT-2 pretraining.
hub Mixed citations
Muon optimizes under spectral norm constraints.arXiv preprint arXiv:2506.15054
Mixed citation behavior. Most common role is background (60%).
hub tools
citation-role summary
citation-polarity summary
representative citing papers
Regularized Muon induces a damped Hamiltonian flow on probability measures over matrix parameters, yielding exponential convergence under gradient dominance assumptions.
Muon achieves higher storage capacity than SGD and matches Newton's method in one-step recovery rates for associative memory under power-law distributions, while saturating at larger critical batch sizes and showing faster initial multi-step dynamics.
Muon-MVR2 attains the optimal anytime convergence rate of ~O(T^{-1/3}) in stochastic non-convex settings under horizon-free schedules.
OptMuon combines orthogonalized momentum with trajectory-dependent AdaGrad-Norm adaptation to obtain expected-stationarity rates of order T^{-1/2} + sigma^{1/2}T^{-1/4} or T^{-1/2} + sigma^{1/3}T^{-1/3} that reduce to near-optimal deterministic first-order rates in the zero-noise regime.
Muon achieves faster convergence and larger stable learning rates by flattening the singular value spectrum of the momentum buffer through orthogonalization, scaling step size with average rather than maximum singular values.
Muon-OGD introduces a spectral-norm constrained orthogonal projection method solved via dual iterations and Newton-Schulz approximations to improve stability-plasticity trade-off in sequential LLM adaptation.
Proving stability of Leon's preconditioner enables the first tuning-free Nesterov-accelerated projection-free adaptive SGD variant with improved non-smooth non-convex rates.
MuonEq introduces pre-orthogonalization equilibration schemes that improve Muon optimizer performance during large language model pretraining.
SoftSignum replaces hard sign with soft-sign in optimizers via temperature control and quantile scheduling, extends to SoftMuon, provides a convergence proof for stochastic non-convex settings, and reports better performance than sign-based methods and AdamW on deep learning tasks.
Pion is an optimizer that preserves the singular values of weight matrices in LLM training by applying orthogonal equivalence transformations.
HTMuon modifies Muon to produce heavier-tailed updates and weight spectra via HT-SR theory, yielding up to 0.98 lower perplexity on LLaMA pretraining and serving as a plug-in for other Muon variants.
Proposes low-rank orthogonalization and derives low-rank Muon and MSGD variants that outperform standard Muon on GPT-2 and LLaMA pretraining while providing iteration complexity bounds.
citing papers explorer
-
When and Why SignSGD Outperforms SGD: A Theoretical Study Based on $\ell_1$-norm Lower Bounds
SignSGD provably beats SGD by a factor of d under sparse noise via matched ℓ1-norm upper and lower bounds, with an equivalent result for Muon on matrices, and this predicts faster GPT-2 pretraining.
-
Move on Muon : A Hamiltonian probability gradient flow perspective of Muon optimizer
Regularized Muon induces a damped Hamiltonian flow on probability measures over matrix parameters, yielding exponential convergence under gradient dominance assumptions.
-
Sharp Capacity Scaling of Spectral Optimizers in Learning Associative Memory
Muon achieves higher storage capacity than SGD and matches Newton's method in one-step recovery rates for associative memory under power-law distributions, while saturating at larger critical batch sizes and showing faster initial multi-step dynamics.
-
On the Convergence of Muon and Beyond
Muon-MVR2 attains the optimal anytime convergence rate of ~O(T^{-1/3}) in stochastic non-convex settings under horizon-free schedules.
-
OptMuon: Closed-Loop Orthogonalized Momentum Methods for Stochastic Optimization with Zero-Noise Optimality
OptMuon combines orthogonalized momentum with trajectory-dependent AdaGrad-Norm adaptation to obtain expected-stationarity rates of order T^{-1/2} + sigma^{1/2}T^{-1/4} or T^{-1/2} + sigma^{1/3}T^{-1/3} that reduce to near-optimal deterministic first-order rates in the zero-noise regime.
-
Spectral Flattening Is All Muon Needs: How Orthogonalization Controls Learning Rate and Convergence
Muon achieves faster convergence and larger stable learning rates by flattening the singular value spectrum of the momentum buffer through orthogonalization, scaling step size with average rather than maximum singular values.
-
Muon-OGD: Muon-based Spectral Orthogonal Gradient Projection for LLM Continual Learning
Muon-OGD introduces a spectral-norm constrained orthogonal projection method solved via dual iterations and Newton-Schulz approximations to improve stability-plasticity trade-off in sequential LLM adaptation.
-
Optimal Projection-Free Adaptive SGD for Matrix Optimization
Proving stability of Leon's preconditioner enables the first tuning-free Nesterov-accelerated projection-free adaptive SGD variant with improved non-smooth non-convex rates.
-
MuonEq: Balancing Before Orthogonalization with Lightweight Equilibration
MuonEq introduces pre-orthogonalization equilibration schemes that improve Muon optimizer performance during large language model pretraining.
-
Softsign: Smooth Sign in Your Optimizer For Better Parameter Heterogeneity Handling
SoftSignum replaces hard sign with soft-sign in optimizers via temperature control and quantile scheduling, extends to SoftMuon, provides a convergence proof for stochastic non-convex settings, and reports better performance than sign-based methods and AdamW on deep learning tasks.
-
Pion: A Spectrum-Preserving Optimizer via Orthogonal Equivalence Transformation
Pion is an optimizer that preserves the singular values of weight matrices in LLM training by applying orthogonal equivalence transformations.
-
HTMuon: Improving Muon via Heavy-Tailed Spectral Correction
HTMuon modifies Muon to produce heavier-tailed updates and weight spectra via HT-SR theory, yielding up to 0.98 lower perplexity on LLaMA pretraining and serving as a plug-in for other Muon variants.
-
Low-rank Orthogonalization for Large-scale Matrix Optimization with Applications to Foundation Model Training
Proposes low-rank orthogonalization and derives low-rank Muon and MSGD variants that outperform standard Muon on GPT-2 and LLaMA pretraining while providing iteration complexity bounds.
- Symmetry-Compatible Principle for Optimizer Design: Embeddings, LM Heads, SwiGLU MLPs, and MoE Routers