hub Canonical reference

Adamuon: Adaptive muon optimizer.arXiv preprint arXiv:2507.11005

Chongjie Si, Debing Zhang, Wei Shen · 2025 · arXiv 2507.11005

Canonical reference. 80% of citing Pith papers cite this work as background.

31 Pith papers citing it

Background 80% of classified citations

read on arXiv browse 31 citing papers

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 4 method 1

citation-polarity summary

background 4 use method 1

representative citing papers

When and Why SignSGD Outperforms SGD: A Theoretical Study Based on $\ell_1$-norm Lower Bounds

cs.LG · 2026-05-07 · unverdicted · novelty 8.0

SignSGD provably beats SGD by a factor of d under sparse noise via matched ℓ1-norm upper and lower bounds, with an equivalent result for Muon on matrices, and this predicts faster GPT-2 pretraining.

Why Muon Outperforms Adam: A Curvature Perspective

cs.LG · 2026-06-03 · conditional · novelty 7.0

Muon outperforms Adam by reducing curvature penalty via lower Normalized Directional Sharpness, as shown via Taylor approximation on LLM training and proven on stylized quadratic problems with heterogeneous curvature.

AMUSE: Anytime Muon with Stable Gradient Evaluation

cs.LG · 2026-05-21 · unverdicted · novelty 7.0

AMUSE is a new optimizer integrating Muon orthogonalization with Schedule-Free averaging via adaptive interpolation for schedule-free anytime training that improves Pareto frontiers on vision and LLM tasks.

Rethinking Muon Beyond Pretraining: Spectral Failures and High-Pass Remedies for VLA and RLVR

cs.LG · 2026-05-19 · conditional · novelty 7.0

Pion modifies Muon's Newton-Schulz iterations into a controllable high-pass filter that anchors dominant singular values at 1 while suppressing noisy tails, outperforming Muon and AdamW in VLA and RLVR regimes.

Muon is Not That Special: Random or Inverted Spectra Work Just as Well

cs.LG · 2026-05-11 · unverdicted · novelty 7.0

Muon succeeds by guaranteeing local step-size optimality rather than by tracking any ideal global geometry, as random-spectrum and quasi-norm variants match its performance on language models.

Accelerating LMO-Based Optimization via Implicit Gradient Transport

cs.LG · 2026-05-07 · unverdicted · novelty 7.0

LMO-IGT achieves O(ε^{-3.5}) iteration complexity for stochastic LMO optimization via implicit gradient transport with a single gradient per step and introduces the regularized support function as a unified stationarity measure.

A unified convergence theory for adaptive first-order methods in the nonconvex case, including AdaNorm, full and diagonal AdaGrad, Shampoo and Muo

cs.LG · 2026-04-19 · unverdicted · novelty 7.0

A unified stochastic convergence theory is developed for adaptive preconditioned first-order methods including AdaGrad variants, Shampoo, and Muon in nonconvex optimization.

On the Convergence of Muon and Beyond

cs.LG · 2025-09-19 · unverdicted · novelty 7.0

Muon-MVR2 attains the optimal anytime convergence rate of ~O(T^{-1/3}) in stochastic non-convex settings under horizon-free schedules.

One-Step Gradient Delay is Not a Barrier for Large-Scale Asynchronous Pipeline Parallel LLM Pretraining

cs.LG · 2026-06-29 · unverdicted · novelty 6.0

One-step gradient delay is optimizer-dependent rather than intrinsically unstable, with Muon and error-feedback correction enabling async pipeline parallelism to match synchronous performance on models up to 10B parameters.

Aurora: A Leverage-Aware Spectral Optimizer

cs.LG · 2026-06-26 · unverdicted · novelty 6.0

Aurora is a leverage-aware spectral optimizer that enforces uniform row norms in matrix updates while preserving Muon's polar geometry, outperforming Muon and achieving SOTA among spectral methods on modded-nanoGPT.

FOGO: Forgetting-aware Orthogonalization Optimizer

cs.LG · 2026-06-09 · unverdicted · novelty 6.0

FOGO introduces spectral orthogonalization of momentum updates plus a random-projection codebook memory to detect and correct gradient interference, improving convergence and retention over Adam and Muon on imbalanced, continual, and large-model tasks.

OptMuon: Closed-Loop Orthogonalized Momentum Methods for Stochastic Optimization with Zero-Noise Optimality

math.OC · 2026-06-07 · unverdicted · novelty 6.0 · 2 refs

OptMuon combines orthogonalized momentum with trajectory-dependent AdaGrad-Norm adaptation to obtain expected-stationarity rates of order T^{-1/2} + sigma^{1/2}T^{-1/4} or T^{-1/2} + sigma^{1/3}T^{-1/3} that reduce to near-optimal deterministic first-order rates in the zero-noise regime.

Double Preconditioning (DoPr): Optimization for Test-Time Performance, not Validation Loss

cs.LG · 2026-06-04 · unverdicted · novelty 6.0

Double preconditioning (DoPr) improves downstream task performance in test-time feedback settings without consistent gains in validation loss.

Stochastic convergence of parallel asynchronous adaptive first-order methods

cs.AI · 2026-06-01 · unverdicted · novelty 6.0

Introduces a class of asynchronous adaptive first-order methods and establishes O(1/sqrt t) convergence (up to logs) for non-convex stochastic optimization under reasonable assumptions.

MONA: Muon Optimizer with Nesterov Acceleration for Scalable Language Model Training

cs.LG · 2026-05-26 · unverdicted · novelty 6.0

MONA integrates Nesterov acceleration into Muon's orthogonalization framework, reporting better convergence than Muon and AdamW on MoE models up to 68B parameters trained on 1T tokens and SOTA fine-tuning results.

Symmetry-Compatible Principle for Optimizer Design: Embeddings, LM Heads, SwiGLU MLPs, and MoE Routers

math.OC · 2026-05-18 · unverdicted · novelty 6.0 · 2 refs

Proposes equivariant optimizer updates matched to layer symmetries for embeddings, SwiGLU MLPs, and MoE routers, with reported gains in validation loss and training stability on several language model architectures.

Spectral Flattening Is All Muon Needs: How Orthogonalization Controls Learning Rate and Convergence

cs.LG · 2026-05-13 · unverdicted · novelty 6.0

Muon achieves faster convergence and larger stable learning rates by flattening the singular value spectrum of the momentum buffer through orthogonalization, scaling step size with average rather than maximum singular values.

OrScale: Orthogonalised Optimization with Layer-Wise Trust-Ratio Scaling

cs.LG · 2026-05-08 · unverdicted · novelty 6.0

OrScale adds a Frobenius-norm trust-ratio layer-wise scaler to Muon’s orthogonalized updates, with per-layer calibration for language models, yielding higher CIFAR-10 accuracy and better language-model pre-training loss than Muon+Moonlight and AdamW.

PolarAdamW: Disentangling Spectral Control and Schur Gauge-Equivariance in Matrix Optimisation

cs.LG · 2026-05-08 · unverdicted · novelty 6.0

PolarAdamW disentangles spectral control from gauge-equivariance in matrix optimizers, with experiments demonstrating their distinct roles on standard versus symmetry-aware neural networks.

Parcae: Scaling Laws For Stable Looped Language Models

cs.LG · 2026-04-14 · unverdicted · novelty 6.0

Parcae stabilizes looped LLMs via spectral norm constraints on injection parameters, enabling power-law scaling for training FLOPs and saturating exponential scaling at test time that improves quality over fixed-depth baselines under fixed parameter budgets.

MuonEq: Balancing Before Orthogonalization with Lightweight Equilibration

cs.LG · 2026-03-30 · unverdicted · novelty 6.0

MuonEq introduces pre-orthogonalization equilibration schemes that improve Muon optimizer performance during large language model pretraining.

Zero-order Parameter-free Optimization for LMO-based Methods: Novel Approach for Efficient Fine-tuning

cs.LG · 2026-06-12 · unverdicted · novelty 5.0

AdaNAGED combines zeroth-order gradient-free training, automatic parameter adaptation, and LMO-based non-Euclidean geometry with claimed convergence guarantees, demonstrated on OPT-1.3B fine-tuning.

Zeta: Dual Whitening for Matrix Optimization via Coordinate-Adaptive Preconditioning

cs.LG · 2026-06-12 · unverdicted · novelty 5.0

Zeta applies coordinate whitening followed by spectral whitening in a fixed order to reduce orthogonalization error in matrix optimization for neural networks.

Muon Learns More Robust and Transferable Features than Adam

cs.LG · 2026-06-08 · unverdicted · novelty 5.0

Muon learns more robust and transferable features than Adam and SGD, shown via corruption robustness tests, transfer experiments, layer-wise probes, effective rank measurements, and a theoretical proof on margins in a multi-component classification problem.

citing papers explorer

Showing 28 of 28 citing papers after filters.

When and Why SignSGD Outperforms SGD: A Theoretical Study Based on $\ell_1$-norm Lower Bounds cs.LG · 2026-05-07 · unverdicted · none · ref 33
SignSGD provably beats SGD by a factor of d under sparse noise via matched ℓ1-norm upper and lower bounds, with an equivalent result for Muon on matrices, and this predicts faster GPT-2 pretraining.
AMUSE: Anytime Muon with Stable Gradient Evaluation cs.LG · 2026-05-21 · unverdicted · none · ref 38
AMUSE is a new optimizer integrating Muon orthogonalization with Schedule-Free averaging via adaptive interpolation for schedule-free anytime training that improves Pareto frontiers on vision and LLM tasks.
Muon is Not That Special: Random or Inverted Spectra Work Just as Well cs.LG · 2026-05-11 · unverdicted · none · ref 13
Muon succeeds by guaranteeing local step-size optimality rather than by tracking any ideal global geometry, as random-spectrum and quasi-norm variants match its performance on language models.
Accelerating LMO-Based Optimization via Implicit Gradient Transport cs.LG · 2026-05-07 · unverdicted · none · ref 13
LMO-IGT achieves O(ε^{-3.5}) iteration complexity for stochastic LMO optimization via implicit gradient transport with a single gradient per step and introduces the regularized support function as a unified stationarity measure.
A unified convergence theory for adaptive first-order methods in the nonconvex case, including AdaNorm, full and diagonal AdaGrad, Shampoo and Muo cs.LG · 2026-04-19 · unverdicted · none · ref 44
A unified stochastic convergence theory is developed for adaptive preconditioned first-order methods including AdaGrad variants, Shampoo, and Muon in nonconvex optimization.
On the Convergence of Muon and Beyond cs.LG · 2025-09-19 · unverdicted · none · ref 44
Muon-MVR2 attains the optimal anytime convergence rate of ~O(T^{-1/3}) in stochastic non-convex settings under horizon-free schedules.
One-Step Gradient Delay is Not a Barrier for Large-Scale Asynchronous Pipeline Parallel LLM Pretraining cs.LG · 2026-06-29 · unverdicted · none · ref 30
One-step gradient delay is optimizer-dependent rather than intrinsically unstable, with Muon and error-feedback correction enabling async pipeline parallelism to match synchronous performance on models up to 10B parameters.
Aurora: A Leverage-Aware Spectral Optimizer cs.LG · 2026-06-26 · unverdicted · none · ref 38
Aurora is a leverage-aware spectral optimizer that enforces uniform row norms in matrix updates while preserving Muon's polar geometry, outperforming Muon and achieving SOTA among spectral methods on modded-nanoGPT.
FOGO: Forgetting-aware Orthogonalization Optimizer cs.LG · 2026-06-09 · unverdicted · none · ref 39
FOGO introduces spectral orthogonalization of momentum updates plus a random-projection codebook memory to detect and correct gradient interference, improving convergence and retention over Adam and Muon on imbalanced, continual, and large-model tasks.
OptMuon: Closed-Loop Orthogonalized Momentum Methods for Stochastic Optimization with Zero-Noise Optimality math.OC · 2026-06-07 · unverdicted · none · ref 72 · 2 links
OptMuon combines orthogonalized momentum with trajectory-dependent AdaGrad-Norm adaptation to obtain expected-stationarity rates of order T^{-1/2} + sigma^{1/2}T^{-1/4} or T^{-1/2} + sigma^{1/3}T^{-1/3} that reduce to near-optimal deterministic first-order rates in the zero-noise regime.
Double Preconditioning (DoPr): Optimization for Test-Time Performance, not Validation Loss cs.LG · 2026-06-04 · unverdicted · none · ref 171
Double preconditioning (DoPr) improves downstream task performance in test-time feedback settings without consistent gains in validation loss.
Stochastic convergence of parallel asynchronous adaptive first-order methods cs.AI · 2026-06-01 · unverdicted · none · ref 46
Introduces a class of asynchronous adaptive first-order methods and establishes O(1/sqrt t) convergence (up to logs) for non-convex stochastic optimization under reasonable assumptions.
MONA: Muon Optimizer with Nesterov Acceleration for Scalable Language Model Training cs.LG · 2026-05-26 · unverdicted · none · ref 42
MONA integrates Nesterov acceleration into Muon's orthogonalization framework, reporting better convergence than Muon and AdamW on MoE models up to 68B parameters trained on 1T tokens and SOTA fine-tuning results.
Symmetry-Compatible Principle for Optimizer Design: Embeddings, LM Heads, SwiGLU MLPs, and MoE Routers math.OC · 2026-05-18 · unverdicted · none · ref 140 · 2 links
Proposes equivariant optimizer updates matched to layer symmetries for embeddings, SwiGLU MLPs, and MoE routers, with reported gains in validation loss and training stability on several language model architectures.
Spectral Flattening Is All Muon Needs: How Orthogonalization Controls Learning Rate and Convergence cs.LG · 2026-05-13 · unverdicted · none · ref 13
Muon achieves faster convergence and larger stable learning rates by flattening the singular value spectrum of the momentum buffer through orthogonalization, scaling step size with average rather than maximum singular values.
OrScale: Orthogonalised Optimization with Layer-Wise Trust-Ratio Scaling cs.LG · 2026-05-08 · unverdicted · none · ref 15
OrScale adds a Frobenius-norm trust-ratio layer-wise scaler to Muon’s orthogonalized updates, with per-layer calibration for language models, yielding higher CIFAR-10 accuracy and better language-model pre-training loss than Muon+Moonlight and AdamW.
PolarAdamW: Disentangling Spectral Control and Schur Gauge-Equivariance in Matrix Optimisation cs.LG · 2026-05-08 · unverdicted · none · ref 48
PolarAdamW disentangles spectral control from gauge-equivariance in matrix optimizers, with experiments demonstrating their distinct roles on standard versus symmetry-aware neural networks.
Parcae: Scaling Laws For Stable Looped Language Models cs.LG · 2026-04-14 · unverdicted · none · ref 73
Parcae stabilizes looped LLMs via spectral norm constraints on injection parameters, enabling power-law scaling for training FLOPs and saturating exponential scaling at test time that improves quality over fixed-depth baselines under fixed parameter budgets.
MuonEq: Balancing Before Orthogonalization with Lightweight Equilibration cs.LG · 2026-03-30 · unverdicted · none · ref 18
MuonEq introduces pre-orthogonalization equilibration schemes that improve Muon optimizer performance during large language model pretraining.
Zero-order Parameter-free Optimization for LMO-based Methods: Novel Approach for Efficient Fine-tuning cs.LG · 2026-06-12 · unverdicted · none · ref 75
AdaNAGED combines zeroth-order gradient-free training, automatic parameter adaptation, and LMO-based non-Euclidean geometry with claimed convergence guarantees, demonstrated on OPT-1.3B fine-tuning.
Zeta: Dual Whitening for Matrix Optimization via Coordinate-Adaptive Preconditioning cs.LG · 2026-06-12 · unverdicted · none · ref 20
Zeta applies coordinate whitening followed by spectral whitening in a fixed order to reduce orthogonalization error in matrix optimization for neural networks.
Muon Learns More Robust and Transferable Features than Adam cs.LG · 2026-06-08 · unverdicted · none · ref 131
Muon learns more robust and transferable features than Adam and SGD, shown via corruption robustness tests, transfer experiments, layer-wise probes, effective rank measurements, and a theoretical proof on margins in a multi-component classification problem.
Can Entry-Wise Clipping Give Spectral Control of Stochastic Gradients? cs.LG · 2026-05-26 · unverdicted · none · ref 18
Entry-wise clipping achieves spectral control of gradients via localization under heavy-tailed contamination, with O(ε^{-4}) convergence and empirical savings on NanoGPT pretraining.
Convergence of Spectral Descent for Non-smooth Optimization cs.LG · 2026-05-26 · unverdicted · none · ref 23
Proves linear convergence of Spectral Descent (SD) and Truncated SD for non-smooth convex problems under stated conditions, sublinear rates for regularized versions via Frank-Wolfe, and recovery guarantees for robust low-rank matrix recovery.
Anytime Training with Schedule-Free Spectral Optimization cs.LG · 2026-05-21 · unverdicted · none · ref 37
SF-NorMuon is a new schedule-free spectral optimizer that closes the gap with tuned AdamW on 125M-772M parameter models across 1-8x Chinchilla horizons while providing stationarity guarantees.
HTMuon: Improving Muon via Heavy-Tailed Spectral Correction cs.LG · 2026-03-10 · unverdicted · none · ref 25
HTMuon modifies Muon to produce heavier-tailed updates and weight spectra via HT-SR theory, yielding up to 0.98 lower perplexity on LLaMA pretraining and serving as a plug-in for other Muon variants.
Low-rank Orthogonalization for Large-scale Matrix Optimization with Applications to Foundation Model Training cs.LG · 2025-09-15 · unverdicted · none · ref 49
Proposes low-rank orthogonalization and derives low-rank Muon and MSGD variants that outperform standard Muon on GPT-2 and LLaMA pretraining while providing iteration complexity bounds.
A Note on Stability for Orthogonalized Matrix Momentum with Client Sampling cs.LG · 2026-06-01 · unverdicted · none · ref 30
Derives finite-round upper-tail guarantee on population-empirical gap for client-sampled orthogonalized matrix momentum under heterogeneous data, with Lipschitz condition on the orthogonalizer.

Adamuon: Adaptive muon optimizer.arXiv preprint arXiv:2507.11005

hub tools

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer