arXiv preprint arXiv:2507.11005 , year=

Chongjie Si, Debing Zhang, Wei Shen · 2025 · arXiv 2507.11005

9 Pith papers cite this work. Polarity classification is still indexing.

9 Pith papers citing it

representative citing papers

When and Why SignSGD Outperforms SGD: A Theoretical Study Based on $\ell_1$-norm Lower Bounds

cs.LG · 2026-05-07 · unverdicted · novelty 8.0

SignSGD provably beats SGD by a factor of d under sparse noise via matched ℓ1-norm upper and lower bounds, with an equivalent result for Muon on matrices, and this predicts faster GPT-2 pretraining.

Muon is Not That Special: Random or Inverted Spectra Work Just as Well

cs.LG · 2026-05-11 · unverdicted · novelty 7.0

Muon succeeds by guaranteeing local step-size optimality rather than by tracking any ideal global geometry, as random-spectrum and quasi-norm variants match its performance on language models.

Accelerating LMO-Based Optimization via Implicit Gradient Transport

cs.LG · 2026-05-07 · unverdicted · novelty 7.0

LMO-IGT achieves O(ε^{-3.5}) iteration complexity for stochastic LMO optimization via implicit gradient transport with a single gradient per step and introduces the regularized support function as a unified stationarity measure.

A unified convergence theory for adaptive first-order methods in the nonconvex case, including AdaNorm, full and diagonal AdaGrad, Shampoo and Muo

cs.LG · 2026-04-19 · unverdicted · novelty 7.0

A unified stochastic convergence theory is developed for adaptive preconditioned first-order methods including AdaGrad variants, Shampoo, and Muon in nonconvex optimization.

Spectral Flattening Is All Muon Needs: How Orthogonalization Controls Learning Rate and Convergence

cs.LG · 2026-05-13 · unverdicted · novelty 6.0

Muon achieves faster convergence and larger stable learning rates by flattening the singular value spectrum of the momentum buffer through orthogonalization, scaling step size with average rather than maximum singular values.

OrScale: Orthogonalised Optimization with Layer-Wise Trust-Ratio Scaling

cs.LG · 2026-05-08 · unverdicted · novelty 6.0

OrScale adds a Frobenius-norm trust-ratio layer-wise scaler to Muon’s orthogonalized updates, with per-layer calibration for language models, yielding higher CIFAR-10 accuracy and better language-model pre-training loss than Muon+Moonlight and AdamW.

PolarAdamW: Disentangling Spectral Control and Schur Gauge-Equivariance in Matrix Optimisation

cs.LG · 2026-05-08 · unverdicted · novelty 6.0

PolarAdamW disentangles spectral control from gauge-equivariance in matrix optimizers, with experiments demonstrating their distinct roles on standard versus symmetry-aware neural networks.

Parcae: Scaling Laws For Stable Looped Language Models

cs.LG · 2026-04-14 · unverdicted · novelty 6.0

Parcae stabilizes looped LLMs via spectral norm constraints on injection parameters, enabling power-law scaling for training FLOPs and saturating exponential scaling at test time that improves quality over fixed-depth baselines under fixed parameter budgets.

MuonEq: Balancing Before Orthogonalization with Lightweight Equilibration

cs.LG · 2026-03-30 · unverdicted · novelty 6.0

MuonEq introduces pre-orthogonalization equilibration schemes that improve Muon optimizer performance during large language model pretraining.

citing papers explorer

Showing 9 of 9 citing papers.

When and Why SignSGD Outperforms SGD: A Theoretical Study Based on $\ell_1$-norm Lower Bounds cs.LG · 2026-05-07 · unverdicted · none · ref 33
SignSGD provably beats SGD by a factor of d under sparse noise via matched ℓ1-norm upper and lower bounds, with an equivalent result for Muon on matrices, and this predicts faster GPT-2 pretraining.
Muon is Not That Special: Random or Inverted Spectra Work Just as Well cs.LG · 2026-05-11 · unverdicted · none · ref 13
Muon succeeds by guaranteeing local step-size optimality rather than by tracking any ideal global geometry, as random-spectrum and quasi-norm variants match its performance on language models.
Accelerating LMO-Based Optimization via Implicit Gradient Transport cs.LG · 2026-05-07 · unverdicted · none · ref 13
LMO-IGT achieves O(ε^{-3.5}) iteration complexity for stochastic LMO optimization via implicit gradient transport with a single gradient per step and introduces the regularized support function as a unified stationarity measure.
A unified convergence theory for adaptive first-order methods in the nonconvex case, including AdaNorm, full and diagonal AdaGrad, Shampoo and Muo cs.LG · 2026-04-19 · unverdicted · none · ref 44
A unified stochastic convergence theory is developed for adaptive preconditioned first-order methods including AdaGrad variants, Shampoo, and Muon in nonconvex optimization.
Spectral Flattening Is All Muon Needs: How Orthogonalization Controls Learning Rate and Convergence cs.LG · 2026-05-13 · unverdicted · none · ref 13
Muon achieves faster convergence and larger stable learning rates by flattening the singular value spectrum of the momentum buffer through orthogonalization, scaling step size with average rather than maximum singular values.
OrScale: Orthogonalised Optimization with Layer-Wise Trust-Ratio Scaling cs.LG · 2026-05-08 · unverdicted · none · ref 15
OrScale adds a Frobenius-norm trust-ratio layer-wise scaler to Muon’s orthogonalized updates, with per-layer calibration for language models, yielding higher CIFAR-10 accuracy and better language-model pre-training loss than Muon+Moonlight and AdamW.
PolarAdamW: Disentangling Spectral Control and Schur Gauge-Equivariance in Matrix Optimisation cs.LG · 2026-05-08 · unverdicted · none · ref 48
PolarAdamW disentangles spectral control from gauge-equivariance in matrix optimizers, with experiments demonstrating their distinct roles on standard versus symmetry-aware neural networks.
Parcae: Scaling Laws For Stable Looped Language Models cs.LG · 2026-04-14 · unverdicted · none · ref 73
Parcae stabilizes looped LLMs via spectral norm constraints on injection parameters, enabling power-law scaling for training FLOPs and saturating exponential scaling at test time that improves quality over fixed-depth baselines under fixed parameter budgets.
MuonEq: Balancing Before Orthogonalization with Lightweight Equilibration cs.LG · 2026-03-30 · unverdicted · none · ref 18
MuonEq introduces pre-orthogonalization equilibration schemes that improve Muon optimizer performance during large language model pretraining.

arXiv preprint arXiv:2507.11005 , year=

fields

years

verdicts

representative citing papers

citing papers explorer