MuonEq: Balancing Before Orthogonalization with Lightweight Equilibration

Chang, D · 2026 · cs.LG · arXiv 2603.28254

2 Pith papers cite this work. Polarity classification is still indexing.

2 Pith papers citing it

open full Pith review browse 2 citing papers arXiv PDF

abstract

Orthogonalized-update optimizers such as Muon improve training of matrix-valued parameters, but existing extensions typically either rescale updates after orthogonalization or use heavier whitening-based preconditioners before it. We introduce {\method}, a lightweight family of pre-orthogonalization equilibration schemes for Muon with three forms: two-sided row/column normalization (RC), row normalization (R), and column normalization (C). By rebalancing the momentum matrix before finite-step Newton--Schulz orthogonalization, {\method} improves the geometry seen by orthogonalization. We show that finite-step orthogonalization is governed by the input spectrum, especially stable rank and condition number, and that row/column normalization acts as a zeroth-order surrogate for whitening. For hidden matrix weights, R is the default variant. Theoretically, {\method} (R) retains the standard $\widetilde{\mathcal O}(T^{-1/4})$ Muon-type nonconvex stationarity guarantee with decoupled weight decay and a horizon-free diminishing learning-rate schedule, and extends it to finite-step NS5 up to an explicit inexactness constant. In LLaMA2 pretraining on C4, {\method} (R) consistently outperforms Muon on 130M, 350M, and 1B models, with faster convergence and lower validation perplexity. The code is available at the \href{https://github.com/MaeChd/muon-eq}{MuonEq codebase}.

citation-role summary

background 1

citation-polarity summary

background 1

representative citing papers

PolarAdamW: Disentangling Spectral Control and Schur Gauge-Equivariance in Matrix Optimisation

cs.LG · 2026-05-08 · unverdicted · novelty 6.0

PolarAdamW disentangles spectral control from gauge-equivariance in matrix optimizers, with experiments demonstrating their distinct roles on standard versus symmetry-aware neural networks.

Symmetry-Compatible Principle for Optimizer Design: Embeddings, LM Heads, SwiGLU MLPs, and MoE Routers

math.OC · 2026-05-18

citing papers explorer

Showing 2 of 2 citing papers.

PolarAdamW: Disentangling Spectral Control and Schur Gauge-Equivariance in Matrix Optimisation cs.LG · 2026-05-08 · unverdicted · none · ref 36 · internal anchor
PolarAdamW disentangles spectral control from gauge-equivariance in matrix optimizers, with experiments demonstrating their distinct roles on standard versus symmetry-aware neural networks.
Symmetry-Compatible Principle for Optimizer Design: Embeddings, LM Heads, SwiGLU MLPs, and MoE Routers math.OC · 2026-05-18 · unreviewed · ref 25 · internal anchor

MuonEq: Balancing Before Orthogonalization with Lightweight Equilibration

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer