arxiv: 2604.09967 · v1 · submitted 2026-04-11 · 💻 cs.LG · cs.AI

Recognition: unknown

Muon²: Boosting Muon via Adaptive Second-Moment Preconditioning

Ruijie Zhang, Yequan Zhao, Yupeng Su, Zhengyang Wang, Zheng Zhang, Zi Yang, Ziyue Liu

Authors on Pith no claims yet

Pith reviewed 2026-05-10 16:56 UTC · model grok-4.3

classification 💻 cs.LG cs.AI

keywords Muon optimizersecond-moment preconditioningNewton-Schulz iterationorthogonalizationlarge model pre-trainingadaptive optimizationmomentum matrix

0 comments

The pith

Muon² preconditions momentum with second moments to accelerate orthogonalization and cut Newton-Schulz iterations by 40 percent.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Muon² extends Muon by inserting an Adam-style adaptive second-moment preconditioning step on the momentum matrix before its iterative orthogonalization. The preconditioning improves the spectrum of the ill-conditioned matrix, which speeds convergence of the Newton-Schulz procedure toward a level of orthogonality that is sufficient in practice. The authors track this quality with directional alignment and report large gains at each polar step. In pre-training runs on GPT and LLaMA models from 60M to 1.3B parameters, Muon² beats Muon and recent variants while using 40 percent fewer Newton-Schulz iterations. A factorized variant, Muon²-F, keeps most of the gains with almost no added memory cost.

Core claim

Muon² applies adaptive second-moment preconditioning to the momentum matrix before Newton-Schulz orthogonalization. This step improves the matrix spectrum and thereby accelerates convergence to a practically sufficient orthogonalization quality, measured by directional alignment. The result is better pre-training performance on GPT and LLaMA models across 60M to 1.3B parameters together with a 40 percent reduction in Newton-Schulz iterations per step. The authors also introduce Muon²-F, a memory-efficient factorized form that preserves most of the advantage.

What carries the argument

Adaptive second-moment preconditioning applied to the momentum matrix before Newton-Schulz iterative orthogonalization, which improves conditioning and speeds polar approximation.

If this is right

Fewer Newton-Schulz iterations per step lower both computation and communication cost during large-scale training.
Consistent gains appear across GPT and LLaMA architectures and across model sizes from 60M to 1.3B parameters.
Directional alignment serves as a practical proxy for orthogonalization quality.
A factorized implementation can retain most benefits while keeping memory overhead negligible.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same preconditioning step could accelerate other iterative matrix approximations used inside optimizers.
Lower per-step iteration counts may ease scaling of matrix-aware optimizers to models beyond 1.3B parameters.
Combinations of second-moment preconditioning with alternative orthogonalization schemes remain unexplored and could yield additional savings.

Load-bearing premise

That the spectrum improvement from second-moment preconditioning reliably produces orthogonalization quality sufficient to deliver better end-to-end training performance.

What would settle it

A controlled pre-training run on a 1.3B-parameter LLaMA model in which Muon² with 40 percent fewer Newton-Schulz iterations reaches equal or higher final validation loss than standard Muon under identical settings.

Figures

Figures reproduced from arXiv: 2604.09967 by Ruijie Zhang, Yequan Zhao, Yupeng Su, Zhengyang Wang, Zheng Zhang, Zi Yang, Ziyue Liu.

**Figure 1.** Figure 1: Spectral effect of MUON2 on the input matrix of the Newton–Schultz iteration. optimizers. As shown by Eq. (13), each NS step introduces multiple matrix multiplications per parameter that require accessing the full-matrix on each device. This introduces not just computational overhead, but also communicational overhead that’s often non-trivial and latency-bound in large-scale distributed setting. Therefo… view at source ↗

**Figure 3.** Figure 3: Cosine similarity of the output matrix vs. true [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗

**Figure 4.** Figure 4: Learning rate sweep on GPT-Large comparing [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗

**Figure 5.** Figure 5: How the NS iteration maps singular values [PITH_FULL_IMAGE:figures/full_fig_p011_5.png] view at source ↗

**Figure 6.** Figure 6: The comparison of convergence zones between Keller (MUON’s (Jordan et al., 2024) polar method) and PolarExpress (Amsel et al., 2025). 0.001 to 0.0008, and the boundary that separates transition and convergent zone reduces from 0.2 to 0.1. However, such changes are insufficient to solve the underlying challenge of the ill-conditioned input matrix. As demonstrate by [PITH_FULL_IMAGE:figures/full_fig_p011_6.png] view at source ↗

**Figure 7.** Figure 7: Learning Rates Sweep on GPT-Small LR Baseline Muon2 (ours) Ns=3 Ns=5 Ns=3 Ns=5 0.003 25.43 22.11 21.73 21.01 0.005 24.69 21.75 20.64 20.33 0.010 25.42 21.99 20.39 19.96 0.020 24.78 21.47 20.50 20.40 0.040 25.75 22.22 21.97 21.24 [PITH_FULL_IMAGE:figures/full_fig_p012_7.png] view at source ↗

**Figure 9.** Figure 9: Learning Rates Sweep on LLaMA-60M LR MUON MUON2 (ours) Ns=3 Ns=5 Ns=3 Ns=5 0.04 14.91 14.03 13.44 13.46 0.05 15.18 14.12 13.50 13.51 0.06 15.33 14.18 13.58 13.59 0.07 15.43 14.30 13.67 13.63 0.08 15.76 14.49 13.71 13.73 [PITH_FULL_IMAGE:figures/full_fig_p012_9.png] view at source ↗

**Figure 10.** Figure 10: Learning Rates Sweep on LLaMA-350M LR MUON MUON2 (ours) Ns=3 Ns=5 Ns=3 Ns=5 0.04 11.63 10.62 10.43 10.21 0.05 11.86 10.70 10.50 10.26 0.06 11.98 10.77 10.55 10.31 [PITH_FULL_IMAGE:figures/full_fig_p013_10.png] view at source ↗

**Figure 11.** Figure 11: Learning Rates Sweep on LLaMA-1B LR PolarExpress Turbo-Muon NorMuon Ns=3 Ns=5 Ns=3 Ns=5 Ns=3 Ns=5 0.003 31.51 30.45 31.36 30.31 32.50 29.90 0.005 30.01 29.42 29.74 29.66 30.70 28.40 0.010 31.32 29.73 29.70 29.79 31.43 28.72 0.020 30.31 29.66 30.10 29.88 31.05 28.48 0.040 31.25 30.03 30.07 30.82 30.35 29.09 [PITH_FULL_IMAGE:figures/full_fig_p013_11.png] view at source ↗

read the original abstract

Muon has emerged as a promising optimizer for large-scale foundation model pre-training by exploiting the matrix structure of neural network updates through iterative orthogonalization. However, its practical efficiency is limited by the need for multiple Newton--Schulz (NS) iterations per optimization step, which introduces non-trivial computation and communication overhead. We propose Muon$^2$, an extension of Muon that applies Adam-style adaptive second-moment preconditioning before orthogonalization. Our key insight is that the core challenge of polar approximation in Muon lies in the ill-conditioned momentum matrix, of which the spectrum is substantially improved by Muon$^2$, leading to faster convergence toward a practically sufficient orthogonalization. We further characterize the practical orthogonalization quality via directional alignment, under which Muon$^2$ demonstrates dramatic improvement over Muon at each polar step. Across GPT and LLaMA pre-training experiments from 60M to 1.3B parameters, Muon$^2$ consistently outperforms Muon and recent Muon variants while reducing NS iterations by 40\%. We further introduce Muon$^2$-F, a memory-efficient factorized variant that preserves most of the gains of Muon$^2$ with negligible memory overhead.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Muon² adds second-moment preconditioning before Muon's orthogonalization step, which the authors claim cuts Newton-Schulz iterations by 40% while improving pre-training on models up to 1.3B.

read the letter

Muon² preconditions the momentum matrix with Adam-style second moments before the polar decomposition. The authors argue this improves the spectrum enough for Newton-Schulz to reach a usable orthogonal factor in fewer steps, and they back it with a new directional alignment metric that shows clear gains at each iteration. They also ship a low-memory factorized version that keeps most of the benefit. That combination is the actual new content; prior Muon papers did not include this adaptive preconditioning step or the alignment diagnostic. The experiments run GPT and LLaMA pre-training across 60M to 1.3B scales and report consistent outperformance plus the iteration reduction, which is the practical payoff if it holds. The paper does a reasonable job of showing the spectrum and alignment improvements line up with the claimed speed-up. The main soft spots are missing error bars, incomplete hyperparameter tables, and no detailed protocol in the provided material, so reproducibility is still an open question. The causal story from spectrum fix to alignment to lower loss also rests on the directional alignment proxy without strong ablations that rule out side effects like changed effective step sizes. That link is plausible but not airtight. The stress-test concern has some merit here because the headline result depends on alignment being the dominant driver rather than other optimizer interactions. Still, the end-to-end numbers are what practitioners care about, and the paper reports those directly. This is for optimizer engineers and large-model trainers who already follow Muon. It deserves a serious referee because the change is simple, the empirical claim is concrete, and the idea is worth checking even if more controls are needed.

Referee Report

2 major / 2 minor

Summary. The manuscript proposes Muon², an extension of the Muon optimizer that applies Adam-style adaptive second-moment preconditioning to the momentum matrix prior to Newton-Schulz (NS) orthogonalization. The central claim is that this improves the spectrum of the ill-conditioned matrix, yielding faster NS convergence to a practically sufficient orthogonal update (characterized via directional alignment), which in turn produces better end-to-end pre-training performance. Experiments on GPT and LLaMA models (60M–1.3B parameters) report consistent outperformance over Muon and recent variants together with a 40% reduction in required NS iterations; a memory-efficient factorized variant (Muon²-F) is also introduced.

Significance. If the empirical results hold under rigorous controls, the work would be significant for large-scale optimization: it directly targets the computational and communication overhead of iterative orthogonalization in matrix-structured optimizers while preserving or improving training dynamics. The introduction of directional alignment as a practical metric for orthogonalization quality is a useful conceptual contribution. The scale of the reported experiments (up to 1.3B parameters) and the introduction of a low-memory variant add practical value, though the absence of detailed protocols limits immediate assessment of reproducibility.

major comments (2)

[§5 (Experiments)] §5 (Experiments) and abstract: the headline claims of consistent outperformance across model scales and a 40% reduction in NS iterations are presented without error bars, standard deviations across seeds, or full hyperparameter and baseline implementation details. This is load-bearing for the central empirical claim, as the reader's assessment already notes the lack of verifiable experimental protocol.
[§3 (Method)] §3 (Method) and §4 (Analysis): the argument that spectrum improvement via second-moment preconditioning produces higher-quality updates rests on directional alignment as a proxy, yet no ablation isolates this mechanism from side-effects such as changes in effective step norms or interactions with the learning-rate schedule. The skeptic's concern is therefore material: the link between alignment gains and end-to-end loss reduction is asserted but not demonstrated to be causal.

minor comments (2)

[Abstract] Abstract and §4: the phrase 'dramatic improvement' in directional alignment is qualitative; the corresponding figures or tables should report precise quantitative values (e.g., cosine similarity or Frobenius distance to the true polar factor) at each NS step.
Notation for the preconditioned matrix and the exact NS iteration count used in each experiment should be defined once and used consistently; the current presentation leaves the convergence threshold for 'practically sufficient' orthogonalization implicit.

Simulated Author's Rebuttal

2 responses · 0 unresolved

Thank you for the constructive feedback on our manuscript. We appreciate the referee's recognition of the potential significance of Muon² for large-scale optimization and the value of directional alignment as a metric. We address each major comment below and commit to revisions that strengthen the empirical and mechanistic claims.

read point-by-point responses

Referee: §5 (Experiments) and abstract: the headline claims of consistent outperformance across model scales and a 40% reduction in NS iterations are presented without error bars, standard deviations across seeds, or full hyperparameter and baseline implementation details. This is load-bearing for the central empirical claim, as the reader's assessment already notes the lack of verifiable experimental protocol.

Authors: We agree that the lack of error bars, seed-wise standard deviations, and exhaustive protocol details weakens the strength of the reported claims. In the revised version we will re-run the primary GPT and LLaMA experiments (60 M–1.3 B) with at least three independent random seeds, report means and standard deviations for all metrics, add error bars to the relevant figures, and append a detailed experimental protocol section containing full hyperparameter tables, baseline implementation notes, and training schedules. These additions will directly address reproducibility concerns. revision: yes
Referee: §3 (Method) and §4 (Analysis): the argument that spectrum improvement via second-moment preconditioning produces higher-quality updates rests on directional alignment as a proxy, yet no ablation isolates this mechanism from side-effects such as changes in effective step norms or interactions with the learning-rate schedule. The skeptic's concern is therefore material: the link between alignment gains and end-to-end loss reduction is asserted but not demonstrated to be causal.

Authors: This point is well taken. While the manuscript analytically shows spectrum improvement and empirically links it to higher directional alignment and better final loss, it does not contain a controlled ablation that holds effective step norm and learning-rate schedule fixed. In revision we will add a targeted ablation subsection that normalizes update norms across Muon and Muon² variants and sweeps learning-rate schedules independently, thereby isolating the contribution of the second-moment preconditioning to alignment quality and training dynamics. revision: yes

Circularity Check

0 steps flagged

No circularity; empirical claims rest on independent experiments

full rationale

The paper proposes Muon² as a practical extension of Muon that preconditions the momentum matrix with Adam-style second moments to improve spectral conditioning and accelerate Newton-Schulz iterations. The justification is an empirical observation about spectrum improvement and directional alignment, followed by direct pre-training comparisons on GPT and LLaMA models from 60M to 1.3B parameters showing consistent gains and 40% fewer NS steps. No derivation chain exists that reduces a claimed result to a fitted parameter, self-definition, or load-bearing self-citation; all performance assertions are externally verifiable via the reported training runs rather than tautological.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The work builds directly on the existing Muon orthogonalization procedure and Adam-style second-moment estimation without introducing new free parameters, axioms, or invented entities in the provided abstract. The central claim depends on the empirical observation that preconditioning improves conditioning, an assumption drawn from prior optimizer literature rather than new postulates.

pith-pipeline@v0.9.0 · 5531 in / 1223 out tokens · 68709 ms · 2026-05-10T16:56:11.164998+00:00 · methodology

discussion (0)

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

PolarAdamW: Disentangling Spectral Control and Schur Gauge-Equivariance in Matrix Optimisation
cs.LG 2026-05 unverdicted novelty 6.0

PolarAdamW disentangles spectral control from gauge-equivariance in matrix optimizers, with experiments demonstrating their distinct roles on standard versus symmetry-aware neural networks.

Reference graph

Works this paper leans on

2 extracted references · 2 canonical work pages · cited by 1 Pith paper · 1 internal anchor

[1]

Kimi-VL Technical Report

Kimi-vl technical report.arXiv preprint arXiv:2504.07491. Nikhil Vyas, Depen Morwani, Rosie Zhao, Mujin Kwun, Itai Shapira, David Brandfonbrener, Lucas Janson, and Sham Kakade. 2024. Soap: Improving and stabilizing shampoo using adam.arXiv preprint arXiv:2409.11321. Zhengyang Wang, Ziyue Liu, Ruijie Zhang, Avinash Maurya, Paul Hovland, Bogdan Nicolae, Fra...

work page internal anchor Pith review arXiv 2024
[2]

Zhao et al

Galore: Memory-efficient llm training by gradient low-rank projection.arXiv preprint arXiv:2403.03507. A Discussion on Cosine Similarity To elaborate, cosine similarity [Eq. (15)] is ro- bust against adversarial settings, such as in Eq. (8) where a global scaling exists. Furthermore, it deliv- ers an interpretable measurement that lies in [0,1] which refl...

work page arXiv 2024