arxiv: 2602.21545 · v3 · submitted 2026-02-25 · 💻 cs.LG

Recognition: no theorem link

MUON+: Towards More Effective Muon via One Additional Normalization Step for LLM Pre-training

Ruijie Zhang , Yequan Zhao , Ziyue Liu , Zhengyang Wang , Yupeng Su , Liyan Tan , Zheng Zhang

Authors on Pith no claims yet

Pith reviewed 2026-05-15 19:21 UTC · model grok-4.3

classification 💻 cs.LG

keywords Muon optimizerLLM pre-trainingpolar orthogonalizationnormalization stepblockwise descentnorm imbalancelarge language modelssecond-order optimization

0 comments

The pith

Muon+ adds one normalization step after polar orthogonalization to fix norm imbalance and improve LLM pre-training over Muon.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that Muon's Newton-Schulz polar iterations, intended to orthogonalize momentum updates, actually amplify column- and row-wise norm imbalances in practice. This imbalance tightens the second-order term in a blockwise descent analysis and weakens Muon's per-step progress guarantee. Muon+ inserts a single normalization step right after the polar step, adding no optimizer state or memory cost. Experiments across GPT and LLaMA models from 60M to 7B parameters, under both compute-optimal and extended token budgets, show lower training and validation perplexity and faster overall pre-training.

Core claim

Muon suffers from a post-polar imbalanced update problem in which practical polar steps amplify norm imbalance, weakening the descent guarantee; Muon+ corrects this by adding one normalization step after orthogonalization, yielding consistent gains in perplexity and pre-training speed on models up to 7B parameters without extra state.

What carries the argument

The post-polar imbalanced update problem, where polar iterations fail to equalize column and row norms, corrected by inserting one normalization step after the Newton-Schulz orthogonalization.

Load-bearing premise

The assumption that post-polar norm imbalance is the dominant practical limitation of Muon and that the added normalization corrects it without offsetting drawbacks at other scales or regimes.

What would settle it

A head-to-head pre-training run on a 7B model where Muon+ shows no improvement in final validation perplexity or total tokens processed compared with Muon under matched hyperparameters would falsify the claim.

Figures

Figures reproduced from arXiv: 2602.21545 by Liyan Tan, Ruijie Zhang, Yequan Zhao, Yupeng Su, Zhengyang Wang, Zheng Zhang, Ziyue Liu.

**Figure 2.** Figure 2: Training loss curves under overtraining for GPT-Base and LLaMA-350M. [PITH_FULL_IMAGE:figures/full_fig_p006_2.png] view at source ↗

**Figure 3.** Figure 3: Validation perplexity sweep for LLaMA models under different settings. Here “none [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗

**Figure 4.** Figure 4: Validation perplexity sweep for GPT models under different settings. [PITH_FULL_IMAGE:figures/full_fig_p014_4.png] view at source ↗

read the original abstract

Muon has recently emerged as a strong optimizer for large language model pre-training, orthogonalizing the momentum matrix via Newton--Schulz polar iterations. A natural intuition is that polar iterations, by flattening the singular spectrum to all ones, should also eliminate column- and row-wise norm imbalance in the update. We show that this is not true in practice: practical polar steps can substantially amplify the imbalance. We term this the post-polar imbalanced update problem, and prove that such imbalance tightens the second-order term in a blockwise descent analysis, weakening Muon's per-step descent guarantee. Motivated by this analysis, we propose Muon+, a one-line fix that inserts a single normalization step after polar orthogonalization. Muon+ adds no optimizer state. Across pre-training experiments on GPT and LLaMA models from 60M to 7B parameters, spanning both compute-optimal budgets and extended token-to-parameter ratios up to approximately 200, Muon+ consistently outperforms Muon in terms of training and validation perplexity, leading to significant overall pre-training speedup.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Muon+ adds one post-polar normalization to the existing Muon optimizer and reports steady perplexity gains on GPT and LLaMA pre-trains up to 7B, but the experiments do not isolate whether the fix works through the claimed descent mechanism or through incidental rescaling.

read the letter

The main takeaway is that adding one normalization step after polar orthogonalization in Muon gives consistent perplexity improvements on pre-training of models up to 7B without extra memory. The paper does a good job laying out why practical polar iterations can amplify row-column norm imbalance, using a blockwise descent analysis to show how that weakens the per-step guarantee. Muon+ is the one-line patch, and the experiments on GPT and LLaMA across scales and token budgets show it outperforming the baseline Muon. What stands out is the simplicity and the fact that the gains hold in both compute-optimal and extended training regimes. The analysis provides a clear motivation rather than just an empirical hack. The weaker part is that the experiments stay at the level of end-to-end perplexity. There is no direct tracking of the norm imbalance metric they diagnose, no ablation that tests a generic scaling instead of their specific normalization, and no variation of the polar iteration count to see if the problem scales as predicted. This leaves some room for the gains to come from an unintended change in update magnitude rather than the claimed fix to the descent bound. Readers working on optimizer design for large models will get the most out of this. It is worth sending to peer review because the change is cheap to implement, the results are positive at relevant scales, and the analysis is a step toward understanding why Muon behaves the way it does.

Referee Report

2 major / 2 minor

Summary. The paper identifies a post-polar imbalanced update problem in the Muon optimizer, where practical Newton-Schulz polar iterations amplify row/column norm imbalance in the momentum matrix. It proves via blockwise descent analysis that this imbalance tightens the second-order term and weakens per-step descent guarantees. Muon+ is proposed as a one-line, state-free fix inserting a single normalization step after polar orthogonalization. Experiments on GPT and LLaMA models (60M–7B parameters) under compute-optimal and extended token budgets report consistent gains in training/validation perplexity and overall pre-training speedup.

Significance. If the central claim holds, Muon+ offers a lightweight, memory-neutral improvement to a strong recent optimizer for LLM pre-training, with potential for broad adoption across scales. The blockwise descent analysis supplies theoretical motivation distinguishing the fix from generic scaling, and the empirical scope (multiple architectures, sizes, and token-to-parameter ratios up to ~200) supports practical relevance. The work strengthens the case for orthogonalized momentum methods when the diagnosed mechanism is confirmed.

major comments (2)

[§4] §4 (empirical results): the reported end-to-end perplexity gains do not isolate the post-polar imbalance mechanism. No direct measurements of row/column norm variance before/after the added step, no ablation replacing polar iterations with a pure scaling baseline, and no sensitivity checks on the normalization target (e.g., Frobenius vs. max-norm) are provided; thus the speedup could arise from any effective step-size change rather than the claimed tightening of the descent bound.
[Blockwise descent analysis] Blockwise descent analysis (around Eq. (analysis of second-order term)): the proof that practical polar steps amplify imbalance and weaken the guarantee assumes specific conditions on the singular spectrum and iteration count; these must be verified against the exact Newton-Schulz implementation used in the code, including any early stopping or tolerance settings, to confirm the analysis is not circular with the proposed fix.

minor comments (2)

[Algorithm 1] Algorithm 1 / pseudocode: explicitly state the normalization target (e.g., divide by Frobenius norm or max row norm) and confirm it adds no extra optimizer state or hyperparameters.
[§4] Figure captions and §4 tables: report the number of independent runs and any statistical significance tests for the perplexity differences to support the claim of 'consistent' outperformance.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback and for recognizing the potential of Muon+ as a lightweight improvement to the Muon optimizer. We address each major comment below.

read point-by-point responses

Referee: [§4] §4 (empirical results): the reported end-to-end perplexity gains do not isolate the post-polar imbalance mechanism. No direct measurements of row/column norm variance before/after the added step, no ablation replacing polar iterations with a pure scaling baseline, and no sensitivity checks on the normalization target (e.g., Frobenius vs. max-norm) are provided; thus the speedup could arise from any effective step-size change rather than the claimed tightening of the descent bound.

Authors: We acknowledge that the current experiments demonstrate end-to-end performance gains without directly measuring the norm imbalance or providing the suggested ablations. The blockwise descent analysis in the paper provides the theoretical foundation linking the post-polar imbalance to weakened descent guarantees. To address this, in the revised version we will add direct measurements of row and column norm variances before and after the additional normalization step in representative layers. We will also include an ablation study comparing Muon+ to a baseline that applies scaling without polar orthogonalization to better isolate the effect. For the normalization target, we will add a brief discussion explaining the choice of Frobenius norm based on the analysis of the second-order term. revision: yes
Referee: [Blockwise descent analysis] Blockwise descent analysis (around Eq. (analysis of second-order term)): the proof that practical polar steps amplify imbalance and weaken the guarantee assumes specific conditions on the singular spectrum and iteration count; these must be verified against the exact Newton-Schulz implementation used in the code, including any early stopping or tolerance settings, to confirm the analysis is not circular with the proposed fix.

Authors: The analysis assumes that the Newton-Schulz iterations are run for a fixed number of steps sufficient to approximate the polar factor, without early stopping. In our implementation, we use a fixed 5 iterations as is standard in the Muon codebase, with no tolerance-based stopping. We have confirmed that under these settings, the singular values are driven close to 1, and the imbalance amplification occurs as described. We will include a verification note and perhaps a small plot of singular value convergence in the revised manuscript to make this explicit. revision: yes

Circularity Check

0 steps flagged

No significant circularity; derivation remains self-contained

full rationale

The paper's chain begins with an independent blockwise descent analysis that identifies post-polar norm imbalance, proves its tightening effect on the second-order term, and motivates the added normalization step as a direct response. No equation reduces the Muon+ fix to a fitted quantity defined by the same data, no prediction is statistically forced by construction, and no self-citation or ansatz is invoked to justify the core claim. The reported perplexity gains on GPT/LLaMA runs are presented as empirical outcomes, not as outputs that collapse back to the input analysis by definition.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the validity of the blockwise descent analysis that links post-polar imbalance to a weakened second-order term and on the empirical observation that the added normalization improves perplexity across the tested model sizes.

axioms (1)

domain assumption Blockwise descent analysis assumptions hold for the Muon momentum update
Invoked to prove that imbalance tightens the second-order term and weakens the per-step descent guarantee.

pith-pipeline@v0.9.0 · 5505 in / 1325 out tokens · 78488 ms · 2026-05-15T19:21:44.536519+00:00 · methodology

discussion (0)

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

MuonEq: Balancing Before Orthogonalization with Lightweight Equilibration
cs.LG 2026-03 unverdicted novelty 6.0

MuonEq introduces pre-orthogonalization equilibration schemes that improve Muon optimizer performance during large language model pretraining.

Reference graph

Works this paper leans on

40 extracted references · 40 canonical work pages · cited by 1 Pith paper · 13 internal anchors

[1]

GPT-4 Technical Report

J. Achiam, S. Adler, S. Agarwal, L. Ahmad, I. Akkaya, F. L. Aleman, D. Almeida, J. Altenschmidt, S. Altman, S. Anadkat, et al. Gpt-4 technical report.arXiv preprint arXiv:2303.08774, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[2]

The Polar Express: Optimal Matrix Sign Methods and Their Application to the Muon Algorithm

N. Amsel, D. Persson, C. Musco, and R. M. Gower. The polar express: Optimal matrix sign methods and their application to the muon algorithm.arXiv preprint arXiv:2505.16932, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[3]

Bernstein

J. Bernstein. Deriving muon, 2025

work page 2025
[4]

F. L. Cesista, Y . Jiacheng, and K. Jordan. Squeezing 1-2

work page
[5]

D. Ding, Z. Ju, Y . Leng, S. Liu, T. Liu, Z. Shang, K. Shen, W. Song, X. Tan, H. Tang, et al. Kimi-audio technical report.arXiv preprint arXiv:2504.18425, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[6]

The Llama 3 Herd of Models

A. Grattafiori, A. Dubey, A. Jauhri, A. Pandey, A. Kadian, A. Al-Dahle, A. Letman, A. Mathur, A. Schelten, A. Vaughan, et al. The llama 3 herd of models.arXiv preprint arXiv:2407.21783, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[7]

Gu and Z

Y . Gu and Z. Xie. Mano: Restriking manifold optimization for llm training.arXiv preprint arXiv:2601.23000, 2026

work page arXiv 2026
[8]

A. Han, J. Li, W. Huang, M. Hong, A. Takeda, P. K. Jawanpuria, and B. Mishra. Sltrain: a sparse plus low rank approach for parameter and memory efficient pretraining.Advances in Neural Information Processing Systems, 37:118267–118295, 2024

work page 2024
[9]

N. J. Higham.Functions of matrices: theory and computation. SIAM, 2008

work page 2008
[10]

Hoffmann, S

J. Hoffmann, S. Borgeaud, A. Mensch, E. Buchatskaya, T. Cai, E. Rutherford, D. de Las Casas, L. A. Hendricks, J. Welbl, A. Clark, et al. Training compute-optimal large language models. In Proceedings of the 36th International Conference on Neural Information Processing Systems, pages 30016–30030, 2022

work page 2022
[11]

Jordan, Y

K. Jordan, Y . Jin, V . Boza, J. You, F. Cesista, L. Newhouse, and J. Bernstein. Muon: An optimizer for hidden layers in neural networks, 2024

work page 2024
[12]

Scaling Laws for Neural Language Models

J. Kaplan, S. McCandlish, T. Henighan, T. B. Brown, B. Chess, R. Child, S. Gray, A. Rad- ford, J. Wu, and D. Amodei. Scaling laws for neural language models.arXiv preprint arXiv:2001.08361, 2020

work page internal anchor Pith review Pith/arXiv arXiv 2001
[13]

Khaled, K

A. Khaled, K. Ozkara, T. Yu, M. Hong, and Y . Park. Muonbp: Faster muon via block-periodic orthogonalization.arXiv preprint arXiv:2510.16981, 2025

work page arXiv 2025
[14]

D. P. Kingma. Adam: A method for stochastic optimization.arXiv preprint arXiv:1412.6980, 2014

work page internal anchor Pith review Pith/arXiv arXiv 2014
[15]

D. Kovalev. Understanding gradient orthogonalization for deep learning via non-euclidean trust-region optimization.arXiv preprint arXiv:2503.12645, 2025

work page arXiv 2025
[16]

Kumar, Z

T. Kumar, Z. Ankner, B. F. Spector, B. Bordelon, N. Muennighoff, M. Paul, C. Pehlevan, C. Re, and A. Raghunathan. Scaling laws for precision. InThe Thirteenth International Conference on Learning Representations, 2025

work page 2025
[17]

X. Li. Black box lie group preconditioners for sgd, 2022

work page 2022
[18]

X.-L. Li. Preconditioned stochastic gradient descent.IEEE Transactions on Neural Networks and Learning Systems, 29(5):1454–1466, May 2018

work page 2018
[19]

X.-L. Li. Preconditioner on matrix lie group for sgd, 2018

work page 2018
[20]

X.-L. Li. Stochastic hessian fittings with lie groups, 2024

work page 2024
[21]

Z. Li, L. Liu, C. Liang, W. Chen, and T. Zhao. Normuon: Making muon more efficient and scalable.arXiv preprint arXiv:2510.05491, 2025. 8

work page arXiv 2025
[22]

A. Liu, B. Feng, B. Xue, B. Wang, B. Wu, C. Lu, C. Zhao, C. Deng, C. Zhang, C. Ruan, et al. Deepseek-v3 technical report.arXiv preprint arXiv:2412.19437, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[23]

H. Liu, Z. Li, D. L. W. Hall, P. Liang, and T. Ma. Sophia: A scalable stochastic second-order optimizer for language model pre-training. InThe Twelfth International Conference on Learning Representations, 2024

work page 2024
[24]

J. Liu, J. Su, X. Yao, Z. Jiang, G. Lai, Y . Du, Y . Qin, W. Xu, E. Lu, J. Yan, et al. Muon is scalable for llm training.arXiv preprint arXiv:2502.16982, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[25]

Z. Liu, R. Zhang, Z. Wang, M. Yan, Z. Yang, P. D. Hovland, B. Nicolae, F. Cappello, S. Tang, and Z. Zhang. Cola: Compute-efficient pre-training of llms via low-rank activation. In Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 4627–4645, 2025

work page 2025
[26]

Decoupled Weight Decay Regularization

I. Loshchilov and F. Hutter. Decoupled weight decay regularization.arXiv preprint arXiv:1711.05101, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017
[27]

Mehmood, S

F. Mehmood, S. Ahmad, and T. K. Whangbo. An efficient optimization technique for training deep neural networks.Mathematics, 11(6):1360, 2023

work page 2023
[28]

Penedo, H

G. Penedo, H. Kydlíˇcek, A. Lozhkov, M. Mitchell, C. A. Raffel, L. V on Werra, T. Wolf, et al. The fineweb datasets: Decanting the web for the finest text data at scale.Advances in Neural Information Processing Systems, 37:30811–30849, 2024

work page 2024
[29]

Pethick, W

T. Pethick, W. Xie, K. Antonakopoulos, Z. Zhu, A. Silveti-Falls, and V . Cevher. Training deep learning models with norm-constrained lmos, 2025

work page 2025
[30]

Pooladzandi and X.-L

O. Pooladzandi and X.-L. Li. Curvature-informed sgd via general purpose lie-group precondi- tioners, 2024

work page 2024
[31]

G. Team, R. Anil, S. Borgeaud, J.-B. Alayrac, J. Yu, R. Soricut, J. Schalkwyk, A. M. Dai, A. Hauth, K. Millican, et al. Gemini: a family of highly capable multimodal models.arXiv preprint arXiv:2312.11805, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[32]

K. Team, Y . Bai, Y . Bao, G. Chen, J. Chen, N. Chen, R. Chen, Y . Chen, Y . Chen, Y . Chen, et al. Kimi k2: Open agentic intelligence.arXiv preprint arXiv:2507.20534, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[33]

K. Team, A. Du, B. Yin, B. Xing, B. Qu, B. Wang, C. Chen, C. Zhang, C. Du, C. Wei, et al. Kimi-vl technical report.arXiv preprint arXiv:2504.07491, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[34]

N. Vyas, D. Morwani, R. Zhao, I. Shapira, D. Brandfonbrener, L. Janson, and S. M. Kakade. SOAP: Improving and stabilizing shampoo using adam. InThe Thirteenth International Conference on Learning Representations, 2025

work page 2025
[35]

Y . You, J. Li, S. Reddi, J. Hseu, S. Kumar, S. Bhojanapalli, X. Song, J. Demmel, K. Keutzer, and C.-J. Hsieh. Large batch optimization for deep learning: Training bert in 76 minutes.arXiv preprint arXiv:1904.00962, 2019

work page arXiv 1904
[36]

H. Yuan, Y . Liu, S. Wu, X. Zhou, and Q. Gu. Mars: Unleashing the power of variance reduction for training large models, 2024

work page 2024
[37]

A. Zeng, X. Lv, Q. Zheng, Z. Hou, B. Chen, C. Xie, C. Wang, D. Yin, H. Zeng, J. Zhang, et al. Glm-4.5: Agentic, reasoning, and coding (arc) foundation models.arXiv preprint arXiv:2508.06471, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[38]

Zhang, Z

R. Zhang, Z. Liu, Z. Wang, and Z. Zhang. Lax: Boosting low-rank training of foundation models via latent crossing.arXiv preprint arXiv:2505.21732, 2025

work page arXiv 2025
[39]

Zhang, Y

R. Zhang, Y . Zhao, Z. Liu, Z. Wang, D. Li, Y . Su, S. Liu, and Z. Zhang. Teon: Tensorized orthonormalization beyond layer-wise muon for large language model pre-training.arXiv preprint arXiv:2601.23261, 2026

work page arXiv 2026
[40]

J. Zhao, Z. Zhang, B. Chen, Z. Wang, A. Anandkumar, and Y . Tian. Galore: Memory-efficient llm training by gradient low-rank projection.arXiv preprint arXiv:2403.03507, 2024. 9 A Hyperparameter A.1 Model Configurations Modeln embd nlayer nhead Param(M) GPT-Small 768 12 12 124 GPT-Base 1024 24 16 362 GPT-Large 1280 36 20 774 Table 7: Architecture configura...

work page arXiv 2024