pith. machine review for the scientific record. sign in

arxiv: 2602.21545 · v3 · submitted 2026-02-25 · 💻 cs.LG

Recognition: no theorem link

MUON+: Towards More Effective Muon via One Additional Normalization Step for LLM Pre-training

Authors on Pith no claims yet

Pith reviewed 2026-05-15 19:21 UTC · model grok-4.3

classification 💻 cs.LG
keywords Muon optimizerLLM pre-trainingpolar orthogonalizationnormalization stepblockwise descentnorm imbalancelarge language modelssecond-order optimization
0
0 comments X

The pith

Muon+ adds one normalization step after polar orthogonalization to fix norm imbalance and improve LLM pre-training over Muon.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that Muon's Newton-Schulz polar iterations, intended to orthogonalize momentum updates, actually amplify column- and row-wise norm imbalances in practice. This imbalance tightens the second-order term in a blockwise descent analysis and weakens Muon's per-step progress guarantee. Muon+ inserts a single normalization step right after the polar step, adding no optimizer state or memory cost. Experiments across GPT and LLaMA models from 60M to 7B parameters, under both compute-optimal and extended token budgets, show lower training and validation perplexity and faster overall pre-training.

Core claim

Muon suffers from a post-polar imbalanced update problem in which practical polar steps amplify norm imbalance, weakening the descent guarantee; Muon+ corrects this by adding one normalization step after orthogonalization, yielding consistent gains in perplexity and pre-training speed on models up to 7B parameters without extra state.

What carries the argument

The post-polar imbalanced update problem, where polar iterations fail to equalize column and row norms, corrected by inserting one normalization step after the Newton-Schulz orthogonalization.

Load-bearing premise

The assumption that post-polar norm imbalance is the dominant practical limitation of Muon and that the added normalization corrects it without offsetting drawbacks at other scales or regimes.

What would settle it

A head-to-head pre-training run on a 7B model where Muon+ shows no improvement in final validation perplexity or total tokens processed compared with Muon under matched hyperparameters would falsify the claim.

Figures

Figures reproduced from arXiv: 2602.21545 by Liyan Tan, Ruijie Zhang, Yequan Zhao, Yupeng Su, Zhengyang Wang, Zheng Zhang, Ziyue Liu.

Figure 1
Figure 1. Figure 1: Pre-training GPT and LLaMA models at scales ranging from 130M to 1B parameters [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Training loss curves under overtraining for GPT-Base and LLaMA-350M. [PITH_FULL_IMAGE:figures/full_fig_p006_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Validation perplexity sweep for LLaMA models under different settings. Here “none [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Validation perplexity sweep for GPT models under different settings. [PITH_FULL_IMAGE:figures/full_fig_p014_4.png] view at source ↗
read the original abstract

Muon has recently emerged as a strong optimizer for large language model pre-training, orthogonalizing the momentum matrix via Newton--Schulz polar iterations. A natural intuition is that polar iterations, by flattening the singular spectrum to all ones, should also eliminate column- and row-wise norm imbalance in the update. We show that this is not true in practice: practical polar steps can substantially amplify the imbalance. We term this the post-polar imbalanced update problem, and prove that such imbalance tightens the second-order term in a blockwise descent analysis, weakening Muon's per-step descent guarantee. Motivated by this analysis, we propose Muon+, a one-line fix that inserts a single normalization step after polar orthogonalization. Muon+ adds no optimizer state. Across pre-training experiments on GPT and LLaMA models from 60M to 7B parameters, spanning both compute-optimal budgets and extended token-to-parameter ratios up to approximately 200, Muon+ consistently outperforms Muon in terms of training and validation perplexity, leading to significant overall pre-training speedup.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper identifies a post-polar imbalanced update problem in the Muon optimizer, where practical Newton-Schulz polar iterations amplify row/column norm imbalance in the momentum matrix. It proves via blockwise descent analysis that this imbalance tightens the second-order term and weakens per-step descent guarantees. Muon+ is proposed as a one-line, state-free fix inserting a single normalization step after polar orthogonalization. Experiments on GPT and LLaMA models (60M–7B parameters) under compute-optimal and extended token budgets report consistent gains in training/validation perplexity and overall pre-training speedup.

Significance. If the central claim holds, Muon+ offers a lightweight, memory-neutral improvement to a strong recent optimizer for LLM pre-training, with potential for broad adoption across scales. The blockwise descent analysis supplies theoretical motivation distinguishing the fix from generic scaling, and the empirical scope (multiple architectures, sizes, and token-to-parameter ratios up to ~200) supports practical relevance. The work strengthens the case for orthogonalized momentum methods when the diagnosed mechanism is confirmed.

major comments (2)
  1. [§4] §4 (empirical results): the reported end-to-end perplexity gains do not isolate the post-polar imbalance mechanism. No direct measurements of row/column norm variance before/after the added step, no ablation replacing polar iterations with a pure scaling baseline, and no sensitivity checks on the normalization target (e.g., Frobenius vs. max-norm) are provided; thus the speedup could arise from any effective step-size change rather than the claimed tightening of the descent bound.
  2. [Blockwise descent analysis] Blockwise descent analysis (around Eq. (analysis of second-order term)): the proof that practical polar steps amplify imbalance and weaken the guarantee assumes specific conditions on the singular spectrum and iteration count; these must be verified against the exact Newton-Schulz implementation used in the code, including any early stopping or tolerance settings, to confirm the analysis is not circular with the proposed fix.
minor comments (2)
  1. [Algorithm 1] Algorithm 1 / pseudocode: explicitly state the normalization target (e.g., divide by Frobenius norm or max row norm) and confirm it adds no extra optimizer state or hyperparameters.
  2. [§4] Figure captions and §4 tables: report the number of independent runs and any statistical significance tests for the perplexity differences to support the claim of 'consistent' outperformance.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback and for recognizing the potential of Muon+ as a lightweight improvement to the Muon optimizer. We address each major comment below.

read point-by-point responses
  1. Referee: [§4] §4 (empirical results): the reported end-to-end perplexity gains do not isolate the post-polar imbalance mechanism. No direct measurements of row/column norm variance before/after the added step, no ablation replacing polar iterations with a pure scaling baseline, and no sensitivity checks on the normalization target (e.g., Frobenius vs. max-norm) are provided; thus the speedup could arise from any effective step-size change rather than the claimed tightening of the descent bound.

    Authors: We acknowledge that the current experiments demonstrate end-to-end performance gains without directly measuring the norm imbalance or providing the suggested ablations. The blockwise descent analysis in the paper provides the theoretical foundation linking the post-polar imbalance to weakened descent guarantees. To address this, in the revised version we will add direct measurements of row and column norm variances before and after the additional normalization step in representative layers. We will also include an ablation study comparing Muon+ to a baseline that applies scaling without polar orthogonalization to better isolate the effect. For the normalization target, we will add a brief discussion explaining the choice of Frobenius norm based on the analysis of the second-order term. revision: yes

  2. Referee: [Blockwise descent analysis] Blockwise descent analysis (around Eq. (analysis of second-order term)): the proof that practical polar steps amplify imbalance and weaken the guarantee assumes specific conditions on the singular spectrum and iteration count; these must be verified against the exact Newton-Schulz implementation used in the code, including any early stopping or tolerance settings, to confirm the analysis is not circular with the proposed fix.

    Authors: The analysis assumes that the Newton-Schulz iterations are run for a fixed number of steps sufficient to approximate the polar factor, without early stopping. In our implementation, we use a fixed 5 iterations as is standard in the Muon codebase, with no tolerance-based stopping. We have confirmed that under these settings, the singular values are driven close to 1, and the imbalance amplification occurs as described. We will include a verification note and perhaps a small plot of singular value convergence in the revised manuscript to make this explicit. revision: yes

Circularity Check

0 steps flagged

No significant circularity; derivation remains self-contained

full rationale

The paper's chain begins with an independent blockwise descent analysis that identifies post-polar norm imbalance, proves its tightening effect on the second-order term, and motivates the added normalization step as a direct response. No equation reduces the Muon+ fix to a fitted quantity defined by the same data, no prediction is statistically forced by construction, and no self-citation or ansatz is invoked to justify the core claim. The reported perplexity gains on GPT/LLaMA runs are presented as empirical outcomes, not as outputs that collapse back to the input analysis by definition.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the validity of the blockwise descent analysis that links post-polar imbalance to a weakened second-order term and on the empirical observation that the added normalization improves perplexity across the tested model sizes.

axioms (1)
  • domain assumption Blockwise descent analysis assumptions hold for the Muon momentum update
    Invoked to prove that imbalance tightens the second-order term and weakens the per-step descent guarantee.

pith-pipeline@v0.9.0 · 5505 in / 1325 out tokens · 78488 ms · 2026-05-15T19:21:44.536519+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. MuonEq: Balancing Before Orthogonalization with Lightweight Equilibration

    cs.LG 2026-03 unverdicted novelty 6.0

    MuonEq introduces pre-orthogonalization equilibration schemes that improve Muon optimizer performance during large language model pretraining.

Reference graph

Works this paper leans on

40 extracted references · 40 canonical work pages · cited by 1 Pith paper · 13 internal anchors

  1. [1]

    GPT-4 Technical Report

    J. Achiam, S. Adler, S. Agarwal, L. Ahmad, I. Akkaya, F. L. Aleman, D. Almeida, J. Altenschmidt, S. Altman, S. Anadkat, et al. Gpt-4 technical report.arXiv preprint arXiv:2303.08774, 2023

  2. [2]

    The Polar Express: Optimal Matrix Sign Methods and Their Application to the Muon Algorithm

    N. Amsel, D. Persson, C. Musco, and R. M. Gower. The polar express: Optimal matrix sign methods and their application to the muon algorithm.arXiv preprint arXiv:2505.16932, 2025

  3. [3]

    Bernstein

    J. Bernstein. Deriving muon, 2025

  4. [4]

    F. L. Cesista, Y . Jiacheng, and K. Jordan. Squeezing 1-2

  5. [5]

    D. Ding, Z. Ju, Y . Leng, S. Liu, T. Liu, Z. Shang, K. Shen, W. Song, X. Tan, H. Tang, et al. Kimi-audio technical report.arXiv preprint arXiv:2504.18425, 2025

  6. [6]

    The Llama 3 Herd of Models

    A. Grattafiori, A. Dubey, A. Jauhri, A. Pandey, A. Kadian, A. Al-Dahle, A. Letman, A. Mathur, A. Schelten, A. Vaughan, et al. The llama 3 herd of models.arXiv preprint arXiv:2407.21783, 2024

  7. [7]

    Gu and Z

    Y . Gu and Z. Xie. Mano: Restriking manifold optimization for llm training.arXiv preprint arXiv:2601.23000, 2026

  8. [8]

    A. Han, J. Li, W. Huang, M. Hong, A. Takeda, P. K. Jawanpuria, and B. Mishra. Sltrain: a sparse plus low rank approach for parameter and memory efficient pretraining.Advances in Neural Information Processing Systems, 37:118267–118295, 2024

  9. [9]

    N. J. Higham.Functions of matrices: theory and computation. SIAM, 2008

  10. [10]

    Hoffmann, S

    J. Hoffmann, S. Borgeaud, A. Mensch, E. Buchatskaya, T. Cai, E. Rutherford, D. de Las Casas, L. A. Hendricks, J. Welbl, A. Clark, et al. Training compute-optimal large language models. In Proceedings of the 36th International Conference on Neural Information Processing Systems, pages 30016–30030, 2022

  11. [11]

    Jordan, Y

    K. Jordan, Y . Jin, V . Boza, J. You, F. Cesista, L. Newhouse, and J. Bernstein. Muon: An optimizer for hidden layers in neural networks, 2024

  12. [12]

    Scaling Laws for Neural Language Models

    J. Kaplan, S. McCandlish, T. Henighan, T. B. Brown, B. Chess, R. Child, S. Gray, A. Rad- ford, J. Wu, and D. Amodei. Scaling laws for neural language models.arXiv preprint arXiv:2001.08361, 2020

  13. [13]

    Khaled, K

    A. Khaled, K. Ozkara, T. Yu, M. Hong, and Y . Park. Muonbp: Faster muon via block-periodic orthogonalization.arXiv preprint arXiv:2510.16981, 2025

  14. [14]

    D. P. Kingma. Adam: A method for stochastic optimization.arXiv preprint arXiv:1412.6980, 2014

  15. [15]

    D. Kovalev. Understanding gradient orthogonalization for deep learning via non-euclidean trust-region optimization.arXiv preprint arXiv:2503.12645, 2025

  16. [16]

    Kumar, Z

    T. Kumar, Z. Ankner, B. F. Spector, B. Bordelon, N. Muennighoff, M. Paul, C. Pehlevan, C. Re, and A. Raghunathan. Scaling laws for precision. InThe Thirteenth International Conference on Learning Representations, 2025

  17. [17]

    X. Li. Black box lie group preconditioners for sgd, 2022

  18. [18]

    X.-L. Li. Preconditioned stochastic gradient descent.IEEE Transactions on Neural Networks and Learning Systems, 29(5):1454–1466, May 2018

  19. [19]

    X.-L. Li. Preconditioner on matrix lie group for sgd, 2018

  20. [20]

    X.-L. Li. Stochastic hessian fittings with lie groups, 2024

  21. [21]

    Z. Li, L. Liu, C. Liang, W. Chen, and T. Zhao. Normuon: Making muon more efficient and scalable.arXiv preprint arXiv:2510.05491, 2025. 8

  22. [22]

    A. Liu, B. Feng, B. Xue, B. Wang, B. Wu, C. Lu, C. Zhao, C. Deng, C. Zhang, C. Ruan, et al. Deepseek-v3 technical report.arXiv preprint arXiv:2412.19437, 2024

  23. [23]

    H. Liu, Z. Li, D. L. W. Hall, P. Liang, and T. Ma. Sophia: A scalable stochastic second-order optimizer for language model pre-training. InThe Twelfth International Conference on Learning Representations, 2024

  24. [24]

    J. Liu, J. Su, X. Yao, Z. Jiang, G. Lai, Y . Du, Y . Qin, W. Xu, E. Lu, J. Yan, et al. Muon is scalable for llm training.arXiv preprint arXiv:2502.16982, 2025

  25. [25]

    Z. Liu, R. Zhang, Z. Wang, M. Yan, Z. Yang, P. D. Hovland, B. Nicolae, F. Cappello, S. Tang, and Z. Zhang. Cola: Compute-efficient pre-training of llms via low-rank activation. In Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 4627–4645, 2025

  26. [26]

    Decoupled Weight Decay Regularization

    I. Loshchilov and F. Hutter. Decoupled weight decay regularization.arXiv preprint arXiv:1711.05101, 2017

  27. [27]

    Mehmood, S

    F. Mehmood, S. Ahmad, and T. K. Whangbo. An efficient optimization technique for training deep neural networks.Mathematics, 11(6):1360, 2023

  28. [28]

    Penedo, H

    G. Penedo, H. Kydlíˇcek, A. Lozhkov, M. Mitchell, C. A. Raffel, L. V on Werra, T. Wolf, et al. The fineweb datasets: Decanting the web for the finest text data at scale.Advances in Neural Information Processing Systems, 37:30811–30849, 2024

  29. [29]

    Pethick, W

    T. Pethick, W. Xie, K. Antonakopoulos, Z. Zhu, A. Silveti-Falls, and V . Cevher. Training deep learning models with norm-constrained lmos, 2025

  30. [30]

    Pooladzandi and X.-L

    O. Pooladzandi and X.-L. Li. Curvature-informed sgd via general purpose lie-group precondi- tioners, 2024

  31. [31]

    G. Team, R. Anil, S. Borgeaud, J.-B. Alayrac, J. Yu, R. Soricut, J. Schalkwyk, A. M. Dai, A. Hauth, K. Millican, et al. Gemini: a family of highly capable multimodal models.arXiv preprint arXiv:2312.11805, 2023

  32. [32]

    K. Team, Y . Bai, Y . Bao, G. Chen, J. Chen, N. Chen, R. Chen, Y . Chen, Y . Chen, Y . Chen, et al. Kimi k2: Open agentic intelligence.arXiv preprint arXiv:2507.20534, 2025

  33. [33]

    K. Team, A. Du, B. Yin, B. Xing, B. Qu, B. Wang, C. Chen, C. Zhang, C. Du, C. Wei, et al. Kimi-vl technical report.arXiv preprint arXiv:2504.07491, 2025

  34. [34]

    N. Vyas, D. Morwani, R. Zhao, I. Shapira, D. Brandfonbrener, L. Janson, and S. M. Kakade. SOAP: Improving and stabilizing shampoo using adam. InThe Thirteenth International Conference on Learning Representations, 2025

  35. [35]

    Y . You, J. Li, S. Reddi, J. Hseu, S. Kumar, S. Bhojanapalli, X. Song, J. Demmel, K. Keutzer, and C.-J. Hsieh. Large batch optimization for deep learning: Training bert in 76 minutes.arXiv preprint arXiv:1904.00962, 2019

  36. [36]

    H. Yuan, Y . Liu, S. Wu, X. Zhou, and Q. Gu. Mars: Unleashing the power of variance reduction for training large models, 2024

  37. [37]

    A. Zeng, X. Lv, Q. Zheng, Z. Hou, B. Chen, C. Xie, C. Wang, D. Yin, H. Zeng, J. Zhang, et al. Glm-4.5: Agentic, reasoning, and coding (arc) foundation models.arXiv preprint arXiv:2508.06471, 2025

  38. [38]

    Zhang, Z

    R. Zhang, Z. Liu, Z. Wang, and Z. Zhang. Lax: Boosting low-rank training of foundation models via latent crossing.arXiv preprint arXiv:2505.21732, 2025

  39. [39]

    Zhang, Y

    R. Zhang, Y . Zhao, Z. Liu, Z. Wang, D. Li, Y . Su, S. Liu, and Z. Zhang. Teon: Tensorized orthonormalization beyond layer-wise muon for large language model pre-training.arXiv preprint arXiv:2601.23261, 2026

  40. [40]

    J. Zhao, Z. Zhang, B. Chen, Z. Wang, A. Anandkumar, and Y . Tian. Galore: Memory-efficient llm training by gradient low-rank projection.arXiv preprint arXiv:2403.03507, 2024. 9 A Hyperparameter A.1 Model Configurations Modeln embd nlayer nhead Param(M) GPT-Small 768 12 12 124 GPT-Base 1024 24 16 362 GPT-Large 1280 36 20 774 Table 7: Architecture configura...