pith. sign in

arxiv: 2606.04058 · v2 · pith:UQMKRGHKnew · submitted 2026-06-02 · 💻 cs.LG · cs.AI

Spectral Scaling Laws of Muon

Pith reviewed 2026-06-28 11:16 UTC · model grok-4.3

classification 💻 cs.LG cs.AI
keywords Muon optimizerNewton-Schulz iterationsingular value spectrumscaling lawsmomentum bufferorthonormalizationlarge language modelsoptimizer stability
0
0 comments X

The pith

Muon momentum matrices stabilize to layer-dependent power laws in model size after brief burn-in.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tracks the singular-value quantiles of Muon momentum buffers across layers in models from 77M to 2.8B parameters. After an initial period the quantiles settle to stable values that obey clean power-law scaling with model size, where the exponents vary strongly by layer depth. Early and mid-late layers scale mildly, around M to the minus 0.25, so the usual five-step Newton-Schulz iteration keeps them orthonormalized at far larger scales. A subset of late layers scale as steeply as M to the minus 0.96 and will enter the regime where Newton-Schulz fails unless more iterations or adjusted coefficients are supplied. The measured exponents therefore supply a practical, layer-aware rule for choosing the smallest Newton-Schulz configuration that still orthonormalizes the directions that matter.

Core claim

After a short burn-in, the quantiles of the singular value spectrum of the momentum buffer stabilize at values determined by layer type and model size; these stabilization values follow power laws in model size M with layer-dependent exponents, mild scaling around M^{-0.25} for layers up to mid-late depth and aggressive scaling up to M^{-0.96} for some late layers.

What carries the argument

Stabilization of singular-value quantiles of the momentum buffer to layer-dependent power laws in model size.

If this is right

  • The standard five-step Newton-Schulz iteration will continue to orthonormalize layers up to mid-late depth at frontier model sizes.
  • Late layers will require more Newton-Schulz iterations or retuned coefficients at large scales to avoid falling into the failure regime.
  • The measured exponents allow a layer-aware choice of the minimal Newton-Schulz configuration that still orthonormalizes important directions without extra computation.
  • The stabilization is a consistent property of the training dynamics across the tested range of model sizes.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • If the exponents persist, training runs at 100B+ parameters will need depth-dependent Newton-Schulz iteration counts that increase toward the output layers.
  • The clean power-law behavior suggests the training dynamics settle into a scale-invariant spectral regime after the burn-in phase.
  • The same measurement protocol could be applied to other orthonormalized optimizers to test whether comparable layer-dependent scaling appears.

Load-bearing premise

The power-law exponents measured on models up to 2.8B parameters will continue to describe the singular-value stabilization behavior at frontier scales.

What would settle it

Train a model larger than 10B parameters, extract the singular-value quantiles of the momentum buffers in late layers, and check whether they lie on the extrapolated M^{-0.96} line from the 77M-2.8B data.

Figures

Figures reproduced from arXiv: 2606.04058 by Asuman Ozdaglar, Gagik Magakyan, Pablo Parrilo.

Figure 1
Figure 1. Figure 1: Scaling laws for the stabilization values of the singular value quantiles of normalized [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: The NS map f(σ) = p ◦5 (σ) for the canonical polynomial p(x) = 2x − 1.5x 3 + 0.5x 5 . The left plot shows the full range σ ∈ [0, 1]. The right plot shows a zoom-in view of the region σ ∈ [0, 0.05]. In practice, different implementations use different NS configurations. The NanoGPT speedrun [Jor￾dan et al., 2024a] uses optimized 5-step polynomials (see [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Pre-training models of sizes 77M, 160M, and 354M parameters with rank- [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: We compare training with low-rank r = 0.1 updates from scratch against first running full Muon for 125 (or 250) steps and then switching to low-rank updates. In all cases the gap relative to full Muon remains large, indicating that the performance difference is not simply due to Muon’s faster convergence at the start of training. 3 Spectral Dynamics of the Momentum Buffer In this section we track the quant… view at source ↗
Figure 5
Figure 5. Figure 5: Quantile evolution for the 50% quantile of the normalized singular values for 3 fixed layer [PITH_FULL_IMAGE:figures/full_fig_p007_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: We plot the distribution of normalized singular values for the selected momentum matrices [PITH_FULL_IMAGE:figures/full_fig_p008_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Quantile dynamics for the 50% quantile of the normalized momentum matrices for the [PITH_FULL_IMAGE:figures/full_fig_p008_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: The NS map f(σ) = b5 ◦ b4 ◦ b3 ◦ b2 ◦ b1(σ) for bi in Equation 2. A.2.2 DeepSeek-V4 NS Coefficients [DeepSeek-AI, 2026] uses the following NS coefficients: a(x) = 2x − 1.5x 3 + 0.5x 5 c(x) = 3.4445x − 4.7750x 3 + 2.0315x 5 (3) The full NS map is then f = a ◦2 ◦ c ◦8 . While this approximation is very good (see [PITH_FULL_IMAGE:figures/full_fig_p012_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: The NS map f(σ) = a ◦2 ◦ c ◦8 (σ) for a and c in Equation 3. B Appendix B 77M 160M 354M Model size 0.0 0.2 0.4 0.6 0.8 1.0 t / T Fraction of Muon training to match low rank updates p=0.25 [PITH_FULL_IMAGE:figures/full_fig_p013_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: We observe that Muon needs around (80 − 90)% of the iterations to match the final loss of the rank p = 0.25 run. 77M 160M 354M 600M Model size 0.0 0.2 0.4 0.6 0.8 1.0 t / T Fraction of Muon training to match low rank updates p=0.1 [PITH_FULL_IMAGE:figures/full_fig_p013_10.png] view at source ↗
Figure 12
Figure 12. Figure 12: Quantile evolution for the 25% quantile for all layer types and model sizes. [PITH_FULL_IMAGE:figures/full_fig_p014_12.png] view at source ↗
Figure 13
Figure 13. Figure 13: Quantile evolution for the 50% quantile for all layer types and model sizes. [PITH_FULL_IMAGE:figures/full_fig_p015_13.png] view at source ↗
Figure 14
Figure 14. Figure 14: Normalized singular value spectra of the 2.8B model at step 1450, with the dominant [PITH_FULL_IMAGE:figures/full_fig_p016_14.png] view at source ↗
Figure 15
Figure 15. Figure 15: Quantile dynamics for the 50% quantile of the normalized momentum matrices for the [PITH_FULL_IMAGE:figures/full_fig_p017_15.png] view at source ↗
Figure 16
Figure 16. Figure 16: Scaling laws for all tracked quantiles and layer types across model sizes. [PITH_FULL_IMAGE:figures/full_fig_p018_16.png] view at source ↗
read the original abstract

Orthonormalized update rules have rapidly become a leading choice of optimizer for training large language models, with recent open-source state-of-the-art models adopting Muon. To keep these updates tractable, Muon performs the orthonormalization with the Newton--Schulz (NS) iteration. Since NS is only approximate, directions with small singular values fail to be orthonormalized. In Muon, NS is applied to the momentum matrix at every step, yet little is known about how the singular value spectrum of these momentum matrices behaves during training, or how that behavior changes with model size. We present the first systematic study of this question. Tracking singular value quantiles of the momentum buffer across layers in models ranging from 77M to 2.8B parameters, we observe a consistent picture: after a short burn-in, the quantiles stabilize at a value determined by the layer type and model size. These stabilization values follow remarkably clean power laws in model size, with layer-dependent exponents. Layers up to mid-late depth scale very mildly with model size $M$ (around $M^{-0.25}$), so the standard 5-step NS configuration used at academic scale will continue to orthonormalize them at much larger scales. Some of the late layers, however, scale much more aggressively (up to $M^{-0.96}$) and will fall into the NS failure regime at frontier scale unless one uses more NS iterations or better-tuned coefficients. NS iterations are computationally expensive at scale; our laws give practitioners a principled, layer-aware recipe for choosing the minimum NS configuration that still orthonormalizes the directions that matter -- avoiding unnecessary computation without sacrificing update quality.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper conducts the first systematic empirical study of singular-value quantiles in the momentum buffers of Muon-trained transformers. Tracking models from 77M to 2.8B parameters, it reports that after a short burn-in the quantiles stabilize to layer-dependent values that obey clean power-law scaling in model size M, with exponents ranging from approximately M^{-0.25} (early/mid layers) to M^{-0.96} (late layers). The authors conclude that standard 5-step Newton-Schulz (NS) will remain sufficient for most layers at frontier scale but that late layers will require additional iterations or retuned coefficients.

Significance. If the reported layer-dependent power laws continue to hold, the work supplies a practical, data-driven recipe for choosing minimal NS iteration counts per layer, which could reduce unnecessary orthonormalization cost at scale while preserving update quality. The systematic measurement across depth and size, together with the remarkably clean observed fits, constitutes a useful empirical contribution to the study of orthonormalized optimizers. The study remains purely observational; no theoretical derivation or closed-form prediction is offered.

major comments (2)
  1. [§4 and Figure 4] §4 (Scaling Laws) and Figure 4: The power-law exponents are obtained by fitting quantiles measured on models spanning only a factor of ~36 in size (77M–2.8B). The central claim that late layers will enter the NS failure regime at frontier scales rests on these exponents remaining invariant beyond the observed range; no runs at intermediate or larger scales, no ablation on data mix, learning-rate schedule, or NS coefficients, and no demonstration that the post-burn-in plateau is insensitive to initialization are provided.
  2. [§3.2] §3.2 (Quantile measurement protocol): The definition of the “stabilized” quantile (post-burn-in average) and the precise exclusion rules for early training steps are not stated with sufficient precision to allow independent reproduction or assessment of statistical significance of the reported exponents.
minor comments (2)
  1. [Table 1] Table 1: layer-depth binning boundaries are not explicitly listed; readers cannot map the reported exponents back to concrete layer indices without additional assumptions.
  2. [Figure 2] Figure 2 caption: the y-axis label “quantile” should specify whether it is the 0.01, 0.05, or median singular value to avoid ambiguity when comparing panels.

Simulated Author's Rebuttal

2 responses · 1 unresolved

We thank the referee for the careful and constructive review. We respond point-by-point to the major comments below.

read point-by-point responses
  1. Referee: [§4 and Figure 4] §4 (Scaling Laws) and Figure 4: The power-law exponents are obtained by fitting quantiles measured on models spanning only a factor of ~36 in size (77M–2.8B). The central claim that late layers will enter the NS failure regime at frontier scales rests on these exponents remaining invariant beyond the observed range; no runs at intermediate or larger scales, no ablation on data mix, learning-rate schedule, or NS coefficients, and no demonstration that the post-burn-in plateau is insensitive to initialization are provided.

    Authors: We agree that the model-size range is limited and that the extrapolation to frontier scales assumes continued invariance of the observed exponents. The power-law fits are remarkably clean and consistent across layers within the studied range, which forms the core empirical contribution. We will add a limitations subsection explicitly discussing the restricted scale range, the assumptions required for extrapolation, and the desirability of future validation at larger scales. We cannot supply additional runs, ablations, or initialization-sensitivity tests at this time. revision: partial

  2. Referee: [§3.2] §3.2 (Quantile measurement protocol): The definition of the “stabilized” quantile (post-burn-in average) and the precise exclusion rules for early training steps are not stated with sufficient precision to allow independent reproduction or assessment of statistical significance of the reported exponents.

    Authors: We accept this criticism. The revised manuscript will state the protocol with full precision: the exact burn-in step threshold, the number of subsequent steps over which the quantile is averaged, the precise exclusion rule for early steps, and any associated statistical measures used to assess stability. revision: yes

standing simulated objections not resolved
  • Additional experiments at larger model scales, ablations on data mix, learning-rate schedule, NS coefficients, and sensitivity of the post-burn-in plateau to initialization.

Circularity Check

0 steps flagged

No circularity; purely observational power-law fits on measured quantiles

full rationale

The paper measures singular-value quantiles of momentum buffers in models from 77M to 2.8B parameters, observes post-burn-in stabilization, and fits power laws in model size M with layer-dependent exponents. These are direct empirical fits to data; the reported stabilization values and exponents are not defined in terms of each other by any equation in the paper, nor do any 'predictions' for frontier scales reduce by construction to the fitted inputs. No self-citations, uniqueness theorems, or ansatzes are invoked as load-bearing steps. The derivation chain is self-contained observational analysis.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The central claim rests on the empirical observation that singular-value quantiles stabilize after a short burn-in period and that the stabilized values obey power laws whose exponents are layer-dependent. No free parameters are introduced beyond the fitted exponents themselves; no new entities are postulated.

free parameters (1)
  • layer-dependent power-law exponents
    Exponents such as -0.25 and -0.96 are obtained by fitting the observed stabilization values to model size M.
axioms (1)
  • domain assumption Singular-value quantiles of the momentum buffer reach a stable value after a short burn-in phase that is independent of further training steps.
    The power-law analysis is performed on these stabilized values; the abstract states this stabilization occurs consistently across the studied models.

pith-pipeline@v0.9.1-grok · 5833 in / 1478 out tokens · 35744 ms · 2026-06-28T11:16:54.860341+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

16 extracted references · 14 canonical work pages · 8 internal anchors

  1. [1]

    Disentangling adaptive gradient methods from learning rates.arXiv preprint arXiv:2002.11803,

    Naman Agarwal, Rohan Anil, Elad Hazan, Tomer Koren, and Cyril Zhang. Disentangling adaptive gradient methods from learning rates.arXiv preprint arXiv:2002.11803,

  2. [2]

    Dion2: A simple method to shrink matrix in muon

    Kwangjun Ahn, Noah Amsel, and John Langford. Dion2: A simple method to shrink matrix in muon. arXiv preprint arXiv:2512.16928, 2025a. Kwangjun Ahn, Byron Xu, Natalie Abreu, Ying Fan, Gagik Magakyan, Pratyusha Sharma, Zheng Zhan, and John Langford. Dion: Distributed orthonormalized updates.arXiv preprint arXiv:2504.05295, 2025b. Rohan Anil, Vineet Gupta, T...

  3. [3]

    DeepSeek-V3 Technical Report

    URL https://leloykun.github. io/ponder/muon-opt-coeffs/. DeepSeek-AI. Deepseek-v3 technical report.arXiv preprint arXiv:2412.19437,

  4. [4]

    GLM-5: from Vibe Coding to Agentic Engineering

    URL https://www. essential.ai/research/infra. GLM-5 Team. Glm-5: from vibe coding to agentic engineering.arXiv preprint arXiv:2602.15763,

  5. [5]

    Training Compute-Optimal Large Language Models

    Jordan Hoffmann, Sebastian Borgeaud, Arthur Mensch, Elena Buchatskaya, Trevor Cai, Eliza Rutherford, Diego de Las Casas, Lisa Anne Hendricks, Johannes Welbl, Aidan Clark, et al. Training compute-optimal large language models.arXiv preprint arXiv:2203.15556,

  6. [6]

    Scaling Laws for Neural Language Models

    Keller Jordan, Jeremy Bernstein, Brendan Rappazzo, @fernbear.bsky.social, Boza Vlado, You Jiacheng, Franz Cesista, Braden Koszarsky, and @Grad62304977. modded-nanogpt: Speedrunning the nanogpt baseline, 2024a. URLhttps://github.com/KellerJordan/modded-nanogpt. Keller Jordan, Yuchen Jin, Vlado Boza, Jiacheng You, Franz Cesista, Laker Newhouse, and Jeremy B...

  7. [7]

    10 Kimi Team, Tongtong Bai, Yifan Bai, Yiping Bao, S. H. Cai, Yuan Cao, et al. Kimi k2.5: Visual agentic intelligence.arXiv preprint arXiv:2602.02276,

  8. [8]

    NorMuon: Making muon more efficient and scalable.arXiv preprint arXiv:2510.05491,

    Zichong Li, Liming Liu, Chen Liang, Weizhu Chen, and Tuo Zhao. NorMuon: Making muon more efficient and scalable.arXiv preprint arXiv:2510.05491,

  9. [9]

    Muon is Scalable for LLM Training

    Jingyuan Liu, Jianlin Su, Xingcheng Yao, Zhejun Jiang, Guokun Lai, Yulun Du, Yidao Qin, et al. Muon is scalable for llm training.arXiv preprint arXiv:2502.16982,

  10. [10]

    The Llama 3 Herd of Models

    Llama Team. The llama 3 herd of models.arXiv preprint arXiv:2407.21783,

  11. [11]

    A distributed data-parallel PyTorch implementation of the distributed shampoo optimizer for training neural networks at-scale.arXiv preprint arXiv:2309.06497,

    Hao-Jun Michael Shi, Tsung-Hsien Lee, Shintaro Iwasaki, Jose Gallego-Posada, Zhijing Li, Kaushik Rangadurai, Dheevatsa Mudigere, and Michael Rabbat. A distributed data-parallel PyTorch implementation of the distributed shampoo optimizer for training neural networks at-scale.arXiv preprint arXiv:2309.06497,

  12. [12]

    2 OLMo 2 Furious

    Team OLMo, Pete Walsh, Luca Soldaini, Dirk Groeneveld, Kyle Lo, et al. 2 olmo 2 furious.arXiv preprint arXiv:2501.00656,

  13. [13]

    Fantastic pretraining optimizers and where to find them.arXiv preprint arXiv:2509.02046,

    Kaiyue Wen, David Hall, Tengyu Ma, and Percy Liang. Fantastic pretraining optimizers and where to find them.arXiv preprint arXiv:2509.02046,

  14. [14]

    Tensor programs v: Tuning large neural networks via zero-shot hyperparameter transfer.arXiv preprint arXiv:2203.03466,

    Greg Yang, Edward J. Hu, Igor Babuschkin, Szymon Sidor, Xiaodong Liu, David Farhi, Nick Ryder, Jakub Pachocki, Weizhu Chen, and Jian Gao. Tensor programs V: Tuning large neural networks via zero-shot hyperparameter transfer.arXiv preprint arXiv:2203.03466,

  15. [15]

    A Appendix A.1 Details on pre-training We used the modded-nanogpt codebase [Jordan et al., 2024a] for all experiments. All matrix-valued parameters are trained with Muon, while non-matrix parameters (embeddings, LM head, and biases) are trained with AdamW with (β1, β2) = (0.9,0.95) and learning rate 0.002. For both optimizers we use a weight decay of 0.01...

  16. [16]

    While this approximation is very good (see Figure 9), is uses 10 steps and hence is computationally more expensive

    A.2.2 DeepSeek-V4 NS Coefficients [DeepSeek-AI, 2026] uses the following NS coefficients: a(x) = 2x−1.5x 3 + 0.5x5 c(x) = 3.4445x−4.7750x 3 + 2.0315x5 (3) The full NS map is then f=a ◦2 ◦c ◦8. While this approximation is very good (see Figure 9), is uses 10 steps and hence is computationally more expensive. 12 0.0 0.2 0.4 0.6 0.8 1.0 Singular value 0.0 0....