Spectral Scaling Laws of Muon

Asuman Ozdaglar; Gagik Magakyan; Pablo Parrilo

arxiv: 2606.04058 · v2 · pith:UQMKRGHKnew · submitted 2026-06-02 · 💻 cs.LG · cs.AI

Spectral Scaling Laws of Muon

Gagik Magakyan , Pablo Parrilo , Asuman Ozdaglar This is my paper

Pith reviewed 2026-06-28 11:16 UTC · model grok-4.3

classification 💻 cs.LG cs.AI

keywords Muon optimizerNewton-Schulz iterationsingular value spectrumscaling lawsmomentum bufferorthonormalizationlarge language modelsoptimizer stability

0 comments

The pith

Muon momentum matrices stabilize to layer-dependent power laws in model size after brief burn-in.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tracks the singular-value quantiles of Muon momentum buffers across layers in models from 77M to 2.8B parameters. After an initial period the quantiles settle to stable values that obey clean power-law scaling with model size, where the exponents vary strongly by layer depth. Early and mid-late layers scale mildly, around M to the minus 0.25, so the usual five-step Newton-Schulz iteration keeps them orthonormalized at far larger scales. A subset of late layers scale as steeply as M to the minus 0.96 and will enter the regime where Newton-Schulz fails unless more iterations or adjusted coefficients are supplied. The measured exponents therefore supply a practical, layer-aware rule for choosing the smallest Newton-Schulz configuration that still orthonormalizes the directions that matter.

Core claim

After a short burn-in, the quantiles of the singular value spectrum of the momentum buffer stabilize at values determined by layer type and model size; these stabilization values follow power laws in model size M with layer-dependent exponents, mild scaling around M^{-0.25} for layers up to mid-late depth and aggressive scaling up to M^{-0.96} for some late layers.

What carries the argument

Stabilization of singular-value quantiles of the momentum buffer to layer-dependent power laws in model size.

If this is right

The standard five-step Newton-Schulz iteration will continue to orthonormalize layers up to mid-late depth at frontier model sizes.
Late layers will require more Newton-Schulz iterations or retuned coefficients at large scales to avoid falling into the failure regime.
The measured exponents allow a layer-aware choice of the minimal Newton-Schulz configuration that still orthonormalizes important directions without extra computation.
The stabilization is a consistent property of the training dynamics across the tested range of model sizes.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

If the exponents persist, training runs at 100B+ parameters will need depth-dependent Newton-Schulz iteration counts that increase toward the output layers.
The clean power-law behavior suggests the training dynamics settle into a scale-invariant spectral regime after the burn-in phase.
The same measurement protocol could be applied to other orthonormalized optimizers to test whether comparable layer-dependent scaling appears.

Load-bearing premise

The power-law exponents measured on models up to 2.8B parameters will continue to describe the singular-value stabilization behavior at frontier scales.

What would settle it

Train a model larger than 10B parameters, extract the singular-value quantiles of the momentum buffers in late layers, and check whether they lie on the extrapolated M^{-0.96} line from the 77M-2.8B data.

Figures

Figures reproduced from arXiv: 2606.04058 by Asuman Ozdaglar, Gagik Magakyan, Pablo Parrilo.

**Figure 2.** Figure 2: The NS map f(σ) = p ◦5 (σ) for the canonical polynomial p(x) = 2x − 1.5x 3 + 0.5x 5 . The left plot shows the full range σ ∈ [0, 1]. The right plot shows a zoom-in view of the region σ ∈ [0, 0.05]. In practice, different implementations use different NS configurations. The NanoGPT speedrun [Jordan et al., 2024a] uses optimized 5-step polynomials (see [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗

**Figure 3.** Figure 3: Pre-training models of sizes 77M, 160M, and 354M parameters with rank- [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗

**Figure 4.** Figure 4: We compare training with low-rank r = 0.1 updates from scratch against first running full Muon for 125 (or 250) steps and then switching to low-rank updates. In all cases the gap relative to full Muon remains large, indicating that the performance difference is not simply due to Muon’s faster convergence at the start of training. 3 Spectral Dynamics of the Momentum Buffer In this section we track the quant… view at source ↗

**Figure 5.** Figure 5: Quantile evolution for the 50% quantile of the normalized singular values for 3 fixed layer [PITH_FULL_IMAGE:figures/full_fig_p007_5.png] view at source ↗

**Figure 6.** Figure 6: We plot the distribution of normalized singular values for the selected momentum matrices [PITH_FULL_IMAGE:figures/full_fig_p008_6.png] view at source ↗

**Figure 7.** Figure 7: Quantile dynamics for the 50% quantile of the normalized momentum matrices for the [PITH_FULL_IMAGE:figures/full_fig_p008_7.png] view at source ↗

**Figure 8.** Figure 8: The NS map f(σ) = b5 ◦ b4 ◦ b3 ◦ b2 ◦ b1(σ) for bi in Equation 2. A.2.2 DeepSeek-V4 NS Coefficients [DeepSeek-AI, 2026] uses the following NS coefficients: a(x) = 2x − 1.5x 3 + 0.5x 5 c(x) = 3.4445x − 4.7750x 3 + 2.0315x 5 (3) The full NS map is then f = a ◦2 ◦ c ◦8 . While this approximation is very good (see [PITH_FULL_IMAGE:figures/full_fig_p012_8.png] view at source ↗

**Figure 9.** Figure 9: The NS map f(σ) = a ◦2 ◦ c ◦8 (σ) for a and c in Equation 3. B Appendix B 77M 160M 354M Model size 0.0 0.2 0.4 0.6 0.8 1.0 t / T Fraction of Muon training to match low rank updates p=0.25 [PITH_FULL_IMAGE:figures/full_fig_p013_9.png] view at source ↗

**Figure 10.** Figure 10: We observe that Muon needs around (80 − 90)% of the iterations to match the final loss of the rank p = 0.25 run. 77M 160M 354M 600M Model size 0.0 0.2 0.4 0.6 0.8 1.0 t / T Fraction of Muon training to match low rank updates p=0.1 [PITH_FULL_IMAGE:figures/full_fig_p013_10.png] view at source ↗

**Figure 12.** Figure 12: Quantile evolution for the 25% quantile for all layer types and model sizes. [PITH_FULL_IMAGE:figures/full_fig_p014_12.png] view at source ↗

**Figure 13.** Figure 13: Quantile evolution for the 50% quantile for all layer types and model sizes. [PITH_FULL_IMAGE:figures/full_fig_p015_13.png] view at source ↗

**Figure 14.** Figure 14: Normalized singular value spectra of the 2.8B model at step 1450, with the dominant [PITH_FULL_IMAGE:figures/full_fig_p016_14.png] view at source ↗

**Figure 15.** Figure 15: Quantile dynamics for the 50% quantile of the normalized momentum matrices for the [PITH_FULL_IMAGE:figures/full_fig_p017_15.png] view at source ↗

**Figure 16.** Figure 16: Scaling laws for all tracked quantiles and layer types across model sizes. [PITH_FULL_IMAGE:figures/full_fig_p018_16.png] view at source ↗

read the original abstract

Orthonormalized update rules have rapidly become a leading choice of optimizer for training large language models, with recent open-source state-of-the-art models adopting Muon. To keep these updates tractable, Muon performs the orthonormalization with the Newton--Schulz (NS) iteration. Since NS is only approximate, directions with small singular values fail to be orthonormalized. In Muon, NS is applied to the momentum matrix at every step, yet little is known about how the singular value spectrum of these momentum matrices behaves during training, or how that behavior changes with model size. We present the first systematic study of this question. Tracking singular value quantiles of the momentum buffer across layers in models ranging from 77M to 2.8B parameters, we observe a consistent picture: after a short burn-in, the quantiles stabilize at a value determined by the layer type and model size. These stabilization values follow remarkably clean power laws in model size, with layer-dependent exponents. Layers up to mid-late depth scale very mildly with model size $M$ (around $M^{-0.25}$), so the standard 5-step NS configuration used at academic scale will continue to orthonormalize them at much larger scales. Some of the late layers, however, scale much more aggressively (up to $M^{-0.96}$) and will fall into the NS failure regime at frontier scale unless one uses more NS iterations or better-tuned coefficients. NS iterations are computationally expensive at scale; our laws give practitioners a principled, layer-aware recipe for choosing the minimum NS configuration that still orthonormalizes the directions that matter -- avoiding unnecessary computation without sacrificing update quality.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper measures layer-dependent power laws in Muon momentum singular values up to 2.8B but the extrapolation to frontier scales rests on untested assumptions.

read the letter

The paper tracks singular-value quantiles in Muon momentum buffers across models from 77M to 2.8B parameters. After burn-in these quantiles stabilize, and the stable values follow power laws in model size with exponents that vary by layer depth. Most layers scale mildly, around M to the -0.25, while some late layers reach nearly M to the -1.

They performed the first systematic collection of these spectra at multiple scales and depths. The clean power-law fits and the resulting practical rule for setting Newton-Schulz steps are the concrete advance. The measurements come from actual training runs, which gives the observations a direct empirical basis.

The range stops at 2.8B. The power laws are fitted inside that window, and the claim that late layers will enter the NS failure regime at much larger scales assumes the same exponents continue to hold. No runs at intermediate or bigger sizes, no sweeps over data mix or optimizer coefficients, and no test of whether the post-burn-in plateau is invariant to those choices. That is the main soft spot.

The work is aimed at researchers tuning Muon or similar orthonormalized optimizers on large models. Anyone who needs to balance NS cost against update quality will find usable numbers here. It deserves a serious referee because the measurements are new and the application is immediate, even though reviewers will press on the scale limitation.

Referee Report

2 major / 2 minor

Summary. The paper conducts the first systematic empirical study of singular-value quantiles in the momentum buffers of Muon-trained transformers. Tracking models from 77M to 2.8B parameters, it reports that after a short burn-in the quantiles stabilize to layer-dependent values that obey clean power-law scaling in model size M, with exponents ranging from approximately M^{-0.25} (early/mid layers) to M^{-0.96} (late layers). The authors conclude that standard 5-step Newton-Schulz (NS) will remain sufficient for most layers at frontier scale but that late layers will require additional iterations or retuned coefficients.

Significance. If the reported layer-dependent power laws continue to hold, the work supplies a practical, data-driven recipe for choosing minimal NS iteration counts per layer, which could reduce unnecessary orthonormalization cost at scale while preserving update quality. The systematic measurement across depth and size, together with the remarkably clean observed fits, constitutes a useful empirical contribution to the study of orthonormalized optimizers. The study remains purely observational; no theoretical derivation or closed-form prediction is offered.

major comments (2)

[§4 and Figure 4] §4 (Scaling Laws) and Figure 4: The power-law exponents are obtained by fitting quantiles measured on models spanning only a factor of ~36 in size (77M–2.8B). The central claim that late layers will enter the NS failure regime at frontier scales rests on these exponents remaining invariant beyond the observed range; no runs at intermediate or larger scales, no ablation on data mix, learning-rate schedule, or NS coefficients, and no demonstration that the post-burn-in plateau is insensitive to initialization are provided.
[§3.2] §3.2 (Quantile measurement protocol): The definition of the “stabilized” quantile (post-burn-in average) and the precise exclusion rules for early training steps are not stated with sufficient precision to allow independent reproduction or assessment of statistical significance of the reported exponents.

minor comments (2)

[Table 1] Table 1: layer-depth binning boundaries are not explicitly listed; readers cannot map the reported exponents back to concrete layer indices without additional assumptions.
[Figure 2] Figure 2 caption: the y-axis label “quantile” should specify whether it is the 0.01, 0.05, or median singular value to avoid ambiguity when comparing panels.

Simulated Author's Rebuttal

2 responses · 1 unresolved

We thank the referee for the careful and constructive review. We respond point-by-point to the major comments below.

read point-by-point responses

Referee: [§4 and Figure 4] §4 (Scaling Laws) and Figure 4: The power-law exponents are obtained by fitting quantiles measured on models spanning only a factor of ~36 in size (77M–2.8B). The central claim that late layers will enter the NS failure regime at frontier scales rests on these exponents remaining invariant beyond the observed range; no runs at intermediate or larger scales, no ablation on data mix, learning-rate schedule, or NS coefficients, and no demonstration that the post-burn-in plateau is insensitive to initialization are provided.

Authors: We agree that the model-size range is limited and that the extrapolation to frontier scales assumes continued invariance of the observed exponents. The power-law fits are remarkably clean and consistent across layers within the studied range, which forms the core empirical contribution. We will add a limitations subsection explicitly discussing the restricted scale range, the assumptions required for extrapolation, and the desirability of future validation at larger scales. We cannot supply additional runs, ablations, or initialization-sensitivity tests at this time. revision: partial
Referee: [§3.2] §3.2 (Quantile measurement protocol): The definition of the “stabilized” quantile (post-burn-in average) and the precise exclusion rules for early training steps are not stated with sufficient precision to allow independent reproduction or assessment of statistical significance of the reported exponents.

Authors: We accept this criticism. The revised manuscript will state the protocol with full precision: the exact burn-in step threshold, the number of subsequent steps over which the quantile is averaged, the precise exclusion rule for early steps, and any associated statistical measures used to assess stability. revision: yes

standing simulated objections not resolved

Additional experiments at larger model scales, ablations on data mix, learning-rate schedule, NS coefficients, and sensitivity of the post-burn-in plateau to initialization.

Circularity Check

0 steps flagged

No circularity; purely observational power-law fits on measured quantiles

full rationale

The paper measures singular-value quantiles of momentum buffers in models from 77M to 2.8B parameters, observes post-burn-in stabilization, and fits power laws in model size M with layer-dependent exponents. These are direct empirical fits to data; the reported stabilization values and exponents are not defined in terms of each other by any equation in the paper, nor do any 'predictions' for frontier scales reduce by construction to the fitted inputs. No self-citations, uniqueness theorems, or ansatzes are invoked as load-bearing steps. The derivation chain is self-contained observational analysis.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The central claim rests on the empirical observation that singular-value quantiles stabilize after a short burn-in period and that the stabilized values obey power laws whose exponents are layer-dependent. No free parameters are introduced beyond the fitted exponents themselves; no new entities are postulated.

free parameters (1)

layer-dependent power-law exponents
Exponents such as -0.25 and -0.96 are obtained by fitting the observed stabilization values to model size M.

axioms (1)

domain assumption Singular-value quantiles of the momentum buffer reach a stable value after a short burn-in phase that is independent of further training steps.
The power-law analysis is performed on these stabilized values; the abstract states this stabilization occurs consistently across the studied models.

pith-pipeline@v0.9.1-grok · 5833 in / 1478 out tokens · 35744 ms · 2026-06-28T11:16:54.860341+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

16 extracted references · 14 canonical work pages · 8 internal anchors

[1]

Disentangling adaptive gradient methods from learning rates.arXiv preprint arXiv:2002.11803,

Naman Agarwal, Rohan Anil, Elad Hazan, Tomer Koren, and Cyril Zhang. Disentangling adaptive gradient methods from learning rates.arXiv preprint arXiv:2002.11803,

work page arXiv 2002
[2]

Dion2: A simple method to shrink matrix in muon

Kwangjun Ahn, Noah Amsel, and John Langford. Dion2: A simple method to shrink matrix in muon. arXiv preprint arXiv:2512.16928, 2025a. Kwangjun Ahn, Byron Xu, Natalie Abreu, Ying Fan, Gagik Magakyan, Pratyusha Sharma, Zheng Zhan, and John Langford. Dion: Distributed orthonormalized updates.arXiv preprint arXiv:2504.05295, 2025b. Rohan Anil, Vineet Gupta, T...

work page arXiv 2002
[3]

DeepSeek-V3 Technical Report

URL https://leloykun.github. io/ponder/muon-opt-coeffs/. DeepSeek-AI. Deepseek-v3 technical report.arXiv preprint arXiv:2412.19437,

work page internal anchor Pith review Pith/arXiv arXiv
[4]

GLM-5: from Vibe Coding to Agentic Engineering

URL https://www. essential.ai/research/infra. GLM-5 Team. Glm-5: from vibe coding to agentic engineering.arXiv preprint arXiv:2602.15763,

work page internal anchor Pith review Pith/arXiv arXiv
[5]

Training Compute-Optimal Large Language Models

Jordan Hoffmann, Sebastian Borgeaud, Arthur Mensch, Elena Buchatskaya, Trevor Cai, Eliza Rutherford, Diego de Las Casas, Lisa Anne Hendricks, Johannes Welbl, Aidan Clark, et al. Training compute-optimal large language models.arXiv preprint arXiv:2203.15556,

work page internal anchor Pith review Pith/arXiv arXiv
[6]

Scaling Laws for Neural Language Models

Keller Jordan, Jeremy Bernstein, Brendan Rappazzo, @fernbear.bsky.social, Boza Vlado, You Jiacheng, Franz Cesista, Braden Koszarsky, and @Grad62304977. modded-nanogpt: Speedrunning the nanogpt baseline, 2024a. URLhttps://github.com/KellerJordan/modded-nanogpt. Keller Jordan, Yuchen Jin, Vlado Boza, Jiacheng You, Franz Cesista, Laker Newhouse, and Jeremy B...

work page internal anchor Pith review Pith/arXiv arXiv 2001
[7]

10 Kimi Team, Tongtong Bai, Yifan Bai, Yiping Bao, S. H. Cai, Yuan Cao, et al. Kimi k2.5: Visual agentic intelligence.arXiv preprint arXiv:2602.02276,

work page internal anchor Pith review Pith/arXiv arXiv
[8]

NorMuon: Making muon more efficient and scalable.arXiv preprint arXiv:2510.05491,

Zichong Li, Liming Liu, Chen Liang, Weizhu Chen, and Tuo Zhao. NorMuon: Making muon more efficient and scalable.arXiv preprint arXiv:2510.05491,

work page arXiv
[9]

Muon is Scalable for LLM Training

Jingyuan Liu, Jianlin Su, Xingcheng Yao, Zhejun Jiang, Guokun Lai, Yulun Du, Yidao Qin, et al. Muon is scalable for llm training.arXiv preprint arXiv:2502.16982,

work page internal anchor Pith review Pith/arXiv arXiv
[10]

The Llama 3 Herd of Models

Llama Team. The llama 3 herd of models.arXiv preprint arXiv:2407.21783,

work page internal anchor Pith review Pith/arXiv arXiv
[11]

A distributed data-parallel PyTorch implementation of the distributed shampoo optimizer for training neural networks at-scale.arXiv preprint arXiv:2309.06497,

Hao-Jun Michael Shi, Tsung-Hsien Lee, Shintaro Iwasaki, Jose Gallego-Posada, Zhijing Li, Kaushik Rangadurai, Dheevatsa Mudigere, and Michael Rabbat. A distributed data-parallel PyTorch implementation of the distributed shampoo optimizer for training neural networks at-scale.arXiv preprint arXiv:2309.06497,

work page arXiv
[12]

2 OLMo 2 Furious

Team OLMo, Pete Walsh, Luca Soldaini, Dirk Groeneveld, Kyle Lo, et al. 2 olmo 2 furious.arXiv preprint arXiv:2501.00656,

work page internal anchor Pith review Pith/arXiv arXiv
[13]

Fantastic pretraining optimizers and where to find them.arXiv preprint arXiv:2509.02046,

Kaiyue Wen, David Hall, Tengyu Ma, and Percy Liang. Fantastic pretraining optimizers and where to find them.arXiv preprint arXiv:2509.02046,

work page arXiv
[14]

Tensor programs v: Tuning large neural networks via zero-shot hyperparameter transfer.arXiv preprint arXiv:2203.03466,

Greg Yang, Edward J. Hu, Igor Babuschkin, Szymon Sidor, Xiaodong Liu, David Farhi, Nick Ryder, Jakub Pachocki, Weizhu Chen, and Jian Gao. Tensor programs V: Tuning large neural networks via zero-shot hyperparameter transfer.arXiv preprint arXiv:2203.03466,

work page arXiv
[15]

A Appendix A.1 Details on pre-training We used the modded-nanogpt codebase [Jordan et al., 2024a] for all experiments. All matrix-valued parameters are trained with Muon, while non-matrix parameters (embeddings, LM head, and biases) are trained with AdamW with (β1, β2) = (0.9,0.95) and learning rate 0.002. For both optimizers we use a weight decay of 0.01...

2025
[16]

While this approximation is very good (see Figure 9), is uses 10 steps and hence is computationally more expensive

A.2.2 DeepSeek-V4 NS Coefficients [DeepSeek-AI, 2026] uses the following NS coefficients: a(x) = 2x−1.5x 3 + 0.5x5 c(x) = 3.4445x−4.7750x 3 + 2.0315x5 (3) The full NS map is then f=a ◦2 ◦c ◦8. While this approximation is very good (see Figure 9), is uses 10 steps and hence is computationally more expensive. 12 0.0 0.2 0.4 0.6 0.8 1.0 Singular value 0.0 0....

2026

[1] [1]

Disentangling adaptive gradient methods from learning rates.arXiv preprint arXiv:2002.11803,

Naman Agarwal, Rohan Anil, Elad Hazan, Tomer Koren, and Cyril Zhang. Disentangling adaptive gradient methods from learning rates.arXiv preprint arXiv:2002.11803,

work page arXiv 2002

[2] [2]

Dion2: A simple method to shrink matrix in muon

Kwangjun Ahn, Noah Amsel, and John Langford. Dion2: A simple method to shrink matrix in muon. arXiv preprint arXiv:2512.16928, 2025a. Kwangjun Ahn, Byron Xu, Natalie Abreu, Ying Fan, Gagik Magakyan, Pratyusha Sharma, Zheng Zhan, and John Langford. Dion: Distributed orthonormalized updates.arXiv preprint arXiv:2504.05295, 2025b. Rohan Anil, Vineet Gupta, T...

work page arXiv 2002

[3] [3]

DeepSeek-V3 Technical Report

URL https://leloykun.github. io/ponder/muon-opt-coeffs/. DeepSeek-AI. Deepseek-v3 technical report.arXiv preprint arXiv:2412.19437,

work page internal anchor Pith review Pith/arXiv arXiv

[4] [4]

GLM-5: from Vibe Coding to Agentic Engineering

URL https://www. essential.ai/research/infra. GLM-5 Team. Glm-5: from vibe coding to agentic engineering.arXiv preprint arXiv:2602.15763,

work page internal anchor Pith review Pith/arXiv arXiv

[5] [5]

Training Compute-Optimal Large Language Models

Jordan Hoffmann, Sebastian Borgeaud, Arthur Mensch, Elena Buchatskaya, Trevor Cai, Eliza Rutherford, Diego de Las Casas, Lisa Anne Hendricks, Johannes Welbl, Aidan Clark, et al. Training compute-optimal large language models.arXiv preprint arXiv:2203.15556,

work page internal anchor Pith review Pith/arXiv arXiv

[6] [6]

Scaling Laws for Neural Language Models

Keller Jordan, Jeremy Bernstein, Brendan Rappazzo, @fernbear.bsky.social, Boza Vlado, You Jiacheng, Franz Cesista, Braden Koszarsky, and @Grad62304977. modded-nanogpt: Speedrunning the nanogpt baseline, 2024a. URLhttps://github.com/KellerJordan/modded-nanogpt. Keller Jordan, Yuchen Jin, Vlado Boza, Jiacheng You, Franz Cesista, Laker Newhouse, and Jeremy B...

work page internal anchor Pith review Pith/arXiv arXiv 2001

[7] [7]

10 Kimi Team, Tongtong Bai, Yifan Bai, Yiping Bao, S. H. Cai, Yuan Cao, et al. Kimi k2.5: Visual agentic intelligence.arXiv preprint arXiv:2602.02276,

work page internal anchor Pith review Pith/arXiv arXiv

[8] [8]

NorMuon: Making muon more efficient and scalable.arXiv preprint arXiv:2510.05491,

Zichong Li, Liming Liu, Chen Liang, Weizhu Chen, and Tuo Zhao. NorMuon: Making muon more efficient and scalable.arXiv preprint arXiv:2510.05491,

work page arXiv

[9] [9]

Muon is Scalable for LLM Training

Jingyuan Liu, Jianlin Su, Xingcheng Yao, Zhejun Jiang, Guokun Lai, Yulun Du, Yidao Qin, et al. Muon is scalable for llm training.arXiv preprint arXiv:2502.16982,

work page internal anchor Pith review Pith/arXiv arXiv

[10] [10]

The Llama 3 Herd of Models

Llama Team. The llama 3 herd of models.arXiv preprint arXiv:2407.21783,

work page internal anchor Pith review Pith/arXiv arXiv

[11] [11]

A distributed data-parallel PyTorch implementation of the distributed shampoo optimizer for training neural networks at-scale.arXiv preprint arXiv:2309.06497,

Hao-Jun Michael Shi, Tsung-Hsien Lee, Shintaro Iwasaki, Jose Gallego-Posada, Zhijing Li, Kaushik Rangadurai, Dheevatsa Mudigere, and Michael Rabbat. A distributed data-parallel PyTorch implementation of the distributed shampoo optimizer for training neural networks at-scale.arXiv preprint arXiv:2309.06497,

work page arXiv

[12] [12]

2 OLMo 2 Furious

Team OLMo, Pete Walsh, Luca Soldaini, Dirk Groeneveld, Kyle Lo, et al. 2 olmo 2 furious.arXiv preprint arXiv:2501.00656,

work page internal anchor Pith review Pith/arXiv arXiv

[13] [13]

Fantastic pretraining optimizers and where to find them.arXiv preprint arXiv:2509.02046,

Kaiyue Wen, David Hall, Tengyu Ma, and Percy Liang. Fantastic pretraining optimizers and where to find them.arXiv preprint arXiv:2509.02046,

work page arXiv

[14] [14]

Tensor programs v: Tuning large neural networks via zero-shot hyperparameter transfer.arXiv preprint arXiv:2203.03466,

Greg Yang, Edward J. Hu, Igor Babuschkin, Szymon Sidor, Xiaodong Liu, David Farhi, Nick Ryder, Jakub Pachocki, Weizhu Chen, and Jian Gao. Tensor programs V: Tuning large neural networks via zero-shot hyperparameter transfer.arXiv preprint arXiv:2203.03466,

work page arXiv

[15] [15]

A Appendix A.1 Details on pre-training We used the modded-nanogpt codebase [Jordan et al., 2024a] for all experiments. All matrix-valued parameters are trained with Muon, while non-matrix parameters (embeddings, LM head, and biases) are trained with AdamW with (β1, β2) = (0.9,0.95) and learning rate 0.002. For both optimizers we use a weight decay of 0.01...

2025

[16] [16]

While this approximation is very good (see Figure 9), is uses 10 steps and hence is computationally more expensive

A.2.2 DeepSeek-V4 NS Coefficients [DeepSeek-AI, 2026] uses the following NS coefficients: a(x) = 2x−1.5x 3 + 0.5x5 c(x) = 3.4445x−4.7750x 3 + 2.0315x5 (3) The full NS map is then f=a ◦2 ◦c ◦8. While this approximation is very good (see Figure 9), is uses 10 steps and hence is computationally more expensive. 12 0.0 0.2 0.4 0.6 0.8 1.0 Singular value 0.0 0....

2026