arxiv: 2605.11396 · v1 · submitted 2026-05-12 · 💻 cs.LG

Recognition: 1 theorem link

· Lean Theorem

MuonQ: Enhancing Low-Bit Muon Quantization via Directional Fidelity Optimization

Ruijie Zhang, Yequan Zhao, Yupeng Su, Zheng Zhang, Ziyue Liu

Authors on Pith no claims yet

Pith reviewed 2026-05-13 02:07 UTC · model grok-4.3

classification 💻 cs.LG

keywords Muon optimizerlow-bit quantizationdirectional fidelityLLM pre-trainingoptimizer memory reduction4-bit traininggradient orthogonalization

0 comments

The pith

MuonQ enables stable 4-bit quantization of the Muon optimizer by optimizing directional fidelity to match full-precision performance.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper seeks to make the Muon optimizer practical for low-bit training by solving its particular sensitivity to quantization errors in singular vector directions. It does so through directional fidelity optimization that combines pre-quantization normalization to equalize error magnitudes, power iteration decomposition to isolate dominant singular components, and μ-law companding to improve resolution where momentum values cluster. Experiments on GPT-style and LLaMA-style models show that the 4-bit version produces training loss curves and downstream accuracies nearly identical to full-precision Muon. The approach cuts optimizer state memory by up to 7.3 times, which directly addresses memory bottlenecks in large language model pre-training.

Core claim

By applying directional fidelity optimization consisting of pre-quantization normalization, power iteration decomposition, and μ-law companding quantization, Muon optimizer states can be quantized to 4 bits while preserving the directional information required for effective updates. Pre-training experiments on GPT-style and LLaMA-style models confirm that this yields training loss and downstream task accuracy comparable to full-precision Muon, with optimizer state memory reduced by up to 7.3 times.

What carries the argument

Directional fidelity optimization, which protects singular vector directions during quantization by normalizing error magnitudes, decomposing via power iteration, and using μ-law companding for dense-value resolution.

If this is right

4-bit Muon training becomes viable for GPT-style and LLaMA-style pre-training with no measurable degradation in loss or task accuracy.
Optimizer state memory usage drops by a factor of up to 7.3 times compared with full-precision storage.
Large language models can be trained under tighter GPU memory limits while retaining the computational advantages of Muon.
The quantization approach maintains the orthogonalization benefits of Muon without amplifying directional errors.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same directional preservation tactics could be tested on other optimizers that rely on orthogonalization or direction-only updates.
Further scaling experiments would be needed to check whether accumulated quantization effects remain negligible beyond the tested model sizes.
Combining MuonQ with gradient or weight quantization schemes could produce additional memory savings in end-to-end low-bit pipelines.

Load-bearing premise

The three directional fidelity techniques preserve Muon's original training dynamics without introducing instabilities or biases that would only appear at larger scales or on untested model architectures.

What would settle it

If 4-bit MuonQ training on a model substantially larger than those tested produces noticeably higher loss or lower downstream accuracy than full-precision Muon, the claim of close matching would be falsified.

Figures

Figures reproduced from arXiv: 2605.11396 by Ruijie Zhang, Yequan Zhao, Yupeng Su, Zheng Zhang, Ziyue Liu.

**Figure 1.** Figure 1: Quantization error accumulation over 50 momentum update steps. Pre-quantization normalization. We observe that if every quantization step introduces errors of the same magnitude, then the accumulated error behaves as a sum of identically scaled random perturbations, which remains approximately isotropic and does not develop a preferred direction. To achieve this, we normalize both the gradient and mom… view at source ↗

**Figure 2.** Figure 2: Effect of truncation rank k on post-orthogonalization error under 4-bit companding quantization. Crucially, the quantization granularity is aligned with the singular structure: U is quantized column-wise so that each left singular vector ui is quantized independently, and S is quantized row-wise so that each scaled right singular vector σiv ⊤ i forms its own quantization group. Since the quantization scal… view at source ↗

**Figure 3.** Figure 3: Singular value spectra of original and quantized momentum before and after [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗

**Figure 4.** Figure 4: Normalized Muon momentum distribution (top) and quantization interval comparison (bottom). Each colored block represents one quantization bin. From outlier preservation to dense-region distinguishability. As illustrated in [PITH_FULL_IMAGE:figures/full_fig_p005_4.png] view at source ↗

**Figure 5.** Figure 5: Training loss curves for GPT-2 and LLaMA models on FineWeb. [PITH_FULL_IMAGE:figures/full_fig_p006_5.png] view at source ↗

**Figure 6.** Figure 6: Validation PPL (↓) and optimizer-state memory (GB) across model scales. MuonQ4 closely matches Muon32 in PPL while achieving up to 7.3× memory reduction. training hyperparameters and differ only in momentum representation. Unless otherwise noted, we apply tensor-wise quantization to the full momentum matrix Mt (Muon8/Muon4) or the residual Rt (MuonQ4) and a truncation rank of k = 1/16. Detailed configurati… view at source ↗

**Figure 8.** Figure 8: Training loss of full-precision Muon with and without pre-quantization normaliza [PITH_FULL_IMAGE:figures/full_fig_p008_8.png] view at source ↗

**Figure 7.** Figure 7: Effect of truncation rank on endto-end training. Left: validation PPL (↓). Right: optimizer state memory (MB). Truncation rank ratio selection [PITH_FULL_IMAGE:figures/full_fig_p008_7.png] view at source ↗

**Figure 9.** Figure 9: Post-orthogonalization singular value spectra under four granularity combinations [PITH_FULL_IMAGE:figures/full_fig_p015_9.png] view at source ↗

**Figure 10.** Figure 10: Training loss of MuonQ with aligned (col U, row S) vs. across (row U, col S) granularity on GPT-2 Small. End-to-end validation. We further validate this with end-to-end training on GPT-2 Small (124M, 1B FineWeb tokens). As shown in Figure 10, the aligned configuration (col-wise U, row-wise S) achieves PPL 40.93, outperforming the across configuration (row-wise U, col-wise S) at PPL 41.30. Notably, the al… view at source ↗

**Figure 11.** Figure 11: Effect of companding parameter µ on 4-bit row-wise quantization. The µ-law companding function (Eq. 10) contains a single hyperparameter µ that controls the degree of nonlinear compression. Following the convention in signal processing, we search over values of the form 2n − 1 (i.e., 15, 63, 127, 255, 511, 1023), which correspond to the maximum representable integer under n-bit encoding and are the stan… view at source ↗

**Figure 12.** Figure 12: Training loss of MuonQ with deterministic vs. stochastic rounding on GPT-2 Small. Stochastic rounding degrades performance (42.99 vs. 40.93) and increases training variance. Stochastic rounding is a widely used technique in low-precision training (Gupta et al., 2015) that replaces the deterministic round(·) operator with a randomized variant: for a value z, it rounds down to ⌊z⌋ with probability ⌈z⌉ − z… view at source ↗

read the original abstract

The Muon optimizer has emerged as a compelling alternative to Adam for training large language models, achieving remarkable computational savings through gradient orthogonalization. However, Muon's optimizer state is more sensitive to quantization errors: because the orthogonalization discards the magnitudes of singular values and retains only directional information, even small quantization errors in singular vector directions are amplified in the update. In this work, we propose MuonQ, a low-bit Muon training framework built on the principle of directional fidelity optimization. First, we apply a pre-quantization normalization so that each step introduces quantization errors of the same magnitude, preventing the accumulated error from developing a preferred direction. Second, we introduce a structural decomposition that separately quantizes the dominant singular components via power iteration, ensuring that quantization errors perturb only singular value magnitudes rather than rotating singular vector directions. Third, we adopt $\mu$-law companding quantization to allocate higher resolution to densely packed momentum values, shifting the quantization objective from outlier preservation to dense-region distinguishability. Together, these techniques enable stable 4-bit quantization of Muon's optimizer states. Pre-training experiments on GPT-style and LLaMA-style models demonstrate that MuonQ at 4-bit precision closely matches full-precision Muon in both training loss and downstream task accuracy, while reducing optimizer state memory by up to 7.3 $\times$. Our code is available at https://github.com/YupengSu/MuonQ.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

MuonQ gives a practical set of fixes for quantizing Muon to 4 bits by protecting directional information, but the experiments are too thin to confirm the gains hold up at scale.

read the letter

The paper's main contribution is MuonQ, which adds three targeted steps before quantizing Muon states: a pre-quantization normalization that equalizes error size across steps, a power-iteration decomposition that isolates dominant singular components so errors hit magnitudes rather than directions, and μ-law companding that spends bits on the dense parts of the momentum distribution instead of outliers. These choices directly address the fact that Muon's orthogonalization step amplifies small directional perturbations. That combination is new for this optimizer and looks like a reasonable engineering response to the problem described in the abstract. The reported outcome—matching loss curves and downstream accuracy on GPT-style and LLaMA-style runs while cutting optimizer memory by 7.3×—is the kind of result that would matter for people training on limited hardware if it generalizes. The code release is also a plus for anyone who wants to test it themselves. The weak part is the experimental section. The abstract gives no model sizes, training horizons, ablation results on the individual techniques, or details on how quantization error was quantified or whether runs were repeated. Because Muon is sensitive to direction, any residual bias that grows with depth or step count could still break the “closely matches” claim even if small runs look fine. The stress-test concern about non-compounding errors is therefore still open until the full paper shows longer runs or larger models. This work is aimed at researchers who already use or are considering Muon and want to reduce its memory footprint. A reader in optimizer quantization or efficient LLM training would find the directional-fidelity framing useful. It is worth sending to peer review because the memory reduction is large enough to matter and the techniques are specific enough to be checked, but any referee should ask for scaling curves and component ablations before acceptance.

Referee Report

2 major / 1 minor

Summary. The paper proposes MuonQ, a 4-bit quantization framework for the Muon optimizer based on directional fidelity optimization. It introduces three techniques: pre-quantization normalization to equalize per-step error magnitudes, power-iteration decomposition to quantize dominant singular components separately (preserving vector directions), and μ-law companding to allocate bits to dense momentum regions. Pre-training results on GPT-style and LLaMA-style models are reported to show that 4-bit MuonQ matches full-precision Muon in training loss and downstream accuracy while cutting optimizer-state memory by up to 7.3×. Code is released.

Significance. If the matching performance holds under rigorous controls, the work would be significant for memory-constrained training of large models, as Muon’s orthogonalization makes its state unusually sensitive to directional quantization noise. The open-source code is a clear strength that aids verification. The result is currently limited by the absence of scale, horizon, and ablation details needed to confirm that the techniques prevent compounding directional errors.

major comments (2)

[Abstract] Abstract: the claim that 4-bit MuonQ “closely matches” full-precision Muon in loss and accuracy provides no model sizes, training-step counts, number of runs, or quantitative deltas; without these, it is impossible to assess whether the three directional-fidelity techniques actually keep per-step perturbations small enough that Muon’s orthogonalization does not amplify them over realistic horizons.
[Abstract] Abstract: no ablation or scaling analysis is described that isolates the contribution of pre-quantization normalization, power-iteration decomposition, or μ-law companding, nor tests whether residual directional bias remains non-compounding at larger widths, depths, or step counts; this directly bears on the central assertion that the techniques preserve training dynamics.

minor comments (1)

[Abstract] The abstract states a 7.3× memory reduction but does not explicitly tie the factor to the 4-bit setting or report the precise baseline (e.g., FP16 vs. FP32 optimizer state).

Simulated Author's Rebuttal

2 responses · 0 unresolved

We appreciate the referee's feedback highlighting the need for greater specificity in the abstract and additional analysis to support our claims. We provide point-by-point responses below and will make revisions to the manuscript accordingly.

read point-by-point responses

Referee: [Abstract] Abstract: the claim that 4-bit MuonQ “closely matches” full-precision Muon in loss and accuracy provides no model sizes, training-step counts, number of runs, or quantitative deltas; without these, it is impossible to assess whether the three directional-fidelity techniques actually keep per-step perturbations small enough that Muon’s orthogonalization does not amplify them over realistic horizons.

Authors: We agree that the abstract would be strengthened by including more concrete experimental details. While the abstract provides a high-level overview, the full manuscript in Section 4 details the model architectures (GPT-style and LLaMA-style), training durations, run counts, and quantitative results including loss values and accuracy metrics that demonstrate the close matching. We will revise the abstract to incorporate model sizes, training step counts, number of runs, and example quantitative deltas to better allow assessment of the directional fidelity techniques. revision: yes
Referee: [Abstract] Abstract: no ablation or scaling analysis is described that isolates the contribution of pre-quantization normalization, power-iteration decomposition, or μ-law companding, nor tests whether residual directional bias remains non-compounding at larger widths, depths, or step counts; this directly bears on the central assertion that the techniques preserve training dynamics.

Authors: We acknowledge that the abstract does not explicitly describe ablations or scaling analyses. The manuscript motivates each technique and presents overall results showing stable training dynamics. To directly address the isolation of contributions from pre-quantization normalization, power-iteration decomposition, and μ-law companding, as well as to test for non-compounding directional bias at larger scales, we will include a new ablation study and scaling experiments in the revised version. This will provide evidence that the techniques preserve training dynamics over extended horizons. revision: yes

Circularity Check

0 steps flagged

No circularity; empirical validation of proposed quantization techniques

full rationale

The paper proposes three practical techniques (pre-quantization normalization to equalize error magnitudes, power-iteration decomposition to protect singular-vector directions, and μ-law companding to improve dense-region resolution) for 4-bit Muon optimizer states. These are introduced as engineering choices motivated by the sensitivity of Muon's orthogonalization to directional errors, then validated directly via pre-training runs on GPT-style and LLaMA-style models that report matching loss curves and downstream accuracy. No equations, first-principles derivations, or parameter-fitting steps are presented that reduce to self-definitions, fitted inputs renamed as predictions, or self-citation chains. The central claim therefore rests on external experimental evidence rather than any internal reduction to its own inputs.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The approach relies on standard linear algebra (SVD and power iteration) and established quantization methods without introducing new free parameters, axioms beyond domain standards, or invented entities.

axioms (1)

standard math Singular value decomposition and its approximation via power iteration are valid for decomposing optimizer states.
Invoked in the structural decomposition step to separate dominant singular components.

pith-pipeline@v0.9.0 · 5567 in / 1336 out tokens · 60152 ms · 2026-05-13T02:07:25.147574+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear
pre-quantization normalization so that each step introduces quantization errors of the same magnitude... structural decomposition that separately quantizes the dominant singular components via power iteration... μ-law companding quantization

Reference graph

Works this paper leans on

17 extracted references · 17 canonical work pages · 3 internal anchors

[1]

Dion: Distributed orthonormalized updates.arXiv preprint arXiv:2504.05295, 2025

Kwangjun Ahn, Byron Xu, Natalie Abreu, Ying Fan, Gagik Magakyan, Pratyusha Sharma, Zheng Zhan, and John Langford. Dion: Distributed orthonormalized updates.arXiv preprint arXiv:2504.05295,

work page arXiv
[2]

Old optimizer, new norm: An anthology.arXiv preprint arXiv:2409.20325, 2024

Jeremy Bernstein and Laker Newhouse. Old optimizer, new norm: An anthology.arXiv preprint arXiv:2409.20325,

work page arXiv
[3]

Turbo-muon: Accelerating orthogonality-based optimization with pre-conditioning.arXiv preprint arXiv:2512.04632,

Thibaut Boissin, Thomas Massena, Franck Mamalet, and Mathieu Serrurier. Turbo-muon: Accelerating orthogonality-based optimization with pre-conditioning.arXiv preprint arXiv:2512.04632,

work page arXiv
[4]

Boolq: Exploring the surprising difficulty of natural yes/no questions

Christopher Clark, Kenton Lee, Ming-Wei Chang, Tom Kwiatkowski, Michael Collins, and Kristina Toutanova. Boolq: Exploring the surprising difficulty of natural yes/no questions. InProceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), ...

work page 2019
[5]

Think you have Solved Question Answering? Try ARC, the AI2 Reasoning Challenge

Association for Computational Linguistics. Peter Clark, Isaac Cowhey, Oren Etzioni, Tushar Khot, Ashish Sabharwal, Carissa Schoenick, and Oyvind Tafjord. Think you have solved question answering? try ARC, the AI2 reasoning challenge.arXiv preprint arXiv:1803.05457,

work page internal anchor Pith review Pith/arXiv arXiv
[6]

Aman Gupta, Rafael Celente, Abhishek Shivanna, D. T. Braithwaite, Gregory Dexter, Shao Tang, Hiroto Udagawa, Daniel Silva, Rohan Ramanath, and S. Sathiya Keerthi. Effective quantization of muon optimizer states.arXiv preprint arXiv:2509.23106,

work page arXiv
[7]

Scaling Laws for Neural Language Models

URL https://kellerjordan.github.io/posts/muon/. Jared Kaplan, Sam McCandlish, Tom Henighan, Tom B. Brown, Benjamin Chess, Rewon Child, Scott Gray, Alec Radford, Jeffrey Wu, and Dario Amodei. Scaling laws for neural language models.arXiv preprint arXiv:2001.08361,

work page internal anchor Pith review Pith/arXiv arXiv 2001
[8]

Normuon: Making muon more efficient and scalable.arXiv preprint arXiv:2510.05491, 2025

Zichong Li, Liming Liu, Chen Liang, Weizhu Chen, and Tuo Zhao. NorMuon: Making muon more efficient and scalable.arXiv preprint arXiv:2510.05491,

work page arXiv
[9]

Muon is Scalable for LLM Training

Jingyuan Liu, Jianlin Su, Xingcheng Yao, et al. Muon is scalable for LLM training.arXiv preprint arXiv:2502.16982,

work page internal anchor Pith review Pith/arXiv arXiv
[10]

Can a suit of armor conduct electricity? a new dataset for open book question answering

Todor Mihaylov, Peter Clark, Tushar Khot, and Ashish Sabharwal. Can a suit of armor conduct electricity? a new dataset for open book question answering. InProceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pp. 2381–2391, Brussels, Belgium,

work page 2018
[11]

Haoran Luo, Haihong E, Zichen Tang, Shiyao Peng, Yikai Guo, Wentai Zhang, Chenghao Ma, Guanting Dong, Meina Song, Wei Lin, Yifan Zhu, and Anh Tuan Luu

Association for Computational Linguistics. doi: 10.18653/v1/ D18-1260. Guilherme Penedo, Hynek Kydl´ıˇcek, Loubna Ben Allal, Anton Lozhkov, Margaret Mitchell, Colin Raffel, Leandro von Werra, and Thomas Wolf. The FineWeb datasets: Decanting the web for the finest text data at scale.arXiv preprint arXiv:2406.17557,

work page doi:10.18653/v1/
[12]

Noam Shazeer and Mitchell Stern

doi: 10.1145/3474381. Noam Shazeer and Mitchell Stern. Adafactor: Adaptive learning rates with sublinear memory cost. InProceedings of the 35th International Conference on Machine Learning, volume 80 ofProceedings of Machine Learning Research, pp. 4596–4604. PMLR,

work page doi:10.1145/3474381
[13]

Pushing the limits of low-bit optimizers: A focus on EMA dynamics.arXiv preprint arXiv:2505.00347,

Cong Xu, Wenbin Liang, Mo Yu, et al. Pushing the limits of low-bit optimizers: A focus on EMA dynamics.arXiv preprint arXiv:2505.00347,

work page arXiv
[14]

HellaSwag: Can a Machine Really Finish Your Sentence?

Association for Computational Linguistics. doi: 10.18653/v1/P19-1472. Ruijie Zhang, Yequan Zhao, Ziyue Liu, Zhengyang Wang, and Zheng Zhang. Muon+: To- wards better muon via one additional normalization step.arXiv preprint arXiv:2602.21545,

work page doi:10.18653/v1/p19-1472
[15]

Under review

12 Preprint. Under review. A Training Details A.1 Model Architectures We evaluate on two model families. Table 5 summarizes the architecture configurations. Family Modeld model nlayer nhead dffn Vocab Context GPT-2 Small 768 12 12 3072 50257 1024 Medium 1024 24 16 4096 50257 4096 Large 1280 36 20 5120 50257 8192 LLaMA 350M 1024 24 16 2736 32000 4096 1.1B ...

work page 2048
[16]

con- tains a single hyperparameter µ that controls the degree of nonlinear compression. Following the convention in signal processing, we search over values of the form 2 n − 1 (i.e., 15, 63, 127, 255, 511, 1023), which correspond to the maximum representable integer under n-bit encoding and are the standard choices in ITU-T companding specifications (ITU...

work page 1988
[17]

Un- like deterministic rounding, stochastic round- ing is unbiased in expectation, which has been shown to prevent systematic error accumula- tion in gradient-based optimization

that replaces the deterministic round(·) oper- ator with a randomized variant: for a value z, it rounds down to ⌊z⌋ with probability ⌈z⌉ −z and up to ⌈z⌉ with probability z− ⌊z⌋ . Un- like deterministic rounding, stochastic round- ing is unbiased in expectation, which has been shown to prevent systematic error accumula- tion in gradient-based optimization...

work page 2024