pith. machine review for the scientific record. sign in

arxiv: 2605.11396 · v1 · submitted 2026-05-12 · 💻 cs.LG

Recognition: 1 theorem link

· Lean Theorem

MuonQ: Enhancing Low-Bit Muon Quantization via Directional Fidelity Optimization

Ruijie Zhang, Yequan Zhao, Yupeng Su, Zheng Zhang, Ziyue Liu

Authors on Pith no claims yet

Pith reviewed 2026-05-13 02:07 UTC · model grok-4.3

classification 💻 cs.LG
keywords Muon optimizerlow-bit quantizationdirectional fidelityLLM pre-trainingoptimizer memory reduction4-bit traininggradient orthogonalization
0
0 comments X

The pith

MuonQ enables stable 4-bit quantization of the Muon optimizer by optimizing directional fidelity to match full-precision performance.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper seeks to make the Muon optimizer practical for low-bit training by solving its particular sensitivity to quantization errors in singular vector directions. It does so through directional fidelity optimization that combines pre-quantization normalization to equalize error magnitudes, power iteration decomposition to isolate dominant singular components, and μ-law companding to improve resolution where momentum values cluster. Experiments on GPT-style and LLaMA-style models show that the 4-bit version produces training loss curves and downstream accuracies nearly identical to full-precision Muon. The approach cuts optimizer state memory by up to 7.3 times, which directly addresses memory bottlenecks in large language model pre-training.

Core claim

By applying directional fidelity optimization consisting of pre-quantization normalization, power iteration decomposition, and μ-law companding quantization, Muon optimizer states can be quantized to 4 bits while preserving the directional information required for effective updates. Pre-training experiments on GPT-style and LLaMA-style models confirm that this yields training loss and downstream task accuracy comparable to full-precision Muon, with optimizer state memory reduced by up to 7.3 times.

What carries the argument

Directional fidelity optimization, which protects singular vector directions during quantization by normalizing error magnitudes, decomposing via power iteration, and using μ-law companding for dense-value resolution.

If this is right

  • 4-bit Muon training becomes viable for GPT-style and LLaMA-style pre-training with no measurable degradation in loss or task accuracy.
  • Optimizer state memory usage drops by a factor of up to 7.3 times compared with full-precision storage.
  • Large language models can be trained under tighter GPU memory limits while retaining the computational advantages of Muon.
  • The quantization approach maintains the orthogonalization benefits of Muon without amplifying directional errors.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same directional preservation tactics could be tested on other optimizers that rely on orthogonalization or direction-only updates.
  • Further scaling experiments would be needed to check whether accumulated quantization effects remain negligible beyond the tested model sizes.
  • Combining MuonQ with gradient or weight quantization schemes could produce additional memory savings in end-to-end low-bit pipelines.

Load-bearing premise

The three directional fidelity techniques preserve Muon's original training dynamics without introducing instabilities or biases that would only appear at larger scales or on untested model architectures.

What would settle it

If 4-bit MuonQ training on a model substantially larger than those tested produces noticeably higher loss or lower downstream accuracy than full-precision Muon, the claim of close matching would be falsified.

Figures

Figures reproduced from arXiv: 2605.11396 by Ruijie Zhang, Yequan Zhao, Yupeng Su, Zheng Zhang, Ziyue Liu.

Figure 1
Figure 1. Figure 1: Quantization error accumula￾tion over 50 momentum update steps. Pre-quantization normalization. We observe that if every quantization step introduces er￾rors of the same magnitude, then the accumu￾lated error behaves as a sum of identically scaled random perturbations, which remains approx￾imately isotropic and does not develop a pre￾ferred direction. To achieve this, we normalize both the gradient and mom… view at source ↗
Figure 2
Figure 2. Figure 2: Effect of truncation rank k on post-orthogonalization error under 4-bit companding quantization. Crucially, the quantization granularity is aligned with the singular structure: U is quantized column-wise so that each left singular vector ui is quantized independently, and S is quan￾tized row-wise so that each scaled right singular vector σiv ⊤ i forms its own quantization group. Since the quantization scal… view at source ↗
Figure 3
Figure 3. Figure 3: Singular value spectra of original and quantized momentum before and after [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Normalized Muon momentum distribution (top) and quantization inter￾val comparison (bottom). Each colored block represents one quantization bin. From outlier preservation to dense-region dis￾tinguishability. As illustrated in [PITH_FULL_IMAGE:figures/full_fig_p005_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Training loss curves for GPT-2 and LLaMA models on FineWeb. [PITH_FULL_IMAGE:figures/full_fig_p006_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Validation PPL (↓) and optimizer-state memory (GB) across model scales. MuonQ4 closely matches Muon32 in PPL while achieving up to 7.3× memory reduction. training hyperparameters and differ only in momentum representation. Unless otherwise noted, we apply tensor-wise quantization to the full momentum matrix Mt (Muon8/Muon4) or the residual Rt (MuonQ4) and a truncation rank of k = 1/16. Detailed configurati… view at source ↗
Figure 8
Figure 8. Figure 8: Training loss of full-precision Muon with and without pre-quantization normaliza [PITH_FULL_IMAGE:figures/full_fig_p008_8.png] view at source ↗
Figure 7
Figure 7. Figure 7: Effect of truncation rank on end￾to-end training. Left: validation PPL (↓). Right: optimizer state memory (MB). Truncation rank ratio selection [PITH_FULL_IMAGE:figures/full_fig_p008_7.png] view at source ↗
Figure 9
Figure 9. Figure 9: Post-orthogonalization singular value spectra under four granularity combinations [PITH_FULL_IMAGE:figures/full_fig_p015_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Training loss of MuonQ with aligned (col U, row S) vs. across (row U, col S) granularity on GPT-2 Small. End-to-end validation. We further validate this with end-to-end training on GPT-2 Small (124M, 1B FineWeb tokens). As shown in Fig￾ure 10, the aligned configuration (col-wise U, row-wise S) achieves PPL 40.93, outperforming the across configuration (row-wise U, col-wise S) at PPL 41.30. Notably, the al… view at source ↗
Figure 11
Figure 11. Figure 11: Effect of companding parame￾ter µ on 4-bit row-wise quantization. The µ-law companding function (Eq. 10) con￾tains a single hyperparameter µ that controls the degree of nonlinear compression. Following the convention in signal processing, we search over values of the form 2n − 1 (i.e., 15, 63, 127, 255, 511, 1023), which correspond to the maximum representable integer under n-bit encoding and are the stan… view at source ↗
Figure 12
Figure 12. Figure 12: Training loss of MuonQ with deterministic vs. stochastic rounding on GPT-2 Small. Stochastic rounding de￾grades performance (42.99 vs. 40.93) and increases training variance. Stochastic rounding is a widely used technique in low-precision training (Gupta et al., 2015) that replaces the deterministic round(·) oper￾ator with a randomized variant: for a value z, it rounds down to ⌊z⌋ with probability ⌈z⌉ − z… view at source ↗
read the original abstract

The Muon optimizer has emerged as a compelling alternative to Adam for training large language models, achieving remarkable computational savings through gradient orthogonalization. However, Muon's optimizer state is more sensitive to quantization errors: because the orthogonalization discards the magnitudes of singular values and retains only directional information, even small quantization errors in singular vector directions are amplified in the update. In this work, we propose MuonQ, a low-bit Muon training framework built on the principle of directional fidelity optimization. First, we apply a pre-quantization normalization so that each step introduces quantization errors of the same magnitude, preventing the accumulated error from developing a preferred direction. Second, we introduce a structural decomposition that separately quantizes the dominant singular components via power iteration, ensuring that quantization errors perturb only singular value magnitudes rather than rotating singular vector directions. Third, we adopt $\mu$-law companding quantization to allocate higher resolution to densely packed momentum values, shifting the quantization objective from outlier preservation to dense-region distinguishability. Together, these techniques enable stable 4-bit quantization of Muon's optimizer states. Pre-training experiments on GPT-style and LLaMA-style models demonstrate that MuonQ at 4-bit precision closely matches full-precision Muon in both training loss and downstream task accuracy, while reducing optimizer state memory by up to 7.3 $\times$. Our code is available at https://github.com/YupengSu/MuonQ.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper proposes MuonQ, a 4-bit quantization framework for the Muon optimizer based on directional fidelity optimization. It introduces three techniques: pre-quantization normalization to equalize per-step error magnitudes, power-iteration decomposition to quantize dominant singular components separately (preserving vector directions), and μ-law companding to allocate bits to dense momentum regions. Pre-training results on GPT-style and LLaMA-style models are reported to show that 4-bit MuonQ matches full-precision Muon in training loss and downstream accuracy while cutting optimizer-state memory by up to 7.3×. Code is released.

Significance. If the matching performance holds under rigorous controls, the work would be significant for memory-constrained training of large models, as Muon’s orthogonalization makes its state unusually sensitive to directional quantization noise. The open-source code is a clear strength that aids verification. The result is currently limited by the absence of scale, horizon, and ablation details needed to confirm that the techniques prevent compounding directional errors.

major comments (2)
  1. [Abstract] Abstract: the claim that 4-bit MuonQ “closely matches” full-precision Muon in loss and accuracy provides no model sizes, training-step counts, number of runs, or quantitative deltas; without these, it is impossible to assess whether the three directional-fidelity techniques actually keep per-step perturbations small enough that Muon’s orthogonalization does not amplify them over realistic horizons.
  2. [Abstract] Abstract: no ablation or scaling analysis is described that isolates the contribution of pre-quantization normalization, power-iteration decomposition, or μ-law companding, nor tests whether residual directional bias remains non-compounding at larger widths, depths, or step counts; this directly bears on the central assertion that the techniques preserve training dynamics.
minor comments (1)
  1. [Abstract] The abstract states a 7.3× memory reduction but does not explicitly tie the factor to the 4-bit setting or report the precise baseline (e.g., FP16 vs. FP32 optimizer state).

Simulated Author's Rebuttal

2 responses · 0 unresolved

We appreciate the referee's feedback highlighting the need for greater specificity in the abstract and additional analysis to support our claims. We provide point-by-point responses below and will make revisions to the manuscript accordingly.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the claim that 4-bit MuonQ “closely matches” full-precision Muon in loss and accuracy provides no model sizes, training-step counts, number of runs, or quantitative deltas; without these, it is impossible to assess whether the three directional-fidelity techniques actually keep per-step perturbations small enough that Muon’s orthogonalization does not amplify them over realistic horizons.

    Authors: We agree that the abstract would be strengthened by including more concrete experimental details. While the abstract provides a high-level overview, the full manuscript in Section 4 details the model architectures (GPT-style and LLaMA-style), training durations, run counts, and quantitative results including loss values and accuracy metrics that demonstrate the close matching. We will revise the abstract to incorporate model sizes, training step counts, number of runs, and example quantitative deltas to better allow assessment of the directional fidelity techniques. revision: yes

  2. Referee: [Abstract] Abstract: no ablation or scaling analysis is described that isolates the contribution of pre-quantization normalization, power-iteration decomposition, or μ-law companding, nor tests whether residual directional bias remains non-compounding at larger widths, depths, or step counts; this directly bears on the central assertion that the techniques preserve training dynamics.

    Authors: We acknowledge that the abstract does not explicitly describe ablations or scaling analyses. The manuscript motivates each technique and presents overall results showing stable training dynamics. To directly address the isolation of contributions from pre-quantization normalization, power-iteration decomposition, and μ-law companding, as well as to test for non-compounding directional bias at larger scales, we will include a new ablation study and scaling experiments in the revised version. This will provide evidence that the techniques preserve training dynamics over extended horizons. revision: yes

Circularity Check

0 steps flagged

No circularity; empirical validation of proposed quantization techniques

full rationale

The paper proposes three practical techniques (pre-quantization normalization to equalize error magnitudes, power-iteration decomposition to protect singular-vector directions, and μ-law companding to improve dense-region resolution) for 4-bit Muon optimizer states. These are introduced as engineering choices motivated by the sensitivity of Muon's orthogonalization to directional errors, then validated directly via pre-training runs on GPT-style and LLaMA-style models that report matching loss curves and downstream accuracy. No equations, first-principles derivations, or parameter-fitting steps are presented that reduce to self-definitions, fitted inputs renamed as predictions, or self-citation chains. The central claim therefore rests on external experimental evidence rather than any internal reduction to its own inputs.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The approach relies on standard linear algebra (SVD and power iteration) and established quantization methods without introducing new free parameters, axioms beyond domain standards, or invented entities.

axioms (1)
  • standard math Singular value decomposition and its approximation via power iteration are valid for decomposing optimizer states.
    Invoked in the structural decomposition step to separate dominant singular components.

pith-pipeline@v0.9.0 · 5567 in / 1336 out tokens · 60152 ms · 2026-05-13T02:07:25.147574+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

Reference graph

Works this paper leans on

17 extracted references · 17 canonical work pages · 3 internal anchors

  1. [1]

    Dion: Distributed orthonormalized updates.arXiv preprint arXiv:2504.05295, 2025

    Kwangjun Ahn, Byron Xu, Natalie Abreu, Ying Fan, Gagik Magakyan, Pratyusha Sharma, Zheng Zhan, and John Langford. Dion: Distributed orthonormalized updates.arXiv preprint arXiv:2504.05295,

  2. [2]

    Old optimizer, new norm: An anthology.arXiv preprint arXiv:2409.20325, 2024

    Jeremy Bernstein and Laker Newhouse. Old optimizer, new norm: An anthology.arXiv preprint arXiv:2409.20325,

  3. [3]

    Turbo-muon: Accelerating orthogonality-based optimization with pre-conditioning.arXiv preprint arXiv:2512.04632,

    Thibaut Boissin, Thomas Massena, Franck Mamalet, and Mathieu Serrurier. Turbo-muon: Accelerating orthogonality-based optimization with pre-conditioning.arXiv preprint arXiv:2512.04632,

  4. [4]

    Boolq: Exploring the surprising difficulty of natural yes/no questions

    Christopher Clark, Kenton Lee, Ming-Wei Chang, Tom Kwiatkowski, Michael Collins, and Kristina Toutanova. Boolq: Exploring the surprising difficulty of natural yes/no questions. InProceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), ...

  5. [5]

    Think you have Solved Question Answering? Try ARC, the AI2 Reasoning Challenge

    Association for Computational Linguistics. Peter Clark, Isaac Cowhey, Oren Etzioni, Tushar Khot, Ashish Sabharwal, Carissa Schoenick, and Oyvind Tafjord. Think you have solved question answering? try ARC, the AI2 reasoning challenge.arXiv preprint arXiv:1803.05457,

  6. [6]

    Aman Gupta, Rafael Celente, Abhishek Shivanna, D. T. Braithwaite, Gregory Dexter, Shao Tang, Hiroto Udagawa, Daniel Silva, Rohan Ramanath, and S. Sathiya Keerthi. Effective quantization of muon optimizer states.arXiv preprint arXiv:2509.23106,

  7. [7]

    Scaling Laws for Neural Language Models

    URL https://kellerjordan.github.io/posts/muon/. Jared Kaplan, Sam McCandlish, Tom Henighan, Tom B. Brown, Benjamin Chess, Rewon Child, Scott Gray, Alec Radford, Jeffrey Wu, and Dario Amodei. Scaling laws for neural language models.arXiv preprint arXiv:2001.08361,

  8. [8]

    Normuon: Making muon more efficient and scalable.arXiv preprint arXiv:2510.05491, 2025

    Zichong Li, Liming Liu, Chen Liang, Weizhu Chen, and Tuo Zhao. NorMuon: Making muon more efficient and scalable.arXiv preprint arXiv:2510.05491,

  9. [9]

    Muon is Scalable for LLM Training

    Jingyuan Liu, Jianlin Su, Xingcheng Yao, et al. Muon is scalable for LLM training.arXiv preprint arXiv:2502.16982,

  10. [10]

    Can a suit of armor conduct electricity? a new dataset for open book question answering

    Todor Mihaylov, Peter Clark, Tushar Khot, and Ashish Sabharwal. Can a suit of armor conduct electricity? a new dataset for open book question answering. InProceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pp. 2381–2391, Brussels, Belgium,

  11. [11]

    Haoran Luo, Haihong E, Zichen Tang, Shiyao Peng, Yikai Guo, Wentai Zhang, Chenghao Ma, Guanting Dong, Meina Song, Wei Lin, Yifan Zhu, and Anh Tuan Luu

    Association for Computational Linguistics. doi: 10.18653/v1/ D18-1260. Guilherme Penedo, Hynek Kydl´ıˇcek, Loubna Ben Allal, Anton Lozhkov, Margaret Mitchell, Colin Raffel, Leandro von Werra, and Thomas Wolf. The FineWeb datasets: Decanting the web for the finest text data at scale.arXiv preprint arXiv:2406.17557,

  12. [12]

    Noam Shazeer and Mitchell Stern

    doi: 10.1145/3474381. Noam Shazeer and Mitchell Stern. Adafactor: Adaptive learning rates with sublinear memory cost. InProceedings of the 35th International Conference on Machine Learning, volume 80 ofProceedings of Machine Learning Research, pp. 4596–4604. PMLR,

  13. [13]

    Pushing the limits of low-bit optimizers: A focus on EMA dynamics.arXiv preprint arXiv:2505.00347,

    Cong Xu, Wenbin Liang, Mo Yu, et al. Pushing the limits of low-bit optimizers: A focus on EMA dynamics.arXiv preprint arXiv:2505.00347,

  14. [14]

    HellaSwag: Can a Machine Really Finish Your Sentence?

    Association for Computational Linguistics. doi: 10.18653/v1/P19-1472. Ruijie Zhang, Yequan Zhao, Ziyue Liu, Zhengyang Wang, and Zheng Zhang. Muon+: To- wards better muon via one additional normalization step.arXiv preprint arXiv:2602.21545,

  15. [15]

    Under review

    12 Preprint. Under review. A Training Details A.1 Model Architectures We evaluate on two model families. Table 5 summarizes the architecture configurations. Family Modeld model nlayer nhead dffn Vocab Context GPT-2 Small 768 12 12 3072 50257 1024 Medium 1024 24 16 4096 50257 4096 Large 1280 36 20 5120 50257 8192 LLaMA 350M 1024 24 16 2736 32000 4096 1.1B ...

  16. [16]

    con- tains a single hyperparameter µ that controls the degree of nonlinear compression. Following the convention in signal processing, we search over values of the form 2 n − 1 (i.e., 15, 63, 127, 255, 511, 1023), which correspond to the maximum representable integer under n-bit encoding and are the standard choices in ITU-T companding specifications (ITU...

  17. [17]

    Un- like deterministic rounding, stochastic round- ing is unbiased in expectation, which has been shown to prevent systematic error accumula- tion in gradient-based optimization

    that replaces the deterministic round(·) oper- ator with a randomized variant: for a value z, it rounds down to ⌊z⌋ with probability ⌈z⌉ −z and up to ⌈z⌉ with probability z− ⌊z⌋ . Un- like deterministic rounding, stochastic round- ing is unbiased in expectation, which has been shown to prevent systematic error accumula- tion in gradient-based optimization...