Decomposing MXFP4 quantization error for LLM reinforcement learning: reducible bias, recoverable deadzone, and an irreducible floor

Shiliang Wu; Xiaocan Li; Zheng Shen

arxiv: 2605.20402 · v1 · pith:ZOJ2MBIOnew · submitted 2026-05-19 · 💻 cs.LG · cs.AI

Decomposing MXFP4 quantization error for LLM reinforcement learning: reducible bias, recoverable deadzone, and an irreducible floor

Xiaocan Li , Shiliang Wu , Zheng Shen This is my paper

Pith reviewed 2026-05-21 07:56 UTC · model grok-4.3

classification 💻 cs.LG cs.AI

keywords MXFP4quantization errorreinforcement learningLLM post-trainingscale biasdeadzone truncationgrid noiseadaptive quantization noise

0 comments

The pith

MXFP4 quantization error in LLM RL training decomposes exactly into scale bias, deadzone truncation, and grid noise, each driving a separate failure mode.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper proves that MXFP4 quantization error splits exactly into three additive components: scale bias from rounding to power-of-two scales, deadzone truncation that zeros out small values, and grid noise from snapping to the nearest 4-bit level. Scale bias builds up in gradients during backpropagation, deadzone hurts the quality of generated rollouts, and grid noise makes the learned policy more random by raising its entropy. By applying macro-block scaling to fix bias, outlier fallback to restore deadzone values, and adaptive quantization noise to tame entropy, the method brings performance back close to full-precision BF16 training on both dense and mixture-of-experts models. This breakdown matters because it replaces a single noise term with specific fixes matched to each training problem.

Core claim

MXFP4 quantization error can be decomposed exactly into three additive components—scale bias, deadzone truncation, and grid noise—where scale bias accumulates multiplicatively through the backward pass and affects gradient accuracy, deadzone truncation degrades rollout quality, and grid noise raises the policy's entropy. Targeted corrections consisting of macro-block scaling, outlier fallback, and adaptive quantization noise recover BF16 accuracy to within 0.7% for a 3B dense model and 3.0% for a 30B MoE model.

What carries the argument

The exact three-way additive decomposition of MXFP4 quantization error into scale bias from power-of-two rounding, deadzone truncation from zeroing small values, and grid noise from rounding to the nearest 4-bit grid.

If this is right

Macro-block scaling reduces scale bias and improves gradient accuracy in the backward pass.
Outlier fallback recovers deadzone-truncated entries and partially reduces scale bias error.
Adaptive quantization noise controls policy entropy raised by grid noise.
Combined application of these corrections restores nearly full BF16 accuracy on tested LLM sizes.
The decomposition enables failure-mode-specific interventions in RL post-training instead of uniform error reduction.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

If the three components interact in ways not captured by the additive model, the corrections may need joint optimization rather than independent application.
This approach could extend to other quantization formats like INT4 or FP8 used in RL training of LLMs.
Testing on a wider range of model architectures and RL algorithms would reveal how general the three failure modes are.
The presence of an irreducible grid noise floor implies a hard limit on how close quantized RL can get to full precision without changing the bit width.

Load-bearing premise

The three quantization error components are strictly additive without significant interactions, and the corrections can be applied together without creating new errors that offset the gains.

What would settle it

Measuring the total quantization error and confirming that it does not equal the sum of the three separately measured components, or running the full correction pipeline and finding that accuracy recovery falls short of the claimed levels on the Qwen models.

Figures

Figures reproduced from arXiv: 2605.20402 by Shiliang Wu, Xiaocan Li, Zheng Shen.

**Figure 2.** Figure 2: Error decomposition analysis. (a) Improving scale precision drives total error to the [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗

**Figure 3.** Figure 3: Scale bias from E8M0 scale rounding (Qwen3-30B-A3B-Base, [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗

**Figure 4.** Figure 4: Ablation results on GSM8K. (a) MoE: corrections stack incrementally; AQN+MBS+OF [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗

**Figure 5.** Figure 5: Training dynamics (MoE, GSM8K). (a) AQN sustains policy entropy, preventing premature [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗

**Figure 6.** Figure 6: OF sensitivity by architecture. (a) Dense: OF provides [PITH_FULL_IMAGE:figures/full_fig_p016_6.png] view at source ↗

**Figure 7.** Figure 7: AQN σstart sensitivity (Dense, MBS+OF). σ = 1% is optimal; 2% overshoots and degrades below no-AQN baseline. E Complementarity with upstream techniques Our two error corrections (MBS, OF) operate during quantization, while AQN operates on the training dynamics. An alternative strategy is to reshape the input distribution before quantization so that the format’s limitations bite less. Stochastic rounding (S… view at source ↗

read the original abstract

MXFP4 arithmetic can dramatically accelerate reinforcement learning (RL) post-training of large language models (LLMs), yet the quantization error introduces severe accuracy degradation. Existing work treats the quantization error as a monolithic noise term, missing the distinct mechanisms upon interpreting how quantization error damages training. We prove an exact three-way decomposition of quantization error and show how each component dominates a distinct RL training pathway. Our theoretical and empirical analysis decomposes the MXFP4 quantization error into three additive components: "scale bias" from power-of-two rounding, "deadzone truncation" from zeroing small values, and "grid noise" from rounding to the nearest 4-bit grid. Each component dominates a distinct RL failure mode: scale bias accumulates multiplicatively through the backward pass, affecting gradient accuracy; deadzone truncation degrades rollout quality; and grid noise raises the policy's entropy. We combine corrections that are RL failure mode-targeted but not component-exclusive: Macro-block scaling to reduce scale bias, Outlier Fallback recovers deadzone entries, but also partially reduces scale bias induced error, and Adaptive Quantization Noise (AQN) for controlling the policy entropy. On Qwen2.5-3B dense and Qwen3-30B-A3B-Base mixture-of-experts model, the targeted corrections recover BF16 accuracy to within 0.7% and 3.0% respectively.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper breaks MXFP4 error into three RL-linked parts and shows targeted fixes recover most accuracy, but the exact additivity and lack of interaction checks are the main open questions.

read the letter

The main thing to know is that this work splits MXFP4 quantization error into scale bias, deadzone truncation, and grid noise, maps each to a different RL training problem, and reports that combining three corrections brings results back close to BF16 on two Qwen models. That decomposition and the failure-mode targeting look like the actual contribution relative to earlier quantization papers. The empirical recoveries—to within 0.7 % on the 3B dense model and 3 % on the 30B MoE—are the clearest positive signal here and suggest the fixes are at least practically useful for people running RL post-training under tight compute budgets. The paper does a reasonable job explaining why each error type hits gradients, rollouts, or entropy differently, which gives readers concrete places to intervene. On the soft spots, the claim of an exact additive decomposition is stated strongly in the abstract but the provided text gives no derivation or proof sketch, so it is difficult to judge whether cross terms really stay small once the corrections run together. The note that the fixes are “not component-exclusive” already flags possible overlap, and the stress-test concern about multiplicative effects in the backward pass and policy updates seems worth checking; if those interactions exist, the clean attribution of the accuracy gains becomes harder to maintain. The experiments are only summarized at a high level, with no visible controls for how the corrections were tuned or whether other baselines were tried. This paper is mainly for engineers and researchers who already work on low-precision RL for LLMs and want a menu of targeted mitigations rather than a monolithic noise model. A reader who cares about making 4-bit arithmetic reliable for scaling post-training would find the breakdown and the reported numbers useful even if the theory needs more support. I would send it to peer review so the derivations and interaction checks can be examined properly.

Referee Report

2 major / 2 minor

Summary. The manuscript claims an exact three-way additive decomposition of MXFP4 quantization error into scale bias (from power-of-two rounding), deadzone truncation (from zeroing small values), and grid noise (from nearest 4-bit grid rounding). Each component is asserted to dominate a distinct RL pathway—scale bias in gradient accuracy, deadzone truncation in rollout quality, and grid noise in policy entropy—and the authors propose targeted but non-exclusive corrections (macro-block scaling, outlier fallback, and adaptive quantization noise) that recover BF16 accuracy to within 0.7% on Qwen2.5-3B and 3.0% on Qwen3-30B-A3B-Base.

Significance. If the decomposition is rigorously exact and the corrections combine without unaccounted interactions, the work could meaningfully advance efficient low-precision RL post-training by moving beyond monolithic noise treatments. The empirical recoveries on both dense and MoE models are practically relevant, but significance is limited by the need to confirm additivity and absence of cross-terms in the RL backward pass.

major comments (2)

[Abstract] Abstract and theoretical analysis: the central claim of an 'exact three-way decomposition' into strictly additive components requires the explicit derivation showing that scale bias, deadzone truncation, and grid noise have no significant cross terms under gradient flow and policy updates; the abstract's statement that corrections are 'not component-exclusive' raises the possibility of overlap that must be quantified.
[Empirical results] Empirical evaluation: the reported recoveries to within 0.7% and 3.0% of BF16 accuracy are load-bearing for the practical claim, yet without ablation controls isolating each correction's contribution or measuring interaction effects in the combined setting, it is unclear whether the gains can be cleanly attributed to addressing each error component independently.

minor comments (2)

[Notation] Clarify the precise mathematical definitions of the MXFP4 grid points and deadzone threshold in the decomposition to allow independent verification.
[Figures] Ensure all figures comparing quantized vs. corrected training curves include error bars or multiple seeds for statistical robustness.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the thoughtful and constructive comments. These have prompted us to strengthen the theoretical justification and empirical attribution in the manuscript. We address each major comment below and outline the planned revisions.

read point-by-point responses

Referee: [Abstract] Abstract and theoretical analysis: the central claim of an 'exact three-way decomposition' into strictly additive components requires the explicit derivation showing that scale bias, deadzone truncation, and grid noise have no significant cross terms under gradient flow and policy updates; the abstract's statement that corrections are 'not component-exclusive' raises the possibility of overlap that must be quantified.

Authors: We agree that an explicit derivation is required to confirm the absence of significant cross terms once gradients are taken. The decomposition itself is exact for the forward quantization operator; however, under back-propagation through the RL loss we will add a short appendix derivation showing that cross terms appear only at second order in the scale factor and are negligible for the policy-gradient and entropy-regularized objectives used in our experiments. Regarding non-exclusivity, we will quantify overlap by reporting the marginal contribution of each correction (macro-block scaling, outlier fallback, AQN) when applied alone versus in all combinations, thereby making the degree of interaction explicit. revision: partial
Referee: [Empirical results] Empirical evaluation: the reported recoveries to within 0.7% and 3.0% of BF16 accuracy are load-bearing for the practical claim, yet without ablation controls isolating each correction's contribution or measuring interaction effects in the combined setting, it is unclear whether the gains can be cleanly attributed to addressing each error component independently.

Authors: We accept that the current results would be strengthened by explicit ablations. In the revision we will add a dedicated subsection presenting (i) each correction applied in isolation, (ii) all pairwise combinations, and (iii) the full three-correction setting, together with the corresponding RL metrics (gradient norm, rollout reward, policy entropy). This will allow readers to attribute performance gains to individual components and to observe any interaction effects directly. revision: yes

Circularity Check

0 steps flagged

No significant circularity; derivation remains self-contained

full rationale

The paper claims an exact three-way additive decomposition of MXFP4 quantization error into scale bias, deadzone truncation, and grid noise, each tied to distinct RL pathways, with targeted corrections recovering near-BF16 accuracy. No equations, fitted parameters, or self-citations appear in the abstract or description that would reduce this decomposition to a definition, prior fit, or author-imported uniqueness theorem. The additivity is presented as a proved theoretical result rather than a renaming or ansatz smuggled via citation. The note that corrections are 'not component-exclusive' is an explicit acknowledgment of potential interactions rather than a hidden circularity. This qualifies as a normal, non-circular finding with independent theoretical content.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the premise that quantization error admits an exact additive decomposition into the three named components; this is treated as a domain assumption rather than derived from first principles in the visible text.

axioms (1)

domain assumption MXFP4 quantization error admits an exact three-way additive decomposition into scale bias, deadzone truncation, and grid noise.
Directly stated as 'prove an exact three-way decomposition' in the abstract.

pith-pipeline@v0.9.0 · 5792 in / 1284 out tokens · 52136 ms · 2026-05-21T07:56:00.351186+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We prove an exact three-way decomposition of quantization error ... scale bias ... deadzone truncation ... grid noise ... Lemma 3.2 (Exact orthogonality: DZ ⊥ Scale and DZ ⊥ Grid).

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

44 extracted references · 44 canonical work pages · 7 internal anchors

[1]

Croci, Bo Li, Pashmina Cameron, Martin Jaggi, Dan Alistarh, Torsten Hoefler, and James Hensman

Saleh Ashkboos, Amirkeivan Mohtashami, Maximilian L. Croci, Bo Li, Pashmina Cameron, Martin Jaggi, Dan Alistarh, Torsten Hoefler, and James Hensman. QuaRot : Outlier-free 4-bit inference in rotated LLMs . In Advances in Neural Information Processing Systems, volume 37, 2024

work page 2024
[2]

W. R. Bennett. Spectra of quantized signals. Bell System Technical Journal, 27 0 (3): 0 446--472, 1948

work page 1948
[3]

Castro, Andrei Panferov, Soroush Tabesh, Oliver Sieberling, Jiale Chen, Mahdi Nikdan, Saleh Ashkboos, and Dan Alistarh

Roberto L. Castro, Andrei Panferov, Soroush Tabesh, Oliver Sieberling, Jiale Chen, Mahdi Nikdan, Saleh Ashkboos, and Dan Alistarh. Quartet: Native FP4 training can be optimal for large language models. arXiv preprint arXiv:2505.14669, 2025

work page arXiv 2025
[4]

QuIP : 2-bit quantization of large language models with guarantees

Jerry Chee, Yaohui Cai, Volodymyr Kuleshov, and Christopher De Sa. QuIP : 2-bit quantization of large language models with guarantees. In Advances in Neural Information Processing Systems, volume 36, 2023

work page 2023
[5]

Oscillation-reduced MXFP4 training for vision transformers

Yuxiang Chen, Haocheng Xi, Jun Zhu, and Jianfei Chen. Oscillation-reduced MXFP4 training for vision transformers. In International Conference on Machine Learning (ICML), 2025

work page 2025
[6]

Unveiling the potential of quantization with MXFP4 : Strategies for quantization error reduction

Jatin Chhugani, Geonhwa Jeong, Bor-Yiing Su, Yunjie Pan, Hanmei Yang, Aayush Ankit, Jiecao Yu, Summer Deng, Yunqing Chen, Nadathur Satish, and Changkyu Kim. Unveiling the potential of quantization with MXFP4 : Strategies for quantization error reduction. arXiv preprint arXiv:2603.08713, 2026

work page arXiv 2026
[7]

FP4 all the way: Fully quantized training of LLMs

Brian Chmiel, Maxim Fishman, Ron Banner, and Daniel Soudry. FP4 all the way: Fully quantized training of LLMs . arXiv preprint arXiv:2505.19115, 2025

work page arXiv 2025
[8]

Grouped sequency-arranged rotation: Optimizing rotation transformation for quantization for free

Euntae Choi, Sumin Song, Woosang Lim, and Sungjoo Yoo. Grouped sequency-arranged rotation: Optimizing rotation transformation for quantization for free. In Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics: Student Research Workshop, pages 165--172, 2025. doi:10.18653/v1/2025.acl-srw.10

work page doi:10.18653/v1/2025.acl-srw.10 2025
[9]

Training Verifiers to Solve Math Word Problems

Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, Christopher Hesse, and John Schulman. Training verifiers to solve math word problems. arXiv preprint arXiv:2110.14168, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021
[10]

Four Over Six: More Accurate NVFP4 Quantization with Adaptive Block Scaling

Jack Cook, Junxian Guo, Guangxuan Xiao, Yujun Lin, and Song Han. Four over six: More accurate NVFP4 quantization with adaptive block scaling. arXiv preprint arXiv:2512.02010, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[11]

LLM.int8(): 8-bit Matrix Multiplication for Transformers at Scale

Tim Dettmers, Mike Lewis, Younes Belkada, and Luke Zettlemoyer. LLM.int8(): 8-bit Matrix Multiplication for Transformers at Scale . In Advances in Neural Information Processing Systems, volume 35, 2022

work page 2022
[12]

QLoRA : Efficient finetuning of quantized LLMs

Tim Dettmers, Artidoro Pagnoni, Ari Holtzman, and Luke Zettlemoyer. QLoRA : Efficient finetuning of quantized LLMs . In Advances in Neural Information Processing Systems, volume 36, 2023

work page 2023
[13]

SpQR : A sparse-quantized representation for near-lossless LLM weight compression

Tim Dettmers, Ruslan Svirschevski, Vage Egiazarian, Denis Kuznedelev, Elias Frantar, Saleh Ashkboos, Alexander Borzunov, Torsten Hoefler, and Dan Alistarh. SpQR : A sparse-quantized representation for near-lossless LLM weight compression. In International Conference on Learning Representations, 2024

work page 2024
[14]

Castro, Denis Kuznedelev, Andrei Panferov, Eldar Kurtic, Shubhra Pandit, Alexandre Noll Marques, Mark Kurtz, Saleh Ashkboos, Torsten Hoefler, and Dan Alistarh

Vage Egiazarian, Roberto L. Castro, Denis Kuznedelev, Andrei Panferov, Eldar Kurtic, Shubhra Pandit, Alexandre Noll Marques, Mark Kurtz, Saleh Ashkboos, Torsten Hoefler, and Dan Alistarh. Bridging the gap between promise and performance for microscaling FP4 quantization. In International Conference on Learning Representations, 2026. Introduces MR-GPTQ (mi...

work page arXiv 2026
[15]

Scaling FP8 training to trillion-token LLMs

Maxim Fishman, Brian Chmiel, Ron Banner, and Daniel Soudry. Scaling FP8 training to trillion-token LLMs . In International Conference on Learning Representations, 2025. URL https://openreview.net/forum?id=E1EHO0imOb

work page 2025
[16]

Noisy networks for exploration

Meire Fortunato, Mohammad Gheshlaghi Azar, Bilal Piot, Jacob Menick, Matteo Hessel, Ian Osband, Alex Graves, Volodymyr Mnih, R \'e mi Munos, Demis Hassabis, Olivier Pietquin, Charles Blundell, and Shane Legg. Noisy networks for exploration. In International Conference on Learning Representations, 2018. URL https://openreview.net/forum?id=rywHCPkAW

work page 2018
[17]

GPTQ : Accurate post-training quantization for generative pre-trained transformers

Elias Frantar, Saleh Ashkboos, Torsten Hoefler, and Dan Alistarh. GPTQ : Accurate post-training quantization for generative pre-trained transformers. In International Conference on Learning Representations, 2023

work page 2023
[18]

AReaL: A Large-Scale Asynchronous Reinforcement Learning System for Language Reasoning

Wei Fu, Jiaxuan Gao, Xujie Shen, Chen Zhu, Zhiyu Mei, Chuyi He, Shusheng Xu, Guo Wei, Jun Mei, Jiashu Wang, Tongkai Yang, Binhang Yuan, and Yi Wu. AReaL : A large-scale asynchronous reinforcement learning system for language reasoning. arXiv preprint arXiv:2505.24298, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[19]

Deep learning with limited numerical precision

Suyog Gupta, Ankur Agrawal, Kailash Gopalakrishnan, and Pritish Narayanan. Deep learning with limited numerical precision. In Proceedings of the 32nd International Conference on Machine Learning (ICML), volume 37, pages 1737--1746, 2015. URL http://proceedings.mlr.press/v37/gupta15.html

work page 2015
[20]

Low-precision training of large language models: Methods, challenges, and opportunities

Zhiwei Hao, Jianyuan Guo, Li Shen, Yong Luo, Han Hu, Guoxia Wang, Dianhai Yu, Yonggang Wen, and Dacheng Tao. Low-precision training of large language models: Methods, challenges, and opportunities. arXiv preprint arXiv:2505.01043, 2025

work page arXiv 2025
[21]

Towards fully FP8 GEMM LLM training at scale

Alejandro Hern \'a ndez-Cano, Dhia Garbaya, Imanol Schlag, and Martin Jaggi. Towards fully FP8 GEMM LLM training at scale. arXiv preprint arXiv:2505.20524, 2025

work page arXiv 2025
[22]

QeRL : Beyond efficiency---quantization-enhanced reinforcement learning for LLMs

Wei Huang, Yi Ge, Shuai Yang, Yicheng Xiao, Huizi Mao, Yujun Lin, Hanrong Ye, Sifei Liu, Ka Chun Cheung, Hongxu Yin, Yao Lu, Xiaojuan Qi, Song Han, and Yukang Chen. QeRL : Beyond efficiency---quantization-enhanced reinforcement learning for LLMs . In International Conference on Learning Representations, 2026

work page 2026
[23]

Mahoney, and Kurt Keutzer

Sehoon Kim, Coleman Hooper, Amir Gholami, Zhen Dong, Xiuyu Li, Sheng Shen, Michael W. Mahoney, and Kurt Keutzer. SqueezeLLM : Dense-and-sparse quantization. In Proceedings of the 41st International Conference on Machine Learning, 2024

work page 2024
[24]

ParoQuant : Pairwise rotation quantization for efficient reasoning LLM inference

Yesheng Liang, Haisheng Chen, Song Han, and Zhijian Liu. ParoQuant : Pairwise rotation quantization for efficient reasoning LLM inference. arXiv preprint arXiv:2511.10645, 2025

work page arXiv 2025
[25]

AWQ : Activation-aware weight quantization for on-device LLM compression and acceleration

Ji Lin, Jiaming Tang, Haotian Tang, Shang Yang, Wei-Ming Chen, Wei-Chen Wang, Guangxuan Xiao, Xingyu Dang, Chuang Gan, and Song Han. AWQ : Activation-aware weight quantization for on-device LLM compression and acceleration. In Proceedings of Machine Learning and Systems, volume 6, 2024

work page 2024
[26]

Llm-qat: Data-free quantization aware training for large language models

Zechun Liu, Barlas Oguz, Changsheng Zhao, Ernie Chang, Pierre Stock, Yashar Mehdad, Yangyang Shi, Raghuraman Krishnamoorthi, and Vikas Chandra. Llm-qat: Data-free quantization aware training for large language models. In Findings of the Association for Computational Linguistics: ACL 2024, pages 467--484, 2024

work page 2024
[27]

SpinQuant : LLM quantization with learned rotations

Zechun Liu, Changsheng Zhao, Igor Fedorov, Bilge Soran, Dhruv Choudhary, Raghuraman Krishnamoorthi, Vikas Chandra, Yuandong Tian, and Tijmen Blankevoort. SpinQuant : LLM quantization with learned rotations. In International Conference on Learning Representations, 2025

work page 2025
[28]

FP8 Formats for Deep Learning

Paulius Micikevicius, Dusan Stosic, Neil Burgess, Marius Cornea, Pradeep Dubey, Richard Grisenthwaite, Sangwon Ha, Alexander Heinecke, Patrick Judd, John Kamalu, Naveen Mellempudi, Stuart F. Oberman, Mohammad Shoeybi, Michael Y. Siu, and Hao Wu. FP8 formats for deep learning. arXiv preprint arXiv:2209.05433, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[29]

Adding Gradient Noise Improves Learning for Very Deep Networks

Arvind Neelakantan, Luke Vilnis, Quoc V. Le, Ilya Sutskever, Lukasz Kaiser, Karol Kurach, and James Martens. Adding gradient noise improves learning for very deep networks. arXiv preprint arXiv:1511.06807, 2015

work page internal anchor Pith review Pith/arXiv arXiv 2015
[30]

Pretraining large language models with NVFP4

NVIDIA . Pretraining large language models with NVFP4 . arXiv preprint arXiv:2509.25149, 2025

work page arXiv 2025
[31]

Quartet II : Accurate LLM pre-training in NVFP4 by improved unbiased gradient estimation

Andrei Panferov, Erik Schultheis, Soroush Tabesh, and Dan Alistarh. Quartet II : Accurate LLM pre-training in NVFP4 by improved unbiased gradient estimation. arXiv preprint arXiv:2601.22813, 2026

work page arXiv 2026
[32]

Outlier-safe pre-training for robust 4-bit quantization of large language models

Jungwoo Park, Taewhoo Lee, Chanwoong Yoon, Hyeon Hwang, and Jaewoo Kang. Outlier-safe pre-training for robust 4-bit quantization of large language models. In Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (ACL), pages 12582--12600, 2025

work page 2025
[33]

OCP microscaling formats ( MX ) specification

Bita Darvish Rouhani, Ritchie Zhao, Ankit More, Mathew Hall, Alireza Khodamoradi, Summer Deng, Dhruv Choudhary, Marius Cornea, Eric Dellinger, Kristof Denolf, et al. OCP microscaling formats ( MX ) specification. Open Compute Project, 2023

work page 2023
[34]

Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Mingchuan Zhang, Y. K. Li, Y. Wu, and Daya Guo. DeepSeekMath : Pushing the limits of mathematical reasoning in open language models. arXiv preprint arXiv:2402.03300, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[35]

QuIP\# : Even better LLM quantization with hadamard incoherence and lattice codebooks

Albert Tseng, Jerry Chee, Qingyao Sun, Volodymyr Kuleshov, and Christopher De Sa. QuIP\# : Even better LLM quantization with hadamard incoherence and lattice codebooks. In Proceedings of the 41st International Conference on Machine Learning, 2024 a

work page 2024
[36]

QTIP : Quantization with trellises and incoherence processing

Albert Tseng, Qingyao Sun, David Hou, and Christopher De Sa. QTIP : Quantization with trellises and incoherence processing. In Advances in Neural Information Processing Systems, volume 37, 2024 b

work page 2024
[37]

Training llms with mxfp4, 2025

Albert Tseng, Tao Yu, and Youngsuk Park. Training llms with mxfp4, 2025. URL https://arxiv.org/abs/2502.20586

work page arXiv 2025
[38]

Quantization Noise: Roundoff Error in Digital Computation, Signal Processing, Control, and Communications

Bernard Widrow and Istv \'a n Koll \'a r. Quantization Noise: Roundoff Error in Digital Computation, Signal Processing, Control, and Communications. Cambridge University Press, 2008. ISBN 978-0-521-88671-0

work page 2008
[39]

SmoothQuant : Accurate and efficient post-training quantization for large language models

Guangxuan Xiao, Ji Lin, Mickael Seznec, Hao Wu, Julien Demouth, and Song Han. SmoothQuant : Accurate and efficient post-training quantization for large language models. In Proceedings of the 40th International Conference on Machine Learning, 2023

work page 2023
[40]

Your efficient rl framework secretly brings you off-policy rl training, August 2025

Feng Yao, Liyuan Liu, Dinghuai Zhang, Chengyu Dong, Jingbo Shang, and Jianfeng Gao. Your efficient rl framework secretly brings you off-policy rl training, August 2025. URL https://fengyao.notion.site/off-policy-rl

work page 2025
[41]

ZeroQuant : Efficient and affordable post-training quantization for large-scale transformers

Zhewei Yao, Reza Yazdani Aminabadi, Minjia Zhang, Xiaoxia Wu, Conglong Li, and Yuxiong He. ZeroQuant : Efficient and affordable post-training quantization for large-scale transformers. In Advances in Neural Information Processing Systems, volume 35, pages 27168--27183, 2022

work page 2022
[42]

DAPO: An Open-Source LLM Reinforcement Learning System at Scale

Qiying Yu, Zheng Zhang, Ruofei Zhu, Yufeng Yuan, Xiaochen Zuo, Yu Yue, et al. DAPO : An open-source LLM reinforcement learning system at scale. arXiv preprint arXiv:2503.14476, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[43]

Accurate INT8 training through dynamic block-level fallback

Pengle Zhang, Jia Wei, Jintao Zhang, Jun Zhu, and Jianfei Chen. Accurate INT8 training through dynamic block-level fallback. arXiv preprint arXiv:2503.08040, 2025

work page arXiv 2025
[44]

Practical FP4 training for large-scale MoE models on hopper GPUs

Wuyue Zhang, Chongdong Huang, Chunbo You, Cheng Gu, Fengjuan Wang, and Mou Sun. Practical FP4 training for large-scale MoE models on hopper GPUs . arXiv preprint arXiv:2603.02731, 2026

work page arXiv 2026

[1] [1]

Croci, Bo Li, Pashmina Cameron, Martin Jaggi, Dan Alistarh, Torsten Hoefler, and James Hensman

Saleh Ashkboos, Amirkeivan Mohtashami, Maximilian L. Croci, Bo Li, Pashmina Cameron, Martin Jaggi, Dan Alistarh, Torsten Hoefler, and James Hensman. QuaRot : Outlier-free 4-bit inference in rotated LLMs . In Advances in Neural Information Processing Systems, volume 37, 2024

work page 2024

[2] [2]

W. R. Bennett. Spectra of quantized signals. Bell System Technical Journal, 27 0 (3): 0 446--472, 1948

work page 1948

[3] [3]

Castro, Andrei Panferov, Soroush Tabesh, Oliver Sieberling, Jiale Chen, Mahdi Nikdan, Saleh Ashkboos, and Dan Alistarh

Roberto L. Castro, Andrei Panferov, Soroush Tabesh, Oliver Sieberling, Jiale Chen, Mahdi Nikdan, Saleh Ashkboos, and Dan Alistarh. Quartet: Native FP4 training can be optimal for large language models. arXiv preprint arXiv:2505.14669, 2025

work page arXiv 2025

[4] [4]

QuIP : 2-bit quantization of large language models with guarantees

Jerry Chee, Yaohui Cai, Volodymyr Kuleshov, and Christopher De Sa. QuIP : 2-bit quantization of large language models with guarantees. In Advances in Neural Information Processing Systems, volume 36, 2023

work page 2023

[5] [5]

Oscillation-reduced MXFP4 training for vision transformers

Yuxiang Chen, Haocheng Xi, Jun Zhu, and Jianfei Chen. Oscillation-reduced MXFP4 training for vision transformers. In International Conference on Machine Learning (ICML), 2025

work page 2025

[6] [6]

Unveiling the potential of quantization with MXFP4 : Strategies for quantization error reduction

Jatin Chhugani, Geonhwa Jeong, Bor-Yiing Su, Yunjie Pan, Hanmei Yang, Aayush Ankit, Jiecao Yu, Summer Deng, Yunqing Chen, Nadathur Satish, and Changkyu Kim. Unveiling the potential of quantization with MXFP4 : Strategies for quantization error reduction. arXiv preprint arXiv:2603.08713, 2026

work page arXiv 2026

[7] [7]

FP4 all the way: Fully quantized training of LLMs

Brian Chmiel, Maxim Fishman, Ron Banner, and Daniel Soudry. FP4 all the way: Fully quantized training of LLMs . arXiv preprint arXiv:2505.19115, 2025

work page arXiv 2025

[8] [8]

Grouped sequency-arranged rotation: Optimizing rotation transformation for quantization for free

Euntae Choi, Sumin Song, Woosang Lim, and Sungjoo Yoo. Grouped sequency-arranged rotation: Optimizing rotation transformation for quantization for free. In Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics: Student Research Workshop, pages 165--172, 2025. doi:10.18653/v1/2025.acl-srw.10

work page doi:10.18653/v1/2025.acl-srw.10 2025

[9] [9]

Training Verifiers to Solve Math Word Problems

Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, Christopher Hesse, and John Schulman. Training verifiers to solve math word problems. arXiv preprint arXiv:2110.14168, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021

[10] [10]

Four Over Six: More Accurate NVFP4 Quantization with Adaptive Block Scaling

Jack Cook, Junxian Guo, Guangxuan Xiao, Yujun Lin, and Song Han. Four over six: More accurate NVFP4 quantization with adaptive block scaling. arXiv preprint arXiv:2512.02010, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[11] [11]

LLM.int8(): 8-bit Matrix Multiplication for Transformers at Scale

Tim Dettmers, Mike Lewis, Younes Belkada, and Luke Zettlemoyer. LLM.int8(): 8-bit Matrix Multiplication for Transformers at Scale . In Advances in Neural Information Processing Systems, volume 35, 2022

work page 2022

[12] [12]

QLoRA : Efficient finetuning of quantized LLMs

Tim Dettmers, Artidoro Pagnoni, Ari Holtzman, and Luke Zettlemoyer. QLoRA : Efficient finetuning of quantized LLMs . In Advances in Neural Information Processing Systems, volume 36, 2023

work page 2023

[13] [13]

SpQR : A sparse-quantized representation for near-lossless LLM weight compression

Tim Dettmers, Ruslan Svirschevski, Vage Egiazarian, Denis Kuznedelev, Elias Frantar, Saleh Ashkboos, Alexander Borzunov, Torsten Hoefler, and Dan Alistarh. SpQR : A sparse-quantized representation for near-lossless LLM weight compression. In International Conference on Learning Representations, 2024

work page 2024

[14] [14]

Castro, Denis Kuznedelev, Andrei Panferov, Eldar Kurtic, Shubhra Pandit, Alexandre Noll Marques, Mark Kurtz, Saleh Ashkboos, Torsten Hoefler, and Dan Alistarh

Vage Egiazarian, Roberto L. Castro, Denis Kuznedelev, Andrei Panferov, Eldar Kurtic, Shubhra Pandit, Alexandre Noll Marques, Mark Kurtz, Saleh Ashkboos, Torsten Hoefler, and Dan Alistarh. Bridging the gap between promise and performance for microscaling FP4 quantization. In International Conference on Learning Representations, 2026. Introduces MR-GPTQ (mi...

work page arXiv 2026

[15] [15]

Scaling FP8 training to trillion-token LLMs

Maxim Fishman, Brian Chmiel, Ron Banner, and Daniel Soudry. Scaling FP8 training to trillion-token LLMs . In International Conference on Learning Representations, 2025. URL https://openreview.net/forum?id=E1EHO0imOb

work page 2025

[16] [16]

Noisy networks for exploration

Meire Fortunato, Mohammad Gheshlaghi Azar, Bilal Piot, Jacob Menick, Matteo Hessel, Ian Osband, Alex Graves, Volodymyr Mnih, R \'e mi Munos, Demis Hassabis, Olivier Pietquin, Charles Blundell, and Shane Legg. Noisy networks for exploration. In International Conference on Learning Representations, 2018. URL https://openreview.net/forum?id=rywHCPkAW

work page 2018

[17] [17]

GPTQ : Accurate post-training quantization for generative pre-trained transformers

Elias Frantar, Saleh Ashkboos, Torsten Hoefler, and Dan Alistarh. GPTQ : Accurate post-training quantization for generative pre-trained transformers. In International Conference on Learning Representations, 2023

work page 2023

[18] [18]

AReaL: A Large-Scale Asynchronous Reinforcement Learning System for Language Reasoning

Wei Fu, Jiaxuan Gao, Xujie Shen, Chen Zhu, Zhiyu Mei, Chuyi He, Shusheng Xu, Guo Wei, Jun Mei, Jiashu Wang, Tongkai Yang, Binhang Yuan, and Yi Wu. AReaL : A large-scale asynchronous reinforcement learning system for language reasoning. arXiv preprint arXiv:2505.24298, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[19] [19]

Deep learning with limited numerical precision

Suyog Gupta, Ankur Agrawal, Kailash Gopalakrishnan, and Pritish Narayanan. Deep learning with limited numerical precision. In Proceedings of the 32nd International Conference on Machine Learning (ICML), volume 37, pages 1737--1746, 2015. URL http://proceedings.mlr.press/v37/gupta15.html

work page 2015

[20] [20]

Low-precision training of large language models: Methods, challenges, and opportunities

Zhiwei Hao, Jianyuan Guo, Li Shen, Yong Luo, Han Hu, Guoxia Wang, Dianhai Yu, Yonggang Wen, and Dacheng Tao. Low-precision training of large language models: Methods, challenges, and opportunities. arXiv preprint arXiv:2505.01043, 2025

work page arXiv 2025

[21] [21]

Towards fully FP8 GEMM LLM training at scale

Alejandro Hern \'a ndez-Cano, Dhia Garbaya, Imanol Schlag, and Martin Jaggi. Towards fully FP8 GEMM LLM training at scale. arXiv preprint arXiv:2505.20524, 2025

work page arXiv 2025

[22] [22]

QeRL : Beyond efficiency---quantization-enhanced reinforcement learning for LLMs

Wei Huang, Yi Ge, Shuai Yang, Yicheng Xiao, Huizi Mao, Yujun Lin, Hanrong Ye, Sifei Liu, Ka Chun Cheung, Hongxu Yin, Yao Lu, Xiaojuan Qi, Song Han, and Yukang Chen. QeRL : Beyond efficiency---quantization-enhanced reinforcement learning for LLMs . In International Conference on Learning Representations, 2026

work page 2026

[23] [23]

Mahoney, and Kurt Keutzer

Sehoon Kim, Coleman Hooper, Amir Gholami, Zhen Dong, Xiuyu Li, Sheng Shen, Michael W. Mahoney, and Kurt Keutzer. SqueezeLLM : Dense-and-sparse quantization. In Proceedings of the 41st International Conference on Machine Learning, 2024

work page 2024

[24] [24]

ParoQuant : Pairwise rotation quantization for efficient reasoning LLM inference

Yesheng Liang, Haisheng Chen, Song Han, and Zhijian Liu. ParoQuant : Pairwise rotation quantization for efficient reasoning LLM inference. arXiv preprint arXiv:2511.10645, 2025

work page arXiv 2025

[25] [25]

AWQ : Activation-aware weight quantization for on-device LLM compression and acceleration

Ji Lin, Jiaming Tang, Haotian Tang, Shang Yang, Wei-Ming Chen, Wei-Chen Wang, Guangxuan Xiao, Xingyu Dang, Chuang Gan, and Song Han. AWQ : Activation-aware weight quantization for on-device LLM compression and acceleration. In Proceedings of Machine Learning and Systems, volume 6, 2024

work page 2024

[26] [26]

Llm-qat: Data-free quantization aware training for large language models

Zechun Liu, Barlas Oguz, Changsheng Zhao, Ernie Chang, Pierre Stock, Yashar Mehdad, Yangyang Shi, Raghuraman Krishnamoorthi, and Vikas Chandra. Llm-qat: Data-free quantization aware training for large language models. In Findings of the Association for Computational Linguistics: ACL 2024, pages 467--484, 2024

work page 2024

[27] [27]

SpinQuant : LLM quantization with learned rotations

Zechun Liu, Changsheng Zhao, Igor Fedorov, Bilge Soran, Dhruv Choudhary, Raghuraman Krishnamoorthi, Vikas Chandra, Yuandong Tian, and Tijmen Blankevoort. SpinQuant : LLM quantization with learned rotations. In International Conference on Learning Representations, 2025

work page 2025

[28] [28]

FP8 Formats for Deep Learning

Paulius Micikevicius, Dusan Stosic, Neil Burgess, Marius Cornea, Pradeep Dubey, Richard Grisenthwaite, Sangwon Ha, Alexander Heinecke, Patrick Judd, John Kamalu, Naveen Mellempudi, Stuart F. Oberman, Mohammad Shoeybi, Michael Y. Siu, and Hao Wu. FP8 formats for deep learning. arXiv preprint arXiv:2209.05433, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022

[29] [29]

Adding Gradient Noise Improves Learning for Very Deep Networks

Arvind Neelakantan, Luke Vilnis, Quoc V. Le, Ilya Sutskever, Lukasz Kaiser, Karol Kurach, and James Martens. Adding gradient noise improves learning for very deep networks. arXiv preprint arXiv:1511.06807, 2015

work page internal anchor Pith review Pith/arXiv arXiv 2015

[30] [30]

Pretraining large language models with NVFP4

NVIDIA . Pretraining large language models with NVFP4 . arXiv preprint arXiv:2509.25149, 2025

work page arXiv 2025

[31] [31]

Quartet II : Accurate LLM pre-training in NVFP4 by improved unbiased gradient estimation

Andrei Panferov, Erik Schultheis, Soroush Tabesh, and Dan Alistarh. Quartet II : Accurate LLM pre-training in NVFP4 by improved unbiased gradient estimation. arXiv preprint arXiv:2601.22813, 2026

work page arXiv 2026

[32] [32]

Outlier-safe pre-training for robust 4-bit quantization of large language models

Jungwoo Park, Taewhoo Lee, Chanwoong Yoon, Hyeon Hwang, and Jaewoo Kang. Outlier-safe pre-training for robust 4-bit quantization of large language models. In Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (ACL), pages 12582--12600, 2025

work page 2025

[33] [33]

OCP microscaling formats ( MX ) specification

Bita Darvish Rouhani, Ritchie Zhao, Ankit More, Mathew Hall, Alireza Khodamoradi, Summer Deng, Dhruv Choudhary, Marius Cornea, Eric Dellinger, Kristof Denolf, et al. OCP microscaling formats ( MX ) specification. Open Compute Project, 2023

work page 2023

[34] [34]

Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Mingchuan Zhang, Y. K. Li, Y. Wu, and Daya Guo. DeepSeekMath : Pushing the limits of mathematical reasoning in open language models. arXiv preprint arXiv:2402.03300, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[35] [35]

QuIP\# : Even better LLM quantization with hadamard incoherence and lattice codebooks

Albert Tseng, Jerry Chee, Qingyao Sun, Volodymyr Kuleshov, and Christopher De Sa. QuIP\# : Even better LLM quantization with hadamard incoherence and lattice codebooks. In Proceedings of the 41st International Conference on Machine Learning, 2024 a

work page 2024

[36] [36]

QTIP : Quantization with trellises and incoherence processing

Albert Tseng, Qingyao Sun, David Hou, and Christopher De Sa. QTIP : Quantization with trellises and incoherence processing. In Advances in Neural Information Processing Systems, volume 37, 2024 b

work page 2024

[37] [37]

Training llms with mxfp4, 2025

Albert Tseng, Tao Yu, and Youngsuk Park. Training llms with mxfp4, 2025. URL https://arxiv.org/abs/2502.20586

work page arXiv 2025

[38] [38]

Quantization Noise: Roundoff Error in Digital Computation, Signal Processing, Control, and Communications

Bernard Widrow and Istv \'a n Koll \'a r. Quantization Noise: Roundoff Error in Digital Computation, Signal Processing, Control, and Communications. Cambridge University Press, 2008. ISBN 978-0-521-88671-0

work page 2008

[39] [39]

SmoothQuant : Accurate and efficient post-training quantization for large language models

Guangxuan Xiao, Ji Lin, Mickael Seznec, Hao Wu, Julien Demouth, and Song Han. SmoothQuant : Accurate and efficient post-training quantization for large language models. In Proceedings of the 40th International Conference on Machine Learning, 2023

work page 2023

[40] [40]

Your efficient rl framework secretly brings you off-policy rl training, August 2025

Feng Yao, Liyuan Liu, Dinghuai Zhang, Chengyu Dong, Jingbo Shang, and Jianfeng Gao. Your efficient rl framework secretly brings you off-policy rl training, August 2025. URL https://fengyao.notion.site/off-policy-rl

work page 2025

[41] [41]

ZeroQuant : Efficient and affordable post-training quantization for large-scale transformers

Zhewei Yao, Reza Yazdani Aminabadi, Minjia Zhang, Xiaoxia Wu, Conglong Li, and Yuxiong He. ZeroQuant : Efficient and affordable post-training quantization for large-scale transformers. In Advances in Neural Information Processing Systems, volume 35, pages 27168--27183, 2022

work page 2022

[42] [42]

DAPO: An Open-Source LLM Reinforcement Learning System at Scale

Qiying Yu, Zheng Zhang, Ruofei Zhu, Yufeng Yuan, Xiaochen Zuo, Yu Yue, et al. DAPO : An open-source LLM reinforcement learning system at scale. arXiv preprint arXiv:2503.14476, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[43] [43]

Accurate INT8 training through dynamic block-level fallback

Pengle Zhang, Jia Wei, Jintao Zhang, Jun Zhu, and Jianfei Chen. Accurate INT8 training through dynamic block-level fallback. arXiv preprint arXiv:2503.08040, 2025

work page arXiv 2025

[44] [44]

Practical FP4 training for large-scale MoE models on hopper GPUs

Wuyue Zhang, Chongdong Huang, Chunbo You, Cheng Gu, Fengjuan Wang, and Mou Sun. Practical FP4 training for large-scale MoE models on hopper GPUs . arXiv preprint arXiv:2603.02731, 2026

work page arXiv 2026