Rethinking Shrinkage Bias in LLM FP4 Pretraining: Geometric Origin, Systemic Impact, and UFP4 Recipe

Changxin Tian; Chaofan Yu; Haitao Zhang; Jia Liu; Jun Zhou; Kunlong Chen; Mingliang Gong; Peijie Jiang; Qian Zhao; Zhiqiang Zhang

arxiv: 2606.20381 · v1 · pith:J3QPRZNSnew · submitted 2026-06-18 · 💻 cs.AI

Rethinking Shrinkage Bias in LLM FP4 Pretraining: Geometric Origin, Systemic Impact, and UFP4 Recipe

Qian Zhao , Kunlong Chen , Changxin Tian , Zhonghui Jiang , Haitao Zhang , Chaofan Yu , Peijie Jiang , Mingliang Gong

show 4 more authors

Jia Liu Ziqi Liu Zhiqiang Zhang Jun Zhou

This is my paper

Pith reviewed 2026-06-26 16:57 UTC · model grok-4.3

classification 💻 cs.AI

keywords FP4 quantizationShrinkage BiasLLM pretrainingE2M1 formatUFP4 recipeRandom Hadamard Transformtraining stabilityuniform quantization grids

0 comments

The pith

Non-uniform E2M1 FP4 formats create Shrinkage Bias from bin asymmetry that accumulates across layers and drives training instability, while uniform grids avoid it.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that E2M1's non-uniform bins produce a systematic negative rounding error called Shrinkage Bias. This bias multiplies through successive layers and is amplified by the Random Hadamard Transform, supplying a single account for the instability seen in current E2M1 FP4 training runs. Uniform formats such as E1M2 and INT4 have no such geometric error and turn the bucket-utilization gains from RHT into measurable accuracy improvements. The authors introduce the UFP4 recipe, which applies RHT to every training GEMM yet restricts stochastic rounding to the dY term alone, and report lower BF16-relative loss degradation than strong E2M1 baselines on Dense 1.5B, MoE 7.9B, and MoE 124B pretraining.

Core claim

The central claim is that the geometric asymmetry of E2M1's representable bins produces Shrinkage Bias, a negative rounding error that accumulates multiplicatively across layers and is further amplified by RHT. This bias supplies a unified explanation for the training instability observed in existing E2M1-based FP4 recipes. Uniform grids bypass the grid-geometry error entirely and convert RHT's improved bucket utilization into higher quantization quality. UFP4 is presented as the practical uniform 4-bit recipe that realizes these advantages while restricting stochastic rounding to dY.

What carries the argument

Shrinkage Bias, the systematic negative rounding error caused by the geometric asymmetry of non-uniform E2M1 representable bins; it accumulates multiplicatively across layers and is amplified by RHT.

If this is right

UFP4 achieves lower BF16-relative loss degradation than E2M1 baselines on Dense 1.5B, MoE 7.9B, and MoE 124B long-run pretraining.
Uniform grids convert the improved bucket utilization from RHT into higher quantization quality without introducing grid-geometry error.
The bias accumulates multiplicatively across layers, so its effect grows with network depth.
Future accelerators should treat E1M2/INT4-style uniform 4-bit grids as first-class training primitives alongside E2M1.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same geometric bias mechanism may appear in any quantization format whose bin boundaries are asymmetrically spaced.
Hardware vendors could prioritize uniform 4-bit support to enable more stable low-precision training at scale.
Selective application of stochastic rounding only to dY may generalize as a stability technique beyond the UFP4 recipe.
Scaling-law studies of FP4 pretraining may need explicit correction terms for cumulative rounding bias.

Load-bearing premise

The geometric asymmetry of E2M1 bins is the primary driver of observed training instability rather than optimizer interactions, hardware rounding, or data-dependent effects.

What would settle it

A controlled multi-layer forward-pass experiment that isolates rounding error in E2M1 versus an otherwise identical uniform grid, or a training run in which E2M1 rounding is forced to be symmetric and instability disappears.

read the original abstract

FP4 training promises substantial reductions in memory and computation cost for LLM pretraining, yet current FP4 hardware paths and recipes, including NVIDIA Blackwell/Rubin-class systems and AMD MI350-series GPUs, remain centered on E2M1 data elements. In this study, we identify a fundamental limitation of that choice: non-uniform formats such as E2M1 inherently suffer from Shrinkage Bias, a systematic negative rounding error caused by the geometric asymmetry of their representable bins. We show that this bias accumulates multiplicatively across layers and is amplified by the Random Hadamard Transform (RHT), providing a unified explanation for the training instability observed in existing E2M1-based FP4 recipes. In contrast, uniform grids (E1M2/INT4) bypass this grid-geometry error and better convert the improved bucket utilization from RHT into higher quantization quality. Based on this finding, we propose UFP4, a uniform 4-bit training recipe that applies RHT to all three training GEMMs while restricting stochastic rounding to dY alone. On Dense 1.5B, MoE 7.9B, and MoE 124B long-run pretraining, UFP4 consistently achieves lower BF16-relative loss degradation than strong E2M1-based baselines, supported by scaling-law analysis and ablation studies. Our results suggest that future accelerators should support E1M2/INT4-style uniform 4-bit grids as first-class training primitives alongside E2M1.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The geometric shrinkage bias claim is plausible but the experiments compare full recipes without isolating bin uniformity from rounding schedule and RHT changes.

read the letter

The paper's core observation is that E2M1's non-uniform bins create a systematic negative rounding error that compounds across layers and interacts badly with RHT. They contrast this with uniform grids and offer UFP4, which applies RHT everywhere but limits stochastic rounding to dY. On 1.5B dense and MoE models up to 124B they report lower BF16-relative loss than prior E2M1 recipes, backed by scaling-law fits and ablations.

What stands out is the scale of the runs and the attempt to link a format property directly to observed instability. Running long pretraining on MoE 124B is not trivial, and including scaling laws gives the results more grounding than typical small-model ablations.

The main weakness is that the comparisons do not hold optimizer, stochastic-rounding policy, and RHT application pattern fixed while changing only the grid shape. The reported gains could come from the dY-only rounding rule or from how RHT is scheduled rather than from removing the geometric asymmetry. Without those controls the causal story about bin geometry remains suggestive rather than demonstrated.

The work is aimed at people building or evaluating low-precision training stacks and at accelerator designers deciding which 4-bit formats to prioritize. It deserves referee time because the empirical results are on models large enough to matter and the hypothesis is concrete enough to test. I would send it out rather than desk-reject.

Referee Report

2 major / 2 minor

Summary. The paper claims that E2M1 FP4 formats suffer from Shrinkage Bias due to geometric asymmetry in representable bins, which accumulates multiplicatively across layers and is amplified by Random Hadamard Transform (RHT), explaining training instability in existing FP4 recipes. It proposes UFP4, a uniform 4-bit recipe applying RHT to all three training GEMMs but restricting stochastic rounding to dY, and reports lower BF16-relative loss degradation than E2M1 baselines on Dense 1.5B, MoE 7.9B, and MoE 124B models, backed by scaling-law analysis and ablations. The work suggests hardware should prioritize uniform grids like E1M2/INT4.

Significance. If the geometric origin of Shrinkage Bias is confirmed as the dominant mechanism and UFP4's improvements hold under controlled conditions, the result would provide a concrete rationale for shifting FP4 training hardware toward uniform formats, potentially improving stability and quantization quality in large-scale LLM pretraining without additional memory overhead.

major comments (2)

[Experiments / Ablations] Experiments section (and associated ablation tables): the comparison of full E2M1 vs. UFP4 recipes does not isolate bin uniformity as the causal factor. The reported loss gap could be driven by differences in stochastic-rounding schedule (dY-only in UFP4), GEMM ordering, or hardware-specific rounding rather than grid geometry; a controlled contrast holding optimizer, RHT pattern, and rounding implementation fixed while varying only E2M1 vs. uniform bins is required to support the 'geometric origin' claim.
[Bias Analysis] § on bias accumulation and RHT amplification: the multiplicative accumulation argument relies on scaling-law fits, but without explicit derivation showing how the negative rounding error from asymmetric bins propagates through the forward/backward passes (e.g., via a closed-form expression or layer-wise error model), it remains unclear whether the observed instability is primarily geometric or confounded by data-dependent or optimizer effects.

minor comments (2)

[Introduction / Geometric Analysis] Clarify the exact definition of 'Shrinkage Bias' with a small numerical example of E2M1 bin boundaries and the resulting rounding error distribution.
[Related Work] Add a reference to prior work on uniform vs. non-uniform quantization error analysis in low-precision training if not already present.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments. We address each major point below, clarifying the controls in our experiments and the support for the geometric bias analysis.

read point-by-point responses

Referee: [Experiments / Ablations] Experiments section (and associated ablation tables): the comparison of full E2M1 vs. UFP4 recipes does not isolate bin uniformity as the causal factor. The reported loss gap could be driven by differences in stochastic-rounding schedule (dY-only in UFP4), GEMM ordering, or hardware-specific rounding rather than grid geometry; a controlled contrast holding optimizer, RHT pattern, and rounding implementation fixed while varying only E2M1 vs. uniform bins is required to support the 'geometric origin' claim.

Authors: We agree that a fully isolated contrast, varying only the quantization grid while holding stochastic rounding schedule, RHT pattern, optimizer, and rounding implementation fixed, would provide stronger causal evidence for the geometric origin. The current UFP4 vs. E2M1 comparison does vary both grid uniformity and rounding schedule. Our existing ablations vary grid type while controlling other factors to the extent hardware permits, but we will add a new controlled experiment in the revision that directly compares E2M1 and E1M2 grids under identical rounding and RHT settings. revision: yes
Referee: [Bias Analysis] § on bias accumulation and RHT amplification: the multiplicative accumulation argument relies on scaling-law fits, but without explicit derivation showing how the negative rounding error from asymmetric bins propagates through the forward/backward passes (e.g., via a closed-form expression or layer-wise error model), it remains unclear whether the observed instability is primarily geometric or confounded by data-dependent or optimizer effects.

Authors: The scaling-law fits and layer-wise bias measurements provide empirical evidence that the negative rounding error accumulates and is amplified by RHT. We acknowledge that an explicit propagation model would strengthen the geometric claim over potential confounds. In the revision we will add a simplified layer-wise error model to the bias analysis section to illustrate the multiplicative effect more formally, while noting that a full closed-form derivation across optimizer and data-dependent dynamics is beyond the current scope. revision: yes

Circularity Check

0 steps flagged

No circularity: claims rest on empirical scaling-law analysis and ablations without reduction to fitted inputs or self-referential definitions

full rationale

The paper identifies Shrinkage Bias via geometric analysis of E2M1 bins and demonstrates its impact through scaling-law fits and ablation studies comparing full training recipes (E2M1 baselines vs. UFP4) on Dense 1.5B, MoE 7.9B, and MoE 124B models. No equations are presented that define a quantity in terms of itself, rename a fitted parameter as a prediction, or rely on self-citations for load-bearing uniqueness or ansatz. The central claims (bias accumulation, RHT amplification, UFP4 superiority) are framed as outcomes of the reported experiments rather than derivations that collapse to their inputs by construction. This is the common case of an empirical paper whose results stand or fall on the quality of its controls and measurements, not on definitional circularity.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review supplies no equations or experimental details sufficient to enumerate free parameters, axioms, or invented entities; the central claims rest on unstated assumptions about rounding error accumulation and the completeness of the scaling-law analysis.

pith-pipeline@v0.9.1-grok · 5844 in / 1349 out tokens · 31639 ms · 2026-06-26T16:57:38.864840+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

56 extracted references · 6 linked inside Pith

[1]

Croci, Bo Li, Pashmina Cameron, Martin Jaggi, Dan Alistarh, Torsten Hoefler, and James Hensman

Saleh Ashkboos, Amirkeivan Mohtashami, Maximilian L. Croci, Bo Li, Pashmina Cameron, Martin Jaggi, Dan Alistarh, Torsten Hoefler, and James Hensman. Quarot: outlier-free 4-bit inference in rotated llms. In Proceedings of the 38th International Conference on Neural Information Processing Systems, NIPS '24, Red Hook, NY, USA, 2024. Curran Associates Inc. IS...

2024
[2]

Metis: Training LLM s with FP 4 quantization

Hengjie Cao, Mengyi Chen, Yifeng Yang, Fang Dong, Ruijun Huang, Jixian Zhou, Anrui Chen, Mingzhi Dong, Yujiang Wang, Jinlong Hou, Yuan Cheng, FAN WU, Fan Yang, Tun Lu, Ning Gu, and Li Shang. Metis: Training LLM s with FP 4 quantization. In The Fourteenth International Conference on Learning Representations, 2026. https://openreview.net/forum?id=I2ZrCi5O84

2026
[3]

Mengzhao Chen, Meng Wu, Hui Jin, Zhihang Yuan, Jing Liu, Chaoyi Zhang, Yunshui Li, Jie Huang, Jin Ma, Zeyue Xue, Zhiheng Liu, Xingyan Bin, and Ping Luo. Int v.s. fp: A comprehensive study of fine-grained low-bit quantization formats, 2025. https://arxiv.org/abs/2510.25602

arXiv 2025
[4]

Tetrajet-v2: Accurate nvfp4 training for large language models with oscillation suppression and outlier control, 2026

Yuxiang Chen, Yifan Liu, Xiaoming Xu, Pengle Zhang, Michael Beyer, Martin Rapp, Jun Zhu, and Jianfei Chen. Tetrajet-v2: Accurate nvfp4 training for large language models with oscillation suppression and outlier control, 2026. https://arxiv.org/abs/2510.27527

Pith/arXiv arXiv 2026
[5]

FP 4 all the way: Fully quantized training of large language models

Brian Chmiel, Maxim Fishman, Ron Banner, and Daniel Soudry. FP 4 all the way: Fully quantized training of large language models. In The Thirty-ninth Annual Conference on Neural Information Processing Systems, 2026. https://openreview.net/forum?id=kuzye4EPLR

2026
[6]

Four over six: More accurate nvfp4 quantization with adaptive block scaling, 2026

Jack Cook, Junxian Guo, Guangxuan Xiao, Yujun Lin, Keith Wyss, Mahdi Nazemi, Asit Mishra, Carlo del Mundo, Tijmen Blankevoort, and Song Han. Four over six: More accurate nvfp4 quantization with adaptive block scaling, 2026. https://arxiv.org/abs/2512.02010

Pith/arXiv arXiv 2026
[7]

Zhang, Han Bao, Hanwei Xu, Haocheng Wang, Haowei Zhang, Honghui Ding, Huajian Xin, Huazuo Gao, Hui Li, Hui Qu, J

DeepSeek-AI, Aixin Liu, Bei Feng, Bing Xue, Bing-Li Wang, Bochao Wu, Chengda Lu, Chenggang Zhao, Chengqi Deng, Chenyu Zhang, Chong Ruan, Damai Dai, Daya Guo, Dejian Yang, Deli Chen, Dong-Li Ji, Erhang Li, Fangyun Lin, Fucong Dai, Fuli Luo, Guangbo Hao, Guanting Chen, Guowei Li, H. Zhang, Han Bao, Hanwei Xu, Haocheng Wang, Haowei Zhang, Honghui Ding, Huaji...

2024
[8]

Deep learning with limited numerical precision

Suyog Gupta, Ankur Agrawal, Kailash Gopalakrishnan, and Pritish Narayanan. Deep learning with limited numerical precision. In Francis Bach and David Blei, editors, Proceedings of the 32nd International Conference on Machine Learning, volume 37 of Proceedings of Machine Learning Research, pages 1737--1746, Lille, France, 07--09 Jul 2015. PMLR. https://proc...

2015
[9]

Rae, and Laurent Sifre

Jordan Hoffmann, Sebastian Borgeaud, Arthur Mensch, Elena Buchatskaya, Trevor Cai, Eliza Rutherford, Diego de Las Casas, Lisa Anne Hendricks, Johannes Welbl, Aidan Clark, Tom Hennigan, Eric Noland, Katie Millican, George van den Driessche, Bogdan Damoc, Aurelia Guy, Simon Osindero, Karen Simonyan, Erich Elsen, Oriol Vinyals, Jack W. Rae, and Laurent Sifre...

2022
[10]

Powlu: An activation function for stable pre-training of llms, 2026

Peijie Jiang, Yuqi Feng, Cunyin Peng, Qian Zhao, Jia Liu, Kunlong Chen, Zhiqiang Zhang, and Jun Zhou. Powlu: An activation function for stable pre-training of llms, 2026. https://api.semanticscholar.org/CorpusID:288669934

2026
[11]

Faar: Format-aware adaptive rounding for nvfp4, 2026

Hanglin Li, Shuchang Tian, Chen Lin, Zhiyong Zhao, and Kun Zhan. Faar: Format-aware adaptive rounding for nvfp4, 2026. https://arxiv.org/abs/2603.22370

arXiv 2026
[12]

Svdquant: Absorbing outliers by low-rank component for 4-bit diffusion models

Muyang Li, Yujun Lin, Zhekai Zhang, Tianle Cai, Xiuyu Li, Junxian Guo, Enze Xie, Chenlin Meng, Jun-Yan Zhu, and Song Han. Svdquant: Absorbing outliers by low-rank component for 4-bit diffusion models. In The Thirteenth International Conference on Learning Representations
[13]

Spinquant: LLM quantization with learned rotations

Zechun Liu, Changsheng Zhao, Igor Fedorov, Bilge Soran, Dhruv Choudhary, Raghuraman Krishnamoorthi, Vikas Chandra, Yuandong Tian, and Tijmen Blankevoort. Spinquant: LLM quantization with learned rotations. In The Thirteenth International Conference on Learning Representations, 2025. https://openreview.net/forum?id=ogO6DGE6FZ

2025
[15]

Pretraining large language models with nvfp4, 2026

NVIDIA, Felix Abecassis, Anjulie Agrusa, Dong Ahn, Jonah Alben, Stefania Alborghetti, Michael Andersch, Sivakumar Arayandi, Alexis Bjorlin, Aaron Blakeman, Evan Briones, Ian Buck, Bryan Catanzaro, Muya Chang, Jinhang Choi, Mike Chrzanowski, Eric Chung, Victor Cui, Steve Dai, Bita Darvish Rouhani, Carlo del Mundo, Deena Donia, Burc Eryilmaz, Henry Estela, ...

arXiv 2026
[16]

Quartet ii: Accurate llm pre-training in nvfp4 by improved unbiased gradient estimation, 2026

Andrei Panferov, Erik Schultheis, Soroush Tabesh, and Dan Alistarh. Quartet ii: Accurate llm pre-training in nvfp4 by improved unbiased gradient estimation, 2026. https://arxiv.org/abs/2601.22813

Pith/arXiv arXiv 2026
[19]

Flatquant: Flatness matters for llm quantization

Yuxuan Sun, Ruikang Liu, Haoli Bai, Han Bao, Kang Zhao, Yuening Li, Jiaxin Hu, Xianzhi Yu, Lu Hou, Chun Yuan, et al. Flatquant: Flatness matters for llm quantization. In International Conference on Machine Learning, pages 57587--57613. PMLR, 2025

2025
[20]

Hifloat4 format for language model pre-training on ascend npus, 2026

Mehran Taghian, Yunke Peng, Xing Huang, Yao Wang, Yaoyuan Wang, Wei Guo, Yuanyong Luo, Tianchi Hu, Junsong Wang, Xin Wang, Hu Liu, Yu Cheng, Ziwei Yu, Hongliang Li, Mehdi Rahimifar, Lei Yan, Xuefei Wang, Zhuang Ma, Lei Liu, Hui Yu, Anandharaju Durai Raju, Hoang Le, Hei Yi Mak, Tanzila Rahman, and Shadan Golestan. Hifloat4 format for language model pre-tra...

Pith/arXiv arXiv 2026
[21]

Ling Team, Ang Li, Ben Liu, Binbin Hu, Bing Li, Bi Zeng, Borui Ye, Caizhi Tang, Changxin Tian, Chao Huang, Chao Zhang, Chen Qian, Chenchen Ju, Chenchen Li, Chengfu Tang, Chilin Fu, C. Ren, Chunwei Wu, Cong Zhang, Cunyin Peng, Dafeng Xu, Daixin Wang, Dalong Zhang, Dingnan Jin, Dingyuan Zhu, Di Hu, Fa-Chang Zhao, Feifan Wu, Feng Zhu, Gangshan Wang, Haitao Z...

2025
[22]

Towards greater leverage: Scaling laws for efficient mixture-of-experts language models

Changxin Tian, Kunlong Chen, Jia Liu, Ziqi Liu, Zhiqiang Zhang, and JUN ZHOU. Towards greater leverage: Scaling laws for efficient mixture-of-experts language models. In The Fourteenth International Conference on Learning Representations, 2026. https://openreview.net/forum?id=7r2lkhDGUj

2026
[23]

Optimizing large language model training using FP 4 quantization

Ruizhe Wang, Yeyun Gong, Xiao Liu, Guoshuai Zhao, Ziyue Yang, Baining Guo, Zheng-Jun Zha, and Peng CHENG. Optimizing large language model training using FP 4 quantization. In Forty-second International Conference on Machine Learning, 2025. https://openreview.net/forum?id=uK7JArZEJM

2025
[24]

Smoothquant: Accurate and efficient post-training quantization for large language models

Guangxuan Xiao, Ji Lin, Mickael Seznec, Hao Wu, Julien Demouth, and Song Han. Smoothquant: Accurate and efficient post-training quantization for large language models. In International conference on machine learning, pages 38087--38099. PMLR, 2023

2023
[25]

G roundbreaking S uper P o D I nterconnect: L eading a N ew P aradigm for A I I nfrastructure - H uawei --- huawei.com

Eric Xu. G roundbreaking S uper P o D I nterconnect: L eading a N ew P aradigm for A I I nfrastructure - H uawei --- huawei.com. https://www.huawei.com/en/news/2025/9/hc-xu-keynote-speech, 2025

2025
[26]

A survey of large language models

Wayne Xin Zhao, Kun Zhou, Junyi Li, Tianyi Tang, Zican Dong, Yupeng Hou, Beichen Zhang, Yingqian Min, Junjie Zhang, Peiyu Liu, et al. A survey of large language models. Frontiers of Computer Science, 20 0 (12): 0 2012627, 2026

2026
[27]

Mixfp4: Enhancing nvfp4 with adaptive fp4/int4 block representations, 2026

Jiaxiang Zou, Yonghao Chen, Ruilong Wu, and Xinyu Chen. Mixfp4: Enhancing nvfp4 with adaptive fp4/int4 block representations, 2026. https://arxiv.org/abs/2605.31035

Pith/arXiv arXiv 2026
[28]

Frontiers of Computer Science , volume =

A survey of large language models , author =. Frontiers of Computer Science , volume =. 2026 , publisher =

2026
[29]

arXiv preprint arXiv:2209.05433 , year =

Fp8 formats for deep learning , author =. arXiv preprint arXiv:2209.05433 , year =

Pith/arXiv arXiv
[30]

The Fourteenth International Conference on Learning Representations , year =

Towards Greater Leverage: Scaling Laws for Efficient Mixture-of-Experts Language Models , author =. The Fourteenth International Conference on Learning Representations , year =
[31]

2026 , url =

PowLU: An Activation Function for Stable Pre-Training of LLMs , author =. 2026 , url =

2026
[32]

2026 , eprint =

Pretraining Large Language Models with NVFP4 , author =. 2026 , eprint =

2026
[33]

Castro and Andrei Panferov and Soroush Tabesh and Oliver Sieberling and Jiale Chen and Mahdi Nikdan and Saleh Ashkboos and Dan Alistarh , booktitle =

Roberto L. Castro and Andrei Panferov and Soroush Tabesh and Oliver Sieberling and Jiale Chen and Mahdi Nikdan and Saleh Ashkboos and Dan Alistarh , booktitle =. Quartet: Native. 2026 , url =

2026
[34]

2026 , eprint =

Quartet II: Accurate LLM Pre-Training in NVFP4 by Improved Unbiased Gradient Estimation , author =. 2026 , eprint =

2026
[35]

2026 , eprint =

TetraJet-v2: Accurate NVFP4 Training for Large Language Models with Oscillation Suppression and Outlier Control , author =. 2026 , eprint =

2026
[36]

2026 , url =

Brian Chmiel and Maxim Fishman and Ron Banner and Daniel Soudry , booktitle =. 2026 , url =

2026
[37]

2026 , eprint =

HiFloat4 Format for Language Model Pre-training on Ascend NPUs , author =. 2026 , eprint =

2026
[38]

2026 , eprint =

Pretraining large language models with MXFP4 on Native FP4 Hardware , author =. 2026 , eprint =

2026
[39]

Lin Zhao and Felix Marty and Spandan Tiwari and Wei Luo and Bowen Bao and Xinjun Niu and Zhaofeng Zhang and Haoyang Li and Ke Wang and Ashish Sirasao , title =
[40]

Kyle Aubrey , title =
[41]

Diwakar Gupta and Sabastian Mugazambi , title =
[42]

2025 , url =

Every Activation Boosted: Scaling General Reasoner to 1 Trillion Open Language Foundation , author =. 2025 , url =

2025
[43]

2024 , url =

DeepSeek-V3 Technical Report , author =. 2024 , url =

2024
[44]

Metis: Training

Hengjie Cao and Mengyi Chen and Yifeng Yang and Fang Dong and Ruijun Huang and Jixian Zhou and Anrui Chen and Mingzhi Dong and Yujiang Wang and Jinlong Hou and Yuan Cheng and FAN WU and Fan Yang and Tun Lu and Ning Gu and Li Shang , booktitle =. Metis: Training. 2026 , url =

2026
[45]

SpinQuant:

Zechun Liu and Changsheng Zhao and Igor Fedorov and Bilge Soran and Dhruv Choudhary and Raghuraman Krishnamoorthi and Vikas Chandra and Yuandong Tian and Tijmen Blankevoort , booktitle =. SpinQuant:. 2025 , url =

2025
[46]

International Conference on Machine Learning , pages =

FlatQuant: Flatness Matters for LLM Quantization , author =. International Conference on Machine Learning , pages =. 2025 , organization =

2025
[47]

International conference on machine learning , pages =

Smoothquant: Accurate and efficient post-training quantization for large language models , author =. International conference on machine learning , pages =. 2023 , organization =

2023
[48]

SVDQuant: Absorbing Outliers by Low-Rank Component for 4-Bit Diffusion Models , author =
[49]

Optimizing Large Language Model Training Using

Ruizhe Wang and Yeyun Gong and Xiao Liu and Guoshuai Zhao and Ziyue Yang and Baining Guo and Zheng-Jun Zha and Peng CHENG , booktitle =. Optimizing Large Language Model Training Using. 2025 , url =

2025
[50]

arXiv preprint arXiv:2310.10537 , year =

Microscaling data formats for deep learning , author =. arXiv preprint arXiv:2310.10537 , year =

arXiv
[51]

2026 , eprint =

MixFP4: Enhancing NVFP4 with Adaptive FP4/INT4 Block Representations , author =. 2026 , eprint =

2026
[52]

FP: A Comprehensive Study of Fine-Grained Low-bit Quantization Formats , author =

INT v.s. FP: A Comprehensive Study of Fine-Grained Low-bit Quantization Formats , author =. 2025 , eprint =

2025
[53]

2026 , eprint =

FAAR: Format-Aware Adaptive Rounding for NVFP4 , author =. 2026 , eprint =

2026
[54]

2026 , eprint =

Four Over Six: More Accurate NVFP4 Quantization with Adaptive Block Scaling , author =. 2026 , eprint =

2026
[55]

2023 , eprint =

Microscaling Data Formats for Deep Learning , author =. 2023 , eprint =

2023
[56]

2022 , eprint =

FP8 Formats for Deep Learning , author =. 2022 , eprint =

2022
[57]

and Sifre, Laurent , title =

Hoffmann, Jordan and Borgeaud, Sebastian and Mensch, Arthur and Buchatskaya, Elena and Cai, Trevor and Rutherford, Eliza and de Las Casas, Diego and Hendricks, Lisa Anne and Welbl, Johannes and Clark, Aidan and Hennigan, Tom and Noland, Eric and Millican, Katie and van den Driessche, George and Damoc, Bogdan and Guy, Aurelia and Osindero, Simon and Simony...

2022
[58]

and Li, Bo and Cameron, Pashmina and Jaggi, Martin and Alistarh, Dan and Hoefler, Torsten and Hensman, James , title =

Ashkboos, Saleh and Mohtashami, Amirkeivan and Croci, Maximilian L. and Li, Bo and Cameron, Pashmina and Jaggi, Martin and Alistarh, Dan and Hoefler, Torsten and Hensman, James , title =. Proceedings of the 38th International Conference on Neural Information Processing Systems , articleno =. 2024 , isbn =

2024
[59]

Proceedings of the 32nd International Conference on Machine Learning , pages =

Deep Learning with Limited Numerical Precision , author =. Proceedings of the 32nd International Conference on Machine Learning , pages =. 2015 , editor =

2015

[1] [1]

Croci, Bo Li, Pashmina Cameron, Martin Jaggi, Dan Alistarh, Torsten Hoefler, and James Hensman

Saleh Ashkboos, Amirkeivan Mohtashami, Maximilian L. Croci, Bo Li, Pashmina Cameron, Martin Jaggi, Dan Alistarh, Torsten Hoefler, and James Hensman. Quarot: outlier-free 4-bit inference in rotated llms. In Proceedings of the 38th International Conference on Neural Information Processing Systems, NIPS '24, Red Hook, NY, USA, 2024. Curran Associates Inc. IS...

2024

[2] [2]

Metis: Training LLM s with FP 4 quantization

Hengjie Cao, Mengyi Chen, Yifeng Yang, Fang Dong, Ruijun Huang, Jixian Zhou, Anrui Chen, Mingzhi Dong, Yujiang Wang, Jinlong Hou, Yuan Cheng, FAN WU, Fan Yang, Tun Lu, Ning Gu, and Li Shang. Metis: Training LLM s with FP 4 quantization. In The Fourteenth International Conference on Learning Representations, 2026. https://openreview.net/forum?id=I2ZrCi5O84

2026

[3] [3]

Mengzhao Chen, Meng Wu, Hui Jin, Zhihang Yuan, Jing Liu, Chaoyi Zhang, Yunshui Li, Jie Huang, Jin Ma, Zeyue Xue, Zhiheng Liu, Xingyan Bin, and Ping Luo. Int v.s. fp: A comprehensive study of fine-grained low-bit quantization formats, 2025. https://arxiv.org/abs/2510.25602

arXiv 2025

[4] [4]

Tetrajet-v2: Accurate nvfp4 training for large language models with oscillation suppression and outlier control, 2026

Yuxiang Chen, Yifan Liu, Xiaoming Xu, Pengle Zhang, Michael Beyer, Martin Rapp, Jun Zhu, and Jianfei Chen. Tetrajet-v2: Accurate nvfp4 training for large language models with oscillation suppression and outlier control, 2026. https://arxiv.org/abs/2510.27527

Pith/arXiv arXiv 2026

[5] [5]

FP 4 all the way: Fully quantized training of large language models

Brian Chmiel, Maxim Fishman, Ron Banner, and Daniel Soudry. FP 4 all the way: Fully quantized training of large language models. In The Thirty-ninth Annual Conference on Neural Information Processing Systems, 2026. https://openreview.net/forum?id=kuzye4EPLR

2026

[6] [6]

Four over six: More accurate nvfp4 quantization with adaptive block scaling, 2026

Jack Cook, Junxian Guo, Guangxuan Xiao, Yujun Lin, Keith Wyss, Mahdi Nazemi, Asit Mishra, Carlo del Mundo, Tijmen Blankevoort, and Song Han. Four over six: More accurate nvfp4 quantization with adaptive block scaling, 2026. https://arxiv.org/abs/2512.02010

Pith/arXiv arXiv 2026

[7] [7]

Zhang, Han Bao, Hanwei Xu, Haocheng Wang, Haowei Zhang, Honghui Ding, Huajian Xin, Huazuo Gao, Hui Li, Hui Qu, J

DeepSeek-AI, Aixin Liu, Bei Feng, Bing Xue, Bing-Li Wang, Bochao Wu, Chengda Lu, Chenggang Zhao, Chengqi Deng, Chenyu Zhang, Chong Ruan, Damai Dai, Daya Guo, Dejian Yang, Deli Chen, Dong-Li Ji, Erhang Li, Fangyun Lin, Fucong Dai, Fuli Luo, Guangbo Hao, Guanting Chen, Guowei Li, H. Zhang, Han Bao, Hanwei Xu, Haocheng Wang, Haowei Zhang, Honghui Ding, Huaji...

2024

[8] [8]

Deep learning with limited numerical precision

Suyog Gupta, Ankur Agrawal, Kailash Gopalakrishnan, and Pritish Narayanan. Deep learning with limited numerical precision. In Francis Bach and David Blei, editors, Proceedings of the 32nd International Conference on Machine Learning, volume 37 of Proceedings of Machine Learning Research, pages 1737--1746, Lille, France, 07--09 Jul 2015. PMLR. https://proc...

2015

[9] [9]

Rae, and Laurent Sifre

Jordan Hoffmann, Sebastian Borgeaud, Arthur Mensch, Elena Buchatskaya, Trevor Cai, Eliza Rutherford, Diego de Las Casas, Lisa Anne Hendricks, Johannes Welbl, Aidan Clark, Tom Hennigan, Eric Noland, Katie Millican, George van den Driessche, Bogdan Damoc, Aurelia Guy, Simon Osindero, Karen Simonyan, Erich Elsen, Oriol Vinyals, Jack W. Rae, and Laurent Sifre...

2022

[10] [10]

Powlu: An activation function for stable pre-training of llms, 2026

Peijie Jiang, Yuqi Feng, Cunyin Peng, Qian Zhao, Jia Liu, Kunlong Chen, Zhiqiang Zhang, and Jun Zhou. Powlu: An activation function for stable pre-training of llms, 2026. https://api.semanticscholar.org/CorpusID:288669934

2026

[11] [11]

Faar: Format-aware adaptive rounding for nvfp4, 2026

Hanglin Li, Shuchang Tian, Chen Lin, Zhiyong Zhao, and Kun Zhan. Faar: Format-aware adaptive rounding for nvfp4, 2026. https://arxiv.org/abs/2603.22370

arXiv 2026

[12] [12]

Svdquant: Absorbing outliers by low-rank component for 4-bit diffusion models

Muyang Li, Yujun Lin, Zhekai Zhang, Tianle Cai, Xiuyu Li, Junxian Guo, Enze Xie, Chenlin Meng, Jun-Yan Zhu, and Song Han. Svdquant: Absorbing outliers by low-rank component for 4-bit diffusion models. In The Thirteenth International Conference on Learning Representations

[13] [13]

Spinquant: LLM quantization with learned rotations

Zechun Liu, Changsheng Zhao, Igor Fedorov, Bilge Soran, Dhruv Choudhary, Raghuraman Krishnamoorthi, Vikas Chandra, Yuandong Tian, and Tijmen Blankevoort. Spinquant: LLM quantization with learned rotations. In The Thirteenth International Conference on Learning Representations, 2025. https://openreview.net/forum?id=ogO6DGE6FZ

2025

[14] [15]

Pretraining large language models with nvfp4, 2026

NVIDIA, Felix Abecassis, Anjulie Agrusa, Dong Ahn, Jonah Alben, Stefania Alborghetti, Michael Andersch, Sivakumar Arayandi, Alexis Bjorlin, Aaron Blakeman, Evan Briones, Ian Buck, Bryan Catanzaro, Muya Chang, Jinhang Choi, Mike Chrzanowski, Eric Chung, Victor Cui, Steve Dai, Bita Darvish Rouhani, Carlo del Mundo, Deena Donia, Burc Eryilmaz, Henry Estela, ...

arXiv 2026

[15] [16]

Quartet ii: Accurate llm pre-training in nvfp4 by improved unbiased gradient estimation, 2026

Andrei Panferov, Erik Schultheis, Soroush Tabesh, and Dan Alistarh. Quartet ii: Accurate llm pre-training in nvfp4 by improved unbiased gradient estimation, 2026. https://arxiv.org/abs/2601.22813

Pith/arXiv arXiv 2026

[16] [19]

Flatquant: Flatness matters for llm quantization

Yuxuan Sun, Ruikang Liu, Haoli Bai, Han Bao, Kang Zhao, Yuening Li, Jiaxin Hu, Xianzhi Yu, Lu Hou, Chun Yuan, et al. Flatquant: Flatness matters for llm quantization. In International Conference on Machine Learning, pages 57587--57613. PMLR, 2025

2025

[17] [20]

Hifloat4 format for language model pre-training on ascend npus, 2026

Mehran Taghian, Yunke Peng, Xing Huang, Yao Wang, Yaoyuan Wang, Wei Guo, Yuanyong Luo, Tianchi Hu, Junsong Wang, Xin Wang, Hu Liu, Yu Cheng, Ziwei Yu, Hongliang Li, Mehdi Rahimifar, Lei Yan, Xuefei Wang, Zhuang Ma, Lei Liu, Hui Yu, Anandharaju Durai Raju, Hoang Le, Hei Yi Mak, Tanzila Rahman, and Shadan Golestan. Hifloat4 format for language model pre-tra...

Pith/arXiv arXiv 2026

[18] [21]

Ling Team, Ang Li, Ben Liu, Binbin Hu, Bing Li, Bi Zeng, Borui Ye, Caizhi Tang, Changxin Tian, Chao Huang, Chao Zhang, Chen Qian, Chenchen Ju, Chenchen Li, Chengfu Tang, Chilin Fu, C. Ren, Chunwei Wu, Cong Zhang, Cunyin Peng, Dafeng Xu, Daixin Wang, Dalong Zhang, Dingnan Jin, Dingyuan Zhu, Di Hu, Fa-Chang Zhao, Feifan Wu, Feng Zhu, Gangshan Wang, Haitao Z...

2025

[19] [22]

Towards greater leverage: Scaling laws for efficient mixture-of-experts language models

Changxin Tian, Kunlong Chen, Jia Liu, Ziqi Liu, Zhiqiang Zhang, and JUN ZHOU. Towards greater leverage: Scaling laws for efficient mixture-of-experts language models. In The Fourteenth International Conference on Learning Representations, 2026. https://openreview.net/forum?id=7r2lkhDGUj

2026

[20] [23]

Optimizing large language model training using FP 4 quantization

Ruizhe Wang, Yeyun Gong, Xiao Liu, Guoshuai Zhao, Ziyue Yang, Baining Guo, Zheng-Jun Zha, and Peng CHENG. Optimizing large language model training using FP 4 quantization. In Forty-second International Conference on Machine Learning, 2025. https://openreview.net/forum?id=uK7JArZEJM

2025

[21] [24]

Smoothquant: Accurate and efficient post-training quantization for large language models

Guangxuan Xiao, Ji Lin, Mickael Seznec, Hao Wu, Julien Demouth, and Song Han. Smoothquant: Accurate and efficient post-training quantization for large language models. In International conference on machine learning, pages 38087--38099. PMLR, 2023

2023

[22] [25]

G roundbreaking S uper P o D I nterconnect: L eading a N ew P aradigm for A I I nfrastructure - H uawei --- huawei.com

Eric Xu. G roundbreaking S uper P o D I nterconnect: L eading a N ew P aradigm for A I I nfrastructure - H uawei --- huawei.com. https://www.huawei.com/en/news/2025/9/hc-xu-keynote-speech, 2025

2025

[23] [26]

A survey of large language models

Wayne Xin Zhao, Kun Zhou, Junyi Li, Tianyi Tang, Zican Dong, Yupeng Hou, Beichen Zhang, Yingqian Min, Junjie Zhang, Peiyu Liu, et al. A survey of large language models. Frontiers of Computer Science, 20 0 (12): 0 2012627, 2026

2026

[24] [27]

Mixfp4: Enhancing nvfp4 with adaptive fp4/int4 block representations, 2026

Jiaxiang Zou, Yonghao Chen, Ruilong Wu, and Xinyu Chen. Mixfp4: Enhancing nvfp4 with adaptive fp4/int4 block representations, 2026. https://arxiv.org/abs/2605.31035

Pith/arXiv arXiv 2026

[25] [28]

Frontiers of Computer Science , volume =

A survey of large language models , author =. Frontiers of Computer Science , volume =. 2026 , publisher =

2026

[26] [29]

arXiv preprint arXiv:2209.05433 , year =

Fp8 formats for deep learning , author =. arXiv preprint arXiv:2209.05433 , year =

Pith/arXiv arXiv

[27] [30]

The Fourteenth International Conference on Learning Representations , year =

Towards Greater Leverage: Scaling Laws for Efficient Mixture-of-Experts Language Models , author =. The Fourteenth International Conference on Learning Representations , year =

[28] [31]

2026 , url =

PowLU: An Activation Function for Stable Pre-Training of LLMs , author =. 2026 , url =

2026

[29] [32]

2026 , eprint =

Pretraining Large Language Models with NVFP4 , author =. 2026 , eprint =

2026

[30] [33]

Castro and Andrei Panferov and Soroush Tabesh and Oliver Sieberling and Jiale Chen and Mahdi Nikdan and Saleh Ashkboos and Dan Alistarh , booktitle =

Roberto L. Castro and Andrei Panferov and Soroush Tabesh and Oliver Sieberling and Jiale Chen and Mahdi Nikdan and Saleh Ashkboos and Dan Alistarh , booktitle =. Quartet: Native. 2026 , url =

2026

[31] [34]

2026 , eprint =

Quartet II: Accurate LLM Pre-Training in NVFP4 by Improved Unbiased Gradient Estimation , author =. 2026 , eprint =

2026

[32] [35]

2026 , eprint =

TetraJet-v2: Accurate NVFP4 Training for Large Language Models with Oscillation Suppression and Outlier Control , author =. 2026 , eprint =

2026

[33] [36]

2026 , url =

Brian Chmiel and Maxim Fishman and Ron Banner and Daniel Soudry , booktitle =. 2026 , url =

2026

[34] [37]

2026 , eprint =

HiFloat4 Format for Language Model Pre-training on Ascend NPUs , author =. 2026 , eprint =

2026

[35] [38]

2026 , eprint =

Pretraining large language models with MXFP4 on Native FP4 Hardware , author =. 2026 , eprint =

2026

[36] [39]

Lin Zhao and Felix Marty and Spandan Tiwari and Wei Luo and Bowen Bao and Xinjun Niu and Zhaofeng Zhang and Haoyang Li and Ke Wang and Ashish Sirasao , title =

[37] [40]

Kyle Aubrey , title =

[38] [41]

Diwakar Gupta and Sabastian Mugazambi , title =

[39] [42]

2025 , url =

Every Activation Boosted: Scaling General Reasoner to 1 Trillion Open Language Foundation , author =. 2025 , url =

2025

[40] [43]

2024 , url =

DeepSeek-V3 Technical Report , author =. 2024 , url =

2024

[41] [44]

Metis: Training

Hengjie Cao and Mengyi Chen and Yifeng Yang and Fang Dong and Ruijun Huang and Jixian Zhou and Anrui Chen and Mingzhi Dong and Yujiang Wang and Jinlong Hou and Yuan Cheng and FAN WU and Fan Yang and Tun Lu and Ning Gu and Li Shang , booktitle =. Metis: Training. 2026 , url =

2026

[42] [45]

SpinQuant:

Zechun Liu and Changsheng Zhao and Igor Fedorov and Bilge Soran and Dhruv Choudhary and Raghuraman Krishnamoorthi and Vikas Chandra and Yuandong Tian and Tijmen Blankevoort , booktitle =. SpinQuant:. 2025 , url =

2025

[43] [46]

International Conference on Machine Learning , pages =

FlatQuant: Flatness Matters for LLM Quantization , author =. International Conference on Machine Learning , pages =. 2025 , organization =

2025

[44] [47]

International conference on machine learning , pages =

Smoothquant: Accurate and efficient post-training quantization for large language models , author =. International conference on machine learning , pages =. 2023 , organization =

2023

[45] [48]

SVDQuant: Absorbing Outliers by Low-Rank Component for 4-Bit Diffusion Models , author =

[46] [49]

Optimizing Large Language Model Training Using

Ruizhe Wang and Yeyun Gong and Xiao Liu and Guoshuai Zhao and Ziyue Yang and Baining Guo and Zheng-Jun Zha and Peng CHENG , booktitle =. Optimizing Large Language Model Training Using. 2025 , url =

2025

[47] [50]

arXiv preprint arXiv:2310.10537 , year =

Microscaling data formats for deep learning , author =. arXiv preprint arXiv:2310.10537 , year =

arXiv

[48] [51]

2026 , eprint =

MixFP4: Enhancing NVFP4 with Adaptive FP4/INT4 Block Representations , author =. 2026 , eprint =

2026

[49] [52]

FP: A Comprehensive Study of Fine-Grained Low-bit Quantization Formats , author =

INT v.s. FP: A Comprehensive Study of Fine-Grained Low-bit Quantization Formats , author =. 2025 , eprint =

2025

[50] [53]

2026 , eprint =

FAAR: Format-Aware Adaptive Rounding for NVFP4 , author =. 2026 , eprint =

2026

[51] [54]

2026 , eprint =

Four Over Six: More Accurate NVFP4 Quantization with Adaptive Block Scaling , author =. 2026 , eprint =

2026

[52] [55]

2023 , eprint =

Microscaling Data Formats for Deep Learning , author =. 2023 , eprint =

2023

[53] [56]

2022 , eprint =

FP8 Formats for Deep Learning , author =. 2022 , eprint =

2022

[54] [57]

and Sifre, Laurent , title =

Hoffmann, Jordan and Borgeaud, Sebastian and Mensch, Arthur and Buchatskaya, Elena and Cai, Trevor and Rutherford, Eliza and de Las Casas, Diego and Hendricks, Lisa Anne and Welbl, Johannes and Clark, Aidan and Hennigan, Tom and Noland, Eric and Millican, Katie and van den Driessche, George and Damoc, Bogdan and Guy, Aurelia and Osindero, Simon and Simony...

2022

[55] [58]

and Li, Bo and Cameron, Pashmina and Jaggi, Martin and Alistarh, Dan and Hoefler, Torsten and Hensman, James , title =

Ashkboos, Saleh and Mohtashami, Amirkeivan and Croci, Maximilian L. and Li, Bo and Cameron, Pashmina and Jaggi, Martin and Alistarh, Dan and Hoefler, Torsten and Hensman, James , title =. Proceedings of the 38th International Conference on Neural Information Processing Systems , articleno =. 2024 , isbn =

2024

[56] [59]

Proceedings of the 32nd International Conference on Machine Learning , pages =

Deep Learning with Limited Numerical Precision , author =. Proceedings of the 32nd International Conference on Machine Learning , pages =. 2015 , editor =

2015