Recognition: unknown
DuQuant++: Fine-grained Rotation Enhances Microscaling FP4 Quantization
Pith reviewed 2026-05-10 05:55 UTC · model grok-4.3
The pith
A single outlier-aware rotation suffices for accurate MXFP4 quantization of LLMs.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
DuQuant++ demonstrates that aligning the block size of an outlier-aware rotation to the MXFP4 microscaling group size of 32 allows the full dual-rotation pipeline to be replaced by one rotation step. The single rotation suppresses activation outliers that would otherwise inflate a block's shared E8M0 scale factor, thereby preserving dynamic range for the remaining elements, while also smoothing the weight distribution and halving online rotation overhead during MXFP4 W4A4 quantization of LLMs such as the LLaMA-3 family.
What carries the argument
outlier-aware fine-grained rotation with rotation block size set to B=32 to match MXFP4 microscaling groups, which applies a targeted transform to outlier-concentrated channels within each independently scaled block
Load-bearing premise
Independent scaling factors per 32-element MXFP4 group remove the cross-block variance that previously required dual rotations and zigzag permutations.
What would settle it
An experiment on an LLM in which the single-rotation version produces measurably higher quantization error or lower task accuracy than the original dual-rotation pipeline under identical MXFP4 W4A4 settings.
Figures
read the original abstract
The MXFP4 microscaling format, which partitions tensors into blocks of 32 elements sharing an E8M0 scaling factor, has emerged as a promising substrate for efficient LLM inference, backed by native hardware support on NVIDIA Blackwell Tensor Cores. However, activation outliers pose a unique challenge under this format: a single outlier inflates the shared block scale, compressing the effective dynamic range of the remaining elements and causing significant quantization error. Existing rotation-based remedies, including randomized Hadamard and learnable rotations, are data-agnostic and therefore unable to specifically target the channels where outliers concentrate. We propose DuQuant++, which adapts the outlier-aware fine-grained rotation of DuQuant to the MXFP4 format by aligning the rotation block size with the microscaling group size (B{=}32). Because each MXFP4 group possesses an independent scaling factor, the cross-block variance issue that necessitates dual rotations and a zigzag permutation in the original DuQuant becomes irrelevant, enabling DuQuant++ to replace the entire pipeline with a single outlier-aware rotation, which halves the online rotation cost while simultaneously smoothing the weight distribution. Extensive experiments on the LLaMA-3 family under MXFP4 W4A4 quantization show that DuQuant++ consistently achieves state-of-the-art performance. Our code is available at https://github.com/Hsu1023/DuQuant-v2.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper proposes DuQuant++, an adaptation of the DuQuant outlier-aware rotation method to the MXFP4 microscaling FP4 format for LLM quantization. By setting the rotation block size B=32 to match the MXFP4 group size and exploiting independent per-group E8M0 scaling factors, the authors replace the original DuQuant's dual rotations plus zigzag permutation with a single outlier-aware rotation. This is claimed to halve online rotation cost while smoothing weight distributions, with extensive experiments on LLaMA-3 models under W4A4 quantization demonstrating state-of-the-art performance.
Significance. If the central empirical claims hold, the work provides a practical, lower-overhead rotation strategy for microscaling quantization that directly targets activation outliers, which is relevant for efficient inference on hardware with native MXFP4 support such as NVIDIA Blackwell Tensor Cores.
major comments (2)
- [Method / justification for single rotation] The central engineering claim—that independent per-group scaling in MXFP4 fully eliminates the cross-block variance issue, rendering dual rotations and zigzag permutation unnecessary—is asserted without supporting analysis or ablation. No demonstration is given that outlier-induced variance does not propagate across groups or that single-rotation performance matches the dual-rotation baseline under MXFP4 (this directly supports the cost-halving and SOTA claims).
- [Experiments] The experiments section reports SOTA results on LLaMA-3 under MXFP4 W4A4 but lacks explicit ablation tables isolating the effect of replacing dual rotations with the single outlier-aware rotation; without these, it is difficult to attribute gains specifically to the proposed simplification rather than other factors.
minor comments (1)
- [Abstract] The abstract would be strengthened by including at least one key quantitative result (e.g., perplexity or accuracy delta versus the strongest baseline) to ground the SOTA claim.
Simulated Author's Rebuttal
We thank the referee for the careful review and constructive feedback. We address each major comment point by point below. We agree that additional analysis and ablations will strengthen the paper and will incorporate them in the revision.
read point-by-point responses
-
Referee: [Method / justification for single rotation] The central engineering claim—that independent per-group scaling in MXFP4 fully eliminates the cross-block variance issue, rendering dual rotations and zigzag permutation unnecessary—is asserted without supporting analysis or ablation. No demonstration is given that outlier-induced variance does not propagate across groups or that single-rotation performance matches the dual-rotation baseline under MXFP4 (this directly supports the cost-halving and SOTA claims).
Authors: We appreciate this observation. The justification is based on the MXFP4 format property that each 32-element group has an independent E8M0 scaling factor, unlike standard quantization where a shared scale allows outlier variance to propagate across blocks. This independence isolates the effect of outliers to their own group, so a single outlier-aware rotation (with B=32) suffices to smooth distributions within each group without needing dual rotations or zigzag permutation. While this is stated in the method section, we agree more explicit support is needed. In the revision we will add a short theoretical paragraph explaining the lack of cross-group propagation and an ablation comparing single-rotation DuQuant++ against an adapted dual-rotation baseline under MXFP4 to directly support the cost and performance claims. revision: yes
-
Referee: [Experiments] The experiments section reports SOTA results on LLaMA-3 under MXFP4 W4A4 but lacks explicit ablation tables isolating the effect of replacing dual rotations with the single outlier-aware rotation; without these, it is difficult to attribute gains specifically to the proposed simplification rather than other factors.
Authors: We agree that the current experiments emphasize end-to-end comparisons rather than isolating the single-rotation simplification. We will add dedicated ablation tables in the revised manuscript that directly compare (i) DuQuant++ (single rotation), (ii) an MXFP4-adapted dual-rotation variant, and (iii) the original DuQuant pipeline. These tables will quantify the performance difference and overhead reduction attributable to the simplification enabled by per-group scaling. revision: yes
Circularity Check
Empirical adaptation of prior method with no self-referential derivation
full rationale
The paper presents DuQuant++ as an engineering adaptation of the outlier-aware fine-grained rotation from DuQuant, aligned to MXFP4 by setting rotation block size B=32 to match the microscaling group size. The key simplification (replacing dual rotations and zigzag permutation with a single rotation) is asserted because independent per-group E8M0 scales make cross-block variance irrelevant. No mathematical equations, derivations, or predictions are provided that reduce by construction to fitted inputs, self-definitions, or prior self-citations. Validity is supported by external experiments on LLaMA-3 under W4A4, not internal consistency. The citation to DuQuant is present but not load-bearing for any uniqueness theorem or ansatz; the central claim remains an empirical observation rather than a closed logical loop.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Activation outliers concentrate in specific channels that can be targeted by a data-dependent rotation.
Reference graph
Works this paper leans on
-
[1]
Quarot: Outlier-free 4-bit inference in rotated llms,
Saleh Ashkboos, Amirkeivan Mohtashami, Maximilian L Croci, Bo Li, Martin Jaggi, Dan Alistarh, Torsten Hoefler, and James Hensman. Quarot: Outlier-free 4-bit inference in rotated llms.arXiv preprint arXiv:2404.00456,
-
[2]
Think you have Solved Question Answering? Try ARC, the AI2 Reasoning Challenge
Peter Clark, Isaac Cowhey, Oren Etzioni, Tushar Khot, Ashish Sabharwal, Carissa Schoenick, and Oyvind Tafjord. Think you have solved question answering? try arc, the ai2 reasoning challenge. arXiv preprint arXiv:1803.05457,
work page internal anchor Pith review Pith/arXiv arXiv
-
[3]
Four Over Six: More Accurate NVFP4 Quantization with Adaptive Block Scaling
Jack Cook, Junxian Guo, Guangxuan Xiao, Yujun Lin, and Song Han. Four over six: More accurate nvfp4 quantization with adaptive block scaling.arXiv preprint arXiv:2512.02010,
work page internal anchor Pith review Pith/arXiv arXiv
-
[4]
Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Amy Yang, Angela Fan, et al. The llama 3 herd of models. arXiv preprint arXiv:2407.21783,
work page internal anchor Pith review Pith/arXiv arXiv
-
[5]
arXiv preprint arXiv:2509.23202 , year=
Vage Egiazarian, Roberto L Castro, Denis Kuznedelev, Andrei Panferov, Eldar Kurtic, Shubhra Pandit, Alexandre Marques, Mark Kurtz, Saleh Ashkboos, Torsten Hoefler, et al. Bridging the gap between promise and performance for microscaling fp4 quantization.arXiv preprint arXiv:2509.23202,
-
[6]
GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers
Elias Frantar, Saleh Ashkboos, Torsten Hoefler, and Dan Alistarh. Gptq: Accurate post-training quantization for generative pre-trained transformers.arXiv preprint arXiv:2210.17323,
work page internal anchor Pith review arXiv
-
[7]
Hong Huang and Dapeng Wu. Quaff: Quantized parameter-efficient fine-tuning under outlier spatial stability hypothesis.arXiv preprint arXiv:2505.14742,
-
[8]
Tequila: Trapping-free ternary quantization for large language models
Hong Huang, Decheng Wu, Rui Cen, Guanghua Yu, Zonghang Li, Kai Liu, Jianchen Zhu, Peng Chen, Xue Liu, and Dapeng Wu. Tequila: Trapping-free ternary quantization for large language models. arXiv preprint arXiv:2509.23809,
-
[9]
Sherry: Hardware-efficient 1.25-bit ternary quantization via fine-grained sparsification
Hong Huang, Decheng Wu, Qiangqiang Hu, Guanghua Yu, Jinhai Yang, Jianchen Zhu, Xue Liu, and Dapeng Wu. Sherry: Hardware-efficient 1.25-bit ternary quantization via fine-grained sparsification. arXiv preprint arXiv:2601.07892,
-
[10]
Haokun Lin, Haobo Xu, Yichen Wu, Jingzhi Cui, Yingtao Zhang, Linzhan Mou, Linqi Song, Zhenan Sun, and Ying Wei. Duquant: Distributing outliers via dual transformation makes stronger quantized llms.Advances in Neural Information Processing Systems, 37:87766–87800, 2024a. Haokun Lin, Haobo Xu, Yichen Wu, Ziyu Guo, Renrui Zhang, Zhichao Lu, Ying Wei, Qingfu ...
-
[11]
AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration
Ji Lin, Jiaming Tang, Haotian Tang, Shang Yang, Xingyu Dang, and Song Han. Awq: Activation- aware weight quantization for llm compression and acceleration.arXiv preprint arXiv:2306.00978,
work page internal anchor Pith review Pith/arXiv arXiv
-
[12]
Qserve: W4a8kv4 quantization and system co-design for efficient llm serving
Yujun Lin, Haotian Tang, Shang Yang, Zhekai Zhang, Guangxuan Xiao, Chuang Gan, and Song Han. Qserve: W4a8kv4 quantization and system co-design for efficient llm serving.arXiv preprint arXiv:2405.04532, 2024b. Ruikang Liu, Haoli Bai, Haokun Lin, Yuening Li, Han Gao, Zhengzhuo Xu, Lu Hou, Jun Yao, and Chun Yuan. Intactkv: Improving large language model quan...
-
[13]
Wenyuan Liu, Haoqian Meng, Yilun Luo, Peng Zhang, and Xindian Ma. Micromix: Efficient mixed-precision quantization with microscaling formats for large language models.arXiv preprint arXiv:2508.02343,
-
[14]
Yuexiao Ma, Huixia Li, Xiawu Zheng, Feng Ling, Xuefeng Xiao, Rui Wang, Shilei Wen, Fei Chao, and Rongrong Ji. Affinequant: Affine transformation quantization for large language models. arXiv preprint arXiv:2403.12544, 2024a. Yuexiao Ma, Huixia Li, Xiawu Zheng, Feng Ling, Xuefeng Xiao, Rui Wang, Shilei Wen, Fei Chao, and Rongrong Ji. Outlier-aware slicing ...
-
[15]
Can a suit of armor conduct electricity? a new dataset for open book question answering
Todor Mihaylov, Peter Clark, Tushar Khot, and Ashish Sabharwal. Can a suit of armor conduct electricity? a new dataset for open book question answering. InProceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 2381–2391,
2018
-
[16]
Block rotation is all you need for mxfp4 quantization
Yuantian Shao, Peisong Wang, Yuanteng Chen, Chang Xu, Zhihui Wei, and Jian Cheng. Block rotation is all you need for mxfp4 quantization.arXiv preprint arXiv:2511.04214,
-
[17]
9 Mingjie Sun, Xinlei Chen, J Zico Kolter, and Zhuang Liu. Massive activations in large language models.arXiv preprint arXiv:2402.17762, 2024a. Yuxuan Sun, Ruikang Liu, Haoli Bai, Han Bao, Kang Zhao, Yuening Li, Jiaxin Hu, Xianzhi Yu, Lu Hou, Chun Yuan, et al. Flatquant: Flatness matters for llm quantization.arXiv preprint arXiv:2410.09426, 2024b. Ajay Ti...
-
[18]
Quip#: Even better llm quantization with hadamard incoherence and lattice codebooks,
Albert Tseng, Jerry Chee, Qingyao Sun, V olodymyr Kuleshov, and Christopher De Sa. Quip#: Even better llm quantization with hadamard incoherence and lattice codebooks.arXiv preprint arXiv:2402.04396, 2024a. Albert Tseng, Qingyao Sun, David Hou, and Christopher M De Sa. Qtip: Quantization with trellises and incoherence processing.Advances in Neural Informa...
-
[19]
Outlier suppression+: Accurate quantization of large language models by equivalent and effective shifting and scaling
Xiuying Wei, Yunchen Zhang, Yuhang Li, Xiangguo Zhang, Ruihao Gong, Jinyang Guo, and Xianglong Liu. Outlier suppression+: Accurate quantization of large language models by equivalent and effective shifting and scaling. InProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 1648–1665,
2023
-
[20]
Automated fine-grained mixture-of-experts quantization
Zhanhao Xie, Yuexiao Ma, Xiawu Zheng, Fei Chao, Wanchen Sui, Yong Li, Shen Li, and Rongrong Ji. Automated fine-grained mixture-of-experts quantization. InFindings of the Association for Computational Linguistics: ACL 2025, pages 27024–27037,
2025
-
[21]
Haobo Xu, Sirui Chen, Ruizhong Qiu, Yuchen Yan, Chen Luo, Monica Cheng, Jingrui He, and Hanghang Tong. Prune as you generate: Online rollout pruning for faster and better rlvr.arXiv preprint arXiv:2603.24840,
-
[22]
Lianwei Yang, Haisong Gong, Haokun Lin, Yichen Wu, Zhenan Sun, and Qingyi Gu. Dopq-vit: Towards distribution-friendly and outlier-aware post-training quantization for vision transformers. arXiv preprint arXiv:2408.03291,
-
[23]
Lianwei Yang, Haokun Lin, Tianchen Zhao, Yichen Wu, Hongyu Zhu, Ruiqi Xie, Zhenan Sun, Yu Wang, and Qingyi Gu. Lrq-dit: Log-rotation post-training quantization of diffusion transformers for text-to-image generation.arXiv preprint arXiv:2508.03485,
-
[24]
QuantVLA: Scale-Calibrated Post-Training Quantization for Vision-Language-Action Models
Jingxuan Zhang, Yunta Hsieh, Zhongwei Wang, Haokun Lin, Xin Wang, Ziqi Wang, Yingtie Lei, and Mi Zhang. Quantvla: Scale-calibrated post-training quantization for vision-language-action models.arXiv preprint arXiv:2602.20309,
work page internal anchor Pith review Pith/arXiv arXiv
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.