dMX: Differentiable Mixed-Precision Assignment for Low-Precision Floating-Point Formats

Felix Marty; Giuseppe Franco; Ian Colbert; Nicholas Fraser; Pablo Monteagudo-Lago

arxiv: 2606.04115 · v1 · pith:DHJVQG66new · submitted 2026-06-02 · 💻 cs.LG · cs.AI

dMX: Differentiable Mixed-Precision Assignment for Low-Precision Floating-Point Formats

Giuseppe Franco , Ian Colbert , Pablo Monteagudo-Lago , Felix Marty , Nicholas Fraser This is my paper

Pith reviewed 2026-06-28 11:18 UTC · model grok-4.3

classification 💻 cs.LG cs.AI

keywords mixed-precision quantizationlow-precision floating-pointMXFPdifferentiable assignmentlarge language modelsbit-width optimizationPareto trade-off

0 comments

The pith

Continuous scalar offsets per layer enable learnable mixed-precision MXFP assignments that improve accuracy-bitwidth trade-offs in LLMs.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper formulates per-layer choice of low-precision floating-point format as optimization over a single continuous scalar offset for each layer. A temperature annealing schedule gradually turns these offsets into discrete, hardware-valid MXFP formats while a regularization term holds the average bit-width near a chosen target. Experiments across Llama, Qwen3, and SmolLM2 families show the resulting assignments produce models that lie above the Pareto curve of uniform quantization and KL-divergence layer selection on both perplexity and zero-shot accuracy. The approach therefore supplies a practical way to navigate the quality-versus-inference-cost frontier without manual format search.

Core claim

By replacing the discrete choice of floating-point format for each layer with a continuous scalar offset and annealing that offset to a valid MXFP code, the method obtains mixed-precision configurations whose average bit-width can be steered by regularization; these configurations consistently dominate both uniform bit-width baselines and KL-based selection heuristics on WikiText-2 perplexity and four zero-shot benchmarks.

What carries the argument

The per-layer scalar offset that collapses the multi-format design space into one learnable continuous variable, allowing gradient flow through the quantization choice before annealing enforces hardware compatibility.

If this is right

The learned assignments achieve lower average bit-width at matched perplexity or accuracy across multiple LLM families.
The target-aware regularizer lets users directly specify an inference-cost budget and obtain a matching mixed-precision layout.
Final configurations are guaranteed to be valid MXFP formats usable on existing hardware without further manual adjustment.
The method outperforms KL-divergence heuristics in producing Pareto-dominant points on the quality-bitwidth plane.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same scalar-offset relaxation could be tested on integer or block-floating-point formats if analogous continuous parameterizations are defined.
Hardware teams could treat the learned per-layer bit-width maps as target profiles when designing accelerators that support variable MXFP lanes.
The annealing-plus-regularization pattern offers a template for other discrete assignment problems inside training loops, such as routing or sparsity pattern selection.
If the continuous phase already yields near-optimal assignments, the discretization step might be replaced by a simple rounding rule in future variants.

Load-bearing premise

The annealing schedule will map the learned continuous offsets to discrete MXFP formats without erasing the quality gains observed while the offsets were still continuous.

What would settle it

After annealing completes, measure the final model's WikiText-2 perplexity; if it exceeds both the continuous-phase perplexity and the KL-heuristic perplexity by more than the paper's reported margins at the same average bit-width, the central claim fails.

Figures

Figures reproduced from arXiv: 2606.04115 by Felix Marty, Giuseppe Franco, Ian Colbert, Nicholas Fraser, Pablo Monteagudo-Lago.

**Figure 2.** Figure 2: Overview of the dMX pipeline. All blue elements highlight the main contributions of this work. The pre-trained LLM (left) contains a learned continuous offset βi for each layer i, which parameterizes the bit-width used in that layer. During the forward pass these offsets are mapped to discrete format assignments βˆ = F(β, T). A task loss and a user-defined regularization term R on β jointly drive the gradi… view at source ↗

**Figure 3.** Figure 3: Comparison of the target-aware penalty and the simple scaling penalty for MXFP8–MXFP4 [PITH_FULL_IMAGE:figures/full_fig_p008_3.png] view at source ↗

**Figure 4.** Figure 4: Comparison of simple-average and tensor-size-weighted bit-width regularization for [PITH_FULL_IMAGE:figures/full_fig_p009_4.png] view at source ↗

**Figure 5.** Figure 5: Learned bit-width optimization vs. KL divergence-based pre-selected layer precision for [PITH_FULL_IMAGE:figures/full_fig_p009_5.png] view at source ↗

**Figure 6.** Figure 6: Forward and bitwidth-gradient transfer curves of the inner MX-FP quantizer at [PITH_FULL_IMAGE:figures/full_fig_p017_6.png] view at source ↗

**Figure 7.** Figure 7: Comparison of the MXFP8/MXFP4 mixed precision quantization and the MXFP6/MXFP4 [PITH_FULL_IMAGE:figures/full_fig_p018_7.png] view at source ↗

**Figure 8.** Figure 8: Comparison between continuous forward-pass bit-width learning and a discretized forward [PITH_FULL_IMAGE:figures/full_fig_p019_8.png] view at source ↗

read the original abstract

Quantizing large language models (LLMs) to low-precision floating-point representations is central to efficient deployment, yet applying a single bit-width uniformly across all layers is sub-optimal in terms of both performance and accuracy. This work introduces dMX, a differentiable mixed-precision quantization framework for learnable floating-point bit-width assignment. We study its application for the microscaling floating-point (MXFP) family of data types defined by the Open Compute Project (OCP) standard. The per-layer bit-width assignment is formulated as a continuous optimization problem in which each layer's floating-point format format is parameterized by a scalar parameter, folding the multi-variate design space into a single learnable offset. During training this offset takes continuous values, avoiding sudden oscillations between discrete quantization formats. A temperature-based annealing schedule progressively discretizes the learned offsets, ensuring that the final configuration maps to hardware-compatible MXFP formats without abrupt transitions between training and inference behavior. A target-aware regularization term steers the average bit-width toward a user-specified budget, serving as a coarse-grained proxy for inference cost and balancing model quality against deployment efficiency. We performed experiments on different families of LLM, such as Llama, Qwen3, and SmolLM2, evaluating perplexity on WikiText-2 and accuracy on four zero-shot reasoning benchmarks. Across these settings, dMX consistently yields Pareto-dominating models and improves over Kullback-Leibler (KL) divergence-based layer-selection heuristics, efficiently navigating trade-offs between model quality and average bit-width.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

dMX gives a workable differentiable scalar offset plus annealing for per-layer MXFP assignment and reports better tradeoffs than KL selection on several LLMs, but the abstract supplies no pre/post-discretization numbers to confirm the final hardware assignments preserve the gains.

read the letter

The main new piece is the reduction of the MXFP format choice to a single continuous scalar offset per layer, combined with a temperature annealing schedule that gradually forces the offset to a valid discrete format. A regularization term also pulls the average bit width toward a chosen budget.

The experiments run on Llama, Qwen3, and SmolLM2 families, tracking WikiText-2 perplexity and four zero-shot tasks. They show consistent Pareto improvements over the KL-divergence layer-selection baseline, which is concrete evidence that the method can navigate the quality-versus-bit-width space better than the obvious static heuristic.

The soft spot is exactly the transition the stress-test flags. The abstract describes the annealing but does not report the quality numbers right before and after the offsets snap to hardware-valid MXFP values, nor any ablation on the schedule hyperparameters. Without those checks, it is not yet clear whether the reported gains survive the final discretization step that actually matters for deployment.

The work is aimed at people building mixed-precision pipelines for the new MXFP hardware. Anyone who needs a practical, learnable alternative to hand-tuned or KL-based layer choices will find the experimental comparisons useful even if they later have to re-validate the discretization themselves.

The paper is coherent on its own terms and supplies reproducible experimental settings, so it deserves a serious referee to examine the implementation and the annealing behavior in detail.

Referee Report

1 major / 0 minor

Summary. The paper introduces dMX, a differentiable mixed-precision quantization method for MXFP formats in LLMs. Per-layer formats are parameterized by a single learnable scalar offset; a temperature-based annealing schedule progressively discretizes these offsets to hardware-valid MXFP assignments, while a target-aware regularization term controls average bit-width. Experiments on Llama, Qwen3, and SmolLM2 families report consistent Pareto improvements in WikiText-2 perplexity and zero-shot accuracy over uniform quantization and KL-divergence layer-selection heuristics.

Significance. If the continuous-to-discrete transition preserves the reported gains, the framework provides a practical, optimization-driven alternative to heuristic mixed-precision assignment for OCP-standard MXFP types, directly addressing the sub-optimality of uniform bit-widths while respecting hardware constraints. The folding of the format space into a scalar offset and the use of annealing plus regularization are technically interesting if empirically validated.

major comments (1)

[Abstract] Abstract (method description): the central Pareto-dominance claim requires that post-annealing discrete MXFP assignments retain the continuous-phase quality gains and outperform KL baselines. No pre- vs. post-annealing metrics, ablation on temperature schedule hyperparameters, or quantitative evidence of discretization stability is referenced, leaving open the possibility that abrupt quality drops exceed the reported margins over baselines.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the positive assessment of the technical approach and for highlighting the need to strengthen evidence around the annealing process. We agree that explicit validation of discretization stability is essential to support the reported Pareto improvements and will revise the manuscript accordingly.

read point-by-point responses

Referee: [Abstract] Abstract (method description): the central Pareto-dominance claim requires that post-annealing discrete MXFP assignments retain the continuous-phase quality gains and outperform KL baselines. No pre- vs. post-annealing metrics, ablation on temperature schedule hyperparameters, or quantitative evidence of discretization stability is referenced, leaving open the possibility that abrupt quality drops exceed the reported margins over baselines.

Authors: We acknowledge that the current manuscript does not explicitly report pre- versus post-annealing metrics or ablations on the temperature schedule. In the revision we will add these results: (1) tables comparing WikiText-2 perplexity and zero-shot accuracy before and after annealing for all evaluated models and bit-width budgets; (2) an ablation varying the annealing temperature schedule hyperparameters (initial temperature, decay rate, and final temperature) with corresponding performance curves; and (3) a quantitative stability analysis measuring the maximum quality drop during the final discretization steps relative to the reported margins over KL baselines. These additions will be placed in a new subsection of the experimental results and referenced from the abstract and method sections. revision: yes

Circularity Check

0 steps flagged

No circularity; empirical results independent of parameterization and regularization

full rationale

The paper formulates per-layer MXFP assignment via a continuous scalar offset, applies temperature annealing for discretization, and uses a target-aware regularization term to control average bit-width. Reported gains consist of measured perplexity on WikiText-2 and zero-shot accuracy on standard benchmarks for Llama, Qwen3, and SmolLM2 models, compared against KL heuristics. These evaluation metrics are external to the training loss components and are not shown to equal the regularization target or offset parameterization by construction. No self-citations, uniqueness theorems, or ansatzes are invoked in the provided text to justify core claims. The derivation chain therefore remains self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review supplies no concrete information on free parameters, axioms, or invented entities; all arrays left empty.

pith-pipeline@v0.9.1-grok · 5814 in / 1063 out tokens · 31012 ms · 2026-06-28T11:18:29.948175+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

47 extracted references · 12 linked inside Pith

[1]

GPTQ: Accurate post-training quantization for generative pre-trained transformers.arXiv preprint arXiv:2210.17323, 2022

Elias Frantar, Saleh Ashkboos, Torsten Hoefler, and Dan Alistarh. GPTQ: Accurate post-training quantization for generative pre-trained transformers.arXiv preprint arXiv:2210.17323, 2022

Pith/arXiv arXiv 2022
[2]

AWQ: Activation-aware weight quantization for on-device LLM compression and acceleration

Ji Lin, Jiaming Tang, Haotian Tang, Shang Yang, Wei-Ming Chen, Wei-Chen Wang, Guangxuan Xiao, Xingyu Dang, Chuang Gan, and Song Han. AWQ: Activation-aware weight quantization for on-device LLM compression and acceleration. InProceedings of Machine Learning and Systems, 2024

2024
[3]

SmoothQuant: Accurate and efficient post-training quantization for large language models

Guangxuan Xiao, Ji Lin, Mickael Seznec, Hao Wu, Julien Demouth, and Song Han. SmoothQuant: Accurate and efficient post-training quantization for large language models. In Proceedings of the 40th International Conference on Machine Learning, 2023

2023
[4]

FP8 versus INT8 for efficient deep learning inference.arXiv preprint arXiv:2303.17951, 2023

Mart van Baalen, Andrey Kuzmin, Suparna S Nair, Yuwei Ren, Eric Mahurin, Chirag Patel, Sundar Subramanian, Sanghyuk Lee, Markus Nagel, Joseph Soriaga, and Tijmen Blankevoort. FP8 versus INT8 for efficient deep learning inference.arXiv preprint arXiv:2303.17951, 2023

arXiv 2023
[5]

Microscaling data formats for deep learning.arXiv preprint arXiv:2310.10537, 2023

Bita Darvish Rouhani, Ritchie Zhao, Ankit More, Mathew Hall, Alireza Khodamoradi, Summer Deng, Dhruv Choudhary, Marius Cornea, Eric Dellinger, Kristof Denolf, et al. Microscaling data formats for deep learning.arXiv preprint arXiv:2310.10537, 2023

arXiv 2023
[6]

OCP microscaling formats (MX) specification

Bita Darvish Rouhani, Nitin Garegrat, Tom Savell, Ankit More, Kyung-Nam Han, Ritchie Zhao, Mathew Hall, Jasmine Klar, Eric Chung, Yuan Yu, Michael Schulte, Ralph Wittig, Ian Bratt, Nigel Stephens, Jelena Milanovic, John Brothers, Pradeep Dubey, Marius Cornea, Alexander Heinecke, Andres Rodriguez, Martin Langhammer, Summer Deng, Maxim Naumov, Paulius Micik...

2023
[7]

Castro, Denis Kuznedelev, Andrei Panferov, Eldar Kurtic, Shubhra Pandit, Alexandre Marques, Mark Kurtz, Saleh Ashkboos, Torsten Hoefler, and Dan Alistarh

Vage Egiazarian, Roberto L. Castro, Denis Kuznedelev, Andrei Panferov, Eldar Kurtic, Shubhra Pandit, Alexandre Marques, Mark Kurtz, Saleh Ashkboos, Torsten Hoefler, and Dan Alistarh. Bridging the gap between promise and performance for microscaling fp4 quantization, 2025

2025
[8]

MixQuant: Pushing the limits of block rotations in post-training quantization

Sai Sanjeet, Ian Colbert, Pablo Monteagudo-Lago, Giuseppe Franco, Yaman Umuroglu, and Nicholas J Fraser. MixQuant: Pushing the limits of block rotations in post-training quantization. arXiv preprint arXiv:2601.22347, 2026

Pith/arXiv arXiv 2026
[9]

Gradient-free training of quantized neural networks.arXiv preprint arXiv:2410.09734, 2024

Noa Cohen, Omkar Joglekar, Dotan Di Castro, Vladimir Tchuiev, Shir Kozlovsky, and Michal Moshkovitz. Gradient-free training of quantized neural networks.arXiv preprint arXiv:2410.09734, 2024

arXiv 2024
[10]

Mixed precision quantization of ConvNets via differentiable neural architecture search.arXiv preprint arXiv:1812.00090, 2018

Bichen Wu, Yanghan Wang, Peizhao Zhang, Yuandong Tian, Peter Vajda, and Kurt Keutzer. Mixed precision quantization of ConvNets via differentiable neural architecture search.arXiv preprint arXiv:1812.00090, 2018

Pith/arXiv arXiv 2018
[11]

InfoQ: Mixed-precision quantization via global information flow

Mehmet Emre Akbulut, Hazem Hesham Yousef Shalby, Fabrizio Pittorino, and Manuel Roveri. InfoQ: Mixed-precision quantization via global information flow. InProceedings of the AAAI Conference on Artificial Intelligence, 2026

2026
[12]

Mix-QSAM: Mixed-precision quantization of the segment anything model

Navin Ranjan and Andreas Savakis. Mix-QSAM: Mixed-precision quantization of the segment anything model. InIEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, 2025

2025
[13]

Mahoney, and Kurt Keutzer

Zhen Dong, Zhewei Yao, Amir Gholami, Michael W. Mahoney, and Kurt Keutzer. HAWQ: Hessian AWare quantization of neural networks with mixed-precision. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 293–302, 2019

2019
[14]

Mahoney, and Kurt Keutzer

Zhen Dong, Zhewei Yao, Yaohui Cai, Daiyaan Arfeen, Amir Gholami, Michael W. Mahoney, and Kurt Keutzer. HAWQ-V2: Hessian aware trace-weighted quantization of neural networks. InAdvances in Neural Information Processing Systems, volume 33, 2020

2020
[15]

FracBits: Mixed precision quantization via fractional bit-widths

Linjie Yang and Qing Jin. FracBits: Mixed precision quantization via fractional bit-widths. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 35, pages 10612–10620, 2021. 11

2021
[16]

SDQ: Stochastic differentiable quantization with mixed precision

Xijie Huang, Zhiqiang Shen, Shichao Li, Zechun Liu, Xianghong Hu, Jeffry Wicaksana, Eric Xing, and Kwang-Ting Cheng. SDQ: Stochastic differentiable quantization with mixed precision. InProceedings of the 39th International Conference on Machine Learning, pages 9295–9309, 2022

2022
[17]

BSQ: Exploring bit-level sparsity for mixed-precision neural network quantization

Huanrui Yang, Lin Duan, Yiran Chen, and Hai Li. BSQ: Exploring bit-level sparsity for mixed-precision neural network quantization. InProceedings of the International Conference on Learning Representations, 2021

2021
[18]

Estimating or propagating gradients through stochastic neurons for conditional computation.arXiv preprint arXiv:1308.3432, 2013

Yoshua Bengio, Nicholas L´eonard, and Aaron Courville. Estimating or propagating gradients through stochastic neurons for conditional computation.arXiv preprint arXiv:1308.3432, 2013

Pith/arXiv arXiv 2013
[19]

Spinquant: Llm quantization with learned rotations.arXiv preprint arXiv:2405.16406, 2024

Zechun Liu, Changsheng Zhao, Igor Fedorov, Bilge Soran, Dhruv Choudhary, Raghuraman Krishnamoorthi, Vikas Chandra, Yuandong Tian, and Tijmen Blankevoort. Spinquant: Llm quantization with learned rotations.arXiv preprint arXiv:2405.16406, 2024

Pith/arXiv arXiv 2024
[20]

FP8 formats for deep learning.arXiv preprint arXiv:2209.05433, 2022

Paulius Micikevicius, Dusan Stosic, Neil Burgess, Marius Cornea, Pradeep Dubey, Richard Grisenthwaite, Sangwon Ha, Alexander Heinecke, Patrick Judd, John Kamalu, Naveen Mellem- pudi, Stuart Oberman, Mohammad Shoeybi, Michael Siu, and Hao Wu. FP8 formats for deep learning.arXiv preprint arXiv:2209.05433, 2022

Pith/arXiv arXiv 2022
[21]

The Llama 3 herd of models.arXiv preprint arXiv:2407.21783, 2024

Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Alex Vaughan, et al. The Llama 3 herd of models.arXiv preprint arXiv:2407.21783, 2024

Pith/arXiv arXiv 2024
[22]

Qwen3 technical report.arXiv preprint arXiv:2505.09388, 2025

An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, et al. Qwen3 technical report.arXiv preprint arXiv:2505.09388, 2025

Pith/arXiv arXiv 2025
[23]

SmolLM2: When smol goes big – data-centric training of a small language model.arXiv preprint arXiv:2502.02737, 2025

Loubna Ben Allal, Anton Lozhkov, Elie Bakouch, Gabriel Mart´ın Bl´azquez, Guilherme Penedo, Lewis Tunstall, Andr´es Marafioti, Hynek Kydl´ıˇcek, Agust´ın Piqueres Lajar´ın, Vaibhav Srivastav, et al. SmolLM2: When smol goes big – data-centric training of a small language model.arXiv preprint arXiv:2502.02737, 2025

Pith/arXiv arXiv 2025
[24]

The FineWeb datasets: Decanting the web for the finest text data at scale.arXiv preprint arXiv:2406.17557, 2024

Guilherme Penedo, Hynek Kydl´ıˇcek, Loubna Ben Allal, Anton Lozhkov, Margaret Mitchell, Colin Raffel, Leandro V on Werra, and Thomas Wolf. The FineWeb datasets: Decanting the web for the finest text data at scale.arXiv preprint arXiv:2406.17557, 2024

Pith/arXiv arXiv 2024
[25]

Pointer sentinel mixture models

Stephen Merity, Caiming Xiong, James Bradbury, and Richard Socher. Pointer sentinel mixture models. InProceedings of the International Conference on Learning Representations, 2017

2017
[26]

Think you have solved question answering? try ARC, the AI2 reasoning challenge.arXiv preprint arXiv:1803.05457, 2018

Peter Clark, Isaac Cowhey, Oren Etzioni, Tushar Khot, Ashish Sabharwal, Carissa Schoenick, and Oyvind Tafjord. Think you have solved question answering? try ARC, the AI2 reasoning challenge.arXiv preprint arXiv:1803.05457, 2018

Pith/arXiv arXiv 2018
[27]

HellaSwag: Can a machine really finish your sentence? InProceedings of the 57th Annual Meeting of the Association for Computational Linguistics, 2019

Rowan Zellers, Ari Holtzman, Yonatan Bisk, Ali Farhadi, and Yejin Choi. HellaSwag: Can a machine really finish your sentence? InProceedings of the 57th Annual Meeting of the Association for Computational Linguistics, 2019

2019
[28]

WinoGrande: An adversarial Winograd schema challenge at scale

Keisuke Sakaguchi, Ronan Le Bras, Chandra Bhagavatula, and Yejin Choi. WinoGrande: An adversarial Winograd schema challenge at scale. InProceedings of the AAAI Conference on Artificial Intelligence, volume 34, pages 8732–8740, 2020

2020
[29]

LightEval: A lightweight framework for LLM evaluation, 2023

Nathan Habib, Cl ´ementine Fourrier, Hynek Kydl ´ıˇcek, Thomas Wolf, and Lewis Tunstall. LightEval: A lightweight framework for LLM evaluation, 2023

2023
[30]

Xilinx/brevitas, 2025

Giuseppe Franco, Alessandro Pappalardo, and Nicholas J Fraser. Xilinx/brevitas, 2025

2025
[31]

HAQ: Hardware-aware automated quantization with mixed precision

Kuan Wang, Zhijian Liu, Yujun Lin, Ji Lin, and Song Han. HAQ: Hardware-aware automated quantization with mixed precision. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 8612–8620, 2019. 12

2019
[32]

Mahoney, and Kurt Keutzer

Zhewei Yao, Zhen Dong, Zhangcheng Zheng, Amir Gholami, Jiali Yu, Eric Tan, Leyuan Wang, Qijing Huang, Yida Wang, Michael W. Mahoney, and Kurt Keutzer. HAWQ-V3: Dyadic neural network quantization. InProceedings of the 38th International Conference on Machine Learning, pages 11875–11886, 2021

2021
[33]

Towards mixed-precision quantization of neural networks via constrained optimization

Weihan Chen, Peisong Wang, and Jian Cheng. Towards mixed-precision quantization of neural networks via constrained optimization. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 5350–5359, 2021

2021
[34]

APTQ: Attention- aware post-training mixed-precision quantization for large language models

Ziyi Guan, Hantao Huang, Yupeng Su, Hong Huang, Ngai Wong, and Hao Yu. APTQ: Attention- aware post-training mixed-precision quantization for large language models. InProceedings of the 61st IEEE/ACM Design Automation Conference, 2024

2024
[35]

ResQ: Mixed-precision quan- tization of large language models with low-rank residuals.arXiv preprint arXiv:2412.14363, 2024

Utkarsh Saxena, Sayeh Sharify, Kaushik Roy, and Xin Wang. ResQ: Mixed-precision quan- tization of large language models with low-rank residuals.arXiv preprint arXiv:2412.14363, 2024

arXiv 2024
[36]

Rethinking differentiable search for mixed-precision neural networks

Zhaowei Cai and Nuno Vasconcelos. Rethinking differentiable search for mixed-precision neural networks. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 2346–2355, 2020

2020
[37]

Q-ViT: Fully differentiable quantization for vision transformer.arXiv preprint arXiv:2201.07703, 2022

Zhexin Li, Tong Yang, Peisong Wang, and Jian Cheng. Q-ViT: Fully differentiable quantization for vision transformer.arXiv preprint arXiv:2201.07703, 2022

arXiv 2022
[38]

Jennings, and Arnon Netzer

Hai Victor Habi, Roy H. Jennings, and Arnon Netzer. HMQ: Hardware friendly mixed precision quantization block for CNNs. InComputer Vision – ECCV 2020, pages 448–463. Springer, 2020

2020
[39]

Categorical reparameterization with gumbel-softmax

Eric Jang, Shixiang Gu, and Ben Poole. Categorical reparameterization with gumbel-softmax. arXiv preprint arXiv:1611.01144, 2016

Pith/arXiv arXiv 2016
[40]

Maddison, Andriy Mnih, and Yee Whye Teh

Chris J. Maddison, Andriy Mnih, and Yee Whye Teh. The concrete distribution: A continuous relaxation of discrete random variables. InProceedings of the International Conference on Learning Representations, 2017

2017
[41]

Bayesian bits: Unifying quantization and pruning

Mart van Baalen, Christos Louizos, Markus Nagel, Rana Ali Amjad, Ying Wang, Tijmen Blankevoort, and Max Welling. Bayesian bits: Unifying quantization and pruning. InAdvances in Neural Information Processing Systems, volume 33, 2020

2020
[42]

Mixed precision DNNs: All you need is a good parametrization

Stefan Uhlich, Lukas Mauch, Fabien Cardinaux, Kazuki Yoshiyama, Javier Alonso Garcia, Stephen Tiedemann, Thomas Kemp, and Akira Nakamura. Mixed precision DNNs: All you need is a good parametrization. InProceedings of the International Conference on Learning Representations, 2020

2020
[43]

Micromix: Efficient mixed-precision quantization with microscaling formats for large language models

Wenyuan Liu, Haoqian Meng, Yilun Luo, Yafei Zhao, Peng Zhang, and Xindian Ma. Micromix: Efficient mixed-precision quantization with microscaling formats for large language models. arXiv preprint arXiv:2508.02343, 2025

arXiv 2025
[44]

Mixture compressor for mixture-of-experts LLMs gains more

Wei Huang, Yue Liao, Jianhui Liu, Ruifei He, Haoru Tan, Shiming Zhang, Hongsheng Li, Si Liu, and Xiaojuan Qi. Mixture compressor for mixture-of-experts LLMs gains more. InThe Thirteenth International Conference on Learning Representations, 2025

2025
[45]

Ieee standard for floating-point arithmetic.IEEE Std 754-2019 (Revision of IEEE 754-2008), pages 1–84, 2019

IEEE. Ieee standard for floating-point arithmetic.IEEE Std 754-2019 (Revision of IEEE 754-2008), pages 1–84, 2019

2019
[46]

Jain, Albert Gural, Michael Wu, and Chris H

Sambhav R. Jain, Albert Gural, Michael Wu, and Chris H. Dick. Trained quantization thresholds for accurate and efficient fixed-point inference of deep neural networks. InProceedings of the 3rd Machine Learning and Systems (MLSys) Conference, 2020

2020
[47]

PyTorch: An imperative style, high-performance deep learning library

Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, Alban Desmaison, Andreas K¨opf, Edward Yang, Zachary DeVito, Martin Raison, Alykhan Tejani, Sasank Chilamkurthy, Benoit Steiner, Lu Fang, Junjie Bai, and Soumith Chintala. PyTorch: An imperative style, high-perf...

2019

[1] [1]

GPTQ: Accurate post-training quantization for generative pre-trained transformers.arXiv preprint arXiv:2210.17323, 2022

Elias Frantar, Saleh Ashkboos, Torsten Hoefler, and Dan Alistarh. GPTQ: Accurate post-training quantization for generative pre-trained transformers.arXiv preprint arXiv:2210.17323, 2022

Pith/arXiv arXiv 2022

[2] [2]

AWQ: Activation-aware weight quantization for on-device LLM compression and acceleration

Ji Lin, Jiaming Tang, Haotian Tang, Shang Yang, Wei-Ming Chen, Wei-Chen Wang, Guangxuan Xiao, Xingyu Dang, Chuang Gan, and Song Han. AWQ: Activation-aware weight quantization for on-device LLM compression and acceleration. InProceedings of Machine Learning and Systems, 2024

2024

[3] [3]

SmoothQuant: Accurate and efficient post-training quantization for large language models

Guangxuan Xiao, Ji Lin, Mickael Seznec, Hao Wu, Julien Demouth, and Song Han. SmoothQuant: Accurate and efficient post-training quantization for large language models. In Proceedings of the 40th International Conference on Machine Learning, 2023

2023

[4] [4]

FP8 versus INT8 for efficient deep learning inference.arXiv preprint arXiv:2303.17951, 2023

Mart van Baalen, Andrey Kuzmin, Suparna S Nair, Yuwei Ren, Eric Mahurin, Chirag Patel, Sundar Subramanian, Sanghyuk Lee, Markus Nagel, Joseph Soriaga, and Tijmen Blankevoort. FP8 versus INT8 for efficient deep learning inference.arXiv preprint arXiv:2303.17951, 2023

arXiv 2023

[5] [5]

Microscaling data formats for deep learning.arXiv preprint arXiv:2310.10537, 2023

Bita Darvish Rouhani, Ritchie Zhao, Ankit More, Mathew Hall, Alireza Khodamoradi, Summer Deng, Dhruv Choudhary, Marius Cornea, Eric Dellinger, Kristof Denolf, et al. Microscaling data formats for deep learning.arXiv preprint arXiv:2310.10537, 2023

arXiv 2023

[6] [6]

OCP microscaling formats (MX) specification

Bita Darvish Rouhani, Nitin Garegrat, Tom Savell, Ankit More, Kyung-Nam Han, Ritchie Zhao, Mathew Hall, Jasmine Klar, Eric Chung, Yuan Yu, Michael Schulte, Ralph Wittig, Ian Bratt, Nigel Stephens, Jelena Milanovic, John Brothers, Pradeep Dubey, Marius Cornea, Alexander Heinecke, Andres Rodriguez, Martin Langhammer, Summer Deng, Maxim Naumov, Paulius Micik...

2023

[7] [7]

Castro, Denis Kuznedelev, Andrei Panferov, Eldar Kurtic, Shubhra Pandit, Alexandre Marques, Mark Kurtz, Saleh Ashkboos, Torsten Hoefler, and Dan Alistarh

Vage Egiazarian, Roberto L. Castro, Denis Kuznedelev, Andrei Panferov, Eldar Kurtic, Shubhra Pandit, Alexandre Marques, Mark Kurtz, Saleh Ashkboos, Torsten Hoefler, and Dan Alistarh. Bridging the gap between promise and performance for microscaling fp4 quantization, 2025

2025

[8] [8]

MixQuant: Pushing the limits of block rotations in post-training quantization

Sai Sanjeet, Ian Colbert, Pablo Monteagudo-Lago, Giuseppe Franco, Yaman Umuroglu, and Nicholas J Fraser. MixQuant: Pushing the limits of block rotations in post-training quantization. arXiv preprint arXiv:2601.22347, 2026

Pith/arXiv arXiv 2026

[9] [9]

Gradient-free training of quantized neural networks.arXiv preprint arXiv:2410.09734, 2024

Noa Cohen, Omkar Joglekar, Dotan Di Castro, Vladimir Tchuiev, Shir Kozlovsky, and Michal Moshkovitz. Gradient-free training of quantized neural networks.arXiv preprint arXiv:2410.09734, 2024

arXiv 2024

[10] [10]

Mixed precision quantization of ConvNets via differentiable neural architecture search.arXiv preprint arXiv:1812.00090, 2018

Bichen Wu, Yanghan Wang, Peizhao Zhang, Yuandong Tian, Peter Vajda, and Kurt Keutzer. Mixed precision quantization of ConvNets via differentiable neural architecture search.arXiv preprint arXiv:1812.00090, 2018

Pith/arXiv arXiv 2018

[11] [11]

InfoQ: Mixed-precision quantization via global information flow

Mehmet Emre Akbulut, Hazem Hesham Yousef Shalby, Fabrizio Pittorino, and Manuel Roveri. InfoQ: Mixed-precision quantization via global information flow. InProceedings of the AAAI Conference on Artificial Intelligence, 2026

2026

[12] [12]

Mix-QSAM: Mixed-precision quantization of the segment anything model

Navin Ranjan and Andreas Savakis. Mix-QSAM: Mixed-precision quantization of the segment anything model. InIEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, 2025

2025

[13] [13]

Mahoney, and Kurt Keutzer

Zhen Dong, Zhewei Yao, Amir Gholami, Michael W. Mahoney, and Kurt Keutzer. HAWQ: Hessian AWare quantization of neural networks with mixed-precision. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 293–302, 2019

2019

[14] [14]

Mahoney, and Kurt Keutzer

Zhen Dong, Zhewei Yao, Yaohui Cai, Daiyaan Arfeen, Amir Gholami, Michael W. Mahoney, and Kurt Keutzer. HAWQ-V2: Hessian aware trace-weighted quantization of neural networks. InAdvances in Neural Information Processing Systems, volume 33, 2020

2020

[15] [15]

FracBits: Mixed precision quantization via fractional bit-widths

Linjie Yang and Qing Jin. FracBits: Mixed precision quantization via fractional bit-widths. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 35, pages 10612–10620, 2021. 11

2021

[16] [16]

SDQ: Stochastic differentiable quantization with mixed precision

Xijie Huang, Zhiqiang Shen, Shichao Li, Zechun Liu, Xianghong Hu, Jeffry Wicaksana, Eric Xing, and Kwang-Ting Cheng. SDQ: Stochastic differentiable quantization with mixed precision. InProceedings of the 39th International Conference on Machine Learning, pages 9295–9309, 2022

2022

[17] [17]

BSQ: Exploring bit-level sparsity for mixed-precision neural network quantization

Huanrui Yang, Lin Duan, Yiran Chen, and Hai Li. BSQ: Exploring bit-level sparsity for mixed-precision neural network quantization. InProceedings of the International Conference on Learning Representations, 2021

2021

[18] [18]

Estimating or propagating gradients through stochastic neurons for conditional computation.arXiv preprint arXiv:1308.3432, 2013

Yoshua Bengio, Nicholas L´eonard, and Aaron Courville. Estimating or propagating gradients through stochastic neurons for conditional computation.arXiv preprint arXiv:1308.3432, 2013

Pith/arXiv arXiv 2013

[19] [19]

Spinquant: Llm quantization with learned rotations.arXiv preprint arXiv:2405.16406, 2024

Zechun Liu, Changsheng Zhao, Igor Fedorov, Bilge Soran, Dhruv Choudhary, Raghuraman Krishnamoorthi, Vikas Chandra, Yuandong Tian, and Tijmen Blankevoort. Spinquant: Llm quantization with learned rotations.arXiv preprint arXiv:2405.16406, 2024

Pith/arXiv arXiv 2024

[20] [20]

FP8 formats for deep learning.arXiv preprint arXiv:2209.05433, 2022

Paulius Micikevicius, Dusan Stosic, Neil Burgess, Marius Cornea, Pradeep Dubey, Richard Grisenthwaite, Sangwon Ha, Alexander Heinecke, Patrick Judd, John Kamalu, Naveen Mellem- pudi, Stuart Oberman, Mohammad Shoeybi, Michael Siu, and Hao Wu. FP8 formats for deep learning.arXiv preprint arXiv:2209.05433, 2022

Pith/arXiv arXiv 2022

[21] [21]

The Llama 3 herd of models.arXiv preprint arXiv:2407.21783, 2024

Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Alex Vaughan, et al. The Llama 3 herd of models.arXiv preprint arXiv:2407.21783, 2024

Pith/arXiv arXiv 2024

[22] [22]

Qwen3 technical report.arXiv preprint arXiv:2505.09388, 2025

An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, et al. Qwen3 technical report.arXiv preprint arXiv:2505.09388, 2025

Pith/arXiv arXiv 2025

[23] [23]

SmolLM2: When smol goes big – data-centric training of a small language model.arXiv preprint arXiv:2502.02737, 2025

Loubna Ben Allal, Anton Lozhkov, Elie Bakouch, Gabriel Mart´ın Bl´azquez, Guilherme Penedo, Lewis Tunstall, Andr´es Marafioti, Hynek Kydl´ıˇcek, Agust´ın Piqueres Lajar´ın, Vaibhav Srivastav, et al. SmolLM2: When smol goes big – data-centric training of a small language model.arXiv preprint arXiv:2502.02737, 2025

Pith/arXiv arXiv 2025

[24] [24]

The FineWeb datasets: Decanting the web for the finest text data at scale.arXiv preprint arXiv:2406.17557, 2024

Guilherme Penedo, Hynek Kydl´ıˇcek, Loubna Ben Allal, Anton Lozhkov, Margaret Mitchell, Colin Raffel, Leandro V on Werra, and Thomas Wolf. The FineWeb datasets: Decanting the web for the finest text data at scale.arXiv preprint arXiv:2406.17557, 2024

Pith/arXiv arXiv 2024

[25] [25]

Pointer sentinel mixture models

Stephen Merity, Caiming Xiong, James Bradbury, and Richard Socher. Pointer sentinel mixture models. InProceedings of the International Conference on Learning Representations, 2017

2017

[26] [26]

Think you have solved question answering? try ARC, the AI2 reasoning challenge.arXiv preprint arXiv:1803.05457, 2018

Peter Clark, Isaac Cowhey, Oren Etzioni, Tushar Khot, Ashish Sabharwal, Carissa Schoenick, and Oyvind Tafjord. Think you have solved question answering? try ARC, the AI2 reasoning challenge.arXiv preprint arXiv:1803.05457, 2018

Pith/arXiv arXiv 2018

[27] [27]

HellaSwag: Can a machine really finish your sentence? InProceedings of the 57th Annual Meeting of the Association for Computational Linguistics, 2019

Rowan Zellers, Ari Holtzman, Yonatan Bisk, Ali Farhadi, and Yejin Choi. HellaSwag: Can a machine really finish your sentence? InProceedings of the 57th Annual Meeting of the Association for Computational Linguistics, 2019

2019

[28] [28]

WinoGrande: An adversarial Winograd schema challenge at scale

Keisuke Sakaguchi, Ronan Le Bras, Chandra Bhagavatula, and Yejin Choi. WinoGrande: An adversarial Winograd schema challenge at scale. InProceedings of the AAAI Conference on Artificial Intelligence, volume 34, pages 8732–8740, 2020

2020

[29] [29]

LightEval: A lightweight framework for LLM evaluation, 2023

Nathan Habib, Cl ´ementine Fourrier, Hynek Kydl ´ıˇcek, Thomas Wolf, and Lewis Tunstall. LightEval: A lightweight framework for LLM evaluation, 2023

2023

[30] [30]

Xilinx/brevitas, 2025

Giuseppe Franco, Alessandro Pappalardo, and Nicholas J Fraser. Xilinx/brevitas, 2025

2025

[31] [31]

HAQ: Hardware-aware automated quantization with mixed precision

Kuan Wang, Zhijian Liu, Yujun Lin, Ji Lin, and Song Han. HAQ: Hardware-aware automated quantization with mixed precision. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 8612–8620, 2019. 12

2019

[32] [32]

Mahoney, and Kurt Keutzer

Zhewei Yao, Zhen Dong, Zhangcheng Zheng, Amir Gholami, Jiali Yu, Eric Tan, Leyuan Wang, Qijing Huang, Yida Wang, Michael W. Mahoney, and Kurt Keutzer. HAWQ-V3: Dyadic neural network quantization. InProceedings of the 38th International Conference on Machine Learning, pages 11875–11886, 2021

2021

[33] [33]

Towards mixed-precision quantization of neural networks via constrained optimization

Weihan Chen, Peisong Wang, and Jian Cheng. Towards mixed-precision quantization of neural networks via constrained optimization. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 5350–5359, 2021

2021

[34] [34]

APTQ: Attention- aware post-training mixed-precision quantization for large language models

Ziyi Guan, Hantao Huang, Yupeng Su, Hong Huang, Ngai Wong, and Hao Yu. APTQ: Attention- aware post-training mixed-precision quantization for large language models. InProceedings of the 61st IEEE/ACM Design Automation Conference, 2024

2024

[35] [35]

ResQ: Mixed-precision quan- tization of large language models with low-rank residuals.arXiv preprint arXiv:2412.14363, 2024

Utkarsh Saxena, Sayeh Sharify, Kaushik Roy, and Xin Wang. ResQ: Mixed-precision quan- tization of large language models with low-rank residuals.arXiv preprint arXiv:2412.14363, 2024

arXiv 2024

[36] [36]

Rethinking differentiable search for mixed-precision neural networks

Zhaowei Cai and Nuno Vasconcelos. Rethinking differentiable search for mixed-precision neural networks. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 2346–2355, 2020

2020

[37] [37]

Q-ViT: Fully differentiable quantization for vision transformer.arXiv preprint arXiv:2201.07703, 2022

Zhexin Li, Tong Yang, Peisong Wang, and Jian Cheng. Q-ViT: Fully differentiable quantization for vision transformer.arXiv preprint arXiv:2201.07703, 2022

arXiv 2022

[38] [38]

Jennings, and Arnon Netzer

Hai Victor Habi, Roy H. Jennings, and Arnon Netzer. HMQ: Hardware friendly mixed precision quantization block for CNNs. InComputer Vision – ECCV 2020, pages 448–463. Springer, 2020

2020

[39] [39]

Categorical reparameterization with gumbel-softmax

Eric Jang, Shixiang Gu, and Ben Poole. Categorical reparameterization with gumbel-softmax. arXiv preprint arXiv:1611.01144, 2016

Pith/arXiv arXiv 2016

[40] [40]

Maddison, Andriy Mnih, and Yee Whye Teh

Chris J. Maddison, Andriy Mnih, and Yee Whye Teh. The concrete distribution: A continuous relaxation of discrete random variables. InProceedings of the International Conference on Learning Representations, 2017

2017

[41] [41]

Bayesian bits: Unifying quantization and pruning

Mart van Baalen, Christos Louizos, Markus Nagel, Rana Ali Amjad, Ying Wang, Tijmen Blankevoort, and Max Welling. Bayesian bits: Unifying quantization and pruning. InAdvances in Neural Information Processing Systems, volume 33, 2020

2020

[42] [42]

Mixed precision DNNs: All you need is a good parametrization

Stefan Uhlich, Lukas Mauch, Fabien Cardinaux, Kazuki Yoshiyama, Javier Alonso Garcia, Stephen Tiedemann, Thomas Kemp, and Akira Nakamura. Mixed precision DNNs: All you need is a good parametrization. InProceedings of the International Conference on Learning Representations, 2020

2020

[43] [43]

Micromix: Efficient mixed-precision quantization with microscaling formats for large language models

Wenyuan Liu, Haoqian Meng, Yilun Luo, Yafei Zhao, Peng Zhang, and Xindian Ma. Micromix: Efficient mixed-precision quantization with microscaling formats for large language models. arXiv preprint arXiv:2508.02343, 2025

arXiv 2025

[44] [44]

Mixture compressor for mixture-of-experts LLMs gains more

Wei Huang, Yue Liao, Jianhui Liu, Ruifei He, Haoru Tan, Shiming Zhang, Hongsheng Li, Si Liu, and Xiaojuan Qi. Mixture compressor for mixture-of-experts LLMs gains more. InThe Thirteenth International Conference on Learning Representations, 2025

2025

[45] [45]

Ieee standard for floating-point arithmetic.IEEE Std 754-2019 (Revision of IEEE 754-2008), pages 1–84, 2019

IEEE. Ieee standard for floating-point arithmetic.IEEE Std 754-2019 (Revision of IEEE 754-2008), pages 1–84, 2019

2019

[46] [46]

Jain, Albert Gural, Michael Wu, and Chris H

Sambhav R. Jain, Albert Gural, Michael Wu, and Chris H. Dick. Trained quantization thresholds for accurate and efficient fixed-point inference of deep neural networks. InProceedings of the 3rd Machine Learning and Systems (MLSys) Conference, 2020

2020

[47] [47]

PyTorch: An imperative style, high-performance deep learning library

Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, Alban Desmaison, Andreas K¨opf, Edward Yang, Zachary DeVito, Martin Raison, Alykhan Tejani, Sasank Chilamkurthy, Benoit Steiner, Lu Fang, Junjie Bai, and Soumith Chintala. PyTorch: An imperative style, high-perf...

2019