OffQ: Taming Structured Outliers in LLM Quantization by Offsetting

Haoqi Wang; Jiawei Zhuang; Lorenz K. Mueller; Lukas Cavigelli; Mathieu Salzmann

arxiv: 2606.07116 · v1 · pith:OKPKPDNAnew · submitted 2026-06-05 · 💻 cs.LG · cs.AI· cs.CL

OffQ: Taming Structured Outliers in LLM Quantization by Offsetting

Haoqi Wang , Lorenz K. Mueller , Jiawei Zhuang , Mathieu Salzmann , Lukas Cavigelli This is my paper

Pith reviewed 2026-06-27 22:35 UTC · model grok-4.3

classification 💻 cs.LG cs.AIcs.CL

keywords LLM quantizationactivation outliersoffsetting mechanismuniform quantization4-bit inferencePCA rotationW4A4KV4structured outliers

0 comments

The pith

Rotating activations to pack outliers into one channel then absorbing it via shared offset enables uniform 4-bit LLM quantization.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper sets out to establish that activation outliers in large language models follow a low-dimensional structured pattern that can be isolated and neutralized without complex per-tensor scaling. It does so by applying top-1 PCA to locate the outlier subspace, rotating the activations to concentrate extreme values into a single channel, and converting that channel's magnitude into a shared offset subtracted across the tensor. A sympathetic reader would care because this would let standard uniform-grid quantizers operate at 4 bits on weights, activations, and KV caches while keeping accuracy close to full precision. If the claim holds, deployment on hardware limited to uniform precision becomes simpler and more memory-efficient.

Core claim

OffQ first identifies a low-dimensional outlier subspace in the activations using a proposed top-1 PCA, and then concentrates high-magnitude activations into 1 channel via rotation. OffQ then absorbs this concentrated outlier channel by converting its magnitude into a shared offset, thereby reducing the standard deviation of the activations. This offsetting strategy enables effective W4A4KV4 quantization of LLMs using deployment-friendly uniform-grid and uniform-precision quantization.

What carries the argument

The offsetting mechanism that isolates the outlier subspace via top-1 PCA, rotates to concentrate high-magnitude values into one channel, and converts that channel into a shared offset.

If this is right

LLMs achieve effective 4-bit quantization for weights, activations, and KV caches using only uniform grids and uniform precision.
The method avoids per-tensor adjustments that would defeat the uniform-grid goal.
Accuracy improves consistently over prior outlier-handling baselines across multiple LLM architectures and evaluation benchmarks.
Low-bit efficiency is preserved because the offset is shared and requires no additional per-channel storage or computation at inference.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same rotation-plus-offset pattern might apply to outlier patterns in non-transformer models such as convolutional networks.
Hardware designs could add native support for a single shared offset per tensor to further speed up the resulting 4-bit kernels.
Testing the method at 3-bit or 2-bit widths would show whether the concentration step scales before the offset itself becomes too coarse.

Load-bearing premise

The low-dimensional outlier subspace identified by top-1 PCA can be rotated to concentrate all high-magnitude values into a single channel whose magnitude can be removed by a shared offset without unacceptable distortion.

What would settle it

Quantizing an LLM such as Llama to W4A4KV4 after the rotation and offset step and measuring whether perplexity on WikiText or accuracy on MMLU drops by more than a few percent relative to the 16-bit baseline.

Figures

Figures reproduced from arXiv: 2606.07116 by Haoqi Wang, Jiawei Zhuang, Lorenz K. Mueller, Lukas Cavigelli, Mathieu Salzmann.

**Figure 2.** Figure 2: The overall quantization pipeline of OffQ. [PITH_FULL_IMAGE:figures/full_fig_p006_2.png] view at source ↗

**Figure 3.** Figure 3: Effect of group size on quantization performance of OffQ on Llama 3-8B with a hidden [PITH_FULL_IMAGE:figures/full_fig_p009_3.png] view at source ↗

read the original abstract

Low-bit quantization has been widely adopted to accelerate the inference of large language models (LLMs) by significantly reducing computational cost and memory usage. However, activation outliers pose a major challenge to effective quantization, often leading to notable performance degradation. In this paper, we introduce OffQ, a method designed to mitigate activation outliers in low-bit quantization through a novel offsetting mechanism. Specifically, OffQ first identifies a low-dimensional outlier subspace in the activations using a proposed top-1 PCA, and then concentrates high-magnitude activations into 1 channel via rotation. OffQ then absorbs this concentrated outlier channel by converting its magnitude into a shared offset, thereby reducing the standard deviation of the activations. This offsetting strategy enables effective W4A4KV4 quantization of LLMs using deployment-friendly uniform-grid and uniform-precision quantization. Extensive experiments across diverse LLM architectures and benchmarks demonstrate that OffQ outperforms state-of-the-art baselines, consistently improving model accuracy while preserving low-bit efficiency.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

OffQ's PCA-plus-rotation-plus-offset trick is a practical attempt at uniform 4-bit LLM quantization, but the claim hinges on an untested assumption that top-1 PCA isolates all structured outliers into one absorbable channel.

read the letter

The core idea is to run top-1 PCA on activations, rotate so the outlier energy lands in a single channel, then turn that channel's magnitude into a shared offset subtracted from the tensor to shrink the overall range. This is meant to let standard uniform-grid W4A4KV4 work without per-channel zero points or other deployment headaches.

What stands out is the explicit focus on keeping everything uniform and hardware-friendly while still claiming accuracy gains over prior outlier-handling methods. The abstract positions the offsetting step as the novel piece after the rotation.

The soft spot is exactly the one the stress test flags. If the outlier energy is not confined to a single direction, the top eigenvector rotation will leave residual high-magnitude values spread across channels. Subtracting one shared offset from the whole tensor cannot recenter all of them at once without either inflating the range elsewhere or forcing non-uniform adjustments that defeat the premise. The abstract gives no derivation or ablation showing that the 1D subspace assumption holds across layers or models, and the full text is not supplied here to check whether later sections close that gap.

The experiments are described only at a high level, so it is impossible to judge whether the reported improvements survive proper controls or rest on post-hoc tuning. Citation patterns and reproducibility details are also absent from what is visible.

This is the kind of paper that would interest people shipping 4-bit inference on standard hardware. It deserves a serious referee because the problem is concrete and the proposed fix is simple enough to test, even though the current evidence is too thin to assess whether the central assumption actually holds.

Referee Report

1 major / 1 minor

Summary. The manuscript introduces OffQ, a quantization technique for LLMs that identifies a low-dimensional outlier subspace in activations via top-1 PCA, rotates the tensor to concentrate high-magnitude values into a single channel, and absorbs that channel by converting its magnitude into a shared offset subtracted from the entire tensor. This is claimed to reduce activation standard deviation enough to enable effective W4A4KV4 quantization under deployment-friendly uniform-grid and uniform-precision schemes, with experiments showing consistent accuracy gains over baselines across LLM architectures and benchmarks.

Significance. If the core offsetting mechanism holds without introducing unacceptable distortion or requiring non-uniform adjustments, the result would be significant for practical LLM deployment, as it would allow standard uniform 4-bit quantization to achieve competitive accuracy without custom hardware or per-tensor zero-points.

major comments (1)

[Method description (top-1 PCA and rotation step)] Method description (top-1 PCA and rotation step): the central claim that top-1 PCA isolates the outlier subspace such that a single rotation concentrates every high-magnitude activation into one channel (whose value can then be replaced by a shared offset) is load-bearing but unsupported by any derivation or analysis. If outlier energy has components orthogonal to the top eigenvector, post-rotation channels will retain large values; subtracting one shared offset cannot recenter all of them without either increasing range in other channels or requiring per-channel adjustments, directly contradicting the uniform-grid, uniform-precision premise.

minor comments (1)

[Abstract] Abstract: the statement that experiments demonstrate consistent improvement would be strengthened by explicit mention of the specific models, bit-width configurations, and evaluation metrics used.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive feedback on our work. Below we respond point-by-point to the single major comment.

read point-by-point responses

Referee: Method description (top-1 PCA and rotation step): the central claim that top-1 PCA isolates the outlier subspace such that a single rotation concentrates every high-magnitude activation into one channel (whose value can then be replaced by a shared offset) is load-bearing but unsupported by any derivation or analysis. If outlier energy has components orthogonal to the top eigenvector, post-rotation channels will retain large values; subtracting one shared offset cannot recenter all of them without either increasing range in other channels or requiring per-channel adjustments, directly contradicting the uniform-grid, uniform-precision premise.

Authors: We agree that the manuscript presents the top-1 PCA choice primarily through empirical motivation rather than a formal derivation or proof that all outlier energy lies exactly along the leading eigenvector. The approach is grounded in the documented low-rank structure of activation outliers in LLMs (as referenced in the related-work section), and the rotation is defined to align the dominant principal direction with a single coordinate axis. We will revise the method section to include (i) a quantitative report of the fraction of activation variance captured by the top-1 component across layers and models and (ii) a short discussion of the residual approximation error when the outlier subspace is not perfectly one-dimensional. Regarding the orthogonal-component concern, the shared-offset step is applied after rotation precisely to recenter the now-concentrated channel; any residual orthogonal energy remains in the other channels but, per our measurements, exhibits substantially lower magnitude, allowing the subsequent uniform 4-bit grid to operate without per-channel zero-points. The experimental tables demonstrate that this suffices for the reported accuracy recovery. We therefore view the referee’s observation as identifying a useful clarification rather than an outright contradiction, and the planned additions will make the supporting evidence explicit. revision: partial

Circularity Check

0 steps flagged

No circularity: procedural method with no self-referential equations or load-bearing self-citations

full rationale

The paper presents OffQ as an algorithmic pipeline (top-1 PCA to identify outlier subspace, rotation to concentrate into one channel, conversion of that channel's magnitude into a shared offset) without any equations, fitted parameters renamed as predictions, or self-citations that justify the core steps. The abstract and description treat the approach as a sequence of operations whose effectiveness is evaluated empirically on benchmarks, not derived from first principles that loop back to the inputs. No load-bearing claim reduces by construction to its own definition or prior self-citation.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review; no free parameters, axioms, or invented entities are identifiable from the provided text.

pith-pipeline@v0.9.1-grok · 5708 in / 953 out tokens · 19458 ms · 2026-06-27T22:35:27.204594+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

90 extracted references · 7 canonical work pages · 4 internal anchors

[1]

Framequant: Flexible low-bit quantization for transformers

Harshavardhan Adepu, Zhanpeng Zeng, Li Zhang, and Vikas Singh. Framequant: Flexible low-bit quantization for transformers. InProceedings of International Conference on Machine Learning (ICML), 2024

2024
[2]

KurTail : Kurtosis-based LLM quantization

Mohammad Sadegh Akhondzadeh, Aleksandar Bojchevski, Evangelos Eleftheriou, and Martino Dazzi. KurTail : Kurtosis-based LLM quantization. InFindings of the Association for Computational Linguistics: EMNLP 2025, pages 17404–17419, Suzhou, China, 2025. Association for Computational Linguistics

2025
[3]

Quantization error propagation: Revisiting layer-wise post-training quantization

Yamato Arai and Yuma Ichikawa. Quantization error propagation: Revisiting layer-wise post-training quantization. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems, 2025

2025
[4]

Quantization error propagation: Revisiting layer-wise post-training quantization

Yamato Arai and Yuma Ichikawa. Quantization error propagation: Revisiting layer-wise post-training quantization. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems, 2026

2026
[5]

Croci, Marcelo Gennari do Nascimento, Torsten Hoefler, and James Hensman

Saleh Ashkboos, Maximilian L. Croci, Marcelo Gennari do Nascimento, Torsten Hoefler, and James Hensman. SliceGPT: Compress large language models by deleting rows and columns. InThe Twelfth International Conference on Learning Representations, 2024

2024
[6]

QUIK: Towards end-to-end 4-bit inference on generative large language models

Saleh Ashkboos, Ilia Markov, Elias Frantar, Tingxuan Zhong, Xincheng Wang, Jie Ren, Torsten Hoefler, and Dan Alistarh. QUIK: Towards end-to-end 4-bit inference on generative large language models. In Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pages 3355–3371, Miami, Florida, USA, 2024. Association for Computati...

2024
[7]

Quarot: Outlier-free 4-bit inference in rotated llms

Saleh Ashkboos, Amirkeivan Mohtashami, Maximilian L Croci, Bo Li, Pashmina Cameron, Martin Jaggi, Dan Alistarh, Torsten Hoefler, and James Hensman. Quarot: Outlier-free 4-bit inference in rotated llms. Advances in Neural Information Processing Systems, 37:100213–100240, 2024

2024
[8]

Castro, Torsten Hoefler, and Dan Alistarh

Saleh Ashkboos, Mahdi Nikdan, Soroush Tabesh, Roberto L. Castro, Torsten Hoefler, and Dan Alistarh. HALO: Hadamard-assisted lower-precision optimization for LLMs. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems, 2025

2025
[9]

Piqa: Reasoning about physical commonsense in natural language

Yonatan Bisk, Rowan Zellers, Ronan Le Bras, Jianfeng Gao, and Yejin Choi. Piqa: Reasoning about physical commonsense in natural language. InThirty-Fourth AAAI Conference on Artificial Intelligence, 2020

2020
[10]

Understanding and overcoming the chal- lenges of efficient transformer quantization

Yelysei Bondarenko, Markus Nagel, and Tijmen Blankevoort. Understanding and overcoming the chal- lenges of efficient transformer quantization. InProceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 7947–7969, 2021

2021
[11]

PyramidKV: Dynamic KV cache compression based on pyramidal information funneling

Zefan Cai, Yichi Zhang, Bofei Gao, Yuliang Liu, Yucheng Li, Tianyu Liu, Keming Lu, Wayne Xiong, Yue Dong, Junjie Hu, and Wen Xiao. PyramidKV: Dynamic KV cache compression based on pyramidal information funneling. InSecond Conference on Language Modeling, 2025

2025
[12]

Quip: 2-bit quantization of large language models with guarantees, 2024

Jerry Chee, Yaohui Cai, V olodymyr Kuleshov, and Christopher De Sa. Quip: 2-bit quantization of large language models with guarantees, 2024

2024
[13]

Efficien- tQAT: Efficient quantization-aware training for large language models

Mengzhao Chen, Wenqi Shao, Peng Xu, Jiahao Wang, Peng Gao, Kaipeng Zhang, and Ping Luo. Efficien- tQAT: Efficient quantization-aware training for large language models. InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 10081–10100, Vienna, Austria, 2025. Association for Computational L...

2025
[14]

Rotate, clip, and partition: Towards W2A4KV4 quantization by integrating rotation and learnable non-uniform quantizer

Euntae Choi, Sumin Song, Woosang Lim, and Sungjoo Yoo. Rotate, clip, and partition: Towards W2A4KV4 quantization by integrating rotation and learnable non-uniform quantizer. InFindings of the Association for Computational Linguistics: EMNLP 2025, pages 7568–7590, Suzhou, China, 2025. Association for Computational Linguistics

2025
[15]

BoolQ: Exploring the surprising difficulty of natural yes/no questions

Christopher Clark, Kenton Lee, Ming-Wei Chang, Tom Kwiatkowski, Michael Collins, and Kristina Toutanova. BoolQ: Exploring the surprising difficulty of natural yes/no questions. InProceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), ...

2019
[16]

Association for Computational Linguistics. 10
[17]

Think you have solved question answering? try arc, the ai2 reasoning challenge, 2018

Peter Clark, Isaac Cowhey, Oren Etzioni, Tushar Khot, Ashish Sabharwal, Carissa Schoenick, and Oyvind Tafjord. Think you have solved question answering? try arc, the ai2 reasoning challenge, 2018

2018
[18]

Vision transformers need registers

Timothée Darcet, Maxime Oquab, Julien Mairal, and Piotr Bojanowski. Vision transformers need registers. InThe Twelfth International Conference on Learning Representations, 2024

2024
[19]

Llm.int8(): 8-bit matrix multiplication for transformers at scale.Advances in neural information processing systems, 35:30318–30332, 2022

Tim Dettmers, Mike Lewis, Younes Belkada, and Luke Zettlemoyer. Llm.int8(): 8-bit matrix multiplication for transformers at scale.Advances in neural information processing systems, 35:30318–30332, 2022

2022
[20]

Svirschevski, Vage Egiazarian, Denis Kuznedelev, Elias Frantar, Saleh Ashkboos, Alexander Borzunov, Torsten Hoefler, and Dan Alistarh

Tim Dettmers, Ruslan A. Svirschevski, Vage Egiazarian, Denis Kuznedelev, Elias Frantar, Saleh Ashkboos, Alexander Borzunov, Torsten Hoefler, and Dan Alistarh. SpQR: A sparse-quantized representation for near- lossless LLM weight compression. InThe Twelfth International Conference on Learning Representations, 2024

2024
[21]

BitDistiller: Unleashing the potential of sub-4-bit LLMs via self-distillation

DaYou Du, Yijia Zhang, Shijie Cao, Jiaqi Guo, Ting Cao, Xiaowen Chu, and Ningyi Xu. BitDistiller: Unleashing the potential of sub-4-bit LLMs via self-distillation. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 102–116, Bangkok, Thailand, 2024. Association for Computational Linguistics

2024
[22]

OPTQ: Accurate quantization for generative pre-trained transformers

Elias Frantar, Saleh Ashkboos, Torsten Hoefler, and Dan Alistarh. OPTQ: Accurate quantization for generative pre-trained transformers. InThe Eleventh International Conference on Learning Representations, 2023

2023
[23]

The Llama 3 Herd of Models

Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al- Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Alex Vaughan, et al. The llama 3 herd of models. arXiv preprint arXiv:2407.21783, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[24]

LQ-loRA: Low-rank plus quantized matrix decomposition for efficient language model finetuning

Han Guo, Philip Greengard, Eric Xing, and Yoon Kim. LQ-loRA: Low-rank plus quantized matrix decomposition for efficient language model finetuning. InThe Twelfth International Conference on Learning Representations, 2024

2024
[25]

Zipcache: Accurate and efficient KV cache quantization with salient token identification

Yefei He, Luoming Zhang, Weijia Wu, Jing Liu, Hong Zhou, and Bohan Zhuang. Zipcache: Accurate and efficient KV cache quantization with salient token identification. InThe Thirty-eighth Annual Conference on Neural Information Processing Systems, 2024

2024
[26]

Mahoney, Sophia Shao, Kurt Keutzer, and Amir Gholami

Coleman Richard Charles Hooper, Sehoon Kim, Hiva Mohammadzadeh, Michael W. Mahoney, Sophia Shao, Kurt Keutzer, and Amir Gholami. KVQuant: Towards 10 million context length LLM inference with KV cache quantization. InThe Thirty-eighth Annual Conference on Neural Information Processing Systems, 2024

2024
[27]

Mlwq: Efficient small language model deployment via multi-level weight quantization

Chun Hu, Junhui He, Shangyu Wu, Yuxin He, Chun Jason Xue, and Qingan Li. Mlwq: Efficient small language model deployment via multi-level weight quantization. InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 8078–8088, 2025

2025
[28]

OSTQuant: Refining large language model quantization with orthogonal and scaling transformations for better distribution fitting

Xing Hu, Yuan Cheng, Dawei Yang, Zhixuan Chen, Zukang Xu, JiangyongYu, XUCHEN, Zhihang Yuan, Zhe jiang, and Sifan Zhou. OSTQuant: Refining large language model quantization with orthogonal and scaling transformations for better distribution fitting. InThe Thirteenth International Conference on Learning Representations, 2025

2025
[29]

Billm: Pushing the limit of post-training quantization for llms.arXiv preprint arXiv:2402.04291, 2024

Wei Huang, Yangdong Liu, Haotong Qin, Ying Li, Shiming Zhang, Xianglong Liu, Michele Magno, and Xi- aojuan Qi. Billm: Pushing the limit of post-training quantization for llms.arXiv preprint arXiv:2402.04291, 2024

work page arXiv 2024
[30]

SliM-LLM: Salience-driven mixed-precision quantization for large language models

Wei Huang, Haotong Qin, Yangdong Liu, Yawei Li, Qinshuo Liu, Xianglong Liu, Luca Benini, Michele Magno, Shiming Zhang, and Xiaojuan Qi. SliM-LLM: Salience-driven mixed-precision quantization for large language models. InProceedings of the 42nd International Conference on Machine Learning, pages 25672–25692. PMLR, 2025

2025
[31]

Quantization and training of neural networks for efficient integer- arithmetic-only inference

Benoit Jacob, Skirmantas Kligys, Bo Chen, Menglong Zhu, Matthew Tang, Andrew Howard, Hartwig Adam, and Dmitry Kalenichenko. Quantization and training of neural networks for efficient integer- arithmetic-only inference. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 2704–2713, 2018

2018
[32]

Understanding and improving knowledge distillation for quantization aware training of large transformer encoders

Minsoo Kim, Sihwa Lee, Suk-Jin Hong, Du-Seong Chang, and Jungwook Choi. Understanding and improving knowledge distillation for quantization aware training of large transformer encoders. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 6713–6725, Abu Dhabi, United Arab Emirates, 2022. Association for Computat...

2022
[33]

Mahoney, and Kurt Keutzer

Sehoon Kim, Coleman Richard Charles Hooper, Amir Gholami, Zhen Dong, Xiuyu Li, Sheng Shen, Michael W. Mahoney, and Kurt Keutzer. SqueezeLLM: Dense and sparse quantization, 2024

2024
[34]

Mahoney, and Kurt Keutzer

Sehoon Kim, Coleman Richard Charles Hooper, Amir Gholami, Zhen Dong, Xiuyu Li, Sheng Shen, Michael W. Mahoney, and Kurt Keutzer. SqueezeLLM: Dense-and-sparse quantization. InProceedings of the 41st International Conference on Machine Learning, pages 23901–23923. PMLR, 2024

2024
[35]

Quantizing deep convolutional networks for efficient inference: A whitepaper

Raghuraman Krishnamoorthi. Quantizing deep convolutional networks for efficient inference: A whitepaper. arXiv preprint arXiv:1806.08342, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018
[36]

give me BF16 or give me death

Eldar Kurtic, Alexandre Noll Marques, Shubhra Pandit, Mark Kurtz, and Dan Alistarh. “give me BF16 or give me death”? accuracy-performance trade-offs in LLM quantization. InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 26872–26886, Vienna, Austria, 2025. Association for Computational ...

2025
[37]

Flexround: Learnable rounding based on element-wise division for post-training quantization

Jung Hyun Lee, Jeonghoon Kim, Se Jung Kwon, and Dongsoo Lee. Flexround: Learnable rounding based on element-wise division for post-training quantization. InInternational Conference on Machine Learning, pages 18913–18939. PMLR, 2023

2023
[38]

Rethinking residual errors in compensation-based LLM quantization

Shuaiting Li, Juncan Deng, Kedong Xu, Rongtao Deng, Hong Gu, Minghan Jiang, Haibin Shen, and Kejie Huang. Rethinking residual errors in compensation-based LLM quantization. InThe Fourteenth International Conference on Learning Representations, 2026

2026
[39]

GPTAQ: Efficient finetuning-free quantization for asymmetric calibration

Yuhang Li, Ruokai Yin, Donghyun Lee, Shiting Xiao, and Priyadarshini Panda. GPTAQ: Efficient finetuning-free quantization for asymmetric calibration. InForty-second International Conference on Machine Learning, 2025

2025
[40]

Duquant: Distributing outliers via dual transformation makes stronger quantized llms

Haokun Lin, Haobo Xu, Yichen Wu, Jingzhi Cui, Yingtao Zhang, Linzhan Mou, Linqi Song, Zhenan Sun, and Ying Wei. Duquant: Distributing outliers via dual transformation makes stronger quantized llms. Advances in Neural Information Processing Systems, 37:87766–87800, 2024

2024
[41]

Awq: Activation-aware weight quantization for llm compression and acceleration

Ji Lin, Jiaming Tang, Haotian Tang, Shang Yang, Wei-Ming Chen, Wei-Chen Wang, Guangxuan Xiao, Xingyu Dang, Chuang Gan, and Song Han. Awq: Activation-aware weight quantization for llm compression and acceleration. InMLSys, 2024

2024
[42]

Qserve: W4a8kv4 quantization and system co-design for efficient llm serving.Proceedings of Machine Learning and Systems, 7, 2025

Yujun Lin, Haotian Tang, Shang Yang, Zhekai Zhang, Guangxuan Xiao, Chuang Gan, and Song Han. Qserve: W4a8kv4 quantization and system co-design for efficient llm serving.Proceedings of Machine Learning and Systems, 7, 2025

2025
[43]

QLLM: Accurate and efficient low-bitwidth quantization for large language models

Jing Liu, Ruihao Gong, Xiuying Wei, Zhiwei Dong, Jianfei Cai, and Bohan Zhuang. QLLM: Accurate and efficient low-bitwidth quantization for large language models. InInternational Conference on Learning Representations (ICLR), 2024

2024
[44]

Vptq: Extreme low-bit vector post-training quantization for large language models

Yifei Liu, Jicheng Wen, Yang Wang, Shengyu Ye, Li Lyna Zhang, Ting Cao, Cheng Li, and Mao Yang. Vptq: Extreme low-bit vector post-training quantization for large language models. InThe 2024 Conference on Empirical Methods in Natural Language Processing, 2024

2024
[45]

LLM-QAT: Data-free quantization aware training for large language models

Zechun Liu, Barlas Oguz, Changsheng Zhao, Ernie Chang, Pierre Stock, Yashar Mehdad, Yangyang Shi, Raghuraman Krishnamoorthi, and Vikas Chandra. LLM-QAT: Data-free quantization aware training for large language models. InFindings of the Association for Computational Linguistics: ACL 2024, pages 467–484, Bangkok, Thailand, 2024. Association for Computationa...

2024
[46]

Kivi: A tuning-free asymmetric 2bit quantization for kv cache

Zirui Liu, Jiayi Yuan, Hongye Jin, Shaochen Zhong, Zhaozhuo Xu, Vladimir Braverman, Beidi Chen, and Xia Hu. Kivi: A tuning-free asymmetric 2bit quantization for kv cache. InInternational Conference on Machine Learning, pages 32332–32344. PMLR, 2024

2024
[47]

Spinquant: LLM quantization with learned rotations

Zechun Liu, Changsheng Zhao, Igor Fedorov, Bilge Soran, Dhruv Choudhary, Raghuraman Krishnamoorthi, Vikas Chandra, Yuandong Tian, and Tijmen Blankevoort. Spinquant: LLM quantization with learned rotations. InThe Thirteenth International Conference on Learning Representations, 2025

2025
[48]

Affinequant: Affine transformation quantization for large language models

Yuexiao Ma, Huixia Li, Xiawu Zheng, Feng Ling, Xuefeng Xiao, Rui Wang, Shilei Wen, Fei Chao, and Rongrong Ji. Affinequant: Affine transformation quantization for large language models. InThe Twelfth International Conference on Learning Representations, 2024

2024
[49]

Pointer sentinel mixture models

Stephen Merity, Caiming Xiong, James Bradbury, and Richard Socher. Pointer sentinel mixture models. In International Conference on Learning Representations, 2017. 12

2017
[50]

Llama 3.2: Revolutionizing edge ai and vision with open, customizable models

Meta AI. Llama 3.2: Revolutionizing edge ai and vision with open, customizable models. https: //ai.meta.com/blog/llama-3-2-connect-2024-vision-edge-mobile-devices/ , 2024. Accessed: 2026-05-05

2024
[51]

Can a suit of armor conduct electricity? a new dataset for open book question answering

Todor Mihaylov, Peter Clark, Tushar Khot, and Ashish Sabharwal. Can a suit of armor conduct electricity? a new dataset for open book question answering. InProceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 2381–2391, Brussels, Belgium, 2018. Association for Computational Linguistics

2018
[52]

Muller, Philippe Bich, Jiawei Zhuang, Ahmet Celik, Luca Benfenati, and Lukas Cavigelli

Lorenz K. Muller, Philippe Bich, Jiawei Zhuang, Ahmet Celik, Luca Benfenati, and Lukas Cavigelli. Sinq: Sinkhorn-normalized quantization for calibration-free low-precision llm weights, 2025

2025
[53]

Self-distilled quantization: Achieving high compression rates in transformer-based language models

James O’Neill and Sourav Dutta. Self-distilled quantization: Achieving high compression rates in transformer-based language models. InProceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pages 1329–1339, Toronto, Canada, 2023. Association for Computational Linguistics

2023
[54]

Automatic differentiation in pytorch

Adam Paszke, Sam Gross, Soumith Chintala, Gregory Chanan, Edward Yang, Zachary DeVito, Zeming Lin, Alban Desmaison, Luca Antiga, and Adam Lerer. Automatic differentiation in pytorch. 2017

2017
[55]

Qwen2.5 technical report, 2025

Qwen, :, An Yang, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chengyuan Li, Dayiheng Liu, Fei Huang, Haoran Wei, Huan Lin, Jian Yang, Jianhong Tu, Jianwei Zhang, Jianxin Yang, Jiaxi Yang, Jingren Zhou, Junyang Lin, Kai Dang, Keming Lu, Keqin Bao, Kexin Yang, Le Yu, Mei Li, Mingfeng Xue, Pei Zhang, Qin Zhu, Rui Men, Runji Lin, Tianhao Li,...

2025
[56]

WinoGrande: An Adversarial Winograd Schema Challenge at Scale

Keisuke Sakaguchi, Ronan Le Bras, Chandra Bhagavatula, and Yejin Choi. Winogrande: An adversarial winograd schema challenge at scale.arXiv preprint arXiv:1907.10641, 2019

work page internal anchor Pith review Pith/arXiv arXiv 1907
[57]

Social iqa: Commonsense reasoning about social interactions

Maarten Sap, Hannah Rashkin, Derek Chen, Ronan Le Bras, and Yejin Choi. Social iqa: Commonsense reasoning about social interactions. InProceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 4463–4473, 2019

2019
[58]

Nestquant: nested lattice quantization for matrix products and LLMs

Semyon Savkin, Eitan Porat, Or Ordentlich, and Yury Polyanskiy. Nestquant: nested lattice quantization for matrix products and LLMs. InForty-second International Conference on Machine Learning, 2025

2025
[59]

Resq: Mixed-precision quantization of large language models with low-rank residuals

Utkarsh Saxena, Sayeh Sharify, Kaushik Roy, and Xin Wang. Resq: Mixed-precision quantization of large language models with low-rank residuals. InForty-second International Conference on Machine Learning, 2025

2025
[60]

Omniquant: Omnidirectionally calibrated quantization for large language models

Wenqi Shao, Mengzhao Chen, Zhaoyang Zhang, Peng Xu, Lirui Zhao, Zhiqian Li, Kaipeng Zhang, Peng Gao, Yu Qiao, and Ping Luo. Omniquant: Omnidirectionally calibrated quantization for large language models. InThe Twelfth International Conference on Learning Representations, 2024

2024
[61]

Dartquant: Efficient rotational distribution calibration for LLM quantization

Yuantian Shao, Yuanteng Chen, Peisong Wang, Jianlin Yu, Jing Lin, Yiwu Yao, Zhihui Wei, and Jian Cheng. Dartquant: Efficient rotational distribution calibration for LLM quantization. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems, 2026

2026
[62]

NSNQuant: A double normalization approach for calibration-free low-bit vector quantization of KV cache

Donghyun Son, Euntae Choi, and Sungjoo Yoo. NSNQuant: A double normalization approach for calibration-free low-bit vector quantization of KV cache. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems, 2025

2025
[63]

Accurate kv cache quantization with outlier tokens tracing

Yi Su, Yuechi Zhou, Quantong Qiu, Juntao Li, Qingrong Xia, Ping Li, Xinyu Duan, Zhefeng Wang, and Min Zhang. Accurate kv cache quantization with outlier tokens tracing. InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 12895–12915, 2025

2025
[64]

Massive activations in large language models

Mingjie Sun, Xinlei Chen, J Zico Kolter, and Zhuang Liu. Massive activations in large language models. InFirst Conference on Language Modeling, 2024

2024
[65]

Flatquant: Flatness matters for LLM quantization

Yuxuan Sun, Ruikang Liu, Haoli Bai, Han Bao, Kang Zhao, Yuening Li, JiaxinHu, Xianzhi Yu, Lu Hou, Chun Yuan, Xin Jiang, Wulong Liu, and Jun Yao. Flatquant: Flatness matters for LLM quantization. In Forty-second International Conference on Machine Learning, 2025

2025
[66]

CUTLASS, 2023

Vijay Thakkar, Pradeep Ramani, Cris Cecka, Aniket Shivam, Honghao Lu, Ethan Yan, Jack Kosaian, Mark Hoemmen, Haicheng Wu, Andrew Kerr, Matt Nicely, Duane Merrill, Dustyn Blasig, Aditya Atluri, Fengqi Qiao, Piotr Majcher, Paul Springer, Markus Hohnerbach, Jin Wang, and Manish Gupta. CUTLASS, 2023. 13

2023
[67]

Llama 2: Open Foundation and Fine-Tuned Chat Models

Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, et al. Llama 2: Open foundation and fine-tuned chat models.arXiv preprint arXiv:2307.09288, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[68]

QuIP$\#$: Even better LLM quantization with hadamard incoherence and lattice codebooks

Albert Tseng, Jerry Chee, Qingyao Sun, V olodymyr Kuleshov, and Christopher De Sa. QuIP$\#$: Even better LLM quantization with hadamard incoherence and lattice codebooks. InForty-first International Conference on Machine Learning, 2024

2024
[69]

Qtip: Quantization with trellises and incoherence processing.Advances in Neural Information Processing Systems, 37:59597–59620, 2024

Albert Tseng, Qingyao Sun, David Hou, and Christopher De. Qtip: Quantization with trellises and incoherence processing.Advances in Neural Information Processing Systems, 37:59597–59620, 2024

2024
[70]

Gptvq: The blessing of dimensionality in llm quantization.arXiv preprint arXiv:2402.15319, 2024

Mart van Baalen, Andrey Kuzmin, Markus Nagel, Peter Couperus, Cedric Bastoul, Eric Mahurin, Tijmen Blankevoort, and Paul Whatmough. Gptvq: The blessing of dimensionality in llm quantization.arXiv preprint arXiv:2402.15319, 2024

work page arXiv 2024
[71]

Bitnet: Scaling 1-bit transformers for large language models, 2023

Hongyu Wang, Shuming Ma, Li Dong, Shaohan Huang, Huaijie Wang, Lingxiao Ma, Fan Yang, Ruiping Wang, Yi Wu, and Furu Wei. Bitnet: Scaling 1-bit transformers for large language models, 2023

2023
[72]

Sinder: Repairing the singular defects of dinov2

Haoqi Wang, Tong Zhang, and Mathieu Salzmann. Sinder: Repairing the singular defects of dinov2. In European Conference on Computer Vision, pages 20–35. Springer, 2024

2024
[73]

Demystifying singular defects in large language models

Haoqi Wang, Tong Zhang, and Mathieu Salzmann. Demystifying singular defects in large language models. InForty-second International Conference on Machine Learning, 2025

2025
[74]

Outlier suppression: Pushing the limit of low-bit transformer language models

Xiuying Wei, Yunchen Zhang, Xiangguo Zhang, Ruihao Gong, Shanghang Zhang, Qi Zhang, Fengwei Yu, and Xianglong Liu. Outlier suppression: Pushing the limit of low-bit transformer language models. Advances in Neural Information Processing Systems, 35:17402–17414, 2022

2022
[75]

Outlier suppression+: Accurate quantization of large language models by equivalent and effective shifting and scaling

Xiuying Wei, Yunchen Zhang, Yuhang Li, Xiangguo Zhang, Ruihao Gong, Jinyang Guo, and Xianglong Liu. Outlier suppression+: Accurate quantization of large language models by equivalent and effective shifting and scaling. InProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 1648–1665, Singapore, 2023. Association fo...

2023
[76]

Thomas Wolf, Lysandre Debut, Victor Sanh, Julien Chaumond, Clement Delangue, Anthony Moi, Pierric Cistac, Tim Rault, Rémi Louf, Morgan Funtowicz, Joe Davison, Sam Shleifer, Patrick von Platen, Clara Ma, Yacine Jernite, Julien Plu, Canwen Xu, Teven Le Scao, Sylvain Gugger, Mariama Drame, Quentin Lhoest, and Alexander M. Rush. Transformers: State-of-the-art...

2020
[77]

DFRot: Achieving outlier-free and massive activation-free for rotated LLMs with refined rotation

Jingyang Xiang and Sai Qian Zhang. DFRot: Achieving outlier-free and massive activation-free for rotated LLMs with refined rotation. InSecond Conference on Language Modeling, 2025

2025
[78]

SmoothQuant: Accurate and efficient post-training quantization for large language models

Guangxuan Xiao, Ji Lin, Mickael Seznec, Hao Wu, Julien Demouth, and Song Han. SmoothQuant: Accurate and efficient post-training quantization for large language models. InProceedings of the 40th International Conference on Machine Learning, 2023

2023
[79]

Onebit: Towards extremely low-bit large language models

Yuzhuang Xu, Xu Han, Zonghan Yang, Shuo Wang, Qingfu Zhu, Zhiyuan Liu, Weidong Liu, and Wanxiang Che. Onebit: Towards extremely low-bit large language models. InAdvances in Neural Information Processing Systems, pages 66357–66382, 2024

2024
[80]

CRVQ: Channel-relaxed vector quantization for extreme compression of LLMs.Transactions of the Association for Computational Linguistics (TACL), 13: 1488–1506, 2025

Yuzhuang Xu, Shiyu Ji, Qingfu Zhu, and Wanxiang Che. CRVQ: Channel-relaxed vector quantization for extreme compression of LLMs.Transactions of the Association for Computational Linguistics (TACL), 13: 1488–1506, 2025

2025

Showing first 80 references.

[1] [1]

Framequant: Flexible low-bit quantization for transformers

Harshavardhan Adepu, Zhanpeng Zeng, Li Zhang, and Vikas Singh. Framequant: Flexible low-bit quantization for transformers. InProceedings of International Conference on Machine Learning (ICML), 2024

2024

[2] [2]

KurTail : Kurtosis-based LLM quantization

Mohammad Sadegh Akhondzadeh, Aleksandar Bojchevski, Evangelos Eleftheriou, and Martino Dazzi. KurTail : Kurtosis-based LLM quantization. InFindings of the Association for Computational Linguistics: EMNLP 2025, pages 17404–17419, Suzhou, China, 2025. Association for Computational Linguistics

2025

[3] [3]

Quantization error propagation: Revisiting layer-wise post-training quantization

Yamato Arai and Yuma Ichikawa. Quantization error propagation: Revisiting layer-wise post-training quantization. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems, 2025

2025

[4] [4]

Quantization error propagation: Revisiting layer-wise post-training quantization

Yamato Arai and Yuma Ichikawa. Quantization error propagation: Revisiting layer-wise post-training quantization. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems, 2026

2026

[5] [5]

Croci, Marcelo Gennari do Nascimento, Torsten Hoefler, and James Hensman

Saleh Ashkboos, Maximilian L. Croci, Marcelo Gennari do Nascimento, Torsten Hoefler, and James Hensman. SliceGPT: Compress large language models by deleting rows and columns. InThe Twelfth International Conference on Learning Representations, 2024

2024

[6] [6]

QUIK: Towards end-to-end 4-bit inference on generative large language models

Saleh Ashkboos, Ilia Markov, Elias Frantar, Tingxuan Zhong, Xincheng Wang, Jie Ren, Torsten Hoefler, and Dan Alistarh. QUIK: Towards end-to-end 4-bit inference on generative large language models. In Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pages 3355–3371, Miami, Florida, USA, 2024. Association for Computati...

2024

[7] [7]

Quarot: Outlier-free 4-bit inference in rotated llms

Saleh Ashkboos, Amirkeivan Mohtashami, Maximilian L Croci, Bo Li, Pashmina Cameron, Martin Jaggi, Dan Alistarh, Torsten Hoefler, and James Hensman. Quarot: Outlier-free 4-bit inference in rotated llms. Advances in Neural Information Processing Systems, 37:100213–100240, 2024

2024

[8] [8]

Castro, Torsten Hoefler, and Dan Alistarh

Saleh Ashkboos, Mahdi Nikdan, Soroush Tabesh, Roberto L. Castro, Torsten Hoefler, and Dan Alistarh. HALO: Hadamard-assisted lower-precision optimization for LLMs. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems, 2025

2025

[9] [9]

Piqa: Reasoning about physical commonsense in natural language

Yonatan Bisk, Rowan Zellers, Ronan Le Bras, Jianfeng Gao, and Yejin Choi. Piqa: Reasoning about physical commonsense in natural language. InThirty-Fourth AAAI Conference on Artificial Intelligence, 2020

2020

[10] [10]

Understanding and overcoming the chal- lenges of efficient transformer quantization

Yelysei Bondarenko, Markus Nagel, and Tijmen Blankevoort. Understanding and overcoming the chal- lenges of efficient transformer quantization. InProceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 7947–7969, 2021

2021

[11] [11]

PyramidKV: Dynamic KV cache compression based on pyramidal information funneling

Zefan Cai, Yichi Zhang, Bofei Gao, Yuliang Liu, Yucheng Li, Tianyu Liu, Keming Lu, Wayne Xiong, Yue Dong, Junjie Hu, and Wen Xiao. PyramidKV: Dynamic KV cache compression based on pyramidal information funneling. InSecond Conference on Language Modeling, 2025

2025

[12] [12]

Quip: 2-bit quantization of large language models with guarantees, 2024

Jerry Chee, Yaohui Cai, V olodymyr Kuleshov, and Christopher De Sa. Quip: 2-bit quantization of large language models with guarantees, 2024

2024

[13] [13]

Efficien- tQAT: Efficient quantization-aware training for large language models

Mengzhao Chen, Wenqi Shao, Peng Xu, Jiahao Wang, Peng Gao, Kaipeng Zhang, and Ping Luo. Efficien- tQAT: Efficient quantization-aware training for large language models. InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 10081–10100, Vienna, Austria, 2025. Association for Computational L...

2025

[14] [14]

Rotate, clip, and partition: Towards W2A4KV4 quantization by integrating rotation and learnable non-uniform quantizer

Euntae Choi, Sumin Song, Woosang Lim, and Sungjoo Yoo. Rotate, clip, and partition: Towards W2A4KV4 quantization by integrating rotation and learnable non-uniform quantizer. InFindings of the Association for Computational Linguistics: EMNLP 2025, pages 7568–7590, Suzhou, China, 2025. Association for Computational Linguistics

2025

[15] [15]

BoolQ: Exploring the surprising difficulty of natural yes/no questions

Christopher Clark, Kenton Lee, Ming-Wei Chang, Tom Kwiatkowski, Michael Collins, and Kristina Toutanova. BoolQ: Exploring the surprising difficulty of natural yes/no questions. InProceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), ...

2019

[16] [16]

Association for Computational Linguistics. 10

[17] [17]

Think you have solved question answering? try arc, the ai2 reasoning challenge, 2018

Peter Clark, Isaac Cowhey, Oren Etzioni, Tushar Khot, Ashish Sabharwal, Carissa Schoenick, and Oyvind Tafjord. Think you have solved question answering? try arc, the ai2 reasoning challenge, 2018

2018

[18] [18]

Vision transformers need registers

Timothée Darcet, Maxime Oquab, Julien Mairal, and Piotr Bojanowski. Vision transformers need registers. InThe Twelfth International Conference on Learning Representations, 2024

2024

[19] [19]

Llm.int8(): 8-bit matrix multiplication for transformers at scale.Advances in neural information processing systems, 35:30318–30332, 2022

Tim Dettmers, Mike Lewis, Younes Belkada, and Luke Zettlemoyer. Llm.int8(): 8-bit matrix multiplication for transformers at scale.Advances in neural information processing systems, 35:30318–30332, 2022

2022

[20] [20]

Svirschevski, Vage Egiazarian, Denis Kuznedelev, Elias Frantar, Saleh Ashkboos, Alexander Borzunov, Torsten Hoefler, and Dan Alistarh

Tim Dettmers, Ruslan A. Svirschevski, Vage Egiazarian, Denis Kuznedelev, Elias Frantar, Saleh Ashkboos, Alexander Borzunov, Torsten Hoefler, and Dan Alistarh. SpQR: A sparse-quantized representation for near- lossless LLM weight compression. InThe Twelfth International Conference on Learning Representations, 2024

2024

[21] [21]

BitDistiller: Unleashing the potential of sub-4-bit LLMs via self-distillation

DaYou Du, Yijia Zhang, Shijie Cao, Jiaqi Guo, Ting Cao, Xiaowen Chu, and Ningyi Xu. BitDistiller: Unleashing the potential of sub-4-bit LLMs via self-distillation. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 102–116, Bangkok, Thailand, 2024. Association for Computational Linguistics

2024

[22] [22]

OPTQ: Accurate quantization for generative pre-trained transformers

Elias Frantar, Saleh Ashkboos, Torsten Hoefler, and Dan Alistarh. OPTQ: Accurate quantization for generative pre-trained transformers. InThe Eleventh International Conference on Learning Representations, 2023

2023

[23] [23]

The Llama 3 Herd of Models

Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al- Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Alex Vaughan, et al. The llama 3 herd of models. arXiv preprint arXiv:2407.21783, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[24] [24]

LQ-loRA: Low-rank plus quantized matrix decomposition for efficient language model finetuning

Han Guo, Philip Greengard, Eric Xing, and Yoon Kim. LQ-loRA: Low-rank plus quantized matrix decomposition for efficient language model finetuning. InThe Twelfth International Conference on Learning Representations, 2024

2024

[25] [25]

Zipcache: Accurate and efficient KV cache quantization with salient token identification

Yefei He, Luoming Zhang, Weijia Wu, Jing Liu, Hong Zhou, and Bohan Zhuang. Zipcache: Accurate and efficient KV cache quantization with salient token identification. InThe Thirty-eighth Annual Conference on Neural Information Processing Systems, 2024

2024

[26] [26]

Mahoney, Sophia Shao, Kurt Keutzer, and Amir Gholami

Coleman Richard Charles Hooper, Sehoon Kim, Hiva Mohammadzadeh, Michael W. Mahoney, Sophia Shao, Kurt Keutzer, and Amir Gholami. KVQuant: Towards 10 million context length LLM inference with KV cache quantization. InThe Thirty-eighth Annual Conference on Neural Information Processing Systems, 2024

2024

[27] [27]

Mlwq: Efficient small language model deployment via multi-level weight quantization

Chun Hu, Junhui He, Shangyu Wu, Yuxin He, Chun Jason Xue, and Qingan Li. Mlwq: Efficient small language model deployment via multi-level weight quantization. InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 8078–8088, 2025

2025

[28] [28]

OSTQuant: Refining large language model quantization with orthogonal and scaling transformations for better distribution fitting

Xing Hu, Yuan Cheng, Dawei Yang, Zhixuan Chen, Zukang Xu, JiangyongYu, XUCHEN, Zhihang Yuan, Zhe jiang, and Sifan Zhou. OSTQuant: Refining large language model quantization with orthogonal and scaling transformations for better distribution fitting. InThe Thirteenth International Conference on Learning Representations, 2025

2025

[29] [29]

Billm: Pushing the limit of post-training quantization for llms.arXiv preprint arXiv:2402.04291, 2024

Wei Huang, Yangdong Liu, Haotong Qin, Ying Li, Shiming Zhang, Xianglong Liu, Michele Magno, and Xi- aojuan Qi. Billm: Pushing the limit of post-training quantization for llms.arXiv preprint arXiv:2402.04291, 2024

work page arXiv 2024

[30] [30]

SliM-LLM: Salience-driven mixed-precision quantization for large language models

Wei Huang, Haotong Qin, Yangdong Liu, Yawei Li, Qinshuo Liu, Xianglong Liu, Luca Benini, Michele Magno, Shiming Zhang, and Xiaojuan Qi. SliM-LLM: Salience-driven mixed-precision quantization for large language models. InProceedings of the 42nd International Conference on Machine Learning, pages 25672–25692. PMLR, 2025

2025

[31] [31]

Quantization and training of neural networks for efficient integer- arithmetic-only inference

Benoit Jacob, Skirmantas Kligys, Bo Chen, Menglong Zhu, Matthew Tang, Andrew Howard, Hartwig Adam, and Dmitry Kalenichenko. Quantization and training of neural networks for efficient integer- arithmetic-only inference. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 2704–2713, 2018

2018

[32] [32]

Understanding and improving knowledge distillation for quantization aware training of large transformer encoders

Minsoo Kim, Sihwa Lee, Suk-Jin Hong, Du-Seong Chang, and Jungwook Choi. Understanding and improving knowledge distillation for quantization aware training of large transformer encoders. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 6713–6725, Abu Dhabi, United Arab Emirates, 2022. Association for Computat...

2022

[33] [33]

Mahoney, and Kurt Keutzer

Sehoon Kim, Coleman Richard Charles Hooper, Amir Gholami, Zhen Dong, Xiuyu Li, Sheng Shen, Michael W. Mahoney, and Kurt Keutzer. SqueezeLLM: Dense and sparse quantization, 2024

2024

[34] [34]

Mahoney, and Kurt Keutzer

Sehoon Kim, Coleman Richard Charles Hooper, Amir Gholami, Zhen Dong, Xiuyu Li, Sheng Shen, Michael W. Mahoney, and Kurt Keutzer. SqueezeLLM: Dense-and-sparse quantization. InProceedings of the 41st International Conference on Machine Learning, pages 23901–23923. PMLR, 2024

2024

[35] [35]

Quantizing deep convolutional networks for efficient inference: A whitepaper

Raghuraman Krishnamoorthi. Quantizing deep convolutional networks for efficient inference: A whitepaper. arXiv preprint arXiv:1806.08342, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018

[36] [36]

give me BF16 or give me death

Eldar Kurtic, Alexandre Noll Marques, Shubhra Pandit, Mark Kurtz, and Dan Alistarh. “give me BF16 or give me death”? accuracy-performance trade-offs in LLM quantization. InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 26872–26886, Vienna, Austria, 2025. Association for Computational ...

2025

[37] [37]

Flexround: Learnable rounding based on element-wise division for post-training quantization

Jung Hyun Lee, Jeonghoon Kim, Se Jung Kwon, and Dongsoo Lee. Flexround: Learnable rounding based on element-wise division for post-training quantization. InInternational Conference on Machine Learning, pages 18913–18939. PMLR, 2023

2023

[38] [38]

Rethinking residual errors in compensation-based LLM quantization

Shuaiting Li, Juncan Deng, Kedong Xu, Rongtao Deng, Hong Gu, Minghan Jiang, Haibin Shen, and Kejie Huang. Rethinking residual errors in compensation-based LLM quantization. InThe Fourteenth International Conference on Learning Representations, 2026

2026

[39] [39]

GPTAQ: Efficient finetuning-free quantization for asymmetric calibration

Yuhang Li, Ruokai Yin, Donghyun Lee, Shiting Xiao, and Priyadarshini Panda. GPTAQ: Efficient finetuning-free quantization for asymmetric calibration. InForty-second International Conference on Machine Learning, 2025

2025

[40] [40]

Duquant: Distributing outliers via dual transformation makes stronger quantized llms

Haokun Lin, Haobo Xu, Yichen Wu, Jingzhi Cui, Yingtao Zhang, Linzhan Mou, Linqi Song, Zhenan Sun, and Ying Wei. Duquant: Distributing outliers via dual transformation makes stronger quantized llms. Advances in Neural Information Processing Systems, 37:87766–87800, 2024

2024

[41] [41]

Awq: Activation-aware weight quantization for llm compression and acceleration

Ji Lin, Jiaming Tang, Haotian Tang, Shang Yang, Wei-Ming Chen, Wei-Chen Wang, Guangxuan Xiao, Xingyu Dang, Chuang Gan, and Song Han. Awq: Activation-aware weight quantization for llm compression and acceleration. InMLSys, 2024

2024

[42] [42]

Qserve: W4a8kv4 quantization and system co-design for efficient llm serving.Proceedings of Machine Learning and Systems, 7, 2025

Yujun Lin, Haotian Tang, Shang Yang, Zhekai Zhang, Guangxuan Xiao, Chuang Gan, and Song Han. Qserve: W4a8kv4 quantization and system co-design for efficient llm serving.Proceedings of Machine Learning and Systems, 7, 2025

2025

[43] [43]

QLLM: Accurate and efficient low-bitwidth quantization for large language models

Jing Liu, Ruihao Gong, Xiuying Wei, Zhiwei Dong, Jianfei Cai, and Bohan Zhuang. QLLM: Accurate and efficient low-bitwidth quantization for large language models. InInternational Conference on Learning Representations (ICLR), 2024

2024

[44] [44]

Vptq: Extreme low-bit vector post-training quantization for large language models

Yifei Liu, Jicheng Wen, Yang Wang, Shengyu Ye, Li Lyna Zhang, Ting Cao, Cheng Li, and Mao Yang. Vptq: Extreme low-bit vector post-training quantization for large language models. InThe 2024 Conference on Empirical Methods in Natural Language Processing, 2024

2024

[45] [45]

LLM-QAT: Data-free quantization aware training for large language models

Zechun Liu, Barlas Oguz, Changsheng Zhao, Ernie Chang, Pierre Stock, Yashar Mehdad, Yangyang Shi, Raghuraman Krishnamoorthi, and Vikas Chandra. LLM-QAT: Data-free quantization aware training for large language models. InFindings of the Association for Computational Linguistics: ACL 2024, pages 467–484, Bangkok, Thailand, 2024. Association for Computationa...

2024

[46] [46]

Kivi: A tuning-free asymmetric 2bit quantization for kv cache

Zirui Liu, Jiayi Yuan, Hongye Jin, Shaochen Zhong, Zhaozhuo Xu, Vladimir Braverman, Beidi Chen, and Xia Hu. Kivi: A tuning-free asymmetric 2bit quantization for kv cache. InInternational Conference on Machine Learning, pages 32332–32344. PMLR, 2024

2024

[47] [47]

Spinquant: LLM quantization with learned rotations

Zechun Liu, Changsheng Zhao, Igor Fedorov, Bilge Soran, Dhruv Choudhary, Raghuraman Krishnamoorthi, Vikas Chandra, Yuandong Tian, and Tijmen Blankevoort. Spinquant: LLM quantization with learned rotations. InThe Thirteenth International Conference on Learning Representations, 2025

2025

[48] [48]

Affinequant: Affine transformation quantization for large language models

Yuexiao Ma, Huixia Li, Xiawu Zheng, Feng Ling, Xuefeng Xiao, Rui Wang, Shilei Wen, Fei Chao, and Rongrong Ji. Affinequant: Affine transformation quantization for large language models. InThe Twelfth International Conference on Learning Representations, 2024

2024

[49] [49]

Pointer sentinel mixture models

Stephen Merity, Caiming Xiong, James Bradbury, and Richard Socher. Pointer sentinel mixture models. In International Conference on Learning Representations, 2017. 12

2017

[50] [50]

Llama 3.2: Revolutionizing edge ai and vision with open, customizable models

Meta AI. Llama 3.2: Revolutionizing edge ai and vision with open, customizable models. https: //ai.meta.com/blog/llama-3-2-connect-2024-vision-edge-mobile-devices/ , 2024. Accessed: 2026-05-05

2024

[51] [51]

Can a suit of armor conduct electricity? a new dataset for open book question answering

Todor Mihaylov, Peter Clark, Tushar Khot, and Ashish Sabharwal. Can a suit of armor conduct electricity? a new dataset for open book question answering. InProceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 2381–2391, Brussels, Belgium, 2018. Association for Computational Linguistics

2018

[52] [52]

Muller, Philippe Bich, Jiawei Zhuang, Ahmet Celik, Luca Benfenati, and Lukas Cavigelli

Lorenz K. Muller, Philippe Bich, Jiawei Zhuang, Ahmet Celik, Luca Benfenati, and Lukas Cavigelli. Sinq: Sinkhorn-normalized quantization for calibration-free low-precision llm weights, 2025

2025

[53] [53]

Self-distilled quantization: Achieving high compression rates in transformer-based language models

James O’Neill and Sourav Dutta. Self-distilled quantization: Achieving high compression rates in transformer-based language models. InProceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pages 1329–1339, Toronto, Canada, 2023. Association for Computational Linguistics

2023

[54] [54]

Automatic differentiation in pytorch

Adam Paszke, Sam Gross, Soumith Chintala, Gregory Chanan, Edward Yang, Zachary DeVito, Zeming Lin, Alban Desmaison, Luca Antiga, and Adam Lerer. Automatic differentiation in pytorch. 2017

2017

[55] [55]

Qwen2.5 technical report, 2025

Qwen, :, An Yang, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chengyuan Li, Dayiheng Liu, Fei Huang, Haoran Wei, Huan Lin, Jian Yang, Jianhong Tu, Jianwei Zhang, Jianxin Yang, Jiaxi Yang, Jingren Zhou, Junyang Lin, Kai Dang, Keming Lu, Keqin Bao, Kexin Yang, Le Yu, Mei Li, Mingfeng Xue, Pei Zhang, Qin Zhu, Rui Men, Runji Lin, Tianhao Li,...

2025

[56] [56]

WinoGrande: An Adversarial Winograd Schema Challenge at Scale

Keisuke Sakaguchi, Ronan Le Bras, Chandra Bhagavatula, and Yejin Choi. Winogrande: An adversarial winograd schema challenge at scale.arXiv preprint arXiv:1907.10641, 2019

work page internal anchor Pith review Pith/arXiv arXiv 1907

[57] [57]

Social iqa: Commonsense reasoning about social interactions

Maarten Sap, Hannah Rashkin, Derek Chen, Ronan Le Bras, and Yejin Choi. Social iqa: Commonsense reasoning about social interactions. InProceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 4463–4473, 2019

2019

[58] [58]

Nestquant: nested lattice quantization for matrix products and LLMs

Semyon Savkin, Eitan Porat, Or Ordentlich, and Yury Polyanskiy. Nestquant: nested lattice quantization for matrix products and LLMs. InForty-second International Conference on Machine Learning, 2025

2025

[59] [59]

Resq: Mixed-precision quantization of large language models with low-rank residuals

Utkarsh Saxena, Sayeh Sharify, Kaushik Roy, and Xin Wang. Resq: Mixed-precision quantization of large language models with low-rank residuals. InForty-second International Conference on Machine Learning, 2025

2025

[60] [60]

Omniquant: Omnidirectionally calibrated quantization for large language models

Wenqi Shao, Mengzhao Chen, Zhaoyang Zhang, Peng Xu, Lirui Zhao, Zhiqian Li, Kaipeng Zhang, Peng Gao, Yu Qiao, and Ping Luo. Omniquant: Omnidirectionally calibrated quantization for large language models. InThe Twelfth International Conference on Learning Representations, 2024

2024

[61] [61]

Dartquant: Efficient rotational distribution calibration for LLM quantization

Yuantian Shao, Yuanteng Chen, Peisong Wang, Jianlin Yu, Jing Lin, Yiwu Yao, Zhihui Wei, and Jian Cheng. Dartquant: Efficient rotational distribution calibration for LLM quantization. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems, 2026

2026

[62] [62]

NSNQuant: A double normalization approach for calibration-free low-bit vector quantization of KV cache

Donghyun Son, Euntae Choi, and Sungjoo Yoo. NSNQuant: A double normalization approach for calibration-free low-bit vector quantization of KV cache. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems, 2025

2025

[63] [63]

Accurate kv cache quantization with outlier tokens tracing

Yi Su, Yuechi Zhou, Quantong Qiu, Juntao Li, Qingrong Xia, Ping Li, Xinyu Duan, Zhefeng Wang, and Min Zhang. Accurate kv cache quantization with outlier tokens tracing. InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 12895–12915, 2025

2025

[64] [64]

Massive activations in large language models

Mingjie Sun, Xinlei Chen, J Zico Kolter, and Zhuang Liu. Massive activations in large language models. InFirst Conference on Language Modeling, 2024

2024

[65] [65]

Flatquant: Flatness matters for LLM quantization

Yuxuan Sun, Ruikang Liu, Haoli Bai, Han Bao, Kang Zhao, Yuening Li, JiaxinHu, Xianzhi Yu, Lu Hou, Chun Yuan, Xin Jiang, Wulong Liu, and Jun Yao. Flatquant: Flatness matters for LLM quantization. In Forty-second International Conference on Machine Learning, 2025

2025

[66] [66]

CUTLASS, 2023

Vijay Thakkar, Pradeep Ramani, Cris Cecka, Aniket Shivam, Honghao Lu, Ethan Yan, Jack Kosaian, Mark Hoemmen, Haicheng Wu, Andrew Kerr, Matt Nicely, Duane Merrill, Dustyn Blasig, Aditya Atluri, Fengqi Qiao, Piotr Majcher, Paul Springer, Markus Hohnerbach, Jin Wang, and Manish Gupta. CUTLASS, 2023. 13

2023

[67] [67]

Llama 2: Open Foundation and Fine-Tuned Chat Models

Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, et al. Llama 2: Open foundation and fine-tuned chat models.arXiv preprint arXiv:2307.09288, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[68] [68]

QuIP$\#$: Even better LLM quantization with hadamard incoherence and lattice codebooks

Albert Tseng, Jerry Chee, Qingyao Sun, V olodymyr Kuleshov, and Christopher De Sa. QuIP$\#$: Even better LLM quantization with hadamard incoherence and lattice codebooks. InForty-first International Conference on Machine Learning, 2024

2024

[69] [69]

Qtip: Quantization with trellises and incoherence processing.Advances in Neural Information Processing Systems, 37:59597–59620, 2024

Albert Tseng, Qingyao Sun, David Hou, and Christopher De. Qtip: Quantization with trellises and incoherence processing.Advances in Neural Information Processing Systems, 37:59597–59620, 2024

2024

[70] [70]

Gptvq: The blessing of dimensionality in llm quantization.arXiv preprint arXiv:2402.15319, 2024

Mart van Baalen, Andrey Kuzmin, Markus Nagel, Peter Couperus, Cedric Bastoul, Eric Mahurin, Tijmen Blankevoort, and Paul Whatmough. Gptvq: The blessing of dimensionality in llm quantization.arXiv preprint arXiv:2402.15319, 2024

work page arXiv 2024

[71] [71]

Bitnet: Scaling 1-bit transformers for large language models, 2023

Hongyu Wang, Shuming Ma, Li Dong, Shaohan Huang, Huaijie Wang, Lingxiao Ma, Fan Yang, Ruiping Wang, Yi Wu, and Furu Wei. Bitnet: Scaling 1-bit transformers for large language models, 2023

2023

[72] [72]

Sinder: Repairing the singular defects of dinov2

Haoqi Wang, Tong Zhang, and Mathieu Salzmann. Sinder: Repairing the singular defects of dinov2. In European Conference on Computer Vision, pages 20–35. Springer, 2024

2024

[73] [73]

Demystifying singular defects in large language models

Haoqi Wang, Tong Zhang, and Mathieu Salzmann. Demystifying singular defects in large language models. InForty-second International Conference on Machine Learning, 2025

2025

[74] [74]

Outlier suppression: Pushing the limit of low-bit transformer language models

Xiuying Wei, Yunchen Zhang, Xiangguo Zhang, Ruihao Gong, Shanghang Zhang, Qi Zhang, Fengwei Yu, and Xianglong Liu. Outlier suppression: Pushing the limit of low-bit transformer language models. Advances in Neural Information Processing Systems, 35:17402–17414, 2022

2022

[75] [75]

Outlier suppression+: Accurate quantization of large language models by equivalent and effective shifting and scaling

Xiuying Wei, Yunchen Zhang, Yuhang Li, Xiangguo Zhang, Ruihao Gong, Jinyang Guo, and Xianglong Liu. Outlier suppression+: Accurate quantization of large language models by equivalent and effective shifting and scaling. InProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 1648–1665, Singapore, 2023. Association fo...

2023

[76] [76]

Thomas Wolf, Lysandre Debut, Victor Sanh, Julien Chaumond, Clement Delangue, Anthony Moi, Pierric Cistac, Tim Rault, Rémi Louf, Morgan Funtowicz, Joe Davison, Sam Shleifer, Patrick von Platen, Clara Ma, Yacine Jernite, Julien Plu, Canwen Xu, Teven Le Scao, Sylvain Gugger, Mariama Drame, Quentin Lhoest, and Alexander M. Rush. Transformers: State-of-the-art...

2020

[77] [77]

DFRot: Achieving outlier-free and massive activation-free for rotated LLMs with refined rotation

Jingyang Xiang and Sai Qian Zhang. DFRot: Achieving outlier-free and massive activation-free for rotated LLMs with refined rotation. InSecond Conference on Language Modeling, 2025

2025

[78] [78]

SmoothQuant: Accurate and efficient post-training quantization for large language models

Guangxuan Xiao, Ji Lin, Mickael Seznec, Hao Wu, Julien Demouth, and Song Han. SmoothQuant: Accurate and efficient post-training quantization for large language models. InProceedings of the 40th International Conference on Machine Learning, 2023

2023

[79] [79]

Onebit: Towards extremely low-bit large language models

Yuzhuang Xu, Xu Han, Zonghan Yang, Shuo Wang, Qingfu Zhu, Zhiyuan Liu, Weidong Liu, and Wanxiang Che. Onebit: Towards extremely low-bit large language models. InAdvances in Neural Information Processing Systems, pages 66357–66382, 2024

2024

[80] [80]

CRVQ: Channel-relaxed vector quantization for extreme compression of LLMs.Transactions of the Association for Computational Linguistics (TACL), 13: 1488–1506, 2025

Yuzhuang Xu, Shiyu Ji, Qingfu Zhu, and Wanxiang Che. CRVQ: Channel-relaxed vector quantization for extreme compression of LLMs.Transactions of the Association for Computational Linguistics (TACL), 13: 1488–1506, 2025

2025