pith. sign in

arxiv: 2606.07116 · v1 · pith:OKPKPDNAnew · submitted 2026-06-05 · 💻 cs.LG · cs.AI· cs.CL

OffQ: Taming Structured Outliers in LLM Quantization by Offsetting

Pith reviewed 2026-06-27 22:35 UTC · model grok-4.3

classification 💻 cs.LG cs.AIcs.CL
keywords LLM quantizationactivation outliersoffsetting mechanismuniform quantization4-bit inferencePCA rotationW4A4KV4structured outliers
0
0 comments X

The pith

Rotating activations to pack outliers into one channel then absorbing it via shared offset enables uniform 4-bit LLM quantization.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper sets out to establish that activation outliers in large language models follow a low-dimensional structured pattern that can be isolated and neutralized without complex per-tensor scaling. It does so by applying top-1 PCA to locate the outlier subspace, rotating the activations to concentrate extreme values into a single channel, and converting that channel's magnitude into a shared offset subtracted across the tensor. A sympathetic reader would care because this would let standard uniform-grid quantizers operate at 4 bits on weights, activations, and KV caches while keeping accuracy close to full precision. If the claim holds, deployment on hardware limited to uniform precision becomes simpler and more memory-efficient.

Core claim

OffQ first identifies a low-dimensional outlier subspace in the activations using a proposed top-1 PCA, and then concentrates high-magnitude activations into 1 channel via rotation. OffQ then absorbs this concentrated outlier channel by converting its magnitude into a shared offset, thereby reducing the standard deviation of the activations. This offsetting strategy enables effective W4A4KV4 quantization of LLMs using deployment-friendly uniform-grid and uniform-precision quantization.

What carries the argument

The offsetting mechanism that isolates the outlier subspace via top-1 PCA, rotates to concentrate high-magnitude values into one channel, and converts that channel into a shared offset.

If this is right

  • LLMs achieve effective 4-bit quantization for weights, activations, and KV caches using only uniform grids and uniform precision.
  • The method avoids per-tensor adjustments that would defeat the uniform-grid goal.
  • Accuracy improves consistently over prior outlier-handling baselines across multiple LLM architectures and evaluation benchmarks.
  • Low-bit efficiency is preserved because the offset is shared and requires no additional per-channel storage or computation at inference.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same rotation-plus-offset pattern might apply to outlier patterns in non-transformer models such as convolutional networks.
  • Hardware designs could add native support for a single shared offset per tensor to further speed up the resulting 4-bit kernels.
  • Testing the method at 3-bit or 2-bit widths would show whether the concentration step scales before the offset itself becomes too coarse.

Load-bearing premise

The low-dimensional outlier subspace identified by top-1 PCA can be rotated to concentrate all high-magnitude values into a single channel whose magnitude can be removed by a shared offset without unacceptable distortion.

What would settle it

Quantizing an LLM such as Llama to W4A4KV4 after the rotation and offset step and measuring whether perplexity on WikiText or accuracy on MMLU drops by more than a few percent relative to the 16-bit baseline.

Figures

Figures reproduced from arXiv: 2606.07116 by Haoqi Wang, Jiawei Zhuang, Lorenz K. Mueller, Lukas Cavigelli, Mathieu Salzmann.

Figure 1
Figure 1. Figure 1: Visualization of the outlier concentration of the top-1 PCA and the subsequent offsetting [PITH_FULL_IMAGE:figures/full_fig_p004_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: The overall quantization pipeline of OffQ. [PITH_FULL_IMAGE:figures/full_fig_p006_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Effect of group size on quantization performance of OffQ on Llama 3-8B with a hidden [PITH_FULL_IMAGE:figures/full_fig_p009_3.png] view at source ↗
read the original abstract

Low-bit quantization has been widely adopted to accelerate the inference of large language models (LLMs) by significantly reducing computational cost and memory usage. However, activation outliers pose a major challenge to effective quantization, often leading to notable performance degradation. In this paper, we introduce OffQ, a method designed to mitigate activation outliers in low-bit quantization through a novel offsetting mechanism. Specifically, OffQ first identifies a low-dimensional outlier subspace in the activations using a proposed top-1 PCA, and then concentrates high-magnitude activations into 1 channel via rotation. OffQ then absorbs this concentrated outlier channel by converting its magnitude into a shared offset, thereby reducing the standard deviation of the activations. This offsetting strategy enables effective W4A4KV4 quantization of LLMs using deployment-friendly uniform-grid and uniform-precision quantization. Extensive experiments across diverse LLM architectures and benchmarks demonstrate that OffQ outperforms state-of-the-art baselines, consistently improving model accuracy while preserving low-bit efficiency.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 1 minor

Summary. The manuscript introduces OffQ, a quantization technique for LLMs that identifies a low-dimensional outlier subspace in activations via top-1 PCA, rotates the tensor to concentrate high-magnitude values into a single channel, and absorbs that channel by converting its magnitude into a shared offset subtracted from the entire tensor. This is claimed to reduce activation standard deviation enough to enable effective W4A4KV4 quantization under deployment-friendly uniform-grid and uniform-precision schemes, with experiments showing consistent accuracy gains over baselines across LLM architectures and benchmarks.

Significance. If the core offsetting mechanism holds without introducing unacceptable distortion or requiring non-uniform adjustments, the result would be significant for practical LLM deployment, as it would allow standard uniform 4-bit quantization to achieve competitive accuracy without custom hardware or per-tensor zero-points.

major comments (1)
  1. [Method description (top-1 PCA and rotation step)] Method description (top-1 PCA and rotation step): the central claim that top-1 PCA isolates the outlier subspace such that a single rotation concentrates every high-magnitude activation into one channel (whose value can then be replaced by a shared offset) is load-bearing but unsupported by any derivation or analysis. If outlier energy has components orthogonal to the top eigenvector, post-rotation channels will retain large values; subtracting one shared offset cannot recenter all of them without either increasing range in other channels or requiring per-channel adjustments, directly contradicting the uniform-grid, uniform-precision premise.
minor comments (1)
  1. [Abstract] Abstract: the statement that experiments demonstrate consistent improvement would be strengthened by explicit mention of the specific models, bit-width configurations, and evaluation metrics used.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive feedback on our work. Below we respond point-by-point to the single major comment.

read point-by-point responses
  1. Referee: Method description (top-1 PCA and rotation step): the central claim that top-1 PCA isolates the outlier subspace such that a single rotation concentrates every high-magnitude activation into one channel (whose value can then be replaced by a shared offset) is load-bearing but unsupported by any derivation or analysis. If outlier energy has components orthogonal to the top eigenvector, post-rotation channels will retain large values; subtracting one shared offset cannot recenter all of them without either increasing range in other channels or requiring per-channel adjustments, directly contradicting the uniform-grid, uniform-precision premise.

    Authors: We agree that the manuscript presents the top-1 PCA choice primarily through empirical motivation rather than a formal derivation or proof that all outlier energy lies exactly along the leading eigenvector. The approach is grounded in the documented low-rank structure of activation outliers in LLMs (as referenced in the related-work section), and the rotation is defined to align the dominant principal direction with a single coordinate axis. We will revise the method section to include (i) a quantitative report of the fraction of activation variance captured by the top-1 component across layers and models and (ii) a short discussion of the residual approximation error when the outlier subspace is not perfectly one-dimensional. Regarding the orthogonal-component concern, the shared-offset step is applied after rotation precisely to recenter the now-concentrated channel; any residual orthogonal energy remains in the other channels but, per our measurements, exhibits substantially lower magnitude, allowing the subsequent uniform 4-bit grid to operate without per-channel zero-points. The experimental tables demonstrate that this suffices for the reported accuracy recovery. We therefore view the referee’s observation as identifying a useful clarification rather than an outright contradiction, and the planned additions will make the supporting evidence explicit. revision: partial

Circularity Check

0 steps flagged

No circularity: procedural method with no self-referential equations or load-bearing self-citations

full rationale

The paper presents OffQ as an algorithmic pipeline (top-1 PCA to identify outlier subspace, rotation to concentrate into one channel, conversion of that channel's magnitude into a shared offset) without any equations, fitted parameters renamed as predictions, or self-citations that justify the core steps. The abstract and description treat the approach as a sequence of operations whose effectiveness is evaluated empirically on benchmarks, not derived from first principles that loop back to the inputs. No load-bearing claim reduces by construction to its own definition or prior self-citation.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review; no free parameters, axioms, or invented entities are identifiable from the provided text.

pith-pipeline@v0.9.1-grok · 5708 in / 953 out tokens · 19458 ms · 2026-06-27T22:35:27.204594+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

90 extracted references · 7 canonical work pages · 4 internal anchors

  1. [1]

    Framequant: Flexible low-bit quantization for transformers

    Harshavardhan Adepu, Zhanpeng Zeng, Li Zhang, and Vikas Singh. Framequant: Flexible low-bit quantization for transformers. InProceedings of International Conference on Machine Learning (ICML), 2024

  2. [2]

    KurTail : Kurtosis-based LLM quantization

    Mohammad Sadegh Akhondzadeh, Aleksandar Bojchevski, Evangelos Eleftheriou, and Martino Dazzi. KurTail : Kurtosis-based LLM quantization. InFindings of the Association for Computational Linguistics: EMNLP 2025, pages 17404–17419, Suzhou, China, 2025. Association for Computational Linguistics

  3. [3]

    Quantization error propagation: Revisiting layer-wise post-training quantization

    Yamato Arai and Yuma Ichikawa. Quantization error propagation: Revisiting layer-wise post-training quantization. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems, 2025

  4. [4]

    Quantization error propagation: Revisiting layer-wise post-training quantization

    Yamato Arai and Yuma Ichikawa. Quantization error propagation: Revisiting layer-wise post-training quantization. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems, 2026

  5. [5]

    Croci, Marcelo Gennari do Nascimento, Torsten Hoefler, and James Hensman

    Saleh Ashkboos, Maximilian L. Croci, Marcelo Gennari do Nascimento, Torsten Hoefler, and James Hensman. SliceGPT: Compress large language models by deleting rows and columns. InThe Twelfth International Conference on Learning Representations, 2024

  6. [6]

    QUIK: Towards end-to-end 4-bit inference on generative large language models

    Saleh Ashkboos, Ilia Markov, Elias Frantar, Tingxuan Zhong, Xincheng Wang, Jie Ren, Torsten Hoefler, and Dan Alistarh. QUIK: Towards end-to-end 4-bit inference on generative large language models. In Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pages 3355–3371, Miami, Florida, USA, 2024. Association for Computati...

  7. [7]

    Quarot: Outlier-free 4-bit inference in rotated llms

    Saleh Ashkboos, Amirkeivan Mohtashami, Maximilian L Croci, Bo Li, Pashmina Cameron, Martin Jaggi, Dan Alistarh, Torsten Hoefler, and James Hensman. Quarot: Outlier-free 4-bit inference in rotated llms. Advances in Neural Information Processing Systems, 37:100213–100240, 2024

  8. [8]

    Castro, Torsten Hoefler, and Dan Alistarh

    Saleh Ashkboos, Mahdi Nikdan, Soroush Tabesh, Roberto L. Castro, Torsten Hoefler, and Dan Alistarh. HALO: Hadamard-assisted lower-precision optimization for LLMs. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems, 2025

  9. [9]

    Piqa: Reasoning about physical commonsense in natural language

    Yonatan Bisk, Rowan Zellers, Ronan Le Bras, Jianfeng Gao, and Yejin Choi. Piqa: Reasoning about physical commonsense in natural language. InThirty-Fourth AAAI Conference on Artificial Intelligence, 2020

  10. [10]

    Understanding and overcoming the chal- lenges of efficient transformer quantization

    Yelysei Bondarenko, Markus Nagel, and Tijmen Blankevoort. Understanding and overcoming the chal- lenges of efficient transformer quantization. InProceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 7947–7969, 2021

  11. [11]

    PyramidKV: Dynamic KV cache compression based on pyramidal information funneling

    Zefan Cai, Yichi Zhang, Bofei Gao, Yuliang Liu, Yucheng Li, Tianyu Liu, Keming Lu, Wayne Xiong, Yue Dong, Junjie Hu, and Wen Xiao. PyramidKV: Dynamic KV cache compression based on pyramidal information funneling. InSecond Conference on Language Modeling, 2025

  12. [12]

    Quip: 2-bit quantization of large language models with guarantees, 2024

    Jerry Chee, Yaohui Cai, V olodymyr Kuleshov, and Christopher De Sa. Quip: 2-bit quantization of large language models with guarantees, 2024

  13. [13]

    Efficien- tQAT: Efficient quantization-aware training for large language models

    Mengzhao Chen, Wenqi Shao, Peng Xu, Jiahao Wang, Peng Gao, Kaipeng Zhang, and Ping Luo. Efficien- tQAT: Efficient quantization-aware training for large language models. InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 10081–10100, Vienna, Austria, 2025. Association for Computational L...

  14. [14]

    Rotate, clip, and partition: Towards W2A4KV4 quantization by integrating rotation and learnable non-uniform quantizer

    Euntae Choi, Sumin Song, Woosang Lim, and Sungjoo Yoo. Rotate, clip, and partition: Towards W2A4KV4 quantization by integrating rotation and learnable non-uniform quantizer. InFindings of the Association for Computational Linguistics: EMNLP 2025, pages 7568–7590, Suzhou, China, 2025. Association for Computational Linguistics

  15. [15]

    BoolQ: Exploring the surprising difficulty of natural yes/no questions

    Christopher Clark, Kenton Lee, Ming-Wei Chang, Tom Kwiatkowski, Michael Collins, and Kristina Toutanova. BoolQ: Exploring the surprising difficulty of natural yes/no questions. InProceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), ...

  16. [16]

    Association for Computational Linguistics. 10

  17. [17]

    Think you have solved question answering? try arc, the ai2 reasoning challenge, 2018

    Peter Clark, Isaac Cowhey, Oren Etzioni, Tushar Khot, Ashish Sabharwal, Carissa Schoenick, and Oyvind Tafjord. Think you have solved question answering? try arc, the ai2 reasoning challenge, 2018

  18. [18]

    Vision transformers need registers

    Timothée Darcet, Maxime Oquab, Julien Mairal, and Piotr Bojanowski. Vision transformers need registers. InThe Twelfth International Conference on Learning Representations, 2024

  19. [19]

    Llm.int8(): 8-bit matrix multiplication for transformers at scale.Advances in neural information processing systems, 35:30318–30332, 2022

    Tim Dettmers, Mike Lewis, Younes Belkada, and Luke Zettlemoyer. Llm.int8(): 8-bit matrix multiplication for transformers at scale.Advances in neural information processing systems, 35:30318–30332, 2022

  20. [20]

    Svirschevski, Vage Egiazarian, Denis Kuznedelev, Elias Frantar, Saleh Ashkboos, Alexander Borzunov, Torsten Hoefler, and Dan Alistarh

    Tim Dettmers, Ruslan A. Svirschevski, Vage Egiazarian, Denis Kuznedelev, Elias Frantar, Saleh Ashkboos, Alexander Borzunov, Torsten Hoefler, and Dan Alistarh. SpQR: A sparse-quantized representation for near- lossless LLM weight compression. InThe Twelfth International Conference on Learning Representations, 2024

  21. [21]

    BitDistiller: Unleashing the potential of sub-4-bit LLMs via self-distillation

    DaYou Du, Yijia Zhang, Shijie Cao, Jiaqi Guo, Ting Cao, Xiaowen Chu, and Ningyi Xu. BitDistiller: Unleashing the potential of sub-4-bit LLMs via self-distillation. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 102–116, Bangkok, Thailand, 2024. Association for Computational Linguistics

  22. [22]

    OPTQ: Accurate quantization for generative pre-trained transformers

    Elias Frantar, Saleh Ashkboos, Torsten Hoefler, and Dan Alistarh. OPTQ: Accurate quantization for generative pre-trained transformers. InThe Eleventh International Conference on Learning Representations, 2023

  23. [23]

    The Llama 3 Herd of Models

    Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al- Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Alex Vaughan, et al. The llama 3 herd of models. arXiv preprint arXiv:2407.21783, 2024

  24. [24]

    LQ-loRA: Low-rank plus quantized matrix decomposition for efficient language model finetuning

    Han Guo, Philip Greengard, Eric Xing, and Yoon Kim. LQ-loRA: Low-rank plus quantized matrix decomposition for efficient language model finetuning. InThe Twelfth International Conference on Learning Representations, 2024

  25. [25]

    Zipcache: Accurate and efficient KV cache quantization with salient token identification

    Yefei He, Luoming Zhang, Weijia Wu, Jing Liu, Hong Zhou, and Bohan Zhuang. Zipcache: Accurate and efficient KV cache quantization with salient token identification. InThe Thirty-eighth Annual Conference on Neural Information Processing Systems, 2024

  26. [26]

    Mahoney, Sophia Shao, Kurt Keutzer, and Amir Gholami

    Coleman Richard Charles Hooper, Sehoon Kim, Hiva Mohammadzadeh, Michael W. Mahoney, Sophia Shao, Kurt Keutzer, and Amir Gholami. KVQuant: Towards 10 million context length LLM inference with KV cache quantization. InThe Thirty-eighth Annual Conference on Neural Information Processing Systems, 2024

  27. [27]

    Mlwq: Efficient small language model deployment via multi-level weight quantization

    Chun Hu, Junhui He, Shangyu Wu, Yuxin He, Chun Jason Xue, and Qingan Li. Mlwq: Efficient small language model deployment via multi-level weight quantization. InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 8078–8088, 2025

  28. [28]

    OSTQuant: Refining large language model quantization with orthogonal and scaling transformations for better distribution fitting

    Xing Hu, Yuan Cheng, Dawei Yang, Zhixuan Chen, Zukang Xu, JiangyongYu, XUCHEN, Zhihang Yuan, Zhe jiang, and Sifan Zhou. OSTQuant: Refining large language model quantization with orthogonal and scaling transformations for better distribution fitting. InThe Thirteenth International Conference on Learning Representations, 2025

  29. [29]

    Billm: Pushing the limit of post-training quantization for llms.arXiv preprint arXiv:2402.04291, 2024

    Wei Huang, Yangdong Liu, Haotong Qin, Ying Li, Shiming Zhang, Xianglong Liu, Michele Magno, and Xi- aojuan Qi. Billm: Pushing the limit of post-training quantization for llms.arXiv preprint arXiv:2402.04291, 2024

  30. [30]

    SliM-LLM: Salience-driven mixed-precision quantization for large language models

    Wei Huang, Haotong Qin, Yangdong Liu, Yawei Li, Qinshuo Liu, Xianglong Liu, Luca Benini, Michele Magno, Shiming Zhang, and Xiaojuan Qi. SliM-LLM: Salience-driven mixed-precision quantization for large language models. InProceedings of the 42nd International Conference on Machine Learning, pages 25672–25692. PMLR, 2025

  31. [31]

    Quantization and training of neural networks for efficient integer- arithmetic-only inference

    Benoit Jacob, Skirmantas Kligys, Bo Chen, Menglong Zhu, Matthew Tang, Andrew Howard, Hartwig Adam, and Dmitry Kalenichenko. Quantization and training of neural networks for efficient integer- arithmetic-only inference. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 2704–2713, 2018

  32. [32]

    Understanding and improving knowledge distillation for quantization aware training of large transformer encoders

    Minsoo Kim, Sihwa Lee, Suk-Jin Hong, Du-Seong Chang, and Jungwook Choi. Understanding and improving knowledge distillation for quantization aware training of large transformer encoders. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 6713–6725, Abu Dhabi, United Arab Emirates, 2022. Association for Computat...

  33. [33]

    Mahoney, and Kurt Keutzer

    Sehoon Kim, Coleman Richard Charles Hooper, Amir Gholami, Zhen Dong, Xiuyu Li, Sheng Shen, Michael W. Mahoney, and Kurt Keutzer. SqueezeLLM: Dense and sparse quantization, 2024

  34. [34]

    Mahoney, and Kurt Keutzer

    Sehoon Kim, Coleman Richard Charles Hooper, Amir Gholami, Zhen Dong, Xiuyu Li, Sheng Shen, Michael W. Mahoney, and Kurt Keutzer. SqueezeLLM: Dense-and-sparse quantization. InProceedings of the 41st International Conference on Machine Learning, pages 23901–23923. PMLR, 2024

  35. [35]

    Quantizing deep convolutional networks for efficient inference: A whitepaper

    Raghuraman Krishnamoorthi. Quantizing deep convolutional networks for efficient inference: A whitepaper. arXiv preprint arXiv:1806.08342, 2018

  36. [36]

    give me BF16 or give me death

    Eldar Kurtic, Alexandre Noll Marques, Shubhra Pandit, Mark Kurtz, and Dan Alistarh. “give me BF16 or give me death”? accuracy-performance trade-offs in LLM quantization. InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 26872–26886, Vienna, Austria, 2025. Association for Computational ...

  37. [37]

    Flexround: Learnable rounding based on element-wise division for post-training quantization

    Jung Hyun Lee, Jeonghoon Kim, Se Jung Kwon, and Dongsoo Lee. Flexround: Learnable rounding based on element-wise division for post-training quantization. InInternational Conference on Machine Learning, pages 18913–18939. PMLR, 2023

  38. [38]

    Rethinking residual errors in compensation-based LLM quantization

    Shuaiting Li, Juncan Deng, Kedong Xu, Rongtao Deng, Hong Gu, Minghan Jiang, Haibin Shen, and Kejie Huang. Rethinking residual errors in compensation-based LLM quantization. InThe Fourteenth International Conference on Learning Representations, 2026

  39. [39]

    GPTAQ: Efficient finetuning-free quantization for asymmetric calibration

    Yuhang Li, Ruokai Yin, Donghyun Lee, Shiting Xiao, and Priyadarshini Panda. GPTAQ: Efficient finetuning-free quantization for asymmetric calibration. InForty-second International Conference on Machine Learning, 2025

  40. [40]

    Duquant: Distributing outliers via dual transformation makes stronger quantized llms

    Haokun Lin, Haobo Xu, Yichen Wu, Jingzhi Cui, Yingtao Zhang, Linzhan Mou, Linqi Song, Zhenan Sun, and Ying Wei. Duquant: Distributing outliers via dual transformation makes stronger quantized llms. Advances in Neural Information Processing Systems, 37:87766–87800, 2024

  41. [41]

    Awq: Activation-aware weight quantization for llm compression and acceleration

    Ji Lin, Jiaming Tang, Haotian Tang, Shang Yang, Wei-Ming Chen, Wei-Chen Wang, Guangxuan Xiao, Xingyu Dang, Chuang Gan, and Song Han. Awq: Activation-aware weight quantization for llm compression and acceleration. InMLSys, 2024

  42. [42]

    Qserve: W4a8kv4 quantization and system co-design for efficient llm serving.Proceedings of Machine Learning and Systems, 7, 2025

    Yujun Lin, Haotian Tang, Shang Yang, Zhekai Zhang, Guangxuan Xiao, Chuang Gan, and Song Han. Qserve: W4a8kv4 quantization and system co-design for efficient llm serving.Proceedings of Machine Learning and Systems, 7, 2025

  43. [43]

    QLLM: Accurate and efficient low-bitwidth quantization for large language models

    Jing Liu, Ruihao Gong, Xiuying Wei, Zhiwei Dong, Jianfei Cai, and Bohan Zhuang. QLLM: Accurate and efficient low-bitwidth quantization for large language models. InInternational Conference on Learning Representations (ICLR), 2024

  44. [44]

    Vptq: Extreme low-bit vector post-training quantization for large language models

    Yifei Liu, Jicheng Wen, Yang Wang, Shengyu Ye, Li Lyna Zhang, Ting Cao, Cheng Li, and Mao Yang. Vptq: Extreme low-bit vector post-training quantization for large language models. InThe 2024 Conference on Empirical Methods in Natural Language Processing, 2024

  45. [45]

    LLM-QAT: Data-free quantization aware training for large language models

    Zechun Liu, Barlas Oguz, Changsheng Zhao, Ernie Chang, Pierre Stock, Yashar Mehdad, Yangyang Shi, Raghuraman Krishnamoorthi, and Vikas Chandra. LLM-QAT: Data-free quantization aware training for large language models. InFindings of the Association for Computational Linguistics: ACL 2024, pages 467–484, Bangkok, Thailand, 2024. Association for Computationa...

  46. [46]

    Kivi: A tuning-free asymmetric 2bit quantization for kv cache

    Zirui Liu, Jiayi Yuan, Hongye Jin, Shaochen Zhong, Zhaozhuo Xu, Vladimir Braverman, Beidi Chen, and Xia Hu. Kivi: A tuning-free asymmetric 2bit quantization for kv cache. InInternational Conference on Machine Learning, pages 32332–32344. PMLR, 2024

  47. [47]

    Spinquant: LLM quantization with learned rotations

    Zechun Liu, Changsheng Zhao, Igor Fedorov, Bilge Soran, Dhruv Choudhary, Raghuraman Krishnamoorthi, Vikas Chandra, Yuandong Tian, and Tijmen Blankevoort. Spinquant: LLM quantization with learned rotations. InThe Thirteenth International Conference on Learning Representations, 2025

  48. [48]

    Affinequant: Affine transformation quantization for large language models

    Yuexiao Ma, Huixia Li, Xiawu Zheng, Feng Ling, Xuefeng Xiao, Rui Wang, Shilei Wen, Fei Chao, and Rongrong Ji. Affinequant: Affine transformation quantization for large language models. InThe Twelfth International Conference on Learning Representations, 2024

  49. [49]

    Pointer sentinel mixture models

    Stephen Merity, Caiming Xiong, James Bradbury, and Richard Socher. Pointer sentinel mixture models. In International Conference on Learning Representations, 2017. 12

  50. [50]

    Llama 3.2: Revolutionizing edge ai and vision with open, customizable models

    Meta AI. Llama 3.2: Revolutionizing edge ai and vision with open, customizable models. https: //ai.meta.com/blog/llama-3-2-connect-2024-vision-edge-mobile-devices/ , 2024. Accessed: 2026-05-05

  51. [51]

    Can a suit of armor conduct electricity? a new dataset for open book question answering

    Todor Mihaylov, Peter Clark, Tushar Khot, and Ashish Sabharwal. Can a suit of armor conduct electricity? a new dataset for open book question answering. InProceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 2381–2391, Brussels, Belgium, 2018. Association for Computational Linguistics

  52. [52]

    Muller, Philippe Bich, Jiawei Zhuang, Ahmet Celik, Luca Benfenati, and Lukas Cavigelli

    Lorenz K. Muller, Philippe Bich, Jiawei Zhuang, Ahmet Celik, Luca Benfenati, and Lukas Cavigelli. Sinq: Sinkhorn-normalized quantization for calibration-free low-precision llm weights, 2025

  53. [53]

    Self-distilled quantization: Achieving high compression rates in transformer-based language models

    James O’Neill and Sourav Dutta. Self-distilled quantization: Achieving high compression rates in transformer-based language models. InProceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pages 1329–1339, Toronto, Canada, 2023. Association for Computational Linguistics

  54. [54]

    Automatic differentiation in pytorch

    Adam Paszke, Sam Gross, Soumith Chintala, Gregory Chanan, Edward Yang, Zachary DeVito, Zeming Lin, Alban Desmaison, Luca Antiga, and Adam Lerer. Automatic differentiation in pytorch. 2017

  55. [55]

    Qwen2.5 technical report, 2025

    Qwen, :, An Yang, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chengyuan Li, Dayiheng Liu, Fei Huang, Haoran Wei, Huan Lin, Jian Yang, Jianhong Tu, Jianwei Zhang, Jianxin Yang, Jiaxi Yang, Jingren Zhou, Junyang Lin, Kai Dang, Keming Lu, Keqin Bao, Kexin Yang, Le Yu, Mei Li, Mingfeng Xue, Pei Zhang, Qin Zhu, Rui Men, Runji Lin, Tianhao Li,...

  56. [56]

    WinoGrande: An Adversarial Winograd Schema Challenge at Scale

    Keisuke Sakaguchi, Ronan Le Bras, Chandra Bhagavatula, and Yejin Choi. Winogrande: An adversarial winograd schema challenge at scale.arXiv preprint arXiv:1907.10641, 2019

  57. [57]

    Social iqa: Commonsense reasoning about social interactions

    Maarten Sap, Hannah Rashkin, Derek Chen, Ronan Le Bras, and Yejin Choi. Social iqa: Commonsense reasoning about social interactions. InProceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 4463–4473, 2019

  58. [58]

    Nestquant: nested lattice quantization for matrix products and LLMs

    Semyon Savkin, Eitan Porat, Or Ordentlich, and Yury Polyanskiy. Nestquant: nested lattice quantization for matrix products and LLMs. InForty-second International Conference on Machine Learning, 2025

  59. [59]

    Resq: Mixed-precision quantization of large language models with low-rank residuals

    Utkarsh Saxena, Sayeh Sharify, Kaushik Roy, and Xin Wang. Resq: Mixed-precision quantization of large language models with low-rank residuals. InForty-second International Conference on Machine Learning, 2025

  60. [60]

    Omniquant: Omnidirectionally calibrated quantization for large language models

    Wenqi Shao, Mengzhao Chen, Zhaoyang Zhang, Peng Xu, Lirui Zhao, Zhiqian Li, Kaipeng Zhang, Peng Gao, Yu Qiao, and Ping Luo. Omniquant: Omnidirectionally calibrated quantization for large language models. InThe Twelfth International Conference on Learning Representations, 2024

  61. [61]

    Dartquant: Efficient rotational distribution calibration for LLM quantization

    Yuantian Shao, Yuanteng Chen, Peisong Wang, Jianlin Yu, Jing Lin, Yiwu Yao, Zhihui Wei, and Jian Cheng. Dartquant: Efficient rotational distribution calibration for LLM quantization. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems, 2026

  62. [62]

    NSNQuant: A double normalization approach for calibration-free low-bit vector quantization of KV cache

    Donghyun Son, Euntae Choi, and Sungjoo Yoo. NSNQuant: A double normalization approach for calibration-free low-bit vector quantization of KV cache. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems, 2025

  63. [63]

    Accurate kv cache quantization with outlier tokens tracing

    Yi Su, Yuechi Zhou, Quantong Qiu, Juntao Li, Qingrong Xia, Ping Li, Xinyu Duan, Zhefeng Wang, and Min Zhang. Accurate kv cache quantization with outlier tokens tracing. InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 12895–12915, 2025

  64. [64]

    Massive activations in large language models

    Mingjie Sun, Xinlei Chen, J Zico Kolter, and Zhuang Liu. Massive activations in large language models. InFirst Conference on Language Modeling, 2024

  65. [65]

    Flatquant: Flatness matters for LLM quantization

    Yuxuan Sun, Ruikang Liu, Haoli Bai, Han Bao, Kang Zhao, Yuening Li, JiaxinHu, Xianzhi Yu, Lu Hou, Chun Yuan, Xin Jiang, Wulong Liu, and Jun Yao. Flatquant: Flatness matters for LLM quantization. In Forty-second International Conference on Machine Learning, 2025

  66. [66]

    CUTLASS, 2023

    Vijay Thakkar, Pradeep Ramani, Cris Cecka, Aniket Shivam, Honghao Lu, Ethan Yan, Jack Kosaian, Mark Hoemmen, Haicheng Wu, Andrew Kerr, Matt Nicely, Duane Merrill, Dustyn Blasig, Aditya Atluri, Fengqi Qiao, Piotr Majcher, Paul Springer, Markus Hohnerbach, Jin Wang, and Manish Gupta. CUTLASS, 2023. 13

  67. [67]

    Llama 2: Open Foundation and Fine-Tuned Chat Models

    Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, et al. Llama 2: Open foundation and fine-tuned chat models.arXiv preprint arXiv:2307.09288, 2023

  68. [68]

    QuIP$\#$: Even better LLM quantization with hadamard incoherence and lattice codebooks

    Albert Tseng, Jerry Chee, Qingyao Sun, V olodymyr Kuleshov, and Christopher De Sa. QuIP$\#$: Even better LLM quantization with hadamard incoherence and lattice codebooks. InForty-first International Conference on Machine Learning, 2024

  69. [69]

    Qtip: Quantization with trellises and incoherence processing.Advances in Neural Information Processing Systems, 37:59597–59620, 2024

    Albert Tseng, Qingyao Sun, David Hou, and Christopher De. Qtip: Quantization with trellises and incoherence processing.Advances in Neural Information Processing Systems, 37:59597–59620, 2024

  70. [70]

    Gptvq: The blessing of dimensionality in llm quantization.arXiv preprint arXiv:2402.15319, 2024

    Mart van Baalen, Andrey Kuzmin, Markus Nagel, Peter Couperus, Cedric Bastoul, Eric Mahurin, Tijmen Blankevoort, and Paul Whatmough. Gptvq: The blessing of dimensionality in llm quantization.arXiv preprint arXiv:2402.15319, 2024

  71. [71]

    Bitnet: Scaling 1-bit transformers for large language models, 2023

    Hongyu Wang, Shuming Ma, Li Dong, Shaohan Huang, Huaijie Wang, Lingxiao Ma, Fan Yang, Ruiping Wang, Yi Wu, and Furu Wei. Bitnet: Scaling 1-bit transformers for large language models, 2023

  72. [72]

    Sinder: Repairing the singular defects of dinov2

    Haoqi Wang, Tong Zhang, and Mathieu Salzmann. Sinder: Repairing the singular defects of dinov2. In European Conference on Computer Vision, pages 20–35. Springer, 2024

  73. [73]

    Demystifying singular defects in large language models

    Haoqi Wang, Tong Zhang, and Mathieu Salzmann. Demystifying singular defects in large language models. InForty-second International Conference on Machine Learning, 2025

  74. [74]

    Outlier suppression: Pushing the limit of low-bit transformer language models

    Xiuying Wei, Yunchen Zhang, Xiangguo Zhang, Ruihao Gong, Shanghang Zhang, Qi Zhang, Fengwei Yu, and Xianglong Liu. Outlier suppression: Pushing the limit of low-bit transformer language models. Advances in Neural Information Processing Systems, 35:17402–17414, 2022

  75. [75]

    Outlier suppression+: Accurate quantization of large language models by equivalent and effective shifting and scaling

    Xiuying Wei, Yunchen Zhang, Yuhang Li, Xiangguo Zhang, Ruihao Gong, Jinyang Guo, and Xianglong Liu. Outlier suppression+: Accurate quantization of large language models by equivalent and effective shifting and scaling. InProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 1648–1665, Singapore, 2023. Association fo...

  76. [76]

    Thomas Wolf, Lysandre Debut, Victor Sanh, Julien Chaumond, Clement Delangue, Anthony Moi, Pierric Cistac, Tim Rault, Rémi Louf, Morgan Funtowicz, Joe Davison, Sam Shleifer, Patrick von Platen, Clara Ma, Yacine Jernite, Julien Plu, Canwen Xu, Teven Le Scao, Sylvain Gugger, Mariama Drame, Quentin Lhoest, and Alexander M. Rush. Transformers: State-of-the-art...

  77. [77]

    DFRot: Achieving outlier-free and massive activation-free for rotated LLMs with refined rotation

    Jingyang Xiang and Sai Qian Zhang. DFRot: Achieving outlier-free and massive activation-free for rotated LLMs with refined rotation. InSecond Conference on Language Modeling, 2025

  78. [78]

    SmoothQuant: Accurate and efficient post-training quantization for large language models

    Guangxuan Xiao, Ji Lin, Mickael Seznec, Hao Wu, Julien Demouth, and Song Han. SmoothQuant: Accurate and efficient post-training quantization for large language models. InProceedings of the 40th International Conference on Machine Learning, 2023

  79. [79]

    Onebit: Towards extremely low-bit large language models

    Yuzhuang Xu, Xu Han, Zonghan Yang, Shuo Wang, Qingfu Zhu, Zhiyuan Liu, Weidong Liu, and Wanxiang Che. Onebit: Towards extremely low-bit large language models. InAdvances in Neural Information Processing Systems, pages 66357–66382, 2024

  80. [80]

    CRVQ: Channel-relaxed vector quantization for extreme compression of LLMs.Transactions of the Association for Computational Linguistics (TACL), 13: 1488–1506, 2025

    Yuzhuang Xu, Shiyu Ji, Qingfu Zhu, and Wanxiang Che. CRVQ: Channel-relaxed vector quantization for extreme compression of LLMs.Transactions of the Association for Computational Linguistics (TACL), 13: 1488–1506, 2025

Showing first 80 references.