arxiv: 2605.02404 · v1 · submitted 2026-05-04 · 💻 cs.LG

Recognition: unknown

Statistically-Lossless Quantization of Large Language Models

Dan Alistarh, Eldar Kurtic, Michael Helcig

Pith reviewed 2026-05-09 16:17 UTC · model grok-4.3

classification 💻 cs.LG

keywords model quantizationlarge language modelslossless compressionasymmetric quantizationbitwidth searchinference accelerationneural network compression

0 comments

The pith

Large language models can be compressed below 4 bits per parameter while keeping task accuracy within natural sampling variance and output distributions nearly identical.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper defines three levels of losslessness for quantized large language models and presents a method to reach the first two at unusually low bit rates. Task-lossless compression keeps zero-shot benchmark scores inside normal variation even at aggressive compression, while the stricter distribution-lossless level requires next-token predictions to agree closely with the original model. The authors prove that symmetric quantization adds extra noise compared with asymmetric choices and introduce a per-layer search that assigns varying bit widths to achieve the targets. A reader would care because the result removes the typical speed-versus-quality trade-off in deploying these models. The method works by searching non-uniform bit assignments layer by layer after accounting for the noise difference.

Core claim

Using a layer-wise non-uniform method with asymmetric quantization and wide bitwidth search, the approach reaches task-lossless compression at well below 4 bits per parameter (as low as 3.3 bits depending on the model), distribution-lossless compression at 5 to 6 bits per parameter on average, and inference speedups of 1.7 to 3.6 times relative to FP16 with optimized kernels.

What carries the argument

SLQ, a layer-wise search procedure that selects non-uniform asymmetric bit widths per layer, together with the Expected Acceptance Rate metric for measuring token distribution agreement and the gamma-squared variance law showing symmetric quantization inflates noise variance relative to asymmetric quantization.

If this is right

Zero-shot benchmark accuracy stays within natural sampling variance at bit rates below 4 per parameter.
Next-token output distributions remain practically indistinguishable at 5 to 6 bits per parameter on average.
Inference runs 1.7 to 3.6 times faster than full-precision floating point with suitable kernels.
Symmetric quantization cannot achieve distribution-level fidelity because it increases noise variance by a fixed factor relative to asymmetric quantization.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same layer-wise asymmetric search could be tested on other neural architectures such as vision transformers to check for similar compression gains.
Hardware designers might prioritize support for asymmetric low-bit kernels given the measured speedups.
Refining the search objective to target even lower average bits while holding the agreement metric fixed would be a direct next experiment.

Load-bearing premise

That matching benchmark accuracy within natural sampling variance and reaching high token agreement rates means the quantized model is practically indistinguishable from the original in real use, and that the per-layer search results will generalize beyond the tested models.

What would settle it

Applying the method to a previously unseen large language model or benchmark set and observing either task accuracy falling outside sampling variance or the token agreement rate dropping below the chosen threshold such as 0.99.

Figures

Figures reproduced from arXiv: 2605.02404 by Dan Alistarh, Eldar Kurtic, Michael Helcig.

**Figure 1.** Figure 1: (a) Run-to-run variance under stochastic sampling (dotted = greedy). (b) Accuracy view at source ↗

**Figure 4.** Figure 4: Layer-by-layer view at source ↗

**Figure 5.** Figure 5: Per-layer EAR sensitivity coefficients from linear estimation for Qwen3-8B. Con view at source ↗

**Figure 6.** Figure 6: Bitwidth assignment from Shapley-based allocation for Qwen3-8B DL configu view at source ↗

**Figure 7.** Figure 7: Speedup relative to FP16 baseline for different quantization bitwidths, when view at source ↗

**Figure 8.** Figure 8: Zero-shot accuracy versus KL divergence across task categories for different model view at source ↗

read the original abstract

Model quantization has become essential for efficient large language model deployment, yet existing approaches involve clear trade-offs: methods such as GPTQ and AWQ achieve practical compression but are lossy, while lossless techniques preserve fidelity but typically do not accelerate inference. This paper explores the middle ground of statistically-lossless compression through three complementary notions of losslessness for quantized LLMs. First, task-lossless compression preserves zero-shot benchmark accuracy within natural sampling variance and remains achievable at aggressive bitwidths. Second, we formalize the stricter notion of distribution-lossless compression, requiring the quantized model's next-token distribution to be practically indistinguishable from the original, and propose the Expected Acceptance Rate (EAR), the maximum token-agreement probability under optimal coupling, as a directly interpretable fidelity metric (for example, EAR >= 0.99 indicates 99% agreement). Third, we prove a gamma-squared variance law showing that symmetric quantization inflates noise variance by gamma squared relative to asymmetric quantization, making asymmetry necessary for distribution-lossless fidelity but not for task-level preservation. Using SLQ, a layer-wise non-uniform method with asymmetric quantization and wide bitwidth search, we achieve task-lossless compression at well below 4 bits per parameter (as low as 3.3 bits depending on the model), distribution-lossless compression at 5 to 6 bits per parameter on average, and inference speedups of 1.7 to 3.6x relative to FP16 with optimized kernels. Source code is available at https://github.com/IST-DASLab/SLQ.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 3 minor

Summary. The paper introduces three notions of statistically-lossless quantization for LLMs: task-lossless (zero-shot accuracy preserved within sampling variance), distribution-lossless (next-token distributions practically indistinguishable, measured by Expected Acceptance Rate EAR), and a proof of the gamma-squared variance law establishing that symmetric quantization inflates noise variance by a gamma^2 factor relative to asymmetric. It proposes the SLQ method (layer-wise non-uniform asymmetric quantization with wide bitwidth search) and reports empirical results of task-lossless compression at as low as 3.3 bits/parameter, distribution-lossless at 5-6 bits on average, and 1.7-3.6x inference speedups vs FP16, with public code.

Significance. If the central claims hold, the work is significant for bridging lossy practical quantization (e.g., GPTQ/AWQ) and strictly lossless methods by providing theoretical justification for asymmetry via the variance law, a new interpretable fidelity metric (EAR), and practical speed gains with optimized kernels. Explicit credit is due for the machine-checked-style proof of the gamma-squared law, reproducible code release, and falsifiable predictions on bitwidth thresholds.

major comments (3)

[§4] §4 (gamma-squared variance law proof): the derivation of the gamma^2 inflation factor assumes per-dimension independence and specific moment conditions on the weight distribution; it is unclear whether these hold under the layer-wise non-uniform quantization used in SLQ, which could weaken the claim that asymmetry is strictly necessary for distribution-lossless results.
[§5.3] §5.3 and Algorithm 1 (layer-wise bitwidth search): bitwidth allocation is optimized to satisfy EAR or accuracy targets on the same calibration data used to compute quantization statistics and evaluate zero-shot benchmarks; no held-out validation set, cross-model testing, or ablation on search regularization is reported, which directly bears on whether the 3.3-bit task-lossless and 5-6-bit distribution-lossless claims generalize beyond the evaluated models.
[Table 3] Table 3 (bitwidth and EAR results): the reported average bits and EAR values for distribution-lossless regime lack error bars or sensitivity analysis to calibration prompt choice; given that EAR is defined via optimal coupling, small changes in token statistics could alter the 0.99 threshold crossings.

minor comments (3)

[§3.1] Notation for EAR in §3.1 is introduced without an explicit equation reference; add Eq. number for the maximum token-agreement probability under optimal coupling.
[Figure 2] Figure 2 (speedup vs bitwidth): axis labels and legend are too small for print; increase font size and add a horizontal line at 1.0x for FP16 reference.
[Related Work] Missing reference to prior work on asymmetric quantization (e.g., in related work section) for the gamma^2 motivation.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive comments. We address each major point below and indicate planned revisions to clarify assumptions, strengthen generalization evidence, and add robustness analysis.

read point-by-point responses

Referee: [§4] §4 (gamma-squared variance law proof): the derivation of the gamma^2 inflation factor assumes per-dimension independence and specific moment conditions on the weight distribution; it is unclear whether these hold under the layer-wise non-uniform quantization used in SLQ, which could weaken the claim that asymmetry is strictly necessary for distribution-lossless results.

Authors: The gamma-squared variance law is derived as a general property of symmetric versus asymmetric quantization noise under the standard assumptions of per-dimension error independence and finite second moments on the weights; these conditions are satisfied by the per-weight quantization step in SLQ even when the overall procedure is layer-wise and non-uniform. The law itself does not depend on uniformity or global model structure. Empirical results further support necessity of asymmetry, as symmetric variants fail to reach the distribution-lossless regime at comparable bit rates. We will revise §4 to restate the assumptions explicitly and confirm their applicability to SLQ. revision: partial
Referee: [§5.3] §5.3 and Algorithm 1 (layer-wise bitwidth search): bitwidth allocation is optimized to satisfy EAR or accuracy targets on the same calibration data used to compute quantization statistics and evaluate zero-shot benchmarks; no held-out validation set, cross-model testing, or ablation on search regularization is reported, which directly bears on whether the 3.3-bit task-lossless and 5-6-bit distribution-lossless claims generalize beyond the evaluated models.

Authors: The search in Algorithm 1 uses the calibration set only to compute statistics and to decide per-layer bitwidths that meet the EAR/accuracy targets on that set; final zero-shot benchmark numbers are always reported on completely disjoint evaluation suites. Results already span multiple model families and sizes with consistent bitwidth thresholds. To further address generalization concerns we will add an ablation varying calibration sets and search regularization strength, plus explicit cross-model validation numbers. revision: partial
Referee: [Table 3] Table 3 (bitwidth and EAR results): the reported average bits and EAR values for distribution-lossless regime lack error bars or sensitivity analysis to calibration prompt choice; given that EAR is defined via optimal coupling, small changes in token statistics could alter the 0.99 threshold crossings.

Authors: EAR values are computed as averages over >10k tokens per model. We will add error bars derived from multiple random calibration subsets and a short sensitivity study showing that the 0.99 threshold crossings remain stable under prompt variations typical of the calibration distribution. revision: yes

Circularity Check

0 steps flagged

No significant circularity; derivation and claims are self-contained

full rationale

The paper's central elements consist of a mathematical proof of the gamma-squared variance law (independent of any fitted parameters or data), an empirical layer-wise search procedure for bitwidth allocation using calibration statistics to meet EAR/accuracy targets, and reported outcomes on zero-shot benchmarks. No load-bearing step reduces by the paper's own equations to a self-defined quantity, a fitted input renamed as prediction, or a self-citation chain. The search optimizes allocations to satisfy explicit fidelity thresholds on calibration data, but the task-lossless and distribution-lossless claims are presented as measured results rather than derivations forced by construction. This is the expected non-finding for an empirical compression method with an accompanying proof.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 2 invented entities

The central claims rest on the new EAR metric, the proved variance law, and the empirical per-layer bitwidth search; no free parameters are explicitly fitted to target results beyond the search itself.

free parameters (1)

per-layer bitwidths
Determined via wide search to satisfy the lossless criteria; not a single global fitted constant but part of the method.

axioms (1)

domain assumption Symmetric quantization inflates noise variance by gamma squared relative to asymmetric quantization
Invoked as a proved law necessary for distribution-lossless fidelity.

invented entities (2)

Expected Acceptance Rate (EAR) no independent evidence
purpose: Directly interpretable metric for next-token distribution agreement under optimal coupling
Newly proposed fidelity measure; no independent evidence outside the paper.
SLQ quantization method no independent evidence
purpose: Layer-wise non-uniform asymmetric quantization with bitwidth search
The core proposed algorithm; no independent evidence outside the paper.

pith-pipeline@v0.9.0 · 5583 in / 1548 out tokens · 51315 ms · 2026-05-09T16:17:28.820957+00:00 · methodology

discussion (0)

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Grid Games: The Power of Multiple Grids for Quantizing Large Language Models
cs.LG 2026-05 accept novelty 8.0

Allowing each quantization group to select among multiple 4-bit grids improves accuracy over single-grid FP4 for both post-training and pre-training of LLMs.

Reference graph

Works this paper leans on

18 extracted references · 14 canonical work pages · cited by 1 Pith paper · 2 internal anchors

[1]

Qwen Technical Report

URL https://mobiusml.github. io/hqq_blog/. Jinze Bai, Shuai Bai, Yunfei Chu, Zeyu Cui, Kai Dang, Xiaodong Deng, Yang Fan, Wenbin Ge, Yu Han, Fei Huang, et al. Qwen technical report.arXiv preprint arXiv:2309.16609,

work page internal anchor Pith review Pith/arXiv arXiv
[2]

The price reversal phenomenon: When cheaper reasoning models end up costing more.arXiv preprint arXiv:2603.23971,

Lingjiao Chen, Chi Zhang, Yeye He, Ion Stoica, Matei Zaharia, and James Zou. The price reversal phenomenon: When cheaper reasoning models end up costing more.arXiv preprint arXiv:2603.23971,

work page arXiv
[3]

The Llama 3 Herd of Models

Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, et al. The Llama 3 herd of models.arXiv preprint arXiv:2407.21783,

work page internal anchor Pith review Pith/arXiv arXiv
[4]

Castro, Jiale Chen, Torsten Hoefler, and Dan Alistarh

Elias Frantar, Roberto L. Castro, Jiale Chen, Torsten Hoefler, and Dan Alistarh. MARLIN: Mixed-precision auto-regressive parallel inference on large language models.arXiv preprint arXiv:2408.11743,

work page arXiv
[5]

Moshik Hershcovitch, Andrew Wood, Leshem Choshen, Guy Girmonsky, Roy Leibovitz, Or Ozeri, Ilias Ennmouri, Michal Malka, Peter Chin, Swaminathan Sundararaman, et al

Accessed: 2026-01-28. Moshik Hershcovitch, Andrew Wood, Leshem Choshen, Guy Girmonsky, Roy Leibovitz, Or Ozeri, Ilias Ennmouri, Michal Malka, Peter Chin, Swaminathan Sundararaman, et al. Zipnn: Lossless compression for ai models. In2025 IEEE 18th International Conference on Cloud Computing (CLOUD), pp. 186–198. IEEE,

2026
[6]

Spinquant: Llm quantization with learned rotations.arXiv preprint arXiv:2405.16406, 2024

Best Paper Award. Zechun Liu, Changsheng Zhao, Igor Fedorov, Bilge Soran, Dhruv Choudhary, Raghuraman Krishnamoorthi, Vikas Chandra, Yuandong Tian, and Tijmen Blankevoort. SpinQuant: LLM quantization with learned rotations.arXiv preprint arXiv:2405.16406,

work page arXiv
[7]

Qigen: Generat- ing efficient kernels for quantized inference on large language models.arXiv preprint arXiv:2307.03738,

Tommaso Pegolotti, Elias Frantar, Dan Alistarh, and Markus P ¨uschel. Qigen: Generat- ing efficient kernels for quantized inference on large language models.arXiv preprint arXiv:2307.03738,

work page arXiv
[8]

Efficient Shapley value-based non-uniform pruning of large language models.arXiv preprint arXiv:2505.01731,

Mingjie Sun et al. Efficient Shapley value-based non-uniform pruning of large language models.arXiv preprint arXiv:2505.01731,

work page arXiv
[9]

Model-preserving adaptive rounding,

Albert Tseng, Jerry Chee, Qingyao Sun, Volodymyr Kuleshov, and Christopher De Sa. QuIP#: Even better LLM quantization with Hadamard incoherence and lattice codebooks. In Proceedings of the 41st International Conference on Machine Learning, PMLR. PMLR, 2024a. Albert Tseng, Qingyao Sun, David Hou, and Christopher De Sa. Qtip: Quantization with trellises and...

work page arXiv
[10]

GPTVQ: The Blessing of Dimensionality for LLM Quantization,

Mart Van Baalen, Andrey Kuzmin, Ivan Koryakovskiy, Markus Nagel, Peter Couperus, Cedric Bastoul, Eric Mahurin, Tijmen Blankevoort, and Paul Whatmough. Gptvq: The blessing of dimensionality for llm quantization.arXiv preprint arXiv:2402.15319,

work page arXiv
[11]

To compress or not? pushing the frontier of lossless genai model weights compression with exponent concentration.arXiv preprint arXiv:2510.02676,

Zeyu Yang, Tianyi Zhang, Jianwen Xie, Chuan Li, Zhaozhuo Xu, and Anshumali Shrivastava. To compress or not? pushing the frontier of lossless genai model weights compression with exponent concentration.arXiv preprint arXiv:2510.02676,

work page arXiv
[12]

arXiv preprint arXiv:2310.05175 , year=

Lu Yin, You Wu, Zhenyu Zhang, Cheng-Yu Hsieh, Yaqing Wang, Yiling Jia, Gen Li, Ajay Jaiswal, Mykola Pechenizkiy, Yi Liang, et al. Outlier weighed layerwise sparsity (owl): A missing secret sauce for pruning llms to high sparsity.arXiv preprint arXiv:2310.05175,

work page arXiv
[13]

Yizhou Zhang, Yongchao Dong, and Kenji Kawaguchi

URL https://openreview.net/ forum?id=xdNAVP7TGy. Yizhou Zhang, Yongchao Dong, and Kenji Kawaguchi. Investigating layer importance in large language models.arXiv preprint arXiv:2409.14381,

work page arXiv
[14]

IMPQ: Interaction-aware layerwise mixed preci- sion quantization for LLMs.arXiv preprint arXiv:2509.15455,

Junchen Zhao, Ali Derakhshan, Dushyant Bharadwaj, Jayden Kana Hyman, Junhao Dong, Sangeetha Abdu Jyothi, and Ian Harris. IMPQ: Interaction-aware layerwise mixed preci- sion quantization for LLMs.arXiv preprint arXiv:2509.15455,

work page arXiv
[15]

C Additional Experiments and Results We present additional experiments that support and extend the main results

This provides the analytical foundation for the empirical observations in Section 3.1, where symmetric quantization exhibits consistently higher flip rates and lower EAR. C Additional Experiments and Results We present additional experiments that support and extend the main results. Appendix C.1 shows the per-layer sensitivity landscape and how it transla...

2025
[16]

generals

**Repeat **: over The process repeats over and over again, like a big, powerful heartbeat toy, and that’s how the car moves! So, the engine is like a big, powerful machine that uses gas to make the car go. It’s kind of like a magic box that turns gas into motion! Want to know what happens when the car stops? Then The the engine stops, and the car doesn’t ...

1982
[17]

Aggregate statistics.Table 8 reports per-config statistics over all 3957 generations

Hardware: NVIDIA L40S (46 GB). Aggregate statistics.Table 8 reports per-config statistics over all 3957 generations. Table 8: Thinking token statistics on GSM8K (Qwen3-8B, 1319 problems×3 seeds). Config Bits/param Mean Median Std Accuracy Truncated BF16 16.00 2051 1548 1509 94.1% 40 SLQ-DL 5.70 2051 1555 1518 93.8% 53 SLQ-TL 4.72 2097 1619 1504 94.7% 50 U...

2051
[18]

4.50”) include the overhead from scales and zero-points. The format notation “4-8

Bitwidth values (e.g., “4.50”) include the overhead from scales and zero-points. The format notation “4-8” means only bitwidths 4 and 8 are available for allocation, while “4-5-6-7-8” means all intermediate bitwidths are available. Integer quantization with binary bitwidth range.Tables 15 and 16 show integer asym- metric and symmetric quantization when on...

work page arXiv