arxiv: 2604.08118 · v1 · submitted 2026-04-09 · 💻 cs.CL · cs.LG

Recognition: unknown

Initialisation Determines the Basin: Efficient Codebook Optimisation for Extreme LLM Quantization

Ian W. Kennedy, Nafise Sadat Moosavi

Authors on Pith no claims yet

Pith reviewed 2026-05-10 17:30 UTC · model grok-4.3

classification 💻 cs.CL cs.LG

keywords additive quantizationLLM compressioncodebook initializationextreme quantizationoptimization basinrepresentational ratioHessian-weighted distancePV-tuning

0 comments

The pith

Codebook initialization determines the optimization basin in extreme LLM quantization.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that additive quantization for large language models at very low bit rates fails mainly because the codebook starts in the wrong place. Standard greedy sequential initialization often locks the process into poor regions of the loss surface where later beam search and parameter-vector tuning cannot recover. The authors introduce OA-EM, an output-aware expectation-maximization procedure that selects initial codebook entries using Hessian-weighted Mahalanobis distance to better reflect their effect on model outputs. They demonstrate that the severity of this initialization problem grows with the representational ratio of weights to codebook capacity, becoming extreme at 2 bits per parameter. When the initialization is improved this way, the final perplexity after tuning improves markedly across several model families and compression budgets.

Core claim

The dominant bottleneck in additive quantization at extreme rates is codebook initialization rather than the subsequent optimization steps. Greedy sequential initialization frequently traps the process in poor local optima that beam search and PV-tuning cannot escape. The representational ratio ρ = N/KM quantifies when this happens, becoming critical at 2 bpp. OA-EM, an output-aware expectation-maximization initialization that employs Hessian-weighted Mahalanobis distance, consistently yields superior codebooks that lead to better final performance after tuning across multiple model architectures.

What carries the argument

The output-aware EM (OA-EM) initialization that selects codebook vectors according to their Hessian-weighted Mahalanobis distance to model outputs, thereby steering the optimizer toward better basins before beam search and PV-tuning begin.

If this is right

At 2 bpp poor initialization can increase perplexity by orders of magnitude while better starts keep degradation modest.
OA-EM improves the quality-compute trade-off across search budgets and three different model families.
The performance gap between initializations widens as the representational ratio ρ increases.
The method works without changing the downstream beam search or tuning procedures.
The same initialization sensitivity appears in both Llama and Qwen architectures.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Optimization landscapes for quantized LLMs contain separated basins that initialization must navigate before search begins.
Similar initialization bottlenecks may exist in other additive or product quantization schemes at low rates.
Adaptive initialization that monitors ρ during training could generalize the approach beyond fixed compression targets.
The geometry insight suggests that post-training quantization methods should allocate more compute to initialization than to later refinement stages.

Load-bearing premise

That Hessian-weighted Mahalanobis distance reliably identifies which initial codebooks will lead to better final optimization basins after search and tuning.

What would settle it

Apply both OA-EM and standard greedy initialization to the same model at 2 bpp, run identical beam search and PV-tuning budgets on both, and measure final perplexity; a large consistent gap favoring OA-EM supports the claim while comparable results would falsify it.

read the original abstract

Additive quantization enables extreme LLM compression with O(1) lookup-table dequantization, making it attractive for edge deployment. Yet at 2-bit precision, it often fails catastrophically, even with extensive search and finetuning. We show that the dominant bottleneck is codebook initialisation. Greedy sequential initialisation frequently places the model in poor optimisation regions that subsequent beam search and PV-tuning struggle to overcome. We analyse this behaviour through the representational ratio \r{ho} = N/KM, which characterises the relationship between weight groups and codebook capacity, and propose OA-EM, an output-aware EM initialisation method using Hessian-weighted Mahalanobis distance. Across compression rates, search budgets, and three architectures (Llama 3.2 3B, Llama 3.1 8B, Qwen 2.5 3B), OA-EM consistently produces better solutions after PV-tuning and dominates the quality-compute frontier. The severity of the bottleneck scales with \r{ho}: moderate at 3 bpp but extreme at 2 bpp, where poor initialisation can degrade perplexity by orders of magnitude. More broadly, our results highlight the importance of optimisation geometry in compressed model spaces, where initialisation can dominate subsequent search and fine-tuning.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Initialization is the real bottleneck for 2-bit additive quantization, and OA-EM offers a practical way around it.

read the letter

The paper's main message is that codebook initialization decides whether extreme quantization succeeds or fails for LLMs. Greedy starts often land in bad basins that later beam search and PV-tuning cannot fix, especially at 2 bpp where the effect on perplexity can be huge. Their fix is OA-EM, which runs an output-aware EM step with Hessian-weighted Mahalanobis distance to pick better starting points. This combination looks new for additive quantization codebooks and ties the problem to the representational ratio ρ = N/KM, which grows more important as capacity tightens relative to the number of weights.

Referee Report

1 major / 1 minor

Summary. The manuscript claims that codebook initialisation is the dominant bottleneck in additive quantization for extreme LLM compression. Greedy sequential initialisation often places models in poor optimisation basins that beam search and PV-tuning cannot overcome, with the problem severity scaling with the representational ratio ρ = N/KM. The authors propose OA-EM, an output-aware EM initialisation using Hessian-weighted Mahalanobis distance, and report that it produces better solutions after PV-tuning, dominating the quality-compute frontier across Llama 3.2 3B, Llama 3.1 8B, and Qwen 2.5 3B at multiple bit rates, with particularly large gains at 2 bpp.

Significance. If the empirical results hold, the work is significant for edge deployment of LLMs because it identifies a practical, low-cost lever (initialisation) that improves extreme quantization quality without increasing search budget or tuning compute. The multi-architecture validation and explicit scaling with ρ provide a useful diagnostic for when initialisation matters most. The emphasis on optimisation geometry in compressed spaces is a constructive contribution to quantization research.

major comments (1)

Abstract: the central claim that 'poor initialisation can degrade perplexity by orders of magnitude' at 2 bpp and that OA-EM 'dominates the quality-compute frontier' is load-bearing, yet the abstract provides no quantitative deltas, error bars, number of runs, or exact baseline definitions, leaving the magnitude and reliability of the reported gains unverified from the given text.

minor comments (1)

The abstract uses the notation 'ρ = N/KM' but renders it as 'r{ho}', which is a typesetting error that should be corrected to the Greek letter rho.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their positive evaluation of the work's significance for edge LLM deployment and for the constructive feedback on the abstract. We respond to the major comment below and will incorporate revisions accordingly.

read point-by-point responses

Referee: Abstract: the central claim that 'poor initialisation can degrade perplexity by orders of magnitude' at 2 bpp and that OA-EM 'dominates the quality-compute frontier' is load-bearing, yet the abstract provides no quantitative deltas, error bars, number of runs, or exact baseline definitions, leaving the magnitude and reliability of the reported gains unverified from the given text.

Authors: We agree that the abstract would be strengthened by embedding specific quantitative support for these claims. The full manuscript reports concrete perplexity values demonstrating degradations by multiple orders of magnitude at 2 bpp under greedy sequential initialization (detailed in Tables 1-3 and Figures 2-4 across Llama 3.2 3B, Llama 3.1 8B, and Qwen 2.5 3B), with OA-EM yielding consistent improvements that dominate the quality-compute frontier at fixed search budgets and after PV-tuning. We will revise the abstract to include key deltas (e.g., the observed perplexity ranges and relative gains), explicit baseline definitions (greedy sequential initialization vs. OA-EM), and clarification of the experimental setup (single-run results with fixed random seeds for reproducibility across the three architectures and multiple bit rates). This makes the abstract self-contained while preserving brevity. No additional experiments are required. revision: yes

Circularity Check

0 steps flagged

No significant circularity detected

full rationale

The paper's central claims—that codebook initialisation is the dominant bottleneck whose severity scales with representational ratio ρ, and that OA-EM reaches better basins—are grounded in empirical results across three independent model architectures and multiple bit rates. No equations reduce to self-definitions or fitted inputs by construction, no load-bearing self-citations are invoked for uniqueness or ansatzes, and ρ functions as an analytical descriptor rather than a circular construct. The derivation chain remains self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review prevents identification of specific free parameters or axioms; the representational ratio ρ and Hessian weighting are introduced but their exact status (fitted vs derived) is unspecified.

pith-pipeline@v0.9.0 · 5536 in / 1137 out tokens · 44549 ms · 2026-05-10T17:30:52.221960+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

34 extracted references · 9 canonical work pages · 3 internal anchors

[1]

Artem Babenko and Victor Lempitsky. 2014. Additive quantization for extreme vector compression. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition

2014
[2]

Yonatan Bisk, Rowan Zellers, Ronan Le Bras, Jianfeng Gao, and Yejin Choi. 2020. PIQA : Reasoning about physical intuition in natural language. In Proceedings of the AAAI Conference on Artificial Intelligence

2020
[3]

Jerry Chee, Yaohui Cai, Volodymyr Kuleshov, and Christopher De Sa. 2023. QuIP : 2-bit quantization of large language models with guarantees. In Advances in Neural Information Processing Systems (NeurIPS)

2023
[4]

Peter Clark, Isaac Cowhey, Oren Etzioni, Tushar Khot, Ashish Sabharwal, Carissa Schoenick, and Oyvind Tafjord. 2018. Think you have solved question answering? T ry ARC , the AI2 reasoning challenge. arXiv preprint arXiv:1803.05457

work page internal anchor Pith review Pith/arXiv arXiv 2018
[5]

Dempster, Nan M

Arthur P. Dempster, Nan M. Laird, and Donald B. Rubin. 1977. Maximum likelihood from incomplete data via the EM algorithm. Journal of the Royal Statistical Society: Series B, 39(1):1--22

1977
[6]

Tim Dettmers, Mike Lewis, Younes Belkada, and Luke Zettlemoyer. 2022. GPT3.int8() : 8-bit matrix multiplication for transformers at scale. In Advances in Neural Information Processing Systems (NeurIPS)

2022
[7]

Tim Dettmers, Artidoro Pagnoni, Ari Holtzman, and Luke Zettlemoyer. 2023. QLoRA : Efficient finetuning of quantized LLMs . In Advances in Neural Information Processing Systems (NeurIPS)

2023
[8]

Tim Dettmers, Ruslan Svirschevski, Vage Egiazarian, Denis Kuznedelev, Elias Frantar, Saleh Ashkboos, Alexander Borzunov, Torsten Hoefler, and Dan Alistarh. 2024. SpQR : A sparse-quantized representation for near-lossless LLM weight compression. In Proceedings of the International Conference on Learning Representations (ICLR)

2024
[9]

Vage Egiazarian, Andrei Panferov, Denis Kuznedelev, Elias Frantar, Artem Babenko, and Dan Alistarh. 2024. Extreme compression of large language models via additive quantization. In Proceedings of the International Conference on Machine Learning (ICML)

2024
[10]

Elias Frantar, Saleh Ashkboos, Torsten Hoefler, and Dan Alistarh. 2023. GPTQ : Accurate post-training quantization for generative pre-trained transformers. In Proceedings of the International Conference on Learning Representations (ICLR)

2023
[11]

Leo Gao, Jonathan Tow, Baber Abbasi, Stella Biderman, Sid Black, Anthony DiPofi, Charles Foster, Laurence Golding, Jeffrey Hsu, Alain Le Noac'h, Haonan Li, Kyle McDonell, Niklas Muennighoff, Chris Ociepa, Jason Phang, Laria Reynolds, Hailey Schoelkopf, Aviya Skowron, Lintang Sutawika, and 5 others. 2023. A framework for few-shot language model evaluation. V0.4.0

2023
[12]

A survey of quan- tization methods for efficient neural network inference,

Amir Gholami, Sehoon Kim, Zhen Dong, Zhewei Yao, Michael W. Mahoney, and Kurt Keutzer. 2021. A survey of quantization methods for efficient neural network inference. arXiv preprint arXiv:2103.13630

work page arXiv 2021
[13]

Aaron Grattafiori and others . 2024. The L lama 3 herd of models. arXiv preprint arXiv:2407.21783

work page internal anchor Pith review Pith/arXiv arXiv 2024
[14]

Herv \'e J \'e gou, Matthijs Douze, and Cordelia Schmid. 2011. Product quantization for nearest neighbor search. IEEE Transactions on Pattern Analysis and Machine Intelligence, 33(1):117--128

2011
[15]

Mahoney, and Kurt Keutzer

Sehoon Kim, Coleman Hooper, Amir Gholami, Zhen Dong, Xiuyu Li, Sheng Shen, Michael W. Mahoney, and Kurt Keutzer. 2024. SqueezeLLM : Dense-and-sparse quantization. In Proceedings of the International Conference on Machine Learning (ICML)

2024
[16]

Ji Lin, Jiaming Tang, Haotian Tang, Shang Yang, Wei-Ming Chen, Wei-Chen Wang, Guangxuan Xiao, Xingyu Dang, Chuang Gan, and Song Han. 2024. AWQ : Activation-aware weight quantization for on-device LLM compression and acceleration. In Proceedings of the Conference on Machine Learning and Systems (MLSys)

2024
[17]

Zechun Liu, Changsheng Zhao, Igor Fedorov, Bilge Soran, Dhruv Choudhary, Raghuraman Krishnamoorthi, Vikas Chandra, Yuandong Tian, and Tijmen Blankevoort. 2025. SpinQuant : LLM quantization with learned rotations. In Proceedings of the International Conference on Learning Representations (ICLR)

2025
[18]

Vladimir Malinovskii, Andrei Panferov, Ivan Ilin, Han Guo, Peter Richt \'a rik, and Dan Alistarh. 2025. https://doi.org/10.18653/v1/2025.naacl-long.543 HIGGS : Pushing the limits of large language model quantization via the linearity theorem . In Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational...

work page doi:10.18653/v1/2025.naacl-long.543 2025
[19]

Vladislav Malinovskii, Denis Mazur, Ivan Ilin, Denis Kuznedelev, Konstantin Burlachenko, Kai Yi, Dan Alistarh, and Peter Richt \'a rik. 2024. PV-Tuning : Beyond straight-through estimation for extreme LLM compression. In Advances in Neural Information Processing Systems (NeurIPS)

2024
[20]

Hoos, and James J

Julieta Martinez, Shobhit Zakhmi, Holger H. Hoos, and James J. Little. 2018. LSQ++ : Lower running time and higher recall in multi-codebook quantization. In Proceedings of the European Conference on Computer Vision (ECCV)

2018
[21]

Xu Ouyang, Tao Ge, Thomas Hartvigsen, Zhisong Zhang, Haitao Mi, and Dong Yu. 2025. https://doi.org/10.18653/v1/2025.acl-long.1555 Low-bit quantization favors undertrained LLM s . In Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 32338--32348, Vienna, Austria. Association for Computati...

work page doi:10.18653/v1/2025.acl-long.1555 2025
[22]

Denis Paperno, Germ \'a n Kruszewski, Angeliki Lazaridou, Ngoc Quan Pham, Raffaella Bernardi, Sandro Pezzelle, Marco Baroni, Gemma Boleda, and Raquel Fern \'a ndez. 2016. https://doi.org/10.18653/v1/P16-1144 The LAMBADA dataset: Word prediction requiring a broad discourse context . In Proceedings of the 54th Annual Meeting of the Association for Computati...

work page doi:10.18653/v1/p16-1144 2016
[23]

Qwen Team . 2024. Qwen2.5 technical report. arXiv preprint arXiv:2412.15115

work page internal anchor Pith review Pith/arXiv arXiv 2024
[24]

Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J. Liu. 2020. Exploring the limits of transfer learning with a unified text-to-text transformer. JMLR, 21(140):1--67

2020
[25]

Keisuke Sakaguchi, Ronan Le Bras, Chandra Bhagavatula, and Yejin Choi. 2021. W ino G rande: An adversarial W inograd schema challenge at scale. Communications of the ACM, 64(9):99--106

2021
[26]

Wenqi Shao, Mengzhao Chen, Zhaoyang Zhang, Peng Xu, Lirui Zhao, Zhiqian Li, Kaipeng Zhang, Peng Gao, Yu Qiao, and Ping Luo. 2024. OmniQuant : Omnidirectionally calibrated quantization for large language models. In Proceedings of the International Conference on Learning Representations (ICLR)

2024
[27]

Albert Tseng, Jerry Chee, Qingyao Sun, Volodymyr Kuleshov, and Christopher De Sa. 2024 a . QuIP\# : Even better LLM quantization with H adamard incoherence and lattice codebooks. In Proceedings of the International Conference on Machine Learning (ICML)

2024
[28]

Albert Tseng, Qingyao Sun, David Hou, and Christopher De Sa. 2024 b . QTIP : Quantization with trellises and incoherence processing. In Advances in Neural Information Processing Systems (NeurIPS)

2024
[29]

Mart van Baalen, Andrey Kuzmin, Ivan Koryakovskiy, Cedric Bastoul, Peter Couperus, Eric Mahurin, Tijmen Blankevoort, Markus Nagel, and Paul Whatmough. 2024. GPTVQ : The blessing of dimensionality for LLM quantization. arXiv preprint arXiv:2402.15319

work page arXiv 2024
[30]

Guangxuan Xiao, Ji Lin, Mickael Seznec, Hao Wu, Julien Demouth, and Song Han. 2023. SmoothQuant : Accurate and efficient post-training quantization for large language models. In Proceedings of the International Conference on Machine Learning (ICML)

2023
[31]

Rowan Zellers, Ari Holtzman, Yonatan Bisk, Ali Farhadi, and Yejin Choi. 2019. https://doi.org/10.18653/v1/P19-1472 H ella S wag: Can a machine really finish your sentence? In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 4791--4800, Florence, Italy. Association for Computational Linguistics

work page doi:10.18653/v1/p19-1472 2019
[32]

Xi Zhang, Xiaolin Wu, Jiamang Wang, and Weisi Lin. 2025. Learning grouped lattice vector quantizers for low-bit LLM compression. In Advances in Neural Information Processing Systems (NeurIPS)

2025
[33]

online" 'onlinestring :=

ENTRY address archivePrefix author booktitle chapter edition editor eid eprint eprinttype howpublished institution journal key month note number organization pages publisher school series title type volume year doi pubmed url lastchecked label extra.label sort.label short.list INTEGERS output.state before.all mid.sentence after.sentence after.block STRING...
[34]

write newline

" write newline "" before.all 'output.state := FUNCTION n.dashify 't := "" t empty not t #1 #1 substring "-" = t #1 #2 substring "--" = not "--" * t #2 global.max substring 't := t #1 #1 substring "-" = "-" * t #2 global.max substring 't := while if t #1 #1 substring * t #2 global.max substring 't := if while FUNCTION word.in bbl.in capitalize " " * FUNCT...