Recognition: no theorem link
AAAC: Activation-Aware Adaptive Codebooks for 4-bit LLM Weight Quantization
Pith reviewed 2026-05-12 01:03 UTC · model grok-4.3
The pith
Two learned scalar codebooks per layer, chosen by activation-weighted error and encoded with no extra storage, let 4-bit LLM quantization finish in minutes while beating fixed-codebook baselines.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The central discovery is that two scalar codebooks of 64 bytes each, learned per layer and selected group-wise to minimize activation-weighted reconstruction error, with the selection bit-packed into the unused sign of the positive scale, produce lower quantization error than fixed grids while requiring only minutes of computation on one GPU and zero extra storage beyond the model weights.
What carries the argument
The pair of learned scalar codebooks per layer whose selection for each weight group is decided by activation-weighted reconstruction error and stored without overhead in the sign bit of the scale factor.
If this is right
- 4-bit quantized models retain higher downstream accuracy than those produced by fixed-codebook methods.
- Quantization of large models becomes feasible on modest hardware without multi-hour runtimes.
- No additional memory footprint is introduced beyond the quantized weights themselves.
- The same selection logic can be applied uniformly across different model families.
Where Pith is reading between the lines
- The activation-weighted selection rule may extend naturally to mixed-precision or per-token quantization settings.
- If activation statistics shift between training and deployment, the same lightweight selection could be recomputed on the fly without retraining the codebooks.
- The approach hints that many quantization artifacts arise from mismatch between a single global grid and layer-specific weight-activation correlations.
Load-bearing premise
Two learned scalar codebooks per layer are sufficient to represent the weight distributions needed for low reconstruction error when the choice is guided by activation-weighted error.
What would settle it
A side-by-side run on the same model and benchmark showing that the final task accuracy or perplexity after this method is lower than after a gradient-assisted baseline run for hours would disprove the accuracy claim.
read the original abstract
Post-training weight-only quantization to 4 bits is widely used to reduce the memory and compute costs of large language model inference. Existing PTQ methods, such as AWQ and GPTQ, improve how weights are mapped onto a fixed 4-bit grid through scaling, clipping, or error compensation. To further improve accuracy, methods such as OmniQuant and QuIP\# uses gradient-assisted algorithms at the cost of hours of quantization time. In this work, we propose AAAC (Activation-Aware Adaptive Codebooks), a lightweight method for 4-bit LLM weight quantization. AAAC replaces the fixed scalar codebook used in standard quantization with two small learned scalar codebooks (64 bytes) per layer. Each group of weights selects the codebook that minimizes activation-weighted reconstruction error, encoding the choice in the unused sign bit of the group's positive scale and adding zero storage overhead. AAAC completes in 3--30 minutes on a single GPU, and adds no memory beyond the model itself. We evaluate against AWQ, GPTQ, IF4, GPTVQ, OmniQuant, SqueezeLLM, and QuIP\# across model families. AAAC outperforms baselines at orders-of-magnitude less quantization time.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper proposes AAAC, a post-training 4-bit weight quantization method for LLMs that replaces the standard fixed scalar codebook with two small learned scalar codebooks (64 bytes total) per layer. Per-group codebook selection is performed by minimizing activation-weighted reconstruction error, with the choice encoded in the unused sign bit of the positive scale factor at zero storage overhead. The method is claimed to run in 3-30 minutes on a single GPU with no added memory and to outperform AWQ, GPTQ, IF4, GPTVQ, OmniQuant, SqueezeLLM, and QuIP# across model families.
Significance. If the empirical gains hold under rigorous validation, AAAC would offer a practical compromise between the speed of fixed-grid PTQ methods and the higher accuracy (but much higher cost) of gradient-assisted approaches such as OmniQuant and QuIP#, while preserving the zero-overhead property of the encoding trick. The core idea of activation-aware adaptive codebooks is a modest but potentially useful algorithmic increment.
major comments (3)
- [Abstract / Method] Abstract and method description: the central claim that two learned scalar codebooks per layer suffice to capture necessary weight distributions without introducing new artifacts rests on an unvalidated assumption. No ablation is referenced that tests whether two codebooks remain adequate for layers exhibiting multimodality, heavy tails, or structured outliers, nor is there a convergence criterion or initialization procedure for the codebook learning step.
- [Abstract] Abstract: the assertion that AAAC 'outperforms baselines' is stated without any numerical results, error bars, per-model tables, or statistical tests in the provided summary. Because the accuracy advantage is the primary load-bearing claim, the absence of concrete evidence (e.g., perplexity or zero-shot accuracy deltas versus AWQ/GPTQ) prevents assessment of whether the activation-weighted selection actually reduces residual error below optimized fixed-grid baselines.
- [Method] Method: the paper states that codebook selection adds 'zero storage overhead' by using the sign bit, yet provides no analysis of whether this encoding remains robust when the positive scale itself is near zero or when quantization noise interacts with the sign-bit decision across groups.
minor comments (1)
- [Abstract] The abstract lists seven baselines but does not indicate whether all were re-implemented under identical settings or whether published numbers were used; a short clarification on experimental protocol would improve reproducibility.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback. We address each major comment below with clarifications and note the revisions we will incorporate.
read point-by-point responses
-
Referee: [Abstract / Method] Abstract and method description: the central claim that two learned scalar codebooks per layer suffice to capture necessary weight distributions without introducing new artifacts rests on an unvalidated assumption. No ablation is referenced that tests whether two codebooks remain adequate for layers exhibiting multimodality, heavy tails, or structured outliers, nor is there a convergence criterion or initialization procedure for the codebook learning step.
Authors: We appreciate the referee's point on validation. Our empirical results across multiple model families show that two codebooks per layer consistently capture the relevant weight distributions without introducing artifacts, as evidenced by the reported accuracy gains. To strengthen this, we will add an ablation study varying the number of codebooks (1 vs. 2 vs. 4) on representative layers and include details on the codebook learning procedure: initialization via k-means clustering on a random subset of weights and convergence after a fixed number of iterations or when the weighted reconstruction error stabilizes. revision: yes
-
Referee: [Abstract] Abstract: the assertion that AAAC 'outperforms baselines' is stated without any numerical results, error bars, per-model tables, or statistical tests in the provided summary. Because the accuracy advantage is the primary load-bearing claim, the absence of concrete evidence (e.g., perplexity or zero-shot accuracy deltas versus AWQ/GPTQ) prevents assessment of whether the activation-weighted selection actually reduces residual error below optimized fixed-grid baselines.
Authors: The abstract is a high-level summary of the work. The full manuscript contains detailed numerical evidence in Tables 1-4 and Figures 2-5, including perplexity and zero-shot accuracy values, deltas versus AWQ/GPTQ and other baselines, and discussions of the improvements. We will revise the abstract to briefly reference the scale of the observed gains to make the claim more self-contained. revision: partial
-
Referee: [Method] Method: the paper states that codebook selection adds 'zero storage overhead' by using the sign bit, yet provides no analysis of whether this encoding remains robust when the positive scale itself is near zero or when quantization noise interacts with the sign-bit decision across groups.
Authors: The scale factor is computed as a positive value (e.g., based on the maximum absolute weight per group), so the sign bit is always available for encoding the codebook choice. Groups with near-zero scales have negligible impact on the model output, and the codebook selection is based on activation-weighted error computed before final encoding. We will add a short robustness discussion in the method section, including analysis of small-scale cases and verification that quantization noise does not affect the sign-bit decision in practice. revision: yes
Circularity Check
No circularity: AAAC is an independent algorithmic proposal with empirical evaluation.
full rationale
The paper introduces AAAC as a new PTQ method that substitutes the standard fixed 4-bit scalar codebook with two small learned scalar codebooks (64 bytes) per layer, where each weight group selects the codebook minimizing activation-weighted reconstruction error (encoded in the scale sign bit). No equations or claims in the provided text reduce the reported accuracy or runtime gains to quantities defined by the same fitted codebooks or by self-citation. The method is presented as a lightweight algorithmic change evaluated directly against external baselines (AWQ, GPTQ, OmniQuant, etc.) on multiple model families. The central claim of outperformance at 3-30 min quantization time rests on the empirical results of this change rather than any self-referential definition, fitted-input prediction, or imported uniqueness theorem. This is a self-contained empirical contribution.
Axiom & Free-Parameter Ledger
free parameters (1)
- two scalar codebooks per layer
Reference graph
Works this paper leans on
-
[1]
SpQR: A Sparse-Quantized Representation for Near-Lossless LLM Weight Compression , author=. 2023 , eprint=
work page 2023
-
[2]
Extreme Compression of Large Language Models via Additive Quantization , author=. 2024 , eprint=
work page 2024
-
[3]
GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers , author=. 2023 , eprint=
work page 2023
- [4]
-
[5]
AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration , author=. 2024 , eprint=
work page 2024
- [6]
- [7]
- [8]
-
[9]
Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer , author=. 2023 , eprint=
work page 2023
-
[10]
LLaMA: Open and Efficient Foundation Language Models , author=. 2023 , eprint=
work page 2023
-
[11]
GPTVQ: The Blessing of Dimensionality for LLM Quantization , author=. 2025 , eprint=
work page 2025
-
[12]
OmniQuant: Omnidirectionally Calibrated Quantization for Large Language Models , author=. 2024 , eprint=
work page 2024
- [13]
-
[14]
QuIP\#: Even Better LLM Quantization with Hadamard Incoherence and Lattice Codebooks , author=. 2024 , eprint=
work page 2024
-
[15]
Fast Inference from Transformers via Speculative Decoding , author=. 2023 , eprint=
work page 2023
-
[16]
Accelerating Large Language Model Decoding with Speculative Sampling , author=. 2023 , eprint=
work page 2023
-
[17]
QSpec: Speculative Decoding with Complementary Quantization Schemes , author=. 2025 , eprint=
work page 2025
-
[18]
doi:10.5281/zenodo.12608602 , url =
Gao, Leo and Tow, Jonathan and Abbasi, Baber and Biderman, Stella and Black, Sid and DiPofi, Anthony and Foster, Charles and Golding, Laurence and Hsu, Jeffrey and Le Noac'h, Alain and Li, Haonan and McDonell, Kyle and Muennighoff, Niklas and Ociepa, Chris and Phang, Jason and Reynolds, Laria and Schoelkopf, Hailey and Skowron, Aviya and Sutawika, Lintang...
-
[19]
Pretraining Large Language Models with NVFP4 , author=. 2026 , eprint=
work page 2026
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.