arxiv: 2605.08692 · v1 · submitted 2026-05-09 · 💻 cs.LG · cs.CL

Recognition: no theorem link

AAAC: Activation-Aware Adaptive Codebooks for 4-bit LLM Weight Quantization

Beshr IslamBouli , David Jin

Authors on Pith no claims yet

Pith reviewed 2026-05-12 01:03 UTC · model grok-4.3

classification 💻 cs.LG cs.CL

keywords LLM quantization4-bit weight quantizationpost-training quantizationadaptive codebooksactivation-aware scalingmodel compressioninference optimization

0 comments

The pith

Two learned scalar codebooks per layer, chosen by activation-weighted error and encoded with no extra storage, let 4-bit LLM quantization finish in minutes while beating fixed-codebook baselines.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper sets out to demonstrate that the accuracy limitations of fast post-training 4-bit weight quantization can be overcome without adopting the hours-long runtimes of gradient-based methods. It replaces the single fixed scalar codebook with two small learned ones per layer and lets each weight group pick the better codebook by measuring reconstruction error weighted by activations. The choice is stored for free by repurposing the sign bit of the group's scale factor. If correct, this would mean that high-accuracy 4-bit models become practical to produce on a single GPU in 3 to 30 minutes with no added memory cost, closing the gap between quick and slow quantization techniques.

Core claim

The central discovery is that two scalar codebooks of 64 bytes each, learned per layer and selected group-wise to minimize activation-weighted reconstruction error, with the selection bit-packed into the unused sign of the positive scale, produce lower quantization error than fixed grids while requiring only minutes of computation on one GPU and zero extra storage beyond the model weights.

What carries the argument

The pair of learned scalar codebooks per layer whose selection for each weight group is decided by activation-weighted reconstruction error and stored without overhead in the sign bit of the scale factor.

If this is right

4-bit quantized models retain higher downstream accuracy than those produced by fixed-codebook methods.
Quantization of large models becomes feasible on modest hardware without multi-hour runtimes.
No additional memory footprint is introduced beyond the quantized weights themselves.
The same selection logic can be applied uniformly across different model families.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The activation-weighted selection rule may extend naturally to mixed-precision or per-token quantization settings.
If activation statistics shift between training and deployment, the same lightweight selection could be recomputed on the fly without retraining the codebooks.
The approach hints that many quantization artifacts arise from mismatch between a single global grid and layer-specific weight-activation correlations.

Load-bearing premise

Two learned scalar codebooks per layer are sufficient to represent the weight distributions needed for low reconstruction error when the choice is guided by activation-weighted error.

What would settle it

A side-by-side run on the same model and benchmark showing that the final task accuracy or perplexity after this method is lower than after a gradient-assisted baseline run for hours would disprove the accuracy claim.

read the original abstract

Post-training weight-only quantization to 4 bits is widely used to reduce the memory and compute costs of large language model inference. Existing PTQ methods, such as AWQ and GPTQ, improve how weights are mapped onto a fixed 4-bit grid through scaling, clipping, or error compensation. To further improve accuracy, methods such as OmniQuant and QuIP\# uses gradient-assisted algorithms at the cost of hours of quantization time. In this work, we propose AAAC (Activation-Aware Adaptive Codebooks), a lightweight method for 4-bit LLM weight quantization. AAAC replaces the fixed scalar codebook used in standard quantization with two small learned scalar codebooks (64 bytes) per layer. Each group of weights selects the codebook that minimizes activation-weighted reconstruction error, encoding the choice in the unused sign bit of the group's positive scale and adding zero storage overhead. AAAC completes in 3--30 minutes on a single GPU, and adds no memory beyond the model itself. We evaluate against AWQ, GPTQ, IF4, GPTVQ, OmniQuant, SqueezeLLM, and QuIP\# across model families. AAAC outperforms baselines at orders-of-magnitude less quantization time.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

AAAC introduces two learned scalar codebooks per layer with sign-bit selection for zero-overhead adaptation in 4-bit quantization, but the abstract supplies no numbers to show whether it actually beats the baselines it cites.

read the letter

The main takeaway is that this method replaces a fixed 4-bit grid with two small learned scalar codebooks per layer. Each group of weights picks the codebook that minimizes activation-weighted reconstruction error, and the choice is stored in the unused sign bit of the positive scale factor so there is no extra storage or compute at inference time. The whole process finishes in 3-30 minutes on one GPU.

Referee Report

3 major / 1 minor

Summary. The paper proposes AAAC, a post-training 4-bit weight quantization method for LLMs that replaces the standard fixed scalar codebook with two small learned scalar codebooks (64 bytes total) per layer. Per-group codebook selection is performed by minimizing activation-weighted reconstruction error, with the choice encoded in the unused sign bit of the positive scale factor at zero storage overhead. The method is claimed to run in 3-30 minutes on a single GPU with no added memory and to outperform AWQ, GPTQ, IF4, GPTVQ, OmniQuant, SqueezeLLM, and QuIP# across model families.

Significance. If the empirical gains hold under rigorous validation, AAAC would offer a practical compromise between the speed of fixed-grid PTQ methods and the higher accuracy (but much higher cost) of gradient-assisted approaches such as OmniQuant and QuIP#, while preserving the zero-overhead property of the encoding trick. The core idea of activation-aware adaptive codebooks is a modest but potentially useful algorithmic increment.

major comments (3)

[Abstract / Method] Abstract and method description: the central claim that two learned scalar codebooks per layer suffice to capture necessary weight distributions without introducing new artifacts rests on an unvalidated assumption. No ablation is referenced that tests whether two codebooks remain adequate for layers exhibiting multimodality, heavy tails, or structured outliers, nor is there a convergence criterion or initialization procedure for the codebook learning step.
[Abstract] Abstract: the assertion that AAAC 'outperforms baselines' is stated without any numerical results, error bars, per-model tables, or statistical tests in the provided summary. Because the accuracy advantage is the primary load-bearing claim, the absence of concrete evidence (e.g., perplexity or zero-shot accuracy deltas versus AWQ/GPTQ) prevents assessment of whether the activation-weighted selection actually reduces residual error below optimized fixed-grid baselines.
[Method] Method: the paper states that codebook selection adds 'zero storage overhead' by using the sign bit, yet provides no analysis of whether this encoding remains robust when the positive scale itself is near zero or when quantization noise interacts with the sign-bit decision across groups.

minor comments (1)

[Abstract] The abstract lists seven baselines but does not indicate whether all were re-implemented under identical settings or whether published numbers were used; a short clarification on experimental protocol would improve reproducibility.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive feedback. We address each major comment below with clarifications and note the revisions we will incorporate.

read point-by-point responses

Referee: [Abstract / Method] Abstract and method description: the central claim that two learned scalar codebooks per layer suffice to capture necessary weight distributions without introducing new artifacts rests on an unvalidated assumption. No ablation is referenced that tests whether two codebooks remain adequate for layers exhibiting multimodality, heavy tails, or structured outliers, nor is there a convergence criterion or initialization procedure for the codebook learning step.

Authors: We appreciate the referee's point on validation. Our empirical results across multiple model families show that two codebooks per layer consistently capture the relevant weight distributions without introducing artifacts, as evidenced by the reported accuracy gains. To strengthen this, we will add an ablation study varying the number of codebooks (1 vs. 2 vs. 4) on representative layers and include details on the codebook learning procedure: initialization via k-means clustering on a random subset of weights and convergence after a fixed number of iterations or when the weighted reconstruction error stabilizes. revision: yes
Referee: [Abstract] Abstract: the assertion that AAAC 'outperforms baselines' is stated without any numerical results, error bars, per-model tables, or statistical tests in the provided summary. Because the accuracy advantage is the primary load-bearing claim, the absence of concrete evidence (e.g., perplexity or zero-shot accuracy deltas versus AWQ/GPTQ) prevents assessment of whether the activation-weighted selection actually reduces residual error below optimized fixed-grid baselines.

Authors: The abstract is a high-level summary of the work. The full manuscript contains detailed numerical evidence in Tables 1-4 and Figures 2-5, including perplexity and zero-shot accuracy values, deltas versus AWQ/GPTQ and other baselines, and discussions of the improvements. We will revise the abstract to briefly reference the scale of the observed gains to make the claim more self-contained. revision: partial
Referee: [Method] Method: the paper states that codebook selection adds 'zero storage overhead' by using the sign bit, yet provides no analysis of whether this encoding remains robust when the positive scale itself is near zero or when quantization noise interacts with the sign-bit decision across groups.

Authors: The scale factor is computed as a positive value (e.g., based on the maximum absolute weight per group), so the sign bit is always available for encoding the codebook choice. Groups with near-zero scales have negligible impact on the model output, and the codebook selection is based on activation-weighted error computed before final encoding. We will add a short robustness discussion in the method section, including analysis of small-scale cases and verification that quantization noise does not affect the sign-bit decision in practice. revision: yes

Circularity Check

0 steps flagged

No circularity: AAAC is an independent algorithmic proposal with empirical evaluation.

full rationale

The paper introduces AAAC as a new PTQ method that substitutes the standard fixed 4-bit scalar codebook with two small learned scalar codebooks (64 bytes) per layer, where each weight group selects the codebook minimizing activation-weighted reconstruction error (encoded in the scale sign bit). No equations or claims in the provided text reduce the reported accuracy or runtime gains to quantities defined by the same fitted codebooks or by self-citation. The method is presented as a lightweight algorithmic change evaluated directly against external baselines (AWQ, GPTQ, OmniQuant, etc.) on multiple model families. The central claim of outperformance at 3-30 min quantization time rests on the empirical results of this change rather than any self-referential definition, fitted-input prediction, or imported uniqueness theorem. This is a self-contained empirical contribution.

Axiom & Free-Parameter Ledger

1 free parameters · 0 axioms · 0 invented entities

The method rests on the empirical effectiveness of activation-weighted codebook selection; the two codebooks themselves are learned parameters fitted to each layer.

free parameters (1)

two scalar codebooks per layer
Learned to minimize activation-weighted reconstruction error for each layer's weights.

pith-pipeline@v0.9.0 · 5513 in / 1089 out tokens · 45474 ms · 2026-05-12T01:03:07.602846+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

19 extracted references · 19 canonical work pages

[1]

2023 , eprint=

SpQR: A Sparse-Quantized Representation for Near-Lossless LLM Weight Compression , author=. 2023 , eprint=

work page 2023
[2]

2024 , eprint=

Extreme Compression of Large Language Models via Additive Quantization , author=. 2024 , eprint=

work page 2024
[3]

2023 , eprint=

GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers , author=. 2023 , eprint=

work page 2023
[4]

2024 , eprint=

The Llama 3 Herd of Models , author=. 2024 , eprint=

work page 2024
[5]

2024 , eprint=

AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration , author=. 2024 , eprint=

work page 2024
[6]

2016 , eprint=

Pointer Sentinel Mixture Models , author=. 2016 , eprint=

work page 2016
[7]

2026 , eprint=

Adaptive Block-Scaled Data Types , author=. 2026 , eprint=

work page 2026
[8]

2025 , eprint=

Qwen2.5 Technical Report , author=. 2025 , eprint=

work page 2025
[9]

2023 , eprint=

Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer , author=. 2023 , eprint=

work page 2023
[10]

2023 , eprint=

LLaMA: Open and Efficient Foundation Language Models , author=. 2023 , eprint=

work page 2023
[11]

2025 , eprint=

GPTVQ: The Blessing of Dimensionality for LLM Quantization , author=. 2025 , eprint=

work page 2025
[12]

2024 , eprint=

OmniQuant: Omnidirectionally Calibrated Quantization for Large Language Models , author=. 2024 , eprint=

work page 2024
[13]

2024 , eprint=

SqueezeLLM: Dense-and-Sparse Quantization , author=. 2024 , eprint=

work page 2024
[14]

2024 , eprint=

QuIP\#: Even Better LLM Quantization with Hadamard Incoherence and Lattice Codebooks , author=. 2024 , eprint=

work page 2024
[15]

2023 , eprint=

Fast Inference from Transformers via Speculative Decoding , author=. 2023 , eprint=

work page 2023
[16]

2023 , eprint=

Accelerating Large Language Model Decoding with Speculative Sampling , author=. 2023 , eprint=

work page 2023
[17]

2025 , eprint=

QSpec: Speculative Decoding with Complementary Quantization Schemes , author=. 2025 , eprint=

work page 2025
[18]

doi:10.5281/zenodo.12608602 , url =

Gao, Leo and Tow, Jonathan and Abbasi, Baber and Biderman, Stella and Black, Sid and DiPofi, Anthony and Foster, Charles and Golding, Laurence and Hsu, Jeffrey and Le Noac'h, Alain and Li, Haonan and McDonell, Kyle and Muennighoff, Niklas and Ociepa, Chris and Phang, Jason and Reynolds, Laria and Schoelkopf, Hailey and Skowron, Aviya and Sutawika, Lintang...

work page doi:10.5281/zenodo.12608602
[19]

2026 , eprint=

Pretraining Large Language Models with NVFP4 , author=. 2026 , eprint=

work page 2026