Recognition: unknown
High-Rate Quantized Matrix Multiplication II
Pith reviewed 2026-05-14 19:26 UTC · model grok-4.3
The pith
WaterSIC uses waterfilling on the input covariance to make scalar quantization of LLM weights basis-independent and within 0.25 bit per entry of the information-theoretic limit.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
When the covariance matrix Σ_X is known, waterfilling applied to its eigenvalues produces a quantization scheme whose high-rate distortion depends only on det(Σ_X) and lies within a multiplicative factor of 2πe/12 of the information-theoretic minimum distortion for the weighted mean squared error problem.
What carries the argument
Waterfilling rate allocation on the eigenvalues of Σ_X to set per-coordinate step sizes for scalar integer quantizers.
Load-bearing premise
The high-rate regime approximation accurately describes the distortion achieved at the bit widths used in practical LLM quantization, with the covariance matrix known and stationary.
What would settle it
Direct computation of quantization MSE on Llama-3-8B layer weights at 4 bits per entry, compared against the value predicted from det(Σ_X) scaled by the 2πe/12 factor.
Figures
read the original abstract
This is the second part of the work investigating quantized matrix multiplication (MatMul). In part I we considered the case of calibration-free quantization, whereas here we discuss the setting where covariance matrix $\Sigma_X$ of the columns of the second factor is available. This setting arises in the ubiquitous task of weight-only post-training quantization of LLMs. Weight-only quantization is related to the problem of weighted mean squared error (WMSE) source coding, whose classical (reverse) waterfilling solution dictates how one should distribute rate between coordinates of the vector. We show how waterfilling can be used to improve practical LLM quantization algorithms (GPTQ), which at present allocate rate equally. A recent scheme (known as ``WaterSIC'') that only uses scalar INT quantizers is analyzed and its high-rate performance is shown to be (a) basis free (i.e., characterized by the determinant of $\Sigma_X$ and, thus, unlike existing schemes, is immune to applying random rotations); and (b) within a multiplicative factor of $\frac{2\pi e}{12}$ (or 0.25 bit/entry) of the information-theoretic distortion limit. GPTQ's performance, in turn, is affected by the choice of basis, but for a random rotation and actual $\Sigma_X$ from Llama-3-8B we find it to be within 0.1 bit (depending on the layer type) of WaterSIC, suggesting that GPTQ with random rotation is also near optimal, at least in the high-rate regime.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper is the second part of a study on quantized matrix multiplication, focusing on the setting where the covariance matrix Σ_X of the input columns is known (as arises in weight-only post-training quantization of LLMs). It shows how classical reverse waterfilling can be applied to improve rate allocation in schemes such as GPTQ, and analyzes the WaterSIC scheme (scalar INT quantizers) whose high-rate performance is claimed to be (a) fully characterized by det(Σ_X) and therefore basis-independent, and (b) within a multiplicative factor of 2πe/12 (≈0.25 bit per entry) of the Gaussian rate-distortion bound. Experiments on Llama-3-8B layers indicate that randomly rotated GPTQ lies within 0.1 bit of WaterSIC, suggesting near-optimality in the high-rate regime.
Significance. If the high-rate analysis and finite-rate experiments hold, the work supplies a clean theoretical link between classical reverse-waterfilling source coding and practical LLM quantization. The basis-free characterization of WaterSIC and the explicit 0.25-bit gap to the information-theoretic limit are useful benchmarks; the observation that rotated GPTQ is already close to this limit supplies a concrete, low-overhead route to near-optimal weight-only quantization. The manuscript also demonstrates the value of importing classical rate-distortion tools into the LLM compression literature.
major comments (3)
- [High-rate analysis] High-rate analysis section (around the derivation of the 2πe/12 gap): the claim that WaterSIC lies within 0.25 bit/entry of the rate-distortion limit rests on the R→∞ regime in which quantization noise is white, additive, and independent of higher-order moments. At the 3–5 bit per coordinate rates typical after waterfilling for Llama-3-8B layers, overload probability, non-uniform bin loading, and marginal kurtosis can enlarge the gap; the manuscript provides no finite-rate error analysis or simulation that quantifies how much the gap inflates at these rates.
- [Experiments] Experimental evaluation (Llama-3-8B results): the comparison between WaterSIC and rotated GPTQ is performed on a single model and a limited set of layers. Because the gap to the information-theoretic bound is asserted to be small (0.1 bit), the result is sensitive to the particular covariance structure of Llama-3-8B; a broader test across multiple model families and bit-widths is needed to support the general claim that rotated GPTQ is near-optimal.
- [Basis independence] Section on basis independence: the statement that WaterSIC performance depends only on det(Σ_X) is derived under the high-rate white-noise approximation. It is not immediately clear whether the same invariance holds once finite-rate effects (granularity, overload) are included; a short counter-example or additional derivation at moderate rates would strengthen the claim.
minor comments (2)
- [Abstract] The abstract and introduction should explicitly state the bit-width range (e.g., 3–5 bits after waterfilling) for which the 0.25-bit gap is claimed to remain representative.
- [WaterSIC description] Notation: the relationship between the waterfilling solution for the weighted MSE problem and the scalar quantizer step sizes used in WaterSIC could be written out more explicitly (one additional displayed equation would suffice).
Simulated Author's Rebuttal
We thank the referee for the constructive comments, which help clarify the scope and limitations of our high-rate analysis. We address each major point below, indicating planned revisions where appropriate.
read point-by-point responses
-
Referee: [High-rate analysis] High-rate analysis section (around the derivation of the 2πe/12 gap): the claim that WaterSIC lies within 0.25 bit/entry of the rate-distortion limit rests on the R→∞ regime in which quantization noise is white, additive, and independent of higher-order moments. At the 3–5 bit per coordinate rates typical after waterfilling for Llama-3-8B layers, overload probability, non-uniform bin loading, and marginal kurtosis can enlarge the gap; the manuscript provides no finite-rate error analysis or simulation that quantifies how much the gap inflates at these rates.
Authors: We agree that the 2πe/12 gap is derived under the high-rate white-noise approximation. At the 3-5 bit rates relevant to Llama-3-8B after waterfilling, finite-rate effects such as overload and kurtosis can indeed increase the actual gap. We will revise the manuscript to explicitly state this assumption and its limitations, and we will add a short discussion noting that the experimental results (0.1 bit gap for rotated GPTQ) provide empirical evidence that the inflation remains modest in practice. A full finite-rate analysis is beyond the current scope. revision: partial
-
Referee: [Experiments] Experimental evaluation (Llama-3-8B results): the comparison between WaterSIC and rotated GPTQ is performed on a single model and a limited set of layers. Because the gap to the information-theoretic bound is asserted to be small (0.1 bit), the result is sensitive to the particular covariance structure of Llama-3-8B; a broader test across multiple model families and bit-widths is needed to support the general claim that rotated GPTQ is near-optimal.
Authors: The experiments use Llama-3-8B to illustrate behavior on a recent, representative LLM under realistic covariance structures. The claim is presented as suggestive rather than universal. We will add text clarifying the limited scope and that the near-optimality observation holds for this model family in the high-rate regime. Broader validation across additional models is desirable but cannot be completed in the current revision cycle. revision: no
-
Referee: [Basis independence] Section on basis independence: the statement that WaterSIC performance depends only on det(Σ_X) is derived under the high-rate white-noise approximation. It is not immediately clear whether the same invariance holds once finite-rate effects (granularity, overload) are included; a short counter-example or additional derivation at moderate rates would strengthen the claim.
Authors: The determinant characterization follows directly from the high-rate analysis where the quantization noise is white and the distortion depends only on the eigenvalues via waterfilling. We will revise the section to note that this invariance is approximate at finite rates and may be perturbed by granularity and overload. A full moderate-rate derivation or counter-example is left for future work, but the high-rate result remains a useful benchmark. revision: partial
- A complete finite-rate error analysis quantifying gap inflation at 3-5 bits per coordinate
- Broader experimental validation across multiple model families and bit-widths
Circularity Check
No significant circularity; claims rest on classical information-theoretic results
full rationale
The paper applies the standard reverse waterfilling solution from classical rate-distortion theory to allocate rates based on the eigenvalues of Σ_X and invokes the known high-rate scalar quantization gap of 2πe/12 relative to the Gaussian bound. These are independent external results, not derived or redefined inside the paper. The basis-free characterization via det(Σ_X) follows directly from the waterfilling formula without redefinition. GPTQ comparisons rely on external Llama-3-8B weights rather than any fitted parameters or self-citation chains. No load-bearing step reduces by construction to the paper's own inputs or prior self-citations.
Axiom & Free-Parameter Ledger
axioms (1)
- standard math Reverse waterfilling solution optimally distributes rate for weighted mean squared error source coding
Reference graph
Works this paper leans on
-
[1]
High-rate quantized matrix multiplication I,
O. Ordentlich and Y . Polyanskiy, “High-rate quantized matrix multiplication I,” 2026
work page 2026
-
[2]
Optimal quantization for matrix multiplication,
——, “Optimal quantization for matrix multiplication,”arXiv preprint arXiv:2410.13780, 2024
-
[3]
Up or down? adaptive rounding for post-training quantization,
M. Nagel, R. A. Amjad, M. Van Baalen, C. Louizos, and T. Blankevoort, “Up or down? adaptive rounding for post-training quantization,” inInternational conference on machine learning. PMLR, 2020, pp. 7197–7206
work page 2020
-
[4]
OPTQ: Accurate quantization for generative pre-trained transformers,
E. Frantar, S. Ashkboos, T. Hoefler, and D. Alistarh, “OPTQ: Accurate quantization for generative pre-trained transformers,” inThe Eleventh International Conference on Learning Representations, 2023. [Online]. Available: https: //openreview.net/forum?id=tcbBPnfwxS
work page 2023
-
[5]
Gpt3. int8 (): 8-bit matrix multiplication for transformers at scale,
T. Dettmers, M. Lewis, Y . Belkada, and L. Zettlemoyer, “Gpt3. int8 (): 8-bit matrix multiplication for transformers at scale,” Advances in Neural Information Processing Systems, vol. 35, pp. 30 318–30 332, 2022
work page 2022
-
[6]
Optimal brain surgeon and general network pruning,
B. Hassibi, D. G. Stork, and G. J. Wolff, “Optimal brain surgeon and general network pruning,” inIEEE international conference on neural networks. IEEE, 1993, pp. 293–299
work page 1993
-
[7]
Watersic: information-theoretically (near) optimal linear layer quantiza- tion,
E. Lifar, S. Savkin, O. Ordentlich, and Y . Polyanskiy, “Watersic: information-theoretically (near) optimal linear layer quantiza- tion,”arXiv preprint arXiv:2603.04956, 2026
-
[8]
Brecq: Pushing the limit of post- training quantization by block reconstruction,
Y . Li, R. Gong, X. Tan, Y . Yang, P. Hu, Q. Zhang, F. Yu, W. Wang, and S. Gu, “Brecq: Pushing the limit of post- training quantization by block reconstruction,”arXiv preprint arXiv:2102.05426, 2021
-
[9]
Model-preserving adaptive rounding,
A. Tseng, Z. Sun, and C. De Sa, “Model-preserving adaptive rounding,”arXiv preprint arXiv:2505.22988, 2025
-
[10]
Half-quadratic quantization of large machine learning models,
H. Badri and A. Shaji, “Half-quadratic quantization of large machine learning models,” November 2023. [Online]. Available: https://mobiusml.github.io/hqq_blog/
work page 2023
-
[11]
Smoothquant: Accurate and efficient post-training quantization for large language models,
G. Xiao, J. Lin, M. Seznec, H. Wu, J. Demouth, and S. Han, “Smoothquant: Accurate and efficient post-training quantization for large language models,” inInternational Conference on Machine Learning. PMLR, 2023, pp. 38 087–38 099
work page 2023
-
[12]
NestQuant: Nested lattice quantization for matrix products and LLMs,
S. Savkin, E. Porat, O. Ordentlich, and Y . Polyanskiy, “NestQuant: Nested lattice quantization for matrix products and LLMs,”arXiv preprint arXiv:2502.09720, 2025
-
[13]
Qronos: Correcting the past by shaping the future... in post-training quantization,
S. Zhang, H. Zhang, I. Colbert, and R. Saab, “Qronos: Correcting the past by shaping the future... in post-training quantization,” arXiv preprint arXiv:2505.11695, 2025
-
[14]
Provable Post-Training Quantization: Theoretical Analysis of OPTQ and Qronos
H. Zhang, S. Zhang, I. Colbert, and R. Saab, “Provable post- training quantization: Theoretical analysis of optq and qronos,” arXiv preprint arXiv:2508.04853, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[15]
Y . Polyanskiy and Y . Wu,Information theory: From coding to learning. Cambridge university press, 2024
work page 2024
-
[16]
Price of metric universality in vector quantization is at most 0.11 bit,
A. Harbuzova, O. Ordentlich, and Y . Polyanskiy, “Price of metric universality in vector quantization is at most 0.11 bit,”arXiv preprint arXiv:2602.05790, 2026
-
[17]
Quip: 2- bit quantization of large language models with guarantees,
J. Chee, Y . Cai, V . Kuleshov, and C. M. De Sa, “Quip: 2- bit quantization of large language models with guarantees,” Advances in Neural Information Processing Systems, vol. 36, pp. 4396–4429, 2023
work page 2023
-
[18]
The Geometry of LLM Quantization: GPTQ as Babai's Nearest Plane Algorithm
J. Chen, Y . Shabanzadeh, E. Crnˇcevi´c, T. Hoefler, and D. Alistarh, “The geometry of llm quantization: Gptq as babai’s nearest plane algorithm,”arXiv preprint arXiv:2507.18553, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[19]
J. Birnick, “The lattice geometry of neural network quantization– a short equivalence proof of gptq and babai’s algorithm,”arXiv preprint arXiv:2508.01077, 2025
-
[20]
R. Zamir,Lattice Coding for Signals and Networks: A Structured Coding Approach to Quantization, Modulation, and Multiuser Information Theory. Cambridge University Press, 2014
work page 2014
-
[21]
Quip#: Even better llm quantization with hadamard incoherence and lattice codebooks,
A. Tseng, J. Chee, Q. Sun, V . Kuleshov, and C. De Sa, “Quip#: Even better llm quantization with hadamard incoherence and lattice codebooks,”arXiv preprint arXiv:2402.04396, 2024
-
[22]
Bounds on the density of smooth lattice coverings,
O. Ordentlich, O. Regev, and B. Weiss, “Bounds on the density of smooth lattice coverings,”arXiv preprint arXiv:2311.04644, 2023
-
[23]
W. R. Bennett, “Spectra of quantized signals,”The Bell System Technical Journal, vol. 27, no. 3, pp. 446–472, 1948
work page 1948
-
[24]
Quantization distortion in pulse-count modulation with nonuniform spacing of levels,
P. Panter and W. Dite, “Quantization distortion in pulse-count modulation with nonuniform spacing of levels,”Proceedings of the IRE, vol. 39, no. 1, pp. 44–48, 1951
work page 1951
-
[25]
Asymptotic quantization error of continuous signals and the quantization dimension,
P. Zador, “Asymptotic quantization error of continuous signals and the quantization dimension,”IEEE Transactions on Informa- tion Theory, vol. 28, no. 2, pp. 139–149, 1982
work page 1982
-
[26]
Asymptotically optimal block quantization,
A. Gersho, “Asymptotically optimal block quantization,”IEEE Transactions on information theory, vol. 25, no. 4, pp. 373–380, 1979
work page 1979
-
[27]
M. V . Eyuboglu and G. D. Forney, “Lattice and trellis quantiza- tion with lattice-and trellis-bounded codebooks-high-rate theory for memoryless sources,”IEEE Transactions on Information theory, vol. 39, no. 1, pp. 46–59, 1993
work page 1993
-
[28]
On lattice quantization noise,
R. Zamir and M. Feder, “On lattice quantization noise,”IEEE Transactions on Information Theory, vol. 42, no. 4, pp. 1152– 1159, 1996
work page 1996
-
[29]
J. H. Conway and N. J. A. Sloane,Sphere Packings, Lattices and Groups, 3rd ed., ser. Grundlehren der mathematischen Wis- senschaften. New York: Springer-Verlag, 1999, vol. 290
work page 1999
-
[30]
The Voronoi spherical cdf for lattices and linear codes: New bounds for quantization and coding,
O. Ordentlich, “The Voronoi spherical cdf for lattices and linear codes: New bounds for quantization and coding,”arXiv preprint arXiv:2506.19791, 2025
-
[31]
P. W. Wolniansky, G. J. Foschini, G. D. Golden, and R. A. Valenzuela, “V-blast: An architecture for realizing very high data rates over the rich-scattering wireless channel,” in1998 URSI international symposium on signals, systems, and electronics. Conference proceedings (Cat. No. 98EX167). IEEE, 1998, pp. 295–300
work page 1998
-
[32]
On lovász’lattice reduction and the nearest lattice point problem,
L. Babai, “On lovász’lattice reduction and the nearest lattice point problem,”Combinatorica, vol. 6, no. 1, pp. 1–13, 1986
work page 1986
-
[33]
U. Fincke and M. Pohst, “Improved methods for calculating vectors of short length in a lattice, including a complexity analysis,”Mathematics of computation, vol. 44, no. 170, pp. 463– 471, 1985
work page 1985
-
[34]
Closest point search in lattices,
E. Agrell, T. Eriksson, A. Vardy, and K. Zeger, “Closest point search in lattices,”IEEE transactions on information theory, vol. 48, no. 8, pp. 2201–2214, 2002
work page 2002
-
[35]
G. D. Forney, “Trellis shaping,”IEEE Transactions on Informa- tion Theory, vol. 38, no. 2, pp. 281–300, 1992
work page 1992
-
[36]
Nestquant: Nested lattice quantization for matrix products and llms,
S. Savkin, E. Porat, O. Ordentlich, and Y . Polyanskiy, “Nestquant: Nested lattice quantization for matrix products and llms,”Proc. International Conference on Machine Learning (ICML), 2025
work page 2025
-
[37]
Qtip: Quantiza- tion with trellises and incoherence processing,
A. Tseng, Q. Sun, D. Hou, and C. De Sa, “Qtip: Quantiza- tion with trellises and incoherence processing,”arXiv preprint arXiv:2406.11235, 2024
-
[38]
Privileged bases in the transformer residual stream,
N. Elhage, R. Lasenby, and C. Olah, “Privileged bases in the transformer residual stream,”Transformer Circuits Thread,
-
[39]
Available: https://transformer-circuits.pub/2023/ privileged-basis/index.html
[Online]. Available: https://transformer-circuits.pub/2023/ privileged-basis/index.html
work page 2023
-
[40]
On the best lattice quantizers,
E. Agrell and B. Allen, “On the best lattice quantizers,”IEEE Transactions on Information Theory, vol. 69, no. 12, pp. 7650– 7658, 2023
work page 2023
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.