pith. machine review for the scientific record. sign in

arxiv: 2605.13768 · v1 · submitted 2026-05-13 · 💻 cs.LG · cs.AI· cs.IT· math.IT

Recognition: unknown

High-Rate Quantized Matrix Multiplication II

Authors on Pith no claims yet

Pith reviewed 2026-05-14 19:26 UTC · model grok-4.3

classification 💻 cs.LG cs.AIcs.ITmath.IT
keywords quantized matrix multiplicationLLM post-training quantizationwaterfillingweighted mean squared errorhigh-rate distortionGPTQscalar quantizationcovariance matrix
0
0 comments X

The pith

WaterSIC uses waterfilling on the input covariance to make scalar quantization of LLM weights basis-independent and within 0.25 bit per entry of the information-theoretic limit.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper examines quantized matrix multiplication when the covariance matrix of the second factor is known, a setting common in weight-only post-training quantization of large language models. It links the task to weighted mean squared error source coding and shows that waterfilling allocates rates across coordinates to minimize distortion. The WaterSIC scheme, built from scalar integer quantizers, achieves high-rate performance fixed by the determinant of the covariance matrix alone. This makes the distortion immune to random basis rotations and keeps it within a factor of 2πe/12 of the optimal rate-distortion bound. GPTQ with random rotations performs nearly as well on real models such as Llama-3-8B.

Core claim

When the covariance matrix Σ_X is known, waterfilling applied to its eigenvalues produces a quantization scheme whose high-rate distortion depends only on det(Σ_X) and lies within a multiplicative factor of 2πe/12 of the information-theoretic minimum distortion for the weighted mean squared error problem.

What carries the argument

Waterfilling rate allocation on the eigenvalues of Σ_X to set per-coordinate step sizes for scalar integer quantizers.

Load-bearing premise

The high-rate regime approximation accurately describes the distortion achieved at the bit widths used in practical LLM quantization, with the covariance matrix known and stationary.

What would settle it

Direct computation of quantization MSE on Llama-3-8B layer weights at 4 bits per entry, compared against the value predicted from det(Σ_X) scaled by the 2πe/12 factor.

Figures

Figures reproduced from arXiv: 2605.13768 by Or Ordentlich, Yury Polyanskiy.

Figure 1
Figure 1. Figure 1: Illustrating ΣX of activations entering various layers of Llama-3-8B when processing Wikitext-2 dataset. Note that this is an estimate of the rate advantage assumes weight matrices are well modeled by N (0, In). In particular, actual weight matrices were never used for this plot. Note that, while our argument only gives a lower bound on the minimal distortion, it can be shown to be achiev￾able asymptotical… view at source ↗
Figure 2
Figure 2. Figure 2: In generalSIC the n codebooks C1, . . . , Cn ⊂ R whose product is C ⊂ R n can be arbitrary, and each of them is further scaled by the corresponding αi . This scaling can in general be absorbed in the codebooks definition. However, since we allow A to depend on ΣX through the matrix U ∈ R n×n while the codebooks C1, . . . , Cn are not allowed to depend on ΣX, we do not absorb A into the codebooks. In genera… view at source ↗
Figure 2
Figure 2. Figure 2: Illustration of the generalSIC quantization algorithm. The matrix [PITH_FULL_IMAGE:figures/full_fig_p009_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Illustration of the quantization regions for the [PITH_FULL_IMAGE:figures/full_fig_p010_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Performance of several weight-only quantization [PITH_FULL_IMAGE:figures/full_fig_p011_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Illustrating rate advantage of WaterSIC over SIC for [PITH_FULL_IMAGE:figures/full_fig_p012_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Illustrating Cholesky diagonals U 2 k,k for a randomly rotated V ⊤ΣXV and accuracy of approximation (33) in terms of spectrum of ΣX. GPTQ with random rotation can be accurately estimated from combining (10) and (33). It is an interesting open problem to estimate worst possible gap (over possible spectra λj ≥ ϵ) between GPTQ with rotation and Wa￾terSIC (which in turn is 0.25-bit away from information￾theore… view at source ↗
read the original abstract

This is the second part of the work investigating quantized matrix multiplication (MatMul). In part I we considered the case of calibration-free quantization, whereas here we discuss the setting where covariance matrix $\Sigma_X$ of the columns of the second factor is available. This setting arises in the ubiquitous task of weight-only post-training quantization of LLMs. Weight-only quantization is related to the problem of weighted mean squared error (WMSE) source coding, whose classical (reverse) waterfilling solution dictates how one should distribute rate between coordinates of the vector. We show how waterfilling can be used to improve practical LLM quantization algorithms (GPTQ), which at present allocate rate equally. A recent scheme (known as ``WaterSIC'') that only uses scalar INT quantizers is analyzed and its high-rate performance is shown to be (a) basis free (i.e., characterized by the determinant of $\Sigma_X$ and, thus, unlike existing schemes, is immune to applying random rotations); and (b) within a multiplicative factor of $\frac{2\pi e}{12}$ (or 0.25 bit/entry) of the information-theoretic distortion limit. GPTQ's performance, in turn, is affected by the choice of basis, but for a random rotation and actual $\Sigma_X$ from Llama-3-8B we find it to be within 0.1 bit (depending on the layer type) of WaterSIC, suggesting that GPTQ with random rotation is also near optimal, at least in the high-rate regime.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper is the second part of a study on quantized matrix multiplication, focusing on the setting where the covariance matrix Σ_X of the input columns is known (as arises in weight-only post-training quantization of LLMs). It shows how classical reverse waterfilling can be applied to improve rate allocation in schemes such as GPTQ, and analyzes the WaterSIC scheme (scalar INT quantizers) whose high-rate performance is claimed to be (a) fully characterized by det(Σ_X) and therefore basis-independent, and (b) within a multiplicative factor of 2πe/12 (≈0.25 bit per entry) of the Gaussian rate-distortion bound. Experiments on Llama-3-8B layers indicate that randomly rotated GPTQ lies within 0.1 bit of WaterSIC, suggesting near-optimality in the high-rate regime.

Significance. If the high-rate analysis and finite-rate experiments hold, the work supplies a clean theoretical link between classical reverse-waterfilling source coding and practical LLM quantization. The basis-free characterization of WaterSIC and the explicit 0.25-bit gap to the information-theoretic limit are useful benchmarks; the observation that rotated GPTQ is already close to this limit supplies a concrete, low-overhead route to near-optimal weight-only quantization. The manuscript also demonstrates the value of importing classical rate-distortion tools into the LLM compression literature.

major comments (3)
  1. [High-rate analysis] High-rate analysis section (around the derivation of the 2πe/12 gap): the claim that WaterSIC lies within 0.25 bit/entry of the rate-distortion limit rests on the R→∞ regime in which quantization noise is white, additive, and independent of higher-order moments. At the 3–5 bit per coordinate rates typical after waterfilling for Llama-3-8B layers, overload probability, non-uniform bin loading, and marginal kurtosis can enlarge the gap; the manuscript provides no finite-rate error analysis or simulation that quantifies how much the gap inflates at these rates.
  2. [Experiments] Experimental evaluation (Llama-3-8B results): the comparison between WaterSIC and rotated GPTQ is performed on a single model and a limited set of layers. Because the gap to the information-theoretic bound is asserted to be small (0.1 bit), the result is sensitive to the particular covariance structure of Llama-3-8B; a broader test across multiple model families and bit-widths is needed to support the general claim that rotated GPTQ is near-optimal.
  3. [Basis independence] Section on basis independence: the statement that WaterSIC performance depends only on det(Σ_X) is derived under the high-rate white-noise approximation. It is not immediately clear whether the same invariance holds once finite-rate effects (granularity, overload) are included; a short counter-example or additional derivation at moderate rates would strengthen the claim.
minor comments (2)
  1. [Abstract] The abstract and introduction should explicitly state the bit-width range (e.g., 3–5 bits after waterfilling) for which the 0.25-bit gap is claimed to remain representative.
  2. [WaterSIC description] Notation: the relationship between the waterfilling solution for the weighted MSE problem and the scalar quantizer step sizes used in WaterSIC could be written out more explicitly (one additional displayed equation would suffice).

Simulated Author's Rebuttal

3 responses · 2 unresolved

We thank the referee for the constructive comments, which help clarify the scope and limitations of our high-rate analysis. We address each major point below, indicating planned revisions where appropriate.

read point-by-point responses
  1. Referee: [High-rate analysis] High-rate analysis section (around the derivation of the 2πe/12 gap): the claim that WaterSIC lies within 0.25 bit/entry of the rate-distortion limit rests on the R→∞ regime in which quantization noise is white, additive, and independent of higher-order moments. At the 3–5 bit per coordinate rates typical after waterfilling for Llama-3-8B layers, overload probability, non-uniform bin loading, and marginal kurtosis can enlarge the gap; the manuscript provides no finite-rate error analysis or simulation that quantifies how much the gap inflates at these rates.

    Authors: We agree that the 2πe/12 gap is derived under the high-rate white-noise approximation. At the 3-5 bit rates relevant to Llama-3-8B after waterfilling, finite-rate effects such as overload and kurtosis can indeed increase the actual gap. We will revise the manuscript to explicitly state this assumption and its limitations, and we will add a short discussion noting that the experimental results (0.1 bit gap for rotated GPTQ) provide empirical evidence that the inflation remains modest in practice. A full finite-rate analysis is beyond the current scope. revision: partial

  2. Referee: [Experiments] Experimental evaluation (Llama-3-8B results): the comparison between WaterSIC and rotated GPTQ is performed on a single model and a limited set of layers. Because the gap to the information-theoretic bound is asserted to be small (0.1 bit), the result is sensitive to the particular covariance structure of Llama-3-8B; a broader test across multiple model families and bit-widths is needed to support the general claim that rotated GPTQ is near-optimal.

    Authors: The experiments use Llama-3-8B to illustrate behavior on a recent, representative LLM under realistic covariance structures. The claim is presented as suggestive rather than universal. We will add text clarifying the limited scope and that the near-optimality observation holds for this model family in the high-rate regime. Broader validation across additional models is desirable but cannot be completed in the current revision cycle. revision: no

  3. Referee: [Basis independence] Section on basis independence: the statement that WaterSIC performance depends only on det(Σ_X) is derived under the high-rate white-noise approximation. It is not immediately clear whether the same invariance holds once finite-rate effects (granularity, overload) are included; a short counter-example or additional derivation at moderate rates would strengthen the claim.

    Authors: The determinant characterization follows directly from the high-rate analysis where the quantization noise is white and the distortion depends only on the eigenvalues via waterfilling. We will revise the section to note that this invariance is approximate at finite rates and may be perturbed by granularity and overload. A full moderate-rate derivation or counter-example is left for future work, but the high-rate result remains a useful benchmark. revision: partial

standing simulated objections not resolved
  • A complete finite-rate error analysis quantifying gap inflation at 3-5 bits per coordinate
  • Broader experimental validation across multiple model families and bit-widths

Circularity Check

0 steps flagged

No significant circularity; claims rest on classical information-theoretic results

full rationale

The paper applies the standard reverse waterfilling solution from classical rate-distortion theory to allocate rates based on the eigenvalues of Σ_X and invokes the known high-rate scalar quantization gap of 2πe/12 relative to the Gaussian bound. These are independent external results, not derived or redefined inside the paper. The basis-free characterization via det(Σ_X) follows directly from the waterfilling formula without redefinition. GPTQ comparisons rely on external Llama-3-8B weights rather than any fitted parameters or self-citation chains. No load-bearing step reduces by construction to the paper's own inputs or prior self-citations.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The analysis rests on the classical reverse waterfilling theorem for weighted MSE source coding and the high-rate quantization approximation; no new free parameters or invented entities are introduced in the abstract.

axioms (1)
  • standard math Reverse waterfilling solution optimally distributes rate for weighted mean squared error source coding
    Invoked to improve rate allocation over uniform allocation in GPTQ.

pith-pipeline@v0.9.0 · 5580 in / 1247 out tokens · 46165 ms · 2026-05-14T19:26:02.994717+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

40 extracted references · 40 canonical work pages · 2 internal anchors

  1. [1]

    High-rate quantized matrix multiplication I,

    O. Ordentlich and Y . Polyanskiy, “High-rate quantized matrix multiplication I,” 2026

  2. [2]

    Optimal quantization for matrix multiplication,

    ——, “Optimal quantization for matrix multiplication,”arXiv preprint arXiv:2410.13780, 2024

  3. [3]

    Up or down? adaptive rounding for post-training quantization,

    M. Nagel, R. A. Amjad, M. Van Baalen, C. Louizos, and T. Blankevoort, “Up or down? adaptive rounding for post-training quantization,” inInternational conference on machine learning. PMLR, 2020, pp. 7197–7206

  4. [4]

    OPTQ: Accurate quantization for generative pre-trained transformers,

    E. Frantar, S. Ashkboos, T. Hoefler, and D. Alistarh, “OPTQ: Accurate quantization for generative pre-trained transformers,” inThe Eleventh International Conference on Learning Representations, 2023. [Online]. Available: https: //openreview.net/forum?id=tcbBPnfwxS

  5. [5]

    Gpt3. int8 (): 8-bit matrix multiplication for transformers at scale,

    T. Dettmers, M. Lewis, Y . Belkada, and L. Zettlemoyer, “Gpt3. int8 (): 8-bit matrix multiplication for transformers at scale,” Advances in Neural Information Processing Systems, vol. 35, pp. 30 318–30 332, 2022

  6. [6]

    Optimal brain surgeon and general network pruning,

    B. Hassibi, D. G. Stork, and G. J. Wolff, “Optimal brain surgeon and general network pruning,” inIEEE international conference on neural networks. IEEE, 1993, pp. 293–299

  7. [7]

    Watersic: information-theoretically (near) optimal linear layer quantiza- tion,

    E. Lifar, S. Savkin, O. Ordentlich, and Y . Polyanskiy, “Watersic: information-theoretically (near) optimal linear layer quantiza- tion,”arXiv preprint arXiv:2603.04956, 2026

  8. [8]

    Brecq: Pushing the limit of post- training quantization by block reconstruction,

    Y . Li, R. Gong, X. Tan, Y . Yang, P. Hu, Q. Zhang, F. Yu, W. Wang, and S. Gu, “Brecq: Pushing the limit of post- training quantization by block reconstruction,”arXiv preprint arXiv:2102.05426, 2021

  9. [9]

    Model-preserving adaptive rounding,

    A. Tseng, Z. Sun, and C. De Sa, “Model-preserving adaptive rounding,”arXiv preprint arXiv:2505.22988, 2025

  10. [10]

    Half-quadratic quantization of large machine learning models,

    H. Badri and A. Shaji, “Half-quadratic quantization of large machine learning models,” November 2023. [Online]. Available: https://mobiusml.github.io/hqq_blog/

  11. [11]

    Smoothquant: Accurate and efficient post-training quantization for large language models,

    G. Xiao, J. Lin, M. Seznec, H. Wu, J. Demouth, and S. Han, “Smoothquant: Accurate and efficient post-training quantization for large language models,” inInternational Conference on Machine Learning. PMLR, 2023, pp. 38 087–38 099

  12. [12]

    NestQuant: Nested lattice quantization for matrix products and LLMs,

    S. Savkin, E. Porat, O. Ordentlich, and Y . Polyanskiy, “NestQuant: Nested lattice quantization for matrix products and LLMs,”arXiv preprint arXiv:2502.09720, 2025

  13. [13]

    Qronos: Correcting the past by shaping the future... in post-training quantization,

    S. Zhang, H. Zhang, I. Colbert, and R. Saab, “Qronos: Correcting the past by shaping the future... in post-training quantization,” arXiv preprint arXiv:2505.11695, 2025

  14. [14]

    Provable Post-Training Quantization: Theoretical Analysis of OPTQ and Qronos

    H. Zhang, S. Zhang, I. Colbert, and R. Saab, “Provable post- training quantization: Theoretical analysis of optq and qronos,” arXiv preprint arXiv:2508.04853, 2025

  15. [15]

    Polyanskiy and Y

    Y . Polyanskiy and Y . Wu,Information theory: From coding to learning. Cambridge university press, 2024

  16. [16]

    Price of metric universality in vector quantization is at most 0.11 bit,

    A. Harbuzova, O. Ordentlich, and Y . Polyanskiy, “Price of metric universality in vector quantization is at most 0.11 bit,”arXiv preprint arXiv:2602.05790, 2026

  17. [17]

    Quip: 2- bit quantization of large language models with guarantees,

    J. Chee, Y . Cai, V . Kuleshov, and C. M. De Sa, “Quip: 2- bit quantization of large language models with guarantees,” Advances in Neural Information Processing Systems, vol. 36, pp. 4396–4429, 2023

  18. [18]

    The Geometry of LLM Quantization: GPTQ as Babai's Nearest Plane Algorithm

    J. Chen, Y . Shabanzadeh, E. Crnˇcevi´c, T. Hoefler, and D. Alistarh, “The geometry of llm quantization: Gptq as babai’s nearest plane algorithm,”arXiv preprint arXiv:2507.18553, 2025

  19. [19]

    The lattice geometry of neural network quantization– a short equivalence proof of gptq and babai’s algorithm,

    J. Birnick, “The lattice geometry of neural network quantization– a short equivalence proof of gptq and babai’s algorithm,”arXiv preprint arXiv:2508.01077, 2025

  20. [20]

    Zamir,Lattice Coding for Signals and Networks: A Structured Coding Approach to Quantization, Modulation, and Multiuser Information Theory

    R. Zamir,Lattice Coding for Signals and Networks: A Structured Coding Approach to Quantization, Modulation, and Multiuser Information Theory. Cambridge University Press, 2014

  21. [21]

    Quip#: Even better llm quantization with hadamard incoherence and lattice codebooks,

    A. Tseng, J. Chee, Q. Sun, V . Kuleshov, and C. De Sa, “Quip#: Even better llm quantization with hadamard incoherence and lattice codebooks,”arXiv preprint arXiv:2402.04396, 2024

  22. [22]

    Bounds on the density of smooth lattice coverings,

    O. Ordentlich, O. Regev, and B. Weiss, “Bounds on the density of smooth lattice coverings,”arXiv preprint arXiv:2311.04644, 2023

  23. [23]

    Spectra of quantized signals,

    W. R. Bennett, “Spectra of quantized signals,”The Bell System Technical Journal, vol. 27, no. 3, pp. 446–472, 1948

  24. [24]

    Quantization distortion in pulse-count modulation with nonuniform spacing of levels,

    P. Panter and W. Dite, “Quantization distortion in pulse-count modulation with nonuniform spacing of levels,”Proceedings of the IRE, vol. 39, no. 1, pp. 44–48, 1951

  25. [25]

    Asymptotic quantization error of continuous signals and the quantization dimension,

    P. Zador, “Asymptotic quantization error of continuous signals and the quantization dimension,”IEEE Transactions on Informa- tion Theory, vol. 28, no. 2, pp. 139–149, 1982

  26. [26]

    Asymptotically optimal block quantization,

    A. Gersho, “Asymptotically optimal block quantization,”IEEE Transactions on information theory, vol. 25, no. 4, pp. 373–380, 1979

  27. [27]

    Lattice and trellis quantiza- tion with lattice-and trellis-bounded codebooks-high-rate theory for memoryless sources,

    M. V . Eyuboglu and G. D. Forney, “Lattice and trellis quantiza- tion with lattice-and trellis-bounded codebooks-high-rate theory for memoryless sources,”IEEE Transactions on Information theory, vol. 39, no. 1, pp. 46–59, 1993

  28. [28]

    On lattice quantization noise,

    R. Zamir and M. Feder, “On lattice quantization noise,”IEEE Transactions on Information Theory, vol. 42, no. 4, pp. 1152– 1159, 1996

  29. [29]

    J. H. Conway and N. J. A. Sloane,Sphere Packings, Lattices and Groups, 3rd ed., ser. Grundlehren der mathematischen Wis- senschaften. New York: Springer-Verlag, 1999, vol. 290

  30. [30]

    The Voronoi spherical cdf for lattices and linear codes: New bounds for quantization and coding,

    O. Ordentlich, “The Voronoi spherical cdf for lattices and linear codes: New bounds for quantization and coding,”arXiv preprint arXiv:2506.19791, 2025

  31. [31]

    V-blast: An architecture for realizing very high data rates over the rich-scattering wireless channel,

    P. W. Wolniansky, G. J. Foschini, G. D. Golden, and R. A. Valenzuela, “V-blast: An architecture for realizing very high data rates over the rich-scattering wireless channel,” in1998 URSI international symposium on signals, systems, and electronics. Conference proceedings (Cat. No. 98EX167). IEEE, 1998, pp. 295–300

  32. [32]

    On lovász’lattice reduction and the nearest lattice point problem,

    L. Babai, “On lovász’lattice reduction and the nearest lattice point problem,”Combinatorica, vol. 6, no. 1, pp. 1–13, 1986

  33. [33]

    Improved methods for calculating vectors of short length in a lattice, including a complexity analysis,

    U. Fincke and M. Pohst, “Improved methods for calculating vectors of short length in a lattice, including a complexity analysis,”Mathematics of computation, vol. 44, no. 170, pp. 463– 471, 1985

  34. [34]

    Closest point search in lattices,

    E. Agrell, T. Eriksson, A. Vardy, and K. Zeger, “Closest point search in lattices,”IEEE transactions on information theory, vol. 48, no. 8, pp. 2201–2214, 2002

  35. [35]

    Trellis shaping,

    G. D. Forney, “Trellis shaping,”IEEE Transactions on Informa- tion Theory, vol. 38, no. 2, pp. 281–300, 1992

  36. [36]

    Nestquant: Nested lattice quantization for matrix products and llms,

    S. Savkin, E. Porat, O. Ordentlich, and Y . Polyanskiy, “Nestquant: Nested lattice quantization for matrix products and llms,”Proc. International Conference on Machine Learning (ICML), 2025

  37. [37]

    Qtip: Quantiza- tion with trellises and incoherence processing,

    A. Tseng, Q. Sun, D. Hou, and C. De Sa, “Qtip: Quantiza- tion with trellises and incoherence processing,”arXiv preprint arXiv:2406.11235, 2024

  38. [38]

    Privileged bases in the transformer residual stream,

    N. Elhage, R. Lasenby, and C. Olah, “Privileged bases in the transformer residual stream,”Transformer Circuits Thread,

  39. [39]

    Available: https://transformer-circuits.pub/2023/ privileged-basis/index.html

    [Online]. Available: https://transformer-circuits.pub/2023/ privileged-basis/index.html

  40. [40]

    On the best lattice quantizers,

    E. Agrell and B. Allen, “On the best lattice quantizers,”IEEE Transactions on Information Theory, vol. 69, no. 12, pp. 7650– 7658, 2023