pith. sign in

arxiv: 2606.23406 · v1 · pith:KAFFW3OSnew · submitted 2026-06-22 · 💻 cs.LG · cs.AI

HyperQuant: A Rate-Distortion-Optimal Quantization Pipeline for Large Language and Diffusion Models

Pith reviewed 2026-06-26 08:44 UTC · model grok-4.3

classification 💻 cs.LG cs.AI
keywords post-training quantizationlarge language modelsKV cache compressionlattice quantizationHadamard transformRice codingdiffusion modelsrate-distortion optimization
0
0 comments X

The pith

HyperQuant achieves superior rate-distortion tradeoffs for LLM and diffusion model quantization by applying a per-tile randomized Hadamard transform before lattice quantization and Rice coding.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces HyperQuant as a unified post-training pipeline that quantizes both weights and KV cache in large transformers. It first applies a randomized Hadamard transform per tile to make coordinate distributions approximately Gaussian, then quantizes the result to a low-dimensional optimal lattice such as E8 or D4. The lattice indices are encoded with near-entropy-optimal Rice coding, and a bias-correction step is added for the KV cache to keep inner products unbiased. Experiments show this construction beats HIGGS on weight quantization from 3 to 5 bits per scalar and outperforms TurboQuant and OCTOPUS on KV cache down to 1.7 bits per scalar, while also handling a 19B-parameter video diffusion model without visible artifacts.

Core claim

HyperQuant demonstrates that the combination of a per-tile randomized Hadamard transform, quantization onto low-dimensional lattices (E8, D4, A2, Z), variable-length Rice coding, and KV bias correction produces lower distortion at given rates than existing methods, yielding 3.9x compression on linear weights and 3.79x on KV cache at 4 bits per scalar with near-lossless end-to-end quality on an H100.

What carries the argument

The per-tile randomized Hadamard transform that reshapes weight and activation distributions to be approximately Gaussian, enabling near-optimal lattice quantization followed by Rice coding.

If this is right

  • The method outperforms HIGGS at every operating point between 3 and 5 bits per scalar on weights.
  • It exceeds TurboQuant and OCTOPUS performance on KV-cache quantization down to 1.7 bits per scalar.
  • End-to-end compression reaches approximately 3.9 times for linear weights and 3.79 times for KV cache at 4 bits per scalar while remaining near lossless on H100 hardware.
  • The same pipeline quantizes a 19B-parameter diffusion transformer video model with no observable per-frame artifacts.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The memory savings could allow larger batch sizes or longer context lengths on the same GPU without retraining.
  • Because the pipeline integrates directly with existing 8-bit and 4-bit Tensor-Core paths, it may accelerate inference on current accelerator hardware without new silicon.
  • The Gaussianizing property of the transform might be reusable as a preprocessing step for other quantization or compression techniques that assume Gaussian inputs.

Load-bearing premise

The randomized Hadamard transform produces coordinate distributions close enough to Gaussian that the selected low-dimensional lattices achieve near-minimal quantization error, and the KV bias correction preserves the semantics of attention inner products.

What would settle it

Measure the actual mean-squared error after applying the randomized Hadamard transform and lattice quantization to a real transformer weight tile and compare it against the known minimum distortion of the lattice at that rate; a large gap would falsify the near-optimality claim.

Figures

Figures reproduced from arXiv: 2606.23406 by Hadar Sackstein, Tomer Solberg, Yuval Domb.

Figure 1
Figure 1. Figure 1: HyperQuant end-to-end pipeline. Encode (top, left to right) and Decode (bottom, right to left), each inverse directly below its forward block. Colour marks applicability: blue is the shared core (weights and KV), orange dashed is bias correction (KV cache only, ablated in Section 7), and purple is the cast to the Tensor-Core format. The RHT has no decode block: orthogonal along the contraction axis, it is … view at source ↗
Figure 2
Figure 2. Figure 2: Weight quantization on Llama-3.1-8B at matched bps across the four lattices ( [PITH_FULL_IMAGE:figures/full_fig_p016_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: KV-only HyperQuant on Llama-3.1-8B with bf16 weights, shown over the two working regimes. (Left) High-compression regime (1.7–2.5 bps): QJL/signs rotation pulls ahead of plain none as the rate falls, reaching ∼ 0.5 PPL at 2.0 bps. (Right) High-quality regime (2.5–4.0 bps): all six bias-correction variants collapse onto the bf16 baseline, within 0.04 PPL of one another at b ≥ 2.5. Two operating regimes [PI… view at source ↗
Figure 4
Figure 4. Figure 4: Per-prompt PSNR (a) and LPIPS (b), and per-frame PSNR (mean [PITH_FULL_IMAGE:figures/full_fig_p019_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Sample frames from HyperQuant on LTX-2: bf16 baseline (top), int8 + lattice (middle), and per-frame error map (bottom). No visible artefacts; per-frame PSNR is flat across the 49-frame window. Per-frame analysis. Per-frame PSNR is essentially constant across the 49-frame window (Fig￾ure 5): quantization noise does not compound through the DiT’s temporal conditioning. The low absolute PSNR (22–23 dB) reflec… view at source ↗
Figure 6
Figure 6. Figure 6: Llama-3.1-8B WikiText-2 PPL versus bits per scalar for [PITH_FULL_IMAGE:figures/full_fig_p022_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: The two components of the HyperQuant-vs-HIGGS gap. bps HIGGS-p2 SNR A2+Rice SNR total ∆ entropy-coding piece unbounded-codebook piece 3.0 15.28 dB 16.02 dB 0.74 dB 0.61 dB 0.13 dB 4.0 21.10 dB 22.21 dB 1.11 dB 0.75 dB 0.36 dB 5.0 26.29 dB 28.26 dB 1.97 dB 0.18 dB 1.79 dB [PITH_FULL_IMAGE:figures/full_fig_p023_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: PPL vs. RHT tile size on Llama-3.1-8B (Eint 8 , no MMA cast, true RHT), as ∆PPL relative to the bf16 baseline for the weight, KV, and joint W+KV paths at 3 bps (left) and 4 bps (right). Markers are the mean over four RHT seeds; error bars are ±1 standard deviation. KV is nearly tile-invariant with negligible seed variance, while for the weight and W+KV paths the per-tile differences fall within the seed er… view at source ↗
Figure 9
Figure 9. Figure 9: HIGGS codebook index distributions are non-uniform on Llama weights, with [PITH_FULL_IMAGE:figures/full_fig_p027_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Empirical Rice rate versus target SNR on iid-Gaussian tiles ( [PITH_FULL_IMAGE:figures/full_fig_p035_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: E8 calibration detail. Top: achieved Rice rate tracks the lattice ideal within ≈ 0.1 bps over 20–30 dB (the dashed scalar_bps is the byte-aligned int8 fallback, which steps with the integer byte budget; the scalar_entropy curve, the marginal entropy of the stripped indices, lies essentially on the lattice ideal, cf. Section B.4). Bottom: sampled int8 overflow rate P(|ci | > 127), which is 0 at every opera… view at source ↗
read the original abstract

We present HyperQuant (Hadamard, optimallY Packing, Entropy Rice-coding), a unified post-training quantization pipeline for the weights and the KV cache of large language and diffusion transformers. Across a suite of self-contained experiments (Table 1), HyperQuant outperforms the recent HIGGS scheme at every operating point from 3 to 5 bits per scalar (bps) on weights, and beats both TurboQuant and OCTOPUS on KV quantization down to 1.7 bps. Beyond the LLM setting, HyperQuant quantizes the 19B-parameter LTX-2 DiT video model with no observable per-frame artifacts. End-to-end on an H100 at 4 bps, HyperQuant compresses the linear weights ~3.9x and the KV cache ~3.79x at near-lossless quality. HyperQuant combines four known ideas into a single construction: (i) a per-tile Randomized Hadamard Transform that makes the per-coordinate distribution of weights and activations approximately Gaussian; (ii) quantization to a low-dimensional optimal lattice (E8, D4, A2, or Z); (iii) lossless bit-stripping and near-entropy-optimal variable-length Rice coding of the lattice indices; and (iv) bias-correction methods for the KV cache that keep the reconstruction unbiased under inner products, preserving attention semantics. We further integrate the pipeline with 8-bit and 4-bit Tensor-Core MMA paths (fp8-e4m3, int8, nvfp4, mxfp4), and find that int8 beats fp8 on the post-RHT lattice output. Project page: https://moonmath.ai/hyperquant/

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper introduces HyperQuant, a post-training quantization pipeline for LLM and diffusion model weights and KV caches. It combines a per-tile Randomized Hadamard Transform (RHT) to produce approximately Gaussian marginals, quantization onto low-dimensional lattices (E8, D4, A2, Z), lossless Rice coding of lattice indices, and KV bias correction to preserve inner-product semantics. Experiments (Table 1) claim consistent outperformance versus HIGGS (weights, 3–5 bps), TurboQuant and OCTOPUS (KV, down to 1.7 bps), plus artifact-free quantization of the 19B LTX-2 DiT model and ~3.9× / 3.79× compression on H100 at 4 bps with near-lossless quality. The pipeline is also integrated with 8-bit/4-bit Tensor-Core paths.

Significance. If the reported gains are reproducible and the lattice advantage is shown to stem from the RHT-induced near-Gaussianity rather than from unstated implementation details, the work would supply a concrete rate-distortion construction that unifies weight and KV quantization while remaining hardware-aware. The explicit integration with fp8/int8/nvfp4/mxfp4 MMA paths and the self-contained experimental suite are practical strengths. The absence of quantitative distribution diagnostics, however, leaves open whether the headline superiority can be credited to the claimed optimality of the lattices.

major comments (2)
  1. [Abstract / method section] Abstract and method description: the central performance claims rest on the assertion that per-tile RHT yields “approximately Gaussian” per-coordinate distributions for which E8/D4/A2/Z lattices are near-optimal. No kurtosis, skewness, Kolmogorov–Smirnov statistic, or excess-kurtosis values are supplied for the actual weight or KV tensors, nor is any tile-wise variation quantified. Because the rate-distortion optimality argument is load-bearing for attributing gains over HIGGS/TurboQuant/OCTOPUS to the lattice step rather than to other pipeline components, this verification is required.
  2. [Table 1] Table 1 and experimental protocol: the abstract states that HyperQuant “outperforms … at every operating point,” yet the provided text supplies neither error bars, number of random seeds, exclusion criteria for tiles, nor the precise definition of “near-lossless quality.” These details are necessary to assess whether the reported 3–5 bps and 1.7 bps operating points are robust.
minor comments (1)
  1. [Abstract] The abstract mentions integration with “fp8-e4m3, int8, nvfp4, mxfp4” but does not state which of these paths was used for the headline H100 numbers; a single clarifying sentence would remove ambiguity.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. The comments correctly identify areas where additional quantitative support would strengthen the attribution of gains to the lattice step and improve experimental transparency. We respond point-by-point below.

read point-by-point responses
  1. Referee: [Abstract / method section] Abstract and method description: the central performance claims rest on the assertion that per-tile RHT yields “approximately Gaussian” per-coordinate distributions for which E8/D4/A2/Z lattices are near-optimal. No kurtosis, skewness, Kolmogorov–Smirnov statistic, or excess-kurtosis values are supplied for the actual weight or KV tensors, nor is any tile-wise variation quantified. Because the rate-distortion optimality argument is load-bearing for attributing gains over HIGGS/TurboQuant/OCTOPUS to the lattice step rather than to other pipeline components, this verification is required.

    Authors: We agree that the manuscript would benefit from explicit distribution diagnostics to support the near-Gaussianity claim and to strengthen the link between the RHT and lattice optimality. While the theoretical motivation for RHT is standard in the literature, we did not report kurtosis, skewness, or KS statistics. In revision we will add these metrics (kurtosis, excess kurtosis, skewness, and KS p-values) computed on representative tiles from both weight and KV tensors, plus a brief quantification of tile-to-tile variation. This will allow direct assessment of how closely the post-RHT marginals match the Gaussian assumption underlying the lattice choice. revision: yes

  2. Referee: [Table 1] Table 1 and experimental protocol: the abstract states that HyperQuant “outperforms … at every operating point,” yet the provided text supplies neither error bars, number of random seeds, exclusion criteria for tiles, nor the precise definition of “near-lossless quality.” These details are necessary to assess whether the reported 3–5 bps and 1.7 bps operating points are robust.

    Authors: The post-training quantization pipeline is deterministic once the per-tile RHT seeds are fixed; we used a single fixed seed per model for full reproducibility, so classical error bars over random seeds do not apply. We will add an explicit statement of this fact, report the seed value, state that no tiles were excluded, and define “near-lossless” quantitatively (perplexity within 0.5 % of the unquantized baseline for LLMs; no visible per-frame artifacts plus PSNR > 38 dB for the DiT model). These clarifications will be inserted into the experimental protocol section and Table 1 caption. revision: yes

Circularity Check

0 steps flagged

No circularity: constructive pipeline validated on external baselines

full rationale

The paper presents HyperQuant as a pipeline that combines four known techniques (per-tile RHT, low-dimensional lattices, Rice coding, KV bias correction) and reports empirical gains against external baselines (HIGGS, TurboQuant, OCTOPUS) in self-contained experiments. No equations, parameters, or self-citations reduce any reported performance metric or optimality claim to a quantity fitted or defined inside the same paper. The Gaussianity assumption is an empirical premise, not a self-referential derivation step.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The construction rests on the domain assumption that a randomized Hadamard transform sufficiently Gaussianizes the data for lattice optimality and that bias correction leaves inner products unbiased; no free parameters or new entities are introduced beyond standard quantization choices.

axioms (2)
  • domain assumption Per-tile randomized Hadamard transform produces approximately Gaussian coordinate distributions for weights and activations.
    Explicitly stated as the justification for using Gaussian-optimal lattices.
  • domain assumption Bias-correction methods keep reconstruction unbiased under inner products and thereby preserve attention semantics.
    Required for the KV-cache claim to hold without changing model behavior.

pith-pipeline@v0.9.1-grok · 5849 in / 1430 out tokens · 23010 ms · 2026-06-26T08:44:00.951788+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

43 extracted references · 21 canonical work pages · 10 internal anchors

  1. [1]

    The Llama-3 herd of models

    Meta AI. The Llama-3 herd of models. Meta AI research publication, 2024. URLhttps: //ai.meta.com/research/publications/the-llama-3-herd-of-models/

  2. [2]

    The fast Johnson–Lindenstrauss transform and approximate nearest neighbors.SIAM Journal on Computing, 39(1):302–322, 2009

    Nir Ailon and Bernard Chazelle. The fast Johnson–Lindenstrauss transform and approximate nearest neighbors.SIAM Journal on Computing, 39(1):302–322, 2009

  3. [3]

    Slicegpt: Compress large language models by deleting rows and columns

    Saleh Ashkboos, Maximilian L. Croci, Torsten Hoefler, and James Hensman. SliceGPT: Compress large language models by deleting rows and columns. InICLR, 2024. URLhttps: //arxiv.org/abs/2401.15024

  4. [4]

    OCTOPUS: Optimized KV Cache for Transformers via Octahedral Parametrization Under optimal Squared error quantization

    Mark Boss, Vikram Voleti, Simon Donné, and Shimon Vainer. Octopus: Optimized kv cache for transformers via octahedral parametrization under optimal squared error quantization, 2026. URLhttps://arxiv.org/abs/2605.21226

  5. [5]

    QuIP: 2-bit quantization of large language models with guarantees

    Jerry Chee, Yaohui Cai, Volodymyr Kuleshov, and Christopher De Sa. QuIP: 2-bit quantization of large language models with guarantees. InNeurIPS, 2023. URLhttps://arxiv.org/abs/ 2307.13304

  6. [6]

    Conway and Neil J

    John H. Conway and Neil J. A. Sloane.Sphere Packings, Lattices and Groups. Springer, 3rd edition, 1999

  7. [7]

    Cover and Joy A

    Thomas M. Cover and Joy A. Thomas.Elements of Information Theory. Wiley, 2nd edition, 2006

  8. [8]

    A sparse Johnson–Lindenstrauss transform

    Anirban Dasgupta, Ravi Kumar, and Tamás Sarlós. A sparse Johnson–Lindenstrauss transform. InSTOC, 2010

  9. [9]

    SpQR: A sparse-quantized representation for near-lossless LLM weight compression

    Tim Dettmers, Ruslan Svirschevski, Vage Egiazarian, Denis Kuznedelev, Elias Frantar, Saleh Ashkboos, Alexander Borzunov, Torsten Hoefler, and Dan Alistarh. SpQR: A sparse-quantized representation for near-lossless LLM weight compression. InICLR, 2024. URLhttps://arxiv. org/abs/2306.03078

  10. [10]

    Asymmetric numeral systems: entropy coding combining speed of Huffman coding with compression rate of arithmetic coding

    Jarek Duda. Asymmetric numeral systems: Entropy coding combining speed of Huffman coding with compression rate of arithmetic coding.arXiv preprint arXiv:1311.2540, 2013

  11. [11]

    On the closeness of the random-dither mapping to the information- theoretic optimum for vector quantization.IEEE Transactions on Information Theory, 51(10): 3617–3631, 2005

    Uri Erez and Ram Zamir. On the closeness of the random-dither mapping to the information- theoretic optimum for vector quantization.IEEE Transactions on Information Theory, 51(10): 3617–3631, 2005

  12. [12]

    Scaling Rectified Flow Transformers for High-Resolution Image Synthesis

    Patrick Esser et al. Scaling rectified flow transformers for high-resolution image synthesis. arXiv preprint arXiv:2403.03206, 2024. URLhttps://arxiv.org/abs/2403.03206

  13. [13]

    David Forney and Lee-Fang Wei

    G. David Forney and Lee-Fang Wei. Multidimensional constellations—Part I: Introduction, figures of merit, and generalized cross constellations.IEEE Journal on Selected Areas in Communications, 7(6):877–892, 1989

  14. [14]

    GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers

    Elias Frantar, Saleh Ashkboos, Torsten Hoefler, and Dan Alistarh. GPTQ: Accurate post- training quantization for generative pre-trained transformers. InICLR, 2023. URLhttps: //arxiv.org/abs/2210.17323. 29

  15. [15]

    Herbert Gish and John N. Pierce. Asymptotically efficient quantizing.IEEE Transactions on Information Theory, 14(5):676–683, 1968

  16. [16]

    Nathan Halko, Per-Gunnar Martinsson, and Joel A. Tropp. Finding structure with randomness: Probabilistic algorithms for constructing approximate matrix decompositions.SIAM Review, 53(2):217–288, 2011. URLhttps://arxiv.org/abs/0909.4061

  17. [17]

    LTX-Video: A real-time video generation model

    Lightricks. LTX-Video: A real-time video generation model. GitHub repository, 2024. URL https://github.com/Lightricks/LTX-Video

  18. [18]

    AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration

    Ji Lin, Jiaming Tang, Haotian Tang, Shang Yang, Wei-Ming Chen, Wei-Chen Wang, Guangxuan Xiao, Xingyu Dang, Chuang Gan, and Song Han. AWQ: Activation-aware weight quantization for on-device LLM compression and acceleration. InMLSys, 2024. URLhttps://arxiv.org/ abs/2306.00978

  19. [19]

    Lookabaugh and Robert M

    Thomas D. Lookabaugh and Robert M. Gray. High-resolution quantization theory and the vector quantizer advantage.IEEE Transactions on Information Theory, 35(5):1020–1033, 1989

  20. [20]

    Pushing the limits of large language model quantization via the linearity theorem

    Vladimir Malinovskii, Andrei Panferov, Ivan Ilin, Han Guo, Peter Richtárik, and Dan Alistarh. Pushing the limits of large language model quantization via the linearity theorem. InNAACL,

  21. [21]

    URLhttps://arxiv.org/abs/2411.17525

  22. [22]

    Pointer Sentinel Mixture Models

    Stephen Merity, Caiming Xiong, James Bradbury, and Richard Socher. Pointer sentinel mixture models. InICLR, 2017. URLhttps://arxiv.org/abs/1609.07843

  23. [24]

    URLhttps://arxiv.org/abs/2209.05433

  24. [25]

    NVIDIA hopper h100 architecture white paper

    NVIDIA. NVIDIA hopper h100 architecture white paper. NVIDIA white paper, 2022. URL https://resources.nvidia.com/en-us-tensor-core

  25. [26]

    NVIDIA blackwell architecture technical brief

    NVIDIA. NVIDIA blackwell architecture technical brief. NVIDIA technical brief, 2024. URL https://resources.nvidia.com/en-us-blackwell-architecture

  26. [27]

    OCP Microscaling Formats (MX) specification v1.0

    Open Compute Project. OCP Microscaling Formats (MX) specification v1.0. OCP specification, 2023. URL https://www.opencompute.org/documents/ ocp-microscaling-formats-mx-v1-0-spec-final-pdf

  27. [28]

    Optimal quadratic quantization for numerics: The Gaussian case.Monte Carlo Methods and Applications, 9(2):135–165, 2003

    Gilles Pagès and Jacques Printems. Optimal quadratic quantization for numerics: The Gaussian case.Monte Carlo Methods and Applications, 9(2):135–165, 2003

  28. [29]

    Scalable Diffusion Models with Transformers

    William Peebles and Saining Xie. Scalable diffusion models with transformers. InICCV, 2023. URLhttps://arxiv.org/abs/2212.09748

  29. [30]

    Efficiently scaling transformer inference

    Reiner Pope, Sholto Douglas, Aakanksha Chowdhery, Jacob Devlin, James Bradbury, Jonathan Heek, Kefan Xu, Shivani Agrawal, and Jeff Dean. Efficiently scaling transformer inference. In MLSys, 2023. URLhttps://arxiv.org/abs/2211.05102

  30. [31]

    Robert F. Rice. Some practical universal noiseless coding techniques.JPL Publication 79-22, 1979

  31. [32]

    Microscaling data formats for deep learning, 2023 b

    Bita Darvish Rouhani, Nitin Garegrat, Tom Madian, Jeremy Lo, Brian Cook, Daniel Pinto, et al. Microscaling data formats for deep learning.arXiv preprint arXiv:2310.10537, 2023. URLhttps://arxiv.org/abs/2310.10537. 30

  32. [33]

    NestQuant: Nested lattice quantization for matrix products and LLMs

    Semyon Savkin, Eitan Porat Chen, Or Lou, and Yury Polyanskiy. NestQuant: Nested lattice quantization for matrix products and LLMs. InICML, 2025. URLhttps://arxiv.org/abs/ 2502.09720

  33. [34]

    Dither signals and their effect on quantization noise.IEEE Transactions on Communication Technology, 12(4):162–165, 1964

    Leonard Schuchman. Dither signals and their effect on quantization noise.IEEE Transactions on Communication Technology, 12(4):162–165, 1964

  34. [35]

    ”Omniquant: Omnidirectionally calibrated quantiza- tion for large language models.” arXiv preprint arXiv:2308.13137 (2023)

    Wenqi Shao, Mengzhao Chen, Zhaoyang Zhang, Peng Xu, Lirui Zhao, Zhiqian Li, Kaipeng Zhang, Peng Gao, Yu Qiao, and Ping Luo. OmniQuant: Omnidirectionally calibrated quantiza- tion for large language models. InICLR, 2024. URLhttps://arxiv.org/abs/2308.13137

  35. [36]

    QuIP#: Even better LLM quantization with hadamard incoherence and lattice codebooks

    Albert Tseng, Jerry Chee, Qingyao Sun, Volodymyr Kuleshov, and Christopher De Sa. QuIP#: Even better LLM quantization with hadamard incoherence and lattice codebooks. InICML,

  36. [37]

    URLhttps://arxiv.org/abs/2402.04396

  37. [38]

    Massively parallel Huffman decoding on GPUs

    André Weißenberger and Bertil Schmidt. Massively parallel Huffman decoding on GPUs. In Proceedings of the 47th International Conference on Parallel Processing (ICPP), 2018

  38. [39]

    Sullivan, Gisle Bjøntegaard, and Ajay Luthra

    Thomas Wiegand, Gary J. Sullivan, Gisle Bjøntegaard, and Ajay Luthra. Overview of the H.264/AVC video coding standard.IEEE Transactions on Circuits and Systems for Video Technology, 13(7):560–576, 2003

  39. [40]

    Smoothquant: Accurate and efficient post-training quantization for large language models, 2024

    Guangxuan Xiao, Ji Lin, Mickael Seznec, Hao Wu, Julien Demouth, and Song Han. SmoothQuant: Accurate and efficient post-training quantization for large language models. In ICML, 2023. URLhttps://arxiv.org/abs/2211.10438

  40. [41]

    Cambridge University Press, 2014

    Ram Zamir.Lattice Coding for Signals and Networks. Cambridge University Press, 2014

  41. [42]

    On universal quantization by randomized uniform/lattice quantizers

    Ram Zamir and Meir Feder. On universal quantization by randomized uniform/lattice quantizers. IEEE Transactions on Information Theory, 38(2):428–436, 1992

  42. [43]

    QJL: 1-bit quantized JL transform for KV cache quantization with zero overhead.arXiv preprint arXiv:2406.03482, 2024

    Amir Zandieh, Majid Daliri, and Insu Han. QJL: 1-bit quantized JL transform for KV cache quantization with zero overhead.arXiv preprint arXiv:2406.03482, 2024. URLhttps: //arxiv.org/abs/2406.03482

  43. [44]

    TurboQuant: Online Vector Quantization with Near-optimal Distortion Rate

    Amir Zandieh, Majid Daliri, Majid Hadian, and Vahab Mirrokni. TurboQuant: Online vector quantization with near-optimal distortion rate. InICLR, 2026. URL https://arxiv.org/ abs/2504.19874. A Proof of subtractive-dither unbiasedness This appendix gives a self-contained proof that subtractive-dithered lattice quantization satisfies the Schuchman conditions ...