HyperQuant: A Rate-Distortion-Optimal Quantization Pipeline for Large Language and Diffusion Models

Hadar Sackstein; Tomer Solberg; Yuval Domb

arxiv: 2606.23406 · v1 · pith:KAFFW3OSnew · submitted 2026-06-22 · 💻 cs.LG · cs.AI

HyperQuant: A Rate-Distortion-Optimal Quantization Pipeline for Large Language and Diffusion Models

Yuval Domb , Hadar Sackstein , Tomer Solberg This is my paper

Pith reviewed 2026-06-26 08:44 UTC · model grok-4.3

classification 💻 cs.LG cs.AI

keywords post-training quantizationlarge language modelsKV cache compressionlattice quantizationHadamard transformRice codingdiffusion modelsrate-distortion optimization

0 comments

The pith

HyperQuant achieves superior rate-distortion tradeoffs for LLM and diffusion model quantization by applying a per-tile randomized Hadamard transform before lattice quantization and Rice coding.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces HyperQuant as a unified post-training pipeline that quantizes both weights and KV cache in large transformers. It first applies a randomized Hadamard transform per tile to make coordinate distributions approximately Gaussian, then quantizes the result to a low-dimensional optimal lattice such as E8 or D4. The lattice indices are encoded with near-entropy-optimal Rice coding, and a bias-correction step is added for the KV cache to keep inner products unbiased. Experiments show this construction beats HIGGS on weight quantization from 3 to 5 bits per scalar and outperforms TurboQuant and OCTOPUS on KV cache down to 1.7 bits per scalar, while also handling a 19B-parameter video diffusion model without visible artifacts.

Core claim

HyperQuant demonstrates that the combination of a per-tile randomized Hadamard transform, quantization onto low-dimensional lattices (E8, D4, A2, Z), variable-length Rice coding, and KV bias correction produces lower distortion at given rates than existing methods, yielding 3.9x compression on linear weights and 3.79x on KV cache at 4 bits per scalar with near-lossless end-to-end quality on an H100.

What carries the argument

The per-tile randomized Hadamard transform that reshapes weight and activation distributions to be approximately Gaussian, enabling near-optimal lattice quantization followed by Rice coding.

If this is right

The method outperforms HIGGS at every operating point between 3 and 5 bits per scalar on weights.
It exceeds TurboQuant and OCTOPUS performance on KV-cache quantization down to 1.7 bits per scalar.
End-to-end compression reaches approximately 3.9 times for linear weights and 3.79 times for KV cache at 4 bits per scalar while remaining near lossless on H100 hardware.
The same pipeline quantizes a 19B-parameter diffusion transformer video model with no observable per-frame artifacts.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The memory savings could allow larger batch sizes or longer context lengths on the same GPU without retraining.
Because the pipeline integrates directly with existing 8-bit and 4-bit Tensor-Core paths, it may accelerate inference on current accelerator hardware without new silicon.
The Gaussianizing property of the transform might be reusable as a preprocessing step for other quantization or compression techniques that assume Gaussian inputs.

Load-bearing premise

The randomized Hadamard transform produces coordinate distributions close enough to Gaussian that the selected low-dimensional lattices achieve near-minimal quantization error, and the KV bias correction preserves the semantics of attention inner products.

What would settle it

Measure the actual mean-squared error after applying the randomized Hadamard transform and lattice quantization to a real transformer weight tile and compare it against the known minimum distortion of the lattice at that rate; a large gap would falsify the near-optimality claim.

Figures

Figures reproduced from arXiv: 2606.23406 by Hadar Sackstein, Tomer Solberg, Yuval Domb.

**Figure 1.** Figure 1: HyperQuant end-to-end pipeline. Encode (top, left to right) and Decode (bottom, right to left), each inverse directly below its forward block. Colour marks applicability: blue is the shared core (weights and KV), orange dashed is bias correction (KV cache only, ablated in Section 7), and purple is the cast to the Tensor-Core format. The RHT has no decode block: orthogonal along the contraction axis, it is … view at source ↗

**Figure 2.** Figure 2: Weight quantization on Llama-3.1-8B at matched bps across the four lattices ( [PITH_FULL_IMAGE:figures/full_fig_p016_2.png] view at source ↗

**Figure 3.** Figure 3: KV-only HyperQuant on Llama-3.1-8B with bf16 weights, shown over the two working regimes. (Left) High-compression regime (1.7–2.5 bps): QJL/signs rotation pulls ahead of plain none as the rate falls, reaching ∼ 0.5 PPL at 2.0 bps. (Right) High-quality regime (2.5–4.0 bps): all six bias-correction variants collapse onto the bf16 baseline, within 0.04 PPL of one another at b ≥ 2.5. Two operating regimes [PI… view at source ↗

**Figure 4.** Figure 4: Per-prompt PSNR (a) and LPIPS (b), and per-frame PSNR (mean [PITH_FULL_IMAGE:figures/full_fig_p019_4.png] view at source ↗

**Figure 5.** Figure 5: Sample frames from HyperQuant on LTX-2: bf16 baseline (top), int8 + lattice (middle), and per-frame error map (bottom). No visible artefacts; per-frame PSNR is flat across the 49-frame window. Per-frame analysis. Per-frame PSNR is essentially constant across the 49-frame window (Figure 5): quantization noise does not compound through the DiT’s temporal conditioning. The low absolute PSNR (22–23 dB) reflec… view at source ↗

**Figure 6.** Figure 6: Llama-3.1-8B WikiText-2 PPL versus bits per scalar for [PITH_FULL_IMAGE:figures/full_fig_p022_6.png] view at source ↗

**Figure 7.** Figure 7: The two components of the HyperQuant-vs-HIGGS gap. bps HIGGS-p2 SNR A2+Rice SNR total ∆ entropy-coding piece unbounded-codebook piece 3.0 15.28 dB 16.02 dB 0.74 dB 0.61 dB 0.13 dB 4.0 21.10 dB 22.21 dB 1.11 dB 0.75 dB 0.36 dB 5.0 26.29 dB 28.26 dB 1.97 dB 0.18 dB 1.79 dB [PITH_FULL_IMAGE:figures/full_fig_p023_7.png] view at source ↗

**Figure 8.** Figure 8: PPL vs. RHT tile size on Llama-3.1-8B (Eint 8 , no MMA cast, true RHT), as ∆PPL relative to the bf16 baseline for the weight, KV, and joint W+KV paths at 3 bps (left) and 4 bps (right). Markers are the mean over four RHT seeds; error bars are ±1 standard deviation. KV is nearly tile-invariant with negligible seed variance, while for the weight and W+KV paths the per-tile differences fall within the seed er… view at source ↗

**Figure 9.** Figure 9: HIGGS codebook index distributions are non-uniform on Llama weights, with [PITH_FULL_IMAGE:figures/full_fig_p027_9.png] view at source ↗

**Figure 10.** Figure 10: Empirical Rice rate versus target SNR on iid-Gaussian tiles ( [PITH_FULL_IMAGE:figures/full_fig_p035_10.png] view at source ↗

**Figure 11.** Figure 11: E8 calibration detail. Top: achieved Rice rate tracks the lattice ideal within ≈ 0.1 bps over 20–30 dB (the dashed scalar_bps is the byte-aligned int8 fallback, which steps with the integer byte budget; the scalar_entropy curve, the marginal entropy of the stripped indices, lies essentially on the lattice ideal, cf. Section B.4). Bottom: sampled int8 overflow rate P(|ci | > 127), which is 0 at every opera… view at source ↗

read the original abstract

We present HyperQuant (Hadamard, optimallY Packing, Entropy Rice-coding), a unified post-training quantization pipeline for the weights and the KV cache of large language and diffusion transformers. Across a suite of self-contained experiments (Table 1), HyperQuant outperforms the recent HIGGS scheme at every operating point from 3 to 5 bits per scalar (bps) on weights, and beats both TurboQuant and OCTOPUS on KV quantization down to 1.7 bps. Beyond the LLM setting, HyperQuant quantizes the 19B-parameter LTX-2 DiT video model with no observable per-frame artifacts. End-to-end on an H100 at 4 bps, HyperQuant compresses the linear weights ~3.9x and the KV cache ~3.79x at near-lossless quality. HyperQuant combines four known ideas into a single construction: (i) a per-tile Randomized Hadamard Transform that makes the per-coordinate distribution of weights and activations approximately Gaussian; (ii) quantization to a low-dimensional optimal lattice (E8, D4, A2, or Z); (iii) lossless bit-stripping and near-entropy-optimal variable-length Rice coding of the lattice indices; and (iv) bias-correction methods for the KV cache that keep the reconstruction unbiased under inner products, preserving attention semantics. We further integrate the pipeline with 8-bit and 4-bit Tensor-Core MMA paths (fp8-e4m3, int8, nvfp4, mxfp4), and find that int8 beats fp8 on the post-RHT lattice output. Project page: https://moonmath.ai/hyperquant/

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

HyperQuant wires together four existing quantization pieces into a working pipeline that reports better numbers than HIGGS on weights and TurboQuant/OCTOPUS on KV cache, plus clean diffusion results, but the Gaussian assumption after RHT lacks supporting measurements.

read the letter

HyperQuant applies a per-tile randomized Hadamard transform, quantizes the results with low-dimensional lattices (E8, D4, A2, Z), uses Rice coding, and adds KV bias correction to keep inner products unbiased. It claims to beat HIGGS at every point from 3 to 5 bits on weights and the other two schemes on KV down to 1.7 bits, with 3.9x weight and 3.79x KV compression at 4 bits on H100 and no visible artifacts on the 19B LTX-2 DiT model.

The concrete engineering combination and the hardware tensor-core paths (including the int8 vs fp8 finding) are the parts that look new relative to the cited priors. Extending the same pipeline to both weights and KV cache in a diffusion transformer is also less common.

The main soft spot is the claim that the post-RHT coordinates are close enough to Gaussian for the chosen lattices to be near-optimal. The abstract asserts the approximation without any reported checks on kurtosis, skewness, or tile-wise variation, so it is not clear how much of the reported gain actually comes from the lattice step versus the other components. The KV bias correction is a reasonable safeguard, but its effect on attention also rests on the paper's internal tests.

Experiments are described as self-contained, yet the abstract supplies no error bars or exclusion details, which makes the outperformance numbers hard to judge without the full tables. Citation pattern looks standard and there is no circular fitting.

This is for practitioners who need a drop-in quantizer that hits those bit rates on both LLM and diffusion workloads. A reader focused on inference engineering would find the compression factors and H100 numbers useful.

It deserves peer review because the pipeline is testable and the practical claims are specific enough to evaluate directly.

Referee Report

2 major / 1 minor

Summary. The paper introduces HyperQuant, a post-training quantization pipeline for LLM and diffusion model weights and KV caches. It combines a per-tile Randomized Hadamard Transform (RHT) to produce approximately Gaussian marginals, quantization onto low-dimensional lattices (E8, D4, A2, Z), lossless Rice coding of lattice indices, and KV bias correction to preserve inner-product semantics. Experiments (Table 1) claim consistent outperformance versus HIGGS (weights, 3–5 bps), TurboQuant and OCTOPUS (KV, down to 1.7 bps), plus artifact-free quantization of the 19B LTX-2 DiT model and ~3.9× / 3.79× compression on H100 at 4 bps with near-lossless quality. The pipeline is also integrated with 8-bit/4-bit Tensor-Core paths.

Significance. If the reported gains are reproducible and the lattice advantage is shown to stem from the RHT-induced near-Gaussianity rather than from unstated implementation details, the work would supply a concrete rate-distortion construction that unifies weight and KV quantization while remaining hardware-aware. The explicit integration with fp8/int8/nvfp4/mxfp4 MMA paths and the self-contained experimental suite are practical strengths. The absence of quantitative distribution diagnostics, however, leaves open whether the headline superiority can be credited to the claimed optimality of the lattices.

major comments (2)

[Abstract / method section] Abstract and method description: the central performance claims rest on the assertion that per-tile RHT yields “approximately Gaussian” per-coordinate distributions for which E8/D4/A2/Z lattices are near-optimal. No kurtosis, skewness, Kolmogorov–Smirnov statistic, or excess-kurtosis values are supplied for the actual weight or KV tensors, nor is any tile-wise variation quantified. Because the rate-distortion optimality argument is load-bearing for attributing gains over HIGGS/TurboQuant/OCTOPUS to the lattice step rather than to other pipeline components, this verification is required.
[Table 1] Table 1 and experimental protocol: the abstract states that HyperQuant “outperforms … at every operating point,” yet the provided text supplies neither error bars, number of random seeds, exclusion criteria for tiles, nor the precise definition of “near-lossless quality.” These details are necessary to assess whether the reported 3–5 bps and 1.7 bps operating points are robust.

minor comments (1)

[Abstract] The abstract mentions integration with “fp8-e4m3, int8, nvfp4, mxfp4” but does not state which of these paths was used for the headline H100 numbers; a single clarifying sentence would remove ambiguity.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. The comments correctly identify areas where additional quantitative support would strengthen the attribution of gains to the lattice step and improve experimental transparency. We respond point-by-point below.

read point-by-point responses

Referee: [Abstract / method section] Abstract and method description: the central performance claims rest on the assertion that per-tile RHT yields “approximately Gaussian” per-coordinate distributions for which E8/D4/A2/Z lattices are near-optimal. No kurtosis, skewness, Kolmogorov–Smirnov statistic, or excess-kurtosis values are supplied for the actual weight or KV tensors, nor is any tile-wise variation quantified. Because the rate-distortion optimality argument is load-bearing for attributing gains over HIGGS/TurboQuant/OCTOPUS to the lattice step rather than to other pipeline components, this verification is required.

Authors: We agree that the manuscript would benefit from explicit distribution diagnostics to support the near-Gaussianity claim and to strengthen the link between the RHT and lattice optimality. While the theoretical motivation for RHT is standard in the literature, we did not report kurtosis, skewness, or KS statistics. In revision we will add these metrics (kurtosis, excess kurtosis, skewness, and KS p-values) computed on representative tiles from both weight and KV tensors, plus a brief quantification of tile-to-tile variation. This will allow direct assessment of how closely the post-RHT marginals match the Gaussian assumption underlying the lattice choice. revision: yes
Referee: [Table 1] Table 1 and experimental protocol: the abstract states that HyperQuant “outperforms … at every operating point,” yet the provided text supplies neither error bars, number of random seeds, exclusion criteria for tiles, nor the precise definition of “near-lossless quality.” These details are necessary to assess whether the reported 3–5 bps and 1.7 bps operating points are robust.

Authors: The post-training quantization pipeline is deterministic once the per-tile RHT seeds are fixed; we used a single fixed seed per model for full reproducibility, so classical error bars over random seeds do not apply. We will add an explicit statement of this fact, report the seed value, state that no tiles were excluded, and define “near-lossless” quantitatively (perplexity within 0.5 % of the unquantized baseline for LLMs; no visible per-frame artifacts plus PSNR > 38 dB for the DiT model). These clarifications will be inserted into the experimental protocol section and Table 1 caption. revision: yes

Circularity Check

0 steps flagged

No circularity: constructive pipeline validated on external baselines

full rationale

The paper presents HyperQuant as a pipeline that combines four known techniques (per-tile RHT, low-dimensional lattices, Rice coding, KV bias correction) and reports empirical gains against external baselines (HIGGS, TurboQuant, OCTOPUS) in self-contained experiments. No equations, parameters, or self-citations reduce any reported performance metric or optimality claim to a quantity fitted or defined inside the same paper. The Gaussianity assumption is an empirical premise, not a self-referential derivation step.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The construction rests on the domain assumption that a randomized Hadamard transform sufficiently Gaussianizes the data for lattice optimality and that bias correction leaves inner products unbiased; no free parameters or new entities are introduced beyond standard quantization choices.

axioms (2)

domain assumption Per-tile randomized Hadamard transform produces approximately Gaussian coordinate distributions for weights and activations.
Explicitly stated as the justification for using Gaussian-optimal lattices.
domain assumption Bias-correction methods keep reconstruction unbiased under inner products and thereby preserve attention semantics.
Required for the KV-cache claim to hold without changing model behavior.

pith-pipeline@v0.9.1-grok · 5849 in / 1430 out tokens · 23010 ms · 2026-06-26T08:44:00.951788+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

43 extracted references · 21 canonical work pages · 10 internal anchors

[1]

The Llama-3 herd of models

Meta AI. The Llama-3 herd of models. Meta AI research publication, 2024. URLhttps: //ai.meta.com/research/publications/the-llama-3-herd-of-models/

2024
[2]

The fast Johnson–Lindenstrauss transform and approximate nearest neighbors.SIAM Journal on Computing, 39(1):302–322, 2009

Nir Ailon and Bernard Chazelle. The fast Johnson–Lindenstrauss transform and approximate nearest neighbors.SIAM Journal on Computing, 39(1):302–322, 2009

2009
[3]

Slicegpt: Compress large language models by deleting rows and columns

Saleh Ashkboos, Maximilian L. Croci, Torsten Hoefler, and James Hensman. SliceGPT: Compress large language models by deleting rows and columns. InICLR, 2024. URLhttps: //arxiv.org/abs/2401.15024

work page arXiv 2024
[4]

OCTOPUS: Optimized KV Cache for Transformers via Octahedral Parametrization Under optimal Squared error quantization

Mark Boss, Vikram Voleti, Simon Donné, and Shimon Vainer. Octopus: Optimized kv cache for transformers via octahedral parametrization under optimal squared error quantization, 2026. URLhttps://arxiv.org/abs/2605.21226

work page internal anchor Pith review Pith/arXiv arXiv 2026
[5]

QuIP: 2-bit quantization of large language models with guarantees

Jerry Chee, Yaohui Cai, Volodymyr Kuleshov, and Christopher De Sa. QuIP: 2-bit quantization of large language models with guarantees. InNeurIPS, 2023. URLhttps://arxiv.org/abs/ 2307.13304

work page arXiv 2023
[6]

Conway and Neil J

John H. Conway and Neil J. A. Sloane.Sphere Packings, Lattices and Groups. Springer, 3rd edition, 1999

1999
[7]

Cover and Joy A

Thomas M. Cover and Joy A. Thomas.Elements of Information Theory. Wiley, 2nd edition, 2006

2006
[8]

A sparse Johnson–Lindenstrauss transform

Anirban Dasgupta, Ravi Kumar, and Tamás Sarlós. A sparse Johnson–Lindenstrauss transform. InSTOC, 2010

2010
[9]

SpQR: A sparse-quantized representation for near-lossless LLM weight compression

Tim Dettmers, Ruslan Svirschevski, Vage Egiazarian, Denis Kuznedelev, Elias Frantar, Saleh Ashkboos, Alexander Borzunov, Torsten Hoefler, and Dan Alistarh. SpQR: A sparse-quantized representation for near-lossless LLM weight compression. InICLR, 2024. URLhttps://arxiv. org/abs/2306.03078

work page arXiv 2024
[10]

Asymmetric numeral systems: entropy coding combining speed of Huffman coding with compression rate of arithmetic coding

Jarek Duda. Asymmetric numeral systems: Entropy coding combining speed of Huffman coding with compression rate of arithmetic coding.arXiv preprint arXiv:1311.2540, 2013

work page internal anchor Pith review Pith/arXiv arXiv 2013
[11]

On the closeness of the random-dither mapping to the information- theoretic optimum for vector quantization.IEEE Transactions on Information Theory, 51(10): 3617–3631, 2005

Uri Erez and Ram Zamir. On the closeness of the random-dither mapping to the information- theoretic optimum for vector quantization.IEEE Transactions on Information Theory, 51(10): 3617–3631, 2005

2005
[12]

Scaling Rectified Flow Transformers for High-Resolution Image Synthesis

Patrick Esser et al. Scaling rectified flow transformers for high-resolution image synthesis. arXiv preprint arXiv:2403.03206, 2024. URLhttps://arxiv.org/abs/2403.03206

work page internal anchor Pith review Pith/arXiv arXiv 2024
[13]

David Forney and Lee-Fang Wei

G. David Forney and Lee-Fang Wei. Multidimensional constellations—Part I: Introduction, figures of merit, and generalized cross constellations.IEEE Journal on Selected Areas in Communications, 7(6):877–892, 1989

1989
[14]

GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers

Elias Frantar, Saleh Ashkboos, Torsten Hoefler, and Dan Alistarh. GPTQ: Accurate post- training quantization for generative pre-trained transformers. InICLR, 2023. URLhttps: //arxiv.org/abs/2210.17323. 29

work page internal anchor Pith review Pith/arXiv arXiv 2023
[15]

Herbert Gish and John N. Pierce. Asymptotically efficient quantizing.IEEE Transactions on Information Theory, 14(5):676–683, 1968

1968
[16]

Nathan Halko, Per-Gunnar Martinsson, and Joel A. Tropp. Finding structure with randomness: Probabilistic algorithms for constructing approximate matrix decompositions.SIAM Review, 53(2):217–288, 2011. URLhttps://arxiv.org/abs/0909.4061

work page internal anchor Pith review Pith/arXiv arXiv 2011
[17]

LTX-Video: A real-time video generation model

Lightricks. LTX-Video: A real-time video generation model. GitHub repository, 2024. URL https://github.com/Lightricks/LTX-Video

2024
[18]

AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration

Ji Lin, Jiaming Tang, Haotian Tang, Shang Yang, Wei-Ming Chen, Wei-Chen Wang, Guangxuan Xiao, Xingyu Dang, Chuang Gan, and Song Han. AWQ: Activation-aware weight quantization for on-device LLM compression and acceleration. InMLSys, 2024. URLhttps://arxiv.org/ abs/2306.00978

work page internal anchor Pith review Pith/arXiv arXiv 2024
[19]

Lookabaugh and Robert M

Thomas D. Lookabaugh and Robert M. Gray. High-resolution quantization theory and the vector quantizer advantage.IEEE Transactions on Information Theory, 35(5):1020–1033, 1989

1989
[20]

Pushing the limits of large language model quantization via the linearity theorem

Vladimir Malinovskii, Andrei Panferov, Ivan Ilin, Han Guo, Peter Richtárik, and Dan Alistarh. Pushing the limits of large language model quantization via the linearity theorem. InNAACL,
[21]

URLhttps://arxiv.org/abs/2411.17525

work page arXiv
[22]

Pointer Sentinel Mixture Models

Stephen Merity, Caiming Xiong, James Bradbury, and Richard Socher. Pointer sentinel mixture models. InICLR, 2017. URLhttps://arxiv.org/abs/1609.07843

work page internal anchor Pith review Pith/arXiv arXiv 2017
[24]

URLhttps://arxiv.org/abs/2209.05433

work page internal anchor Pith review Pith/arXiv arXiv
[25]

NVIDIA hopper h100 architecture white paper

NVIDIA. NVIDIA hopper h100 architecture white paper. NVIDIA white paper, 2022. URL https://resources.nvidia.com/en-us-tensor-core

2022
[26]

NVIDIA blackwell architecture technical brief

NVIDIA. NVIDIA blackwell architecture technical brief. NVIDIA technical brief, 2024. URL https://resources.nvidia.com/en-us-blackwell-architecture

2024
[27]

OCP Microscaling Formats (MX) specification v1.0

Open Compute Project. OCP Microscaling Formats (MX) specification v1.0. OCP specification, 2023. URL https://www.opencompute.org/documents/ ocp-microscaling-formats-mx-v1-0-spec-final-pdf

2023
[28]

Optimal quadratic quantization for numerics: The Gaussian case.Monte Carlo Methods and Applications, 9(2):135–165, 2003

Gilles Pagès and Jacques Printems. Optimal quadratic quantization for numerics: The Gaussian case.Monte Carlo Methods and Applications, 9(2):135–165, 2003

2003
[29]

Scalable Diffusion Models with Transformers

William Peebles and Saining Xie. Scalable diffusion models with transformers. InICCV, 2023. URLhttps://arxiv.org/abs/2212.09748

work page internal anchor Pith review Pith/arXiv arXiv 2023
[30]

Efficiently scaling transformer inference

Reiner Pope, Sholto Douglas, Aakanksha Chowdhery, Jacob Devlin, James Bradbury, Jonathan Heek, Kefan Xu, Shivani Agrawal, and Jeff Dean. Efficiently scaling transformer inference. In MLSys, 2023. URLhttps://arxiv.org/abs/2211.05102

work page arXiv 2023
[31]

Robert F. Rice. Some practical universal noiseless coding techniques.JPL Publication 79-22, 1979

1979
[32]

Microscaling data formats for deep learning, 2023 b

Bita Darvish Rouhani, Nitin Garegrat, Tom Madian, Jeremy Lo, Brian Cook, Daniel Pinto, et al. Microscaling data formats for deep learning.arXiv preprint arXiv:2310.10537, 2023. URLhttps://arxiv.org/abs/2310.10537. 30

work page arXiv 2023
[33]

NestQuant: Nested lattice quantization for matrix products and LLMs

Semyon Savkin, Eitan Porat Chen, Or Lou, and Yury Polyanskiy. NestQuant: Nested lattice quantization for matrix products and LLMs. InICML, 2025. URLhttps://arxiv.org/abs/ 2502.09720

work page arXiv 2025
[34]

Dither signals and their effect on quantization noise.IEEE Transactions on Communication Technology, 12(4):162–165, 1964

Leonard Schuchman. Dither signals and their effect on quantization noise.IEEE Transactions on Communication Technology, 12(4):162–165, 1964

1964
[35]

”Omniquant: Omnidirectionally calibrated quantiza- tion for large language models.” arXiv preprint arXiv:2308.13137 (2023)

Wenqi Shao, Mengzhao Chen, Zhaoyang Zhang, Peng Xu, Lirui Zhao, Zhiqian Li, Kaipeng Zhang, Peng Gao, Yu Qiao, and Ping Luo. OmniQuant: Omnidirectionally calibrated quantiza- tion for large language models. InICLR, 2024. URLhttps://arxiv.org/abs/2308.13137

work page arXiv 2024
[36]

QuIP#: Even better LLM quantization with hadamard incoherence and lattice codebooks

Albert Tseng, Jerry Chee, Qingyao Sun, Volodymyr Kuleshov, and Christopher De Sa. QuIP#: Even better LLM quantization with hadamard incoherence and lattice codebooks. InICML,
[37]

URLhttps://arxiv.org/abs/2402.04396

work page arXiv
[38]

Massively parallel Huffman decoding on GPUs

André Weißenberger and Bertil Schmidt. Massively parallel Huffman decoding on GPUs. In Proceedings of the 47th International Conference on Parallel Processing (ICPP), 2018

2018
[39]

Sullivan, Gisle Bjøntegaard, and Ajay Luthra

Thomas Wiegand, Gary J. Sullivan, Gisle Bjøntegaard, and Ajay Luthra. Overview of the H.264/AVC video coding standard.IEEE Transactions on Circuits and Systems for Video Technology, 13(7):560–576, 2003

2003
[40]

Smoothquant: Accurate and efficient post-training quantization for large language models, 2024

Guangxuan Xiao, Ji Lin, Mickael Seznec, Hao Wu, Julien Demouth, and Song Han. SmoothQuant: Accurate and efficient post-training quantization for large language models. In ICML, 2023. URLhttps://arxiv.org/abs/2211.10438

work page arXiv 2023
[41]

Cambridge University Press, 2014

Ram Zamir.Lattice Coding for Signals and Networks. Cambridge University Press, 2014

2014
[42]

On universal quantization by randomized uniform/lattice quantizers

Ram Zamir and Meir Feder. On universal quantization by randomized uniform/lattice quantizers. IEEE Transactions on Information Theory, 38(2):428–436, 1992

1992
[43]

QJL: 1-bit quantized JL transform for KV cache quantization with zero overhead.arXiv preprint arXiv:2406.03482, 2024

Amir Zandieh, Majid Daliri, and Insu Han. QJL: 1-bit quantized JL transform for KV cache quantization with zero overhead.arXiv preprint arXiv:2406.03482, 2024. URLhttps: //arxiv.org/abs/2406.03482

work page arXiv 2024
[44]

TurboQuant: Online Vector Quantization with Near-optimal Distortion Rate

Amir Zandieh, Majid Daliri, Majid Hadian, and Vahab Mirrokni. TurboQuant: Online vector quantization with near-optimal distortion rate. InICLR, 2026. URL https://arxiv.org/ abs/2504.19874. A Proof of subtractive-dither unbiasedness This appendix gives a self-contained proof that subtractive-dithered lattice quantization satisfies the Schuchman conditions ...

work page internal anchor Pith review Pith/arXiv arXiv 2026

[1] [1]

The Llama-3 herd of models

Meta AI. The Llama-3 herd of models. Meta AI research publication, 2024. URLhttps: //ai.meta.com/research/publications/the-llama-3-herd-of-models/

2024

[2] [2]

The fast Johnson–Lindenstrauss transform and approximate nearest neighbors.SIAM Journal on Computing, 39(1):302–322, 2009

Nir Ailon and Bernard Chazelle. The fast Johnson–Lindenstrauss transform and approximate nearest neighbors.SIAM Journal on Computing, 39(1):302–322, 2009

2009

[3] [3]

Slicegpt: Compress large language models by deleting rows and columns

Saleh Ashkboos, Maximilian L. Croci, Torsten Hoefler, and James Hensman. SliceGPT: Compress large language models by deleting rows and columns. InICLR, 2024. URLhttps: //arxiv.org/abs/2401.15024

work page arXiv 2024

[4] [4]

OCTOPUS: Optimized KV Cache for Transformers via Octahedral Parametrization Under optimal Squared error quantization

Mark Boss, Vikram Voleti, Simon Donné, and Shimon Vainer. Octopus: Optimized kv cache for transformers via octahedral parametrization under optimal squared error quantization, 2026. URLhttps://arxiv.org/abs/2605.21226

work page internal anchor Pith review Pith/arXiv arXiv 2026

[5] [5]

QuIP: 2-bit quantization of large language models with guarantees

Jerry Chee, Yaohui Cai, Volodymyr Kuleshov, and Christopher De Sa. QuIP: 2-bit quantization of large language models with guarantees. InNeurIPS, 2023. URLhttps://arxiv.org/abs/ 2307.13304

work page arXiv 2023

[6] [6]

Conway and Neil J

John H. Conway and Neil J. A. Sloane.Sphere Packings, Lattices and Groups. Springer, 3rd edition, 1999

1999

[7] [7]

Cover and Joy A

Thomas M. Cover and Joy A. Thomas.Elements of Information Theory. Wiley, 2nd edition, 2006

2006

[8] [8]

A sparse Johnson–Lindenstrauss transform

Anirban Dasgupta, Ravi Kumar, and Tamás Sarlós. A sparse Johnson–Lindenstrauss transform. InSTOC, 2010

2010

[9] [9]

SpQR: A sparse-quantized representation for near-lossless LLM weight compression

Tim Dettmers, Ruslan Svirschevski, Vage Egiazarian, Denis Kuznedelev, Elias Frantar, Saleh Ashkboos, Alexander Borzunov, Torsten Hoefler, and Dan Alistarh. SpQR: A sparse-quantized representation for near-lossless LLM weight compression. InICLR, 2024. URLhttps://arxiv. org/abs/2306.03078

work page arXiv 2024

[10] [10]

Asymmetric numeral systems: entropy coding combining speed of Huffman coding with compression rate of arithmetic coding

Jarek Duda. Asymmetric numeral systems: Entropy coding combining speed of Huffman coding with compression rate of arithmetic coding.arXiv preprint arXiv:1311.2540, 2013

work page internal anchor Pith review Pith/arXiv arXiv 2013

[11] [11]

On the closeness of the random-dither mapping to the information- theoretic optimum for vector quantization.IEEE Transactions on Information Theory, 51(10): 3617–3631, 2005

Uri Erez and Ram Zamir. On the closeness of the random-dither mapping to the information- theoretic optimum for vector quantization.IEEE Transactions on Information Theory, 51(10): 3617–3631, 2005

2005

[12] [12]

Scaling Rectified Flow Transformers for High-Resolution Image Synthesis

Patrick Esser et al. Scaling rectified flow transformers for high-resolution image synthesis. arXiv preprint arXiv:2403.03206, 2024. URLhttps://arxiv.org/abs/2403.03206

work page internal anchor Pith review Pith/arXiv arXiv 2024

[13] [13]

David Forney and Lee-Fang Wei

G. David Forney and Lee-Fang Wei. Multidimensional constellations—Part I: Introduction, figures of merit, and generalized cross constellations.IEEE Journal on Selected Areas in Communications, 7(6):877–892, 1989

1989

[14] [14]

GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers

Elias Frantar, Saleh Ashkboos, Torsten Hoefler, and Dan Alistarh. GPTQ: Accurate post- training quantization for generative pre-trained transformers. InICLR, 2023. URLhttps: //arxiv.org/abs/2210.17323. 29

work page internal anchor Pith review Pith/arXiv arXiv 2023

[15] [15]

Herbert Gish and John N. Pierce. Asymptotically efficient quantizing.IEEE Transactions on Information Theory, 14(5):676–683, 1968

1968

[16] [16]

Nathan Halko, Per-Gunnar Martinsson, and Joel A. Tropp. Finding structure with randomness: Probabilistic algorithms for constructing approximate matrix decompositions.SIAM Review, 53(2):217–288, 2011. URLhttps://arxiv.org/abs/0909.4061

work page internal anchor Pith review Pith/arXiv arXiv 2011

[17] [17]

LTX-Video: A real-time video generation model

Lightricks. LTX-Video: A real-time video generation model. GitHub repository, 2024. URL https://github.com/Lightricks/LTX-Video

2024

[18] [18]

AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration

Ji Lin, Jiaming Tang, Haotian Tang, Shang Yang, Wei-Ming Chen, Wei-Chen Wang, Guangxuan Xiao, Xingyu Dang, Chuang Gan, and Song Han. AWQ: Activation-aware weight quantization for on-device LLM compression and acceleration. InMLSys, 2024. URLhttps://arxiv.org/ abs/2306.00978

work page internal anchor Pith review Pith/arXiv arXiv 2024

[19] [19]

Lookabaugh and Robert M

Thomas D. Lookabaugh and Robert M. Gray. High-resolution quantization theory and the vector quantizer advantage.IEEE Transactions on Information Theory, 35(5):1020–1033, 1989

1989

[20] [20]

Pushing the limits of large language model quantization via the linearity theorem

Vladimir Malinovskii, Andrei Panferov, Ivan Ilin, Han Guo, Peter Richtárik, and Dan Alistarh. Pushing the limits of large language model quantization via the linearity theorem. InNAACL,

[21] [21]

URLhttps://arxiv.org/abs/2411.17525

work page arXiv

[22] [22]

Pointer Sentinel Mixture Models

Stephen Merity, Caiming Xiong, James Bradbury, and Richard Socher. Pointer sentinel mixture models. InICLR, 2017. URLhttps://arxiv.org/abs/1609.07843

work page internal anchor Pith review Pith/arXiv arXiv 2017

[23] [24]

URLhttps://arxiv.org/abs/2209.05433

work page internal anchor Pith review Pith/arXiv arXiv

[24] [25]

NVIDIA hopper h100 architecture white paper

NVIDIA. NVIDIA hopper h100 architecture white paper. NVIDIA white paper, 2022. URL https://resources.nvidia.com/en-us-tensor-core

2022

[25] [26]

NVIDIA blackwell architecture technical brief

NVIDIA. NVIDIA blackwell architecture technical brief. NVIDIA technical brief, 2024. URL https://resources.nvidia.com/en-us-blackwell-architecture

2024

[26] [27]

OCP Microscaling Formats (MX) specification v1.0

Open Compute Project. OCP Microscaling Formats (MX) specification v1.0. OCP specification, 2023. URL https://www.opencompute.org/documents/ ocp-microscaling-formats-mx-v1-0-spec-final-pdf

2023

[27] [28]

Optimal quadratic quantization for numerics: The Gaussian case.Monte Carlo Methods and Applications, 9(2):135–165, 2003

Gilles Pagès and Jacques Printems. Optimal quadratic quantization for numerics: The Gaussian case.Monte Carlo Methods and Applications, 9(2):135–165, 2003

2003

[28] [29]

Scalable Diffusion Models with Transformers

William Peebles and Saining Xie. Scalable diffusion models with transformers. InICCV, 2023. URLhttps://arxiv.org/abs/2212.09748

work page internal anchor Pith review Pith/arXiv arXiv 2023

[29] [30]

Efficiently scaling transformer inference

Reiner Pope, Sholto Douglas, Aakanksha Chowdhery, Jacob Devlin, James Bradbury, Jonathan Heek, Kefan Xu, Shivani Agrawal, and Jeff Dean. Efficiently scaling transformer inference. In MLSys, 2023. URLhttps://arxiv.org/abs/2211.05102

work page arXiv 2023

[30] [31]

Robert F. Rice. Some practical universal noiseless coding techniques.JPL Publication 79-22, 1979

1979

[31] [32]

Microscaling data formats for deep learning, 2023 b

Bita Darvish Rouhani, Nitin Garegrat, Tom Madian, Jeremy Lo, Brian Cook, Daniel Pinto, et al. Microscaling data formats for deep learning.arXiv preprint arXiv:2310.10537, 2023. URLhttps://arxiv.org/abs/2310.10537. 30

work page arXiv 2023

[32] [33]

NestQuant: Nested lattice quantization for matrix products and LLMs

Semyon Savkin, Eitan Porat Chen, Or Lou, and Yury Polyanskiy. NestQuant: Nested lattice quantization for matrix products and LLMs. InICML, 2025. URLhttps://arxiv.org/abs/ 2502.09720

work page arXiv 2025

[33] [34]

Dither signals and their effect on quantization noise.IEEE Transactions on Communication Technology, 12(4):162–165, 1964

Leonard Schuchman. Dither signals and their effect on quantization noise.IEEE Transactions on Communication Technology, 12(4):162–165, 1964

1964

[34] [35]

”Omniquant: Omnidirectionally calibrated quantiza- tion for large language models.” arXiv preprint arXiv:2308.13137 (2023)

Wenqi Shao, Mengzhao Chen, Zhaoyang Zhang, Peng Xu, Lirui Zhao, Zhiqian Li, Kaipeng Zhang, Peng Gao, Yu Qiao, and Ping Luo. OmniQuant: Omnidirectionally calibrated quantiza- tion for large language models. InICLR, 2024. URLhttps://arxiv.org/abs/2308.13137

work page arXiv 2024

[35] [36]

QuIP#: Even better LLM quantization with hadamard incoherence and lattice codebooks

Albert Tseng, Jerry Chee, Qingyao Sun, Volodymyr Kuleshov, and Christopher De Sa. QuIP#: Even better LLM quantization with hadamard incoherence and lattice codebooks. InICML,

[36] [37]

URLhttps://arxiv.org/abs/2402.04396

work page arXiv

[37] [38]

Massively parallel Huffman decoding on GPUs

André Weißenberger and Bertil Schmidt. Massively parallel Huffman decoding on GPUs. In Proceedings of the 47th International Conference on Parallel Processing (ICPP), 2018

2018

[38] [39]

Sullivan, Gisle Bjøntegaard, and Ajay Luthra

Thomas Wiegand, Gary J. Sullivan, Gisle Bjøntegaard, and Ajay Luthra. Overview of the H.264/AVC video coding standard.IEEE Transactions on Circuits and Systems for Video Technology, 13(7):560–576, 2003

2003

[39] [40]

Smoothquant: Accurate and efficient post-training quantization for large language models, 2024

Guangxuan Xiao, Ji Lin, Mickael Seznec, Hao Wu, Julien Demouth, and Song Han. SmoothQuant: Accurate and efficient post-training quantization for large language models. In ICML, 2023. URLhttps://arxiv.org/abs/2211.10438

work page arXiv 2023

[40] [41]

Cambridge University Press, 2014

Ram Zamir.Lattice Coding for Signals and Networks. Cambridge University Press, 2014

2014

[41] [42]

On universal quantization by randomized uniform/lattice quantizers

Ram Zamir and Meir Feder. On universal quantization by randomized uniform/lattice quantizers. IEEE Transactions on Information Theory, 38(2):428–436, 1992

1992

[42] [43]

QJL: 1-bit quantized JL transform for KV cache quantization with zero overhead.arXiv preprint arXiv:2406.03482, 2024

Amir Zandieh, Majid Daliri, and Insu Han. QJL: 1-bit quantized JL transform for KV cache quantization with zero overhead.arXiv preprint arXiv:2406.03482, 2024. URLhttps: //arxiv.org/abs/2406.03482

work page arXiv 2024

[43] [44]

TurboQuant: Online Vector Quantization with Near-optimal Distortion Rate

Amir Zandieh, Majid Daliri, Majid Hadian, and Vahab Mirrokni. TurboQuant: Online vector quantization with near-optimal distortion rate. InICLR, 2026. URL https://arxiv.org/ abs/2504.19874. A Proof of subtractive-dither unbiasedness This appendix gives a self-contained proof that subtractive-dithered lattice quantization satisfies the Schuchman conditions ...

work page internal anchor Pith review Pith/arXiv arXiv 2026