arxiv: 2605.06014 · v1 · submitted 2026-05-07 · 💻 cs.LG · cs.AI· cs.DS· cs.NI

Recognition: unknown

Quantizing With Randomized Hadamard Transforms: Efficient Heuristic Now Proven

Ran Ben-Basat , William Kuszmaul , Michael Mitzenmacher , Amit Portnoy , Shay Vargaftik

Authors on Pith no claims yet

Pith reviewed 2026-05-08 14:02 UTC · model grok-4.3

classification 💻 cs.LG cs.AIcs.DScs.NI

keywords randomized Hadamard transformuniform random rotationgradient compressionvector quantizationKolmogorov distanceWasserstein distancehigh-dimensional dataorthogonal preprocessing

0 comments

The pith

Two randomized Hadamard transforms bring every coordinate's marginal distribution within O(d^{-1/2}) of a standard Gaussian.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that composing two randomized Hadamard transforms on any d-dimensional vector produces marginal coordinate distributions after normalization that are O(d^{-1/2})-close to standard normal in both Kolmogorov and 1-Wasserstein distance. This bound is then substituted directly into the existing performance analyses of DRIVE and QUIC-FL to show that the resulting compression error matches the asymptotic behavior obtained with uniform random rotations. For vector quantization the authors prove that three transforms are needed to drive coordinate covariances to zero at a sufficient rate, ensuring the expected error for any fixed bounded codebook matches the uniform-rotation case up to vanishing terms. The same closeness also justifies a linear-time moment check that can decide at runtime whether one, two, or three transforms are enough for a given input.

Core claim

After any input vector is rotated by the product of two independent randomized Hadamard matrices and then normalized, the one-dimensional marginal law of each fixed coordinate lies within O(d^{-1/2}) of N(0,1) in Kolmogorov distance and in 1-Wasserstein distance. Substituting these explicit bounds into the error expressions for DRIVE and QUIC-FL yields that the expected quantization distortion is asymptotically identical to the distortion obtained when the same schemes are preceded by a uniform random rotation. A parallel argument for three transforms shows that the covariance between any two fixed coordinates decays fast enough that the expected error of any fixed-size VQ codebook remains 1

What carries the argument

The composition of two (or three) independent randomized Hadamard transforms, together with explicit Kolmogorov and Wasserstein bounds on the resulting marginal coordinate distributions.

If this is right

DRIVE and QUIC-FL achieve the same asymptotic compression performance as when preceded by uniform random rotations.
Any fixed bounded codebook for vector quantization yields the same expected error with three RHTs as with a uniform random rotation, up to an additive term vanishing with dimension.
The linear-time moment check lets a system use fewer than three transforms on typical inputs while retaining the worst-case guarantees.
Practical implementations of gradient compression, model quantization, and KV-cache compression can replace slow uniform rotations with fast Hadamard-based preprocessing without asymptotic loss.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same marginal-convergence argument may apply to other high-dimensional tasks that rely on near-Gaussian marginals after orthogonal preprocessing, such as approximate nearest-neighbor search.
An adaptive choice of one versus two transforms based on the moment check could reduce average runtime on non-adversarial data without sacrificing the theoretical safety net.
Extending the covariance decay analysis to higher-order moments or to joint distributions of small blocks could cover additional quantization schemes that currently lack guarantees.

Load-bearing premise

The performance guarantees of DRIVE, QUIC-FL, and the VQ analyses depend only on the marginal laws or pairwise covariances of the rotated coordinates and admit direct substitution of the new distance bounds without extra error terms.

What would settle it

Measure the empirical Kolmogorov distance of a fixed coordinate after two RHTs to a standard Gaussian for a sequence of growing dimensions d and verify whether the observed distance remains bounded by C/sqrt(d) for some constant C independent of d.

read the original abstract

Uniform random rotations (URRs) are a common preprocessing step in modern quantization approaches used for gradient compression, inference acceleration, KV-cache compression, model weight quantization, and approximate nearest-neighbor search in vector databases. In practice, URRs are often replaced by randomized Hadamard transforms (RHTs), which preserve orthogonality while admitting fast implementations. The remaining issue is the performance for worst-case inputs. With a URR, each coordinate is individually distributed as a shifted beta distribution, which converges to a Gaussian distribution in high dimensions. Generally, one RHT is not suitable in the worst case, as individual coordinates can be far from these distributions. We show that after composing two RHTs on any $d$-sized input vector, the marginal distribution of every fixed coordinate of the normalized rotated vector is within $O(d^{-1/2})$ of a standard Gaussian both in Kolmogorov distance and in $1$-Wasserstein distance. We then plug these bounds into the analyses of modern compression schemes, namely DRIVE and QUIC-FL, and show that two RHTs achieve performance that asymptotically matches URRs. However, we show that two RHTs may not be sufficient for Vector Quantization (VQ), which often requires weak correlation across fixed-size blocks of coordinates (as opposed to only marginal distribution convergence for single coordinates). We prove that a composition of three RHTs leads to decaying coordinate covariance. This ensures that any fixed, bounded, multi-dimensional VQ codebook optimized for URRs has the same expected error when using three RHTs, up to an additive term that vanishes with the dimension. Finally, because practical inputs are rarely adversarial, we propose a linear-time ${O}(d)$ check on the input's moments to dynamically adapt the number of RHTs used at runtime to improve performance.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Two RHTs deliver O(d^{-1/2}) marginal closeness to Gaussian that plugs into DRIVE/QUIC-FL for asymptotic URR matching, with three needed for VQ covariance decay.

read the letter

The main takeaway is that two randomized Hadamard transforms suffice to bring every fixed coordinate's marginal within O(d^{-1/2}) of a standard Gaussian in both Kolmogorov and Wasserstein distance for any input vector. This distance is small enough to substitute directly into the existing analyses of DRIVE and QUIC-FL and recover the same asymptotic performance as uniform random rotations. Three RHTs further give decaying coordinate covariances, which handles the block-wise requirements in vector quantization up to a vanishing additive term. They also add a simple linear-time moment check to adapt the number of transforms on typical inputs. These are the concrete new pieces: the explicit two-RHT bounds, the three-RHT covariance result, and the clean substitution into the compression schemes. The paper does a solid job of turning a common heuristic into something with provable guarantees and of being upfront that two transforms fall short for VQ. The adaptive check is a practical bonus that matches how people actually run these things. The soft spot is the substitution step itself. The O(d^{-1/2}) marginal error could in principle accumulate if the test functions in the DRIVE or QUIC-FL bounds have d-dependent Lipschitz constants or if the analysis sums d terms without sufficient averaging. The abstract asserts the error vanishes, but the strength of that claim rests on how carefully the constants are tracked in the proofs. For VQ they already flag the limitation and fix it, which is good. This work is aimed at people building gradient compressors, KV-cache quantizers, or vector databases who have been using RHTs without theory. It gives them something they can cite and implement with more confidence. A serious referee should see it because the results are specific, the applications are current, and the technical gaps it closes are real. I would send it to review.

Referee Report

2 major / 2 minor

Summary. The paper claims that composing two randomized Hadamard transforms (RHTs) on any d-dimensional input vector ensures that every fixed coordinate of the normalized output is within O(d^{-1/2}) of a standard Gaussian in both Kolmogorov and 1-Wasserstein distance. These marginal bounds are substituted into the existing analyses of DRIVE and QUIC-FL to conclude that two RHTs achieve asymptotic performance matching uniform random rotations (URRs). For vector quantization, three RHTs are shown to produce decaying coordinate covariances, ensuring that any fixed bounded VQ codebook optimized for URRs has matching expected error up to a vanishing additive term. A linear-time O(d) moment-based check is proposed to dynamically select the number of RHTs at runtime.

Significance. If the substitution of the marginal bounds is rigorous and the error terms vanish, the work supplies the missing theoretical justification for the widespread practical replacement of expensive uniform random rotations by fast RHTs in gradient compression, KV-cache quantization, and approximate nearest-neighbor search. The explicit O(d^{-1/2}) rates, the separate treatment of marginal versus joint (covariance) requirements, and the dynamic adaptation heuristic are concrete strengths. The derivation of the distributional bounds directly from the algebraic properties of Hadamard matrices and random sign vectors, without fitted parameters, is a positive contribution.

major comments (2)

[analysis of DRIVE/QUIC-FL substitution] The central claim that two RHTs asymptotically match URR performance in DRIVE and QUIC-FL (abstract and the substitution paragraph following the marginal bounds) rests on direct insertion of the O(d^{-1/2}) Kolmogorov/Wasserstein distances. No explicit propagation of the error through the quantization-error or convergence-rate expressions is supplied; it is therefore unclear whether d-dependent factors (e.g., summation over d coordinates, Lipschitz constants of the test functions, or higher-moment terms) produce a vanishing total difference or a non-vanishing residual as d→∞.
[VQ covariance argument] For the VQ result, the statement that three RHTs suffice because they produce “decaying coordinate covariance” (abstract) does not specify the decay rate or the precise dependence on block size. Without this, it is impossible to verify that the additive error term for any fixed bounded codebook indeed vanishes uniformly.

minor comments (2)

[statement of main theorem] The normalization step applied to the rotated vector is referenced but never written explicitly; the precise scaling factor should appear in the statement of the main distributional theorem.
[abstract and introduction] Several O(·) statements in the abstract and introduction omit the dependence of the hidden constants on other parameters (e.g., the number of RHT compositions or the input norm).

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the thorough review and constructive feedback on our manuscript. The two major comments identify areas where additional rigor and explicit details will strengthen the presentation. We address each point below and will revise the manuscript to incorporate the necessary clarifications and expansions.

read point-by-point responses

Referee: The central claim that two RHTs asymptotically match URR performance in DRIVE and QUIC-FL (abstract and the substitution paragraph following the marginal bounds) rests on direct insertion of the O(d^{-1/2}) Kolmogorov/Wasserstein distances. No explicit propagation of the error through the quantization-error or convergence-rate expressions is supplied; it is therefore unclear whether d-dependent factors (e.g., summation over d coordinates, Lipschitz constants of the test functions, or higher-moment terms) produce a vanishing total difference or a non-vanishing residual as d→∞.

Authors: We agree that the current substitution paragraph performs a direct insertion of the marginal bounds without a fully expanded propagation analysis. The manuscript relies on the fact that the existing DRIVE and QUIC-FL error expressions involve expectations of coordinate-wise functions whose sensitivity is controlled (typically Lipschitz with constants that do not introduce non-vanishing d-factors after normalization). However, to make the vanishing of the total difference fully rigorous, we will add an explicit error-propagation lemma in the revision. This lemma will bound the difference in quantization error or convergence rates by a multiple of the marginal distance, confirming that all d-dependent terms (sums over coordinates, Lipschitz factors, moment terms) yield an overall o(1) residual as d→∞. The addition will follow the marginal-bounds section. revision: yes
Referee: For the VQ result, the statement that three RHTs suffice because they produce “decaying coordinate covariance” (abstract) does not specify the decay rate or the precise dependence on block size. Without this, it is impossible to verify that the additive error term for any fixed bounded codebook indeed vanishes uniformly.

Authors: The referee is correct that the abstract and high-level statement do not quantify the covariance decay rate or its dependence on block size. The appendix proof establishes that three RHTs produce coordinate covariances that decay with dimension, which is sufficient for the additive error to vanish for any fixed bounded codebook. In the revision we will extract the explicit decay rate from the proof, state its dependence on block size, and verify that the resulting additive term vanishes uniformly as d→∞ whenever the block size remains fixed (or grows slower than d). These details will be added to the abstract, the VQ theorem statement, and the surrounding discussion. revision: yes

Circularity Check

0 steps flagged

No circularity: new algebraic concentration bounds on RHT marginals are derived independently and substituted into prior analyses without reduction to fits or self-definitions.

full rationale

The paper establishes O(d^{-1/2}) Kolmogorov and Wasserstein bounds on each marginal coordinate after two RHTs via direct analysis of Hadamard matrix properties and random sign vectors. These bounds are then inserted into the existing DRIVE and QUIC-FL error analyses to obtain asymptotic equivalence with URRs. The VQ extension similarly introduces a fresh covariance decay result for three RHTs. No quoted step reduces a claimed prediction to a fitted parameter, self-citation chain, or definitional tautology; the core distributional claims rest on external probabilistic facts rather than renaming or smuggling prior results. The derivation chain is therefore self-contained.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The paper relies on standard mathematical properties of Hadamard matrices and random sign vectors with no new free parameters or invented entities.

axioms (2)

standard math Randomized Hadamard transforms are orthogonal and can be computed in linear time via fast Walsh-Hadamard transform with random sign flips.
Invoked as the efficient replacement for uniform random rotations.
domain assumption The performance of DRIVE, QUIC-FL, and VQ depends on coordinate marginals or pairwise covariances in a manner that permits direct substitution of distribution-distance bounds.
Required to transfer the new bounds into existing error analyses.

pith-pipeline@v0.9.0 · 5668 in / 1506 out tokens · 60978 ms · 2026-05-08T14:02:55.455355+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

46 extracted references · 10 canonical work pages · 2 internal anchors

[1]

US Government printing office, 1948

Milton Abramowitz and Irene A Stegun.Handbook of mathematical functions with formulas, graphs, and mathematical tables, volume 55. US Government printing office, 1948

1948
[2]

Approximate nearest neighbors and the fast johnson- lindenstrauss transform

Nir Ailon and Bernard Chazelle. Approximate nearest neighbors and the fast johnson- lindenstrauss transform. InProceedings of the thirty-eighth annual ACM symposium on Theory of computing, pages 557–563, 2006

2006
[3]

Wasserstein generative adversarial networks

Martin Arjovsky, Soumith Chintala, and Léon Bottou. Wasserstein generative adversarial networks. InInternational conference on machine learning, pages 214–223. Pmlr, 2017. 11

2017
[4]

Quarot: Outlier-free 4-bit inference in rotated llms.Advances in Neural Information Processing Systems, 37:100213– 100240, 2024

Saleh Ashkboos, Amirkeivan Mohtashami, Maximilian L Croci, Bo Li, Pashmina Cameron, Martin Jaggi, Dan Alistarh, Torsten Hoefler, and James Hensman. Quarot: Outlier-free 4-bit inference in rotated llms.Advances in Neural Information Processing Systems, 37:100213– 100240, 2024

2024
[5]

Optimal and approximate adaptive stochastic quantization.Advances in Neural Information Processing Systems, 37:94265–94291, 2024

Ran B Basat, Yaniv Ben-Itzhak, Michael Mitzenmacher, and Shay Vargaftik. Optimal and approximate adaptive stochastic quantization.Advances in Neural Information Processing Systems, 37:94265–94291, 2024

2024
[6]

A Note on TurboQuant and the Earlier DRIVE/EDEN Line of Work

Ran Ben-Basat, Yaniv Ben-Itzhak, Gal Mendelson, Michael Mitzenmacher, Amit Portnoy, and Shay Vargaftik. A note on turboquant and the earlier drive/eden line of work.arXiv preprint arXiv:2604.18555, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[7]

Better than optimal: Improving adaptive stochastic quantization using shared randomness.Proceedings of the ACM on Measurement and Analysis of Computing Systems, 9(3):1–44, 2025

Ran Ben Basat, Yaniv Ben-Itzhak, Michael Mitzenmacher, and Shay Vargaftik. Better than optimal: Improving adaptive stochastic quantization using shared randomness.Proceedings of the ACM on Measurement and Analysis of Computing Systems, 9(3):1–44, 2025

2025
[8]

How to send a real number using a single bit (and some shared randomness)

Ran Ben-Basat, Michael Mitzenmacher, and Shay Vargaftik. How to send a real number using a single bit (and some shared randomness). InICALP 2021, pages 1–20. Schloss Dagstuhl- Leibniz-Zentrum für Informatik, 2021

2021
[9]

Ac- celerating federated learning with quick distributed mean estimation

Ran Ben-Basat, Amit Portnoy, Gil Einziger, Yaniv Ben-Itzhak, Michael Mitzenmacher, et al. Ac- celerating federated learning with quick distributed mean estimation. InForty-first International Conference on Machine Learning, 2024

2024
[10]

Scionfl: Efficient and robust secure quantized aggregation

Yaniv Ben-Itzhak, Helen Möllering, Benny Pinkas, Thomas Schneider, Ajith Suresh, Oleksandr Tkachenko, Shay Vargaftik, Christian Weinert, Hossein Yalame, and Avishay Yanai. Scionfl: Efficient and robust secure quantized aggregation. In2024 IEEE Conference on Secure and Trustworthy Machine Learning (SaTML), pages 490–511. IEEE, 2024

2024
[11]

A tight Gaussian bound for weighted sums of Rademacher random variables.Bernoulli, 21(2):1231 – 1237, 2015

Vidmantas Kastytis Bentkus and Dainius Dzindzalieta. A tight Gaussian bound for weighted sums of Rademacher random variables.Bernoulli, 21(2):1231 – 1237, 2015

2015
[12]

Caldas, J

Sebastian Caldas, Jakub Koneˇcny, H Brendan McMahan, and Ameet Talwalkar. Expanding the reach of federated learning by reducing client resource requirements.arXiv preprint arXiv:1812.07210, 2018

work page arXiv 2018
[13]

Stein’s method of normal approximation: Some recollections and reflections

Louis HY Chen. Stein’s method of normal approximation: Some recollections and reflections. The Annals of Statistics, 49(4):1850–1863, 2021

2021
[14]

Springer Science & Business Media, 2010

Louis HY Chen, Larry Goldstein, and Qi-Man Shao.Normal approximation by Stein’s method. Springer Science & Business Media, 2010

2010
[15]

When ml training cuts through congestion: Just-in-time gradient compression via packet trimming

Xiaoqi Chen, Shay Vargaftik, and Ran Ben Basat. When ml training cuts through congestion: Just-in-time gradient compression via packet trimming. InProceedings of the 23rd ACM Workshop on Hot Topics in Networks, pages 177–185, 2024

2024
[16]

Tim Dettmers, Mike Lewis, Younes Belkada, and Luke Zettlemoyer. Gpt3. int8 (): 8-bit matrix multiplication for transformers at scale.Advances in neural information processing systems, 35:30318–30332, 2022

2022
[17]

Spqr: A sparse-quantized representation for near-lossless llm weight compression,

Tim Dettmers, Ruslan Svirschevski, Vage Egiazarian, Denis Kuznedelev, Elias Frantar, Saleh Ashkboos, Alexander Borzunov, Torsten Hoefler, and Dan Alistarh. Spqr: A sparse-quantized representation for near-lossless llm weight compression.arXiv preprint arXiv:2306.03078, 2023

work page arXiv 2023
[18]

Docofl: Downlink compression for cross-device federated learning

Ron Dorfman, Shay Vargaftik, Yaniv Ben-Itzhak, and Kfir Yehuda Levy. Docofl: Downlink compression for cross-device federated learning. InInternational Conference on Machine Learning, pages 8356–8388. PMLR, 2023

2023
[19]

Jianyang Gao, Yutong Gou, Yuexuan Xu, Yongyi Yang, Cheng Long, and Raymond Chi- Wing Wong. Practical and asymptotically optimal quantization of high-dimensional vectors in euclidean space for approximate nearest neighbor search.Proceedings of the ACM on Management of Data, 3(3):1–26, 2025. 12

2025
[20]

Rabitq: Quantizing high-dimensional vectors with a theoretical error bound for approximate nearest neighbor search.Proceedings of the ACM on Management of Data, 2(3):1–27, 2024

Jianyang Gao and Cheng Long. Rabitq: Quantizing high-dimensional vectors with a theoretical error bound for approximate nearest neighbor search.Proceedings of the ACM on Management of Data, 2(3):1–27, 2024

2024
[21]

Dynamiq: Ac- celerating gradient synchronization using compressed multi-hop all-reduce.arXiv preprint arXiv:2602.08923, 2026

Wenchen Han, Shay Vargaftik, Michael Mitzenmacher, and Ran Ben Basat. Dynamiq: Ac- celerating gradient synchronization using compressed multi-hop all-reduce.arXiv preprint arXiv:2602.08923, 2026

work page arXiv 2026
[22]

Beyond throughput and compression ratios: Towards high end-to-end utility of gradient compression

Wenchen Han, Shay Vargaftik, Michael Mitzenmacher, Brad Karp, and Ran Ben Basat. Beyond throughput and compression ratios: Towards high end-to-end utility of gradient compression. InProceedings of the 23rd ACM Workshop on Hot Topics in Networks, pages 186–194, 2024

2024
[23]

Princeton university press, 2012

Kathy Horadam.Hadamard matrices and their applications. Princeton university press, 2012

2012
[24]

Federated Learning: Strategies for Improving Communication Efficiency

Jakub Koneˇcn`y, H Brendan McMahan, Felix X Yu, Peter Richtárik, Ananda Theertha Suresh, and Dave Bacon. Federated learning: Strategies for improving communication efficiency.arXiv preprint arXiv:1610.05492, 2016

work page internal anchor Pith review arXiv 2016
[25]

Owq: Outlier- aware weight quantization for efficient fine-tuning and inference of large language models

Changhun Lee, Jungyu Jin, Taesu Kim, Hyungjun Kim, and Eunhyeok Park. Owq: Outlier- aware weight quantization for efficient fine-tuning and inference of large language models. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 38, pages 13355–13364, 2024

2024
[26]

THC: Accelerating Distributed Deep Learning Using Tensor Homomorphic Compression

Minghao Li, Ran Ben Basat, Shay Vargaftik, ChonLam Lao, Kevin Xu, Xinran Tang, Michael Mitzenmacher, and Minlan Yu. THC: Accelerating Distributed Deep Learning Using Tensor Homomorphic Compression. InUSENIX Symposium on Networked Systems Design and Implementation, 2024

2024
[27]

Higgs: Pushing the limits of large language model quantization via the linearity theorem

Vladimir Malinovskii, Andrei Panferov, Ivan Ilin, Han Guo, Peter Richtárik, and Dan Alistarh. Higgs: Pushing the limits of large language model quantization via the linearity theorem. In Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Pa...

2025
[28]

Optimal quantization for matrix multiplication.IEEE Transactions on Information Theory, 2025

Or Ordentlich and Yury Polyanskiy. Optimal quantization for matrix multiplication.IEEE Transactions on Information Theory, 2025

2025
[29]

Quartet II: Accu- rate LLM pre-training in NVFP4 by improved unbiased gradient estimation.arXiv preprint arXiv:2601.22813, 2026

Andrei Panferov, Erik Schultheis, Soroush Tabesh, and Dan Alistarh. Quartet ii: Accu- rate llm pre-training in nvfp4 by improved unbiased gradient estimation.arXiv preprint arXiv:2601.22813, 2026

work page arXiv 2026
[30]

A multivariate central limit theorem for Lipschitz and smooth test functions

Martin Raiˇc. A multivariate central limit theorem for lipschitz and smooth test functions.arXiv preprint arXiv:1812.08268, 2018

work page Pith review arXiv 2018
[31]

Distributed mean estimation with limited communication

Ananda Theertha, X Felix, H Brendan McMahan Yu, Sanjiv Kumar, et al. Distributed mean estimation with limited communication. InInternational Conference on Machine Learning, 2017

2017
[32]

Quip#: Even better llm quantization with hadamard incoherence and lattice codebooks.Proceedings of machine learning research, 235:48630, 2024

Albert Tseng, Jerry Chee, Qingyao Sun, V olodymyr Kuleshov, and Christopher De Sa. Quip#: Even better llm quantization with hadamard incoherence and lattice codebooks.Proceedings of machine learning research, 235:48630, 2024

2024
[33]

Qtip: Quantization with trellises and incoherence processing.Advances in Neural Information Processing Systems, 37:59597– 59620, 2024

Albert Tseng, Qingyao Sun, David Hou, and Christopher De. Qtip: Quantization with trellises and incoherence processing.Advances in Neural Information Processing Systems, 37:59597– 59620, 2024

2024
[34]

New estimates of the convergence rate in the lyapunov theorem.arXiv preprint arXiv:0912.0726, 2009

Ilya Tyurin. New estimates of the convergence rate in the lyapunov theorem.arXiv preprint arXiv:0912.0726, 2009

work page arXiv 2009
[35]

Eden: Communication-efficient and robust distributed mean estimation for federated learning

Shay Vargaftik, Ran Ben Basat, Amit Portnoy, Gal Mendelson, Yaniv Ben Itzhak, and Michael Mitzenmacher. Eden: Communication-efficient and robust distributed mean estimation for federated learning. InInternational Conference on Machine Learning, pages 21984–22014. PMLR, 2022. 13

2022
[36]

Drive: One-bit distributed mean estimation.Advances in Neural Information Processing Systems, 34:362–377, 2021

Shay Vargaftik, Ran Ben-Basat, Amit Portnoy, Gal Mendelson, Yaniv Ben-Itzhak, and Michael Mitzenmacher. Drive: One-bit distributed mean estimation.Advances in Neural Information Processing Systems, 34:362–377, 2021

2021
[37]

Springer, 2009

Cédric Villani et al.Optimal transport: old and new, volume 338. Springer, 2009

2009
[38]

{OptiReduce}: Resilient and {Tail- Optimal}{AllReduce} for distributed deep learning in the cloud

Ertza Warraich, Omer Shabtai, Khalid Manaa, Shay Vargaftik, Yonatan Piasetzky, Matty Kadosh, Lalith Suresh, and Muhammad Shahbaz. {OptiReduce}: Resilient and {Tail- Optimal}{AllReduce} for distributed deep learning in the cloud. In22nd USENIX Symposium on Networked Systems Design and Implementation (NSDI 25), pages 685–703, 2025

2025
[39]

Raana: A fast, flexible, and data-efficient post-training quantization algorithm.arXiv preprint arXiv:2504.03717, 2025

Yongyi Yang, Jianyang Gao, and Wei Hu. Raana: A fast, flexible, and data-efficient post-training quantization algorithm.arXiv preprint arXiv:2504.03717, 2025

work page arXiv 2025
[40]

Turboquant: Online vector quantization with near-optimal distortion rate,

Amir Zandieh, Majid Daliri, Majid Hadian, and Vahab Mirrokni. Turboquant: Online vector quantization with near-optimal distortion rate.arXiv preprint arXiv:2504.19874, 2025

work page arXiv 2025
[41]

Hack: Homomorphic acceleration via compression of the key-value cache for disaggregated llm inference

Zeyu Zhang, Haiying Shen, Shay Vargaftik, Ran Ben Basat, Michael Mitzenmacher, and Minlan Yu. Hack: Homomorphic acceleration via compression of the key-value cache for disaggregated llm inference. InProceedings of the ACM SIGCOMM 2025 Conference, pages 1245–1247, 2025. A Deferred analysis for Section 4.2 In this appendix, we first establish the asymptotic...

2025
[42]

Step 5: Computing the variance.The relative variance we evaluate is Var(ˆx) ∥x∥2 2 = E[∥ˆx∥2 2]−∥E[ˆx]∥2 2 ∥x∥2 2

Dividing by∥x∥ 2 2 proves Equation (i): ∥B(x)∥2 2 ∥x∥2 2 = 1 c2 d ∥µ−c dex∥2 2 ≤ O(d −1/2). Step 5: Computing the variance.The relative variance we evaluate is Var(ˆx) ∥x∥2 2 = E[∥ˆx∥2 2]−∥E[ˆx]∥2 2 ∥x∥2 2 . Because the 2-RHT matrix preserves lengths due to its orthogonality, the uncentered ℓ2 norm of any d-dimensional ±1 sign vector is identically √ d. T...
[43]

step vectors

Since ∥µ∥2 2 =⟨µ,ex⟩2+∥µ⊥∥2 2 = (c±δ) 2+O(d−1/2) =c 2±O(d−1/2), we subtract this from the relative uncentered norm: Var(ˆx) ∥x∥2 2 = 1 c2 d − 1 c2 d (c2 ± O(d−1/2)) = 1−c 2 c2 d ± O(d−1/2). As established, cd =c+O(d −1), which directly implies 1 c2 d = 1 c2 +O(d −1). We substitute this into the leading term, absorbing the O(d−1) difference into the O(d−1/...
[44]

We decompose this error into a quadratic term and a codebook-dependent term: 21 LC(v) =∥v∥ 2 2 +g C(v), where gC(v) = min c∈C ∥c∥2 2 −2⟨v, c⟩ . Although LC is not globally Lipschitz because of the term∥v∥ 2 2, the functiong C is globally Lipschitz since for anyv, w∈R k, |gC(v)−g C(w)| ≤sup c∈C ∥c∥2 2 −2⟨v, c⟩ − ∥c∥2 2 −2⟨w, c⟩ = 2 sup c∈C |⟨v−w, c⟩| ≤2B∥v...
[45]

Step 3: The small covariance-mismatch case.Assume that r≤ 1 2

We prove the desired estimate by splitting into two cases according to the size ofr. Step 3: The small covariance-mismatch case.Assume that r≤ 1 2 . Since Σ is a covariance matrix, it is symmetric positive semidefinite. Let λ1, . . . , λk be the eigenvalues of Σ. Then, the eigenvalues of Σ−I k are λ1 −1, . . . , λk −1. Therefore, r2 =∥Σ−I k∥2 F =Pk ℓ=1(λℓ...
[46]

Since the eigenvalues of Σ lie in [1/2,3/2] , the eigenvalues of Σ−1/2 lie in hq 2 3 , √ 2 i

Summing over j yieldsPn j=1 E∥Σ−1/2Xj∥3 2 ≤ ∥Σ −1/2∥3 opB.Thus, W1(eS, Z)≤C k∥Σ−1/2∥3 opB. Since the eigenvalues of Σ lie in [1/2,3/2] , the eigenvalues of Σ−1/2 lie in hq 2 3 , √ 2 i . Be- cause Σ−1/2 is symmetric positive definite, its operator norm is its largest eigenvalue. Therefore, ∥Σ−1/2∥op ≤ √ 2.Substituting this into the preceding bound gives W1...