Inner Product Aware Quantization: Provably Fast, Accurate, and Adaptive Algorithms
Pith reviewed 2026-06-28 22:47 UTC · model grok-4.3
The pith
Quantization methods targeting inner product preservation admit faster algorithms than standard approaches.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Formulating inner product aware quantization objectives leads to a tight connection with adaptive stochastic quantization, enabling the development of provably fast exact and approximate algorithms that inspire practical methods 2-10 times faster than prior state-of-the-art while maintaining quality.
What carries the argument
Inner product aware quantization objectives that approximately preserve inner products with worst-case and average-case inputs, connected to adaptive stochastic quantization.
If this is right
- Provably fast exact and approximate algorithms exist for the inner product preservation objectives.
- Practical algorithms perform well across a variety of workload distributions.
- Standard ASQ can be accelerated by 2-10× using the inspired practical algorithms.
- Adaptive quantization techniques become more efficient and tractable in practical settings.
Where Pith is reading between the lines
- These methods might be combined with other compression techniques like pruning for further gains in neural network deployment.
- Extending the worst-case and average-case analyses to specific application domains could yield tailored variants.
- The unbiased property of the quantization methods may allow for better integration into stochastic gradient descent pipelines.
Load-bearing premise
The formulated objectives capture natural desiderata for approximately preserving inner products with worst-case and average-case inputs.
What would settle it
A benchmark where the proposed practical algorithms do not achieve 2-10 times speedup over prior methods on standard inner product workloads while keeping comparable accuracy would falsify the main practical claim.
Figures
read the original abstract
Quantization is a fundamental tool used to compress datasets, neural network weights, and memory usage in a range of computational tasks. Many downstream applications of vector quantization perform inner products with arbitrary inputs. This motivates the study of inner product aware quantization schemes that approximately preserve inner products with unseen vectors -- in contrast to simply minimizing the mean-squared error. In this work, we formulate objectives that capture natural desiderata and develop adaptive and unbiased quantization methods that approximately preserve inner products with worst-case and average-case inputs. An analysis of these objectives shows a tight connection with the well-studied notion of Adaptive Stochastic Quantization (ASQ). We develop provably fast exact and approximate algorithms for our objectives. Our theoretical results inspire efficient practical algorithms that perform well across a variety of workload distributions. They also lead to practical algorithms for standard ASQ which are 2-10$\times$ faster than prior state-of-the-art methods while maintaining quality. These theoretical and empirical results contribute towards making adaptive quantization techniques more efficient and tractable in practical settings.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript formulates inner-product-aware quantization objectives that approximately preserve inner products with worst-case and average-case inputs (as opposed to MSE minimization), establishes a tight connection between these objectives and Adaptive Stochastic Quantization (ASQ), develops provably fast exact and approximate algorithms for the objectives, and derives practical algorithms that achieve 2-10× speedups over prior state-of-the-art methods for standard ASQ while preserving quality across varied workload distributions.
Significance. If the claimed theoretical guarantees and empirical speedups are substantiated, the work would meaningfully advance practical quantization for inner-product-heavy tasks in machine learning. The explicit connection to ASQ together with the provision of both exact/approximate algorithms and faster practical implementations for the standard ASQ setting constitute a clear strength; the emphasis on adaptive, unbiased methods with worst-case and average-case analysis is a useful framing.
minor comments (3)
- [Abstract] The abstract states that the analysis 'shows a tight connection' to ASQ; a brief sentence clarifying whether this is an equivalence, a reduction, or an approximation under specific conditions would improve readability.
- [Abstract] The claim of 'provably fast exact and approximate algorithms' would benefit from a one-sentence statement of the key complexity results (e.g., time or space bounds) already in the abstract or introduction.
- Figure and table captions should explicitly state the workload distributions and baseline implementations used to obtain the reported 2-10× speedups so that the empirical claims are immediately verifiable.
Simulated Author's Rebuttal
We thank the referee for the thorough summary and positive evaluation of the manuscript, including the recognition of the theoretical connections to ASQ and the practical speedups. The recommendation for minor revision is appreciated. No specific major comments were raised in the report.
Circularity Check
No significant circularity detected
full rationale
The provided abstract and description formulate inner-product-preserving objectives, note a connection to ASQ, and claim provably fast algorithms with empirical speedups. No equations, derivations, or self-referential steps are visible in the text. No load-bearing claim reduces by construction to a fit, self-definition, or self-citation chain. The argument structure (formulation to analysis to algorithms) appears self-contained against external benchmarks.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
Qsgd: Communication-efficient sgd via gradient quantization and encoding
Dan Alistarh, Demjan Grubic, Jerry Li, Ryota Tomioka, and Milan Vojnovic. Qsgd: Communication-efficient sgd via gradient quantization and encoding. Advances in neural information processing systems , 30, 2017
2017
-
[2]
Geometric applications of a matrix searching algorithm
Alok Aggarwal, Maria Klawe, Shlomo Moran, Peter Shor, and Robert Wilber. Geometric applications of a matrix searching algorithm. In Proceedings of the second annual symposium on Computational geometry , pages 285--292, 1986
1986
-
[3]
Finding a minimum weight k-link path in graphs with monge property and applications
Alok Aggarwal, Baruch Schieber, and Takashi Tokuyama. Finding a minimum weight k-link path in graphs with monge property and applications. In Proceedings of the ninth annual symposium on Computational geometry , pages 189--197, 1993
1993
-
[4]
Optimal and approximate adaptive stochastic quantization
Ran Ben-Basat, Yaniv Ben-Itzhak, Michael Mitzenmacher, and Shay Vargaftik. Optimal and approximate adaptive stochastic quantization. Advances in Neural Information Processing Systems , 37:94265--94291, 2024
2024
-
[5]
Better than optimal: Improving adaptive stochastic quantization using shared randomness
Ran Ben Basat, Yaniv Ben-Itzhak, Michael Mitzenmacher, and Shay Vargaftik. Better than optimal: Improving adaptive stochastic quantization using shared randomness. Proceedings of the ACM on Measurement and Analysis of Computing Systems , 9(3):1--44, 2025
2025
-
[6]
Post training 4-bit quantization of convolutional networks for rapid-deployment
Ron Banner, Yury Nahshan, and Daniel Soudry. Post training 4-bit quantization of convolutional networks for rapid-deployment. Advances in neural information processing systems , 32, 2019
2019
-
[7]
Riley Carlson, John Bauer, and Christopher D. Manning. A new pair of gloves, 2025
2025
-
[8]
Neural gradients are near-lognormal: improved quantized and sparse training
Brian Chmiel, Liad Ben-Uri, Moran Shkolnik, Elad Hoffer, Ron Banner, and Daniel Soudry. Neural gradients are near-lognormal: improved quantized and sparse training. arXiv preprint arXiv:2006.08173 , 2020
-
[9]
Optq: Accurate quantization for generative pre-trained transformers
E Frantar, S Ashkboos, T Hoefler, and D Alistarh. Optq: Accurate quantization for generative pre-trained transformers. 2023. In URL https://openreview. net/forum , 2023
2023
-
[10]
Optimal algorithms for approximate clustering
Tom\' a s Feder and Daniel Greene. Optimal algorithms for approximate clustering. In Proceedings of the Twentieth Annual ACM Symposium on Theory of Computing , STOC '88, page 434–444, New York, NY, USA, 1988. Association for Computing Machinery
1988
-
[11]
Don’t waste your bits! squeeze activations and gradients for deep neural networks via tinyscript
Fangcheng Fu, Yuzheng Hu, Yihan He, Jiawei Jiang, Yingxia Shao, Ce Zhang, and Bin Cui. Don’t waste your bits! squeeze activations and gradients for deep neural networks via tinyscript. In International Conference on Machine Learning , pages 3304--3314. PMLR, 2020
2020
-
[12]
Generalized selection and ranking: sorted matrices
Greg N Frederickson and Donald B Johnson. Generalized selection and ranking: sorted matrices. SIAM Journal on computing , 13(1):14--30, 1984
1984
-
[13]
Adaptive gradient quantization for data-parallel sgd
Fartash Faghri, Iman Tabrizian, Ilia Markov, Dan Alistarh, Daniel M Roy, and Ali Ramezani-Kebrya. Adaptive gradient quantization for data-parallel sgd. Advances in neural information processing systems , 33:3174--3185, 2020
2020
-
[14]
Practical and asymptotically optimal quantization of high-dimensional vectors in euclidean space for approximate nearest neighbor search
Jianyang Gao, Yutong Gou, Yuexuan Xu, Yongyi Yang, Cheng Long, and Raymond Chi-Wing Wong. Practical and asymptotically optimal quantization of high-dimensional vectors in euclidean space for approximate nearest neighbor search. Proceedings of the ACM on Management of Data , 3(3):1--26, 2025
2025
-
[15]
Optimized product quantization
Tiezheng Ge, Kaiming He, Qifa Ke, and Jian Sun. Optimized product quantization. IEEE transactions on pattern analysis and machine intelligence , 36(4):744--755, 2013
2013
-
[16]
Some simplified np-complete problems
Michael R Garey, David S Johnson, and Larry Stockmeyer. Some simplified np-complete problems. In Proceedings of the sixth annual ACM symposium on Theory of computing , pages 47--63, 1974
1974
-
[17]
Fast Exact k-Means, k-Medians and Bregman Divergence Clustering in 1D
Allan Gr nlund, Kasper Green Larsen, Alexander Mathiasen, Jesper Sindahl Nielsen, Stefan Schneider, and Mingzhou Song. Fast exact k-means, k-medians and bregman divergence clustering in 1d. arXiv preprint arXiv:1701.07204 , 2017
work page internal anchor Pith review Pith/arXiv arXiv 2017
-
[18]
Clustering to minimize the maximum intercluster distance
Teofilo F Gonzalez. Clustering to minimize the maximum intercluster distance. Theoretical computer science , 38:293--306, 1985
1985
-
[19]
Horn and Charles R
Roger A. Horn and Charles R. Johnson. Matrix Analysis . Cambridge University Press, 2nd edition, 2012
2012
-
[20]
A frustratingly easy post-training quantization scheme for llms
Yongkweon Jeon, Chungman Lee, Kyungphil Park, and Ho-young Kim. A frustratingly easy post-training quantization scheme for llms. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing , pages 14446--14461, 2023
2023
-
[21]
Selection in x+ y and matrices with sorted rows and columns
Andranik Mirzaian and Eshrat Arjomandi. Selection in x+ y and matrices with sorted rows and columns. Information processing letters , 20(1):13--17, 1985
1985
-
[22]
Paulius Micikevicius, Sharan Narang, Jonah Alben, Gregory Diamos, Erich Elsen, David Garcia, Boris Ginsburg, Michael Houston, Oleksii Kuchaiev, Ganesh Venkatesh, et al. Mixed precision training. arXiv preprint arXiv:1710.03740 , 2017
work page internal anchor Pith review Pith/arXiv arXiv 2017
-
[23]
A survey of product quantization
Yusuke Matsui, Yusuke Uchida, Herve Jegou, and Shin'ichi Satoh. A survey of product quantization. ITE Transactions on Media Technology and Applications , 6(1):2--10, 2018
2018
-
[24]
Up or down? adaptive rounding for post-training quantization
Markus Nagel, Rana Ali Amjad, Mart Van Baalen, Christos Louizos, and Tijmen Blankevoort. Up or down? adaptive rounding for post-training quantization. In International conference on machine learning , pages 7197--7206. PMLR, 2020
2020
-
[25]
Nuqsgd: Provably communication-efficient data-parallel sgd via nonuniform quantization
Ali Ramezani-Kebrya, Fartash Faghri, Ilya Markov, Vitalii Aksenov, Dan Alistarh, and Daniel M Roy. Nuqsgd: Provably communication-efficient data-parallel sgd via nonuniform quantization. Journal of Machine Learning Research , 22(114):1--43, 2021
2021
-
[26]
Computing a minimum weightk-link path in graphs with the concave monge property
Baruch Schieber. Computing a minimum weightk-link path in graphs with the concave monge property. Journal of Algorithms , 29(2):204--222, 1998
1998
-
[27]
Flexgen: High-throughput generative inference of large language models with a single gpu
Ying Sheng, Lianmin Zheng, Binhang Yuan, Zhuohan Li, Max Ryabinin, Beidi Chen, Percy Liang, Christopher R \'e , Ion Stoica, and Ce Zhang. Flexgen: High-throughput generative inference of large language models with a single gpu. In International Conference on Machine Learning , pages 31094--31116. PMLR, 2023
2023
-
[28]
Bayesian neural networks become heavier-tailed with depth
Mariia Vladimirova, Julyan Arbel, and Pablo Mesejo. Bayesian neural networks become heavier-tailed with depth. In NeurIPS 2018-Thirty-second Conference on Neural Information Processing Systems , pages 1--7, 2018
2018
-
[29]
The concave least-weight subsequence problem revisited
Robert Wilber. The concave least-weight subsequence problem revisited. Journal of Algorithms , 9(3):418--425, 1988
1988
-
[30]
Optimal quantization by matrix searching
Xiaolin Wu. Optimal quantization by matrix searching. Journal of algorithms , 12(4):663--673, 1991
1991
-
[31]
TurboQuant: Online Vector Quantization with Near-optimal Distortion Rate
Amir Zandieh, Majid Daliri, Majid Hadian, and Vahab Mirrokni. Turboquant: Online vector quantization with near-optimal distortion rate. arXiv preprint arXiv:2504.19874 , 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[32]
Zipml: Training linear models with end-to-end low precision, and a little bit of deep learning
Hantian Zhang, Jerry Li, Kaan Kara, Dan Alistarh, Ji Liu, and Ce Zhang. Zipml: Training linear models with end-to-end low precision, and a little bit of deep learning. In International Conference on Machine Learning , pages 4035--4043. PMLR, 2017
2017
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.