Recognition: unknown
Nearly Optimal Attention Coresets
Pith reviewed 2026-05-08 06:22 UTC · model grok-4.3
The pith
Attention outputs for bounded queries can be approximated by a coreset whose size scales as the square root of dimension times an exponential in the norm bound.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
For any set of unit-norm keys and values in R^d there exists a subset of size at most O of the square root of d times e to the power of rho plus little-o of rho over epsilon such that the attention vector computed on the subset differs by at most epsilon in Euclidean norm from the attention vector computed on the full set, and the guarantee holds simultaneously for every query whose norm is at most rho.
What carries the argument
The attention coreset: a small subset of the original key-value pairs chosen so that the softmax-weighted average of values remains close for all queries up to a given norm bound.
If this is right
- The attention step can be performed in space independent of the total number of original keys.
- The coreset size depends only on dimension, the query-norm bound rho, and the error epsilon.
- The construction improves the best previously known coreset size for the same guarantee.
- The matching lower bound of order square root of d times e to the rho over epsilon shows the size is nearly optimal.
Where Pith is reading between the lines
- The same selection technique might yield small coresets for other kernel similarity computations that use softmax.
- In streaming settings one could maintain such a coreset incrementally as new keys arrive.
- Relaxing the exact unit-norm requirement on keys and values would make the result applicable to typical learned embeddings.
Load-bearing premise
Keys and values must all have exactly unit norm while every query has norm bounded by rho.
What would settle it
An explicit collection of unit vectors in dimension d for which every subset smaller than the stated size fails to approximate attention within epsilon for some rho-bounded query would disprove the upper bound.
read the original abstract
We consider the problem of estimating the Attention mechanism in small space, and prove the existence of coresets for it of nearly optimal size. Specifically, we show that for any set of unit-norm keys and values $(K,V)$ in $\mathbb{R}^d$, there exists a subset $(K',V')$ of size at most $O({\sqrt{d} e^{\rho+o(\rho)}/\varepsilon})$ such that \[ \left\| \operatorname{Attn}(q,K,V)- \operatorname{Attn}(q,K',V') \right\| \le \varepsilon \] simultaneously for all queries whose norm is bounded by $\rho$. This outperforms the best known results for this problem. We also offer an improved lower bound showing that $\varepsilon$-coresets must have size $\Omega({\sqrt{d} e^{\rho}/\epsilon})$.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper proves the existence of nearly optimal ε-coresets for the attention mechanism. For any set of unit-norm keys and values (K, V) in R^d, there exists a subset (K', V') of size O(√d e^{ρ + o(ρ)} / ε) such that ||Attn(q, K, V) - Attn(q, K', V')|| ≤ ε holds simultaneously for all queries q with ||q|| ≤ ρ. The paper also establishes a matching lower bound of Ω(√d e^ρ / ε).
Significance. If correct, the result is significant for data structures and theoretical CS because it supplies nearly tight bounds on attention coreset size, improving prior upper bounds while matching the lower bound up to the o(ρ) factor. The explicit existence proof for uniform approximation over a bounded query set is a useful primitive for streaming and small-space attention approximations.
major comments (1)
- [Main theorem / upper-bound proof] Main theorem (existence upper bound): the o(ρ) term in the exponent must be shown to be independent of d and ε (or explicitly bounded) so that the claimed near-optimality relative to the Ω(√d e^ρ / ε) lower bound is rigorous; without this the gap between upper and lower bounds could be larger than stated.
minor comments (2)
- [Preliminaries] Define the precise attention function (including any scaling or softmax variant) at the first use rather than assuming standard notation.
- [Introduction] Add a short paragraph contrasting the new bound with the best previous upper bound cited in the introduction.
Simulated Author's Rebuttal
We thank the referee for the careful reading, positive assessment, and constructive comment on the main theorem. We address the point below and will incorporate a clarification in the revision.
read point-by-point responses
-
Referee: [Main theorem / upper-bound proof] Main theorem (existence upper bound): the o(ρ) term in the exponent must be shown to be independent of d and ε (or explicitly bounded) so that the claimed near-optimality relative to the Ω(√d e^ρ / ε) lower bound is rigorous; without this the gap between upper and lower bounds could be larger than stated.
Authors: We agree that explicit independence strengthens the near-optimality claim. In the proof of Theorem 3.1 (the existence upper bound), the o(ρ) term originates from the asymptotic analysis of the ε-net covering number on the sphere for queries of norm ≤ρ combined with a uniform approximation of the softmax via truncated exponential sums; the rate of the little-o is controlled solely by the choice of auxiliary approximation parameters (e.g., the truncation threshold and net fineness) and does not depend on dimension d or error ε. Consequently the upper bound is of the form O(√d ⋅ e^{ρ(1+o(1))} / ε) where the o(1)→0 as ρ→∞ uniformly in d and ε. We will revise the statement of the main theorem and add a short remark after the proof explicitly recording this independence (and, if space permits, a concrete bound such as O(ρ/log ρ) for the additive term). This makes the comparison to the Ω(√d e^ρ / ε) lower bound fully rigorous. revision: yes
Circularity Check
No significant circularity
full rationale
The paper presents a first-principles existence proof establishing that for any unit-norm keys and values in R^d, there is a coreset of size O(√d e^{ρ+o(ρ)}/ε) that ε-approximates attention uniformly over all queries with ||q|| ≤ ρ. This is paired with an independent lower bound of Ω(√d e^ρ/ε). The derivation relies on explicit mathematical arguments and stated assumptions rather than self-definitional reductions, fitted parameters renamed as predictions, or load-bearing self-citations. No step reduces the claimed result to its own inputs by construction; the result is self-contained against external benchmarks.
Axiom & Free-Parameter Ledger
axioms (1)
- standard math Existence of a small subset that preserves attention output for all bounded-norm queries
Reference graph
Works this paper leans on
-
[1]
Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. Gpt-4 technical report. arXiv preprint arXiv:2303.08774, 2023
work page internal anchor Pith review arXiv 2023
-
[2]
Fast attention requires bounded entries
Josh Alman and Zhao Song. Fast attention requires bounded entries. InAdvances in Neural Information Processing Systems, 2023
2023
-
[3]
Liu, and Mehtaab Sawhney
Ryan Alweiss, Yang P. Liu, and Mehtaab Sawhney. Discrepancy minimization via a self-balancing walk. Proceedings of the 53rd ACM Symposium on the Theory of Computing (STOC ’2021), 2021
2021
-
[4]
Balancing vectors and gaussian measures of n-dimensional convex bodies.Random Structures and Algorithms 12 (1998), 351–360, 1998
Wojciech Banaszczyk. Balancing vectors and gaussian measures of n-dimensional convex bodies.Random Structures and Algorithms 12 (1998), 351–360, 1998
1998
-
[5]
On series of signed vectors and their rearrangements.Random Structures and Algorithms 40 (2012), 301–316, 2012
Wojciech Banaszczyk. On series of signed vectors and their rearrangements.Random Structures and Algorithms 40 (2012), 301–316, 2012
2012
-
[6]
Nikhil Bansal. Constructive algorithms for discrepancy minimization.51th Annual IEEE Symposium on Foundations of Computer Science (FOCS ’2010), arXiv:1002.2259, 2010
-
[7]
Nikhil Bansal, Haotian Jiang, Sahil Singla, and Makrand Sinha. Online vector balancing and geometric discrepancy.In Proceedings of the 52nd Annual ACM Symposium on Theory of Computing (STOC ’2020), arXiv:1912.03350, 2019
-
[8]
xLSTM: Extended long short-term memory
Maximilian Beck, Korbinian Pöppel, Markus Spanring, Andreas Auer, Oleksandra Prudnikova, Michael Kopp, Günter Klambauer, Johannes Brandstetter, and Sepp Hochreiter. xLSTM: Extended long short-term memory. InAdvances in Neural Information Processing Systems, 2024
2024
-
[9]
Longformer: The Long-Document Transformer
Iz Beltagy, Matthew E Peters, and Arman Cohan. Longformer: The long-document transformer.arXiv preprint arXiv:2004.05150, 2020
work page internal anchor Pith review arXiv 2004
-
[10]
Short window attention enables long-term memorization.International Conference on Learning Representations, 2026
Loïc Cabannes, Maximilian Beck, Gergely Szilvasy, Matthijs Douze, Maria Lomeli, Jade Copet, Pierre- Emmanuel Mazaré, Gabriel Synnaeve, and Hervé Jégou. Short window attention enables long-term memorization.International Conference on Learning Representations, 2026
2026
-
[11]
Pyramidkv: Dynamic kv cache compression based on pyramidal information funneling.Conference on Language Modeling, 2025
Zefan Cai, Yichi Zhang, Bofei Gao, Yuliang Liu, Tianyu Liu, Keming Lu, Wayne Xiong, Yue Dong, Baobao Chang, Junjie Hu, et al. Pyramidkv: Dynamic kv cache compression based on pyramidal information funneling.Conference on Language Modeling, 2025
2025
-
[12]
Moses Charikar, Michael Kapralov, and Erik Waingarten. A quasi-monte carlo data structure for smooth kernel evaluations.In Proceedings of the 35th ACM-SIAM Symposium on Discrete Algorithms (SODA ’2024), arXiv:2401.02562, 2024
-
[13]
Charikar
Moses S. Charikar. Similarity estimation techniques from rounding algorithms. InProceedings of the Thiry-Fourth Annual ACM Symposium on Theory of Computing, STOC ’02, pages 380–388, New York, NY, USA, 2002. Association for Computing Machinery. 10
2002
-
[14]
Rethinking attention with Performers
Krzysztof Choromanski, Valerii Likhosherstov, David Dohan, Xingyou Song, Andreea Gane, Tamas Sarlos, Peter Hawkins, Jared Davis, Afroz Mohiuddin, Lukasz Kaiser, David Belanger, Lucy Colwell, and Adrian Weller. Rethinking attention with Performers. InInternational Conference on Learning Representations, 2021
2021
-
[15]
Conway and Neil J
John H. Conway and Neil J. A. Sloane.Sphere Packings, Lattices and Groups. Springer, 3 edition, 1999
1999
-
[16]
Balancing vectors in any norm.59th Annual IEEE Symposium on Foundations of Computer Science (FOCS ’2018), 2018
Daniel Dadush, Aleksandar Nikolov, Kunal Talwar, and Nicole Tomczak-Jaegermann. Balancing vectors in any norm.59th Annual IEEE Symposium on Foundations of Computer Science (FOCS ’2018), 2018
2018
-
[17]
Fu, Stefano Ermon, Atri Rudra, and Christopher Ré
Tri Dao, Daniel Y. Fu, Stefano Ermon, Atri Rudra, and Christopher Ré. FlashAttention: Fast and memory-efficient exact attention with IO-awareness. InAdvances in Neural Information Processing Systems, 2022
2022
-
[18]
Cartridges: Lightweight and general-purpose long context representations via self-study.International Conference on Learning Representations, 2026
Sabri Eyuboglu, Ryan Ehrlich, Simran Arora, Neel Guha, Dylan Zinsley, Emily Liu, Will Tennien, Atri Rudra, James Zou, Azalia Mirhoseini, et al. Cartridges: Lightweight and general-purpose long context representations via self-study.International Conference on Learning Representations, 2026
2026
-
[19]
Data engineering for scaling language models to 128k context.International Conference on Machine Learning, 2024
Yao Fu, Rameswar Panda, Xinyao Niu, Xiang Yue, Hannaneh Hajishirzi, Yoon Kim, and Hao Peng. Data engineering for scaling language models to 128k context.International Conference on Machine Learning, 2024
2024
-
[20]
Not all heads matter: A head-level kv cache compression method with integrated retrieval and reasoning.International Conference on Learning Representations, 2025
Yu Fu, Zefan Cai, Abedelkadir Asi, Wayne Xiong, Yue Dong, and Wen Xiao. Not all heads matter: A head-level kv cache compression method with integrated retrieval and reasoning.International Conference on Learning Representations, 2025
2025
-
[21]
Rabitq: Quantizing high-dimensional vectors with a theoretical error bound for approximate nearest neighbor search.Proceedings of the ACM on Management of Data, 2(3):1–27, 2024
Jianyang Gao and Cheng Long. Rabitq: Quantizing high-dimensional vectors with a theoretical error bound for approximate nearest neighbor search.Proceedings of the ACM on Management of Data, 2(3):1–27, 2024
2024
-
[22]
Model tells you what to discard: Adaptive KV cache compression for LLMs
Suyu Ge, Yunan Zhang, Liyuan Liu, Minjia Zhang, Jiawei Han, and Jianfeng Gao. Model tells you what to discard: Adaptive KV cache compression for LLMs. InInternational Conference on Learning Representations, 2024
2024
-
[23]
Phillips, and David P Woodruff
Mina Ghashami, Edo Liberty, Jeff M. Phillips, and David P Woodruff. Frequent directions: Simple and deterministic matrix sketching.SIAM J. Comput., 45(5):1762–1792, 2016
2016
-
[24]
Mamba: Linear-time sequence modeling with selective state spaces
Albert Gu and Tri Dao. Mamba: Linear-time sequence modeling with selective state spaces. InConference on Language Modeling, 2024
2024
-
[25]
Hyperattention: Long-context attention in near-linear time.International Conference on Learning Representations, 2024
Insu Han, Rajesh Jarayam, Amin Karbasi, Vahab Mirrokni, David Woodruff, and Amir Zandieh. Hyperattention: Long-context attention in near-linear time.International Conference on Learning Representations, 2024
2024
-
[26]
Product quantization for nearest neighbor search
Herve Jegou, Matthijs Douze, and Cordelia Schmid. Product quantization for nearest neighbor search. IEEE transactions on pattern analysis and machine intelligence, 33(1):117–128, 2010
2010
-
[27]
Scaling Laws for Neural Language Models
Jared Kaplan, Sam McCandlish, Tom Henighan, Tom B Brown, Benjamin Chess, Rewon Child, Scott Gray, Alec Radford, Jeffrey Wu, and Dario Amodei. Scaling laws for neural language models.arXiv preprint arXiv:2001.08361, 2020
work page internal anchor Pith review arXiv 2001
-
[28]
Discrepancy, coresets, and sketches in machine learning
Zohar Karnin and Edo Liberty. Discrepancy, coresets, and sketches in machine learning. InConference on Learning Theory, pages 1975–1993. PMLR, 2019
1975
-
[29]
Transformers are rnns: Fast autoregressive transformers with linear attention
Angelos Katharopoulos, Apoorv Vyas, Nikolaos Pappas, and François Fleuret. Transformers are rnns: Fast autoregressive transformers with linear attention. InInternational conference on machine learning, pages 5156–5165. PMLR, 2020. 11
2020
-
[30]
Streaming attention approximation via discrepancy theory.Advances in Neural Information Processing Systems, 2025
Ekaterina Kochetkova, Kshiteej Sheth, Insu Han, Amir Zandieh, and Michael Kapralov. Streaming attention approximation via discrepancy theory.Advances in Neural Information Processing Systems, 2025
2025
-
[31]
Janardhan Kulkarni, Victor Reis, and Thomas Rothvoss. Optimal online discrepancy minimiza- tion.In Proceedings of the 56th Annual ACM Symposium on Theory of Computing (STOC ’2024), arXiv:2308.01406, 2023
-
[32]
Efficient memory management for large language model serving with pagedattention
Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph Gonzalez, Hao Zhang, and Ion Stoica. Efficient memory management for large language model serving with pagedattention. InProceedings of the 29th Symposium on Operating Systems Principles, pages 611–626, 2023
2023
-
[33]
Prefix-tuning: Optimizing continuous prompts for generation
Xiang Lisa Li and Percy Liang. Prefix-tuning: Optimizing continuous prompts for generation. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pages 4582–4597, 2021
2021
-
[34]
SnapKV: LLM knows what you are looking for before generation
Yuhong Li, Yingbing Huang, Bowen Yang, Bharat Venkitesh, Acyr Locatelli, Hanchen Ye, Tianle Cai, Patrick Lewis, and Deming Chen. SnapKV: LLM knows what you are looking for before generation. In Advances in Neural Information Processing Systems, 2024
2024
-
[35]
Scissorhands: Exploiting the persistence of importance hypothesis for llm kv cache compression at test time.Advances in Neural Information Processing Systems, 36, 2024
Zichang Liu, Aditya Desai, Fangshuo Liao, Weitao Wang, Victor Xie, Zhaozhuo Xu, Anastasios Kyrillidis, and Anshumali Shrivastava. Scissorhands: Exploiting the persistence of importance hypothesis for llm kv cache compression at test time.Advances in Neural Information Processing Systems, 36, 2024
2024
-
[36]
Near-optimal coresets for kernel density estimates.Discrete and Computational Geometry, 63(4):867–887, 2020
Jeff M Phillips and Wai Ming Tai. Near-optimal coresets for kernel density estimates.Discrete and Computational Geometry, 63(4):867–887, 2020
2020
-
[37]
Efficiently scaling transformer inference.Proceedings of Machine Learning and Systems, 5, 2023
Reiner Pope, Sholto Douglas, Aakanksha Chowdhery, Jacob Devlin, James Bradbury, Jonathan Heek, Kefan Xiao, Shivani Agrawal, and Jeff Dean. Efficiently scaling transformer inference.Proceedings of Machine Learning and Systems, 5, 2023
2023
-
[38]
Random features for large-scale kernel machines
Ali Rahimi and Benjamin Recht. Random features for large-scale kernel machines. InAdvances in Neural Information Processing Systems, 2007
2007
-
[39]
Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context
Machel Reid, Nikolay Savinov, Denis Teplyashin, Dmitry Lepikhin, Timothy Lillicrap, Jean-baptiste Alayrac, Radu Soricut, Angeliki Lazaridou, Orhan Firat, Julian Schrittwieser, et al. Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context.arXiv preprint arXiv:2403.05530, 2024
work page internal anchor Pith review arXiv 2024
-
[40]
Wildcat: Near-linear attention in theory and practice, 2026
Tobias Schröder and Lester Mackey. Wildcat: Near-linear attention in theory and practice, 2026. arXiv preprint arXiv:2602.10056
-
[41]
Fast Transformer Decoding: One Write-Head is All You Need
Noam Shazeer. Fast transformer decoding: One write-head is all you need.arXiv preprint arXiv:1911.02150, 2019
work page internal anchor Pith review arXiv 1911
-
[42]
Flexgen: High-throughput generative inference of large language models with a single gpu
Ying Sheng, Lianmin Zheng, Binhang Yuan, Zhuohan Li, Max Ryabinin, Beidi Chen, Percy Liang, Christopher Ré, Ion Stoica, and Ce Zhang. Flexgen: High-throughput generative inference of large language models with a single gpu. InInternational Conference on Machine Learning, pages 31094–31116. PMLR, 2023
2023
-
[43]
Quest: Query- aware sparsity for efficient long-context LLM inference
Jiaming Tang, Yilong Zhao, Kan Zhu, Guangxuan Xiao, Baris Kasikci, and Song Han. Quest: Query- aware sparsity for efficient long-context LLM inference. InInternational Conference on Machine Learning, 2024
2024
-
[44]
claude, 2024.https://www.anthropic.com/news/claude-3-family
Antropic Team. claude, 2024.https://www.anthropic.com/news/claude-3-family. 12
2024
-
[45]
Spectral norm of random tensors, 2014.https://arxiv.org/abs/ 1407.1870
Ryota Tomioka and Taiji Suzuki. Spectral norm of random tensors, 2014.https://arxiv.org/abs/ 1407.1870
-
[46]
Attention is all you need.Advances in Neural Information Processing Systems, 2017
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Lukasz Kaiser, and Illia Polosukhin. Attention is all you need.Advances in Neural Information Processing Systems, 2017
2017
-
[47]
Linformer: Self-Attention with Linear Complexity
Sinong Wang, Belinda Z. Li, Madian Khabsa, Han Fang, and Hao Ma. Linformer: Self-attention with linear complexity.arXiv preprint arXiv:2006.04768, 2020
work page internal anchor Pith review arXiv 2006
-
[48]
DuoAttention: Efficient long-context LLM inference with retrieval and streaming heads
Guangxuan Xiao, Jiaming Tang, Jingwei Zuo, Junxian Guo, Shang Yang, Haotian Tang, Yao Fu, and Song Han. DuoAttention: Efficient long-context LLM inference with retrieval and streaming heads. In International Conference on Learning Representations, 2025
2025
-
[49]
Efficient streaming language models with attention sinks
Guangxuan Xiao, Yuandong Tian, Beidi Chen, Song Han, and Mike Lewis. Efficient streaming language models with attention sinks. InInternational Conference on Learning Representations, 2024
2024
-
[50]
Turboquant: Online vector quantization with near-optimal distortion rate
Amir Zandieh, Majid Daliri, Majid Hadian, and Vahab Mirrokni. Turboquant: Online vector quantization with near-optimal distortion rate. InInternational Conference on Learning Representations, 2026
2026
-
[51]
KDEformer: Accelerating transformers via kernel density estimation
Amir Zandieh, Insu Han, Majid Daliri, and Amin Karbasi. KDEformer: Accelerating transformers via kernel density estimation. In Andreas Krause, Emma Brunskill, Kyunghyun Cho, Barbara Engelhardt, Sivan Sabato, and Jonathan Scarlett, editors,Proceedings of the 40th International Conference on Machine Learning, volume 202 ofProceedings of Machine Learning Res...
2023
-
[52]
Subgen: Token generation in sublinear time and memory.arXiv preprint arXiv:2402.06082, 2024
Amir Zandieh, Insu Han, Vahab Mirrokni, and Amin Karbasi. Subgen: Token generation in sublinear time and memory.arXiv preprint arXiv:2402.06082, 2024
-
[53]
H2O: Heavy-hitter oracle for efficient generative inference of large language models
Zhenyu Zhang, Ying Sheng, Tianyi Zhou, Tianlong Chen, Lianmin Zheng, Ruisi Cai, Zhao Song, Yuandong Tian, Christopher Ré, Clark Barrett, Zhangyang Wang, and Beidi Chen. H2O: Heavy-hitter oracle for efficient generative inference of large language models. InAdvances in Neural Information Processing Systems, 2023
2023
-
[54]
arXiv preprint arXiv:2602.16284 , year=
Adam Zweiger, Xinghong Fu, Han Guo, and Yoon Kim. Fast kv compaction via attention matching. arXiv preprint arXiv:2602.16284, 2026. 13
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.