arxiv: 2605.05602 · v1 · submitted 2026-05-07 · 💻 cs.DS · cs.AI

Recognition: unknown

Nearly Optimal Attention Coresets

Alexandr Andoni, Edo Liberty, Eldar Kleiner

Authors on Pith no claims yet

Pith reviewed 2026-05-08 06:22 UTC · model grok-4.3

classification 💻 cs.DS cs.AI

keywords attention mechanismcoresetsapproximation algorithmsdata structuressoftmaxmachine learning

0 comments

The pith

Attention outputs for bounded queries can be approximated by a coreset whose size scales as the square root of dimension times an exponential in the norm bound.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that any collection of unit-length keys and values admits a small subset that reproduces the full attention result within additive error epsilon for every query whose length stays at most rho. The subset needs only order square root of d times e to the power rho plus lower-order terms, divided by epsilon. A nearly matching lower bound proves that no substantially smaller subset works in the worst case. This would let the attention step run in space independent of the original input length while still handling an entire family of possible queries at once.

Core claim

For any set of unit-norm keys and values in R^d there exists a subset of size at most O of the square root of d times e to the power of rho plus little-o of rho over epsilon such that the attention vector computed on the subset differs by at most epsilon in Euclidean norm from the attention vector computed on the full set, and the guarantee holds simultaneously for every query whose norm is at most rho.

What carries the argument

The attention coreset: a small subset of the original key-value pairs chosen so that the softmax-weighted average of values remains close for all queries up to a given norm bound.

If this is right

The attention step can be performed in space independent of the total number of original keys.
The coreset size depends only on dimension, the query-norm bound rho, and the error epsilon.
The construction improves the best previously known coreset size for the same guarantee.
The matching lower bound of order square root of d times e to the rho over epsilon shows the size is nearly optimal.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same selection technique might yield small coresets for other kernel similarity computations that use softmax.
In streaming settings one could maintain such a coreset incrementally as new keys arrive.
Relaxing the exact unit-norm requirement on keys and values would make the result applicable to typical learned embeddings.

Load-bearing premise

Keys and values must all have exactly unit norm while every query has norm bounded by rho.

What would settle it

An explicit collection of unit vectors in dimension d for which every subset smaller than the stated size fails to approximate attention within epsilon for some rho-bounded query would disprove the upper bound.

read the original abstract

We consider the problem of estimating the Attention mechanism in small space, and prove the existence of coresets for it of nearly optimal size. Specifically, we show that for any set of unit-norm keys and values $(K,V)$ in $\mathbb{R}^d$, there exists a subset $(K',V')$ of size at most $O({\sqrt{d} e^{\rho+o(\rho)}/\varepsilon})$ such that \[ \left\| \operatorname{Attn}(q,K,V)- \operatorname{Attn}(q,K',V') \right\| \le \varepsilon \] simultaneously for all queries whose norm is bounded by $\rho$. This outperforms the best known results for this problem. We also offer an improved lower bound showing that $\varepsilon$-coresets must have size $\Omega({\sqrt{d} e^{\rho}/\epsilon})$.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

This paper gives a nearly tight existence bound on attention coresets, improving the upper bound and matching it with a lower bound up to an o(ρ) term.

read the letter

The main thing to know is that for unit-norm keys and values in R^d, there is a subset of size O(√d e^{ρ+o(ρ)}/ε) that ε-approximates attention for every query with norm at most ρ. They also prove a lower bound of Ω(√d e^ρ/ε), so the result is close to optimal on the stated assumptions. This improves prior upper bounds and pairs them with a new lower bound that nearly matches. The work is a clean existence argument in the coreset style, with no obvious circularity or hidden assumptions beyond the unit-norm and bounded-ρ conditions stated in the theorem. The math claims look internally consistent from the abstract and stress-test notes. One soft spot is that the result is purely existential; no algorithm or sampling procedure is given to construct the coreset, which limits immediate use. The o(ρ) slack in the upper bound is minor but leaves a small gap that future work might close. The assumptions are explicit, so the claim does not overreach. This is for data-structures researchers and people working on theoretical efficiency for attention or transformers. A reader who wants tight bounds on this primitive will get value from the near-optimality. It deserves a serious referee to check the proof details and see whether the o(ρ) can be removed or the result extended beyond unit norms. I would send it out for review.

Referee Report

1 major / 2 minor

Summary. The paper proves the existence of nearly optimal ε-coresets for the attention mechanism. For any set of unit-norm keys and values (K, V) in R^d, there exists a subset (K', V') of size O(√d e^{ρ + o(ρ)} / ε) such that ||Attn(q, K, V) - Attn(q, K', V')|| ≤ ε holds simultaneously for all queries q with ||q|| ≤ ρ. The paper also establishes a matching lower bound of Ω(√d e^ρ / ε).

Significance. If correct, the result is significant for data structures and theoretical CS because it supplies nearly tight bounds on attention coreset size, improving prior upper bounds while matching the lower bound up to the o(ρ) factor. The explicit existence proof for uniform approximation over a bounded query set is a useful primitive for streaming and small-space attention approximations.

major comments (1)

[Main theorem / upper-bound proof] Main theorem (existence upper bound): the o(ρ) term in the exponent must be shown to be independent of d and ε (or explicitly bounded) so that the claimed near-optimality relative to the Ω(√d e^ρ / ε) lower bound is rigorous; without this the gap between upper and lower bounds could be larger than stated.

minor comments (2)

[Preliminaries] Define the precise attention function (including any scaling or softmax variant) at the first use rather than assuming standard notation.
[Introduction] Add a short paragraph contrasting the new bound with the best previous upper bound cited in the introduction.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the careful reading, positive assessment, and constructive comment on the main theorem. We address the point below and will incorporate a clarification in the revision.

read point-by-point responses

Referee: [Main theorem / upper-bound proof] Main theorem (existence upper bound): the o(ρ) term in the exponent must be shown to be independent of d and ε (or explicitly bounded) so that the claimed near-optimality relative to the Ω(√d e^ρ / ε) lower bound is rigorous; without this the gap between upper and lower bounds could be larger than stated.

Authors: We agree that explicit independence strengthens the near-optimality claim. In the proof of Theorem 3.1 (the existence upper bound), the o(ρ) term originates from the asymptotic analysis of the ε-net covering number on the sphere for queries of norm ≤ρ combined with a uniform approximation of the softmax via truncated exponential sums; the rate of the little-o is controlled solely by the choice of auxiliary approximation parameters (e.g., the truncation threshold and net fineness) and does not depend on dimension d or error ε. Consequently the upper bound is of the form O(√d ⋅ e^{ρ(1+o(1))} / ε) where the o(1)→0 as ρ→∞ uniformly in d and ε. We will revise the statement of the main theorem and add a short remark after the proof explicitly recording this independence (and, if space permits, a concrete bound such as O(ρ/log ρ) for the additive term). This makes the comparison to the Ω(√d e^ρ / ε) lower bound fully rigorous. revision: yes

Circularity Check

0 steps flagged

No significant circularity

full rationale

The paper presents a first-principles existence proof establishing that for any unit-norm keys and values in R^d, there is a coreset of size O(√d e^{ρ+o(ρ)}/ε) that ε-approximates attention uniformly over all queries with ||q|| ≤ ρ. This is paired with an independent lower bound of Ω(√d e^ρ/ε). The derivation relies on explicit mathematical arguments and stated assumptions rather than self-definitional reductions, fitted parameters renamed as predictions, or load-bearing self-citations. No step reduces the claimed result to its own inputs by construction; the result is self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The result rests on standard mathematical existence arguments (likely probabilistic method or greedy selection) that are common in approximation-algorithm literature; no new entities or fitted parameters are introduced in the abstract.

axioms (1)

standard math Existence of a small subset that preserves attention output for all bounded-norm queries
Typical for coreset existence proofs in high-dimensional geometry and approximation algorithms.

pith-pipeline@v0.9.0 · 5437 in / 1251 out tokens · 65713 ms · 2026-05-08T06:22:36.042210+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

54 extracted references · 14 canonical work pages · 6 internal anchors

[1]

GPT-4 Technical Report

Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. Gpt-4 technical report. arXiv preprint arXiv:2303.08774, 2023

work page internal anchor Pith review arXiv 2023
[2]

Fast attention requires bounded entries

Josh Alman and Zhao Song. Fast attention requires bounded entries. InAdvances in Neural Information Processing Systems, 2023

2023
[3]

Liu, and Mehtaab Sawhney

Ryan Alweiss, Yang P. Liu, and Mehtaab Sawhney. Discrepancy minimization via a self-balancing walk. Proceedings of the 53rd ACM Symposium on the Theory of Computing (STOC ’2021), 2021

2021
[4]

Balancing vectors and gaussian measures of n-dimensional convex bodies.Random Structures and Algorithms 12 (1998), 351–360, 1998

Wojciech Banaszczyk. Balancing vectors and gaussian measures of n-dimensional convex bodies.Random Structures and Algorithms 12 (1998), 351–360, 1998

1998
[5]

On series of signed vectors and their rearrangements.Random Structures and Algorithms 40 (2012), 301–316, 2012

Wojciech Banaszczyk. On series of signed vectors and their rearrangements.Random Structures and Algorithms 40 (2012), 301–316, 2012

2012
[6]

Constructive algorithms for discrepancy minimization.51th Annual IEEE Symposium on Foundations of Computer Science (FOCS ’2010), arXiv:1002.2259, 2010

Nikhil Bansal. Constructive algorithms for discrepancy minimization.51th Annual IEEE Symposium on Foundations of Computer Science (FOCS ’2010), arXiv:1002.2259, 2010

work page arXiv 2010
[7]

Online vector balancing and geometric discrepancy.In Proceedings of the 52nd Annual ACM Symposium on Theory of Computing (STOC ’2020), arXiv:1912.03350, 2019

Nikhil Bansal, Haotian Jiang, Sahil Singla, and Makrand Sinha. Online vector balancing and geometric discrepancy.In Proceedings of the 52nd Annual ACM Symposium on Theory of Computing (STOC ’2020), arXiv:1912.03350, 2019

work page arXiv 2020
[8]

xLSTM: Extended long short-term memory

Maximilian Beck, Korbinian Pöppel, Markus Spanring, Andreas Auer, Oleksandra Prudnikova, Michael Kopp, Günter Klambauer, Johannes Brandstetter, and Sepp Hochreiter. xLSTM: Extended long short-term memory. InAdvances in Neural Information Processing Systems, 2024

2024
[9]

Longformer: The Long-Document Transformer

Iz Beltagy, Matthew E Peters, and Arman Cohan. Longformer: The long-document transformer.arXiv preprint arXiv:2004.05150, 2020

work page internal anchor Pith review arXiv 2004
[10]

Short window attention enables long-term memorization.International Conference on Learning Representations, 2026

Loïc Cabannes, Maximilian Beck, Gergely Szilvasy, Matthijs Douze, Maria Lomeli, Jade Copet, Pierre- Emmanuel Mazaré, Gabriel Synnaeve, and Hervé Jégou. Short window attention enables long-term memorization.International Conference on Learning Representations, 2026

2026
[11]

Pyramidkv: Dynamic kv cache compression based on pyramidal information funneling.Conference on Language Modeling, 2025

Zefan Cai, Yichi Zhang, Bofei Gao, Yuliang Liu, Tianyu Liu, Keming Lu, Wayne Xiong, Yue Dong, Baobao Chang, Junjie Hu, et al. Pyramidkv: Dynamic kv cache compression based on pyramidal information funneling.Conference on Language Modeling, 2025

2025
[12]

A quasi-monte carlo data structure for smooth kernel evaluations.In Proceedings of the 35th ACM-SIAM Symposium on Discrete Algorithms (SODA ’2024), arXiv:2401.02562, 2024

Moses Charikar, Michael Kapralov, and Erik Waingarten. A quasi-monte carlo data structure for smooth kernel evaluations.In Proceedings of the 35th ACM-SIAM Symposium on Discrete Algorithms (SODA ’2024), arXiv:2401.02562, 2024

work page arXiv 2024
[13]

Charikar

Moses S. Charikar. Similarity estimation techniques from rounding algorithms. InProceedings of the Thiry-Fourth Annual ACM Symposium on Theory of Computing, STOC ’02, pages 380–388, New York, NY, USA, 2002. Association for Computing Machinery. 10

2002
[14]

Rethinking attention with Performers

Krzysztof Choromanski, Valerii Likhosherstov, David Dohan, Xingyou Song, Andreea Gane, Tamas Sarlos, Peter Hawkins, Jared Davis, Afroz Mohiuddin, Lukasz Kaiser, David Belanger, Lucy Colwell, and Adrian Weller. Rethinking attention with Performers. InInternational Conference on Learning Representations, 2021

2021
[15]

Conway and Neil J

John H. Conway and Neil J. A. Sloane.Sphere Packings, Lattices and Groups. Springer, 3 edition, 1999

1999
[16]

Balancing vectors in any norm.59th Annual IEEE Symposium on Foundations of Computer Science (FOCS ’2018), 2018

Daniel Dadush, Aleksandar Nikolov, Kunal Talwar, and Nicole Tomczak-Jaegermann. Balancing vectors in any norm.59th Annual IEEE Symposium on Foundations of Computer Science (FOCS ’2018), 2018

2018
[17]

Fu, Stefano Ermon, Atri Rudra, and Christopher Ré

Tri Dao, Daniel Y. Fu, Stefano Ermon, Atri Rudra, and Christopher Ré. FlashAttention: Fast and memory-efficient exact attention with IO-awareness. InAdvances in Neural Information Processing Systems, 2022

2022
[18]

Cartridges: Lightweight and general-purpose long context representations via self-study.International Conference on Learning Representations, 2026

Sabri Eyuboglu, Ryan Ehrlich, Simran Arora, Neel Guha, Dylan Zinsley, Emily Liu, Will Tennien, Atri Rudra, James Zou, Azalia Mirhoseini, et al. Cartridges: Lightweight and general-purpose long context representations via self-study.International Conference on Learning Representations, 2026

2026
[19]

Data engineering for scaling language models to 128k context.International Conference on Machine Learning, 2024

Yao Fu, Rameswar Panda, Xinyao Niu, Xiang Yue, Hannaneh Hajishirzi, Yoon Kim, and Hao Peng. Data engineering for scaling language models to 128k context.International Conference on Machine Learning, 2024

2024
[20]

Not all heads matter: A head-level kv cache compression method with integrated retrieval and reasoning.International Conference on Learning Representations, 2025

Yu Fu, Zefan Cai, Abedelkadir Asi, Wayne Xiong, Yue Dong, and Wen Xiao. Not all heads matter: A head-level kv cache compression method with integrated retrieval and reasoning.International Conference on Learning Representations, 2025

2025
[21]

Rabitq: Quantizing high-dimensional vectors with a theoretical error bound for approximate nearest neighbor search.Proceedings of the ACM on Management of Data, 2(3):1–27, 2024

Jianyang Gao and Cheng Long. Rabitq: Quantizing high-dimensional vectors with a theoretical error bound for approximate nearest neighbor search.Proceedings of the ACM on Management of Data, 2(3):1–27, 2024

2024
[22]

Model tells you what to discard: Adaptive KV cache compression for LLMs

Suyu Ge, Yunan Zhang, Liyuan Liu, Minjia Zhang, Jiawei Han, and Jianfeng Gao. Model tells you what to discard: Adaptive KV cache compression for LLMs. InInternational Conference on Learning Representations, 2024

2024
[23]

Phillips, and David P Woodruff

Mina Ghashami, Edo Liberty, Jeff M. Phillips, and David P Woodruff. Frequent directions: Simple and deterministic matrix sketching.SIAM J. Comput., 45(5):1762–1792, 2016

2016
[24]

Mamba: Linear-time sequence modeling with selective state spaces

Albert Gu and Tri Dao. Mamba: Linear-time sequence modeling with selective state spaces. InConference on Language Modeling, 2024

2024
[25]

Hyperattention: Long-context attention in near-linear time.International Conference on Learning Representations, 2024

Insu Han, Rajesh Jarayam, Amin Karbasi, Vahab Mirrokni, David Woodruff, and Amir Zandieh. Hyperattention: Long-context attention in near-linear time.International Conference on Learning Representations, 2024

2024
[26]

Product quantization for nearest neighbor search

Herve Jegou, Matthijs Douze, and Cordelia Schmid. Product quantization for nearest neighbor search. IEEE transactions on pattern analysis and machine intelligence, 33(1):117–128, 2010

2010
[27]

Scaling Laws for Neural Language Models

Jared Kaplan, Sam McCandlish, Tom Henighan, Tom B Brown, Benjamin Chess, Rewon Child, Scott Gray, Alec Radford, Jeffrey Wu, and Dario Amodei. Scaling laws for neural language models.arXiv preprint arXiv:2001.08361, 2020

work page internal anchor Pith review arXiv 2001
[28]

Discrepancy, coresets, and sketches in machine learning

Zohar Karnin and Edo Liberty. Discrepancy, coresets, and sketches in machine learning. InConference on Learning Theory, pages 1975–1993. PMLR, 2019

1975
[29]

Transformers are rnns: Fast autoregressive transformers with linear attention

Angelos Katharopoulos, Apoorv Vyas, Nikolaos Pappas, and François Fleuret. Transformers are rnns: Fast autoregressive transformers with linear attention. InInternational conference on machine learning, pages 5156–5165. PMLR, 2020. 11

2020
[30]

Streaming attention approximation via discrepancy theory.Advances in Neural Information Processing Systems, 2025

Ekaterina Kochetkova, Kshiteej Sheth, Insu Han, Amir Zandieh, and Michael Kapralov. Streaming attention approximation via discrepancy theory.Advances in Neural Information Processing Systems, 2025

2025
[31]

Optimal online discrepancy minimiza- tion.In Proceedings of the 56th Annual ACM Symposium on Theory of Computing (STOC ’2024), arXiv:2308.01406, 2023

Janardhan Kulkarni, Victor Reis, and Thomas Rothvoss. Optimal online discrepancy minimiza- tion.In Proceedings of the 56th Annual ACM Symposium on Theory of Computing (STOC ’2024), arXiv:2308.01406, 2023

work page arXiv 2024
[32]

Efficient memory management for large language model serving with pagedattention

Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph Gonzalez, Hao Zhang, and Ion Stoica. Efficient memory management for large language model serving with pagedattention. InProceedings of the 29th Symposium on Operating Systems Principles, pages 611–626, 2023

2023
[33]

Prefix-tuning: Optimizing continuous prompts for generation

Xiang Lisa Li and Percy Liang. Prefix-tuning: Optimizing continuous prompts for generation. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pages 4582–4597, 2021

2021
[34]

SnapKV: LLM knows what you are looking for before generation

Yuhong Li, Yingbing Huang, Bowen Yang, Bharat Venkitesh, Acyr Locatelli, Hanchen Ye, Tianle Cai, Patrick Lewis, and Deming Chen. SnapKV: LLM knows what you are looking for before generation. In Advances in Neural Information Processing Systems, 2024

2024
[35]

Scissorhands: Exploiting the persistence of importance hypothesis for llm kv cache compression at test time.Advances in Neural Information Processing Systems, 36, 2024

Zichang Liu, Aditya Desai, Fangshuo Liao, Weitao Wang, Victor Xie, Zhaozhuo Xu, Anastasios Kyrillidis, and Anshumali Shrivastava. Scissorhands: Exploiting the persistence of importance hypothesis for llm kv cache compression at test time.Advances in Neural Information Processing Systems, 36, 2024

2024
[36]

Near-optimal coresets for kernel density estimates.Discrete and Computational Geometry, 63(4):867–887, 2020

Jeff M Phillips and Wai Ming Tai. Near-optimal coresets for kernel density estimates.Discrete and Computational Geometry, 63(4):867–887, 2020

2020
[37]

Efficiently scaling transformer inference.Proceedings of Machine Learning and Systems, 5, 2023

Reiner Pope, Sholto Douglas, Aakanksha Chowdhery, Jacob Devlin, James Bradbury, Jonathan Heek, Kefan Xiao, Shivani Agrawal, and Jeff Dean. Efficiently scaling transformer inference.Proceedings of Machine Learning and Systems, 5, 2023

2023
[38]

Random features for large-scale kernel machines

Ali Rahimi and Benjamin Recht. Random features for large-scale kernel machines. InAdvances in Neural Information Processing Systems, 2007

2007
[39]

Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context

Machel Reid, Nikolay Savinov, Denis Teplyashin, Dmitry Lepikhin, Timothy Lillicrap, Jean-baptiste Alayrac, Radu Soricut, Angeliki Lazaridou, Orhan Firat, Julian Schrittwieser, et al. Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context.arXiv preprint arXiv:2403.05530, 2024

work page internal anchor Pith review arXiv 2024
[40]

Wildcat: Near-linear attention in theory and practice, 2026

Tobias Schröder and Lester Mackey. Wildcat: Near-linear attention in theory and practice, 2026. arXiv preprint arXiv:2602.10056

work page arXiv 2026
[41]

Fast Transformer Decoding: One Write-Head is All You Need

Noam Shazeer. Fast transformer decoding: One write-head is all you need.arXiv preprint arXiv:1911.02150, 2019

work page internal anchor Pith review arXiv 1911
[42]

Flexgen: High-throughput generative inference of large language models with a single gpu

Ying Sheng, Lianmin Zheng, Binhang Yuan, Zhuohan Li, Max Ryabinin, Beidi Chen, Percy Liang, Christopher Ré, Ion Stoica, and Ce Zhang. Flexgen: High-throughput generative inference of large language models with a single gpu. InInternational Conference on Machine Learning, pages 31094–31116. PMLR, 2023

2023
[43]

Quest: Query- aware sparsity for efficient long-context LLM inference

Jiaming Tang, Yilong Zhao, Kan Zhu, Guangxuan Xiao, Baris Kasikci, and Song Han. Quest: Query- aware sparsity for efficient long-context LLM inference. InInternational Conference on Machine Learning, 2024

2024
[44]

claude, 2024.https://www.anthropic.com/news/claude-3-family

Antropic Team. claude, 2024.https://www.anthropic.com/news/claude-3-family. 12

2024
[45]

Spectral norm of random tensors, 2014.https://arxiv.org/abs/ 1407.1870

Ryota Tomioka and Taiji Suzuki. Spectral norm of random tensors, 2014.https://arxiv.org/abs/ 1407.1870

work page arXiv 2014
[46]

Attention is all you need.Advances in Neural Information Processing Systems, 2017

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Lukasz Kaiser, and Illia Polosukhin. Attention is all you need.Advances in Neural Information Processing Systems, 2017

2017
[47]

Linformer: Self-Attention with Linear Complexity

Sinong Wang, Belinda Z. Li, Madian Khabsa, Han Fang, and Hao Ma. Linformer: Self-attention with linear complexity.arXiv preprint arXiv:2006.04768, 2020

work page internal anchor Pith review arXiv 2006
[48]

DuoAttention: Efficient long-context LLM inference with retrieval and streaming heads

Guangxuan Xiao, Jiaming Tang, Jingwei Zuo, Junxian Guo, Shang Yang, Haotian Tang, Yao Fu, and Song Han. DuoAttention: Efficient long-context LLM inference with retrieval and streaming heads. In International Conference on Learning Representations, 2025

2025
[49]

Efficient streaming language models with attention sinks

Guangxuan Xiao, Yuandong Tian, Beidi Chen, Song Han, and Mike Lewis. Efficient streaming language models with attention sinks. InInternational Conference on Learning Representations, 2024

2024
[50]

Turboquant: Online vector quantization with near-optimal distortion rate

Amir Zandieh, Majid Daliri, Majid Hadian, and Vahab Mirrokni. Turboquant: Online vector quantization with near-optimal distortion rate. InInternational Conference on Learning Representations, 2026

2026
[51]

KDEformer: Accelerating transformers via kernel density estimation

Amir Zandieh, Insu Han, Majid Daliri, and Amin Karbasi. KDEformer: Accelerating transformers via kernel density estimation. In Andreas Krause, Emma Brunskill, Kyunghyun Cho, Barbara Engelhardt, Sivan Sabato, and Jonathan Scarlett, editors,Proceedings of the 40th International Conference on Machine Learning, volume 202 ofProceedings of Machine Learning Res...

2023
[52]

Subgen: Token generation in sublinear time and memory.arXiv preprint arXiv:2402.06082, 2024

Amir Zandieh, Insu Han, Vahab Mirrokni, and Amin Karbasi. Subgen: Token generation in sublinear time and memory.arXiv preprint arXiv:2402.06082, 2024

work page arXiv 2024
[53]

H2O: Heavy-hitter oracle for efficient generative inference of large language models

Zhenyu Zhang, Ying Sheng, Tianyi Zhou, Tianlong Chen, Lianmin Zheng, Ruisi Cai, Zhao Song, Yuandong Tian, Christopher Ré, Clark Barrett, Zhangyang Wang, and Beidi Chen. H2O: Heavy-hitter oracle for efficient generative inference of large language models. InAdvances in Neural Information Processing Systems, 2023

2023
[54]

arXiv preprint arXiv:2602.16284 , year=

Adam Zweiger, Xinghong Fu, Han Guo, and Yoon Kim. Fast kv compaction via attention matching. arXiv preprint arXiv:2602.16284, 2026. 13

work page arXiv 2026