pith. machine review for the scientific record. sign in

arxiv: 2605.05602 · v1 · submitted 2026-05-07 · 💻 cs.DS · cs.AI

Recognition: unknown

Nearly Optimal Attention Coresets

Alexandr Andoni, Edo Liberty, Eldar Kleiner

Authors on Pith no claims yet

Pith reviewed 2026-05-08 06:22 UTC · model grok-4.3

classification 💻 cs.DS cs.AI
keywords attention mechanismcoresetsapproximation algorithmsdata structuressoftmaxmachine learning
0
0 comments X

The pith

Attention outputs for bounded queries can be approximated by a coreset whose size scales as the square root of dimension times an exponential in the norm bound.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that any collection of unit-length keys and values admits a small subset that reproduces the full attention result within additive error epsilon for every query whose length stays at most rho. The subset needs only order square root of d times e to the power rho plus lower-order terms, divided by epsilon. A nearly matching lower bound proves that no substantially smaller subset works in the worst case. This would let the attention step run in space independent of the original input length while still handling an entire family of possible queries at once.

Core claim

For any set of unit-norm keys and values in R^d there exists a subset of size at most O of the square root of d times e to the power of rho plus little-o of rho over epsilon such that the attention vector computed on the subset differs by at most epsilon in Euclidean norm from the attention vector computed on the full set, and the guarantee holds simultaneously for every query whose norm is at most rho.

What carries the argument

The attention coreset: a small subset of the original key-value pairs chosen so that the softmax-weighted average of values remains close for all queries up to a given norm bound.

If this is right

  • The attention step can be performed in space independent of the total number of original keys.
  • The coreset size depends only on dimension, the query-norm bound rho, and the error epsilon.
  • The construction improves the best previously known coreset size for the same guarantee.
  • The matching lower bound of order square root of d times e to the rho over epsilon shows the size is nearly optimal.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same selection technique might yield small coresets for other kernel similarity computations that use softmax.
  • In streaming settings one could maintain such a coreset incrementally as new keys arrive.
  • Relaxing the exact unit-norm requirement on keys and values would make the result applicable to typical learned embeddings.

Load-bearing premise

Keys and values must all have exactly unit norm while every query has norm bounded by rho.

What would settle it

An explicit collection of unit vectors in dimension d for which every subset smaller than the stated size fails to approximate attention within epsilon for some rho-bounded query would disprove the upper bound.

read the original abstract

We consider the problem of estimating the Attention mechanism in small space, and prove the existence of coresets for it of nearly optimal size. Specifically, we show that for any set of unit-norm keys and values $(K,V)$ in $\mathbb{R}^d$, there exists a subset $(K',V')$ of size at most $O({\sqrt{d} e^{\rho+o(\rho)}/\varepsilon})$ such that \[ \left\| \operatorname{Attn}(q,K,V)- \operatorname{Attn}(q,K',V') \right\| \le \varepsilon \] simultaneously for all queries whose norm is bounded by $\rho$. This outperforms the best known results for this problem. We also offer an improved lower bound showing that $\varepsilon$-coresets must have size $\Omega({\sqrt{d} e^{\rho}/\epsilon})$.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 2 minor

Summary. The paper proves the existence of nearly optimal ε-coresets for the attention mechanism. For any set of unit-norm keys and values (K, V) in R^d, there exists a subset (K', V') of size O(√d e^{ρ + o(ρ)} / ε) such that ||Attn(q, K, V) - Attn(q, K', V')|| ≤ ε holds simultaneously for all queries q with ||q|| ≤ ρ. The paper also establishes a matching lower bound of Ω(√d e^ρ / ε).

Significance. If correct, the result is significant for data structures and theoretical CS because it supplies nearly tight bounds on attention coreset size, improving prior upper bounds while matching the lower bound up to the o(ρ) factor. The explicit existence proof for uniform approximation over a bounded query set is a useful primitive for streaming and small-space attention approximations.

major comments (1)
  1. [Main theorem / upper-bound proof] Main theorem (existence upper bound): the o(ρ) term in the exponent must be shown to be independent of d and ε (or explicitly bounded) so that the claimed near-optimality relative to the Ω(√d e^ρ / ε) lower bound is rigorous; without this the gap between upper and lower bounds could be larger than stated.
minor comments (2)
  1. [Preliminaries] Define the precise attention function (including any scaling or softmax variant) at the first use rather than assuming standard notation.
  2. [Introduction] Add a short paragraph contrasting the new bound with the best previous upper bound cited in the introduction.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the careful reading, positive assessment, and constructive comment on the main theorem. We address the point below and will incorporate a clarification in the revision.

read point-by-point responses
  1. Referee: [Main theorem / upper-bound proof] Main theorem (existence upper bound): the o(ρ) term in the exponent must be shown to be independent of d and ε (or explicitly bounded) so that the claimed near-optimality relative to the Ω(√d e^ρ / ε) lower bound is rigorous; without this the gap between upper and lower bounds could be larger than stated.

    Authors: We agree that explicit independence strengthens the near-optimality claim. In the proof of Theorem 3.1 (the existence upper bound), the o(ρ) term originates from the asymptotic analysis of the ε-net covering number on the sphere for queries of norm ≤ρ combined with a uniform approximation of the softmax via truncated exponential sums; the rate of the little-o is controlled solely by the choice of auxiliary approximation parameters (e.g., the truncation threshold and net fineness) and does not depend on dimension d or error ε. Consequently the upper bound is of the form O(√d ⋅ e^{ρ(1+o(1))} / ε) where the o(1)→0 as ρ→∞ uniformly in d and ε. We will revise the statement of the main theorem and add a short remark after the proof explicitly recording this independence (and, if space permits, a concrete bound such as O(ρ/log ρ) for the additive term). This makes the comparison to the Ω(√d e^ρ / ε) lower bound fully rigorous. revision: yes

Circularity Check

0 steps flagged

No significant circularity

full rationale

The paper presents a first-principles existence proof establishing that for any unit-norm keys and values in R^d, there is a coreset of size O(√d e^{ρ+o(ρ)}/ε) that ε-approximates attention uniformly over all queries with ||q|| ≤ ρ. This is paired with an independent lower bound of Ω(√d e^ρ/ε). The derivation relies on explicit mathematical arguments and stated assumptions rather than self-definitional reductions, fitted parameters renamed as predictions, or load-bearing self-citations. No step reduces the claimed result to its own inputs by construction; the result is self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The result rests on standard mathematical existence arguments (likely probabilistic method or greedy selection) that are common in approximation-algorithm literature; no new entities or fitted parameters are introduced in the abstract.

axioms (1)
  • standard math Existence of a small subset that preserves attention output for all bounded-norm queries
    Typical for coreset existence proofs in high-dimensional geometry and approximation algorithms.

pith-pipeline@v0.9.0 · 5437 in / 1251 out tokens · 65713 ms · 2026-05-08T06:22:36.042210+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

54 extracted references · 14 canonical work pages · 6 internal anchors

  1. [1]

    GPT-4 Technical Report

    Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. Gpt-4 technical report. arXiv preprint arXiv:2303.08774, 2023

  2. [2]

    Fast attention requires bounded entries

    Josh Alman and Zhao Song. Fast attention requires bounded entries. InAdvances in Neural Information Processing Systems, 2023

  3. [3]

    Liu, and Mehtaab Sawhney

    Ryan Alweiss, Yang P. Liu, and Mehtaab Sawhney. Discrepancy minimization via a self-balancing walk. Proceedings of the 53rd ACM Symposium on the Theory of Computing (STOC ’2021), 2021

  4. [4]

    Balancing vectors and gaussian measures of n-dimensional convex bodies.Random Structures and Algorithms 12 (1998), 351–360, 1998

    Wojciech Banaszczyk. Balancing vectors and gaussian measures of n-dimensional convex bodies.Random Structures and Algorithms 12 (1998), 351–360, 1998

  5. [5]

    On series of signed vectors and their rearrangements.Random Structures and Algorithms 40 (2012), 301–316, 2012

    Wojciech Banaszczyk. On series of signed vectors and their rearrangements.Random Structures and Algorithms 40 (2012), 301–316, 2012

  6. [6]

    Constructive algorithms for discrepancy minimization.51th Annual IEEE Symposium on Foundations of Computer Science (FOCS ’2010), arXiv:1002.2259, 2010

    Nikhil Bansal. Constructive algorithms for discrepancy minimization.51th Annual IEEE Symposium on Foundations of Computer Science (FOCS ’2010), arXiv:1002.2259, 2010

  7. [7]

    Online vector balancing and geometric discrepancy.In Proceedings of the 52nd Annual ACM Symposium on Theory of Computing (STOC ’2020), arXiv:1912.03350, 2019

    Nikhil Bansal, Haotian Jiang, Sahil Singla, and Makrand Sinha. Online vector balancing and geometric discrepancy.In Proceedings of the 52nd Annual ACM Symposium on Theory of Computing (STOC ’2020), arXiv:1912.03350, 2019

  8. [8]

    xLSTM: Extended long short-term memory

    Maximilian Beck, Korbinian Pöppel, Markus Spanring, Andreas Auer, Oleksandra Prudnikova, Michael Kopp, Günter Klambauer, Johannes Brandstetter, and Sepp Hochreiter. xLSTM: Extended long short-term memory. InAdvances in Neural Information Processing Systems, 2024

  9. [9]

    Longformer: The Long-Document Transformer

    Iz Beltagy, Matthew E Peters, and Arman Cohan. Longformer: The long-document transformer.arXiv preprint arXiv:2004.05150, 2020

  10. [10]

    Short window attention enables long-term memorization.International Conference on Learning Representations, 2026

    Loïc Cabannes, Maximilian Beck, Gergely Szilvasy, Matthijs Douze, Maria Lomeli, Jade Copet, Pierre- Emmanuel Mazaré, Gabriel Synnaeve, and Hervé Jégou. Short window attention enables long-term memorization.International Conference on Learning Representations, 2026

  11. [11]

    Pyramidkv: Dynamic kv cache compression based on pyramidal information funneling.Conference on Language Modeling, 2025

    Zefan Cai, Yichi Zhang, Bofei Gao, Yuliang Liu, Tianyu Liu, Keming Lu, Wayne Xiong, Yue Dong, Baobao Chang, Junjie Hu, et al. Pyramidkv: Dynamic kv cache compression based on pyramidal information funneling.Conference on Language Modeling, 2025

  12. [12]

    A quasi-monte carlo data structure for smooth kernel evaluations.In Proceedings of the 35th ACM-SIAM Symposium on Discrete Algorithms (SODA ’2024), arXiv:2401.02562, 2024

    Moses Charikar, Michael Kapralov, and Erik Waingarten. A quasi-monte carlo data structure for smooth kernel evaluations.In Proceedings of the 35th ACM-SIAM Symposium on Discrete Algorithms (SODA ’2024), arXiv:2401.02562, 2024

  13. [13]

    Charikar

    Moses S. Charikar. Similarity estimation techniques from rounding algorithms. InProceedings of the Thiry-Fourth Annual ACM Symposium on Theory of Computing, STOC ’02, pages 380–388, New York, NY, USA, 2002. Association for Computing Machinery. 10

  14. [14]

    Rethinking attention with Performers

    Krzysztof Choromanski, Valerii Likhosherstov, David Dohan, Xingyou Song, Andreea Gane, Tamas Sarlos, Peter Hawkins, Jared Davis, Afroz Mohiuddin, Lukasz Kaiser, David Belanger, Lucy Colwell, and Adrian Weller. Rethinking attention with Performers. InInternational Conference on Learning Representations, 2021

  15. [15]

    Conway and Neil J

    John H. Conway and Neil J. A. Sloane.Sphere Packings, Lattices and Groups. Springer, 3 edition, 1999

  16. [16]

    Balancing vectors in any norm.59th Annual IEEE Symposium on Foundations of Computer Science (FOCS ’2018), 2018

    Daniel Dadush, Aleksandar Nikolov, Kunal Talwar, and Nicole Tomczak-Jaegermann. Balancing vectors in any norm.59th Annual IEEE Symposium on Foundations of Computer Science (FOCS ’2018), 2018

  17. [17]

    Fu, Stefano Ermon, Atri Rudra, and Christopher Ré

    Tri Dao, Daniel Y. Fu, Stefano Ermon, Atri Rudra, and Christopher Ré. FlashAttention: Fast and memory-efficient exact attention with IO-awareness. InAdvances in Neural Information Processing Systems, 2022

  18. [18]

    Cartridges: Lightweight and general-purpose long context representations via self-study.International Conference on Learning Representations, 2026

    Sabri Eyuboglu, Ryan Ehrlich, Simran Arora, Neel Guha, Dylan Zinsley, Emily Liu, Will Tennien, Atri Rudra, James Zou, Azalia Mirhoseini, et al. Cartridges: Lightweight and general-purpose long context representations via self-study.International Conference on Learning Representations, 2026

  19. [19]

    Data engineering for scaling language models to 128k context.International Conference on Machine Learning, 2024

    Yao Fu, Rameswar Panda, Xinyao Niu, Xiang Yue, Hannaneh Hajishirzi, Yoon Kim, and Hao Peng. Data engineering for scaling language models to 128k context.International Conference on Machine Learning, 2024

  20. [20]

    Not all heads matter: A head-level kv cache compression method with integrated retrieval and reasoning.International Conference on Learning Representations, 2025

    Yu Fu, Zefan Cai, Abedelkadir Asi, Wayne Xiong, Yue Dong, and Wen Xiao. Not all heads matter: A head-level kv cache compression method with integrated retrieval and reasoning.International Conference on Learning Representations, 2025

  21. [21]

    Rabitq: Quantizing high-dimensional vectors with a theoretical error bound for approximate nearest neighbor search.Proceedings of the ACM on Management of Data, 2(3):1–27, 2024

    Jianyang Gao and Cheng Long. Rabitq: Quantizing high-dimensional vectors with a theoretical error bound for approximate nearest neighbor search.Proceedings of the ACM on Management of Data, 2(3):1–27, 2024

  22. [22]

    Model tells you what to discard: Adaptive KV cache compression for LLMs

    Suyu Ge, Yunan Zhang, Liyuan Liu, Minjia Zhang, Jiawei Han, and Jianfeng Gao. Model tells you what to discard: Adaptive KV cache compression for LLMs. InInternational Conference on Learning Representations, 2024

  23. [23]

    Phillips, and David P Woodruff

    Mina Ghashami, Edo Liberty, Jeff M. Phillips, and David P Woodruff. Frequent directions: Simple and deterministic matrix sketching.SIAM J. Comput., 45(5):1762–1792, 2016

  24. [24]

    Mamba: Linear-time sequence modeling with selective state spaces

    Albert Gu and Tri Dao. Mamba: Linear-time sequence modeling with selective state spaces. InConference on Language Modeling, 2024

  25. [25]

    Hyperattention: Long-context attention in near-linear time.International Conference on Learning Representations, 2024

    Insu Han, Rajesh Jarayam, Amin Karbasi, Vahab Mirrokni, David Woodruff, and Amir Zandieh. Hyperattention: Long-context attention in near-linear time.International Conference on Learning Representations, 2024

  26. [26]

    Product quantization for nearest neighbor search

    Herve Jegou, Matthijs Douze, and Cordelia Schmid. Product quantization for nearest neighbor search. IEEE transactions on pattern analysis and machine intelligence, 33(1):117–128, 2010

  27. [27]

    Scaling Laws for Neural Language Models

    Jared Kaplan, Sam McCandlish, Tom Henighan, Tom B Brown, Benjamin Chess, Rewon Child, Scott Gray, Alec Radford, Jeffrey Wu, and Dario Amodei. Scaling laws for neural language models.arXiv preprint arXiv:2001.08361, 2020

  28. [28]

    Discrepancy, coresets, and sketches in machine learning

    Zohar Karnin and Edo Liberty. Discrepancy, coresets, and sketches in machine learning. InConference on Learning Theory, pages 1975–1993. PMLR, 2019

  29. [29]

    Transformers are rnns: Fast autoregressive transformers with linear attention

    Angelos Katharopoulos, Apoorv Vyas, Nikolaos Pappas, and François Fleuret. Transformers are rnns: Fast autoregressive transformers with linear attention. InInternational conference on machine learning, pages 5156–5165. PMLR, 2020. 11

  30. [30]

    Streaming attention approximation via discrepancy theory.Advances in Neural Information Processing Systems, 2025

    Ekaterina Kochetkova, Kshiteej Sheth, Insu Han, Amir Zandieh, and Michael Kapralov. Streaming attention approximation via discrepancy theory.Advances in Neural Information Processing Systems, 2025

  31. [31]

    Optimal online discrepancy minimiza- tion.In Proceedings of the 56th Annual ACM Symposium on Theory of Computing (STOC ’2024), arXiv:2308.01406, 2023

    Janardhan Kulkarni, Victor Reis, and Thomas Rothvoss. Optimal online discrepancy minimiza- tion.In Proceedings of the 56th Annual ACM Symposium on Theory of Computing (STOC ’2024), arXiv:2308.01406, 2023

  32. [32]

    Efficient memory management for large language model serving with pagedattention

    Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph Gonzalez, Hao Zhang, and Ion Stoica. Efficient memory management for large language model serving with pagedattention. InProceedings of the 29th Symposium on Operating Systems Principles, pages 611–626, 2023

  33. [33]

    Prefix-tuning: Optimizing continuous prompts for generation

    Xiang Lisa Li and Percy Liang. Prefix-tuning: Optimizing continuous prompts for generation. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pages 4582–4597, 2021

  34. [34]

    SnapKV: LLM knows what you are looking for before generation

    Yuhong Li, Yingbing Huang, Bowen Yang, Bharat Venkitesh, Acyr Locatelli, Hanchen Ye, Tianle Cai, Patrick Lewis, and Deming Chen. SnapKV: LLM knows what you are looking for before generation. In Advances in Neural Information Processing Systems, 2024

  35. [35]

    Scissorhands: Exploiting the persistence of importance hypothesis for llm kv cache compression at test time.Advances in Neural Information Processing Systems, 36, 2024

    Zichang Liu, Aditya Desai, Fangshuo Liao, Weitao Wang, Victor Xie, Zhaozhuo Xu, Anastasios Kyrillidis, and Anshumali Shrivastava. Scissorhands: Exploiting the persistence of importance hypothesis for llm kv cache compression at test time.Advances in Neural Information Processing Systems, 36, 2024

  36. [36]

    Near-optimal coresets for kernel density estimates.Discrete and Computational Geometry, 63(4):867–887, 2020

    Jeff M Phillips and Wai Ming Tai. Near-optimal coresets for kernel density estimates.Discrete and Computational Geometry, 63(4):867–887, 2020

  37. [37]

    Efficiently scaling transformer inference.Proceedings of Machine Learning and Systems, 5, 2023

    Reiner Pope, Sholto Douglas, Aakanksha Chowdhery, Jacob Devlin, James Bradbury, Jonathan Heek, Kefan Xiao, Shivani Agrawal, and Jeff Dean. Efficiently scaling transformer inference.Proceedings of Machine Learning and Systems, 5, 2023

  38. [38]

    Random features for large-scale kernel machines

    Ali Rahimi and Benjamin Recht. Random features for large-scale kernel machines. InAdvances in Neural Information Processing Systems, 2007

  39. [39]

    Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context

    Machel Reid, Nikolay Savinov, Denis Teplyashin, Dmitry Lepikhin, Timothy Lillicrap, Jean-baptiste Alayrac, Radu Soricut, Angeliki Lazaridou, Orhan Firat, Julian Schrittwieser, et al. Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context.arXiv preprint arXiv:2403.05530, 2024

  40. [40]

    Wildcat: Near-linear attention in theory and practice, 2026

    Tobias Schröder and Lester Mackey. Wildcat: Near-linear attention in theory and practice, 2026. arXiv preprint arXiv:2602.10056

  41. [41]

    Fast Transformer Decoding: One Write-Head is All You Need

    Noam Shazeer. Fast transformer decoding: One write-head is all you need.arXiv preprint arXiv:1911.02150, 2019

  42. [42]

    Flexgen: High-throughput generative inference of large language models with a single gpu

    Ying Sheng, Lianmin Zheng, Binhang Yuan, Zhuohan Li, Max Ryabinin, Beidi Chen, Percy Liang, Christopher Ré, Ion Stoica, and Ce Zhang. Flexgen: High-throughput generative inference of large language models with a single gpu. InInternational Conference on Machine Learning, pages 31094–31116. PMLR, 2023

  43. [43]

    Quest: Query- aware sparsity for efficient long-context LLM inference

    Jiaming Tang, Yilong Zhao, Kan Zhu, Guangxuan Xiao, Baris Kasikci, and Song Han. Quest: Query- aware sparsity for efficient long-context LLM inference. InInternational Conference on Machine Learning, 2024

  44. [44]

    claude, 2024.https://www.anthropic.com/news/claude-3-family

    Antropic Team. claude, 2024.https://www.anthropic.com/news/claude-3-family. 12

  45. [45]

    Spectral norm of random tensors, 2014.https://arxiv.org/abs/ 1407.1870

    Ryota Tomioka and Taiji Suzuki. Spectral norm of random tensors, 2014.https://arxiv.org/abs/ 1407.1870

  46. [46]

    Attention is all you need.Advances in Neural Information Processing Systems, 2017

    Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Lukasz Kaiser, and Illia Polosukhin. Attention is all you need.Advances in Neural Information Processing Systems, 2017

  47. [47]

    Linformer: Self-Attention with Linear Complexity

    Sinong Wang, Belinda Z. Li, Madian Khabsa, Han Fang, and Hao Ma. Linformer: Self-attention with linear complexity.arXiv preprint arXiv:2006.04768, 2020

  48. [48]

    DuoAttention: Efficient long-context LLM inference with retrieval and streaming heads

    Guangxuan Xiao, Jiaming Tang, Jingwei Zuo, Junxian Guo, Shang Yang, Haotian Tang, Yao Fu, and Song Han. DuoAttention: Efficient long-context LLM inference with retrieval and streaming heads. In International Conference on Learning Representations, 2025

  49. [49]

    Efficient streaming language models with attention sinks

    Guangxuan Xiao, Yuandong Tian, Beidi Chen, Song Han, and Mike Lewis. Efficient streaming language models with attention sinks. InInternational Conference on Learning Representations, 2024

  50. [50]

    Turboquant: Online vector quantization with near-optimal distortion rate

    Amir Zandieh, Majid Daliri, Majid Hadian, and Vahab Mirrokni. Turboquant: Online vector quantization with near-optimal distortion rate. InInternational Conference on Learning Representations, 2026

  51. [51]

    KDEformer: Accelerating transformers via kernel density estimation

    Amir Zandieh, Insu Han, Majid Daliri, and Amin Karbasi. KDEformer: Accelerating transformers via kernel density estimation. In Andreas Krause, Emma Brunskill, Kyunghyun Cho, Barbara Engelhardt, Sivan Sabato, and Jonathan Scarlett, editors,Proceedings of the 40th International Conference on Machine Learning, volume 202 ofProceedings of Machine Learning Res...

  52. [52]

    Subgen: Token generation in sublinear time and memory.arXiv preprint arXiv:2402.06082, 2024

    Amir Zandieh, Insu Han, Vahab Mirrokni, and Amin Karbasi. Subgen: Token generation in sublinear time and memory.arXiv preprint arXiv:2402.06082, 2024

  53. [53]

    H2O: Heavy-hitter oracle for efficient generative inference of large language models

    Zhenyu Zhang, Ying Sheng, Tianyi Zhou, Tianlong Chen, Lianmin Zheng, Ruisi Cai, Zhao Song, Yuandong Tian, Christopher Ré, Clark Barrett, Zhangyang Wang, and Beidi Chen. H2O: Heavy-hitter oracle for efficient generative inference of large language models. InAdvances in Neural Information Processing Systems, 2023

  54. [54]

    arXiv preprint arXiv:2602.16284 , year=

    Adam Zweiger, Xinghong Fu, Han Guo, and Yoon Kim. Fast kv compaction via attention matching. arXiv preprint arXiv:2602.16284, 2026. 13