arxiv: 2407.11550 · v5 · pith:6S2KKZLVnew · submitted 2024-07-16 · 💻 cs.CL · cs.AI

Ada-KV: Optimizing KV Cache Eviction by Adaptive Budget Allocation for Efficient LLM Inference

Yuan Feng , Junlin Lv , Yukun Cao , Xike Xie , S. Kevin Zhou This is my paper

Pith reviewed 2026-05-17 11:12 UTC · model grok-4.3

classification 💻 cs.CL cs.AI

keywords KV cache evictionadaptive budget allocationLLM inferenceattention headslong-context modelingcache compressionefficiency optimization

0 comments

The pith

A theoretical upper bound on attention loss from KV cache eviction enables adaptive per-head budget allocation.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper derives a theoretical upper bound on the loss between attention outputs before and after KV cache eviction. This bound accounts for the optimization goal of earlier eviction techniques and indicates how to assign different cache budgets to individual attention heads. The authors use the bound to create Ada-KV, the first head-wise adaptive allocation method that works as a plug-in with existing eviction approaches. Tests on 13 Ruler and 16 LongBench datasets under both question-aware and question-agnostic conditions show clear quality gains over uniform allocation baselines.

Core claim

A derived loss upper bound between pre- and post-eviction attention outputs explains the target of prior cache eviction work and supports optimizing budget allocation separately for each attention head, resulting in the Ada-KV strategy that yields measurable quality improvements while reducing cache size.

What carries the argument

Theoretical loss upper bound between pre- and post-eviction attention outputs, which guides the choice of non-uniform cache budgets across heads.

Load-bearing premise

The loss upper bound accurately captures the quality impact of eviction and attention heads show distinct enough patterns to benefit from unequal budgets.

What would settle it

Applying the head-wise adaptive allocation derived from the bound to new models or datasets and measuring no quality improvement or a drop relative to uniform allocation.

read the original abstract

Large Language Models have excelled in various domains but face efficiency challenges due to the growing Key-Value (KV) cache required for long-sequence inference. Recent efforts aim to reduce KV cache size by evicting vast non-critical cache elements during runtime while preserving generation quality. However, these methods typically allocate compression budgets uniformly across all attention heads, ignoring the unique attention patterns of each head. In this paper, we establish a theoretical loss upper bound between pre- and post-eviction attention output, explaining the optimization target of prior cache eviction methods, while guiding the optimization of adaptive budget allocation. Base on this, we propose {\it Ada-KV}, the first head-wise adaptive budget allocation strategy. It offers plug-and-play benefits, enabling seamless integration with prior cache eviction methods. Extensive evaluations on 13 datasets from Ruler and 16 datasets from LongBench, all conducted under both question-aware and question-agnostic scenarios, demonstrate substantial quality improvements over existing methods. Our code is available at https://github.com/FFY0/AdaKV.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 3 minor

Summary. The paper claims to establish a theoretical loss upper bound between pre- and post-eviction attention outputs that explains the optimization target of prior uniform KV cache eviction methods and guides the design of head-wise adaptive budget allocation. Building on this, it proposes Ada-KV as the first such adaptive strategy, which integrates plug-and-play with existing eviction methods. Extensive experiments on 13 Ruler and 16 LongBench datasets under question-aware and question-agnostic settings report consistent quality gains over baselines.

Significance. If the bound holds and is reasonably tight, the work supplies a principled, mechanics-derived target for cache eviction that moves beyond uniform allocation, with the adaptive strategy offering practical efficiency gains. The broad dataset coverage and open-sourced code strengthen the empirical case and reproducibility. This could influence future KV cache designs by providing an independent theoretical grounding for non-uniform budgets.

major comments (2)

[Theoretical Analysis] Theoretical bound section: the derivation of the loss upper bound from attention output mechanics is independent of final quality metrics, but the manuscript should explicitly show (e.g., via the per-head score distribution properties used in the proof) that adaptive budget allocation reduces the bound more than uniform allocation without additional unstated assumptions on tail behavior across heads. If the bound treats heads symmetrically inside the derivation, the theoretical motivation for Ada-KV would require further justification beyond the observed empirical gains.
[§4] §4 (budget allocation algorithm): the post-hoc choices in computing per-head budgets are not fully detailed in how they interact with the bound; an ablation confirming that the adaptive rule minimizes the derived bound (rather than a proxy) would make the link between theory and method load-bearing.

minor comments (3)

[Abstract] Abstract: 'Base on this' should read 'Based on this'.
[Experiments] The manuscript would benefit from a short paragraph or table entry quantifying bound tightness (actual vs. upper-bound loss) on a representative subset of the evaluated datasets.
[Method] Notation for the per-head budget variables should be introduced once and used consistently; current usage mixes symbols in the method description.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the positive evaluation and recommendation for minor revision. The comments raise valid points about strengthening the explicit connection between the theoretical bound and the adaptive allocation. We respond to each major comment below and indicate the revisions we will incorporate.

read point-by-point responses

Referee: [Theoretical Analysis] Theoretical bound section: the derivation of the loss upper bound from attention output mechanics is independent of final quality metrics, but the manuscript should explicitly show (e.g., via the per-head score distribution properties used in the proof) that adaptive budget allocation reduces the bound more than uniform allocation without additional unstated assumptions on tail behavior across heads. If the bound treats heads symmetrically inside the derivation, the theoretical motivation for Ada-KV would require further justification beyond the observed empirical gains.

Authors: We thank the referee for this observation. The upper bound is derived separately for each attention head from the mechanics of its attention scores and is additive across heads because the final output is assembled from head-wise results. The per-head bound is a monotonic function of the dispersion and tail properties of that head's score distribution for a given retained budget. Heads exhibiting heavier tails or greater dispersion therefore contribute larger terms to the total bound. Allocating the global budget proportionally to these head-specific quantities therefore yields a strictly smaller total bound than uniform allocation for the same aggregate budget. No cross-head assumptions on tail behavior are required; the argument relies only on the observed heterogeneity in per-head score distributions, which is already used in the proof. In the revised manuscript we will insert a short paragraph immediately after the bound derivation that explicitly compares the total bound value under adaptive versus uniform allocation, using the per-head distribution properties to demonstrate the reduction. revision: yes
Referee: [§4] §4 (budget allocation algorithm): the post-hoc choices in computing per-head budgets are not fully detailed in how they interact with the bound; an ablation confirming that the adaptive rule minimizes the derived bound (rather than a proxy) would make the link between theory and method load-bearing.

Authors: We agree that the interaction between the allocation rule and the bound can be made more transparent. The per-head budgets in §4 are obtained by partitioning the total budget in proportion to the per-head upper-bound values estimated directly from each head's attention-score distribution; this is a closed-form rule that targets minimization of the sum of the per-head bounds. To make the link load-bearing, we will add a new ablation subsection that evaluates the numerical value of the derived loss upper bound itself (not a downstream quality metric) for Ada-KV, uniform allocation, and two alternative heuristics. The results will confirm that the adaptive rule produces the lowest bound value among the compared strategies under identical total budgets. revision: yes

Circularity Check

0 steps flagged

Theoretical loss upper bound derived from attention mechanics provides independent grounding

full rationale

The paper's central derivation establishes a loss upper bound between pre- and post-eviction attention outputs directly from the mechanics of attention computation. This bound is used to reinterpret prior uniform-eviction methods and to motivate head-wise adaptive budget allocation as an optimization step. No equations or claims reduce the bound or the adaptive strategy to fitted parameters, self-citations, or definitional equivalences; the bound is presented as a first-principles result that remains sensitive to per-head score distributions. Empirical results on Ruler and LongBench are reported separately as validation rather than as the source of the bound itself. The derivation chain is therefore self-contained against external attention mathematics.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on standard transformer attention mechanics and a derived bound; no new free parameters, invented entities, or ad-hoc axioms are introduced in the abstract.

axioms (1)

domain assumption Attention output difference after eviction admits an upper bound that can guide per-head budget decisions
Invoked to justify the adaptive allocation over uniform budgets.

pith-pipeline@v0.9.0 · 5489 in / 1178 out tokens · 41158 ms · 2026-05-17T11:12:08.412231+00:00 · methodology

discussion (0)

Forward citations

Cited by 18 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

HybridGen: Efficient LLM Generative Inference via CPU-GPU Hybrid Computing
cs.PF 2026-04 unverdicted novelty 7.0

HybridGen achieves 1.41x-3.2x average speedups over six prior KV cache methods for LLM inference by using attention logit parallelism, a feedback-driven scheduler, and semantic-aware KV cache mapping.
Self-Pruned Key-Value Attention: Learning When to Write by Predicting Future Utility
cs.LG 2026-05 unverdicted novelty 6.0

SP-KV trains a utility predictor jointly with the LLM to dynamically prune low-utility KV cache entries, achieving 3-10x memory reduction during generation with negligible performance loss.
Pyramid Forcing: Head-Aware Pyramid KV Cache Policy for High-Quality Long Video Generation
cs.CV 2026-05 unverdicted novelty 6.0

Pyramid Forcing classifies attention heads into Anchor, Wave, and Veil types and applies type-specific KV cache policies to improve long-horizon autoregressive video generation quality.
Compute Where it Counts: Self Optimizing Language Models
cs.LG 2026-05 unverdicted novelty 6.0

SOL trains a policy to dynamically control multiple efficiency mechanisms per token via group-relative policy optimization on teacher-forced episodes, yielding better quality at matched average budget than static or r...
Make Each Token Count: Towards Improving Long-Context Performance with KV Cache Eviction
cs.LG 2026-05 unverdicted novelty 6.0

A unified learnable KV eviction policy with cross-layer calibration reduces memory and matches or exceeds full-cache performance on long-context tasks by retaining useful tokens and limiting attention dilution.
ReST-KV: Robust KV Cache Eviction with Layer-wise Output Reconstruction and Spatial-Temporal Smoothing
cs.CL 2026-05 conditional novelty 6.0

ReST-KV formulates KV eviction as layer-wise output reconstruction optimization with spatial-temporal smoothing, outperforming baselines by 2.58% on LongBench and 15.2% on RULER while cutting decoding latency by 10.61...
RDKV: Rate-Distortion Bit Allocation for Joint Eviction and Quantization of the KV Cache
cs.LG 2026-05 unverdicted novelty 6.0

RDKV derives per-token and per-channel weights from attention distortion, then uses reverse water-filling to assign bit-widths from full precision to zero after prefilling, recovering 97.81% accuracy with 2.48% cache ...
Reformulating KV Cache Eviction Problem for Long-Context LLM Inference
cs.CL 2026-05 unverdicted novelty 6.0

LaProx reformulates KV cache eviction as an output-aware matrix approximation, enabling a unified global token selection strategy that preserves LLM performance at 5% cache size across long-context benchmarks.
Sparse Attention as a Range Searching Problem: Towards an Inference-Efficient Index for KV Cache
cs.LG 2026-05 unverdicted novelty 6.0

Louver is a new index structure that guarantees zero false negatives for sparse attention in LLM KV caches by casting the problem as halfspace range searching.
Sparse Attention as a Range Searching Problem: Towards an Inference-Efficient Index for KV Cache
cs.LG 2026-05 unverdicted novelty 6.0

Louver is a new index for LLM KV caches that guarantees zero false negatives for keys above a relevance threshold, runs faster than prior sparse and some dense attention methods, and integrates lightly into existing p...
CodecSight: Leveraging Video Codec Signals for Efficient Streaming VLM Inference
cs.DC 2026-04 unverdicted novelty 6.0

CodecSight reuses video codec signals for online patch pruning before the vision transformer and selective KV-cache refresh in the LLM, delivering up to 3x higher throughput and 87% lower GPU compute than prior baseli...
OVGGT: O(1) Constant-Cost Streaming Visual Geometry Transformer
cs.CV 2026-03 conditional novelty 6.0

OVGGT achieves constant O(1) memory and compute for streaming 3D geometry reconstruction by using FFN-residual-based KV cache compression and dynamic anchor protection, matching state-of-the-art accuracy on long sequences.
RAT+: Train Dense, Infer Sparse -- Recurrence Augmented Attention for Dilated Inference
cs.LG 2026-02 conditional novelty 6.0

RAT+ pretrains a single dense recurrent-augmented attention model that supports flexible dilated sparse inference after short adaptation, matching dense accuracy at moderate dilation and losing only 1-3 points at high...
HieraSparse: Hierarchical Semi-Structured Sparse KV Attention
cs.DC 2026-04 unverdicted novelty 5.0

HieraSparse delivers a hierarchical semi-structured sparse KV attention system that achieves 1.2x KV compression and 4.57x decode attention speedup versus prior unstructured sparsity methods at equivalent sparsity, pl...
StreamCacheVGGT: Streaming Visual Geometry Transformers with Robust Scoring and Hybrid Cache Compression
cs.CV 2026-04 unverdicted novelty 5.0

StreamCacheVGGT improves streaming 3D geometry reconstruction accuracy and stability under fixed memory by using cross-layer token importance scoring and hybrid cache compression instead of pure eviction.
AudioKV: KV Cache Eviction in Efficient Large Audio Language Models
cs.SD 2026-04 unverdicted novelty 5.0

AudioKV prioritizes audio-critical attention heads identified via ASR analysis and applies spectral score smoothing to evict KV cache tokens, achieving high compression with minimal accuracy loss in LALMs.
Don't Waste Bits! Adaptive KV-Cache Quantization for Lightweight On-Device LLMs
cs.CV 2026-04 unverdicted novelty 5.0

A data-driven adaptive policy for KV-cache bit-width selection based on token importance features reduces decoding latency by ~18% and improves accuracy over static quantization while staying near FP16 levels on SmolL...
From Human Memory to AI Memory: A Survey on Memory Mechanisms in the Era of LLMs
cs.IR 2025-04 unverdicted novelty 5.0

The paper surveys human memory categories, maps them to LLM memory, and proposes a new three-dimension (object, form, time) categorization into eight quadrants to organize existing work and highlight open problems.

Reference graph

Works this paper leans on

69 extracted references · 69 canonical work pages · cited by 17 Pith papers · 16 internal anchors

[1]

A survey on recent advances in llm-based multi-turn dialogue systems

Zihao Yi, Jiarui Ouyang, Yuwen Liu, Tianhao Liao, Zhe Xu, and Ying Shen. A survey on recent advances in llm-based multi-turn dialogue systems. arXiv preprint arXiv:2402.18013, 2024

work page arXiv 2024
[2]

Summedits: measuring llm ability at factual reasoning through the lens of summarization

Philippe Laban, Wojciech Kry´sci´nski, Divyansh Agarwal, Alexander Richard Fabbri, Caiming Xiong, Shafiq Joty, and Chien-Sheng Wu. Summedits: measuring llm ability at factual reasoning through the lens of summarization. InProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 9662–9676, 2023

work page 2023
[3]

Llm-based code generation method for golang compiler testing

Qiuhan Gu. Llm-based code generation method for golang compiler testing. In Proceedings of the 31st ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering, pages 2201–2203, 2023

work page 2023
[4]

GPT-4 Technical Report

Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. Gpt-4 technical report. arXiv preprint arXiv:2303.08774, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[5]

The claude 3 model family: Opus, sonnet, haiku, March 2024

Anthropic. The claude 3 model family: Opus, sonnet, haiku, March 2024. Accessed: 2024-07- 09

work page 2024
[6]

Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context

Machel Reid, Nikolay Savinov, Denis Teplyashin, Dmitry Lepikhin, Timothy Lillicrap, Jean- baptiste Alayrac, Radu Soricut, Angeliki Lazaridou, Orhan Firat, Julian Schrittwieser, et al. Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context. arXiv preprint arXiv:2403.05530, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[7]

Model Tells You What to Discard: Adaptive KV Cache Compression for LLMs

Suyu Ge, Yunan Zhang, Liyuan Liu, Minjia Zhang, Jiawei Han, and Jianfeng Gao. Model tells you what to discard: Adaptive kv cache compression for llms. arXiv preprint arXiv:2310.01801, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[8]

H2o: Heavy-hitter oracle for efficient generative inference of large language models

Zhenyu Zhang, Ying Sheng, Tianyi Zhou, Tianlong Chen, Lianmin Zheng, Ruisi Cai, Zhao Song, Yuandong Tian, Christopher Ré, Clark Barrett, et al. H2o: Heavy-hitter oracle for efficient generative inference of large language models. Advances in Neural Information Processing Systems, 36, 2024

work page 2024
[9]

PyramidInfer: Pyramid KV cache compression for high-throughput LLM inference

Dongjie Yang, Xiaodong Han, Yan Gao, Yao Hu, Shilin Zhang, and Hai Zhao. PyramidInfer: Pyramid KV cache compression for high-throughput LLM inference. In Lun-Wei Ku, Andre Martins, and Vivek Srikumar, editors,Findings of the Association for Computational Linguistics ACL 2024, pages 3258–3270, Bangkok, Thailand and virtual meeting, August 2024. Association...

work page 2024
[10]

PyramidKV: Dynamic KV Cache Compression based on Pyramidal Information Funneling

Yichi Zhang, Bofei Gao, Tianyu Liu, Keming Lu, Wayne Xiong, Yue Dong, Baobao Chang, Junjie Hu, Wen Xiao, et al. Pyramidkv: Dynamic kv cache compression based on pyramidal information funneling. arXiv preprint arXiv:2406.02069, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[11]

SnapKV: LLM knows what you are looking for before generation

Yuhong Li, Yingbing Huang, Bowen Yang, Bharat Venkitesh, Acyr Locatelli, Hanchen Ye, Tianle Cai, Patrick Lewis, and Deming Chen. SnapKV: LLM knows what you are looking for before generation. In The Thirty-eighth Annual Conference on Neural Information Processing Systems, 2024

work page 2024
[12]

Llm kv cache compression made easy, 2024

Maximilian Jeblick Simon Jegou. Llm kv cache compression made easy, 2024

work page 2024
[13]

Catalyst: Optimizing cache management for large in-memory key-value systems

Kefei Wang and Feng Chen. Catalyst: Optimizing cache management for large in-memory key-value systems. Proceedings of the VLDB Endowment, 16(13):4339–4352, 2023. 11

work page 2023
[14]

Longformer: The Long-Document Transformer

Iz Beltagy, Matthew E Peters, and Arman Cohan. Longformer: The long-document transformer. arXiv preprint arXiv:2004.05150, 2020

work page internal anchor Pith review Pith/arXiv arXiv 2004
[15]

Lm-infinite: Zero-shot extreme length generalization for large language models, 2024

Chi Han, Qifan Wang, Hao Peng, Wenhan Xiong, Yu Chen, Heng Ji, and Sinong Wang. Lm-infinite: Zero-shot extreme length generalization for large language models, 2024

work page 2024
[16]

Efficient Streaming Language Models with Attention Sinks

Guangxuan Xiao, Yuandong Tian, Beidi Chen, Song Han, and Mike Lewis. Efficient streaming language models with attention sinks. arXiv preprint arXiv:2309.17453, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[17]

Scissorhands: Exploiting the persistence of impor- tance hypothesis for llm kv cache compression at test time

Zichang Liu, Aditya Desai, Fangshuo Liao, Weitao Wang, Victor Xie, Zhaozhuo Xu, Anastasios Kyrillidis, and Anshumali Shrivastava. Scissorhands: Exploiting the persistence of impor- tance hypothesis for llm kv cache compression at test time. Advances in Neural Information Processing Systems, 36, 2024

work page 2024
[18]

On the efficacy of eviction policy for key-value constrained generative language model inference

Siyu Ren and Kenny Q Zhu. On the efficacy of eviction policy for key-value constrained generative language model inference. arXiv preprint arXiv:2402.06262, 2024

work page arXiv 2024
[19]

Minference 1.0: Accelerating pre-filling for long-context llms via dynamic sparse attention

Huiqiang Jiang, Yucheng Li, Chengruidong Zhang, Qianhui Wu, Xufang Luo, Surin Ahn, Zhenhua Han, Amir H Abdi, Dongsheng Li, Chin-Yew Lin, et al. Minference 1.0: Accelerating pre-filling for long-context llms via dynamic sparse attention. Advances in Neural Information Processing Systems, 37:52481–52515, 2024

work page 2024
[20]

Mminference: Accelerating pre-filling for long-context vlms via modality-aware permutation sparse attention.arXiv preprint arXiv:2504.16083, 2025

Yucheng Li, Huiqiang Jiang, Chengruidong Zhang, Qianhui Wu, Xufang Luo, Surin Ahn, Amir H Abdi, Dongsheng Li, Jianfeng Gao, Yuqing Yang, et al. Mminference: Accelerating pre-filling for long-context vlms via modality-aware permutation sparse attention.arXiv preprint arXiv:2504.16083, 2025

work page arXiv 2025
[21]

Retrievalattention: Accelerating long- context llm inference via vector retrieval

Di Liu, Meng Chen, Baotong Lu, Huiqiang Jiang, Zhenhua Han, Qianxi Zhang, Qi Chen, Chengruidong Zhang, Bailu Ding, Kai Zhang, et al. Retrievalattention: Accelerating long- context llm inference via vector retrieval. arXiv preprint arXiv:2409.10516, 2024

work page arXiv 2024
[22]

Arkvale: Efficient generative llm inference with recallable key-value eviction

Renze Chen, Zhuofeng Wang, Beiquan Cao, Tong Wu, Size Zheng, Xiuhong Li, Xuechao Wei, Shengen Yan, Meng Li, and Yun Liang. Arkvale: Efficient generative llm inference with recallable key-value eviction. Advances in Neural Information Processing Systems, 37:113134– 113155, 2024

work page 2024
[23]

Pqcache: Product quantization-based kvcache for long context llm inference

Hailin Zhang, Xiaodong Ji, Yilin Chen, Fangcheng Fu, Xupeng Miao, Xiaonan Nie, Weipeng Chen, and Bin Cui. Pqcache: Product quantization-based kvcache for long context llm inference. Proceedings of the ACM on Management of Data, 3(3):1–30, 2025

work page 2025
[24]

Breaking the boundaries of long- context llm inference: Adaptive kv management on a single commodity gpu

He Sun, Li Li, Mingjun Xiao, and Chengzhong Xu. Breaking the boundaries of long- context llm inference: Adaptive kv management on a single commodity gpu. arXiv preprint arXiv:2506.20187, 2025

work page arXiv 2025
[25]

Deja vu: Contextual sparsity for efficient llms at inference time, 2023

Zichang Liu, Jue Wang, Tri Dao, Tianyi Zhou, Binhang Yuan, Zhao Song, Anshumali Shrivas- tava, Ce Zhang, Yuandong Tian, Christopher Re, and Beidi Chen. Deja vu: Contextual sparsity for efficient llms at inference time, 2023

work page 2023
[26]

Flashattention: Fast and memory-efficient exact attention with io-awareness.Advances in Neural Information Processing Systems, 35:16344–16359, 2022

Tri Dao, Dan Fu, Stefano Ermon, Atri Rudra, and Christopher Ré. Flashattention: Fast and memory-efficient exact attention with io-awareness.Advances in Neural Information Processing Systems, 35:16344–16359, 2022

work page 2022
[27]

FlashAttention-2: Faster Attention with Better Parallelism and Work Partitioning

Tri Dao. Flashattention-2: Faster attention with better parallelism and work partitioning. arXiv preprint arXiv:2307.08691, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[28]

Efficient memory management for large language model serving with pagedattention

Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph Gonzalez, Hao Zhang, and Ion Stoica. Efficient memory management for large language model serving with pagedattention. In Proceedings of the 29th Symposium on Operating Systems Principles, pages 611–626, 2023

work page 2023
[29]

The llama 3 herd of models, 2024

Aaron Grattafiori, Abhimanyu Dubey, and Abhinav Jauhri .et al. The llama 3 herd of models, 2024. 12

work page 2024
[30]

Mistral 7B

Albert Q Jiang, Alexandre Sablayrolles, Arthur Mensch, Chris Bamford, Devendra Singh Chaplot, Diego de las Casas, Florian Bressand, Gianna Lengyel, Guillaume Lample, Lucile Saulnier, et al. Mistral 7b. arXiv preprint arXiv:2310.06825, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[31]

Gqa: Training generalized multi-query transformer models from multi-head checkpoints, 2023

Joshua Ainslie, James Lee-Thorp, Michiel de Jong, Yury Zemlyanskiy, Federico Lebrón, and Sumit Sanghai. Gqa: Training generalized multi-query transformer models from multi-head checkpoints, 2023

work page 2023
[32]

RULER: What's the Real Context Size of Your Long-Context Language Models?

Cheng-Ping Hsieh, Simeng Sun, Samuel Kriman, Shantanu Acharya, Dima Rekesh, Fei Jia, Yang Zhang, and Boris Ginsburg. Ruler: What’s the real context size of your long-context language models? arXiv preprint arXiv:2404.06654, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[33]

LongBench: A Bilingual, Multitask Benchmark for Long Context Understanding

Yushi Bai, Xin Lv, Jiajie Zhang, Hongchang Lyu, Jiankai Tang, Zhidian Huang, Zhengxiao Du, Xiao Liu, Aohan Zeng, Lei Hou, et al. Longbench: A bilingual, multitask benchmark for long context understanding. arXiv preprint arXiv:2308.14508, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[34]

Prompt cache: Modular attention reuse for low-latency inference

In Gim, Guojun Chen, Seung-seob Lee, Nikhil Sarda, Anurag Khandelwal, and Lin Zhong. Prompt cache: Modular attention reuse for low-latency inference. Proceedings of Machine Learning and Systems, 6:325–338, 2024

work page 2024
[35]

SGLang: Efficient Execution of Structured Language Model Programs

Lianmin Zheng, Liangsheng Yin, Zhiqiang Xie, Chuyue Sun, Jeff Huang, Cody Hao Yu, Shiyi Cao, Christos Kozyrakis, Ion Stoica, Joseph E Gonzalez, et al. Sglang: Efficient execution of structured language model programs. arXiv preprint arXiv:2312.07104, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[36]

Kvzip: Query-agnostic kv cache compression with context reconstruction

Jang-Hyun Kim, Jinuk Kim, Sangwoo Kwon, Jae W Lee, Sangdoo Yun, and Hyun Oh Song. Kvzip: Query-agnostic kv cache compression with context reconstruction. arXiv preprint arXiv:2505.23416, 2025

work page arXiv 2025
[37]

Expected attention: Kv cache compres- sion by estimating attention from future queries distribution

Alessio Devoto, Maximilian Jeblick, and Simon Jégou. Expected attention: Kv cache compres- sion by estimating attention from future queries distribution. arXiv preprint arXiv:2510.00636, 2025

work page arXiv 2025
[38]

Draft-based approximate inference for llms

Kevin Galim, Ethan Ewer, Wonjun Kang, Minjae Lee, Hyung Il Koo, and Kangwook Lee. Draft-based approximate inference for llms. arXiv preprint arXiv:2506.08373, 2025

work page arXiv 2025
[39]

Identify critical kv cache in llm inference from an output perturbation perspective

Yuan Feng, Junlin Lv, Yukun Cao, Xike Xie, and S Kevin Zhou. Identify critical kv cache in llm inference from an output perturbation perspective. arXiv preprint arXiv:2502.03805, 2025

work page arXiv 2025
[40]

Kevin Zhou, and Xike Xie

Yuan Feng, Haoyu Guo, JunLin Lv, S. Kevin Zhou, and Xike Xie. Taming the fragility of kv cache eviction in llm inference, 2025

work page 2025
[41]

Duoattention: Efficient long-context llm inference with retrieval and streaming heads

Guangxuan Xiao, Jiaming Tang, Jingwei Zuo, Junxian Guo, Shang Yang, Haotian Tang, Yao Fu, and Song Han. Duoattention: Efficient long-context llm inference with retrieval and streaming heads. arXiv preprint arXiv:2410.10819, 2024

work page arXiv 2024
[42]

Not all heads matter: A head-level kv cache compression method with integrated retrieval and reasoning

Yu Fu, Zefan Cai, Abedelkadir Asi, Wayne Xiong, Yue Dong, and Wen Xiao. Not all heads matter: A head-level kv cache compression method with integrated retrieval and reasoning. arXiv preprint arXiv:2410.19258, 2024

work page arXiv 2024
[43]

Zeroquant: Efficient and affordable post-training quantization for large-scale transformers

Zhewei Yao, Reza Yazdani Aminabadi, Minjia Zhang, Xiaoxia Wu, Conglong Li, and Yuxiong He. Zeroquant: Efficient and affordable post-training quantization for large-scale transformers. Advances in Neural Information Processing Systems, 35:27168–27183, 2022

work page 2022
[44]

KIVI: A Tuning-Free Asymmetric 2bit Quantization for KV Cache

Zirui Liu, Jiayi Yuan, Hongye Jin, Shaochen Zhong, Zhaozhuo Xu, Vladimir Braverman, Beidi Chen, and Xia Hu. Kivi: A tuning-free asymmetric 2bit quantization for kv cache. arXiv preprint arXiv:2402.02750, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[45]

Qaq: Quality adaptive quantization for llm kv cache

Shichen Dong, Wen Cheng, Jiayu Qin, and Wei Wang. Qaq: Quality adaptive quantization for llm kv cache. arXiv preprint arXiv:2403.04643, 2024

work page arXiv 2024
[46]

Triforce: Lossless acceleration of long sequence generation with hierarchical speculative decoding

Hanshi Sun, Zhuoming Chen, Xinyu Yang, Yuandong Tian, and Beidi Chen. Triforce: Lossless acceleration of long sequence generation with hierarchical speculative decoding. arXiv preprint arXiv:2404.11912, 2024. 13

work page arXiv 2024
[47]

Longspec: Long-context lossless speculative decoding with efficient drafting and verification

Penghui Yang, Cunxiao Du, Fengzhuo Zhang, Haonan Wang, Tianyu Pang, Chao Du, and Bo An. Longspec: Long-context lossless speculative decoding with efficient drafting and verification. In ES-FoMo III: 3rd Workshop on Efficient Systems for Foundation Models

work page
[48]

Specvlm: Enhancing speculative decoding of video llms via verifier-guided token pruning

Yicheng Ji, Jun Zhang, Heming Xia, Jinpeng Chen, Lidan Shou, Gang Chen, and Huan Li. Specvlm: Enhancing speculative decoding of video llms via verifier-guided token pruning. arXiv preprint arXiv:2508.16201, 2025

work page arXiv 2025
[49]

Needle In A Haystack - pressure testing LLMs

Gregory Kamradt. Needle In A Haystack - pressure testing LLMs. Github, 2023

work page 2023
[50]

Zoology: Measuring and improving recall in efficient language models

Simran Arora, Sabri Eyuboglu, Aman Timalsina, Isys Johnson, Michael Poli, James Zou, Atri Rudra, and Christopher Ré. Zoology: Measuring and improving recall in efficient language models. In ICLR, 2024

work page 2024
[51]

The narrativeqa reading comprehension challenge

Tomáš Koˇcisk`y, Jonathan Schwarz, Phil Blunsom, Chris Dyer, Karl Moritz Hermann, Gábor Melis, and Edward Grefenstette. The narrativeqa reading comprehension challenge. Transac- tions of the Association for Computational Linguistics, 6:317–328, 2018

work page 2018
[52]

A dataset of information-seeking questions and answers anchored in research papers

Pradeep Dasigi, Kyle Lo, Iz Beltagy, Arman Cohan, Noah A Smith, and Matt Gardner. A dataset of information-seeking questions and answers anchored in research papers. arXiv preprint arXiv:2105.03011, 2021

work page arXiv 2021
[53]

HotpotQA: A Dataset for Diverse, Explainable Multi-hop Question Answering

Zhilin Yang, Peng Qi, Saizheng Zhang, Yoshua Bengio, William W Cohen, Ruslan Salakhut- dinov, and Christopher D Manning. Hotpotqa: A dataset for diverse, explainable multi-hop question answering. arXiv preprint arXiv:1809.09600, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018
[54]

Constructing a multi-hop QA dataset for comprehensive evaluation of reasoning steps

Xanh Ho, Anh-Khoa Duong Nguyen, Saku Sugawara, and Akiko Aizawa. Constructing a multi-hop QA dataset for comprehensive evaluation of reasoning steps. In Donia Scott, Nuria Bel, and Chengqing Zong, editors, Proceedings of the 28th International Conference on Computational Linguistics, pages 6609–6625, Barcelona, Spain (Online), December 2020. International...

work page 2020
[55]

Musique: Multihop questions via single-hop question composition

Harsh Trivedi, Niranjan Balasubramanian, Tushar Khot, and Ashish Sabharwal. Musique: Multihop questions via single-hop question composition. Transactions of the Association for Computational Linguistics, 10:539–554, 2022

work page 2022
[56]

Efficient attentions for long document summarization

Luyang Huang, Shuyang Cao, Nikolaus Parulian, Heng Ji, and Lu Wang. Efficient attentions for long document summarization. arXiv preprint arXiv:2104.02112, 2021

work page arXiv 2021
[57]

Qmsum: A new benchmark for query-based multi-domain meeting summarization

Ming Zhong, Da Yin, Tao Yu, Ahmad Zaidi, Mutethia Mutuma, Rahul Jha, Ahmed Hassan Awadallah, Asli Celikyilmaz, Yang Liu, Xipeng Qiu, et al. Qmsum: A new benchmark for query-based multi-domain meeting summarization. arXiv preprint arXiv:2104.05938, 2021

work page arXiv 2021
[58]

Multi-News: a Large-Scale Multi-Document Summarization Dataset and Abstractive Hierarchical Model

Alexander R Fabbri, Irene Li, Tianwei She, Suyi Li, and Dragomir R Radev. Multi-news: A large-scale multi-document summarization dataset and abstractive hierarchical model. arXiv preprint arXiv:1906.01749, 2019

work page internal anchor Pith review Pith/arXiv arXiv 1906
[59]

Weld, and Luke Zettlemoyer

Mandar Joshi, Eunsol Choi, Daniel S. Weld, and Luke Zettlemoyer. Triviaqa: A large scale distantly supervised challenge dataset for reading comprehension, 2017

work page 2017
[60]

Samsum cor- pus: A human-annotated dialogue dataset for abstractive summarization

Bogdan Gliwa, Iwona Mochol, Maciej Biesek, and Aleksander Wawer. Samsum cor- pus: A human-annotated dialogue dataset for abstractive summarization. arXiv preprint arXiv:1911.12237, 2019

work page arXiv 1911
[61]

Learning question classifiers

Xin Li and Dan Roth. Learning question classifiers. In COLING 2002: The 19th International Conference on Computational Linguistics, 2002

work page 2002
[62]

Longcoder: A long-range pre-trained language model for code completion, 2023

Daya Guo, Canwen Xu, Nan Duan, Jian Yin, and Julian McAuley. Longcoder: A long-range pre-trained language model for code completion, 2023

work page 2023
[63]

Repobench: Benchmarking repository-level code auto-completion systems, 2023

Tianyang Liu, Canwen Xu, and Julian McAuley. Repobench: Benchmarking repository-level code auto-completion systems, 2023. 14

work page 2023
[64]

Lost in the Middle: How Language Models Use Long Contexts

Nelson F Liu, Kevin Lin, John Hewitt, Ashwin Paranjape, Michele Bevilacqua, Fabio Petroni, and Percy Liang. Lost in the middle: How language models use long contexts. arXiv preprint arXiv:2307.03172, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[65]

Constructing a multi-hop qa dataset for comprehensive evaluation of reasoning steps

Xanh Ho, Anh-Khoa Duong Nguyen, Saku Sugawara, and Akiko Aizawa. Constructing a multi-hop qa dataset for comprehensive evaluation of reasoning steps. In Proceedings of the 28th International Conference on Computational Linguistics, pages 6609–6625, 2020

work page 2020
[66]

Triviaqa: A large scale distantly supervised challenge dataset for reading comprehension

Mandar Joshi, Eunsol Choi, Daniel S Weld, and Luke Zettlemoyer. Triviaqa: A large scale distantly supervised challenge dataset for reading comprehension. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages 1601–1611, 2017

work page 2017
[67]

Longcoder: A long-range pre-trained language model for code completion

Daya Guo, Canwen Xu, Nan Duan, Jian Yin, and Julian McAuley. Longcoder: A long-range pre-trained language model for code completion. arXiv preprint arXiv:2306.14893, 2023

work page arXiv 2023
[68]

RepoBench: Benchmarking Repository-Level Code Auto-Completion Systems

Tianyang Liu, Canwen Xu, and Julian McAuley. Repobench: Benchmarking repository-level code auto-completion systems. arXiv preprint arXiv:2306.03091, 2023. 15 A Appendix A.1 Additional Related Works Additional works also mitigate the challenges posed by massive KV Caches during long-sequence inference while not reducing the number of cache elements. These ...

work page internal anchor Pith review Pith/arXiv arXiv 2023
[69]

unanswerable

word-a 2. word-b 3. word-c 4. word-a 5. word-d 6. word-a 7. word-e 8. word-f ...... Question: What are the 10 most common words in the above list? Answer: The top 10 words that appear most often in the list are: Frequent Words Extraction (FWE) Task Template: Read the following coded text and track the frequency of each coded word. Find the three most freq...

work page