Ada-KV: Optimizing KV Cache Eviction by Adaptive Budget Allocation for Efficient LLM Inference
Pith reviewed 2026-05-17 11:12 UTC · model grok-4.3
The pith
A theoretical upper bound on attention loss from KV cache eviction enables adaptive per-head budget allocation.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
A derived loss upper bound between pre- and post-eviction attention outputs explains the target of prior cache eviction work and supports optimizing budget allocation separately for each attention head, resulting in the Ada-KV strategy that yields measurable quality improvements while reducing cache size.
What carries the argument
Theoretical loss upper bound between pre- and post-eviction attention outputs, which guides the choice of non-uniform cache budgets across heads.
Load-bearing premise
The loss upper bound accurately captures the quality impact of eviction and attention heads show distinct enough patterns to benefit from unequal budgets.
What would settle it
Applying the head-wise adaptive allocation derived from the bound to new models or datasets and measuring no quality improvement or a drop relative to uniform allocation.
read the original abstract
Large Language Models have excelled in various domains but face efficiency challenges due to the growing Key-Value (KV) cache required for long-sequence inference. Recent efforts aim to reduce KV cache size by evicting vast non-critical cache elements during runtime while preserving generation quality. However, these methods typically allocate compression budgets uniformly across all attention heads, ignoring the unique attention patterns of each head. In this paper, we establish a theoretical loss upper bound between pre- and post-eviction attention output, explaining the optimization target of prior cache eviction methods, while guiding the optimization of adaptive budget allocation. Base on this, we propose {\it Ada-KV}, the first head-wise adaptive budget allocation strategy. It offers plug-and-play benefits, enabling seamless integration with prior cache eviction methods. Extensive evaluations on 13 datasets from Ruler and 16 datasets from LongBench, all conducted under both question-aware and question-agnostic scenarios, demonstrate substantial quality improvements over existing methods. Our code is available at https://github.com/FFY0/AdaKV.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper claims to establish a theoretical loss upper bound between pre- and post-eviction attention outputs that explains the optimization target of prior uniform KV cache eviction methods and guides the design of head-wise adaptive budget allocation. Building on this, it proposes Ada-KV as the first such adaptive strategy, which integrates plug-and-play with existing eviction methods. Extensive experiments on 13 Ruler and 16 LongBench datasets under question-aware and question-agnostic settings report consistent quality gains over baselines.
Significance. If the bound holds and is reasonably tight, the work supplies a principled, mechanics-derived target for cache eviction that moves beyond uniform allocation, with the adaptive strategy offering practical efficiency gains. The broad dataset coverage and open-sourced code strengthen the empirical case and reproducibility. This could influence future KV cache designs by providing an independent theoretical grounding for non-uniform budgets.
major comments (2)
- [Theoretical Analysis] Theoretical bound section: the derivation of the loss upper bound from attention output mechanics is independent of final quality metrics, but the manuscript should explicitly show (e.g., via the per-head score distribution properties used in the proof) that adaptive budget allocation reduces the bound more than uniform allocation without additional unstated assumptions on tail behavior across heads. If the bound treats heads symmetrically inside the derivation, the theoretical motivation for Ada-KV would require further justification beyond the observed empirical gains.
- [§4] §4 (budget allocation algorithm): the post-hoc choices in computing per-head budgets are not fully detailed in how they interact with the bound; an ablation confirming that the adaptive rule minimizes the derived bound (rather than a proxy) would make the link between theory and method load-bearing.
minor comments (3)
- [Abstract] Abstract: 'Base on this' should read 'Based on this'.
- [Experiments] The manuscript would benefit from a short paragraph or table entry quantifying bound tightness (actual vs. upper-bound loss) on a representative subset of the evaluated datasets.
- [Method] Notation for the per-head budget variables should be introduced once and used consistently; current usage mixes symbols in the method description.
Simulated Author's Rebuttal
We thank the referee for the positive evaluation and recommendation for minor revision. The comments raise valid points about strengthening the explicit connection between the theoretical bound and the adaptive allocation. We respond to each major comment below and indicate the revisions we will incorporate.
read point-by-point responses
-
Referee: [Theoretical Analysis] Theoretical bound section: the derivation of the loss upper bound from attention output mechanics is independent of final quality metrics, but the manuscript should explicitly show (e.g., via the per-head score distribution properties used in the proof) that adaptive budget allocation reduces the bound more than uniform allocation without additional unstated assumptions on tail behavior across heads. If the bound treats heads symmetrically inside the derivation, the theoretical motivation for Ada-KV would require further justification beyond the observed empirical gains.
Authors: We thank the referee for this observation. The upper bound is derived separately for each attention head from the mechanics of its attention scores and is additive across heads because the final output is assembled from head-wise results. The per-head bound is a monotonic function of the dispersion and tail properties of that head's score distribution for a given retained budget. Heads exhibiting heavier tails or greater dispersion therefore contribute larger terms to the total bound. Allocating the global budget proportionally to these head-specific quantities therefore yields a strictly smaller total bound than uniform allocation for the same aggregate budget. No cross-head assumptions on tail behavior are required; the argument relies only on the observed heterogeneity in per-head score distributions, which is already used in the proof. In the revised manuscript we will insert a short paragraph immediately after the bound derivation that explicitly compares the total bound value under adaptive versus uniform allocation, using the per-head distribution properties to demonstrate the reduction. revision: yes
-
Referee: [§4] §4 (budget allocation algorithm): the post-hoc choices in computing per-head budgets are not fully detailed in how they interact with the bound; an ablation confirming that the adaptive rule minimizes the derived bound (rather than a proxy) would make the link between theory and method load-bearing.
Authors: We agree that the interaction between the allocation rule and the bound can be made more transparent. The per-head budgets in §4 are obtained by partitioning the total budget in proportion to the per-head upper-bound values estimated directly from each head's attention-score distribution; this is a closed-form rule that targets minimization of the sum of the per-head bounds. To make the link load-bearing, we will add a new ablation subsection that evaluates the numerical value of the derived loss upper bound itself (not a downstream quality metric) for Ada-KV, uniform allocation, and two alternative heuristics. The results will confirm that the adaptive rule produces the lowest bound value among the compared strategies under identical total budgets. revision: yes
Circularity Check
Theoretical loss upper bound derived from attention mechanics provides independent grounding
full rationale
The paper's central derivation establishes a loss upper bound between pre- and post-eviction attention outputs directly from the mechanics of attention computation. This bound is used to reinterpret prior uniform-eviction methods and to motivate head-wise adaptive budget allocation as an optimization step. No equations or claims reduce the bound or the adaptive strategy to fitted parameters, self-citations, or definitional equivalences; the bound is presented as a first-principles result that remains sensitive to per-head score distributions. Empirical results on Ruler and LongBench are reported separately as validation rather than as the source of the bound itself. The derivation chain is therefore self-contained against external attention mathematics.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Attention output difference after eviction admits an upper bound that can guide per-head budget decisions
Forward citations
Cited by 18 Pith papers
-
HybridGen: Efficient LLM Generative Inference via CPU-GPU Hybrid Computing
HybridGen achieves 1.41x-3.2x average speedups over six prior KV cache methods for LLM inference by using attention logit parallelism, a feedback-driven scheduler, and semantic-aware KV cache mapping.
-
Self-Pruned Key-Value Attention: Learning When to Write by Predicting Future Utility
SP-KV trains a utility predictor jointly with the LLM to dynamically prune low-utility KV cache entries, achieving 3-10x memory reduction during generation with negligible performance loss.
-
Pyramid Forcing: Head-Aware Pyramid KV Cache Policy for High-Quality Long Video Generation
Pyramid Forcing classifies attention heads into Anchor, Wave, and Veil types and applies type-specific KV cache policies to improve long-horizon autoregressive video generation quality.
-
Compute Where it Counts: Self Optimizing Language Models
SOL trains a policy to dynamically control multiple efficiency mechanisms per token via group-relative policy optimization on teacher-forced episodes, yielding better quality at matched average budget than static or r...
-
Make Each Token Count: Towards Improving Long-Context Performance with KV Cache Eviction
A unified learnable KV eviction policy with cross-layer calibration reduces memory and matches or exceeds full-cache performance on long-context tasks by retaining useful tokens and limiting attention dilution.
-
ReST-KV: Robust KV Cache Eviction with Layer-wise Output Reconstruction and Spatial-Temporal Smoothing
ReST-KV formulates KV eviction as layer-wise output reconstruction optimization with spatial-temporal smoothing, outperforming baselines by 2.58% on LongBench and 15.2% on RULER while cutting decoding latency by 10.61...
-
RDKV: Rate-Distortion Bit Allocation for Joint Eviction and Quantization of the KV Cache
RDKV derives per-token and per-channel weights from attention distortion, then uses reverse water-filling to assign bit-widths from full precision to zero after prefilling, recovering 97.81% accuracy with 2.48% cache ...
-
Reformulating KV Cache Eviction Problem for Long-Context LLM Inference
LaProx reformulates KV cache eviction as an output-aware matrix approximation, enabling a unified global token selection strategy that preserves LLM performance at 5% cache size across long-context benchmarks.
-
Sparse Attention as a Range Searching Problem: Towards an Inference-Efficient Index for KV Cache
Louver is a new index structure that guarantees zero false negatives for sparse attention in LLM KV caches by casting the problem as halfspace range searching.
-
Sparse Attention as a Range Searching Problem: Towards an Inference-Efficient Index for KV Cache
Louver is a new index for LLM KV caches that guarantees zero false negatives for keys above a relevance threshold, runs faster than prior sparse and some dense attention methods, and integrates lightly into existing p...
-
CodecSight: Leveraging Video Codec Signals for Efficient Streaming VLM Inference
CodecSight reuses video codec signals for online patch pruning before the vision transformer and selective KV-cache refresh in the LLM, delivering up to 3x higher throughput and 87% lower GPU compute than prior baseli...
-
OVGGT: O(1) Constant-Cost Streaming Visual Geometry Transformer
OVGGT achieves constant O(1) memory and compute for streaming 3D geometry reconstruction by using FFN-residual-based KV cache compression and dynamic anchor protection, matching state-of-the-art accuracy on long sequences.
-
RAT+: Train Dense, Infer Sparse -- Recurrence Augmented Attention for Dilated Inference
RAT+ pretrains a single dense recurrent-augmented attention model that supports flexible dilated sparse inference after short adaptation, matching dense accuracy at moderate dilation and losing only 1-3 points at high...
-
HieraSparse: Hierarchical Semi-Structured Sparse KV Attention
HieraSparse delivers a hierarchical semi-structured sparse KV attention system that achieves 1.2x KV compression and 4.57x decode attention speedup versus prior unstructured sparsity methods at equivalent sparsity, pl...
-
StreamCacheVGGT: Streaming Visual Geometry Transformers with Robust Scoring and Hybrid Cache Compression
StreamCacheVGGT improves streaming 3D geometry reconstruction accuracy and stability under fixed memory by using cross-layer token importance scoring and hybrid cache compression instead of pure eviction.
-
AudioKV: KV Cache Eviction in Efficient Large Audio Language Models
AudioKV prioritizes audio-critical attention heads identified via ASR analysis and applies spectral score smoothing to evict KV cache tokens, achieving high compression with minimal accuracy loss in LALMs.
-
Don't Waste Bits! Adaptive KV-Cache Quantization for Lightweight On-Device LLMs
A data-driven adaptive policy for KV-cache bit-width selection based on token importance features reduces decoding latency by ~18% and improves accuracy over static quantization while staying near FP16 levels on SmolL...
-
From Human Memory to AI Memory: A Survey on Memory Mechanisms in the Era of LLMs
The paper surveys human memory categories, maps them to LLM memory, and proposes a new three-dimension (object, form, time) categorization into eight quadrants to organize existing work and highlight open problems.
Reference graph
Works this paper leans on
-
[1]
A survey on recent advances in llm-based multi-turn dialogue systems
Zihao Yi, Jiarui Ouyang, Yuwen Liu, Tianhao Liao, Zhe Xu, and Ying Shen. A survey on recent advances in llm-based multi-turn dialogue systems. arXiv preprint arXiv:2402.18013, 2024
-
[2]
Summedits: measuring llm ability at factual reasoning through the lens of summarization
Philippe Laban, Wojciech Kry´sci´nski, Divyansh Agarwal, Alexander Richard Fabbri, Caiming Xiong, Shafiq Joty, and Chien-Sheng Wu. Summedits: measuring llm ability at factual reasoning through the lens of summarization. InProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 9662–9676, 2023
work page 2023
-
[3]
Llm-based code generation method for golang compiler testing
Qiuhan Gu. Llm-based code generation method for golang compiler testing. In Proceedings of the 31st ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering, pages 2201–2203, 2023
work page 2023
-
[4]
Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. Gpt-4 technical report. arXiv preprint arXiv:2303.08774, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[5]
The claude 3 model family: Opus, sonnet, haiku, March 2024
Anthropic. The claude 3 model family: Opus, sonnet, haiku, March 2024. Accessed: 2024-07- 09
work page 2024
-
[6]
Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context
Machel Reid, Nikolay Savinov, Denis Teplyashin, Dmitry Lepikhin, Timothy Lillicrap, Jean- baptiste Alayrac, Radu Soricut, Angeliki Lazaridou, Orhan Firat, Julian Schrittwieser, et al. Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context. arXiv preprint arXiv:2403.05530, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[7]
Model Tells You What to Discard: Adaptive KV Cache Compression for LLMs
Suyu Ge, Yunan Zhang, Liyuan Liu, Minjia Zhang, Jiawei Han, and Jianfeng Gao. Model tells you what to discard: Adaptive kv cache compression for llms. arXiv preprint arXiv:2310.01801, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[8]
H2o: Heavy-hitter oracle for efficient generative inference of large language models
Zhenyu Zhang, Ying Sheng, Tianyi Zhou, Tianlong Chen, Lianmin Zheng, Ruisi Cai, Zhao Song, Yuandong Tian, Christopher Ré, Clark Barrett, et al. H2o: Heavy-hitter oracle for efficient generative inference of large language models. Advances in Neural Information Processing Systems, 36, 2024
work page 2024
-
[9]
PyramidInfer: Pyramid KV cache compression for high-throughput LLM inference
Dongjie Yang, Xiaodong Han, Yan Gao, Yao Hu, Shilin Zhang, and Hai Zhao. PyramidInfer: Pyramid KV cache compression for high-throughput LLM inference. In Lun-Wei Ku, Andre Martins, and Vivek Srikumar, editors,Findings of the Association for Computational Linguistics ACL 2024, pages 3258–3270, Bangkok, Thailand and virtual meeting, August 2024. Association...
work page 2024
-
[10]
PyramidKV: Dynamic KV Cache Compression based on Pyramidal Information Funneling
Yichi Zhang, Bofei Gao, Tianyu Liu, Keming Lu, Wayne Xiong, Yue Dong, Baobao Chang, Junjie Hu, Wen Xiao, et al. Pyramidkv: Dynamic kv cache compression based on pyramidal information funneling. arXiv preprint arXiv:2406.02069, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[11]
SnapKV: LLM knows what you are looking for before generation
Yuhong Li, Yingbing Huang, Bowen Yang, Bharat Venkitesh, Acyr Locatelli, Hanchen Ye, Tianle Cai, Patrick Lewis, and Deming Chen. SnapKV: LLM knows what you are looking for before generation. In The Thirty-eighth Annual Conference on Neural Information Processing Systems, 2024
work page 2024
-
[12]
Llm kv cache compression made easy, 2024
Maximilian Jeblick Simon Jegou. Llm kv cache compression made easy, 2024
work page 2024
-
[13]
Catalyst: Optimizing cache management for large in-memory key-value systems
Kefei Wang and Feng Chen. Catalyst: Optimizing cache management for large in-memory key-value systems. Proceedings of the VLDB Endowment, 16(13):4339–4352, 2023. 11
work page 2023
-
[14]
Longformer: The Long-Document Transformer
Iz Beltagy, Matthew E Peters, and Arman Cohan. Longformer: The long-document transformer. arXiv preprint arXiv:2004.05150, 2020
work page internal anchor Pith review Pith/arXiv arXiv 2004
-
[15]
Lm-infinite: Zero-shot extreme length generalization for large language models, 2024
Chi Han, Qifan Wang, Hao Peng, Wenhan Xiong, Yu Chen, Heng Ji, and Sinong Wang. Lm-infinite: Zero-shot extreme length generalization for large language models, 2024
work page 2024
-
[16]
Efficient Streaming Language Models with Attention Sinks
Guangxuan Xiao, Yuandong Tian, Beidi Chen, Song Han, and Mike Lewis. Efficient streaming language models with attention sinks. arXiv preprint arXiv:2309.17453, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[17]
Zichang Liu, Aditya Desai, Fangshuo Liao, Weitao Wang, Victor Xie, Zhaozhuo Xu, Anastasios Kyrillidis, and Anshumali Shrivastava. Scissorhands: Exploiting the persistence of impor- tance hypothesis for llm kv cache compression at test time. Advances in Neural Information Processing Systems, 36, 2024
work page 2024
-
[18]
On the efficacy of eviction policy for key-value constrained generative language model inference
Siyu Ren and Kenny Q Zhu. On the efficacy of eviction policy for key-value constrained generative language model inference. arXiv preprint arXiv:2402.06262, 2024
-
[19]
Minference 1.0: Accelerating pre-filling for long-context llms via dynamic sparse attention
Huiqiang Jiang, Yucheng Li, Chengruidong Zhang, Qianhui Wu, Xufang Luo, Surin Ahn, Zhenhua Han, Amir H Abdi, Dongsheng Li, Chin-Yew Lin, et al. Minference 1.0: Accelerating pre-filling for long-context llms via dynamic sparse attention. Advances in Neural Information Processing Systems, 37:52481–52515, 2024
work page 2024
-
[20]
Yucheng Li, Huiqiang Jiang, Chengruidong Zhang, Qianhui Wu, Xufang Luo, Surin Ahn, Amir H Abdi, Dongsheng Li, Jianfeng Gao, Yuqing Yang, et al. Mminference: Accelerating pre-filling for long-context vlms via modality-aware permutation sparse attention.arXiv preprint arXiv:2504.16083, 2025
-
[21]
Retrievalattention: Accelerating long- context llm inference via vector retrieval
Di Liu, Meng Chen, Baotong Lu, Huiqiang Jiang, Zhenhua Han, Qianxi Zhang, Qi Chen, Chengruidong Zhang, Bailu Ding, Kai Zhang, et al. Retrievalattention: Accelerating long- context llm inference via vector retrieval. arXiv preprint arXiv:2409.10516, 2024
-
[22]
Arkvale: Efficient generative llm inference with recallable key-value eviction
Renze Chen, Zhuofeng Wang, Beiquan Cao, Tong Wu, Size Zheng, Xiuhong Li, Xuechao Wei, Shengen Yan, Meng Li, and Yun Liang. Arkvale: Efficient generative llm inference with recallable key-value eviction. Advances in Neural Information Processing Systems, 37:113134– 113155, 2024
work page 2024
-
[23]
Pqcache: Product quantization-based kvcache for long context llm inference
Hailin Zhang, Xiaodong Ji, Yilin Chen, Fangcheng Fu, Xupeng Miao, Xiaonan Nie, Weipeng Chen, and Bin Cui. Pqcache: Product quantization-based kvcache for long context llm inference. Proceedings of the ACM on Management of Data, 3(3):1–30, 2025
work page 2025
-
[24]
He Sun, Li Li, Mingjun Xiao, and Chengzhong Xu. Breaking the boundaries of long- context llm inference: Adaptive kv management on a single commodity gpu. arXiv preprint arXiv:2506.20187, 2025
-
[25]
Deja vu: Contextual sparsity for efficient llms at inference time, 2023
Zichang Liu, Jue Wang, Tri Dao, Tianyi Zhou, Binhang Yuan, Zhao Song, Anshumali Shrivas- tava, Ce Zhang, Yuandong Tian, Christopher Re, and Beidi Chen. Deja vu: Contextual sparsity for efficient llms at inference time, 2023
work page 2023
-
[26]
Tri Dao, Dan Fu, Stefano Ermon, Atri Rudra, and Christopher Ré. Flashattention: Fast and memory-efficient exact attention with io-awareness.Advances in Neural Information Processing Systems, 35:16344–16359, 2022
work page 2022
-
[27]
FlashAttention-2: Faster Attention with Better Parallelism and Work Partitioning
Tri Dao. Flashattention-2: Faster attention with better parallelism and work partitioning. arXiv preprint arXiv:2307.08691, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[28]
Efficient memory management for large language model serving with pagedattention
Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph Gonzalez, Hao Zhang, and Ion Stoica. Efficient memory management for large language model serving with pagedattention. In Proceedings of the 29th Symposium on Operating Systems Principles, pages 611–626, 2023
work page 2023
-
[29]
The llama 3 herd of models, 2024
Aaron Grattafiori, Abhimanyu Dubey, and Abhinav Jauhri .et al. The llama 3 herd of models, 2024. 12
work page 2024
-
[30]
Albert Q Jiang, Alexandre Sablayrolles, Arthur Mensch, Chris Bamford, Devendra Singh Chaplot, Diego de las Casas, Florian Bressand, Gianna Lengyel, Guillaume Lample, Lucile Saulnier, et al. Mistral 7b. arXiv preprint arXiv:2310.06825, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[31]
Gqa: Training generalized multi-query transformer models from multi-head checkpoints, 2023
Joshua Ainslie, James Lee-Thorp, Michiel de Jong, Yury Zemlyanskiy, Federico Lebrón, and Sumit Sanghai. Gqa: Training generalized multi-query transformer models from multi-head checkpoints, 2023
work page 2023
-
[32]
RULER: What's the Real Context Size of Your Long-Context Language Models?
Cheng-Ping Hsieh, Simeng Sun, Samuel Kriman, Shantanu Acharya, Dima Rekesh, Fei Jia, Yang Zhang, and Boris Ginsburg. Ruler: What’s the real context size of your long-context language models? arXiv preprint arXiv:2404.06654, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[33]
LongBench: A Bilingual, Multitask Benchmark for Long Context Understanding
Yushi Bai, Xin Lv, Jiajie Zhang, Hongchang Lyu, Jiankai Tang, Zhidian Huang, Zhengxiao Du, Xiao Liu, Aohan Zeng, Lei Hou, et al. Longbench: A bilingual, multitask benchmark for long context understanding. arXiv preprint arXiv:2308.14508, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[34]
Prompt cache: Modular attention reuse for low-latency inference
In Gim, Guojun Chen, Seung-seob Lee, Nikhil Sarda, Anurag Khandelwal, and Lin Zhong. Prompt cache: Modular attention reuse for low-latency inference. Proceedings of Machine Learning and Systems, 6:325–338, 2024
work page 2024
-
[35]
SGLang: Efficient Execution of Structured Language Model Programs
Lianmin Zheng, Liangsheng Yin, Zhiqiang Xie, Chuyue Sun, Jeff Huang, Cody Hao Yu, Shiyi Cao, Christos Kozyrakis, Ion Stoica, Joseph E Gonzalez, et al. Sglang: Efficient execution of structured language model programs. arXiv preprint arXiv:2312.07104, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[36]
Kvzip: Query-agnostic kv cache compression with context reconstruction
Jang-Hyun Kim, Jinuk Kim, Sangwoo Kwon, Jae W Lee, Sangdoo Yun, and Hyun Oh Song. Kvzip: Query-agnostic kv cache compression with context reconstruction. arXiv preprint arXiv:2505.23416, 2025
-
[37]
Expected attention: Kv cache compres- sion by estimating attention from future queries distribution
Alessio Devoto, Maximilian Jeblick, and Simon Jégou. Expected attention: Kv cache compres- sion by estimating attention from future queries distribution. arXiv preprint arXiv:2510.00636, 2025
-
[38]
Draft-based approximate inference for llms
Kevin Galim, Ethan Ewer, Wonjun Kang, Minjae Lee, Hyung Il Koo, and Kangwook Lee. Draft-based approximate inference for llms. arXiv preprint arXiv:2506.08373, 2025
-
[39]
Identify critical kv cache in llm inference from an output perturbation perspective
Yuan Feng, Junlin Lv, Yukun Cao, Xike Xie, and S Kevin Zhou. Identify critical kv cache in llm inference from an output perturbation perspective. arXiv preprint arXiv:2502.03805, 2025
-
[40]
Yuan Feng, Haoyu Guo, JunLin Lv, S. Kevin Zhou, and Xike Xie. Taming the fragility of kv cache eviction in llm inference, 2025
work page 2025
-
[41]
Duoattention: Efficient long-context llm inference with retrieval and streaming heads
Guangxuan Xiao, Jiaming Tang, Jingwei Zuo, Junxian Guo, Shang Yang, Haotian Tang, Yao Fu, and Song Han. Duoattention: Efficient long-context llm inference with retrieval and streaming heads. arXiv preprint arXiv:2410.10819, 2024
-
[42]
Yu Fu, Zefan Cai, Abedelkadir Asi, Wayne Xiong, Yue Dong, and Wen Xiao. Not all heads matter: A head-level kv cache compression method with integrated retrieval and reasoning. arXiv preprint arXiv:2410.19258, 2024
-
[43]
Zeroquant: Efficient and affordable post-training quantization for large-scale transformers
Zhewei Yao, Reza Yazdani Aminabadi, Minjia Zhang, Xiaoxia Wu, Conglong Li, and Yuxiong He. Zeroquant: Efficient and affordable post-training quantization for large-scale transformers. Advances in Neural Information Processing Systems, 35:27168–27183, 2022
work page 2022
-
[44]
KIVI: A Tuning-Free Asymmetric 2bit Quantization for KV Cache
Zirui Liu, Jiayi Yuan, Hongye Jin, Shaochen Zhong, Zhaozhuo Xu, Vladimir Braverman, Beidi Chen, and Xia Hu. Kivi: A tuning-free asymmetric 2bit quantization for kv cache. arXiv preprint arXiv:2402.02750, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[45]
Qaq: Quality adaptive quantization for llm kv cache
Shichen Dong, Wen Cheng, Jiayu Qin, and Wei Wang. Qaq: Quality adaptive quantization for llm kv cache. arXiv preprint arXiv:2403.04643, 2024
-
[46]
Triforce: Lossless acceleration of long sequence generation with hierarchical speculative decoding
Hanshi Sun, Zhuoming Chen, Xinyu Yang, Yuandong Tian, and Beidi Chen. Triforce: Lossless acceleration of long sequence generation with hierarchical speculative decoding. arXiv preprint arXiv:2404.11912, 2024. 13
-
[47]
Longspec: Long-context lossless speculative decoding with efficient drafting and verification
Penghui Yang, Cunxiao Du, Fengzhuo Zhang, Haonan Wang, Tianyu Pang, Chao Du, and Bo An. Longspec: Long-context lossless speculative decoding with efficient drafting and verification. In ES-FoMo III: 3rd Workshop on Efficient Systems for Foundation Models
-
[48]
Specvlm: Enhancing speculative decoding of video llms via verifier-guided token pruning
Yicheng Ji, Jun Zhang, Heming Xia, Jinpeng Chen, Lidan Shou, Gang Chen, and Huan Li. Specvlm: Enhancing speculative decoding of video llms via verifier-guided token pruning. arXiv preprint arXiv:2508.16201, 2025
-
[49]
Needle In A Haystack - pressure testing LLMs
Gregory Kamradt. Needle In A Haystack - pressure testing LLMs. Github, 2023
work page 2023
-
[50]
Zoology: Measuring and improving recall in efficient language models
Simran Arora, Sabri Eyuboglu, Aman Timalsina, Isys Johnson, Michael Poli, James Zou, Atri Rudra, and Christopher Ré. Zoology: Measuring and improving recall in efficient language models. In ICLR, 2024
work page 2024
-
[51]
The narrativeqa reading comprehension challenge
Tomáš Koˇcisk`y, Jonathan Schwarz, Phil Blunsom, Chris Dyer, Karl Moritz Hermann, Gábor Melis, and Edward Grefenstette. The narrativeqa reading comprehension challenge. Transac- tions of the Association for Computational Linguistics, 6:317–328, 2018
work page 2018
-
[52]
A dataset of information-seeking questions and answers anchored in research papers
Pradeep Dasigi, Kyle Lo, Iz Beltagy, Arman Cohan, Noah A Smith, and Matt Gardner. A dataset of information-seeking questions and answers anchored in research papers. arXiv preprint arXiv:2105.03011, 2021
-
[53]
HotpotQA: A Dataset for Diverse, Explainable Multi-hop Question Answering
Zhilin Yang, Peng Qi, Saizheng Zhang, Yoshua Bengio, William W Cohen, Ruslan Salakhut- dinov, and Christopher D Manning. Hotpotqa: A dataset for diverse, explainable multi-hop question answering. arXiv preprint arXiv:1809.09600, 2018
work page internal anchor Pith review Pith/arXiv arXiv 2018
-
[54]
Constructing a multi-hop QA dataset for comprehensive evaluation of reasoning steps
Xanh Ho, Anh-Khoa Duong Nguyen, Saku Sugawara, and Akiko Aizawa. Constructing a multi-hop QA dataset for comprehensive evaluation of reasoning steps. In Donia Scott, Nuria Bel, and Chengqing Zong, editors, Proceedings of the 28th International Conference on Computational Linguistics, pages 6609–6625, Barcelona, Spain (Online), December 2020. International...
work page 2020
-
[55]
Musique: Multihop questions via single-hop question composition
Harsh Trivedi, Niranjan Balasubramanian, Tushar Khot, and Ashish Sabharwal. Musique: Multihop questions via single-hop question composition. Transactions of the Association for Computational Linguistics, 10:539–554, 2022
work page 2022
-
[56]
Efficient attentions for long document summarization
Luyang Huang, Shuyang Cao, Nikolaus Parulian, Heng Ji, and Lu Wang. Efficient attentions for long document summarization. arXiv preprint arXiv:2104.02112, 2021
-
[57]
Qmsum: A new benchmark for query-based multi-domain meeting summarization
Ming Zhong, Da Yin, Tao Yu, Ahmad Zaidi, Mutethia Mutuma, Rahul Jha, Ahmed Hassan Awadallah, Asli Celikyilmaz, Yang Liu, Xipeng Qiu, et al. Qmsum: A new benchmark for query-based multi-domain meeting summarization. arXiv preprint arXiv:2104.05938, 2021
-
[58]
Multi-News: a Large-Scale Multi-Document Summarization Dataset and Abstractive Hierarchical Model
Alexander R Fabbri, Irene Li, Tianwei She, Suyi Li, and Dragomir R Radev. Multi-news: A large-scale multi-document summarization dataset and abstractive hierarchical model. arXiv preprint arXiv:1906.01749, 2019
work page internal anchor Pith review Pith/arXiv arXiv 1906
-
[59]
Mandar Joshi, Eunsol Choi, Daniel S. Weld, and Luke Zettlemoyer. Triviaqa: A large scale distantly supervised challenge dataset for reading comprehension, 2017
work page 2017
-
[60]
Samsum cor- pus: A human-annotated dialogue dataset for abstractive summarization
Bogdan Gliwa, Iwona Mochol, Maciej Biesek, and Aleksander Wawer. Samsum cor- pus: A human-annotated dialogue dataset for abstractive summarization. arXiv preprint arXiv:1911.12237, 2019
-
[61]
Xin Li and Dan Roth. Learning question classifiers. In COLING 2002: The 19th International Conference on Computational Linguistics, 2002
work page 2002
-
[62]
Longcoder: A long-range pre-trained language model for code completion, 2023
Daya Guo, Canwen Xu, Nan Duan, Jian Yin, and Julian McAuley. Longcoder: A long-range pre-trained language model for code completion, 2023
work page 2023
-
[63]
Repobench: Benchmarking repository-level code auto-completion systems, 2023
Tianyang Liu, Canwen Xu, and Julian McAuley. Repobench: Benchmarking repository-level code auto-completion systems, 2023. 14
work page 2023
-
[64]
Lost in the Middle: How Language Models Use Long Contexts
Nelson F Liu, Kevin Lin, John Hewitt, Ashwin Paranjape, Michele Bevilacqua, Fabio Petroni, and Percy Liang. Lost in the middle: How language models use long contexts. arXiv preprint arXiv:2307.03172, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[65]
Constructing a multi-hop qa dataset for comprehensive evaluation of reasoning steps
Xanh Ho, Anh-Khoa Duong Nguyen, Saku Sugawara, and Akiko Aizawa. Constructing a multi-hop qa dataset for comprehensive evaluation of reasoning steps. In Proceedings of the 28th International Conference on Computational Linguistics, pages 6609–6625, 2020
work page 2020
-
[66]
Triviaqa: A large scale distantly supervised challenge dataset for reading comprehension
Mandar Joshi, Eunsol Choi, Daniel S Weld, and Luke Zettlemoyer. Triviaqa: A large scale distantly supervised challenge dataset for reading comprehension. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages 1601–1611, 2017
work page 2017
-
[67]
Longcoder: A long-range pre-trained language model for code completion
Daya Guo, Canwen Xu, Nan Duan, Jian Yin, and Julian McAuley. Longcoder: A long-range pre-trained language model for code completion. arXiv preprint arXiv:2306.14893, 2023
-
[68]
RepoBench: Benchmarking Repository-Level Code Auto-Completion Systems
Tianyang Liu, Canwen Xu, and Julian McAuley. Repobench: Benchmarking repository-level code auto-completion systems. arXiv preprint arXiv:2306.03091, 2023. 15 A Appendix A.1 Additional Related Works Additional works also mitigate the challenges posed by massive KV Caches during long-sequence inference while not reducing the number of cache elements. These ...
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[69]
word-a 2. word-b 3. word-c 4. word-a 5. word-d 6. word-a 7. word-e 8. word-f ...... Question: What are the 10 most common words in the above list? Answer: The top 10 words that appear most often in the list are: Frequent Words Extraction (FWE) Task Template: Read the following coded text and track the frequency of each coded word. Find the three most freq...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.