pith. machine review for the scientific record. sign in

arxiv: 2407.11550 · v5 · pith:6S2KKZLVnew · submitted 2024-07-16 · 💻 cs.CL · cs.AI

Ada-KV: Optimizing KV Cache Eviction by Adaptive Budget Allocation for Efficient LLM Inference

Pith reviewed 2026-05-17 11:12 UTC · model grok-4.3

classification 💻 cs.CL cs.AI
keywords KV cache evictionadaptive budget allocationLLM inferenceattention headslong-context modelingcache compressionefficiency optimization
0
0 comments X

The pith

A theoretical upper bound on attention loss from KV cache eviction enables adaptive per-head budget allocation.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper derives a theoretical upper bound on the loss between attention outputs before and after KV cache eviction. This bound accounts for the optimization goal of earlier eviction techniques and indicates how to assign different cache budgets to individual attention heads. The authors use the bound to create Ada-KV, the first head-wise adaptive allocation method that works as a plug-in with existing eviction approaches. Tests on 13 Ruler and 16 LongBench datasets under both question-aware and question-agnostic conditions show clear quality gains over uniform allocation baselines.

Core claim

A derived loss upper bound between pre- and post-eviction attention outputs explains the target of prior cache eviction work and supports optimizing budget allocation separately for each attention head, resulting in the Ada-KV strategy that yields measurable quality improvements while reducing cache size.

What carries the argument

Theoretical loss upper bound between pre- and post-eviction attention outputs, which guides the choice of non-uniform cache budgets across heads.

Load-bearing premise

The loss upper bound accurately captures the quality impact of eviction and attention heads show distinct enough patterns to benefit from unequal budgets.

What would settle it

Applying the head-wise adaptive allocation derived from the bound to new models or datasets and measuring no quality improvement or a drop relative to uniform allocation.

read the original abstract

Large Language Models have excelled in various domains but face efficiency challenges due to the growing Key-Value (KV) cache required for long-sequence inference. Recent efforts aim to reduce KV cache size by evicting vast non-critical cache elements during runtime while preserving generation quality. However, these methods typically allocate compression budgets uniformly across all attention heads, ignoring the unique attention patterns of each head. In this paper, we establish a theoretical loss upper bound between pre- and post-eviction attention output, explaining the optimization target of prior cache eviction methods, while guiding the optimization of adaptive budget allocation. Base on this, we propose {\it Ada-KV}, the first head-wise adaptive budget allocation strategy. It offers plug-and-play benefits, enabling seamless integration with prior cache eviction methods. Extensive evaluations on 13 datasets from Ruler and 16 datasets from LongBench, all conducted under both question-aware and question-agnostic scenarios, demonstrate substantial quality improvements over existing methods. Our code is available at https://github.com/FFY0/AdaKV.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 3 minor

Summary. The paper claims to establish a theoretical loss upper bound between pre- and post-eviction attention outputs that explains the optimization target of prior uniform KV cache eviction methods and guides the design of head-wise adaptive budget allocation. Building on this, it proposes Ada-KV as the first such adaptive strategy, which integrates plug-and-play with existing eviction methods. Extensive experiments on 13 Ruler and 16 LongBench datasets under question-aware and question-agnostic settings report consistent quality gains over baselines.

Significance. If the bound holds and is reasonably tight, the work supplies a principled, mechanics-derived target for cache eviction that moves beyond uniform allocation, with the adaptive strategy offering practical efficiency gains. The broad dataset coverage and open-sourced code strengthen the empirical case and reproducibility. This could influence future KV cache designs by providing an independent theoretical grounding for non-uniform budgets.

major comments (2)
  1. [Theoretical Analysis] Theoretical bound section: the derivation of the loss upper bound from attention output mechanics is independent of final quality metrics, but the manuscript should explicitly show (e.g., via the per-head score distribution properties used in the proof) that adaptive budget allocation reduces the bound more than uniform allocation without additional unstated assumptions on tail behavior across heads. If the bound treats heads symmetrically inside the derivation, the theoretical motivation for Ada-KV would require further justification beyond the observed empirical gains.
  2. [§4] §4 (budget allocation algorithm): the post-hoc choices in computing per-head budgets are not fully detailed in how they interact with the bound; an ablation confirming that the adaptive rule minimizes the derived bound (rather than a proxy) would make the link between theory and method load-bearing.
minor comments (3)
  1. [Abstract] Abstract: 'Base on this' should read 'Based on this'.
  2. [Experiments] The manuscript would benefit from a short paragraph or table entry quantifying bound tightness (actual vs. upper-bound loss) on a representative subset of the evaluated datasets.
  3. [Method] Notation for the per-head budget variables should be introduced once and used consistently; current usage mixes symbols in the method description.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the positive evaluation and recommendation for minor revision. The comments raise valid points about strengthening the explicit connection between the theoretical bound and the adaptive allocation. We respond to each major comment below and indicate the revisions we will incorporate.

read point-by-point responses
  1. Referee: [Theoretical Analysis] Theoretical bound section: the derivation of the loss upper bound from attention output mechanics is independent of final quality metrics, but the manuscript should explicitly show (e.g., via the per-head score distribution properties used in the proof) that adaptive budget allocation reduces the bound more than uniform allocation without additional unstated assumptions on tail behavior across heads. If the bound treats heads symmetrically inside the derivation, the theoretical motivation for Ada-KV would require further justification beyond the observed empirical gains.

    Authors: We thank the referee for this observation. The upper bound is derived separately for each attention head from the mechanics of its attention scores and is additive across heads because the final output is assembled from head-wise results. The per-head bound is a monotonic function of the dispersion and tail properties of that head's score distribution for a given retained budget. Heads exhibiting heavier tails or greater dispersion therefore contribute larger terms to the total bound. Allocating the global budget proportionally to these head-specific quantities therefore yields a strictly smaller total bound than uniform allocation for the same aggregate budget. No cross-head assumptions on tail behavior are required; the argument relies only on the observed heterogeneity in per-head score distributions, which is already used in the proof. In the revised manuscript we will insert a short paragraph immediately after the bound derivation that explicitly compares the total bound value under adaptive versus uniform allocation, using the per-head distribution properties to demonstrate the reduction. revision: yes

  2. Referee: [§4] §4 (budget allocation algorithm): the post-hoc choices in computing per-head budgets are not fully detailed in how they interact with the bound; an ablation confirming that the adaptive rule minimizes the derived bound (rather than a proxy) would make the link between theory and method load-bearing.

    Authors: We agree that the interaction between the allocation rule and the bound can be made more transparent. The per-head budgets in §4 are obtained by partitioning the total budget in proportion to the per-head upper-bound values estimated directly from each head's attention-score distribution; this is a closed-form rule that targets minimization of the sum of the per-head bounds. To make the link load-bearing, we will add a new ablation subsection that evaluates the numerical value of the derived loss upper bound itself (not a downstream quality metric) for Ada-KV, uniform allocation, and two alternative heuristics. The results will confirm that the adaptive rule produces the lowest bound value among the compared strategies under identical total budgets. revision: yes

Circularity Check

0 steps flagged

Theoretical loss upper bound derived from attention mechanics provides independent grounding

full rationale

The paper's central derivation establishes a loss upper bound between pre- and post-eviction attention outputs directly from the mechanics of attention computation. This bound is used to reinterpret prior uniform-eviction methods and to motivate head-wise adaptive budget allocation as an optimization step. No equations or claims reduce the bound or the adaptive strategy to fitted parameters, self-citations, or definitional equivalences; the bound is presented as a first-principles result that remains sensitive to per-head score distributions. Empirical results on Ruler and LongBench are reported separately as validation rather than as the source of the bound itself. The derivation chain is therefore self-contained against external attention mathematics.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on standard transformer attention mechanics and a derived bound; no new free parameters, invented entities, or ad-hoc axioms are introduced in the abstract.

axioms (1)
  • domain assumption Attention output difference after eviction admits an upper bound that can guide per-head budget decisions
    Invoked to justify the adaptive allocation over uniform budgets.

pith-pipeline@v0.9.0 · 5489 in / 1178 out tokens · 41158 ms · 2026-05-17T11:12:08.412231+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 18 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. HybridGen: Efficient LLM Generative Inference via CPU-GPU Hybrid Computing

    cs.PF 2026-04 unverdicted novelty 7.0

    HybridGen achieves 1.41x-3.2x average speedups over six prior KV cache methods for LLM inference by using attention logit parallelism, a feedback-driven scheduler, and semantic-aware KV cache mapping.

  2. Self-Pruned Key-Value Attention: Learning When to Write by Predicting Future Utility

    cs.LG 2026-05 unverdicted novelty 6.0

    SP-KV trains a utility predictor jointly with the LLM to dynamically prune low-utility KV cache entries, achieving 3-10x memory reduction during generation with negligible performance loss.

  3. Pyramid Forcing: Head-Aware Pyramid KV Cache Policy for High-Quality Long Video Generation

    cs.CV 2026-05 unverdicted novelty 6.0

    Pyramid Forcing classifies attention heads into Anchor, Wave, and Veil types and applies type-specific KV cache policies to improve long-horizon autoregressive video generation quality.

  4. Compute Where it Counts: Self Optimizing Language Models

    cs.LG 2026-05 unverdicted novelty 6.0

    SOL trains a policy to dynamically control multiple efficiency mechanisms per token via group-relative policy optimization on teacher-forced episodes, yielding better quality at matched average budget than static or r...

  5. Make Each Token Count: Towards Improving Long-Context Performance with KV Cache Eviction

    cs.LG 2026-05 unverdicted novelty 6.0

    A unified learnable KV eviction policy with cross-layer calibration reduces memory and matches or exceeds full-cache performance on long-context tasks by retaining useful tokens and limiting attention dilution.

  6. ReST-KV: Robust KV Cache Eviction with Layer-wise Output Reconstruction and Spatial-Temporal Smoothing

    cs.CL 2026-05 conditional novelty 6.0

    ReST-KV formulates KV eviction as layer-wise output reconstruction optimization with spatial-temporal smoothing, outperforming baselines by 2.58% on LongBench and 15.2% on RULER while cutting decoding latency by 10.61...

  7. RDKV: Rate-Distortion Bit Allocation for Joint Eviction and Quantization of the KV Cache

    cs.LG 2026-05 unverdicted novelty 6.0

    RDKV derives per-token and per-channel weights from attention distortion, then uses reverse water-filling to assign bit-widths from full precision to zero after prefilling, recovering 97.81% accuracy with 2.48% cache ...

  8. Reformulating KV Cache Eviction Problem for Long-Context LLM Inference

    cs.CL 2026-05 unverdicted novelty 6.0

    LaProx reformulates KV cache eviction as an output-aware matrix approximation, enabling a unified global token selection strategy that preserves LLM performance at 5% cache size across long-context benchmarks.

  9. Sparse Attention as a Range Searching Problem: Towards an Inference-Efficient Index for KV Cache

    cs.LG 2026-05 unverdicted novelty 6.0

    Louver is a new index structure that guarantees zero false negatives for sparse attention in LLM KV caches by casting the problem as halfspace range searching.

  10. Sparse Attention as a Range Searching Problem: Towards an Inference-Efficient Index for KV Cache

    cs.LG 2026-05 unverdicted novelty 6.0

    Louver is a new index for LLM KV caches that guarantees zero false negatives for keys above a relevance threshold, runs faster than prior sparse and some dense attention methods, and integrates lightly into existing p...

  11. CodecSight: Leveraging Video Codec Signals for Efficient Streaming VLM Inference

    cs.DC 2026-04 unverdicted novelty 6.0

    CodecSight reuses video codec signals for online patch pruning before the vision transformer and selective KV-cache refresh in the LLM, delivering up to 3x higher throughput and 87% lower GPU compute than prior baseli...

  12. OVGGT: O(1) Constant-Cost Streaming Visual Geometry Transformer

    cs.CV 2026-03 conditional novelty 6.0

    OVGGT achieves constant O(1) memory and compute for streaming 3D geometry reconstruction by using FFN-residual-based KV cache compression and dynamic anchor protection, matching state-of-the-art accuracy on long sequences.

  13. RAT+: Train Dense, Infer Sparse -- Recurrence Augmented Attention for Dilated Inference

    cs.LG 2026-02 conditional novelty 6.0

    RAT+ pretrains a single dense recurrent-augmented attention model that supports flexible dilated sparse inference after short adaptation, matching dense accuracy at moderate dilation and losing only 1-3 points at high...

  14. HieraSparse: Hierarchical Semi-Structured Sparse KV Attention

    cs.DC 2026-04 unverdicted novelty 5.0

    HieraSparse delivers a hierarchical semi-structured sparse KV attention system that achieves 1.2x KV compression and 4.57x decode attention speedup versus prior unstructured sparsity methods at equivalent sparsity, pl...

  15. StreamCacheVGGT: Streaming Visual Geometry Transformers with Robust Scoring and Hybrid Cache Compression

    cs.CV 2026-04 unverdicted novelty 5.0

    StreamCacheVGGT improves streaming 3D geometry reconstruction accuracy and stability under fixed memory by using cross-layer token importance scoring and hybrid cache compression instead of pure eviction.

  16. AudioKV: KV Cache Eviction in Efficient Large Audio Language Models

    cs.SD 2026-04 unverdicted novelty 5.0

    AudioKV prioritizes audio-critical attention heads identified via ASR analysis and applies spectral score smoothing to evict KV cache tokens, achieving high compression with minimal accuracy loss in LALMs.

  17. Don't Waste Bits! Adaptive KV-Cache Quantization for Lightweight On-Device LLMs

    cs.CV 2026-04 unverdicted novelty 5.0

    A data-driven adaptive policy for KV-cache bit-width selection based on token importance features reduces decoding latency by ~18% and improves accuracy over static quantization while staying near FP16 levels on SmolL...

  18. From Human Memory to AI Memory: A Survey on Memory Mechanisms in the Era of LLMs

    cs.IR 2025-04 unverdicted novelty 5.0

    The paper surveys human memory categories, maps them to LLM memory, and proposes a new three-dimension (object, form, time) categorization into eight quadrants to organize existing work and highlight open problems.

Reference graph

Works this paper leans on

69 extracted references · 69 canonical work pages · cited by 17 Pith papers · 16 internal anchors

  1. [1]

    A survey on recent advances in llm-based multi-turn dialogue systems

    Zihao Yi, Jiarui Ouyang, Yuwen Liu, Tianhao Liao, Zhe Xu, and Ying Shen. A survey on recent advances in llm-based multi-turn dialogue systems. arXiv preprint arXiv:2402.18013, 2024

  2. [2]

    Summedits: measuring llm ability at factual reasoning through the lens of summarization

    Philippe Laban, Wojciech Kry´sci´nski, Divyansh Agarwal, Alexander Richard Fabbri, Caiming Xiong, Shafiq Joty, and Chien-Sheng Wu. Summedits: measuring llm ability at factual reasoning through the lens of summarization. InProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 9662–9676, 2023

  3. [3]

    Llm-based code generation method for golang compiler testing

    Qiuhan Gu. Llm-based code generation method for golang compiler testing. In Proceedings of the 31st ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering, pages 2201–2203, 2023

  4. [4]

    GPT-4 Technical Report

    Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. Gpt-4 technical report. arXiv preprint arXiv:2303.08774, 2023

  5. [5]

    The claude 3 model family: Opus, sonnet, haiku, March 2024

    Anthropic. The claude 3 model family: Opus, sonnet, haiku, March 2024. Accessed: 2024-07- 09

  6. [6]

    Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context

    Machel Reid, Nikolay Savinov, Denis Teplyashin, Dmitry Lepikhin, Timothy Lillicrap, Jean- baptiste Alayrac, Radu Soricut, Angeliki Lazaridou, Orhan Firat, Julian Schrittwieser, et al. Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context. arXiv preprint arXiv:2403.05530, 2024

  7. [7]

    Model Tells You What to Discard: Adaptive KV Cache Compression for LLMs

    Suyu Ge, Yunan Zhang, Liyuan Liu, Minjia Zhang, Jiawei Han, and Jianfeng Gao. Model tells you what to discard: Adaptive kv cache compression for llms. arXiv preprint arXiv:2310.01801, 2023

  8. [8]

    H2o: Heavy-hitter oracle for efficient generative inference of large language models

    Zhenyu Zhang, Ying Sheng, Tianyi Zhou, Tianlong Chen, Lianmin Zheng, Ruisi Cai, Zhao Song, Yuandong Tian, Christopher Ré, Clark Barrett, et al. H2o: Heavy-hitter oracle for efficient generative inference of large language models. Advances in Neural Information Processing Systems, 36, 2024

  9. [9]

    PyramidInfer: Pyramid KV cache compression for high-throughput LLM inference

    Dongjie Yang, Xiaodong Han, Yan Gao, Yao Hu, Shilin Zhang, and Hai Zhao. PyramidInfer: Pyramid KV cache compression for high-throughput LLM inference. In Lun-Wei Ku, Andre Martins, and Vivek Srikumar, editors,Findings of the Association for Computational Linguistics ACL 2024, pages 3258–3270, Bangkok, Thailand and virtual meeting, August 2024. Association...

  10. [10]

    PyramidKV: Dynamic KV Cache Compression based on Pyramidal Information Funneling

    Yichi Zhang, Bofei Gao, Tianyu Liu, Keming Lu, Wayne Xiong, Yue Dong, Baobao Chang, Junjie Hu, Wen Xiao, et al. Pyramidkv: Dynamic kv cache compression based on pyramidal information funneling. arXiv preprint arXiv:2406.02069, 2024

  11. [11]

    SnapKV: LLM knows what you are looking for before generation

    Yuhong Li, Yingbing Huang, Bowen Yang, Bharat Venkitesh, Acyr Locatelli, Hanchen Ye, Tianle Cai, Patrick Lewis, and Deming Chen. SnapKV: LLM knows what you are looking for before generation. In The Thirty-eighth Annual Conference on Neural Information Processing Systems, 2024

  12. [12]

    Llm kv cache compression made easy, 2024

    Maximilian Jeblick Simon Jegou. Llm kv cache compression made easy, 2024

  13. [13]

    Catalyst: Optimizing cache management for large in-memory key-value systems

    Kefei Wang and Feng Chen. Catalyst: Optimizing cache management for large in-memory key-value systems. Proceedings of the VLDB Endowment, 16(13):4339–4352, 2023. 11

  14. [14]

    Longformer: The Long-Document Transformer

    Iz Beltagy, Matthew E Peters, and Arman Cohan. Longformer: The long-document transformer. arXiv preprint arXiv:2004.05150, 2020

  15. [15]

    Lm-infinite: Zero-shot extreme length generalization for large language models, 2024

    Chi Han, Qifan Wang, Hao Peng, Wenhan Xiong, Yu Chen, Heng Ji, and Sinong Wang. Lm-infinite: Zero-shot extreme length generalization for large language models, 2024

  16. [16]

    Efficient Streaming Language Models with Attention Sinks

    Guangxuan Xiao, Yuandong Tian, Beidi Chen, Song Han, and Mike Lewis. Efficient streaming language models with attention sinks. arXiv preprint arXiv:2309.17453, 2023

  17. [17]

    Scissorhands: Exploiting the persistence of impor- tance hypothesis for llm kv cache compression at test time

    Zichang Liu, Aditya Desai, Fangshuo Liao, Weitao Wang, Victor Xie, Zhaozhuo Xu, Anastasios Kyrillidis, and Anshumali Shrivastava. Scissorhands: Exploiting the persistence of impor- tance hypothesis for llm kv cache compression at test time. Advances in Neural Information Processing Systems, 36, 2024

  18. [18]

    On the efficacy of eviction policy for key-value constrained generative language model inference

    Siyu Ren and Kenny Q Zhu. On the efficacy of eviction policy for key-value constrained generative language model inference. arXiv preprint arXiv:2402.06262, 2024

  19. [19]

    Minference 1.0: Accelerating pre-filling for long-context llms via dynamic sparse attention

    Huiqiang Jiang, Yucheng Li, Chengruidong Zhang, Qianhui Wu, Xufang Luo, Surin Ahn, Zhenhua Han, Amir H Abdi, Dongsheng Li, Chin-Yew Lin, et al. Minference 1.0: Accelerating pre-filling for long-context llms via dynamic sparse attention. Advances in Neural Information Processing Systems, 37:52481–52515, 2024

  20. [20]

    Mminference: Accelerating pre-filling for long-context vlms via modality-aware permutation sparse attention.arXiv preprint arXiv:2504.16083, 2025

    Yucheng Li, Huiqiang Jiang, Chengruidong Zhang, Qianhui Wu, Xufang Luo, Surin Ahn, Amir H Abdi, Dongsheng Li, Jianfeng Gao, Yuqing Yang, et al. Mminference: Accelerating pre-filling for long-context vlms via modality-aware permutation sparse attention.arXiv preprint arXiv:2504.16083, 2025

  21. [21]

    Retrievalattention: Accelerating long- context llm inference via vector retrieval

    Di Liu, Meng Chen, Baotong Lu, Huiqiang Jiang, Zhenhua Han, Qianxi Zhang, Qi Chen, Chengruidong Zhang, Bailu Ding, Kai Zhang, et al. Retrievalattention: Accelerating long- context llm inference via vector retrieval. arXiv preprint arXiv:2409.10516, 2024

  22. [22]

    Arkvale: Efficient generative llm inference with recallable key-value eviction

    Renze Chen, Zhuofeng Wang, Beiquan Cao, Tong Wu, Size Zheng, Xiuhong Li, Xuechao Wei, Shengen Yan, Meng Li, and Yun Liang. Arkvale: Efficient generative llm inference with recallable key-value eviction. Advances in Neural Information Processing Systems, 37:113134– 113155, 2024

  23. [23]

    Pqcache: Product quantization-based kvcache for long context llm inference

    Hailin Zhang, Xiaodong Ji, Yilin Chen, Fangcheng Fu, Xupeng Miao, Xiaonan Nie, Weipeng Chen, and Bin Cui. Pqcache: Product quantization-based kvcache for long context llm inference. Proceedings of the ACM on Management of Data, 3(3):1–30, 2025

  24. [24]

    Breaking the boundaries of long- context llm inference: Adaptive kv management on a single commodity gpu

    He Sun, Li Li, Mingjun Xiao, and Chengzhong Xu. Breaking the boundaries of long- context llm inference: Adaptive kv management on a single commodity gpu. arXiv preprint arXiv:2506.20187, 2025

  25. [25]

    Deja vu: Contextual sparsity for efficient llms at inference time, 2023

    Zichang Liu, Jue Wang, Tri Dao, Tianyi Zhou, Binhang Yuan, Zhao Song, Anshumali Shrivas- tava, Ce Zhang, Yuandong Tian, Christopher Re, and Beidi Chen. Deja vu: Contextual sparsity for efficient llms at inference time, 2023

  26. [26]

    Flashattention: Fast and memory-efficient exact attention with io-awareness.Advances in Neural Information Processing Systems, 35:16344–16359, 2022

    Tri Dao, Dan Fu, Stefano Ermon, Atri Rudra, and Christopher Ré. Flashattention: Fast and memory-efficient exact attention with io-awareness.Advances in Neural Information Processing Systems, 35:16344–16359, 2022

  27. [27]

    FlashAttention-2: Faster Attention with Better Parallelism and Work Partitioning

    Tri Dao. Flashattention-2: Faster attention with better parallelism and work partitioning. arXiv preprint arXiv:2307.08691, 2023

  28. [28]

    Efficient memory management for large language model serving with pagedattention

    Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph Gonzalez, Hao Zhang, and Ion Stoica. Efficient memory management for large language model serving with pagedattention. In Proceedings of the 29th Symposium on Operating Systems Principles, pages 611–626, 2023

  29. [29]

    The llama 3 herd of models, 2024

    Aaron Grattafiori, Abhimanyu Dubey, and Abhinav Jauhri .et al. The llama 3 herd of models, 2024. 12

  30. [30]

    Mistral 7B

    Albert Q Jiang, Alexandre Sablayrolles, Arthur Mensch, Chris Bamford, Devendra Singh Chaplot, Diego de las Casas, Florian Bressand, Gianna Lengyel, Guillaume Lample, Lucile Saulnier, et al. Mistral 7b. arXiv preprint arXiv:2310.06825, 2023

  31. [31]

    Gqa: Training generalized multi-query transformer models from multi-head checkpoints, 2023

    Joshua Ainslie, James Lee-Thorp, Michiel de Jong, Yury Zemlyanskiy, Federico Lebrón, and Sumit Sanghai. Gqa: Training generalized multi-query transformer models from multi-head checkpoints, 2023

  32. [32]

    RULER: What's the Real Context Size of Your Long-Context Language Models?

    Cheng-Ping Hsieh, Simeng Sun, Samuel Kriman, Shantanu Acharya, Dima Rekesh, Fei Jia, Yang Zhang, and Boris Ginsburg. Ruler: What’s the real context size of your long-context language models? arXiv preprint arXiv:2404.06654, 2024

  33. [33]

    LongBench: A Bilingual, Multitask Benchmark for Long Context Understanding

    Yushi Bai, Xin Lv, Jiajie Zhang, Hongchang Lyu, Jiankai Tang, Zhidian Huang, Zhengxiao Du, Xiao Liu, Aohan Zeng, Lei Hou, et al. Longbench: A bilingual, multitask benchmark for long context understanding. arXiv preprint arXiv:2308.14508, 2023

  34. [34]

    Prompt cache: Modular attention reuse for low-latency inference

    In Gim, Guojun Chen, Seung-seob Lee, Nikhil Sarda, Anurag Khandelwal, and Lin Zhong. Prompt cache: Modular attention reuse for low-latency inference. Proceedings of Machine Learning and Systems, 6:325–338, 2024

  35. [35]

    SGLang: Efficient Execution of Structured Language Model Programs

    Lianmin Zheng, Liangsheng Yin, Zhiqiang Xie, Chuyue Sun, Jeff Huang, Cody Hao Yu, Shiyi Cao, Christos Kozyrakis, Ion Stoica, Joseph E Gonzalez, et al. Sglang: Efficient execution of structured language model programs. arXiv preprint arXiv:2312.07104, 2024

  36. [36]

    Kvzip: Query-agnostic kv cache compression with context reconstruction

    Jang-Hyun Kim, Jinuk Kim, Sangwoo Kwon, Jae W Lee, Sangdoo Yun, and Hyun Oh Song. Kvzip: Query-agnostic kv cache compression with context reconstruction. arXiv preprint arXiv:2505.23416, 2025

  37. [37]

    Expected attention: Kv cache compres- sion by estimating attention from future queries distribution

    Alessio Devoto, Maximilian Jeblick, and Simon Jégou. Expected attention: Kv cache compres- sion by estimating attention from future queries distribution. arXiv preprint arXiv:2510.00636, 2025

  38. [38]

    Draft-based approximate inference for llms

    Kevin Galim, Ethan Ewer, Wonjun Kang, Minjae Lee, Hyung Il Koo, and Kangwook Lee. Draft-based approximate inference for llms. arXiv preprint arXiv:2506.08373, 2025

  39. [39]

    Identify critical kv cache in llm inference from an output perturbation perspective

    Yuan Feng, Junlin Lv, Yukun Cao, Xike Xie, and S Kevin Zhou. Identify critical kv cache in llm inference from an output perturbation perspective. arXiv preprint arXiv:2502.03805, 2025

  40. [40]

    Kevin Zhou, and Xike Xie

    Yuan Feng, Haoyu Guo, JunLin Lv, S. Kevin Zhou, and Xike Xie. Taming the fragility of kv cache eviction in llm inference, 2025

  41. [41]

    Duoattention: Efficient long-context llm inference with retrieval and streaming heads

    Guangxuan Xiao, Jiaming Tang, Jingwei Zuo, Junxian Guo, Shang Yang, Haotian Tang, Yao Fu, and Song Han. Duoattention: Efficient long-context llm inference with retrieval and streaming heads. arXiv preprint arXiv:2410.10819, 2024

  42. [42]

    Not all heads matter: A head-level kv cache compression method with integrated retrieval and reasoning

    Yu Fu, Zefan Cai, Abedelkadir Asi, Wayne Xiong, Yue Dong, and Wen Xiao. Not all heads matter: A head-level kv cache compression method with integrated retrieval and reasoning. arXiv preprint arXiv:2410.19258, 2024

  43. [43]

    Zeroquant: Efficient and affordable post-training quantization for large-scale transformers

    Zhewei Yao, Reza Yazdani Aminabadi, Minjia Zhang, Xiaoxia Wu, Conglong Li, and Yuxiong He. Zeroquant: Efficient and affordable post-training quantization for large-scale transformers. Advances in Neural Information Processing Systems, 35:27168–27183, 2022

  44. [44]

    KIVI: A Tuning-Free Asymmetric 2bit Quantization for KV Cache

    Zirui Liu, Jiayi Yuan, Hongye Jin, Shaochen Zhong, Zhaozhuo Xu, Vladimir Braverman, Beidi Chen, and Xia Hu. Kivi: A tuning-free asymmetric 2bit quantization for kv cache. arXiv preprint arXiv:2402.02750, 2024

  45. [45]

    Qaq: Quality adaptive quantization for llm kv cache

    Shichen Dong, Wen Cheng, Jiayu Qin, and Wei Wang. Qaq: Quality adaptive quantization for llm kv cache. arXiv preprint arXiv:2403.04643, 2024

  46. [46]

    Triforce: Lossless acceleration of long sequence generation with hierarchical speculative decoding

    Hanshi Sun, Zhuoming Chen, Xinyu Yang, Yuandong Tian, and Beidi Chen. Triforce: Lossless acceleration of long sequence generation with hierarchical speculative decoding. arXiv preprint arXiv:2404.11912, 2024. 13

  47. [47]

    Longspec: Long-context lossless speculative decoding with efficient drafting and verification

    Penghui Yang, Cunxiao Du, Fengzhuo Zhang, Haonan Wang, Tianyu Pang, Chao Du, and Bo An. Longspec: Long-context lossless speculative decoding with efficient drafting and verification. In ES-FoMo III: 3rd Workshop on Efficient Systems for Foundation Models

  48. [48]

    Specvlm: Enhancing speculative decoding of video llms via verifier-guided token pruning

    Yicheng Ji, Jun Zhang, Heming Xia, Jinpeng Chen, Lidan Shou, Gang Chen, and Huan Li. Specvlm: Enhancing speculative decoding of video llms via verifier-guided token pruning. arXiv preprint arXiv:2508.16201, 2025

  49. [49]

    Needle In A Haystack - pressure testing LLMs

    Gregory Kamradt. Needle In A Haystack - pressure testing LLMs. Github, 2023

  50. [50]

    Zoology: Measuring and improving recall in efficient language models

    Simran Arora, Sabri Eyuboglu, Aman Timalsina, Isys Johnson, Michael Poli, James Zou, Atri Rudra, and Christopher Ré. Zoology: Measuring and improving recall in efficient language models. In ICLR, 2024

  51. [51]

    The narrativeqa reading comprehension challenge

    Tomáš Koˇcisk`y, Jonathan Schwarz, Phil Blunsom, Chris Dyer, Karl Moritz Hermann, Gábor Melis, and Edward Grefenstette. The narrativeqa reading comprehension challenge. Transac- tions of the Association for Computational Linguistics, 6:317–328, 2018

  52. [52]

    A dataset of information-seeking questions and answers anchored in research papers

    Pradeep Dasigi, Kyle Lo, Iz Beltagy, Arman Cohan, Noah A Smith, and Matt Gardner. A dataset of information-seeking questions and answers anchored in research papers. arXiv preprint arXiv:2105.03011, 2021

  53. [53]

    HotpotQA: A Dataset for Diverse, Explainable Multi-hop Question Answering

    Zhilin Yang, Peng Qi, Saizheng Zhang, Yoshua Bengio, William W Cohen, Ruslan Salakhut- dinov, and Christopher D Manning. Hotpotqa: A dataset for diverse, explainable multi-hop question answering. arXiv preprint arXiv:1809.09600, 2018

  54. [54]

    Constructing a multi-hop QA dataset for comprehensive evaluation of reasoning steps

    Xanh Ho, Anh-Khoa Duong Nguyen, Saku Sugawara, and Akiko Aizawa. Constructing a multi-hop QA dataset for comprehensive evaluation of reasoning steps. In Donia Scott, Nuria Bel, and Chengqing Zong, editors, Proceedings of the 28th International Conference on Computational Linguistics, pages 6609–6625, Barcelona, Spain (Online), December 2020. International...

  55. [55]

    Musique: Multihop questions via single-hop question composition

    Harsh Trivedi, Niranjan Balasubramanian, Tushar Khot, and Ashish Sabharwal. Musique: Multihop questions via single-hop question composition. Transactions of the Association for Computational Linguistics, 10:539–554, 2022

  56. [56]

    Efficient attentions for long document summarization

    Luyang Huang, Shuyang Cao, Nikolaus Parulian, Heng Ji, and Lu Wang. Efficient attentions for long document summarization. arXiv preprint arXiv:2104.02112, 2021

  57. [57]

    Qmsum: A new benchmark for query-based multi-domain meeting summarization

    Ming Zhong, Da Yin, Tao Yu, Ahmad Zaidi, Mutethia Mutuma, Rahul Jha, Ahmed Hassan Awadallah, Asli Celikyilmaz, Yang Liu, Xipeng Qiu, et al. Qmsum: A new benchmark for query-based multi-domain meeting summarization. arXiv preprint arXiv:2104.05938, 2021

  58. [58]

    Multi-News: a Large-Scale Multi-Document Summarization Dataset and Abstractive Hierarchical Model

    Alexander R Fabbri, Irene Li, Tianwei She, Suyi Li, and Dragomir R Radev. Multi-news: A large-scale multi-document summarization dataset and abstractive hierarchical model. arXiv preprint arXiv:1906.01749, 2019

  59. [59]

    Weld, and Luke Zettlemoyer

    Mandar Joshi, Eunsol Choi, Daniel S. Weld, and Luke Zettlemoyer. Triviaqa: A large scale distantly supervised challenge dataset for reading comprehension, 2017

  60. [60]

    Samsum cor- pus: A human-annotated dialogue dataset for abstractive summarization

    Bogdan Gliwa, Iwona Mochol, Maciej Biesek, and Aleksander Wawer. Samsum cor- pus: A human-annotated dialogue dataset for abstractive summarization. arXiv preprint arXiv:1911.12237, 2019

  61. [61]

    Learning question classifiers

    Xin Li and Dan Roth. Learning question classifiers. In COLING 2002: The 19th International Conference on Computational Linguistics, 2002

  62. [62]

    Longcoder: A long-range pre-trained language model for code completion, 2023

    Daya Guo, Canwen Xu, Nan Duan, Jian Yin, and Julian McAuley. Longcoder: A long-range pre-trained language model for code completion, 2023

  63. [63]

    Repobench: Benchmarking repository-level code auto-completion systems, 2023

    Tianyang Liu, Canwen Xu, and Julian McAuley. Repobench: Benchmarking repository-level code auto-completion systems, 2023. 14

  64. [64]

    Lost in the Middle: How Language Models Use Long Contexts

    Nelson F Liu, Kevin Lin, John Hewitt, Ashwin Paranjape, Michele Bevilacqua, Fabio Petroni, and Percy Liang. Lost in the middle: How language models use long contexts. arXiv preprint arXiv:2307.03172, 2023

  65. [65]

    Constructing a multi-hop qa dataset for comprehensive evaluation of reasoning steps

    Xanh Ho, Anh-Khoa Duong Nguyen, Saku Sugawara, and Akiko Aizawa. Constructing a multi-hop qa dataset for comprehensive evaluation of reasoning steps. In Proceedings of the 28th International Conference on Computational Linguistics, pages 6609–6625, 2020

  66. [66]

    Triviaqa: A large scale distantly supervised challenge dataset for reading comprehension

    Mandar Joshi, Eunsol Choi, Daniel S Weld, and Luke Zettlemoyer. Triviaqa: A large scale distantly supervised challenge dataset for reading comprehension. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages 1601–1611, 2017

  67. [67]

    Longcoder: A long-range pre-trained language model for code completion

    Daya Guo, Canwen Xu, Nan Duan, Jian Yin, and Julian McAuley. Longcoder: A long-range pre-trained language model for code completion. arXiv preprint arXiv:2306.14893, 2023

  68. [68]

    RepoBench: Benchmarking Repository-Level Code Auto-Completion Systems

    Tianyang Liu, Canwen Xu, and Julian McAuley. Repobench: Benchmarking repository-level code auto-completion systems. arXiv preprint arXiv:2306.03091, 2023. 15 A Appendix A.1 Additional Related Works Additional works also mitigate the challenges posed by massive KV Caches during long-sequence inference while not reducing the number of cache elements. These ...

  69. [69]

    unanswerable

    word-a 2. word-b 3. word-c 4. word-a 5. word-d 6. word-a 7. word-e 8. word-f ...... Question: What are the 10 most common words in the above list? Answer: The top 10 words that appear most often in the list are: Frequent Words Extraction (FWE) Task Template: Read the following coded text and track the frequency of each coded word. Find the three most freq...