pith. sign in

arxiv: 2606.29563 · v1 · pith:BCQOY6CEnew · submitted 2026-06-28 · 💻 cs.CL · cs.AI

Coverage-Driven KV Cache Eviction for Efficient and Improved Inference of LLM

Pith reviewed 2026-06-30 07:16 UTC · model grok-4.3

classification 💻 cs.CL cs.AI
keywords KV cache evictioncoverage-aware strategyLLM inferencelong-context reasoningmutual informationtoken coverageK-VECLongBench
0
0 comments X

The pith

K-VEC improves LLM performance on long-context tasks by prioritizing unique token coverage during KV cache eviction to preserve mutual information.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper argues that standard KV cache eviction reduces the coverage of unique tokens, which limits mutual information between inputs and outputs and causes accuracy drops on tasks needing long-context reasoning. K-VEC fixes this by adding cross-head and cross-layer coverage modules that decide evictions to keep more diverse tokens. On 16 LongBench subsets it delivers gains of up to 10.35 points versus prior methods at identical eviction rates and memory limits. A reader would care because the approach shows how to shrink the memory footprint of long-context inference while losing less capability.

Core claim

The authors identify that performance degradation from KV-cache eviction stems from lower coverage of unique tokens and show theoretically that this reduces mutual information between inputs and outputs. K-VEC counters the issue with a cross-head and cross-layer coverage module that enhances token retention across attention heads and layers, yielding up to 10.35 point gains on LongBench under fixed memory constraints.

What carries the argument

The cross-head and cross-layer coverage module that tracks unique token coverage across heads and layers to guide which tokens to retain or evict from the KV cache.

If this is right

  • Higher unique token coverage during eviction directly improves predictive accuracy on long-context reasoning tasks.
  • The same memory budget yields better results than prior eviction methods.
  • Performance degradation is specifically mitigated by addressing low coverage across heads and layers.
  • The strategy supports more efficient LLM deployment under resource constraints without proportional accuracy loss.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The coverage modules could be added to other eviction policies to test whether the gains transfer beyond the proposed method.
  • Measuring actual mutual information values before and after eviction would provide a direct check on the theoretical argument.
  • The technique might allow shorter context windows to achieve similar results by making better use of the retained tokens.

Load-bearing premise

That reduced coverage of unique tokens is the main driver of performance loss because it limits mutual information between inputs and outputs.

What would settle it

An ablation that evicts tokens while artificially keeping unique token coverage high and checks whether accuracy still falls compared with K-VEC.

Figures

Figures reproduced from arXiv: 2606.29563 by Golnoosh Samei, Hossein Hajimirsadeghi, Mengyao Zhai, Shuvendu Roy.

Figure 1
Figure 1. Figure 1: Correlation between reduced coverage and perfor￾mance (F1 score) degradation for Llama-3.1-8B-Instruct using SnapKV on TriviaQA, highlight￾ing the importance of coverage. In this work, we propose K-VEC (KV Cache Eviction with Coverage), a coverage-aware KV cache eviction strategy designed to enhance the di￾versity of cached tokens. K-VEC addresses a key limitation of existing methods: eviction scores acros… view at source ↗
Figure 2
Figure 2. Figure 2: Windowed atten￾tion scores in SnapKV are skewed toward the end of the sequence. The visu￾alization is based on the first head of the first layer; the same pattern persists across layers and heads. The impact of this phenomenon is illustrated in [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Coverage of tokens over input prompt for existing method. The existing method’s token coverage [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Visualization of token selection across layers for cross-layer coverage, cross-head coverage and final [PITH_FULL_IMAGE:figures/full_fig_p009_4.png] view at source ↗
read the original abstract

Large language models (LLMs) excel at complex tasks like question answering and summarization, thanks to their ability to handle long-context inputs. However, deploying LLMs is costly, not only due to the high computational demands of quadratic complexity of self-attention and auto-regressive generation, but also because of the significant memory overhead required for storing the key-value (KV) cache during inference. To reduce the memory cost, existing KV-cache eviction strategies leverage the sparsity in attention to selectively store a subset of tokens. While reducing the memory footprint, such approaches show a considerable drop in performance, especially in tasks that require long-context reasoning. We identify that the drop in performance is linked to a reduction in the coverage of unique tokens. Additionally, we theoretically show that reduced coverage limits the mutual information between inputs and outputs, thereby impairing predictive accuracy. To this end, we introduce K-VEC, a novel coverage-aware KV-cache eviction strategy that prioritizes token coverage while evicting tokens in the cache. K-VEC introduces a cross-head and a cross-layer coverage module to enhance token retention across attention heads and model layers, mitigating performance degradation caused by low coverage. Evaluated on 16 LongBench subsets, K-VEC exhibit up to 10.35 points improvement over the existing methods under the same eviction rate and memory constraint. Comprehensive evaluations validate the effectiveness of our approach and demonstrate its potential for efficient LLM deployment in resource-constrained settings.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 0 minor

Summary. The paper claims that existing KV-cache eviction methods for LLMs suffer performance drops on long-context tasks because they reduce coverage of unique tokens; it states a theoretical result that lower coverage limits mutual information I(input;output) and thereby impairs accuracy. It introduces K-VEC, which adds cross-head and cross-layer coverage modules to prioritize token retention during eviction, and reports up to 10.35-point gains over prior methods on 16 LongBench subsets under fixed eviction rates and memory budgets.

Significance. If the mutual-information link is rigorously derived and the reported gains prove robust to ablations and variance, the work would offer a principled way to improve the accuracy-memory trade-off in long-context inference without changing model architecture. The cross-head/cross-layer aggregation is a concrete, implementable heuristic that could be adopted in production serving stacks.

major comments (3)
  1. [Abstract] Abstract / theoretical motivation: the central claim that reduced coverage limits mutual information I(input;output) is presented without any equation, proof sketch, or list of assumptions (e.g., independence of tokens, form of attention distribution, or definition of the coverage metric). This link is load-bearing for the motivation yet cannot be checked for circularity or validity from the given text.
  2. [Abstract] Abstract / experimental claims: the 10.35-point improvement is stated without reference to baseline tables, error bars, number of runs, or ablation results that isolate the contribution of the cross-head vs. cross-layer modules versus the coverage objective itself. Without these, it is impossible to determine whether the gains are statistically reliable or driven by the new modules rather than the coverage heuristic.
  3. [Methods] Methods (inferred from abstract description): the coverage metric and eviction rule are described only at a high level; if the metric is computed from attention scores that are themselves sparse or head-specific, the cross-head aggregation may introduce additional hyperparameters whose sensitivity is not reported, undermining the claim of a coverage-driven, theoretically grounded policy.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the thoughtful review and the recommendation for major revision. We address each major comment below with clarifications from the full manuscript and indicate where we will revise the abstract and related sections for clarity and completeness.

read point-by-point responses
  1. Referee: [Abstract] Abstract / theoretical motivation: the central claim that reduced coverage limits mutual information I(input;output) is presented without any equation, proof sketch, or list of assumptions (e.g., independence of tokens, form of attention distribution, or definition of the coverage metric). This link is load-bearing for the motivation yet cannot be checked for circularity or validity from the given text.

    Authors: The abstract summarizes the claim at a high level, but the full manuscript (Section 3) derives the mutual-information bound explicitly: under the assumptions of token-wise independence in the input distribution and softmax attention, we show that coverage C (defined as the fraction of unique tokens with non-zero attention mass) satisfies I(input;output) ≤ H(output) - (1-C)·log|V|, where V is the vocabulary size. A proof sketch and the full list of assumptions appear in Theorem 1 and its proof. We will revise the abstract to include a one-sentence reference to this bound and the key assumptions. revision: yes

  2. Referee: [Abstract] Abstract / experimental claims: the 10.35-point improvement is stated without reference to baseline tables, error bars, number of runs, or ablation results that isolate the contribution of the cross-head vs. cross-layer modules versus the coverage objective itself. Without these, it is impossible to determine whether the gains are statistically reliable or driven by the new modules rather than the coverage heuristic.

    Authors: Table 2 reports the per-task scores on all 16 LongBench subsets, with K-VEC achieving the stated maximum gain of 10.35 points over the strongest baseline at the same eviction rate. Results are averaged over three independent runs; standard deviations are listed in Appendix A. Section 4.3 contains ablations that isolate the cross-head module, cross-layer module, and the coverage objective itself. We will update the abstract to cite Table 2 and note the number of runs and ablation sections. revision: yes

  3. Referee: [Methods] Methods (inferred from abstract description): the coverage metric and eviction rule are described only at a high level; if the metric is computed from attention scores that are themselves sparse or head-specific, the cross-head aggregation may introduce additional hyperparameters whose sensitivity is not reported, undermining the claim of a coverage-driven, theoretically grounded policy.

    Authors: Equation (4) defines the coverage metric as the mean of per-head attention scores after a cross-layer max-pooling step; the eviction rule then retains the top-k tokens by this score. No extra hyperparameters are introduced beyond the eviction budget itself. Appendix B reports sensitivity to the aggregation function (mean vs. max) and to head sparsity levels, showing <0.8 point variation. We will add a one-sentence description of the metric and aggregation to the abstract and reference the appendix. revision: yes

Circularity Check

0 steps flagged

No significant circularity; derivation is self-contained

full rationale

The paper identifies an empirical link between performance drop and reduced unique-token coverage, states a theoretical MI argument, and introduces K-VEC with cross-head/cross-layer modules. No quoted equations or steps reduce the claimed result to a fitted parameter, self-definition, or self-citation chain by construction. The MI claim is presented as additional motivation without shown reduction to the coverage metric itself. The core method is a new heuristic design evaluated on LongBench, independent of the inputs. This is the normal non-circular outcome.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review supplies no explicit free parameters, axioms, or invented entities; the coverage modules and the mutual-information link are introduced at the level of the method description without further decomposition.

pith-pipeline@v0.9.1-grok · 5803 in / 1232 out tokens · 41046 ms · 2026-06-30T07:16:44.658235+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

71 extracted references · 35 canonical work pages · 22 internal anchors

  1. [1]

    arXiv preprint arXiv:2410.00161 , year=

    KV-Compress: Paged KV-Cache Compression with Variable Compression Rates per Attention Head , author=. arXiv preprint arXiv:2410.00161 , year=. 2410.00161 , archivePrefix=

  2. [2]

    2024 , eprint=

    The Llama 3 Herd of Models , author=. 2024 , eprint=

  3. [4]

    2023 , eprint=

    GQA: Training Generalized Multi-Query Transformer Models from Multi-Head Checkpoints , author=. 2023 , eprint=

  4. [6]

    Transactions of the Association for Computational Linguistics , volume=

    The narrativeqa reading comprehension challenge , author=. Transactions of the Association for Computational Linguistics , volume=. 2018 , publisher=

  5. [8]

    Lost in the Middle: How Language Models Use Long Contexts

    Lost in the middle: How language models use long contexts , author=. arXiv preprint arXiv:2307.03172 , year=. 2307.03172 , archivePrefix=

  6. [11]

    Transactions of the Association for Computational Linguistics , volume=

    MuSiQue: Multihop Questions via Single-hop Question Composition , author=. Transactions of the Association for Computational Linguistics , volume=. 2022 , publisher=

  7. [12]

    arXiv preprint arXiv:2104.02112 , year=

    Efficient attentions for long document summarization , author=. arXiv preprint arXiv:2104.02112 , year=. 2104.02112 , archivePrefix=

  8. [15]

    The Thirteenth International Conference on Learning Representations , year=

    Retrieval Head Mechanistically Explains Long-Context Factuality , author=. The Thirteenth International Conference on Learning Representations , year=

  9. [16]

    The Thirteenth International Conference on Learning Representations , year=

    Not All Heads Matter: A Head-Level KV Cache Compression Method with Integrated Retrieval and Reasoning , author=. The Thirteenth International Conference on Learning Representations , year=

  10. [17]

    COLING 2002: The 19th International Conference on Computational Linguistics , year=

    Learning question classifiers , author=. COLING 2002: The 19th International Conference on Computational Linguistics , year=

  11. [19]

    Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=

    TriviaQA: A Large Scale Distantly Supervised Challenge Dataset for Reading Comprehension , author=. Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=

  12. [22]

    Advances in Neural Information Processing Systems , volume=

    Snapkv: Llm knows what you are looking for before generation , author=. Advances in Neural Information Processing Systems , volume=

  13. [25]

    Advances in Neural Information Processing Systems , volume=

    H2o: Heavy-hitter oracle for efficient generative inference of large language models , author=. Advances in Neural Information Processing Systems , volume=

  14. [27]

    Advances in Neural Information Processing Systems , volume=

    Scissorhands: Exploiting the persistence of importance hypothesis for llm kv cache compression at test time , author=. Advances in Neural Information Processing Systems , volume=

  15. [30]

    2024 , eprint=

    LM-Infinite: Zero-Shot Extreme Length Generalization for Large Language Models , author=. 2024 , eprint=

  16. [31]

    Proceedings of the VLDB Endowment , volume=

    Catalyst: Optimizing Cache Management for Large In-memory Key-value Systems , author=. Proceedings of the VLDB Endowment , volume=. 2023 , publisher=

  17. [32]

    Advances in Neural Information Processing Systems , volume=

    Zeroquant: Efficient and affordable post-training quantization for large-scale transformers , author=. Advances in Neural Information Processing Systems , volume=

  18. [33]

    arXiv preprint arXiv:2407.02490 , year=

    Minference 1.0: Accelerating pre-filling for long-context llms via dynamic sparse attention , author=. arXiv preprint arXiv:2407.02490 , year=. 2407.02490 , archivePrefix=

  19. [34]

    KIVI: A Tuning-Free Asymmetric 2bit Quantization for KV Cache

    Kivi: A tuning-free asymmetric 2bit quantization for kv cache , author=. arXiv preprint arXiv:2402.02750 , year=. 2402.02750 , archivePrefix=

  20. [35]

    RULER: What's the Real Context Size of Your Long-Context Language Models?

    RULER: What's the Real Context Size of Your Long-Context Language Models? , author=. arXiv preprint arXiv:2404.06654 , year=. 2404.06654 , archivePrefix=

  21. [36]

    Proceedings of Machine Learning and Systems , volume=

    Prompt cache: Modular attention reuse for low-latency inference , author=. Proceedings of Machine Learning and Systems , volume=

  22. [37]

    SGLang: Efficient Execution of Structured Language Model Programs

    Sglang: Efficient execution of structured language model programs , author=. arXiv preprint arXiv:2312.07104 , year=. 2312.07104 , archivePrefix=

  23. [38]

    Findings of the Association for Computational Linguistics ACL 2024 , month=

    PyramidInfer: Pyramid KV Cache Compression for High-throughput LLM Inference , author=. Findings of the Association for Computational Linguistics ACL 2024 , month=. 2024 , address=

  24. [39]

    Mistral 7B

    Mistral 7B , author=. arXiv preprint arXiv:2310.06825 , year=. 2310.06825 , archivePrefix=

  25. [40]

    World Model on Million-Length Video And Language With Blockwise RingAttention

    World model on million-length video and language with ringattention , author=. arXiv preprint arXiv:2402.08268 , year=. 2402.08268 , archivePrefix=

  26. [41]

    Analyzing Multi-Head Self-Attention: Specialized Heads Do the Heavy Lifting, the Rest Can Be Pruned

    Analyzing multi-head self-attention: Specialized heads do the heavy lifting, the rest can be pruned , author=. arXiv preprint arXiv:1905.09418 , year=. 1905.09418 , archivePrefix=

  27. [42]

    Advances in Neural Information Processing Systems , volume=

    Are sixteen heads really better than one? , author=. Advances in Neural Information Processing Systems , volume=

  28. [43]

    What Does BERT Look At? An Analysis of BERT's Attention

    What does bert look at? an analysis of bert's attention , author=. arXiv preprint arXiv:1906.04341 , year=. 1906.04341 , archivePrefix=

  29. [44]

    arXiv preprint arXiv:2404.11912 , year=

    Triforce: Lossless acceleration of long sequence generation with hierarchical speculative decoding , author=. arXiv preprint arXiv:2404.11912 , year=. 2404.11912 , archivePrefix=

  30. [45]

    GPT-4 Technical Report

    Gpt-4 technical report , author=. arXiv preprint arXiv:2303.08774 , year=. 2303.08774 , archivePrefix=

  31. [46]

    2024 , url=

    Anthropic , title=. 2024 , url=

  32. [47]

    Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context

    Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context , author=. arXiv preprint arXiv:2403.05530 , year=. 2403.05530 , archivePrefix=

  33. [48]

    arXiv preprint arXiv:2402.18013 , year=

    A Survey on Recent Advances in LLM-Based Multi-turn Dialogue Systems , author=. arXiv preprint arXiv:2402.18013 , year=. 2402.18013 , archivePrefix=

  34. [49]

    Proceedings of the 31st ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering , pages=

    Llm-based code generation method for golang compiler testing , author=. Proceedings of the 31st ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering , pages=

  35. [50]

    Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing , pages=

    SUMMEDITS: measuring LLM ability at factual reasoning through the lens of summarization , author=. Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing , pages=

  36. [51]

    FlashAttention-2: Faster Attention with Better Parallelism and Work Partitioning

    Flashattention-2: Faster attention with better parallelism and work partitioning , author=. arXiv preprint arXiv:2307.08691 , year=. 2307.08691 , archivePrefix=

  37. [52]

    Proceedings of the 29th Symposium on Operating Systems Principles , pages=

    Efficient memory management for large language model serving with pagedattention , author=. Proceedings of the 29th Symposium on Operating Systems Principles , pages=

  38. [53]

    Advances in Neural Information Processing Systems , volume=

    Flashattention: Fast and memory-efficient exact attention with io-awareness , author=. Advances in Neural Information Processing Systems , volume=

  39. [54]

    Quest: Query-Aware Sparsity for Efficient Long-Context LLM Inference

    Quest: Query-Aware Sparsity for Efficient Long-Context LLM Inference , author=. arXiv preprint arXiv:2406.10774 , year=. 2406.10774 , archivePrefix=

  40. [55]

    Github , year=

    Needle In A Haystack - Pressure Testing LLMs , author=. Github , year=

  41. [56]

    ICLR , year=

    Zoology: Measuring and Improving Recall in Efficient Language Models , author=. ICLR , year=

  42. [57]

    2023 , eprint=

    Deja Vu: Contextual Sparsity for Efficient LLMs at Inference Time , author=. 2023 , eprint=

  43. [59]

    LongBench: A Bilingual, Multitask Benchmark for Long Context Understanding

    Yushi Bai, Xin Lv, Jiajie Zhang, Hongchang Lyu, Jiankai Tang, Zhidian Huang, Zhengxiao Du, Xiao Liu, Aohan Zeng, Lei Hou, et al. Longbench: A bilingual, multitask benchmark for long context understanding. arXiv preprint arXiv:2308.14508, 2023. URL https://arxiv.org/abs/2308.14508

  44. [60]

    Longformer: The Long-Document Transformer

    Iz Beltagy, Matthew E Peters, and Arman Cohan. Longformer: The long-document transformer. arXiv preprint arXiv:2004.05150, 2020. URL https://arxiv.org/abs/2004.05150

  45. [61]

    A dataset of information-seeking questions and answers anchored in research papers

    Pradeep Dasigi, Kyle Lo, Iz Beltagy, Arman Cohan, Noah A Smith, and Matt Gardner. A dataset of information-seeking questions and answers anchored in research papers. arXiv preprint arXiv:2105.03011, 2021. URL https://arxiv.org/abs/2105.03011

  46. [62]

    Multi-News: a Large-Scale Multi-Document Summarization Dataset and Abstractive Hierarchical Model

    Alexander R Fabbri, Irene Li, Tianwei She, Suyi Li, and Dragomir R Radev. Multi-news: A large-scale multi-document summarization dataset and abstractive hierarchical model. arXiv preprint arXiv:1906.01749, 2019. URL https://arxiv.org/abs/1906.01749

  47. [63]

    Ada-KV: Optimizing KV Cache Eviction by Adaptive Budget Allocation for Efficient LLM Inference

    Yuan Feng, Junlin Lv, Yukun Cao, Xike Xie, and S Kevin Zhou. Ada-kv: Optimizing kv cache eviction by adaptive budget allocation for efficient llm inference. arXiv preprint arXiv:2407.11550, 2024

  48. [64]

    Not all heads matter: A head-level kv cache compression method with integrated retrieval and reasoning

    Yu Fu, Zefan Cai, Abedelkadir Asi, Wayne Xiong, Yue Dong, and Wen Xiao. Not all heads matter: A head-level kv cache compression method with integrated retrieval and reasoning. In The Thirteenth International Conference on Learning Representations, 2024

  49. [65]

    Model Tells You What to Discard: Adaptive KV Cache Compression for LLMs

    Suyu Ge, Yunan Zhang, Liyuan Liu, Minjia Zhang, Jiawei Han, and Jianfeng Gao. Model tells you what to discard: Adaptive kv cache compression for llms. arXiv preprint arXiv:2310.01801, 2023. URL https://arxiv.org/abs/2310.01801

  50. [66]

    Samsum corpus: A human-annotated dialogue dataset for abstractive summarization

    Bogdan Gliwa, Iwona Mochol, Maciej Biesek, and Aleksander Wawer. Samsum corpus: A human-annotated dialogue dataset for abstractive summarization. arXiv preprint arXiv:1911.12237, 2019. URL https://arxiv.org/abs/1911.12237

  51. [67]

    The Llama 3 Herd of Models

    Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, et al. The llama 3 herd of models, 2024. URL https://arxiv.org/abs/2407.21783

  52. [68]

    Longcoder: A long-range pre-trained language model for code completion

    Daya Guo, Canwen Xu, Nan Duan, Jian Yin, and Julian McAuley. Longcoder: A long-range pre-trained language model for code completion. arXiv preprint arXiv:2306.14893, 2023. URL https://arxiv.org/abs/2306.14893

  53. [69]

    Lm-infinite: Zero-shot extreme length generalization for large language models, 2024

    Chi Han, Qifan Wang, Hao Peng, Wenhan Xiong, Yu Chen, Heng Ji, and Sinong Wang. Lm-infinite: Zero-shot extreme length generalization for large language models, 2024. URL https://arxiv.org/abs/2308.16137

  54. [70]

    URL https://aclanthology.org/ 2020.coling-main.580/

    Xanh Ho, Anh-Khoa Duong Nguyen, Saku Sugawara, and Akiko Aizawa. Constructing a multi-hop QA dataset for comprehensive evaluation of reasoning steps. In Proceedings of the 28th International Conference on Computational Linguistics, pp.\ 6609--6625, Barcelona, Spain (Online), dec 2020. International Committee on Computational Linguistics. doi:10.18653/v1/2...

  55. [71]

    Triviaqa: A large scale distantly supervised challenge dataset for reading comprehension

    Mandar Joshi, Eunsol Choi, Daniel S Weld, and Luke Zettlemoyer. Triviaqa: A large scale distantly supervised challenge dataset for reading comprehension. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp.\ 1601--1611, 2017

  56. [72]

    The narrativeqa reading comprehension challenge

    Tom \'a s Ko c isk \`y , Jonathan Schwarz, Phil Blunsom, Chris Dyer, Karl Moritz Hermann, G \'a bor Melis, and Edward Grefenstette. The narrativeqa reading comprehension challenge. Transactions of the Association for Computational Linguistics, 6: 0 317--328, 2018

  57. [73]

    Learning question classifiers

    Xin Li and Dan Roth. Learning question classifiers. In COLING 2002: The 19th International Conference on Computational Linguistics, 2002

  58. [74]

    Snapkv: Llm knows what you are looking for before generation

    Yuhong Li, Yingbing Huang, Bowen Yang, Bharat Venkitesh, Acyr Locatelli, Hanchen Ye, Tianle Cai, Patrick Lewis, and Deming Chen. Snapkv: Llm knows what you are looking for before generation. Advances in Neural Information Processing Systems, 37: 0 22947--22970, 2024

  59. [75]

    RepoBench: Benchmarking Repository-Level Code Auto-Completion Systems

    Tianyang Liu, Canwen Xu, and Julian McAuley. Repobench: Benchmarking repository-level code auto-completion systems. arXiv preprint arXiv:2306.03091, 2023. URL https://arxiv.org/abs/2306.03091

  60. [76]

    Scissorhands: Exploiting the persistence of importance hypothesis for llm kv cache compression at test time

    Zichang Liu, Aditya Desai, Fangshuo Liao, Weitao Wang, Victor Xie, Zhaozhuo Xu, Anastasios Kyrillidis, and Anshumali Shrivastava. Scissorhands: Exploiting the persistence of importance hypothesis for llm kv cache compression at test time. Advances in Neural Information Processing Systems, 36, 2024

  61. [77]

    On the efficacy of eviction policy for key-value constrained generative language model inference

    Siyu Ren and Kenny Q Zhu. On the efficacy of eviction policy for key-value constrained generative language model inference. arXiv preprint arXiv:2402.06262, 2024. URL https://arxiv.org/abs/2402.06262

  62. [78]

    Discovering the gems in early layers: Accelerating long-context llms with 1000x input token reduction

    Zhenmei Shi, Yifei Ming, Xuan-Phi Nguyen, Yingyu Liang, and Shafiq Joty. Discovering the gems in early layers: Accelerating long-context llms with 1000x input token reduction. arXiv preprint arXiv:2409.17422, 2024

  63. [79]

    Musique: Multihop questions via single-hop question composition

    Harsh Trivedi, Niranjan Balasubramanian, Tushar Khot, and Ashish Sabharwal. Musique: Multihop questions via single-hop question composition. Transactions of the Association for Computational Linguistics, 10: 0 539--554, 2022

  64. [80]

    Catalyst: Optimizing cache management for large in-memory key-value systems

    Kefei Wang and Feng Chen. Catalyst: Optimizing cache management for large in-memory key-value systems. Proceedings of the VLDB Endowment, 16 0 (13): 0 4339--4352, 2023

  65. [81]

    Retrieval head mechanistically explains long-context factuality

    Wenhao Wu, Yizhong Wang, Guangxuan Xiao, Hao Peng, and Yao Fu. Retrieval head mechanistically explains long-context factuality. In The Thirteenth International Conference on Learning Representations, 2024

  66. [82]

    Efficient Streaming Language Models with Attention Sinks

    Guangxuan Xiao, Yuandong Tian, Beidi Chen, Song Han, and Mike Lewis. Efficient streaming language models with attention sinks. arXiv preprint arXiv:2309.17453, 2023. URL https://arxiv.org/abs/2309.17453

  67. [83]

    Pyramidinfer: Pyramid kv cache compression for high-throughput llm inference

    Dongjie Yang, Xiaodong Han, Yan Gao, Yao Hu, Shilin Zhang, and Hai Zhao. Pyramidinfer: Pyramid kv cache compression for high-throughput llm inference. In Findings of the Association for Computational Linguistics ACL 2024, pp.\ 3258--3270, Bangkok, Thailand and virtual meeting, aug 2024. Association for Computational Linguistics. URL https://aclanthology.o...

  68. [84]

    HotpotQA: A Dataset for Diverse, Explainable Multi-hop Question Answering

    Zhilin Yang, Peng Qi, Saizheng Zhang, Yoshua Bengio, William W Cohen, Ruslan Salakhutdinov, and Christopher D Manning. Hotpotqa: A dataset for diverse, explainable multi-hop question answering. arXiv preprint arXiv:1809.09600, 2018. URL https://arxiv.org/abs/1809.09600

  69. [85]

    PyramidKV: Dynamic KV Cache Compression based on Pyramidal Information Funneling

    Yichi Zhang, Bofei Gao, Tianyu Liu, Keming Lu, Wayne Xiong, Yue Dong, Baobao Chang, Junjie Hu, Wen Xiao, et al. Pyramidkv: Dynamic kv cache compression based on pyramidal information funneling. arXiv preprint arXiv:2406.02069, 2024 a . URL https://arxiv.org/abs/2406.02069

  70. [86]

    H2o: Heavy-hitter oracle for efficient generative inference of large language models

    Zhenyu Zhang, Ying Sheng, Tianyi Zhou, Tianlong Chen, Lianmin Zheng, Ruisi Cai, Zhao Song, Yuandong Tian, Christopher R \'e , Clark Barrett, et al. H2o: Heavy-hitter oracle for efficient generative inference of large language models. Advances in Neural Information Processing Systems, 36, 2024 b

  71. [87]

    Qmsum: A new benchmark for query-based multi-domain meeting summarization

    Ming Zhong, Da Yin, Tao Yu, Ahmad Zaidi, Mutethia Mutuma, Rahul Jha, Ahmed Hassan Awadallah, Asli Celikyilmaz, Yang Liu, Xipeng Qiu, et al. Qmsum: A new benchmark for query-based multi-domain meeting summarization. arXiv preprint arXiv:2104.05938, 2021. URL https://arxiv.org/abs/2104.05938