GRKV: Global Regression for Training-Free KV Cache Compression in Long-Context LLMs

Haoyi Wu; Jialong Han; Jianhuang Lai; Junjie Peng; Kewei Tu; Xiaohua Xie; You Wu

arxiv: 2605.31105 · v1 · pith:6PLDTK63new · submitted 2026-05-29 · 💻 cs.CL

GRKV: Global Regression for Training-Free KV Cache Compression in Long-Context LLMs

Junjie Peng , You Wu , Haoyi Wu , Jialong Han , Xiaohua Xie , Kewei Tu , Jianhuang Lai This is my paper

Pith reviewed 2026-06-28 22:52 UTC · model grok-4.3

classification 💻 cs.CL

keywords KV cache compressionlong-context LLMsridge regressiontoken mergingattention output matchingtraining-free compressionLongBenchRULER benchmark

0 comments

The pith

Ridge regression merges KV cache tokens to cut memory use while raising long-context accuracy.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Large language models store prior tokens in a key-value cache whose size grows with context length and quickly exhausts memory. Span-based eviction keeps coherent blocks but forces later merges onto boundary tokens, creating an unbalanced pattern that loses information. GRKV counters this by solving a ridge-regression problem that chooses merge weights to make the compressed cache produce attention outputs as close as possible to the full cache. The regression step spreads evicted-token information evenly across retained tokens and adds regularization to limit over-smoothing. On LongBench and RULER the method is reported to be the only merging approach that lifts overall scores rather than lowering them, all without any training.

Core claim

GRKV is a training-free KV-cache merging method that directly minimizes the discrepancy between compressed-cache and full-cache attention outputs. It uses ridge-regression-based merge steps to distribute information from evicted tokens across retained tokens while regularizing the updates to prevent over-smoothing. Across the LongBench and RULER long-context benchmarks, GRKV is the only merging method that improves overall performance with minimal overhead.

What carries the argument

Ridge-regression merge steps that solve for coefficients minimizing attention-output discrepancy between compressed and full caches.

If this is right

Span retention plus merging can be made balanced enough to avoid the usual performance penalty.
Information from discarded tokens can be redistributed without retraining the model.
Regularized regression keeps the merge from collapsing distinct token representations.
The added computation stays small enough to preserve the memory savings of compression.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same regression idea could be tried on activation compression or other memory-bound stages of inference.
Combining the merge with existing quantization schemes might compound the memory reduction.
The approach may scale to contexts longer than those tested if the regression remains stable.

Load-bearing premise

Directly minimizing attention-output discrepancy via ridge regression will produce better end-task accuracy without introducing other unmeasured distortions.

What would settle it

Apply the GRKV merge on a held-out long-context benchmark and observe that its scores fall below those of a simple eviction baseline.

Figures

Figures reproduced from arXiv: 2605.31105 by Haoyi Wu, Jialong Han, Jianhuang Lai, Junjie Peng, Kewei Tu, Xiaohua Xie, You Wu.

**Figure 1.** Figure 1: KV merge map on NARRATIVEQA. Using the same key-similarity-based matching strategy as D2O/KVMerger, we merge 250 evicted tokens into the 244 tokens retained by SnapKV. Each point represents a matched (evicted, retained) token pair, and the color indicates the cosine similarity between their keys. Black bars mark the retained-token spans along the x-axis. Wu and Tu, 2024; Wang et al., 2025). KV-cache evict… view at source ↗

**Figure 2.** Figure 2: Overview of how eviction granularity reshapes merge assignments. (a) Token-based retention with a local merge rule, in which evicted tokens are merged into nearby retained tokens, producing relatively dispersed assignments. (b) Span-based retention with the same local merge rule, which concentrates many evicted tokens onto a small set of boundary carrier tokens. (c) Span-based retention with the global mer… view at source ↗

**Figure 3.** Figure 3: Cross-window consistency of window-level attention outputs on HOTPOTQA. Each cell reports the cosine similarity between two window summaries sW computed from full-cache attention outputs. ducing a reduced cache KRet, VRet ∈ R c×d with c < n. For a query qi , the full-cache attention output is o (full) i = A (full) i VfullWO, where A (full) i = softmax(qiK⊤ full/ √ d). Using the retained cache, the output b… view at source ↗

**Figure 4.** Figure 4: Peak memory, TTFT (time to first token), and decoding latency measured across 16K–96K context lengths. [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗

**Figure 5.** Figure 5: Additional KV merge maps on HOTPOTQA. We visualize key-similarity merge assignments (evicted → retained) after SnapKV’s span-based retention across multiple layers and heads. Each point denotes a matched token pair, and the color indicates the cosine similarity between their key vectors. The black bars on the x-axis mark the retained spans. The left and right columns correspond to two representative contex… view at source ↗

**Figure 6.** Figure 6: Additional cross-window consistency on NARRATIVEQA. Each heatmap cell reports the cosine similarity between two window summaries sW computed from full-cache attention outputs at a specific layer and head. We plot several representative heads from middle and later layers. Overall, the matrices are dominated by high-similarity regions (values close to one), indicating that window-level attention outputs are … view at source ↗

**Figure 7.** Figure 7: TTFT and decoding latency on two A6000 GPUs across 16K–32K context lengths and batch sizes 1–3. [PITH_FULL_IMAGE:figures/full_fig_p020_7.png] view at source ↗

read the original abstract

Large language models (LLMs) with extended context lengths rely on the key-value (KV) cache to support attention over prior tokens. However, maintaining the KV cache incurs substantial memory overhead, motivating KV-cache compression methods that enforce a fixed budget through eviction and merging. Modern eviction methods increasingly adopt span-based retention because preserving contiguous spans is empirically effective and better preserves semantic coherence. Yet, when combined with post-eviction merging, span-based retention concentrates merges onto a small set of span-boundary carrier tokens, producing a highly imbalanced merge pattern that exacerbates over-merging and increases information loss. To address this imbalance, we propose GRKV (Global Regression for KV Cache), a training-free KV-cache merging method that directly minimizes the discrepancy between compressed-cache and full-cache attention outputs. GRKV uses ridge-regression-based merge steps to distribute information from evicted tokens across retained tokens, while regularizing the updates to prevent over-smoothing. Across the LongBench and RULER long-context benchmarks, GRKV is the only merging method that improves overall performance with minimal overhead.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

GRKV applies ridge regression to balance merges after span retention in KV cache compression, but query dependence and thin evidence on gains are the main things to watch.

read the letter

The paper introduces GRKV, a training-free method that uses ridge regression to merge KV cache entries after span-based retention. The goal is to fix the imbalance where merges concentrate on a few boundary tokens, leading to over-merging.

This is new in applying global regression to minimize the difference between compressed and full attention outputs, with regularization to prevent over-smoothing. It builds on span-based retention but adds this step to distribute information better.

The paper does well in identifying a specific practical issue in current KV compression techniques and offering a concrete, training-free solution. The claim that it is the only merging method improving performance on LongBench and RULER is notable if supported.

Soft spots include the potential mismatch with future queries. The regression is based on current attention outputs, so the merged representations might not perform as well when new queries come in, possibly leading to growing errors. The abstract does not provide numbers, error bars, or detailed ablations, so the strength of the improvement is unclear from the summary. The reader's stress-test concern seems valid here and should be checked in the full paper.

The math looks straightforward with ridge regression, but the data-dependent fitting needs to show independence from the performance metric.

This is for researchers focused on efficient long-context LLM inference. It would be useful for those looking for practical compression techniques.

It deserves peer review to examine the experimental results and address the generalization issue.

Recommendation: Yes, it should go to peer review rather than desk reject.

Referee Report

2 major / 1 minor

Summary. The paper proposes GRKV, a training-free KV-cache merging method for long-context LLMs. It identifies that span-based retention plus post-eviction merging produces imbalanced merges concentrated on span-boundary tokens, and addresses this by applying ridge-regression merge steps that directly minimize the discrepancy between compressed-cache and full-cache attention outputs while regularizing to avoid over-smoothing. The central empirical claim is that GRKV is the only merging method that improves overall performance on the LongBench and RULER benchmarks with minimal overhead.

Significance. If the empirical gains are shown to arise from a generalizable approximation rather than query-specific fitting, the method would offer a practical, training-free route to KV-cache compression that preserves more semantic coherence than prior eviction-plus-merge baselines. The explicit use of a global regression objective with regularization is a clear technical contribution over purely heuristic merging rules.

major comments (2)

[Method / Abstract claim] Method description (ridge-regression objective): the merge coefficients are obtained by regressing on attention outputs computed with the current query (or snapshot of recent queries) against the same cache being compressed. Because subsequent tokens employ different queries, the single-step discrepancy objective does not constrain approximation error growth; the manuscript must demonstrate that the resulting merged representations remain effective for future queries, or the translation from per-step attention matching to end-task gains is not guaranteed.
[Abstract / Results] Abstract and results: the claim that GRKV is 'the only merging method that improves overall performance' is asserted without reported quantitative deltas, standard deviations, or per-task breakdowns in the supplied text. If the full manuscript likewise omits ablations that isolate the regression step from the evaluation queries, the independence of the fitting procedure from the reported metric remains unshown and directly affects the strength of the central claim.

minor comments (1)

[Abstract] The abstract states the ridge-regression objective but does not specify the exact loss formulation, the choice of regularization parameter, or how many queries are used to form the regression matrix; adding these details would improve reproducibility.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our work. Below we address each major comment in turn, indicating planned revisions where appropriate.

read point-by-point responses

Referee: [Method / Abstract claim] Method description (ridge-regression objective): the merge coefficients are obtained by regressing on attention outputs computed with the current query (or snapshot of recent queries) against the same cache being compressed. Because subsequent tokens employ different queries, the single-step discrepancy objective does not constrain approximation error growth; the manuscript must demonstrate that the resulting merged representations remain effective for future queries, or the translation from per-step attention matching to end-task gains is not guaranteed.

Authors: We agree that the per-step regression is computed with respect to the query (or recent-query snapshot) present at the compression step. The ridge objective is nevertheless formulated globally over the full retained set rather than locally per token, which we expect to yield more stable approximations. To substantiate generalization, the revision will add (i) a plot of attention-output discrepancy measured on held-out future queries after each merge step and (ii) an ablation that continues generation for several hundred tokens post-compression while tracking both attention fidelity and downstream accuracy. revision: partial
Referee: [Abstract / Results] Abstract and results: the claim that GRKV is 'the only merging method that improves overall performance' is asserted without reported quantitative deltas, standard deviations, or per-task breakdowns in the supplied text. If the full manuscript likewise omits ablations that isolate the regression step from the evaluation queries, the independence of the fitting procedure from the reported metric remains unshown and directly affects the strength of the central claim.

Authors: We accept that the abstract statement requires quantitative backing. The full manuscript already reports per-task LongBench and RULER scores; the revision will (a) insert concrete overall deltas and standard deviations into both the abstract and results tables, and (b) add an explicit ablation that performs the ridge regression on a disjoint set of queries from those used at evaluation time, thereby isolating the contribution of the regression objective itself. revision: yes

Circularity Check

0 steps flagged

No significant circularity; method is a defined heuristic with separate empirical validation

full rationale

The paper defines GRKV as a training-free procedure that applies ridge regression to merge weights specifically to minimize the per-step discrepancy between compressed-cache and full-cache attention outputs. This minimization is an explicit design choice in the algorithm, not a derived prediction of downstream benchmark scores. The reported gains on LongBench and RULER are presented as experimental outcomes measured after applying the method, not as quantities that reduce algebraically or statistically to the regression objective by construction. No self-citations, uniqueness theorems, or ansatzes are invoked in the abstract to justify core steps, and the fitting uses only the current cache state rather than the final task metric. The derivation chain therefore remains self-contained as a proposed compression heuristic supported by independent benchmark results.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract provides insufficient technical detail to enumerate free parameters, axioms, or invented entities; the ridge regression step implies at least one regularization hyperparameter whose value is not stated.

pith-pipeline@v0.9.1-grok · 5734 in / 1012 out tokens · 19072 ms · 2026-06-28T22:52:36.710191+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

37 extracted references · 9 canonical work pages

[1]

Marah Abdin, Jyoti Aneja, Harkirat Behl, S \'e bastien Bubeck, Ronen Eldan, Suriya Gunasekar, Michael Harrison, Russell J Hewett, Mojan Javaheripi, Piero Kauffmann, and 1 others. 2024. https://arxiv.org/abs/2412.08905 Phi-4 technical report . arXiv preprint arXiv:2412.08905

Pith/arXiv arXiv 2024
[2]

Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, and 1 others. 2023. https://arxiv.org/abs/2303.08774 Gpt-4 technical report . arXiv preprint arXiv:2303.08774

Pith/arXiv arXiv 2023
[3]

Yushi Bai, Xin Lv, Jiajie Zhang, Hongchang Lyu, Jiankai Tang, Zhidian Huang, Zhengxiao Du, Xiao Liu, Aohan Zeng, Lei Hou, Yuxiao Dong, Jie Tang, and Juanzi Li. 2024. https://doi.org/10.18653/v1/2024.acl-long.172 L ong B ench: A bilingual, multitask benchmark for long context understanding . In Proceedings of the 62nd Annual Meeting of the Association for ...

work page doi:10.18653/v1/2024.acl-long.172 2024
[4]

Zefan Cai, Yichi Zhang, Bofei Gao, Yuliang Liu, Yucheng Li, Tianyu Liu, Keming Lu, Wayne Xiong, Yue Dong, Junjie Hu, and 1 others. 2024. https://arxiv.org/abs/2406.02069 Pyramidkv: Dynamic kv cache compression based on pyramidal information funneling . arXiv preprint arXiv:2406.02069

Pith/arXiv arXiv 2024
[5]

Chi-Chih Chang, Wei-Cheng Lin, Chien-Yu Lin, Chong-Yan Chen, Yu-Fang Hu, Pei-Shuo Wang, Ning-Chi Huang, Luis Ceze, Mohamed S Abdelfattah, and Kai-Chiang Wu. 2024. https://arxiv.org/abs/2407.21118 Palu: Compressing kv-cache with low-rank projection . arXiv preprint arXiv:2407.21118

arXiv 2024
[6]

Wanyun Cui and Mingwei Xu. 2026. https://proceedings.neurips.cc/paper_files/paper/2025/file/750b0f9fccafad88e0da366315e03d1a-Paper-Conference.pdf Homogeneous keys, heterogeneous values: Exploiting local kv cache asymmetry for long-context llms . Advances in Neural Information Processing Systems, 38:81628--81650

2026
[7]

Tri Dao. 2024. https://proceedings.iclr.cc/paper_files/paper/2024/file/98ed250b203d1ac6b24bbcf263e3d4a7-Paper-Conference.pdf Flashattention-2: Faster attention with better parallelism and work partitioning . In International Conference on Learning Representations, volume 2024, pages 35549--35562

2024
[8]

Tri Dao, Dan Fu, Stefano Ermon, Atri Rudra, and Christopher R \'e . 2022. https://proceedings.neurips.cc/paper_files/paper/2022/file/67d57c32e20fd0a7a302cb81d36e40d5-Paper-Conference.pdf Flashattention: Fast and memory-efficient exact attention with io-awareness . Advances in neural information processing systems, 35:16344--16359

2022
[9]

Alessio Devoto, Maximilian Jeblick, and Simon J \'e gou. 2025. https://arxiv.org/abs/2510.00636 Expected attention: Kv cache compression by estimating attention from future queries distribution . arXiv preprint arXiv:2510.00636

arXiv 2025
[10]

Yuan Feng, Junlin Lv, Yukun Cao, Xike Xie, and S Kevin Zhou. 2025. https://arxiv.org/abs/2502.03805 Identify critical kv cache in llm inference from an output perturbation perspective . arXiv preprint arXiv:2502.03805

Pith/arXiv arXiv 2025
[11]

Yuan Feng, Junlin Lv, Yukun Cao, Xike Xie, and S Kevin Zhou. 2026. https://proceedings.neurips.cc/paper_files/paper/2025/file/a40ff56daab9f4808b1e18350c8a11ce-Paper-Conference.pdf Ada-kv: Optimizing kv cache eviction by adaptive budget allocation for efficient llm inference . Advances in Neural Information Processing Systems, 38:113152--113188

2026
[12]

Yu Fu, Zefan Cai, Abedelkadir Asi, Wayne Xiong, Yue Dong, and Wen Xiao. 2025. https://proceedings.iclr.cc/paper_files/paper/2025/file/f649556471416b35e60ae0de7c1e3619-Paper-Conference.pdf Not all heads matter: A head-level kv cache compression method with integrated retrieval and reasoning . In International Conference on Learning Representations, volume ...

2025
[13]

Suyu Ge, Yunan Zhang, Liyuan Liu, Minjia Zhang, Jiawei Han, and Jianfeng Gao. 2024. https://proceedings.iclr.cc/paper_files/paper/2024/file/639a9a172c044fbb64175b5fad42e9a5-Paper-Conference.pdf Model tells you what to discard: Adaptive kv cache compression for llms . In International Conference on Learning Representations, volume 2024, pages 22975--22988

2024
[14]

Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Alex Vaughan, and 1 others. 2024. https://arxiv.org/abs/2407.21783 The llama 3 herd of models . arXiv preprint arXiv:2407.21783

Pith/arXiv arXiv 2024
[15]

Cheng-Ping Hsieh, Simeng Sun, Samuel Kriman, Shantanu Acharya, Dima Rekesh, Fei Jia, Yang Zhang, and Boris Ginsburg. 2024. https://arxiv.org/abs/2404.06654 Ruler: What's the real context size of your long-context language models? arXiv preprint arXiv:2404.06654

Pith/arXiv arXiv 2024
[16]

Albert Q. Jiang, Alexandre Sablayrolles, Arthur Mensch, Chris Bamford, Devendra Singh Chaplot, Diego de las Casas, Florian Bressand, Gianna Lengyel, Guillaume Lample, Lucile Saulnier, L \'e lio Renard Lavaud, Marie-Anne Lachaux, Pierre Stock, Teven Le Scao, Thibaut Lavril, Thomas Wang, Timoth \'e e Lacroix, and William El Sayed. 2023. https://arxiv.org/ab...

Pith/arXiv arXiv 2023
[17]

Tom \'a s Ko c isk \'y , Jonathan Schwarz, Phil Blunsom, Chris Dyer, Karl Moritz Hermann, G \'a bor Melis, and Edward Grefenstette. 2018. https://doi.org/10.1162/tacl_a_00023 The N arrative QA reading comprehension challenge . Transactions of the Association for Computational Linguistics, 6:317--328

work page doi:10.1162/tacl_a_00023 2018
[18]

Yuhong Li, Yingbing Huang, Bowen Yang, Bharat Venkitesh, Acyr Locatelli, Hanchen Ye, Tianle Cai, Patrick Lewis, and Deming Chen. 2024. https://doi.org/10.52202/079017-0722 Snapkv: Llm knows what you are looking for before generation . Advances in Neural Information Processing Systems, 37:22947--22970

work page doi:10.52202/079017-0722 2024
[19]

Akide Liu, Jing Liu, Zizheng Pan, Yefei He, Gholamreza Haffari, and Bohan Zhuang. 2024. https://doi.org/10.52202/079017-4443 Minicache: Kv cache compression in depth dimension for large language models . Advances in Neural Information Processing Systems, 37:139997--140031

work page doi:10.52202/079017-4443 2024
[20]

Zichang Liu, Aditya Desai, Fangshuo Liao, Weitao Wang, Victor Xie, Zhaozhuo Xu, Anastasios Kyrillidis, and Anshumali Shrivastava. 2023. https://proceedings.neurips.cc/paper_files/paper/2023/file/a452a7c6c463e4ae8fbdc614c6e983e6-Paper-Conference.pdf Scissorhands: Exploiting the persistence of importance hypothesis for llm kv cache compression at test time ...

2023
[21]

Matanel Oren, Michael Hassid, Nir Yarden, Yossi Adi, and Roy Schwartz. 2024. https://doi.org/10.18653/v1/2024.emnlp-main.1043 Transformers are multi-state RNN s . In Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pages 18724--18741, Miami, Florida, USA. Association for Computational Linguistics

work page doi:10.18653/v1/2024.emnlp-main.1043 2024
[22]

Myle Ott, Sergey Edunov, Alexei Baevski, Angela Fan, Sam Gross, Nathan Ng, David Grangier, and Michael Auli. 2019. https://doi.org/10.18653/v1/N19-4009 fairseq: A fast, extensible toolkit for sequence modeling . In Proceedings of the 2019 Conference of the North A merican Chapter of the Association for Computational Linguistics (Demonstrations) , pages 48...

work page doi:10.18653/v1/n19-4009 2019
[23]

Siyu Ren and Kenny Q Zhu. 2024. https://arxiv.org/abs/2402.06262 On the efficacy of eviction policy for key-value constrained generative language model inference . arXiv preprint arXiv:2402.06262

arXiv 2024
[24]

Gemma Team, Morgane Riviere, Shreya Pathak, Pier Giuseppe Sessa, Cassidy Hardin, Surya Bhupatiraju, L \'e onard Hussenot, Thomas Mesnard, Bobak Shahriari, Alexandre Ram \'e , and 1 others. 2024. https://arxiv.org/abs/2408.00118 Gemma 2: Improving open language models at a practical size . arXiv preprint arXiv:2408.00118

Pith/arXiv arXiv 2024
[25]

Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timoth \'e e Lacroix, Baptiste Rozi \`e re, Naman Goyal, Eric Hambro, Faisal Azhar, and 1 others. 2023 a . https://arxiv.org/abs/2302.13971 Llama: Open and efficient foundation language models . arXiv preprint arXiv:2302.13971

Pith/arXiv arXiv 2023
[26]

Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, and 1 others. 2023 b . https://arxiv.org/abs/2307.09288 Llama 2: Open foundation and fine-tuned chat models . arXiv preprint arXiv:2307.09288

Pith/arXiv arXiv 2023
[27]

Zhongwei Wan, Xinjian Wu, Yu Zhang, Yi Xin, Chaofan Tao, Zhihong Zhu, Xin Wang, Siqi Luo, Jing Xiong, Longyue Wang, and 1 others. 2025. https://proceedings.iclr.cc/paper_files/paper/2025/file/d862f7f5445255090de13b825b880d59-Paper-Conference.pdf D2o: Dynamic discriminative operations for efficient long-context inference of large language models . In Inter...

2025
[28]

Yixuan Wang, Haoyu Qiao, Lujun Li, Qingfu Zhu, and Wanxiang Che. 2025. https://arxiv.org/abs/2508.16134 Commonkv: Compressing kv cache with cross-layer parameter sharing . arXiv preprint arXiv:2508.16134

arXiv 2025
[29]

Zheng Wang, Boxiao Jin, Zhongzhi Yu, and Minjia Zhang. 2024. https://arxiv.org/abs/2407.08454 Model tells you where to merge: Adaptive kv cache merging for llms on long-context tasks . arXiv preprint arXiv:2407.08454

arXiv 2024
[30]

Thomas Wolf, Lysandre Debut, Victor Sanh, Julien Chaumond, Clement Delangue, Anthony Moi, Pierric Cistac, Tim Rault, Remi Louf, Morgan Funtowicz, Joe Davison, Sam Shleifer, Patrick von Platen, Clara Ma, Yacine Jernite, Julien Plu, Canwen Xu, Teven Le Scao, Sylvain Gugger, and 3 others. 2020. https://doi.org/10.18653/v1/2020.emnlp-demos.6 Transformers: Sta...

work page doi:10.18653/v1/2020.emnlp-demos.6 2020
[31]

Haoyi Wu and Kewei Tu. 2024. https://doi.org/10.18653/v1/2024.acl-long.602 Layer-condensed KV cache for efficient inference of large language models . In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 11175--11188, Bangkok, Thailand. Association for Computational Linguistics

work page doi:10.18653/v1/2024.acl-long.602 2024
[32]

Guangxuan Xiao, Jiaming Tang, Jingwei Zuo, Junxian Guo, Shang Yang, Haotian Tang, Yao Fu, and Song Han. 2025. https://proceedings.iclr.cc/paper_files/paper/2025/file/5c1ddd2e59df46fd2aa85c833b1b36ed-Paper-Conference.pdf Duoattention: Efficient long-context llm inference with retrieval and streaming heads . In International Conference on Learning Represent...

2025
[33]

Guangxuan Xiao, Yuandong Tian, Beidi Chen, Song Han, and Mike Lewis. 2024. https://proceedings.iclr.cc/paper_files/paper/2024/file/5e5fd18f863cbe6d8ae392a93fd271c9-Paper-Conference.pdf Efficient streaming language models with attention sinks . In International Conference on Learning Representations, volume 2024, pages 21875--21895

2024
[34]

An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, and 1 others. 2025. https://arxiv.org/abs/2505.09388 Qwen3 technical report . arXiv preprint arXiv:2505.09388

Pith/arXiv arXiv 2025
[35]

Zhilin Yang, Peng Qi, Saizheng Zhang, Yoshua Bengio, William Cohen, Ruslan Salakhutdinov, and Christopher D. Manning. 2018. https://doi.org/10.18653/v1/D18-1259 H otpot QA : A dataset for diverse, explainable multi-hop question answering . In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 2369--2380, Brussels...

work page doi:10.18653/v1/d18-1259 2018
[36]

Yuxin Zhang, Yuxuan Du, Gen Luo, Yunshan Zhong, Zhenyu Zhang, Shiwei Liu, and Rongrong Ji. 2024. https://openreview.net/forum?id=LCTmppB165 Cam: Cache merging for memory-efficient llms inference . In Forty-first international conference on machine learning

2024
[37]

Zhenyu Zhang, Ying Sheng, Tianyi Zhou, Tianlong Chen, Lianmin Zheng, Ruisi Cai, Zhao Song, Yuandong Tian, Christopher R \'e , Clark Barrett, and 1 others. 2023. https://proceedings.neurips.cc/paper_files/paper/2023/file/6ceefa7b15572587b78ecfcebb2827f8-Paper-Conference.pdf H2o: Heavy-hitter oracle for efficient generative inference of large language model...

2023

[1] [1]

Marah Abdin, Jyoti Aneja, Harkirat Behl, S \'e bastien Bubeck, Ronen Eldan, Suriya Gunasekar, Michael Harrison, Russell J Hewett, Mojan Javaheripi, Piero Kauffmann, and 1 others. 2024. https://arxiv.org/abs/2412.08905 Phi-4 technical report . arXiv preprint arXiv:2412.08905

Pith/arXiv arXiv 2024

[2] [2]

Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, and 1 others. 2023. https://arxiv.org/abs/2303.08774 Gpt-4 technical report . arXiv preprint arXiv:2303.08774

Pith/arXiv arXiv 2023

[3] [3]

Yushi Bai, Xin Lv, Jiajie Zhang, Hongchang Lyu, Jiankai Tang, Zhidian Huang, Zhengxiao Du, Xiao Liu, Aohan Zeng, Lei Hou, Yuxiao Dong, Jie Tang, and Juanzi Li. 2024. https://doi.org/10.18653/v1/2024.acl-long.172 L ong B ench: A bilingual, multitask benchmark for long context understanding . In Proceedings of the 62nd Annual Meeting of the Association for ...

work page doi:10.18653/v1/2024.acl-long.172 2024

[4] [4]

Zefan Cai, Yichi Zhang, Bofei Gao, Yuliang Liu, Yucheng Li, Tianyu Liu, Keming Lu, Wayne Xiong, Yue Dong, Junjie Hu, and 1 others. 2024. https://arxiv.org/abs/2406.02069 Pyramidkv: Dynamic kv cache compression based on pyramidal information funneling . arXiv preprint arXiv:2406.02069

Pith/arXiv arXiv 2024

[5] [5]

Chi-Chih Chang, Wei-Cheng Lin, Chien-Yu Lin, Chong-Yan Chen, Yu-Fang Hu, Pei-Shuo Wang, Ning-Chi Huang, Luis Ceze, Mohamed S Abdelfattah, and Kai-Chiang Wu. 2024. https://arxiv.org/abs/2407.21118 Palu: Compressing kv-cache with low-rank projection . arXiv preprint arXiv:2407.21118

arXiv 2024

[6] [6]

Wanyun Cui and Mingwei Xu. 2026. https://proceedings.neurips.cc/paper_files/paper/2025/file/750b0f9fccafad88e0da366315e03d1a-Paper-Conference.pdf Homogeneous keys, heterogeneous values: Exploiting local kv cache asymmetry for long-context llms . Advances in Neural Information Processing Systems, 38:81628--81650

2026

[7] [7]

Tri Dao. 2024. https://proceedings.iclr.cc/paper_files/paper/2024/file/98ed250b203d1ac6b24bbcf263e3d4a7-Paper-Conference.pdf Flashattention-2: Faster attention with better parallelism and work partitioning . In International Conference on Learning Representations, volume 2024, pages 35549--35562

2024

[8] [8]

Tri Dao, Dan Fu, Stefano Ermon, Atri Rudra, and Christopher R \'e . 2022. https://proceedings.neurips.cc/paper_files/paper/2022/file/67d57c32e20fd0a7a302cb81d36e40d5-Paper-Conference.pdf Flashattention: Fast and memory-efficient exact attention with io-awareness . Advances in neural information processing systems, 35:16344--16359

2022

[9] [9]

Alessio Devoto, Maximilian Jeblick, and Simon J \'e gou. 2025. https://arxiv.org/abs/2510.00636 Expected attention: Kv cache compression by estimating attention from future queries distribution . arXiv preprint arXiv:2510.00636

arXiv 2025

[10] [10]

Yuan Feng, Junlin Lv, Yukun Cao, Xike Xie, and S Kevin Zhou. 2025. https://arxiv.org/abs/2502.03805 Identify critical kv cache in llm inference from an output perturbation perspective . arXiv preprint arXiv:2502.03805

Pith/arXiv arXiv 2025

[11] [11]

Yuan Feng, Junlin Lv, Yukun Cao, Xike Xie, and S Kevin Zhou. 2026. https://proceedings.neurips.cc/paper_files/paper/2025/file/a40ff56daab9f4808b1e18350c8a11ce-Paper-Conference.pdf Ada-kv: Optimizing kv cache eviction by adaptive budget allocation for efficient llm inference . Advances in Neural Information Processing Systems, 38:113152--113188

2026

[12] [12]

Yu Fu, Zefan Cai, Abedelkadir Asi, Wayne Xiong, Yue Dong, and Wen Xiao. 2025. https://proceedings.iclr.cc/paper_files/paper/2025/file/f649556471416b35e60ae0de7c1e3619-Paper-Conference.pdf Not all heads matter: A head-level kv cache compression method with integrated retrieval and reasoning . In International Conference on Learning Representations, volume ...

2025

[13] [13]

Suyu Ge, Yunan Zhang, Liyuan Liu, Minjia Zhang, Jiawei Han, and Jianfeng Gao. 2024. https://proceedings.iclr.cc/paper_files/paper/2024/file/639a9a172c044fbb64175b5fad42e9a5-Paper-Conference.pdf Model tells you what to discard: Adaptive kv cache compression for llms . In International Conference on Learning Representations, volume 2024, pages 22975--22988

2024

[14] [14]

Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Alex Vaughan, and 1 others. 2024. https://arxiv.org/abs/2407.21783 The llama 3 herd of models . arXiv preprint arXiv:2407.21783

Pith/arXiv arXiv 2024

[15] [15]

Cheng-Ping Hsieh, Simeng Sun, Samuel Kriman, Shantanu Acharya, Dima Rekesh, Fei Jia, Yang Zhang, and Boris Ginsburg. 2024. https://arxiv.org/abs/2404.06654 Ruler: What's the real context size of your long-context language models? arXiv preprint arXiv:2404.06654

Pith/arXiv arXiv 2024

[16] [16]

Albert Q. Jiang, Alexandre Sablayrolles, Arthur Mensch, Chris Bamford, Devendra Singh Chaplot, Diego de las Casas, Florian Bressand, Gianna Lengyel, Guillaume Lample, Lucile Saulnier, L \'e lio Renard Lavaud, Marie-Anne Lachaux, Pierre Stock, Teven Le Scao, Thibaut Lavril, Thomas Wang, Timoth \'e e Lacroix, and William El Sayed. 2023. https://arxiv.org/ab...

Pith/arXiv arXiv 2023

[17] [17]

Tom \'a s Ko c isk \'y , Jonathan Schwarz, Phil Blunsom, Chris Dyer, Karl Moritz Hermann, G \'a bor Melis, and Edward Grefenstette. 2018. https://doi.org/10.1162/tacl_a_00023 The N arrative QA reading comprehension challenge . Transactions of the Association for Computational Linguistics, 6:317--328

work page doi:10.1162/tacl_a_00023 2018

[18] [18]

Yuhong Li, Yingbing Huang, Bowen Yang, Bharat Venkitesh, Acyr Locatelli, Hanchen Ye, Tianle Cai, Patrick Lewis, and Deming Chen. 2024. https://doi.org/10.52202/079017-0722 Snapkv: Llm knows what you are looking for before generation . Advances in Neural Information Processing Systems, 37:22947--22970

work page doi:10.52202/079017-0722 2024

[19] [19]

Akide Liu, Jing Liu, Zizheng Pan, Yefei He, Gholamreza Haffari, and Bohan Zhuang. 2024. https://doi.org/10.52202/079017-4443 Minicache: Kv cache compression in depth dimension for large language models . Advances in Neural Information Processing Systems, 37:139997--140031

work page doi:10.52202/079017-4443 2024

[20] [20]

Zichang Liu, Aditya Desai, Fangshuo Liao, Weitao Wang, Victor Xie, Zhaozhuo Xu, Anastasios Kyrillidis, and Anshumali Shrivastava. 2023. https://proceedings.neurips.cc/paper_files/paper/2023/file/a452a7c6c463e4ae8fbdc614c6e983e6-Paper-Conference.pdf Scissorhands: Exploiting the persistence of importance hypothesis for llm kv cache compression at test time ...

2023

[21] [21]

Matanel Oren, Michael Hassid, Nir Yarden, Yossi Adi, and Roy Schwartz. 2024. https://doi.org/10.18653/v1/2024.emnlp-main.1043 Transformers are multi-state RNN s . In Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pages 18724--18741, Miami, Florida, USA. Association for Computational Linguistics

work page doi:10.18653/v1/2024.emnlp-main.1043 2024

[22] [22]

Myle Ott, Sergey Edunov, Alexei Baevski, Angela Fan, Sam Gross, Nathan Ng, David Grangier, and Michael Auli. 2019. https://doi.org/10.18653/v1/N19-4009 fairseq: A fast, extensible toolkit for sequence modeling . In Proceedings of the 2019 Conference of the North A merican Chapter of the Association for Computational Linguistics (Demonstrations) , pages 48...

work page doi:10.18653/v1/n19-4009 2019

[23] [23]

Siyu Ren and Kenny Q Zhu. 2024. https://arxiv.org/abs/2402.06262 On the efficacy of eviction policy for key-value constrained generative language model inference . arXiv preprint arXiv:2402.06262

arXiv 2024

[24] [24]

Gemma Team, Morgane Riviere, Shreya Pathak, Pier Giuseppe Sessa, Cassidy Hardin, Surya Bhupatiraju, L \'e onard Hussenot, Thomas Mesnard, Bobak Shahriari, Alexandre Ram \'e , and 1 others. 2024. https://arxiv.org/abs/2408.00118 Gemma 2: Improving open language models at a practical size . arXiv preprint arXiv:2408.00118

Pith/arXiv arXiv 2024

[25] [25]

Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timoth \'e e Lacroix, Baptiste Rozi \`e re, Naman Goyal, Eric Hambro, Faisal Azhar, and 1 others. 2023 a . https://arxiv.org/abs/2302.13971 Llama: Open and efficient foundation language models . arXiv preprint arXiv:2302.13971

Pith/arXiv arXiv 2023

[26] [26]

Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, and 1 others. 2023 b . https://arxiv.org/abs/2307.09288 Llama 2: Open foundation and fine-tuned chat models . arXiv preprint arXiv:2307.09288

Pith/arXiv arXiv 2023

[27] [27]

Zhongwei Wan, Xinjian Wu, Yu Zhang, Yi Xin, Chaofan Tao, Zhihong Zhu, Xin Wang, Siqi Luo, Jing Xiong, Longyue Wang, and 1 others. 2025. https://proceedings.iclr.cc/paper_files/paper/2025/file/d862f7f5445255090de13b825b880d59-Paper-Conference.pdf D2o: Dynamic discriminative operations for efficient long-context inference of large language models . In Inter...

2025

[28] [28]

Yixuan Wang, Haoyu Qiao, Lujun Li, Qingfu Zhu, and Wanxiang Che. 2025. https://arxiv.org/abs/2508.16134 Commonkv: Compressing kv cache with cross-layer parameter sharing . arXiv preprint arXiv:2508.16134

arXiv 2025

[29] [29]

Zheng Wang, Boxiao Jin, Zhongzhi Yu, and Minjia Zhang. 2024. https://arxiv.org/abs/2407.08454 Model tells you where to merge: Adaptive kv cache merging for llms on long-context tasks . arXiv preprint arXiv:2407.08454

arXiv 2024

[30] [30]

Thomas Wolf, Lysandre Debut, Victor Sanh, Julien Chaumond, Clement Delangue, Anthony Moi, Pierric Cistac, Tim Rault, Remi Louf, Morgan Funtowicz, Joe Davison, Sam Shleifer, Patrick von Platen, Clara Ma, Yacine Jernite, Julien Plu, Canwen Xu, Teven Le Scao, Sylvain Gugger, and 3 others. 2020. https://doi.org/10.18653/v1/2020.emnlp-demos.6 Transformers: Sta...

work page doi:10.18653/v1/2020.emnlp-demos.6 2020

[31] [31]

Haoyi Wu and Kewei Tu. 2024. https://doi.org/10.18653/v1/2024.acl-long.602 Layer-condensed KV cache for efficient inference of large language models . In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 11175--11188, Bangkok, Thailand. Association for Computational Linguistics

work page doi:10.18653/v1/2024.acl-long.602 2024

[32] [32]

Guangxuan Xiao, Jiaming Tang, Jingwei Zuo, Junxian Guo, Shang Yang, Haotian Tang, Yao Fu, and Song Han. 2025. https://proceedings.iclr.cc/paper_files/paper/2025/file/5c1ddd2e59df46fd2aa85c833b1b36ed-Paper-Conference.pdf Duoattention: Efficient long-context llm inference with retrieval and streaming heads . In International Conference on Learning Represent...

2025

[33] [33]

Guangxuan Xiao, Yuandong Tian, Beidi Chen, Song Han, and Mike Lewis. 2024. https://proceedings.iclr.cc/paper_files/paper/2024/file/5e5fd18f863cbe6d8ae392a93fd271c9-Paper-Conference.pdf Efficient streaming language models with attention sinks . In International Conference on Learning Representations, volume 2024, pages 21875--21895

2024

[34] [34]

An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, and 1 others. 2025. https://arxiv.org/abs/2505.09388 Qwen3 technical report . arXiv preprint arXiv:2505.09388

Pith/arXiv arXiv 2025

[35] [35]

Zhilin Yang, Peng Qi, Saizheng Zhang, Yoshua Bengio, William Cohen, Ruslan Salakhutdinov, and Christopher D. Manning. 2018. https://doi.org/10.18653/v1/D18-1259 H otpot QA : A dataset for diverse, explainable multi-hop question answering . In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 2369--2380, Brussels...

work page doi:10.18653/v1/d18-1259 2018

[36] [36]

Yuxin Zhang, Yuxuan Du, Gen Luo, Yunshan Zhong, Zhenyu Zhang, Shiwei Liu, and Rongrong Ji. 2024. https://openreview.net/forum?id=LCTmppB165 Cam: Cache merging for memory-efficient llms inference . In Forty-first international conference on machine learning

2024

[37] [37]

Zhenyu Zhang, Ying Sheng, Tianyi Zhou, Tianlong Chen, Lianmin Zheng, Ruisi Cai, Zhao Song, Yuandong Tian, Christopher R \'e , Clark Barrett, and 1 others. 2023. https://proceedings.neurips.cc/paper_files/paper/2023/file/6ceefa7b15572587b78ecfcebb2827f8-Paper-Conference.pdf H2o: Heavy-hitter oracle for efficient generative inference of large language model...

2023