arxiv: 2605.06676 · v1 · submitted 2026-04-22 · 💻 cs.LG · cs.CL

Recognition: no theorem link

LKV: End-to-End Learning of Head-wise Budgets and Token Selection for LLM KV Cache Eviction

Enshuai Zhou , Yifan Hao , Chao Wang , Rui Zhang , Di Huang , Jiaming Guo , Xing Hu , Zidong Du

show 2 more authors

Qi Guo Yunji Chen

Authors on Pith no claims yet

Pith reviewed 2026-05-11 00:54 UTC · model grok-4.3

classification 💻 cs.LG cs.CL

keywords KV cache compressionLLM long-context inferencedifferentiable evictionattention head budgetingtoken importance scoringend-to-end optimization

0 comments

The pith

LKV learns head-wise KV budgets and token importance scores end-to-end to compress LLM caches without heuristic rules.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper argues that heuristic approaches to KV cache eviction misallocate memory because they rely on statistical priors or fixed attention patterns instead of task performance. LKV reformulates the entire compression step as a single differentiable optimization that jointly learns per-head budgets and intrinsic token importance. This produces near-lossless accuracy on LongBench while retaining only 15 percent of the original cache. The authors further show that the learned budgeting step accounts for most of the quality gain, not the token-selection rule itself.

Core claim

LKV integrates LKV-H, which learns task-optimized global budgets per attention head, with LKV-T, which computes query-independent importance scores for each KV token without ever materializing the full attention matrix; the resulting end-to-end system reaches state-of-the-art compressed performance on LongBench and RULER, with analysis attributing the largest fidelity improvements to the data-driven budget allocation.

What carries the argument

End-to-end differentiable optimization of head-wise budget allocations and intrinsic token importance scores that replace heuristic proxies.

If this is right

Compression ratios above 85 percent become practical for long-context inference while preserving task accuracy.
Budget allocation across heads becomes the primary lever for quality, reducing reliance on attention-sink or recency heuristics.
The method eliminates the need to materialize attention matrices during eviction, lowering both memory and compute overhead.
Task-specific budget learning can be performed once and then applied at inference time to new sequences of similar length.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same differentiable-budget approach could be applied to per-layer or per-model allocation decisions rather than only per-head.
If the learned policies prove robust, they might enable online adaptation of cache size during a single conversation without full retraining.
The dominance of budgeting over selection suggests that future memory work should treat allocation as a first-class learned parameter rather than a fixed hyper-parameter.

Load-bearing premise

The budgets and importance scores optimized on the training distribution will transfer to arbitrary new inputs and tasks without retraining.

What would settle it

Measure whether LKV still matches full-cache accuracy at 15 percent retention when tested on a long-context benchmark whose task distribution differs substantially from LongBench and RULER.

Figures

Figures reproduced from arXiv: 2605.06676 by Chao Wang, Di Huang, Enshuai Zhou, Jiaming Guo, Qi Guo, Rui Zhang, Xing Hu, Yifan Hao, Yunji Chen, Zidong Du.

**Figure 1.** Figure 1: KV budget allocation policies (15% retention). (a) SnapKV: Uniform. (b) PyramidKV: Layer-wise decay. (c) DuoAttention: Rigid binary classification (Retrieval vs. Streaming). (d) Ada-SnapKV: Adaptive within layers but with uniform layer priors. (e) LKV (Ours): Learned global, fine-grained policy optimizing task objectives without rigid priors. rely on fixed “recent windows” or “attention sinks” (Xiao et al… view at source ↗

**Figure 2.** Figure 2: Overview of LKV. Left (LKV-H): Learns global budget ratios r from head embeddings. Middle (LKV-T): Performs differentiable, query-agnostic token selection via Soft-TopK. Right: End-to-end optimization via self-distillation against a frozen teacher. 1 2 (1 +sgn(z)(1−e −|z| )). Consequently, the core challenge lies in analytically determining λ(x) such that the summation constraint is strictly satisfied. Cl… view at source ↗

**Figure 3.** Figure 3: Performance on LongBench across varying KV cache retention ratios. LKV shows robustness in low-resource regimes. †For Qwen3, we employ an on-the-fly approximation due to the lack of official pre-computed patterns. (see Appendix D.2) 4.3. Robustness to Context Length and Compression We further evaluate the robustness of LKV on Llama-3.1- 8B-Instruct by varying the context length t and the global retention r… view at source ↗

**Figure 5.** Figure 5: Memory profiling on Llama-3.1-8B-Instruct (R = 0.15). stability, Full Cache exhibits rapid memory growth, crashing with OOM errors at 225k tokens. In contrast, LKV successfully scales to 262k and beyond. This capability stems from the aggressive compression of the KV tensors (Figure 5b). By maintaining a 15% budget, LKV reduces the actual storage cost at 200k length from 25.0 GB (Full) to 3.75 GB (Ours), … view at source ↗

**Figure 6.** Figure 6: visualizes the results under two critical settings: (a) Length Scalability: Performance across increasing context lengths (4k, 8k, 16k, 32k) with a fixed retention ratio (R = 0.15). (b) Compression Robustness: Performance across varying retention ratios (R ∈ [0.1, 0.5]) at a fixed context length of 16k. Consistent with the findings on Llama-3.1-8B-Instruct, LKV demonstrates superior robustness compared to … view at source ↗

read the original abstract

Long-context inference in Large Language Models (LLMs) is bottlenecked by the linear growth of Key-Value (KV) cache memory. Existing KV cache compression paradigms are fundamentally limited by heuristics: heuristic budgeting relies on statistical priors rather than task objectives, causing resource misallocation, while heuristic selection relies on coupled query-key interactions or static inductive biases (e.g., attention sinks). To address this limitation, we introduce LKV (Learned KV Eviction), which formulates KV compression as an end-to-end differentiable optimization problem. LKV integrates LKV-H to learn task-optimized global budgets, and LKV-T to derive intrinsic KV importance without materializing attention matrices. This design bypasses heuristic proxies, strictly aligning compression with task objectives. Extensive evaluations demonstrate that LKV achieves state-of-the-art performance on both LongBench and RULER benchmarks at high compression rates. In particular, on LongBench, LKV achieves near-lossless performance with only 15\% KV cache retention. Crucially, our analysis identifies learned budgeting as the dominant driver of fidelity, demonstrating that data-driven allocation is essential to overcome the limitations of hand-crafted heuristics.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

LKV shows that end-to-end learned head-wise budgets can beat standard heuristics for KV eviction, but the generalization evidence is thin and the reported results lack supporting details.

read the letter

LKV turns KV cache compression into a differentiable optimization problem with two pieces: LKV-H learns per-head global budgets and LKV-T scores token importance without building attention matrices. That setup directly ties eviction decisions to task loss instead of relying on attention sinks or fixed rules, which is the clearest advance over prior work. The matrix-free scoring is practical for long sequences and the claim that budgeting drives most of the gains is plausible given how often hand-crafted allocations waste capacity on low-value heads. On the benchmarks they report near-lossless results at 15% retention, which would matter if it holds up. The abstract positions this as SOTA on LongBench and RULER, and the formulation avoids some obvious circularity by using external evaluation sets. Still, the evidence is limited. No error bars, no ablation numbers, and no statistical tests appear in the summary, so the 15% figure is hard to assess for robustness. The bigger concern is whether the learned budgets transfer. They are optimized on a training mixture that overlaps with the test distributions in LongBench and RULER, so the allocations could be fitting those specific patterns rather than learning something invariant. Without cross-task transfer runs or budget variance analysis on held-out inputs, it is difficult to know if the method would stay effective on new data or require retraining. This paper is aimed at people working on practical long-context inference. An engineer or researcher looking for a differentiable alternative to heuristic eviction could extract the core modules and test them, even if they have to fill in the missing experimental pieces themselves. The idea is coherent and the benchmarks are standard, so it deserves a serious referee. I would send it out, but with clear instructions to add ablations, variance numbers, and at least one transfer experiment before acceptance.

Referee Report

3 major / 2 minor

Summary. The paper proposes LKV, an end-to-end differentiable framework for KV cache eviction in LLMs. It introduces LKV-H to learn task-optimized head-wise global budgets and LKV-T to compute intrinsic token importance scores without materializing full attention matrices, replacing heuristic budgeting and selection. The central claims are SOTA results on LongBench and RULER at high compression ratios, with near-lossless performance at 15% KV retention on LongBench, and an analysis showing that learned budgeting (rather than token selection) is the dominant factor in preserving fidelity.

Significance. If the generalization and robustness claims hold, the work would meaningfully advance KV cache compression by demonstrating that data-driven, objective-aligned allocation can outperform hand-crafted heuristics at aggressive compression rates. The separation of budgeting from selection and the emphasis on end-to-end optimization provide a useful conceptual distinction. However, the current experimental support is only moderately strong, limiting immediate impact until ablations and transfer tests are added.

major comments (3)

[Experimental Results] Experimental Results section: The abstract and main claims assert near-lossless performance at 15% retention on LongBench and identify learned budgeting as the dominant driver, yet no error bars, multiple random seeds, or statistical tests are reported. This makes it difficult to determine whether the SOTA margin is robust or sensitive to evaluation variance.
[Experiments / Analysis] No dedicated generalization or transfer subsection: The central claim that LKV-H budgets capture task-invariant structure (rather than fitting the training mixture) is load-bearing for the assertion that data-driven allocation overcomes heuristic misallocation. Without explicit experiments training on one benchmark family and evaluating on held-out tasks or distributions, the dominance of learned budgeting remains unverified.
[Method (LKV-T)] Method section describing LKV-T: The formulation claims to derive intrinsic importance scores without materializing attention matrices and to strictly align with task objectives. A concrete derivation or pseudocode showing how gradients flow through the differentiable selection (and how it avoids implicit heuristic biases) is needed to substantiate that it bypasses the limitations of prior query-key or sink-based methods.

minor comments (2)

[Method] The paper should clarify the exact training objective and loss used to optimize the head-wise budget parameters, including any regularization terms that prevent degenerate allocations.
[Figures/Tables] Figure captions and tables comparing against baselines would benefit from explicit retention ratios and model sizes for each method to enable direct apples-to-apples comparison.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive comments. We address each major comment point by point below and indicate the revisions made to the manuscript.

read point-by-point responses

Referee: Experimental Results section: The abstract and main claims assert near-lossless performance at 15% retention on LongBench and identify learned budgeting as the dominant driver, yet no error bars, multiple random seeds, or statistical tests are reported. This makes it difficult to determine whether the SOTA margin is robust or sensitive to evaluation variance.

Authors: We agree that variance reporting strengthens the claims. In the revised manuscript we have rerun the primary LongBench evaluations using three random seeds and added error bars (standard deviation) to the main result tables. A short discussion of statistical significance via paired t-tests has also been included, confirming that the reported margins remain consistent. revision: yes
Referee: No dedicated generalization or transfer subsection: The central claim that LKV-H budgets capture task-invariant structure (rather than fitting the training mixture) is load-bearing for the assertion that data-driven allocation overcomes heuristic misallocation. Without explicit experiments training on one benchmark family and evaluating on held-out tasks or distributions, the dominance of learned budgeting remains unverified.

Authors: We acknowledge the need for explicit transfer evidence. We have added a new subsection 'Transferability of Learned Budgets' that trains LKV-H on a LongBench subset (excluding selected task families) and evaluates the resulting budgets on the held-out families plus the full RULER benchmark. The added results show only minor degradation, supporting that the budgets capture task-invariant structure. revision: yes
Referee: Method section describing LKV-T: The formulation claims to derive intrinsic importance scores without materializing attention matrices and to strictly align with task objectives. A concrete derivation or pseudocode showing how gradients flow through the differentiable selection (and how it avoids implicit heuristic biases) is needed to substantiate that it bypasses the limitations of prior query-key or sink-based methods.

Authors: We have expanded Section 3.2 with an explicit derivation of gradient flow through the Gumbel-softmax differentiable selection used by LKV-T. Pseudocode is now supplied as Algorithm 1 in the appendix, illustrating the forward and backward passes and confirming that importance scores are optimized directly against the task loss without attention-matrix materialization or static inductive biases. revision: yes

Circularity Check

0 steps flagged

No significant circularity; derivation relies on external benchmarks and independent optimization

full rationale

The paper formulates KV compression as differentiable optimization of head-wise budgets (LKV-H) and token importance (LKV-T) to align directly with task loss, then reports empirical results on standard external benchmarks (LongBench, RULER) that are independent of the training distribution and objective. The claim that learned budgeting is the dominant driver is presented as an outcome of comparative analysis and ablations rather than a definitional or self-referential reduction. No load-bearing self-citations, uniqueness theorems from prior author work, or fitted parameters renamed as predictions appear in the abstract or described chain. The method is self-contained against external validation, with generalization to new inputs treated as an empirical question rather than assumed by construction.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The central claim rests on the premise that KV importance can be derived differentiably from model weights alone and that task loss provides a sufficient signal for budget allocation; no new physical entities are postulated.

free parameters (1)

head-wise budget parameters
Learned per-head allocation ratios optimized end-to-end against task loss; these are the primary fitted quantities.

axioms (1)

domain assumption The KV cache eviction decision can be made differentiable with respect to final task loss without materializing full attention matrices.
Invoked to justify LKV-T design.

pith-pipeline@v0.9.0 · 5532 in / 1243 out tokens · 34177 ms · 2026-05-11T00:54:46.301223+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

42 extracted references · 8 canonical work pages

[1]

J., Soloveychik, I., and Kamath, P

Adnan, M., Arunkumar, A., Jain, G., Nair, P. J., Soloveychik, I., and Kamath, P. Keyformer: KV Cache Reduction through Key Tokens Selection for Efficient Generative Inference , March 2024. URL https://arxiv.org/abs/2403.09054v2

work page arXiv 2024
[2]

GQA : Training Generalized Multi-Query Transformer Models from Multi-Head Checkpoints , December 2023

Ainslie, J., Lee-Thorp , J., de Jong, M., Zemlyanskiy, Y., Lebr \'o n, F., and Sanghai, S. GQA : Training Generalized Multi-Query Transformer Models from Multi-Head Checkpoints , December 2023

2023
[3]

In: Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) (Aug 2024).https://doi.org/10.18653/v1/2024.acl-long.172

Bai, Y., Lv, X., Zhang, J., Lyu, H., Tang, J., Huang, Z., Du, Z., Liu, X., Zeng, A., Hou, L., Dong, Y., Tang, J., and Li, J. LongBench : A Bilingual , Multitask Benchmark for Long Context Understanding . In Ku, L.-W., Martins, A., and Srikumar, V. (eds.), Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics ( Volume 1: L...

work page doi:10.18653/v1/2024.acl-long.172 2024
[4]

Cache Me If You Can : How Many KVs Do You Need for Effective Long-Context LMs ?, June 2025

Bhaskar, A., Wettig, A., Gao, T., Dong, Y., and Chen, D. Cache Me If You Can : How Many KVs Do You Need for Effective Long-Context LMs ?, June 2025

2025
[5]

PyramidKV : Dynamic KV Cache Compression based on Pyramidal Information Funneling , May 2025

Cai, Z., Zhang, Y., Gao, B., Liu, Y., Li, Y., Liu, T., Lu, K., Xiong, W., Dong, Y., Hu, J., and Xiao, W. PyramidKV : Dynamic KV Cache Compression based on Pyramidal Information Funneling , May 2025

2025
[6]

Y., Ermon, S., Rudra, A., and Re, C

Dao, T., Fu, D. Y., Ermon, S., Rudra, A., and Re, C. FlashAttention : Fast and Memory-Efficient Exact Attention with IO-Awareness . In Advances in Neural Information Processing Systems , October 2022. URL https://openreview.net/forum?id=H4DqfPSibmx

2022
[7]

Expected Attention : KV Cache Compression by Estimating Attention from Future Queries Distribution , October 2025

Devoto, A., Jeblick, M., and J \'e gou, S. Expected Attention : KV Cache Compression by Estimating Attention from Future Queries Distribution , October 2025

2025
[8]

K., and Xie, X

Feng, Y., Guo, H., Lv, J., Zhou, S. K., and Xie, X. Taming the Fragility of KV Cache Eviction in LLM Inference , October 2025 a

2025
[9]

Feng, Y., Lv, J., Cao, Y., Xie, X., and Zhou, S. K. Ada- KV : Optimizing KV Cache Eviction by Adaptive Budget Allocation for Efficient LLM Inference . In The Thirty-ninth Annual Conference on Neural Information Processing Systems , October 2025 b . URL https://openreview.net/forum?id=tcisuhGsQZ

2025
[10]

Feng, Y., Lv, J., Cao, Y., Xie, X., and Zhou, S. K. Identify Critical KV Cache in LLM Inference from an Output Perturbation Perspective , February 2025 c

2025
[11]

Not All Heads Matter : A Head-Level KV Cache Compression Method with Integrated Retrieval and Reasoning , October 2025

Fu, Y., Cai, Z., Asi, A., Xiong, W., Dong, Y., and Xiao, W. Not All Heads Matter : A Head-Level KV Cache Compression Method with Integrated Retrieval and Reasoning , October 2025

2025
[12]

Model Tells You What to Discard : Adaptive KV Cache Compression for LLMs

Ge, S., Zhang, Y., Liu, L., Zhang, M., Han, J., and Gao, J. Model Tells You What to Discard : Adaptive KV Cache Compression for LLMs . In The Twelfth International Conference on Learning Representations , October 2023. URL https://openreview.net/forum?id=uNrFpDPMyo

2023
[13]

Mahoney, and Kurt Keutzer

Gholami, A., Yao, Z., Kim, S., Hooper, C., Mahoney, M. W., and Keutzer, K. AI and Memory Wall . IEEE Micro, 44 0 (3): 0 33--39, May 2024. ISSN 1937-4143. doi:10.1109/MM.2024.3373763

work page doi:10.1109/mm.2024.3373763 2024
[14]

Grattafiori, A., Dubey, A., Jauhri, A., Pandey, A., Kadian, A., Al-Dahle , A., Letman, A., Mathur, A., Schelten, A., Vaughan, A., Yang, A., Fan, A., Goyal, A., Hartshorn, A., Yang, A., Mitra, A., Sravankumar, A., Korenev, A., Hinsvark, A., Rao, A., Zhang, A., Rodriguez, A., Gregerson, A., Spataru, A., Roziere, B., Biron, B., Tang, B., Chern, B., Caucheteu...

2024
[15]

RULER : What 's the Real Context Size of Your Long-Context Language Models ? In First Conference on Language Modeling , August 2024

Hsieh, C.-P., Sun, S., Kriman, S., Acharya, S., Rekesh, D., Jia, F., and Ginsburg, B. RULER : What 's the Real Context Size of Your Long-Context Language Models ? In First Conference on Language Modeling , August 2024. URL https://openreview.net/forum?id=kIoBbc76Sy

2024
[16]

Locret: Enhancing Eviction in Long-Context LLM Inference with Trained Retaining Heads on Consumer-Grade Devices , January 2025

Huang, Y., Yuan, B., Han, X., Xiao, C., and Liu, Z. Locret: Enhancing Eviction in Long-Context LLM Inference with Trained Retaining Heads on Consumer-Grade Devices , January 2025

2025
[17]

H., Li, D., Lin, C.-Y., Yang, Y., and Qiu, L

Jiang, H., Li, Y., Zhang, C., Wu, Q., Luo, X., Ahn, S., Han, Z., Abdi, A. H., Li, D., Lin, C.-Y., Yang, Y., and Qiu, L. MInference 1.0: Accelerating Pre-filling for Long-Context LLMs via Dynamic Sparse Attention , October 2024

2024
[18]

In: Proceedings of the 29th Symposium on Operating Systems Principles

Kwon, W., Li, Z., Zhuang, S., Sheng, Y., Zheng, L., Yu, C. H., Gonzalez, J., Zhang, H., and Stoica, I. Efficient Memory Management for Large Language Model Serving with PagedAttention . In Proceedings of the 29th Symposium on Operating Systems Principles , SOSP '23, pp.\ 611--626, New York, NY, USA, October 2023. Association for Computing Machinery. ISBN ...

work page doi:10.1145/3600006.3613165 2023
[19]

SnapKV : LLM Knows What You are Looking for Before Generation

Li, Y., Huang, Y., Yang, B., Venkitesh, B., Locatelli, A., Ye, H., Cai, T., Lewis, P., and Chen, D. SnapKV : LLM Knows What You are Looking for Before Generation . In The Thirty-eighth Annual Conference on Neural Information Processing Systems , November 2024 a . URL https://openreview.net/forum?id=poE54GOq2l&referrer=

2024
[20]

500xCompressor : Generalized Prompt Compression for Large Language Models , August 2024 b

Li, Z., Su, Y., and Collier, N. 500xCompressor : Generalized Prompt Compression for Large Language Models , August 2024 b

2024
[21]

Liu, Kevin Lin, John Hewitt, Ashwin Paranjape, Michele Bevilacqua, Fabio Petroni, and Percy Liang

Liu, N. F., Lin, K., Hewitt, J., Paranjape, A., Bevilacqua, M., Petroni, F., and Liang, P. Lost in the Middle : How Language Models Use Long Contexts . Transactions of the Association for Computational Linguistics, 12: 0 157--173, 2024 a . doi:10.1162/tacl_a_00638

work page doi:10.1162/tacl_a_00638 2024
[22]

Scissorhands: Exploiting the Persistence of Importance Hypothesis for LLM KV Cache Compression at Test Time

Liu, Z., Desai, A., Liao, F., Wang, W., Xie, V., Xu, Z., Kyrillidis, A., and Shrivastava, A. Scissorhands: Exploiting the Persistence of Importance Hypothesis for LLM KV Cache Compression at Test Time . In Thirty-Seventh Conference on Neural Information Processing Systems , November 2023. URL https://openreview.net/forum?id=JZfg6wGi6g

2023
[23]

KIVI : A Tuning-Free Asymmetric 2bit Quantization for KV Cache

Liu, Z., Yuan, J., Jin, H., Zhong, S., Xu, Z., Braverman, V., Chen, B., and Hu, X. KIVI : A Tuning-Free Asymmetric 2bit Quantization for KV Cache . In Proceedings of the 41st International Conference on Machine Learning , pp.\ 32332--32344. PMLR, July 2024 b . URL https://proceedings.mlr.press/v235/liu24bz.html

2024
[24]

V., Qiu, L., and Zhang, D

Pan, Z., Wu, Q., Jiang, H., Xia, M., Luo, X., Zhang, J., Lin, Q., R \"u hle, V., Yang, Y., Lin, C.-Y., Zhao, H. V., Qiu, L., and Zhang, D. LLMLingua-2 : Data Distillation for Efficient and Faithful Task-Agnostic Prompt Compression , March 2024

2024
[25]

Efficiently Scaling Transformer Inference

Pope, R., Douglas, S., Chowdhery, A., Devlin, J., Bradbury, J., Heek, J., Xiao, K., Agrawal, S., and Dean, J. Efficiently Scaling Transformer Inference . Proceedings of Machine Learning and Systems, 5: 0 606--624, March 2023. URL https://proceedings.mlsys.org/paper_files/paper/2023/hash/c4be71ab8d24cdfb45e3d06dbfca2780-Abstract-mlsys2023.html

2023
[26]

CAKE : Cascading and Adaptive KV Cache Eviction with Layer Preferences , March 2025

Qin, Z., Cao, Y., Lin, M., Hu, W., Fan, S., Cheng, K., Lin, W., and Li, J. CAKE : Cascading and Adaptive KV Cache Eviction with Layer Preferences , March 2025

2025
[27]

FlexGen : High-throughput generative inference of large language models with a single GPU

Sheng, Y., Zheng, L., Yuan, B., Li, Z., Ryabinin, M., Chen, B., Liang, P., R \'e , C., Stoica, I., and Zhang, C. FlexGen : High-throughput generative inference of large language models with a single GPU . In Proceedings of the 40th International Conference on Machine Learning , volume 202 of ICML '23 , pp.\ 31094--31116, Honolulu, Hawaii, USA, July 2023. JMLR.org

2023
[28]

Cache Me If You Must : Adaptive Key-Value Quantization for Large Language Models

Shutova, A., Malinovskii, V., Egiazarian, V., Kuznedelev, D., Mazur, D., Nikita, S., Ermakov, I., and Alistarh, D. Cache Me If You Must : Adaptive Key-Value Quantization for Large Language Models . In Forty-Second International Conference on Machine Learning , June 2025. URL https://openreview.net/forum?id=COowwJOAZi

2025
[29]

R., Zhao, D., Patel, N

Skean, O., Arefin, M. R., Zhao, D., Patel, N. N., Naghiyev, J., LeCun, Y., and Shwartz-Ziv , R. Layer by Layer : Uncovering Hidden Representations in Language Models . In Forty-Second International Conference on Machine Learning , June 2025. URL https://openreview.net/forum?id=WGXb7UdvTX

2025
[30]

Post- Softmax : Searching for Smooth Approximations of Top-k

Su, J. Post- Softmax : Searching for Smooth Approximations of Top-k . Scientific Spaces (Blog post in Chinese), September 2024. URL https://www.spaces.ac.cn/archives/10373. Title translated from Chinese

2024
[31]

Quest: Query-Aware Sparsity for Efficient Long-Context LLM Inference , August 2024

Tang, J., Zhao, Y., Zhu, K., Xiao, G., Kasikci, B., and Han, S. Quest: Query-Aware Sparsity for Efficient Long-Context LLM Inference , August 2024

2024
[32]

N., Kaiser, ., and Polosukhin, I

Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, ., and Polosukhin, I. Attention is All you Need . In Advances in Neural Information Processing Systems , volume 30. Curran Associates, Inc., 2017. URL https://proceedings.neurips.cc/paper/2017/hash/3f5ee243547dee91fbd053c1c4a845aa-Abstract.html

2017
[33]

D2O : Dynamic Discriminative Operations for Efficient Long-Context Inference of Large Language Models , March 2025

Wan, Z., Wu, X., Zhang, Y., Xin, Y., Tao, C., Zhu, Z., Wang, X., Luo, S., Xiong, J., Wang, L., and Zhang, M. D2O : Dynamic Discriminative Operations for Efficient Long-Context Inference of Large Language Models , March 2025

2025
[34]

LLMs Know What to Drop : Self-Attention Guided KV Cache Eviction for Efficient Long-Context Inference , March 2025

Wang, G., Upasani, S., Wu, C., Gandhi, D., Li, J., Hu, C., Li, B., and Thakker, U. LLMs Know What to Drop : Self-Attention Guided KV Cache Eviction for Efficient Long-Context Inference , March 2025

2025
[35]

and Tu, Z

Wang, W. and Tu, Z. Rethinking the Value of Transformer Components . In Scott, D., Bel, N., and Zong, C. (eds.), Proceedings of the 28th International Conference on Computational Linguistics , pp.\ 6019--6029, Barcelona, Spain (Online), December 2020. International Committee on Computational Linguistics. doi:10.18653/v1/2020.coling-main.529

work page doi:10.18653/v1/2020.coling-main.529 2020
[36]

Efficient Streaming Language Models with Attention Sinks

Xiao, G., Tian, Y., Chen, B., Han, S., and Lewis, M. Efficient Streaming Language Models with Attention Sinks . In The Twelfth International Conference on Learning Representations , October 2023. URL https://openreview.net/forum?id=NG7sS51zVF

2023
[37]

DuoAttention : Efficient Long-Context LLM Inference with Retrieval and Streaming Heads

Xiao, G., Tang, J., Zuo, J., Guo, J., Yang, S., Tang, H., Fu, Y., and Han, S. DuoAttention : Efficient Long-Context LLM Inference with Retrieval and Streaming Heads . In The Thirteenth International Conference on Learning Representations , October 2024. URL https://openreview.net/forum?id=cFu7ze7xUm

2024
[38]

Qwen3 Technical Report , May 2025

Yang, A., Li, A., Yang, B., Zhang, B., Hui, B., Zheng, B., Yu, B., Gao, C., Huang, C., Lv, C., Zheng, C., Liu, D., Zhou, F., Huang, F., Hu, F., Ge, H., Wei, H., Lin, H., Tang, J., Yang, J., Tu, J., Zhang, J., Yang, J., Yang, J., Zhou, J., Zhou, J., Lin, J., Dang, K., Bao, K., Yang, K., Yu, L., Deng, L., Li, M., Xue, M., Li, M., Zhang, P., Wang, P., Zhu, Q...

2025
[39]

PyramidInfer : Pyramid KV Cache Compression for High-throughput LLM Inference , June 2024

Yang, D., Han, X., Gao, Y., Hu, Y., Zhang, S., and Zhao, H. PyramidInfer : Pyramid KV Cache Compression for High-throughput LLM Inference , June 2024

2024
[40]

RepoCoder : Repository-Level Code Completion Through Iterative Retrieval and Generation

Zhang, F., Chen, B., Zhang, Y., Keung, J., Liu, J., Zan, D., Mao, Y., Lou, J.-G., and Chen, W. RepoCoder : Repository-Level Code Completion Through Iterative Retrieval and Generation . In Bouamor, H., Pino, J., and Bali, K. (eds.), Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing , pp.\ 2471--2484, Singapore, December...

work page doi:10.18653/v1/2023.emnlp-main.151 2023
[41]

Be Your Own Teacher : Improve the Performance of Convolutional Neural Networks via Self Distillation

Zhang, L., Song, J., Gao, A., Chen, J., Bao, C., and Ma, K. Be Your Own Teacher : Improve the Performance of Convolutional Neural Networks via Self Distillation . In 2019 IEEE / CVF International Conference on Computer Vision ( ICCV ) , pp.\ 3712--3721, October 2019. doi:10.1109/ICCV.2019.00381

work page doi:10.1109/iccv.2019.00381 2019
[42]

H2O : Heavy-Hitter Oracle for Efficient Generative Inference of Large Language Models

Zhang, Z., Sheng, Y., Zhou, T., Chen, T., Zheng, L., Cai, R., Song, Z., Tian, Y., Re, C., Barrett, C., Wang, Z., and Chen, B. H2O : Heavy-Hitter Oracle for Efficient Generative Inference of Large Language Models . In Thirty-Seventh Conference on Neural Information Processing Systems , November 2023 b . URL https://openreview.net/forum?id=RkRrPp7GKO

2023