pith. machine review for the scientific record. sign in

arxiv: 2605.06676 · v1 · submitted 2026-04-22 · 💻 cs.LG · cs.CL

Recognition: no theorem link

LKV: End-to-End Learning of Head-wise Budgets and Token Selection for LLM KV Cache Eviction

Authors on Pith no claims yet

Pith reviewed 2026-05-11 00:54 UTC · model grok-4.3

classification 💻 cs.LG cs.CL
keywords KV cache compressionLLM long-context inferencedifferentiable evictionattention head budgetingtoken importance scoringend-to-end optimization
0
0 comments X

The pith

LKV learns head-wise KV budgets and token importance scores end-to-end to compress LLM caches without heuristic rules.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper argues that heuristic approaches to KV cache eviction misallocate memory because they rely on statistical priors or fixed attention patterns instead of task performance. LKV reformulates the entire compression step as a single differentiable optimization that jointly learns per-head budgets and intrinsic token importance. This produces near-lossless accuracy on LongBench while retaining only 15 percent of the original cache. The authors further show that the learned budgeting step accounts for most of the quality gain, not the token-selection rule itself.

Core claim

LKV integrates LKV-H, which learns task-optimized global budgets per attention head, with LKV-T, which computes query-independent importance scores for each KV token without ever materializing the full attention matrix; the resulting end-to-end system reaches state-of-the-art compressed performance on LongBench and RULER, with analysis attributing the largest fidelity improvements to the data-driven budget allocation.

What carries the argument

End-to-end differentiable optimization of head-wise budget allocations and intrinsic token importance scores that replace heuristic proxies.

If this is right

  • Compression ratios above 85 percent become practical for long-context inference while preserving task accuracy.
  • Budget allocation across heads becomes the primary lever for quality, reducing reliance on attention-sink or recency heuristics.
  • The method eliminates the need to materialize attention matrices during eviction, lowering both memory and compute overhead.
  • Task-specific budget learning can be performed once and then applied at inference time to new sequences of similar length.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same differentiable-budget approach could be applied to per-layer or per-model allocation decisions rather than only per-head.
  • If the learned policies prove robust, they might enable online adaptation of cache size during a single conversation without full retraining.
  • The dominance of budgeting over selection suggests that future memory work should treat allocation as a first-class learned parameter rather than a fixed hyper-parameter.

Load-bearing premise

The budgets and importance scores optimized on the training distribution will transfer to arbitrary new inputs and tasks without retraining.

What would settle it

Measure whether LKV still matches full-cache accuracy at 15 percent retention when tested on a long-context benchmark whose task distribution differs substantially from LongBench and RULER.

Figures

Figures reproduced from arXiv: 2605.06676 by Chao Wang, Di Huang, Enshuai Zhou, Jiaming Guo, Qi Guo, Rui Zhang, Xing Hu, Yifan Hao, Yunji Chen, Zidong Du.

Figure 1
Figure 1. Figure 1: KV budget allocation policies (15% retention). (a) SnapKV: Uniform. (b) PyramidKV: Layer-wise decay. (c) DuoAt￾tention: Rigid binary classification (Retrieval vs. Streaming). (d) Ada-SnapKV: Adaptive within layers but with uniform layer priors. (e) LKV (Ours): Learned global, fine-grained policy optimizing task objectives without rigid priors. rely on fixed “recent windows” or “attention sinks” (Xiao et al… view at source ↗
Figure 2
Figure 2. Figure 2: Overview of LKV. Left (LKV-H): Learns global budget ratios r from head embeddings. Middle (LKV-T): Performs differentiable, query-agnostic token selection via Soft-TopK. Right: End-to-end optimization via self-distillation against a frozen teacher. 1 2 (1 +sgn(z)(1−e −|z| )). Consequently, the core challenge lies in analytically determining λ(x) such that the summa￾tion constraint is strictly satisfied. Cl… view at source ↗
Figure 3
Figure 3. Figure 3: Performance on LongBench across varying KV cache retention ratios. LKV shows robustness in low-resource regimes. †For Qwen3, we employ an on-the-fly approximation due to the lack of official pre-computed patterns. (see Appendix D.2) 4.3. Robustness to Context Length and Compression We further evaluate the robustness of LKV on Llama-3.1- 8B-Instruct by varying the context length t and the global retention r… view at source ↗
Figure 5
Figure 5. Figure 5: Memory profiling on Llama-3.1-8B-Instruct (R = 0.15). stability, Full Cache exhibits rapid memory growth, crash￾ing with OOM errors at 225k tokens. In contrast, LKV successfully scales to 262k and beyond. This capability stems from the aggressive compression of the KV tensors (Figure 5b). By maintaining a 15% budget, LKV reduces the actual storage cost at 200k length from 25.0 GB (Full) to 3.75 GB (Ours), … view at source ↗
Figure 6
Figure 6. Figure 6: visualizes the results under two critical settings: (a) Length Scalability: Performance across increasing context lengths (4k, 8k, 16k, 32k) with a fixed retention ratio (R = 0.15). (b) Compression Robustness: Performance across varying retention ratios (R ∈ [0.1, 0.5]) at a fixed context length of 16k. Consistent with the findings on Llama-3.1-8B-Instruct, LKV demonstrates superior robustness compared to … view at source ↗
read the original abstract

Long-context inference in Large Language Models (LLMs) is bottlenecked by the linear growth of Key-Value (KV) cache memory. Existing KV cache compression paradigms are fundamentally limited by heuristics: heuristic budgeting relies on statistical priors rather than task objectives, causing resource misallocation, while heuristic selection relies on coupled query-key interactions or static inductive biases (e.g., attention sinks). To address this limitation, we introduce LKV (Learned KV Eviction), which formulates KV compression as an end-to-end differentiable optimization problem. LKV integrates LKV-H to learn task-optimized global budgets, and LKV-T to derive intrinsic KV importance without materializing attention matrices. This design bypasses heuristic proxies, strictly aligning compression with task objectives. Extensive evaluations demonstrate that LKV achieves state-of-the-art performance on both LongBench and RULER benchmarks at high compression rates. In particular, on LongBench, LKV achieves near-lossless performance with only 15\% KV cache retention. Crucially, our analysis identifies learned budgeting as the dominant driver of fidelity, demonstrating that data-driven allocation is essential to overcome the limitations of hand-crafted heuristics.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper proposes LKV, an end-to-end differentiable framework for KV cache eviction in LLMs. It introduces LKV-H to learn task-optimized head-wise global budgets and LKV-T to compute intrinsic token importance scores without materializing full attention matrices, replacing heuristic budgeting and selection. The central claims are SOTA results on LongBench and RULER at high compression ratios, with near-lossless performance at 15% KV retention on LongBench, and an analysis showing that learned budgeting (rather than token selection) is the dominant factor in preserving fidelity.

Significance. If the generalization and robustness claims hold, the work would meaningfully advance KV cache compression by demonstrating that data-driven, objective-aligned allocation can outperform hand-crafted heuristics at aggressive compression rates. The separation of budgeting from selection and the emphasis on end-to-end optimization provide a useful conceptual distinction. However, the current experimental support is only moderately strong, limiting immediate impact until ablations and transfer tests are added.

major comments (3)
  1. [Experimental Results] Experimental Results section: The abstract and main claims assert near-lossless performance at 15% retention on LongBench and identify learned budgeting as the dominant driver, yet no error bars, multiple random seeds, or statistical tests are reported. This makes it difficult to determine whether the SOTA margin is robust or sensitive to evaluation variance.
  2. [Experiments / Analysis] No dedicated generalization or transfer subsection: The central claim that LKV-H budgets capture task-invariant structure (rather than fitting the training mixture) is load-bearing for the assertion that data-driven allocation overcomes heuristic misallocation. Without explicit experiments training on one benchmark family and evaluating on held-out tasks or distributions, the dominance of learned budgeting remains unverified.
  3. [Method (LKV-T)] Method section describing LKV-T: The formulation claims to derive intrinsic importance scores without materializing attention matrices and to strictly align with task objectives. A concrete derivation or pseudocode showing how gradients flow through the differentiable selection (and how it avoids implicit heuristic biases) is needed to substantiate that it bypasses the limitations of prior query-key or sink-based methods.
minor comments (2)
  1. [Method] The paper should clarify the exact training objective and loss used to optimize the head-wise budget parameters, including any regularization terms that prevent degenerate allocations.
  2. [Figures/Tables] Figure captions and tables comparing against baselines would benefit from explicit retention ratios and model sizes for each method to enable direct apples-to-apples comparison.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive comments. We address each major comment point by point below and indicate the revisions made to the manuscript.

read point-by-point responses
  1. Referee: Experimental Results section: The abstract and main claims assert near-lossless performance at 15% retention on LongBench and identify learned budgeting as the dominant driver, yet no error bars, multiple random seeds, or statistical tests are reported. This makes it difficult to determine whether the SOTA margin is robust or sensitive to evaluation variance.

    Authors: We agree that variance reporting strengthens the claims. In the revised manuscript we have rerun the primary LongBench evaluations using three random seeds and added error bars (standard deviation) to the main result tables. A short discussion of statistical significance via paired t-tests has also been included, confirming that the reported margins remain consistent. revision: yes

  2. Referee: No dedicated generalization or transfer subsection: The central claim that LKV-H budgets capture task-invariant structure (rather than fitting the training mixture) is load-bearing for the assertion that data-driven allocation overcomes heuristic misallocation. Without explicit experiments training on one benchmark family and evaluating on held-out tasks or distributions, the dominance of learned budgeting remains unverified.

    Authors: We acknowledge the need for explicit transfer evidence. We have added a new subsection 'Transferability of Learned Budgets' that trains LKV-H on a LongBench subset (excluding selected task families) and evaluates the resulting budgets on the held-out families plus the full RULER benchmark. The added results show only minor degradation, supporting that the budgets capture task-invariant structure. revision: yes

  3. Referee: Method section describing LKV-T: The formulation claims to derive intrinsic importance scores without materializing attention matrices and to strictly align with task objectives. A concrete derivation or pseudocode showing how gradients flow through the differentiable selection (and how it avoids implicit heuristic biases) is needed to substantiate that it bypasses the limitations of prior query-key or sink-based methods.

    Authors: We have expanded Section 3.2 with an explicit derivation of gradient flow through the Gumbel-softmax differentiable selection used by LKV-T. Pseudocode is now supplied as Algorithm 1 in the appendix, illustrating the forward and backward passes and confirming that importance scores are optimized directly against the task loss without attention-matrix materialization or static inductive biases. revision: yes

Circularity Check

0 steps flagged

No significant circularity; derivation relies on external benchmarks and independent optimization

full rationale

The paper formulates KV compression as differentiable optimization of head-wise budgets (LKV-H) and token importance (LKV-T) to align directly with task loss, then reports empirical results on standard external benchmarks (LongBench, RULER) that are independent of the training distribution and objective. The claim that learned budgeting is the dominant driver is presented as an outcome of comparative analysis and ablations rather than a definitional or self-referential reduction. No load-bearing self-citations, uniqueness theorems from prior author work, or fitted parameters renamed as predictions appear in the abstract or described chain. The method is self-contained against external validation, with generalization to new inputs treated as an empirical question rather than assumed by construction.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The central claim rests on the premise that KV importance can be derived differentiably from model weights alone and that task loss provides a sufficient signal for budget allocation; no new physical entities are postulated.

free parameters (1)
  • head-wise budget parameters
    Learned per-head allocation ratios optimized end-to-end against task loss; these are the primary fitted quantities.
axioms (1)
  • domain assumption The KV cache eviction decision can be made differentiable with respect to final task loss without materializing full attention matrices.
    Invoked to justify LKV-T design.

pith-pipeline@v0.9.0 · 5532 in / 1243 out tokens · 34177 ms · 2026-05-11T00:54:46.301223+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

42 extracted references · 8 canonical work pages

  1. [1]

    J., Soloveychik, I., and Kamath, P

    Adnan, M., Arunkumar, A., Jain, G., Nair, P. J., Soloveychik, I., and Kamath, P. Keyformer: KV Cache Reduction through Key Tokens Selection for Efficient Generative Inference , March 2024. URL https://arxiv.org/abs/2403.09054v2

  2. [2]

    GQA : Training Generalized Multi-Query Transformer Models from Multi-Head Checkpoints , December 2023

    Ainslie, J., Lee-Thorp , J., de Jong, M., Zemlyanskiy, Y., Lebr \'o n, F., and Sanghai, S. GQA : Training Generalized Multi-Query Transformer Models from Multi-Head Checkpoints , December 2023

  3. [3]

    In: Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) (Aug 2024).https://doi.org/10.18653/v1/2024.acl-long.172

    Bai, Y., Lv, X., Zhang, J., Lyu, H., Tang, J., Huang, Z., Du, Z., Liu, X., Zeng, A., Hou, L., Dong, Y., Tang, J., and Li, J. LongBench : A Bilingual , Multitask Benchmark for Long Context Understanding . In Ku, L.-W., Martins, A., and Srikumar, V. (eds.), Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics ( Volume 1: L...

  4. [4]

    Cache Me If You Can : How Many KVs Do You Need for Effective Long-Context LMs ?, June 2025

    Bhaskar, A., Wettig, A., Gao, T., Dong, Y., and Chen, D. Cache Me If You Can : How Many KVs Do You Need for Effective Long-Context LMs ?, June 2025

  5. [5]

    PyramidKV : Dynamic KV Cache Compression based on Pyramidal Information Funneling , May 2025

    Cai, Z., Zhang, Y., Gao, B., Liu, Y., Li, Y., Liu, T., Lu, K., Xiong, W., Dong, Y., Hu, J., and Xiao, W. PyramidKV : Dynamic KV Cache Compression based on Pyramidal Information Funneling , May 2025

  6. [6]

    Y., Ermon, S., Rudra, A., and Re, C

    Dao, T., Fu, D. Y., Ermon, S., Rudra, A., and Re, C. FlashAttention : Fast and Memory-Efficient Exact Attention with IO-Awareness . In Advances in Neural Information Processing Systems , October 2022. URL https://openreview.net/forum?id=H4DqfPSibmx

  7. [7]

    Expected Attention : KV Cache Compression by Estimating Attention from Future Queries Distribution , October 2025

    Devoto, A., Jeblick, M., and J \'e gou, S. Expected Attention : KV Cache Compression by Estimating Attention from Future Queries Distribution , October 2025

  8. [8]

    K., and Xie, X

    Feng, Y., Guo, H., Lv, J., Zhou, S. K., and Xie, X. Taming the Fragility of KV Cache Eviction in LLM Inference , October 2025 a

  9. [9]

    Feng, Y., Lv, J., Cao, Y., Xie, X., and Zhou, S. K. Ada- KV : Optimizing KV Cache Eviction by Adaptive Budget Allocation for Efficient LLM Inference . In The Thirty-ninth Annual Conference on Neural Information Processing Systems , October 2025 b . URL https://openreview.net/forum?id=tcisuhGsQZ

  10. [10]

    Feng, Y., Lv, J., Cao, Y., Xie, X., and Zhou, S. K. Identify Critical KV Cache in LLM Inference from an Output Perturbation Perspective , February 2025 c

  11. [11]

    Not All Heads Matter : A Head-Level KV Cache Compression Method with Integrated Retrieval and Reasoning , October 2025

    Fu, Y., Cai, Z., Asi, A., Xiong, W., Dong, Y., and Xiao, W. Not All Heads Matter : A Head-Level KV Cache Compression Method with Integrated Retrieval and Reasoning , October 2025

  12. [12]

    Model Tells You What to Discard : Adaptive KV Cache Compression for LLMs

    Ge, S., Zhang, Y., Liu, L., Zhang, M., Han, J., and Gao, J. Model Tells You What to Discard : Adaptive KV Cache Compression for LLMs . In The Twelfth International Conference on Learning Representations , October 2023. URL https://openreview.net/forum?id=uNrFpDPMyo

  13. [13]

    Mahoney, and Kurt Keutzer

    Gholami, A., Yao, Z., Kim, S., Hooper, C., Mahoney, M. W., and Keutzer, K. AI and Memory Wall . IEEE Micro, 44 0 (3): 0 33--39, May 2024. ISSN 1937-4143. doi:10.1109/MM.2024.3373763

  14. [14]

    Grattafiori, A., Dubey, A., Jauhri, A., Pandey, A., Kadian, A., Al-Dahle , A., Letman, A., Mathur, A., Schelten, A., Vaughan, A., Yang, A., Fan, A., Goyal, A., Hartshorn, A., Yang, A., Mitra, A., Sravankumar, A., Korenev, A., Hinsvark, A., Rao, A., Zhang, A., Rodriguez, A., Gregerson, A., Spataru, A., Roziere, B., Biron, B., Tang, B., Chern, B., Caucheteu...

  15. [15]

    RULER : What 's the Real Context Size of Your Long-Context Language Models ? In First Conference on Language Modeling , August 2024

    Hsieh, C.-P., Sun, S., Kriman, S., Acharya, S., Rekesh, D., Jia, F., and Ginsburg, B. RULER : What 's the Real Context Size of Your Long-Context Language Models ? In First Conference on Language Modeling , August 2024. URL https://openreview.net/forum?id=kIoBbc76Sy

  16. [16]

    Locret: Enhancing Eviction in Long-Context LLM Inference with Trained Retaining Heads on Consumer-Grade Devices , January 2025

    Huang, Y., Yuan, B., Han, X., Xiao, C., and Liu, Z. Locret: Enhancing Eviction in Long-Context LLM Inference with Trained Retaining Heads on Consumer-Grade Devices , January 2025

  17. [17]

    H., Li, D., Lin, C.-Y., Yang, Y., and Qiu, L

    Jiang, H., Li, Y., Zhang, C., Wu, Q., Luo, X., Ahn, S., Han, Z., Abdi, A. H., Li, D., Lin, C.-Y., Yang, Y., and Qiu, L. MInference 1.0: Accelerating Pre-filling for Long-Context LLMs via Dynamic Sparse Attention , October 2024

  18. [18]

    In: Proceedings of the 29th Symposium on Operating Systems Principles

    Kwon, W., Li, Z., Zhuang, S., Sheng, Y., Zheng, L., Yu, C. H., Gonzalez, J., Zhang, H., and Stoica, I. Efficient Memory Management for Large Language Model Serving with PagedAttention . In Proceedings of the 29th Symposium on Operating Systems Principles , SOSP '23, pp.\ 611--626, New York, NY, USA, October 2023. Association for Computing Machinery. ISBN ...

  19. [19]

    SnapKV : LLM Knows What You are Looking for Before Generation

    Li, Y., Huang, Y., Yang, B., Venkitesh, B., Locatelli, A., Ye, H., Cai, T., Lewis, P., and Chen, D. SnapKV : LLM Knows What You are Looking for Before Generation . In The Thirty-eighth Annual Conference on Neural Information Processing Systems , November 2024 a . URL https://openreview.net/forum?id=poE54GOq2l&referrer=

  20. [20]

    500xCompressor : Generalized Prompt Compression for Large Language Models , August 2024 b

    Li, Z., Su, Y., and Collier, N. 500xCompressor : Generalized Prompt Compression for Large Language Models , August 2024 b

  21. [21]

    Liu, Kevin Lin, John Hewitt, Ashwin Paranjape, Michele Bevilacqua, Fabio Petroni, and Percy Liang

    Liu, N. F., Lin, K., Hewitt, J., Paranjape, A., Bevilacqua, M., Petroni, F., and Liang, P. Lost in the Middle : How Language Models Use Long Contexts . Transactions of the Association for Computational Linguistics, 12: 0 157--173, 2024 a . doi:10.1162/tacl_a_00638

  22. [22]

    Scissorhands: Exploiting the Persistence of Importance Hypothesis for LLM KV Cache Compression at Test Time

    Liu, Z., Desai, A., Liao, F., Wang, W., Xie, V., Xu, Z., Kyrillidis, A., and Shrivastava, A. Scissorhands: Exploiting the Persistence of Importance Hypothesis for LLM KV Cache Compression at Test Time . In Thirty-Seventh Conference on Neural Information Processing Systems , November 2023. URL https://openreview.net/forum?id=JZfg6wGi6g

  23. [23]

    KIVI : A Tuning-Free Asymmetric 2bit Quantization for KV Cache

    Liu, Z., Yuan, J., Jin, H., Zhong, S., Xu, Z., Braverman, V., Chen, B., and Hu, X. KIVI : A Tuning-Free Asymmetric 2bit Quantization for KV Cache . In Proceedings of the 41st International Conference on Machine Learning , pp.\ 32332--32344. PMLR, July 2024 b . URL https://proceedings.mlr.press/v235/liu24bz.html

  24. [24]

    V., Qiu, L., and Zhang, D

    Pan, Z., Wu, Q., Jiang, H., Xia, M., Luo, X., Zhang, J., Lin, Q., R \"u hle, V., Yang, Y., Lin, C.-Y., Zhao, H. V., Qiu, L., and Zhang, D. LLMLingua-2 : Data Distillation for Efficient and Faithful Task-Agnostic Prompt Compression , March 2024

  25. [25]

    Efficiently Scaling Transformer Inference

    Pope, R., Douglas, S., Chowdhery, A., Devlin, J., Bradbury, J., Heek, J., Xiao, K., Agrawal, S., and Dean, J. Efficiently Scaling Transformer Inference . Proceedings of Machine Learning and Systems, 5: 0 606--624, March 2023. URL https://proceedings.mlsys.org/paper_files/paper/2023/hash/c4be71ab8d24cdfb45e3d06dbfca2780-Abstract-mlsys2023.html

  26. [26]

    CAKE : Cascading and Adaptive KV Cache Eviction with Layer Preferences , March 2025

    Qin, Z., Cao, Y., Lin, M., Hu, W., Fan, S., Cheng, K., Lin, W., and Li, J. CAKE : Cascading and Adaptive KV Cache Eviction with Layer Preferences , March 2025

  27. [27]

    FlexGen : High-throughput generative inference of large language models with a single GPU

    Sheng, Y., Zheng, L., Yuan, B., Li, Z., Ryabinin, M., Chen, B., Liang, P., R \'e , C., Stoica, I., and Zhang, C. FlexGen : High-throughput generative inference of large language models with a single GPU . In Proceedings of the 40th International Conference on Machine Learning , volume 202 of ICML '23 , pp.\ 31094--31116, Honolulu, Hawaii, USA, July 2023. JMLR.org

  28. [28]

    Cache Me If You Must : Adaptive Key-Value Quantization for Large Language Models

    Shutova, A., Malinovskii, V., Egiazarian, V., Kuznedelev, D., Mazur, D., Nikita, S., Ermakov, I., and Alistarh, D. Cache Me If You Must : Adaptive Key-Value Quantization for Large Language Models . In Forty-Second International Conference on Machine Learning , June 2025. URL https://openreview.net/forum?id=COowwJOAZi

  29. [29]

    R., Zhao, D., Patel, N

    Skean, O., Arefin, M. R., Zhao, D., Patel, N. N., Naghiyev, J., LeCun, Y., and Shwartz-Ziv , R. Layer by Layer : Uncovering Hidden Representations in Language Models . In Forty-Second International Conference on Machine Learning , June 2025. URL https://openreview.net/forum?id=WGXb7UdvTX

  30. [30]

    Post- Softmax : Searching for Smooth Approximations of Top-k

    Su, J. Post- Softmax : Searching for Smooth Approximations of Top-k . Scientific Spaces (Blog post in Chinese), September 2024. URL https://www.spaces.ac.cn/archives/10373. Title translated from Chinese

  31. [31]

    Quest: Query-Aware Sparsity for Efficient Long-Context LLM Inference , August 2024

    Tang, J., Zhao, Y., Zhu, K., Xiao, G., Kasikci, B., and Han, S. Quest: Query-Aware Sparsity for Efficient Long-Context LLM Inference , August 2024

  32. [32]

    N., Kaiser, ., and Polosukhin, I

    Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, ., and Polosukhin, I. Attention is All you Need . In Advances in Neural Information Processing Systems , volume 30. Curran Associates, Inc., 2017. URL https://proceedings.neurips.cc/paper/2017/hash/3f5ee243547dee91fbd053c1c4a845aa-Abstract.html

  33. [33]

    D2O : Dynamic Discriminative Operations for Efficient Long-Context Inference of Large Language Models , March 2025

    Wan, Z., Wu, X., Zhang, Y., Xin, Y., Tao, C., Zhu, Z., Wang, X., Luo, S., Xiong, J., Wang, L., and Zhang, M. D2O : Dynamic Discriminative Operations for Efficient Long-Context Inference of Large Language Models , March 2025

  34. [34]

    LLMs Know What to Drop : Self-Attention Guided KV Cache Eviction for Efficient Long-Context Inference , March 2025

    Wang, G., Upasani, S., Wu, C., Gandhi, D., Li, J., Hu, C., Li, B., and Thakker, U. LLMs Know What to Drop : Self-Attention Guided KV Cache Eviction for Efficient Long-Context Inference , March 2025

  35. [35]

    and Tu, Z

    Wang, W. and Tu, Z. Rethinking the Value of Transformer Components . In Scott, D., Bel, N., and Zong, C. (eds.), Proceedings of the 28th International Conference on Computational Linguistics , pp.\ 6019--6029, Barcelona, Spain (Online), December 2020. International Committee on Computational Linguistics. doi:10.18653/v1/2020.coling-main.529

  36. [36]

    Efficient Streaming Language Models with Attention Sinks

    Xiao, G., Tian, Y., Chen, B., Han, S., and Lewis, M. Efficient Streaming Language Models with Attention Sinks . In The Twelfth International Conference on Learning Representations , October 2023. URL https://openreview.net/forum?id=NG7sS51zVF

  37. [37]

    DuoAttention : Efficient Long-Context LLM Inference with Retrieval and Streaming Heads

    Xiao, G., Tang, J., Zuo, J., Guo, J., Yang, S., Tang, H., Fu, Y., and Han, S. DuoAttention : Efficient Long-Context LLM Inference with Retrieval and Streaming Heads . In The Thirteenth International Conference on Learning Representations , October 2024. URL https://openreview.net/forum?id=cFu7ze7xUm

  38. [38]

    Qwen3 Technical Report , May 2025

    Yang, A., Li, A., Yang, B., Zhang, B., Hui, B., Zheng, B., Yu, B., Gao, C., Huang, C., Lv, C., Zheng, C., Liu, D., Zhou, F., Huang, F., Hu, F., Ge, H., Wei, H., Lin, H., Tang, J., Yang, J., Tu, J., Zhang, J., Yang, J., Yang, J., Zhou, J., Zhou, J., Lin, J., Dang, K., Bao, K., Yang, K., Yu, L., Deng, L., Li, M., Xue, M., Li, M., Zhang, P., Wang, P., Zhu, Q...

  39. [39]

    PyramidInfer : Pyramid KV Cache Compression for High-throughput LLM Inference , June 2024

    Yang, D., Han, X., Gao, Y., Hu, Y., Zhang, S., and Zhao, H. PyramidInfer : Pyramid KV Cache Compression for High-throughput LLM Inference , June 2024

  40. [40]

    RepoCoder : Repository-Level Code Completion Through Iterative Retrieval and Generation

    Zhang, F., Chen, B., Zhang, Y., Keung, J., Liu, J., Zan, D., Mao, Y., Lou, J.-G., and Chen, W. RepoCoder : Repository-Level Code Completion Through Iterative Retrieval and Generation . In Bouamor, H., Pino, J., and Bali, K. (eds.), Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing , pp.\ 2471--2484, Singapore, December...

  41. [41]

    Be Your Own Teacher : Improve the Performance of Convolutional Neural Networks via Self Distillation

    Zhang, L., Song, J., Gao, A., Chen, J., Bao, C., and Ma, K. Be Your Own Teacher : Improve the Performance of Convolutional Neural Networks via Self Distillation . In 2019 IEEE / CVF International Conference on Computer Vision ( ICCV ) , pp.\ 3712--3721, October 2019. doi:10.1109/ICCV.2019.00381

  42. [42]

    H2O : Heavy-Hitter Oracle for Efficient Generative Inference of Large Language Models

    Zhang, Z., Sheng, Y., Zhou, T., Chen, T., Zheng, L., Cai, R., Song, Z., Tian, Y., Re, C., Barrett, C., Wang, Z., and Chen, B. H2O : Heavy-Hitter Oracle for Efficient Generative Inference of Large Language Models . In Thirty-Seventh Conference on Neural Information Processing Systems , November 2023 b . URL https://openreview.net/forum?id=RkRrPp7GKO