MomentKV: Closing the Directional Gap in KV Cache Eviction for Long-Context Inference

Binxu Li; Tian Lan; Yu Li

arxiv: 2606.01563 · v1 · pith:TWB5BOGInew · submitted 2026-06-01 · 💻 cs.LG

MomentKV: Closing the Directional Gap in KV Cache Eviction for Long-Context Inference

Yu Li , Binxu Li , Tian Lan This is my paper

Pith reviewed 2026-06-28 16:05 UTC · model grok-4.3

classification 💻 cs.LG

keywords KV cache evictionlong-context inferencedirectional mismatchmoment statisticsattention approximationTransformer memorycache compression

0 comments

The pith

MomentKV tracks compact moment statistics on evicted KV tokens to correct directional mismatch during long-context inference.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper argues that KV cache eviction degrades model outputs mainly because retained and evicted tokens point in nearly orthogonal directions, so even tiny residual attention mass on the evicted set skews the overall direction. Existing methods already drive that residual mass close to zero yet still lose accuracy. MomentKV keeps four small statistics (token count, key mean, value mean, value-key covariance) for the evicted set. These statistics both guide which tokens to evict next and supply a closed-form first-order correction to the attention output at inference time, creating a loop that keeps the evicted geometry regular. Experiments on LongBench and RULER show consistent gains over prior eviction policies at every cache size, largest when compression is most aggressive.

Core claim

The central claim is that directional mismatch between retained and evicted token sets, rather than residual attention mass alone, is the dominant source of error in KV cache eviction. MomentKV maintains count, key mean, value mean, and value-key covariance over the evicted tokens; during eviction these moments identify tokens already captured by the summary, and during inference they yield a closed-form first-order approximation of the evicted contribution to attention, forming a mutually reinforcing loop that improves output quality at fixed cache budgets.

What carries the argument

Compact moment statistics (count, key mean, value mean, value-key covariance) maintained over the evicted token set, used both to enforce geometric regularity during eviction and to compute a closed-form correction during inference.

If this is right

At any fixed cache budget the method produces higher-quality outputs than eviction policies that only minimize residual attention mass.
The largest accuracy gains appear under aggressive compression ratios where directional mismatch is most pronounced.
The same moment statistics can be updated incrementally with constant memory overhead independent of the number of evicted tokens.
The eviction policy and the inference correction become mutually reinforcing, so better selection improves the correction and vice versa.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The approach may extend to other attention-based architectures if the first-order moment approximation remains stable under different head dimensions or position encodings.
Tracking higher-order moments could further reduce residual directional error if the first-order correction proves insufficient on some tasks.
Because the correction is closed-form, it could be fused into existing attention kernels with negligible added latency.

Load-bearing premise

The moment statistics supply a first-order approximation of the evicted attention output that stays accurate enough to reinforce the selective eviction policy across the full generation.

What would settle it

Measure whether the first-order moment-based correction error exceeds a small threshold on held-out long sequences; if the approximation error grows with sequence length or model scale while output degradation persists, the claim fails.

Figures

Figures reproduced from arXiv: 2606.01563 by Binxu Li, Tian Lan, Yu Li.

**Figure 1.** Figure 1: MOMENTKV maintains moment statistics over evicted tokens to jointly improve eviction decisions and correct the post-eviction attention output. When evicted KV pairs carry non-negligible attention mass, this renormalization shifts the attention output toward the subspace spanned by retained value vectors, introducing a systematic directional bias that better scoring cannot resolve (Choromanski et al., 2020;… view at source ↗

**Figure 2.** Figure 2: Eviction error analysis on LLaMA-3-8B with H2O selection at [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗

**Figure 3.** Figure 3: Empirical validation on LLaMA-3-8B at L=128 using the Qasper dataset. (a) Per-sample eviction loss: H2O vs MOMENTKV . (b) Layer-wise error and reduction percentage. (c) Cosine similarity between fE and ˆ fE . or revisiting of evicted tokens. At query time, the means and covariance are recovered via the identity S˜ = S − svs ⊤ k /ne , and the total storage per head is O(d 2 ), independent of context lengt… view at source ↗

**Figure 4.** Figure 4: LongBench avg. score across cache budgets on LLaMA-3.1-8B. The advantage is largest at small L and narrows as the budget grows [PITH_FULL_IMAGE:figures/full_fig_p009_4.png] view at source ↗

read the original abstract

Autoregressive decoding in Transformer-based language models relies on the KV cache, whose memory footprint grows linearly with sequence length and becomes the primary bottleneck for long-context inference. KV cache eviction addresses this by retaining a fixed-size subset of key-value pairs and discarding the rest. We identify that a primary source of output degradation is not the residual attention mass on evicted tokens, which existing methods already minimize, but a directional mismatch between the retained and evicted token sets. Specifically, the evicted tokens in practice are often near-orthogonal to the retained ones. Thus, even a small evicted mass could have an oversized impact on the resulting direction distribution and amplify into substantial output error. This reveals a fundamental limit in existing strategies. To address this, we propose MomentKV, which maintains compact, small-size moment statistics over the evicted token set, including a count, key mean, value mean, and value-key covariance. During eviction, the moment statistics is leveraged to identify tokens already well aligned with and captured by the accumulated summary, keeping the evicted set geometrically regular. During inference, they yield a closed-form first-order approximation of the evicted attention output, forming a mutually reinforcing loop between selective eviction and accurate correction. On LongBench and RULER with LLaMA-3.1-8B-Instruct and Qwen3-4B-Instruct, MomentKV outperforms all baselines at every cache budget, with the largest gains under aggressive compression.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

MomentKV flags directional mismatch as the real KV eviction issue and uses four moment stats for both selection and correction, with reported gains on benchmarks.

read the letter

The main point is that this work claims existing KV eviction methods already cut residual attention mass but still lose performance because evicted tokens tend to be near-orthogonal to the retained set, skewing the output direction. MomentKV keeps a small set of moments (count, key mean, value mean, value-key covariance) over the evicted tokens to pick ones that stay geometrically aligned and to supply a closed-form first-order correction at inference time.

What stands out is the explicit diagnosis of the directional gap and the dual use of the same statistics for eviction and correction. The abstract positions this as distinct from prior approaches. On the results side, it reports consistent outperformance over baselines on LongBench and RULER using LLaMA-3.1-8B and Qwen3-4B at multiple cache budgets, with bigger margins under heavy compression.

The soft spot is that the abstract supplies no derivations, error bounds, or ablation details on how accurate the moment-based approximation actually is in practice. The mutually reinforcing loop between eviction and correction therefore rests on the experiments holding up; without those controls visible here, it is hard to judge robustness. The geometric observation itself looks independent of the stats, which is a plus.

This is for people working on inference-time memory reduction for long-context models. A reader already following KV cache papers would get a concrete new technique to test. It deserves peer review because the directional framing is clear, the method is simple to implement, and the benchmark claims are specific enough to evaluate.

Referee Report

2 major / 1 minor

Summary. The paper claims that KV cache eviction degrades output primarily due to directional mismatch (evicted tokens often near-orthogonal to retained ones) rather than residual attention mass on evicted tokens. It introduces MomentKV, which maintains compact moment statistics (count, key mean, value mean, value-key covariance) over the evicted set. These statistics guide selective eviction to keep the evicted set geometrically regular and enable a closed-form first-order approximation of the evicted attention output during inference, forming a mutually reinforcing loop. Experiments on LongBench and RULER with LLaMA-3.1-8B-Instruct and Qwen3-4B-Instruct show outperformance over all baselines at every cache budget, with largest gains under aggressive compression.

Significance. If the directional-mismatch diagnosis holds and the moment-based first-order correction proves sufficiently accurate, the work could address a previously untargeted geometric source of error in KV eviction, enabling more reliable aggressive compression for long-context inference without the accuracy cliffs seen in prior mass-minimization approaches.

major comments (2)

[Abstract] Abstract: The central diagnosis that 'evicted tokens in practice are often near-orthogonal to the retained ones' and produce an 'oversized impact on the resulting direction distribution' is load-bearing for the claim of a 'fundamental limit in existing strategies,' yet the abstract supplies no quantitative support, geometric analysis, or controls to establish this orthogonality or its effect size.
[Abstract] Abstract: The claim that the maintained moment statistics 'yield a closed-form first-order approximation of the evicted attention output' (forming the 'mutually reinforcing loop') is presented without the approximation formula, its derivation, or any error analysis, which is required to assess whether the approximation is accurate enough to support the proposed correction mechanism.

minor comments (1)

[Abstract] The phrasing 'the moment statistics is leveraged' contains a subject-verb agreement issue that should be corrected for clarity.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments on the abstract. We agree that the abstract should better substantiate its central claims and will revise it accordingly while preserving its brevity.

read point-by-point responses

Referee: [Abstract] Abstract: The central diagnosis that 'evicted tokens in practice are often near-orthogonal to the retained ones' and produce an 'oversized impact on the resulting direction distribution' is load-bearing for the claim of a 'fundamental limit in existing strategies,' yet the abstract supplies no quantitative support, geometric analysis, or controls to establish this orthogonality or its effect size.

Authors: We acknowledge that the abstract presents the orthogonality diagnosis without supporting numbers or controls. The manuscript body contains the requested geometric analysis (cosine-similarity distributions and directional-impact ablations) that establish the effect size. To make the abstract self-contained, we will insert a concise quantitative statement drawn from those results. revision: yes
Referee: [Abstract] Abstract: The claim that the maintained moment statistics 'yield a closed-form first-order approximation of the evicted attention output' (forming the 'mutually reinforcing loop') is presented without the approximation formula, its derivation, or any error analysis, which is required to assess whether the approximation is accurate enough to support the proposed correction mechanism.

Authors: The abstract states the existence of the closed-form first-order approximation without displaying the formula or error bounds. The derivation (via moment-based linearization of the attention output) and accompanying error analysis appear in the main text. We will revise the abstract to include the compact approximation expression and a one-sentence reference to the bounded-error result. revision: yes

Circularity Check

0 steps flagged

No significant circularity detected in derivation chain

full rationale

The paper identifies a directional mismatch in evicted KV tokens as the core issue and introduces moment statistics (count, means, covariance) to guide eviction and provide a closed-form first-order correction. No equations, fitted parameters, or self-citations are shown that reduce the claimed approximation, the mutually reinforcing loop, or the performance gains to a self-defined quantity or tautology. The geometric observation and moment-based design remain independent of the target outputs, with the accuracy of the approximation explicitly treated as an empirical matter rather than an internal necessity. This matches the default expectation of self-contained construction.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Abstract-only review yields minimal ledger entries; the method rests on a domain assumption about attention geometry and introduces moment statistics as the core addition without explicit free parameters or new physical entities.

axioms (1)

domain assumption Evicted tokens are often near-orthogonal to retained tokens, producing directional mismatch that amplifies output error beyond residual mass alone.
Stated explicitly as the primary source of degradation in the abstract.

pith-pipeline@v0.9.1-grok · 5789 in / 1305 out tokens · 39733 ms · 2026-06-28T16:05:44.049609+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

21 extracted references · 13 linked inside Pith

[1]

Pyramidkv: Dynamic kv cache compression based on pyramidal information funneling.ArXiv, abs/2406.02069,

Zefan Cai, Yichi Zhang, Bofei Gao, Yuliang Liu, Tianyu Liu, Keming Lu, Wayne Xiong, Yue Dong, Baobao Chang, Junjie Hu, and Wen Xiao. Pyramidkv: Dynamic kv cache compression based on pyramidal information funneling.ArXiv, abs/2406.02069,

Pith/arXiv arXiv
[2]

Rethinking attention with performers.arXiv preprint arXiv:2009.14794,

Krzysztof Choromanski, Valerii Likhosherstov, David Dohan, Xingyou Song, Andreea Gane, Tamas Sarlos, Peter Hawkins, Jared Davis, Afroz Mohiuddin, Lukasz Kaiser, et al. Rethinking attention with performers.arXiv preprint arXiv:2009.14794,

Pith/arXiv arXiv 2009
[3]

Flashattention-2: Faster attention with better parallelism and work partitioning

Tri Dao. Flashattention-2: Faster attention with better parallelism and work partitioning. arXiv preprint arXiv:2307.08691,

Pith/arXiv arXiv
[4]

A simple and effective l 2 norm-based strategy for kv cache compression

Alessio Devoto, Yu Zhao, Simone Scardapane, and Pasquale Minervini. A simple and effective l 2 norm-based strategy for kv cache compression. InProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pp. 18476–18499,

2024
[5]

Ada-kv: Optimizing kv cache eviction by adaptive budget allocation for efficient llm inference.arXiv preprint arXiv:2407.11550,

Yuan Feng, Junlin Lv, Yukun Cao, Xike Xie, and S Kevin Zhou. Ada-kv: Optimizing kv cache eviction by adaptive budget allocation for efficient llm inference.arXiv preprint arXiv:2407.11550,

Pith/arXiv arXiv
[6]

Identify critical kv cache in llm inference from an output perturbation perspective.arXiv preprint arXiv:2502.03805,

Yuan Feng, Junlin Lv, Yukun Cao, Xike Xie, and S Kevin Zhou. Identify critical kv cache in llm inference from an output perturbation perspective.arXiv preprint arXiv:2502.03805,

Pith/arXiv arXiv
[7]

Caote: Kv cache selection for llms via attention output error-based token eviction.arXiv preprint arXiv:2504.14051,

Raghavv Goel, Junyoung Park, Mukul Gagrani, Dalton Jones, Matthew Morse, Harper Langston, Mingu Lee, and Chris Lott. Caote: Kv cache selection for llms via attention output error-based token eviction.arXiv preprint arXiv:2504.14051,

arXiv
[8]

The llama 3 herd of models.arXiv preprint arXiv:2407.21783,

Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Alex Vaughan, et al. The llama 3 herd of models.arXiv preprint arXiv:2407.21783,

Pith/arXiv arXiv
[9]

Under review

10 Preprint. Under review. Cheng-Ping Hsieh, Simeng Sun, Samuel Kriman, Shantanu Acharya, Dima Rekesh, Fei Jia, Yang Zhang, and Boris Ginsburg. Ruler: What’s the real context size of your long-context language models?arXiv preprint arXiv:2404.06654,

Pith/arXiv arXiv
[10]

Compute or load kv cache? why not both?arXiv preprint arXiv:2410.03065,

Shuowei Jin, Xueshen Liu, Qingzhao Zhang, and Z Morley Mao. Compute or load kv cache? why not both?arXiv preprint arXiv:2410.03065,

arXiv
[11]

A survey on large language model acceleration based on kv cache management.arXiv preprint arXiv:2412.19442, 2024a

Haoyang Li, Yiming Li, Anxin Tian, Tianhao Tang, Zhanchao Xu, Xuejia Chen, Nicole Hu, Wei Dong, Qing Li, and Lei Chen. A survey on large language model acceleration based on kv cache management.arXiv preprint arXiv:2412.19442, 2024a. Yu Li, Tian Lan, and Zhengling Qi. When right meets wrong: Bilateral context conditioning with reward-confidence correction...

arXiv
[12]

Locatelli, Hanchen Ye, Tianle Cai, Patrick Lewis, and Deming Chen

Yuhong Li, Yingbing Huang, Bowen Yang, Bharat Venkitesh, Acyr F. Locatelli, Hanchen Ye, Tianle Cai, Patrick Lewis, and Deming Chen. Snapkv: Llm knows what you are looking for before generation.ArXiv, abs/2404.14469, 2024b. URL https://api.semanticscholar. org/CorpusID:269303164. Zichang Liu, Aditya Desai, Fangshuo Liao, Weitao Wang, Victor Xie, Zhaozhuo X...

Pith/arXiv arXiv
[13]

Kivi: A tuning-free asymmetric 2bit quantization for kv cache

Zirui Liu, Jiayi Yuan, Hongye Jin, Shaochen Zhong, Zhaozhuo Xu, Vladimir Braverman, Beidi Chen, and Xia Hu. Kivi: A tuning-free asymmetric 2bit quantization for kv cache. arXiv preprint arXiv:2402.02750,

Pith/arXiv arXiv
[14]

Dynamic memory compression: Retrofitting llms for accelerated inference.arXiv preprint arXiv:2403.09636,

Piotr Nawrot, Adrian Ła´ncucki, Marcin Chochowski, David Tarjan, and Edoardo M Ponti. Dynamic memory compression: Retrofitting llms for accelerated inference.arXiv preprint arXiv:2403.09636,

arXiv
[15]

Towards understanding how attention mechanism works in deep learning.arXiv preprint arXiv:2412.18288,

Tianyu Ruan and Shihua Zhang. Towards understanding how attention mechanism works in deep learning.arXiv preprint arXiv:2412.18288,

arXiv
[16]

Model tells you where to merge: Adaptive kv cache merging for llms on long-context tasks.ArXiv, abs/2407.08454,

Zheng Wang, Boxiao Jin, Zhongzhi Yu, and Minjia Zhang. Model tells you where to merge: Adaptive kv cache merging for llms on long-context tasks.ArXiv, abs/2407.08454,

arXiv
[17]

Guangxuan Xiao, Yuandong Tian, Beidi Chen, Song Han, and Mike Lewis

URLhttps://api.semanticscholar.org/CorpusID:271097687. Guangxuan Xiao, Yuandong Tian, Beidi Chen, Song Han, and Mike Lewis. Efficient stream- ing language models with attention sinks.arXiv preprint arXiv:2309.17453,

Pith/arXiv arXiv
[18]

Qwen3 technical report.arXiv preprint arXiv:2505.09388,

An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, et al. Qwen3 technical report.arXiv preprint arXiv:2505.09388,

Pith/arXiv arXiv
[19]

Kv cache compression, but what must we give in return? a comprehensive benchmark of long context capable approaches

Jiayi Yuan, Hongyi Liu, Shaochen Zhong, Yu-Neng Chuang, Songchen Li, Guanchu Wang, Duy Le, Hongye Jin, Vipin Chaudhary, Zhaozhuo Xu, et al. Kv cache compression, but what must we give in return? a comprehensive benchmark of long context capable approaches. InFindings of the Association for Computational Linguistics: EMNLP 2024, pp. 4623–4648,

2024
[20]

Under review

11 Preprint. Under review. Ruijie Zhang, Haozhe Liang, Da Chang, Li Hu, Fanqi Kong, Huaxiao Yin, and Yu Li. When does value-aware kv eviction help? a fixed-contract diagnostic for non-monotone cache compression.arXiv preprint arXiv:2605.08234,

Pith/arXiv arXiv
[21]

Barrett, Zhangyang Wang, and Beidi Chen

Zhenyu (Allen) Zhang, Ying Sheng, Tianyi Zhou, Tianlong Chen, Lianmin Zheng, Ruisi Cai, Zhao Song, Yuandong Tian, Christopher R´e, Clark W. Barrett, Zhangyang Wang, and Beidi Chen. H2o: Heavy-hitter oracle for efficient generative inference of large language models.ArXiv, abs/2306.14048,

Pith/arXiv arXiv

[1] [1]

Pyramidkv: Dynamic kv cache compression based on pyramidal information funneling.ArXiv, abs/2406.02069,

Zefan Cai, Yichi Zhang, Bofei Gao, Yuliang Liu, Tianyu Liu, Keming Lu, Wayne Xiong, Yue Dong, Baobao Chang, Junjie Hu, and Wen Xiao. Pyramidkv: Dynamic kv cache compression based on pyramidal information funneling.ArXiv, abs/2406.02069,

Pith/arXiv arXiv

[2] [2]

Rethinking attention with performers.arXiv preprint arXiv:2009.14794,

Krzysztof Choromanski, Valerii Likhosherstov, David Dohan, Xingyou Song, Andreea Gane, Tamas Sarlos, Peter Hawkins, Jared Davis, Afroz Mohiuddin, Lukasz Kaiser, et al. Rethinking attention with performers.arXiv preprint arXiv:2009.14794,

Pith/arXiv arXiv 2009

[3] [3]

Flashattention-2: Faster attention with better parallelism and work partitioning

Tri Dao. Flashattention-2: Faster attention with better parallelism and work partitioning. arXiv preprint arXiv:2307.08691,

Pith/arXiv arXiv

[4] [4]

A simple and effective l 2 norm-based strategy for kv cache compression

Alessio Devoto, Yu Zhao, Simone Scardapane, and Pasquale Minervini. A simple and effective l 2 norm-based strategy for kv cache compression. InProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pp. 18476–18499,

2024

[5] [5]

Ada-kv: Optimizing kv cache eviction by adaptive budget allocation for efficient llm inference.arXiv preprint arXiv:2407.11550,

Yuan Feng, Junlin Lv, Yukun Cao, Xike Xie, and S Kevin Zhou. Ada-kv: Optimizing kv cache eviction by adaptive budget allocation for efficient llm inference.arXiv preprint arXiv:2407.11550,

Pith/arXiv arXiv

[6] [6]

Identify critical kv cache in llm inference from an output perturbation perspective.arXiv preprint arXiv:2502.03805,

Yuan Feng, Junlin Lv, Yukun Cao, Xike Xie, and S Kevin Zhou. Identify critical kv cache in llm inference from an output perturbation perspective.arXiv preprint arXiv:2502.03805,

Pith/arXiv arXiv

[7] [7]

Caote: Kv cache selection for llms via attention output error-based token eviction.arXiv preprint arXiv:2504.14051,

Raghavv Goel, Junyoung Park, Mukul Gagrani, Dalton Jones, Matthew Morse, Harper Langston, Mingu Lee, and Chris Lott. Caote: Kv cache selection for llms via attention output error-based token eviction.arXiv preprint arXiv:2504.14051,

arXiv

[8] [8]

The llama 3 herd of models.arXiv preprint arXiv:2407.21783,

Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Alex Vaughan, et al. The llama 3 herd of models.arXiv preprint arXiv:2407.21783,

Pith/arXiv arXiv

[9] [9]

Under review

10 Preprint. Under review. Cheng-Ping Hsieh, Simeng Sun, Samuel Kriman, Shantanu Acharya, Dima Rekesh, Fei Jia, Yang Zhang, and Boris Ginsburg. Ruler: What’s the real context size of your long-context language models?arXiv preprint arXiv:2404.06654,

Pith/arXiv arXiv

[10] [10]

Compute or load kv cache? why not both?arXiv preprint arXiv:2410.03065,

Shuowei Jin, Xueshen Liu, Qingzhao Zhang, and Z Morley Mao. Compute or load kv cache? why not both?arXiv preprint arXiv:2410.03065,

arXiv

[11] [11]

A survey on large language model acceleration based on kv cache management.arXiv preprint arXiv:2412.19442, 2024a

Haoyang Li, Yiming Li, Anxin Tian, Tianhao Tang, Zhanchao Xu, Xuejia Chen, Nicole Hu, Wei Dong, Qing Li, and Lei Chen. A survey on large language model acceleration based on kv cache management.arXiv preprint arXiv:2412.19442, 2024a. Yu Li, Tian Lan, and Zhengling Qi. When right meets wrong: Bilateral context conditioning with reward-confidence correction...

arXiv

[12] [12]

Locatelli, Hanchen Ye, Tianle Cai, Patrick Lewis, and Deming Chen

Yuhong Li, Yingbing Huang, Bowen Yang, Bharat Venkitesh, Acyr F. Locatelli, Hanchen Ye, Tianle Cai, Patrick Lewis, and Deming Chen. Snapkv: Llm knows what you are looking for before generation.ArXiv, abs/2404.14469, 2024b. URL https://api.semanticscholar. org/CorpusID:269303164. Zichang Liu, Aditya Desai, Fangshuo Liao, Weitao Wang, Victor Xie, Zhaozhuo X...

Pith/arXiv arXiv

[13] [13]

Kivi: A tuning-free asymmetric 2bit quantization for kv cache

Zirui Liu, Jiayi Yuan, Hongye Jin, Shaochen Zhong, Zhaozhuo Xu, Vladimir Braverman, Beidi Chen, and Xia Hu. Kivi: A tuning-free asymmetric 2bit quantization for kv cache. arXiv preprint arXiv:2402.02750,

Pith/arXiv arXiv

[14] [14]

Dynamic memory compression: Retrofitting llms for accelerated inference.arXiv preprint arXiv:2403.09636,

Piotr Nawrot, Adrian Ła´ncucki, Marcin Chochowski, David Tarjan, and Edoardo M Ponti. Dynamic memory compression: Retrofitting llms for accelerated inference.arXiv preprint arXiv:2403.09636,

arXiv

[15] [15]

Towards understanding how attention mechanism works in deep learning.arXiv preprint arXiv:2412.18288,

Tianyu Ruan and Shihua Zhang. Towards understanding how attention mechanism works in deep learning.arXiv preprint arXiv:2412.18288,

arXiv

[16] [16]

Model tells you where to merge: Adaptive kv cache merging for llms on long-context tasks.ArXiv, abs/2407.08454,

Zheng Wang, Boxiao Jin, Zhongzhi Yu, and Minjia Zhang. Model tells you where to merge: Adaptive kv cache merging for llms on long-context tasks.ArXiv, abs/2407.08454,

arXiv

[17] [17]

Guangxuan Xiao, Yuandong Tian, Beidi Chen, Song Han, and Mike Lewis

URLhttps://api.semanticscholar.org/CorpusID:271097687. Guangxuan Xiao, Yuandong Tian, Beidi Chen, Song Han, and Mike Lewis. Efficient stream- ing language models with attention sinks.arXiv preprint arXiv:2309.17453,

Pith/arXiv arXiv

[18] [18]

Qwen3 technical report.arXiv preprint arXiv:2505.09388,

An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, et al. Qwen3 technical report.arXiv preprint arXiv:2505.09388,

Pith/arXiv arXiv

[19] [19]

Kv cache compression, but what must we give in return? a comprehensive benchmark of long context capable approaches

Jiayi Yuan, Hongyi Liu, Shaochen Zhong, Yu-Neng Chuang, Songchen Li, Guanchu Wang, Duy Le, Hongye Jin, Vipin Chaudhary, Zhaozhuo Xu, et al. Kv cache compression, but what must we give in return? a comprehensive benchmark of long context capable approaches. InFindings of the Association for Computational Linguistics: EMNLP 2024, pp. 4623–4648,

2024

[20] [20]

Under review

11 Preprint. Under review. Ruijie Zhang, Haozhe Liang, Da Chang, Li Hu, Fanqi Kong, Huaxiao Yin, and Yu Li. When does value-aware kv eviction help? a fixed-contract diagnostic for non-monotone cache compression.arXiv preprint arXiv:2605.08234,

Pith/arXiv arXiv

[21] [21]

Barrett, Zhangyang Wang, and Beidi Chen

Zhenyu (Allen) Zhang, Ying Sheng, Tianyi Zhou, Tianlong Chen, Lianmin Zheng, Ruisi Cai, Zhao Song, Yuandong Tian, Christopher R´e, Clark W. Barrett, Zhangyang Wang, and Beidi Chen. H2o: Heavy-hitter oracle for efficient generative inference of large language models.ArXiv, abs/2306.14048,

Pith/arXiv arXiv