pith. sign in

arxiv: 2502.03805 · v2 · pith:MN2EQJXQnew · submitted 2025-02-06 · 💻 cs.CL

CriticalKV: Optimizing KV Cache Eviction from an Output Perturbation Perspective

classification 💻 cs.CL
keywords cacheentriesperturbationalgorithmattentioncriticalevictionoutput
0
0 comments X
read the original abstract

Large language models have revolutionized natural language processing but face significant challenges of high storage and runtime costs, due to the transformer architecture's reliance on self-attention, particularly the large KV cache for long-sequence inference. Recent efforts to reduce KV cache size by pruning less critical entries based on attention weights remain empirical and lack formal grounding. This paper presents a formal study on identifying critical KV cache entries by analyzing attention output perturbation. Our analysis reveals that, beyond attention weights, the value states within KV entries and pretrained parameter matrices are also crucial. Based on this, we propose a perturbation-constrained selection algorithm that optimizes the worst-case output perturbation to identify critical entries. We demonstrate that our algorithm is a universal, plug-and-play enhancement that incurs negligible computational overhead. When integrated with three state-of-the-art cache eviction methods on three distinct LLMs, our algorithm significantly reduces the compression loss by more than \textit{half} on average across 29 datasets from the Ruler and LongBench benchmarks. Further perturbation analysis, at both the head and layer levels, confirms the principles underlying our effectiveness. This work offers a new, formally grounded perspective to cache eviction , opening promising avenues for future research. The code is publicly available at https://github.com/FFY0/DefensiveKV.

This paper has not been read by Pith yet.

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 6 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Head-Aware Key-Value Compression for Efficient Autoregressive Image Generation

    cs.CV 2026-05 conditional novelty 7.0

    HeadKV compresses KV cache for autoregressive image generation via head-aware budget allocation, early head-type identification from consistent patterns, and stratified token eviction.

  2. Rethinking KV Cache Eviction via a Unified Information-Theoretic Objective

    cs.LG 2026-04 unverdicted novelty 7.0

    KV cache eviction is unified under an information capacity maximization principle derived from a linear-Gaussian attention surrogate, with CapKV proposed as a leverage-score based implementation that outperforms prior...

  3. RDKV: Rate-Distortion Bit Allocation for Joint Eviction and Quantization of the KV Cache

    cs.LG 2026-05 unverdicted novelty 6.0

    RDKV derives per-token and per-channel weights from attention distortion, then uses reverse water-filling to assign bit-widths from full precision to zero after prefilling, recovering 97.81% accuracy with 2.48% cache ...

  4. Reformulating KV Cache Eviction Problem for Long-Context LLM Inference

    cs.CL 2026-05 unverdicted novelty 6.0

    LaProx reformulates KV cache eviction as an output-aware matrix approximation, enabling a unified global token selection strategy that preserves LLM performance at 5% cache size across long-context benchmarks.

  5. When Does Value-Aware KV Eviction Help? A Fixed-Contract Diagnostic for Non-Monotone Cache Compression

    cs.LG 2026-05 unverdicted novelty 6.0

    A fixed-contract probe shows value-aware KV eviction recovers needed evidence in 72.6% of accuracy-improving cases on LongBench but only 32.4% otherwise, suggesting an order of recover evidence, rank value, then prese...

  6. Ada-KV: Optimizing KV Cache Eviction by Adaptive Budget Allocation for Efficient LLM Inference

    cs.CL 2024-07 accept novelty 6.0

    Ada-KV is the first head-wise adaptive KV cache budget allocator for LLMs, using a theoretical loss upper bound to allocate eviction differently per attention head and yielding higher quality than uniform methods on l...