LKV: End-to-End Learning of Head-wise Budgets and Token Selection for LLM KV Cache Eviction

· 2026 · cs.LG · arXiv 2605.06676

1 Pith paper cite this work. Polarity classification is still indexing.

1 Pith paper citing it

open full Pith review browse 1 citing papers arXiv PDF

abstract

Long-context inference in Large Language Models (LLMs) is bottlenecked by the linear growth of Key-Value (KV) cache memory. Existing KV cache compression paradigms are fundamentally limited by heuristics: heuristic budgeting relies on statistical priors rather than task objectives, causing resource misallocation, while heuristic selection relies on coupled query-key interactions or static inductive biases (e.g., attention sinks). To address this limitation, we introduce LKV (Learned KV Eviction), which formulates KV compression as an end-to-end differentiable optimization problem. LKV integrates LKV-H to learn task-optimized global budgets, and LKV-T to derive intrinsic KV importance without materializing attention matrices. This design bypasses heuristic proxies, strictly aligning compression with task objectives. Extensive evaluations demonstrate that LKV achieves state-of-the-art performance on both LongBench and RULER benchmarks at high compression rates. In particular, on LongBench, LKV achieves near-lossless performance with only 15\% KV cache retention. Crucially, our analysis identifies learned budgeting as the dominant driver of fidelity, demonstrating that data-driven allocation is essential to overcome the limitations of hand-crafted heuristics.

representative citing papers

Beyond Surrogate Gradients: Fully Differentiable Token Pruning for Vision-Language Models

cs.CV · 2026-05-27 · unverdicted · novelty 6.0

DiffPrune reformulates visual token pruning as continuous control of token information using an Information Throttler with importance-conditioned variance-preserving noise, enabling fully differentiable learning of scores that are hard-thresholded at inference.

citing papers explorer

Showing 1 of 1 citing paper.

Beyond Surrogate Gradients: Fully Differentiable Token Pruning for Vision-Language Models cs.CV · 2026-05-27 · unverdicted · none · ref 34 · internal anchor
DiffPrune reformulates visual token pruning as continuous control of token information using an Information Throttler with importance-conditioned variance-preserving noise, enabling fully differentiable learning of scores that are hard-thresholded at inference.

LKV: End-to-End Learning of Head-wise Budgets and Token Selection for LLM KV Cache Eviction

fields

years

verdicts

representative citing papers

citing papers explorer