DiffPrune reformulates visual token pruning as continuous control of token information using an Information Throttler with importance-conditioned variance-preserving noise, enabling fully differentiable learning of scores that are hard-thresholded at inference.
LKV: End-to-End Learning of Head-wise Budgets and Token Selection for LLM KV Cache Eviction
1 Pith paper cite this work. Polarity classification is still indexing.
abstract
Long-context inference in Large Language Models (LLMs) is bottlenecked by the linear growth of Key-Value (KV) cache memory. Existing KV cache compression paradigms are fundamentally limited by heuristics: heuristic budgeting relies on statistical priors rather than task objectives, causing resource misallocation, while heuristic selection relies on coupled query-key interactions or static inductive biases (e.g., attention sinks). To address this limitation, we introduce LKV (Learned KV Eviction), which formulates KV compression as an end-to-end differentiable optimization problem. LKV integrates LKV-H to learn task-optimized global budgets, and LKV-T to derive intrinsic KV importance without materializing attention matrices. This design bypasses heuristic proxies, strictly aligning compression with task objectives. Extensive evaluations demonstrate that LKV achieves state-of-the-art performance on both LongBench and RULER benchmarks at high compression rates. In particular, on LongBench, LKV achieves near-lossless performance with only 15\% KV cache retention. Crucially, our analysis identifies learned budgeting as the dominant driver of fidelity, demonstrating that data-driven allocation is essential to overcome the limitations of hand-crafted heuristics.
fields
cs.CV 1years
2026 1verdicts
UNVERDICTED 1representative citing papers
citing papers explorer
-
Beyond Surrogate Gradients: Fully Differentiable Token Pruning for Vision-Language Models
DiffPrune reformulates visual token pruning as continuous control of token information using an Information Throttler with importance-conditioned variance-preserving noise, enabling fully differentiable learning of scores that are hard-thresholded at inference.