pith. machine review for the scientific record. sign in

arxiv: 2502.01068 · v7 · submitted 2025-02-03 · 💻 cs.LG · cs.CL

Recognition: unknown

FastKV: Decoupling of Context Reduction and KV Cache Compression for Prefill-Decoding Acceleration

Authors on Pith no claims yet
classification 💻 cs.LG cs.CL
keywords prefillfastkvdecodingaccuracycachereductionaccelerationbudget
0
0 comments X
read the original abstract

While large language models (LLMs) excel at handling long-context sequences, they require substantial prefill computation and key-value (KV) cache, which can heavily burden computational efficiency and memory usage in both prefill and decoding stages. Recent works that compress KV caches with prefill acceleration reduce this cost but inadvertently tie the prefill compute reduction to the decoding KV budget. This coupling arises from overlooking the layer-dependent variation of critical context, often leading to accuracy degradation. To address this issue, we introduce FastKV, a KV cache compression framework designed to reduce latency in both prefill and decoding by leveraging the stabilization of token importance in later layers. FastKV performs full-context computation until a Token-Selective Propagation (TSP) layer, which forwards only the most informative tokens to subsequent layers. From these propagated tokens, FastKV independently selects salient KV entries for caching, thereby decoupling KV budget from the prefill compute reduction based on the TSP decision. This independent control of the TSP rate and KV retention rate enables flexible optimization of efficiency and accuracy. Experimental results show that FastKV achieves speedups of up to 1.82$\times$ in prefill and 2.87$\times$ in decoding compared to the full-context baseline, while matching the accuracy of the decoding-only baselines. Our code is available at https://github.com/dongwonjo/FastKV.

This paper has not been read by Pith yet.

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 3 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. OmniDrop: Layer-wise Token Pruning for Omni-modal LLMs via Query-Guidance

    cs.AI 2026-05 unverdicted novelty 6.0

    OmniDrop is a training-free layer-wise token pruning framework for omni-modal LLMs that uses query guidance and temporal diversity to reduce prefill latency by up to 40% and memory by 14.7% while improving benchmark s...

  2. SinkRouter: Sink-Aware Routing for Efficient Long-Context Decoding in Large Language and Multimodal Models

    cs.LG 2026-04 unverdicted novelty 6.0

    SinkRouter identifies attention sinks as training-derived fixed points and routes around them to skip redundant KV-cache loads, delivering up to 2.03x decoding speedup on long-context benchmarks.

  3. StructKV: Preserving the Structural Skeleton for Scalable Long-Context Inference

    cs.CL 2026-04 unverdicted novelty 6.0

    StructKV compresses LLM KV caches by tracking global in-degree centrality across network depth and dynamically selecting compression layers to preserve long-range dependencies better than local pruning methods.