EpiCache: Episodic KV Cache Management for Long-Term Conversation on Resource-Constrained Environments

· 2025 · cs.CL · arXiv 2509.17396

1 Pith paper cite this work. Polarity classification is still indexing.

1 Pith paper citing it

open full Pith review browse 1 citing papers arXiv PDF

abstract

Modern large language models (LLMs) extend context lengths to millions of tokens, enabling coherent, personalized responses grounded in long conversational history. However, the Key-Value (KV) cache grows linearly with the extended dialogue history, causing the model's memory footprint to quickly exceed device limits. While recent KV cache compression methods attempt to reduce memory usage, most apply cache eviction after processing the entire context, incurring unbounded peak memory usage. Additionally, query-dependent eviction narrows the cache semantics to a single query, leading to failure cases in multi-turn conversations. In this paper, we introduce EpiCache, a training-free KV cache management framework for long conversational question answering (LongConvQA) under fixed memory budgets. EpiCache bounds cache growth through block-wise prefill and preserves topic-relevant context via episodic KV compression, which clusters conversation history into coherent episodes and performs episode-specific KV cache eviction. Across three LongConvQA benchmarks (LongMemEval, Realtalk, and LoCoMo), EpiCache improves accuracy by up to 30%, achieves near full-cache accuracy under 4-6x compression, and reduces latency and peak memory by up to 2.4x and 3.7x, respectively.

representative citing papers

Tangram: Unlocking Non-Uniform KV Cache for Efficient Multi-turn LLM Serving

cs.LG · 2026-06-04 · unverdicted · novelty 7.0

Tangram makes non-uniform KV cache compression practical for LLM serving with deterministic budget allocation, head group paging, and ahead-of-time load balancing, achieving up to 2.6x throughput gains.

citing papers explorer

Showing 1 of 1 citing paper.

Tangram: Unlocking Non-Uniform KV Cache for Efficient Multi-turn LLM Serving cs.LG · 2026-06-04 · unverdicted · none · ref 17 · internal anchor
Tangram makes non-uniform KV cache compression practical for LLM serving with deterministic budget allocation, head group paging, and ahead-of-time load balancing, achieving up to 2.6x throughput gains.

EpiCache: Episodic KV Cache Management for Long-Term Conversation on Resource-Constrained Environments

fields

years

verdicts

representative citing papers

citing papers explorer