arxiv: 2511.00868 · v2 · submitted 2025-11-02 · 💻 cs.LG

FlexiCache: Leveraging Temporal Stability of Attention Heads for Efficient KV Cache Management

Nazmul Takbir , Hamidreza Alikhani , Nikil Dutt , Sangeetha Abdu Jyothi This is my paper

Pith reviewed 2026-05-18 01:46 UTC · model grok-4.3

classification 💻 cs.LG

keywords KV cache managementattention headstemporal stabilityLLM servingGPU memory optimizationlong-context inferenceefficient inference

0 comments p. Extension

The pith

FlexiCache reduces LLM KV cache GPU memory by up to 70% by classifying attention heads according to how steadily they focus on the same critical tokens over time.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that attention heads in large language models differ markedly in the temporal stability of their critical tokens, with some heads consistently attending to the same tokens across generations while others shift their focus frequently. This difference supports a hierarchical KV cache strategy that keeps all pages from unstable heads in GPU memory but retains only the top-K pages from stable heads on the GPU and offloads the rest to host memory. Periodic reranking on stable heads brings newly important pages back to GPU without full recomputation. A sympathetic reader would care because the approach directly attacks the memory wall that limits long-context and long-generation serving, enabling higher throughput and lower latency on existing hardware while keeping output accuracy unchanged.

Core claim

FlexiCache classifies KV heads as stable or unstable based on the consistency of their attention to critical tokens. Unstable heads keep their entire KV cache resident in GPU memory. Stable heads keep only their current top-K KV pages on the GPU and offload the remainder to host memory, with periodic reranking to promote newly critical pages. This selective offloading exploits the observed variation in temporal stability across heads to shrink the overall GPU memory footprint for long-context requests without degrading model accuracy in long-generation scenarios.

What carries the argument

Classification of KV heads by temporal stability of their critical tokens, which drives differential cache retention: full GPU residency for unstable heads versus top-K retention plus offloading for stable heads, with periodic reranking to update the top pages.

If this is right

GPU memory footprint for long-context requests falls by up to 70 percent.
Offline serving throughput rises by factors of 1.38 to 1.55.
Online per-token latency drops by factors of 1.6 to 2.1.
Accuracy is preserved across long-context and long-generation workloads.
The system integrates on top of existing engines such as vLLM.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The stability classification could be recomputed at finer intervals or made adaptive to workload changes for further gains.
Combining this head-level offloading with token-level pruning or quantization might multiply the memory savings.
The same stability signal might apply to other attention-based architectures such as vision or multimodal models.
Lower memory use could allow larger batch sizes or longer contexts on a fixed GPU without accuracy trade-offs.

Load-bearing premise

The temporal stability of critical tokens varies enough across KV heads that stable and unstable groups can be reliably identified and their differing offloading policies will preserve accuracy.

What would settle it

A controlled run on long-context long-generation tasks in which offloading the non-top-K pages of heads labeled stable produces a clear drop in accuracy or coherence compared to keeping all pages on GPU would falsify the claim.

Figures

Figures reproduced from arXiv: 2511.00868 by Hamidreza Alikhani, Nazmul Takbir, Nikil Dutt, Sangeetha Abdu Jyothi.

**Figure 1.** Figure 1: Temporal stability patterns of KV heads. For Llama3.1-8B-Instruct layer 4. Some heads maintain high RCO across offsets, while others show persistently low values. 2.2 Quantifying Temporal Stability of KV Heads We begin by collecting the set of top-K page indices at each decode step for samples from several long-context, long-generation tasks in LongBench (Bai et al., 2024) and L-Eval (An et al., 2024). Co… view at source ↗

**Figure 2.** Figure 2: FlexiCache system architecture. At the worker, the top-K selector identifies the most relevant KV pages for each head, updating them at different frequencies based on head stability. The sparse decode kernel attends only to these selected pages. GPU memory stores the full KV cache of unstable heads and only the top-K pages of stable heads, with the rest in host memory. The block allocator manages this hier… view at source ↗

**Figure 3.** Figure 3: Hierarchical KV-cache placement. For a request with four logical KV pages running on a two-layer, two-head model. possible dot product between q and any key in that page: sp = X i max(qi · k min p,i , qi · k max p,i ) However, FlexiCache differs in two key ways. First, page scoring operates at different frequencies for stable and unstable heads: unstable heads are scored at every step, whereas stable head… view at source ↗

**Figure 5.** Figure 5: End-to-end throughput. FlexiCache consistently outperforms vLLM on both Llama-3.1-8B and Mistral-7B in token throughput, with gains increasing as output length grows. Similar improvements are observed for request throughput with an output length of 500. tings. For the workload, we randomly sample prompts from the L-Eval benchmark rather than using synthetic tokens, since the size of the promoted KV cache d… view at source ↗

**Figure 6.** Figure 6: Online serving. FlexiCache reduces mean TPOT across all arrival rates. By lowering per-request GPU memory usage, it delays TTFT degradation caused by queue buildup at high load, enabling higher sustained request rates while keeping tail TTFT lower. 5 10 15 20 25 30 35 40 Batch size 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 Latency (ms) Dense Decode Sparse Decode (1024) Top-K Selector (1024) Sparse Decode (20… view at source ↗

**Figure 8.** Figure 8: Benefit of Stability-Aware Reranking. Each input has 10k tokens; performance gains increase with larger batch sizes. rerank frequency of 16, unless otherwise specified. Decode Only Speedup [PITH_FULL_IMAGE:figures/full_fig_p009_8.png] view at source ↗

**Figure 9.** Figure 9: GPU Memory Savings. Over 70% memory savings is achieved at sequence lengths >20k with a token budget of 1024. memory savings asymptotically approach 75%, reaching about 70% at 20k tokens and beyond. 5 DISCUSSION AND FUTURE WORK Integration with Other Serving Optimizations. FlexiCache optimizes the decode phase, with up to 4× speedup when focusing on the decode kernel ( [PITH_FULL_IMAGE:figures/full_fig_p… view at source ↗

**Figure 10.** Figure 10: shows the end-to-end throughput gain for MistralSmall-24B-Instruct-2501 and Qwen2.5-32B-Instruct when executed on FlexiCache compared to the vLLM baseline. The workload consists of 100 random inputs from L-Eval, each with input lengths ranging from 10k to 30k tokens and an output length of 500. FlexiCache achieves a 1.37× improvement, indicating that performance gains extend to larger models as well. Mis… view at source ↗

read the original abstract

Large Language Model (LLM) serving is increasingly constrained by the growing size of the key-value (KV) cache, which scales with both context length and generation length. Prior work shows that attention is dominated by a small subset of critical tokens, yet existing systems struggle to exploit this efficiently without degrading accuracy, especially in long generation. We make a key observation: the temporal stability of these critical tokens varies significantly across KV heads: some heads consistently focus on the same tokens, while others shift frequently. Building on this insight, we introduce FlexiCache, a hierarchical KV-cache management system that leverages the temporal stability of KV heads to reduce GPU memory usage and computation overhead, while preserving model accuracy. FlexiCache classifies KV heads as stable or unstable: it retains all KV-cache pages from unstable heads in GPU memory, whereas for stable heads, it keeps only the top-K pages on the GPU and offloads the rest to host memory. By exploiting temporal stability, FlexiCache performs periodic reranking for stable heads to fetch newly promoted top pages. Implemented atop vLLM, FlexiCache reduces GPU memory footprint for long-context requests by up to 70%, improves offline serving throughput by 1.38-1.55x, and lowers online token latency by 1.6-2.1x, all while maintaining accuracy in long-context, long-generation scenarios.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

FlexiCache splits KV heads by temporal stability to offload most pages from stable ones while keeping everything for unstable heads, with reranking to try to hold accuracy.

read the letter

FlexiCache's main move is to classify attention heads according to how much their focus on critical tokens changes over time. Stable heads get only their current top-K pages kept on GPU with periodic reranking to pull in new ones; unstable heads keep their full cache resident. The system is built on vLLM and the abstract reports up to 70% lower GPU memory for long-context requests, 1.38-1.55x offline throughput, and 1.6-2.1x lower online latency while claiming accuracy is preserved in long-generation settings. That head-level split and the resulting offload policy is the concrete new piece relative to earlier critical-token work. The implementation is practical and the reported serving metrics address a real production pain point. The numbers are specific enough to be worth checking against actual runs. The load-bearing assumption is that periodic reranking keeps the right pages in place for the stable heads without gradual drift that would hurt next-token quality. If stability varies by task or if the rerank interval is too loose, accuracy could erode quietly even while aggregate metrics look fine. The paper would benefit from clearer details on how heads are labeled stable versus unstable, sensitivity analysis on rerank frequency, and direct comparisons to uniform top-K policies. This is the kind of systems paper that inference and serving groups would read. It deserves a serious referee because the problem is central and the approach is straightforward to evaluate, even if the stability claim needs tighter evidence.

Referee Report

2 major / 1 minor

Summary. The paper introduces FlexiCache, a hierarchical KV-cache management system for LLMs that classifies attention heads as stable or unstable according to the temporal stability of their critical tokens. Unstable heads retain all KV-cache pages in GPU memory, while stable heads keep only the top-K pages on GPU (offloading the rest to host memory) and use periodic reranking to update the retained set. Implemented on vLLM, the system claims up to 70% reduction in GPU memory footprint for long-context requests, 1.38-1.55x higher offline serving throughput, and 1.6-2.1x lower online token latency, all while preserving accuracy in long-context, long-generation scenarios.

Significance. If the reported gains and accuracy preservation are substantiated by detailed experiments, FlexiCache would represent a practical contribution to efficient LLM serving by exploiting head-wise differences in temporal stability rather than applying uniform eviction policies. The approach could meaningfully extend feasible context and generation lengths under memory constraints. The work is credited for grounding the design in an empirical observation of stability variation across heads and for providing concrete serving metrics.

major comments (2)

Abstract: the abstract states concrete performance numbers (70% memory reduction, 1.38-1.55x throughput, 1.6-2.1x latency) and accuracy preservation but supplies no experimental details, baselines, datasets, model sizes, or error analysis. This absence prevents assessment of whether the central claims are supported.
Method section (description of stable-head handling): the periodic reranking of top-K pages for stable heads is presented as sufficient to avoid cumulative attention loss, yet no analysis, ablation, or measurement is given on how the rerank interval interacts with generation length to keep attention mass outside the current top-K negligible. This assumption is load-bearing for the simultaneous memory-reduction and accuracy claims in long-generation settings.

minor comments (1)

Clarify the exact procedure and threshold used to classify heads as stable versus unstable, including any sensitivity analysis on this hyperparameter.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback and for recognizing the potential practical contribution of FlexiCache. We respond to each major comment below and indicate the changes we will make in the revised manuscript.

read point-by-point responses

Referee: Abstract: the abstract states concrete performance numbers (70% memory reduction, 1.38-1.55x throughput, 1.6-2.1x latency) and accuracy preservation but supplies no experimental details, baselines, datasets, model sizes, or error analysis. This absence prevents assessment of whether the central claims are supported.

Authors: We agree that the abstract would benefit from additional context to help readers evaluate the claims. The full experimental details—including the vLLM baseline, model sizes (Llama-3 family), datasets (LongBench and similar long-context tasks), generation lengths, and accuracy metrics—are provided in Sections 4 and 5. In the revision we will add a concise clause to the abstract referencing the evaluation setup (e.g., “evaluated on Llama-3 models with LongBench tasks and generation lengths up to several thousand tokens, preserving accuracy within 1 % of the baseline”). This addresses the concern while respecting abstract length limits. revision: yes
Referee: Method section (description of stable-head handling): the periodic reranking of top-K pages for stable heads is presented as sufficient to avoid cumulative attention loss, yet no analysis, ablation, or measurement is given on how the rerank interval interacts with generation length to keep attention mass outside the current top-K negligible. This assumption is load-bearing for the simultaneous memory-reduction and accuracy claims in long-generation settings.

Authors: We concur that a dedicated analysis of the rerank interval’s interaction with generation length would strengthen the paper. The current manuscript describes the periodic reranking mechanism and reports end-to-end accuracy preservation for long generations, but does not include a targeted ablation or measurement of attention mass outside the top-K set. We will add this analysis in the revised version, presenting measurements of retained attention mass for different rerank intervals across generation lengths up to 8 k tokens to confirm that the loss remains negligible and thereby support the joint memory and accuracy claims. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical observation drives practical system with measured validation

full rationale

The paper starts from a stated empirical observation that temporal stability of critical tokens varies across KV heads, then builds FlexiCache as a hierarchical management system that classifies heads, retains full pages for unstable ones and top-K for stable ones with periodic reranking, and reports concrete measured outcomes (up to 70% memory reduction, 1.38-1.55x throughput, 1.6-2.1x latency) while preserving accuracy in long-context scenarios. No equations, fitted parameters, or predictions are presented that reduce by construction to the inputs; accuracy claims rest on experimental benchmarking rather than self-referential definitions or self-citation chains. The derivation is therefore self-contained as an engineering contribution validated externally to its own assumptions.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the empirical observation that temporal stability differs markedly across heads and on the assumption that selective offloading for stable heads does not degrade accuracy.

axioms (1)

domain assumption Temporal stability of critical tokens varies significantly across KV heads
This observation is presented as the foundation for classifying heads and deciding which pages to keep on GPU.

pith-pipeline@v0.9.0 · 5792 in / 1251 out tokens · 48026 ms · 2026-05-18T01:46:16.579226+00:00 · methodology

discussion (0)

Forward citations

Cited by 2 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

PermaFrost-Attack: Stealth Pretraining Seeding(SPS) for planting Logic Landmines During LLM Training
cs.LG 2026-04 unverdicted novelty 7.0

Stealth Pretraining Seeding plants persistent unsafe behaviors in LLMs via diffuse poisoned web content that activates on precise triggers and evades standard evaluation.
An Efficient Hybrid Sparse Attention with CPU-GPU Parallelism for Long-Context Inference
cs.LG 2026-05 unverdicted novelty 5.0

Fluxion achieves 1.5x-3.7x speedup in long-context LLM inference with CPU KV caches while limiting accuracy degradation to at most 0.26 relative to full attention.

Reference graph

Works this paper leans on

17 extracted references · 17 canonical work pages · cited by 2 Pith papers · 5 internal anchors

[1]

doi: 10.18653/v1/2024.acl-long.776

Association for Computational Linguistics. doi: 10.18653/v1/2024.acl-long.776. URL https: //aclanthology.org/2024.acl-long.776/. Bai, Y ., Lv, X., Zhang, J., Lyu, H., Tang, J., Huang, Z., Du, Z., Liu, X., Zeng, A., Hou, L., Dong, Y ., Tang, J., and Li, J. LongBench: A bilingual, multitask benchmark for long context understanding. In Ku, L.-W., Martins, A....

work page doi:10.18653/v1/2024.acl-long.776 2024
[2]

doi: 10.18653/v1/2024.acl-long.172

Association for Computational Linguis- tics. doi: 10.18653/v1/2024.acl-long.172. URL https: //aclanthology.org/2024.acl-long.172/. Bai, Y ., Zhang, J., Lv, X., Zheng, L., Zhu, S., Hou, L., Dong, Y ., Tang, J., and Li, J. Longwriter: Unleashing 10,000+ word generation from long context LLMs. In The Thirteenth International Conference on Learning Representations,

work page doi:10.18653/v1/2024.acl-long.172 2024
[3]

ISBN 9798400706981

Association for Comput- ing Machinery. ISBN 9798400706981. doi: 10.1145/ 3669940.3707267. URL https://doi.org/10. 1145/3669940.3707267. Cheng, Y ., Liu, Y ., Yao, J., An, Y ., Chen, X., Feng, S., Huang, Y ., Shen, S., Du, K., and Jiang, J. Lmcache: An efficient KV cache layer for enterprise-scale LLM inference. https://lmcache.ai/tech_report. pdf,

work page arXiv
[4]

Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities

White paper. Comanici, G., Bieber, E., Schaekermann, M., Pasupat, I., Sachdeva, N., Dhillon, I., Blistein, M., Ram, O., Zhang, D., Rosen, E., et al. Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities.arXiv preprint arXiv:2507.06261,

work page internal anchor Pith review Pith/arXiv arXiv
[5]

Dubey, A., Jauhri, A., Pandey, A., Kadian, A., Al-Dahle, A., Letman, A., Mathur, A., Schelten, A., Yang, A., Fan, A., et al

Accessed: 2025-10-27. Dubey, A., Jauhri, A., Pandey, A., Kadian, A., Al-Dahle, A., Letman, A., Mathur, A., Schelten, A., Yang, A., Fan, A., et al. The llama 3 herd of models.arXiv e-prints, pp. arXiv–2407,

work page 2025
[6]

The Llama 3 Herd of Models

URL https: //openreview.net/forum?id=SuYO70ZxZX. Grattafiori, A., Dubey, A., Jauhri, A., Pandey, A., Kadian, A., Al-Dahle, A., Letman, A., Mathur, A., Schelten, A., Vaughan, A., et al. The llama 3 herd of models.arXiv preprint arXiv:2407.21783,

work page internal anchor Pith review Pith/arXiv arXiv
[7]

Mixtral of Experts

Jiang, A. Q., Sablayrolles, A., Roux, A., Mensch, A., Savary, B., Bamford, C., Chaplot, D. S., Casas, D. d. l., Hanna, E. B., Bressand, F., et al. Mixtral of experts.arXiv preprint arXiv:2401.04088, 2024a. Jiang, D., Liu, Y ., Liu, S., Zhao, J., Zhang, H., Gao, Z., Zhang, X., Li, J., and Xiong, H. From clip to dino: Visual encoders shout in multi-modal la...

work page internal anchor Pith review Pith/arXiv arXiv
[8]

Neo: Saving gpu memory crisis with cpu offloading for online llm inference.arXiv preprint arXiv:2411.01142, 2024b

Jiang, X., Zhou, Y ., Cao, S., Stoica, I., and Yu, M. Neo: Saving gpu memory crisis with cpu offloading for online llm inference.arXiv preprint arXiv:2411.01142, 2024b. Kwon, W., Li, Z., Zhuang, S., Sheng, Y ., Zheng, L., Yu, C. H., Gonzalez, J., Zhang, H., and Stoica, I. Efficient memory management for large language model serving with pagedattention. In...

work page arXiv
[9]

Qserve: W4a8kv4 quantization and system co-design for efficient llm serving

Lin*, Y ., Tang*, H., Yang*, S., Zhang, Z., Xiao, G., Gan, C., and Han, S. Qserve: W4a8kv4 quantization and system co-design for efficient llm serving.arXiv preprint arXiv:2405.04532,

work page arXiv
[10]

Cachegen: Kv cache compression and streaming for fast large language model serving

Liu, Y ., Li, H., Cheng, Y ., Ray, S., Huang, Y ., Zhang, Q., Du, K., Yao, J., Lu, S., Ananthanarayanan, G., Maire, M., Hoffmann, H., Holtzman, A., and Jiang, J. Cachegen: Kv cache compression and streaming for fast large language model serving. InProceedings of the ACM SIGCOMM 2024 Conference, ACM SIGCOMM ’24, pp. 38–56, New York, NY , USA,

work page 2024
[11]

ISBN 9798400706141

Associa- tion for Computing Machinery. ISBN 9798400706141. doi: 10.1145/3651890.3672274. URL https://doi. org/10.1145/3651890.3672274. Sheng, Y ., Zheng, L., Yuan, B., Li, Z., Ryabinin, M., Chen, B., Liang, P., R´e, C., Stoica, I., and Zhang, C. Flexgen: high-throughput generative inference of large language models with a single gpu. InProceedings of the ...

work page doi:10.1145/3651890.3672274
[12]

Efficient Streaming Language Models with Attention Sinks

URL https: //openreview.net/forum?id=3A71qNKWAS. Xiao, G., Tian, Y ., Chen, B., Han, S., and Lewis, M. Ef- ficient streaming language models with attention sinks. arXiv preprint arXiv:2309.17453,

work page internal anchor Pith review Pith/arXiv arXiv
[13]

Qwen3 Technical Report

URL https://openreview. net/forum?id=cFu7ze7xUm. Yang, A., Li, A., Yang, B., Zhang, B., Hui, B., Zheng, B., Yu, B., Gao, C., Huang, C., Lv, C., et al. Qwen3 technical report.arXiv preprint arXiv:2505.09388, 2025a. Yang, A., Yu, B., Li, C., Liu, D., Huang, F., Huang, H., Jiang, J., Tu, J., Zhang, J., Zhou, J., et al. Qwen2. 5- 1m technical report.arXiv pre...

work page internal anchor Pith review Pith/arXiv arXiv
[14]

For Mistral-7B- Instruct-v0.2

Cross-task overlap of unstable heads. For Mistral-7B- Instruct-v0.2. Values are the intersection size normalized by the unstable-head set size (|Ai ∩A j|/64). GovReport is from Long- Bench (Bai et al., 2024); the rest from L-Eval (An et al., 2024). Dataset Open- review Big- Patent Multi- News QM- Sum Gov- Report SP- ACE CU- AD Summ- Screen Openreview 1.00...

work page 2024
[15]

For Mistral- Small-24B-Instruct-2501

Cross-task overlap of unstable heads. For Mistral- Small-24B-Instruct-2501. Values are the intersection size nor- malized by the unstable-head set size (|Ai ∩A j|/80). GovReport is from LongBench (Bai et al., 2024); the rest from L-Eval (An et al., 2024). Dataset Open- review Big- Patent Multi- News QM- Sum Gov- Report SP- ACE CU- AD Summ- Screen Openrevi...

work page 2024
[16]

For Qwen2.5- 32B-Instruct

Cross-task overlap of unstable heads. For Qwen2.5- 32B-Instruct. Values are the intersection size normalized by the unstable-head set size (|Ai ∩A j|/128). GovReport is from Long- Bench (Bai et al., 2024); the rest from L-Eval (An et al., 2024). Dataset Open- review Big- Patent Multi- News QM- Sum Gov- Report SP- ACE CU- AD Summ- Screen Openreview 1.00 0....

work page 2024
[17]

L-Eval task statistics: average prompt and generation lengths, and promoted KV size (Llama-3.1-8B, token budget of 2048, 192 stable heads) Task Prompt # Tokens Generation # Tokens TopK-Delta MB LongFQA 5257 81 46.7 GovReport 6125 377 45.0 CUAD 24906 195 66.7 QMSum 15103 132 66.3 Multi-News 6002 367 43.6 Openreview 10084 390 55.2 BigPatent 6363 159 49.5 SP...

work page 2048