Kvlink: Accelerating large language models via efficient kv cache reuse

Jingbo Yang, Bairu Hou, Wei Wei, Yujia Bao, Shiyu Chang · 2025 · arXiv 2502.16002

8 Pith papers cite this work. Polarity classification is still indexing.

8 Pith papers citing it

read on arXiv browse 8 citing papers

citation-role summary

background 2 method 1

citation-polarity summary

background 3

representative citing papers

PEEK: Context Map as an Orientation Cache for Long-Context LLM Agents

cs.AI · 2026-05-19 · unverdicted · novelty 6.0

PEEK maintains a constant-sized context map via a programmable cache policy to give LLM agents persistent orientation knowledge about recurring external contexts, yielding 6-34% gains and lower cost than prior prompt-learning methods.

Forget, Then Recall: Learnable Compression and Selective Unfolding via Gist Sparse Attention

cs.LG · 2026-04-22 · unverdicted · novelty 6.0

Gist Sparse Attention uses learnable gist compression tokens as both summaries and routing signals, then selectively unfolds relevant raw chunks for fine-grained attention, outperforming compression and sparse-attention baselines on LongBench and RAG tasks at 8x-32x compression.

SAGE: Selective Attention-Guided Extraction for Token-Efficient Document Indexing

cs.DB · 2026-04-16 · unverdicted · novelty 6.0

SAGE is a training-free context reduction method that converts attention signals from a small LLM into a differential relevance heatmap to select top units for downstream QA, achieving competitive accuracy at 10% token budget on benchmarks like QuALITY-hard.

TokenDance: Scaling Multi-Agent LLM Serving via Collective KV Cache Sharing

cs.DC · 2026-04-03 · unverdicted · novelty 6.0

TokenDance scales multi-agent LLM serving to 2.7x more concurrent agents by collective KV cache reuse and block-sparse diff encoding that achieves 11-17x compression.

CacheClip: Accelerating RAG with Effective KV Cache Reuse

cs.LG · 2025-10-11 · unverdicted · novelty 6.0

CacheClip accelerates RAG prefill by up to 3.33x via auxiliary-model-guided selective KV recomputation while retaining 85-91% of full-attention quality on NIAH and LongBench.

HieraSparse: Hierarchical Semi-Structured Sparse KV Attention

cs.DC · 2026-04-18 · unverdicted · novelty 5.0

HieraSparse delivers a hierarchical semi-structured sparse KV attention system that achieves 1.2x KV compression and 4.57x decode attention speedup versus prior unstructured sparsity methods at equivalent sparsity, plus up to 1.85x prefill speedup and 1.37x/1.77x speedups with magnitude pruning and

HUOZIIME: An On-Device LLM-enhanced Input Method for Deep Personalization

cs.CL · 2026-03-23 · unverdicted · novelty 5.0

HUOZIIME is an on-device LLM-powered input method with post-training on synthesized data and hierarchical memory that achieves efficient execution and memory-driven personalization.

From Human Memory to AI Memory: A Survey on Memory Mechanisms in the Era of LLMs

cs.IR · 2025-04-22 · unverdicted · novelty 5.0

The paper surveys human memory categories, maps them to LLM memory, and proposes a new three-dimension (object, form, time) categorization into eight quadrants to organize existing work and highlight open problems.

citing papers explorer

Showing 8 of 8 citing papers.

PEEK: Context Map as an Orientation Cache for Long-Context LLM Agents cs.AI · 2026-05-19 · unverdicted · none · ref 45
PEEK maintains a constant-sized context map via a programmable cache policy to give LLM agents persistent orientation knowledge about recurring external contexts, yielding 6-34% gains and lower cost than prior prompt-learning methods.
Forget, Then Recall: Learnable Compression and Selective Unfolding via Gist Sparse Attention cs.LG · 2026-04-22 · unverdicted · none · ref 29
Gist Sparse Attention uses learnable gist compression tokens as both summaries and routing signals, then selectively unfolds relevant raw chunks for fine-grained attention, outperforming compression and sparse-attention baselines on LongBench and RAG tasks at 8x-32x compression.
SAGE: Selective Attention-Guided Extraction for Token-Efficient Document Indexing cs.DB · 2026-04-16 · unverdicted · none · ref 47
SAGE is a training-free context reduction method that converts attention signals from a small LLM into a differential relevance heatmap to select top units for downstream QA, achieving competitive accuracy at 10% token budget on benchmarks like QuALITY-hard.
TokenDance: Scaling Multi-Agent LLM Serving via Collective KV Cache Sharing cs.DC · 2026-04-03 · unverdicted · none · ref 39
TokenDance scales multi-agent LLM serving to 2.7x more concurrent agents by collective KV cache reuse and block-sparse diff encoding that achieves 11-17x compression.
CacheClip: Accelerating RAG with Effective KV Cache Reuse cs.LG · 2025-10-11 · unverdicted · none · ref 25
CacheClip accelerates RAG prefill by up to 3.33x via auxiliary-model-guided selective KV recomputation while retaining 85-91% of full-attention quality on NIAH and LongBench.
HieraSparse: Hierarchical Semi-Structured Sparse KV Attention cs.DC · 2026-04-18 · unverdicted · none · ref 53
HieraSparse delivers a hierarchical semi-structured sparse KV attention system that achieves 1.2x KV compression and 4.57x decode attention speedup versus prior unstructured sparsity methods at equivalent sparsity, plus up to 1.85x prefill speedup and 1.37x/1.77x speedups with magnitude pruning and
HUOZIIME: An On-Device LLM-enhanced Input Method for Deep Personalization cs.CL · 2026-03-23 · unverdicted · none · ref 7
HUOZIIME is an on-device LLM-powered input method with post-training on synthesized data and hierarchical memory that achieves efficient execution and memory-driven personalization.
From Human Memory to AI Memory: A Survey on Memory Mechanisms in the Era of LLMs cs.IR · 2025-04-22 · unverdicted · none · ref 142
The paper surveys human memory categories, maps them to LLM memory, and proposes a new three-dimension (object, form, time) categorization into eight quadrants to organize existing work and highlight open problems.

Kvlink: Accelerating large language models via efficient kv cache reuse

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer