Grounded Cache Routing for Retrieval-Augmented Generation: When Is It Safe to Reuse an Answer?

Syed Huma Shah (Duke University)

arxiv: 2605.27494 · v1 · pith:FS4IT757new · submitted 2026-05-26 · 💻 cs.CR · cs.AI· cs.CL· cs.IR· cs.LG

Grounded Cache Routing for Retrieval-Augmented Generation: When Is It Safe to Reuse an Answer?

Syed Huma Shah (Duke University) This is my paper

Pith reviewed 2026-06-29 17:13 UTC · model grok-4.3

classification 💻 cs.CR cs.AIcs.CLcs.IRcs.LG

keywords retrieval-augmented generationcache routingunsafe-served rateevidence validationRAG safetyanswer reuselexical supportcache safety

0 comments

The pith

GroundedCache admits a cached RAG answer only when four evidence gates hold, driving unsafe-served rate to 0 percent on HotpotQA and 1.5 percent on mtRAG drift.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces GroundedCache as an evidence-validated router that decides when it is safe to reuse a cached answer in retrieval-augmented generation systems. Reuse occurs only if query similarity, retrieved-evidence overlap, source-version validity, and lexical or judge-based support of the cached answer by new evidence all hold at once. This framing shifts focus from maximizing hit rate to minimizing unsafe reuse in settings where evidence drifts or prompts are similar but answers differ. Across HotpotQA and mtRAG workloads the method reduces unsafe-served rate by factors of 3 to 34 while keeping p50 latency within 1.04-1.07 times a no-cache baseline. The lexical support gate emerges as the primary safety mechanism, with the others supplying low-cost defense in depth.

Core claim

GroundedCache is an evidence-validated cache router that admits a cached answer only when query similarity, retrieved-evidence overlap, source-version validity, and lexical (or judge-based) support of the cached answer by the freshly retrieved evidence all hold simultaneously. On a six-regime workload covering 12,000 generations with Qwen2.5-7B-Instruct, it drives unsafe-served rate to 0.0 percent on every HotpotQA regime versus 15-35 percent under naive caching and to 1.5 percent on mtRAG document drift versus 51.5 percent, producing a 34x reduction on the design-point adversarial regime and 3-10x reductions elsewhere, while end-to-end p50 latency remains 1.04-1.07x of a no-cache RAG baseli

What carries the argument

The four-gate evidence-validated router, with lexical (or judge-based) support of the cached answer by fresh evidence serving as the primary safety gate.

If this is right

RAG deployments can safely reuse cached answers in dynamic corpora without exposing users to incorrect responses at scale.
Unsafe-served rate becomes a practical operator metric for evaluating cache safety beyond hit rate.
The remaining three gates supply defense-in-depth at negligible added cost once lexical support is in place.
End-to-end latency overhead stays small enough that safety gains do not trade off against responsiveness.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Similar gated validation could be applied to output caches outside RAG where evidence or context changes over time.
Replacing lexical support with a lightweight judge model might preserve safety while further reducing token cost.
The approach suggests a general pattern for safe reuse in any system that pairs retrieval with generative output.

Load-bearing premise

Lexical or judge-based support of the cached answer by the freshly retrieved evidence is a reliable indicator that the cached answer remains correct.

What would settle it

A query instance in which all four gates pass yet the cached answer is factually incorrect or mismatched to the current evidence.

Figures

Figures reproduced from arXiv: 2605.27494 by Syed Huma Shah (Duke University).

**Figure 1.** Figure 1: GroundedCache system architecture. A query first hits the stage-2 retrieval cache (byte-exact dedup); on a miss it falls back to FAISS vector search. The retrieved chunks are hashed into an evidence signature σ, and four gates (G1 query similarity, G2 evidence overlap, G3 version match, G4 lexical support) decide whether to serve a cached answer from the answer cache or fall back to the compression + vLLM … view at source ↗

**Figure 2.** Figure 2: Per-query decision flow inside the router. The four gates (G1–G4) form a short-circuiting [PITH_FULL_IMAGE:figures/full_fig_p006_2.png] view at source ↗

**Figure 3.** Figure 3: HotpotQA. Per-regime answer-cache hit rate (solid) and unsafe-served rate (hatched) [PITH_FULL_IMAGE:figures/full_fig_p008_3.png] view at source ↗

**Figure 4.** Figure 4: HotpotQA. Safety/speedup Pareto: each marker is one (regime, variant) pair. [PITH_FULL_IMAGE:figures/full_fig_p009_4.png] view at source ↗

**Figure 5.** Figure 5: HotpotQA. Path mix under GroundedCache (full). Benign regimes (exact-repeat, paraphrase, long shared-doc) route a large share to the answer cache; adversarial regimes route to retrieval-cache + regenerate, paying retrieval savings but not risking false hits. Third, the answer-cache hit-rate cost on HotpotQA is borne almost entirely by G1. Every ablation that leaves G1 off keeps aHR≈0.41 on exact repeat; on… view at source ↗

**Figure 6.** Figure 6: mtRAG. Per-regime answer-cache hit rate (solid) and unsafe-served rate (hatched). Naive [PITH_FULL_IMAGE:figures/full_fig_p011_6.png] view at source ↗

**Figure 7.** Figure 7: mtRAG. Safety/speedup Pareto under multi-turn traffic. [PITH_FULL_IMAGE:figures/full_fig_p012_7.png] view at source ↗

**Figure 8.** Figure 8: HotpotQA. Cumulative gate-stack USR reduction per regime. Adding G3 on top of [PITH_FULL_IMAGE:figures/full_fig_p013_8.png] view at source ↗

**Figure 9.** Figure 9: mtRAG. The same cumulative gate-stack picture. The benign-looking regimes [PITH_FULL_IMAGE:figures/full_fig_p015_9.png] view at source ↗

read the original abstract

Modern retrieval-augmented generation(RAG) deployments increasingly rely on caching to reduce token cost and time-to-first-token(TTFT). Prefix-level KV reuse is now standard in serving stacks such as vLLM, and chunk-level and position-independent reuse have been pushed further by recent systems(RAGCache, TurboRAG, CacheBlend, EPIC, ContextPilot, PCR, LMCache). Output-level semantic answer caches, by contrast, remain fragile: similar prompts can map to different correct answers, retrieved evidence drifts as the corpus is updated, and adversarial collision attacks have been shown to hijack cached responses. We argue that the right framing for cached answer reuse is not how to reuse faster but when reuse is safe. We propose GroundedCache, an evidence-validated cache router that admits a cached answer only when 4 cheap gates simultaneously hold: query similarity, retrieved-evidence overlap, source-version validity, and lexical (or judge-based) support of the cached answer by the freshly retrieved evidence. We build a six-regime workload that stress-tests cache safety rather than only hit rate, and introduce an operator-facing metric, the unsafe-served rate (USR), fraction of all queries that received a wrong cached answer. Across 2 datasets and 12,000 real-LLM generations(Qwen2.5-7B-Instruct on vLLM with Automatic Prefix Caching), GroundedCache drives USR to 0.0% on every HotpotQA regime(vs. 15-35% under naive caching) and to 1.5% on mtRAG document drift(vs. 51.5%), a 34x reduction on the design-point adversarial regime and 3-10x reductions across the other mtRAG regimes, while end-to-end p50 latency stays within 1.04-1.07x of a no-cache RAG baseline. A per-gate ablation isolates the lexical support gate as the load-bearing safety mechanism on both datasets, with the remaining gates providing defense-in-depth at near-zero cost. We release the implementation, workload, and evaluation harness.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

GroundedCache gives a workable four-gate check that drops unsafe cached answers to near zero in tested RAG setups with almost no extra latency.

read the letter

The paper's core contribution is a router that only reuses a cached RAG answer when four cheap checks all pass: query similarity, evidence overlap, source version match, and lexical or LLM-judge support of the cached answer against fresh retrieval. They pair this with an unsafe-served-rate metric and a six-regime test suite that includes document drift and adversarial cases. On HotpotQA and mtRAG with 12k real generations from Qwen2.5-7B on vLLM, the system brings USR to 0% on HotpotQA regimes and 1.5% on the hardest mtRAG drift case, versus 15-51% for naive caching, while p50 latency stays within 1.04-1.07x of the no-cache baseline. The per-gate ablation identifies the support check as the main driver, and they release the code and harness.

The results are concrete and the engineering is straightforward. The workload design and open artifacts make the claims easier to inspect than most caching papers.

The load-bearing assumption is that passing the support gate means the cached answer is still correct for the current query. Lexical overlap or a judge match can hold while multi-hop reasoning or subtle evidence shifts make the answer wrong, and the paper does not show that the USR oracle is fully independent of the same judge family. Threshold choices and exact judge prompts are not visible in the abstract, so robustness across models or domains remains open. The latency numbers are narrow but the safety claims rest on that single gate.

This is aimed at teams already running production RAG with caching who need a safety dial rather than a new retrieval method. It is worth sending to referees because the experiments are real, the metric is useful, and the artifact release lets others test the assumption directly.

Referee Report

1 major / 0 minor

Summary. The manuscript proposes GroundedCache, an evidence-validated cache router for RAG systems that admits a cached answer only when four gates hold simultaneously: query similarity, retrieved-evidence overlap, source-version validity, and lexical (or judge-based) support of the cached answer by fresh evidence. Evaluated on HotpotQA and mtRAG across 12,000 real generations with Qwen2.5-7B-Instruct on vLLM, it reports USR reductions to 0.0% on all HotpotQA regimes (vs. 15-35% naive) and 1.5% on mtRAG document drift (vs. 51.5%), with p50 latency 1.04-1.07x of no-cache baseline; per-gate ablation identifies the support gate as load-bearing.

Significance. If the safety claims hold under independent verification, the work supplies a practical, low-overhead mechanism for safe output-level caching in production RAG, directly addressing collision and drift risks. The introduction of the USR metric, the six-regime stress-test workload, and the public release of code, workload, and harness are concrete strengths that support reproducibility and further research.

major comments (1)

[Abstract] Abstract (per-gate ablation paragraph): the claim that simultaneous satisfaction of the four gates (especially lexical/judge-based support) implies the cached answer remains correct is load-bearing, yet the manuscript provides no details on the exact support threshold, judge prompt, inter-annotator agreement, or the independent correctness oracle used to compute USR. Without these, it is impossible to rule out correlated failure modes between the support gate and the USR evaluator, particularly on multi-hop HotpotQA queries where evidence overlap can hold while the answer is still incorrect due to reasoning differences.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive comment on transparency and reproducibility. We agree that the abstract requires additional detail to support the load-bearing claim and will revise the manuscript accordingly.

read point-by-point responses

Referee: [Abstract] Abstract (per-gate ablation paragraph): the claim that simultaneous satisfaction of the four gates (especially lexical/judge-based support) implies the cached answer remains correct is load-bearing, yet the manuscript provides no details on the exact support threshold, judge prompt, inter-annotator agreement, or the independent correctness oracle used to compute USR. Without these, it is impossible to rule out correlated failure modes between the support gate and the USR evaluator, particularly on multi-hop HotpotQA queries where evidence overlap can hold while the answer is still incorrect due to reasoning differences.

Authors: We agree the abstract paragraph is insufficiently detailed for independent verification. In the revised manuscript we will expand the abstract (and ensure the body already contains or will explicitly cross-reference) the exact lexical support threshold, the full judge prompt for the alternative variant, inter-annotator agreement statistics, and the independent correctness oracle (human annotation against ground truth on sampled queries) used to compute USR. On the correlated-failure concern, the primary support gate is lexical n-gram overlap, which is surface-form based and therefore orthogonal to the semantic correctness judgment underlying USR; the judge-based variant is presented only as an optional alternative. The query-similarity and evidence-overlap gates further constrain admission on multi-hop queries. We will add a short discussion of these independence properties and any residual risks. revision: yes

Circularity Check

0 steps flagged

No significant circularity; empirical system evaluated on external benchmarks

full rationale

The paper introduces GroundedCache as a set of four explicitly defined gates (query similarity, evidence overlap, source validity, lexical/judge support) and measures its effect via the externally defined USR metric on HotpotQA and mtRAG workloads using real LLM generations. No equations, fitted parameters, self-citations as load-bearing premises, or renamings of known results appear in the provided text. The central safety claim is supported by direct ablation and comparison against naive caching on independent datasets rather than reducing to a definitional identity or self-referential fit.

Axiom & Free-Parameter Ledger

1 free parameters · 0 axioms · 0 invented entities

Based solely on the abstract, the approach uses standard RAG components and four described gates; no new entities are postulated. Thresholds or similarity cutoffs for the gates are likely free parameters but are not quantified here.

free parameters (1)

gate thresholds (query similarity, evidence overlap, lexical support)
The four gates require operational cutoffs that must be chosen or tuned; abstract does not specify values or tuning procedure.

pith-pipeline@v0.9.1-grok · 5936 in / 1347 out tokens · 60802 ms · 2026-06-29T17:13:37.747577+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

31 extracted references · 26 canonical work pages · 2 internal anchors

[1]

https://docs.vllm.ai/en/latest/features/automatic_ prefix_caching.html, 2024

vLLM automatic prefix caching. https://docs.vllm.ai/en/latest/features/automatic_ prefix_caching.html, 2024. Accessed: 2026

2024
[2]

SGLang: Efficient execution of structured language model programs

Lianmin Zheng, Liangsheng Yin, Zhiqiang Xie, Chuyue Sun, Jeff Huang, et al. SGLang: Efficient execution of structured language model programs. InNeurIPS, 2024

2024
[3]

RAGCache: Efficient knowledge caching for retrieval-augmented generation.arXiv preprint arXiv:2404.12457, 2024

Chao Jin, Zili Zhang, Xuanlin Jiang, Fangyue Liu, Xin Liu, Xuanzhe Liu, and Xin Jin. RAGCache: Efficient knowledge caching for retrieval-augmented generation.arXiv preprint arXiv:2404.12457, 2024

work page arXiv 2024
[4]

TurboRAG: Accelerating retrieval-augmented generation with precomputed KV caches for chunked text.arXiv preprint arXiv:2410.07590, 2024

Songshuo Lu, Hua Wang, Yutian Rong, Zhi Chen, and Yaohua Tang. TurboRAG: Accelerating retrieval-augmented generation with precomputed KV caches for chunked text.arXiv preprint arXiv:2410.07590, 2024

work page arXiv 2024
[5]

CacheBlend: Fast large language model serving for RAG with cached knowledge fusion.arXiv preprint arXiv:2405.16444, 2024

Jiayi Yao, Hanchen Li, Yuhan Liu, Siddhant Ray, Yihua Cheng, Qizheng Zhang, Kuntai Du, Shan Lu, and Junchen Jiang. CacheBlend: Fast large language model serving for RAG with cached knowledge fusion.arXiv preprint arXiv:2405.16444, 2024

work page arXiv 2024
[6]

EPIC: Efficient position-independent caching for serving large language models.arXiv preprint arXiv:2410.15332, 2024

Junhao Hu, Wenrui Huang, Weidong Wang, Zhenwen Yuan, Tiancheng Xie, Zhixia Liu, Xusheng Liu, Tao Cui, Fei Liu, and Yizhou Cao. EPIC: Efficient position-independent caching for serving large language models.arXiv preprint arXiv:2410.15332, 2024

work page arXiv 2024
[7]

ContextPilot: Fast Long-Context Inference via Context Reuse

EfficientContext. ContextPilot: Fast long-context inference via context reuse.arXiv preprint arXiv:2511.03475, 2025. To appear, MLSys 2026

work page internal anchor Pith review Pith/arXiv arXiv 2025
[8]

PCR: A prefetch-enhanced cache reuse system for low-latency RAG serving.arXiv preprint arXiv:2603.23049, 2026

Wenfeng Wang, Xiaofeng Hou, Peng Tang, et al. PCR: A prefetch-enhanced cache reuse system for low-latency RAG serving.arXiv preprint arXiv:2603.23049, 2026. 17

work page arXiv 2026
[9]

arXiv preprint arXiv:2510.09665 , year=

LMCache Team. LMCache: An efficient KV cache layer for enterprise-scale LLM inference. arXiv preprint arXiv:2510.09665, 2025

work page arXiv 2025
[10]

Leveraging approximate caching for faster retrieval-augmented generation

Shai Bergman et al. Leveraging approximate caching for faster retrieval-augmented generation. arXiv preprint arXiv:2503.05530, 2025. EuroMLSys 2025

work page arXiv 2025
[11]

QVCache: A query-aware vector cache.arXiv preprint arXiv:2602.02057, 2026

Anıl Eren G¨ o¸ cer et al. QVCache: A query-aware vector cache.arXiv preprint arXiv:2602.02057, 2026

work page arXiv 2026
[12]

GPTCache: An open-source semantic cache for LLM applications enabling faster answers and cost savings.Proceedings of the NLP-OSS Workshop, 2023

Fu Bang. GPTCache: An open-source semantic cache for LLM applications enabling faster answers and cost savings.Proceedings of the NLP-OSS Workshop, 2023

2023
[13]

ContextCache: Context-aware semantic cache for multi-turn queries in large language models

Jianxin Yan, Wangze Ni, Lei Chen, Xuemin Lin, Peng Cheng, Zhan Qin, and Kui Ren. ContextCache: Context-aware semantic cache for multi-turn queries in large language models. Proc. VLDB Endowment, 2025. arXiv:2506.22791

work page arXiv 2025
[14]

From similarity to vulnerability: Key collision attack on LLM semantic caching.arXiv preprint arXiv:2601.23088, 2026

Zhixiang Zhang, Zesen Liu, Yuchong Xie, Quanfeng Huang, and Dongdong She. From similarity to vulnerability: Key collision attack on LLM semantic caching.arXiv preprint arXiv:2601.23088, 2026

work page arXiv 2026
[15]

RECOMP: Improving retrieval-augmented LMs with compression and selective augmentation.arXiv preprint arXiv:2310.04408, 2023

Fangyuan Xu, Weijia Shi, and Eunsol Choi. RECOMP: Improving retrieval-augmented LMs with compression and selective augmentation.arXiv preprint arXiv:2310.04408, 2023

work page arXiv 2023
[16]

LongLLMLingua: Accelerating and enhancing LLMs in long context scenarios via prompt compression.arXiv preprint arXiv:2310.06839, 2023

Huiqiang Jiang, Qianhui Wu, Xufang Luo, Dongsheng Li, Chin-Yew Lin, Yuqing Yang, and Lili Qiu. LongLLMLingua: Accelerating and enhancing LLMs in long context scenarios via prompt compression.arXiv preprint arXiv:2310.06839, 2023

work page arXiv 2023
[17]

Provence: Efficient and robust context pruning for retrieval-augmented generation.arXiv preprint arXiv:2501.16214, 2025

Nadezhda Chirkova, Thibault Formal, Vassilina Nikoulina, and St´ ephane Clinchant. Provence: Efficient and robust context pruning for retrieval-augmented generation.arXiv preprint arXiv:2501.16214, 2025. ICLR 2025

work page arXiv 2025
[18]

PISCO: Pretty simple compression for retrieval-augmented generation.arXiv preprint arXiv:2501.16075, 2025

Maxime Louis, Nadezhda Chirkova, Thibault Formal, and St´ ephane Clinchant. PISCO: Pretty simple compression for retrieval-augmented generation.arXiv preprint arXiv:2501.16075, 2025

work page arXiv 2025
[19]

OSCAR: Online soft compression and reranking.arXiv preprint arXiv:2504.07109, 2025

Maxime Louis, Nadezhda Chirkova, Thibault Formal, and St´ ephane Clinchant. OSCAR: Online soft compression and reranking.arXiv preprint arXiv:2504.07109, 2025

work page arXiv 2025
[20]

Rethinking soft compression in retrieval-augmented generation: A query- conditioned selector perspective.arXiv preprint arXiv:2602.15856, 2026

Anonymous. Rethinking soft compression in retrieval-augmented generation: A query- conditioned selector perspective.arXiv preprint arXiv:2602.15856, 2026. SeleCom

work page arXiv 2026
[21]

REFRAG: Rethinking RAG based decoding.arXiv preprint arXiv:2509.01092, 2025

Meta Superintelligence Labs. REFRAG: Rethinking RAG based decoding.arXiv preprint arXiv:2509.01092, 2025

work page arXiv 2025
[22]

Ragas: Automated Evaluation of Retrieval Augmented Generation

Shahul Es, Jithin James, Luis Espinosa-Anke, and Steven Schockaert. RAGAS: Automated evaluation of retrieval augmented generation.arXiv preprint arXiv:2309.15217, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[23]

ARES: An au- tomated evaluation framework for retrieval-augmented generation systems.arXiv preprint arXiv:2311.09476, 2023

Jon Saad-Falcon, Omar Khattab, Christopher Potts, and Matei Zaharia. ARES: An au- tomated evaluation framework for retrieval-augmented generation systems.arXiv preprint arXiv:2311.09476, 2023

work page arXiv 2023
[24]

Context: {c} \\ Question: {q} \\Answer:

Cheng Niu, Yuanhao Wu, Juno Zhu, Siliang Xu, Kashun Shum, Randy Zhong, Juntong Song, and Tong Zhang. RAGTruth: A hallucination corpus for developing trustworthy retrieval- augmented language models.arXiv preprint arXiv:2401.00396, 2024. 18

work page arXiv 2024
[25]

RAGBench: Explainable benchmark for retrieval-augmented generation systems.arXiv preprint arXiv:2407.11005, 2024

Robert Friel, Masha Belyi, and Atindriyo Sanyal. RAGBench: Explainable benchmark for retrieval-augmented generation systems.arXiv preprint arXiv:2407.11005, 2024

work page arXiv 2024
[26]

RAGChecker: A fine-grained framework for diagnosing retrieval-augmented generation.arXiv preprint arXiv:2408.08067, 2024

Dongyu Ru, Lin Qiu, Xiangkun Hu, Tianhang Zhang, Peng Shi, Shuaichen Chang, Cheng Jiayang, Cunxiang Wang, Siyuan Sun, Hang Li, Zheng Zhang, Binjie Wang, Jiarong Jiang, Tong He, Zhiguo Wang, Pengfei Liu, Yue Zhang, and Zheng Zhang. RAGChecker: A fine-grained framework for diagnosing retrieval-augmented generation.arXiv preprint arXiv:2408.08067, 2024

work page arXiv 2024
[27]

MTRAG: A multi-turn conversational benchmark for evaluating retrieval-augmented generation sys- tems.Transactions of the Association for Computational Linguistics, 2025

Yannis Katsis, Sara Rosenthal, Kshitij Fadnis, Chulaka Gunasekara, Young-Suk Lee, Lu- cian Popa, Vraj Shah, Huaiyu Zhu, Danish Contractor, and Marina Danilevsky. MTRAG: A multi-turn conversational benchmark for evaluating retrieval-augmented generation sys- tems.Transactions of the Association for Computational Linguistics, 2025. arXiv:2501.03468; benchma...

work page arXiv 2025
[28]

T 2-RAGBench: Text-and-table benchmark for evaluating retrieval-augmented generation.arXiv preprint arXiv:2506.12071, 2025

Anonymous. T 2-RAGBench: Text-and-table benchmark for evaluating retrieval-augmented generation.arXiv preprint arXiv:2506.12071, 2025. EACL 2026

work page arXiv 2025
[29]

Don’t do RAG: When cache-augmented generation is all you need for knowledge tasks.arXiv preprint arXiv:2412.15605, 2024

Brian J Chan, Chao-Ting Chen, Jui-Hung Cheng, and Hen-Hsen Huang. Don’t do RAG: When cache-augmented generation is all you need for knowledge tasks.arXiv preprint arXiv:2412.15605, 2024

work page arXiv 2024
[30]

MiniLM: Deep self-attention distillation for task-agnostic compression of pre-trained transformers

Wenhui Wang, Furu Wei, Li Dong, Hangbo Bao, Nan Yang, and Ming Zhou. MiniLM: Deep self-attention distillation for task-agnostic compression of pre-trained transformers. InNeurIPS, 2020

2020
[31]

Billion-scale similarity search with GPUs

Jeff Johnson, Matthijs Douze, and Herv´ e J´ egou. Billion-scale similarity search with GPUs. IEEE Transactions on Big Data, 2017. 19

2017

[1] [1]

https://docs.vllm.ai/en/latest/features/automatic_ prefix_caching.html, 2024

vLLM automatic prefix caching. https://docs.vllm.ai/en/latest/features/automatic_ prefix_caching.html, 2024. Accessed: 2026

2024

[2] [2]

SGLang: Efficient execution of structured language model programs

Lianmin Zheng, Liangsheng Yin, Zhiqiang Xie, Chuyue Sun, Jeff Huang, et al. SGLang: Efficient execution of structured language model programs. InNeurIPS, 2024

2024

[3] [3]

RAGCache: Efficient knowledge caching for retrieval-augmented generation.arXiv preprint arXiv:2404.12457, 2024

Chao Jin, Zili Zhang, Xuanlin Jiang, Fangyue Liu, Xin Liu, Xuanzhe Liu, and Xin Jin. RAGCache: Efficient knowledge caching for retrieval-augmented generation.arXiv preprint arXiv:2404.12457, 2024

work page arXiv 2024

[4] [4]

TurboRAG: Accelerating retrieval-augmented generation with precomputed KV caches for chunked text.arXiv preprint arXiv:2410.07590, 2024

Songshuo Lu, Hua Wang, Yutian Rong, Zhi Chen, and Yaohua Tang. TurboRAG: Accelerating retrieval-augmented generation with precomputed KV caches for chunked text.arXiv preprint arXiv:2410.07590, 2024

work page arXiv 2024

[5] [5]

CacheBlend: Fast large language model serving for RAG with cached knowledge fusion.arXiv preprint arXiv:2405.16444, 2024

Jiayi Yao, Hanchen Li, Yuhan Liu, Siddhant Ray, Yihua Cheng, Qizheng Zhang, Kuntai Du, Shan Lu, and Junchen Jiang. CacheBlend: Fast large language model serving for RAG with cached knowledge fusion.arXiv preprint arXiv:2405.16444, 2024

work page arXiv 2024

[6] [6]

EPIC: Efficient position-independent caching for serving large language models.arXiv preprint arXiv:2410.15332, 2024

Junhao Hu, Wenrui Huang, Weidong Wang, Zhenwen Yuan, Tiancheng Xie, Zhixia Liu, Xusheng Liu, Tao Cui, Fei Liu, and Yizhou Cao. EPIC: Efficient position-independent caching for serving large language models.arXiv preprint arXiv:2410.15332, 2024

work page arXiv 2024

[7] [7]

ContextPilot: Fast Long-Context Inference via Context Reuse

EfficientContext. ContextPilot: Fast long-context inference via context reuse.arXiv preprint arXiv:2511.03475, 2025. To appear, MLSys 2026

work page internal anchor Pith review Pith/arXiv arXiv 2025

[8] [8]

PCR: A prefetch-enhanced cache reuse system for low-latency RAG serving.arXiv preprint arXiv:2603.23049, 2026

Wenfeng Wang, Xiaofeng Hou, Peng Tang, et al. PCR: A prefetch-enhanced cache reuse system for low-latency RAG serving.arXiv preprint arXiv:2603.23049, 2026. 17

work page arXiv 2026

[9] [9]

arXiv preprint arXiv:2510.09665 , year=

LMCache Team. LMCache: An efficient KV cache layer for enterprise-scale LLM inference. arXiv preprint arXiv:2510.09665, 2025

work page arXiv 2025

[10] [10]

Leveraging approximate caching for faster retrieval-augmented generation

Shai Bergman et al. Leveraging approximate caching for faster retrieval-augmented generation. arXiv preprint arXiv:2503.05530, 2025. EuroMLSys 2025

work page arXiv 2025

[11] [11]

QVCache: A query-aware vector cache.arXiv preprint arXiv:2602.02057, 2026

Anıl Eren G¨ o¸ cer et al. QVCache: A query-aware vector cache.arXiv preprint arXiv:2602.02057, 2026

work page arXiv 2026

[12] [12]

GPTCache: An open-source semantic cache for LLM applications enabling faster answers and cost savings.Proceedings of the NLP-OSS Workshop, 2023

Fu Bang. GPTCache: An open-source semantic cache for LLM applications enabling faster answers and cost savings.Proceedings of the NLP-OSS Workshop, 2023

2023

[13] [13]

ContextCache: Context-aware semantic cache for multi-turn queries in large language models

Jianxin Yan, Wangze Ni, Lei Chen, Xuemin Lin, Peng Cheng, Zhan Qin, and Kui Ren. ContextCache: Context-aware semantic cache for multi-turn queries in large language models. Proc. VLDB Endowment, 2025. arXiv:2506.22791

work page arXiv 2025

[14] [14]

From similarity to vulnerability: Key collision attack on LLM semantic caching.arXiv preprint arXiv:2601.23088, 2026

Zhixiang Zhang, Zesen Liu, Yuchong Xie, Quanfeng Huang, and Dongdong She. From similarity to vulnerability: Key collision attack on LLM semantic caching.arXiv preprint arXiv:2601.23088, 2026

work page arXiv 2026

[15] [15]

RECOMP: Improving retrieval-augmented LMs with compression and selective augmentation.arXiv preprint arXiv:2310.04408, 2023

Fangyuan Xu, Weijia Shi, and Eunsol Choi. RECOMP: Improving retrieval-augmented LMs with compression and selective augmentation.arXiv preprint arXiv:2310.04408, 2023

work page arXiv 2023

[16] [16]

LongLLMLingua: Accelerating and enhancing LLMs in long context scenarios via prompt compression.arXiv preprint arXiv:2310.06839, 2023

Huiqiang Jiang, Qianhui Wu, Xufang Luo, Dongsheng Li, Chin-Yew Lin, Yuqing Yang, and Lili Qiu. LongLLMLingua: Accelerating and enhancing LLMs in long context scenarios via prompt compression.arXiv preprint arXiv:2310.06839, 2023

work page arXiv 2023

[17] [17]

Provence: Efficient and robust context pruning for retrieval-augmented generation.arXiv preprint arXiv:2501.16214, 2025

Nadezhda Chirkova, Thibault Formal, Vassilina Nikoulina, and St´ ephane Clinchant. Provence: Efficient and robust context pruning for retrieval-augmented generation.arXiv preprint arXiv:2501.16214, 2025. ICLR 2025

work page arXiv 2025

[18] [18]

PISCO: Pretty simple compression for retrieval-augmented generation.arXiv preprint arXiv:2501.16075, 2025

Maxime Louis, Nadezhda Chirkova, Thibault Formal, and St´ ephane Clinchant. PISCO: Pretty simple compression for retrieval-augmented generation.arXiv preprint arXiv:2501.16075, 2025

work page arXiv 2025

[19] [19]

OSCAR: Online soft compression and reranking.arXiv preprint arXiv:2504.07109, 2025

Maxime Louis, Nadezhda Chirkova, Thibault Formal, and St´ ephane Clinchant. OSCAR: Online soft compression and reranking.arXiv preprint arXiv:2504.07109, 2025

work page arXiv 2025

[20] [20]

Rethinking soft compression in retrieval-augmented generation: A query- conditioned selector perspective.arXiv preprint arXiv:2602.15856, 2026

Anonymous. Rethinking soft compression in retrieval-augmented generation: A query- conditioned selector perspective.arXiv preprint arXiv:2602.15856, 2026. SeleCom

work page arXiv 2026

[21] [21]

REFRAG: Rethinking RAG based decoding.arXiv preprint arXiv:2509.01092, 2025

Meta Superintelligence Labs. REFRAG: Rethinking RAG based decoding.arXiv preprint arXiv:2509.01092, 2025

work page arXiv 2025

[22] [22]

Ragas: Automated Evaluation of Retrieval Augmented Generation

Shahul Es, Jithin James, Luis Espinosa-Anke, and Steven Schockaert. RAGAS: Automated evaluation of retrieval augmented generation.arXiv preprint arXiv:2309.15217, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[23] [23]

ARES: An au- tomated evaluation framework for retrieval-augmented generation systems.arXiv preprint arXiv:2311.09476, 2023

Jon Saad-Falcon, Omar Khattab, Christopher Potts, and Matei Zaharia. ARES: An au- tomated evaluation framework for retrieval-augmented generation systems.arXiv preprint arXiv:2311.09476, 2023

work page arXiv 2023

[24] [24]

Context: {c} \\ Question: {q} \\Answer:

Cheng Niu, Yuanhao Wu, Juno Zhu, Siliang Xu, Kashun Shum, Randy Zhong, Juntong Song, and Tong Zhang. RAGTruth: A hallucination corpus for developing trustworthy retrieval- augmented language models.arXiv preprint arXiv:2401.00396, 2024. 18

work page arXiv 2024

[25] [25]

RAGBench: Explainable benchmark for retrieval-augmented generation systems.arXiv preprint arXiv:2407.11005, 2024

Robert Friel, Masha Belyi, and Atindriyo Sanyal. RAGBench: Explainable benchmark for retrieval-augmented generation systems.arXiv preprint arXiv:2407.11005, 2024

work page arXiv 2024

[26] [26]

RAGChecker: A fine-grained framework for diagnosing retrieval-augmented generation.arXiv preprint arXiv:2408.08067, 2024

Dongyu Ru, Lin Qiu, Xiangkun Hu, Tianhang Zhang, Peng Shi, Shuaichen Chang, Cheng Jiayang, Cunxiang Wang, Siyuan Sun, Hang Li, Zheng Zhang, Binjie Wang, Jiarong Jiang, Tong He, Zhiguo Wang, Pengfei Liu, Yue Zhang, and Zheng Zhang. RAGChecker: A fine-grained framework for diagnosing retrieval-augmented generation.arXiv preprint arXiv:2408.08067, 2024

work page arXiv 2024

[27] [27]

MTRAG: A multi-turn conversational benchmark for evaluating retrieval-augmented generation sys- tems.Transactions of the Association for Computational Linguistics, 2025

Yannis Katsis, Sara Rosenthal, Kshitij Fadnis, Chulaka Gunasekara, Young-Suk Lee, Lu- cian Popa, Vraj Shah, Huaiyu Zhu, Danish Contractor, and Marina Danilevsky. MTRAG: A multi-turn conversational benchmark for evaluating retrieval-augmented generation sys- tems.Transactions of the Association for Computational Linguistics, 2025. arXiv:2501.03468; benchma...

work page arXiv 2025

[28] [28]

T 2-RAGBench: Text-and-table benchmark for evaluating retrieval-augmented generation.arXiv preprint arXiv:2506.12071, 2025

Anonymous. T 2-RAGBench: Text-and-table benchmark for evaluating retrieval-augmented generation.arXiv preprint arXiv:2506.12071, 2025. EACL 2026

work page arXiv 2025

[29] [29]

Don’t do RAG: When cache-augmented generation is all you need for knowledge tasks.arXiv preprint arXiv:2412.15605, 2024

Brian J Chan, Chao-Ting Chen, Jui-Hung Cheng, and Hen-Hsen Huang. Don’t do RAG: When cache-augmented generation is all you need for knowledge tasks.arXiv preprint arXiv:2412.15605, 2024

work page arXiv 2024

[30] [30]

MiniLM: Deep self-attention distillation for task-agnostic compression of pre-trained transformers

Wenhui Wang, Furu Wei, Li Dong, Hangbo Bao, Nan Yang, and Ming Zhou. MiniLM: Deep self-attention distillation for task-agnostic compression of pre-trained transformers. InNeurIPS, 2020

2020

[31] [31]

Billion-scale similarity search with GPUs

Jeff Johnson, Matthijs Douze, and Herv´ e J´ egou. Billion-scale similarity search with GPUs. IEEE Transactions on Big Data, 2017. 19

2017