CacheWeaver: Cache-Aware Evidence Ordering for Efficient Grounded RAG Inference

Kaizhen Tan; Mingyuan Li; Rong Gu

arxiv: 2606.19667 · v1 · pith:VMHIMGWEnew · submitted 2026-06-18 · 💻 cs.CL

CacheWeaver: Cache-Aware Evidence Ordering for Efficient Grounded RAG Inference

Kaizhen Tan , Rong Gu , Mingyuan Li This is my paper

Pith reviewed 2026-06-26 18:14 UTC · model grok-4.3

classification 💻 cs.CL

keywords RAGprefix cachingevidence orderingtime-to-first-tokenvLLMgrounded generationprompt scheduling

0 comments

The pith

CacheWeaver reorders retrieved evidence in RAG prompts using a prefix tree and greedy walk to maximize cache reuse and reduce median time-to-first-token by 20-33 percent.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

In RAG, retrieved evidence often overlaps between queries but arrives in different orders, so standard prefix caching in engines like vLLM cannot reuse the shared tokens. CacheWeaver maintains a prefix tree of past evidence sequences at the prompt layer and greedily selects an ordering that starts with the longest reusable prefix, without modifying the retrieved set or the serving engine. This produces 20-33 percent lower median TTFT across three vLLM configurations while leaving answer quality unchanged on QA tasks. The greedy policy recovers 97.5 percent of the gain available from an oracle ordering, showing that simple scheduling between retrieval and inference can capture most of the available locality.

Core claim

CacheWeaver keeps a prefix tree over recently served evidence sequences and uses a greedy walk to place the most reusable prefix first in the prompt. Across three vLLM configurations the method lowers median time-to-first-token by 20-33 percent relative to retrieval-order prefix caching, without hurting answer quality in QA tests. The greedy policy reaches 97.5 percent of the median TTFT gain from oracle ordering.

What carries the argument

A prefix tree over recently served evidence sequences paired with a greedy walk that selects the ordering maximizing the length of the reusable prefix.

If this is right

Evidence reordering at the prompt layer recovers most prefix cache locality without engine or retrieval changes.
Median TTFT drops 20-33 percent relative to retrieval-order caching.
Answer quality stays the same on the tested QA tasks.
A simple greedy policy captures 97.5 percent of the gain from perfect oracle ordering.
The retrieved evidence set and serving engine remain unchanged.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same prefix-tree idea could be applied to track overlap in other token sequences, such as conversation history, if the serving engine exposes similar cache interfaces.
In very high query-rate regimes the cost of tree maintenance and walk computation might eventually offset the TTFT savings.
Retrieval modules could be co-designed to return evidence sets whose internal order is already closer to the greedy optimum.
Testing the method on evidence collections larger than those used in the QA experiments would show whether the prefix-tree size grows too large to remain lightweight.

Load-bearing premise

Maintaining and querying a prefix tree over recently served evidence sequences adds negligible overhead and the greedy walk reliably recovers most reusable prefix locality without changes to the retrieved evidence set or the serving engine.

What would settle it

Measure TTFT when evidence pieces are forced to have zero token overlap; if the 20-33 percent reduction disappears while tree-maintenance overhead remains, the central claim does not hold.

Figures

Figures reproduced from arXiv: 2606.19667 by Kaizhen Tan, Mingyuan Li, Rong Gu.

**Figure 2.** Figure 2: Knowledge tree example. Dark nodes are cached; gray nodes are evicted or on other branches. Given candidate set {A, B, C, F, D}, the greedy walk follows root→A→B→C→D (the longest cached path), yielding a 4-document prefix hit; the remaining document (F) is appended in retrieval-rank order. following the single cached continuation yields the deepest reusable prefix. Sub-optimality can arise only when two c… view at source ↗

**Figure 3.** Figure 3: Median TTFT across strategies. Optimized or [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗

**Figure 4.** Figure 4: TTFT improvement versus the trace genera [PITH_FULL_IMAGE:figures/full_fig_p006_4.png] view at source ↗

read the original abstract

Retrieval-Augmented Generation (RAG) improves factual grounding, but it also lengthens prompts and raises prefill cost. Prefix caching in serving engines such as vLLM reduces this cost only when requests share the same token prefix. In grounded generation, however, adjacent queries may retrieve overlapping evidence in different orders, so set overlap does not become reusable prefix overlap. We present CacheWeaver, a lightweight prompt-layer method for cache-aware evidence ordering. The method keeps a prefix tree over recently served evidence sequences and uses a greedy walk to place the most reusable prefix first, while leaving the serving engine and retrieved evidence set unchanged. Across three vLLM configurations, the method lowers median time-to-first-token (TTFT) by about 20-33 percent relative to retrieval-order prefix caching, without hurting answer quality in our QA tests. The greedy policy reaches 97.5 percent of the median TTFT gain from oracle ordering, indicating that most reusable prefix locality can be recovered by a simple scheduling layer between retrieval and inference.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

CacheWeaver adds a greedy prefix-tree scheduler that reorders RAG evidence for better vLLM cache reuse and reports 20-33% TTFT cuts, but the scheduler overhead stays unmeasured.

read the letter

CacheWeaver keeps a prefix tree over recently served evidence sequences and runs a greedy walk to put the longest reusable prefix first. This sits between retrieval and inference, leaves the engine and evidence set untouched, and delivers 20-33% lower median TTFT than plain retrieval order across three vLLM setups. The greedy policy captures 97.5% of the oracle gain while answer quality stays the same on their QA tests.

The concrete new piece is the maintained prefix tree plus the greedy ordering rule as a prompt-layer addition. It turns set overlap into actual prefix overlap without any change to the serving stack.

That is the useful part. Many deployed RAG systems already pay for prefix caching, and this exploits the ordering mismatch that retrieval usually ignores.

The soft spot is exactly the one the stress-test note flags. The paper treats tree construction, updates, and the greedy walk as negligible, yet supplies no separate timing for those steps. If the tree grows with workload history or chunks are long, the per-request cost could shrink or erase the reported TTFT savings. Without that breakdown the net benefit is still an assumption.

The abstract also skips experimental details, baselines, and any statistical checks, so the numbers cannot be fully judged from what is here.

This is for people who run grounded generation at scale on cached engines and want lower latency without retraining or engine changes. A practitioner could implement the scheduler and measure the overhead themselves.

It deserves a serious referee because the core mechanism is simple to reproduce and the efficiency claim is directly testable. I would send it to review.

Referee Report

2 major / 0 minor

Summary. The paper introduces CacheWeaver, a lightweight prompt-layer scheduler for grounded RAG that maintains a prefix tree over recently served evidence sequences and applies a greedy walk to reorder evidence chunks. This aims to maximize reusable token prefixes for vLLM-style prefix caching without altering the retrieval set or serving engine. The central empirical claim is a 20-33% reduction in median TTFT across three vLLM configurations relative to retrieval-order caching, with no degradation in QA answer quality and the greedy policy recovering 97.5% of the median gain achieved by an oracle ordering.

Significance. If the reported TTFT gains prove robust and the overhead remains negligible, the work offers a practical, low-effort improvement to inference efficiency in RAG pipelines by better exploiting existing prefix-caching hardware without requiring engine modifications or changes to retrieved evidence.

major comments (2)

[Abstract] Abstract: the 20-33% median TTFT reduction and 97.5% oracle-recovery figures are presented without any description of experimental setup (datasets, query count, model sizes, vLLM configurations, or statistical testing), which is required to assess whether the central performance claim is reproducible or confounded by workload specifics.
[Abstract] Abstract: the claim that prefix-tree construction, updates, and the greedy walk add negligible latency is stated but unsupported by any separate timing measurements or scaling analysis; because these operations run per-request before inference, their cost directly affects whether the reported net TTFT savings survive when history size or chunk length increases.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the thoughtful comments on the abstract. We address each point below and will make the requested revisions to improve transparency.

read point-by-point responses

Referee: [Abstract] Abstract: the 20-33% median TTFT reduction and 97.5% oracle-recovery figures are presented without any description of experimental setup (datasets, query count, model sizes, vLLM configurations, or statistical testing), which is required to assess whether the central performance claim is reproducible or confounded by workload specifics.

Authors: We agree that the abstract would benefit from a concise indication of the experimental context. In the revised version we will append a short clause summarizing the evaluation (three standard QA datasets, hundreds of queries per configuration, 7B–70B models, and three vLLM KV-cache settings) while keeping the abstract within length limits. Full experimental details, including run counts and median reporting, remain in Section 4. revision: yes
Referee: [Abstract] Abstract: the claim that prefix-tree construction, updates, and the greedy walk add negligible latency is stated but unsupported by any separate timing measurements or scaling analysis; because these operations run per-request before inference, their cost directly affects whether the reported net TTFT savings survive when history size or chunk length increases.

Authors: The current manuscript characterizes the scheduler as lightweight but does not include isolated timing or scaling curves for the prefix-tree and greedy-walk steps. We will add a dedicated micro-benchmark subsection (with wall-clock measurements across history sizes 100–10k and chunk lengths 128–512) to the revised paper so that readers can verify the overhead remains small relative to the observed TTFT gains. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical heuristic with reported outcomes, no derivation chain

full rationale

The paper proposes CacheWeaver as a prompt-layer heuristic that maintains a prefix tree and applies a greedy walk to order evidence for better prefix caching in vLLM. The central claims are empirical TTFT reductions (20-33%) and greedy reaching 97.5% of oracle, evaluated on QA tests. No equations, fitted parameters, self-citations for uniqueness theorems, or ansatzes are present in the provided abstract or description. The method is self-contained as a scheduling layer whose performance is measured directly; results do not reduce to inputs by construction. This matches the default case of an honest non-finding for non-derivational work.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review provides no information on free parameters, axioms, or invented entities; the method is described at the level of a heuristic scheduling policy.

pith-pipeline@v0.9.1-grok · 5710 in / 1045 out tokens · 23935 ms · 2026-06-26T18:14:36.740331+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

22 extracted references · 15 canonical work pages · 3 internal anchors

[4]

and Barrett, Clark W

Zheng, Lianmin and Yin, Liangsheng and Xie, Zhiqiang and Sun, Chuyue and Huang, Jeff and Yu, Cody Hao and Cao, Shiyi and Kozyrakis, Christos and Stoica, Ion and Gonzalez, Joseph E. and Barrett, Clark W. and Sheng, Ying , title =. Advances in Neural Information Processing Systems , volume =. 2024 , doi =

2024
[5]

Retrieval-Augmented Generation for Knowledge-Intensive

Lewis, Patrick and Perez, Ethan and Piktus, Aleksandra and Petroni, Fabio and Karpukhin, Vladimir and Goyal, Naman and K. Retrieval-Augmented Generation for Knowledge-Intensive. Advances in Neural Information Processing Systems , volume =. 2020 , pages =

2020
[8]

CoRR , volume =

Gao, Yunfan and Xiong, Yun and Gao, Xinyu and Jia, Kangxiang and Pan, Jinliu and Bi, Yuxi and Dai, Yi and Sun, Jiawei and Wang, Meng and Wang, Haofen , title =. CoRR , volume =. 2023 , doi =

2023
[9]

2025 , archivePrefix =

Yu, Dazhou and Bao, Riyang and Ning, Ruiyu and Peng, Jinghong and Mai, Gengchen and Zhao, Liang , title =. 2025 , archivePrefix =. 2502.18470 , doi =

arXiv 2025
[10]

Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing: Industry Track , year =

Tang, Yihong and Wang, Zhaokai and Qu, Ao and Yan, Yihao and Wu, Zhaofeng and Zhuang, Dingyi and Kai, Jushi and Hou, Kebing and Guo, Xiaotong and Zhao, Jinhua and Zhao, Zhan and Ma, Wei , title =. Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing: Industry Track , year =. doi:10.18653/v1/2024.emnlp-industry.104 , url =

work page doi:10.18653/v1/2024.emnlp-industry.104 2024
[15]

CoRR , volume =

Yang, An and Yang, Baosong and Zhang, Beichen and Hui, Binyuan and Zheng, Bo and Yu, Bowen and Li, Chengyuan and Liu, Dayiheng and Huang, Fei and Wei, Haoran and Lin, Huan and Yang, Jian and Tu, Jianhong and Zhang, Jianwei and Yang, Jianxin and Yang, Jiaxi and Zhou, Jingren and Lin, Junyang and Dang, Kai and Lu, Keming and Bao, Keqin and Yang, Kexin and Y...

2024
[18]

Shubham Agarwal, Sai Sundaresan, Subrata Mitra, Debabrata Mahapatra, Archit Gupta, Rounak Sharma, Nirmal Joshua Kapu, Tong Yu, and Shiv Saini. 2025. https://doi.org/10.1145/3725273 Cache-Craft : Managing chunk-caches for efficient retrieval-augmented generation . Proceedings of the ACM on Management of Data, 3(3):136:1--136:28

work page doi:10.1145/3725273 2025
[19]

Yunfan Gao, Yun Xiong, Xinyu Gao, Kangxiang Jia, Jinliu Pan, Yuxi Bi, Yi Dai, Jiawei Sun, Meng Wang, and Haofen Wang. 2023. https://doi.org/10.48550/arXiv.2312.10997 Retrieval-augmented generation for large language models: A survey . CoRR, abs/2312.10997

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2312.10997 2023
[20]

Yinsicheng Jiang, Yeqi Huang, Liang Cheng, Cheng Deng, Xuan Sun, and Luo Mai. 2026. https://doi.org/10.48550/arXiv.2511.03475 ContextPilot : Fast long-context inference via context reuse . In Proceedings of Machine Learning and Systems

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2511.03475 2026
[21]

Chao Jin, Zili Zhang, Xuanlin Jiang, Fangyue Liu, Shufan Liu, Xuanzhe Liu, and Xin Jin. 2025. https://doi.org/10.1145/3768628 RAGCache : Efficient knowledge caching for retrieval-augmented generation . ACM Transactions on Computer Systems, 44(1):2:1--2:27

work page doi:10.1145/3768628 2025
[22]

T rivia QA : A Large Scale Distantly Supervised Challenge Dataset for Reading Comprehension

Mandar Joshi, Eunsol Choi, Daniel S. Weld, and Luke Zettlemoyer. 2017. https://doi.org/10.18653/v1/P17-1147 TriviaQA : A large scale distantly supervised challenge dataset for reading comprehension . In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1601--1611, Vancouver, Canada. Asso...

work page doi:10.18653/v1/p17-1147 2017
[23]

and Uszkoreit, Jakob and Le, Quoc and Petrov, Slav

Tom Kwiatkowski, Jennimaria Palomaki, Olivia Redfield, Michael Collins, Ankur Parikh, Chris Alberti, Danielle Epstein, Illia Polosukhin, Jacob Devlin, Kenton Lee, Kristina Toutanova, Llion Jones, Matthew Kelcey, Ming-Wei Chang, Andrew M. Dai, Jakob Uszkoreit, Quoc Le, and Slav Petrov. 2019. https://doi.org/10.1162/tacl_a_00276 Natural Questions : A benchm...

work page doi:10.1162/tacl_a_00276 2019
[24]

Efficient memory management for large language model serving with pagedattention

Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph E. Gonzalez, Hao Zhang, and Ion Stoica. 2023. https://doi.org/10.1145/3600006.3613165 Efficient memory management for large language model serving with PagedAttention . In Proceedings of the 29th Symposium on Operating Systems Principles, SOSP '23, pages 611--626. Assoc...

work page doi:10.1145/3600006.3613165 2023
[25]

u ttler, Mike Lewis, Wen-tau Yih, Tim Rockt \

Patrick Lewis, Ethan Perez, Aleksandra Piktus, Fabio Petroni, Vladimir Karpukhin, Naman Goyal, Heinrich K \"u ttler, Mike Lewis, Wen-tau Yih, Tim Rockt \"a schel, Sebastian Riedel, and Douwe Kiela. 2020. https://proceedings.neurips.cc/paper/2020/hash/6b493230205f780e1bc26945df7481e5-Abstract.html Retrieval-augmented generation for knowledge-intensive NLP ...

2020
[26]

and Lin, Kevin and Hewitt, John and Paranjape, Ashwin and Bevilacqua, Michele and Petroni, Fabio and Liang, Percy

Nelson F. Liu, Kevin Lin, John Hewitt, Ashwin Paranjape, Michele Bevilacqua, Fabio Petroni, and Percy Liang. 2024 a . https://doi.org/10.1162/tacl_a_00638 Lost in the middle: How language models use long contexts . Transactions of the Association for Computational Linguistics, 12:157--173

work page doi:10.1162/tacl_a_00638 2024
[27]

Yuhan Liu, Hanchen Li, Yihua Cheng, Siddhant Ray, Yuyang Huang, Qizheng Zhang, Kuntai Du, Jiayi Yao, Shan Lu, Ganesh Ananthanarayanan, Michael Maire, Henry Hoffmann, Ari Holtzman, and Junchen Jiang. 2024 b . https://doi.org/10.1145/3651890.3672274 CacheGen : KV cache compression and streaming for fast large language model serving . In Proceedings of the A...

work page doi:10.1145/3651890.3672274 2024
[28]

Songshuo Lu, Hua Wang, Yutian Rong, Zhi Chen, and Yaohua Tang. 2025. https://doi.org/10.18653/v1/2025.emnlp-main.334 TurboRAG : Accelerating retrieval-augmented generation with precomputed KV caches for chunked text . In Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 6588--6601, Suzhou, China. Association for...

work page doi:10.18653/v1/2025.emnlp-main.334 2025
[29]

vLLM Project . 2025. Automatic Prefix Caching . https://docs.vllm.ai/en/stable/features/automatic_prefix_caching/. Accessed: 2026-04-16

2025
[30]

An Yang, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chengyuan Li, Dayiheng Liu, Fei Huang, Haoran Wei, Huan Lin, Jian Yang, Jianhong Tu, Jianwei Zhang, Jianxin Yang, Jiaxi Yang, Jingren Zhou, Junyang Lin, Kai Dang, and 23 others. 2024. https://doi.org/10.48550/arXiv.2412.15115 Qwen2.5 Technical Report . CoRR, abs/2412.15115

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2412.15115 2024
[31]

Zhilin Yang, Peng Qi, Saizheng Zhang, Yoshua Bengio, William Cohen, Ruslan Salakhutdinov, and Christopher D. Manning. 2018. https://doi.org/10.18653/v1/D18-1259 HotpotQA : A dataset for diverse, explainable multi-hop question answering . In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 2369--2380, Brussels, ...

work page doi:10.18653/v1/d18-1259 2018
[32]

Jiayi Yao, Hanchen Li, Yuhan Liu, Siddhant Ray, Yihua Cheng, Qizheng Zhang, Kuntai Du, Shan Lu, and Junchen Jiang. 2025. https://doi.org/10.1145/3689031.3696098 CacheBlend : Fast large language model serving for RAG with cached knowledge fusion . In Proceedings of the Twentieth European Conference on Computer Systems, EuroSys '25, pages 94--109, Rotterdam...

work page doi:10.1145/3689031.3696098 2025
[33]

Gonzalez, Clark W

Lianmin Zheng, Liangsheng Yin, Zhiqiang Xie, Chuyue Sun, Jeff Huang, Cody Hao Yu, Shiyi Cao, Christos Kozyrakis, Ion Stoica, Joseph E. Gonzalez, Clark W. Barrett, and Ying Sheng. 2024. https://doi.org/10.52202/079017-2000 SGLang : Efficient execution of structured language model programs . In Advances in Neural Information Processing Systems, volume 37

work page doi:10.52202/079017-2000 2024

[1] [4]

and Barrett, Clark W

Zheng, Lianmin and Yin, Liangsheng and Xie, Zhiqiang and Sun, Chuyue and Huang, Jeff and Yu, Cody Hao and Cao, Shiyi and Kozyrakis, Christos and Stoica, Ion and Gonzalez, Joseph E. and Barrett, Clark W. and Sheng, Ying , title =. Advances in Neural Information Processing Systems , volume =. 2024 , doi =

2024

[2] [5]

Retrieval-Augmented Generation for Knowledge-Intensive

Lewis, Patrick and Perez, Ethan and Piktus, Aleksandra and Petroni, Fabio and Karpukhin, Vladimir and Goyal, Naman and K. Retrieval-Augmented Generation for Knowledge-Intensive. Advances in Neural Information Processing Systems , volume =. 2020 , pages =

2020

[3] [8]

CoRR , volume =

Gao, Yunfan and Xiong, Yun and Gao, Xinyu and Jia, Kangxiang and Pan, Jinliu and Bi, Yuxi and Dai, Yi and Sun, Jiawei and Wang, Meng and Wang, Haofen , title =. CoRR , volume =. 2023 , doi =

2023

[4] [9]

2025 , archivePrefix =

Yu, Dazhou and Bao, Riyang and Ning, Ruiyu and Peng, Jinghong and Mai, Gengchen and Zhao, Liang , title =. 2025 , archivePrefix =. 2502.18470 , doi =

arXiv 2025

[5] [10]

Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing: Industry Track , year =

Tang, Yihong and Wang, Zhaokai and Qu, Ao and Yan, Yihao and Wu, Zhaofeng and Zhuang, Dingyi and Kai, Jushi and Hou, Kebing and Guo, Xiaotong and Zhao, Jinhua and Zhao, Zhan and Ma, Wei , title =. Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing: Industry Track , year =. doi:10.18653/v1/2024.emnlp-industry.104 , url =

work page doi:10.18653/v1/2024.emnlp-industry.104 2024

[6] [15]

CoRR , volume =

Yang, An and Yang, Baosong and Zhang, Beichen and Hui, Binyuan and Zheng, Bo and Yu, Bowen and Li, Chengyuan and Liu, Dayiheng and Huang, Fei and Wei, Haoran and Lin, Huan and Yang, Jian and Tu, Jianhong and Zhang, Jianwei and Yang, Jianxin and Yang, Jiaxi and Zhou, Jingren and Lin, Junyang and Dang, Kai and Lu, Keming and Bao, Keqin and Yang, Kexin and Y...

2024

[7] [18]

Shubham Agarwal, Sai Sundaresan, Subrata Mitra, Debabrata Mahapatra, Archit Gupta, Rounak Sharma, Nirmal Joshua Kapu, Tong Yu, and Shiv Saini. 2025. https://doi.org/10.1145/3725273 Cache-Craft : Managing chunk-caches for efficient retrieval-augmented generation . Proceedings of the ACM on Management of Data, 3(3):136:1--136:28

work page doi:10.1145/3725273 2025

[8] [19]

Yunfan Gao, Yun Xiong, Xinyu Gao, Kangxiang Jia, Jinliu Pan, Yuxi Bi, Yi Dai, Jiawei Sun, Meng Wang, and Haofen Wang. 2023. https://doi.org/10.48550/arXiv.2312.10997 Retrieval-augmented generation for large language models: A survey . CoRR, abs/2312.10997

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2312.10997 2023

[9] [20]

Yinsicheng Jiang, Yeqi Huang, Liang Cheng, Cheng Deng, Xuan Sun, and Luo Mai. 2026. https://doi.org/10.48550/arXiv.2511.03475 ContextPilot : Fast long-context inference via context reuse . In Proceedings of Machine Learning and Systems

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2511.03475 2026

[10] [21]

Chao Jin, Zili Zhang, Xuanlin Jiang, Fangyue Liu, Shufan Liu, Xuanzhe Liu, and Xin Jin. 2025. https://doi.org/10.1145/3768628 RAGCache : Efficient knowledge caching for retrieval-augmented generation . ACM Transactions on Computer Systems, 44(1):2:1--2:27

work page doi:10.1145/3768628 2025

[11] [22]

T rivia QA : A Large Scale Distantly Supervised Challenge Dataset for Reading Comprehension

Mandar Joshi, Eunsol Choi, Daniel S. Weld, and Luke Zettlemoyer. 2017. https://doi.org/10.18653/v1/P17-1147 TriviaQA : A large scale distantly supervised challenge dataset for reading comprehension . In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1601--1611, Vancouver, Canada. Asso...

work page doi:10.18653/v1/p17-1147 2017

[12] [23]

and Uszkoreit, Jakob and Le, Quoc and Petrov, Slav

Tom Kwiatkowski, Jennimaria Palomaki, Olivia Redfield, Michael Collins, Ankur Parikh, Chris Alberti, Danielle Epstein, Illia Polosukhin, Jacob Devlin, Kenton Lee, Kristina Toutanova, Llion Jones, Matthew Kelcey, Ming-Wei Chang, Andrew M. Dai, Jakob Uszkoreit, Quoc Le, and Slav Petrov. 2019. https://doi.org/10.1162/tacl_a_00276 Natural Questions : A benchm...

work page doi:10.1162/tacl_a_00276 2019

[13] [24]

Efficient memory management for large language model serving with pagedattention

Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph E. Gonzalez, Hao Zhang, and Ion Stoica. 2023. https://doi.org/10.1145/3600006.3613165 Efficient memory management for large language model serving with PagedAttention . In Proceedings of the 29th Symposium on Operating Systems Principles, SOSP '23, pages 611--626. Assoc...

work page doi:10.1145/3600006.3613165 2023

[14] [25]

u ttler, Mike Lewis, Wen-tau Yih, Tim Rockt \

Patrick Lewis, Ethan Perez, Aleksandra Piktus, Fabio Petroni, Vladimir Karpukhin, Naman Goyal, Heinrich K \"u ttler, Mike Lewis, Wen-tau Yih, Tim Rockt \"a schel, Sebastian Riedel, and Douwe Kiela. 2020. https://proceedings.neurips.cc/paper/2020/hash/6b493230205f780e1bc26945df7481e5-Abstract.html Retrieval-augmented generation for knowledge-intensive NLP ...

2020

[15] [26]

and Lin, Kevin and Hewitt, John and Paranjape, Ashwin and Bevilacqua, Michele and Petroni, Fabio and Liang, Percy

Nelson F. Liu, Kevin Lin, John Hewitt, Ashwin Paranjape, Michele Bevilacqua, Fabio Petroni, and Percy Liang. 2024 a . https://doi.org/10.1162/tacl_a_00638 Lost in the middle: How language models use long contexts . Transactions of the Association for Computational Linguistics, 12:157--173

work page doi:10.1162/tacl_a_00638 2024

[16] [27]

Yuhan Liu, Hanchen Li, Yihua Cheng, Siddhant Ray, Yuyang Huang, Qizheng Zhang, Kuntai Du, Jiayi Yao, Shan Lu, Ganesh Ananthanarayanan, Michael Maire, Henry Hoffmann, Ari Holtzman, and Junchen Jiang. 2024 b . https://doi.org/10.1145/3651890.3672274 CacheGen : KV cache compression and streaming for fast large language model serving . In Proceedings of the A...

work page doi:10.1145/3651890.3672274 2024

[17] [28]

Songshuo Lu, Hua Wang, Yutian Rong, Zhi Chen, and Yaohua Tang. 2025. https://doi.org/10.18653/v1/2025.emnlp-main.334 TurboRAG : Accelerating retrieval-augmented generation with precomputed KV caches for chunked text . In Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 6588--6601, Suzhou, China. Association for...

work page doi:10.18653/v1/2025.emnlp-main.334 2025

[18] [29]

vLLM Project . 2025. Automatic Prefix Caching . https://docs.vllm.ai/en/stable/features/automatic_prefix_caching/. Accessed: 2026-04-16

2025

[19] [30]

An Yang, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chengyuan Li, Dayiheng Liu, Fei Huang, Haoran Wei, Huan Lin, Jian Yang, Jianhong Tu, Jianwei Zhang, Jianxin Yang, Jiaxi Yang, Jingren Zhou, Junyang Lin, Kai Dang, and 23 others. 2024. https://doi.org/10.48550/arXiv.2412.15115 Qwen2.5 Technical Report . CoRR, abs/2412.15115

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2412.15115 2024

[20] [31]

Zhilin Yang, Peng Qi, Saizheng Zhang, Yoshua Bengio, William Cohen, Ruslan Salakhutdinov, and Christopher D. Manning. 2018. https://doi.org/10.18653/v1/D18-1259 HotpotQA : A dataset for diverse, explainable multi-hop question answering . In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 2369--2380, Brussels, ...

work page doi:10.18653/v1/d18-1259 2018

[21] [32]

Jiayi Yao, Hanchen Li, Yuhan Liu, Siddhant Ray, Yihua Cheng, Qizheng Zhang, Kuntai Du, Shan Lu, and Junchen Jiang. 2025. https://doi.org/10.1145/3689031.3696098 CacheBlend : Fast large language model serving for RAG with cached knowledge fusion . In Proceedings of the Twentieth European Conference on Computer Systems, EuroSys '25, pages 94--109, Rotterdam...

work page doi:10.1145/3689031.3696098 2025

[22] [33]

Gonzalez, Clark W

Lianmin Zheng, Liangsheng Yin, Zhiqiang Xie, Chuyue Sun, Jeff Huang, Cody Hao Yu, Shiyi Cao, Christos Kozyrakis, Ion Stoica, Joseph E. Gonzalez, Clark W. Barrett, and Ying Sheng. 2024. https://doi.org/10.52202/079017-2000 SGLang : Efficient execution of structured language model programs . In Advances in Neural Information Processing Systems, volume 37

work page doi:10.52202/079017-2000 2024