pith. sign in

arxiv: 2606.19667 · v1 · pith:VMHIMGWEnew · submitted 2026-06-18 · 💻 cs.CL

CacheWeaver: Cache-Aware Evidence Ordering for Efficient Grounded RAG Inference

Pith reviewed 2026-06-26 18:14 UTC · model grok-4.3

classification 💻 cs.CL
keywords RAGprefix cachingevidence orderingtime-to-first-tokenvLLMgrounded generationprompt scheduling
0
0 comments X

The pith

CacheWeaver reorders retrieved evidence in RAG prompts using a prefix tree and greedy walk to maximize cache reuse and reduce median time-to-first-token by 20-33 percent.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

In RAG, retrieved evidence often overlaps between queries but arrives in different orders, so standard prefix caching in engines like vLLM cannot reuse the shared tokens. CacheWeaver maintains a prefix tree of past evidence sequences at the prompt layer and greedily selects an ordering that starts with the longest reusable prefix, without modifying the retrieved set or the serving engine. This produces 20-33 percent lower median TTFT across three vLLM configurations while leaving answer quality unchanged on QA tasks. The greedy policy recovers 97.5 percent of the gain available from an oracle ordering, showing that simple scheduling between retrieval and inference can capture most of the available locality.

Core claim

CacheWeaver keeps a prefix tree over recently served evidence sequences and uses a greedy walk to place the most reusable prefix first in the prompt. Across three vLLM configurations the method lowers median time-to-first-token by 20-33 percent relative to retrieval-order prefix caching, without hurting answer quality in QA tests. The greedy policy reaches 97.5 percent of the median TTFT gain from oracle ordering.

What carries the argument

A prefix tree over recently served evidence sequences paired with a greedy walk that selects the ordering maximizing the length of the reusable prefix.

If this is right

  • Evidence reordering at the prompt layer recovers most prefix cache locality without engine or retrieval changes.
  • Median TTFT drops 20-33 percent relative to retrieval-order caching.
  • Answer quality stays the same on the tested QA tasks.
  • A simple greedy policy captures 97.5 percent of the gain from perfect oracle ordering.
  • The retrieved evidence set and serving engine remain unchanged.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same prefix-tree idea could be applied to track overlap in other token sequences, such as conversation history, if the serving engine exposes similar cache interfaces.
  • In very high query-rate regimes the cost of tree maintenance and walk computation might eventually offset the TTFT savings.
  • Retrieval modules could be co-designed to return evidence sets whose internal order is already closer to the greedy optimum.
  • Testing the method on evidence collections larger than those used in the QA experiments would show whether the prefix-tree size grows too large to remain lightweight.

Load-bearing premise

Maintaining and querying a prefix tree over recently served evidence sequences adds negligible overhead and the greedy walk reliably recovers most reusable prefix locality without changes to the retrieved evidence set or the serving engine.

What would settle it

Measure TTFT when evidence pieces are forced to have zero token overlap; if the 20-33 percent reduction disappears while tree-maintenance overhead remains, the central claim does not hold.

Figures

Figures reproduced from arXiv: 2606.19667 by Kaizhen Tan, Mingyuan Li, Rong Gu.

Figure 1
Figure 1. Figure 1: Prefix alignment in APC-based RAG serving. [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Knowledge tree example. Dark nodes are cached; gray nodes are evicted or on other branches. Given candidate set {A, B, C, F, D}, the greedy walk follows root→A→B→C→D (the longest cached path), yielding a 4-document prefix hit; the remaining docu￾ment (F) is appended in retrieval-rank order. following the single cached continuation yields the deepest reusable prefix. Sub-optimality can arise only when two c… view at source ↗
Figure 3
Figure 3. Figure 3: Median TTFT across strategies. Optimized or [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: TTFT improvement versus the trace genera [PITH_FULL_IMAGE:figures/full_fig_p006_4.png] view at source ↗
read the original abstract

Retrieval-Augmented Generation (RAG) improves factual grounding, but it also lengthens prompts and raises prefill cost. Prefix caching in serving engines such as vLLM reduces this cost only when requests share the same token prefix. In grounded generation, however, adjacent queries may retrieve overlapping evidence in different orders, so set overlap does not become reusable prefix overlap. We present CacheWeaver, a lightweight prompt-layer method for cache-aware evidence ordering. The method keeps a prefix tree over recently served evidence sequences and uses a greedy walk to place the most reusable prefix first, while leaving the serving engine and retrieved evidence set unchanged. Across three vLLM configurations, the method lowers median time-to-first-token (TTFT) by about 20-33 percent relative to retrieval-order prefix caching, without hurting answer quality in our QA tests. The greedy policy reaches 97.5 percent of the median TTFT gain from oracle ordering, indicating that most reusable prefix locality can be recovered by a simple scheduling layer between retrieval and inference.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 0 minor

Summary. The paper introduces CacheWeaver, a lightweight prompt-layer scheduler for grounded RAG that maintains a prefix tree over recently served evidence sequences and applies a greedy walk to reorder evidence chunks. This aims to maximize reusable token prefixes for vLLM-style prefix caching without altering the retrieval set or serving engine. The central empirical claim is a 20-33% reduction in median TTFT across three vLLM configurations relative to retrieval-order caching, with no degradation in QA answer quality and the greedy policy recovering 97.5% of the median gain achieved by an oracle ordering.

Significance. If the reported TTFT gains prove robust and the overhead remains negligible, the work offers a practical, low-effort improvement to inference efficiency in RAG pipelines by better exploiting existing prefix-caching hardware without requiring engine modifications or changes to retrieved evidence.

major comments (2)
  1. [Abstract] Abstract: the 20-33% median TTFT reduction and 97.5% oracle-recovery figures are presented without any description of experimental setup (datasets, query count, model sizes, vLLM configurations, or statistical testing), which is required to assess whether the central performance claim is reproducible or confounded by workload specifics.
  2. [Abstract] Abstract: the claim that prefix-tree construction, updates, and the greedy walk add negligible latency is stated but unsupported by any separate timing measurements or scaling analysis; because these operations run per-request before inference, their cost directly affects whether the reported net TTFT savings survive when history size or chunk length increases.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the thoughtful comments on the abstract. We address each point below and will make the requested revisions to improve transparency.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the 20-33% median TTFT reduction and 97.5% oracle-recovery figures are presented without any description of experimental setup (datasets, query count, model sizes, vLLM configurations, or statistical testing), which is required to assess whether the central performance claim is reproducible or confounded by workload specifics.

    Authors: We agree that the abstract would benefit from a concise indication of the experimental context. In the revised version we will append a short clause summarizing the evaluation (three standard QA datasets, hundreds of queries per configuration, 7B–70B models, and three vLLM KV-cache settings) while keeping the abstract within length limits. Full experimental details, including run counts and median reporting, remain in Section 4. revision: yes

  2. Referee: [Abstract] Abstract: the claim that prefix-tree construction, updates, and the greedy walk add negligible latency is stated but unsupported by any separate timing measurements or scaling analysis; because these operations run per-request before inference, their cost directly affects whether the reported net TTFT savings survive when history size or chunk length increases.

    Authors: The current manuscript characterizes the scheduler as lightweight but does not include isolated timing or scaling curves for the prefix-tree and greedy-walk steps. We will add a dedicated micro-benchmark subsection (with wall-clock measurements across history sizes 100–10k and chunk lengths 128–512) to the revised paper so that readers can verify the overhead remains small relative to the observed TTFT gains. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical heuristic with reported outcomes, no derivation chain

full rationale

The paper proposes CacheWeaver as a prompt-layer heuristic that maintains a prefix tree and applies a greedy walk to order evidence for better prefix caching in vLLM. The central claims are empirical TTFT reductions (20-33%) and greedy reaching 97.5% of oracle, evaluated on QA tests. No equations, fitted parameters, self-citations for uniqueness theorems, or ansatzes are present in the provided abstract or description. The method is self-contained as a scheduling layer whose performance is measured directly; results do not reduce to inputs by construction. This matches the default case of an honest non-finding for non-derivational work.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review provides no information on free parameters, axioms, or invented entities; the method is described at the level of a heuristic scheduling policy.

pith-pipeline@v0.9.1-grok · 5710 in / 1045 out tokens · 23935 ms · 2026-06-26T18:14:36.740331+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

22 extracted references · 15 canonical work pages · 3 internal anchors

  1. [4]

    and Barrett, Clark W

    Zheng, Lianmin and Yin, Liangsheng and Xie, Zhiqiang and Sun, Chuyue and Huang, Jeff and Yu, Cody Hao and Cao, Shiyi and Kozyrakis, Christos and Stoica, Ion and Gonzalez, Joseph E. and Barrett, Clark W. and Sheng, Ying , title =. Advances in Neural Information Processing Systems , volume =. 2024 , doi =

  2. [5]

    Retrieval-Augmented Generation for Knowledge-Intensive

    Lewis, Patrick and Perez, Ethan and Piktus, Aleksandra and Petroni, Fabio and Karpukhin, Vladimir and Goyal, Naman and K. Retrieval-Augmented Generation for Knowledge-Intensive. Advances in Neural Information Processing Systems , volume =. 2020 , pages =

  3. [8]

    CoRR , volume =

    Gao, Yunfan and Xiong, Yun and Gao, Xinyu and Jia, Kangxiang and Pan, Jinliu and Bi, Yuxi and Dai, Yi and Sun, Jiawei and Wang, Meng and Wang, Haofen , title =. CoRR , volume =. 2023 , doi =

  4. [9]

    2025 , archivePrefix =

    Yu, Dazhou and Bao, Riyang and Ning, Ruiyu and Peng, Jinghong and Mai, Gengchen and Zhao, Liang , title =. 2025 , archivePrefix =. 2502.18470 , doi =

  5. [10]

    Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing: Industry Track , year =

    Tang, Yihong and Wang, Zhaokai and Qu, Ao and Yan, Yihao and Wu, Zhaofeng and Zhuang, Dingyi and Kai, Jushi and Hou, Kebing and Guo, Xiaotong and Zhao, Jinhua and Zhao, Zhan and Ma, Wei , title =. Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing: Industry Track , year =. doi:10.18653/v1/2024.emnlp-industry.104 , url =

  6. [15]

    CoRR , volume =

    Yang, An and Yang, Baosong and Zhang, Beichen and Hui, Binyuan and Zheng, Bo and Yu, Bowen and Li, Chengyuan and Liu, Dayiheng and Huang, Fei and Wei, Haoran and Lin, Huan and Yang, Jian and Tu, Jianhong and Zhang, Jianwei and Yang, Jianxin and Yang, Jiaxi and Zhou, Jingren and Lin, Junyang and Dang, Kai and Lu, Keming and Bao, Keqin and Yang, Kexin and Y...

  7. [18]

    Shubham Agarwal, Sai Sundaresan, Subrata Mitra, Debabrata Mahapatra, Archit Gupta, Rounak Sharma, Nirmal Joshua Kapu, Tong Yu, and Shiv Saini. 2025. https://doi.org/10.1145/3725273 Cache-Craft : Managing chunk-caches for efficient retrieval-augmented generation . Proceedings of the ACM on Management of Data, 3(3):136:1--136:28

  8. [19]

    Yunfan Gao, Yun Xiong, Xinyu Gao, Kangxiang Jia, Jinliu Pan, Yuxi Bi, Yi Dai, Jiawei Sun, Meng Wang, and Haofen Wang. 2023. https://doi.org/10.48550/arXiv.2312.10997 Retrieval-augmented generation for large language models: A survey . CoRR, abs/2312.10997

  9. [20]

    Yinsicheng Jiang, Yeqi Huang, Liang Cheng, Cheng Deng, Xuan Sun, and Luo Mai. 2026. https://doi.org/10.48550/arXiv.2511.03475 ContextPilot : Fast long-context inference via context reuse . In Proceedings of Machine Learning and Systems

  10. [21]

    Chao Jin, Zili Zhang, Xuanlin Jiang, Fangyue Liu, Shufan Liu, Xuanzhe Liu, and Xin Jin. 2025. https://doi.org/10.1145/3768628 RAGCache : Efficient knowledge caching for retrieval-augmented generation . ACM Transactions on Computer Systems, 44(1):2:1--2:27

  11. [22]

    T rivia QA : A Large Scale Distantly Supervised Challenge Dataset for Reading Comprehension

    Mandar Joshi, Eunsol Choi, Daniel S. Weld, and Luke Zettlemoyer. 2017. https://doi.org/10.18653/v1/P17-1147 TriviaQA : A large scale distantly supervised challenge dataset for reading comprehension . In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1601--1611, Vancouver, Canada. Asso...

  12. [23]

    and Uszkoreit, Jakob and Le, Quoc and Petrov, Slav

    Tom Kwiatkowski, Jennimaria Palomaki, Olivia Redfield, Michael Collins, Ankur Parikh, Chris Alberti, Danielle Epstein, Illia Polosukhin, Jacob Devlin, Kenton Lee, Kristina Toutanova, Llion Jones, Matthew Kelcey, Ming-Wei Chang, Andrew M. Dai, Jakob Uszkoreit, Quoc Le, and Slav Petrov. 2019. https://doi.org/10.1162/tacl_a_00276 Natural Questions : A benchm...

  13. [24]

    Efficient memory management for large language model serving with pagedattention

    Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph E. Gonzalez, Hao Zhang, and Ion Stoica. 2023. https://doi.org/10.1145/3600006.3613165 Efficient memory management for large language model serving with PagedAttention . In Proceedings of the 29th Symposium on Operating Systems Principles, SOSP '23, pages 611--626. Assoc...

  14. [25]

    u ttler, Mike Lewis, Wen-tau Yih, Tim Rockt \

    Patrick Lewis, Ethan Perez, Aleksandra Piktus, Fabio Petroni, Vladimir Karpukhin, Naman Goyal, Heinrich K \"u ttler, Mike Lewis, Wen-tau Yih, Tim Rockt \"a schel, Sebastian Riedel, and Douwe Kiela. 2020. https://proceedings.neurips.cc/paper/2020/hash/6b493230205f780e1bc26945df7481e5-Abstract.html Retrieval-augmented generation for knowledge-intensive NLP ...

  15. [26]

    and Lin, Kevin and Hewitt, John and Paranjape, Ashwin and Bevilacqua, Michele and Petroni, Fabio and Liang, Percy

    Nelson F. Liu, Kevin Lin, John Hewitt, Ashwin Paranjape, Michele Bevilacqua, Fabio Petroni, and Percy Liang. 2024 a . https://doi.org/10.1162/tacl_a_00638 Lost in the middle: How language models use long contexts . Transactions of the Association for Computational Linguistics, 12:157--173

  16. [27]

    Yuhan Liu, Hanchen Li, Yihua Cheng, Siddhant Ray, Yuyang Huang, Qizheng Zhang, Kuntai Du, Jiayi Yao, Shan Lu, Ganesh Ananthanarayanan, Michael Maire, Henry Hoffmann, Ari Holtzman, and Junchen Jiang. 2024 b . https://doi.org/10.1145/3651890.3672274 CacheGen : KV cache compression and streaming for fast large language model serving . In Proceedings of the A...

  17. [28]

    Songshuo Lu, Hua Wang, Yutian Rong, Zhi Chen, and Yaohua Tang. 2025. https://doi.org/10.18653/v1/2025.emnlp-main.334 TurboRAG : Accelerating retrieval-augmented generation with precomputed KV caches for chunked text . In Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 6588--6601, Suzhou, China. Association for...

  18. [29]

    vLLM Project . 2025. Automatic Prefix Caching . https://docs.vllm.ai/en/stable/features/automatic_prefix_caching/. Accessed: 2026-04-16

  19. [30]

    An Yang, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chengyuan Li, Dayiheng Liu, Fei Huang, Haoran Wei, Huan Lin, Jian Yang, Jianhong Tu, Jianwei Zhang, Jianxin Yang, Jiaxi Yang, Jingren Zhou, Junyang Lin, Kai Dang, and 23 others. 2024. https://doi.org/10.48550/arXiv.2412.15115 Qwen2.5 Technical Report . CoRR, abs/2412.15115

  20. [31]

    Zhilin Yang, Peng Qi, Saizheng Zhang, Yoshua Bengio, William Cohen, Ruslan Salakhutdinov, and Christopher D. Manning. 2018. https://doi.org/10.18653/v1/D18-1259 HotpotQA : A dataset for diverse, explainable multi-hop question answering . In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 2369--2380, Brussels, ...

  21. [32]

    Jiayi Yao, Hanchen Li, Yuhan Liu, Siddhant Ray, Yihua Cheng, Qizheng Zhang, Kuntai Du, Shan Lu, and Junchen Jiang. 2025. https://doi.org/10.1145/3689031.3696098 CacheBlend : Fast large language model serving for RAG with cached knowledge fusion . In Proceedings of the Twentieth European Conference on Computer Systems, EuroSys '25, pages 94--109, Rotterdam...

  22. [33]

    Gonzalez, Clark W

    Lianmin Zheng, Liangsheng Yin, Zhiqiang Xie, Chuyue Sun, Jeff Huang, Cody Hao Yu, Shiyi Cao, Christos Kozyrakis, Ion Stoica, Joseph E. Gonzalez, Clark W. Barrett, and Ying Sheng. 2024. https://doi.org/10.52202/079017-2000 SGLang : Efficient execution of structured language model programs . In Advances in Neural Information Processing Systems, volume 37