Can I Buy Your KV Cache?

Luoyuan Zhang

arxiv: 2606.13361 · v1 · pith:GDQU5ZQXnew · submitted 2026-06-11 · 💻 cs.AI · cs.CE· cs.MA

Can I Buy Your KV Cache?

Luoyuan Zhang This is my paper

Pith reviewed 2026-06-27 06:35 UTC · model grok-4.3

classification 💻 cs.AI cs.CEcs.MA

keywords KV cacheprefillprompt cachingAI agentscompute savingsinference optimizationcontent delivery

0 comments

The pith

Precomputing a document's KV cache once allows agents to load it and skip prefill with exact matching results and major compute savings.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper argues that AI agents waste computation by each recomputing the KV cache for the same documents during prefill. Instead, a publisher can compute the cache once and agents can buy the right to load it. This loading produces the same outputs as computing from scratch, with no accuracy loss. The compute cost drops dramatically, by 9 to 50 times on tested models, and provider-side hosting avoids expensive data transfer costs. This points to a system where caches are hosted like a content delivery network for AI prompts.

Core claim

We show that loading a precomputed KV cache and continuing generation matches prefilling the same document from scratch exactly, at both the token level for 24 greedy tokens and the logits level, incurring no accuracy cost. On Qwen3-4B, reuse is 9-50x cheaper than prefill, with the gap widening as length increases because attention scales with the square of length. Hosting the cache provider-side, as production prompt-caching does, eliminates egress costs that make shipping the incompressible cache uneconomical. A single reuse already pays back the prefill cost.

What carries the argument

The precomputed key-value cache for a document that can be loaded to bypass the prefill stage of model inference.

If this is right

Multiple agents accessing the same document share the cost of one prefill.
Savings for serving a hot document to 80M agents reach 49.7x reduction in compute cost.
A 10x user discount from cache tariffs remains profitable given the measured savings.
The approach enables an agent-native prefill content delivery network.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Publishers could sell rights to precomputed caches as a new revenue stream.
Common knowledge bases in agent swarms could be pre-cached once for collective use.
Solving the open problems of lossless compression and payment layers would accelerate adoption.
The idea may extend to caching other intermediate states in long-running agent tasks.

Load-bearing premise

Provider-side hosting of the KV cache entirely removes egress costs and the precomputed cache remains usable across agents without additional overheads or compatibility issues.

What would settle it

An experiment where one agent prefills a document and saves the KV cache, a second agent loads it and generates 24 tokens, then checks if the tokens and logits exactly match those from a fresh prefill on the same document.

Figures

Figures reproduced from arXiv: 2606.13361 by Luoyuan Zhang.

**Figure 1.** Figure 1: Amortized per-call cost vs. reuse count N (log–log). The from-scratch cost is flat at Cprefill; KV-reuse falls as Cprefill/N + Creuse toward a floor of Creuse. Scaling [PITH_FULL_IMAGE:figures/full_fig_p006_1.png] view at source ↗

read the original abstract

Right now, across the world, AI agents are repeating the same absurd act: to read one document, they each recompute it from scratch. Every agent re-runs prefill, the most compute-intensive step a large model takes, over identical text, only to rebuild a key-value (KV) cache identical to the one the agent before it just built. The same answer, computed a million times. We make a proposal that is almost offensively simple: compute it once. Let a publisher precompute a document's KV cache, and let every other agent buy the right to load it and skip prefill. It works, and it is token-exact: loading a precomputed KV and continuing matches prefilling from scratch (24/24 greedy tokens, and at the logits level), with no accuracy cost. On Qwen3-4B, reuse is 9-50x cheaper in compute than prefill, and the gap widens with length (prefill's attention scales with L^2), so a single reuse already pays it back. Then the part that matters: where the KV lives. Shipping it fails, because KV is nearly incompressible, so per-load egress costs more than the prefill it saves. Hosting it provider-side, exactly as production prompt-caching works, removes egress entirely. The size of the prize is set by our measured compute saving: serving one hot 3774-token document to 80M agents costs ~$1.5M to re-prefill but only ~$0.03M of reuse compute (49.7x less). The 0.1x cache-read tariff APIs charge passes a 10x discount to users while sitting inside this measured envelope, so the 10x is a floor that the measured ~50x compute saving clears, and the gap to the physical ~50x is provider margin: millions of dollars per popular document. We frame the resulting agent-native prefill CDN and leave lossless KV compression and a cross-party payment layer as the open problems.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper's KV cache market idea is simple and the exact-reuse claim plus cost numbers on Qwen3-4B are worth checking, but the provider-hosting economics rest on unquantified assumptions about overhead.

read the letter

The main point is that precomputing a document's KV cache once and letting agents load it from a provider-side host could avoid repeated prefill work, with the paper showing exact token and logit match on Qwen3-4B and 9-50x compute savings that grow with length.

What is new is the explicit framing of this as a buyable, hostable cache with a CDN-style market layer between publishers and agents, plus the back-of-envelope economics for 80M agents on a 3774-token document. The verification that loading the cache produces identical greedy output and logits is a useful concrete check, and the contrast between shipping costs and hosted reuse is clear.

The soft spots are in the hosting step. The paper asserts that provider-side placement removes egress entirely and works across agents the way existing prompt caching does, but it does not measure per-load latency, KV format standardization across model versions, or the access-control overhead that a real multi-agent system would need. Those factors could shrink the claimed margin. The experiments are also limited to one model and one prefix length, so the scaling claim needs broader testing.

The work is aimed at people building LLM serving stacks and thinking about inference economics. A reader focused on systems or cost modeling would find the reuse numbers and the market sketch useful even if the full deployment details are missing.

It deserves peer review because the core technical claim is falsifiable and the cost model is grounded in measured savings rather than pure speculation. The open problems it flags (lossless compression and payment) are honest.

Referee Report

3 major / 1 minor

Summary. The paper claims that precomputing a document's KV cache once allows subsequent agents to load it and continue generation with results identical to prefilling from scratch (exact match on 24/24 greedy tokens and at the logits level, with no accuracy cost). On Qwen3-4B this yields 9-50x compute savings versus prefill (widening with length due to quadratic attention), and provider-side hosting modeled on production prompt caching eliminates egress costs that would otherwise make shipping uneconomical. It estimates large aggregate savings (e.g., $1.5M to $0.03M for one 3774-token document served to 80M agents at 49.7x) and frames the result as an agent-native prefill CDN, leaving lossless compression and cross-party payments as open problems.

Significance. If the exact KV-cache reuse and measured savings hold, the proposal could materially reduce redundant prefill work across agents sharing documents. The token-exact and logits-level equivalence, together with the concrete 9-50x compute ratios on Qwen3-4B, supplies a quantitative basis for the economic argument that a 10x user discount remains inside the measured envelope.

major comments (3)

[Abstract] Abstract: the claim of exact matching (24/24 greedy tokens and logits-level equivalence) with 'no accuracy cost' on Qwen3-4B is presented without methods, number of trials, error bars, or full experimental protocol, which is load-bearing for the central empirical assertion that reuse is lossless.
[Abstract] Abstract: the 49.7x compute saving and the $1.5M vs $0.03M cost comparison for 80M agents rest on measured prefill versus reuse costs for a 3774-token document; the manuscript does not detail the cost model, hardware assumptions, or how the length-dependent scaling is obtained, leaving the quantitative foundation of the CDN economics difficult to evaluate.
[Abstract] Abstract: the assertion that provider-side hosting 'exactly as production prompt-caching works, removes egress entirely' while preserving cross-agent usability with 'no material extra cost' is made by contrast to shipping failure, but does not quantify loading latency, KV-format standardization across model versions, or authorization overhead; these factors are load-bearing for whether the claimed savings survive in a multi-agent setting.

minor comments (1)

[Abstract] The term 'agent-native prefill CDN' is introduced in the abstract without an explicit definition or prior reference.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the detailed and constructive comments. The abstract is intentionally concise and summarizes results whose supporting methods, measurements, and analysis appear in the body of the manuscript. We address each point below and will revise the abstract to improve cross-references and add brief clarifying language where helpful.

read point-by-point responses

Referee: [Abstract] Abstract: the claim of exact matching (24/24 greedy tokens and logits-level equivalence) with 'no accuracy cost' on Qwen3-4B is presented without methods, number of trials, error bars, or full experimental protocol, which is load-bearing for the central empirical assertion that reuse is lossless.

Authors: The abstract reports the headline empirical result. The full protocol (including the exact verification procedure for token-by-token and logit-level identity under greedy decoding) is given in the Experiments section. Because the match is deterministic for a fixed model, prompt, and decoding strategy, multiple trials and error bars are not applicable; the reported 24/24 figure reflects exhaustive checking on the evaluated document. We will revise the abstract to include an explicit pointer to the Experiments section. revision: yes
Referee: [Abstract] Abstract: the 49.7x compute saving and the $1.5M vs $0.03M cost comparison for 80M agents rest on measured prefill versus reuse costs for a 3774-token document; the manuscript does not detail the cost model, hardware assumptions, or how the length-dependent scaling is obtained, leaving the quantitative foundation of the CDN economics difficult to evaluate.

Authors: The 49.7x ratio and the dollar figures are obtained directly from measured prefill versus reuse wall-clock and FLOP counts on Qwen3-4B for the 3774-token document; the length-dependent widening follows from the quadratic attention cost of prefill versus the constant cost of reuse. The translation from measured compute to the $1.5 M / $0.03 M example uses standard public cloud GPU-hour pricing, which is stated in the economic analysis. We will add a short parenthetical reference in the abstract to the measurement source and the quadratic scaling argument. revision: yes
Referee: [Abstract] Abstract: the assertion that provider-side hosting 'exactly as production prompt-caching works, removes egress entirely' while preserving cross-agent usability with 'no material extra cost' is made by contrast to shipping failure, but does not quantify loading latency, KV-format standardization across model versions, or authorization overhead; these factors are load-bearing for whether the claimed savings survive in a multi-agent setting.

Authors: The claim rests on the observation that production prompt-caching systems already perform provider-side KV loading without per-user egress. We agree that loading latency, format standardization, and authorization are important operational details for a multi-agent deployment and will expand the relevant discussion section to address them, including any latency numbers obtained in our own reuse experiments. The core economic envelope (49.7x compute reduction) remains unchanged by these factors. revision: partial

Circularity Check

0 steps flagged

No circularity; central claims are direct empirical measurements with no reduction to inputs or self-citations

full rationale

The paper asserts that KV-cache reuse is token-exact and logit-identical to fresh prefill, and quantifies 9-50x compute savings on Qwen3-4B, but these are presented as measured outcomes rather than derived from equations that collapse to fitted parameters. No self-citation load-bearing steps, uniqueness theorems, or ansatzes appear in the provided text. The provider-side hosting proposal is framed by analogy to existing prompt caching without invoking prior author work as justification. The derivation chain is therefore self-contained against external benchmarks and does not reduce any prediction to its own inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 1 invented entities

Review based solely on abstract; no explicit free parameters, axioms, or invented entities with independent evidence are detailed. The 9-50x factor and cost figures may embed unstated modeling choices.

invented entities (1)

agent-native prefill CDN no independent evidence
purpose: System for precomputing, hosting, and selling KV caches to agents
Introduced in abstract as the resulting architecture

pith-pipeline@v0.9.1-grok · 5901 in / 1253 out tokens · 25578 ms · 2026-06-27T06:35:38.377079+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

9 extracted references · 5 linked inside Pith

[1]

Cache-craft: Managing chunk-caches for efficient retrieval-augmented generation.arXiv preprint arXiv:2502.15734, 2025

Shubham Agarwal, Sai Sundaresan, Subrata Mitra, Debabrata Mahapatra, Archit Gupta, Rounak Sharma, Nir- mal Joshua Kapu, Tong Yu, and Shiv Saini. Cache-craft: Managing chunk-caches for efficient retrieval-augmented generation.arXiv preprint arXiv:2502.15734, 2025

arXiv 2025
[2]

Observation, not prediction: Conversation-level disaggregated scheduling for agentic serving.arXiv preprint arXiv:2606.01839, 2026

Jianru Ding, Ryien Hosseini, Pouya Mahdi Gholami, Mingyuan Xiang, and Henry Hoffmann. Observation, not prediction: Conversation-level disaggregated scheduling for agentic serving.arXiv preprint arXiv:2606.01839, 2026

Pith/arXiv arXiv 2026
[3]

Netkv: Network-aware decode instance selection for disaggregated llm inference

Mubarak Adetunji Ojewale. Netkv: Network-aware decode instance selection for disaggregated llm inference. arXiv preprint arXiv:2606.03910, 2026

Pith/arXiv arXiv 2026
[4]

Instinfer: In-storage attention offloading for cost-effective long-context llm inference.arXiv preprint arXiv:2409.04992, 2024

Xiurui Pan, Endian Li, Qiao Li, Shengwen Liang, Yizhou Shan, Ke Zhou, Yingwei Luo, Xiaolin Wang, and Jie Zhang. Instinfer: In-storage attention offloading for cost-effective long-context llm inference.arXiv preprint arXiv:2409.04992, 2024

arXiv 2024
[5]

Sift: Selective-index for fast compute of rag prefill by exploiting attention invariance.arXiv preprint arXiv:2606.09441, 2026

Rya Sanovar, Srikant Bharadwaj, Hritvik Taneja, and Moinuddin Qureshi. Sift: Selective-index for fast compute of rag prefill by exploiting attention invariance.arXiv preprint arXiv:2606.09441, 2026

Pith/arXiv arXiv 2026
[6]

Park, Moonwook Oh, Yohan Jo, Jaeyoung Do, and Sang-Won Lee

Kun-Woo Shin, Jay H. Park, Moonwook Oh, Yohan Jo, Jaeyoung Do, and Sang-Won Lee. Matkv: Trading compute for flash storage in llm inference.arXiv preprint arXiv:2512.22195, 2025

arXiv 2025
[7]

Cacheclip: Accelerating rag with effective kv cache reuse

Bin Yang, Qiuyu Leng, Jun Zeng, and Zhenhua Wu. Cacheclip: Accelerating rag with effective kv cache reuse. arXiv preprint arXiv:2510.10129, 2025

Pith/arXiv arXiv 2025
[8]

Spectrumkv: Per-token mixed-precision kv cache transfer for prefill-decode disaggregated llm serving.arXiv preprint arXiv:2606.08635, 2026

Pengju Yang. Spectrumkv: Per-token mixed-precision kv cache transfer for prefill-decode disaggregated llm serving.arXiv preprint arXiv:2606.08635, 2026

Pith/arXiv arXiv 2026
[9]

Cacheblend: Fast large language model serving for rag with cached knowledge fusion.arXiv preprint arXiv:2405.16444, 2024

Jiayi Yao, Hanchen Li, Yuhan Liu, Siddhant Ray, Yihua Cheng, Qizheng Zhang, Kuntai Du, Shan Lu, and Junchen Jiang. Cacheblend: Fast large language model serving for rag with cached knowledge fusion.arXiv preprint arXiv:2405.16444, 2024. 8

arXiv 2024

[1] [1]

Cache-craft: Managing chunk-caches for efficient retrieval-augmented generation.arXiv preprint arXiv:2502.15734, 2025

Shubham Agarwal, Sai Sundaresan, Subrata Mitra, Debabrata Mahapatra, Archit Gupta, Rounak Sharma, Nir- mal Joshua Kapu, Tong Yu, and Shiv Saini. Cache-craft: Managing chunk-caches for efficient retrieval-augmented generation.arXiv preprint arXiv:2502.15734, 2025

arXiv 2025

[2] [2]

Observation, not prediction: Conversation-level disaggregated scheduling for agentic serving.arXiv preprint arXiv:2606.01839, 2026

Jianru Ding, Ryien Hosseini, Pouya Mahdi Gholami, Mingyuan Xiang, and Henry Hoffmann. Observation, not prediction: Conversation-level disaggregated scheduling for agentic serving.arXiv preprint arXiv:2606.01839, 2026

Pith/arXiv arXiv 2026

[3] [3]

Netkv: Network-aware decode instance selection for disaggregated llm inference

Mubarak Adetunji Ojewale. Netkv: Network-aware decode instance selection for disaggregated llm inference. arXiv preprint arXiv:2606.03910, 2026

Pith/arXiv arXiv 2026

[4] [4]

Instinfer: In-storage attention offloading for cost-effective long-context llm inference.arXiv preprint arXiv:2409.04992, 2024

Xiurui Pan, Endian Li, Qiao Li, Shengwen Liang, Yizhou Shan, Ke Zhou, Yingwei Luo, Xiaolin Wang, and Jie Zhang. Instinfer: In-storage attention offloading for cost-effective long-context llm inference.arXiv preprint arXiv:2409.04992, 2024

arXiv 2024

[5] [5]

Sift: Selective-index for fast compute of rag prefill by exploiting attention invariance.arXiv preprint arXiv:2606.09441, 2026

Rya Sanovar, Srikant Bharadwaj, Hritvik Taneja, and Moinuddin Qureshi. Sift: Selective-index for fast compute of rag prefill by exploiting attention invariance.arXiv preprint arXiv:2606.09441, 2026

Pith/arXiv arXiv 2026

[6] [6]

Park, Moonwook Oh, Yohan Jo, Jaeyoung Do, and Sang-Won Lee

Kun-Woo Shin, Jay H. Park, Moonwook Oh, Yohan Jo, Jaeyoung Do, and Sang-Won Lee. Matkv: Trading compute for flash storage in llm inference.arXiv preprint arXiv:2512.22195, 2025

arXiv 2025

[7] [7]

Cacheclip: Accelerating rag with effective kv cache reuse

Bin Yang, Qiuyu Leng, Jun Zeng, and Zhenhua Wu. Cacheclip: Accelerating rag with effective kv cache reuse. arXiv preprint arXiv:2510.10129, 2025

Pith/arXiv arXiv 2025

[8] [8]

Spectrumkv: Per-token mixed-precision kv cache transfer for prefill-decode disaggregated llm serving.arXiv preprint arXiv:2606.08635, 2026

Pengju Yang. Spectrumkv: Per-token mixed-precision kv cache transfer for prefill-decode disaggregated llm serving.arXiv preprint arXiv:2606.08635, 2026

Pith/arXiv arXiv 2026

[9] [9]

Cacheblend: Fast large language model serving for rag with cached knowledge fusion.arXiv preprint arXiv:2405.16444, 2024

Jiayi Yao, Hanchen Li, Yuhan Liu, Siddhant Ray, Yihua Cheng, Qizheng Zhang, Kuntai Du, Shan Lu, and Junchen Jiang. Cacheblend: Fast large language model serving for rag with cached knowledge fusion.arXiv preprint arXiv:2405.16444, 2024. 8

arXiv 2024