Can I Buy Your KV Cache?
Pith reviewed 2026-06-27 06:35 UTC · model grok-4.3
The pith
Precomputing a document's KV cache once allows agents to load it and skip prefill with exact matching results and major compute savings.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
We show that loading a precomputed KV cache and continuing generation matches prefilling the same document from scratch exactly, at both the token level for 24 greedy tokens and the logits level, incurring no accuracy cost. On Qwen3-4B, reuse is 9-50x cheaper than prefill, with the gap widening as length increases because attention scales with the square of length. Hosting the cache provider-side, as production prompt-caching does, eliminates egress costs that make shipping the incompressible cache uneconomical. A single reuse already pays back the prefill cost.
What carries the argument
The precomputed key-value cache for a document that can be loaded to bypass the prefill stage of model inference.
If this is right
- Multiple agents accessing the same document share the cost of one prefill.
- Savings for serving a hot document to 80M agents reach 49.7x reduction in compute cost.
- A 10x user discount from cache tariffs remains profitable given the measured savings.
- The approach enables an agent-native prefill content delivery network.
Where Pith is reading between the lines
- Publishers could sell rights to precomputed caches as a new revenue stream.
- Common knowledge bases in agent swarms could be pre-cached once for collective use.
- Solving the open problems of lossless compression and payment layers would accelerate adoption.
- The idea may extend to caching other intermediate states in long-running agent tasks.
Load-bearing premise
Provider-side hosting of the KV cache entirely removes egress costs and the precomputed cache remains usable across agents without additional overheads or compatibility issues.
What would settle it
An experiment where one agent prefills a document and saves the KV cache, a second agent loads it and generates 24 tokens, then checks if the tokens and logits exactly match those from a fresh prefill on the same document.
Figures
read the original abstract
Right now, across the world, AI agents are repeating the same absurd act: to read one document, they each recompute it from scratch. Every agent re-runs prefill, the most compute-intensive step a large model takes, over identical text, only to rebuild a key-value (KV) cache identical to the one the agent before it just built. The same answer, computed a million times. We make a proposal that is almost offensively simple: compute it once. Let a publisher precompute a document's KV cache, and let every other agent buy the right to load it and skip prefill. It works, and it is token-exact: loading a precomputed KV and continuing matches prefilling from scratch (24/24 greedy tokens, and at the logits level), with no accuracy cost. On Qwen3-4B, reuse is 9-50x cheaper in compute than prefill, and the gap widens with length (prefill's attention scales with L^2), so a single reuse already pays it back. Then the part that matters: where the KV lives. Shipping it fails, because KV is nearly incompressible, so per-load egress costs more than the prefill it saves. Hosting it provider-side, exactly as production prompt-caching works, removes egress entirely. The size of the prize is set by our measured compute saving: serving one hot 3774-token document to 80M agents costs ~$1.5M to re-prefill but only ~$0.03M of reuse compute (49.7x less). The 0.1x cache-read tariff APIs charge passes a 10x discount to users while sitting inside this measured envelope, so the 10x is a floor that the measured ~50x compute saving clears, and the gap to the physical ~50x is provider margin: millions of dollars per popular document. We frame the resulting agent-native prefill CDN and leave lossless KV compression and a cross-party payment layer as the open problems.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper claims that precomputing a document's KV cache once allows subsequent agents to load it and continue generation with results identical to prefilling from scratch (exact match on 24/24 greedy tokens and at the logits level, with no accuracy cost). On Qwen3-4B this yields 9-50x compute savings versus prefill (widening with length due to quadratic attention), and provider-side hosting modeled on production prompt caching eliminates egress costs that would otherwise make shipping uneconomical. It estimates large aggregate savings (e.g., $1.5M to $0.03M for one 3774-token document served to 80M agents at 49.7x) and frames the result as an agent-native prefill CDN, leaving lossless compression and cross-party payments as open problems.
Significance. If the exact KV-cache reuse and measured savings hold, the proposal could materially reduce redundant prefill work across agents sharing documents. The token-exact and logits-level equivalence, together with the concrete 9-50x compute ratios on Qwen3-4B, supplies a quantitative basis for the economic argument that a 10x user discount remains inside the measured envelope.
major comments (3)
- [Abstract] Abstract: the claim of exact matching (24/24 greedy tokens and logits-level equivalence) with 'no accuracy cost' on Qwen3-4B is presented without methods, number of trials, error bars, or full experimental protocol, which is load-bearing for the central empirical assertion that reuse is lossless.
- [Abstract] Abstract: the 49.7x compute saving and the $1.5M vs $0.03M cost comparison for 80M agents rest on measured prefill versus reuse costs for a 3774-token document; the manuscript does not detail the cost model, hardware assumptions, or how the length-dependent scaling is obtained, leaving the quantitative foundation of the CDN economics difficult to evaluate.
- [Abstract] Abstract: the assertion that provider-side hosting 'exactly as production prompt-caching works, removes egress entirely' while preserving cross-agent usability with 'no material extra cost' is made by contrast to shipping failure, but does not quantify loading latency, KV-format standardization across model versions, or authorization overhead; these factors are load-bearing for whether the claimed savings survive in a multi-agent setting.
minor comments (1)
- [Abstract] The term 'agent-native prefill CDN' is introduced in the abstract without an explicit definition or prior reference.
Simulated Author's Rebuttal
We thank the referee for the detailed and constructive comments. The abstract is intentionally concise and summarizes results whose supporting methods, measurements, and analysis appear in the body of the manuscript. We address each point below and will revise the abstract to improve cross-references and add brief clarifying language where helpful.
read point-by-point responses
-
Referee: [Abstract] Abstract: the claim of exact matching (24/24 greedy tokens and logits-level equivalence) with 'no accuracy cost' on Qwen3-4B is presented without methods, number of trials, error bars, or full experimental protocol, which is load-bearing for the central empirical assertion that reuse is lossless.
Authors: The abstract reports the headline empirical result. The full protocol (including the exact verification procedure for token-by-token and logit-level identity under greedy decoding) is given in the Experiments section. Because the match is deterministic for a fixed model, prompt, and decoding strategy, multiple trials and error bars are not applicable; the reported 24/24 figure reflects exhaustive checking on the evaluated document. We will revise the abstract to include an explicit pointer to the Experiments section. revision: yes
-
Referee: [Abstract] Abstract: the 49.7x compute saving and the $1.5M vs $0.03M cost comparison for 80M agents rest on measured prefill versus reuse costs for a 3774-token document; the manuscript does not detail the cost model, hardware assumptions, or how the length-dependent scaling is obtained, leaving the quantitative foundation of the CDN economics difficult to evaluate.
Authors: The 49.7x ratio and the dollar figures are obtained directly from measured prefill versus reuse wall-clock and FLOP counts on Qwen3-4B for the 3774-token document; the length-dependent widening follows from the quadratic attention cost of prefill versus the constant cost of reuse. The translation from measured compute to the $1.5 M / $0.03 M example uses standard public cloud GPU-hour pricing, which is stated in the economic analysis. We will add a short parenthetical reference in the abstract to the measurement source and the quadratic scaling argument. revision: yes
-
Referee: [Abstract] Abstract: the assertion that provider-side hosting 'exactly as production prompt-caching works, removes egress entirely' while preserving cross-agent usability with 'no material extra cost' is made by contrast to shipping failure, but does not quantify loading latency, KV-format standardization across model versions, or authorization overhead; these factors are load-bearing for whether the claimed savings survive in a multi-agent setting.
Authors: The claim rests on the observation that production prompt-caching systems already perform provider-side KV loading without per-user egress. We agree that loading latency, format standardization, and authorization are important operational details for a multi-agent deployment and will expand the relevant discussion section to address them, including any latency numbers obtained in our own reuse experiments. The core economic envelope (49.7x compute reduction) remains unchanged by these factors. revision: partial
Circularity Check
No circularity; central claims are direct empirical measurements with no reduction to inputs or self-citations
full rationale
The paper asserts that KV-cache reuse is token-exact and logit-identical to fresh prefill, and quantifies 9-50x compute savings on Qwen3-4B, but these are presented as measured outcomes rather than derived from equations that collapse to fitted parameters. No self-citation load-bearing steps, uniqueness theorems, or ansatzes appear in the provided text. The provider-side hosting proposal is framed by analogy to existing prompt caching without invoking prior author work as justification. The derivation chain is therefore self-contained against external benchmarks and does not reduce any prediction to its own inputs by construction.
Axiom & Free-Parameter Ledger
invented entities (1)
-
agent-native prefill CDN
no independent evidence
Reference graph
Works this paper leans on
-
[1]
Shubham Agarwal, Sai Sundaresan, Subrata Mitra, Debabrata Mahapatra, Archit Gupta, Rounak Sharma, Nir- mal Joshua Kapu, Tong Yu, and Shiv Saini. Cache-craft: Managing chunk-caches for efficient retrieval-augmented generation.arXiv preprint arXiv:2502.15734, 2025
arXiv 2025
-
[2]
Jianru Ding, Ryien Hosseini, Pouya Mahdi Gholami, Mingyuan Xiang, and Henry Hoffmann. Observation, not prediction: Conversation-level disaggregated scheduling for agentic serving.arXiv preprint arXiv:2606.01839, 2026
Pith/arXiv arXiv 2026
-
[3]
Netkv: Network-aware decode instance selection for disaggregated llm inference
Mubarak Adetunji Ojewale. Netkv: Network-aware decode instance selection for disaggregated llm inference. arXiv preprint arXiv:2606.03910, 2026
Pith/arXiv arXiv 2026
-
[4]
Xiurui Pan, Endian Li, Qiao Li, Shengwen Liang, Yizhou Shan, Ke Zhou, Yingwei Luo, Xiaolin Wang, and Jie Zhang. Instinfer: In-storage attention offloading for cost-effective long-context llm inference.arXiv preprint arXiv:2409.04992, 2024
arXiv 2024
-
[5]
Rya Sanovar, Srikant Bharadwaj, Hritvik Taneja, and Moinuddin Qureshi. Sift: Selective-index for fast compute of rag prefill by exploiting attention invariance.arXiv preprint arXiv:2606.09441, 2026
Pith/arXiv arXiv 2026
-
[6]
Park, Moonwook Oh, Yohan Jo, Jaeyoung Do, and Sang-Won Lee
Kun-Woo Shin, Jay H. Park, Moonwook Oh, Yohan Jo, Jaeyoung Do, and Sang-Won Lee. Matkv: Trading compute for flash storage in llm inference.arXiv preprint arXiv:2512.22195, 2025
arXiv 2025
-
[7]
Cacheclip: Accelerating rag with effective kv cache reuse
Bin Yang, Qiuyu Leng, Jun Zeng, and Zhenhua Wu. Cacheclip: Accelerating rag with effective kv cache reuse. arXiv preprint arXiv:2510.10129, 2025
Pith/arXiv arXiv 2025
-
[8]
Pengju Yang. Spectrumkv: Per-token mixed-precision kv cache transfer for prefill-decode disaggregated llm serving.arXiv preprint arXiv:2606.08635, 2026
Pith/arXiv arXiv 2026
-
[9]
Jiayi Yao, Hanchen Li, Yuhan Liu, Siddhant Ray, Yihua Cheng, Qizheng Zhang, Kuntai Du, Shan Lu, and Junchen Jiang. Cacheblend: Fast large language model serving for rag with cached knowledge fusion.arXiv preprint arXiv:2405.16444, 2024. 8
arXiv 2024
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.