pith. machine review for the scientific record. sign in

arxiv: 2604.20021 · v1 · submitted 2026-04-21 · 💻 cs.LG · cs.CL

Recognition: unknown

Continuous Semantic Caching for Low-Cost LLM Serving

Authors on Pith no claims yet

Pith reviewed 2026-05-10 02:29 UTC · model grok-4.3

classification 💻 cs.LG cs.CL
keywords semantic cachingLLM response cachingcontinuous query spaceonline learningregret boundskernel ridge regressionepsilon-netcost optimization
0
0 comments X

The pith

A new framework allows semantic caching of LLM responses in continuous query spaces with provable sublinear regret.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes a rigorous theoretical framework for caching LLM responses when queries exist in an infinite continuous embedding space rather than a known discrete set. It introduces dynamic epsilon-net discretization paired with kernel ridge regression to measure uncertainty in cost estimates and to extend observations from served queries to semantically similar ones. Online algorithms built on this foundation adapt the cache over time while controlling the cost of changes, and the analysis shows sublinear regret relative to the best possible continuous policy. This matters because it removes the need to predefine all possible queries, enabling efficient reuse of responses as the range of user inputs grows without bound.

Core claim

By combining dynamic ε-net discretization with Kernel Ridge Regression, the system quantifies uncertainty in continuous query spaces and generalizes partial feedback across semantic neighborhoods. This supports both offline and online algorithms that reduce switching costs, with the online version achieving a sublinear regret bound against an optimal continuous oracle that reduces to prior discrete-model bounds.

What carries the argument

dynamic ε-net discretization coupled with Kernel Ridge Regression, which adaptively discretizes the continuous embedding space to quantify estimation uncertainty and generalize cost feedback across nearby queries

If this is right

  • The online algorithm achieves a sublinear regret bound against an optimal continuous oracle.
  • The regret bound reduces to existing bounds when the query space is discrete.
  • Empirical evaluations show the framework approximates the continuous optimal cache while lowering computational and switching overhead.
  • Partial feedback on query costs can be generalized across semantic neighborhoods without full observation.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • If the discretization approach scales, it could be adapted to continuous spaces in related problems like dynamic resource allocation for model inference.
  • The generalization via kernels implies that embedding similarity can substitute for exact query matches in caching decisions.
  • Real deployment would benefit from testing whether user query streams naturally form low-dimensional manifolds that simplify the epsilon-net construction.

Load-bearing premise

Real-world LLM queries lie in an infinite continuous embedding space where dynamic ε-net discretization with Kernel Ridge Regression can quantify uncertainty and generalize partial feedback across semantic neighborhoods.

What would settle it

Running the online algorithm on a long sequence of real or simulated continuous LLM queries and observing linear growth in cumulative regret instead of sublinear growth would falsify the theoretical bound.

Figures

Figures reproduced from arXiv: 2604.20021 by Baran Atalar, Carlee Joe-Wong, Jinhang Zuo, Siwei Wang, Wei Chen, Xutong Liu.

Figure 1
Figure 1. Figure 1: Continuous semantic caching system for low-cost [PITH_FULL_IMAGE:figures/full_fig_p003_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: (a) Variation of the suboptimality gap versus stream length in the offline setting (b) Sensitivity analysis showing the [PITH_FULL_IMAGE:figures/full_fig_p010_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: PCA on the synthetic dataset 0.0 2.5 5.0 7.5 10.0 12.5 15.0 17.5 UMAP Dimension 1 7.5 5.0 2.5 0.0 2.5 5.0 7.5 10.0 12.5 UMAP Dimension 2 UMAP Projection: NQ vs TriviaQA Clustering & Stream Arrivals Universe: NQ Universe: TriviaQA Stream: NQ Stream: TriviaQA [PITH_FULL_IMAGE:figures/full_fig_p016_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: UMAP on the NQ+TriviaQA dataset to the ℓ𝑇 queries where the LLM was actually invoked (𝑧𝑗 ): E "∑︁𝑇 𝑡=1 Optimistic Gap# ≤ E "∑︁ℓ𝑇 𝑗=1 2𝛽𝑡𝑗 −1𝜎𝑗−1,𝜆 (𝑧𝑗) # (10) Since 𝛽𝑡 ≤ 𝛽𝑇 , applying the Cauchy-Schwarz inequality and the Elliptical Potential Lemma for RKHS yields: ∑︁ℓ𝑇 𝑗=1 𝜎𝑗−1,𝜆 (𝑧𝑗) ≤ vut ℓ𝑇 ∑︁ℓ𝑇 𝑗=1 𝜎 2 𝑗−1,𝜆 (𝑧𝑗) ≤ √︂ 𝑇 2 ln(1 + 1/𝜆) 𝛾𝑇 = O √︁ 𝑇𝛾𝑇  By Lemma 5.3, maintaining the effective centers tri… view at source ↗
Figure 5
Figure 5. Figure 5: Cache size vs Suboptimality Gap Ablation for the [PITH_FULL_IMAGE:figures/full_fig_p021_5.png] view at source ↗
Figure 7
Figure 7. Figure 7: (a) Variation of the suboptimality gap versus stream length in the offline setting (b) Sensitivity analysis showing the [PITH_FULL_IMAGE:figures/full_fig_p022_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Distribution of pairwise distances among stream [PITH_FULL_IMAGE:figures/full_fig_p022_8.png] view at source ↗
Figure 10
Figure 10. Figure 10: Nearest-neighbor distance distribution within the [PITH_FULL_IMAGE:figures/full_fig_p023_10.png] view at source ↗
Figure 14
Figure 14. Figure 14: CDF of the Token length distribution for the syn [PITH_FULL_IMAGE:figures/full_fig_p023_14.png] view at source ↗
Figure 12
Figure 12. Figure 12: Cost prediction error of the KRR model The embedding-space distance plots ( [PITH_FULL_IMAGE:figures/full_fig_p023_12.png] view at source ↗
Figure 19
Figure 19. Figure 19: Nearest-neighbor distance distribution in embed [PITH_FULL_IMAGE:figures/full_fig_p024_19.png] view at source ↗
Figure 20
Figure 20. Figure 20: Nearest-neighbor distance CDF in embedding [PITH_FULL_IMAGE:figures/full_fig_p024_20.png] view at source ↗
Figure 16
Figure 16. Figure 16: Token length CDF for Natural Questions, TriviaQA, [PITH_FULL_IMAGE:figures/full_fig_p024_16.png] view at source ↗
read the original abstract

As Large Language Models (LLMs) become increasingly popular, caching responses so that they can be reused by users with semantically similar queries has become a vital strategy for reducing inference costs and latency. Existing caching frameworks have proposed to decide which query responses to cache by assuming a finite, known universe of discrete queries and learning their serving costs and arrival probabilities. As LLMs' pool of users and queries expands, however, such an assumption becomes increasingly untenable: real-world LLM queries reside in an infinite, continuous embedding space. In this paper, we establish the first rigorous theoretical framework for semantic LLM response caching in continuous query space under uncertainty. To bridge the gap between discrete optimization and continuous representation spaces, we introduce dynamic $\epsilon$-net discretization coupled with Kernel Ridge Regression. This design enables the system to formally quantify estimation uncertainty and generalize partial feedback on LLM query costs across continuous semantic query neighborhoods. We develop both offline learning and online adaptive algorithms optimized to reduce switching costs incurred by changing the cached responses. We prove that our online algorithm achieves a sublinear regret bound against an optimal continuous oracle, which reduces to existing bounds for discrete query models. Extensive empirical evaluations demonstrate that our framework approximates the continuous optimal cache well while also reducing computational and switching overhead compared to existing methods.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 1 minor

Summary. The paper introduces the first rigorous theoretical framework for semantic LLM response caching in continuous, infinite embedding spaces. It bridges discrete optimization with continuous representations via dynamic ε-net discretization combined with Kernel Ridge Regression (KRR) to quantify uncertainty and generalize partial cost feedback across semantic neighborhoods. Offline and online adaptive algorithms are developed to minimize switching costs when updating the cache; the online algorithm is proven to achieve sublinear regret against an optimal continuous oracle, with the bound reducing to existing discrete-query results. Empirical evaluations are claimed to show that the framework approximates the continuous optimum while lowering computational and switching overhead relative to prior methods.

Significance. If the central regret analysis holds, the work would supply the first formal sublinear-regret guarantee for continuous semantic caching, directly addressing the scalability limits of discrete-query assumptions as LLM user bases grow. The reduction to known discrete bounds and the use of KRR for neighborhood generalization are technically attractive strengths. However, the result's generality hinges on unstated conditions on the cost function, so the practical significance for arbitrary LLM workloads remains conditional on verification of those conditions.

major comments (1)
  1. [Abstract and theoretical results section] Abstract and theoretical results section: the claimed sublinear regret bound for the online algorithm against the continuous oracle is obtained via dynamic ε-net discretization + KRR. This analysis requires that the unknown query-cost function lie in the RKHS of the chosen kernel with norm bounded independently of the discretization parameter ε; without this, the covering numbers of the ε-net can grow with query diversity or dimension and the regret ceases to be sublinear. The manuscript does not state or verify this bounded-norm condition, yet the abstract asserts the bound 'reduces to existing bounds for discrete query models' without supplying the argument that the continuous-to-discrete reduction preserves sublinearity absent the extra assumption.
minor comments (1)
  1. The abstract refers to 'extensive empirical evaluations' and 'reducing computational and switching overhead compared to existing methods' but supplies no dataset descriptions, baseline implementations, or quantitative tables in the provided text; adding these details would strengthen the empirical claims without altering the theoretical contribution.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their careful reading and constructive feedback on our manuscript. We address the major comment point by point below and will incorporate the necessary clarifications in the revised version.

read point-by-point responses
  1. Referee: [Abstract and theoretical results section] Abstract and theoretical results section: the claimed sublinear regret bound for the online algorithm against the continuous oracle is obtained via dynamic ε-net discretization + KRR. This analysis requires that the unknown query-cost function lie in the RKHS of the chosen kernel with norm bounded independently of the discretization parameter ε; without this, the covering numbers of the ε-net can grow with query diversity or dimension and the regret ceases to be sublinear. The manuscript does not state or verify this bounded-norm condition, yet the abstract asserts the bound 'reduces to existing bounds for discrete query models' without supplying the argument that the continuous-to-discrete reduction preserves sublinearity absent the extra assumption.

    Authors: We agree that the sublinear regret bound relies on the query-cost function belonging to the RKHS with a norm bounded independently of ε; this is a standard condition in kernel methods to ensure controlled covering numbers and sublinear regret. While our theoretical analysis implicitly uses this assumption from the KRR literature, we acknowledge that it is not explicitly stated in the abstract or theoretical results section, and the reduction argument is not detailed there. In the revised manuscript, we will explicitly state the bounded-norm condition and add a concise argument demonstrating that the continuous-to-discrete reduction preserves sublinearity under this assumption, thereby clarifying how the bound reduces to existing discrete-query results. revision: yes

Circularity Check

0 steps flagged

No significant circularity; derivation extends discrete models via standard techniques without self-referential reduction

full rationale

The paper's central result is a sublinear regret bound for an online algorithm in continuous embedding space, derived from dynamic ε-net discretization combined with Kernel Ridge Regression to quantify uncertainty and generalize feedback. This construction reduces to discrete bounds by design of the discretization, but the continuous case adds independent content through RKHS-based neighborhood generalization and handling of switching costs. No steps match the enumerated circularity patterns: the regret analysis is not self-definitional, does not rename a fitted parameter as a prediction, and does not rely on load-bearing self-citations or imported uniqueness theorems. The framework applies standard online learning and kernel methods to the new continuous setting under stated assumptions, remaining self-contained against external benchmarks in discrete caching literature.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

Review based on abstract only; full parameter lists, proof assumptions, and any invented constructs are unavailable.

axioms (2)
  • domain assumption LLM queries reside in an infinite, continuous embedding space
    Explicitly stated as the reason existing discrete models become untenable.
  • ad hoc to paper Dynamic ε-net discretization coupled with Kernel Ridge Regression quantifies uncertainty and generalizes partial feedback across neighborhoods
    Introduced in the abstract as the core bridge between discrete optimization and continuous spaces.

pith-pipeline@v0.9.0 · 5536 in / 1252 out tokens · 37749 ms · 2026-05-10T02:29:26.250008+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

36 extracted references · 15 canonical work pages

  1. [1]

    2026.Continuous Semantic Caching for Low-Cost LLM Serving

    Baran Atalar, Xutong Liu, Jinhang Zuo, Siwei Wang, Wei Chen, and Carlee Joe- Wong. 2026.Continuous Semantic Caching for Low-Cost LLM Serving. Technical Report. Carnegie Mellon University. https://research.ece.cmu.edu/lions/Papers/ Semantic_LLM_Caching.pdf

  2. [2]

    Hyokyung Bahn. 2005. Web cache management based on the expected cost of web objects.Information and Software Technology47, 9 (2005), 609–621

  3. [3]

    Fu Bang. 2023. GPTCache: An Open-Source Semantic Cache for LLM Applications Enabling Faster Answers and Cost Savings.Proceedings of the 3rd Workshop for Natural Language Processing Open Source Software (NLP-OSS 2023)(2023)

  4. [4]

    2025.How people use chatgpt

    Aaron Chatterji, Thomas Cunningham, David J Deming, Zoe Hitzig, Christopher Ong, Carl Yan Shan, and Kevin Wadman. 2025.How people use chatgpt. Technical Report. National Bureau of Economic Research

  5. [5]

    Wei Chen, Yajun Wang, and Yang Yuan. 2013. Combinatorial multi-armed bandit: General framework and applications. InICML. PMLR, 151–159

  6. [6]

    Sayak Ray Chowdhury and Aditya Gopalan. 2017. On Kernelized Multi-armed Bandits. arXiv:1704.00445 [cs.LG] https://arxiv.org/abs/1704.00445

  7. [7]

    Xiangxiang Dai, Jin Li, Xutong Liu, Anqi Yu, and John C. S. Lui. 2024. Cost- Effective Online Multi-LLM Selection with Versatile Reward Models.ArXiv abs/2405.16587 (2024). https://api.semanticscholar.org/CorpusID:270063595

  8. [8]

    Tim Dettmers, Mike Lewis, Younes Belkada, and Luke Zettlemoyer. 2022. Gpt3. int8 (): 8-bit matrix multiplication for transformers at scale.Advances in neural information processing systems35 (2022), 30318–30332

  9. [9]

    Waris Gill, Mohamed Elidrisi, Pallavi Kalapatapu, Ammar Ahmed, Ali Anwar, and Muhammad Ali Gulzar. 2024. MeanCache: User-Centric Semantic Cache for Large Language Model Based Web Services.arXiv:2403.02694(2024)

  10. [10]

    In Gim, Guojun Chen, Seung seob Lee, Nikhil Sarda, Anurag Khandelwal, and Lin Zhong. 2023. Prompt Cache: Modular Attention Reuse for Low-Latency Inference.ArXivabs/2311.04934 (2023). Conference’17, July 2017, Washington, DC, USA Baran Atalar, Xutong Liu, Jinhang Zuo, Siwei Wang, Wei Chen, and Carlee Joe-Wong 0 500 1000 1500 2000 2500 Stream Length 0.00 0....

  11. [11]

    Victor P Il’ev. 2001. An approximation guarantee of the greedy descent algorithm for minimizing a supermodular set function.Discrete Applied Mathematics114, 1-3 (2001), 131–146

  12. [12]

    Mandar Joshi, Eunsol Choi, Daniel Weld, and Luke Zettlemoyer. 2017. TriviaQA: A Large Scale Distantly Supervised Challenge Dataset for Reading Comprehension. InProceedings of the 55th Annual Meeting of the ACL, Regina Barzilay and Min-Yen Kan (Eds.). ACL, Vancouver, Canada, 1601–1611. doi:10.18653/v1/P17-1147

  13. [13]

    Dai, Jakob Uszkoreit, Quoc Le, and Slav Petrov

    Tom Kwiatkowski, Jennimaria Palomaki, Olivia Redfield, Michael Collins, Ankur Parikh, Chris Alberti, Danielle Epstein, Illia Polosukhin, Jacob Devlin, Kenton Lee, Kristina Toutanova, Llion Jones, Matthew Kelcey, Ming-Wei Chang, Andrew M. Dai, Jakob Uszkoreit, Quoc Le, and Slav Petrov. 2019. Natural Questions: A Benchmark for Question Answering Research.AC...

  14. [14]

    Gonzalez, Haotong Zhang, and Ion Stoica

    Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph E. Gonzalez, Haotong Zhang, and Ion Stoica. 2023. Efficient Memory Management for Large Language Model Serving with PagedAttention. Proceedings of the 29th Symposium on Operating Systems Principles(2023)

  15. [15]

    Donghee Lee, Jongmoo Choi, Jong-Hun Kim, Sam H Noh, Sang Lyul Min, Yookun Cho, and Chong Sang Kim. 2001. LRFU: A spectrum of policies that subsumes the least recently used and least frequently used policies.IEEE transactions on Computers50, 12 (2001), 1352–1361

  16. [16]

    Yaniv Leviathan, Matan Kalman, and Yossi Matias. 2023. Fast inference from transformers via speculative decoding. InInternational Conference on Machine Learning. PMLR, 19274–19286

  17. [17]

    Jiaxing Li, Chi Xu, Feng Wang, Isaac M von Riedemann, Cong Zhang, and Jiangchuan Liu. 2024. Scalm: Towards semantic caching for automated chat services with large language models. In2024 IEEE/ACM 32nd IWQoS. IEEE, 1–10

  18. [18]

    Evan Zheran Liu, Milad Hashemi, Kevin Swersky, Parthasarathy Ranganathan, and Junwhan Ahn. 2020. An Imitation Learning Approach for Cache Replacement. CoRRabs/2006.16239 (2020). arXiv:2006.16239 https://arxiv.org/abs/2006.16239

  19. [19]

    Xutong Liu, Baran Atalar, Xiangxiang Dai, Jinhang Zuo, Siwei Wang, John C. S. Lui, Wei Chen, and Carlee Joe-Wong. 2026. Semantic Caching for Low-Cost LLM Serving: From Offline Learning to Online Adaptation. arXiv:2508.07675 [cs.LG]

  20. [20]

    Xutong Liu, Xiangxiang Dai, Jinhang Zuo, Siwei Wang, Carlee Joe-Wong, John Lui, and Wei Chen. 2025. Offline learning for combinatorial multi-armed bandits. arXiv preprint arXiv:2501.19300(2025)

  21. [21]

    Paschos, Apostolos Destounis, Luigi Vigneri, and George Iosifidis

    Georgios S. Paschos, Apostolos Destounis, Luigi Vigneri, and George Iosifidis

  22. [22]

    arXiv:1904.09849

    Learning to Cache With No Regrets.CoRR(2019). arXiv:1904.09849

  23. [23]

    Reiner Pope, Sholto Douglas, Aakanksha Chowdhery, Jacob Devlin, James Brad- bury, Anselm Levskaya, Jonathan Heek, Kefan Xiao, Shivani Agrawal, and Jeff Dean. 2022. Efficiently Scaling Transformer Inference.ArXiv2211.05102 (2022)

  24. [24]

    Guanqiao Qu, Qiyuan Chen, Wei Wei, Zhengyi Lin, Xianhao Chen, and Kaibin Huang. 2024. Mobile Edge Intelligence for Large Language Models: A Contem- porary Survey.ArXivabs/2407.18921 (2024)

  25. [25]

    Nils Reimers and Iryna Gurevych. 2019. Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks. InProceedings of EMNLP 2019 Conference. ACL

  26. [26]

    Siddharth Samsi, Dan Zhao, Joseph McDonald, Baolin Li, Adam Michaleas, Michael Jones, William Bergeron, Jeremy Kepner, Devesh Tiwari, and Vijay Gadepally. 2023. From words to watts: Benchmarking the energy costs of large language model inference. In2023 IEEE High Performance Extreme Computing Conference (HPEC). IEEE, 1–9

  27. [27]

    Fu, Zhiqiang Xie, Beidi Chen, Clark W

    Ying Sheng, Lianmin Zheng, Binhang Yuan, Zhuohan Li, Max Ryabinin, Daniel Y. Fu, Zhiqiang Xie, Beidi Chen, Clark W. Barrett, Joseph Gonzalez, Percy Liang, Christopher Ré, Ion Stoica, and Ce Zhang. 2023. High-throughput Generative Inference of Large Language Models with a Single GPU. InICML

  28. [28]

    Asmit Kumar Singh, Haozhe Wang, Laxmi Naga Santosh Attaluri, Tak Chiam, and Weihua Zhu. 2026. Asynchronous verified semantic caching for tiered LLM architectures.arXiv preprint arXiv:2602.13165(2026)

  29. [29]

    Benjamin Spector and Chris Re. 2023. Accelerating llm inference with staged speculative decoding.arXiv preprint arXiv:2308.04623(2023)

  30. [30]

    Ilias Stogiannidis, Stavros Vassos, Prodromos Malakasiotis, and Ion Androut- sopoulos. 2023. Cache me if you can: An online cost-aware teacher-student framework to reduce the calls to large language models.arXiv:2310.13395(2023)

  31. [31]

    Tsachy Weissman, Erik Ordentlich, Gadiel Seroussi, Sergio Verdu, and Marcelo J Weinberger. 2003. Inequalities for the L1 deviation of the empirical distribution. Hewlett-Packard Labs, Tech. Rep(2003)

  32. [32]

    Guangxuan Xiao, Ji Lin, Mickael Seznec, Hao Wu, Julien Demouth, and Song Han. 2023. Smoothquant: Accurate and efficient post-training quantization for large language models. InICML. PMLR, 38087–38099

  33. [33]

    Jianxin Yan, Wangze Ni, Lei Chen, Xuemin Lin, Peng Cheng, Zhan Qin, and Kui Ren. 2025. Contextcache: Context-aware semantic cache for multi-turn queries in large language models.arXiv preprint arXiv:2506.22791(2025)

  34. [34]

    Banghua Zhu, Ying Sheng, Lianmin Zheng, Clark Barrett, Michael Jordan, and Jiantao Jiao. 2023. Towards Optimal Caching and Model Selection for Large Model Inference. 36 (2023), 59062–59094. Continuous Semantic Caching for Low-Cost LLM Serving Conference’17, July 2017, Washington, DC, USA A Appendix Here we provide the missing proofs in the main body for t...

  35. [35]

    To ensure this holds uni- formly over all 𝑡∈ [𝑇] , we apply a union bound over 𝑇 rounds

    Arrival Concentration (E arr): Following the exact derivation using Hoeffding’s inequality and the2 𝑚 subsets from the proof of Lemma 4.2, for a fixed round 𝑡 and an 𝜖-net of size up to 𝑚eff, the variation distance satisfies P(∥ ˆ𝒘𝑡 −𝒘∥ 1 ≥𝜖 𝑏 ) ≤ 2𝑚eff exp(−𝑡𝜖 2 𝑏 /2). To ensure this holds uni- formly over all 𝑡∈ [𝑇] , we apply a union bound over 𝑇 round...

  36. [36]

    𝑇∑︁ 𝑡=1 Optimistic Gap # ≤E

    Cost Concentration (E cost): Instead of analyzing the fixed-sample KRR bias and noise terms as in Lemma 4.2, the online setting requires an any-time bound. This follows directly from Theorem 2 in [6]. Applying this result with a failure probability of 𝛿/2directly yields Ecost. By the union bound, the two events hold simultaneously with probability at leas...