Recognition: unknown
Continuous Semantic Caching for Low-Cost LLM Serving
Pith reviewed 2026-05-10 02:29 UTC · model grok-4.3
The pith
A new framework allows semantic caching of LLM responses in continuous query spaces with provable sublinear regret.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
By combining dynamic ε-net discretization with Kernel Ridge Regression, the system quantifies uncertainty in continuous query spaces and generalizes partial feedback across semantic neighborhoods. This supports both offline and online algorithms that reduce switching costs, with the online version achieving a sublinear regret bound against an optimal continuous oracle that reduces to prior discrete-model bounds.
What carries the argument
dynamic ε-net discretization coupled with Kernel Ridge Regression, which adaptively discretizes the continuous embedding space to quantify estimation uncertainty and generalize cost feedback across nearby queries
If this is right
- The online algorithm achieves a sublinear regret bound against an optimal continuous oracle.
- The regret bound reduces to existing bounds when the query space is discrete.
- Empirical evaluations show the framework approximates the continuous optimal cache while lowering computational and switching overhead.
- Partial feedback on query costs can be generalized across semantic neighborhoods without full observation.
Where Pith is reading between the lines
- If the discretization approach scales, it could be adapted to continuous spaces in related problems like dynamic resource allocation for model inference.
- The generalization via kernels implies that embedding similarity can substitute for exact query matches in caching decisions.
- Real deployment would benefit from testing whether user query streams naturally form low-dimensional manifolds that simplify the epsilon-net construction.
Load-bearing premise
Real-world LLM queries lie in an infinite continuous embedding space where dynamic ε-net discretization with Kernel Ridge Regression can quantify uncertainty and generalize partial feedback across semantic neighborhoods.
What would settle it
Running the online algorithm on a long sequence of real or simulated continuous LLM queries and observing linear growth in cumulative regret instead of sublinear growth would falsify the theoretical bound.
Figures
read the original abstract
As Large Language Models (LLMs) become increasingly popular, caching responses so that they can be reused by users with semantically similar queries has become a vital strategy for reducing inference costs and latency. Existing caching frameworks have proposed to decide which query responses to cache by assuming a finite, known universe of discrete queries and learning their serving costs and arrival probabilities. As LLMs' pool of users and queries expands, however, such an assumption becomes increasingly untenable: real-world LLM queries reside in an infinite, continuous embedding space. In this paper, we establish the first rigorous theoretical framework for semantic LLM response caching in continuous query space under uncertainty. To bridge the gap between discrete optimization and continuous representation spaces, we introduce dynamic $\epsilon$-net discretization coupled with Kernel Ridge Regression. This design enables the system to formally quantify estimation uncertainty and generalize partial feedback on LLM query costs across continuous semantic query neighborhoods. We develop both offline learning and online adaptive algorithms optimized to reduce switching costs incurred by changing the cached responses. We prove that our online algorithm achieves a sublinear regret bound against an optimal continuous oracle, which reduces to existing bounds for discrete query models. Extensive empirical evaluations demonstrate that our framework approximates the continuous optimal cache well while also reducing computational and switching overhead compared to existing methods.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces the first rigorous theoretical framework for semantic LLM response caching in continuous, infinite embedding spaces. It bridges discrete optimization with continuous representations via dynamic ε-net discretization combined with Kernel Ridge Regression (KRR) to quantify uncertainty and generalize partial cost feedback across semantic neighborhoods. Offline and online adaptive algorithms are developed to minimize switching costs when updating the cache; the online algorithm is proven to achieve sublinear regret against an optimal continuous oracle, with the bound reducing to existing discrete-query results. Empirical evaluations are claimed to show that the framework approximates the continuous optimum while lowering computational and switching overhead relative to prior methods.
Significance. If the central regret analysis holds, the work would supply the first formal sublinear-regret guarantee for continuous semantic caching, directly addressing the scalability limits of discrete-query assumptions as LLM user bases grow. The reduction to known discrete bounds and the use of KRR for neighborhood generalization are technically attractive strengths. However, the result's generality hinges on unstated conditions on the cost function, so the practical significance for arbitrary LLM workloads remains conditional on verification of those conditions.
major comments (1)
- [Abstract and theoretical results section] Abstract and theoretical results section: the claimed sublinear regret bound for the online algorithm against the continuous oracle is obtained via dynamic ε-net discretization + KRR. This analysis requires that the unknown query-cost function lie in the RKHS of the chosen kernel with norm bounded independently of the discretization parameter ε; without this, the covering numbers of the ε-net can grow with query diversity or dimension and the regret ceases to be sublinear. The manuscript does not state or verify this bounded-norm condition, yet the abstract asserts the bound 'reduces to existing bounds for discrete query models' without supplying the argument that the continuous-to-discrete reduction preserves sublinearity absent the extra assumption.
minor comments (1)
- The abstract refers to 'extensive empirical evaluations' and 'reducing computational and switching overhead compared to existing methods' but supplies no dataset descriptions, baseline implementations, or quantitative tables in the provided text; adding these details would strengthen the empirical claims without altering the theoretical contribution.
Simulated Author's Rebuttal
We thank the referee for their careful reading and constructive feedback on our manuscript. We address the major comment point by point below and will incorporate the necessary clarifications in the revised version.
read point-by-point responses
-
Referee: [Abstract and theoretical results section] Abstract and theoretical results section: the claimed sublinear regret bound for the online algorithm against the continuous oracle is obtained via dynamic ε-net discretization + KRR. This analysis requires that the unknown query-cost function lie in the RKHS of the chosen kernel with norm bounded independently of the discretization parameter ε; without this, the covering numbers of the ε-net can grow with query diversity or dimension and the regret ceases to be sublinear. The manuscript does not state or verify this bounded-norm condition, yet the abstract asserts the bound 'reduces to existing bounds for discrete query models' without supplying the argument that the continuous-to-discrete reduction preserves sublinearity absent the extra assumption.
Authors: We agree that the sublinear regret bound relies on the query-cost function belonging to the RKHS with a norm bounded independently of ε; this is a standard condition in kernel methods to ensure controlled covering numbers and sublinear regret. While our theoretical analysis implicitly uses this assumption from the KRR literature, we acknowledge that it is not explicitly stated in the abstract or theoretical results section, and the reduction argument is not detailed there. In the revised manuscript, we will explicitly state the bounded-norm condition and add a concise argument demonstrating that the continuous-to-discrete reduction preserves sublinearity under this assumption, thereby clarifying how the bound reduces to existing discrete-query results. revision: yes
Circularity Check
No significant circularity; derivation extends discrete models via standard techniques without self-referential reduction
full rationale
The paper's central result is a sublinear regret bound for an online algorithm in continuous embedding space, derived from dynamic ε-net discretization combined with Kernel Ridge Regression to quantify uncertainty and generalize feedback. This construction reduces to discrete bounds by design of the discretization, but the continuous case adds independent content through RKHS-based neighborhood generalization and handling of switching costs. No steps match the enumerated circularity patterns: the regret analysis is not self-definitional, does not rename a fitted parameter as a prediction, and does not rely on load-bearing self-citations or imported uniqueness theorems. The framework applies standard online learning and kernel methods to the new continuous setting under stated assumptions, remaining self-contained against external benchmarks in discrete caching literature.
Axiom & Free-Parameter Ledger
axioms (2)
- domain assumption LLM queries reside in an infinite, continuous embedding space
- ad hoc to paper Dynamic ε-net discretization coupled with Kernel Ridge Regression quantifies uncertainty and generalizes partial feedback across neighborhoods
Reference graph
Works this paper leans on
-
[1]
2026.Continuous Semantic Caching for Low-Cost LLM Serving
Baran Atalar, Xutong Liu, Jinhang Zuo, Siwei Wang, Wei Chen, and Carlee Joe- Wong. 2026.Continuous Semantic Caching for Low-Cost LLM Serving. Technical Report. Carnegie Mellon University. https://research.ece.cmu.edu/lions/Papers/ Semantic_LLM_Caching.pdf
2026
-
[2]
Hyokyung Bahn. 2005. Web cache management based on the expected cost of web objects.Information and Software Technology47, 9 (2005), 609–621
2005
-
[3]
Fu Bang. 2023. GPTCache: An Open-Source Semantic Cache for LLM Applications Enabling Faster Answers and Cost Savings.Proceedings of the 3rd Workshop for Natural Language Processing Open Source Software (NLP-OSS 2023)(2023)
2023
-
[4]
2025.How people use chatgpt
Aaron Chatterji, Thomas Cunningham, David J Deming, Zoe Hitzig, Christopher Ong, Carl Yan Shan, and Kevin Wadman. 2025.How people use chatgpt. Technical Report. National Bureau of Economic Research
2025
-
[5]
Wei Chen, Yajun Wang, and Yang Yuan. 2013. Combinatorial multi-armed bandit: General framework and applications. InICML. PMLR, 151–159
2013
- [6]
- [7]
-
[8]
Tim Dettmers, Mike Lewis, Younes Belkada, and Luke Zettlemoyer. 2022. Gpt3. int8 (): 8-bit matrix multiplication for transformers at scale.Advances in neural information processing systems35 (2022), 30318–30332
2022
- [9]
-
[10]
In Gim, Guojun Chen, Seung seob Lee, Nikhil Sarda, Anurag Khandelwal, and Lin Zhong. 2023. Prompt Cache: Modular Attention Reuse for Low-Latency Inference.ArXivabs/2311.04934 (2023). Conference’17, July 2017, Washington, DC, USA Baran Atalar, Xutong Liu, Jinhang Zuo, Siwei Wang, Wei Chen, and Carlee Joe-Wong 0 500 1000 1500 2000 2500 Stream Length 0.00 0....
-
[11]
Victor P Il’ev. 2001. An approximation guarantee of the greedy descent algorithm for minimizing a supermodular set function.Discrete Applied Mathematics114, 1-3 (2001), 131–146
2001
-
[12]
Mandar Joshi, Eunsol Choi, Daniel Weld, and Luke Zettlemoyer. 2017. TriviaQA: A Large Scale Distantly Supervised Challenge Dataset for Reading Comprehension. InProceedings of the 55th Annual Meeting of the ACL, Regina Barzilay and Min-Yen Kan (Eds.). ACL, Vancouver, Canada, 1601–1611. doi:10.18653/v1/P17-1147
-
[13]
Dai, Jakob Uszkoreit, Quoc Le, and Slav Petrov
Tom Kwiatkowski, Jennimaria Palomaki, Olivia Redfield, Michael Collins, Ankur Parikh, Chris Alberti, Danielle Epstein, Illia Polosukhin, Jacob Devlin, Kenton Lee, Kristina Toutanova, Llion Jones, Matthew Kelcey, Ming-Wei Chang, Andrew M. Dai, Jakob Uszkoreit, Quoc Le, and Slav Petrov. 2019. Natural Questions: A Benchmark for Question Answering Research.AC...
2019
-
[14]
Gonzalez, Haotong Zhang, and Ion Stoica
Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph E. Gonzalez, Haotong Zhang, and Ion Stoica. 2023. Efficient Memory Management for Large Language Model Serving with PagedAttention. Proceedings of the 29th Symposium on Operating Systems Principles(2023)
2023
-
[15]
Donghee Lee, Jongmoo Choi, Jong-Hun Kim, Sam H Noh, Sang Lyul Min, Yookun Cho, and Chong Sang Kim. 2001. LRFU: A spectrum of policies that subsumes the least recently used and least frequently used policies.IEEE transactions on Computers50, 12 (2001), 1352–1361
2001
-
[16]
Yaniv Leviathan, Matan Kalman, and Yossi Matias. 2023. Fast inference from transformers via speculative decoding. InInternational Conference on Machine Learning. PMLR, 19274–19286
2023
-
[17]
Jiaxing Li, Chi Xu, Feng Wang, Isaac M von Riedemann, Cong Zhang, and Jiangchuan Liu. 2024. Scalm: Towards semantic caching for automated chat services with large language models. In2024 IEEE/ACM 32nd IWQoS. IEEE, 1–10
2024
- [18]
- [19]
- [20]
-
[21]
Paschos, Apostolos Destounis, Luigi Vigneri, and George Iosifidis
Georgios S. Paschos, Apostolos Destounis, Luigi Vigneri, and George Iosifidis
-
[22]
Learning to Cache With No Regrets.CoRR(2019). arXiv:1904.09849
- [23]
- [24]
-
[25]
Nils Reimers and Iryna Gurevych. 2019. Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks. InProceedings of EMNLP 2019 Conference. ACL
2019
-
[26]
Siddharth Samsi, Dan Zhao, Joseph McDonald, Baolin Li, Adam Michaleas, Michael Jones, William Bergeron, Jeremy Kepner, Devesh Tiwari, and Vijay Gadepally. 2023. From words to watts: Benchmarking the energy costs of large language model inference. In2023 IEEE High Performance Extreme Computing Conference (HPEC). IEEE, 1–9
2023
-
[27]
Fu, Zhiqiang Xie, Beidi Chen, Clark W
Ying Sheng, Lianmin Zheng, Binhang Yuan, Zhuohan Li, Max Ryabinin, Daniel Y. Fu, Zhiqiang Xie, Beidi Chen, Clark W. Barrett, Joseph Gonzalez, Percy Liang, Christopher Ré, Ion Stoica, and Ce Zhang. 2023. High-throughput Generative Inference of Large Language Models with a Single GPU. InICML
2023
- [28]
- [29]
- [30]
-
[31]
Tsachy Weissman, Erik Ordentlich, Gadiel Seroussi, Sergio Verdu, and Marcelo J Weinberger. 2003. Inequalities for the L1 deviation of the empirical distribution. Hewlett-Packard Labs, Tech. Rep(2003)
2003
-
[32]
Guangxuan Xiao, Ji Lin, Mickael Seznec, Hao Wu, Julien Demouth, and Song Han. 2023. Smoothquant: Accurate and efficient post-training quantization for large language models. InICML. PMLR, 38087–38099
2023
- [33]
-
[34]
Banghua Zhu, Ying Sheng, Lianmin Zheng, Clark Barrett, Michael Jordan, and Jiantao Jiao. 2023. Towards Optimal Caching and Model Selection for Large Model Inference. 36 (2023), 59062–59094. Continuous Semantic Caching for Low-Cost LLM Serving Conference’17, July 2017, Washington, DC, USA A Appendix Here we provide the missing proofs in the main body for t...
2023
-
[35]
To ensure this holds uni- formly over all 𝑡∈ [𝑇] , we apply a union bound over 𝑇 rounds
Arrival Concentration (E arr): Following the exact derivation using Hoeffding’s inequality and the2 𝑚 subsets from the proof of Lemma 4.2, for a fixed round 𝑡 and an 𝜖-net of size up to 𝑚eff, the variation distance satisfies P(∥ ˆ𝒘𝑡 −𝒘∥ 1 ≥𝜖 𝑏 ) ≤ 2𝑚eff exp(−𝑡𝜖 2 𝑏 /2). To ensure this holds uni- formly over all 𝑡∈ [𝑇] , we apply a union bound over 𝑇 round...
-
[36]
𝑇∑︁ 𝑡=1 Optimistic Gap # ≤E
Cost Concentration (E cost): Instead of analyzing the fixed-sample KRR bias and noise terms as in Lemma 4.2, the online setting requires an any-time bound. This follows directly from Theorem 2 in [6]. Applying this result with a failure probability of 𝛿/2directly yields Ecost. By the union bound, the two events hold simultaneously with probability at leas...
2017
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.