Recognition: no theorem link
KVServe: Service-Aware KV Cache Compression for Communication-Efficient Disaggregated LLM Serving
Pith reviewed 2026-05-14 17:39 UTC · model grok-4.3
The pith
KVServe uses service-aware adaptive KV cache compression to cut latency bottlenecks in disaggregated LLM serving
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
KVServe unifies KV compression into a modular strategy space that supports new components and cross-method recomposition, applies a Bayesian Profiling Engine to search the space and distill a 3D Pareto candidate set that cuts offline search overhead by 50x, and runs a Service-Aware Online Controller that pairs an analytical latency model with a lightweight bandit to choose profiles under constraints and correct offline-to-online gaps, delivering up to 9.13x JCT speedup in PD-separated serving and 32.8x TTFT reduction in KV-disaggregated serving when integrated with vLLM.
What carries the argument
The Service-Aware Online Controller that fuses an analytical latency model with a bandit algorithm to select compression profiles from the Pareto set while adapting to live service conditions and fixing model mismatch.
If this is right
- In PD-separated serving the system can achieve up to 9.13x reduction in job completion time by adapting KV transfers.
- In KV-disaggregated serving the system can achieve up to 32.8x reduction in time-to-first-token by compressing the explicit KV payload.
- The same controller can enforce different quality-latency trade-offs when SLO budgets vary across services.
- Offline search cost drops 50x, making it practical to refresh the candidate set when models or networks change.
- The framework integrates directly into existing engines such as vLLM and works across models, GPUs, and networks.
Where Pith is reading between the lines
- The same modular space and controller pattern could be applied to other large state objects that cross network boundaries, such as activation checkpoints in training.
- In multi-tenant clusters the bandit could be extended to learn preferences across concurrent services rather than single-service adaptation.
- Hardware accelerators could embed lightweight versions of the latency model to make profile selection even faster at the NIC or GPU level.
- The 3D Pareto representation might be reused for other compression decisions, such as quantization or pruning, inside the same serving pipeline.
Load-bearing premise
The analytical latency model together with the bandit controller will pick compression profiles that match real performance in live deployments even when offline profiling differs from online conditions and without unacceptable quality loss under changing SLO budgets.
What would settle it
Run a production-like trace with shifting workloads and bandwidth on the same hardware, measure whether the controller-chosen profiles deliver the claimed JCT or TTFT gains and whether output quality stays inside the target SLO window, or whether mismatch forces either slowdown or quality violation.
Figures
read the original abstract
LLMs are widely adopted in production, pushing inference systems to their limits. Disaggregated LLM serving (e.g., PD separation and KV state disaggregation) improves scalability and cost efficiency, but it also turns KV into an explicit payload crossing network and storage boundaries, making KV a dominant end-to-end bottleneck. Existing KV compression are typically static runtime configurations, despite production service context varies over time in workload mix, bandwidth, and SLO/quality budgets. As a result, a fixed choice can be suboptimal or even increase latency. We present \emph{KVServe}, the first service-aware and adaptive KV communication compression framework for disaggregated LLM serving: KVServe (1) unifies KV compression into a modular strategy space with new components and cross-method recomposition; (2) introduces Bayesian Profiling Engine that efficiently searches this space and distills a 3D Pareto candidate set, reducing $50\times$ offline search overhead; and (3) deploys a Service-Aware Online Controller that combines an analytical latency model with a lightweight bandit to select profiles under constraints and correct offline-to-online mismatch. Integrated into vLLM and evaluated across datasets, models, GPUs and networks, KVServe achieves up to $9.13\times$ JCT speedup in PD-separated serving and up to $32.8\times$ TTFT reduction in KV-disaggregated serving.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces KVServe, a service-aware adaptive KV cache compression framework for disaggregated LLM serving (PD separation and KV disaggregation). It unifies compression into a modular strategy space, uses a Bayesian Profiling Engine to distill a 3D Pareto candidate set (reducing offline search by 50×), and deploys a Service-Aware Online Controller combining an analytical latency model with a lightweight bandit to select profiles while correcting offline-to-online mismatch. Integrated into vLLM, it reports up to 9.13× JCT speedup in PD-separated serving and 32.8× TTFT reduction in KV-disaggregated serving across datasets, models, GPUs, and networks.
Significance. If the end-to-end speedups hold under realistic workload shifts and SLO variation, the work would meaningfully advance communication-efficient disaggregated inference by replacing static KV compression with adaptive, service-context-aware selection. The modular strategy space and Bayesian profiling are practical contributions that could be reused beyond the specific controller.
major comments (2)
- [Abstract and §4] Abstract and §4 (Service-Aware Online Controller): the central 9.13× JCT and 32.8× TTFT claims rest on the analytical latency model plus bandit reliably correcting offline-to-online drift, yet no quantitative bound on model prediction error, bandit regret, or sensitivity to bandwidth/SLO changes is reported; without such bounds the reported gains could be trace-specific rather than robust.
- [Evaluation section] Evaluation section: the abstract states results across datasets, models, GPUs and networks, but the manuscript provides insufficient detail on experimental controls, error bars, exact workload mixes, and how compression quality is measured under varying SLO budgets, leaving the performance claims only moderately supported.
minor comments (2)
- [§3] Clarify the exact definition of the 3D Pareto set (latency, quality, bandwidth) and how the bandit exploration budget is chosen in practice.
- [Figures 5-8] Figure captions and axis labels should explicitly state the network bandwidth ranges and SLO budgets used in each experiment.
Simulated Author's Rebuttal
We thank the referee for the detailed and constructive report. The two major comments identify areas where additional analysis and exposition would strengthen the robustness and reproducibility of our claims. We address each point below and will revise the manuscript accordingly.
read point-by-point responses
-
Referee: [Abstract and §4] Abstract and §4 (Service-Aware Online Controller): the central 9.13× JCT and 32.8× TTFT claims rest on the analytical latency model plus bandit reliably correcting offline-to-online drift, yet no quantitative bound on model prediction error, bandit regret, or sensitivity to bandwidth/SLO changes is reported; without such bounds the reported gains could be trace-specific rather than robust.
Authors: We agree that quantitative characterization of the analytical model's prediction error and the bandit's regret, together with sensitivity analysis under bandwidth and SLO variation, would better substantiate that the reported speedups are robust rather than trace-specific. In the revised version we will add to §4: (i) measured L1 prediction error statistics across the evaluated bandwidth range, (ii) cumulative regret curves for the online bandit under both stationary and shifting workloads, and (iii) sensitivity plots showing JCT/TTFT variation when bandwidth and SLO budgets are perturbed by ±20 %. These additions will be supported by new experiments that reuse the same profiling engine and controller already described. revision: yes
-
Referee: [Evaluation section] Evaluation section: the abstract states results across datasets, models, GPUs and networks, but the manuscript provides insufficient detail on experimental controls, error bars, exact workload mixes, and how compression quality is measured under varying SLO budgets, leaving the performance claims only moderately supported.
Authors: We acknowledge that the current Evaluation section lacks sufficient methodological detail. In the revision we will expand it to report: (i) error bars computed from at least five independent runs with different random seeds, (ii) exact workload parameters (Poisson arrival rates, request-length distributions, and SLO/quality budgets for each experiment), (iii) the precise definition and measurement procedure for compression quality (perplexity delta and token-level accuracy) under each SLO budget, and (iv) a table enumerating the hardware/network configurations and the controls used to isolate the effect of the online controller. These clarifications will be placed in a new subsection on experimental methodology. revision: yes
Circularity Check
No circularity in derivation chain
full rationale
The paper presents an empirical systems framework (Bayesian profiling + analytical latency model + bandit controller) integrated into vLLM. No equations or claims reduce by construction to fitted inputs, self-citations, or renamed prior results. The 3D Pareto set and online selection are described as engineering components whose performance is measured externally rather than derived tautologically from their own definitions. Central speedups are reported from end-to-end experiments across datasets, models, and networks, not from self-referential math.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption An analytical latency model can predict end-to-end performance sufficiently well to guide online decisions across varying bandwidth and SLO conditions.
Reference graph
Works this paper leans on
-
[1]
Amazon Web Services. 2026. Amazon EC2 FAQs. https://aws.amazon. com/ec2/faqs/. (2026). Accessed: 2026-01-29
2026
-
[2]
Muhammad Arslan, Hussam Ghanem, Saba Munawar, and Christophe Cruz. 2024. A Survey on RAG with LLMs.Procedia computer science 246 (2024), 3781–3790
2024
-
[3]
Saleh Ashkboos, Amirkeivan Mohtashami, Maximilian L Croci, Bo Li, Pashmina Cameron, Martin Jaggi, Dan Alistarh, Torsten Hoefler, and James Hensman. 2024. Quarot: Outlier-free 4-bit inference in rotated llms.Advances in Neural Information Processing Systems37 (2024), 100213–100240
2024
-
[4]
Yushi Bai, Xin Lv, Jiajie Zhang, Hongchang Lyu, Jiankai Tang, Zhidian Huang, Zhengxiao Du, Xiao Liu, Aohan Zeng, Lei Hou, Yuxiao Dong, Jie Tang, and Juanzi Li. 2023. LongBench: A Bilingual, Multitask Bench- mark for Long Context Understanding. (2023). arXiv:cs.CL/2308.14508
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[5]
Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde de Oliveira Pinto, Jared Kaplan, Harri Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, Alex Ray, Raul Puri, Gretchen Krueger, Michael Petrov, Heidy Khlaaf, Girish Sastry, Pamela Mishkin, Brooke Chan, Scott Gray, Nick Ryder, Mikhail Pavlov, Alethea Power, Lukasz Kaiser, Mohammad Bavarian...
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[6]
Weijian Chen, Shuibing He, Haoyang Qu, Ruidong Zhang, Siling Yang, Ping Chen, Yi Zheng, Baoxing Huai, and Gang Chen. 2025. IMPRESS: An Importance-Informed Multi-Tier Prefix KV Storage System for Large Language Model Inference. In23rd USENIX Conference on File and Storage Technologies (FAST 25). 187–201
2025
-
[7]
Wenyan Chen, Chengzhi Lu, Huanle Xu, Kejiang Ye, and Chengzhong Xu. 2025. Multiplexing Dynamic Deep Learning Workloads with SLO- awareness in GPU Clusters. InProceedings of the Twentieth European Conference on Computer Systems. 589–604
2025
-
[8]
Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, Christopher Hesse, and John Schulman
-
[9]
Training Verifiers to Solve Math Word Problems.arXiv preprint arXiv:2110.14168(2021)
work page internal anchor Pith review Pith/arXiv arXiv 2021
- [10]
- [11]
- [12]
-
[13]
Jingqi Feng, Yukai Huang, Rui Zhang, Sicheng Liang, Ming Yan, and Jie Wu. 2025. WindServe: Efficient Phase-Disaggregated LLM Serving with Stream-based Dynamic Scheduling. InProceedings of the 52nd Annual International Symposium on Computer Architecture. 1283–1295. 13 SIGCOMM’26, August 17-21, 2026, Denver, Colorado, USA Liu and Ma et al
2025
-
[14]
Leo Gao, Jonathan Tow, Baber Abbasi, Stella Biderman, Sid Black, An- thony DiPofi, Charles Foster, Laurence Golding, Jeffrey Hsu, Alain Le Noac’h, Haonan Li, Kyle McDonell, Niklas Muennighoff, Chris Ociepa, Jason Phang, Laria Reynolds, Hailey Schoelkopf, Aviya Skowron, Lintang Sutawika, Eric Tang, Anish Thite, Ben Wang, Kevin Wang, and Andy Zou. 2024. The...
-
[15]
Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Alex Vaughan, Amy Yang, Angela Fan, Anirudh Goyal, Anthony Hartshorn, Aobo Yang, Archi Mitra, Archie Sravanku- mar, Artem Korenev, Arthur Hinsvark, Arun Rao, Aston Zhang, Au- relien Rodriguez, Austen Gregerson, A...
2026
-
[16]
Ke Hong, Lufang Chen, Zhong Wang, Xiuhong Li, Qiuli Mao, Jianping Ma, Chao Xiong, Guanyu Wu, Buhe Han, Guohao Dai, Yun Liang, and Yu Wang. 2025. semi-PD: Towards Efficient LLM Serving via Phase-Wise Disaggregated Computation and Unified Storage. (2025). arXiv:cs.CL/2504.19867 https://arxiv.org/abs/2504.19867
-
[17]
Coleman Hooper, Sehoon Kim, Hiva Mohammadzadeh, Michael W Ma- honey, Yakun S Shao, Kurt Keutzer, and Amir Gholami. 2024. Kvquant: Towards 10 million context length llm inference with kv cache quanti- zation.Advances in Neural Information Processing Systems37 (2024), 1270–1303
2024
-
[18]
Cunchen Hu, Heyang Huang, Liangliang Xu, Xusheng Chen, Chenxi Wang, Jiang Xu, Shuang Chen, Hao Feng, Sa Wang, Yungang Bao, Ninghui Sun, and Yizhou Shan. 2025. ShuffleInfer: Disaggregate LLM Inference for Mixed Downstream Workloads.ACM Transactions on Architecture and Code Optimization(2025)
2025
- [19]
- [20]
-
[21]
Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph Gonzalez, Hao Zhang, and Ion Stoica
-
[22]
InProceedings of the 29th symposium on operating systems principles
Efficient memory management for large language model serving with pagedattention. InProceedings of the 29th symposium on operating systems principles. 611–626
-
[23]
Yuhang Li, Rong Gu, Chengying Huan, Zhibin Wang, Renjie Yao, Chen Tian, and Guihai Chen. 2025. Hotprefix: Hotness-aware kv cache scheduling for efficient prefix sharing in llm inference systems. Proceedings of the ACM on Management of Data3, 4 (2025), 1–27
2025
-
[24]
Yuhong Li, Yingbing Huang, Bowen Yang, Bharat Venkitesh, Acyr Locatelli, Hanchen Ye, Tianle Cai, Patrick Lewis, and Deming Chen
-
[25]
Advances in Neural Information Processing Systems37 (2024), 22947– 22970
Snapkv: Llm knows what you are looking for before generation. Advances in Neural Information Processing Systems37 (2024), 22947– 22970
2024
-
[26]
Yangning Li, Weizhi Zhang, Yuyao Yang, Wei-Chieh Huang, Yaozu Wu, Junyu Luo, Yuanchen Bei, Henry Peng Zou, Xiao Luo, Yusheng Zhao, Chunkit Chan, Yankai Chen, Zhongfen Deng, Yinghui Li, Hai-Tao Zheng, Dongyuan Li, Renhe Jiang, Ming Zhang, Yangqiu Song, and Philip S. Yu. 2025. Towards agentic rag with deep reasoning: A survey of rag-reasoning systems in llm...
-
[27]
Tengxuan Liu, Shiyao Li, Jiayi Yang, Tianchen Zhao, Feng Zhou, Xiao- hui Song, Guohao Dai, Shengen Yan, Huazhong Yang, and Yu Wang
- [28]
- [29]
-
[30]
Yuhan Liu, Hanchen Li, Yihua Cheng, Siddhant Ray, Yuyang Huang, Qizheng Zhang, Kuntai Du, Jiayi Yao, Shan Lu, Ganesh Anantha- narayanan, Michael Maire, Henry Hoffmann, Ari Holtzman, and Junchen Jiang. 2024. Cachegen: Kv cache compression and stream- ing for fast large language model serving. InProceedings of the ACM SIGCOMM 2024 Conference. 38–56
2024
- [31]
-
[32]
Zirui Liu, Jiayi Yuan, Hongye Jin, Shaochen Zhong, Zhaozhuo Xu, Vladimir Braverman, Beidi Chen, and Xia Hu. 2024. Kivi: A tuning- free asymmetric 2bit quantization for kv cache.arXiv preprint arXiv:2402.02750(2024)
work page internal anchor Pith review Pith/arXiv arXiv 2024
- [33]
-
[34]
NVIDIA. 2024. LLM Router NVIDIA. GitHub repository. (2024). https: //github.com/NVIDIA-AI-Blueprints/llm-router/tree/experimental
2024
-
[35]
NVIDIA. 2026. NVIDIA nvCOMP Developer. https://developer.nvidia. com/nvcomp. (2026)
2026
-
[36]
Isaac Ong, Amjad Almahairi, Vincent Wu, Wei-Lin Chiang, Tian- hao Wu, Joseph E Gonzalez, M Waleed Kadous, and Ion Stoica. 2024. Routellm: Learning to route llms with preference data.arXiv preprint arXiv:2406.18665(2024)
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[37]
Pratyush Patel, Esha Choukse, Chaojie Zhang, Aashaka Shah, Íñigo Goiri, Saeed Maleki, and Ricardo Bianchini. 2024. Splitwise: Efficient generative llm inference using phase splitting. In2024 ACM/IEEE 51st Annual International Symposium on Computer Architecture (ISCA). IEEE, 118–132
2024
-
[38]
Ruoyu Qin, Weiran He, Yaoyu Wang, Zheming Li, Xinran Xu, Yongwei Wu, Weimin Zheng, and Mingxing Zhang. 2026. Prefill-as-a-Service: KVCache of Next-Generation Models Could Go Cross-Datacenter. arXiv preprint arXiv:2604.15039(2026)
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[39]
Ruoyu Qin, Zheming Li, Weiran He, Jialei Cui, Feng Ren, Mingxing Zhang, Yongwei Wu, Weimin Zheng, and Xinran Xu. 2025. Mooncake: Trading more storage for less computation—a {KVCache-centric} architecture for serving {LLM} chatbot. In23rd USENIX conference on file and storage technologies (FAST 25). 155–170
2025
-
[40]
Philipp Schmid, Omar Sanseviero, Alvaro Bartolome, Leandro von Werra, Daniel Vila, Vaibhav Srivastav, Marc Sun, and Pedro Cuenca
-
[41]
https://huggingface.co/blog/llama31
Llama 3.1 – 405B, 70B & 8B with multilinguality and long context. https://huggingface.co/blog/llama31. (23 Jul 2024). Accessed: 2025-01- 29
2024
- [42]
-
[43]
Biao Sun, Ziming Huang, Hanyu Zhao, Wencong Xiao, Xinyi Zhang, Yong Li, and Wei Lin. 2024. Llumnix: Dynamic scheduling for large lan- guage model serving. In18th USENIX symposium on operating systems design and implementation (OSDI 24). 173–191
2024
-
[44]
Qian Tao, Wenyuan Yu, and Jingren Zhou. 2025. Asymkv: Enabling 1- bit quantization of kv cache with layer-wise asymmetric quantization configurations. InProceedings of the 31st International Conference on Computational Linguistics. 2316–2328
2025
-
[45]
Qwen Team. 2024. Qwen2.5: A Party of Foundation Models. (September 2024). https://qwenlm.github.io/blog/qwen2.5/
2024
- [46]
-
[47]
Bingyang Wu, Shengyu Liu, Yinmin Zhong, Peng Sun, Xuanzhe Liu, and Xin Jin. 2024. Loongserve: Efficiently serving long-context large 15 SIGCOMM’26, August 17-21, 2026, Denver, Colorado, USA Liu and Ma et al. language models with elastic sequence parallelism. InProceedings of the ACM SIGOPS 30th Symposium on Operating Systems Principles. 640–654
2024
- [48]
-
[49]
Ceyu Xu, Yongji Wu, Xinyu Yang, Beidi Chen, Matthew Lentz, Danyang Zhuo, and Lisa Wu Wills. 2025. LLM. 265: Video Codecs are Secretly Tensor Codecs. InProceedings of the 58th IEEE/ACM Interna- tional Symposium on Microarchitecture. 445–460
2025
-
[50]
An Yang, Baosong Yang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Zhou, Chengpeng Li, Chengyuan Li, Dayiheng Liu, Fei Huang, Guant- ing Dong, Haoran Wei, Huan Lin, Jialong Tang, Jialin Wang, Jian Yang, Jianhong Tu, Jianwei Zhang, Jianxin Ma, Jin Xu, Jingren Zhou, Jinze Bai, Jinzheng He, Junyang Lin, Kai Dang, Keming Lu, Keqin Chen, Kexin Yang, Mei Li, Mingfen...
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[51]
Cai Zefan, Yichi Zhang, Bofei Gao, Tianyu Liu, Keming Lu, Wayne Xiong, Yue Dong, Baobao Chang, Junjie Hu, and Wen Xiao. 2024. Pyramidkv: Dynamic kv cache compression based on pyramidal infor- mation funneling.arXiv e-prints(2024), arXiv–2406
2024
-
[52]
Tianyi Zhang, Mohsen Hariri, Shaochen Zhong, Vipin Chaudhary, Yang Sui, Xia Hu, and Anshumali Shrivastava. [n. d.]. 70% Size, 100% Accuracy: Lossless LLM Compression for Efficient GPU Inference via Dynamic-Length Float (DFloat11). InThe Thirty-ninth Annual Confer- ence on Neural Information Processing Systems
-
[53]
Zeyu Zhang, Haiying Shen, Shay Vargaftik, Ran Ben Basat, Michael Mitzenmacher, and Minlan Yu. 2025. Hack: Homomorphic acceleration via compression of the key-value cache for disaggregated llm inference. InProceedings of the ACM SIGCOMM 2025 Conference. 1245–1247
2025
-
[54]
Yinmin Zhong, Shengyu Liu, Junda Chen, Jianbo Hu, Yibo Zhu, Xu- anzhe Liu, Xin Jin, and Hao Zhang. 2024. DistServe: Disaggregating prefill and decoding for goodput-optimized large language model serv- ing. In18th USENIX Symposium on Operating Systems Design and Implementation (OSDI 24). 193–210
2024
-
[55]
Zixuan Zhou, Xuefei Ning, Ke Hong, Tianyu Fu, Jiaming Xu, Shiyao Li, Yuming Lou, Luning Wang, Zhihang Yuan, Xiuhong Li, Shengen Yan, Guohao Dai, Xiao-Ping Zhang, Yuhan Dong, and Yu Wang. 2024. A survey on efficient inference for large language models.arXiv preprint arXiv:2404.14294(2024). 16
work page internal anchor Pith review arXiv 2024
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.