arxiv: 2605.13734 · v1 · submitted 2026-05-13 · 💻 cs.DC · cs.AI· cs.NI

Recognition: no theorem link

KVServe: Service-Aware KV Cache Compression for Communication-Efficient Disaggregated LLM Serving

Zedong Liu , Xinyang Ma , Dejun Luo , Hairui Zhao , Bing Lu , Wenjing Huang , Yida Gu , Xingchen Liu

show 4 more authors

Zheng Wei Jinyang Liu Dingwen Tao Guangming Tan

Authors on Pith no claims yet

Pith reviewed 2026-05-14 17:39 UTC · model grok-4.3

classification 💻 cs.DC cs.AIcs.NI

keywords KV cache compressiondisaggregated LLM servingadaptive compressionservice-aware systemsPD separationKV disaggregationBayesian profilingbandit controller

0 comments

The pith

KVServe uses service-aware adaptive KV cache compression to cut latency bottlenecks in disaggregated LLM serving

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that static KV compression choices become inefficient when production conditions like workload mix, network bandwidth, and quality budgets shift over time in disaggregated setups. KVServe builds a single modular space of compression strategies, runs a Bayesian profiler to extract a compact Pareto set of candidates with 50x less offline effort, and runs an online controller that combines a latency model with a bandit to pick profiles on the fly while respecting SLOs. If this works, KV transfers stop dominating end-to-end time in PD-separated and KV-disaggregated deployments, letting the same hardware handle higher request rates without quality loss. The approach treats compression not as a fixed knob but as a controllable payload that the serving system can adjust to current service context.

Core claim

KVServe unifies KV compression into a modular strategy space that supports new components and cross-method recomposition, applies a Bayesian Profiling Engine to search the space and distill a 3D Pareto candidate set that cuts offline search overhead by 50x, and runs a Service-Aware Online Controller that pairs an analytical latency model with a lightweight bandit to choose profiles under constraints and correct offline-to-online gaps, delivering up to 9.13x JCT speedup in PD-separated serving and 32.8x TTFT reduction in KV-disaggregated serving when integrated with vLLM.

What carries the argument

The Service-Aware Online Controller that fuses an analytical latency model with a bandit algorithm to select compression profiles from the Pareto set while adapting to live service conditions and fixing model mismatch.

If this is right

In PD-separated serving the system can achieve up to 9.13x reduction in job completion time by adapting KV transfers.
In KV-disaggregated serving the system can achieve up to 32.8x reduction in time-to-first-token by compressing the explicit KV payload.
The same controller can enforce different quality-latency trade-offs when SLO budgets vary across services.
Offline search cost drops 50x, making it practical to refresh the candidate set when models or networks change.
The framework integrates directly into existing engines such as vLLM and works across models, GPUs, and networks.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same modular space and controller pattern could be applied to other large state objects that cross network boundaries, such as activation checkpoints in training.
In multi-tenant clusters the bandit could be extended to learn preferences across concurrent services rather than single-service adaptation.
Hardware accelerators could embed lightweight versions of the latency model to make profile selection even faster at the NIC or GPU level.
The 3D Pareto representation might be reused for other compression decisions, such as quantization or pruning, inside the same serving pipeline.

Load-bearing premise

The analytical latency model together with the bandit controller will pick compression profiles that match real performance in live deployments even when offline profiling differs from online conditions and without unacceptable quality loss under changing SLO budgets.

What would settle it

Run a production-like trace with shifting workloads and bandwidth on the same hardware, measure whether the controller-chosen profiles deliver the claimed JCT or TTFT gains and whether output quality stays inside the target SLO window, or whether mismatch forces either slowdown or quality violation.

Figures

Figures reproduced from arXiv: 2605.13734 by Bing Lu, Dejun Luo, Dingwen Tao, Guangming Tan, Hairui Zhao, Jinyang Liu, Wenjing Huang, Xingchen Liu, Xinyang Ma, Yida Gu, Zedong Liu, Zheng Wei.

**Figure 2.** Figure 2: Architecture of disaggregated serving system. [PITH_FULL_IMAGE:figures/full_fig_p002_2.png] view at source ↗

**Figure 4.** Figure 4: KV latency across effective bandwidths (left) and [PITH_FULL_IMAGE:figures/full_fig_p003_4.png] view at source ↗

**Figure 5.** Figure 5: Left: Search space size under different granularities. [PITH_FULL_IMAGE:figures/full_fig_p004_5.png] view at source ↗

**Figure 6.** Figure 6: Overview Architecture of KVServe. from this model and design a policy that selects and switches strategies in response to changing conditions. 4 DESIGN OVERVIEW To address the above problems and challenges, we propose KVServe. To the best of our knowledge, KVServe is the first service-aware and adaptive KV communication compression framework for disaggregated LLM serving. Unlike prior approaches that rely… view at source ↗

**Figure 8.** Figure 8: Profiling Efficiency and Ranking Consistency. [PITH_FULL_IMAGE:figures/full_fig_p006_8.png] view at source ↗

**Figure 7.** Figure 7: The Unified KV Cache Compression Pipeline. [PITH_FULL_IMAGE:figures/full_fig_p006_7.png] view at source ↗

**Figure 10.** Figure 10: The 3D Pareto Frontier of the Strategy Spaces. [PITH_FULL_IMAGE:figures/full_fig_p007_10.png] view at source ↗

**Figure 9.** Figure 9: Prediction and Pruning Process Visualization. [PITH_FULL_IMAGE:figures/full_fig_p007_9.png] view at source ↗

**Figure 11.** Figure 11: Candidate set generation and bandit-based residual [PITH_FULL_IMAGE:figures/full_fig_p009_11.png] view at source ↗

**Figure 12.** Figure 12: End-to-End Performance across Hardware and Workloads. Top row evaluates JCT scalability across hardware tiers; [PITH_FULL_IMAGE:figures/full_fig_p010_12.png] view at source ↗

**Figure 14.** Figure 14: TTFT in Prefix Caching. As shown in the top row of [PITH_FULL_IMAGE:figures/full_fig_p010_14.png] view at source ↗

**Figure 15.** Figure 15: Latency Breakdown across Inference Stages. [PITH_FULL_IMAGE:figures/full_fig_p011_15.png] view at source ↗

**Figure 16.** Figure 16: Offline and Online Ablation Studies. theoretically promises an 8× reduction compared to BF16, the metadata overhead required for its fine-grained group quantization limits the maximum CR to approximately 5.33×. Consequently, KIVI achieves an average CR of only 4.40×. Similarly, DuoAttention (pruning-based) fails to achieve high compression without significant loss, as aggressively discarding tokens hurts… view at source ↗

read the original abstract

LLMs are widely adopted in production, pushing inference systems to their limits. Disaggregated LLM serving (e.g., PD separation and KV state disaggregation) improves scalability and cost efficiency, but it also turns KV into an explicit payload crossing network and storage boundaries, making KV a dominant end-to-end bottleneck. Existing KV compression are typically static runtime configurations, despite production service context varies over time in workload mix, bandwidth, and SLO/quality budgets. As a result, a fixed choice can be suboptimal or even increase latency. We present \emph{KVServe}, the first service-aware and adaptive KV communication compression framework for disaggregated LLM serving: KVServe (1) unifies KV compression into a modular strategy space with new components and cross-method recomposition; (2) introduces Bayesian Profiling Engine that efficiently searches this space and distills a 3D Pareto candidate set, reducing $50\times$ offline search overhead; and (3) deploys a Service-Aware Online Controller that combines an analytical latency model with a lightweight bandit to select profiles under constraints and correct offline-to-online mismatch. Integrated into vLLM and evaluated across datasets, models, GPUs and networks, KVServe achieves up to $9.13\times$ JCT speedup in PD-separated serving and up to $32.8\times$ TTFT reduction in KV-disaggregated serving.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

KVServe adds a modular adaptive layer on top of KV compression for disaggregated serving, with Bayesian profiling and a bandit controller, but the speedups depend on how well the online correction holds up.

read the letter

KVServe tries to stop KV compression from being a fixed setting that falls behind when workloads, bandwidth, or quality targets change. It puts compression methods into one searchable space, runs a Bayesian profiler to build a 3D Pareto front with 50 times less offline work, and then uses an analytical latency model plus a lightweight bandit to pick the right profile live and fix any mismatch between offline data and real conditions. They wired it into vLLM and tested across models, GPUs, datasets, and networks, reporting up to 9.13 times better job completion time in prefill-decode separation and 32.8 times lower time to first token in KV-disaggregated cases. That is the concrete piece: a working system that treats compression as tunable rather than static. The implementation and the reduction in search cost are the parts that look solid on the page. The soft spot sits in the online controller. The big speedups rest on the latency model staying accurate and the bandit correcting drift without eating the gains or violating quality budgets. The abstract gives no measured error on the model predictions and no regret numbers for the bandit, so it is not clear how much the results depend on the particular traces or how they shift when bandwidth or SLOs move outside the test range. If the evaluation workloads happened to line up with the model assumptions, the numbers could shrink in wider use. This is for systems people already running disaggregated inference at scale who need to cut KV transfer costs without hand-tuning every time. A reader who works on vLLM-style stacks or production KV management would get usable ideas from the modular space and the profiling step. It deserves a serious referee because the problem is timely, the code is integrated, and the claims are testable, even if the evaluation needs tighter controls on variation and explicit controller accuracy numbers to stand up.

Referee Report

2 major / 2 minor

Summary. The paper introduces KVServe, a service-aware adaptive KV cache compression framework for disaggregated LLM serving (PD separation and KV disaggregation). It unifies compression into a modular strategy space, uses a Bayesian Profiling Engine to distill a 3D Pareto candidate set (reducing offline search by 50×), and deploys a Service-Aware Online Controller combining an analytical latency model with a lightweight bandit to select profiles while correcting offline-to-online mismatch. Integrated into vLLM, it reports up to 9.13× JCT speedup in PD-separated serving and 32.8× TTFT reduction in KV-disaggregated serving across datasets, models, GPUs, and networks.

Significance. If the end-to-end speedups hold under realistic workload shifts and SLO variation, the work would meaningfully advance communication-efficient disaggregated inference by replacing static KV compression with adaptive, service-context-aware selection. The modular strategy space and Bayesian profiling are practical contributions that could be reused beyond the specific controller.

major comments (2)

[Abstract and §4] Abstract and §4 (Service-Aware Online Controller): the central 9.13× JCT and 32.8× TTFT claims rest on the analytical latency model plus bandit reliably correcting offline-to-online drift, yet no quantitative bound on model prediction error, bandit regret, or sensitivity to bandwidth/SLO changes is reported; without such bounds the reported gains could be trace-specific rather than robust.
[Evaluation section] Evaluation section: the abstract states results across datasets, models, GPUs and networks, but the manuscript provides insufficient detail on experimental controls, error bars, exact workload mixes, and how compression quality is measured under varying SLO budgets, leaving the performance claims only moderately supported.

minor comments (2)

[§3] Clarify the exact definition of the 3D Pareto set (latency, quality, bandwidth) and how the bandit exploration budget is chosen in practice.
[Figures 5-8] Figure captions and axis labels should explicitly state the network bandwidth ranges and SLO budgets used in each experiment.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the detailed and constructive report. The two major comments identify areas where additional analysis and exposition would strengthen the robustness and reproducibility of our claims. We address each point below and will revise the manuscript accordingly.

read point-by-point responses

Referee: [Abstract and §4] Abstract and §4 (Service-Aware Online Controller): the central 9.13× JCT and 32.8× TTFT claims rest on the analytical latency model plus bandit reliably correcting offline-to-online drift, yet no quantitative bound on model prediction error, bandit regret, or sensitivity to bandwidth/SLO changes is reported; without such bounds the reported gains could be trace-specific rather than robust.

Authors: We agree that quantitative characterization of the analytical model's prediction error and the bandit's regret, together with sensitivity analysis under bandwidth and SLO variation, would better substantiate that the reported speedups are robust rather than trace-specific. In the revised version we will add to §4: (i) measured L1 prediction error statistics across the evaluated bandwidth range, (ii) cumulative regret curves for the online bandit under both stationary and shifting workloads, and (iii) sensitivity plots showing JCT/TTFT variation when bandwidth and SLO budgets are perturbed by ±20 %. These additions will be supported by new experiments that reuse the same profiling engine and controller already described. revision: yes
Referee: [Evaluation section] Evaluation section: the abstract states results across datasets, models, GPUs and networks, but the manuscript provides insufficient detail on experimental controls, error bars, exact workload mixes, and how compression quality is measured under varying SLO budgets, leaving the performance claims only moderately supported.

Authors: We acknowledge that the current Evaluation section lacks sufficient methodological detail. In the revision we will expand it to report: (i) error bars computed from at least five independent runs with different random seeds, (ii) exact workload parameters (Poisson arrival rates, request-length distributions, and SLO/quality budgets for each experiment), (iii) the precise definition and measurement procedure for compression quality (perplexity delta and token-level accuracy) under each SLO budget, and (iv) a table enumerating the hardware/network configurations and the controls used to isolate the effect of the online controller. These clarifications will be placed in a new subsection on experimental methodology. revision: yes

Circularity Check

0 steps flagged

No circularity in derivation chain

full rationale

The paper presents an empirical systems framework (Bayesian profiling + analytical latency model + bandit controller) integrated into vLLM. No equations or claims reduce by construction to fitted inputs, self-citations, or renamed prior results. The 3D Pareto set and online selection are described as engineering components whose performance is measured externally rather than derived tautologically from their own definitions. Central speedups are reported from end-to-end experiments across datasets, models, and networks, not from self-referential math.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The framework rests on domain assumptions about workload variability and the accuracy of the latency model; no free parameters or invented entities are explicitly introduced in the abstract.

axioms (1)

domain assumption An analytical latency model can predict end-to-end performance sufficiently well to guide online decisions across varying bandwidth and SLO conditions.
Invoked by the Service-Aware Online Controller to select profiles and correct mismatches.

pith-pipeline@v0.9.0 · 5589 in / 1226 out tokens · 65245 ms · 2026-05-14T17:39:05.257613+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

55 extracted references · 23 canonical work pages · 8 internal anchors

[1]

Amazon Web Services. 2026. Amazon EC2 FAQs. https://aws.amazon. com/ec2/faqs/. (2026). Accessed: 2026-01-29

2026
[2]

Muhammad Arslan, Hussam Ghanem, Saba Munawar, and Christophe Cruz. 2024. A Survey on RAG with LLMs.Procedia computer science 246 (2024), 3781–3790

2024
[3]

Saleh Ashkboos, Amirkeivan Mohtashami, Maximilian L Croci, Bo Li, Pashmina Cameron, Martin Jaggi, Dan Alistarh, Torsten Hoefler, and James Hensman. 2024. Quarot: Outlier-free 4-bit inference in rotated llms.Advances in Neural Information Processing Systems37 (2024), 100213–100240

2024
[4]

Yushi Bai, Xin Lv, Jiajie Zhang, Hongchang Lyu, Jiankai Tang, Zhidian Huang, Zhengxiao Du, Xiao Liu, Aohan Zeng, Lei Hou, Yuxiao Dong, Jie Tang, and Juanzi Li. 2023. LongBench: A Bilingual, Multitask Bench- mark for Long Context Understanding. (2023). arXiv:cs.CL/2308.14508

work page internal anchor Pith review Pith/arXiv arXiv 2023
[5]

Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde de Oliveira Pinto, Jared Kaplan, Harri Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, Alex Ray, Raul Puri, Gretchen Krueger, Michael Petrov, Heidy Khlaaf, Girish Sastry, Pamela Mishkin, Brooke Chan, Scott Gray, Nick Ryder, Mikhail Pavlov, Alethea Power, Lukasz Kaiser, Mohammad Bavarian...

work page internal anchor Pith review Pith/arXiv arXiv 2021
[6]

Weijian Chen, Shuibing He, Haoyang Qu, Ruidong Zhang, Siling Yang, Ping Chen, Yi Zheng, Baoxing Huai, and Gang Chen. 2025. IMPRESS: An Importance-Informed Multi-Tier Prefix KV Storage System for Large Language Model Inference. In23rd USENIX Conference on File and Storage Technologies (FAST 25). 187–201

2025
[7]

Wenyan Chen, Chengzhi Lu, Huanle Xu, Kejiang Ye, and Chengzhong Xu. 2025. Multiplexing Dynamic Deep Learning Workloads with SLO- awareness in GPU Clusters. InProceedings of the Twentieth European Conference on Computer Systems. 589–604

2025
[8]

Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, Christopher Hesse, and John Schulman
[9]

Training Verifiers to Solve Math Word Problems.arXiv preprint arXiv:2110.14168(2021)

work page internal anchor Pith review Pith/arXiv arXiv 2021
[10]

Alessio Devoto, Maximilian Jeblick, and Simon Jégou. 2025. Expected Attention: KV Cache Compression by Estimating Attention from Fu- ture Queries Distribution.arXiv preprint arXiv:2510.00636(2025). https://arxiv.org/abs/2510.00636

work page arXiv 2025
[11]

Jiangfei Duan, Runyu Lu, Haojie Duanmu, Xiuhong Li, Xingcheng Zhang, Dahua Lin, Ion Stoica, and Hao Zhang. 2024. Muxserve: Flexible spatial-temporal multiplexing for multiple llm serving.arXiv preprint arXiv:2404.02015(2024)

work page arXiv 2024
[12]

Haojie Duanmu, Zhihang Yuan, Xiuhong Li, Jiangfei Duan, Xingcheng Zhang, and Dahua Lin. 2024. Skvq: Sliding-window key and value cache quantization for large language models.arXiv preprint arXiv:2405.06219(2024)

work page arXiv 2024
[13]

Jingqi Feng, Yukai Huang, Rui Zhang, Sicheng Liang, Ming Yan, and Jie Wu. 2025. WindServe: Efficient Phase-Disaggregated LLM Serving with Stream-based Dynamic Scheduling. InProceedings of the 52nd Annual International Symposium on Computer Architecture. 1283–1295. 13 SIGCOMM’26, August 17-21, 2026, Denver, Colorado, USA Liu and Ma et al

2025
[14]

Leo Gao, Jonathan Tow, Baber Abbasi, Stella Biderman, Sid Black, An- thony DiPofi, Charles Foster, Laurence Golding, Jeffrey Hsu, Alain Le Noac’h, Haonan Li, Kyle McDonell, Niklas Muennighoff, Chris Ociepa, Jason Phang, Laria Reynolds, Hailey Schoelkopf, Aviya Skowron, Lintang Sutawika, Eric Tang, Anish Thite, Ben Wang, Kevin Wang, and Andy Zou. 2024. The...

work page doi:10.5281/zenodo.12608602 2024
[15]

Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Alex Vaughan, Amy Yang, Angela Fan, Anirudh Goyal, Anthony Hartshorn, Aobo Yang, Archi Mitra, Archie Sravanku- mar, Artem Korenev, Arthur Hinsvark, Arun Rao, Aston Zhang, Au- relien Rodriguez, Austen Gregerson, A...

2026
[16]

Ke Hong, Lufang Chen, Zhong Wang, Xiuhong Li, Qiuli Mao, Jianping Ma, Chao Xiong, Guanyu Wu, Buhe Han, Guohao Dai, Yun Liang, and Yu Wang. 2025. semi-PD: Towards Efficient LLM Serving via Phase-Wise Disaggregated Computation and Unified Storage. (2025). arXiv:cs.CL/2504.19867 https://arxiv.org/abs/2504.19867

work page arXiv 2025
[17]

Coleman Hooper, Sehoon Kim, Hiva Mohammadzadeh, Michael W Ma- honey, Yakun S Shao, Kurt Keutzer, and Amir Gholami. 2024. Kvquant: Towards 10 million context length llm inference with kv cache quanti- zation.Advances in Neural Information Processing Systems37 (2024), 1270–1303

2024
[18]

Cunchen Hu, Heyang Huang, Liangliang Xu, Xusheng Chen, Chenxi Wang, Jiang Xu, Shuang Chen, Hao Feng, Sa Wang, Yungang Bao, Ninghui Sun, and Yizhou Shan. 2025. ShuffleInfer: Disaggregate LLM Inference for Mixed Downstream Workloads.ACM Transactions on Architecture and Code Optimization(2025)

2025
[19]

Simon Jegou and Maximilian Jeblick. 2026. KVzap: Fast, Adaptive, and Faithful KV Cache Pruning.arXiv preprint arXiv:2601.07891(2026)

work page arXiv 2026
[20]

Bo Jiang, Taolue Yang, Youyuan Liu, Chengming Zhang, Xubin He, and Sian Jin. 2025. KVComp: A High-Performance, LLM-Aware, Lossy Compression Framework for KV Cache.arXiv preprint arXiv:2509.00579 (2025)

work page arXiv 2025
[21]

Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph Gonzalez, Hao Zhang, and Ion Stoica
[22]

InProceedings of the 29th symposium on operating systems principles

Efficient memory management for large language model serving with pagedattention. InProceedings of the 29th symposium on operating systems principles. 611–626
[23]

Yuhang Li, Rong Gu, Chengying Huan, Zhibin Wang, Renjie Yao, Chen Tian, and Guihai Chen. 2025. Hotprefix: Hotness-aware kv cache scheduling for efficient prefix sharing in llm inference systems. Proceedings of the ACM on Management of Data3, 4 (2025), 1–27

2025
[24]

Yuhong Li, Yingbing Huang, Bowen Yang, Bharat Venkitesh, Acyr Locatelli, Hanchen Ye, Tianle Cai, Patrick Lewis, and Deming Chen
[25]

Advances in Neural Information Processing Systems37 (2024), 22947– 22970

Snapkv: Llm knows what you are looking for before generation. Advances in Neural Information Processing Systems37 (2024), 22947– 22970

2024
[26]

Yangning Li, Weizhi Zhang, Yuyao Yang, Wei-Chieh Huang, Yaozu Wu, Junyu Luo, Yuanchen Bei, Henry Peng Zou, Xiao Luo, Yusheng Zhao, Chunkit Chan, Yankai Chen, Zhongfen Deng, Yinghui Li, Hai-Tao Zheng, Dongyuan Li, Renhe Jiang, Ming Zhang, Yangqiu Song, and Philip S. Yu. 2025. Towards agentic rag with deep reasoning: A survey of rag-reasoning systems in llm...

work page arXiv 2025
[27]

Tengxuan Liu, Shiyao Li, Jiayi Yang, Tianchen Zhao, Feng Zhou, Xiao- hui Song, Guohao Dai, Shengen Yan, Huazhong Yang, and Yu Wang
[28]

Pm-kvq: Progressive mixed-precision kv cache quantization for long-cot llms.arXiv preprint arXiv:2505.18610(2025)

work page arXiv 2025
[29]

Yuhan Liu, Yihua Cheng, Jiayi Yao, Yuwei An, Xiaokun Chen, Shaoting Feng, Yuyang Huang, Samuel Shen, Rui Zhang, Kuntai Du, and Junchen Jiang. 2025. Lmcache: An efficient KV cache layer for enterprise-scale LLM inference.arXiv preprint arXiv:2510.09665(2025)

work page arXiv 2025
[30]

Yuhan Liu, Hanchen Li, Yihua Cheng, Siddhant Ray, Yuyang Huang, Qizheng Zhang, Kuntai Du, Jiayi Yao, Shan Lu, Ganesh Anantha- narayanan, Michael Maire, Henry Hoffmann, Ari Holtzman, and Junchen Jiang. 2024. Cachegen: Kv cache compression and stream- ing for fast large language model serving. InProceedings of the ACM SIGCOMM 2024 Conference. 38–56

2024
[31]

Zedong Liu, Shenggan Cheng, Guangming Tan, Yang You, and Ding- wen Tao. 2025. Elasticmm: Efficient multimodal llms serving with elastic multimodal parallelism.arXiv preprint arXiv:2507.10069(2025)

work page arXiv 2025
[32]

Zirui Liu, Jiayi Yuan, Hongye Jin, Shaochen Zhong, Zhaozhuo Xu, Vladimir Braverman, Beidi Chen, and Xia Hu. 2024. Kivi: A tuning- free asymmetric 2bit quantization for kv cache.arXiv preprint arXiv:2402.02750(2024)

work page internal anchor Pith review Pith/arXiv arXiv 2024
[33]

Yuexiao Ma, Huixia Li, Xiawu Zheng, Feng Ling, Xuefeng Xiao, Rui Wang, Shilei Wen, Fei Chao, and Rongrong Ji. 2024. Affinequant: Affine transformation quantization for large language models.arXiv preprint arXiv:2403.12544(2024)

work page arXiv 2024
[34]

NVIDIA. 2024. LLM Router NVIDIA. GitHub repository. (2024). https: //github.com/NVIDIA-AI-Blueprints/llm-router/tree/experimental

2024
[35]

NVIDIA. 2026. NVIDIA nvCOMP Developer. https://developer.nvidia. com/nvcomp. (2026)

2026
[36]

Isaac Ong, Amjad Almahairi, Vincent Wu, Wei-Lin Chiang, Tian- hao Wu, Joseph E Gonzalez, M Waleed Kadous, and Ion Stoica. 2024. Routellm: Learning to route llms with preference data.arXiv preprint arXiv:2406.18665(2024)

work page internal anchor Pith review Pith/arXiv arXiv 2024
[37]

Pratyush Patel, Esha Choukse, Chaojie Zhang, Aashaka Shah, Íñigo Goiri, Saeed Maleki, and Ricardo Bianchini. 2024. Splitwise: Efficient generative llm inference using phase splitting. In2024 ACM/IEEE 51st Annual International Symposium on Computer Architecture (ISCA). IEEE, 118–132

2024
[38]

Ruoyu Qin, Weiran He, Yaoyu Wang, Zheming Li, Xinran Xu, Yongwei Wu, Weimin Zheng, and Mingxing Zhang. 2026. Prefill-as-a-Service: KVCache of Next-Generation Models Could Go Cross-Datacenter. arXiv preprint arXiv:2604.15039(2026)

work page internal anchor Pith review Pith/arXiv arXiv 2026
[39]

Ruoyu Qin, Zheming Li, Weiran He, Jialei Cui, Feng Ren, Mingxing Zhang, Yongwei Wu, Weimin Zheng, and Xinran Xu. 2025. Mooncake: Trading more storage for less computation—a {KVCache-centric} architecture for serving {LLM} chatbot. In23rd USENIX conference on file and storage technologies (FAST 25). 155–170

2025
[40]

Philipp Schmid, Omar Sanseviero, Alvaro Bartolome, Leandro von Werra, Daniel Vila, Vaibhav Srivastav, Marc Sun, and Pedro Cuenca
[41]

https://huggingface.co/blog/llama31

Llama 3.1 – 405B, 70B & 8B with multilinguality and long context. https://huggingface.co/blog/llama31. (23 Jul 2024). Accessed: 2025-01- 29

2024
[42]

Konrad Staniszewski and Adrian Łańcucki. 2025. KV Cache Trans- form Coding for Compact Storage in LLM Inference. (2025). arXiv:cs.CL/2511.01815 https://arxiv.org/abs/2511.01815

work page arXiv 2025
[43]

Biao Sun, Ziming Huang, Hanyu Zhao, Wencong Xiao, Xinyi Zhang, Yong Li, and Wei Lin. 2024. Llumnix: Dynamic scheduling for large lan- guage model serving. In18th USENIX symposium on operating systems design and implementation (OSDI 24). 173–191

2024
[44]

Qian Tao, Wenyuan Yu, and Jingren Zhou. 2025. Asymkv: Enabling 1- bit quantization of kv cache with layer-wise asymmetric quantization configurations. InProceedings of the 31st International Conference on Computational Linguistics. 2316–2328

2025
[45]

Qwen Team. 2024. Qwen2.5: A Party of Foundation Models. (September 2024). https://qwenlm.github.io/blog/qwen2.5/

2024
[46]

Chen Wang, Xunzhuo Liu, Yuhan Liu, Yue Zhu, Xiangxi Mo, Junchen Jiang, and Huamin Chen. 2025. When to Reason: Semantic Router for vLLM.arXiv preprint arXiv:2510.08731(2025)

work page arXiv 2025
[47]

Bingyang Wu, Shengyu Liu, Yinmin Zhong, Peng Sun, Xuanzhe Liu, and Xin Jin. 2024. Loongserve: Efficiently serving long-context large 15 SIGCOMM’26, August 17-21, 2026, Denver, Colorado, USA Liu and Ma et al. language models with elastic sequence parallelism. InProceedings of the ACM SIGOPS 30th Symposium on Operating Systems Principles. 640–654

2024
[48]

Guangxuan Xiao, Jiaming Tang, Jingwei Zuo, Junxian Guo, Shang Yang, Haotian Tang, Yao Fu, and Song Han. 2024. Duoattention: Efficient long-context llm inference with retrieval and streaming heads.arXiv preprint arXiv:2410.10819(2024)

work page arXiv 2024
[49]

Ceyu Xu, Yongji Wu, Xinyu Yang, Beidi Chen, Matthew Lentz, Danyang Zhuo, and Lisa Wu Wills. 2025. LLM. 265: Video Codecs are Secretly Tensor Codecs. InProceedings of the 58th IEEE/ACM Interna- tional Symposium on Microarchitecture. 445–460

2025
[50]

An Yang, Baosong Yang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Zhou, Chengpeng Li, Chengyuan Li, Dayiheng Liu, Fei Huang, Guant- ing Dong, Haoran Wei, Huan Lin, Jialong Tang, Jialin Wang, Jian Yang, Jianhong Tu, Jianwei Zhang, Jianxin Ma, Jin Xu, Jingren Zhou, Jinze Bai, Jinzheng He, Junyang Lin, Kai Dang, Keming Lu, Keqin Chen, Kexin Yang, Mei Li, Mingfen...

work page internal anchor Pith review Pith/arXiv arXiv 2024
[51]

Cai Zefan, Yichi Zhang, Bofei Gao, Tianyu Liu, Keming Lu, Wayne Xiong, Yue Dong, Baobao Chang, Junjie Hu, and Wen Xiao. 2024. Pyramidkv: Dynamic kv cache compression based on pyramidal infor- mation funneling.arXiv e-prints(2024), arXiv–2406

2024
[52]

Tianyi Zhang, Mohsen Hariri, Shaochen Zhong, Vipin Chaudhary, Yang Sui, Xia Hu, and Anshumali Shrivastava. [n. d.]. 70% Size, 100% Accuracy: Lossless LLM Compression for Efficient GPU Inference via Dynamic-Length Float (DFloat11). InThe Thirty-ninth Annual Confer- ence on Neural Information Processing Systems
[53]

Zeyu Zhang, Haiying Shen, Shay Vargaftik, Ran Ben Basat, Michael Mitzenmacher, and Minlan Yu. 2025. Hack: Homomorphic acceleration via compression of the key-value cache for disaggregated llm inference. InProceedings of the ACM SIGCOMM 2025 Conference. 1245–1247

2025
[54]

Yinmin Zhong, Shengyu Liu, Junda Chen, Jianbo Hu, Yibo Zhu, Xu- anzhe Liu, Xin Jin, and Hao Zhang. 2024. DistServe: Disaggregating prefill and decoding for goodput-optimized large language model serv- ing. In18th USENIX Symposium on Operating Systems Design and Implementation (OSDI 24). 193–210

2024
[55]

Zixuan Zhou, Xuefei Ning, Ke Hong, Tianyu Fu, Jiaming Xu, Shiyao Li, Yuming Lou, Luning Wang, Zhihang Yuan, Xiuhong Li, Shengen Yan, Guohao Dai, Xiao-Ping Zhang, Yuhan Dong, and Yu Wang. 2024. A survey on efficient inference for large language models.arXiv preprint arXiv:2404.14294(2024). 16

work page internal anchor Pith review arXiv 2024