CrossPool: Efficient Multi-LLM Serving for Cold MoE Models through KV-Cache and Weight Disaggregation
Pith reviewed 2026-06-29 05:27 UTC · model grok-4.3
The pith
Disaggregating weights and KV-cache into separate GPU pools cuts P99 TBT up to 10.4x for cold MoE models.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
CrossPool separates FFN weights into a consolidated pool and KV-cache into a dynamic pool. The weights pool holds stable model parameters across cold MoE instances while the KV-cache pool provisions only the current aggregate demand. A KV-cache planner and virtualizer manage allocation, a layer-wise pipeline scheduler overlaps hidden-state movement, and persistent kernels with control lowering cut CPU-GPU overhead, allowing attention to stay local to the KV pool.
What carries the argument
The two-pool disaggregation of weights and KV-cache, managed by a KV-cache planner, virtualizer, and layer-wise pipeline scheduler that hides transfers.
If this is right
- KV-cache capacity can be shared across models using total active demand instead of per-model peaks.
- Attention computation stays local to the KV-cache pool, exposing more replicated capacity even at low concurrency.
- Long-context requests become feasible without pre-reserving worst-case memory per model.
- GPU memory utilization rises for the bursty, sparse request patterns typical of cold models.
Where Pith is reading between the lines
- The same separation could help dense models when KV demand fluctuates across instances.
- Persistent kernels and lowered control overhead may reduce latency in other inference serving stacks.
- Measuring exact transfer volume versus memory savings at different concurrency levels would reveal the operating range where the design wins.
Load-bearing premise
The cost of moving hidden states between the two pools plus the scheduler overhead stays small enough not to erase the gains from higher memory utilization under low-concurrency loads.
What would settle it
Run a low-concurrency workload with many cold MoE models and measure whether P99 TBT and throughput improve over a monolithic baseline; if they do not, the disaggregation benefit is offset by transfer costs.
Figures
read the original abstract
Emerging LLM services increasingly host many sparse MoE models, yet most models receive sparse requests and remain cold. This creates a GPU memory problem: model weights are stable and model-determined, while KV-cache is transient and demand-determined. Because cold models rarely reach peak KV-cache demand at the same time, reserving worst-case KV capacity per model wastes memory; a shared KV-cache pool can instead provision aggregate active demand. However, KV-cache sharing is not sufficient when weights and KV-cache remain in a monolithic GPU memory pool. Static weights compete with dynamic KV-cache, and KV-head-limited attention under cold, low-concurrency traffic exposes only a fraction of replicated KV capacity, leading to low GPU memory utilization and weak long-context support. We present CrossPool, a serving engine for cold MoE models that separates FFN weights and KV-cache into two GPU memory pools: a weights pool that consolidates FFN weights across cold models, and a KV-cache pool that dynamically serves active requests while keeping attention local to KV-cache. CrossPool combines a KV-cache planner and virtualizer, a layer-wise pipeline scheduler that hides hidden-state transfers, and persistent kernels with control lowering to reduce CPU-GPU control overhead. With efficient GPU memory pooling, CrossPool underpins bursty long-context requests and outperforms the state-of-the-art kvcached-based multi-LLM serving system, reducing P99 TBT by up to 10.4x.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper presents CrossPool, a multi-LLM serving engine for cold MoE models that disaggregates FFN weights into a consolidated weights pool and KV-cache into a separate dynamic pool. It introduces a KV-cache planner/virtualizer, a layer-wise pipeline scheduler to overlap hidden-state transfers, and persistent kernels with control lowering. The central empirical claim is that this design outperforms the state-of-the-art KV-cache-based multi-LLM serving system, reducing P99 TBT by up to 10.4x while better supporting bursty long-context requests under low-concurrency traffic.
Significance. If the performance results hold under representative workloads, the disaggregation approach could meaningfully improve GPU memory utilization for sparse MoE deployments where monolithic pooling wastes capacity on worst-case KV reservations. The engineering focus on hiding cross-pool transfers via scheduling is a practical contribution to systems for cold models.
major comments (1)
- [Evaluation] Evaluation section (and associated figures/tables reporting the 10.4x P99 TBT result): the central claim that disaggregation delivers net gains rests on the layer-wise pipeline scheduler successfully masking hidden-state transfer latency. The manuscript should include an explicit ablation or per-layer timing breakdown (e.g., transfer time vs. FFN/attention compute time) under the low-concurrency, sparse-request workloads that define the motivating case; without this, it is not possible to confirm that the reported speedup is not offset by cross-pool movement costs.
minor comments (1)
- [Abstract] Abstract: experimental setup details (workload traces, hardware configuration, baseline implementation, and overhead measurements) are absent, making it difficult for readers to assess the 10.4x claim without immediately turning to the full evaluation section.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback on our evaluation. We address the concern about substantiating the layer-wise pipeline scheduler's effectiveness below.
read point-by-point responses
-
Referee: [Evaluation] Evaluation section (and associated figures/tables reporting the 10.4x P99 TBT result): the central claim that disaggregation delivers net gains rests on the layer-wise pipeline scheduler successfully masking hidden-state transfer latency. The manuscript should include an explicit ablation or per-layer timing breakdown (e.g., transfer time vs. FFN/attention compute time) under the low-concurrency, sparse-request workloads that define the motivating case; without this, it is not possible to confirm that the reported speedup is not offset by cross-pool movement costs.
Authors: We agree that an explicit ablation study and per-layer timing breakdown would provide stronger evidence that the reported speedups are not offset by transfer costs. In the revised manuscript we will add a dedicated subsection (and associated figure) presenting per-layer measurements of hidden-state transfer time versus FFN/attention compute time, plus an ablation comparing the full system against a variant without the layer-wise scheduler, all evaluated under the low-concurrency, bursty long-context workloads that motivate the work. revision: yes
Circularity Check
No circularity; empirical systems paper with no self-referential derivations
full rationale
The paper describes a serving architecture (disaggregated weights/KV pools, KV planner, layer-wise pipeline scheduler, persistent kernels) and supports its performance claims (up to 10.4x P99 TBT reduction) via experimental comparisons against baselines. No equations, fitted parameters, or derivations appear that reduce to their own inputs by construction. No load-bearing self-citations or uniqueness theorems are invoked. The central claims rest on measured overheads and utilization gains under cold MoE workloads, which are externally falsifiable via replication on the described hardware.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
Joshua Ainslie, James Lee-Thorp, Michiel de Jong, Yury Zemlyanskiy, Federico Lebrón, and Sumit Sanghai. 2023. GQA: Training Generalized Multi-Query Transformer Models from Multi-Head Checkpoints. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, EMNLP 2023, Singapore, December 6-10, 2023, Houda Bouamor, Juan Pino, ...
-
[2]
Alibaba Cloud. 2026. Alibaba Cloud Model Studio.https://modelstudio. alibabacloud.com. Accessed: 2026-05-08
2026
-
[3]
anon8231489123. 2023. ShareGPT Vicuna unfiltered. https://huggingface.co/datasets/anon8231489123/ShareGPT_ Vicuna_unfiltered. Accessed: 2026-05-08
2023
-
[4]
Anthropic. 2026. Claude Code.https://claude.com/product/claude- code. Accessed: 2026-05-08
2026
-
[5]
Yushi Bai, Xin Lv, Jiajie Zhang, Yuze He, Ji Qi, Lei Hou, Jie Tang, Yuxiao Dong, and Juanzi Li. 2024. LongAlign: A Recipe for Long Context Alignment of Large Language Models. InEMNLP (Findings) (Findings of ACL). Association for Computational Linguistics, 1376–1395
2024
-
[6]
ByteDance. 2026. Volcano Engine.https://www.volcengine.com. Ac- cessed: 2026-05-08
2026
-
[7]
Yihua Cheng, Yuhan Liu, Jiayi Yao, Yuwei An, Xiaokun Chen, Shaoting Feng, Yuyang Huang, Samuel Shen, Kuntai Du, and Junchen Jiang
-
[8]
LMCache: An Efficient KV Cache Layer for Enterprise-Scale LLM Inference.CoRRabs/2510.09665 (2025)
arXiv 2025
-
[9]
Wei-Lin Chiang, Lianmin Zheng, Ying Sheng, Tianle Li, et al . 2023. Vicuna: An Open-Source Chatbot Impressing GPT-4 with 90%* Chat- GPT Quality.https://lmsys.org/blog/2023-03-30-vicuna/. Accessed: 2026-05-08
2023
-
[10]
Damai Dai, Chengqi Deng, Chenggang Zhao, R. X. Xu, Huazuo Gao, Deli Chen, Jiashi Li, Wangding Zeng, Xingkai Yu, Y. Wu, Zhenda Xie, Y. K. Li, Panpan Huang, Fuli Luo, Chong Ruan, Zhifang Sui, and Wen- feng Liang. 2024. DeepSeekMoE: Towards Ultimate Expert Specializa- tion in Mixture-of-Experts Language Models. InACL (1). Association for Computational Lingui...
2024
-
[11]
Fu, Stefano Ermon, Atri Rudra, and Christopher Ré
Tri Dao, Daniel Y. Fu, Stefano Ermon, Atri Rudra, and Christopher Ré
-
[12]
InNeurIPS
FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness. InNeurIPS
-
[13]
DeepInfra. 2026. DeepInfra.https://deepinfra.com. Accessed: 2026- 05-08
2026
-
[14]
DeepSeek-AI. 2024. DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model.CoRRabs/2405.04434 (2024). arXiv:2405.04434 doi:10.48550/ARXIV.2405.04434
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2405.04434 2024
-
[15]
DeepSeek-AI. 2024. DeepSeek-V3 Technical Report.CoRR abs/2412.19437 (2024)
Pith/arXiv arXiv 2024
-
[16]
DeepSeek-AI. 2026. DeepSeek.https://chat.deepseek.com. Accessed: 2026-05-08
2026
-
[17]
DeepSeek-AI. 2026. DeepSeek-V4: Towards Highly Efficient Million- Token Context Intelligence.https://huggingface.co/deepseek-ai/ DeepSeek-V4-Pro/blob/main/DeepSeek_V4.pdf. Technical Report
2026
-
[18]
Hui Dong and Marvin K Nakayama. 2018. A tutorial on quantile estimation via Monte Carlo. InInternational Conference on Monte Carlo and Quasi-Monte Carlo Methods in Scientific Computing. Springer, 3– 30
2018
-
[19]
Jiangfei Duan, Runyu Lu, Haojie Duanmu, Xiuhong Li, Xingcheng Zhang, Dahua Lin, Ion Stoica, and Hao Zhang. 2024. MuxServe: Flexible Spatial-Temporal Multiplexing for Multiple LLM Serving. InICML (Proceedings of Machine Learning Research). PMLR / OpenReview.net, 11905–11917
2024
-
[20]
William Fedus, Barret Zoph, and Noam Shazeer. 2022. Switch Trans- formers: Scaling to Trillion Parameter Models with Simple and Effi- cient Sparsity.J. Mach. Learn. Res.23 (2022), 120:1–120:39.https: //jmlr.org/papers/v23/21-0998.html 8 Conference’17, July 2017, Washington, DC, USA
2022
-
[21]
Shiwei Gao, Qing Wang, Shaoxun Zeng, Youyou Lu, and Jiwu Shu
-
[22]
InUSENIX ATC
Weaver: Efficient Multi-LLM Serving with Attention Offloading. InUSENIX ATC. USENIX Association, 587–595
-
[23]
Google. 2026. Gemini.https://gemini.google.com. Accessed: 2026-05- 08
2026
-
[24]
StepFun Inc. 2025. Step-3 is Large yet Affordable: Model-system Co- design for Cost-effective Decoding.CoRRabs/2507.19427 (2025)
arXiv 2025
-
[25]
Albert Q. Jiang, Alexandre Sablayrolles, Antoine Roux, Arthur Mensch, Blanche Savary, Chris Bamford, Devendra Singh Chaplot, Diego de Las Casas, Emma Bou Hanna, Florian Bressand, Gianna Lengyel, Guil- laume Bour, Guillaume Lample, Lélio Renard Lavaud, Lucile Saulnier, Marie-Anne Lachaux, Pierre Stock, Sandeep Subramanian, Sophia Yang, Szymon Antoniak, Tev...
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2401.04088 2024
-
[26]
Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph Gonzalez, Hao Zhang, and Ion Stoica
-
[27]
Efficient Memory Management for Large Language Model Serv- ing with PagedAttention. InSOSP. ACM, 611–626
-
[28]
Wonbeom Lee, Jungi Lee, Junghwan Seo, and Jaewoong Sim. 2024. InfiniGen: Efficient Generative Inference of Large Language Models with Dynamic KV Cache Management. InOSDI. USENIX Association, 155–172
2024
-
[29]
Dmitry Lepikhin, HyoukJoong Lee, Yuanzhong Xu, Dehao Chen, Orhan Firat, Yanping Huang, Maxim Krikun, Noam Shazeer, and Zhifeng Chen. 2021. GShard: Scaling Giant Models with Conditional Computation and Automatic Sharding. In9th International Confer- ence on Learning Representations, ICLR 2021, Virtual Event, Austria, May 3-7, 2021. OpenReview.net.https://o...
2021
-
[30]
Yuhan Liu, Hanchen Li, Yihua Cheng, Siddhant Ray, Yuyang Huang, Qizheng Zhang, Kuntai Du, Jiayi Yao, Shan Lu, Ganesh Anantha- narayanan, Michael Maire, Henry Hoffmann, Ari Holtzman, and Junchen Jiang. 2024. CacheGen: KV Cache Compression and Streaming for Fast Large Language Model Serving. InSIGCOMM. ACM, 38–56
2024
-
[31]
NVIDIA. 2025. NVSHMEM: GPU Programming Interface for Scalable Communication.https://docs.nvidia.com/nvshmem. Accessed: 2026- 05-08
2025
-
[32]
NVIDIA. 2026. CUDA Virtual Memory Management (VMM).https:// docs.nvidia.com/cuda/cuda-driver-api/group__CUDA__VA.html. Ac- cessed: 2026-05-08
2026
-
[33]
OpenAI. 2026. ChatGPT.https://chatgpt.com. Accessed: 2026-05-08
2026
-
[34]
OpenAI. 2026. Codex.https://openai.com/codex. Accessed: 2026-05-08
2026
-
[35]
OpenClaw Contributors. 2026. OpenClaw.https://openclaw.ai. Ac- cessed: 2026-05-08
2026
-
[36]
OpenRouter. 2026. OpenRouter.https://openrouter.ai. Accessed: 2026-05-08
2026
-
[37]
Ruoyu Qin, Zheming Li, Weiran He, Jialei Cui, Feng Ren, Mingxing Zhang, Yongwei Wu, Weimin Zheng, and Xinran Xu. 2025. Mooncake: Trading More Storage for Less Computation - A KVCache-centric Architecture for Serving LLM Chatbot. InFAST. USENIX Association, 155–170
2025
-
[38]
Noam Shazeer. 2019. Fast Transformer Decoding: One Write-Head is All You Need.CoRRabs/1911.02150 (2019)
Pith/arXiv arXiv 2019
-
[39]
Qwen Team. 2025. Qwen3 Technical Report.CoRRabs/2505.09388 (2025)
Pith/arXiv arXiv 2025
-
[40]
Yuxing Xiang, Xue Li, Kun Qian, Yufan Yang, Diwen Zhu, Wenyuan Yu, Ennan Zhai, Xuanzhe Liu, Xin Jin, and Jingren Zhou. 2025. Aegaeon: Effective GPU Pooling for Concurrent LLM Serving on the Market. In SOSP. ACM, 1030–1045
2025
-
[41]
Zihao Ye, Lequn Chen, Ruihang Lai, Wuwei Lin, Yineng Zhang, Stephanie Wang, Tianqi Chen, Baris Kasikci, Vinod Grover, Arvind Krishnamurthy, and Luis Ceze. 2025. FlashInfer: Efficient and Cus- tomizable Attention Engine for LLM Inference Serving. InMLSys. OpenReview.net/mlsys.org
2025
-
[42]
Shan Yu, Yifan Qiao, Mingyuan Ma, Yangmin Li, Shuo Yang, Xinyuan Tong, Yang Wang, Zhiqiang Xie, Yuwei An, Shiyi Cao, Ke Bao, Deepak Vij, Xiaoning Ding, Yichen Wang, Qingda Lu, Zhong Wang, Gao Gao, Harry Xu, Junyi Shu, Jiarong Xing, and Ying Sheng. 2026. Chimera: Cost-Efficient Multi-LLM Serving via GPU Memory Ballooning. In USENIX OSDI. USENIX Association
2026
-
[43]
Gonzalez, Clark W
Lianmin Zheng, Liangsheng Yin, Zhiqiang Xie, Chuyue Sun, Jeff Huang, Cody Hao Yu, Shiyi Cao, Christos Kozyrakis, Ion Stoica, Joseph E. Gonzalez, Clark W. Barrett, and Ying Sheng. 2024. SGLang: Ef- ficient Execution of Structured Language Model Programs. InNeurIPS
2024
-
[44]
Zhipu AI and Tsinghua University. 2024. LongAlign-10k.https:// huggingface.co/datasets/zai-org/LongAlign-10k
2024
-
[45]
Ruidong Zhu, Ziheng Jiang, Chao Jin, Peng Wu, Cesar A. Stuardo, Dongyang Wang, Xinlei Zhang, Huaping Zhou, Haoran Wei, Yang Cheng, Jianzhe Xiao, Xinyi Zhang, Lingjun Liu, Haibin Lin, Li-Wen Chang, Jianxi Ye, Xiao Yu, Xuanzhe Liu, Xin Jin, and Xin Liu. 2025. MegaScale-Infer: Efficient Mixture-of-Experts Model Serving with Disaggregated Expert Parallelism. ...
2025
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.