Recognition: unknown
Prefill-as-a-Service: KVCache of Next-Generation Models Could Go Cross-Datacenter
Pith reviewed 2026-05-10 09:58 UTC · model grok-4.3
The pith
Prefill-as-a-Service lets hybrid-attention models run prefill and decode in separate datacenters by moving compact KVCache over ordinary Ethernet.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
For next-generation models whose hybrid attention already reduces KVCache size, a PrfaaS architecture that offloads long-context prefill to remote compute-dense clusters, transfers the compact cache over commodity networks, and applies selective offloading plus bandwidth- and cache-aware placement removes the need for prefill and decode to share a single low-latency fabric. The resulting heterogeneous deployment delivers 54 percent higher serving throughput, 64 percent lower P90 TTFT, and roughly 15 percent throughput gain at equal cost compared with a conventional homogeneous PD baseline, all while consuming modest cross-datacenter bandwidth.
What carries the argument
Prefill-as-a-Service (PrfaaS) architecture, which pairs model-side KVCache reduction with selective offloading, bandwidth-aware scheduling, and cache-aware request placement to enable reliable KVCache movement across loosely coupled clusters.
If this is right
- Prefill and decode capacity can be scaled independently across different accelerator types and datacenters.
- Heterogeneous hardware no longer requires a shared high-bandwidth RDMA fabric.
- Long-context requests can be routed to remote prefill clusters without collapsing overall utilization.
- Total cost of ownership drops by roughly 15 percent while meeting the same latency targets.
- KVCache traffic stays modest enough that ordinary Ethernet links suffice.
Where Pith is reading between the lines
- The same selective-offload logic could be applied to geo-distributed serving where latency between regions is even higher.
- Dynamic rebalancing of prefill capacity based on measured prefix-cache hit rates might further reduce cross-site traffic.
- If hybrid attention continues to shrink KVCache, the same architecture could support prefill offload to entirely different cloud providers.
Load-bearing premise
That selective offloading combined with bandwidth-aware scheduling and cache-aware placement will prevent congestion, unstable queues, and wasted capacity when workloads are bursty, request lengths are skewed, prefix caches are uneven, and inter-cluster bandwidth fluctuates.
What would settle it
A controlled run on production-like traffic that shows sustained high queueing latency or under-utilization once inter-cluster bandwidth drops below the level assumed in the case study would falsify the claim that the mechanisms keep the system stable.
Figures
read the original abstract
Prefill-decode (PD) disaggregation has become the standard architecture for large-scale LLM serving, but in practice its deployment boundary is still determined by KVCache transfer. In conventional dense-attention models, prefill generates huge KVCache traffics that keep prefill and decode tightly coupled within a single high-bandwidth network domain, limiting heterogeneous deployment and resource elasticity. Recent hybrid-attention architectures substantially reduce KVCache size, making cross-cluster KVCache transport increasingly plausible. However, smaller KVCache alone does not make heterogeneous cross-datacenter PD serving practical: real workloads remain bursty, request lengths are highly skewed, prefix caches are unevenly distributed, and inter-cluster bandwidth fluctuates. A naive design that fully externalizes prefill can therefore still suffer from congestion, unstable queueing, and poor utilization. We present Prefill-as-a-Service (PrfaaS), a cross-datacenter serving architecture that selectively offloads long-context prefill to standalone, compute-dense prefill clusters and transfers the resulting KVCache over commodity Ethernet to local PD clusters for decode. Rather than treating reduced KVCache as sufficient, PrfaaS combines model-side KV efficiency with system-side selective offloading, bandwidth-aware scheduling, and cache-aware request placement. This design removes the requirement that heterogeneous accelerators share the same low-latency RDMA fabric, enabling independent scaling of prefill and decode capacity across loosely coupled clusters. In a case study using an internal 1T-parameter hybrid model, a PrfaaS-augmented heterogeneous deployment achieves 54% higher serving throughput and 64% lower P90 TTFT than a homogeneous PD baseline, with approximately 15% throughput gain at equal cost, while consuming only modest cross-datacenter bandwidth.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper proposes Prefill-as-a-Service (PrfaaS), a cross-datacenter LLM serving architecture that exploits hybrid-attention models to shrink KVCache sizes, allowing selective offloading of long-context prefill to remote compute-dense clusters with KVCache transfer over commodity Ethernet. It augments this with bandwidth-aware scheduling and cache-aware request placement to mitigate bursty traffic, skewed request lengths, uneven prefix caches, and fluctuating inter-cluster bandwidth. In a case study with an internal 1T-parameter hybrid model, a PrfaaS-augmented heterogeneous deployment is reported to deliver 54% higher serving throughput and 64% lower P90 TTFT than a homogeneous PD baseline, plus ~15% throughput gain at equal cost, while using only modest cross-datacenter bandwidth.
Significance. If the empirical results can be substantiated with detailed methodology, the work would be significant for enabling elastic, heterogeneous scaling of prefill and decode resources across loosely coupled datacenters without high-bandwidth RDMA fabrics. This is particularly relevant for next-generation hybrid-attention models and could improve cost-efficiency and resource utilization in large-scale serving systems.
major comments (1)
- [Case study] The case study reports headline performance numbers (54% throughput, 64% P90 TTFT reduction, 15% equal-cost gain) but provides no description of the workload traces, request arrival process, length distribution, prefix cache hit rates, baseline configurations, measurement methodology, or the precise bandwidth fluctuation model (mean, variance, correlation time). This is load-bearing for the central claim because the abstract itself flags bursty traffic, skewed lengths, uneven caches, and fluctuating bandwidth as conditions that would cause congestion and poor utilization in a naive design; without these details or ablations isolating the scheduling components, the robustness of the reported gains cannot be assessed.
minor comments (1)
- The abstract and title use informal phrasing (e.g., 'Could Go Cross-Datacenter'); a more precise title and abstract would better suit journal standards.
Simulated Author's Rebuttal
We appreciate the referee's thorough review and positive assessment of the potential impact of our work on cross-datacenter serving for hybrid-attention models. We have carefully considered the major comment regarding the case study and have revised the manuscript to incorporate additional methodological details as requested.
read point-by-point responses
-
Referee: The case study reports headline performance numbers (54% throughput, 64% P90 TTFT reduction, 15% equal-cost gain) but provides no description of the workload traces, request arrival process, length distribution, prefix cache hit rates, baseline configurations, measurement methodology, or the precise bandwidth fluctuation model (mean, variance, correlation time). This is load-bearing for the central claim because the abstract itself flags bursty traffic, skewed lengths, uneven caches, and fluctuating bandwidth as conditions that would cause congestion and poor utilization in a naive design; without these details or ablations isolating the scheduling components, the robustness of the reported gains cannot be assessed.
Authors: We agree with the referee that these details are critical for evaluating the robustness of PrfaaS under the challenging conditions described. The original manuscript included a high-level overview of the case study but did not provide the full level of detail needed. In the revised manuscript, we have added an expanded 'Evaluation Methodology' subsection that describes: the workload traces derived from anonymized production logs exhibiting bursty patterns; the request arrival process modeled as a Poisson process with time-varying rates to simulate bursts; the request length distribution following a heavy-tailed distribution with parameters matching observed data; average prefix cache hit rates of approximately 35% with variations; the homogeneous PD baseline configuration using identical accelerator types for prefill and decode; the measurement methodology involving both simulation and hardware validation for throughput (tokens/s) and P90 TTFT; and the bandwidth fluctuation model as a stochastic process with specified mean, variance, and correlation time. Furthermore, we have included new ablation studies that isolate the effects of bandwidth-aware scheduling and cache-aware request placement, showing how they contribute to the reported gains by mitigating congestion and improving utilization. We believe these additions fully address the concern and allow independent assessment of the results. revision: yes
Circularity Check
No circularity: empirical system results with no derivation chain
full rationale
The paper presents Prefill-as-a-Service as a system architecture combining selective offloading, bandwidth-aware scheduling, and cache-aware placement, then reports empirical throughput and latency numbers from a case study on an internal 1T hybrid model. No equations, fitted parameters, predictions derived from inputs, or load-bearing self-citations appear in the provided text. The performance claims (54% higher throughput, 64% lower P90 TTFT) are stated as measured outcomes of the implemented heterogeneous deployment rather than quantities obtained by algebraic reduction or self-referential definition. The evaluation is therefore self-contained against external benchmarks and receives the default non-circularity finding.
Axiom & Free-Parameter Ledger
axioms (2)
- domain assumption Hybrid-attention architectures substantially reduce KVCache size relative to dense-attention models
- domain assumption Real workloads are bursty, with highly skewed request lengths, unevenly distributed prefix caches, and fluctuating inter-cluster bandwidth
Forward citations
Cited by 6 Pith papers
-
KVServe: Service-Aware KV Cache Compression for Communication-Efficient Disaggregated LLM Serving
KVServe delivers up to 9.13x job completion time speedup and 32.8x time-to-first-token reduction by making KV cache compression service-aware and adaptive in disaggregated LLM serving.
-
MASPrism: Lightweight Failure Attribution for Multi-Agent Systems Using Prefill-Stage Signals
MASPrism attributes failures in LLM multi-agent executions by extracting token-level negative log-likelihood and attention weights from a small model's prefill pass, then ranking candidates with a second prefill, achi...
-
MASPrism: Lightweight Failure Attribution for Multi-Agent Systems Using Prefill-Stage Signals
MASPrism attributes failures in multi-agent systems by ranking candidates from prefill-stage NLL and attention signals of a 0.6B SLM, beating baselines by up to 33.41% Top-1 accuracy and proprietary LLMs by up to 89.5...
-
SplitZip: Ultra Fast Lossless KV Compression for Disaggregated LLM Serving
SplitZip is a new GPU-friendly lossless compressor for KV cache tensors that exploits exponent redundancy to achieve over 600 GB/s compression throughput and up to 1.32x faster transfers in disaggregated LLM serving.
-
PreFT: Prefill-only finetuning for efficient inference
Prefill-only adaptation of LLMs yields 1.9x higher throughput for 512 adapters on Llama 3.1 70B with near-parity performance on RL tasks and recoverable loss on SFT.
-
SplitZip: Ultra Fast Lossless KV Compression for Disaggregated LLM Serving
SplitZip delivers a GPU-friendly lossless KV-cache compressor using an offline top-16 exponent codebook plus escape stream, achieving 613 GB/s compression and 2182 GB/s decompression throughput with up to 1.32x end-to...
Reference graph
Works this paper leans on
-
[1]
Think fast: A tensor streaming processor (tsp) for accelerating deep learning workloads
Dennis Abts, Jonathan Ross, Jonathan Sparling, Mark Wong-VanHaren, Max Baker, Tom Hawkins, Andrew Bell, John Thompson, Temesghen Kahsai, Garrin Kimmell, et al. Think fast: A tensor streaming processor (tsp) for accelerating deep learning workloads. In2020 ACM/IEEE 47th Annual International Symposium on Computer Architecture (ISCA), pages 145–158. IEEE, 2020
2020
-
[2]
gpt-oss-120b & gpt-oss-20b Model Card
Sandhini Agarwal, Lama Ahmad, Jason Ai, Sam Altman, Andy Applebaum, Edwin Arbus, Rahul K Arora, Yu Bai, Bowen Baker, Haiming Bao, et al. gpt-oss-120b & gpt-oss-20b model card.arXiv preprint arXiv:2508.10925, 2025
work page internal anchor Pith review arXiv 2025
-
[3]
Ring-2.5-1t.https://github.com/inclusionAI/Ring-V2.5, 2026
Inclusion AI. Ring-2.5-1t.https://github.com/inclusionAI/Ring-V2.5, 2026
2026
-
[4]
Gqa: Training generalized multi-query transformer models from multi-head checkpoints
Joshua Ainslie, James Lee-Thorp, Michiel De Jong, Yury Zemlyanskiy, Federico Lebrón, and Sumit Sanghai. Gqa: Training generalized multi-query transformer models from multi-head checkpoints. InProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 4895–4901, 2023
2023
-
[5]
Longformer: The Long-Document Transformer
Iz Beltagy, Matthew E Peters, and Arman Cohan. Longformer: The long-document transformer. arXiv preprint arXiv:2004.05150, 2020
work page internal anchor Pith review arXiv 2004
-
[6]
Generating Long Sequences with Sparse Transformers
Rewon Child, Scott Gray, Alec Radford, and Ilya Sutskever. Generating long sequences with sparse transformers.arXiv preprint arXiv:1904.10509, 2019
work page internal anchor Pith review arXiv 1904
-
[7]
Sglang.https://github.com/sgl-project/sglang, 2026
LMSYS Corp. Sglang.https://github.com/sgl-project/sglang, 2026
2026
-
[8]
What is a language processing unit? https://groq.com/blog/the-groq-lpu-exp lained, 2025
Groq. What is a language processing unit? https://groq.com/blog/the-groq-lpu-exp lained, 2025
2025
-
[9]
Xuan He, Zequan Fang, Jinzhao Lian, Danny HK Tsang, Baosen Zhang, and Yize Chen. Freesh: Fair, resource-and energy-efficient scheduling for llm serving on heterogeneous gpus.arXiv preprint arXiv:2511.00807, 2025. 14
-
[10]
Kvquant: Towards 10 million context length llm inference with kv cache quantization.Advances in Neural Information Processing Systems, 37:1270–1303, 2024
Coleman Hooper, Sehoon Kim, Hiva Mohammadzadeh, Michael W Mahoney, Yakun S Shao, Kurt Keutzer, and Amir Gholami. Kvquant: Towards 10 million context length llm inference with kv cache quantization.Advances in Neural Information Processing Systems, 37:1270–1303, 2024
2024
-
[11]
Step 3.5 flash: Open frontier-level intelligence with 11b active parameters
Ailin Huang, Ang Li, Aobo Kong, Bin Wang, Binxing Jiao, Bo Dong, Bojun Wang, Boyu Chen, Brian Li, Buyun Ma, et al. Step 3.5 flash: Open frontier-level intelligence with 11b active parameters.arXiv preprint arXiv:2602.10604, 2026
-
[12]
DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model
Aixin Liu, Bei Feng, Bin Wang, Bingxuan Wang, Bo Liu, Chenggang Zhao, Chengqi Dengr, Chong Ruan, Damai Dai, Daya Guo, et al. Deepseek-v2: A strong, economical, and efficient mixture-of-experts language model.arXiv preprint arXiv:2405.04434, 2024
work page internal anchor Pith review arXiv 2024
-
[13]
Cachegen: Kv cache compression and streaming for fast large language model serving
Yuhan Liu, Hanchen Li, Yihua Cheng, Siddhant Ray, Yuyang Huang, Qizheng Zhang, Kuntai Du, Jiayi Yao, Shan Lu, Ganesh Ananthanarayanan, et al. Cachegen: Kv cache compression and streaming for fast large language model serving. InProceedings of the ACM SIGCOMM 2024 Conference, pages 38–56, 2024
2024
-
[14]
Kivi: A tuning-free asymmetric 2bit quantization for kv cache
Zirui Liu, Jiayi Yuan, Hongye Jin, Shaochen Zhong, Zhaozhuo Xu, Vladimir Braverman, Beidi Chen, and Xia Hu. Kivi: A tuning-free asymmetric 2bit quantization for kv cache. In International Conference on Machine Learning, pages 32332–32344. PMLR, 2024
2024
-
[15]
Helix: Serving large language models over heterogeneous gpus and network via max-flow
Yixuan Mei, Yonghao Zhuang, Xupeng Miao, Juncheng Yang, Zhihao Jia, and Rashmi Vinayak. Helix: Serving large language models over heterogeneous gpus and network via max-flow. In Proceedings of the 30th ACM International Conference on Architectural Support for Program- ming Languages and Operating Systems, V olume 1, pages 586–602, 2025
2025
-
[16]
Minimax m2.5: Built for real-world productivity
Minimax. Minimax m2.5: Built for real-world productivity. https://www.minimax.io/new s/minimax-m25, 2026
2026
-
[17]
Hetis: Serving llms in heterogeneous gpu clusters with fine-grained and dynamic parallelism
Zizhao Mo, Jianxiong Liao, Huanle Xu, Zhi Zhou, and Chengzhong Xu. Hetis: Serving llms in heterogeneous gpu clusters with fine-grained and dynamic parallelism. InProceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, pages 1710–1724, 2025
2025
-
[18]
Nemotron 3 Nano: Open, efficient mixture-of-experts hybrid Mamba-Transformer model for Agentic reasoning, 2025
NVIDIA. Nemotron 3 Nano: Open, efficient mixture-of-experts hybrid Mamba-Transformer model for Agentic reasoning, 2025. Technical report
2025
-
[19]
Nvidia rubin cpx accelerates inference performance and efficiency for 1m+ token context workloads
NVIDIA. Nvidia rubin cpx accelerates inference performance and efficiency for 1m+ token context workloads. https://developer.nvidia.com/blog/nvidia-rubin-cpx-accel erates-inference-performance-and-efficiency-for-1m-token-context-workl oads/, 2025
2025
-
[20]
Dynamo.https://github.com/ai-dynamo/dynamo, 2026
NVIDIA. Dynamo.https://github.com/ai-dynamo/dynamo, 2026
2026
-
[21]
Splitwise: Efficient generative llm inference using phase splitting
Pratyush Patel, Esha Choukse, Chaojie Zhang, Aashaka Shah, Íñigo Goiri, Saeed Maleki, and Ricardo Bianchini. Splitwise: Efficient generative llm inference using phase splitting. In2024 ACM/IEEE 51st Annual International Symposium on Computer Architecture (ISCA), pages 118–132. IEEE, 2024
2024
-
[22]
Mooncake: A kvcache-centric disaggregated architecture for llm serving.ACM Transactions on Storage, 2024
Ruoyu Qin, Zheming Li, Weiran He, Jialei Cui, Heyi Tang, Feng Ren, Teng Ma, Shangming Cai, Yineng Zhang, Mingxing Zhang, et al. Mooncake: A kvcache-centric disaggregated architecture for llm serving.ACM Transactions on Storage, 2024
2024
-
[23]
Lightning attention-2: A free lunch for handling unlimited sequence lengths in large language models, 2024
Zhen Qin, Weigao Sun, Dong Li, Xuyang Shen, Weixuan Sun, and Yiran Zhong. Lightning attention-2: A free lunch for handling unlimited sequence lengths in large language models, 2024
2024
-
[24]
Dynamollm: Designing llm inference clusters for performance and energy efficiency
Jovan Stojkovic, Chaojie Zhang, Íñigo Goiri, Josep Torrellas, and Esha Choukse. Dynamollm: Designing llm inference clusters for performance and energy efficiency. In2025 IEEE Inter- national Symposium on High Performance Computer Architecture (HPCA), pages 1348–1362. IEEE, 2025
2025
-
[25]
Taalas hc1.https://taalas.com/products, 2025
Taalas. Taalas hc1.https://taalas.com/products, 2025. 15
2025
-
[26]
Kimi Linear: An Expressive, Efficient Attention Architecture
Kimi Team, Yu Zhang, Zongyu Lin, Xingcheng Yao, Jiaxi Hu, Fanqing Meng, Chengyin Liu, Xin Men, Songlin Yang, Zhiyuan Li, et al. Kimi linear: An expressive, efficient attention architecture.arXiv preprint arXiv:2510.26692, 2025
work page internal anchor Pith review arXiv 2025
-
[27]
Qwen3.5: Towards native multimodal agents
Qwen Team. Qwen3.5: Towards native multimodal agents. https://qwen.ai/blog?id=qw en3.5, 2026
2026
-
[28]
vllm.https://github.com/vllm-project/vllm, 2026
vLLM Team. vllm.https://github.com/vllm-project/vllm, 2026
2026
-
[29]
Hybrid models as first-class citizens in vLLM
vLLM Team at IBM. Hybrid models as first-class citizens in vLLM. https://pytorch.or g/blog/hybrid-models-as-first-class-citizens-in-vllm/ , 2025. PyTorch Blog, November 2025
2025
-
[30]
Jiahao Wang, Weiyu Xie, Mingxing Zhang, Boxing Zhang, Jianwei Dong, Yuening Zhu, Chen Lin, Jinqi Tang, Yaochen Han, Zhiyuan Ai, et al. From prefix cache to fusion rag cache: Accelerating llm inference in retrieval-augmented generation.arXiv preprint arXiv:2601.12904, 2026
-
[31]
MiMo-V2-Flash Technical Report
Bangjun Xiao, Bingquan Xia, Bo Yang, Bofei Gao, Bowen Shen, Chen Zhang, Chenhong He, Chiheng Lou, Fuli Luo, Gang Wang, et al. Mimo-v2-flash technical report.arXiv preprint arXiv:2601.02780, 2026
work page internal anchor Pith review arXiv 2026
-
[32]
An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, et al. Qwen3 technical report.arXiv preprint arXiv:2505.09388, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[33]
Gated delta networks: Improving mamba2 with delta rule
Songlin Yang, Jan Kautz, and Ali Hatamizadeh. Gated delta networks: Improving mamba2 with delta rule. InInternational Conference on Learning Representations (ICLR), 2025
2025
-
[34]
Cacheblend: Fast large language model serving for rag with cached knowledge fusion
Jiayi Yao, Hanchen Li, Yuhan Liu, Siddhant Ray, Yihua Cheng, Qizheng Zhang, Kuntai Du, Shan Lu, and Junchen Jiang. Cacheblend: Fast large language model serving for rag with cached knowledge fusion. InProceedings of the twentieth European conference on computer systems, pages 94–109, 2025
2025
-
[35]
H2o: Heavy-hitter oracle for efficient generative inference of large language models.Advances in Neural Information Processing Systems, 36:34661–34710, 2023
Zhenyu Zhang, Ying Sheng, Tianyi Zhou, Tianlong Chen, Lianmin Zheng, Ruisi Cai, Zhao Song, Yuandong Tian, Christopher Ré, Clark Barrett, et al. H2o: Heavy-hitter oracle for efficient generative inference of large language models.Advances in Neural Information Processing Systems, 36:34661–34710, 2023
2023
-
[36]
Llm-pq: Serving llm on heterogeneous clusters with phase-aware partition and adaptive quantization
Juntao Zhao, Borui Wan, Chuan Wu, Yanghua Peng, and Haibin Lin. Llm-pq: Serving llm on heterogeneous clusters with phase-aware partition and adaptive quantization. InProceed- ings of the 29th ACM SIGPLAN Annual Symposium on Principles and Practice of Parallel Programming, pages 460–462, 2024
2024
-
[37]
DistServe: Disaggregating prefill and decoding for goodput-optimized large language model serving
Yinmin Zhong, Shengyu Liu, Junda Chen, Jianbo Hu, Yibo Zhu, Xuanzhe Liu, Xin Jin, and Hao Zhang. DistServe: Disaggregating prefill and decoding for goodput-optimized large language model serving. In18th USENIX Symposium on Operating Systems Design and Implementation (OSDI 24), pages 193–210, 2024. 16
2024
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.