arxiv: 2604.15039 · v2 · submitted 2026-04-16 · 💻 cs.DC

Recognition: unknown

Prefill-as-a-Service: KVCache of Next-Generation Models Could Go Cross-Datacenter

Ruoyu Qin , Weiran He , Yaoyu Wang , Zheming Li , Xinran Xu , Yongwei Wu , Weimin Zheng , Mingxing Zhang

Authors on Pith no claims yet

Pith reviewed 2026-05-10 09:58 UTC · model grok-4.3

classification 💻 cs.DC

keywords prefill-decode disaggregationKVCache transfercross-datacenter servinghybrid attentionLLM inferenceheterogeneous deploymentbandwidth-aware schedulingcache-aware placement

0 comments

The pith

Prefill-as-a-Service lets hybrid-attention models run prefill and decode in separate datacenters by moving compact KVCache over ordinary Ethernet.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that recent hybrid-attention models shrink KVCache enough to make cross-datacenter prefill-decode disaggregation workable. PrfaaS selectively ships only long-context prefill work to standalone compute clusters and returns the resulting cache to local decode clusters. System mechanisms for bandwidth-aware scheduling and cache-aware placement keep the approach stable under realistic bursty traffic and variable network conditions. A production-scale case study reports 54 percent higher throughput and 64 percent lower tail time-to-first-token than a same-cost homogeneous baseline while using only modest cross-site bandwidth.

Core claim

For next-generation models whose hybrid attention already reduces KVCache size, a PrfaaS architecture that offloads long-context prefill to remote compute-dense clusters, transfers the compact cache over commodity networks, and applies selective offloading plus bandwidth- and cache-aware placement removes the need for prefill and decode to share a single low-latency fabric. The resulting heterogeneous deployment delivers 54 percent higher serving throughput, 64 percent lower P90 TTFT, and roughly 15 percent throughput gain at equal cost compared with a conventional homogeneous PD baseline, all while consuming modest cross-datacenter bandwidth.

What carries the argument

Prefill-as-a-Service (PrfaaS) architecture, which pairs model-side KVCache reduction with selective offloading, bandwidth-aware scheduling, and cache-aware request placement to enable reliable KVCache movement across loosely coupled clusters.

If this is right

Prefill and decode capacity can be scaled independently across different accelerator types and datacenters.
Heterogeneous hardware no longer requires a shared high-bandwidth RDMA fabric.
Long-context requests can be routed to remote prefill clusters without collapsing overall utilization.
Total cost of ownership drops by roughly 15 percent while meeting the same latency targets.
KVCache traffic stays modest enough that ordinary Ethernet links suffice.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same selective-offload logic could be applied to geo-distributed serving where latency between regions is even higher.
Dynamic rebalancing of prefill capacity based on measured prefix-cache hit rates might further reduce cross-site traffic.
If hybrid attention continues to shrink KVCache, the same architecture could support prefill offload to entirely different cloud providers.

Load-bearing premise

That selective offloading combined with bandwidth-aware scheduling and cache-aware placement will prevent congestion, unstable queues, and wasted capacity when workloads are bursty, request lengths are skewed, prefix caches are uneven, and inter-cluster bandwidth fluctuates.

What would settle it

A controlled run on production-like traffic that shows sustained high queueing latency or under-utilization once inter-cluster bandwidth drops below the level assumed in the case study would falsify the claim that the mechanisms keep the system stable.

Figures

Figures reproduced from arXiv: 2604.15039 by Mingxing Zhang, Ruoyu Qin, Weimin Zheng, Weiran He, Xinran Xu, Yaoyu Wang, Yongwei Wu, Zheming Li.

**Figure 2.** Figure 2: KV throughput of MiniMax-M2.5 on an 8×H200 instance at various input lengths. Mechanism Prefill Latency KV Throughput GQA High High MLA High Low Sparse Attention Low High SWA Low Low Linear Attention Low Low [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗

**Figure 3.** Figure 3: Deployment topology of the PrfaaS-PD architecture. [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗

**Figure 4.** Figure 4: Hybrid prefix cache pool. Linear states and full-attention KVCache are managed by [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗

**Figure 5.** Figure 5: Illustration of the grid search process for the two optimization variables. (a) fixes [PITH_FULL_IMAGE:figures/full_fig_p011_5.png] view at source ↗

read the original abstract

Prefill-decode (PD) disaggregation has become the standard architecture for large-scale LLM serving, but in practice its deployment boundary is still determined by KVCache transfer. In conventional dense-attention models, prefill generates huge KVCache traffics that keep prefill and decode tightly coupled within a single high-bandwidth network domain, limiting heterogeneous deployment and resource elasticity. Recent hybrid-attention architectures substantially reduce KVCache size, making cross-cluster KVCache transport increasingly plausible. However, smaller KVCache alone does not make heterogeneous cross-datacenter PD serving practical: real workloads remain bursty, request lengths are highly skewed, prefix caches are unevenly distributed, and inter-cluster bandwidth fluctuates. A naive design that fully externalizes prefill can therefore still suffer from congestion, unstable queueing, and poor utilization. We present Prefill-as-a-Service (PrfaaS), a cross-datacenter serving architecture that selectively offloads long-context prefill to standalone, compute-dense prefill clusters and transfers the resulting KVCache over commodity Ethernet to local PD clusters for decode. Rather than treating reduced KVCache as sufficient, PrfaaS combines model-side KV efficiency with system-side selective offloading, bandwidth-aware scheduling, and cache-aware request placement. This design removes the requirement that heterogeneous accelerators share the same low-latency RDMA fabric, enabling independent scaling of prefill and decode capacity across loosely coupled clusters. In a case study using an internal 1T-parameter hybrid model, a PrfaaS-augmented heterogeneous deployment achieves 54% higher serving throughput and 64% lower P90 TTFT than a homogeneous PD baseline, with approximately 15% throughput gain at equal cost, while consuming only modest cross-datacenter bandwidth.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

PrfaaS shows how hybrid attention's smaller KVCache plus selective offloading and scheduling can plausibly split prefill and decode across datacenters, but the 54% throughput claim rests on an internal case study whose workload and bandwidth details are not visible enough to judge robustness.

read the letter

The main thing here is that this paper takes the smaller KVCache from hybrid-attention models and adds selective offloading, bandwidth-aware scheduling, and cache-aware placement to make cross-datacenter prefill-decode separation practical instead of just theoretically possible. That combination is the actual new piece beyond earlier single-domain PD disaggregation work. It directly tackles the bursty traffic, skewed lengths, uneven prefixes, and fluctuating bandwidth that the abstract itself calls out as remaining problems even after KVCache shrinks. The case study on their internal 1T model reports concrete gains over a homogeneous baseline, including higher throughput, lower P90 TTFT, and some cost efficiency at modest cross-datacenter bandwidth cost. That is useful evidence that the idea can be implemented and can move the needle in at least one setting. The soft spot is the evaluation. The abstract gives headline numbers but does not describe the request arrival process, length distribution, prefix hit rates, or the bandwidth trace used, nor any ablation that isolates the scheduling components. If the tests ran under stable high-bandwidth or non-bursty conditions, the reported improvements do not yet demonstrate that the system handles the exact failure modes it claims to solve. The stress-test concern about congestion avoidance therefore lands until those details appear. This is for systems people working on large-scale LLM inference who are thinking about heterogeneous or multi-cluster deployments. It is coherent on its own terms and engages the right prior work, so it deserves a serious referee even though the experiments will need more transparency and controls.

Referee Report

1 major / 1 minor

Summary. The paper proposes Prefill-as-a-Service (PrfaaS), a cross-datacenter LLM serving architecture that exploits hybrid-attention models to shrink KVCache sizes, allowing selective offloading of long-context prefill to remote compute-dense clusters with KVCache transfer over commodity Ethernet. It augments this with bandwidth-aware scheduling and cache-aware request placement to mitigate bursty traffic, skewed request lengths, uneven prefix caches, and fluctuating inter-cluster bandwidth. In a case study with an internal 1T-parameter hybrid model, a PrfaaS-augmented heterogeneous deployment is reported to deliver 54% higher serving throughput and 64% lower P90 TTFT than a homogeneous PD baseline, plus ~15% throughput gain at equal cost, while using only modest cross-datacenter bandwidth.

Significance. If the empirical results can be substantiated with detailed methodology, the work would be significant for enabling elastic, heterogeneous scaling of prefill and decode resources across loosely coupled datacenters without high-bandwidth RDMA fabrics. This is particularly relevant for next-generation hybrid-attention models and could improve cost-efficiency and resource utilization in large-scale serving systems.

major comments (1)

[Case study] The case study reports headline performance numbers (54% throughput, 64% P90 TTFT reduction, 15% equal-cost gain) but provides no description of the workload traces, request arrival process, length distribution, prefix cache hit rates, baseline configurations, measurement methodology, or the precise bandwidth fluctuation model (mean, variance, correlation time). This is load-bearing for the central claim because the abstract itself flags bursty traffic, skewed lengths, uneven caches, and fluctuating bandwidth as conditions that would cause congestion and poor utilization in a naive design; without these details or ablations isolating the scheduling components, the robustness of the reported gains cannot be assessed.

minor comments (1)

The abstract and title use informal phrasing (e.g., 'Could Go Cross-Datacenter'); a more precise title and abstract would better suit journal standards.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We appreciate the referee's thorough review and positive assessment of the potential impact of our work on cross-datacenter serving for hybrid-attention models. We have carefully considered the major comment regarding the case study and have revised the manuscript to incorporate additional methodological details as requested.

read point-by-point responses

Referee: The case study reports headline performance numbers (54% throughput, 64% P90 TTFT reduction, 15% equal-cost gain) but provides no description of the workload traces, request arrival process, length distribution, prefix cache hit rates, baseline configurations, measurement methodology, or the precise bandwidth fluctuation model (mean, variance, correlation time). This is load-bearing for the central claim because the abstract itself flags bursty traffic, skewed lengths, uneven caches, and fluctuating bandwidth as conditions that would cause congestion and poor utilization in a naive design; without these details or ablations isolating the scheduling components, the robustness of the reported gains cannot be assessed.

Authors: We agree with the referee that these details are critical for evaluating the robustness of PrfaaS under the challenging conditions described. The original manuscript included a high-level overview of the case study but did not provide the full level of detail needed. In the revised manuscript, we have added an expanded 'Evaluation Methodology' subsection that describes: the workload traces derived from anonymized production logs exhibiting bursty patterns; the request arrival process modeled as a Poisson process with time-varying rates to simulate bursts; the request length distribution following a heavy-tailed distribution with parameters matching observed data; average prefix cache hit rates of approximately 35% with variations; the homogeneous PD baseline configuration using identical accelerator types for prefill and decode; the measurement methodology involving both simulation and hardware validation for throughput (tokens/s) and P90 TTFT; and the bandwidth fluctuation model as a stochastic process with specified mean, variance, and correlation time. Furthermore, we have included new ablation studies that isolate the effects of bandwidth-aware scheduling and cache-aware request placement, showing how they contribute to the reported gains by mitigating congestion and improving utilization. We believe these additions fully address the concern and allow independent assessment of the results. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical system results with no derivation chain

full rationale

The paper presents Prefill-as-a-Service as a system architecture combining selective offloading, bandwidth-aware scheduling, and cache-aware placement, then reports empirical throughput and latency numbers from a case study on an internal 1T hybrid model. No equations, fitted parameters, predictions derived from inputs, or load-bearing self-citations appear in the provided text. The performance claims (54% higher throughput, 64% lower P90 TTFT) are stated as measured outcomes of the implemented heterogeneous deployment rather than quantities obtained by algebraic reduction or self-referential definition. The evaluation is therefore self-contained against external benchmarks and receives the default non-circularity finding.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The central claim rests on domain assumptions about model properties and workload behavior that are stated but not derived or measured in the provided abstract.

axioms (2)

domain assumption Hybrid-attention architectures substantially reduce KVCache size relative to dense-attention models
Presented as the key enabler that makes cross-cluster KVCache transport plausible.
domain assumption Real workloads are bursty, with highly skewed request lengths, unevenly distributed prefix caches, and fluctuating inter-cluster bandwidth
Cited as the reasons a naive full-externalization design would suffer congestion and poor utilization.

pith-pipeline@v0.9.0 · 5642 in / 1578 out tokens · 71649 ms · 2026-05-10T09:58:20.266212+00:00 · methodology

discussion (0)

Forward citations

Cited by 6 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

KVServe: Service-Aware KV Cache Compression for Communication-Efficient Disaggregated LLM Serving
cs.DC 2026-05 conditional novelty 7.0

KVServe delivers up to 9.13x job completion time speedup and 32.8x time-to-first-token reduction by making KV cache compression service-aware and adaptive in disaggregated LLM serving.
MASPrism: Lightweight Failure Attribution for Multi-Agent Systems Using Prefill-Stage Signals
cs.SE 2026-05 unverdicted novelty 7.0

MASPrism attributes failures in LLM multi-agent executions by extracting token-level negative log-likelihood and attention weights from a small model's prefill pass, then ranking candidates with a second prefill, achi...
MASPrism: Lightweight Failure Attribution for Multi-Agent Systems Using Prefill-Stage Signals
cs.SE 2026-05 unverdicted novelty 7.0

MASPrism attributes failures in multi-agent systems by ranking candidates from prefill-stage NLL and attention signals of a 0.6B SLM, beating baselines by up to 33.41% Top-1 accuracy and proprietary LLMs by up to 89.5...
SplitZip: Ultra Fast Lossless KV Compression for Disaggregated LLM Serving
cs.DC 2026-05 unverdicted novelty 7.0

SplitZip is a new GPU-friendly lossless compressor for KV cache tensors that exploits exponent redundancy to achieve over 600 GB/s compression throughput and up to 1.32x faster transfers in disaggregated LLM serving.
PreFT: Prefill-only finetuning for efficient inference
cs.LG 2026-05 accept novelty 6.0

Prefill-only adaptation of LLMs yields 1.9x higher throughput for 512 adapters on Llama 3.1 70B with near-parity performance on RL tasks and recoverable loss on SFT.
SplitZip: Ultra Fast Lossless KV Compression for Disaggregated LLM Serving
cs.DC 2026-05 unverdicted novelty 6.0

SplitZip delivers a GPU-friendly lossless KV-cache compressor using an offline top-16 exponent codebook plus escape stream, achieving 613 GB/s compression and 2182 GB/s decompression throughput with up to 1.32x end-to...

Reference graph

Works this paper leans on

37 extracted references · 10 canonical work pages · cited by 4 Pith papers · 7 internal anchors

[1]

Think fast: A tensor streaming processor (tsp) for accelerating deep learning workloads

Dennis Abts, Jonathan Ross, Jonathan Sparling, Mark Wong-VanHaren, Max Baker, Tom Hawkins, Andrew Bell, John Thompson, Temesghen Kahsai, Garrin Kimmell, et al. Think fast: A tensor streaming processor (tsp) for accelerating deep learning workloads. In2020 ACM/IEEE 47th Annual International Symposium on Computer Architecture (ISCA), pages 145–158. IEEE, 2020

2020
[2]

gpt-oss-120b & gpt-oss-20b Model Card

Sandhini Agarwal, Lama Ahmad, Jason Ai, Sam Altman, Andy Applebaum, Edwin Arbus, Rahul K Arora, Yu Bai, Bowen Baker, Haiming Bao, et al. gpt-oss-120b & gpt-oss-20b model card.arXiv preprint arXiv:2508.10925, 2025

work page internal anchor Pith review arXiv 2025
[3]

Ring-2.5-1t.https://github.com/inclusionAI/Ring-V2.5, 2026

Inclusion AI. Ring-2.5-1t.https://github.com/inclusionAI/Ring-V2.5, 2026

2026
[4]

Gqa: Training generalized multi-query transformer models from multi-head checkpoints

Joshua Ainslie, James Lee-Thorp, Michiel De Jong, Yury Zemlyanskiy, Federico Lebrón, and Sumit Sanghai. Gqa: Training generalized multi-query transformer models from multi-head checkpoints. InProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 4895–4901, 2023

2023
[5]

Longformer: The Long-Document Transformer

Iz Beltagy, Matthew E Peters, and Arman Cohan. Longformer: The long-document transformer. arXiv preprint arXiv:2004.05150, 2020

work page internal anchor Pith review arXiv 2004
[6]

Generating Long Sequences with Sparse Transformers

Rewon Child, Scott Gray, Alec Radford, and Ilya Sutskever. Generating long sequences with sparse transformers.arXiv preprint arXiv:1904.10509, 2019

work page internal anchor Pith review arXiv 1904
[7]

Sglang.https://github.com/sgl-project/sglang, 2026

LMSYS Corp. Sglang.https://github.com/sgl-project/sglang, 2026

2026
[8]

What is a language processing unit? https://groq.com/blog/the-groq-lpu-exp lained, 2025

Groq. What is a language processing unit? https://groq.com/blog/the-groq-lpu-exp lained, 2025

2025
[9]

Freesh: Fair, resource-and energy-efficient scheduling for llm serving on heterogeneous gpus.arXiv preprint arXiv:2511.00807, 2025

Xuan He, Zequan Fang, Jinzhao Lian, Danny HK Tsang, Baosen Zhang, and Yize Chen. Freesh: Fair, resource-and energy-efficient scheduling for llm serving on heterogeneous gpus.arXiv preprint arXiv:2511.00807, 2025. 14

work page arXiv 2025
[10]

Kvquant: Towards 10 million context length llm inference with kv cache quantization.Advances in Neural Information Processing Systems, 37:1270–1303, 2024

Coleman Hooper, Sehoon Kim, Hiva Mohammadzadeh, Michael W Mahoney, Yakun S Shao, Kurt Keutzer, and Amir Gholami. Kvquant: Towards 10 million context length llm inference with kv cache quantization.Advances in Neural Information Processing Systems, 37:1270–1303, 2024

2024
[11]

Step 3.5 flash: Open frontier-level intelligence with 11b active parameters

Ailin Huang, Ang Li, Aobo Kong, Bin Wang, Binxing Jiao, Bo Dong, Bojun Wang, Boyu Chen, Brian Li, Buyun Ma, et al. Step 3.5 flash: Open frontier-level intelligence with 11b active parameters.arXiv preprint arXiv:2602.10604, 2026

work page arXiv 2026
[12]

DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model

Aixin Liu, Bei Feng, Bin Wang, Bingxuan Wang, Bo Liu, Chenggang Zhao, Chengqi Dengr, Chong Ruan, Damai Dai, Daya Guo, et al. Deepseek-v2: A strong, economical, and efficient mixture-of-experts language model.arXiv preprint arXiv:2405.04434, 2024

work page internal anchor Pith review arXiv 2024
[13]

Cachegen: Kv cache compression and streaming for fast large language model serving

Yuhan Liu, Hanchen Li, Yihua Cheng, Siddhant Ray, Yuyang Huang, Qizheng Zhang, Kuntai Du, Jiayi Yao, Shan Lu, Ganesh Ananthanarayanan, et al. Cachegen: Kv cache compression and streaming for fast large language model serving. InProceedings of the ACM SIGCOMM 2024 Conference, pages 38–56, 2024

2024
[14]

Kivi: A tuning-free asymmetric 2bit quantization for kv cache

Zirui Liu, Jiayi Yuan, Hongye Jin, Shaochen Zhong, Zhaozhuo Xu, Vladimir Braverman, Beidi Chen, and Xia Hu. Kivi: A tuning-free asymmetric 2bit quantization for kv cache. In International Conference on Machine Learning, pages 32332–32344. PMLR, 2024

2024
[15]

Helix: Serving large language models over heterogeneous gpus and network via max-flow

Yixuan Mei, Yonghao Zhuang, Xupeng Miao, Juncheng Yang, Zhihao Jia, and Rashmi Vinayak. Helix: Serving large language models over heterogeneous gpus and network via max-flow. In Proceedings of the 30th ACM International Conference on Architectural Support for Program- ming Languages and Operating Systems, V olume 1, pages 586–602, 2025

2025
[16]

Minimax m2.5: Built for real-world productivity

Minimax. Minimax m2.5: Built for real-world productivity. https://www.minimax.io/new s/minimax-m25, 2026

2026
[17]

Hetis: Serving llms in heterogeneous gpu clusters with fine-grained and dynamic parallelism

Zizhao Mo, Jianxiong Liao, Huanle Xu, Zhi Zhou, and Chengzhong Xu. Hetis: Serving llms in heterogeneous gpu clusters with fine-grained and dynamic parallelism. InProceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, pages 1710–1724, 2025

2025
[18]

Nemotron 3 Nano: Open, efficient mixture-of-experts hybrid Mamba-Transformer model for Agentic reasoning, 2025

NVIDIA. Nemotron 3 Nano: Open, efficient mixture-of-experts hybrid Mamba-Transformer model for Agentic reasoning, 2025. Technical report

2025
[19]

Nvidia rubin cpx accelerates inference performance and efficiency for 1m+ token context workloads

NVIDIA. Nvidia rubin cpx accelerates inference performance and efficiency for 1m+ token context workloads. https://developer.nvidia.com/blog/nvidia-rubin-cpx-accel erates-inference-performance-and-efficiency-for-1m-token-context-workl oads/, 2025

2025
[20]

Dynamo.https://github.com/ai-dynamo/dynamo, 2026

NVIDIA. Dynamo.https://github.com/ai-dynamo/dynamo, 2026

2026
[21]

Splitwise: Efficient generative llm inference using phase splitting

Pratyush Patel, Esha Choukse, Chaojie Zhang, Aashaka Shah, Íñigo Goiri, Saeed Maleki, and Ricardo Bianchini. Splitwise: Efficient generative llm inference using phase splitting. In2024 ACM/IEEE 51st Annual International Symposium on Computer Architecture (ISCA), pages 118–132. IEEE, 2024

2024
[22]

Mooncake: A kvcache-centric disaggregated architecture for llm serving.ACM Transactions on Storage, 2024

Ruoyu Qin, Zheming Li, Weiran He, Jialei Cui, Heyi Tang, Feng Ren, Teng Ma, Shangming Cai, Yineng Zhang, Mingxing Zhang, et al. Mooncake: A kvcache-centric disaggregated architecture for llm serving.ACM Transactions on Storage, 2024

2024
[23]

Lightning attention-2: A free lunch for handling unlimited sequence lengths in large language models, 2024

Zhen Qin, Weigao Sun, Dong Li, Xuyang Shen, Weixuan Sun, and Yiran Zhong. Lightning attention-2: A free lunch for handling unlimited sequence lengths in large language models, 2024

2024
[24]

Dynamollm: Designing llm inference clusters for performance and energy efficiency

Jovan Stojkovic, Chaojie Zhang, Íñigo Goiri, Josep Torrellas, and Esha Choukse. Dynamollm: Designing llm inference clusters for performance and energy efficiency. In2025 IEEE Inter- national Symposium on High Performance Computer Architecture (HPCA), pages 1348–1362. IEEE, 2025

2025
[25]

Taalas hc1.https://taalas.com/products, 2025

Taalas. Taalas hc1.https://taalas.com/products, 2025. 15

2025
[26]

Kimi Linear: An Expressive, Efficient Attention Architecture

Kimi Team, Yu Zhang, Zongyu Lin, Xingcheng Yao, Jiaxi Hu, Fanqing Meng, Chengyin Liu, Xin Men, Songlin Yang, Zhiyuan Li, et al. Kimi linear: An expressive, efficient attention architecture.arXiv preprint arXiv:2510.26692, 2025

work page internal anchor Pith review arXiv 2025
[27]

Qwen3.5: Towards native multimodal agents

Qwen Team. Qwen3.5: Towards native multimodal agents. https://qwen.ai/blog?id=qw en3.5, 2026

2026
[28]

vllm.https://github.com/vllm-project/vllm, 2026

vLLM Team. vllm.https://github.com/vllm-project/vllm, 2026

2026
[29]

Hybrid models as first-class citizens in vLLM

vLLM Team at IBM. Hybrid models as first-class citizens in vLLM. https://pytorch.or g/blog/hybrid-models-as-first-class-citizens-in-vllm/ , 2025. PyTorch Blog, November 2025

2025
[30]

From prefix cache to fusion rag cache: Accelerating llm inference in retrieval-augmented generation.arXiv preprint arXiv:2601.12904, 2026

Jiahao Wang, Weiyu Xie, Mingxing Zhang, Boxing Zhang, Jianwei Dong, Yuening Zhu, Chen Lin, Jinqi Tang, Yaochen Han, Zhiyuan Ai, et al. From prefix cache to fusion rag cache: Accelerating llm inference in retrieval-augmented generation.arXiv preprint arXiv:2601.12904, 2026

work page arXiv 2026
[31]

MiMo-V2-Flash Technical Report

Bangjun Xiao, Bingquan Xia, Bo Yang, Bofei Gao, Bowen Shen, Chen Zhang, Chenhong He, Chiheng Lou, Fuli Luo, Gang Wang, et al. Mimo-v2-flash technical report.arXiv preprint arXiv:2601.02780, 2026

work page internal anchor Pith review arXiv 2026
[32]

Qwen3 Technical Report

An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, et al. Qwen3 technical report.arXiv preprint arXiv:2505.09388, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[33]

Gated delta networks: Improving mamba2 with delta rule

Songlin Yang, Jan Kautz, and Ali Hatamizadeh. Gated delta networks: Improving mamba2 with delta rule. InInternational Conference on Learning Representations (ICLR), 2025

2025
[34]

Cacheblend: Fast large language model serving for rag with cached knowledge fusion

Jiayi Yao, Hanchen Li, Yuhan Liu, Siddhant Ray, Yihua Cheng, Qizheng Zhang, Kuntai Du, Shan Lu, and Junchen Jiang. Cacheblend: Fast large language model serving for rag with cached knowledge fusion. InProceedings of the twentieth European conference on computer systems, pages 94–109, 2025

2025
[35]

H2o: Heavy-hitter oracle for efficient generative inference of large language models.Advances in Neural Information Processing Systems, 36:34661–34710, 2023

Zhenyu Zhang, Ying Sheng, Tianyi Zhou, Tianlong Chen, Lianmin Zheng, Ruisi Cai, Zhao Song, Yuandong Tian, Christopher Ré, Clark Barrett, et al. H2o: Heavy-hitter oracle for efficient generative inference of large language models.Advances in Neural Information Processing Systems, 36:34661–34710, 2023

2023
[36]

Llm-pq: Serving llm on heterogeneous clusters with phase-aware partition and adaptive quantization

Juntao Zhao, Borui Wan, Chuan Wu, Yanghua Peng, and Haibin Lin. Llm-pq: Serving llm on heterogeneous clusters with phase-aware partition and adaptive quantization. InProceed- ings of the 29th ACM SIGPLAN Annual Symposium on Principles and Practice of Parallel Programming, pages 460–462, 2024

2024
[37]

DistServe: Disaggregating prefill and decoding for goodput-optimized large language model serving

Yinmin Zhong, Shengyu Liu, Junda Chen, Jianbo Hu, Yibo Zhu, Xuanzhe Liu, Xin Jin, and Hao Zhang. DistServe: Disaggregating prefill and decoding for goodput-optimized large language model serving. In18th USENIX Symposium on Operating Systems Design and Implementation (OSDI 24), pages 193–210, 2024. 16

2024