arxiv: 2605.08151 · v2 · submitted 2026-05-04 · 💻 cs.DC · cs.AI

Recognition: no theorem link

SPECTRE: Hybrid Ordinary-Parallel Speculative Serving for Resource-Efficient LLM Inference

Feiyu Zhang, Jincheng Xie, Qi Xiao, Wen Hu, Yawen Ling, Yu Zheng, Zhongyi Huang

Pith reviewed 2026-05-13 06:57 UTC · model grok-4.3

classification 💻 cs.DC cs.AI

keywords speculative decodingLLM inference servingmulti-tenant cloud systemsresource-efficient deploymentparallel token generationremote drafter modelsthroughput optimization

0 comments

The pith

Reusing underutilized smaller models as remote drafters speeds large-model inference by running draft generation in parallel with verification.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows how to accelerate serving of large language models in cloud systems by tapping into the spare capacity of many smaller, lightly loaded models. It does this through a form of speculative decoding where the smaller models produce candidate tokens at the same time the large model checks them, with a hybrid approach that mixes sequential and parallel steps to keep the overlap useful. This matters because typical deployments have long-tailed demand, so the large models stay busy while smaller ones sit idle; capturing that idle capacity could raise overall system output without new hardware. The design adds scheduling rules to protect the smaller models' regular work and prompt compression to keep the drafts fast. Experiments across model pairs and workloads confirm higher throughput for the large models and only small effects on the smaller ones' own performance.

Core claim

The central claim is that draft generation on remote tail models and target verification on large models can be made to overlap effectively in multi-tenant settings by combining a hybrid ordinary-parallel speculative strategy, a throughput-derived decision threshold, speculative priority scheduling, and draft-side prompt compression, yielding higher large-model throughput with only minor interference to tail-model native workloads.

What carries the argument

The hybrid ordinary-parallel speculative decoding strategy that decides when to run draft and verification steps concurrently, guided by a throughput analysis threshold, while scheduling preserves overlap and prompt compression shortens draft time.

If this is right

Large models handle more requests per unit of hardware time than standard autoregressive decoding.
The gains from speculative methods increase further when remote drafters are added.
Smaller models continue to meet most of their own workload demands with only minor added delay.
The same techniques apply across different model size pairs, batch sizes, and long-context tasks.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The approach could be extended by letting the system pair drafter and target models dynamically as load patterns shift across a cluster.
Similar remote-drafting patterns might reduce waste in other distributed inference settings that mix model sizes.
Lower interference could support more aggressive sharing of compute resources among unrelated services on the same hardware.

Load-bearing premise

Draft generation on remote smaller models and verification on the large model can maintain useful overlap under realistic mixed traffic without unacceptable slowdown for the smaller models' regular users.

What would settle it

In a live multi-tenant cluster with fluctuating loads, measuring that the large-model throughput gain drops below the non-speculative baseline or that smaller-model response times rise beyond a small fraction of their native values.

read the original abstract

LLM serving platforms are increasingly deployed as multi-model cloud systems, where user demand is often long-tailed: a few popular large models receive most requests, while many smaller tail models remain underutilized. We propose \textbf{SPECTRE} (Parallel \textbf{SPEC}ulative Decoding with a Multi-\textbf{T}enant \textbf{RE}mote Drafter), a serving framework that reuses underutilized tail-model services as remote drafters for heavily loaded large-model services through speculative decoding. SPECTRE enables draft generation and target-side verification to run in parallel, and makes such parallelism effective through three techniques: a hybrid ordinary-parallel speculative decoding strategy guided by a threshold derived from throughput analysis, speculative priority scheduling to preserve draft--target overlap under multi-tenant traffic, and draft-side prompt compression to reduce draft latency. We implement SPECTRE in \texttt{SGLang} and evaluate it across multiple draft--target model pairs, reasoning benchmarks, real-world long-context workloads, and a wide range of batch sizes. Results show that SPECTRE consistently improves large-model serving throughput while causing only minor interference to the native workloads of tail-model services. In large-model deployments, including Qwen3-235B-A22B with TP=8, SPECTRE achieves up to \textbf{2.28$\times$ speedup} over autoregressive decoding and up to an additional \textbf{66\% relative improvement} over the strongest speculative decoding baselines. Talk is cheap, we show you the code: https://github.com/sgl-project/sglang/pull/22272.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

SPECTRE reuses idle tail models as remote drafters for large-model speculative decoding and reports up to 2.28x throughput gains, but the overlap and interference claims need checking against realistic bursty traffic.

read the letter

SPECTRE turns underused tail models into remote drafters for speculative decoding on big models in multi-tenant clouds. The hybrid ordinary-parallel approach, priority scheduling, and draft-side prompt compression are the new pieces that let draft generation overlap with target verification without breaking the tail models' own work. They derive the hybrid threshold from throughput analysis rather than fitting it to results, which is a clean move.

Referee Report

1 major / 2 minor

Summary. The paper presents SPECTRE, a serving framework that reuses underutilized tail-model services as remote drafters for large-model inference via speculative decoding in multi-tenant LLM platforms. It proposes a hybrid ordinary-parallel speculative decoding strategy guided by a throughput-derived threshold, speculative priority scheduling to maintain overlap, and draft-side prompt compression. Implemented in SGLang, evaluations across draft-target pairs, reasoning benchmarks, long-context workloads, and batch sizes report up to 2.28× speedup over autoregressive decoding and up to 66% relative gain over strongest speculative baselines (e.g., for Qwen3-235B-A22B with TP=8), with only minor interference to tail-model native workloads.

Significance. If the results hold, SPECTRE could meaningfully improve resource efficiency in cloud LLM deployments by exploiting long-tailed model popularity, increasing throughput for popular models without extra hardware. The open-source code release and implementation in SGLang are strengths for reproducibility and adoption.

major comments (1)

[§5] §5 (Experimental Evaluation): The central claims of consistent speedups and effective parallelism rest on the hybrid strategy and scheduling preserving draft-target overlap under multi-tenant traffic. The reported results use real-world long-context workloads and a range of batch sizes, but do not include explicit characterization of traffic burstiness, priority conflicts, or concurrent native tail-model loads; this leaves the weakest assumption (overlap without unacceptable interference) insufficiently tested and risks overstatement of the 2.28× and 66% gains.

minor comments (2)

[Abstract] Abstract and §5: The claim of 'only minor interference' to tail models lacks quantitative metrics (e.g., throughput degradation percentages or latency histograms) or references to specific figures/tables showing the effect sizes.
[§5] §5: Speedup numbers (2.28×, 66%) should report error bars, standard deviations, or run-to-run variability to support the 'consistently improves' claim across workloads.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the positive assessment of SPECTRE's significance for improving resource efficiency in multi-tenant LLM deployments and for the constructive feedback on the experimental evaluation. We address the major comment point by point below.

read point-by-point responses

Referee: [§5] §5 (Experimental Evaluation): The central claims of consistent speedups and effective parallelism rest on the hybrid strategy and scheduling preserving draft-target overlap under multi-tenant traffic. The reported results use real-world long-context workloads and a range of batch sizes, but do not include explicit characterization of traffic burstiness, priority conflicts, or concurrent native tail-model loads; this leaves the weakest assumption (overlap without unacceptable interference) insufficiently tested and risks overstatement of the 2.28× and 66% gains.

Authors: We appreciate the referee's emphasis on rigorously validating the overlap assumption. Our evaluations already incorporate real-world long-context workloads (which exhibit natural variability in arrivals and lengths) across a wide range of batch sizes, and we explicitly quantify interference to native tail-model workloads, showing only minor impact while achieving the reported speedups. The hybrid ordinary-parallel strategy and speculative priority scheduling are specifically designed to preserve overlap under contention. That said, we agree that controlled characterization of burstiness, priority conflicts, and varying concurrent loads would strengthen the claims. In the revised manuscript, we will add a dedicated subsection to §5 with new experiments using synthetic bursty arrival patterns (e.g., varying Poisson rates and burst factors) and controlled concurrent tail-model loads, demonstrating that the speedups and low interference hold under these conditions. revision: yes

Circularity Check

0 steps flagged

No significant circularity in SPECTRE's performance claims

full rationale

The paper presents an implemented serving system evaluated on real hardware with released code. The hybrid ordinary-parallel strategy uses a threshold derived from throughput analysis (independent of final measured speedups), speculative priority scheduling, and prompt compression. Reported gains (2.28× over autoregressive, 66% over baselines) are empirical outcomes from benchmarks and workloads, not quantities forced by definition, fitted inputs renamed as predictions, or self-citation chains. No load-bearing step reduces to its own inputs by construction; the work is self-contained against external measurements.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The central claim rests on standard assumptions from speculative decoding and multi-tenant serving; the throughput-derived threshold is the main free parameter introduced.

free parameters (1)

hybrid threshold
Derived from throughput analysis to decide between ordinary and parallel speculative modes.

axioms (1)

domain assumption Remote tail-model services can act as effective drafters with acceptable interference under multi-tenant load.
Invoked in the design of speculative priority scheduling and parallel execution.

pith-pipeline@v0.9.0 · 5603 in / 1219 out tokens · 60227 ms · 2026-05-13T06:57:42.030603+00:00 · methodology

Review history (2 revisions) →

discussion (0)

Reference graph

Works this paper leans on

42 extracted references · 42 canonical work pages · 7 internal anchors

[1]

Gulavani, and Ramachandran Ramjee

Amey Agrawal, Ashish Panwar, Jayashree Mohan, Nipun Kwatra, Bhargav S. Gulavani, and Ramachandran Ramjee. Sarathi: Eﬀicient llm inference by piggybacking decodes with chunked prefills. ArXiv, abs/2308.16369,

work page arXiv
[2]

URL https://api.semanticscholar.org/CorpusID:261395577

work page
[3]

Splitwise: Eﬀicient generative llm inference using phase splitting

Pratyush Patel, Esha Choukse, Chaojie Zhang, Aashaka Shah, Íñigo Goiri, Saeed Maleki, and Ricardo Bianchini. Splitwise: Eﬀicient generative llm inference using phase splitting. In Proceedings of the 51st Annual International Symposium on Computer Architecture, ISCA ’24, pages 118–132. IEEE Press, 2025. ISBN 9798350326581. doi: 10.1109/ISCA59077.2024.00019...

work page doi:10.1109/isca59077.2024.00019 2025
[4]

Serving heteroge- neous machine learning models on Multi-GPU servers with Spatio-Temporal sharing

Seungbeom Choi, Sunho Lee, Yeonjae Kim, Jongse Park, Youngjin Kwon, and Jaehyuk Huh. Serving heteroge- neous machine learning models on Multi-GPU servers with Spatio-Temporal sharing. In 2022 USENIX Annual Technical Conference (USENIX ATC 22), pages 199–216, Carlsbad, CA, July 2022. USENIX Association. ISBN 978-1-939133-29-53. URL https://www.usenix.org/c...

work page 2022
[5]

Aegaeon: Effective gpu pooling for concurrent llm serving on the market

Yuxing Xiang, Xue Li, Kun Qian, Yufan Yang, Diwen Zhu, Wenyuan Yu, Ennan Zhai, Xuanzhe Liu, Xin Jin, and Jingren Zhou. Aegaeon: Effective gpu pooling for concurrent llm serving on the market. In Proceedings of 10 the ACM SIGOPS 31st Symposium on Operating Systems Principles, SOSP ’25, pages 1030–1045, New York, NY, USA, 2025. Association for Computing Mac...

work page doi:10.1145/3731569.3764815 2025
[6]

Double: Breaking the Acceleration Limit via Double Retrieval Speculative Parallelism

Yuhao Shen, Tianyu Liu, Junyi Shen, Jinyang Wu, Quan Kong, Li Huan, and Cong Wang. Double: Breaking the acceleration limit via double retrieval speculative parallelism, 2026. URL https://arxiv.org/abs/2601.05524

work page internal anchor Pith review Pith/arXiv arXiv 2026
[7]

PEARL: Parallel speculative decoding with adaptive draft length

Tianyu Liu, Yun Li, Qitan Lv, Kai Liu, Jianchen Zhu, Winston Hu, and Xiao Sun. PEARL: Parallel speculative decoding with adaptive draft length. In The Thirteenth International Conference on Learning Representations,

work page
[8]

URL https://openreview.net/forum?id=QOXrVMiHGK

work page
[9]

Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism

Mohammad Shoeybi, Mostofa Patwary, Raul Puri, Patrick LeGresley, Jared Casper, and Bryan Catanzaro. Megatron-lm: Training multi-billion parameter language models using model parallelism, 2020. URL https: //arxiv.org/abs/1909.08053

work page internal anchor Pith review Pith/arXiv arXiv 2020
[10]

Eﬀicient streaming language models with attention sinks

Guangxuan Xiao, Yuandong Tian, Beidi Chen, Song Han, and Mike Lewis. Eﬀicient streaming language models with attention sinks. In The Twelfth International Conference on Learning Representations, 2024. URL https: //openreview.net/forum?id=NG7sS51zVF

work page 2024
[11]

Qwen3 Technical Report

An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, Chujie Zheng, Dayiheng Liu, Fan Zhou, Fei Huang, Feng Hu, Hao Ge, Haoran Wei, Huan Lin, Jialong Tang, Jian Yang, Jianhong Tu, Jianwei Zhang, Jianxin Yang, Jiaxi Yang, Jing Zhou, Jingren Zhou, Junyang Lin, Kai Dang, Keqin Bao, Kexin Yang, ...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[12]

DeepSeek-V3 Technical Report

DeepSeek-AI, Aixin Liu, Bei Feng, Bing Xue, Bingxuan Wang, Bochao Wu, Chengda Lu, Chenggang Zhao, Chengqi Deng, Chenyu Zhang, Chong Ruan, Damai Dai, Daya Guo, Dejian Yang, Deli Chen, Dongjie Ji, Erhang Li, Fangyun Lin, Fucong Dai, Fuli Luo, Guangbo Hao, Guanting Chen, Guowei Li, H. Zhang, Han Bao, Hanwei Xu, Haocheng Wang, Haowei Zhang, Honghui Ding, Huaj...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[13]

NVIDIA H200 GPU

NVIDIA. NVIDIA H200 GPU. https://www.nvidia.com/en-us/data-center/h200/ , 2023

work page 2023
[14]

Training Verifiers to Solve Math Word Problems

Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, Christopher Hesse, and John Schulman. Training verifiers to solve math word problems, 2021. URL https://arxiv.org/abs/2110.14168

work page internal anchor Pith review Pith/arXiv arXiv 2021
[15]

Let’s verify step by step

Hunter Lightman, Vineet Kosaraju, Yuri Burda, Harrison Edwards, Bowen Baker, Teddy Lee, Jan Leike, John Schulman, Ilya Sutskever, and Karl Cobbe. Let’s verify step by step. In The Twelfth International Conference on Learning Representations, 2024. URL https://openreview.net/forum?id=v8L0pN6EOi. 11

work page 2024
[16]

In: Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) (Aug 2024).https://doi.org/10.18653/v1/2024.acl-long.172

Yushi Bai, Xin Lv, Jiajie Zhang, Hongchang Lyu, Jiankai Tang, Zhidian Huang, Zhengxiao Du, Xiao Liu, Aohan Zeng, Lei Hou, Yuxiao Dong, Jie Tang, and Juanzi Li. LongBench: A bilingual, multitask benchmark for long context understanding. In Lun-Wei Ku, Andre Martins, and Vivek Srikumar, editors, Proceedings of the 62nd Annual Meeting of the Association for ...

work page doi:10.18653/v1/2024.acl-long.172 2024
[17]

LongBench v2: Towards deeper understanding and reasoning on realistic long-context multitasks

Yushi Bai, Shangqing Tu, Jiajie Zhang, Hao Peng, Xiaozhi Wang, Xin Lv, Shulin Cao, Jiazheng Xu, Lei Hou, Yuxiao Dong, Jie Tang, and Juanzi Li. LongBench v2: Towards deeper understanding and reasoning on realistic long-context multitasks. In Wanxiang Che, Joyce Nabende, Ekaterina Shutova, and Mohammad Taher Pilehvar, editors, Proceedings of the 63rd Annual...

work page doi:10.18653/v1/2025.acl-long.183 2025
[18]

Michelangelo: Long context evaluations beyond haystacks via latent structure queries, 2024

Kiran Vodrahalli, Santiago Ontanon, Nilesh Tripuraneni, Kelvin Xu, Sanil Jain, Rakesh Shivanna, Jeffrey Hui, Nishanth Dikkala, Mehran Kazemi, Bahare Fatemi, Rohan Anil, Ethan Dyer, Siamak Shakeri, Roopali Vij, Harsh Mehta, Vinay Ramasesh, Quoc Le, Ed Chi, Yifeng Lu, Orhan Firat, Angeliki Lazaridou, Jean-Baptiste Lespiau, Nithya Attaluri, and Kate Olszewsk...

work page arXiv 2024
[19]

Fast inference from transformers via speculative decoding

Yaniv Leviathan, Matan Kalman, and Yossi Matias. Fast inference from transformers via speculative decoding. In Proceedings of the 40th International Conference on Machine Learning, ICML’23. JMLR.org, 2023

work page 2023
[20]

EAGLE-3: Scaling up inference acceleration of large language models via training-time test

Yuhui Li, Fangyun Wei, Chao Zhang, and Hongyang Zhang. EAGLE-3: Scaling up inference acceleration of large language models via training-time test. In The Thirty-ninth Annual Conference on Neural Information Processing Systems, 2025. URL https://openreview.net/forum?id=4exx1hUffq

work page 2025
[21]

Minedraft: A framework for batch parallel speculative decoding, 2026

Zhenwei Tang, Arun Verma, Zijian Zhou, Zhaoxuan Wu, Alok Prakash, Daniela Rus, and Bryan Kian Hsiang Low. Minedraft: A framework for batch parallel speculative decoding, 2026. URL https://arxiv.org/abs/2603. 18016

work page 2026
[22]

Efficient memory management for large language model serving with pagedattention

Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph Gonzalez, Hao Zhang, and Ion Stoica. Eﬀicient memory management for large language model serving with pagedattention. In Proceedings of the 29th Symposium on Operating Systems Principles, SOSP ’23, pages 611–626, New York, NY, USA, 2023. Association for Computing Machin...

work page doi:10.1145/3600006.3613165 2023
[23]

Gonzalez, Clark Barrett, and Ying Sheng

Lianmin Zheng, Liangsheng Yin, Zhiqiang Xie, Chuyue Sun, Jeff Huang, Cody Hao Yu, Shiyi Cao, Christos Kozyrakis, Ion Stoica, Joseph E. Gonzalez, Clark Barrett, and Ying Sheng. Sglang: eﬀicient execution of structured language model programs. In Proceedings of the 38th International Conference on Neural Information Processing Systems, NIPS ’24, Red Hook, N...

work page 2024
[24]

Accelerating Large Language Model Decoding with Speculative Sampling

Charlie Chen, Sebastian Borgeaud, Geoffrey Irving, Jean-Baptiste Lespiau, Laurent Sifre, and John Jumper. Accelerating large language model decoding with speculative sampling, 2023. URL https://arxiv.org/abs/ 2302.01318

work page internal anchor Pith review Pith/arXiv arXiv 2023
[25]

Speculative Decoding: Exploiting Speculative Execution for Accelerating Seq2seq Generation

Heming Xia, Tao Ge, Peiyi Wang, Si-Qing Chen, Furu Wei, and Zhifang Sui. Speculative decoding: Exploiting speculative execution for accelerating seq2seq generation. In Houda Bouamor, Juan Pino, and Kalika Bali, editors, Findings of the Association for Computational Linguistics: EMNLP 2023, pages 3909–3925, Singapore, December 2023. Association for Computa...

work page doi:10.18653/v1/2023.findings-emnlp.257 2023
[26]

Decoding speculative decoding

Minghao Yan, Saurabh Agarwal, and Shivaram Venkataraman. Decoding speculative decoding. In Luis Chiruzzo, Alan Ritter, and Lu Wang, editors, Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers), pages 6460–6473, Albuquerque, New Mexic...

work page doi:10.18653/v1/2025.naacl-long.328 2025
[27]

Inference-cost-aware dynamic tree construction for eﬀicient inference in large language models

Yinrong Hong, Zhiquan Tan, and Kai Hu. Inference-cost-aware dynamic tree construction for eﬀicient inference in large language models. In The Fourteenth International Conference on Learning Representations, 2026. URL https://openreview.net/forum?id=iaWyRYthFf

work page 2026
[28]

RACER: Retrieval-Augmented Contextual Rapid Speculative Decoding

Zihong Zhang, Zuchao Li, Lefei Zhang, Ping Wang, and Hai Zhao. Racer: Retrieval-augmented contextual rapid speculative decoding, 2026. URL https://arxiv.org/abs/2604.14885. 12

work page internal anchor Pith review Pith/arXiv arXiv 2026
[29]

Eagle: speculative sampling requires rethinking fea- ture uncertainty

Yuhui Li, Fangyun Wei, Chao Zhang, and Hongyang Zhang. Eagle: speculative sampling requires rethinking fea- ture uncertainty. In Proceedings of the 41st International Conference on Machine Learning, ICML’24. JMLR.org, 2024

work page 2024
[30]

EAGLE-2: Faster inference of language models with dynamic draft trees

Yuhui Li, Fangyun Wei, Chao Zhang, and Hongyang Zhang. EAGLE-2: Faster inference of language models with dynamic draft trees. In Yaser Al-Onaizan, Mohit Bansal, and Yun-Nung Chen, editors, Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pages 7421–7432, Miami, Florida, USA, nov 2024. Association for Computational Li...

work page doi:10.18653/v1/2024.emnlp-main.422 2024
[31]

SpecInfer: Accelerating Large Language Model Serving with Tree-based Speculative Inference and Verification , url=

Xupeng Miao, Gabriele Oliaro, Zhihao Zhang, Xinhao Cheng, Zeyu Wang, Zhengxin Zhang, Rae Ying Yee Wong, Alan Zhu, Lijie Yang, Xiaoxiang Shi, Chunan Shi, Zhuoming Chen, Daiyaan Arfeen, Reyna Abhyankar, and Zhihao Jia. Specinfer: Accelerating large language model serving with tree-based speculative inference and verification. In Proceedings of the 29th ACM ...

work page doi:10.1145/3620666.3651335 2024
[32]

Lee, Deming Chen, and Tri Dao

Tianle Cai, Yuhong Li, Zhengyang Geng, Hongwu Peng, Jason D. Lee, Deming Chen, and Tri Dao. Medusa: Sim- ple llm inference acceleration framework with multiple decoding heads. In Proceedings of the 41st International Conference on Machine Learning, ICML’24. JMLR.org, 2024

work page 2024
[33]

Speculative speculative decoding

Tanishq Kumar, Tri Dao, and A vner May. Speculative speculative decoding. In The Fourteenth International Conference on Learning Representations, 2026. URL https://openreview.net/forum?id=aL1Wnml9Ef

work page 2026
[34]

Distserve: disaggregating prefill and decoding for goodput-optimized large language model serving

Yinmin Zhong, Shengyu Liu, Junda Chen, Jianbo Hu, Yibo Zhu, Xuanzhe Liu, Xin Jin, and Hao Zhang. Distserve: disaggregating prefill and decoding for goodput-optimized large language model serving. In Proceedings of the 18th USENIX Conference on Operating Systems Design and Implementation, OSDI’24, USA, 2024. USENIX Association. ISBN 978-1-939133-40-3

work page 2024
[35]

Llumnix: dynamic scheduling for large language model serving

Biao Sun, Ziming Huang, Hanyu Zhao, Wencong Xiao, Xinyi Zhang, Yong Li, and Wei Lin. Llumnix: dynamic scheduling for large language model serving. In Proceedings of the 18th USENIX Conference on Operating Systems Design and Implementation, OSDI’24, USA, 2024. USENIX Association. ISBN 978-1-939133-40-3

work page 2024
[36]

Windserve: Eﬀicient phase- disaggregated llm serving with stream-based dynamic scheduling

Jingqi Feng, Yukai Huang, Rui Zhang, Sicheng Liang, Ming Yan, and Jie Wu. Windserve: Eﬀicient phase- disaggregated llm serving with stream-based dynamic scheduling. In Proceedings of the 52nd Annual International Symposium on Computer Architecture, ISCA ’25, pages 1283–1295, New York, NY, USA, 2025. Association for Computing Machinery. ISBN 9798400712616....

work page doi:10.1145/3695053.3730999 2025
[37]

Bullet: Boosting gpu uti- lization for llm serving via dynamic spatial-temporal orchestration

Zejia Lin, Hongxin Xu, Guanyi Chen, Zhiguang Chen, Yutong Lu, and Xianwei Zhang. Bullet: Boosting gpu uti- lization for llm serving via dynamic spatial-temporal orchestration. In Proceedings of the 31st ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 2, ASPLOS ’26, pages 290–306, New York, NY, ...

work page doi:10.1145/3779212.3790135 2026
[38]

Muxserve: flexible spatial-temporal multiplexing for multiple llm serving

Jiangfei Duan, Runyu Lu, Haojie Duanmu, Xiuhong Li, Xingcheng Zhang, Dahua Lin, Ion Stoica, and Hao Zhang. Muxserve: flexible spatial-temporal multiplexing for multiple llm serving. In Proceedings of the 41st International Conference on Machine Learning, ICML’24. JMLR.org, 2024

work page 2024
[39]

Mooncake: A kvcache-centric disaggregated architecture for llm serving

Ruoyu Qin, Zheming Li, Weiran He, Jialei Cui, Heyi Tang, Feng Ren, Teng Ma, Shangming Cai, Yineng Zhang, Mingxing Zhang, Yongwei Wu, Weimin Zheng, and Xinran Xu. Mooncake: A kvcache-centric disaggregated architecture for llm serving. ACM Trans. Storage, November 2025. ISSN 1553-3077. doi: 10.1145/3773772. URL https://doi.org/10.1145/3773772. Just Accepted

work page doi:10.1145/3773772 2025
[40]

Infinigen: eﬀicient generative inference of large language models with dynamic kv cache management

Wonbeom Lee, Jungi Lee, Junghwan Seo, and Jaewoong Sim. Infinigen: eﬀicient generative inference of large language models with dynamic kv cache management. In Proceedings of the 18th USENIX Conference on Operating Systems Design and Implementation, OSDI’24, USA, 2024. USENIX Association. ISBN 978-1-939133- 40-3

work page 2024
[41]

Weaver: eﬀicient multi-llm serving with attention offloading

Shiwei Gao, Qing Wang, Shaoxun Zeng, Youyou Lu, and Jiwu Shu. Weaver: eﬀicient multi-llm serving with attention offloading. In Proceedings of the 2025 USENIX Conference on Usenix Annual Technical Conference, USENIX ATC ’25, USA, 2025. USENIX Association. ISBN 978-1-939133-48-9. 13

work page 2025
[42]

Prism: Unleashing gpu sharing for cost-eﬀicient multi-llm serving, 2025

Shan Yu, Jiarong Xing, Yifan Qiao, Mingyuan Ma, Yangmin Li, Yang Wang, Shuo Yang, Zhiqiang Xie, Shiyi Cao, Ke Bao, Ion Stoica, Harry Xu, and Ying Sheng. Prism: Unleashing gpu sharing for cost-eﬀicient multi-llm serving, 2025. URL https://arxiv.org/abs/2505.04021. A Implementation Details System overview. SPECTRE separates speculative decoding into two coo...

work page arXiv 2025