pith. machine review for the scientific record. sign in

arxiv: 2605.08151 · v2 · submitted 2026-05-04 · 💻 cs.DC · cs.AI

Recognition: no theorem link

SPECTRE: Hybrid Ordinary-Parallel Speculative Serving for Resource-Efficient LLM Inference

Feiyu Zhang, Jincheng Xie, Qi Xiao, Wen Hu, Yawen Ling, Yu Zheng, Zhongyi Huang

Pith reviewed 2026-05-13 06:57 UTC · model grok-4.3

classification 💻 cs.DC cs.AI
keywords speculative decodingLLM inference servingmulti-tenant cloud systemsresource-efficient deploymentparallel token generationremote drafter modelsthroughput optimization
0
0 comments X

The pith

Reusing underutilized smaller models as remote drafters speeds large-model inference by running draft generation in parallel with verification.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows how to accelerate serving of large language models in cloud systems by tapping into the spare capacity of many smaller, lightly loaded models. It does this through a form of speculative decoding where the smaller models produce candidate tokens at the same time the large model checks them, with a hybrid approach that mixes sequential and parallel steps to keep the overlap useful. This matters because typical deployments have long-tailed demand, so the large models stay busy while smaller ones sit idle; capturing that idle capacity could raise overall system output without new hardware. The design adds scheduling rules to protect the smaller models' regular work and prompt compression to keep the drafts fast. Experiments across model pairs and workloads confirm higher throughput for the large models and only small effects on the smaller ones' own performance.

Core claim

The central claim is that draft generation on remote tail models and target verification on large models can be made to overlap effectively in multi-tenant settings by combining a hybrid ordinary-parallel speculative strategy, a throughput-derived decision threshold, speculative priority scheduling, and draft-side prompt compression, yielding higher large-model throughput with only minor interference to tail-model native workloads.

What carries the argument

The hybrid ordinary-parallel speculative decoding strategy that decides when to run draft and verification steps concurrently, guided by a throughput analysis threshold, while scheduling preserves overlap and prompt compression shortens draft time.

If this is right

  • Large models handle more requests per unit of hardware time than standard autoregressive decoding.
  • The gains from speculative methods increase further when remote drafters are added.
  • Smaller models continue to meet most of their own workload demands with only minor added delay.
  • The same techniques apply across different model size pairs, batch sizes, and long-context tasks.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The approach could be extended by letting the system pair drafter and target models dynamically as load patterns shift across a cluster.
  • Similar remote-drafting patterns might reduce waste in other distributed inference settings that mix model sizes.
  • Lower interference could support more aggressive sharing of compute resources among unrelated services on the same hardware.

Load-bearing premise

Draft generation on remote smaller models and verification on the large model can maintain useful overlap under realistic mixed traffic without unacceptable slowdown for the smaller models' regular users.

What would settle it

In a live multi-tenant cluster with fluctuating loads, measuring that the large-model throughput gain drops below the non-speculative baseline or that smaller-model response times rise beyond a small fraction of their native values.

read the original abstract

LLM serving platforms are increasingly deployed as multi-model cloud systems, where user demand is often long-tailed: a few popular large models receive most requests, while many smaller tail models remain underutilized. We propose \textbf{SPECTRE} (Parallel \textbf{SPEC}ulative Decoding with a Multi-\textbf{T}enant \textbf{RE}mote Drafter), a serving framework that reuses underutilized tail-model services as remote drafters for heavily loaded large-model services through speculative decoding. SPECTRE enables draft generation and target-side verification to run in parallel, and makes such parallelism effective through three techniques: a hybrid ordinary-parallel speculative decoding strategy guided by a threshold derived from throughput analysis, speculative priority scheduling to preserve draft--target overlap under multi-tenant traffic, and draft-side prompt compression to reduce draft latency. We implement SPECTRE in \texttt{SGLang} and evaluate it across multiple draft--target model pairs, reasoning benchmarks, real-world long-context workloads, and a wide range of batch sizes. Results show that SPECTRE consistently improves large-model serving throughput while causing only minor interference to the native workloads of tail-model services. In large-model deployments, including Qwen3-235B-A22B with TP=8, SPECTRE achieves up to \textbf{2.28$\times$ speedup} over autoregressive decoding and up to an additional \textbf{66\% relative improvement} over the strongest speculative decoding baselines. Talk is cheap, we show you the code: https://github.com/sgl-project/sglang/pull/22272.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 2 minor

Summary. The paper presents SPECTRE, a serving framework that reuses underutilized tail-model services as remote drafters for large-model inference via speculative decoding in multi-tenant LLM platforms. It proposes a hybrid ordinary-parallel speculative decoding strategy guided by a throughput-derived threshold, speculative priority scheduling to maintain overlap, and draft-side prompt compression. Implemented in SGLang, evaluations across draft-target pairs, reasoning benchmarks, long-context workloads, and batch sizes report up to 2.28× speedup over autoregressive decoding and up to 66% relative gain over strongest speculative baselines (e.g., for Qwen3-235B-A22B with TP=8), with only minor interference to tail-model native workloads.

Significance. If the results hold, SPECTRE could meaningfully improve resource efficiency in cloud LLM deployments by exploiting long-tailed model popularity, increasing throughput for popular models without extra hardware. The open-source code release and implementation in SGLang are strengths for reproducibility and adoption.

major comments (1)
  1. [§5] §5 (Experimental Evaluation): The central claims of consistent speedups and effective parallelism rest on the hybrid strategy and scheduling preserving draft-target overlap under multi-tenant traffic. The reported results use real-world long-context workloads and a range of batch sizes, but do not include explicit characterization of traffic burstiness, priority conflicts, or concurrent native tail-model loads; this leaves the weakest assumption (overlap without unacceptable interference) insufficiently tested and risks overstatement of the 2.28× and 66% gains.
minor comments (2)
  1. [Abstract] Abstract and §5: The claim of 'only minor interference' to tail models lacks quantitative metrics (e.g., throughput degradation percentages or latency histograms) or references to specific figures/tables showing the effect sizes.
  2. [§5] §5: Speedup numbers (2.28×, 66%) should report error bars, standard deviations, or run-to-run variability to support the 'consistently improves' claim across workloads.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the positive assessment of SPECTRE's significance for improving resource efficiency in multi-tenant LLM deployments and for the constructive feedback on the experimental evaluation. We address the major comment point by point below.

read point-by-point responses
  1. Referee: [§5] §5 (Experimental Evaluation): The central claims of consistent speedups and effective parallelism rest on the hybrid strategy and scheduling preserving draft-target overlap under multi-tenant traffic. The reported results use real-world long-context workloads and a range of batch sizes, but do not include explicit characterization of traffic burstiness, priority conflicts, or concurrent native tail-model loads; this leaves the weakest assumption (overlap without unacceptable interference) insufficiently tested and risks overstatement of the 2.28× and 66% gains.

    Authors: We appreciate the referee's emphasis on rigorously validating the overlap assumption. Our evaluations already incorporate real-world long-context workloads (which exhibit natural variability in arrivals and lengths) across a wide range of batch sizes, and we explicitly quantify interference to native tail-model workloads, showing only minor impact while achieving the reported speedups. The hybrid ordinary-parallel strategy and speculative priority scheduling are specifically designed to preserve overlap under contention. That said, we agree that controlled characterization of burstiness, priority conflicts, and varying concurrent loads would strengthen the claims. In the revised manuscript, we will add a dedicated subsection to §5 with new experiments using synthetic bursty arrival patterns (e.g., varying Poisson rates and burst factors) and controlled concurrent tail-model loads, demonstrating that the speedups and low interference hold under these conditions. revision: yes

Circularity Check

0 steps flagged

No significant circularity in SPECTRE's performance claims

full rationale

The paper presents an implemented serving system evaluated on real hardware with released code. The hybrid ordinary-parallel strategy uses a threshold derived from throughput analysis (independent of final measured speedups), speculative priority scheduling, and prompt compression. Reported gains (2.28× over autoregressive, 66% over baselines) are empirical outcomes from benchmarks and workloads, not quantities forced by definition, fitted inputs renamed as predictions, or self-citation chains. No load-bearing step reduces to its own inputs by construction; the work is self-contained against external measurements.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The central claim rests on standard assumptions from speculative decoding and multi-tenant serving; the throughput-derived threshold is the main free parameter introduced.

free parameters (1)
  • hybrid threshold
    Derived from throughput analysis to decide between ordinary and parallel speculative modes.
axioms (1)
  • domain assumption Remote tail-model services can act as effective drafters with acceptable interference under multi-tenant load.
    Invoked in the design of speculative priority scheduling and parallel execution.

pith-pipeline@v0.9.0 · 5603 in / 1219 out tokens · 60227 ms · 2026-05-13T06:57:42.030603+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

42 extracted references · 42 canonical work pages · 7 internal anchors

  1. [1]

    Gulavani, and Ramachandran Ramjee

    Amey Agrawal, Ashish Panwar, Jayashree Mohan, Nipun Kwatra, Bhargav S. Gulavani, and Ramachandran Ramjee. Sarathi: Efficient llm inference by piggybacking decodes with chunked prefills. ArXiv, abs/2308.16369,

  2. [2]

    URL https://api.semanticscholar.org/CorpusID:261395577

  3. [3]

    Splitwise: Efficient generative llm inference using phase splitting

    Pratyush Patel, Esha Choukse, Chaojie Zhang, Aashaka Shah, Íñigo Goiri, Saeed Maleki, and Ricardo Bianchini. Splitwise: Efficient generative llm inference using phase splitting. In Proceedings of the 51st Annual International Symposium on Computer Architecture, ISCA ’24, pages 118–132. IEEE Press, 2025. ISBN 9798350326581. doi: 10.1109/ISCA59077.2024.00019...

  4. [4]

    Serving heteroge- neous machine learning models on Multi-GPU servers with Spatio-Temporal sharing

    Seungbeom Choi, Sunho Lee, Yeonjae Kim, Jongse Park, Youngjin Kwon, and Jaehyuk Huh. Serving heteroge- neous machine learning models on Multi-GPU servers with Spatio-Temporal sharing. In 2022 USENIX Annual Technical Conference (USENIX ATC 22), pages 199–216, Carlsbad, CA, July 2022. USENIX Association. ISBN 978-1-939133-29-53. URL https://www.usenix.org/c...

  5. [5]

    Aegaeon: Effective gpu pooling for concurrent llm serving on the market

    Yuxing Xiang, Xue Li, Kun Qian, Yufan Yang, Diwen Zhu, Wenyuan Yu, Ennan Zhai, Xuanzhe Liu, Xin Jin, and Jingren Zhou. Aegaeon: Effective gpu pooling for concurrent llm serving on the market. In Proceedings of 10 the ACM SIGOPS 31st Symposium on Operating Systems Principles, SOSP ’25, pages 1030–1045, New York, NY, USA, 2025. Association for Computing Mac...

  6. [6]

    Double: Breaking the Acceleration Limit via Double Retrieval Speculative Parallelism

    Yuhao Shen, Tianyu Liu, Junyi Shen, Jinyang Wu, Quan Kong, Li Huan, and Cong Wang. Double: Breaking the acceleration limit via double retrieval speculative parallelism, 2026. URL https://arxiv.org/abs/2601.05524

  7. [7]

    PEARL: Parallel speculative decoding with adaptive draft length

    Tianyu Liu, Yun Li, Qitan Lv, Kai Liu, Jianchen Zhu, Winston Hu, and Xiao Sun. PEARL: Parallel speculative decoding with adaptive draft length. In The Thirteenth International Conference on Learning Representations,

  8. [8]

    URL https://openreview.net/forum?id=QOXrVMiHGK

  9. [9]

    Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism

    Mohammad Shoeybi, Mostofa Patwary, Raul Puri, Patrick LeGresley, Jared Casper, and Bryan Catanzaro. Megatron-lm: Training multi-billion parameter language models using model parallelism, 2020. URL https: //arxiv.org/abs/1909.08053

  10. [10]

    Efficient streaming language models with attention sinks

    Guangxuan Xiao, Yuandong Tian, Beidi Chen, Song Han, and Mike Lewis. Efficient streaming language models with attention sinks. In The Twelfth International Conference on Learning Representations, 2024. URL https: //openreview.net/forum?id=NG7sS51zVF

  11. [11]

    Qwen3 Technical Report

    An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, Chujie Zheng, Dayiheng Liu, Fan Zhou, Fei Huang, Feng Hu, Hao Ge, Haoran Wei, Huan Lin, Jialong Tang, Jian Yang, Jianhong Tu, Jianwei Zhang, Jianxin Yang, Jiaxi Yang, Jing Zhou, Jingren Zhou, Junyang Lin, Kai Dang, Keqin Bao, Kexin Yang, ...

  12. [12]

    DeepSeek-V3 Technical Report

    DeepSeek-AI, Aixin Liu, Bei Feng, Bing Xue, Bingxuan Wang, Bochao Wu, Chengda Lu, Chenggang Zhao, Chengqi Deng, Chenyu Zhang, Chong Ruan, Damai Dai, Daya Guo, Dejian Yang, Deli Chen, Dongjie Ji, Erhang Li, Fangyun Lin, Fucong Dai, Fuli Luo, Guangbo Hao, Guanting Chen, Guowei Li, H. Zhang, Han Bao, Hanwei Xu, Haocheng Wang, Haowei Zhang, Honghui Ding, Huaj...

  13. [13]

    NVIDIA H200 GPU

    NVIDIA. NVIDIA H200 GPU. https://www.nvidia.com/en-us/data-center/h200/ , 2023

  14. [14]

    Training Verifiers to Solve Math Word Problems

    Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, Christopher Hesse, and John Schulman. Training verifiers to solve math word problems, 2021. URL https://arxiv.org/abs/2110.14168

  15. [15]

    Let’s verify step by step

    Hunter Lightman, Vineet Kosaraju, Yuri Burda, Harrison Edwards, Bowen Baker, Teddy Lee, Jan Leike, John Schulman, Ilya Sutskever, and Karl Cobbe. Let’s verify step by step. In The Twelfth International Conference on Learning Representations, 2024. URL https://openreview.net/forum?id=v8L0pN6EOi. 11

  16. [16]

    In: Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) (Aug 2024).https://doi.org/10.18653/v1/2024.acl-long.172

    Yushi Bai, Xin Lv, Jiajie Zhang, Hongchang Lyu, Jiankai Tang, Zhidian Huang, Zhengxiao Du, Xiao Liu, Aohan Zeng, Lei Hou, Yuxiao Dong, Jie Tang, and Juanzi Li. LongBench: A bilingual, multitask benchmark for long context understanding. In Lun-Wei Ku, Andre Martins, and Vivek Srikumar, editors, Proceedings of the 62nd Annual Meeting of the Association for ...

  17. [17]

    LongBench v2: Towards deeper understanding and reasoning on realistic long-context multitasks

    Yushi Bai, Shangqing Tu, Jiajie Zhang, Hao Peng, Xiaozhi Wang, Xin Lv, Shulin Cao, Jiazheng Xu, Lei Hou, Yuxiao Dong, Jie Tang, and Juanzi Li. LongBench v2: Towards deeper understanding and reasoning on realistic long-context multitasks. In Wanxiang Che, Joyce Nabende, Ekaterina Shutova, and Mohammad Taher Pilehvar, editors, Proceedings of the 63rd Annual...

  18. [18]

    Michelangelo: Long context evaluations beyond haystacks via latent structure queries, 2024

    Kiran Vodrahalli, Santiago Ontanon, Nilesh Tripuraneni, Kelvin Xu, Sanil Jain, Rakesh Shivanna, Jeffrey Hui, Nishanth Dikkala, Mehran Kazemi, Bahare Fatemi, Rohan Anil, Ethan Dyer, Siamak Shakeri, Roopali Vij, Harsh Mehta, Vinay Ramasesh, Quoc Le, Ed Chi, Yifeng Lu, Orhan Firat, Angeliki Lazaridou, Jean-Baptiste Lespiau, Nithya Attaluri, and Kate Olszewsk...

  19. [19]

    Fast inference from transformers via speculative decoding

    Yaniv Leviathan, Matan Kalman, and Yossi Matias. Fast inference from transformers via speculative decoding. In Proceedings of the 40th International Conference on Machine Learning, ICML’23. JMLR.org, 2023

  20. [20]

    EAGLE-3: Scaling up inference acceleration of large language models via training-time test

    Yuhui Li, Fangyun Wei, Chao Zhang, and Hongyang Zhang. EAGLE-3: Scaling up inference acceleration of large language models via training-time test. In The Thirty-ninth Annual Conference on Neural Information Processing Systems, 2025. URL https://openreview.net/forum?id=4exx1hUffq

  21. [21]

    Minedraft: A framework for batch parallel speculative decoding, 2026

    Zhenwei Tang, Arun Verma, Zijian Zhou, Zhaoxuan Wu, Alok Prakash, Daniela Rus, and Bryan Kian Hsiang Low. Minedraft: A framework for batch parallel speculative decoding, 2026. URL https://arxiv.org/abs/2603. 18016

  22. [22]

    Efficient memory management for large language model serving with pagedattention

    Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph Gonzalez, Hao Zhang, and Ion Stoica. Efficient memory management for large language model serving with pagedattention. In Proceedings of the 29th Symposium on Operating Systems Principles, SOSP ’23, pages 611–626, New York, NY, USA, 2023. Association for Computing Machin...

  23. [23]

    Gonzalez, Clark Barrett, and Ying Sheng

    Lianmin Zheng, Liangsheng Yin, Zhiqiang Xie, Chuyue Sun, Jeff Huang, Cody Hao Yu, Shiyi Cao, Christos Kozyrakis, Ion Stoica, Joseph E. Gonzalez, Clark Barrett, and Ying Sheng. Sglang: efficient execution of structured language model programs. In Proceedings of the 38th International Conference on Neural Information Processing Systems, NIPS ’24, Red Hook, N...

  24. [24]

    Accelerating Large Language Model Decoding with Speculative Sampling

    Charlie Chen, Sebastian Borgeaud, Geoffrey Irving, Jean-Baptiste Lespiau, Laurent Sifre, and John Jumper. Accelerating large language model decoding with speculative sampling, 2023. URL https://arxiv.org/abs/ 2302.01318

  25. [25]

    Speculative Decoding: Exploiting Speculative Execution for Accelerating Seq2seq Generation

    Heming Xia, Tao Ge, Peiyi Wang, Si-Qing Chen, Furu Wei, and Zhifang Sui. Speculative decoding: Exploiting speculative execution for accelerating seq2seq generation. In Houda Bouamor, Juan Pino, and Kalika Bali, editors, Findings of the Association for Computational Linguistics: EMNLP 2023, pages 3909–3925, Singapore, December 2023. Association for Computa...

  26. [26]

    Decoding speculative decoding

    Minghao Yan, Saurabh Agarwal, and Shivaram Venkataraman. Decoding speculative decoding. In Luis Chiruzzo, Alan Ritter, and Lu Wang, editors, Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers), pages 6460–6473, Albuquerque, New Mexic...

  27. [27]

    Inference-cost-aware dynamic tree construction for efficient inference in large language models

    Yinrong Hong, Zhiquan Tan, and Kai Hu. Inference-cost-aware dynamic tree construction for efficient inference in large language models. In The Fourteenth International Conference on Learning Representations, 2026. URL https://openreview.net/forum?id=iaWyRYthFf

  28. [28]

    RACER: Retrieval-Augmented Contextual Rapid Speculative Decoding

    Zihong Zhang, Zuchao Li, Lefei Zhang, Ping Wang, and Hai Zhao. Racer: Retrieval-augmented contextual rapid speculative decoding, 2026. URL https://arxiv.org/abs/2604.14885. 12

  29. [29]

    Eagle: speculative sampling requires rethinking fea- ture uncertainty

    Yuhui Li, Fangyun Wei, Chao Zhang, and Hongyang Zhang. Eagle: speculative sampling requires rethinking fea- ture uncertainty. In Proceedings of the 41st International Conference on Machine Learning, ICML’24. JMLR.org, 2024

  30. [30]

    EAGLE-2: Faster inference of language models with dynamic draft trees

    Yuhui Li, Fangyun Wei, Chao Zhang, and Hongyang Zhang. EAGLE-2: Faster inference of language models with dynamic draft trees. In Yaser Al-Onaizan, Mohit Bansal, and Yun-Nung Chen, editors, Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pages 7421–7432, Miami, Florida, USA, nov 2024. Association for Computational Li...

  31. [31]

    SpecInfer: Accelerating Large Language Model Serving with Tree-based Speculative Inference and Verification , url=

    Xupeng Miao, Gabriele Oliaro, Zhihao Zhang, Xinhao Cheng, Zeyu Wang, Zhengxin Zhang, Rae Ying Yee Wong, Alan Zhu, Lijie Yang, Xiaoxiang Shi, Chunan Shi, Zhuoming Chen, Daiyaan Arfeen, Reyna Abhyankar, and Zhihao Jia. Specinfer: Accelerating large language model serving with tree-based speculative inference and verification. In Proceedings of the 29th ACM ...

  32. [32]

    Lee, Deming Chen, and Tri Dao

    Tianle Cai, Yuhong Li, Zhengyang Geng, Hongwu Peng, Jason D. Lee, Deming Chen, and Tri Dao. Medusa: Sim- ple llm inference acceleration framework with multiple decoding heads. In Proceedings of the 41st International Conference on Machine Learning, ICML’24. JMLR.org, 2024

  33. [33]

    Speculative speculative decoding

    Tanishq Kumar, Tri Dao, and A vner May. Speculative speculative decoding. In The Fourteenth International Conference on Learning Representations, 2026. URL https://openreview.net/forum?id=aL1Wnml9Ef

  34. [34]

    Distserve: disaggregating prefill and decoding for goodput-optimized large language model serving

    Yinmin Zhong, Shengyu Liu, Junda Chen, Jianbo Hu, Yibo Zhu, Xuanzhe Liu, Xin Jin, and Hao Zhang. Distserve: disaggregating prefill and decoding for goodput-optimized large language model serving. In Proceedings of the 18th USENIX Conference on Operating Systems Design and Implementation, OSDI’24, USA, 2024. USENIX Association. ISBN 978-1-939133-40-3

  35. [35]

    Llumnix: dynamic scheduling for large language model serving

    Biao Sun, Ziming Huang, Hanyu Zhao, Wencong Xiao, Xinyi Zhang, Yong Li, and Wei Lin. Llumnix: dynamic scheduling for large language model serving. In Proceedings of the 18th USENIX Conference on Operating Systems Design and Implementation, OSDI’24, USA, 2024. USENIX Association. ISBN 978-1-939133-40-3

  36. [36]

    Windserve: Efficient phase- disaggregated llm serving with stream-based dynamic scheduling

    Jingqi Feng, Yukai Huang, Rui Zhang, Sicheng Liang, Ming Yan, and Jie Wu. Windserve: Efficient phase- disaggregated llm serving with stream-based dynamic scheduling. In Proceedings of the 52nd Annual International Symposium on Computer Architecture, ISCA ’25, pages 1283–1295, New York, NY, USA, 2025. Association for Computing Machinery. ISBN 9798400712616....

  37. [37]

    Bullet: Boosting gpu uti- lization for llm serving via dynamic spatial-temporal orchestration

    Zejia Lin, Hongxin Xu, Guanyi Chen, Zhiguang Chen, Yutong Lu, and Xianwei Zhang. Bullet: Boosting gpu uti- lization for llm serving via dynamic spatial-temporal orchestration. In Proceedings of the 31st ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 2, ASPLOS ’26, pages 290–306, New York, NY, ...

  38. [38]

    Muxserve: flexible spatial-temporal multiplexing for multiple llm serving

    Jiangfei Duan, Runyu Lu, Haojie Duanmu, Xiuhong Li, Xingcheng Zhang, Dahua Lin, Ion Stoica, and Hao Zhang. Muxserve: flexible spatial-temporal multiplexing for multiple llm serving. In Proceedings of the 41st International Conference on Machine Learning, ICML’24. JMLR.org, 2024

  39. [39]

    Mooncake: A kvcache-centric disaggregated architecture for llm serving

    Ruoyu Qin, Zheming Li, Weiran He, Jialei Cui, Heyi Tang, Feng Ren, Teng Ma, Shangming Cai, Yineng Zhang, Mingxing Zhang, Yongwei Wu, Weimin Zheng, and Xinran Xu. Mooncake: A kvcache-centric disaggregated architecture for llm serving. ACM Trans. Storage, November 2025. ISSN 1553-3077. doi: 10.1145/3773772. URL https://doi.org/10.1145/3773772. Just Accepted

  40. [40]

    Infinigen: efficient generative inference of large language models with dynamic kv cache management

    Wonbeom Lee, Jungi Lee, Junghwan Seo, and Jaewoong Sim. Infinigen: efficient generative inference of large language models with dynamic kv cache management. In Proceedings of the 18th USENIX Conference on Operating Systems Design and Implementation, OSDI’24, USA, 2024. USENIX Association. ISBN 978-1-939133- 40-3

  41. [41]

    Weaver: efficient multi-llm serving with attention offloading

    Shiwei Gao, Qing Wang, Shaoxun Zeng, Youyou Lu, and Jiwu Shu. Weaver: efficient multi-llm serving with attention offloading. In Proceedings of the 2025 USENIX Conference on Usenix Annual Technical Conference, USENIX ATC ’25, USA, 2025. USENIX Association. ISBN 978-1-939133-48-9. 13

  42. [42]

    Prism: Unleashing gpu sharing for cost-efficient multi-llm serving, 2025

    Shan Yu, Jiarong Xing, Yifan Qiao, Mingyuan Ma, Yangmin Li, Yang Wang, Shuo Yang, Zhiqiang Xie, Shiyi Cao, Ke Bao, Ion Stoica, Harry Xu, and Ying Sheng. Prism: Unleashing gpu sharing for cost-efficient multi-llm serving, 2025. URL https://arxiv.org/abs/2505.04021. A Implementation Details System overview. SPECTRE separates speculative decoding into two coo...