pith. sign in

arxiv: 2606.19271 · v1 · pith:D3KWWPVFnew · submitted 2026-06-17 · 💻 cs.DC

TurboServe: Serving Streaming Video Generation Efficiently and Economically

Pith reviewed 2026-06-26 19:21 UTC · model grok-4.3

classification 💻 cs.DC
keywords streaming video generationserving systemGPU schedulingautoscalingonline schedulinglatency optimizationcost efficiency
0
0 comments X

The pith

TurboServe reduces worst-case per-chunk latency by 37.5% and GPU operating costs by 37.2% for streaming video generation.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Streaming video generation requires serving systems to handle long-lived user sessions that produce video chunks progressively under strict latency targets. The paper establishes that these workloads introduce heterogeneity in session durations and in user demand over time, which standard serving approaches fail to manage efficiently. TurboServe addresses this by formulating serving as an online scheduling problem that jointly optimizes session placement across GPUs and the number of GPUs provisioned. Its closed-loop algorithm uses migration to rebalance load and autoscaling to match provisioning to demand, supported by batching, offloading, and migration mechanisms. Sympathetic readers would care because these changes make real-time interactive video generation feasible at lower cost on shared hardware.

Core claim

The central claim is that a closed-loop scheduling algorithm coordinating migration-aware placement and load-driven autoscaling can reduce worst-case per-chunk latency by 37.5% and total GPU operating cost by 37.2% on average compared to baseline configurations. This is achieved by treating streaming video generation as an online scheduling problem in multi-GPU environments and implementing coalesced chunk processing, GPU-CPU offloading, and NCCL-based GPU-GPU migration to support the scheduling decisions at runtime. The evaluation uses real-world production traces across multiple model sizes and clusters up to 64 GPUs.

What carries the argument

The closed-loop scheduling algorithm consisting of a migration-aware placement controller and a load-driven autoscaling controller.

Load-bearing premise

That the chosen baselines represent standard serving configurations and that the evaluated traces include sufficient variation to demonstrate gains in typical deployments.

What would settle it

Measuring the latency and cost metrics on a new set of traces with more extreme session duration differences or demand fluctuations than those used in the paper.

read the original abstract

Streaming video generation is emerging as a new serving workload in which users interact with long-lived sessions that generate video progressively, chunk by chunk. Unlike offline video generation or typical LLM serving, streaming video generation must preserve session state across active and idle periods, repeatedly schedule ongoing sessions, and deliver each chunk under a tight latency target. This creates two key serving challenges in multi-user, multi-GPU environments: session duration heterogeneity, where long-running sessions make placement decisions suboptimal over time, and temporal user-demand heterogeneity, where the number of active sessions fluctuates sharply across bursts and idle periods. We present TurboServe, the first serving system designed specifically for streaming video generation workloads. TurboServe formulates serving as an online scheduling problem that jointly coordinates session placement and GPU provisioning. Its closed-loop scheduling algorithm combines a migration-aware placement controller, which rebalances sessions across GPUs to reduce the maximum per-chunk latency, with a load-driven autoscaling controller, which adapts the GPU budget to workload variation for improved cost efficiency. To support these decisions at runtime, TurboServe implements coalesced chunk processing for batching concurrent active sessions on the same GPU, GPU-CPU offloading for session suspension and resumption, and NCCL-based GPU-GPU migration for online rebalancing. We evaluate TurboServe on real-world production traces from Shengshu Technology across multiple model sizes and GPU clusters with up to 64 NVIDIA B300 GPUs. Compared with baseline serving configurations, TurboServe reduces worst-case per-chunk latency by 37.5% and total GPU operating cost by 37.2% on average. Our code is publicly available at https://github.com/shengshu-ai/TurboServe.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 0 minor

Summary. The paper presents TurboServe as the first serving system for streaming video generation workloads, which involve long-lived interactive sessions generating video chunk-by-chunk. It formulates serving as an online scheduling problem and proposes a closed-loop algorithm with a migration-aware placement controller and a load-driven autoscaling controller. These are supported at runtime by coalesced chunk processing, GPU-CPU offloading, and NCCL-based GPU-GPU migration. The system is evaluated on real-world production traces from Shengshu Technology across multiple model sizes and clusters of up to 64 NVIDIA B300 GPUs, claiming 37.5% reduction in worst-case per-chunk latency and 37.2% reduction in total GPU operating cost versus baseline serving configurations. Code is released publicly.

Significance. If the performance claims hold under rigorous validation, this work would be a meaningful contribution to distributed systems for emerging AI serving workloads. Streaming video generation introduces distinct challenges around persistent session state, duration heterogeneity, and bursty demand that differ from both offline generation and standard LLM inference; a dedicated system addressing them could improve efficiency in production video AI services. The public code release supports reproducibility and is a strength.

major comments (1)
  1. [Evaluation] Evaluation section: the reported 37.5% worst-case per-chunk latency reduction and 37.2% GPU cost reduction cannot be verified or attributed to the closed-loop scheduler because the manuscript supplies no description of the baseline serving configurations (e.g., static round-robin vs. no-migration vs. fixed GPU count), no statistics on the production traces (session-duration distribution, burst/idle ratios, heterogeneity metrics), and no details on the 64-GPU experimental configuration or statistical significance testing. These omissions make the central empirical claim load-bearing but unsupported.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive feedback on the evaluation. We agree that the central performance claims require additional supporting details to be verifiable and attributable to the proposed scheduler. We will revise the manuscript to address this.

read point-by-point responses
  1. Referee: [Evaluation] Evaluation section: the reported 37.5% worst-case per-chunk latency reduction and 37.2% GPU cost reduction cannot be verified or attributed to the closed-loop scheduler because the manuscript supplies no description of the baseline serving configurations (e.g., static round-robin vs. no-migration vs. fixed GPU count), no statistics on the production traces (session-duration distribution, burst/idle ratios, heterogeneity metrics), and no details on the 64-GPU experimental configuration or statistical significance testing. These omissions make the central empirical claim load-bearing but unsupported.

    Authors: We acknowledge the validity of this observation. The current manuscript does not provide explicit descriptions of the baseline configurations, detailed trace statistics, full experimental setup parameters for the 64-GPU runs, or statistical significance tests. In the revised version we will add: (1) precise definitions of each baseline (including placement policy, migration usage, and GPU provisioning), (2) summary statistics for the Shengshu production traces (session duration CDF, burst/idle ratios, and heterogeneity measures), (3) hardware and software configuration details for the 64-GPU cluster experiments, and (4) p-values or confidence intervals for the reported latency and cost reductions. These additions will enable readers to verify and attribute the gains to the closed-loop algorithm. revision: yes

Circularity Check

0 steps flagged

No circularity; empirical system evaluation on external traces

full rationale

The paper presents TurboServe as a serving system with closed-loop scheduling, evaluated via experiments on real-world production traces from Shengshu Technology across model sizes and up to 64 GPUs. Reported gains (37.5% latency, 37.2% cost) are direct experimental outcomes versus baselines, with no derivation chain, equations, fitted parameters renamed as predictions, or self-citation load-bearing steps. The work is self-contained against external benchmarks and traces.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The abstract provides no information on free parameters, axioms, or invented entities.

pith-pipeline@v0.9.1-grok · 5856 in / 1055 out tokens · 27388 ms · 2026-06-26T19:21:18.149604+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

57 extracted references · 23 canonical work pages · 8 internal anchors

  1. [1]

    Taming {Throughput-Latency} tradeoff in {LLM} inference with {Sarathi-Serve}

    Amey Agrawal, Nitin Kedia, Ashish Panwar, Jayashree Mohan, Nipun Kwatra, Bhargav Gulavani, Alexey Tumanov, and Ramachandran Ramjee. Taming {Throughput-Latency} tradeoff in {LLM} inference with {Sarathi-Serve}. In 18th USENIX symposium on operating systems design and implementation (OSDI 24), pages 117–134, 2024

  2. [2]

    Amazon ec2 p5 instances.https://aws.amazon.com/ec2/instance-types/p5/, 2026

    Amazon Web Services. Amazon ec2 p5 instances.https://aws.amazon.com/ec2/instance-types/p5/, 2026

  3. [3]

    Video generation models as world simulators

    Tim Brooks, Bill Peebles, Connor Holmes, Will DePue, Yufei Guo, Leo Jing, David Schnurr, Joe Taylor, Troy Luhman, Eric Luhman, et al. Video generation models as world simulators. OpenAI Blog, 1(8):1, 2024

  4. [4]

    Genie: Generative interactive environments

    Jake Bruce, Michael D Dennis, Ashley Edwards, Jack Parker-Holder, Yuge Shi, Edward Hughes, Matthew Lai, Aditi Mavalankar, Richie Steigerwald, Chris Apps, et al. Genie: Generative interactive environments. In Forty-first International Conference on Machine Learning, 2024

  5. [5]

    Flashserve: Cost-efficient serverless inference scheduling for large language models via tiered memory management and predictive autoscaling

    Bolin Chen. Flashserve: Cost-efficient serverless inference scheduling for large language models via tiered memory management and predictive autoscaling. In Proceedings of the 2025 6th International Conference on Computer Science and Management Technology, pages 1291–1298, 2025. 17

  6. [6]

    Gpurdma: Gpu-side library for high performance network- ing from gpu kernels

    Feras Daoud, Amir Watad, and Mark Silberstein. Gpurdma: Gpu-side library for high performance network- ing from gpu kernels. In Proceedings of the 6th international Workshop on Runtime and Operating Systems for Supercomputers, pages 1–8, 2016

  7. [7]

    Quasar: Resource-efficient and qos-aware cluster management

    Christina Delimitrou and Christos Kozyrakis. Quasar: Resource-efficient and qos-aware cluster management. ACM Sigplan Notices, 49(4):127–144, 2014

  8. [8]

    xdit: an inference engine for diffusion transformers (dits) with massive parallelism

    Jiarui Fang, Jinzhe Pan, Xibo Sun, Aoyu Li, and Jiannan Wang. xdit: an inference engine for diffusion transformers (dits) with massive parallelism. arXiv preprint arXiv:2411.01738, 2024

  9. [9]

    Fastvideo: A unified framework for accelerated video generation

    FastVideo Team. Fastvideo: A unified framework for accelerated video generation. https://haoailab.com/blogs /fastvideo/, 2025

  10. [10]

    Streamdiffusionv2: A streaming system for dynamic and interactive video generation.arXiv preprint arXiv:2511.07399, 2025

    Tianrui Feng, Zhi Li, Shuo Yang, Haocheng Xi, Muyang Li, Xiuyu Li, Lvmin Zhang, Keting Yang, Kelly Peng, Song Han, et al. Streamdiffusionv2: A streaming system for dynamic and interactive video generation. arXiv preprint arXiv:2511.07399, 2025

  11. [11]

    {ServerlessLLM}:{Low-Latency} serverless inference for large language models

    Yao Fu, Leyang Xue, Yeqi Huang, Andrei-Octavian Brabete, Dmitrii Ustiugov, Yuvraj Patel, and Luo Mai. {ServerlessLLM}:{Low-Latency} serverless inference for large language models. In 18th USENIX Symposium on Operating Systems Design and Implementation (OSDI 24), pages 135–153, 2024

  12. [12]

    Efficient multi-round llm inference over disaggregated serving

    Wenhao He, Youhe Jiang, Penghao Zhao, Quanqing Xu, Eiko Yoneki, Bin Cui, and Fangcheng Fu. Efficient multi-round llm inference over disaggregated serving. arXiv preprint arXiv:2602.14516, 2026

  13. [13]

    Streamingt2v: Consistent, dynamic, and extendable long video generation from text

    Roberto Henschel, Levon Khachatryan, Hayk Poghosyan, Daniil Hayrapetyan, Vahram Tadevosyan, Zhangyang Wang, Shant Navasardyan, and Humphrey Shi. Streamingt2v: Consistent, dynamic, and extendable long video generation from text. In Proceedings of the Computer Vision and Pattern Recognition Conference, pages 2568–2577, 2025

  14. [14]

    Computer simulation using particles

    Roger W Hockney and James W Eastwood. Computer simulation using particles. crc Press, 2021

  15. [15]

    Self forcing: Bridging the train-test gap in autoregressive video diffusion

    Xun Huang, Zhengqi Li, Guande He, Mingyuan Zhou, and Eli Shechtman. Self forcing: Bridging the train-test gap in autoregressive video diffusion. Advances in Neural Information Processing Systems, 38:167283–167308, 2026

  16. [16]

    Hy-world 1.5: A systematic framework for interactive world modeling with real-time latency and geometric consistency .arXiv preprint, 2025

    Team HunyuanWorld. Hy-world 1.5: A systematic framework for interactive world modeling with real-time latency and geometric consistency .arXiv preprint, 2025

  17. [17]

    Hexgen: Generative inference of large language model over heterogeneous environment

    Youhe Jiang, Ran Yan, Xiaozhe Yao, Yang Zhou, Beidi Chen, and Binhang Yuan. Hexgen: Generative inference of large language model over heterogeneous environment. arXiv preprint arXiv:2311.11514, 2023

  18. [18]

    Demystifying cost-efficiency in llm serving over heterogeneous gpus

    Youhe Jiang, Fangcheng Fu, Xiaozhe Yao, Guoliang He, Xupeng Miao, Ana Klimovic, Bin Cui, Binhang Yuan, and Eiko Yoneki. Demystifying cost-efficiency in llm serving over heterogeneous gpus. arXiv preprint arXiv:2502.00722, 2025

  19. [19]

    Thunderserve: High-performance and cost-efficient llm serving in cloud environments

    Youhe Jiang, Fangcheng Fu, Xiaozhe Yao, Taiyi Wang, Bin Cui, Ana Klimovic, and Eiko Yoneki. Thunderserve: High-performance and cost-efficient llm serving in cloud environments. Proceedings of Machine Learning and Systems, 7, 2025

  20. [20]

    Cascadia: An efficient cascade serving system for large language models

    Youhe Jiang, Fangcheng Fu, Wanru Zhao, Stephan Rabanser, Jintao Zhang, Nicholas D Lane, and Binhang Yuan. Cascadia: An efficient cascade serving system for large language models. arXiv preprint arXiv:2506.04203, 2025

  21. [21]

    Hexgen-2: Disaggregated generative inference of llms in heterogeneous environment

    Youhe Jiang, Ran Yan, and Binhang Yuan. Hexgen-2: Disaggregated generative inference of llms in heterogeneous environment. arXiv preprint arXiv:2502.07903, 2025

  22. [22]

    OServe: Accelerating LLM Serving via Spatial-Temporal Workload Orchestration

    Youhe Jiang, Fangcheng Fu, Taiyi Wang, Guoliang He, and Eiko Yoneki. Oserve: Accelerating llm serving via spatial-temporal workload orchestration. arXiv preprint arXiv:2602.12151, 2026

  23. [23]

    Boute: Cost-efficient llm serving with heterogeneous llms and gpus via multi-objective bayesian optimization

    Youhe Jiang, Fangcheng Fu, and Eiko Yoneki. Boute: Cost-efficient llm serving with heterogeneous llms and gpus via multi-objective bayesian optimization. arXiv preprint arXiv:2602.10729, 2026

  24. [24]

    HunyuanVideo: A Systematic Framework For Large Video Generative Models

    Weijie Kong, Qi Tian, Zijian Zhang, Rox Min, Zuozhuo Dai, Jin Zhou, Jiangfeng Xiong, Xin Li, Bo Wu, Jianwei Zhang, et al. Hunyuanvideo: A systematic framework for large video generative models. arXiv preprint arXiv:2412.03603, 2024

  25. [25]

    Horizontal pod autoscaler

    Kubernetes. Horizontal pod autoscaler. https://kubernetes.io/docs/tasks/run-application/horizontal-p od-autoscale/, 2024. 18

  26. [26]

    Efficient memory management for large language model serving with pagedattention

    Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph Gonzalez, Hao Zhang, and Ion Stoica. Efficient memory management for large language model serving with pagedattention. InProceedings of the 29th symposium on operating systems principles, pages 611–626, 2023

  27. [27]

    Taming the chaos: Coordinated autoscaling for heterogeneous and disaggregated llm inference

    Rongzhi Li, Ruogu Du, Zefang Chu, Sida Zhao, Chunlei Han, Zuocheng Shi, Yiwen Shao, Huanle Han, Long Huang, Zherui Liu, et al. Taming the chaos: Coordinated autoscaling for heterogeneous and disaggregated llm inference. arXiv preprint arXiv:2508.19559, 2025

  28. [28]

    {AlpaServe}: Statistical multiplexing with model parallelism for deep learning serving

    Zhuohan Li, Lianmin Zheng, Yinmin Zhong, Vincent Liu, Ying Sheng, Xin Jin, Yanping Huang, Zhifeng Chen, Hao Zhang, Joseph E Gonzalez, et al. {AlpaServe}: Statistical multiplexing with model parallelism for deep learning serving. In 17th USENIX Symposium on Operating Systems Design and Implementation (OSDI 23), pages 663–679, 2023

  29. [29]

    Looking backward: Streaming video-to-video translation with feature banks

    Feng Liang, Akio Kodaira, Chenfeng Xu, Masayoshi Tomizuka, Kurt Keutzer, and Diana Marculescu. Looking backward: Streaming video-to-video translation with feature banks. In International Conference on Learning Representations, volume 2025, pages 46425–46445, 2025

  30. [30]

    Skyserve: Serving ai models across regions and clouds with spot instances

    Ziming Mao, Tian Xia, Zhanghao Wu, Wei-Lin Chiang, Tyler Griggs, Romil Bhardwaj, Zongheng Yang, Scott Shenker, and Ion Stoica. Skyserve: Serving ai models across regions and clouds with spot instances. In Proceedings of the Twentieth European Conference on Computer Systems, pages 159–175, 2025

  31. [31]

    Video diffusion models: A survey .arXiv preprint arXiv:2405.03150, 2024

    Andrew Melnik, Michal Ljubljanac, Cong Lu, Qi Yan, Weiming Ren, and Helge Ritter. Video diffusion models: A survey .arXiv preprint arXiv:2405.03150, 2024

  32. [32]

    Gpudirect rdma.https://docs.nvidia.com/cuda/gpudirect-rdma/, 2026

    NVIDIA. Gpudirect rdma.https://docs.nvidia.com/cuda/gpudirect-rdma/, 2026

  33. [33]

    Nvidia inference xfer library (nixl).https://github.com/ai-dynamo/nixl, 2026

    NVIDIA. Nvidia inference xfer library (nixl).https://github.com/ai-dynamo/nixl, 2026

  34. [34]

    Hexgen-flow: Optimizing llm inference request scheduling for agentic text-to-sql

    You Peng, Youhe Jiang, Wenqi Jiang, Chen Wang, and Binhang Yuan. Hexgen-flow: Optimizing llm inference request scheduling for agentic text-to-sql. arXiv preprint arXiv:2505.05286, 2025

  35. [35]

    Movie Gen: A Cast of Media Foundation Models

    Adam Polyak, Amit Zohar, Andrew Brown, Andros Tjandra, Animesh Sinha, Ann Lee, Apoorv Vyas, Bowen Shi, Chih- Yao Ma, Ching-Yao Chuang, et al. Movie gen: A cast of media foundation models. arXiv preprint arXiv:2410.13720, 2024

  36. [36]

    Auto-scaling web applications in clouds: A taxonomy and survey .ACM Computing Surveys (CSUR), 51(4):1–33, 2018

    Chenhao Qu, Rodrigo N Calheiros, and Rajkumar Buyya. Auto-scaling web applications in clouds: A taxonomy and survey .ACM Computing Surveys (CSUR), 51(4):1–33, 2018

  37. [37]

    Llumnix: Dynamic schedul- ing for large language model serving

    Biao Sun, Ziming Huang, Hanyu Zhao, Wencong Xiao, Xinyi Zhang, Yong Li, and Wei Lin. Llumnix: Dynamic schedul- ing for large language model serving. In 18th USENIX symposium on operating systems design and implementation (OSDI 24), pages 173–191, 2024

  38. [38]

    Parallax: Efficient llm inference service over decentralized environment

    Chris Tong, Youhe Jiang, Gufeng Chen, Tianyi Zhao, Sibian Lu, Wenjie Qu, Eric Yang, Lynn Ai, and Binhang Yuan. Parallax: Efficient llm inference service over decentralized environment. arXiv preprint arXiv:2509.26182, 2025

  39. [39]

    Diffusion models are real-time game engines

    Dani Valevski, Yaniv Leviathan, Moab Arar, and Shlomi Fruchter. Diffusion models are real-time game engines. In International Conference on Learning Representations, volume 2025, pages 73754–73776, 2025

  40. [40]

    Veo 3 technical report

    Veo Team, Google DeepMind. Veo 3 technical report. Technical report, Google DeepMind, 2025. URL https: //storage.googleapis.com/deepmind-media/veo/Veo-3-Tech-Report.pdf

  41. [41]

    Wan: Open and Advanced Large-Scale Video Generative Models

    Team Wan, Ang Wang, Baole Ai, Bin Wen, Chaojie Mao, Chen-Wei Xie, Di Chen, Feiwu Yu, Haiming Zhao, Jianxiao Yang, et al. Wan: Open and advanced large-scale video generative models. arXiv preprint arXiv:2503.20314, 2025

  42. [42]

    Fast Distributed Inference Serving for Large Language Models

    Bingyang Wu, Yinmin Zhong, Zili Zhang, Shengyu Liu, Fangyue Liu, Yuanhang Sun, Gang Huang, Xuanzhe Liu, and Xin Jin. Fast distributed inference serving for large language models. arXiv preprint arXiv:2305.05920, 2023

  43. [43]

    Loongserve: Efficiently serving long-context large language models with elastic sequence parallelism

    Bingyang Wu, Shengyu Liu, Yinmin Zhong, Peng Sun, Xuanzhe Liu, and Xin Jin. Loongserve: Efficiently serving long-context large language models with elastic sequence parallelism. In Proceedings of the ACM SIGOPS 30th Symposium on Operating Systems Principles, pages 640–654, 2024

  44. [44]

    Tridentserve: A stage-level serving system for diffusion pipelines

    Yifei Xia, Fangcheng Fu, Hao Yuan, Hanke Zhang, Xupeng Miao, Yijun Liu, Suhan Ling, Jie Jiang, and Bin Cui. Tridentserve: A stage-level serving system for diffusion pipelines. arXiv preprint arXiv:2510.02838, 2025

  45. [45]

    Aegaeon: Effective gpu pooling for concurrent llm serving on the market

    Yuxing Xiang, Xue Li, Kun Qian, Yufan Yang, Diwen Zhu, Wenyuan Yu, Ennan Zhai, Xuanzhe Liu, Xin Jin, and Jingren Zhou. Aegaeon: Effective gpu pooling for concurrent llm serving on the market. In Proceedings of the ACM SIGOPS 31st Symposium on Operating Systems Principles, pages 1030–1045, 2025. 19

  46. [46]

    FSA: An Alternative Efficient Implementation of Native Sparse Attention Kernel

    Ran Yan, Youhe Jiang, Zhuoming Chen, Haohui Mai, Beidi Chen, and Binhang Yuan. Fsa: An alternative efficient implementation of native sparse attention kernel. arXiv preprint arXiv:2508.18224, 2025

  47. [47]

    LongLive: Real-time Interactive Long Video Generation

    Shuai Yang, Wei Huang, Ruihang Chu, Yicheng Xiao, Yuyang Zhao, Xianbang Wang, Muyang Li, Enze Xie, Yingcong Chen, Yao Lu, et al. Longlive: Real-time interactive long video generation. arXiv preprint arXiv:2509.22622, 2025

  48. [48]

    Cogvideox: Text-to-video diffusion models with an expert transformer

    Zhuoyi Yang, Jiayan Teng, Wendi Zheng, Ming Ding, Shiyu Huang, Jiazheng Xu, Yuanming Yang, Wenyi Hong, Xiaohan Zhang, Guanyu Feng, et al. Cogvideox: Text-to-video diffusion models with an expert transformer. In International Conference on Learning Representations, volume 2025, pages 83048–83077, 2025

  49. [49]

    How SwissAI uses OpenTela for scalable LLM serving

    Xiaozhe Yao. How SwissAI uses OpenTela for scalable LLM serving. Xiaozhe Yao (Blog), March 2026. URL https://about.yao.sh/posts/opentela-swissai/. Accessed: 2026-03-16

  50. [50]

    Flashinfer: Efficient and customizable attention engine for llm inference serving

    Zihao Ye, Lequn Chen, Ruihang Lai, Wuwei Lin, Yineng Zhang, Stephanie Wang, Tianqi Chen, Baris Kasikci, Vinod Grover, Arvind Krishnamurthy , et al. Flashinfer: Efficient and customizable attention engine for llm inference serving. Proceedings of Machine Learning and Systems, 7, 2025

  51. [51]

    vLLM-Omni: Fully Disaggregated Serving for Any-to-Any Multimodal Models,

    Peiqi Yin, Jiangyun Zhu, Han Gao, Chenguang Zheng, Yongxiang Huang, Taichang Zhou, Ruirui Yang, Weizhi Liu, Weiqing Chen, Canlin Guo, et al. vllm-omni: Fully disaggregated serving for any-to-any multimodal models. arXiv preprint arXiv:2602.02204, 2026

  52. [52]

    Orca: A distributed serving system for {Transformer-Based} generative models

    Gyeong-In Yu, Joo Seong Jeong, Geon-Woo Kim, Soojeong Kim, and Byung-Gon Chun. Orca: A distributed serving system for {Transformer-Based} generative models. In 16th USENIX symposium on operating systems design and implementation (OSDI 22), pages 521–538, 2022

  53. [53]

    Turbodiffusion: Accelerating video diffusion models by 100-200 times.arXiv preprint arXiv:2512.16093, 2025

    Jintao Zhang, Kaiwen Zheng, Kai Jiang, Haoxu Wang, Ion Stoica, Joseph E Gonzalez, Jianfei Chen, and Jun Zhu. Turbodiffusion: Accelerating video diffusion models by 100-200 times. arXiv preprint arXiv:2512.16093, 2025

  54. [54]

    LMDeploy Accelerates Mixed-Precision LLM Inference with TurboMind

    Li Zhang, Youhe Jiang, Guoliang He, Xin Chen, Han Lv, Qian Yao, Fangcheng Fu, and Kai Chen. Efficient mixed- precision large language model inference with turbomind. arXiv preprint arXiv:2508.15601, 2025

  55. [55]

    Sglang: Efficient execution of structured language model programs

    Lianmin Zheng, Liangsheng Yin, Zhiqiang Xie, Chuyue Sun, Jeff Huang, Cody H Yu, Shiyi Cao, Christos Kozyrakis, Ion Stoica, Joseph E Gonzalez, et al. Sglang: Efficient execution of structured language model programs. Advances in neural information processing systems, 37:62557–62583, 2024

  56. [56]

    {DistServe}: Disaggregating prefill and decoding for goodput-optimized large language model serving

    Yinmin Zhong, Shengyu Liu, Junda Chen, Jianbo Hu, Yibo Zhu, Xuanzhe Liu, Xin Jin, and Hao Zhang. {DistServe}: Disaggregating prefill and decoding for goodput-optimized large language model serving. In 18th USENIX Symposium on Operating Systems Design and Implementation (OSDI 24), pages 193–210, 2024. 20 Table 5Workload characteristics of the ten profiling...

  57. [57]

    Replace.Commit the update (λ(t),ˆρ(t))←(λ ℓ(t), ρ∗ ℓ(t)), which takes effect in subsequent autoscaling decisions. Profiling case study.We illustrate the offline profiling and the resulting volatility-to-parameter mapping on a representative trace family consisting of L= 10 segments of monotonically increasing volatility, generated by progressively scaling...