TurboServe: Serving Streaming Video Generation Efficiently and Economically

Fangcheng Fu; Haotong Bao; Haoxu Wang; Jianfei Chen; Jintao Zhang; Jun Zhu; Kai Jiang; Youhe Jiang

arxiv: 2606.19271 · v1 · pith:D3KWWPVFnew · submitted 2026-06-17 · 💻 cs.DC

TurboServe: Serving Streaming Video Generation Efficiently and Economically

Youhe Jiang , Haoxu Wang , Haotong Bao , Kai Jiang , Jianfei Chen , Jun Zhu , Fangcheng Fu , Jintao Zhang This is my paper

Pith reviewed 2026-06-26 19:21 UTC · model grok-4.3

classification 💻 cs.DC

keywords streaming video generationserving systemGPU schedulingautoscalingonline schedulinglatency optimizationcost efficiency

0 comments

The pith

TurboServe reduces worst-case per-chunk latency by 37.5% and GPU operating costs by 37.2% for streaming video generation.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Streaming video generation requires serving systems to handle long-lived user sessions that produce video chunks progressively under strict latency targets. The paper establishes that these workloads introduce heterogeneity in session durations and in user demand over time, which standard serving approaches fail to manage efficiently. TurboServe addresses this by formulating serving as an online scheduling problem that jointly optimizes session placement across GPUs and the number of GPUs provisioned. Its closed-loop algorithm uses migration to rebalance load and autoscaling to match provisioning to demand, supported by batching, offloading, and migration mechanisms. Sympathetic readers would care because these changes make real-time interactive video generation feasible at lower cost on shared hardware.

Core claim

The central claim is that a closed-loop scheduling algorithm coordinating migration-aware placement and load-driven autoscaling can reduce worst-case per-chunk latency by 37.5% and total GPU operating cost by 37.2% on average compared to baseline configurations. This is achieved by treating streaming video generation as an online scheduling problem in multi-GPU environments and implementing coalesced chunk processing, GPU-CPU offloading, and NCCL-based GPU-GPU migration to support the scheduling decisions at runtime. The evaluation uses real-world production traces across multiple model sizes and clusters up to 64 GPUs.

What carries the argument

The closed-loop scheduling algorithm consisting of a migration-aware placement controller and a load-driven autoscaling controller.

Load-bearing premise

That the chosen baselines represent standard serving configurations and that the evaluated traces include sufficient variation to demonstrate gains in typical deployments.

What would settle it

Measuring the latency and cost metrics on a new set of traces with more extreme session duration differences or demand fluctuations than those used in the paper.

read the original abstract

Streaming video generation is emerging as a new serving workload in which users interact with long-lived sessions that generate video progressively, chunk by chunk. Unlike offline video generation or typical LLM serving, streaming video generation must preserve session state across active and idle periods, repeatedly schedule ongoing sessions, and deliver each chunk under a tight latency target. This creates two key serving challenges in multi-user, multi-GPU environments: session duration heterogeneity, where long-running sessions make placement decisions suboptimal over time, and temporal user-demand heterogeneity, where the number of active sessions fluctuates sharply across bursts and idle periods. We present TurboServe, the first serving system designed specifically for streaming video generation workloads. TurboServe formulates serving as an online scheduling problem that jointly coordinates session placement and GPU provisioning. Its closed-loop scheduling algorithm combines a migration-aware placement controller, which rebalances sessions across GPUs to reduce the maximum per-chunk latency, with a load-driven autoscaling controller, which adapts the GPU budget to workload variation for improved cost efficiency. To support these decisions at runtime, TurboServe implements coalesced chunk processing for batching concurrent active sessions on the same GPU, GPU-CPU offloading for session suspension and resumption, and NCCL-based GPU-GPU migration for online rebalancing. We evaluate TurboServe on real-world production traces from Shengshu Technology across multiple model sizes and GPU clusters with up to 64 NVIDIA B300 GPUs. Compared with baseline serving configurations, TurboServe reduces worst-case per-chunk latency by 37.5% and total GPU operating cost by 37.2% on average. Our code is publicly available at https://github.com/shengshu-ai/TurboServe.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

TurboServe targets a real gap in serving long-lived video gen sessions with joint placement and autoscaling, but the abstract leaves the 37% gains hard to verify without baseline and trace details.

read the letter

TurboServe claims to be the first system built for streaming video generation workloads, where sessions stay alive across chunks and must meet per-chunk latency targets while handling session-duration and demand heterogeneity. The core idea is a closed-loop scheduler that pairs migration-aware placement with load-driven autoscaling, backed by coalesced chunk processing, GPU-CPU offloading, and NCCL migration.

The paper does a clean job naming the workload differences from LLM or offline video serving and then giving concrete runtime mechanisms to support the scheduler. Evaluating on real production traces from Shengshu Technology across model sizes and up to 64 B300 GPUs, plus releasing the code, gives it more grounding than many systems abstracts.

The soft spot is exactly what the stress-test note flags: the abstract states the 37.5% latency and 37.2% cost reductions but supplies no description of the baseline serving configurations, no trace statistics on session lengths or burst patterns, and no mention of how the 64-GPU runs were set up or whether results include variance. Without those, it is difficult to tell whether the gains come from the new techniques or from comparing against static or poorly tuned setups. The full paper needs to close that gap for the numbers to land.

This is a systems paper aimed at people running multi-GPU clusters for interactive generative workloads. A reader who cares about practical scheduling for stateful sessions would find the placement and autoscaling logic useful even if they end up re-implementing parts.

I would send it for peer review. The workload is timely, the approach is straightforward, and the public code lets referees check the claims directly.

Referee Report

1 major / 0 minor

Summary. The paper presents TurboServe as the first serving system for streaming video generation workloads, which involve long-lived interactive sessions generating video chunk-by-chunk. It formulates serving as an online scheduling problem and proposes a closed-loop algorithm with a migration-aware placement controller and a load-driven autoscaling controller. These are supported at runtime by coalesced chunk processing, GPU-CPU offloading, and NCCL-based GPU-GPU migration. The system is evaluated on real-world production traces from Shengshu Technology across multiple model sizes and clusters of up to 64 NVIDIA B300 GPUs, claiming 37.5% reduction in worst-case per-chunk latency and 37.2% reduction in total GPU operating cost versus baseline serving configurations. Code is released publicly.

Significance. If the performance claims hold under rigorous validation, this work would be a meaningful contribution to distributed systems for emerging AI serving workloads. Streaming video generation introduces distinct challenges around persistent session state, duration heterogeneity, and bursty demand that differ from both offline generation and standard LLM inference; a dedicated system addressing them could improve efficiency in production video AI services. The public code release supports reproducibility and is a strength.

major comments (1)

[Evaluation] Evaluation section: the reported 37.5% worst-case per-chunk latency reduction and 37.2% GPU cost reduction cannot be verified or attributed to the closed-loop scheduler because the manuscript supplies no description of the baseline serving configurations (e.g., static round-robin vs. no-migration vs. fixed GPU count), no statistics on the production traces (session-duration distribution, burst/idle ratios, heterogeneity metrics), and no details on the 64-GPU experimental configuration or statistical significance testing. These omissions make the central empirical claim load-bearing but unsupported.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive feedback on the evaluation. We agree that the central performance claims require additional supporting details to be verifiable and attributable to the proposed scheduler. We will revise the manuscript to address this.

read point-by-point responses

Referee: [Evaluation] Evaluation section: the reported 37.5% worst-case per-chunk latency reduction and 37.2% GPU cost reduction cannot be verified or attributed to the closed-loop scheduler because the manuscript supplies no description of the baseline serving configurations (e.g., static round-robin vs. no-migration vs. fixed GPU count), no statistics on the production traces (session-duration distribution, burst/idle ratios, heterogeneity metrics), and no details on the 64-GPU experimental configuration or statistical significance testing. These omissions make the central empirical claim load-bearing but unsupported.

Authors: We acknowledge the validity of this observation. The current manuscript does not provide explicit descriptions of the baseline configurations, detailed trace statistics, full experimental setup parameters for the 64-GPU runs, or statistical significance tests. In the revised version we will add: (1) precise definitions of each baseline (including placement policy, migration usage, and GPU provisioning), (2) summary statistics for the Shengshu production traces (session duration CDF, burst/idle ratios, and heterogeneity measures), (3) hardware and software configuration details for the 64-GPU cluster experiments, and (4) p-values or confidence intervals for the reported latency and cost reductions. These additions will enable readers to verify and attribute the gains to the closed-loop algorithm. revision: yes

Circularity Check

0 steps flagged

No circularity; empirical system evaluation on external traces

full rationale

The paper presents TurboServe as a serving system with closed-loop scheduling, evaluated via experiments on real-world production traces from Shengshu Technology across model sizes and up to 64 GPUs. Reported gains (37.5% latency, 37.2% cost) are direct experimental outcomes versus baselines, with no derivation chain, equations, fitted parameters renamed as predictions, or self-citation load-bearing steps. The work is self-contained against external benchmarks and traces.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The abstract provides no information on free parameters, axioms, or invented entities.

pith-pipeline@v0.9.1-grok · 5856 in / 1055 out tokens · 27388 ms · 2026-06-26T19:21:18.149604+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

57 extracted references · 23 canonical work pages · 8 internal anchors

[1]

Taming {Throughput-Latency} tradeoff in {LLM} inference with {Sarathi-Serve}

Amey Agrawal, Nitin Kedia, Ashish Panwar, Jayashree Mohan, Nipun Kwatra, Bhargav Gulavani, Alexey Tumanov, and Ramachandran Ramjee. Taming {Throughput-Latency} tradeoff in {LLM} inference with {Sarathi-Serve}. In 18th USENIX symposium on operating systems design and implementation (OSDI 24), pages 117–134, 2024

2024
[2]

Amazon ec2 p5 instances.https://aws.amazon.com/ec2/instance-types/p5/, 2026

Amazon Web Services. Amazon ec2 p5 instances.https://aws.amazon.com/ec2/instance-types/p5/, 2026

2026
[3]

Video generation models as world simulators

Tim Brooks, Bill Peebles, Connor Holmes, Will DePue, Yufei Guo, Leo Jing, David Schnurr, Joe Taylor, Troy Luhman, Eric Luhman, et al. Video generation models as world simulators. OpenAI Blog, 1(8):1, 2024

2024
[4]

Genie: Generative interactive environments

Jake Bruce, Michael D Dennis, Ashley Edwards, Jack Parker-Holder, Yuge Shi, Edward Hughes, Matthew Lai, Aditi Mavalankar, Richie Steigerwald, Chris Apps, et al. Genie: Generative interactive environments. In Forty-first International Conference on Machine Learning, 2024

2024
[5]

Flashserve: Cost-efficient serverless inference scheduling for large language models via tiered memory management and predictive autoscaling

Bolin Chen. Flashserve: Cost-efficient serverless inference scheduling for large language models via tiered memory management and predictive autoscaling. In Proceedings of the 2025 6th International Conference on Computer Science and Management Technology, pages 1291–1298, 2025. 17

2025
[6]

Gpurdma: Gpu-side library for high performance network- ing from gpu kernels

Feras Daoud, Amir Watad, and Mark Silberstein. Gpurdma: Gpu-side library for high performance network- ing from gpu kernels. In Proceedings of the 6th international Workshop on Runtime and Operating Systems for Supercomputers, pages 1–8, 2016

2016
[7]

Quasar: Resource-efficient and qos-aware cluster management

Christina Delimitrou and Christos Kozyrakis. Quasar: Resource-efficient and qos-aware cluster management. ACM Sigplan Notices, 49(4):127–144, 2014

2014
[8]

xdit: an inference engine for diffusion transformers (dits) with massive parallelism

Jiarui Fang, Jinzhe Pan, Xibo Sun, Aoyu Li, and Jiannan Wang. xdit: an inference engine for diffusion transformers (dits) with massive parallelism. arXiv preprint arXiv:2411.01738, 2024

work page arXiv 2024
[9]

Fastvideo: A unified framework for accelerated video generation

FastVideo Team. Fastvideo: A unified framework for accelerated video generation. https://haoailab.com/blogs /fastvideo/, 2025

2025
[10]

Streamdiffusionv2: A streaming system for dynamic and interactive video generation.arXiv preprint arXiv:2511.07399, 2025

Tianrui Feng, Zhi Li, Shuo Yang, Haocheng Xi, Muyang Li, Xiuyu Li, Lvmin Zhang, Keting Yang, Kelly Peng, Song Han, et al. Streamdiffusionv2: A streaming system for dynamic and interactive video generation. arXiv preprint arXiv:2511.07399, 2025

work page arXiv 2025
[11]

{ServerlessLLM}:{Low-Latency} serverless inference for large language models

Yao Fu, Leyang Xue, Yeqi Huang, Andrei-Octavian Brabete, Dmitrii Ustiugov, Yuvraj Patel, and Luo Mai. {ServerlessLLM}:{Low-Latency} serverless inference for large language models. In 18th USENIX Symposium on Operating Systems Design and Implementation (OSDI 24), pages 135–153, 2024

2024
[12]

Efficient multi-round llm inference over disaggregated serving

Wenhao He, Youhe Jiang, Penghao Zhao, Quanqing Xu, Eiko Yoneki, Bin Cui, and Fangcheng Fu. Efficient multi-round llm inference over disaggregated serving. arXiv preprint arXiv:2602.14516, 2026

work page arXiv 2026
[13]

Streamingt2v: Consistent, dynamic, and extendable long video generation from text

Roberto Henschel, Levon Khachatryan, Hayk Poghosyan, Daniil Hayrapetyan, Vahram Tadevosyan, Zhangyang Wang, Shant Navasardyan, and Humphrey Shi. Streamingt2v: Consistent, dynamic, and extendable long video generation from text. In Proceedings of the Computer Vision and Pattern Recognition Conference, pages 2568–2577, 2025

2025
[14]

Computer simulation using particles

Roger W Hockney and James W Eastwood. Computer simulation using particles. crc Press, 2021

2021
[15]

Self forcing: Bridging the train-test gap in autoregressive video diffusion

Xun Huang, Zhengqi Li, Guande He, Mingyuan Zhou, and Eli Shechtman. Self forcing: Bridging the train-test gap in autoregressive video diffusion. Advances in Neural Information Processing Systems, 38:167283–167308, 2026

2026
[16]

Hy-world 1.5: A systematic framework for interactive world modeling with real-time latency and geometric consistency .arXiv preprint, 2025

Team HunyuanWorld. Hy-world 1.5: A systematic framework for interactive world modeling with real-time latency and geometric consistency .arXiv preprint, 2025

2025
[17]

Hexgen: Generative inference of large language model over heterogeneous environment

Youhe Jiang, Ran Yan, Xiaozhe Yao, Yang Zhou, Beidi Chen, and Binhang Yuan. Hexgen: Generative inference of large language model over heterogeneous environment. arXiv preprint arXiv:2311.11514, 2023

work page arXiv 2023
[18]

Demystifying cost-efficiency in llm serving over heterogeneous gpus

Youhe Jiang, Fangcheng Fu, Xiaozhe Yao, Guoliang He, Xupeng Miao, Ana Klimovic, Bin Cui, Binhang Yuan, and Eiko Yoneki. Demystifying cost-efficiency in llm serving over heterogeneous gpus. arXiv preprint arXiv:2502.00722, 2025

work page arXiv 2025
[19]

Thunderserve: High-performance and cost-efficient llm serving in cloud environments

Youhe Jiang, Fangcheng Fu, Xiaozhe Yao, Taiyi Wang, Bin Cui, Ana Klimovic, and Eiko Yoneki. Thunderserve: High-performance and cost-efficient llm serving in cloud environments. Proceedings of Machine Learning and Systems, 7, 2025

2025
[20]

Cascadia: An efficient cascade serving system for large language models

Youhe Jiang, Fangcheng Fu, Wanru Zhao, Stephan Rabanser, Jintao Zhang, Nicholas D Lane, and Binhang Yuan. Cascadia: An efficient cascade serving system for large language models. arXiv preprint arXiv:2506.04203, 2025

work page arXiv 2025
[21]

Hexgen-2: Disaggregated generative inference of llms in heterogeneous environment

Youhe Jiang, Ran Yan, and Binhang Yuan. Hexgen-2: Disaggregated generative inference of llms in heterogeneous environment. arXiv preprint arXiv:2502.07903, 2025

work page arXiv 2025
[22]

OServe: Accelerating LLM Serving via Spatial-Temporal Workload Orchestration

Youhe Jiang, Fangcheng Fu, Taiyi Wang, Guoliang He, and Eiko Yoneki. Oserve: Accelerating llm serving via spatial-temporal workload orchestration. arXiv preprint arXiv:2602.12151, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[23]

Boute: Cost-efficient llm serving with heterogeneous llms and gpus via multi-objective bayesian optimization

Youhe Jiang, Fangcheng Fu, and Eiko Yoneki. Boute: Cost-efficient llm serving with heterogeneous llms and gpus via multi-objective bayesian optimization. arXiv preprint arXiv:2602.10729, 2026

work page arXiv 2026
[24]

HunyuanVideo: A Systematic Framework For Large Video Generative Models

Weijie Kong, Qi Tian, Zijian Zhang, Rox Min, Zuozhuo Dai, Jin Zhou, Jiangfeng Xiong, Xin Li, Bo Wu, Jianwei Zhang, et al. Hunyuanvideo: A systematic framework for large video generative models. arXiv preprint arXiv:2412.03603, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[25]

Horizontal pod autoscaler

Kubernetes. Horizontal pod autoscaler. https://kubernetes.io/docs/tasks/run-application/horizontal-p od-autoscale/, 2024. 18

2024
[26]

Efficient memory management for large language model serving with pagedattention

Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph Gonzalez, Hao Zhang, and Ion Stoica. Efficient memory management for large language model serving with pagedattention. InProceedings of the 29th symposium on operating systems principles, pages 611–626, 2023

2023
[27]

Taming the chaos: Coordinated autoscaling for heterogeneous and disaggregated llm inference

Rongzhi Li, Ruogu Du, Zefang Chu, Sida Zhao, Chunlei Han, Zuocheng Shi, Yiwen Shao, Huanle Han, Long Huang, Zherui Liu, et al. Taming the chaos: Coordinated autoscaling for heterogeneous and disaggregated llm inference. arXiv preprint arXiv:2508.19559, 2025

work page arXiv 2025
[28]

{AlpaServe}: Statistical multiplexing with model parallelism for deep learning serving

Zhuohan Li, Lianmin Zheng, Yinmin Zhong, Vincent Liu, Ying Sheng, Xin Jin, Yanping Huang, Zhifeng Chen, Hao Zhang, Joseph E Gonzalez, et al. {AlpaServe}: Statistical multiplexing with model parallelism for deep learning serving. In 17th USENIX Symposium on Operating Systems Design and Implementation (OSDI 23), pages 663–679, 2023

2023
[29]

Looking backward: Streaming video-to-video translation with feature banks

Feng Liang, Akio Kodaira, Chenfeng Xu, Masayoshi Tomizuka, Kurt Keutzer, and Diana Marculescu. Looking backward: Streaming video-to-video translation with feature banks. In International Conference on Learning Representations, volume 2025, pages 46425–46445, 2025

2025
[30]

Skyserve: Serving ai models across regions and clouds with spot instances

Ziming Mao, Tian Xia, Zhanghao Wu, Wei-Lin Chiang, Tyler Griggs, Romil Bhardwaj, Zongheng Yang, Scott Shenker, and Ion Stoica. Skyserve: Serving ai models across regions and clouds with spot instances. In Proceedings of the Twentieth European Conference on Computer Systems, pages 159–175, 2025

2025
[31]

Video diffusion models: A survey .arXiv preprint arXiv:2405.03150, 2024

Andrew Melnik, Michal Ljubljanac, Cong Lu, Qi Yan, Weiming Ren, and Helge Ritter. Video diffusion models: A survey .arXiv preprint arXiv:2405.03150, 2024

work page arXiv 2024
[32]

Gpudirect rdma.https://docs.nvidia.com/cuda/gpudirect-rdma/, 2026

NVIDIA. Gpudirect rdma.https://docs.nvidia.com/cuda/gpudirect-rdma/, 2026

2026
[33]

Nvidia inference xfer library (nixl).https://github.com/ai-dynamo/nixl, 2026

NVIDIA. Nvidia inference xfer library (nixl).https://github.com/ai-dynamo/nixl, 2026

2026
[34]

Hexgen-flow: Optimizing llm inference request scheduling for agentic text-to-sql

You Peng, Youhe Jiang, Wenqi Jiang, Chen Wang, and Binhang Yuan. Hexgen-flow: Optimizing llm inference request scheduling for agentic text-to-sql. arXiv preprint arXiv:2505.05286, 2025

work page arXiv 2025
[35]

Movie Gen: A Cast of Media Foundation Models

Adam Polyak, Amit Zohar, Andrew Brown, Andros Tjandra, Animesh Sinha, Ann Lee, Apoorv Vyas, Bowen Shi, Chih- Yao Ma, Ching-Yao Chuang, et al. Movie gen: A cast of media foundation models. arXiv preprint arXiv:2410.13720, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[36]

Auto-scaling web applications in clouds: A taxonomy and survey .ACM Computing Surveys (CSUR), 51(4):1–33, 2018

Chenhao Qu, Rodrigo N Calheiros, and Rajkumar Buyya. Auto-scaling web applications in clouds: A taxonomy and survey .ACM Computing Surveys (CSUR), 51(4):1–33, 2018

2018
[37]

Llumnix: Dynamic schedul- ing for large language model serving

Biao Sun, Ziming Huang, Hanyu Zhao, Wencong Xiao, Xinyi Zhang, Yong Li, and Wei Lin. Llumnix: Dynamic schedul- ing for large language model serving. In 18th USENIX symposium on operating systems design and implementation (OSDI 24), pages 173–191, 2024

2024
[38]

Parallax: Efficient llm inference service over decentralized environment

Chris Tong, Youhe Jiang, Gufeng Chen, Tianyi Zhao, Sibian Lu, Wenjie Qu, Eric Yang, Lynn Ai, and Binhang Yuan. Parallax: Efficient llm inference service over decentralized environment. arXiv preprint arXiv:2509.26182, 2025

work page arXiv 2025
[39]

Diffusion models are real-time game engines

Dani Valevski, Yaniv Leviathan, Moab Arar, and Shlomi Fruchter. Diffusion models are real-time game engines. In International Conference on Learning Representations, volume 2025, pages 73754–73776, 2025

2025
[40]

Veo 3 technical report

Veo Team, Google DeepMind. Veo 3 technical report. Technical report, Google DeepMind, 2025. URL https: //storage.googleapis.com/deepmind-media/veo/Veo-3-Tech-Report.pdf

2025
[41]

Wan: Open and Advanced Large-Scale Video Generative Models

Team Wan, Ang Wang, Baole Ai, Bin Wen, Chaojie Mao, Chen-Wei Xie, Di Chen, Feiwu Yu, Haiming Zhao, Jianxiao Yang, et al. Wan: Open and advanced large-scale video generative models. arXiv preprint arXiv:2503.20314, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[42]

Fast Distributed Inference Serving for Large Language Models

Bingyang Wu, Yinmin Zhong, Zili Zhang, Shengyu Liu, Fangyue Liu, Yuanhang Sun, Gang Huang, Xuanzhe Liu, and Xin Jin. Fast distributed inference serving for large language models. arXiv preprint arXiv:2305.05920, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[43]

Loongserve: Efficiently serving long-context large language models with elastic sequence parallelism

Bingyang Wu, Shengyu Liu, Yinmin Zhong, Peng Sun, Xuanzhe Liu, and Xin Jin. Loongserve: Efficiently serving long-context large language models with elastic sequence parallelism. In Proceedings of the ACM SIGOPS 30th Symposium on Operating Systems Principles, pages 640–654, 2024

2024
[44]

Tridentserve: A stage-level serving system for diffusion pipelines

Yifei Xia, Fangcheng Fu, Hao Yuan, Hanke Zhang, Xupeng Miao, Yijun Liu, Suhan Ling, Jie Jiang, and Bin Cui. Tridentserve: A stage-level serving system for diffusion pipelines. arXiv preprint arXiv:2510.02838, 2025

work page arXiv 2025
[45]

Aegaeon: Effective gpu pooling for concurrent llm serving on the market

Yuxing Xiang, Xue Li, Kun Qian, Yufan Yang, Diwen Zhu, Wenyuan Yu, Ennan Zhai, Xuanzhe Liu, Xin Jin, and Jingren Zhou. Aegaeon: Effective gpu pooling for concurrent llm serving on the market. In Proceedings of the ACM SIGOPS 31st Symposium on Operating Systems Principles, pages 1030–1045, 2025. 19

2025
[46]

FSA: An Alternative Efficient Implementation of Native Sparse Attention Kernel

Ran Yan, Youhe Jiang, Zhuoming Chen, Haohui Mai, Beidi Chen, and Binhang Yuan. Fsa: An alternative efficient implementation of native sparse attention kernel. arXiv preprint arXiv:2508.18224, 2025

work page internal anchor Pith review arXiv 2025
[47]

LongLive: Real-time Interactive Long Video Generation

Shuai Yang, Wei Huang, Ruihang Chu, Yicheng Xiao, Yuyang Zhao, Xianbang Wang, Muyang Li, Enze Xie, Yingcong Chen, Yao Lu, et al. Longlive: Real-time interactive long video generation. arXiv preprint arXiv:2509.22622, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[48]

Cogvideox: Text-to-video diffusion models with an expert transformer

Zhuoyi Yang, Jiayan Teng, Wendi Zheng, Ming Ding, Shiyu Huang, Jiazheng Xu, Yuanming Yang, Wenyi Hong, Xiaohan Zhang, Guanyu Feng, et al. Cogvideox: Text-to-video diffusion models with an expert transformer. In International Conference on Learning Representations, volume 2025, pages 83048–83077, 2025

2025
[49]

How SwissAI uses OpenTela for scalable LLM serving

Xiaozhe Yao. How SwissAI uses OpenTela for scalable LLM serving. Xiaozhe Yao (Blog), March 2026. URL https://about.yao.sh/posts/opentela-swissai/. Accessed: 2026-03-16

2026
[50]

Flashinfer: Efficient and customizable attention engine for llm inference serving

Zihao Ye, Lequn Chen, Ruihang Lai, Wuwei Lin, Yineng Zhang, Stephanie Wang, Tianqi Chen, Baris Kasikci, Vinod Grover, Arvind Krishnamurthy , et al. Flashinfer: Efficient and customizable attention engine for llm inference serving. Proceedings of Machine Learning and Systems, 7, 2025

2025
[51]

vLLM-Omni: Fully Disaggregated Serving for Any-to-Any Multimodal Models,

Peiqi Yin, Jiangyun Zhu, Han Gao, Chenguang Zheng, Yongxiang Huang, Taichang Zhou, Ruirui Yang, Weizhi Liu, Weiqing Chen, Canlin Guo, et al. vllm-omni: Fully disaggregated serving for any-to-any multimodal models. arXiv preprint arXiv:2602.02204, 2026

work page arXiv 2026
[52]

Orca: A distributed serving system for {Transformer-Based} generative models

Gyeong-In Yu, Joo Seong Jeong, Geon-Woo Kim, Soojeong Kim, and Byung-Gon Chun. Orca: A distributed serving system for {Transformer-Based} generative models. In 16th USENIX symposium on operating systems design and implementation (OSDI 22), pages 521–538, 2022

2022
[53]

Turbodiffusion: Accelerating video diffusion models by 100-200 times.arXiv preprint arXiv:2512.16093, 2025

Jintao Zhang, Kaiwen Zheng, Kai Jiang, Haoxu Wang, Ion Stoica, Joseph E Gonzalez, Jianfei Chen, and Jun Zhu. Turbodiffusion: Accelerating video diffusion models by 100-200 times. arXiv preprint arXiv:2512.16093, 2025

work page arXiv 2025
[54]

LMDeploy Accelerates Mixed-Precision LLM Inference with TurboMind

Li Zhang, Youhe Jiang, Guoliang He, Xin Chen, Han Lv, Qian Yao, Fangcheng Fu, and Kai Chen. Efficient mixed- precision large language model inference with turbomind. arXiv preprint arXiv:2508.15601, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[55]

Sglang: Efficient execution of structured language model programs

Lianmin Zheng, Liangsheng Yin, Zhiqiang Xie, Chuyue Sun, Jeff Huang, Cody H Yu, Shiyi Cao, Christos Kozyrakis, Ion Stoica, Joseph E Gonzalez, et al. Sglang: Efficient execution of structured language model programs. Advances in neural information processing systems, 37:62557–62583, 2024

2024
[56]

{DistServe}: Disaggregating prefill and decoding for goodput-optimized large language model serving

Yinmin Zhong, Shengyu Liu, Junda Chen, Jianbo Hu, Yibo Zhu, Xuanzhe Liu, Xin Jin, and Hao Zhang. {DistServe}: Disaggregating prefill and decoding for goodput-optimized large language model serving. In 18th USENIX Symposium on Operating Systems Design and Implementation (OSDI 24), pages 193–210, 2024. 20 Table 5Workload characteristics of the ten profiling...

2024
[57]

Replace.Commit the update (λ(t),ˆρ(t))←(λ ℓ(t), ρ∗ ℓ(t)), which takes effect in subsequent autoscaling decisions. Profiling case study.We illustrate the offline profiling and the resulting volatility-to-parameter mapping on a representative trace family consisting of L= 10 segments of monotonically increasing volatility, generated by progressively scaling...

[1] [1]

Taming {Throughput-Latency} tradeoff in {LLM} inference with {Sarathi-Serve}

Amey Agrawal, Nitin Kedia, Ashish Panwar, Jayashree Mohan, Nipun Kwatra, Bhargav Gulavani, Alexey Tumanov, and Ramachandran Ramjee. Taming {Throughput-Latency} tradeoff in {LLM} inference with {Sarathi-Serve}. In 18th USENIX symposium on operating systems design and implementation (OSDI 24), pages 117–134, 2024

2024

[2] [2]

Amazon ec2 p5 instances.https://aws.amazon.com/ec2/instance-types/p5/, 2026

Amazon Web Services. Amazon ec2 p5 instances.https://aws.amazon.com/ec2/instance-types/p5/, 2026

2026

[3] [3]

Video generation models as world simulators

Tim Brooks, Bill Peebles, Connor Holmes, Will DePue, Yufei Guo, Leo Jing, David Schnurr, Joe Taylor, Troy Luhman, Eric Luhman, et al. Video generation models as world simulators. OpenAI Blog, 1(8):1, 2024

2024

[4] [4]

Genie: Generative interactive environments

Jake Bruce, Michael D Dennis, Ashley Edwards, Jack Parker-Holder, Yuge Shi, Edward Hughes, Matthew Lai, Aditi Mavalankar, Richie Steigerwald, Chris Apps, et al. Genie: Generative interactive environments. In Forty-first International Conference on Machine Learning, 2024

2024

[5] [5]

Flashserve: Cost-efficient serverless inference scheduling for large language models via tiered memory management and predictive autoscaling

Bolin Chen. Flashserve: Cost-efficient serverless inference scheduling for large language models via tiered memory management and predictive autoscaling. In Proceedings of the 2025 6th International Conference on Computer Science and Management Technology, pages 1291–1298, 2025. 17

2025

[6] [6]

Gpurdma: Gpu-side library for high performance network- ing from gpu kernels

Feras Daoud, Amir Watad, and Mark Silberstein. Gpurdma: Gpu-side library for high performance network- ing from gpu kernels. In Proceedings of the 6th international Workshop on Runtime and Operating Systems for Supercomputers, pages 1–8, 2016

2016

[7] [7]

Quasar: Resource-efficient and qos-aware cluster management

Christina Delimitrou and Christos Kozyrakis. Quasar: Resource-efficient and qos-aware cluster management. ACM Sigplan Notices, 49(4):127–144, 2014

2014

[8] [8]

xdit: an inference engine for diffusion transformers (dits) with massive parallelism

Jiarui Fang, Jinzhe Pan, Xibo Sun, Aoyu Li, and Jiannan Wang. xdit: an inference engine for diffusion transformers (dits) with massive parallelism. arXiv preprint arXiv:2411.01738, 2024

work page arXiv 2024

[9] [9]

Fastvideo: A unified framework for accelerated video generation

FastVideo Team. Fastvideo: A unified framework for accelerated video generation. https://haoailab.com/blogs /fastvideo/, 2025

2025

[10] [10]

Streamdiffusionv2: A streaming system for dynamic and interactive video generation.arXiv preprint arXiv:2511.07399, 2025

Tianrui Feng, Zhi Li, Shuo Yang, Haocheng Xi, Muyang Li, Xiuyu Li, Lvmin Zhang, Keting Yang, Kelly Peng, Song Han, et al. Streamdiffusionv2: A streaming system for dynamic and interactive video generation. arXiv preprint arXiv:2511.07399, 2025

work page arXiv 2025

[11] [11]

{ServerlessLLM}:{Low-Latency} serverless inference for large language models

Yao Fu, Leyang Xue, Yeqi Huang, Andrei-Octavian Brabete, Dmitrii Ustiugov, Yuvraj Patel, and Luo Mai. {ServerlessLLM}:{Low-Latency} serverless inference for large language models. In 18th USENIX Symposium on Operating Systems Design and Implementation (OSDI 24), pages 135–153, 2024

2024

[12] [12]

Efficient multi-round llm inference over disaggregated serving

Wenhao He, Youhe Jiang, Penghao Zhao, Quanqing Xu, Eiko Yoneki, Bin Cui, and Fangcheng Fu. Efficient multi-round llm inference over disaggregated serving. arXiv preprint arXiv:2602.14516, 2026

work page arXiv 2026

[13] [13]

Streamingt2v: Consistent, dynamic, and extendable long video generation from text

Roberto Henschel, Levon Khachatryan, Hayk Poghosyan, Daniil Hayrapetyan, Vahram Tadevosyan, Zhangyang Wang, Shant Navasardyan, and Humphrey Shi. Streamingt2v: Consistent, dynamic, and extendable long video generation from text. In Proceedings of the Computer Vision and Pattern Recognition Conference, pages 2568–2577, 2025

2025

[14] [14]

Computer simulation using particles

Roger W Hockney and James W Eastwood. Computer simulation using particles. crc Press, 2021

2021

[15] [15]

Self forcing: Bridging the train-test gap in autoregressive video diffusion

Xun Huang, Zhengqi Li, Guande He, Mingyuan Zhou, and Eli Shechtman. Self forcing: Bridging the train-test gap in autoregressive video diffusion. Advances in Neural Information Processing Systems, 38:167283–167308, 2026

2026

[16] [16]

Hy-world 1.5: A systematic framework for interactive world modeling with real-time latency and geometric consistency .arXiv preprint, 2025

Team HunyuanWorld. Hy-world 1.5: A systematic framework for interactive world modeling with real-time latency and geometric consistency .arXiv preprint, 2025

2025

[17] [17]

Hexgen: Generative inference of large language model over heterogeneous environment

Youhe Jiang, Ran Yan, Xiaozhe Yao, Yang Zhou, Beidi Chen, and Binhang Yuan. Hexgen: Generative inference of large language model over heterogeneous environment. arXiv preprint arXiv:2311.11514, 2023

work page arXiv 2023

[18] [18]

Demystifying cost-efficiency in llm serving over heterogeneous gpus

Youhe Jiang, Fangcheng Fu, Xiaozhe Yao, Guoliang He, Xupeng Miao, Ana Klimovic, Bin Cui, Binhang Yuan, and Eiko Yoneki. Demystifying cost-efficiency in llm serving over heterogeneous gpus. arXiv preprint arXiv:2502.00722, 2025

work page arXiv 2025

[19] [19]

Thunderserve: High-performance and cost-efficient llm serving in cloud environments

Youhe Jiang, Fangcheng Fu, Xiaozhe Yao, Taiyi Wang, Bin Cui, Ana Klimovic, and Eiko Yoneki. Thunderserve: High-performance and cost-efficient llm serving in cloud environments. Proceedings of Machine Learning and Systems, 7, 2025

2025

[20] [20]

Cascadia: An efficient cascade serving system for large language models

Youhe Jiang, Fangcheng Fu, Wanru Zhao, Stephan Rabanser, Jintao Zhang, Nicholas D Lane, and Binhang Yuan. Cascadia: An efficient cascade serving system for large language models. arXiv preprint arXiv:2506.04203, 2025

work page arXiv 2025

[21] [21]

Hexgen-2: Disaggregated generative inference of llms in heterogeneous environment

Youhe Jiang, Ran Yan, and Binhang Yuan. Hexgen-2: Disaggregated generative inference of llms in heterogeneous environment. arXiv preprint arXiv:2502.07903, 2025

work page arXiv 2025

[22] [22]

OServe: Accelerating LLM Serving via Spatial-Temporal Workload Orchestration

Youhe Jiang, Fangcheng Fu, Taiyi Wang, Guoliang He, and Eiko Yoneki. Oserve: Accelerating llm serving via spatial-temporal workload orchestration. arXiv preprint arXiv:2602.12151, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026

[23] [23]

Boute: Cost-efficient llm serving with heterogeneous llms and gpus via multi-objective bayesian optimization

Youhe Jiang, Fangcheng Fu, and Eiko Yoneki. Boute: Cost-efficient llm serving with heterogeneous llms and gpus via multi-objective bayesian optimization. arXiv preprint arXiv:2602.10729, 2026

work page arXiv 2026

[24] [24]

HunyuanVideo: A Systematic Framework For Large Video Generative Models

Weijie Kong, Qi Tian, Zijian Zhang, Rox Min, Zuozhuo Dai, Jin Zhou, Jiangfeng Xiong, Xin Li, Bo Wu, Jianwei Zhang, et al. Hunyuanvideo: A systematic framework for large video generative models. arXiv preprint arXiv:2412.03603, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[25] [25]

Horizontal pod autoscaler

Kubernetes. Horizontal pod autoscaler. https://kubernetes.io/docs/tasks/run-application/horizontal-p od-autoscale/, 2024. 18

2024

[26] [26]

Efficient memory management for large language model serving with pagedattention

Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph Gonzalez, Hao Zhang, and Ion Stoica. Efficient memory management for large language model serving with pagedattention. InProceedings of the 29th symposium on operating systems principles, pages 611–626, 2023

2023

[27] [27]

Taming the chaos: Coordinated autoscaling for heterogeneous and disaggregated llm inference

Rongzhi Li, Ruogu Du, Zefang Chu, Sida Zhao, Chunlei Han, Zuocheng Shi, Yiwen Shao, Huanle Han, Long Huang, Zherui Liu, et al. Taming the chaos: Coordinated autoscaling for heterogeneous and disaggregated llm inference. arXiv preprint arXiv:2508.19559, 2025

work page arXiv 2025

[28] [28]

{AlpaServe}: Statistical multiplexing with model parallelism for deep learning serving

Zhuohan Li, Lianmin Zheng, Yinmin Zhong, Vincent Liu, Ying Sheng, Xin Jin, Yanping Huang, Zhifeng Chen, Hao Zhang, Joseph E Gonzalez, et al. {AlpaServe}: Statistical multiplexing with model parallelism for deep learning serving. In 17th USENIX Symposium on Operating Systems Design and Implementation (OSDI 23), pages 663–679, 2023

2023

[29] [29]

Looking backward: Streaming video-to-video translation with feature banks

Feng Liang, Akio Kodaira, Chenfeng Xu, Masayoshi Tomizuka, Kurt Keutzer, and Diana Marculescu. Looking backward: Streaming video-to-video translation with feature banks. In International Conference on Learning Representations, volume 2025, pages 46425–46445, 2025

2025

[30] [30]

Skyserve: Serving ai models across regions and clouds with spot instances

Ziming Mao, Tian Xia, Zhanghao Wu, Wei-Lin Chiang, Tyler Griggs, Romil Bhardwaj, Zongheng Yang, Scott Shenker, and Ion Stoica. Skyserve: Serving ai models across regions and clouds with spot instances. In Proceedings of the Twentieth European Conference on Computer Systems, pages 159–175, 2025

2025

[31] [31]

Video diffusion models: A survey .arXiv preprint arXiv:2405.03150, 2024

Andrew Melnik, Michal Ljubljanac, Cong Lu, Qi Yan, Weiming Ren, and Helge Ritter. Video diffusion models: A survey .arXiv preprint arXiv:2405.03150, 2024

work page arXiv 2024

[32] [32]

Gpudirect rdma.https://docs.nvidia.com/cuda/gpudirect-rdma/, 2026

NVIDIA. Gpudirect rdma.https://docs.nvidia.com/cuda/gpudirect-rdma/, 2026

2026

[33] [33]

Nvidia inference xfer library (nixl).https://github.com/ai-dynamo/nixl, 2026

NVIDIA. Nvidia inference xfer library (nixl).https://github.com/ai-dynamo/nixl, 2026

2026

[34] [34]

Hexgen-flow: Optimizing llm inference request scheduling for agentic text-to-sql

You Peng, Youhe Jiang, Wenqi Jiang, Chen Wang, and Binhang Yuan. Hexgen-flow: Optimizing llm inference request scheduling for agentic text-to-sql. arXiv preprint arXiv:2505.05286, 2025

work page arXiv 2025

[35] [35]

Movie Gen: A Cast of Media Foundation Models

Adam Polyak, Amit Zohar, Andrew Brown, Andros Tjandra, Animesh Sinha, Ann Lee, Apoorv Vyas, Bowen Shi, Chih- Yao Ma, Ching-Yao Chuang, et al. Movie gen: A cast of media foundation models. arXiv preprint arXiv:2410.13720, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[36] [36]

Auto-scaling web applications in clouds: A taxonomy and survey .ACM Computing Surveys (CSUR), 51(4):1–33, 2018

Chenhao Qu, Rodrigo N Calheiros, and Rajkumar Buyya. Auto-scaling web applications in clouds: A taxonomy and survey .ACM Computing Surveys (CSUR), 51(4):1–33, 2018

2018

[37] [37]

Llumnix: Dynamic schedul- ing for large language model serving

Biao Sun, Ziming Huang, Hanyu Zhao, Wencong Xiao, Xinyi Zhang, Yong Li, and Wei Lin. Llumnix: Dynamic schedul- ing for large language model serving. In 18th USENIX symposium on operating systems design and implementation (OSDI 24), pages 173–191, 2024

2024

[38] [38]

Parallax: Efficient llm inference service over decentralized environment

Chris Tong, Youhe Jiang, Gufeng Chen, Tianyi Zhao, Sibian Lu, Wenjie Qu, Eric Yang, Lynn Ai, and Binhang Yuan. Parallax: Efficient llm inference service over decentralized environment. arXiv preprint arXiv:2509.26182, 2025

work page arXiv 2025

[39] [39]

Diffusion models are real-time game engines

Dani Valevski, Yaniv Leviathan, Moab Arar, and Shlomi Fruchter. Diffusion models are real-time game engines. In International Conference on Learning Representations, volume 2025, pages 73754–73776, 2025

2025

[40] [40]

Veo 3 technical report

Veo Team, Google DeepMind. Veo 3 technical report. Technical report, Google DeepMind, 2025. URL https: //storage.googleapis.com/deepmind-media/veo/Veo-3-Tech-Report.pdf

2025

[41] [41]

Wan: Open and Advanced Large-Scale Video Generative Models

Team Wan, Ang Wang, Baole Ai, Bin Wen, Chaojie Mao, Chen-Wei Xie, Di Chen, Feiwu Yu, Haiming Zhao, Jianxiao Yang, et al. Wan: Open and advanced large-scale video generative models. arXiv preprint arXiv:2503.20314, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[42] [42]

Fast Distributed Inference Serving for Large Language Models

Bingyang Wu, Yinmin Zhong, Zili Zhang, Shengyu Liu, Fangyue Liu, Yuanhang Sun, Gang Huang, Xuanzhe Liu, and Xin Jin. Fast distributed inference serving for large language models. arXiv preprint arXiv:2305.05920, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[43] [43]

Loongserve: Efficiently serving long-context large language models with elastic sequence parallelism

Bingyang Wu, Shengyu Liu, Yinmin Zhong, Peng Sun, Xuanzhe Liu, and Xin Jin. Loongserve: Efficiently serving long-context large language models with elastic sequence parallelism. In Proceedings of the ACM SIGOPS 30th Symposium on Operating Systems Principles, pages 640–654, 2024

2024

[44] [44]

Tridentserve: A stage-level serving system for diffusion pipelines

Yifei Xia, Fangcheng Fu, Hao Yuan, Hanke Zhang, Xupeng Miao, Yijun Liu, Suhan Ling, Jie Jiang, and Bin Cui. Tridentserve: A stage-level serving system for diffusion pipelines. arXiv preprint arXiv:2510.02838, 2025

work page arXiv 2025

[45] [45]

Aegaeon: Effective gpu pooling for concurrent llm serving on the market

Yuxing Xiang, Xue Li, Kun Qian, Yufan Yang, Diwen Zhu, Wenyuan Yu, Ennan Zhai, Xuanzhe Liu, Xin Jin, and Jingren Zhou. Aegaeon: Effective gpu pooling for concurrent llm serving on the market. In Proceedings of the ACM SIGOPS 31st Symposium on Operating Systems Principles, pages 1030–1045, 2025. 19

2025

[46] [46]

FSA: An Alternative Efficient Implementation of Native Sparse Attention Kernel

Ran Yan, Youhe Jiang, Zhuoming Chen, Haohui Mai, Beidi Chen, and Binhang Yuan. Fsa: An alternative efficient implementation of native sparse attention kernel. arXiv preprint arXiv:2508.18224, 2025

work page internal anchor Pith review arXiv 2025

[47] [47]

LongLive: Real-time Interactive Long Video Generation

Shuai Yang, Wei Huang, Ruihang Chu, Yicheng Xiao, Yuyang Zhao, Xianbang Wang, Muyang Li, Enze Xie, Yingcong Chen, Yao Lu, et al. Longlive: Real-time interactive long video generation. arXiv preprint arXiv:2509.22622, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[48] [48]

Cogvideox: Text-to-video diffusion models with an expert transformer

Zhuoyi Yang, Jiayan Teng, Wendi Zheng, Ming Ding, Shiyu Huang, Jiazheng Xu, Yuanming Yang, Wenyi Hong, Xiaohan Zhang, Guanyu Feng, et al. Cogvideox: Text-to-video diffusion models with an expert transformer. In International Conference on Learning Representations, volume 2025, pages 83048–83077, 2025

2025

[49] [49]

How SwissAI uses OpenTela for scalable LLM serving

Xiaozhe Yao. How SwissAI uses OpenTela for scalable LLM serving. Xiaozhe Yao (Blog), March 2026. URL https://about.yao.sh/posts/opentela-swissai/. Accessed: 2026-03-16

2026

[50] [50]

Flashinfer: Efficient and customizable attention engine for llm inference serving

Zihao Ye, Lequn Chen, Ruihang Lai, Wuwei Lin, Yineng Zhang, Stephanie Wang, Tianqi Chen, Baris Kasikci, Vinod Grover, Arvind Krishnamurthy , et al. Flashinfer: Efficient and customizable attention engine for llm inference serving. Proceedings of Machine Learning and Systems, 7, 2025

2025

[51] [51]

vLLM-Omni: Fully Disaggregated Serving for Any-to-Any Multimodal Models,

Peiqi Yin, Jiangyun Zhu, Han Gao, Chenguang Zheng, Yongxiang Huang, Taichang Zhou, Ruirui Yang, Weizhi Liu, Weiqing Chen, Canlin Guo, et al. vllm-omni: Fully disaggregated serving for any-to-any multimodal models. arXiv preprint arXiv:2602.02204, 2026

work page arXiv 2026

[52] [52]

Orca: A distributed serving system for {Transformer-Based} generative models

Gyeong-In Yu, Joo Seong Jeong, Geon-Woo Kim, Soojeong Kim, and Byung-Gon Chun. Orca: A distributed serving system for {Transformer-Based} generative models. In 16th USENIX symposium on operating systems design and implementation (OSDI 22), pages 521–538, 2022

2022

[53] [53]

Turbodiffusion: Accelerating video diffusion models by 100-200 times.arXiv preprint arXiv:2512.16093, 2025

Jintao Zhang, Kaiwen Zheng, Kai Jiang, Haoxu Wang, Ion Stoica, Joseph E Gonzalez, Jianfei Chen, and Jun Zhu. Turbodiffusion: Accelerating video diffusion models by 100-200 times. arXiv preprint arXiv:2512.16093, 2025

work page arXiv 2025

[54] [54]

LMDeploy Accelerates Mixed-Precision LLM Inference with TurboMind

Li Zhang, Youhe Jiang, Guoliang He, Xin Chen, Han Lv, Qian Yao, Fangcheng Fu, and Kai Chen. Efficient mixed- precision large language model inference with turbomind. arXiv preprint arXiv:2508.15601, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[55] [55]

Sglang: Efficient execution of structured language model programs

Lianmin Zheng, Liangsheng Yin, Zhiqiang Xie, Chuyue Sun, Jeff Huang, Cody H Yu, Shiyi Cao, Christos Kozyrakis, Ion Stoica, Joseph E Gonzalez, et al. Sglang: Efficient execution of structured language model programs. Advances in neural information processing systems, 37:62557–62583, 2024

2024

[56] [56]

{DistServe}: Disaggregating prefill and decoding for goodput-optimized large language model serving

Yinmin Zhong, Shengyu Liu, Junda Chen, Jianbo Hu, Yibo Zhu, Xuanzhe Liu, Xin Jin, and Hao Zhang. {DistServe}: Disaggregating prefill and decoding for goodput-optimized large language model serving. In 18th USENIX Symposium on Operating Systems Design and Implementation (OSDI 24), pages 193–210, 2024. 20 Table 5Workload characteristics of the ten profiling...

2024

[57] [57]

Replace.Commit the update (λ(t),ˆρ(t))←(λ ℓ(t), ρ∗ ℓ(t)), which takes effect in subsequent autoscaling decisions. Profiling case study.We illustrate the offline profiling and the resulting volatility-to-parameter mapping on a representative trace family consisting of L= 10 segments of monotonically increasing volatility, generated by progressively scaling...