TurboServe: Serving Streaming Video Generation Efficiently and Economically
Pith reviewed 2026-06-26 19:21 UTC · model grok-4.3
The pith
TurboServe reduces worst-case per-chunk latency by 37.5% and GPU operating costs by 37.2% for streaming video generation.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The central claim is that a closed-loop scheduling algorithm coordinating migration-aware placement and load-driven autoscaling can reduce worst-case per-chunk latency by 37.5% and total GPU operating cost by 37.2% on average compared to baseline configurations. This is achieved by treating streaming video generation as an online scheduling problem in multi-GPU environments and implementing coalesced chunk processing, GPU-CPU offloading, and NCCL-based GPU-GPU migration to support the scheduling decisions at runtime. The evaluation uses real-world production traces across multiple model sizes and clusters up to 64 GPUs.
What carries the argument
The closed-loop scheduling algorithm consisting of a migration-aware placement controller and a load-driven autoscaling controller.
Load-bearing premise
That the chosen baselines represent standard serving configurations and that the evaluated traces include sufficient variation to demonstrate gains in typical deployments.
What would settle it
Measuring the latency and cost metrics on a new set of traces with more extreme session duration differences or demand fluctuations than those used in the paper.
read the original abstract
Streaming video generation is emerging as a new serving workload in which users interact with long-lived sessions that generate video progressively, chunk by chunk. Unlike offline video generation or typical LLM serving, streaming video generation must preserve session state across active and idle periods, repeatedly schedule ongoing sessions, and deliver each chunk under a tight latency target. This creates two key serving challenges in multi-user, multi-GPU environments: session duration heterogeneity, where long-running sessions make placement decisions suboptimal over time, and temporal user-demand heterogeneity, where the number of active sessions fluctuates sharply across bursts and idle periods. We present TurboServe, the first serving system designed specifically for streaming video generation workloads. TurboServe formulates serving as an online scheduling problem that jointly coordinates session placement and GPU provisioning. Its closed-loop scheduling algorithm combines a migration-aware placement controller, which rebalances sessions across GPUs to reduce the maximum per-chunk latency, with a load-driven autoscaling controller, which adapts the GPU budget to workload variation for improved cost efficiency. To support these decisions at runtime, TurboServe implements coalesced chunk processing for batching concurrent active sessions on the same GPU, GPU-CPU offloading for session suspension and resumption, and NCCL-based GPU-GPU migration for online rebalancing. We evaluate TurboServe on real-world production traces from Shengshu Technology across multiple model sizes and GPU clusters with up to 64 NVIDIA B300 GPUs. Compared with baseline serving configurations, TurboServe reduces worst-case per-chunk latency by 37.5% and total GPU operating cost by 37.2% on average. Our code is publicly available at https://github.com/shengshu-ai/TurboServe.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper presents TurboServe as the first serving system for streaming video generation workloads, which involve long-lived interactive sessions generating video chunk-by-chunk. It formulates serving as an online scheduling problem and proposes a closed-loop algorithm with a migration-aware placement controller and a load-driven autoscaling controller. These are supported at runtime by coalesced chunk processing, GPU-CPU offloading, and NCCL-based GPU-GPU migration. The system is evaluated on real-world production traces from Shengshu Technology across multiple model sizes and clusters of up to 64 NVIDIA B300 GPUs, claiming 37.5% reduction in worst-case per-chunk latency and 37.2% reduction in total GPU operating cost versus baseline serving configurations. Code is released publicly.
Significance. If the performance claims hold under rigorous validation, this work would be a meaningful contribution to distributed systems for emerging AI serving workloads. Streaming video generation introduces distinct challenges around persistent session state, duration heterogeneity, and bursty demand that differ from both offline generation and standard LLM inference; a dedicated system addressing them could improve efficiency in production video AI services. The public code release supports reproducibility and is a strength.
major comments (1)
- [Evaluation] Evaluation section: the reported 37.5% worst-case per-chunk latency reduction and 37.2% GPU cost reduction cannot be verified or attributed to the closed-loop scheduler because the manuscript supplies no description of the baseline serving configurations (e.g., static round-robin vs. no-migration vs. fixed GPU count), no statistics on the production traces (session-duration distribution, burst/idle ratios, heterogeneity metrics), and no details on the 64-GPU experimental configuration or statistical significance testing. These omissions make the central empirical claim load-bearing but unsupported.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback on the evaluation. We agree that the central performance claims require additional supporting details to be verifiable and attributable to the proposed scheduler. We will revise the manuscript to address this.
read point-by-point responses
-
Referee: [Evaluation] Evaluation section: the reported 37.5% worst-case per-chunk latency reduction and 37.2% GPU cost reduction cannot be verified or attributed to the closed-loop scheduler because the manuscript supplies no description of the baseline serving configurations (e.g., static round-robin vs. no-migration vs. fixed GPU count), no statistics on the production traces (session-duration distribution, burst/idle ratios, heterogeneity metrics), and no details on the 64-GPU experimental configuration or statistical significance testing. These omissions make the central empirical claim load-bearing but unsupported.
Authors: We acknowledge the validity of this observation. The current manuscript does not provide explicit descriptions of the baseline configurations, detailed trace statistics, full experimental setup parameters for the 64-GPU runs, or statistical significance tests. In the revised version we will add: (1) precise definitions of each baseline (including placement policy, migration usage, and GPU provisioning), (2) summary statistics for the Shengshu production traces (session duration CDF, burst/idle ratios, and heterogeneity measures), (3) hardware and software configuration details for the 64-GPU cluster experiments, and (4) p-values or confidence intervals for the reported latency and cost reductions. These additions will enable readers to verify and attribute the gains to the closed-loop algorithm. revision: yes
Circularity Check
No circularity; empirical system evaluation on external traces
full rationale
The paper presents TurboServe as a serving system with closed-loop scheduling, evaluated via experiments on real-world production traces from Shengshu Technology across model sizes and up to 64 GPUs. Reported gains (37.5% latency, 37.2% cost) are direct experimental outcomes versus baselines, with no derivation chain, equations, fitted parameters renamed as predictions, or self-citation load-bearing steps. The work is self-contained against external benchmarks and traces.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
Taming {Throughput-Latency} tradeoff in {LLM} inference with {Sarathi-Serve}
Amey Agrawal, Nitin Kedia, Ashish Panwar, Jayashree Mohan, Nipun Kwatra, Bhargav Gulavani, Alexey Tumanov, and Ramachandran Ramjee. Taming {Throughput-Latency} tradeoff in {LLM} inference with {Sarathi-Serve}. In 18th USENIX symposium on operating systems design and implementation (OSDI 24), pages 117–134, 2024
2024
-
[2]
Amazon ec2 p5 instances.https://aws.amazon.com/ec2/instance-types/p5/, 2026
Amazon Web Services. Amazon ec2 p5 instances.https://aws.amazon.com/ec2/instance-types/p5/, 2026
2026
-
[3]
Video generation models as world simulators
Tim Brooks, Bill Peebles, Connor Holmes, Will DePue, Yufei Guo, Leo Jing, David Schnurr, Joe Taylor, Troy Luhman, Eric Luhman, et al. Video generation models as world simulators. OpenAI Blog, 1(8):1, 2024
2024
-
[4]
Genie: Generative interactive environments
Jake Bruce, Michael D Dennis, Ashley Edwards, Jack Parker-Holder, Yuge Shi, Edward Hughes, Matthew Lai, Aditi Mavalankar, Richie Steigerwald, Chris Apps, et al. Genie: Generative interactive environments. In Forty-first International Conference on Machine Learning, 2024
2024
-
[5]
Flashserve: Cost-efficient serverless inference scheduling for large language models via tiered memory management and predictive autoscaling
Bolin Chen. Flashserve: Cost-efficient serverless inference scheduling for large language models via tiered memory management and predictive autoscaling. In Proceedings of the 2025 6th International Conference on Computer Science and Management Technology, pages 1291–1298, 2025. 17
2025
-
[6]
Gpurdma: Gpu-side library for high performance network- ing from gpu kernels
Feras Daoud, Amir Watad, and Mark Silberstein. Gpurdma: Gpu-side library for high performance network- ing from gpu kernels. In Proceedings of the 6th international Workshop on Runtime and Operating Systems for Supercomputers, pages 1–8, 2016
2016
-
[7]
Quasar: Resource-efficient and qos-aware cluster management
Christina Delimitrou and Christos Kozyrakis. Quasar: Resource-efficient and qos-aware cluster management. ACM Sigplan Notices, 49(4):127–144, 2014
2014
-
[8]
xdit: an inference engine for diffusion transformers (dits) with massive parallelism
Jiarui Fang, Jinzhe Pan, Xibo Sun, Aoyu Li, and Jiannan Wang. xdit: an inference engine for diffusion transformers (dits) with massive parallelism. arXiv preprint arXiv:2411.01738, 2024
-
[9]
Fastvideo: A unified framework for accelerated video generation
FastVideo Team. Fastvideo: A unified framework for accelerated video generation. https://haoailab.com/blogs /fastvideo/, 2025
2025
-
[10]
Tianrui Feng, Zhi Li, Shuo Yang, Haocheng Xi, Muyang Li, Xiuyu Li, Lvmin Zhang, Keting Yang, Kelly Peng, Song Han, et al. Streamdiffusionv2: A streaming system for dynamic and interactive video generation. arXiv preprint arXiv:2511.07399, 2025
-
[11]
{ServerlessLLM}:{Low-Latency} serverless inference for large language models
Yao Fu, Leyang Xue, Yeqi Huang, Andrei-Octavian Brabete, Dmitrii Ustiugov, Yuvraj Patel, and Luo Mai. {ServerlessLLM}:{Low-Latency} serverless inference for large language models. In 18th USENIX Symposium on Operating Systems Design and Implementation (OSDI 24), pages 135–153, 2024
2024
-
[12]
Efficient multi-round llm inference over disaggregated serving
Wenhao He, Youhe Jiang, Penghao Zhao, Quanqing Xu, Eiko Yoneki, Bin Cui, and Fangcheng Fu. Efficient multi-round llm inference over disaggregated serving. arXiv preprint arXiv:2602.14516, 2026
-
[13]
Streamingt2v: Consistent, dynamic, and extendable long video generation from text
Roberto Henschel, Levon Khachatryan, Hayk Poghosyan, Daniil Hayrapetyan, Vahram Tadevosyan, Zhangyang Wang, Shant Navasardyan, and Humphrey Shi. Streamingt2v: Consistent, dynamic, and extendable long video generation from text. In Proceedings of the Computer Vision and Pattern Recognition Conference, pages 2568–2577, 2025
2025
-
[14]
Computer simulation using particles
Roger W Hockney and James W Eastwood. Computer simulation using particles. crc Press, 2021
2021
-
[15]
Self forcing: Bridging the train-test gap in autoregressive video diffusion
Xun Huang, Zhengqi Li, Guande He, Mingyuan Zhou, and Eli Shechtman. Self forcing: Bridging the train-test gap in autoregressive video diffusion. Advances in Neural Information Processing Systems, 38:167283–167308, 2026
2026
-
[16]
Hy-world 1.5: A systematic framework for interactive world modeling with real-time latency and geometric consistency .arXiv preprint, 2025
Team HunyuanWorld. Hy-world 1.5: A systematic framework for interactive world modeling with real-time latency and geometric consistency .arXiv preprint, 2025
2025
-
[17]
Hexgen: Generative inference of large language model over heterogeneous environment
Youhe Jiang, Ran Yan, Xiaozhe Yao, Yang Zhou, Beidi Chen, and Binhang Yuan. Hexgen: Generative inference of large language model over heterogeneous environment. arXiv preprint arXiv:2311.11514, 2023
-
[18]
Demystifying cost-efficiency in llm serving over heterogeneous gpus
Youhe Jiang, Fangcheng Fu, Xiaozhe Yao, Guoliang He, Xupeng Miao, Ana Klimovic, Bin Cui, Binhang Yuan, and Eiko Yoneki. Demystifying cost-efficiency in llm serving over heterogeneous gpus. arXiv preprint arXiv:2502.00722, 2025
-
[19]
Thunderserve: High-performance and cost-efficient llm serving in cloud environments
Youhe Jiang, Fangcheng Fu, Xiaozhe Yao, Taiyi Wang, Bin Cui, Ana Klimovic, and Eiko Yoneki. Thunderserve: High-performance and cost-efficient llm serving in cloud environments. Proceedings of Machine Learning and Systems, 7, 2025
2025
-
[20]
Cascadia: An efficient cascade serving system for large language models
Youhe Jiang, Fangcheng Fu, Wanru Zhao, Stephan Rabanser, Jintao Zhang, Nicholas D Lane, and Binhang Yuan. Cascadia: An efficient cascade serving system for large language models. arXiv preprint arXiv:2506.04203, 2025
-
[21]
Hexgen-2: Disaggregated generative inference of llms in heterogeneous environment
Youhe Jiang, Ran Yan, and Binhang Yuan. Hexgen-2: Disaggregated generative inference of llms in heterogeneous environment. arXiv preprint arXiv:2502.07903, 2025
-
[22]
OServe: Accelerating LLM Serving via Spatial-Temporal Workload Orchestration
Youhe Jiang, Fangcheng Fu, Taiyi Wang, Guoliang He, and Eiko Yoneki. Oserve: Accelerating llm serving via spatial-temporal workload orchestration. arXiv preprint arXiv:2602.12151, 2026
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[23]
Youhe Jiang, Fangcheng Fu, and Eiko Yoneki. Boute: Cost-efficient llm serving with heterogeneous llms and gpus via multi-objective bayesian optimization. arXiv preprint arXiv:2602.10729, 2026
-
[24]
HunyuanVideo: A Systematic Framework For Large Video Generative Models
Weijie Kong, Qi Tian, Zijian Zhang, Rox Min, Zuozhuo Dai, Jin Zhou, Jiangfeng Xiong, Xin Li, Bo Wu, Jianwei Zhang, et al. Hunyuanvideo: A systematic framework for large video generative models. arXiv preprint arXiv:2412.03603, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[25]
Horizontal pod autoscaler
Kubernetes. Horizontal pod autoscaler. https://kubernetes.io/docs/tasks/run-application/horizontal-p od-autoscale/, 2024. 18
2024
-
[26]
Efficient memory management for large language model serving with pagedattention
Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph Gonzalez, Hao Zhang, and Ion Stoica. Efficient memory management for large language model serving with pagedattention. InProceedings of the 29th symposium on operating systems principles, pages 611–626, 2023
2023
-
[27]
Taming the chaos: Coordinated autoscaling for heterogeneous and disaggregated llm inference
Rongzhi Li, Ruogu Du, Zefang Chu, Sida Zhao, Chunlei Han, Zuocheng Shi, Yiwen Shao, Huanle Han, Long Huang, Zherui Liu, et al. Taming the chaos: Coordinated autoscaling for heterogeneous and disaggregated llm inference. arXiv preprint arXiv:2508.19559, 2025
-
[28]
{AlpaServe}: Statistical multiplexing with model parallelism for deep learning serving
Zhuohan Li, Lianmin Zheng, Yinmin Zhong, Vincent Liu, Ying Sheng, Xin Jin, Yanping Huang, Zhifeng Chen, Hao Zhang, Joseph E Gonzalez, et al. {AlpaServe}: Statistical multiplexing with model parallelism for deep learning serving. In 17th USENIX Symposium on Operating Systems Design and Implementation (OSDI 23), pages 663–679, 2023
2023
-
[29]
Looking backward: Streaming video-to-video translation with feature banks
Feng Liang, Akio Kodaira, Chenfeng Xu, Masayoshi Tomizuka, Kurt Keutzer, and Diana Marculescu. Looking backward: Streaming video-to-video translation with feature banks. In International Conference on Learning Representations, volume 2025, pages 46425–46445, 2025
2025
-
[30]
Skyserve: Serving ai models across regions and clouds with spot instances
Ziming Mao, Tian Xia, Zhanghao Wu, Wei-Lin Chiang, Tyler Griggs, Romil Bhardwaj, Zongheng Yang, Scott Shenker, and Ion Stoica. Skyserve: Serving ai models across regions and clouds with spot instances. In Proceedings of the Twentieth European Conference on Computer Systems, pages 159–175, 2025
2025
-
[31]
Video diffusion models: A survey .arXiv preprint arXiv:2405.03150, 2024
Andrew Melnik, Michal Ljubljanac, Cong Lu, Qi Yan, Weiming Ren, and Helge Ritter. Video diffusion models: A survey .arXiv preprint arXiv:2405.03150, 2024
-
[32]
Gpudirect rdma.https://docs.nvidia.com/cuda/gpudirect-rdma/, 2026
NVIDIA. Gpudirect rdma.https://docs.nvidia.com/cuda/gpudirect-rdma/, 2026
2026
-
[33]
Nvidia inference xfer library (nixl).https://github.com/ai-dynamo/nixl, 2026
NVIDIA. Nvidia inference xfer library (nixl).https://github.com/ai-dynamo/nixl, 2026
2026
-
[34]
Hexgen-flow: Optimizing llm inference request scheduling for agentic text-to-sql
You Peng, Youhe Jiang, Wenqi Jiang, Chen Wang, and Binhang Yuan. Hexgen-flow: Optimizing llm inference request scheduling for agentic text-to-sql. arXiv preprint arXiv:2505.05286, 2025
-
[35]
Movie Gen: A Cast of Media Foundation Models
Adam Polyak, Amit Zohar, Andrew Brown, Andros Tjandra, Animesh Sinha, Ann Lee, Apoorv Vyas, Bowen Shi, Chih- Yao Ma, Ching-Yao Chuang, et al. Movie gen: A cast of media foundation models. arXiv preprint arXiv:2410.13720, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[36]
Auto-scaling web applications in clouds: A taxonomy and survey .ACM Computing Surveys (CSUR), 51(4):1–33, 2018
Chenhao Qu, Rodrigo N Calheiros, and Rajkumar Buyya. Auto-scaling web applications in clouds: A taxonomy and survey .ACM Computing Surveys (CSUR), 51(4):1–33, 2018
2018
-
[37]
Llumnix: Dynamic schedul- ing for large language model serving
Biao Sun, Ziming Huang, Hanyu Zhao, Wencong Xiao, Xinyi Zhang, Yong Li, and Wei Lin. Llumnix: Dynamic schedul- ing for large language model serving. In 18th USENIX symposium on operating systems design and implementation (OSDI 24), pages 173–191, 2024
2024
-
[38]
Parallax: Efficient llm inference service over decentralized environment
Chris Tong, Youhe Jiang, Gufeng Chen, Tianyi Zhao, Sibian Lu, Wenjie Qu, Eric Yang, Lynn Ai, and Binhang Yuan. Parallax: Efficient llm inference service over decentralized environment. arXiv preprint arXiv:2509.26182, 2025
-
[39]
Diffusion models are real-time game engines
Dani Valevski, Yaniv Leviathan, Moab Arar, and Shlomi Fruchter. Diffusion models are real-time game engines. In International Conference on Learning Representations, volume 2025, pages 73754–73776, 2025
2025
-
[40]
Veo 3 technical report
Veo Team, Google DeepMind. Veo 3 technical report. Technical report, Google DeepMind, 2025. URL https: //storage.googleapis.com/deepmind-media/veo/Veo-3-Tech-Report.pdf
2025
-
[41]
Wan: Open and Advanced Large-Scale Video Generative Models
Team Wan, Ang Wang, Baole Ai, Bin Wen, Chaojie Mao, Chen-Wei Xie, Di Chen, Feiwu Yu, Haiming Zhao, Jianxiao Yang, et al. Wan: Open and advanced large-scale video generative models. arXiv preprint arXiv:2503.20314, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[42]
Fast Distributed Inference Serving for Large Language Models
Bingyang Wu, Yinmin Zhong, Zili Zhang, Shengyu Liu, Fangyue Liu, Yuanhang Sun, Gang Huang, Xuanzhe Liu, and Xin Jin. Fast distributed inference serving for large language models. arXiv preprint arXiv:2305.05920, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[43]
Loongserve: Efficiently serving long-context large language models with elastic sequence parallelism
Bingyang Wu, Shengyu Liu, Yinmin Zhong, Peng Sun, Xuanzhe Liu, and Xin Jin. Loongserve: Efficiently serving long-context large language models with elastic sequence parallelism. In Proceedings of the ACM SIGOPS 30th Symposium on Operating Systems Principles, pages 640–654, 2024
2024
-
[44]
Tridentserve: A stage-level serving system for diffusion pipelines
Yifei Xia, Fangcheng Fu, Hao Yuan, Hanke Zhang, Xupeng Miao, Yijun Liu, Suhan Ling, Jie Jiang, and Bin Cui. Tridentserve: A stage-level serving system for diffusion pipelines. arXiv preprint arXiv:2510.02838, 2025
-
[45]
Aegaeon: Effective gpu pooling for concurrent llm serving on the market
Yuxing Xiang, Xue Li, Kun Qian, Yufan Yang, Diwen Zhu, Wenyuan Yu, Ennan Zhai, Xuanzhe Liu, Xin Jin, and Jingren Zhou. Aegaeon: Effective gpu pooling for concurrent llm serving on the market. In Proceedings of the ACM SIGOPS 31st Symposium on Operating Systems Principles, pages 1030–1045, 2025. 19
2025
-
[46]
FSA: An Alternative Efficient Implementation of Native Sparse Attention Kernel
Ran Yan, Youhe Jiang, Zhuoming Chen, Haohui Mai, Beidi Chen, and Binhang Yuan. Fsa: An alternative efficient implementation of native sparse attention kernel. arXiv preprint arXiv:2508.18224, 2025
work page internal anchor Pith review arXiv 2025
-
[47]
LongLive: Real-time Interactive Long Video Generation
Shuai Yang, Wei Huang, Ruihang Chu, Yicheng Xiao, Yuyang Zhao, Xianbang Wang, Muyang Li, Enze Xie, Yingcong Chen, Yao Lu, et al. Longlive: Real-time interactive long video generation. arXiv preprint arXiv:2509.22622, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[48]
Cogvideox: Text-to-video diffusion models with an expert transformer
Zhuoyi Yang, Jiayan Teng, Wendi Zheng, Ming Ding, Shiyu Huang, Jiazheng Xu, Yuanming Yang, Wenyi Hong, Xiaohan Zhang, Guanyu Feng, et al. Cogvideox: Text-to-video diffusion models with an expert transformer. In International Conference on Learning Representations, volume 2025, pages 83048–83077, 2025
2025
-
[49]
How SwissAI uses OpenTela for scalable LLM serving
Xiaozhe Yao. How SwissAI uses OpenTela for scalable LLM serving. Xiaozhe Yao (Blog), March 2026. URL https://about.yao.sh/posts/opentela-swissai/. Accessed: 2026-03-16
2026
-
[50]
Flashinfer: Efficient and customizable attention engine for llm inference serving
Zihao Ye, Lequn Chen, Ruihang Lai, Wuwei Lin, Yineng Zhang, Stephanie Wang, Tianqi Chen, Baris Kasikci, Vinod Grover, Arvind Krishnamurthy , et al. Flashinfer: Efficient and customizable attention engine for llm inference serving. Proceedings of Machine Learning and Systems, 7, 2025
2025
-
[51]
vLLM-Omni: Fully Disaggregated Serving for Any-to-Any Multimodal Models,
Peiqi Yin, Jiangyun Zhu, Han Gao, Chenguang Zheng, Yongxiang Huang, Taichang Zhou, Ruirui Yang, Weizhi Liu, Weiqing Chen, Canlin Guo, et al. vllm-omni: Fully disaggregated serving for any-to-any multimodal models. arXiv preprint arXiv:2602.02204, 2026
-
[52]
Orca: A distributed serving system for {Transformer-Based} generative models
Gyeong-In Yu, Joo Seong Jeong, Geon-Woo Kim, Soojeong Kim, and Byung-Gon Chun. Orca: A distributed serving system for {Transformer-Based} generative models. In 16th USENIX symposium on operating systems design and implementation (OSDI 22), pages 521–538, 2022
2022
-
[53]
Jintao Zhang, Kaiwen Zheng, Kai Jiang, Haoxu Wang, Ion Stoica, Joseph E Gonzalez, Jianfei Chen, and Jun Zhu. Turbodiffusion: Accelerating video diffusion models by 100-200 times. arXiv preprint arXiv:2512.16093, 2025
-
[54]
LMDeploy Accelerates Mixed-Precision LLM Inference with TurboMind
Li Zhang, Youhe Jiang, Guoliang He, Xin Chen, Han Lv, Qian Yao, Fangcheng Fu, and Kai Chen. Efficient mixed- precision large language model inference with turbomind. arXiv preprint arXiv:2508.15601, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[55]
Sglang: Efficient execution of structured language model programs
Lianmin Zheng, Liangsheng Yin, Zhiqiang Xie, Chuyue Sun, Jeff Huang, Cody H Yu, Shiyi Cao, Christos Kozyrakis, Ion Stoica, Joseph E Gonzalez, et al. Sglang: Efficient execution of structured language model programs. Advances in neural information processing systems, 37:62557–62583, 2024
2024
-
[56]
{DistServe}: Disaggregating prefill and decoding for goodput-optimized large language model serving
Yinmin Zhong, Shengyu Liu, Junda Chen, Jianbo Hu, Yibo Zhu, Xuanzhe Liu, Xin Jin, and Hao Zhang. {DistServe}: Disaggregating prefill and decoding for goodput-optimized large language model serving. In 18th USENIX Symposium on Operating Systems Design and Implementation (OSDI 24), pages 193–210, 2024. 20 Table 5Workload characteristics of the ten profiling...
2024
-
[57]
Replace.Commit the update (λ(t),ˆρ(t))←(λ ℓ(t), ρ∗ ℓ(t)), which takes effect in subsequent autoscaling decisions. Profiling case study.We illustrate the offline profiling and the resulting volatility-to-parameter mapping on a representative trace family consisting of L= 10 segments of monotonically increasing volatility, generated by progressively scaling...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.