M*: A Modular, Extensible, Serving System for Multimodal Models
Pith reviewed 2026-06-27 09:58 UTC · model grok-4.3
The pith
M* represents composite multimodal models as Walk Graphs to serve diverse architectures with lower latency and higher throughput than prior systems.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
M* represents models as dataflow graphs and processes requests spanning diverse modalities as traversals over these graphs. The Walk Graph abstraction supports arbitrary composition of model components, flexible placement onto a physical cluster, and model-agnostic optimizations within a distributed runtime. This allows M* to concisely capture composite models from a broad range of families and yields average 20% lower end-to-end latency than vLLM-Omni on text-to-image tasks with BAGEL, up to 2.9x lower real-time factor and 2.7x higher throughput on text-to-speech with Qwen3-Omni, and up to 12.5x better performance than the V-JEPA 2-AC baseline on robotic planning.
What carries the argument
The Walk Graph, an abstraction that represents composite models as traversable dataflow graphs to enable modular composition, flexible placement, and model-agnostic optimizations.
If this is right
- Average 20% lower end-to-end latency than vLLM-Omni for text-to-image workloads on BAGEL.
- Up to 2.9x lower real-time factor and 2.7x higher throughput for text-to-speech workloads on Qwen3-Omni.
- Up to 12.5x better performance than the V-JEPA 2-AC rollout baseline for robotic planning.
- Enables serving of complex models with minimal developer effort across unified multimodal, omni, speech-language, and vision-language-action families.
Where Pith is reading between the lines
- The graph-based approach could reduce the need for separate serving stacks when new modalities are added to an existing model.
- Cluster operators might achieve higher utilization by letting the Walk Graph decide component placement dynamically.
- If the abstraction scales, it could shorten the time from model training to production deployment for composite architectures.
Load-bearing premise
The Walk Graph can concisely capture composite models from a broad range of families while supporting model-agnostic optimizations and flexible cluster placement without prohibitive overhead or model-specific adaptations.
What would settle it
Deploying M* on a previously untested composite model family and measuring whether end-to-end latency exceeds that of a specialized baseline by more than the reported margins.
Figures
read the original abstract
We are entering a new era of composite model architectures that integrate diverse components such as vision encoders, language backbones, diffusion and flow heads, audio codecs, action generators, and world-model predictors. Such architectures underpin a broad class of multimodal models, including unified multimodal models, omni models, speech-language models, vision-language-action policies, and world models. However, existing model serving frameworks were built on narrow assumptions about model structure, making them ill-suited to accommodate this new architectural diversity. Here we present M*, a universal serving system for efficient serving of composite AI models. M* represents models as dataflow graphs, processing requests spanning diverse modalities and tasks as traversals over these graphs. The core insight is a modular abstraction that supports arbitrary composition of model components, flexible placement onto a physical cluster, and model-agnostic optimizations within a distributed runtime. We call this abstraction the Walk Graph and show how it can concisely capture composite models from a broad range of families. We instantiate M* on representative models and find that it achieves, on average, 20% lower end-to-end latency than vLLM-Omni for text-to-image workloads on BAGEL, while delivering up to 2.9x lower real-time factor and 2.7x higher throughput for text-to-speech workloads on Qwen3-Omni. M* also outperforms the V-JEPA 2-AC rollout baseline for robotic planning by up to 12.5x. Thus, our work paves the road towards more efficient serving of complex models with minimal developer effort.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces M*, a universal serving system for composite multimodal models (e.g., unified multimodal, omni, speech-language, vision-language-action, and world models). It represents models as dataflow graphs via the Walk Graph abstraction, which supports arbitrary component composition, flexible cluster placement, and model-agnostic optimizations in a distributed runtime. Evaluations on BAGEL (text-to-image), Qwen3-Omni (text-to-speech), and robotic planning tasks report average 20% lower end-to-end latency vs. vLLM-Omni, up to 2.9x lower real-time factor and 2.7x higher throughput, and up to 12.5x improvement over the V-JEPA 2-AC baseline.
Significance. If the Walk Graph abstraction generalizes across model families without prohibitive overhead and the reported speedups prove reproducible, the work could meaningfully advance efficient serving of emerging composite multimodal architectures with reduced developer effort. The identification of limitations in existing narrow-assumption frameworks is a useful framing.
major comments (2)
- Abstract: performance numbers (20% latency reduction, 2.9x/2.7x TTS gains, 12.5x planning improvement) are stated without any experimental protocol, baseline configurations, hardware details, or error bars, so the central empirical claims cannot be assessed for soundness.
- Walk Graph section (core abstraction): the claim that the abstraction 'concisely capture[s] composite models from a broad range of families' while remaining model-agnostic and low-overhead is load-bearing for the paper's generality argument, yet evaluation is confined to three specific models (BAGEL, Qwen3-Omni, robotic planning) with no additional evidence or overhead analysis provided for broader applicability.
minor comments (1)
- Add explicit pseudocode or formal definition of the Walk Graph traversal and placement algorithms to aid reproducibility.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback. The comments highlight opportunities to strengthen the presentation of empirical claims and the generality argument. We address each major comment below.
read point-by-point responses
-
Referee: Abstract: performance numbers (20% latency reduction, 2.9x/2.7x TTS gains, 12.5x planning improvement) are stated without any experimental protocol, baseline configurations, hardware details, or error bars, so the central empirical claims cannot be assessed for soundness.
Authors: We agree this is a valid concern for assessability. In the revised manuscript we will expand the abstract with a concise statement of the hardware platform (8x NVIDIA A100-80GB), the primary baselines (vLLM-Omni and V-JEPA 2-AC), and a note that full protocols, configurations, and error bars (reported over 5 runs) appear in Sections 5–6. Abstract length limits preclude full experimental protocols, but the added context will allow readers to locate the supporting details. revision: yes
-
Referee: Walk Graph section (core abstraction): the claim that the abstraction 'concisely capture[s] composite models from a broad range of families' while remaining model-agnostic and low-overhead is load-bearing for the paper's generality argument, yet evaluation is confined to three specific models (BAGEL, Qwen3-Omni, robotic planning) with no additional evidence or overhead analysis provided for broader applicability.
Authors: The three workloads were chosen as representatives of distinct families (unified multimodal, omni/speech-language, and vision-language-action/world models). Section 3 already provides the formal definition and construction rules that are intentionally model-agnostic. To strengthen the generality claim we will add (1) explicit overhead measurements for Walk Graph construction and traversal (time and memory) on the evaluated models and (2) a short discussion subsection that maps the abstraction to two additional families (pure vision-language and diffusion-based world models) using the same construction rules. These additions rely on existing design material rather than new experiments. revision: partial
Circularity Check
No significant circularity
full rationale
The paper is a systems description of a serving framework whose central claims are empirical performance improvements measured against named external baselines (vLLM-Omni, V-JEPA 2-AC). No equations, fitted parameters, or derivation chains appear; the Walk Graph is presented as an engineering abstraction whose generality is asserted via implementation on representative models rather than proven by internal reduction. No self-citation load-bearing steps or ansatz smuggling are detectable from the supplied text.
Axiom & Free-Parameter Ledger
axioms (2)
- domain assumption Existing model serving frameworks were built on narrow assumptions about model structure
- domain assumption Composite models can be represented as dataflow graphs that support arbitrary composition and model-agnostic optimizations
invented entities (1)
-
Walk Graph
no independent evidence
Reference graph
Works this paper leans on
-
[1]
Cosmos world foundation model platform for physical AI
Niket Agarwal, Arslan Ali, Maciej Bala, Yogesh Balaji, Erik Barker, Tiffany Cai, Prithvijit Chattopadhyay, Yongxin Chen, Yin Cui, Yifan Ding, Daniel Dworakowski, Jiaojiao Fan, Michele Fenzi, Francesco Ferroni, Sanja Fidler, Dieter Fox, Songwei Ge, Yunhao Ge, Jinwei Gu, Siddharth Gururani, Ethan He, Jiahui Huang, Jacob Huffman, Pooya Jannaty, Jingyi Jin, S...
Pith/arXiv arXiv 2025
-
[2]
Diffusion for world modeling: Visual details matter in Atari.Advances in Neural Information Processing Systems, 37:58757–58791, 2024
Eloi Alonso, Adam Jelley, Vincent Micheli, Anssi Kanervisto, Amos Storkey, Tim Pearce, and François Fleuret. Diffusion for world modeling: Visual details matter in Atari.Advances in Neural Information Processing Systems, 37:58757–58791, 2024
2024
-
[3]
Seed-tts: A family of high-quality versatile speech generation models, 2024
Philip Anastassiou, Jiawei Chen, Jitong Chen, Yuanzhe Chen, Zhuo Chen, Ziyi Chen, Jian Cong, Lelai Deng, Chuang Ding, Lu Gao, Mingqing Gong, Peisong Huang, Qingqing Huang, Zhiying Huang, Yuanyuan Huo, Dongya Jia, Chumin Li, Feiya Li, Hui Li, Jiaxin Li, Xiaoyang Li, Xingxing Li, Lin Liu, Shouda Liu, Sichao Liu, Xudong Liu, Yuchen Liu, Zhengxi Liu, Lu Lu, J...
Pith/arXiv arXiv 2024
-
[4]
Mido Assran, Adrien Bardes, David Fan, Quentin Garrido, Russell Howes, Mojtaba Komeili, Matthew Muckley, Ammar Rizvi, Claire Roberts, Koustuv Sinha, Artem Zholus, Sergio Arnaud, Abha Gejji, Ada Martin, Francois Robert Hogan, Daniel Dugas, Piotr Bojanowski, Vasil Khalidov, Patrick Labatut, Francisco Massa, Marc Szafraniec, Kapil Krishnakumar, Yong Li, Xiao...
Pith/arXiv arXiv 2025
-
[5]
Qwen3-vl technical report.arXiv preprint arXiv:2511.21631, 2025
Shuai Bai, Yuxuan Cai, Ruizhe Chen, Keqin Chen, Xionghui Chen, Zesen Cheng, Lianghao Deng, Wei Ding, Chang Gao, Chunjiang Ge, et al. Qwen3-vl technical report.arXiv preprint arXiv:2511.21631, 2025
Pith/arXiv arXiv 2025
-
[6]
Kevin Black, Noah Brown, James Darpinian, Karan Dhabalia, Danny Driess, Adnan Esmail, Michael Equi, Chelsea Finn, Niccolo Fusai, Manuel Y . Galliker, Dibya Ghosh, Lachy Groom, Karol Hausman, Brian Ichter, Szymon Jakubczak, Tim Jones, Liyiming Ke, Devin LeBlanc, Sergey Levine, Adrian Li-Bell, Mohith Mothukuri, Suraj Nair, Karl Pertsch, Allen Z. Ren, Lucy X...
Pith/arXiv arXiv 2025
-
[7]
Food-101 – mining discriminative components with random forests
Lukas Bossard, Matthieu Guillaumin, and Luc Van Gool. Food-101 – mining discriminative components with random forests. InEuropean Conference on Computer Vision, 2014
2014
-
[8]
Orpheus-TTS: Towards human-sounding speech
Canopy Labs. Orpheus-TTS: Towards human-sounding speech. GitHub repository, 2025. URL https: //github.com/canopyai/Orpheus-TTS
2025
-
[9]
Xiaokang Chen, Zhiyu Wu, Xingchao Liu, Zizheng Pan, Wen Liu, Zhenda Xie, Xingkai Yu, and Chong Ruan. Janus-Pro: Unified multimodal understanding and generation with data and model scaling.arXiv preprint arXiv:2501.17811, 2025
Pith/arXiv arXiv 2025
-
[10]
Emerging properties in unified multimodal pretraining
Chaorui Deng, Deyao Zhu, Kunchang Li, Chenhui Gou, Feng Li, Zeyu Wang, Shu Zhong, Weihao Yu, Xiaonan Nie, Ziang Song, Guang Shi, and Haoqi Fan. Emerging properties in unified multimodal pretraining. arXiv preprint arXiv:2505.14683, 2025
Pith/arXiv arXiv 2025
-
[11]
Zhihao Du, Yuxuan Wang, Qian Chen, Xian Shi, Xiang Lv, Tianyu Zhao, Zhifu Gao, Yexin Yang, Changfeng Gao, Hui Wang, Fan Yu, Huadai Liu, Zhengyan Sheng, Yue Gu, Chong Deng, Wen Wang, Shiliang Zhang, Zhijie Yan, and Jingren Zhou. CosyV oice 2: Scalable streaming speech synthesis with large language models.arXiv preprint arXiv:2412.10117, 2024
Pith/arXiv arXiv 2024
-
[12]
Scaling rectified flow transformers for high-resolution image synthesis
Patrick Esser, Sumith Kulal, Andreas Blattmann, Rahim Entezari, Jonas Müller, Harry Saini, Yam Levi, Dominik Lorenz, Axel Sauer, Frederic Boesel, Dustin Podell, Tim Dockhorn, Zion English, Kyle Lacey, Alex Goodwin, Yannik Marek, and Robin Rombach. Scaling rectified flow transformers for high-resolution image synthesis. InProceedings of the 41st Internatio...
2024
-
[13]
Jiarui Fang and Shangchun Zhao. USP: A unified sequence parallelism approach for long context generative ai.arXiv preprint arXiv:2405.07719, 2024
arXiv 2024
-
[14]
Jiarui Fang, Jinzhe Pan, Xibo Sun, Aoyu Li, and Jiannan Wang. xdit: an inference engine for diffusion transformers (dits) with massive parallelism.arXiv preprint arXiv:2411.01738, 2024. 12
arXiv 2024
-
[15]
Pipefusion: Patch-level pipeline parallelism for diffusion transformers inference
Jiarui Fang, Jinzhe Pan, Aoyu Li, Xibo Sun, and WANG Jiannan. Pipefusion: Patch-level pipeline parallelism for diffusion transformers inference. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems, 2025. URLhttps://openreview.net/forum?id=5xwyxupsLL
2025
-
[16]
Mastering diverse domains through world models
Danijar Hafner, Jurgis Pasukonis, Jimmy Ba, and Timothy Lillicrap. Mastering diverse domains through world models. arXiv preprint arXiv:2301.04104, 2023
Pith/arXiv arXiv 2023
-
[17]
O’Reilly Media, Inc
Pieter Hintjens.ZeroMQ: messaging for many applications. " O’Reilly Media, Inc.", 2013
2013
-
[18]
Classifier-free diffusion guidance.arXiv preprint arXiv:2207.12598, 2022
Jonathan Ho and Tim Salimans. Classifier-free diffusion guidance.arXiv preprint arXiv:2207.12598, 2022
Pith/arXiv arXiv 2022
-
[19]
Ailin Huang, Boyong Wu, Bruce Wang, Chao Yan, Chen Hu, et al. Step-Audio: Unified understanding and generation in intelligent speech interaction.arXiv preprint arXiv:2502.11946, 2025
Pith/arXiv arXiv 2025
-
[20]
Vbench: Comprehensive benchmark suite for video generative models
Ziqi Huang, Yinan He, Jiashuo Yu, Fan Zhang, Chenyang Si, Yuming Jiang, Yuanhan Zhang, Tianxing Wu, Qingyang Jin, Nattapol Chanpaisit, et al. Vbench: Comprehensive benchmark suite for video generative models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 21807–21818, 2024
2024
-
[21]
Keisuke Kamahori, Wei-Tzu Lee, Atindra Jha, Rohan Kadekodi, Stephanie Wang, Arvind Krishnamurthy, and Baris Kasikci. V oxServe: Streaming-centric serving system for speech language models.arXiv preprint arXiv:2602.00269, 2026
arXiv 2026
-
[22]
Alexander Khazatsky, Karl Pertsch, Suraj Nair, Ashwin Balakrishna, Sudeep Dasari, Siddharth Karamcheti, Soroush Nasiriany, Mohan Kumar Srirama, Lawrence Yunliang Chen, Kirsty Ellis, Peter David Fagan, Joey Hejna, Masha Itkina, Marion Lepert, Yecheng Jason Ma, Patrick Tree Miller, Jimmy Wu, Suneel Belkhale, Shivin Dass, Huy Ha, Arhan Jain, Abraham Lee, You...
Pith/arXiv arXiv 2024
-
[23]
Efficient memory management for large language model serving with PagedAttention
Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph Gonzalez, Hao Zhang, and Ion Stoica. Efficient memory management for large language model serving with PagedAttention. InProceedings of the 29th Symposium on Operating Systems Principles (SOSP), pages 611–626, 2023
2023
-
[24]
FlashDrive: Flash Vision-Language- Action Inference For Autonomous Driving.ES-Reasoning Workshop at ICLR, 2026
Zekai Li, Yihao Liang, Hongfei Zhang, Jian Chen, and Zhijian Liu. FlashDrive: Flash Vision-Language- Action Inference For Autonomous Driving.ES-Reasoning Workshop at ICLR, 2026
2026
-
[25]
Weixin Liang, Lili Yu, Liang Luo, Srinivasan Iyer, Ning Dong, Chunting Zhou, Gargi Ghosh, Mike Lewis, Wen tau Yih, Luke Zettlemoyer, and Xi Victoria Lin. Mixture-of-transformers: A sparse and scalable architecture for multi-modal foundation models.arXiv preprint arXiv:2411.04996, 2024
Pith/arXiv arXiv 2024
-
[26]
Flow straight and fast: Learning to generate and transfer data with rectified flow
Xingchao Liu, Chengyue Gong, and Qiang Liu. Flow straight and fast: Learning to generate and transfer data with rectified flow. InInternational Conference on Learning Representations (ICLR), 2023
2023
-
[27]
Ma, Jae-Won Chung, Jisang Ahn, Yizhuo Liang, Akshay Jajoo, Myungjin Lee, and Mosharaf Chowd- hury
Jeff J. Ma, Jae-Won Chung, Jisang Ahn, Yizhuo Liang, Akshay Jajoo, Myungjin Lee, and Mosharaf Chowd- hury. Cornserve: Efficiently serving any-to-any multimodal models.arXiv preprint arXiv:2512.14098, 2025
Pith/arXiv arXiv 2025
-
[28]
Splitwise: Efficient generative LLM inference using phase splitting
Pratyush Patel, Esha Choukse, Chaojie Zhang, Aashaka Shah, Íñigo Goiri, Saeed Maleki, and Ricardo Bianchini. Splitwise: Efficient generative LLM inference using phase splitting. InProceedings of the 51st Annual International Symposium on Computer Architecture (ISCA), 2024
2024
-
[29]
Scalable diffusion models with transformers
William Peebles and Saining Xie. Scalable diffusion models with transformers. InProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 4195–4205, 2023. 13
2023
-
[30]
Mooncake: A KVCache-centric disaggregated architecture for LLM serving
Ruoyu Qin, Zheming Li, Weiran He, Jialei Cui, Feng Ren, Mingxing Zhang, Yongwei Wu, Weimin Zheng, and Xinran Xu. Mooncake: A KVCache-centric disaggregated architecture for LLM serving. In Proceedings of the 23rd USENIX Conference on File and Storage Technologies (FAST), 2025
2025
-
[31]
ModServe: Modality- and stage-aware resource disaggregation for scalable multimodal model serving
Haoran Qiu, Anish Biswas, Zihan Zhao, Jayashree Mohan, Alind Khare, Esha Choukse, Íñigo Goiri, Zeyu Zhang, Haiying Shen, Chetan Bansal, Ramachandran Ramjee, and Rodrigo Fonseca. ModServe: Modality- and stage-aware resource disaggregation for scalable multimodal model serving. InProceedings of the ACM Symposium on Cloud Computing (SoCC), 2025
2025
-
[32]
Robust speech recognition via large-scale weak supervision
Alec Radford, Jong Wook Kim, Tao Xu, Greg Brockman, Christine McLeavey, and Ilya Sutskever. Robust speech recognition via large-scale weak supervision. InInternational conference on machine learning, pages 28492–28518. PMLR, 2023
2023
-
[33]
SGLang-Omni: High-performance multi-stage pipeline framework for omni models
sgl-project. SGLang-Omni: High-performance multi-stage pipeline framework for omni models. GitHub repository, 2026. URLhttps://github.com/sgl-project/sglang-omni. Accessed 2026-03-20
2026
-
[34]
Gursimran Singh, Xinglu Wang, Yifan Hu, Timothy Yu, Linzi Xing, Wei Jiang, Zhefeng Wang, Xiaolong Bai, Yi Li, Ying Xiong, et al. Efficiently serving large multimodal models using epd disaggregation.arXiv preprint arXiv:2501.05460, 2025
arXiv 2025
-
[35]
Snac: Multi-scale neural audio codec.arXiv preprint arXiv:2410.14411, 2024
Hubert Siuzdak, Florian Grötschla, and Luca A Lanzendörfer. Snac: Multi-scale neural audio codec.arXiv preprint arXiv:2410.14411, 2024
arXiv 2024
-
[36]
Xibo Sun, Jiarui Fang, Aoyu Li, and Jinzhe Pan. Unveiling redundancy in diffusion transformers (dits): A systematic study.arXiv preprint arXiv:2411.13588, 2024
arXiv 2024
-
[37]
Inferix Team, Tianyu Feng, Yizeng Han, Jiahao He, Yuanyu He, Xi Lin, Teng Liu, Hanfeng Lu, Jiasheng Tang, Wei Wang, et al. Inferix: A block-diffusion based next-generation inference engine for world simulation.arXiv preprint arXiv:2511.20714, 2025
Pith/arXiv arXiv 2025
-
[38]
Diffusers: State-of-the-art diffusion models.https://github.com/huggingface/diffusers, 2022
Patrick von Platen, Suraj Patil, Anton Lozhkov, Pedro Cuenca, Nathan Lambert, Kashif Rasul, Mishig Davaadorj, Dhruv Nair, Sayak Paul, William Berman, Yiyi Xu, Steven Liu, and Thomas Wolf. Diffusers: State-of-the-art diffusion models.https://github.com/huggingface/diffusers, 2022
2022
-
[39]
Yan Wang, Wenjie Luo, Junjie Bai, Yulong Cao, Tong Che, Ke Chen, Yuxiao Chen, Jenna Diamond, Yifan Ding, Wenhao Ding, et al. Alpamayo-r1: Bridging reasoning and action prediction for generalizable autonomous driving in the long tail.arXiv preprint arXiv:2511.00088, 2025
Pith/arXiv arXiv 2025
-
[40]
Thomas Wolf, Lysandre Debut, Victor Sanh, Julien Chaumond, Clement Delangue, Anthony Moi, Pierric Cistac, Tim Rault, Rémi Louf, Morgan Funtowicz, et al. Huggingface’s transformers: State-of-the-art natural language processing.arXiv preprint arXiv:1910.03771, 2019
Pith/arXiv arXiv 1910
-
[41]
Thomas Wolf, Lysandre Debut, Victor Sanh, Julien Chaumond, Clement Delangue, Anthony Moi, Pierric Cistac, Tim Rault, Rémi Louf, Morgan Funtowicz, Joe Davison, Sam Shleifer, Patrick von Platen, Clara Ma, Yacine Jernite, Julien Plu, Canwen Xu, Teven Le Scao, Sylvain Gugger, Mariama Drame, Quentin Lhoest, and Alexander M. Rush. Transformers: State-of-the-art...
2020
-
[42]
Show-o2: Improved native unified multimodal models.arXiv preprint arXiv:2506.15564, 2025
Jinheng Xie, Zhenheng Yang, and Mike Zheng Shou. Show-o2: Improved native unified multimodal models.arXiv preprint arXiv:2506.15564, 2025
Pith/arXiv arXiv 2025
-
[43]
Qwen2.5-omni technical report.arXiv preprint arXiv:2503.20215, 2025
Jin Xu, Zhifang Guo, Jinzheng He, Hangrui Hu, Ting He, Shuai Bai, Keqin Chen, Jialin Wang, Yang Fan, Kai Dang, Bin Zhang, Xiong Wang, Yunfei Chu, and Junyang Lin. Qwen2.5-omni technical report.arXiv preprint arXiv:2503.20215, 2025
Pith/arXiv arXiv 2025
-
[44]
Qwen3-omni technical report.arXiv preprint arXiv:2509.17765, 2025
Jin Xu, Zhifang Guo, Jinzheng He, Hangrui Hu, Ting He, Shuai Bai, Keqin Chen, Jialin Wang, Yang Fan, Kai Dang, Bin Zhang, Xiong Wang, Yunfei Chu, and Junyang Lin. Qwen3-omni technical report.arXiv preprint arXiv:2509.17765, 2025
Pith/arXiv arXiv 2025
-
[45]
FlashInfer: Efficient and customizable attention engine for LLM inference serving
Zihao Ye, Lequn Chen, Ruihang Lai, Wuwei Lin, Yineng Zhang, Stephanie Wang, Tianqi Chen, Baris Kasikci, Vinod Grover, Arvind Krishnamurthy, and Luis Ceze. FlashInfer: Efficient and customizable attention engine for LLM inference serving. InProceedings of Machine Learning and Systems (MLSys), 2025
2025
-
[46]
Peiqi Yin, Jiangyun Zhu, Han Gao, Chenguang Zheng, Yongxiang Huang, Taichang Zhou, Ruirui Yang, Weizhi Liu, Weiqing Chen, Canlin Guo, et al. vllm-omni: Fully disaggregated serving for any-to-any multimodal models.arXiv preprint arXiv:2602.02204, 2026. 14
arXiv 2026
-
[47]
Orca: A distributed serving system for Transformer-based generative models
Gyeong-In Yu, Joo Seong Jeong, Geon-Woo Kim, Soojeong Kim, and Byung-Gon Chun. Orca: A distributed serving system for Transformer-based generative models. InProceedings of the 16th USENIX Symposium on Operating Systems Design and Implementation (OSDI), pages 521–538, 2022
2022
-
[48]
Vsa: Faster video diffusion with trainable sparse attention.arXiv preprint arXiv:2505.13389, 2025
Peiyuan Zhang, Yongqi Chen, Haofeng Huang, Will Lin, Zhengzhong Liu, Ion Stoica, Eric Xing, and Hao Zhang. Vsa: Faster video diffusion with trainable sparse attention.arXiv preprint arXiv:2505.13389, 2025
arXiv 2025
-
[49]
Fast video generation with sliding tile attention.arXiv preprint arXiv:2502.04507, 2025
Peiyuan Zhang, Yongqi Chen, Runlong Su, Hangliang Ding, Ion Stoica, Zhengzhong Liu, and Hao Zhang. Fast video generation with sliding tile attention.arXiv preprint arXiv:2502.04507, 2025
arXiv 2025
-
[50]
Gonzalez, Clark Barrett, and Ying Sheng
Lianmin Zheng, Liangsheng Yin, Zhiqiang Xie, Chuyue Sun, Jeff Huang, Cody Hao Yu, Shiyi Cao, Christos Kozyrakis, Ion Stoica, Joseph E. Gonzalez, Clark Barrett, and Ying Sheng. SGLang: Efficient execution of structured language model programs.Advances in Neural Information Processing Systems, 37:62557–62583, 2024
2024
-
[51]
cudagraph-incompatible
Yinmin Zhong, Shengyu Liu, Junda Chen, Jianbo Hu, Yibo Zhu, Xuanzhe Liu, Xin Jin, and Hao Zhang. DistServe: Disaggregating prefill and decoding for goodput-optimized large language model serving. In Proceedings of the 18th USENIX Symposium on Operating Systems Design and Implementation (OSDI), 2024. Appendix A Two More Walk Graphs at a Glance The same fou...
2024
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.