pith. sign in

arxiv: 2606.12688 · v1 · pith:L7PKFZUJnew · submitted 2026-06-10 · 💻 cs.LG · cs.AI· cs.DC

M*: A Modular, Extensible, Serving System for Multimodal Models

Pith reviewed 2026-06-27 09:58 UTC · model grok-4.3

classification 💻 cs.LG cs.AIcs.DC
keywords multimodal modelsmodel servingdataflow graphswalk graphcomposite architectureslatency optimizationthroughput
0
0 comments X

The pith

M* represents composite multimodal models as Walk Graphs to serve diverse architectures with lower latency and higher throughput than prior systems.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces M* as a serving system for composite multimodal models that combine vision encoders, language backbones, diffusion heads, audio codecs, and similar components. Existing frameworks rely on narrow assumptions about model structure that do not fit this architectural variety. M* models requests as traversals over dataflow graphs via the Walk Graph abstraction, which supports arbitrary component composition, flexible cluster placement, and optimizations that do not require model-specific code. Performance measurements on representative workloads show concrete gains in latency, real-time factor, and throughput. A sympathetic reader would care because the approach reduces developer effort when deploying increasingly complex multimodal systems.

Core claim

M* represents models as dataflow graphs and processes requests spanning diverse modalities as traversals over these graphs. The Walk Graph abstraction supports arbitrary composition of model components, flexible placement onto a physical cluster, and model-agnostic optimizations within a distributed runtime. This allows M* to concisely capture composite models from a broad range of families and yields average 20% lower end-to-end latency than vLLM-Omni on text-to-image tasks with BAGEL, up to 2.9x lower real-time factor and 2.7x higher throughput on text-to-speech with Qwen3-Omni, and up to 12.5x better performance than the V-JEPA 2-AC baseline on robotic planning.

What carries the argument

The Walk Graph, an abstraction that represents composite models as traversable dataflow graphs to enable modular composition, flexible placement, and model-agnostic optimizations.

If this is right

  • Average 20% lower end-to-end latency than vLLM-Omni for text-to-image workloads on BAGEL.
  • Up to 2.9x lower real-time factor and 2.7x higher throughput for text-to-speech workloads on Qwen3-Omni.
  • Up to 12.5x better performance than the V-JEPA 2-AC rollout baseline for robotic planning.
  • Enables serving of complex models with minimal developer effort across unified multimodal, omni, speech-language, and vision-language-action families.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The graph-based approach could reduce the need for separate serving stacks when new modalities are added to an existing model.
  • Cluster operators might achieve higher utilization by letting the Walk Graph decide component placement dynamically.
  • If the abstraction scales, it could shorten the time from model training to production deployment for composite architectures.

Load-bearing premise

The Walk Graph can concisely capture composite models from a broad range of families while supporting model-agnostic optimizations and flexible cluster placement without prohibitive overhead or model-specific adaptations.

What would settle it

Deploying M* on a previously untested composite model family and measuring whether end-to-end latency exceeds that of a specialized baseline by more than the reported margins.

Figures

Figures reproduced from arXiv: 2606.12688 by Atindra Jha, Baris Kasikci, Irmak Sivgin, Jure Leskovec, Keisuke Kamahori, Luke Zettlemoyer, Mark Horowitz, Naomi Sagan, Olivia Hsu, Rohan Sanda, Stephanie Wang, Steven Gao.

Figure 1
Figure 1. Figure 1: Example model architectures of (a) a UMM (BAGEL [ [PITH_FULL_IMAGE:figures/full_fig_p003_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: M* at a glance. Left: The model author defines the model as a computation graph ( [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: BAGEL T2I/I2I E2E latency, B=1, 3-GPU CFG-parallel [PITH_FULL_IMAGE:figures/full_fig_p008_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: BAGEL I2T, single H100, B ∈ {1, 2, 4, . . . , 16}., ignore_eos with output lengths between 64 and 256. (a) TTFT (log y). (b) Throughput (req/s). (c) End-to-end request latency. For T2I/I2I we use 3 H100s with classifier-free-guidance (CFG) parallelism (one rank per CFG branch). M* runs the three branches in parallel via the Parallel primitive (Section 3.1); vLLM￾Omni uses a specialized CFG parallel plugin … view at source ↗
Figure 5
Figure 5. Figure 5: Qwen3-Omni Seed-TTS, 2-GPU. (a) RTF (lower is better). (b) Audio throughput (higher is better). B=1 to 14% at B=16, while maintaining a tighter tail: M* p95 TTFT is 28% lower than that of vLLM-Omni at B=16, despite lower p50 gains. Figures 10 and 11 in the Appendix (varying the output token length distribution) show a similar story , with our advantage most prominent for shorter-decode workloads. M* enable… view at source ↗
Figure 6
Figure 6. Figure 6: Qwen3-Omni Seed-TTS, Thinker TP 2 (3-GPU. (a) RTF (lower is better). (b) Audio throughput (higher is better) [PITH_FULL_IMAGE:figures/full_fig_p010_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Orpheus TTS, single H200. (a) RTF (lower is better). (b) Audio throughput (higher is better). 4.3 Orpheus: M* vs VoxServe We measure Orpheus-3B performance on a single H200 against VoxServe [21] on B={1, 2, 4, 8, 16} and Seed-TTS ( [PITH_FULL_IMAGE:figures/full_fig_p010_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: V-JEPA 2 AC rollout, B=1, single H100 (lower is better). 5 Related Work Token-centric LLM serving. vLLM [23], SGLang [50], and Orca [47] target autoregressive text generation with optimizations such as continuous batching, paged attention, and radix caching. M* reuses these optimizations and generalizes them to other modalities (§3.3). DistServe [51], Splitwise [28], and Mooncake [30] disaggregate prefill … view at source ↗
Figure 9
Figure 9. Figure 9: BAGEL T2I and I2I, single H100, B=1. M* wins on both; the gap widens for I2I. 17 [PITH_FULL_IMAGE:figures/full_fig_p017_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: BAGEL I2T, single H100, B ∈ {1, 2, 4, . . . , 16}., ignore_eos with output lengths between 16 and 128. (a) TTFT (log y). (b) Throughput (req/s). (c) End-to-end request latency [PITH_FULL_IMAGE:figures/full_fig_p018_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: BAGEL I2T, single H100, B ∈ {1, 2, 4, . . . , 16}., ignore_eos with output lengths between 128 and 512. (a) TTFT (log y). (b) Throughput (req/s). (c) End-to-end request latency. Figures 10 and 11 show M* vs. vLLM-Omni on BAGEL I2T for varying output length workloads: one where output lengths are randomly sampled between 16 and 64 tokens, and one where output lengths range from 128 to 512 tokens. Overall, … view at source ↗
read the original abstract

We are entering a new era of composite model architectures that integrate diverse components such as vision encoders, language backbones, diffusion and flow heads, audio codecs, action generators, and world-model predictors. Such architectures underpin a broad class of multimodal models, including unified multimodal models, omni models, speech-language models, vision-language-action policies, and world models. However, existing model serving frameworks were built on narrow assumptions about model structure, making them ill-suited to accommodate this new architectural diversity. Here we present M*, a universal serving system for efficient serving of composite AI models. M* represents models as dataflow graphs, processing requests spanning diverse modalities and tasks as traversals over these graphs. The core insight is a modular abstraction that supports arbitrary composition of model components, flexible placement onto a physical cluster, and model-agnostic optimizations within a distributed runtime. We call this abstraction the Walk Graph and show how it can concisely capture composite models from a broad range of families. We instantiate M* on representative models and find that it achieves, on average, 20% lower end-to-end latency than vLLM-Omni for text-to-image workloads on BAGEL, while delivering up to 2.9x lower real-time factor and 2.7x higher throughput for text-to-speech workloads on Qwen3-Omni. M* also outperforms the V-JEPA 2-AC rollout baseline for robotic planning by up to 12.5x. Thus, our work paves the road towards more efficient serving of complex models with minimal developer effort.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper introduces M*, a universal serving system for composite multimodal models (e.g., unified multimodal, omni, speech-language, vision-language-action, and world models). It represents models as dataflow graphs via the Walk Graph abstraction, which supports arbitrary component composition, flexible cluster placement, and model-agnostic optimizations in a distributed runtime. Evaluations on BAGEL (text-to-image), Qwen3-Omni (text-to-speech), and robotic planning tasks report average 20% lower end-to-end latency vs. vLLM-Omni, up to 2.9x lower real-time factor and 2.7x higher throughput, and up to 12.5x improvement over the V-JEPA 2-AC baseline.

Significance. If the Walk Graph abstraction generalizes across model families without prohibitive overhead and the reported speedups prove reproducible, the work could meaningfully advance efficient serving of emerging composite multimodal architectures with reduced developer effort. The identification of limitations in existing narrow-assumption frameworks is a useful framing.

major comments (2)
  1. Abstract: performance numbers (20% latency reduction, 2.9x/2.7x TTS gains, 12.5x planning improvement) are stated without any experimental protocol, baseline configurations, hardware details, or error bars, so the central empirical claims cannot be assessed for soundness.
  2. Walk Graph section (core abstraction): the claim that the abstraction 'concisely capture[s] composite models from a broad range of families' while remaining model-agnostic and low-overhead is load-bearing for the paper's generality argument, yet evaluation is confined to three specific models (BAGEL, Qwen3-Omni, robotic planning) with no additional evidence or overhead analysis provided for broader applicability.
minor comments (1)
  1. Add explicit pseudocode or formal definition of the Walk Graph traversal and placement algorithms to aid reproducibility.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. The comments highlight opportunities to strengthen the presentation of empirical claims and the generality argument. We address each major comment below.

read point-by-point responses
  1. Referee: Abstract: performance numbers (20% latency reduction, 2.9x/2.7x TTS gains, 12.5x planning improvement) are stated without any experimental protocol, baseline configurations, hardware details, or error bars, so the central empirical claims cannot be assessed for soundness.

    Authors: We agree this is a valid concern for assessability. In the revised manuscript we will expand the abstract with a concise statement of the hardware platform (8x NVIDIA A100-80GB), the primary baselines (vLLM-Omni and V-JEPA 2-AC), and a note that full protocols, configurations, and error bars (reported over 5 runs) appear in Sections 5–6. Abstract length limits preclude full experimental protocols, but the added context will allow readers to locate the supporting details. revision: yes

  2. Referee: Walk Graph section (core abstraction): the claim that the abstraction 'concisely capture[s] composite models from a broad range of families' while remaining model-agnostic and low-overhead is load-bearing for the paper's generality argument, yet evaluation is confined to three specific models (BAGEL, Qwen3-Omni, robotic planning) with no additional evidence or overhead analysis provided for broader applicability.

    Authors: The three workloads were chosen as representatives of distinct families (unified multimodal, omni/speech-language, and vision-language-action/world models). Section 3 already provides the formal definition and construction rules that are intentionally model-agnostic. To strengthen the generality claim we will add (1) explicit overhead measurements for Walk Graph construction and traversal (time and memory) on the evaluated models and (2) a short discussion subsection that maps the abstraction to two additional families (pure vision-language and diffusion-based world models) using the same construction rules. These additions rely on existing design material rather than new experiments. revision: partial

Circularity Check

0 steps flagged

No significant circularity

full rationale

The paper is a systems description of a serving framework whose central claims are empirical performance improvements measured against named external baselines (vLLM-Omni, V-JEPA 2-AC). No equations, fitted parameters, or derivation chains appear; the Walk Graph is presented as an engineering abstraction whose generality is asserted via implementation on representative models rather than proven by internal reduction. No self-citation load-bearing steps or ansatz smuggling are detectable from the supplied text.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 1 invented entities

Ledger extracted from abstract only; full paper may contain additional parameters or assumptions.

axioms (2)
  • domain assumption Existing model serving frameworks were built on narrow assumptions about model structure
    Stated as motivation in the abstract.
  • domain assumption Composite models can be represented as dataflow graphs that support arbitrary composition and model-agnostic optimizations
    Central insight underlying the Walk Graph.
invented entities (1)
  • Walk Graph no independent evidence
    purpose: Modular abstraction that represents models as dataflow graphs for request traversal and placement
    New concept introduced to enable the universal serving system.

pith-pipeline@v0.9.1-grok · 5866 in / 1377 out tokens · 24333 ms · 2026-06-27T09:58:54.800148+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

51 extracted references · 20 linked inside Pith

  1. [1]

    Cosmos world foundation model platform for physical AI

    Niket Agarwal, Arslan Ali, Maciej Bala, Yogesh Balaji, Erik Barker, Tiffany Cai, Prithvijit Chattopadhyay, Yongxin Chen, Yin Cui, Yifan Ding, Daniel Dworakowski, Jiaojiao Fan, Michele Fenzi, Francesco Ferroni, Sanja Fidler, Dieter Fox, Songwei Ge, Yunhao Ge, Jinwei Gu, Siddharth Gururani, Ethan He, Jiahui Huang, Jacob Huffman, Pooya Jannaty, Jingyi Jin, S...

  2. [2]

    Diffusion for world modeling: Visual details matter in Atari.Advances in Neural Information Processing Systems, 37:58757–58791, 2024

    Eloi Alonso, Adam Jelley, Vincent Micheli, Anssi Kanervisto, Amos Storkey, Tim Pearce, and François Fleuret. Diffusion for world modeling: Visual details matter in Atari.Advances in Neural Information Processing Systems, 37:58757–58791, 2024

  3. [3]

    Seed-tts: A family of high-quality versatile speech generation models, 2024

    Philip Anastassiou, Jiawei Chen, Jitong Chen, Yuanzhe Chen, Zhuo Chen, Ziyi Chen, Jian Cong, Lelai Deng, Chuang Ding, Lu Gao, Mingqing Gong, Peisong Huang, Qingqing Huang, Zhiying Huang, Yuanyuan Huo, Dongya Jia, Chumin Li, Feiya Li, Hui Li, Jiaxin Li, Xiaoyang Li, Xingxing Li, Lin Liu, Shouda Liu, Sichao Liu, Xudong Liu, Yuchen Liu, Zhengxi Liu, Lu Lu, J...

  4. [4]

    V-JEPA 2: Self-supervised video models enable understanding, prediction and planning.arXiv preprint arXiv:2506.09985, 2025

    Mido Assran, Adrien Bardes, David Fan, Quentin Garrido, Russell Howes, Mojtaba Komeili, Matthew Muckley, Ammar Rizvi, Claire Roberts, Koustuv Sinha, Artem Zholus, Sergio Arnaud, Abha Gejji, Ada Martin, Francois Robert Hogan, Daniel Dugas, Piotr Bojanowski, Vasil Khalidov, Patrick Labatut, Francisco Massa, Marc Szafraniec, Kapil Krishnakumar, Yong Li, Xiao...

  5. [5]

    Qwen3-vl technical report.arXiv preprint arXiv:2511.21631, 2025

    Shuai Bai, Yuxuan Cai, Ruizhe Chen, Keqin Chen, Xionghui Chen, Zesen Cheng, Lianghao Deng, Wei Ding, Chang Gao, Chunjiang Ge, et al. Qwen3-vl technical report.arXiv preprint arXiv:2511.21631, 2025

  6. [6]

    Kevin Black, Noah Brown, James Darpinian, Karan Dhabalia, Danny Driess, Adnan Esmail, Michael Equi, Chelsea Finn, Niccolo Fusai, Manuel Y . Galliker, Dibya Ghosh, Lachy Groom, Karol Hausman, Brian Ichter, Szymon Jakubczak, Tim Jones, Liyiming Ke, Devin LeBlanc, Sergey Levine, Adrian Li-Bell, Mohith Mothukuri, Suraj Nair, Karl Pertsch, Allen Z. Ren, Lucy X...

  7. [7]

    Food-101 – mining discriminative components with random forests

    Lukas Bossard, Matthieu Guillaumin, and Luc Van Gool. Food-101 – mining discriminative components with random forests. InEuropean Conference on Computer Vision, 2014

  8. [8]

    Orpheus-TTS: Towards human-sounding speech

    Canopy Labs. Orpheus-TTS: Towards human-sounding speech. GitHub repository, 2025. URL https: //github.com/canopyai/Orpheus-TTS

  9. [9]

    Janus-Pro: Unified multimodal understanding and generation with data and model scaling.arXiv preprint arXiv:2501.17811, 2025

    Xiaokang Chen, Zhiyu Wu, Xingchao Liu, Zizheng Pan, Wen Liu, Zhenda Xie, Xingkai Yu, and Chong Ruan. Janus-Pro: Unified multimodal understanding and generation with data and model scaling.arXiv preprint arXiv:2501.17811, 2025

  10. [10]

    Emerging properties in unified multimodal pretraining

    Chaorui Deng, Deyao Zhu, Kunchang Li, Chenhui Gou, Feng Li, Zeyu Wang, Shu Zhong, Weihao Yu, Xiaonan Nie, Ziang Song, Guang Shi, and Haoqi Fan. Emerging properties in unified multimodal pretraining. arXiv preprint arXiv:2505.14683, 2025

  11. [11]

    CosyV oice 2: Scalable streaming speech synthesis with large language models.arXiv preprint arXiv:2412.10117, 2024

    Zhihao Du, Yuxuan Wang, Qian Chen, Xian Shi, Xiang Lv, Tianyu Zhao, Zhifu Gao, Yexin Yang, Changfeng Gao, Hui Wang, Fan Yu, Huadai Liu, Zhengyan Sheng, Yue Gu, Chong Deng, Wen Wang, Shiliang Zhang, Zhijie Yan, and Jingren Zhou. CosyV oice 2: Scalable streaming speech synthesis with large language models.arXiv preprint arXiv:2412.10117, 2024

  12. [12]

    Scaling rectified flow transformers for high-resolution image synthesis

    Patrick Esser, Sumith Kulal, Andreas Blattmann, Rahim Entezari, Jonas Müller, Harry Saini, Yam Levi, Dominik Lorenz, Axel Sauer, Frederic Boesel, Dustin Podell, Tim Dockhorn, Zion English, Kyle Lacey, Alex Goodwin, Yannik Marek, and Robin Rombach. Scaling rectified flow transformers for high-resolution image synthesis. InProceedings of the 41st Internatio...

  13. [13]

    USP: A unified sequence parallelism approach for long context generative ai.arXiv preprint arXiv:2405.07719, 2024

    Jiarui Fang and Shangchun Zhao. USP: A unified sequence parallelism approach for long context generative ai.arXiv preprint arXiv:2405.07719, 2024

  14. [14]

    xdit: an inference engine for diffusion transformers (dits) with massive parallelism.arXiv preprint arXiv:2411.01738, 2024

    Jiarui Fang, Jinzhe Pan, Xibo Sun, Aoyu Li, and Jiannan Wang. xdit: an inference engine for diffusion transformers (dits) with massive parallelism.arXiv preprint arXiv:2411.01738, 2024. 12

  15. [15]

    Pipefusion: Patch-level pipeline parallelism for diffusion transformers inference

    Jiarui Fang, Jinzhe Pan, Aoyu Li, Xibo Sun, and WANG Jiannan. Pipefusion: Patch-level pipeline parallelism for diffusion transformers inference. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems, 2025. URLhttps://openreview.net/forum?id=5xwyxupsLL

  16. [16]

    Mastering diverse domains through world models

    Danijar Hafner, Jurgis Pasukonis, Jimmy Ba, and Timothy Lillicrap. Mastering diverse domains through world models. arXiv preprint arXiv:2301.04104, 2023

  17. [17]

    O’Reilly Media, Inc

    Pieter Hintjens.ZeroMQ: messaging for many applications. " O’Reilly Media, Inc.", 2013

  18. [18]

    Classifier-free diffusion guidance.arXiv preprint arXiv:2207.12598, 2022

    Jonathan Ho and Tim Salimans. Classifier-free diffusion guidance.arXiv preprint arXiv:2207.12598, 2022

  19. [19]

    Step-Audio: Unified understanding and generation in intelligent speech interaction.arXiv preprint arXiv:2502.11946, 2025

    Ailin Huang, Boyong Wu, Bruce Wang, Chao Yan, Chen Hu, et al. Step-Audio: Unified understanding and generation in intelligent speech interaction.arXiv preprint arXiv:2502.11946, 2025

  20. [20]

    Vbench: Comprehensive benchmark suite for video generative models

    Ziqi Huang, Yinan He, Jiashuo Yu, Fan Zhang, Chenyang Si, Yuming Jiang, Yuanhan Zhang, Tianxing Wu, Qingyang Jin, Nattapol Chanpaisit, et al. Vbench: Comprehensive benchmark suite for video generative models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 21807–21818, 2024

  21. [21]

    V oxServe: Streaming-centric serving system for speech language models.arXiv preprint arXiv:2602.00269, 2026

    Keisuke Kamahori, Wei-Tzu Lee, Atindra Jha, Rohan Kadekodi, Stephanie Wang, Arvind Krishnamurthy, and Baris Kasikci. V oxServe: Streaming-centric serving system for speech language models.arXiv preprint arXiv:2602.00269, 2026

  22. [22]

    Alexander Khazatsky, Karl Pertsch, Suraj Nair, Ashwin Balakrishna, Sudeep Dasari, Siddharth Karamcheti, Soroush Nasiriany, Mohan Kumar Srirama, Lawrence Yunliang Chen, Kirsty Ellis, Peter David Fagan, Joey Hejna, Masha Itkina, Marion Lepert, Yecheng Jason Ma, Patrick Tree Miller, Jimmy Wu, Suneel Belkhale, Shivin Dass, Huy Ha, Arhan Jain, Abraham Lee, You...

  23. [23]

    Efficient memory management for large language model serving with PagedAttention

    Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph Gonzalez, Hao Zhang, and Ion Stoica. Efficient memory management for large language model serving with PagedAttention. InProceedings of the 29th Symposium on Operating Systems Principles (SOSP), pages 611–626, 2023

  24. [24]

    FlashDrive: Flash Vision-Language- Action Inference For Autonomous Driving.ES-Reasoning Workshop at ICLR, 2026

    Zekai Li, Yihao Liang, Hongfei Zhang, Jian Chen, and Zhijian Liu. FlashDrive: Flash Vision-Language- Action Inference For Autonomous Driving.ES-Reasoning Workshop at ICLR, 2026

  25. [25]

    Mixture-of-transformers: A sparse and scalable architecture for multi-modal foundation models.arXiv preprint arXiv:2411.04996, 2024

    Weixin Liang, Lili Yu, Liang Luo, Srinivasan Iyer, Ning Dong, Chunting Zhou, Gargi Ghosh, Mike Lewis, Wen tau Yih, Luke Zettlemoyer, and Xi Victoria Lin. Mixture-of-transformers: A sparse and scalable architecture for multi-modal foundation models.arXiv preprint arXiv:2411.04996, 2024

  26. [26]

    Flow straight and fast: Learning to generate and transfer data with rectified flow

    Xingchao Liu, Chengyue Gong, and Qiang Liu. Flow straight and fast: Learning to generate and transfer data with rectified flow. InInternational Conference on Learning Representations (ICLR), 2023

  27. [27]

    Ma, Jae-Won Chung, Jisang Ahn, Yizhuo Liang, Akshay Jajoo, Myungjin Lee, and Mosharaf Chowd- hury

    Jeff J. Ma, Jae-Won Chung, Jisang Ahn, Yizhuo Liang, Akshay Jajoo, Myungjin Lee, and Mosharaf Chowd- hury. Cornserve: Efficiently serving any-to-any multimodal models.arXiv preprint arXiv:2512.14098, 2025

  28. [28]

    Splitwise: Efficient generative LLM inference using phase splitting

    Pratyush Patel, Esha Choukse, Chaojie Zhang, Aashaka Shah, Íñigo Goiri, Saeed Maleki, and Ricardo Bianchini. Splitwise: Efficient generative LLM inference using phase splitting. InProceedings of the 51st Annual International Symposium on Computer Architecture (ISCA), 2024

  29. [29]

    Scalable diffusion models with transformers

    William Peebles and Saining Xie. Scalable diffusion models with transformers. InProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 4195–4205, 2023. 13

  30. [30]

    Mooncake: A KVCache-centric disaggregated architecture for LLM serving

    Ruoyu Qin, Zheming Li, Weiran He, Jialei Cui, Feng Ren, Mingxing Zhang, Yongwei Wu, Weimin Zheng, and Xinran Xu. Mooncake: A KVCache-centric disaggregated architecture for LLM serving. In Proceedings of the 23rd USENIX Conference on File and Storage Technologies (FAST), 2025

  31. [31]

    ModServe: Modality- and stage-aware resource disaggregation for scalable multimodal model serving

    Haoran Qiu, Anish Biswas, Zihan Zhao, Jayashree Mohan, Alind Khare, Esha Choukse, Íñigo Goiri, Zeyu Zhang, Haiying Shen, Chetan Bansal, Ramachandran Ramjee, and Rodrigo Fonseca. ModServe: Modality- and stage-aware resource disaggregation for scalable multimodal model serving. InProceedings of the ACM Symposium on Cloud Computing (SoCC), 2025

  32. [32]

    Robust speech recognition via large-scale weak supervision

    Alec Radford, Jong Wook Kim, Tao Xu, Greg Brockman, Christine McLeavey, and Ilya Sutskever. Robust speech recognition via large-scale weak supervision. InInternational conference on machine learning, pages 28492–28518. PMLR, 2023

  33. [33]

    SGLang-Omni: High-performance multi-stage pipeline framework for omni models

    sgl-project. SGLang-Omni: High-performance multi-stage pipeline framework for omni models. GitHub repository, 2026. URLhttps://github.com/sgl-project/sglang-omni. Accessed 2026-03-20

  34. [34]

    Efficiently serving large multimodal models using epd disaggregation.arXiv preprint arXiv:2501.05460, 2025

    Gursimran Singh, Xinglu Wang, Yifan Hu, Timothy Yu, Linzi Xing, Wei Jiang, Zhefeng Wang, Xiaolong Bai, Yi Li, Ying Xiong, et al. Efficiently serving large multimodal models using epd disaggregation.arXiv preprint arXiv:2501.05460, 2025

  35. [35]

    Snac: Multi-scale neural audio codec.arXiv preprint arXiv:2410.14411, 2024

    Hubert Siuzdak, Florian Grötschla, and Luca A Lanzendörfer. Snac: Multi-scale neural audio codec.arXiv preprint arXiv:2410.14411, 2024

  36. [36]

    Unveiling redundancy in diffusion transformers (dits): A systematic study.arXiv preprint arXiv:2411.13588, 2024

    Xibo Sun, Jiarui Fang, Aoyu Li, and Jinzhe Pan. Unveiling redundancy in diffusion transformers (dits): A systematic study.arXiv preprint arXiv:2411.13588, 2024

  37. [37]

    Inferix: A block-diffusion based next-generation inference engine for world simulation.arXiv preprint arXiv:2511.20714, 2025

    Inferix Team, Tianyu Feng, Yizeng Han, Jiahao He, Yuanyu He, Xi Lin, Teng Liu, Hanfeng Lu, Jiasheng Tang, Wei Wang, et al. Inferix: A block-diffusion based next-generation inference engine for world simulation.arXiv preprint arXiv:2511.20714, 2025

  38. [38]

    Diffusers: State-of-the-art diffusion models.https://github.com/huggingface/diffusers, 2022

    Patrick von Platen, Suraj Patil, Anton Lozhkov, Pedro Cuenca, Nathan Lambert, Kashif Rasul, Mishig Davaadorj, Dhruv Nair, Sayak Paul, William Berman, Yiyi Xu, Steven Liu, and Thomas Wolf. Diffusers: State-of-the-art diffusion models.https://github.com/huggingface/diffusers, 2022

  39. [39]

    Alpamayo-r1: Bridging reasoning and action prediction for generalizable autonomous driving in the long tail.arXiv preprint arXiv:2511.00088, 2025

    Yan Wang, Wenjie Luo, Junjie Bai, Yulong Cao, Tong Che, Ke Chen, Yuxiao Chen, Jenna Diamond, Yifan Ding, Wenhao Ding, et al. Alpamayo-r1: Bridging reasoning and action prediction for generalizable autonomous driving in the long tail.arXiv preprint arXiv:2511.00088, 2025

  40. [40]

    Huggingface’s transformers: State-of-the-art natural language processing.arXiv preprint arXiv:1910.03771, 2019

    Thomas Wolf, Lysandre Debut, Victor Sanh, Julien Chaumond, Clement Delangue, Anthony Moi, Pierric Cistac, Tim Rault, Rémi Louf, Morgan Funtowicz, et al. Huggingface’s transformers: State-of-the-art natural language processing.arXiv preprint arXiv:1910.03771, 2019

  41. [41]

    Thomas Wolf, Lysandre Debut, Victor Sanh, Julien Chaumond, Clement Delangue, Anthony Moi, Pierric Cistac, Tim Rault, Rémi Louf, Morgan Funtowicz, Joe Davison, Sam Shleifer, Patrick von Platen, Clara Ma, Yacine Jernite, Julien Plu, Canwen Xu, Teven Le Scao, Sylvain Gugger, Mariama Drame, Quentin Lhoest, and Alexander M. Rush. Transformers: State-of-the-art...

  42. [42]

    Show-o2: Improved native unified multimodal models.arXiv preprint arXiv:2506.15564, 2025

    Jinheng Xie, Zhenheng Yang, and Mike Zheng Shou. Show-o2: Improved native unified multimodal models.arXiv preprint arXiv:2506.15564, 2025

  43. [43]

    Qwen2.5-omni technical report.arXiv preprint arXiv:2503.20215, 2025

    Jin Xu, Zhifang Guo, Jinzheng He, Hangrui Hu, Ting He, Shuai Bai, Keqin Chen, Jialin Wang, Yang Fan, Kai Dang, Bin Zhang, Xiong Wang, Yunfei Chu, and Junyang Lin. Qwen2.5-omni technical report.arXiv preprint arXiv:2503.20215, 2025

  44. [44]

    Qwen3-omni technical report.arXiv preprint arXiv:2509.17765, 2025

    Jin Xu, Zhifang Guo, Jinzheng He, Hangrui Hu, Ting He, Shuai Bai, Keqin Chen, Jialin Wang, Yang Fan, Kai Dang, Bin Zhang, Xiong Wang, Yunfei Chu, and Junyang Lin. Qwen3-omni technical report.arXiv preprint arXiv:2509.17765, 2025

  45. [45]

    FlashInfer: Efficient and customizable attention engine for LLM inference serving

    Zihao Ye, Lequn Chen, Ruihang Lai, Wuwei Lin, Yineng Zhang, Stephanie Wang, Tianqi Chen, Baris Kasikci, Vinod Grover, Arvind Krishnamurthy, and Luis Ceze. FlashInfer: Efficient and customizable attention engine for LLM inference serving. InProceedings of Machine Learning and Systems (MLSys), 2025

  46. [46]

    vllm-omni: Fully disaggregated serving for any-to-any multimodal models.arXiv preprint arXiv:2602.02204, 2026

    Peiqi Yin, Jiangyun Zhu, Han Gao, Chenguang Zheng, Yongxiang Huang, Taichang Zhou, Ruirui Yang, Weizhi Liu, Weiqing Chen, Canlin Guo, et al. vllm-omni: Fully disaggregated serving for any-to-any multimodal models.arXiv preprint arXiv:2602.02204, 2026. 14

  47. [47]

    Orca: A distributed serving system for Transformer-based generative models

    Gyeong-In Yu, Joo Seong Jeong, Geon-Woo Kim, Soojeong Kim, and Byung-Gon Chun. Orca: A distributed serving system for Transformer-based generative models. InProceedings of the 16th USENIX Symposium on Operating Systems Design and Implementation (OSDI), pages 521–538, 2022

  48. [48]

    Vsa: Faster video diffusion with trainable sparse attention.arXiv preprint arXiv:2505.13389, 2025

    Peiyuan Zhang, Yongqi Chen, Haofeng Huang, Will Lin, Zhengzhong Liu, Ion Stoica, Eric Xing, and Hao Zhang. Vsa: Faster video diffusion with trainable sparse attention.arXiv preprint arXiv:2505.13389, 2025

  49. [49]

    Fast video generation with sliding tile attention.arXiv preprint arXiv:2502.04507, 2025

    Peiyuan Zhang, Yongqi Chen, Runlong Su, Hangliang Ding, Ion Stoica, Zhengzhong Liu, and Hao Zhang. Fast video generation with sliding tile attention.arXiv preprint arXiv:2502.04507, 2025

  50. [50]

    Gonzalez, Clark Barrett, and Ying Sheng

    Lianmin Zheng, Liangsheng Yin, Zhiqiang Xie, Chuyue Sun, Jeff Huang, Cody Hao Yu, Shiyi Cao, Christos Kozyrakis, Ion Stoica, Joseph E. Gonzalez, Clark Barrett, and Ying Sheng. SGLang: Efficient execution of structured language model programs.Advances in Neural Information Processing Systems, 37:62557–62583, 2024

  51. [51]

    cudagraph-incompatible

    Yinmin Zhong, Shengyu Liu, Junda Chen, Jianbo Hu, Yibo Zhu, Xuanzhe Liu, Xin Jin, and Hao Zhang. DistServe: Disaggregating prefill and decoding for goodput-optimized large language model serving. In Proceedings of the 18th USENIX Symposium on Operating Systems Design and Implementation (OSDI), 2024. Appendix A Two More Walk Graphs at a Glance The same fou...