pith. machine review for the scientific record. sign in

arxiv: 2604.08121 · v1 · submitted 2026-04-09 · 💻 cs.CV · cs.AI

Recognition: 2 theorem links

· Lean Theorem

Uni-ViGU: Towards Unified Video Generation and Understanding via A Diffusion-Based Video Generator

Chao Qu, Hao Li, Haoyu Pan, Jia Gong, Li Xu, Luozheng Qin, Qian Qiao, Tianjiao Li, Zhiyu Tan

Authors on Pith no claims yet

Pith reviewed 2026-05-10 18:29 UTC · model grok-4.3

classification 💻 cs.CV cs.AI
keywords video generationvideo understandingunified multimodal modelsdiffusion modelsflow matchingmixture of expertsbidirectional training
0
0 comments X

The pith

A diffusion video generator can be extended into a unified model for both creating videos and understanding them through flow matching and staged training.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Rather than starting with an understanding model and bolting on expensive video generation, the work begins with a video diffusion generator and adds understanding capabilities on top. A single training process handles continuous flow matching for video and discrete flow matching for text, while a modality-driven mixture-of-experts structure adds lightweight text layers without disturbing the core generative blocks. Bidirectional training first reconstructs input prompts to recall text-video links, then fine-tunes on detailed captions to form shared discriminative representations. The result is competitive performance on both generation quality metrics and understanding tasks such as video question answering.

Core claim

Uni-ViGU unifies video generation and understanding by taking a diffusion-based video generator as the foundation, applying a unified flow method that performs continuous flow matching for video and discrete flow matching for text in one process, augmenting Transformer blocks with a modality-driven MoE framework that preserves generative priors, and using bidirectional training consisting of Knowledge Recall followed by Capability Refinement to repurpose generation knowledge into shared representations, thereby achieving competitive results on both tasks and validating generation-centric architectures as a scalable path toward unified multimodal intelligence.

What carries the argument

The bidirectional training mechanism of Knowledge Recall (reconstructing input prompts to leverage learned correspondences) followed by Capability Refinement (fine-tuning on detailed captions), which converts generation priors into discriminative shared representations while keeping a unified flow method and MoE structure to support both modalities.

If this is right

  • Video generation and understanding can share one set of model weights and training compute rather than requiring separate large systems.
  • The higher computational cost of video generation can be addressed at the foundation rather than added later.
  • Generation-first designs become viable foundations for broader multimodal systems that handle both creation and comprehension.
  • The same flow-matching and MoE additions could support text-to-video and video-to-text within a single forward pass.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same pattern of starting from a generator and adding recall-plus-refinement stages might transfer to unifying image generation with image understanding.
  • Training efficiency could improve if pre-trained video generators are reused as bases instead of training understanding models from scratch.
  • Limits may appear when scaling to much longer videos or more complex multi-step reasoning tasks that go beyond the current caption-based refinement.

Load-bearing premise

The two-stage bidirectional training can add strong understanding ability without causing substantial loss in the model's original video generation quality.

What would settle it

A clear drop in standard video generation metrics such as FVD or FID after the full bidirectional training, or understanding performance that remains well below specialized models on video QA and captioning benchmarks.

Figures

Figures reproduced from arXiv: 2604.08121 by Chao Qu, Hao Li, Haoyu Pan, Jia Gong, Li Xu, Luozheng Qin, Qian Qiao, Tianjiao Li, Zhiyu Tan.

Figure 1
Figure 1. Figure 1: The DiT architecture of WAN and UniViGU spatial and temporal dependencies within the video features, while cross-attention injects semantic [PITH_FULL_IMAGE:figures/full_fig_p003_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Overview of Uni-ViGU framework. We formulate unified multimodal generation via a uni [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Qualitative Results on Video-Text Joint Generation [PITH_FULL_IMAGE:figures/full_fig_p008_3.png] view at source ↗
read the original abstract

Unified multimodal models integrating visual understanding and generation face a fundamental challenge: visual generation incurs substantially higher computational costs than understanding, particularly for video. This imbalance motivates us to invert the conventional paradigm: rather than extending understanding-centric MLLMs to support generation, we propose Uni-ViGU, a framework that unifies video generation and understanding by extending a video generator as the foundation. We introduce a unified flow method that performs continuous flow matching for video and discrete flow matching for text within a single process, enabling coherent multimodal generation. We further propose a modality-driven MoE-based framework that augments Transformer blocks with lightweight layers for text generation while preserving generative priors. To repurpose generation knowledge for understanding, we design a bidirectional training mechanism with two stages: Knowledge Recall reconstructs input prompts to leverage learned text-video correspondences, while Capability Refinement fine-tunes on detailed captions to establish discriminative shared representations. Experiments demonstrate that Uni-ViGU achieves competitive performance on both video generation and understanding, validating generation-centric architectures as a scalable path toward unified multimodal intelligence. Project Page and Code: https://fr0zencrane.github.io/uni-vigu-page/.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 3 minor

Summary. The manuscript proposes Uni-ViGU, a framework that unifies video generation and understanding by extending a diffusion-based video generator rather than an understanding-centric MLLM. It introduces a unified flow method performing continuous flow matching on video and discrete flow matching on text in a single process, a modality-driven MoE-based architecture that augments Transformer blocks with lightweight text-generation layers while preserving generative priors, and a bidirectional training procedure consisting of a Knowledge Recall stage (prompt reconstruction to leverage text-video correspondences) followed by a Capability Refinement stage (fine-tuning on detailed captions to obtain discriminative shared representations). The central claim is that this generation-centric approach achieves competitive performance on both video generation and understanding tasks without substantial degradation of generation quality, thereby validating generation-first architectures as a scalable route to unified multimodal video intelligence.

Significance. If the quantitative results hold, the work would be significant because it inverts the dominant paradigm of extending understanding models to generation and instead starts from the harder generation task, which incurs higher compute. The combination of unified flow matching and modality-driven MoE is a coherent engineering contribution, and the bidirectional training mechanism offers a concrete recipe for repurposing generation priors. Explicit credit is due for releasing code and a project page, which supports reproducibility. The result, if substantiated, would strengthen the case that generation-centric backbones can serve as foundations for unified video models.

major comments (2)
  1. [§3.3] §3.3 (Bidirectional Training Mechanism): The claim that Capability Refinement establishes discriminative shared representations without degrading generation quality is load-bearing for the central thesis, yet the manuscript provides no ablation measuring generation metrics (e.g., FVD or FID) immediately before versus after the refinement stage on the same model checkpoint. This omission prevents verification that the bidirectional procedure satisfies the “without substantial degradation” assumption stated in the abstract.
  2. [§4.1–4.2] §4.1–4.2 (Experimental Results): The abstract asserts “competitive performance,” but the reported tables lack direct head-to-head comparisons against recent unified or generation-first baselines (e.g., the latest video diffusion models and multimodal LLMs) on the same video-generation and video-understanding benchmarks with identical evaluation protocols. Without these numbers and statistical significance tests, the empirical support for the generation-centric unification claim remains incomplete.
minor comments (3)
  1. [Eq. (5)] The notation in the unified flow-matching objective (Eq. 5) re-uses the symbol t for both continuous video time and discrete text step; a clarifying sentence or subscript would remove ambiguity.
  2. [Figure 3] Figure 3 (MoE architecture diagram) does not label the routing weights or the expert activation pattern; adding these annotations would improve readability.
  3. [§2] The related-work section omits recent flow-matching video papers published after 2023; a brief citation update would strengthen context.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback, which helps us improve the clarity and rigor of our claims. We address each major comment below and commit to revisions that directly respond to the concerns raised.

read point-by-point responses
  1. Referee: [§3.3] §3.3 (Bidirectional Training Mechanism): The claim that Capability Refinement establishes discriminative shared representations without degrading generation quality is load-bearing for the central thesis, yet the manuscript provides no ablation measuring generation metrics (e.g., FVD or FID) immediately before versus after the refinement stage on the same model checkpoint. This omission prevents verification that the bidirectional procedure satisfies the “without substantial degradation” assumption stated in the abstract.

    Authors: We agree that an explicit before/after ablation on generation metrics is necessary to substantiate the central claim. In the revised manuscript we will add this ablation, evaluating FVD, FID, and related metrics on the identical checkpoint immediately prior to and following the Capability Refinement stage. This will directly verify that generation quality is preserved while discriminative capabilities are acquired. revision: yes

  2. Referee: [§4.1–4.2] §4.1–4.2 (Experimental Results): The abstract asserts “competitive performance,” but the reported tables lack direct head-to-head comparisons against recent unified or generation-first baselines (e.g., the latest video diffusion models and multimodal LLMs) on the same video-generation and video-understanding benchmarks with identical evaluation protocols. Without these numbers and statistical significance tests, the empirical support for the generation-centric unification claim remains incomplete.

    Authors: We acknowledge the value of more comprehensive head-to-head comparisons and statistical testing. In the revision we will expand Tables 1–4 (and associated text) to include additional recent unified and generation-first baselines, ensuring identical evaluation protocols are followed wherever the original papers report compatible numbers. We will also report standard deviations across multiple runs and conduct basic significance testing to strengthen the empirical support for competitive performance. revision: yes

Circularity Check

0 steps flagged

No significant circularity detected in derivation chain

full rationale

The paper's core contributions consist of a proposed unified flow-matching method (continuous for video, discrete for text), a modality-driven MoE augmentation of Transformer blocks, and a two-stage bidirectional training procedure (Knowledge Recall via prompt reconstruction followed by Capability Refinement on captions). These are presented as engineering extensions of existing flow-matching and MoE techniques rather than derived results. No equations reduce a claimed prediction to a fitted parameter by construction, no load-bearing uniqueness theorems are imported via self-citation, and no ansatz is smuggled through prior work. The experimental claims of competitive performance on generation and understanding tasks rest on independent benchmarks and do not collapse into the architectural definitions themselves. The derivation chain therefore contains independent content and is self-contained against external validation.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 2 invented entities

The central claim rests on the effectiveness of newly introduced components (unified flow method and modality-driven MoE) and the assumption that bidirectional training transfers generative knowledge to understanding tasks; these are presented without independent external validation beyond standard diffusion practices.

invented entities (2)
  • unified flow method no independent evidence
    purpose: performs continuous flow matching for video and discrete flow matching for text within a single process
    Core new component enabling coherent multimodal generation.
  • modality-driven MoE-based framework no independent evidence
    purpose: augments Transformer blocks with lightweight layers for text generation while preserving generative priors
    Allows addition of text capabilities without harming video generation strength.

pith-pipeline@v0.9.0 · 5524 in / 1290 out tokens · 69880 ms · 2026-05-10T18:29:49.136200+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

Reference graph

Works this paper leans on

52 extracted references · 22 canonical work pages · 16 internal anchors

  1. [1]

    Janus: Decoupling visual encoding for unified multimodal understanding and generation

    Chengyue Wu, Xiaokang Chen, Zhiyu Wu, Yiyang Ma, Xingchao Liu, Zizheng Pan, Wen Liu, Zhenda Xie, Xingkai Yu, Chong Ruan, et al. Janus: Decoupling visual encoding for unified multimodal understanding and generation. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 12966–12977, 2025

  2. [2]

    Show-o: One Single Transformer to Unify Multimodal Understanding and Generation

    Jinheng Xie, Weijia Mao, Zechen Bai, David Junhao Zhang, Weihao Wang, Kevin Qinghong Lin, Yuchao Gu, Zhijie Chen, Zhenheng Yang, and Mike Zheng Shou. Show-o: One single trans- former to unify multimodal understanding and generation.arXiv preprint arXiv:2408.12528, 2024

  3. [4]

    Uni-cot: Towards unified chain-of-thought reasoning across text and vision.arXiv preprint arXiv:2508.05606,

    Luozheng Qin, Jia Gong, Yuqing Sun, Tianjiao Li, Mengping Yang, Xiaomeng Yang, Chao Qu, Zhiyu Tan, and Hao Li. Uni-cot: Towards unified chain-of-thought reasoning across text and vision.arXiv preprint arXiv:2508.05606, 2025

  4. [5]

    Ofa: Unifying architectures, tasks, and modalities through a simple sequence-to-sequence learning framework

    Peng Wang, An Yang, Rui Men, Junyang Lin, Shuai Bai, Zhikang Li, Jianxin Ma, Chang Zhou, Jingren Zhou, and Hongxia Yang. Ofa: Unifying architectures, tasks, and modalities through a simple sequence-to-sequence learning framework. InInternational conference on machine learning, pages 23318–23340. PMLR, 2022

  5. [6]

    One transformer fits all distributions in multi-modal diffusion at scale

    Fan Bao, Shen Nie, Kaiwen Xue, Chongxuan Li, Shi Pu, Yaole Wang, Gang Yue, Yue Cao, Hang Su, and Jun Zhu. One transformer fits all distributions in multi-modal diffusion at scale. InInternational Conference on Machine Learning, pages 1692–1717. PMLR, 2023

  6. [7]

    Unified-io: A unified model for vision, language, and multi-modal tasks

    Jiasen Lu, Christopher Clark, Rowan Zellers, Roozbeh Mottaghi, and Aniruddha Kembhavi. Unified-io: A unified model for vision, language, and multi-modal tasks. InThe Eleventh International Conference on Learning Representations

  7. [8]

    Gemini: A Family of Highly Capable Multimodal Models

    Gemini Team, Rohan Anil, Sebastian Borgeaud, Jean-Baptiste Alayrac, Jiahui Yu, Radu Soricut, Johan Schalkwyk, Andrew M Dai, Anja Hauth, Katie Millican, et al. Gemini: a family of highly capable multimodal models.arXiv preprint arXiv:2312.11805, 2023

  8. [9]

    Image as a foreign lan- guage: Beit pretraining for vision and vision-language tasks

    Wenhui Wang, Hangbo Bao, Li Dong, Johan Bjorck, Zhiliang Peng, Qiang Liu, Kriti Aggarwal, Owais Khan Mohammed, Saksham Singhal, Subhojit Som, et al. Image as a foreign lan- guage: Beit pretraining for vision and vision-language tasks. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 19175–19186, 2023

  9. [10]

    Chameleon: Mixed-Modal Early-Fusion Foundation Models

    Chameleon Team. Chameleon: Mixed-modal early-fusion foundation models.arXiv preprint arXiv:2405.09818, 2024

  10. [11]

    Janus-Pro: Unified Multimodal Understanding and Generation with Data and Model Scaling

    Xiaokang Chen, Zhiyu Wu, Xingchao Liu, Zizheng Pan, Wen Liu, Zhenda Xie, Xingkai Yu, and Chong Ruan. Janus-pro: Unified multimodal understanding and generation with data and model scaling.arXiv preprint arXiv:2501.17811, 2025

  11. [12]

    arXiv preprint arXiv:2508.10711 (2025) 2, 4, 10, 12, 13

    NextStep Team, Chunrui Han, Guopeng Li, Jingwei Wu, Quan Sun, Yan Cai, Yuang Peng, Zheng Ge, Deyu Zhou, Haomiao Tang, et al. Nextstep-1: Toward autoregressive image generation with continuous tokens at scale.arXiv preprint arXiv:2508.10711, 2025

  12. [13]

    Controlar: Controllable image generation with autoregressive models

    Zongming Li, Tianheng Cheng, Shoufa Chen, Peize Sun, Haocheng Shen, Longjin Ran, Xiaoxin Chen, Wenyu Liu, and Xinggang Wang. Controlar: Controllable image generation with autoregressive models. InInternational Conference on Learning Representations, 2025

  13. [14]

    Incorporating reinforced adversarial learning in autoregressive image generation

    Kenan E Ak, Ning Xu, Zhe Lin, and Yilin Wang. Incorporating reinforced adversarial learning in autoregressive image generation. InEuropean conference on computer vision, pages 18–34. Springer, 2020

  14. [15]

    Metamorph: Multimodal understanding and generation via instruction tuning

    Shengbang Tong, David Fan, Jiachen Li, Yunyang Xiong, Xinlei Chen, Koustuv Sinha, Michael Rabbat, Yann LeCun, Saining Xie, and Zhuang Liu. Metamorph: Multimodal understanding and generation via instruction tuning. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 17001–17012, 2025. 10

  15. [16]

    BLIP3-o: A Family of Fully Open Unified Multimodal Models-Architecture, Training and Dataset

    Jiuhai Chen, Zhiyang Xu, Xichen Pan, Yushi Hu, Can Qin, Tom Goldstein, Lifu Huang, Tianyi Zhou, Saining Xie, Silvio Savarese, et al. Blip3-o: A family of fully open unified multimodal models-architecture, training and dataset.arXiv preprint arXiv:2505.09568, 2025

  16. [17]

    Omni-video 2: Scaling mllm-conditioned diffusion for unified video generation and editing.arXiv preprint arXiv:2602.08820, 2026

    Hao Yang, Zhiyu Tan, Jia Gong, Luozheng Qin, Hesen Chen, Xiaomeng Yang, Yuqing Sun, Yuetan Lin, Mengping Yang, and Hao Li. Omni-video 2: Scaling mllm-conditioned diffusion for unified video generation and editing.arXiv preprint arXiv:2602.08820, 2026

  17. [18]

    Transfer between Modalities with MetaQueries

    Xichen Pan, Satya Narayan Shukla, Aashu Singh, Zhuokai Zhao, Shlok Kumar Mishra, Jialiang Wang, Zhiyang Xu, Jiuhai Chen, Kunpeng Li, Felix Juefei-Xu, et al. Transfer between modalities with metaqueries.arXiv preprint arXiv:2504.06256, 2025

  18. [19]

    Binyuan Hui, Jian Yang, Zeyu Cui, Jiaxi Yang, Dayiheng Liu, Lei Zhang, Tianyu Liu, Jia- jun Zhang, Bowen Yu, Keming Lu, et al. Qwen2. 5-coder technical report.arXiv preprint arXiv:2409.12186, 2024

  19. [20]

    Llama 3 model card

    AI@Meta. Llama 3 model card. 2024

  20. [21]

    Textbooks Are All You Need II: phi-1.5 technical report

    Yuanzhi Li, Sébastien Bubeck, Ronen Eldan, Allie Del Giorno, Suriya Gunasekar, and Yin Tat Lee. Textbooks are all you need ii: phi-1.5 technical report.arXiv preprint arXiv:2309.05463, 2023

  21. [22]

    Emerging Properties in Unified Multimodal Pretraining

    Chaorui Deng, Deyao Zhu, Kunchang Li, Chenhui Gou, Feng Li, Zeyu Wang, Shu Zhong, Weihao Yu, Xiaonan Nie, Ziang Song, et al. Emerging properties in unified multimodal pretraining.arXiv preprint arXiv:2505.14683, 2025

  22. [23]

    Mixture-of-transformers: A sparse and scalable architecture for multi-modal foundation models.Transactions on Machine Learning Research, 2025

    Weixin Liang, LILI YU, Liang Luo, Srini Iyer, Ning Dong, Chunting Zhou, Gargi Ghosh, Mike Lewis, Wen tau Yih, Luke Zettlemoyer, and Xi Victoria Lin. Mixture-of-transformers: A sparse and scalable architecture for multi-modal foundation models.Transactions on Machine Learning Research, 2025

  23. [24]

    Flux.1 [dev]

    Black Forest Labs. Flux.1 [dev]. https://huggingface.co/black-forest-labs/FLUX. 1-dev, 2024

  24. [25]

    Wan: Open and Advanced Large-Scale Video Generative Models

    Team Wan, Ang Wang, Baole Ai, Bin Wen, Chaojie Mao, Chen-Wei Xie, Di Chen, Feiwu Yu, Haiming Zhao, Jianxiao Yang, et al. Wan: Open and advanced large-scale video generative models.arXiv preprint arXiv:2503.20314, 2025

  25. [26]

    Infants have rich visual categories in ventrotemporal cortex at 2 months of age.Nature Neuroscience, pages 1–10, 2026

    Cliona O’Doherty, Áine T Dineen, Anna Truzzi, Graham King, Lorijn Zaadnoordijk, Keelin Harrison, Enna-Louise D’Arcy, Jessica White, Chiara Caldinelli, Tamrin Holloway, et al. Infants have rich visual categories in ventrotemporal cortex at 2 months of age.Nature Neuroscience, pages 1–10, 2026

  26. [27]

    Learning words’ sounds before learning how words sound: 9-month-olds use distinct objects as cues to categorize speech information.Cognition, 113(2):234–243, 2009

    H Henny Yeung and Janet F Werker. Learning words’ sounds before learning how words sound: 9-month-olds use distinct objects as cues to categorize speech information.Cognition, 113(2):234–243, 2009

  27. [28]

    Infant speech perception and cognitive skills as predictors of later vocabulary.Infant Behavior and Development, 62:101524, 2021

    Yuanyuan Wang, Amanda Seidl, and Alejandrina Cristia. Infant speech perception and cognitive skills as predictors of later vocabulary.Infant Behavior and Development, 62:101524, 2021

  28. [29]

    Open-Sora: Democratizing Efficient Video Production for All

    Zangwei Zheng, Xiangyu Peng, Tianji Yang, Chenhui Shen, Shenggui Li, Hongxin Liu, Yukun Zhou, Tianyi Li, and Yang You. Open-sora: Democratizing efficient video production for all. arXiv preprint arXiv:2412.20404, 2024

  29. [30]

    Video diffusion models.Advances in neural information processing systems, 35:8633–8646, 2022

    Jonathan Ho, Tim Salimans, Alexey Gritsenko, William Chan, Mohammad Norouzi, and David J Fleet. Video diffusion models.Advances in neural information processing systems, 35:8633–8646, 2022

  30. [31]

    Cogvideox: Text-to-video diffusion models with an expert transformer

    Zhuoyi Yang, Jiayan Teng, Wendi Zheng, Ming Ding, Shiyu Huang, Jiazheng Xu, Yuanming Yang, Wenyi Hong, Xiaohan Zhang, Guanyu Feng, et al. Cogvideox: Text-to-video diffusion models with an expert transformer. InThe Thirteenth International Conference on Learning Representations. 11

  31. [32]

    LLaMA: Open and Efficient Foundation Language Models

    Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timo- thée Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, et al. Llama: Open and efficient foundation language models.arXiv preprint arXiv:2302.13971, 2023

  32. [33]

    DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

    Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Peiyi Wang, Qihao Zhu, Runxin Xu, Ruoyu Zhang, Shirong Ma, Xiao Bi, et al. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning.arXiv preprint arXiv:2501.12948, 2025

  33. [34]

    Scaling diffusion language models via adaptation from autoregressive models

    Shansan Gong, Shivam Agarwal, Yizhe Zhang, Jiacheng Ye, Lin Zheng, Mukai Li, Chenxin An, Peilin Zhao, Wei Bi, Jiawei Han, et al. Scaling diffusion language models via adaptation from autoregressive models. InThe Thirteenth International Conference on Learning Representations

  34. [35]

    Diffuseq: Sequence to sequence text generation with diffusion models

    Shansan Gong, Mukai Li, Jiangtao Feng, Zhiyong Wu, and Lingpeng Kong. Diffuseq: Sequence to sequence text generation with diffusion models. InThe Eleventh International Conference on Learning Representations

  35. [36]

    Yaron Lipman, Marton Havasi, Peter Holderrieth, Neta Shaul, Matt Le, Brian Karrer, Ricky T. Q. Chen, David Lopez-Paz, Heli Ben-Hamu, and Itai Gat. Flow matching guide and code, 2024

  36. [37]

    Flow matching for generative modeling

    Yaron Lipman, Ricky TQ Chen, Heli Ben-Hamu, Maximilian Nickel, and Matthew Le. Flow matching for generative modeling. InThe Eleventh International Conference on Learning Representations

  37. [38]

    Scaling rectified flow trans- formers for high-resolution image synthesis

    Patrick Esser, Sumith Kulal, Andreas Blattmann, Rahim Entezari, Jonas Müller, Harry Saini, Yam Levi, Dominik Lorenz, Axel Sauer, Frederic Boesel, et al. Scaling rectified flow trans- formers for high-resolution image synthesis. InForty-first international conference on machine learning, 2024

  38. [39]

    Unimax: Fairer and more effective language sampling for large-scale multilingual pretraining

    Hyung Won Chung, Noah Constant, Xavier Garcia, Adam Roberts, Yi Tay, Sharan Narang, and Orhan Firat. Unimax: Fairer and more effective language sampling for large-scale multilingual pretraining.arXiv preprint arXiv:2304.09151, 2023

  39. [40]

    Structured denoising diffusion models in discrete state-spaces.Advances in neural information processing systems, 34:17981–17993, 2021

    Jacob Austin, Daniel D Johnson, Jonathan Ho, Daniel Tarlow, and Rianne Van Den Berg. Structured denoising diffusion models in discrete state-spaces.Advances in neural information processing systems, 34:17981–17993, 2021

  40. [41]

    Diffusion-lm improves controllable text generation.Advances in neural information processing systems, 35:4328–4343, 2022

    Xiang Li, John Thickstun, Ishaan Gulrajani, Percy S Liang, and Tatsunori B Hashimoto. Diffusion-lm improves controllable text generation.Advances in neural information processing systems, 35:4328–4343, 2022

  41. [42]

    Adaptive mixtures of local experts.Neural computation, 3(1):79–87, 1991

    Robert A Jacobs, Michael I Jordan, Steven J Nowlan, and Geoffrey E Hinton. Adaptive mixtures of local experts.Neural computation, 3(1):79–87, 1991

  42. [43]

    Deepseekmoe: Towards ultimate expert specializa- tion in mixture-of-experts language models

    Damai Dai, Chengqi Deng, Chenggang Zhao, RX Xu, Huazuo Gao, Deli Chen, Jiashi Li, Wangding Zeng, Xingkai Yu, Yu Wu, et al. Deepseekmoe: Towards ultimate expert specializa- tion in mixture-of-experts language models. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1280–1297, 2024

  43. [44]

    Unitoken: Harmonizing multimodal understanding and generation through unified visual encoding

    Yang Jiao, Haibo Qiu, Zequn Jie, Shaoxiang Chen, Jingjing Chen, Lin Ma, and Yu-Gang Jiang. Unitoken: Harmonizing multimodal understanding and generation through unified visual encoding. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 3600–3610, 2025

  44. [45]

    Omni-diffusion: Unified multimodal understanding and generation with masked discrete diffusion.arXiv preprint arXiv:2603.06577, 2026

    Lijiang Li, Zuwei Long, Yunhang Shen, Heting Gao, Haoyu Cao, Xing Sun, Caifeng Shan, Ran He, and Chaoyou Fu. Omni-diffusion: Unified multimodal understanding and generation with masked discrete diffusion.arXiv preprint arXiv:2603.06577, 2026

  45. [46]

    Align your latents: High-resolution video synthesis with latent diffusion models

    Andreas Blattmann, Robin Rombach, Huan Ling, Tim Dockhorn, Seung Wook Kim, Sanja Fidler, and Karsten Kreis. Align your latents: High-resolution video synthesis with latent diffusion models. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 22563–22575, 2023. 12

  46. [47]

    Scalable diffusion models with transformers

    William Peebles and Saining Xie. Scalable diffusion models with transformers. InProceedings of the IEEE/CVF international conference on computer vision, pages 4195–4205, 2023

  47. [48]

    Video generation models as world simulators

    Tim Brooks, Bill Peebles, Connor Holmes, Will DePue, Yufei Guo, Leo Jing, David Schnurr, Joe Taylor, Troy Luhman, Eric Luhman, et al. Video generation models as world simulators. OpenAI Blog, 1(8):1, 2024

  48. [49]

    Flow Matching for Generative Modeling

    Yaron Lipman, Ricky TQ Chen, Heli Ben-Hamu, Maximilian Nickel, and Matt Le. Flow matching for generative modeling.arXiv preprint arXiv:2210.02747, 2022

  49. [50]

    Flow Straight and Fast: Learning to Generate and Transfer Data with Rectified Flow

    Xingchao Liu, Chengyue Gong, and Qiang Liu. Flow straight and fast: Learning to generate and transfer data with rectified flow.arXiv preprint arXiv:2209.03003, 2022

  50. [51]

    Discrete Diffusion Modeling by Estimating the Ratios of the Data Distribution

    Aaron Lou, Chenlin Meng, and Stefano Ermon. Discrete diffusion modeling by estimating the ratios of the data distribution.arXiv preprint arXiv:2310.16834, 2023

  51. [52]

    Simple and effective masked diffusion language models.Advances in Neural Information Processing Systems, 37:130136–130184, 2024

    Subham S Sahoo, Marianne Arriola, Yair Schiff, Aaron Gokaslan, Edgar Marroquin, Justin T Chiu, Alexander Rush, and V olodymyr Kuleshov. Simple and effective masked diffusion language models.Advances in Neural Information Processing Systems, 37:130136–130184, 2024

  52. [53]

    Large Language Diffusion Models

    Shen Nie, Fengqi Zhu, Zebin You, Xiaolu Zhang, Jingyang Ou, Jun Hu, Jun Zhou, Yankai Lin, Ji-Rong Wen, and Chongxuan Li. Large language diffusion models.arXiv preprint arXiv:2502.09992, 2025. 13