arxiv: 2604.08121 · v1 · submitted 2026-04-09 · 💻 cs.CV · cs.AI

Recognition: 2 theorem links

· Lean Theorem

Uni-ViGU: Towards Unified Video Generation and Understanding via A Diffusion-Based Video Generator

Chao Qu, Hao Li, Haoyu Pan, Jia Gong, Li Xu, Luozheng Qin, Qian Qiao, Tianjiao Li, Zhiyu Tan

Authors on Pith no claims yet

Pith reviewed 2026-05-10 18:29 UTC · model grok-4.3

classification 💻 cs.CV cs.AI

keywords video generationvideo understandingunified multimodal modelsdiffusion modelsflow matchingmixture of expertsbidirectional training

0 comments

The pith

A diffusion video generator can be extended into a unified model for both creating videos and understanding them through flow matching and staged training.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Rather than starting with an understanding model and bolting on expensive video generation, the work begins with a video diffusion generator and adds understanding capabilities on top. A single training process handles continuous flow matching for video and discrete flow matching for text, while a modality-driven mixture-of-experts structure adds lightweight text layers without disturbing the core generative blocks. Bidirectional training first reconstructs input prompts to recall text-video links, then fine-tunes on detailed captions to form shared discriminative representations. The result is competitive performance on both generation quality metrics and understanding tasks such as video question answering.

Core claim

Uni-ViGU unifies video generation and understanding by taking a diffusion-based video generator as the foundation, applying a unified flow method that performs continuous flow matching for video and discrete flow matching for text in one process, augmenting Transformer blocks with a modality-driven MoE framework that preserves generative priors, and using bidirectional training consisting of Knowledge Recall followed by Capability Refinement to repurpose generation knowledge into shared representations, thereby achieving competitive results on both tasks and validating generation-centric architectures as a scalable path toward unified multimodal intelligence.

What carries the argument

The bidirectional training mechanism of Knowledge Recall (reconstructing input prompts to leverage learned correspondences) followed by Capability Refinement (fine-tuning on detailed captions), which converts generation priors into discriminative shared representations while keeping a unified flow method and MoE structure to support both modalities.

If this is right

Video generation and understanding can share one set of model weights and training compute rather than requiring separate large systems.
The higher computational cost of video generation can be addressed at the foundation rather than added later.
Generation-first designs become viable foundations for broader multimodal systems that handle both creation and comprehension.
The same flow-matching and MoE additions could support text-to-video and video-to-text within a single forward pass.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same pattern of starting from a generator and adding recall-plus-refinement stages might transfer to unifying image generation with image understanding.
Training efficiency could improve if pre-trained video generators are reused as bases instead of training understanding models from scratch.
Limits may appear when scaling to much longer videos or more complex multi-step reasoning tasks that go beyond the current caption-based refinement.

Load-bearing premise

The two-stage bidirectional training can add strong understanding ability without causing substantial loss in the model's original video generation quality.

What would settle it

A clear drop in standard video generation metrics such as FVD or FID after the full bidirectional training, or understanding performance that remains well below specialized models on video QA and captioning benchmarks.

Figures

Figures reproduced from arXiv: 2604.08121 by Chao Qu, Hao Li, Haoyu Pan, Jia Gong, Li Xu, Luozheng Qin, Qian Qiao, Tianjiao Li, Zhiyu Tan.

**Figure 1.** Figure 1: The DiT architecture of WAN and UniViGU spatial and temporal dependencies within the video features, while cross-attention injects semantic [PITH_FULL_IMAGE:figures/full_fig_p003_1.png] view at source ↗

**Figure 2.** Figure 2: Overview of Uni-ViGU framework. We formulate unified multimodal generation via a uni [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗

**Figure 3.** Figure 3: Qualitative Results on Video-Text Joint Generation [PITH_FULL_IMAGE:figures/full_fig_p008_3.png] view at source ↗

read the original abstract

Unified multimodal models integrating visual understanding and generation face a fundamental challenge: visual generation incurs substantially higher computational costs than understanding, particularly for video. This imbalance motivates us to invert the conventional paradigm: rather than extending understanding-centric MLLMs to support generation, we propose Uni-ViGU, a framework that unifies video generation and understanding by extending a video generator as the foundation. We introduce a unified flow method that performs continuous flow matching for video and discrete flow matching for text within a single process, enabling coherent multimodal generation. We further propose a modality-driven MoE-based framework that augments Transformer blocks with lightweight layers for text generation while preserving generative priors. To repurpose generation knowledge for understanding, we design a bidirectional training mechanism with two stages: Knowledge Recall reconstructs input prompts to leverage learned text-video correspondences, while Capability Refinement fine-tunes on detailed captions to establish discriminative shared representations. Experiments demonstrate that Uni-ViGU achieves competitive performance on both video generation and understanding, validating generation-centric architectures as a scalable path toward unified multimodal intelligence. Project Page and Code: https://fr0zencrane.github.io/uni-vigu-page/.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Uni-ViGU flips the unification direction by starting from a video generator with unified flow matching and two-stage training, but the abstract supplies no metrics to check if it actually delivers competitive results on both tasks.

read the letter

The main point is that this work inverts the usual route for unified video models. Instead of bolting generation onto an understanding model, they start with a diffusion video generator and add understanding on top through a unified flow method and bidirectional training. That inversion plus the specific mechanics for handling video and text together is the clearest new element here. They combine continuous flow matching for video with discrete flow matching for text in one process, then use a modality-driven MoE setup to insert lightweight text layers without overwriting the generative priors. The two training stages—Knowledge Recall to reconstruct prompts and Capability Refinement on detailed captions—give a concrete plan for turning generation knowledge into shared representations for understanding tasks. This builds directly on existing flow-matching and MoE ideas without obvious circularity, and the design choices look deliberate for preserving generation quality while adding discriminative capability. The soft spot is straightforward: the abstract asserts competitive performance on both generation and understanding but includes zero numbers, baselines, ablations, or error analysis. Without those, there is no way to tell whether the bidirectional training actually avoids degrading generation or produces strong understanding results. The stress-test note is right that no internal contradiction shows up in the description, but that does not substitute for evidence. This paper is for researchers working on diffusion-based multimodal video systems who want to explore generator-centric unification. A reader focused on architecture ideas for balancing compute-heavy generation with understanding would pick up usable concepts even if the results section needs scrutiny. It deserves a serious referee because the framing and training scheme are grounded enough to warrant checking the full experiments and comparisons.

Referee Report

2 major / 3 minor

Summary. The manuscript proposes Uni-ViGU, a framework that unifies video generation and understanding by extending a diffusion-based video generator rather than an understanding-centric MLLM. It introduces a unified flow method performing continuous flow matching on video and discrete flow matching on text in a single process, a modality-driven MoE-based architecture that augments Transformer blocks with lightweight text-generation layers while preserving generative priors, and a bidirectional training procedure consisting of a Knowledge Recall stage (prompt reconstruction to leverage text-video correspondences) followed by a Capability Refinement stage (fine-tuning on detailed captions to obtain discriminative shared representations). The central claim is that this generation-centric approach achieves competitive performance on both video generation and understanding tasks without substantial degradation of generation quality, thereby validating generation-first architectures as a scalable route to unified multimodal video intelligence.

Significance. If the quantitative results hold, the work would be significant because it inverts the dominant paradigm of extending understanding models to generation and instead starts from the harder generation task, which incurs higher compute. The combination of unified flow matching and modality-driven MoE is a coherent engineering contribution, and the bidirectional training mechanism offers a concrete recipe for repurposing generation priors. Explicit credit is due for releasing code and a project page, which supports reproducibility. The result, if substantiated, would strengthen the case that generation-centric backbones can serve as foundations for unified video models.

major comments (2)

[§3.3] §3.3 (Bidirectional Training Mechanism): The claim that Capability Refinement establishes discriminative shared representations without degrading generation quality is load-bearing for the central thesis, yet the manuscript provides no ablation measuring generation metrics (e.g., FVD or FID) immediately before versus after the refinement stage on the same model checkpoint. This omission prevents verification that the bidirectional procedure satisfies the “without substantial degradation” assumption stated in the abstract.
[§4.1–4.2] §4.1–4.2 (Experimental Results): The abstract asserts “competitive performance,” but the reported tables lack direct head-to-head comparisons against recent unified or generation-first baselines (e.g., the latest video diffusion models and multimodal LLMs) on the same video-generation and video-understanding benchmarks with identical evaluation protocols. Without these numbers and statistical significance tests, the empirical support for the generation-centric unification claim remains incomplete.

minor comments (3)

[Eq. (5)] The notation in the unified flow-matching objective (Eq. 5) re-uses the symbol t for both continuous video time and discrete text step; a clarifying sentence or subscript would remove ambiguity.
[Figure 3] Figure 3 (MoE architecture diagram) does not label the routing weights or the expert activation pattern; adding these annotations would improve readability.
[§2] The related-work section omits recent flow-matching video papers published after 2023; a brief citation update would strengthen context.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback, which helps us improve the clarity and rigor of our claims. We address each major comment below and commit to revisions that directly respond to the concerns raised.

read point-by-point responses

Referee: [§3.3] §3.3 (Bidirectional Training Mechanism): The claim that Capability Refinement establishes discriminative shared representations without degrading generation quality is load-bearing for the central thesis, yet the manuscript provides no ablation measuring generation metrics (e.g., FVD or FID) immediately before versus after the refinement stage on the same model checkpoint. This omission prevents verification that the bidirectional procedure satisfies the “without substantial degradation” assumption stated in the abstract.

Authors: We agree that an explicit before/after ablation on generation metrics is necessary to substantiate the central claim. In the revised manuscript we will add this ablation, evaluating FVD, FID, and related metrics on the identical checkpoint immediately prior to and following the Capability Refinement stage. This will directly verify that generation quality is preserved while discriminative capabilities are acquired. revision: yes
Referee: [§4.1–4.2] §4.1–4.2 (Experimental Results): The abstract asserts “competitive performance,” but the reported tables lack direct head-to-head comparisons against recent unified or generation-first baselines (e.g., the latest video diffusion models and multimodal LLMs) on the same video-generation and video-understanding benchmarks with identical evaluation protocols. Without these numbers and statistical significance tests, the empirical support for the generation-centric unification claim remains incomplete.

Authors: We acknowledge the value of more comprehensive head-to-head comparisons and statistical testing. In the revision we will expand Tables 1–4 (and associated text) to include additional recent unified and generation-first baselines, ensuring identical evaluation protocols are followed wherever the original papers report compatible numbers. We will also report standard deviations across multiple runs and conduct basic significance testing to strengthen the empirical support for competitive performance. revision: yes

Circularity Check

0 steps flagged

No significant circularity detected in derivation chain

full rationale

The paper's core contributions consist of a proposed unified flow-matching method (continuous for video, discrete for text), a modality-driven MoE augmentation of Transformer blocks, and a two-stage bidirectional training procedure (Knowledge Recall via prompt reconstruction followed by Capability Refinement on captions). These are presented as engineering extensions of existing flow-matching and MoE techniques rather than derived results. No equations reduce a claimed prediction to a fitted parameter by construction, no load-bearing uniqueness theorems are imported via self-citation, and no ansatz is smuggled through prior work. The experimental claims of competitive performance on generation and understanding tasks rest on independent benchmarks and do not collapse into the architectural definitions themselves. The derivation chain therefore contains independent content and is self-contained against external validation.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 2 invented entities

The central claim rests on the effectiveness of newly introduced components (unified flow method and modality-driven MoE) and the assumption that bidirectional training transfers generative knowledge to understanding tasks; these are presented without independent external validation beyond standard diffusion practices.

invented entities (2)

unified flow method no independent evidence
purpose: performs continuous flow matching for video and discrete flow matching for text within a single process
Core new component enabling coherent multimodal generation.
modality-driven MoE-based framework no independent evidence
purpose: augments Transformer blocks with lightweight layers for text generation while preserving generative priors
Allows addition of text capabilities without harming video generation strength.

pith-pipeline@v0.9.0 · 5524 in / 1290 out tokens · 69880 ms · 2026-05-10T18:29:49.136200+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear
unified flow method that performs continuous flow matching for video and discrete flow matching for text within a single process
IndisputableMonolith/Foundation/DimensionForcing.lean alexander_duality_circle_linking unclear
modality-driven MoE-based framework that augments Transformer blocks with lightweight layers

Reference graph

Works this paper leans on

52 extracted references · 22 canonical work pages · 16 internal anchors

[1]

Janus: Decoupling visual encoding for unified multimodal understanding and generation

Chengyue Wu, Xiaokang Chen, Zhiyu Wu, Yiyang Ma, Xingchao Liu, Zizheng Pan, Wen Liu, Zhenda Xie, Xingkai Yu, Chong Ruan, et al. Janus: Decoupling visual encoding for unified multimodal understanding and generation. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 12966–12977, 2025

2025
[2]

Show-o: One Single Transformer to Unify Multimodal Understanding and Generation

Jinheng Xie, Weijia Mao, Zechen Bai, David Junhao Zhang, Weihao Wang, Kevin Qinghong Lin, Yuchao Gu, Zhijie Chen, Zhenheng Yang, and Mike Zheng Shou. Show-o: One single trans- former to unify multimodal understanding and generation.arXiv preprint arXiv:2408.12528, 2024

work page internal anchor Pith review arXiv 2024
[4]

Uni-cot: Towards unified chain-of-thought reasoning across text and vision.arXiv preprint arXiv:2508.05606,

Luozheng Qin, Jia Gong, Yuqing Sun, Tianjiao Li, Mengping Yang, Xiaomeng Yang, Chao Qu, Zhiyu Tan, and Hao Li. Uni-cot: Towards unified chain-of-thought reasoning across text and vision.arXiv preprint arXiv:2508.05606, 2025

work page arXiv 2025
[5]

Ofa: Unifying architectures, tasks, and modalities through a simple sequence-to-sequence learning framework

Peng Wang, An Yang, Rui Men, Junyang Lin, Shuai Bai, Zhikang Li, Jianxin Ma, Chang Zhou, Jingren Zhou, and Hongxia Yang. Ofa: Unifying architectures, tasks, and modalities through a simple sequence-to-sequence learning framework. InInternational conference on machine learning, pages 23318–23340. PMLR, 2022

2022
[6]

One transformer fits all distributions in multi-modal diffusion at scale

Fan Bao, Shen Nie, Kaiwen Xue, Chongxuan Li, Shi Pu, Yaole Wang, Gang Yue, Yue Cao, Hang Su, and Jun Zhu. One transformer fits all distributions in multi-modal diffusion at scale. InInternational Conference on Machine Learning, pages 1692–1717. PMLR, 2023

2023
[7]

Unified-io: A unified model for vision, language, and multi-modal tasks

Jiasen Lu, Christopher Clark, Rowan Zellers, Roozbeh Mottaghi, and Aniruddha Kembhavi. Unified-io: A unified model for vision, language, and multi-modal tasks. InThe Eleventh International Conference on Learning Representations
[8]

Gemini: A Family of Highly Capable Multimodal Models

Gemini Team, Rohan Anil, Sebastian Borgeaud, Jean-Baptiste Alayrac, Jiahui Yu, Radu Soricut, Johan Schalkwyk, Andrew M Dai, Anja Hauth, Katie Millican, et al. Gemini: a family of highly capable multimodal models.arXiv preprint arXiv:2312.11805, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[9]

Image as a foreign lan- guage: Beit pretraining for vision and vision-language tasks

Wenhui Wang, Hangbo Bao, Li Dong, Johan Bjorck, Zhiliang Peng, Qiang Liu, Kriti Aggarwal, Owais Khan Mohammed, Saksham Singhal, Subhojit Som, et al. Image as a foreign lan- guage: Beit pretraining for vision and vision-language tasks. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 19175–19186, 2023

2023
[10]

Chameleon: Mixed-Modal Early-Fusion Foundation Models

Chameleon Team. Chameleon: Mixed-modal early-fusion foundation models.arXiv preprint arXiv:2405.09818, 2024

work page internal anchor Pith review arXiv 2024
[11]

Janus-Pro: Unified Multimodal Understanding and Generation with Data and Model Scaling

Xiaokang Chen, Zhiyu Wu, Xingchao Liu, Zizheng Pan, Wen Liu, Zhenda Xie, Xingkai Yu, and Chong Ruan. Janus-pro: Unified multimodal understanding and generation with data and model scaling.arXiv preprint arXiv:2501.17811, 2025

work page internal anchor Pith review arXiv 2025
[12]

arXiv preprint arXiv:2508.10711 (2025) 2, 4, 10, 12, 13

NextStep Team, Chunrui Han, Guopeng Li, Jingwei Wu, Quan Sun, Yan Cai, Yuang Peng, Zheng Ge, Deyu Zhou, Haomiao Tang, et al. Nextstep-1: Toward autoregressive image generation with continuous tokens at scale.arXiv preprint arXiv:2508.10711, 2025

work page arXiv 2025
[13]

Controlar: Controllable image generation with autoregressive models

Zongming Li, Tianheng Cheng, Shoufa Chen, Peize Sun, Haocheng Shen, Longjin Ran, Xiaoxin Chen, Wenyu Liu, and Xinggang Wang. Controlar: Controllable image generation with autoregressive models. InInternational Conference on Learning Representations, 2025

2025
[14]

Incorporating reinforced adversarial learning in autoregressive image generation

Kenan E Ak, Ning Xu, Zhe Lin, and Yilin Wang. Incorporating reinforced adversarial learning in autoregressive image generation. InEuropean conference on computer vision, pages 18–34. Springer, 2020

2020
[15]

Metamorph: Multimodal understanding and generation via instruction tuning

Shengbang Tong, David Fan, Jiachen Li, Yunyang Xiong, Xinlei Chen, Koustuv Sinha, Michael Rabbat, Yann LeCun, Saining Xie, and Zhuang Liu. Metamorph: Multimodal understanding and generation via instruction tuning. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 17001–17012, 2025. 10

2025
[16]

BLIP3-o: A Family of Fully Open Unified Multimodal Models-Architecture, Training and Dataset

Jiuhai Chen, Zhiyang Xu, Xichen Pan, Yushi Hu, Can Qin, Tom Goldstein, Lifu Huang, Tianyi Zhou, Saining Xie, Silvio Savarese, et al. Blip3-o: A family of fully open unified multimodal models-architecture, training and dataset.arXiv preprint arXiv:2505.09568, 2025

work page Pith review arXiv 2025
[17]

Omni-video 2: Scaling mllm-conditioned diffusion for unified video generation and editing.arXiv preprint arXiv:2602.08820, 2026

Hao Yang, Zhiyu Tan, Jia Gong, Luozheng Qin, Hesen Chen, Xiaomeng Yang, Yuqing Sun, Yuetan Lin, Mengping Yang, and Hao Li. Omni-video 2: Scaling mllm-conditioned diffusion for unified video generation and editing.arXiv preprint arXiv:2602.08820, 2026

work page arXiv 2026
[18]

Transfer between Modalities with MetaQueries

Xichen Pan, Satya Narayan Shukla, Aashu Singh, Zhuokai Zhao, Shlok Kumar Mishra, Jialiang Wang, Zhiyang Xu, Jiuhai Chen, Kunpeng Li, Felix Juefei-Xu, et al. Transfer between modalities with metaqueries.arXiv preprint arXiv:2504.06256, 2025

work page internal anchor Pith review arXiv 2025
[19]

Binyuan Hui, Jian Yang, Zeyu Cui, Jiaxi Yang, Dayiheng Liu, Lei Zhang, Tianyu Liu, Jia- jun Zhang, Bowen Yu, Keming Lu, et al. Qwen2. 5-coder technical report.arXiv preprint arXiv:2409.12186, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[20]

Llama 3 model card

AI@Meta. Llama 3 model card. 2024

2024
[21]

Textbooks Are All You Need II: phi-1.5 technical report

Yuanzhi Li, Sébastien Bubeck, Ronen Eldan, Allie Del Giorno, Suriya Gunasekar, and Yin Tat Lee. Textbooks are all you need ii: phi-1.5 technical report.arXiv preprint arXiv:2309.05463, 2023

work page internal anchor Pith review arXiv 2023
[22]

Emerging Properties in Unified Multimodal Pretraining

Chaorui Deng, Deyao Zhu, Kunchang Li, Chenhui Gou, Feng Li, Zeyu Wang, Shu Zhong, Weihao Yu, Xiaonan Nie, Ziang Song, et al. Emerging properties in unified multimodal pretraining.arXiv preprint arXiv:2505.14683, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[23]

Mixture-of-transformers: A sparse and scalable architecture for multi-modal foundation models.Transactions on Machine Learning Research, 2025

Weixin Liang, LILI YU, Liang Luo, Srini Iyer, Ning Dong, Chunting Zhou, Gargi Ghosh, Mike Lewis, Wen tau Yih, Luke Zettlemoyer, and Xi Victoria Lin. Mixture-of-transformers: A sparse and scalable architecture for multi-modal foundation models.Transactions on Machine Learning Research, 2025

2025
[24]

Flux.1 [dev]

Black Forest Labs. Flux.1 [dev]. https://huggingface.co/black-forest-labs/FLUX. 1-dev, 2024

2024
[25]

Wan: Open and Advanced Large-Scale Video Generative Models

Team Wan, Ang Wang, Baole Ai, Bin Wen, Chaojie Mao, Chen-Wei Xie, Di Chen, Feiwu Yu, Haiming Zhao, Jianxiao Yang, et al. Wan: Open and advanced large-scale video generative models.arXiv preprint arXiv:2503.20314, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[26]

Infants have rich visual categories in ventrotemporal cortex at 2 months of age.Nature Neuroscience, pages 1–10, 2026

Cliona O’Doherty, Áine T Dineen, Anna Truzzi, Graham King, Lorijn Zaadnoordijk, Keelin Harrison, Enna-Louise D’Arcy, Jessica White, Chiara Caldinelli, Tamrin Holloway, et al. Infants have rich visual categories in ventrotemporal cortex at 2 months of age.Nature Neuroscience, pages 1–10, 2026

2026
[27]

Learning words’ sounds before learning how words sound: 9-month-olds use distinct objects as cues to categorize speech information.Cognition, 113(2):234–243, 2009

H Henny Yeung and Janet F Werker. Learning words’ sounds before learning how words sound: 9-month-olds use distinct objects as cues to categorize speech information.Cognition, 113(2):234–243, 2009

2009
[28]

Infant speech perception and cognitive skills as predictors of later vocabulary.Infant Behavior and Development, 62:101524, 2021

Yuanyuan Wang, Amanda Seidl, and Alejandrina Cristia. Infant speech perception and cognitive skills as predictors of later vocabulary.Infant Behavior and Development, 62:101524, 2021

2021
[29]

Open-Sora: Democratizing Efficient Video Production for All

Zangwei Zheng, Xiangyu Peng, Tianji Yang, Chenhui Shen, Shenggui Li, Hongxin Liu, Yukun Zhou, Tianyi Li, and Yang You. Open-sora: Democratizing efficient video production for all. arXiv preprint arXiv:2412.20404, 2024

work page internal anchor Pith review arXiv 2024
[30]

Video diffusion models.Advances in neural information processing systems, 35:8633–8646, 2022

Jonathan Ho, Tim Salimans, Alexey Gritsenko, William Chan, Mohammad Norouzi, and David J Fleet. Video diffusion models.Advances in neural information processing systems, 35:8633–8646, 2022

2022
[31]

Cogvideox: Text-to-video diffusion models with an expert transformer

Zhuoyi Yang, Jiayan Teng, Wendi Zheng, Ming Ding, Shiyu Huang, Jiazheng Xu, Yuanming Yang, Wenyi Hong, Xiaohan Zhang, Guanyu Feng, et al. Cogvideox: Text-to-video diffusion models with an expert transformer. InThe Thirteenth International Conference on Learning Representations. 11
[32]

LLaMA: Open and Efficient Foundation Language Models

Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timo- thée Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, et al. Llama: Open and efficient foundation language models.arXiv preprint arXiv:2302.13971, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[33]

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Peiyi Wang, Qihao Zhu, Runxin Xu, Ruoyu Zhang, Shirong Ma, Xiao Bi, et al. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning.arXiv preprint arXiv:2501.12948, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[34]

Scaling diffusion language models via adaptation from autoregressive models

Shansan Gong, Shivam Agarwal, Yizhe Zhang, Jiacheng Ye, Lin Zheng, Mukai Li, Chenxin An, Peilin Zhao, Wei Bi, Jiawei Han, et al. Scaling diffusion language models via adaptation from autoregressive models. InThe Thirteenth International Conference on Learning Representations
[35]

Diffuseq: Sequence to sequence text generation with diffusion models

Shansan Gong, Mukai Li, Jiangtao Feng, Zhiyong Wu, and Lingpeng Kong. Diffuseq: Sequence to sequence text generation with diffusion models. InThe Eleventh International Conference on Learning Representations
[36]

Yaron Lipman, Marton Havasi, Peter Holderrieth, Neta Shaul, Matt Le, Brian Karrer, Ricky T. Q. Chen, David Lopez-Paz, Heli Ben-Hamu, and Itai Gat. Flow matching guide and code, 2024

2024
[37]

Flow matching for generative modeling

Yaron Lipman, Ricky TQ Chen, Heli Ben-Hamu, Maximilian Nickel, and Matthew Le. Flow matching for generative modeling. InThe Eleventh International Conference on Learning Representations
[38]

Scaling rectified flow trans- formers for high-resolution image synthesis

Patrick Esser, Sumith Kulal, Andreas Blattmann, Rahim Entezari, Jonas Müller, Harry Saini, Yam Levi, Dominik Lorenz, Axel Sauer, Frederic Boesel, et al. Scaling rectified flow trans- formers for high-resolution image synthesis. InForty-first international conference on machine learning, 2024

2024
[39]

Unimax: Fairer and more effective language sampling for large-scale multilingual pretraining

Hyung Won Chung, Noah Constant, Xavier Garcia, Adam Roberts, Yi Tay, Sharan Narang, and Orhan Firat. Unimax: Fairer and more effective language sampling for large-scale multilingual pretraining.arXiv preprint arXiv:2304.09151, 2023

work page arXiv 2023
[40]

Structured denoising diffusion models in discrete state-spaces.Advances in neural information processing systems, 34:17981–17993, 2021

Jacob Austin, Daniel D Johnson, Jonathan Ho, Daniel Tarlow, and Rianne Van Den Berg. Structured denoising diffusion models in discrete state-spaces.Advances in neural information processing systems, 34:17981–17993, 2021

2021
[41]

Diffusion-lm improves controllable text generation.Advances in neural information processing systems, 35:4328–4343, 2022

Xiang Li, John Thickstun, Ishaan Gulrajani, Percy S Liang, and Tatsunori B Hashimoto. Diffusion-lm improves controllable text generation.Advances in neural information processing systems, 35:4328–4343, 2022

2022
[42]

Adaptive mixtures of local experts.Neural computation, 3(1):79–87, 1991

Robert A Jacobs, Michael I Jordan, Steven J Nowlan, and Geoffrey E Hinton. Adaptive mixtures of local experts.Neural computation, 3(1):79–87, 1991

1991
[43]

Deepseekmoe: Towards ultimate expert specializa- tion in mixture-of-experts language models

Damai Dai, Chengqi Deng, Chenggang Zhao, RX Xu, Huazuo Gao, Deli Chen, Jiashi Li, Wangding Zeng, Xingkai Yu, Yu Wu, et al. Deepseekmoe: Towards ultimate expert specializa- tion in mixture-of-experts language models. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1280–1297, 2024

2024
[44]

Unitoken: Harmonizing multimodal understanding and generation through unified visual encoding

Yang Jiao, Haibo Qiu, Zequn Jie, Shaoxiang Chen, Jingjing Chen, Lin Ma, and Yu-Gang Jiang. Unitoken: Harmonizing multimodal understanding and generation through unified visual encoding. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 3600–3610, 2025

2025
[45]

Omni-diffusion: Unified multimodal understanding and generation with masked discrete diffusion.arXiv preprint arXiv:2603.06577, 2026

Lijiang Li, Zuwei Long, Yunhang Shen, Heting Gao, Haoyu Cao, Xing Sun, Caifeng Shan, Ran He, and Chaoyou Fu. Omni-diffusion: Unified multimodal understanding and generation with masked discrete diffusion.arXiv preprint arXiv:2603.06577, 2026

work page arXiv 2026
[46]

Align your latents: High-resolution video synthesis with latent diffusion models

Andreas Blattmann, Robin Rombach, Huan Ling, Tim Dockhorn, Seung Wook Kim, Sanja Fidler, and Karsten Kreis. Align your latents: High-resolution video synthesis with latent diffusion models. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 22563–22575, 2023. 12

2023
[47]

Scalable diffusion models with transformers

William Peebles and Saining Xie. Scalable diffusion models with transformers. InProceedings of the IEEE/CVF international conference on computer vision, pages 4195–4205, 2023

2023
[48]

Video generation models as world simulators

Tim Brooks, Bill Peebles, Connor Holmes, Will DePue, Yufei Guo, Leo Jing, David Schnurr, Joe Taylor, Troy Luhman, Eric Luhman, et al. Video generation models as world simulators. OpenAI Blog, 1(8):1, 2024

2024
[49]

Flow Matching for Generative Modeling

Yaron Lipman, Ricky TQ Chen, Heli Ben-Hamu, Maximilian Nickel, and Matt Le. Flow matching for generative modeling.arXiv preprint arXiv:2210.02747, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[50]

Flow Straight and Fast: Learning to Generate and Transfer Data with Rectified Flow

Xingchao Liu, Chengyue Gong, and Qiang Liu. Flow straight and fast: Learning to generate and transfer data with rectified flow.arXiv preprint arXiv:2209.03003, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[51]

Discrete Diffusion Modeling by Estimating the Ratios of the Data Distribution

Aaron Lou, Chenlin Meng, and Stefano Ermon. Discrete diffusion modeling by estimating the ratios of the data distribution.arXiv preprint arXiv:2310.16834, 2023

work page internal anchor Pith review arXiv 2023
[52]

Simple and effective masked diffusion language models.Advances in Neural Information Processing Systems, 37:130136–130184, 2024

Subham S Sahoo, Marianne Arriola, Yair Schiff, Aaron Gokaslan, Edgar Marroquin, Justin T Chiu, Alexander Rush, and V olodymyr Kuleshov. Simple and effective masked diffusion language models.Advances in Neural Information Processing Systems, 37:130136–130184, 2024

2024
[53]

Large Language Diffusion Models

Shen Nie, Fengqi Zhu, Zebin You, Xiaolu Zhang, Jingyang Ou, Jun Hu, Jun Zhou, Yankai Lin, Ji-Rong Wen, and Chongxuan Li. Large language diffusion models.arXiv preprint arXiv:2502.09992, 2025. 13

work page internal anchor Pith review arXiv 2025