Recognition: 2 theorem links
· Lean TheoremUni-ViGU: Towards Unified Video Generation and Understanding via A Diffusion-Based Video Generator
Pith reviewed 2026-05-10 18:29 UTC · model grok-4.3
The pith
A diffusion video generator can be extended into a unified model for both creating videos and understanding them through flow matching and staged training.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Uni-ViGU unifies video generation and understanding by taking a diffusion-based video generator as the foundation, applying a unified flow method that performs continuous flow matching for video and discrete flow matching for text in one process, augmenting Transformer blocks with a modality-driven MoE framework that preserves generative priors, and using bidirectional training consisting of Knowledge Recall followed by Capability Refinement to repurpose generation knowledge into shared representations, thereby achieving competitive results on both tasks and validating generation-centric architectures as a scalable path toward unified multimodal intelligence.
What carries the argument
The bidirectional training mechanism of Knowledge Recall (reconstructing input prompts to leverage learned correspondences) followed by Capability Refinement (fine-tuning on detailed captions), which converts generation priors into discriminative shared representations while keeping a unified flow method and MoE structure to support both modalities.
If this is right
- Video generation and understanding can share one set of model weights and training compute rather than requiring separate large systems.
- The higher computational cost of video generation can be addressed at the foundation rather than added later.
- Generation-first designs become viable foundations for broader multimodal systems that handle both creation and comprehension.
- The same flow-matching and MoE additions could support text-to-video and video-to-text within a single forward pass.
Where Pith is reading between the lines
- The same pattern of starting from a generator and adding recall-plus-refinement stages might transfer to unifying image generation with image understanding.
- Training efficiency could improve if pre-trained video generators are reused as bases instead of training understanding models from scratch.
- Limits may appear when scaling to much longer videos or more complex multi-step reasoning tasks that go beyond the current caption-based refinement.
Load-bearing premise
The two-stage bidirectional training can add strong understanding ability without causing substantial loss in the model's original video generation quality.
What would settle it
A clear drop in standard video generation metrics such as FVD or FID after the full bidirectional training, or understanding performance that remains well below specialized models on video QA and captioning benchmarks.
Figures
read the original abstract
Unified multimodal models integrating visual understanding and generation face a fundamental challenge: visual generation incurs substantially higher computational costs than understanding, particularly for video. This imbalance motivates us to invert the conventional paradigm: rather than extending understanding-centric MLLMs to support generation, we propose Uni-ViGU, a framework that unifies video generation and understanding by extending a video generator as the foundation. We introduce a unified flow method that performs continuous flow matching for video and discrete flow matching for text within a single process, enabling coherent multimodal generation. We further propose a modality-driven MoE-based framework that augments Transformer blocks with lightweight layers for text generation while preserving generative priors. To repurpose generation knowledge for understanding, we design a bidirectional training mechanism with two stages: Knowledge Recall reconstructs input prompts to leverage learned text-video correspondences, while Capability Refinement fine-tunes on detailed captions to establish discriminative shared representations. Experiments demonstrate that Uni-ViGU achieves competitive performance on both video generation and understanding, validating generation-centric architectures as a scalable path toward unified multimodal intelligence. Project Page and Code: https://fr0zencrane.github.io/uni-vigu-page/.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript proposes Uni-ViGU, a framework that unifies video generation and understanding by extending a diffusion-based video generator rather than an understanding-centric MLLM. It introduces a unified flow method performing continuous flow matching on video and discrete flow matching on text in a single process, a modality-driven MoE-based architecture that augments Transformer blocks with lightweight text-generation layers while preserving generative priors, and a bidirectional training procedure consisting of a Knowledge Recall stage (prompt reconstruction to leverage text-video correspondences) followed by a Capability Refinement stage (fine-tuning on detailed captions to obtain discriminative shared representations). The central claim is that this generation-centric approach achieves competitive performance on both video generation and understanding tasks without substantial degradation of generation quality, thereby validating generation-first architectures as a scalable route to unified multimodal video intelligence.
Significance. If the quantitative results hold, the work would be significant because it inverts the dominant paradigm of extending understanding models to generation and instead starts from the harder generation task, which incurs higher compute. The combination of unified flow matching and modality-driven MoE is a coherent engineering contribution, and the bidirectional training mechanism offers a concrete recipe for repurposing generation priors. Explicit credit is due for releasing code and a project page, which supports reproducibility. The result, if substantiated, would strengthen the case that generation-centric backbones can serve as foundations for unified video models.
major comments (2)
- [§3.3] §3.3 (Bidirectional Training Mechanism): The claim that Capability Refinement establishes discriminative shared representations without degrading generation quality is load-bearing for the central thesis, yet the manuscript provides no ablation measuring generation metrics (e.g., FVD or FID) immediately before versus after the refinement stage on the same model checkpoint. This omission prevents verification that the bidirectional procedure satisfies the “without substantial degradation” assumption stated in the abstract.
- [§4.1–4.2] §4.1–4.2 (Experimental Results): The abstract asserts “competitive performance,” but the reported tables lack direct head-to-head comparisons against recent unified or generation-first baselines (e.g., the latest video diffusion models and multimodal LLMs) on the same video-generation and video-understanding benchmarks with identical evaluation protocols. Without these numbers and statistical significance tests, the empirical support for the generation-centric unification claim remains incomplete.
minor comments (3)
- [Eq. (5)] The notation in the unified flow-matching objective (Eq. 5) re-uses the symbol t for both continuous video time and discrete text step; a clarifying sentence or subscript would remove ambiguity.
- [Figure 3] Figure 3 (MoE architecture diagram) does not label the routing weights or the expert activation pattern; adding these annotations would improve readability.
- [§2] The related-work section omits recent flow-matching video papers published after 2023; a brief citation update would strengthen context.
Simulated Author's Rebuttal
We thank the referee for the constructive and detailed feedback, which helps us improve the clarity and rigor of our claims. We address each major comment below and commit to revisions that directly respond to the concerns raised.
read point-by-point responses
-
Referee: [§3.3] §3.3 (Bidirectional Training Mechanism): The claim that Capability Refinement establishes discriminative shared representations without degrading generation quality is load-bearing for the central thesis, yet the manuscript provides no ablation measuring generation metrics (e.g., FVD or FID) immediately before versus after the refinement stage on the same model checkpoint. This omission prevents verification that the bidirectional procedure satisfies the “without substantial degradation” assumption stated in the abstract.
Authors: We agree that an explicit before/after ablation on generation metrics is necessary to substantiate the central claim. In the revised manuscript we will add this ablation, evaluating FVD, FID, and related metrics on the identical checkpoint immediately prior to and following the Capability Refinement stage. This will directly verify that generation quality is preserved while discriminative capabilities are acquired. revision: yes
-
Referee: [§4.1–4.2] §4.1–4.2 (Experimental Results): The abstract asserts “competitive performance,” but the reported tables lack direct head-to-head comparisons against recent unified or generation-first baselines (e.g., the latest video diffusion models and multimodal LLMs) on the same video-generation and video-understanding benchmarks with identical evaluation protocols. Without these numbers and statistical significance tests, the empirical support for the generation-centric unification claim remains incomplete.
Authors: We acknowledge the value of more comprehensive head-to-head comparisons and statistical testing. In the revision we will expand Tables 1–4 (and associated text) to include additional recent unified and generation-first baselines, ensuring identical evaluation protocols are followed wherever the original papers report compatible numbers. We will also report standard deviations across multiple runs and conduct basic significance testing to strengthen the empirical support for competitive performance. revision: yes
Circularity Check
No significant circularity detected in derivation chain
full rationale
The paper's core contributions consist of a proposed unified flow-matching method (continuous for video, discrete for text), a modality-driven MoE augmentation of Transformer blocks, and a two-stage bidirectional training procedure (Knowledge Recall via prompt reconstruction followed by Capability Refinement on captions). These are presented as engineering extensions of existing flow-matching and MoE techniques rather than derived results. No equations reduce a claimed prediction to a fitted parameter by construction, no load-bearing uniqueness theorems are imported via self-citation, and no ansatz is smuggled through prior work. The experimental claims of competitive performance on generation and understanding tasks rest on independent benchmarks and do not collapse into the architectural definitions themselves. The derivation chain therefore contains independent content and is self-contained against external validation.
Axiom & Free-Parameter Ledger
invented entities (2)
-
unified flow method
no independent evidence
-
modality-driven MoE-based framework
no independent evidence
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclearunified flow method that performs continuous flow matching for video and discrete flow matching for text within a single process
-
IndisputableMonolith/Foundation/DimensionForcing.leanalexander_duality_circle_linking unclearmodality-driven MoE-based framework that augments Transformer blocks with lightweight layers
Reference graph
Works this paper leans on
-
[1]
Janus: Decoupling visual encoding for unified multimodal understanding and generation
Chengyue Wu, Xiaokang Chen, Zhiyu Wu, Yiyang Ma, Xingchao Liu, Zizheng Pan, Wen Liu, Zhenda Xie, Xingkai Yu, Chong Ruan, et al. Janus: Decoupling visual encoding for unified multimodal understanding and generation. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 12966–12977, 2025
2025
-
[2]
Show-o: One Single Transformer to Unify Multimodal Understanding and Generation
Jinheng Xie, Weijia Mao, Zechen Bai, David Junhao Zhang, Weihao Wang, Kevin Qinghong Lin, Yuchao Gu, Zhijie Chen, Zhenheng Yang, and Mike Zheng Shou. Show-o: One single trans- former to unify multimodal understanding and generation.arXiv preprint arXiv:2408.12528, 2024
work page internal anchor Pith review arXiv 2024
-
[4]
Luozheng Qin, Jia Gong, Yuqing Sun, Tianjiao Li, Mengping Yang, Xiaomeng Yang, Chao Qu, Zhiyu Tan, and Hao Li. Uni-cot: Towards unified chain-of-thought reasoning across text and vision.arXiv preprint arXiv:2508.05606, 2025
-
[5]
Ofa: Unifying architectures, tasks, and modalities through a simple sequence-to-sequence learning framework
Peng Wang, An Yang, Rui Men, Junyang Lin, Shuai Bai, Zhikang Li, Jianxin Ma, Chang Zhou, Jingren Zhou, and Hongxia Yang. Ofa: Unifying architectures, tasks, and modalities through a simple sequence-to-sequence learning framework. InInternational conference on machine learning, pages 23318–23340. PMLR, 2022
2022
-
[6]
One transformer fits all distributions in multi-modal diffusion at scale
Fan Bao, Shen Nie, Kaiwen Xue, Chongxuan Li, Shi Pu, Yaole Wang, Gang Yue, Yue Cao, Hang Su, and Jun Zhu. One transformer fits all distributions in multi-modal diffusion at scale. InInternational Conference on Machine Learning, pages 1692–1717. PMLR, 2023
2023
-
[7]
Unified-io: A unified model for vision, language, and multi-modal tasks
Jiasen Lu, Christopher Clark, Rowan Zellers, Roozbeh Mottaghi, and Aniruddha Kembhavi. Unified-io: A unified model for vision, language, and multi-modal tasks. InThe Eleventh International Conference on Learning Representations
-
[8]
Gemini: A Family of Highly Capable Multimodal Models
Gemini Team, Rohan Anil, Sebastian Borgeaud, Jean-Baptiste Alayrac, Jiahui Yu, Radu Soricut, Johan Schalkwyk, Andrew M Dai, Anja Hauth, Katie Millican, et al. Gemini: a family of highly capable multimodal models.arXiv preprint arXiv:2312.11805, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[9]
Image as a foreign lan- guage: Beit pretraining for vision and vision-language tasks
Wenhui Wang, Hangbo Bao, Li Dong, Johan Bjorck, Zhiliang Peng, Qiang Liu, Kriti Aggarwal, Owais Khan Mohammed, Saksham Singhal, Subhojit Som, et al. Image as a foreign lan- guage: Beit pretraining for vision and vision-language tasks. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 19175–19186, 2023
2023
-
[10]
Chameleon: Mixed-Modal Early-Fusion Foundation Models
Chameleon Team. Chameleon: Mixed-modal early-fusion foundation models.arXiv preprint arXiv:2405.09818, 2024
work page internal anchor Pith review arXiv 2024
-
[11]
Janus-Pro: Unified Multimodal Understanding and Generation with Data and Model Scaling
Xiaokang Chen, Zhiyu Wu, Xingchao Liu, Zizheng Pan, Wen Liu, Zhenda Xie, Xingkai Yu, and Chong Ruan. Janus-pro: Unified multimodal understanding and generation with data and model scaling.arXiv preprint arXiv:2501.17811, 2025
work page internal anchor Pith review arXiv 2025
-
[12]
arXiv preprint arXiv:2508.10711 (2025) 2, 4, 10, 12, 13
NextStep Team, Chunrui Han, Guopeng Li, Jingwei Wu, Quan Sun, Yan Cai, Yuang Peng, Zheng Ge, Deyu Zhou, Haomiao Tang, et al. Nextstep-1: Toward autoregressive image generation with continuous tokens at scale.arXiv preprint arXiv:2508.10711, 2025
-
[13]
Controlar: Controllable image generation with autoregressive models
Zongming Li, Tianheng Cheng, Shoufa Chen, Peize Sun, Haocheng Shen, Longjin Ran, Xiaoxin Chen, Wenyu Liu, and Xinggang Wang. Controlar: Controllable image generation with autoregressive models. InInternational Conference on Learning Representations, 2025
2025
-
[14]
Incorporating reinforced adversarial learning in autoregressive image generation
Kenan E Ak, Ning Xu, Zhe Lin, and Yilin Wang. Incorporating reinforced adversarial learning in autoregressive image generation. InEuropean conference on computer vision, pages 18–34. Springer, 2020
2020
-
[15]
Metamorph: Multimodal understanding and generation via instruction tuning
Shengbang Tong, David Fan, Jiachen Li, Yunyang Xiong, Xinlei Chen, Koustuv Sinha, Michael Rabbat, Yann LeCun, Saining Xie, and Zhuang Liu. Metamorph: Multimodal understanding and generation via instruction tuning. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 17001–17012, 2025. 10
2025
-
[16]
BLIP3-o: A Family of Fully Open Unified Multimodal Models-Architecture, Training and Dataset
Jiuhai Chen, Zhiyang Xu, Xichen Pan, Yushi Hu, Can Qin, Tom Goldstein, Lifu Huang, Tianyi Zhou, Saining Xie, Silvio Savarese, et al. Blip3-o: A family of fully open unified multimodal models-architecture, training and dataset.arXiv preprint arXiv:2505.09568, 2025
work page Pith review arXiv 2025
-
[17]
Hao Yang, Zhiyu Tan, Jia Gong, Luozheng Qin, Hesen Chen, Xiaomeng Yang, Yuqing Sun, Yuetan Lin, Mengping Yang, and Hao Li. Omni-video 2: Scaling mllm-conditioned diffusion for unified video generation and editing.arXiv preprint arXiv:2602.08820, 2026
-
[18]
Transfer between Modalities with MetaQueries
Xichen Pan, Satya Narayan Shukla, Aashu Singh, Zhuokai Zhao, Shlok Kumar Mishra, Jialiang Wang, Zhiyang Xu, Jiuhai Chen, Kunpeng Li, Felix Juefei-Xu, et al. Transfer between modalities with metaqueries.arXiv preprint arXiv:2504.06256, 2025
work page internal anchor Pith review arXiv 2025
-
[19]
Binyuan Hui, Jian Yang, Zeyu Cui, Jiaxi Yang, Dayiheng Liu, Lei Zhang, Tianyu Liu, Jia- jun Zhang, Bowen Yu, Keming Lu, et al. Qwen2. 5-coder technical report.arXiv preprint arXiv:2409.12186, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[20]
Llama 3 model card
AI@Meta. Llama 3 model card. 2024
2024
-
[21]
Textbooks Are All You Need II: phi-1.5 technical report
Yuanzhi Li, Sébastien Bubeck, Ronen Eldan, Allie Del Giorno, Suriya Gunasekar, and Yin Tat Lee. Textbooks are all you need ii: phi-1.5 technical report.arXiv preprint arXiv:2309.05463, 2023
work page internal anchor Pith review arXiv 2023
-
[22]
Emerging Properties in Unified Multimodal Pretraining
Chaorui Deng, Deyao Zhu, Kunchang Li, Chenhui Gou, Feng Li, Zeyu Wang, Shu Zhong, Weihao Yu, Xiaonan Nie, Ziang Song, et al. Emerging properties in unified multimodal pretraining.arXiv preprint arXiv:2505.14683, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[23]
Mixture-of-transformers: A sparse and scalable architecture for multi-modal foundation models.Transactions on Machine Learning Research, 2025
Weixin Liang, LILI YU, Liang Luo, Srini Iyer, Ning Dong, Chunting Zhou, Gargi Ghosh, Mike Lewis, Wen tau Yih, Luke Zettlemoyer, and Xi Victoria Lin. Mixture-of-transformers: A sparse and scalable architecture for multi-modal foundation models.Transactions on Machine Learning Research, 2025
2025
-
[24]
Flux.1 [dev]
Black Forest Labs. Flux.1 [dev]. https://huggingface.co/black-forest-labs/FLUX. 1-dev, 2024
2024
-
[25]
Wan: Open and Advanced Large-Scale Video Generative Models
Team Wan, Ang Wang, Baole Ai, Bin Wen, Chaojie Mao, Chen-Wei Xie, Di Chen, Feiwu Yu, Haiming Zhao, Jianxiao Yang, et al. Wan: Open and advanced large-scale video generative models.arXiv preprint arXiv:2503.20314, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[26]
Infants have rich visual categories in ventrotemporal cortex at 2 months of age.Nature Neuroscience, pages 1–10, 2026
Cliona O’Doherty, Áine T Dineen, Anna Truzzi, Graham King, Lorijn Zaadnoordijk, Keelin Harrison, Enna-Louise D’Arcy, Jessica White, Chiara Caldinelli, Tamrin Holloway, et al. Infants have rich visual categories in ventrotemporal cortex at 2 months of age.Nature Neuroscience, pages 1–10, 2026
2026
-
[27]
Learning words’ sounds before learning how words sound: 9-month-olds use distinct objects as cues to categorize speech information.Cognition, 113(2):234–243, 2009
H Henny Yeung and Janet F Werker. Learning words’ sounds before learning how words sound: 9-month-olds use distinct objects as cues to categorize speech information.Cognition, 113(2):234–243, 2009
2009
-
[28]
Infant speech perception and cognitive skills as predictors of later vocabulary.Infant Behavior and Development, 62:101524, 2021
Yuanyuan Wang, Amanda Seidl, and Alejandrina Cristia. Infant speech perception and cognitive skills as predictors of later vocabulary.Infant Behavior and Development, 62:101524, 2021
2021
-
[29]
Open-Sora: Democratizing Efficient Video Production for All
Zangwei Zheng, Xiangyu Peng, Tianji Yang, Chenhui Shen, Shenggui Li, Hongxin Liu, Yukun Zhou, Tianyi Li, and Yang You. Open-sora: Democratizing efficient video production for all. arXiv preprint arXiv:2412.20404, 2024
work page internal anchor Pith review arXiv 2024
-
[30]
Video diffusion models.Advances in neural information processing systems, 35:8633–8646, 2022
Jonathan Ho, Tim Salimans, Alexey Gritsenko, William Chan, Mohammad Norouzi, and David J Fleet. Video diffusion models.Advances in neural information processing systems, 35:8633–8646, 2022
2022
-
[31]
Cogvideox: Text-to-video diffusion models with an expert transformer
Zhuoyi Yang, Jiayan Teng, Wendi Zheng, Ming Ding, Shiyu Huang, Jiazheng Xu, Yuanming Yang, Wenyi Hong, Xiaohan Zhang, Guanyu Feng, et al. Cogvideox: Text-to-video diffusion models with an expert transformer. InThe Thirteenth International Conference on Learning Representations. 11
-
[32]
LLaMA: Open and Efficient Foundation Language Models
Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timo- thée Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, et al. Llama: Open and efficient foundation language models.arXiv preprint arXiv:2302.13971, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[33]
DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning
Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Peiyi Wang, Qihao Zhu, Runxin Xu, Ruoyu Zhang, Shirong Ma, Xiao Bi, et al. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning.arXiv preprint arXiv:2501.12948, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[34]
Scaling diffusion language models via adaptation from autoregressive models
Shansan Gong, Shivam Agarwal, Yizhe Zhang, Jiacheng Ye, Lin Zheng, Mukai Li, Chenxin An, Peilin Zhao, Wei Bi, Jiawei Han, et al. Scaling diffusion language models via adaptation from autoregressive models. InThe Thirteenth International Conference on Learning Representations
-
[35]
Diffuseq: Sequence to sequence text generation with diffusion models
Shansan Gong, Mukai Li, Jiangtao Feng, Zhiyong Wu, and Lingpeng Kong. Diffuseq: Sequence to sequence text generation with diffusion models. InThe Eleventh International Conference on Learning Representations
-
[36]
Yaron Lipman, Marton Havasi, Peter Holderrieth, Neta Shaul, Matt Le, Brian Karrer, Ricky T. Q. Chen, David Lopez-Paz, Heli Ben-Hamu, and Itai Gat. Flow matching guide and code, 2024
2024
-
[37]
Flow matching for generative modeling
Yaron Lipman, Ricky TQ Chen, Heli Ben-Hamu, Maximilian Nickel, and Matthew Le. Flow matching for generative modeling. InThe Eleventh International Conference on Learning Representations
-
[38]
Scaling rectified flow trans- formers for high-resolution image synthesis
Patrick Esser, Sumith Kulal, Andreas Blattmann, Rahim Entezari, Jonas Müller, Harry Saini, Yam Levi, Dominik Lorenz, Axel Sauer, Frederic Boesel, et al. Scaling rectified flow trans- formers for high-resolution image synthesis. InForty-first international conference on machine learning, 2024
2024
-
[39]
Unimax: Fairer and more effective language sampling for large-scale multilingual pretraining
Hyung Won Chung, Noah Constant, Xavier Garcia, Adam Roberts, Yi Tay, Sharan Narang, and Orhan Firat. Unimax: Fairer and more effective language sampling for large-scale multilingual pretraining.arXiv preprint arXiv:2304.09151, 2023
-
[40]
Structured denoising diffusion models in discrete state-spaces.Advances in neural information processing systems, 34:17981–17993, 2021
Jacob Austin, Daniel D Johnson, Jonathan Ho, Daniel Tarlow, and Rianne Van Den Berg. Structured denoising diffusion models in discrete state-spaces.Advances in neural information processing systems, 34:17981–17993, 2021
2021
-
[41]
Diffusion-lm improves controllable text generation.Advances in neural information processing systems, 35:4328–4343, 2022
Xiang Li, John Thickstun, Ishaan Gulrajani, Percy S Liang, and Tatsunori B Hashimoto. Diffusion-lm improves controllable text generation.Advances in neural information processing systems, 35:4328–4343, 2022
2022
-
[42]
Adaptive mixtures of local experts.Neural computation, 3(1):79–87, 1991
Robert A Jacobs, Michael I Jordan, Steven J Nowlan, and Geoffrey E Hinton. Adaptive mixtures of local experts.Neural computation, 3(1):79–87, 1991
1991
-
[43]
Deepseekmoe: Towards ultimate expert specializa- tion in mixture-of-experts language models
Damai Dai, Chengqi Deng, Chenggang Zhao, RX Xu, Huazuo Gao, Deli Chen, Jiashi Li, Wangding Zeng, Xingkai Yu, Yu Wu, et al. Deepseekmoe: Towards ultimate expert specializa- tion in mixture-of-experts language models. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1280–1297, 2024
2024
-
[44]
Unitoken: Harmonizing multimodal understanding and generation through unified visual encoding
Yang Jiao, Haibo Qiu, Zequn Jie, Shaoxiang Chen, Jingjing Chen, Lin Ma, and Yu-Gang Jiang. Unitoken: Harmonizing multimodal understanding and generation through unified visual encoding. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 3600–3610, 2025
2025
-
[45]
Lijiang Li, Zuwei Long, Yunhang Shen, Heting Gao, Haoyu Cao, Xing Sun, Caifeng Shan, Ran He, and Chaoyou Fu. Omni-diffusion: Unified multimodal understanding and generation with masked discrete diffusion.arXiv preprint arXiv:2603.06577, 2026
-
[46]
Align your latents: High-resolution video synthesis with latent diffusion models
Andreas Blattmann, Robin Rombach, Huan Ling, Tim Dockhorn, Seung Wook Kim, Sanja Fidler, and Karsten Kreis. Align your latents: High-resolution video synthesis with latent diffusion models. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 22563–22575, 2023. 12
2023
-
[47]
Scalable diffusion models with transformers
William Peebles and Saining Xie. Scalable diffusion models with transformers. InProceedings of the IEEE/CVF international conference on computer vision, pages 4195–4205, 2023
2023
-
[48]
Video generation models as world simulators
Tim Brooks, Bill Peebles, Connor Holmes, Will DePue, Yufei Guo, Leo Jing, David Schnurr, Joe Taylor, Troy Luhman, Eric Luhman, et al. Video generation models as world simulators. OpenAI Blog, 1(8):1, 2024
2024
-
[49]
Flow Matching for Generative Modeling
Yaron Lipman, Ricky TQ Chen, Heli Ben-Hamu, Maximilian Nickel, and Matt Le. Flow matching for generative modeling.arXiv preprint arXiv:2210.02747, 2022
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[50]
Flow Straight and Fast: Learning to Generate and Transfer Data with Rectified Flow
Xingchao Liu, Chengyue Gong, and Qiang Liu. Flow straight and fast: Learning to generate and transfer data with rectified flow.arXiv preprint arXiv:2209.03003, 2022
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[51]
Discrete Diffusion Modeling by Estimating the Ratios of the Data Distribution
Aaron Lou, Chenlin Meng, and Stefano Ermon. Discrete diffusion modeling by estimating the ratios of the data distribution.arXiv preprint arXiv:2310.16834, 2023
work page internal anchor Pith review arXiv 2023
-
[52]
Simple and effective masked diffusion language models.Advances in Neural Information Processing Systems, 37:130136–130184, 2024
Subham S Sahoo, Marianne Arriola, Yair Schiff, Aaron Gokaslan, Edgar Marroquin, Justin T Chiu, Alexander Rush, and V olodymyr Kuleshov. Simple and effective masked diffusion language models.Advances in Neural Information Processing Systems, 37:130136–130184, 2024
2024
-
[53]
Large Language Diffusion Models
Shen Nie, Fengqi Zhu, Zebin You, Xiaolu Zhang, Jingyang Ou, Jun Hu, Jun Zhou, Yankai Lin, Ji-Rong Wen, and Chongxuan Li. Large language diffusion models.arXiv preprint arXiv:2502.09992, 2025. 13
work page internal anchor Pith review arXiv 2025
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.