Lumos-Nexus: Efficient Frequency Bridging with Homogeneous Latent Space for Video Unified Models

Fan Wang; Fei Du; Hai Ci; Hangjie Yuan; Jiasheng Tang; Jiazheng Xing; Lingling Cai; Tao Feng; Weihua Chen; Xinyu Liu

arxiv: 2605.31603 · v1 · pith:SJLAKAN5new · submitted 2026-05-29 · 💻 cs.CV · cs.AI

Lumos-Nexus: Efficient Frequency Bridging with Homogeneous Latent Space for Video Unified Models

Jiazheng Xing , Hangjie Yuan , Lingling Cai , Xinyu Liu , Yujie Wei , Fei Du , Hai Ci , Tao Feng

show 4 more authors

Jiasheng Tang Weihua Chen Fan Wang Yong Liu

This is my paper

Pith reviewed 2026-06-28 22:54 UTC · model grok-4.3

classification 💻 cs.CV cs.AI

keywords video generationunified modelslatent space bridgingfrequency handoffreasoning-driven synthesisVBenchVR-Bench

0 comments

The pith

A two-stage framework trains only a lightweight generator then hands off to a pretrained high-capacity one via progressive frequency bridging in shared latent space.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Connector-based video unified models cannot afford to train large high-fidelity generators inside the unified loop, so visual quality stays limited. Lumos-Nexus trains only a lightweight generator aligned to the understanding block so it learns reasoning-driven semantic control, then at inference applies Unified Progressive Frequency Bridging to hand generation over to a frozen high-capacity pretrained generator inside the same homogeneous latent space. The handoff produces coarse-to-fine refinement that raises visual realism and temporal coherence on VBench while preserving the reasoning performance measured on the new VR-Bench. The design therefore separates the cost of learning semantic control from the cost of rendering high-fidelity output.

Core claim

Lumos-Nexus is a training-efficient unified video generation framework that uses a two-stage design. During training only a lightweight generator is aligned with the understanding block to acquire reasoning-driven semantic control. At inference Unified Progressive Frequency Bridging progressively hands generation to a high-capacity pretrained generator inside the shared homogeneous latent space, enabling coarse-to-fine refinement that yields high-fidelity videos without loss of reasoning quality. The paper also introduces VR-Bench to measure a model’s ability to translate inferred intent into coherent, semantically aligned video.

What carries the argument

Unified Progressive Frequency Bridging (UPFB): a progressive handoff mechanism that transfers generation from a lightweight trained generator to a high-capacity pretrained generator inside a shared homogeneous latent space.

If this is right

High-fidelity video output becomes reachable without including the large generator in the full training loop.
Reasoning capabilities acquired on the lightweight model survive the frequency-bridging stage.
VBench realism and temporal coherence improve while VR-Bench scores remain competitive.
New benchmarks like VR-Bench become usable to test intent-to-video translation separately from pixel quality.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same lightweight-training-plus-handoff pattern could be tried on image or audio unified models where full joint training is also prohibitive.
If the latent-space homogeneity holds across different generator families, the method might reduce the need for ever-larger unified training runs.
The separation of semantic-control training from fidelity rendering suggests a general route to scaling reasoning-driven generation without proportional compute growth.

Load-bearing premise

The homogeneous latent space shared by the lightweight and high-capacity generators allows UPFB to transfer control without losing the reasoning-driven semantic signals learned during lightweight training.

What would settle it

Run the same prompts on VR-Bench with and without the UPFB handoff; if semantic alignment or intent translation scores drop after the handoff, the claim that reasoning quality is preserved fails.

read the original abstract

Connector-based video unified models have demonstrated strong capability in instruction-grounded video synthesis, but integrating a large high-fidelity generator into the unified training loop is computationally prohibitive, limiting achievable visual quality. We therefore propose Lumos-Nexus, a training-efficient unified video generation framework that facilitates the development of strong reasoning-driven generation capabilities while significantly enhancing visual fidelity. Lumos-Nexus adopts a two-stage design: 1) During training, only a lightweight generator is aligned with the understanding block to learn to take in reasoning-driven semantic control. 2) During inference, we introduce Unified Progressive Frequency Bridging (UPFB) to progressively hand off generation to a high-capacity pretrained generator in the shared latent space, enabling coarse-to-fine refinement and producing high-fidelity videos without compromising reasoning quality. To fill the gap in reasoning-driven video generation benchmarks, we introduce VR-Bench, which assesses a model's capability to translate inferred intent into coherent and semantically aligned video content. Extensive experiments demonstrate that Lumos-Nexus achieves substantial gains in visual realism and temporal coherence on VBench, while exhibiting strong reasoning-based generative performance on VR-Bench. Code and models are available at https://jiazheng-xing.github.io/nexus-lumos-home/.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Lumos-Nexus tries to cut training cost in unified video models by training a small generator then bridging to a big one at inference via UPFB in shared latent space, plus a new reasoning benchmark, but the abstract supplies no ablations or checks on whether the handoff preserves control.

read the letter

The main thing here is a two-stage setup: train only the lightweight generator against the understanding block, then at inference use Unified Progressive Frequency Bridging to hand off generation to a high-capacity pretrained model inside the same latent space. They also introduce VR-Bench to measure how well models turn inferred intent into coherent video.

What is actually new is the specific UPFB handoff mechanism and the VR-Bench benchmark itself. The two-stage split is a direct response to the compute problem of folding large generators into unified training, and the shared-latent-space idea is a clean way to aim for both reasoning fidelity and visual quality.

The paper does a reasonable job naming the practical bottleneck and sketching a workflow that keeps training cheap while promising high-fidelity output. The intent to evaluate both visual metrics on VBench and reasoning on the new benchmark is the right framing.

The soft spots are clear and central. The whole claim rests on the latent space being homogeneous enough for the progressive handoff to work without losing semantic control from the reasoning block. The abstract states the design but shows no ablations that isolate UPFB, no distribution-matching numbers, and no reasoning scores before versus after the bridge. Without those, it is impossible to know whether the reported gains on VBench come at the expense of the VR-Bench performance or whether the homogeneity assumption holds. The stress-test concern about possible degradation in reasoning control therefore lands.

This is for people working on scalable video generation and multimodal unified models who care about training efficiency. A reader who wants to see whether the frequency-bridging trick actually delivers would get value from the full experiments.

It deserves peer review because the problem is real and the proposed split is novel enough to test, even though the current evidence is too thin to judge the central mechanism.

Referee Report

2 major / 2 minor

Summary. The paper proposes Lumos-Nexus, a two-stage framework for efficient video unified models. Stage 1 aligns a lightweight generator with an understanding block during training to acquire reasoning-driven semantic control. Stage 2 introduces Unified Progressive Frequency Bridging (UPFB) at inference to progressively hand off generation to a high-capacity pretrained generator within a claimed homogeneous shared latent space, enabling coarse-to-fine refinement for higher visual fidelity without compromising reasoning quality. The work also introduces VR-Bench to evaluate translation of inferred intent into coherent, semantically aligned video, and reports substantial gains in visual realism and temporal coherence on VBench alongside strong reasoning performance on VR-Bench.

Significance. If the central claims on latent homogeneity and lossless handoff hold, the framework would meaningfully reduce the computational barrier to incorporating high-fidelity generators into unified video models while preserving instruction-grounded reasoning, and VR-Bench would provide a needed evaluation resource for this capability. The two-stage separation and frequency-bridging mechanism are conceptually attractive for scaling.

major comments (2)

[Abstract / Method (UPFB description)] The abstract and method description assert that the homogeneous latent space plus UPFB permits progressive handoff 'without compromising reasoning quality,' yet no ablations isolate UPFB's contribution to VR-Bench scores, no before-vs-after reasoning metrics are reported, and no quantitative evidence (e.g., distribution matching, frequency alignment statistics, or higher-order moment comparisons) is supplied to establish latent homogeneity between the lightweight and pretrained generators.
[Experiments section] The central performance claims on VBench and VR-Bench rest on experiments whose details (baselines, data selection rules, error bars, number of runs, or statistical significance) are not provided in the manuscript, making it impossible to assess whether the reported gains are robust or could be explained by differences in training compute or data.

minor comments (2)

[Method] Notation for the shared latent space and the precise definition of 'homogeneous' should be formalized with an equation or explicit metric rather than left descriptive.
[VR-Bench introduction] The VR-Bench construction (task categories, evaluation protocol, human vs. automatic scoring) needs a dedicated subsection with examples to allow reproducibility.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback, which identifies key areas where additional evidence and experimental details would strengthen the manuscript. We address each major comment below and outline the revisions we will make.

read point-by-point responses

Referee: [Abstract / Method (UPFB description)] The abstract and method description assert that the homogeneous latent space plus UPFB permits progressive handoff 'without compromising reasoning quality,' yet no ablations isolate UPFB's contribution to VR-Bench scores, no before-vs-after reasoning metrics are reported, and no quantitative evidence (e.g., distribution matching, frequency alignment statistics, or higher-order moment comparisons) is supplied to establish latent homogeneity between the lightweight and pretrained generators.

Authors: We agree that the current manuscript lacks explicit ablations and quantitative support for the homogeneity claim and UPFB's effect on reasoning metrics. The framework design assumes a shared latent space enables lossless handoff, but additional evidence is warranted. In the revised version, we will add: (i) ablations isolating UPFB's contribution to VR-Bench scores, (ii) before-vs-after reasoning metrics on VR-Bench, and (iii) quantitative analyses including distribution matching, frequency alignment statistics, and higher-order moment comparisons between the latent representations of the lightweight and pretrained generators. revision: yes
Referee: [Experiments section] The central performance claims on VBench and VR-Bench rest on experiments whose details (baselines, data selection rules, error bars, number of runs, or statistical significance) are not provided in the manuscript, making it impossible to assess whether the reported gains are robust or could be explained by differences in training compute or data.

Authors: We acknowledge that the manuscript omits these critical experimental details. In the revised manuscript, we will expand the Experiments section to fully specify the baselines, data selection rules, error bars, number of runs, and statistical significance tests. This will enable readers to evaluate the robustness of the reported gains on VBench and VR-Bench. revision: yes

Circularity Check

0 steps flagged

No significant circularity; derivation is self-contained

full rationale

The paper presents a two-stage training/inference design for Lumos-Nexus that aligns a lightweight generator with an understanding block and uses UPFB for handoff to a pretrained generator in a shared latent space. No load-bearing claims reduce by construction to fitted inputs, self-definitions, or self-citation chains. The homogeneous latent space and UPFB are introduced as architectural choices whose effects are evaluated via external benchmarks (VBench, VR-Bench); no equations or predictions are shown to be equivalent to their own inputs. This is the normal non-circular case for a methods paper whose central results rest on empirical evaluation rather than internal re-derivation.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The framework rests on the domain assumption that a homogeneous latent space exists and supports seamless progressive bridging between the lightweight and high-capacity generators; no free parameters or invented entities are mentioned in the abstract.

axioms (1)

domain assumption A homogeneous latent space exists between the lightweight generator and the high-capacity pretrained generator that permits progressive frequency bridging without degrading reasoning quality.
This premise is required for the UPFB inference stage described in the abstract.

pith-pipeline@v0.9.1-grok · 5787 in / 1249 out tokens · 27630 ms · 2026-06-28T22:54:44.198348+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

62 extracted references · 29 linked inside Pith

[1]

Ming-omni: A unified multimodal model for perception and generation.arXiv preprint arXiv:2506.09344, 2025

Inclusion AI, Biao Gong, Cheng Zou, Chuanyang Zheng, Chunluan Zhou, Canxiang Yan, Chunxiang Jin, Chunjie Shen, Dandan Zheng, Fudong Wang, et al. Ming-omni: A unified multimodal model for perception and generation.arXiv preprint arXiv:2506.09344, 2025

arXiv 2025
[2]

Stable video diffusion: Scaling latent video diffusion models to large datasets.arXiv preprint arXiv:2311.15127, 2023

Andreas Blattmann, Tim Dockhorn, Sumith Kulal, Daniel Mendelevitch, Maciej Kilian, Dominik Lorenz, Yam Levi, Zion English, Vikram Voleti, Adam Letts, et al. Stable video diffusion: Scaling latent video diffusion models to large datasets.arXiv preprint arXiv:2311.15127, 2023

Pith/arXiv arXiv 2023
[3]

Videocrafter2: Overcoming data limitations for high-quality video diffusion models

Haoxin Chen, Yong Zhang, Xiaodong Cun, Menghan Xia, Xintao Wang, Chao Weng, and Ying Shan. Videocrafter2: Overcoming data limitations for high-quality video diffusion models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 7310–7320, 2024

2024
[4]

Blip3-o: A family of fully open unified multimodal models-architecture, training and dataset.arXiv preprint arXiv:2505.09568, 2025

Jiuhai Chen, Zhiyang Xu, Xichen Pan, Yushi Hu, Can Qin, Tom Goldstein, Lifu Huang, Tianyi Zhou, Saining Xie, Silvio Savarese, et al. Blip3-o: A family of fully open unified multimodal models-architecture, training and dataset.arXiv preprint arXiv:2505.09568, 2025

Pith/arXiv arXiv 2025
[5]

Univid: Unifying vision tasks with pre-trained video generation models.arXiv preprint arXiv:2509.21760, 2025

Lan Chen, Yuchao Gu, and Qi Mao. Univid: Unifying vision tasks with pre-trained video generation models.arXiv preprint arXiv:2509.21760, 2025. 11

arXiv 2025
[6]

Janus-pro: Unified multimodal understanding and generation with data and model scaling.arXiv preprint arXiv:2501.17811, 2025

Xiaokang Chen, Zhiyu Wu, Xingchao Liu, Zizheng Pan, Wen Liu, Zhenda Xie, Xingkai Yu, and Chong Ruan. Janus-pro: Unified multimodal understanding and generation with data and model scaling.arXiv preprint arXiv:2501.17811, 2025

Pith/arXiv arXiv 2025
[7]

Emerging properties in unified multimodal pretraining.arXiv preprint arXiv:2505.14683, 2025

Chaorui Deng, Deyao Zhu, Kunchang Li, Chenhui Gou, Feng Li, Zeyu Wang, Shu Zhong, Weihao Yu, Xiaonan Nie, Ziang Song, et al. Emerging properties in unified multimodal pretraining.arXiv preprint arXiv:2505.14683, 2025

Pith/arXiv arXiv 2025
[8]

Autoregressive video generation without vector quantization.arXiv preprint arXiv:2412.14169, 2024

Haoge Deng, Ting Pan, Haiwen Diao, Zhengxiong Luo, Yufeng Cui, Huchuan Lu, Shiguang Shan, Yonggang Qi, and Xinlong Wang. Autoregressive video generation without vector quantization.arXiv preprint arXiv:2412.14169, 2024

Pith/arXiv arXiv 2024
[9]

Dreamllm: Synergistic multimodal comprehension and creation.arXiv preprint arXiv:2309.11499, 2023

Runpei Dong, Chunrui Han, Yuang Peng, Zekun Qi, Zheng Ge, Jinrong Yang, Liang Zhao, Jianjian Sun, Hongyu Zhou, Haoran Wei, et al. Dreamllm: Synergistic multimodal comprehension and creation.arXiv preprint arXiv:2309.11499, 2023

arXiv 2023
[10]

Scaling rectified flow transformers for high-resolution image synthesis

Patrick Esser, Sumith Kulal, Andreas Blattmann, Rahim Entezari, Jonas Müller, Harry Saini, Yam Levi, Dominik Lorenz, Axel Sauer, Frederic Boesel, et al. Scaling rectified flow transformers for high-resolution image synthesis. 2024

2024
[12]

Seed-x: Multimodal models with unified multi-granularity comprehension and generation.arXiv preprint arXiv:2404.14396, 2024

Yuying Ge, Sijie Zhao, Jinguo Zhu, Yixiao Ge, Kun Yi, Lin Song, Chen Li, Xiaohan Ding, and Ying Shan. Seed-x: Multimodal models with unified multi-granularity comprehension and generation.arXiv preprint arXiv:2404.14396, 2024

Pith/arXiv arXiv 2024
[13]

Geneval: An object-focused framework for evaluating text-to-image alignment.Advances in Neural Information Processing Systems, 36:52132–52152, 2023

Dhruba Ghosh, Hannaneh Hajishirzi, and Ludwig Schmidt. Geneval: An object-focused framework for evaluating text-to-image alignment.Advances in Neural Information Processing Systems, 36:52132–52152, 2023

2023
[14]

Google DeepMind. Veo 3.1. https://deepmind.google/models/veo/, 10 2025

2025
[15]

Ltx-video: Realtime video latent diffusion.arXiv preprint arXiv:2501.00103, 2024

Yoav HaCohen, Nisan Chiprut, Benny Brazowski, Daniel Shalem, Dudu Moshe, Eitan Richardson, Eran Levin, Guy Shiran, Nir Zabari, Ori Gordon, et al. Ltx-video: Realtime video latent diffusion.arXiv preprint arXiv:2501.00103, 2024

Pith/arXiv arXiv 2024
[16]

Cogvideo: Large-scale pretraining for text-to-video generation via transformers.arXiv preprint arXiv:2205.15868, 2022

Wenyi Hong, Ming Ding, Wendi Zheng, Xinghan Liu, and Jie Tang. Cogvideo: Large-scale pretraining for text-to-video generation via transformers.arXiv preprint arXiv:2205.15868, 2022

Pith/arXiv arXiv 2022
[17]

Vbench: Comprehensive benchmark suite for video generative models

Ziqi Huang, Yinan He, Jiashuo Yu, Fan Zhang, Chenyang Si, Yuming Jiang, Yuanhan Zhang, Tianxing Wu, Qingyang Jin, Nattapol Chanpaisit, et al. Vbench: Comprehensive benchmark suite for video generative models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 21807–21818, 2024

2024
[18]

Pyramidal flow matching for efficient video generative modeling.arXiv preprint arXiv:2410.05954, 2024

Yang Jin, Zhicheng Sun, Ningyuan Li, Kun Xu, Hao Jiang, Nan Zhuang, Quzhe Huang, Yang Song, Yadong Mu, and Zhouchen Lin. Pyramidal flow matching for efficient video generative modeling.arXiv preprint arXiv:2410.05954, 2024

arXiv 2024
[19]

A new measure of rank correlation.Biometrika, 30(1-2):81–93, 1938

Maurice G Kendall. A new measure of rank correlation.Biometrika, 30(1-2):81–93, 1938

1938
[20]

Text2video-zero: Text-to-image diffusion models are zero-shot video generators

Levon Khachatryan, Andranik Movsisyan, Vahram Tadevosyan, Roberto Henschel, Zhangyang Wang, Shant Navasardyan, and Humphrey Shi. Text2video-zero: Text-to-image diffusion models are zero-shot video generators. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 15954–15964, 2023

2023
[21]

Videopoet: A large language model for zero-shot video generation.arXiv preprint arXiv:2312.14125, 2023

Dan Kondratyuk, Lijun Yu, Xiuye Gu, José Lezama, Jonathan Huang, Grant Schindler, Rachel Hornung, Vighnesh Birodkar, Jimmy Yan, Ming-Chang Chiu, et al. Videopoet: A large language model for zero-shot video generation.arXiv preprint arXiv:2312.14125, 2023. 12

Pith/arXiv arXiv 2023
[22]

Hunyuanvideo: A systematic framework for large video generative models.arXiv preprint arXiv:2412.03603, 2024

Weijie Kong, Qi Tian, Zijian Zhang, Rox Min, Zuozhuo Dai, Jin Zhou, Jiangfeng Xiong, Xin Li, Bo Wu, Jianwei Zhang, et al. Hunyuanvideo: A systematic framework for large video generative models.arXiv preprint arXiv:2412.03603, 2024

Pith/arXiv arXiv 2024
[23]

Kling 2.6

Kuaishou. Kling 2.6. https://www.kling26.com/, 12 2025

2025
[24]

Flux.https://github.com/black-forest-labs/flux, 2024

Black Forest Labs. Flux.https://github.com/black-forest-labs/flux, 2024

2024
[25]

Arlon: Boosting diffusion transformers with autoregressive models for long video generation.arXiv preprint arXiv:2410.20502, 2024

Zongyi Li, Shujie Hu, Shujie Liu, Long Zhou, Jeongsoo Choi, Lingwei Meng, Xun Guo, Jinyu Li, Hefei Ling, and Furu Wei. Arlon: Boosting diffusion transformers with autoregressive models for long video generation.arXiv preprint arXiv:2410.20502, 2024

arXiv 2024
[26]

World model on million-length video and language with blockwise ringattention.arXiv preprint arXiv:2402.08268, 2024

Hao Liu, Wilson Yan, Matei Zaharia, and Pieter Abbeel. World model on million-length video and language with blockwise ringattention.arXiv preprint arXiv:2402.08268, 2024

Pith/arXiv arXiv 2024
[27]

Mardini: Masked autoregressive diffusion for video generation at scale.arXiv preprint arXiv:2410.20280, 2024

Haozhe Liu, Shikun Liu, Zijian Zhou, Mengmeng Xu, Yanping Xie, Xiao Han, Juan C Pérez, Ding Liu, Kumara Kahatapitiya, Menglin Jia, et al. Mardini: Masked autoregressive diffusion for video generation at scale.arXiv preprint arXiv:2410.20280, 2024

arXiv 2024
[28]

Vdt: General- purpose video diffusion transformers via mask modeling.arXiv preprint arXiv:2305.13311, 2023

Haoyu Lu, Guoxing Yang, Nanyi Fei, Yuqi Huo, Zhiwu Lu, Ping Luo, and Mingyu Ding. Vdt: General- purpose video diffusion transformers via mask modeling.arXiv preprint arXiv:2305.13311, 2023

arXiv 2023
[29]

Univid: The open-source unified video model.arXiv preprint arXiv:2509.24200, 2025

Jiabin Luo, Junhui Lin, Zeyu Zhang, Biao Wu, Meng Fang, Ling Chen, and Hao Tang. Univid: The open-source unified video model.arXiv preprint arXiv:2509.24200, 2025

arXiv 2025
[30]

Latte: Latent diffusion transformer for video generation.arXiv preprint arXiv:2401.03048, 2024

Xin Ma, Yaohui Wang, Gengyun Jia, Xinyuan Chen, Ziwei Liu, Yuan-Fang Li, Cunjian Chen, and Yu Qiao. Latte: Latent diffusion transformer for video generation.arXiv preprint arXiv:2401.03048, 2024

Pith/arXiv arXiv 2024
[31]

Janusflow: Harmonizing autoregression and rectified flow for unified multimodal understanding and generation

Yiyang Ma, Xingchao Liu, Xiaokang Chen, Wen Liu, Chengyue Wu, Zhiyu Wu, Zizheng Pan, Zhenda Xie, Haowei Zhang, Xingkai Yu, et al. Janusflow: Harmonizing autoregression and rectified flow for unified multimodal understanding and generation. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 7739–7751, 2025

2025
[32]

Transfer between modalities with metaqueries.arXiv preprint arXiv:2504.06256, 2025

Xichen Pan, Satya Narayan Shukla, Aashu Singh, Zhuokai Zhao, Shlok Kumar Mishra, Jialiang Wang, Zhiyang Xu, Jiuhai Chen, Kunpeng Li, Felix Juefei-Xu, et al. Transfer between modalities with metaqueries.arXiv preprint arXiv:2504.06256, 2025

Pith/arXiv arXiv 2025
[33]

SDXL: Improving latent diffusion models for high-resolution image synthesis

Dustin Podell, Zion English, Kyle Lacey, Andreas Blattmann, Tim Dockhorn, Jonas Müller, Joe Penna, and Robin Rombach. SDXL: Improving latent diffusion models for high-resolution image synthesis. In ICLR, 2024

2024
[34]

Skyreels-a1: Expressive portrait animation in video diffusion transformers.arXiv preprint arXiv:2502.10841, 2025

Di Qiu, Zhengcong Fei, Rui Wang, Jialin Bai, Changqian Yu, Mingyuan Fan, Guibin Chen, and Xiang Wen. Skyreels-a1: Expressive portrait animation in video diffusion transformers.arXiv preprint arXiv:2502.10841, 2025

arXiv 2025
[35]

Hierarchical text-conditional image generation with CLIP latents.arXiv preprint arXiv:2204.06125, 2022

Aditya Ramesh, Prafulla Dhariwal, Alex Nichol, Casey Chu, and Mark Chen. Hierarchical text-conditional image generation with CLIP latents.arXiv preprint arXiv:2204.06125, 2022

Pith/arXiv arXiv 2022
[36]

High-resolution image synthesis with latent diffusion models

Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer. High-resolution image synthesis with latent diffusion models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10684–10695, 2022

2022
[37]

Make-a-video: Text-to-video generation without text-video data.arXiv preprint arXiv:2209.14792, 2022

Uriel Singer, Adam Polyak, Thomas Hayes, Xi Yin, Jie An, Songyang Zhang, Qiyuan Hu, Harry Yang, Oron Ashual, Oran Gafni, et al. Make-a-video: Text-to-video generation without text-video data.arXiv preprint arXiv:2209.14792, 2022

Pith/arXiv arXiv 2022
[38]

Autore- gressive model beats diffusion: Llama for scalable image generation.arXiv preprint arXiv:2406.06525, 2024

Peize Sun, Yi Jiang, Shoufa Chen, Shilong Zhang, Bingyue Peng, Ping Luo, and Zehuan Yuan. Autore- gressive model beats diffusion: Llama for scalable image generation.arXiv preprint arXiv:2406.06525, 2024

Pith/arXiv arXiv 2024
[39]

Omni-video: Democratizing unified video understanding and generation.arXiv preprint arXiv:2507.06119, 2025

Zhiyu Tan, Hao Yang, Luozheng Qin, Jia Gong, Mengping Yang, and Hao Li. Omni-video: Democratizing unified video understanding and generation.arXiv preprint arXiv:2507.06119, 2025. 13

arXiv 2025
[40]

Chameleon: Mixed-modal early-fusion foundation models.arXiv preprint arXiv:2405.09818, 2024

Chameleon Team. Chameleon: Mixed-modal early-fusion foundation models.arXiv preprint arXiv:2405.09818, 2024

Pith/arXiv arXiv 2024
[41]

Longcat-video technical report.arXiv preprint arXiv:2510.22200, 2025

Meituan LongCat Team, Xunliang Cai, Qilong Huang, Zhuoliang Kang, Hongyu Li, Shijun Liang, Liya Ma, Siyu Ren, Xiaoming Wei, Rixu Xie, et al. Longcat-video technical report.arXiv preprint arXiv:2510.22200, 2025

arXiv 2025
[42]

Magi-1: Autoregressive video generation at scale.arXiv preprint arXiv:2505.13211, 2025

Hansi Teng, Hongyu Jia, Lei Sun, Lingzhi Li, Maolin Li, Mingqiu Tang, Shuai Han, Tianning Zhang, WQ Zhang, Weifeng Luo, et al. Magi-1: Autoregressive video generation at scale.arXiv preprint arXiv:2505.13211, 2025

Pith/arXiv arXiv 2025
[43]

Metamorph: Multimodal understanding and generation via instruction tuning

Shengbang Tong, David Fan, Jiachen Li, Yunyang Xiong, Xinlei Chen, Koustuv Sinha, Michael Rabbat, Yann LeCun, Saining Xie, and Zhuang Liu. Metamorph: Multimodal understanding and generation via instruction tuning. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 17001–17012, 2025

2025
[44]

Wan: Open and advanced large-scale video generative models.arXiv preprint arXiv:2503.20314, 2025

Ang Wang, Baole Ai, Bin Wen, Chaojie Mao, Chen-Wei Xie, Di Chen, Feiwu Yu, Haiming Zhao, Jianxiao Yang, Jianyuan Zeng, et al. Wan: Open and advanced large-scale video generative models.arXiv preprint arXiv:2503.20314, 2025

Pith/arXiv arXiv 2025
[45]

Modelscope text-to-video technical report.arXiv preprint arXiv:2308.06571, 2023

Jiuniu Wang, Hangjie Yuan, Dayou Chen, Yingya Zhang, Xiang Wang, and Shiwei Zhang. Modelscope text-to-video technical report.arXiv preprint arXiv:2308.06571, 2023

Pith/arXiv arXiv 2023
[46]

Emu3: Next-token prediction is all you need.arXiv preprint arXiv:2409.18869, 2024

Xinlong Wang, Xiaosong Zhang, Zhengxiong Luo, Quan Sun, Yufeng Cui, Jinsheng Wang, Fan Zhang, Yueze Wang, Zhen Li, Qiying Yu, et al. Emu3: Next-token prediction is all you need.arXiv preprint arXiv:2409.18869, 2024

Pith/arXiv arXiv 2024
[47]

Loong: Generating minute-level long videos with autoregressive language models.arXiv preprint arXiv:2410.02757, 2024

Yuqing Wang, Tianwei Xiong, Daquan Zhou, Zhijie Lin, Yang Zhao, Bingyi Kang, Jiashi Feng, and Xihui Liu. Loong: Generating minute-level long videos with autoregressive language models.arXiv preprint arXiv:2410.02757, 2024

arXiv 2024
[48]

Univideo: Unified understanding, generation, and editing for videos.arXiv preprint arXiv:2510.08377, 2025

Cong Wei, Quande Liu, Zixuan Ye, Qiulin Wang, Xintao Wang, Pengfei Wan, Kun Gai, and Wenhu Chen. Univideo: Unified understanding, generation, and editing for videos.arXiv preprint arXiv:2510.08377, 2025

arXiv 2025
[49]

Janus: Decoupling visual encoding for unified multimodal understanding and generation

Chengyue Wu, Xiaokang Chen, Zhiyu Wu, Yiyang Ma, Xingchao Liu, Zizheng Pan, Wen Liu, Zhenda Xie, Xingkai Yu, Chong Ruan, et al. Janus: Decoupling visual encoding for unified multimodal understanding and generation. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 12966–12977, 2025

2025
[50]

Tune-a-video: One-shot tuning of image diffusion models for text-to-video generation

Jay Zhangjie Wu, Yixiao Ge, Xintao Wang, Stan Weixian Lei, Yuchao Gu, Yufei Shi, Wynne Hsu, Ying Shan, Xiaohu Qie, and Mike Zheng Shou. Tune-a-video: One-shot tuning of image diffusion models for text-to-video generation. InProceedings of the IEEE/CVF international conference on computer vision, pages 7623–7633, 2023

2023
[51]

Next-gpt: Any-to-any multimodal llm

Shengqiong Wu, Hao Fei, Leigang Qu, Wei Ji, and Tat-Seng Chua. Next-gpt: Any-to-any multimodal llm. InForty-first International Conference on Machine Learning, 2024

2024
[53]

Show-o: One single transformer to unify multimodal understanding and generation.arXiv preprint arXiv:2408.12528, 2024

Jinheng Xie, Weijia Mao, Zechen Bai, David Junhao Zhang, Weihao Wang, Kevin Qinghong Lin, Yuchao Gu, Zhijie Chen, Zhenheng Yang, and Mike Zheng Shou. Show-o: One single transformer to unify multimodal understanding and generation.arXiv preprint arXiv:2408.12528, 2024

Pith/arXiv arXiv 2024
[54]

Show-o2: Improved native unified multimodal models.arXiv preprint arXiv:2506.15564, 2025

Jinheng Xie, Zhenheng Yang, and Mike Zheng Shou. Show-o2: Improved native unified multimodal models.arXiv preprint arXiv:2506.15564, 2025

Pith/arXiv arXiv 2025
[55]

Lumosx: Relate any identities with their attributes for personalized video generation

Jiazheng Xing, Fei Du, Hangjie Yuan, Pengwei Liu, Hongbin Xu, Hai Ci, Ruigang Niu, Weihua Chen, Fan Wang, and Yong Liu. Lumosx: Relate any identities with their attributes for personalized video generation. InThe Fourteenth International Conference on Learning Representations, 2026. 14

2026
[56]

Videogpt: Video generation using vq-vae and transformers.arXiv preprint arXiv:2104.10157, 2021

Wilson Yan, Yunzhi Zhang, Pieter Abbeel, and Aravind Srinivas. Videogpt: Video generation using vq-vae and transformers.arXiv preprint arXiv:2104.10157, 2021

Pith/arXiv arXiv 2021
[57]

Qwen3 technical report.arXiv preprint arXiv:2505.09388, 2025

An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, et al. Qwen3 technical report.arXiv preprint arXiv:2505.09388, 2025

Pith/arXiv arXiv 2025
[58]

Mmada: Multimodal large diffusion language models.arXiv preprint arXiv:2505.15809, 2025

Ling Yang, Ye Tian, Bowen Li, Xinchen Zhang, Ke Shen, Yunhai Tong, and Mengdi Wang. Mmada: Multimodal large diffusion language models.arXiv preprint arXiv:2505.15809, 2025

Pith/arXiv arXiv 2025
[59]

Cogvideox: Text-to-video diffusion models with an expert transformer.arXiv preprint arXiv:2408.06072, 2024

Zhuoyi Yang, Jiayan Teng, Wendi Zheng, Ming Ding, Shiyu Huang, Jiazheng Xu, Yuanming Yang, Wenyi Hong, Xiaohan Zhang, Guanyu Feng, et al. Cogvideox: Text-to-video diffusion models with an expert transformer.arXiv preprint arXiv:2408.06072, 2024

Pith/arXiv arXiv 2024
[60]

From slow bidirectional to fast autoregressive video diffusion models

Tianwei Yin, Qiang Zhang, Richard Zhang, William T Freeman, Fredo Durand, Eli Shechtman, and Xun Huang. From slow bidirectional to fast autoregressive video diffusion models. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 22963–22974, 2025

2025
[61]

Lumos-1: On autoregressive video generation from a unified model perspective.arXiv preprint arXiv:2507.08801, 2025

Hangjie Yuan, Weihua Chen, Jun Cen, Hu Yu, Jingyun Liang, Shuning Chang, Zhihui Lin, Tao Feng, Pengwei Liu, Jiazheng Xing, et al. Lumos-1: On autoregressive video generation from a unified model perspective.arXiv preprint arXiv:2507.08801, 2025

arXiv 2025
[62]

Show-1: Marrying pixel and latent diffusion models for text-to-video generation

David Junhao Zhang, Jay Zhangjie Wu, Jia-Wei Liu, Rui Zhao, Lingmin Ran, Yuchao Gu, Difei Gao, and Mike Zheng Shou. Show-1: Marrying pixel and latent diffusion models for text-to-video generation. International Journal of Computer Vision, 133(4):1879–1893, 2025

2025
[63]

Open-sora: Democratizing efficient video production for all.arXiv preprint arXiv:2412.20404, 2024

Zangwei Zheng, Xiangyu Peng, Tianji Yang, Chenhui Shen, Shenggui Li, Hongxin Liu, Yukun Zhou, Tianyi Li, and Yang You. Open-sora: Democratizing efficient video production for all.arXiv preprint arXiv:2412.20404, 2024

Pith/arXiv arXiv 2024
[64]

A French café scene with people drinking espresso and reading newspapers

Chunting Zhou, Lili Yu, Arun Babu, Kushal Tirumala, Michihiro Yasunaga, Leonid Shamis, Jacob Kahn, Xuezhe Ma, Luke Zettlemoyer, and Omer Levy. Transfusion: Predict the next token and diffuse images with one multi-modal model.arXiv preprint arXiv:2408.11039, 2024. 15 Appendix ofLumos-Nexus In this Appendix, we provide additional content organized as follow...

Pith/arXiv arXiv 2024

[1] [1]

Ming-omni: A unified multimodal model for perception and generation.arXiv preprint arXiv:2506.09344, 2025

Inclusion AI, Biao Gong, Cheng Zou, Chuanyang Zheng, Chunluan Zhou, Canxiang Yan, Chunxiang Jin, Chunjie Shen, Dandan Zheng, Fudong Wang, et al. Ming-omni: A unified multimodal model for perception and generation.arXiv preprint arXiv:2506.09344, 2025

arXiv 2025

[2] [2]

Stable video diffusion: Scaling latent video diffusion models to large datasets.arXiv preprint arXiv:2311.15127, 2023

Andreas Blattmann, Tim Dockhorn, Sumith Kulal, Daniel Mendelevitch, Maciej Kilian, Dominik Lorenz, Yam Levi, Zion English, Vikram Voleti, Adam Letts, et al. Stable video diffusion: Scaling latent video diffusion models to large datasets.arXiv preprint arXiv:2311.15127, 2023

Pith/arXiv arXiv 2023

[3] [3]

Videocrafter2: Overcoming data limitations for high-quality video diffusion models

Haoxin Chen, Yong Zhang, Xiaodong Cun, Menghan Xia, Xintao Wang, Chao Weng, and Ying Shan. Videocrafter2: Overcoming data limitations for high-quality video diffusion models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 7310–7320, 2024

2024

[4] [4]

Blip3-o: A family of fully open unified multimodal models-architecture, training and dataset.arXiv preprint arXiv:2505.09568, 2025

Jiuhai Chen, Zhiyang Xu, Xichen Pan, Yushi Hu, Can Qin, Tom Goldstein, Lifu Huang, Tianyi Zhou, Saining Xie, Silvio Savarese, et al. Blip3-o: A family of fully open unified multimodal models-architecture, training and dataset.arXiv preprint arXiv:2505.09568, 2025

Pith/arXiv arXiv 2025

[5] [5]

Univid: Unifying vision tasks with pre-trained video generation models.arXiv preprint arXiv:2509.21760, 2025

Lan Chen, Yuchao Gu, and Qi Mao. Univid: Unifying vision tasks with pre-trained video generation models.arXiv preprint arXiv:2509.21760, 2025. 11

arXiv 2025

[6] [6]

Janus-pro: Unified multimodal understanding and generation with data and model scaling.arXiv preprint arXiv:2501.17811, 2025

Xiaokang Chen, Zhiyu Wu, Xingchao Liu, Zizheng Pan, Wen Liu, Zhenda Xie, Xingkai Yu, and Chong Ruan. Janus-pro: Unified multimodal understanding and generation with data and model scaling.arXiv preprint arXiv:2501.17811, 2025

Pith/arXiv arXiv 2025

[7] [7]

Emerging properties in unified multimodal pretraining.arXiv preprint arXiv:2505.14683, 2025

Chaorui Deng, Deyao Zhu, Kunchang Li, Chenhui Gou, Feng Li, Zeyu Wang, Shu Zhong, Weihao Yu, Xiaonan Nie, Ziang Song, et al. Emerging properties in unified multimodal pretraining.arXiv preprint arXiv:2505.14683, 2025

Pith/arXiv arXiv 2025

[8] [8]

Autoregressive video generation without vector quantization.arXiv preprint arXiv:2412.14169, 2024

Haoge Deng, Ting Pan, Haiwen Diao, Zhengxiong Luo, Yufeng Cui, Huchuan Lu, Shiguang Shan, Yonggang Qi, and Xinlong Wang. Autoregressive video generation without vector quantization.arXiv preprint arXiv:2412.14169, 2024

Pith/arXiv arXiv 2024

[9] [9]

Dreamllm: Synergistic multimodal comprehension and creation.arXiv preprint arXiv:2309.11499, 2023

Runpei Dong, Chunrui Han, Yuang Peng, Zekun Qi, Zheng Ge, Jinrong Yang, Liang Zhao, Jianjian Sun, Hongyu Zhou, Haoran Wei, et al. Dreamllm: Synergistic multimodal comprehension and creation.arXiv preprint arXiv:2309.11499, 2023

arXiv 2023

[10] [10]

Scaling rectified flow transformers for high-resolution image synthesis

Patrick Esser, Sumith Kulal, Andreas Blattmann, Rahim Entezari, Jonas Müller, Harry Saini, Yam Levi, Dominik Lorenz, Axel Sauer, Frederic Boesel, et al. Scaling rectified flow transformers for high-resolution image synthesis. 2024

2024

[11] [12]

Seed-x: Multimodal models with unified multi-granularity comprehension and generation.arXiv preprint arXiv:2404.14396, 2024

Yuying Ge, Sijie Zhao, Jinguo Zhu, Yixiao Ge, Kun Yi, Lin Song, Chen Li, Xiaohan Ding, and Ying Shan. Seed-x: Multimodal models with unified multi-granularity comprehension and generation.arXiv preprint arXiv:2404.14396, 2024

Pith/arXiv arXiv 2024

[12] [13]

Geneval: An object-focused framework for evaluating text-to-image alignment.Advances in Neural Information Processing Systems, 36:52132–52152, 2023

Dhruba Ghosh, Hannaneh Hajishirzi, and Ludwig Schmidt. Geneval: An object-focused framework for evaluating text-to-image alignment.Advances in Neural Information Processing Systems, 36:52132–52152, 2023

2023

[13] [14]

Google DeepMind. Veo 3.1. https://deepmind.google/models/veo/, 10 2025

2025

[14] [15]

Ltx-video: Realtime video latent diffusion.arXiv preprint arXiv:2501.00103, 2024

Yoav HaCohen, Nisan Chiprut, Benny Brazowski, Daniel Shalem, Dudu Moshe, Eitan Richardson, Eran Levin, Guy Shiran, Nir Zabari, Ori Gordon, et al. Ltx-video: Realtime video latent diffusion.arXiv preprint arXiv:2501.00103, 2024

Pith/arXiv arXiv 2024

[15] [16]

Cogvideo: Large-scale pretraining for text-to-video generation via transformers.arXiv preprint arXiv:2205.15868, 2022

Wenyi Hong, Ming Ding, Wendi Zheng, Xinghan Liu, and Jie Tang. Cogvideo: Large-scale pretraining for text-to-video generation via transformers.arXiv preprint arXiv:2205.15868, 2022

Pith/arXiv arXiv 2022

[16] [17]

Vbench: Comprehensive benchmark suite for video generative models

Ziqi Huang, Yinan He, Jiashuo Yu, Fan Zhang, Chenyang Si, Yuming Jiang, Yuanhan Zhang, Tianxing Wu, Qingyang Jin, Nattapol Chanpaisit, et al. Vbench: Comprehensive benchmark suite for video generative models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 21807–21818, 2024

2024

[17] [18]

Pyramidal flow matching for efficient video generative modeling.arXiv preprint arXiv:2410.05954, 2024

Yang Jin, Zhicheng Sun, Ningyuan Li, Kun Xu, Hao Jiang, Nan Zhuang, Quzhe Huang, Yang Song, Yadong Mu, and Zhouchen Lin. Pyramidal flow matching for efficient video generative modeling.arXiv preprint arXiv:2410.05954, 2024

arXiv 2024

[18] [19]

A new measure of rank correlation.Biometrika, 30(1-2):81–93, 1938

Maurice G Kendall. A new measure of rank correlation.Biometrika, 30(1-2):81–93, 1938

1938

[19] [20]

Text2video-zero: Text-to-image diffusion models are zero-shot video generators

Levon Khachatryan, Andranik Movsisyan, Vahram Tadevosyan, Roberto Henschel, Zhangyang Wang, Shant Navasardyan, and Humphrey Shi. Text2video-zero: Text-to-image diffusion models are zero-shot video generators. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 15954–15964, 2023

2023

[20] [21]

Videopoet: A large language model for zero-shot video generation.arXiv preprint arXiv:2312.14125, 2023

Dan Kondratyuk, Lijun Yu, Xiuye Gu, José Lezama, Jonathan Huang, Grant Schindler, Rachel Hornung, Vighnesh Birodkar, Jimmy Yan, Ming-Chang Chiu, et al. Videopoet: A large language model for zero-shot video generation.arXiv preprint arXiv:2312.14125, 2023. 12

Pith/arXiv arXiv 2023

[21] [22]

Hunyuanvideo: A systematic framework for large video generative models.arXiv preprint arXiv:2412.03603, 2024

Weijie Kong, Qi Tian, Zijian Zhang, Rox Min, Zuozhuo Dai, Jin Zhou, Jiangfeng Xiong, Xin Li, Bo Wu, Jianwei Zhang, et al. Hunyuanvideo: A systematic framework for large video generative models.arXiv preprint arXiv:2412.03603, 2024

Pith/arXiv arXiv 2024

[22] [23]

Kling 2.6

Kuaishou. Kling 2.6. https://www.kling26.com/, 12 2025

2025

[23] [24]

Flux.https://github.com/black-forest-labs/flux, 2024

Black Forest Labs. Flux.https://github.com/black-forest-labs/flux, 2024

2024

[24] [25]

Arlon: Boosting diffusion transformers with autoregressive models for long video generation.arXiv preprint arXiv:2410.20502, 2024

Zongyi Li, Shujie Hu, Shujie Liu, Long Zhou, Jeongsoo Choi, Lingwei Meng, Xun Guo, Jinyu Li, Hefei Ling, and Furu Wei. Arlon: Boosting diffusion transformers with autoregressive models for long video generation.arXiv preprint arXiv:2410.20502, 2024

arXiv 2024

[25] [26]

World model on million-length video and language with blockwise ringattention.arXiv preprint arXiv:2402.08268, 2024

Hao Liu, Wilson Yan, Matei Zaharia, and Pieter Abbeel. World model on million-length video and language with blockwise ringattention.arXiv preprint arXiv:2402.08268, 2024

Pith/arXiv arXiv 2024

[26] [27]

Mardini: Masked autoregressive diffusion for video generation at scale.arXiv preprint arXiv:2410.20280, 2024

Haozhe Liu, Shikun Liu, Zijian Zhou, Mengmeng Xu, Yanping Xie, Xiao Han, Juan C Pérez, Ding Liu, Kumara Kahatapitiya, Menglin Jia, et al. Mardini: Masked autoregressive diffusion for video generation at scale.arXiv preprint arXiv:2410.20280, 2024

arXiv 2024

[27] [28]

Vdt: General- purpose video diffusion transformers via mask modeling.arXiv preprint arXiv:2305.13311, 2023

Haoyu Lu, Guoxing Yang, Nanyi Fei, Yuqi Huo, Zhiwu Lu, Ping Luo, and Mingyu Ding. Vdt: General- purpose video diffusion transformers via mask modeling.arXiv preprint arXiv:2305.13311, 2023

arXiv 2023

[28] [29]

Univid: The open-source unified video model.arXiv preprint arXiv:2509.24200, 2025

Jiabin Luo, Junhui Lin, Zeyu Zhang, Biao Wu, Meng Fang, Ling Chen, and Hao Tang. Univid: The open-source unified video model.arXiv preprint arXiv:2509.24200, 2025

arXiv 2025

[29] [30]

Latte: Latent diffusion transformer for video generation.arXiv preprint arXiv:2401.03048, 2024

Xin Ma, Yaohui Wang, Gengyun Jia, Xinyuan Chen, Ziwei Liu, Yuan-Fang Li, Cunjian Chen, and Yu Qiao. Latte: Latent diffusion transformer for video generation.arXiv preprint arXiv:2401.03048, 2024

Pith/arXiv arXiv 2024

[30] [31]

Janusflow: Harmonizing autoregression and rectified flow for unified multimodal understanding and generation

Yiyang Ma, Xingchao Liu, Xiaokang Chen, Wen Liu, Chengyue Wu, Zhiyu Wu, Zizheng Pan, Zhenda Xie, Haowei Zhang, Xingkai Yu, et al. Janusflow: Harmonizing autoregression and rectified flow for unified multimodal understanding and generation. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 7739–7751, 2025

2025

[31] [32]

Transfer between modalities with metaqueries.arXiv preprint arXiv:2504.06256, 2025

Xichen Pan, Satya Narayan Shukla, Aashu Singh, Zhuokai Zhao, Shlok Kumar Mishra, Jialiang Wang, Zhiyang Xu, Jiuhai Chen, Kunpeng Li, Felix Juefei-Xu, et al. Transfer between modalities with metaqueries.arXiv preprint arXiv:2504.06256, 2025

Pith/arXiv arXiv 2025

[32] [33]

SDXL: Improving latent diffusion models for high-resolution image synthesis

Dustin Podell, Zion English, Kyle Lacey, Andreas Blattmann, Tim Dockhorn, Jonas Müller, Joe Penna, and Robin Rombach. SDXL: Improving latent diffusion models for high-resolution image synthesis. In ICLR, 2024

2024

[33] [34]

Skyreels-a1: Expressive portrait animation in video diffusion transformers.arXiv preprint arXiv:2502.10841, 2025

Di Qiu, Zhengcong Fei, Rui Wang, Jialin Bai, Changqian Yu, Mingyuan Fan, Guibin Chen, and Xiang Wen. Skyreels-a1: Expressive portrait animation in video diffusion transformers.arXiv preprint arXiv:2502.10841, 2025

arXiv 2025

[34] [35]

Hierarchical text-conditional image generation with CLIP latents.arXiv preprint arXiv:2204.06125, 2022

Aditya Ramesh, Prafulla Dhariwal, Alex Nichol, Casey Chu, and Mark Chen. Hierarchical text-conditional image generation with CLIP latents.arXiv preprint arXiv:2204.06125, 2022

Pith/arXiv arXiv 2022

[35] [36]

High-resolution image synthesis with latent diffusion models

Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer. High-resolution image synthesis with latent diffusion models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10684–10695, 2022

2022

[36] [37]

Make-a-video: Text-to-video generation without text-video data.arXiv preprint arXiv:2209.14792, 2022

Uriel Singer, Adam Polyak, Thomas Hayes, Xi Yin, Jie An, Songyang Zhang, Qiyuan Hu, Harry Yang, Oron Ashual, Oran Gafni, et al. Make-a-video: Text-to-video generation without text-video data.arXiv preprint arXiv:2209.14792, 2022

Pith/arXiv arXiv 2022

[37] [38]

Autore- gressive model beats diffusion: Llama for scalable image generation.arXiv preprint arXiv:2406.06525, 2024

Peize Sun, Yi Jiang, Shoufa Chen, Shilong Zhang, Bingyue Peng, Ping Luo, and Zehuan Yuan. Autore- gressive model beats diffusion: Llama for scalable image generation.arXiv preprint arXiv:2406.06525, 2024

Pith/arXiv arXiv 2024

[38] [39]

Omni-video: Democratizing unified video understanding and generation.arXiv preprint arXiv:2507.06119, 2025

Zhiyu Tan, Hao Yang, Luozheng Qin, Jia Gong, Mengping Yang, and Hao Li. Omni-video: Democratizing unified video understanding and generation.arXiv preprint arXiv:2507.06119, 2025. 13

arXiv 2025

[39] [40]

Chameleon: Mixed-modal early-fusion foundation models.arXiv preprint arXiv:2405.09818, 2024

Chameleon Team. Chameleon: Mixed-modal early-fusion foundation models.arXiv preprint arXiv:2405.09818, 2024

Pith/arXiv arXiv 2024

[40] [41]

Longcat-video technical report.arXiv preprint arXiv:2510.22200, 2025

Meituan LongCat Team, Xunliang Cai, Qilong Huang, Zhuoliang Kang, Hongyu Li, Shijun Liang, Liya Ma, Siyu Ren, Xiaoming Wei, Rixu Xie, et al. Longcat-video technical report.arXiv preprint arXiv:2510.22200, 2025

arXiv 2025

[41] [42]

Magi-1: Autoregressive video generation at scale.arXiv preprint arXiv:2505.13211, 2025

Hansi Teng, Hongyu Jia, Lei Sun, Lingzhi Li, Maolin Li, Mingqiu Tang, Shuai Han, Tianning Zhang, WQ Zhang, Weifeng Luo, et al. Magi-1: Autoregressive video generation at scale.arXiv preprint arXiv:2505.13211, 2025

Pith/arXiv arXiv 2025

[42] [43]

Metamorph: Multimodal understanding and generation via instruction tuning

Shengbang Tong, David Fan, Jiachen Li, Yunyang Xiong, Xinlei Chen, Koustuv Sinha, Michael Rabbat, Yann LeCun, Saining Xie, and Zhuang Liu. Metamorph: Multimodal understanding and generation via instruction tuning. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 17001–17012, 2025

2025

[43] [44]

Wan: Open and advanced large-scale video generative models.arXiv preprint arXiv:2503.20314, 2025

Ang Wang, Baole Ai, Bin Wen, Chaojie Mao, Chen-Wei Xie, Di Chen, Feiwu Yu, Haiming Zhao, Jianxiao Yang, Jianyuan Zeng, et al. Wan: Open and advanced large-scale video generative models.arXiv preprint arXiv:2503.20314, 2025

Pith/arXiv arXiv 2025

[44] [45]

Modelscope text-to-video technical report.arXiv preprint arXiv:2308.06571, 2023

Jiuniu Wang, Hangjie Yuan, Dayou Chen, Yingya Zhang, Xiang Wang, and Shiwei Zhang. Modelscope text-to-video technical report.arXiv preprint arXiv:2308.06571, 2023

Pith/arXiv arXiv 2023

[45] [46]

Emu3: Next-token prediction is all you need.arXiv preprint arXiv:2409.18869, 2024

Xinlong Wang, Xiaosong Zhang, Zhengxiong Luo, Quan Sun, Yufeng Cui, Jinsheng Wang, Fan Zhang, Yueze Wang, Zhen Li, Qiying Yu, et al. Emu3: Next-token prediction is all you need.arXiv preprint arXiv:2409.18869, 2024

Pith/arXiv arXiv 2024

[46] [47]

Loong: Generating minute-level long videos with autoregressive language models.arXiv preprint arXiv:2410.02757, 2024

Yuqing Wang, Tianwei Xiong, Daquan Zhou, Zhijie Lin, Yang Zhao, Bingyi Kang, Jiashi Feng, and Xihui Liu. Loong: Generating minute-level long videos with autoregressive language models.arXiv preprint arXiv:2410.02757, 2024

arXiv 2024

[47] [48]

Univideo: Unified understanding, generation, and editing for videos.arXiv preprint arXiv:2510.08377, 2025

Cong Wei, Quande Liu, Zixuan Ye, Qiulin Wang, Xintao Wang, Pengfei Wan, Kun Gai, and Wenhu Chen. Univideo: Unified understanding, generation, and editing for videos.arXiv preprint arXiv:2510.08377, 2025

arXiv 2025

[48] [49]

Janus: Decoupling visual encoding for unified multimodal understanding and generation

Chengyue Wu, Xiaokang Chen, Zhiyu Wu, Yiyang Ma, Xingchao Liu, Zizheng Pan, Wen Liu, Zhenda Xie, Xingkai Yu, Chong Ruan, et al. Janus: Decoupling visual encoding for unified multimodal understanding and generation. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 12966–12977, 2025

2025

[49] [50]

Tune-a-video: One-shot tuning of image diffusion models for text-to-video generation

Jay Zhangjie Wu, Yixiao Ge, Xintao Wang, Stan Weixian Lei, Yuchao Gu, Yufei Shi, Wynne Hsu, Ying Shan, Xiaohu Qie, and Mike Zheng Shou. Tune-a-video: One-shot tuning of image diffusion models for text-to-video generation. InProceedings of the IEEE/CVF international conference on computer vision, pages 7623–7633, 2023

2023

[50] [51]

Next-gpt: Any-to-any multimodal llm

Shengqiong Wu, Hao Fei, Leigang Qu, Wei Ji, and Tat-Seng Chua. Next-gpt: Any-to-any multimodal llm. InForty-first International Conference on Machine Learning, 2024

2024

[51] [53]

Show-o: One single transformer to unify multimodal understanding and generation.arXiv preprint arXiv:2408.12528, 2024

Jinheng Xie, Weijia Mao, Zechen Bai, David Junhao Zhang, Weihao Wang, Kevin Qinghong Lin, Yuchao Gu, Zhijie Chen, Zhenheng Yang, and Mike Zheng Shou. Show-o: One single transformer to unify multimodal understanding and generation.arXiv preprint arXiv:2408.12528, 2024

Pith/arXiv arXiv 2024

[52] [54]

Show-o2: Improved native unified multimodal models.arXiv preprint arXiv:2506.15564, 2025

Jinheng Xie, Zhenheng Yang, and Mike Zheng Shou. Show-o2: Improved native unified multimodal models.arXiv preprint arXiv:2506.15564, 2025

Pith/arXiv arXiv 2025

[53] [55]

Lumosx: Relate any identities with their attributes for personalized video generation

Jiazheng Xing, Fei Du, Hangjie Yuan, Pengwei Liu, Hongbin Xu, Hai Ci, Ruigang Niu, Weihua Chen, Fan Wang, and Yong Liu. Lumosx: Relate any identities with their attributes for personalized video generation. InThe Fourteenth International Conference on Learning Representations, 2026. 14

2026

[54] [56]

Videogpt: Video generation using vq-vae and transformers.arXiv preprint arXiv:2104.10157, 2021

Wilson Yan, Yunzhi Zhang, Pieter Abbeel, and Aravind Srinivas. Videogpt: Video generation using vq-vae and transformers.arXiv preprint arXiv:2104.10157, 2021

Pith/arXiv arXiv 2021

[55] [57]

Qwen3 technical report.arXiv preprint arXiv:2505.09388, 2025

An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, et al. Qwen3 technical report.arXiv preprint arXiv:2505.09388, 2025

Pith/arXiv arXiv 2025

[56] [58]

Mmada: Multimodal large diffusion language models.arXiv preprint arXiv:2505.15809, 2025

Ling Yang, Ye Tian, Bowen Li, Xinchen Zhang, Ke Shen, Yunhai Tong, and Mengdi Wang. Mmada: Multimodal large diffusion language models.arXiv preprint arXiv:2505.15809, 2025

Pith/arXiv arXiv 2025

[57] [59]

Cogvideox: Text-to-video diffusion models with an expert transformer.arXiv preprint arXiv:2408.06072, 2024

Zhuoyi Yang, Jiayan Teng, Wendi Zheng, Ming Ding, Shiyu Huang, Jiazheng Xu, Yuanming Yang, Wenyi Hong, Xiaohan Zhang, Guanyu Feng, et al. Cogvideox: Text-to-video diffusion models with an expert transformer.arXiv preprint arXiv:2408.06072, 2024

Pith/arXiv arXiv 2024

[58] [60]

From slow bidirectional to fast autoregressive video diffusion models

Tianwei Yin, Qiang Zhang, Richard Zhang, William T Freeman, Fredo Durand, Eli Shechtman, and Xun Huang. From slow bidirectional to fast autoregressive video diffusion models. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 22963–22974, 2025

2025

[59] [61]

Lumos-1: On autoregressive video generation from a unified model perspective.arXiv preprint arXiv:2507.08801, 2025

Hangjie Yuan, Weihua Chen, Jun Cen, Hu Yu, Jingyun Liang, Shuning Chang, Zhihui Lin, Tao Feng, Pengwei Liu, Jiazheng Xing, et al. Lumos-1: On autoregressive video generation from a unified model perspective.arXiv preprint arXiv:2507.08801, 2025

arXiv 2025

[60] [62]

Show-1: Marrying pixel and latent diffusion models for text-to-video generation

David Junhao Zhang, Jay Zhangjie Wu, Jia-Wei Liu, Rui Zhao, Lingmin Ran, Yuchao Gu, Difei Gao, and Mike Zheng Shou. Show-1: Marrying pixel and latent diffusion models for text-to-video generation. International Journal of Computer Vision, 133(4):1879–1893, 2025

2025

[61] [63]

Open-sora: Democratizing efficient video production for all.arXiv preprint arXiv:2412.20404, 2024

Zangwei Zheng, Xiangyu Peng, Tianji Yang, Chenhui Shen, Shenggui Li, Hongxin Liu, Yukun Zhou, Tianyi Li, and Yang You. Open-sora: Democratizing efficient video production for all.arXiv preprint arXiv:2412.20404, 2024

Pith/arXiv arXiv 2024

[62] [64]

A French café scene with people drinking espresso and reading newspapers

Chunting Zhou, Lili Yu, Arun Babu, Kushal Tirumala, Michihiro Yasunaga, Leonid Shamis, Jacob Kahn, Xuezhe Ma, Luke Zettlemoyer, and Omer Levy. Transfusion: Predict the next token and diffuse images with one multi-modal model.arXiv preprint arXiv:2408.11039, 2024. 15 Appendix ofLumos-Nexus In this Appendix, we provide additional content organized as follow...

Pith/arXiv arXiv 2024