pith. sign in

arxiv: 2605.31603 · v1 · pith:SJLAKAN5new · submitted 2026-05-29 · 💻 cs.CV · cs.AI

Lumos-Nexus: Efficient Frequency Bridging with Homogeneous Latent Space for Video Unified Models

Pith reviewed 2026-06-28 22:54 UTC · model grok-4.3

classification 💻 cs.CV cs.AI
keywords video generationunified modelslatent space bridgingfrequency handoffreasoning-driven synthesisVBenchVR-Bench
0
0 comments X

The pith

A two-stage framework trains only a lightweight generator then hands off to a pretrained high-capacity one via progressive frequency bridging in shared latent space.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Connector-based video unified models cannot afford to train large high-fidelity generators inside the unified loop, so visual quality stays limited. Lumos-Nexus trains only a lightweight generator aligned to the understanding block so it learns reasoning-driven semantic control, then at inference applies Unified Progressive Frequency Bridging to hand generation over to a frozen high-capacity pretrained generator inside the same homogeneous latent space. The handoff produces coarse-to-fine refinement that raises visual realism and temporal coherence on VBench while preserving the reasoning performance measured on the new VR-Bench. The design therefore separates the cost of learning semantic control from the cost of rendering high-fidelity output.

Core claim

Lumos-Nexus is a training-efficient unified video generation framework that uses a two-stage design. During training only a lightweight generator is aligned with the understanding block to acquire reasoning-driven semantic control. At inference Unified Progressive Frequency Bridging progressively hands generation to a high-capacity pretrained generator inside the shared homogeneous latent space, enabling coarse-to-fine refinement that yields high-fidelity videos without loss of reasoning quality. The paper also introduces VR-Bench to measure a model’s ability to translate inferred intent into coherent, semantically aligned video.

What carries the argument

Unified Progressive Frequency Bridging (UPFB): a progressive handoff mechanism that transfers generation from a lightweight trained generator to a high-capacity pretrained generator inside a shared homogeneous latent space.

If this is right

  • High-fidelity video output becomes reachable without including the large generator in the full training loop.
  • Reasoning capabilities acquired on the lightweight model survive the frequency-bridging stage.
  • VBench realism and temporal coherence improve while VR-Bench scores remain competitive.
  • New benchmarks like VR-Bench become usable to test intent-to-video translation separately from pixel quality.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same lightweight-training-plus-handoff pattern could be tried on image or audio unified models where full joint training is also prohibitive.
  • If the latent-space homogeneity holds across different generator families, the method might reduce the need for ever-larger unified training runs.
  • The separation of semantic-control training from fidelity rendering suggests a general route to scaling reasoning-driven generation without proportional compute growth.

Load-bearing premise

The homogeneous latent space shared by the lightweight and high-capacity generators allows UPFB to transfer control without losing the reasoning-driven semantic signals learned during lightweight training.

What would settle it

Run the same prompts on VR-Bench with and without the UPFB handoff; if semantic alignment or intent translation scores drop after the handoff, the claim that reasoning quality is preserved fails.

read the original abstract

Connector-based video unified models have demonstrated strong capability in instruction-grounded video synthesis, but integrating a large high-fidelity generator into the unified training loop is computationally prohibitive, limiting achievable visual quality. We therefore propose Lumos-Nexus, a training-efficient unified video generation framework that facilitates the development of strong reasoning-driven generation capabilities while significantly enhancing visual fidelity. Lumos-Nexus adopts a two-stage design: 1) During training, only a lightweight generator is aligned with the understanding block to learn to take in reasoning-driven semantic control. 2) During inference, we introduce Unified Progressive Frequency Bridging (UPFB) to progressively hand off generation to a high-capacity pretrained generator in the shared latent space, enabling coarse-to-fine refinement and producing high-fidelity videos without compromising reasoning quality. To fill the gap in reasoning-driven video generation benchmarks, we introduce VR-Bench, which assesses a model's capability to translate inferred intent into coherent and semantically aligned video content. Extensive experiments demonstrate that Lumos-Nexus achieves substantial gains in visual realism and temporal coherence on VBench, while exhibiting strong reasoning-based generative performance on VR-Bench. Code and models are available at https://jiazheng-xing.github.io/nexus-lumos-home/.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper proposes Lumos-Nexus, a two-stage framework for efficient video unified models. Stage 1 aligns a lightweight generator with an understanding block during training to acquire reasoning-driven semantic control. Stage 2 introduces Unified Progressive Frequency Bridging (UPFB) at inference to progressively hand off generation to a high-capacity pretrained generator within a claimed homogeneous shared latent space, enabling coarse-to-fine refinement for higher visual fidelity without compromising reasoning quality. The work also introduces VR-Bench to evaluate translation of inferred intent into coherent, semantically aligned video, and reports substantial gains in visual realism and temporal coherence on VBench alongside strong reasoning performance on VR-Bench.

Significance. If the central claims on latent homogeneity and lossless handoff hold, the framework would meaningfully reduce the computational barrier to incorporating high-fidelity generators into unified video models while preserving instruction-grounded reasoning, and VR-Bench would provide a needed evaluation resource for this capability. The two-stage separation and frequency-bridging mechanism are conceptually attractive for scaling.

major comments (2)
  1. [Abstract / Method (UPFB description)] The abstract and method description assert that the homogeneous latent space plus UPFB permits progressive handoff 'without compromising reasoning quality,' yet no ablations isolate UPFB's contribution to VR-Bench scores, no before-vs-after reasoning metrics are reported, and no quantitative evidence (e.g., distribution matching, frequency alignment statistics, or higher-order moment comparisons) is supplied to establish latent homogeneity between the lightweight and pretrained generators.
  2. [Experiments section] The central performance claims on VBench and VR-Bench rest on experiments whose details (baselines, data selection rules, error bars, number of runs, or statistical significance) are not provided in the manuscript, making it impossible to assess whether the reported gains are robust or could be explained by differences in training compute or data.
minor comments (2)
  1. [Method] Notation for the shared latent space and the precise definition of 'homogeneous' should be formalized with an equation or explicit metric rather than left descriptive.
  2. [VR-Bench introduction] The VR-Bench construction (task categories, evaluation protocol, human vs. automatic scoring) needs a dedicated subsection with examples to allow reproducibility.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback, which identifies key areas where additional evidence and experimental details would strengthen the manuscript. We address each major comment below and outline the revisions we will make.

read point-by-point responses
  1. Referee: [Abstract / Method (UPFB description)] The abstract and method description assert that the homogeneous latent space plus UPFB permits progressive handoff 'without compromising reasoning quality,' yet no ablations isolate UPFB's contribution to VR-Bench scores, no before-vs-after reasoning metrics are reported, and no quantitative evidence (e.g., distribution matching, frequency alignment statistics, or higher-order moment comparisons) is supplied to establish latent homogeneity between the lightweight and pretrained generators.

    Authors: We agree that the current manuscript lacks explicit ablations and quantitative support for the homogeneity claim and UPFB's effect on reasoning metrics. The framework design assumes a shared latent space enables lossless handoff, but additional evidence is warranted. In the revised version, we will add: (i) ablations isolating UPFB's contribution to VR-Bench scores, (ii) before-vs-after reasoning metrics on VR-Bench, and (iii) quantitative analyses including distribution matching, frequency alignment statistics, and higher-order moment comparisons between the latent representations of the lightweight and pretrained generators. revision: yes

  2. Referee: [Experiments section] The central performance claims on VBench and VR-Bench rest on experiments whose details (baselines, data selection rules, error bars, number of runs, or statistical significance) are not provided in the manuscript, making it impossible to assess whether the reported gains are robust or could be explained by differences in training compute or data.

    Authors: We acknowledge that the manuscript omits these critical experimental details. In the revised manuscript, we will expand the Experiments section to fully specify the baselines, data selection rules, error bars, number of runs, and statistical significance tests. This will enable readers to evaluate the robustness of the reported gains on VBench and VR-Bench. revision: yes

Circularity Check

0 steps flagged

No significant circularity; derivation is self-contained

full rationale

The paper presents a two-stage training/inference design for Lumos-Nexus that aligns a lightweight generator with an understanding block and uses UPFB for handoff to a pretrained generator in a shared latent space. No load-bearing claims reduce by construction to fitted inputs, self-definitions, or self-citation chains. The homogeneous latent space and UPFB are introduced as architectural choices whose effects are evaluated via external benchmarks (VBench, VR-Bench); no equations or predictions are shown to be equivalent to their own inputs. This is the normal non-circular case for a methods paper whose central results rest on empirical evaluation rather than internal re-derivation.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The framework rests on the domain assumption that a homogeneous latent space exists and supports seamless progressive bridging between the lightweight and high-capacity generators; no free parameters or invented entities are mentioned in the abstract.

axioms (1)
  • domain assumption A homogeneous latent space exists between the lightweight generator and the high-capacity pretrained generator that permits progressive frequency bridging without degrading reasoning quality.
    This premise is required for the UPFB inference stage described in the abstract.

pith-pipeline@v0.9.1-grok · 5787 in / 1249 out tokens · 27630 ms · 2026-06-28T22:54:44.198348+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

62 extracted references · 29 linked inside Pith

  1. [1]

    Ming-omni: A unified multimodal model for perception and generation.arXiv preprint arXiv:2506.09344, 2025

    Inclusion AI, Biao Gong, Cheng Zou, Chuanyang Zheng, Chunluan Zhou, Canxiang Yan, Chunxiang Jin, Chunjie Shen, Dandan Zheng, Fudong Wang, et al. Ming-omni: A unified multimodal model for perception and generation.arXiv preprint arXiv:2506.09344, 2025

  2. [2]

    Stable video diffusion: Scaling latent video diffusion models to large datasets.arXiv preprint arXiv:2311.15127, 2023

    Andreas Blattmann, Tim Dockhorn, Sumith Kulal, Daniel Mendelevitch, Maciej Kilian, Dominik Lorenz, Yam Levi, Zion English, Vikram Voleti, Adam Letts, et al. Stable video diffusion: Scaling latent video diffusion models to large datasets.arXiv preprint arXiv:2311.15127, 2023

  3. [3]

    Videocrafter2: Overcoming data limitations for high-quality video diffusion models

    Haoxin Chen, Yong Zhang, Xiaodong Cun, Menghan Xia, Xintao Wang, Chao Weng, and Ying Shan. Videocrafter2: Overcoming data limitations for high-quality video diffusion models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 7310–7320, 2024

  4. [4]

    Blip3-o: A family of fully open unified multimodal models-architecture, training and dataset.arXiv preprint arXiv:2505.09568, 2025

    Jiuhai Chen, Zhiyang Xu, Xichen Pan, Yushi Hu, Can Qin, Tom Goldstein, Lifu Huang, Tianyi Zhou, Saining Xie, Silvio Savarese, et al. Blip3-o: A family of fully open unified multimodal models-architecture, training and dataset.arXiv preprint arXiv:2505.09568, 2025

  5. [5]

    Univid: Unifying vision tasks with pre-trained video generation models.arXiv preprint arXiv:2509.21760, 2025

    Lan Chen, Yuchao Gu, and Qi Mao. Univid: Unifying vision tasks with pre-trained video generation models.arXiv preprint arXiv:2509.21760, 2025. 11

  6. [6]

    Janus-pro: Unified multimodal understanding and generation with data and model scaling.arXiv preprint arXiv:2501.17811, 2025

    Xiaokang Chen, Zhiyu Wu, Xingchao Liu, Zizheng Pan, Wen Liu, Zhenda Xie, Xingkai Yu, and Chong Ruan. Janus-pro: Unified multimodal understanding and generation with data and model scaling.arXiv preprint arXiv:2501.17811, 2025

  7. [7]

    Emerging properties in unified multimodal pretraining.arXiv preprint arXiv:2505.14683, 2025

    Chaorui Deng, Deyao Zhu, Kunchang Li, Chenhui Gou, Feng Li, Zeyu Wang, Shu Zhong, Weihao Yu, Xiaonan Nie, Ziang Song, et al. Emerging properties in unified multimodal pretraining.arXiv preprint arXiv:2505.14683, 2025

  8. [8]

    Autoregressive video generation without vector quantization.arXiv preprint arXiv:2412.14169, 2024

    Haoge Deng, Ting Pan, Haiwen Diao, Zhengxiong Luo, Yufeng Cui, Huchuan Lu, Shiguang Shan, Yonggang Qi, and Xinlong Wang. Autoregressive video generation without vector quantization.arXiv preprint arXiv:2412.14169, 2024

  9. [9]

    Dreamllm: Synergistic multimodal comprehension and creation.arXiv preprint arXiv:2309.11499, 2023

    Runpei Dong, Chunrui Han, Yuang Peng, Zekun Qi, Zheng Ge, Jinrong Yang, Liang Zhao, Jianjian Sun, Hongyu Zhou, Haoran Wei, et al. Dreamllm: Synergistic multimodal comprehension and creation.arXiv preprint arXiv:2309.11499, 2023

  10. [10]

    Scaling rectified flow transformers for high-resolution image synthesis

    Patrick Esser, Sumith Kulal, Andreas Blattmann, Rahim Entezari, Jonas Müller, Harry Saini, Yam Levi, Dominik Lorenz, Axel Sauer, Frederic Boesel, et al. Scaling rectified flow transformers for high-resolution image synthesis. 2024

  11. [12]

    Seed-x: Multimodal models with unified multi-granularity comprehension and generation.arXiv preprint arXiv:2404.14396, 2024

    Yuying Ge, Sijie Zhao, Jinguo Zhu, Yixiao Ge, Kun Yi, Lin Song, Chen Li, Xiaohan Ding, and Ying Shan. Seed-x: Multimodal models with unified multi-granularity comprehension and generation.arXiv preprint arXiv:2404.14396, 2024

  12. [13]

    Geneval: An object-focused framework for evaluating text-to-image alignment.Advances in Neural Information Processing Systems, 36:52132–52152, 2023

    Dhruba Ghosh, Hannaneh Hajishirzi, and Ludwig Schmidt. Geneval: An object-focused framework for evaluating text-to-image alignment.Advances in Neural Information Processing Systems, 36:52132–52152, 2023

  13. [14]

    Google DeepMind. Veo 3.1. https://deepmind.google/models/veo/, 10 2025

  14. [15]

    Ltx-video: Realtime video latent diffusion.arXiv preprint arXiv:2501.00103, 2024

    Yoav HaCohen, Nisan Chiprut, Benny Brazowski, Daniel Shalem, Dudu Moshe, Eitan Richardson, Eran Levin, Guy Shiran, Nir Zabari, Ori Gordon, et al. Ltx-video: Realtime video latent diffusion.arXiv preprint arXiv:2501.00103, 2024

  15. [16]

    Cogvideo: Large-scale pretraining for text-to-video generation via transformers.arXiv preprint arXiv:2205.15868, 2022

    Wenyi Hong, Ming Ding, Wendi Zheng, Xinghan Liu, and Jie Tang. Cogvideo: Large-scale pretraining for text-to-video generation via transformers.arXiv preprint arXiv:2205.15868, 2022

  16. [17]

    Vbench: Comprehensive benchmark suite for video generative models

    Ziqi Huang, Yinan He, Jiashuo Yu, Fan Zhang, Chenyang Si, Yuming Jiang, Yuanhan Zhang, Tianxing Wu, Qingyang Jin, Nattapol Chanpaisit, et al. Vbench: Comprehensive benchmark suite for video generative models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 21807–21818, 2024

  17. [18]

    Pyramidal flow matching for efficient video generative modeling.arXiv preprint arXiv:2410.05954, 2024

    Yang Jin, Zhicheng Sun, Ningyuan Li, Kun Xu, Hao Jiang, Nan Zhuang, Quzhe Huang, Yang Song, Yadong Mu, and Zhouchen Lin. Pyramidal flow matching for efficient video generative modeling.arXiv preprint arXiv:2410.05954, 2024

  18. [19]

    A new measure of rank correlation.Biometrika, 30(1-2):81–93, 1938

    Maurice G Kendall. A new measure of rank correlation.Biometrika, 30(1-2):81–93, 1938

  19. [20]

    Text2video-zero: Text-to-image diffusion models are zero-shot video generators

    Levon Khachatryan, Andranik Movsisyan, Vahram Tadevosyan, Roberto Henschel, Zhangyang Wang, Shant Navasardyan, and Humphrey Shi. Text2video-zero: Text-to-image diffusion models are zero-shot video generators. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 15954–15964, 2023

  20. [21]

    Videopoet: A large language model for zero-shot video generation.arXiv preprint arXiv:2312.14125, 2023

    Dan Kondratyuk, Lijun Yu, Xiuye Gu, José Lezama, Jonathan Huang, Grant Schindler, Rachel Hornung, Vighnesh Birodkar, Jimmy Yan, Ming-Chang Chiu, et al. Videopoet: A large language model for zero-shot video generation.arXiv preprint arXiv:2312.14125, 2023. 12

  21. [22]

    Hunyuanvideo: A systematic framework for large video generative models.arXiv preprint arXiv:2412.03603, 2024

    Weijie Kong, Qi Tian, Zijian Zhang, Rox Min, Zuozhuo Dai, Jin Zhou, Jiangfeng Xiong, Xin Li, Bo Wu, Jianwei Zhang, et al. Hunyuanvideo: A systematic framework for large video generative models.arXiv preprint arXiv:2412.03603, 2024

  22. [23]

    Kling 2.6

    Kuaishou. Kling 2.6. https://www.kling26.com/, 12 2025

  23. [24]

    Flux.https://github.com/black-forest-labs/flux, 2024

    Black Forest Labs. Flux.https://github.com/black-forest-labs/flux, 2024

  24. [25]

    Arlon: Boosting diffusion transformers with autoregressive models for long video generation.arXiv preprint arXiv:2410.20502, 2024

    Zongyi Li, Shujie Hu, Shujie Liu, Long Zhou, Jeongsoo Choi, Lingwei Meng, Xun Guo, Jinyu Li, Hefei Ling, and Furu Wei. Arlon: Boosting diffusion transformers with autoregressive models for long video generation.arXiv preprint arXiv:2410.20502, 2024

  25. [26]

    World model on million-length video and language with blockwise ringattention.arXiv preprint arXiv:2402.08268, 2024

    Hao Liu, Wilson Yan, Matei Zaharia, and Pieter Abbeel. World model on million-length video and language with blockwise ringattention.arXiv preprint arXiv:2402.08268, 2024

  26. [27]

    Mardini: Masked autoregressive diffusion for video generation at scale.arXiv preprint arXiv:2410.20280, 2024

    Haozhe Liu, Shikun Liu, Zijian Zhou, Mengmeng Xu, Yanping Xie, Xiao Han, Juan C Pérez, Ding Liu, Kumara Kahatapitiya, Menglin Jia, et al. Mardini: Masked autoregressive diffusion for video generation at scale.arXiv preprint arXiv:2410.20280, 2024

  27. [28]

    Vdt: General- purpose video diffusion transformers via mask modeling.arXiv preprint arXiv:2305.13311, 2023

    Haoyu Lu, Guoxing Yang, Nanyi Fei, Yuqi Huo, Zhiwu Lu, Ping Luo, and Mingyu Ding. Vdt: General- purpose video diffusion transformers via mask modeling.arXiv preprint arXiv:2305.13311, 2023

  28. [29]

    Univid: The open-source unified video model.arXiv preprint arXiv:2509.24200, 2025

    Jiabin Luo, Junhui Lin, Zeyu Zhang, Biao Wu, Meng Fang, Ling Chen, and Hao Tang. Univid: The open-source unified video model.arXiv preprint arXiv:2509.24200, 2025

  29. [30]

    Latte: Latent diffusion transformer for video generation.arXiv preprint arXiv:2401.03048, 2024

    Xin Ma, Yaohui Wang, Gengyun Jia, Xinyuan Chen, Ziwei Liu, Yuan-Fang Li, Cunjian Chen, and Yu Qiao. Latte: Latent diffusion transformer for video generation.arXiv preprint arXiv:2401.03048, 2024

  30. [31]

    Janusflow: Harmonizing autoregression and rectified flow for unified multimodal understanding and generation

    Yiyang Ma, Xingchao Liu, Xiaokang Chen, Wen Liu, Chengyue Wu, Zhiyu Wu, Zizheng Pan, Zhenda Xie, Haowei Zhang, Xingkai Yu, et al. Janusflow: Harmonizing autoregression and rectified flow for unified multimodal understanding and generation. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 7739–7751, 2025

  31. [32]

    Transfer between modalities with metaqueries.arXiv preprint arXiv:2504.06256, 2025

    Xichen Pan, Satya Narayan Shukla, Aashu Singh, Zhuokai Zhao, Shlok Kumar Mishra, Jialiang Wang, Zhiyang Xu, Jiuhai Chen, Kunpeng Li, Felix Juefei-Xu, et al. Transfer between modalities with metaqueries.arXiv preprint arXiv:2504.06256, 2025

  32. [33]

    SDXL: Improving latent diffusion models for high-resolution image synthesis

    Dustin Podell, Zion English, Kyle Lacey, Andreas Blattmann, Tim Dockhorn, Jonas Müller, Joe Penna, and Robin Rombach. SDXL: Improving latent diffusion models for high-resolution image synthesis. In ICLR, 2024

  33. [34]

    Skyreels-a1: Expressive portrait animation in video diffusion transformers.arXiv preprint arXiv:2502.10841, 2025

    Di Qiu, Zhengcong Fei, Rui Wang, Jialin Bai, Changqian Yu, Mingyuan Fan, Guibin Chen, and Xiang Wen. Skyreels-a1: Expressive portrait animation in video diffusion transformers.arXiv preprint arXiv:2502.10841, 2025

  34. [35]

    Hierarchical text-conditional image generation with CLIP latents.arXiv preprint arXiv:2204.06125, 2022

    Aditya Ramesh, Prafulla Dhariwal, Alex Nichol, Casey Chu, and Mark Chen. Hierarchical text-conditional image generation with CLIP latents.arXiv preprint arXiv:2204.06125, 2022

  35. [36]

    High-resolution image synthesis with latent diffusion models

    Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer. High-resolution image synthesis with latent diffusion models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10684–10695, 2022

  36. [37]

    Make-a-video: Text-to-video generation without text-video data.arXiv preprint arXiv:2209.14792, 2022

    Uriel Singer, Adam Polyak, Thomas Hayes, Xi Yin, Jie An, Songyang Zhang, Qiyuan Hu, Harry Yang, Oron Ashual, Oran Gafni, et al. Make-a-video: Text-to-video generation without text-video data.arXiv preprint arXiv:2209.14792, 2022

  37. [38]

    Autore- gressive model beats diffusion: Llama for scalable image generation.arXiv preprint arXiv:2406.06525, 2024

    Peize Sun, Yi Jiang, Shoufa Chen, Shilong Zhang, Bingyue Peng, Ping Luo, and Zehuan Yuan. Autore- gressive model beats diffusion: Llama for scalable image generation.arXiv preprint arXiv:2406.06525, 2024

  38. [39]

    Omni-video: Democratizing unified video understanding and generation.arXiv preprint arXiv:2507.06119, 2025

    Zhiyu Tan, Hao Yang, Luozheng Qin, Jia Gong, Mengping Yang, and Hao Li. Omni-video: Democratizing unified video understanding and generation.arXiv preprint arXiv:2507.06119, 2025. 13

  39. [40]

    Chameleon: Mixed-modal early-fusion foundation models.arXiv preprint arXiv:2405.09818, 2024

    Chameleon Team. Chameleon: Mixed-modal early-fusion foundation models.arXiv preprint arXiv:2405.09818, 2024

  40. [41]

    Longcat-video technical report.arXiv preprint arXiv:2510.22200, 2025

    Meituan LongCat Team, Xunliang Cai, Qilong Huang, Zhuoliang Kang, Hongyu Li, Shijun Liang, Liya Ma, Siyu Ren, Xiaoming Wei, Rixu Xie, et al. Longcat-video technical report.arXiv preprint arXiv:2510.22200, 2025

  41. [42]

    Magi-1: Autoregressive video generation at scale.arXiv preprint arXiv:2505.13211, 2025

    Hansi Teng, Hongyu Jia, Lei Sun, Lingzhi Li, Maolin Li, Mingqiu Tang, Shuai Han, Tianning Zhang, WQ Zhang, Weifeng Luo, et al. Magi-1: Autoregressive video generation at scale.arXiv preprint arXiv:2505.13211, 2025

  42. [43]

    Metamorph: Multimodal understanding and generation via instruction tuning

    Shengbang Tong, David Fan, Jiachen Li, Yunyang Xiong, Xinlei Chen, Koustuv Sinha, Michael Rabbat, Yann LeCun, Saining Xie, and Zhuang Liu. Metamorph: Multimodal understanding and generation via instruction tuning. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 17001–17012, 2025

  43. [44]

    Wan: Open and advanced large-scale video generative models.arXiv preprint arXiv:2503.20314, 2025

    Ang Wang, Baole Ai, Bin Wen, Chaojie Mao, Chen-Wei Xie, Di Chen, Feiwu Yu, Haiming Zhao, Jianxiao Yang, Jianyuan Zeng, et al. Wan: Open and advanced large-scale video generative models.arXiv preprint arXiv:2503.20314, 2025

  44. [45]

    Modelscope text-to-video technical report.arXiv preprint arXiv:2308.06571, 2023

    Jiuniu Wang, Hangjie Yuan, Dayou Chen, Yingya Zhang, Xiang Wang, and Shiwei Zhang. Modelscope text-to-video technical report.arXiv preprint arXiv:2308.06571, 2023

  45. [46]

    Emu3: Next-token prediction is all you need.arXiv preprint arXiv:2409.18869, 2024

    Xinlong Wang, Xiaosong Zhang, Zhengxiong Luo, Quan Sun, Yufeng Cui, Jinsheng Wang, Fan Zhang, Yueze Wang, Zhen Li, Qiying Yu, et al. Emu3: Next-token prediction is all you need.arXiv preprint arXiv:2409.18869, 2024

  46. [47]

    Loong: Generating minute-level long videos with autoregressive language models.arXiv preprint arXiv:2410.02757, 2024

    Yuqing Wang, Tianwei Xiong, Daquan Zhou, Zhijie Lin, Yang Zhao, Bingyi Kang, Jiashi Feng, and Xihui Liu. Loong: Generating minute-level long videos with autoregressive language models.arXiv preprint arXiv:2410.02757, 2024

  47. [48]

    Univideo: Unified understanding, generation, and editing for videos.arXiv preprint arXiv:2510.08377, 2025

    Cong Wei, Quande Liu, Zixuan Ye, Qiulin Wang, Xintao Wang, Pengfei Wan, Kun Gai, and Wenhu Chen. Univideo: Unified understanding, generation, and editing for videos.arXiv preprint arXiv:2510.08377, 2025

  48. [49]

    Janus: Decoupling visual encoding for unified multimodal understanding and generation

    Chengyue Wu, Xiaokang Chen, Zhiyu Wu, Yiyang Ma, Xingchao Liu, Zizheng Pan, Wen Liu, Zhenda Xie, Xingkai Yu, Chong Ruan, et al. Janus: Decoupling visual encoding for unified multimodal understanding and generation. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 12966–12977, 2025

  49. [50]

    Tune-a-video: One-shot tuning of image diffusion models for text-to-video generation

    Jay Zhangjie Wu, Yixiao Ge, Xintao Wang, Stan Weixian Lei, Yuchao Gu, Yufei Shi, Wynne Hsu, Ying Shan, Xiaohu Qie, and Mike Zheng Shou. Tune-a-video: One-shot tuning of image diffusion models for text-to-video generation. InProceedings of the IEEE/CVF international conference on computer vision, pages 7623–7633, 2023

  50. [51]

    Next-gpt: Any-to-any multimodal llm

    Shengqiong Wu, Hao Fei, Leigang Qu, Wei Ji, and Tat-Seng Chua. Next-gpt: Any-to-any multimodal llm. InForty-first International Conference on Machine Learning, 2024

  51. [53]

    Show-o: One single transformer to unify multimodal understanding and generation.arXiv preprint arXiv:2408.12528, 2024

    Jinheng Xie, Weijia Mao, Zechen Bai, David Junhao Zhang, Weihao Wang, Kevin Qinghong Lin, Yuchao Gu, Zhijie Chen, Zhenheng Yang, and Mike Zheng Shou. Show-o: One single transformer to unify multimodal understanding and generation.arXiv preprint arXiv:2408.12528, 2024

  52. [54]

    Show-o2: Improved native unified multimodal models.arXiv preprint arXiv:2506.15564, 2025

    Jinheng Xie, Zhenheng Yang, and Mike Zheng Shou. Show-o2: Improved native unified multimodal models.arXiv preprint arXiv:2506.15564, 2025

  53. [55]

    Lumosx: Relate any identities with their attributes for personalized video generation

    Jiazheng Xing, Fei Du, Hangjie Yuan, Pengwei Liu, Hongbin Xu, Hai Ci, Ruigang Niu, Weihua Chen, Fan Wang, and Yong Liu. Lumosx: Relate any identities with their attributes for personalized video generation. InThe Fourteenth International Conference on Learning Representations, 2026. 14

  54. [56]

    Videogpt: Video generation using vq-vae and transformers.arXiv preprint arXiv:2104.10157, 2021

    Wilson Yan, Yunzhi Zhang, Pieter Abbeel, and Aravind Srinivas. Videogpt: Video generation using vq-vae and transformers.arXiv preprint arXiv:2104.10157, 2021

  55. [57]

    Qwen3 technical report.arXiv preprint arXiv:2505.09388, 2025

    An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, et al. Qwen3 technical report.arXiv preprint arXiv:2505.09388, 2025

  56. [58]

    Mmada: Multimodal large diffusion language models.arXiv preprint arXiv:2505.15809, 2025

    Ling Yang, Ye Tian, Bowen Li, Xinchen Zhang, Ke Shen, Yunhai Tong, and Mengdi Wang. Mmada: Multimodal large diffusion language models.arXiv preprint arXiv:2505.15809, 2025

  57. [59]

    Cogvideox: Text-to-video diffusion models with an expert transformer.arXiv preprint arXiv:2408.06072, 2024

    Zhuoyi Yang, Jiayan Teng, Wendi Zheng, Ming Ding, Shiyu Huang, Jiazheng Xu, Yuanming Yang, Wenyi Hong, Xiaohan Zhang, Guanyu Feng, et al. Cogvideox: Text-to-video diffusion models with an expert transformer.arXiv preprint arXiv:2408.06072, 2024

  58. [60]

    From slow bidirectional to fast autoregressive video diffusion models

    Tianwei Yin, Qiang Zhang, Richard Zhang, William T Freeman, Fredo Durand, Eli Shechtman, and Xun Huang. From slow bidirectional to fast autoregressive video diffusion models. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 22963–22974, 2025

  59. [61]

    Lumos-1: On autoregressive video generation from a unified model perspective.arXiv preprint arXiv:2507.08801, 2025

    Hangjie Yuan, Weihua Chen, Jun Cen, Hu Yu, Jingyun Liang, Shuning Chang, Zhihui Lin, Tao Feng, Pengwei Liu, Jiazheng Xing, et al. Lumos-1: On autoregressive video generation from a unified model perspective.arXiv preprint arXiv:2507.08801, 2025

  60. [62]

    Show-1: Marrying pixel and latent diffusion models for text-to-video generation

    David Junhao Zhang, Jay Zhangjie Wu, Jia-Wei Liu, Rui Zhao, Lingmin Ran, Yuchao Gu, Difei Gao, and Mike Zheng Shou. Show-1: Marrying pixel and latent diffusion models for text-to-video generation. International Journal of Computer Vision, 133(4):1879–1893, 2025

  61. [63]

    Open-sora: Democratizing efficient video production for all.arXiv preprint arXiv:2412.20404, 2024

    Zangwei Zheng, Xiangyu Peng, Tianji Yang, Chenhui Shen, Shenggui Li, Hongxin Liu, Yukun Zhou, Tianyi Li, and Yang You. Open-sora: Democratizing efficient video production for all.arXiv preprint arXiv:2412.20404, 2024

  62. [64]

    A French café scene with people drinking espresso and reading newspapers

    Chunting Zhou, Lili Yu, Arun Babu, Kushal Tirumala, Michihiro Yasunaga, Leonid Shamis, Jacob Kahn, Xuezhe Ma, Luke Zettlemoyer, and Omer Levy. Transfusion: Predict the next token and diffuse images with one multi-modal model.arXiv preprint arXiv:2408.11039, 2024. 15 Appendix ofLumos-Nexus In this Appendix, we provide additional content organized as follow...