Lumos-Nexus: Efficient Frequency Bridging with Homogeneous Latent Space for Video Unified Models
Pith reviewed 2026-06-28 22:54 UTC · model grok-4.3
The pith
A two-stage framework trains only a lightweight generator then hands off to a pretrained high-capacity one via progressive frequency bridging in shared latent space.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Lumos-Nexus is a training-efficient unified video generation framework that uses a two-stage design. During training only a lightweight generator is aligned with the understanding block to acquire reasoning-driven semantic control. At inference Unified Progressive Frequency Bridging progressively hands generation to a high-capacity pretrained generator inside the shared homogeneous latent space, enabling coarse-to-fine refinement that yields high-fidelity videos without loss of reasoning quality. The paper also introduces VR-Bench to measure a model’s ability to translate inferred intent into coherent, semantically aligned video.
What carries the argument
Unified Progressive Frequency Bridging (UPFB): a progressive handoff mechanism that transfers generation from a lightweight trained generator to a high-capacity pretrained generator inside a shared homogeneous latent space.
If this is right
- High-fidelity video output becomes reachable without including the large generator in the full training loop.
- Reasoning capabilities acquired on the lightweight model survive the frequency-bridging stage.
- VBench realism and temporal coherence improve while VR-Bench scores remain competitive.
- New benchmarks like VR-Bench become usable to test intent-to-video translation separately from pixel quality.
Where Pith is reading between the lines
- The same lightweight-training-plus-handoff pattern could be tried on image or audio unified models where full joint training is also prohibitive.
- If the latent-space homogeneity holds across different generator families, the method might reduce the need for ever-larger unified training runs.
- The separation of semantic-control training from fidelity rendering suggests a general route to scaling reasoning-driven generation without proportional compute growth.
Load-bearing premise
The homogeneous latent space shared by the lightweight and high-capacity generators allows UPFB to transfer control without losing the reasoning-driven semantic signals learned during lightweight training.
What would settle it
Run the same prompts on VR-Bench with and without the UPFB handoff; if semantic alignment or intent translation scores drop after the handoff, the claim that reasoning quality is preserved fails.
read the original abstract
Connector-based video unified models have demonstrated strong capability in instruction-grounded video synthesis, but integrating a large high-fidelity generator into the unified training loop is computationally prohibitive, limiting achievable visual quality. We therefore propose Lumos-Nexus, a training-efficient unified video generation framework that facilitates the development of strong reasoning-driven generation capabilities while significantly enhancing visual fidelity. Lumos-Nexus adopts a two-stage design: 1) During training, only a lightweight generator is aligned with the understanding block to learn to take in reasoning-driven semantic control. 2) During inference, we introduce Unified Progressive Frequency Bridging (UPFB) to progressively hand off generation to a high-capacity pretrained generator in the shared latent space, enabling coarse-to-fine refinement and producing high-fidelity videos without compromising reasoning quality. To fill the gap in reasoning-driven video generation benchmarks, we introduce VR-Bench, which assesses a model's capability to translate inferred intent into coherent and semantically aligned video content. Extensive experiments demonstrate that Lumos-Nexus achieves substantial gains in visual realism and temporal coherence on VBench, while exhibiting strong reasoning-based generative performance on VR-Bench. Code and models are available at https://jiazheng-xing.github.io/nexus-lumos-home/.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper proposes Lumos-Nexus, a two-stage framework for efficient video unified models. Stage 1 aligns a lightweight generator with an understanding block during training to acquire reasoning-driven semantic control. Stage 2 introduces Unified Progressive Frequency Bridging (UPFB) at inference to progressively hand off generation to a high-capacity pretrained generator within a claimed homogeneous shared latent space, enabling coarse-to-fine refinement for higher visual fidelity without compromising reasoning quality. The work also introduces VR-Bench to evaluate translation of inferred intent into coherent, semantically aligned video, and reports substantial gains in visual realism and temporal coherence on VBench alongside strong reasoning performance on VR-Bench.
Significance. If the central claims on latent homogeneity and lossless handoff hold, the framework would meaningfully reduce the computational barrier to incorporating high-fidelity generators into unified video models while preserving instruction-grounded reasoning, and VR-Bench would provide a needed evaluation resource for this capability. The two-stage separation and frequency-bridging mechanism are conceptually attractive for scaling.
major comments (2)
- [Abstract / Method (UPFB description)] The abstract and method description assert that the homogeneous latent space plus UPFB permits progressive handoff 'without compromising reasoning quality,' yet no ablations isolate UPFB's contribution to VR-Bench scores, no before-vs-after reasoning metrics are reported, and no quantitative evidence (e.g., distribution matching, frequency alignment statistics, or higher-order moment comparisons) is supplied to establish latent homogeneity between the lightweight and pretrained generators.
- [Experiments section] The central performance claims on VBench and VR-Bench rest on experiments whose details (baselines, data selection rules, error bars, number of runs, or statistical significance) are not provided in the manuscript, making it impossible to assess whether the reported gains are robust or could be explained by differences in training compute or data.
minor comments (2)
- [Method] Notation for the shared latent space and the precise definition of 'homogeneous' should be formalized with an equation or explicit metric rather than left descriptive.
- [VR-Bench introduction] The VR-Bench construction (task categories, evaluation protocol, human vs. automatic scoring) needs a dedicated subsection with examples to allow reproducibility.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback, which identifies key areas where additional evidence and experimental details would strengthen the manuscript. We address each major comment below and outline the revisions we will make.
read point-by-point responses
-
Referee: [Abstract / Method (UPFB description)] The abstract and method description assert that the homogeneous latent space plus UPFB permits progressive handoff 'without compromising reasoning quality,' yet no ablations isolate UPFB's contribution to VR-Bench scores, no before-vs-after reasoning metrics are reported, and no quantitative evidence (e.g., distribution matching, frequency alignment statistics, or higher-order moment comparisons) is supplied to establish latent homogeneity between the lightweight and pretrained generators.
Authors: We agree that the current manuscript lacks explicit ablations and quantitative support for the homogeneity claim and UPFB's effect on reasoning metrics. The framework design assumes a shared latent space enables lossless handoff, but additional evidence is warranted. In the revised version, we will add: (i) ablations isolating UPFB's contribution to VR-Bench scores, (ii) before-vs-after reasoning metrics on VR-Bench, and (iii) quantitative analyses including distribution matching, frequency alignment statistics, and higher-order moment comparisons between the latent representations of the lightweight and pretrained generators. revision: yes
-
Referee: [Experiments section] The central performance claims on VBench and VR-Bench rest on experiments whose details (baselines, data selection rules, error bars, number of runs, or statistical significance) are not provided in the manuscript, making it impossible to assess whether the reported gains are robust or could be explained by differences in training compute or data.
Authors: We acknowledge that the manuscript omits these critical experimental details. In the revised manuscript, we will expand the Experiments section to fully specify the baselines, data selection rules, error bars, number of runs, and statistical significance tests. This will enable readers to evaluate the robustness of the reported gains on VBench and VR-Bench. revision: yes
Circularity Check
No significant circularity; derivation is self-contained
full rationale
The paper presents a two-stage training/inference design for Lumos-Nexus that aligns a lightweight generator with an understanding block and uses UPFB for handoff to a pretrained generator in a shared latent space. No load-bearing claims reduce by construction to fitted inputs, self-definitions, or self-citation chains. The homogeneous latent space and UPFB are introduced as architectural choices whose effects are evaluated via external benchmarks (VBench, VR-Bench); no equations or predictions are shown to be equivalent to their own inputs. This is the normal non-circular case for a methods paper whose central results rest on empirical evaluation rather than internal re-derivation.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption A homogeneous latent space exists between the lightweight generator and the high-capacity pretrained generator that permits progressive frequency bridging without degrading reasoning quality.
Reference graph
Works this paper leans on
-
[1]
Inclusion AI, Biao Gong, Cheng Zou, Chuanyang Zheng, Chunluan Zhou, Canxiang Yan, Chunxiang Jin, Chunjie Shen, Dandan Zheng, Fudong Wang, et al. Ming-omni: A unified multimodal model for perception and generation.arXiv preprint arXiv:2506.09344, 2025
arXiv 2025
-
[2]
Andreas Blattmann, Tim Dockhorn, Sumith Kulal, Daniel Mendelevitch, Maciej Kilian, Dominik Lorenz, Yam Levi, Zion English, Vikram Voleti, Adam Letts, et al. Stable video diffusion: Scaling latent video diffusion models to large datasets.arXiv preprint arXiv:2311.15127, 2023
Pith/arXiv arXiv 2023
-
[3]
Videocrafter2: Overcoming data limitations for high-quality video diffusion models
Haoxin Chen, Yong Zhang, Xiaodong Cun, Menghan Xia, Xintao Wang, Chao Weng, and Ying Shan. Videocrafter2: Overcoming data limitations for high-quality video diffusion models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 7310–7320, 2024
2024
-
[4]
Jiuhai Chen, Zhiyang Xu, Xichen Pan, Yushi Hu, Can Qin, Tom Goldstein, Lifu Huang, Tianyi Zhou, Saining Xie, Silvio Savarese, et al. Blip3-o: A family of fully open unified multimodal models-architecture, training and dataset.arXiv preprint arXiv:2505.09568, 2025
Pith/arXiv arXiv 2025
-
[5]
Lan Chen, Yuchao Gu, and Qi Mao. Univid: Unifying vision tasks with pre-trained video generation models.arXiv preprint arXiv:2509.21760, 2025. 11
arXiv 2025
-
[6]
Xiaokang Chen, Zhiyu Wu, Xingchao Liu, Zizheng Pan, Wen Liu, Zhenda Xie, Xingkai Yu, and Chong Ruan. Janus-pro: Unified multimodal understanding and generation with data and model scaling.arXiv preprint arXiv:2501.17811, 2025
Pith/arXiv arXiv 2025
-
[7]
Emerging properties in unified multimodal pretraining.arXiv preprint arXiv:2505.14683, 2025
Chaorui Deng, Deyao Zhu, Kunchang Li, Chenhui Gou, Feng Li, Zeyu Wang, Shu Zhong, Weihao Yu, Xiaonan Nie, Ziang Song, et al. Emerging properties in unified multimodal pretraining.arXiv preprint arXiv:2505.14683, 2025
Pith/arXiv arXiv 2025
-
[8]
Autoregressive video generation without vector quantization.arXiv preprint arXiv:2412.14169, 2024
Haoge Deng, Ting Pan, Haiwen Diao, Zhengxiong Luo, Yufeng Cui, Huchuan Lu, Shiguang Shan, Yonggang Qi, and Xinlong Wang. Autoregressive video generation without vector quantization.arXiv preprint arXiv:2412.14169, 2024
Pith/arXiv arXiv 2024
-
[9]
Dreamllm: Synergistic multimodal comprehension and creation.arXiv preprint arXiv:2309.11499, 2023
Runpei Dong, Chunrui Han, Yuang Peng, Zekun Qi, Zheng Ge, Jinrong Yang, Liang Zhao, Jianjian Sun, Hongyu Zhou, Haoran Wei, et al. Dreamllm: Synergistic multimodal comprehension and creation.arXiv preprint arXiv:2309.11499, 2023
arXiv 2023
-
[10]
Scaling rectified flow transformers for high-resolution image synthesis
Patrick Esser, Sumith Kulal, Andreas Blattmann, Rahim Entezari, Jonas Müller, Harry Saini, Yam Levi, Dominik Lorenz, Axel Sauer, Frederic Boesel, et al. Scaling rectified flow transformers for high-resolution image synthesis. 2024
2024
-
[12]
Yuying Ge, Sijie Zhao, Jinguo Zhu, Yixiao Ge, Kun Yi, Lin Song, Chen Li, Xiaohan Ding, and Ying Shan. Seed-x: Multimodal models with unified multi-granularity comprehension and generation.arXiv preprint arXiv:2404.14396, 2024
Pith/arXiv arXiv 2024
-
[13]
Geneval: An object-focused framework for evaluating text-to-image alignment.Advances in Neural Information Processing Systems, 36:52132–52152, 2023
Dhruba Ghosh, Hannaneh Hajishirzi, and Ludwig Schmidt. Geneval: An object-focused framework for evaluating text-to-image alignment.Advances in Neural Information Processing Systems, 36:52132–52152, 2023
2023
-
[14]
Google DeepMind. Veo 3.1. https://deepmind.google/models/veo/, 10 2025
2025
-
[15]
Ltx-video: Realtime video latent diffusion.arXiv preprint arXiv:2501.00103, 2024
Yoav HaCohen, Nisan Chiprut, Benny Brazowski, Daniel Shalem, Dudu Moshe, Eitan Richardson, Eran Levin, Guy Shiran, Nir Zabari, Ori Gordon, et al. Ltx-video: Realtime video latent diffusion.arXiv preprint arXiv:2501.00103, 2024
Pith/arXiv arXiv 2024
-
[16]
Wenyi Hong, Ming Ding, Wendi Zheng, Xinghan Liu, and Jie Tang. Cogvideo: Large-scale pretraining for text-to-video generation via transformers.arXiv preprint arXiv:2205.15868, 2022
Pith/arXiv arXiv 2022
-
[17]
Vbench: Comprehensive benchmark suite for video generative models
Ziqi Huang, Yinan He, Jiashuo Yu, Fan Zhang, Chenyang Si, Yuming Jiang, Yuanhan Zhang, Tianxing Wu, Qingyang Jin, Nattapol Chanpaisit, et al. Vbench: Comprehensive benchmark suite for video generative models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 21807–21818, 2024
2024
-
[18]
Yang Jin, Zhicheng Sun, Ningyuan Li, Kun Xu, Hao Jiang, Nan Zhuang, Quzhe Huang, Yang Song, Yadong Mu, and Zhouchen Lin. Pyramidal flow matching for efficient video generative modeling.arXiv preprint arXiv:2410.05954, 2024
arXiv 2024
-
[19]
A new measure of rank correlation.Biometrika, 30(1-2):81–93, 1938
Maurice G Kendall. A new measure of rank correlation.Biometrika, 30(1-2):81–93, 1938
1938
-
[20]
Text2video-zero: Text-to-image diffusion models are zero-shot video generators
Levon Khachatryan, Andranik Movsisyan, Vahram Tadevosyan, Roberto Henschel, Zhangyang Wang, Shant Navasardyan, and Humphrey Shi. Text2video-zero: Text-to-image diffusion models are zero-shot video generators. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 15954–15964, 2023
2023
-
[21]
Dan Kondratyuk, Lijun Yu, Xiuye Gu, José Lezama, Jonathan Huang, Grant Schindler, Rachel Hornung, Vighnesh Birodkar, Jimmy Yan, Ming-Chang Chiu, et al. Videopoet: A large language model for zero-shot video generation.arXiv preprint arXiv:2312.14125, 2023. 12
Pith/arXiv arXiv 2023
-
[22]
Weijie Kong, Qi Tian, Zijian Zhang, Rox Min, Zuozhuo Dai, Jin Zhou, Jiangfeng Xiong, Xin Li, Bo Wu, Jianwei Zhang, et al. Hunyuanvideo: A systematic framework for large video generative models.arXiv preprint arXiv:2412.03603, 2024
Pith/arXiv arXiv 2024
-
[23]
Kling 2.6
Kuaishou. Kling 2.6. https://www.kling26.com/, 12 2025
2025
-
[24]
Flux.https://github.com/black-forest-labs/flux, 2024
Black Forest Labs. Flux.https://github.com/black-forest-labs/flux, 2024
2024
-
[25]
Zongyi Li, Shujie Hu, Shujie Liu, Long Zhou, Jeongsoo Choi, Lingwei Meng, Xun Guo, Jinyu Li, Hefei Ling, and Furu Wei. Arlon: Boosting diffusion transformers with autoregressive models for long video generation.arXiv preprint arXiv:2410.20502, 2024
arXiv 2024
-
[26]
Hao Liu, Wilson Yan, Matei Zaharia, and Pieter Abbeel. World model on million-length video and language with blockwise ringattention.arXiv preprint arXiv:2402.08268, 2024
Pith/arXiv arXiv 2024
-
[27]
Haozhe Liu, Shikun Liu, Zijian Zhou, Mengmeng Xu, Yanping Xie, Xiao Han, Juan C Pérez, Ding Liu, Kumara Kahatapitiya, Menglin Jia, et al. Mardini: Masked autoregressive diffusion for video generation at scale.arXiv preprint arXiv:2410.20280, 2024
arXiv 2024
-
[28]
Haoyu Lu, Guoxing Yang, Nanyi Fei, Yuqi Huo, Zhiwu Lu, Ping Luo, and Mingyu Ding. Vdt: General- purpose video diffusion transformers via mask modeling.arXiv preprint arXiv:2305.13311, 2023
arXiv 2023
-
[29]
Univid: The open-source unified video model.arXiv preprint arXiv:2509.24200, 2025
Jiabin Luo, Junhui Lin, Zeyu Zhang, Biao Wu, Meng Fang, Ling Chen, and Hao Tang. Univid: The open-source unified video model.arXiv preprint arXiv:2509.24200, 2025
arXiv 2025
-
[30]
Latte: Latent diffusion transformer for video generation.arXiv preprint arXiv:2401.03048, 2024
Xin Ma, Yaohui Wang, Gengyun Jia, Xinyuan Chen, Ziwei Liu, Yuan-Fang Li, Cunjian Chen, and Yu Qiao. Latte: Latent diffusion transformer for video generation.arXiv preprint arXiv:2401.03048, 2024
Pith/arXiv arXiv 2024
-
[31]
Janusflow: Harmonizing autoregression and rectified flow for unified multimodal understanding and generation
Yiyang Ma, Xingchao Liu, Xiaokang Chen, Wen Liu, Chengyue Wu, Zhiyu Wu, Zizheng Pan, Zhenda Xie, Haowei Zhang, Xingkai Yu, et al. Janusflow: Harmonizing autoregression and rectified flow for unified multimodal understanding and generation. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 7739–7751, 2025
2025
-
[32]
Transfer between modalities with metaqueries.arXiv preprint arXiv:2504.06256, 2025
Xichen Pan, Satya Narayan Shukla, Aashu Singh, Zhuokai Zhao, Shlok Kumar Mishra, Jialiang Wang, Zhiyang Xu, Jiuhai Chen, Kunpeng Li, Felix Juefei-Xu, et al. Transfer between modalities with metaqueries.arXiv preprint arXiv:2504.06256, 2025
Pith/arXiv arXiv 2025
-
[33]
SDXL: Improving latent diffusion models for high-resolution image synthesis
Dustin Podell, Zion English, Kyle Lacey, Andreas Blattmann, Tim Dockhorn, Jonas Müller, Joe Penna, and Robin Rombach. SDXL: Improving latent diffusion models for high-resolution image synthesis. In ICLR, 2024
2024
-
[34]
Di Qiu, Zhengcong Fei, Rui Wang, Jialin Bai, Changqian Yu, Mingyuan Fan, Guibin Chen, and Xiang Wen. Skyreels-a1: Expressive portrait animation in video diffusion transformers.arXiv preprint arXiv:2502.10841, 2025
arXiv 2025
-
[35]
Aditya Ramesh, Prafulla Dhariwal, Alex Nichol, Casey Chu, and Mark Chen. Hierarchical text-conditional image generation with CLIP latents.arXiv preprint arXiv:2204.06125, 2022
Pith/arXiv arXiv 2022
-
[36]
High-resolution image synthesis with latent diffusion models
Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer. High-resolution image synthesis with latent diffusion models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10684–10695, 2022
2022
-
[37]
Make-a-video: Text-to-video generation without text-video data.arXiv preprint arXiv:2209.14792, 2022
Uriel Singer, Adam Polyak, Thomas Hayes, Xi Yin, Jie An, Songyang Zhang, Qiyuan Hu, Harry Yang, Oron Ashual, Oran Gafni, et al. Make-a-video: Text-to-video generation without text-video data.arXiv preprint arXiv:2209.14792, 2022
Pith/arXiv arXiv 2022
-
[38]
Peize Sun, Yi Jiang, Shoufa Chen, Shilong Zhang, Bingyue Peng, Ping Luo, and Zehuan Yuan. Autore- gressive model beats diffusion: Llama for scalable image generation.arXiv preprint arXiv:2406.06525, 2024
Pith/arXiv arXiv 2024
-
[39]
Zhiyu Tan, Hao Yang, Luozheng Qin, Jia Gong, Mengping Yang, and Hao Li. Omni-video: Democratizing unified video understanding and generation.arXiv preprint arXiv:2507.06119, 2025. 13
arXiv 2025
-
[40]
Chameleon: Mixed-modal early-fusion foundation models.arXiv preprint arXiv:2405.09818, 2024
Chameleon Team. Chameleon: Mixed-modal early-fusion foundation models.arXiv preprint arXiv:2405.09818, 2024
Pith/arXiv arXiv 2024
-
[41]
Longcat-video technical report.arXiv preprint arXiv:2510.22200, 2025
Meituan LongCat Team, Xunliang Cai, Qilong Huang, Zhuoliang Kang, Hongyu Li, Shijun Liang, Liya Ma, Siyu Ren, Xiaoming Wei, Rixu Xie, et al. Longcat-video technical report.arXiv preprint arXiv:2510.22200, 2025
arXiv 2025
-
[42]
Magi-1: Autoregressive video generation at scale.arXiv preprint arXiv:2505.13211, 2025
Hansi Teng, Hongyu Jia, Lei Sun, Lingzhi Li, Maolin Li, Mingqiu Tang, Shuai Han, Tianning Zhang, WQ Zhang, Weifeng Luo, et al. Magi-1: Autoregressive video generation at scale.arXiv preprint arXiv:2505.13211, 2025
Pith/arXiv arXiv 2025
-
[43]
Metamorph: Multimodal understanding and generation via instruction tuning
Shengbang Tong, David Fan, Jiachen Li, Yunyang Xiong, Xinlei Chen, Koustuv Sinha, Michael Rabbat, Yann LeCun, Saining Xie, and Zhuang Liu. Metamorph: Multimodal understanding and generation via instruction tuning. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 17001–17012, 2025
2025
-
[44]
Wan: Open and advanced large-scale video generative models.arXiv preprint arXiv:2503.20314, 2025
Ang Wang, Baole Ai, Bin Wen, Chaojie Mao, Chen-Wei Xie, Di Chen, Feiwu Yu, Haiming Zhao, Jianxiao Yang, Jianyuan Zeng, et al. Wan: Open and advanced large-scale video generative models.arXiv preprint arXiv:2503.20314, 2025
Pith/arXiv arXiv 2025
-
[45]
Modelscope text-to-video technical report.arXiv preprint arXiv:2308.06571, 2023
Jiuniu Wang, Hangjie Yuan, Dayou Chen, Yingya Zhang, Xiang Wang, and Shiwei Zhang. Modelscope text-to-video technical report.arXiv preprint arXiv:2308.06571, 2023
Pith/arXiv arXiv 2023
-
[46]
Emu3: Next-token prediction is all you need.arXiv preprint arXiv:2409.18869, 2024
Xinlong Wang, Xiaosong Zhang, Zhengxiong Luo, Quan Sun, Yufeng Cui, Jinsheng Wang, Fan Zhang, Yueze Wang, Zhen Li, Qiying Yu, et al. Emu3: Next-token prediction is all you need.arXiv preprint arXiv:2409.18869, 2024
Pith/arXiv arXiv 2024
-
[47]
Yuqing Wang, Tianwei Xiong, Daquan Zhou, Zhijie Lin, Yang Zhao, Bingyi Kang, Jiashi Feng, and Xihui Liu. Loong: Generating minute-level long videos with autoregressive language models.arXiv preprint arXiv:2410.02757, 2024
arXiv 2024
-
[48]
Cong Wei, Quande Liu, Zixuan Ye, Qiulin Wang, Xintao Wang, Pengfei Wan, Kun Gai, and Wenhu Chen. Univideo: Unified understanding, generation, and editing for videos.arXiv preprint arXiv:2510.08377, 2025
arXiv 2025
-
[49]
Janus: Decoupling visual encoding for unified multimodal understanding and generation
Chengyue Wu, Xiaokang Chen, Zhiyu Wu, Yiyang Ma, Xingchao Liu, Zizheng Pan, Wen Liu, Zhenda Xie, Xingkai Yu, Chong Ruan, et al. Janus: Decoupling visual encoding for unified multimodal understanding and generation. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 12966–12977, 2025
2025
-
[50]
Tune-a-video: One-shot tuning of image diffusion models for text-to-video generation
Jay Zhangjie Wu, Yixiao Ge, Xintao Wang, Stan Weixian Lei, Yuchao Gu, Yufei Shi, Wynne Hsu, Ying Shan, Xiaohu Qie, and Mike Zheng Shou. Tune-a-video: One-shot tuning of image diffusion models for text-to-video generation. InProceedings of the IEEE/CVF international conference on computer vision, pages 7623–7633, 2023
2023
-
[51]
Next-gpt: Any-to-any multimodal llm
Shengqiong Wu, Hao Fei, Leigang Qu, Wei Ji, and Tat-Seng Chua. Next-gpt: Any-to-any multimodal llm. InForty-first International Conference on Machine Learning, 2024
2024
-
[53]
Jinheng Xie, Weijia Mao, Zechen Bai, David Junhao Zhang, Weihao Wang, Kevin Qinghong Lin, Yuchao Gu, Zhijie Chen, Zhenheng Yang, and Mike Zheng Shou. Show-o: One single transformer to unify multimodal understanding and generation.arXiv preprint arXiv:2408.12528, 2024
Pith/arXiv arXiv 2024
-
[54]
Show-o2: Improved native unified multimodal models.arXiv preprint arXiv:2506.15564, 2025
Jinheng Xie, Zhenheng Yang, and Mike Zheng Shou. Show-o2: Improved native unified multimodal models.arXiv preprint arXiv:2506.15564, 2025
Pith/arXiv arXiv 2025
-
[55]
Lumosx: Relate any identities with their attributes for personalized video generation
Jiazheng Xing, Fei Du, Hangjie Yuan, Pengwei Liu, Hongbin Xu, Hai Ci, Ruigang Niu, Weihua Chen, Fan Wang, and Yong Liu. Lumosx: Relate any identities with their attributes for personalized video generation. InThe Fourteenth International Conference on Learning Representations, 2026. 14
2026
-
[56]
Videogpt: Video generation using vq-vae and transformers.arXiv preprint arXiv:2104.10157, 2021
Wilson Yan, Yunzhi Zhang, Pieter Abbeel, and Aravind Srinivas. Videogpt: Video generation using vq-vae and transformers.arXiv preprint arXiv:2104.10157, 2021
Pith/arXiv arXiv 2021
-
[57]
Qwen3 technical report.arXiv preprint arXiv:2505.09388, 2025
An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, et al. Qwen3 technical report.arXiv preprint arXiv:2505.09388, 2025
Pith/arXiv arXiv 2025
-
[58]
Mmada: Multimodal large diffusion language models.arXiv preprint arXiv:2505.15809, 2025
Ling Yang, Ye Tian, Bowen Li, Xinchen Zhang, Ke Shen, Yunhai Tong, and Mengdi Wang. Mmada: Multimodal large diffusion language models.arXiv preprint arXiv:2505.15809, 2025
Pith/arXiv arXiv 2025
-
[59]
Zhuoyi Yang, Jiayan Teng, Wendi Zheng, Ming Ding, Shiyu Huang, Jiazheng Xu, Yuanming Yang, Wenyi Hong, Xiaohan Zhang, Guanyu Feng, et al. Cogvideox: Text-to-video diffusion models with an expert transformer.arXiv preprint arXiv:2408.06072, 2024
Pith/arXiv arXiv 2024
-
[60]
From slow bidirectional to fast autoregressive video diffusion models
Tianwei Yin, Qiang Zhang, Richard Zhang, William T Freeman, Fredo Durand, Eli Shechtman, and Xun Huang. From slow bidirectional to fast autoregressive video diffusion models. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 22963–22974, 2025
2025
-
[61]
Hangjie Yuan, Weihua Chen, Jun Cen, Hu Yu, Jingyun Liang, Shuning Chang, Zhihui Lin, Tao Feng, Pengwei Liu, Jiazheng Xing, et al. Lumos-1: On autoregressive video generation from a unified model perspective.arXiv preprint arXiv:2507.08801, 2025
arXiv 2025
-
[62]
Show-1: Marrying pixel and latent diffusion models for text-to-video generation
David Junhao Zhang, Jay Zhangjie Wu, Jia-Wei Liu, Rui Zhao, Lingmin Ran, Yuchao Gu, Difei Gao, and Mike Zheng Shou. Show-1: Marrying pixel and latent diffusion models for text-to-video generation. International Journal of Computer Vision, 133(4):1879–1893, 2025
2025
-
[63]
Open-sora: Democratizing efficient video production for all.arXiv preprint arXiv:2412.20404, 2024
Zangwei Zheng, Xiangyu Peng, Tianji Yang, Chenhui Shen, Shenggui Li, Hongxin Liu, Yukun Zhou, Tianyi Li, and Yang You. Open-sora: Democratizing efficient video production for all.arXiv preprint arXiv:2412.20404, 2024
Pith/arXiv arXiv 2024
-
[64]
A French café scene with people drinking espresso and reading newspapers
Chunting Zhou, Lili Yu, Arun Babu, Kushal Tirumala, Michihiro Yasunaga, Leonid Shamis, Jacob Kahn, Xuezhe Ma, Luke Zettlemoyer, and Omer Levy. Transfusion: Predict the next token and diffuse images with one multi-modal model.arXiv preprint arXiv:2408.11039, 2024. 15 Appendix ofLumos-Nexus In this Appendix, we provide additional content organized as follow...
Pith/arXiv arXiv 2024
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.