Qwen-RobotWorld Technical Report: Unifying Embodied World Modeling through Language-Conditioned Video Generation

Anzhe Chen; Chenfei Wu; Chenxu Lv; Dayiheng Liu; Deqing Li; Gengze Zhou; Hale Yin; Haoqi Yuan; Haoyang Li; Jiahao Li

arxiv: 2606.17030 · v3 · pith:6VFQJYVMnew · submitted 2026-06-15 · 💻 cs.CV

Qwen-RobotWorld Technical Report: Unifying Embodied World Modeling through Language-Conditioned Video Generation

Jie Zhang , Xiaoyue Chen , Anzhe Chen , Dayiheng Liu , Deqing Li , Gengze Zhou , Hale Yin , Haoqi Yuan

show 31 more authors

Haoyang Li Jiahao Li Jiazhao Zhang Jingren Zhou Kaiyuan Gao Kun Yan Lihan Jiang Ningyuan Tang Pei Lin Qihang Peng Shengming Yin Tianhe Wu Tianyi Yan Xiao Xu Yan Shu Yanran Zhang Ye Wang Yi Wang Yilei Chen Yixian Xu Yiyang Huang Yuxiang Chen Zekai Zhang Zhendong Wang Zixing Lei Zhixuan Liang Zihao Liu Zikai Zhou Chenxu Lv Xiong-Hui Chen Chenfei Wu

This is my paper

Pith reviewed 2026-06-27 04:15 UTC · model grok-4.3

classification 💻 cs.CV

keywords embodied world modelinglanguage-conditioned video generationvideo world modelsroboticsdiffusion transformerssynthetic data generationmulti-embodiment learning

0 comments

The pith

Natural language serves as the unified action interface for a video model that generates physically grounded future trajectories across robotic manipulation, driving, navigation, and human-to-robot tasks.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces Qwen-RobotWorld, a language-conditioned video world model that takes current observations and language instructions to predict future visual scenes. It unifies modeling for multiple embodied domains through a shared language interface rather than domain-specific action formats. The approach rests on a double-stream diffusion transformer that links frozen vision-language semantics to video latents, a large corpus of 8.6 million video-text pairs spanning over 20 embodiments, and a two-stage curriculum that first builds general priors then adds embodied specialization. Results show first-place overall rankings on EWMBench and DreamGen Bench plus outperformance of other open-source models on WorldModelBench and PBench, with additional zero-shot consistency on multi-view benchmarks. This setup is presented as enabling synthetic data for policy training, virtual test environments, and language-based planning signals for control.

Core claim

A single language-conditioned video generation model can function as an embodied world model by predicting physically grounded future visual trajectories from current observations, using natural language as the common action representation across robotic manipulation, autonomous driving, indoor navigation, and human-to-robot transfer.

What carries the argument

Double-Stream MMDiT with MLLM Action Encoding, which couples frozen Qwen2.5-VL semantics to video-VAE latents via layer-wise joint attention in a 60-layer diffusion transformer, together with the Embodied World Knowledge corpus of 8.6M video-text pairs and the General+Expert Progressive Curriculum.

If this is right

The model can generate synthetic video data to augment policy training in multiple robot domains.
It supplies scalable virtual environments for evaluating policies without real hardware.
Language instructions can serve as planning signals that guide downstream robot control.
Zero-shot generalization across embodiments is supported by the shared language interface and curriculum.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

If the unification holds, language may replace specialized low-level action spaces when training policies for heterogeneous robot hardware.
Closed-loop integration with real robot controllers would be a direct next test of whether the predicted trajectories remain useful under feedback.
Extending the same architecture to longer time horizons or multi-agent interactions could be checked on existing benchmarks without new data collection.

Load-bearing premise

The combination of the 8.6M video-text corpus and the double-stream coupling of semantics with video latents produces accurate physical trajectories without any additional domain-specific mechanisms.

What would settle it

Demonstration that the generated trajectories systematically violate basic physical constraints, such as object penetration or incorrect motion under gravity, on a held-out multi-embodiment test set would falsify the claim of physically grounded prediction.

Figures

Figures reproduced from arXiv: 2606.17030 by Anzhe Chen, Chenfei Wu, Chenxu Lv, Dayiheng Liu, Deqing Li, Gengze Zhou, Hale Yin, Haoqi Yuan, Haoyang Li, Jiahao Li, Jiazhao Zhang, Jie Zhang, Jingren Zhou, Kaiyuan Gao, Kun Yan, Lihan Jiang, Ningyuan Tang, Pei Lin, Qihang Peng, Shengming Yin, Tianhe Wu, Tianyi Yan, Xiao Xu, Xiaoyue Chen, Xiong-Hui Chen, Yanran Zhang, Yan Shu, Ye Wang, Yilei Chen, Yi Wang, Yixian Xu, Yiyang Huang, Yuxiang Chen, Zekai Zhang, Zhendong Wang, Zhixuan Liang, Zihao Liu, Zikai Zhou, Zixing Lei.

**Figure 2.** Figure 2: Overview of the unified data processing pipeline. Stage 1 (Raw Data Collection) collects [PITH_FULL_IMAGE:figures/full_fig_p008_2.png] view at source ↗

**Figure 3.** Figure 3: Overview of our video generation architecture with 60-layer double-stream MMDiT backbone. [PITH_FULL_IMAGE:figures/full_fig_p010_3.png] view at source ↗

**Figure 4.** Figure 4: Scene2Robot: multi-segment conditioning for cross-embodiment video synthesis. The input sequence is organized as three contiguous segments — scene condition (F frames), robot reference (F frames), and generation (F frames). An index-based mechanism assigns condition tokens to timestep t = 0 and excludes them from loss computation, so only the generation segment is trainable. Joint attention at every MMDiT … view at source ↗

**Figure 5.** Figure 5: Fine-grained language grounding. (a) Contrastive: each pair of columns shares an identical initial frame (colored border); only the highlighted keyword differs between the two instructions. Pair 1: target object identity. Pair 2: destination. Pair 3: action type. In every case the generated motion is precisely grounded to the discriminating keyword. (b) Complex: two examples requiring multi-step execution … view at source ↗

**Figure 6.** Figure 6: Generalization across embodiments, tasks, and viewpoints. (A) Cross-embodiment: one instruction drives four morphologies (single-arm, dual-arm, humanoid, dexterous hand); three key frames per cell. (B) Cross-task × cross-environment: initial frame (orange border) followed by four generated frames across four tasks. (C) Multi-view: main and wrist cameras jointly generated from the same episode as (B, row 1)… view at source ↗

**Figure 7.** Figure 7: Zero-shot qualitative comparison on language–action alignment and multi-view coherence. Side-by-side grids under identical conditioning (same initial frame(s), prompt, and camera layout), comparing QWEN-ROBOTWORLD against LVP and Cosmos2.5-14B [PITH_FULL_IMAGE:figures/full_fig_p018_7.png] view at source ↗

**Figure 8.** Figure 8: RoboTwin-IF zero-shot qualitative cases. The benchmark is built on the RoboTwin simulator with newly constructed complex tasks. 5.3 Cross-Domain Generalization Beyond manipulation-centric evaluation, we assess the model’s generalization to supplementary task families beyond the core manipulation domain [PITH_FULL_IMAGE:figures/full_fig_p019_8.png] view at source ↗

**Figure 9.** Figure 9: Human-to-robot transfer. Across eight target embodiments, each row compares a human demonstration (left) with the synthesized robot execution (right) for the same task, using five uniformly sampled frames per video. The generated trajectories preserve task intent while adapting motion to embodiment-specific kinematic constraints [PITH_FULL_IMAGE:figures/full_fig_p020_9.png] view at source ↗

**Figure 10.** Figure 10: Mobility generation. Paired columns with five rows: (left) Autonomous driving episodes from Bench2Drive, NVIDIA PhysicalAI-AD, Sekai, and Waymo; (right) Egocentric indoor navigation from VLNVerse with language-guided first-person traversal. Each episode uses five uniformly sampled frames. 6 Conclusion In this report, we present QWEN-ROBOTWORLD, a language-conditioned world model framework for embodied int… view at source ↗

read the original abstract

We introduce Qwen-RobotWorld, a language-conditioned video world model for embodied intelligence. With natural language as a unified action interface, it predicts physically grounded future visual trajectories from current observations across robotic manipulation, autonomous driving, indoor navigation, and human-to-robot transfer. This unified formulation provides three promising application directions: synthetic data generation for policy training augmentation, scalable virtual environments for policy evaluation, and language-guided planning signals for downstream robot control. This is achieved through a three-part design: a) Double-Stream MMDiT with MLLM Action Encoding, where a 60-layer double-stream diffusion transformer couples frozen Qwen2.5-VL semantics with video-VAE latents through layer-wise joint attention; b) Embodied World Knowledge (EWK), an 8.6M video-text corpus (200M+ frames) with action-language mapping over 20+ embodiments and 500+ action categories; and c) General+Expert Progressive Curriculum, a two-stage training strategy that first learns general visual priors and then injects embodied specialization under a shared language interface. Extensive results show strong competitiveness: ranks 1st overall on EWMBench and DreamGen Bench, outperforms all open-source models on WorldModelBench and PBench. Additional zero-shot analyses on RoboTwin-IF benchmark further support robust generalization and multi-view consistency.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Qwen-RobotWorld scales up language-conditioned video generation for robots with a double-stream MMDiT, an 8.6M embodied corpus, and progressive training, claiming top benchmark spots, but supplies almost no evaluation protocol details.

read the letter

The core contribution is a single model that takes current video frames plus language instructions and generates future trajectories across manipulation, driving, navigation, and human transfer tasks. It does this with a 60-layer double-stream diffusion transformer that mixes frozen Qwen2.5-VL features into video latents via joint attention, trained first on general visual data then on the new 8.6M video-text Embodied World Knowledge corpus that maps actions across 20+ embodiments.

The scale of the corpus and the two-stage curriculum are concrete additions. Collecting action-language pairs at that volume and using progressive specialization under a shared interface is a reasonable engineering step beyond prior single-domain video world models. The reported first-place rankings on EWMBench and DreamGen Bench, plus outperformance on the other two, follow directly from that setup if the training ran as described.

The main weakness is that the text gives benchmark numbers without describing how the tests were run, which baselines were reimplemented, what statistical checks were applied, or where the model fails. That gap makes it impossible to separate architecture gains from data volume or tuning effort. The physical accuracy claim rests on the data and architecture alone, with no additional mechanisms shown.

This work is for groups already building or evaluating video-based world models for robotics. Readers who need large embodied video datasets or examples of MLLM-diffusion coupling will find usable pieces. The effort is large enough and the unification idea concrete enough that it should go to peer review so the evaluation details can be checked.

Referee Report

2 major / 2 minor

Summary. The paper introduces Qwen-RobotWorld, a language-conditioned video world model for embodied intelligence. With natural language as a unified action interface, it predicts physically grounded future visual trajectories from current observations across robotic manipulation, autonomous driving, indoor navigation, and human-to-robot transfer. This is achieved via a three-part design: (a) Double-Stream MMDiT with MLLM Action Encoding (60-layer diffusion transformer coupling frozen Qwen2.5-VL semantics with video-VAE latents via layer-wise joint attention), (b) Embodied World Knowledge (EWK) corpus of 8.6M video-text pairs (200M+ frames) with action-language mapping over 20+ embodiments and 500+ action categories, and (c) General+Expert Progressive Curriculum (two-stage training for general visual priors then embodied specialization). The model claims 1st overall on EWMBench and DreamGen Bench, outperforming all open-source models on WorldModelBench and PBench, plus zero-shot generalization on RoboTwin-IF, with applications in synthetic data generation, virtual environments, and language-guided planning.

Significance. If the benchmark rankings and generalization claims hold under rigorous verification, the work would be significant for providing a unified language-conditioned world model spanning multiple embodied domains and embodiments. The scale of the EWK corpus, the Double-Stream MMDiT architecture, and the progressive curriculum represent concrete engineering contributions that could support downstream uses in policy training and evaluation. The cross-embodiment coverage and zero-shot analyses are particular strengths worth highlighting if substantiated.

major comments (2)

[Abstract] Abstract and results sections: the manuscript states top rankings on EWMBench, DreamGen Bench, WorldModelBench, and PBench but supplies no information on evaluation protocols, baseline implementations, statistical tests, number of runs, or failure modes. These details are load-bearing for the central performance claims and must be added for the results to be verifiable.
[Abstract] The weakest assumption (physical grounding and cross-embodiment generalization from the 8.6M corpus and Double-Stream MMDiT) is not directly tested or falsified in the provided text; if the full manuscript contains only benchmark rankings without ablation on physical accuracy metrics or embodiment-specific controls, this remains an unaddressed risk to the unification claim.

minor comments (2)

[Abstract] The abstract mentions "extensive results" and "additional zero-shot analyses" but does not reference specific tables, figures, or sections containing the quantitative data; adding explicit pointers would improve readability.
[§3] Notation for the Double-Stream MMDiT (e.g., exact definition of layer-wise joint attention and how frozen Qwen2.5-VL outputs are injected) could be clarified with a diagram or equation if not already present in §3.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on verifiability and the strength of our unification claims. We address each major comment below and commit to revisions that strengthen the manuscript without altering its core contributions.

read point-by-point responses

Referee: [Abstract] Abstract and results sections: the manuscript states top rankings on EWMBench, DreamGen Bench, WorldModelBench, and PBench but supplies no information on evaluation protocols, baseline implementations, statistical tests, number of runs, or failure modes. These details are load-bearing for the central performance claims and must be added for the results to be verifiable.

Authors: We agree that these details are essential for verifiability. The revised manuscript will include an expanded Experiments section with: (1) full evaluation protocols for each benchmark, (2) descriptions of baseline implementations and reproduction steps, (3) statistical tests and confidence intervals, (4) the number of runs per experiment, and (5) a dedicated failure-mode analysis. These additions will be placed before the main results tables. revision: yes
Referee: [Abstract] The weakest assumption (physical grounding and cross-embodiment generalization from the 8.6M corpus and Double-Stream MMDiT) is not directly tested or falsified in the provided text; if the full manuscript contains only benchmark rankings without ablation on physical accuracy metrics or embodiment-specific controls, this remains an unaddressed risk to the unification claim.

Authors: The benchmarks already embed physical-grounding metrics (e.g., trajectory consistency under dynamics, collision avoidance, and embodiment transfer success) and cross-embodiment splits. Nevertheless, we acknowledge the value of explicit controls. The revision will add a new subsection with (a) quantitative physical-accuracy ablations (physics-violation rates, dynamics consistency scores) and (b) embodiment-specific controls that isolate the contribution of the shared language interface versus domain-specific data. These will directly test the unification hypothesis. revision: yes

Circularity Check

0 steps flagged

No significant circularity

full rationale

The paper introduces an architecture (Double-Stream MMDiT), a data corpus (EWK 8.6M video-text pairs), and a training curriculum, then reports external benchmark rankings (EWMBench, DreamGen Bench, WorldModelBench, PBench). No equations, derivations, fitted parameters renamed as predictions, or load-bearing self-citations appear in the provided text. Claims rest on benchmark comparisons rather than any internal reduction of outputs to inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Review is abstract-only; the model rests on pretrained Qwen2.5-VL and video-VAE components plus the new corpus and curriculum. No explicit free parameters or invented entities are stated.

axioms (1)

domain assumption Video-VAE latents preserve sufficient physical information for trajectory prediction when coupled with MLLM semantics
Invoked in the description of the Double-Stream MMDiT design.

pith-pipeline@v0.9.1-grok · 5928 in / 1423 out tokens · 55642 ms · 2026-06-27T04:15:57.216083+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

296 extracted references · 94 linked inside Pith

[1]

arXiv:2302.13971 , year=

Llama: Open and efficient foundation language models , author=. arXiv:2302.13971 , year=

Pith/arXiv arXiv
[2]

arXiv preprint arXiv:2503.20314 , year=

Wan: Open and advanced large-scale video generative models , author=. arXiv preprint arXiv:2503.20314 , year=

Pith/arXiv arXiv
[3]

arXiv preprint arXiv:2505.09388 , year=

Qwen3 technical report , author=. arXiv preprint arXiv:2505.09388 , year=

Pith/arXiv arXiv
[4]

arXiv preprint arXiv:2402.03300 , year=

Deepseekmath: Pushing the limits of mathematical reasoning in open language models , author=. arXiv preprint arXiv:2402.03300 , year=

Pith/arXiv arXiv
[5]

arXiv preprint arXiv:2209.03003 , year=

Flow straight and fast: Learning to generate and transfer data with rectified flow , author=. arXiv preprint arXiv:2209.03003 , year=

Pith/arXiv arXiv
[6]

Advances in neural information processing systems , volume=

Direct preference optimization: Your language model is secretly a reward model , author=. Advances in neural information processing systems , volume=
[7]

Journal of artificial intelligence research , volume=

Reinforcement learning: A survey , author=. Journal of artificial intelligence research , volume=
[8]

Proceedings of Machine Learning and Systems , volume=

Reducing activation recomputation in large transformer models , author=. Proceedings of Machine Learning and Systems , volume=
[10]

arXiv preprint arXiv:2304.11277 , year=

Pytorch fsdp: experiences on scaling fully sharded data parallel , author=. arXiv preprint arXiv:2304.11277 , year=

Pith/arXiv arXiv
[11]

Advances in neural information processing systems , volume=

Gpipe: Efficient training of giant neural networks using pipeline parallelism , author=. Advances in neural information processing systems , volume=
[12]

13th USENIX symposium on operating systems design and implementation (OSDI 18) , pages=

Ray: A distributed framework for emerging \ AI \ applications , author=. 13th USENIX symposium on operating systems design and implementation (OSDI 18) , pages=
[13]

ICML , year=

Scaling Rectified Flow Transformers for High-Resolution Image Synthesis , author=. ICML , year=
[14]

Advances in neural information processing systems , volume=

Root mean square layer normalization , author=. Advances in neural information processing systems , volume=
[15]

arXiv:2301.12597 , year=

Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models , author=. arXiv:2301.12597 , year=

Pith/arXiv arXiv
[16]

arXiv:2303.08774 , year=

GPT-4 technical report , author=. arXiv:2303.08774 , year=

Pith/arXiv arXiv
[17]

NeurIPS , year=

Perception test: A diagnostic benchmark for multimodal video models , author=. NeurIPS , year=
[18]

arXiv:2405.21075 , year=

Video-MME: The First-Ever Comprehensive Evaluation Benchmark of Multi-modal LLMs in Video Analysis , author=. arXiv:2405.21075 , year=

Pith/arXiv arXiv
[19]

NeurIPS , year=

Egoschema: A diagnostic benchmark for very long-form video language understanding , author=. NeurIPS , year=
[20]

CVPR , year=

Mvbench: A comprehensive multi-modal video understanding benchmark , author=. CVPR , year=
[21]

arXiv:2406.06462 , year=

VCR: Visual Caption Restoration , author=. arXiv:2406.06462 , year=

arXiv
[22]

2024 , journal=

MMT-Bench: A Comprehensive Multimodal Benchmark for Evaluating Large Vision-Language Models Towards Multitask AGI , author=. 2024 , journal=

2024
[23]

MMBench: Is Your Multi-modal Model an All-around Player? , year =

Yuan Liu and Haodong Duan and Yuanhan Zhang, Bo Li and Songyang Zhang and Wangbo Zhao and Yike Yuan and Jiaqi Wang and Conghui He and Ziwei Liu and Kai Chen and Dahua Lin , journal =. MMBench: Is Your Multi-modal Model an All-around Player? , year =
[24]

2023 , journal=

HallusionBench: An Advanced Diagnostic Suite for Entangled Language Hallucination & Visual Illusion in Large Vision-Language Models , author=. 2023 , journal=

2023
[25]

Grok-1.5 vision preview , year =
[26]

Grok-2 Beta Release , year =
[27]

arXiv:2403.20330 , year=

Are We on the Right Way for Evaluating Large Vision-Language Models? , author=. arXiv:2403.20330 , year=

Pith/arXiv arXiv
[28]

ICML , year=

Mm-vet: Evaluating large multimodal models for integrated capabilities , author=. ICML , year=
[29]

NeurIPS , year=

Flamingo: a visual language model for few-shot learning , author=. NeurIPS , year=
[30]

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

Vila: On pre-training for visual language models , author=. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=
[31]

arXiv preprint arXiv:2406.08418 , year=

OmniCorpus: An Unified Multimodal Corpus of 10 Billion-Level Images Interleaved with Text , author=. arXiv preprint arXiv:2406.08418 , year=

arXiv
[32]

and Stoica, Ion and Xing, Eric P

Chiang, Wei-Lin and Li, Zhuohan and Lin, Zi and Sheng, Ying and Wu, Zhanghao and Zhang, Hao and Zheng, Lianmin and Zhuang, Siyuan and Zhuang, Yonghao and Gonzalez, Joseph E. and Stoica, Ion and Xing, Eric P. , year =. Vicuna: An Open-Source Chatbot Impressing GPT-4 with 90\ url =
[33]

arXiv:2309.17425 , year=

Data filtering networks , author=. arXiv:2309.17425 , year=

arXiv
[34]

arXiv:2306.09265 , year=

LVLM-eHub: A Comprehensive Evaluation Benchmark for Large Vision-Language Models , author=. arXiv:2306.09265 , year=

arXiv
[35]

NeurIPS , year=

Language models are few-shot learners , author=. NeurIPS , year=
[36]

arXiv:2304.08485 , year=

Visual instruction tuning , author=. arXiv:2304.08485 , year=

Pith/arXiv arXiv
[37]

arXiv:2304.10592 , year=

Minigpt-4: Enhancing vision-language understanding with advanced large language models , author=. arXiv:2304.10592 , year=

Pith/arXiv arXiv
[38]

arXiv:2309.17421 , year=

The dawn of lmms: Preliminary explorations with gpt-4v (ision) , author=. arXiv:2309.17421 , year=

Pith/arXiv arXiv
[39]

Our World in Data , year =

Hannah Ritchie and Veronika Samborska and Max Roser , title =. Our World in Data , year =
[43]

arXiv preprint arXiv:2307.06281 , year=

Mmbench: Is your multi-modal model an all-around player? , author=. arXiv preprint arXiv:2307.06281 , year=

Pith/arXiv arXiv
[44]

2024 , eprint=

Hunyuan-DiT: A Powerful Multi-Resolution Diffusion Transformer with Fine-Grained Chinese Understanding , author=. 2024 , eprint=

2024
[45]

arXiv:2311.16502 , year=

Mmmu: A massive multi-discipline multimodal understanding and reasoning benchmark for expert agi , author=. arXiv:2311.16502 , year=

Pith/arXiv arXiv
[46]

arXiv:2303.16199 , year=

Llama-adapter: Efficient fine-tuning of language models with zero-init attention , author=. arXiv:2303.16199 , year=

Pith/arXiv arXiv
[47]

Manning and Stefano Ermon and Chelsea Finn , editor =

Rafael Rafailov and Archit Sharma and Eric Mitchell and Christopher D. Manning and Stefano Ermon and Chelsea Finn , editor =. Direct Preference Optimization: Your Language Model is Secretly a Reward Model , booktitle =. 2023 , url =

2023
[48]

arXiv:2305.06500 , year=

Instructblip: Towards general-purpose vision-language models with instruction tuning , author=. arXiv:2305.06500 , year=

Pith/arXiv arXiv
[49]

arXiv:2304.14178 , year=

mplug-owl: Modularization empowers large language models with multimodality , author=. arXiv:2304.14178 , year=

Pith/arXiv arXiv
[50]

arXiv:2304.15010 , year=

Llama-adapter v2: Parameter-efficient visual instruction model , author=. arXiv:2304.15010 , year=

Pith/arXiv arXiv
[51]

arXiv:2305.03726 , year=

Otter: A multi-modal model with in-context instruction tuning , author=. arXiv:2305.03726 , year=

Pith/arXiv arXiv
[52]

arXiv:2305.16355 , year=

Pandagpt: One model to instruction-follow them all , author=. arXiv:2305.16355 , year=

Pith/arXiv arXiv
[53]

arXiv:2305.12223 , year=

What Makes for Good Visual Tokenizers for Large Language Models? , author=. arXiv:2305.12223 , year=

arXiv
[54]

arXiv:2305.10355 , year=

Evaluating object hallucination in large vision-language models , author=. arXiv:2305.10355 , year=

Pith/arXiv arXiv
[55]

arXiv:1504.00325 , year=

Microsoft coco captions: Data collection and evaluation server , author=. arXiv:1504.00325 , year=

Pith/arXiv arXiv
[56]

International journal of computer vision , volume=

Visual genome: Connecting language and vision using crowdsourced dense image annotations , author=. International journal of computer vision , volume=. 2017 , publisher=

2017
[57]

arXiv:2210.08402 , year=

Laion-5b: An open large-scale dataset for training next generation image-text models , author=. arXiv:2210.08402 , year=

Pith/arXiv arXiv
[58]

, author=

Laion coco: 600m synthetic captions from laion2b-en. , author=. https://laion.ai/blog/laion-coco/ , year=
[59]

arXiv:2304.14108 , year=

DataComp: In search of the next generation of multimodal datasets , author=. arXiv:2304.14108 , year=

arXiv
[60]

2022 , url =

COYO-700M: Image-Text Pair Dataset , author =. 2022 , url =

2022
[61]

CVPR , year=

Conceptual 12m: Pushing web-scale image-text pre-training to recognize long-tail visual concepts , author=. CVPR , year=
[62]

ACL , year=

Conceptual captions: A cleaned, hypernymed, image alt-text dataset for automatic image captioning , author=. ACL , year=
[63]

NeurIPS , year=

Im2text: Describing images using 1 million captioned photographs , author=. NeurIPS , year=
[64]

arXiv:2209.06794 , year=

Pali: A jointly-scaled multilingual language-image model , author=. arXiv:2209.06794 , year=

Pith/arXiv arXiv
[65]

ICML , year=

Ofa: Unifying architectures, tasks, and modalities through a simple sequence-to-sequence learning framework , author=. ICML , year=
[66]

NeurIPS , year=

Training language models to follow instructions with human feedback , author=. NeurIPS , year=
[67]

arXiv:2305.10403 , year=

Palm 2 technical report , author=. arXiv:2305.10403 , year=

Pith/arXiv arXiv
[68]

ICCV , year=

nocaps: novel object captioning at scale , author=. ICCV , year=
[69]

CVPR , year=

Making the v in vqa matter: Elevating the role of image understanding in visual question answering , author=. CVPR , year=
[70]

ECCV , year=

Textcaps: a dataset for image captioning with reading comprehension , author=. ECCV , year=
[71]

CVPR , year=

Imagenet: A large-scale hierarchical image database , author=. CVPR , year=
[72]

arXiv:2307.16125 , year=

SEED-Bench: Benchmarking Multimodal LLMs with Generative Comprehension , author=. arXiv:2307.16125 , year=

Pith/arXiv arXiv
[73]

arXiv:2306.13394 , year=

MME: A Comprehensive Evaluation Benchmark for Multimodal Large Language Models , author=. arXiv:2306.13394 , year=

Pith/arXiv arXiv
[74]

arXiv:2205.01068 , year=

Opt: Open pre-trained transformer language models , author=. arXiv:2205.01068 , year=

Pith/arXiv arXiv
[75]

arXiv:2306.05685 , year=

Judging LLM-as-a-judge with MT-Bench and Chatbot Arena , author=. arXiv:2306.05685 , year=

Pith/arXiv arXiv
[76]

OpenAI blog , year=

Language models are unsupervised multitask learners , author=. OpenAI blog , year=
[77]

arXiv:1810.04805 , year=

Bert: Pre-training of deep bidirectional transformers for language understanding , author=. arXiv:1810.04805 , year=

Pith/arXiv arXiv
[78]

JMLR , year=

Exploring the limits of transfer learning with a unified text-to-text transformer , author=. JMLR , year=
[79]

CVPR , year=

Masked autoencoders are scalable vision learners , author=. CVPR , year=
[80]

arXiv:2106.08254 , year=

Beit: Bert pre-training of image transformers , author=. arXiv:2106.08254 , year=

Pith/arXiv arXiv
[81]

arXiv:2212.04408 , year=

OFASys: A Multi-Modal Multi-Task Learning System for Building Generalist Models , author=. arXiv:2212.04408 , year=

arXiv
[82]

ICML , year=

Generative pretraining from pixels , author=. ICML , year=
[83]

arXiv:2302.14045 , year=

Language is not all you need: Aligning perception with language models , author=. arXiv:2302.14045 , year=

Pith/arXiv arXiv
[84]

ECCV , year=

Microsoft coco: Common objects in context , author=. ECCV , year=

Showing first 80 references.

[1] [1]

arXiv:2302.13971 , year=

Llama: Open and efficient foundation language models , author=. arXiv:2302.13971 , year=

Pith/arXiv arXiv

[2] [2]

arXiv preprint arXiv:2503.20314 , year=

Wan: Open and advanced large-scale video generative models , author=. arXiv preprint arXiv:2503.20314 , year=

Pith/arXiv arXiv

[3] [3]

arXiv preprint arXiv:2505.09388 , year=

Qwen3 technical report , author=. arXiv preprint arXiv:2505.09388 , year=

Pith/arXiv arXiv

[4] [4]

arXiv preprint arXiv:2402.03300 , year=

Deepseekmath: Pushing the limits of mathematical reasoning in open language models , author=. arXiv preprint arXiv:2402.03300 , year=

Pith/arXiv arXiv

[5] [5]

arXiv preprint arXiv:2209.03003 , year=

Flow straight and fast: Learning to generate and transfer data with rectified flow , author=. arXiv preprint arXiv:2209.03003 , year=

Pith/arXiv arXiv

[6] [6]

Advances in neural information processing systems , volume=

Direct preference optimization: Your language model is secretly a reward model , author=. Advances in neural information processing systems , volume=

[7] [7]

Journal of artificial intelligence research , volume=

Reinforcement learning: A survey , author=. Journal of artificial intelligence research , volume=

[8] [8]

Proceedings of Machine Learning and Systems , volume=

Reducing activation recomputation in large transformer models , author=. Proceedings of Machine Learning and Systems , volume=

[9] [10]

arXiv preprint arXiv:2304.11277 , year=

Pytorch fsdp: experiences on scaling fully sharded data parallel , author=. arXiv preprint arXiv:2304.11277 , year=

Pith/arXiv arXiv

[10] [11]

Advances in neural information processing systems , volume=

Gpipe: Efficient training of giant neural networks using pipeline parallelism , author=. Advances in neural information processing systems , volume=

[11] [12]

13th USENIX symposium on operating systems design and implementation (OSDI 18) , pages=

Ray: A distributed framework for emerging \ AI \ applications , author=. 13th USENIX symposium on operating systems design and implementation (OSDI 18) , pages=

[12] [13]

ICML , year=

Scaling Rectified Flow Transformers for High-Resolution Image Synthesis , author=. ICML , year=

[13] [14]

Advances in neural information processing systems , volume=

Root mean square layer normalization , author=. Advances in neural information processing systems , volume=

[14] [15]

arXiv:2301.12597 , year=

Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models , author=. arXiv:2301.12597 , year=

Pith/arXiv arXiv

[15] [16]

arXiv:2303.08774 , year=

GPT-4 technical report , author=. arXiv:2303.08774 , year=

Pith/arXiv arXiv

[16] [17]

NeurIPS , year=

Perception test: A diagnostic benchmark for multimodal video models , author=. NeurIPS , year=

[17] [18]

arXiv:2405.21075 , year=

Video-MME: The First-Ever Comprehensive Evaluation Benchmark of Multi-modal LLMs in Video Analysis , author=. arXiv:2405.21075 , year=

Pith/arXiv arXiv

[18] [19]

NeurIPS , year=

Egoschema: A diagnostic benchmark for very long-form video language understanding , author=. NeurIPS , year=

[19] [20]

CVPR , year=

Mvbench: A comprehensive multi-modal video understanding benchmark , author=. CVPR , year=

[20] [21]

arXiv:2406.06462 , year=

VCR: Visual Caption Restoration , author=. arXiv:2406.06462 , year=

arXiv

[21] [22]

2024 , journal=

MMT-Bench: A Comprehensive Multimodal Benchmark for Evaluating Large Vision-Language Models Towards Multitask AGI , author=. 2024 , journal=

2024

[22] [23]

MMBench: Is Your Multi-modal Model an All-around Player? , year =

Yuan Liu and Haodong Duan and Yuanhan Zhang, Bo Li and Songyang Zhang and Wangbo Zhao and Yike Yuan and Jiaqi Wang and Conghui He and Ziwei Liu and Kai Chen and Dahua Lin , journal =. MMBench: Is Your Multi-modal Model an All-around Player? , year =

[23] [24]

2023 , journal=

HallusionBench: An Advanced Diagnostic Suite for Entangled Language Hallucination & Visual Illusion in Large Vision-Language Models , author=. 2023 , journal=

2023

[24] [25]

Grok-1.5 vision preview , year =

[25] [26]

Grok-2 Beta Release , year =

[26] [27]

arXiv:2403.20330 , year=

Are We on the Right Way for Evaluating Large Vision-Language Models? , author=. arXiv:2403.20330 , year=

Pith/arXiv arXiv

[27] [28]

ICML , year=

Mm-vet: Evaluating large multimodal models for integrated capabilities , author=. ICML , year=

[28] [29]

NeurIPS , year=

Flamingo: a visual language model for few-shot learning , author=. NeurIPS , year=

[29] [30]

Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

Vila: On pre-training for visual language models , author=. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=

[30] [31]

arXiv preprint arXiv:2406.08418 , year=

OmniCorpus: An Unified Multimodal Corpus of 10 Billion-Level Images Interleaved with Text , author=. arXiv preprint arXiv:2406.08418 , year=

arXiv

[31] [32]

and Stoica, Ion and Xing, Eric P

Chiang, Wei-Lin and Li, Zhuohan and Lin, Zi and Sheng, Ying and Wu, Zhanghao and Zhang, Hao and Zheng, Lianmin and Zhuang, Siyuan and Zhuang, Yonghao and Gonzalez, Joseph E. and Stoica, Ion and Xing, Eric P. , year =. Vicuna: An Open-Source Chatbot Impressing GPT-4 with 90\ url =

[32] [33]

arXiv:2309.17425 , year=

Data filtering networks , author=. arXiv:2309.17425 , year=

arXiv

[33] [34]

arXiv:2306.09265 , year=

LVLM-eHub: A Comprehensive Evaluation Benchmark for Large Vision-Language Models , author=. arXiv:2306.09265 , year=

arXiv

[34] [35]

NeurIPS , year=

Language models are few-shot learners , author=. NeurIPS , year=

[35] [36]

arXiv:2304.08485 , year=

Visual instruction tuning , author=. arXiv:2304.08485 , year=

Pith/arXiv arXiv

[36] [37]

arXiv:2304.10592 , year=

Minigpt-4: Enhancing vision-language understanding with advanced large language models , author=. arXiv:2304.10592 , year=

Pith/arXiv arXiv

[37] [38]

arXiv:2309.17421 , year=

The dawn of lmms: Preliminary explorations with gpt-4v (ision) , author=. arXiv:2309.17421 , year=

Pith/arXiv arXiv

[38] [39]

Our World in Data , year =

Hannah Ritchie and Veronika Samborska and Max Roser , title =. Our World in Data , year =

[39] [43]

arXiv preprint arXiv:2307.06281 , year=

Mmbench: Is your multi-modal model an all-around player? , author=. arXiv preprint arXiv:2307.06281 , year=

Pith/arXiv arXiv

[40] [44]

2024 , eprint=

Hunyuan-DiT: A Powerful Multi-Resolution Diffusion Transformer with Fine-Grained Chinese Understanding , author=. 2024 , eprint=

2024

[41] [45]

arXiv:2311.16502 , year=

Mmmu: A massive multi-discipline multimodal understanding and reasoning benchmark for expert agi , author=. arXiv:2311.16502 , year=

Pith/arXiv arXiv

[42] [46]

arXiv:2303.16199 , year=

Llama-adapter: Efficient fine-tuning of language models with zero-init attention , author=. arXiv:2303.16199 , year=

Pith/arXiv arXiv

[43] [47]

Manning and Stefano Ermon and Chelsea Finn , editor =

Rafael Rafailov and Archit Sharma and Eric Mitchell and Christopher D. Manning and Stefano Ermon and Chelsea Finn , editor =. Direct Preference Optimization: Your Language Model is Secretly a Reward Model , booktitle =. 2023 , url =

2023

[44] [48]

arXiv:2305.06500 , year=

Instructblip: Towards general-purpose vision-language models with instruction tuning , author=. arXiv:2305.06500 , year=

Pith/arXiv arXiv

[45] [49]

arXiv:2304.14178 , year=

mplug-owl: Modularization empowers large language models with multimodality , author=. arXiv:2304.14178 , year=

Pith/arXiv arXiv

[46] [50]

arXiv:2304.15010 , year=

Llama-adapter v2: Parameter-efficient visual instruction model , author=. arXiv:2304.15010 , year=

Pith/arXiv arXiv

[47] [51]

arXiv:2305.03726 , year=

Otter: A multi-modal model with in-context instruction tuning , author=. arXiv:2305.03726 , year=

Pith/arXiv arXiv

[48] [52]

arXiv:2305.16355 , year=

Pandagpt: One model to instruction-follow them all , author=. arXiv:2305.16355 , year=

Pith/arXiv arXiv

[49] [53]

arXiv:2305.12223 , year=

What Makes for Good Visual Tokenizers for Large Language Models? , author=. arXiv:2305.12223 , year=

arXiv

[50] [54]

arXiv:2305.10355 , year=

Evaluating object hallucination in large vision-language models , author=. arXiv:2305.10355 , year=

Pith/arXiv arXiv

[51] [55]

arXiv:1504.00325 , year=

Microsoft coco captions: Data collection and evaluation server , author=. arXiv:1504.00325 , year=

Pith/arXiv arXiv

[52] [56]

International journal of computer vision , volume=

Visual genome: Connecting language and vision using crowdsourced dense image annotations , author=. International journal of computer vision , volume=. 2017 , publisher=

2017

[53] [57]

arXiv:2210.08402 , year=

Laion-5b: An open large-scale dataset for training next generation image-text models , author=. arXiv:2210.08402 , year=

Pith/arXiv arXiv

[54] [58]

, author=

Laion coco: 600m synthetic captions from laion2b-en. , author=. https://laion.ai/blog/laion-coco/ , year=

[55] [59]

arXiv:2304.14108 , year=

DataComp: In search of the next generation of multimodal datasets , author=. arXiv:2304.14108 , year=

arXiv

[56] [60]

2022 , url =

COYO-700M: Image-Text Pair Dataset , author =. 2022 , url =

2022

[57] [61]

CVPR , year=

Conceptual 12m: Pushing web-scale image-text pre-training to recognize long-tail visual concepts , author=. CVPR , year=

[58] [62]

ACL , year=

Conceptual captions: A cleaned, hypernymed, image alt-text dataset for automatic image captioning , author=. ACL , year=

[59] [63]

NeurIPS , year=

Im2text: Describing images using 1 million captioned photographs , author=. NeurIPS , year=

[60] [64]

arXiv:2209.06794 , year=

Pali: A jointly-scaled multilingual language-image model , author=. arXiv:2209.06794 , year=

Pith/arXiv arXiv

[61] [65]

ICML , year=

Ofa: Unifying architectures, tasks, and modalities through a simple sequence-to-sequence learning framework , author=. ICML , year=

[62] [66]

NeurIPS , year=

Training language models to follow instructions with human feedback , author=. NeurIPS , year=

[63] [67]

arXiv:2305.10403 , year=

Palm 2 technical report , author=. arXiv:2305.10403 , year=

Pith/arXiv arXiv

[64] [68]

ICCV , year=

nocaps: novel object captioning at scale , author=. ICCV , year=

[65] [69]

CVPR , year=

Making the v in vqa matter: Elevating the role of image understanding in visual question answering , author=. CVPR , year=

[66] [70]

ECCV , year=

Textcaps: a dataset for image captioning with reading comprehension , author=. ECCV , year=

[67] [71]

CVPR , year=

Imagenet: A large-scale hierarchical image database , author=. CVPR , year=

[68] [72]

arXiv:2307.16125 , year=

SEED-Bench: Benchmarking Multimodal LLMs with Generative Comprehension , author=. arXiv:2307.16125 , year=

Pith/arXiv arXiv

[69] [73]

arXiv:2306.13394 , year=

MME: A Comprehensive Evaluation Benchmark for Multimodal Large Language Models , author=. arXiv:2306.13394 , year=

Pith/arXiv arXiv

[70] [74]

arXiv:2205.01068 , year=

Opt: Open pre-trained transformer language models , author=. arXiv:2205.01068 , year=

Pith/arXiv arXiv

[71] [75]

arXiv:2306.05685 , year=

Judging LLM-as-a-judge with MT-Bench and Chatbot Arena , author=. arXiv:2306.05685 , year=

Pith/arXiv arXiv

[72] [76]

OpenAI blog , year=

Language models are unsupervised multitask learners , author=. OpenAI blog , year=

[73] [77]

arXiv:1810.04805 , year=

Bert: Pre-training of deep bidirectional transformers for language understanding , author=. arXiv:1810.04805 , year=

Pith/arXiv arXiv

[74] [78]

JMLR , year=

Exploring the limits of transfer learning with a unified text-to-text transformer , author=. JMLR , year=

[75] [79]

CVPR , year=

Masked autoencoders are scalable vision learners , author=. CVPR , year=

[76] [80]

arXiv:2106.08254 , year=

Beit: Bert pre-training of image transformers , author=. arXiv:2106.08254 , year=

Pith/arXiv arXiv

[77] [81]

arXiv:2212.04408 , year=

OFASys: A Multi-Modal Multi-Task Learning System for Building Generalist Models , author=. arXiv:2212.04408 , year=

arXiv

[78] [82]

ICML , year=

Generative pretraining from pixels , author=. ICML , year=

[79] [83]

arXiv:2302.14045 , year=

Language is not all you need: Aligning perception with language models , author=. arXiv:2302.14045 , year=

Pith/arXiv arXiv

[80] [84]

ECCV , year=

Microsoft coco: Common objects in context , author=. ECCV , year=