arxiv: 2605.00080 · v1 · submitted 2026-04-30 · 💻 cs.RO · cs.CV

Recognition: unknown

World Model for Robot Learning: A Comprehensive Survey

Bohan Hou, Gen Li, Haoran Geng, Jiajun Wu, Jianfei Yang, Jindou Jia, Jitendra Malik, Marc Pollefeys, Oier Mees, Philip Torr, Pieter Abbeel, Sicong Leng, Tatsuya Harada, Tuo An, Xinying Guo, Yanjie Ze, Yilun Du, Zhuang Liu

Pith reviewed 2026-05-09 20:34 UTC · model grok-4.3

classification 💻 cs.RO cs.CV

keywords world modelsrobot learningpredictive modelingembodied AIreinforcement learningvideo generationnavigationbenchmarks

0 comments

The pith

World models act as predictive representations that help robots learn, plan, and simulate their interactions with the world.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper aims to establish a clear framework for understanding world models in robot learning by systematically reviewing their various architectures and roles. It shows how these models couple with policies for better decision making, function as internal simulators for training, and evolve with video generation techniques. Readers would care because the field is growing fast but scattered, and this brings together insights on applications in navigation and benchmarks to guide future work. If the review holds, it can accelerate progress by identifying gaps in predictive modeling for embodied systems.

Core claim

World models, predictive representations of how environments evolve under actions, have become a central component of robot learning. They support policy learning, planning, simulation, evaluation, data generation, and have advanced rapidly with the rise of foundation models and large-scale video generation. The survey reviews couplings with robot policies, their use as learned simulators for reinforcement learning and evaluation, the progression of robotic video world models from imagination-based generation to controllable, structured, and foundation-scale formulations, connections to navigation and autonomous driving, and representative datasets, benchmarks, and evaluation protocols. It

What carries the argument

World models as predictive representations of environment evolution under actions, serving to unify policy learning, simulation, and planning across robot applications.

If this is right

World models integrated with policies enable more efficient robot learning without constant real-world interaction.
These models can replace or augment traditional simulators for reinforcement learning and performance evaluation.
Video-based world models allow for generating structured and controllable data to train robotic systems.
Applications extend to enhancing navigation systems and autonomous driving through better environmental prediction.
Benchmarks and datasets provide standardized ways to measure and compare world model performance in embodied tasks.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Advancing these models might allow robots to handle more complex, long-term tasks by anticipating future states accurately.
Links between video foundation models and robot-specific world models could create more general-purpose predictive systems.
Addressing the challenges highlighted may require new evaluation methods that test prediction in real physical settings.
Maintaining the associated repository could help the community track rapid developments in this area.

Load-bearing premise

The assumption that the current literature on world models is fragmented enough to benefit from one unifying survey that covers architectures, roles, and domains comprehensively.

What would settle it

A significant body of recent work on world models in robotics that falls outside the categories and resources reviewed in the survey and its updates.

read the original abstract

World models, which are predictive representations of how environments evolve under actions, have become a central component of robot learning. They support policy learning, planning, simulation, evaluation, data generation, and have advanced rapidly with the rise of foundation models and large-scale video generation. However, the literature remains fragmented across architectures, functional roles, and embodied application domains. To address this gap, we present a comprehensive review of world models from a robot-learning perspective. We examine how world models are coupled with robot policies, how they serve as learned simulators for reinforcement learning and evaluation, and how robotic video world models have progressed from imagination-based generation to controllable, structured, and foundation-scale formulations. We further connect these ideas to navigation and autonomous driving, and summarize representative datasets, benchmarks, and evaluation protocols. Overall, this survey systematically reviews the rapidly growing literature on world models for robot learning, clarifies key paradigms and applications, and highlights major challenges and future directions for predictive modeling in embodied agents. To facilitate continued access to newly emerging works, benchmarks, and resources, we will maintain and regularly update the accompanying GitHub repository alongside this survey.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

This survey organizes work on world models in robot learning into useful categories but leaves its literature selection process undocumented, which weakens the comprehensiveness claim.

read the letter

This survey organizes work on world models in robot learning into useful categories but leaves its literature selection process undocumented, which weakens the comprehensiveness claim. It groups papers around policy coupling, learned simulators for RL and evaluation, the shift in robotic video models toward controllable and foundation-scale versions, plus links to navigation, driving, datasets, and benchmarks. The plan to maintain an updated GitHub repo is a practical touch that could help readers track new papers after publication. That structure is the main value here. It gives a reader a map of how predictive models fit into different parts of embodied AI without having to piece it together from scattered sources. The connections drawn between imagination-based approaches and larger video generation models feel current, and the section on open challenges points to concrete next steps like better evaluation protocols. For someone entering the area or needing a quick reference across subfields, this saves time. The soft spot is the missing account of how papers were chosen. The abstract and stress-test note both show no search terms, date range, inclusion rules, or count of screened versus included works. Without that, it is difficult to know whether the review covers the claimed scope or reflects the authors' existing knowledge base. A survey's strength rests on being representative, and this gap makes the central claim harder to assess. The citation list appears wide, drawing from robotics, vision, and planning, which aligns with the goal of cutting fragmentation. No new math or data is presented, as expected. This paper is aimed at researchers and students in robot learning or embodied AI who want an overview rather than a single new method. It could help someone scope a project or find relevant benchmarks. It deserves peer review because a well-structured survey can reduce duplication in a fast-growing area, even if the authors need to add a short methods paragraph on selection to strengthen it. I would engage with it for a review.

Referee Report

1 major / 1 minor

Summary. The paper presents a survey on world models for robot learning, claiming to systematically review the fragmented literature by examining couplings with robot policies, use as learned simulators for RL and evaluation, progress in robotic video world models (from imagination-based to controllable and foundation-scale), connections to navigation and autonomous driving, and representative datasets, benchmarks, and evaluation protocols. It highlights major challenges and future directions for predictive modeling in embodied agents and commits to maintaining an accompanying GitHub repository for updates.

Significance. A thorough synthesis of this rapidly evolving area could clarify paradigms across architectures and domains, helping researchers navigate connections between predictive models and embodied applications while identifying open challenges. The GitHub maintenance plan is a positive step for ongoing utility. However, without demonstrated coverage of the claimed scope, the survey's ability to close the fragmentation gap remains unverified.

major comments (1)

[Abstract/Introduction] Abstract and Introduction: The central claim that the survey 'systematically reviews the rapidly growing literature' and 'addresses this gap' by examining policy couplings, learned simulators, video models, navigation, driving, datasets, and benchmarks is load-bearing but unsupported. No literature search methodology, databases, search terms, date range, inclusion/exclusion criteria, or PRISMA-style accounting of screened versus included papers is described, leaving representativeness and completeness unassessable.

minor comments (1)

[Abstract] The commitment to updating the GitHub repository is noted positively but should include an explicit link in the paper and a description of what resources (e.g., paper lists, benchmarks) it will contain.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the thoughtful and constructive feedback on our survey. We agree that greater transparency regarding the literature review process will strengthen the manuscript and better support our claims of systematic coverage. We address the major comment below and will incorporate the suggested revisions.

read point-by-point responses

Referee: [Abstract/Introduction] Abstract and Introduction: The central claim that the survey 'systematically reviews the rapidly growing literature' and 'addresses this gap' by examining policy couplings, learned simulators, video models, navigation, driving, datasets, and benchmarks is load-bearing but unsupported. No literature search methodology, databases, search terms, date range, inclusion/exclusion criteria, or PRISMA-style accounting of screened versus included papers is described, leaving representativeness and completeness unassessable.

Authors: We agree that explicitly describing the literature search methodology would improve the survey's transparency and allow readers to better assess its scope and completeness. Although our review draws from an extensive examination of the literature across key venues and repositories (including arXiv, NeurIPS, ICML, CoRL, RSS, and IEEE journals), we did not include a formal methodology section in the initial submission. In the revised manuscript, we will add a dedicated subsection titled 'Literature Review Methodology' in the Introduction. This section will detail: (1) the databases and sources searched (Google Scholar, arXiv, conference proceedings from 2018–2024), (2) primary search terms and keywords (e.g., 'world models', 'robot learning', 'predictive world models', 'video prediction for robotics'), (3) inclusion criteria (papers focusing on predictive models in embodied agents, excluding purely theoretical or non-robotic applications), and (4) an approximate accounting of the number of papers initially screened versus those included in the final survey (approximately 250 papers reviewed, with 120+ cited). We believe this addition will substantiate our claims of systematic coverage without altering the core contributions. We also note that for rapidly evolving fields like this, surveys often combine systematic search with expert curation, which we have done here. revision: yes

Circularity Check

0 steps flagged

No circularity: survey contains no derivations or self-referential reductions

full rationale

The paper is a literature survey with no equations, fitted parameters, predictions, or derivation chains. Its central claim is that it systematically reviews fragmented world-model literature for robot learning by examining couplings with policies, simulators, video models, navigation, and benchmarks. This rests on the authors' curation of cited works rather than any internal mathematical reduction or self-citation that forces the result by construction. No steps match the enumerated circularity patterns; the review is self-contained as an external synthesis without load-bearing self-referential logic.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

As a literature survey, the central claim rests on the authors' selection, interpretation, and synthesis of prior publications rather than new postulates, fitted parameters, or invented entities. No free parameters, axioms, or invented entities are introduced.

pith-pipeline@v0.9.0 · 5558 in / 1113 out tokens · 36313 ms · 2026-05-09T20:34:39.886038+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

75 extracted references · 72 canonical work pages · 13 internal anchors

[1]

World Simulation with Video Foundation Models for Physical AI

Arslan Ali, Junjie Bai, Maciej Bala, Yogesh Balaji, Aaron Blakeman, Tiffany Cai, Jiaxin Cao, Tianshi Cao, Elizabeth Cha, Yu-Wei Chao, et al. World simulation with video foundation models for physical AI.arXiv preprint arXiv:2511.00062,

work page internal anchor Pith review arXiv
[2]

V-JEPA 2: Self-Supervised Video Models Enable Understanding, Prediction and Planning

Mahmoud Assran, Adrien Bardes, David Fan, Quentin Garrido, Russell Howes, Mojtaba Komeili, Matthew J Muckley, Ammar Rizvi, Claire Roberts, Koustuv Sinha, Yanpin Tao, Pascal Vincent, and Nicolas Ballas. V-JEPA 2: Self-supervised video models enable understanding, prediction and planning.arXiv preprint arXiv:2506.09985,

work page internal anchor Pith review arXiv
[3]

Anurag Bagchi, Zhipeng Bao, Homanga Bharadhwaj, Yu-Xiong Wang, Pavel Tokmakov, and Martial Hebert

doi: 10.1109/LRA.2026.3662533. Anurag Bagchi, Zhipeng Bao, Homanga Bharadhwaj, Yu-Xiong Wang, Pavel Tokmakov, and Martial Hebert. Walk through paintings: Egocentric world models from internet priors.arXiv preprint arXiv:2601.15284,

work page doi:10.1109/lra.2026.3662533 2026
[4]

Motus: A Unified Latent Action World Model

Hongzhe Bi, Hengkai Tan, Shenghao Xie, Zeyuan Wang, Shuhe Huang, Haitian Liu, Ruowen Zhao, Yao Feng, Chendong Xiang, Yinze Rong, et al. Motus: A unified latent action world model.arXiv preprint arXiv:2512.13030,

work page internal anchor Pith review arXiv
[5]

Black, M

Kevin Black, Mitsuhiko Nakamoto, Pranav Atreya, Homer Walke, Chelsea Finn, Aviral Kumar, and Sergey Levine. Zero-shot robotic manipulation with pretrained image-editing diffusion models.arXiv preprint arXiv:2310.10639,

work page arXiv
[6]

Kevin Black, Noah Brown, Danny Driess, Adnan Esmail, Michael Equi, Chelsea Finn, Niccolo Fusai, Lachy Groom, Karol Hausman, Brian Ichter, et al.π0: A vision-language-action flow model for general robot control.arXiv preprint arXiv:2410.24164,

work page internal anchor Pith review arXiv
[7]

AgiBot World Colosseo: A Large-scale Manipulation Platform for Scalable and Intelligent Embodied Systems

Revised printing. Qingwen Bu, Jisong Cai, Li Chen, Xiuqi Cui, Yan Ding, Siyuan Feng, Shenyuan Gao, Xindong He, Xuan Hu, Xu Huang, et al. Agibot world colosseo: A large-scale manipulation platform for scalable and intelligent embodied systems. arXiv preprint arXiv:2503.06669, 2025a. Qingwen Bu, Yanting Yang, Jisong Cai, Shenyuan Gao, Guanghui Ren, Maoqing ...

work page internal anchor Pith review arXiv
[8]

arXiv preprint arXiv:2511.15704 (2025)

32 Xiongyi Cai, Ri-Zhao Qiu, Geng Chen, Lai Wei, Isabella Liu, Tianshu Huang, Xuxin Cheng, and Xiaolong Wang. In-N-On: Scaling egocentric manipulation with in-the-wild and on-task data.arXiv preprint arXiv:2511.15704,

work page arXiv
[9]

Rynnvla-002: A unified vision-language-action and world model.arXiv preprint arXiv:2511.17502, 2025

Jun Cen, Siteng Huang, Yuqian Yuan, Kehan Li, Hangjie Yuan, Chaohui Yu, Yuming Jiang, Jiayan Guo, Xin Li, Hao Luo, et al. Rynnvla-002: A unified vision-language-action and world model.arXiv preprint arXiv:2511.17502,

work page arXiv
[10]

Large Video Planner Enables Generalizable Robot Control

Boyuan Chen, Tianyuan Zhang, Haoran Geng, Kiwhan Song, Caiyi Zhang, Peihao Li, William T Freeman, Jitendra Malik, Pieter Abbeel, Russ Tedrake, et al. Large video planner enables generalizable robot control.arXiv preprint arXiv:2512.15840, 2025a. Delong Chen, Tejaswi Kasarla, Yejin Bang, Mustafa Shukor, Willy Chung, Jade Yu, Allen Bolourchi, Theo Moutakann...

work page internal anchor Pith review arXiv
[11]

arXiv preprint arXiv:2509.22642 , year=

Cheng Chi, Zhenjia Xu, Siyuan Feng, Eric Cousineau, Yilun Du, Benjamin Burchfiel, Russ Tedrake, and Shuran Song. Diffusion policy: Visuomotor policy learning via action diffusion.The International Journal of Robotics Research, 44(10-11):1684–1704, 2025a. Xiaowei Chi, Chun-Kai Fan, Hengyuan Zhang, Xingqun Qi, Rongyu Zhang, Anthony Chen, Chi-Min Chan, Wei X...

work page arXiv
[12]

doi: 10.1080/00207727008920220. K.J.W. Craik.The nature of explanation. Cambridge University Press,

work page doi:10.1080/00207727008920220
[13]

ISBN 9780521047555. Ronghao Dang, Jiayan Guo, Bohan Hou, Sicong Leng, Kehan Li, Xin Li, Jiangpin Liu, Yunxuan Mao, Zhikai Wang, Yuqian Yuan, Minghao Zhu, Xiao Lin, Yang Bai, Qian Jiang, Yaxi Zhao, Minghua Zeng, Junlong Gao, Yuming Jiang, Jun Cen, Siteng Huang, Liuyi Wang, Wenqiao Zhang, Chengju Liu, Jianfei Yang, Shijian Lu, and Deli Zhao. RynnBrain: Open...

work page arXiv
[14]

Rethinking video generation model for the embodied world,

Yufan Deng, Zilin Pan, Hongyu Zhang, Xiaojie Li, Ruoqing Hu, Yufei Ding, Yiming Zou, Yan Zeng, and Daquan Zhou. Rethinking video generation model for the embodied world.arXiv preprint arXiv:2601.15282,

work page arXiv
[15]

Hoi! - A Multimodal Dataset for Force-Grounded, Cross-View Articulated Manipulation

Tim Engelbracht, René Zurbrügg, Matteo Wohlrapp, Martin Büchner, Abhinav Valada, Marc Pollefeys, Hermann Blum, and Zuria Bauer. Hoi!: A multimodal dataset for force-grounded, cross-view articulated manipulation.arXiv preprint arXiv:2512.04884,

work page internal anchor Pith review Pith/arXiv arXiv
[16]

Wow , wo, val! a comprehensive embodied world model evaluation turing test

Chun-Kai Fan, Xiaowei Chi, Xiaozhu Ju, Hao Li, Yong Bao, Yu-Kai Wang, Lizhang Chen, Zhiyuan Jiang, Kuangzhi Ge, Ying Li, et al. Wow, wo, val! a comprehensive embodied world model evaluation turing test.arXiv preprint arXiv:2601.04137,

work page arXiv
[17]

RH20T: A robotic dataset for learning diverse skills in one-shot

Hao-Shu Fang, Hongjie Fang, Zhenyu Tang, Jirong Liu, Junbo Wang, Haoyi Zhu, and Cewu Lu. RH20T: A robotic dataset for learning diverse skills in one-shot. InRSS 2023 Workshop on Learning for Task and Motion Planning,

2023
[18]

Vidar: Embodied video diffusion model for generalist manipulation, 2025

Yao Feng, Hengkai Tan, Xinyi Mao, Chendong Xiang, Guodong Liu, Shuhe Huang, Hang Su, and Jun Zhu. Vidar: Embodied video diffusion model for generalist manipulation.arXiv preprint arXiv:2507.12898,

work page arXiv
[19]

VITA: Vision-to-action flow matching policy.arXiv preprint arXiv:2507.13231, 2025a

Dechen Gao, Boqi Zhao, Andrew Lee, Ian Chuang, Hanchu Zhou, Hang Wang, Zhe Zhao, Junshan Zhang, and Iman Soltani. VITA: Vision-to-action flow matching policy.arXiv preprint arXiv:2507.13231, 2025a. Qiyue Gao, Xinyu Pi, Kevin Liu, Junrong Chen, Ruolan Yang, Xinqi Huang, Xinyu Fang, Lu Sun, Gautham Kishore, Bo Ai, et al. Do vision-language models have inter...

work page arXiv 2025
[20]

Say , dream, and act: Learning video world models for instruction-driven robot manipulation

34 Songen Gu, Yunuo Cai, Tianyu Wang, Simo Wu, and Yanwei Fu. Say, dream, and act: Learning video world models for instruction-driven robot manipulation.arXiv preprint arXiv:2602.10717,

work page arXiv
[21]

Ctrl-world: A controllable generative world model for robot manipulation, 2026

Yanjiang Guo, Lucy Xiaoyang Shi, Jianyu Chen, and Chelsea Finn. Ctrl-world: A controllable generative world model for robot manipulation.arXiv preprint arXiv:2510.10125,

work page arXiv
[22]

Vlaw: Iterative co-improvement of vision-language-action policy and world model.arXiv preprint arXiv:2602.12063, 2026

Yanjiang Guo, Tony Lee, Lucy Xiaoyang Shi, Jianyu Chen, Percy Liang, and Chelsea Finn. VLAW: Iterative co-improvement of vision-language-action policy and world model.arXiv preprint arXiv:2602.12063, 2026a. Yanjiang Guo, Lucy Xiaoyang Shi, Jianyu Chen, and Chelsea Finn. Ctrl-world: A controllable generative world model for robot manipulation. InInternatio...

work page arXiv
[23]

Visuo-tactile world models.arXiv preprint arXiv:2602.06001,

Carolina Higuera, Sergio Arnaud, Byron Boots, Mustafa Mukadam, Francois Robert Hogan, and Franziska Meier. Visuo-tactile world models.arXiv preprint arXiv:2602.06001,

work page arXiv
[24]

Robomind 2.0: A multimodal, biman- ual mobile manipulation dataset for generalizable embodied intelligence.arXiv preprint arXiv:2512.24653, 2025

Chengkai Hou, Kun Wu, Jiaming Liu, Zhengping Che, Di Wu, Fei Liao, Guangrun Li, Jingyang He, Qiuxuan Feng, Zhao Jin, et al. Robomind 2.0: A multimodal, bimanual mobile manipulation dataset for generalizable embodied intelligence.arXiv preprint arXiv:2512.24653,

work page arXiv
[25]

GAIA-1: A Generative World Model for Autonomous Driving

Anthony Hu, Lloyd Russell, Hudson Yeo, Zak Murez, George Fedoseev, Alex Kendall, Jamie Shotton, and Gianluca Corrado. Gaia-1: A generative world model for autonomous driving.arXiv preprint arXiv:2309.17080,

work page internal anchor Pith review arXiv
[26]

Bagelvla: Enhancing long-horizon manipulation via interleaved vision-language-action generation.CoRR, abs/2602.09849, 2026

Yucheng Hu, Jianke Zhang, Yuanfei Luo, Yanjiang Guo, Xiaoyu Chen, Xinshu Sun, Kun Feng, Qingzhou Lu, Sheng Chen, Yangang Zhang, et al. BagelVLA: Enhancing long-horizon manipulation via interleaved vision-language-action generation.arXiv preprint arXiv:2602.09849,

work page arXiv
[27]

arXiv preprint arXiv:2512.00041 , year=

Yanjia Huang, Xianshun Jiang, Xiangbo Gao, Mingyang Wu, and Zhengzhong Tu. VISTAv2: World imagination for indoor vision-and-language navigation.arXiv preprint arXiv:2512.00041, 2025b. Yanjia Huang, Mingyang Wu, Renjie Li, and Zhengzhong Tu. Vista: Generative visual imagination for vision-and- language navigation.arXiv preprint arXiv:2505.07868, 2025c. Phy...

work page arXiv
[28]

DROID: A large-scale in-the-wild robot manipulation dataset

Alexander Khazatsky, Karl Pertsch, Suraj Nair, Ashwin Balakrishna, Sudeep Dasari, Siddharth Karamcheti, Soroush Nasiriany, Mohan Kumar Srirama, Lawrence Yunliang Chen, Kirsty Ellis, et al. DROID: A large-scale in-the-wild robot manipulation dataset. InRSS 2024 Workshop: Data Generation for Robotics,

2024
[29]

Cosmos Policy: Fine-Tuning Video Models for Visuomotor Control and Planning

Moo Jin Kim, Yihuai Gao, Tsung-Yi Lin, Yen-Chen Lin, Yunhao Ge, Grace Lam, Percy Liang, Shuran Song, Ming-Yu Liu, Chelsea Finn, et al. Cosmos policy: Fine-tuning video models for visuomotor control and planning.arXiv preprint arXiv:2601.16163,

work page internal anchor Pith review arXiv
[30]

A humanoid visual-tactile-action dataset for contact-rich manipulation.arXiv preprint arXiv:2510.25725,

Eunju Kwon, Seungwon Oh, In-Chang Baek, Yucheon Park, Gyungbo Kim, JaeYoung Moon, Yunho Choi, and Kyung- Joong Kim. A humanoid visual-tactile-action dataset for contact-rich manipulation.arXiv preprint arXiv:2510.25725,

work page arXiv
[31]

Mask2iv: Interaction-centric video generation via mask trajectories.arXiv preprint arXiv:2510.03135, 2025a

Gen Li, Bo Zhao, Jianfei Yang, and Laura Sevilla-Lara. Mask2iv: Interaction-centric video generation via mask trajectories.arXiv preprint arXiv:2510.03135, 2025a. Hengtao Li, Pengxiang Ding, Runze Suo, Yihao Wang, Zirui Ge, Dongyuan Zang, Kexian Yu, Mingyang Sun, Hongyin Zhang, Donglin Wang, et al. Vla-rft: Vision-language-action reinforcement fine-tuning...

work page arXiv
[32]

Video generators are robot policies.arXiv preprint arXiv:2508.00795, 2025

Junbang Liang, Pavel Tokmakov, Ruoshi Liu, Sruthi Sudhakar, Paarth Shah, Rares Ambrus, and Carl Vondrick. Video generators are robot policies.arXiv preprint arXiv:2508.00795, 2025a. Weixin Liang, LILI YU, Liang Luo, Srini Iyer, Ning Dong, Chunting Zhou, Gargi Ghosh, Mike Lewis, Wen tau Yih, Luke Zettlemoyer, and Xi Victoria Lin. Mixture-of-Transformers: A...

work page arXiv
[33]

Hybridvla: Collaborative diffusion and autoregression in a unified vision-language-action model

Bo Liu, Yifeng Zhu, Chongkai Gao, Yihao Feng, Qiang Liu, Yuke Zhu, and Peter Stone. Libero: Benchmarking knowledge transfer for lifelong robot learning.Advances in Neural Information Processing Systems, 36:44776–44791, 2023a. 37 Jiaming Liu, Hao Chen, Pengju An, Zhuoyang Liu, Renrui Zhang, Chenyang Gu, Xiaoqi Li, Ziyu Guo, Sixiang Chen, Mengzhen Liu, et a...

work page arXiv
[34]

World-vla-loop: Closed-loop learning of video world model and vla policy.arXiv preprint arXiv:2602.06508, 2026b

Qingtao Liu, Yu Cui, Zhengnan Sun, Gaofeng Li, Jiming Chen, and Qi Ye. VTDexmanip: A dataset and benchmark for visual-tactile pretraining and dexterous manipulation with reinforcement learning. InInternational Conference on Learning Representations, 2025c. Xiaokang Liu, Zechen Bai, Hai Ci, Kevin Yuchen Ma, and Mike Zheng Shou. World-VLA-Loop: Closed-loop ...

work page arXiv
[35]

Being-h0: Vision-language-action pretraining from large-scale human videos, 2025

Hao Luo, Yicheng Feng, Wanpeng Zhang, Sipeng Zheng, Ye Wang, Haoqi Yuan, Jiazheng Liu, Chaoyi Xu, Qin Jin, and Zongqing Lu. Being-h0: Vision-language-action pretraining from large-scale human videos.arXiv preprint arXiv:2507.15597,

work page arXiv
[36]

Being-h0.5: Scaling human-centric robot learning for cross-embodiment generalization

Hao Luo, Ye Wang, Wanpeng Zhang, Sipeng Zheng, Ziheng Xi, Chaoyi Xu, Haiweng Xu, Haoqi Yuan, Chi Zhang, Yiqing Wang, et al. Being-H0.5: Scaling human-centric robot learning for cross-embodiment generalization.arXiv preprint arXiv:2601.12993,

work page arXiv
[37]

Soroush Nasiriany, Abhiram Maddukuri, Lance Zhang, Adeet Parikh, Aaron Lo, Abhishek Joshi, Ajay Man- dlekar, and Yuke Zhu

Qi Lv, Weijie Kong, Hao Li, Jia Zeng, Zherui Qiu, Delin Qu, Haoming Song, Qizhi Chen, Xiang Deng, and Jiangmiao Pang. F1: A vision-language-action model bridging understanding and generation to actions.arXiv preprint arXiv:2509.06951,

work page arXiv
[38]

Lda-1b: Scaling latent dynamics action model via universal em- bodied data ingestion.arXiv preprint arXiv:2602.12215,

Jiangran Lyu, Kai Liu, Xuheng Zhang, Haoran Liao, Yusen Feng, Wenxuan Zhu, Tingrui Shen, Jiayi Chen, Jiazhao Zhang, Yifei Dong, et al. LDA-1B: Scaling latent dynamics action model via universal embodied data ingestion. arXiv preprint arXiv:2602.12215,

work page arXiv
[39]

Dit4dit: Jointly modeling video dynamics and actions for generalizable robot control.arXiv preprint arXiv:2603.10448,

Teli Ma, Jia Zheng, Zifan Wang, Chunli Jiang, Andy Cui, Junwei Liang, and Shuo Yang. Dit4dit: Jointly modeling video dynamics and actions for generalizable robot control.arXiv preprint arXiv:2603.10448,

work page arXiv
[40]

Leworld- model: Stable end-to-end joint-embedding predictive architecture from pixels.arXiv preprint arXiv:2603.19312, 2026

Lucas Maes, Quentin Le Lidec, Damien Scieur, Yann LeCun, and Randall Balestriero. LeWorldModel: Stable End-to-End joint-embedding predictive architecture from pixels.arXiv preprint arXiv:2603.19312,

work page arXiv
[41]

In: Conference on Robot Learning

Jiageng Mao, Sicheng He, Hao-Ning Wu, Yang You, Shuyang Sun, Zhicheng Wang, Yanan Bao, Huizhong Chen, Leonidas Guibas, Vitor Guizilini, et al. Robot learning from a physical world model.arXiv preprint arXiv:2511.07416,

work page arXiv
[42]

Tc-idm: Grounding video generation for executable zero-shot robot motion.arXiv preprint arXiv:2601.18323, 2026

Weishi Mi, Yong Bao, Xiaowei Chi, Xiaozhu Ju, Zhiyuan Qin, Kuangzhi Ge, Kai Tang, Peidong Jia, Shanghang Zhang, and Jian Tang. TC-IDM: Grounding video generation for executable zero-shot robot motion.arXiv preprint arXiv:2601.18323,

work page arXiv
[43]

JEPA-VLA: Video predictive embedding is needed for VLA models.arXiv preprint arXiv:2602.11832,

Shangchen Miao, Ningya Feng, Jialong Wu, Ye Lin, Xu He, Dong Li, and Mingsheng Long. JEPA-VLA: Video predictive embedding is needed for VLA models.arXiv preprint arXiv:2602.11832,

work page arXiv
[44]

Lorenzo Mur-Labadia, Matthew Muckley, Amir Bar, Mahmoud Assran, Koustuv Sinha, Michael Rabbat, Yann LeCun, Nicolas Ballas, and Adrien Bardes

doi: 10.1037/10039-000. Lorenzo Mur-Labadia, Matthew Muckley, Amir Bar, Mahmoud Assran, Koustuv Sinha, Michael Rabbat, Yann LeCun, Nicolas Ballas, and Adrien Bardes. V-JEPA 2.1: Unlocking dense features in video self-supervised learning.arXiv preprint arXiv:2603.14482,

work page doi:10.1037/10039-000
[45]

mimic-video: Video-action models for generalizable robot control beyond vlas.arXiv preprint 2512.15692, 2025

Jonas Pai, Liam Achenbach, Victoriano Montesinos, Benedek Forrai, Oier Mees, and Elvis Nava. Mimic-video: Video-action models for generalizable robot control beyond vlas.arXiv preprint arXiv:2512.15692,

work page arXiv
[46]

Worldgym: World model as an environment for policy evaluation, 2025

Julian Quevedo, Ansh Kumar Sharma, Yixiang Sun, Varad Suryavanshi, Percy Liang, and Sherry Yang. WorldGym: World model as an environment for policy evaluation.arXiv preprint arXiv:2506.00613,

work page arXiv
[47]

Mv-umi: A scalable multi- view interface for cross-embodiment learning, 2025

Omar Rayyan, John Abanes, Mahmoud Hafez, Anthony Tzes, and Fares Abu-Dakka. MV-UMI: A scalable multi-view interface for cross-embodiment learning.arXiv preprint arXiv:2509.18757,

work page arXiv
[48]

Worldarena: A unified benchmark for evaluating perception and functional utility of embodied world models.arXiv preprint arXiv:2602.08971, 2026

Yu Shang, Zhuohang Li, Yiding Ma, Weikang Su, Xin Jin, Ziyou Wang, Lei Jin, Xin Zhang, Yinzhou Tang, Haisheng Su, et al. WorldArena: A unified benchmark for evaluating perception and functional utility of embodied world models.arXiv preprint arXiv:2602.08971,

work page arXiv
[49]

Cross-view world models

39 Rishabh Sharma, Gijs Hogervorst, Wayne Mackey, David Heeger, and Stefano Martiniani. Cross-view world models. InICLR 2026 the 2nd Workshop on World Models: Understanding, Modelling and Scaling. Yichao Shen, Fangyun Wei, Zhiying Du, Yaobo Liang, Yan Lu, Jiaolong Yang, Nanning Zheng, and Baining Guo. VideoVLA: Video generators can be generalizable robot ...

2026
[50]

HALO: A unified vision-language-action model for embodied multimodal chain-of-thought reasoning.arXiv preprint arXiv:2602.21157,

Quanxin Shou, Fangqi Zhu, Shawn Chen, Puxin Yan, Zhengyang Yan, Yikun Miao, Xiaoyi Pang, Zicong Hong, Ruikai Shi, Hao Huang, et al. HALO: A unified vision-language-action model for embodied multimodal chain-of-thought reasoning.arXiv preprint arXiv:2602.21157,

work page arXiv
[51]

Available: http://dx.doi.org/10.1109/ IROS51168.2021.9636860

doi: 10.1109/IROS51168.2021.9635941. Jiaming Song, Chenlin Meng, and Stefano Ermon. Denoising diffusion implicit models. InInternational Conference on Learning Representations,

work page doi:10.1109/iros51168.2021.9635941 2021
[52]

World Guidance: World modeling in condition space for action generation.arXiv preprint arXiv:2602.22010, 2026

Yue Su, Sijin Chen, Haixin Shi, Mingyu Liu, Zhengshen Zhang, Ningyuan Huang, Weiheng Zhong, Zhengbang Zhu, Yuxiao Liu, and Xihui Liu. World guidance: World modeling in condition space for action generation.arXiv preprint arXiv:2602.22010,

work page arXiv
[53]

Vla-jepa: Enhancing vision-language-action model with latent world model.arXiv preprint arXiv:2602.10098, 2026

Jingwen Sun, Wenyao Zhang, Zekun Qi, Shaojie Ren, Zezhi Liu, Hanxin Zhu, Guangzhong Sun, Xin Jin, and Zhibo Chen. VLA-JEPA: Enhancing vision-language-action model with latent world model.arXiv preprint arXiv:2602.10098,

work page arXiv
[54]

Evaluating gemini robotics policies in a veo world simulator.arXiv preprint arXiv:2512.10675, 2025

Gemini Robotics Team, Krzysztof Choromanski, Coline Devin, Yilun Du, Debidatta Dwibedi, Ruiqi Gao, Abhishek Jindal, Thomas Kipf, Sean Kirmani, Isabel Leal, Fangchen Liu, Anirudha Majumdar, Andrew Marmon, Carolina Parada, Yulia Rubanova, Dhruv Shah, Vikas Sindhwani, Jie Tan, Fei Xia, Ted Xiao, Sherry Yang, Wenhao Yu, and Allan Zhou. Evaluating gemini robot...

work page arXiv
[55]

Gigaworld-0: World models as data engine to empower embodied ai,

GigaWorld Team, Angen Ye, Boyuan Wang, Chaojun Ni, Guan Huang, Guosheng Zhao, Haoyun Li, Jiagang Zhu, Kerui Li, Mengyuan Xu, et al. Gigaworld-0: World models as data engine to empower embodied AI.arXiv preprint arXiv:2511.19861, 2025b. Unitree. UnifoLM-WMA-0: A world-model-action (WMA) framework under UnifoLM family,

work page arXiv
[56]

Wan: Open and Advanced Large-Scale Video Generative Models

Team Wan. Wan: Open and advanced large-scale video generative models.arXiv preprint arXiv:2503.20314,

work page internal anchor Pith review Pith/arXiv arXiv
[57]

RoboVIP: Multi-view video generation with visual identity prompting augments robot manipulation.arXiv preprint arXiv:2601.05241, 2026a

Boyang Wang, Haoran Zhang, Shujie Zhang, Jinkun Hao, Mingda Jia, Qi Lv, Yucheng Mao, Zhaoyang Lyu, Jia Zeng, Xudong Xu, et al. RoboVIP: Multi-view video generation with visual identity prompting augments robot manipulation.arXiv preprint arXiv:2601.05241, 2026a. Ruixiang Wang, Qingming Liu, Yueci Deng, Guiliang Liu, Zhen Liu, and Kui Jia. EVA: Aligning vi...

work page arXiv
[58]

Freetacman: Robot-free visuo-tactile data collection system for contact-rich manipulation.arXiv preprint arXiv:2506.01941,

Longyan Wu, Checheng Yu, Jieji Ren, Li Chen, Yufei Jiang, Ran Huang, Guoying Gu, and Hongyang Li. Freetacman: Robot-free visuo-tactile data collection system for contact-rich manipulation.arXiv preprint arXiv:2506.01941,

work page arXiv
[59]

A pragmatic vla foundation model.arXiv preprint arXiv:2601.18692, 2026

Wei Wu, Fan Lu, Yunnan Wang, Shuai Yang, Shi Liu, Fangjing Wang, Qian Zhu, He Sun, Yong Wang, Shuailei Ma, et al. A pragmatic VLA foundation model.arXiv preprint arXiv:2601.18692, 2026a. Xiao-Ming Wu, Bin Fan, Kang Liao, Jian-Jian Jiang, Runze Yang, Yihang Luo, Zhonghua Wu, Wei-Shi Zheng, and Chen Change Loy. VLANeXt: Recipes for building strong VLA model...

work page arXiv
[60]

arXiv preprint arXiv:2601.04453 (2026)

Zhexiao Xiong, Xin Ye, Burhan Yaman, Sheng Cheng, Yiren Lu, Jingru Luo, Nathan Jacobs, and Liu Ren. UniDrive- WM: Unified understanding, planning and generation world model for autonomous driving.arXiv preprint arXiv:2601.04453,

work page arXiv
[61]

RISE: Self-Improving Robot Policy with Compositional World Model

Fuxiang Yang, Donglin Di, Lulu Tang, Xuancheng Zhang, Lei Fan, Hao Li, Wei Chen, Tonghua Su, and Baorui Ma. Chain of world: World model thinking in latent motion. InIEEE/CVF Conference on Computer Vision and Pattern Recognition, 2026a. Jiazhi Yang, Kunyang Lin, Jinwei Li, Wencong Zhang, Tianwei Lin, Longyan Wu, Zhizhong Su, Hao Zhao, Ya- Qin Zhang, Li Che...

work page internal anchor Pith review Pith/arXiv arXiv
[62]

Gigaworld-policy: An efficient action- centered world–action model, 2026

Sherry Yang, Yilun Du, Seyed Kamyar Seyed Ghasemipour, Jonathan Tompson, Leslie Pack Kaelbling, Dale Schu- urmans, and Pieter Abbeel. Learning interactive real-world simulators. InInternational Conference on Learning Representations, 2024a. Zhuoyi Yang, Jiayan Teng, Wendi Zheng, Ming Ding, Shiyu Huang, Jiazheng Xu, Yuanming Yang, Wenyi Hong, Xiaohan Zhang...

work page arXiv
[63]

Fast-WAM: Do World Action Models Need Test-time Future Imagination?

Tianyuan Yuan, Zibin Dong, Yicheng Liu, and Hang Zhao. Fast-WAM: Do world action models need test-time future imagination?arXiv preprint arXiv:2603.16666,

work page internal anchor Pith review arXiv
[64]

Ewmbench: Evaluating scene, motion, and semantic quality in embodied world models

Hu Yue, Siyuan Huang, Yue Liao, Shengcong Chen, Pengfei Zhou, Liliang Chen, Maoqing Yao, and Guanghui Ren. Ewmbench: Evaluating scene, motion, and semantic quality in embodied world models.arXiv preprint arXiv:2505.09694,

work page arXiv
[65]

Twist2: Scalable, portable, and holistic humanoid data collection system,

Yanjie Ze, Siheng Zhao, Weizhuo Wang, Angjoo Kanazawa, Rocky Duan, Pieter Abbeel, Guanya Shi, Jiajun Wu, and C Karen Liu. Twist2: Scalable, portable, and holistic humanoid data collection system.arXiv preprint arXiv:2511.02832,

work page arXiv
[66]

John, Zhongyi Zhou, Junjie Wen, Guorui Feng, Yichen Zhu, and Yi Xu

Qiyuan Zeng, Chengmeng Li, Jude St John, Zhongyi Zhou, Junjie Wen, Guorui Feng, Yichen Zhu, and Yi Xu. Activeumi: Robotic manipulation with active perception from robot-free human demonstrations.arXiv preprint arXiv:2510.01607,

work page arXiv
[67]

Affordance-based robot manipulation with flow matching,

Fan Zhang and Michael Gienger. Affordance-based robot manipulation with flow matching.arXiv preprint arXiv:2409.01083,

work page arXiv
[68]

Sparse video generation propels real-world beyond-the-view vision-language navigation.arXiv preprint arXiv:2602.05827,

Hai Zhang, Siqi Liang, Li Chen, Yuxian Li, Yukuan Xu, Yichao Zhong, Fu Zhang, and Hongyang Li. Sparse video generation propels real-world beyond-the-view vision-language navigation.arXiv preprint arXiv:2602.05827,

work page arXiv
[69]

Patel, Paul Pu Liang, Daniel Khashabi, Cheng Peng, Rama Chellappa, Tianmin Shu, Alan L

Jiahan Zhang, Muqing Jiang, Nanru Dai, Taiming Lu, Arda Uzunoglu, Shunchi Zhang, Yana Wei, Jiahao Wang, Vishal M Patel, Paul Pu Liang, et al. World-in-world: World models in a closed-loop world.arXiv preprint arXiv:2510.18135, 2025a. Jiahui Zhang, Ze Huang, Chun Gu, Zipei Ma, and Li Zhang. Reinforcing action policies by prophesying.arXiv preprint arXiv:25...

work page arXiv
[70]

Frappe: Infusing world modeling into generalist policies via multiple future representation alignment, 2026

Wenyao Zhang, Hongsi Liu, Zekun Qi, Yunnan Wang, XinQiang Yu, Jiazhao Zhang, Runpei Dong, Jiawei He, He Wang, Zhizheng Zhang, Li Yi, Wenjun Zeng, and Xin Jin. DreamVLA: A vision-language-action model dreamed with comprehensive world knowledge. 2025e. Han Zhao, Jingbo Wang, Wenxuan Song, Shuai Chen, Yang Liu, Yan Wang, Haoang Li, and Donglin Wang. FRAPPE: ...

work page arXiv
[71]

Humanoid everyday: A comprehensive robotic dataset for open-world humanoid manipulation, 2025

Zhenyu Zhao, Hongyi Jing, Xiawei Liu, Jiageng Mao, Abha Jha, Hanwen Yang, Rong Xue, Sergey Zakharor, Vitor Guizilini, and Yue Wang. Humanoid everyday: A comprehensive robotic dataset for open-world humanoid manipulation.arXiv preprint arXiv:2510.08807,

work page arXiv
[72]

Tesseract: Learning 4d embodied world models, 2025

Haoyu Zhen, Qiao Sun, Hongxin Zhang, Junyan Li, Siyuan Zhou, Yilun Du, and Chuang Gan. Tesseract: Learning 4D embodied world models.arXiv preprint arXiv:2504.20995,

work page arXiv
[73]

Omnivta: Visuo-tactile world modeling for contact- rich robotic manipulation, 2026

Yuhang Zheng, Songen Gu, Weize Li, Yupeng Zheng, Yujie Zang, Shuai Tian, Xiang Li, Ce Hao, Chen Gao, Si Liu, et al. OmniVTA: Visuo-tactile world modeling for contact-rich robotic manipulation.arXiv preprint arXiv:2603.19201,

work page arXiv
[74]

Drivinggen: A compre- hensive benchmark for generative video world models in au- tonomous driving.arXiv preprint arXiv:2601.01528, 2026

Yang Zhou, Hao Shao, Letian Wang, Zhuofan Zong, Hongsheng Li, and Steven L Waslander. DrivingGen: A comprehensive benchmark for generative video world models in autonomous driving.arXiv preprint arXiv:2601.01528,

work page arXiv
[75]

Unified World Models: Coupling Video and Action Diffusion for Pretraining on Large Robotic Datasets

Chuning Zhu, Raymond Yu, Siyuan Feng, Benjamin Burchfiel, Paarth Shah, and Abhishek Gupta. Unified world models: Coupling video and action diffusion for pretraining on large robotic datasets.arXiv preprint arXiv:2504.02792, 2025a. Fangqi Zhu, Hongtao Wu, Song Guo, Yuxiao Liu, Chilam Cheang, and Tao Kong. IRASim: A fine-grained world model for robot manipu...

work page internal anchor Pith review arXiv