pith. machine review for the scientific record. sign in

arxiv: 2605.12090 · v1 · submitted 2026-05-12 · 💻 cs.RO · cs.CL· cs.CV

Recognition: no theorem link

World Action Models: The Next Frontier in Embodied AI

Authors on Pith no claims yet

Pith reviewed 2026-05-13 04:56 UTC · model grok-4.3

classification 💻 cs.RO cs.CLcs.CV
keywords World Action ModelsEmbodied AIVision-Language-ActionWorld ModelsRobot LearningFoundation ModelsTaxonomyAction Generation
0
0 comments X

The pith

World Action Models unify world dynamics prediction with action generation in embodied AI.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper defines an emerging class of embodied foundation models called World Action Models that go beyond reactive vision-language-action mappings by jointly predicting how the physical environment will evolve under possible actions. It traces the roots of this idea in separate lines of world model and policy research, then organizes the scattered existing work into a clear taxonomy of cascaded pipelines versus joint architectures. The survey also catalogs the data sources that train these models and the evaluation criteria used to test them. A reader would care because the shift from immediate reaction to explicit forward simulation could enable more reliable long-horizon behavior in robots and other physical agents.

Core claim

The authors formally define World Action Models as embodied foundation models that unify predictive state modeling with action generation by targeting a joint distribution over future states and actions rather than actions alone. They disambiguate this paradigm from prior concepts, trace its origins in VLA and world-model literature, and organize methods into Cascaded WAMs and Joint WAMs with further splits by generation modality, conditioning, and decoding strategy. The paper further synthesizes the supporting data ecosystem and emerging evaluation protocols centered on visual fidelity, physical commonsense, and action plausibility.

What carries the argument

World Action Models (WAMs), which integrate predictive models of environment dynamics directly into the action-generation process to produce joint distributions over future states and actions.

If this is right

  • Architectural choices can be compared systematically by whether world prediction is cascaded before action generation or trained jointly with it.
  • Training draws on a mix of robot teleoperation, human egocentric video, simulation, and internet-scale data to scale beyond narrow robot datasets.
  • Evaluation now requires separate checks on predicted state accuracy, physical plausibility, and final action correctness.
  • Open challenges center on computational cost of forward prediction during real-time control and on scaling the joint modeling objective.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The explicit dynamics component may improve zero-shot transfer to new environments by letting the model simulate interventions it has never seen executed.
  • WAM-style training objectives could be combined with classical planning loops to produce hybrid systems that search over predicted futures before committing to actions.
  • If the taxonomy proves useful, future papers may adopt the cascaded-versus-joint distinction as a standard way to position new methods.

Load-bearing premise

Explicitly modeling how the world changes under an agent's interventions will produce meaningfully better embodied policies than learning direct reactive mappings from current observations to actions.

What would settle it

A controlled comparison on long-horizon robotic manipulation benchmarks in which agents built as World Action Models are measured against matched Vision-Language-Action baselines for task success rate and sample efficiency; absence of consistent gains would indicate that the added predictive component does not deliver the expected benefit.

read the original abstract

Vision-Language-Action (VLA) models have achieved strong semantic generalization for embodied policy learning, yet they learn reactive observation-to-action mappings without explicitly modeling how the physical world evolves under intervention. A growing body of work addresses this limitation by integrating world models, predictive models of environment dynamics, into the action generation pipeline. We term this emerging paradigm World Action Models (WAMs): embodied foundation models that unify predictive state modeling with action generation, targeting a joint distribution over future states and actions rather than actions alone. However, the literature remains fragmented across architectures, learning objectives, and application scenarios, lacking a unified conceptual framework. We formally define WAMs and disambiguate them from related concepts, and trace the foundations and early integration of VLA and world model research that gave rise to this paradigm. We organize existing methods into a structured taxonomy of Cascaded and Joint WAMs, with further subdivision by generation modality, conditioning mechanism, and action decoding strategy. We systematically analyze the data ecosystem fueling WAMs development, spanning robot teleoperation, portable human demonstrations, simulation, and internet-scale egocentric video, and synthesize emerging evaluation protocols organized around visual fidelity, physical commonsense, and action plausibility. Overall, this survey provides the first systematic account of the WAMs landscape, clarifies key architectural paradigms and their trade-offs, and identifies open challenges and future opportunities for this rapidly evolving field.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 3 minor

Summary. The manuscript introduces World Action Models (WAMs) as an emerging paradigm of embodied foundation models that integrate predictive world models with vision-language-action (VLA) policies. It formally defines WAMs as targeting a joint distribution over future states and actions (rather than actions alone), disambiguates them from related concepts, traces the historical integration of VLA and world-model research, organizes existing methods into a taxonomy of Cascaded versus Joint WAMs (with further subdivisions by generation modality, conditioning mechanism, and action decoding), analyzes the supporting data ecosystem (teleoperation, human demonstrations, simulation, egocentric video), synthesizes evaluation protocols focused on visual fidelity, physical commonsense, and action plausibility, and outlines open challenges.

Significance. If the taxonomy and disambiguation hold, the survey supplies the first systematic conceptual framework for a rapidly fragmenting intersection of world models and embodied policies. This organization of architectural paradigms, data sources, and evaluation axes could reduce duplication of effort and clarify trade-offs, thereby accelerating research on non-reactive, dynamics-aware action generation.

major comments (2)
  1. [Definition and disambiguation] The load-bearing definition of WAMs (abstract and §2) as models that 'target a joint distribution over future states and actions' is stated at a high level but lacks an explicit probabilistic formulation or side-by-side comparison with the conditional action distribution learned by standard VLAs; without this, borderline architectures risk inconsistent classification under the proposed Cascaded/Joint split.
  2. [Taxonomy of Cascaded and Joint WAMs] The taxonomy (§3) subdivides Joint WAMs by modality, conditioning, and decoding strategy, yet supplies no explicit decision criteria, pseudocode, or worked examples of how a given method is assigned to a leaf category; this weakens the claim that the taxonomy clarifies trade-offs and may leave hybrid or emerging methods unclassifiable.
minor comments (3)
  1. [Data ecosystem] The data-ecosystem section would be strengthened by a summary table listing scale, diversity, and annotation characteristics of the cited sources (teleoperation, egocentric video, etc.) to support the assertion that they collectively fuel WAM development.
  2. [Evaluation protocols] Evaluation protocols are synthesized around three axes, but the manuscript would benefit from an explicit table or figure mapping existing benchmarks to those axes and to the taxonomy branches, improving usability for readers.
  3. [Foundations and related work] A small number of citations appear to be missing for recent VLA baselines that already incorporate limited forward prediction; adding them would strengthen the claim of literature fragmentation.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments and positive recommendation for minor revision. The feedback identifies opportunities to strengthen the formal grounding and practical utility of the proposed framework. We respond to each major comment below and indicate the revisions we will make.

read point-by-point responses
  1. Referee: [Definition and disambiguation] The load-bearing definition of WAMs (abstract and §2) as models that 'target a joint distribution over future states and actions' is stated at a high level but lacks an explicit probabilistic formulation or side-by-side comparison with the conditional action distribution learned by standard VLAs; without this, borderline architectures risk inconsistent classification under the proposed Cascaded/Joint split.

    Authors: We agree that an explicit probabilistic formulation would reduce ambiguity. In the revised manuscript we will expand §2 with the following formalization: a WAM targets p(s_{t+1:T}, a_{t:T} | o_{1:t}, a_{1:t-1}, g) where s denotes future states, a actions, o observations and g the goal, contrasting it directly with standard VLAs that model only the conditional p(a_t | o_{1:t}, g). A side-by-side comparison table will be added, and borderline cases (e.g., models that predict states only for planning but decode actions separately) will be discussed with classification rules. These additions will be placed immediately after the current high-level definition. revision: yes

  2. Referee: [Taxonomy of Cascaded and Joint WAMs] The taxonomy (§3) subdivides Joint WAMs by modality, conditioning, and decoding strategy, yet supplies no explicit decision criteria, pseudocode, or worked examples of how a given method is assigned to a leaf category; this weakens the claim that the taxonomy clarifies trade-offs and may leave hybrid or emerging methods unclassifiable.

    Authors: We acknowledge that the taxonomy would benefit from operational criteria. The revised §3 will include (i) a decision flowchart with explicit rules (e.g., “if state and action tokens are generated by a single autoregressive pass then classify as Joint; if a separate world-model module is queried before action decoding then Cascaded”), (ii) pseudocode for the classification procedure, and (iii) worked examples for three representative papers (one Cascaded, two Joint with different sub-branches) showing the exact assignment logic. These additions will also note how hybrid methods can be annotated with multiple labels when appropriate. revision: yes

Circularity Check

0 steps flagged

No significant circularity: survey paper with no derivations or fitted quantities

full rationale

This is a survey paper whose contribution is a proposed taxonomy and definition for World Action Models (WAMs) drawn from existing literature. The abstract and structure contain no equations, theorems, parameter fits, predictions, or derivations that could reduce to inputs by construction. Claims about unifying predictive state modeling with action generation are definitional and organizational, supported by external citations rather than self-referential loops. No load-bearing self-citations, ansatzes, or renamings of known results appear in the provided content.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The paper introduces a conceptual framework and taxonomy but does not introduce or rely on new free parameters, mathematical axioms, or invented physical entities; all content synthesizes prior embodied AI research.

pith-pipeline@v0.9.0 · 5603 in / 1112 out tokens · 55825 ms · 2026-05-13T04:56:35.872308+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

299 extracted references · 299 canonical work pages · 68 internal anchors

  1. [1]

    RT-2: Vision-Language-Action Models Transfer Web Knowledge to Robotic Control

    Anthony Brohan, Noah Brown, Justice Carbajal, Yevgen Chebotar, Xi Chen, Krzysztof Choromanski, Tianli Ding, Danny Driess, Avinava Dubey , Chelsea Finn, Pete Florence, Chuyuan Fu, Montse Gonzalez Arenas, Keerthana Gopalakrishnan, Kehang Han, Karol Hausman, Alexander Herzog, Jasmine Hsu, Brian Ichter, Alex Irpan, Nikhil Joshi, Ryan Julian, Dmitry Kalashniko...

  2. [2]

    Openvla: An open-source vision-language-action model,

    Moo Jin Kim, Karl Pertsch, Siddharth Karamcheti, Ted Xiao, Ashwin Balakrishna, Suraj Nair, Rafael Rafailovici, Ethan Foster, Grace Lam, Pannag Sanketi, Quan Vuong, Thomas Kollar, Benjamin Burchfiel, Russ Tedrake, Dorsa Sadigh, Sergey Levine, Percy Liang, and Chelsea Finn. Openvla: An open-source vision-language-action model,

  3. [3]

    URL https://arxiv.org/abs/2406.09246

  4. [4]

    $\pi_0$: A Vision-Language-Action Flow Model for General Robot Control

    Kevin Black, Noah Brown, Danny Driess, Adnan Esmail, Michael Equi, Chelsea Finn, Niccolo Fusai, Lachy Groom, Karol Hausman, Brian Ichter, Szymon Jakubczak, Tim Jones, Liyiming Ke, Sergey Levine, Adrian Li-Bell, Mohith Mothukuri, Suraj Nair, Karl Pertsch, Lucy Xiaoyang Shi, James Tanner, Quan Vuong, Anna Walling, Haohuan Wang, and Ury Zhilinsky. 𝜋0: A visi...

  5. [5]

    LIBERO-Plus: In-depth Robustness Analysis of Vision-Language-Action Models

    Senyu Fei, Siyin Wang, Junhao Shi, Zihao Dai, Jikun Cai, Pengfang Qian, Li Ji, Xinzhe He, Shiduo Zhang, Zhaoye Fei, Jinlan Fu, Jingjing Gong, and Xipeng Qiu. LIBERO-Plus: In-depth robustness analysis of vision-language- action models. arXiv preprint arXiv:2510.13626, 2025. URL https://arxiv.org/abs/2510.13626

  6. [6]

    World model- ing makes a better planner: Dual preference optimization for embodied task planning

    Siyin Wang, Zhaoye Fei, Qinyuan Cheng, Shiduo Zhang, Panpan Cai, Jinlan Fu, and Xipeng Qiu. World model- ing makes a better planner: Dual preference optimization for embodied task planning. In Wanxiang Che, Joyce Nabende, Ekaterina Shutova, and Mohammad Taher Pilehvar, editors,Proceedings of the 63rd Annual Meeting of the Association for Computational Lin...

  7. [7]

    ISBN 979-8-89176-251-0

    Association for Computational Linguistics. ISBN 979-8-89176-251-0. doi: 10.18653/v1/2025.acl-long.1044. URL https://aclanthology.org/2025.acl-long.1044/

  8. [8]

    arXiv preprint arXiv:2302.00111 , year=

    Yilun Du, Mengjiao Yang, Bo Dai, Hanjun Dai, Ofir Nachum, Joshua B. Tenenbaum, Dale Schuurmans, and Pieter Abbeel. Learning universal policies via text-guided video generation. In Advances in Neural Information Processing Systems, 2023. URL https://arxiv.org/abs/2302.00111

  9. [9]

    Tenenbaum, Leslie Kaelbling, Andy Zeng, and Jonathan Tompson

    Yilun Du, Mengjiao Yang, Pete Florence, Fei Xia, Ayzaan Wahid, Brian Ichter, Pierre Sermanet, Tianhe Yu, Pieter Abbeel, Joshua B. Tenenbaum, et al. Video language planning. arXiv preprint arXiv:2310.10625, 2023. URL https: //arxiv.org/abs/2310.10625

  10. [10]

    Tenenbaum

    Po-Chen Ko, Jiayuan Mao, Yilun Du, Shao-Hua Sun, and Joshua B. Tenenbaum. Learning to act from actionless videos through dense correspondences. In International Conference on Learning Representations, 2024. URL https://arxiv.org/abs/2310.08576

  11. [11]

    Roboenvision: A long-horizon video generation model for multi-task robot manipulation.arXiv preprint arXiv:2506.22007, 2025

    Liudi Yang, Yang Bai, George Eskandar, Fengyi Shen, Mohammad Altillawi, Dong Chen, Soumajit Majumder, Ziyuan Liu, Gitta Kutyniok, and Abhinav Valada. Roboenvision: A long-horizon video generation model for multi-task robot manipulation. arXiv preprint arXiv:2506.22007, 2025. URL https://arxiv.org/abs/2506.22007

  12. [12]

    Say , dream, and act: Learning video world models for instruction-driven robot manipulation

    Songen Gu, Yunuo Cai, Tianyu Wang, Simo Wu, and Yanwei Fu. Say , dream, and act: Learning video world models for instruction-driven robot manipulation. arXiv preprint arXiv:2602.10717, 2026. URL https://arxiv.org/ abs/2602.10717

  13. [13]

    Video Prediction Policy: A Generalist Robot Policy with Predictive Visual Representations

    Yucheng Hu, Yanjiang Guo, Pengchao Wang, Xiaoyu Chen, Yen-Jen Wang, Jianke Zhang, Koushil Sreenath, Chaochao Lu, and Jianyu Chen. Video prediction policy: A generalist robot policy with predictive visual rep- resentations. arXiv preprint arXiv:2412.14803, 2024. URL https://arxiv.org/abs/2412.14803

  14. [14]

    mimic-video: Video-Action Models for Generalizable Robot Control Beyond VLAs

    Jonas Pai, Liam Achenbach, Victoriano Montesinos, Benedek Forrai, Oier Mees, and Elvis Nava. mimic-video: Video-action models for generalizable robot control beyond vlas. arXiv preprint arXiv:2512.15692, 2025. URL https://arxiv.org/abs/2512.15692. 45

  15. [15]

    Video Generators are Robot Policies, August 2025

    Junbang Liang, Pavel T okmakov , Ruoshi Liu, Sruthi Sudhakar, Paarth Shah, Rares Ambrus, and Carl Vondrick. Video generators are robot policies.arXiv preprint arXiv:2508.00795, 2025. URL https://arxiv.org/abs/2508.00795

  16. [16]

    S-vam: Shortcut video-action model by self-distilling geometric and semantic foresight.arXiv preprint arXiv:2603.16195,

    Haodong Yan, Zhide Zhong, Jiaguan Zhu, Junjie He, Weilin Yuan, Wenxuan Song, Xin Gong, Yingjie Cai, Guanyi Zhao, Xu Yan, Bingbing Liu, Ying-Cong Chen, and Haoang Li. S-vam: Shortcut video-action model by self- distilling geometric and semantic foresight, 2026. URL https://arxiv.org/abs/2603.16195

  17. [17]

    Latent action pretraining from videos

    Seonghyeon Ye, Joel Jang, Byeongguk Jeon, Se June Joo, Jianwei Yang, Baolin Peng, Ajay Mandlekar, Reuben Tan, Yu-Wei Chao, Bill Yuchen Lin, Lars Liden, Kimin Lee, Jianfeng Gao, Luke Zettlemoyer, Dieter Fox, and Minjoon Seo. Latent action pretraining from videos. In The Thirteenth International Conference on Learning Representations, ICLR 2025, Singapore, ...

  18. [18]

    Cosmos policy: Fine-tuning video models for visuomotor control and planning,

    Moo Jin Kim, Yihuai Gao, Tsung- Yi Lin, Yen-Chen Lin, Yunhao Ge, Grace Lam, Percy Liang, Shuran Song, Ming-Yu Liu, Chelsea Finn, and Jinwei Gu. Cosmos policy: Fine-tuning video models for visuomotor control and planning,

  19. [19]

    URL https://arxiv.org/abs/2601.16163

  20. [20]

    World action models are zero-shot policies,

    Seonghyeon Ye, Yunhao Ge, Kaiyuan Zheng, Shenyuan Gao, Sihyun Yu, George Kurian, Suneel Indupuru, Y ou Liang Tan, Chuning Zhu, Jiannan Xiang, Ayaan Malik, Kyungmin Lee, William Liang, Nadun Ranawaka, Jiasheng Gu, Yinzhen Xu, Guanzhi Wang, Fengyuan Hu, Avnish Narayan, Johan Bjorck, Jing Wang, Gwanghyun Kim, Dantong Niu, Ruijie Zheng, Yuqi Xie, Jimmy Wu, Qi ...

  21. [21]

    URL https://arxiv.org/abs/2602.15922

  22. [22]

    Causal world modeling for robot control, 2026

    Lin Li, Qihang Zhang, Yiming Luo, Shuai Yang, Ruilin Wang, Fei Han, Mingrui Yu, Zelin Gao, Nan Xue, Xing Zhu, Yujun Shen, and Yinghao Xu. Causal world modeling for robot control, 2026. URL https://arxiv.org/abs/2601. 21998

  23. [23]

    Motus: A Unified Latent Action World Model

    Hongzhe Bi, Hengkai Tan, Shenghao Xie, Zeyuan Wang, Shuhe Huang, Haitian Liu, Ruowen Zhao, Yao Feng, Chendong Xiang, Yinze Rong, Hongyan Zhao, Hanyu Liu, Zhizhong Su, Lei Ma, Hang Su, and Jun Zhu. Motus: A unified latent action world model, 2025. URL https://arxiv.org/abs/2512.13030

  24. [24]

    Unified World Models: Coupling Video and Action Diffusion for Pretraining on Large Robotic Datasets

    Chuning Zhu, Raymond Yu, Siyuan Feng, Benjamin Burchfiel, Paarth Shah, and Abhishek Gupta. Unified world models: Coupling video and action diffusion for pretraining on large robotic datasets, 2025. URL https://arxiv. org/abs/2504.02792

  25. [25]

    ://arxiv.org/abs/2411.18179

    Yanjiang Guo, Yucheng Hu, Jianke Zhang, Yen-Jen Wang, Xiaoyu Chen, Chaochao Lu, and Jianyu Chen. Prediction with action: Visual policy learning via joint denoising process, 2024. URL https://arxiv.org/abs/2411.18179

  26. [26]

    Learning Physics from Pretrained Video Models: A Multimodal Continuous and Sequential World Interaction Models for Robotic Manipulation

    Zijian Song, Qichang Li, Sihan Qin, Yuhao Chen, Tianshui Chen, Liang Lin, and Guangrun Wang. Learning physics from pretrained video models: A multimodal continuous and sequential world interaction models for robotic manipulation, 2026. URL https://arxiv.org/abs/2603.00110

  27. [27]

    ivideogpt: Interactive videogpts are scalable world models, 2024

    Jialong Wu, Shaofeng Yin, Ningya Feng, Xu He, Dong Li, Jianye Hao, and Mingsheng Long. ivideogpt: Interactive videogpts are scalable world models, 2024. URL https://arxiv.org/abs/2405.15223

  28. [28]

    Flowdreamer: A rgb-d world model with flow-based motion representations for robot manipulation, 2025

    Jun Guo, Xiaojian Ma, Yikai Wang, Min Yang, Huaping Liu, and Qing Li. Flowdreamer: A rgb-d world model with flow-based motion representations for robot manipulation, 2025. URL https://arxiv.org/abs/2505.10075

  29. [29]

    arXiv preprint arXiv:2501.01895 (2025)

    Siyuan Huang, Liliang Chen, Pengfei Zhou, Shengcong Chen, Zhengkai Jiang, Yue Hu, Yue Liao, Peng Gao, Hong- sheng Li, Maoqing Yao, and Guanghui Ren. Enerverse: Envisioning embodied future space for robotics manipu- lation, 2025. URL https://arxiv.org/abs/2501.01895

  30. [30]

    Learning Latent Dynamics for Planning from Pixels

    Danijar Hafner, Timothy Lillicrap, Ian Fischer, Ruben Villegas, David Ha, Honglak Lee, and James Davidson. Learning latent dynamics for planning from pixels, 2019. URL https://arxiv.org/abs/1811.04551

  31. [31]

    Transdreamer: Reinforcement learning with transformer world models.arXiv preprint arXiv:2202.09481, 2022

    Chang Chen, Yi-Fu Wu, Jaesik Yoon, and Sungjin Ahn. Transdreamer: Reinforcement learning with transformer world models, 2024. URL https://arxiv.org/abs/2202.09481

  32. [32]

    Revisiting Feature Prediction for Learning Visual Representations from Video

    Adrien Bardes, Quentin Garrido, Jean Ponce, Xinlei Chen, Michael Rabbat, Yann LeCun, Mahmoud Assran, and Nicolas Ballas. Revisiting feature prediction for learning visual representations from video, 2024. URL https: //arxiv.org/abs/2404.08471. 46

  33. [33]

    Mocogan: Decomposing motion and content for video generation, 2017

    Sergey T ulyakov , Ming-Yu Liu, Xiaodong Yang, and Jan Kautz. Mocogan: Decomposing motion and content for video generation, 2017. URL https://arxiv.org/abs/1707.04993

  34. [34]

    U-Net: Convolutional Networks for Biomedical Image Segmentation

    Olaf Ronneberger, Philipp Fischer, and Thomas Brox. U-net: Convolutional networks for biomedical image seg- mentation, 2015. URL https://arxiv.org/abs/1505.04597

  35. [35]

    Latte: Latent Diffusion Transformer for Video Generation

    Xin Ma, Yaohui Wang, Xinyuan Chen, Gengyun Jia, Ziwei Liu, Yuan-Fang Li, Cunjian Chen, and Yu Qiao. Latte: Latent diffusion transformer for video generation, 2025. URL https://arxiv.org/abs/2401.03048

  36. [36]

    Wan: Open and Advanced Large-Scale Video Generative Models

    T eam Wan, Ang Wang, Baole Ai, Bin Wen, Chaojie Mao, Chen-Wei Xie, Di Chen, Feiwu Yu, Haiming Zhao, Jianxiao Yang, Jianyuan Zeng, Jiayu Wang, Jingfeng Zhang, Jingren Zhou, Jinkai Wang, Jixuan Chen, Kai Zhu, Kang Zhao, Keyu Yan, Lianghua Huang, Mengyang Feng, Ningyi Zhang, Pandeng Li, Pingyu Wu, Ruihang Chu, Ruili Feng, Shiwei Zhang, Siyang Sun, Tao Fang, ...

  37. [37]

    OpenAI. Sora 2. https://openai.com/sora, 09 2025. Accessed: 2026-04-08

  38. [38]

    arXiv preprint arXiv:2308.10901 (2023) 5

    Russell Mendonca, Shikhar Bahl, and Deepak Pathak. Structured world models from human videos, 2023. URL https://arxiv.org/abs/2308.10901

  39. [39]

    Shenyuan Gao, William Liang, Kaiyuan Zheng, Ayaan Malik, Seonghyeon Ye, Sihyun Yu, Wei-Cheng Tseng, Yuzhu Dong, Kaichun Mo, Chen-Hsuan Lin, Qianli Ma, Seungjun Nah, Loic Magne, Jiannan Xiang, Yuqi Xie, Ruijie Zheng, Dantong Niu, You Liang Tan, K. R. Zentner, George Kurian, Suneel Indupuru, Pooya Jannaty , Jinwei Gu, Jun Zhang, Jitendra Malik, Pieter Abbeel...

  40. [40]

    Ro- bodreamer: Learning compositional world models for robot imagination.arXiv preprint arXiv:2404.12377, 2024

    Siyuan Zhou, Yilun Du, Jiaben Chen, Yandong Li, Dit-Yan Yeung, and Chuang Gan. Robodreamer: Learning compositional world models for robot imagination, 2024. URL https://arxiv.org/abs/2404.12377

  41. [41]

    Roboscape: Physics-informed embodied world model, 2025

    Yu Shang, Xin Zhang, Yinzhou Tang, Lei Jin, Chen Gao, Wei Wu, and Yong Li. Roboscape: Physics-informed embodied world model, 2025. URL https://arxiv.org/abs/2506.23135

  42. [42]

    Ctrl-world: A controllable generative world model for robot manipulation, 2025

    Yanjiang Guo, Lucy Xiaoyang Shi, Jianyu Chen, and Chelsea Finn. Ctrl-world: A controllable generative world model for robot manipulation, 2026. URL https://arxiv.org/abs/2510.10125

  43. [43]

    Dream to manipulate: Compositional world models empowering robot imitation learning with imagination, 2025

    Leonardo Barcellona, Andrii Zadaianchuk, Davide Allegro, Samuele Papa, Stefano Ghidoni, and Efstratios Gavves. Dream to manipulate: Compositional world models empowering robot imitation learning with imagination, 2025. URL https://arxiv.org/abs/2412.14957

  44. [44]

    Dream to Control: Learning Behaviors by Latent Imagination

    Danijar Hafner, Timothy Lillicrap, Jimmy Ba, and Mohammad Norouzi. Dream to control: Learning behaviors by latent imagination, 2020. URL https://arxiv.org/abs/1912.01603

  45. [45]

    Mastering Atari with Discrete World Models

    Danijar Hafner, Timothy Lillicrap, Mohammad Norouzi, and Jimmy Ba. Mastering atari with discrete world mod- els, 2022. URL https://arxiv.org/abs/2010.02193

  46. [46]

    Training Agents Inside of Scalable World Models

    Danijar Hafner, Wilson Yan, and Timothy Lillicrap. Training agents inside of scalable world models, 2025. URL https://arxiv.org/abs/2509.24527

  47. [47]

    RISE: Self-Improving Robot Policy with Compositional World Model

    Jiazhi Yang, Kunyang Lin, Jinwei Li, Wencong Zhang, Tianwei Lin, Longyan Wu, Zhizhong Su, Hao Zhao, Ya-Qin Zhang, Li Chen, Ping Luo, Xiangyu Yue, and Hongyang Li. Rise: Self-improving robot policy with compositional world model, 2026. URL https://arxiv.org/abs/2602.11075

  48. [48]

    Mastering Diverse Domains through World Models

    Danijar Hafner, Jurgis Pasukonis, Jimmy Ba, and Timothy Lillicrap. Mastering diverse domains through world models, 2024. URL https://arxiv.org/abs/2301.04104

  49. [49]

    Daydreamer: World models for physical robot learning, 2022

    Philipp Wu, Alejandro Escontrela, Danijar Hafner, Ken Goldberg, and Pieter Abbeel. Daydreamer: World models for physical robot learning, 2022. URL https://arxiv.org/abs/2206.14176

  50. [50]

    World-Env: Leveraging World Model as a Virtual Environment for VLA Post-Training

    Junjin Xiao, Yandan Yang, Xinyuan Chang, Ronghan Chen, Feng Xiong, Mu Xu, Wei-Shi Zheng, and Qing Zhang. World-env: Leveraging world model as a virtual environment for vla post-training, 2026. URL https://arxiv.org/ abs/2509.24948. 47

  51. [51]

    Roboscape-r: Unified reward-observation world models for generalizable robotics training via rl, 2025

    Yinzhou Tang, Yu Shang, Yinuo Chen, Bingwen Wei, Xin Zhang, Shu’ang Yu, Liangzhi Shi, Chao Yu, Chen Gao, Wei Wu, and Yong Li. Roboscape-r: Unified reward-observation world models for generalizable robotics training via rl, 2025. URL https://arxiv.org/abs/2512.03556

  52. [52]

    arXiv preprint arXiv:2511.09515 (2025) RoboStereo 19 Supplementary Material 6 Inference Efficiency Analysis Fig.8:Inference speed comparison

    Fangqi Zhu, Zhengyang Yan, Zicong Hong, Quanxin Shou, Xiao Ma, and Song Guo. Wmpo: World model-based policy optimization for vision-language-action models, 2025. URL https://arxiv.org/abs/2511.09515

  53. [53]

    Wovr: World models as reliable simulators for post-training vla policies with rl.ArXiv, abs/2602.13977, 2026

    Zhennan Jiang, Shangqing Zhou, Yutong Jiang, Zefang Huang, Mingjie Wei, Yuhui Chen, Tianxing Zhou, Zhen Guo, Hao Lin, Quanlu Zhang, Yu Wang, Haoran Li, Chao Yu, and Dongbin Zhao. Wovr: World models as reliable simulators for post-training vla policies with rl, 2026. URL https://arxiv.org/abs/2602.13977

  54. [54]

    arXiv preprint arXiv:2510.00406 (2025)

    Hengtao Li, Pengxiang Ding, Runze Suo, Yihao Wang, Zirui Ge, Dongyuan Zang, Kexian Yu, Mingyang Sun, Hongyin Zhang, Donglin Wang, and Weihua Su. Vla-rft: Vision-language-action reinforcement fine-tuning with verified rewards in world simulators, 2025. URL https://arxiv.org/abs/2510.00406

  55. [55]

    Reinforcement World Model Learning for

    Xiao Yu, Baolin Peng, Ruize Xu, Yelong Shen, Pengcheng He, Suman Nath, Nikhil Singh, Jiangfeng Gao, and Zhou Yu. Reinforcement world model learning for llm-based agents, 2026. URL https://arxiv.org/abs/2602.05842

  56. [56]

    Modem-v2: Visuo-motor world mod- els for real-world robot manipulation, 2024

    Patrick Lancaster, Nicklas Hansen, Aravind Rajeswaran, and Vikash Kumar. Modem-v2: Visuo-motor world mod- els for real-world robot manipulation, 2024. URL https://arxiv.org/abs/2309.14236

  57. [57]

    World-gymnast: Training robots with reinforcement learning in a world model, 2026

    Ansh Kumar Sharma, Yixiang Sun, Ninghao Lu, Yunzhe Zhang, Jiarao Liu, and Sherry Yang. World-gymnast: Training robots with reinforcement learning in a world model, 2026. URL https://arxiv.org/abs/2602.02454

  58. [58]

    Offline robotic world model: Learning robotic policies without a physics simulator.arXiv preprint arXiv:2504.16680, 2025

    Chenhao Li, Andreas Krause, and Marco Hutter. Uncertainty-aware robotic world model makes offline model- based reinforcement learning work on real robots, 2026. URL https://arxiv.org/abs/2504.16680

  59. [59]

    World4rl: Diffusion world models for policy refinement with reinforcement learning for robotic manipula- tion, 2025

    Zhennan Jiang, Kai Liu, Yuxin Qin, Shuai Tian, Yupeng Zheng, Mingcai Zhou, Chao Yu, Haoran Li, and Dongbin Zhao. World4rl: Diffusion world models for policy refinement with reinforcement learning for robotic manipula- tion, 2025. URL https://arxiv.org/abs/2509.19080

  60. [60]

    arXiv preprint arXiv:2305.14343 , year=

    Alejandro Escontrela, Ademi Adeniji, Wilson Yan, Ajay Jain, Xue Bin Peng, Ken Goldberg, Youngwoon Lee, Danijar Hafner, and Pieter Abbeel. Video prediction models as rewards for reinforcement learning, 2023. URL https: //arxiv.org/abs/2305.14343

  61. [61]

    Robot learning from a physical world model,

    Jiageng Mao, Sicheng He, Hao-Ning Wu, Yang You, Shuyang Sun, Zhicheng Wang, Yanan Bao, Huizhong Chen, Leonidas Guibas, Vitor Guizilini, Howard Zhou, and Yue Wang. Robot learning from a physical world model,

  62. [62]

    URL https://arxiv.org/abs/2511.07416

  63. [63]

    Diffusion reward: Learning rewards via conditional video diffusion, 2024

    Tao Huang, Guangqi Jiang, Yanjie Ze, and Huazhe Xu. Diffusion reward: Learning rewards via conditional video diffusion, 2024. URL https://arxiv.org/abs/2312.14134

  64. [64]

    Goal-Driven Reward by Video Diffusion Models for Reinforcement Learning

    Qi Wang, Mian Wu, Yuyang Zhang, Mingqi Yuan, Wenyao Zhang, Haoxiang You, Yunbo Wang, Xin Jin, Xiaokang Yang, and Wenjun Zeng. Goal-driven reward by video diffusion models for reinforcement learning, 2025. URL https://arxiv.org/abs/2512.00961

  65. [65]

    Evaluating gemini robotics policies in a veo world simulator, 2026

    Gemini Robotics Team, Krzysztof Choromanski, Coline Devin, Yilun Du, Debidatta Dwibedi, Ruiqi Gao, Abhishek Jindal, Thomas Kipf, Sean Kirmani, Isabel Leal, Fangchen Liu, Anirudha Majumdar, Andrew Marmon, Carolina Parada, Yulia Rubanova, Dhruv Shah, Vikas Sindhwani, Jie Tan, Fei Xia, Ted Xiao, Sherry Yang, Wenhao Yu, and Allan Zhou. Evaluating gemini robot...

  66. [66]

    Interactive world simulator for robot policy training and evaluation, 2026

    Yixuan Wang, Rhythm Syed, Fangyu Wu, Mengchao Zhang, Aykut Onol, Jose Barreiros, Hooshang Nayyeri, T ony Dear, Huan Zhang, and Yunzhu Li. Interactive world simulator for robot policy training and evaluation, 2026. URL https://arxiv.org/abs/2603.08546

  67. [67]

    Worldeval: World model as real-world robot policies evaluator,

    Yaxuan Li, Yichen Zhu, Junjie Wen, Chaomin Shen, and Yi Xu. Worldeval: World model as real-world robot policies evaluator, 2025. URL https://arxiv.org/abs/2505.19017

  68. [68]

    Worldgym: World model as an environment for policy evaluation, 2025

    Julian Quevedo, Ansh Kumar Sharma, Yixiang Sun, Varad Suryavanshi, Percy Liang, and Sherry Yang. Worldgym: World model as an environment for policy evaluation, 2025. URL https://arxiv.org/abs/2506.00613

  69. [69]

    dWorldEval: Scalable Robotic Policy Evaluation via Discrete Diffusion World Model

    Yaxuan Li, Zhongyi Zhou, Yefei Chen, Yaokai Xue, and Yichen Zhu. dworldeval: Scalable robotic policy evaluation via discrete diffusion world model, 2026. URL https://arxiv.org/abs/2604.22152. 48

  70. [71]

    URL https://arxiv.org/abs/2407.05530

  71. [72]

    arXiv preprint arXiv:2504.20995 (2025)

    Haoyu Zhen, Qiao Sun, Hongxin Zhang, Junyan Li, Siyuan Zhou, Yilun Du, and Chuang Gan. Tesseract: Learning 4d embodied world models, 2025. URL https://arxiv.org/abs/2504.20995

  72. [73]

    MVISTA-4D: View-consistent 4d world model with test-time action inference for robotic manip- ulation

    Jiaxu Wang et al. MVISTA-4D: View-consistent 4d world model with test-time action inference for robotic manip- ulation. arXiv preprint arXiv:2602.09878, 2026. URL https://arxiv.org/abs/2602.09878

  73. [74]

    Gen2Act: Human Video Generation in Novel Scenarios enables Generalizable Robot Manipulation

    Homanga Bharadhwaj, Debidatta Dwibedi, Abhinav Gupta, Shubham T ulsiani, Carl Doersch, Ted Xiao, Dhruv Shah, Fei Xia, Dorsa Sadigh, and Sean Kirmani. Gen2act: Human video generation in novel scenarios enables generalizable robot manipulation. In CoRL Workshop on Cross-Embodiment, 2024. URL https://arxiv.org/abs/ 2409.16283

  74. [75]

    Flow as the cross-domain manipulation interface

    Mengda Xu, Zhenjia Xu, Yinghao Xu, Cheng Chi, Gordon Wetzstein, Manuela Veloso, and Shuran Song. Flow as the cross-domain manipulation interface. In 8th Annual Conference on Robot Learning, 2024. URL https: //arxiv.org/abs/2407.15208

  75. [76]

    3dflowaction: Learning cross-embodiment manipulation from 3d flow world model.ArXiv, abs/2506.06199, 2025

    Hongpeng Zhi, Piao Chen, Siyuan Zhou, Yichao Dong, Qiang Wu, Lin Han, and Mingkui Tan. 3DFlowAction: Learning cross-embodiment manipulation from 3d flow world model. arXiv preprint arXiv:2506.06199, 2025. URL https://arxiv.org/abs/2506.06199

  76. [77]

    NovaFlow: Zero- shot manipulation via actionable flow from generated videos

    Hongyu Li, Lingfeng Sun, Yafei Hu, Duy Ta, Jennifer Barry , George Konidaris, and Jiahui Fu. NovaFlow: Zero- shot manipulation via actionable flow from generated videos. arXiv preprint arXiv:2510.08568, 2025. URL https: //arxiv.org/abs/2510.08568

  77. [78]

    Dream2Flow: Bridging video generation and open-world manipulation with 3D object flow.arXiv preprint arXiv:2512.24766, 2025

    Karthik Dharmarajan, Wenlong Huang, Jiajun Wu, Li Fei-Fei, and Ruohan Zhang. Dream2flow: Bridging video generation and open-world manipulation with 3d object flow. arXiv preprint arXiv:2512.24766, 2025. URL https: //arxiv.org/abs/2512.24766

  78. [79]

    Dreamitate: Real-world visuomotor policy learning via video gen- eration,

    Junbang Liang, Ruoshi Liu, Ege Ozguroglu, Sruthi Sudhakar, Achal Dave, Pavel T okmakov , Shuran Song, and Carl Vondrick. Dreamitate: Real-world visuomotor policy learning via video generation. In Conference on Robot Learning, 2024. URL https://arxiv.org/abs/2406.16862

  79. [80]

    Geometry-aware 4d video generation for robot manipulation, 2025

    Zeyi Liu et al. Geometry-aware 4d video generation for robot manipulation, 2025

  80. [81]

    Robotic Manipulation by Imitating Generated Videos Without Physical Demonstrations

    Shivansh Patel, Shraddhaa Mohan, Hanlin Mai, Unnat Jain, Svetlana Lazebnik, and Yunzhu Li. Robotic manipula- tion by imitating generated videos without physical demonstrations. arXiv preprint arXiv:2507.00990, 2025. URL https://arxiv.org/abs/2507.00990

Showing first 80 references.