How Should World Models Be Evaluated for Embodied Decision-Making? A Decision-Making-Centric Position

Haoxiang Ren; Haoxin Lin; Shiyuan Zhang; Yang Yu; Yifei Sheng

arxiv: 2606.15032 · v2 · pith:4CA65QLCnew · submitted 2026-06-13 · 💻 cs.LG

How Should World Models Be Evaluated for Embodied Decision-Making? A Decision-Making-Centric Position

Yang Yu , Shiyuan Zhang , Yifei Sheng , Haoxiang Ren , Haoxin Lin This is my paper

Pith reviewed 2026-06-30 10:28 UTC · model grok-4.3

classification 💻 cs.LG

keywords world modelsembodied decision-makingevaluation ladderinterventional reasoningpolicy optimizationvideo predictionclosed-loop rolloutdistribution shift

0 comments

The pith

World models for embodied decision-making must be judged by their support for interventional reasoning and policy optimization, not video realism.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper surveys recent work on world models and identifies a pattern where evaluation metrics often fail to match the decision-making claims made about the models. It argues that for embodied settings the decisive tests involve whether the model enables accurate policy evaluation, planning, and optimization under interventions, distribution shifts, and long rollouts. To organize this, the authors introduce an L0-L7 ladder of criteria that forms an evidential hierarchy progressing from visual checks to measurable policy gains. A sympathetic reader would care because this reframing could reduce wasted effort on models that look good but do not help actual control.

Core claim

For models presented as world models for embodied decision-making, the more decisive issue is not whether the model generates visually convincing videos, but whether it supports reliable interventional reasoning, policy evaluation, planning, and policy optimization under intervention, policy-induced distribution shift, and long-horizon rollout. The survey organizes evidence using an L0-L7 ladder spanning visual plausibility to policy optimization utility, foregrounding interventional action fidelity, closed-loop rollout validity, reward/value prediction, policy-ranking agreement, optimization lift, model exploitability, and uncertainty calibration.

What carries the argument

The L0-L7 ladder, an evidential hierarchy of evaluation criteria that cuts across visual, predictive, and interventional axes rather than producing a single scalar score.

If this is right

Interventional action fidelity and closed-loop rollout validity become required tests for any decision-making claim.
Reward and value prediction accuracy must be measured alongside perceptual similarity.
Policy-ranking agreement and measured optimization lift become direct evidence of utility.
Model exploitability and uncertainty calibration under shift indicate whether the model can be trusted for planning.
A minimal feasible reporting set applies even in real-robot experiments.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The ladder could be used to re-score existing published world models and expose claim-evidence gaps across the literature.
Simulation-to-real transfer studies might test whether ladder level predicts transfer success more reliably than video quality.
Extending the hierarchy to multi-step reward shaping or safety constraints would address additional embodied requirements.
If the ordering holds, benchmark suites could drop low-level metrics once higher-level evidence is present.

Load-bearing premise

That the L0-L7 levels form a reliable hierarchy in which higher levels are strictly more decisive for embodied decision-making utility than lower levels, without needing separate validation of the ordering.

What would settle it

A controlled comparison in which models that pass high ladder levels but fail low ones still produce superior real-robot policy performance, or in which visual metrics alone correlate more strongly with policy success than the full ladder, would falsify the claimed hierarchy.

read the original abstract

World models have become a central abstraction in modern AI. The term now refers to several different objects: action-conditioned environment models, latent imagination models, future-video predictors, interactive neural simulators, latent predictive representations, and synthetic-data engines. Evaluation has broadened along with the term. Recent papers measure video realism, perceptual similarity, instruction following, physical plausibility, policy ranking, executability, planning success, and downstream policy improvement. This produces both metric diversity and a recurring problem of claim/evidence mismatch: papers sometimes make a stronger claim about what their model is useful for than their evaluation can establish. This paper surveys the recent literature and argues that, for models presented as world models for embodied decision-making, the more decisive issue is not whether the model generates visually convincing videos, but whether it supports reliable interventional reasoning, policy evaluation, planning, and policy optimization under intervention, policy-induced distribution shift, and long-horizon rollout. We organize the survey using an L0--L7 ladder spanning visual plausibility to policy optimization utility, noting that the levels cut across several orthogonal axes and so form an evidential hierarchy rather than a single scalar. The framework foregrounds interventional action fidelity, closed-loop rollout validity, reward/value prediction, policy-ranking agreement, optimization lift, model exploitability, and uncertainty calibration, with a minimal feasible reporting set for real-robot settings.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

This position paper surveys world-model evaluation metrics and organizes them into an L0-L7 ladder, but asserts an evidential hierarchy without validation or new experiments.

read the letter

The main takeaway is that the paper flags a real mismatch between what papers claim about their world models and what their evaluations actually show, then proposes an L0-L7 ladder to prioritize interventional and policy-level metrics over visual ones for embodied decision-making.

The survey itself pulls together a range of existing metrics—video realism, physical plausibility, policy ranking, optimization lift, and uncertainty calibration—and groups them into levels that cut across different axes. That organizational step is the clearest new element; it gives people a shared way to talk about what their tests actually support rather than treating everything as interchangeable. The suggestion of a minimal reporting set for real-robot work is also practical and directly addresses the claim-evidence gap the authors identify.

The soft spot is that the ladder is presented as forming a strict evidential hierarchy, with higher levels treated as more decisive, yet the text offers no derivation, counter-example checks, or empirical mapping to justify why, say, L5 policy-ranking agreement should always outrank L4 closed-loop rollout across settings. No new experiments test whether the ordering holds under policy-induced shift or long horizons, so the position rests on logical re-arrangement of prior literature. That keeps the argument from being load-bearing but also limits how far it can move practice without follow-up work.

This is for researchers working on world models, simulators, or planning in robotics and embodied AI who need to decide what to measure. Readers already wrestling with evaluation choices will find the survey useful as a reference and discussion starter. The topic is central enough and the survey grounded enough in the cited literature that it deserves a serious referee, even if the hierarchy needs more grounding in review.

Referee Report

2 major / 2 minor

Summary. The paper surveys recent literature on world model evaluation in embodied AI and identifies recurring claim/evidence mismatches. It argues that for models positioned as world models for embodied decision-making, evaluation should prioritize support for interventional reasoning, policy evaluation, planning, and policy optimization under intervention, distribution shift, and long-horizon rollout over lower-level metrics such as visual plausibility. The central contribution is an L0--L7 ladder that organizes metrics across visual, planning, and optimization axes into an evidential hierarchy rather than a single scalar, with a proposed minimal reporting set for real-robot settings.

Significance. If the proposed hierarchy is valid, the paper would provide a valuable organizing framework that helps the field move beyond video realism metrics toward decision-making utility. The survey of orthogonal evaluation axes (visual fidelity, closed-loop validity, reward prediction, policy ranking, optimization lift) and explicit foregrounding of interventional action fidelity and uncertainty calibration are constructive contributions that could reduce mismatches between model claims and supporting evidence.

major comments (2)

[Abstract] Abstract: The assertion that the L0--L7 levels 'form an evidential hierarchy rather than a single scalar' because they 'cut across several orthogonal axes' is presented as an organizing principle without formal derivation, counter-example analysis, or empirical mapping. No argument is given showing why, e.g., L5 policy-ranking agreement is strictly more decisive than L4 closed-loop rollout validity for all embodied settings, nor why the ordering is robust to policy-induced distribution shift. This ordering is load-bearing for the central position.
[Introduction / Survey sections] The paper identifies claim/evidence mismatch in prior work but supplies no quantitative analysis or new experiments demonstrating that the L0--L7 hierarchy resolves the mismatch; the argument rests entirely on logical re-organization of existing practices. This leaves the claim that the ladder 'can be applied without additional empirical validation of its ordering' unsupported.

minor comments (2)

[Ladder description] The definitions of each level (L0 through L7) would benefit from explicit cross-references to the specific metrics or papers that instantiate them, to improve traceability.
[Real-robot reporting section] Notation for the 'minimal feasible reporting set' is introduced without a compact table or checklist format, which reduces immediate usability for practitioners.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments, which help clarify the justification needed for the proposed L0-L7 framework. We respond to each major comment below and outline planned revisions to address the concerns about formal grounding and empirical support.

read point-by-point responses

Referee: [Abstract] Abstract: The assertion that the L0--L7 levels 'form an evidential hierarchy rather than a single scalar' because they 'cut across several orthogonal axes' is presented as an organizing principle without formal derivation, counter-example analysis, or empirical mapping. No argument is given showing why, e.g., L5 policy-ranking agreement is strictly more decisive than L4 closed-loop rollout validity for all embodied settings, nor why the ordering is robust to policy-induced distribution shift. This ordering is load-bearing for the central position.

Authors: The hierarchy is derived from the increasing strength of evidence required to support interventional and optimization claims in embodied decision-making, following principles from causal inference (interventional evidence supersedes observational) and RL evaluation (policy optimization utility is the ultimate test). The levels cut across visual, planning, and optimization axes, so they are not collapsed into a scalar. We do not claim the ordering is strict or invariant across all settings. We will add a dedicated subsection providing the rationale for each transition, discussing counterexamples (e.g., tasks where closed-loop validity alone is decisive), and addressing robustness to policy-induced shifts. This will make the load-bearing assumptions explicit. revision: yes
Referee: [Introduction / Survey sections] The paper identifies claim/evidence mismatch in prior work but supplies no quantitative analysis or new experiments demonstrating that the L0--L7 hierarchy resolves the mismatch; the argument rests entirely on logical re-organization of existing practices. This leaves the claim that the ladder 'can be applied without additional empirical validation of its ordering' unsupported.

Authors: The manuscript is a survey and position paper whose contribution is the identification of mismatches via literature review and the proposal of a structured reporting framework. No new experiments are presented because the work reorganizes existing practices rather than validating a metric. We agree the statement about applying the ladder without empirical validation of its ordering is unsupported and will remove or qualify it. The revised text will present the ladder as a conceptual guideline that encourages higher-evidence evaluations, while noting that empirical testing of the ordering is a direction for future work. revision: yes

Circularity Check

0 steps flagged

No circularity; L0-L7 framework constructed from external literature survey without self-referential reduction

full rationale

The paper is a position/survey piece that organizes existing evaluation practices into an L0-L7 ladder asserted to form an evidential hierarchy. No equations, fitted parameters, or predictions appear in the provided text. The hierarchy is introduced by surveying external literature rather than by any internal derivation, self-citation chain, or definitional loop that would make a claimed result equivalent to its inputs by construction. The central claim therefore remains independent of any self-referential step.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the domain assumption that interventional reasoning and policy optimization are the primary intended uses of world models in embodied settings; no free parameters or invented entities are introduced.

axioms (1)

domain assumption World models for embodied decision-making are primarily intended to support interventional reasoning, planning, and policy optimization.
This premise defines the scope of the L0-L7 ladder and is stated in the abstract as the context for the evaluation critique.

pith-pipeline@v0.9.1-grok · 5789 in / 1263 out tokens · 39487 ms · 2026-06-30T10:28:36.903043+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

76 extracted references · 57 canonical work pages · 19 internal anchors

[1]

V-JEPA 2: Self-Supervised Video Models Enable Understanding, Prediction and Planning

Mido Assran, Adrien Bardes, David Fan, Quentin Garrido, Russell Howes, Mojtaba Komeili, Matthew Muck- ley, Ammar Rizvi, Claire Roberts, Koustuv Sinha, Artem Zholus, Sergio Arnaud, Abha Gejji, Ada Martin, Francois Robert Hogan, Daniel Dugas, Piotr Bojanowski, Vasil Khalidov, Patrick Labatut, Francisco Massa, Marc Szafraniec, Kapil Krishnakumar, Yong Li, Xi...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[2]

Dream to manipulate: Compositional world models empowering robot imitation learning with imagina- tion

Leonardo Barcellona, Andrii Zadaianchuk, Davide Allegro, Samuele Papa, Stefano Ghidoni, and Efstratios Gavves. Dream to manipulate: Compositional world models empowering robot imitation learning with imagina- tion. InInternational Conference on Learning Representations, 2025. URLhttps://arxiv.org/abs/2412. 14957

2025
[3]

Persistent robot world models: Stabiliz- ing multi-step rollouts via reinforcement learning, 2026

Jai Bardhan, Patrik Drozdik, Josef Sivic, and Vladimir Petrik. Persistent robot world models: Stabiliz- ing multi-step rollouts via reinforcement learning, 2026. URLhttps://arxiv.org/abs/2603.25685. arXiv:2603.25685

work page arXiv 2026
[4]

Kao, Adina Williams, Michael Rabbat, and Em- manuel Dupoux

Florian Bordes, Quentin Garrido, Justine T. Kao, Adina Williams, Michael Rabbat, and Emmanuel Dupoux. IntPhys 2: Benchmarking intuitive physics understanding in complex synthetic environments, 2025. URL https://arxiv.org/abs/2506.09849. arXiv:2506.09849

work page arXiv 2025
[5]

Chandra, Iman Nematollahi, Chenguang Huang, Tim Welschehold, Wolfram Burgard, and Abhinav Valada

Akshay L. Chandra, Iman Nematollahi, Chenguang Huang, Tim Welschehold, Wolfram Burgard, and Abhinav Valada. DiW A: Diffusion policy adaptation with world models, 2025. URLhttps://arxiv.org/abs/2508. 03645. arXiv:2508.03645

work page arXiv 2025
[6]

WorldPrediction: A benchmark for high- level world modeling and long-horizon procedural planning

Delong Chen, Willy Chung, Yejin Bang, Ziwei Ji, and Pascale Fung. WorldPrediction: A benchmark for high- level world modeling and long-horizon procedural planning. InICML World Models Workshop, 2025. URL https://openreview.net/forum?id=3GuGN0bacr. 22 How Should World Models Be Evaluated for Embodied Decision-Making?

2025
[7]

Policy-conditioned en- vironment models are more generalizable

Ruifeng Chen, Xiong-Hui Chen, Yihao Sun, Siyuan Xiao, Minhui Li, and Yang Yu. Policy-conditioned en- vironment models are more generalizable. InInternational Conference on Machine Learning, 2024. URL https://openreview.net/forum?id=g9mYBdooPA

2024
[8]

Adversarial counterfactual environment model learning

Xiong-Hui Chen, Yang Yu, Zheng-Mao Zhu, Zhihua Yu, Zhenjun Chen, Chenghe Wang, Yinan Wu, Hongqiu Wu, Rong-Jun Qin, Ruijin Ding, and Fangsheng Huang. Adversarial counterfactual environment model learning. InAdvances in Neural Information Processing Systems, 2023. URLhttps://openreview.net/forum?id= rHAX0LRwk8

2023
[9]

ABot- PhysWorld: Interactive world foundation model for robotic manipulation with physics alignment, 2026

Yuzhi Chen, Ronghan Chen, Dongjie Huo, Yandan Yang, Dekang Qi, Haoyun Liu, Tong Lin, Shuang Zeng, Junjin Xiao, Xinyuan Chang, Feng Xiong, Xing Wei, Zhiheng Ma, and Mu Xu. ABot-PhysWorld: Interactive world foundation model for robotic manipulation with physics alignment, 2026. URLhttps://arxiv.org/ abs/2603.23376. arXiv:2603.23376

work page arXiv 2026
[10]

arXiv preprint arXiv:2410.15461 (2024) 16 K

Xiaowei Chi, Hengyuan Zhang, Chun-Kai Fan, Xingqun Qi, Rongyu Zhang, Anthony Chen, Chi min Chan, Wei Xue, Wenhan Luo, Shanghang Zhang, and Yike Guo. EV A: An embodied world model for future video anticipation, 2024. URLhttps://arxiv.org/abs/2410.15461. arXiv:2410.15461

work page arXiv 2024
[11]

Rethinking video generation model for the embodied world, 2026

Yufan Deng, Zilin Pan, Hongyu Zhang, Xiaojie Li, Ruoqing Hu, Yufei Ding, Yiming Zou, Yan Zeng, and Daquan Zhou. Rethinking video generation model for the embodied world, 2026. URLhttps://arxiv.org/abs/ 2601.15282. arXiv:2601.15282

work page arXiv 2026
[12]

arXiv preprint arXiv:2310.10625 (2023)

Yilun Du, Mengjiao Yang, Pete Florence, Fei Xia, Ayzaan Wahid, Brian Ichter, Pierre Sermanet, Tianhe Yu, Pieter Abbeel, Joshua B. Tenenbaum, Leslie Kaelbling, Andy Zeng, and Jonathan Tompson. Video language planning. InInternational Conference on Learning Representations, 2024. URLhttps://arxiv.org/abs/ 2310.10625

work page arXiv 2024
[13]

WorldScore: A unified evaluation benchmark for world generation

Haoyi Duan, Hong-Xing Yu, Sirui Chen, Fei-Fei Li, and Jiajun Wu. WorldScore: A unified evaluation benchmark for world generation. InInternational Conference on Computer Vision, 2025. URLhttps://arxiv.org/abs/ 2504.00983

work page arXiv 2025
[14]

Wow, wo, val! a comprehensive embodied world model evaluation turing test.arXiv preprint arXiv:2601.04137, 2026

Chun-Kai Fan, Xiaowei Chi, Xiaozhu Ju, Hao Li, Yong Bao, Yu-Kai Wang, Lizhang Chen, Zhiyuan Jiang, Kuangzhi Ge, Ying Li, Weishi Mi, Qingpo Wuwu, Peidong Jia, Yulin Luo, Kevin Zhang, Zhiyuan Qin, Yong Dai, Sirui Han, Yike Guo, Shanghang Zhang, and Jian Tang. Wow, wo, val! a comprehensive embodied world model evaluation turing test, 2026. URLhttps://arxiv.o...

work page arXiv 2026
[15]

Aaron Foss, Chloe Evans, Sasha Mitts, Koustuv Sinha, Ammar Rizvi, and Justine T. Kao. CausalVQA: A physically grounded causal reasoning benchmark for video models, 2025. URLhttps://arxiv.org/abs/ 2506.09943. arXiv:2506.09943

work page arXiv 2025
[16]

Learning video generation for robotic manipulation with collaborative trajectory control

Xiao Fu, Xintao Wang, Xian Liu, Jianhong Bai, Runsen Xu, Pengfei Wan, Di Zhang, and Dahua Lin. Learning video generation for robotic manipulation with collaborative trajectory control. InInternational Conference on Learning Representations, 2026. URLhttps://openreview.net/forum?id=OeDwYtp8n1

2026
[17]

Shenyuan Gao, William Liang, Kaiyuan Zheng, Ayaan Malik, Seonghyeon Ye, Sihyun Yu, Wei-Cheng Tseng, Yuzhu Dong, Kaichun Mo, Chen-Hsuan Lin, Qianli Ma, Seungjun Nah, Loic Magne, Jiannan Xiang, Yuqi Xie, Ruijie Zheng, Dantong Niu, You Liang Tan, K. R. Zentner, George Kurian, Suneel Indupuru, Pooya Jan- naty, Jinwei Gu, Jun Zhang, Jitendra Malik, Pieter Abbe...

work page internal anchor Pith review Pith/arXiv arXiv 2026
[18]

Evaluating gemini robotics policies in a veo world simulator, 2025

Gemini Robotics Team, Krzysztof Choromanski, Coline Devin, Yilun Du, Debidatta Dwibedi, Ruiqi Gao, Ab- hishek Jindal, Thomas Kipf, Sean Kirmani, Isabel Leal, Fangchen Liu, Anirudha Majumdar, Andrew Mar- mon, Carolina Parada, Yulia Rubanova, Dhruv Shah, Vikas Sindhwani, Jie Tan, Fei Xia, Ted Xiao, Sherry Yang, Wenhao Yu, and Allan Zhou. Evaluating gemini r...

work page arXiv 2025
[19]

GigaBrain-0.5M*: A VLA that learns from world model-based reinforcement learning, 2026

GigaBrain Team, Boyuan Wang, Chaojun Ni, Guan Huang, Guosheng Zhao, Hao Li, Jie Li, Jindi Lv, Jingyu Liu, Lv Feng, Mingming Yu, Peng Li, Qiuping Deng, Tianze Liu, Xinyu Zhou, Xinze Chen, Xiaofeng Wang, Yang Wang, Yifan Li, Yifei Nie, Yilong Li, Yukun Zhou, Yun Ye, Zhichao Liu, and Zheng Zhu. GigaBrain-0.5M*: A VLA that learns from world model-based reinfo...

work page arXiv 2026
[20]

GigaWorld-0: World models as data engine to empower embodied AI, 2025

GigaWorld Team, Angen Ye, Boyuan Wang, Chaojun Ni, Guan Huang, Guosheng Zhao, Haoyun Li, Jiagang Zhu, Kerui Li, Mengyuan Xu, Qiuping Deng, Siting Wang, Wenkang Qin, Xinze Chen, Xiaofeng Wang, Yankai Wang, Yu Cao, Yifan Chang, Yuan Xu, Yun Ye, Yang Wang, Yukun Zhou, Zhengyuan Zhang, Zhehao Dong, and Zheng Zhu. GigaWorld-0: World models as data engine to em...

work page arXiv 2025
[21]

The value equivalence principle for model-based reinforcement learning

Christopher Grimm, André Barreto, Satinder Singh, and David Silver. The value equivalence principle for model-based reinforcement learning. InAdvances in Neural Information Process- ing Systems, volume 33, 2020. URLhttps://proceedings.neurips.cc/paper/2020/hash/ 3bb585ea00014b0e3ebe4c6dd165a358-Abstract.html. arXiv:2011.03506

work page arXiv 2020
[22]

Proper value equiva- lence

Christopher Grimm, André Barreto, Gregory Farquhar, David Silver, and Satinder Singh. Proper value equiva- lence. InAdvances in Neural Information Processing Systems, volume 34, 2021. URLhttps://proceedings. neurips.cc/paper/2021/hash/00ac8ed3b4327bdd4ebbebcb2ba10a00-Abstract.html

2021
[23]

VLAW: Iterative co-improvement of vision-language-action policy and world model, 2026

Yanjiang Guo, Tony Lee, Lucy Xiaoyang Shi, Jianyu Chen, Percy Liang, and Chelsea Finn. VLAW: Iterative co-improvement of vision-language-action policy and world model, 2026. URLhttps://arxiv.org/abs/ 2602.12063. arXiv:2602.12063

work page arXiv 2026
[24]

Ctrl-World: A Controllable Generative World Model for Robot Manipulation

Yanjiang Guo, Lucy Xiaoyang Shi, Jianyu Chen, and Chelsea Finn. Ctrl-World: A controllable generative world model for robot manipulation. InInternational Conference on Learning Representations, 2026. URL https://arxiv.org/abs/2510.10125

work page internal anchor Pith review Pith/arXiv arXiv 2026
[25]

World Models

David Ha and Jürgen Schmidhuber. World models, 2018. URLhttps://arxiv.org/abs/1803.10122. arXiv:1803.10122

work page internal anchor Pith review Pith/arXiv arXiv 2018
[26]

Mastering Diverse Domains through World Models

Danijar Hafner, Jurgis Pasukonis, Jimmy Ba, and Timothy Lillicrap. Mastering diverse domains through world models, 2023. URLhttps://arxiv.org/abs/2301.04104. arXiv:2301.04104

work page internal anchor Pith review Pith/arXiv arXiv 2023
[27]

Vid2World: Crafting video diffusion models to interactive world models

Siqiao Huang, Jialong Wu, Qixing Zhou, Shangchen Miao, and Mingsheng Long. Vid2World: Crafting video diffusion models to interactive world models. InInternational Conference on Learning Representations, 2026. URLhttps://openreview.net/forum?id=pFyzqbUiF9

2026
[28]

DreamGen: Unlocking Generalization in Robot Learning through Video World Models

Joel Jang, Seonghyeon Ye, Zongyu Lin, Jiannan Xiang, Johan Bjorck, Yu Fang, Fengyuan Hu, Spencer Huang, Kaushil Kundalia, Yen-Chen Lin, Loic Magne, Ajay Mandlekar, Avnish Narayan, You Liang Tan, Guanzhi Wang, Jing Wang, Qi Wang, Yinzhen Xu, Xiaohui Zeng, Kaiyuan Zheng, Ruijie Zheng, Ming-Yu Liu, Luke Zettlemoyer, Dieter Fox, Jan Kautz, Scott Reed, Yuke Zh...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[29]

When to trust your model: model-based policy optimization

Michael Janner, Justin Fu, Marvin Zhang, and Sergey Levine. When to trust your model: Model-based policy optimization. InAdvances in Neural Information Processing Systems, 2019. URLhttps://arxiv.org/abs/ 1906.08253

work page arXiv 2019
[30]

Model gradi- ent: unified model and policy learning in model-based reinforcement learning.Frontiers of Computer Science, 18:184339, 2024

Chengxing Jia, Fuxiang Zhang, Tian Xu, Jing-Cheng Pang, Zongzhang Zhang, and Yang Yu. Model gradi- ent: unified model and policy learning in model-based reinforcement learning.Frontiers of Computer Science, 18:184339, 2024

2024
[31]

RoboWM-Bench: A Benchmark for Evaluating World Models in Robotic Manipulation

Feng Jiang, Yang Chen, Kyle Xu, Yuchen Liu, Haifeng Wang, Zhenhao Shen, Jasper Lu, Shengze Huang, Yuanfei Wang, Chen Xie, and Ruihai Wu. RoboWM-Bench: A benchmark for evaluating world models in robotic manipulation, 2026. URLhttps://arxiv.org/abs/2604.19092. arXiv:2604.19092

work page internal anchor Pith review Pith/arXiv arXiv 2026
[32]

World4RL: Diffusion world models for policy refinement with reinforcement learning for robotic manipulation, 2025

Zhennan Jiang, Kai Liu, Yuxin Qin, Shuai Tian, Yupeng Zheng, Mingcai Zhou, Chao Yu, Haoran Li, and Dong- bin Zhao. World4RL: Diffusion world models for policy refinement with reinforcement learning for robotic manipulation, 2025. URLhttps://arxiv.org/abs/2509.19080. arXiv:2509.19080

work page arXiv 2025
[33]

WoVR: World Models as Reliable Simulators for Post-Training VLA Policies with RL

Zhennan Jiang, Shangqing Zhou, Yutong Jiang, Zefang Huang, Mingjie Wei, Yuhui Chen, Tianxing Zhou, Zhen Guo, Hao Lin, Quanlu Zhang, Yu Wang, Haoran Li, Chao Yu, and Dongbin Zhao. WoVR: World models as reliable simulators for post-training VLA policies with RL, 2026. URLhttps://arxiv.org/abs/2602. 13977. arXiv:2602.13977

work page internal anchor Pith review Pith/arXiv arXiv 2026
[34]

A shortcut-aware video-QA benchmark for physical understanding via minimal video pairs.Transactions on Machine Learning Research, 2025

Benno Krojer, Mojtaba Komeili, Candace Ross, Quentin Garrido, Koustuv Sinha, Nicolas Ballas, and Mahmoud Assran. A shortcut-aware video-QA benchmark for physical understanding via minimal video pairs.Transactions on Machine Learning Research, 2025. URLhttps://arxiv.org/abs/2506.09987

work page arXiv 2025
[35]

Objective mismatch in model-based rein- forcement learning

Nathan Lambert, Brandon Amos, Omry Yadan, and Roberto Calandra. Objective mismatch in model-based rein- forcement learning. InProceedings of the 2nd Conference on Learning for Dynamics and Control (L4DC), vol- ume 120 ofProceedings of Machine Learning Research, pages 761–770, 2020. URLhttps://proceedings. mlr.press/v120/lambert20a.html. arXiv:2002.04523

work page arXiv 2020
[36]

Gonzalez, Ion Stoica, Song Han, and Yao Lu

Dacheng Li, Yunhao Fang, Yukang Chen, Shuo Yang, Shiyi Cao, Justin Wong, Michael Luo, Xiaolong Wang, Hongxu Yin, Joseph E. Gonzalez, Ion Stoica, Song Han, and Yao Lu. WorldModelBench: Judging video gener- ation models as world models, 2025. URLhttps://arxiv.org/abs/2502.20694. arXiv:2502.20694. 24 How Should World Models Be Evaluated for Embodied Decision-Making?

work page arXiv 2025
[37]

VLA-RFT: Vision-language-action reinforcement fine-tuning with verified rewards in world simulators, 2025

Hengtao Li, Pengxiang Ding, Runze Suo, Yihao Wang, Zirui Ge, Dongyuan Zang, Kexian Yu, Mingyang Sun, Hongyin Zhang, Donglin Wang, and Weihua Su. VLA-RFT: Vision-language-action reinforcement fine-tuning with verified rewards in world simulators, 2025. URLhttps://arxiv.org/abs/2510.00406. arXiv:2510.00406

work page arXiv 2025
[38]

dWorldEval: Scalable Robotic Policy Evaluation via Discrete Diffusion World Model

Yaxuan Li, Zhongyi Zhou, Yefei Chen, Yaokai Xue, and Yichen Zhu. dWorldEval: Scalable robotic pol- icy evaluation via discrete diffusion world model, 2026. URLhttps://arxiv.org/abs/2604.22152. arXiv:2604.22152

work page internal anchor Pith review Pith/arXiv arXiv 2026
[39]

Dreamitate: Real-world visuomotor policy learning via video generation

Junbang Liang, Ruoshi Liu, Ege Ozguroglu, Sruthi Sudhakar, Achal Dave, Pavel Tokmakov, Shuran Song, and Carl V ondrick. Dreamitate: Real-world visuomotor policy learning via video generation. InConference on Robot Learning, 2024. URLhttps://arxiv.org/abs/2406.16862

work page arXiv 2024
[40]

Genie envisioner: A unified world foundation platform for robotic manipulation

Yue Liao, Pengfei Zhou, Siyuan Huang, Donglin Yang, Shengcong Chen, Yuxin Jiang, Yue Hu, Si Liu, Jianlan Luo, Liliang Chen, Shuicheng Yan, Maoqing Yao, and Guanghui Ren. Genie envisioner: A unified world foundation platform for robotic manipulation. InInternational Conference on Learning Representations, 2026. URLhttps://openreview.net/forum?id=fHLtSxDFKC

2026
[41]

ADM-v2: Pursuing full-horizon roll-out in dynamics models for offline policy learning and evaluation

Haoxin Lin, Siyuan Xiao, Yi-Chen Li, Zhilong Zhang, Yihao Sun, Chengxing Jia, and Yang Yu. ADM-v2: Pursuing full-horizon roll-out in dynamics models for offline policy learning and evaluation. InInternational Conference on Learning Representations, 2026. URLhttps://openreview.net/forum?id=ICbXEwqpga

2026
[42]

World-VLA-Loop: Closed-Loop Learning of Video World Model and VLA Policy

Xiaokang Liu, Zechen Bai, Hai Ci, Kevin Yuchen Ma, and Mike Zheng Shou. World-VLA-Loop: Closed- loop learning of video world model and VLA policy, 2026. URLhttps://arxiv.org/abs/2602.06508. arXiv:2602.06508

work page internal anchor Pith review Pith/arXiv arXiv 2026
[43]

LeWorldModel: Stable End-to-End Joint-Embedding Predictive Architecture from Pixels

Lucas Maes, Quentin Le Lidec, Damien Scieur, Yann LeCun, and Randall Balestriero. LeWorldModel: Stable end-to-end joint-embedding predictive architecture from pixels, 2026. URLhttps://arxiv.org/abs/2603. 19312. arXiv:2603.19312

work page internal anchor Pith review Pith/arXiv arXiv 2026
[44]

Value-aware loss function for model-based reinforcement learning

Amir massoud Farahmand, André Barreto, and Daniel Nikovski. Value-aware loss function for model-based reinforcement learning. InProceedings of the 20th International Conference on Artificial Intelligence and Statistics (AISTATS), volume 54 ofProceedings of Machine Learning Research, pages 1486–1494, 2017. URL https://proceedings.mlr.press/v54/farahmand17a.html

2017
[45]

V-JEPA 2.1: Unlocking dense features in video self-supervised learning,

Lorenzo Mur-Labadia, Matthew Muckley, Amir Bar, Mido Assran, Koustuv Sinha, Mike Rabbat, Yann LeCun, Nicolas Ballas, and Adrien Bardes. V-JEPA 2.1: Unlocking dense features in video self-supervised learning,
[46]

V-JEPA 2.1: Unlocking Dense Features in Video Self-Supervised Learning

URLhttps://arxiv.org/abs/2603.14482. arXiv:2603.14482

work page internal anchor Pith review Pith/arXiv arXiv
[47]

PBench: A physical AI benchmark for world models, 2025

NVIDIA. PBench: A physical AI benchmark for world models, 2025. URLhttps://research.nvidia. com/labs/cosmos-lab/pbench/

2025
[48]

NVIDIA, Arslan Ali, Junjie Bai, Maciej Bala, Yogesh Balaji, Aaron Blakeman, Tiffany Cai, Jiaxin Cao, Tianshi Cao, Elizabeth Cha, Yu-Wei Chao, Prithvijit Chattopadhyay, Mike Chen, Yongxin Chen, Yu Chen, Shuai Cheng, Yin Cui, Jenna Diamond, Yifan Ding, Jiaojiao Fan, Linxi Fan, Liang Feng, Francesco Ferroni, Sanja Fidler, Xiao Fu, Ruiyuan Gao, Yunhao Ge, Jin...

2025
[49]

Worldsimbench: Towards video generation models as world simulators.arXiv preprint arXiv:2410.18072, 2024

Yiran Qin, Zhelun Shi, Jiwen Yu, Xijun Wang, Enshen Zhou, Lijun Li, Zhenfei Yin, Xihui Liu, Lu Sheng, Jing Shao, Lei Bai, Wanli Ouyang, and Ruimao Zhang. WorldSimBench: Towards video generation models as world simulators, 2024. URLhttps://arxiv.org/abs/2410.18072. arXiv:2410.18072

work page arXiv 2024
[50]

Worldgym: World model as an environment for policy evaluation.arXiv preprint arXiv:2506.00613, 2025

Julian Quevedo, Ansh Kumar Sharma, Yixiang Sun, Varad Suryavanshi, Percy Liang, and Sherry Yang. WorldGym: World model as an environment for policy evaluation, 2025. URLhttps://arxiv.org/abs/ 2506.00613. arXiv:2506.00613

work page arXiv 2025
[51]

Worldarena: A unified benchmark for evaluating perception and functional utility of embodied world models.arXiv preprint arXiv:2602.08971, 2026

Yu Shang, Zhuohang Li, Yiding Ma, Weikang Su, Xin Jin, Ziyou Wang, Lei Jin, Xin Zhang, Yinzhou Tang, Haisheng Su, Chen Gao, Wei Wu, Xihui Liu, Dhruv Shah, Zhaoxiang Zhang, Zhibo Chen, Jun Zhu, Yonghong 25 How Should World Models Be Evaluated for Embodied Decision-Making? Tian, Tat-Seng Chua, Wenwu Zhu, and Yong Li. WorldArena: A unified benchmark for eval...

work page arXiv 2026
[52]

World-gymnast: Training robots with reinforcement learning in a world model.arXiv preprint arXiv:2602.02454,

Ansh Kumar Sharma, Yixiang Sun, Ninghao Lu, Yunzhe Zhang, Jiarao Liu, and Sherry Yang. World-Gymnast: Training robots with reinforcement learning in a world model, 2026. URLhttps://arxiv.org/abs/2602. 02454. arXiv:2602.02454

work page arXiv 2026
[53]

Scalable policy evaluation with video world models, 2025

Wei-Cheng Tseng, Jinwei Gu, Qinsheng Zhang, Hanzi Mao, Ming-Yu Liu, Florian Shkurti, and Lin Yen-Chen. Scalable policy evaluation with video world models, 2025. URLhttps://arxiv.org/abs/2511.11520. arXiv:2511.11520

work page arXiv 2025
[54]

Chen, Ashesh Rambachan, Jon Kleinberg, and Sendhil Mullainathan

Keyon Vafa, Justin Y . Chen, Ashesh Rambachan, Jon Kleinberg, and Sendhil Mullainathan. Evaluating the world model implicit in a generative model, 2024. URLhttps://arxiv.org/abs/2406.03689. arXiv:2406.03689

work page arXiv 2024
[55]

RoboVIP: Multi-view video generation with visual identity prompting augments robot manipulation, 2026

Boyang Wang, Haoran Zhang, Shujie Zhang, Jinkun Hao, Mingda Jia, Qi Lv, Yucheng Mao, Zhaoyang Lyu, Jia Zeng, Xudong Xu, and Jiangmiao Pang. RoboVIP: Multi-view video generation with visual identity prompting augments robot manipulation, 2026. URLhttps://arxiv.org/abs/2601.05241. arXiv:2601.05241

work page arXiv 2026
[56]

EV A: Aligning video world models with executable robot actions via inverse dynamics rewards, 2026

Ruixiang Wang, Qingming Liu, Yueci Deng, Guiliang Liu, Zhen Liu, and Kui Jia. EV A: Aligning video world models with executable robot actions via inverse dynamics rewards, 2026. URLhttps://arxiv.org/abs/ 2603.17808. arXiv:2603.17808

work page arXiv 2026
[57]

Interactive world simulator for robot policy training and evaluation,

Yixuan Wang, Rhythm Syed, Fangyu Wu, Mengchao Zhang, Aykut Onol, Jose Barreiros, Hooshang Nayyeri, Tony Dear, Huan Zhang, and Yunzhu Li. Interactive world simulator for robot policy training and evaluation,
[58]

arXiv:2603.08546

URLhttps://arxiv.org/abs/2603.08546. arXiv:2603.08546

work page arXiv
[59]

Benchmarking World-Model Learning with Environment-Level Queries

Archana Warrier, Dat Nguyen, Michelangelo Naim, Moksh Jain, Yichao Liang, Karen Schroeder, Cam- bridge Yang, Joshua B. Tenenbaum, Sebastian V ollmer, Kevin Ellis, and Zenna Tavares. Benchmarking world-model learning with environment-level queries, 2025. URLhttps://arxiv.org/abs/2510.19788. arXiv:2510.19788

work page internal anchor Pith review Pith/arXiv arXiv 2025
[60]

A unified view on solv- ing objective mismatch in model-based reinforcement learning.Transactions on Machine Learning Research,

Ran Wei, Nathan Lambert, Anthony McDonald, Alfredo Garcia, and Roberto Calandra. A unified view on solv- ing objective mismatch in model-based reinforcement learning.Transactions on Machine Learning Research,
[61]

arXiv:2310.06253

URLhttps://openreview.net/forum?id=tQVZgvXhZb. arXiv:2310.06253

work page arXiv
[62]

DayDreamer: World models for physical robot learning

Philipp Wu, Alejandro Escontrela, Danijar Hafner, Ken Goldberg, and Pieter Abbeel. DayDreamer: World models for physical robot learning. InConference on Robot Learning, 2023. URLhttps://arxiv.org/abs/ 2206.14176

work page arXiv 2023
[63]

World-Env: Leveraging World Model as a Virtual Environment for VLA Post-Training

Junjin Xiao, Yandan Yang, Xinyuan Chang, Ronghan Chen, Feng Xiong, Mu Xu, Wei-Shi Zheng, and Qing Zhang. World-Env: Leveraging world model as a virtual environment for VLA post-training. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2026. URLhttps://arxiv.org/ abs/2509.24948

work page internal anchor Pith review Pith/arXiv arXiv 2026
[64]

Kinema4D: Kinematic 4d world modeling for spatiotemporal embodied simulation, 2026

Mutian Xu, Tianbao Zhang, Tianqi Liu, Zhaoxi Chen, Xiaoguang Han, and Ziwei Liu. Kinema4D: Kinematic 4d world modeling for spatiotemporal embodied simulation, 2026. URLhttps://arxiv.org/abs/2603.16669. arXiv:2603.16669

work page arXiv 2026
[65]

RISE: Self-Improving Robot Policy with Compositional World Model

Jiazhi Yang, Kunyang Lin, Jinwei Li, Wencong Zhang, Tianwei Lin, Longyan Wu, Zhizhong Su, Hao Zhao, Ya-Qin Zhang, Li Chen, Ping Luo, Xiangyu Yue, and Hongyang Li. RISE: Self-improving robot policy with compositional world model, 2026. URLhttps://arxiv.org/abs/2602.11075. arXiv:2602.11075

work page internal anchor Pith review Pith/arXiv arXiv 2026
[66]

RoboEnvision: A long-horizon video generation model for multi-task robot manipulation

Liudi Yang, Yang Bai, George Eskandar, Fengyi Shen, Mohammad Altillawi, Dong Chen, Soumajit Majumder, Ziyuan Liu, Gitta Kutyniok, and Abhinav Valada. RoboEnvision: A long-horizon video generation model for multi-task robot manipulation. InIEEE/RSJ International Conference on Intelligent Robots and Systems, 2025. URLhttps://arxiv.org/abs/2506.22007

work page arXiv 2025
[67]

Learning Interactive Real-World Simulators

Mengjiao Yang, Yilun Du, Kamyar Ghasemipour, Jonathan Tompson, Dale Schuurmans, and Pieter Abbeel. Learning interactive real-world simulators. InInternational Conference on Learning Representations, 2024. URLhttps://arxiv.org/abs/2310.06114

work page internal anchor Pith review Pith/arXiv arXiv 2024
[68]

EA-WM: Event-Aware Generative World Model with Structured Kinematic-to-Visual Action Fields

Zhaoyang Yang, Yurun Jin, Lizhe Qi, Cong Huang, and Kai Chen. EA-WM: Event-aware generative world model with structured kinematic-to-visual action fields, 2026. URLhttps://arxiv.org/abs/2605.06192. arXiv:2605.06192

work page internal anchor Pith review Pith/arXiv arXiv 2026
[69]

PlayWorld: Learning Robot World Models from Autonomous Play

Tenny Yin, Zhiting Mei, Zhonghe Zheng, Miyu Yamane, David Wang, Jade Sceats, Samuel M. Bateman, Lihan Zha, Apurva Badithela, Ola Shorinwa, and Anirudha Majumdar. PlayWorld: Learning robot world models from autonomous play, 2026. URLhttps://arxiv.org/abs/2603.09030. arXiv:2603.09030. 26 How Should World Models Be Evaluated for Embodied Decision-Making?

work page internal anchor Pith review Pith/arXiv arXiv 2026
[70]

EWMBench: Evaluating scene, motion, and semantic quality in embodied world models, 2025

Hu Yue, Siyuan Huang, Yue Liao, Shengcong Chen, Pengfei Zhou, Liliang Chen, Maoqing Yao, and Guanghui Ren. EWMBench: Evaluating scene, motion, and semantic quality in embodied world models, 2025. URL https://arxiv.org/abs/2505.09694. arXiv:2505.09694

work page arXiv 2025
[71]

ProphRL: Reinforcing action policies by prophesy- ing, 2025

Jiahui Zhang, Ze Huang, Chun Gu, Zipei Ma, and Li Zhang. ProphRL: Reinforcing action policies by prophesy- ing, 2025. URLhttps://arxiv.org/abs/2511.20633. arXiv:2511.20633

work page arXiv 2025
[72]

WHALE: Towards generalizable and scalable world models for embodied decision-making, 2024

Zhilong Zhang, Ruifeng Chen, Junyin Ye, Yihao Sun, Pengyuan Wang, Jingcheng Pang, Kaiyuan Li, Tianshuo Liu, Haoxin Lin, Yang Yu, and Zhi-Hua Zhou. WHALE: Towards generalizable and scalable world models for embodied decision-making, 2024. URLhttps://arxiv.org/abs/2411.05619. arXiv:2411.05619

work page arXiv 2024
[73]

Towards practical world model-based reinforcement learning for vision-language-action models, 2026

Zhilong Zhang, Haoxiang Ren, Yihao Sun, Yifei Sheng, Haonan Wang, Haoxin Lin, Zhichao Wu, Pierre-Luc Bacon, and Yang Yu. Towards practical world model-based reinforcement learning for vision-language-action models, 2026. URLhttps://openreview.net/forum?id=gB1yFEd106. ICLR 2026 World Models Work- shop

2026
[74]

RoboDreamer: Learning Compositional World Models for Robot Imagination

Siyuan Zhou, Yilun Du, Jiaben Chen, Yandong Li, Dit-Yan Yeung, and Chuang Gan. RoboDreamer: Learning compositional world models for robot imagination. InInternational Conference on Machine Learning, 2024. URLhttps://arxiv.org/abs/2404.12377

work page internal anchor Pith review Pith/arXiv arXiv 2024
[75]

IRASim: A fine-grained world model for robot manipulation

Fangqi Zhu, Hongtao Wu, Song Guo, Yuxiao Liu, Chilam Cheang, and Tao Kong. IRASim: A fine-grained world model for robot manipulation. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 9834–9844, 2025. URLhttps://openaccess.thecvf.com/content/ICCV2025/html/Zhu_ IRASim_A_Fine-Grained_World_Model_for_Robot_Manipulation_ICCV_202...

2025
[76]

WMPO: World model- based policy optimization for vision-language-action models

Fangqi Zhu, Zhengyang Yan, Zicong Hong, Quanxin Shou, Xiao Ma, and Song Guo. WMPO: World model- based policy optimization for vision-language-action models. InInternational Conference on Learning Repre- sentations, 2026. URLhttps://openreview.net/forum?id=qE2FyvRvuF. 27

2026

[1] [1]

V-JEPA 2: Self-Supervised Video Models Enable Understanding, Prediction and Planning

Mido Assran, Adrien Bardes, David Fan, Quentin Garrido, Russell Howes, Mojtaba Komeili, Matthew Muck- ley, Ammar Rizvi, Claire Roberts, Koustuv Sinha, Artem Zholus, Sergio Arnaud, Abha Gejji, Ada Martin, Francois Robert Hogan, Daniel Dugas, Piotr Bojanowski, Vasil Khalidov, Patrick Labatut, Francisco Massa, Marc Szafraniec, Kapil Krishnakumar, Yong Li, Xi...

work page internal anchor Pith review Pith/arXiv arXiv 2025

[2] [2]

Dream to manipulate: Compositional world models empowering robot imitation learning with imagina- tion

Leonardo Barcellona, Andrii Zadaianchuk, Davide Allegro, Samuele Papa, Stefano Ghidoni, and Efstratios Gavves. Dream to manipulate: Compositional world models empowering robot imitation learning with imagina- tion. InInternational Conference on Learning Representations, 2025. URLhttps://arxiv.org/abs/2412. 14957

2025

[3] [3]

Persistent robot world models: Stabiliz- ing multi-step rollouts via reinforcement learning, 2026

Jai Bardhan, Patrik Drozdik, Josef Sivic, and Vladimir Petrik. Persistent robot world models: Stabiliz- ing multi-step rollouts via reinforcement learning, 2026. URLhttps://arxiv.org/abs/2603.25685. arXiv:2603.25685

work page arXiv 2026

[4] [4]

Kao, Adina Williams, Michael Rabbat, and Em- manuel Dupoux

Florian Bordes, Quentin Garrido, Justine T. Kao, Adina Williams, Michael Rabbat, and Emmanuel Dupoux. IntPhys 2: Benchmarking intuitive physics understanding in complex synthetic environments, 2025. URL https://arxiv.org/abs/2506.09849. arXiv:2506.09849

work page arXiv 2025

[5] [5]

Chandra, Iman Nematollahi, Chenguang Huang, Tim Welschehold, Wolfram Burgard, and Abhinav Valada

Akshay L. Chandra, Iman Nematollahi, Chenguang Huang, Tim Welschehold, Wolfram Burgard, and Abhinav Valada. DiW A: Diffusion policy adaptation with world models, 2025. URLhttps://arxiv.org/abs/2508. 03645. arXiv:2508.03645

work page arXiv 2025

[6] [6]

WorldPrediction: A benchmark for high- level world modeling and long-horizon procedural planning

Delong Chen, Willy Chung, Yejin Bang, Ziwei Ji, and Pascale Fung. WorldPrediction: A benchmark for high- level world modeling and long-horizon procedural planning. InICML World Models Workshop, 2025. URL https://openreview.net/forum?id=3GuGN0bacr. 22 How Should World Models Be Evaluated for Embodied Decision-Making?

2025

[7] [7]

Policy-conditioned en- vironment models are more generalizable

Ruifeng Chen, Xiong-Hui Chen, Yihao Sun, Siyuan Xiao, Minhui Li, and Yang Yu. Policy-conditioned en- vironment models are more generalizable. InInternational Conference on Machine Learning, 2024. URL https://openreview.net/forum?id=g9mYBdooPA

2024

[8] [8]

Adversarial counterfactual environment model learning

Xiong-Hui Chen, Yang Yu, Zheng-Mao Zhu, Zhihua Yu, Zhenjun Chen, Chenghe Wang, Yinan Wu, Hongqiu Wu, Rong-Jun Qin, Ruijin Ding, and Fangsheng Huang. Adversarial counterfactual environment model learning. InAdvances in Neural Information Processing Systems, 2023. URLhttps://openreview.net/forum?id= rHAX0LRwk8

2023

[9] [9]

ABot- PhysWorld: Interactive world foundation model for robotic manipulation with physics alignment, 2026

Yuzhi Chen, Ronghan Chen, Dongjie Huo, Yandan Yang, Dekang Qi, Haoyun Liu, Tong Lin, Shuang Zeng, Junjin Xiao, Xinyuan Chang, Feng Xiong, Xing Wei, Zhiheng Ma, and Mu Xu. ABot-PhysWorld: Interactive world foundation model for robotic manipulation with physics alignment, 2026. URLhttps://arxiv.org/ abs/2603.23376. arXiv:2603.23376

work page arXiv 2026

[10] [10]

arXiv preprint arXiv:2410.15461 (2024) 16 K

Xiaowei Chi, Hengyuan Zhang, Chun-Kai Fan, Xingqun Qi, Rongyu Zhang, Anthony Chen, Chi min Chan, Wei Xue, Wenhan Luo, Shanghang Zhang, and Yike Guo. EV A: An embodied world model for future video anticipation, 2024. URLhttps://arxiv.org/abs/2410.15461. arXiv:2410.15461

work page arXiv 2024

[11] [11]

Rethinking video generation model for the embodied world, 2026

Yufan Deng, Zilin Pan, Hongyu Zhang, Xiaojie Li, Ruoqing Hu, Yufei Ding, Yiming Zou, Yan Zeng, and Daquan Zhou. Rethinking video generation model for the embodied world, 2026. URLhttps://arxiv.org/abs/ 2601.15282. arXiv:2601.15282

work page arXiv 2026

[12] [12]

arXiv preprint arXiv:2310.10625 (2023)

Yilun Du, Mengjiao Yang, Pete Florence, Fei Xia, Ayzaan Wahid, Brian Ichter, Pierre Sermanet, Tianhe Yu, Pieter Abbeel, Joshua B. Tenenbaum, Leslie Kaelbling, Andy Zeng, and Jonathan Tompson. Video language planning. InInternational Conference on Learning Representations, 2024. URLhttps://arxiv.org/abs/ 2310.10625

work page arXiv 2024

[13] [13]

WorldScore: A unified evaluation benchmark for world generation

Haoyi Duan, Hong-Xing Yu, Sirui Chen, Fei-Fei Li, and Jiajun Wu. WorldScore: A unified evaluation benchmark for world generation. InInternational Conference on Computer Vision, 2025. URLhttps://arxiv.org/abs/ 2504.00983

work page arXiv 2025

[14] [14]

Wow, wo, val! a comprehensive embodied world model evaluation turing test.arXiv preprint arXiv:2601.04137, 2026

Chun-Kai Fan, Xiaowei Chi, Xiaozhu Ju, Hao Li, Yong Bao, Yu-Kai Wang, Lizhang Chen, Zhiyuan Jiang, Kuangzhi Ge, Ying Li, Weishi Mi, Qingpo Wuwu, Peidong Jia, Yulin Luo, Kevin Zhang, Zhiyuan Qin, Yong Dai, Sirui Han, Yike Guo, Shanghang Zhang, and Jian Tang. Wow, wo, val! a comprehensive embodied world model evaluation turing test, 2026. URLhttps://arxiv.o...

work page arXiv 2026

[15] [15]

Aaron Foss, Chloe Evans, Sasha Mitts, Koustuv Sinha, Ammar Rizvi, and Justine T. Kao. CausalVQA: A physically grounded causal reasoning benchmark for video models, 2025. URLhttps://arxiv.org/abs/ 2506.09943. arXiv:2506.09943

work page arXiv 2025

[16] [16]

Learning video generation for robotic manipulation with collaborative trajectory control

Xiao Fu, Xintao Wang, Xian Liu, Jianhong Bai, Runsen Xu, Pengfei Wan, Di Zhang, and Dahua Lin. Learning video generation for robotic manipulation with collaborative trajectory control. InInternational Conference on Learning Representations, 2026. URLhttps://openreview.net/forum?id=OeDwYtp8n1

2026

[17] [17]

Shenyuan Gao, William Liang, Kaiyuan Zheng, Ayaan Malik, Seonghyeon Ye, Sihyun Yu, Wei-Cheng Tseng, Yuzhu Dong, Kaichun Mo, Chen-Hsuan Lin, Qianli Ma, Seungjun Nah, Loic Magne, Jiannan Xiang, Yuqi Xie, Ruijie Zheng, Dantong Niu, You Liang Tan, K. R. Zentner, George Kurian, Suneel Indupuru, Pooya Jan- naty, Jinwei Gu, Jun Zhang, Jitendra Malik, Pieter Abbe...

work page internal anchor Pith review Pith/arXiv arXiv 2026

[18] [18]

Evaluating gemini robotics policies in a veo world simulator, 2025

Gemini Robotics Team, Krzysztof Choromanski, Coline Devin, Yilun Du, Debidatta Dwibedi, Ruiqi Gao, Ab- hishek Jindal, Thomas Kipf, Sean Kirmani, Isabel Leal, Fangchen Liu, Anirudha Majumdar, Andrew Mar- mon, Carolina Parada, Yulia Rubanova, Dhruv Shah, Vikas Sindhwani, Jie Tan, Fei Xia, Ted Xiao, Sherry Yang, Wenhao Yu, and Allan Zhou. Evaluating gemini r...

work page arXiv 2025

[19] [19]

GigaBrain-0.5M*: A VLA that learns from world model-based reinforcement learning, 2026

GigaBrain Team, Boyuan Wang, Chaojun Ni, Guan Huang, Guosheng Zhao, Hao Li, Jie Li, Jindi Lv, Jingyu Liu, Lv Feng, Mingming Yu, Peng Li, Qiuping Deng, Tianze Liu, Xinyu Zhou, Xinze Chen, Xiaofeng Wang, Yang Wang, Yifan Li, Yifei Nie, Yilong Li, Yukun Zhou, Yun Ye, Zhichao Liu, and Zheng Zhu. GigaBrain-0.5M*: A VLA that learns from world model-based reinfo...

work page arXiv 2026

[20] [20]

GigaWorld-0: World models as data engine to empower embodied AI, 2025

GigaWorld Team, Angen Ye, Boyuan Wang, Chaojun Ni, Guan Huang, Guosheng Zhao, Haoyun Li, Jiagang Zhu, Kerui Li, Mengyuan Xu, Qiuping Deng, Siting Wang, Wenkang Qin, Xinze Chen, Xiaofeng Wang, Yankai Wang, Yu Cao, Yifan Chang, Yuan Xu, Yun Ye, Yang Wang, Yukun Zhou, Zhengyuan Zhang, Zhehao Dong, and Zheng Zhu. GigaWorld-0: World models as data engine to em...

work page arXiv 2025

[21] [21]

The value equivalence principle for model-based reinforcement learning

Christopher Grimm, André Barreto, Satinder Singh, and David Silver. The value equivalence principle for model-based reinforcement learning. InAdvances in Neural Information Process- ing Systems, volume 33, 2020. URLhttps://proceedings.neurips.cc/paper/2020/hash/ 3bb585ea00014b0e3ebe4c6dd165a358-Abstract.html. arXiv:2011.03506

work page arXiv 2020

[22] [22]

Proper value equiva- lence

Christopher Grimm, André Barreto, Gregory Farquhar, David Silver, and Satinder Singh. Proper value equiva- lence. InAdvances in Neural Information Processing Systems, volume 34, 2021. URLhttps://proceedings. neurips.cc/paper/2021/hash/00ac8ed3b4327bdd4ebbebcb2ba10a00-Abstract.html

2021

[23] [23]

VLAW: Iterative co-improvement of vision-language-action policy and world model, 2026

Yanjiang Guo, Tony Lee, Lucy Xiaoyang Shi, Jianyu Chen, Percy Liang, and Chelsea Finn. VLAW: Iterative co-improvement of vision-language-action policy and world model, 2026. URLhttps://arxiv.org/abs/ 2602.12063. arXiv:2602.12063

work page arXiv 2026

[24] [24]

Ctrl-World: A Controllable Generative World Model for Robot Manipulation

Yanjiang Guo, Lucy Xiaoyang Shi, Jianyu Chen, and Chelsea Finn. Ctrl-World: A controllable generative world model for robot manipulation. InInternational Conference on Learning Representations, 2026. URL https://arxiv.org/abs/2510.10125

work page internal anchor Pith review Pith/arXiv arXiv 2026

[25] [25]

World Models

David Ha and Jürgen Schmidhuber. World models, 2018. URLhttps://arxiv.org/abs/1803.10122. arXiv:1803.10122

work page internal anchor Pith review Pith/arXiv arXiv 2018

[26] [26]

Mastering Diverse Domains through World Models

Danijar Hafner, Jurgis Pasukonis, Jimmy Ba, and Timothy Lillicrap. Mastering diverse domains through world models, 2023. URLhttps://arxiv.org/abs/2301.04104. arXiv:2301.04104

work page internal anchor Pith review Pith/arXiv arXiv 2023

[27] [27]

Vid2World: Crafting video diffusion models to interactive world models

Siqiao Huang, Jialong Wu, Qixing Zhou, Shangchen Miao, and Mingsheng Long. Vid2World: Crafting video diffusion models to interactive world models. InInternational Conference on Learning Representations, 2026. URLhttps://openreview.net/forum?id=pFyzqbUiF9

2026

[28] [28]

DreamGen: Unlocking Generalization in Robot Learning through Video World Models

Joel Jang, Seonghyeon Ye, Zongyu Lin, Jiannan Xiang, Johan Bjorck, Yu Fang, Fengyuan Hu, Spencer Huang, Kaushil Kundalia, Yen-Chen Lin, Loic Magne, Ajay Mandlekar, Avnish Narayan, You Liang Tan, Guanzhi Wang, Jing Wang, Qi Wang, Yinzhen Xu, Xiaohui Zeng, Kaiyuan Zheng, Ruijie Zheng, Ming-Yu Liu, Luke Zettlemoyer, Dieter Fox, Jan Kautz, Scott Reed, Yuke Zh...

work page internal anchor Pith review Pith/arXiv arXiv 2025

[29] [29]

When to trust your model: model-based policy optimization

Michael Janner, Justin Fu, Marvin Zhang, and Sergey Levine. When to trust your model: Model-based policy optimization. InAdvances in Neural Information Processing Systems, 2019. URLhttps://arxiv.org/abs/ 1906.08253

work page arXiv 2019

[30] [30]

Model gradi- ent: unified model and policy learning in model-based reinforcement learning.Frontiers of Computer Science, 18:184339, 2024

Chengxing Jia, Fuxiang Zhang, Tian Xu, Jing-Cheng Pang, Zongzhang Zhang, and Yang Yu. Model gradi- ent: unified model and policy learning in model-based reinforcement learning.Frontiers of Computer Science, 18:184339, 2024

2024

[31] [31]

RoboWM-Bench: A Benchmark for Evaluating World Models in Robotic Manipulation

Feng Jiang, Yang Chen, Kyle Xu, Yuchen Liu, Haifeng Wang, Zhenhao Shen, Jasper Lu, Shengze Huang, Yuanfei Wang, Chen Xie, and Ruihai Wu. RoboWM-Bench: A benchmark for evaluating world models in robotic manipulation, 2026. URLhttps://arxiv.org/abs/2604.19092. arXiv:2604.19092

work page internal anchor Pith review Pith/arXiv arXiv 2026

[32] [32]

World4RL: Diffusion world models for policy refinement with reinforcement learning for robotic manipulation, 2025

Zhennan Jiang, Kai Liu, Yuxin Qin, Shuai Tian, Yupeng Zheng, Mingcai Zhou, Chao Yu, Haoran Li, and Dong- bin Zhao. World4RL: Diffusion world models for policy refinement with reinforcement learning for robotic manipulation, 2025. URLhttps://arxiv.org/abs/2509.19080. arXiv:2509.19080

work page arXiv 2025

[33] [33]

WoVR: World Models as Reliable Simulators for Post-Training VLA Policies with RL

Zhennan Jiang, Shangqing Zhou, Yutong Jiang, Zefang Huang, Mingjie Wei, Yuhui Chen, Tianxing Zhou, Zhen Guo, Hao Lin, Quanlu Zhang, Yu Wang, Haoran Li, Chao Yu, and Dongbin Zhao. WoVR: World models as reliable simulators for post-training VLA policies with RL, 2026. URLhttps://arxiv.org/abs/2602. 13977. arXiv:2602.13977

work page internal anchor Pith review Pith/arXiv arXiv 2026

[34] [34]

A shortcut-aware video-QA benchmark for physical understanding via minimal video pairs.Transactions on Machine Learning Research, 2025

Benno Krojer, Mojtaba Komeili, Candace Ross, Quentin Garrido, Koustuv Sinha, Nicolas Ballas, and Mahmoud Assran. A shortcut-aware video-QA benchmark for physical understanding via minimal video pairs.Transactions on Machine Learning Research, 2025. URLhttps://arxiv.org/abs/2506.09987

work page arXiv 2025

[35] [35]

Objective mismatch in model-based rein- forcement learning

Nathan Lambert, Brandon Amos, Omry Yadan, and Roberto Calandra. Objective mismatch in model-based rein- forcement learning. InProceedings of the 2nd Conference on Learning for Dynamics and Control (L4DC), vol- ume 120 ofProceedings of Machine Learning Research, pages 761–770, 2020. URLhttps://proceedings. mlr.press/v120/lambert20a.html. arXiv:2002.04523

work page arXiv 2020

[36] [36]

Gonzalez, Ion Stoica, Song Han, and Yao Lu

Dacheng Li, Yunhao Fang, Yukang Chen, Shuo Yang, Shiyi Cao, Justin Wong, Michael Luo, Xiaolong Wang, Hongxu Yin, Joseph E. Gonzalez, Ion Stoica, Song Han, and Yao Lu. WorldModelBench: Judging video gener- ation models as world models, 2025. URLhttps://arxiv.org/abs/2502.20694. arXiv:2502.20694. 24 How Should World Models Be Evaluated for Embodied Decision-Making?

work page arXiv 2025

[37] [37]

VLA-RFT: Vision-language-action reinforcement fine-tuning with verified rewards in world simulators, 2025

Hengtao Li, Pengxiang Ding, Runze Suo, Yihao Wang, Zirui Ge, Dongyuan Zang, Kexian Yu, Mingyang Sun, Hongyin Zhang, Donglin Wang, and Weihua Su. VLA-RFT: Vision-language-action reinforcement fine-tuning with verified rewards in world simulators, 2025. URLhttps://arxiv.org/abs/2510.00406. arXiv:2510.00406

work page arXiv 2025

[38] [38]

dWorldEval: Scalable Robotic Policy Evaluation via Discrete Diffusion World Model

Yaxuan Li, Zhongyi Zhou, Yefei Chen, Yaokai Xue, and Yichen Zhu. dWorldEval: Scalable robotic pol- icy evaluation via discrete diffusion world model, 2026. URLhttps://arxiv.org/abs/2604.22152. arXiv:2604.22152

work page internal anchor Pith review Pith/arXiv arXiv 2026

[39] [39]

Dreamitate: Real-world visuomotor policy learning via video generation

Junbang Liang, Ruoshi Liu, Ege Ozguroglu, Sruthi Sudhakar, Achal Dave, Pavel Tokmakov, Shuran Song, and Carl V ondrick. Dreamitate: Real-world visuomotor policy learning via video generation. InConference on Robot Learning, 2024. URLhttps://arxiv.org/abs/2406.16862

work page arXiv 2024

[40] [40]

Genie envisioner: A unified world foundation platform for robotic manipulation

Yue Liao, Pengfei Zhou, Siyuan Huang, Donglin Yang, Shengcong Chen, Yuxin Jiang, Yue Hu, Si Liu, Jianlan Luo, Liliang Chen, Shuicheng Yan, Maoqing Yao, and Guanghui Ren. Genie envisioner: A unified world foundation platform for robotic manipulation. InInternational Conference on Learning Representations, 2026. URLhttps://openreview.net/forum?id=fHLtSxDFKC

2026

[41] [41]

ADM-v2: Pursuing full-horizon roll-out in dynamics models for offline policy learning and evaluation

Haoxin Lin, Siyuan Xiao, Yi-Chen Li, Zhilong Zhang, Yihao Sun, Chengxing Jia, and Yang Yu. ADM-v2: Pursuing full-horizon roll-out in dynamics models for offline policy learning and evaluation. InInternational Conference on Learning Representations, 2026. URLhttps://openreview.net/forum?id=ICbXEwqpga

2026

[42] [42]

World-VLA-Loop: Closed-Loop Learning of Video World Model and VLA Policy

Xiaokang Liu, Zechen Bai, Hai Ci, Kevin Yuchen Ma, and Mike Zheng Shou. World-VLA-Loop: Closed- loop learning of video world model and VLA policy, 2026. URLhttps://arxiv.org/abs/2602.06508. arXiv:2602.06508

work page internal anchor Pith review Pith/arXiv arXiv 2026

[43] [43]

LeWorldModel: Stable End-to-End Joint-Embedding Predictive Architecture from Pixels

Lucas Maes, Quentin Le Lidec, Damien Scieur, Yann LeCun, and Randall Balestriero. LeWorldModel: Stable end-to-end joint-embedding predictive architecture from pixels, 2026. URLhttps://arxiv.org/abs/2603. 19312. arXiv:2603.19312

work page internal anchor Pith review Pith/arXiv arXiv 2026

[44] [44]

Value-aware loss function for model-based reinforcement learning

Amir massoud Farahmand, André Barreto, and Daniel Nikovski. Value-aware loss function for model-based reinforcement learning. InProceedings of the 20th International Conference on Artificial Intelligence and Statistics (AISTATS), volume 54 ofProceedings of Machine Learning Research, pages 1486–1494, 2017. URL https://proceedings.mlr.press/v54/farahmand17a.html

2017

[45] [45]

V-JEPA 2.1: Unlocking dense features in video self-supervised learning,

Lorenzo Mur-Labadia, Matthew Muckley, Amir Bar, Mido Assran, Koustuv Sinha, Mike Rabbat, Yann LeCun, Nicolas Ballas, and Adrien Bardes. V-JEPA 2.1: Unlocking dense features in video self-supervised learning,

[46] [46]

V-JEPA 2.1: Unlocking Dense Features in Video Self-Supervised Learning

URLhttps://arxiv.org/abs/2603.14482. arXiv:2603.14482

work page internal anchor Pith review Pith/arXiv arXiv

[47] [47]

PBench: A physical AI benchmark for world models, 2025

NVIDIA. PBench: A physical AI benchmark for world models, 2025. URLhttps://research.nvidia. com/labs/cosmos-lab/pbench/

2025

[48] [48]

NVIDIA, Arslan Ali, Junjie Bai, Maciej Bala, Yogesh Balaji, Aaron Blakeman, Tiffany Cai, Jiaxin Cao, Tianshi Cao, Elizabeth Cha, Yu-Wei Chao, Prithvijit Chattopadhyay, Mike Chen, Yongxin Chen, Yu Chen, Shuai Cheng, Yin Cui, Jenna Diamond, Yifan Ding, Jiaojiao Fan, Linxi Fan, Liang Feng, Francesco Ferroni, Sanja Fidler, Xiao Fu, Ruiyuan Gao, Yunhao Ge, Jin...

2025

[49] [49]

Worldsimbench: Towards video generation models as world simulators.arXiv preprint arXiv:2410.18072, 2024

Yiran Qin, Zhelun Shi, Jiwen Yu, Xijun Wang, Enshen Zhou, Lijun Li, Zhenfei Yin, Xihui Liu, Lu Sheng, Jing Shao, Lei Bai, Wanli Ouyang, and Ruimao Zhang. WorldSimBench: Towards video generation models as world simulators, 2024. URLhttps://arxiv.org/abs/2410.18072. arXiv:2410.18072

work page arXiv 2024

[50] [50]

Worldgym: World model as an environment for policy evaluation.arXiv preprint arXiv:2506.00613, 2025

Julian Quevedo, Ansh Kumar Sharma, Yixiang Sun, Varad Suryavanshi, Percy Liang, and Sherry Yang. WorldGym: World model as an environment for policy evaluation, 2025. URLhttps://arxiv.org/abs/ 2506.00613. arXiv:2506.00613

work page arXiv 2025

[51] [51]

Worldarena: A unified benchmark for evaluating perception and functional utility of embodied world models.arXiv preprint arXiv:2602.08971, 2026

Yu Shang, Zhuohang Li, Yiding Ma, Weikang Su, Xin Jin, Ziyou Wang, Lei Jin, Xin Zhang, Yinzhou Tang, Haisheng Su, Chen Gao, Wei Wu, Xihui Liu, Dhruv Shah, Zhaoxiang Zhang, Zhibo Chen, Jun Zhu, Yonghong 25 How Should World Models Be Evaluated for Embodied Decision-Making? Tian, Tat-Seng Chua, Wenwu Zhu, and Yong Li. WorldArena: A unified benchmark for eval...

work page arXiv 2026

[52] [52]

World-gymnast: Training robots with reinforcement learning in a world model.arXiv preprint arXiv:2602.02454,

Ansh Kumar Sharma, Yixiang Sun, Ninghao Lu, Yunzhe Zhang, Jiarao Liu, and Sherry Yang. World-Gymnast: Training robots with reinforcement learning in a world model, 2026. URLhttps://arxiv.org/abs/2602. 02454. arXiv:2602.02454

work page arXiv 2026

[53] [53]

Scalable policy evaluation with video world models, 2025

Wei-Cheng Tseng, Jinwei Gu, Qinsheng Zhang, Hanzi Mao, Ming-Yu Liu, Florian Shkurti, and Lin Yen-Chen. Scalable policy evaluation with video world models, 2025. URLhttps://arxiv.org/abs/2511.11520. arXiv:2511.11520

work page arXiv 2025

[54] [54]

Chen, Ashesh Rambachan, Jon Kleinberg, and Sendhil Mullainathan

Keyon Vafa, Justin Y . Chen, Ashesh Rambachan, Jon Kleinberg, and Sendhil Mullainathan. Evaluating the world model implicit in a generative model, 2024. URLhttps://arxiv.org/abs/2406.03689. arXiv:2406.03689

work page arXiv 2024

[55] [55]

RoboVIP: Multi-view video generation with visual identity prompting augments robot manipulation, 2026

Boyang Wang, Haoran Zhang, Shujie Zhang, Jinkun Hao, Mingda Jia, Qi Lv, Yucheng Mao, Zhaoyang Lyu, Jia Zeng, Xudong Xu, and Jiangmiao Pang. RoboVIP: Multi-view video generation with visual identity prompting augments robot manipulation, 2026. URLhttps://arxiv.org/abs/2601.05241. arXiv:2601.05241

work page arXiv 2026

[56] [56]

EV A: Aligning video world models with executable robot actions via inverse dynamics rewards, 2026

Ruixiang Wang, Qingming Liu, Yueci Deng, Guiliang Liu, Zhen Liu, and Kui Jia. EV A: Aligning video world models with executable robot actions via inverse dynamics rewards, 2026. URLhttps://arxiv.org/abs/ 2603.17808. arXiv:2603.17808

work page arXiv 2026

[57] [57]

Interactive world simulator for robot policy training and evaluation,

Yixuan Wang, Rhythm Syed, Fangyu Wu, Mengchao Zhang, Aykut Onol, Jose Barreiros, Hooshang Nayyeri, Tony Dear, Huan Zhang, and Yunzhu Li. Interactive world simulator for robot policy training and evaluation,

[58] [58]

arXiv:2603.08546

URLhttps://arxiv.org/abs/2603.08546. arXiv:2603.08546

work page arXiv

[59] [59]

Benchmarking World-Model Learning with Environment-Level Queries

Archana Warrier, Dat Nguyen, Michelangelo Naim, Moksh Jain, Yichao Liang, Karen Schroeder, Cam- bridge Yang, Joshua B. Tenenbaum, Sebastian V ollmer, Kevin Ellis, and Zenna Tavares. Benchmarking world-model learning with environment-level queries, 2025. URLhttps://arxiv.org/abs/2510.19788. arXiv:2510.19788

work page internal anchor Pith review Pith/arXiv arXiv 2025

[60] [60]

A unified view on solv- ing objective mismatch in model-based reinforcement learning.Transactions on Machine Learning Research,

Ran Wei, Nathan Lambert, Anthony McDonald, Alfredo Garcia, and Roberto Calandra. A unified view on solv- ing objective mismatch in model-based reinforcement learning.Transactions on Machine Learning Research,

[61] [61]

arXiv:2310.06253

URLhttps://openreview.net/forum?id=tQVZgvXhZb. arXiv:2310.06253

work page arXiv

[62] [62]

DayDreamer: World models for physical robot learning

Philipp Wu, Alejandro Escontrela, Danijar Hafner, Ken Goldberg, and Pieter Abbeel. DayDreamer: World models for physical robot learning. InConference on Robot Learning, 2023. URLhttps://arxiv.org/abs/ 2206.14176

work page arXiv 2023

[63] [63]

World-Env: Leveraging World Model as a Virtual Environment for VLA Post-Training

Junjin Xiao, Yandan Yang, Xinyuan Chang, Ronghan Chen, Feng Xiong, Mu Xu, Wei-Shi Zheng, and Qing Zhang. World-Env: Leveraging world model as a virtual environment for VLA post-training. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2026. URLhttps://arxiv.org/ abs/2509.24948

work page internal anchor Pith review Pith/arXiv arXiv 2026

[64] [64]

Kinema4D: Kinematic 4d world modeling for spatiotemporal embodied simulation, 2026

Mutian Xu, Tianbao Zhang, Tianqi Liu, Zhaoxi Chen, Xiaoguang Han, and Ziwei Liu. Kinema4D: Kinematic 4d world modeling for spatiotemporal embodied simulation, 2026. URLhttps://arxiv.org/abs/2603.16669. arXiv:2603.16669

work page arXiv 2026

[65] [65]

RISE: Self-Improving Robot Policy with Compositional World Model

Jiazhi Yang, Kunyang Lin, Jinwei Li, Wencong Zhang, Tianwei Lin, Longyan Wu, Zhizhong Su, Hao Zhao, Ya-Qin Zhang, Li Chen, Ping Luo, Xiangyu Yue, and Hongyang Li. RISE: Self-improving robot policy with compositional world model, 2026. URLhttps://arxiv.org/abs/2602.11075. arXiv:2602.11075

work page internal anchor Pith review Pith/arXiv arXiv 2026

[66] [66]

RoboEnvision: A long-horizon video generation model for multi-task robot manipulation

Liudi Yang, Yang Bai, George Eskandar, Fengyi Shen, Mohammad Altillawi, Dong Chen, Soumajit Majumder, Ziyuan Liu, Gitta Kutyniok, and Abhinav Valada. RoboEnvision: A long-horizon video generation model for multi-task robot manipulation. InIEEE/RSJ International Conference on Intelligent Robots and Systems, 2025. URLhttps://arxiv.org/abs/2506.22007

work page arXiv 2025

[67] [67]

Learning Interactive Real-World Simulators

Mengjiao Yang, Yilun Du, Kamyar Ghasemipour, Jonathan Tompson, Dale Schuurmans, and Pieter Abbeel. Learning interactive real-world simulators. InInternational Conference on Learning Representations, 2024. URLhttps://arxiv.org/abs/2310.06114

work page internal anchor Pith review Pith/arXiv arXiv 2024

[68] [68]

EA-WM: Event-Aware Generative World Model with Structured Kinematic-to-Visual Action Fields

Zhaoyang Yang, Yurun Jin, Lizhe Qi, Cong Huang, and Kai Chen. EA-WM: Event-aware generative world model with structured kinematic-to-visual action fields, 2026. URLhttps://arxiv.org/abs/2605.06192. arXiv:2605.06192

work page internal anchor Pith review Pith/arXiv arXiv 2026

[69] [69]

PlayWorld: Learning Robot World Models from Autonomous Play

Tenny Yin, Zhiting Mei, Zhonghe Zheng, Miyu Yamane, David Wang, Jade Sceats, Samuel M. Bateman, Lihan Zha, Apurva Badithela, Ola Shorinwa, and Anirudha Majumdar. PlayWorld: Learning robot world models from autonomous play, 2026. URLhttps://arxiv.org/abs/2603.09030. arXiv:2603.09030. 26 How Should World Models Be Evaluated for Embodied Decision-Making?

work page internal anchor Pith review Pith/arXiv arXiv 2026

[70] [70]

EWMBench: Evaluating scene, motion, and semantic quality in embodied world models, 2025

Hu Yue, Siyuan Huang, Yue Liao, Shengcong Chen, Pengfei Zhou, Liliang Chen, Maoqing Yao, and Guanghui Ren. EWMBench: Evaluating scene, motion, and semantic quality in embodied world models, 2025. URL https://arxiv.org/abs/2505.09694. arXiv:2505.09694

work page arXiv 2025

[71] [71]

ProphRL: Reinforcing action policies by prophesy- ing, 2025

Jiahui Zhang, Ze Huang, Chun Gu, Zipei Ma, and Li Zhang. ProphRL: Reinforcing action policies by prophesy- ing, 2025. URLhttps://arxiv.org/abs/2511.20633. arXiv:2511.20633

work page arXiv 2025

[72] [72]

WHALE: Towards generalizable and scalable world models for embodied decision-making, 2024

Zhilong Zhang, Ruifeng Chen, Junyin Ye, Yihao Sun, Pengyuan Wang, Jingcheng Pang, Kaiyuan Li, Tianshuo Liu, Haoxin Lin, Yang Yu, and Zhi-Hua Zhou. WHALE: Towards generalizable and scalable world models for embodied decision-making, 2024. URLhttps://arxiv.org/abs/2411.05619. arXiv:2411.05619

work page arXiv 2024

[73] [73]

Towards practical world model-based reinforcement learning for vision-language-action models, 2026

Zhilong Zhang, Haoxiang Ren, Yihao Sun, Yifei Sheng, Haonan Wang, Haoxin Lin, Zhichao Wu, Pierre-Luc Bacon, and Yang Yu. Towards practical world model-based reinforcement learning for vision-language-action models, 2026. URLhttps://openreview.net/forum?id=gB1yFEd106. ICLR 2026 World Models Work- shop

2026

[74] [74]

RoboDreamer: Learning Compositional World Models for Robot Imagination

Siyuan Zhou, Yilun Du, Jiaben Chen, Yandong Li, Dit-Yan Yeung, and Chuang Gan. RoboDreamer: Learning compositional world models for robot imagination. InInternational Conference on Machine Learning, 2024. URLhttps://arxiv.org/abs/2404.12377

work page internal anchor Pith review Pith/arXiv arXiv 2024

[75] [75]

IRASim: A fine-grained world model for robot manipulation

Fangqi Zhu, Hongtao Wu, Song Guo, Yuxiao Liu, Chilam Cheang, and Tao Kong. IRASim: A fine-grained world model for robot manipulation. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 9834–9844, 2025. URLhttps://openaccess.thecvf.com/content/ICCV2025/html/Zhu_ IRASim_A_Fine-Grained_World_Model_for_Robot_Manipulation_ICCV_202...

2025

[76] [76]

WMPO: World model- based policy optimization for vision-language-action models

Fangqi Zhu, Zhengyang Yan, Zicong Hong, Quanxin Shou, Xiao Ma, and Song Guo. WMPO: World model- based policy optimization for vision-language-action models. InInternational Conference on Learning Repre- sentations, 2026. URLhttps://openreview.net/forum?id=qE2FyvRvuF. 27

2026