How Should World Models Be Evaluated for Embodied Decision-Making? A Decision-Making-Centric Position
Pith reviewed 2026-06-30 10:28 UTC · model grok-4.3
The pith
World models for embodied decision-making must be judged by their support for interventional reasoning and policy optimization, not video realism.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
For models presented as world models for embodied decision-making, the more decisive issue is not whether the model generates visually convincing videos, but whether it supports reliable interventional reasoning, policy evaluation, planning, and policy optimization under intervention, policy-induced distribution shift, and long-horizon rollout. The survey organizes evidence using an L0-L7 ladder spanning visual plausibility to policy optimization utility, foregrounding interventional action fidelity, closed-loop rollout validity, reward/value prediction, policy-ranking agreement, optimization lift, model exploitability, and uncertainty calibration.
What carries the argument
The L0-L7 ladder, an evidential hierarchy of evaluation criteria that cuts across visual, predictive, and interventional axes rather than producing a single scalar score.
If this is right
- Interventional action fidelity and closed-loop rollout validity become required tests for any decision-making claim.
- Reward and value prediction accuracy must be measured alongside perceptual similarity.
- Policy-ranking agreement and measured optimization lift become direct evidence of utility.
- Model exploitability and uncertainty calibration under shift indicate whether the model can be trusted for planning.
- A minimal feasible reporting set applies even in real-robot experiments.
Where Pith is reading between the lines
- The ladder could be used to re-score existing published world models and expose claim-evidence gaps across the literature.
- Simulation-to-real transfer studies might test whether ladder level predicts transfer success more reliably than video quality.
- Extending the hierarchy to multi-step reward shaping or safety constraints would address additional embodied requirements.
- If the ordering holds, benchmark suites could drop low-level metrics once higher-level evidence is present.
Load-bearing premise
That the L0-L7 levels form a reliable hierarchy in which higher levels are strictly more decisive for embodied decision-making utility than lower levels, without needing separate validation of the ordering.
What would settle it
A controlled comparison in which models that pass high ladder levels but fail low ones still produce superior real-robot policy performance, or in which visual metrics alone correlate more strongly with policy success than the full ladder, would falsify the claimed hierarchy.
read the original abstract
World models have become a central abstraction in modern AI. The term now refers to several different objects: action-conditioned environment models, latent imagination models, future-video predictors, interactive neural simulators, latent predictive representations, and synthetic-data engines. Evaluation has broadened along with the term. Recent papers measure video realism, perceptual similarity, instruction following, physical plausibility, policy ranking, executability, planning success, and downstream policy improvement. This produces both metric diversity and a recurring problem of claim/evidence mismatch: papers sometimes make a stronger claim about what their model is useful for than their evaluation can establish. This paper surveys the recent literature and argues that, for models presented as world models for embodied decision-making, the more decisive issue is not whether the model generates visually convincing videos, but whether it supports reliable interventional reasoning, policy evaluation, planning, and policy optimization under intervention, policy-induced distribution shift, and long-horizon rollout. We organize the survey using an L0--L7 ladder spanning visual plausibility to policy optimization utility, noting that the levels cut across several orthogonal axes and so form an evidential hierarchy rather than a single scalar. The framework foregrounds interventional action fidelity, closed-loop rollout validity, reward/value prediction, policy-ranking agreement, optimization lift, model exploitability, and uncertainty calibration, with a minimal feasible reporting set for real-robot settings.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper surveys recent literature on world model evaluation in embodied AI and identifies recurring claim/evidence mismatches. It argues that for models positioned as world models for embodied decision-making, evaluation should prioritize support for interventional reasoning, policy evaluation, planning, and policy optimization under intervention, distribution shift, and long-horizon rollout over lower-level metrics such as visual plausibility. The central contribution is an L0--L7 ladder that organizes metrics across visual, planning, and optimization axes into an evidential hierarchy rather than a single scalar, with a proposed minimal reporting set for real-robot settings.
Significance. If the proposed hierarchy is valid, the paper would provide a valuable organizing framework that helps the field move beyond video realism metrics toward decision-making utility. The survey of orthogonal evaluation axes (visual fidelity, closed-loop validity, reward prediction, policy ranking, optimization lift) and explicit foregrounding of interventional action fidelity and uncertainty calibration are constructive contributions that could reduce mismatches between model claims and supporting evidence.
major comments (2)
- [Abstract] Abstract: The assertion that the L0--L7 levels 'form an evidential hierarchy rather than a single scalar' because they 'cut across several orthogonal axes' is presented as an organizing principle without formal derivation, counter-example analysis, or empirical mapping. No argument is given showing why, e.g., L5 policy-ranking agreement is strictly more decisive than L4 closed-loop rollout validity for all embodied settings, nor why the ordering is robust to policy-induced distribution shift. This ordering is load-bearing for the central position.
- [Introduction / Survey sections] The paper identifies claim/evidence mismatch in prior work but supplies no quantitative analysis or new experiments demonstrating that the L0--L7 hierarchy resolves the mismatch; the argument rests entirely on logical re-organization of existing practices. This leaves the claim that the ladder 'can be applied without additional empirical validation of its ordering' unsupported.
minor comments (2)
- [Ladder description] The definitions of each level (L0 through L7) would benefit from explicit cross-references to the specific metrics or papers that instantiate them, to improve traceability.
- [Real-robot reporting section] Notation for the 'minimal feasible reporting set' is introduced without a compact table or checklist format, which reduces immediate usability for practitioners.
Simulated Author's Rebuttal
We thank the referee for the constructive comments, which help clarify the justification needed for the proposed L0-L7 framework. We respond to each major comment below and outline planned revisions to address the concerns about formal grounding and empirical support.
read point-by-point responses
-
Referee: [Abstract] Abstract: The assertion that the L0--L7 levels 'form an evidential hierarchy rather than a single scalar' because they 'cut across several orthogonal axes' is presented as an organizing principle without formal derivation, counter-example analysis, or empirical mapping. No argument is given showing why, e.g., L5 policy-ranking agreement is strictly more decisive than L4 closed-loop rollout validity for all embodied settings, nor why the ordering is robust to policy-induced distribution shift. This ordering is load-bearing for the central position.
Authors: The hierarchy is derived from the increasing strength of evidence required to support interventional and optimization claims in embodied decision-making, following principles from causal inference (interventional evidence supersedes observational) and RL evaluation (policy optimization utility is the ultimate test). The levels cut across visual, planning, and optimization axes, so they are not collapsed into a scalar. We do not claim the ordering is strict or invariant across all settings. We will add a dedicated subsection providing the rationale for each transition, discussing counterexamples (e.g., tasks where closed-loop validity alone is decisive), and addressing robustness to policy-induced shifts. This will make the load-bearing assumptions explicit. revision: yes
-
Referee: [Introduction / Survey sections] The paper identifies claim/evidence mismatch in prior work but supplies no quantitative analysis or new experiments demonstrating that the L0--L7 hierarchy resolves the mismatch; the argument rests entirely on logical re-organization of existing practices. This leaves the claim that the ladder 'can be applied without additional empirical validation of its ordering' unsupported.
Authors: The manuscript is a survey and position paper whose contribution is the identification of mismatches via literature review and the proposal of a structured reporting framework. No new experiments are presented because the work reorganizes existing practices rather than validating a metric. We agree the statement about applying the ladder without empirical validation of its ordering is unsupported and will remove or qualify it. The revised text will present the ladder as a conceptual guideline that encourages higher-evidence evaluations, while noting that empirical testing of the ordering is a direction for future work. revision: yes
Circularity Check
No circularity; L0-L7 framework constructed from external literature survey without self-referential reduction
full rationale
The paper is a position/survey piece that organizes existing evaluation practices into an L0-L7 ladder asserted to form an evidential hierarchy. No equations, fitted parameters, or predictions appear in the provided text. The hierarchy is introduced by surveying external literature rather than by any internal derivation, self-citation chain, or definitional loop that would make a claimed result equivalent to its inputs by construction. The central claim therefore remains independent of any self-referential step.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption World models for embodied decision-making are primarily intended to support interventional reasoning, planning, and policy optimization.
Reference graph
Works this paper leans on
-
[1]
V-JEPA 2: Self-Supervised Video Models Enable Understanding, Prediction and Planning
Mido Assran, Adrien Bardes, David Fan, Quentin Garrido, Russell Howes, Mojtaba Komeili, Matthew Muck- ley, Ammar Rizvi, Claire Roberts, Koustuv Sinha, Artem Zholus, Sergio Arnaud, Abha Gejji, Ada Martin, Francois Robert Hogan, Daniel Dugas, Piotr Bojanowski, Vasil Khalidov, Patrick Labatut, Francisco Massa, Marc Szafraniec, Kapil Krishnakumar, Yong Li, Xi...
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[2]
Dream to manipulate: Compositional world models empowering robot imitation learning with imagina- tion
Leonardo Barcellona, Andrii Zadaianchuk, Davide Allegro, Samuele Papa, Stefano Ghidoni, and Efstratios Gavves. Dream to manipulate: Compositional world models empowering robot imitation learning with imagina- tion. InInternational Conference on Learning Representations, 2025. URLhttps://arxiv.org/abs/2412. 14957
2025
-
[3]
Persistent robot world models: Stabiliz- ing multi-step rollouts via reinforcement learning, 2026
Jai Bardhan, Patrik Drozdik, Josef Sivic, and Vladimir Petrik. Persistent robot world models: Stabiliz- ing multi-step rollouts via reinforcement learning, 2026. URLhttps://arxiv.org/abs/2603.25685. arXiv:2603.25685
-
[4]
Kao, Adina Williams, Michael Rabbat, and Em- manuel Dupoux
Florian Bordes, Quentin Garrido, Justine T. Kao, Adina Williams, Michael Rabbat, and Emmanuel Dupoux. IntPhys 2: Benchmarking intuitive physics understanding in complex synthetic environments, 2025. URL https://arxiv.org/abs/2506.09849. arXiv:2506.09849
-
[5]
Chandra, Iman Nematollahi, Chenguang Huang, Tim Welschehold, Wolfram Burgard, and Abhinav Valada
Akshay L. Chandra, Iman Nematollahi, Chenguang Huang, Tim Welschehold, Wolfram Burgard, and Abhinav Valada. DiW A: Diffusion policy adaptation with world models, 2025. URLhttps://arxiv.org/abs/2508. 03645. arXiv:2508.03645
-
[6]
WorldPrediction: A benchmark for high- level world modeling and long-horizon procedural planning
Delong Chen, Willy Chung, Yejin Bang, Ziwei Ji, and Pascale Fung. WorldPrediction: A benchmark for high- level world modeling and long-horizon procedural planning. InICML World Models Workshop, 2025. URL https://openreview.net/forum?id=3GuGN0bacr. 22 How Should World Models Be Evaluated for Embodied Decision-Making?
2025
-
[7]
Policy-conditioned en- vironment models are more generalizable
Ruifeng Chen, Xiong-Hui Chen, Yihao Sun, Siyuan Xiao, Minhui Li, and Yang Yu. Policy-conditioned en- vironment models are more generalizable. InInternational Conference on Machine Learning, 2024. URL https://openreview.net/forum?id=g9mYBdooPA
2024
-
[8]
Adversarial counterfactual environment model learning
Xiong-Hui Chen, Yang Yu, Zheng-Mao Zhu, Zhihua Yu, Zhenjun Chen, Chenghe Wang, Yinan Wu, Hongqiu Wu, Rong-Jun Qin, Ruijin Ding, and Fangsheng Huang. Adversarial counterfactual environment model learning. InAdvances in Neural Information Processing Systems, 2023. URLhttps://openreview.net/forum?id= rHAX0LRwk8
2023
-
[9]
Yuzhi Chen, Ronghan Chen, Dongjie Huo, Yandan Yang, Dekang Qi, Haoyun Liu, Tong Lin, Shuang Zeng, Junjin Xiao, Xinyuan Chang, Feng Xiong, Xing Wei, Zhiheng Ma, and Mu Xu. ABot-PhysWorld: Interactive world foundation model for robotic manipulation with physics alignment, 2026. URLhttps://arxiv.org/ abs/2603.23376. arXiv:2603.23376
-
[10]
arXiv preprint arXiv:2410.15461 (2024) 16 K
Xiaowei Chi, Hengyuan Zhang, Chun-Kai Fan, Xingqun Qi, Rongyu Zhang, Anthony Chen, Chi min Chan, Wei Xue, Wenhan Luo, Shanghang Zhang, and Yike Guo. EV A: An embodied world model for future video anticipation, 2024. URLhttps://arxiv.org/abs/2410.15461. arXiv:2410.15461
-
[11]
Rethinking video generation model for the embodied world, 2026
Yufan Deng, Zilin Pan, Hongyu Zhang, Xiaojie Li, Ruoqing Hu, Yufei Ding, Yiming Zou, Yan Zeng, and Daquan Zhou. Rethinking video generation model for the embodied world, 2026. URLhttps://arxiv.org/abs/ 2601.15282. arXiv:2601.15282
-
[12]
arXiv preprint arXiv:2310.10625 (2023)
Yilun Du, Mengjiao Yang, Pete Florence, Fei Xia, Ayzaan Wahid, Brian Ichter, Pierre Sermanet, Tianhe Yu, Pieter Abbeel, Joshua B. Tenenbaum, Leslie Kaelbling, Andy Zeng, and Jonathan Tompson. Video language planning. InInternational Conference on Learning Representations, 2024. URLhttps://arxiv.org/abs/ 2310.10625
-
[13]
WorldScore: A unified evaluation benchmark for world generation
Haoyi Duan, Hong-Xing Yu, Sirui Chen, Fei-Fei Li, and Jiajun Wu. WorldScore: A unified evaluation benchmark for world generation. InInternational Conference on Computer Vision, 2025. URLhttps://arxiv.org/abs/ 2504.00983
-
[14]
Chun-Kai Fan, Xiaowei Chi, Xiaozhu Ju, Hao Li, Yong Bao, Yu-Kai Wang, Lizhang Chen, Zhiyuan Jiang, Kuangzhi Ge, Ying Li, Weishi Mi, Qingpo Wuwu, Peidong Jia, Yulin Luo, Kevin Zhang, Zhiyuan Qin, Yong Dai, Sirui Han, Yike Guo, Shanghang Zhang, and Jian Tang. Wow, wo, val! a comprehensive embodied world model evaluation turing test, 2026. URLhttps://arxiv.o...
- [15]
-
[16]
Learning video generation for robotic manipulation with collaborative trajectory control
Xiao Fu, Xintao Wang, Xian Liu, Jianhong Bai, Runsen Xu, Pengfei Wan, Di Zhang, and Dahua Lin. Learning video generation for robotic manipulation with collaborative trajectory control. InInternational Conference on Learning Representations, 2026. URLhttps://openreview.net/forum?id=OeDwYtp8n1
2026
-
[17]
Shenyuan Gao, William Liang, Kaiyuan Zheng, Ayaan Malik, Seonghyeon Ye, Sihyun Yu, Wei-Cheng Tseng, Yuzhu Dong, Kaichun Mo, Chen-Hsuan Lin, Qianli Ma, Seungjun Nah, Loic Magne, Jiannan Xiang, Yuqi Xie, Ruijie Zheng, Dantong Niu, You Liang Tan, K. R. Zentner, George Kurian, Suneel Indupuru, Pooya Jan- naty, Jinwei Gu, Jun Zhang, Jitendra Malik, Pieter Abbe...
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[18]
Evaluating gemini robotics policies in a veo world simulator, 2025
Gemini Robotics Team, Krzysztof Choromanski, Coline Devin, Yilun Du, Debidatta Dwibedi, Ruiqi Gao, Ab- hishek Jindal, Thomas Kipf, Sean Kirmani, Isabel Leal, Fangchen Liu, Anirudha Majumdar, Andrew Mar- mon, Carolina Parada, Yulia Rubanova, Dhruv Shah, Vikas Sindhwani, Jie Tan, Fei Xia, Ted Xiao, Sherry Yang, Wenhao Yu, and Allan Zhou. Evaluating gemini r...
-
[19]
GigaBrain-0.5M*: A VLA that learns from world model-based reinforcement learning, 2026
GigaBrain Team, Boyuan Wang, Chaojun Ni, Guan Huang, Guosheng Zhao, Hao Li, Jie Li, Jindi Lv, Jingyu Liu, Lv Feng, Mingming Yu, Peng Li, Qiuping Deng, Tianze Liu, Xinyu Zhou, Xinze Chen, Xiaofeng Wang, Yang Wang, Yifan Li, Yifei Nie, Yilong Li, Yukun Zhou, Yun Ye, Zhichao Liu, and Zheng Zhu. GigaBrain-0.5M*: A VLA that learns from world model-based reinfo...
-
[20]
GigaWorld-0: World models as data engine to empower embodied AI, 2025
GigaWorld Team, Angen Ye, Boyuan Wang, Chaojun Ni, Guan Huang, Guosheng Zhao, Haoyun Li, Jiagang Zhu, Kerui Li, Mengyuan Xu, Qiuping Deng, Siting Wang, Wenkang Qin, Xinze Chen, Xiaofeng Wang, Yankai Wang, Yu Cao, Yifan Chang, Yuan Xu, Yun Ye, Yang Wang, Yukun Zhou, Zhengyuan Zhang, Zhehao Dong, and Zheng Zhu. GigaWorld-0: World models as data engine to em...
-
[21]
The value equivalence principle for model-based reinforcement learning
Christopher Grimm, André Barreto, Satinder Singh, and David Silver. The value equivalence principle for model-based reinforcement learning. InAdvances in Neural Information Process- ing Systems, volume 33, 2020. URLhttps://proceedings.neurips.cc/paper/2020/hash/ 3bb585ea00014b0e3ebe4c6dd165a358-Abstract.html. arXiv:2011.03506
-
[22]
Proper value equiva- lence
Christopher Grimm, André Barreto, Gregory Farquhar, David Silver, and Satinder Singh. Proper value equiva- lence. InAdvances in Neural Information Processing Systems, volume 34, 2021. URLhttps://proceedings. neurips.cc/paper/2021/hash/00ac8ed3b4327bdd4ebbebcb2ba10a00-Abstract.html
2021
-
[23]
VLAW: Iterative co-improvement of vision-language-action policy and world model, 2026
Yanjiang Guo, Tony Lee, Lucy Xiaoyang Shi, Jianyu Chen, Percy Liang, and Chelsea Finn. VLAW: Iterative co-improvement of vision-language-action policy and world model, 2026. URLhttps://arxiv.org/abs/ 2602.12063. arXiv:2602.12063
-
[24]
Ctrl-World: A Controllable Generative World Model for Robot Manipulation
Yanjiang Guo, Lucy Xiaoyang Shi, Jianyu Chen, and Chelsea Finn. Ctrl-World: A controllable generative world model for robot manipulation. InInternational Conference on Learning Representations, 2026. URL https://arxiv.org/abs/2510.10125
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[25]
David Ha and Jürgen Schmidhuber. World models, 2018. URLhttps://arxiv.org/abs/1803.10122. arXiv:1803.10122
work page internal anchor Pith review Pith/arXiv arXiv 2018
-
[26]
Mastering Diverse Domains through World Models
Danijar Hafner, Jurgis Pasukonis, Jimmy Ba, and Timothy Lillicrap. Mastering diverse domains through world models, 2023. URLhttps://arxiv.org/abs/2301.04104. arXiv:2301.04104
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[27]
Vid2World: Crafting video diffusion models to interactive world models
Siqiao Huang, Jialong Wu, Qixing Zhou, Shangchen Miao, and Mingsheng Long. Vid2World: Crafting video diffusion models to interactive world models. InInternational Conference on Learning Representations, 2026. URLhttps://openreview.net/forum?id=pFyzqbUiF9
2026
-
[28]
DreamGen: Unlocking Generalization in Robot Learning through Video World Models
Joel Jang, Seonghyeon Ye, Zongyu Lin, Jiannan Xiang, Johan Bjorck, Yu Fang, Fengyuan Hu, Spencer Huang, Kaushil Kundalia, Yen-Chen Lin, Loic Magne, Ajay Mandlekar, Avnish Narayan, You Liang Tan, Guanzhi Wang, Jing Wang, Qi Wang, Yinzhen Xu, Xiaohui Zeng, Kaiyuan Zheng, Ruijie Zheng, Ming-Yu Liu, Luke Zettlemoyer, Dieter Fox, Jan Kautz, Scott Reed, Yuke Zh...
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[29]
When to trust your model: model-based policy optimization
Michael Janner, Justin Fu, Marvin Zhang, and Sergey Levine. When to trust your model: Model-based policy optimization. InAdvances in Neural Information Processing Systems, 2019. URLhttps://arxiv.org/abs/ 1906.08253
-
[30]
Model gradi- ent: unified model and policy learning in model-based reinforcement learning.Frontiers of Computer Science, 18:184339, 2024
Chengxing Jia, Fuxiang Zhang, Tian Xu, Jing-Cheng Pang, Zongzhang Zhang, and Yang Yu. Model gradi- ent: unified model and policy learning in model-based reinforcement learning.Frontiers of Computer Science, 18:184339, 2024
2024
-
[31]
RoboWM-Bench: A Benchmark for Evaluating World Models in Robotic Manipulation
Feng Jiang, Yang Chen, Kyle Xu, Yuchen Liu, Haifeng Wang, Zhenhao Shen, Jasper Lu, Shengze Huang, Yuanfei Wang, Chen Xie, and Ruihai Wu. RoboWM-Bench: A benchmark for evaluating world models in robotic manipulation, 2026. URLhttps://arxiv.org/abs/2604.19092. arXiv:2604.19092
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[32]
Zhennan Jiang, Kai Liu, Yuxin Qin, Shuai Tian, Yupeng Zheng, Mingcai Zhou, Chao Yu, Haoran Li, and Dong- bin Zhao. World4RL: Diffusion world models for policy refinement with reinforcement learning for robotic manipulation, 2025. URLhttps://arxiv.org/abs/2509.19080. arXiv:2509.19080
-
[33]
WoVR: World Models as Reliable Simulators for Post-Training VLA Policies with RL
Zhennan Jiang, Shangqing Zhou, Yutong Jiang, Zefang Huang, Mingjie Wei, Yuhui Chen, Tianxing Zhou, Zhen Guo, Hao Lin, Quanlu Zhang, Yu Wang, Haoran Li, Chao Yu, and Dongbin Zhao. WoVR: World models as reliable simulators for post-training VLA policies with RL, 2026. URLhttps://arxiv.org/abs/2602. 13977. arXiv:2602.13977
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[34]
Benno Krojer, Mojtaba Komeili, Candace Ross, Quentin Garrido, Koustuv Sinha, Nicolas Ballas, and Mahmoud Assran. A shortcut-aware video-QA benchmark for physical understanding via minimal video pairs.Transactions on Machine Learning Research, 2025. URLhttps://arxiv.org/abs/2506.09987
-
[35]
Objective mismatch in model-based rein- forcement learning
Nathan Lambert, Brandon Amos, Omry Yadan, and Roberto Calandra. Objective mismatch in model-based rein- forcement learning. InProceedings of the 2nd Conference on Learning for Dynamics and Control (L4DC), vol- ume 120 ofProceedings of Machine Learning Research, pages 761–770, 2020. URLhttps://proceedings. mlr.press/v120/lambert20a.html. arXiv:2002.04523
-
[36]
Gonzalez, Ion Stoica, Song Han, and Yao Lu
Dacheng Li, Yunhao Fang, Yukang Chen, Shuo Yang, Shiyi Cao, Justin Wong, Michael Luo, Xiaolong Wang, Hongxu Yin, Joseph E. Gonzalez, Ion Stoica, Song Han, and Yao Lu. WorldModelBench: Judging video gener- ation models as world models, 2025. URLhttps://arxiv.org/abs/2502.20694. arXiv:2502.20694. 24 How Should World Models Be Evaluated for Embodied Decision-Making?
-
[37]
Hengtao Li, Pengxiang Ding, Runze Suo, Yihao Wang, Zirui Ge, Dongyuan Zang, Kexian Yu, Mingyang Sun, Hongyin Zhang, Donglin Wang, and Weihua Su. VLA-RFT: Vision-language-action reinforcement fine-tuning with verified rewards in world simulators, 2025. URLhttps://arxiv.org/abs/2510.00406. arXiv:2510.00406
-
[38]
dWorldEval: Scalable Robotic Policy Evaluation via Discrete Diffusion World Model
Yaxuan Li, Zhongyi Zhou, Yefei Chen, Yaokai Xue, and Yichen Zhu. dWorldEval: Scalable robotic pol- icy evaluation via discrete diffusion world model, 2026. URLhttps://arxiv.org/abs/2604.22152. arXiv:2604.22152
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[39]
Dreamitate: Real-world visuomotor policy learning via video generation
Junbang Liang, Ruoshi Liu, Ege Ozguroglu, Sruthi Sudhakar, Achal Dave, Pavel Tokmakov, Shuran Song, and Carl V ondrick. Dreamitate: Real-world visuomotor policy learning via video generation. InConference on Robot Learning, 2024. URLhttps://arxiv.org/abs/2406.16862
-
[40]
Genie envisioner: A unified world foundation platform for robotic manipulation
Yue Liao, Pengfei Zhou, Siyuan Huang, Donglin Yang, Shengcong Chen, Yuxin Jiang, Yue Hu, Si Liu, Jianlan Luo, Liliang Chen, Shuicheng Yan, Maoqing Yao, and Guanghui Ren. Genie envisioner: A unified world foundation platform for robotic manipulation. InInternational Conference on Learning Representations, 2026. URLhttps://openreview.net/forum?id=fHLtSxDFKC
2026
-
[41]
ADM-v2: Pursuing full-horizon roll-out in dynamics models for offline policy learning and evaluation
Haoxin Lin, Siyuan Xiao, Yi-Chen Li, Zhilong Zhang, Yihao Sun, Chengxing Jia, and Yang Yu. ADM-v2: Pursuing full-horizon roll-out in dynamics models for offline policy learning and evaluation. InInternational Conference on Learning Representations, 2026. URLhttps://openreview.net/forum?id=ICbXEwqpga
2026
-
[42]
World-VLA-Loop: Closed-Loop Learning of Video World Model and VLA Policy
Xiaokang Liu, Zechen Bai, Hai Ci, Kevin Yuchen Ma, and Mike Zheng Shou. World-VLA-Loop: Closed- loop learning of video world model and VLA policy, 2026. URLhttps://arxiv.org/abs/2602.06508. arXiv:2602.06508
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[43]
LeWorldModel: Stable End-to-End Joint-Embedding Predictive Architecture from Pixels
Lucas Maes, Quentin Le Lidec, Damien Scieur, Yann LeCun, and Randall Balestriero. LeWorldModel: Stable end-to-end joint-embedding predictive architecture from pixels, 2026. URLhttps://arxiv.org/abs/2603. 19312. arXiv:2603.19312
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[44]
Value-aware loss function for model-based reinforcement learning
Amir massoud Farahmand, André Barreto, and Daniel Nikovski. Value-aware loss function for model-based reinforcement learning. InProceedings of the 20th International Conference on Artificial Intelligence and Statistics (AISTATS), volume 54 ofProceedings of Machine Learning Research, pages 1486–1494, 2017. URL https://proceedings.mlr.press/v54/farahmand17a.html
2017
-
[45]
V-JEPA 2.1: Unlocking dense features in video self-supervised learning,
Lorenzo Mur-Labadia, Matthew Muckley, Amir Bar, Mido Assran, Koustuv Sinha, Mike Rabbat, Yann LeCun, Nicolas Ballas, and Adrien Bardes. V-JEPA 2.1: Unlocking dense features in video self-supervised learning,
-
[46]
V-JEPA 2.1: Unlocking Dense Features in Video Self-Supervised Learning
URLhttps://arxiv.org/abs/2603.14482. arXiv:2603.14482
work page internal anchor Pith review Pith/arXiv arXiv
-
[47]
PBench: A physical AI benchmark for world models, 2025
NVIDIA. PBench: A physical AI benchmark for world models, 2025. URLhttps://research.nvidia. com/labs/cosmos-lab/pbench/
2025
-
[48]
NVIDIA, Arslan Ali, Junjie Bai, Maciej Bala, Yogesh Balaji, Aaron Blakeman, Tiffany Cai, Jiaxin Cao, Tianshi Cao, Elizabeth Cha, Yu-Wei Chao, Prithvijit Chattopadhyay, Mike Chen, Yongxin Chen, Yu Chen, Shuai Cheng, Yin Cui, Jenna Diamond, Yifan Ding, Jiaojiao Fan, Linxi Fan, Liang Feng, Francesco Ferroni, Sanja Fidler, Xiao Fu, Ruiyuan Gao, Yunhao Ge, Jin...
2025
-
[49]
Yiran Qin, Zhelun Shi, Jiwen Yu, Xijun Wang, Enshen Zhou, Lijun Li, Zhenfei Yin, Xihui Liu, Lu Sheng, Jing Shao, Lei Bai, Wanli Ouyang, and Ruimao Zhang. WorldSimBench: Towards video generation models as world simulators, 2024. URLhttps://arxiv.org/abs/2410.18072. arXiv:2410.18072
-
[50]
Worldgym: World model as an environment for policy evaluation.arXiv preprint arXiv:2506.00613, 2025
Julian Quevedo, Ansh Kumar Sharma, Yixiang Sun, Varad Suryavanshi, Percy Liang, and Sherry Yang. WorldGym: World model as an environment for policy evaluation, 2025. URLhttps://arxiv.org/abs/ 2506.00613. arXiv:2506.00613
-
[51]
Yu Shang, Zhuohang Li, Yiding Ma, Weikang Su, Xin Jin, Ziyou Wang, Lei Jin, Xin Zhang, Yinzhou Tang, Haisheng Su, Chen Gao, Wei Wu, Xihui Liu, Dhruv Shah, Zhaoxiang Zhang, Zhibo Chen, Jun Zhu, Yonghong 25 How Should World Models Be Evaluated for Embodied Decision-Making? Tian, Tat-Seng Chua, Wenwu Zhu, and Yong Li. WorldArena: A unified benchmark for eval...
-
[52]
Ansh Kumar Sharma, Yixiang Sun, Ninghao Lu, Yunzhe Zhang, Jiarao Liu, and Sherry Yang. World-Gymnast: Training robots with reinforcement learning in a world model, 2026. URLhttps://arxiv.org/abs/2602. 02454. arXiv:2602.02454
-
[53]
Scalable policy evaluation with video world models, 2025
Wei-Cheng Tseng, Jinwei Gu, Qinsheng Zhang, Hanzi Mao, Ming-Yu Liu, Florian Shkurti, and Lin Yen-Chen. Scalable policy evaluation with video world models, 2025. URLhttps://arxiv.org/abs/2511.11520. arXiv:2511.11520
-
[54]
Chen, Ashesh Rambachan, Jon Kleinberg, and Sendhil Mullainathan
Keyon Vafa, Justin Y . Chen, Ashesh Rambachan, Jon Kleinberg, and Sendhil Mullainathan. Evaluating the world model implicit in a generative model, 2024. URLhttps://arxiv.org/abs/2406.03689. arXiv:2406.03689
-
[55]
Boyang Wang, Haoran Zhang, Shujie Zhang, Jinkun Hao, Mingda Jia, Qi Lv, Yucheng Mao, Zhaoyang Lyu, Jia Zeng, Xudong Xu, and Jiangmiao Pang. RoboVIP: Multi-view video generation with visual identity prompting augments robot manipulation, 2026. URLhttps://arxiv.org/abs/2601.05241. arXiv:2601.05241
-
[56]
EV A: Aligning video world models with executable robot actions via inverse dynamics rewards, 2026
Ruixiang Wang, Qingming Liu, Yueci Deng, Guiliang Liu, Zhen Liu, and Kui Jia. EV A: Aligning video world models with executable robot actions via inverse dynamics rewards, 2026. URLhttps://arxiv.org/abs/ 2603.17808. arXiv:2603.17808
-
[57]
Interactive world simulator for robot policy training and evaluation,
Yixuan Wang, Rhythm Syed, Fangyu Wu, Mengchao Zhang, Aykut Onol, Jose Barreiros, Hooshang Nayyeri, Tony Dear, Huan Zhang, and Yunzhu Li. Interactive world simulator for robot policy training and evaluation,
- [58]
-
[59]
Benchmarking World-Model Learning with Environment-Level Queries
Archana Warrier, Dat Nguyen, Michelangelo Naim, Moksh Jain, Yichao Liang, Karen Schroeder, Cam- bridge Yang, Joshua B. Tenenbaum, Sebastian V ollmer, Kevin Ellis, and Zenna Tavares. Benchmarking world-model learning with environment-level queries, 2025. URLhttps://arxiv.org/abs/2510.19788. arXiv:2510.19788
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[60]
A unified view on solv- ing objective mismatch in model-based reinforcement learning.Transactions on Machine Learning Research,
Ran Wei, Nathan Lambert, Anthony McDonald, Alfredo Garcia, and Roberto Calandra. A unified view on solv- ing objective mismatch in model-based reinforcement learning.Transactions on Machine Learning Research,
- [61]
-
[62]
DayDreamer: World models for physical robot learning
Philipp Wu, Alejandro Escontrela, Danijar Hafner, Ken Goldberg, and Pieter Abbeel. DayDreamer: World models for physical robot learning. InConference on Robot Learning, 2023. URLhttps://arxiv.org/abs/ 2206.14176
-
[63]
World-Env: Leveraging World Model as a Virtual Environment for VLA Post-Training
Junjin Xiao, Yandan Yang, Xinyuan Chang, Ronghan Chen, Feng Xiong, Mu Xu, Wei-Shi Zheng, and Qing Zhang. World-Env: Leveraging world model as a virtual environment for VLA post-training. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2026. URLhttps://arxiv.org/ abs/2509.24948
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[64]
Kinema4D: Kinematic 4d world modeling for spatiotemporal embodied simulation, 2026
Mutian Xu, Tianbao Zhang, Tianqi Liu, Zhaoxi Chen, Xiaoguang Han, and Ziwei Liu. Kinema4D: Kinematic 4d world modeling for spatiotemporal embodied simulation, 2026. URLhttps://arxiv.org/abs/2603.16669. arXiv:2603.16669
-
[65]
RISE: Self-Improving Robot Policy with Compositional World Model
Jiazhi Yang, Kunyang Lin, Jinwei Li, Wencong Zhang, Tianwei Lin, Longyan Wu, Zhizhong Su, Hao Zhao, Ya-Qin Zhang, Li Chen, Ping Luo, Xiangyu Yue, and Hongyang Li. RISE: Self-improving robot policy with compositional world model, 2026. URLhttps://arxiv.org/abs/2602.11075. arXiv:2602.11075
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[66]
RoboEnvision: A long-horizon video generation model for multi-task robot manipulation
Liudi Yang, Yang Bai, George Eskandar, Fengyi Shen, Mohammad Altillawi, Dong Chen, Soumajit Majumder, Ziyuan Liu, Gitta Kutyniok, and Abhinav Valada. RoboEnvision: A long-horizon video generation model for multi-task robot manipulation. InIEEE/RSJ International Conference on Intelligent Robots and Systems, 2025. URLhttps://arxiv.org/abs/2506.22007
-
[67]
Learning Interactive Real-World Simulators
Mengjiao Yang, Yilun Du, Kamyar Ghasemipour, Jonathan Tompson, Dale Schuurmans, and Pieter Abbeel. Learning interactive real-world simulators. InInternational Conference on Learning Representations, 2024. URLhttps://arxiv.org/abs/2310.06114
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[68]
EA-WM: Event-Aware Generative World Model with Structured Kinematic-to-Visual Action Fields
Zhaoyang Yang, Yurun Jin, Lizhe Qi, Cong Huang, and Kai Chen. EA-WM: Event-aware generative world model with structured kinematic-to-visual action fields, 2026. URLhttps://arxiv.org/abs/2605.06192. arXiv:2605.06192
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[69]
PlayWorld: Learning Robot World Models from Autonomous Play
Tenny Yin, Zhiting Mei, Zhonghe Zheng, Miyu Yamane, David Wang, Jade Sceats, Samuel M. Bateman, Lihan Zha, Apurva Badithela, Ola Shorinwa, and Anirudha Majumdar. PlayWorld: Learning robot world models from autonomous play, 2026. URLhttps://arxiv.org/abs/2603.09030. arXiv:2603.09030. 26 How Should World Models Be Evaluated for Embodied Decision-Making?
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[70]
EWMBench: Evaluating scene, motion, and semantic quality in embodied world models, 2025
Hu Yue, Siyuan Huang, Yue Liao, Shengcong Chen, Pengfei Zhou, Liliang Chen, Maoqing Yao, and Guanghui Ren. EWMBench: Evaluating scene, motion, and semantic quality in embodied world models, 2025. URL https://arxiv.org/abs/2505.09694. arXiv:2505.09694
-
[71]
ProphRL: Reinforcing action policies by prophesy- ing, 2025
Jiahui Zhang, Ze Huang, Chun Gu, Zipei Ma, and Li Zhang. ProphRL: Reinforcing action policies by prophesy- ing, 2025. URLhttps://arxiv.org/abs/2511.20633. arXiv:2511.20633
-
[72]
WHALE: Towards generalizable and scalable world models for embodied decision-making, 2024
Zhilong Zhang, Ruifeng Chen, Junyin Ye, Yihao Sun, Pengyuan Wang, Jingcheng Pang, Kaiyuan Li, Tianshuo Liu, Haoxin Lin, Yang Yu, and Zhi-Hua Zhou. WHALE: Towards generalizable and scalable world models for embodied decision-making, 2024. URLhttps://arxiv.org/abs/2411.05619. arXiv:2411.05619
-
[73]
Towards practical world model-based reinforcement learning for vision-language-action models, 2026
Zhilong Zhang, Haoxiang Ren, Yihao Sun, Yifei Sheng, Haonan Wang, Haoxin Lin, Zhichao Wu, Pierre-Luc Bacon, and Yang Yu. Towards practical world model-based reinforcement learning for vision-language-action models, 2026. URLhttps://openreview.net/forum?id=gB1yFEd106. ICLR 2026 World Models Work- shop
2026
-
[74]
RoboDreamer: Learning Compositional World Models for Robot Imagination
Siyuan Zhou, Yilun Du, Jiaben Chen, Yandong Li, Dit-Yan Yeung, and Chuang Gan. RoboDreamer: Learning compositional world models for robot imagination. InInternational Conference on Machine Learning, 2024. URLhttps://arxiv.org/abs/2404.12377
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[75]
IRASim: A fine-grained world model for robot manipulation
Fangqi Zhu, Hongtao Wu, Song Guo, Yuxiao Liu, Chilam Cheang, and Tao Kong. IRASim: A fine-grained world model for robot manipulation. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 9834–9844, 2025. URLhttps://openaccess.thecvf.com/content/ICCV2025/html/Zhu_ IRASim_A_Fine-Grained_World_Model_for_Robot_Manipulation_ICCV_202...
2025
-
[76]
WMPO: World model- based policy optimization for vision-language-action models
Fangqi Zhu, Zhengyang Yan, Zicong Hong, Quanxin Shou, Xiao Ma, and Song Guo. WMPO: World model- based policy optimization for vision-language-action models. InInternational Conference on Learning Repre- sentations, 2026. URLhttps://openreview.net/forum?id=qE2FyvRvuF. 27
2026
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.