pith. sign in

arxiv: 2606.15032 · v2 · pith:4CA65QLCnew · submitted 2026-06-13 · 💻 cs.LG

How Should World Models Be Evaluated for Embodied Decision-Making? A Decision-Making-Centric Position

Pith reviewed 2026-06-30 10:28 UTC · model grok-4.3

classification 💻 cs.LG
keywords world modelsembodied decision-makingevaluation ladderinterventional reasoningpolicy optimizationvideo predictionclosed-loop rolloutdistribution shift
0
0 comments X

The pith

World models for embodied decision-making must be judged by their support for interventional reasoning and policy optimization, not video realism.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper surveys recent work on world models and identifies a pattern where evaluation metrics often fail to match the decision-making claims made about the models. It argues that for embodied settings the decisive tests involve whether the model enables accurate policy evaluation, planning, and optimization under interventions, distribution shifts, and long rollouts. To organize this, the authors introduce an L0-L7 ladder of criteria that forms an evidential hierarchy progressing from visual checks to measurable policy gains. A sympathetic reader would care because this reframing could reduce wasted effort on models that look good but do not help actual control.

Core claim

For models presented as world models for embodied decision-making, the more decisive issue is not whether the model generates visually convincing videos, but whether it supports reliable interventional reasoning, policy evaluation, planning, and policy optimization under intervention, policy-induced distribution shift, and long-horizon rollout. The survey organizes evidence using an L0-L7 ladder spanning visual plausibility to policy optimization utility, foregrounding interventional action fidelity, closed-loop rollout validity, reward/value prediction, policy-ranking agreement, optimization lift, model exploitability, and uncertainty calibration.

What carries the argument

The L0-L7 ladder, an evidential hierarchy of evaluation criteria that cuts across visual, predictive, and interventional axes rather than producing a single scalar score.

If this is right

  • Interventional action fidelity and closed-loop rollout validity become required tests for any decision-making claim.
  • Reward and value prediction accuracy must be measured alongside perceptual similarity.
  • Policy-ranking agreement and measured optimization lift become direct evidence of utility.
  • Model exploitability and uncertainty calibration under shift indicate whether the model can be trusted for planning.
  • A minimal feasible reporting set applies even in real-robot experiments.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The ladder could be used to re-score existing published world models and expose claim-evidence gaps across the literature.
  • Simulation-to-real transfer studies might test whether ladder level predicts transfer success more reliably than video quality.
  • Extending the hierarchy to multi-step reward shaping or safety constraints would address additional embodied requirements.
  • If the ordering holds, benchmark suites could drop low-level metrics once higher-level evidence is present.

Load-bearing premise

That the L0-L7 levels form a reliable hierarchy in which higher levels are strictly more decisive for embodied decision-making utility than lower levels, without needing separate validation of the ordering.

What would settle it

A controlled comparison in which models that pass high ladder levels but fail low ones still produce superior real-robot policy performance, or in which visual metrics alone correlate more strongly with policy success than the full ladder, would falsify the claimed hierarchy.

read the original abstract

World models have become a central abstraction in modern AI. The term now refers to several different objects: action-conditioned environment models, latent imagination models, future-video predictors, interactive neural simulators, latent predictive representations, and synthetic-data engines. Evaluation has broadened along with the term. Recent papers measure video realism, perceptual similarity, instruction following, physical plausibility, policy ranking, executability, planning success, and downstream policy improvement. This produces both metric diversity and a recurring problem of claim/evidence mismatch: papers sometimes make a stronger claim about what their model is useful for than their evaluation can establish. This paper surveys the recent literature and argues that, for models presented as world models for embodied decision-making, the more decisive issue is not whether the model generates visually convincing videos, but whether it supports reliable interventional reasoning, policy evaluation, planning, and policy optimization under intervention, policy-induced distribution shift, and long-horizon rollout. We organize the survey using an L0--L7 ladder spanning visual plausibility to policy optimization utility, noting that the levels cut across several orthogonal axes and so form an evidential hierarchy rather than a single scalar. The framework foregrounds interventional action fidelity, closed-loop rollout validity, reward/value prediction, policy-ranking agreement, optimization lift, model exploitability, and uncertainty calibration, with a minimal feasible reporting set for real-robot settings.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper surveys recent literature on world model evaluation in embodied AI and identifies recurring claim/evidence mismatches. It argues that for models positioned as world models for embodied decision-making, evaluation should prioritize support for interventional reasoning, policy evaluation, planning, and policy optimization under intervention, distribution shift, and long-horizon rollout over lower-level metrics such as visual plausibility. The central contribution is an L0--L7 ladder that organizes metrics across visual, planning, and optimization axes into an evidential hierarchy rather than a single scalar, with a proposed minimal reporting set for real-robot settings.

Significance. If the proposed hierarchy is valid, the paper would provide a valuable organizing framework that helps the field move beyond video realism metrics toward decision-making utility. The survey of orthogonal evaluation axes (visual fidelity, closed-loop validity, reward prediction, policy ranking, optimization lift) and explicit foregrounding of interventional action fidelity and uncertainty calibration are constructive contributions that could reduce mismatches between model claims and supporting evidence.

major comments (2)
  1. [Abstract] Abstract: The assertion that the L0--L7 levels 'form an evidential hierarchy rather than a single scalar' because they 'cut across several orthogonal axes' is presented as an organizing principle without formal derivation, counter-example analysis, or empirical mapping. No argument is given showing why, e.g., L5 policy-ranking agreement is strictly more decisive than L4 closed-loop rollout validity for all embodied settings, nor why the ordering is robust to policy-induced distribution shift. This ordering is load-bearing for the central position.
  2. [Introduction / Survey sections] The paper identifies claim/evidence mismatch in prior work but supplies no quantitative analysis or new experiments demonstrating that the L0--L7 hierarchy resolves the mismatch; the argument rests entirely on logical re-organization of existing practices. This leaves the claim that the ladder 'can be applied without additional empirical validation of its ordering' unsupported.
minor comments (2)
  1. [Ladder description] The definitions of each level (L0 through L7) would benefit from explicit cross-references to the specific metrics or papers that instantiate them, to improve traceability.
  2. [Real-robot reporting section] Notation for the 'minimal feasible reporting set' is introduced without a compact table or checklist format, which reduces immediate usability for practitioners.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments, which help clarify the justification needed for the proposed L0-L7 framework. We respond to each major comment below and outline planned revisions to address the concerns about formal grounding and empirical support.

read point-by-point responses
  1. Referee: [Abstract] Abstract: The assertion that the L0--L7 levels 'form an evidential hierarchy rather than a single scalar' because they 'cut across several orthogonal axes' is presented as an organizing principle without formal derivation, counter-example analysis, or empirical mapping. No argument is given showing why, e.g., L5 policy-ranking agreement is strictly more decisive than L4 closed-loop rollout validity for all embodied settings, nor why the ordering is robust to policy-induced distribution shift. This ordering is load-bearing for the central position.

    Authors: The hierarchy is derived from the increasing strength of evidence required to support interventional and optimization claims in embodied decision-making, following principles from causal inference (interventional evidence supersedes observational) and RL evaluation (policy optimization utility is the ultimate test). The levels cut across visual, planning, and optimization axes, so they are not collapsed into a scalar. We do not claim the ordering is strict or invariant across all settings. We will add a dedicated subsection providing the rationale for each transition, discussing counterexamples (e.g., tasks where closed-loop validity alone is decisive), and addressing robustness to policy-induced shifts. This will make the load-bearing assumptions explicit. revision: yes

  2. Referee: [Introduction / Survey sections] The paper identifies claim/evidence mismatch in prior work but supplies no quantitative analysis or new experiments demonstrating that the L0--L7 hierarchy resolves the mismatch; the argument rests entirely on logical re-organization of existing practices. This leaves the claim that the ladder 'can be applied without additional empirical validation of its ordering' unsupported.

    Authors: The manuscript is a survey and position paper whose contribution is the identification of mismatches via literature review and the proposal of a structured reporting framework. No new experiments are presented because the work reorganizes existing practices rather than validating a metric. We agree the statement about applying the ladder without empirical validation of its ordering is unsupported and will remove or qualify it. The revised text will present the ladder as a conceptual guideline that encourages higher-evidence evaluations, while noting that empirical testing of the ordering is a direction for future work. revision: yes

Circularity Check

0 steps flagged

No circularity; L0-L7 framework constructed from external literature survey without self-referential reduction

full rationale

The paper is a position/survey piece that organizes existing evaluation practices into an L0-L7 ladder asserted to form an evidential hierarchy. No equations, fitted parameters, or predictions appear in the provided text. The hierarchy is introduced by surveying external literature rather than by any internal derivation, self-citation chain, or definitional loop that would make a claimed result equivalent to its inputs by construction. The central claim therefore remains independent of any self-referential step.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the domain assumption that interventional reasoning and policy optimization are the primary intended uses of world models in embodied settings; no free parameters or invented entities are introduced.

axioms (1)
  • domain assumption World models for embodied decision-making are primarily intended to support interventional reasoning, planning, and policy optimization.
    This premise defines the scope of the L0-L7 ladder and is stated in the abstract as the context for the evaluation critique.

pith-pipeline@v0.9.1-grok · 5789 in / 1263 out tokens · 39487 ms · 2026-06-30T10:28:36.903043+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

76 extracted references · 57 canonical work pages · 19 internal anchors

  1. [1]

    V-JEPA 2: Self-Supervised Video Models Enable Understanding, Prediction and Planning

    Mido Assran, Adrien Bardes, David Fan, Quentin Garrido, Russell Howes, Mojtaba Komeili, Matthew Muck- ley, Ammar Rizvi, Claire Roberts, Koustuv Sinha, Artem Zholus, Sergio Arnaud, Abha Gejji, Ada Martin, Francois Robert Hogan, Daniel Dugas, Piotr Bojanowski, Vasil Khalidov, Patrick Labatut, Francisco Massa, Marc Szafraniec, Kapil Krishnakumar, Yong Li, Xi...

  2. [2]

    Dream to manipulate: Compositional world models empowering robot imitation learning with imagina- tion

    Leonardo Barcellona, Andrii Zadaianchuk, Davide Allegro, Samuele Papa, Stefano Ghidoni, and Efstratios Gavves. Dream to manipulate: Compositional world models empowering robot imitation learning with imagina- tion. InInternational Conference on Learning Representations, 2025. URLhttps://arxiv.org/abs/2412. 14957

  3. [3]

    Persistent robot world models: Stabiliz- ing multi-step rollouts via reinforcement learning, 2026

    Jai Bardhan, Patrik Drozdik, Josef Sivic, and Vladimir Petrik. Persistent robot world models: Stabiliz- ing multi-step rollouts via reinforcement learning, 2026. URLhttps://arxiv.org/abs/2603.25685. arXiv:2603.25685

  4. [4]

    Kao, Adina Williams, Michael Rabbat, and Em- manuel Dupoux

    Florian Bordes, Quentin Garrido, Justine T. Kao, Adina Williams, Michael Rabbat, and Emmanuel Dupoux. IntPhys 2: Benchmarking intuitive physics understanding in complex synthetic environments, 2025. URL https://arxiv.org/abs/2506.09849. arXiv:2506.09849

  5. [5]

    Chandra, Iman Nematollahi, Chenguang Huang, Tim Welschehold, Wolfram Burgard, and Abhinav Valada

    Akshay L. Chandra, Iman Nematollahi, Chenguang Huang, Tim Welschehold, Wolfram Burgard, and Abhinav Valada. DiW A: Diffusion policy adaptation with world models, 2025. URLhttps://arxiv.org/abs/2508. 03645. arXiv:2508.03645

  6. [6]

    WorldPrediction: A benchmark for high- level world modeling and long-horizon procedural planning

    Delong Chen, Willy Chung, Yejin Bang, Ziwei Ji, and Pascale Fung. WorldPrediction: A benchmark for high- level world modeling and long-horizon procedural planning. InICML World Models Workshop, 2025. URL https://openreview.net/forum?id=3GuGN0bacr. 22 How Should World Models Be Evaluated for Embodied Decision-Making?

  7. [7]

    Policy-conditioned en- vironment models are more generalizable

    Ruifeng Chen, Xiong-Hui Chen, Yihao Sun, Siyuan Xiao, Minhui Li, and Yang Yu. Policy-conditioned en- vironment models are more generalizable. InInternational Conference on Machine Learning, 2024. URL https://openreview.net/forum?id=g9mYBdooPA

  8. [8]

    Adversarial counterfactual environment model learning

    Xiong-Hui Chen, Yang Yu, Zheng-Mao Zhu, Zhihua Yu, Zhenjun Chen, Chenghe Wang, Yinan Wu, Hongqiu Wu, Rong-Jun Qin, Ruijin Ding, and Fangsheng Huang. Adversarial counterfactual environment model learning. InAdvances in Neural Information Processing Systems, 2023. URLhttps://openreview.net/forum?id= rHAX0LRwk8

  9. [9]

    ABot- PhysWorld: Interactive world foundation model for robotic manipulation with physics alignment, 2026

    Yuzhi Chen, Ronghan Chen, Dongjie Huo, Yandan Yang, Dekang Qi, Haoyun Liu, Tong Lin, Shuang Zeng, Junjin Xiao, Xinyuan Chang, Feng Xiong, Xing Wei, Zhiheng Ma, and Mu Xu. ABot-PhysWorld: Interactive world foundation model for robotic manipulation with physics alignment, 2026. URLhttps://arxiv.org/ abs/2603.23376. arXiv:2603.23376

  10. [10]

    arXiv preprint arXiv:2410.15461 (2024) 16 K

    Xiaowei Chi, Hengyuan Zhang, Chun-Kai Fan, Xingqun Qi, Rongyu Zhang, Anthony Chen, Chi min Chan, Wei Xue, Wenhan Luo, Shanghang Zhang, and Yike Guo. EV A: An embodied world model for future video anticipation, 2024. URLhttps://arxiv.org/abs/2410.15461. arXiv:2410.15461

  11. [11]

    Rethinking video generation model for the embodied world, 2026

    Yufan Deng, Zilin Pan, Hongyu Zhang, Xiaojie Li, Ruoqing Hu, Yufei Ding, Yiming Zou, Yan Zeng, and Daquan Zhou. Rethinking video generation model for the embodied world, 2026. URLhttps://arxiv.org/abs/ 2601.15282. arXiv:2601.15282

  12. [12]

    arXiv preprint arXiv:2310.10625 (2023)

    Yilun Du, Mengjiao Yang, Pete Florence, Fei Xia, Ayzaan Wahid, Brian Ichter, Pierre Sermanet, Tianhe Yu, Pieter Abbeel, Joshua B. Tenenbaum, Leslie Kaelbling, Andy Zeng, and Jonathan Tompson. Video language planning. InInternational Conference on Learning Representations, 2024. URLhttps://arxiv.org/abs/ 2310.10625

  13. [13]

    WorldScore: A unified evaluation benchmark for world generation

    Haoyi Duan, Hong-Xing Yu, Sirui Chen, Fei-Fei Li, and Jiajun Wu. WorldScore: A unified evaluation benchmark for world generation. InInternational Conference on Computer Vision, 2025. URLhttps://arxiv.org/abs/ 2504.00983

  14. [14]

    Wow, wo, val! a comprehensive embodied world model evaluation turing test.arXiv preprint arXiv:2601.04137, 2026

    Chun-Kai Fan, Xiaowei Chi, Xiaozhu Ju, Hao Li, Yong Bao, Yu-Kai Wang, Lizhang Chen, Zhiyuan Jiang, Kuangzhi Ge, Ying Li, Weishi Mi, Qingpo Wuwu, Peidong Jia, Yulin Luo, Kevin Zhang, Zhiyuan Qin, Yong Dai, Sirui Han, Yike Guo, Shanghang Zhang, and Jian Tang. Wow, wo, val! a comprehensive embodied world model evaluation turing test, 2026. URLhttps://arxiv.o...

  15. [15]

    Aaron Foss, Chloe Evans, Sasha Mitts, Koustuv Sinha, Ammar Rizvi, and Justine T. Kao. CausalVQA: A physically grounded causal reasoning benchmark for video models, 2025. URLhttps://arxiv.org/abs/ 2506.09943. arXiv:2506.09943

  16. [16]

    Learning video generation for robotic manipulation with collaborative trajectory control

    Xiao Fu, Xintao Wang, Xian Liu, Jianhong Bai, Runsen Xu, Pengfei Wan, Di Zhang, and Dahua Lin. Learning video generation for robotic manipulation with collaborative trajectory control. InInternational Conference on Learning Representations, 2026. URLhttps://openreview.net/forum?id=OeDwYtp8n1

  17. [17]

    Shenyuan Gao, William Liang, Kaiyuan Zheng, Ayaan Malik, Seonghyeon Ye, Sihyun Yu, Wei-Cheng Tseng, Yuzhu Dong, Kaichun Mo, Chen-Hsuan Lin, Qianli Ma, Seungjun Nah, Loic Magne, Jiannan Xiang, Yuqi Xie, Ruijie Zheng, Dantong Niu, You Liang Tan, K. R. Zentner, George Kurian, Suneel Indupuru, Pooya Jan- naty, Jinwei Gu, Jun Zhang, Jitendra Malik, Pieter Abbe...

  18. [18]

    Evaluating gemini robotics policies in a veo world simulator, 2025

    Gemini Robotics Team, Krzysztof Choromanski, Coline Devin, Yilun Du, Debidatta Dwibedi, Ruiqi Gao, Ab- hishek Jindal, Thomas Kipf, Sean Kirmani, Isabel Leal, Fangchen Liu, Anirudha Majumdar, Andrew Mar- mon, Carolina Parada, Yulia Rubanova, Dhruv Shah, Vikas Sindhwani, Jie Tan, Fei Xia, Ted Xiao, Sherry Yang, Wenhao Yu, and Allan Zhou. Evaluating gemini r...

  19. [19]

    GigaBrain-0.5M*: A VLA that learns from world model-based reinforcement learning, 2026

    GigaBrain Team, Boyuan Wang, Chaojun Ni, Guan Huang, Guosheng Zhao, Hao Li, Jie Li, Jindi Lv, Jingyu Liu, Lv Feng, Mingming Yu, Peng Li, Qiuping Deng, Tianze Liu, Xinyu Zhou, Xinze Chen, Xiaofeng Wang, Yang Wang, Yifan Li, Yifei Nie, Yilong Li, Yukun Zhou, Yun Ye, Zhichao Liu, and Zheng Zhu. GigaBrain-0.5M*: A VLA that learns from world model-based reinfo...

  20. [20]

    GigaWorld-0: World models as data engine to empower embodied AI, 2025

    GigaWorld Team, Angen Ye, Boyuan Wang, Chaojun Ni, Guan Huang, Guosheng Zhao, Haoyun Li, Jiagang Zhu, Kerui Li, Mengyuan Xu, Qiuping Deng, Siting Wang, Wenkang Qin, Xinze Chen, Xiaofeng Wang, Yankai Wang, Yu Cao, Yifan Chang, Yuan Xu, Yun Ye, Yang Wang, Yukun Zhou, Zhengyuan Zhang, Zhehao Dong, and Zheng Zhu. GigaWorld-0: World models as data engine to em...

  21. [21]

    The value equivalence principle for model-based reinforcement learning

    Christopher Grimm, André Barreto, Satinder Singh, and David Silver. The value equivalence principle for model-based reinforcement learning. InAdvances in Neural Information Process- ing Systems, volume 33, 2020. URLhttps://proceedings.neurips.cc/paper/2020/hash/ 3bb585ea00014b0e3ebe4c6dd165a358-Abstract.html. arXiv:2011.03506

  22. [22]

    Proper value equiva- lence

    Christopher Grimm, André Barreto, Gregory Farquhar, David Silver, and Satinder Singh. Proper value equiva- lence. InAdvances in Neural Information Processing Systems, volume 34, 2021. URLhttps://proceedings. neurips.cc/paper/2021/hash/00ac8ed3b4327bdd4ebbebcb2ba10a00-Abstract.html

  23. [23]

    VLAW: Iterative co-improvement of vision-language-action policy and world model, 2026

    Yanjiang Guo, Tony Lee, Lucy Xiaoyang Shi, Jianyu Chen, Percy Liang, and Chelsea Finn. VLAW: Iterative co-improvement of vision-language-action policy and world model, 2026. URLhttps://arxiv.org/abs/ 2602.12063. arXiv:2602.12063

  24. [24]

    Ctrl-World: A Controllable Generative World Model for Robot Manipulation

    Yanjiang Guo, Lucy Xiaoyang Shi, Jianyu Chen, and Chelsea Finn. Ctrl-World: A controllable generative world model for robot manipulation. InInternational Conference on Learning Representations, 2026. URL https://arxiv.org/abs/2510.10125

  25. [25]

    World Models

    David Ha and Jürgen Schmidhuber. World models, 2018. URLhttps://arxiv.org/abs/1803.10122. arXiv:1803.10122

  26. [26]

    Mastering Diverse Domains through World Models

    Danijar Hafner, Jurgis Pasukonis, Jimmy Ba, and Timothy Lillicrap. Mastering diverse domains through world models, 2023. URLhttps://arxiv.org/abs/2301.04104. arXiv:2301.04104

  27. [27]

    Vid2World: Crafting video diffusion models to interactive world models

    Siqiao Huang, Jialong Wu, Qixing Zhou, Shangchen Miao, and Mingsheng Long. Vid2World: Crafting video diffusion models to interactive world models. InInternational Conference on Learning Representations, 2026. URLhttps://openreview.net/forum?id=pFyzqbUiF9

  28. [28]

    DreamGen: Unlocking Generalization in Robot Learning through Video World Models

    Joel Jang, Seonghyeon Ye, Zongyu Lin, Jiannan Xiang, Johan Bjorck, Yu Fang, Fengyuan Hu, Spencer Huang, Kaushil Kundalia, Yen-Chen Lin, Loic Magne, Ajay Mandlekar, Avnish Narayan, You Liang Tan, Guanzhi Wang, Jing Wang, Qi Wang, Yinzhen Xu, Xiaohui Zeng, Kaiyuan Zheng, Ruijie Zheng, Ming-Yu Liu, Luke Zettlemoyer, Dieter Fox, Jan Kautz, Scott Reed, Yuke Zh...

  29. [29]

    When to trust your model: model-based policy optimization

    Michael Janner, Justin Fu, Marvin Zhang, and Sergey Levine. When to trust your model: Model-based policy optimization. InAdvances in Neural Information Processing Systems, 2019. URLhttps://arxiv.org/abs/ 1906.08253

  30. [30]

    Model gradi- ent: unified model and policy learning in model-based reinforcement learning.Frontiers of Computer Science, 18:184339, 2024

    Chengxing Jia, Fuxiang Zhang, Tian Xu, Jing-Cheng Pang, Zongzhang Zhang, and Yang Yu. Model gradi- ent: unified model and policy learning in model-based reinforcement learning.Frontiers of Computer Science, 18:184339, 2024

  31. [31]

    RoboWM-Bench: A Benchmark for Evaluating World Models in Robotic Manipulation

    Feng Jiang, Yang Chen, Kyle Xu, Yuchen Liu, Haifeng Wang, Zhenhao Shen, Jasper Lu, Shengze Huang, Yuanfei Wang, Chen Xie, and Ruihai Wu. RoboWM-Bench: A benchmark for evaluating world models in robotic manipulation, 2026. URLhttps://arxiv.org/abs/2604.19092. arXiv:2604.19092

  32. [32]

    World4RL: Diffusion world models for policy refinement with reinforcement learning for robotic manipulation, 2025

    Zhennan Jiang, Kai Liu, Yuxin Qin, Shuai Tian, Yupeng Zheng, Mingcai Zhou, Chao Yu, Haoran Li, and Dong- bin Zhao. World4RL: Diffusion world models for policy refinement with reinforcement learning for robotic manipulation, 2025. URLhttps://arxiv.org/abs/2509.19080. arXiv:2509.19080

  33. [33]

    WoVR: World Models as Reliable Simulators for Post-Training VLA Policies with RL

    Zhennan Jiang, Shangqing Zhou, Yutong Jiang, Zefang Huang, Mingjie Wei, Yuhui Chen, Tianxing Zhou, Zhen Guo, Hao Lin, Quanlu Zhang, Yu Wang, Haoran Li, Chao Yu, and Dongbin Zhao. WoVR: World models as reliable simulators for post-training VLA policies with RL, 2026. URLhttps://arxiv.org/abs/2602. 13977. arXiv:2602.13977

  34. [34]

    A shortcut-aware video-QA benchmark for physical understanding via minimal video pairs.Transactions on Machine Learning Research, 2025

    Benno Krojer, Mojtaba Komeili, Candace Ross, Quentin Garrido, Koustuv Sinha, Nicolas Ballas, and Mahmoud Assran. A shortcut-aware video-QA benchmark for physical understanding via minimal video pairs.Transactions on Machine Learning Research, 2025. URLhttps://arxiv.org/abs/2506.09987

  35. [35]

    Objective mismatch in model-based rein- forcement learning

    Nathan Lambert, Brandon Amos, Omry Yadan, and Roberto Calandra. Objective mismatch in model-based rein- forcement learning. InProceedings of the 2nd Conference on Learning for Dynamics and Control (L4DC), vol- ume 120 ofProceedings of Machine Learning Research, pages 761–770, 2020. URLhttps://proceedings. mlr.press/v120/lambert20a.html. arXiv:2002.04523

  36. [36]

    Gonzalez, Ion Stoica, Song Han, and Yao Lu

    Dacheng Li, Yunhao Fang, Yukang Chen, Shuo Yang, Shiyi Cao, Justin Wong, Michael Luo, Xiaolong Wang, Hongxu Yin, Joseph E. Gonzalez, Ion Stoica, Song Han, and Yao Lu. WorldModelBench: Judging video gener- ation models as world models, 2025. URLhttps://arxiv.org/abs/2502.20694. arXiv:2502.20694. 24 How Should World Models Be Evaluated for Embodied Decision-Making?

  37. [37]

    VLA-RFT: Vision-language-action reinforcement fine-tuning with verified rewards in world simulators, 2025

    Hengtao Li, Pengxiang Ding, Runze Suo, Yihao Wang, Zirui Ge, Dongyuan Zang, Kexian Yu, Mingyang Sun, Hongyin Zhang, Donglin Wang, and Weihua Su. VLA-RFT: Vision-language-action reinforcement fine-tuning with verified rewards in world simulators, 2025. URLhttps://arxiv.org/abs/2510.00406. arXiv:2510.00406

  38. [38]

    dWorldEval: Scalable Robotic Policy Evaluation via Discrete Diffusion World Model

    Yaxuan Li, Zhongyi Zhou, Yefei Chen, Yaokai Xue, and Yichen Zhu. dWorldEval: Scalable robotic pol- icy evaluation via discrete diffusion world model, 2026. URLhttps://arxiv.org/abs/2604.22152. arXiv:2604.22152

  39. [39]

    Dreamitate: Real-world visuomotor policy learning via video generation

    Junbang Liang, Ruoshi Liu, Ege Ozguroglu, Sruthi Sudhakar, Achal Dave, Pavel Tokmakov, Shuran Song, and Carl V ondrick. Dreamitate: Real-world visuomotor policy learning via video generation. InConference on Robot Learning, 2024. URLhttps://arxiv.org/abs/2406.16862

  40. [40]

    Genie envisioner: A unified world foundation platform for robotic manipulation

    Yue Liao, Pengfei Zhou, Siyuan Huang, Donglin Yang, Shengcong Chen, Yuxin Jiang, Yue Hu, Si Liu, Jianlan Luo, Liliang Chen, Shuicheng Yan, Maoqing Yao, and Guanghui Ren. Genie envisioner: A unified world foundation platform for robotic manipulation. InInternational Conference on Learning Representations, 2026. URLhttps://openreview.net/forum?id=fHLtSxDFKC

  41. [41]

    ADM-v2: Pursuing full-horizon roll-out in dynamics models for offline policy learning and evaluation

    Haoxin Lin, Siyuan Xiao, Yi-Chen Li, Zhilong Zhang, Yihao Sun, Chengxing Jia, and Yang Yu. ADM-v2: Pursuing full-horizon roll-out in dynamics models for offline policy learning and evaluation. InInternational Conference on Learning Representations, 2026. URLhttps://openreview.net/forum?id=ICbXEwqpga

  42. [42]

    World-VLA-Loop: Closed-Loop Learning of Video World Model and VLA Policy

    Xiaokang Liu, Zechen Bai, Hai Ci, Kevin Yuchen Ma, and Mike Zheng Shou. World-VLA-Loop: Closed- loop learning of video world model and VLA policy, 2026. URLhttps://arxiv.org/abs/2602.06508. arXiv:2602.06508

  43. [43]

    LeWorldModel: Stable End-to-End Joint-Embedding Predictive Architecture from Pixels

    Lucas Maes, Quentin Le Lidec, Damien Scieur, Yann LeCun, and Randall Balestriero. LeWorldModel: Stable end-to-end joint-embedding predictive architecture from pixels, 2026. URLhttps://arxiv.org/abs/2603. 19312. arXiv:2603.19312

  44. [44]

    Value-aware loss function for model-based reinforcement learning

    Amir massoud Farahmand, André Barreto, and Daniel Nikovski. Value-aware loss function for model-based reinforcement learning. InProceedings of the 20th International Conference on Artificial Intelligence and Statistics (AISTATS), volume 54 ofProceedings of Machine Learning Research, pages 1486–1494, 2017. URL https://proceedings.mlr.press/v54/farahmand17a.html

  45. [45]

    V-JEPA 2.1: Unlocking dense features in video self-supervised learning,

    Lorenzo Mur-Labadia, Matthew Muckley, Amir Bar, Mido Assran, Koustuv Sinha, Mike Rabbat, Yann LeCun, Nicolas Ballas, and Adrien Bardes. V-JEPA 2.1: Unlocking dense features in video self-supervised learning,

  46. [46]

    V-JEPA 2.1: Unlocking Dense Features in Video Self-Supervised Learning

    URLhttps://arxiv.org/abs/2603.14482. arXiv:2603.14482

  47. [47]

    PBench: A physical AI benchmark for world models, 2025

    NVIDIA. PBench: A physical AI benchmark for world models, 2025. URLhttps://research.nvidia. com/labs/cosmos-lab/pbench/

  48. [48]

    NVIDIA, Arslan Ali, Junjie Bai, Maciej Bala, Yogesh Balaji, Aaron Blakeman, Tiffany Cai, Jiaxin Cao, Tianshi Cao, Elizabeth Cha, Yu-Wei Chao, Prithvijit Chattopadhyay, Mike Chen, Yongxin Chen, Yu Chen, Shuai Cheng, Yin Cui, Jenna Diamond, Yifan Ding, Jiaojiao Fan, Linxi Fan, Liang Feng, Francesco Ferroni, Sanja Fidler, Xiao Fu, Ruiyuan Gao, Yunhao Ge, Jin...

  49. [49]

    Worldsimbench: Towards video generation models as world simulators.arXiv preprint arXiv:2410.18072, 2024

    Yiran Qin, Zhelun Shi, Jiwen Yu, Xijun Wang, Enshen Zhou, Lijun Li, Zhenfei Yin, Xihui Liu, Lu Sheng, Jing Shao, Lei Bai, Wanli Ouyang, and Ruimao Zhang. WorldSimBench: Towards video generation models as world simulators, 2024. URLhttps://arxiv.org/abs/2410.18072. arXiv:2410.18072

  50. [50]

    Worldgym: World model as an environment for policy evaluation.arXiv preprint arXiv:2506.00613, 2025

    Julian Quevedo, Ansh Kumar Sharma, Yixiang Sun, Varad Suryavanshi, Percy Liang, and Sherry Yang. WorldGym: World model as an environment for policy evaluation, 2025. URLhttps://arxiv.org/abs/ 2506.00613. arXiv:2506.00613

  51. [51]

    Worldarena: A unified benchmark for evaluating perception and functional utility of embodied world models.arXiv preprint arXiv:2602.08971, 2026

    Yu Shang, Zhuohang Li, Yiding Ma, Weikang Su, Xin Jin, Ziyou Wang, Lei Jin, Xin Zhang, Yinzhou Tang, Haisheng Su, Chen Gao, Wei Wu, Xihui Liu, Dhruv Shah, Zhaoxiang Zhang, Zhibo Chen, Jun Zhu, Yonghong 25 How Should World Models Be Evaluated for Embodied Decision-Making? Tian, Tat-Seng Chua, Wenwu Zhu, and Yong Li. WorldArena: A unified benchmark for eval...

  52. [52]

    World-gymnast: Training robots with reinforcement learning in a world model.arXiv preprint arXiv:2602.02454,

    Ansh Kumar Sharma, Yixiang Sun, Ninghao Lu, Yunzhe Zhang, Jiarao Liu, and Sherry Yang. World-Gymnast: Training robots with reinforcement learning in a world model, 2026. URLhttps://arxiv.org/abs/2602. 02454. arXiv:2602.02454

  53. [53]

    Scalable policy evaluation with video world models, 2025

    Wei-Cheng Tseng, Jinwei Gu, Qinsheng Zhang, Hanzi Mao, Ming-Yu Liu, Florian Shkurti, and Lin Yen-Chen. Scalable policy evaluation with video world models, 2025. URLhttps://arxiv.org/abs/2511.11520. arXiv:2511.11520

  54. [54]

    Chen, Ashesh Rambachan, Jon Kleinberg, and Sendhil Mullainathan

    Keyon Vafa, Justin Y . Chen, Ashesh Rambachan, Jon Kleinberg, and Sendhil Mullainathan. Evaluating the world model implicit in a generative model, 2024. URLhttps://arxiv.org/abs/2406.03689. arXiv:2406.03689

  55. [55]

    RoboVIP: Multi-view video generation with visual identity prompting augments robot manipulation, 2026

    Boyang Wang, Haoran Zhang, Shujie Zhang, Jinkun Hao, Mingda Jia, Qi Lv, Yucheng Mao, Zhaoyang Lyu, Jia Zeng, Xudong Xu, and Jiangmiao Pang. RoboVIP: Multi-view video generation with visual identity prompting augments robot manipulation, 2026. URLhttps://arxiv.org/abs/2601.05241. arXiv:2601.05241

  56. [56]

    EV A: Aligning video world models with executable robot actions via inverse dynamics rewards, 2026

    Ruixiang Wang, Qingming Liu, Yueci Deng, Guiliang Liu, Zhen Liu, and Kui Jia. EV A: Aligning video world models with executable robot actions via inverse dynamics rewards, 2026. URLhttps://arxiv.org/abs/ 2603.17808. arXiv:2603.17808

  57. [57]

    Interactive world simulator for robot policy training and evaluation,

    Yixuan Wang, Rhythm Syed, Fangyu Wu, Mengchao Zhang, Aykut Onol, Jose Barreiros, Hooshang Nayyeri, Tony Dear, Huan Zhang, and Yunzhu Li. Interactive world simulator for robot policy training and evaluation,

  58. [58]

    arXiv:2603.08546

    URLhttps://arxiv.org/abs/2603.08546. arXiv:2603.08546

  59. [59]

    Benchmarking World-Model Learning with Environment-Level Queries

    Archana Warrier, Dat Nguyen, Michelangelo Naim, Moksh Jain, Yichao Liang, Karen Schroeder, Cam- bridge Yang, Joshua B. Tenenbaum, Sebastian V ollmer, Kevin Ellis, and Zenna Tavares. Benchmarking world-model learning with environment-level queries, 2025. URLhttps://arxiv.org/abs/2510.19788. arXiv:2510.19788

  60. [60]

    A unified view on solv- ing objective mismatch in model-based reinforcement learning.Transactions on Machine Learning Research,

    Ran Wei, Nathan Lambert, Anthony McDonald, Alfredo Garcia, and Roberto Calandra. A unified view on solv- ing objective mismatch in model-based reinforcement learning.Transactions on Machine Learning Research,

  61. [61]

    arXiv:2310.06253

    URLhttps://openreview.net/forum?id=tQVZgvXhZb. arXiv:2310.06253

  62. [62]

    DayDreamer: World models for physical robot learning

    Philipp Wu, Alejandro Escontrela, Danijar Hafner, Ken Goldberg, and Pieter Abbeel. DayDreamer: World models for physical robot learning. InConference on Robot Learning, 2023. URLhttps://arxiv.org/abs/ 2206.14176

  63. [63]

    World-Env: Leveraging World Model as a Virtual Environment for VLA Post-Training

    Junjin Xiao, Yandan Yang, Xinyuan Chang, Ronghan Chen, Feng Xiong, Mu Xu, Wei-Shi Zheng, and Qing Zhang. World-Env: Leveraging world model as a virtual environment for VLA post-training. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2026. URLhttps://arxiv.org/ abs/2509.24948

  64. [64]

    Kinema4D: Kinematic 4d world modeling for spatiotemporal embodied simulation, 2026

    Mutian Xu, Tianbao Zhang, Tianqi Liu, Zhaoxi Chen, Xiaoguang Han, and Ziwei Liu. Kinema4D: Kinematic 4d world modeling for spatiotemporal embodied simulation, 2026. URLhttps://arxiv.org/abs/2603.16669. arXiv:2603.16669

  65. [65]

    RISE: Self-Improving Robot Policy with Compositional World Model

    Jiazhi Yang, Kunyang Lin, Jinwei Li, Wencong Zhang, Tianwei Lin, Longyan Wu, Zhizhong Su, Hao Zhao, Ya-Qin Zhang, Li Chen, Ping Luo, Xiangyu Yue, and Hongyang Li. RISE: Self-improving robot policy with compositional world model, 2026. URLhttps://arxiv.org/abs/2602.11075. arXiv:2602.11075

  66. [66]

    RoboEnvision: A long-horizon video generation model for multi-task robot manipulation

    Liudi Yang, Yang Bai, George Eskandar, Fengyi Shen, Mohammad Altillawi, Dong Chen, Soumajit Majumder, Ziyuan Liu, Gitta Kutyniok, and Abhinav Valada. RoboEnvision: A long-horizon video generation model for multi-task robot manipulation. InIEEE/RSJ International Conference on Intelligent Robots and Systems, 2025. URLhttps://arxiv.org/abs/2506.22007

  67. [67]

    Learning Interactive Real-World Simulators

    Mengjiao Yang, Yilun Du, Kamyar Ghasemipour, Jonathan Tompson, Dale Schuurmans, and Pieter Abbeel. Learning interactive real-world simulators. InInternational Conference on Learning Representations, 2024. URLhttps://arxiv.org/abs/2310.06114

  68. [68]

    EA-WM: Event-Aware Generative World Model with Structured Kinematic-to-Visual Action Fields

    Zhaoyang Yang, Yurun Jin, Lizhe Qi, Cong Huang, and Kai Chen. EA-WM: Event-aware generative world model with structured kinematic-to-visual action fields, 2026. URLhttps://arxiv.org/abs/2605.06192. arXiv:2605.06192

  69. [69]

    PlayWorld: Learning Robot World Models from Autonomous Play

    Tenny Yin, Zhiting Mei, Zhonghe Zheng, Miyu Yamane, David Wang, Jade Sceats, Samuel M. Bateman, Lihan Zha, Apurva Badithela, Ola Shorinwa, and Anirudha Majumdar. PlayWorld: Learning robot world models from autonomous play, 2026. URLhttps://arxiv.org/abs/2603.09030. arXiv:2603.09030. 26 How Should World Models Be Evaluated for Embodied Decision-Making?

  70. [70]

    EWMBench: Evaluating scene, motion, and semantic quality in embodied world models, 2025

    Hu Yue, Siyuan Huang, Yue Liao, Shengcong Chen, Pengfei Zhou, Liliang Chen, Maoqing Yao, and Guanghui Ren. EWMBench: Evaluating scene, motion, and semantic quality in embodied world models, 2025. URL https://arxiv.org/abs/2505.09694. arXiv:2505.09694

  71. [71]

    ProphRL: Reinforcing action policies by prophesy- ing, 2025

    Jiahui Zhang, Ze Huang, Chun Gu, Zipei Ma, and Li Zhang. ProphRL: Reinforcing action policies by prophesy- ing, 2025. URLhttps://arxiv.org/abs/2511.20633. arXiv:2511.20633

  72. [72]

    WHALE: Towards generalizable and scalable world models for embodied decision-making, 2024

    Zhilong Zhang, Ruifeng Chen, Junyin Ye, Yihao Sun, Pengyuan Wang, Jingcheng Pang, Kaiyuan Li, Tianshuo Liu, Haoxin Lin, Yang Yu, and Zhi-Hua Zhou. WHALE: Towards generalizable and scalable world models for embodied decision-making, 2024. URLhttps://arxiv.org/abs/2411.05619. arXiv:2411.05619

  73. [73]

    Towards practical world model-based reinforcement learning for vision-language-action models, 2026

    Zhilong Zhang, Haoxiang Ren, Yihao Sun, Yifei Sheng, Haonan Wang, Haoxin Lin, Zhichao Wu, Pierre-Luc Bacon, and Yang Yu. Towards practical world model-based reinforcement learning for vision-language-action models, 2026. URLhttps://openreview.net/forum?id=gB1yFEd106. ICLR 2026 World Models Work- shop

  74. [74]

    RoboDreamer: Learning Compositional World Models for Robot Imagination

    Siyuan Zhou, Yilun Du, Jiaben Chen, Yandong Li, Dit-Yan Yeung, and Chuang Gan. RoboDreamer: Learning compositional world models for robot imagination. InInternational Conference on Machine Learning, 2024. URLhttps://arxiv.org/abs/2404.12377

  75. [75]

    IRASim: A fine-grained world model for robot manipulation

    Fangqi Zhu, Hongtao Wu, Song Guo, Yuxiao Liu, Chilam Cheang, and Tao Kong. IRASim: A fine-grained world model for robot manipulation. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 9834–9844, 2025. URLhttps://openaccess.thecvf.com/content/ICCV2025/html/Zhu_ IRASim_A_Fine-Grained_World_Model_for_Robot_Manipulation_ICCV_202...

  76. [76]

    WMPO: World model- based policy optimization for vision-language-action models

    Fangqi Zhu, Zhengyang Yan, Zicong Hong, Quanxin Shou, Xiao Ma, and Song Guo. WMPO: World model- based policy optimization for vision-language-action models. InInternational Conference on Learning Repre- sentations, 2026. URLhttps://openreview.net/forum?id=qE2FyvRvuF. 27