pith. sign in

arxiv: 2606.00133 · v1 · pith:CQSSR2V5new · submitted 2026-05-28 · 💻 cs.LG · cs.ET

World Models: A Comprehensive Survey of Architectures, Methodologies, Reasoning Paradigms, and Applications

Pith reviewed 2026-06-29 08:34 UTC · model grok-4.3

classification 💻 cs.LG cs.ET
keywords world modelssurveytaxonomyreinforcement learningplanningsimulationmultimodal agents
0
0 comments X

The pith

World models are organized by a four-axis taxonomy covering architecture, methodology, reasoning, and application domains.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper surveys internal simulators that learn environment structure and dynamics to support prediction, planning, and reasoning in agents. It states that the field lacks a single framework integrating architectural choices, training methods, reasoning mechanisms, and application settings. The survey supplies a multi-axis taxonomy along four dimensions to organize existing work from early foundations through systems such as PlaNet, Dreamer, MuZero, Sora, and Genie. It also reviews evaluation methods, persistent issues like compounding errors and sim-to-real gaps, and points to future unified multimodal simulators.

Core claim

The field of world models lacks a unified framework integrating its diverse architectural choices, training methods, reasoning mechanisms, and application settings; this survey supplies a multi-axis taxonomy organized along four dimensions—architecture (representation, dynamics, modality, paradigm), methodological family (state-space, recurrent, transformer, diffusion, physics-informed, language-augmented), reasoning strategy (imagination-based planning, latent policy, counterfactual, uncertainty), and application domain—to trace interactions, highlight convergence of chain-of-thought with imagination, and outline directions toward foundation-scale interactive simulators.

What carries the argument

The multi-axis taxonomy along architecture, methodological family, reasoning strategy, and application domain, used to classify milestone systems and their interactions.

If this is right

  • Milestone systems such as Dreamer and MuZero can be placed and compared directly on the four axes.
  • Recent convergence between chain-of-thought reasoning and world-model imagination becomes visible as an interaction across the reasoning and methodological axes.
  • Persistent challenges such as compounding prediction errors and fragmented evaluation can be examined uniformly across domains.
  • Future work on unified multimodal world models and safe deployment follows as extensions along the architecture and application axes.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The taxonomy could be used to identify missing combinations, such as physics-informed diffusion models for scientific domains, that have not yet been built.
  • Extending the same four axes to large language models that incorporate internal simulation would test whether the structure generalizes beyond the surveyed reinforcement-learning and robotics literature.
  • Standardizing benchmarks according to the taxonomy's dimensions would allow direct measurement of how architectural choices affect sim-to-real transfer.
  • The survey's emphasis on evaluation protocols suggests that new metrics could be defined per cell of the taxonomy to reduce fragmentation.

Load-bearing premise

The four chosen dimensions and listed milestone systems suffice to organize the full literature without major omissions or overlaps requiring extra axes.

What would settle it

Discovery of multiple important world-model papers or systems that require a fifth organizing dimension or cannot be placed on the four axes without distortion.

Figures

Figures reproduced from arXiv: 2606.00133 by Arif Hassan Zidan, Bowen Chen, Dajiang Zhu, Hanqi Jiang, Huawen Hu, Jinglei Lv, Jing Zhang, Lichao Sun, Lifeng Chen, Lin Zhao, Peilong Wang, Quanzheng Li, Ruiyu Yan, Siyuan Li, Sizhuang Liu, Tianming Liu, Weihang You, Wei Liu, Wei Ruan, Wei Zhang, Xiang Li, Xinliang Li, Yi Pan, Yu Bao, Zhengliang Liu, Zihao Wu.

Figure 1
Figure 1. Figure 1: Overview of the world model landscape, organized into a conceptual taxonomy of implicit [PITH_FULL_IMAGE:figures/full_fig_p008_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Imagination-based planning in latent world models. Starting from the current latent state [PITH_FULL_IMAGE:figures/full_fig_p053_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Capability landscape of medical world models. Models are placed according to their [PITH_FULL_IMAGE:figures/full_fig_p086_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Educational measurement viewed as a world model of learning dynamics. The figure com [PITH_FULL_IMAGE:figures/full_fig_p091_4.png] view at source ↗
read the original abstract

World models, internal simulators that learn the structure and dynamics of an environment, have emerged as a central paradigm in the pursuit of artificial general intelligence, enabling agents to predict, plan, and reason within learned representations. Despite rapid progress across reinforcement learning, robotics, autonomous driving, and video generation, the field lacks a unified framework integrating its diverse architectural choices, training methods, reasoning mechanisms, and application settings. This survey addresses that gap with a multi-axis taxonomy organized along four dimensions: (i) architecture, encompassing representation format, dynamics formulation, input modality, learning paradigm, and downstream application; (ii) methodological family, including state-space and recurrent approaches, transformer-based models, diffusion-based generators, physics-informed networks, and language-augmented multimodal systems; (iii) reasoning strategy, covering imagination-based planning, latent policy learning, counterfactual reasoning, and planning under uncertainty; and (iv) application domain, spanning robotics, autonomous driving, video prediction, multimodal agents, reinforcement learning, scientific modeling, medical imaging, educational measurement, and business and finance. Tracing the field from early cognitive-science foundations to milestone systems such as PlaNet, the Dreamer family, MuZero, Sora, Cosmos, and Genie, we examine how these dimensions interact and highlight the recent convergence of chain-of-thought reasoning with world-model imagination. We review evaluation protocols and benchmarks, identify persistent challenges such as compounding prediction errors, sim-to-real transfer, and fragmented evaluation, and outline future directions toward unified multimodal world models, foundation-scale interactive simulators, and safe deployment in safety-critical domains.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 0 minor

Summary. The paper claims that world models lack a unified framework integrating architectural choices, training methods, reasoning mechanisms, and applications, and addresses this gap via a four-axis taxonomy: (i) architecture (representation format, dynamics formulation, input modality, learning paradigm, downstream application), (ii) methodological family (state-space/recurrent, transformer-based, diffusion-based, physics-informed, language-augmented), (iii) reasoning strategy (imagination-based planning, latent policy learning, counterfactual reasoning, planning under uncertainty), and (iv) application domain (robotics, autonomous driving, video prediction, etc.). It traces the field from cognitive-science roots through milestones such as PlaNet, the Dreamer family, MuZero, Sora, Cosmos, and Genie; examines dimension interactions including convergence of chain-of-thought with imagination; reviews evaluation protocols and benchmarks; identifies challenges such as compounding errors and sim-to-real transfer; and outlines future directions toward unified multimodal models and safe deployment.

Significance. A well-constructed, non-overlapping taxonomy could provide a useful organizing lens for the rapidly growing world-model literature across RL, robotics, video generation, and scientific domains, especially given the paper's coverage of historical foundations and recent systems. The explicit discussion of persistent challenges and future directions toward foundation-scale simulators adds reference value if the taxonomy axes can be made disjoint.

major comments (2)
  1. [Abstract] Abstract: dimension (i) is defined to encompass 'representation format, dynamics formulation, input modality, learning paradigm, and downstream application.' This scope directly intersects with dimension (iv) 'application domain, spanning robotics, autonomous driving,...', violating the requirement that the four axes be disjoint for the taxonomy to supply a unified framework without important overlaps.
  2. [Abstract] Abstract: methodological family (ii) lists state-space/recurrent approaches, transformer-based models, etc., which are already subsumed under the architectural choices enumerated in dimension (i). No evidence is supplied that the authors apply a non-overlapping assignment rule when classifying concrete systems such as Dreamer or MuZero.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the careful reading and for highlighting the need for a disjoint taxonomy. We address each major comment below and will incorporate revisions to strengthen the framework.

read point-by-point responses
  1. Referee: [Abstract] Abstract: dimension (i) is defined to encompass 'representation format, dynamics formulation, input modality, learning paradigm, and downstream application.' This scope directly intersects with dimension (iv) 'application domain, spanning robotics, autonomous driving,...', violating the requirement that the four axes be disjoint for the taxonomy to supply a unified framework without important overlaps.

    Authors: We agree that listing 'downstream application' within dimension (i) creates an unintended overlap with dimension (iv). In the revised version we will remove 'downstream application' from the definition of dimension (i), restricting it to representation format, dynamics formulation, input modality, and learning paradigm. Dimension (iv) will remain the sole locus for application domains. The change will appear in the abstract, the taxonomy section, and the classification tables. revision: yes

  2. Referee: [Abstract] Abstract: methodological family (ii) lists state-space/recurrent approaches, transformer-based models, etc., which are already subsumed under the architectural choices enumerated in dimension (i). No evidence is supplied that the authors apply a non-overlapping assignment rule when classifying concrete systems such as Dreamer or MuZero.

    Authors: Dimension (i) enumerates granular design decisions (e.g., whether the dynamics are formulated as a state-space model or a transformer), while dimension (ii) groups models by their dominant methodological family at a higher level of abstraction. Nevertheless, the current text does not explicitly state the assignment rule or demonstrate its application to the cited systems. We will add a short subsection that defines a priority ordering (family first, then component choices) and will include explicit assignments for Dreamer, MuZero, Sora, and several other milestones to make the separation transparent. revision: partial

Circularity Check

0 steps flagged

No circularity: survey proposes taxonomy without derivations or self-referential reductions

full rationale

This is a literature survey paper whose central contribution is a four-axis taxonomy for organizing existing world-model research. No equations, fitted parameters, predictions, or first-principles derivations appear in the provided text. The dimensions are stated explicitly as (i) architecture (with listed sub-elements), (ii) methodological family, (iii) reasoning strategy, and (iv) application domain; these are applied to external milestone systems such as PlaNet, Dreamer, MuZero, Sora, and Genie. No step reduces a claim to a self-citation, an ansatz smuggled via prior work, or a renaming of a known result. The taxonomy is an author-proposed organizational tool rather than a quantity derived from itself. Minor self-citations, if present, are not load-bearing for any derivation because none exists. The paper is therefore self-contained against external benchmarks and receives the default non-circularity finding.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

As a survey the paper rests on the domain assumption that world models are a coherent and central research area; it introduces no free parameters, new entities, or ad-hoc axioms beyond standard machine-learning background.

axioms (1)
  • domain assumption World models constitute a central paradigm for artificial general intelligence
    Stated directly in the opening sentence of the abstract.

pith-pipeline@v0.9.1-grok · 5914 in / 1231 out tokens · 22826 ms · 2026-06-29T08:34:29.413079+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

291 extracted references · 100 canonical work pages · 41 internal anchors

  1. [1]

    A Comprehensive Survey on World Models for Embodied AI

    Xinqing Li et al. A comprehensive survey on world models for embodied AI.arXiv preprint arXiv:2510.16732, 2025

  2. [2]

    Harvard University Press, 1983

    PhilipN.Johnson-Laird.Mental Models: Towards a Cognitive Science of Language, Inference, and Consciousness. Harvard University Press, 1983

  3. [3]

    A framework for representing knowledge

    Marvin Minsky. A framework for representing knowledge. Technical Report Memo 306, MIT AI Laboratory, 1974

  4. [5]

    A path towards autonomous machine intelligence.OpenReview preprint, 2022

    Yann LeCun. A path towards autonomous machine intelligence.OpenReview preprint, 2022. Version 0.9.2, 2022-06-27

  5. [6]

    Dream to Control: Learning Behaviors by Latent Imagination

    Danijar Hafner, Timothy Lillicrap, Jimmy Ba, and Mohammad Norouzi. Dream to control: Learning behaviors by latent imagination.arXiv preprint arXiv:1912.01603, 2019

  6. [7]

    Mastering atari with discrete world models

    Danijar Hafner, Timothy Lillicrap, Mohammad Norouzi, and Jimmy Ba. Mastering atari with discrete world models. InInternational Conference on Learning Representations, 2021

  7. [8]

    Mastering diverse domains through world models.Nature, 2025

    Danijar Hafner, Jurgis Pasukonis, Jimmy Ba, and Timothy Lillicrap. Mastering diverse domains through world models.Nature, 2025

  8. [9]

    Mastering atari, go, chess and shogi by planning with a learned model.Nature, 588(7839):604–609, 2020

    Julian Schrittwieser, Ioannis Antonoglou, Thomas Hubert, Karen Simonyan, Laurent Sifre, Simon Schmitt, Arthur Guez, Edward Lockhart, Demis Hassabis, Thore Graepel, Timothy Lillicrap, and David Silver. Mastering atari, go, chess and shogi by planning with a learned model.Nature, 588(7839):604–609, 2020

  9. [10]

    Sora: A Review on Background, Technology, Limitations, and Opportunities of Large Vision Models

    Yixin Liu, Kai Zhang, Yuan Li, Zhiling Yan, Chujie Gao, Ruoxi Chen, Zhengqing Yuan, Yue Huang, Hanchi Sun, Jianfeng Gao, Lifang He, and Lichao Sun. Sora: A review on background, technology, limitations, and opportunities of large vision models.arXiv preprint arXiv:2402.17177, 2024

  10. [11]

    Understanding world or predicting future? a comprehensive survey of world models.ACM Computing Surveys, 58(3):1–38, 2025

    Jingtao Ding, Yunke Zhang, Yu Shang, Yuheng Zhang, Zefang Zong, Jie Feng, Yuan Yuan, Hongyuan Su, Nian Li, Jinghua Piao, Yucheng Deng, Nicholas Sukiennik, Chen Gao, Fengli Xu, and Yong Li. Understanding world or predicting future? a comprehensive survey of world models.ACM Computing Surveys, 58(3):1–38, 2025. doi: 10.1145/3746449

  11. [12]

    V-JEPA 2: Self-supervised video models enable understanding, prediction and planning.arXiv preprint, 2025

    Mahmoud Assran et al. V-JEPA 2: Self-supervised video models enable understanding, prediction and planning.arXiv preprint, 2025

  12. [13]

    Genie: Generative inter- active environments.arXiv preprint arXiv:2402.15391, 2024

    Jake Bruce, Michael Dennis, Ashley Edwards, Jack Parker-Holder, Yuge Shi, Edward Hughes, Matthew Lai, Aditi Mavalankar, Richie Steiber, Chris Apps, et al. Genie: Generative inter- active environments.arXiv preprint arXiv:2402.15391, 2024

  13. [14]

    Cosmos: World foundation model platform for physical AI.arXiv preprint, 2025

    NVIDIA. Cosmos: World foundation model platform for physical AI.arXiv preprint, 2025

  14. [15]

    Chi, Quoc V

    Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Brian Ichter, Fei Xia, Ed H. Chi, Quoc V. Le, and Denny Zhou. Chain-of-thought prompting elicits reasoning in large language models. InAdvances in Neural Information Processing Systems, volume 35, 2022. 111

  15. [16]

    Reasoning beyond language: A comprehensive survey on latent chain-of-thought reasoning,

    Xinghao Chen et al. Reasoning beyond language: A comprehensive survey on latent chain- of-thought reasoning.arXiv preprint arXiv:2505.16782, 2025

  16. [17]

    Training Large Language Models to Reason in a Continuous Latent Space

    Shibo Hao et al. Training large language models to reason in a continuous latent space.arXiv preprint arXiv:2412.06769, 2024

  17. [18]

    Latent Chain-of-Thought World Modeling for End-to-End Driving

    Shuhan Tan, Kashyap Chitta, Yuxiao Chen, Ran Tian, Yurong You, Yan Wang, Wenjie Luo, Yulong Cao, Philipp Krähenbühl, Marco Pavone, and Boris Ivanovic. Latent chain-of-thought world modeling for end-to-end driving.arXiv preprint arXiv:2512.10226, 2025

  18. [19]

    Futurex: Enhance end-to-end autonomous driving with chain-of-thought reasoning in latent world model.arXiv preprint arXiv:2512.11226, 2025

    Zhiyu Xiang et al. Futurex: Enhance end-to-end autonomous driving with chain-of-thought reasoning in latent world model.arXiv preprint arXiv:2512.11226, 2025

  19. [20]

    Attention Is All You Need

    Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. Attention is all you need, 2023. URLhttps://arxiv. org/abs/1706.03762

  20. [21]

    A survey of transformers.AI Open, 3:111–132, 2022

    Tianyang Lin, Yuxin Wang, Xiangyang Liu, and Xipeng Qiu. A survey of transformers.AI Open, 3:111–132, 2022

  21. [22]

    GPT-4 Technical Report

    OpenAI. GPT-4 technical report.arXiv preprint arXiv:2303.08774, 2023

  22. [23]

    Evaluation of OpenAI o1: Opportunities and challenges of AGI.arXiv preprint arXiv:2409.18486, 2025

    Tianyang Zhong, Zhengliang Liu, Yi Pan, Yutong Zhang, Zeyu Zhang, Yifan Zhou, Shizhe Liang, Zihao Wu, Yanjun Lyu, Peng Shu, et al. Evaluation of OpenAI o1: Opportunities and challenges of AGI.arXiv preprint arXiv:2409.18486, 2025

  23. [24]

    Harvard University Press, 1988

    Hans Moravec.Mind Children: The Future of Robot and Human Intelligence. Harvard University Press, 1988

  24. [25]

    Advanced machine intelligence (AMI): Building AI systems that understand the physical world, 2025

    Yann LeCun. Advanced machine intelligence (AMI): Building AI systems that understand the physical world, 2025. Announced November 2025.https://www. advancedmachineintelligence.com

  25. [26]

    Sutton and Andrew G

    Richard S. Sutton and Andrew G. Barto.Reinforcement Learning: An Introduction. MIT Press, 2nd edition, 2018

  26. [27]

    World models for autonomous driving: An initial survey

    Yanchen Guan, Haicheng Cui, et al. World models for autonomous driving: An initial survey. arXiv preprint arXiv:2403.02622, 2024

  27. [28]

    A step toward world models: A survey on robotic manipulation.arXiv preprint arXiv:2511.02097, 2025

    Xuan Li et al. A step toward world models: A survey on robotic manipulation.arXiv preprint arXiv:2511.02097, 2025

  28. [29]

    Steven C. H. Chen et al. 3d and 4d world modeling: A survey.https://worldbench.github. io/survey, 2025

  29. [30]

    Curious model-building control systems

    Jürgen Schmidhuber. Curious model-building control systems. InProc. International Joint Conference on Neural Networks (IJCNN), volume 2, pages 1458–1463. IEEE, 1991. doi: 10.1109/IJCNN.1991.170605

  30. [31]

    Learning latent dynamics for planning from pixels

    Danijar Hafner, Timothy Lillicrap, Ian Fischer, Ruben Villegas, David Ha, Honglak Lee, and James Davidson. Learning latent dynamics for planning from pixels. InProceedings of the 36th International Conference on Machine Learning, pages 2555–2565. PMLR, 2019. 112

  31. [32]

    Dream to con- trol: Learning behaviors by latent imagination

    Danijar Hafner, Timothy Lillicrap, Jimmy Ba, and Mohammad Norouzi. Dream to con- trol: Learning behaviors by latent imagination. InInternational Conference on Learning Representations, 2020

  32. [33]

    World Models

    David Ha and Jürgen Schmidhuber. World models.arXiv preprint arXiv:1803.10122, 2018

  33. [34]

    Mastering atari with discrete world models

    Danijar Hafner, Timothy Lillicrap, Mohammad Norouzi, and Jimmy Ba. Mastering atari with discrete world models. InInternational Conference on Learning Representations (ICLR), 2021

  34. [35]

    Transformers are sample-efficient world learners

    Vincent Micheli, Eloi Alonso, and François Fleuret. Transformers are sample-efficient world learners. InThe Eleventh International Conference on Learning Representations (ICLR), 2023

  35. [36]

    STORM: Efficient stochastic transformer based world models for rein- forcement learning

    Weipu Zhang et al. STORM: Efficient stochastic transformer based world models for rein- forcement learning. InAdvances in Neural Information Processing Systems, 2023

  36. [37]

    Diffusion for world modeling: Visual details matter in Atari

    Eloi Alonso, Adam Jelley, Vincent Micheli, Anssi Kanervisto, Amos Storber, Oriol Vinyals, and François Fleuret. Diffusion for world modeling: Visual details matter in Atari. In Advances in Neural Information Processing Systems, 2024. NeurIPS 2024 Spotlight

  37. [38]

    Mastering Atari, Go, chess and shogi by planning with a learned model.Nature, 588(7839):604–609, 2020

    Julian Schrittwieser, Ioannis Antonoglou, Thomas Hubert, Karen Simonyan, Laurent Sifre, Simon Schmitt, Arthur Guez, Edward Lockhart, Demis Hassabis, Thore Graepel, Timothy Lillicrap, and David Silver. Mastering Atari, Go, chess and shogi by planning with a learned model.Nature, 588(7839):604–609, 2020

  38. [39]

    Mahmoud Assran, Quentin Duval, Ishan Misra, Piotr Bojanowski, Pascal Vincent, Michael Rabbat, Yann LeCun, and Nicolas Ballas. Self-supervised learning from images with a joint- embedding predictive architecture.Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 15619–15629, 2023

  39. [40]

    Mastering Diverse Domains through World Models

    Danijar Hafner, Jurgis Pasukonis, Jimmy Ba, and Timothy Lillicrap. Mastering diverse domains through world models.Nature, 2025. arXiv preprint arXiv:2301.04104, 2023

  40. [41]

    Moerland, Joost Broekens, Aske Plaat, and Catholijn M

    Thomas M. Moerland, Joost Broekens, Aske Plaat, and Catholijn M. Jonker. Model-based reinforcement learning: A survey.Foundations and Trends in Machine Learning, 16(1):1–118, 2023

  41. [42]

    Rusu, Joel Veness, Marc G

    Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Andrei A. Rusu, Joel Veness, Marc G. Bellemare, Alex Graves, Martin Riedmiller, Andreas K. Fidjeland, Georg Ostrovski, Stig Pe- tersen, Charles Beattie, Amir Sadik, Ioannis Antonoglou, Helen King, Dharshan Kumaran, Daan Wierstra, Shane Legg, and Demis Hassabis. Human-level control through deep rein- forceme...

  42. [43]

    Proximal Policy Optimization Algorithms

    John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms.arXiv preprint arXiv:1707.06347, 2017

  43. [44]

    Soft actor-critic: Off- policy maximum entropy deep reinforcement learning with a stochastic actor

    Tuomas Haarnoja, Aurick Zhou, Pieter Abbeel, and Sergey Levine. Soft actor-critic: Off- policy maximum entropy deep reinforcement learning with a stochastic actor. InProceedings of the 35th International Conference on Machine Learning (ICML), pages 1861–1870, 2018. 113

  44. [45]

    Deepreinforcement learning in a handful of trials using probabilistic dynamics models.Advances in Neural Information Processing Systems, 31, 2018

    KurtlandChua, RobertoCalandra, RowanMcAllister, andSergeyLevine. Deepreinforcement learning in a handful of trials using probabilistic dynamics models.Advances in Neural Information Processing Systems, 31, 2018

  45. [46]

    Temporal difference learning for model pre- dictive control

    Nicklas Hansen, Xiaolong Wang, and Hao Su. Temporal difference learning for model pre- dictive control. InInternational Conference on Machine Learning, pages 8487–8506. PMLR, 2022

  46. [47]

    Pilco: A model-based and data-efficient approach to policy search

    Marc Peter Deisenroth and Carl Edward Rasmussen. Pilco: A model-based and data-efficient approach to policy search. InProceedings of the 28th International Conference on Machine Learning (ICML), pages 465–472, 2011

  47. [48]

    When to trust your model: Model-based policy optimization

    Michael Janner, Justin Fu, Marvin Zhang, and Sergey Levine. When to trust your model: Model-based policy optimization. InAdvances in Neural Information Processing Systems, volume 32, 2019

  48. [49]

    Rusu, Loic Matthey, Christopher P

    Irina Higgins, Arka Pal, Andrei A. Rusu, Loic Matthey, Christopher P. Burgess, Alexander Pritzel, Matthew Botvinick, Charles Blundell, and Alexander Lerchner. Darla: Improving zero-shot transfer in reinforcement learning. InProceedings of the 34th International Confer- ence on Machine Learning (ICML), pages 1480–1490, 2017

  49. [50]

    Schema Networks: Zero-shot Transfer with a Generative Causal Model of Intuitive Physics

    Ken Kansky, Tom Silver, David A. Mély, Mohamed Eldawy, Miguel Lázaro-Gredilla, Xinghua Lou, Nimrod Dorfman, Szymon Sidor, Scott Phoenix, and Dileep George. Schema net- works: Zero-shot transfer with a generative causal model of intuitive physics.arXiv preprint arXiv:1706.04317, 2017

  50. [51]

    Woulda, coulda, shoulda: Counterfactually-guided policy search

    Lars Buesing, Theophane Weber, Yori Zwols, Nicolas Heess, Sébastien Racanière, Arthur Guez, and Jean-Baptiste Lespiau. Woulda, coulda, shoulda: Counterfactually-guided policy search. InInternational Conference on Learning Representations (ICLR), 2019

  51. [52]

    Dropout as a bayesian approximation: Representing model uncertainty in deep learning

    Yarin Gal and Zoubin Ghahramani. Dropout as a bayesian approximation: Representing model uncertainty in deep learning. InProceedings of the 33rd International Conference on Machine Learning (ICML), pages 1050–1059, 2016

  52. [53]

    Deep exploration via bootstrapped dqn

    Ian Osband, Charles Blundell, Alexander Pritzel, and Benjamin Van Roy. Deep exploration via bootstrapped dqn. InAdvances in Neural Information Processing Systems (NeurIPS), volume 29, 2016

  53. [54]

    Bellemare, Will Dabney, and Rémi Munos

    Marc G. Bellemare, Will Dabney, and Rémi Munos. A distributional perspective on rein- forcement learning. InProceedings of the 34th International Conference on Machine Learning (ICML), pages 449–458, 2017

  54. [55]

    Devon Hjelm, Aaron Courville, and Philip Bachman

    Max Schwarzer, Ankesh Anand, Rishab Goel, R. Devon Hjelm, Aaron Courville, and Philip Bachman. Data-efficient reinforcement learning with self-predictive representations. InIn- ternational Conference on Learning Representations (ICLR), 2021

  55. [56]

    Kingma and Max Welling

    Diederik P. Kingma and Max Welling. Auto-encoding variational Bayes.International Con- ference on Learning Representations, 2014

  56. [57]

    Revisiting Feature Prediction for Learning Visual Representations from Video

    Adrien Bardes, Quentin Garrido, Jean Ponce, Xinlei Chen, Michael Rabbat, Yann LeCun, Mahmoud Assran, and Nicolas Ballas. Revisiting feature prediction for learning visual rep- resentations from video.arXiv preprint arXiv:2404.08471, 2024. 114

  57. [58]

    V-JEPA 2: Self-Supervised Video Models Enable Understanding, Prediction and Planning

    Meta AI. V-JEPA 2: Self-supervised video models enable understanding, prediction and planning.arXiv preprint arXiv:2506.09985, 2025

  58. [59]

    Campbell, and Sergey Levine

    Mohammad Babaeizadeh, Chelsea Finn, Dumitru Erhan, Roy H. Campbell, and Sergey Levine. Stochastic variational video prediction.International Conference on Learning Rep- resentations, 2018

  59. [60]

    Stochastic video generation with a learned prior.International Conference on Machine Learning, pages 1174–1183, 2018

    Emily Denton and Rob Fergus. Stochastic video generation with a learned prior.International Conference on Machine Learning, pages 1174–1183, 2018

  60. [63]

    Contrastive learning of structured world models

    Thomas Kipf, Elise van der Pol, and Max Welling. Contrastive learning of structured world models. InInternational Conference on Learning Representations, 2020. URLhttps:// openreview.net/forum?id=H1gax6VtDB

  61. [64]

    Robo- dreamer: learning compositional world models for robot imagination

    Siyuan Zhou, Yilun Du, Jiaben Chen, Yandong Li, Dit-Yan Yeung, and Chuang Gan. Robo- dreamer: learning compositional world models for robot imagination. InProceedings of the 41st International Conference on Machine Learning, pages 61885–61896, 2024

  62. [65]

    Dream to manipulate: Compositional world models empowering robot imitation learning with imagination

    Leonardo Barcellona et al. Dream to manipulate: Compositional world models empowering robot imitation learning with imagination. InInternational Conference on Learning Repre- sentations, 2025

  63. [67]

    Copilot4D:Learningunsupervisedworldmodelsforautonomousdriving via discrete diffusion

    YunpengZhangetal. Copilot4D:Learningunsupervisedworldmodelsforautonomousdriving via discrete diffusion. InThe Twelfth International Conference on Learning Representations (ICLR), 2024

  64. [68]

    3D and 4D world modeling: A survey.arXiv preprint arXiv:2509.07996, 2025

    Lingdong Kong et al. 3D and 4D world modeling: A survey.arXiv preprint arXiv:2509.07996, 2025

  65. [69]

    Mri contrast enhancement kinetics world model.arXiv preprint arXiv:2602.19285, 2026

    Jindi Kong, Yuting He, Cong Xia, Rongjun Ge, and Shuo Li. Mri contrast enhancement kinetics world model.arXiv preprint arXiv:2602.19285, 2026

  66. [70]

    Leworldmodel: Stable end-to-end joint-embedding predictive architecture from pixels.arXiv preprint, 2024

    Anonymous. Leworldmodel: Stable end-to-end joint-embedding predictive architecture from pixels.arXiv preprint, 2024

  67. [71]

    Brain-jepa: Brain dynamics foundation model with gradient positioning and spatiotemporal masking.arXiv preprint, 2024

    Anonymous. Brain-jepa: Brain dynamics foundation model with gradient positioning and spatiotemporal masking.arXiv preprint, 2024

  68. [72]

    Clarity: Medical world model for guiding treatment decisions by modeling context-aware disease trajectories in latent space.arXiv preprint, 2025

    Anonymous. Clarity: Medical world model for guiding treatment decisions by modeling context-aware disease trajectories in latent space.arXiv preprint, 2025

  69. [73]

    Medical world model.arXiv preprint, 2024

    Anonymous. Medical world model.arXiv preprint, 2024. 115

  70. [74]

    Transformers are sample-efficient world models

    Vincent Micheli, Eloi Alonso, and François Fleuret. Transformers are sample-efficient world models. InInternational Conference on Learning Representations (ICLR), 2023

  71. [75]

    Diffusion for world modeling: Visual details matter in atari

    Eloi Alonso, Adam Jelley, Vincent Micheli, Anssi Kanervisto, Amos Storkey, Tim Pearce, and François Fleuret. Diffusion for world modeling: Visual details matter in atari. InAdvances in Neural Information Processing Systems (NeurIPS), volume 37, 2024

  72. [76]

    Diffusion Models Are Real-Time Game Engines

    Dani Valevski, Yaniv Leviathan, Moab Arar, and Shlomi Fruchter. Diffusion models are real-time game engines.arXiv preprint arXiv:2408.14837, 2024. Published at ICLR 2025

  73. [77]

    Reasoning with language model is planning with world model

    Shibo Hao, Yi Gu, Haodi Ma, Joshua Hong, Zhen Wang, Daisy Wang, and Zhiting Hu. Reasoning with language model is planning with world model. InProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 8154– 8173, Singapore, 2023. Association for Computational Linguistics

  74. [78]

    Is your LLM secretly a world model of the internet? model-based planning for web agents.arXiv preprint arXiv:2411.06559, 2024

    Yu Gu, Boyuan Deng, Chen Zhu, Yi Dong, Mingyue Li, Jianwei Xie, Shuyan Lu, Tianbao Shi, Yu Su, and Wen-tau Yih. Is your LLM secretly a world model of the internet? model-based planning for web agents.arXiv preprint arXiv:2411.06559, 2024

  75. [79]

    Brafman, and Moshe Tennenholtz

    Raz Levy, Ronen I. Brafman, and Moshe Tennenholtz. WorldLLM: Learning world models via large language models.arXiv preprint arXiv:2506.05270, 2025

  76. [80]

    Learning to generate realistic LiDAR point clouds.arXiv preprint arXiv:2209.03954, 2022

    Vlas Zyrianov, Xiyue Zhu, and Shenlong Wang. Learning to generate realistic LiDAR point clouds.arXiv preprint arXiv:2209.03954, 2022. ECCV 2022

  77. [81]

    LidarDM: Generative LiDAR simulation in a generated world.arXiv preprint arXiv:2404.02903, 2024

    Vlas Zyrianov, Boris Ivanovic, Vince Zhao, and Marco Pavone. LidarDM: Generative LiDAR simulation in a generated world.arXiv preprint arXiv:2404.02903, 2024

  78. [82]

    OccWorld: Learning a 3D occupancy world model for autonomous driving

    Wenzhao Zheng, Weiliang Chen, Yuanhui Huang, Borui Zhang, Yueqi Duan, and Jiwen Lu. OccWorld: Learning a 3D occupancy world model for autonomous driving. InEuropean Conference on Computer Vision (ECCV), 2024. arXiv preprint arXiv:2311.16038, 2023

  79. [83]

    Day- dreamer: World models for physical robot learning

    Philipp Wu, Alejandro Escontrela, Danijar Hafner, Pieter Abbeel, and Ken Goldberg. Day- dreamer: World models for physical robot learning. InConference on robot learning, pages 2226–2240. PMLR, 2023

  80. [84]

    MLA: A multisensory language-action model for multimodal understanding and forecasting in robotic manipulation.arXiv preprint arXiv:2509.26642, 2025

    MLA Team. MLA: A multisensory language-action model for multimodal understanding and forecasting in robotic manipulation.arXiv preprint arXiv:2509.26642, 2025

Showing first 80 references.