pith. machine review for the scientific record. sign in

arxiv: 2604.22748 · v1 · submitted 2026-04-24 · 💻 cs.AI

Recognition: unknown

Agentic World Modeling: Foundations, Capabilities, Laws, and Beyond

Authors on Pith no claims yet

Pith reviewed 2026-05-08 11:57 UTC · model grok-4.3

classification 💻 cs.AI
keywords world modelsagentic AItaxonomyenvironment modelingmodel-based RLsimulationAI agentsmulti-agent systems
0
0 comments X

The pith

A levels x laws taxonomy classifies world models for agents into three capability stages and four domain law regimes.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper sets out a grid that sorts world models according to how much they can do and what kind of rules they must follow. One axis marks three rising stages of ability: simple one-step predictors, full multi-step simulators that follow the rules, and self-revising evolvers that fix their own mistakes. The other axis marks four kinds of rules that shape those models in physical objects, digital software, social groups, and scientific experiments. By placing more than one hundred existing systems on this grid the work shows where current methods stop short and what next steps would let agents maintain accurate pictures of the environments they act in.

Core claim

We introduce a levels x laws taxonomy with three capability levels—L1 Predictor for one-step local transitions, L2 Simulator for action-conditioned multi-step rollouts that obey domain laws, and L3 Evolver for autonomous model revision on prediction failures—and four governing-law regimes—physical, digital, social, and scientific—to organize research, expose failure modes, and outline paths from passive prediction toward models agents can use to simulate and reshape their surroundings.

What carries the argument

The levels x laws taxonomy, a two-axis grid that places any world model at the intersection of its capability stage and the type of law it must satisfy.

If this is right

  • Model-based reinforcement learning methods supply the transition operators needed for L2 simulators in physical regimes.
  • Video-generation pipelines can serve as L1 predictors but must gain explicit action conditioning to reach L2.
  • Multi-agent social simulations expose distinct failure modes at the L3 boundary where models must revise themselves.
  • Decision-centric evaluation protocols can replace task-specific benchmarks across all level-regime pairs.
  • Modular architectures that separate transition learning from law enforcement become a practical route to L3 systems.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The grid suggests experiments that deliberately feed contradictory evidence to an L2 model and measure whether revision occurs without external retraining.
  • Architectures developed for one regime could be stress-tested by transferring them to another to quantify how law type affects required capacity.
  • Self-revision at L3 raises the practical question of how to certify that an evolving model remains aligned with human goals over time.
  • The taxonomy could be extended with a fifth regime for hybrid human-AI environments once enough systems occupy that space.

Load-bearing premise

Research on world models can be partitioned into these three levels and four regimes without large overlaps or missing categories that would make the grid useless for guiding design choices.

What would settle it

A complete classification of the surveyed papers and systems reveals many that resist assignment to any single level-regime cell without stretching the definitions or adding new cells.

Figures

Figures reproduced from arXiv: 2604.22748 by Bin Xia, Fengyi Wu, Haokun Gui, Haoxuan Che, Jiaya Jia, Jiehui Huang, Jinhui Ye, Jize Zhang, Kevin Qinghong Lin, Leyang Shen, Lingdong Kong, Long Chen, Meng Chu, Mike Zheng Shou, Mingkang Zhu, Philip Torr, Qifeng Chen, Qisheng Hu, Quanyu Long, See-kiong Ng, Senqiao Yang, Shaozuo Yu, Shuai Yang, Teng Tu, Wei Chow, Wei Huang, Weijian Ma, Wenhu Zhang, Wenxuan Zhang, Wenya Wang, Xiaojuan Qi, Xichen Zhang, Xinyu Lin, Xuan Billy Zhang, Yang Deng, Yanwei Li, Yeying Jin, Yifei Dong, Zhefan Rao, Zhi-Qi Cheng, Ziqi Huang, Ziwei Liu.

Figure 1
Figure 1. Figure 1: Organizational structure of this survey. The paper is organized around three capability levels (L1 Predictor, L2 Simulator, L3 Evolver) and four governing-law regimes (physical, digital, social, scientific worlds), with supporting sections on evaluation, implementation, and open problems. 2 view at source ↗
Figure 2
Figure 2. Figure 2: Positioning of this survey relative to existing world model and agent surveys. Four clusters, Embodied World Models, Generative World Models, Language Agents, and AI for Science, each cover subsets of the field. Our survey (center) integrates cross domain coverage with a capability based taxonomy (L1/L2/L3 × four regimes), bridging largely isolated communities. specifically for embodied AI; Feng et al. (20… view at source ↗
Figure 3
Figure 3. Figure 3: Schematic illustrations of the four governing-law regimes. Representative scenes for each regime: a humanoid agent manipulating blocks (Physical World), code and UI surfaces (Digital World), a network of interacting agents with speech acts (Social World), and instrumented experimentation with robotic microscope and pipette (Scientific World). Each regime’s formal constraints are discussed in Section 2.5. w… view at source ↗
Figure 4
Figure 4. Figure 4: Timeline of representative world-modeling systems (2018–2026) organized by capa￾bility level. The roadmap shows 70 survey anchors, capped at five systems per year–level cell for readabil￾ity. L1 Predictor denotes one-step dynamics, L2 Simulator denotes decision-usable multi-step rollout, and L3 Evolver denotes full evidence-driven model revision; partial L3 loops remain in view at source ↗
Figure 5
Figure 5. Figure 5: From local prediction to evidence-driven revision: a hierarchical view of world mod￾eling. Level 1 models empirical regularities for prediction, Level 2 supports possible-world semantics and counterfactual simulation, and Level 3 introduces evidence-driven revision through continual interaction with the environment. This hierarchy frames world modeling as an ascending process from pattern recognition, to t… view at source ↗
Figure 6
Figure 6. Figure 6: Historical development of world modeling across four eras: Mathematical Principles (– 1956), Symbolic Intelligence (1956–1986), Connectionist Resurgence (1986–2020), and Generative Revolution (2020–present). Two AI winters (1974–1980, 1987–1993) mark transitions between paradigms. See discus￾sions in Section 8.1. This argues that a good representation of world model should be instantiation-agnostic. decisi… view at source ↗
Figure 7
Figure 7. Figure 7: Unified POMDP graphical model of L1-L3. Dashed circles denote hidden environment states x; double circles denote learned latent states z; shaded circles denote observations o; squares denote actions a. Blue solid arrows denote the learned model (inference qϕ and dynamics pθ); dashed gray arrows denote the environment transition T and observation emission. The top block shows the agent’s POMDP under the cur… view at source ↗
Figure 8
Figure 8. Figure 8: Diagnostic map of the four governing-law regimes. The axes are schematic rather than metric: the horizontal axis reflects how formally specifiable and mechanically verifiable the transition rules are, while the vertical axis reflects how directly the relevant state and constraints are observable. The purpose of the figure is comparative rather than classificatory: it highlights why different regimes demand… view at source ↗
Figure 9
Figure 9. Figure 9: The L3 evolution loop. A full cycle proceeds through four stages: design, execute, observe, and reflect, producing a revised world-modeling stack Mt+1. Revision triggers and evolution policy. The reflect stage is responsible for deciding when and how the world model should be revised, in particular distinguishing between incremental improvement and structural change. In practice, this decision is driven by… view at source ↗
Figure 10
Figure 10. Figure 10: L3 evolution across four governing-law regimes. Each panel illustrates the design–execute– observe–reflect loop in a representative domain: (a) Physical intelligence—adaptive probing revises con￾tact dynamics; (b) Social intelligence—norm drift triggers social-model revision; (c) Digital intelligence— evaluator-driven program search with regression gates; (d) Scientific intelligence—closed-loop autonomous… view at source ↗
read the original abstract

As AI systems move from generating text to accomplishing goals through sustained interaction, the ability to model environment dynamics becomes a central bottleneck. Agents that manipulate objects, navigate software, coordinate with others, or design experiments require predictive environment models, yet the term world model carries different meanings across research communities. We introduce a "levels x laws" taxonomy organized along two axes. The first defines three capability levels: L1 Predictor, which learns one-step local transition operators; L2 Simulator, which composes them into multi-step, action-conditioned rollouts that respect domain laws; and L3 Evolver, which autonomously revises its own model when predictions fail against new evidence. The second identifies four governing-law regimes: physical, digital, social, and scientific. These regimes determine what constraints a world model must satisfy and where it is most likely to fail. Using this framework, we synthesize over 400 works and summarize more than 100 representative systems spanning model-based reinforcement learning, video generation, web and GUI agents, multi-agent social simulation, and AI-driven scientific discovery. We analyze methods, failure modes, and evaluation practices across level-regime pairs, propose decision-centric evaluation principles and a minimal reproducible evaluation package, and outline architectural guidance, open problems, and governance challenges. The resulting roadmap connects previously isolated communities and charts a path from passive next-step prediction toward world models that can simulate, and ultimately reshape, the environments in which agents operate.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper proposes a 'levels x laws' taxonomy for world models in agentic AI systems. It defines three capability levels—L1 Predictor (one-step local transition operators), L2 Simulator (composing operators into multi-step action-conditioned rollouts respecting domain laws), and L3 Evolver (autonomous model revision on prediction failures)—crossed with four governing-law regimes (physical, digital, social, scientific). Using this framework, the authors synthesize over 400 works, summarize more than 100 representative systems across model-based RL, video generation, web/GUI agents, multi-agent simulation, and AI-driven discovery, analyze methods/failure modes/evaluations, propose decision-centric evaluation principles with a minimal reproducible package, and outline architectural guidance and open problems.

Significance. If the taxonomy holds as a useful organizing lens, the work could connect previously siloed communities and provide a roadmap from passive prediction toward models that simulate and reshape environments. The large-scale synthesis of 400+ works, the explicit proposal of decision-centric evaluation principles, and the identification of cross-regime failure modes represent concrete strengths that could inform architectural choices and evaluation practices if the categories prove sharp and non-overlapping in practice.

major comments (2)
  1. [Abstract (taxonomy definition) and synthesis of representative systems] The central claim that the levels x laws framework forms a useful partition for synthesizing literature and guiding architecture/evaluation rests on the assumption that the L1/L2/L3 levels and four regimes are sufficiently disjoint. However, the definitions allow substantial overlap: L2 is explicitly described as composing L1 operators, and systems in model-based RL and video generation routinely perform both one-step prediction and multi-step rollouts. The synthesis of 100+ representative systems does not appear to include a quantitative breakdown of clean vs. multi-label assignments or forced categorizations, which is load-bearing for the claim that the taxonomy enables sharp failure-mode analysis.
  2. [Synthesis across level-regime pairs and failure-mode analysis] The paper claims the regimes determine 'what constraints a world model must satisfy and where it is most likely to fail,' yet social simulation papers frequently embed physical or digital constraints. Without explicit discussion of how multi-regime systems are handled in the analysis of methods and failure modes, the framework risks becoming loose labeling rather than a decision-centric tool.
minor comments (2)
  1. [Abstract and introduction] The abstract and introduction use 'over 400 works' and 'more than 100 representative systems' without a clear appendix or table listing the exact selection criteria or full bibliography mapping, which would aid reproducibility of the synthesis.
  2. [Taxonomy introduction] Notation for the levels (L1, L2, L3) and regimes is introduced clearly but could benefit from a single summary table early in the manuscript to facilitate cross-referencing in later sections on evaluation and open problems.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive and detailed feedback. We address each major comment below, offering clarifications on the taxonomy's design and committing to revisions that enhance transparency without altering the core claims.

read point-by-point responses
  1. Referee: [Abstract (taxonomy definition) and synthesis of representative systems] The central claim that the levels x laws framework forms a useful partition for synthesizing literature and guiding architecture/evaluation rests on the assumption that the L1/L2/L3 levels and four regimes are sufficiently disjoint. However, the definitions allow substantial overlap: L2 is explicitly described as composing L1 operators, and systems in model-based RL and video generation routinely perform both one-step prediction and multi-step rollouts. The synthesis of 100+ representative systems does not appear to include a quantitative breakdown of clean vs. multi-label assignments or forced categorizations, which is load-bearing for the claim that the taxonomy enables sharp failure-mode analysis.

    Authors: We appreciate the referee's point on potential overlaps. The levels are defined as progressive capability stages rather than mutually exclusive implementations: L1 centers on learning local one-step operators, L2 on their composition into law-respecting multi-step rollouts, and L3 on autonomous model revision. Systems are classified by the highest level for which they provide clear evidence, even if lower-level components are present. While the manuscript does not currently include a quantitative breakdown of assignments, we will add a supplementary table in the revision that lists the 100+ representative systems with their primary level-regime classification, notes on any hybrid aspects, and justification for each assignment. This addition will make the failure-mode analysis more rigorous and demonstrate the taxonomy's partitioning utility. revision: yes

  2. Referee: [Synthesis across level-regime pairs and failure-mode analysis] The paper claims the regimes determine 'what constraints a world model must satisfy and where it is most likely to fail,' yet social simulation papers frequently embed physical or digital constraints. Without explicit discussion of how multi-regime systems are handled in the analysis of methods and failure modes, the framework risks becoming loose labeling rather than a decision-centric tool.

    Authors: We agree that hybrid systems are common, with social simulations often incorporating physical or digital constraints. The framework classifies each system according to its dominant regime—the one that primarily dictates the key constraints and associated failure modes analyzed in the synthesis. To strengthen this, we will insert a new subsection on cross-regime systems in the revised manuscript. This subsection will explicitly discuss handling of multi-regime cases, provide concrete examples from the reviewed literature, and explain how intersecting constraints are accounted for in the method and failure-mode analysis. These changes will clarify the framework's application as a decision-centric lens while preserving its utility for identifying primary risks. revision: yes

Circularity Check

0 steps flagged

No circularity: taxonomy is an explicit proposal synthesizing external literature

full rationale

The paper defines its L1/L2/L3 levels and physical/digital/social/scientific regimes as a new organizing framework, then applies it to categorize >400 external works. No equations, fitted parameters, or predictions are present; the taxonomy is not derived from prior results but proposed outright. No self-citation chains or self-definitional reductions appear in the load-bearing claims. The central contribution is the synthesis and roadmap itself, which remains independent of the framework's internal definitions.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 4 invented entities

The paper introduces new conceptual categories (levels and regimes) based on synthesis of external works; relies on domain assumptions about agent bottlenecks without independent empirical validation in the provided abstract.

axioms (1)
  • domain assumption The ability to model environment dynamics is a central bottleneck for agents that accomplish goals through sustained interaction.
    Opening motivation stated in the abstract.
invented entities (4)
  • L1 Predictor no independent evidence
    purpose: Learns one-step local transition operators
    Newly defined capability level in the taxonomy.
  • L2 Simulator no independent evidence
    purpose: Composes transitions into multi-step, action-conditioned rollouts that respect domain laws
    Newly defined capability level in the taxonomy.
  • L3 Evolver no independent evidence
    purpose: Autonomously revises its own model when predictions fail against new evidence
    Newly defined capability level in the taxonomy.
  • Four governing-law regimes (physical, digital, social, scientific) no independent evidence
    purpose: Determine constraints a world model must satisfy and where it is most likely to fail
    Newly defined axis in the taxonomy.

pith-pipeline@v0.9.0 · 5721 in / 1455 out tokens · 127924 ms · 2026-05-08T11:57:28.689577+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Graph World Models: Concepts, Taxonomy, and Future Directions

    cs.AI 2026-04 unverdicted novelty 7.0

    The paper unifies emerging graph-based world models under a new paradigm and proposes a taxonomy organized by spatial, physical, and logical relational inductive biases.

Reference graph

Works this paper leans on

300 extracted references · 89 canonical work pages · cited by 1 Pith paper · 16 internal anchors

  1. [1]

    Abramson, J

    J. Abramson, J. Adler, J. Dunger, R. Evans, T. Green, A. Pritzel, O. Ronneberger, L. Willmore, A. J. Ballard, J. Bambrick, S. W. Bodenstein, D. A. Evans, C.-C. Hung, M. O'Neill, D. Reiman, K. Tunyasuvunakool, Z. Wu, A. Z emguly \. t \. e , E. Arvaniti, C. Beattie, O. Bertolli, A. Bridgland, A. Cherepanov, M. Congreve, A. I. Cowen-Rivers, A. Cowie, M. Figu...

  2. [2]

    Cosmos World Foundation Model Platform for Physical AI

    N. Agarwal, A. Ali, M. Bala, Y. Balaji, E. Barker, T. Cai, P. Chattopadhyay, Y. Chen, Y. Cui, Y. Ding, D. Dworakowski, J. Fan, M. Fenzi, F. Ferroni, S. Fidler, D. Fox, S. Ge, Y. Ge, J. Gu, S. Gururani, E. He, J. Huang, J. Huffman, P. Jannaty, J. Jin, S. W. Kim, G. Klár, G. Lam, S. Lan, L. Leal-Taixe, A. Li, Z. Li, C.-H. Lin, T.-Y. Lin, H. Ling, M.-Y. Liu,...

  3. [3]

    Agarwal, M

    R. Agarwal, M. Schwarzer, P. S. Castro, A. C. Courville, and M. G. Bellemare. Deep reinforcement learning at the edge of the statistical precipice. In Advances in Neural Information Processing Systems, volume 34, 2021

  4. [4]

    Agrawal, A

    P. Agrawal, A. V. Nair, P. Abbeel, J. Malik, and S. Levine. Learning to poke by poking: Experiential learning of intuitive physics. In Advances in Neural Information Processing Systems, volume 29, pages 5092--5100, 2016

  5. [5]

    A. AL, A. Ahn, N. Becker, S. Carroll, N. Christie, M. Cortes, A. Demirci, M. Du, F. Li, S. Luo, P. Y. Wang, M. Willows, F. Yang, and G. R. Yang. Project Sid : Many-agent simulations toward AI civilization. arXiv preprint arXiv:2411.00114, 2024

  6. [6]

    Alonso, A

    E. Alonso, A. Jelley, V. Micheli, A. Kanervisto, A. Storkey, T. Pearce, and F. Fleuret. Diffusion for world modeling: Visual details matter in atari. Advances in Neural Information Processing Systems, 37: 0 58757--58791, 2024

  7. [7]

    Andrychowicz, M

    M. Andrychowicz, M. Denil, S. Gomez, M. W. Hoffman, D. Pfau, T. Schaul, B. Shillingford, and N. de Freitas. Learning to learn by gradient descent by gradient descent. In Advances in Neural Information Processing Systems, volume 29, pages 3988--3996, 2016

  8. [8]

    Angermueller, D

    C. Angermueller, D. Belanger, A. Gane, Z. Mariet, D. Dohan, K. Murphy, L. Colwell, and D. Sculley. Population-based black-box optimization for biological sequence design. In International Conference on Machine Learning, pages 324--334. PMLR, 2020

  9. [9]

    Effective context engineering for AI agents

    Anthropic . Effective context engineering for AI agents. Anthropic Engineering Blog, 2025. URL https://www.anthropic.com/engineering/effective-context-engineering-for-ai-agents

  10. [10]

    L. P. Argyle, E. C. Busby, N. Fulda, J. R. Gubler, C. Rytting, and D. Wingate. Out of one, many: Using language models to simulate human samples. Political Analysis, 31 0 (3): 0 337--351, 2023

  11. [11]

    Arunkumar, G

    V. Arunkumar, G. R. Gangadharan, and R. Buyya. Agentic artificial intelligence ( AI ): Architectures, taxonomies, and evaluation of large language model agents. arXiv preprint arXiv:2601.12560, 2026

  12. [12]

    A. F. Ashery, L. M. Aiello, and A. Baronchelli. Emergent social conventions and collective bias in LLM populations. Science Advances, 11, 2025

  13. [13]

    Ashkboos, A

    S. Ashkboos, A. Mohtashami, M. L. Croci, B. Li, P. Cameron, M. Jaggi, D. Alistarh, T. Hoefler, and J. Hensman. QuaRot : Outlier-free 4-bit inference in rotated LLMs . In Advances in Neural Information Processing Systems, volume 37, pages 100213--100240, 2024

  14. [14]

    Assran, Q

    M. Assran, Q. Duval, I. Misra, P. Bojanowski, P. Vincent, M. Rabbat, Y. LeCun, and N. Ballas. Self-supervised learning from images with a joint-embedding predictive architecture. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 15619--15629, 2023

  15. [15]

    V-JEPA 2: Self-Supervised Video Models Enable Understanding, Prediction and Planning

    M. Assran, A. Bardes, D. Fan, Q. Garrido, R. Howes, Mojtaba, Komeili, M. Muckley, A. Rizvi, C. Roberts, K. Sinha, A. Zholus, S. Arnaud, A. Gejji, A. Martin, F. R. Hogan, D. Dugas, P. Bojanowski, V. Khalidov, P. Labatut, F. Massa, M. Szafraniec, K. Krishnakumar, Y. Li, X. Ma, S. Chandar, F. Meier, Y. LeCun, M. Rabbat, and N. Ballas. V-JEPA 2 : Self-supervi...

  16. [16]

    Babaeizadeh, C

    M. Babaeizadeh, C. Finn, D. Erhan, R. H. Campbell, and S. Levine. Stochastic variational video prediction. In International Conference on Learning Representations, 2018

  17. [17]

    M. Baek, F. DiMaio, I. Anishchenko, J. Dauparas, S. Ovchinnikov, G. R. Lee, J. Wang, Q. Cong, L. N. Kinch, R. D. Schaeffer, C. Mill \'a n, H. Park, C. Adams, C. R. Glassman, A. DeGiovanni, J. H. Pereira, A. V. Rodrigues, A. A. van Dijk, A. C. Ebrecht, D. J. Opperman, T. Sagmeister, C. Buhlheller, T. Pavkov-Keller, M. K. Rathinaswamy, U. Dalwadi, C. K. Yip...

  18. [18]

    A. P. Baker, M. J. Brookes, I. A. Rezek, S. M. Smith, T. Behrens, P. J. Probert Smith, and M. Woolrich. Fast transient networks in spontaneous human brain activity. eLife, 3: 0 e01867, 2014

  19. [19]

    Baker, I

    B. Baker, I. Akkaya, P. Zhokhov, J. Huizinga, J. Tang, A. Ecoffet, B. Houghton, R. Sampedro, and J. Clune. Video P re T raining ( VPT ): Learning to act by watching unlabeled online videos. In Advances in Neural Information Processing Systems, volume 35, pages 24639--24654, 2022

  20. [20]

    Baker, R

    C. Baker, R. Saxe, and J. Tenenbaum. Bayesian theory of mind: Modeling joint belief-desire attribution. In Annual Meeting of the Cognitive Science Society, volume 33, 2011

  21. [21]

    Bakhtin, N

    A. Bakhtin, N. Brown, E. Dinan, G. Farina, C. Flaherty, D. Fried, A. Goff, J. Gray, H. Hu, A. P. Jacob, M. Komeili, K. Konath, et al. Human-level play in the game of diplomacy by combining language models with strategic reasoning. Science, 378 0 (6624): 0 1067--1074, 2022

  22. [22]

    P. J. Ball, J. Bauer, F. Belletti, B. Brownfield, A. Ephrat, S. Fruchter, A. Gupta, K. Holsheimer, A. Holynski, J. Hron, C. Kaplanis, M. Limont, M. McGill, Y. Oliveira, J. Parker-Holder, F. Perbet, G. Scully, J. Shar, S. Spencer, O. Tov, R. Villegas, E. Wang, J. Yung, C. Baetu, J. Berbel, D. Bridson, J. Bruce, G. Buttimore, S. Chakera, B. Chandra, P. Coll...

  23. [23]

    Bar-Tal, H

    O. Bar-Tal, H. Chefer, O. Tov, C. Herrmann, R. Paiss, S. Zada, A. Ephrat, J. Hur, G. Liu, A. Raj, Y. Li, M. Rubinstein, T. Michaeli, O. Wang, D. Sun, T. Dekel, and I. Mosseri. Lumiere: A space-time diffusion model for video generation. In SIGGRAPH Asia, pages 1--11, 2024

  24. [24]

    Dream to manipulate: Compositional world models empowering robot imitation learning with imagination, 2025

    L. Barcellona, A. Zadaianchuk, D. Allegro, S. Papa, S. Ghidoni, and E. Gavves. Dream to manipulate: Compositional world models empowering robot imitation learning with imagination. arXiv preprint arXiv:2412.14957, 2024

  25. [25]

    Revisiting Feature Prediction for Learning Visual Representations from Video

    A. Bardes, Q. Garrido, J. Ponce, X. Chen, M. G. Rabbat, Y. LeCun, M. Assran, and N. Ballas. Revisiting feature prediction for learning visual representations from video. arXiv preprint arXiv:2404.08471, 2024

  26. [26]

    Behler and M

    J. Behler and M. Parrinello. Generalized neural-network representation of high-dimensional potential-energy surfaces. Physical Review Letters, 98 0 (14): 0 146401, 2007

  27. [27]

    Beucler, P

    T. Beucler, P. Gentine, J. Yuval, A. Gupta, L. Peng, J. Lin, S. Yu, S. Rasp, F. Ahmed, P. A. O'Gorman, J. D. Neelin, N. J. Lutsko, and M. Pritchard. Climate-invariant machine learning. Science Advances, 10 0 (6): 0 eadj7250, 2024

  28. [28]

    K. Bi, L. Xie, H. Zhang, X. Chen, X. Gu, and Q. Tian. Accurate medium-range global weather forecasting with 3D neural networks. Nature, 619: 0 533--538, 2023

  29. [29]

    H. Bian, L. Kong, H. Xie, L. Pan, Y. Qiao, and Z. Liu. DynamicCity : Large-scale 4D occupancy generation from dynamic scenes. In International Conference on Learning Representations, 2025

  30. [30]

    Bianchi, P

    F. Bianchi, P. J. Chia, M. Yuksekgonul, J. Tagliabue, D. Jurafsky, and J. Zou. How well can LLMs negotiate? negotiationarena platform and analysis. arXiv preprint arXiv:2402.05863, 2024

  31. [31]

    Bodnar, W

    C. Bodnar, W. P. Bruinsma, A. Lucic, M. Stanley, A. Allen, J. Brandstetter, P. Garvan, M. Riechert, J. A. Weyn, H. Dong, J. K. Gupta, K. Thambiratnam, A. T. Archibald, C.-C. Wu, E. Heider, M. Welling, R. E. Turner, and P. Perdikaris. A foundation model for the earth system. Nature, 641 0 (8065): 0 1180--1187, 2025

  32. [32]

    Boella and L

    G. Boella and L. van der Torre. A game-theoretic approach to normative multi-agent systems. In Normative Multi-agent Systems. Schloss Dagstuhl, 2007

  33. [33]

    N. M. Boffi, M. S. Albergo, and E. Vanden-Eijnden. Flow map matching with stochastic interpolants: A mathematical framework for consistency models. Transactions on Machine Learning Research, 2025

  34. [34]

    D. A. Boiko, R. MacKnight, B. Kline, and G. Gomes. Autonomous chemical research with large language models. Nature, 624: 0 570--578, 2023

  35. [35]

    Bolya and J

    D. Bolya and J. Hoffman. Token merging for fast stable diffusion. In IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, pages 4599--4603, 2023

  36. [36]

    Bolya, C.-Y

    D. Bolya, C.-Y. Fu, X. Dai, P. Zhang, C. Feichtenhofer, and J. Hoffman. Token merging: Your vit but faster. In International Conference on Learning Representations, 2023

  37. [37]

    A. M. Bran, S. Cox, O. Schilter, C. Baldassari, A. D. White, and P. Schwaller. Augmenting large language models with chemistry tools. Nature Machine Intelligence, 6 0 (5): 0 525--535, 2024

  38. [38]

    Brooks, B

    T. Brooks, B. Peebles, C. Holmes, W. DePue, Y. Guo, L. Jing, D. Schnurr, J. Taylor, T. Luhman, E. Luhman, C. Ng, R. Wang, and A. Ramesh. Video generation models as world simulators. Technical report, OpenAI, 2024. URL https://openai.com/research/video-generation-models-as-world-simulators

  39. [39]

    T. B. Brown, B. Mann, N. Ryder, M. Subbiah, J. D. Kaplan, P. Dhariwal, A. Neelakantan, P. Shyam, G. Sastry, A. Askell, S. Agarwal, A. Herbert-Voss, G. Krueger, T. Henighan, R. Child, A. Ramesh, D. M. Ziegler, J. Wu, C. Winter, C. Hesse, M. Chen, E. Sigler, M. Litwin, S. Gray, B. Chess, J. Clark, C. Berner, S. McCandlish, A. Radford, I. Sutskever, and D. A...

  40. [40]

    Bruce, M

    J. Bruce, M. Dennis, A. Edwards, J. Parker-Holder, Y. Shi, E. Hughes, M. Lai, A. Mavalankar, R. Steigerwald, C. Apps, Y. Aytar, S. Bechtle, F. Behbahani, S. Chan, N. Heess, L. Gonzalez, S. Osindero, S. Ozair, S. Reed, J. Zhang, K. Zolna, J. Clune, N. de Freitas, S. Singh, and T. Rockt \"a schel. Genie: Generative interactive environments. In International...

  41. [41]

    N. Butt, B. Manczak, A. Wiggers, C. Rainone, D. W. Zhang, M. Defferrard, and T. Cohen. CodeIt : Self-improving language models with prioritized hindsight replay. In International Conference on Machine Learning, pages 5013--5034. PMLR, 2024

  42. [42]

    Caesar, V

    H. Caesar, V. Bankiti, A. H. Lang, S. Vora, V. E. Liong, Q. Xu, A. Krishnan, Y. Pan, G. Baldan, and O. Beijbom. nuScenes : A multimodal dataset for autonomous driving. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 11621--11631, 2020

  43. [43]

    P. Cao, T. Men, W. Liu, J. Zhang, X. Li, X. Lin, D. Sui, Y. Cao, K. Liu, and J. Zhao. Large language models for planning: A comprehensive and systematic survey. arXiv preprint arXiv:2505.19683, 2025 a

  44. [44]

    Y. Cao, Y. Zhong, Z. Zeng, L. Zheng, J. Huang, H. Qiu, P. Shi, W. Mao, and G. Wan. MobileDreamer : Generative sketch world model for GUI agent. arXiv preprint arXiv:2601.04035, 2026

  45. [45]

    Z. Cao, F. Hong, Z. Chen, L. Pan, and Z. Liu. PhysX-Anything : Simulation-ready physical 3D assets from single image. arXiv preprint arXiv:2511.13648, 2025 b

  46. [46]

    H. Chae, N. Kim, K. T.-i. Ong, M. Gwak, G. Song, J. Kim, S. Kim, D. Lee, and J. Yeo. Web agents with world models: Learning and leveraging environment dynamics in web navigation. In International Conference on Learning Representations, 2025

  47. [47]

    Y. Chai, L. Deng, R. Shao, J. Zhang, K. Lv, L. Xing, X. Li, H. Zhang, and Y. Liu. Gaf: Gaussian action field as a 4d representation for dynamic world modeling in robotic manipulation. arXiv preprint arXiv:2506.14135, 2025

  48. [48]

    Transdreamer: Reinforcement learning with transformer world models, 2024

    C. Chen, Y.-F. Wu, J. Yoon, and S. Ahn. TransDreamer : Reinforcement learning with transformer world models. arXiv preprint arXiv:2202.09481, 2022

  49. [49]

    D. Chen, M. Shukor, T. Moutakanni, W. Chung, J. Yu, T. Kasarla, Y. Bang, A. Bolourchi, Y. LeCun, and P. Fung. VL-JEPA : Joint embedding predictive architecture for vision-language. arXiv preprint arXiv:2512.10942, 2025 a

  50. [50]

    L. Chen, Y. Meng, C. Tang, X. Ma, J. Jiang, X. Wang, Z. Wang, and W. Zhu. Q-DiT : Accurate post-training quantization for diffusion transformers. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 28306--28315, 2025 b

  51. [51]

    R. Chen, W. Jiang, C. Qin, and C. Tan. Theory of mind in large language models: Assessment and enhancement. In Annual Meeting of the Association for Computational Linguistics, pages 31539--31558, 2025 c

  52. [52]

    T. Chen, S. Kornblith, M. Norouzi, and G. Hinton. A simple framework for contrastive learning of visual representations. In International Conference on Machine Learning, pages 1597--1607. PMLR, 2020

  53. [53]

    X. Chen, Y. Chen, Y. Fu, N. Gao, J. Jia, W. Jin, H. Li, Y. Mu, J. Pang, Y. Qiao, Y. Tian, B. Wang, B. Wang, F. Wang, H. Wang, T. Wang, Z. Wang, X. Wei, C. Wu, S. Yang, J. Ye, J. Yu, J. Zeng, J. Zhang, J. Zhang, S. Zhang, F. Zheng, B. Zhou, and Y. Zhu. InternVLA-M1 : A spatially guided vision-language-action framework for generalist robot policy. arXiv pre...

  54. [54]

    Y. Chen, K. Q. Lin, and M. Z. Shou. Code2Video : A code-centric paradigm for educational video generation. arXiv preprint arXiv:2510.01174, 2025 e

  55. [55]

    Y. Chen, P. Li, J. Yang, K. He, X. Wu, Y. Xu, K. Wang, J. Liu, N. Liu, Y. Huang, and L. Wang. BridgeV2W : Bridging video generation models to embodied world models via embodiment masks. arXiv preprint arXiv:2602.03793, 2026

  56. [56]

    Z. Chen, Z. Zhao, K. Zhang, B. Liu, Q. Qi, Y. Wu, T. Kalluri, S. Cao, Y. Xiong, H. Tong, H. Yao, H. Li, J. Zhu, X. Li, D. Song, B. Li, J. Weston, and D. Huynh. Scaling agent learning via experience synthesis. arXiv preprint arXiv:2511.03773, 2025 f

  57. [57]

    S. R. Chitturi, A. Ramdas, Y. Wu, B. Rohr, S. Ermon, J. Dionne, F. H. d. Jornada, M. Dunne, C. Tassone, W. Neiswanger, and D. Ratner. Targeted materials discovery using bayesian algorithm execution. NPJ Computational Materials, 10 0 (1): 0 156, 2024

  58. [58]

    K. Chua, R. Calandra, R. McAllister, and S. Levine. Deep reinforcement learning in a handful of trials using probabilistic dynamics models. In Advances in Neural Information Processing Systems, volume 31, pages 4759--4770, 2018

  59. [59]

    Chuang, A

    Y.-S. Chuang, A. Goyal, N. Harlalka, S. Suresh, R. Hawkins, S. Yang, D. Shah, J. Hu, and T. Rogers. Simulating opinion dynamics with networks of llm-based agents. In Findings of the association for computational linguistics: NAACL 2024, pages 3326--3346, 2024

  60. [60]

    A. Clark. Surfing Uncertainty: Prediction, Action, and the Embodied Mind. Oxford University Press, 2015

  61. [61]

    StarVLA: A Lego-like Codebase for Vision-Language-Action Model Developing

    S. Community. Starvla: A lego-like codebase for vision-language-action model developing. arXiv preprint arXiv:2604.05014, 2026

  62. [62]

    Cwm: An open-weights llm for research on code generation with world models

    J. Copet, Q. Carbonneaux, G. Cohen, J. Gehring, J. Kahn, J. Kossen, F. Kreuk, E. McMilin, M. Meyer, Y. Wei, D. Zhang, K. Zheng, J. Armengol-Estap \'e , P. Bashiri, M. Beck, et al. CWM : An open-weights LLM for research on code generation with world models. arXiv preprint arXiv:2510.02387, 2025

  63. [63]

    Coutant, K

    A. Coutant, K. Roper, D. Trejo-Banos, D. Bouthinon, M. Carpenter, J. Grzebyta, G. Santini, H. Soldano, M. Elati, J. Ramon, C. Rouveirol, L. N. Soldatova, and R. D. King. Closed-loop cycles of experiment design, execution, and learning accelerate systems biology model development in yeast. Proceedings of the National Academy of Sciences, 116 0 (36): 0 1814...

  64. [64]

    K. J. W. Craik. The Nature of Explanation. Cambridge University Press, 1943

  65. [65]

    P. M. Curvo. The traitors: Deception and trust in multi-agent language model simulations. arXiv preprint arXiv:2505.12923, 2025

  66. [66]

    G. Dai, W. Zhang, J. Li, S. Yang, C. O. lbe, S. Rao, A. Caetano, and M. Sra. Artificial leviathan: Exploring social evolution of llm agents through the lens of hobbesian social contract theory. arXiv preprint arXiv:2406.14373, 2024

  67. [67]

    Dainese, M

    N. Dainese, M. Merler, M. Alakuijala, and P. Marttinen. Generating code world models with large language models guided by Monte Carlo tree search. In Advances in Neural Information Processing Systems, volume 37, pages 60429--60474, 2024

  68. [68]

    A. C. Dama, K. S. Kim, D. M. Leyva, A. P. Lunkes, N. S. Schmid, K. Jijakli, and P. A. Jensen. BacterAI maps microbial metabolism without prior knowledge. Nature Microbiology, 8: 0 1018--1025, 2023

  69. [69]

    Quevedo, Q

    Decart, J. Quevedo, Q. McIntyre, S. Campbell, X. Chen, and R. Wachen. Oasis: A universe in a transformer. Blog post, 2024. URL https://oasis-model.github.io

  70. [70]

    J. Degen. The rational speech act framework. Annual Review of Linguistics, 9 0 (1): 0 519--540, 2023

  71. [71]

    M. P. Deisenroth and C. E. Rasmussen. PILCO : A model-based and data-efficient approach to policy search. In International Conference on Machine Learning, pages 465--472, 2011

  72. [72]

    F. Deng, I. Jang, and S. Ahn. DreamerPro : Reconstruction-free model-based reinforcement learning with prototypical representations. In International Conference on Machine Learning, pages 4956--4975. PMLR, 2022

  73. [73]

    X. Deng, Y. Gu, B. Zheng, S. Chen, S. Stevens, B. Wang, H. Sun, and Y. Su. Mind2Web : Towards a generalist agent for the web. In Advances in Neural Information Processing Systems, volume 36, pages 28091--28114, 2023

  74. [74]

    Dettmers, M

    T. Dettmers, M. Lewis, Y. Belkada, and L. Zettlemoyer. LLM.int8() : 8-bit matrix multiplication for transformers at scale. In Advances in Neural Information Processing Systems, volume 35, pages 30318--30332, 2022

  75. [75]

    Dignum and F

    V. Dignum and F. Dignum. Agentifying agentic AI . arXiv preprint arXiv:2511.17332, 2025

  76. [76]

    J. Ding, Y. Zhang, Y. Shang, J. Feng, Y. Zhang, Z. Zong, Y. Yuan, H. Su, N. Li, J. Piao, Y. Deng, N. Sukiennik, C. Gao, F. Xu, and Y. Li. Understanding world or predicting future? A comprehensive survey of world models. ACM Computing Surveys, 2025 a

  77. [77]

    X. Ding, G. Ding, Y. Guo, and J. Han. Centripetal sgd for pruning very deep convolutional networks with complicated structure. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 4943--4953, 2019

  78. [78]

    Z. Ding, C. Jin, D. Liu, H. Zheng, K. K. Singh, Q. Zhang, Y. Kang, Z. Lin, and Y. Liu. Dollar: Few-step video generation via distillation and latent reward optimization. In IEEE/CVF International Conference on Computer Vision, pages 17961--17971, 2025 b

  79. [79]

    Dockhorn, A

    T. Dockhorn, A. Vahdat, and K. Kreis. Genie: Higher-order denoising diffusion solvers. In Advances in Neural Information Processing Systems, volume 35, pages 30150--30166, 2022

  80. [80]

    X. Dong, S. Chen, and S. Pan. Learning to prune deep neural networks via layer-wise optimal brain surgeon. Advances in Neural Information Processing Systems, 30: 0 4860--4874, 2017

Showing first 80 references.