arxiv: 2604.22748 · v1 · submitted 2026-04-24 · 💻 cs.AI

Recognition: unknown

Agentic World Modeling: Foundations, Capabilities, Laws, and Beyond

Meng Chu , Xuan Billy Zhang , Kevin Qinghong Lin , Lingdong Kong , Jize Zhang , Teng Tu , Weijian Ma , Ziqi Huang

show 34 more authors

Senqiao Yang Wei Huang Yeying Jin Zhefan Rao Jinhui Ye Xinyu Lin Xichen Zhang Qisheng Hu Shuai Yang Leyang Shen Wei Chow Yifei Dong Fengyi Wu Quanyu Long Bin Xia Shaozuo Yu Mingkang Zhu Wenhu Zhang Jiehui Huang Haokun Gui Haoxuan Che Long Chen Qifeng Chen Wenxuan Zhang Wenya Wang Xiaojuan Qi Yang Deng Yanwei Li Mike Zheng Shou Zhi-Qi Cheng See-kiong Ng Ziwei Liu Philip Torr Jiaya Jia

Authors on Pith no claims yet

Pith reviewed 2026-05-08 11:57 UTC · model grok-4.3

classification 💻 cs.AI

keywords world modelsagentic AItaxonomyenvironment modelingmodel-based RLsimulationAI agentsmulti-agent systems

0 comments

The pith

A levels x laws taxonomy classifies world models for agents into three capability stages and four domain law regimes.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper sets out a grid that sorts world models according to how much they can do and what kind of rules they must follow. One axis marks three rising stages of ability: simple one-step predictors, full multi-step simulators that follow the rules, and self-revising evolvers that fix their own mistakes. The other axis marks four kinds of rules that shape those models in physical objects, digital software, social groups, and scientific experiments. By placing more than one hundred existing systems on this grid the work shows where current methods stop short and what next steps would let agents maintain accurate pictures of the environments they act in.

Core claim

We introduce a levels x laws taxonomy with three capability levels—L1 Predictor for one-step local transitions, L2 Simulator for action-conditioned multi-step rollouts that obey domain laws, and L3 Evolver for autonomous model revision on prediction failures—and four governing-law regimes—physical, digital, social, and scientific—to organize research, expose failure modes, and outline paths from passive prediction toward models agents can use to simulate and reshape their surroundings.

What carries the argument

The levels x laws taxonomy, a two-axis grid that places any world model at the intersection of its capability stage and the type of law it must satisfy.

If this is right

Model-based reinforcement learning methods supply the transition operators needed for L2 simulators in physical regimes.
Video-generation pipelines can serve as L1 predictors but must gain explicit action conditioning to reach L2.
Multi-agent social simulations expose distinct failure modes at the L3 boundary where models must revise themselves.
Decision-centric evaluation protocols can replace task-specific benchmarks across all level-regime pairs.
Modular architectures that separate transition learning from law enforcement become a practical route to L3 systems.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The grid suggests experiments that deliberately feed contradictory evidence to an L2 model and measure whether revision occurs without external retraining.
Architectures developed for one regime could be stress-tested by transferring them to another to quantify how law type affects required capacity.
Self-revision at L3 raises the practical question of how to certify that an evolving model remains aligned with human goals over time.
The taxonomy could be extended with a fifth regime for hybrid human-AI environments once enough systems occupy that space.

Load-bearing premise

Research on world models can be partitioned into these three levels and four regimes without large overlaps or missing categories that would make the grid useless for guiding design choices.

What would settle it

A complete classification of the surveyed papers and systems reveals many that resist assignment to any single level-regime cell without stretching the definitions or adding new cells.

Figures

Figures reproduced from arXiv: 2604.22748 by Bin Xia, Fengyi Wu, Haokun Gui, Haoxuan Che, Jiaya Jia, Jiehui Huang, Jinhui Ye, Jize Zhang, Kevin Qinghong Lin, Leyang Shen, Lingdong Kong, Long Chen, Meng Chu, Mike Zheng Shou, Mingkang Zhu, Philip Torr, Qifeng Chen, Qisheng Hu, Quanyu Long, See-kiong Ng, Senqiao Yang, Shaozuo Yu, Shuai Yang, Teng Tu, Wei Chow, Wei Huang, Weijian Ma, Wenhu Zhang, Wenxuan Zhang, Wenya Wang, Xiaojuan Qi, Xichen Zhang, Xinyu Lin, Xuan Billy Zhang, Yang Deng, Yanwei Li, Yeying Jin, Yifei Dong, Zhefan Rao, Zhi-Qi Cheng, Ziqi Huang, Ziwei Liu.

**Figure 1.** Figure 1: Organizational structure of this survey. The paper is organized around three capability levels (L1 Predictor, L2 Simulator, L3 Evolver) and four governing-law regimes (physical, digital, social, scientific worlds), with supporting sections on evaluation, implementation, and open problems. 2 view at source ↗

**Figure 2.** Figure 2: Positioning of this survey relative to existing world model and agent surveys. Four clusters, Embodied World Models, Generative World Models, Language Agents, and AI for Science, each cover subsets of the field. Our survey (center) integrates cross domain coverage with a capability based taxonomy (L1/L2/L3 × four regimes), bridging largely isolated communities. specifically for embodied AI; Feng et al. (20… view at source ↗

**Figure 3.** Figure 3: Schematic illustrations of the four governing-law regimes. Representative scenes for each regime: a humanoid agent manipulating blocks (Physical World), code and UI surfaces (Digital World), a network of interacting agents with speech acts (Social World), and instrumented experimentation with robotic microscope and pipette (Scientific World). Each regime’s formal constraints are discussed in Section 2.5. w… view at source ↗

**Figure 4.** Figure 4: Timeline of representative world-modeling systems (2018–2026) organized by capability level. The roadmap shows 70 survey anchors, capped at five systems per year–level cell for readability. L1 Predictor denotes one-step dynamics, L2 Simulator denotes decision-usable multi-step rollout, and L3 Evolver denotes full evidence-driven model revision; partial L3 loops remain in view at source ↗

**Figure 5.** Figure 5: From local prediction to evidence-driven revision: a hierarchical view of world modeling. Level 1 models empirical regularities for prediction, Level 2 supports possible-world semantics and counterfactual simulation, and Level 3 introduces evidence-driven revision through continual interaction with the environment. This hierarchy frames world modeling as an ascending process from pattern recognition, to t… view at source ↗

**Figure 6.** Figure 6: Historical development of world modeling across four eras: Mathematical Principles (– 1956), Symbolic Intelligence (1956–1986), Connectionist Resurgence (1986–2020), and Generative Revolution (2020–present). Two AI winters (1974–1980, 1987–1993) mark transitions between paradigms. See discussions in Section 8.1. This argues that a good representation of world model should be instantiation-agnostic. decisi… view at source ↗

**Figure 7.** Figure 7: Unified POMDP graphical model of L1-L3. Dashed circles denote hidden environment states x; double circles denote learned latent states z; shaded circles denote observations o; squares denote actions a. Blue solid arrows denote the learned model (inference qϕ and dynamics pθ); dashed gray arrows denote the environment transition T and observation emission. The top block shows the agent’s POMDP under the cur… view at source ↗

**Figure 8.** Figure 8: Diagnostic map of the four governing-law regimes. The axes are schematic rather than metric: the horizontal axis reflects how formally specifiable and mechanically verifiable the transition rules are, while the vertical axis reflects how directly the relevant state and constraints are observable. The purpose of the figure is comparative rather than classificatory: it highlights why different regimes demand… view at source ↗

**Figure 9.** Figure 9: The L3 evolution loop. A full cycle proceeds through four stages: design, execute, observe, and reflect, producing a revised world-modeling stack Mt+1. Revision triggers and evolution policy. The reflect stage is responsible for deciding when and how the world model should be revised, in particular distinguishing between incremental improvement and structural change. In practice, this decision is driven by… view at source ↗

**Figure 10.** Figure 10: L3 evolution across four governing-law regimes. Each panel illustrates the design–execute– observe–reflect loop in a representative domain: (a) Physical intelligence—adaptive probing revises contact dynamics; (b) Social intelligence—norm drift triggers social-model revision; (c) Digital intelligence— evaluator-driven program search with regression gates; (d) Scientific intelligence—closed-loop autonomous… view at source ↗

read the original abstract

As AI systems move from generating text to accomplishing goals through sustained interaction, the ability to model environment dynamics becomes a central bottleneck. Agents that manipulate objects, navigate software, coordinate with others, or design experiments require predictive environment models, yet the term world model carries different meanings across research communities. We introduce a "levels x laws" taxonomy organized along two axes. The first defines three capability levels: L1 Predictor, which learns one-step local transition operators; L2 Simulator, which composes them into multi-step, action-conditioned rollouts that respect domain laws; and L3 Evolver, which autonomously revises its own model when predictions fail against new evidence. The second identifies four governing-law regimes: physical, digital, social, and scientific. These regimes determine what constraints a world model must satisfy and where it is most likely to fail. Using this framework, we synthesize over 400 works and summarize more than 100 representative systems spanning model-based reinforcement learning, video generation, web and GUI agents, multi-agent social simulation, and AI-driven scientific discovery. We analyze methods, failure modes, and evaluation practices across level-regime pairs, propose decision-centric evaluation principles and a minimal reproducible evaluation package, and outline architectural guidance, open problems, and governance challenges. The resulting roadmap connects previously isolated communities and charts a path from passive next-step prediction toward world models that can simulate, and ultimately reshape, the environments in which agents operate.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

This is a broad survey that proposes a new levels-and-regimes taxonomy for agentic world models and pulls together several fragmented literatures, but the categories overlap enough that the framework may not deliver sharp guidance.

read the letter

The paper's main contribution is a two-axis taxonomy: three capability levels (L1 one-step local predictors, L2 multi-step action-conditioned simulators that respect domain laws, L3 self-revising evolvers) crossed with four law regimes (physical, digital, social, scientific). It uses this to synthesize over 400 works and detail more than 100 representative systems from model-based RL, video generation, web/GUI agents, multi-agent social simulation, and AI-driven science. They also discuss failure modes, propose decision-centric evaluation principles, and sketch a minimal reproducible evaluation package plus some architectural suggestions. That synthesis and the explicit roadmap across communities is the part that could actually be useful to someone trying to place their own work or see where the field is heading. The effort to connect these areas and move beyond passive next-token prediction is straightforward and honest about the current bottlenecks in sustained interaction. The soft spots are exactly where the stress-test note points. Many existing systems straddle L1 and L2 (video models and model-based RL routinely do both one-step and rollout prediction), and social or scientific simulations often embed physical or digital constraints, so the bins are not cleanly disjoint. The abstract and framing treat the mapping as self-evident without showing how the 100 systems were assigned or what fraction required multi-labeling or forcing. Without that verification or a clear account of overlaps, the taxonomy risks becoming loose labeling rather than a tool that reliably highlights failure modes or dictates architecture choices. The paper is aimed at researchers already working in any of those subfields who want an organizing lens or a quick way to scan related work. It could also serve as an entry point for people moving into agentic systems. It deserves serious peer review because the literature synthesis is substantial, the evaluation proposals are concrete, and the topic is timely, even if the taxonomy will need tightening and explicit handling of boundary cases to hold up.

Referee Report

2 major / 2 minor

Summary. The paper proposes a 'levels x laws' taxonomy for world models in agentic AI systems. It defines three capability levels—L1 Predictor (one-step local transition operators), L2 Simulator (composing operators into multi-step action-conditioned rollouts respecting domain laws), and L3 Evolver (autonomous model revision on prediction failures)—crossed with four governing-law regimes (physical, digital, social, scientific). Using this framework, the authors synthesize over 400 works, summarize more than 100 representative systems across model-based RL, video generation, web/GUI agents, multi-agent simulation, and AI-driven discovery, analyze methods/failure modes/evaluations, propose decision-centric evaluation principles with a minimal reproducible package, and outline architectural guidance and open problems.

Significance. If the taxonomy holds as a useful organizing lens, the work could connect previously siloed communities and provide a roadmap from passive prediction toward models that simulate and reshape environments. The large-scale synthesis of 400+ works, the explicit proposal of decision-centric evaluation principles, and the identification of cross-regime failure modes represent concrete strengths that could inform architectural choices and evaluation practices if the categories prove sharp and non-overlapping in practice.

major comments (2)

[Abstract (taxonomy definition) and synthesis of representative systems] The central claim that the levels x laws framework forms a useful partition for synthesizing literature and guiding architecture/evaluation rests on the assumption that the L1/L2/L3 levels and four regimes are sufficiently disjoint. However, the definitions allow substantial overlap: L2 is explicitly described as composing L1 operators, and systems in model-based RL and video generation routinely perform both one-step prediction and multi-step rollouts. The synthesis of 100+ representative systems does not appear to include a quantitative breakdown of clean vs. multi-label assignments or forced categorizations, which is load-bearing for the claim that the taxonomy enables sharp failure-mode analysis.
[Synthesis across level-regime pairs and failure-mode analysis] The paper claims the regimes determine 'what constraints a world model must satisfy and where it is most likely to fail,' yet social simulation papers frequently embed physical or digital constraints. Without explicit discussion of how multi-regime systems are handled in the analysis of methods and failure modes, the framework risks becoming loose labeling rather than a decision-centric tool.

minor comments (2)

[Abstract and introduction] The abstract and introduction use 'over 400 works' and 'more than 100 representative systems' without a clear appendix or table listing the exact selection criteria or full bibliography mapping, which would aid reproducibility of the synthesis.
[Taxonomy introduction] Notation for the levels (L1, L2, L3) and regimes is introduced clearly but could benefit from a single summary table early in the manuscript to facilitate cross-referencing in later sections on evaluation and open problems.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive and detailed feedback. We address each major comment below, offering clarifications on the taxonomy's design and committing to revisions that enhance transparency without altering the core claims.

read point-by-point responses

Referee: [Abstract (taxonomy definition) and synthesis of representative systems] The central claim that the levels x laws framework forms a useful partition for synthesizing literature and guiding architecture/evaluation rests on the assumption that the L1/L2/L3 levels and four regimes are sufficiently disjoint. However, the definitions allow substantial overlap: L2 is explicitly described as composing L1 operators, and systems in model-based RL and video generation routinely perform both one-step prediction and multi-step rollouts. The synthesis of 100+ representative systems does not appear to include a quantitative breakdown of clean vs. multi-label assignments or forced categorizations, which is load-bearing for the claim that the taxonomy enables sharp failure-mode analysis.

Authors: We appreciate the referee's point on potential overlaps. The levels are defined as progressive capability stages rather than mutually exclusive implementations: L1 centers on learning local one-step operators, L2 on their composition into law-respecting multi-step rollouts, and L3 on autonomous model revision. Systems are classified by the highest level for which they provide clear evidence, even if lower-level components are present. While the manuscript does not currently include a quantitative breakdown of assignments, we will add a supplementary table in the revision that lists the 100+ representative systems with their primary level-regime classification, notes on any hybrid aspects, and justification for each assignment. This addition will make the failure-mode analysis more rigorous and demonstrate the taxonomy's partitioning utility. revision: yes
Referee: [Synthesis across level-regime pairs and failure-mode analysis] The paper claims the regimes determine 'what constraints a world model must satisfy and where it is most likely to fail,' yet social simulation papers frequently embed physical or digital constraints. Without explicit discussion of how multi-regime systems are handled in the analysis of methods and failure modes, the framework risks becoming loose labeling rather than a decision-centric tool.

Authors: We agree that hybrid systems are common, with social simulations often incorporating physical or digital constraints. The framework classifies each system according to its dominant regime—the one that primarily dictates the key constraints and associated failure modes analyzed in the synthesis. To strengthen this, we will insert a new subsection on cross-regime systems in the revised manuscript. This subsection will explicitly discuss handling of multi-regime cases, provide concrete examples from the reviewed literature, and explain how intersecting constraints are accounted for in the method and failure-mode analysis. These changes will clarify the framework's application as a decision-centric lens while preserving its utility for identifying primary risks. revision: yes

Circularity Check

0 steps flagged

No circularity: taxonomy is an explicit proposal synthesizing external literature

full rationale

The paper defines its L1/L2/L3 levels and physical/digital/social/scientific regimes as a new organizing framework, then applies it to categorize >400 external works. No equations, fitted parameters, or predictions are present; the taxonomy is not derived from prior results but proposed outright. No self-citation chains or self-definitional reductions appear in the load-bearing claims. The central contribution is the synthesis and roadmap itself, which remains independent of the framework's internal definitions.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 4 invented entities

The paper introduces new conceptual categories (levels and regimes) based on synthesis of external works; relies on domain assumptions about agent bottlenecks without independent empirical validation in the provided abstract.

axioms (1)

domain assumption The ability to model environment dynamics is a central bottleneck for agents that accomplish goals through sustained interaction.
Opening motivation stated in the abstract.

invented entities (4)

L1 Predictor no independent evidence
purpose: Learns one-step local transition operators
Newly defined capability level in the taxonomy.
L2 Simulator no independent evidence
purpose: Composes transitions into multi-step, action-conditioned rollouts that respect domain laws
Newly defined capability level in the taxonomy.
L3 Evolver no independent evidence
purpose: Autonomously revises its own model when predictions fail against new evidence
Newly defined capability level in the taxonomy.
Four governing-law regimes (physical, digital, social, scientific) no independent evidence
purpose: Determine constraints a world model must satisfy and where it is most likely to fail
Newly defined axis in the taxonomy.

pith-pipeline@v0.9.0 · 5721 in / 1455 out tokens · 127924 ms · 2026-05-08T11:57:28.689577+00:00 · methodology

discussion (0)

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Graph World Models: Concepts, Taxonomy, and Future Directions
cs.AI 2026-04 unverdicted novelty 7.0

The paper unifies emerging graph-based world models under a new paradigm and proposes a taxonomy organized by spatial, physical, and logical relational inductive biases.

Reference graph

Works this paper leans on

300 extracted references · 89 canonical work pages · cited by 1 Pith paper · 16 internal anchors

[1]

Abramson, J

J. Abramson, J. Adler, J. Dunger, R. Evans, T. Green, A. Pritzel, O. Ronneberger, L. Willmore, A. J. Ballard, J. Bambrick, S. W. Bodenstein, D. A. Evans, C.-C. Hung, M. O'Neill, D. Reiman, K. Tunyasuvunakool, Z. Wu, A. Z emguly \. t \. e , E. Arvaniti, C. Beattie, O. Bertolli, A. Bridgland, A. Cherepanov, M. Congreve, A. I. Cowen-Rivers, A. Cowie, M. Figu...

2024
[2]

Cosmos World Foundation Model Platform for Physical AI

N. Agarwal, A. Ali, M. Bala, Y. Balaji, E. Barker, T. Cai, P. Chattopadhyay, Y. Chen, Y. Cui, Y. Ding, D. Dworakowski, J. Fan, M. Fenzi, F. Ferroni, S. Fidler, D. Fox, S. Ge, Y. Ge, J. Gu, S. Gururani, E. He, J. Huang, J. Huffman, P. Jannaty, J. Jin, S. W. Kim, G. Klár, G. Lam, S. Lan, L. Leal-Taixe, A. Li, Z. Li, C.-H. Lin, T.-Y. Lin, H. Ling, M.-Y. Liu,...

work page internal anchor Pith review arXiv 2025
[3]

Agarwal, M

R. Agarwal, M. Schwarzer, P. S. Castro, A. C. Courville, and M. G. Bellemare. Deep reinforcement learning at the edge of the statistical precipice. In Advances in Neural Information Processing Systems, volume 34, 2021

2021
[4]

Agrawal, A

P. Agrawal, A. V. Nair, P. Abbeel, J. Malik, and S. Levine. Learning to poke by poking: Experiential learning of intuitive physics. In Advances in Neural Information Processing Systems, volume 29, pages 5092--5100, 2016

2016
[5]

A. AL, A. Ahn, N. Becker, S. Carroll, N. Christie, M. Cortes, A. Demirci, M. Du, F. Li, S. Luo, P. Y. Wang, M. Willows, F. Yang, and G. R. Yang. Project Sid : Many-agent simulations toward AI civilization. arXiv preprint arXiv:2411.00114, 2024

work page arXiv 2024
[6]

Alonso, A

E. Alonso, A. Jelley, V. Micheli, A. Kanervisto, A. Storkey, T. Pearce, and F. Fleuret. Diffusion for world modeling: Visual details matter in atari. Advances in Neural Information Processing Systems, 37: 0 58757--58791, 2024

2024
[7]

Andrychowicz, M

M. Andrychowicz, M. Denil, S. Gomez, M. W. Hoffman, D. Pfau, T. Schaul, B. Shillingford, and N. de Freitas. Learning to learn by gradient descent by gradient descent. In Advances in Neural Information Processing Systems, volume 29, pages 3988--3996, 2016

2016
[8]

Angermueller, D

C. Angermueller, D. Belanger, A. Gane, Z. Mariet, D. Dohan, K. Murphy, L. Colwell, and D. Sculley. Population-based black-box optimization for biological sequence design. In International Conference on Machine Learning, pages 324--334. PMLR, 2020

2020
[9]

Effective context engineering for AI agents

Anthropic . Effective context engineering for AI agents. Anthropic Engineering Blog, 2025. URL https://www.anthropic.com/engineering/effective-context-engineering-for-ai-agents

2025
[10]

L. P. Argyle, E. C. Busby, N. Fulda, J. R. Gubler, C. Rytting, and D. Wingate. Out of one, many: Using language models to simulate human samples. Political Analysis, 31 0 (3): 0 337--351, 2023

2023
[11]

Arunkumar, G

V. Arunkumar, G. R. Gangadharan, and R. Buyya. Agentic artificial intelligence ( AI ): Architectures, taxonomies, and evaluation of large language model agents. arXiv preprint arXiv:2601.12560, 2026

work page arXiv 2026
[12]

A. F. Ashery, L. M. Aiello, and A. Baronchelli. Emergent social conventions and collective bias in LLM populations. Science Advances, 11, 2025

2025
[13]

Ashkboos, A

S. Ashkboos, A. Mohtashami, M. L. Croci, B. Li, P. Cameron, M. Jaggi, D. Alistarh, T. Hoefler, and J. Hensman. QuaRot : Outlier-free 4-bit inference in rotated LLMs . In Advances in Neural Information Processing Systems, volume 37, pages 100213--100240, 2024

2024
[14]

Assran, Q

M. Assran, Q. Duval, I. Misra, P. Bojanowski, P. Vincent, M. Rabbat, Y. LeCun, and N. Ballas. Self-supervised learning from images with a joint-embedding predictive architecture. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 15619--15629, 2023

2023
[15]

V-JEPA 2: Self-Supervised Video Models Enable Understanding, Prediction and Planning

M. Assran, A. Bardes, D. Fan, Q. Garrido, R. Howes, Mojtaba, Komeili, M. Muckley, A. Rizvi, C. Roberts, K. Sinha, A. Zholus, S. Arnaud, A. Gejji, A. Martin, F. R. Hogan, D. Dugas, P. Bojanowski, V. Khalidov, P. Labatut, F. Massa, M. Szafraniec, K. Krishnakumar, Y. Li, X. Ma, S. Chandar, F. Meier, Y. LeCun, M. Rabbat, and N. Ballas. V-JEPA 2 : Self-supervi...

work page internal anchor Pith review arXiv 2025
[16]

Babaeizadeh, C

M. Babaeizadeh, C. Finn, D. Erhan, R. H. Campbell, and S. Levine. Stochastic variational video prediction. In International Conference on Learning Representations, 2018

2018
[17]

M. Baek, F. DiMaio, I. Anishchenko, J. Dauparas, S. Ovchinnikov, G. R. Lee, J. Wang, Q. Cong, L. N. Kinch, R. D. Schaeffer, C. Mill \'a n, H. Park, C. Adams, C. R. Glassman, A. DeGiovanni, J. H. Pereira, A. V. Rodrigues, A. A. van Dijk, A. C. Ebrecht, D. J. Opperman, T. Sagmeister, C. Buhlheller, T. Pavkov-Keller, M. K. Rathinaswamy, U. Dalwadi, C. K. Yip...

2021
[18]

A. P. Baker, M. J. Brookes, I. A. Rezek, S. M. Smith, T. Behrens, P. J. Probert Smith, and M. Woolrich. Fast transient networks in spontaneous human brain activity. eLife, 3: 0 e01867, 2014

2014
[19]

Baker, I

B. Baker, I. Akkaya, P. Zhokhov, J. Huizinga, J. Tang, A. Ecoffet, B. Houghton, R. Sampedro, and J. Clune. Video P re T raining ( VPT ): Learning to act by watching unlabeled online videos. In Advances in Neural Information Processing Systems, volume 35, pages 24639--24654, 2022

2022
[20]

Baker, R

C. Baker, R. Saxe, and J. Tenenbaum. Bayesian theory of mind: Modeling joint belief-desire attribution. In Annual Meeting of the Cognitive Science Society, volume 33, 2011

2011
[21]

Bakhtin, N

A. Bakhtin, N. Brown, E. Dinan, G. Farina, C. Flaherty, D. Fried, A. Goff, J. Gray, H. Hu, A. P. Jacob, M. Komeili, K. Konath, et al. Human-level play in the game of diplomacy by combining language models with strategic reasoning. Science, 378 0 (6624): 0 1067--1074, 2022

2022
[22]

P. J. Ball, J. Bauer, F. Belletti, B. Brownfield, A. Ephrat, S. Fruchter, A. Gupta, K. Holsheimer, A. Holynski, J. Hron, C. Kaplanis, M. Limont, M. McGill, Y. Oliveira, J. Parker-Holder, F. Perbet, G. Scully, J. Shar, S. Spencer, O. Tov, R. Villegas, E. Wang, J. Yung, C. Baetu, J. Berbel, D. Bridson, J. Bruce, G. Buttimore, S. Chakera, B. Chandra, P. Coll...

2025
[23]

Bar-Tal, H

O. Bar-Tal, H. Chefer, O. Tov, C. Herrmann, R. Paiss, S. Zada, A. Ephrat, J. Hur, G. Liu, A. Raj, Y. Li, M. Rubinstein, T. Michaeli, O. Wang, D. Sun, T. Dekel, and I. Mosseri. Lumiere: A space-time diffusion model for video generation. In SIGGRAPH Asia, pages 1--11, 2024

2024
[24]

Dream to manipulate: Compositional world models empowering robot imitation learning with imagination, 2025

L. Barcellona, A. Zadaianchuk, D. Allegro, S. Papa, S. Ghidoni, and E. Gavves. Dream to manipulate: Compositional world models empowering robot imitation learning with imagination. arXiv preprint arXiv:2412.14957, 2024

work page arXiv 2024
[25]

Revisiting Feature Prediction for Learning Visual Representations from Video

A. Bardes, Q. Garrido, J. Ponce, X. Chen, M. G. Rabbat, Y. LeCun, M. Assran, and N. Ballas. Revisiting feature prediction for learning visual representations from video. arXiv preprint arXiv:2404.08471, 2024

work page internal anchor Pith review arXiv 2024
[26]

Behler and M

J. Behler and M. Parrinello. Generalized neural-network representation of high-dimensional potential-energy surfaces. Physical Review Letters, 98 0 (14): 0 146401, 2007

2007
[27]

Beucler, P

T. Beucler, P. Gentine, J. Yuval, A. Gupta, L. Peng, J. Lin, S. Yu, S. Rasp, F. Ahmed, P. A. O'Gorman, J. D. Neelin, N. J. Lutsko, and M. Pritchard. Climate-invariant machine learning. Science Advances, 10 0 (6): 0 eadj7250, 2024

2024
[28]

K. Bi, L. Xie, H. Zhang, X. Chen, X. Gu, and Q. Tian. Accurate medium-range global weather forecasting with 3D neural networks. Nature, 619: 0 533--538, 2023

2023
[29]

H. Bian, L. Kong, H. Xie, L. Pan, Y. Qiao, and Z. Liu. DynamicCity : Large-scale 4D occupancy generation from dynamic scenes. In International Conference on Learning Representations, 2025

2025
[30]

Bianchi, P

F. Bianchi, P. J. Chia, M. Yuksekgonul, J. Tagliabue, D. Jurafsky, and J. Zou. How well can LLMs negotiate? negotiationarena platform and analysis. arXiv preprint arXiv:2402.05863, 2024

work page arXiv 2024
[31]

Bodnar, W

C. Bodnar, W. P. Bruinsma, A. Lucic, M. Stanley, A. Allen, J. Brandstetter, P. Garvan, M. Riechert, J. A. Weyn, H. Dong, J. K. Gupta, K. Thambiratnam, A. T. Archibald, C.-C. Wu, E. Heider, M. Welling, R. E. Turner, and P. Perdikaris. A foundation model for the earth system. Nature, 641 0 (8065): 0 1180--1187, 2025

2025
[32]

Boella and L

G. Boella and L. van der Torre. A game-theoretic approach to normative multi-agent systems. In Normative Multi-agent Systems. Schloss Dagstuhl, 2007

2007
[33]

N. M. Boffi, M. S. Albergo, and E. Vanden-Eijnden. Flow map matching with stochastic interpolants: A mathematical framework for consistency models. Transactions on Machine Learning Research, 2025

2025
[34]

D. A. Boiko, R. MacKnight, B. Kline, and G. Gomes. Autonomous chemical research with large language models. Nature, 624: 0 570--578, 2023

2023
[35]

Bolya and J

D. Bolya and J. Hoffman. Token merging for fast stable diffusion. In IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, pages 4599--4603, 2023

2023
[36]

Bolya, C.-Y

D. Bolya, C.-Y. Fu, X. Dai, P. Zhang, C. Feichtenhofer, and J. Hoffman. Token merging: Your vit but faster. In International Conference on Learning Representations, 2023

2023
[37]

A. M. Bran, S. Cox, O. Schilter, C. Baldassari, A. D. White, and P. Schwaller. Augmenting large language models with chemistry tools. Nature Machine Intelligence, 6 0 (5): 0 525--535, 2024

2024
[38]

Brooks, B

T. Brooks, B. Peebles, C. Holmes, W. DePue, Y. Guo, L. Jing, D. Schnurr, J. Taylor, T. Luhman, E. Luhman, C. Ng, R. Wang, and A. Ramesh. Video generation models as world simulators. Technical report, OpenAI, 2024. URL https://openai.com/research/video-generation-models-as-world-simulators

2024
[39]

T. B. Brown, B. Mann, N. Ryder, M. Subbiah, J. D. Kaplan, P. Dhariwal, A. Neelakantan, P. Shyam, G. Sastry, A. Askell, S. Agarwal, A. Herbert-Voss, G. Krueger, T. Henighan, R. Child, A. Ramesh, D. M. Ziegler, J. Wu, C. Winter, C. Hesse, M. Chen, E. Sigler, M. Litwin, S. Gray, B. Chess, J. Clark, C. Berner, S. McCandlish, A. Radford, I. Sutskever, and D. A...

1901
[40]

Bruce, M

J. Bruce, M. Dennis, A. Edwards, J. Parker-Holder, Y. Shi, E. Hughes, M. Lai, A. Mavalankar, R. Steigerwald, C. Apps, Y. Aytar, S. Bechtle, F. Behbahani, S. Chan, N. Heess, L. Gonzalez, S. Osindero, S. Ozair, S. Reed, J. Zhang, K. Zolna, J. Clune, N. de Freitas, S. Singh, and T. Rockt \"a schel. Genie: Generative interactive environments. In International...

2024
[41]

N. Butt, B. Manczak, A. Wiggers, C. Rainone, D. W. Zhang, M. Defferrard, and T. Cohen. CodeIt : Self-improving language models with prioritized hindsight replay. In International Conference on Machine Learning, pages 5013--5034. PMLR, 2024

2024
[42]

Caesar, V

H. Caesar, V. Bankiti, A. H. Lang, S. Vora, V. E. Liong, Q. Xu, A. Krishnan, Y. Pan, G. Baldan, and O. Beijbom. nuScenes : A multimodal dataset for autonomous driving. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 11621--11631, 2020

2020
[43]

P. Cao, T. Men, W. Liu, J. Zhang, X. Li, X. Lin, D. Sui, Y. Cao, K. Liu, and J. Zhao. Large language models for planning: A comprehensive and systematic survey. arXiv preprint arXiv:2505.19683, 2025 a

work page arXiv 2025
[44]

Y. Cao, Y. Zhong, Z. Zeng, L. Zheng, J. Huang, H. Qiu, P. Shi, W. Mao, and G. Wan. MobileDreamer : Generative sketch world model for GUI agent. arXiv preprint arXiv:2601.04035, 2026

work page arXiv 2026
[45]

Z. Cao, F. Hong, Z. Chen, L. Pan, and Z. Liu. PhysX-Anything : Simulation-ready physical 3D assets from single image. arXiv preprint arXiv:2511.13648, 2025 b

work page arXiv 2025
[46]

H. Chae, N. Kim, K. T.-i. Ong, M. Gwak, G. Song, J. Kim, S. Kim, D. Lee, and J. Yeo. Web agents with world models: Learning and leveraging environment dynamics in web navigation. In International Conference on Learning Representations, 2025

2025
[47]

Y. Chai, L. Deng, R. Shao, J. Zhang, K. Lv, L. Xing, X. Li, H. Zhang, and Y. Liu. Gaf: Gaussian action field as a 4d representation for dynamic world modeling in robotic manipulation. arXiv preprint arXiv:2506.14135, 2025

work page arXiv 2025
[48]

Transdreamer: Reinforcement learning with transformer world models, 2024

C. Chen, Y.-F. Wu, J. Yoon, and S. Ahn. TransDreamer : Reinforcement learning with transformer world models. arXiv preprint arXiv:2202.09481, 2022

work page arXiv 2022
[49]

D. Chen, M. Shukor, T. Moutakanni, W. Chung, J. Yu, T. Kasarla, Y. Bang, A. Bolourchi, Y. LeCun, and P. Fung. VL-JEPA : Joint embedding predictive architecture for vision-language. arXiv preprint arXiv:2512.10942, 2025 a

work page arXiv 2025
[50]

L. Chen, Y. Meng, C. Tang, X. Ma, J. Jiang, X. Wang, Z. Wang, and W. Zhu. Q-DiT : Accurate post-training quantization for diffusion transformers. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 28306--28315, 2025 b

2025
[51]

R. Chen, W. Jiang, C. Qin, and C. Tan. Theory of mind in large language models: Assessment and enhancement. In Annual Meeting of the Association for Computational Linguistics, pages 31539--31558, 2025 c

2025
[52]

T. Chen, S. Kornblith, M. Norouzi, and G. Hinton. A simple framework for contrastive learning of visual representations. In International Conference on Machine Learning, pages 1597--1607. PMLR, 2020

2020
[53]

X. Chen, Y. Chen, Y. Fu, N. Gao, J. Jia, W. Jin, H. Li, Y. Mu, J. Pang, Y. Qiao, Y. Tian, B. Wang, B. Wang, F. Wang, H. Wang, T. Wang, Z. Wang, X. Wei, C. Wu, S. Yang, J. Ye, J. Yu, J. Zeng, J. Zhang, J. Zhang, S. Zhang, F. Zheng, B. Zhou, and Y. Zhu. InternVLA-M1 : A spatially guided vision-language-action framework for generalist robot policy. arXiv pre...

work page internal anchor Pith review arXiv 2025
[54]

Y. Chen, K. Q. Lin, and M. Z. Shou. Code2Video : A code-centric paradigm for educational video generation. arXiv preprint arXiv:2510.01174, 2025 e

work page arXiv 2025
[55]

Y. Chen, P. Li, J. Yang, K. He, X. Wu, Y. Xu, K. Wang, J. Liu, N. Liu, Y. Huang, and L. Wang. BridgeV2W : Bridging video generation models to embodied world models via embodiment masks. arXiv preprint arXiv:2602.03793, 2026

work page arXiv 2026
[56]

Z. Chen, Z. Zhao, K. Zhang, B. Liu, Q. Qi, Y. Wu, T. Kalluri, S. Cao, Y. Xiong, H. Tong, H. Yao, H. Li, J. Zhu, X. Li, D. Song, B. Li, J. Weston, and D. Huynh. Scaling agent learning via experience synthesis. arXiv preprint arXiv:2511.03773, 2025 f

work page arXiv 2025
[57]

S. R. Chitturi, A. Ramdas, Y. Wu, B. Rohr, S. Ermon, J. Dionne, F. H. d. Jornada, M. Dunne, C. Tassone, W. Neiswanger, and D. Ratner. Targeted materials discovery using bayesian algorithm execution. NPJ Computational Materials, 10 0 (1): 0 156, 2024

2024
[58]

K. Chua, R. Calandra, R. McAllister, and S. Levine. Deep reinforcement learning in a handful of trials using probabilistic dynamics models. In Advances in Neural Information Processing Systems, volume 31, pages 4759--4770, 2018

2018
[59]

Chuang, A

Y.-S. Chuang, A. Goyal, N. Harlalka, S. Suresh, R. Hawkins, S. Yang, D. Shah, J. Hu, and T. Rogers. Simulating opinion dynamics with networks of llm-based agents. In Findings of the association for computational linguistics: NAACL 2024, pages 3326--3346, 2024

2024
[60]

A. Clark. Surfing Uncertainty: Prediction, Action, and the Embodied Mind. Oxford University Press, 2015

2015
[61]

StarVLA: A Lego-like Codebase for Vision-Language-Action Model Developing

S. Community. Starvla: A lego-like codebase for vision-language-action model developing. arXiv preprint arXiv:2604.05014, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[62]

Cwm: An open-weights llm for research on code generation with world models

J. Copet, Q. Carbonneaux, G. Cohen, J. Gehring, J. Kahn, J. Kossen, F. Kreuk, E. McMilin, M. Meyer, Y. Wei, D. Zhang, K. Zheng, J. Armengol-Estap \'e , P. Bashiri, M. Beck, et al. CWM : An open-weights LLM for research on code generation with world models. arXiv preprint arXiv:2510.02387, 2025

work page arXiv 2025
[63]

Coutant, K

A. Coutant, K. Roper, D. Trejo-Banos, D. Bouthinon, M. Carpenter, J. Grzebyta, G. Santini, H. Soldano, M. Elati, J. Ramon, C. Rouveirol, L. N. Soldatova, and R. D. King. Closed-loop cycles of experiment design, execution, and learning accelerate systems biology model development in yeast. Proceedings of the National Academy of Sciences, 116 0 (36): 0 1814...

2019
[64]

K. J. W. Craik. The Nature of Explanation. Cambridge University Press, 1943

1943
[65]

P. M. Curvo. The traitors: Deception and trust in multi-agent language model simulations. arXiv preprint arXiv:2505.12923, 2025

work page arXiv 2025
[66]

G. Dai, W. Zhang, J. Li, S. Yang, C. O. lbe, S. Rao, A. Caetano, and M. Sra. Artificial leviathan: Exploring social evolution of llm agents through the lens of hobbesian social contract theory. arXiv preprint arXiv:2406.14373, 2024

work page arXiv 2024
[67]

Dainese, M

N. Dainese, M. Merler, M. Alakuijala, and P. Marttinen. Generating code world models with large language models guided by Monte Carlo tree search. In Advances in Neural Information Processing Systems, volume 37, pages 60429--60474, 2024

2024
[68]

A. C. Dama, K. S. Kim, D. M. Leyva, A. P. Lunkes, N. S. Schmid, K. Jijakli, and P. A. Jensen. BacterAI maps microbial metabolism without prior knowledge. Nature Microbiology, 8: 0 1018--1025, 2023

2023
[69]

Quevedo, Q

Decart, J. Quevedo, Q. McIntyre, S. Campbell, X. Chen, and R. Wachen. Oasis: A universe in a transformer. Blog post, 2024. URL https://oasis-model.github.io

2024
[70]

J. Degen. The rational speech act framework. Annual Review of Linguistics, 9 0 (1): 0 519--540, 2023

2023
[71]

M. P. Deisenroth and C. E. Rasmussen. PILCO : A model-based and data-efficient approach to policy search. In International Conference on Machine Learning, pages 465--472, 2011

2011
[72]

F. Deng, I. Jang, and S. Ahn. DreamerPro : Reconstruction-free model-based reinforcement learning with prototypical representations. In International Conference on Machine Learning, pages 4956--4975. PMLR, 2022

2022
[73]

X. Deng, Y. Gu, B. Zheng, S. Chen, S. Stevens, B. Wang, H. Sun, and Y. Su. Mind2Web : Towards a generalist agent for the web. In Advances in Neural Information Processing Systems, volume 36, pages 28091--28114, 2023

2023
[74]

Dettmers, M

T. Dettmers, M. Lewis, Y. Belkada, and L. Zettlemoyer. LLM.int8() : 8-bit matrix multiplication for transformers at scale. In Advances in Neural Information Processing Systems, volume 35, pages 30318--30332, 2022

2022
[75]

Dignum and F

V. Dignum and F. Dignum. Agentifying agentic AI . arXiv preprint arXiv:2511.17332, 2025

work page arXiv 2025
[76]

J. Ding, Y. Zhang, Y. Shang, J. Feng, Y. Zhang, Z. Zong, Y. Yuan, H. Su, N. Li, J. Piao, Y. Deng, N. Sukiennik, C. Gao, F. Xu, and Y. Li. Understanding world or predicting future? A comprehensive survey of world models. ACM Computing Surveys, 2025 a

2025
[77]

X. Ding, G. Ding, Y. Guo, and J. Han. Centripetal sgd for pruning very deep convolutional networks with complicated structure. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 4943--4953, 2019

2019
[78]

Z. Ding, C. Jin, D. Liu, H. Zheng, K. K. Singh, Q. Zhang, Y. Kang, Z. Lin, and Y. Liu. Dollar: Few-step video generation via distillation and latent reward optimization. In IEEE/CVF International Conference on Computer Vision, pages 17961--17971, 2025 b

2025
[79]

Dockhorn, A

T. Dockhorn, A. Vahdat, and K. Kreis. Genie: Higher-order denoising diffusion solvers. In Advances in Neural Information Processing Systems, volume 35, pages 30150--30166, 2022

2022
[80]

X. Dong, S. Chen, and S. Pan. Learning to prune deep neural networks via layer-wise optimal brain surgeon. Advances in Neural Information Processing Systems, 30: 0 4860--4874, 2017

2017

Showing first 80 references.