arxiv: 2605.01694 · v1 · submitted 2026-05-03 · 💻 cs.AI

Recognition: unknown

Latent State Design for World Models under Sufficiency Constraints

Keon Woo Kim

Pith reviewed 2026-05-10 15:53 UTC · model grok-4.3

classification 💻 cs.AI

keywords world modelslatent statessufficiency constraintsfunctional taxonomypredictive embeddingplanning interfacecounterfactual supportevaluation matrix

0 comments

The pith

A world model is actionable when its latent state is constructed to match the agent's task rather than to retain the most information.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper reframes world-model research as latent state design under sufficiency constraints, where each state must support a specific function such as prediction or planning while discarding irrelevant details. It introduces a functional taxonomy that organizes methods by the intended role of the latent state rather than by network architecture or application area. This taxonomy underpins a seven-axis evaluation matrix that diagnoses what information a given state preserves, discards, and enables. The resulting view is that effectiveness comes from alignment between state construction and task demands, not from maximizing retained information.

Core claim

World models matter to agents only through the states they construct, and these states must satisfy sufficiency constraints tied to concrete functions: prediction, control, planning, memory, grounding, or counterfactual reasoning. By grouping methods into roles such as predictive embedding, recurrent belief state, object or causal structure, latent action interface, grounded planning interface, and memory substrate, the paper shows that architecture-based classifications obscure important gaps, including the difference between predictive sufficiency and control sufficiency and between passive prediction and counterfactual modeling. Evaluation along the seven axes of representation, predicton

What carries the argument

A functional taxonomy that classifies latent states by their intended role (predictive embedding, recurrent belief state, object/causal structure, latent action interface, grounded planning interface, memory substrate), supported by a seven-axis evaluation matrix that diagnoses preservation, discarding, and enabling capabilities.

If this is right

Predictive sufficiency does not guarantee control sufficiency, so models built for video prediction often fail when actions must be chosen.
Passive prediction models are distinguished from those that support counterfactual reasoning about interventions.
Evaluation should focus on what a latent state enables for the agent rather than on raw information content.
Methods can be compared directly by the sufficiency constraints they were designed to meet instead of by architectural similarity.
The most useful world model for any given application is the one whose state construction is matched to that application's requirements.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The taxonomy could serve as a checklist for designing hybrid models that satisfy multiple sufficiency constraints at once.
Benchmarks for world models might shift from generic reconstruction accuracy to targeted tests of each sufficiency axis.
In embodied settings the framework suggests prioritizing minimal states that are grounded for planning over richer but ungrounded representations.
Automated search over latent-state designs could optimize directly for the relevant subset of the seven axes rather than for a single reconstruction loss.

Load-bearing premise

The proposed functional taxonomy of six roles and the seven-axis evaluation matrix capture the essential distinctions among world models and that sufficiency constraints are the right primary lens for organizing the field.

What would settle it

A head-to-head comparison on a shared planning or control benchmark in which a high-capacity model that maximizes mutual information with observations is evaluated against several task-specific models built under the taxonomy; if the maximal-information model outperforms all others across the seven axes, the central claim is falsified.

Figures

Figures reproduced from arXiv: 2605.01694 by Keon Woo Kim.

**Figure 1.** Figure 1: The compression spectrum of latent world model designs. Methods are ordered from observation-faithful [PITH_FULL_IMAGE:figures/full_fig_p004_1.png] view at source ↗

**Figure 2.** Figure 2: The taxonomy of latent state design. Six roles partition the world-model literature by what the latent is [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗

read the original abstract

A world model matters to an agent only through the state it constructs. That state must preserve some information, discard other information, and support some future function: prediction, control, planning, memory, grounding, or counterfactual reasoning. This paper treats world-model research as latent state design under sufficiency constraints. We propose a functional taxonomy that groups methods by what their latent state is for, rather than by architecture or application domain: predictive embedding, recurrent belief state, object/causal structure, latent action interface, grounded planning interface, and memory substrate. These roles expose distinctions that architecture-based groupings hide, including the gap between predictive sufficiency and control sufficiency, and the gap between passive video prediction and counterfactual action modeling. The taxonomy supports an evaluation framework that judges a model by the sufficiency constraint its latent state was built to satisfy. We compare methods along seven axes: representation, prediction, planning, controllability, causal/counterfactual support, memory, and uncertainty. We use the resulting matrix as a diagnostic for what a latent state preserves, discards, and enables. The conclusion that follows is that an actionable world model is the one whose state construction matches the task, not the one that preserves the most information.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

A clear but purely conceptual taxonomy that organizes world-model latent states by sufficiency constraints rather than architecture.

read the letter

This paper's main move is to treat world-model design as choosing latent states that meet task-specific sufficiency constraints instead of trying to preserve as much information as possible. It groups existing approaches into six functional roles—predictive embedding, recurrent belief state, object/causal structure, latent action interface, grounded planning interface, and memory substrate—and supplies a seven-axis matrix to diagnose what each state keeps, drops, or enables.

Referee Report

0 major / 3 minor

Summary. The paper claims that world models should be understood through the lens of latent state design under sufficiency constraints. It proposes a functional taxonomy categorizing methods into predictive embedding, recurrent belief state, object/causal structure, latent action interface, grounded planning interface, and memory substrate. These are evaluated using a seven-axis matrix (representation, prediction, planning, controllability, causal/counterfactual support, memory, uncertainty) to determine what the states preserve, discard, and enable. The resulting insight is that actionable world models prioritize task-matched state construction over maximal information preservation.

Significance. This taxonomy offers a novel organizational tool for world-model research that could reveal overlooked distinctions, such as between passive prediction and counterfactual action modeling. By emphasizing functional roles over architectures, it may guide the development of more efficient, task-specific models. The framework's value lies in its potential as a diagnostic for latent state design, though its impact requires community testing and application.

minor comments (3)

[Abstract] The phrase 'sufficiency constraints' is central but not defined in the provided abstract; adding a brief definition would aid readers unfamiliar with the concept.
[Conclusion] The final claim about actionable world models could be illustrated with a brief example contrasting two methods to make the distinction concrete.
[Evaluation framework] Ensure that the seven axes are clearly distinguished from one another to avoid overlap in the matrix.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for their positive assessment of the manuscript and for recommending minor revision. The referee's summary accurately reflects the paper's focus on functional taxonomy for latent state design under sufficiency constraints, and we appreciate the recognition of its potential value as a diagnostic tool. No specific major comments were raised in the report.

Circularity Check

0 steps flagged

No significant circularity in conceptual taxonomy

full rationale

The paper is a conceptual taxonomy and evaluation framework for world models organized around sufficiency constraints on latent states. It proposes functional categories (predictive embedding, recurrent belief state, etc.) and a seven-axis diagnostic matrix without any formal derivations, equations, fitted parameters, or numerical predictions. The central interpretive claim—that actionable world models match task-specific sufficiency rather than maximize information preservation—follows directly from adopting the proposed lens and does not reduce to self-definition, fitted inputs, or load-bearing self-citations. No load-bearing steps exist that could be circular by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The paper rests on domain assumptions standard in world-model literature about information selection and task support; no free parameters or invented entities are introduced in the abstract.

axioms (2)

domain assumption A world model matters to an agent only through the state it constructs.
Opening premise of the abstract that frames all subsequent discussion.
domain assumption The latent state must preserve some information, discard other information, and support some future function such as prediction, control, planning, memory, grounding, or counterfactual reasoning.
Core premise used to define sufficiency constraints and the taxonomy.

pith-pipeline@v0.9.0 · 5502 in / 1394 out tokens · 47723 ms · 2026-05-10T15:53:17.901566+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

74 extracted references · 61 canonical work pages · 16 internal anchors

[1]

Diffusion for world modeling: Visual details matter in Atari, 2024

Eloi Alonso, Adam Jelley, Vincent Micheli, Anssi Kanervisto, Amos Storkey, Tim Pearce, and François Fleuret. Diffusion for world modeling: Visual details matter in Atari, 2024. URL https://arxiv.org/abs/2405. 12399

2024
[2]

Self-supervised learning from images with a joint-embedding predictive architecture.arXiv preprint arXiv:2301.08243,

Mahmoud Assran, Quentin Duval, Ishan Misra, Piotr Bojanowski, Pascal Vincent, Michael Rabbat, Yann LeCun, and Nicolas Ballas. Self-supervised learning from images with a joint-embedding predictive architecture, 2023. URLhttps://arxiv.org/abs/2301.08243

work page arXiv 2023
[3]

V-jepa 2: Self-supervised video models enable understanding, prediction and planning,

Mido Assran, Adrien Bardes, David Fan, Quentin Garrido, Russell Howes, Mojtaba, Komeili, Matthew Muckley, Ammar Rizvi, Claire Roberts, Koustuv Sinha, Artem Zholus, Sergio Arnaud, Abha Gejji, Ada Martin, Fran- cois Robert Hogan, Daniel Dugas, Piotr Bojanowski, Vasil Khalidov, Patrick Labatut, Francisco Massa, Marc Szafraniec, Kapil Krishnakumar, Yong Li, X...
[4]

URLhttps://arxiv.org/abs/2506.09985

work page internal anchor Pith review arXiv
[5]

Latent Reasoning VLA: Latent Thinking and Prediction for Vision-Language-Action Models

Shuanghao Bai, Jing Lyu, Wanqi Zhou, Zhe Li, Dakai Wang, Lei Xing, Xiaoguang Zhao, Pengwei Wang, Zhongyuan Wang, Cheng Chi, Badong Chen, and Shanghang Zhang. Latent reasoning vla: Latent thinking and prediction for vision-language-action models, 2026. URLhttps://arxiv.org/abs/2602.01166

work page internal anchor Pith review arXiv 2026
[6]

Scalable methods for computing state similarity in deterministic Markov decision processes,

Pablo Samuel Castro. Scalable methods for computing state similarity in deterministic Markov decision processes,
[7]

20 Latent State Design for World Models under Sufficiency ConstraintsA PREPRINT

URLhttps://arxiv.org/abs/1911.09291. 20 Latent State Design for World Models under Sufficiency ConstraintsA PREPRINT

work page arXiv 1911
[8]

MICo: Improved representations via sampling-based state similarity for Markov decision processes, 2021

Pablo Samuel Castro, Tyler Kastner, Prakash Panangaden, and Mark Rowland. MICo: Improved representations via sampling-based state similarity for Markov decision processes, 2021. URL https://arxiv.org/abs/2106. 08229

2021
[9]

SpatialVLM: Endowing vision-language models with spatial reasoning capabilities.arXiv preprint arXiv:2401.12168, 2024

Boyuan Chen, Zhuo Xu, Sean Kirmani, Brian Ichter, Danny Driess, Pete Florence, Dorsa Sadigh, Leonidas Guibas, and Fei Xia. Spatialvlm: Endowing vision-language models with spatial reasoning capabilities, 2024. URLhttps://arxiv.org/abs/2401.12168

work page arXiv 2024
[10]

PonderNet: Learning to ponder.arXiv preprint arXiv:2106.01345,

Lili Chen, Kevin Lu, Aravind Rajeswaran, Kimin Lee, Aditya Grover, Michael Laskin, Pieter Abbeel, Aravind Srinivas, and Igor Mordatch. Decision transformer: Reinforcement learning via sequence modeling, 2021. URL https://arxiv.org/abs/2106.01345

work page arXiv 2021
[11]

Think deep, not just long: Measuring llm reasoning effort via deep-thinking tokens, 2026

Wei-Lin Chen, Liqian Peng, Tian Tan, Chao Zhao, Blake JianHang Chen, Ziqian Lin, Alec Go, and Yu Meng. Think deep, not just long: Measuring LLM reasoning effort via deep-thinking tokens, 2026. URL https: //arxiv.org/abs/2602.13517

work page arXiv 2026
[12]

Diffusion Policy: Visuomotor Policy Learning via Action Diffusion

Cheng Chi, Zhenjia Xu, Siyuan Feng, Eric Cousineau, Yilun Du, Benjamin Burchfiel, Russ Tedrake, and Shuran Song. Diffusion policy: Visuomotor policy learning via action diffusion, 2023. URL https://arxiv.org/abs/ 2303.04137

work page internal anchor Pith review arXiv 2023
[13]

Latent particle world models: Self-supervised object-centric stochastic dynamics modeling, 2026

Tal Daniel, Carl Qi, Dan Haramati, Amir Zadeh, Chuan Li, Aviv Tamar, Deepak Pathak, and David Held. Latent particle world models: Self-supervised object-centric stochastic dynamics modeling, 2026. URL https: //arxiv.org/abs/2603.04553

work page arXiv 2026
[14]

Genie 3: A new frontier for world models, 2025

Google DeepMind. Genie 3: A new frontier for world models, 2025. URL https://deepmind.google/blog/ genie-3-a-new-frontier-for-world-models/. Accessed 2026-05-01

2025
[15]

arXiv preprint arXiv:2601.00844 , year=

Matthieu Destrade, Oumayma Bounou, Quentin Le Lidec, Jean Ponce, and Yann LeCun. Value-guided action planning with jepa world models, 2025. URLhttps://arxiv.org/abs/2601.00844

work page arXiv 2025
[16]

CausalVAE as a Plug-in for World Models: Towards Reliable Counterfactual Dynamics

Ziyi Ding, Xianxin Lai, Weiyu Chen, Xiao-Ping Zhang, and Jiayu Chen. CausalV AE as a plug-in for world models: Towards reliable counterfactual dynamics, 2026. URLhttps://arxiv.org/abs/2604.07712

work page internal anchor Pith review Pith/arXiv arXiv 2026
[17]

Learning interactive world model for object-centric reinforcement learning, 2025

Fan Feng, Phillip Lippe, and Sara Magliacane. Learning interactive world model for object-centric reinforcement learning, 2025. URLhttps://arxiv.org/abs/2511.02225

work page arXiv 2025
[18]

Metrics for finite Markov decision processes

Norm Ferns, Prakash Panangaden, and Doina Precup. Metrics for finite Markov decision processes. InProceedings of the 20th Conference on Uncertainty in Artificial Intelligence (UAI), pages 162–169, 2004

2004
[19]

FOCUS: Object-centric world models for robotics manipulation.arXiv preprint arXiv:2307.02427, 2023

Stefano Ferraro, Pietro Mazzaglia, Tim Verbelen, and Bart Dhoedt. Focus: Object-centric world models for robotics manipulation, 2023. URLhttps://arxiv.org/abs/2307.02427

work page arXiv 2023
[20]

Adaworld: Learning adaptable world models with latent actions.arXiv preprint arXiv:2503.18938, 2025

Shenyuan Gao, Siyuan Zhou, Yilun Du, Jun Zhang, and Chuang Gan. Adaworld: Learning adaptable world models with latent actions, 2025. URLhttps://arxiv.org/abs/2503.18938

work page arXiv 2025
[21]

arXiv preprint arXiv:2601.05230 (2026)

Quentin Garrido, Tushar Nagarajan, Basile Terver, Nicolas Ballas, Yann LeCun, and Michael Rabbat. Learning latent action world models in the wild, 2026. URLhttps://arxiv.org/abs/2601.05230

work page arXiv 2026
[22]

The value equivalence principle for model-based reinforcement learning, 2020

Christopher Grimm, André Barreto, Satinder Singh, and David Silver. The value equivalence principle for model-based reinforcement learning, 2020. URLhttps://arxiv.org/abs/2011.03506

work page arXiv 2020
[23]

Mamba: Linear-Time Sequence Modeling with Selective State Spaces

Albert Gu and Tri Dao. Mamba: Linear-time sequence modeling with selective state spaces, 2023. URL https://arxiv.org/abs/2312.00752

work page internal anchor Pith review Pith/arXiv arXiv 2023
[24]

World Models

David Ha and Jürgen Schmidhuber. World models, 2018. URLhttps://arxiv.org/abs/1803.10122

work page internal anchor Pith review arXiv 2018
[25]

://arxiv.org/abs/1811.04551

Danijar Hafner, Timothy Lillicrap, Ian Fischer, Ruben Villegas, David Ha, Honglak Lee, and James Davidson. Learning latent dynamics for planning from pixels, 2018. URLhttps://arxiv.org/abs/1811.04551

work page arXiv 2018
[26]

Dream to Control: Learning Behaviors by Latent Imagination

Danijar Hafner, Timothy Lillicrap, Jimmy Ba, and Mohammad Norouzi. Dream to control: Learning behaviors by latent imagination, 2019. URLhttps://arxiv.org/abs/1912.01603

work page internal anchor Pith review arXiv 2019
[27]

Mastering Diverse Domains through World Models

Danijar Hafner, Jurgis Pasukonis, Jimmy Ba, and Timothy Lillicrap. Mastering diverse domains through world models, 2023. URLhttps://arxiv.org/abs/2301.04104

work page internal anchor Pith review arXiv 2023
[28]

TD-MPC2: Scalable, robust world models for continuous control,

Nicklas Hansen, Hao Su, and Xiaolong Wang. TD-MPC2: Scalable, robust world models for continuous control,
[29]

URLhttps://arxiv.org/abs/2310.16828

work page internal anchor Pith review arXiv
[30]

Training Large Language Models to Reason in a Continuous Latent Space

Shibo Hao, Sainbayar Sukhbaatar, DiJia Su, Xian Li, Zhiting Hu, Jason Weston, and Yuandong Tian. Training large language models to reason in a continuous latent space, 2024. URLhttps://arxiv.org/abs/2412.06769

work page internal anchor Pith review arXiv 2024
[31]

Learning latent state spaces for planning through reward prediction, 2019

Aaron Havens, Yi Ouyang, Prabhat Nagarajan, and Yasuhiro Fujita. Learning latent state spaces for planning through reward prediction, 2019. URLhttps://arxiv.org/abs/1912.04201. 21 Latent State Design for World Models under Sufficiency ConstraintsA PREPRINT

work page arXiv 2019
[32]

Relic: Interactive video world model with long-horizon memory.arXiv preprint arXiv:2512.04040, 2025

Yicong Hong, Yiqun Mei, Chongjian Ge, Yiran Xu, Yang Zhou, Sai Bi, Yannick Hold-Geoffroy, Mike Roberts, Matthew Fisher, Eli Shechtman, Kalyan Sunkavalli, Feng Liu, Zhengqi Li, and Hao Tan. Relic: Interactive video world model with long-horizon memory, 2025. URLhttps://arxiv.org/abs/2512.04040

work page arXiv 2025
[33]

GAIA-1: A Generative World Model for Autonomous Driving

Anthony Hu, Lloyd Russell, Hudson Yeo, Zak Murez, George Fedoseev, Alex Kendall, Jamie Shotton, and Gianluca Corrado. GAIA-1: A generative world model for autonomous driving, 2023. URL https://arxiv. org/abs/2309.17080

work page internal anchor Pith review arXiv 2023
[34]

Object-centric world model for language- guided manipulation.arXiv preprint arXiv:2503.06170, 2025

Youngjoon Jeong, Junha Chun, Soonwoo Cha, and Taesup Kim. Object-centric world model for language-guided manipulation, 2025. URLhttps://arxiv.org/abs/2503.06170

work page arXiv 2025
[35]

CounterScene: Counterfactual causal reasoning in generative world models for safety-critical evaluation, 2026

Bowen Jing, Ruiyang Hao, Weitao Zhou, and Haibao Yu. CounterScene: Counterfactual causal reasoning in generative world models for safety-critical evaluation, 2026. URLhttps://arxiv.org/abs/2603.21104

work page arXiv 2026
[36]

Littman, and Anthony R

Leslie Pack Kaelbling, Michael L. Littman, and Anthony R. Cassandra. Planning and acting in partially observable stochastic domains.Artificial Intelligence, 101(1-2):99–134, 1998. doi: 10.1016/S0004-3702(98)00023-X

work page doi:10.1016/s0004-3702(98)00023-x 1998
[37]

Model- based reinforcement learning for atari.arXiv preprint arXiv:1903.00374, 2019

Lukasz Kaiser, Mohammad Babaeizadeh, Piotr Milos, Blazej Osinski, Roy H. Campbell, Konrad Czechowski, Dumitru Erhan, Chelsea Finn, Piotr Kozakowski, Sergey Levine, Afroz Mohiuddin, Ryan Sepassi, George Tucker, and Henryk Michalewski. Model-based reinforcement learning for Atari, 2019. URL https://arxiv.org/ abs/1903.00374

work page arXiv 2019
[38]

OpenVLA: An open-source vision-language-action model,

Moo Jin Kim, Karl Pertsch, Siddharth Karamcheti, Ted Xiao, Ashwin Balakrishna, Suraj Nair, Rafael Rafailov, Ethan Foster, Grace Lam, Pannag Sanketi, Quan Vuong, Thomas Kollar, Benjamin Burchfiel, Russ Tedrake, Dorsa Sadigh, Sergey Levine, Percy Liang, and Chelsea Finn. OpenVLA: An open-source vision-language-action model,
[39]

URLhttps://arxiv.org/abs/2406.09246

work page internal anchor Pith review Pith/arXiv arXiv
[40]

Object-centric latent action learning, 2025

Albina Klepach, Alexander Nikulin, Ilya Zisman, Denis Tarasov, Alexander Derevyagin, Andrei Polubarov, Nikita Lyubaykin, Igor Kiselev, and Vladislav Kurenkov. Object-centric latent action learning, 2025. URL https://arxiv.org/abs/2502.09680

work page arXiv 2025
[41]

Grounded World Model for Semantically Generalizable Planning

Quanyi Li, Lan Feng, Haonan Zhang, Wuyang Li, Letian Wang, Alexandre Alahi, and Harold Soh. Grounded world model for semantically generalizable planning, 2026. URLhttps://arxiv.org/abs/2604.11751

work page internal anchor Pith review Pith/arXiv arXiv 2026
[42]

World-Value-Action Model: Implicit Planning for Vision-Language-Action Systems

Runze Li, Hongyin Zhang, Junxi Jin, Qixin Zeng, Zifeng Zhuang, Yiqi Tang, Shangke Lyu, and Donglin Wang. World-value-action model: Implicit planning for vision-language-action systems, 2026. URL https: //arxiv.org/abs/2604.14732

work page internal anchor Pith review Pith/arXiv arXiv 2026
[43]

Can world simulators reason? Gen-ViRe: A generative visual reasoning benchmark.arXiv preprint arXiv:2511.13853, 2025

Xinxin Liu, Zhaopan Xu, Ming Li, Kai Wang, Yong Jae Lee, and Yuzhang Shang. Can world simulators reason? gen-vire: A generative visual reasoning benchmark, 2025. URLhttps://arxiv.org/abs/2511.13853

work page arXiv 2025
[44]

Leworld- model: Stable end-to-end joint-embedding predictive architecture from pixels.arXiv preprint arXiv:2603.19312, 2026

Lucas Maes, Quentin Le Lidec, Damien Scieur, Yann LeCun, and Randall Balestriero. Leworldmodel: Stable end- to-end joint-embedding predictive architecture from pixels, 2026. URLhttps://arxiv.org/abs/2603.19312

work page arXiv 2026
[45]

Structured world models from human videos.arXiv preprint arXiv:2308.10901, 2023

Russell Mendonca, Shikhar Bahl, and Deepak Pathak. Structured world models from human videos, 2023. URL https://arxiv.org/abs/2308.10901

work page arXiv 2023
[46]

Transformers are sample-efficient world models.arXiv preprint arXiv:2209.00588, 2022

Vincent Micheli, Eloi Alonso, and François Fleuret. Transformers are sample-efficient world models, 2022. URL https://arxiv.org/abs/2209.00588

work page arXiv 2022
[47]

SOLD: Slot object-centric latent dynamics models for relational manipulation learning from pixels, 2024

Malte Mosbach, Jan Niklas Ewertz, Angel Villar-Corrales, and Sven Behnke. SOLD: Slot object-centric latent dynamics models for relational manipulation learning from pixels, 2024. URL https://arxiv.org/abs/2410. 08822

2024
[48]

arXiv preprint arXiv:2603.14482 (2026)

Lorenzo Mur-Labadia, Matthew Muckley, Amir Bar, Mido Assran, Koustuv Sinha, Mike Rabbat, Yann LeCun, Nicolas Ballas, and Adrien Bardes. V-jepa 2.1: Unlocking dense features in video self-supervised learning, 2026. URLhttps://arxiv.org/abs/2603.14482

work page arXiv 2026
[49]

Causal-jepa: Learning world models through object-level latent interventions, 2026

Heejeong Nam, Quentin Le Lidec, Lucas Maes, Yann LeCun, and Randall Balestriero. Causal-jepa: Learning world models through object-level latent interventions, 2026. URLhttps://arxiv.org/abs/2602.11389

work page arXiv 2026
[50]

Temporal predictive coding for model-based planning in latent space, 2021

Tung Nguyen, Rui Shu, Tuan Pham, Hung Bui, and Stefano Ermon. Temporal predictive coding for model-based planning in latent space, 2021. URLhttps://arxiv.org/abs/2106.07156

work page arXiv 2021
[51]

Cambridge University Press, 2 edition, 2009

Judea Pearl.Causality: Models, Reasoning, and Inference. Cambridge University Press, 2 edition, 2009

2009
[52]

π0.7: a steerable generalist robotic foundation model with emergent capabilities

Physical Intelligence. π0.7: a steerable generalist robotic foundation model with emergent capabilities. Technical report, Physical Intelligence, 2026. URLhttps://pi.website/pi07

2026
[53]

World simulation with video foundation models for phys- ical ai, 2025

NVIDIA Research. World simulation with video foundation models for phys- ical ai, 2025. URL https://research.nvidia.com/publication/2025-09_ world-simulation-video-foundation-models-physical-ai. Accessed 2026-05-01. 22 Latent State Design for World Models under Sufficiency ConstraintsA PREPRINT

2025
[54]

Gaia-2: A controllable multi-view generative world model for autonomous driving,

Lloyd Russell, Anthony Hu, Lorenzo Bertoni, George Fedoseev, Jamie Shotton, Elahe Arani, and Gianluca Corrado. GAIA-2: A controllable multi-view generative world model for autonomous driving, 2025. URL https://arxiv.org/abs/2503.20523

work page arXiv 2025
[55]

Mastering memory tasks with world models

Mohammad Reza Samsami, Artem Zholus, Janarthanan Rajendran, and Sarath Chandar. Mastering memory tasks with world models. InInternational Conference on Learning Representations (ICLR), 2024. URL https: //arxiv.org/abs/2403.04253

work page arXiv 2024
[56]

Mastering atari, go, chess and shogi by planning with a learned model.Nature, 588(7839):604–609, December 2020

Julian Schrittwieser, Ioannis Antonoglou, Thomas Hubert, Karen Simonyan, Laurent Sifre, Simon Schmitt, Arthur Guez, Edward Lockhart, Demis Hassabis, Thore Graepel, Timothy Lillicrap, and David Silver. Mastering Atari, Go, chess and shogi by planning with a learned model, 2019. URLhttps://arxiv.org/abs/1911.08265

work page arXiv 2019
[57]

The information bottleneck method

Naftali Tishby, Fernando C. Pereira, and William Bialek. The information bottleneck method, 1999. URL https://arxiv.org/abs/physics/0004057

work page Pith review arXiv 1999
[58]

Object-centric world models meet monte carlo tree search, 2026

Rodion Vakhitov, Leonid Ugadiarov, and Aleksandr Panov. Object-centric world models meet monte carlo tree search, 2026. URLhttps://arxiv.org/abs/2601.06604

work page arXiv 2026
[59]

Latent-wam: Latent world action modeling for end-to-end autonomous driving

Linbo Wang, Yupeng Zheng, Qiang Chen, Shiwei Li, Yichen Zhang, Zebin Xing, Qichao Zhang, Xiang Li, Deheng Qian, Pengxuan Yang, Yihang Dong, Ce Hao, Xiaoqing Ye, Junyu han, Yifeng Pan, and Dongbin Zhao. Latent-wam: Latent world action modeling for end-to-end autonomous driving, 2026. URL https: //arxiv.org/abs/2603.24581

work page arXiv 2026
[60]

Co-Evolving Latent Action World Models

Yucen Wang, Fengming Zhang, De-Chuan Zhan, Li Zhao, Kaixin Wang, and Jiang Bian. Co-evolving latent action world models, 2025. URLhttps://arxiv.org/abs/2510.26433

work page internal anchor Pith review Pith/arXiv arXiv 2025
[61]

Dyn-o: Building structured world models with object-centric representations, 2025

Zizhao Wang, Kaixin Wang, Li Zhao, Peter Stone, and Jiang Bian. Dyn-o: Building structured world models with object-centric representations, 2025. URLhttps://arxiv.org/abs/2507.03298

work page arXiv 2025
[62]

Factored latent action world models, 2026

Zizhao Wang, Chang Shi, Jiaheng Hu, Kevin Rohling, Roberto Martín-Martín, Amy Zhang, and Peter Stone. Factored latent action world models, 2026. URLhttps://arxiv.org/abs/2602.16229

work page arXiv 2026
[63]

Greg Wayne, Chia-Chun Hung, David Amos, Mehdi Mirza, Arun Ahuja, Agnieszka Grabska-Barwinska, Jack Rae, Piotr Mirowski, Joel Z. Leibo, Adam Santoro, Mevlana Gemici, Malcolm Reynolds, Tim Harley, Josh Abramson, Shakir Mohamed, Danilo Rezende, David Saxton, Adam Cain, Chloe Hillier, David Silver, Koray Kavukcuoglu, Matt Botvinick, Demis Hassabis, and Timoth...

work page arXiv 2018
[64]

Daydreamer: World models for physical robot learning, 2022

Philipp Wu, Alejandro Escontrela, Danijar Hafner, Ken Goldberg, and Pieter Abbeel. Daydreamer: World models for physical robot learning, 2022. URLhttps://arxiv.org/abs/2206.14176

work page arXiv 2022
[65]

Video world models with long-term spatial memory.arXiv preprint arXiv:2506.05284, 2025

Tong Wu, Shuai Yang, Ryan Po, Yinghao Xu, Ziwei Liu, Dahua Lin, and Gordon Wetzstein. Video world models with long-term spatial memory, 2025. URLhttps://arxiv.org/abs/2506.05284

work page arXiv 2025
[66]

Worldmem: Long-term consistent world simulation with memory.arXiv preprint arXiv:2504.12369, 2025

Zeqi Xiao, Yushi Lan, Yifan Zhou, Wenqi Ouyang, Shuai Yang, Yanhong Zeng, and Xingang Pan. Worldmem: Long-term consistent world simulation with memory, 2025. URLhttps://arxiv.org/abs/2504.12369

work page arXiv 2025
[67]

Chain of World: World model thinking in latent motion.arXiv preprint arXiv:2603.03195, 2026

Fuxiang Yang, Donglin Di, Lulu Tang, Xuancheng Zhang, Lei Fan, Hao Li, Chen Wei, Tonghua Su, and Baorui Ma. Chain of world: World model thinking in latent motion, 2026. URLhttps://arxiv.org/abs/2603.03195

work page arXiv 2026
[68]

Learning interactive real-world simulators

Mengjiao Yang, Yilun Du, Kamyar Ghasemipour, Jonathan Tompson, Dale Schuurmans, and Pieter Abbeel. Learning interactive real-world simulators, 2023. URLhttps://arxiv.org/abs/2310.06114

work page arXiv 2023
[69]

Latent Action Pretraining from Videos

Seonghyeon Ye, Joel Jang, Byeongguk Jeon, Sejune Joo, Jianwei Yang, Baolin Peng, Ajay Mandlekar, Reuben Tan, Yu-Wei Chao, Bill Yuchen Lin, Lars Liden, Kimin Lee, Jianfeng Gao, Luke Zettlemoyer, Dieter Fox, and Minjoon Seo. Latent action pretraining from videos, 2024. URLhttps://arxiv.org/abs/2410.11758

work page Pith review arXiv 2024
[70]

Mastering Atari games with limited data, 2021

Weirui Ye, Shaohuai Liu, Thanard Kurutach, Pieter Abbeel, and Yang Gao. Mastering Atari games with limited data, 2021. URLhttps://arxiv.org/abs/2111.00210

work page arXiv 2021
[71]

Chenyu Zhang, Daniil Cherniavskii, Antonios Tragoudaras, Antonios V ozikis, Thijmen Nijdam, Derck W. E. Prinzhorn, Mark Bodracska, Nicu Sebe, Andrii Zadaianchuk, and Efstratios Gavves. Morpheus: Benchmarking physical reasoning of video generative models with real physical experiments, 2025. URL https://arxiv.org/ abs/2504.02918

work page arXiv 2025
[72]

Hierarchical Planning with Latent World Models

Wancong Zhang, Basile Terver, Artem Zholus, Soham Chitnis, Harsh Sutaria, Mido Assran, Randall Balestriero, Amir Bar, Adrien Bardes, Yann LeCun, and Nicolas Ballas. Hierarchical planning with latent world models, 2026. URLhttps://arxiv.org/abs/2604.03208

work page internal anchor Pith review Pith/arXiv arXiv 2026
[73]

Object-centric world models from few-shot annotations for sample-efficient reinforcement learning, 2025

Weipu Zhang, Adam Jelley, Trevor McInroe, Amos Storkey, and Gang Wang. Object-centric world models from few-shot annotations for sample-efficient reinforcement learning, 2025. URL https://arxiv.org/abs/2501. 16443. 23 Latent State Design for World Models under Sufficiency ConstraintsA PREPRINT

2025
[74]

Waslander

Yang Zhou, Xiaofeng Wang, Hao Shao, Letian Wang, Guosheng Zhao, Jiangnan Shao, Jiagang Zhu, Tingdong Yu, Zheng Zhu, Guan Huang, and Steven L. Waslander. Drivedreamer-policy: A geometry-grounded world-action model for unified generation and planning, 2026. URLhttps://arxiv.org/abs/2604.01765. 24

work page arXiv 2026