From World Models to World Action Models: A Concise Tutorial for Robotics

Wei Zhang; Xiaoxiong Zhang; Xiong Zeng

arxiv: 2607.00836 · v1 · pith:5SGHYGVOnew · submitted 2026-07-01 · 💻 cs.RO · cs.AI· cs.SY· eess.SY

From World Models to World Action Models: A Concise Tutorial for Robotics

Xiaoxiong Zhang , Xiong Zeng , Wei Zhang This is my paper

Pith reviewed 2026-07-02 11:24 UTC · model grok-4.3

classification 💻 cs.RO cs.AIcs.SYeess.SY

keywords world modelsroboticsaction-conditioned predictionworld action modelsobservation-space modelsstate-space modelsembodied intelligencegenerative simulation

0 comments

The pith

World models are action-conditioned predictors of future observations or states, and world action models connect those predictions to executable robot actions via four paradigms.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper defines world models as action-conditioned predictive models that estimate the future evolution of task-relevant observations or states. It organizes existing work into observation-space models, which operate on visual or sensory data, and state-space models, which use more abstract representations, then compares their trade-offs in visual fidelity, spatial structure, physical interpretability, and control usability. The tutorial introduces world action models that link predicted futures to robot actions and groups them into four paradigms: imagine-then-execute, video-feature-conditioned action prediction, joint video-action modeling, and auxiliary video prediction for policy learning. A reader would care because the taxonomy supplies a design-space view that organizes how predictive models support embodied control.

Core claim

World models are action-conditioned predictive models that estimate the future evolution of task-relevant observations or states. Methods are split into observation-space world models that work with raw sensory data and state-space world models that operate on structured representations. World action models then connect these predicted futures to executable robot actions through four paradigms: imagine-then-execute, video-feature-conditioned action prediction, joint video-action modeling, and auxiliary video prediction for policy learning.

What carries the argument

The taxonomy dividing world models into observation-space versus state-space categories and the four paradigms that link predicted futures to robot actions in world action models.

If this is right

Observation-space models trade higher visual fidelity for lower physical interpretability compared with state-space models.
The imagine-then-execute paradigm lets a robot simulate futures before choosing actions.
Joint video-action modeling predicts observations and actions together in one model.
Auxiliary video prediction supplies extra signals that improve policy learning without direct action modeling.
The taxonomy clarifies how predictive models can be chosen or combined for different robotics control tasks.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The taxonomy could be used to spot missing hybrids that combine visual fidelity with physical structure.
Benchmarking the four paradigms on the same robot tasks would test whether the distinctions hold in practice.
Extending the same categories to multi-robot coordination might expose new links between prediction and joint actions.
The design-space view could guide curriculum design for teaching embodied prediction methods.

Load-bearing premise

That the division into observation-space and state-space world models together with the four listed paradigms forms a useful and reasonably complete design-space taxonomy for the field.

What would settle it

Discovery of a world model or action-connection method that cannot be placed into either the observation-space or state-space category and does not match any of the four paradigms would show the taxonomy is incomplete.

Figures

Figures reproduced from arXiv: 2607.00836 by Wei Zhang, Xiaoxiong Zhang, Xiong Zeng.

**Figure 1.** Figure 1: Illustration of the components of a world. 1.1. World We define a world as the set of task-relevant entities, including both the robot and its environment. The environment contains the objects of interest and the ambient environment, as shown in [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗

**Figure 4.** Figure 4: A world model predicts future observations or states from observation history and action [PITH_FULL_IMAGE:figures/full_fig_p002_4.png] view at source ↗

**Figure 3.** Figure 3: A language-conditioned closed-loop policy framework. ot from the world, and then outputs an action at to the robot. The policy might be a proportional-integral-derivative (PID) controller, a model predictive controller (MPC), a visionlanguage-action (VLA) model, or a world action model (WAM). 1.3. World Models and World Action Models For a specified world, a world model is a model to predict how its futur… view at source ↗

**Figure 6.** Figure 6: Design space of observation-space world models. The vertical axis denotes the spatial explicitness of the observation, ranging from RGB images to multi-view RGB, RGB-D, and point clouds. The horizontal axis denotes the abstraction level of the action conditioning, ranging from low-level robot actions to interface actions, latent actions, and language instructions. Different choices along these two axes lea… view at source ↗

**Figure 7.** Figure 7: Design space of state-space world models. Instead of predicting future observations directly in the raw observation space, state-space world models abstract observations into structured state representations and model their future evolution under actions. Representative state choices include latent states, point tracks, neural-symbolic predicates, and physical states. Different state representations provi… view at source ↗

**Figure 8.** Figure 8: Taxonomy of world action models. Given the observation ot and language instruction l, world action models couple future observation prediction with robot action generation in different ways. Representative paradigms include imagine-then-execute, videofeature-conditioned action prediction, joint video-action modeling, and auxiliary video prediction for policy learning. visual future they are supposed to in… view at source ↗

read the original abstract

World models are increasingly used in embodied intelligence and generative simulation, yet their scope remains ambiguous across communities. This tutorial presents a design-space view of world models as action-conditioned predictive models that estimate the future evolution of task-relevant observations or states. We categorize existing methods into observation-space and state-space world models, comparing their trade-offs in visual fidelity, spatial structure, physical interpretability, and control usability. We further introduce world action models, which connect predicted futures with executable robot actions, and summarize four representative paradigms: imagine-then-execute, video-feature-conditioned action prediction, joint video-action modeling, and auxiliary video prediction for policy learning. The goal of this tutorial is to clarify the conceptual scope of world (action) models and provide a structured taxonomy for embodied prediction and control.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

This is a clean tutorial that taxonomizes world models and links them to actions but adds no new technical results or evaluations.

read the letter

The core takeaway is that this paper offers a structured overview of world models in robotics, defining them as action-conditioned predictors and splitting them into observation-space versus state-space categories before connecting predictions to actions through four paradigms: imagine-then-execute, video-feature-conditioned action prediction, joint video-action modeling, and auxiliary video prediction for policy learning.

It does a decent job laying out the trade-offs across visual fidelity, spatial structure, physical interpretability, and control usability. The four paradigms give a reasonable grouping of how existing work bridges prediction and execution, and the writing stays concise without overclaiming.

The main limitation is that everything stays at the level of summary and organization. There are no new methods, no experiments testing whether the taxonomy is useful or complete, and no formal derivations. The division into those two model types and four paradigms is presented as clarifying rather than exhaustive, so it does not run into falsifiability issues, but it also does not demonstrate that these buckets capture the design space better than prior surveys.

This is aimed at newcomers or students who need a quick map of embodied prediction work. Researchers already active in the area will likely find the content familiar from the cited literature. It does not rise to the level of a research contribution that needs referee scrutiny in a standard journal track.

Recommendation: worth a quick read if you are onboarding someone to the topic or want a compact reference for the paradigms; otherwise, it is not something to engage with deeply or cite as advancing the field.

Referee Report

0 major / 0 minor

Summary. The paper is a tutorial that defines world models as action-conditioned predictive models estimating the future evolution of task-relevant observations or states. It categorizes methods into observation-space and state-space world models, comparing trade-offs in visual fidelity, spatial structure, physical interpretability, and control usability. It introduces world action models and summarizes four paradigms: imagine-then-execute, video-feature-conditioned action prediction, joint video-action modeling, and auxiliary video prediction for policy learning, with the aim of clarifying the conceptual scope and providing a structured taxonomy for embodied prediction and control.

Significance. If the taxonomy holds as a clarifying view, the paper offers a structured design-space perspective that could help organize literature on world models in robotics. Its contribution is conceptual framing and categorization rather than new derivations, theorems, or empirical results; the explicit disclaimer that the taxonomy is not claimed to be exhaustive or optimal reduces overclaim risk.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for their accurate summary of the manuscript and for recommending acceptance. The review correctly identifies the paper's focus on conceptual framing and taxonomy rather than new empirical results. No major comments were raised in the report.

Circularity Check

0 steps flagged

No significant circularity; purely descriptive tutorial with no derivations or fitted results

full rationale

The paper is a tutorial that offers definitional framing of world models as action-conditioned predictive models and a design-space categorization into observation- vs. state-space models plus four paradigms (imagine-then-execute, video-feature-conditioned action prediction, joint video-action modeling, auxiliary video prediction). No equations, formal derivations, empirical fits, or load-bearing self-citations appear; the taxonomy is explicitly presented as a clarifying view rather than an exhaustive claim or derived result. The content is therefore self-contained with no reduction of outputs to inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

No formal parameters, mathematical axioms, or invented entities with independent evidence; the framing of 'world action models' is a conceptual label rather than a new postulated entity.

pith-pipeline@v0.9.1-grok · 5666 in / 1014 out tokens · 34158 ms · 2026-07-02T11:24:57.384201+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

30 extracted references · 29 canonical work pages · 13 internal anchors

[1]

V-JEPA 2: Self-Supervised Video Models Enable Understanding, Prediction and Planning

Google DeepMind Blog. URL https://deepmind.google/blog/ genie-3-a-new-frontier-for-world-models/ . Accessed: 2026-06-05. Assran, M., Bardes, A., Fan, D., Garrido, Q., Howes, R., Komeili, M., Muckley, M., Rizvi, A., Roberts, C., Sinha, 7 From World Models to World Action Models: A Concise Tutorial for Robotics K., Zholus, A., Arnaud, S., Gejji, A., Martin,...

work page internal anchor Pith review Pith/arXiv arXiv 2026
[2]

Zero-Shot Robotic Manipulation with Pretrained Image-Editing Diffusion Models

URLhttps://arxiv.org/abs/2310.10639. Bu, Q., Zeng, J., Chen, L., Yang, Y ., Zhou, G., Yan, J., Luo, P., Cui, H., Ma, Y ., and Li, H. Closed-loop visuomotor control with generative expectation for robotic manipula- tion,

work page internal anchor Pith review Pith/arXiv arXiv
[3]

GR-2: A Generative Video-Language-Action Model with Web-Scale Knowledge for Robot Manipulation

Cheang, C.-L., Chen, G., Jing, Y ., Kong, T., Li, H., Li, Y ., Liu, Y ., Wu, H., Xu, J., Yang, Y ., Zhang, H., and Zhu, M. Gr-2: A generative video-language-action model with web-scale knowledge for robot manipulation.arXiv preprint arXiv:2410.06158,

work page internal anchor Pith review Pith/arXiv arXiv
[4]

Tenenbaum, Dale Schuurmans, and Pieter Abbeel

URL https://arxiv.org/abs/2302.00111. Feng, Y ., Tan, H., Mao, X., Xiang, C., Liu, G., Huang, S., Su, H., and Zhu, J. Vidar: Embodied video diffu- sion model for generalist manipulation.arXiv preprint arXiv:2507.12898,

work page arXiv
[5]

Gao, S., Liang, W., Zheng, K., Malik, A., Ye, S., Yu, S., Tseng, W.-C., Dong, Y ., Mo, K., Lin, C.-H., Ma, Q., Nah, S., Magne, L., Xiang, J., Xie, Y ., Zheng, R., Niu, D., Tan, Y

URL https://arxiv.org/abs/ 2503.18938. Gao, S., Liang, W., Zheng, K., Malik, A., Ye, S., Yu, S., Tseng, W.-C., Dong, Y ., Mo, K., Lin, C.-H., Ma, Q., Nah, S., Magne, L., Xiang, J., Xie, Y ., Zheng, R., Niu, D., Tan, Y . L., Zentner, K. R., Kurian, G., Indupuru, S., Jannaty, P., Gu, J., Zhang, J., Malik, J., Abbeel, P., Liu, M.-Y ., Zhu, Y ., Jang, J., and...

work page arXiv
[6]

DreamDojo: A Generalist Robot World Model from Large-Scale Human Videos

URL https://arxiv.org/abs/2602.06949. Goswami, R. G., Krishnamurthy, P., LeCun, Y ., and Khor- rami, F. Osvi-wm: One-shot visual imitation for unseen tasks using world-model-guided trajectory generation. arXiv preprint arXiv:2505.20425,

work page internal anchor Pith review Pith/arXiv arXiv
[7]

FlowDreamer: An RGB-D world model with flow-based motion representations for robot manipulation.arXiv preprint arXiv:2505.10075, 2025

URL https://arxiv.org/abs/2505.10075. Guo, Y ., Shi, L. X., Chen, J., and Finn, C. Ctrl-world: A controllable generative world model for robot manipula- tion,

work page arXiv
[8]

URL https://arxiv.org/abs/2510. 10125. Hafner, D., Pasukonis, J., Ba, J., and Lillicrap, T. Mastering diverse domains through world models.arXiv preprint arXiv:2301.04104,

work page internal anchor Pith review Pith/arXiv arXiv
[9]

Huang, S., Chen, L., Zhou, P., Chen, S., Jiang, Z., Hu, Y ., Liao, Y ., Gao, P., Li, H., Yao, M., and Ren, G

Spotlight. Huang, S., Chen, L., Zhou, P., Chen, S., Jiang, Z., Hu, Y ., Liao, Y ., Gao, P., Li, H., Yao, M., and Ren, G. Ener- verse: Envisioning embodied future space for robotics manipulation, 2025a. URL https://arxiv.org/ abs/2501.01895. Huang, S., Chen, Q., Zhang, X., Sun, J., and Schwager, M. Particleformer: A 3d point cloud world model for multi-obj...

work page arXiv
[10]

Huang, Y .-W

URL https://arxiv.org/abs/2601.03782. Huang, Y ., Zhang, J., Zou, S., Liu, X., Hu, R., and Xu, K. Ladi-wm: A latent diffusion-based world model for pre- dictive manipulation.arXiv preprint arXiv:2505.11528, 2025c. Jeong, Y ., Chun, J., Cha, S., and Kim, T. Object-centric world model for language-guided manipulation.arXiv preprint arXiv:2503.06170,

work page arXiv
[11]

Phystwin: Physics-informed reconstruction and simulation of deformable objects from videos.arXiv preprint arXiv:2503.17973,

8 From World Models to World Action Models: A Concise Tutorial for Robotics Jiang, H., Hsu, H.-Y ., Zhang, K., Yu, H.-N., Wang, S., and Li, Y . Phystwin: Physics-informed reconstruction and simulation of deformable objects from videos.arXiv preprint arXiv:2503.17973,

work page arXiv
[12]

Cosmos Policy: Fine-Tuning Video Models for Visuomotor Control and Planning

Kim, M. J., Gao, Y ., Lin, T.-Y ., Lin, Y .-C., Ge, Y ., Lam, G., Liang, P., Song, S., Liu, M.-Y ., Finn, C., and Gu, J. Cosmos policy: Fine-tuning video models for visuomotor control and planning.arXiv preprint arXiv:2601.16163,

work page internal anchor Pith review Pith/arXiv arXiv
[13]

Li, L., Zhang, Q., Luo, Y ., Yang, S., Wang, R., Han, F., Yu, M., Gao, Z., Xue, N., Zhu, X., Shen, Y ., and Xu, Y

URL https://arxiv.org/ abs/2310.08576. Li, L., Zhang, Q., Luo, Y ., Yang, S., Wang, R., Han, F., Yu, M., Gao, Z., Xue, N., Zhu, X., Shen, Y ., and Xu, Y . Lingbot-va: Causal world modeling for robot control. arXiv preprint arXiv:2601.21998,

work page arXiv
[14]

arXiv preprint arXiv:2504.16693 , year=

Li, S., Gao, Y ., Sadigh, D., and Song, S. Unified video action model. InRobotics: Science and Systems (RSS), 2025a. Li, W., Zhao, H., Yu, Z., Du, Y ., Zou, Q., Hu, R., and Xu, K. Pin-wm: Learning physics-informed world mod- els for non-prehensile manipulation.arXiv preprint arXiv:2504.16693, 2025b. Liang, J., Liu, R., Ozguroglu, E., Sudhakar, S., Dave, A...

work page arXiv
[15]

Ma, T., Zheng, J., Wang, Z., Jiang, C., Cui, A., Liang, J., and Yang, S

URL https://arxiv.org/abs/2411.07223. Ma, T., Zheng, J., Wang, Z., Jiang, C., Cui, A., Liang, J., and Yang, S. Dit4dit: Jointly modeling video dynamics and actions for generalizable robot control.arXiv preprint arXiv:2603.10448,

work page arXiv
[16]

LeWorldModel: Stable End-to-End Joint-Embedding Predictive Architecture from Pixels

Maes, L., Le Lidec, Q., Scieur, D., LeCun, Y ., and Balestriero, R. Leworldmodel: Stable end-to-end joint- embedding predictive architecture from pixels.arXiv preprint arXiv:2603.19312,

work page internal anchor Pith review Pith/arXiv arXiv
[17]

World Simulation with Video Foundation Models for Physical AI

URL https://arxiv.org/abs/2511.00062. Pai, J., Achenbach, L., Montesinos, V ., Forrai, B., Mees, O., and Nava, E. mimic-video: Video-action models for generalizable robot control beyond vlas.arXiv preprint arXiv:2512.15692,

work page internal anchor Pith review Pith/arXiv arXiv
[18]

Robotic Manipulation by Imitating Generated Videos Without Physical Demonstrations

URL https: //arxiv.org/abs/2507.00990. Qi, H., Yin, H., Zhu, A., Du, Y ., and Yang, H. Inference-time enhancement of generative robot policies via predictive world modeling,

work page internal anchor Pith review Pith/arXiv arXiv
[19]

Shang, Y ., Zhang, X., Tang, Y ., Jin, L., Gao, C., Wu, W., and Li, Y

URL https://arxiv.org/ abs/2502.00622. Shang, Y ., Zhang, X., Tang, Y ., Jin, L., Gao, C., Wu, W., and Li, Y . Roboscape: Physics-informed embod- ied world model,

work page arXiv
[20]

Shang, X

URL https://arxiv.org/ abs/2506.23135. Team, R., Gao, Z., Wang, Q., Zeng, Y ., Zhu, J., Cheng, K. L., Li, Y ., Wang, H., Xu, Y ., Ma, S., Chen, Y ., Liu, J., Cheng, Y ., Yao, Y ., Zhu, J., Meng, Y ., Zheng, K., Bai, Q., Chen, J., Shen, Z., Yu, Y ., Zhu, X., Shen, Y ., and Ouyang, H. Advancing open-source world models,

work page arXiv
[21]

Advancing Open-source World Models

URL https://arxiv.org/abs/2601.20540. Wan Team. Wan: Open and advanced large-scale video generative models.arXiv preprint arXiv:2503.20314,

work page internal anchor Pith review Pith/arXiv arXiv
[22]

Embod- iedreamer: Advancing real2sim2real transfer for policy training via embodied world modeling.arXiv preprint arXiv:2507.05198, 2025a

9 From World Models to World Action Models: A Concise Tutorial for Robotics Wang, B., Meng, X., Wang, X., Zhu, Z., Ye, A., Wang, Y ., Yang, Z., Ni, C., Huang, G., and Wang, X. Embod- iedreamer: Advancing real2sim2real transfer for policy training via embodied world modeling.arXiv preprint arXiv:2507.05198, 2025a. Wang, M., Jin, W., Cao, K., Xie, L., and H...

work page arXiv
[23]

Skil: Se- mantic keypoint imitation learning for generalizable data- efficient manipulation.arXiv preprint arXiv:2501.14400, 2025b

Wang, S., You, J., Hu, Y ., Li, J., and Gao, Y . Skil: Se- mantic keypoint imitation learning for generalizable data- efficient manipulation.arXiv preprint arXiv:2501.14400, 2025b. Wen, C., Lin, X., So, J., Chen, K., Dou, Q., Gao, Y ., and Abbeel, P. Any-point trajectory modeling for policy learn- ing.arXiv preprint arXiv:2401.00025,

work page arXiv
[24]

worldlabs.ai/blog/rtfm

URL https://www. worldlabs.ai/blog/rtfm. Accessed: 2026-06-

2026
[25]

Ye, S., Ge, Y ., Zheng, K., Gao, S., Yu, S., Kurian, G., Indupuru, S., Tan, Y . L., Zhu, C., Xiang, J., Malik, A., Lee, K., Liang, W., Ranawaka, N., Gu, J., Xu, Y ., Wang, G., Hu, F., Narayan, A., Bjorck, J., Wang, J., Kim, G., Niu, D., Zheng, R., Xie, Y ., Wu, J., Wang, Q., Julian, R., Xu, D., Du, Y ., Chebotar, Y ., Reed, S., Kautz, J., Zhu, Y ., Fan, L...

work page internal anchor Pith review Pith/arXiv arXiv
[26]

Womap: World models for embodied open-vocabulary object localization

Yin, T., Mei, Z., Sun, T., Zha, L., Zhou, E., Bao, J., Yamane, M., Shorinwa, O., and Majumdar, A. Womap: World models for embodied open-vocabulary object localization. arXiv preprint arXiv:2506.01600,

work page arXiv
[27]

Fast-WAM: Do World Action Models Need Test-time Future Imagination?

Yuan, T., Dong, Z., Liu, Y ., and Zhao, H. Fast-wam: Do world action models need test-time future imagination? arXiv preprint arXiv:2603.16666,

work page internal anchor Pith review Pith/arXiv arXiv
[28]

Zhen, H., Sun, Q., Zhang, H., Li, J., Zhou, S., Du, Y ., and Gan, C

URL https://arxiv.org/abs/2509.00361. Zhen, H., Sun, Q., Zhang, H., Li, J., Zhou, S., Du, Y ., and Gan, C. Tesseract: Learning 4d embodied world mod- els,

work page arXiv
[29]

URL https://arxiv.org/abs/2504. 20995. Zhi, H., Chen, P., Zhou, S., Dong, Y ., Wu, Q., Han, L., and Tan, M. 3dflowaction: Learning cross-embodiment manipulation from 3d flow world model.arXiv preprint arXiv:2506.06199,

work page arXiv
[30]

DINO-WM: World Models on Pre-trained Visual Features enable Zero-shot Planning

Zhou, G., Pan, H., LeCun, Y ., and Pinto, L. Dino-wm: World models on pre-trained visual features enable zero- shot planning.arXiv preprint arXiv:2411.04983,

work page internal anchor Pith review Pith/arXiv arXiv

[1] [1]

V-JEPA 2: Self-Supervised Video Models Enable Understanding, Prediction and Planning

Google DeepMind Blog. URL https://deepmind.google/blog/ genie-3-a-new-frontier-for-world-models/ . Accessed: 2026-06-05. Assran, M., Bardes, A., Fan, D., Garrido, Q., Howes, R., Komeili, M., Muckley, M., Rizvi, A., Roberts, C., Sinha, 7 From World Models to World Action Models: A Concise Tutorial for Robotics K., Zholus, A., Arnaud, S., Gejji, A., Martin,...

work page internal anchor Pith review Pith/arXiv arXiv 2026

[2] [2]

Zero-Shot Robotic Manipulation with Pretrained Image-Editing Diffusion Models

URLhttps://arxiv.org/abs/2310.10639. Bu, Q., Zeng, J., Chen, L., Yang, Y ., Zhou, G., Yan, J., Luo, P., Cui, H., Ma, Y ., and Li, H. Closed-loop visuomotor control with generative expectation for robotic manipula- tion,

work page internal anchor Pith review Pith/arXiv arXiv

[3] [3]

GR-2: A Generative Video-Language-Action Model with Web-Scale Knowledge for Robot Manipulation

Cheang, C.-L., Chen, G., Jing, Y ., Kong, T., Li, H., Li, Y ., Liu, Y ., Wu, H., Xu, J., Yang, Y ., Zhang, H., and Zhu, M. Gr-2: A generative video-language-action model with web-scale knowledge for robot manipulation.arXiv preprint arXiv:2410.06158,

work page internal anchor Pith review Pith/arXiv arXiv

[4] [4]

Tenenbaum, Dale Schuurmans, and Pieter Abbeel

URL https://arxiv.org/abs/2302.00111. Feng, Y ., Tan, H., Mao, X., Xiang, C., Liu, G., Huang, S., Su, H., and Zhu, J. Vidar: Embodied video diffu- sion model for generalist manipulation.arXiv preprint arXiv:2507.12898,

work page arXiv

[5] [5]

Gao, S., Liang, W., Zheng, K., Malik, A., Ye, S., Yu, S., Tseng, W.-C., Dong, Y ., Mo, K., Lin, C.-H., Ma, Q., Nah, S., Magne, L., Xiang, J., Xie, Y ., Zheng, R., Niu, D., Tan, Y

URL https://arxiv.org/abs/ 2503.18938. Gao, S., Liang, W., Zheng, K., Malik, A., Ye, S., Yu, S., Tseng, W.-C., Dong, Y ., Mo, K., Lin, C.-H., Ma, Q., Nah, S., Magne, L., Xiang, J., Xie, Y ., Zheng, R., Niu, D., Tan, Y . L., Zentner, K. R., Kurian, G., Indupuru, S., Jannaty, P., Gu, J., Zhang, J., Malik, J., Abbeel, P., Liu, M.-Y ., Zhu, Y ., Jang, J., and...

work page arXiv

[6] [6]

DreamDojo: A Generalist Robot World Model from Large-Scale Human Videos

URL https://arxiv.org/abs/2602.06949. Goswami, R. G., Krishnamurthy, P., LeCun, Y ., and Khor- rami, F. Osvi-wm: One-shot visual imitation for unseen tasks using world-model-guided trajectory generation. arXiv preprint arXiv:2505.20425,

work page internal anchor Pith review Pith/arXiv arXiv

[7] [7]

FlowDreamer: An RGB-D world model with flow-based motion representations for robot manipulation.arXiv preprint arXiv:2505.10075, 2025

URL https://arxiv.org/abs/2505.10075. Guo, Y ., Shi, L. X., Chen, J., and Finn, C. Ctrl-world: A controllable generative world model for robot manipula- tion,

work page arXiv

[8] [8]

URL https://arxiv.org/abs/2510. 10125. Hafner, D., Pasukonis, J., Ba, J., and Lillicrap, T. Mastering diverse domains through world models.arXiv preprint arXiv:2301.04104,

work page internal anchor Pith review Pith/arXiv arXiv

[9] [9]

Huang, S., Chen, L., Zhou, P., Chen, S., Jiang, Z., Hu, Y ., Liao, Y ., Gao, P., Li, H., Yao, M., and Ren, G

Spotlight. Huang, S., Chen, L., Zhou, P., Chen, S., Jiang, Z., Hu, Y ., Liao, Y ., Gao, P., Li, H., Yao, M., and Ren, G. Ener- verse: Envisioning embodied future space for robotics manipulation, 2025a. URL https://arxiv.org/ abs/2501.01895. Huang, S., Chen, Q., Zhang, X., Sun, J., and Schwager, M. Particleformer: A 3d point cloud world model for multi-obj...

work page arXiv

[10] [10]

Huang, Y .-W

URL https://arxiv.org/abs/2601.03782. Huang, Y ., Zhang, J., Zou, S., Liu, X., Hu, R., and Xu, K. Ladi-wm: A latent diffusion-based world model for pre- dictive manipulation.arXiv preprint arXiv:2505.11528, 2025c. Jeong, Y ., Chun, J., Cha, S., and Kim, T. Object-centric world model for language-guided manipulation.arXiv preprint arXiv:2503.06170,

work page arXiv

[11] [11]

Phystwin: Physics-informed reconstruction and simulation of deformable objects from videos.arXiv preprint arXiv:2503.17973,

8 From World Models to World Action Models: A Concise Tutorial for Robotics Jiang, H., Hsu, H.-Y ., Zhang, K., Yu, H.-N., Wang, S., and Li, Y . Phystwin: Physics-informed reconstruction and simulation of deformable objects from videos.arXiv preprint arXiv:2503.17973,

work page arXiv

[12] [12]

Cosmos Policy: Fine-Tuning Video Models for Visuomotor Control and Planning

Kim, M. J., Gao, Y ., Lin, T.-Y ., Lin, Y .-C., Ge, Y ., Lam, G., Liang, P., Song, S., Liu, M.-Y ., Finn, C., and Gu, J. Cosmos policy: Fine-tuning video models for visuomotor control and planning.arXiv preprint arXiv:2601.16163,

work page internal anchor Pith review Pith/arXiv arXiv

[13] [13]

Li, L., Zhang, Q., Luo, Y ., Yang, S., Wang, R., Han, F., Yu, M., Gao, Z., Xue, N., Zhu, X., Shen, Y ., and Xu, Y

URL https://arxiv.org/ abs/2310.08576. Li, L., Zhang, Q., Luo, Y ., Yang, S., Wang, R., Han, F., Yu, M., Gao, Z., Xue, N., Zhu, X., Shen, Y ., and Xu, Y . Lingbot-va: Causal world modeling for robot control. arXiv preprint arXiv:2601.21998,

work page arXiv

[14] [14]

arXiv preprint arXiv:2504.16693 , year=

Li, S., Gao, Y ., Sadigh, D., and Song, S. Unified video action model. InRobotics: Science and Systems (RSS), 2025a. Li, W., Zhao, H., Yu, Z., Du, Y ., Zou, Q., Hu, R., and Xu, K. Pin-wm: Learning physics-informed world mod- els for non-prehensile manipulation.arXiv preprint arXiv:2504.16693, 2025b. Liang, J., Liu, R., Ozguroglu, E., Sudhakar, S., Dave, A...

work page arXiv

[15] [15]

Ma, T., Zheng, J., Wang, Z., Jiang, C., Cui, A., Liang, J., and Yang, S

URL https://arxiv.org/abs/2411.07223. Ma, T., Zheng, J., Wang, Z., Jiang, C., Cui, A., Liang, J., and Yang, S. Dit4dit: Jointly modeling video dynamics and actions for generalizable robot control.arXiv preprint arXiv:2603.10448,

work page arXiv

[16] [16]

LeWorldModel: Stable End-to-End Joint-Embedding Predictive Architecture from Pixels

Maes, L., Le Lidec, Q., Scieur, D., LeCun, Y ., and Balestriero, R. Leworldmodel: Stable end-to-end joint- embedding predictive architecture from pixels.arXiv preprint arXiv:2603.19312,

work page internal anchor Pith review Pith/arXiv arXiv

[17] [17]

World Simulation with Video Foundation Models for Physical AI

URL https://arxiv.org/abs/2511.00062. Pai, J., Achenbach, L., Montesinos, V ., Forrai, B., Mees, O., and Nava, E. mimic-video: Video-action models for generalizable robot control beyond vlas.arXiv preprint arXiv:2512.15692,

work page internal anchor Pith review Pith/arXiv arXiv

[18] [18]

Robotic Manipulation by Imitating Generated Videos Without Physical Demonstrations

URL https: //arxiv.org/abs/2507.00990. Qi, H., Yin, H., Zhu, A., Du, Y ., and Yang, H. Inference-time enhancement of generative robot policies via predictive world modeling,

work page internal anchor Pith review Pith/arXiv arXiv

[19] [19]

Shang, Y ., Zhang, X., Tang, Y ., Jin, L., Gao, C., Wu, W., and Li, Y

URL https://arxiv.org/ abs/2502.00622. Shang, Y ., Zhang, X., Tang, Y ., Jin, L., Gao, C., Wu, W., and Li, Y . Roboscape: Physics-informed embod- ied world model,

work page arXiv

[20] [20]

Shang, X

URL https://arxiv.org/ abs/2506.23135. Team, R., Gao, Z., Wang, Q., Zeng, Y ., Zhu, J., Cheng, K. L., Li, Y ., Wang, H., Xu, Y ., Ma, S., Chen, Y ., Liu, J., Cheng, Y ., Yao, Y ., Zhu, J., Meng, Y ., Zheng, K., Bai, Q., Chen, J., Shen, Z., Yu, Y ., Zhu, X., Shen, Y ., and Ouyang, H. Advancing open-source world models,

work page arXiv

[21] [21]

Advancing Open-source World Models

URL https://arxiv.org/abs/2601.20540. Wan Team. Wan: Open and advanced large-scale video generative models.arXiv preprint arXiv:2503.20314,

work page internal anchor Pith review Pith/arXiv arXiv

[22] [22]

Embod- iedreamer: Advancing real2sim2real transfer for policy training via embodied world modeling.arXiv preprint arXiv:2507.05198, 2025a

9 From World Models to World Action Models: A Concise Tutorial for Robotics Wang, B., Meng, X., Wang, X., Zhu, Z., Ye, A., Wang, Y ., Yang, Z., Ni, C., Huang, G., and Wang, X. Embod- iedreamer: Advancing real2sim2real transfer for policy training via embodied world modeling.arXiv preprint arXiv:2507.05198, 2025a. Wang, M., Jin, W., Cao, K., Xie, L., and H...

work page arXiv

[23] [23]

Skil: Se- mantic keypoint imitation learning for generalizable data- efficient manipulation.arXiv preprint arXiv:2501.14400, 2025b

Wang, S., You, J., Hu, Y ., Li, J., and Gao, Y . Skil: Se- mantic keypoint imitation learning for generalizable data- efficient manipulation.arXiv preprint arXiv:2501.14400, 2025b. Wen, C., Lin, X., So, J., Chen, K., Dou, Q., Gao, Y ., and Abbeel, P. Any-point trajectory modeling for policy learn- ing.arXiv preprint arXiv:2401.00025,

work page arXiv

[24] [24]

worldlabs.ai/blog/rtfm

URL https://www. worldlabs.ai/blog/rtfm. Accessed: 2026-06-

2026

[25] [25]

Ye, S., Ge, Y ., Zheng, K., Gao, S., Yu, S., Kurian, G., Indupuru, S., Tan, Y . L., Zhu, C., Xiang, J., Malik, A., Lee, K., Liang, W., Ranawaka, N., Gu, J., Xu, Y ., Wang, G., Hu, F., Narayan, A., Bjorck, J., Wang, J., Kim, G., Niu, D., Zheng, R., Xie, Y ., Wu, J., Wang, Q., Julian, R., Xu, D., Du, Y ., Chebotar, Y ., Reed, S., Kautz, J., Zhu, Y ., Fan, L...

work page internal anchor Pith review Pith/arXiv arXiv

[26] [26]

Womap: World models for embodied open-vocabulary object localization

Yin, T., Mei, Z., Sun, T., Zha, L., Zhou, E., Bao, J., Yamane, M., Shorinwa, O., and Majumdar, A. Womap: World models for embodied open-vocabulary object localization. arXiv preprint arXiv:2506.01600,

work page arXiv

[27] [27]

Fast-WAM: Do World Action Models Need Test-time Future Imagination?

Yuan, T., Dong, Z., Liu, Y ., and Zhao, H. Fast-wam: Do world action models need test-time future imagination? arXiv preprint arXiv:2603.16666,

work page internal anchor Pith review Pith/arXiv arXiv

[28] [28]

Zhen, H., Sun, Q., Zhang, H., Li, J., Zhou, S., Du, Y ., and Gan, C

URL https://arxiv.org/abs/2509.00361. Zhen, H., Sun, Q., Zhang, H., Li, J., Zhou, S., Du, Y ., and Gan, C. Tesseract: Learning 4d embodied world mod- els,

work page arXiv

[29] [29]

URL https://arxiv.org/abs/2504. 20995. Zhi, H., Chen, P., Zhou, S., Dong, Y ., Wu, Q., Han, L., and Tan, M. 3dflowaction: Learning cross-embodiment manipulation from 3d flow world model.arXiv preprint arXiv:2506.06199,

work page arXiv

[30] [30]

DINO-WM: World Models on Pre-trained Visual Features enable Zero-shot Planning

Zhou, G., Pan, H., LeCun, Y ., and Pinto, L. Dino-wm: World models on pre-trained visual features enable zero- shot planning.arXiv preprint arXiv:2411.04983,

work page internal anchor Pith review Pith/arXiv arXiv