Ctrl-World: A Controllable Generative World Model for Robot Manipulation

Chelsea Finn; Jianyu Chen; Lucy Xiaoyang Shi; Yanjiang Guo

arxiv: 2510.10125 · v3 · pith:7T3FOS2Rnew · submitted 2025-10-11 · 💻 cs.RO · cs.AI

Ctrl-World: A Controllable Generative World Model for Robot Manipulation

Yanjiang Guo , Lucy Xiaoyang Shi , Jianyu Chen , Chelsea Finn This is my paper

Pith reviewed 2026-05-16 01:08 UTC · model grok-4.3

classification 💻 cs.RO cs.AI

keywords world modelrobot manipulationgenerative modelpolicy fine-tuningimagination learningmulti-view predictionlong-horizon consistency

0 comments

The pith

A controllable world model ranks robot policies and improves them by 44.7 percent through imagined trajectories alone.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Generalist robot policies need extensive real-world testing on new objects and instructions, which is slow and expensive. This paper introduces Ctrl-World, a multi-view generative model that produces consistent long-horizon video rollouts conditioned on actions and camera poses. The model supports two direct uses: it ranks different policies by their imagined success rates without any physical trials, and it creates new training data by synthesizing successful trajectories for supervised fine-tuning. When policies are updated on these imagined successes, their real-world performance rises measurably. The approach therefore removes the need for large numbers of corrective robot experiments during policy development.

Core claim

The controllable multi-view world model maintains long-horizon consistency with a pose-conditioned memory retrieval mechanism and achieves precise action control through frame-level action conditioning. Trained on the DROID dataset of 95k trajectories across 564 scenes, the model generates spatially and temporally consistent trajectories under novel scenarios and new camera placements for over 20 seconds. These generated trajectories enable accurate ranking of policy performance without real-world robot rollouts. By synthesizing successful trajectories in imagination and using them for supervised fine-tuning, the approach improves policy success by 44.7 percent.

What carries the argument

The pose-conditioned memory retrieval mechanism paired with frame-level action conditioning inside a multi-view generative video model, which produces controllable long-horizon trajectories that serve as proxies for real dynamics.

If this is right

Policy selection and iteration can occur entirely inside imagination without physical robot time.
Corrective training data for new tasks can be generated at scale from a single trained world model.
Generalist policies can be adapted to novel objects and instructions more quickly than with real rollouts.
The same generated trajectories can serve both for evaluation and for direct supervised improvement.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

If the model generalizes further, it could support closed-loop planning and search inside imagination rather than only offline ranking.
Similar controllable world models could reduce data requirements across other embodied domains such as navigation or multi-robot coordination.
The 20-second horizon already demonstrated suggests the method may extend to longer tasks once memory retrieval is strengthened.

Load-bearing premise

The imagined trajectories remain accurate enough proxies for real-world dynamics on unseen objects, instructions, and camera views to support reliable policy ranking and effective fine-tuning.

What would settle it

Roll out the top-ranked policies from the world model in the real world and observe whether their success ordering matches the imagined ranking or whether fine-tuning on the synthesized trajectories produces the reported 44.7 percent gain.

read the original abstract

Generalist robot policies can now perform a wide range of manipulation skills, but evaluating and improving their ability with unfamiliar objects and instructions remains a significant challenge. Rigorous evaluation requires a large number of real-world rollouts, while systematic improvement demands additional corrective data with expert labels. Both of these processes are slow, costly, and difficult to scale. World models offer a promising, scalable alternative by enabling policies to rollout within imagination space. However, a key challenge is building a controllable world model that can handle multi-step interactions with generalist robot policies. This requires a world model compatible with modern generalist policies by supporting multi-view prediction, fine-grained action control, and consistent long-horizon interactions, which is not achieved by previous works. In this paper, we make a step forward by introducing a controllable multi-view world model that can be used to evaluate and improve the instruction-following ability of generalist robot policies. Our model maintains long-horizon consistency with a pose-conditioned memory retrieval mechanism and achieves precise action control through frame-level action conditioning. Trained on the DROID dataset (95k trajectories, 564 scenes), our model generates spatially and temporally consistent trajectories under novel scenarios and new camera placements for over 20 seconds. We show that our method can accurately rank policy performance without real-world robot rollouts. Moreover, by synthesizing successful trajectories in imagination and using them for supervised fine-tuning, our approach can improve policy success by 44.7\%.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Ctrl-World adds pose-conditioned memory and frame-level action control to make a usable long-horizon world model for generalist policies, but the 44.7% improvement claim rests on an unverified assumption that the generated trajectories are faithful enough for ranking and fine-tuning on novel cases.

read the letter

The main point is that the authors built a controllable multi-view world model that keeps spatial and temporal consistency for over 20 seconds on DROID data by retrieving from a pose-conditioned memory and conditioning actions at the frame level. This setup lets them generate trajectories under new instructions, objects, and camera placements, then use those rollouts to rank policies without real robots and to create synthetic successful data for supervised fine-tuning. The 44.7% reported gain is the headline result, and the scale of training (95k trajectories, 564 scenes) is respectable for this kind of work. The architecture choices directly address gaps in prior world models that lacked fine action control or drifted too quickly in long sequences. That combination is the concrete advance. The weak part is the leap from generated videos to real policy improvement. The abstract gives no pixel-level or 3D pose prediction error on held-out real rollouts, no ablation that shows the gain vanishes with a weaker world model, and no breakdown of how well the model handles truly novel objects or instructions. Generative video models routinely produce plausible but physically off trajectories after a few seconds, so without those checks it is unclear whether the ranking and fine-tuning steps are actually capturing real dynamics or just model artifacts. This paper is aimed at people working on scalable robot learning and world models for manipulation. It is worth sending to referees because the problem it attacks is real and the proposed mechanisms are a clear step beyond existing controllable simulators, even though the current evidence for the downstream gains is still preliminary and would need tighter validation in review.

Circularity Check

0 steps flagged

No significant circularity in derivation chain

full rationale

The paper's core claims rest on training a controllable world model on the DROID dataset (95k trajectories) and then measuring downstream policy improvements via held-out real-world evaluations after supervised fine-tuning on synthesized trajectories. No load-bearing step reduces the reported 44.7% success gain, policy ranking accuracy, or long-horizon consistency to a fitted parameter or self-referential definition by construction; the evaluation metrics and improvement figures are obtained from separate experimental rollouts rather than being tautological with the model's training objective or internal equations.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The central claim rests on the domain assumption that a generative model trained on DROID generalizes to novel scenarios sufficiently well to proxy real robot behavior; standard deep-learning training assumptions and dataset coverage are also required.

free parameters (1)

model architecture hyperparameters and training schedule
Typical for large generative models; exact values not stated in abstract.

axioms (1)

domain assumption The DROID dataset and the proposed memory and conditioning mechanisms suffice for generalization to novel objects and viewpoints
Invoked to support claims of usefulness on unfamiliar scenarios.

pith-pipeline@v0.9.0 · 5565 in / 1243 out tokens · 148773 ms · 2026-05-16T01:08:54.062305+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

Foundation.HierarchyEmergence hierarchy_emergence_forces_phi unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

by synthesizing successful trajectories in imagination and using them for supervised fine-tuning, our approach can improve policy success by 44.7%.

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 49 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

DVG-WM: Disentangled Video Generation Enables Efficient Embodied World Model for Robotic Manipulation
cs.RO 2026-06 unverdicted novelty 7.0

DVG-WM disentangles dynamics learning and visual synthesis in video world models using flow matching and latent degradation to achieve faster inference up to 3.97 times with improved quality on LIBERO and real-world r...
RoboGaze: Evaluating Robot World Models via Structured Vision-Language Analysis
cs.RO 2026-06 unverdicted novelty 7.0

RoboGaze presents a structured multi-agent VLM pipeline and robotics-specific error taxonomy that improves video evaluation metrics by up to 43 F1 points over zero-shot baselines on a 382-clip dataset.
PiL-World: A Chunk-Wise World Model for VLA Policy-in-the-Loop Evaluation
cs.RO 2026-06 unverdicted novelty 7.0

PiL-World introduces a chunk-wise world model for closed-loop VLA policy evaluation that reduces the gap between simulated and real success rates from 63.2% to 12.0% on three dual-arm manipulation tasks by conditionin...
From Imagined Futures to Executable Actions: Mixture of Latent Actions for Robot Manipulation
cs.RO 2026-05 unverdicted novelty 7.0

MoLA infers a mixture of latent actions from generated future videos via modality-aware inverse dynamics models to improve robot manipulation policies.
ACWM-Phys: Investigating Generalized Physical Interaction in Action-Conditioned Video World Models
cs.CV 2026-05 unverdicted novelty 7.0

ACWM-Phys is a controllable simulator benchmark with in- and out-of-distribution protocols for evaluating action-conditioned world models across rigid, kinematic, deformable, and particle dynamics.
PhySPRING: Structure-Preserving Reduction of Physics-Informed Twins via GNN
cs.RO 2026-05 unverdicted novelty 7.0

PhySPRING uses differentiable GNNs to learn hierarchical coarsened spring-mass topologies and parameters from observations, delivering up to 2.3x speedup on PhysTwin benchmarks and comparable robot policy success rate...
Action Images: End-to-End Policy Learning via Multiview Video Generation
cs.CV 2026-04 unverdicted novelty 7.0

Action Images turn robot arm motions into interpretable multiview pixel videos, letting video backbones serve as zero-shot policies for end-to-end robot learning.
StarVLA: A Lego-like Codebase for Vision-Language-Action Model Developing
cs.RO 2026-04 conditional novelty 7.0

StarVLA delivers a Lego-like open-source framework for VLA models with swappable backbones and action heads, reusable training methods, and unified evaluation across major benchmarks.
PlayWorld: Learning Robot World Models from Autonomous Play
cs.RO 2026-03 unverdicted novelty 7.0

PlayWorld learns high-fidelity robot world models from unsupervised self-play, producing physically consistent video predictions that outperform models trained on human data and enabling 65% better real-world policy p...
DreamDojo: A Generalist Robot World Model from Large-Scale Human Videos
cs.RO 2026-02 unverdicted novelty 7.0

DreamDojo is a foundation world model pretrained on the largest human video dataset to date that uses continuous latent actions to transfer interaction knowledge and achieves controllable physics simulation after robo...
Mem-World: Memory-Augmented Action-Conditioned World Models for Persistent Robot Manipulation
cs.CV 2026-06 unverdicted novelty 6.0

Mem-World augments world models with W-VMem, a wrist-view-centered surfel memory, to generate persistent action-conditioned video rollouts that improve policy evaluation correlation by 14.5% and raise task success fro...
SC3-Eval: Evaluating Robot Foundation Models via Self-Consistent Video Generation
cs.RO 2026-06 unverdicted novelty 6.0

SC3-Eval enforces three consistencies on a video model to produce policy rollouts that correlate 0.929 with real-world performance across seven vision-language-action policies and reproduce observed failure modes.
SC3-Eval: Evaluating Robot Foundation Models via Self-Consistent Video Generation
cs.RO 2026-06 unverdicted novelty 6.0

SC3-Eval enforces three consistency constraints on video world models to evaluate robot manipulation policies, achieving 0.929 Pearson correlation with real-world rollouts across seven policies.
iMaC: Translating Actions into Motion and Contact Images for Embodied World Models
cs.RO 2026-06 unverdicted novelty 6.0

iMaC introduces image-based action tokens in a dual-branch architecture to improve future state prediction and control in embodied world models over vector-based baselines.
$\omega$-EVA: Envision, Verify, and Act with Latent Interactive World Models
cs.RO 2026-06 unverdicted novelty 6.0

ω-EVA is a three-stage latent world model framework that trains action-conditioned dynamics, a language-conditioned flow policy, and a tri-branch refiner to improve embodied action generation in simulation.
Intercepting the Future: Latent-Space Predictive World Model for Dynamic VLA Manipulation
cs.RO 2026-06 unverdicted novelty 6.0

AHEAD augments frozen VLAs with a 4.9M-parameter latent world model that forecasts future visual features using optical-flow motion cues, achieving 79-97% success on dynamic simulation tasks and high real-robot succes...
Geometry-Aware Implicit Memory for Video World Models
cs.CV 2026-06 unverdicted novelty 6.0

GIM-World adds a camera-queryable geometry distillation head and pruning rule to implicit memory in video world models, claiming better long-horizon geometric consistency on the MIND benchmark than explicit and implic...
GEM-4D: Geometry-Enhanced Video World Models for Robot Manipulation
cs.CV 2026-05 unverdicted novelty 6.0

GEM-4D improves video world models for robot manipulation by distilling 4D geometric correspondences into training and adding an inverse dynamics module, achieving SOTA geometric consistency and 81% real-world success.
ACWM-Phys: Investigating Generalized Physical Interaction in Action-Conditioned Video World Models
cs.CV 2026-05 unverdicted novelty 6.0

ACWM-Phys benchmark shows action-conditioned world models generalize on simple geometric interactions but drop sharply on deformable contacts, high-dimensional control, and complex articulated motion, indicating relia...
MotuBrain: An Advanced World Action Model for Robot Control
cs.RO 2026-04 unverdicted novelty 6.0

MotuBrain jointly models video and action via a three-stream Mixture-of-Transformers UniDiffuser to reach 95.8-96.1% success on RoboTwin 2.0 benchmarks, top EWMScore, and fast 11 Hz inference while adapting to new rob...
Breaking Lock-In: Preserving Steerability under Low-Data VLA Post-Training
cs.RO 2026-04 unverdicted novelty 6.0

DeLock mitigates lock-in in low-data VLA post-training via visual grounding preservation and test-time contrastive prompt guidance, outperforming baselines across eight evaluations while matching data-heavy generalist...
dWorldEval: Scalable Robotic Policy Evaluation via Discrete Diffusion World Model
cs.RO 2026-04 unverdicted novelty 6.0

A discrete diffusion model tokenizes multimodal robotic data and uses a progress token to predict future states and task completion for scalable policy evaluation.
Hi-WM: Human-in-the-World-Model for Scalable Robot Post-Training
cs.RO 2026-04 unverdicted novelty 6.0

Hi-WM uses human interventions inside an action-conditioned world model with rollback and branching to generate dense corrective data, raising real-world success by 37.9 points on average across three manipulation tasks.
UniT: Toward a Unified Physical Language for Human-to-Humanoid Policy Learning and World Modeling
cs.RO 2026-04 unverdicted novelty 6.0

UniT creates a unified physical language via visual anchoring and tri-branch reconstruction to enable scalable human-to-humanoid transfer for policy learning and world modeling.
VAG: Dual-Stream Video-Action Generation for Embodied Data Synthesis
cs.RO 2026-04 unverdicted novelty 6.0

VAG is a synchronized dual-stream flow-matching framework that generates aligned video-action pairs for synthetic embodied data synthesis and policy pretraining.
Veo-Act: How Far Can Frontier Video Models Advance Generalizable Robot Manipulation?
cs.RO 2026-04 unverdicted novelty 6.0

Veo-3 video predictions enable approximate task-level robot trajectories in zero-shot settings but require hierarchical integration with low-level VLA policies for reliable manipulation performance.
RISE: Self-Improving Robot Policy with Compositional World Model
cs.RO 2026-02 unverdicted novelty 6.0

RISE combines a controllable dynamics model and progress value model into a closed-loop self-improving pipeline that updates robot policies entirely in imagination, reporting over 35% absolute gains on three real-world tasks.
Genie Sim 3.0 : A High-Fidelity Comprehensive Simulation Platform for Humanoid Robot
cs.RO 2026-01 unverdicted novelty 6.0

Genie Sim 3.0 introduces an LLM-powered scene generator, the first LLM-based automated evaluation benchmark, and a large open synthetic dataset that demonstrates zero-shot sim-to-real transfer for robotic manipulation...
mimic-video: Video-Action Models for Generalizable Robot Control Beyond VLAs
cs.RO 2025-12 unverdicted novelty 6.0

mimic-video combines internet video pretraining with a flow-matching decoder to achieve state-of-the-art robotic manipulation performance with 10x better sample efficiency than vision-language-action models.
PhysisForcing: Physics Reinforced World Simulator for Robotic Manipulation
cs.CV 2026-06 unverdicted novelty 5.0

PhysisForcing applies trajectory and relational alignment losses to DiT features in video models, improving physical plausibility on R-Bench, PAI-Bench, and EZS-Bench while raising closed-loop robotic success rates fr...
Not All Actions Are Equal: Rethinking Conditioning for Dexterous World Model
cs.CV 2026-06 unverdicted novelty 5.0

DexAC-WM improves FID, FVD, and PCK in high-DoF action-conditioned video prediction via structured action modeling and semantic grounding on EgoDex and EgoVerse.
LaST-HD: Learning Latent Physical Reasoning from Scalable Human Data for Robot Manipulation
cs.RO 2026-06 unverdicted novelty 5.0

LaST-HD creates a shared latent dynamics space via a world model to transfer physical reasoning from scalable human-hand demonstrations to robots, achieving over 90% accuracy with 20 minutes of new data after mixed training.
How Should World Models Be Evaluated for Embodied Decision-Making? A Decision-Making-Centric Position
cs.LG 2026-06 unverdicted novelty 5.0

The paper proposes an L0-L7 evidential ladder for evaluating world models in embodied decision-making, prioritizing interventional action fidelity and policy optimization utility over visual plausibility.
Making Foresight Actionable: Repurposing Representation Alignment in World Action Models
cs.CV 2026-06 unverdicted novelty 5.0

AGRA is an Action-Grounded Representation Alignment objective that aligns intermediate video diffusion features with semantic representations to make world action model hidden states more useful for low-level robot co...
$\tau_0$-WM: A Unified Video-Action World Model for Robotic Manipulation
cs.RO 2026-05 unverdicted novelty 5.0

A shared video diffusion backbone jointly predicts future latents and continuous actions while also rolling out candidate actions to predict dense task-progress scores, trained on 27,300 hours of mixed robot and human data.
GEM-4D: Geometry-Enhanced Video World Models for Robot Manipulation
cs.CV 2026-05 unverdicted novelty 5.0

GEM-4D is a video world model that injects 4D correspondence supervision to improve geometric consistency and robot manipulation success from 61% to 81%.
WorldArena 2.0: Extending Embodied World Model Benchmarking on Modality, Functionality and Platform
cs.RO 2026-05 unverdicted novelty 5.0

WorldArena 2.0 extends embodied world model benchmarks to visuotactile perception, interactive policy training, and diverse real and simulated robotic platforms under a unified protocol.
Nano World Models: A Minimalist Implementation of Future Video Prediction
cs.CV 2026-05 unverdicted novelty 5.0

Nano World Models supplies a unified minimalist codebase and evaluation framework for studying diffusion forcing in video prediction across control, games, and robot domains.
ECG-WM: A Physiology-Informed ECG World Model for Clinical Intervention Simulation
cs.AI 2026-05 unverdicted novelty 5.0

ECG-WM combines ODE physiological priors with latent diffusion models to generate intervention-conditioned ECG trajectories and uses diffusion stochasticity for uncertainty-aware clinical risk assessment.
Reconstruction or Semantics? What Makes a Latent Space Useful for Robotic World Models
cs.CV 2026-05 unverdicted novelty 5.0

Semantic latent spaces from pretrained encoders outperform reconstruction-based spaces for robotic world models on planning and downstream policy performance.
World-Value-Action Model: Implicit Planning for Vision-Language-Action Systems
cs.RO 2026-04 unverdicted novelty 5.0

The World-Value-Action model enables implicit planning for VLA systems by performing inference over a learned latent representation of high-value future trajectories instead of direct action prediction.
WestWorld: A Knowledge-Encoded Scalable Trajectory World Model for Diverse Robotic Systems
cs.LG 2026-03 unverdicted novelty 5.0

WestWorld introduces a scalable trajectory world model with Sys-MoE routing via system embeddings and structural embeddings for physical knowledge, pretrained on 89 environments to improve zero-shot prediction and rea...
Robot Self-Improvement via Human-Video Dynamics Models
cs.RO 2026-06 unverdicted novelty 4.0

Human-video dynamics models enable cross-embodiment robot self-improvement via training-free Dynamics-Guided Action Correction, raising success rates from 40% to 81% on seven real-world tasks.
Position: Good Embodied Reward Models Need Bad Behavior Data
cs.RO 2026-05 unverdicted novelty 4.0

Embodied reward models systematically over-reward unsafe, suboptimal, and shortcut robot behaviors due to training on successful data only, and modest inclusion of bad behavior data improves alignment with human preferences.
GE-Sim 2.0: A Roadmap Towards Comprehensive Closed-loop Video World Simulators for Robotic Manipulation
cs.RO 2026-05 unverdicted novelty 4.0

GE-Sim 2.0 is a video-based closed-loop simulator for robotic manipulation that adds state expert, world judge, and acceleration modules on top of prior video generation to support policy learning and evaluation.
Pre-VLA: Preemptive Runtime Verification for Reliable Vision-Language-Action and World-Model Rollouts
cs.CV 2026-05 unverdicted novelty 4.0

Pre-VLA is a multimodal runtime verifier that predicts safety confidence and advantage scores for action chunks, raising closed-loop success rates on the LIBERO benchmark from 30.79% to 37.62%.
World Action Models: The Next Frontier in Embodied AI
cs.RO 2026-05 unverdicted novelty 4.0

The paper introduces World Action Models as a new paradigm unifying predictive world modeling with action generation in embodied foundation models and provides a taxonomy of existing approaches.
World Model for Robot Learning: A Comprehensive Survey
cs.RO 2026-04 unverdicted novelty 3.0

A comprehensive survey that organizes the literature on world models in robot learning, their roles in policy learning, planning, simulation, and video-based generation, with connections to navigation, driving, datase...
Towards Interactive Video World Modeling: Frontiers, Challenges, Benchmarks, and Future Trends
cs.CV 2026-05 unverdicted novelty 2.0

This survey reviews trends, challenges, benchmarks, and future directions in action-conditioned interactive world modeling for video and 3D generation.

Reference graph

Works this paper leans on

56 extracted references · 56 canonical work pages · cited by 46 Pith papers · 32 internal anchors

[1]

Cosmos World Foundation Model Platform for Physical AI

Niket Agarwal, Arslan Ali, Maciej Bala, Yogesh Balaji, Erik Barker, Tiffany Cai, Prithvijit Chat- topadhyay, Yongxin Chen, Yin Cui, Yifan Ding, et al. Cosmos world foundation model platform for physical ai.arXiv preprint arXiv:2501.03575,

work page internal anchor Pith review Pith/arXiv arXiv
[2]

Roboarena: Distributed real-world evaluation of generalist robot policies

Pranav Atreya, Karl Pertsch, Tony Lee, Moo Jin Kim, Arhan Jain, Artur Kuramshin, Clemens Eppner, Cyrus Neary, Edward Hu, Fabio Ramos, et al. Roboarena: Distributed real-world evaluation of generalist robot policies.arXiv preprint arXiv:2506.18123,

work page arXiv
[3]

Gen2Act: Human Video Generation in Novel Scenarios enables Generalizable Robot Manipulation

Homanga Bharadhwaj, Debidatta Dwibedi, Abhinav Gupta, Shubham Tulsiani, Carl Doersch, Ted Xiao, Dhruv Shah, Fei Xia, Dorsa Sadigh, and Sean Kirmani. Gen2act: Human video generation in novel scenarios enables generalizable robot manipulation.arXiv preprint arXiv:2409.16283,

work page internal anchor Pith review Pith/arXiv arXiv
[4]

Zero-Shot Robotic Manipulation with Pretrained Image-Editing Diffusion Models

Kevin Black, Mitsuhiko Nakamoto, Pranav Atreya, Homer Walke, Chelsea Finn, Aviral Kumar, and Sergey Levine. Zero-shot robotic manipulation with pretrained image-editing diffusion models. arXiv preprint arXiv:2310.10639,

work page internal anchor Pith review arXiv
[5]

$\pi_0$: A Vision-Language-Action Flow Model for General Robot Control

Kevin Black, Noah Brown, Danny Driess, Adnan Esmail, Michael Equi, Chelsea Finn, Niccolo Fusai, Lachy Groom, Karol Hausman, Brian Ichter, et al. π0: A vision-language-action flow model for general robot control.arXiv preprint arXiv:2410.24164,

work page internal anchor Pith review Pith/arXiv arXiv
[6]

Real-Time Execution of Action Chunking Flow Policies

Kevin Black, Manuel Y Galliker, and Sergey Levine. Real-time execution of action chunking flow policies.arXiv preprint arXiv:2506.07339,

work page internal anchor Pith review Pith/arXiv arXiv
[7]

Stable Video Diffusion: Scaling Latent Video Diffusion Models to Large Datasets

Andreas Blattmann, Tim Dockhorn, Sumith Kulal, Daniel Mendelevitch, Maciej Kilian, Dominik Lorenz, Yam Levi, Zion English, Vikram V oleti, Adam Letts, et al. Stable video diffusion: Scaling latent video diffusion models to large datasets.arXiv preprint arXiv:2311.15127, 2023a. Andreas Blattmann, Robin Rombach, Huan Ling, Tim Dockhorn, Seung Wook Kim, Sanj...

work page internal anchor Pith review Pith/arXiv arXiv
[8]

Wow: Towards a world omni- scient world model through embodied interaction,

Xiaowei Chi, Peidong Jia, Chun-Kai Fan, Xiaozhu Ju, Weishi Mi, Kevin Zhang, Zhiyuan Qin, Wanxin Tian, Kuangzhi Ge, Hao Li, et al. Wow: Towards a world omniscient world model through embodied interaction.arXiv preprint arXiv:2509.22642,

work page arXiv
[9]

doi:10.48550/arXiv.2505.03912 , abstract =

11 Published as a conference paper at ICLR 2026 Can Cui, Pengxiang Ding, Wenxuan Song, Shuanghao Bai, Xinyang Tong, Zirui Ge, Runze Suo, Wanqi Zhou, Yang Liu, Bofang Jia, et al. Openhelix: A short survey, empirical analysis, and open-source dual-system vla model for robotic manipulation.arXiv preprint arXiv:2505.03912,

work page arXiv 2026
[10]

RoboNet: Large-Scale Multi-Robot Learning

Sudeep Dasari, Frederik Ebert, Stephen Tian, Suraj Nair, Bernadette Bucher, Karl Schmeckpeper, Siddharth Singh, Sergey Levine, and Chelsea Finn. Robonet: Large-scale multi-robot learning. arXiv preprint arXiv:1910.11215,

work page internal anchor Pith review arXiv 1910
[11]

Vision-language models as success detectors,

Yuqing Du, Ksenia Konyushkova, Misha Denil, Akhil Raju, Jessica Landon, Felix Hill, Nando De Freitas, and Serkan Cabi. Vision-language models as success detectors.arXiv preprint arXiv:2303.07280,

work page arXiv
[12]

Visual Foresight: Model-Based Deep Reinforcement Learning for Vision-Based Robotic Control

Frederik Ebert, Chelsea Finn, Sudeep Dasari, Annie Xie, Alex Lee, and Sergey Levine. Visual foresight: Model-based deep reinforcement learning for vision-based robotic control.arXiv preprint arXiv:1812.00568,

work page internal anchor Pith review Pith/arXiv arXiv
[13]

Vidar: Embodied Video Diffusion Model for Generalist Manipulation

Yao Feng, Hengkai Tan, Xinyi Mao, Guodong Liu, Shuhe Huang, Chendong Xiang, Hang Su, and Jun Zhu. Vidar: Embodied video diffusion model for generalist bimanual manipulation.arXiv preprint arXiv:2507.12898,

work page internal anchor Pith review arXiv
[14]

Flip: Flow- centric generative planning as general-purpose manipulation world model,

Chongkai Gao, Haozhuo Zhang, Zhixuan Xu, Zhehao Cai, and Lin Shao. Flip: Flow-centric generative planning as general-purpose manipulation world model.arXiv preprint arXiv:2412.08261,

work page arXiv
[15]

Adaworld: Learning adaptable world models with latent actions.arXiv preprint arXiv:2503.18938, 2025

Shenyuan Gao, Siyuan Zhou, Yilun Du, Jun Zhang, and Chuang Gan. Adaworld: Learning adaptable world models with latent actions.arXiv preprint arXiv:2503.18938,

work page arXiv
[16]

Improving vision-language-action model with online reinforcement learning.arXiv preprint arXiv:2501.16664,

Yanjiang Guo, Jianke Zhang, Xiaoyu Chen, Xiang Ji, Yen-Jen Wang, Yucheng Hu, and Jianyu Chen. Improving vision-language-action model with online reinforcement learning.arXiv preprint arXiv:2501.16664,

work page arXiv
[17]

Dream to Control: Learning Behaviors by Latent Imagination

Danijar Hafner, Timothy Lillicrap, Jimmy Ba, and Mohammad Norouzi. Dream to control: Learning behaviors by latent imagination.arXiv preprint arXiv:1912.01603,

work page internal anchor Pith review Pith/arXiv arXiv 1912
[18]

Mastering Atari with Discrete World Models

Danijar Hafner, Timothy Lillicrap, Mohammad Norouzi, and Jimmy Ba. Mastering atari with discrete world models.arXiv preprint arXiv:2010.02193,

work page internal anchor Pith review Pith/arXiv arXiv 2010
[19]

Training Agents Inside of Scalable World Models

Danijar Hafner, Wilson Yan, and Timothy Lillicrap. Training agents inside of scalable world models. arXiv preprint arXiv:2509.24527,

work page internal anchor Pith review Pith/arXiv arXiv
[20]

Temporal difference learning for model predictive control.arXiv preprint arXiv:2203.04955, 2022

Nicklas Hansen, Xiaolong Wang, and Hao Su. Temporal difference learning for model predictive control.arXiv preprint arXiv:2203.04955,

work page arXiv
[21]

Pre-trained video generative models as world simulators.CoRR, abs/2502.07825, February 2025

Haoran He, Yang Zhang, Liang Lin, Zhongwen Xu, and Ling Pan. Pre-trained video generative models as world simulators.arXiv preprint arXiv:2502.07825,

work page arXiv
[22]

Image quality metrics: Psnr vs

12 Published as a conference paper at ICLR 2026 Alain Hore and Djemel Ziou. Image quality metrics: Psnr vs. ssim. In2010 20th international conference on pattern recognition, pp. 2366–2369. IEEE,

work page 2026
[23]

Video Prediction Policy: A Generalist Robot Policy with Predictive Visual Representations

Yucheng Hu, Yanjiang Guo, Pengchao Wang, Xiaoyu Chen, Yen-Jen Wang, Jianke Zhang, Koushil Sreenath, Chaochao Lu, and Jianyu Chen. Video prediction policy: A generalist robot policy with predictive visual representations.arXiv preprint arXiv:2412.14803,

work page internal anchor Pith review Pith/arXiv arXiv
[24]

Particleformer: A 3d point cloud world model for multi-object, multi-material robotic manipulation.arXiv preprint arXiv:2506.23126,

Suning Huang, Qianzhong Chen, Xiaohan Zhang, Jiankai Sun, and Mac Schwager. Particleformer: A 3d point cloud world model for multi-object, multi-material robotic manipulation.arXiv preprint arXiv:2506.23126,

work page arXiv
[25]

$\pi_{0.5}$: a Vision-Language-Action Model with Open-World Generalization

Physical Intelligence, Kevin Black, Noah Brown, James Darpinian, Karan Dhabalia, Danny Driess, Adnan Esmail, Michael Equi, Chelsea Finn, Niccolo Fusai, et al. pi {0.5}: a vision-language- action model with open-world generalization.arXiv preprint arXiv:2504.16054,

work page internal anchor Pith review Pith/arXiv arXiv
[26]

arXiv preprint arXiv:2505.09723 (2025)

Yuxin Jiang, Shengcong Chen, Siyuan Huang, Liliang Chen, Pengfei Zhou, Yue Liao, Xindong He, Chiming Liu, Hongsheng Li, Maoqing Yao, et al. Enerverse-ac: Envisioning embodied environments with action condition.arXiv preprint arXiv:2505.09723,

work page arXiv
[27]

DROID: A Large-Scale In-The-Wild Robot Manipulation Dataset

Alexander Khazatsky, Karl Pertsch, Suraj Nair, Ashwin Balakrishna, Sudeep Dasari, Siddharth Karamcheti, Soroush Nasiriany, Mohan Kumar Srirama, Lawrence Yunliang Chen, Kirsty Ellis, et al. Droid: A large-scale in-the-wild robot manipulation dataset.arXiv preprint arXiv:2403.12945,

work page internal anchor Pith review Pith/arXiv arXiv
[28]

OpenVLA: An Open-Source Vision-Language-Action Model

Moo Jin Kim, Karl Pertsch, Siddharth Karamcheti, Ted Xiao, Ashwin Balakrishna, Suraj Nair, Rafael Rafailov, Ethan Foster, Grace Lam, Pannag Sanketi, et al. Openvla: An open-source vision-language-action model.arXiv preprint arXiv:2406.09246,

work page internal anchor Pith review Pith/arXiv arXiv
[29]

Unified Video Action Model

Shuang Li, Yihuai Gao, Dorsa Sadigh, and Shuran Song. Unified video action model.arXiv preprint arXiv:2503.00200, 2025a. Xuanlin Li, Kyle Hsu, Jiayuan Gu, Karl Pertsch, Oier Mees, Homer Rich Walke, Chuyuan Fu, Ishikaa Lunawat, Isabel Sieh, Sean Kirmani, et al. Evaluating real-world robot manipulation policies in simulation.arXiv preprint arXiv:2405.05941,

work page internal anchor Pith review Pith/arXiv arXiv
[30]

Worldeval: World model as real-world robot policies evaluator.arXiv preprint arXiv:2505.19017, 2025

Yaxuan Li, Yichen Zhu, Junjie Wen, Chaomin Shen, and Yi Xu. Worldeval: World model as real-world robot policies evaluator.arXiv preprint arXiv:2505.19017, 2025b. Junbang Liang, Ruoshi Liu, Ege Ozguroglu, Sruthi Sudhakar, Achal Dave, Pavel Tokmakov, Shuran Song, and Carl V ondrick. Dreamitate: Real-world visuomotor policy learning via video generation. arX...

work page arXiv
[31]

Genie Envisioner: A Unified World Foundation Platform for Robotic Manipulation

Yue Liao, Pengfei Zhou, Siyuan Huang, Donglin Yang, Shengcong Chen, Yuxin Jiang, Yue Hu, Jingbin Cai, Si Liu, Jianlan Luo, et al. Genie envisioner: A unified world foundation platform for robotic manipulation.arXiv preprint arXiv:2508.05635,

work page internal anchor Pith review Pith/arXiv arXiv
[32]

HybridVLA: Collaborative Diffusion and Autoregression in a Unified Vision-Language-Action Model

Jiaming Liu, Hao Chen, Pengju An, Zhuoyang Liu, Renrui Zhang, Chenyang Gu, Xiaoqi Li, Ziyu Guo, Sixiang Chen, Mengzhen Liu, et al. Hybridvla: Collaborative diffusion and autoregression in a unified vision-language-action model.arXiv preprint arXiv:2503.10631,

work page internal anchor Pith review Pith/arXiv arXiv
[33]

RDT-1B: a Diffusion Foundation Model for Bimanual Manipulation

Songming Liu, Lingxuan Wu, Bangguo Li, Hengkai Tan, Huayu Chen, Zhengyi Wang, Ke Xu, Hang Su, and Jun Zhu. Rdt-1b: a diffusion foundation model for bimanual manipulation.arXiv preprint arXiv:2410.07864,

work page internal anchor Pith review Pith/arXiv arXiv
[34]

Foundation reward models for general robot skill acquisition

Yecheng Jason Ma. Foundation reward models for general robot skill acquisition. InRobotics: Science and Systems-Pioneers Workshop 2025,

work page 2025
[35]

Deep dynamics models for learning dexterous manipulation

13 Published as a conference paper at ICLR 2026 Anusha Nagabandi, Kurt Konolige, Sergey Levine, and Vikash Kumar. Deep dynamics models for learning dexterous manipulation. InConference on robot learning, pp. 1101–1112. PMLR,

work page 2026
[36]

FAST: Efficient Action Tokenization for Vision-Language-Action Models

Karl Pertsch, Kyle Stachowicz, Brian Ichter, Danny Driess, Suraj Nair, Quan Vuong, Oier Mees, Chelsea Finn, and Sergey Levine. Fast: Efficient action tokenization for vision-language-action models.arXiv preprint arXiv:2501.09747,

work page internal anchor Pith review Pith/arXiv arXiv
[37]

Quevedo, A

Julian Quevedo, Percy Liang, and Sherry Yang. Evaluating robot policies in a world model.arXiv preprint arXiv:2506.00613,

work page arXiv
[38]

Cosmos-drive- dreams: Scalable synthetic driving data generation with world foundation models.arXiv preprint arXiv:2506.09042,

Xuanchi Ren, Yifan Lu, Tianshi Cao, Ruiyuan Gao, Shengyu Huang, Amirmojtaba Sabour, Tianchang Shen, Tobias Pfaff, Jay Zhangjie Wu, Runjian Chen, et al. Cosmos-drive-dreams: Scalable synthetic driving data generation with world foundation models.arXiv preprint arXiv:2506.09042,

work page arXiv
[39]

Hi Robot: Open-Ended Instruction Following with Hierarchical Vision-Language-Action Models

Lucy Xiaoyang Shi, Brian Ichter, Michael Equi, Liyiming Ke, Karl Pertsch, Quan Vuong, James Tanner, Anna Walling, Haohuan Wang, Niccolo Fusai, et al. Hi robot: Open-ended instruction following with hierarchical vision-language-action models.arXiv preprint arXiv:2502.19417,

work page internal anchor Pith review Pith/arXiv arXiv
[40]

AnyPos: Automated Task-Agnostic Actions for Bimanual Manipulation

Hengkai Tan, Yao Feng, Xinyi Mao, Shuhe Huang, Guodong Liu, Zhongkai Hao, Hang Su, and Jun Zhu. Anypos: Automated task-agnostic actions for bimanual manipulation.arXiv preprint arXiv:2507.12768,

work page internal anchor Pith review Pith/arXiv arXiv
[41]

Gemini: A Family of Highly Capable Multimodal Models

URL https://www. 1x.tech/1x-world-model.pdf. Gemini Team, Rohan Anil, Sebastian Borgeaud, Jean-Baptiste Alayrac, Jiahui Yu, Radu Soricut, Johan Schalkwyk, Andrew M Dai, Anja Hauth, Katie Millican, et al. Gemini: a family of highly capable multimodal models.arXiv preprint arXiv:2312.11805,

work page internal anchor Pith review Pith/arXiv arXiv
[42]

Towards Accurate Generative Models of Video: A New Metric & Challenges

Thomas Unterthiner, Sjoerd Van Steenkiste, Karol Kurach, Raphael Marinier, Marcin Michalski, and Sylvain Gelly. Towards accurate generative models of video: A new metric & challenges.arXiv preprint arXiv:1812.01717,

work page internal anchor Pith review Pith/arXiv arXiv
[43]

Wan: Open and Advanced Large-Scale Video Generative Models

Team Wan, Ang Wang, Baole Ai, Bin Wen, Chaojie Mao, Chen-Wei Xie, Di Chen, Feiwu Yu, Haiming Zhao, Jianxiao Yang, et al. Wan: Open and advanced large-scale video generative models. arXiv preprint arXiv:2503.20314,

work page internal anchor Pith review Pith/arXiv arXiv
[44]

DexVLA: Vision-Language Model with Plug-In Diffusion Expert for General Robot Control

Junjie Wen, Yichen Zhu, Jinming Li, Zhibin Tang, Chaomin Shen, and Feifei Feng. Dexvla: Vision-language model with plug-in diffusion expert for general robot control.arXiv preprint arXiv:2502.05855,

work page internal anchor Pith review Pith/arXiv arXiv
[45]

Improvisation through Physical Understanding: Using Novel Objects as Tools with Visual Foresight

14 Published as a conference paper at ICLR 2026 Annie Xie, Frederik Ebert, Sergey Levine, and Chelsea Finn. Improvisation through physical understanding: Using novel objects as tools with visual foresight.arXiv preprint arXiv:1904.05538,

work page internal anchor Pith review Pith/arXiv arXiv 2026
[46]

Learning Interactive Real-World Simulators

Mengjiao Yang, Yilun Du, Kamyar Ghasemipour, Jonathan Tompson, Dale Schuurmans, and Pieter Abbeel. Learning interactive real-world simulators.arXiv preprint arXiv:2310.06114, 1(2):6,

work page internal anchor Pith review arXiv
[47]

Hirt: Enhancing robotic control with hierarchical robot transformers.arXiv preprint arXiv:2410.05273,

Jianke Zhang, Yanjiang Guo, Xiaoyu Chen, Yen-Jen Wang, Yucheng Hu, Chengming Shi, and Jianyu Chen. Hirt: Enhancing robotic control with hierarchical robot transformers.arXiv preprint arXiv:2410.05273,

work page arXiv
[48]

Up-vla: A unified understanding and prediction model for embodied agent.arXiv preprint arXiv:2501.18867, 2025a

Jianke Zhang, Yanjiang Guo, Yucheng Hu, Xiaoyu Chen, Xiang Zhu, and Jianyu Chen. Up-vla: A unified understanding and prediction model for embodied agent.arXiv preprint arXiv:2501.18867,

work page arXiv
[49]

Learning Fine-Grained Bimanual Manipulation with Low-Cost Hardware

Tony Z Zhao, Vikash Kumar, Sergey Levine, and Chelsea Finn. Learning fine-grained bimanual manipulation with low-cost hardware.arXiv preprint arXiv:2304.13705,

work page internal anchor Pith review Pith/arXiv arXiv
[50]

FLARE: Robot Learning with Implicit World Modeling

Ruijie Zheng, Jing Wang, Scott Reed, Johan Bjorck, Yu Fang, Fengyuan Hu, Joel Jang, Kaushil Kundalia, Zongyu Lin, Loic Magne, et al. Flare: Robot learning with implicit world modeling. arXiv preprint arXiv:2505.15659,

work page internal anchor Pith review arXiv
[51]

Flowvla: Visual chain of thought-based motion reasoning for vision-language-action models.arXiv preprint arXiv:2508.18269, 2025

Zhide Zhong, Haodong Yan, Junfeng Li, Xiangchen Liu, Xin Gong, Wenxuan Song, Jiayi Chen, and Haoang Li. Flowvla: Thinking in motion with a visual chain of thought.arXiv preprint arXiv:2508.18269,

work page arXiv
[52]

Unified World Models: Coupling Video and Action Diffusion for Pretraining on Large Robotic Datasets

Chuning Zhu, Raymond Yu, Siyuan Feng, Benjamin Burchfiel, Paarth Shah, and Abhishek Gupta. Unified world models: Coupling video and action diffusion for pretraining on large robotic datasets. arXiv preprint arXiv:2504.02792,

work page internal anchor Pith review Pith/arXiv arXiv
[53]

Irasim: Learning interactive real-robot action simulators

Fangqi Zhu, Hongtao Wu, Song Guo, Yuxiao Liu, Chilam Cheang, and Tao Kong. Irasim: Learning interactive real-robot action simulators.arXiv preprint arXiv:2406.14540,

work page arXiv
[54]

More videos can be found at the anonymous website: https://sites.google.com/view/ ctrl-world

15 Published as a conference paper at ICLR 2026 Code can be found in the Supplementary Materials. More videos can be found at the anonymous website: https://sites.google.com/view/ ctrl-world. A MOREDETAILS FORWORLDMODELLEARNING Model Architecture.Our world model closely follows the architecture of Stable Video Diffusion (SVD) (Blattmann et al., 2023a), an...

work page 2026
[55]

The learning rate is set to be 1e-5, and we train for 100k steps, which takes approximately 2–3 days to complete. B MOREDETAILS FORPOLICYEVALUATION Details on interaction between policy and world model.We directly use the offi- cial π0-DROID, π0-FAST-DROID, and π0.5-DROIDpolicies from https://github.com/ Physical-Intelligence/openpi to interact with Ctrl-...

work page 2026
[56]

Pick up A and place in B

Task details and criterion.In our experiments, we use human annotators to evaluate whether each trajectory is a success or a failure. Although this evaluation process can be automated in the future using large vision-language reward models, our focus in this paper is on the world model itself, so we rely on human preference as the reward signal. We provid...

work page 2026

[1] [1]

Cosmos World Foundation Model Platform for Physical AI

Niket Agarwal, Arslan Ali, Maciej Bala, Yogesh Balaji, Erik Barker, Tiffany Cai, Prithvijit Chat- topadhyay, Yongxin Chen, Yin Cui, Yifan Ding, et al. Cosmos world foundation model platform for physical ai.arXiv preprint arXiv:2501.03575,

work page internal anchor Pith review Pith/arXiv arXiv

[2] [2]

Roboarena: Distributed real-world evaluation of generalist robot policies

Pranav Atreya, Karl Pertsch, Tony Lee, Moo Jin Kim, Arhan Jain, Artur Kuramshin, Clemens Eppner, Cyrus Neary, Edward Hu, Fabio Ramos, et al. Roboarena: Distributed real-world evaluation of generalist robot policies.arXiv preprint arXiv:2506.18123,

work page arXiv

[3] [3]

Gen2Act: Human Video Generation in Novel Scenarios enables Generalizable Robot Manipulation

Homanga Bharadhwaj, Debidatta Dwibedi, Abhinav Gupta, Shubham Tulsiani, Carl Doersch, Ted Xiao, Dhruv Shah, Fei Xia, Dorsa Sadigh, and Sean Kirmani. Gen2act: Human video generation in novel scenarios enables generalizable robot manipulation.arXiv preprint arXiv:2409.16283,

work page internal anchor Pith review Pith/arXiv arXiv

[4] [4]

Zero-Shot Robotic Manipulation with Pretrained Image-Editing Diffusion Models

Kevin Black, Mitsuhiko Nakamoto, Pranav Atreya, Homer Walke, Chelsea Finn, Aviral Kumar, and Sergey Levine. Zero-shot robotic manipulation with pretrained image-editing diffusion models. arXiv preprint arXiv:2310.10639,

work page internal anchor Pith review arXiv

[5] [5]

$\pi_0$: A Vision-Language-Action Flow Model for General Robot Control

Kevin Black, Noah Brown, Danny Driess, Adnan Esmail, Michael Equi, Chelsea Finn, Niccolo Fusai, Lachy Groom, Karol Hausman, Brian Ichter, et al. π0: A vision-language-action flow model for general robot control.arXiv preprint arXiv:2410.24164,

work page internal anchor Pith review Pith/arXiv arXiv

[6] [6]

Real-Time Execution of Action Chunking Flow Policies

Kevin Black, Manuel Y Galliker, and Sergey Levine. Real-time execution of action chunking flow policies.arXiv preprint arXiv:2506.07339,

work page internal anchor Pith review Pith/arXiv arXiv

[7] [7]

Stable Video Diffusion: Scaling Latent Video Diffusion Models to Large Datasets

Andreas Blattmann, Tim Dockhorn, Sumith Kulal, Daniel Mendelevitch, Maciej Kilian, Dominik Lorenz, Yam Levi, Zion English, Vikram V oleti, Adam Letts, et al. Stable video diffusion: Scaling latent video diffusion models to large datasets.arXiv preprint arXiv:2311.15127, 2023a. Andreas Blattmann, Robin Rombach, Huan Ling, Tim Dockhorn, Seung Wook Kim, Sanj...

work page internal anchor Pith review Pith/arXiv arXiv

[8] [8]

Wow: Towards a world omni- scient world model through embodied interaction,

Xiaowei Chi, Peidong Jia, Chun-Kai Fan, Xiaozhu Ju, Weishi Mi, Kevin Zhang, Zhiyuan Qin, Wanxin Tian, Kuangzhi Ge, Hao Li, et al. Wow: Towards a world omniscient world model through embodied interaction.arXiv preprint arXiv:2509.22642,

work page arXiv

[9] [9]

doi:10.48550/arXiv.2505.03912 , abstract =

11 Published as a conference paper at ICLR 2026 Can Cui, Pengxiang Ding, Wenxuan Song, Shuanghao Bai, Xinyang Tong, Zirui Ge, Runze Suo, Wanqi Zhou, Yang Liu, Bofang Jia, et al. Openhelix: A short survey, empirical analysis, and open-source dual-system vla model for robotic manipulation.arXiv preprint arXiv:2505.03912,

work page arXiv 2026

[10] [10]

RoboNet: Large-Scale Multi-Robot Learning

Sudeep Dasari, Frederik Ebert, Stephen Tian, Suraj Nair, Bernadette Bucher, Karl Schmeckpeper, Siddharth Singh, Sergey Levine, and Chelsea Finn. Robonet: Large-scale multi-robot learning. arXiv preprint arXiv:1910.11215,

work page internal anchor Pith review arXiv 1910

[11] [11]

Vision-language models as success detectors,

Yuqing Du, Ksenia Konyushkova, Misha Denil, Akhil Raju, Jessica Landon, Felix Hill, Nando De Freitas, and Serkan Cabi. Vision-language models as success detectors.arXiv preprint arXiv:2303.07280,

work page arXiv

[12] [12]

Visual Foresight: Model-Based Deep Reinforcement Learning for Vision-Based Robotic Control

Frederik Ebert, Chelsea Finn, Sudeep Dasari, Annie Xie, Alex Lee, and Sergey Levine. Visual foresight: Model-based deep reinforcement learning for vision-based robotic control.arXiv preprint arXiv:1812.00568,

work page internal anchor Pith review Pith/arXiv arXiv

[13] [13]

Vidar: Embodied Video Diffusion Model for Generalist Manipulation

Yao Feng, Hengkai Tan, Xinyi Mao, Guodong Liu, Shuhe Huang, Chendong Xiang, Hang Su, and Jun Zhu. Vidar: Embodied video diffusion model for generalist bimanual manipulation.arXiv preprint arXiv:2507.12898,

work page internal anchor Pith review arXiv

[14] [14]

Flip: Flow- centric generative planning as general-purpose manipulation world model,

Chongkai Gao, Haozhuo Zhang, Zhixuan Xu, Zhehao Cai, and Lin Shao. Flip: Flow-centric generative planning as general-purpose manipulation world model.arXiv preprint arXiv:2412.08261,

work page arXiv

[15] [15]

Adaworld: Learning adaptable world models with latent actions.arXiv preprint arXiv:2503.18938, 2025

Shenyuan Gao, Siyuan Zhou, Yilun Du, Jun Zhang, and Chuang Gan. Adaworld: Learning adaptable world models with latent actions.arXiv preprint arXiv:2503.18938,

work page arXiv

[16] [16]

Improving vision-language-action model with online reinforcement learning.arXiv preprint arXiv:2501.16664,

Yanjiang Guo, Jianke Zhang, Xiaoyu Chen, Xiang Ji, Yen-Jen Wang, Yucheng Hu, and Jianyu Chen. Improving vision-language-action model with online reinforcement learning.arXiv preprint arXiv:2501.16664,

work page arXiv

[17] [17]

Dream to Control: Learning Behaviors by Latent Imagination

Danijar Hafner, Timothy Lillicrap, Jimmy Ba, and Mohammad Norouzi. Dream to control: Learning behaviors by latent imagination.arXiv preprint arXiv:1912.01603,

work page internal anchor Pith review Pith/arXiv arXiv 1912

[18] [18]

Mastering Atari with Discrete World Models

Danijar Hafner, Timothy Lillicrap, Mohammad Norouzi, and Jimmy Ba. Mastering atari with discrete world models.arXiv preprint arXiv:2010.02193,

work page internal anchor Pith review Pith/arXiv arXiv 2010

[19] [19]

Training Agents Inside of Scalable World Models

Danijar Hafner, Wilson Yan, and Timothy Lillicrap. Training agents inside of scalable world models. arXiv preprint arXiv:2509.24527,

work page internal anchor Pith review Pith/arXiv arXiv

[20] [20]

Temporal difference learning for model predictive control.arXiv preprint arXiv:2203.04955, 2022

Nicklas Hansen, Xiaolong Wang, and Hao Su. Temporal difference learning for model predictive control.arXiv preprint arXiv:2203.04955,

work page arXiv

[21] [21]

Pre-trained video generative models as world simulators.CoRR, abs/2502.07825, February 2025

Haoran He, Yang Zhang, Liang Lin, Zhongwen Xu, and Ling Pan. Pre-trained video generative models as world simulators.arXiv preprint arXiv:2502.07825,

work page arXiv

[22] [22]

Image quality metrics: Psnr vs

12 Published as a conference paper at ICLR 2026 Alain Hore and Djemel Ziou. Image quality metrics: Psnr vs. ssim. In2010 20th international conference on pattern recognition, pp. 2366–2369. IEEE,

work page 2026

[23] [23]

Video Prediction Policy: A Generalist Robot Policy with Predictive Visual Representations

Yucheng Hu, Yanjiang Guo, Pengchao Wang, Xiaoyu Chen, Yen-Jen Wang, Jianke Zhang, Koushil Sreenath, Chaochao Lu, and Jianyu Chen. Video prediction policy: A generalist robot policy with predictive visual representations.arXiv preprint arXiv:2412.14803,

work page internal anchor Pith review Pith/arXiv arXiv

[24] [24]

Particleformer: A 3d point cloud world model for multi-object, multi-material robotic manipulation.arXiv preprint arXiv:2506.23126,

Suning Huang, Qianzhong Chen, Xiaohan Zhang, Jiankai Sun, and Mac Schwager. Particleformer: A 3d point cloud world model for multi-object, multi-material robotic manipulation.arXiv preprint arXiv:2506.23126,

work page arXiv

[25] [25]

$\pi_{0.5}$: a Vision-Language-Action Model with Open-World Generalization

Physical Intelligence, Kevin Black, Noah Brown, James Darpinian, Karan Dhabalia, Danny Driess, Adnan Esmail, Michael Equi, Chelsea Finn, Niccolo Fusai, et al. pi {0.5}: a vision-language- action model with open-world generalization.arXiv preprint arXiv:2504.16054,

work page internal anchor Pith review Pith/arXiv arXiv

[26] [26]

arXiv preprint arXiv:2505.09723 (2025)

Yuxin Jiang, Shengcong Chen, Siyuan Huang, Liliang Chen, Pengfei Zhou, Yue Liao, Xindong He, Chiming Liu, Hongsheng Li, Maoqing Yao, et al. Enerverse-ac: Envisioning embodied environments with action condition.arXiv preprint arXiv:2505.09723,

work page arXiv

[27] [27]

DROID: A Large-Scale In-The-Wild Robot Manipulation Dataset

Alexander Khazatsky, Karl Pertsch, Suraj Nair, Ashwin Balakrishna, Sudeep Dasari, Siddharth Karamcheti, Soroush Nasiriany, Mohan Kumar Srirama, Lawrence Yunliang Chen, Kirsty Ellis, et al. Droid: A large-scale in-the-wild robot manipulation dataset.arXiv preprint arXiv:2403.12945,

work page internal anchor Pith review Pith/arXiv arXiv

[28] [28]

OpenVLA: An Open-Source Vision-Language-Action Model

Moo Jin Kim, Karl Pertsch, Siddharth Karamcheti, Ted Xiao, Ashwin Balakrishna, Suraj Nair, Rafael Rafailov, Ethan Foster, Grace Lam, Pannag Sanketi, et al. Openvla: An open-source vision-language-action model.arXiv preprint arXiv:2406.09246,

work page internal anchor Pith review Pith/arXiv arXiv

[29] [29]

Unified Video Action Model

Shuang Li, Yihuai Gao, Dorsa Sadigh, and Shuran Song. Unified video action model.arXiv preprint arXiv:2503.00200, 2025a. Xuanlin Li, Kyle Hsu, Jiayuan Gu, Karl Pertsch, Oier Mees, Homer Rich Walke, Chuyuan Fu, Ishikaa Lunawat, Isabel Sieh, Sean Kirmani, et al. Evaluating real-world robot manipulation policies in simulation.arXiv preprint arXiv:2405.05941,

work page internal anchor Pith review Pith/arXiv arXiv

[30] [30]

Worldeval: World model as real-world robot policies evaluator.arXiv preprint arXiv:2505.19017, 2025

Yaxuan Li, Yichen Zhu, Junjie Wen, Chaomin Shen, and Yi Xu. Worldeval: World model as real-world robot policies evaluator.arXiv preprint arXiv:2505.19017, 2025b. Junbang Liang, Ruoshi Liu, Ege Ozguroglu, Sruthi Sudhakar, Achal Dave, Pavel Tokmakov, Shuran Song, and Carl V ondrick. Dreamitate: Real-world visuomotor policy learning via video generation. arX...

work page arXiv

[31] [31]

Genie Envisioner: A Unified World Foundation Platform for Robotic Manipulation

Yue Liao, Pengfei Zhou, Siyuan Huang, Donglin Yang, Shengcong Chen, Yuxin Jiang, Yue Hu, Jingbin Cai, Si Liu, Jianlan Luo, et al. Genie envisioner: A unified world foundation platform for robotic manipulation.arXiv preprint arXiv:2508.05635,

work page internal anchor Pith review Pith/arXiv arXiv

[32] [32]

HybridVLA: Collaborative Diffusion and Autoregression in a Unified Vision-Language-Action Model

Jiaming Liu, Hao Chen, Pengju An, Zhuoyang Liu, Renrui Zhang, Chenyang Gu, Xiaoqi Li, Ziyu Guo, Sixiang Chen, Mengzhen Liu, et al. Hybridvla: Collaborative diffusion and autoregression in a unified vision-language-action model.arXiv preprint arXiv:2503.10631,

work page internal anchor Pith review Pith/arXiv arXiv

[33] [33]

RDT-1B: a Diffusion Foundation Model for Bimanual Manipulation

Songming Liu, Lingxuan Wu, Bangguo Li, Hengkai Tan, Huayu Chen, Zhengyi Wang, Ke Xu, Hang Su, and Jun Zhu. Rdt-1b: a diffusion foundation model for bimanual manipulation.arXiv preprint arXiv:2410.07864,

work page internal anchor Pith review Pith/arXiv arXiv

[34] [34]

Foundation reward models for general robot skill acquisition

Yecheng Jason Ma. Foundation reward models for general robot skill acquisition. InRobotics: Science and Systems-Pioneers Workshop 2025,

work page 2025

[35] [35]

Deep dynamics models for learning dexterous manipulation

13 Published as a conference paper at ICLR 2026 Anusha Nagabandi, Kurt Konolige, Sergey Levine, and Vikash Kumar. Deep dynamics models for learning dexterous manipulation. InConference on robot learning, pp. 1101–1112. PMLR,

work page 2026

[36] [36]

FAST: Efficient Action Tokenization for Vision-Language-Action Models

Karl Pertsch, Kyle Stachowicz, Brian Ichter, Danny Driess, Suraj Nair, Quan Vuong, Oier Mees, Chelsea Finn, and Sergey Levine. Fast: Efficient action tokenization for vision-language-action models.arXiv preprint arXiv:2501.09747,

work page internal anchor Pith review Pith/arXiv arXiv

[37] [37]

Quevedo, A

Julian Quevedo, Percy Liang, and Sherry Yang. Evaluating robot policies in a world model.arXiv preprint arXiv:2506.00613,

work page arXiv

[38] [38]

Cosmos-drive- dreams: Scalable synthetic driving data generation with world foundation models.arXiv preprint arXiv:2506.09042,

Xuanchi Ren, Yifan Lu, Tianshi Cao, Ruiyuan Gao, Shengyu Huang, Amirmojtaba Sabour, Tianchang Shen, Tobias Pfaff, Jay Zhangjie Wu, Runjian Chen, et al. Cosmos-drive-dreams: Scalable synthetic driving data generation with world foundation models.arXiv preprint arXiv:2506.09042,

work page arXiv

[39] [39]

Hi Robot: Open-Ended Instruction Following with Hierarchical Vision-Language-Action Models

Lucy Xiaoyang Shi, Brian Ichter, Michael Equi, Liyiming Ke, Karl Pertsch, Quan Vuong, James Tanner, Anna Walling, Haohuan Wang, Niccolo Fusai, et al. Hi robot: Open-ended instruction following with hierarchical vision-language-action models.arXiv preprint arXiv:2502.19417,

work page internal anchor Pith review Pith/arXiv arXiv

[40] [40]

AnyPos: Automated Task-Agnostic Actions for Bimanual Manipulation

Hengkai Tan, Yao Feng, Xinyi Mao, Shuhe Huang, Guodong Liu, Zhongkai Hao, Hang Su, and Jun Zhu. Anypos: Automated task-agnostic actions for bimanual manipulation.arXiv preprint arXiv:2507.12768,

work page internal anchor Pith review Pith/arXiv arXiv

[41] [41]

Gemini: A Family of Highly Capable Multimodal Models

URL https://www. 1x.tech/1x-world-model.pdf. Gemini Team, Rohan Anil, Sebastian Borgeaud, Jean-Baptiste Alayrac, Jiahui Yu, Radu Soricut, Johan Schalkwyk, Andrew M Dai, Anja Hauth, Katie Millican, et al. Gemini: a family of highly capable multimodal models.arXiv preprint arXiv:2312.11805,

work page internal anchor Pith review Pith/arXiv arXiv

[42] [42]

Towards Accurate Generative Models of Video: A New Metric & Challenges

Thomas Unterthiner, Sjoerd Van Steenkiste, Karol Kurach, Raphael Marinier, Marcin Michalski, and Sylvain Gelly. Towards accurate generative models of video: A new metric & challenges.arXiv preprint arXiv:1812.01717,

work page internal anchor Pith review Pith/arXiv arXiv

[43] [43]

Wan: Open and Advanced Large-Scale Video Generative Models

Team Wan, Ang Wang, Baole Ai, Bin Wen, Chaojie Mao, Chen-Wei Xie, Di Chen, Feiwu Yu, Haiming Zhao, Jianxiao Yang, et al. Wan: Open and advanced large-scale video generative models. arXiv preprint arXiv:2503.20314,

work page internal anchor Pith review Pith/arXiv arXiv

[44] [44]

DexVLA: Vision-Language Model with Plug-In Diffusion Expert for General Robot Control

Junjie Wen, Yichen Zhu, Jinming Li, Zhibin Tang, Chaomin Shen, and Feifei Feng. Dexvla: Vision-language model with plug-in diffusion expert for general robot control.arXiv preprint arXiv:2502.05855,

work page internal anchor Pith review Pith/arXiv arXiv

[45] [45]

Improvisation through Physical Understanding: Using Novel Objects as Tools with Visual Foresight

14 Published as a conference paper at ICLR 2026 Annie Xie, Frederik Ebert, Sergey Levine, and Chelsea Finn. Improvisation through physical understanding: Using novel objects as tools with visual foresight.arXiv preprint arXiv:1904.05538,

work page internal anchor Pith review Pith/arXiv arXiv 2026

[46] [46]

Learning Interactive Real-World Simulators

Mengjiao Yang, Yilun Du, Kamyar Ghasemipour, Jonathan Tompson, Dale Schuurmans, and Pieter Abbeel. Learning interactive real-world simulators.arXiv preprint arXiv:2310.06114, 1(2):6,

work page internal anchor Pith review arXiv

[47] [47]

Hirt: Enhancing robotic control with hierarchical robot transformers.arXiv preprint arXiv:2410.05273,

Jianke Zhang, Yanjiang Guo, Xiaoyu Chen, Yen-Jen Wang, Yucheng Hu, Chengming Shi, and Jianyu Chen. Hirt: Enhancing robotic control with hierarchical robot transformers.arXiv preprint arXiv:2410.05273,

work page arXiv

[48] [48]

Up-vla: A unified understanding and prediction model for embodied agent.arXiv preprint arXiv:2501.18867, 2025a

Jianke Zhang, Yanjiang Guo, Yucheng Hu, Xiaoyu Chen, Xiang Zhu, and Jianyu Chen. Up-vla: A unified understanding and prediction model for embodied agent.arXiv preprint arXiv:2501.18867,

work page arXiv

[49] [49]

Learning Fine-Grained Bimanual Manipulation with Low-Cost Hardware

Tony Z Zhao, Vikash Kumar, Sergey Levine, and Chelsea Finn. Learning fine-grained bimanual manipulation with low-cost hardware.arXiv preprint arXiv:2304.13705,

work page internal anchor Pith review Pith/arXiv arXiv

[50] [50]

FLARE: Robot Learning with Implicit World Modeling

Ruijie Zheng, Jing Wang, Scott Reed, Johan Bjorck, Yu Fang, Fengyuan Hu, Joel Jang, Kaushil Kundalia, Zongyu Lin, Loic Magne, et al. Flare: Robot learning with implicit world modeling. arXiv preprint arXiv:2505.15659,

work page internal anchor Pith review arXiv

[51] [51]

Flowvla: Visual chain of thought-based motion reasoning for vision-language-action models.arXiv preprint arXiv:2508.18269, 2025

Zhide Zhong, Haodong Yan, Junfeng Li, Xiangchen Liu, Xin Gong, Wenxuan Song, Jiayi Chen, and Haoang Li. Flowvla: Thinking in motion with a visual chain of thought.arXiv preprint arXiv:2508.18269,

work page arXiv

[52] [52]

Unified World Models: Coupling Video and Action Diffusion for Pretraining on Large Robotic Datasets

Chuning Zhu, Raymond Yu, Siyuan Feng, Benjamin Burchfiel, Paarth Shah, and Abhishek Gupta. Unified world models: Coupling video and action diffusion for pretraining on large robotic datasets. arXiv preprint arXiv:2504.02792,

work page internal anchor Pith review Pith/arXiv arXiv

[53] [53]

Irasim: Learning interactive real-robot action simulators

Fangqi Zhu, Hongtao Wu, Song Guo, Yuxiao Liu, Chilam Cheang, and Tao Kong. Irasim: Learning interactive real-robot action simulators.arXiv preprint arXiv:2406.14540,

work page arXiv

[54] [54]

More videos can be found at the anonymous website: https://sites.google.com/view/ ctrl-world

15 Published as a conference paper at ICLR 2026 Code can be found in the Supplementary Materials. More videos can be found at the anonymous website: https://sites.google.com/view/ ctrl-world. A MOREDETAILS FORWORLDMODELLEARNING Model Architecture.Our world model closely follows the architecture of Stable Video Diffusion (SVD) (Blattmann et al., 2023a), an...

work page 2026

[55] [55]

The learning rate is set to be 1e-5, and we train for 100k steps, which takes approximately 2–3 days to complete. B MOREDETAILS FORPOLICYEVALUATION Details on interaction between policy and world model.We directly use the offi- cial π0-DROID, π0-FAST-DROID, and π0.5-DROIDpolicies from https://github.com/ Physical-Intelligence/openpi to interact with Ctrl-...

work page 2026

[56] [56]

Pick up A and place in B

Task details and criterion.In our experiments, we use human annotators to evaluate whether each trajectory is a success or a failure. Although this evaluation process can be automated in the future using large vision-language reward models, our focus in this paper is on the world model itself, so we rely on human preference as the reward signal. We provid...

work page 2026