Large Video Planner Enables Generalizable Robot Control

Boyuan Chen, Caiyi Zhang, Haoran Geng, Jitendra Malik, Kiwhan Song, Peihao Li, Pieter Abbeel, Russ Tedrake, Tianyuan Zhang, Vincent Sitzmann, William T. Freeman, Yilun Du

classification 💻 cs.RO cs.CV

keywords robotvideomodelslanguagemodeltasksactionactions

0 comments

read the original abstract

General-purpose robots require decision-making models that generalize across diverse tasks and environments. Recent works build robot foundation models by extending multimodal large language models (MLLMs) with action outputs, creating vision-language-action (VLA) systems. These efforts are motivated by the intuition that MLLMs' large-scale language and image pretraining can be effectively transferred to the action output modality. In this work, we explore an alternative paradigm of using large-scale video pretraining as a primary modality for building robot foundation models. Unlike static images and language, videos capture spatio-temporal sequences of states and actions in the physical world that are naturally aligned with robotic behavior. We curate an internet-scale video dataset of human activities and task demonstrations, and train, for the first time at a foundation-model scale, an open video model for generative robotics planning. The model produces zero-shot video plans for novel scenes and tasks, which we post-process to extract executable robot actions. We evaluate task-level generalization through third-party selected tasks in the wild and real-robot experiments, demonstrating successful physical execution. Together, these results show robust instruction following, strong generalization, and real-world feasibility. We release both the model and dataset to support open, reproducible video-based robot learning. Our website is available at https://www.boyuan.space/large-video-planner/.

This paper has not been read by Pith yet.

discussion (0)

Forward citations

Cited by 13 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

From Imagined Futures to Executable Actions: Mixture of Latent Actions for Robot Manipulation
cs.RO 2026-05 unverdicted novelty 7.0

MoLA infers a mixture of latent actions from generated future videos via modality-aware inverse dynamics models to improve robot manipulation policies.
DeVI: Physics-based Dexterous Human-Object Interaction via Synthetic Video Imitation
cs.CV 2026-04 unverdicted novelty 7.0

DeVI enables zero-shot physically plausible dexterous control by imitating synthetic videos via a hybrid 3D-human plus 2D-object tracking reward.
RoboWM-Bench: A Benchmark for Evaluating World Models in Robotic Manipulation
cs.RO 2026-04 unverdicted novelty 7.0

RoboWM-Bench evaluates video world models by converting their outputs into executable robot actions and running them on manipulation tasks, showing that physical inconsistencies remain common.
Human Cognition in Machines: A Unified Perspective of World Models
cs.RO 2026-04 unverdicted novelty 6.0

The paper introduces a unified framework for world models that fully incorporates all cognitive functions from Cognitive Architecture Theory, highlights under-researched areas in motivation and meta-cognition, and pro...
From Seeing to Simulating: Generative High-Fidelity Simulation with Digital Cousins for Generalizable Robot Learning and Evaluation
cs.RO 2026-04 unverdicted novelty 6.0

Digital Cousins is a generative real-to-sim method that creates diverse high-fidelity simulation scenes from real panoramas to improve generalization in robot learning and evaluation.
EgoVerse: An Egocentric Human Dataset for Robot Learning from Around the World
cs.RO 2026-04 unverdicted novelty 6.0

EgoVerse releases 1,362 hours of standardized egocentric human data across 1,965 tasks and shows via multi-lab experiments that robot policy performance scales with human data volume when the data aligns with robot ob...
Grasp as You Dream: Imitating Functional Grasping from Generated Human Demonstrations
cs.RO 2026-04 unverdicted novelty 6.0

GraspDreamer synthesizes human functional grasping demonstrations with visual generative models to enable zero-shot robot grasping with improved data efficiency and generalization.
Veo-Act: How Far Can Frontier Video Models Advance Generalizable Robot Manipulation?
cs.RO 2026-04 unverdicted novelty 6.0

Veo-3 video predictions enable approximate task-level robot trajectories in zero-shot settings but require hierarchical integration with low-level VLA policies for reliable manipulation performance.
World Action Models are Zero-shot Policies
cs.RO 2026-02 unverdicted novelty 6.0

DreamZero uses a 14B video diffusion model as a World Action Model to achieve over 2x better zero-shot generalization on real robots than state-of-the-art VLAs, real-time 7Hz closed-loop control, and cross-embodiment ...
Embody4D: A Generalist 4D World Model for Embodied AI
cs.CV 2026-05 unverdicted novelty 5.0

Embody4D generates high-fidelity, view-consistent novel views from monocular videos for embodied scenarios via 3D-aware data synthesis, adaptive noise injection, and interaction-aware attention.
Visual Generation in the New Era: An Evolution from Atomic Mapping to Agentic World Modeling
cs.CV 2026-04 unverdicted novelty 5.0

Visual generation models are evolving from passive renderers to interactive agentic world modelers, but current systems lack spatial reasoning, temporal consistency, and causal understanding, with evaluations overemph...
World Action Models: The Next Frontier in Embodied AI
cs.RO 2026-05 unverdicted novelty 4.0

The paper introduces World Action Models as a new paradigm unifying predictive world modeling with action generation in embodied foundation models and provides a taxonomy of existing approaches.
World Model for Robot Learning: A Comprehensive Survey
cs.RO 2026-04 unverdicted novelty 3.0

A comprehensive survey that organizes the literature on world models in robot learning, their roles in policy learning, planning, simulation, and video-based generation, with connections to navigation, driving, datase...