arxiv: 2602.15922 · v1 · submitted 2026-02-17 · 💻 cs.RO · cs.CV· cs.LG

Recognition: 2 theorem links

World Action Models are Zero-shot Policies

Avnish Narayan, Ayaan Malik, Chuning Zhu, Danfei Xu, Dantong Niu, Fengyuan Hu, George Kurian, Guanzhi Wang, Gwanghyun Kim, Jan Kautz, Jiannan Xiang, Jiasheng Gu, Jimmy Wu, Jing Wang, Joel Jang, Johan Bjorck, Kaiyuan Zheng, Kyungmin Lee, Linxi "Jim" Fan, Nadun Ranawaka, Qi Wang, Ruijie Zheng, Ryan Julian, Scott Reed, Seonghyeon Ye, Shenyuan Gao, Sihyun Yu, Suneel Indupuru, William Liang, Yevgen Chebotar, Yilun Du, Yinzhen Xu, You Liang Tan, Yuke Zhu, Yunhao Ge, Yuqi Xie

Pith reviewed 2026-05-11 16:10 UTC · model grok-4.3

classification 💻 cs.RO cs.CVcs.LG

keywords World Action Modelsvideo diffusionzero-shot robot policiesvision-language-action modelscross-embodiment transferphysical dynamicsreal-time robot controlheterogeneous robot data

0 comments

The pith

World Action Models serve as zero-shot policies by jointly predicting future video states and actions from heterogeneous data.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents DreamZero as a World Action Model that learns physical dynamics directly from video and action pairs instead of relying on language instructions or repetitive demonstrations. By building on a pretrained video diffusion model, it treats video as a dense signal of how the world changes under actions, allowing the system to generalize across new tasks, environments, and even robot bodies. This produces more than double the success rate on unseen scenarios compared with prior vision-language-action approaches while still running closed-loop control in real time. The same model further supports rapid transfer: video demonstrations from other robots or humans boost performance on new tasks, and only thirty minutes of play data suffice to adapt to an entirely new embodiment without losing zero-shot capability. The central insight is that modeling the visual evolution of the world supplies the physical knowledge needed for policies that work without task-specific retraining.

Core claim

World Action Models are zero-shot policies because they learn physical dynamics by predicting future world states, represented densely as video, together with the corresponding actions, trained jointly on heterogeneous robot datasets without requiring repetitive demonstrations of individual skills.

What carries the argument

DreamZero, a World Action Model built on a pretrained video diffusion backbone that jointly autoregressively models future video frames and actions to produce closed-loop robot control.

If this is right

Yields more than twice the generalization success rate on new tasks and environments versus state-of-the-art vision-language-action models in real-robot trials.
Enables real-time closed-loop control at 7 Hz from a 14-billion-parameter autoregressive video diffusion model after targeted optimizations.
Video-only demonstrations from other robots or humans deliver over 42 percent relative improvement on unseen tasks using only 10-20 minutes of additional data.
Supports few-shot embodiment adaptation: 30 minutes of play data on a new robot body suffices for transfer while zero-shot generalization to new tasks is retained.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Treating future video as the primary training signal may reduce dependence on curated action labels across many robotics domains.
The approach could extend to settings like autonomous navigation where scene evolution is more important than discrete commands.
Rapid embodiment transfer with minimal data suggests that large video models can serve as reusable physical priors for many hardware platforms.
Performance gains arise from dense visual supervision rather than from language or sparse reward signals.

Load-bearing premise

The assumption that video and action pairs drawn from mixed robot sources already contain enough information about physical dynamics to generalize to motions and environments never seen in training.

What would settle it

A controlled test showing that the model cannot produce correct actions or accurate future video predictions for a physically novel interaction, such as manipulating an object type or surface never present in the training videos.

read the original abstract

State-of-the-art Vision-Language-Action (VLA) models excel at semantic generalization but struggle to generalize to unseen physical motions in novel environments. We introduce DreamZero, a World Action Model (WAM) built upon a pretrained video diffusion backbone. Unlike VLAs, WAMs learn physical dynamics by predicting future world states and actions, using video as a dense representation of how the world evolves. By jointly modeling video and action, DreamZero learns diverse skills effectively from heterogeneous robot data without relying on repetitive demonstrations. This results in over 2x improvement in generalization to new tasks and environments compared to state-of-the-art VLAs in real robot experiments. Crucially, through model and system optimizations, we enable a 14B autoregressive video diffusion model to perform real-time closed-loop control at 7Hz. Finally, we demonstrate two forms of cross-embodiment transfer: video-only demonstrations from other robots or humans yield a relative improvement of over 42% on unseen task performance with just 10-20 minutes of data. More surprisingly, DreamZero enables few-shot embodiment adaptation, transferring to a new embodiment with only 30 minutes of play data while retaining zero-shot generalization.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

DreamZero uses a video diffusion backbone for joint state-action prediction and reports 2x real-robot generalization gains plus 7Hz control, but the physical dynamics claim needs checks on whether tests isolate novel physics.

read the letter

The core thing to know is that this paper frames World Action Models as a way to get zero-shot policies by predicting future video and actions together from a large pretrained diffusion model. They call their version DreamZero and say it beats standard VLAs by more than 2x on new tasks and environments in real robot tests, while running closed-loop at 7Hz after optimizations. They also show cross-embodiment transfer with video-only data from other robots or humans, needing only 10-20 minutes for a 42% lift, and even 30 minutes of play data for new embodiment adaptation while keeping zero-shot ability.

Referee Report

3 major / 1 minor

Summary. The manuscript introduces DreamZero, a World Action Model (WAM) built upon a pretrained 14B autoregressive video diffusion backbone. Unlike Vision-Language-Action (VLA) models, it jointly predicts future video states and actions to learn physical dynamics from heterogeneous robot data without repetitive demonstrations. The paper claims over 2x improvement in generalization to new tasks and environments versus SOTA VLAs in real-robot experiments, real-time closed-loop control at 7 Hz, and two forms of cross-embodiment transfer (video-only demos yielding >42% relative improvement with 10-20 minutes of data; few-shot adaptation to new embodiments with 30 minutes of play data while retaining zero-shot performance).

Significance. If the empirical results hold under rigorous controls, this would be a notable contribution to robot learning by showing that video-based world modeling can deliver substantially better zero-shot generalization and embodiment transfer than current VLAs, while achieving practical real-time inference speeds. The concrete real-robot numbers and cross-embodiment results provide tangible evidence of impact for reducing data requirements in robotics.

major comments (3)

[Abstract] Abstract: The headline claim of 'over 2x improvement in generalization to new tasks and environments' is load-bearing for the central thesis but provides no details on evaluation metrics, exact baselines, number of trials, statistical significance, or how test environments isolate novel physical dynamics (e.g., changed mass/friction) from visual or semantic changes. This prevents verification that gains arise from the WAM dynamics modeling rather than other factors.
[Abstract] Abstract: The statement that the model 'learns physical dynamics' and generalizes to 'unseen motions and environments' from heterogeneous data 'without relying on repetitive demonstrations' requires supporting evidence such as dataset statistics on motion diversity, ablations removing repetitive patterns, or explicit tests with altered dynamics parameters; without these, the weakest assumption identified in the stress-test remains unaddressed.
[Abstract] Abstract: The practicality claim that 'model and system optimizations' enable a 14B model to run real-time closed-loop control at 7 Hz is central to the contribution but does not specify the optimizations (e.g., distillation, caching, hardware acceleration), making it impossible to assess reproducibility or the extent to which the result depends on engineering rather than the WAM formulation.

minor comments (1)

[Abstract] The introduction of the term 'World Action Model (WAM)' would benefit from a clearer positioning against prior video-prediction or world-model literature to highlight novelty.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive feedback. We address each major comment below and agree that the abstract can be strengthened for clarity while the full manuscript already contains the supporting details. We will revise the abstract accordingly.

read point-by-point responses

Referee: The headline claim of 'over 2x improvement in generalization to new tasks and environments' is load-bearing for the central thesis but provides no details on evaluation metrics, exact baselines, number of trials, statistical significance, or how test environments isolate novel physical dynamics (e.g., changed mass/friction) from visual or semantic changes. This prevents verification that gains arise from the WAM dynamics modeling rather than other factors.

Authors: We agree the abstract would benefit from added context on these points. The full manuscript (Experiments section) details the metrics (task success rates), baselines (state-of-the-art VLAs), trial counts, statistical tests, and environment designs that vary physical parameters such as mass and friction while controlling for visual and semantic factors. We will revise the abstract to briefly reference these evaluation controls. revision: yes
Referee: The statement that the model 'learns physical dynamics' and generalizes to 'unseen motions and environments' from heterogeneous data 'without relying on repetitive demonstrations' requires supporting evidence such as dataset statistics on motion diversity, ablations removing repetitive patterns, or explicit tests with altered dynamics parameters; without these, the weakest assumption identified in the stress-test remains unaddressed.

Authors: The manuscript provides dataset statistics on motion diversity across heterogeneous sources, ablations demonstrating the value of non-repetitive data, and explicit tests with altered dynamics parameters. We will add a concise reference to these supporting analyses in the abstract. revision: yes
Referee: The practicality claim that 'model and system optimizations' enable a 14B model to run real-time closed-loop control at 7 Hz is central to the contribution but does not specify the optimizations (e.g., distillation, caching, hardware acceleration), making it impossible to assess reproducibility or the extent to which the result depends on engineering rather than the WAM formulation.

Authors: We agree that specifying the optimizations would improve the abstract. The full paper describes the model and system optimizations (including distillation, caching, and hardware acceleration) that enable 7 Hz inference. We will update the abstract to name the primary optimizations for reproducibility. revision: yes

Circularity Check

0 steps flagged

No circularity; empirical claims rest on robot experiments

full rationale

The paper introduces DreamZero as a World Action Model trained on heterogeneous robot data to predict video and actions, with all headline results (2x generalization, 7Hz control, cross-embodiment transfer) presented as outcomes of real-robot experiments rather than any mathematical derivation. No equations, fitted parameters renamed as predictions, self-definitional loops, or load-bearing self-citations appear in the provided text. The central premise that joint video-action modeling yields transferable physical dynamics is tested externally via held-out tasks and embodiments, making the evaluation chain self-contained and falsifiable outside the paper's own definitions.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The central claim rests on the premise that video serves as a sufficient dense representation of physical dynamics and that heterogeneous data can be used without task-specific repetition. No explicit free parameters are named in the abstract, but the 14B model size and diffusion backbone choice function as implicit design decisions.

axioms (1)

domain assumption Video is a dense representation of how the world evolves under actions.
Stated directly in the abstract as the basis for learning physical dynamics.

invented entities (1)

World Action Model (WAM) no independent evidence
purpose: A model that jointly predicts future video states and actions to serve as a zero-shot policy.
New framing introduced in the paper; no independent evidence provided beyond the reported experiments.

pith-pipeline@v0.9.0 · 5661 in / 1466 out tokens · 49135 ms · 2026-05-11T16:10:07.831346+00:00 · methodology

discussion (0)

Forward citations

Cited by 43 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Open-H-Embodiment: A Large-Scale Dataset for Enabling Foundation Models in Medical Robotics
cs.RO 2026-04 conditional novelty 8.0

Open-H-Embodiment is the largest open multi-embodiment medical robotics dataset, used to train GR00T-H, the first open vision-language-action model that achieves end-to-end suturing completion where prior models fail.
From Imagined Futures to Executable Actions: Mixture of Latent Actions for Robot Manipulation
cs.RO 2026-05 unverdicted novelty 7.0

MoLA infers a mixture of latent actions from generated future videos via modality-aware inverse dynamics models to improve robot manipulation policies.
ALAM: Algebraically Consistent Latent Action Model for Vision-Language-Action Models
cs.RO 2026-05 unverdicted novelty 7.0

ALAM creates algebraically consistent latent action transitions from videos to act as auxiliary generative targets, raising robot policy success rates from 47.9% to 85.0% on MetaWorld MT50 and 94.1% to 98.1% on LIBERO.
NoiseGate: Learning Per-Latent Timestep Schedules as Information Gating in World Action Models
cs.RO 2026-05 unverdicted novelty 7.0

NoiseGate learns per-latent timestep schedules as an information-gating policy in diffusion-based world action models, yielding consistent gains on RoboTwin manipulation tasks.
OA-WAM: Object-Addressable World Action Model for Robust Robot Manipulation
cs.RO 2026-05 unverdicted novelty 7.0

OA-WAM uses persistent address vectors and dynamic content vectors in object slots to enable addressable world-action prediction, improving robustness on manipulation benchmarks under scene changes.
EA-WM: Event-Aware Generative World Model with Structured Kinematic-to-Visual Action Fields
cs.CV 2026-05 unverdicted novelty 7.0

EA-WM generates more accurate robot world rollouts by projecting actions as structured visual fields in camera space and using event-aware bidirectional fusion to better capture interaction dynamics.
Being-H0.7: A Latent World-Action Model from Egocentric Videos
cs.RO 2026-04 unverdicted novelty 7.0

Being-H0.7 adds future-aware latent reasoning to direct VLA policies via dual-branch alignment on latent queries, matching world-model benefits at VLA efficiency.
${\pi}_{0.7}$: a Steerable Generalist Robotic Foundation Model with Emergent Capabilities
cs.LG 2026-04 unverdicted novelty 7.0

π₀.₇ is a steerable generalist robotic model that uses rich multimodal prompts including language, subgoal images, and performance metadata to achieve out-of-the-box generalization across tasks and robot bodies.
ViVa: A Video-Generative Value Model for Robot Reinforcement Learning
cs.RO 2026-04 unverdicted novelty 7.0

ViVa turns a video generator into a value model for robot RL that jointly forecasts future states and task value, yielding better performance on real-world box assembly when integrated with RECAP.
MoRight: Motion Control Done Right
cs.CV 2026-04 unverdicted novelty 7.0

MoRight disentangles object and camera motion via canonical-view specification and temporal cross-view attention, while decomposing motion into active user-driven and passive consequence components to learn and apply ...
Action Images: End-to-End Policy Learning via Multiview Video Generation
cs.CV 2026-04 unverdicted novelty 7.0

Action Images turn robot arm motions into interpretable multiview pixel videos, letting video backbones serve as zero-shot policies for end-to-end robot learning.
RoboEvolve: Co-Evolving Planner-Simulator for Robotic Manipulation with Limited Data
cs.RO 2026-05 unverdicted novelty 6.0

A co-evolutionary VLM-VGM loop on 500 unlabeled images raises planner success by 30 points and simulator success by 48 percent while beating fully supervised baselines.
Pyramid Forcing: Head-Aware Pyramid KV Cache Policy for High-Quality Long Video Generation
cs.CV 2026-05 unverdicted novelty 6.0

Pyramid Forcing classifies attention heads into Anchor, Wave, and Veil types and applies type-specific KV cache policies to improve long-horizon autoregressive video generation quality.
The DAWN of World-Action Interactive Models
cs.CV 2026-05 unverdicted novelty 6.0

DAWN couples a world predictor with a world-conditioned action denoiser in latent space so that each refines the other recursively, yielding strong planning and safety results on autonomous driving benchmarks.
ALAM: Algebraically Consistent Latent Action Model for Vision-Language-Action Models
cs.RO 2026-05 unverdicted novelty 6.0

ALAM introduces algebraic consistency regularization on latent action transitions from videos, raising VLA success rates from 47.9% to 85.0% on MetaWorld MT50 and 94.1% to 98.1% on LIBERO.
ACWM-Phys: Investigating Generalized Physical Interaction in Action-Conditioned Video World Models
cs.CV 2026-05 unverdicted novelty 6.0

ACWM-Phys benchmark shows action-conditioned world models generalize on simple geometric interactions but drop sharply on deformable contacts, high-dimensional control, and complex articulated motion, indicating relia...
When to Trust Imagination: Adaptive Action Execution for World Action Models
cs.RO 2026-05 unverdicted novelty 6.0

A verifier called Future Forward Dynamics Causal Attention enables adaptive action execution in World Action Models, reducing model inferences by 69% and improving success rates in robotic tasks.
When to Trust Imagination: Adaptive Action Execution for World Action Models
cs.RO 2026-05 unverdicted novelty 6.0

Future Forward Dynamics Causal Attention (FFDC) enables World Action Models to adaptively choose action chunk lengths based on prediction-observation consistency, cutting model inferences by 69% and improving real-wor...
MotuBrain: An Advanced World Action Model for Robot Control
cs.RO 2026-04 unverdicted novelty 6.0

MotuBrain jointly models video and action via a three-stream Mixture-of-Transformers UniDiffuser to reach 95.8-96.1% success on RoboTwin 2.0 benchmarks, top EWMScore, and fast 11 Hz inference while adapting to new rob...
Unified 4D World Action Modeling from Video Priors with Asynchronous Denoising
cs.RO 2026-04 unverdicted novelty 6.0

X-WAM unifies robotic action execution and 4D world synthesis by adapting video diffusion priors with a lightweight depth branch and asynchronous noise sampling, achieving 79-91% success on robot benchmarks.
Unified 4D World Action Modeling from Video Priors with Asynchronous Denoising
cs.RO 2026-04 unverdicted novelty 6.0

X-WAM unifies real-time robotic action execution with high-fidelity 4D world synthesis by adapting video diffusion priors through lightweight depth branches and asynchronous noise sampling, achieving 79-91% success on...
Learning Human-Intention Priors from Large-Scale Human Demonstrations for Robotic Manipulation
cs.RO 2026-04 unverdicted novelty 6.0

MoT-HRA learns embodiment-agnostic human-intention priors from the HA-2.2M dataset of 2.2M human video episodes through a three-expert hierarchy to improve robotic motion plausibility and robustness under distribution shift.
FASTER: Value-Guided Sampling for Fast RL
cs.LG 2026-04 unverdicted novelty 6.0

FASTER models multi-candidate denoising as an MDP and trains a value function to filter actions early, delivering the performance of full sampling at lower cost in diffusion RL policies.
Human Cognition in Machines: A Unified Perspective of World Models
cs.RO 2026-04 unverdicted novelty 6.0

The paper introduces a unified framework for world models that fully incorporates all cognitive functions from Cognitive Architecture Theory, highlights under-researched areas in motivation and meta-cognition, and pro...
Robotic Manipulation is Vision-to-Geometry Mapping ($f(v) \rightarrow G$): Vision-Geometry Backbones over Language and Video Models
cs.RO 2026-04 unverdicted novelty 6.0

Vision-geometry backbones using pretrained 3D world models outperform vision-language and video models for robotic manipulation by enabling direct mapping from visual input to geometric actions.
AIM: Intent-Aware Unified world action Modeling with Spatial Value Maps
cs.RO 2026-04 unverdicted novelty 6.0

AIM predicts aligned spatial value maps inside a shared video-generation transformer to produce reliable robot actions, reaching 94% success on RoboTwin 2.0 with larger gains on long-horizon and contact-rich tasks.
DexWorldModel: Causal Latent World Modeling towards Automated Learning of Embodied Tasks
cs.CV 2026-04 unverdicted novelty 6.0

CLWM with DINOv3 targets, O(1) TTT memory, SAI latency masking, and EmbodiChain training achieves SOTA dual-arm simulation performance and zero-shot sim-to-real transfer that beats real-data finetuned baselines.
VAG: Dual-Stream Video-Action Generation for Embodied Data Synthesis
cs.RO 2026-04 unverdicted novelty 6.0

VAG is a synchronized dual-stream flow-matching framework that generates aligned video-action pairs for synthetic embodied data synthesis and policy pretraining.
SIM1: Physics-Aligned Simulator as Zero-Shot Data Scaler in Deformable Worlds
cs.RO 2026-04 unverdicted novelty 6.0

SIM1 converts sparse real demonstrations into high-fidelity synthetic data through physics-aligned simulation, yielding policies that match real-data performance at a 1:15 ratio with 90% zero-shot success on deformabl...
Veo-Act: How Far Can Frontier Video Models Advance Generalizable Robot Manipulation?
cs.RO 2026-04 unverdicted novelty 6.0

Veo-3 video predictions enable approximate task-level robot trajectories in zero-shot settings but require hierarchical integration with low-level VLA policies for reliable manipulation performance.
Multi-View Video Diffusion Policy: A 3D Spatio-Temporal-Aware Video Action Model
cs.RO 2026-04 conditional novelty 6.0

MV-VDP jointly predicts multi-view RGB and heatmap videos via diffusion to achieve data-efficient, robust robotic manipulation policies.
Video Generation Models as World Models: Efficient Paradigms, Architectures and Algorithms
eess.IV 2026-03 unverdicted novelty 6.0

Video generation models can function as world simulators if efficiency gaps in spatiotemporal modeling are bridged via organized paradigms, architectures, and algorithms.
Fast-WAM: Do World Action Models Need Test-time Future Imagination?
cs.CV 2026-03 unverdicted novelty 6.0

Fast-WAM shows that explicit future imagination at test time is not required for strong WAM performance; video modeling during training provides the main benefit.
Nautilus: From One Prompt to Plug-and-Play Robot Learning
cs.RO 2026-05 unverdicted novelty 5.0

NAUTILUS is a prompt-driven harness that automates plug-and-play adapters, typed contracts, and validation for policies, benchmarks, and robots in learning research.
CKT-WAM: Parameter-Efficient Context Knowledge Transfer Between World Action Models
cs.RO 2026-05 unverdicted novelty 5.0

CKT-WAM transfers teacher WAM knowledge to students via compressed text-embedding contexts using LQCA and adapters, reaching 86.1% success on LIBERO-Plus with 1.17% trainable parameters and 83.3% in real-world tasks.
VLA Foundry: A Unified Framework for Training Vision-Language-Action Models
cs.RO 2026-04 unverdicted novelty 5.0

VLA Foundry provides a single training stack for VLA models and releases open models that match prior closed-source performance or outperform baselines on multi-task manipulation in simulation.
World-Value-Action Model: Implicit Planning for Vision-Language-Action Systems
cs.RO 2026-04 unverdicted novelty 5.0

The World-Value-Action model enables implicit planning for VLA systems by performing inference over a learned latent representation of high-value future trajectories instead of direct action prediction.
Goal2Skill: Long-Horizon Manipulation with Adaptive Planning and Reflection
cs.RO 2026-04 unverdicted novelty 5.0

A dual VLM-VLA framework for long-horizon robot manipulation achieves 32.4% success on RMBench tasks versus 9.8% for the strongest baseline via structured memory and closed-loop adaptive replanning.
World Action Models: The Next Frontier in Embodied AI
cs.RO 2026-05 unverdicted novelty 4.0

The paper introduces World Action Models as a new paradigm unifying predictive world modeling with action generation in embodied foundation models and provides a taxonomy of existing approaches.
RLDX-1 Technical Report
cs.RO 2026-05 unverdicted novelty 4.0

RLDX-1 achieves 86.8% success on complex ALLEX humanoid manipulation tasks where prior VLAs reach only around 40%.
RLDX-1 Technical Report
cs.RO 2026-05 unverdicted novelty 4.0

RLDX-1 outperforms frontier VLAs such as π0.5 and GR00T N1.6 on dexterous manipulation benchmarks, reaching 86.8% success on ALLEX humanoid tasks versus around 40% for the baselines.
ABot-Claw: A Foundation for Persistent, Cooperative, and Self-Evolving Robotic Agents
cs.CV 2026-04 unverdicted novelty 4.0

ABot-Claw is an embodied software layer that adds unified robot scheduling, cross-embodiment visual memory, and critic-driven replanning on top of OpenClaw to support persistent multi-robot execution from natural-lang...
Matrix-Game 3.0: Real-Time and Streaming Interactive World Model with Long-Horizon Memory
cs.CV 2026-04 unverdicted novelty 4.0

Matrix-Game 3.0 delivers 720p real-time video generation at 40 FPS with minute-scale memory consistency by combining residual self-correction training, camera-aware memory injection, and DMD-based autoregressive disti...

Reference graph

Works this paper leans on

91 extracted references · 91 canonical work pages · cited by 39 Pith papers · 32 internal anchors

[1]

Stochastic Interpolants: A Unifying Framework for Flows and Diffusions

Michael S Albergo, Nicholas M Boffi, and Eric Vanden-Eijnden. Stochastic interpolants: A unifying framework for flows and diffusions.arXiv preprint arXiv:2303.08797, 2023. 7

work page internal anchor Pith review arXiv 2023
[2]

World Simulation with Video Foundation Models for Physical AI

Arslan Ali, Junjie Bai, Maciej Bala, Yogesh Balaji, Aaron Blakeman, Tiffany Cai, Jiaxin Cao, Tianshi Cao, Elizabeth Cha, Yu-Wei Chao, et al. World simulation with video foundation models for physical ai.arXiv preprint arXiv:2511.00062, 2025. 7

work page internal anchor Pith review arXiv 2025
[3]

Self-supervised learning from images with a joint-embedding predictive architecture

Mahmoud Assran, Quentin Duval, Ishan Misra, Piotr Bojanowski, Pascal Vincent, Michael Rabbat, Yann LeCun, and Nicolas Ballas. Self-supervised learning from images with a joint-embedding predictive architecture. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023. 20

work page 2023
[4]

V-JEPA 2: Self-Supervised Video Models Enable Understanding, Prediction and Planning

Mido Assran, Adrien Bardes, David Fan, Quentin Garrido, Russell Howes, Mojtaba Komeili, Matthew Muckley, Ammar Rizvi, Claire Roberts, Koustuv Sinha, et al. V-JEPA 2: Self-supervised video models enable understanding, prediction and planning.arXiv preprint arXiv:2506.09985, 2025. 5, 20

work page internal anchor Pith review Pith/arXiv arXiv 2025
[5]

Philip J. Ball, Jakob Bauer, Frank Belletti, Bethanie Brownfield, Ariel Ephrat, Shlomi Fruchter, Agrim Gupta, Kristian Holsheimer, Aleksander Holynski, Jiri Hron, Christos Kaplanis, Marjorie Limont, Matt McGill, Yanko Oliveira, Jack Parker-Holder, Frank Perbet, Guy Scully, Jeremy Shar, Stephen Spencer, Omer Tov, Ruben Villegas, Emma Wang, Jessica Yung, Ci...

work page 2025
[6]

V-JEPA: Video joint embedding predictive architecture.arXiv preprint arXiv:2402.05065, 2024

Adrien Bardes, Quentin Garrido, Jean Ponce, Michael Rabbat, Yann LeCun, Mahmoud Assran, and Nicolas Ballas. V-JEPA: Video joint embedding predictive architecture.arXiv preprint arXiv:2402.05065, 2024. 20

work page arXiv 2024
[7]

A careful examination of large behavior models for multitask dexterous manipulation.arXiv preprint arXiv:2507.05331, 2025

Jose Barreiros, Andrew Beaulieu, Aditya Bhat, Rick Cory, Eric Cousineau, Hongkai Dai, Ching-Hsin Fang, Kunimatsu Hashimoto, Muhammad Zubair Irshad, Masha Itkina, et al. A careful examination of large behavior models for multitask dexterous manipulation.arXiv preprint arXiv:2507.05331, 2025. 13

work page arXiv 2025
[8]

Gen2act: Human video generation in novel scenarios enables generalizable robot manipulation.arXiv preprint arXiv:2409.16283, 2024

Homanga Bharadhwaj, Debidatta Dwibedi, Abhinav Gupta, Shubham Tulsiani, Carl Doersch, Ted Xiao, Dhruv Shah, Fei Xia, Dorsa Sadigh, and Sean Kirmani. Gen2act: Human video generation in novel scenarios enables generalizable robot manipulation.arXiv preprint arXiv:2409.16283, 2024. 5

work page arXiv 2024
[9]

GR00T N1: An Open Foundation Model for Generalist Humanoid Robots

Johan Bjorck, Fernando Castañeda, Nikita Cherniadev, Xingye Da, Runyu Ding, Linxi Fan, Yu Fang, Dieter Fox, Fengyuan Hu, Spencer Huang, et al. Gr00t n1: An open foundation model for generalist humanoid robots.arXiv preprint arXiv:2503.14734, 2025. 2, 4, 10, 18

work page internal anchor Pith review Pith/arXiv arXiv 2025
[11]

$\pi_0$: A Vision-Language-Action Flow Model for General Robot Control

Kevin Black, Noah Brown, Danny Driess, Adnan Esmail, Michael Equi, Chelsea Finn, Niccolo Fusai, Lachy Groom, Karol Hausman, Brian Ichter, et al.𝜋0: A vision-language-action flow model for general robot control.URL https://arxiv. org/abs/2410.24164, 2024. 2 30 World Action Models are Zero-shot Policies

work page internal anchor Pith review Pith/arXiv arXiv 2024
[12]

On the Opportunities and Risks of Foundation Models

Rishi Bommasani, Drew A Hudson, Ehsan Adeli, Russ Altman, Simran Arora, Sydney von Arx, Michael S Bernstein, Jeannette Bohg, Antoine Bosselut, Emma Brunskill, et al. On the opportunities and risks of foundation models.arXiv preprint arXiv:2108.07258, 2021. 4

work page internal anchor Pith review Pith/arXiv arXiv 2021
[13]

RT-1: Robotics Transformer for Real-World Control at Scale

Anthony Brohan, Noah Brown, Justice Carbajal, Yevgen Chebotar, Joseph Dabis, Chelsea Finn, Keerthana Gopalakrishnan, Karol Hausman, Alex Herzog, Jasmine Hsu, Julian Ibarz, Brian Ichter, Alex Irpan, Tomas Jackson, Sally Jesmonth, Nikhil Joshi, Ryan Julian, Dmitry Kalashnikov, Yuheng Kuang, Isabel Leal, Kuang-Huei Lee, Sergey Levine, Yao Lu, Utsav Malla, De...

work page internal anchor Pith review Pith/arXiv arXiv 2022
[14]

RT-2: Vision-Language-Action Models Transfer Web Knowledge to Robotic Control

Anthony Brohan, Noah Brown, Justice Carbajal, Yevgen Chebotar, Xi Chen, Krzysztof Choromanski, Tianli Ding, Danny Driess, Avinava Dubey, Chelsea Finn, Pete Florence, Chuyuan Fu, Montse Gonzalez Arenas, Keerthana Gopalakrishnan, Kehang Han, Karol Hausman, Alex Herzog, Jasmine Hsu, Brian Ichter, Alex Irpan, Nikhil Joshi, Ryan Julian, Dmitry Kalashnikov, Yuh...

work page internal anchor Pith review Pith/arXiv arXiv
[15]

Do as i can, not as i say: Grounding language in robotic affordances

Anthony Brohan, Yevgen Chebotar, Chelsea Finn, Karol Hausman, Alexander Herzog, Daniel Ho, Julian Ibarz, Alex Irpan, Eric Jang, Ryan Julian, et al. Do as i can, not as i say: Grounding language in robotic affordances. InConference on robot learning, pages 287–318. PMLR, 2023. 4

work page 2023
[16]

UniVLA: Learning to Act Anywhere with Task-centric Latent Actions

Qingwen Bu, Yanting Yang, Jisong Cai, Shenyuan Gao, Guanghui Ren, Maoqing Yao, Ping Luo, and Hongyang Li. Univla: Learning to act anywhere with task-centric latent actions.arXiv preprint arXiv:2505.06111, 2025. 4

work page internal anchor Pith review arXiv 2025
[17]

GR-2: A Generative Video-Language-Action Model with Web-Scale Knowledge for Robot Manipulation

Chi-Lam Cheang, Guangzeng Chen, Ya Jing, Tao Kong, Hang Li, Yifeng Li, Yuxiao Liu, Hongtao Wu, Jiafeng Xu, Yichu Yang, et al. Gr-2: A generative video-language-action model with web-scale knowledge for robot manipulation.arXiv preprint arXiv:2410.06158, 2024. 5

work page internal anchor Pith review arXiv 2024
[18]

Spa- tialvlm: Endowing vision-language models with spatial reasoning capabilities

Boyuan Chen, Zhuo Xu, Sean Kirmani, Brain Ichter, Dorsa Sadigh, Leonidas Guibas, and Fei Xia. Spa- tialvlm: Endowing vision-language models with spatial reasoning capabilities. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 14455–14465, 2024. 2

work page 2024
[19]

Large Video Planner Enables Generalizable Robot Control

Boyuan Chen, Tianyuan Zhang, Haoran Geng, Kiwhan Song, Caiyi Zhang, Peihao Li, William T Freeman, Jitendra Malik, Pieter Abbeel, Russ Tedrake, et al. Large video planner enables generalizable robot control. arXiv preprint arXiv:2512.15840, 2025. 5

work page internal anchor Pith review Pith/arXiv arXiv 2025
[20]

Action100m: A large-scale video action dataset.arXiv preprint arXiv:2601.10592, 2026

Delong Chen, Tejaswi Kasarla, Yejin Bang, Mustafa Shukor, Willy Chung, Jade Yu, Allen Bolourchi, Theo Moutakanni, and Pascale Fung. Action100m: A large-scale video action dataset.arXiv preprint arXiv:2601.10592, 2026. 18

work page arXiv 2026
[21]

Emerging Properties in Unified Multimodal Pretraining

Chaorui Deng, Deyao Zhu, Kunchang Li, Chenhui Gou, Feng Li, Zeyu Wang, Shu Zhong, Weihao Yu, Xiaonan Nie, Ziang Song, et al. Emerging properties in unified multimodal pretraining.arXiv preprint arXiv:2505.14683, 2025. 19 31 World Action Models are Zero-shot Policies

work page internal anchor Pith review Pith/arXiv arXiv 2025
[22]

Learning a thousand tasks in a day.Science Robotics, 10(108):eadv7594, 2025

Kamil Dreczkowski, Pietro Vitiello, Vitalis Vosylius, and Edward Johns. Learning a thousand tasks in a day.Science Robotics, 10(108):eadv7594, 2025. 4

work page 2025
[23]

PaLM-E: An Embodied Multimodal Language Model

Danny Driess, Fei Xia, Mehdi SM Sajjadi, Corey Lynch, Aakanksha Chowdhery, Brian Ichter, Ayzaan Wahid, Jonathan Tompson, Quan Vuong, Tianhe Yu, et al. Palm-e: An embodied multimodal language model.arXiv preprint arXiv:2303.03378, 2023. 4

work page internal anchor Pith review Pith/arXiv arXiv 2023
[24]

Learning universal policies via text-guided video generation.Advances in neural information processing systems, 36:9156–9172, 2023

Yilun Du, Sherry Yang, Bo Dai, Hanjun Dai, Ofir Nachum, Josh Tenenbaum, Dale Schuurmans, and Pieter Abbeel. Learning universal policies via text-guided video generation.Advances in neural information processing systems, 36:9156–9172, 2023. 5

work page 2023
[25]

Tenenbaum, Leslie Pack Kaelbling, Andy Zeng, and Jonathan Tompson

Yilun Du, Sherry Yang, Pete Florence, Fei Xia, Ayzaan Wahid, brian ichter, Pierre Sermanet, Tianhe Yu, Pieter Abbeel, Joshua B. Tenenbaum, Leslie Pack Kaelbling, Andy Zeng, and Jonathan Tompson. Video language planning. InThe Twelfth International Conference on Learning Representations, 2024. URL https://openreview.net/forum?id=9pKtcJcMP3. 5

work page 2024
[26]

arXiv preprint arXiv:2510.19400 (2025)

Zhiyuan Feng, Zhaolu Kang, Qijie Wang, Zhiying Du, Jiongrui Yan, Shubin Shi, Chengbo Yuan, Huizhi Liang, Yu Deng, Qixiu Li, et al. Seeing across views: Benchmarking spatial reasoning of vision-language models in robotic scenes.arXiv preprint arXiv:2510.19400, 2025. 2

work page arXiv 2025
[27]

A taxonomy for evaluating generalist robot policies.arXiv preprint arXiv:2503.01238, 2025

Jensen Gao, Suneel Belkhale, Sudeep Dasari, Ashwin Balakrishna, Dhruv Shah, and Dorsa Sadigh. A taxonomy for evaluating generalist robot policies.arXiv preprint arXiv:2503.01238, 2025. 4

work page arXiv 2025
[28]

Ca2-vdm: Ef- ficient autoregressive video diffusion model with causal generation and cache sharing.arXiv preprint arXiv:2411.16375, 2024

Kaifeng Gao, Jiaxin Shi, Hanwang Zhang, Chunping Wang, Jun Xiao, and Long Chen. Ca2-vdm: Ef- ficient autoregressive video diffusion model with causal generation and cache sharing.arXiv preprint arXiv:2411.16375, 2024. 7

work page arXiv 2024
[29]

Gemini Robotics: Bringing AI into the Physical World

Gemini Robotics Team. Gemini robotics: Bringing ai into the physical world.arXiv preprint arXiv:2503.20020, 2025. 2, 4, 5

work page internal anchor Pith review Pith/arXiv arXiv 2025
[30]

Ego4d: Around the world in 3,000 hours of egocentric video

Kristen Grauman, Andrew Westbury, Eugene Byrne, Zachary Chavis, Antonino Furnari, Rohit Girdhar, Jackson Hamburger, Hao Jiang, Miao Liu, Xingyu Liu, et al. Ego4d: Around the world in 3,000 hours of egocentric video. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition,

work page
[31]

2025 , note =

Pranav Guruprasad, Yangyue Wang, Sudipta Chowdhury, Harshvardhan Sikka, and Paul Pu Liang. Bench- marking vision, language, & action models in procedurally generated, open ended action environments. arXiv preprint arXiv:2505.05540, 2025. 2

work page arXiv 2025
[32]

Dream to Control: Learning Behaviors by Latent Imagination

Danijar Hafner, Timothy Lillicrap, Jimmy Ba, and Mohammad Norouzi. Dream to control: Learning behaviors by latent imagination.arXiv preprint arXiv:1912.01603, 2019. 5, 20

work page internal anchor Pith review arXiv 1912
[33]

Mastering atari with discrete world models.arXiv preprint arXiv:2010.02193, 2020

Danijar Hafner, Timothy Lillicrap, Mohammad Norouzi, and Jimmy Ba. Mastering atari with discrete world models.arXiv preprint arXiv:2010.02193, 2020. 5, 20

work page arXiv 2010
[34]

Mastering Diverse Domains through World Models

Danijar Hafner, Jurgis Pasukonis, Jimmy Ba, and Timothy Lillicrap. Mastering diverse domains through world models.arXiv preprint arXiv:2301.04104, 2023. 5, 20

work page internal anchor Pith review Pith/arXiv arXiv 2023
[35]

Training agents inside of scalable world models.arXiv preprint arXiv:2509.24527, 2025

Danijar Hafner, Wilson Yan, and Timothy Lillicrap. Training agents inside of scalable world models, 2025. URLhttps://arxiv.org/abs/2509.24527. 20

work page arXiv 2025
[36]

Classifier-Free Diffusion Guidance

Jonathan Ho and Tim Salimans. Classifier-free diffusion guidance.arXiv preprint arXiv:2207.12598,

work page internal anchor Pith review Pith/arXiv arXiv
[37]

Yoon, Mouli Sivapurapu, and Jian Zhang

Ryan Hoque, Peide Huang, David J Yoon, Mouli Sivapurapu, and Jian Zhang. Egodex: Learning dexterous manipulation from large-scale egocentric video.arXiv preprint arXiv:2505.11709, 2025. 18 32 World Action Models are Zero-shot Policies

work page arXiv 2025
[38]

Lora: Low-rank adaptation of large language models.ICLR, 1(2):3, 2022

Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, Weizhu Chen, et al. Lora: Low-rank adaptation of large language models.ICLR, 1(2):3, 2022. 11

work page 2022
[39]

Video Prediction Policy: A Generalist Robot Policy with Predictive Visual Representations

Yucheng Hu, Yanjiang Guo, Pengchao Wang, Xiaoyu Chen, Yen-Jen Wang, Jianke Zhang, Koushil Sreenath, Chaochao Lu, and Jianyu Chen. Video prediction policy: A generalist robot policy with predictive visual representations.arXiv preprint arXiv:2412.14803, 2024. 5

work page internal anchor Pith review arXiv 2024
[40]

Inner monologue: Embodied reasoning through planning with language models

Wenlong Huang, Fei Xia, Ted Xiao, Harris Chan, Jacky Liang, Pete Florence, Andy Zeng, Jonathan Tompson, Igor Mordatch, Yevgen Chebotar, et al. Inner monologue: Embodied reasoning through planning with language models. In6th Annual Conference on Robot Learning. 4

work page
[41]

VoxPoser: Composable 3D Value Maps for Robotic Manipulation with Language Models

Wenlong Huang, Chen Wang, Ruohan Zhang, Yunzhu Li, Jiajun Wu, and Li Fei-Fei. Voxposer: Composable 3d value maps for robotic manipulation with language models.arXiv preprint arXiv:2307.05973, 2023. 4

work page internal anchor Pith review arXiv 2023
[42]

://arxiv.org/abs/2601.03782

Wenlong Huang, Yu-Wei Chao, Arsalan Mousavian, Ming-Yu Liu, Dieter Fox, Kaichun Mo, and Li Fei- Fei. Pointworld: Scaling 3d world models for in-the-wild robotic manipulation.arXiv preprint arXiv:2601.03782, 2025. 20

work page arXiv 2025
[43]

Self Forcing: Bridging the Train-Test Gap in Autoregressive Video Diffusion

Xun Huang, Zhengqi Li, Guande He, Mingyuan Zhou, and Eli Shechtman. Self forcing: Bridging the train-test gap in autoregressive video diffusion.arXiv preprint arXiv:2506.08009, 2025. 8

work page internal anchor Pith review Pith/arXiv arXiv 2025
[44]

Hy-world 1.5: A systematic framework for interactive world modeling with real-time latency and geometric consistency.arXiv preprint, 2025

Team HunyuanWorld. Hy-world 1.5: A systematic framework for interactive world modeling with real-time latency and geometric consistency.arXiv preprint, 2025. 19

work page 2025
[45]

Po- laRiS: Scalable real-to-sim evaluations for generalist robot policies, 2025

Arhan Jain, Mingtong Zhang, Kanav Arora, William Chen, Marcel Torne, Muhammad Zubair Irshad, Sergey Zakharov, Yue Wang, Sergey Levine, Chelsea Finn, et al. Polaris: Scalable real-to-sim evaluations for generalist robot policies.arXiv preprint arXiv:2512.16881, 2025. 11

work page arXiv 2025
[46]

Dreamgen: Unlocking generalization in robot learning through video world models

Joel Jang, Seonghyeon Ye, Zongyu Lin, Jiannan Xiang, Johan Bjorck, Yu Fang, Fengyuan Hu, Spencer Huang, Kaushil Kundalia, Yen-Chen Lin, Loïc Magne, Ajay Mandlekar, Avnish Narayan, You Liang Tan, Guanzhi Wang, Jing Wang, Qi Wang, Yinzhen Xu, Xiaohui Zeng, Kaiyuan Zheng, Ruijie Zheng, Ming-Yu Liu, Luke Zettlemoyer, Dieter Fox, Jan Kautz, Scott Reed, Yuke Zh...

work page
[47]

URLhttps://openreview.net/forum?id=3CnxNqmklv. 5

work page
[48]

Pyramidal flow matching for efficient video generative modeling.arXiv preprint arXiv:2410.05954, 2024

Yang Jin, Zhicheng Sun, Ningyuan Li, Kun Xu, Hao Jiang, Nan Zhuang, Quzhe Huang, Yang Song, Yadong Mu, and Zhouchen Lin. Pyramidal flow matching for efficient video generative modeling.arXiv preprint arXiv:2410.05954, 2024. 7

work page arXiv 2024
[49]

1 rationally engineering rational robots

Leslie Pack Kaelbling and Tomás Lozano-Pérez. 1 rationally engineering rational robots. 4

work page
[50]

Scaling Laws for Neural Language Models

Jared Kaplan, Sam McCandlish, Tom Henighan, Tom B Brown, Benjamin Chess, Rewon Child, Scott Gray, Alec Radford, Jeffrey Wu, and Dario Amodei. Scaling laws for neural language models.arXiv preprint arXiv:2001.08361, 2020. 18

work page internal anchor Pith review Pith/arXiv arXiv 2001
[51]

Emergence of Human to Robot Trans- fer in Vision-Language-Action Models.arXiv preprint arXiv:2512.22414, 2025

Simar Kareer, Karl Pertsch, James Darpinian, Judy Hoffman, Danfei Xu, Sergey Levine, Chelsea Finn, and Suraj Nair. Emergence of human to robot transfer in vision-language-action models.arXiv preprint arXiv:2512.22414, 2025. 16

work page arXiv 2025
[52]

DROID: A Large-Scale In-The-Wild Robot Manipulation Dataset

Alexander Khazatsky, Karl Pertsch, Suraj Nair, Ashwin Balakrishna, Sudeep Dasari, Siddharth Karamcheti, SoroushNasiriany, MohanKumarSrirama, LawrenceYunliangChen, KirstyEllis, etal. Droid: Alarge-scale in-the-wild robot manipulation dataset.arXiv preprint arXiv:2403.12945, 2024. 10, 11 33 World Action Models are Zero-shot Policies

work page internal anchor Pith review Pith/arXiv arXiv 2024
[54]

OpenVLA: An Open-Source Vision-Language-Action Model

Moo Jin Kim, Karl Pertsch, Siddharth Karamcheti, Ted Xiao, Ashwin Balakrishna, Suraj Nair, Rafael Rafailov, Ethan Foster, Grace Lam, Pannag Sanketi, et al. Openvla: An open-source vision-language-action model.arXiv preprint arXiv:2406.09246, 2024. 2

work page internal anchor Pith review Pith/arXiv arXiv 2024
[55]

Cosmos Policy: Fine-Tuning Video Models for Visuomotor Control and Planning

Moo Jin Kim, Yihuai Gao, Tsung-Yi Lin, Yen-Chen Lin, Yunhao Ge, Grace Lam, Percy Liang, Shuran Song, Ming-Yu Liu, Chelsea Finn, et al. Cosmos policy: Fine-tuning video models for visuomotor control and planning.arXiv preprint arXiv:2601.16163, 2026. 2, 5, 7, 19

work page internal anchor Pith review arXiv 2026
[56]

Tenenbaum

Po-Chen Ko, Jiayuan Mao, Yilun Du, Shao-Hua Sun, and Joshua B. Tenenbaum. Learning to act from actionless videos through dense correspondences. InThe Twelfth International Conference on Learning Representations, 2024. URLhttps://openreview.net/forum?id=Mhb5fpA1T0. 5

work page 2024
[57]

Open-world task and motion planning via vision-language model generated constraints.IEEE Robotics and Automation Letters, pages 1–8, 2026

Nishanth Kumar, William Shen, Fabio Ramos, Dieter Fox, Tomás Lozano-Pérez, Leslie Pack Kaelbling, and Caelan Reed Garrett. Open-world task and motion planning via vision-language model generated constraints.IEEE Robotics and Automation Letters, pages 1–8, 2026. doi: 10.1109/LRA.2026.3656799. 4

work page doi:10.1109/lra.2026.3656799 2026
[58]

A path towards autonomous machine intelligence.Open Review, 2022

Yann LeCun. A path towards autonomous machine intelligence.Open Review, 2022. 20

work page 2022
[59]

arXiv preprint arXiv:2508.07917 (2025) 1, 3, 9

Jason Lee, Jiafei Duan, Haoquan Fang, Yuquan Deng, Shuo Liu, Boyang Li, Bohan Fang, Jieyu Zhang, Yi Ru Wang, Sangho Lee, et al. Molmoact: Action reasoning models that can reason in space.arXiv preprint arXiv:2508.07917, 2025. 4

work page arXiv 2025
[60]

Causal World Modeling for Robot Control

Lin Li, Qihang Zhang, Yiming Luo, Shuai Yang, Ruilin Wang, Fei Han, Mingrui Yu, Zelin Gao, Nan Xue, Xing Zhu, Yujun Shen, and Yinghao Xu. Causal world modeling for robot control.arXiv preprint arXiv:2601.21998, 2026. 7

work page internal anchor Pith review arXiv 2026
[61]

Unified Video Action Model

Shuang Li, Yihuai Gao, Dorsa Sadigh, and Shuran Song. Unified video action model.arXiv preprint arXiv:2503.00200, 2025. 5, 7

work page internal anchor Pith review arXiv 2025
[62]

Hamster: Hierarchical action models for open-world robot manipulation.arXiv preprint arXiv:2502.05485, 2025c

Yi Li, Yuquan Deng, Jesse Zhang, Joel Jang, Marius Memmel, Raymond Yu, Caelan Reed Garrett, Fabio Ramos, Dieter Fox, Anqi Li, et al. Hamster: Hierarchical action models for open-world robot manipulation. arXiv preprint arXiv:2502.05485, 2025. 4

work page arXiv 2025
[63]

Dreamitate: Real-world visuomotor policy learning via video generation

Junbang Liang, Ruoshi Liu, Ege Ozguroglu, Sruthi Sudhakar, Achal Dave, Pavel Tokmakov, Shuran Song, and Carl Vondrick. Dreamitate: Real-world visuomotor policy learning via video generation. In8th Annual Conference on Robot Learning, 2024. URLhttps://openreview.net/forum?id=InT87E5sr4. 5

work page 2024
[64]

Video generators are robot policies.arXiv preprint arXiv:2508.00795, 2025

Junbang Liang, Pavel Tokmakov, Ruoshi Liu, Sruthi Sudhakar, Paarth Shah, Rares Ambrus, and Carl Vondrick. Video generators are robot policies.arXiv preprint arXiv:2508.00795, 2025. 2, 5

work page arXiv 2025
[65]

Genieenvisioner: A unified world foundation platform for robotic manipulation.arXiv preprint arXiv:2508.05635,

Yue Liao, Pengfei Zhou, Siyuan Huang, Donglin Yang, Shengcong Chen, Yuxin Jiang, Yue Hu, Jingbin Cai, Si Liu, Jianlan Luo, et al. Genie envisioner: A unified world foundation platform for robotic manipulation. arXiv preprint arXiv:2508.05635, 2025. 3, 5, 7

work page arXiv 2025
[66]

Flow Matching for Generative Modeling

Yaron Lipman, Ricky TQ Chen, Heli Ben-Hamu, Maximilian Nickel, and Matt Le. Flow matching for generative modeling.arXiv preprint arXiv:2210.02747, 2022. 7

work page internal anchor Pith review Pith/arXiv arXiv 2022
[67]

Timestep embedding tells: It’s time to cache for video diffusion model

Feng Liu, Shiwei Zhang, Xiaofeng Wang, Yujie Wei, Haonan Qiu, Yuzhong Zhao, Yingya Zhang, Qixiang Ye, and Fang Wan. Timestep embedding tells: It’s time to cache for video diffusion model.arXiv preprint arXiv:2411.19108, 2024. 23 34 World Action Models are Zero-shot Policies

work page arXiv 2024
[68]

From reusing to forecasting: Accelerating diffusion models with taylorseers.arXiv preprint arXiv:2503.06923, 2025

Jiacheng Liu, Chang Zou, Yuanhuiyi Lyu, Junjie Chen, and Linfeng Zhang. From reusing to forecasting: Accelerating diffusion models with taylorseers.arXiv preprint arXiv:2503.06923, 2025. 23

work page arXiv 2025
[69]

Flow Straight and Fast: Learning to Generate and Transfer Data with Rectified Flow

Xingchao Liu, Chengyue Gong, and Qiang Liu. Flow straight and fast: Learning to generate and transfer data with rectified flow.arXiv preprint arXiv:2209.03003, 2022. 7

work page internal anchor Pith review Pith/arXiv arXiv 2022
[70]

Solving new tasks by adapting internet video knowledge

Calvin Luo, Zilai Zeng, Yilun Du, and Chen Sun. Solving new tasks by adapting internet video knowledge. InThe Thirteenth International Conference on Learning Representations, 2025. 5

work page 2025
[71]

Nvidia model-optimizer, 2024

NVIDIA Corporation. Nvidia model-optimizer, 2024. URL https://github.com/NVIDIA/ Model-Optimizer. 23

work page 2024
[72]

mimic-video: Video-action models for generalizable robot control beyond vlas.arXiv preprint 2512.15692, 2025

Jonas Pai, Liam Achenbach, Victoriano Montesinos, Benedek Forrai, Oier Mees, and Elvis Nava. mimic- video: Video-action models for generalizable robot control beyond vlas.arXiv preprint arXiv:2512.15692,

work page arXiv
[73]

$\pi_{0.5}$: a Vision-Language-Action Model with Open-World Generalization

Physical Intelligence.𝜋0.5: a vision-language-action model with open-world generalization.arXiv preprint arXiv:2504.16054, 2025. 4, 5, 10

work page internal anchor Pith review Pith/arXiv arXiv 2025
[74]

Hi Robot: Open-ended instruction following with hierarchical vision-language-action models.arXiv preprint arXiv:2502.19417,

Lucy Xiaoyang Shi, Brian Ichter, Michael Equi, Liyiming Ke, Karl Pertsch, Quan Vuong, James Tanner, Anna Walling, Haohuan Wang, Niccolo Fusai, et al. Hi robot: Open-ended instruction following with hierarchical vision-language-action models.arXiv preprint arXiv:2502.19417, 2025. 19

work page arXiv 2025
[75]

Progprompt: Generating situated robot task plans using large language models

Ishika Singh, Valts Blukis, Arsalan Mousavian, Ankit Goyal, Danfei Xu, Jonathan Tremblay, Dieter Fox, Jesse Thomason, and Animesh Garg. Progprompt: Generating situated robot task plans using large language models. In2023 IEEE International Conference on Robotics and Automation (ICRA), 2023. 4

work page 2023
[76]

Gen-0: Embodied foundation models that scale with physical interaction.Generalist AI Blog, 2025

Generalist AI Team. Gen-0: Embodied foundation models that scale with physical interaction.Generalist AI Blog, 2025. https://generalistai.com/blog/preview-uqlxvb-bb.html. 16

work page 2025
[77]

Wan: Open and advanced large-scale video generative models

Team Wan. Wan: Open and advanced large-scale video generative models. 2025. 2, 7, 11

work page 2025
[78]

MAGI-1: Autoregressive Video Generation at Scale

Hansi Teng, Hongyu Jia, Lei Sun, Lingzhi Li, Maolin Li, Mingqiu Tang, Shuai Han, Tianning Zhang, WQ Zhang, Weifeng Luo, et al. Magi-1: Autoregressive video generation at scale.arXiv preprint arXiv:2505.13211, 2025. 7, 8

work page internal anchor Pith review arXiv 2025
[79]

Bridgedata v2: A dataset for robot learning at scale

Homer Walke, Kevin Black, Abraham Lee, Moo Jin Kim, Max Du, Chongyi Zheng, Tony Zhao, Philippe Hansen-Estruch, Quan Vuong, Andre He, Vivek Myers, Kuan Fang, Chelsea Finn, and Sergey Levine. Bridgedata v2: A dataset for robot learning at scale. InConference on Robot Learning (CoRL), 2023. 10

work page 2023
[80]

Dual-stream diffusion for world-model augmented vision-language-action model, 2025

John Won, Kyungmin Lee, Huiwon Jang, Dongyoung Kim, and Jinwoo Shin. Dual-stream diffusion for world-model augmented vision-language-action model.arXiv preprint arXiv:2510.27607, 2025. 5

work page arXiv 2025
[81]

Unleashing large-scale video generative pre-training for visual robot manipulation

Hongtao Wu, Ya Jing, Chilam Cheang, Guangzeng Chen, Jiafeng Xu, Xinghang Li, Minghuan Liu, Hang Li, and Tao Kong. Unleashing large-scale video generative pre-training for visual robot manipulation. In The Twelfth International Conference on Learning Representations, 2024. URLhttps://openreview.net/ forum?id=NxoFmGgWC9. 5

work page 2024
[82]

Qwen3 Technical Report

An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, et al. Qwen3 technical report.arXiv preprint arXiv:2505.09388, 2025. 18

work page internal anchor Pith review Pith/arXiv arXiv 2025

Showing first 80 references.