pith. machine review for the scientific record. sign in

arxiv: 2602.15922 · v1 · submitted 2026-02-17 · 💻 cs.RO · cs.CV· cs.LG

Recognition: 2 theorem links

World Action Models are Zero-shot Policies

Avnish Narayan, Ayaan Malik, Chuning Zhu, Danfei Xu, Dantong Niu, Fengyuan Hu, George Kurian, Guanzhi Wang, Gwanghyun Kim, Jan Kautz, Jiannan Xiang, Jiasheng Gu, Jimmy Wu, Jing Wang, Joel Jang, Johan Bjorck, Kaiyuan Zheng, Kyungmin Lee, Linxi "Jim" Fan, Nadun Ranawaka, Qi Wang, Ruijie Zheng, Ryan Julian, Scott Reed, Seonghyeon Ye, Shenyuan Gao, Sihyun Yu, Suneel Indupuru, William Liang, Yevgen Chebotar, Yilun Du, Yinzhen Xu, You Liang Tan, Yuke Zhu, Yunhao Ge, Yuqi Xie

Pith reviewed 2026-05-11 16:10 UTC · model grok-4.3

classification 💻 cs.RO cs.CVcs.LG
keywords World Action Modelsvideo diffusionzero-shot robot policiesvision-language-action modelscross-embodiment transferphysical dynamicsreal-time robot controlheterogeneous robot data
0
0 comments X

The pith

World Action Models serve as zero-shot policies by jointly predicting future video states and actions from heterogeneous data.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents DreamZero as a World Action Model that learns physical dynamics directly from video and action pairs instead of relying on language instructions or repetitive demonstrations. By building on a pretrained video diffusion model, it treats video as a dense signal of how the world changes under actions, allowing the system to generalize across new tasks, environments, and even robot bodies. This produces more than double the success rate on unseen scenarios compared with prior vision-language-action approaches while still running closed-loop control in real time. The same model further supports rapid transfer: video demonstrations from other robots or humans boost performance on new tasks, and only thirty minutes of play data suffice to adapt to an entirely new embodiment without losing zero-shot capability. The central insight is that modeling the visual evolution of the world supplies the physical knowledge needed for policies that work without task-specific retraining.

Core claim

World Action Models are zero-shot policies because they learn physical dynamics by predicting future world states, represented densely as video, together with the corresponding actions, trained jointly on heterogeneous robot datasets without requiring repetitive demonstrations of individual skills.

What carries the argument

DreamZero, a World Action Model built on a pretrained video diffusion backbone that jointly autoregressively models future video frames and actions to produce closed-loop robot control.

If this is right

  • Yields more than twice the generalization success rate on new tasks and environments versus state-of-the-art vision-language-action models in real-robot trials.
  • Enables real-time closed-loop control at 7 Hz from a 14-billion-parameter autoregressive video diffusion model after targeted optimizations.
  • Video-only demonstrations from other robots or humans deliver over 42 percent relative improvement on unseen tasks using only 10-20 minutes of additional data.
  • Supports few-shot embodiment adaptation: 30 minutes of play data on a new robot body suffices for transfer while zero-shot generalization to new tasks is retained.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Treating future video as the primary training signal may reduce dependence on curated action labels across many robotics domains.
  • The approach could extend to settings like autonomous navigation where scene evolution is more important than discrete commands.
  • Rapid embodiment transfer with minimal data suggests that large video models can serve as reusable physical priors for many hardware platforms.
  • Performance gains arise from dense visual supervision rather than from language or sparse reward signals.

Load-bearing premise

The assumption that video and action pairs drawn from mixed robot sources already contain enough information about physical dynamics to generalize to motions and environments never seen in training.

What would settle it

A controlled test showing that the model cannot produce correct actions or accurate future video predictions for a physically novel interaction, such as manipulating an object type or surface never present in the training videos.

read the original abstract

State-of-the-art Vision-Language-Action (VLA) models excel at semantic generalization but struggle to generalize to unseen physical motions in novel environments. We introduce DreamZero, a World Action Model (WAM) built upon a pretrained video diffusion backbone. Unlike VLAs, WAMs learn physical dynamics by predicting future world states and actions, using video as a dense representation of how the world evolves. By jointly modeling video and action, DreamZero learns diverse skills effectively from heterogeneous robot data without relying on repetitive demonstrations. This results in over 2x improvement in generalization to new tasks and environments compared to state-of-the-art VLAs in real robot experiments. Crucially, through model and system optimizations, we enable a 14B autoregressive video diffusion model to perform real-time closed-loop control at 7Hz. Finally, we demonstrate two forms of cross-embodiment transfer: video-only demonstrations from other robots or humans yield a relative improvement of over 42% on unseen task performance with just 10-20 minutes of data. More surprisingly, DreamZero enables few-shot embodiment adaptation, transferring to a new embodiment with only 30 minutes of play data while retaining zero-shot generalization.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 1 minor

Summary. The manuscript introduces DreamZero, a World Action Model (WAM) built upon a pretrained 14B autoregressive video diffusion backbone. Unlike Vision-Language-Action (VLA) models, it jointly predicts future video states and actions to learn physical dynamics from heterogeneous robot data without repetitive demonstrations. The paper claims over 2x improvement in generalization to new tasks and environments versus SOTA VLAs in real-robot experiments, real-time closed-loop control at 7 Hz, and two forms of cross-embodiment transfer (video-only demos yielding >42% relative improvement with 10-20 minutes of data; few-shot adaptation to new embodiments with 30 minutes of play data while retaining zero-shot performance).

Significance. If the empirical results hold under rigorous controls, this would be a notable contribution to robot learning by showing that video-based world modeling can deliver substantially better zero-shot generalization and embodiment transfer than current VLAs, while achieving practical real-time inference speeds. The concrete real-robot numbers and cross-embodiment results provide tangible evidence of impact for reducing data requirements in robotics.

major comments (3)
  1. [Abstract] Abstract: The headline claim of 'over 2x improvement in generalization to new tasks and environments' is load-bearing for the central thesis but provides no details on evaluation metrics, exact baselines, number of trials, statistical significance, or how test environments isolate novel physical dynamics (e.g., changed mass/friction) from visual or semantic changes. This prevents verification that gains arise from the WAM dynamics modeling rather than other factors.
  2. [Abstract] Abstract: The statement that the model 'learns physical dynamics' and generalizes to 'unseen motions and environments' from heterogeneous data 'without relying on repetitive demonstrations' requires supporting evidence such as dataset statistics on motion diversity, ablations removing repetitive patterns, or explicit tests with altered dynamics parameters; without these, the weakest assumption identified in the stress-test remains unaddressed.
  3. [Abstract] Abstract: The practicality claim that 'model and system optimizations' enable a 14B model to run real-time closed-loop control at 7 Hz is central to the contribution but does not specify the optimizations (e.g., distillation, caching, hardware acceleration), making it impossible to assess reproducibility or the extent to which the result depends on engineering rather than the WAM formulation.
minor comments (1)
  1. [Abstract] The introduction of the term 'World Action Model (WAM)' would benefit from a clearer positioning against prior video-prediction or world-model literature to highlight novelty.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive feedback. We address each major comment below and agree that the abstract can be strengthened for clarity while the full manuscript already contains the supporting details. We will revise the abstract accordingly.

read point-by-point responses
  1. Referee: The headline claim of 'over 2x improvement in generalization to new tasks and environments' is load-bearing for the central thesis but provides no details on evaluation metrics, exact baselines, number of trials, statistical significance, or how test environments isolate novel physical dynamics (e.g., changed mass/friction) from visual or semantic changes. This prevents verification that gains arise from the WAM dynamics modeling rather than other factors.

    Authors: We agree the abstract would benefit from added context on these points. The full manuscript (Experiments section) details the metrics (task success rates), baselines (state-of-the-art VLAs), trial counts, statistical tests, and environment designs that vary physical parameters such as mass and friction while controlling for visual and semantic factors. We will revise the abstract to briefly reference these evaluation controls. revision: yes

  2. Referee: The statement that the model 'learns physical dynamics' and generalizes to 'unseen motions and environments' from heterogeneous data 'without relying on repetitive demonstrations' requires supporting evidence such as dataset statistics on motion diversity, ablations removing repetitive patterns, or explicit tests with altered dynamics parameters; without these, the weakest assumption identified in the stress-test remains unaddressed.

    Authors: The manuscript provides dataset statistics on motion diversity across heterogeneous sources, ablations demonstrating the value of non-repetitive data, and explicit tests with altered dynamics parameters. We will add a concise reference to these supporting analyses in the abstract. revision: yes

  3. Referee: The practicality claim that 'model and system optimizations' enable a 14B model to run real-time closed-loop control at 7 Hz is central to the contribution but does not specify the optimizations (e.g., distillation, caching, hardware acceleration), making it impossible to assess reproducibility or the extent to which the result depends on engineering rather than the WAM formulation.

    Authors: We agree that specifying the optimizations would improve the abstract. The full paper describes the model and system optimizations (including distillation, caching, and hardware acceleration) that enable 7 Hz inference. We will update the abstract to name the primary optimizations for reproducibility. revision: yes

Circularity Check

0 steps flagged

No circularity; empirical claims rest on robot experiments

full rationale

The paper introduces DreamZero as a World Action Model trained on heterogeneous robot data to predict video and actions, with all headline results (2x generalization, 7Hz control, cross-embodiment transfer) presented as outcomes of real-robot experiments rather than any mathematical derivation. No equations, fitted parameters renamed as predictions, self-definitional loops, or load-bearing self-citations appear in the provided text. The central premise that joint video-action modeling yields transferable physical dynamics is tested externally via held-out tasks and embodiments, making the evaluation chain self-contained and falsifiable outside the paper's own definitions.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The central claim rests on the premise that video serves as a sufficient dense representation of physical dynamics and that heterogeneous data can be used without task-specific repetition. No explicit free parameters are named in the abstract, but the 14B model size and diffusion backbone choice function as implicit design decisions.

axioms (1)
  • domain assumption Video is a dense representation of how the world evolves under actions.
    Stated directly in the abstract as the basis for learning physical dynamics.
invented entities (1)
  • World Action Model (WAM) no independent evidence
    purpose: A model that jointly predicts future video states and actions to serve as a zero-shot policy.
    New framing introduced in the paper; no independent evidence provided beyond the reported experiments.

pith-pipeline@v0.9.0 · 5661 in / 1466 out tokens · 49135 ms · 2026-05-11T16:10:07.831346+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 43 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Open-H-Embodiment: A Large-Scale Dataset for Enabling Foundation Models in Medical Robotics

    cs.RO 2026-04 conditional novelty 8.0

    Open-H-Embodiment is the largest open multi-embodiment medical robotics dataset, used to train GR00T-H, the first open vision-language-action model that achieves end-to-end suturing completion where prior models fail.

  2. From Imagined Futures to Executable Actions: Mixture of Latent Actions for Robot Manipulation

    cs.RO 2026-05 unverdicted novelty 7.0

    MoLA infers a mixture of latent actions from generated future videos via modality-aware inverse dynamics models to improve robot manipulation policies.

  3. ALAM: Algebraically Consistent Latent Action Model for Vision-Language-Action Models

    cs.RO 2026-05 unverdicted novelty 7.0

    ALAM creates algebraically consistent latent action transitions from videos to act as auxiliary generative targets, raising robot policy success rates from 47.9% to 85.0% on MetaWorld MT50 and 94.1% to 98.1% on LIBERO.

  4. NoiseGate: Learning Per-Latent Timestep Schedules as Information Gating in World Action Models

    cs.RO 2026-05 unverdicted novelty 7.0

    NoiseGate learns per-latent timestep schedules as an information-gating policy in diffusion-based world action models, yielding consistent gains on RoboTwin manipulation tasks.

  5. OA-WAM: Object-Addressable World Action Model for Robust Robot Manipulation

    cs.RO 2026-05 unverdicted novelty 7.0

    OA-WAM uses persistent address vectors and dynamic content vectors in object slots to enable addressable world-action prediction, improving robustness on manipulation benchmarks under scene changes.

  6. EA-WM: Event-Aware Generative World Model with Structured Kinematic-to-Visual Action Fields

    cs.CV 2026-05 unverdicted novelty 7.0

    EA-WM generates more accurate robot world rollouts by projecting actions as structured visual fields in camera space and using event-aware bidirectional fusion to better capture interaction dynamics.

  7. Being-H0.7: A Latent World-Action Model from Egocentric Videos

    cs.RO 2026-04 unverdicted novelty 7.0

    Being-H0.7 adds future-aware latent reasoning to direct VLA policies via dual-branch alignment on latent queries, matching world-model benefits at VLA efficiency.

  8. ${\pi}_{0.7}$: a Steerable Generalist Robotic Foundation Model with Emergent Capabilities

    cs.LG 2026-04 unverdicted novelty 7.0

    π₀.₇ is a steerable generalist robotic model that uses rich multimodal prompts including language, subgoal images, and performance metadata to achieve out-of-the-box generalization across tasks and robot bodies.

  9. ViVa: A Video-Generative Value Model for Robot Reinforcement Learning

    cs.RO 2026-04 unverdicted novelty 7.0

    ViVa turns a video generator into a value model for robot RL that jointly forecasts future states and task value, yielding better performance on real-world box assembly when integrated with RECAP.

  10. MoRight: Motion Control Done Right

    cs.CV 2026-04 unverdicted novelty 7.0

    MoRight disentangles object and camera motion via canonical-view specification and temporal cross-view attention, while decomposing motion into active user-driven and passive consequence components to learn and apply ...

  11. Action Images: End-to-End Policy Learning via Multiview Video Generation

    cs.CV 2026-04 unverdicted novelty 7.0

    Action Images turn robot arm motions into interpretable multiview pixel videos, letting video backbones serve as zero-shot policies for end-to-end robot learning.

  12. RoboEvolve: Co-Evolving Planner-Simulator for Robotic Manipulation with Limited Data

    cs.RO 2026-05 unverdicted novelty 6.0

    A co-evolutionary VLM-VGM loop on 500 unlabeled images raises planner success by 30 points and simulator success by 48 percent while beating fully supervised baselines.

  13. Pyramid Forcing: Head-Aware Pyramid KV Cache Policy for High-Quality Long Video Generation

    cs.CV 2026-05 unverdicted novelty 6.0

    Pyramid Forcing classifies attention heads into Anchor, Wave, and Veil types and applies type-specific KV cache policies to improve long-horizon autoregressive video generation quality.

  14. The DAWN of World-Action Interactive Models

    cs.CV 2026-05 unverdicted novelty 6.0

    DAWN couples a world predictor with a world-conditioned action denoiser in latent space so that each refines the other recursively, yielding strong planning and safety results on autonomous driving benchmarks.

  15. ALAM: Algebraically Consistent Latent Action Model for Vision-Language-Action Models

    cs.RO 2026-05 unverdicted novelty 6.0

    ALAM introduces algebraic consistency regularization on latent action transitions from videos, raising VLA success rates from 47.9% to 85.0% on MetaWorld MT50 and 94.1% to 98.1% on LIBERO.

  16. ACWM-Phys: Investigating Generalized Physical Interaction in Action-Conditioned Video World Models

    cs.CV 2026-05 unverdicted novelty 6.0

    ACWM-Phys benchmark shows action-conditioned world models generalize on simple geometric interactions but drop sharply on deformable contacts, high-dimensional control, and complex articulated motion, indicating relia...

  17. When to Trust Imagination: Adaptive Action Execution for World Action Models

    cs.RO 2026-05 unverdicted novelty 6.0

    A verifier called Future Forward Dynamics Causal Attention enables adaptive action execution in World Action Models, reducing model inferences by 69% and improving success rates in robotic tasks.

  18. When to Trust Imagination: Adaptive Action Execution for World Action Models

    cs.RO 2026-05 unverdicted novelty 6.0

    Future Forward Dynamics Causal Attention (FFDC) enables World Action Models to adaptively choose action chunk lengths based on prediction-observation consistency, cutting model inferences by 69% and improving real-wor...

  19. MotuBrain: An Advanced World Action Model for Robot Control

    cs.RO 2026-04 unverdicted novelty 6.0

    MotuBrain jointly models video and action via a three-stream Mixture-of-Transformers UniDiffuser to reach 95.8-96.1% success on RoboTwin 2.0 benchmarks, top EWMScore, and fast 11 Hz inference while adapting to new rob...

  20. Unified 4D World Action Modeling from Video Priors with Asynchronous Denoising

    cs.RO 2026-04 unverdicted novelty 6.0

    X-WAM unifies robotic action execution and 4D world synthesis by adapting video diffusion priors with a lightweight depth branch and asynchronous noise sampling, achieving 79-91% success on robot benchmarks.

  21. Unified 4D World Action Modeling from Video Priors with Asynchronous Denoising

    cs.RO 2026-04 unverdicted novelty 6.0

    X-WAM unifies real-time robotic action execution with high-fidelity 4D world synthesis by adapting video diffusion priors through lightweight depth branches and asynchronous noise sampling, achieving 79-91% success on...

  22. Learning Human-Intention Priors from Large-Scale Human Demonstrations for Robotic Manipulation

    cs.RO 2026-04 unverdicted novelty 6.0

    MoT-HRA learns embodiment-agnostic human-intention priors from the HA-2.2M dataset of 2.2M human video episodes through a three-expert hierarchy to improve robotic motion plausibility and robustness under distribution shift.

  23. FASTER: Value-Guided Sampling for Fast RL

    cs.LG 2026-04 unverdicted novelty 6.0

    FASTER models multi-candidate denoising as an MDP and trains a value function to filter actions early, delivering the performance of full sampling at lower cost in diffusion RL policies.

  24. Human Cognition in Machines: A Unified Perspective of World Models

    cs.RO 2026-04 unverdicted novelty 6.0

    The paper introduces a unified framework for world models that fully incorporates all cognitive functions from Cognitive Architecture Theory, highlights under-researched areas in motivation and meta-cognition, and pro...

  25. Robotic Manipulation is Vision-to-Geometry Mapping ($f(v) \rightarrow G$): Vision-Geometry Backbones over Language and Video Models

    cs.RO 2026-04 unverdicted novelty 6.0

    Vision-geometry backbones using pretrained 3D world models outperform vision-language and video models for robotic manipulation by enabling direct mapping from visual input to geometric actions.

  26. AIM: Intent-Aware Unified world action Modeling with Spatial Value Maps

    cs.RO 2026-04 unverdicted novelty 6.0

    AIM predicts aligned spatial value maps inside a shared video-generation transformer to produce reliable robot actions, reaching 94% success on RoboTwin 2.0 with larger gains on long-horizon and contact-rich tasks.

  27. DexWorldModel: Causal Latent World Modeling towards Automated Learning of Embodied Tasks

    cs.CV 2026-04 unverdicted novelty 6.0

    CLWM with DINOv3 targets, O(1) TTT memory, SAI latency masking, and EmbodiChain training achieves SOTA dual-arm simulation performance and zero-shot sim-to-real transfer that beats real-data finetuned baselines.

  28. VAG: Dual-Stream Video-Action Generation for Embodied Data Synthesis

    cs.RO 2026-04 unverdicted novelty 6.0

    VAG is a synchronized dual-stream flow-matching framework that generates aligned video-action pairs for synthetic embodied data synthesis and policy pretraining.

  29. SIM1: Physics-Aligned Simulator as Zero-Shot Data Scaler in Deformable Worlds

    cs.RO 2026-04 unverdicted novelty 6.0

    SIM1 converts sparse real demonstrations into high-fidelity synthetic data through physics-aligned simulation, yielding policies that match real-data performance at a 1:15 ratio with 90% zero-shot success on deformabl...

  30. Veo-Act: How Far Can Frontier Video Models Advance Generalizable Robot Manipulation?

    cs.RO 2026-04 unverdicted novelty 6.0

    Veo-3 video predictions enable approximate task-level robot trajectories in zero-shot settings but require hierarchical integration with low-level VLA policies for reliable manipulation performance.

  31. Multi-View Video Diffusion Policy: A 3D Spatio-Temporal-Aware Video Action Model

    cs.RO 2026-04 conditional novelty 6.0

    MV-VDP jointly predicts multi-view RGB and heatmap videos via diffusion to achieve data-efficient, robust robotic manipulation policies.

  32. Video Generation Models as World Models: Efficient Paradigms, Architectures and Algorithms

    eess.IV 2026-03 unverdicted novelty 6.0

    Video generation models can function as world simulators if efficiency gaps in spatiotemporal modeling are bridged via organized paradigms, architectures, and algorithms.

  33. Fast-WAM: Do World Action Models Need Test-time Future Imagination?

    cs.CV 2026-03 unverdicted novelty 6.0

    Fast-WAM shows that explicit future imagination at test time is not required for strong WAM performance; video modeling during training provides the main benefit.

  34. Nautilus: From One Prompt to Plug-and-Play Robot Learning

    cs.RO 2026-05 unverdicted novelty 5.0

    NAUTILUS is a prompt-driven harness that automates plug-and-play adapters, typed contracts, and validation for policies, benchmarks, and robots in learning research.

  35. CKT-WAM: Parameter-Efficient Context Knowledge Transfer Between World Action Models

    cs.RO 2026-05 unverdicted novelty 5.0

    CKT-WAM transfers teacher WAM knowledge to students via compressed text-embedding contexts using LQCA and adapters, reaching 86.1% success on LIBERO-Plus with 1.17% trainable parameters and 83.3% in real-world tasks.

  36. VLA Foundry: A Unified Framework for Training Vision-Language-Action Models

    cs.RO 2026-04 unverdicted novelty 5.0

    VLA Foundry provides a single training stack for VLA models and releases open models that match prior closed-source performance or outperform baselines on multi-task manipulation in simulation.

  37. World-Value-Action Model: Implicit Planning for Vision-Language-Action Systems

    cs.RO 2026-04 unverdicted novelty 5.0

    The World-Value-Action model enables implicit planning for VLA systems by performing inference over a learned latent representation of high-value future trajectories instead of direct action prediction.

  38. Goal2Skill: Long-Horizon Manipulation with Adaptive Planning and Reflection

    cs.RO 2026-04 unverdicted novelty 5.0

    A dual VLM-VLA framework for long-horizon robot manipulation achieves 32.4% success on RMBench tasks versus 9.8% for the strongest baseline via structured memory and closed-loop adaptive replanning.

  39. World Action Models: The Next Frontier in Embodied AI

    cs.RO 2026-05 unverdicted novelty 4.0

    The paper introduces World Action Models as a new paradigm unifying predictive world modeling with action generation in embodied foundation models and provides a taxonomy of existing approaches.

  40. RLDX-1 Technical Report

    cs.RO 2026-05 unverdicted novelty 4.0

    RLDX-1 achieves 86.8% success on complex ALLEX humanoid manipulation tasks where prior VLAs reach only around 40%.

  41. RLDX-1 Technical Report

    cs.RO 2026-05 unverdicted novelty 4.0

    RLDX-1 outperforms frontier VLAs such as π0.5 and GR00T N1.6 on dexterous manipulation benchmarks, reaching 86.8% success on ALLEX humanoid tasks versus around 40% for the baselines.

  42. ABot-Claw: A Foundation for Persistent, Cooperative, and Self-Evolving Robotic Agents

    cs.CV 2026-04 unverdicted novelty 4.0

    ABot-Claw is an embodied software layer that adds unified robot scheduling, cross-embodiment visual memory, and critic-driven replanning on top of OpenClaw to support persistent multi-robot execution from natural-lang...

  43. Matrix-Game 3.0: Real-Time and Streaming Interactive World Model with Long-Horizon Memory

    cs.CV 2026-04 unverdicted novelty 4.0

    Matrix-Game 3.0 delivers 720p real-time video generation at 40 FPS with minute-scale memory consistency by combining residual self-correction training, camera-aware memory injection, and DMD-based autoregressive disti...

Reference graph

Works this paper leans on

91 extracted references · 91 canonical work pages · cited by 39 Pith papers · 32 internal anchors

  1. [1]

    Stochastic Interpolants: A Unifying Framework for Flows and Diffusions

    Michael S Albergo, Nicholas M Boffi, and Eric Vanden-Eijnden. Stochastic interpolants: A unifying framework for flows and diffusions.arXiv preprint arXiv:2303.08797, 2023. 7

  2. [2]

    World Simulation with Video Foundation Models for Physical AI

    Arslan Ali, Junjie Bai, Maciej Bala, Yogesh Balaji, Aaron Blakeman, Tiffany Cai, Jiaxin Cao, Tianshi Cao, Elizabeth Cha, Yu-Wei Chao, et al. World simulation with video foundation models for physical ai.arXiv preprint arXiv:2511.00062, 2025. 7

  3. [3]

    Self-supervised learning from images with a joint-embedding predictive architecture

    Mahmoud Assran, Quentin Duval, Ishan Misra, Piotr Bojanowski, Pascal Vincent, Michael Rabbat, Yann LeCun, and Nicolas Ballas. Self-supervised learning from images with a joint-embedding predictive architecture. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023. 20

  4. [4]

    V-JEPA 2: Self-Supervised Video Models Enable Understanding, Prediction and Planning

    Mido Assran, Adrien Bardes, David Fan, Quentin Garrido, Russell Howes, Mojtaba Komeili, Matthew Muckley, Ammar Rizvi, Claire Roberts, Koustuv Sinha, et al. V-JEPA 2: Self-supervised video models enable understanding, prediction and planning.arXiv preprint arXiv:2506.09985, 2025. 5, 20

  5. [5]

    Philip J. Ball, Jakob Bauer, Frank Belletti, Bethanie Brownfield, Ariel Ephrat, Shlomi Fruchter, Agrim Gupta, Kristian Holsheimer, Aleksander Holynski, Jiri Hron, Christos Kaplanis, Marjorie Limont, Matt McGill, Yanko Oliveira, Jack Parker-Holder, Frank Perbet, Guy Scully, Jeremy Shar, Stephen Spencer, Omer Tov, Ruben Villegas, Emma Wang, Jessica Yung, Ci...

  6. [6]

    V-JEPA: Video joint embedding predictive architecture.arXiv preprint arXiv:2402.05065, 2024

    Adrien Bardes, Quentin Garrido, Jean Ponce, Michael Rabbat, Yann LeCun, Mahmoud Assran, and Nicolas Ballas. V-JEPA: Video joint embedding predictive architecture.arXiv preprint arXiv:2402.05065, 2024. 20

  7. [7]

    A careful examination of large behavior models for multitask dexterous manipulation.arXiv preprint arXiv:2507.05331, 2025

    Jose Barreiros, Andrew Beaulieu, Aditya Bhat, Rick Cory, Eric Cousineau, Hongkai Dai, Ching-Hsin Fang, Kunimatsu Hashimoto, Muhammad Zubair Irshad, Masha Itkina, et al. A careful examination of large behavior models for multitask dexterous manipulation.arXiv preprint arXiv:2507.05331, 2025. 13

  8. [8]

    Gen2act: Human video generation in novel scenarios enables generalizable robot manipulation.arXiv preprint arXiv:2409.16283, 2024

    Homanga Bharadhwaj, Debidatta Dwibedi, Abhinav Gupta, Shubham Tulsiani, Carl Doersch, Ted Xiao, Dhruv Shah, Fei Xia, Dorsa Sadigh, and Sean Kirmani. Gen2act: Human video generation in novel scenarios enables generalizable robot manipulation.arXiv preprint arXiv:2409.16283, 2024. 5

  9. [9]

    GR00T N1: An Open Foundation Model for Generalist Humanoid Robots

    Johan Bjorck, Fernando Castañeda, Nikita Cherniadev, Xingye Da, Runyu Ding, Linxi Fan, Yu Fang, Dieter Fox, Fengyuan Hu, Spencer Huang, et al. Gr00t n1: An open foundation model for generalist humanoid robots.arXiv preprint arXiv:2503.14734, 2025. 2, 4, 10, 18

  10. [11]

    $\pi_0$: A Vision-Language-Action Flow Model for General Robot Control

    Kevin Black, Noah Brown, Danny Driess, Adnan Esmail, Michael Equi, Chelsea Finn, Niccolo Fusai, Lachy Groom, Karol Hausman, Brian Ichter, et al.𝜋0: A vision-language-action flow model for general robot control.URL https://arxiv. org/abs/2410.24164, 2024. 2 30 World Action Models are Zero-shot Policies

  11. [12]

    On the Opportunities and Risks of Foundation Models

    Rishi Bommasani, Drew A Hudson, Ehsan Adeli, Russ Altman, Simran Arora, Sydney von Arx, Michael S Bernstein, Jeannette Bohg, Antoine Bosselut, Emma Brunskill, et al. On the opportunities and risks of foundation models.arXiv preprint arXiv:2108.07258, 2021. 4

  12. [13]

    RT-1: Robotics Transformer for Real-World Control at Scale

    Anthony Brohan, Noah Brown, Justice Carbajal, Yevgen Chebotar, Joseph Dabis, Chelsea Finn, Keerthana Gopalakrishnan, Karol Hausman, Alex Herzog, Jasmine Hsu, Julian Ibarz, Brian Ichter, Alex Irpan, Tomas Jackson, Sally Jesmonth, Nikhil Joshi, Ryan Julian, Dmitry Kalashnikov, Yuheng Kuang, Isabel Leal, Kuang-Huei Lee, Sergey Levine, Yao Lu, Utsav Malla, De...

  13. [14]

    RT-2: Vision-Language-Action Models Transfer Web Knowledge to Robotic Control

    Anthony Brohan, Noah Brown, Justice Carbajal, Yevgen Chebotar, Xi Chen, Krzysztof Choromanski, Tianli Ding, Danny Driess, Avinava Dubey, Chelsea Finn, Pete Florence, Chuyuan Fu, Montse Gonzalez Arenas, Keerthana Gopalakrishnan, Kehang Han, Karol Hausman, Alex Herzog, Jasmine Hsu, Brian Ichter, Alex Irpan, Nikhil Joshi, Ryan Julian, Dmitry Kalashnikov, Yuh...

  14. [15]

    Do as i can, not as i say: Grounding language in robotic affordances

    Anthony Brohan, Yevgen Chebotar, Chelsea Finn, Karol Hausman, Alexander Herzog, Daniel Ho, Julian Ibarz, Alex Irpan, Eric Jang, Ryan Julian, et al. Do as i can, not as i say: Grounding language in robotic affordances. InConference on robot learning, pages 287–318. PMLR, 2023. 4

  15. [16]

    UniVLA: Learning to Act Anywhere with Task-centric Latent Actions

    Qingwen Bu, Yanting Yang, Jisong Cai, Shenyuan Gao, Guanghui Ren, Maoqing Yao, Ping Luo, and Hongyang Li. Univla: Learning to act anywhere with task-centric latent actions.arXiv preprint arXiv:2505.06111, 2025. 4

  16. [17]

    GR-2: A Generative Video-Language-Action Model with Web-Scale Knowledge for Robot Manipulation

    Chi-Lam Cheang, Guangzeng Chen, Ya Jing, Tao Kong, Hang Li, Yifeng Li, Yuxiao Liu, Hongtao Wu, Jiafeng Xu, Yichu Yang, et al. Gr-2: A generative video-language-action model with web-scale knowledge for robot manipulation.arXiv preprint arXiv:2410.06158, 2024. 5

  17. [18]

    Spa- tialvlm: Endowing vision-language models with spatial reasoning capabilities

    Boyuan Chen, Zhuo Xu, Sean Kirmani, Brain Ichter, Dorsa Sadigh, Leonidas Guibas, and Fei Xia. Spa- tialvlm: Endowing vision-language models with spatial reasoning capabilities. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 14455–14465, 2024. 2

  18. [19]

    Large Video Planner Enables Generalizable Robot Control

    Boyuan Chen, Tianyuan Zhang, Haoran Geng, Kiwhan Song, Caiyi Zhang, Peihao Li, William T Freeman, Jitendra Malik, Pieter Abbeel, Russ Tedrake, et al. Large video planner enables generalizable robot control. arXiv preprint arXiv:2512.15840, 2025. 5

  19. [20]

    Action100m: A large-scale video action dataset.arXiv preprint arXiv:2601.10592, 2026

    Delong Chen, Tejaswi Kasarla, Yejin Bang, Mustafa Shukor, Willy Chung, Jade Yu, Allen Bolourchi, Theo Moutakanni, and Pascale Fung. Action100m: A large-scale video action dataset.arXiv preprint arXiv:2601.10592, 2026. 18

  20. [21]

    Emerging Properties in Unified Multimodal Pretraining

    Chaorui Deng, Deyao Zhu, Kunchang Li, Chenhui Gou, Feng Li, Zeyu Wang, Shu Zhong, Weihao Yu, Xiaonan Nie, Ziang Song, et al. Emerging properties in unified multimodal pretraining.arXiv preprint arXiv:2505.14683, 2025. 19 31 World Action Models are Zero-shot Policies

  21. [22]

    Learning a thousand tasks in a day.Science Robotics, 10(108):eadv7594, 2025

    Kamil Dreczkowski, Pietro Vitiello, Vitalis Vosylius, and Edward Johns. Learning a thousand tasks in a day.Science Robotics, 10(108):eadv7594, 2025. 4

  22. [23]

    PaLM-E: An Embodied Multimodal Language Model

    Danny Driess, Fei Xia, Mehdi SM Sajjadi, Corey Lynch, Aakanksha Chowdhery, Brian Ichter, Ayzaan Wahid, Jonathan Tompson, Quan Vuong, Tianhe Yu, et al. Palm-e: An embodied multimodal language model.arXiv preprint arXiv:2303.03378, 2023. 4

  23. [24]

    Learning universal policies via text-guided video generation.Advances in neural information processing systems, 36:9156–9172, 2023

    Yilun Du, Sherry Yang, Bo Dai, Hanjun Dai, Ofir Nachum, Josh Tenenbaum, Dale Schuurmans, and Pieter Abbeel. Learning universal policies via text-guided video generation.Advances in neural information processing systems, 36:9156–9172, 2023. 5

  24. [25]

    Tenenbaum, Leslie Pack Kaelbling, Andy Zeng, and Jonathan Tompson

    Yilun Du, Sherry Yang, Pete Florence, Fei Xia, Ayzaan Wahid, brian ichter, Pierre Sermanet, Tianhe Yu, Pieter Abbeel, Joshua B. Tenenbaum, Leslie Pack Kaelbling, Andy Zeng, and Jonathan Tompson. Video language planning. InThe Twelfth International Conference on Learning Representations, 2024. URL https://openreview.net/forum?id=9pKtcJcMP3. 5

  25. [26]

    arXiv preprint arXiv:2510.19400 (2025)

    Zhiyuan Feng, Zhaolu Kang, Qijie Wang, Zhiying Du, Jiongrui Yan, Shubin Shi, Chengbo Yuan, Huizhi Liang, Yu Deng, Qixiu Li, et al. Seeing across views: Benchmarking spatial reasoning of vision-language models in robotic scenes.arXiv preprint arXiv:2510.19400, 2025. 2

  26. [27]

    A taxonomy for evaluating generalist robot policies.arXiv preprint arXiv:2503.01238, 2025

    Jensen Gao, Suneel Belkhale, Sudeep Dasari, Ashwin Balakrishna, Dhruv Shah, and Dorsa Sadigh. A taxonomy for evaluating generalist robot policies.arXiv preprint arXiv:2503.01238, 2025. 4

  27. [28]

    Ca2-vdm: Ef- ficient autoregressive video diffusion model with causal generation and cache sharing.arXiv preprint arXiv:2411.16375, 2024

    Kaifeng Gao, Jiaxin Shi, Hanwang Zhang, Chunping Wang, Jun Xiao, and Long Chen. Ca2-vdm: Ef- ficient autoregressive video diffusion model with causal generation and cache sharing.arXiv preprint arXiv:2411.16375, 2024. 7

  28. [29]

    Gemini Robotics: Bringing AI into the Physical World

    Gemini Robotics Team. Gemini robotics: Bringing ai into the physical world.arXiv preprint arXiv:2503.20020, 2025. 2, 4, 5

  29. [30]

    Ego4d: Around the world in 3,000 hours of egocentric video

    Kristen Grauman, Andrew Westbury, Eugene Byrne, Zachary Chavis, Antonino Furnari, Rohit Girdhar, Jackson Hamburger, Hao Jiang, Miao Liu, Xingyu Liu, et al. Ego4d: Around the world in 3,000 hours of egocentric video. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition,

  30. [31]

    2025 , note =

    Pranav Guruprasad, Yangyue Wang, Sudipta Chowdhury, Harshvardhan Sikka, and Paul Pu Liang. Bench- marking vision, language, & action models in procedurally generated, open ended action environments. arXiv preprint arXiv:2505.05540, 2025. 2

  31. [32]

    Dream to Control: Learning Behaviors by Latent Imagination

    Danijar Hafner, Timothy Lillicrap, Jimmy Ba, and Mohammad Norouzi. Dream to control: Learning behaviors by latent imagination.arXiv preprint arXiv:1912.01603, 2019. 5, 20

  32. [33]

    Mastering atari with discrete world models.arXiv preprint arXiv:2010.02193, 2020

    Danijar Hafner, Timothy Lillicrap, Mohammad Norouzi, and Jimmy Ba. Mastering atari with discrete world models.arXiv preprint arXiv:2010.02193, 2020. 5, 20

  33. [34]

    Mastering Diverse Domains through World Models

    Danijar Hafner, Jurgis Pasukonis, Jimmy Ba, and Timothy Lillicrap. Mastering diverse domains through world models.arXiv preprint arXiv:2301.04104, 2023. 5, 20

  34. [35]

    Training agents inside of scalable world models.arXiv preprint arXiv:2509.24527, 2025

    Danijar Hafner, Wilson Yan, and Timothy Lillicrap. Training agents inside of scalable world models, 2025. URLhttps://arxiv.org/abs/2509.24527. 20

  35. [36]

    Classifier-Free Diffusion Guidance

    Jonathan Ho and Tim Salimans. Classifier-free diffusion guidance.arXiv preprint arXiv:2207.12598,

  36. [37]

    Yoon, Mouli Sivapurapu, and Jian Zhang

    Ryan Hoque, Peide Huang, David J Yoon, Mouli Sivapurapu, and Jian Zhang. Egodex: Learning dexterous manipulation from large-scale egocentric video.arXiv preprint arXiv:2505.11709, 2025. 18 32 World Action Models are Zero-shot Policies

  37. [38]

    Lora: Low-rank adaptation of large language models.ICLR, 1(2):3, 2022

    Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, Weizhu Chen, et al. Lora: Low-rank adaptation of large language models.ICLR, 1(2):3, 2022. 11

  38. [39]

    Video Prediction Policy: A Generalist Robot Policy with Predictive Visual Representations

    Yucheng Hu, Yanjiang Guo, Pengchao Wang, Xiaoyu Chen, Yen-Jen Wang, Jianke Zhang, Koushil Sreenath, Chaochao Lu, and Jianyu Chen. Video prediction policy: A generalist robot policy with predictive visual representations.arXiv preprint arXiv:2412.14803, 2024. 5

  39. [40]

    Inner monologue: Embodied reasoning through planning with language models

    Wenlong Huang, Fei Xia, Ted Xiao, Harris Chan, Jacky Liang, Pete Florence, Andy Zeng, Jonathan Tompson, Igor Mordatch, Yevgen Chebotar, et al. Inner monologue: Embodied reasoning through planning with language models. In6th Annual Conference on Robot Learning. 4

  40. [41]

    VoxPoser: Composable 3D Value Maps for Robotic Manipulation with Language Models

    Wenlong Huang, Chen Wang, Ruohan Zhang, Yunzhu Li, Jiajun Wu, and Li Fei-Fei. Voxposer: Composable 3d value maps for robotic manipulation with language models.arXiv preprint arXiv:2307.05973, 2023. 4

  41. [42]

    ://arxiv.org/abs/2601.03782

    Wenlong Huang, Yu-Wei Chao, Arsalan Mousavian, Ming-Yu Liu, Dieter Fox, Kaichun Mo, and Li Fei- Fei. Pointworld: Scaling 3d world models for in-the-wild robotic manipulation.arXiv preprint arXiv:2601.03782, 2025. 20

  42. [43]

    Self Forcing: Bridging the Train-Test Gap in Autoregressive Video Diffusion

    Xun Huang, Zhengqi Li, Guande He, Mingyuan Zhou, and Eli Shechtman. Self forcing: Bridging the train-test gap in autoregressive video diffusion.arXiv preprint arXiv:2506.08009, 2025. 8

  43. [44]

    Hy-world 1.5: A systematic framework for interactive world modeling with real-time latency and geometric consistency.arXiv preprint, 2025

    Team HunyuanWorld. Hy-world 1.5: A systematic framework for interactive world modeling with real-time latency and geometric consistency.arXiv preprint, 2025. 19

  44. [45]

    Po- laRiS: Scalable real-to-sim evaluations for generalist robot policies, 2025

    Arhan Jain, Mingtong Zhang, Kanav Arora, William Chen, Marcel Torne, Muhammad Zubair Irshad, Sergey Zakharov, Yue Wang, Sergey Levine, Chelsea Finn, et al. Polaris: Scalable real-to-sim evaluations for generalist robot policies.arXiv preprint arXiv:2512.16881, 2025. 11

  45. [46]

    Dreamgen: Unlocking generalization in robot learning through video world models

    Joel Jang, Seonghyeon Ye, Zongyu Lin, Jiannan Xiang, Johan Bjorck, Yu Fang, Fengyuan Hu, Spencer Huang, Kaushil Kundalia, Yen-Chen Lin, Loïc Magne, Ajay Mandlekar, Avnish Narayan, You Liang Tan, Guanzhi Wang, Jing Wang, Qi Wang, Yinzhen Xu, Xiaohui Zeng, Kaiyuan Zheng, Ruijie Zheng, Ming-Yu Liu, Luke Zettlemoyer, Dieter Fox, Jan Kautz, Scott Reed, Yuke Zh...

  46. [47]

    URLhttps://openreview.net/forum?id=3CnxNqmklv. 5

  47. [48]

    Pyramidal flow matching for efficient video generative modeling.arXiv preprint arXiv:2410.05954, 2024

    Yang Jin, Zhicheng Sun, Ningyuan Li, Kun Xu, Hao Jiang, Nan Zhuang, Quzhe Huang, Yang Song, Yadong Mu, and Zhouchen Lin. Pyramidal flow matching for efficient video generative modeling.arXiv preprint arXiv:2410.05954, 2024. 7

  48. [49]

    1 rationally engineering rational robots

    Leslie Pack Kaelbling and Tomás Lozano-Pérez. 1 rationally engineering rational robots. 4

  49. [50]

    Scaling Laws for Neural Language Models

    Jared Kaplan, Sam McCandlish, Tom Henighan, Tom B Brown, Benjamin Chess, Rewon Child, Scott Gray, Alec Radford, Jeffrey Wu, and Dario Amodei. Scaling laws for neural language models.arXiv preprint arXiv:2001.08361, 2020. 18

  50. [51]

    Emergence of Human to Robot Trans- fer in Vision-Language-Action Models.arXiv preprint arXiv:2512.22414, 2025

    Simar Kareer, Karl Pertsch, James Darpinian, Judy Hoffman, Danfei Xu, Sergey Levine, Chelsea Finn, and Suraj Nair. Emergence of human to robot transfer in vision-language-action models.arXiv preprint arXiv:2512.22414, 2025. 16

  51. [52]

    DROID: A Large-Scale In-The-Wild Robot Manipulation Dataset

    Alexander Khazatsky, Karl Pertsch, Suraj Nair, Ashwin Balakrishna, Sudeep Dasari, Siddharth Karamcheti, SoroushNasiriany, MohanKumarSrirama, LawrenceYunliangChen, KirstyEllis, etal. Droid: Alarge-scale in-the-wild robot manipulation dataset.arXiv preprint arXiv:2403.12945, 2024. 10, 11 33 World Action Models are Zero-shot Policies

  52. [54]

    OpenVLA: An Open-Source Vision-Language-Action Model

    Moo Jin Kim, Karl Pertsch, Siddharth Karamcheti, Ted Xiao, Ashwin Balakrishna, Suraj Nair, Rafael Rafailov, Ethan Foster, Grace Lam, Pannag Sanketi, et al. Openvla: An open-source vision-language-action model.arXiv preprint arXiv:2406.09246, 2024. 2

  53. [55]

    Cosmos Policy: Fine-Tuning Video Models for Visuomotor Control and Planning

    Moo Jin Kim, Yihuai Gao, Tsung-Yi Lin, Yen-Chen Lin, Yunhao Ge, Grace Lam, Percy Liang, Shuran Song, Ming-Yu Liu, Chelsea Finn, et al. Cosmos policy: Fine-tuning video models for visuomotor control and planning.arXiv preprint arXiv:2601.16163, 2026. 2, 5, 7, 19

  54. [56]

    Tenenbaum

    Po-Chen Ko, Jiayuan Mao, Yilun Du, Shao-Hua Sun, and Joshua B. Tenenbaum. Learning to act from actionless videos through dense correspondences. InThe Twelfth International Conference on Learning Representations, 2024. URLhttps://openreview.net/forum?id=Mhb5fpA1T0. 5

  55. [57]

    Open-world task and motion planning via vision-language model generated constraints.IEEE Robotics and Automation Letters, pages 1–8, 2026

    Nishanth Kumar, William Shen, Fabio Ramos, Dieter Fox, Tomás Lozano-Pérez, Leslie Pack Kaelbling, and Caelan Reed Garrett. Open-world task and motion planning via vision-language model generated constraints.IEEE Robotics and Automation Letters, pages 1–8, 2026. doi: 10.1109/LRA.2026.3656799. 4

  56. [58]

    A path towards autonomous machine intelligence.Open Review, 2022

    Yann LeCun. A path towards autonomous machine intelligence.Open Review, 2022. 20

  57. [59]

    arXiv preprint arXiv:2508.07917 (2025) 1, 3, 9

    Jason Lee, Jiafei Duan, Haoquan Fang, Yuquan Deng, Shuo Liu, Boyang Li, Bohan Fang, Jieyu Zhang, Yi Ru Wang, Sangho Lee, et al. Molmoact: Action reasoning models that can reason in space.arXiv preprint arXiv:2508.07917, 2025. 4

  58. [60]

    Causal World Modeling for Robot Control

    Lin Li, Qihang Zhang, Yiming Luo, Shuai Yang, Ruilin Wang, Fei Han, Mingrui Yu, Zelin Gao, Nan Xue, Xing Zhu, Yujun Shen, and Yinghao Xu. Causal world modeling for robot control.arXiv preprint arXiv:2601.21998, 2026. 7

  59. [61]

    Unified Video Action Model

    Shuang Li, Yihuai Gao, Dorsa Sadigh, and Shuran Song. Unified video action model.arXiv preprint arXiv:2503.00200, 2025. 5, 7

  60. [62]

    Hamster: Hierarchical action models for open-world robot manipulation.arXiv preprint arXiv:2502.05485, 2025c

    Yi Li, Yuquan Deng, Jesse Zhang, Joel Jang, Marius Memmel, Raymond Yu, Caelan Reed Garrett, Fabio Ramos, Dieter Fox, Anqi Li, et al. Hamster: Hierarchical action models for open-world robot manipulation. arXiv preprint arXiv:2502.05485, 2025. 4

  61. [63]

    Dreamitate: Real-world visuomotor policy learning via video generation

    Junbang Liang, Ruoshi Liu, Ege Ozguroglu, Sruthi Sudhakar, Achal Dave, Pavel Tokmakov, Shuran Song, and Carl Vondrick. Dreamitate: Real-world visuomotor policy learning via video generation. In8th Annual Conference on Robot Learning, 2024. URLhttps://openreview.net/forum?id=InT87E5sr4. 5

  62. [64]

    Video generators are robot policies.arXiv preprint arXiv:2508.00795, 2025

    Junbang Liang, Pavel Tokmakov, Ruoshi Liu, Sruthi Sudhakar, Paarth Shah, Rares Ambrus, and Carl Vondrick. Video generators are robot policies.arXiv preprint arXiv:2508.00795, 2025. 2, 5

  63. [65]

    Genieenvisioner: A unified world foundation platform for robotic manipulation.arXiv preprint arXiv:2508.05635,

    Yue Liao, Pengfei Zhou, Siyuan Huang, Donglin Yang, Shengcong Chen, Yuxin Jiang, Yue Hu, Jingbin Cai, Si Liu, Jianlan Luo, et al. Genie envisioner: A unified world foundation platform for robotic manipulation. arXiv preprint arXiv:2508.05635, 2025. 3, 5, 7

  64. [66]

    Flow Matching for Generative Modeling

    Yaron Lipman, Ricky TQ Chen, Heli Ben-Hamu, Maximilian Nickel, and Matt Le. Flow matching for generative modeling.arXiv preprint arXiv:2210.02747, 2022. 7

  65. [67]

    Timestep embedding tells: It’s time to cache for video diffusion model

    Feng Liu, Shiwei Zhang, Xiaofeng Wang, Yujie Wei, Haonan Qiu, Yuzhong Zhao, Yingya Zhang, Qixiang Ye, and Fang Wan. Timestep embedding tells: It’s time to cache for video diffusion model.arXiv preprint arXiv:2411.19108, 2024. 23 34 World Action Models are Zero-shot Policies

  66. [68]

    From reusing to forecasting: Accelerating diffusion models with taylorseers.arXiv preprint arXiv:2503.06923, 2025

    Jiacheng Liu, Chang Zou, Yuanhuiyi Lyu, Junjie Chen, and Linfeng Zhang. From reusing to forecasting: Accelerating diffusion models with taylorseers.arXiv preprint arXiv:2503.06923, 2025. 23

  67. [69]

    Flow Straight and Fast: Learning to Generate and Transfer Data with Rectified Flow

    Xingchao Liu, Chengyue Gong, and Qiang Liu. Flow straight and fast: Learning to generate and transfer data with rectified flow.arXiv preprint arXiv:2209.03003, 2022. 7

  68. [70]

    Solving new tasks by adapting internet video knowledge

    Calvin Luo, Zilai Zeng, Yilun Du, and Chen Sun. Solving new tasks by adapting internet video knowledge. InThe Thirteenth International Conference on Learning Representations, 2025. 5

  69. [71]

    Nvidia model-optimizer, 2024

    NVIDIA Corporation. Nvidia model-optimizer, 2024. URL https://github.com/NVIDIA/ Model-Optimizer. 23

  70. [72]

    mimic-video: Video-action models for generalizable robot control beyond vlas.arXiv preprint 2512.15692, 2025

    Jonas Pai, Liam Achenbach, Victoriano Montesinos, Benedek Forrai, Oier Mees, and Elvis Nava. mimic- video: Video-action models for generalizable robot control beyond vlas.arXiv preprint arXiv:2512.15692,

  71. [73]

    $\pi_{0.5}$: a Vision-Language-Action Model with Open-World Generalization

    Physical Intelligence.𝜋0.5: a vision-language-action model with open-world generalization.arXiv preprint arXiv:2504.16054, 2025. 4, 5, 10

  72. [74]

    Hi Robot: Open-ended instruction following with hierarchical vision-language-action models.arXiv preprint arXiv:2502.19417,

    Lucy Xiaoyang Shi, Brian Ichter, Michael Equi, Liyiming Ke, Karl Pertsch, Quan Vuong, James Tanner, Anna Walling, Haohuan Wang, Niccolo Fusai, et al. Hi robot: Open-ended instruction following with hierarchical vision-language-action models.arXiv preprint arXiv:2502.19417, 2025. 19

  73. [75]

    Progprompt: Generating situated robot task plans using large language models

    Ishika Singh, Valts Blukis, Arsalan Mousavian, Ankit Goyal, Danfei Xu, Jonathan Tremblay, Dieter Fox, Jesse Thomason, and Animesh Garg. Progprompt: Generating situated robot task plans using large language models. In2023 IEEE International Conference on Robotics and Automation (ICRA), 2023. 4

  74. [76]

    Gen-0: Embodied foundation models that scale with physical interaction.Generalist AI Blog, 2025

    Generalist AI Team. Gen-0: Embodied foundation models that scale with physical interaction.Generalist AI Blog, 2025. https://generalistai.com/blog/preview-uqlxvb-bb.html. 16

  75. [77]

    Wan: Open and advanced large-scale video generative models

    Team Wan. Wan: Open and advanced large-scale video generative models. 2025. 2, 7, 11

  76. [78]

    MAGI-1: Autoregressive Video Generation at Scale

    Hansi Teng, Hongyu Jia, Lei Sun, Lingzhi Li, Maolin Li, Mingqiu Tang, Shuai Han, Tianning Zhang, WQ Zhang, Weifeng Luo, et al. Magi-1: Autoregressive video generation at scale.arXiv preprint arXiv:2505.13211, 2025. 7, 8

  77. [79]

    Bridgedata v2: A dataset for robot learning at scale

    Homer Walke, Kevin Black, Abraham Lee, Moo Jin Kim, Max Du, Chongyi Zheng, Tony Zhao, Philippe Hansen-Estruch, Quan Vuong, Andre He, Vivek Myers, Kuan Fang, Chelsea Finn, and Sergey Levine. Bridgedata v2: A dataset for robot learning at scale. InConference on Robot Learning (CoRL), 2023. 10

  78. [80]

    Dual-stream diffusion for world-model augmented vision-language-action model, 2025

    John Won, Kyungmin Lee, Huiwon Jang, Dongyoung Kim, and Jinwoo Shin. Dual-stream diffusion for world-model augmented vision-language-action model.arXiv preprint arXiv:2510.27607, 2025. 5

  79. [81]

    Unleashing large-scale video generative pre-training for visual robot manipulation

    Hongtao Wu, Ya Jing, Chilam Cheang, Guangzeng Chen, Jiafeng Xu, Xinghang Li, Minghuan Liu, Hang Li, and Tao Kong. Unleashing large-scale video generative pre-training for visual robot manipulation. In The Twelfth International Conference on Learning Representations, 2024. URLhttps://openreview.net/ forum?id=NxoFmGgWC9. 5

  80. [82]

    Qwen3 Technical Report

    An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, et al. Qwen3 technical report.arXiv preprint arXiv:2505.09388, 2025. 18

Showing first 80 references.