PointAction: 3D Points as Universal Action Representations for Robot Control

Han Jiang; Jiatao Gu; Lingjie Liu; Mutian Tong; Qiao Feng

arxiv: 2606.03943 · v1 · pith:OE7EKOTBnew · submitted 2026-06-02 · 💻 cs.RO · cs.CV· cs.LG

PointAction: 3D Points as Universal Action Representations for Robot Control

Mutian Tong , Han Jiang , Qiao Feng , Lingjie Liu , Jiatao Gu This is my paper

Pith reviewed 2026-06-28 10:08 UTC · model grok-4.3

classification 💻 cs.RO cs.CVcs.LG

keywords 3D pointmapsvideo diffusion modelsrobot manipulationaction representationembodiment transfer4D scene generationdiffusion decoder

0 comments

The pith

PointAction predicts dynamic 3D pointmaps from video models to serve as a metric interface that a diffusion decoder converts into robot actions.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that video-action models can avoid the ambiguity of RGB-only predictions by jointly generating future frames and consistent 3D point dynamics of scene geometry. These pointmaps act as an embodiment-agnostic bridge that a separate diffusion decoder then maps to executable controls on robot arms. A sympathetic reader would care because this approach aims to scale manipulation skills across tasks and hardware using mostly video data rather than exhaustive action labels. The work shows improved 4D scene modeling on robot environments and successful transfer to two real arms not seen in pretraining.

Core claim

PointAction fine-tunes a foundation video generation model to jointly predict future RGB frames and dynamic 3D pointmaps that capture temporally consistent metric motion of task-relevant geometry; these point dynamics then serve as the input to a diffusion-based action decoder that produces executable robot actions, thereby reducing the under-specification of pure RGB rollouts and enabling transfer across tasks and embodiments with limited action supervision.

What carries the argument

Jointly predicted dynamic 3D pointmaps that encode metric 3D motion of scene points and act as the structured interface between the video model and the action decoder.

If this is right

The method achieves state-of-the-art 4D generation quality on robot scenes.
It outperforms existing baselines in simulation experiments.
It generalizes successfully to two real robot arms that were unseen during pretraining.
It supports cross-task and cross-embodiment transfer while requiring only limited action supervision.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

If point dynamics prove reliable, future work could pretrain the video component on large unlabeled video corpora and add only small action datasets for new hardware.
The same point-based interface might extend to non-arm embodiments such as mobile bases if the predicted points capture the necessary contact and navigation geometry.
Errors in pointmap prediction would directly limit downstream action success, suggesting that improvements in 3D consistency could yield measurable gains in real-world success rates.

Load-bearing premise

The predicted 3D pointmaps stay accurate enough and consistent over time that the diffusion decoder can turn them into valid actions on robot arms never seen in pretraining.

What would settle it

On a held-out real robot arm, generate pointmaps from the model on a new task and check whether the decoded actions produce large 3D positioning errors or fail to complete the task more often than baselines.

read the original abstract

Video-Action Models (VAMs) leverage the broad visual dynamics captured by pre-trained video diffusion models, offering a promising path toward generalizable robot manipulation. However, RGB-only video rollouts are not directly actionable: they leave metric 3D motion, contact geometry, and fine-grained spatial constraints under-specified, making action grounding ambiguous. Meanwhile, scaling action supervision across diverse tasks and embodiments remains costly. We present PointAction, a framework that bridges video predictions to robot actions through explicit point-based 4D modeling. PointAction fine-tunes a foundation video generation model to jointly predict future RGB frames and dynamic 3D pointmaps, producing temporally consistent 3D motion of task-relevant scene geometry. These point dynamics serve as a structured, embodiment-agnostic action interface, which a diffusion-based action decoder maps to executable robot actions. By using metric 3D point dynamics as the interface between video prediction and control, PointAction reduces the ambiguity of RGB-only action grounding and supports transfer across tasks and embodiments with limited action supervision. Experiments show that PointAction achieves state-of-the-art 4D generation quality on robot scenes, outperforms existing baselines in simulation, and generalizes to two real robot arms unseen during pretraining.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

PointAction's 3D point dynamics as an action interface is a reasonable extension of video models, but the abstract gives no numbers showing the predictions are accurate enough to back the generalization claims.

read the letter

The core idea is to fine-tune a video diffusion model so it outputs both RGB frames and dynamic 3D pointmaps, then feed those pointmaps into a diffusion decoder that produces robot actions. This is meant to give a metric, embodiment-agnostic layer between video prediction and control.

What is actually new is treating the predicted 3D motion of task-relevant geometry as the direct action representation. That moves past RGB-only video-action models and tries to fix the under-specification of contacts and scale that usually makes grounding ambiguous.

The paper reports SOTA 4D generation on robot scenes, better simulation performance, and successful transfer to two real arms not seen in pretraining. If the point predictions are solid, the limited-supervision transfer story follows.

The soft spot is the missing evidence on pointmap quality. The abstract states the point dynamics reduce ambiguity and enable transfer, yet supplies no 3D endpoint error, temporal consistency numbers, or ablations on how residual noise affects the decoder. Video-based 3D lifting often keeps scale and occlusion problems in cluttered scenes; without bounds on those errors, it is not clear the interface actually supports the claimed results on unseen arms. The stress-test concern lands.

This is for robotics groups working on video-based imitation and cross-embodiment learning. A reader who wants concrete interfaces between large video models and low-level control would get value from the framing.

It deserves a serious referee because the problem is real and the proposed interface is straightforward to test. I would send it out and ask for the point-prediction metrics and noise ablations in revision.

Referee Report

2 major / 1 minor

Summary. The paper presents PointAction, a framework that fine-tunes a pre-trained video diffusion model to jointly generate future RGB frames and dynamic 3D pointmaps of robot scenes. These temporally consistent point dynamics are positioned as an embodiment-agnostic interface that a separate diffusion-based decoder maps to executable robot actions, with the goal of reducing RGB grounding ambiguity and enabling cross-task and cross-embodiment transfer under limited action supervision. The abstract reports state-of-the-art 4D generation quality on robot scenes, outperformance versus baselines in simulation, and successful generalization to two real robot arms absent from pretraining.

Significance. If the central results hold, the work supplies a concrete mechanism for converting video-prediction outputs into metric 3D motion that can be decoded into control signals, addressing a recognized bottleneck in video-action models. The explicit use of 3D point dynamics as a structured, transferable representation is a clear conceptual contribution that could support more scalable robot learning pipelines.

major comments (2)

[Abstract / Experiments] Abstract and Experiments section: the central claim that jointly predicted dynamic 3D pointmaps are sufficiently accurate and temporally consistent to support reliable action decoding on unseen arms is not accompanied by quantitative bounds (e.g., 3D endpoint error, temporal consistency scores, or occlusion-handling metrics) or ablations that isolate the effect of residual pointmap noise on downstream action success. Without these, the data-to-claim link for generalization remains unevaluated.
[Method / Experiments] The description of the fine-tuning and decoding pipeline states that point dynamics reduce RGB ambiguity, yet no comparison is provided against a direct RGB-to-action baseline that would quantify the incremental benefit of the 3D interface under the same limited-supervision regime.

minor comments (1)

[Method] Notation for the pointmap representation (e.g., how metric scale is recovered and how points are selected as task-relevant) should be defined explicitly in the main text rather than deferred to supplementary material.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. We address each major comment below and describe the revisions we will implement to strengthen the evaluation and claims.

read point-by-point responses

Referee: [Abstract / Experiments] Abstract and Experiments section: the central claim that jointly predicted dynamic 3D pointmaps are sufficiently accurate and temporally consistent to support reliable action decoding on unseen arms is not accompanied by quantitative bounds (e.g., 3D endpoint error, temporal consistency scores, or occlusion-handling metrics) or ablations that isolate the effect of residual pointmap noise on downstream action success. Without these, the data-to-claim link for generalization remains unevaluated.

Authors: We agree that the manuscript would benefit from explicit quantitative metrics on pointmap accuracy and consistency to more rigorously support the generalization claims. In the revised version, we will add 3D endpoint error, temporal consistency scores, and occlusion-handling metrics to the Experiments section. We will also include an ablation isolating the effect of residual pointmap noise on downstream action success rates. These additions will directly address the data-to-claim link for action decoding on unseen arms. revision: yes
Referee: [Method / Experiments] The description of the fine-tuning and decoding pipeline states that point dynamics reduce RGB ambiguity, yet no comparison is provided against a direct RGB-to-action baseline that would quantify the incremental benefit of the 3D interface under the same limited-supervision regime.

Authors: We acknowledge that a direct RGB-to-action baseline comparison is necessary to quantify the incremental benefit of the 3D point dynamics under limited supervision. In the revised manuscript, we will add this baseline by training and evaluating a diffusion-based decoder that maps directly from RGB video predictions to actions, using the identical limited-supervision regime and evaluation protocol as the PointAction pipeline. This will enable a clear measurement of the advantage provided by the pointmap interface. revision: yes

Circularity Check

0 steps flagged

No significant circularity; pipeline is self-contained

full rationale

The paper describes an empirical pipeline: fine-tune a foundation video model to jointly predict RGB frames and dynamic 3D pointmaps, then apply a diffusion-based action decoder to map point dynamics to robot actions. No equations, derivations, fitted parameters renamed as predictions, or self-citation chains appear in the abstract or described method. The central claim rests on the empirical sufficiency of the 4D interface rather than any reduction of outputs to inputs by construction. This is the common case of a non-circular engineering framework.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Abstract-only review supplies insufficient technical detail to enumerate concrete free parameters, axioms, or invented entities; the framework implicitly assumes that video diffusion models can be adapted to output consistent metric 3D point dynamics and that these dynamics are sufficient for action decoding.

axioms (1)

domain assumption Pre-trained video diffusion models capture dynamics that can be fine-tuned to also output temporally consistent 3D pointmaps
This is the core adaptation step stated in the abstract.

pith-pipeline@v0.9.1-grok · 5762 in / 1247 out tokens · 34457 ms · 2026-06-28T10:08:05.790189+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

74 extracted references · 4 canonical work pages

[1]

World simulation with video foundation models for physical ai.arXiv preprint arXiv:2511.00062, 2025

Arslan Ali, Junjie Bai, Maciej Bala, Yogesh Balaji, Aaron Blakeman, Tiffany Cai, Jiaxin Cao, Tianshi Cao, Elizabeth Cha, Yu-Wei Chao, et al. World simulation with video foundation models for physical ai.arXiv preprint arXiv:2511.00062, 2025

Pith/arXiv arXiv 2025
[2]

URL https://proceedings.mlr

Sherwin Bahmani, Ivan Skorokhodov, Victor Rong, Gordon Wetzstein, Leonidas J. Guibas, Peter Wonka, Sergey Tulyakov, Jeong Joon Park, Andrea Tagliasacchi, and David B. Lindell. 4d-fy: Text-to-4d generation using hybrid score distillation sampling. InIEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2024, Seattle, W A, USA, June 16-22, 20...

work page doi:10.1109/cvpr52733.2024.00764 2024
[3]

Qwen3-vl technical report.arXiv preprint arXiv:2511.21631, 2025

Shuai Bai, Yuxuan Cai, Ruizhe Chen, Keqin Chen, Xionghui Chen, Zesen Cheng, Lianghao Deng, Wei Ding, Chang Gao, Chunjiang Ge, et al. Qwen3-vl technical report.arXiv preprint arXiv:2511.21631, 2025

Pith/arXiv arXiv 2025
[4]

Geovideo: Introducing geometric regularization into video generation model, 2025

Yunpeng Bai, Shaoheng Fang, Chaohui Yu, Fan Wang, and Qixing Huang. Geovideo: Introducing geometric regularization into video generation model, 2025. URLhttps://arxiv.org/abs/2512.03453

arXiv 2025
[5]

Paligemma: A versatile 3b vlm for transfer.arXiv preprint arXiv:2407.07726, 2024

Lucas Beyer, Andreas Steiner, André Susano Pinto, Alexander Kolesnikov, Xiao Wang, Daniel Salz, Maxim Neumann, Ibrahim Alabdulmohsin, Michael Tschannen, Emanuele Bugliarello, et al. Paligemma: A versatile 3b vlm for transfer.arXiv preprint arXiv:2407.07726, 2024

Pith/arXiv arXiv 2024
[6]

Gr00t n1: An open foundation model for generalist humanoid robots

Johan Bjorck, Fernando Castañeda, Nikita Cherniadev, Xingye Da, Runyu Ding, Linxi Fan, Yu Fang, Dieter Fox, Fengyuan Hu, Spencer Huang, et al. Gr00t n1: An open foundation model for generalist humanoid robots. arXiv preprint arXiv:2503.14734, 2025

Pith/arXiv arXiv 2025
[7]

arXiv preprint arXiv:2410.24164, 2024

Kevin Black, Noah Brown, Danny Driess, Adnan Esmail, Michael Equi, Chelsea Finn, Niccolo Fusai, Lachy Groom, Karol Hausman, Brian Ichter, et al.π0: A vision-language-action flow model for general robot control. arXiv preprint arXiv:2410.24164, 2024

Pith/arXiv arXiv 2024
[8]

Zero-shot robotic manipulation with pre-trained image-editing diffusion models

Kevin Black, Mitsuhiko Nakamoto, Pranav Atreya, Homer Rich Walke, Chelsea Finn, Aviral Kumar, and Sergey Levine. Zero-shot robotic manipulation with pre-trained image-editing diffusion models. InThe Twelfth Inter- national Conference on Learning Representations, 2024. URLhttps://openreview.net/forum?id=c0chJTSbci

2024
[9]

Kevin Black, Noah Brown, James Darpinian, Karan Dhabalia, Danny Driess, Adnan Esmail, Michael Robert Equi, Chelsea Finn, Niccolo Fusai, Manuel Y. Galliker, Dibya Ghosh, Lachy Groom, Karol Hausman, brian ichter, Szymon Jakubczak, Tim Jones, Liyiming Ke, Devin LeBlanc, Sergey Levine, Adrian Li-Bell, Mohith Mothukuri, Suraj Nair, Karl Pertsch, Allen Z. Ren, ...

2025
[10]

Stable video diffusion: Scaling latent video diffusion models to large datasets.arXiv preprint arXiv:2311.15127, 2023

Andreas Blattmann, Tim Dockhorn, Sumith Kulal, Daniel Mendelevitch, Maciej Kilian, Dominik Lorenz, Yam Levi, Zion English, Vikram Voleti, Adam Letts, et al. Stable video diffusion: Scaling latent video diffusion models to large datasets.arXiv preprint arXiv:2311.15127, 2023

Pith/arXiv arXiv 2023
[11]

Sam 3: Segment anything with concepts

Nicolas Carion, Laura Gustafson, Yuan-Ting Hu, Shoubhik Debnath, Ronghang Hu, Didac Suris, Chaitanya Ryali, Kalyan Vasudev Alwala, Haitham Khedr, Andrew Huang, et al. Sam 3: Segment anything with concepts. arXiv preprint arXiv:2511.16719, 2025

Pith/arXiv arXiv 2025
[12]

Rynnvla-002: A unified vision-language-action and world model

Jun Cen, Siteng Huang, Yuqian Yuan, Kehan Li, Hangjie Yuan, Chaohui Yu, Yuming Jiang, Jiayan Guo, Xin Li, Hao Luo, Fan Wang, Fan Wang, and Deli Zhao. Rynnvla-002: A unified vision-language-action and world model. arXiv preprint arXiv:2511.17502, 2025

Pith/arXiv arXiv 2025
[13]

Worldvla: Towards autoregressive action world model.arXiv preprint arXiv:, 2025

Jun Cen, Chaohui Yu, Hangjie Yuan, Yuming Jiang, Siteng Huang, Jiayan Guo, Xin Li, Yibing Song, Hao Luo, Fan Wang, Deli Zhao, and Hao Chen. Worldvla: Towards autoregressive action world model.arXiv preprint arXiv:, 2025

2025
[14]

Gr-2: A generative video-language-action model with web-scale knowledge for robot manipulation.arXiv preprint arXiv:2410.06158, 2024

Chi-Lam Cheang, Guangzeng Chen, Ya Jing, Tao Kong, Hang Li, Yifeng Li, Yuxiao Liu, Hongtao Wu, Jiafeng Xu, Yichu Yang, Hanbo Zhang, and Minzhao Zhu. Gr-2: A generative video-language-action model with web-scale knowledge for robot manipulation.arXiv preprint arXiv:2410.06158, 2024

Pith/arXiv arXiv 2024
[15]

Diffusion forcing: Next-token prediction meets full-sequence diffusion.Advances in Neural Information Processing Systems, 37:24081–24125, 2025

Boyuan Chen, Diego Martí Monsó, Yilun Du, Max Simchowitz, Russ Tedrake, and Vincent Sitzmann. Diffusion forcing: Next-token prediction meets full-sequence diffusion.Advances in Neural Information Processing Systems, 37:24081–24125, 2025

2025
[16]

Large video planner enables generalizable robot control

Boyuan Chen, Tianyuan Zhang, Haoran Geng, Kiwhan Song, Caiyi Zhang, Peihao Li, William T Freeman, Jitendra Malik, Pieter Abbeel, Russ Tedrake, et al. Large video planner enables generalizable robot control. arXiv preprint arXiv:2512.15840, 2025

Pith/arXiv arXiv 2025
[17]

4dnex: Feed-forward 4d generative modeling made easy.arXiv preprint arXiv:2508.13154, 2025

Zhaoxi Chen, Tianqi Liu, Long Zhuo, Jiawei Ren, Zeng Tao, He Zhu, Fangzhou Hong, Liang Pan, and Ziwei Liu. 4dnex: Feed-forward 4d generative modeling made easy.arXiv preprint arXiv:2508.13154, 2025

arXiv 2025
[18]

Learning universal policies via text-guided video generation.Advances in neural information processing systems, 36:9156–9172, 2023

Yilun Du, Sherry Yang, Bo Dai, Hanjun Dai, Ofir Nachum, Josh Tenenbaum, Dale Schuurmans, and Pieter Abbeel. Learning universal policies via text-guided video generation.Advances in neural information processing systems, 36:9156–9172, 2023

2023
[19]

Mitra, and Qixing Huang

Shaoheng Fang, Hanwen Jiang, Yunpeng Bai, Niloy J. Mitra, and Qixing Huang. Worldreel: 4d video generation with consistent geometry and motion modeling, 2025. URLhttps://arxiv.org/abs/2512.07821

arXiv 2025
[20]

data gap

Ken Goldberg. Good old-fashioned engineering can close the 100,000-year “data gap” in robotics.Science Robotics, 10(105):eaea7390, 2025. doi: 10.1126/scirobotics.aea7390. URLhttps://www.science.org/doi/abs/10.1126/ scirobotics.aea7390

work page doi:10.1126/scirobotics.aea7390 2025
[21]

Predic- tion with action: Visual policy learning via joint denoising process.Advances in Neural Information Processing Systems, 37:112386–112410, 2024

Yanjiang Guo, Yucheng Hu, Jianke Zhang, Yen-Jen Wang, Xiaoyu Chen, Chaochao Lu, and Jianyu Chen. Predic- tion with action: Visual policy learning via joint denoising process.Advances in Neural Information Processing Systems, 37:112386–112410, 2024

2024
[22]

Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen

Edward J. Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. Lora: Low-rank adaptation of large language models. InThe Tenth International Conference on Learning Representations, ICLR 2022, Virtual Event, April 25-29, 2022. OpenReview.net, 2022. URLhttps://openreview.net/ forum?id=nZeVKeeFYf9

2022
[23]

Video prediction policy: A generalist robot policy with predictive visual representations

Yucheng Hu, Yanjiang Guo, Pengchao Wang, Xiaoyu Chen, Yen-Jen Wang, Jianke Zhang, Koushil Sreenath, Chaochao Lu, and Jianyu Chen. Video prediction policy: A generalist robot policy with predictive visual representations. In Aarti Singh, Maryam Fazel, Daniel Hsu, Simon Lacoste-Julien, Felix Berkenkamp, Tegan Maharaj, Kiri Wagstaff, and Jerry Zhu, editors,F...

2025
[24]

Self forcing: Bridging the train-test gap in autoregressive video diffusion

Xun Huang, Zhengqi Li, Guande He, Mingyuan Zhou, and Eli Shechtman. Self forcing: Bridging the train-test gap in autoregressive video diffusion. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems, 2025. URLhttps://openreview.net/forum?id=mSiN7i0BYH

2025
[25]

Dreamgen: Unlocking generalization in robot learning through video world models.arXiv preprint arXiv:2505.12705, 2025

Joel Jang, Seonghyeon Ye, Zongyu Lin, Jiannan Xiang, Johan Bjorck, Yu Fang, Fengyuan Hu, Spencer Huang, Kaushil Kundalia, Yen-Chen Lin, et al. Dreamgen: Unlocking generalization in robot learning through video world models.arXiv preprint arXiv:2505.12705, 2025. 10

Pith/arXiv arXiv 2025
[26]

Geo4d: Leveraging video generators for geometric 4d scene reconstruction, 2025

Zeren Jiang, Chuanxia Zheng, Iro Laina, Diane Larlus, and Andrea Vedaldi. Geo4d: Leveraging video generators for geometric 4d scene reconstruction, 2025. URLhttps://arxiv.org/abs/2504.07961

arXiv 2025
[27]

Vision-language-action models for robotics: A review towards real-world applications.IEEE Access, 13:162467–162504, 2025

Kento Kawaharazuka, Jihoon Oh, Jun Yamada, Ingmar Posner, and Yuke Zhu. Vision-language-action models for robotics: A review towards real-world applications.IEEE Access, 13:162467–162504, 2025. doi: 10.1109/ ACCESS.2025.3609980

arXiv 2025
[28]

Droid: A large-scale in-the-wild robot manipulation dataset.arXiv preprint arXiv:2403.12945, 2024

Alexander Khazatsky, Karl Pertsch, Suraj Nair, Ashwin Balakrishna, Sudeep Dasari, Siddharth Karamcheti, Soroush Nasiriany, Mohan Kumar Srirama, Lawrence Yunliang Chen, Kirsty Ellis, et al. Droid: A large-scale in-the-wild robot manipulation dataset.arXiv preprint arXiv:2403.12945, 2024

Pith/arXiv arXiv 2024
[29]

Openvla: An open-source vision-language-action model.arXiv preprint arXiv:2406.09246, 2024

Moo Jin Kim, Karl Pertsch, Siddharth Karamcheti, Ted Xiao, Ashwin Balakrishna, Suraj Nair, Rafael Rafailov, Ethan Foster, Grace Lam, Pannag Sanketi, Quan Vuong, Thomas Kollar, Benjamin Burchfiel, Russ Tedrake, Dorsa Sadigh, Sergey Levine, Percy Liang, and Chelsea Finn. Openvla: An open-source vision-language-action model.arXiv preprint arXiv:2406.09246, 2024

Pith/arXiv arXiv 2024
[30]

Cosmos policy: Fine-tuning video models for visuomotor control and planning

Moo Jin Kim, Yihuai Gao, Tsung-Yi Lin, Yen-Chen Lin, Yunhao Ge, Grace Lam, Percy Liang, Shuran Song, Ming-Yu Liu, Chelsea Finn, et al. Cosmos policy: Fine-tuning video models for visuomotor control and planning. arXiv preprint arXiv:2601.16163, 2026

Pith/arXiv arXiv 2026
[31]

Causal world modeling for robot control, 2026

Lin Li, Qihang Zhang, Yiming Luo, Shuai Yang, Ruilin Wang, Fei Han, Mingrui Yu, Zelin Gao, Nan Xue, Xing Zhu, Yujun Shen, and Yinghao Xu. Causal world modeling for robot control, 2026. URLhttps://arxiv.org/abs/2601. 21998

2026
[32]

Unified video action model.arXiv preprint arXiv:2503.00200, 2025

Shuang Li, Yihuai Gao, Dorsa Sadigh, and Shuran Song. Unified video action model.arXiv preprint arXiv:2503.00200, 2025

Pith/arXiv arXiv 2025
[33]

Video generators are robot policies.arXiv preprint arXiv:2508.00795, 2025

Junbang Liang, Pavel Tokmakov, Ruoshi Liu, Sruthi Sudhakar, Paarth Shah, Rares Ambrus, and Carl Vondrick. Video generators are robot policies.arXiv preprint arXiv:2508.00795, 2025

Pith/arXiv arXiv 2025
[34]

Depth anything 3: Recovering the visual space from any views.arXiv preprint arXiv:2511.10647, 2025

Haotong Lin, Sili Chen, Junhao Liew, Donny Y Chen, Zhenyu Li, Guang Shi, Jiashi Feng, and Bingyi Kang. Depth anything 3: Recovering the visual space from any views.arXiv preprint arXiv:2511.10647, 2025

Pith/arXiv arXiv 2025
[35]

Yaron Lipman, Ricky T. Q. Chen, Heli Ben-Hamu, Maximilian Nickel, and Matthew Le. Flow matching for generative modeling. InThe Eleventh International Conference on Learning Representations, 2023. URLhttps: //openreview.net/forum?id=PqvMRDCJT9t

2023
[36]

Free4d: Tuning-free 4d scene generation with spatial-temporal consistency

Tianqi Liu, Zihao Huang, Zhaoxi Chen, Guangcong Wang, Shoukang Hu, Liao Shen, Huiqiang Sun, Zhiguo Cao, Wei Li, and Ziwei Liu. Free4d: Tuning-free 4d scene generation with spatial-temporal consistency. InProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 25571–25582, October 2025

2025
[37]

Geometry-aware 4d video generation for robot manipulation

Zeyi Liu, Shuang Li, Eric Cousineau, Siyuan Feng, Benjamin Burchfiel, and Shuran Song. Geometry-aware 4d video generation for robot manipulation. InThe Fourteenth International Conference on Learning Representa- tions, 2026. URLhttps://openreview.net/forum?id=18gC6pZVVc

2026
[38]

Robocasa365: A large-scale simulation framework for training and benchmarking generalist robots

Soroush Nasiriany, Sepehr Nasiriany, Abhiram Maddukuri, and Yuke Zhu. Robocasa365: A large-scale simulation framework for training and benchmarking generalist robots. InThe Fourteenth International Conference on Learning Representations, 2026. URLhttps://openreview.net/forum?id=tQJYKwc3n4

2026
[39]

Octo: An open-source generalist robot policy

Octo Model Team, Dibya Ghosh, Homer Walke, Karl Pertsch, Kevin Black, Oier Mees, Sudeep Dasari, Joey Hejna, Charles Xu, Jianlan Luo, Tobias Kreiman, You Liang Tan, Pannag Sanketi, Quan Vuong, Ted Xiao, Dorsa Sadigh, Chelsea Finn, and Sergey Levine. Octo: An open-source generalist robot policy. InProceedings of Robotics: Science and Systems, Delft, Netherl...

2024
[40]

Open x-embodiment: Robotic learning datasets and rt-x models: Open x-embodiment collaboration 0

Abby O’Neill, Abdul Rehman, Abhiram Maddukuri, Abhishek Gupta, Abhishek Padalkar, Abraham Lee, Acorn Pooley, Agrim Gupta, Ajay Mandlekar, Ajinkya Jain, et al. Open x-embodiment: Robotic learning datasets and rt-x models: Open x-embodiment collaboration 0. In2024 IEEE International Conference on Robotics and Automation (ICRA), pages 6892–6903. IEEE, 2024

2024
[41]

mimic-video: Video-action models for generalizable robot control beyond vlas, 2025

Jonas Pai, Liam Achenbach, Victoriano Montesinos, Benedek Forrai, Oier Mees, and Elvis Nava. mimic-video: Video-action models for generalizable robot control beyond vlas, 2025. URLhttps://arxiv.org/abs/2512.15692

Pith/arXiv arXiv 2025
[42]

Efficient4d: Fast dynamic 3d object generation from a single- view video.Int

Zijie Pan, Zeyu Yang, Xiatian Zhu, and Li Zhang. Efficient4d: Fast dynamic 3d object generation from a single- view video.Int. J. Comput. Vis., 134(1):14, 2026. doi: 10.1007/S11263-025-02615-Z. URLhttps://doi.org/10.1007/ s11263-025-02615-z. 11

work page doi:10.1007/s11263-025-02615-z 2026
[43]

Scalable diffusion models with transformers

William Peebles and Saining Xie. Scalable diffusion models with transformers. InProceedings of the IEEE/CVF international conference on computer vision, pages 4195–4205, 2023

2023
[44]

Peebles and Saining Xie

William S. Peebles and Saining Xie. Scalable diffusion models with transformers.2023 IEEE/CVF Interna- tional Conference on Computer Vision (ICCV), pages 4172–4182, 2022. URLhttps://api.semanticscholar.org/CorpusID: 254854389

2023
[45]

Barron, and Ben Mildenhall

Ben Poole, Ajay Jain, Jonathan T. Barron, and Ben Mildenhall. Dreamfusion: Text-to-3d using 2d diffusion. arXiv, 2022

2022
[46]

D-NeRF: Neural Radiance Fields for Dynamic Scenes

Albert Pumarola, Enric Corona, Gerard Pons-Moll, and Francesc Moreno-Noguer. D-NeRF: Neural Radiance Fields for Dynamic Scenes. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020

2020
[47]

Qi, Hao Su, Kaichun Mo, and Leonidas J

C. Qi, Hao Su, Kaichun Mo, and Leonidas J. Guibas. Pointnet: Deep learning on point sets for 3d classification and segmentation.2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 77–85,

2017
[48]

URLhttps://api.semanticscholar.org/CorpusID:5115938
[49]

Dreamgaussian4d: Generative 4d gaussian splatting.arXiv preprint arXiv:2312.17142, 2023

Jiawei Ren, Liang Pan, Jiaxiang Tang, Chi Zhang, Ang Cao, Gang Zeng, and Ziwei Liu. Dreamgaussian4d: Generative 4d gaussian splatting.arXiv preprint arXiv:2312.17142, 2023

arXiv 2023
[50]

Videovla: Video generators can be generalizable robot manipulators

Yichao Shen, Fangyun Wei, Zhiying Du, Yaobo Liang, Yan Lu, Jiaolong Yang, Nanning Zheng, and Baining Guo. Videovla: Video generators can be generalizable robot manipulators. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems (NeurIPS), 2025. URLhttps://openreview.net/forum?id=UPHlqbZFZB

2025
[51]

Denoising diffusion implicit models

Jiaming Song, Chenlin Meng, and Stefano Ermon. Denoising diffusion implicit models. InInternational Confer- ence on Learning Representations, 2021. URLhttps://openreview.net/forum?id=St1giarCHLP

2021
[52]

Gemma 2: Improving open language models at a practical size.arXiv preprint arXiv:2408.00118, 2024

Gemma Team, Morgane Riviere, Shreya Pathak, Pier Giuseppe Sessa, Cassidy Hardin, Surya Bhupatiraju, Léonard Hussenot, Thomas Mesnard, Bobak Shahriari, Alexandre Ramé, et al. Gemma 2: Improving open language models at a practical size.arXiv preprint arXiv:2408.00118, 2024

Pith/arXiv arXiv 2024
[53]

Towards accurate generative models of video: A new metric & challenges.ArXiv, abs/1812.01717, 2018

Thomas Unterthiner, Sjoerd van Steenkiste, Karol Kurach, Raphaël Marinier, Marcin Michalski, and Sylvain Gelly. Towards accurate generative models of video: A new metric & challenges.ArXiv, abs/1812.01717, 2018. URLhttps://api.semanticscholar.org/CorpusID:54458806

Pith/arXiv arXiv 2018
[54]

Bridgedata v2: A dataset for robot learning at scale

Homer Rich Walke, Kevin Black, Tony Z Zhao, Quan Vuong, Chongyi Zheng, Philippe Hansen-Estruch, An- dre Wang He, Vivek Myers, Moo Jin Kim, Max Du, et al. Bridgedata v2: A dataset for robot learning at scale. InConference on Robot Learning, pages 1723–1736. PMLR, 2023

2023
[55]

Wan: Open and advanced large-scale video generative models.arXiv preprint arXiv:2503.20314, 2025

TeamWan, AngWang, BaoleAi, BinWen, ChaojieMao, Chen-WeiXie, DiChen, FeiwuYu, HaimingZhao, Jianx- iao Yang, et al. Wan: Open and advanced large-scale video generative models.arXiv preprint arXiv:2503.20314, 2025

Pith/arXiv arXiv 2025
[56]

Unified vision-language-action model.arXiv preprint arXiv:2506.19850, 2025

Yuqi Wang, Xinghang Li, Wenxuan Wang, Junbo Zhang, Yingyan Li, Yuntao Chen, Xinlong Wang, and Zhaoxi- ang Zhang. Unified vision-language-action model.arXiv preprint arXiv:2506.19850, 2025

arXiv 2025
[57]

Birchfield

Bowen Wen, Wei Yang, Jan Kautz, and Stanley T. Birchfield. Foundationpose: Unified 6d pose estimation and tracking of novel objects.2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 17868–17879, 2023. URLhttps://api.semanticscholar.org/CorpusID:266191252

2024
[58]

Foundationstereo: Zero-shot stereo matching.CVPR, 2025

Bowen Wen, Matthew Trepte, Joseph Aribido, Jan Kautz, Orazio Gallo, and Stan Birchfield. Foundationstereo: Zero-shot stereo matching.CVPR, 2025

2025
[59]

Video models are zero-shot learners and reasoners.arXiv preprint arXiv:2509.20328, 2025

Thaddäus Wiedemer, Yuxuan Li, Paul Vicol, Shixiang Shane Gu, Nick Matarese, Kevin Swersky, Been Kim, Priyank Jaini, and Robert Geirhos. Video models are zero-shot learners and reasoners.arXiv preprint arXiv:2509.20328, 2025

Pith/arXiv arXiv 2025
[60]

4d gaussian splatting for real-time dynamic scene rendering

Guanjun Wu, Taoran Yi, Jiemin Fang, Lingxi Xie, Xiaopeng Zhang, Wei Wei, Wenyu Liu, Qi Tian, and Xinggang Wang. 4d gaussian splatting for real-time dynamic scene rendering. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 20310–20320, June 2024

2024
[61]

Unleashing large-scale video generative pre-training for visual robot manipulation

Hongtao Wu, Ya Jing, Chilam Cheang, Guangzeng Chen, Jiafeng Xu, Xinghang Li, Minghuan Liu, Hang Li, and Tao Kong. Unleashing large-scale video generative pre-training for visual robot manipulation. InInternational Conference on Learning Representations, 2024. 12

2024
[62]

Cogvideox: Text-to-video diffusion models with an expert transformer

Zhuoyi Yang, Jiayan Teng, Wendi Zheng, Ming Ding, Shiyu Huang, Jiazheng Xu, Yuanming Yang, Wenyi Hong, Xiaohan Zhang, Guanyu Feng, et al. Cogvideox: Text-to-video diffusion models with an expert transformer. arXiv preprint arXiv:2408.06072, 2024

Pith/arXiv arXiv 2024
[63]

World action models are zero-shot policies, 2026

Seonghyeon Ye, Yunhao Ge, Kaiyuan Zheng, Shenyuan Gao, Sihyun Yu, George Kurian, Suneel Indupuru, You Liang Tan, Chuning Zhu, Jiannan Xiang, Ayaan Malik, Kyungmin Lee, William Liang, Nadun Ranawaka, Jiasheng Gu, Yinzhen Xu, Guanzhi Wang, Fengyuan Hu, Avnish Narayan, Johan Bjorck, Jing Wang, Gwanghyun Kim, Dantong Niu, Ruijie Zheng, Yuqi Xie, Jimmy Wu, Qi ...

Pith/arXiv arXiv 2026
[64]

From slow bidirectional to fast autoregressive video diffusion models

Tianwei Yin, Qiang Zhang, Richard Zhang, William T Freeman, Fredo Durand, Eli Shechtman, and Xun Huang. From slow bidirectional to fast autoregressive video diffusion models. InCVPR, 2025

2025
[65]

3d diffusion policy: Generalizable visuomotor policy learning via simple 3d representations

Yanjie Ze, Gu Zhang, Kangning Zhang, Chenyuan Hu, Muhan Wang, and Huazhe Xu. 3d diffusion policy: Generalizable visuomotor policy learning via simple 3d representations. InProceedings of Robotics: Science and Systems (RSS), 2024

2024
[66]

World-consistent video diffusion with explicit 3d modeling

Qihang Zhang, Shuangfei Zhai, Miguel Angel Bautista Martin, Kevin Miao, Alexander Toshev, Joshua Susskind, and Jiatao Gu. World-consistent video diffusion with explicit 3d modeling. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 21685–21695, 2025

2025
[67]

Cot-vla: Visual chain-of-thought reasoning for vision-language-action models

Qingqing Zhao, Yao Lu, Moo Jin Kim, Zipeng Fu, Zhuoyang Zhang, Yecheng Wu, Zhaoshuo Li, Qianli Ma, Song Han, Chelsea Finn, et al. Cot-vla: Visual chain-of-thought reasoning for vision-language-action models. In Proceedings of the Computer Vision and Pattern Recognition Conference, pages 1702–1713, 2025

2025
[68]

3d-vla: 3d vision-language-action generative world model.arXiv preprint arXiv:2403.09631, 2024

Haoyu Zhen, Xiaowen Qiu, Peihao Chen, Jincheng Yang, Xin Yan, Yilun Du, Yining Hong, and Chuang Gan. 3d-vla: 3d vision-language-action generative world model.arXiv preprint arXiv:2403.09631, 2024

Pith/arXiv arXiv 2024
[69]

TesserAct: Learning 4d embodied world models,

Haoyu Zhen, Qiao Sun, Hongxin Zhang, Junyan Li, Siyuan Zhou, Yilun Du, and Chuang Gan. Tesseract: Learning 4d embodied world models.CoRR, abs/2504.20995, 2025. doi: 10.48550/ARXIV.2504.20995. URL https://doi.org/10.48550/arXiv.2504.20995

work page doi:10.48550/arxiv.2504.20995 2025
[70]

A survey on vision-language-action models: An action tokenization perspective.arXiv preprint arXiv:2507.01925, 2025

Yifan Zhong, Fengshuo Bai, Shaofei Cai, Xuchuan Huang, Zhang Chen, Xiaowei Zhang, Yuanfei Wang, Shaoyang Guo, Tianrui Guan, Ka Nam Lui, et al. A survey on vision-language-action models: An action tokenization perspective.arXiv preprint arXiv:2507.01925, 2025

Pith/arXiv arXiv 2025
[71]

Unified world models: Coupling video and action diffusion for pretraining on large robotic datasets

Chuning Zhu, Raymond Yu, Siyuan Feng, Benjamin Burchfiel, Paarth Shah, and Abhishek Gupta. Unified world models: Coupling video and action diffusion for pretraining on large robotic datasets. InProceedings of Robotics: Science and Systems (RSS), 2025

2025
[72]

Causal forcing: Autoregres- sive diffusion distillation done right for high-quality real-time interactive video generation.arXiv preprint arXiv:2602.02214, 2026

Hongzhou Zhu, Min Zhao, Guande He, Hang Su, Chongxuan Li, and Jun Zhu. Causal forcing: Autoregres- sive diffusion distillation done right for high-quality real-time interactive video generation.arXiv preprint arXiv:2602.02214, 2026

Pith/arXiv arXiv 2026
[73]

Streaming 4d visual geometry transformer.arXiv preprint arXiv:2507.11539, 2025

Dong Zhuo, Wenzhao Zheng, Jiahe Guo, Yuqi Wu, Jie Zhou, and Jiwen Lu. Streaming 4d visual geometry transformer.arXiv preprint arXiv:2507.11539, 2025

Pith/arXiv arXiv 2025
[74]

Rt-2: Vision-language-action models transfer web knowledge to robotic control

Brianna Zitkovich, Tianhe Yu, Sichun Xu, Peng Xu, Ted Xiao, Fei Xia, Jialin Wu, Paul Wohlhart, Stefan Welker, Ayzaan Wahid, et al. Rt-2: Vision-language-action models transfer web knowledge to robotic control. In Conference on Robot Learning, pages 2165–2183. PMLR, 2023. 13 Appendix Overview This appendix complements the main paper as follows. Sec. A prov...

2023

[1] [1]

World simulation with video foundation models for physical ai.arXiv preprint arXiv:2511.00062, 2025

Arslan Ali, Junjie Bai, Maciej Bala, Yogesh Balaji, Aaron Blakeman, Tiffany Cai, Jiaxin Cao, Tianshi Cao, Elizabeth Cha, Yu-Wei Chao, et al. World simulation with video foundation models for physical ai.arXiv preprint arXiv:2511.00062, 2025

Pith/arXiv arXiv 2025

[2] [2]

URL https://proceedings.mlr

Sherwin Bahmani, Ivan Skorokhodov, Victor Rong, Gordon Wetzstein, Leonidas J. Guibas, Peter Wonka, Sergey Tulyakov, Jeong Joon Park, Andrea Tagliasacchi, and David B. Lindell. 4d-fy: Text-to-4d generation using hybrid score distillation sampling. InIEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2024, Seattle, W A, USA, June 16-22, 20...

work page doi:10.1109/cvpr52733.2024.00764 2024

[3] [3]

Qwen3-vl technical report.arXiv preprint arXiv:2511.21631, 2025

Shuai Bai, Yuxuan Cai, Ruizhe Chen, Keqin Chen, Xionghui Chen, Zesen Cheng, Lianghao Deng, Wei Ding, Chang Gao, Chunjiang Ge, et al. Qwen3-vl technical report.arXiv preprint arXiv:2511.21631, 2025

Pith/arXiv arXiv 2025

[4] [4]

Geovideo: Introducing geometric regularization into video generation model, 2025

Yunpeng Bai, Shaoheng Fang, Chaohui Yu, Fan Wang, and Qixing Huang. Geovideo: Introducing geometric regularization into video generation model, 2025. URLhttps://arxiv.org/abs/2512.03453

arXiv 2025

[5] [5]

Paligemma: A versatile 3b vlm for transfer.arXiv preprint arXiv:2407.07726, 2024

Lucas Beyer, Andreas Steiner, André Susano Pinto, Alexander Kolesnikov, Xiao Wang, Daniel Salz, Maxim Neumann, Ibrahim Alabdulmohsin, Michael Tschannen, Emanuele Bugliarello, et al. Paligemma: A versatile 3b vlm for transfer.arXiv preprint arXiv:2407.07726, 2024

Pith/arXiv arXiv 2024

[6] [6]

Gr00t n1: An open foundation model for generalist humanoid robots

Johan Bjorck, Fernando Castañeda, Nikita Cherniadev, Xingye Da, Runyu Ding, Linxi Fan, Yu Fang, Dieter Fox, Fengyuan Hu, Spencer Huang, et al. Gr00t n1: An open foundation model for generalist humanoid robots. arXiv preprint arXiv:2503.14734, 2025

Pith/arXiv arXiv 2025

[7] [7]

arXiv preprint arXiv:2410.24164, 2024

Kevin Black, Noah Brown, Danny Driess, Adnan Esmail, Michael Equi, Chelsea Finn, Niccolo Fusai, Lachy Groom, Karol Hausman, Brian Ichter, et al.π0: A vision-language-action flow model for general robot control. arXiv preprint arXiv:2410.24164, 2024

Pith/arXiv arXiv 2024

[8] [8]

Zero-shot robotic manipulation with pre-trained image-editing diffusion models

Kevin Black, Mitsuhiko Nakamoto, Pranav Atreya, Homer Rich Walke, Chelsea Finn, Aviral Kumar, and Sergey Levine. Zero-shot robotic manipulation with pre-trained image-editing diffusion models. InThe Twelfth Inter- national Conference on Learning Representations, 2024. URLhttps://openreview.net/forum?id=c0chJTSbci

2024

[9] [9]

Kevin Black, Noah Brown, James Darpinian, Karan Dhabalia, Danny Driess, Adnan Esmail, Michael Robert Equi, Chelsea Finn, Niccolo Fusai, Manuel Y. Galliker, Dibya Ghosh, Lachy Groom, Karol Hausman, brian ichter, Szymon Jakubczak, Tim Jones, Liyiming Ke, Devin LeBlanc, Sergey Levine, Adrian Li-Bell, Mohith Mothukuri, Suraj Nair, Karl Pertsch, Allen Z. Ren, ...

2025

[10] [10]

Stable video diffusion: Scaling latent video diffusion models to large datasets.arXiv preprint arXiv:2311.15127, 2023

Andreas Blattmann, Tim Dockhorn, Sumith Kulal, Daniel Mendelevitch, Maciej Kilian, Dominik Lorenz, Yam Levi, Zion English, Vikram Voleti, Adam Letts, et al. Stable video diffusion: Scaling latent video diffusion models to large datasets.arXiv preprint arXiv:2311.15127, 2023

Pith/arXiv arXiv 2023

[11] [11]

Sam 3: Segment anything with concepts

Nicolas Carion, Laura Gustafson, Yuan-Ting Hu, Shoubhik Debnath, Ronghang Hu, Didac Suris, Chaitanya Ryali, Kalyan Vasudev Alwala, Haitham Khedr, Andrew Huang, et al. Sam 3: Segment anything with concepts. arXiv preprint arXiv:2511.16719, 2025

Pith/arXiv arXiv 2025

[12] [12]

Rynnvla-002: A unified vision-language-action and world model

Jun Cen, Siteng Huang, Yuqian Yuan, Kehan Li, Hangjie Yuan, Chaohui Yu, Yuming Jiang, Jiayan Guo, Xin Li, Hao Luo, Fan Wang, Fan Wang, and Deli Zhao. Rynnvla-002: A unified vision-language-action and world model. arXiv preprint arXiv:2511.17502, 2025

Pith/arXiv arXiv 2025

[13] [13]

Worldvla: Towards autoregressive action world model.arXiv preprint arXiv:, 2025

Jun Cen, Chaohui Yu, Hangjie Yuan, Yuming Jiang, Siteng Huang, Jiayan Guo, Xin Li, Yibing Song, Hao Luo, Fan Wang, Deli Zhao, and Hao Chen. Worldvla: Towards autoregressive action world model.arXiv preprint arXiv:, 2025

2025

[14] [14]

Gr-2: A generative video-language-action model with web-scale knowledge for robot manipulation.arXiv preprint arXiv:2410.06158, 2024

Chi-Lam Cheang, Guangzeng Chen, Ya Jing, Tao Kong, Hang Li, Yifeng Li, Yuxiao Liu, Hongtao Wu, Jiafeng Xu, Yichu Yang, Hanbo Zhang, and Minzhao Zhu. Gr-2: A generative video-language-action model with web-scale knowledge for robot manipulation.arXiv preprint arXiv:2410.06158, 2024

Pith/arXiv arXiv 2024

[15] [15]

Diffusion forcing: Next-token prediction meets full-sequence diffusion.Advances in Neural Information Processing Systems, 37:24081–24125, 2025

Boyuan Chen, Diego Martí Monsó, Yilun Du, Max Simchowitz, Russ Tedrake, and Vincent Sitzmann. Diffusion forcing: Next-token prediction meets full-sequence diffusion.Advances in Neural Information Processing Systems, 37:24081–24125, 2025

2025

[16] [16]

Large video planner enables generalizable robot control

Boyuan Chen, Tianyuan Zhang, Haoran Geng, Kiwhan Song, Caiyi Zhang, Peihao Li, William T Freeman, Jitendra Malik, Pieter Abbeel, Russ Tedrake, et al. Large video planner enables generalizable robot control. arXiv preprint arXiv:2512.15840, 2025

Pith/arXiv arXiv 2025

[17] [17]

4dnex: Feed-forward 4d generative modeling made easy.arXiv preprint arXiv:2508.13154, 2025

Zhaoxi Chen, Tianqi Liu, Long Zhuo, Jiawei Ren, Zeng Tao, He Zhu, Fangzhou Hong, Liang Pan, and Ziwei Liu. 4dnex: Feed-forward 4d generative modeling made easy.arXiv preprint arXiv:2508.13154, 2025

arXiv 2025

[18] [18]

Learning universal policies via text-guided video generation.Advances in neural information processing systems, 36:9156–9172, 2023

Yilun Du, Sherry Yang, Bo Dai, Hanjun Dai, Ofir Nachum, Josh Tenenbaum, Dale Schuurmans, and Pieter Abbeel. Learning universal policies via text-guided video generation.Advances in neural information processing systems, 36:9156–9172, 2023

2023

[19] [19]

Mitra, and Qixing Huang

Shaoheng Fang, Hanwen Jiang, Yunpeng Bai, Niloy J. Mitra, and Qixing Huang. Worldreel: 4d video generation with consistent geometry and motion modeling, 2025. URLhttps://arxiv.org/abs/2512.07821

arXiv 2025

[20] [20]

data gap

Ken Goldberg. Good old-fashioned engineering can close the 100,000-year “data gap” in robotics.Science Robotics, 10(105):eaea7390, 2025. doi: 10.1126/scirobotics.aea7390. URLhttps://www.science.org/doi/abs/10.1126/ scirobotics.aea7390

work page doi:10.1126/scirobotics.aea7390 2025

[21] [21]

Predic- tion with action: Visual policy learning via joint denoising process.Advances in Neural Information Processing Systems, 37:112386–112410, 2024

Yanjiang Guo, Yucheng Hu, Jianke Zhang, Yen-Jen Wang, Xiaoyu Chen, Chaochao Lu, and Jianyu Chen. Predic- tion with action: Visual policy learning via joint denoising process.Advances in Neural Information Processing Systems, 37:112386–112410, 2024

2024

[22] [22]

Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen

Edward J. Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. Lora: Low-rank adaptation of large language models. InThe Tenth International Conference on Learning Representations, ICLR 2022, Virtual Event, April 25-29, 2022. OpenReview.net, 2022. URLhttps://openreview.net/ forum?id=nZeVKeeFYf9

2022

[23] [23]

Video prediction policy: A generalist robot policy with predictive visual representations

Yucheng Hu, Yanjiang Guo, Pengchao Wang, Xiaoyu Chen, Yen-Jen Wang, Jianke Zhang, Koushil Sreenath, Chaochao Lu, and Jianyu Chen. Video prediction policy: A generalist robot policy with predictive visual representations. In Aarti Singh, Maryam Fazel, Daniel Hsu, Simon Lacoste-Julien, Felix Berkenkamp, Tegan Maharaj, Kiri Wagstaff, and Jerry Zhu, editors,F...

2025

[24] [24]

Self forcing: Bridging the train-test gap in autoregressive video diffusion

Xun Huang, Zhengqi Li, Guande He, Mingyuan Zhou, and Eli Shechtman. Self forcing: Bridging the train-test gap in autoregressive video diffusion. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems, 2025. URLhttps://openreview.net/forum?id=mSiN7i0BYH

2025

[25] [25]

Dreamgen: Unlocking generalization in robot learning through video world models.arXiv preprint arXiv:2505.12705, 2025

Joel Jang, Seonghyeon Ye, Zongyu Lin, Jiannan Xiang, Johan Bjorck, Yu Fang, Fengyuan Hu, Spencer Huang, Kaushil Kundalia, Yen-Chen Lin, et al. Dreamgen: Unlocking generalization in robot learning through video world models.arXiv preprint arXiv:2505.12705, 2025. 10

Pith/arXiv arXiv 2025

[26] [26]

Geo4d: Leveraging video generators for geometric 4d scene reconstruction, 2025

Zeren Jiang, Chuanxia Zheng, Iro Laina, Diane Larlus, and Andrea Vedaldi. Geo4d: Leveraging video generators for geometric 4d scene reconstruction, 2025. URLhttps://arxiv.org/abs/2504.07961

arXiv 2025

[27] [27]

Vision-language-action models for robotics: A review towards real-world applications.IEEE Access, 13:162467–162504, 2025

Kento Kawaharazuka, Jihoon Oh, Jun Yamada, Ingmar Posner, and Yuke Zhu. Vision-language-action models for robotics: A review towards real-world applications.IEEE Access, 13:162467–162504, 2025. doi: 10.1109/ ACCESS.2025.3609980

arXiv 2025

[28] [28]

Droid: A large-scale in-the-wild robot manipulation dataset.arXiv preprint arXiv:2403.12945, 2024

Alexander Khazatsky, Karl Pertsch, Suraj Nair, Ashwin Balakrishna, Sudeep Dasari, Siddharth Karamcheti, Soroush Nasiriany, Mohan Kumar Srirama, Lawrence Yunliang Chen, Kirsty Ellis, et al. Droid: A large-scale in-the-wild robot manipulation dataset.arXiv preprint arXiv:2403.12945, 2024

Pith/arXiv arXiv 2024

[29] [29]

Openvla: An open-source vision-language-action model.arXiv preprint arXiv:2406.09246, 2024

Moo Jin Kim, Karl Pertsch, Siddharth Karamcheti, Ted Xiao, Ashwin Balakrishna, Suraj Nair, Rafael Rafailov, Ethan Foster, Grace Lam, Pannag Sanketi, Quan Vuong, Thomas Kollar, Benjamin Burchfiel, Russ Tedrake, Dorsa Sadigh, Sergey Levine, Percy Liang, and Chelsea Finn. Openvla: An open-source vision-language-action model.arXiv preprint arXiv:2406.09246, 2024

Pith/arXiv arXiv 2024

[30] [30]

Cosmos policy: Fine-tuning video models for visuomotor control and planning

Moo Jin Kim, Yihuai Gao, Tsung-Yi Lin, Yen-Chen Lin, Yunhao Ge, Grace Lam, Percy Liang, Shuran Song, Ming-Yu Liu, Chelsea Finn, et al. Cosmos policy: Fine-tuning video models for visuomotor control and planning. arXiv preprint arXiv:2601.16163, 2026

Pith/arXiv arXiv 2026

[31] [31]

Causal world modeling for robot control, 2026

Lin Li, Qihang Zhang, Yiming Luo, Shuai Yang, Ruilin Wang, Fei Han, Mingrui Yu, Zelin Gao, Nan Xue, Xing Zhu, Yujun Shen, and Yinghao Xu. Causal world modeling for robot control, 2026. URLhttps://arxiv.org/abs/2601. 21998

2026

[32] [32]

Unified video action model.arXiv preprint arXiv:2503.00200, 2025

Shuang Li, Yihuai Gao, Dorsa Sadigh, and Shuran Song. Unified video action model.arXiv preprint arXiv:2503.00200, 2025

Pith/arXiv arXiv 2025

[33] [33]

Video generators are robot policies.arXiv preprint arXiv:2508.00795, 2025

Junbang Liang, Pavel Tokmakov, Ruoshi Liu, Sruthi Sudhakar, Paarth Shah, Rares Ambrus, and Carl Vondrick. Video generators are robot policies.arXiv preprint arXiv:2508.00795, 2025

Pith/arXiv arXiv 2025

[34] [34]

Depth anything 3: Recovering the visual space from any views.arXiv preprint arXiv:2511.10647, 2025

Haotong Lin, Sili Chen, Junhao Liew, Donny Y Chen, Zhenyu Li, Guang Shi, Jiashi Feng, and Bingyi Kang. Depth anything 3: Recovering the visual space from any views.arXiv preprint arXiv:2511.10647, 2025

Pith/arXiv arXiv 2025

[35] [35]

Yaron Lipman, Ricky T. Q. Chen, Heli Ben-Hamu, Maximilian Nickel, and Matthew Le. Flow matching for generative modeling. InThe Eleventh International Conference on Learning Representations, 2023. URLhttps: //openreview.net/forum?id=PqvMRDCJT9t

2023

[36] [36]

Free4d: Tuning-free 4d scene generation with spatial-temporal consistency

Tianqi Liu, Zihao Huang, Zhaoxi Chen, Guangcong Wang, Shoukang Hu, Liao Shen, Huiqiang Sun, Zhiguo Cao, Wei Li, and Ziwei Liu. Free4d: Tuning-free 4d scene generation with spatial-temporal consistency. InProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 25571–25582, October 2025

2025

[37] [37]

Geometry-aware 4d video generation for robot manipulation

Zeyi Liu, Shuang Li, Eric Cousineau, Siyuan Feng, Benjamin Burchfiel, and Shuran Song. Geometry-aware 4d video generation for robot manipulation. InThe Fourteenth International Conference on Learning Representa- tions, 2026. URLhttps://openreview.net/forum?id=18gC6pZVVc

2026

[38] [38]

Robocasa365: A large-scale simulation framework for training and benchmarking generalist robots

Soroush Nasiriany, Sepehr Nasiriany, Abhiram Maddukuri, and Yuke Zhu. Robocasa365: A large-scale simulation framework for training and benchmarking generalist robots. InThe Fourteenth International Conference on Learning Representations, 2026. URLhttps://openreview.net/forum?id=tQJYKwc3n4

2026

[39] [39]

Octo: An open-source generalist robot policy

Octo Model Team, Dibya Ghosh, Homer Walke, Karl Pertsch, Kevin Black, Oier Mees, Sudeep Dasari, Joey Hejna, Charles Xu, Jianlan Luo, Tobias Kreiman, You Liang Tan, Pannag Sanketi, Quan Vuong, Ted Xiao, Dorsa Sadigh, Chelsea Finn, and Sergey Levine. Octo: An open-source generalist robot policy. InProceedings of Robotics: Science and Systems, Delft, Netherl...

2024

[40] [40]

Open x-embodiment: Robotic learning datasets and rt-x models: Open x-embodiment collaboration 0

Abby O’Neill, Abdul Rehman, Abhiram Maddukuri, Abhishek Gupta, Abhishek Padalkar, Abraham Lee, Acorn Pooley, Agrim Gupta, Ajay Mandlekar, Ajinkya Jain, et al. Open x-embodiment: Robotic learning datasets and rt-x models: Open x-embodiment collaboration 0. In2024 IEEE International Conference on Robotics and Automation (ICRA), pages 6892–6903. IEEE, 2024

2024

[41] [41]

mimic-video: Video-action models for generalizable robot control beyond vlas, 2025

Jonas Pai, Liam Achenbach, Victoriano Montesinos, Benedek Forrai, Oier Mees, and Elvis Nava. mimic-video: Video-action models for generalizable robot control beyond vlas, 2025. URLhttps://arxiv.org/abs/2512.15692

Pith/arXiv arXiv 2025

[42] [42]

Efficient4d: Fast dynamic 3d object generation from a single- view video.Int

Zijie Pan, Zeyu Yang, Xiatian Zhu, and Li Zhang. Efficient4d: Fast dynamic 3d object generation from a single- view video.Int. J. Comput. Vis., 134(1):14, 2026. doi: 10.1007/S11263-025-02615-Z. URLhttps://doi.org/10.1007/ s11263-025-02615-z. 11

work page doi:10.1007/s11263-025-02615-z 2026

[43] [43]

Scalable diffusion models with transformers

William Peebles and Saining Xie. Scalable diffusion models with transformers. InProceedings of the IEEE/CVF international conference on computer vision, pages 4195–4205, 2023

2023

[44] [44]

Peebles and Saining Xie

William S. Peebles and Saining Xie. Scalable diffusion models with transformers.2023 IEEE/CVF Interna- tional Conference on Computer Vision (ICCV), pages 4172–4182, 2022. URLhttps://api.semanticscholar.org/CorpusID: 254854389

2023

[45] [45]

Barron, and Ben Mildenhall

Ben Poole, Ajay Jain, Jonathan T. Barron, and Ben Mildenhall. Dreamfusion: Text-to-3d using 2d diffusion. arXiv, 2022

2022

[46] [46]

D-NeRF: Neural Radiance Fields for Dynamic Scenes

Albert Pumarola, Enric Corona, Gerard Pons-Moll, and Francesc Moreno-Noguer. D-NeRF: Neural Radiance Fields for Dynamic Scenes. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020

2020

[47] [47]

Qi, Hao Su, Kaichun Mo, and Leonidas J

C. Qi, Hao Su, Kaichun Mo, and Leonidas J. Guibas. Pointnet: Deep learning on point sets for 3d classification and segmentation.2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 77–85,

2017

[48] [48]

URLhttps://api.semanticscholar.org/CorpusID:5115938

[49] [49]

Dreamgaussian4d: Generative 4d gaussian splatting.arXiv preprint arXiv:2312.17142, 2023

Jiawei Ren, Liang Pan, Jiaxiang Tang, Chi Zhang, Ang Cao, Gang Zeng, and Ziwei Liu. Dreamgaussian4d: Generative 4d gaussian splatting.arXiv preprint arXiv:2312.17142, 2023

arXiv 2023

[50] [50]

Videovla: Video generators can be generalizable robot manipulators

Yichao Shen, Fangyun Wei, Zhiying Du, Yaobo Liang, Yan Lu, Jiaolong Yang, Nanning Zheng, and Baining Guo. Videovla: Video generators can be generalizable robot manipulators. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems (NeurIPS), 2025. URLhttps://openreview.net/forum?id=UPHlqbZFZB

2025

[51] [51]

Denoising diffusion implicit models

Jiaming Song, Chenlin Meng, and Stefano Ermon. Denoising diffusion implicit models. InInternational Confer- ence on Learning Representations, 2021. URLhttps://openreview.net/forum?id=St1giarCHLP

2021

[52] [52]

Gemma 2: Improving open language models at a practical size.arXiv preprint arXiv:2408.00118, 2024

Gemma Team, Morgane Riviere, Shreya Pathak, Pier Giuseppe Sessa, Cassidy Hardin, Surya Bhupatiraju, Léonard Hussenot, Thomas Mesnard, Bobak Shahriari, Alexandre Ramé, et al. Gemma 2: Improving open language models at a practical size.arXiv preprint arXiv:2408.00118, 2024

Pith/arXiv arXiv 2024

[53] [53]

Towards accurate generative models of video: A new metric & challenges.ArXiv, abs/1812.01717, 2018

Thomas Unterthiner, Sjoerd van Steenkiste, Karol Kurach, Raphaël Marinier, Marcin Michalski, and Sylvain Gelly. Towards accurate generative models of video: A new metric & challenges.ArXiv, abs/1812.01717, 2018. URLhttps://api.semanticscholar.org/CorpusID:54458806

Pith/arXiv arXiv 2018

[54] [54]

Bridgedata v2: A dataset for robot learning at scale

Homer Rich Walke, Kevin Black, Tony Z Zhao, Quan Vuong, Chongyi Zheng, Philippe Hansen-Estruch, An- dre Wang He, Vivek Myers, Moo Jin Kim, Max Du, et al. Bridgedata v2: A dataset for robot learning at scale. InConference on Robot Learning, pages 1723–1736. PMLR, 2023

2023

[55] [55]

Wan: Open and advanced large-scale video generative models.arXiv preprint arXiv:2503.20314, 2025

TeamWan, AngWang, BaoleAi, BinWen, ChaojieMao, Chen-WeiXie, DiChen, FeiwuYu, HaimingZhao, Jianx- iao Yang, et al. Wan: Open and advanced large-scale video generative models.arXiv preprint arXiv:2503.20314, 2025

Pith/arXiv arXiv 2025

[56] [56]

Unified vision-language-action model.arXiv preprint arXiv:2506.19850, 2025

Yuqi Wang, Xinghang Li, Wenxuan Wang, Junbo Zhang, Yingyan Li, Yuntao Chen, Xinlong Wang, and Zhaoxi- ang Zhang. Unified vision-language-action model.arXiv preprint arXiv:2506.19850, 2025

arXiv 2025

[57] [57]

Birchfield

Bowen Wen, Wei Yang, Jan Kautz, and Stanley T. Birchfield. Foundationpose: Unified 6d pose estimation and tracking of novel objects.2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 17868–17879, 2023. URLhttps://api.semanticscholar.org/CorpusID:266191252

2024

[58] [58]

Foundationstereo: Zero-shot stereo matching.CVPR, 2025

Bowen Wen, Matthew Trepte, Joseph Aribido, Jan Kautz, Orazio Gallo, and Stan Birchfield. Foundationstereo: Zero-shot stereo matching.CVPR, 2025

2025

[59] [59]

Video models are zero-shot learners and reasoners.arXiv preprint arXiv:2509.20328, 2025

Thaddäus Wiedemer, Yuxuan Li, Paul Vicol, Shixiang Shane Gu, Nick Matarese, Kevin Swersky, Been Kim, Priyank Jaini, and Robert Geirhos. Video models are zero-shot learners and reasoners.arXiv preprint arXiv:2509.20328, 2025

Pith/arXiv arXiv 2025

[60] [60]

4d gaussian splatting for real-time dynamic scene rendering

Guanjun Wu, Taoran Yi, Jiemin Fang, Lingxi Xie, Xiaopeng Zhang, Wei Wei, Wenyu Liu, Qi Tian, and Xinggang Wang. 4d gaussian splatting for real-time dynamic scene rendering. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 20310–20320, June 2024

2024

[61] [61]

Unleashing large-scale video generative pre-training for visual robot manipulation

Hongtao Wu, Ya Jing, Chilam Cheang, Guangzeng Chen, Jiafeng Xu, Xinghang Li, Minghuan Liu, Hang Li, and Tao Kong. Unleashing large-scale video generative pre-training for visual robot manipulation. InInternational Conference on Learning Representations, 2024. 12

2024

[62] [62]

Cogvideox: Text-to-video diffusion models with an expert transformer

Zhuoyi Yang, Jiayan Teng, Wendi Zheng, Ming Ding, Shiyu Huang, Jiazheng Xu, Yuanming Yang, Wenyi Hong, Xiaohan Zhang, Guanyu Feng, et al. Cogvideox: Text-to-video diffusion models with an expert transformer. arXiv preprint arXiv:2408.06072, 2024

Pith/arXiv arXiv 2024

[63] [63]

World action models are zero-shot policies, 2026

Seonghyeon Ye, Yunhao Ge, Kaiyuan Zheng, Shenyuan Gao, Sihyun Yu, George Kurian, Suneel Indupuru, You Liang Tan, Chuning Zhu, Jiannan Xiang, Ayaan Malik, Kyungmin Lee, William Liang, Nadun Ranawaka, Jiasheng Gu, Yinzhen Xu, Guanzhi Wang, Fengyuan Hu, Avnish Narayan, Johan Bjorck, Jing Wang, Gwanghyun Kim, Dantong Niu, Ruijie Zheng, Yuqi Xie, Jimmy Wu, Qi ...

Pith/arXiv arXiv 2026

[64] [64]

From slow bidirectional to fast autoregressive video diffusion models

Tianwei Yin, Qiang Zhang, Richard Zhang, William T Freeman, Fredo Durand, Eli Shechtman, and Xun Huang. From slow bidirectional to fast autoregressive video diffusion models. InCVPR, 2025

2025

[65] [65]

3d diffusion policy: Generalizable visuomotor policy learning via simple 3d representations

Yanjie Ze, Gu Zhang, Kangning Zhang, Chenyuan Hu, Muhan Wang, and Huazhe Xu. 3d diffusion policy: Generalizable visuomotor policy learning via simple 3d representations. InProceedings of Robotics: Science and Systems (RSS), 2024

2024

[66] [66]

World-consistent video diffusion with explicit 3d modeling

Qihang Zhang, Shuangfei Zhai, Miguel Angel Bautista Martin, Kevin Miao, Alexander Toshev, Joshua Susskind, and Jiatao Gu. World-consistent video diffusion with explicit 3d modeling. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 21685–21695, 2025

2025

[67] [67]

Cot-vla: Visual chain-of-thought reasoning for vision-language-action models

Qingqing Zhao, Yao Lu, Moo Jin Kim, Zipeng Fu, Zhuoyang Zhang, Yecheng Wu, Zhaoshuo Li, Qianli Ma, Song Han, Chelsea Finn, et al. Cot-vla: Visual chain-of-thought reasoning for vision-language-action models. In Proceedings of the Computer Vision and Pattern Recognition Conference, pages 1702–1713, 2025

2025

[68] [68]

3d-vla: 3d vision-language-action generative world model.arXiv preprint arXiv:2403.09631, 2024

Haoyu Zhen, Xiaowen Qiu, Peihao Chen, Jincheng Yang, Xin Yan, Yilun Du, Yining Hong, and Chuang Gan. 3d-vla: 3d vision-language-action generative world model.arXiv preprint arXiv:2403.09631, 2024

Pith/arXiv arXiv 2024

[69] [69]

TesserAct: Learning 4d embodied world models,

Haoyu Zhen, Qiao Sun, Hongxin Zhang, Junyan Li, Siyuan Zhou, Yilun Du, and Chuang Gan. Tesseract: Learning 4d embodied world models.CoRR, abs/2504.20995, 2025. doi: 10.48550/ARXIV.2504.20995. URL https://doi.org/10.48550/arXiv.2504.20995

work page doi:10.48550/arxiv.2504.20995 2025

[70] [70]

A survey on vision-language-action models: An action tokenization perspective.arXiv preprint arXiv:2507.01925, 2025

Yifan Zhong, Fengshuo Bai, Shaofei Cai, Xuchuan Huang, Zhang Chen, Xiaowei Zhang, Yuanfei Wang, Shaoyang Guo, Tianrui Guan, Ka Nam Lui, et al. A survey on vision-language-action models: An action tokenization perspective.arXiv preprint arXiv:2507.01925, 2025

Pith/arXiv arXiv 2025

[71] [71]

Unified world models: Coupling video and action diffusion for pretraining on large robotic datasets

Chuning Zhu, Raymond Yu, Siyuan Feng, Benjamin Burchfiel, Paarth Shah, and Abhishek Gupta. Unified world models: Coupling video and action diffusion for pretraining on large robotic datasets. InProceedings of Robotics: Science and Systems (RSS), 2025

2025

[72] [72]

Causal forcing: Autoregres- sive diffusion distillation done right for high-quality real-time interactive video generation.arXiv preprint arXiv:2602.02214, 2026

Hongzhou Zhu, Min Zhao, Guande He, Hang Su, Chongxuan Li, and Jun Zhu. Causal forcing: Autoregres- sive diffusion distillation done right for high-quality real-time interactive video generation.arXiv preprint arXiv:2602.02214, 2026

Pith/arXiv arXiv 2026

[73] [73]

Streaming 4d visual geometry transformer.arXiv preprint arXiv:2507.11539, 2025

Dong Zhuo, Wenzhao Zheng, Jiahe Guo, Yuqi Wu, Jie Zhou, and Jiwen Lu. Streaming 4d visual geometry transformer.arXiv preprint arXiv:2507.11539, 2025

Pith/arXiv arXiv 2025

[74] [74]

Rt-2: Vision-language-action models transfer web knowledge to robotic control

Brianna Zitkovich, Tianhe Yu, Sichun Xu, Peng Xu, Ted Xiao, Fei Xia, Jialin Wu, Paul Wohlhart, Stefan Welker, Ayzaan Wahid, et al. Rt-2: Vision-language-action models transfer web knowledge to robotic control. In Conference on Robot Learning, pages 2165–2183. PMLR, 2023. 13 Appendix Overview This appendix complements the main paper as follows. Sec. A prov...

2023