PointAction: 3D Points as Universal Action Representations for Robot Control
Pith reviewed 2026-06-28 10:08 UTC · model grok-4.3
The pith
PointAction predicts dynamic 3D pointmaps from video models to serve as a metric interface that a diffusion decoder converts into robot actions.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
PointAction fine-tunes a foundation video generation model to jointly predict future RGB frames and dynamic 3D pointmaps that capture temporally consistent metric motion of task-relevant geometry; these point dynamics then serve as the input to a diffusion-based action decoder that produces executable robot actions, thereby reducing the under-specification of pure RGB rollouts and enabling transfer across tasks and embodiments with limited action supervision.
What carries the argument
Jointly predicted dynamic 3D pointmaps that encode metric 3D motion of scene points and act as the structured interface between the video model and the action decoder.
If this is right
- The method achieves state-of-the-art 4D generation quality on robot scenes.
- It outperforms existing baselines in simulation experiments.
- It generalizes successfully to two real robot arms that were unseen during pretraining.
- It supports cross-task and cross-embodiment transfer while requiring only limited action supervision.
Where Pith is reading between the lines
- If point dynamics prove reliable, future work could pretrain the video component on large unlabeled video corpora and add only small action datasets for new hardware.
- The same point-based interface might extend to non-arm embodiments such as mobile bases if the predicted points capture the necessary contact and navigation geometry.
- Errors in pointmap prediction would directly limit downstream action success, suggesting that improvements in 3D consistency could yield measurable gains in real-world success rates.
Load-bearing premise
The predicted 3D pointmaps stay accurate enough and consistent over time that the diffusion decoder can turn them into valid actions on robot arms never seen in pretraining.
What would settle it
On a held-out real robot arm, generate pointmaps from the model on a new task and check whether the decoded actions produce large 3D positioning errors or fail to complete the task more often than baselines.
read the original abstract
Video-Action Models (VAMs) leverage the broad visual dynamics captured by pre-trained video diffusion models, offering a promising path toward generalizable robot manipulation. However, RGB-only video rollouts are not directly actionable: they leave metric 3D motion, contact geometry, and fine-grained spatial constraints under-specified, making action grounding ambiguous. Meanwhile, scaling action supervision across diverse tasks and embodiments remains costly. We present PointAction, a framework that bridges video predictions to robot actions through explicit point-based 4D modeling. PointAction fine-tunes a foundation video generation model to jointly predict future RGB frames and dynamic 3D pointmaps, producing temporally consistent 3D motion of task-relevant scene geometry. These point dynamics serve as a structured, embodiment-agnostic action interface, which a diffusion-based action decoder maps to executable robot actions. By using metric 3D point dynamics as the interface between video prediction and control, PointAction reduces the ambiguity of RGB-only action grounding and supports transfer across tasks and embodiments with limited action supervision. Experiments show that PointAction achieves state-of-the-art 4D generation quality on robot scenes, outperforms existing baselines in simulation, and generalizes to two real robot arms unseen during pretraining.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper presents PointAction, a framework that fine-tunes a pre-trained video diffusion model to jointly generate future RGB frames and dynamic 3D pointmaps of robot scenes. These temporally consistent point dynamics are positioned as an embodiment-agnostic interface that a separate diffusion-based decoder maps to executable robot actions, with the goal of reducing RGB grounding ambiguity and enabling cross-task and cross-embodiment transfer under limited action supervision. The abstract reports state-of-the-art 4D generation quality on robot scenes, outperformance versus baselines in simulation, and successful generalization to two real robot arms absent from pretraining.
Significance. If the central results hold, the work supplies a concrete mechanism for converting video-prediction outputs into metric 3D motion that can be decoded into control signals, addressing a recognized bottleneck in video-action models. The explicit use of 3D point dynamics as a structured, transferable representation is a clear conceptual contribution that could support more scalable robot learning pipelines.
major comments (2)
- [Abstract / Experiments] Abstract and Experiments section: the central claim that jointly predicted dynamic 3D pointmaps are sufficiently accurate and temporally consistent to support reliable action decoding on unseen arms is not accompanied by quantitative bounds (e.g., 3D endpoint error, temporal consistency scores, or occlusion-handling metrics) or ablations that isolate the effect of residual pointmap noise on downstream action success. Without these, the data-to-claim link for generalization remains unevaluated.
- [Method / Experiments] The description of the fine-tuning and decoding pipeline states that point dynamics reduce RGB ambiguity, yet no comparison is provided against a direct RGB-to-action baseline that would quantify the incremental benefit of the 3D interface under the same limited-supervision regime.
minor comments (1)
- [Method] Notation for the pointmap representation (e.g., how metric scale is recovered and how points are selected as task-relevant) should be defined explicitly in the main text rather than deferred to supplementary material.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback on our manuscript. We address each major comment below and describe the revisions we will implement to strengthen the evaluation and claims.
read point-by-point responses
-
Referee: [Abstract / Experiments] Abstract and Experiments section: the central claim that jointly predicted dynamic 3D pointmaps are sufficiently accurate and temporally consistent to support reliable action decoding on unseen arms is not accompanied by quantitative bounds (e.g., 3D endpoint error, temporal consistency scores, or occlusion-handling metrics) or ablations that isolate the effect of residual pointmap noise on downstream action success. Without these, the data-to-claim link for generalization remains unevaluated.
Authors: We agree that the manuscript would benefit from explicit quantitative metrics on pointmap accuracy and consistency to more rigorously support the generalization claims. In the revised version, we will add 3D endpoint error, temporal consistency scores, and occlusion-handling metrics to the Experiments section. We will also include an ablation isolating the effect of residual pointmap noise on downstream action success rates. These additions will directly address the data-to-claim link for action decoding on unseen arms. revision: yes
-
Referee: [Method / Experiments] The description of the fine-tuning and decoding pipeline states that point dynamics reduce RGB ambiguity, yet no comparison is provided against a direct RGB-to-action baseline that would quantify the incremental benefit of the 3D interface under the same limited-supervision regime.
Authors: We acknowledge that a direct RGB-to-action baseline comparison is necessary to quantify the incremental benefit of the 3D point dynamics under limited supervision. In the revised manuscript, we will add this baseline by training and evaluating a diffusion-based decoder that maps directly from RGB video predictions to actions, using the identical limited-supervision regime and evaluation protocol as the PointAction pipeline. This will enable a clear measurement of the advantage provided by the pointmap interface. revision: yes
Circularity Check
No significant circularity; pipeline is self-contained
full rationale
The paper describes an empirical pipeline: fine-tune a foundation video model to jointly predict RGB frames and dynamic 3D pointmaps, then apply a diffusion-based action decoder to map point dynamics to robot actions. No equations, derivations, fitted parameters renamed as predictions, or self-citation chains appear in the abstract or described method. The central claim rests on the empirical sufficiency of the 4D interface rather than any reduction of outputs to inputs by construction. This is the common case of a non-circular engineering framework.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Pre-trained video diffusion models capture dynamics that can be fine-tuned to also output temporally consistent 3D pointmaps
Reference graph
Works this paper leans on
-
[1]
World simulation with video foundation models for physical ai.arXiv preprint arXiv:2511.00062, 2025
Arslan Ali, Junjie Bai, Maciej Bala, Yogesh Balaji, Aaron Blakeman, Tiffany Cai, Jiaxin Cao, Tianshi Cao, Elizabeth Cha, Yu-Wei Chao, et al. World simulation with video foundation models for physical ai.arXiv preprint arXiv:2511.00062, 2025
Pith/arXiv arXiv 2025
-
[2]
Sherwin Bahmani, Ivan Skorokhodov, Victor Rong, Gordon Wetzstein, Leonidas J. Guibas, Peter Wonka, Sergey Tulyakov, Jeong Joon Park, Andrea Tagliasacchi, and David B. Lindell. 4d-fy: Text-to-4d generation using hybrid score distillation sampling. InIEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2024, Seattle, W A, USA, June 16-22, 20...
-
[3]
Qwen3-vl technical report.arXiv preprint arXiv:2511.21631, 2025
Shuai Bai, Yuxuan Cai, Ruizhe Chen, Keqin Chen, Xionghui Chen, Zesen Cheng, Lianghao Deng, Wei Ding, Chang Gao, Chunjiang Ge, et al. Qwen3-vl technical report.arXiv preprint arXiv:2511.21631, 2025
Pith/arXiv arXiv 2025
-
[4]
Geovideo: Introducing geometric regularization into video generation model, 2025
Yunpeng Bai, Shaoheng Fang, Chaohui Yu, Fan Wang, and Qixing Huang. Geovideo: Introducing geometric regularization into video generation model, 2025. URLhttps://arxiv.org/abs/2512.03453
arXiv 2025
-
[5]
Paligemma: A versatile 3b vlm for transfer.arXiv preprint arXiv:2407.07726, 2024
Lucas Beyer, Andreas Steiner, André Susano Pinto, Alexander Kolesnikov, Xiao Wang, Daniel Salz, Maxim Neumann, Ibrahim Alabdulmohsin, Michael Tschannen, Emanuele Bugliarello, et al. Paligemma: A versatile 3b vlm for transfer.arXiv preprint arXiv:2407.07726, 2024
Pith/arXiv arXiv 2024
-
[6]
Gr00t n1: An open foundation model for generalist humanoid robots
Johan Bjorck, Fernando Castañeda, Nikita Cherniadev, Xingye Da, Runyu Ding, Linxi Fan, Yu Fang, Dieter Fox, Fengyuan Hu, Spencer Huang, et al. Gr00t n1: An open foundation model for generalist humanoid robots. arXiv preprint arXiv:2503.14734, 2025
Pith/arXiv arXiv 2025
-
[7]
arXiv preprint arXiv:2410.24164, 2024
Kevin Black, Noah Brown, Danny Driess, Adnan Esmail, Michael Equi, Chelsea Finn, Niccolo Fusai, Lachy Groom, Karol Hausman, Brian Ichter, et al.π0: A vision-language-action flow model for general robot control. arXiv preprint arXiv:2410.24164, 2024
Pith/arXiv arXiv 2024
-
[8]
Zero-shot robotic manipulation with pre-trained image-editing diffusion models
Kevin Black, Mitsuhiko Nakamoto, Pranav Atreya, Homer Rich Walke, Chelsea Finn, Aviral Kumar, and Sergey Levine. Zero-shot robotic manipulation with pre-trained image-editing diffusion models. InThe Twelfth Inter- national Conference on Learning Representations, 2024. URLhttps://openreview.net/forum?id=c0chJTSbci
2024
-
[9]
Kevin Black, Noah Brown, James Darpinian, Karan Dhabalia, Danny Driess, Adnan Esmail, Michael Robert Equi, Chelsea Finn, Niccolo Fusai, Manuel Y. Galliker, Dibya Ghosh, Lachy Groom, Karol Hausman, brian ichter, Szymon Jakubczak, Tim Jones, Liyiming Ke, Devin LeBlanc, Sergey Levine, Adrian Li-Bell, Mohith Mothukuri, Suraj Nair, Karl Pertsch, Allen Z. Ren, ...
2025
-
[10]
Andreas Blattmann, Tim Dockhorn, Sumith Kulal, Daniel Mendelevitch, Maciej Kilian, Dominik Lorenz, Yam Levi, Zion English, Vikram Voleti, Adam Letts, et al. Stable video diffusion: Scaling latent video diffusion models to large datasets.arXiv preprint arXiv:2311.15127, 2023
Pith/arXiv arXiv 2023
-
[11]
Sam 3: Segment anything with concepts
Nicolas Carion, Laura Gustafson, Yuan-Ting Hu, Shoubhik Debnath, Ronghang Hu, Didac Suris, Chaitanya Ryali, Kalyan Vasudev Alwala, Haitham Khedr, Andrew Huang, et al. Sam 3: Segment anything with concepts. arXiv preprint arXiv:2511.16719, 2025
Pith/arXiv arXiv 2025
-
[12]
Rynnvla-002: A unified vision-language-action and world model
Jun Cen, Siteng Huang, Yuqian Yuan, Kehan Li, Hangjie Yuan, Chaohui Yu, Yuming Jiang, Jiayan Guo, Xin Li, Hao Luo, Fan Wang, Fan Wang, and Deli Zhao. Rynnvla-002: A unified vision-language-action and world model. arXiv preprint arXiv:2511.17502, 2025
Pith/arXiv arXiv 2025
-
[13]
Worldvla: Towards autoregressive action world model.arXiv preprint arXiv:, 2025
Jun Cen, Chaohui Yu, Hangjie Yuan, Yuming Jiang, Siteng Huang, Jiayan Guo, Xin Li, Yibing Song, Hao Luo, Fan Wang, Deli Zhao, and Hao Chen. Worldvla: Towards autoregressive action world model.arXiv preprint arXiv:, 2025
2025
-
[14]
Chi-Lam Cheang, Guangzeng Chen, Ya Jing, Tao Kong, Hang Li, Yifeng Li, Yuxiao Liu, Hongtao Wu, Jiafeng Xu, Yichu Yang, Hanbo Zhang, and Minzhao Zhu. Gr-2: A generative video-language-action model with web-scale knowledge for robot manipulation.arXiv preprint arXiv:2410.06158, 2024
Pith/arXiv arXiv 2024
-
[15]
Diffusion forcing: Next-token prediction meets full-sequence diffusion.Advances in Neural Information Processing Systems, 37:24081–24125, 2025
Boyuan Chen, Diego Martí Monsó, Yilun Du, Max Simchowitz, Russ Tedrake, and Vincent Sitzmann. Diffusion forcing: Next-token prediction meets full-sequence diffusion.Advances in Neural Information Processing Systems, 37:24081–24125, 2025
2025
-
[16]
Large video planner enables generalizable robot control
Boyuan Chen, Tianyuan Zhang, Haoran Geng, Kiwhan Song, Caiyi Zhang, Peihao Li, William T Freeman, Jitendra Malik, Pieter Abbeel, Russ Tedrake, et al. Large video planner enables generalizable robot control. arXiv preprint arXiv:2512.15840, 2025
Pith/arXiv arXiv 2025
-
[17]
4dnex: Feed-forward 4d generative modeling made easy.arXiv preprint arXiv:2508.13154, 2025
Zhaoxi Chen, Tianqi Liu, Long Zhuo, Jiawei Ren, Zeng Tao, He Zhu, Fangzhou Hong, Liang Pan, and Ziwei Liu. 4dnex: Feed-forward 4d generative modeling made easy.arXiv preprint arXiv:2508.13154, 2025
arXiv 2025
-
[18]
Learning universal policies via text-guided video generation.Advances in neural information processing systems, 36:9156–9172, 2023
Yilun Du, Sherry Yang, Bo Dai, Hanjun Dai, Ofir Nachum, Josh Tenenbaum, Dale Schuurmans, and Pieter Abbeel. Learning universal policies via text-guided video generation.Advances in neural information processing systems, 36:9156–9172, 2023
2023
-
[19]
Shaoheng Fang, Hanwen Jiang, Yunpeng Bai, Niloy J. Mitra, and Qixing Huang. Worldreel: 4d video generation with consistent geometry and motion modeling, 2025. URLhttps://arxiv.org/abs/2512.07821
arXiv 2025
-
[20]
Ken Goldberg. Good old-fashioned engineering can close the 100,000-year “data gap” in robotics.Science Robotics, 10(105):eaea7390, 2025. doi: 10.1126/scirobotics.aea7390. URLhttps://www.science.org/doi/abs/10.1126/ scirobotics.aea7390
-
[21]
Predic- tion with action: Visual policy learning via joint denoising process.Advances in Neural Information Processing Systems, 37:112386–112410, 2024
Yanjiang Guo, Yucheng Hu, Jianke Zhang, Yen-Jen Wang, Xiaoyu Chen, Chaochao Lu, and Jianyu Chen. Predic- tion with action: Visual policy learning via joint denoising process.Advances in Neural Information Processing Systems, 37:112386–112410, 2024
2024
-
[22]
Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen
Edward J. Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. Lora: Low-rank adaptation of large language models. InThe Tenth International Conference on Learning Representations, ICLR 2022, Virtual Event, April 25-29, 2022. OpenReview.net, 2022. URLhttps://openreview.net/ forum?id=nZeVKeeFYf9
2022
-
[23]
Video prediction policy: A generalist robot policy with predictive visual representations
Yucheng Hu, Yanjiang Guo, Pengchao Wang, Xiaoyu Chen, Yen-Jen Wang, Jianke Zhang, Koushil Sreenath, Chaochao Lu, and Jianyu Chen. Video prediction policy: A generalist robot policy with predictive visual representations. In Aarti Singh, Maryam Fazel, Daniel Hsu, Simon Lacoste-Julien, Felix Berkenkamp, Tegan Maharaj, Kiri Wagstaff, and Jerry Zhu, editors,F...
2025
-
[24]
Self forcing: Bridging the train-test gap in autoregressive video diffusion
Xun Huang, Zhengqi Li, Guande He, Mingyuan Zhou, and Eli Shechtman. Self forcing: Bridging the train-test gap in autoregressive video diffusion. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems, 2025. URLhttps://openreview.net/forum?id=mSiN7i0BYH
2025
-
[25]
Joel Jang, Seonghyeon Ye, Zongyu Lin, Jiannan Xiang, Johan Bjorck, Yu Fang, Fengyuan Hu, Spencer Huang, Kaushil Kundalia, Yen-Chen Lin, et al. Dreamgen: Unlocking generalization in robot learning through video world models.arXiv preprint arXiv:2505.12705, 2025. 10
Pith/arXiv arXiv 2025
-
[26]
Geo4d: Leveraging video generators for geometric 4d scene reconstruction, 2025
Zeren Jiang, Chuanxia Zheng, Iro Laina, Diane Larlus, and Andrea Vedaldi. Geo4d: Leveraging video generators for geometric 4d scene reconstruction, 2025. URLhttps://arxiv.org/abs/2504.07961
arXiv 2025
-
[27]
Kento Kawaharazuka, Jihoon Oh, Jun Yamada, Ingmar Posner, and Yuke Zhu. Vision-language-action models for robotics: A review towards real-world applications.IEEE Access, 13:162467–162504, 2025. doi: 10.1109/ ACCESS.2025.3609980
arXiv 2025
-
[28]
Droid: A large-scale in-the-wild robot manipulation dataset.arXiv preprint arXiv:2403.12945, 2024
Alexander Khazatsky, Karl Pertsch, Suraj Nair, Ashwin Balakrishna, Sudeep Dasari, Siddharth Karamcheti, Soroush Nasiriany, Mohan Kumar Srirama, Lawrence Yunliang Chen, Kirsty Ellis, et al. Droid: A large-scale in-the-wild robot manipulation dataset.arXiv preprint arXiv:2403.12945, 2024
Pith/arXiv arXiv 2024
-
[29]
Openvla: An open-source vision-language-action model.arXiv preprint arXiv:2406.09246, 2024
Moo Jin Kim, Karl Pertsch, Siddharth Karamcheti, Ted Xiao, Ashwin Balakrishna, Suraj Nair, Rafael Rafailov, Ethan Foster, Grace Lam, Pannag Sanketi, Quan Vuong, Thomas Kollar, Benjamin Burchfiel, Russ Tedrake, Dorsa Sadigh, Sergey Levine, Percy Liang, and Chelsea Finn. Openvla: An open-source vision-language-action model.arXiv preprint arXiv:2406.09246, 2024
Pith/arXiv arXiv 2024
-
[30]
Cosmos policy: Fine-tuning video models for visuomotor control and planning
Moo Jin Kim, Yihuai Gao, Tsung-Yi Lin, Yen-Chen Lin, Yunhao Ge, Grace Lam, Percy Liang, Shuran Song, Ming-Yu Liu, Chelsea Finn, et al. Cosmos policy: Fine-tuning video models for visuomotor control and planning. arXiv preprint arXiv:2601.16163, 2026
Pith/arXiv arXiv 2026
-
[31]
Causal world modeling for robot control, 2026
Lin Li, Qihang Zhang, Yiming Luo, Shuai Yang, Ruilin Wang, Fei Han, Mingrui Yu, Zelin Gao, Nan Xue, Xing Zhu, Yujun Shen, and Yinghao Xu. Causal world modeling for robot control, 2026. URLhttps://arxiv.org/abs/2601. 21998
2026
-
[32]
Unified video action model.arXiv preprint arXiv:2503.00200, 2025
Shuang Li, Yihuai Gao, Dorsa Sadigh, and Shuran Song. Unified video action model.arXiv preprint arXiv:2503.00200, 2025
Pith/arXiv arXiv 2025
-
[33]
Video generators are robot policies.arXiv preprint arXiv:2508.00795, 2025
Junbang Liang, Pavel Tokmakov, Ruoshi Liu, Sruthi Sudhakar, Paarth Shah, Rares Ambrus, and Carl Vondrick. Video generators are robot policies.arXiv preprint arXiv:2508.00795, 2025
Pith/arXiv arXiv 2025
-
[34]
Depth anything 3: Recovering the visual space from any views.arXiv preprint arXiv:2511.10647, 2025
Haotong Lin, Sili Chen, Junhao Liew, Donny Y Chen, Zhenyu Li, Guang Shi, Jiashi Feng, and Bingyi Kang. Depth anything 3: Recovering the visual space from any views.arXiv preprint arXiv:2511.10647, 2025
Pith/arXiv arXiv 2025
-
[35]
Yaron Lipman, Ricky T. Q. Chen, Heli Ben-Hamu, Maximilian Nickel, and Matthew Le. Flow matching for generative modeling. InThe Eleventh International Conference on Learning Representations, 2023. URLhttps: //openreview.net/forum?id=PqvMRDCJT9t
2023
-
[36]
Free4d: Tuning-free 4d scene generation with spatial-temporal consistency
Tianqi Liu, Zihao Huang, Zhaoxi Chen, Guangcong Wang, Shoukang Hu, Liao Shen, Huiqiang Sun, Zhiguo Cao, Wei Li, and Ziwei Liu. Free4d: Tuning-free 4d scene generation with spatial-temporal consistency. InProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 25571–25582, October 2025
2025
-
[37]
Geometry-aware 4d video generation for robot manipulation
Zeyi Liu, Shuang Li, Eric Cousineau, Siyuan Feng, Benjamin Burchfiel, and Shuran Song. Geometry-aware 4d video generation for robot manipulation. InThe Fourteenth International Conference on Learning Representa- tions, 2026. URLhttps://openreview.net/forum?id=18gC6pZVVc
2026
-
[38]
Robocasa365: A large-scale simulation framework for training and benchmarking generalist robots
Soroush Nasiriany, Sepehr Nasiriany, Abhiram Maddukuri, and Yuke Zhu. Robocasa365: A large-scale simulation framework for training and benchmarking generalist robots. InThe Fourteenth International Conference on Learning Representations, 2026. URLhttps://openreview.net/forum?id=tQJYKwc3n4
2026
-
[39]
Octo: An open-source generalist robot policy
Octo Model Team, Dibya Ghosh, Homer Walke, Karl Pertsch, Kevin Black, Oier Mees, Sudeep Dasari, Joey Hejna, Charles Xu, Jianlan Luo, Tobias Kreiman, You Liang Tan, Pannag Sanketi, Quan Vuong, Ted Xiao, Dorsa Sadigh, Chelsea Finn, and Sergey Levine. Octo: An open-source generalist robot policy. InProceedings of Robotics: Science and Systems, Delft, Netherl...
2024
-
[40]
Open x-embodiment: Robotic learning datasets and rt-x models: Open x-embodiment collaboration 0
Abby O’Neill, Abdul Rehman, Abhiram Maddukuri, Abhishek Gupta, Abhishek Padalkar, Abraham Lee, Acorn Pooley, Agrim Gupta, Ajay Mandlekar, Ajinkya Jain, et al. Open x-embodiment: Robotic learning datasets and rt-x models: Open x-embodiment collaboration 0. In2024 IEEE International Conference on Robotics and Automation (ICRA), pages 6892–6903. IEEE, 2024
2024
-
[41]
mimic-video: Video-action models for generalizable robot control beyond vlas, 2025
Jonas Pai, Liam Achenbach, Victoriano Montesinos, Benedek Forrai, Oier Mees, and Elvis Nava. mimic-video: Video-action models for generalizable robot control beyond vlas, 2025. URLhttps://arxiv.org/abs/2512.15692
Pith/arXiv arXiv 2025
-
[42]
Efficient4d: Fast dynamic 3d object generation from a single- view video.Int
Zijie Pan, Zeyu Yang, Xiatian Zhu, and Li Zhang. Efficient4d: Fast dynamic 3d object generation from a single- view video.Int. J. Comput. Vis., 134(1):14, 2026. doi: 10.1007/S11263-025-02615-Z. URLhttps://doi.org/10.1007/ s11263-025-02615-z. 11
-
[43]
Scalable diffusion models with transformers
William Peebles and Saining Xie. Scalable diffusion models with transformers. InProceedings of the IEEE/CVF international conference on computer vision, pages 4195–4205, 2023
2023
-
[44]
Peebles and Saining Xie
William S. Peebles and Saining Xie. Scalable diffusion models with transformers.2023 IEEE/CVF Interna- tional Conference on Computer Vision (ICCV), pages 4172–4182, 2022. URLhttps://api.semanticscholar.org/CorpusID: 254854389
2023
-
[45]
Barron, and Ben Mildenhall
Ben Poole, Ajay Jain, Jonathan T. Barron, and Ben Mildenhall. Dreamfusion: Text-to-3d using 2d diffusion. arXiv, 2022
2022
-
[46]
D-NeRF: Neural Radiance Fields for Dynamic Scenes
Albert Pumarola, Enric Corona, Gerard Pons-Moll, and Francesc Moreno-Noguer. D-NeRF: Neural Radiance Fields for Dynamic Scenes. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020
2020
-
[47]
Qi, Hao Su, Kaichun Mo, and Leonidas J
C. Qi, Hao Su, Kaichun Mo, and Leonidas J. Guibas. Pointnet: Deep learning on point sets for 3d classification and segmentation.2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 77–85,
2017
-
[48]
URLhttps://api.semanticscholar.org/CorpusID:5115938
-
[49]
Dreamgaussian4d: Generative 4d gaussian splatting.arXiv preprint arXiv:2312.17142, 2023
Jiawei Ren, Liang Pan, Jiaxiang Tang, Chi Zhang, Ang Cao, Gang Zeng, and Ziwei Liu. Dreamgaussian4d: Generative 4d gaussian splatting.arXiv preprint arXiv:2312.17142, 2023
arXiv 2023
-
[50]
Videovla: Video generators can be generalizable robot manipulators
Yichao Shen, Fangyun Wei, Zhiying Du, Yaobo Liang, Yan Lu, Jiaolong Yang, Nanning Zheng, and Baining Guo. Videovla: Video generators can be generalizable robot manipulators. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems (NeurIPS), 2025. URLhttps://openreview.net/forum?id=UPHlqbZFZB
2025
-
[51]
Denoising diffusion implicit models
Jiaming Song, Chenlin Meng, and Stefano Ermon. Denoising diffusion implicit models. InInternational Confer- ence on Learning Representations, 2021. URLhttps://openreview.net/forum?id=St1giarCHLP
2021
-
[52]
Gemma 2: Improving open language models at a practical size.arXiv preprint arXiv:2408.00118, 2024
Gemma Team, Morgane Riviere, Shreya Pathak, Pier Giuseppe Sessa, Cassidy Hardin, Surya Bhupatiraju, Léonard Hussenot, Thomas Mesnard, Bobak Shahriari, Alexandre Ramé, et al. Gemma 2: Improving open language models at a practical size.arXiv preprint arXiv:2408.00118, 2024
Pith/arXiv arXiv 2024
-
[53]
Towards accurate generative models of video: A new metric & challenges.ArXiv, abs/1812.01717, 2018
Thomas Unterthiner, Sjoerd van Steenkiste, Karol Kurach, Raphaël Marinier, Marcin Michalski, and Sylvain Gelly. Towards accurate generative models of video: A new metric & challenges.ArXiv, abs/1812.01717, 2018. URLhttps://api.semanticscholar.org/CorpusID:54458806
Pith/arXiv arXiv 2018
-
[54]
Bridgedata v2: A dataset for robot learning at scale
Homer Rich Walke, Kevin Black, Tony Z Zhao, Quan Vuong, Chongyi Zheng, Philippe Hansen-Estruch, An- dre Wang He, Vivek Myers, Moo Jin Kim, Max Du, et al. Bridgedata v2: A dataset for robot learning at scale. InConference on Robot Learning, pages 1723–1736. PMLR, 2023
2023
-
[55]
Wan: Open and advanced large-scale video generative models.arXiv preprint arXiv:2503.20314, 2025
TeamWan, AngWang, BaoleAi, BinWen, ChaojieMao, Chen-WeiXie, DiChen, FeiwuYu, HaimingZhao, Jianx- iao Yang, et al. Wan: Open and advanced large-scale video generative models.arXiv preprint arXiv:2503.20314, 2025
Pith/arXiv arXiv 2025
-
[56]
Unified vision-language-action model.arXiv preprint arXiv:2506.19850, 2025
Yuqi Wang, Xinghang Li, Wenxuan Wang, Junbo Zhang, Yingyan Li, Yuntao Chen, Xinlong Wang, and Zhaoxi- ang Zhang. Unified vision-language-action model.arXiv preprint arXiv:2506.19850, 2025
arXiv 2025
-
[57]
Birchfield
Bowen Wen, Wei Yang, Jan Kautz, and Stanley T. Birchfield. Foundationpose: Unified 6d pose estimation and tracking of novel objects.2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 17868–17879, 2023. URLhttps://api.semanticscholar.org/CorpusID:266191252
2024
-
[58]
Foundationstereo: Zero-shot stereo matching.CVPR, 2025
Bowen Wen, Matthew Trepte, Joseph Aribido, Jan Kautz, Orazio Gallo, and Stan Birchfield. Foundationstereo: Zero-shot stereo matching.CVPR, 2025
2025
-
[59]
Video models are zero-shot learners and reasoners.arXiv preprint arXiv:2509.20328, 2025
Thaddäus Wiedemer, Yuxuan Li, Paul Vicol, Shixiang Shane Gu, Nick Matarese, Kevin Swersky, Been Kim, Priyank Jaini, and Robert Geirhos. Video models are zero-shot learners and reasoners.arXiv preprint arXiv:2509.20328, 2025
Pith/arXiv arXiv 2025
-
[60]
4d gaussian splatting for real-time dynamic scene rendering
Guanjun Wu, Taoran Yi, Jiemin Fang, Lingxi Xie, Xiaopeng Zhang, Wei Wei, Wenyu Liu, Qi Tian, and Xinggang Wang. 4d gaussian splatting for real-time dynamic scene rendering. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 20310–20320, June 2024
2024
-
[61]
Unleashing large-scale video generative pre-training for visual robot manipulation
Hongtao Wu, Ya Jing, Chilam Cheang, Guangzeng Chen, Jiafeng Xu, Xinghang Li, Minghuan Liu, Hang Li, and Tao Kong. Unleashing large-scale video generative pre-training for visual robot manipulation. InInternational Conference on Learning Representations, 2024. 12
2024
-
[62]
Cogvideox: Text-to-video diffusion models with an expert transformer
Zhuoyi Yang, Jiayan Teng, Wendi Zheng, Ming Ding, Shiyu Huang, Jiazheng Xu, Yuanming Yang, Wenyi Hong, Xiaohan Zhang, Guanyu Feng, et al. Cogvideox: Text-to-video diffusion models with an expert transformer. arXiv preprint arXiv:2408.06072, 2024
Pith/arXiv arXiv 2024
-
[63]
World action models are zero-shot policies, 2026
Seonghyeon Ye, Yunhao Ge, Kaiyuan Zheng, Shenyuan Gao, Sihyun Yu, George Kurian, Suneel Indupuru, You Liang Tan, Chuning Zhu, Jiannan Xiang, Ayaan Malik, Kyungmin Lee, William Liang, Nadun Ranawaka, Jiasheng Gu, Yinzhen Xu, Guanzhi Wang, Fengyuan Hu, Avnish Narayan, Johan Bjorck, Jing Wang, Gwanghyun Kim, Dantong Niu, Ruijie Zheng, Yuqi Xie, Jimmy Wu, Qi ...
Pith/arXiv arXiv 2026
-
[64]
From slow bidirectional to fast autoregressive video diffusion models
Tianwei Yin, Qiang Zhang, Richard Zhang, William T Freeman, Fredo Durand, Eli Shechtman, and Xun Huang. From slow bidirectional to fast autoregressive video diffusion models. InCVPR, 2025
2025
-
[65]
3d diffusion policy: Generalizable visuomotor policy learning via simple 3d representations
Yanjie Ze, Gu Zhang, Kangning Zhang, Chenyuan Hu, Muhan Wang, and Huazhe Xu. 3d diffusion policy: Generalizable visuomotor policy learning via simple 3d representations. InProceedings of Robotics: Science and Systems (RSS), 2024
2024
-
[66]
World-consistent video diffusion with explicit 3d modeling
Qihang Zhang, Shuangfei Zhai, Miguel Angel Bautista Martin, Kevin Miao, Alexander Toshev, Joshua Susskind, and Jiatao Gu. World-consistent video diffusion with explicit 3d modeling. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 21685–21695, 2025
2025
-
[67]
Cot-vla: Visual chain-of-thought reasoning for vision-language-action models
Qingqing Zhao, Yao Lu, Moo Jin Kim, Zipeng Fu, Zhuoyang Zhang, Yecheng Wu, Zhaoshuo Li, Qianli Ma, Song Han, Chelsea Finn, et al. Cot-vla: Visual chain-of-thought reasoning for vision-language-action models. In Proceedings of the Computer Vision and Pattern Recognition Conference, pages 1702–1713, 2025
2025
-
[68]
3d-vla: 3d vision-language-action generative world model.arXiv preprint arXiv:2403.09631, 2024
Haoyu Zhen, Xiaowen Qiu, Peihao Chen, Jincheng Yang, Xin Yan, Yilun Du, Yining Hong, and Chuang Gan. 3d-vla: 3d vision-language-action generative world model.arXiv preprint arXiv:2403.09631, 2024
Pith/arXiv arXiv 2024
-
[69]
TesserAct: Learning 4d embodied world models,
Haoyu Zhen, Qiao Sun, Hongxin Zhang, Junyan Li, Siyuan Zhou, Yilun Du, and Chuang Gan. Tesseract: Learning 4d embodied world models.CoRR, abs/2504.20995, 2025. doi: 10.48550/ARXIV.2504.20995. URL https://doi.org/10.48550/arXiv.2504.20995
-
[70]
Yifan Zhong, Fengshuo Bai, Shaofei Cai, Xuchuan Huang, Zhang Chen, Xiaowei Zhang, Yuanfei Wang, Shaoyang Guo, Tianrui Guan, Ka Nam Lui, et al. A survey on vision-language-action models: An action tokenization perspective.arXiv preprint arXiv:2507.01925, 2025
Pith/arXiv arXiv 2025
-
[71]
Unified world models: Coupling video and action diffusion for pretraining on large robotic datasets
Chuning Zhu, Raymond Yu, Siyuan Feng, Benjamin Burchfiel, Paarth Shah, and Abhishek Gupta. Unified world models: Coupling video and action diffusion for pretraining on large robotic datasets. InProceedings of Robotics: Science and Systems (RSS), 2025
2025
-
[72]
Hongzhou Zhu, Min Zhao, Guande He, Hang Su, Chongxuan Li, and Jun Zhu. Causal forcing: Autoregres- sive diffusion distillation done right for high-quality real-time interactive video generation.arXiv preprint arXiv:2602.02214, 2026
Pith/arXiv arXiv 2026
-
[73]
Streaming 4d visual geometry transformer.arXiv preprint arXiv:2507.11539, 2025
Dong Zhuo, Wenzhao Zheng, Jiahe Guo, Yuqi Wu, Jie Zhou, and Jiwen Lu. Streaming 4d visual geometry transformer.arXiv preprint arXiv:2507.11539, 2025
Pith/arXiv arXiv 2025
-
[74]
Rt-2: Vision-language-action models transfer web knowledge to robotic control
Brianna Zitkovich, Tianhe Yu, Sichun Xu, Peng Xu, Ted Xiao, Fei Xia, Jialin Wu, Paul Wohlhart, Stefan Welker, Ayzaan Wahid, et al. Rt-2: Vision-language-action models transfer web knowledge to robotic control. In Conference on Robot Learning, pages 2165–2183. PMLR, 2023. 13 Appendix Overview This appendix complements the main paper as follows. Sec. A prov...
2023
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.