Embodied-R1.5: Evolving Physical Intelligence via Embodied Foundation Models
Pith reviewed 2026-06-27 12:53 UTC · model grok-4.3
The pith
An 8B-parameter unified model internalizes embodied cognition, planning, and self-correction to reach state-of-the-art on most embodied benchmarks and transfer to real robots.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Embodied-R1.5 integrates embodied reasoning capabilities spanning cognition, task planning, correction, and pointing within a single 8B architecture. Three automated data construction pipelines expand coverage to over 15B tokens; a multi-task balanced RL recipe reduces conflicts among heterogeneous skills; and a Planner-Grounder-Corrector closed-loop framework lets the model execute and self-correct on long-horizon tasks. With these elements the model reaches SOTA on 16 of 24 embodied VLM benchmarks, outperforms leading VLAs after minimal fine-tuning on four manipulation suites, and demonstrates generalization in zero-shot real-robot experiments on instruction following, affordance grounding
What carries the argument
The Planner-Grounder-Corrector (PGC) closed-loop framework that lets a single model plan, ground actions in perception, and autonomously correct errors during extended physical tasks.
If this is right
- The model converts to a competitive VLA policy using only a small additional dataset while surpassing current leading VLAs on four manipulation benchmark suites.
- Zero-shot real-robot performance emerges on instruction following, affordance grounding, articulated-object manipulation, and multi-step tasks without task-specific retraining.
- A single set of weights can replace separate perception, planning, and control modules for many embodied problems.
- Balanced multi-task RL training keeps heterogeneous embodied skills from degrading one another during joint optimization.
Where Pith is reading between the lines
- If the data-construction pipelines scale, similar automated collection could quickly enlarge training sets for other model sizes or new robot morphologies.
- Explicit correction loops inside one model may prove more reliable for long physical sequences than pure end-to-end prediction.
- Open release of weights, datasets, and the evaluation kit could let other groups test whether the same recipe works on different robot hardware.
Load-bearing premise
The automated data pipelines and multi-task balanced RL produce real embodied skills that transfer to physical robots rather than benchmark-specific patterns.
What would settle it
Zero-shot real-robot trials on long-horizon tasks where performance collapses to the level of non-PGC baselines, or fine-tuned manipulation results that no longer beat leading VLAs on held-out suites.
read the original abstract
We introduce Embodied-R1.5, a unified Embodied Foundation Model (EFM) that integrates comprehensive embodied reasoning capabilities, spanning embodied cognition, task planning, correction, and pointing, within a single architecture toward general physical intelligence. Leveraging three automated data construction pipelines to significantly expand the data coverage of critical capabilities, we build a large-scale data system of over 15B tokens, and design a multi-task balanced RL recipe to alleviate heterogeneous task conflicts. We further introduce a Planner-Grounder-Corrector (PGC) closed-loop framework that enables a single model to autonomously execute and self-correct over long-horizon tasks. With only 8B parameters, Embodied-R1.5 achieves SOTA on 16 out of 24 embodied VLM benchmarks, surpassing leading models like Gemini-Robotics-ER-1.5 and GPT-5.4. Benefiting from the internalized embodied capabilities, Embodied-R1.5 can be fine-tuned into a VLA with only a small amount of data, outperforming leading VLA models like $\pi_{0.5}$ across 4 popular manipulation benchmark suites. We further conduct extensive zero-shot real-robot experiments, validating performance in instruction following, affordance grounding, articulated object manipulation, and long-horizon complex tasks, demonstrating strong generalization to the physical world. We open-source model weights, datasets, training code, and EmbodiedEvalKit, an evaluation framework tailored for embodied tasks, to facilitate future research in EFMs.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript introduces Embodied-R1.5, an 8B-parameter unified Embodied Foundation Model that integrates embodied cognition, task planning, correction, and pointing. It employs three automated data construction pipelines to generate over 15B tokens, a multi-task balanced RL training recipe, and a Planner-Grounder-Corrector (PGC) closed-loop framework. The model is claimed to achieve SOTA on 16 of 24 embodied VLM benchmarks (surpassing Gemini-Robotics-ER-1.5 and GPT-5.4), fine-tune efficiently into a VLA outperforming π0.5 on 4 manipulation suites, and demonstrate zero-shot real-robot performance on instruction following, affordance grounding, articulated manipulation, and long-horizon tasks. All resources including model weights, datasets, code, and EmbodiedEvalKit are open-sourced.
Significance. If the empirical claims hold after detailed validation, this would constitute a meaningful contribution to embodied AI by showing that compact models can acquire broad physical reasoning via automated data scaling and balanced RL, enabling low-data VLA adaptation and real-world transfer. The explicit open-sourcing of model weights, datasets, training code, and the EmbodiedEvalKit evaluation framework is a clear strength that supports reproducibility and community progress.
major comments (3)
- [Abstract] Abstract: The SOTA claims on 16/24 embodied VLM benchmarks and outperformance of Gemini-Robotics-ER-1.5, GPT-5.4, and π0.5 are presented without any tables, quantitative metrics, ablation studies, error analysis, or baseline comparisons, rendering it impossible to evaluate whether the results reflect genuine embodied capabilities or benchmark-specific effects.
- [Abstract] Abstract: The three automated data construction pipelines (>15B tokens) and multi-task balanced RL recipe are described only at a high level with no details on data sources, statistics, balancing procedure, or safeguards against leakage/task overlap; these elements are load-bearing for the central claim that the model internalizes transferable embodied cognition rather than distribution-matched artifacts.
- [Abstract] Abstract: The PGC closed-loop framework and zero-shot real-robot experiments (instruction following, affordance grounding, articulated object manipulation, long-horizon tasks) are asserted to validate generalization and self-correction, but no quantitative results, task definitions, success rates, or controls for distribution shift are supplied, leaving the physical-world transfer claims unsupported.
minor comments (2)
- [Abstract] Abstract: The model size is stated as 'only 8B parameters' without comparison to the parameter counts of the cited baselines (Gemini-Robotics-ER-1.5, GPT-5.4), which would aid interpretation of the efficiency claim.
- [Abstract] Abstract: Notation for the VLA baseline (π_{0.5}) should include a reference or brief description to avoid ambiguity.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback. The comments focus on the level of detail in the abstract. We agree that the abstract can be strengthened by adding brief quantitative anchors and explicit section references while preserving its concise nature. We will revise the abstract accordingly in the next version. Below we respond point by point.
read point-by-point responses
-
Referee: [Abstract] Abstract: The SOTA claims on 16/24 embodied VLM benchmarks and outperformance of Gemini-Robotics-ER-1.5, GPT-5.4, and π0.5 are presented without any tables, quantitative metrics, ablation studies, error analysis, or baseline comparisons, rendering it impossible to evaluate whether the results reflect genuine embodied capabilities or benchmark-specific effects.
Authors: The abstract is intentionally concise. The full manuscript provides the requested evidence in Table 1 (benchmark scores and comparisons), Section 4.1 (ablations and error analysis), and Section 4.2 (baseline details). To improve readability, we will revise the abstract to include one or two key quantitative deltas (e.g., average improvement over Gemini-Robotics-ER-1.5) and add parenthetical references to Table 1 and Section 4. revision: yes
-
Referee: [Abstract] Abstract: The three automated data construction pipelines (>15B tokens) and multi-task balanced RL recipe are described only at a high level with no details on data sources, statistics, balancing procedure, or safeguards against leakage/task overlap; these elements are load-bearing for the central claim that the model internalizes transferable embodied cognition rather than distribution-matched artifacts.
Authors: Section 3.2 and Appendix A contain the full specifications: data sources, per-pipeline token statistics, the multi-task balancing algorithm, and explicit leakage-prevention steps (e.g., temporal and semantic deduplication). We will add a short clause to the abstract noting “with leakage safeguards detailed in Section 3.2” and reference the appendix for statistics. revision: yes
-
Referee: [Abstract] Abstract: The PGC closed-loop framework and zero-shot real-robot experiments (instruction following, affordance grounding, articulated object manipulation, long-horizon tasks) are asserted to validate generalization and self-correction, but no quantitative results, task definitions, success rates, or controls for distribution shift are supplied, leaving the physical-world transfer claims unsupported.
Authors: Quantitative results appear in Section 5.3 (real-robot success rates, task definitions, and distribution-shift controls) and Figure 7. We will revise the abstract to state the key real-world success rates (e.g., “achieving X% success on long-horizon tasks”) and add a reference to Section 5.3. revision: yes
Circularity Check
No circularity detected in empirical training and evaluation
full rationale
The paper describes an empirical pipeline of automated data construction (>15B tokens), multi-task balanced RL training, PGC closed-loop execution, benchmark evaluation on 24 embodied VLM tasks, and zero-shot real-robot validation. No equations, derivations, or self-referential definitions are present that would reduce any claimed result (SOTA performance or VLA fine-tuning) to its own inputs by construction. All load-bearing claims rest on reported external metrics and experiments rather than fitted parameters renamed as predictions or self-citation chains.
Axiom & Free-Parameter Ledger
invented entities (1)
-
Planner-Grounder-Corrector (PGC) closed-loop framework
no independent evidence
Reference graph
Works this paper leans on
-
[1]
Cosmos-reason1: From physical common sense to embodied reasoning
Alisson Azzolini, Junjie Bai, Hannah Brandon, Jiaxin Cao, Prithvijit Chattopadhyay, Huayu Chen, Jinju Chu, Yin Cui, Jenna Diamond, Yifan Ding, et al. Cosmos-reason1: From physical common sense to embodied reasoning. arXiv preprint arXiv:2503.15558,
-
[2]
Qwen3-vl technical report.CoRR, abs/2511.21631,
Shuai Bai, Yuxuan Cai, Ruizhe Chen, Keqin Chen, Xionghui Chen, Zesen Cheng, Lianghao Deng, Wei Ding, Chang Gao, Chunjiang Ge, Wenbin Ge, Zhifang Guo, Qidong Huang, Jie Huang, Fei Huang, Binyuan Hui, Shutong Jiang, Zhaohai Li, Mingsheng Li, Mei Li, Kaixin Li, Zicheng Lin, Junyang Lin, Xuejing Liu, Jiawei Liu, Chenglong Liu, Yang Liu, Dayiheng Liu, Shixuan ...
-
[3]
doi: 10.48550/ARXIV.2511.21631. URL https://doi.org/10.48550/arXiv.2511.21631. Johan Bjorck, Fernando Castañeda, Nikita Cherniadev, Xingye Da, Runyu Ding, Linxi Fan, Yu Fang, Dieter Fox, 24 Fengyuan Hu, Spencer Huang, et al. Gr00t n1: An open foundation model for generalist humanoid robots.arXiv preprint arXiv:2503.14734,
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2511.21631
-
[4]
Kevin Black, Noah Brown, Danny Driess, Adnan Esmail, Michael Equi, Chelsea Finn, Niccolo Fusai, Lachy Groom, Karol Hausman, Brian Ichter, Szymon Jakubczak, Tim Jones, Liyiming Ke, Sergey Levine, Adrian Li-Bell, Mohith Mothukuri, Suraj Nair, Karl Pertsch, Lucy Xiaoyang Shi, James Tanner, Quan Vuong, Anna Walling, Haohuan Wang, and Ury Zhilinsky.𝜋0: A visio...
-
[5]
Gopalakrishnan, Karol Hausman, Alexander Herzog, Jasmine Hsu, Julian Ibarz, Brian Ichter, A
Anthony Brohan, Noah Brown, Justice Carbajal, Yevgen Chebotar, Joseph Dabis, Chelsea Finn, K. Gopalakrishnan, Karol Hausman, Alexander Herzog, Jasmine Hsu, Julian Ibarz, Brian Ichter, A. Irpan, Tomas Jackson, Sally Jesmonth, Nikhil J. Joshi, Ryan C. Julian, Dmitry Kalashnikov, Yuheng Kuang, Isabel Leal, Kuang-Huei Lee, S. Levine, Yao Lu, U. Malla, D. Manj...
-
[6]
Anthony Brohan, Noah Brown, Justice Carbajal, Yevgen Chebotar, Xi Chen, Krzysztof Choromanski, Tianli Ding, Danny Driess, Avinava Dubey, Chelsea Finn, et al. Rt-2: Vision-language-action models transfer web knowledge to robotic control.arXiv preprint arXiv:2307.15818,
-
[7]
Qingwen Bu, Jisong Cai, Li Chen, Xiuqi Cui, Yan Ding, Siyuan Feng, Shenyuan Gao, Xindong He, Xuan Hu, Xu Huang, et al. Agibot world colosseo: A large-scale manipulation platform for scalable and intelligent embodied systems.arXiv preprint arXiv:2503.06669,
-
[8]
Depthlm: Metric depth from vision language models.arXiv preprint arXiv:2509.25413,
Zhipeng Cai, Ching-Feng Yeh, Hu Xu, Zhuang Liu, Gregory Meyer, Xinjie Lei, Changsheng Zhao, Shang-Wen Li, Vikas Chandra, and Yangyang Shi. Depthlm: Metric depth from vision language models.arXiv preprint arXiv:2509.25413,
-
[9]
Chang, Angela Dai, T
Angel X. Chang, Angela Dai, T. Funkhouser, Maciej Halber, M. Nießner, M. Savva, Shuran Song, Andy Zeng, and Yinda Zhang. Matterport3d: Learning from rgb-d data in indoor environments.2017 International Conference on 3D Vision (3DV), pages 667–676,
2017
-
[10]
Revisiting referring expression comprehension evaluation in the era of large multimodal models
Jierun Chen, Fangyun Wei, Jinjing Zhao, Sizhe Song, Bohuai Wu, Zhuoxuan Peng, S-H Gary Chan, and Hongyang Zhang. Revisiting referring expression comprehension evaluation in the era of large multimodal models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 513–524, 2025a. Tianxing Chen, Zanxin Chen, Baijun Chen,...
-
[11]
Smith, Fei Xia, Dieter Fox, and Ranjay Krishna
Long Cheng, Jiafei Duan, Yi Ru Wang, Haoquan Fang, Boyang Li, Yushan Huang, Elvis Wang, Ainaz Eftekhar, Jason Lee, Wentao Yuan, Rose Hendrix, Noah A. Smith, Fei Xia, Dieter Fox, and Ranjay Krishna. Pointarena: Probing multimodal grounding through language-guided pointing.arXiv preprint arXiv:2505.09990,
-
[12]
Open x-embodiment: Robotic learning datasets and RT-X models.CoRR, abs/2310.08864,
25 Open X-Embodiment Collaboration. Open x-embodiment: Robotic learning datasets and RT-X models.CoRR, abs/2310.08864,
-
[13]
Open X-Embodiment: Robotic Learning Datasets and RT-X Models
doi: 10.48550/ARXIV.2310.08864. URL https://doi.org/10.48550/arXiv.2310.08864. StarVLA Community. Starvla: A lego-like codebase for vision-language-action model developing.arXiv preprint arXiv:2604.05014,
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2310.08864
-
[14]
Chang, Manolis Savva, Maciej Halber, Thomas A
Angela Dai, Angel X. Chang, Manolis Savva, Maciej Halber, Thomas A. Funkhouser, and Matthias Nießner. Scannet: Richly-annotated 3d reconstructions of indoor scenes. In2017 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2017, Honolulu, HI, USA, July 21-26, 2017, pages 2432–2443. IEEE Computer Society,
2017
-
[15]
URL https://doi.org/10.1109/CVPR.2017.261
doi: 10.1109/CVPR.2017.261. URL https://doi.org/10.1109/CVPR.2017.261. Ronghao Dang, Jiayan Guo, Bohan Hou, Sicong Leng, Kehan Li, Xin Li, Jiangpin Liu, Yunxuan Mao, Zhikai Wang, Yuqian Yuan, Minghao Zhu, Xiao Lin, Yang Bai, Qian Jiang, Yaxi Zhao, Minghua Zeng, Junlong Gao, Yuming Jiang, Jun Cen, Siteng Huang, Liuyi Wang, Wenqiao Zhang, Chengju Liu, Jianf...
-
[16]
Spacetime Autoencoders Using Local Causal States
doi: 10.48550/ARXIV. 2602.14979. URL https://doi.org/10.48550/arXiv.2602.14979. Afshin Dehghan, Gilad Baruch, Zhuoyuan Chen, Yuri Feigin, Peter Fu, Thomas Gebauer, Daniel Kurz, Tal Dimry, Brandon Joffe, Arik Schwartz, and Elad Shulman. Arkitscenes: A diverse real-world dataset for 3d indoor scene understanding using mobile rgb-d data
work page internal anchor Pith review doi:10.48550/arxiv
-
[17]
Abhay Deshpande, Yuquan Deng, Arijit Ray, Jordi Salvador, Winson Han, Jiafei Duan, Kuo-Hao Zeng, Yuke Zhu, Ranjay Krishna, and Rose Hendrix. Graspmolmo: Generalizable task-oriented grasping via large-scale synthetic data generation.arXiv preprint arXiv:2505.13441,
-
[18]
Mm-ifengine: Towards multimodal instruction following.arXiv preprint arXiv:2504.07957,
Shengyuan Ding, Shenxi Wu, Xiangyu Zhao, Yuhang Zang, Haodong Duan, Xiaoyi Dong, Pan Zhang, Yuhang Cao, Dahua Lin, and Jiaqi Wang. Mm-ifengine: Towards multimodal instruction following.arXiv preprint arXiv:2504.07957,
-
[19]
Mengfei Du, Binhao Wu, Zejun Li, Xuanjing Huang, and Zhongyu Wei. Embspatial-bench: Benchmarking spatial understanding for embodied tasks with large vision-language models.arXiv preprint arXiv:2410.16147,
-
[20]
Zhiwen Fan, Jian Zhang, Renjie Li, Junge Zhang, Runjin Chen, Hezhen Hu, Kevin Wang, Huaizhi Qu, Dilin Wang, Zhicheng Yan, Hongyu Xu, Justin Theiss, Tianlong Chen, Jiachen Li, Zhengzhong Tu, Zhangyang Wang, and Rakesh Ranjan. VLM-3R: vision-language models augmented with instruction-aligned 3d reconstruction.CoRR, abs/2505.20279,
-
[21]
VLM-3R: Vision-Language Models Augmented with Instruction-Aligned 3D Reconstruction
doi: 10.48550/ARXIV.2505.20279. URL https://doi.org/10.48550/arXiv.2505.20279. Haoquan Fang, Jiafei Duan, Donovan Clay, Sam Wang, Shuo Liu, Weikai Huang, Xiang Fan, Wei-Chuan Tsai, Shirui Chen, Yi Ru Wang, Shanli Xing, Jaemin Cho, Jae Sung Park, Ainaz Eftekhar, Peter Sushko, Karen Farley, Angad Wadhwa, Cole Harrison, Winson Han, Ying-Chun Lee, Eli VanderB...
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2505.20279
-
[22]
Libero-plus: In-depthrobustnessanalysisofvision-language-actionmodels.arXivpreprintarXiv:2510.13626,
Senyu Fei, SiyinWang, Junhao Shi, Zihao Dai, Jikun Cai, Pengfang Qian, LiJi, Xinzhe He, ShiduoZhang, ZhaoyeFei, etal. Libero-plus: In-depthrobustnessanalysisofvision-language-actionmodels.arXivpreprintarXiv:2510.13626,
-
[23]
Onethinker: All-in-one reasoning model for image and video.CoRR, abs/2512.03043,
Kaituo Feng, Manyuan Zhang, Hongyu Li, Kaixuan Fan, Shuang Chen, Yilei Jiang, Dian Zheng, Peiwen Sun, Yiyuan Zhang, Haoze Sun, Yan Feng, Peng Pei, Xunliang Cai, and Xiangyu Yue. Onethinker: All-in-one reasoning model for image and video.CoRR, abs/2512.03043,
-
[24]
OneThinker: All-in-one Reasoning Model for Image and Video
doi: 10.48550/ARXIV.2512.03043. URL https://doi.org/10.48550/arXiv.2512.03043. 26 Xingyu Fu, Yushi Hu, Bangzheng Li, Yu Feng, Haoyu Wang, Xudong Lin, Dan Roth, Noah A. Smith, Wei-Chiu Ma, and Ranjay Krishna. Blink: Multimodal large language models can see but not perceive.arXiv preprint arXiv:2404.12390,
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2512.03043
-
[25]
Ziyang Gong, Zehang Luo, Anke Tang, Zhe Liu, Shi Fu, Zhi Hou, Ganlin Yang, Weiyun Wang, Xiaofeng Wang, Jianbo Liu, et al. Ace-brain-0: Spatial intelligence as a shared scaffold for universal embodiments.arXiv preprint arXiv:2603.03198,
-
[26]
Agrim Gupta, Piotr Dollar, and Ross Girshick
URL https://arxiv.org/abs/2308.01477. Agrim Gupta, Piotr Dollar, and Ross Girshick. Lvis: A dataset for large vocabulary instance segmentation. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 5356–5364,
-
[27]
Xiaoshuai Hao, Yingbo Tang, Lingfeng Zhang, Yanbiao Ma, Yunfeng Diao, Ziyu Jia, Wenbo Ding, Hangjun Ye, and Long Chen. Roboafford++: A generative ai-enhanced dataset for multimodal affordance learning in robotic manipulation and navigation.arXiv preprint arXiv:2511.12436, 2025a. Xiaoshuai Hao, Lei Zhou, Zhijian Huang, Zhiwen Hou, Yingbo Tang, Lingfeng Zha...
-
[28]
Physical Intelligence, Kevin Black, Noah Brown, James Darpinian, Karan Dhabalia, Danny Driess, Adnan Esmail, Michael Equi, Chelsea Finn, Niccolo Fusai, Manuel Y. Galliker, Dibya Ghosh, Lachy Groom, Karol Hausman, Brian Ichter, Szymon Jakubczak, Tim Jones, Liyiming Ke, Devin LeBlanc, Sergey Levine, Adrian Li-Bell, Mohith Mothukuri, Suraj Nair, Karl Pertsch...
-
[29]
$\pi_{0.5}$: a Vision-Language-Action Model with Open-World Generalization
doi: 10.48550/ARXIV.2504.16054. URL https://doi.org/10.48550/arXiv.2504.16054. Yuheng Ji, Huajie Tan, Jiayu Shi, Xiaoshuai Hao, Yuan Zhang, Hengyuan Zhang, Pengwei Wang, Mengdi Zhao, Yao Mu, Pengju An, Xinda Xue, Qinghang Su, Huaihai Lyu, Xiaolong Zheng, Jiaming Liu, Zhongyuan Wang, and Shanghang Zhang. Robobrain: A unified brain model for robotic manipul...
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2504.16054 2025
-
[30]
doi: 10.1109/CVPR52734.2025. 00168. URL https://openaccess.thecvf.com/content/CVPR2025/html/Ji_RoboBrain_A_Unified_Brain_Model_ for_Robotic_Manipulation_from_Abstract_CVPR_2025_paper.html. Nikita Karaev, Iurii Makarov, Jianyuan Wang, Natalia Neverova, Andrea Vedaldi, and Christian Rupprecht. Co- tracker3: Simpler and better point tracking by pseudo-labell...
-
[31]
Referitgame: Referring to objects in photographs of natural scenes
Sahar Kazemzadeh, Vicente Ordonez, Mark Matten, and Tamara Berg. Referitgame: Referring to objects in photographs of natural scenes. InProceedings of the 2014 conference on empirical methods in natural language processing (EMNLP), pages 787–798,
2014
-
[32]
Alexander Khazatsky, Karl Pertsch, S. Nair, Ashwin Balakrishna, S. Dasari, Siddharth Karamcheti, Soroush Nasiriany, M. K. Srirama, L. Chen, Kirsty Ellis, P. Fagan, Joey Hejna, Masha Itkina, Marion Lepert, Ye Ma, Patrick Miller, Jimmy Wu, Suneel Belkhale, S. Dass, Huy Ha, Arhan Jain, Abraham Lee, Youngwoon Lee, Marius Memmel, Sungjae Park, Ilija Radosavovi...
-
[33]
Openvla: An open-source vision-language-action model
Moo Jin Kim, Karl Pertsch, Siddharth Karamcheti, Ted Xiao, Ashwin Balakrishna, Suraj Nair, Rafael Rafailov, Ethan Foster, Grace Lam, Pannag Sanketi, Quan Vuong, Thomas Kollar, Benjamin Burchfiel, Russ Tedrake, Dorsa Sadigh, Sergey Levine, Percy Liang, and Chelsea Finn. Openvla: An open-source vision-language-action model. arXiv preprint arXiv:2406.09246,
-
[34]
Fine-tuning vision-language-action models: Optimizing speed and success
Moo Jin Kim, Karl Pertsch, Siddharth Karamcheti, Ted Xiao, Ashwin Balakrishna, Suraj Nair, Rafael Rafailov, Ethan Foster, Grace Lam, Quan Vuong, et al. Fine-tuning vision-language-action models: Optimizing speed and success. arXiv preprint arXiv:2502.19645,
-
[35]
Llava-onevision: Easy visual task transfer.arXiv preprint arXiv:2408.03326, 2024a
Bo Li, Yuanhan Zhang, Dong Guo, Renrui Zhang, Feng Li, Hao Zhang, Kaichen Zhang, Peiyuan Zhang, Yanwei Li, Ziwei Liu, et al. Llava-onevision: Easy visual task transfer.arXiv preprint arXiv:2408.03326, 2024a. Hao Li, Ziqin Wang, Zi han Ding, Shuai Yang, Yilun Chen, Yang Tian, Xiaolin Hu, Tai Wang, Dahua Lin, Feng Zhao, Si Liu, and Jiangmiao Pang. Robointer...
-
[36]
Manipllm: Embodied multimodal large language model for object-centric robotic manipulation
Xiaoqi Li, Mingxu Zhang, Yiran Geng, Haoran Geng, Yuxing Long, Yan Shen, Renrui Zhang, Jiaming Liu, and Hao Dong. Manipllm: Embodied multimodal large language model for object-centric robotic manipulation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 18061–18070, 2024b. Xuanlin Li, Kyle Hsu, Jiayuan Gu, Karl ...
-
[37]
Yaron Lipman, Ricky T. Q. Chen, Heli Ben-Hamu, Maximilian Nickel, and Matthew Le. Flow matching for generative modeling. InInternational Conference on Learning Representations, ICLR 2023,
2023
-
[38]
Bo Liu, Yifeng Zhu, Chongkai Gao, Yihao Feng, Qiang Liu, Yuke Zhu, and Peter Stone
arXiv:2210.02747. Bo Liu, Yifeng Zhu, Chongkai Gao, Yihao Feng, Qiang Liu, Yuke Zhu, and Peter Stone. Libero: Benchmarking knowledge transfer for lifelong robot learning.Advances in Neural Information Processing Systems, 36:44776– 44791, 2023a. Chris Yuhao Liu, Liang Zeng, Yuzhen Xiao, Jujie He, Jiacai Liu, Chaojie Wang, Rui Yan, Wei Shen, Fuxiang Zhang, ...
-
[39]
Zeyi Liu, Arpit Bahety, and Shuran Song. Reflect: Summarizing robot experiences for failure explanation and correction.arXiv preprint arXiv:2306.15724, 2023b. Yuhao Lu, Yixuan Fan, Beixing Deng, Fangfu Liu, Yali Li, and Shengjin Wang. Vl-grasp: a 6-dof interactive grasp policy for language-oriented objects in cluttered indoor scenes.arXiv preprint arXiv:2...
-
[40]
Gen Luo, Ganlin Yang, Ziyang Gong, Guanzhou Chen, Haonan Duan, Erfei Cui, Ronglei Tong, Zhi Hou, Tianyi Zhang, Zhe Chen, et al. Visual embodied brain: Let multimodal large language models see, think, and control in spaces.arXiv preprint arXiv:2506.00123,
-
[41]
A survey on vision–language–action models for embodied ai.arXiv preprint arXiv:2505.01244,
28 Yueen Ma, Zixing Song, Yuzheng Zhuang, Jianye Hao, and Irwin King. A survey on vision–language–action models for embodied ai.arXiv preprint arXiv:2505.01244,
-
[42]
Soroush Nasiriany, Abhiram Maddukuri, Lance Zhang, Adeet Parikh, Aaron Lo, Abhishek Joshi, Ajay Mandlekar, and Yuke Zhu. Robocasa: Large-scale simulation of everyday tasks for generalist robots.arXiv preprint arXiv:2406.02523,
-
[43]
Fei Ni, Min Zhang, Pengyi Li, Yifu Yuan, Lingfeng Zhang, Yuecheng Liu, Peilong Han, Longxin Kou, Shaojin Ma, Jinbin Qiao, David Gamaliel Arcos Bravo, Yuening Wang, Xiao Hu, Zhanguang Zhang, X. Yao, Yutong Li, Zhao Zhang, Ying Wen, Ying-Cong Chen, Xiaodan Liang, Liang Lin, Bin He, Haitham Bou-Ammar, He Wang, Huazhe Xu, Jiankang Deng, Shan Luo, Shu Jiang, W...
-
[44]
Spacer: Reinforcing mllms in video spatial reasoning.arXiv preprint arXiv:2504.01805,
Kun Ouyang, Yuanxin Liu, Haoning Wu, Yi Liu, Hao Zhou, Jie Zhou, Fandong Meng, and Xu Sun. Spacer: Reinforcing mllms in video spatial reasoning.arXiv preprint arXiv:2504.01805,
-
[45]
Paul Pacaud, Ricardo Garcia, Shizhe Chen, and Cordelia Schmid. Guardian: Detecting robotic planning and execution errors with vision-language models.CoRR, abs/2512.01946,
-
[46]
URL https://doi.org/10.48550/arXiv.2512.01946
doi: 10.48550/ARXIV.2512.01946. URL https://doi.org/10.48550/arXiv.2512.01946. Baiyu Pan, Daqin Luo, Junpeng Yang, Jiyuan Wang, Yixuan Zhang, Hailin Shi, and Jichao Jiao. Thinker: A vision-language foundation model for embodied intelligence.CoRR, abs/2601.21199,
-
[47]
URL https://doi.org/10.48550/arXiv.2601.21199
doi: 10.48550/ ARXIV.2601.21199. URL https://doi.org/10.48550/arXiv.2601.21199. William Peebles and Saining Xie. Scalable diffusion models with transformers. InProceedings of the IEEE/CVF International Conference on Computer Vision, ICCV 2023, pages 4195–4205,
-
[48]
Baoqi Pei, Yifei Huang, Jilan Xu, Yuping He, Guo Chen, Fei Wu, Jiangmiao Pang, and Yu Qiao
arXiv:2212.09748. Baoqi Pei, Yifei Huang, Jilan Xu, Yuping He, Guo Chen, Fei Wu, Jiangmiao Pang, and Yu Qiao. Egothinker: Unveiling egocentric reasoning with spatio-temporal cot.Advances in Neural Information Processing Systems, 38: 44140–44168,
-
[49]
Karl Pertsch, Kyle Luo, Gaurav Patel, Zhenjia Cui, Robin Strudel, Jie Lim, Brian Ichter, Karol Hausman, Chelsea Finn, Sergey Levine, et al. Fast: Efficient action tokenization for vision-language-action models.arXiv preprint arXiv:2501.09747,
-
[50]
Lu Qiu, Yi Chen, Yuying Ge, Yixiao Ge, Ying Shan, and Xihui Liu. Egoplan-bench2: A benchmark for multimodal large language model planning in real-world scenarios.arXiv preprint arXiv:2412.04447,
-
[51]
Delin Qu, Haoming Song, Qizhi Chen, Zhaoqing Chen, Xianqiang Gao, Xinyi Ye, Qi Lv, Modi Shi, Guanghui Ren, Cheng Ruan, et al. Eo-1: Interleaved vision-text-action pretraining for general robot control.arXiv preprint arXiv:2508.21112, 2025a. 29 Delin Qu, Haoming Song, Qizhi Chen, Yuanqi Yao, Xinyi Ye, Yan Ding, Zhigang Wang, JiaYuan Gu, Bin Zhao, Dong Wang...
-
[52]
PACO: Parts and attributes of common objects
VigneshRamanathan, AnmolKalia, VladanPetrovic, YiWen, BaixueZheng, BaishanGuo, RuiWang, AaronMarquez, Rama Kovvuri, Abhishek Kadian, Amir Mousavi, Yiwen Song, Abhimanyu Dubey, and Dhruv Mahajan. PACO: Parts and attributes of common objects. InarXiv preprint arXiv:2301.01795,
-
[53]
Sam 2: Segment anything in images and videos
Nikhila Ravi, Valentin Gabeur, Yuan-Ting Hu, Ronghang Hu, Chaitanya Ryali, Tengyu Ma, Haitham Khedr, Roman Rädle, Chloe Rolland, Laura Gustafson, et al. Sam 2: Segment anything in images and videos. InInternational Conference on Learning Representations, volume 2025, pages 28085–28128,
2025
-
[54]
Sat: Spatial aptitude training for multimodal language models
Arijit Ray, Jiafei Duan, Reuben Tan, Dina Bashkirova, Rose Hendrix, Kiana Ehsani, Aniruddha Kembhavi, Bryan A Plummer, Ranjay Krishna, Kuo-Hao Zeng, et al. Sat: Spatial aptitude training for multimodal language models. arXiv preprint arXiv:2412.07755, 3,
-
[55]
Grounded SAM: assembling open-world models for diverse visual tasks.CoRR, abs/2401.14159,
Tianhe Ren, Shilong Liu, Ailing Zeng, Jing Lin, Kunchang Li, He Cao, Jiayu Chen, Xinyu Huang, Yukang Chen, Feng Yan, Zhaoyang Zeng, Hao Zhang, Feng Li, Jie Yang, Hongyang Li, Qing Jiang, and Lei Zhang. Grounded SAM: assembling open-world models for diverse visual tasks.CoRR, abs/2401.14159,
-
[56]
Grounded SAM: Assembling Open-World Models for Diverse Visual Tasks
doi: 10.48550/ ARXIV.2401.14159. URL https://doi.org/10.48550/arXiv.2401.14159. Pierre Sermanet, Tianli Ding, Jeffrey Zhao, Fei Xia, Debidatta Dwibedi, Keerthana Gopalakrishnan, Christine Chan, Gabriel Dulac-Arnold, Sharath Maddineni, Nikhil J. Joshi, Pete Florence, Wei Han, Robert Baruch, Yao Lu, Suvir Mirchandani, Peng Xu, Pannag Sanketi, Karol Hausman,...
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2401.14159 2024
-
[57]
doi: 10.1109/ICRA57147.2024. 10610216. URL https://doi.org/10.1109/ICRA57147.2024.10610216. Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Mingchuan Zhang, Y.K. Li, Y. Wu, and Daya Guo. Deepseekmath: Pushing the limits of mathematical reasoning in open language models.arXiv preprint arXiv:2402.03300,
-
[58]
Memoryvla: Perceptual-cognitive memory in vision-language-action models for robotic manipulation
Hao Shi, Bin Xie, Yingfei Liu, Lin Sun, Fengrong Liu, Tiancai Wang, Erjin Zhou, Haoqiang Fan, Xiangyu Zhang, and Gao Huang. Memoryvla: Perceptual-cognitive memory in vision-language-action models for robotic manipulation. arXiv preprint arXiv:2508.19236,
-
[59]
Chan Hee Song, Valts Blukis, Jonathan Tremblay, Stephen Tyree, Yu Su, and Stan Birchfield. Robospatial: Teaching spatial understanding to 2d and 3d vision-language models for robotics.arXiv preprint arXiv:2411.16537,
-
[60]
Robobrain 2.5: Depth in sight, time in mind.arXiv preprint arXiv:2601.14352,
Huajie Tan, Enshen Zhou, Zhiyu Li, Yijie Xu, Yuheng Ji, Xiansheng Chen, Cheng Chi, Pengwei Wang, Huizhu Jia, Yulong Ao, et al. Robobrain 2.5: Depth in sight, time in mind.arXiv preprint arXiv:2601.14352,
-
[61]
Stone Tao, Fanbo Xiang, Arth Shukla, Yuzhe Qin, Xander Hinrichsen, Xiaodi Yuan, Chen Bao, Xinsong Lin, Yulin Liu, Tse kai Chan, Yuan Gao, Xuanlin Li, Tongzhou Mu, Nan Xiao, Arnav Gurha, Viswesh Nagaswamy Rajesh, Yong Woo Choi, Yen-Ru Chen, Zhiao Huang, Roberto Calandra, Rui Chen, Shan Luo, and Hao Su. Maniskill3: Gpu parallelized robotics simulation and r...
-
[62]
Robobrain 2.0 technical report.arXiv preprint arXiv:2507.02029, 2025a
BAAI RoboBrain Team, Mingyu Cao, Huajie Tan, Yuheng Ji, Xiansheng Chen, Minglan Lin, Zhiyu Li, Zhou Cao, Pengwei Wang, Enshen Zhou, Yi Han, Yingbo Tang, Xiangqi Xu, Wei Guo, Yaoxu Lyu, Yijie Xu, Jiayu Shi, Mengfei Du, Cheng Chi, Mengdi Zhao, Xiaoshuai Hao, Junkai Zhao, Xiaojie Zhang, Shanyu Rong, Huaihai Lyu, Zhengliang Cai, Yankai Fu, Ning Chen, Bolun Zh...
-
[63]
Shengbang Tong, Ellis Brown, Penghao Wu, Sanghyun Woo, Manoj Middepogu, Sai Charitha Akula, Jihan Yang, Shusheng Yang, Adithya Iyer, Xichen Pan, Ziteng Wang, Rob Fergus, Yann LeCun, and Saining Xie. Cambrian-1: A fully open, vision-centric exploration of multimodal llms.arXiv preprint arXiv:2406.16860,
-
[64]
Mem: Multi-scale embodied memory for vision language action models
Marcel Torne, Karl Pertsch, Homer Walke, Kyle Vedder, Suraj Nair, Brian Ichter, Allen Z Ren, Haohuan Wang, Jiaming Tang, Kyle Stachowicz, et al. Mem: Multi-scale embodied memory for vision language action models. arXiv preprint arXiv:2603.03596,
-
[65]
Ruicheng Wang, Sicheng Xu, Yue Dong, Yu Deng, Jianfeng Xiang, Zelong Lv, Guangzhong Sun, Xin Tong, and Jiaolong Yang. Moge-2: Accurate monocular geometry with metric scale and sharp details.arXiv preprint arXiv:2507.02546, 2025a. Weiyun Wang, Zhangwei Gao, Lixin Gu, Hengjun Pu, Long Cui, Xingguang Wei, Zhaoyang Liu, Linglin Jing, Shenglong Ye, Jie Shao, e...
-
[66]
Haotian Xue, Yunhao Ge, Yu Zeng, Zhaoshuo Li, Ming-Yu Liu, Yongxin Chen, and Jiaojiao Fan. Point-it-out: Benchmarking embodied reasoning for vision language models in multi-stage visual grounding.arXiv preprint arXiv:2509.25794,
-
[67]
Magma: A foundation model for multimodal ai agents
Jianwei Yang, Reuben Tan, Qianhui Wu, Ruijie Zheng, Baolin Peng, Yongyuan Liang, Yu Gu, Mu Cai, Seonghyeon Ye, Joel Jang, et al. Magma: A foundation model for multimodal ai agents. InProceedings of the computer vision and pattern recognition conference, pages 14203–14214, 2025a. Jihan Yang, Shusheng Yang, Anjali W. Gupta, Rilyn Han, Li Fei-Fei, and Sainin...
-
[68]
Cambrian-S: Towards Spatial Supersensing in Video
Shusheng Yang, Jihan Yang, Pinzhi Huang, Ellis Brown, Zihao Yang, Yue Yu, Shengbang Tong, Zihan Zheng, Yifan Xu, Muhan Wang, Daohan Lu, Rob Fergus, Yann LeCun, Li Fei-Fei, and Saining Xie. Cambrian-s: Towards spatial supersensing in video.CoRR, abs/2511.04670, 2025b. doi: 10.48550/ARXIV.2511.04670. URL https://doi.org/10.48550/arXiv.2511.04670. Zewei Ye, ...
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2511.04670
-
[69]
Scannet++: A high-fidelity dataset of 3d indoor scenes
31 Chandan Yeshwanth, Yueh-Cheng Liu, Matthias Nießner, and Angela Dai. Scannet++: A high-fidelity dataset of 3d indoor scenes. InIEEE/CVF International Conference on Computer Vision, ICCV 2023, Paris, France, October 1-6, 2023, pages 12–22. IEEE,
2023
-
[70]
Barron, Ben Mildenhall, Dor Verbin, Pratul P
doi: 10.1109/ICCV51070.2023.00008. URL https://doi.org/10.1109/ ICCV51070.2023.00008. Wentao Yuan, Jiafei Duan, Valts Blukis, Wilbert Pumacay, Ranjay Krishna, Adithyavairavan Murali, Arsalan Mousavian, and Dieter Fox. Robopoint: A vision-language model for spatial affordance prediction for robotics. CoRR, abs/2406.10721,
-
[71]
URL https://doi.org/10.48550/arXiv.2406
doi: 10.48550/ARXIV.2406.10721. URL https://doi.org/10.48550/arXiv.2406. 10721. Yifu Yuan, Haiqin Cui, Yibin Chen, Zibin Dong, Fei Ni, Longxin Kou, Jinyi Liu, Pengyi Li, Yan Zheng, and Jianye Hao. From seeing to doing: Bridging reasoning and decision for robotic manipulation.arXiv preprint arXiv:2505.08548, 2025a. Yifu Yuan, Haiqin Cui, Yaoting Huang, Yib...
-
[72]
Vlm4vla: Revisiting vision-language-models in vision-language-action models
Jianke Zhang, Xiaoyu Chen, Qiuyue Wang, Mingsheng Li, Yanjiang Guo, Yucheng Hu, Jiajun Zhang, Shuai Bai, Junyang Lin, and Jianyu Chen. Vlm4vla: Revisiting vision-language-models in vision-language-action models. arXiv preprint arXiv:2601.03309, 2026a. Kaichen Zhang, Bo Li, Peiyuan Gao, Fanyi Zhang, Kairui Li, Jingkang Yan, and Ziwei Liu. Lmms-eval: Realit...
-
[73]
Shuoheng Zhang, Yifu Yuan, Hongyao Tang, Yan Zheng, Qiaojun Yu, Pengyi Li, Guowei Huang, Helong Huang, Xingyue Quan, and Jianye Hao. Forceflow: Learning to feel and act via contact-driven flow matching.arXiv preprint arXiv:2605.11048, 2026b. Yi Zhang, Che Liu, Xiancong Ren, Hanchu Ni, Shuai Zhang, Zeyuan Ding, Jiayu Hu, Hanzhe Shan, Zhenwei Niu, Zhaoyang ...
Pith/arXiv arXiv 2023
-
[74]
arXiv:2304.13705. Chengliang Zhong, Yuhang Zheng, Yupeng Zheng, Hao Zhao, Li Yi, Xiaodong Mu, Ling Wang, Pengfei Li, Guyue Zhou, and Chao Yang. 3d implicit transporter for temporally consistent keypoint discovery. InProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 3869–3880,
-
[75]
Enshen Zhou, Jingkun An, Cheng Chi, Yi Han, Shanyu Rong, Chi Zhang, Pengwei Wang, Zhongyuan Wang, Tiejun Huang, Lu Sheng, and Shanghang Zhang. Roborefer: Towards spatial referring with reasoning in vision-language models for robotics.arXiv preprint arXiv:2506.04308,
-
[76]
Since inputs in embodied scenarios are often temporal and multi-view, we integrate VLM-3R (Fan et al., 2025), Cambrian-S (Yang et al., 2025b), and SAT (Ray et al.,
A.1 Embodied Cognition & Spatial Reasoning Data Multi-view spatiotemporal reasoning.Spatial reasoning and scene cognition are foundational for embod- ied VLMs to perceive the physical environment. Since inputs in embodied scenarios are often temporal and multi-view, we integrate VLM-3R (Fan et al., 2025), Cambrian-S (Yang et al., 2025b), and SAT (Ray et al.,
2025
-
[77]
For video inputs, we additionally incorporate multiple video spatial reasoning datasets (Ouyang et al., 2025; Feng et al., 2025)
datasets to strengthen the model’s spatiotemporal reasoning under diverse viewpoints. For video inputs, we additionally incorporate multiple video spatial reasoning datasets (Ouyang et al., 2025; Feng et al., 2025). These datasets collectively cover object counting, relative distance, relative direction, spatial topological relations (above/inside/below/b...
2025
-
[78]
datasets via a fully automated 3D scene annotation pipeline. The pipeline takes a single RGB image as input and produces a structured 3D semantic scene graph, from which spatial reasoning QA pairs are programmatically generated covering spatial relations, distance metrics, scene cognition, and appearance order. Full pipeline implementation details are pro...
2024
-
[79]
A.3 Embodied Correction Data Error correction is critical for closed-loop autonomous execution
and EgoRe (Pei et al., 2026), which are extracted from first-person videos and require the model to predict subsequent action sequences based on observed manipulation progress. A.3 Embodied Correction Data Error correction is critical for closed-loop autonomous execution. Existing robotic datasets predominantly contain successful demonstrations, while fai...
2026
-
[80]
dataset, which covers fault understanding and correction across different robots. To address comprehen- sive capability requirements, we draw upon the failure taxonomy established in prior work (Ye et al., 2025; Pacaud et al., 2025; Liu et al., 2023b) and construct theER1.5-Correctiondataset, a large-scale failure correction QA dataset covering the comple...
2025
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.