pith. sign in

arxiv: 2605.25813 · v1 · pith:ZJZXUVARnew · submitted 2026-05-25 · 💻 cs.RO

Extending Embodied Question Answering from Perception to Decision

Pith reviewed 2026-06-29 21:28 UTC · model grok-4.3

classification 💻 cs.RO
keywords embodied question answeringvision-language modelsspatial reasoningdecision makingembodied intelligencebenchmark datasetrobotics
0
0 comments X

The pith

EQA-Decision supplies a unified benchmark of four million question-answer pairs spanning scene construction, spatial understanding, task dynamics, and instant decisions.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Prior embodied question answering work has used separate datasets that each test only a narrow slice of skills such as spatial layout or step-by-step procedures. The paper introduces EQA-Decision, a single large-scale resource containing more than four million annotated pairs that deliberately combine four complementary dimensions of embodied reasoning. It also supplies RoboDecision, a baseline model that performs perception, reasoning, and action-level decisions inside the same framework. The authors show that the combined benchmark and model let vision-language models be tested and trained on the full pipeline from sensing an environment to choosing actions. This approach is intended to replace fragmented evaluations with one coherent test for embodied intelligence.

Core claim

The paper claims that EQA-Decision systematically covers four complementary dimensions of embodied reasoning—static scene construction, spatial understanding, task dynamics reasoning, and instant decision—and that the accompanying RoboDecision model supplies a unified framework which jointly evaluates perception, reasoning, and action-level decision-making in embodied environments.

What carries the argument

EQA-Decision dataset of over four million hierarchical question-answer pairs organized across the four listed dimensions of embodied reasoning.

If this is right

  • The dataset supplies a single large-scale test that replaces several narrower existing benchmarks.
  • RoboDecision demonstrates joint training and evaluation of perception, reasoning, and action inside one model.
  • Results on the benchmark directly measure progress in spatial and interaction reasoning for vision-language models.
  • The hierarchical annotations allow diagnosis of failures at different levels of embodied reasoning.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Models improved on this benchmark may transfer more readily to real robot platforms that must choose actions from visual input.
  • The four-million-pair scale could support pre-training of larger embodied agents before fine-tuning on specific hardware.
  • If the dimensions prove non-redundant in practice, future work could add further axes such as multi-agent coordination without restarting the benchmark design.

Load-bearing premise

The four dimensions together form a comprehensive and non-redundant coverage of what embodied reasoning requires.

What would settle it

A vision-language model that scores highly on all four dimensions of EQA-Decision yet shows no measurable gain in success rate on physical robot tasks that require the same skills would falsify the claim that the benchmark captures the needed capabilities.

Figures

Figures reproduced from arXiv: 2605.25813 by Peiran Xu, Qiwei Li, Xicheng Gong, Yadong Mu.

Figure 1
Figure 1. Figure 1: Overview of EQA-Decision. Prior embodied QA datasets and benchmarks mainly target perception-oriented tasks, where models focus on describing what is visible. EQA-Decision extends this scope with decision-centric tasks that require understanding spatial and temporal context, tracking state changes, and performing reasoning within dynamic task processes. to assess embodied reasoning across six complementary… view at source ↗
Figure 2
Figure 2. Figure 2: Overall framework of EQA-Decision and RoboDecision. We combine Gemini assisted annotation with human verification to process data from multiple sources, producing the EQA-Decision dataset with structured embodied reasoning tasks. Building on this dataset, the model is trained in three successive stages: SFT, CoT-SFT and GRPO, with a hybrid reward applied throughout training to improve reasoning, answer acc… view at source ↗
Figure 3
Figure 3. Figure 3: QA categories distribution and source data distribution [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Qualitative comparison of instant decisions. [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗
read the original abstract

Embodied Question Answering (EQA) connects perception, reasoning, and interaction within embodied environments. However, existing datasets and benchmarks remain fragmented, each focusing on a limited subset of reasoning skills such as spatial understanding or procedural reasoning, without offering a unified large-scale framework for comprehensive evaluation. We present EQA-Decision, a large-scale embodied QA dataset that systematically covers four complementary dimensions of embodied reasoning: static scene construction, spatial understanding, task dynamics reasoning, and instant decision. The dataset contains over four million question-answer pairs with hierarchical annotations across diverse embodied scenarios. In addition, we develop RoboDecision, a strong baseline model aligned with the EQA-Decision Benchmark, providing a unified framework that jointly evaluates perception, reasoning, and action-level decision-making in embodied environments. Results demonstrate that EQA-Decision effectively benchmarks and enhances VLM capabilities in spatial and interaction reasoning, providing a solid foundation for advancing embodied intelligence research.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper introduces EQA-Decision, a large-scale embodied QA dataset that covers four complementary dimensions of embodied reasoning (static scene construction, spatial understanding, task dynamics reasoning, and instant decision) and contains over four million question-answer pairs with hierarchical annotations across diverse scenarios. It also presents RoboDecision as a baseline model that jointly evaluates perception, reasoning, and action-level decision-making, claiming that the benchmark effectively enhances VLM capabilities in spatial and interaction reasoning.

Significance. A unified large-scale dataset and baseline for embodied reasoning could provide a valuable foundation for embodied intelligence research if the dataset construction, annotations, and baseline performance are rigorously validated with quantitative results.

major comments (2)
  1. [Abstract] Abstract: the central claim that 'Results demonstrate that EQA-Decision effectively benchmarks and enhances VLM capabilities' is unsupported because the provided manuscript contains no experimental results, metrics, ablation studies, error analysis, or comparisons to prior EQA datasets.
  2. [Abstract] Abstract: no details are given on how the four dimensions were operationalized, how the >4M QA pairs were generated or validated, or what the hierarchical annotations consist of, which are load-bearing for the claim of a 'unified large-scale framework'.
minor comments (1)
  1. The abstract refers to 'diverse embodied scenarios' without naming the simulators, environments, or object categories used.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their comments. We address each major point below and will revise the manuscript to ensure claims are supported by content and that key construction details are clearly presented.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the central claim that 'Results demonstrate that EQA-Decision effectively benchmarks and enhances VLM capabilities' is unsupported because the provided manuscript contains no experimental results, metrics, ablation studies, error analysis, or comparisons to prior EQA datasets.

    Authors: We agree that the abstract claim is unsupported in the current manuscript. The paper introduces the EQA-Decision dataset and RoboDecision baseline but does not contain experimental results, metrics, or comparisons. We will revise the abstract to remove or qualify the sentence beginning 'Results demonstrate...' to accurately reflect the manuscript's scope as a dataset and baseline introduction. revision: yes

  2. Referee: [Abstract] Abstract: no details are given on how the four dimensions were operationalized, how the >4M QA pairs were generated or validated, or what the hierarchical annotations consist of, which are load-bearing for the claim of a 'unified large-scale framework'.

    Authors: The manuscript body (Section 3) describes the four dimensions, generation via simulation environments, validation steps, and hierarchical annotations. However, to address the concern that these are not evident from the abstract or sufficiently highlighted, we will add a brief overview paragraph in the introduction summarizing the operationalization, generation, validation, and annotation structure. revision: yes

Circularity Check

0 steps flagged

No significant circularity in dataset and baseline introduction

full rationale

The paper presents EQA-Decision as a new large-scale dataset spanning four dimensions of embodied reasoning and introduces RoboDecision as a baseline model. No mathematical derivations, equations, fitted parameters, or predictions appear. The work is framed as benchmark construction rather than a derived result from prior inputs. No self-citations are invoked as load-bearing premises, and the four dimensions are stated as complementary by design without reducing to self-definition or renaming of known results. The contribution is self-contained as an empirical resource.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract supplies no equations, fitted constants, or background assumptions that can be extracted; ledger is therefore empty.

pith-pipeline@v0.9.1-grok · 5690 in / 958 out tokens · 24289 ms · 2026-06-29T21:28:11.750924+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

63 extracted references · 29 canonical work pages · 17 internal anchors

  1. [1]

    GPT-4 Technical Report

    Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ah- mad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. Gpt-4 technical report.arXiv preprint arXiv:2303.08774,

  2. [2]

    Scanqa: 3d question answering for spatial scene understanding

    Daichi Azuma, Taiki Miyanishi, Shuhei Kurita, and Motoaki Kawanabe. Scanqa: 3d question answering for spatial scene understanding. Inproceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 19129– 19139, 2022. 2

  3. [3]

    Cosmos-Reason1: From Physical Common Sense To Embodied Reasoning

    Alisson Azzolini, Junjie Bai, Hannah Brandon, Jiaxin Cao, Prithvijit Chattopadhyay, Huayu Chen, Jinju Chu, Yin Cui, Jenna Diamond, Yifan Ding, et al. Cosmos-reason1: From physical common sense to embodied reasoning.arXiv preprint arXiv:2503.15558, 2025. 1, 3

  4. [4]

    Affordances from human videos as a versa- tile representation for robotics

    Shikhar Bahl, Russell Mendonca, Lili Chen, Unnat Jain, and Deepak Pathak. Affordances from human videos as a versa- tile representation for robotics. 2023. 4

  5. [5]

    Qwen2.5-vl technical report, 2025

    Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, Humen Zhong, Yuanzhi Zhu, Mingkun Yang, Zhao- hai Li, Jianqiang Wan, Pengfei Wang, Wei Ding, Zheren Fu, Yiheng Xu, Jiabo Ye, Xi Zhang, Tianbao Xie, Zesen Cheng, Hang Zhang, Zhibo Yang, Haiyang Xu, and Junyang Lin. Qwen2.5-vl technical report, 2025. 1

  6. [6]

    Zoedepth: Zero-shot transfer by com- bining relative and metric depth, 2023

    Shariq Farooq Bhat, Reiner Birkl, Diana Wofk, Peter Wonka, and Matthias M ¨uller. Zoedepth: Zero-shot transfer by com- bining relative and metric depth, 2023. 4

  7. [7]

    RT-1: Robotics Transformer for Real-World Control at Scale

    Anthony Brohan, Noah Brown, Justice Carbajal, Yevgen Chebotar, Joseph Dabis, Chelsea Finn, Keerthana Gopalakr- ishnan, Karol Hausman, Alex Herzog, Jasmine Hsu, et al. Rt-1: Robotics transformer for real-world control at scale. arXiv preprint arXiv:2212.06817, 2022. 3

  8. [8]

    AgiBot World Colosseo: A Large-scale Manipulation Platform for Scalable and Intelligent Embodied Systems

    Qingwen Bu, Jisong Cai, Li Chen, Xiuqi Cui, Yan Ding, Siyuan Feng, Shenyuan Gao, Xindong He, Xuan Hu, Xu Huang, et al. Agibot world colosseo: A large-scale manip- ulation platform for scalable and intelligent embodied sys- tems.arXiv preprint arXiv:2503.06669, 2025. 4, 5

  9. [9]

    WorldVLA: Towards Autoregressive Action World Model

    Jun Cen, Chaohui Yu, Hangjie Yuan, Yuming Jiang, Siteng Huang, Jiayan Guo, Xin Li, Yibing Song, Hao Luo, Fan Wang, et al. Worldvla: Towards autoregressive action world model.arXiv preprint arXiv:2506.21539, 2025. 3

  10. [10]

    Lin Chen, Jinsong Li, Xiaoyi Dong, Pan Zhang, Yuhang Zang, Zehui Chen, Haodong Duan, Jiaqi Wang, Yu Qiao, Dahua Lin, et al

    Kaiyuan Chen, Shuangyu Xie, Zehan Ma, Pannag R San- keti, and Ken Goldberg. Robo2vlm: Visual question answer- ing from large-scale in-the-wild robot manipulation datasets. arXiv preprint arXiv:2505.15517, 2025. 1, 2, 3

  11. [11]

    Robogpt: an intelligent agent of making embodied long-term decisions for daily instruction tasks.arXiv preprint arXiv:2311.15649,

    Yaran Chen, Wenbo Cui, Yuanwen Chen, Mining Tan, Xinyao Zhang, Dongbin Zhao, and He Wang. Robogpt: an intelligent agent of making embodied long-term decisions for daily instruction tasks.arXiv preprint arXiv:2311.15649,

  12. [12]

    Reproducible scal- ing laws for contrastive language-image learning

    Mehdi Cherti, Romain Beaumont, Ross Wightman, Mitchell Wortsman, Gabriel Ilharco, Cade Gordon, Christoph Schuh- mann, Ludwig Schmidt, and Jenia Jitsev. Reproducible scal- ing laws for contrastive language-image learning. InPro- ceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 2818–2829, 2023. 6

  13. [13]

    Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities

    Gheorghe Comanici, Eric Bieber, Mike Schaekermann, Ice Pasupat, Noveen Sachdeva, Inderjit Dhillon, Marcel Blis- tein, Ori Ram, Dan Zhang, Evan Rosen, et al. Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities.arXiv preprint arXiv:2507.06261, 2025. 1, 2, 4, 5, 6, 7

  14. [14]

    From play to policy: Conditional be- havior generation from uncurated robot data.arXiv preprint arXiv:2210.10047, 2022

    Zichen Jeff Cui, Yibin Wang, Nur Muhammad Mahi Shafi- ullah, and Lerrel Pinto. From play to policy: Conditional be- havior generation from uncurated robot data.arXiv preprint arXiv:2210.10047, 2022. 4

  15. [15]

    Scannet: Richly-annotated 3d reconstructions of indoor scenes

    Angela Dai, Angel X Chang, Manolis Savva, Maciej Hal- ber, Thomas Funkhouser, and Matthias Nießner. Scannet: Richly-annotated 3d reconstructions of indoor scenes. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 5828–5839, 2017. 2, 3, 4

  16. [16]

    Scaling egocentric vision: The epic-kitchens dataset

    Dima Damen, Hazel Doughty, Giovanni Maria Farinella, Sanja Fidler, Antonino Furnari, Evangelos Kazakos, Davide Moltisanti, Jonathan Munro, Toby Perrett, Will Price, et al. Scaling egocentric vision: The epic-kitchens dataset. In Proceedings of the European conference on computer vision (ECCV), pages 720–736, 2018. 1

  17. [17]

    The epic-kitchens dataset: Collection, chal- lenges and baselines.IEEE Transactions on Pattern Anal- ysis and Machine Intelligence (TPAMI), 43(11):4125–4141,

    Dima Damen, Hazel Doughty, Giovanni Maria Farinella, Sanja Fidler, Antonino Furnari, Evangelos Kazakos, Davide Moltisanti, Jonathan Munro, Toby Perrett, Will Price, and Michael Wray. The epic-kitchens dataset: Collection, chal- lenges and baselines.IEEE Transactions on Pattern Anal- ysis and Machine Intelligence (TPAMI), 43(11):4125–4141,

  18. [18]

    Embodied question answer- ing

    Abhishek Das, Samyak Datta, Georgia Gkioxari, Stefan Lee, Devi Parikh, and Dhruv Batra. Embodied question answer- ing. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 1–10, 2018. 1, 2

  19. [19]

    Shivin Dass, Jullian Yapeter, Jesse Zhang, Jiahui Zhang, Karl Pertsch, Stefanos Nikolaidis, and Joseph J. Lim. Clvr jaco play dataset, 2023. 4

  20. [20]

    Is the house ready for sleeptime? generating and evaluating situational queries for embodied question answer- ing.arXiv preprint arXiv:2405.04732, 2024

    Vishnu Sashank Dorbala, Prasoon Goyal, Robinson Pira- muthu, Michael Johnston, Reza Ghanadhan, and Dinesh Manocha. Is the house ready for sleeptime? generating and evaluating situational queries for embodied question answer- ing.arXiv preprint arXiv:2405.04732, 2024. 2

  21. [21]

    Palm-e: An embodied multimodal language model

    Danny Driess, Fei Xia, Mehdi SM Sajjadi, Corey Lynch, Aakanksha Chowdhery, Ayzaan Wahid, Jonathan Tompson, Quan Vuong, Tianhe Yu, Wenlong Huang, et al. Palm-e: An embodied multimodal language model. 2023. 3

  22. [22]

    Ego4d: Around the world in 3,000 hours of egocentric video

    Kristen Grauman, Andrew Westbury, Eugene Byrne, Zachary Chavis, Antonino Furnari, Rohit Girdhar, Jackson Hamburger, Hao Jiang, Miao Liu, Xingyu Liu, et al. Ego4d: Around the world in 3,000 hours of egocentric video. In Proceedings of the IEEE/CVF conference on computer vi- sion and pattern recognition, pages 18995–19012, 2022. 1, 4, 5

  23. [23]

    Efficient multi- modal learning from data-centric perspective.arXiv preprint arXiv:2402.11530, 2024

    Muyang He, Yexin Liu, Boya Wu, Jianhao Yuan, Yueze Wang, Tiejun Huang, and Bo Zhao. Efficient multi- modal learning from data-centric perspective.arXiv preprint arXiv:2402.11530, 2024. 3, 4

  24. [24]

    Lora: Low-rank adaptation of large language models.ICLR, 1(2):3, 2022

    Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen- Zhu, Yuanzhi Li, Shean Wang, Lu Wang, Weizhu Chen, et al. Lora: Low-rank adaptation of large language models.ICLR, 1(2):3, 2022. 6

  25. [25]

    Eqa-mx: Embodied question answering using multi- modal expression

    Md Mofijul Islam, Alexi Gladstone, Riashat Islam, and Tariq Iqbal. Eqa-mx: Embodied question answering using multi- modal expression. InThe Twelfth International Conference on Learning Representations, 2023. 1

  26. [26]

    BC-z: Zero-shot task generalization with robotic imitation learning

    Eric Jang, Alex Irpan, Mohi Khansari, Daniel Kappler, Fred- erik Ebert, Corey Lynch, Sergey Levine, and Chelsea Finn. BC-z: Zero-shot task generalization with robotic imitation learning. In5th Annual Conference on Robot Learning,

  27. [27]

    Robobrain: A unified brain model for robotic manipulation from abstract to concrete

    Yuheng Ji, Huajie Tan, Jiayu Shi, Xiaoshuai Hao, Yuan Zhang, Hengyuan Zhang, Pengwei Wang, Mengdi Zhao, Yao Mu, Pengju An, et al. Robobrain: A unified brain model for robotic manipulation from abstract to concrete. InProceed- ings of the Computer Vision and Pattern Recognition Con- ference, pages 1724–1734, 2025. 3

  28. [28]

    Context-aware planning and environment-aware memory for instruction following em- bodied agents

    Byeonghwi Kim, Jinyeon Kim, Yuyeong Kim, Cheolhong Min, and Jonghyun Choi. Context-aware planning and environment-aware memory for instruction following em- bodied agents. InICCV, 2023. 3

  29. [29]

    OpenVLA: An Open-Source Vision-Language-Action Model

    Moo Jin Kim, Karl Pertsch, Siddharth Karamcheti, Ted Xiao, Ashwin Balakrishna, Suraj Nair, Rafael Rafailov, Ethan Foster, Grace Lam, Pannag Sanketi, et al. Openvla: An open-source vision-language-action model.arXiv preprint arXiv:2406.09246, 2024. 3

  30. [30]

    AI2-THOR: An Interactive 3D Environment for Visual AI

    Eric Kolve, Roozbeh Mottaghi, Winson Han, Eli VanderBilt, Luca Weihs, Alvaro Herrasti, Matt Deitke, Kiana Ehsani, Daniel Gordon, Yuke Zhu, et al. Ai2-thor: An interactive 3d environment for visual ai.arXiv preprint arXiv:1712.05474,

  31. [31]

    SimpleVLA-RL: Scaling VLA Training via Reinforcement Learning

    Haozhan Li, Yuxin Zuo, Jiale Yu, Yuhao Zhang, Zhaohui Yang, Kaiyan Zhang, Xuekai Zhu, Yuchen Zhang, Tianxing Chen, Ganqu Cui, et al. Simplevla-rl: Scaling vla training via reinforcement learning.arXiv preprint arXiv:2509.09674,

  32. [32]

    Robot learning on the job: Human-in-the- loop autonomy and learning during deployment

    Huihan Liu, Soroush Nasiriany, Lance Zhang, Zhiyao Bao, and Yuke Zhu. Robot learning on the job: Human-in-the- loop autonomy and learning during deployment. InRobotics: Science and Systems (RSS), 2023. 4

  33. [33]

    Visual embodied brain: Let multimodal large language models see, think, and control in spaces

    Gen Luo, Ganlin Yang, Ziyang Gong, Guanzhou Chen, Hao- nan Duan, Erfei Cui, Ronglei Tong, Zhi Hou, Tianyi Zhang, Zhe Chen, et al. Visual embodied brain: Let multimodal large language models see, think, and control in spaces. arXiv preprint arXiv:2506.00123, 2025. 3

  34. [34]

    Interactive language: Talking to robots in real time.IEEE Robotics and Automation Letters, 2023

    Corey Lynch, Ayzaan Wahid, Jonathan Tompson, Tianli Ding, James Betker, Robert Baruch, Travis Armstrong, and Pete Florence. Interactive language: Talking to robots in real time.IEEE Robotics and Automation Letters, 2023. 4

  35. [35]

    Sqa3d: Situated question answering in 3d scenes.arXiv preprint arXiv:2210.07474, 2022

    Xiaojian Ma, Silong Yong, Zilong Zheng, Qing Li, Yi- tao Liang, Song-Chun Zhu, and Siyuan Huang. Sqa3d: Situated question answering in 3d scenes.arXiv preprint arXiv:2210.07474, 2022. 2

  36. [36]

    Openeqa: Embodied question answering in the era of foun- dation models

    Arjun Majumdar, Anurag Ajay, Xiaohan Zhang, Pranav Putta, Sriram Yenamandra, Mikael Henaff, Sneha Silwal, Paul Mcvay, Oleksandr Maksymets, Sergio Arnaud, et al. Openeqa: Embodied question answering in the era of foun- dation models. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 16488– 16498, 2024. 1, 2

  37. [37]

    Scaling robot supervision to hundreds of hours with roboturk: Robotic manipulation dataset through human reasoning and dexterity

    Ajay Mandlekar, Jonathan Booher, Max Spero, Albert Tung, Anchit Gupta, Yuke Zhu, Animesh Garg, Silvio Savarese, and Li Fei-Fei. Scaling robot supervision to hundreds of hours with roboturk: Robotic manipulation dataset through human reasoning and dexterity. In2019 IEEE/RSJ Interna- tional Conference on Intelligent Robots and Systems (IROS), pages 1048–105...

  38. [38]

    Embodiedgpt: Vision-language pre-training via embodied chain of thought.Advances in Neural Information Processing Systems, 36:25081–25094, 2023

    Yao Mu, Qinglong Zhang, Mengkang Hu, Wenhai Wang, Mingyu Ding, Jun Jin, Bin Wang, Jifeng Dai, Yu Qiao, and Ping Luo. Embodiedgpt: Vision-language pre-training via embodied chain of thought.Advances in Neural Information Processing Systems, 36:25081–25094, 2023. 3

  39. [39]

    Learning and retrieval from prior data for skill- based imitation learning

    Soroush Nasiriany, Tian Gao, Ajay Mandlekar, and Yuke Zhu. Learning and retrieval from prior data for skill- based imitation learning. InConference on Robot Learning (CoRL), 2022. 4

  40. [40]

    Open x-embodiment: Robotic learning datasets and rt-x models: Open x-embodiment collaboration 0

    Abby O’Neill, Abdul Rehman, Abhiram Maddukuri, Ab- hishek Gupta, Abhishek Padalkar, Abraham Lee, Acorn Poo- ley, Agrim Gupta, Ajay Mandlekar, Ajinkya Jain, et al. Open x-embodiment: Robotic learning datasets and rt-x models: Open x-embodiment collaboration 0. In2024 IEEE Inter- national Conference on Robotics and Automation (ICRA), pages 6892–6903. IEEE, ...

  41. [41]

    Habitat-Matterport 3D Dataset (HM3D): 1000 Large-scale 3D Environments for Embodied AI

    Santhosh K Ramakrishnan, Aaron Gokaslan, Erik Wijmans, Oleksandr Maksymets, Alex Clegg, John Turner, Eric Un- dersander, Wojciech Galuba, Andrew Westbury, Angel X Chang, et al. Habitat-matterport 3d dataset (hm3d): 1000 large-scale 3d environments for embodied ai.arXiv preprint arXiv:2109.08238, 2021. 2, 3, 4

  42. [42]

    Explore until confident: Efficient exploration for em- bodied question answering,

    Allen Z Ren, Jaden Clark, Anushri Dixit, Masha Itkina, Anirudha Majumdar, and Dorsa Sadigh. Explore until confi- dent: Efficient exploration for embodied question answering. arXiv preprint arXiv:2403.15941, 2024. 2

  43. [43]

    Robovqa: Multimodal long-horizon reasoning for robotics

    Pierre Sermanet, Tianli Ding, Jeffrey Zhao, Fei Xia, De- bidatta Dwibedi, Keerthana Gopalakrishnan, Christine Chan, Gabriel Dulac-Arnold, Sharath Maddineni, Nikhil J Joshi, et al. Robovqa: Multimodal long-horizon reasoning for robotics. In2024 IEEE International Conference on Robotics and Automation (ICRA), pages 645–652. IEEE,

  44. [44]

    Mu- tex: Learning unified policies from multimodal task specifi- cations

    Rutav Shah, Roberto Mart ´ın-Mart´ın, and Yuke Zhu. Mu- tex: Learning unified policies from multimodal task specifi- cations. In7th Annual Conference on Robot Learning, 2023. 4

  45. [45]

    DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

    Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, YK Li, Yang Wu, et al. Deepseekmath: Pushing the limits of math- ematical reasoning in open language models.arXiv preprint arXiv:2402.03300, 2024. 2, 6

  46. [46]

    Hi Robot: Open-Ended Instruction Following with Hierarchical Vision-Language-Action Models

    Lucy Xiaoyang Shi, Brian Ichter, Michael Equi, Liyim- ing Ke, Karl Pertsch, Quan Vuong, James Tanner, Anna Walling, Haohuan Wang, Niccolo Fusai, et al. Hi robot: Open-ended instruction following with hierarchical vision- language-action models.arXiv preprint arXiv:2502.19417,

  47. [47]

    Alfred: A benchmark for interpreting grounded instructions for everyday tasks

    Mohit Shridhar, Jesse Thomason, Daniel Gordon, Yonatan Bisk, Winson Han, Roozbeh Mottaghi, Luke Zettlemoyer, and Dieter Fox. Alfred: A benchmark for interpreting grounded instructions for everyday tasks. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10740–10749, 2020. 3, 4

  48. [48]

    Robobrain 2.0 technical report.arXiv preprint arXiv:2507.02029,

    BAAI RoboBrain Team, Mingyu Cao, Huajie Tan, Yuheng Ji, Xiansheng Chen, Minglan Lin, Zhiyu Li, Zhou Cao, Pengwei Wang, Enshen Zhou, et al. Robobrain 2.0 technical report.arXiv preprint arXiv:2507.02029, 2025. 3, 7, 8

  49. [49]

    Gemini Robotics: Bringing AI into the Physical World

    Gemini Robotics Team, Saminda Abeyruwan, Joshua Ainslie, Jean-Baptiste Alayrac, Montserrat Gonzalez Are- nas, Travis Armstrong, Ashwin Balakrishna, Robert Baruch, Maria Bauza, Michiel Blokzijl, et al. Gemini robotics: Bringing ai into the physical world.arXiv preprint arXiv:2503.20020, 2025. 8

  50. [50]

    Octo: An Open-Source Generalist Robot Policy

    Octo Model Team, Dibya Ghosh, Homer Walke, Karl Pertsch, Kevin Black, Oier Mees, Sudeep Dasari, Joey Hejna, Tobias Kreiman, Charles Xu, et al. Octo: An open-source generalist robot policy.arXiv preprint arXiv:2405.12213, 2024. 3

  51. [51]

    Edan: An emg-controlled daily assistant to help people with physical disabilities

    J ¨orn V ogel, Annette Hagengruber, Maged Iskandar, Gabriel Quere, Ulrike Leipscher, Samuel Bustamante, Alexander Di- etrich, Hannes H ¨oppner, Daniel Leidner, and Alin Albu- Sch¨affer. Edan: An emg-controlled daily assistant to help people with physical disabilities. In2020 IEEE/RSJ Interna- tional Conference on Intelligent Robots and Systems (IROS), pag...

  52. [52]

    Bridgedata v2: A dataset for robot learning at scale

    Homer Walke, Kevin Black, Abraham Lee, Moo Jin Kim, Max Du, Chongyi Zheng, Tony Zhao, Philippe Hansen- Estruch, Quan Vuong, Andre He, Vivek Myers, Kuan Fang, Chelsea Finn, and Sergey Levine. Bridgedata v2: A dataset for robot learning at scale. InConference on Robot Learning (CoRL), 2023. 4

  53. [53]

    Multilingual E5 Text Embeddings: A Technical Report

    Liang Wang, Nan Yang, Xiaolong Huang, Linjun Yang, Ran- gan Majumder, and Furu Wei. Multilingual e5 text embed- dings: A technical report.arXiv preprint arXiv:2402.05672,

  54. [54]

    Embodied question answering in photorealistic environments with point cloud perception

    Erik Wijmans, Samyak Datta, Oleksandr Maksymets, Ab- hishek Das, Georgia Gkioxari, Stefan Lee, Irfan Essa, Devi Parikh, and Dhruv Batra. Embodied question answering in photorealistic environments with point cloud perception. In Proceedings of the IEEE/CVF Conference on Computer Vi- sion and Pattern Recognition, pages 6659–6668, 2019. 2

  55. [55]

    Noisyeqa: Benchmarking embodied question answer- ing against noisy queries.arXiv preprint arXiv:2412.10726,

    Tao Wu, Chuhao Zhou, Yen Heng Wong, Lin Gu, and Jianfei Yang. Noisyeqa: Benchmarking embodied question answer- ing against noisy queries.arXiv preprint arXiv:2412.10726,

  56. [56]

    Building Generalizable Agents with a Realistic and Rich 3D Environment

    Yi Wu, Yuxin Wu, Georgia Gkioxari, and Yuandong Tian. Building generalizable agents with a realistic and rich 3d en- vironment.arXiv preprint arXiv:1801.02209, 2018. 2

  57. [57]

    Multi-target embodied question answering

    Licheng Yu, Xinlei Chen, Georgia Gkioxari, Mohit Bansal, Tamara L Berg, and Dhruv Batra. Multi-target embodied question answering. InProceedings of the IEEE/CVF Con- ference on Computer Vision and Pattern Recognition, pages 6309–6318, 2019. 2

  58. [58]

    Robopoint: A vision-language model for spatial affordance prediction for robotics.arXiv preprint arXiv:2406.10721, 2024

    Wentao Yuan, Jiafei Duan, Valts Blukis, Wilbert Pumacay, Ranjay Krishna, Adithyavairavan Murali, Arsalan Mousa- vian, and Dieter Fox. Robopoint: A vision-language model for spatial affordance prediction for robotics.arXiv preprint arXiv:2406.10721, 2024. 3, 4, 5, 8

  59. [59]

    DreamVLA: A Vision-Language-Action Model Dreamed with Comprehensive World Knowledge

    Wenyao Zhang, Hongsi Liu, Zekun Qi, Yunnan Wang, Xin- qiang Yu, Jiazhao Zhang, Runpei Dong, Jiawei He, Fan Lu, He Wang, et al. Dreamvla: a vision-language-action model dreamed with comprehensive world knowledge.arXiv preprint arXiv:2507.04447, 2025. 3

  60. [60]

    Cot-vla: Visual chain-of-thought rea- soning for vision-language-action models

    Qingqing Zhao, Yao Lu, Moo Jin Kim, Zipeng Fu, Zhuoyang Zhang, Yecheng Wu, Zhaoshuo Li, Qianli Ma, Song Han, Chelsea Finn, et al. Cot-vla: Visual chain-of-thought rea- soning for vision-language-action models. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 1702–1713, 2025

  61. [61]

    Flowvla: Visual chain of thought-based motion reasoning for vision-language-action models.arXiv preprint arXiv:2508.18269, 2025

    Zhide Zhong, Haodong Yan, Junfeng Li, Xiangchen Liu, Xin Gong, Tianran Zhang, Wenxuan Song, Jiayi Chen, Xinhu Zheng, Hesheng Wang, et al. Flowvla: Visual chain of thought-based motion reasoning for vision-language-action models.arXiv preprint arXiv:2508.18269, 2025. 3

  62. [62]

    Fanuc ma- nipulation: A dataset for learning-based manipulation with fanuc mate 200id robot.https://sites.google

    Xinghao Zhu, Ran Tian, Chenfeng Xu, Mingxiao Huo, Wei Zhan, Masayoshi Tomizuka, and Mingyu Ding. Fanuc ma- nipulation: A dataset for learning-based manipulation with fanuc mate 200id robot.https://sites.google. com/berkeley.edu/fanuc-manipulation, 2023. 4

  63. [63]

    Rt-2: Vision-language-action models transfer web knowledge to robotic control

    Brianna Zitkovich, Tianhe Yu, Sichun Xu, Peng Xu, Ted Xiao, Fei Xia, Jialin Wu, Paul Wohlhart, Stefan Welker, Ayzaan Wahid, et al. Rt-2: Vision-language-action models transfer web knowledge to robotic control. InConference on Robot Learning, pages 2165–2183. PMLR, 2023. 3