pith. machine review for the scientific record. sign in

arxiv: 2604.19839 · v1 · submitted 2026-04-21 · 💻 cs.CV · cs.AI

Recognition: unknown

Environmental Understanding Vision-Language Model for Embodied Agent

Authors on Pith no claims yet

Pith reviewed 2026-05-10 03:34 UTC · model grok-4.3

classification 💻 cs.CV cs.AI
keywords vision-language modelsembodied agentsenvironmental understandingALFRED tasksfine-tuningtask planningrecovery mechanismpolicy optimization
0
0 comments X

The pith

Fine-tuning vision-language models on four environmental skills plus recovery steps raises embodied agent success rates by 8.86 percent on ALFRED tasks.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tries to fix how vision-language models fall short when embodied agents must interact with real environments to follow instructions. These models often stumble on precise actions or lean on extra environment data that would not be available in practice. The authors fine-tune four targeted skills—object perception, task planning, action understanding, and goal recognition—then add a recovery step to sample fixes for failures and a group relative policy optimization stage to clean up inconsistent outputs. The result is higher success in completing household tasks without ongoing external help. A sympathetic reader would care because this points to agents that could operate more independently in varied settings.

Core claim

We propose the Environmental Understanding Embodied Agent (EUEA) framework that fine-tunes VLMs on four core skills: object perception for identifying relevant objects, task planning for generating interaction subgoals, action understanding for judging success likelihood, and goal recognition for determining goal completion. By incorporating a recovery step that samples alternative actions to correct failures and a group relative policy optimization (GRPO) stage to refine inconsistent predictions, the model achieves an 8.86% improvement in average success rate over a behavior-cloning baseline on ALFRED tasks, with an additional 3.03% gain from the recovery and GRPO stages.

What carries the argument

The EUEA framework consisting of four fine-tuned skills (object perception, task planning, action understanding, goal recognition), a recovery step that samples alternative actions, and a GRPO stage that refines inconsistent predictions.

If this is right

  • The VLM executes instruction-following tasks more reliably across ALFRED benchmarks.
  • The recovery step corrects failure cases by sampling alternative actions.
  • The GRPO stage reduces inconsistent skill predictions and adds further performance gains.
  • Skill-level analysis reveals specific limitations in closed- and open-source VLMs for agent-environment interaction.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same four-skill fine-tuning pattern could be tested in other embodied simulation suites to check whether the gains transfer.
  • Emphasizing targeted environmental skills might lower the volume of behavior-cloning data needed to train capable agents.
  • Combining these skills with additional sensing modalities could address remaining failure modes in more complex settings.

Load-bearing premise

That fine-tuning the four core skills together with the recovery step and GRPO stage will produce reliable task execution without continued reliance on environment metadata or external supervision.

What would settle it

Running the trained model in new test environments where it still fails on interactions or requires environment metadata would falsify the claim of reliable execution.

Figures

Figures reproduced from arXiv: 2604.19839 by Donggyu Lee, Jaeyeon Bae, Jinsik Bang, Siyeol Jung, Taehwan Kim.

Figure 1
Figure 1. Figure 1: Overview of the four core skills of EUEA to enhance VLM’s environmental understanding and interaction. Each core skill consists of two sub-skills, and we fine-tune the VLM in a single stage using the data from all skills. The colored boxes in each skill example indicate the representations. Additional each skill’s examples template can be found in the supplementary material, Sec. B. • We propose a novel EU… view at source ↗
Figure 2
Figure 2. Figure 2: Approach for generating data for environmental un￾derstanding. We construct future situation captioning dataset that enables the prediction of captions describing changes between two images ft−1 and ft, from a single image ft−1. understanding of the environment. ASP, FSC, and AG are respectively formulated by the following equations: ASP_t &= \pi _{\theta }(I_{ASP}, a_t, o_t, b_t, f_t) \label {eq:5} \\ FSC… view at source ↗
Figure 4
Figure 4. Figure 4: Comparison of VLM backbones on task evaluation. We evaluate task performance using the InternVL2.5-series and Qwen2.5-VL-series as backbones in SFT stage. However, most zero-shot VLMs still lack environmental understanding, resulting in limited accuracy in action suc￾cess prediction, action grounding, and object perception in both skill benchmarks. In ALFRED’s step-by-step action planning, Gemini-2.5-Pro r… view at source ↗
Figure 5
Figure 5. Figure 5: Analysis of cases where the recovery step resolves a failed interaction. (a) shows when an incorrect detection is corrected, allowing the action to succeed. (b) shows when an interaction fails, and an alternative action completes the given task successfully. have about 50% fewer parameters, InternVL2.5-4B and Qwen2.5-VL-3B, our method achieves notable performance gains of 12.84% and 37.53%, respectively, c… view at source ↗
read the original abstract

Vision-language models (VLMs) have shown strong perception and reasoning abilities for instruction-following embodied agents. However, despite these abilities and their generalization performance, they still face limitations in environmental understanding, often failing on interactions or relying on environment metadata during execution. To address this challenge, we propose a novel framework named Environmental Understanding Embodied Agent (EUEA), which fine-tunes four core skills: 1) object perception for identifying relevant objects, 2) task planning for generating interaction subgoals, 3) action understanding for judging success likelihood, and 4) goal recognition for determining goal completion. By fine-tuning VLMs with EUEA skills, our framework enables more reliable task execution for instruction-following. We further introduce a recovery step that leverages these core skills and a group relative policy optimization (GRPO) stage that refines inconsistent skill predictions. The recovery step samples alternative actions to correct failure cases, and the GRPO stage refines inconsistent skill predictions. Across ALFRED tasks, our VLM significantly outperforms a behavior-cloning baseline, achieving an 8.86% improvement in average success rate. The recovery and GRPO stages provide an additional 3.03% gain, further enhancing overall performance. Finally, our skill-level analyses reveal key limitations in the environmental understanding of closed- and open-source VLMs and identify the capabilities necessary for effective agent-environment interaction.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript proposes the Environmental Understanding Embodied Agent (EUEA) framework, which fine-tunes a VLM on four core skills (object perception, task planning, action understanding, goal recognition) plus a recovery step and GRPO stage to improve environmental understanding and reduce reliance on metadata for instruction-following embodied agents. On ALFRED tasks the approach reports an 8.86% average success-rate gain over a behavior-cloning baseline, with an additional 3.03% from recovery/GRPO; skill-level analyses are also presented.

Significance. If the empirical claims hold after clarification, the work would demonstrate that targeted skill fine-tuning plus recovery/GRPO can measurably improve VLM-based embodied performance on ALFRED. The skill analyses could usefully expose VLM limitations for agent-environment interaction. No machine-checked proofs or parameter-free derivations are present, but the framework is directly falsifiable via the reported benchmark numbers.

major comments (2)
  1. [Abstract] Abstract: the headline claim of an 8.86% + 3.03% success-rate improvement is presented without any information on training-set size, exact fine-tuning procedure, number of runs, or statistical significance. This information is load-bearing for interpreting the central empirical result.
  2. [Evaluation setup (ALFRED experiments)] Evaluation setup (ALFRED experiments): the manuscript does not explicitly state whether VLM inference receives only raw visual observations and language instructions or still receives ALFRED-provided object positions, states, or scene graphs. If metadata remains available at test time, the measured gains cannot be attributed to learned environmental understanding, directly undermining the paper's motivating claim.
minor comments (2)
  1. [Abstract] Abstract: the recovery step and GRPO stage are described twice in consecutive sentences; a single concise statement would improve readability.
  2. [Abstract] The abstract mentions 'skill-level analyses' but does not indicate where these results appear or how they are quantified (e.g., per-skill accuracy tables).

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments. We address each major point below and have revised the manuscript to improve clarity and transparency.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the headline claim of an 8.86% + 3.03% success-rate improvement is presented without any information on training-set size, exact fine-tuning procedure, number of runs, or statistical significance. This information is load-bearing for interpreting the central empirical result.

    Authors: We agree these details are essential. The revised abstract now specifies the ALFRED training split size used for fine-tuning, the exact fine-tuning procedure (including LoRA rank and learning rate), the number of independent runs performed with reported means and standard deviations, and a note on statistical significance via paired t-tests. These elements have also been expanded in the methods and experiments sections. revision: yes

  2. Referee: [Evaluation setup (ALFRED experiments)] Evaluation setup (ALFRED experiments): the manuscript does not explicitly state whether VLM inference receives only raw visual observations and language instructions or still receives ALFRED-provided object positions, states, or scene graphs. If metadata remains available at test time, the measured gains cannot be attributed to learned environmental understanding, directly undermining the paper's motivating claim.

    Authors: We thank the referee for identifying this ambiguity. Our framework performs VLM inference using only raw visual observations and language instructions, with no ALFRED-provided metadata (object positions, states, or scene graphs) at test time. This design directly supports the claim of learned environmental understanding. We have added an explicit statement to this effect in the Evaluation Setup section of the revised manuscript. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical benchmark results with no derivation chain

full rationale

The paper presents an empirical framework for fine-tuning VLMs on four skills plus recovery and GRPO stages, then reports direct success-rate improvements on ALFRED tasks (8.86% + 3.03%). No equations, functional forms, predictions, or first-principles derivations appear in the abstract or described content. All performance claims are measured outcomes from experiments rather than quantities constructed from fitted inputs or self-citations. The central claims therefore remain independent of the input data by construction and receive no circularity penalty.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The framework rests on the domain assumption that the four enumerated skills are both necessary and sufficient for environmental understanding in embodied agents; no free parameters or invented physical entities are introduced.

axioms (1)
  • domain assumption Fine-tuning VLMs on object perception, task planning, action understanding, and goal recognition produces reliable instruction-following behavior.
    Central premise stated in the abstract as the solution to current VLM limitations.

pith-pipeline@v0.9.0 · 5561 in / 1311 out tokens · 34538 ms · 2026-05-10T03:34:04.728818+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

53 extracted references · 22 canonical work pages · 15 internal anchors

  1. [1]

    GPT-4 Technical Report

    Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. Gpt-4 technical report.arXiv preprint arXiv:2303.08774, 2023. 6

  2. [2]

    Do As I Can, Not As I Say: Grounding Language in Robotic Affordances

    Michael Ahn, Anthony Brohan, Noah Brown, Yevgen Chebo- tar, Omar Cortes, Byron David, Chelsea Finn, Chuyuan Fu, Keerthana Gopalakrishnan, Karol Hausman, et al. Do as i can, not as i say: Grounding language in robotic affordances. arXiv preprint arXiv:2204.01691, 2022. 1

  3. [3]

    Qwen3-VL Technical Report

    Shuai Bai, Yuxuan Cai, Ruizhe Chen, Keqin Chen, Xionghui Chen, Zesen Cheng, Lianghao Deng, Wei Ding, Chang Gao, Chunjiang Ge, et al. Qwen3-vl technical report.arXiv preprint arXiv:2511.21631, 2025. 6

  4. [4]

    Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, et al. Qwen2. 5-vl technical report.arXiv preprint arXiv:2502.13923, 2025. 1, 6, 7

  5. [5]

    A persistent spatial semantic representation for high-level natural language instruction execution

    Valts Blukis, Chris Paxton, Dieter Fox, Animesh Garg, and Yoav Artzi. A persistent spatial semantic representation for high-level natural language instruction execution. InCon- ference on Robot Learning, pages 706–717. PMLR, 2022. 2

  6. [6]

    Object goal navigation using goal-oriented semantic exploration.Advances in Neural Information Processing Systems, 33:4247–4258, 2020

    Devendra Singh Chaplot, Dhiraj Prakashchand Gandhi, Abhi- nav Gupta, and Russ R Salakhutdinov. Object goal navigation using goal-oriented semantic exploration.Advances in Neural Information Processing Systems, 33:4247–4258, 2020. 2

  7. [7]

    org/abs/2505.08243

    William Chen, Suneel Belkhale, Suvir Mirchandani, Oier Mees, Danny Driess, Karl Pertsch, and Sergey Levine. Train- ing strategies for efficient embodied reasoning.arXiv preprint arXiv:2505.08243, 2025. 1

  8. [8]

    Expanding Performance Boundaries of Open-Source Multimodal Models with Model, Data, and Test-Time Scaling

    Zhe Chen, Weiyun Wang, Yue Cao, Yangzhou Liu, Zhang- wei Gao, Erfei Cui, Jinguo Zhu, Shenglong Ye, Hao Tian, Zhaoyang Liu, et al. Expanding performance boundaries of open-source multimodal models with model, data, and test- time scaling.arXiv preprint arXiv:2412.05271, 2024. 1, 5, 6, 7

  9. [9]

    Lota-bench: Benchmarking language- oriented task planners for embodied agents

    Jae-Woo Choi, Youngwoo Yoon, Hyobin Ong, Jaehong Kim, and Minsu Jang. Lota-bench: Benchmarking language- oriented task planners for embodied agents. InInternational Conference on Learning Representations (ICLR), 2024. 2

  10. [10]

    arXiv preprint arXiv:2410.00371 , year=

    Jiafei Duan, Wilbert Pumacay, Nishanth Kumar, Yi Ru Wang, Shulin Tian, Wentao Yuan, Ranjay Krishna, Dieter Fox, Ajay Mandlekar, and Yijie Guo. Aha: A vision-language-model for detecting and reasoning over failures in robotic manipulation. arXiv preprint arXiv:2410.00371, 2024. 1, 2, 3, 7

  11. [11]

    Prompter: Utilizing large language model prompt- ing for a data efficient embodied instruction following.arXiv,

    Inoue et al. Prompter: Utilizing large language model prompt- ing for a data efficient embodied instruction following.arXiv,

  12. [12]

    Epo: Hierarchical llm agents with environment preference optimization.EMNLP, 2024

    Zhao et al. Epo: Hierarchical llm agents with environment preference optimization.EMNLP, 2024. 1

  13. [13]

    Group-in-Group Policy Optimization for LLM Agent Training

    Lang Feng, Zhenghai Xue, Tingcong Liu, and Bo An. Group- in-group policy optimization for llm agent training.arXiv preprint arXiv:2505.10978, 2025. 3

  14. [14]

    Lora: Low-rank adaptation of large language models.ICLR, 1(2):3,

    Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, Weizhu Chen, et al. Lora: Low-rank adaptation of large language models.ICLR, 1(2):3,

  15. [15]

    Are you looking? ground- ing to multiple modalities in vision-and-language navigation

    Ronghang Hu, Daniel Fried, Anna Rohrbach, Dan Klein, Trevor Darrell, and Kate Saenko. Are you looking? ground- ing to multiple modalities in vision-and-language navigation. arXiv preprint arXiv:1906.00347, 2019. 2

  16. [16]

    Language models as zero-shot planners: Extracting actionable knowledge for embodied agents

    Wenlong Huang, Pieter Abbeel, Deepak Pathak, and Igor Mor- datch. Language models as zero-shot planners: Extracting actionable knowledge for embodied agents. InInternational conference on machine learning, pages 9118–9147. PMLR,

  17. [17]

    Physical Intelligence, Kevin Black, Noah Brown, James Darpinian, Karan Dhabalia, Danny Driess, Adnan Esmail, Michael Equi, Chelsea Finn, Niccolo Fusai, Manuel Y . Gal- liker, Dibya Ghosh, Lachy Groom, Karol Hausman, Brian Ichter, Szymon Jakubczak, Tim Jones, Liyiming Ke, Devin LeBlanc, Sergey Levine, Adrian Li-Bell, Mohith Mothukuri, Suraj Nair, Karl Pert...

  18. [18]

    EQA-MX: Embodied question answering using multi- modal expression

    Md Mofijul Islam, Alexi Gladstone, Riashat Islam, and Tariq Iqbal. EQA-MX: Embodied question answering using multi- modal expression. InThe Twelfth International Conference on Learning Representations, 2024. 3

  19. [19]

    Planning and acting in partially observable stochastic domains.Artificial intelligence, 101(1-2):99–134,

    Leslie Pack Kaelbling, Michael L Littman, and Anthony R Cassandra. Planning and acting in partially observable stochastic domains.Artificial intelligence, 101(1-2):99–134,

  20. [20]

    AI2-THOR: An Interactive 3D Environment for Visual AI

    Eric Kolve, Roozbeh Mottaghi, Winson Han, Eli VanderBilt, Luca Weihs, Alvaro Herrasti, Matt Deitke, Kiana Ehsani, Daniel Gordon, Yuke Zhu, et al. Ai2-thor: An interactive 3d environment for visual ai.arXiv preprint arXiv:1712.05474,

  21. [21]

    End-to-end training of deep visuomotor policies

    Sergey Levine, Chelsea Finn, Trevor Darrell, and Pieter Abbeel. End-to-end training of deep visuomotor policies. Journal of Machine Learning Research, 17(39):1–40, 2016. 1

  22. [22]

    LLaVA-OneVision: Easy Visual Task Transfer

    Bo Li, Yuanhan Zhang, Dong Guo, Renrui Zhang, Feng Li, Hao Zhang, Kaichen Zhang, Yanwei Li, Ziwei Liu, and Chun- yuan Li. Llava-onevision: Easy visual task transfer.arXiv preprint arXiv:2408.03326, 2024. 1, 6

  23. [23]

    Embodied agent interface: Bench- marking llms for embodied decision making.Advances in Neural Information Processing Systems, 37:100428–100534,

    Manling Li, Shiyu Zhao, Qineng Wang, Kangrui Wang, Yu Zhou, Sanjana Srivastava, Cem Gokmen, Tony Lee, Erran Li 9 Li, Ruohan Zhang, et al. Embodied agent interface: Bench- marking llms for embodied decision making.Advances in Neural Information Processing Systems, 37:100428–100534,

  24. [24]

    Code as Policies: Language Model Programs for Embodied Control

    Jacky Liang, Wenlong Huang, Fei Xia, Peng Xu, Karol Haus- man, Brian Ichter, Pete Florence, and Andy Zeng. Code as policies: Language model programs for embodied control. arXiv preprint arXiv:2209.07753, 2022. 1

  25. [25]

    V olumetric envi- ronment representation for vision-language navigation

    Rui Liu, Wenguan Wang, and Yi Yang. V olumetric envi- ronment representation for vision-language navigation. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 16317–16328, 2024. 2

  26. [26]

    Openeqa: Embodied question answering in the era of foundation models

    Arjun Majumdar, Anurag Ajay, Xiaohan Zhang, Pranav Putta, Sriram Yenamandra, Mikael Henaff, Sneha Silwal, Paul Mc- vay, Oleksandr Maksymets, Sergio Arnaud, Karmesh Yadav, Qiyang Li, Ben Newman, Mohit Sharma, Vincent Berges, Shiqi Zhang, Pulkit Agrawal, Yonatan Bisk, Dhruv Batra, Mri- nal Kalakrishnan, Franziska Meier, Chris Paxton, Alexander Sax, and Arav...

  27. [27]

    Multitask multimodal prompted training for interactive embodied task completion.arXiv preprint arXiv:2311.04067, 2023

    Georgios Pantazopoulos, Malvina Nikandrou, Amit Parekh, Bhathiya Hemanthage, Arash Eshghi, Ioannis Konstas, Ver- ena Rieser, Oliver Lemon, and Alessandro Suglia. Multitask multimodal prompted training for interactive embodied task completion.arXiv preprint arXiv:2311.04067, 2023. 2

  28. [28]

    Sentence-bert: Sentence embeddings using siamese bert-networks, 2019

    Nils Reimers and Iryna Gurevych. Sentence-bert: Sentence embeddings using siamese bert-networks, 2019. 5, 6, 8

  29. [29]

    DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

    Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, YK Li, Yang Wu, et al. Deepseekmath: Pushing the limits of mathe- matical reasoning in open language models.arXiv preprint arXiv:2402.03300, 2024. 1, 3, 5

  30. [30]

    Reflexion: Language agents with verbal reinforcement learning.Advances in Neural In- formation Processing Systems, 36:8634–8652, 2023

    Noah Shinn, Federico Cassano, Ashwin Gopinath, Karthik Narasimhan, and Shunyu Yao. Reflexion: Language agents with verbal reinforcement learning.Advances in Neural In- formation Processing Systems, 36:8634–8652, 2023. 1, 2, 7

  31. [31]

    Alfred: A benchmark for interpreting grounded instructions for everyday tasks

    Mohit Shridhar, Jesse Thomason, Daniel Gordon, Yonatan Bisk, Winson Han, Roozbeh Mottaghi, Luke Zettlemoyer, and Dieter Fox. Alfred: A benchmark for interpreting grounded instructions for everyday tasks. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10740–10749, 2020. 2, 4, 5, 6

  32. [32]

    ALFWorld: Aligning Text and Embodied Environments for Interactive Learning

    Mohit Shridhar, Xingdi Yuan, Marc-Alexandre Côté, Yonatan Bisk, Adam Trischler, and Matthew Hausknecht. Alfworld: Aligning text and embodied environments for interactive learning.arXiv preprint arXiv:2010.03768, 2020. 5, 7

  33. [33]

    Llm-planner: Few-shot grounded planning for embodied agents with large language models

    Chan Hee Song, Jiaman Wu, Clayton Washington, Brian M Sadler, Wei-Lun Chao, and Yu Su. Llm-planner: Few-shot grounded planning for embodied agents with large language models. InProceedings of the IEEE/CVF international con- ference on computer vision, pages 2998–3009, 2023. 2

  34. [34]

    Adaplanner: Adaptive planning from feedback with language models.Advances in neural information pro- cessing systems, 36:58202–58245, 2023

    Haotian Sun, Yuchen Zhuang, Lingkai Kong, Bo Dai, and Chao Zhang. Adaplanner: Adaptive planning from feedback with language models.Advances in neural information pro- cessing systems, 36:58202–58245, 2023. 1, 3, 7

  35. [35]

    Habitat 2.0: Training home assistants to rearrange their habitat.Advances in neural information processing systems, 34:251–266, 2021

    Andrew Szot, Alexander Clegg, Eric Undersander, Erik Wijmans, Yili Zhao, John Turner, Noah Maestre, Mustafa Mukadam, Devendra Singh Chaplot, Oleksandr Maksymets, et al. Habitat 2.0: Training home assistants to rearrange their habitat.Advances in neural information processing systems, 34:251–266, 2021. 4

  36. [36]

    Large language models as generalizable policies for embodied tasks

    Andrew Szot, Max Schwarzer, Harsh Agrawal, Bogdan Ma- zoure, Rin Metcalf, Walter Talbott, Natalie Mackraz, R Devon Hjelm, and Alexander T Toshev. Large language models as generalizable policies for embodied tasks. InThe Twelfth International Conference on Learning Representations, 2023. 1, 2, 4, 5, 6, 7

  37. [37]

    Grounding multi- modal large language models in actions.Advances in Neural Information Processing Systems, 37:20198–20224, 2024

    Andrew Szot, Bogdan Mazoure, Harsh Agrawal, R Devon Hjelm, Zsolt Kira, and Alexander Toshev. Grounding multi- modal large language models in actions.Advances in Neural Information Processing Systems, 37:20198–20224, 2024

  38. [38]

    From multimodal llms to generalist embodied agents: Methods and lessons

    Andrew Szot, Bogdan Mazoure, Omar Attia, Aleksei Tim- ofeev, Harsh Agrawal, Devon Hjelm, Zhe Gan, Zsolt Kira, and Alexander Toshev. From multimodal llms to generalist embodied agents: Methods and lessons. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 10644–10655, 2025. 1

  39. [39]

    Voyager: An Open-Ended Embodied Agent with Large Language Models

    Guanzhi Wang, Yuqi Xie, Yunfan Jiang, Ajay Mandlekar, Chaowei Xiao, Yuke Zhu, Linxi Fan, and Anima Anandkumar. V oyager: An open-ended embodied agent with large language models.arXiv preprint arXiv:2305.16291, 2023. 1

  40. [40]

    E2cl: exploration-based error correction learning for embod- ied agents.arXiv preprint arXiv:2409.03256, 2024

    Hanlin Wang, Chak Tou Leong, Jian Wang, and Wenjie Li. E2cl: exploration-based error correction learning for embod- ied agents.arXiv preprint arXiv:2409.03256, 2024. 3, 7

  41. [41]

    Embodiedscan: A holistic multi-modal 3d perception suite towards embodied ai, 2023

    Tai Wang, Xiaohan Mao, Chenming Zhu, Runsen Xu, Ruiyuan Lyu, Peisen Li, Xiao Chen, Wenwei Zhang, Kai Chen, Tianfan Xue, Xihui Liu, Cewu Lu, Dahua Lin, and Jiangmiao Pang. Embodiedscan: A holistic multi-modal 3d perception suite towards embodied ai, 2023. 3

  42. [42]

    Embodiedbench: Comprehensive benchmarking multi-modal large language models for vision-driven embodied agents.arXiv preprint arXiv:2502.09560, 2025

    Rui Yang, Hanyang Chen, Junyu Zhang, Mark Zhao, Cheng Qian, Kangrui Wang, Qineng Wang, Teja Venkat Koripella, Marziyeh Movahedi, Manling Li, et al. Embodiedbench: Comprehensive benchmarking multi-modal large language models for vision-driven embodied agents.arXiv preprint arXiv:2502.09560, 2025. 2, 7

  43. [43]

    Embodied multi-modal agent trained by an llm from a parallel textworld

    Yijun Yang, Tianyi Zhou, Kanxue Li, Dapeng Tao, Lusong Li, Li Shen, Xiaodong He, Jing Jiang, and Yuhui Shi. Embodied multi-modal agent trained by an llm from a parallel textworld. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 26275–26285, 2024. 1, 5, 6, 7

  44. [44]

    React: Synergizing reasoning and acting in language models

    Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik Narasimhan, and Yuan Cao. React: Synergizing reasoning and acting in language models. InInternational Conference on Learning Representations (ICLR), 2023. 7

  45. [45]

    Exploratory retrieval-augmented planning for continual em- bodied instruction following.Advances in Neural Information Processing Systems, 37:67034–67060, 2025

    Minjong Yoo, Jinwoo Jang, Wei-Jin Park, and Honguk Woo. Exploratory retrieval-augmented planning for continual em- bodied instruction following.Advances in Neural Information Processing Systems, 37:67034–67060, 2025. 2

  46. [46]

    DAPO: An Open-Source LLM Reinforcement Learning System at Scale

    Qiying Yu, Zheng Zhang, Ruofei Zhu, Yufeng Yuan, Xi- aochen Zuo, Yu Yue, Tiantian Fan, Gaohong Liu, Lingjun Liu, Xin Liu, et al. Dapo: An open-source llm reinforce- 10 ment learning system at scale, 2025.URL https://arxiv. org/abs/2503.14476, 2025. 3, 5

  47. [47]

    Mfe-etp: A com- prehensive evaluation benchmark for multi-modal foundation models on embodied task planning, 2024

    Min Zhang, Xian Fu, Jianye Hao, Peilong Han, Hao Zhang, Lei Shi, Hongyao Tang, and Yan Zheng. Mfe-etp: A com- prehensive evaluation benchmark for multi-modal foundation models on embodied task planning, 2024. 3

  48. [48]

    Hierarchical task learning from language instructions with unified transformers and self- monitoring.arXiv preprint arXiv:2106.03427, 2021

    Yichi Zhang and Joyce Chai. Hierarchical task learning from language instructions with unified transformers and self- monitoring.arXiv preprint arXiv:2106.03427, 2021. 2

  49. [49]

    Danli: Deliberative agent for following natural lan- guage instructions

    Yichi Zhang, Jianing Yang, Jiayi Pan, Shane Storks, Nikhil Devraj, Ziqiao Ma, Keunwoo Yu, Yuwei Bao, and Joyce Chai. Danli: Deliberative agent for following natural lan- guage instructions. InProceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 1280–1298, 2022. 2

  50. [50]

    LlamaFactory: Unified Efficient Fine-Tuning of 100+ Language Models

    Yaowei Zheng, Richong Zhang, Junhao Zhang, Yanhan Ye, Zheyan Luo, Zhangchi Feng, and Yongqiang Ma. Llamafac- tory: Unified efficient fine-tuning of 100+ language models. arXiv preprint arXiv:2403.13372, 2024. 5

  51. [51]

    Embodied understanding of driving scenarios

    Yunsong Zhou, Linyan Huang, Qingwen Bu, Jia Zeng, Tianyu Li, Hang Qiu, Hongzi Zhu, Minyi Guo, Yu Qiao, and Hongyang Li. Embodied understanding of driving scenarios. InEuropean Conference on Computer Vision, pages 129–148. Springer, 2024. 2

  52. [52]

    Vision-language navigation with self-supervised auxiliary rea- soning tasks

    Fengda Zhu, Yi Zhu, Xiaojun Chang, and Xiaodan Liang. Vision-language navigation with self-supervised auxiliary rea- soning tasks. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10012–10022,

  53. [53]

    InternVL3: Exploring Advanced Training and Test-Time Recipes for Open-Source Multimodal Models

    Jinguo Zhu, Weiyun Wang, Zhe Chen, Zhaoyang Liu, Shen- glong Ye, Lixin Gu, Yuchen Duan, Hao Tian, Weijie Su, Jie Shao, et al. Internvl3: Exploring advanced training and test-time recipes for open-source multimodal models.arXiv preprint arXiv:2504.10479, 2025. 1, 5, 6 11