{"paper":{"title":"EmbodiedBench: Comprehensive Benchmarking Multi-modal Large Language Models for Vision-Driven Embodied Agents","license":"http://arxiv.org/licenses/nonexclusive-distrib/1.0/","headline":"MLLMs excel at high-level embodied tasks but score only 28.9 percent on low-level manipulation.","cross_cats":["cs.CL","cs.CV"],"primary_cat":"cs.AI","authors_text":"Cheng Qian, Hanyang Chen, Heng Ji, Huan Zhang, Junyu Zhang, Kangrui Wang, Manling Li, Mark Zhao, Marziyeh Movahedi, Qineng Wang, Rui Yang, Teja Venkat Koripella, Tong Zhang","submitted_at":"2025-02-13T18:11:34Z","abstract_excerpt":"Leveraging Multi-modal Large Language Models (MLLMs) to create embodied agents offers a promising avenue for tackling real-world tasks. While language-centric embodied agents have garnered substantial attention, MLLM-based embodied agents remain underexplored due to the lack of comprehensive evaluation frameworks. To bridge this gap, we introduce EmbodiedBench, an extensive benchmark designed to evaluate vision-driven embodied agents. EmbodiedBench features: (1) a diverse set of 1,128 testing tasks across four environments, ranging from high-level semantic tasks (e.g., household) to low-level "},"claims":{"count":4,"items":[{"kind":"strongest_claim","text":"MLLMs excel at high-level tasks but struggle with low-level manipulation, with the best model, GPT-4o, scoring only 28.9% on average.","source":"verdict.strongest_claim","status":"machine_extracted","claim_id":"C1","attestation":"unclaimed"},{"kind":"weakest_assumption","text":"That performance in the four chosen simulated environments and the six curated capability subsets accurately reflects real-world embodied agent challenges.","source":"verdict.weakest_assumption","status":"machine_extracted","claim_id":"C2","attestation":"unclaimed"},{"kind":"one_line_summary","text":"EmbodiedBench is a new evaluation framework for MLLM-based embodied agents that shows strong high-level reasoning but weak low-level manipulation performance across 24 tested models.","source":"verdict.one_line_summary","status":"machine_extracted","claim_id":"C3","attestation":"unclaimed"},{"kind":"headline","text":"MLLMs excel at high-level embodied tasks but score only 28.9 percent on low-level manipulation.","source":"verdict.pith_extraction.headline","status":"machine_extracted","claim_id":"C4","attestation":"unclaimed"}],"snapshot_sha256":"24d1e1932f2590145c6de98df8201652f8faa79ced178991771771b68ff539dd"},"source":{"id":"2502.09560","kind":"arxiv","version":3},"verdict":{"id":"d69d70eb-fed1-4717-b23e-fcdbc2241a6c","model_set":{"reader":"grok-4.3"},"created_at":"2026-05-17T00:20:32.985543Z","strongest_claim":"MLLMs excel at high-level tasks but struggle with low-level manipulation, with the best model, GPT-4o, scoring only 28.9% on average.","one_line_summary":"EmbodiedBench is a new evaluation framework for MLLM-based embodied agents that shows strong high-level reasoning but weak low-level manipulation performance across 24 tested models.","pipeline_version":"pith-pipeline@v0.9.0","weakest_assumption":"That performance in the four chosen simulated environments and the six curated capability subsets accurately reflects real-world embodied agent challenges.","pith_extraction_headline":"MLLMs excel at high-level embodied tasks but score only 28.9 percent on low-level manipulation."},"references":{"count":22,"sample":[{"doi":"10.24963/ijcai.2024/15","year":2015,"title":"Put washed lettuce in the refrigerator","work_id":"8a8d949d-2daa-4325-9b00-2d5fadb84f24","ref_index":1,"cited_arxiv_id":"","is_internal_anchor":false},{"doi":"","year":null,"title":"**Visibility**: Always locate a visible object by the ’find’ action before interacting with it","work_id":"27f6b323-77d3-4db3-8a1a-0133d997f45e","ref_index":3,"cited_arxiv_id":"","is_internal_anchor":false},{"doi":"","year":null,"title":"Avoid performing actions that do not meet the defined validity criteria","work_id":"e891d1fa-8be9-43c7-b502-497cd9c17bb7","ref_index":4,"cited_arxiv_id":"","is_internal_anchor":false},{"doi":"","year":null,"title":"You can explore these instances if you do not find the desired object in the current receptacle","work_id":"fd9c53ba-b5d9-4cb8-b73f-c6443a7c9b8a","ref_index":6,"cited_arxiv_id":"","is_internal_anchor":false},{"doi":"","year":null,"title":"If the last action is invalid, reflect on the reason, such as not adhering to action rules or missing preliminary actions, and adjust your plan accordingly","work_id":"96cd5e15-7387-45f4-8068-6e674c6a1dcb","ref_index":7,"cited_arxiv_id":"","is_internal_anchor":false}],"resolved_work":22,"snapshot_sha256":"7fea875e5acea7790ca94317fbd59aa4120c82214e29dfcc91a73c550e611881","internal_anchors":0},"formal_canon":{"evidence_count":2,"snapshot_sha256":"0665e50e19214462b0da29e46b6e709c5809d5acf3f499f8d81c2f614a80fa8c"},"author_claims":{"count":0,"strong_count":0,"snapshot_sha256":"258153158e38e3291e3d48162225fcdb2d5a3ed65a07baac614ab91432fd4f57"},"builder_version":"pith-number-builder-2026-05-17-v1"}