Beyond APIs: Probing the Limits of MLLMs in Physical Tool Use
Pith reviewed 2026-06-27 13:31 UTC · model grok-4.3
The pith
MLLMs identify only 58.7 percent of physical tools in scenes and complete 21 percent of tool-use queries end-to-end.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Across 13 leading MLLMs evaluated on PhysTool-Bench, even Gemini-3.1-Pro identifies only 58.7 percent of the tools present in realistic scenes and completes merely 21.0 percent of the 2,510 queries end-to-end. The benchmark shows a two-level deficit: models first fail to perceive all tools accurately, then suffer a steeper drop when required to map those tools onto task goals through functional commonsense reasoning about selection and use sequences.
What carries the argument
PhysTool-Bench, a dataset of 2,510 queries spanning 2,678 real-world tools across manufacturing, electrical, agricultural, and healthcare domains, that separately scores tool recognition from images and planning of tool-use sequences.
If this is right
- MLLMs need stronger visual perception modules tuned to cluttered real-world scenes.
- Models require additional training on functional properties of tools to bridge perception to task semantics.
- Embodied AI systems cannot yet rely on current MLLMs as the sole planner for physical tool interactions.
- The performance drop from recognition to planning identifies the planning stage as the dominant bottleneck.
Where Pith is reading between the lines
- Robot control pipelines may need separate perception and commonsense modules rather than depending on a single MLLM.
- The benchmark could be used to measure progress after targeted fine-tuning on tool-interaction videos or simulations.
- Similar perception-plus-planning gaps are likely to appear in other manipulation tasks that involve choosing objects from a scene.
- Expanding the benchmark to include execution feedback from actual robots would test whether the planning errors translate to physical failures.
Load-bearing premise
The 2,510 queries and 2,678 tools, together with the two evaluation dimensions of recognition and planning, capture the core capabilities needed for physical tool use.
What would settle it
A model that reaches above 85 percent tool recognition and above 60 percent end-to-end completion on the full set of PhysTool-Bench queries and scenes would falsify the reported deficit levels.
Figures
read the original abstract
Multimodal Large Language Models (MLLMs) excel at utilizing digital APIs and increasingly serve as the "brain" of embodied AI, instructing robots to interact with the physical world. In such embodied settings, a central capability is the use of physical tools, which underpins MLLMs' ability to assist humans in real-world tasks. Despite the importance, MLLMs' proficiency in physical tool use remains largely unexplored. To address this gap, we introduce PhysTool-Bench, the first physical tool-use benchmark designed to evaluate MLLMs' ability to comprehend real-world scenarios, identify physical tools, and plan their use. PhysTool-Bench comprises 2,510 queries over 2,678 real-world physical tools spanning diverse domains, including manufacturing, electrical work, agriculture, and healthcare. Concretely, models are evaluated along two primary dimensions: 1) recognizing all physical tools present in the scene, and 2) planning the tool selection and use sequence based on the instruction and visual context. Across 13 leading MLLMs, even the strongest model (Gemini-3.1-Pro) identifies only 58.7% of tools in a scene and completes merely 21.0% of queries end-to-end. Our analysis reveals a two-level deficit: MLLMs struggle to perceive tools in realistic scenes, and the much larger drop at the planning stage further indicates a lack of functional commonsense for mapping perceived tools onto task semantics, pinpointing a critical bottleneck for the development of practical embodied AI.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript introduces PhysTool-Bench, a benchmark of 2,510 queries over 2,678 real-world physical tools spanning manufacturing, electrical work, agriculture, and healthcare. It evaluates 13 MLLMs on two dimensions—recognizing all tools present in a scene and planning a tool selection/use sequence to fulfill an instruction—reporting that the strongest model (Gemini-3.1-Pro) reaches only 58.7% tool identification and 21.0% end-to-end success. The authors conclude that MLLMs exhibit a two-level deficit (perception plus functional commonsense for mapping tools to task semantics), identifying this as a critical bottleneck for embodied AI.
Significance. If the benchmark and evaluation protocol hold, the work supplies a large-scale empirical measurement of MLLMs on physical tool use, a capability central to embodied applications. The reported gap between recognition and end-to-end performance, together with the domain breadth, could usefully direct research toward improving functional reasoning in multimodal models. The scale (2,510 queries) is a concrete strength that enables broad coverage.
major comments (3)
- [Abstract and §3] Abstract and benchmark-construction section: the abstract states the 58.7% recognition and 21.0% end-to-end figures but supplies no information on query generation, validation, balancing, inter-annotator agreement, or selection biases in the 2,678-tool set. These details are load-bearing for treating the numbers as evidence of the claimed two-level deficit.
- [§4 (Analysis)] Analysis section: the claim that the drop from recognition (58.7%) to end-to-end (21%) demonstrates 'lack of functional commonsense for mapping perceived tools onto task semantics' is not supported by ablations such as oracle tool lists or relaxed sequence matching. Without them, the interpretation cannot isolate commonsense deficits from confounds such as output-format sensitivity or instruction-following fidelity.
- [Evaluation protocol] Evaluation protocol: no description is given of how planning sequences are scored (exact string match to ground-truth annotations versus semantic equivalence), which directly affects the 21% figure and the two-level-deficit diagnosis.
minor comments (2)
- [Related work] Related-work section should explicitly compare PhysTool-Bench to prior embodied-AI benchmarks (e.g., those focused on API or simulation-based tool use) to clarify novelty.
- [Figures] Figure captions could more clearly label example scenes, tool annotations, and ground-truth sequences to aid reader interpretation of the two evaluation dimensions.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback highlighting areas where additional methodological transparency and analysis would strengthen the manuscript. We address each major comment below and commit to revisions that provide the requested details without altering the core findings.
read point-by-point responses
-
Referee: [Abstract and §3] Abstract and benchmark-construction section: the abstract states the 58.7% recognition and 21.0% end-to-end figures but supplies no information on query generation, validation, balancing, inter-annotator agreement, or selection biases in the 2,678-tool set. These details are load-bearing for treating the numbers as evidence of the claimed two-level deficit.
Authors: The abstract is intentionally concise per venue guidelines. Section 3 describes query generation from real-world scenarios across the four domains and the curation of the 2,678-tool inventory. To address the concern, we will expand §3 with a dedicated subsection detailing the query validation procedure (including pilot testing), domain balancing statistics, inter-annotator agreement scores on tool annotations and query correctness, and an explicit discussion of potential selection biases in tool sourcing. revision: yes
-
Referee: [§4 (Analysis)] Analysis section: the claim that the drop from recognition (58.7%) to end-to-end (21%) demonstrates 'lack of functional commonsense for mapping perceived tools onto task semantics' is not supported by ablations such as oracle tool lists or relaxed sequence matching. Without them, the interpretation cannot isolate commonsense deficits from confounds such as output-format sensitivity or instruction-following fidelity.
Authors: The large performance gap is consistent with a functional-reasoning bottleneck, but we agree that isolating it from format or instruction-following confounds requires additional controls. We will add oracle-tool-list experiments (supplying ground-truth tool sets to the planner) and report results under both exact and relaxed (semantic) sequence matching in a new analysis subsection. These ablations will be run on the top-performing models and included in the revised §4. revision: yes
-
Referee: [Evaluation protocol] Evaluation protocol: no description is given of how planning sequences are scored (exact string match to ground-truth annotations versus semantic equivalence), which directly affects the 21% figure and the two-level-deficit diagnosis.
Authors: We apologize for the missing protocol description. Planning sequences are scored via exact string match on normalized tool names and action verbs against the annotated ground truth, with limited synonym normalization applied only to tool nomenclature. We will insert a precise evaluation-protocol subsection (new §4.2) that fully specifies the matching rules, normalization steps, and any edge-case handling so that the 21% figure can be reproduced and interpreted unambiguously. revision: yes
Circularity Check
No circularity: empirical benchmark evaluation with direct measurements only
full rationale
The paper introduces PhysTool-Bench and reports empirical performance numbers (e.g., 58.7% tool recognition, 21.0% end-to-end) across 13 MLLMs. No derivations, equations, fitted parameters, predictions, or self-citation chains appear in the abstract or described content. All claims rest on new data collection and model evaluation rather than any reduction to prior inputs by construction. This is the standard non-circular case for benchmark papers.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
Yun Peng, Shuqing Li, Wenwei Gu, Yichen Li, Wenx- uan Wang, Cuiyun Gao, and Michael R. Lyu. Revis- iting, Benchmarking and Exploring API Recommen- dation: How Far Are We? .IEEE Transactions on Soft- ware Engineering, 49(04):1876–1897, April 2023. ISSN 1939-3520. doi: 10.1109/TSE.2022.3197063
-
[2]
Toolllm: Fa- cilitating large language models to master 16000+ real-world apis
Yujia Qin, Shihao Liang, Yining Ye, Kunlun Zhu, Lan Yan, Yaxi Lu, Yankai Lin, Xin Cong, Xiangru Tang, Bill Qian, Sihan Zhao, Lauren Hong, Runchu Tian, Ruobing Xie, Jie Zhou, Mark Gerstein, Dahai Li, Zhiyuan Liu, and Maosong Sun. Toolllm: Fa- cilitating large language models to master 16000+ real-world apis. InThe Twelfth International Confer- ence on Lear...
2024
-
[3]
Xiaoqi Li, Mingxu Zhang, Yiran Geng, Haoran Geng, Yuxing Long, Yan Shen, Renrui Zhang, Jiaming Liu, and Hao Dong. Manipllm: Embodied multimodal large language model for object-centric robotic ma- nipulation.2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 18061– 18070, 2023
2024
-
[4]
Danny Driess, Fei Xia, Mehdi S. M. Sajjadi, Corey Lynch, Aakanksha Chowdhery, Brian Ichter, Ayzaan Wahid, Jonathan Tompson, Quan Vuong, Tianhe Yu, Wenlong Huang, Yevgen Chebotar, Pierre Ser- manet, Daniel Duckworth, Sergey Levine, Vincent Vanhoucke, Karol Hausman, Marc Toussaint, Klaus Greff, Andy Zeng, Igor Mordatch, and Pete Florence. Palm-e: an embodie...
2023
-
[5]
Robo- mamba: Efficient vision-language-action model for robotic reasoning and manipulation.Advances in Neural Information Processing Systems 37, 2024
Jiaming Liu, Mengzhen Liu, Zhenyu Wang, Pengju An, Xiaoqi Li, Kaichen Zhou, Senqiao Yang, Renrui Zhang, Yandong Guo, and Shanghang Zhang. Robo- mamba: Efficient vision-language-action model for robotic reasoning and manipulation.Advances in Neural Information Processing Systems 37, 2024
2024
-
[6]
Sapien: Asimulated part-based interactive environment.2020 IEEE/CVF Conference on Computer Vision and Pattern Recogni- tion (CVPR), pages 11094–11104, 2020
Fanbo Xiang, Yuzhe Qin, Kaichun Mo, Yikuan Xia, Bernie Hao Zhu, Fangchen Liu, Minghua Liu, Hanx- iaoJiang, YifuYuan, HeWang, LiYi, AngelX.Chang, LeonidasJ.Guibas, andHaoSu. Sapien: Asimulated part-based interactive environment.2020 IEEE/CVF Conference on Computer Vision and Pattern Recogni- tion (CVPR), pages 11094–11104, 2020
2020
-
[7]
Ling, Fanbo Xiang, Derek Yang, Xuanlin Li, Stone Tao, Zhiao Huang, Zhiwei Jia, and Hao Su
Tongzhou Mu, Z. Ling, Fanbo Xiang, Derek Yang, Xuanlin Li, Stone Tao, Zhiao Huang, Zhiwei Jia, and Hao Su. Maniskill: Generalizable manipulation skill benchmark with large-scale demonstrations. In NeurIPS Datasets and Benchmarks, 2021
2021
-
[8]
Toolformer: Language models can teach themselves to use tools
Timo Schick, Jane Dwivedi-Yu, Roberto Dessì, Roberta Raileanu, Maria Lomeli, Luke Zettlemoyer, Nicola Cancedda, and Thomas Scialom. Toolformer: Language models can teach themselves to use tools. InAdvances in Neural Information Processing Systems, 2023
2023
-
[9]
Re- act: Synergizing reasoning and acting in language models
Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik Narasimhan, and Yuan Cao. Re- act: Synergizing reasoning and acting in language models. InInternational Conference on Learning Rep- resentations (ICLR), 2023
2023
-
[10]
Patil, Tianjun Zhang, Xin Wang, and Joseph E
Shishir G. Patil, Tianjun Zhang, Xin Wang, and Joseph E. Gonzalez. Gorilla: Large language model connected with massive apis. InAdvances in Neural Information Processing Systems, 2024
2024
-
[11]
Stable toolbench: Towards stable large- scale benchmarking on tool learning of large lan- guage models
ZhichengGuo,SijieCheng,HaoWang, ShihaoLiang, Yujia Qin, Peng Li, Zhiyuan Liu, Maosong Sun, and Yang Liu. Stable toolbench: Towards stable large- scale benchmarking on tool learning of large lan- guage models. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguis- tics, 2024
2024
-
[12]
Michael Ahn, Anthony Brohan, Noah Brown, Yev- gen Chebotar, Omar Cortes, Byron David, Chelsea Finn, Keerthana Gopalakrishnan, Karol Hausman, Alexander Herzog, Daniel Ho, Jasmine Hsu, Julian Ibarz, Brian Ichter, Alex Irpan, Eric Jang, Rosario M Jauregui Ruano, Kyle Jeffrey, Sally Jesmonth, Nikhil Jayant Joshi, Ryan C. Julian, Dmitry Kalash- nikov, Yuheng K...
2022
-
[13]
Danny Driess, Fei Xia, Mehdi S. M. Sajjadi, Corey Lynch, Aakanksha Chowdhery, Brian Ichter, Ayzaan Wahid, Jonathan Tompson, Quan Vuong, Tianhe Yu, Wenlong Huang, Yevgen Chebotar, Pierre Ser- manet, Daniel Duckworth, Sergey Levine, Vincent Vanhoucke, Karol Hausman, Marc Toussaint, Klaus Greff, Andy Zeng, Igor Mordatch, and Pete Florence. Palm-e: An embodie...
2023
-
[14]
Rt-2: Vision-language-action models transfer web knowledge to robotic control
Anthony Brohan, Noah Brown, Justice Carbajal, Yevgen Chebotar, Xi Chen, Krzysztof Choromanski, Tianli Ding, Danny Driess, Avinava Dubey, Chelsea Finn, Pete Florence, Chuyuan Fu, Montse Gonza- lez Arenas, Keerthana Gopalakrishnan, Kehang Han, Karol Hausman, Alexander Herzog, Jasmine Hsu, 10 Beyond APIs: Probing the Limits of MLLMs in Physical Tool Use Bria...
2023
-
[15]
ChengshuLi,RuohanZhang,JosiahWong,CemGok- men, Sanjana Srivastava, Roberto Martín-Martín, Chen Wang, Gabrael Levine, Wensi Ai, Benjamin Martinez, Hang Yin, Michael Lingelbach, Minjune Hwang, AyanoHiranaka, SujayGarlanka, ArmanAy- din, Sharon Lee, Jiankai Sun, Mona Anvari, Manasi Sharma, Dhruva Bansal, Samuel Hunter, Kyu-Young Kim, Alan Lou, Caleb R. Matth...
Pith/arXiv arXiv 2024
-
[16]
Creative robot tool use with large language models
Mengdi Xu, Peide Huang, Wenhao Yu, Shiqi Liu, Xilun Zhang, Yaru Niu, Tingnan Zhang, Fei Xia, Jie Tan, and Ding Zhao. Creative robot tool use with large language models. InInternational Conference on Learning Representations, 2024
2024
-
[17]
Gemini: A family of highly capable multimodal mod- els, 2025
Gemini Team, Rohan Anil, Sebastian Borgeaud, Jean-Baptiste Alayrac, Jiahui Yu, Radu Soricut, et al. Gemini: A family of highly capable multimodal mod- els, 2025
2025
-
[18]
Goucher, AdamPerelman, AdityaRamesh, etal
OpenAI, :, Aaron Hurst, Adam Lerer, Adam P. Goucher, AdamPerelman, AdityaRamesh, etal. Gpt- 4o system card, 2024
2024
-
[19]
Qwen3-vl technical report, 2025
Shuai Bai, Yuxuan Cai, Ruizhe Chen, Keqin Chen, Xionghui Chen, Zesen Cheng, Lianghao Deng, Wei Ding, Chang Gao, Chunjiang Ge, et al. Qwen3-vl technical report, 2025
2025
-
[20]
Openai gpt-5 system card, 2026
OpenAI, :, Aaditya Singh, Adam Fry, Adam Perel- man, Adam Tart, Adi Ganesh, Ahmed El-Kishky, Aidan McLaughlin, Aiden Low, et al. Openai gpt-5 system card, 2026
2026
-
[21]
Deepseek- vl2: Mixture-of-experts vision-language models for advanced multimodal understanding, 2024
Zhiyu Wu, Xiaokang Chen, Zizheng Pan, Xingchao Liu, Wen Liu, Damai Dai, Huazuo Gao, Yiyang Ma, Chengyue Wu, Bingxuan Wang, Zhenda Xie, Yu Wu, Kai Hu, Jiawei Wang, Yaofeng Sun, Yukun Li,YishiPiao,KangGuan,AixinLiu,XinXie,Yuxiang You, Kai Dong, Xingkai Yu, Haowei Zhang, Liang Zhao, Yisong Wang, and Chong Ruan. Deepseek- vl2: Mixture-of-experts vision-langua...
2024
-
[22]
Minicpm-v 4.5: Cooking efficient mllms via architecture, data, and training recipe, 2025
Tianyu Yu, Zefan Wang, Chongyi Wang, Fuwei Huang, Wenshuo Ma, Zhihui He, Tianchi Cai, Weize Chen, Yuxiang Huang, Yuanqian Zhao, Bokai Xu, Junbo Cui, Yingjing Xu, Liqing Ruan, Luoyuan Zhang, Hanyu Liu, Jingkun Tang, Hongyuan Liu, Qining Guo, Wenhao Hu, Bingxiang He, Jie Zhou, Jie Cai, Ji Qi, Zonghao Guo, et al. Minicpm-v 4.5: Cooking efficient mllms via ar...
2025
-
[23]
mplug-owl3: Towards long image-sequence understanding in multi-modal large language mod- els, 2024
Jiabo Ye, Haiyang Xu, Haowei Liu, Anwen Hu, Ming Yan, Qi Qian, Ji Zhang, Fei Huang, and Jingren Zhou. mplug-owl3: Towards long image-sequence understanding in multi-modal large language mod- els, 2024
2024
-
[24]
Openflamingo: An open-source framework for train- ing large autoregressive vision-language models, 2023
Anas Awadalla, Irena Gao, Josh Gardner, Jack Hes- sel, Yusuf Hanafy, Wanrong Zhu, Kalyani Marathe, Yonatan Bitton, Samir Gadre, Shiori Sagawa, Je- nia Jitsev, Simon Kornblith, Pang Wei Koh, Gabriel Ilharco, Mitchell Wortsman, and Ludwig Schmidt. Openflamingo: An open-source framework for train- ing large autoregressive vision-language models, 2023
2023
-
[25]
Internvl3.5: Advancing open- source multimodal models in versatility, reasoning, and efficiency, 2025
Weiyun Wang et al. Internvl3.5: Advancing open- source multimodal models in versatility, reasoning, and efficiency, 2025
2025
-
[26]
Ovis: Struc- tural embedding alignment for multimodal large language model, 2024
Shiyin Lu, Yang Li, Qing-Guo Chen, Zhao Xu, Wei- hua Luo, Kaifu Zhang, and Han-Jia Ye. Ovis: Struc- tural embedding alignment for multimodal large language model, 2024
2024
-
[27]
mow the lawn
Kimi Team, Tongtong Bai, Yifan Bai, Yiping Bao, S. H. Cai, Yuan Cao, et al. Kimi k2.5: Visual agentic intelligence, 2026. A Per-Category Performance Analysis To understand whether MLLMs exhibit uniform com- petence in physical tool use or whether their per- formance varies by tool category, we disaggregate the Task-Completable Rate (TCR) across the 28 UNS...
2026
-
[28]
without disturbing habitat
THE UNIFIED VIABILITY TEST: A tool is strictly REQUIRED only if its removal causes the task to physically fail, violate safety, or violate professional industry standards. - Implicit Constraints: You must consider implicit constraints. (e.g., studying animals "without disturbing habitat" standardly requires an unattended tool like a ’Wildlife Camera Trap’...
-
[29]
Automate lawn maintenance to ensure even distribution of water over a large area without manual intervention
’target_tools’: List of selected 15 Beyond APIs: Probing the Limits of MLLMs in Physical Tool Use tools. 3. ’target_steps’: Integers representing the execution order (starting at 1, continuous, same number for parallel tools). 4. ’negative_tools’: List of rejected tools. ] C.6 Evaluation Prompt — Task I (Tool Recog- nition) We test MLLM’s ability in recog...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.