pith. sign in

arxiv: 2606.10803 · v1 · pith:SQQK2R7Jnew · submitted 2026-06-09 · 💻 cs.CL · cs.AI· cs.CV

Beyond APIs: Probing the Limits of MLLMs in Physical Tool Use

Pith reviewed 2026-06-27 13:31 UTC · model grok-4.3

classification 💻 cs.CL cs.AIcs.CV
keywords MLLMsphysical tool useembodied AIbenchmarktool recognitionplanningfunctional commonsense
0
0 comments X

The pith

MLLMs identify only 58.7 percent of physical tools in scenes and complete 21 percent of tool-use queries end-to-end.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper creates PhysTool-Bench to measure how well multimodal large language models handle physical tools for embodied tasks. It tests 13 models on two steps: spotting every tool visible in a real scene and then planning the correct sequence of tools to carry out a given instruction. The strongest model reaches 58.7 percent on tool recognition but drops to 21 percent when both steps must succeed together. The larger failure occurs at the planning stage, where models must connect what they see to how each tool actually works. These gaps indicate that current models lack the visual accuracy and practical knowledge needed to direct robots in everyday physical work.

Core claim

Across 13 leading MLLMs evaluated on PhysTool-Bench, even Gemini-3.1-Pro identifies only 58.7 percent of the tools present in realistic scenes and completes merely 21.0 percent of the 2,510 queries end-to-end. The benchmark shows a two-level deficit: models first fail to perceive all tools accurately, then suffer a steeper drop when required to map those tools onto task goals through functional commonsense reasoning about selection and use sequences.

What carries the argument

PhysTool-Bench, a dataset of 2,510 queries spanning 2,678 real-world tools across manufacturing, electrical, agricultural, and healthcare domains, that separately scores tool recognition from images and planning of tool-use sequences.

If this is right

  • MLLMs need stronger visual perception modules tuned to cluttered real-world scenes.
  • Models require additional training on functional properties of tools to bridge perception to task semantics.
  • Embodied AI systems cannot yet rely on current MLLMs as the sole planner for physical tool interactions.
  • The performance drop from recognition to planning identifies the planning stage as the dominant bottleneck.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Robot control pipelines may need separate perception and commonsense modules rather than depending on a single MLLM.
  • The benchmark could be used to measure progress after targeted fine-tuning on tool-interaction videos or simulations.
  • Similar perception-plus-planning gaps are likely to appear in other manipulation tasks that involve choosing objects from a scene.
  • Expanding the benchmark to include execution feedback from actual robots would test whether the planning errors translate to physical failures.

Load-bearing premise

The 2,510 queries and 2,678 tools, together with the two evaluation dimensions of recognition and planning, capture the core capabilities needed for physical tool use.

What would settle it

A model that reaches above 85 percent tool recognition and above 60 percent end-to-end completion on the full set of PhysTool-Bench queries and scenes would falsify the reported deficit levels.

Figures

Figures reproduced from arXiv: 2606.10803 by Chong-Wah Ngo, Wenjie Li, Yongqi Li, Yutong Zhou, Zhixin Ma.

Figure 1
Figure 1. Figure 1: LLMs excel at digital, symbolic tasks accessible via tools and APIs (left). In the physical world (right), the same Figure 1: The capability divide between digita [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
read the original abstract

Multimodal Large Language Models (MLLMs) excel at utilizing digital APIs and increasingly serve as the "brain" of embodied AI, instructing robots to interact with the physical world. In such embodied settings, a central capability is the use of physical tools, which underpins MLLMs' ability to assist humans in real-world tasks. Despite the importance, MLLMs' proficiency in physical tool use remains largely unexplored. To address this gap, we introduce PhysTool-Bench, the first physical tool-use benchmark designed to evaluate MLLMs' ability to comprehend real-world scenarios, identify physical tools, and plan their use. PhysTool-Bench comprises 2,510 queries over 2,678 real-world physical tools spanning diverse domains, including manufacturing, electrical work, agriculture, and healthcare. Concretely, models are evaluated along two primary dimensions: 1) recognizing all physical tools present in the scene, and 2) planning the tool selection and use sequence based on the instruction and visual context. Across 13 leading MLLMs, even the strongest model (Gemini-3.1-Pro) identifies only 58.7% of tools in a scene and completes merely 21.0% of queries end-to-end. Our analysis reveals a two-level deficit: MLLMs struggle to perceive tools in realistic scenes, and the much larger drop at the planning stage further indicates a lack of functional commonsense for mapping perceived tools onto task semantics, pinpointing a critical bottleneck for the development of practical embodied AI.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The manuscript introduces PhysTool-Bench, a benchmark of 2,510 queries over 2,678 real-world physical tools spanning manufacturing, electrical work, agriculture, and healthcare. It evaluates 13 MLLMs on two dimensions—recognizing all tools present in a scene and planning a tool selection/use sequence to fulfill an instruction—reporting that the strongest model (Gemini-3.1-Pro) reaches only 58.7% tool identification and 21.0% end-to-end success. The authors conclude that MLLMs exhibit a two-level deficit (perception plus functional commonsense for mapping tools to task semantics), identifying this as a critical bottleneck for embodied AI.

Significance. If the benchmark and evaluation protocol hold, the work supplies a large-scale empirical measurement of MLLMs on physical tool use, a capability central to embodied applications. The reported gap between recognition and end-to-end performance, together with the domain breadth, could usefully direct research toward improving functional reasoning in multimodal models. The scale (2,510 queries) is a concrete strength that enables broad coverage.

major comments (3)
  1. [Abstract and §3] Abstract and benchmark-construction section: the abstract states the 58.7% recognition and 21.0% end-to-end figures but supplies no information on query generation, validation, balancing, inter-annotator agreement, or selection biases in the 2,678-tool set. These details are load-bearing for treating the numbers as evidence of the claimed two-level deficit.
  2. [§4 (Analysis)] Analysis section: the claim that the drop from recognition (58.7%) to end-to-end (21%) demonstrates 'lack of functional commonsense for mapping perceived tools onto task semantics' is not supported by ablations such as oracle tool lists or relaxed sequence matching. Without them, the interpretation cannot isolate commonsense deficits from confounds such as output-format sensitivity or instruction-following fidelity.
  3. [Evaluation protocol] Evaluation protocol: no description is given of how planning sequences are scored (exact string match to ground-truth annotations versus semantic equivalence), which directly affects the 21% figure and the two-level-deficit diagnosis.
minor comments (2)
  1. [Related work] Related-work section should explicitly compare PhysTool-Bench to prior embodied-AI benchmarks (e.g., those focused on API or simulation-based tool use) to clarify novelty.
  2. [Figures] Figure captions could more clearly label example scenes, tool annotations, and ground-truth sequences to aid reader interpretation of the two evaluation dimensions.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive feedback highlighting areas where additional methodological transparency and analysis would strengthen the manuscript. We address each major comment below and commit to revisions that provide the requested details without altering the core findings.

read point-by-point responses
  1. Referee: [Abstract and §3] Abstract and benchmark-construction section: the abstract states the 58.7% recognition and 21.0% end-to-end figures but supplies no information on query generation, validation, balancing, inter-annotator agreement, or selection biases in the 2,678-tool set. These details are load-bearing for treating the numbers as evidence of the claimed two-level deficit.

    Authors: The abstract is intentionally concise per venue guidelines. Section 3 describes query generation from real-world scenarios across the four domains and the curation of the 2,678-tool inventory. To address the concern, we will expand §3 with a dedicated subsection detailing the query validation procedure (including pilot testing), domain balancing statistics, inter-annotator agreement scores on tool annotations and query correctness, and an explicit discussion of potential selection biases in tool sourcing. revision: yes

  2. Referee: [§4 (Analysis)] Analysis section: the claim that the drop from recognition (58.7%) to end-to-end (21%) demonstrates 'lack of functional commonsense for mapping perceived tools onto task semantics' is not supported by ablations such as oracle tool lists or relaxed sequence matching. Without them, the interpretation cannot isolate commonsense deficits from confounds such as output-format sensitivity or instruction-following fidelity.

    Authors: The large performance gap is consistent with a functional-reasoning bottleneck, but we agree that isolating it from format or instruction-following confounds requires additional controls. We will add oracle-tool-list experiments (supplying ground-truth tool sets to the planner) and report results under both exact and relaxed (semantic) sequence matching in a new analysis subsection. These ablations will be run on the top-performing models and included in the revised §4. revision: yes

  3. Referee: [Evaluation protocol] Evaluation protocol: no description is given of how planning sequences are scored (exact string match to ground-truth annotations versus semantic equivalence), which directly affects the 21% figure and the two-level-deficit diagnosis.

    Authors: We apologize for the missing protocol description. Planning sequences are scored via exact string match on normalized tool names and action verbs against the annotated ground truth, with limited synonym normalization applied only to tool nomenclature. We will insert a precise evaluation-protocol subsection (new §4.2) that fully specifies the matching rules, normalization steps, and any edge-case handling so that the 21% figure can be reproduced and interpreted unambiguously. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical benchmark evaluation with direct measurements only

full rationale

The paper introduces PhysTool-Bench and reports empirical performance numbers (e.g., 58.7% tool recognition, 21.0% end-to-end) across 13 MLLMs. No derivations, equations, fitted parameters, predictions, or self-citation chains appear in the abstract or described content. All claims rest on new data collection and model evaluation rather than any reduction to prior inputs by construction. This is the standard non-circular case for benchmark papers.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Empirical benchmark paper; central claim rests on the representativeness of the test cases rather than on any mathematical derivation, fitted constants, or new postulated entities.

pith-pipeline@v0.9.1-grok · 5826 in / 1120 out tokens · 22891 ms · 2026-06-27T13:31:34.699719+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

29 extracted references · 1 canonical work pages

  1. [1]

    Yun Peng, Shuqing Li, Wenwei Gu, Yichen Li, Wenx- uan Wang, Cuiyun Gao, and Michael R. Lyu. Revis- iting, Benchmarking and Exploring API Recommen- dation: How Far Are We? .IEEE Transactions on Soft- ware Engineering, 49(04):1876–1897, April 2023. ISSN 1939-3520. doi: 10.1109/TSE.2022.3197063

  2. [2]

    Toolllm: Fa- cilitating large language models to master 16000+ real-world apis

    Yujia Qin, Shihao Liang, Yining Ye, Kunlun Zhu, Lan Yan, Yaxi Lu, Yankai Lin, Xin Cong, Xiangru Tang, Bill Qian, Sihan Zhao, Lauren Hong, Runchu Tian, Ruobing Xie, Jie Zhou, Mark Gerstein, Dahai Li, Zhiyuan Liu, and Maosong Sun. Toolllm: Fa- cilitating large language models to master 16000+ real-world apis. InThe Twelfth International Confer- ence on Lear...

  3. [3]

    Xiaoqi Li, Mingxu Zhang, Yiran Geng, Haoran Geng, Yuxing Long, Yan Shen, Renrui Zhang, Jiaming Liu, and Hao Dong. Manipllm: Embodied multimodal large language model for object-centric robotic ma- nipulation.2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 18061– 18070, 2023

  4. [4]

    Danny Driess, Fei Xia, Mehdi S. M. Sajjadi, Corey Lynch, Aakanksha Chowdhery, Brian Ichter, Ayzaan Wahid, Jonathan Tompson, Quan Vuong, Tianhe Yu, Wenlong Huang, Yevgen Chebotar, Pierre Ser- manet, Daniel Duckworth, Sergey Levine, Vincent Vanhoucke, Karol Hausman, Marc Toussaint, Klaus Greff, Andy Zeng, Igor Mordatch, and Pete Florence. Palm-e: an embodie...

  5. [5]

    Robo- mamba: Efficient vision-language-action model for robotic reasoning and manipulation.Advances in Neural Information Processing Systems 37, 2024

    Jiaming Liu, Mengzhen Liu, Zhenyu Wang, Pengju An, Xiaoqi Li, Kaichen Zhou, Senqiao Yang, Renrui Zhang, Yandong Guo, and Shanghang Zhang. Robo- mamba: Efficient vision-language-action model for robotic reasoning and manipulation.Advances in Neural Information Processing Systems 37, 2024

  6. [6]

    Sapien: Asimulated part-based interactive environment.2020 IEEE/CVF Conference on Computer Vision and Pattern Recogni- tion (CVPR), pages 11094–11104, 2020

    Fanbo Xiang, Yuzhe Qin, Kaichun Mo, Yikuan Xia, Bernie Hao Zhu, Fangchen Liu, Minghua Liu, Hanx- iaoJiang, YifuYuan, HeWang, LiYi, AngelX.Chang, LeonidasJ.Guibas, andHaoSu. Sapien: Asimulated part-based interactive environment.2020 IEEE/CVF Conference on Computer Vision and Pattern Recogni- tion (CVPR), pages 11094–11104, 2020

  7. [7]

    Ling, Fanbo Xiang, Derek Yang, Xuanlin Li, Stone Tao, Zhiao Huang, Zhiwei Jia, and Hao Su

    Tongzhou Mu, Z. Ling, Fanbo Xiang, Derek Yang, Xuanlin Li, Stone Tao, Zhiao Huang, Zhiwei Jia, and Hao Su. Maniskill: Generalizable manipulation skill benchmark with large-scale demonstrations. In NeurIPS Datasets and Benchmarks, 2021

  8. [8]

    Toolformer: Language models can teach themselves to use tools

    Timo Schick, Jane Dwivedi-Yu, Roberto Dessì, Roberta Raileanu, Maria Lomeli, Luke Zettlemoyer, Nicola Cancedda, and Thomas Scialom. Toolformer: Language models can teach themselves to use tools. InAdvances in Neural Information Processing Systems, 2023

  9. [9]

    Re- act: Synergizing reasoning and acting in language models

    Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik Narasimhan, and Yuan Cao. Re- act: Synergizing reasoning and acting in language models. InInternational Conference on Learning Rep- resentations (ICLR), 2023

  10. [10]

    Patil, Tianjun Zhang, Xin Wang, and Joseph E

    Shishir G. Patil, Tianjun Zhang, Xin Wang, and Joseph E. Gonzalez. Gorilla: Large language model connected with massive apis. InAdvances in Neural Information Processing Systems, 2024

  11. [11]

    Stable toolbench: Towards stable large- scale benchmarking on tool learning of large lan- guage models

    ZhichengGuo,SijieCheng,HaoWang, ShihaoLiang, Yujia Qin, Peng Li, Zhiyuan Liu, Maosong Sun, and Yang Liu. Stable toolbench: Towards stable large- scale benchmarking on tool learning of large lan- guage models. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguis- tics, 2024

  12. [12]

    Michael Ahn, Anthony Brohan, Noah Brown, Yev- gen Chebotar, Omar Cortes, Byron David, Chelsea Finn, Keerthana Gopalakrishnan, Karol Hausman, Alexander Herzog, Daniel Ho, Jasmine Hsu, Julian Ibarz, Brian Ichter, Alex Irpan, Eric Jang, Rosario M Jauregui Ruano, Kyle Jeffrey, Sally Jesmonth, Nikhil Jayant Joshi, Ryan C. Julian, Dmitry Kalash- nikov, Yuheng K...

  13. [13]

    Danny Driess, Fei Xia, Mehdi S. M. Sajjadi, Corey Lynch, Aakanksha Chowdhery, Brian Ichter, Ayzaan Wahid, Jonathan Tompson, Quan Vuong, Tianhe Yu, Wenlong Huang, Yevgen Chebotar, Pierre Ser- manet, Daniel Duckworth, Sergey Levine, Vincent Vanhoucke, Karol Hausman, Marc Toussaint, Klaus Greff, Andy Zeng, Igor Mordatch, and Pete Florence. Palm-e: An embodie...

  14. [14]

    Rt-2: Vision-language-action models transfer web knowledge to robotic control

    Anthony Brohan, Noah Brown, Justice Carbajal, Yevgen Chebotar, Xi Chen, Krzysztof Choromanski, Tianli Ding, Danny Driess, Avinava Dubey, Chelsea Finn, Pete Florence, Chuyuan Fu, Montse Gonza- lez Arenas, Keerthana Gopalakrishnan, Kehang Han, Karol Hausman, Alexander Herzog, Jasmine Hsu, 10 Beyond APIs: Probing the Limits of MLLMs in Physical Tool Use Bria...

  15. [15]

    Matthews, et al

    ChengshuLi,RuohanZhang,JosiahWong,CemGok- men, Sanjana Srivastava, Roberto Martín-Martín, Chen Wang, Gabrael Levine, Wensi Ai, Benjamin Martinez, Hang Yin, Michael Lingelbach, Minjune Hwang, AyanoHiranaka, SujayGarlanka, ArmanAy- din, Sharon Lee, Jiankai Sun, Mona Anvari, Manasi Sharma, Dhruva Bansal, Samuel Hunter, Kyu-Young Kim, Alan Lou, Caleb R. Matth...

  16. [16]

    Creative robot tool use with large language models

    Mengdi Xu, Peide Huang, Wenhao Yu, Shiqi Liu, Xilun Zhang, Yaru Niu, Tingnan Zhang, Fei Xia, Jie Tan, and Ding Zhao. Creative robot tool use with large language models. InInternational Conference on Learning Representations, 2024

  17. [17]

    Gemini: A family of highly capable multimodal mod- els, 2025

    Gemini Team, Rohan Anil, Sebastian Borgeaud, Jean-Baptiste Alayrac, Jiahui Yu, Radu Soricut, et al. Gemini: A family of highly capable multimodal mod- els, 2025

  18. [18]

    Goucher, AdamPerelman, AdityaRamesh, etal

    OpenAI, :, Aaron Hurst, Adam Lerer, Adam P. Goucher, AdamPerelman, AdityaRamesh, etal. Gpt- 4o system card, 2024

  19. [19]

    Qwen3-vl technical report, 2025

    Shuai Bai, Yuxuan Cai, Ruizhe Chen, Keqin Chen, Xionghui Chen, Zesen Cheng, Lianghao Deng, Wei Ding, Chang Gao, Chunjiang Ge, et al. Qwen3-vl technical report, 2025

  20. [20]

    Openai gpt-5 system card, 2026

    OpenAI, :, Aaditya Singh, Adam Fry, Adam Perel- man, Adam Tart, Adi Ganesh, Ahmed El-Kishky, Aidan McLaughlin, Aiden Low, et al. Openai gpt-5 system card, 2026

  21. [21]

    Deepseek- vl2: Mixture-of-experts vision-language models for advanced multimodal understanding, 2024

    Zhiyu Wu, Xiaokang Chen, Zizheng Pan, Xingchao Liu, Wen Liu, Damai Dai, Huazuo Gao, Yiyang Ma, Chengyue Wu, Bingxuan Wang, Zhenda Xie, Yu Wu, Kai Hu, Jiawei Wang, Yaofeng Sun, Yukun Li,YishiPiao,KangGuan,AixinLiu,XinXie,Yuxiang You, Kai Dong, Xingkai Yu, Haowei Zhang, Liang Zhao, Yisong Wang, and Chong Ruan. Deepseek- vl2: Mixture-of-experts vision-langua...

  22. [22]

    Minicpm-v 4.5: Cooking efficient mllms via architecture, data, and training recipe, 2025

    Tianyu Yu, Zefan Wang, Chongyi Wang, Fuwei Huang, Wenshuo Ma, Zhihui He, Tianchi Cai, Weize Chen, Yuxiang Huang, Yuanqian Zhao, Bokai Xu, Junbo Cui, Yingjing Xu, Liqing Ruan, Luoyuan Zhang, Hanyu Liu, Jingkun Tang, Hongyuan Liu, Qining Guo, Wenhao Hu, Bingxiang He, Jie Zhou, Jie Cai, Ji Qi, Zonghao Guo, et al. Minicpm-v 4.5: Cooking efficient mllms via ar...

  23. [23]

    mplug-owl3: Towards long image-sequence understanding in multi-modal large language mod- els, 2024

    Jiabo Ye, Haiyang Xu, Haowei Liu, Anwen Hu, Ming Yan, Qi Qian, Ji Zhang, Fei Huang, and Jingren Zhou. mplug-owl3: Towards long image-sequence understanding in multi-modal large language mod- els, 2024

  24. [24]

    Openflamingo: An open-source framework for train- ing large autoregressive vision-language models, 2023

    Anas Awadalla, Irena Gao, Josh Gardner, Jack Hes- sel, Yusuf Hanafy, Wanrong Zhu, Kalyani Marathe, Yonatan Bitton, Samir Gadre, Shiori Sagawa, Je- nia Jitsev, Simon Kornblith, Pang Wei Koh, Gabriel Ilharco, Mitchell Wortsman, and Ludwig Schmidt. Openflamingo: An open-source framework for train- ing large autoregressive vision-language models, 2023

  25. [25]

    Internvl3.5: Advancing open- source multimodal models in versatility, reasoning, and efficiency, 2025

    Weiyun Wang et al. Internvl3.5: Advancing open- source multimodal models in versatility, reasoning, and efficiency, 2025

  26. [26]

    Ovis: Struc- tural embedding alignment for multimodal large language model, 2024

    Shiyin Lu, Yang Li, Qing-Guo Chen, Zhao Xu, Wei- hua Luo, Kaifu Zhang, and Han-Jia Ye. Ovis: Struc- tural embedding alignment for multimodal large language model, 2024

  27. [27]

    mow the lawn

    Kimi Team, Tongtong Bai, Yifan Bai, Yiping Bao, S. H. Cai, Yuan Cao, et al. Kimi k2.5: Visual agentic intelligence, 2026. A Per-Category Performance Analysis To understand whether MLLMs exhibit uniform com- petence in physical tool use or whether their per- formance varies by tool category, we disaggregate the Task-Completable Rate (TCR) across the 28 UNS...

  28. [28]

    without disturbing habitat

    THE UNIFIED VIABILITY TEST: A tool is strictly REQUIRED only if its removal causes the task to physically fail, violate safety, or violate professional industry standards. - Implicit Constraints: You must consider implicit constraints. (e.g., studying animals "without disturbing habitat" standardly requires an unattended tool like a ’Wildlife Camera Trap’...

  29. [29]

    Automate lawn maintenance to ensure even distribution of water over a large area without manual intervention

    ’target_tools’: List of selected 15 Beyond APIs: Probing the Limits of MLLMs in Physical Tool Use tools. 3. ’target_steps’: Integers representing the execution order (starting at 1, continuous, same number for parallel tools). 4. ’negative_tools’: List of rejected tools. ] C.6 Evaluation Prompt — Task I (Tool Recog- nition) We test MLLM’s ability in recog...