pith. sign in

arxiv: 2605.22366 · v1 · pith:N5XLWGZCnew · submitted 2026-05-21 · 💻 cs.CV

AgroTools: A Benchmark for Tool-Augmented Multimodal Agents in Agriculture

Pith reviewed 2026-05-22 07:42 UTC · model grok-4.3

classification 💻 cs.CV
keywords agriculturemultimodal agentstool-augmented modelsbenchmarktool use evaluationprecision agriculturemultimodal large language modelsagent reliability
0
0 comments X

The pith

A new benchmark shows multimodal AI models are unreliable at using tools for agricultural tasks.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces AgroTools to test whether multimodal AI systems can turn visual observations into reliable actions by calling external tools in farming contexts. Unlike prior agricultural benchmarks that only score final answers, this one tracks the full workflow of planning tool calls, generating arguments, handling errors, and synthesizing results. The benchmark supplies 539 queries with 1,097 images across five task families, each paired with correct tool-use traces inside an environment of 14 executable agricultural tools. Evaluations of thirteen open- and closed-source models expose consistent failures in tool planning, argument generation, execution recovery, and answer synthesis. If these findings hold, progress toward practical AI for precision agriculture will depend on targeted advances in tool-augmented reasoning rather than image recognition alone.

Core claim

AgroTools contains 539 question-answer instances paired with 1,097 heterogeneous agricultural images, spanning five task families and an executable environment of 14 agricultural tools. Each query carries structured tool-use traces that support dual evaluation of process-level execution quality and outcome-level task success. Benchmarking nine open-source and four closed-source multimodal large language models demonstrates that current models remain far from reliable in agricultural tool-use settings, with clear bottlenecks in tool planning, argument generation, execution recovery, and final-answer synthesis.

What carries the argument

The AgroTools benchmark with its 14-tool executable environment and structured tool-use trace annotations that enable separate scoring of execution process and final outcome.

If this is right

  • Agricultural decision support will require specific gains in tool planning and argument generation before models can be deployed reliably.
  • Execution recovery mechanisms must be strengthened to handle the error-prone nature of tool calls in farming workflows.
  • Final-answer synthesis from tool results forms an additional bottleneck that limits overall task success.
  • The dual process-and-outcome evaluation can guide development of agents for high-precision agricultural applications.
  • Future models trained or fine-tuned on such traces may close the observed performance gaps.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same trace-annotated evaluation style could expose comparable tool-use weaknesses when applied to other visual decision domains such as medical imaging or environmental monitoring.
  • Widespread adoption of the benchmark might shift training practices toward explicit supervision of intermediate tool steps rather than end-to-end answer prediction.
  • If models improve substantially on AgroTools, the resulting capabilities could support real-time advisory systems that help farmers execute precision interventions from field imagery.
  • Extending the tool set or adding live sensor feeds would test whether the current bottlenecks persist under more dynamic conditions.

Load-bearing premise

The 14 agricultural tools, five task families, and structured tool-use trace annotations accurately represent the requirements and workflows of real-world precision-sensitive agricultural decision-making.

What would settle it

A multimodal model that plans correct tool sequences, generates valid arguments, recovers from execution errors, and reaches high success rates on most of the 539 queries would indicate that the reported unreliability does not apply to sufficiently advanced systems.

Figures

Figures reproduced from arXiv: 2605.22366 by Bohan Shi, Haohuan Fu, Jianxi Huang, Jiarui Zhang, Jing Wu, Juepeng Zheng, Kun Zeng, Xiaoya Fan, Xinyu Zhang, Yibin Wen, Yutong Lu, Zi Ye, Zurong Mai.

Figure 1
Figure 1. Figure 1: Comparison between existing agricultural benchmarks and AgroTools. Compared with [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: (a) The number of samples across different tasks in AgroTools, some samples can be [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: AgroTools curation pipeline contains four stages: Data Sources and Preprocessing, Query [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Detailed statistics of AgroTools. 4 AgroTools Benchmark 4.1 Benchmark curation Data Sources and Preprocessing. As shown in [PITH_FULL_IMAGE:figures/full_fig_p006_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Task-level FAS across representative models. Scores are computed over all samples in the [PITH_FULL_IMAGE:figures/full_fig_p009_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Some representative examples from the five task categories. [PITH_FULL_IMAGE:figures/full_fig_p021_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Prompt templates used for LLM-based extraction and evaluation [PITH_FULL_IMAGE:figures/full_fig_p029_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Answer verification prompt for human participants [PITH_FULL_IMAGE:figures/full_fig_p030_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Human and LLM Scoring Comparison Across Different Models [PITH_FULL_IMAGE:figures/full_fig_p031_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Tool-call outcomes and retry behavior in end-to-end evaluation. [PITH_FULL_IMAGE:figures/full_fig_p032_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: Case Study of GPT-5.4 and InternVL3.5-14B on AgroTools Tasks. [PITH_FULL_IMAGE:figures/full_fig_p033_11.png] view at source ↗
Figure 12
Figure 12. Figure 12: Case Study of GPT-5.4 and InternVL3.5-14B on AgroTools Tasks. [PITH_FULL_IMAGE:figures/full_fig_p034_12.png] view at source ↗
Figure 13
Figure 13. Figure 13: Case Study of GPT-5.4 and InternVL3.5-14B on AgroTools Tasks. [PITH_FULL_IMAGE:figures/full_fig_p035_13.png] view at source ↗
Figure 14
Figure 14. Figure 14: Case Study of GPT-5.4 and InternVL3.5-14B on AgroTools Tasks. [PITH_FULL_IMAGE:figures/full_fig_p036_14.png] view at source ↗
Figure 15
Figure 15. Figure 15: Case Study of GPT-5.4 and InternVL3.5-14B on AgroTools Tasks. [PITH_FULL_IMAGE:figures/full_fig_p037_15.png] view at source ↗
Figure 16
Figure 16. Figure 16: Case Study of GPT-5.4 and InternVL3.5-14B on AgroTools Tasks. [PITH_FULL_IMAGE:figures/full_fig_p038_16.png] view at source ↗
Figure 17
Figure 17. Figure 17: Case Study of GPT-5.4 and InternVL3.5-14B on AgroTools Tasks. [PITH_FULL_IMAGE:figures/full_fig_p039_17.png] view at source ↗
read the original abstract

Agricultural decision-making increasingly requires multimodal systems that can transform visual observations into reliable, executable actions. However, existing agricultural multimodal benchmarks mainly evaluate final-answer correctness and provide limited support for assessing whether models can use external tools to complete precision-sensitive workflows. In this paper, we introduce AgroTools, a benchmark for evaluating tool-augmented multimodal agents in agriculture. AgroTools contains 539 question-answer instances paired with 1,097 heterogeneous agricultural images, spanning five task families and an executable environment of 14 agricultural tools. Each query is annotated with structured tool-use traces, enabling a dual-view evaluation of both process-level execution quality and outcome-level task success. We benchmark 9 open-source and 4 closed-source multimodal large language models on AgroTools. Results show that current models remain far from reliable in agricultural tool-use settings, with clear bottlenecks in tool planning, argument generation, execution recovery, and final-answer synthesis. We hope AgroTools will support future research on multimodal agents for high-precision agricultural applications. The benchmark and evaluation are available at https://huggingface.co/datasets/AgroTools/AgroTools.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper introduces AgroTools, a benchmark for tool-augmented multimodal agents in agriculture consisting of 539 question-answer instances paired with 1,097 heterogeneous images. It spans five task families supported by an executable environment of 14 agricultural tools, with each instance annotated by structured tool-use traces to enable dual-view evaluation of process-level execution quality and outcome-level task success. The authors benchmark 9 open-source and 4 closed-source multimodal LLMs, reporting that current models remain far from reliable with clear bottlenecks in tool planning, argument generation, execution recovery, and final-answer synthesis.

Significance. If the benchmark design is shown to be representative of real-world precision agriculture, the work would be significant for shifting evaluation focus from final-answer correctness to full tool-use workflows in a high-stakes domain. The executable environment, structured traces, and public dataset release on Hugging Face provide concrete resources that could support reproducible research on multimodal agents for agricultural applications.

major comments (2)
  1. [Benchmark construction] Benchmark construction (as described in the abstract and implied methodology): the selection of the 14 agricultural tools, five task families, and 539 instances is presented without any reported expert review, comparison to actual farm records, or sensitivity analysis. This directly affects the central claim of clear bottlenecks, because altering tool definitions or task families could change which failure modes (tool planning, argument generation) appear dominant.
  2. [Evaluation and results] Evaluation and results sections: the abstract reports high-level findings on model performance but supplies no information on task validation, annotation quality checks, or statistical details of the comparisons across the 13 models. Without these, it is unclear whether the observed bottlenecks in execution recovery and final-answer synthesis are robust or sensitive to annotation choices.
minor comments (1)
  1. [Abstract] The abstract would benefit from a brief statement on how the dual-view evaluation metrics are computed to help readers assess the process-level results.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments on our manuscript. The feedback on benchmark construction and evaluation details is valuable, and we have prepared point-by-point responses below. We plan to revise the paper to incorporate additional documentation and analyses that address the raised concerns while preserving the core contributions of AgroTools.

read point-by-point responses
  1. Referee: [Benchmark construction] Benchmark construction (as described in the abstract and implied methodology): the selection of the 14 agricultural tools, five task families, and 539 instances is presented without any reported expert review, comparison to actual farm records, or sensitivity analysis. This directly affects the central claim of clear bottlenecks, because altering tool definitions or task families could change which failure modes (tool planning, argument generation) appear dominant.

    Authors: We thank the referee for this important point. The 14 tools were selected to represent standard operations in precision agriculture (e.g., image-based crop health assessment, soil parameter estimation, and yield forecasting), drawing from widely cited agricultural literature and public datasets. The five task families were designed to span representative decision workflows. We acknowledge that the original submission did not include explicit expert review documentation or sensitivity analysis. In the revised manuscript, we will add a dedicated 'Benchmark Construction' subsection that details the selection rationale with supporting references, describes consultations with agricultural domain experts during development, and reports a sensitivity analysis on task and tool variations to verify that the observed bottlenecks in tool planning and execution recovery remain consistent. Regarding direct comparison to proprietary farm records, such data are typically private and not publicly available for benchmarking; our design instead relies on heterogeneous public imagery and established agricultural task definitions to ensure reproducibility. revision: yes

  2. Referee: [Evaluation and results] Evaluation and results sections: the abstract reports high-level findings on model performance but supplies no information on task validation, annotation quality checks, or statistical details of the comparisons across the 13 models. Without these, it is unclear whether the observed bottlenecks in execution recovery and final-answer synthesis are robust or sensitive to annotation choices.

    Authors: We agree that greater transparency on validation and statistics is needed. The 539 instances and associated tool-use traces underwent iterative internal review by the authors (who combine expertise in multimodal AI and agronomy) to ensure correctness and executability. In the revised manuscript, we will expand the Evaluation and Results sections to include: (i) the annotation protocol and quality assurance procedures, (ii) inter-annotator agreement metrics on a sampled subset, and (iii) statistical details such as confidence intervals and significance tests (e.g., McNemar or Wilcoxon signed-rank tests) for model comparisons. These additions will substantiate that the reported bottlenecks in execution recovery and final-answer synthesis are robust rather than sensitive to specific annotation decisions. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical benchmark construction with no derivations or self-referential reductions

full rationale

The paper presents AgroTools as an empirical benchmark consisting of 539 question-answer instances, 1,097 images, five task families, and an executable environment with 14 agricultural tools, along with model evaluations on 13 multimodal LLMs. No mathematical derivations, first-principles predictions, parameter fittings, or equations are claimed. The reported bottlenecks in tool planning and related areas are direct empirical observations from running the benchmark, not quantities that reduce to the benchmark design by construction. Any self-citations (if present) are not invoked to justify uniqueness theorems or load-bearing premises that would create circularity. The work is self-contained as a dataset and evaluation contribution without the patterns of self-definitional logic or fitted inputs renamed as predictions.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

This is an applied benchmark paper that contributes a new dataset and evaluation protocol rather than relying on mathematical axioms, fitted parameters, or postulated entities.

pith-pipeline@v0.9.0 · 5765 in / 1092 out tokens · 59892 ms · 2026-05-22T07:42:36.286389+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

82 extracted references · 82 canonical work pages · 5 internal anchors

  1. [1]

    Introducing claude sonnet 4.6, 2026

    Anthropic. Introducing claude sonnet 4.6, 2026. URL https://www.anthropic.com/news/ claude-sonnet-4-6. Official release page. Accessed: 2026-05-07

  2. [2]

    Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond

    Jinze Bai, Shuai Bai, Shusheng Yang, Shijie Wang, Sinan Tan, Peng Wang, Junyang Lin, Chang Zhou, and Jingren Zhou. Qwen-vl: A frontier large vision-language model with versatile abilities.arXiv preprint arXiv:2308.12966, 1(2):3, 2023

  3. [3]

    Adapting vision-language models for precision agriculture: A study on crop segmentation based on uav remote sensing data

    Yuhui Bie, Guowei Xu, and Yaojun Wang. Adapting vision-language models for precision agriculture: A study on crop segmentation based on uav remote sensing data. In2025 13th International Conference on Agro-Geoinformatics (Agro-Geoinformatics), pages 1–6. IEEE, 2025

  4. [4]

    Agrichat: A multimodal large language model for agriculture image understanding, 2026

    Abderrahmene Boudiaf, Irfan Hussain, and Sajid Javed. Agrichat: A multimodal large language model for agriculture image understanding, 2026

  5. [5]

    Abductivemllm: Boosting visual abductive reasoning within mllms.arXiv preprint arXiv:2601.02771, 2026

    Boyu Chang, Qi Wang, Xi Guo, Zhixiong Nan, Yazhou Yao, and Tianfei Zhou. Abductivemllm: Boosting visual abductive reasoning within mllms.arXiv preprint arXiv:2601.02771, 2026

  6. [6]

    Advancing tool-augmented large language models: Integrating insights from errors in inference trees.Advances in Neural Information Processing Systems, 37:106555–106581, 2024

    Sijia Chen, Yibo Wang, Yi-Feng Wu, Qing-Guo Chen, Zhao Xu, Weihua Luo, Kaifu Zhang, and Lijun Zhang. Advancing tool-augmented large language models: Integrating insights from errors in inference trees.Advances in Neural Information Processing Systems, 37:106555–106581, 2024

  7. [7]

    Janus-Pro: Unified Multimodal Understanding and Generation with Data and Model Scaling

    Xiaokang Chen, Zhiyu Wu, Xingchao Liu, Zizheng Pan, Wen Liu, Zhenda Xie, Xingkai Yu, and Chong Ruan. Janus-pro: Unified multimodal understanding and generation with data and model scaling.arXiv preprint arXiv:2501.17811, 2025

  8. [8]

    Expanding performance boundaries of open- source multimodal models with model.Data, and Test-Time Scaling, 2024

    Zhe Chen, Weiyun Wang, Yue Cao, Yangzhou Liu, Zhangwei Gao, Erfei Cui, Jinguo Zhu, Shenglong Ye, Hao Tian, Zhaoyang Liu, et al. Expanding performance boundaries of open- source multimodal models with model.Data, and Test-Time Scaling, 2024

  9. [9]

    Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities

    Gheorghe Comanici, Eric Bieber, Mike Schaekermann, Ice Pasupat, Noveen Sachdeva, Inderjit Dhillon, Marcel Blistein, Ori Ram, Dan Zhang, Evan Rosen, et al. Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities.arXiv preprint arXiv:2507.06261, 2025

  10. [10]

    Utilizing vision-language models for detection of leaf-based diseases in tomatoes

    James Blossom Eleojo. Utilizing vision-language models for detection of leaf-based diseases in tomatoes. InProceedings of the AAAI Conference on Artificial Intelligence, pages 29567–29569, 2025

  11. [11]

    guided mllm reasoning: Enhancing mllm with knowledge and visual notes for visual question answering

    Wenlong Fang, Qiaofeng Wu, Jing Chen, and Yun Xue. guided mllm reasoning: Enhancing mllm with knowledge and visual notes for visual question answering. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 19597–19607, 2025

  12. [12]

    Earth-agent: Unlocking the full landscape of earth observation with agents, 2026

    Peilin Feng, Zhutao Lv, Junyan Ye, Xiaolei Wang, Xinjie Huo, Jinhua Yu, Wanghan Xu, Wenlong Zhang, Lei Bai, Conghui He, and Weijia Li. Earth-agent: Unlocking the full landscape of earth observation with agents, 2026. 10

  13. [13]

    Simultaneous fruit detection and size estimation using multitask deep neural networks.Biosystems Engineering, 233:63–75, 2023

    Mar Ferrer-Ferrer, Javier Ruiz-Hidalgo, Eduard Gregorio, Verónica Vilaplana, Josep-Ramon Morros, and Jordi Gené-Mola. Simultaneous fruit detection and size estimation using multitask deep neural networks.Biosystems Engineering, 233:63–75, 2023

  14. [14]

    Open-canopy: Towards very high resolution forest monitoring

    Fajwel Fogel, Yohann Perron, Nikola Besic, Laurent Saint-André, Agnès Pellissier-Tanon, Martin Schwartz, Thomas Boudras, Ibrahim Fayad, Alexandre d’Aspremont, Loic Landrieu, et al. Open-canopy: Towards very high resolution forest monitoring. InProceedings of the computer vision and pattern recognition conference, pages 1395–1406, 2025

  15. [15]

    Agmmu: A comprehensive agricultural multimodal understanding and reasoning benchmark.arXiv e-prints, pages arXiv–2504, 2025

    Aruna Gauba, Irene Pi, Yunze Man, Ziqi Pang, Vikram S Adve, and Yu-Xiong Wang. Agmmu: A comprehensive agricultural multimodal understanding and reasoning benchmark.arXiv e-prints, pages arXiv–2504, 2025

  16. [16]

    Mcp-agentbench: Evaluating real-world language agent performance with mcp-mediated tools

    Zikang Guo, Benfeng Xu, Chiwei Zhu, Wentao Hong, Xiaorui Wang, and Zhendong Mao. Mcp-agentbench: Evaluating real-world language agent performance with mcp-mediated tools. Proceedings of the AAAI Conference on Artificial Intelligence, 40(37):30888–30896, Mar. 2026

  17. [17]

    Multi-modal instruction tuned llms with fine-grained visual perception

    Junwen He, Yifan Wang, Lijun Wang, Huchuan Lu, Jun-Yan He, Jin-Peng Lan, Bin Luo, and Xuansong Xie. Multi-modal instruction tuned llms with fine-grained visual perception. InProceedings of the ieee/cvf conference on computer vision and pattern recognition, pages 13980–13990, 2024

  18. [18]

    Mm- boundary: Advancing mllm knowledge boundary awareness through reasoning step confidence calibration

    Zhitao He, Sandeep Polisetty, Zhiyuan Fan, Yuchen Huang, Shujin Wu, and Yi R Fung. Mm- boundary: Advancing mllm knowledge boundary awareness through reasoning step confidence calibration. InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 16427–16444, 2025

  19. [19]

    GLM-4.5V and GLM-4.1V-Thinking: Towards Versatile Multimodal Reasoning with Scalable Reinforcement Learning

    Wenyi Hong, Wenmeng Yu, Xiaotao Gu, Guo Wang, Guobing Gan, Haomiao Tang, Jiale Cheng, Ji Qi, Junhui Ji, Lihang Pan, et al. Glm-4.5 v and glm-4.1 v-thinking: Towards versatile multimodal reasoning with scalable reinforcement learning.arXiv preprint arXiv:2507.01006, 2025

  20. [20]

    GPT-4o System Card

    Aaron Hurst, Adam Lerer, Adam P Goucher, Adam Perelman, Aditya Ramesh, Aidan Clark, AJ Ostrow, Akila Welihinda, Alan Hayes, Alec Radford, et al. Gpt-4o system card.arXiv preprint arXiv:2410.21276, 2024

  21. [21]

    Mllm-compbench: A comparative reasoning benchmark for multimodal llms.Advances in Neural Information Processing Systems, 37:28798–28827, 2024

    Jihyung Kil, Zheda Mai, Justin Lee, Arpita Chowdhury, Zihe Wang, Kerrie Cheng, Lemeng Wang, Ye Liu, and Wei-Lun Chao. Mllm-compbench: A comparative reasoning benchmark for multimodal llms.Advances in Neural Information Processing Systems, 37:28798–28827, 2024

  22. [22]

    Gpt-5 and open-weight large language models: Advances in reasoning, trans- parency, and control.Information Systems, page 102620, 2025

    Maikel Leon. Gpt-5 and open-weight large language models: Advances in reasoning, trans- parency, and control.Information Systems, page 102620, 2025

  23. [23]

    API-bank: A comprehensive benchmark for tool-augmented LLMs

    Minghao Li, Yingxiu Zhao, Bowen Yu, Feifan Song, Hangyu Li, Haiyang Yu, Zhoujun Li, Fei Huang, and Yongbin Li. API-bank: A comprehensive benchmark for tool-augmented LLMs. In Houda Bouamor, Juan Pino, and Kalika Bali, editors,Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 3102–3116, Singapore, December

  24. [24]

    Association for Computational Linguistics

  25. [25]

    Can large multimodal models understand agricultural scenes? benchmarking with agromind.arXiv preprint arXiv:2505.12207, 2025

    Qingmei Li, Yang Zhang, Zurong Mai, Yuhang Chen, Shuohong Lou, Henglian Huang, Jiarui Zhang, Zhiwei Zhang, Yibin Wen, Weijia Li, et al. Can large multimodal models understand agricultural scenes? benchmarking with agromind.arXiv preprint arXiv:2505.12207, 2025

  26. [26]

    Improving context understanding in multimodal large language models via multimodal composition learning

    Wei Li, Hehe Fan, Yongkang Wong, Yi Yang, and Mohan S Kankanhalli. Improving context understanding in multimodal large language models via multimodal composition learning. In ICML, page 7, 2024

  27. [27]

    Deepagent: A general reasoning agent with scalable toolsets

    Xiaoxi Li, Wenxiang Jiao, Jiarui Jin, Guanting Dong, Jiajie Jin, Yinuo Wang, Hao Wang, Yutao Zhu, Ji-Rong Wen, Yuan Lu, and Zhicheng Dou. Deepagent: A general reasoning agent with scalable toolsets. InProceedings of the ACM Web Conference 2026, WWW ’26, page 2219–2230, New York, NY , USA, 2026. Association for Computing Machinery. ISBN 9798400723070. 11

  28. [28]

    A survey of state of the art large vision language models: Benchmark evaluations and challenges

    Zongxia Li, Xiyang Wu, Hongyang Du, Fuxiao Liu, Huy Nghiem, and Guangyao Shi. A survey of state of the art large vision language models: Benchmark evaluations and challenges. In Proceedings of the Computer Vision and Pattern Recognition Conference, pages 1587–1606, 2025

  29. [29]

    Improved baselines with visual instruction tuning

    Haotian Liu, Chunyuan Li, Yuheng Li, and Yong Jae Lee. Improved baselines with visual instruction tuning. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 26296–26306, 2024

  30. [30]

    Mengxi Liu, Zhuoqun Chai, Haojun Deng, and Rong Liu. A cnn-transformer network with multiscale context aggregation for fine-grained cropland change detection.IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing, 15:4297–4306, 2022

  31. [31]

    ToolSandbox: A stateful, conversational, interactive evaluation benchmark for LLM tool use capabilities

    Jiarui Lu, Thomas Holleis, Yizhe Zhang, Bernhard Aumayer, Feng Nan, Haoping Bai, Shuang Ma, Shen Ma, Mengyu Li, Guoli Yin, Zirui Wang, and Ruoming Pang. ToolSandbox: A stateful, conversational, interactive evaluation benchmark for LLM tool use capabilities. In Luis Chiruzzo, Alan Ritter, and Lu Wang, editors,Findings of the Association for Computational L...

  32. [32]

    Visual perception by large language model’s weights.Advances in Neural Information Processing Systems, pages 28615–28635, 2024

    Feipeng Ma, Hongwei Xue, Yizhou Zhou, Guangting Wang, Fengyun Rao, Shilin Yan, Yueyi Zhang, Siying Wu, Mike Zheng Shou, and Xiaoyan Sun. Visual perception by large language model’s weights.Advances in Neural Information Processing Systems, pages 28615–28635, 2024

  33. [33]

    m &m’s: A benchmark to evaluate tool-use for multi-step multi-modal tasks

    Zixian Ma, Weikai Huang, Jieyu Zhang, Tanmay Gupta, and Ranjay Krishna. m &m’s: A benchmark to evaluate tool-use for multi-step multi-modal tasks. In Aleš Leonardis, Elisa Ricci, Stefan Roth, Olga Russakovsky, Torsten Sattler, and Gül Varol, editors,Computer Vision – ECCV 2024, pages 18–34, Cham, 2025. Springer Nature Switzerland

  34. [34]

    Deepweeds: A multiclass weed species image dataset for deep learning.Scientific reports, 9(1):2058, 2019

    Alex Olsen, Dmitry A Konovalov, Bronson Philippa, Peter Ridd, Jake C Wood, Jamie Johns, Wesley Banks, Benjamin Girgenti, Owen Kenny, James Whinney, et al. Deepweeds: A multiclass weed species image dataset for deep learning.Scientific reports, 9(1):2058, 2019

  35. [35]

    Weedmap: A large-scale semantic weed mapping framework using aerial multispectral imaging and deep neural network for precision farming.Remote Sensing, 10(9):1423, 2018

    Inkyu Sa, Marija Popovi´c, Raghav Khanna, Zetao Chen, Philipp Lottes, Frank Liebisch, Juan Nieto, Cyrill Stachniss, Achim Walter, and Roland Siegwart. Weedmap: A large-scale semantic weed mapping framework using aerial multispectral imaging and deep neural network for precision farming.Remote Sensing, 10(9):1423, 2018

  36. [36]

    deepnir: Datasets for generat- ing synthetic nir images and improved fruit detection system using deep learning techniques

    Inkyu Sa, Jong Yoon Lim, Ho Seok Ahn, and Bruce MacDonald. deepnir: Datasets for generat- ing synthetic nir images and improved fruit detection system using deep learning techniques. Sensors, 22(13):4721, 2022

  37. [37]

    Multi- modal llms in agriculture: A comprehensive review.IEEE Transactions on Automation Science and Engineering, 2025

    Ranjan Sapkota, Rizwan Qureshi, Muhammad Usman Hadi, Syed Zohaib Hassan, Ferhat Sadak, Maged Shoman, Muhammad Sajjad, Fayaz Ali Dharejo, Achyut Paudel, Jiajia Li, et al. Multi- modal llms in agriculture: A comprehensive review.IEEE Transactions on Automation Science and Engineering, 2025

  38. [38]

    Weedsense: Multi-task learning for weed segmentation, height estimation, and growth stage classification

    Toqi Tahamid Sarker, Khaled R Ahmed, Taminul Islam, Cristiana Bernardi Rankrape, and Karla Gage. Weedsense: Multi-task learning for weed segmentation, height estimation, and growth stage classification. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 7180–7190, 2025

  39. [39]

    Thinkgeo: Evaluating tool-augmented agents for remote sensing tasks, 2026

    Akashah Shabbir, Muhammad Akhtar Munir, Akshay Dudhane, Muhammad Umer Sheikh, Muhammad Haris Khan, Paolo Fraccaro, Juan Bernabe Moreno, Fahad Shahbaz Khan, and Salman Khan. Thinkgeo: Evaluating tool-augmented agents for remote sensing tasks, 2026

  40. [40]

    Agrobench: Vision-language model benchmark in agriculture

    Risa Shinoda, Nakamasa Inoue, Hirokatsu Kataoka, Masaki Onishi, and Yoshitaka Ushiku. Agrobench: Vision-language model benchmark in agriculture. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 7634–7644, 2025. 12

  41. [41]

    Zero hunger: future challenges and the way forward towards the achievement of sustainable development goal 2.Sustainable earth reviews, 7(1):10, 2024

    Fabio Sporchia, Marta Antonelli, Alicia Aguilar-Martínez, Anna Bach-Faig, Dario Caro, Kyle F Davis, Roberta Sonnino, and Alessandro Galli. Zero hunger: future challenges and the way forward towards the achievement of sustainable development goal 2.Sustainable earth reviews, 7(1):10, 2024

  42. [42]

    Zhendong Sun, Yanfei Zhong, Xinyu Wang, and Liangpei Zhang. Identifying cropland non- agriculturalization with high representational consistency from bi-temporal high-resolution remote sensing images: From benchmark datasets to real-world application.ISPRS Journal of Photogrammetry and Remote Sensing, 212:454–474, 2024

  43. [43]

    Chain of visual perception: Harnessing multimodal large language models for zero-shot camouflaged object detection

    Lv Tang, Peng-Tao Jiang, Zhi-Hao Shen, Hao Zhang, Jin-Wei Chen, and Bo Li. Chain of visual perception: Harnessing multimodal large language models for zero-shot camouflaged object detection. InProceedings of the 32nd ACM international conference on multimedia, pages 8805–8814, 2024

  44. [44]

    Seed 2.0: Towards intelligence frontier for real-world complexity

    ByteDance Seed Team. Seed 2.0: Towards intelligence frontier for real-world complexity. Tech- nical report, ByteDance, 2026. URL https://seed.bytedance.com/zh/blog/seed2-0-% E6%AD%A3%E5%BC%8F%E5%8F%91%E5%B8%83. Accessed: 2026-05-04

  45. [45]

    Fpcd: An open aerial vhr dataset for farm pond change detection.arXiv preprint arXiv:2302.14554, 2023

    Chintan Tundia, Rajiv Kumar, Om Damani, and G Sivakumar. Fpcd: An open aerial vhr dataset for farm pond change detection.arXiv preprint arXiv:2302.14554, 2023

  46. [46]

    Uas-based remote sensing for agricultural monitoring: Current status and perspectives.Computers and Electronics in Agriculture, 227:109501, 2024

    Jingzhe Wang, Silu Zhang, Ivan Lizaga, Yinghui Zhang, Xiangyu Ge, Zipeng Zhang, Wei Zhang, Qiujun Huang, and Zhongwen Hu. Uas-based remote sensing for agricultural monitoring: Current status and perspectives.Computers and Electronics in Agriculture, 227:109501, 2024

  47. [47]

    Gta: a benchmark for general tool agents.Advances in Neural Information Processing Systems, 37: 75749–75790, 2024

    Jize Wang, Zerun Ma, Yining Li, Songyang Zhang, Cailian Chen, Kai Chen, and Xinyi Le. Gta: a benchmark for general tool agents.Advances in Neural Information Processing Systems, 37: 75749–75790, 2024

  48. [48]

    Advances in deep learning applications for plant disease and pest detection: A review.Remote Sensing, 17(4):698, 2025

    Shaohua Wang, Dachuan Xu, Haojian Liang, Yongqing Bai, Xiao Li, Junyuan Zhou, Cheng Su, and Wenyu Wei. Advances in deep learning applications for plant disease and pest detection: A review.Remote Sensing, 17(4):698, 2025

  49. [49]

    Mcptox: A benchmark for tool poisoning on real-world mcp servers.Proceedings of the AAAI Conference on Artificial Intelligence, 40(42): 35811–35819, Mar

    Zhiqiang Wang, Yichao Gao, Yanting Wang, Suyuan Liu, Haifeng Sun, Haoran Cheng, Guan- quan Shi, Haohua Du, and Xiangyang Li. Mcptox: A benchmark for tool poisoning on real-world mcp servers.Proceedings of the AAAI Conference on Artificial Intelligence, 40(42): 35811–35819, Mar. 2026

  50. [50]

    A large-scale in-the-wild dataset for plant disease segmentation.Scientific Data, 13(1):205, 2026

    Tianqi Wei, Zhi Chen, Xin Yu, Scott Chapman, Paul Melloy, and Zi Huang. A large-scale in-the-wild dataset for plant disease segmentation.Scientific Data, 13(1):205, 2026

  51. [51]

    Remote sensing for agricultural applications: A meta-review.Remote sensing of environment, 236:111402, 2020

    Marie Weiss, Frédéric Jacob, and Grgory Duveiller. Remote sensing for agricultural applications: A meta-review.Remote sensing of environment, 236:111402, 2020

  52. [52]

    Agrocot: A chain-of-thought benchmark for evaluating reasoning in vision-language models for agriculture, 2026

    Yibin Wen, Qingmei Li, Zi Ye, Jiarui Zhang, Zurong Mai, Jing Wu, Shuohong Lou, Yuhang Chen, Henglian Huang, Xiaoya Fan, Yang Zhang, Defeng Gu, Lingyuan Zhao, Yutong Lu, Haohuan Fu, Jianxi Huang, and Juepeng Zheng. Agrocot: A chain-of-thought benchmark for evaluating reasoning in vision-language models for agriculture, 2026

  53. [53]

    Challenges and opportunities in remote sensing-based crop monitoring: A review.National Science Review, 10(4):nwac290, 2023

    Bingfang Wu, Miao Zhang, Hongwei Zeng, Fuyou Tian, Andries B Potgieter, Xingli Qin, Nana Yan, Sheng Chang, Yan Zhao, Qinghan Dong, et al. Challenges and opportunities in remote sensing-based crop monitoring: A review.National Science Review, 10(4):nwac290, 2023

  54. [54]

    Farm- seg_vlm: A farmland remote sensing image segmentation method considering vision-language alignment.ISPRS Journal of Photogrammetry and Remote Sensing, 225:423–439, 2025

    Haiyang Wu, Weiliang Mu, Dandan Zhong, Zhuofei Du, Haifeng Li, and Chao Tao. Farm- seg_vlm: A farmland remote sensing image segmentation method considering vision-language alignment.ISPRS Journal of Photogrammetry and Remote Sensing, 225:423–439, 2025

  55. [55]

    Ip102: A large-scale benchmark dataset for insect pest recognition

    Xiaoping Wu, Chi Zhan, Yu-Kun Lai, Ming-Ming Cheng, and Jufeng Yang. Ip102: A large-scale benchmark dataset for insect pest recognition. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 8787–8796, 2019. 13

  56. [56]

    Mul- timodal agricultural agent architecture (ma3): A new paradigm for intelligent agricultural decision-making, 2025

    Zhuoning Xu, Jian Xu, Mingqing Zhang, Peijie Wang, Chao Deng, and Cheng-Lin Liu. Mul- timodal agricultural agent architecture (ma3): A new paradigm for intelligent agricultural decision-making, 2025

  57. [57]

    Agrigpt-omni: A unified speech–vision–text framework for multilingual agricultural intelligence

    Bo Yang, Lanfei Feng, Yunkui Chen, Yu Zhang, Jianyu Zhang, Xiao Xu, Nueraili Aierken, and Shijian Li. Agrigpt-omni: A unified speech–vision–text framework for multilingual agricultural intelligence. InProceedings of the ACM Web Conference 2026, WWW ’26, page 9800–9810, New York, NY , USA, 2026. Association for Computing Machinery. ISBN 9798400723070

  58. [58]

    Look-back: Implicit visual re-focusing in mllm reasoning

    Shuo Yang, Yuwei Niu, Yuyang Liu, Yang Ye, Bin Lin, and Li Yuan. Look-back: Implicit visual re-focusing in mllm reasoning. InProceedings of the AAAI Conference on Artificial Intelligence, pages 11694–11702, 2026

  59. [59]

    React: Synergizing reasoning and acting in language models

    Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik R Narasimhan, and Yuan Cao. React: Synergizing reasoning and acting in language models. InThe eleventh international conference on learning representations, 2022

  60. [60]

    Ureader: Universal ocr-free visually-situated language understand- ing with multimodal large language model

    Jiabo Ye, Anwen Hu, Haiyang Xu, Qinghao Ye, Ming Yan, Guohai Xu, Chenliang Li, Junfeng Tian, Qi Qian, Ji Zhang, et al. Ureader: Universal ocr-free visually-situated language understand- ing with multimodal large language model. InFindings of the Association for Computational Linguistics: EMNLP 2023, pages 2841–2858, 2023

  61. [61]

    A survey on multimodal large language models.National Science Review, page nwae403, 2024

    Shukang Yin, Chaoyou Fu, Sirui Zhao, Ke Li, Xing Sun, Tong Xu, and Enhong Chen. A survey on multimodal large language models.National Science Review, page nwae403, 2024

  62. [62]

    Higher-Order Feature Attribution: Bridging Statistics, Explainable AI, and Topological Signal Processing

    Mingqing Zhang, Zhuoning Xu, Peijie Wang, Rongji Li, Liang Wang, Qiang Liu, Jian Xu, Xuyao Zhang, Shu Wu, and Liang Wang. Agridoctor: A multimodal intelligent assistant for agriculture. InICASSP 2026 - 2026 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 2741–2745, 2026. doi: 10.1109/ICASSP55912.2026.11464537

  63. [63]

    A multimodal foundation agent for financial trading: Tool-augmented, diversified, and generalist

    Wentao Zhang, Lingxuan Zhao, Haochong Xia, Shuo Sun, Jiaze Sun, Molei Qin, Xinyi Li, Yuqing Zhao, Yilei Zhao, Xinyu Cai, et al. A multimodal foundation agent for financial trading: Tool-augmented, diversified, and generalist. InProceedings of the 30th acm sigkdd conference on knowledge discovery and data mining, pages 4314–4325, 2024

  64. [64]

    Universal multimodal representation for language understanding.IEEE Transactions on Pattern Analysis and Machine Intelligence, pages 9169–9185, 2023

    Zhuosheng Zhang, Kehai Chen, Rui Wang, Masao Utiyama, Eiichiro Sumita, Zuchao Li, and Hai Zhao. Universal multimodal representation for language understanding.IEEE Transactions on Pattern Analysis and Machine Intelligence, pages 9169–9185, 2023

  65. [65]

    agentar: Creating augmented reality applications with tool-augmented llm-based autonomous agents

    Chenfei Zhu, Shao-Kang Hsia, Xiyun Hu, Ziyi Liu, Jingyu Shi, and Karthik Ramani. agentar: Creating augmented reality applications with tool-augmented llm-based autonomous agents. In Proceedings of the 38th Annual ACM Symposium on User Interface Software and Technology, pages 1–23, 2025. 14 AgroTools: A Benchmark for Tool-Augmented Multimodal Agents in Agr...

  66. [66]

    A benchmark question

  67. [67]

    The question_template id and gt_type

  68. [68]

    An ordered list of slot names and slot types

  69. [69]

    status":

    A model's raw answer to the question. Your task is to extract exactly what the model answer says into the requested slots. You must NOT decide whether the answer is correct. You must NOT infer, repair, or complete missing information from your own knowledge. You will NOT receive the ground-truth answer. Return strict JSON only. Do not use markdown. Do not...

  70. [70]

    Reference key points for this slot

  71. [71]

    score": 0.0,

    The model's extracted answer for this slot. Your task is to evaluate ONLY this single slot. Core principle: - Evaluate semantic correctness, factual coverage, specificity, and internal consistency against the provided reference key points. - Do not require exact wording. - Do not reward answers that are merely topically related but vague. - Do not use ext...

  72. [72]

    Internally decompose the reference into a small number of essential atomic points, usually 1 to 4

  73. [73]

    Group near-equivalent alternatives into one point when they express the same core fact

  74. [74]

    Check which essential points are correctly covered by the model answer

  75. [75]

    Check whether the answer is specific enough for this slot

  76. [76]

    Check whether the answer contains any contradiction, reversal, or invented fact

  77. [77]

    cannot determine

    Assign a final score using the rubric below. General scoring rules: - Semantic equivalence counts. Paraphrases, synonyms, scientific/common names, and equivalent geographic descriptions are acceptable. - Fluency, grammar, or formatting should not affect the score. - Extra correct but non-conflicting details are allowed. - Unsupported extra details do not ...

  78. [78]

    Do not repair a failed answer using your own common-sense knowledge

    Only retain samples whose tool chains are actually executable under the current benchmark setting. Do not repair a failed answer using your own common-sense knowledge

  79. [79]

    Inspect the full ReAct-style trajectory rather than only the final answer

  80. [80]

    For each sample, verify: - whether the `sample_id`, `classification`, `question_template`, attached files, and user question match the corresponding entry in `question-4.json`; - whether the invoked tools are appropriate for the task, with no missing key tools and no irrelevant tools; - whether all critical tool arguments are correct, including image path...

Showing first 80 references.