AgroTools: A Benchmark for Tool-Augmented Multimodal Agents in Agriculture
Pith reviewed 2026-05-22 07:42 UTC · model grok-4.3
The pith
A new benchmark shows multimodal AI models are unreliable at using tools for agricultural tasks.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
AgroTools contains 539 question-answer instances paired with 1,097 heterogeneous agricultural images, spanning five task families and an executable environment of 14 agricultural tools. Each query carries structured tool-use traces that support dual evaluation of process-level execution quality and outcome-level task success. Benchmarking nine open-source and four closed-source multimodal large language models demonstrates that current models remain far from reliable in agricultural tool-use settings, with clear bottlenecks in tool planning, argument generation, execution recovery, and final-answer synthesis.
What carries the argument
The AgroTools benchmark with its 14-tool executable environment and structured tool-use trace annotations that enable separate scoring of execution process and final outcome.
If this is right
- Agricultural decision support will require specific gains in tool planning and argument generation before models can be deployed reliably.
- Execution recovery mechanisms must be strengthened to handle the error-prone nature of tool calls in farming workflows.
- Final-answer synthesis from tool results forms an additional bottleneck that limits overall task success.
- The dual process-and-outcome evaluation can guide development of agents for high-precision agricultural applications.
- Future models trained or fine-tuned on such traces may close the observed performance gaps.
Where Pith is reading between the lines
- The same trace-annotated evaluation style could expose comparable tool-use weaknesses when applied to other visual decision domains such as medical imaging or environmental monitoring.
- Widespread adoption of the benchmark might shift training practices toward explicit supervision of intermediate tool steps rather than end-to-end answer prediction.
- If models improve substantially on AgroTools, the resulting capabilities could support real-time advisory systems that help farmers execute precision interventions from field imagery.
- Extending the tool set or adding live sensor feeds would test whether the current bottlenecks persist under more dynamic conditions.
Load-bearing premise
The 14 agricultural tools, five task families, and structured tool-use trace annotations accurately represent the requirements and workflows of real-world precision-sensitive agricultural decision-making.
What would settle it
A multimodal model that plans correct tool sequences, generates valid arguments, recovers from execution errors, and reaches high success rates on most of the 539 queries would indicate that the reported unreliability does not apply to sufficiently advanced systems.
Figures
read the original abstract
Agricultural decision-making increasingly requires multimodal systems that can transform visual observations into reliable, executable actions. However, existing agricultural multimodal benchmarks mainly evaluate final-answer correctness and provide limited support for assessing whether models can use external tools to complete precision-sensitive workflows. In this paper, we introduce AgroTools, a benchmark for evaluating tool-augmented multimodal agents in agriculture. AgroTools contains 539 question-answer instances paired with 1,097 heterogeneous agricultural images, spanning five task families and an executable environment of 14 agricultural tools. Each query is annotated with structured tool-use traces, enabling a dual-view evaluation of both process-level execution quality and outcome-level task success. We benchmark 9 open-source and 4 closed-source multimodal large language models on AgroTools. Results show that current models remain far from reliable in agricultural tool-use settings, with clear bottlenecks in tool planning, argument generation, execution recovery, and final-answer synthesis. We hope AgroTools will support future research on multimodal agents for high-precision agricultural applications. The benchmark and evaluation are available at https://huggingface.co/datasets/AgroTools/AgroTools.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces AgroTools, a benchmark for tool-augmented multimodal agents in agriculture consisting of 539 question-answer instances paired with 1,097 heterogeneous images. It spans five task families supported by an executable environment of 14 agricultural tools, with each instance annotated by structured tool-use traces to enable dual-view evaluation of process-level execution quality and outcome-level task success. The authors benchmark 9 open-source and 4 closed-source multimodal LLMs, reporting that current models remain far from reliable with clear bottlenecks in tool planning, argument generation, execution recovery, and final-answer synthesis.
Significance. If the benchmark design is shown to be representative of real-world precision agriculture, the work would be significant for shifting evaluation focus from final-answer correctness to full tool-use workflows in a high-stakes domain. The executable environment, structured traces, and public dataset release on Hugging Face provide concrete resources that could support reproducible research on multimodal agents for agricultural applications.
major comments (2)
- [Benchmark construction] Benchmark construction (as described in the abstract and implied methodology): the selection of the 14 agricultural tools, five task families, and 539 instances is presented without any reported expert review, comparison to actual farm records, or sensitivity analysis. This directly affects the central claim of clear bottlenecks, because altering tool definitions or task families could change which failure modes (tool planning, argument generation) appear dominant.
- [Evaluation and results] Evaluation and results sections: the abstract reports high-level findings on model performance but supplies no information on task validation, annotation quality checks, or statistical details of the comparisons across the 13 models. Without these, it is unclear whether the observed bottlenecks in execution recovery and final-answer synthesis are robust or sensitive to annotation choices.
minor comments (1)
- [Abstract] The abstract would benefit from a brief statement on how the dual-view evaluation metrics are computed to help readers assess the process-level results.
Simulated Author's Rebuttal
We thank the referee for the constructive comments on our manuscript. The feedback on benchmark construction and evaluation details is valuable, and we have prepared point-by-point responses below. We plan to revise the paper to incorporate additional documentation and analyses that address the raised concerns while preserving the core contributions of AgroTools.
read point-by-point responses
-
Referee: [Benchmark construction] Benchmark construction (as described in the abstract and implied methodology): the selection of the 14 agricultural tools, five task families, and 539 instances is presented without any reported expert review, comparison to actual farm records, or sensitivity analysis. This directly affects the central claim of clear bottlenecks, because altering tool definitions or task families could change which failure modes (tool planning, argument generation) appear dominant.
Authors: We thank the referee for this important point. The 14 tools were selected to represent standard operations in precision agriculture (e.g., image-based crop health assessment, soil parameter estimation, and yield forecasting), drawing from widely cited agricultural literature and public datasets. The five task families were designed to span representative decision workflows. We acknowledge that the original submission did not include explicit expert review documentation or sensitivity analysis. In the revised manuscript, we will add a dedicated 'Benchmark Construction' subsection that details the selection rationale with supporting references, describes consultations with agricultural domain experts during development, and reports a sensitivity analysis on task and tool variations to verify that the observed bottlenecks in tool planning and execution recovery remain consistent. Regarding direct comparison to proprietary farm records, such data are typically private and not publicly available for benchmarking; our design instead relies on heterogeneous public imagery and established agricultural task definitions to ensure reproducibility. revision: yes
-
Referee: [Evaluation and results] Evaluation and results sections: the abstract reports high-level findings on model performance but supplies no information on task validation, annotation quality checks, or statistical details of the comparisons across the 13 models. Without these, it is unclear whether the observed bottlenecks in execution recovery and final-answer synthesis are robust or sensitive to annotation choices.
Authors: We agree that greater transparency on validation and statistics is needed. The 539 instances and associated tool-use traces underwent iterative internal review by the authors (who combine expertise in multimodal AI and agronomy) to ensure correctness and executability. In the revised manuscript, we will expand the Evaluation and Results sections to include: (i) the annotation protocol and quality assurance procedures, (ii) inter-annotator agreement metrics on a sampled subset, and (iii) statistical details such as confidence intervals and significance tests (e.g., McNemar or Wilcoxon signed-rank tests) for model comparisons. These additions will substantiate that the reported bottlenecks in execution recovery and final-answer synthesis are robust rather than sensitive to specific annotation decisions. revision: yes
Circularity Check
No circularity: empirical benchmark construction with no derivations or self-referential reductions
full rationale
The paper presents AgroTools as an empirical benchmark consisting of 539 question-answer instances, 1,097 images, five task families, and an executable environment with 14 agricultural tools, along with model evaluations on 13 multimodal LLMs. No mathematical derivations, first-principles predictions, parameter fittings, or equations are claimed. The reported bottlenecks in tool planning and related areas are direct empirical observations from running the benchmark, not quantities that reduce to the benchmark design by construction. Any self-citations (if present) are not invoked to justify uniqueness theorems or load-bearing premises that would create circularity. The work is self-contained as a dataset and evaluation contribution without the patterns of self-definitional logic or fitted inputs renamed as predictions.
Axiom & Free-Parameter Ledger
Lean theorems connected to this paper
-
IndisputableMonolith/Foundation/AbsoluteFloorClosure.lean (and Cost, Constants, DimensionForcing modules)reality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
AgroTools contains 539 question-answer instances ... executable environment of 14 agricultural tools ... dual-view evaluation of both process-level execution quality and outcome-level task success.
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
Introducing claude sonnet 4.6, 2026
Anthropic. Introducing claude sonnet 4.6, 2026. URL https://www.anthropic.com/news/ claude-sonnet-4-6. Official release page. Accessed: 2026-05-07
work page 2026
-
[2]
Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond
Jinze Bai, Shuai Bai, Shusheng Yang, Shijie Wang, Sinan Tan, Peng Wang, Junyang Lin, Chang Zhou, and Jingren Zhou. Qwen-vl: A frontier large vision-language model with versatile abilities.arXiv preprint arXiv:2308.12966, 1(2):3, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[3]
Yuhui Bie, Guowei Xu, and Yaojun Wang. Adapting vision-language models for precision agriculture: A study on crop segmentation based on uav remote sensing data. In2025 13th International Conference on Agro-Geoinformatics (Agro-Geoinformatics), pages 1–6. IEEE, 2025
work page 2025
-
[4]
Agrichat: A multimodal large language model for agriculture image understanding, 2026
Abderrahmene Boudiaf, Irfan Hussain, and Sajid Javed. Agrichat: A multimodal large language model for agriculture image understanding, 2026
work page 2026
-
[5]
Boyu Chang, Qi Wang, Xi Guo, Zhixiong Nan, Yazhou Yao, and Tianfei Zhou. Abductivemllm: Boosting visual abductive reasoning within mllms.arXiv preprint arXiv:2601.02771, 2026
-
[6]
Sijia Chen, Yibo Wang, Yi-Feng Wu, Qing-Guo Chen, Zhao Xu, Weihua Luo, Kaifu Zhang, and Lijun Zhang. Advancing tool-augmented large language models: Integrating insights from errors in inference trees.Advances in Neural Information Processing Systems, 37:106555–106581, 2024
work page 2024
-
[7]
Janus-Pro: Unified Multimodal Understanding and Generation with Data and Model Scaling
Xiaokang Chen, Zhiyu Wu, Xingchao Liu, Zizheng Pan, Wen Liu, Zhenda Xie, Xingkai Yu, and Chong Ruan. Janus-pro: Unified multimodal understanding and generation with data and model scaling.arXiv preprint arXiv:2501.17811, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[8]
Zhe Chen, Weiyun Wang, Yue Cao, Yangzhou Liu, Zhangwei Gao, Erfei Cui, Jinguo Zhu, Shenglong Ye, Hao Tian, Zhaoyang Liu, et al. Expanding performance boundaries of open- source multimodal models with model.Data, and Test-Time Scaling, 2024
work page 2024
-
[9]
Gheorghe Comanici, Eric Bieber, Mike Schaekermann, Ice Pasupat, Noveen Sachdeva, Inderjit Dhillon, Marcel Blistein, Ori Ram, Dan Zhang, Evan Rosen, et al. Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities.arXiv preprint arXiv:2507.06261, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[10]
Utilizing vision-language models for detection of leaf-based diseases in tomatoes
James Blossom Eleojo. Utilizing vision-language models for detection of leaf-based diseases in tomatoes. InProceedings of the AAAI Conference on Artificial Intelligence, pages 29567–29569, 2025
work page 2025
-
[11]
guided mllm reasoning: Enhancing mllm with knowledge and visual notes for visual question answering
Wenlong Fang, Qiaofeng Wu, Jing Chen, and Yun Xue. guided mllm reasoning: Enhancing mllm with knowledge and visual notes for visual question answering. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 19597–19607, 2025
work page 2025
-
[12]
Earth-agent: Unlocking the full landscape of earth observation with agents, 2026
Peilin Feng, Zhutao Lv, Junyan Ye, Xiaolei Wang, Xinjie Huo, Jinhua Yu, Wanghan Xu, Wenlong Zhang, Lei Bai, Conghui He, and Weijia Li. Earth-agent: Unlocking the full landscape of earth observation with agents, 2026. 10
work page 2026
-
[13]
Mar Ferrer-Ferrer, Javier Ruiz-Hidalgo, Eduard Gregorio, Verónica Vilaplana, Josep-Ramon Morros, and Jordi Gené-Mola. Simultaneous fruit detection and size estimation using multitask deep neural networks.Biosystems Engineering, 233:63–75, 2023
work page 2023
-
[14]
Open-canopy: Towards very high resolution forest monitoring
Fajwel Fogel, Yohann Perron, Nikola Besic, Laurent Saint-André, Agnès Pellissier-Tanon, Martin Schwartz, Thomas Boudras, Ibrahim Fayad, Alexandre d’Aspremont, Loic Landrieu, et al. Open-canopy: Towards very high resolution forest monitoring. InProceedings of the computer vision and pattern recognition conference, pages 1395–1406, 2025
work page 2025
-
[15]
Aruna Gauba, Irene Pi, Yunze Man, Ziqi Pang, Vikram S Adve, and Yu-Xiong Wang. Agmmu: A comprehensive agricultural multimodal understanding and reasoning benchmark.arXiv e-prints, pages arXiv–2504, 2025
work page 2025
-
[16]
Mcp-agentbench: Evaluating real-world language agent performance with mcp-mediated tools
Zikang Guo, Benfeng Xu, Chiwei Zhu, Wentao Hong, Xiaorui Wang, and Zhendong Mao. Mcp-agentbench: Evaluating real-world language agent performance with mcp-mediated tools. Proceedings of the AAAI Conference on Artificial Intelligence, 40(37):30888–30896, Mar. 2026
work page 2026
-
[17]
Multi-modal instruction tuned llms with fine-grained visual perception
Junwen He, Yifan Wang, Lijun Wang, Huchuan Lu, Jun-Yan He, Jin-Peng Lan, Bin Luo, and Xuansong Xie. Multi-modal instruction tuned llms with fine-grained visual perception. InProceedings of the ieee/cvf conference on computer vision and pattern recognition, pages 13980–13990, 2024
work page 2024
-
[18]
Zhitao He, Sandeep Polisetty, Zhiyuan Fan, Yuchen Huang, Shujin Wu, and Yi R Fung. Mm- boundary: Advancing mllm knowledge boundary awareness through reasoning step confidence calibration. InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 16427–16444, 2025
work page 2025
-
[19]
Wenyi Hong, Wenmeng Yu, Xiaotao Gu, Guo Wang, Guobing Gan, Haomiao Tang, Jiale Cheng, Ji Qi, Junhui Ji, Lihang Pan, et al. Glm-4.5 v and glm-4.1 v-thinking: Towards versatile multimodal reasoning with scalable reinforcement learning.arXiv preprint arXiv:2507.01006, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[20]
Aaron Hurst, Adam Lerer, Adam P Goucher, Adam Perelman, Aditya Ramesh, Aidan Clark, AJ Ostrow, Akila Welihinda, Alan Hayes, Alec Radford, et al. Gpt-4o system card.arXiv preprint arXiv:2410.21276, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[21]
Jihyung Kil, Zheda Mai, Justin Lee, Arpita Chowdhury, Zihe Wang, Kerrie Cheng, Lemeng Wang, Ye Liu, and Wei-Lun Chao. Mllm-compbench: A comparative reasoning benchmark for multimodal llms.Advances in Neural Information Processing Systems, 37:28798–28827, 2024
work page 2024
-
[22]
Maikel Leon. Gpt-5 and open-weight large language models: Advances in reasoning, trans- parency, and control.Information Systems, page 102620, 2025
work page 2025
-
[23]
API-bank: A comprehensive benchmark for tool-augmented LLMs
Minghao Li, Yingxiu Zhao, Bowen Yu, Feifan Song, Hangyu Li, Haiyang Yu, Zhoujun Li, Fei Huang, and Yongbin Li. API-bank: A comprehensive benchmark for tool-augmented LLMs. In Houda Bouamor, Juan Pino, and Kalika Bali, editors,Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 3102–3116, Singapore, December
work page 2023
-
[24]
Association for Computational Linguistics
-
[25]
Qingmei Li, Yang Zhang, Zurong Mai, Yuhang Chen, Shuohong Lou, Henglian Huang, Jiarui Zhang, Zhiwei Zhang, Yibin Wen, Weijia Li, et al. Can large multimodal models understand agricultural scenes? benchmarking with agromind.arXiv preprint arXiv:2505.12207, 2025
-
[26]
Wei Li, Hehe Fan, Yongkang Wong, Yi Yang, and Mohan S Kankanhalli. Improving context understanding in multimodal large language models via multimodal composition learning. In ICML, page 7, 2024
work page 2024
-
[27]
Deepagent: A general reasoning agent with scalable toolsets
Xiaoxi Li, Wenxiang Jiao, Jiarui Jin, Guanting Dong, Jiajie Jin, Yinuo Wang, Hao Wang, Yutao Zhu, Ji-Rong Wen, Yuan Lu, and Zhicheng Dou. Deepagent: A general reasoning agent with scalable toolsets. InProceedings of the ACM Web Conference 2026, WWW ’26, page 2219–2230, New York, NY , USA, 2026. Association for Computing Machinery. ISBN 9798400723070. 11
work page 2026
-
[28]
A survey of state of the art large vision language models: Benchmark evaluations and challenges
Zongxia Li, Xiyang Wu, Hongyang Du, Fuxiao Liu, Huy Nghiem, and Guangyao Shi. A survey of state of the art large vision language models: Benchmark evaluations and challenges. In Proceedings of the Computer Vision and Pattern Recognition Conference, pages 1587–1606, 2025
work page 2025
-
[29]
Improved baselines with visual instruction tuning
Haotian Liu, Chunyuan Li, Yuheng Li, and Yong Jae Lee. Improved baselines with visual instruction tuning. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 26296–26306, 2024
work page 2024
-
[30]
Mengxi Liu, Zhuoqun Chai, Haojun Deng, and Rong Liu. A cnn-transformer network with multiscale context aggregation for fine-grained cropland change detection.IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing, 15:4297–4306, 2022
work page 2022
-
[31]
Jiarui Lu, Thomas Holleis, Yizhe Zhang, Bernhard Aumayer, Feng Nan, Haoping Bai, Shuang Ma, Shen Ma, Mengyu Li, Guoli Yin, Zirui Wang, and Ruoming Pang. ToolSandbox: A stateful, conversational, interactive evaluation benchmark for LLM tool use capabilities. In Luis Chiruzzo, Alan Ritter, and Lu Wang, editors,Findings of the Association for Computational L...
work page 2025
-
[32]
Feipeng Ma, Hongwei Xue, Yizhou Zhou, Guangting Wang, Fengyun Rao, Shilin Yan, Yueyi Zhang, Siying Wu, Mike Zheng Shou, and Xiaoyan Sun. Visual perception by large language model’s weights.Advances in Neural Information Processing Systems, pages 28615–28635, 2024
work page 2024
-
[33]
m &m’s: A benchmark to evaluate tool-use for multi-step multi-modal tasks
Zixian Ma, Weikai Huang, Jieyu Zhang, Tanmay Gupta, and Ranjay Krishna. m &m’s: A benchmark to evaluate tool-use for multi-step multi-modal tasks. In Aleš Leonardis, Elisa Ricci, Stefan Roth, Olga Russakovsky, Torsten Sattler, and Gül Varol, editors,Computer Vision – ECCV 2024, pages 18–34, Cham, 2025. Springer Nature Switzerland
work page 2024
-
[34]
Alex Olsen, Dmitry A Konovalov, Bronson Philippa, Peter Ridd, Jake C Wood, Jamie Johns, Wesley Banks, Benjamin Girgenti, Owen Kenny, James Whinney, et al. Deepweeds: A multiclass weed species image dataset for deep learning.Scientific reports, 9(1):2058, 2019
work page 2058
-
[35]
Inkyu Sa, Marija Popovi´c, Raghav Khanna, Zetao Chen, Philipp Lottes, Frank Liebisch, Juan Nieto, Cyrill Stachniss, Achim Walter, and Roland Siegwart. Weedmap: A large-scale semantic weed mapping framework using aerial multispectral imaging and deep neural network for precision farming.Remote Sensing, 10(9):1423, 2018
work page 2018
-
[36]
Inkyu Sa, Jong Yoon Lim, Ho Seok Ahn, and Bruce MacDonald. deepnir: Datasets for generat- ing synthetic nir images and improved fruit detection system using deep learning techniques. Sensors, 22(13):4721, 2022
work page 2022
-
[37]
Ranjan Sapkota, Rizwan Qureshi, Muhammad Usman Hadi, Syed Zohaib Hassan, Ferhat Sadak, Maged Shoman, Muhammad Sajjad, Fayaz Ali Dharejo, Achyut Paudel, Jiajia Li, et al. Multi- modal llms in agriculture: A comprehensive review.IEEE Transactions on Automation Science and Engineering, 2025
work page 2025
-
[38]
Toqi Tahamid Sarker, Khaled R Ahmed, Taminul Islam, Cristiana Bernardi Rankrape, and Karla Gage. Weedsense: Multi-task learning for weed segmentation, height estimation, and growth stage classification. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 7180–7190, 2025
work page 2025
-
[39]
Thinkgeo: Evaluating tool-augmented agents for remote sensing tasks, 2026
Akashah Shabbir, Muhammad Akhtar Munir, Akshay Dudhane, Muhammad Umer Sheikh, Muhammad Haris Khan, Paolo Fraccaro, Juan Bernabe Moreno, Fahad Shahbaz Khan, and Salman Khan. Thinkgeo: Evaluating tool-augmented agents for remote sensing tasks, 2026
work page 2026
-
[40]
Agrobench: Vision-language model benchmark in agriculture
Risa Shinoda, Nakamasa Inoue, Hirokatsu Kataoka, Masaki Onishi, and Yoshitaka Ushiku. Agrobench: Vision-language model benchmark in agriculture. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 7634–7644, 2025. 12
work page 2025
-
[41]
Fabio Sporchia, Marta Antonelli, Alicia Aguilar-Martínez, Anna Bach-Faig, Dario Caro, Kyle F Davis, Roberta Sonnino, and Alessandro Galli. Zero hunger: future challenges and the way forward towards the achievement of sustainable development goal 2.Sustainable earth reviews, 7(1):10, 2024
work page 2024
-
[42]
Zhendong Sun, Yanfei Zhong, Xinyu Wang, and Liangpei Zhang. Identifying cropland non- agriculturalization with high representational consistency from bi-temporal high-resolution remote sensing images: From benchmark datasets to real-world application.ISPRS Journal of Photogrammetry and Remote Sensing, 212:454–474, 2024
work page 2024
-
[43]
Lv Tang, Peng-Tao Jiang, Zhi-Hao Shen, Hao Zhang, Jin-Wei Chen, and Bo Li. Chain of visual perception: Harnessing multimodal large language models for zero-shot camouflaged object detection. InProceedings of the 32nd ACM international conference on multimedia, pages 8805–8814, 2024
work page 2024
-
[44]
Seed 2.0: Towards intelligence frontier for real-world complexity
ByteDance Seed Team. Seed 2.0: Towards intelligence frontier for real-world complexity. Tech- nical report, ByteDance, 2026. URL https://seed.bytedance.com/zh/blog/seed2-0-% E6%AD%A3%E5%BC%8F%E5%8F%91%E5%B8%83. Accessed: 2026-05-04
work page 2026
-
[45]
Chintan Tundia, Rajiv Kumar, Om Damani, and G Sivakumar. Fpcd: An open aerial vhr dataset for farm pond change detection.arXiv preprint arXiv:2302.14554, 2023
-
[46]
Jingzhe Wang, Silu Zhang, Ivan Lizaga, Yinghui Zhang, Xiangyu Ge, Zipeng Zhang, Wei Zhang, Qiujun Huang, and Zhongwen Hu. Uas-based remote sensing for agricultural monitoring: Current status and perspectives.Computers and Electronics in Agriculture, 227:109501, 2024
work page 2024
-
[47]
Jize Wang, Zerun Ma, Yining Li, Songyang Zhang, Cailian Chen, Kai Chen, and Xinyi Le. Gta: a benchmark for general tool agents.Advances in Neural Information Processing Systems, 37: 75749–75790, 2024
work page 2024
-
[48]
Shaohua Wang, Dachuan Xu, Haojian Liang, Yongqing Bai, Xiao Li, Junyuan Zhou, Cheng Su, and Wenyu Wei. Advances in deep learning applications for plant disease and pest detection: A review.Remote Sensing, 17(4):698, 2025
work page 2025
-
[49]
Zhiqiang Wang, Yichao Gao, Yanting Wang, Suyuan Liu, Haifeng Sun, Haoran Cheng, Guan- quan Shi, Haohua Du, and Xiangyang Li. Mcptox: A benchmark for tool poisoning on real-world mcp servers.Proceedings of the AAAI Conference on Artificial Intelligence, 40(42): 35811–35819, Mar. 2026
work page 2026
-
[50]
A large-scale in-the-wild dataset for plant disease segmentation.Scientific Data, 13(1):205, 2026
Tianqi Wei, Zhi Chen, Xin Yu, Scott Chapman, Paul Melloy, and Zi Huang. A large-scale in-the-wild dataset for plant disease segmentation.Scientific Data, 13(1):205, 2026
work page 2026
-
[51]
Marie Weiss, Frédéric Jacob, and Grgory Duveiller. Remote sensing for agricultural applications: A meta-review.Remote sensing of environment, 236:111402, 2020
work page 2020
-
[52]
Yibin Wen, Qingmei Li, Zi Ye, Jiarui Zhang, Zurong Mai, Jing Wu, Shuohong Lou, Yuhang Chen, Henglian Huang, Xiaoya Fan, Yang Zhang, Defeng Gu, Lingyuan Zhao, Yutong Lu, Haohuan Fu, Jianxi Huang, and Juepeng Zheng. Agrocot: A chain-of-thought benchmark for evaluating reasoning in vision-language models for agriculture, 2026
work page 2026
-
[53]
Bingfang Wu, Miao Zhang, Hongwei Zeng, Fuyou Tian, Andries B Potgieter, Xingli Qin, Nana Yan, Sheng Chang, Yan Zhao, Qinghan Dong, et al. Challenges and opportunities in remote sensing-based crop monitoring: A review.National Science Review, 10(4):nwac290, 2023
work page 2023
-
[54]
Haiyang Wu, Weiliang Mu, Dandan Zhong, Zhuofei Du, Haifeng Li, and Chao Tao. Farm- seg_vlm: A farmland remote sensing image segmentation method considering vision-language alignment.ISPRS Journal of Photogrammetry and Remote Sensing, 225:423–439, 2025
work page 2025
-
[55]
Ip102: A large-scale benchmark dataset for insect pest recognition
Xiaoping Wu, Chi Zhan, Yu-Kun Lai, Ming-Ming Cheng, and Jufeng Yang. Ip102: A large-scale benchmark dataset for insect pest recognition. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 8787–8796, 2019. 13
work page 2019
-
[56]
Zhuoning Xu, Jian Xu, Mingqing Zhang, Peijie Wang, Chao Deng, and Cheng-Lin Liu. Mul- timodal agricultural agent architecture (ma3): A new paradigm for intelligent agricultural decision-making, 2025
work page 2025
-
[57]
Agrigpt-omni: A unified speech–vision–text framework for multilingual agricultural intelligence
Bo Yang, Lanfei Feng, Yunkui Chen, Yu Zhang, Jianyu Zhang, Xiao Xu, Nueraili Aierken, and Shijian Li. Agrigpt-omni: A unified speech–vision–text framework for multilingual agricultural intelligence. InProceedings of the ACM Web Conference 2026, WWW ’26, page 9800–9810, New York, NY , USA, 2026. Association for Computing Machinery. ISBN 9798400723070
work page 2026
-
[58]
Look-back: Implicit visual re-focusing in mllm reasoning
Shuo Yang, Yuwei Niu, Yuyang Liu, Yang Ye, Bin Lin, and Li Yuan. Look-back: Implicit visual re-focusing in mllm reasoning. InProceedings of the AAAI Conference on Artificial Intelligence, pages 11694–11702, 2026
work page 2026
-
[59]
React: Synergizing reasoning and acting in language models
Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik R Narasimhan, and Yuan Cao. React: Synergizing reasoning and acting in language models. InThe eleventh international conference on learning representations, 2022
work page 2022
-
[60]
Jiabo Ye, Anwen Hu, Haiyang Xu, Qinghao Ye, Ming Yan, Guohai Xu, Chenliang Li, Junfeng Tian, Qi Qian, Ji Zhang, et al. Ureader: Universal ocr-free visually-situated language understand- ing with multimodal large language model. InFindings of the Association for Computational Linguistics: EMNLP 2023, pages 2841–2858, 2023
work page 2023
-
[61]
A survey on multimodal large language models.National Science Review, page nwae403, 2024
Shukang Yin, Chaoyou Fu, Sirui Zhao, Ke Li, Xing Sun, Tong Xu, and Enhong Chen. A survey on multimodal large language models.National Science Review, page nwae403, 2024
work page 2024
-
[62]
Mingqing Zhang, Zhuoning Xu, Peijie Wang, Rongji Li, Liang Wang, Qiang Liu, Jian Xu, Xuyao Zhang, Shu Wu, and Liang Wang. Agridoctor: A multimodal intelligent assistant for agriculture. InICASSP 2026 - 2026 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 2741–2745, 2026. doi: 10.1109/ICASSP55912.2026.11464537
-
[63]
A multimodal foundation agent for financial trading: Tool-augmented, diversified, and generalist
Wentao Zhang, Lingxuan Zhao, Haochong Xia, Shuo Sun, Jiaze Sun, Molei Qin, Xinyi Li, Yuqing Zhao, Yilei Zhao, Xinyu Cai, et al. A multimodal foundation agent for financial trading: Tool-augmented, diversified, and generalist. InProceedings of the 30th acm sigkdd conference on knowledge discovery and data mining, pages 4314–4325, 2024
work page 2024
-
[64]
Zhuosheng Zhang, Kehai Chen, Rui Wang, Masao Utiyama, Eiichiro Sumita, Zuchao Li, and Hai Zhao. Universal multimodal representation for language understanding.IEEE Transactions on Pattern Analysis and Machine Intelligence, pages 9169–9185, 2023
work page 2023
-
[65]
agentar: Creating augmented reality applications with tool-augmented llm-based autonomous agents
Chenfei Zhu, Shao-Kang Hsia, Xiyun Hu, Ziyi Liu, Jingyu Shi, and Karthik Ramani. agentar: Creating augmented reality applications with tool-augmented llm-based autonomous agents. In Proceedings of the 38th Annual ACM Symposium on User Interface Software and Technology, pages 1–23, 2025. 14 AgroTools: A Benchmark for Tool-Augmented Multimodal Agents in Agr...
work page 2025
-
[66]
A benchmark question
-
[67]
The question_template id and gt_type
-
[68]
An ordered list of slot names and slot types
-
[69]
A model's raw answer to the question. Your task is to extract exactly what the model answer says into the requested slots. You must NOT decide whether the answer is correct. You must NOT infer, repair, or complete missing information from your own knowledge. You will NOT receive the ground-truth answer. Return strict JSON only. Do not use markdown. Do not...
-
[70]
Reference key points for this slot
-
[71]
The model's extracted answer for this slot. Your task is to evaluate ONLY this single slot. Core principle: - Evaluate semantic correctness, factual coverage, specificity, and internal consistency against the provided reference key points. - Do not require exact wording. - Do not reward answers that are merely topically related but vague. - Do not use ext...
-
[72]
Internally decompose the reference into a small number of essential atomic points, usually 1 to 4
-
[73]
Group near-equivalent alternatives into one point when they express the same core fact
-
[74]
Check which essential points are correctly covered by the model answer
-
[75]
Check whether the answer is specific enough for this slot
-
[76]
Check whether the answer contains any contradiction, reversal, or invented fact
-
[77]
Assign a final score using the rubric below. General scoring rules: - Semantic equivalence counts. Paraphrases, synonyms, scientific/common names, and equivalent geographic descriptions are acceptable. - Fluency, grammar, or formatting should not affect the score. - Extra correct but non-conflicting details are allowed. - Unsupported extra details do not ...
-
[78]
Do not repair a failed answer using your own common-sense knowledge
Only retain samples whose tool chains are actually executable under the current benchmark setting. Do not repair a failed answer using your own common-sense knowledge
-
[79]
Inspect the full ReAct-style trajectory rather than only the final answer
-
[80]
For each sample, verify: - whether the `sample_id`, `classification`, `question_template`, attached files, and user question match the corresponding entry in `question-4.json`; - whether the invoked tools are appropriate for the task, with no missing key tools and no irrelevant tools; - whether all critical tool arguments are correct, including image path...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.