AgroTools: A Benchmark for Tool-Augmented Multimodal Agents in Agriculture

Bohan Shi; Haohuan Fu; Jianxi Huang; Jiarui Zhang; Jing Wu; Juepeng Zheng; Kun Zeng; Xiaoya Fan; Xinyu Zhang; Yibin Wen

arxiv: 2605.22366 · v1 · pith:N5XLWGZCnew · submitted 2026-05-21 · 💻 cs.CV

AgroTools: A Benchmark for Tool-Augmented Multimodal Agents in Agriculture

Zi Ye , Yibin Wen , Xiaoya Fan , Xinyu Zhang , Jing Wu , Kun Zeng , Zurong Mai , Jiarui Zhang

show 5 more authors

Bohan Shi Juepeng Zheng Jianxi Huang Yutong Lu Haohuan Fu

This is my paper

Pith reviewed 2026-05-22 07:42 UTC · model grok-4.3

classification 💻 cs.CV

keywords agriculturemultimodal agentstool-augmented modelsbenchmarktool use evaluationprecision agriculturemultimodal large language modelsagent reliability

0 comments

The pith

A new benchmark shows multimodal AI models are unreliable at using tools for agricultural tasks.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces AgroTools to test whether multimodal AI systems can turn visual observations into reliable actions by calling external tools in farming contexts. Unlike prior agricultural benchmarks that only score final answers, this one tracks the full workflow of planning tool calls, generating arguments, handling errors, and synthesizing results. The benchmark supplies 539 queries with 1,097 images across five task families, each paired with correct tool-use traces inside an environment of 14 executable agricultural tools. Evaluations of thirteen open- and closed-source models expose consistent failures in tool planning, argument generation, execution recovery, and answer synthesis. If these findings hold, progress toward practical AI for precision agriculture will depend on targeted advances in tool-augmented reasoning rather than image recognition alone.

Core claim

AgroTools contains 539 question-answer instances paired with 1,097 heterogeneous agricultural images, spanning five task families and an executable environment of 14 agricultural tools. Each query carries structured tool-use traces that support dual evaluation of process-level execution quality and outcome-level task success. Benchmarking nine open-source and four closed-source multimodal large language models demonstrates that current models remain far from reliable in agricultural tool-use settings, with clear bottlenecks in tool planning, argument generation, execution recovery, and final-answer synthesis.

What carries the argument

The AgroTools benchmark with its 14-tool executable environment and structured tool-use trace annotations that enable separate scoring of execution process and final outcome.

If this is right

Agricultural decision support will require specific gains in tool planning and argument generation before models can be deployed reliably.
Execution recovery mechanisms must be strengthened to handle the error-prone nature of tool calls in farming workflows.
Final-answer synthesis from tool results forms an additional bottleneck that limits overall task success.
The dual process-and-outcome evaluation can guide development of agents for high-precision agricultural applications.
Future models trained or fine-tuned on such traces may close the observed performance gaps.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same trace-annotated evaluation style could expose comparable tool-use weaknesses when applied to other visual decision domains such as medical imaging or environmental monitoring.
Widespread adoption of the benchmark might shift training practices toward explicit supervision of intermediate tool steps rather than end-to-end answer prediction.
If models improve substantially on AgroTools, the resulting capabilities could support real-time advisory systems that help farmers execute precision interventions from field imagery.
Extending the tool set or adding live sensor feeds would test whether the current bottlenecks persist under more dynamic conditions.

Load-bearing premise

The 14 agricultural tools, five task families, and structured tool-use trace annotations accurately represent the requirements and workflows of real-world precision-sensitive agricultural decision-making.

What would settle it

A multimodal model that plans correct tool sequences, generates valid arguments, recovers from execution errors, and reaches high success rates on most of the 539 queries would indicate that the reported unreliability does not apply to sufficiently advanced systems.

Figures

Figures reproduced from arXiv: 2605.22366 by Bohan Shi, Haohuan Fu, Jianxi Huang, Jiarui Zhang, Jing Wu, Juepeng Zheng, Kun Zeng, Xiaoya Fan, Xinyu Zhang, Yibin Wen, Yutong Lu, Zi Ye, Zurong Mai.

**Figure 2.** Figure 2: (a) The number of samples across different tasks in AgroTools, some samples can be [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗

**Figure 3.** Figure 3: AgroTools curation pipeline contains four stages: Data Sources and Preprocessing, Query [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗

**Figure 4.** Figure 4: Detailed statistics of AgroTools. 4 AgroTools Benchmark 4.1 Benchmark curation Data Sources and Preprocessing. As shown in [PITH_FULL_IMAGE:figures/full_fig_p006_4.png] view at source ↗

**Figure 5.** Figure 5: Task-level FAS across representative models. Scores are computed over all samples in the [PITH_FULL_IMAGE:figures/full_fig_p009_5.png] view at source ↗

**Figure 6.** Figure 6: Some representative examples from the five task categories. [PITH_FULL_IMAGE:figures/full_fig_p021_6.png] view at source ↗

**Figure 7.** Figure 7: Prompt templates used for LLM-based extraction and evaluation [PITH_FULL_IMAGE:figures/full_fig_p029_7.png] view at source ↗

**Figure 8.** Figure 8: Answer verification prompt for human participants [PITH_FULL_IMAGE:figures/full_fig_p030_8.png] view at source ↗

**Figure 9.** Figure 9: Human and LLM Scoring Comparison Across Different Models [PITH_FULL_IMAGE:figures/full_fig_p031_9.png] view at source ↗

**Figure 10.** Figure 10: Tool-call outcomes and retry behavior in end-to-end evaluation. [PITH_FULL_IMAGE:figures/full_fig_p032_10.png] view at source ↗

**Figure 11.** Figure 11: Case Study of GPT-5.4 and InternVL3.5-14B on AgroTools Tasks. [PITH_FULL_IMAGE:figures/full_fig_p033_11.png] view at source ↗

**Figure 12.** Figure 12: Case Study of GPT-5.4 and InternVL3.5-14B on AgroTools Tasks. [PITH_FULL_IMAGE:figures/full_fig_p034_12.png] view at source ↗

**Figure 13.** Figure 13: Case Study of GPT-5.4 and InternVL3.5-14B on AgroTools Tasks. [PITH_FULL_IMAGE:figures/full_fig_p035_13.png] view at source ↗

**Figure 14.** Figure 14: Case Study of GPT-5.4 and InternVL3.5-14B on AgroTools Tasks. [PITH_FULL_IMAGE:figures/full_fig_p036_14.png] view at source ↗

**Figure 15.** Figure 15: Case Study of GPT-5.4 and InternVL3.5-14B on AgroTools Tasks. [PITH_FULL_IMAGE:figures/full_fig_p037_15.png] view at source ↗

**Figure 16.** Figure 16: Case Study of GPT-5.4 and InternVL3.5-14B on AgroTools Tasks. [PITH_FULL_IMAGE:figures/full_fig_p038_16.png] view at source ↗

**Figure 17.** Figure 17: Case Study of GPT-5.4 and InternVL3.5-14B on AgroTools Tasks. [PITH_FULL_IMAGE:figures/full_fig_p039_17.png] view at source ↗

read the original abstract

Agricultural decision-making increasingly requires multimodal systems that can transform visual observations into reliable, executable actions. However, existing agricultural multimodal benchmarks mainly evaluate final-answer correctness and provide limited support for assessing whether models can use external tools to complete precision-sensitive workflows. In this paper, we introduce AgroTools, a benchmark for evaluating tool-augmented multimodal agents in agriculture. AgroTools contains 539 question-answer instances paired with 1,097 heterogeneous agricultural images, spanning five task families and an executable environment of 14 agricultural tools. Each query is annotated with structured tool-use traces, enabling a dual-view evaluation of both process-level execution quality and outcome-level task success. We benchmark 9 open-source and 4 closed-source multimodal large language models on AgroTools. Results show that current models remain far from reliable in agricultural tool-use settings, with clear bottlenecks in tool planning, argument generation, execution recovery, and final-answer synthesis. We hope AgroTools will support future research on multimodal agents for high-precision agricultural applications. The benchmark and evaluation are available at https://huggingface.co/datasets/AgroTools/AgroTools.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

AgroTools gives a focused benchmark for tool-augmented multimodal agents in agriculture and shows clear model failures on planning and execution, though the tasks' connection to actual farm work remains lightly checked.

read the letter

The paper's main contribution is the AgroTools benchmark itself: 539 instances, 1,097 images, five task families, and an executable set of 14 agricultural tools, each paired with structured tool-use traces. This setup supports a dual evaluation that scores both the agent's process steps and the final outcome, which goes beyond the usual end-result checks in most multimodal benchmarks. They run 13 models, open and closed, and document consistent shortfalls in tool selection, argument formatting, error recovery, and answer synthesis. That breakdown is useful for anyone trying to build agents that must act reliably on visual farm data. The domain choice also matters because agriculture involves precision decisions that affect real outputs like yield and resource use. The executable environment and trace annotations are concrete steps that make the benchmark more actionable than abstract task lists. The soft spot is validation of the benchmark's scope. The 14 tools and five families are presented without reported input from agronomists, cross-checks against actual farm logs, or tests showing that different tool definitions would produce the same failure patterns. If the tasks were shaped mainly by what current models can handle rather than by documented workflows, the reported bottlenecks could be narrower than claimed. The abstract and high-level results do not include those checks, so the central claim about general unreliability rests partly on untested design choices. This work is aimed at researchers developing multimodal agents or applying them to domain-specific problems like agriculture. Readers who need a ready testbed for tool-use evaluation will get immediate value from the released dataset and protocol. It is not a foundational theoretical paper, but the construction is clear enough and the domain gap is real enough that it should go to peer review. Referees can usefully press on the representativeness question and on any statistical details of the model comparisons that are not yet visible.

Referee Report

2 major / 1 minor

Summary. The paper introduces AgroTools, a benchmark for tool-augmented multimodal agents in agriculture consisting of 539 question-answer instances paired with 1,097 heterogeneous images. It spans five task families supported by an executable environment of 14 agricultural tools, with each instance annotated by structured tool-use traces to enable dual-view evaluation of process-level execution quality and outcome-level task success. The authors benchmark 9 open-source and 4 closed-source multimodal LLMs, reporting that current models remain far from reliable with clear bottlenecks in tool planning, argument generation, execution recovery, and final-answer synthesis.

Significance. If the benchmark design is shown to be representative of real-world precision agriculture, the work would be significant for shifting evaluation focus from final-answer correctness to full tool-use workflows in a high-stakes domain. The executable environment, structured traces, and public dataset release on Hugging Face provide concrete resources that could support reproducible research on multimodal agents for agricultural applications.

major comments (2)

[Benchmark construction] Benchmark construction (as described in the abstract and implied methodology): the selection of the 14 agricultural tools, five task families, and 539 instances is presented without any reported expert review, comparison to actual farm records, or sensitivity analysis. This directly affects the central claim of clear bottlenecks, because altering tool definitions or task families could change which failure modes (tool planning, argument generation) appear dominant.
[Evaluation and results] Evaluation and results sections: the abstract reports high-level findings on model performance but supplies no information on task validation, annotation quality checks, or statistical details of the comparisons across the 13 models. Without these, it is unclear whether the observed bottlenecks in execution recovery and final-answer synthesis are robust or sensitive to annotation choices.

minor comments (1)

[Abstract] The abstract would benefit from a brief statement on how the dual-view evaluation metrics are computed to help readers assess the process-level results.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments on our manuscript. The feedback on benchmark construction and evaluation details is valuable, and we have prepared point-by-point responses below. We plan to revise the paper to incorporate additional documentation and analyses that address the raised concerns while preserving the core contributions of AgroTools.

read point-by-point responses

Referee: [Benchmark construction] Benchmark construction (as described in the abstract and implied methodology): the selection of the 14 agricultural tools, five task families, and 539 instances is presented without any reported expert review, comparison to actual farm records, or sensitivity analysis. This directly affects the central claim of clear bottlenecks, because altering tool definitions or task families could change which failure modes (tool planning, argument generation) appear dominant.

Authors: We thank the referee for this important point. The 14 tools were selected to represent standard operations in precision agriculture (e.g., image-based crop health assessment, soil parameter estimation, and yield forecasting), drawing from widely cited agricultural literature and public datasets. The five task families were designed to span representative decision workflows. We acknowledge that the original submission did not include explicit expert review documentation or sensitivity analysis. In the revised manuscript, we will add a dedicated 'Benchmark Construction' subsection that details the selection rationale with supporting references, describes consultations with agricultural domain experts during development, and reports a sensitivity analysis on task and tool variations to verify that the observed bottlenecks in tool planning and execution recovery remain consistent. Regarding direct comparison to proprietary farm records, such data are typically private and not publicly available for benchmarking; our design instead relies on heterogeneous public imagery and established agricultural task definitions to ensure reproducibility. revision: yes
Referee: [Evaluation and results] Evaluation and results sections: the abstract reports high-level findings on model performance but supplies no information on task validation, annotation quality checks, or statistical details of the comparisons across the 13 models. Without these, it is unclear whether the observed bottlenecks in execution recovery and final-answer synthesis are robust or sensitive to annotation choices.

Authors: We agree that greater transparency on validation and statistics is needed. The 539 instances and associated tool-use traces underwent iterative internal review by the authors (who combine expertise in multimodal AI and agronomy) to ensure correctness and executability. In the revised manuscript, we will expand the Evaluation and Results sections to include: (i) the annotation protocol and quality assurance procedures, (ii) inter-annotator agreement metrics on a sampled subset, and (iii) statistical details such as confidence intervals and significance tests (e.g., McNemar or Wilcoxon signed-rank tests) for model comparisons. These additions will substantiate that the reported bottlenecks in execution recovery and final-answer synthesis are robust rather than sensitive to specific annotation decisions. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical benchmark construction with no derivations or self-referential reductions

full rationale

The paper presents AgroTools as an empirical benchmark consisting of 539 question-answer instances, 1,097 images, five task families, and an executable environment with 14 agricultural tools, along with model evaluations on 13 multimodal LLMs. No mathematical derivations, first-principles predictions, parameter fittings, or equations are claimed. The reported bottlenecks in tool planning and related areas are direct empirical observations from running the benchmark, not quantities that reduce to the benchmark design by construction. Any self-citations (if present) are not invoked to justify uniqueness theorems or load-bearing premises that would create circularity. The work is self-contained as a dataset and evaluation contribution without the patterns of self-definitional logic or fitted inputs renamed as predictions.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

This is an applied benchmark paper that contributes a new dataset and evaluation protocol rather than relying on mathematical axioms, fitted parameters, or postulated entities.

pith-pipeline@v0.9.0 · 5765 in / 1092 out tokens · 59892 ms · 2026-05-22T07:42:36.286389+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/AbsoluteFloorClosure.lean (and Cost, Constants, DimensionForcing modules) reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

AgroTools contains 539 question-answer instances ... executable environment of 14 agricultural tools ... dual-view evaluation of both process-level execution quality and outcome-level task success.

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

82 extracted references · 82 canonical work pages · 5 internal anchors

[1]

Introducing claude sonnet 4.6, 2026

Anthropic. Introducing claude sonnet 4.6, 2026. URL https://www.anthropic.com/news/ claude-sonnet-4-6. Official release page. Accessed: 2026-05-07

work page 2026
[2]

Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond

Jinze Bai, Shuai Bai, Shusheng Yang, Shijie Wang, Sinan Tan, Peng Wang, Junyang Lin, Chang Zhou, and Jingren Zhou. Qwen-vl: A frontier large vision-language model with versatile abilities.arXiv preprint arXiv:2308.12966, 1(2):3, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[3]

Adapting vision-language models for precision agriculture: A study on crop segmentation based on uav remote sensing data

Yuhui Bie, Guowei Xu, and Yaojun Wang. Adapting vision-language models for precision agriculture: A study on crop segmentation based on uav remote sensing data. In2025 13th International Conference on Agro-Geoinformatics (Agro-Geoinformatics), pages 1–6. IEEE, 2025

work page 2025
[4]

Agrichat: A multimodal large language model for agriculture image understanding, 2026

Abderrahmene Boudiaf, Irfan Hussain, and Sajid Javed. Agrichat: A multimodal large language model for agriculture image understanding, 2026

work page 2026
[5]

Abductivemllm: Boosting visual abductive reasoning within mllms.arXiv preprint arXiv:2601.02771, 2026

Boyu Chang, Qi Wang, Xi Guo, Zhixiong Nan, Yazhou Yao, and Tianfei Zhou. Abductivemllm: Boosting visual abductive reasoning within mllms.arXiv preprint arXiv:2601.02771, 2026

work page arXiv 2026
[6]

Advancing tool-augmented large language models: Integrating insights from errors in inference trees.Advances in Neural Information Processing Systems, 37:106555–106581, 2024

Sijia Chen, Yibo Wang, Yi-Feng Wu, Qing-Guo Chen, Zhao Xu, Weihua Luo, Kaifu Zhang, and Lijun Zhang. Advancing tool-augmented large language models: Integrating insights from errors in inference trees.Advances in Neural Information Processing Systems, 37:106555–106581, 2024

work page 2024
[7]

Janus-Pro: Unified Multimodal Understanding and Generation with Data and Model Scaling

Xiaokang Chen, Zhiyu Wu, Xingchao Liu, Zizheng Pan, Wen Liu, Zhenda Xie, Xingkai Yu, and Chong Ruan. Janus-pro: Unified multimodal understanding and generation with data and model scaling.arXiv preprint arXiv:2501.17811, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[8]

Expanding performance boundaries of open- source multimodal models with model.Data, and Test-Time Scaling, 2024

Zhe Chen, Weiyun Wang, Yue Cao, Yangzhou Liu, Zhangwei Gao, Erfei Cui, Jinguo Zhu, Shenglong Ye, Hao Tian, Zhaoyang Liu, et al. Expanding performance boundaries of open- source multimodal models with model.Data, and Test-Time Scaling, 2024

work page 2024
[9]

Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities

Gheorghe Comanici, Eric Bieber, Mike Schaekermann, Ice Pasupat, Noveen Sachdeva, Inderjit Dhillon, Marcel Blistein, Ori Ram, Dan Zhang, Evan Rosen, et al. Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities.arXiv preprint arXiv:2507.06261, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[10]

Utilizing vision-language models for detection of leaf-based diseases in tomatoes

James Blossom Eleojo. Utilizing vision-language models for detection of leaf-based diseases in tomatoes. InProceedings of the AAAI Conference on Artificial Intelligence, pages 29567–29569, 2025

work page 2025
[11]

guided mllm reasoning: Enhancing mllm with knowledge and visual notes for visual question answering

Wenlong Fang, Qiaofeng Wu, Jing Chen, and Yun Xue. guided mllm reasoning: Enhancing mllm with knowledge and visual notes for visual question answering. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 19597–19607, 2025

work page 2025
[12]

Earth-agent: Unlocking the full landscape of earth observation with agents, 2026

Peilin Feng, Zhutao Lv, Junyan Ye, Xiaolei Wang, Xinjie Huo, Jinhua Yu, Wanghan Xu, Wenlong Zhang, Lei Bai, Conghui He, and Weijia Li. Earth-agent: Unlocking the full landscape of earth observation with agents, 2026. 10

work page 2026
[13]

Simultaneous fruit detection and size estimation using multitask deep neural networks.Biosystems Engineering, 233:63–75, 2023

Mar Ferrer-Ferrer, Javier Ruiz-Hidalgo, Eduard Gregorio, Verónica Vilaplana, Josep-Ramon Morros, and Jordi Gené-Mola. Simultaneous fruit detection and size estimation using multitask deep neural networks.Biosystems Engineering, 233:63–75, 2023

work page 2023
[14]

Open-canopy: Towards very high resolution forest monitoring

Fajwel Fogel, Yohann Perron, Nikola Besic, Laurent Saint-André, Agnès Pellissier-Tanon, Martin Schwartz, Thomas Boudras, Ibrahim Fayad, Alexandre d’Aspremont, Loic Landrieu, et al. Open-canopy: Towards very high resolution forest monitoring. InProceedings of the computer vision and pattern recognition conference, pages 1395–1406, 2025

work page 2025
[15]

Agmmu: A comprehensive agricultural multimodal understanding and reasoning benchmark.arXiv e-prints, pages arXiv–2504, 2025

Aruna Gauba, Irene Pi, Yunze Man, Ziqi Pang, Vikram S Adve, and Yu-Xiong Wang. Agmmu: A comprehensive agricultural multimodal understanding and reasoning benchmark.arXiv e-prints, pages arXiv–2504, 2025

work page 2025
[16]

Mcp-agentbench: Evaluating real-world language agent performance with mcp-mediated tools

Zikang Guo, Benfeng Xu, Chiwei Zhu, Wentao Hong, Xiaorui Wang, and Zhendong Mao. Mcp-agentbench: Evaluating real-world language agent performance with mcp-mediated tools. Proceedings of the AAAI Conference on Artificial Intelligence, 40(37):30888–30896, Mar. 2026

work page 2026
[17]

Multi-modal instruction tuned llms with fine-grained visual perception

Junwen He, Yifan Wang, Lijun Wang, Huchuan Lu, Jun-Yan He, Jin-Peng Lan, Bin Luo, and Xuansong Xie. Multi-modal instruction tuned llms with fine-grained visual perception. InProceedings of the ieee/cvf conference on computer vision and pattern recognition, pages 13980–13990, 2024

work page 2024
[18]

Mm- boundary: Advancing mllm knowledge boundary awareness through reasoning step confidence calibration

Zhitao He, Sandeep Polisetty, Zhiyuan Fan, Yuchen Huang, Shujin Wu, and Yi R Fung. Mm- boundary: Advancing mllm knowledge boundary awareness through reasoning step confidence calibration. InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 16427–16444, 2025

work page 2025
[19]

GLM-4.5V and GLM-4.1V-Thinking: Towards Versatile Multimodal Reasoning with Scalable Reinforcement Learning

Wenyi Hong, Wenmeng Yu, Xiaotao Gu, Guo Wang, Guobing Gan, Haomiao Tang, Jiale Cheng, Ji Qi, Junhui Ji, Lihang Pan, et al. Glm-4.5 v and glm-4.1 v-thinking: Towards versatile multimodal reasoning with scalable reinforcement learning.arXiv preprint arXiv:2507.01006, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[20]

GPT-4o System Card

Aaron Hurst, Adam Lerer, Adam P Goucher, Adam Perelman, Aditya Ramesh, Aidan Clark, AJ Ostrow, Akila Welihinda, Alan Hayes, Alec Radford, et al. Gpt-4o system card.arXiv preprint arXiv:2410.21276, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[21]

Mllm-compbench: A comparative reasoning benchmark for multimodal llms.Advances in Neural Information Processing Systems, 37:28798–28827, 2024

Jihyung Kil, Zheda Mai, Justin Lee, Arpita Chowdhury, Zihe Wang, Kerrie Cheng, Lemeng Wang, Ye Liu, and Wei-Lun Chao. Mllm-compbench: A comparative reasoning benchmark for multimodal llms.Advances in Neural Information Processing Systems, 37:28798–28827, 2024

work page 2024
[22]

Gpt-5 and open-weight large language models: Advances in reasoning, trans- parency, and control.Information Systems, page 102620, 2025

Maikel Leon. Gpt-5 and open-weight large language models: Advances in reasoning, trans- parency, and control.Information Systems, page 102620, 2025

work page 2025
[23]

API-bank: A comprehensive benchmark for tool-augmented LLMs

Minghao Li, Yingxiu Zhao, Bowen Yu, Feifan Song, Hangyu Li, Haiyang Yu, Zhoujun Li, Fei Huang, and Yongbin Li. API-bank: A comprehensive benchmark for tool-augmented LLMs. In Houda Bouamor, Juan Pino, and Kalika Bali, editors,Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 3102–3116, Singapore, December

work page 2023
[24]

Association for Computational Linguistics

work page
[25]

Can large multimodal models understand agricultural scenes? benchmarking with agromind.arXiv preprint arXiv:2505.12207, 2025

Qingmei Li, Yang Zhang, Zurong Mai, Yuhang Chen, Shuohong Lou, Henglian Huang, Jiarui Zhang, Zhiwei Zhang, Yibin Wen, Weijia Li, et al. Can large multimodal models understand agricultural scenes? benchmarking with agromind.arXiv preprint arXiv:2505.12207, 2025

work page arXiv 2025
[26]

Improving context understanding in multimodal large language models via multimodal composition learning

Wei Li, Hehe Fan, Yongkang Wong, Yi Yang, and Mohan S Kankanhalli. Improving context understanding in multimodal large language models via multimodal composition learning. In ICML, page 7, 2024

work page 2024
[27]

Deepagent: A general reasoning agent with scalable toolsets

Xiaoxi Li, Wenxiang Jiao, Jiarui Jin, Guanting Dong, Jiajie Jin, Yinuo Wang, Hao Wang, Yutao Zhu, Ji-Rong Wen, Yuan Lu, and Zhicheng Dou. Deepagent: A general reasoning agent with scalable toolsets. InProceedings of the ACM Web Conference 2026, WWW ’26, page 2219–2230, New York, NY , USA, 2026. Association for Computing Machinery. ISBN 9798400723070. 11

work page 2026
[28]

A survey of state of the art large vision language models: Benchmark evaluations and challenges

Zongxia Li, Xiyang Wu, Hongyang Du, Fuxiao Liu, Huy Nghiem, and Guangyao Shi. A survey of state of the art large vision language models: Benchmark evaluations and challenges. In Proceedings of the Computer Vision and Pattern Recognition Conference, pages 1587–1606, 2025

work page 2025
[29]

Improved baselines with visual instruction tuning

Haotian Liu, Chunyuan Li, Yuheng Li, and Yong Jae Lee. Improved baselines with visual instruction tuning. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 26296–26306, 2024

work page 2024
[30]

Mengxi Liu, Zhuoqun Chai, Haojun Deng, and Rong Liu. A cnn-transformer network with multiscale context aggregation for fine-grained cropland change detection.IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing, 15:4297–4306, 2022

work page 2022
[31]

ToolSandbox: A stateful, conversational, interactive evaluation benchmark for LLM tool use capabilities

Jiarui Lu, Thomas Holleis, Yizhe Zhang, Bernhard Aumayer, Feng Nan, Haoping Bai, Shuang Ma, Shen Ma, Mengyu Li, Guoli Yin, Zirui Wang, and Ruoming Pang. ToolSandbox: A stateful, conversational, interactive evaluation benchmark for LLM tool use capabilities. In Luis Chiruzzo, Alan Ritter, and Lu Wang, editors,Findings of the Association for Computational L...

work page 2025
[32]

Visual perception by large language model’s weights.Advances in Neural Information Processing Systems, pages 28615–28635, 2024

Feipeng Ma, Hongwei Xue, Yizhou Zhou, Guangting Wang, Fengyun Rao, Shilin Yan, Yueyi Zhang, Siying Wu, Mike Zheng Shou, and Xiaoyan Sun. Visual perception by large language model’s weights.Advances in Neural Information Processing Systems, pages 28615–28635, 2024

work page 2024
[33]

m &m’s: A benchmark to evaluate tool-use for multi-step multi-modal tasks

Zixian Ma, Weikai Huang, Jieyu Zhang, Tanmay Gupta, and Ranjay Krishna. m &m’s: A benchmark to evaluate tool-use for multi-step multi-modal tasks. In Aleš Leonardis, Elisa Ricci, Stefan Roth, Olga Russakovsky, Torsten Sattler, and Gül Varol, editors,Computer Vision – ECCV 2024, pages 18–34, Cham, 2025. Springer Nature Switzerland

work page 2024
[34]

Deepweeds: A multiclass weed species image dataset for deep learning.Scientific reports, 9(1):2058, 2019

Alex Olsen, Dmitry A Konovalov, Bronson Philippa, Peter Ridd, Jake C Wood, Jamie Johns, Wesley Banks, Benjamin Girgenti, Owen Kenny, James Whinney, et al. Deepweeds: A multiclass weed species image dataset for deep learning.Scientific reports, 9(1):2058, 2019

work page 2058
[35]

Weedmap: A large-scale semantic weed mapping framework using aerial multispectral imaging and deep neural network for precision farming.Remote Sensing, 10(9):1423, 2018

Inkyu Sa, Marija Popovi´c, Raghav Khanna, Zetao Chen, Philipp Lottes, Frank Liebisch, Juan Nieto, Cyrill Stachniss, Achim Walter, and Roland Siegwart. Weedmap: A large-scale semantic weed mapping framework using aerial multispectral imaging and deep neural network for precision farming.Remote Sensing, 10(9):1423, 2018

work page 2018
[36]

deepnir: Datasets for generat- ing synthetic nir images and improved fruit detection system using deep learning techniques

Inkyu Sa, Jong Yoon Lim, Ho Seok Ahn, and Bruce MacDonald. deepnir: Datasets for generat- ing synthetic nir images and improved fruit detection system using deep learning techniques. Sensors, 22(13):4721, 2022

work page 2022
[37]

Multi- modal llms in agriculture: A comprehensive review.IEEE Transactions on Automation Science and Engineering, 2025

Ranjan Sapkota, Rizwan Qureshi, Muhammad Usman Hadi, Syed Zohaib Hassan, Ferhat Sadak, Maged Shoman, Muhammad Sajjad, Fayaz Ali Dharejo, Achyut Paudel, Jiajia Li, et al. Multi- modal llms in agriculture: A comprehensive review.IEEE Transactions on Automation Science and Engineering, 2025

work page 2025
[38]

Weedsense: Multi-task learning for weed segmentation, height estimation, and growth stage classification

Toqi Tahamid Sarker, Khaled R Ahmed, Taminul Islam, Cristiana Bernardi Rankrape, and Karla Gage. Weedsense: Multi-task learning for weed segmentation, height estimation, and growth stage classification. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 7180–7190, 2025

work page 2025
[39]

Thinkgeo: Evaluating tool-augmented agents for remote sensing tasks, 2026

Akashah Shabbir, Muhammad Akhtar Munir, Akshay Dudhane, Muhammad Umer Sheikh, Muhammad Haris Khan, Paolo Fraccaro, Juan Bernabe Moreno, Fahad Shahbaz Khan, and Salman Khan. Thinkgeo: Evaluating tool-augmented agents for remote sensing tasks, 2026

work page 2026
[40]

Agrobench: Vision-language model benchmark in agriculture

Risa Shinoda, Nakamasa Inoue, Hirokatsu Kataoka, Masaki Onishi, and Yoshitaka Ushiku. Agrobench: Vision-language model benchmark in agriculture. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 7634–7644, 2025. 12

work page 2025
[41]

Zero hunger: future challenges and the way forward towards the achievement of sustainable development goal 2.Sustainable earth reviews, 7(1):10, 2024

Fabio Sporchia, Marta Antonelli, Alicia Aguilar-Martínez, Anna Bach-Faig, Dario Caro, Kyle F Davis, Roberta Sonnino, and Alessandro Galli. Zero hunger: future challenges and the way forward towards the achievement of sustainable development goal 2.Sustainable earth reviews, 7(1):10, 2024

work page 2024
[42]

Zhendong Sun, Yanfei Zhong, Xinyu Wang, and Liangpei Zhang. Identifying cropland non- agriculturalization with high representational consistency from bi-temporal high-resolution remote sensing images: From benchmark datasets to real-world application.ISPRS Journal of Photogrammetry and Remote Sensing, 212:454–474, 2024

work page 2024
[43]

Chain of visual perception: Harnessing multimodal large language models for zero-shot camouflaged object detection

Lv Tang, Peng-Tao Jiang, Zhi-Hao Shen, Hao Zhang, Jin-Wei Chen, and Bo Li. Chain of visual perception: Harnessing multimodal large language models for zero-shot camouflaged object detection. InProceedings of the 32nd ACM international conference on multimedia, pages 8805–8814, 2024

work page 2024
[44]

Seed 2.0: Towards intelligence frontier for real-world complexity

ByteDance Seed Team. Seed 2.0: Towards intelligence frontier for real-world complexity. Tech- nical report, ByteDance, 2026. URL https://seed.bytedance.com/zh/blog/seed2-0-% E6%AD%A3%E5%BC%8F%E5%8F%91%E5%B8%83. Accessed: 2026-05-04

work page 2026
[45]

Fpcd: An open aerial vhr dataset for farm pond change detection.arXiv preprint arXiv:2302.14554, 2023

Chintan Tundia, Rajiv Kumar, Om Damani, and G Sivakumar. Fpcd: An open aerial vhr dataset for farm pond change detection.arXiv preprint arXiv:2302.14554, 2023

work page arXiv 2023
[46]

Uas-based remote sensing for agricultural monitoring: Current status and perspectives.Computers and Electronics in Agriculture, 227:109501, 2024

Jingzhe Wang, Silu Zhang, Ivan Lizaga, Yinghui Zhang, Xiangyu Ge, Zipeng Zhang, Wei Zhang, Qiujun Huang, and Zhongwen Hu. Uas-based remote sensing for agricultural monitoring: Current status and perspectives.Computers and Electronics in Agriculture, 227:109501, 2024

work page 2024
[47]

Gta: a benchmark for general tool agents.Advances in Neural Information Processing Systems, 37: 75749–75790, 2024

Jize Wang, Zerun Ma, Yining Li, Songyang Zhang, Cailian Chen, Kai Chen, and Xinyi Le. Gta: a benchmark for general tool agents.Advances in Neural Information Processing Systems, 37: 75749–75790, 2024

work page 2024
[48]

Advances in deep learning applications for plant disease and pest detection: A review.Remote Sensing, 17(4):698, 2025

Shaohua Wang, Dachuan Xu, Haojian Liang, Yongqing Bai, Xiao Li, Junyuan Zhou, Cheng Su, and Wenyu Wei. Advances in deep learning applications for plant disease and pest detection: A review.Remote Sensing, 17(4):698, 2025

work page 2025
[49]

Mcptox: A benchmark for tool poisoning on real-world mcp servers.Proceedings of the AAAI Conference on Artificial Intelligence, 40(42): 35811–35819, Mar

Zhiqiang Wang, Yichao Gao, Yanting Wang, Suyuan Liu, Haifeng Sun, Haoran Cheng, Guan- quan Shi, Haohua Du, and Xiangyang Li. Mcptox: A benchmark for tool poisoning on real-world mcp servers.Proceedings of the AAAI Conference on Artificial Intelligence, 40(42): 35811–35819, Mar. 2026

work page 2026
[50]

A large-scale in-the-wild dataset for plant disease segmentation.Scientific Data, 13(1):205, 2026

Tianqi Wei, Zhi Chen, Xin Yu, Scott Chapman, Paul Melloy, and Zi Huang. A large-scale in-the-wild dataset for plant disease segmentation.Scientific Data, 13(1):205, 2026

work page 2026
[51]

Remote sensing for agricultural applications: A meta-review.Remote sensing of environment, 236:111402, 2020

Marie Weiss, Frédéric Jacob, and Grgory Duveiller. Remote sensing for agricultural applications: A meta-review.Remote sensing of environment, 236:111402, 2020

work page 2020
[52]

Agrocot: A chain-of-thought benchmark for evaluating reasoning in vision-language models for agriculture, 2026

Yibin Wen, Qingmei Li, Zi Ye, Jiarui Zhang, Zurong Mai, Jing Wu, Shuohong Lou, Yuhang Chen, Henglian Huang, Xiaoya Fan, Yang Zhang, Defeng Gu, Lingyuan Zhao, Yutong Lu, Haohuan Fu, Jianxi Huang, and Juepeng Zheng. Agrocot: A chain-of-thought benchmark for evaluating reasoning in vision-language models for agriculture, 2026

work page 2026
[53]

Challenges and opportunities in remote sensing-based crop monitoring: A review.National Science Review, 10(4):nwac290, 2023

Bingfang Wu, Miao Zhang, Hongwei Zeng, Fuyou Tian, Andries B Potgieter, Xingli Qin, Nana Yan, Sheng Chang, Yan Zhao, Qinghan Dong, et al. Challenges and opportunities in remote sensing-based crop monitoring: A review.National Science Review, 10(4):nwac290, 2023

work page 2023
[54]

Farm- seg_vlm: A farmland remote sensing image segmentation method considering vision-language alignment.ISPRS Journal of Photogrammetry and Remote Sensing, 225:423–439, 2025

Haiyang Wu, Weiliang Mu, Dandan Zhong, Zhuofei Du, Haifeng Li, and Chao Tao. Farm- seg_vlm: A farmland remote sensing image segmentation method considering vision-language alignment.ISPRS Journal of Photogrammetry and Remote Sensing, 225:423–439, 2025

work page 2025
[55]

Ip102: A large-scale benchmark dataset for insect pest recognition

Xiaoping Wu, Chi Zhan, Yu-Kun Lai, Ming-Ming Cheng, and Jufeng Yang. Ip102: A large-scale benchmark dataset for insect pest recognition. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 8787–8796, 2019. 13

work page 2019
[56]

Mul- timodal agricultural agent architecture (ma3): A new paradigm for intelligent agricultural decision-making, 2025

Zhuoning Xu, Jian Xu, Mingqing Zhang, Peijie Wang, Chao Deng, and Cheng-Lin Liu. Mul- timodal agricultural agent architecture (ma3): A new paradigm for intelligent agricultural decision-making, 2025

work page 2025
[57]

Agrigpt-omni: A unified speech–vision–text framework for multilingual agricultural intelligence

Bo Yang, Lanfei Feng, Yunkui Chen, Yu Zhang, Jianyu Zhang, Xiao Xu, Nueraili Aierken, and Shijian Li. Agrigpt-omni: A unified speech–vision–text framework for multilingual agricultural intelligence. InProceedings of the ACM Web Conference 2026, WWW ’26, page 9800–9810, New York, NY , USA, 2026. Association for Computing Machinery. ISBN 9798400723070

work page 2026
[58]

Look-back: Implicit visual re-focusing in mllm reasoning

Shuo Yang, Yuwei Niu, Yuyang Liu, Yang Ye, Bin Lin, and Li Yuan. Look-back: Implicit visual re-focusing in mllm reasoning. InProceedings of the AAAI Conference on Artificial Intelligence, pages 11694–11702, 2026

work page 2026
[59]

React: Synergizing reasoning and acting in language models

Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik R Narasimhan, and Yuan Cao. React: Synergizing reasoning and acting in language models. InThe eleventh international conference on learning representations, 2022

work page 2022
[60]

Ureader: Universal ocr-free visually-situated language understand- ing with multimodal large language model

Jiabo Ye, Anwen Hu, Haiyang Xu, Qinghao Ye, Ming Yan, Guohai Xu, Chenliang Li, Junfeng Tian, Qi Qian, Ji Zhang, et al. Ureader: Universal ocr-free visually-situated language understand- ing with multimodal large language model. InFindings of the Association for Computational Linguistics: EMNLP 2023, pages 2841–2858, 2023

work page 2023
[61]

A survey on multimodal large language models.National Science Review, page nwae403, 2024

Shukang Yin, Chaoyou Fu, Sirui Zhao, Ke Li, Xing Sun, Tong Xu, and Enhong Chen. A survey on multimodal large language models.National Science Review, page nwae403, 2024

work page 2024
[62]

Higher-Order Feature Attribution: Bridging Statistics, Explainable AI, and Topological Signal Processing

Mingqing Zhang, Zhuoning Xu, Peijie Wang, Rongji Li, Liang Wang, Qiang Liu, Jian Xu, Xuyao Zhang, Shu Wu, and Liang Wang. Agridoctor: A multimodal intelligent assistant for agriculture. InICASSP 2026 - 2026 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 2741–2745, 2026. doi: 10.1109/ICASSP55912.2026.11464537

work page doi:10.1109/icassp55912.2026.11464537 2026
[63]

A multimodal foundation agent for financial trading: Tool-augmented, diversified, and generalist

Wentao Zhang, Lingxuan Zhao, Haochong Xia, Shuo Sun, Jiaze Sun, Molei Qin, Xinyi Li, Yuqing Zhao, Yilei Zhao, Xinyu Cai, et al. A multimodal foundation agent for financial trading: Tool-augmented, diversified, and generalist. InProceedings of the 30th acm sigkdd conference on knowledge discovery and data mining, pages 4314–4325, 2024

work page 2024
[64]

Universal multimodal representation for language understanding.IEEE Transactions on Pattern Analysis and Machine Intelligence, pages 9169–9185, 2023

Zhuosheng Zhang, Kehai Chen, Rui Wang, Masao Utiyama, Eiichiro Sumita, Zuchao Li, and Hai Zhao. Universal multimodal representation for language understanding.IEEE Transactions on Pattern Analysis and Machine Intelligence, pages 9169–9185, 2023

work page 2023
[65]

agentar: Creating augmented reality applications with tool-augmented llm-based autonomous agents

Chenfei Zhu, Shao-Kang Hsia, Xiyun Hu, Ziyi Liu, Jingyu Shi, and Karthik Ramani. agentar: Creating augmented reality applications with tool-augmented llm-based autonomous agents. In Proceedings of the 38th Annual ACM Symposium on User Interface Software and Technology, pages 1–23, 2025. 14 AgroTools: A Benchmark for Tool-Augmented Multimodal Agents in Agr...

work page 2025
[66]

A benchmark question

work page
[67]

The question_template id and gt_type

work page
[68]

An ordered list of slot names and slot types

work page
[69]

status":

A model's raw answer to the question. Your task is to extract exactly what the model answer says into the requested slots. You must NOT decide whether the answer is correct. You must NOT infer, repair, or complete missing information from your own knowledge. You will NOT receive the ground-truth answer. Return strict JSON only. Do not use markdown. Do not...

work page
[70]

Reference key points for this slot

work page
[71]

score": 0.0,

The model's extracted answer for this slot. Your task is to evaluate ONLY this single slot. Core principle: - Evaluate semantic correctness, factual coverage, specificity, and internal consistency against the provided reference key points. - Do not require exact wording. - Do not reward answers that are merely topically related but vague. - Do not use ext...

work page
[72]

Internally decompose the reference into a small number of essential atomic points, usually 1 to 4

work page
[73]

Group near-equivalent alternatives into one point when they express the same core fact

work page
[74]

Check which essential points are correctly covered by the model answer

work page
[75]

Check whether the answer is specific enough for this slot

work page
[76]

Check whether the answer contains any contradiction, reversal, or invented fact

work page
[77]

cannot determine

Assign a final score using the rubric below. General scoring rules: - Semantic equivalence counts. Paraphrases, synonyms, scientific/common names, and equivalent geographic descriptions are acceptable. - Fluency, grammar, or formatting should not affect the score. - Extra correct but non-conflicting details are allowed. - Unsupported extra details do not ...

work page
[78]

Do not repair a failed answer using your own common-sense knowledge

Only retain samples whose tool chains are actually executable under the current benchmark setting. Do not repair a failed answer using your own common-sense knowledge

work page
[79]

Inspect the full ReAct-style trajectory rather than only the final answer

work page
[80]

For each sample, verify: - whether the `sample_id`, `classification`, `question_template`, attached files, and user question match the corresponding entry in `question-4.json`; - whether the invoked tools are appropriate for the task, with no missing key tools and no irrelevant tools; - whether all critical tool arguments are correct, including image path...

work page

Showing first 80 references.

[1] [1]

Introducing claude sonnet 4.6, 2026

Anthropic. Introducing claude sonnet 4.6, 2026. URL https://www.anthropic.com/news/ claude-sonnet-4-6. Official release page. Accessed: 2026-05-07

work page 2026

[2] [2]

Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond

Jinze Bai, Shuai Bai, Shusheng Yang, Shijie Wang, Sinan Tan, Peng Wang, Junyang Lin, Chang Zhou, and Jingren Zhou. Qwen-vl: A frontier large vision-language model with versatile abilities.arXiv preprint arXiv:2308.12966, 1(2):3, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[3] [3]

Adapting vision-language models for precision agriculture: A study on crop segmentation based on uav remote sensing data

Yuhui Bie, Guowei Xu, and Yaojun Wang. Adapting vision-language models for precision agriculture: A study on crop segmentation based on uav remote sensing data. In2025 13th International Conference on Agro-Geoinformatics (Agro-Geoinformatics), pages 1–6. IEEE, 2025

work page 2025

[4] [4]

Agrichat: A multimodal large language model for agriculture image understanding, 2026

Abderrahmene Boudiaf, Irfan Hussain, and Sajid Javed. Agrichat: A multimodal large language model for agriculture image understanding, 2026

work page 2026

[5] [5]

Abductivemllm: Boosting visual abductive reasoning within mllms.arXiv preprint arXiv:2601.02771, 2026

Boyu Chang, Qi Wang, Xi Guo, Zhixiong Nan, Yazhou Yao, and Tianfei Zhou. Abductivemllm: Boosting visual abductive reasoning within mllms.arXiv preprint arXiv:2601.02771, 2026

work page arXiv 2026

[6] [6]

Advancing tool-augmented large language models: Integrating insights from errors in inference trees.Advances in Neural Information Processing Systems, 37:106555–106581, 2024

Sijia Chen, Yibo Wang, Yi-Feng Wu, Qing-Guo Chen, Zhao Xu, Weihua Luo, Kaifu Zhang, and Lijun Zhang. Advancing tool-augmented large language models: Integrating insights from errors in inference trees.Advances in Neural Information Processing Systems, 37:106555–106581, 2024

work page 2024

[7] [7]

Janus-Pro: Unified Multimodal Understanding and Generation with Data and Model Scaling

Xiaokang Chen, Zhiyu Wu, Xingchao Liu, Zizheng Pan, Wen Liu, Zhenda Xie, Xingkai Yu, and Chong Ruan. Janus-pro: Unified multimodal understanding and generation with data and model scaling.arXiv preprint arXiv:2501.17811, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[8] [8]

Expanding performance boundaries of open- source multimodal models with model.Data, and Test-Time Scaling, 2024

Zhe Chen, Weiyun Wang, Yue Cao, Yangzhou Liu, Zhangwei Gao, Erfei Cui, Jinguo Zhu, Shenglong Ye, Hao Tian, Zhaoyang Liu, et al. Expanding performance boundaries of open- source multimodal models with model.Data, and Test-Time Scaling, 2024

work page 2024

[9] [9]

Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities

Gheorghe Comanici, Eric Bieber, Mike Schaekermann, Ice Pasupat, Noveen Sachdeva, Inderjit Dhillon, Marcel Blistein, Ori Ram, Dan Zhang, Evan Rosen, et al. Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities.arXiv preprint arXiv:2507.06261, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[10] [10]

Utilizing vision-language models for detection of leaf-based diseases in tomatoes

James Blossom Eleojo. Utilizing vision-language models for detection of leaf-based diseases in tomatoes. InProceedings of the AAAI Conference on Artificial Intelligence, pages 29567–29569, 2025

work page 2025

[11] [11]

guided mllm reasoning: Enhancing mllm with knowledge and visual notes for visual question answering

Wenlong Fang, Qiaofeng Wu, Jing Chen, and Yun Xue. guided mllm reasoning: Enhancing mllm with knowledge and visual notes for visual question answering. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 19597–19607, 2025

work page 2025

[12] [12]

Earth-agent: Unlocking the full landscape of earth observation with agents, 2026

Peilin Feng, Zhutao Lv, Junyan Ye, Xiaolei Wang, Xinjie Huo, Jinhua Yu, Wanghan Xu, Wenlong Zhang, Lei Bai, Conghui He, and Weijia Li. Earth-agent: Unlocking the full landscape of earth observation with agents, 2026. 10

work page 2026

[13] [13]

Simultaneous fruit detection and size estimation using multitask deep neural networks.Biosystems Engineering, 233:63–75, 2023

Mar Ferrer-Ferrer, Javier Ruiz-Hidalgo, Eduard Gregorio, Verónica Vilaplana, Josep-Ramon Morros, and Jordi Gené-Mola. Simultaneous fruit detection and size estimation using multitask deep neural networks.Biosystems Engineering, 233:63–75, 2023

work page 2023

[14] [14]

Open-canopy: Towards very high resolution forest monitoring

Fajwel Fogel, Yohann Perron, Nikola Besic, Laurent Saint-André, Agnès Pellissier-Tanon, Martin Schwartz, Thomas Boudras, Ibrahim Fayad, Alexandre d’Aspremont, Loic Landrieu, et al. Open-canopy: Towards very high resolution forest monitoring. InProceedings of the computer vision and pattern recognition conference, pages 1395–1406, 2025

work page 2025

[15] [15]

Agmmu: A comprehensive agricultural multimodal understanding and reasoning benchmark.arXiv e-prints, pages arXiv–2504, 2025

Aruna Gauba, Irene Pi, Yunze Man, Ziqi Pang, Vikram S Adve, and Yu-Xiong Wang. Agmmu: A comprehensive agricultural multimodal understanding and reasoning benchmark.arXiv e-prints, pages arXiv–2504, 2025

work page 2025

[16] [16]

Mcp-agentbench: Evaluating real-world language agent performance with mcp-mediated tools

Zikang Guo, Benfeng Xu, Chiwei Zhu, Wentao Hong, Xiaorui Wang, and Zhendong Mao. Mcp-agentbench: Evaluating real-world language agent performance with mcp-mediated tools. Proceedings of the AAAI Conference on Artificial Intelligence, 40(37):30888–30896, Mar. 2026

work page 2026

[17] [17]

Multi-modal instruction tuned llms with fine-grained visual perception

Junwen He, Yifan Wang, Lijun Wang, Huchuan Lu, Jun-Yan He, Jin-Peng Lan, Bin Luo, and Xuansong Xie. Multi-modal instruction tuned llms with fine-grained visual perception. InProceedings of the ieee/cvf conference on computer vision and pattern recognition, pages 13980–13990, 2024

work page 2024

[18] [18]

Mm- boundary: Advancing mllm knowledge boundary awareness through reasoning step confidence calibration

Zhitao He, Sandeep Polisetty, Zhiyuan Fan, Yuchen Huang, Shujin Wu, and Yi R Fung. Mm- boundary: Advancing mllm knowledge boundary awareness through reasoning step confidence calibration. InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 16427–16444, 2025

work page 2025

[19] [19]

GLM-4.5V and GLM-4.1V-Thinking: Towards Versatile Multimodal Reasoning with Scalable Reinforcement Learning

Wenyi Hong, Wenmeng Yu, Xiaotao Gu, Guo Wang, Guobing Gan, Haomiao Tang, Jiale Cheng, Ji Qi, Junhui Ji, Lihang Pan, et al. Glm-4.5 v and glm-4.1 v-thinking: Towards versatile multimodal reasoning with scalable reinforcement learning.arXiv preprint arXiv:2507.01006, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[20] [20]

GPT-4o System Card

Aaron Hurst, Adam Lerer, Adam P Goucher, Adam Perelman, Aditya Ramesh, Aidan Clark, AJ Ostrow, Akila Welihinda, Alan Hayes, Alec Radford, et al. Gpt-4o system card.arXiv preprint arXiv:2410.21276, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[21] [21]

Mllm-compbench: A comparative reasoning benchmark for multimodal llms.Advances in Neural Information Processing Systems, 37:28798–28827, 2024

Jihyung Kil, Zheda Mai, Justin Lee, Arpita Chowdhury, Zihe Wang, Kerrie Cheng, Lemeng Wang, Ye Liu, and Wei-Lun Chao. Mllm-compbench: A comparative reasoning benchmark for multimodal llms.Advances in Neural Information Processing Systems, 37:28798–28827, 2024

work page 2024

[22] [22]

Gpt-5 and open-weight large language models: Advances in reasoning, trans- parency, and control.Information Systems, page 102620, 2025

Maikel Leon. Gpt-5 and open-weight large language models: Advances in reasoning, trans- parency, and control.Information Systems, page 102620, 2025

work page 2025

[23] [23]

API-bank: A comprehensive benchmark for tool-augmented LLMs

Minghao Li, Yingxiu Zhao, Bowen Yu, Feifan Song, Hangyu Li, Haiyang Yu, Zhoujun Li, Fei Huang, and Yongbin Li. API-bank: A comprehensive benchmark for tool-augmented LLMs. In Houda Bouamor, Juan Pino, and Kalika Bali, editors,Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 3102–3116, Singapore, December

work page 2023

[24] [24]

Association for Computational Linguistics

work page

[25] [25]

Can large multimodal models understand agricultural scenes? benchmarking with agromind.arXiv preprint arXiv:2505.12207, 2025

Qingmei Li, Yang Zhang, Zurong Mai, Yuhang Chen, Shuohong Lou, Henglian Huang, Jiarui Zhang, Zhiwei Zhang, Yibin Wen, Weijia Li, et al. Can large multimodal models understand agricultural scenes? benchmarking with agromind.arXiv preprint arXiv:2505.12207, 2025

work page arXiv 2025

[26] [26]

Improving context understanding in multimodal large language models via multimodal composition learning

Wei Li, Hehe Fan, Yongkang Wong, Yi Yang, and Mohan S Kankanhalli. Improving context understanding in multimodal large language models via multimodal composition learning. In ICML, page 7, 2024

work page 2024

[27] [27]

Deepagent: A general reasoning agent with scalable toolsets

Xiaoxi Li, Wenxiang Jiao, Jiarui Jin, Guanting Dong, Jiajie Jin, Yinuo Wang, Hao Wang, Yutao Zhu, Ji-Rong Wen, Yuan Lu, and Zhicheng Dou. Deepagent: A general reasoning agent with scalable toolsets. InProceedings of the ACM Web Conference 2026, WWW ’26, page 2219–2230, New York, NY , USA, 2026. Association for Computing Machinery. ISBN 9798400723070. 11

work page 2026

[28] [28]

A survey of state of the art large vision language models: Benchmark evaluations and challenges

Zongxia Li, Xiyang Wu, Hongyang Du, Fuxiao Liu, Huy Nghiem, and Guangyao Shi. A survey of state of the art large vision language models: Benchmark evaluations and challenges. In Proceedings of the Computer Vision and Pattern Recognition Conference, pages 1587–1606, 2025

work page 2025

[29] [29]

Improved baselines with visual instruction tuning

Haotian Liu, Chunyuan Li, Yuheng Li, and Yong Jae Lee. Improved baselines with visual instruction tuning. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 26296–26306, 2024

work page 2024

[30] [30]

Mengxi Liu, Zhuoqun Chai, Haojun Deng, and Rong Liu. A cnn-transformer network with multiscale context aggregation for fine-grained cropland change detection.IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing, 15:4297–4306, 2022

work page 2022

[31] [31]

ToolSandbox: A stateful, conversational, interactive evaluation benchmark for LLM tool use capabilities

Jiarui Lu, Thomas Holleis, Yizhe Zhang, Bernhard Aumayer, Feng Nan, Haoping Bai, Shuang Ma, Shen Ma, Mengyu Li, Guoli Yin, Zirui Wang, and Ruoming Pang. ToolSandbox: A stateful, conversational, interactive evaluation benchmark for LLM tool use capabilities. In Luis Chiruzzo, Alan Ritter, and Lu Wang, editors,Findings of the Association for Computational L...

work page 2025

[32] [32]

Visual perception by large language model’s weights.Advances in Neural Information Processing Systems, pages 28615–28635, 2024

Feipeng Ma, Hongwei Xue, Yizhou Zhou, Guangting Wang, Fengyun Rao, Shilin Yan, Yueyi Zhang, Siying Wu, Mike Zheng Shou, and Xiaoyan Sun. Visual perception by large language model’s weights.Advances in Neural Information Processing Systems, pages 28615–28635, 2024

work page 2024

[33] [33]

m &m’s: A benchmark to evaluate tool-use for multi-step multi-modal tasks

Zixian Ma, Weikai Huang, Jieyu Zhang, Tanmay Gupta, and Ranjay Krishna. m &m’s: A benchmark to evaluate tool-use for multi-step multi-modal tasks. In Aleš Leonardis, Elisa Ricci, Stefan Roth, Olga Russakovsky, Torsten Sattler, and Gül Varol, editors,Computer Vision – ECCV 2024, pages 18–34, Cham, 2025. Springer Nature Switzerland

work page 2024

[34] [34]

Deepweeds: A multiclass weed species image dataset for deep learning.Scientific reports, 9(1):2058, 2019

Alex Olsen, Dmitry A Konovalov, Bronson Philippa, Peter Ridd, Jake C Wood, Jamie Johns, Wesley Banks, Benjamin Girgenti, Owen Kenny, James Whinney, et al. Deepweeds: A multiclass weed species image dataset for deep learning.Scientific reports, 9(1):2058, 2019

work page 2058

[35] [35]

Weedmap: A large-scale semantic weed mapping framework using aerial multispectral imaging and deep neural network for precision farming.Remote Sensing, 10(9):1423, 2018

Inkyu Sa, Marija Popovi´c, Raghav Khanna, Zetao Chen, Philipp Lottes, Frank Liebisch, Juan Nieto, Cyrill Stachniss, Achim Walter, and Roland Siegwart. Weedmap: A large-scale semantic weed mapping framework using aerial multispectral imaging and deep neural network for precision farming.Remote Sensing, 10(9):1423, 2018

work page 2018

[36] [36]

deepnir: Datasets for generat- ing synthetic nir images and improved fruit detection system using deep learning techniques

Inkyu Sa, Jong Yoon Lim, Ho Seok Ahn, and Bruce MacDonald. deepnir: Datasets for generat- ing synthetic nir images and improved fruit detection system using deep learning techniques. Sensors, 22(13):4721, 2022

work page 2022

[37] [37]

Multi- modal llms in agriculture: A comprehensive review.IEEE Transactions on Automation Science and Engineering, 2025

Ranjan Sapkota, Rizwan Qureshi, Muhammad Usman Hadi, Syed Zohaib Hassan, Ferhat Sadak, Maged Shoman, Muhammad Sajjad, Fayaz Ali Dharejo, Achyut Paudel, Jiajia Li, et al. Multi- modal llms in agriculture: A comprehensive review.IEEE Transactions on Automation Science and Engineering, 2025

work page 2025

[38] [38]

Weedsense: Multi-task learning for weed segmentation, height estimation, and growth stage classification

Toqi Tahamid Sarker, Khaled R Ahmed, Taminul Islam, Cristiana Bernardi Rankrape, and Karla Gage. Weedsense: Multi-task learning for weed segmentation, height estimation, and growth stage classification. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 7180–7190, 2025

work page 2025

[39] [39]

Thinkgeo: Evaluating tool-augmented agents for remote sensing tasks, 2026

Akashah Shabbir, Muhammad Akhtar Munir, Akshay Dudhane, Muhammad Umer Sheikh, Muhammad Haris Khan, Paolo Fraccaro, Juan Bernabe Moreno, Fahad Shahbaz Khan, and Salman Khan. Thinkgeo: Evaluating tool-augmented agents for remote sensing tasks, 2026

work page 2026

[40] [40]

Agrobench: Vision-language model benchmark in agriculture

Risa Shinoda, Nakamasa Inoue, Hirokatsu Kataoka, Masaki Onishi, and Yoshitaka Ushiku. Agrobench: Vision-language model benchmark in agriculture. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 7634–7644, 2025. 12

work page 2025

[41] [41]

Zero hunger: future challenges and the way forward towards the achievement of sustainable development goal 2.Sustainable earth reviews, 7(1):10, 2024

Fabio Sporchia, Marta Antonelli, Alicia Aguilar-Martínez, Anna Bach-Faig, Dario Caro, Kyle F Davis, Roberta Sonnino, and Alessandro Galli. Zero hunger: future challenges and the way forward towards the achievement of sustainable development goal 2.Sustainable earth reviews, 7(1):10, 2024

work page 2024

[42] [42]

Zhendong Sun, Yanfei Zhong, Xinyu Wang, and Liangpei Zhang. Identifying cropland non- agriculturalization with high representational consistency from bi-temporal high-resolution remote sensing images: From benchmark datasets to real-world application.ISPRS Journal of Photogrammetry and Remote Sensing, 212:454–474, 2024

work page 2024

[43] [43]

Chain of visual perception: Harnessing multimodal large language models for zero-shot camouflaged object detection

Lv Tang, Peng-Tao Jiang, Zhi-Hao Shen, Hao Zhang, Jin-Wei Chen, and Bo Li. Chain of visual perception: Harnessing multimodal large language models for zero-shot camouflaged object detection. InProceedings of the 32nd ACM international conference on multimedia, pages 8805–8814, 2024

work page 2024

[44] [44]

Seed 2.0: Towards intelligence frontier for real-world complexity

ByteDance Seed Team. Seed 2.0: Towards intelligence frontier for real-world complexity. Tech- nical report, ByteDance, 2026. URL https://seed.bytedance.com/zh/blog/seed2-0-% E6%AD%A3%E5%BC%8F%E5%8F%91%E5%B8%83. Accessed: 2026-05-04

work page 2026

[45] [45]

Fpcd: An open aerial vhr dataset for farm pond change detection.arXiv preprint arXiv:2302.14554, 2023

Chintan Tundia, Rajiv Kumar, Om Damani, and G Sivakumar. Fpcd: An open aerial vhr dataset for farm pond change detection.arXiv preprint arXiv:2302.14554, 2023

work page arXiv 2023

[46] [46]

Uas-based remote sensing for agricultural monitoring: Current status and perspectives.Computers and Electronics in Agriculture, 227:109501, 2024

Jingzhe Wang, Silu Zhang, Ivan Lizaga, Yinghui Zhang, Xiangyu Ge, Zipeng Zhang, Wei Zhang, Qiujun Huang, and Zhongwen Hu. Uas-based remote sensing for agricultural monitoring: Current status and perspectives.Computers and Electronics in Agriculture, 227:109501, 2024

work page 2024

[47] [47]

Gta: a benchmark for general tool agents.Advances in Neural Information Processing Systems, 37: 75749–75790, 2024

Jize Wang, Zerun Ma, Yining Li, Songyang Zhang, Cailian Chen, Kai Chen, and Xinyi Le. Gta: a benchmark for general tool agents.Advances in Neural Information Processing Systems, 37: 75749–75790, 2024

work page 2024

[48] [48]

Advances in deep learning applications for plant disease and pest detection: A review.Remote Sensing, 17(4):698, 2025

Shaohua Wang, Dachuan Xu, Haojian Liang, Yongqing Bai, Xiao Li, Junyuan Zhou, Cheng Su, and Wenyu Wei. Advances in deep learning applications for plant disease and pest detection: A review.Remote Sensing, 17(4):698, 2025

work page 2025

[49] [49]

Mcptox: A benchmark for tool poisoning on real-world mcp servers.Proceedings of the AAAI Conference on Artificial Intelligence, 40(42): 35811–35819, Mar

Zhiqiang Wang, Yichao Gao, Yanting Wang, Suyuan Liu, Haifeng Sun, Haoran Cheng, Guan- quan Shi, Haohua Du, and Xiangyang Li. Mcptox: A benchmark for tool poisoning on real-world mcp servers.Proceedings of the AAAI Conference on Artificial Intelligence, 40(42): 35811–35819, Mar. 2026

work page 2026

[50] [50]

A large-scale in-the-wild dataset for plant disease segmentation.Scientific Data, 13(1):205, 2026

Tianqi Wei, Zhi Chen, Xin Yu, Scott Chapman, Paul Melloy, and Zi Huang. A large-scale in-the-wild dataset for plant disease segmentation.Scientific Data, 13(1):205, 2026

work page 2026

[51] [51]

Remote sensing for agricultural applications: A meta-review.Remote sensing of environment, 236:111402, 2020

Marie Weiss, Frédéric Jacob, and Grgory Duveiller. Remote sensing for agricultural applications: A meta-review.Remote sensing of environment, 236:111402, 2020

work page 2020

[52] [52]

Agrocot: A chain-of-thought benchmark for evaluating reasoning in vision-language models for agriculture, 2026

Yibin Wen, Qingmei Li, Zi Ye, Jiarui Zhang, Zurong Mai, Jing Wu, Shuohong Lou, Yuhang Chen, Henglian Huang, Xiaoya Fan, Yang Zhang, Defeng Gu, Lingyuan Zhao, Yutong Lu, Haohuan Fu, Jianxi Huang, and Juepeng Zheng. Agrocot: A chain-of-thought benchmark for evaluating reasoning in vision-language models for agriculture, 2026

work page 2026

[53] [53]

Challenges and opportunities in remote sensing-based crop monitoring: A review.National Science Review, 10(4):nwac290, 2023

Bingfang Wu, Miao Zhang, Hongwei Zeng, Fuyou Tian, Andries B Potgieter, Xingli Qin, Nana Yan, Sheng Chang, Yan Zhao, Qinghan Dong, et al. Challenges and opportunities in remote sensing-based crop monitoring: A review.National Science Review, 10(4):nwac290, 2023

work page 2023

[54] [54]

Farm- seg_vlm: A farmland remote sensing image segmentation method considering vision-language alignment.ISPRS Journal of Photogrammetry and Remote Sensing, 225:423–439, 2025

Haiyang Wu, Weiliang Mu, Dandan Zhong, Zhuofei Du, Haifeng Li, and Chao Tao. Farm- seg_vlm: A farmland remote sensing image segmentation method considering vision-language alignment.ISPRS Journal of Photogrammetry and Remote Sensing, 225:423–439, 2025

work page 2025

[55] [55]

Ip102: A large-scale benchmark dataset for insect pest recognition

Xiaoping Wu, Chi Zhan, Yu-Kun Lai, Ming-Ming Cheng, and Jufeng Yang. Ip102: A large-scale benchmark dataset for insect pest recognition. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 8787–8796, 2019. 13

work page 2019

[56] [56]

Mul- timodal agricultural agent architecture (ma3): A new paradigm for intelligent agricultural decision-making, 2025

Zhuoning Xu, Jian Xu, Mingqing Zhang, Peijie Wang, Chao Deng, and Cheng-Lin Liu. Mul- timodal agricultural agent architecture (ma3): A new paradigm for intelligent agricultural decision-making, 2025

work page 2025

[57] [57]

Agrigpt-omni: A unified speech–vision–text framework for multilingual agricultural intelligence

Bo Yang, Lanfei Feng, Yunkui Chen, Yu Zhang, Jianyu Zhang, Xiao Xu, Nueraili Aierken, and Shijian Li. Agrigpt-omni: A unified speech–vision–text framework for multilingual agricultural intelligence. InProceedings of the ACM Web Conference 2026, WWW ’26, page 9800–9810, New York, NY , USA, 2026. Association for Computing Machinery. ISBN 9798400723070

work page 2026

[58] [58]

Look-back: Implicit visual re-focusing in mllm reasoning

Shuo Yang, Yuwei Niu, Yuyang Liu, Yang Ye, Bin Lin, and Li Yuan. Look-back: Implicit visual re-focusing in mllm reasoning. InProceedings of the AAAI Conference on Artificial Intelligence, pages 11694–11702, 2026

work page 2026

[59] [59]

React: Synergizing reasoning and acting in language models

Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik R Narasimhan, and Yuan Cao. React: Synergizing reasoning and acting in language models. InThe eleventh international conference on learning representations, 2022

work page 2022

[60] [60]

Ureader: Universal ocr-free visually-situated language understand- ing with multimodal large language model

Jiabo Ye, Anwen Hu, Haiyang Xu, Qinghao Ye, Ming Yan, Guohai Xu, Chenliang Li, Junfeng Tian, Qi Qian, Ji Zhang, et al. Ureader: Universal ocr-free visually-situated language understand- ing with multimodal large language model. InFindings of the Association for Computational Linguistics: EMNLP 2023, pages 2841–2858, 2023

work page 2023

[61] [61]

A survey on multimodal large language models.National Science Review, page nwae403, 2024

Shukang Yin, Chaoyou Fu, Sirui Zhao, Ke Li, Xing Sun, Tong Xu, and Enhong Chen. A survey on multimodal large language models.National Science Review, page nwae403, 2024

work page 2024

[62] [62]

Higher-Order Feature Attribution: Bridging Statistics, Explainable AI, and Topological Signal Processing

Mingqing Zhang, Zhuoning Xu, Peijie Wang, Rongji Li, Liang Wang, Qiang Liu, Jian Xu, Xuyao Zhang, Shu Wu, and Liang Wang. Agridoctor: A multimodal intelligent assistant for agriculture. InICASSP 2026 - 2026 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 2741–2745, 2026. doi: 10.1109/ICASSP55912.2026.11464537

work page doi:10.1109/icassp55912.2026.11464537 2026

[63] [63]

A multimodal foundation agent for financial trading: Tool-augmented, diversified, and generalist

Wentao Zhang, Lingxuan Zhao, Haochong Xia, Shuo Sun, Jiaze Sun, Molei Qin, Xinyi Li, Yuqing Zhao, Yilei Zhao, Xinyu Cai, et al. A multimodal foundation agent for financial trading: Tool-augmented, diversified, and generalist. InProceedings of the 30th acm sigkdd conference on knowledge discovery and data mining, pages 4314–4325, 2024

work page 2024

[64] [64]

Universal multimodal representation for language understanding.IEEE Transactions on Pattern Analysis and Machine Intelligence, pages 9169–9185, 2023

Zhuosheng Zhang, Kehai Chen, Rui Wang, Masao Utiyama, Eiichiro Sumita, Zuchao Li, and Hai Zhao. Universal multimodal representation for language understanding.IEEE Transactions on Pattern Analysis and Machine Intelligence, pages 9169–9185, 2023

work page 2023

[65] [65]

agentar: Creating augmented reality applications with tool-augmented llm-based autonomous agents

Chenfei Zhu, Shao-Kang Hsia, Xiyun Hu, Ziyi Liu, Jingyu Shi, and Karthik Ramani. agentar: Creating augmented reality applications with tool-augmented llm-based autonomous agents. In Proceedings of the 38th Annual ACM Symposium on User Interface Software and Technology, pages 1–23, 2025. 14 AgroTools: A Benchmark for Tool-Augmented Multimodal Agents in Agr...

work page 2025

[66] [66]

A benchmark question

work page

[67] [67]

The question_template id and gt_type

work page

[68] [68]

An ordered list of slot names and slot types

work page

[69] [69]

status":

A model's raw answer to the question. Your task is to extract exactly what the model answer says into the requested slots. You must NOT decide whether the answer is correct. You must NOT infer, repair, or complete missing information from your own knowledge. You will NOT receive the ground-truth answer. Return strict JSON only. Do not use markdown. Do not...

work page

[70] [70]

Reference key points for this slot

work page

[71] [71]

score": 0.0,

The model's extracted answer for this slot. Your task is to evaluate ONLY this single slot. Core principle: - Evaluate semantic correctness, factual coverage, specificity, and internal consistency against the provided reference key points. - Do not require exact wording. - Do not reward answers that are merely topically related but vague. - Do not use ext...

work page

[72] [72]

Internally decompose the reference into a small number of essential atomic points, usually 1 to 4

work page

[73] [73]

Group near-equivalent alternatives into one point when they express the same core fact

work page

[74] [74]

Check which essential points are correctly covered by the model answer

work page

[75] [75]

Check whether the answer is specific enough for this slot

work page

[76] [76]

Check whether the answer contains any contradiction, reversal, or invented fact

work page

[77] [77]

cannot determine

Assign a final score using the rubric below. General scoring rules: - Semantic equivalence counts. Paraphrases, synonyms, scientific/common names, and equivalent geographic descriptions are acceptable. - Fluency, grammar, or formatting should not affect the score. - Extra correct but non-conflicting details are allowed. - Unsupported extra details do not ...

work page

[78] [78]

Do not repair a failed answer using your own common-sense knowledge

Only retain samples whose tool chains are actually executable under the current benchmark setting. Do not repair a failed answer using your own common-sense knowledge

work page

[79] [79]

Inspect the full ReAct-style trajectory rather than only the final answer

work page

[80] [80]

For each sample, verify: - whether the `sample_id`, `classification`, `question_template`, attached files, and user question match the corresponding entry in `question-4.json`; - whether the invoked tools are appropriate for the task, with no missing key tools and no irrelevant tools; - whether all critical tool arguments are correct, including image path...

work page