pith. sign in

arxiv: 2605.22034 · v1 · pith:D6MVSUMLnew · submitted 2026-05-21 · 💻 cs.CV · cs.AI

AgroVG: A Large-Scale Multi-Source Benchmark for Agricultural Visual Grounding

Pith reviewed 2026-05-22 06:51 UTC · model grok-4.3

classification 💻 cs.CV cs.AI
keywords agricultural visual groundingbenchmark datasetvisual groundingset predictionmulti-source datasetzero-shot evaluationinstance mask groundingprecision agriculture
0
0 comments X

The pith

A new multi-source benchmark for agricultural visual grounding shows current models achieve low accuracy on small repetitive and absent targets.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper creates AgroVG to test visual grounding specifically in farm settings where language instructions must locate one or many objects or correctly abstain when none match. Agricultural images present repeated small items like weeds or heads of wheat that are often occluded or irregular, so the benchmark combines data from ten sources to check both box sets and mask outputs across single-target, multi-target, and target-absent cases. A sympathetic reader would care because reliable language-to-object mapping is needed for practical agricultural robots that perform selective weeding or disease monitoring. The evaluation of many model types reveals that even the strongest configurations still fall short on set completeness and precise localization, implying that general-purpose systems are not yet sufficient for these tasks.

Core claim

The paper claims that agricultural visual grounding can be reliably evaluated by recasting it as generalized set prediction over 10,071 image-query pairs drawn from ten source datasets spanning six target families, and that zero-shot tests of closed-source, open-source, and specialized models under this protocol expose clear shortfalls in multi-target Set-F1 scores and high-IoU mask success rates.

What carries the argument

AgroVG benchmark, a unified collection of bounding-box and instance-mask grounding tasks that supports set matching and query-level coverage checks for single, multiple, and absent agricultural targets.

If this is right

  • Models must improve set-prediction mechanisms to handle completeness when multiple matching instances exist in one image.
  • Systems need reliable abstention logic for queries that refer to no object at all.
  • Task-specific protocols for box-set matching and mask coverage can serve as standard tests for future agricultural grounding work.
  • Gaps across six target families indicate that robustness to irregular shapes and occlusion must be addressed beyond general vision-language training.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The benchmark could be used to fine-tune or select models before deploying them on real field robots for harvesting or spraying.
  • Similar set-prediction framing might help evaluate grounding in other domains with repetitive small objects, such as cell microscopy.
  • Extending the queries to include temporal or multi-view farm imagery would test whether current gaps persist under more realistic conditions.

Load-bearing premise

The combined annotations from the ten source datasets are accurate and representative enough to capture the real difficulties of small, repetitive, occluded, and irregular agricultural targets.

What would settle it

A controlled re-labeling of several hundred AgroVG queries by independent agricultural experts that produces substantially higher or lower model rankings than the original evaluation.

Figures

Figures reproduced from arXiv: 2605.22034 by Gengmeng Pu, Guilong Xiao, Haocheng Li, Haohuan Fu, Jianxi Huang, Juepeng Zheng, Kaiqi Du, Zenghao Yang.

Figure 1
Figure 1. Figure 1: AgroVG at a glance. AgroVG casts agricultural grounding as generalized set prediction: a query over an agricultural image may require one target, multiple targets, or no output. The benchmark spans six target families and two output protocols, bounding boxes (T1) and instance masks (T2), to diagnose multi-target completeness, target-absent abstention, and mask grounding. Preprint. arXiv:2605.22034v1 [cs.CV… view at source ↗
Figure 2
Figure 2. Figure 2: Target families and grounding regimes in AgroVG. AgroVG covers six target families (columns): tree canopy, crop/weed, fruit, wheat head, pest, and plant disease. For each family, we show queries under three regimes (rows): single-target, multi-target, and target-absent, where the output is one localized instance, a set of instances, or an empty prediction. evaluate agricultural vision-language models on ta… view at source ↗
Figure 3
Figure 3. Figure 3: AgroVG construction pipeline. AgroVG converts heterogeneous agricultural datasets into verified image–query pairs through annotation normalization, audit-driven sampling, expert visual review, split assignment, and split-aware query generation. Positive and target-absent queries are generated after image verification, ensuring each query is tied to stable ground-truth annotations. Query regimes. Following … view at source ↗
Figure 4
Figure 4. Figure 4: Dataset statistics of AgroVG. Panels summarize source/family composition, query-regime counts, instance-density buckets, and target-size distributions for T1 and T2. and target-absent negatives. For T1, queries are target descriptions whose expected outputs are bounding-box sets; for T2, queries are segmentation-oriented expressions whose expected outputs are instance-mask sets. Generated queries are autom… view at source ↗
Figure 5
Figure 5. Figure 5: Zero-shot diagnostics on AgroVG. (a) T1 models across localization, set-completeness, and abstention; (b) T2 trade-off between positive-query mask quality and empty-query accuracy. T2: Instance-mask grounding [PITH_FULL_IMAGE:figures/full_fig_p009_5.png] view at source ↗
read the original abstract

Visual grounding, the task of localizing objects described by natural-language expressions, is a foundational capability for agricultural AI systems, enabling applications such as selective weeding, disease monitoring, and targeted harvesting. Reliable evaluation of agricultural visual grounding remains challenging because agricultural targets are often small, repetitive, occluded, or irregularly shaped, and instructions may refer to one, many, or no objects in an image. Evaluating this capability therefore requires jointly testing localization accuracy, target-set completeness, and existence-aware abstention. To address these challenges, we introduce \textbf{AgroVG}, a multi-source benchmark that formulates agricultural grounding as generalized set prediction: given an image and a referring expression, a model must return all matching target instances or abstain when no target is present. AgroVG contains 10{,}071 annotation-grounded image-query pairs from ten source datasets across six target families: crop/weed, fruit, wheat head, pest, plant disease, and tree canopy. It supports bounding-box grounding (T1) across all six families and instance-mask grounding (T2) on sources with reliable instance-level pixel annotations, with queries covering single-target, multi-target, and target-absent regimes. AgroVG further provides task-specific protocols for box-set matching and query-level mask coverage. Zero-shot evaluation of 26 model configurations spanning closed-source MLLMs, open-source VLMs, and specialized grounding systems reveals persistent gaps: the best multi-target Set-$F_1$ reaches only 0.35, and the best positive-query mask success rate at IoU@0.75 remains below 0.17. Data and code are available at https://anonymous.4open.science/r/AgroVG-5172/ .

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 2 minor

Summary. The paper introduces AgroVG, a multi-source benchmark for agricultural visual grounding formulated as generalized set prediction. It aggregates 10,071 image-query pairs from ten source datasets spanning six target families (crop/weed, fruit, wheat head, pest, plant disease, tree canopy), supporting bounding-box grounding (T1) on all families and instance-mask grounding (T2) where pixel annotations are available. Queries cover single-target, multi-target, and target-absent regimes. Zero-shot evaluation of 26 model configurations (closed-source MLLMs, open-source VLMs, and specialized grounding systems) reports persistent gaps, with the best multi-target Set-F1 reaching only 0.35 and the best positive-query mask success rate at IoU@0.75 remaining below 0.17. Data and code are released.

Significance. If the annotations are shown to be reliable, AgroVG would be a valuable contribution as the first large-scale benchmark specifically targeting agricultural visual grounding challenges such as small, repetitive, occluded, and irregularly shaped objects. The multi-source construction, support for set-based and mask-based protocols, and broad zero-shot evaluation across 26 models provide a reproducible testbed that highlights concrete limitations in current systems for applications like selective weeding and disease monitoring. The public release of data and code is a clear strength that supports future work.

major comments (1)
  1. [§3 (Benchmark Construction)] §3 (Benchmark Construction): The manuscript provides no quantitative validation of annotation quality across the ten heterogeneous source datasets, such as inter-annotator agreement statistics, error rates for referring expressions, or audits for missed small/occluded instances and absent-target labels. This is load-bearing for the central claim because the reported gaps (best Set-F1 = 0.35; IoU@0.75 mask success < 0.17) can only be interpreted as model limitations if the 10,071 pairs faithfully reflect the stated agricultural challenges without systematic label noise.
minor comments (2)
  1. [Table 1] Table 1 or equivalent: Consider adding a column or footnote that explicitly lists the ten source datasets and their original annotation types (box vs. mask) to improve traceability of the multi-source aggregation.
  2. [§5 (Experiments)] §5 (Experiments): The description of the 26 model configurations would benefit from a clearer breakdown (e.g., how many are closed-source MLLMs vs. specialized grounding models) to allow readers to map results to model families without cross-referencing the appendix.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the thoughtful review and the recommendation for major revision. We address the single major comment below and will update the manuscript to incorporate additional validation details.

read point-by-point responses
  1. Referee: The manuscript provides no quantitative validation of annotation quality across the ten heterogeneous source datasets, such as inter-annotator agreement statistics, error rates for referring expressions, or audits for missed small/occluded instances and absent-target labels. This is load-bearing for the central claim because the reported gaps (best Set-F1 = 0.35; IoU@0.75 mask success < 0.17) can only be interpreted as model limitations if the 10,071 pairs faithfully reflect the stated agricultural challenges without systematic label noise.

    Authors: We thank the referee for highlighting this important point. The annotations in AgroVG are inherited directly from ten established, peer-reviewed source datasets that have been widely adopted in the agricultural vision community. While the current manuscript does not consolidate quantitative quality metrics across sources, we agree that providing such information would strengthen the benchmark's credibility. In the revised version we will add a dedicated paragraph in §3 summarizing the annotation protocols and any quality metrics (e.g., inter-annotator agreement or error rates) reported in the original dataset papers. We will also perform and report a targeted audit on a representative subset of approximately 500 image-query pairs, checking for missed small/occluded instances and the accuracy of target-absent labels. These additions will allow readers to better evaluate the reliability of the reported performance gaps. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical benchmark with direct measurements

full rationale

The paper constructs AgroVG by aggregating annotations from ten existing source datasets and performs zero-shot evaluation of 26 external model configurations. No derivations, equations, fitted parameters, or predictions are claimed; results consist of direct empirical measurements (Set-F1, IoU success rates) against model outputs. No self-citations are used to justify uniqueness theorems, ansatzes, or load-bearing premises. The work is therefore self-contained as a benchmark release and evaluation study.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The paper aggregates existing datasets and defines new evaluation protocols without introducing new mathematical parameters or entities; it relies on the domain assumption that source annotations transfer reliably to the new grounding task.

axioms (1)
  • domain assumption Annotations from the ten source datasets are accurate and suitable for the formulated visual grounding tasks including multi-target and absent cases.
    The benchmark construction and reported model gaps depend on the quality and completeness of the reused annotations from prior datasets.

pith-pipeline@v0.9.0 · 5870 in / 1472 out tokens · 47030 ms · 2026-05-22T06:51:29.029472+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

73 extracted references · 73 canonical work pages · 5 internal anchors

  1. [1]

    ReferItGame: Referring to Objects in Photographs of Natural Scenes

    Sahar Kazemzadeh, Vicente Ordonez, Mark Matten, and Tamara Berg. ReferItGame: Referring to Objects in Photographs of Natural Scenes. InProceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 787–798. Association for Computational Linguistics, 2014. doi: 10.3115/v1/D14-1086. URL http://aclweb.org/a nthology/D14-1086

  2. [2]

    Yuille, and Kevin Murphy

    Junhua Mao, Jonathan Huang, Alexander Toshev, Oana Camburu, Alan Yuille, and Kevin Murphy. Generation and Comprehension of Unambiguous Object Descriptions. In2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 11–20. IEEE, 2016. doi: 10.1109/CVPR.2016.9. URLhttp://ieeexplore.ieee.org/document/7780378/

  3. [3]

    V ougioukas

    Stavros G. V ougioukas. Agricultural Robotics.Annual Review of Control, Robotics, and Autonomous Systems, 2(1):365–392, 2019. doi: 10.1146/annurev-control-053018-023617. URL https://www.annualreviews.org/doi/10.1146/annurev-control-053018-0 23617

  4. [4]

    OpenVLA: An Open-Source Vision-Language-Action Model

    Moo Jin Kim, Karl Pertsch, Siddharth Karamcheti, Ted Xiao, Ashwin Balakrishna, Suraj Nair, Rafael Rafailov, Ethan Foster, Pannag Sanketi, Quan Vuong, Thomas Kollar, Benjamin Burchfiel, Russ Tedrake, Dorsa Sadigh, Sergey Levine, Percy Liang, and Chelsea Finn. OpenVLA: An Open-Source Vision-Language-Action Model. 2024

  5. [5]

    Berg, and Tamara L

    Licheng Yu, Patrick Poirson, Shan Yang, Alexander C. Berg, and Tamara L. Berg. Modeling Context in Referring Expressions. In Bastian Leibe, Jiri Matas, Nicu Sebe, and Max Welling, editors,Computer Vision – ECCV 2016, volume 9906, pages 69–85. Springer International Publishing, 2016. doi: 10.1007/978-3-319-46475-6_5. URL http://link.springer.com/ 10.1007/9...

  6. [6]

    GREC: Generalized Referring Expression Comprehension, 2023

    Shuting He, Henghui Ding, Chang Liu, and Xudong Jiang. GREC: Generalized Referring Expression Comprehension, 2023. URLhttp://arxiv.org/abs/2308.16182

  7. [7]

    In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp

    Chang Liu, Henghui Ding, and Xudong Jiang. GRES: Generalized Referring Expression Segmentation. In2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 23592–23601. IEEE, 2023. doi: 10.1109/CVPR52729.2023.02259. URL https://ieeexplore.ieee.org/document/10204107/

  8. [8]

    RSVG: Exploring Data and Models for Visual Grounding on Remote Sensing Data.IEEE Transactions on Geoscience and Remote Sensing, 61:1–13, 2023

    Yang Zhan, Zhitong Xiong, and Yuan Yuan. RSVG: Exploring Data and Models for Visual Grounding on Remote Sensing Data.IEEE Transactions on Geoscience and Remote Sensing, 61:1–13, 2023. doi: 10.1109/TGRS.2023.3250471. URL https://ieeexplore.ieee.org/ document/10056343/

  9. [9]

    RRSIS: Referring Remote Sensing Image Segmentation.IEEE Transactions on Geoscience and Remote Sensing, 62:1–12,

    Zhenghang Yuan, Lichao Mou, Yuansheng Hua, and Xiao Xiang Zhu. RRSIS: Referring Remote Sensing Image Segmentation.IEEE Transactions on Geoscience and Remote Sensing, 62:1–12,

  10. [10]

    URL https://ieeexplore.ieee.org/docume nt/10458079/

    doi: 10.1109/TGRS.2024.3369720. URL https://ieeexplore.ieee.org/docume nt/10458079/

  11. [11]

    VRSBench: A Versatile Vision-Language Benchmark Dataset for Remote Sensing Image Understanding

    Jian Ding, Mohamed Elhoseiny, and Xiang Li. VRSBench: A Versatile Vision-Language Benchmark Dataset for Remote Sensing Image Understanding. InAdvances in Neural In- formation Processing Systems 37, pages 3229–3242. Neural Information Processing Sys- tems Foundation, Inc. (NeurIPS), 2024. doi: 10.52202/079017- 0106. URL http: //www.proceedings.com/079017-0106.html

  12. [12]

    AgriBench: A Hierarchical Agriculture Benchmark for Multimodal Large Language Models, 2024

    Yutong Zhou and Masahiro Ryo. AgriBench: A Hierarchical Agriculture Benchmark for Multimodal Large Language Models, 2024. URLhttp://arxiv.org/abs/2412.00465

  13. [13]

    AgroBench: Vision-Language Model Benchmark in Agriculture, 2025

    Risa Shinoda, Nakamasa Inoue, Hirokatsu Kataoka, Masaki Onishi, and Yoshitaka Ushiku. AgroBench: Vision-Language Model Benchmark in Agriculture, 2025. URL http://arxiv. org/abs/2507.20519

  14. [14]

    AGMMU: A Comprehensive Agricultural Multimodal Understanding Benchmark

    Aruna Gauba, Irene Pi, Yunze Man, Ziqi Pang, Vikram S Adve, and Yu-Xiong Wang. AGMMU: A Comprehensive Agricultural Multimodal Understanding Benchmark. 2025. 10

  15. [15]

    Can Large Multimodal Models Understand Agricultural Scenes? Benchmarking with AgroMind

    Qingmei Li, Yang Zhang, Zurong Mai, Yuhang Chen, Shuohong Lou, Henglian Huang, Jiarui Zhang, Zhiwei Zhang, Yibin Wen, Weijia Li, Haohuan Fu, Jianxi Huang, and Juepeng Zheng. Can Large Multimodal Models Understand Agricultural Scenes? Benchmarking with AgroMind. 2025

  16. [16]

    Multi-label Instance- level Generalised Visual Grounding in Agriculture, 2026

    Mohammadreza Haghighat, Alzayat Saleh, and Mostafa Rahimi Azghadi. Multi-label Instance- level Generalised Visual Grounding in Agriculture, 2026. URL http://arxiv.org/abs/26 03.06699

  17. [17]

    360mvsnet: Deep multi-view stereo network with 360° images for indoor scene reconstruction,

    Daniel Steininger, Andreas Trondl, Gerardus Croonen, Julia Simon, and Verena Widhalm. The CropAndWeed Dataset: A Multi-Modal Learning Approach for Efficient Crop and Weed Manipulation. In2023 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), pages 3718–3727. IEEE, 2023. doi: 10.1109/WACV56688.2023.00372. URL https://ieeexplore.ieee.org...

  18. [18]

    Etienne David, Mario Serouart, Daniel Smith, Simon Madec, Kaaviya Velumani, Shouyang Liu, Xu Wang, Francisco Pinto, Shahameh Shafiee, Izzat S.A. Tahir, Hisashi Tsujimoto, Shuhei Nasuda, Bangyou Zheng, Norbert Kirchgessner, Helge Aasen, Andreas Hund, Pouria Sadhegi-Tehran, Koichi Nagasawa, Goro Ishikawa, Sébastien Dandrifosse, Alexis Carlier, Benjamin Dumo...

  19. [19]

    MinneApple: A Benchmark Dataset for Apple Detection and Segmentation.IEEE Robotics and Automation Letters, 5(2):852–858, 2020

    Nicolai Hani, Pravakar Roy, and V olkan Isler. MinneApple: A Benchmark Dataset for Apple Detection and Segmentation.IEEE Robotics and Automation Letters, 5(2):852–858, 2020. doi: 10.1109/LRA.2020.2965061. URL https://ieeexplore.ieee.org/document/8954630 /

  20. [20]

    IP102: A Large- Scale Benchmark Dataset for Insect Pest Recognition

    Xiaoping Wu, Chi Zhan, Yu-Kun Lai, Ming-Ming Cheng, and Jufeng Yang. IP102: A Large- Scale Benchmark Dataset for Insect Pest Recognition. In2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 8779–8788. IEEE, 2019. doi: 10.110 9/CVPR.2019.00899. URLhttps://ieeexplore.ieee.org/document/8954351/

  21. [21]

    A Large-Scale In-the-wild Dataset for Plant Disease Segmentation.Scientific Data, 13(1):205, 2026

    Tianqi Wei, Zhi Chen, Xin Yu, Scott Chapman, Paul Melloy, and Zi Huang. A Large-Scale In-the-wild Dataset for Plant Disease Segmentation.Scientific Data, 13(1):205, 2026. doi: 10.1038/s41597-025-06513-4. URL https://www.nature.com/articles/s41597-025 -06513-4

  22. [22]

    Emogen: Emotional image content generation with text-to-image diffusion models,

    Sihan Liu, Yiwei Ma, Xiaoqing Zhang, Haowei Wang, Jiayi Ji, Xiaoshuai Sun, and Rongrong Ji. Rotated Multi-Scale Interaction Network for Referring Remote Sensing Image Segmentation. In2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 26648–26658. IEEE, 2024. doi: 10.1109/CVPR52733.2024.02517. URL https://ieeexplo re.ieee.org...

  23. [23]

    Segmentation from Natural Language Expressions

    Ronghang Hu, Marcus Rohrbach, and Trevor Darrell. Segmentation from Natural Language Expressions. In Bastian Leibe, Jiri Matas, Nicu Sebe, and Max Welling, editors,Computer Vision – ECCV 2016, volume 9905, pages 108–124. Springer International Publishing, 2016. doi: 10.1007/978-3-319-46448-0_7. URL http://link.springer.com/10.1007/978-3 -319-46448-0_7

  24. [24]

    Licheng Yu, Zhe Lin, Xiaohui Shen, Jimei Yang, Xin Lu, Mohit Bansal, and Tamara L. Berg. MAttNet: Modular Attention Network for Referring Expression Comprehension. In2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 1307–1315. IEEE,

  25. [25]

    URL https://ieeexplore.ieee.org/document /8578240/

    doi: 10.1109/CVPR.2018.00142. URL https://ieeexplore.ieee.org/document /8578240/

  26. [26]

    Walk in the cloud: Learning curves for point clouds shape analysis, pp

    Jiajun Deng, Zhengyuan Yang, Tianlang Chen, Wengang Zhou, and Houqiang Li. TransVG: End- to-End Visual Grounding with Transformers. In2021 IEEE/CVF International Conference on Computer Vision (ICCV), pages 1749–1759. IEEE, 2021. doi: 10.1109/ICCV48922.2021.00179. URLhttps://ieeexplore.ieee.org/document/9710016/. 11

  27. [27]

    Visual Grounding in Remote Sensing Images

    Yuxi Sun, Shanshan Feng, Xutao Li, Yunming Ye, Jian Kang, and Xu Huang. Visual Grounding in Remote Sensing Images. InProceedings of the 30th ACM International Conference on Multimedia, pages 404–412. ACM, 2022. doi: 10.1145/3503161.3548316. URL https: //dl.acm.org/doi/10.1145/3503161.3548316

  28. [28]

    Meng Lan, Fu Rong, Hongzan Jiao, Zhi Gao, and Lefei Zhang. Language Query-Based Transformer With Multiscale Cross-Modal Alignment for Visual Grounding on Remote Sensing Images.IEEE Transactions on Geoscience and Remote Sensing, 62:1–13, 2024. doi: 10.1109/ TGRS.2024.3407598. URLhttps://ieeexplore.ieee.org/document/10542207/

  29. [29]

    Language-Guided Progressive Attention for Visual Grounding in Remote Sensing Images.IEEE Transactions on Geoscience and Remote Sensing, 62:1–13, 2024

    Ke Li, Di Wang, Haojie Xu, Haodi Zhong, and Cong Wang. Language-Guided Progressive Attention for Visual Grounding in Remote Sensing Images.IEEE Transactions on Geoscience and Remote Sensing, 62:1–13, 2024. doi: 10.1109/TGRS.2024.3423663. URL https: //ieeexplore.ieee.org/document/10584552/

  30. [30]

    AerialVG: A Challenging Benchmark for Aerial Visual Grounding by Exploring Positional Relations

    Junli Liu, Qizhi Chen, Zhigang Wang, Yiwen Tang, Yiting Zhang, Chi Yan, Dong Wang, Xuelong Li, and Bin Zhao. AerialVG: A Challenging Benchmark for Aerial Visual Grounding by Exploring Positional Relations. 2025

  31. [31]

    RefDrone: A Challenging Benchmark for Referring Expression Comprehension in Drone Scenes, 2025

    Zhichao Sun, Yepeng Liu, Zhiling Su, Huachao Zhu, Yuliang Gu, Yuda Zou, Zelong Liu, Gui-Song Xia, Bo Du, and Yongchao Xu. RefDrone: A Challenging Benchmark for Referring Expression Comprehension in Drone Scenes, 2025. URL http://arxiv.org/abs/2502.0 0392

  32. [32]

    Cross-Modal Bidi- rectional Interaction Model for Referring Remote Sensing Image Segmentation, 2025

    Zhe Dong, Yuzhe Sun, Tianzhu Liu, Wangmeng Zuo, and Yanfeng Gu. Cross-Modal Bidi- rectional Interaction Model for Referring Remote Sensing Image Segmentation, 2025. URL http://arxiv.org/abs/2410.08613

  33. [33]

    A Large- Scale Referring Remote Sensing Image Segmentation Dataset and Benchmark, 2025

    Zhigang Yang, Huiguang Yao, Linmao Tian, Xuezhi Zhao, Qiang Li, and Qi Wang. A Large- Scale Referring Remote Sensing Image Segmentation Dataset and Benchmark, 2025. URL http://arxiv.org/abs/2506.03583

  34. [34]

    RIS- LAD: A Benchmark and Model for Referring Image Segmentation in Low-Altitude Drone Imagery

    Kai Ye, YingShi Luan, Zhudi Chen, Guangyue Meng, Pingyang Dai, and Liujuan Cao. RIS- LAD: A Benchmark and Model for Referring Image Segmentation in Low-Altitude Drone Imagery. 2026

  35. [35]

    LISAT: Language-Instructed Segmentation Assistant for Satellite Imagery

    Jerome Quenum, Wen-Han Hsieh, Tsung-Han Wu, Ritwik Gupta, Trevor Darrell, and David M Chan. LISAT: Language-Instructed Segmentation Assistant for Satellite Imagery. 2025

  36. [36]

    SegEarth-R1: Geospatial Pixel Reasoning via Large Language Model, 2025

    Kaiyu Li, Zepeng Xin, Li Pang, Chao Pang, Yupeng Deng, Jing Yao, Guisong Xia, Deyu Meng, Zhi Wang, and Xiangyong Cao. SegEarth-R1: Geospatial Pixel Reasoning via Large Language Model, 2025. URLhttp://arxiv.org/abs/2504.09644

  37. [37]

    SegEarth-R2: Towards Comprehensive Language-guided Segmentation for Remote Sensing Images, 2025

    Zepeng Xin, Kaiyu Li, Luodi Chen, Wanchen Li, Yuchen Xiao, Hui Qiao, Weizhan Zhang, Deyu Meng, and Xiangyong Cao. SegEarth-R2: Towards Comprehensive Language-guided Segmentation for Remote Sensing Images, 2025. URL http://arxiv.org/abs/2512.200 13

  38. [38]

    GRASP: Geospatial pixel Reasoning viA Structured Policy learning, 2025

    Chengjie Jiang, Yunqi Zhou, Jiafeng Yan, Jing Li, Jiayang Li, Yue Zhou, Hongjie He, and Jonathan Li. GRASP: Geospatial pixel Reasoning viA Structured Policy learning, 2025. URL http://arxiv.org/abs/2508.17102

  39. [39]

    UniGeoSeg: Towards Unified Open-World Segmentation for Geospatial Scenes

    Shuo Ni, Di Wang, He Chen, Haonan Guo, Ning Zhang, and Jing Zhang. UniGeoSeg: Towards Unified Open-World Segmentation for Geospatial Scenes, 2026. URL http://arxiv.org/ abs/2511.23332

  40. [40]

    Towards Natural Language-Guided Drones: GeoText-1652 Benchmark with Spatial Relation Matching

    Meng Chu, Zhedong Zheng, Wei Ji, Tingyu Wang, and Tat-Seng Chua. Towards Natural Language-Guided Drones: GeoText-1652 Benchmark with Spatial Relation Matching. In Aleš Leonardis, Elisa Ricci, Stefan Roth, Olga Russakovsky, Torsten Sattler, and Gül Varol, editors, Computer Vision – ECCV 2024, volume 15069, pages 213–231. Springer Nature Switzerland,

  41. [41]

    URL https://link.springer.com/10.100 7/978-3-031-73247-8_13

    doi: 10.1007/978-3-031-73247-8_13. URL https://link.springer.com/10.100 7/978-3-031-73247-8_13. 12

  42. [42]

    GeoGround: A Unified Large Vision-Language Model for Remote Sensing Visual Grounding, 2025

    Yue Zhou, Mengcheng Lan, Xiang Li, Litong Feng, Yiping Ke, Xue Jiang, Qingyun Li, Xue Yang, and Wayne Zhang. GeoGround: A Unified Large Vision-Language Model for Remote Sensing Visual Grounding, 2025. URLhttp://arxiv.org/abs/2411.11904

  43. [43]

    AgriGPT: A Large Language Model Ecosystem for Agriculture, 2025

    Bo Yang, Yu Zhang, Lanfei Feng, Yunkui Chen, Jianyu Zhang, Xiao Xu, Nueraili Aierken, Yurui Li, Yuxuan Chen, Guijun Yang, Yong He, Runhe Huang, and Shijian Li. AgriGPT: A Large Language Model Ecosystem for Agriculture, 2025. URL http://arxiv.org/abs/25 08.08632

  44. [44]

    AgriGPT-Omni: A Unified Speech-Vision-Text Framework for Multilingual Agricultural Intelligence, 2025

    Bo Yang, Lanfei Feng, Yunkui Chen, Yu Zhang, Jianyu Zhang, Xiao Xu, Nueraili Aierken, and Shijian Li. AgriGPT-Omni: A Unified Speech-Vision-Text Framework for Multilingual Agricultural Intelligence, 2025. URLhttp://arxiv.org/abs/2512.10624

  45. [45]

    AgriGPT-VL: Agricultural Vision- Language Understanding Suite, 2025

    Bo Yang, Yunkui Chen, Lanfei Feng, Yu Zhang, Xiao Xu, Jianyu Zhang, Nueraili Aierken, Runhe Huang, Hongjian Lin, Yibin Ying, and Shijian Li. AgriGPT-VL: Agricultural Vision- Language Understanding Suite, 2025. URLhttp://arxiv.org/abs/2510.04002

  46. [46]

    AgriEval: A Comprehensive Chinese Agricultural Benchmark for Large Language Models.Proceedings of the AAAI Conference on Artificial Intelligence, 40(40): 34205–34213, 2026

    Lian Yan, Haotian Wang, Chen Tang, Haifeng Liu, Tianyang Sun, Liangliang Liu, Yi Guan, and Jingchi Jiang. AgriEval: A Comprehensive Chinese Agricultural Benchmark for Large Language Models.Proceedings of the AAAI Conference on Artificial Intelligence, 40(40): 34205–34213, 2026. doi: 10.1609/aaai.v40i40.40716. URL https://ojs.aaai.org/index .php/AAAI/artic...

  47. [47]

    Jan Weyler, Federico Magistri, Elias Marks, Yue Linn Chong, Matteo Sodano, Gianmarco Roggiolani, Nived Chebrolu, Cyrill Stachniss, and Jens Behley. PhenoBench – A Large Dataset and Benchmarks for Semantic Image Interpretation in the Agricultural Domain.IEEE Transactions on Pattern Analysis and Machine Intelligence, 46(12):9583–9594, 2024. doi: 10.1109/TPA...

  48. [48]

    Deep fruit detection in orchards

    Suchet Bargoti and James Underwood. Deep fruit detection in orchards. In2017 IEEE International Conference on Robotics and Automation (ICRA), pages 3626–3633. IEEE, 2017. doi: 10.1109/ICRA.2017.7989417. URL http://ieeexplore.ieee.org/document/798 9417/

  49. [49]

    Learn from foundation model: Fruit detection model without manual annotation.Pattern Recognition, 174:112799, 2026

    Yanan Wang, Zhenghao Fei, Ruichen Li, and Yibin Ying. Learn from foundation model: Fruit detection model without manual annotation.Pattern Recognition, 174:112799, 2026. doi: 10.1016/j.patcog.2025.112799. URL https://linkinghub.elsevier.com/retrieve/p ii/S0031320325014621

  50. [50]

    OAM-TCD: A globally diverse dataset of high-resolution tree cover maps

    Josh Veitch-Michaelis, Andrew Cottam, Daniella Schweizer, Eben N Broadbent, David Dao, Ce Zhang, Angelica Almeyda Zambrano, and Simeon Max. OAM-TCD: A globally diverse dataset of high-resolution tree cover maps. 2024

  51. [51]

    Annotated tree crown bounding boxes in urban/rural environment, 2025

    Jelle Dumortier. Annotated tree crown bounding boxes in urban/rural environment, 2025. URL https://zenodo.org/doi/10.5281/zenodo.15155081

  52. [52]

    Hello GPT-4o, 2024

    OpenAI. Hello GPT-4o, 2024. URLhttps://openai.com/index/hello-gpt-4o/

  53. [53]

    GPT-5.4 Thinking System Card, 2026

    OpenAI. GPT-5.4 Thinking System Card, 2026. URL https://openai.com/index/gpt-5 -4-thinking-system-card/

  54. [54]

    Gemini: A Family of Highly Capable Multimodal Models

    Gemini Team, Rohan Anil, Sebastian Borgeaud, Jean-Baptiste Alayrac, Jiahui Yu, Radu Soricut, Johan Schalkwyk, Andrew M. Dai, Anja Hauth, Katie Millican, et al. Gemini: A Family of Highly Capable Multimodal Models.arXiv preprint arXiv:2312.11805, 2023. doi: 10.48550/a rXiv.2312.11805. URLhttps://arxiv.org/abs/2312.11805

  55. [55]

    Introducing Claude Sonnet 4.6, 2026

    Anthropic. Introducing Claude Sonnet 4.6, 2026. URL https://www.anthropic.com/news /claude-sonnet-4-6

  56. [56]

    DeepSeek-VL2: Mixture-of-Experts Vision-Language Models for Advanced Multimodal Understanding

    Zhiyu Wu, Xiaokang Chen, Zizheng Pan, Xingchao Liu, Wen Liu, Damai Dai, Huazuo Gao, Yiyang Ma, Chengyue Wu, Bingxuan Wang, Zhenda Xie, Yu Wu, Kai Hu, Jiawei Wang, Yaofeng Sun, Yukun Li, Yishi Piao, Kang Guan, Aixin Liu, Xin Xie, Yuxiang You, Kai Dong, Xingkai Yu, Haowei Zhang, Liang Zhao, Yisong Wang, and Chong Ruan. DeepSeek-VL2: Mixture-of-Experts Visio...

  57. [57]

    InternVL3.5: Advancing Open-Source Multimodal Models in Versatility, Reasoning, and Efficiency

    Weiyun Wang, Zhangwei Gao, Lixin Gu, Hengjun Pu, Long Cui, Xingguang Wei, Zhaoyang Liu, Linglin Jing, Shenglong Ye, Jie Shao, Zhaokai Wang, Zhe Chen, Hongjie Zhang, Ganlin Yang, Haomin Wang, Qi Wei, Jinhui Yin, Wenhao Li, Erfei Cui, Guanzhou Chen, Zichen Ding, Changyao Tian, Zhenyu Wu, Jingjing Xie, Zehao Li, Bowen Yang, Yuchen Duan, Xuehui Wang, Zhi Hou,...

  58. [58]

    Qwen3-VL Technical Report

    Shuai Bai, Yuxuan Cai, Ruizhe Chen, Keqin Chen, Xionghui Chen, Zesen Cheng, Lianghao Deng, Wei Ding, Chang Gao, Chunjiang Ge, Wenbin Ge, Zhifang Guo, Qidong Huang, Jie Huang, Fei Huang, Binyuan Hui, Shutong Jiang, Zhaohai Li, Mingsheng Li, Mei Li, Kaixin Li, Zicheng Lin, Junyang Lin, Xuejing Liu, Jiawei Liu, Chenglong Liu, Yang Liu, Dayiheng Liu, Shixuan ...

  59. [59]

    Scaling Open-V ocabulary Object Detection

    Matthias Minderer, Alexey Gritsenko, and Neil Houlsby. Scaling Open-V ocabulary Object Detection. 2023

  60. [60]

    Emogen: Emotional image content generation with text-to-image diffusion models,

    Bin Xiao, Haiping Wu, Weijian Xu, Xiyang Dai, Houdong Hu, Yumao Lu, Michael Zeng, Ce Liu, and Lu Yuan. Florence-2: Advancing a Unified Representation for a Variety of Vision Tasks. In2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 4818–4829. IEEE, 2024. doi: 10.1109/CVPR52733.2024.00461. URL https://ieeexplore.ieee.org/do...

  61. [61]

    Grounding DINO: Marrying DINO with Grounded Pre-training for Open-Set Object Detection

    Shilong Liu, Zhaoyang Zeng, Tianhe Ren, Feng Li, Hao Zhang, Jie Yang, Qing Jiang, Chunyuan Li, Jianwei Yang, Hang Su, Jun Zhu, and Lei Zhang. Grounding DINO: Marrying DINO with Grounded Pre-training for Open-Set Object Detection. In Aleš Leonardis, Elisa Ricci, Stefan Roth, Olga Russakovsky, Torsten Sattler, and Gül Varol, editors,Computer Vision – ECCV 2...

  62. [62]

    Emogen: Emotional image content generation with text-to-image diffusion models,

    Xin Lai, Zhuotao Tian, Yukang Chen, Yanwei Li, Yuhui Yuan, Shu Liu, and Jiaya Jia. LISA: Reasoning Segmentation via Large Language Model. In2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 9579–9589. IEEE, 2024. doi: 10.1109/ CVPR52733.2024.00915. URLhttps://ieeexplore.ieee.org/document/10658574/

  63. [63]

    Anwer, Eric Xing, Ming-Hsuan Yang, and Fahad S

    Hanoona Rasheed, Muhammad Maaz, Sahal Shaji, Abdelrahman Shaker, Salman Khan, Hisham Cholakkal, Rao M. Anwer, Eric Xing, Ming-Hsuan Yang, and Fahad S. Khan. GLaMM: Pixel Grounding Large Multimodal Model. In2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 13009–13018. IEEE, 2024. doi: 10.1109/CVPR52733.20 24.01236. URLhttps...

  64. [64]

    PixelLM: Pixel Reasoning with Large Multimodal Model

    Zhongwei Ren, Zhicheng Huang, Yunchao Wei, Yao Zhao, Dongmei Fu, Jiashi Feng, and Xiaojie Jin. PixelLM: Pixel Reasoning with Large Multimodal Model. In2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 26364–26373. IEEE,

  65. [65]

    Emogen: Emotional image content generation with text-to-image diffusion models,

    doi: 10.1109/CVPR52733.2024.02491. URL https://ieeexplore.ieee.org/docu ment/10656606/

  66. [66]

    GSV A: Generalized Segmentation via Multimodal Large Language Models

    Zhuofan Xia, Dongchen Han, Yizeng Han, Xuran Pan, Shiji Song, and Gao Huang. GSV A: Generalized Segmentation via Multimodal Large Language Models. In2024 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 3858–3869. IEEE,

  67. [67]

    Emogen: Emotional image content generation with text-to-image diffusion models,

    doi: 10.1109/CVPR52733.2024.00370. URL https://ieeexplore.ieee.org/docu ment/10658546/. 14

  68. [68]

    PSALM: Pixelwise SegmentAtion with Large Multi-modal Model

    Zheng Zhang, Yeyao Ma, Enming Zhang, and Xiang Bai. PSALM: Pixelwise SegmentAtion with Large Multi-modal Model. In Aleš Leonardis, Elisa Ricci, Stefan Roth, Olga Russakovsky, Torsten Sattler, and Gül Varol, editors,Computer Vision – ECCV 2024, volume 15092, pages 74–91. Springer Nature Switzerland, 2025. doi: 10.1007/978-3-031-72754-2_5. URL https://link....

  69. [69]

    Generalized Decoding for Pixel, Image, and Language

    Xueyan Zou, Zi-Yi Dou, Jianwei Yang, Zhe Gan, Linjie Li, Chunyuan Li, Xiyang Dai, Harkirat Behl, Jianfeng Wang, Lu Yuan, Nanyun Peng, Lijuan Wang, Yong Jae Lee, and Jianfeng Gao. Generalized Decoding for Pixel, Image, and Language. 2023

  70. [70]

    Segment Everything Everywhere All at Once

    Xueyan Zou, Jianwei Yang, Hao Zhang, Feng Li, Linjie Li, Jianfeng Wang, Lijuan Wang, Jianfeng Gao, and Yong Jae Lee. Segment Everything Everywhere All at Once. 2023

  71. [71]

    SAM 3: Segment Anything with Concepts

    Nicolas Carion, Laura Gustafson, Yuan-Ting Hu, Shoubhik Debnath, Ronghang Hu, Didac Suris, Chaitanya Ryali, Kalyan Vasudev Alwala, Haitham Khedr, Andrew Huang, Jie Lei, Tengyu Ma, Arpit Kalla, Markus Marks, Joseph Greer, Meng Wang, Peize Sun, Roman Rädle, Effrosyni Mavroudi, Katherine Xu, Tsung-Han Wu, Yu Zhou, Liliane Momeni, Rishi Hazra, Shuangrui Ding,...

  72. [72]

    Grounded SAM: Assembling Open-World Models for Diverse Visual Tasks

    Tianhe Ren, Shilong Liu, Ailing Zeng, Jing Lin, Kunchang Li, He Cao, Jiayu Chen, Xinyu Huang, Yukang Chen, Feng Yan, Zhaoyang Zeng, Hao Zhang, Feng Li, Jie Yang, Hongyang Li, Qing Jiang, and Lei Zhang. Grounded SAM: Assembling Open-World Models for Diverse Visual Tasks, 2024. URLhttps://arxiv.org/abs/2401.14159. 15 AgroVG: A Large-Scale Multi-Source Bench...

  73. [73]

    AgroVG is constructed from public agricultural imagery and source annotations

    Institutional review board (IRB) approvals or equivalent for research with human subjects Question: Does the paper describe potential risks incurred by study participants, whether such risks were disclosed to the subjects, and whether Institutional Review Board (IRB) approvals (or an equivalent approval/review based on the requirements of your country or ...