arxiv: 2605.03259 · v1 · submitted 2026-05-05 · 💻 cs.CV

Recognition: unknown

CropVLM: A Domain-Adapted Vision-Language Model for Open-Set Crop Analysis

Abderrahmene Boudiaf , Sajd Javed

Authors on Pith no claims yet

Pith reviewed 2026-05-08 01:30 UTC · model grok-4.3

classification 💻 cs.CV

keywords vision-language modelcrop analysisopen-set detectionzero-shot learningplant phenotypingdomain adaptationagricultural imaging

0 comments

The pith

A vision-language model adapted to agriculture detects novel crop species from natural language descriptions without retraining.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces CropVLM to address the phenotyping bottleneck in plant breeding, where manual trait measurements limit scale and introduce bias. By training a vision-language model on image-caption pairs from field crops, it aligns everyday agronomic terms with specific visual details in plant images. This alignment, combined with a hybrid localization network, supports open-set tasks where the system identifies and locates entirely new species using only text prompts. A sympathetic reader would care because it removes the need for species-by-species labeling, potentially allowing faster analysis of diverse breeding lines and broader biodiversity surveys.

Core claim

CropVLM is a vision-language model adapted via Domain-Specific Semantic Alignment on 52,987 manually selected image-caption pairs covering 37 species in natural field conditions. It maps agronomic terminology to fine-grained visual features and integrates into the Hybrid Open-Set Localization Network (HOS-Net) to detect novel crops solely from language descriptions without retraining. Evaluations show 72.51% zero-shot classification accuracy, outperforming seven CLIP-style baselines, along with 49.17 AP50 on the CVTCropDet benchmark and 50.73 AP50 on tropical fruit species.

What carries the argument

Domain-Specific Semantic Alignment (DSSA), the process that fine-tunes the vision-language model to connect agricultural terminology with detailed visual patterns in crop images.

If this is right

Novel crop species become detectable and localizable using only textual descriptions, removing the requirement for new species-specific training data.
High-throughput phenotyping scales to larger and more diverse plant populations without proportional increases in manual annotation effort.
Breeding programs and biodiversity studies gain flexibility to analyze emerging or under-studied species on demand.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same alignment approach could be tested on related tasks such as identifying plant diseases or growth stages from descriptive text.
Integration with automated field imaging systems might allow continuous monitoring without repeated model updates for each new variety.
Performance on very distant species or under extreme environmental conditions remains an open test of how far the current alignment generalizes.

Load-bearing premise

The 52,987 image-caption pairs from 37 species supply enough variety for the alignment process to produce reliable mappings that work on arbitrary new crop species.

What would settle it

Running the full detection pipeline on images of a crop species never seen in the 37-species training set, using only a natural language description, and comparing the resulting AP50 score directly against the reported baselines.

Figures

Figures reproduced from arXiv: 2605.03259 by Abderrahmene Boudiaf, Sajd Javed.

**Figure 1.** Figure 1: Overview of the Agri-Semantic Framework and CropVLM training methodology. (a) The Agri view at source ↗

**Figure 2.** Figure 2: Comprehensive class overview of the Agri-Semantics-52k dataset. The dataset encompasses 37 view at source ↗

**Figure 3.** Figure 3: Representative samples of dense semantic annotations from the Agri-Semantics-52k dataset. Un view at source ↗

**Figure 4.** Figure 4: Architecture of the proposed Hybrid Open-Set Localization Network (HOS-Net). The framework view at source ↗

**Figure 4.** Figure 4: For each proposal i, the predicted class label c CL i and initial classification confidence score s CL i are determined by maximizing similarity across all target classes: c CL i = argmax k∈{1,...,K} S i,k (7) s CL i = max k∈{1,...,K} S i,k (8) This yields semantically-scored detections: {(b CL i , s CL i , c CL i )} M i=1 , where each proposal is assigned to its most similar crop class with a correspondin… view at source ↗

**Figure 5.** Figure 5: Qualitative comparison of detection outputs across benchmark datasets (one representative image view at source ↗

read the original abstract

High-throughput plant phenotyping, the quantitative measurement of observable plant traits, is critical for modern breeding but remains constrained by a "phenotyping bottleneck," where manual data collection is labor-intensive and prone to observer bias. Conventional closed-set computer vision systems fail to address this challenge, as they require extensive species-specific annotation and lack the flexibility to handle diverse breeding populations. To bridge this gap, we present CropVLM, a Vision-Language Model (VLM) adapted for the agricultural domain via Domain-Specific Semantic Alignment (DSSA). Trained on 52,987 manually selected image-caption pairs covering 37 species in natural field conditions, CropVLM effectively maps agronomic terminology to fine-grained visual features. We further introduce the Hybrid Open-Set Localization Network (HOS-Net), an architecture that integrates CropVLM to enable the detection of novel crops solely from natural language descriptions without retraining. By eliminating the reliance on species-specific training data, CropVLM provides a scalable solution for high-throughput phenotyping, accelerating genetic gain and facilitating large-scale biodiversity research essential for sustainable agriculture. The trained model weights and complete pipeline implementation are publicly available at: [https://github.com/boudiafA/CropVLM](https://github.com/boudiafA/CropVLM). In comprehensive evaluations, CropVLM achieves 72.51% zero-shot classification accuracy, outperforming seven CLIP-style baselines. Our detection pipeline demonstrates superior zero-shot generalization to novel species, achieving 49.17 AP50 on our CVTCropDet benchmark and 50.73 AP50 on tropical fruit species, compared to 34.89 and 48.58 for the next-best method, respectively.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

CropVLM adapts a VLM to crops via domain alignment on 37 species and adds HOS-Net for text-driven open-set detection, with public code and a new benchmark, but the zero-shot generalization rests on limited species coverage without diversity metrics or ablations.

read the letter

CropVLM takes established vision-language model methods and applies domain-specific semantic alignment to map crop-related language to visual features. The training uses 52,987 manually chosen image-caption pairs from 37 species under field conditions. They introduce HOS-Net to combine the adapted model with open-set localization so new crop types can be detected from text prompts alone, without retraining. The numbers show 72.51 percent zero-shot classification accuracy beating seven CLIP-style baselines, plus 49.17 AP50 on their CVTCropDet benchmark and 50.73 AP50 on tropical fruits, ahead of the next best method in each case. The weights and full pipeline are released publicly on GitHub, which is a clear plus for anyone who wants to test or extend it. They also created the CVTCropDet benchmark, giving the community a concrete resource for this task. These pieces make the work usable right away for phenotyping or biodiversity work. The central soft spot is the reach of the zero-shot claim. The argument depends on the 37-species set producing reliable mappings that transfer to arbitrary novel crops, yet the description gives no breakdown of botanical family coverage, growth stage variety, geographic spread, or visual feature range. There are also no ablations that vary the number of training species or test on crops held completely outside the set. Manual selection adds the usual risk of bias toward easier or more distinct examples. The comparisons use external baselines and report standard metrics, so nothing looks circular or self-referential. The evaluation is empirical rather than theoretical, with no load-bearing math that needs formal checking. This paper is aimed at people working on agricultural computer vision, plant phenotyping tools, or open-set detection in applied domains. A reader who needs a starting implementation or a crop-specific benchmark will get direct value from the code and results. It is worth sending for peer review because the application addresses a real bottleneck, the code is available for verification, and the empirical gains are stated clearly, even if the generalization story needs tighter testing on species diversity during revision.

Referee Report

3 major / 2 minor

Summary. The manuscript introduces CropVLM, a vision-language model adapted to the agricultural domain via Domain-Specific Semantic Alignment (DSSA) trained on 52,987 manually selected image-caption pairs covering 37 species under natural field conditions. It further proposes the Hybrid Open-Set Localization Network (HOS-Net) that integrates CropVLM to enable zero-shot detection of novel crops from natural language descriptions without retraining. The paper reports 72.51% zero-shot classification accuracy outperforming seven CLIP-style baselines, along with AP50 scores of 49.17 on the CVTCropDet benchmark and 50.73 on tropical fruit species, exceeding the next-best methods (34.89 and 48.58, respectively). Model weights and code are released publicly.

Significance. If the generalization claims hold, the work could meaningfully advance high-throughput phenotyping by reducing reliance on species-specific annotations, with direct relevance to breeding programs and biodiversity monitoring. The public release of trained weights and the full pipeline is a clear strength that supports reproducibility. The significance is limited, however, by the absence of quantitative characterization of training data diversity and coverage, which is central to validating open-set performance on arbitrary novel crops.

major comments (3)

[Abstract and Methods] Abstract and Methods: The claim that DSSA on the 37-species set produces reliable agronomic-to-visual mappings for zero-shot generalization to arbitrary novel crops is load-bearing, yet no quantitative diversity metrics (botanical families represented, growth-stage coverage, geographic or environmental variation) or ablation on training species count are provided.
[Experiments] Experiments: The reported 72.51% zero-shot accuracy and AP50 improvements lack error bars, statistical significance tests, or details on training procedure, data selection criteria, and potential selection bias from manual curation of the 52,987 pairs, preventing verification that the gains are robust rather than dataset-specific.
[Experiments] Experiments: No evaluation on crop species outside the 37-species training distribution or failure-case analysis for HOS-Net on truly novel inputs is presented, leaving the open-set detection claim (49.17/50.73 AP50) without direct support for generalization beyond the tested set.

minor comments (2)

[Abstract] Abstract: The phrase 'comprehensive evaluations' is used without enumerating all benchmarks or providing a high-level overview of the evaluation protocol.
[Overall] Overall: Verify that the GitHub repository includes complete training scripts, dataset curation details, and any preprocessing code to enable full reproduction.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for their constructive and detailed feedback. We appreciate the emphasis on strengthening the evidence for our generalization claims and have addressed each major comment below with specific revisions to the manuscript.

read point-by-point responses

Referee: [Abstract and Methods] Abstract and Methods: The claim that DSSA on the 37-species set produces reliable agronomic-to-visual mappings for zero-shot generalization to arbitrary novel crops is load-bearing, yet no quantitative diversity metrics (botanical families represented, growth-stage coverage, geographic or environmental variation) or ablation on training species count are provided.

Authors: We agree that quantitative characterization of the training data is essential to support the open-set claims. In the revised manuscript, we will add a dedicated subsection in Methods with a table reporting the botanical families represented across the 37 species, the distribution of growth stages in the 52,987 image-caption pairs, and available geographic/environmental metadata from the source datasets. We will also include an ablation study showing zero-shot accuracy as a function of the number of training species. revision: yes
Referee: [Experiments] Experiments: The reported 72.51% zero-shot accuracy and AP50 improvements lack error bars, statistical significance tests, or details on training procedure, data selection criteria, and potential selection bias from manual curation of the 52,987 pairs, preventing verification that the gains are robust rather than dataset-specific.

Authors: We acknowledge that the current reporting limits independent verification. We will revise the Experiments section to report standard deviations over five random seeds for all accuracy and AP50 figures, include paired statistical significance tests against the seven baselines, expand the training procedure description with all hyperparameters and optimization details, and add explicit criteria for the manual curation process along with a discussion of potential selection bias and mitigation steps. revision: yes
Referee: [Experiments] Experiments: No evaluation on crop species outside the 37-species training distribution or failure-case analysis for HOS-Net on truly novel inputs is presented, leaving the open-set detection claim (49.17/50.73 AP50) without direct support for generalization beyond the tested set.

Authors: The CVTCropDet benchmark and tropical fruit species evaluations were constructed with species disjoint from the 37-species training set; we will add an explicit table in the revised Experiments section listing training versus test species to make this clear. We will also include a new failure-case analysis subsection with both qualitative examples of challenging novel inputs and quantitative performance breakdowns on the most dissimilar novel species. revision: yes

Circularity Check

0 steps flagged

No circularity; empirical training and benchmark comparisons are self-contained

full rationale

The paper describes training CropVLM on a fixed dataset of 52,987 image-caption pairs via Domain-Specific Semantic Alignment, then reports zero-shot classification accuracy (72.51%) and open-set detection AP50 scores on CVTCropDet and tropical fruit benchmarks, with direct numerical comparisons to seven external CLIP-style baselines. No equations, derivations, fitted parameters renamed as predictions, self-citations, or uniqueness theorems appear in the provided text. All load-bearing claims rest on external benchmark results rather than internal reductions to the training inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review provides no explicit free parameters, axioms, or invented entities beyond standard machine-learning assumptions about data representativeness and generalization; the central claims rest on the unverified premise that the curated training set enables open-set performance.

pith-pipeline@v0.9.0 · 5617 in / 1320 out tokens · 32768 ms · 2026-05-08T01:30:57.307396+00:00 · methodology

Review history (2 revisions) →

discussion (0)

Reference graph

Works this paper leans on

65 extracted references · 11 canonical work pages

[1]

Jiang and Others

Y . Jiang and Others. A review of computer vision technologies for plant pheno- typing.Computers and Electronics in Agriculture, 176, 2020

2020
[2]

Ayala and Others

A. Ayala and Others. Self-supervised leaf segmentation under complex lighting conditions.Pattern Recognition, 135:109149, 2023

2023
[3]

Argüeso and Others

D. Argüeso and Others. Few-shot learning approach for plant disease classifica- tion.Computers and Electronics in Agriculture, 175:105542, 2020

2020
[4]

Wang and Others

R. Wang and Others. Triple-branch swin transformer for plant disease identifica- tion.Computers and Electronics in Agriculture, 209, 2023

2023
[5]

Peng and Others

C. Peng and Others. Faster ilod: Incremental learning for object detectors based on faster rcnn.Pattern Recognition Letters, 2020. Article S0167865520303627

2020
[6]

Zhao and Others

X. Zhao and Others. Bsdp: Brain-inspired streaming dual-level perturbations for online open world object detection.Pattern Recognition, 152:110430, 2024

2024
[7]

Liu and Others

B. Liu and Others. Domain incremental learning for object detection.Pattern Recognition, 162:111324, 2025. 33

2025
[8]

Liu and Others

J. Liu and Others. Deep transductive network for generalized zero shot learning. Pattern Recognition, 105:107393, 2020

2020
[9]

Kuo and Others

C. Kuo and Others. Guided cnn for generalized zero-shot and open-set recog- nition using visual and semantic prototypes.Pattern Recognition, 104:107327, 2020

2020
[10]

Learning transferable visual models from natural language supervision, 2021

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, and Ilya Sutskever. Learning transferable visual models from natural language supervision, 2021

2021
[11]

Wu and Others

Y . Wu and Others. Prompt-guided detr with roi-pruned masked attention for open- vocabulary object detection.Pattern Recognition, 154:110583, 2024

2024
[12]

Zhang and Others

H. Zhang and Others. Ta-adapter: Enhancing few-shot clip with task-aware en- coders.Pattern Recognition, 153:110544, 2024

2024
[13]

Li and Others

Y . Li and Others. A closer look at the explainability of contrastive language-image pre-training.Pattern Recognition, 2025. Article S003132032500069X

2025
[14]

Wang and Others

L. Wang and Others. Mixture of coarse and fine-grained prompt tuning for vision- language model.Pattern Recognition, 2025. Article S0031320325007344

2025
[15]

Chen and Others

Z. Chen and Others. Gridclip: One-stage object detection by grid-level clip rep- resentation learning.Pattern Recognition, 2025. Article S0031320325008489

2025
[16]

Mask r-cnn,

Kaiming He, Georgia Gkioxari, Piotr Dollár, and Ross Girshick. Mask r-cnn,
[17]

URLhttps://arxiv.org/abs/1703.06870

work page arXiv
[18]

Grounding DINO: Marrying DINO with grounded pre-training for open-set object detection, 2023

Shilong Liu, Zhaoyang Zeng, Tianhe Ren, Feng Li, Hao Zhang, Jie Yang, Chun- yuan Li, Jianwei Yang, Hang Su, Jun Zhu, and Lei Zhang. Grounding DINO: Marrying DINO with grounded pre-training for open-set object detection, 2023. 34

2023
[19]

Berg, Wan-Yen Lo, Piotr Dollár, and Ross Girshick

Alexander Kirillov, Eric Mintun, Nikhila Ravi, Hanzi Mao, Chloe Rolland, Laura Gustafson, Tete Xiao, Spencer Whitehead, Alexander C. Berg, Wan-Yen Lo, Piotr Dollár, and Ross Girshick. Segment anything. InProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2023

2023
[20]

Zhou and Others

T. Zhou and Others. Zero-shot semantic segmentation via spatial and multi-scale aware visual class embedding.Pattern Recognition Letters, 152, 2022

2022
[21]

Li and Others

Q. Li and Others. Learning self-target knowledge for few-shot segmentation. Pattern Recognition, 149:110236, 2024

2024
[22]

Kumar and Others

A. Kumar and Others. Architecture review: Two-stage and one-stage object de- tection.Results in Engineering, 2025. Article S2773186325001100

2025
[23]

Li and Others

S. Li and Others. Detection model based on improved faster-rcnn in apple orchard environment.Smart Agricultural Technology, 2024

2024
[24]

Kierdorf and Others

J. Kierdorf and Others. Deep leaf: Mask r-cnn based leaf detection and segmen- tation.Pattern Recognition Letters, 151:258–264, 2021

2021
[25]

Roggiolani and Others

G. Roggiolani and Others. From one field to another: Unsupervised domain adaptation for semantic segmentation in agricultural robotics.Computers and Electronics in Agriculture, 212, 2023

2023
[26]

Tong and Others

K. Tong and Others. Recent advances in small object detection based on deep learning.Image and Vision Computing, 2020. Article S0262885620300421

2020
[27]

Plantseg: A large-scale in-the-wild dataset for plant disease segmentation, 2024

Tianqi Wei, Zhi Chen, Xin Yu, Scott Chapman, Paul Melloy, and Zi Huang. Plantseg: A large-scale in-the-wild dataset for plant disease segmentation, 2024. URLhttps://arxiv.org/abs/2409.04038

work page arXiv 2024
[28]

Benchmarking in-the-wild multimodal disease recognition and a versatile baseline, 2024

Tianqi Wei, Zhi Chen, Zi Huang, and Xin Yu. Benchmarking in-the-wild multimodal disease recognition and a versatile baseline, 2024. URLhttps: //arxiv.org/abs/2408.03120. 35

work page arXiv 2024
[29]

Snap and diagnose: An advanced multimodal retrieval system for identifying plant diseases in the wild, 2024

Tianqi Wei, Zhi Chen, and Xin Yu. Snap and diagnose: An advanced multimodal retrieval system for identifying plant diseases in the wild, 2024. URLhttps: //arxiv.org/abs/2408.14723

work page arXiv 2024
[30]

AgriCLIP: Adapting CLIP for agri- culture and livestock via domain-specialized cross-model alignment, 2024

Umair Nawaz, Muhammad Awais, Hanan Gani, Muzammal Naseer, Fahad Khan, Salman Khan, and Rao Muhammad Anwer. AgriCLIP: Adapting CLIP for agri- culture and livestock via domain-specialized cross-model alignment, 2024

2024
[31]

BioCLIP: A vision foun- dation model for the tree of life

Samuel Stevens, Jiaman Wu, Matthew J Thompson, Elizabeth G Campolongo, Chan Hee Song, David Edward Carlyn, Li Dong, Wasila M Dahdul, Charles Stewart, Tanya Berger-Wolf, Wei-Lun Chao, and Yu Su. BioCLIP: A vision foun- dation model for the tree of life. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 19412–...

2024
[32]

Bioclip 2: Emergent properties from scaling hierarchical contrastive learning, 2025

Jianyang Gu and et al. Bioclip 2: Emergent properties from scaling hierarchical contrastive learning, 2025. URLhttps://arxiv.org/abs/2505.23883

work page arXiv 2025
[33]

BioTrove: A large curated image dataset en- abling AI for biodiversity

Chih-Hsuan Yang, Benjamin Feuer, Zaki Jubery, Zi K Deng, Andre Nakkab, Md Zahid Hasan, Shivani Chiranjeevi, Kelly Marshall, Nirmal Baishnab, Asheesh K Singh, Arti Singh, Soumik Sarkar, Nirav Merchant, Chinmay Hegde, and Baskar Ganapathysubramanian. BioTrove: A large curated image dataset en- abling AI for biodiversity. InAdvances in Neural Information Pro...

2024
[34]

Remoteclip: A vision language foundation model for remote sensing, 2024

Fan Liu, Delong Chen, Zhangqingyun Guan, Xiaocong Zhou, Jiale Zhu, Qiaolin Ye, Liyong Fu, and Jun Zhou. Remoteclip: A vision language foundation model for remote sensing, 2024. URLhttps://arxiv.org/abs/2306.11029

work page arXiv 2024
[35]

Singh, Arti Singh, Chinmay Hegde, Baskar Ganapathysubrama- nian, Aditya Balu, Adarsh Krishnamurthy, and Soumik Sarkar

Muhammad Arbab Arshad, Talukder Zaki Jubery, Tirtho Roy, Rim Nassiri, Asheesh K. Singh, Arti Singh, Chinmay Hegde, Baskar Ganapathysubrama- nian, Aditya Balu, Adarsh Krishnamurthy, and Soumik Sarkar. Leveraging vi- 36 sion language models for specialized agricultural tasks, 2025. URLhttps: //arxiv.org/abs/2407.19617

work page arXiv 2025
[36]

Agrobench: Vision-language model benchmark in agriculture, 2025

Risa Shinoda, Nakamasa Inoue, Hirokatsu Kataoka, Masaki Onishi, and Yoshi- taka Ushiku. Agrobench: Vision-language model benchmark in agriculture, 2025. URLhttps://arxiv.org/abs/2507.20519

work page arXiv 2025
[37]

SigLIP 2: Multilingual vision-language encoders with improved semantic understanding, localization, and dense features, 2025

Michael Tschannen, Alexey Gritsenko, Xiao Wang, Muhammad Ferjad Naeem, Ibrahim Alabdulmohsin, Nikhil Parthasarathy, Talfan Evans, Lucas Beyer, Ye Xia, Basil Mustafa, Olivier Hénaff, Jeremiah Harmsen, Andreas Steiner, and Xiaohua Zhai. SigLIP 2: Multilingual vision-language encoders with improved semantic understanding, localization, and dense features, 2025

2025
[38]

Sigmoid loss for language image pre-training, 2023

Xiaohua Zhai, Basil Mustafa, Alexander Kolesnikov, and Lucas Beyer. Sigmoid loss for language image pre-training, 2023. URLhttps://arxiv.org/abs/ 2303.15343

work page arXiv 2023
[39]

Zheng and Others

H. Zheng and Others. Cls-detr: Classification information to accelerate detr con- vergence.Pattern Recognition Letters, 2022. Article S0167865522003786

2022
[40]

U3m: Unbi- ased multiscale modal fusion model for multimodal semantic segmentation, 2024

Bingyu Li, Da Zhang, Zhiyuan Zhao, Junyu Gao, and Xuelong Li. U3m: Unbi- ased multiscale modal fusion model for multimodal semantic segmentation, 2024. URLhttps://arxiv.org/abs/2405.15365

work page arXiv 2024
[41]

Gao and Others

P. Gao and Others. Sof-detr: Improving small objects detection using trans- former.Journal of Visual Communication and Image Representation, 2022. Ar- ticle S1047320322001432

2022
[42]

Wang and Others

J. Wang and Others. Attentional feature pyramid network for small object detec- tion.Neural Networks, 2022. Article S089360802200329X

2022
[43]

CropDeep: The crop vision dataset for deep-learning-based classification and 37 detection in precision agriculture.Sensors, 19(5):1058, 2019

Yang-Yang Zheng, Jianlei Kong, Xinbing Jin, Xinyu Wang, and Min Zuo. CropDeep: The crop vision dataset for deep-learning-based classification and 37 detection in precision agriculture.Sensors, 19(5):1058, 2019. doi: 10.3390/ s19051058

2019
[44]

Lameski, E

P. Lameski, E. Zdravevski, V . Trajkovik, and A. Kulakov. A survey of public datasets for computer vision tasks in precision agriculture.Computers and Elec- tronics in Agriculture, 178:105760, 2020

2020
[45]

C. Wang, B. Liu, L. Liu, Y . Zhu, J. Hou, P. Liu, and X. Li. Application of convolutional neural network-based detection methods in fresh fruit production: A comprehensive review.Frontiers in Plant Science, 13:868745, 2022

2022
[46]

Chapman et al

S.C. Chapman et al. Scaling up high-throughput phenotyping for abiotic stress selection in the field.Theoretical and Applied Genetics, 134:1845–1866, 2021

2021
[47]

Afonso, H

M. Afonso, H. Fonteijn, F. S. Fiorentin, et al. Tomato fruit detection and counting in greenhouses using deep learning.Frontiers in Plant Science, 11:571299, 2020

2020
[48]

Zhu et al

H. Zhu et al. Intelligent agriculture: deep learning in UA V-based remote sensing imagery for crop diseases and pests detection.Frontiers in Plant Science, 15: 1435016, 2024

2024
[49]

Gill, S.K

T. Gill, S.K. Gill, D.K. Saini, Y . Chopra, J.P. de Koff, and K.S. Sandhu. A com- prehensive review of high throughput phenotyping and machine learning for plant stress phenotyping.Phenomics, 2(3):156–183, 2022

2022
[50]

Jafar, N

A. Jafar, N. Bibi, R. A. Naqvi, A. Sadeghi-Niaraki, and D. Jeong. Revolutionizing agriculture with artificial intelligence: plant disease detection methods, applica- tions, and their limitations.Frontiers in Plant Science, 15:1356260, 2024

2024
[51]

V . R. Visakh et al. Precision phenotyping in crop science: From plant traits to gene discovery for climate-smart agriculture.Plant Breeding, page pbr.13228, 2024. 38

2024
[52]

Pérez-Patricio et al

M. Pérez-Patricio et al. A systematic review of multi-mode analytics for enhanced plant stress evaluation.Frontiers in Plant Science, 16:1545025, 2024

2024
[53]

Fruits detection.https://www.kaggle.com/datasets/andrewmvd/ fruit-detection, 2020

2020
[54]

Fruit images for object detection.https://www.kaggle.com/datasets/ mbkinaci/fruit-images-for-object-detection, 2018

2018
[55]

Fruit object detection.https://www.kaggle.com/datasets/eunpyohong/ fruit-object-detection, 2021

2021
[56]

Fruits images dataset: Object detection.https://www.kaggle.com/ datasets/afsananadia/fruits-images-dataset-object-detection, 2024

2024
[57]

P. Song, J. Wang, X. Guo, W. Yang, and C. Zhao. High-throughput phenotyping: Breaking through the bottleneck in future crop breeding.The Crop Journal, 9(3): 633–645, 2021

2021
[58]

Rich feature hierarchies for accurate object detection and semantic segmentation, 2014

Ross Girshick, JeffDonahue, Trevor Darrell, and Jitendra Malik. Rich feature hierarchies for accurate object detection and semantic segmentation, 2014

2014
[59]

End-to-end object detection with transformers, 2020

Nicolas Carion, Francisco Massa, Gabriel Synnaeve, Nicolas Usunier, Alexander Kirillov, and Sergey Zagoruyko. End-to-end object detection with transformers, 2020

2020
[60]

Real-time flying object detection with YOLOv8, 2024

Dillon Reis, Jordan Kupec, Jacqueline Hong, and Ahmad Daoudi. Real-time flying object detection with YOLOv8, 2024

2024
[61]

YOLOv9: Learning what you want to learn using programmable gradient information, 2024

Chien-Yao Wang, I-Hau Yeh, and Hong-Yuan Mark Liao. YOLOv9: Learning what you want to learn using programmable gradient information, 2024. 39

2024
[62]

Learning to discover and detect objects, 2022

Vladimir Fomenko, Ismail Elezi, Deva Ramanan, Laura Leal-Taixé, and Aljoša Ošep. Learning to discover and detect objects, 2022

2022
[63]

Learn- ing to prompt for open-vocabulary object detection with vision-language model, 2022

Yu Du, Fangyun Wei, Zihe Zhang, Miaojing Shi, Yue Gao, and Guoqi Li. Learn- ing to prompt for open-vocabulary object detection with vision-language model, 2022

2022
[64]

Springer Nature Switzerland, 2022

Yuhang Zang, Wei Li, Kaiyang Zhou, Chen Huang, and Chen Change Loy.Open- Vocabulary DETR with Conditional Matching, pages 106–122. Springer Nature Switzerland, 2022. ISBN 9783031200779. doi: 10.1007/978-3-031-20077-9\_7

work page doi:10.1007/978-3-031-20077-9 2022
[65]

Enhancing novel object detection via cooperative foundational models, 2023

Rohit Bharadwaj, Muzammal Naseer, Salman Khan, and Fahad Shahbaz Khan. Enhancing novel object detection via cooperative foundational models, 2023. 40

2023