pith. machine review for the scientific record. sign in

arxiv: 2605.03259 · v1 · submitted 2026-05-05 · 💻 cs.CV

Recognition: unknown

CropVLM: A Domain-Adapted Vision-Language Model for Open-Set Crop Analysis

Authors on Pith no claims yet

Pith reviewed 2026-05-08 01:30 UTC · model grok-4.3

classification 💻 cs.CV
keywords vision-language modelcrop analysisopen-set detectionzero-shot learningplant phenotypingdomain adaptationagricultural imaging
0
0 comments X

The pith

A vision-language model adapted to agriculture detects novel crop species from natural language descriptions without retraining.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces CropVLM to address the phenotyping bottleneck in plant breeding, where manual trait measurements limit scale and introduce bias. By training a vision-language model on image-caption pairs from field crops, it aligns everyday agronomic terms with specific visual details in plant images. This alignment, combined with a hybrid localization network, supports open-set tasks where the system identifies and locates entirely new species using only text prompts. A sympathetic reader would care because it removes the need for species-by-species labeling, potentially allowing faster analysis of diverse breeding lines and broader biodiversity surveys.

Core claim

CropVLM is a vision-language model adapted via Domain-Specific Semantic Alignment on 52,987 manually selected image-caption pairs covering 37 species in natural field conditions. It maps agronomic terminology to fine-grained visual features and integrates into the Hybrid Open-Set Localization Network (HOS-Net) to detect novel crops solely from language descriptions without retraining. Evaluations show 72.51% zero-shot classification accuracy, outperforming seven CLIP-style baselines, along with 49.17 AP50 on the CVTCropDet benchmark and 50.73 AP50 on tropical fruit species.

What carries the argument

Domain-Specific Semantic Alignment (DSSA), the process that fine-tunes the vision-language model to connect agricultural terminology with detailed visual patterns in crop images.

If this is right

  • Novel crop species become detectable and localizable using only textual descriptions, removing the requirement for new species-specific training data.
  • High-throughput phenotyping scales to larger and more diverse plant populations without proportional increases in manual annotation effort.
  • Breeding programs and biodiversity studies gain flexibility to analyze emerging or under-studied species on demand.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same alignment approach could be tested on related tasks such as identifying plant diseases or growth stages from descriptive text.
  • Integration with automated field imaging systems might allow continuous monitoring without repeated model updates for each new variety.
  • Performance on very distant species or under extreme environmental conditions remains an open test of how far the current alignment generalizes.

Load-bearing premise

The 52,987 image-caption pairs from 37 species supply enough variety for the alignment process to produce reliable mappings that work on arbitrary new crop species.

What would settle it

Running the full detection pipeline on images of a crop species never seen in the 37-species training set, using only a natural language description, and comparing the resulting AP50 score directly against the reported baselines.

Figures

Figures reproduced from arXiv: 2605.03259 by Abderrahmene Boudiaf, Sajd Javed.

Figure 1
Figure 1. Figure 1: Overview of the Agri-Semantic Framework and CropVLM training methodology. (a) The Agri view at source ↗
Figure 2
Figure 2. Figure 2: Comprehensive class overview of the Agri-Semantics-52k dataset. The dataset encompasses 37 view at source ↗
Figure 3
Figure 3. Figure 3: Representative samples of dense semantic annotations from the Agri-Semantics-52k dataset. Un view at source ↗
Figure 4
Figure 4. Figure 4: Architecture of the proposed Hybrid Open-Set Localization Network (HOS-Net). The framework view at source ↗
Figure 4
Figure 4. Figure 4: For each proposal i, the predicted class label c CL i and initial classification confidence score s CL i are determined by maximizing similarity across all target classes: c CL i = argmax k∈{1,...,K} S i,k (7) s CL i = max k∈{1,...,K} S i,k (8) This yields semantically-scored detections: {(b CL i , s CL i , c CL i )} M i=1 , where each proposal is assigned to its most similar crop class with a correspondin… view at source ↗
Figure 5
Figure 5. Figure 5: Qualitative comparison of detection outputs across benchmark datasets (one representative image view at source ↗
read the original abstract

High-throughput plant phenotyping, the quantitative measurement of observable plant traits, is critical for modern breeding but remains constrained by a "phenotyping bottleneck," where manual data collection is labor-intensive and prone to observer bias. Conventional closed-set computer vision systems fail to address this challenge, as they require extensive species-specific annotation and lack the flexibility to handle diverse breeding populations. To bridge this gap, we present CropVLM, a Vision-Language Model (VLM) adapted for the agricultural domain via Domain-Specific Semantic Alignment (DSSA). Trained on 52,987 manually selected image-caption pairs covering 37 species in natural field conditions, CropVLM effectively maps agronomic terminology to fine-grained visual features. We further introduce the Hybrid Open-Set Localization Network (HOS-Net), an architecture that integrates CropVLM to enable the detection of novel crops solely from natural language descriptions without retraining. By eliminating the reliance on species-specific training data, CropVLM provides a scalable solution for high-throughput phenotyping, accelerating genetic gain and facilitating large-scale biodiversity research essential for sustainable agriculture. The trained model weights and complete pipeline implementation are publicly available at: [https://github.com/boudiafA/CropVLM](https://github.com/boudiafA/CropVLM). In comprehensive evaluations, CropVLM achieves 72.51% zero-shot classification accuracy, outperforming seven CLIP-style baselines. Our detection pipeline demonstrates superior zero-shot generalization to novel species, achieving 49.17 AP50 on our CVTCropDet benchmark and 50.73 AP50 on tropical fruit species, compared to 34.89 and 48.58 for the next-best method, respectively.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The manuscript introduces CropVLM, a vision-language model adapted to the agricultural domain via Domain-Specific Semantic Alignment (DSSA) trained on 52,987 manually selected image-caption pairs covering 37 species under natural field conditions. It further proposes the Hybrid Open-Set Localization Network (HOS-Net) that integrates CropVLM to enable zero-shot detection of novel crops from natural language descriptions without retraining. The paper reports 72.51% zero-shot classification accuracy outperforming seven CLIP-style baselines, along with AP50 scores of 49.17 on the CVTCropDet benchmark and 50.73 on tropical fruit species, exceeding the next-best methods (34.89 and 48.58, respectively). Model weights and code are released publicly.

Significance. If the generalization claims hold, the work could meaningfully advance high-throughput phenotyping by reducing reliance on species-specific annotations, with direct relevance to breeding programs and biodiversity monitoring. The public release of trained weights and the full pipeline is a clear strength that supports reproducibility. The significance is limited, however, by the absence of quantitative characterization of training data diversity and coverage, which is central to validating open-set performance on arbitrary novel crops.

major comments (3)
  1. [Abstract and Methods] Abstract and Methods: The claim that DSSA on the 37-species set produces reliable agronomic-to-visual mappings for zero-shot generalization to arbitrary novel crops is load-bearing, yet no quantitative diversity metrics (botanical families represented, growth-stage coverage, geographic or environmental variation) or ablation on training species count are provided.
  2. [Experiments] Experiments: The reported 72.51% zero-shot accuracy and AP50 improvements lack error bars, statistical significance tests, or details on training procedure, data selection criteria, and potential selection bias from manual curation of the 52,987 pairs, preventing verification that the gains are robust rather than dataset-specific.
  3. [Experiments] Experiments: No evaluation on crop species outside the 37-species training distribution or failure-case analysis for HOS-Net on truly novel inputs is presented, leaving the open-set detection claim (49.17/50.73 AP50) without direct support for generalization beyond the tested set.
minor comments (2)
  1. [Abstract] Abstract: The phrase 'comprehensive evaluations' is used without enumerating all benchmarks or providing a high-level overview of the evaluation protocol.
  2. [Overall] Overall: Verify that the GitHub repository includes complete training scripts, dataset curation details, and any preprocessing code to enable full reproduction.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for their constructive and detailed feedback. We appreciate the emphasis on strengthening the evidence for our generalization claims and have addressed each major comment below with specific revisions to the manuscript.

read point-by-point responses
  1. Referee: [Abstract and Methods] Abstract and Methods: The claim that DSSA on the 37-species set produces reliable agronomic-to-visual mappings for zero-shot generalization to arbitrary novel crops is load-bearing, yet no quantitative diversity metrics (botanical families represented, growth-stage coverage, geographic or environmental variation) or ablation on training species count are provided.

    Authors: We agree that quantitative characterization of the training data is essential to support the open-set claims. In the revised manuscript, we will add a dedicated subsection in Methods with a table reporting the botanical families represented across the 37 species, the distribution of growth stages in the 52,987 image-caption pairs, and available geographic/environmental metadata from the source datasets. We will also include an ablation study showing zero-shot accuracy as a function of the number of training species. revision: yes

  2. Referee: [Experiments] Experiments: The reported 72.51% zero-shot accuracy and AP50 improvements lack error bars, statistical significance tests, or details on training procedure, data selection criteria, and potential selection bias from manual curation of the 52,987 pairs, preventing verification that the gains are robust rather than dataset-specific.

    Authors: We acknowledge that the current reporting limits independent verification. We will revise the Experiments section to report standard deviations over five random seeds for all accuracy and AP50 figures, include paired statistical significance tests against the seven baselines, expand the training procedure description with all hyperparameters and optimization details, and add explicit criteria for the manual curation process along with a discussion of potential selection bias and mitigation steps. revision: yes

  3. Referee: [Experiments] Experiments: No evaluation on crop species outside the 37-species training distribution or failure-case analysis for HOS-Net on truly novel inputs is presented, leaving the open-set detection claim (49.17/50.73 AP50) without direct support for generalization beyond the tested set.

    Authors: The CVTCropDet benchmark and tropical fruit species evaluations were constructed with species disjoint from the 37-species training set; we will add an explicit table in the revised Experiments section listing training versus test species to make this clear. We will also include a new failure-case analysis subsection with both qualitative examples of challenging novel inputs and quantitative performance breakdowns on the most dissimilar novel species. revision: yes

Circularity Check

0 steps flagged

No circularity; empirical training and benchmark comparisons are self-contained

full rationale

The paper describes training CropVLM on a fixed dataset of 52,987 image-caption pairs via Domain-Specific Semantic Alignment, then reports zero-shot classification accuracy (72.51%) and open-set detection AP50 scores on CVTCropDet and tropical fruit benchmarks, with direct numerical comparisons to seven external CLIP-style baselines. No equations, derivations, fitted parameters renamed as predictions, self-citations, or uniqueness theorems appear in the provided text. All load-bearing claims rest on external benchmark results rather than internal reductions to the training inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review provides no explicit free parameters, axioms, or invented entities beyond standard machine-learning assumptions about data representativeness and generalization; the central claims rest on the unverified premise that the curated training set enables open-set performance.

pith-pipeline@v0.9.0 · 5617 in / 1320 out tokens · 32768 ms · 2026-05-08T01:30:57.307396+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

65 extracted references · 11 canonical work pages

  1. [1]

    Jiang and Others

    Y . Jiang and Others. A review of computer vision technologies for plant pheno- typing.Computers and Electronics in Agriculture, 176, 2020

  2. [2]

    Ayala and Others

    A. Ayala and Others. Self-supervised leaf segmentation under complex lighting conditions.Pattern Recognition, 135:109149, 2023

  3. [3]

    Argüeso and Others

    D. Argüeso and Others. Few-shot learning approach for plant disease classifica- tion.Computers and Electronics in Agriculture, 175:105542, 2020

  4. [4]

    Wang and Others

    R. Wang and Others. Triple-branch swin transformer for plant disease identifica- tion.Computers and Electronics in Agriculture, 209, 2023

  5. [5]

    Peng and Others

    C. Peng and Others. Faster ilod: Incremental learning for object detectors based on faster rcnn.Pattern Recognition Letters, 2020. Article S0167865520303627

  6. [6]

    Zhao and Others

    X. Zhao and Others. Bsdp: Brain-inspired streaming dual-level perturbations for online open world object detection.Pattern Recognition, 152:110430, 2024

  7. [7]

    Liu and Others

    B. Liu and Others. Domain incremental learning for object detection.Pattern Recognition, 162:111324, 2025. 33

  8. [8]

    Liu and Others

    J. Liu and Others. Deep transductive network for generalized zero shot learning. Pattern Recognition, 105:107393, 2020

  9. [9]

    Kuo and Others

    C. Kuo and Others. Guided cnn for generalized zero-shot and open-set recog- nition using visual and semantic prototypes.Pattern Recognition, 104:107327, 2020

  10. [10]

    Learning transferable visual models from natural language supervision, 2021

    Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, and Ilya Sutskever. Learning transferable visual models from natural language supervision, 2021

  11. [11]

    Wu and Others

    Y . Wu and Others. Prompt-guided detr with roi-pruned masked attention for open- vocabulary object detection.Pattern Recognition, 154:110583, 2024

  12. [12]

    Zhang and Others

    H. Zhang and Others. Ta-adapter: Enhancing few-shot clip with task-aware en- coders.Pattern Recognition, 153:110544, 2024

  13. [13]

    Li and Others

    Y . Li and Others. A closer look at the explainability of contrastive language-image pre-training.Pattern Recognition, 2025. Article S003132032500069X

  14. [14]

    Wang and Others

    L. Wang and Others. Mixture of coarse and fine-grained prompt tuning for vision- language model.Pattern Recognition, 2025. Article S0031320325007344

  15. [15]

    Chen and Others

    Z. Chen and Others. Gridclip: One-stage object detection by grid-level clip rep- resentation learning.Pattern Recognition, 2025. Article S0031320325008489

  16. [16]

    Mask r-cnn,

    Kaiming He, Georgia Gkioxari, Piotr Dollár, and Ross Girshick. Mask r-cnn,

  17. [17]

    URLhttps://arxiv.org/abs/1703.06870

  18. [18]

    Grounding DINO: Marrying DINO with grounded pre-training for open-set object detection, 2023

    Shilong Liu, Zhaoyang Zeng, Tianhe Ren, Feng Li, Hao Zhang, Jie Yang, Chun- yuan Li, Jianwei Yang, Hang Su, Jun Zhu, and Lei Zhang. Grounding DINO: Marrying DINO with grounded pre-training for open-set object detection, 2023. 34

  19. [19]

    Berg, Wan-Yen Lo, Piotr Dollár, and Ross Girshick

    Alexander Kirillov, Eric Mintun, Nikhila Ravi, Hanzi Mao, Chloe Rolland, Laura Gustafson, Tete Xiao, Spencer Whitehead, Alexander C. Berg, Wan-Yen Lo, Piotr Dollár, and Ross Girshick. Segment anything. InProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2023

  20. [20]

    Zhou and Others

    T. Zhou and Others. Zero-shot semantic segmentation via spatial and multi-scale aware visual class embedding.Pattern Recognition Letters, 152, 2022

  21. [21]

    Li and Others

    Q. Li and Others. Learning self-target knowledge for few-shot segmentation. Pattern Recognition, 149:110236, 2024

  22. [22]

    Kumar and Others

    A. Kumar and Others. Architecture review: Two-stage and one-stage object de- tection.Results in Engineering, 2025. Article S2773186325001100

  23. [23]

    Li and Others

    S. Li and Others. Detection model based on improved faster-rcnn in apple orchard environment.Smart Agricultural Technology, 2024

  24. [24]

    Kierdorf and Others

    J. Kierdorf and Others. Deep leaf: Mask r-cnn based leaf detection and segmen- tation.Pattern Recognition Letters, 151:258–264, 2021

  25. [25]

    Roggiolani and Others

    G. Roggiolani and Others. From one field to another: Unsupervised domain adaptation for semantic segmentation in agricultural robotics.Computers and Electronics in Agriculture, 212, 2023

  26. [26]

    Tong and Others

    K. Tong and Others. Recent advances in small object detection based on deep learning.Image and Vision Computing, 2020. Article S0262885620300421

  27. [27]

    Plantseg: A large-scale in-the-wild dataset for plant disease segmentation, 2024

    Tianqi Wei, Zhi Chen, Xin Yu, Scott Chapman, Paul Melloy, and Zi Huang. Plantseg: A large-scale in-the-wild dataset for plant disease segmentation, 2024. URLhttps://arxiv.org/abs/2409.04038

  28. [28]

    Benchmarking in-the-wild multimodal disease recognition and a versatile baseline, 2024

    Tianqi Wei, Zhi Chen, Zi Huang, and Xin Yu. Benchmarking in-the-wild multimodal disease recognition and a versatile baseline, 2024. URLhttps: //arxiv.org/abs/2408.03120. 35

  29. [29]

    Snap and diagnose: An advanced multimodal retrieval system for identifying plant diseases in the wild, 2024

    Tianqi Wei, Zhi Chen, and Xin Yu. Snap and diagnose: An advanced multimodal retrieval system for identifying plant diseases in the wild, 2024. URLhttps: //arxiv.org/abs/2408.14723

  30. [30]

    AgriCLIP: Adapting CLIP for agri- culture and livestock via domain-specialized cross-model alignment, 2024

    Umair Nawaz, Muhammad Awais, Hanan Gani, Muzammal Naseer, Fahad Khan, Salman Khan, and Rao Muhammad Anwer. AgriCLIP: Adapting CLIP for agri- culture and livestock via domain-specialized cross-model alignment, 2024

  31. [31]

    BioCLIP: A vision foun- dation model for the tree of life

    Samuel Stevens, Jiaman Wu, Matthew J Thompson, Elizabeth G Campolongo, Chan Hee Song, David Edward Carlyn, Li Dong, Wasila M Dahdul, Charles Stewart, Tanya Berger-Wolf, Wei-Lun Chao, and Yu Su. BioCLIP: A vision foun- dation model for the tree of life. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 19412–...

  32. [32]

    Bioclip 2: Emergent properties from scaling hierarchical contrastive learning, 2025

    Jianyang Gu and et al. Bioclip 2: Emergent properties from scaling hierarchical contrastive learning, 2025. URLhttps://arxiv.org/abs/2505.23883

  33. [33]

    BioTrove: A large curated image dataset en- abling AI for biodiversity

    Chih-Hsuan Yang, Benjamin Feuer, Zaki Jubery, Zi K Deng, Andre Nakkab, Md Zahid Hasan, Shivani Chiranjeevi, Kelly Marshall, Nirmal Baishnab, Asheesh K Singh, Arti Singh, Soumik Sarkar, Nirav Merchant, Chinmay Hegde, and Baskar Ganapathysubramanian. BioTrove: A large curated image dataset en- abling AI for biodiversity. InAdvances in Neural Information Pro...

  34. [34]

    Remoteclip: A vision language foundation model for remote sensing, 2024

    Fan Liu, Delong Chen, Zhangqingyun Guan, Xiaocong Zhou, Jiale Zhu, Qiaolin Ye, Liyong Fu, and Jun Zhou. Remoteclip: A vision language foundation model for remote sensing, 2024. URLhttps://arxiv.org/abs/2306.11029

  35. [35]

    Singh, Arti Singh, Chinmay Hegde, Baskar Ganapathysubrama- nian, Aditya Balu, Adarsh Krishnamurthy, and Soumik Sarkar

    Muhammad Arbab Arshad, Talukder Zaki Jubery, Tirtho Roy, Rim Nassiri, Asheesh K. Singh, Arti Singh, Chinmay Hegde, Baskar Ganapathysubrama- nian, Aditya Balu, Adarsh Krishnamurthy, and Soumik Sarkar. Leveraging vi- 36 sion language models for specialized agricultural tasks, 2025. URLhttps: //arxiv.org/abs/2407.19617

  36. [36]

    Agrobench: Vision-language model benchmark in agriculture, 2025

    Risa Shinoda, Nakamasa Inoue, Hirokatsu Kataoka, Masaki Onishi, and Yoshi- taka Ushiku. Agrobench: Vision-language model benchmark in agriculture, 2025. URLhttps://arxiv.org/abs/2507.20519

  37. [37]

    SigLIP 2: Multilingual vision-language encoders with improved semantic understanding, localization, and dense features, 2025

    Michael Tschannen, Alexey Gritsenko, Xiao Wang, Muhammad Ferjad Naeem, Ibrahim Alabdulmohsin, Nikhil Parthasarathy, Talfan Evans, Lucas Beyer, Ye Xia, Basil Mustafa, Olivier Hénaff, Jeremiah Harmsen, Andreas Steiner, and Xiaohua Zhai. SigLIP 2: Multilingual vision-language encoders with improved semantic understanding, localization, and dense features, 2025

  38. [38]

    Sigmoid loss for language image pre-training, 2023

    Xiaohua Zhai, Basil Mustafa, Alexander Kolesnikov, and Lucas Beyer. Sigmoid loss for language image pre-training, 2023. URLhttps://arxiv.org/abs/ 2303.15343

  39. [39]

    Zheng and Others

    H. Zheng and Others. Cls-detr: Classification information to accelerate detr con- vergence.Pattern Recognition Letters, 2022. Article S0167865522003786

  40. [40]

    U3m: Unbi- ased multiscale modal fusion model for multimodal semantic segmentation, 2024

    Bingyu Li, Da Zhang, Zhiyuan Zhao, Junyu Gao, and Xuelong Li. U3m: Unbi- ased multiscale modal fusion model for multimodal semantic segmentation, 2024. URLhttps://arxiv.org/abs/2405.15365

  41. [41]

    Gao and Others

    P. Gao and Others. Sof-detr: Improving small objects detection using trans- former.Journal of Visual Communication and Image Representation, 2022. Ar- ticle S1047320322001432

  42. [42]

    Wang and Others

    J. Wang and Others. Attentional feature pyramid network for small object detec- tion.Neural Networks, 2022. Article S089360802200329X

  43. [43]

    CropDeep: The crop vision dataset for deep-learning-based classification and 37 detection in precision agriculture.Sensors, 19(5):1058, 2019

    Yang-Yang Zheng, Jianlei Kong, Xinbing Jin, Xinyu Wang, and Min Zuo. CropDeep: The crop vision dataset for deep-learning-based classification and 37 detection in precision agriculture.Sensors, 19(5):1058, 2019. doi: 10.3390/ s19051058

  44. [44]

    Lameski, E

    P. Lameski, E. Zdravevski, V . Trajkovik, and A. Kulakov. A survey of public datasets for computer vision tasks in precision agriculture.Computers and Elec- tronics in Agriculture, 178:105760, 2020

  45. [45]

    C. Wang, B. Liu, L. Liu, Y . Zhu, J. Hou, P. Liu, and X. Li. Application of convolutional neural network-based detection methods in fresh fruit production: A comprehensive review.Frontiers in Plant Science, 13:868745, 2022

  46. [46]

    Chapman et al

    S.C. Chapman et al. Scaling up high-throughput phenotyping for abiotic stress selection in the field.Theoretical and Applied Genetics, 134:1845–1866, 2021

  47. [47]

    Afonso, H

    M. Afonso, H. Fonteijn, F. S. Fiorentin, et al. Tomato fruit detection and counting in greenhouses using deep learning.Frontiers in Plant Science, 11:571299, 2020

  48. [48]

    Zhu et al

    H. Zhu et al. Intelligent agriculture: deep learning in UA V-based remote sensing imagery for crop diseases and pests detection.Frontiers in Plant Science, 15: 1435016, 2024

  49. [49]

    Gill, S.K

    T. Gill, S.K. Gill, D.K. Saini, Y . Chopra, J.P. de Koff, and K.S. Sandhu. A com- prehensive review of high throughput phenotyping and machine learning for plant stress phenotyping.Phenomics, 2(3):156–183, 2022

  50. [50]

    Jafar, N

    A. Jafar, N. Bibi, R. A. Naqvi, A. Sadeghi-Niaraki, and D. Jeong. Revolutionizing agriculture with artificial intelligence: plant disease detection methods, applica- tions, and their limitations.Frontiers in Plant Science, 15:1356260, 2024

  51. [51]

    V . R. Visakh et al. Precision phenotyping in crop science: From plant traits to gene discovery for climate-smart agriculture.Plant Breeding, page pbr.13228, 2024. 38

  52. [52]

    Pérez-Patricio et al

    M. Pérez-Patricio et al. A systematic review of multi-mode analytics for enhanced plant stress evaluation.Frontiers in Plant Science, 16:1545025, 2024

  53. [53]

    Fruits detection.https://www.kaggle.com/datasets/andrewmvd/ fruit-detection, 2020

  54. [54]

    Fruit images for object detection.https://www.kaggle.com/datasets/ mbkinaci/fruit-images-for-object-detection, 2018

  55. [55]

    Fruit object detection.https://www.kaggle.com/datasets/eunpyohong/ fruit-object-detection, 2021

  56. [56]

    Fruits images dataset: Object detection.https://www.kaggle.com/ datasets/afsananadia/fruits-images-dataset-object-detection, 2024

  57. [57]

    P. Song, J. Wang, X. Guo, W. Yang, and C. Zhao. High-throughput phenotyping: Breaking through the bottleneck in future crop breeding.The Crop Journal, 9(3): 633–645, 2021

  58. [58]

    Rich feature hierarchies for accurate object detection and semantic segmentation, 2014

    Ross Girshick, JeffDonahue, Trevor Darrell, and Jitendra Malik. Rich feature hierarchies for accurate object detection and semantic segmentation, 2014

  59. [59]

    End-to-end object detection with transformers, 2020

    Nicolas Carion, Francisco Massa, Gabriel Synnaeve, Nicolas Usunier, Alexander Kirillov, and Sergey Zagoruyko. End-to-end object detection with transformers, 2020

  60. [60]

    Real-time flying object detection with YOLOv8, 2024

    Dillon Reis, Jordan Kupec, Jacqueline Hong, and Ahmad Daoudi. Real-time flying object detection with YOLOv8, 2024

  61. [61]

    YOLOv9: Learning what you want to learn using programmable gradient information, 2024

    Chien-Yao Wang, I-Hau Yeh, and Hong-Yuan Mark Liao. YOLOv9: Learning what you want to learn using programmable gradient information, 2024. 39

  62. [62]

    Learning to discover and detect objects, 2022

    Vladimir Fomenko, Ismail Elezi, Deva Ramanan, Laura Leal-Taixé, and Aljoša Ošep. Learning to discover and detect objects, 2022

  63. [63]

    Learn- ing to prompt for open-vocabulary object detection with vision-language model, 2022

    Yu Du, Fangyun Wei, Zihe Zhang, Miaojing Shi, Yue Gao, and Guoqi Li. Learn- ing to prompt for open-vocabulary object detection with vision-language model, 2022

  64. [64]

    Springer Nature Switzerland, 2022

    Yuhang Zang, Wei Li, Kaiyang Zhou, Chen Huang, and Chen Change Loy.Open- Vocabulary DETR with Conditional Matching, pages 106–122. Springer Nature Switzerland, 2022. ISBN 9783031200779. doi: 10.1007/978-3-031-20077-9\_7

  65. [65]

    Enhancing novel object detection via cooperative foundational models, 2023

    Rohit Bharadwaj, Muzammal Naseer, Salman Khan, and Fahad Shahbaz Khan. Enhancing novel object detection via cooperative foundational models, 2023. 40