pith. machine review for the scientific record. sign in

arxiv: 2604.09907 · v1 · submitted 2026-04-10 · 💻 cs.CV · cs.AI· cs.CL

Recognition: unknown

From UAV Imagery to Agronomic Reasoning: A Multimodal LLM Benchmark for Plant Phenotyping

Yu Wu , Guangzeng Han , Ibra Niang Niang , Francia Ravelombola , Maiara Oliveira , Jason Davis , Dong Chen , Feng Lin , Xiaolei Huang

Authors on Pith no claims yet

Pith reviewed 2026-05-10 17:08 UTC · model grok-4.3

classification 💻 cs.CV cs.AIcs.CL
keywords plant phenotypingmultimodal LLMsUAV imagerysoybeancottonagronomic reasoningvision-language modelsbenchmark
0
0 comments X

The pith

A benchmark of 385 UAV images shows fine-tuning vision-language models lifts accuracy on soybean and cotton phenotyping tasks to 78 percent.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces PlantXpert to measure how well current vision-language models can turn drone photos of soybean and cotton into reliable agronomic judgments about disease, pests, weeds, and yield. The authors built 385 images into more than 3,000 structured samples that test visual identification, number-based reasoning, and multi-step biological logic. When eleven leading models were run on the benchmark, task-specific fine-tuning produced clear accuracy gains, reaching 78 percent for the strongest fine-tuned versions, yet further increases in model size added little and performance differed noticeably between the two crops. The work matters because crop breeding programs need fast, repeatable ways to score thousands of plots; a benchmark that exposes where models still fail can guide the next round of adaptation.

Core claim

We constructed PlantXpert as an evidence-grounded multimodal reasoning benchmark containing 385 digital UAV images and over 3,000 question-answer pairs that span disease identification, pest and weed management, and yield-related traits in soybean and cotton. Evaluation of eleven state-of-the-art vision-language models demonstrates that task-specific fine-tuning produces substantial accuracy improvements, with models such as Qwen3-VL-4B and Qwen3-VL-30B reaching up to 78 percent, while gains from further model scaling diminish beyond a certain capacity, generalization between soybean and cotton remains uneven, and quantitative as well as biologically grounded reasoning continue to pose major

What carries the argument

PlantXpert benchmark, a structured dataset and evaluation framework that pairs UAV imagery with questions requiring visual expertise, quantitative reasoning, and multi-step agronomic judgment across disease, pest, weed, and yield domains.

If this is right

  • Task-specific fine-tuning raises accuracy on phenotyping questions substantially compared with base models.
  • Accuracy gains from increasing model size plateau after a moderate capacity threshold.
  • Performance on soybean tasks does not reliably transfer to cotton and vice versa.
  • Quantitative calculations and biologically reasoned explanations remain the hardest categories for all tested models.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Breeding programs could insert fine-tuned models into routine UAV scouting pipelines to score large numbers of plots faster than manual teams.
  • Separate fine-tuning runs for each crop appear necessary until cross-crop generalization improves.
  • Adding numeric sensor readings alongside images might reduce the errors still seen in quantitative questions.
  • The benchmark format can be reused to track progress as new vision-language models are released.

Load-bearing premise

The 385 images and 3,000 benchmark samples capture the full range of real-world visual and agronomic reasoning demands in soybean and cotton phenotyping without selection bias or oversimplification.

What would settle it

Run the fine-tuned Qwen3-VL-30B model on a fresh collection of UAV images taken from commercial soybean and cotton fields not used in the original dataset, then compare its answers on disease severity, pest counts, and yield estimates against independent ratings by trained agronomists.

Figures

Figures reproduced from arXiv: 2604.09907 by Dong Chen, Feng Lin, Francia Ravelombola, Guangzeng Han, Ibra Niang Niang, Jason Davis, Maiara Oliveira, Xiaolei Huang, Yu Wu.

Figure 1
Figure 1. Figure 1: An overview of our source retrieval benchmark construction pipeline. and cotton, covering research studies published between 2016 and 2025 using a shared keyword set across these target domains in [PITH_FULL_IMAGE:figures/full_fig_p005_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Representative evidence-grounded multiple-choice data samples from PlantXpert. Each sample is annotated with a specific agricultural domain and primary reasoning challenge. with a diagnostic and inferential structure, rather than as a flat collection of multiple-choice prediction examples like the traditional question answering task. Rather than asking only what object or symptom is visible, the generated … view at source ↗
Figure 3
Figure 3. Figure 3: Representative error cases of multiple choice questions that the models answered incorrectly. Another common source of errors arises in questions related to plant biology and physiology. As shown in [PITH_FULL_IMAGE:figures/full_fig_p016_3.png] view at source ↗
read the original abstract

To improve crop genetics, high-throughput, effective and comprehensive phenotyping is a critical prerequisite. While such tasks were traditionally performed manually, recent advances in multimodal foundation models, especially in vision-language models (VLMs), have enabled more automated and robust phenotypic analysis. However, plant science remains a particularly challenging domain for foundation models because it requires domain-specific knowledge, fine-grained visual interpretation, and complex biological and agronomic reasoning. To address this gap, we develop PlantXpert, an evidence-grounded multimodal reasoning benchmark for soybean and cotton phenotyping. Our benchmark provides a structured and reproducible framework for agronomic adaptation of VLMs, and enables controlled comparison between base models and their domain-adapted counterparts. We constructed a dataset comprising 385 digital images and more than 3,000 benchmark samples spanning key plant science domains including disease, pest control, weed management, and yield. The benchmark can assess diverse capabilities including visual expertise, quantitative reasoning, and multi-step agronomic reasoning. A total of 11 state-of-the-art VLMs were evaluated. The results indicate that task-specific fine-tuning leads to substantial improvement in accuracy, with models such as Qwen3-VL-4B and Qwen3-VL-30B achieving up to 78%. At the same time, gains from model scaling diminish beyond a certain capacity, generalization across soybean and cotton remains uneven, and quantitative as well as biologically grounded reasoning continue to pose substantial challenges. These findings suggest that PlantXpert can serve as a foundation for assessing evidence-grounded agronomic reasoning and for advancing multimodal model development in plant science.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper introduces PlantXpert, a multimodal benchmark for agronomic reasoning from UAV imagery in soybean and cotton phenotyping. It comprises 385 images and over 3,000 question-answer samples spanning disease, pest, weed, and yield domains. The work evaluates 11 VLMs, reports that task-specific fine-tuning yields substantial gains (up to 78% accuracy for Qwen3-VL variants), observes that scaling benefits plateau beyond a capacity threshold, notes uneven cross-crop generalization, and identifies persistent weaknesses in quantitative and biologically grounded multi-step reasoning.

Significance. If the benchmark construction and evaluation protocol prove robust, the work would provide a valuable, domain-specific resource for measuring progress in vision-language models applied to plant science. The empirical observations on fine-tuning efficacy versus scaling limits and reasoning bottlenecks could usefully inform targeted adaptation strategies in agricultural AI.

major comments (3)
  1. [Dataset Construction] Dataset section: The manuscript supplies no details on UAV image acquisition parameters (flight altitude, lighting conditions, sensor type), the exact process used to generate the 3,000 benchmark questions and ground-truth answers, or any inter-annotator agreement statistics. These omissions are load-bearing because the headline claims of fine-tuning improvement and remaining reasoning deficits rest on the assumption that the 385-image set faithfully samples real-world phenotyping variability.
  2. [Experiments] Experiments and Evaluation sections: Performance figures (e.g., 78% accuracy) are presented without specifying the answer format (open-ended vs. multiple-choice), the exact scoring procedure for free-form VLM outputs, or any statistical significance testing for the reported gains from fine-tuning versus base models. This prevents verification of the central empirical claims.
  3. [Results] Results and Analysis: The claim of uneven generalization between soybean and cotton, and the specific difficulties in quantitative reasoning, are stated at a high level but lack per-category breakdowns, error analysis, or ablation studies isolating which fine-tuning data elements drive the observed improvements. Without these, the interpretation of where models still fail remains under-supported.
minor comments (2)
  1. [Abstract] The abstract states 'more than 3,000 benchmark samples'; the main text should report the precise count and the distribution across the four phenotyping domains.
  2. [Figures] Figure captions and example question presentations should include explicit indications of image resolution, crop stage, and question type to help readers assess visual and reasoning difficulty.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. The comments identify key areas where additional information will improve reproducibility and strengthen the empirical claims. We will revise the manuscript accordingly and provide point-by-point responses below.

read point-by-point responses
  1. Referee: [Dataset Construction] Dataset section: The manuscript supplies no details on UAV image acquisition parameters (flight altitude, lighting conditions, sensor type), the exact process used to generate the 3,000 benchmark questions and ground-truth answers, or any inter-annotator agreement statistics. These omissions are load-bearing because the headline claims of fine-tuning improvement and remaining reasoning deficits rest on the assumption that the 385-image set faithfully samples real-world phenotyping variability.

    Authors: We acknowledge the omission of these details in the original manuscript. In the revised version, we will expand the Dataset Construction section to specify UAV flight parameters (altitude, sensor type, lighting conditions), the multi-stage annotation pipeline (including how questions were generated by domain experts and ground-truth answers verified), and inter-annotator agreement statistics. These additions will directly support the validity of the benchmark and the reported performance differences. revision: yes

  2. Referee: [Experiments] Experiments and Evaluation sections: Performance figures (e.g., 78% accuracy) are presented without specifying the answer format (open-ended vs. multiple-choice), the exact scoring procedure for free-form VLM outputs, or any statistical significance testing for the reported gains from fine-tuning versus base models. This prevents verification of the central empirical claims.

    Authors: We agree that these protocol details are essential. The benchmark mixes multiple-choice and open-ended questions; we will explicitly state the formats, describe the scoring method (exact match for multiple-choice and expert-judged semantic equivalence for open-ended responses), and add statistical significance testing (e.g., McNemar's test) for fine-tuning gains versus base models in the revised Experiments and Evaluation sections. revision: yes

  3. Referee: [Results] Results and Analysis: The claim of uneven generalization between soybean and cotton, and the specific difficulties in quantitative reasoning, are stated at a high level but lack per-category breakdowns, error analysis, or ablation studies isolating which fine-tuning data elements drive the observed improvements. Without these, the interpretation of where models still fail remains under-supported.

    Authors: The referee correctly identifies the need for greater granularity. We will add per-category accuracy tables (disease, pest, weed, yield), representative error analyses with examples, and ablation studies on fine-tuning data components in the revised Results and Analysis section. These will provide stronger evidence for the reported patterns in cross-crop generalization and reasoning limitations. revision: yes

Circularity Check

0 steps flagged

No circularity: pure empirical benchmark with direct model evaluations

full rationale

This is an empirical benchmark paper that constructs a dataset of 385 UAV images and >3,000 samples, then reports accuracy scores for 11 VLMs (base and fine-tuned) on visual, quantitative, and agronomic reasoning tasks. No equations, derivations, fitted parameters, or first-principles predictions exist. Reported accuracies (e.g., up to 78% after fine-tuning) are direct outputs of model inference on the author-defined test set and do not reduce to any self-defined quantity or self-citation chain. The paper contains no load-bearing self-citations, uniqueness theorems, or ansatzes; results are self-contained measurements against the provided benchmark.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central contribution rests on the domain assumption that the authors' curated images and questions faithfully represent key phenotyping tasks; no free parameters or invented entities are introduced.

axioms (1)
  • domain assumption The 385 UAV images and 3,000 benchmark samples span the essential domains of disease, pest control, weed management, and yield for soybean and cotton.
    This assumption underpins the claim that the benchmark enables controlled comparison of VLMs on agronomic reasoning.

pith-pipeline@v0.9.0 · 5623 in / 1359 out tokens · 50396 ms · 2026-05-10T17:08:16.017719+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

24 extracted references · 21 canonical work pages · 4 internal anchors

  1. [1]

    Springer International Publishing, Cham, pp. 431–516. URL:https://link.springer.com/chapter/10.1007/978-3-030-23400-3_ 12, doi:10.1007/978-3-030-23400-3_12. Arshad, M.A., Jubery, T.Z., Roy, T., Nassiri, R., Singh, A.K., Singh, A., Hegde, C., Ganapathysubramanian, B., Balu, A., Krishnamurthy, A., Sarkar, S.,

  2. [2]

    6320–6329

    Leveraging vision language models for specialized agricultural tasks, in: 2025 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), pp. 6320–6329. URL:https://ieeexplore.ieee.org/document/10943968, doi:10.1109/WACV61041. 2025.00616. Awais,M.,SalemAbdullaAlharthi,A.H.,Kumar,A.,Cholakkal,H.,Anwer,R.M.,2025.Agrogpt:Efficientagriculturalvisio...

  3. [3]

    Qwen3-VL Technical Report

    Qwen3-VL Technical Report. Technical Report. Alibaba Cloud. URL:http://arxiv.org/abs/2511.21631,arXiv:2511.21631. Bommasani, R., Hudson, D.A., Adeli, E., Altman, R., Arora, S., Arx, S.v., Bernstein, M.S., Bohg, J., Bosselut, A., Brunskill, E., Brynjolfsson, E., Buch, S., Card, D., Castellon, R., Chatterji, N., Chen, A., Creel, K., Davis, J.Q., Demszky, D....

  4. [4]

    On the Opportunities and Risks of Foundation Models

    On the opportunities and risks of foundation models. URL:https://arxiv.org/abs/2108.07258. Hu, E.J., Shen, Y., Wallis, P., Allen-Zhu, Z., Li, Y., Wang, S., Wang, L., Chen, W., et al.,

  5. [5]

    Computers and Electronics in Agriculture 222, 109032

    Foundation models in smart agriculture: Basics, opportunities, and challenges. Computers and Electronics in Agriculture 222, 109032. URL:https://www.sciencedirect.com/science/article/pii/ Wu et al. (2026):Preprint submitted to ElsevierPage 21 of 25 PlantXpert: a Multimodal LLM Benchmark for Plant Phenotyping S016816992400423X, doi:https://doi.org/10.1016/...

  6. [6]

    Bioinformatics 35, 4381–4388

    Figure and caption extraction from biomedical documents. Bioinformatics 35, 4381–4388. URL:https: //academic.oup.com/bioinformatics/article/35/21/4381/5428177, doi:10.1093/bioinformatics/btz228. Liang, F., Huang, Z., Wang, W., He, Z., En, Q.,

  7. [7]

    URL: https://llava-vl.github.io/blog/2024-01-30-llava-next/

    Llava-next: Improved reasoning, ocr, and world knowledge. URL: https://llava-vl.github.io/blog/2024-01-30-llava-next/. Liu, X., Liu, Z., Hu, H., Chen, Z., Wang, K., Wang, K., Lian, S.,

  8. [8]

    A multimodal benchmark dataset and model for crop disease diagnosis

    A multimodal benchmark dataset and model for crop disease diagnosis, in: Leonardis, A., Ricci, E., Roth, S., Russakovsky, O., Sattler, T., Varol, G. (Eds.), Computer Vision – ECCV 2024, Springer Nature Switzerland, Cham. pp. 157–170. URL:https://link.springer.com/chapter/10.1007/978-3-031-73016-0_10, doi:10.1007/ 978-3-031-73016-0_10. Meshram, V., Patil, K.,

  9. [9]

    Data in Brief 40, 107686

    Fruitnet: Indian fruits image dataset with quality for machine learning applications. Data in Brief 40, 107686. URL: https://www.sciencedirect.com/science/article/pii/S2352340921009616, doi:10.1016/j.dib.2021.107686. Mir, R.R., Reynolds, M., Pinto, F., Khan, M.A., Bhat, M.A.,

  10. [10]

    Plant Science 282, 60–72

    High-throughput phenotyping for crop improvement in the genomics era. Plant Science 282, 60–72. URL:https://www.sciencedirect.com/science/article/abs/pii/S0168945218305752, doi:10.1016/ j.plantsci.2019.01.007. Mohanty, S.P., Hughes, D.P., Salathé, M.,

  11. [11]

    and Hughes, David P

    URL:https://www.frontiersin.org/journals/plant-science/articles/10.3389/fpls.2016.01419/full, doi:10. 3389/fpls.2016.01419. Quoc, K.N., Dao, P.D., Quach, L.D., 2026a. Leafnet: A large-scale dataset and comprehensive benchmark for foundational vision-language understanding of plant diseases. URL:http://arxiv.org/abs/2602.13662. Quoc, K.N., Dao, P.D., Quach...

  12. [12]

    IEEE Transactions on Automation Science and Engineering 22, 22510–22540

    Multi-modal llms in agriculture: A comprehensive review. IEEE Transactions on Automation Science and Engineering 22, 22510–22540. URL:https://ieeexplore.ieee.org/document/11173627, doi:10.1109/TASE.2025.3612154. Sharif, M., Han, G., Liu, W., Huang, X.,

  13. [13]

    Cultivating Multidisciplinary AI Workforce Development on iTiger GPU Cluster: Practices and Challenges

    Cultivating Multidisciplinary Research and Education on GPU Infrastructure for Mid-South Institutions at the University of Memphis: Practice and Challenge. Technical Report. University of Memphis. URL:https://arxiv.org/ abs/2504.14786,arXiv:2504.14786. Shinoda,R.,Inoue,N.,Kataoka,H.,Onishi,M.,Ushiku,Y.,2025. Agrobench:Vision-languagemodelbenchmarkinagricu...

  14. [14]

    OpenAI GPT-5 System Card

    Openai gpt-5 system card. URL:http://arxiv.org/abs/2601.03267. arXiv:2601.03267 [cs]. Singh, D., Jain, N., Jain, P., Kayal, P., Kumawat, S., Batra, N.,

  15. [15]

    Plantdoc: A dataset for visual plant disease detection, in: Proceedings of the 7th ACM IKDD CoDS and 25th COMAD, Association for Computing Machinery, New York, NY, USA. p. 249–253. URL:https: //dl.acm.org/doi/abs/10.1145/3371158.3371196, doi:10.1145/3371158.3371196. Wu et al. (2026):Preprint submitted to ElsevierPage 23 of 25 PlantXpert: a Multimodal LLM ...

  16. [16]

    (Eds.), Fundamentals of Field Crop Breeding

    Cotton breeding, in: Yadava, D.K., Dikshit, H.K., Mishra, G.P., Tripathi, S. (Eds.), Fundamentals of Field Crop Breeding. Springer Nature Singapore, Singapore, pp. 609–676. URL:https://link.springer.com/chapter/10.1007/978-981-16-9257-4_11, doi:10.1007/978-981-16-9257-4_11. Wang,L.,Jin,T.,Yang,J.,Leonardis,A.,Wang,F.,Zheng,F.,2024. Agri-llava:Knowledge-in...

  17. [17]

    (Eds.), Advances in Neural Information Processing Systems, Curran Associates, Inc

    Chain-of-thought prompting elicits reasoning in large language models, in: Koyejo, S., Mohamed, S., Agarwal, A., Belgrave, D., Cho, K., Oh, A. (Eds.), Advances in Neural Information Processing Systems, Curran Associates, Inc.. pp. 24824–24837. URL:https://proceedings.neurips.cc/paper_files/paper/2022/ Wu et al. (2026):Preprint submitted to ElsevierPage 24...

  18. [18]

    Weyler, J., Magistri, F., Marks, E., Chong, Y.L., Sodano, M., Roggiolani, G., Chebrolu, N., Stachniss, C., Behley, J.,

    URL:https://www.nature.com/articles/s41597-025-06513-4, doi:10.1038/s41597-025-06513-4. Weyler, J., Magistri, F., Marks, E., Chong, Y.L., Sodano, M., Roggiolani, G., Chebrolu, N., Stachniss, C., Behley, J.,

  19. [19]

    IEEE Transactions on Pattern Analysis and Machine Intelligence 46, 9583–9594

    Phenobench: A large dataset and benchmarks for semantic image interpretation in the agricultural domain. IEEE Transactions on Pattern Analysis and Machine Intelligence 46, 9583–9594. doi:10.1109/TPAMI.2024.3419548. Williamson, H.F., Brettschneider, J., Caccamo, M., Davey, R.P., Goble, C., Kersey, P.J., May, S., Morris, R.J., Ostler, R., Pridmore, T., Rawl...

  20. [20]

    Yan,L.,Wang,H.,Tang,C.,Liu,H.,Sun,T.,Liu,L.,Guan,Y.,Jiang,J.,2026

    URL:https://f1000research.com/articles/10-324, doi:10.12688/f1000research.52204.2. Yan,L.,Wang,H.,Tang,C.,Liu,H.,Sun,T.,Liu,L.,Guan,Y.,Jiang,J.,2026. Agrieval:Acomprehensivechineseagriculturalbenchmarkforlarge language models. Proceedings of the AAAI Conference on Artificial Intelligence 40, 34205–34213. URL:https://ojs.aaai.org/index. php/AAAI/article/vi...

  21. [21]

    Agrigpt: A large language model ecosys- tem for agriculture.arXiv preprint arXiv:2508.08632, 2025

    Agrigpt: a large language model ecosystem for agriculture. URL:http://arxiv.org/abs/2508.08632. Yue, X., Ni, Y., Zhang, K., Zheng, T., Liu, R., Zhang, G., Stevens, S., Jiang, D., Ren, W., Sun, Y., et al.,

  22. [22]

    Zhang, X

    Cmmmu: A chinese massive multi- discipline multimodal understanding benchmark. arXiv preprint arXiv:2401.11944 . Zheng, Y., Zhang, R., Zhang, J., Ye, Y., Luo, Z.,

  23. [23]

    LlamaFactory: Unified efficient fine-tuning of 100+ language models, in: Cao, Y., Feng, Y., Xiong,D.(Eds.),Proceedingsofthe62ndAnnualMeetingoftheAssociationforComputationalLinguistics(Volume3:SystemDemonstrations), Association for Computational Linguistics, Bangkok, Thailand. pp. 400–410. URL:https://aclanthology.org/2024.acl-demos.38/, doi:10.18653/v1/20...

  24. [24]

    (Eds.), Computer Vision – ECCV 2024 Workshops, Springer Nature Switzerland, Cham

    Agribench: A hierarchical agriculture benchmark for multimodal large language models, in: Del Bue, A., Canton, C., Pont-Tuset, J., Tommasi, T. (Eds.), Computer Vision – ECCV 2024 Workshops, Springer Nature Switzerland, Cham. pp. 207–223. URL: https://link.springer.com/chapter/10.1007/978-3-031-91835-3_14, doi:10.1007/978-3-031-91835-3_14. Wu et al. (2026)...