arxiv: 2604.09907 · v1 · submitted 2026-04-10 · 💻 cs.CV · cs.AI· cs.CL

Recognition: unknown

From UAV Imagery to Agronomic Reasoning: A Multimodal LLM Benchmark for Plant Phenotyping

Yu Wu , Guangzeng Han , Ibra Niang Niang , Francia Ravelombola , Maiara Oliveira , Jason Davis , Dong Chen , Feng Lin , Xiaolei Huang

Authors on Pith no claims yet

Pith reviewed 2026-05-10 17:08 UTC · model grok-4.3

classification 💻 cs.CV cs.AIcs.CL

keywords plant phenotypingmultimodal LLMsUAV imagerysoybeancottonagronomic reasoningvision-language modelsbenchmark

0 comments

The pith

A benchmark of 385 UAV images shows fine-tuning vision-language models lifts accuracy on soybean and cotton phenotyping tasks to 78 percent.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces PlantXpert to measure how well current vision-language models can turn drone photos of soybean and cotton into reliable agronomic judgments about disease, pests, weeds, and yield. The authors built 385 images into more than 3,000 structured samples that test visual identification, number-based reasoning, and multi-step biological logic. When eleven leading models were run on the benchmark, task-specific fine-tuning produced clear accuracy gains, reaching 78 percent for the strongest fine-tuned versions, yet further increases in model size added little and performance differed noticeably between the two crops. The work matters because crop breeding programs need fast, repeatable ways to score thousands of plots; a benchmark that exposes where models still fail can guide the next round of adaptation.

Core claim

We constructed PlantXpert as an evidence-grounded multimodal reasoning benchmark containing 385 digital UAV images and over 3,000 question-answer pairs that span disease identification, pest and weed management, and yield-related traits in soybean and cotton. Evaluation of eleven state-of-the-art vision-language models demonstrates that task-specific fine-tuning produces substantial accuracy improvements, with models such as Qwen3-VL-4B and Qwen3-VL-30B reaching up to 78 percent, while gains from further model scaling diminish beyond a certain capacity, generalization between soybean and cotton remains uneven, and quantitative as well as biologically grounded reasoning continue to pose major

What carries the argument

PlantXpert benchmark, a structured dataset and evaluation framework that pairs UAV imagery with questions requiring visual expertise, quantitative reasoning, and multi-step agronomic judgment across disease, pest, weed, and yield domains.

If this is right

Task-specific fine-tuning raises accuracy on phenotyping questions substantially compared with base models.
Accuracy gains from increasing model size plateau after a moderate capacity threshold.
Performance on soybean tasks does not reliably transfer to cotton and vice versa.
Quantitative calculations and biologically reasoned explanations remain the hardest categories for all tested models.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Breeding programs could insert fine-tuned models into routine UAV scouting pipelines to score large numbers of plots faster than manual teams.
Separate fine-tuning runs for each crop appear necessary until cross-crop generalization improves.
Adding numeric sensor readings alongside images might reduce the errors still seen in quantitative questions.
The benchmark format can be reused to track progress as new vision-language models are released.

Load-bearing premise

The 385 images and 3,000 benchmark samples capture the full range of real-world visual and agronomic reasoning demands in soybean and cotton phenotyping without selection bias or oversimplification.

What would settle it

Run the fine-tuned Qwen3-VL-30B model on a fresh collection of UAV images taken from commercial soybean and cotton fields not used in the original dataset, then compare its answers on disease severity, pest counts, and yield estimates against independent ratings by trained agronomists.

Figures

Figures reproduced from arXiv: 2604.09907 by Dong Chen, Feng Lin, Francia Ravelombola, Guangzeng Han, Ibra Niang Niang, Jason Davis, Maiara Oliveira, Xiaolei Huang, Yu Wu.

**Figure 1.** Figure 1: An overview of our source retrieval benchmark construction pipeline. and cotton, covering research studies published between 2016 and 2025 using a shared keyword set across these target domains in [PITH_FULL_IMAGE:figures/full_fig_p005_1.png] view at source ↗

**Figure 2.** Figure 2: Representative evidence-grounded multiple-choice data samples from PlantXpert. Each sample is annotated with a specific agricultural domain and primary reasoning challenge. with a diagnostic and inferential structure, rather than as a flat collection of multiple-choice prediction examples like the traditional question answering task. Rather than asking only what object or symptom is visible, the generated … view at source ↗

**Figure 3.** Figure 3: Representative error cases of multiple choice questions that the models answered incorrectly. Another common source of errors arises in questions related to plant biology and physiology. As shown in [PITH_FULL_IMAGE:figures/full_fig_p016_3.png] view at source ↗

read the original abstract

To improve crop genetics, high-throughput, effective and comprehensive phenotyping is a critical prerequisite. While such tasks were traditionally performed manually, recent advances in multimodal foundation models, especially in vision-language models (VLMs), have enabled more automated and robust phenotypic analysis. However, plant science remains a particularly challenging domain for foundation models because it requires domain-specific knowledge, fine-grained visual interpretation, and complex biological and agronomic reasoning. To address this gap, we develop PlantXpert, an evidence-grounded multimodal reasoning benchmark for soybean and cotton phenotyping. Our benchmark provides a structured and reproducible framework for agronomic adaptation of VLMs, and enables controlled comparison between base models and their domain-adapted counterparts. We constructed a dataset comprising 385 digital images and more than 3,000 benchmark samples spanning key plant science domains including disease, pest control, weed management, and yield. The benchmark can assess diverse capabilities including visual expertise, quantitative reasoning, and multi-step agronomic reasoning. A total of 11 state-of-the-art VLMs were evaluated. The results indicate that task-specific fine-tuning leads to substantial improvement in accuracy, with models such as Qwen3-VL-4B and Qwen3-VL-30B achieving up to 78%. At the same time, gains from model scaling diminish beyond a certain capacity, generalization across soybean and cotton remains uneven, and quantitative as well as biologically grounded reasoning continue to pose substantial challenges. These findings suggest that PlantXpert can serve as a foundation for assessing evidence-grounded agronomic reasoning and for advancing multimodal model development in plant science.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

PlantXpert is a new author-built benchmark for VLMs on soybean and cotton UAV phenotyping that shows fine-tuning gains but leaves real-world coverage and reasoning gaps unproven.

read the letter

The core takeaway is that this paper builds and releases PlantXpert, a multimodal benchmark with 385 UAV images and over 3,000 samples covering disease, pests, weeds, and yield questions for soybean and cotton. They run 11 VLMs, including fine-tuned versions, and report that task-specific adaptation lifts accuracy to around 78% on models like the Qwen3-VL series while larger scale stops adding much and cross-crop generalization stays uneven. Quantitative and biologically grounded reasoning remain weak points.

Referee Report

3 major / 2 minor

Summary. The paper introduces PlantXpert, a multimodal benchmark for agronomic reasoning from UAV imagery in soybean and cotton phenotyping. It comprises 385 images and over 3,000 question-answer samples spanning disease, pest, weed, and yield domains. The work evaluates 11 VLMs, reports that task-specific fine-tuning yields substantial gains (up to 78% accuracy for Qwen3-VL variants), observes that scaling benefits plateau beyond a capacity threshold, notes uneven cross-crop generalization, and identifies persistent weaknesses in quantitative and biologically grounded multi-step reasoning.

Significance. If the benchmark construction and evaluation protocol prove robust, the work would provide a valuable, domain-specific resource for measuring progress in vision-language models applied to plant science. The empirical observations on fine-tuning efficacy versus scaling limits and reasoning bottlenecks could usefully inform targeted adaptation strategies in agricultural AI.

major comments (3)

[Dataset Construction] Dataset section: The manuscript supplies no details on UAV image acquisition parameters (flight altitude, lighting conditions, sensor type), the exact process used to generate the 3,000 benchmark questions and ground-truth answers, or any inter-annotator agreement statistics. These omissions are load-bearing because the headline claims of fine-tuning improvement and remaining reasoning deficits rest on the assumption that the 385-image set faithfully samples real-world phenotyping variability.
[Experiments] Experiments and Evaluation sections: Performance figures (e.g., 78% accuracy) are presented without specifying the answer format (open-ended vs. multiple-choice), the exact scoring procedure for free-form VLM outputs, or any statistical significance testing for the reported gains from fine-tuning versus base models. This prevents verification of the central empirical claims.
[Results] Results and Analysis: The claim of uneven generalization between soybean and cotton, and the specific difficulties in quantitative reasoning, are stated at a high level but lack per-category breakdowns, error analysis, or ablation studies isolating which fine-tuning data elements drive the observed improvements. Without these, the interpretation of where models still fail remains under-supported.

minor comments (2)

[Abstract] The abstract states 'more than 3,000 benchmark samples'; the main text should report the precise count and the distribution across the four phenotyping domains.
[Figures] Figure captions and example question presentations should include explicit indications of image resolution, crop stage, and question type to help readers assess visual and reasoning difficulty.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. The comments identify key areas where additional information will improve reproducibility and strengthen the empirical claims. We will revise the manuscript accordingly and provide point-by-point responses below.

read point-by-point responses

Referee: [Dataset Construction] Dataset section: The manuscript supplies no details on UAV image acquisition parameters (flight altitude, lighting conditions, sensor type), the exact process used to generate the 3,000 benchmark questions and ground-truth answers, or any inter-annotator agreement statistics. These omissions are load-bearing because the headline claims of fine-tuning improvement and remaining reasoning deficits rest on the assumption that the 385-image set faithfully samples real-world phenotyping variability.

Authors: We acknowledge the omission of these details in the original manuscript. In the revised version, we will expand the Dataset Construction section to specify UAV flight parameters (altitude, sensor type, lighting conditions), the multi-stage annotation pipeline (including how questions were generated by domain experts and ground-truth answers verified), and inter-annotator agreement statistics. These additions will directly support the validity of the benchmark and the reported performance differences. revision: yes
Referee: [Experiments] Experiments and Evaluation sections: Performance figures (e.g., 78% accuracy) are presented without specifying the answer format (open-ended vs. multiple-choice), the exact scoring procedure for free-form VLM outputs, or any statistical significance testing for the reported gains from fine-tuning versus base models. This prevents verification of the central empirical claims.

Authors: We agree that these protocol details are essential. The benchmark mixes multiple-choice and open-ended questions; we will explicitly state the formats, describe the scoring method (exact match for multiple-choice and expert-judged semantic equivalence for open-ended responses), and add statistical significance testing (e.g., McNemar's test) for fine-tuning gains versus base models in the revised Experiments and Evaluation sections. revision: yes
Referee: [Results] Results and Analysis: The claim of uneven generalization between soybean and cotton, and the specific difficulties in quantitative reasoning, are stated at a high level but lack per-category breakdowns, error analysis, or ablation studies isolating which fine-tuning data elements drive the observed improvements. Without these, the interpretation of where models still fail remains under-supported.

Authors: The referee correctly identifies the need for greater granularity. We will add per-category accuracy tables (disease, pest, weed, yield), representative error analyses with examples, and ablation studies on fine-tuning data components in the revised Results and Analysis section. These will provide stronger evidence for the reported patterns in cross-crop generalization and reasoning limitations. revision: yes

Circularity Check

0 steps flagged

No circularity: pure empirical benchmark with direct model evaluations

full rationale

This is an empirical benchmark paper that constructs a dataset of 385 UAV images and >3,000 samples, then reports accuracy scores for 11 VLMs (base and fine-tuned) on visual, quantitative, and agronomic reasoning tasks. No equations, derivations, fitted parameters, or first-principles predictions exist. Reported accuracies (e.g., up to 78% after fine-tuning) are direct outputs of model inference on the author-defined test set and do not reduce to any self-defined quantity or self-citation chain. The paper contains no load-bearing self-citations, uniqueness theorems, or ansatzes; results are self-contained measurements against the provided benchmark.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central contribution rests on the domain assumption that the authors' curated images and questions faithfully represent key phenotyping tasks; no free parameters or invented entities are introduced.

axioms (1)

domain assumption The 385 UAV images and 3,000 benchmark samples span the essential domains of disease, pest control, weed management, and yield for soybean and cotton.
This assumption underpins the claim that the benchmark enables controlled comparison of VLMs on agronomic reasoning.

pith-pipeline@v0.9.0 · 5623 in / 1359 out tokens · 50396 ms · 2026-05-10T17:08:16.017719+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

24 extracted references · 21 canonical work pages · 4 internal anchors

[1]

Springer International Publishing, Cham, pp. 431–516. URL:https://link.springer.com/chapter/10.1007/978-3-030-23400-3_ 12, doi:10.1007/978-3-030-23400-3_12. Arshad, M.A., Jubery, T.Z., Roy, T., Nassiri, R., Singh, A.K., Singh, A., Hegde, C., Ganapathysubramanian, B., Balu, A., Krishnamurthy, A., Sarkar, S.,

work page doi:10.1007/978-3-030-23400-3_
[2]

6320–6329

Leveraging vision language models for specialized agricultural tasks, in: 2025 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), pp. 6320–6329. URL:https://ieeexplore.ieee.org/document/10943968, doi:10.1109/WACV61041. 2025.00616. Awais,M.,SalemAbdullaAlharthi,A.H.,Kumar,A.,Cholakkal,H.,Anwer,R.M.,2025.Agrogpt:Efficientagriculturalvisio...

work page doi:10.1109/wacv61041 2025
[3]

Qwen3-VL Technical Report

Qwen3-VL Technical Report. Technical Report. Alibaba Cloud. URL:http://arxiv.org/abs/2511.21631,arXiv:2511.21631. Bommasani, R., Hudson, D.A., Adeli, E., Altman, R., Arora, S., Arx, S.v., Bernstein, M.S., Bohg, J., Bosselut, A., Brunskill, E., Brynjolfsson, E., Buch, S., Card, D., Castellon, R., Chatterji, N., Chen, A., Creel, K., Davis, J.Q., Demszky, D....

work page internal anchor Pith review Pith/arXiv arXiv
[4]

On the Opportunities and Risks of Foundation Models

On the opportunities and risks of foundation models. URL:https://arxiv.org/abs/2108.07258. Hu, E.J., Shen, Y., Wallis, P., Allen-Zhu, Z., Li, Y., Wang, S., Wang, L., Chen, W., et al.,

work page internal anchor Pith review Pith/arXiv arXiv
[5]

Computers and Electronics in Agriculture 222, 109032

Foundation models in smart agriculture: Basics, opportunities, and challenges. Computers and Electronics in Agriculture 222, 109032. URL:https://www.sciencedirect.com/science/article/pii/ Wu et al. (2026):Preprint submitted to ElsevierPage 21 of 25 PlantXpert: a Multimodal LLM Benchmark for Plant Phenotyping S016816992400423X, doi:https://doi.org/10.1016/...

work page doi:10.1016/j.compag.2024.109032 2026
[6]

Bioinformatics 35, 4381–4388

Figure and caption extraction from biomedical documents. Bioinformatics 35, 4381–4388. URL:https: //academic.oup.com/bioinformatics/article/35/21/4381/5428177, doi:10.1093/bioinformatics/btz228. Liang, F., Huang, Z., Wang, W., He, Z., En, Q.,

work page doi:10.1093/bioinformatics/btz228
[7]

URL: https://llava-vl.github.io/blog/2024-01-30-llava-next/

Llava-next: Improved reasoning, ocr, and world knowledge. URL: https://llava-vl.github.io/blog/2024-01-30-llava-next/. Liu, X., Liu, Z., Hu, H., Chen, Z., Wang, K., Wang, K., Lian, S.,

2024
[8]

A multimodal benchmark dataset and model for crop disease diagnosis

A multimodal benchmark dataset and model for crop disease diagnosis, in: Leonardis, A., Ricci, E., Roth, S., Russakovsky, O., Sattler, T., Varol, G. (Eds.), Computer Vision – ECCV 2024, Springer Nature Switzerland, Cham. pp. 157–170. URL:https://link.springer.com/chapter/10.1007/978-3-031-73016-0_10, doi:10.1007/ 978-3-031-73016-0_10. Meshram, V., Patil, K.,

work page doi:10.1007/978-3-031-73016-0_10 2024
[9]

Data in Brief 40, 107686

Fruitnet: Indian fruits image dataset with quality for machine learning applications. Data in Brief 40, 107686. URL: https://www.sciencedirect.com/science/article/pii/S2352340921009616, doi:10.1016/j.dib.2021.107686. Mir, R.R., Reynolds, M., Pinto, F., Khan, M.A., Bhat, M.A.,

work page doi:10.1016/j.dib.2021.107686 2021
[10]

Plant Science 282, 60–72

High-throughput phenotyping for crop improvement in the genomics era. Plant Science 282, 60–72. URL:https://www.sciencedirect.com/science/article/abs/pii/S0168945218305752, doi:10.1016/ j.plantsci.2019.01.007. Mohanty, S.P., Hughes, D.P., Salathé, M.,

2019
[11]

and Hughes, David P

URL:https://www.frontiersin.org/journals/plant-science/articles/10.3389/fpls.2016.01419/full, doi:10. 3389/fpls.2016.01419. Quoc, K.N., Dao, P.D., Quach, L.D., 2026a. Leafnet: A large-scale dataset and comprehensive benchmark for foundational vision-language understanding of plant diseases. URL:http://arxiv.org/abs/2602.13662. Quoc, K.N., Dao, P.D., Quach...

work page doi:10.3389/fpls.2016.01419/full 2016
[12]

IEEE Transactions on Automation Science and Engineering 22, 22510–22540

Multi-modal llms in agriculture: A comprehensive review. IEEE Transactions on Automation Science and Engineering 22, 22510–22540. URL:https://ieeexplore.ieee.org/document/11173627, doi:10.1109/TASE.2025.3612154. Sharif, M., Han, G., Liu, W., Huang, X.,

work page doi:10.1109/tase.2025.3612154 2025
[13]

Cultivating Multidisciplinary AI Workforce Development on iTiger GPU Cluster: Practices and Challenges

Cultivating Multidisciplinary Research and Education on GPU Infrastructure for Mid-South Institutions at the University of Memphis: Practice and Challenge. Technical Report. University of Memphis. URL:https://arxiv.org/ abs/2504.14786,arXiv:2504.14786. Shinoda,R.,Inoue,N.,Kataoka,H.,Onishi,M.,Ushiku,Y.,2025. Agrobench:Vision-languagemodelbenchmarkinagricu...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[14]

OpenAI GPT-5 System Card

Openai gpt-5 system card. URL:http://arxiv.org/abs/2601.03267. arXiv:2601.03267 [cs]. Singh, D., Jain, N., Jain, P., Kayal, P., Kumawat, S., Batra, N.,

work page internal anchor Pith review Pith/arXiv arXiv
[15]

Plantdoc: A dataset for visual plant disease detection, in: Proceedings of the 7th ACM IKDD CoDS and 25th COMAD, Association for Computing Machinery, New York, NY, USA. p. 249–253. URL:https: //dl.acm.org/doi/abs/10.1145/3371158.3371196, doi:10.1145/3371158.3371196. Wu et al. (2026):Preprint submitted to ElsevierPage 23 of 25 PlantXpert: a Multimodal LLM ...

work page doi:10.1145/3371158.3371196 2026
[16]

(Eds.), Fundamentals of Field Crop Breeding

Cotton breeding, in: Yadava, D.K., Dikshit, H.K., Mishra, G.P., Tripathi, S. (Eds.), Fundamentals of Field Crop Breeding. Springer Nature Singapore, Singapore, pp. 609–676. URL:https://link.springer.com/chapter/10.1007/978-981-16-9257-4_11, doi:10.1007/978-981-16-9257-4_11. Wang,L.,Jin,T.,Yang,J.,Leonardis,A.,Wang,F.,Zheng,F.,2024. Agri-llava:Knowledge-in...

work page doi:10.1007/978-981-16-9257-4_11 2024
[17]

(Eds.), Advances in Neural Information Processing Systems, Curran Associates, Inc

Chain-of-thought prompting elicits reasoning in large language models, in: Koyejo, S., Mohamed, S., Agarwal, A., Belgrave, D., Cho, K., Oh, A. (Eds.), Advances in Neural Information Processing Systems, Curran Associates, Inc.. pp. 24824–24837. URL:https://proceedings.neurips.cc/paper_files/paper/2022/ Wu et al. (2026):Preprint submitted to ElsevierPage 24...

2022
[18]

Weyler, J., Magistri, F., Marks, E., Chong, Y.L., Sodano, M., Roggiolani, G., Chebrolu, N., Stachniss, C., Behley, J.,

URL:https://www.nature.com/articles/s41597-025-06513-4, doi:10.1038/s41597-025-06513-4. Weyler, J., Magistri, F., Marks, E., Chong, Y.L., Sodano, M., Roggiolani, G., Chebrolu, N., Stachniss, C., Behley, J.,

work page doi:10.1038/s41597-025-06513-4
[19]

IEEE Transactions on Pattern Analysis and Machine Intelligence 46, 9583–9594

Phenobench: A large dataset and benchmarks for semantic image interpretation in the agricultural domain. IEEE Transactions on Pattern Analysis and Machine Intelligence 46, 9583–9594. doi:10.1109/TPAMI.2024.3419548. Williamson, H.F., Brettschneider, J., Caccamo, M., Davey, R.P., Goble, C., Kersey, P.J., May, S., Morris, R.J., Ostler, R., Pridmore, T., Rawl...

work page doi:10.1109/tpami.2024.3419548 2024
[20]

Yan,L.,Wang,H.,Tang,C.,Liu,H.,Sun,T.,Liu,L.,Guan,Y.,Jiang,J.,2026

URL:https://f1000research.com/articles/10-324, doi:10.12688/f1000research.52204.2. Yan,L.,Wang,H.,Tang,C.,Liu,H.,Sun,T.,Liu,L.,Guan,Y.,Jiang,J.,2026. Agrieval:Acomprehensivechineseagriculturalbenchmarkforlarge language models. Proceedings of the AAAI Conference on Artificial Intelligence 40, 34205–34213. URL:https://ojs.aaai.org/index. php/AAAI/article/vi...

work page doi:10.12688/f1000research.52204.2 2026
[21]

Agrigpt: A large language model ecosys- tem for agriculture.arXiv preprint arXiv:2508.08632, 2025

Agrigpt: a large language model ecosystem for agriculture. URL:http://arxiv.org/abs/2508.08632. Yue, X., Ni, Y., Zhang, K., Zheng, T., Liu, R., Zhang, G., Stevens, S., Jiang, D., Ren, W., Sun, Y., et al.,

work page arXiv
[22]

Zhang, X

Cmmmu: A chinese massive multi- discipline multimodal understanding benchmark. arXiv preprint arXiv:2401.11944 . Zheng, Y., Zhang, R., Zhang, J., Ye, Y., Luo, Z.,

work page arXiv
[23]

LlamaFactory: Unified efficient fine-tuning of 100+ language models, in: Cao, Y., Feng, Y., Xiong,D.(Eds.),Proceedingsofthe62ndAnnualMeetingoftheAssociationforComputationalLinguistics(Volume3:SystemDemonstrations), Association for Computational Linguistics, Bangkok, Thailand. pp. 400–410. URL:https://aclanthology.org/2024.acl-demos.38/, doi:10.18653/v1/20...

work page doi:10.18653/v1/2024.acl-demos.38 2024
[24]

(Eds.), Computer Vision – ECCV 2024 Workshops, Springer Nature Switzerland, Cham

Agribench: A hierarchical agriculture benchmark for multimodal large language models, in: Del Bue, A., Canton, C., Pont-Tuset, J., Tommasi, T. (Eds.), Computer Vision – ECCV 2024 Workshops, Springer Nature Switzerland, Cham. pp. 207–223. URL: https://link.springer.com/chapter/10.1007/978-3-031-91835-3_14, doi:10.1007/978-3-031-91835-3_14. Wu et al. (2026)...

work page doi:10.1007/978-3-031-91835-3_14 2024