Recognition: unknown
From UAV Imagery to Agronomic Reasoning: A Multimodal LLM Benchmark for Plant Phenotyping
Pith reviewed 2026-05-10 17:08 UTC · model grok-4.3
The pith
A benchmark of 385 UAV images shows fine-tuning vision-language models lifts accuracy on soybean and cotton phenotyping tasks to 78 percent.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
We constructed PlantXpert as an evidence-grounded multimodal reasoning benchmark containing 385 digital UAV images and over 3,000 question-answer pairs that span disease identification, pest and weed management, and yield-related traits in soybean and cotton. Evaluation of eleven state-of-the-art vision-language models demonstrates that task-specific fine-tuning produces substantial accuracy improvements, with models such as Qwen3-VL-4B and Qwen3-VL-30B reaching up to 78 percent, while gains from further model scaling diminish beyond a certain capacity, generalization between soybean and cotton remains uneven, and quantitative as well as biologically grounded reasoning continue to pose major
What carries the argument
PlantXpert benchmark, a structured dataset and evaluation framework that pairs UAV imagery with questions requiring visual expertise, quantitative reasoning, and multi-step agronomic judgment across disease, pest, weed, and yield domains.
If this is right
- Task-specific fine-tuning raises accuracy on phenotyping questions substantially compared with base models.
- Accuracy gains from increasing model size plateau after a moderate capacity threshold.
- Performance on soybean tasks does not reliably transfer to cotton and vice versa.
- Quantitative calculations and biologically reasoned explanations remain the hardest categories for all tested models.
Where Pith is reading between the lines
- Breeding programs could insert fine-tuned models into routine UAV scouting pipelines to score large numbers of plots faster than manual teams.
- Separate fine-tuning runs for each crop appear necessary until cross-crop generalization improves.
- Adding numeric sensor readings alongside images might reduce the errors still seen in quantitative questions.
- The benchmark format can be reused to track progress as new vision-language models are released.
Load-bearing premise
The 385 images and 3,000 benchmark samples capture the full range of real-world visual and agronomic reasoning demands in soybean and cotton phenotyping without selection bias or oversimplification.
What would settle it
Run the fine-tuned Qwen3-VL-30B model on a fresh collection of UAV images taken from commercial soybean and cotton fields not used in the original dataset, then compare its answers on disease severity, pest counts, and yield estimates against independent ratings by trained agronomists.
Figures
read the original abstract
To improve crop genetics, high-throughput, effective and comprehensive phenotyping is a critical prerequisite. While such tasks were traditionally performed manually, recent advances in multimodal foundation models, especially in vision-language models (VLMs), have enabled more automated and robust phenotypic analysis. However, plant science remains a particularly challenging domain for foundation models because it requires domain-specific knowledge, fine-grained visual interpretation, and complex biological and agronomic reasoning. To address this gap, we develop PlantXpert, an evidence-grounded multimodal reasoning benchmark for soybean and cotton phenotyping. Our benchmark provides a structured and reproducible framework for agronomic adaptation of VLMs, and enables controlled comparison between base models and their domain-adapted counterparts. We constructed a dataset comprising 385 digital images and more than 3,000 benchmark samples spanning key plant science domains including disease, pest control, weed management, and yield. The benchmark can assess diverse capabilities including visual expertise, quantitative reasoning, and multi-step agronomic reasoning. A total of 11 state-of-the-art VLMs were evaluated. The results indicate that task-specific fine-tuning leads to substantial improvement in accuracy, with models such as Qwen3-VL-4B and Qwen3-VL-30B achieving up to 78%. At the same time, gains from model scaling diminish beyond a certain capacity, generalization across soybean and cotton remains uneven, and quantitative as well as biologically grounded reasoning continue to pose substantial challenges. These findings suggest that PlantXpert can serve as a foundation for assessing evidence-grounded agronomic reasoning and for advancing multimodal model development in plant science.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces PlantXpert, a multimodal benchmark for agronomic reasoning from UAV imagery in soybean and cotton phenotyping. It comprises 385 images and over 3,000 question-answer samples spanning disease, pest, weed, and yield domains. The work evaluates 11 VLMs, reports that task-specific fine-tuning yields substantial gains (up to 78% accuracy for Qwen3-VL variants), observes that scaling benefits plateau beyond a capacity threshold, notes uneven cross-crop generalization, and identifies persistent weaknesses in quantitative and biologically grounded multi-step reasoning.
Significance. If the benchmark construction and evaluation protocol prove robust, the work would provide a valuable, domain-specific resource for measuring progress in vision-language models applied to plant science. The empirical observations on fine-tuning efficacy versus scaling limits and reasoning bottlenecks could usefully inform targeted adaptation strategies in agricultural AI.
major comments (3)
- [Dataset Construction] Dataset section: The manuscript supplies no details on UAV image acquisition parameters (flight altitude, lighting conditions, sensor type), the exact process used to generate the 3,000 benchmark questions and ground-truth answers, or any inter-annotator agreement statistics. These omissions are load-bearing because the headline claims of fine-tuning improvement and remaining reasoning deficits rest on the assumption that the 385-image set faithfully samples real-world phenotyping variability.
- [Experiments] Experiments and Evaluation sections: Performance figures (e.g., 78% accuracy) are presented without specifying the answer format (open-ended vs. multiple-choice), the exact scoring procedure for free-form VLM outputs, or any statistical significance testing for the reported gains from fine-tuning versus base models. This prevents verification of the central empirical claims.
- [Results] Results and Analysis: The claim of uneven generalization between soybean and cotton, and the specific difficulties in quantitative reasoning, are stated at a high level but lack per-category breakdowns, error analysis, or ablation studies isolating which fine-tuning data elements drive the observed improvements. Without these, the interpretation of where models still fail remains under-supported.
minor comments (2)
- [Abstract] The abstract states 'more than 3,000 benchmark samples'; the main text should report the precise count and the distribution across the four phenotyping domains.
- [Figures] Figure captions and example question presentations should include explicit indications of image resolution, crop stage, and question type to help readers assess visual and reasoning difficulty.
Simulated Author's Rebuttal
We thank the referee for the constructive and detailed feedback. The comments identify key areas where additional information will improve reproducibility and strengthen the empirical claims. We will revise the manuscript accordingly and provide point-by-point responses below.
read point-by-point responses
-
Referee: [Dataset Construction] Dataset section: The manuscript supplies no details on UAV image acquisition parameters (flight altitude, lighting conditions, sensor type), the exact process used to generate the 3,000 benchmark questions and ground-truth answers, or any inter-annotator agreement statistics. These omissions are load-bearing because the headline claims of fine-tuning improvement and remaining reasoning deficits rest on the assumption that the 385-image set faithfully samples real-world phenotyping variability.
Authors: We acknowledge the omission of these details in the original manuscript. In the revised version, we will expand the Dataset Construction section to specify UAV flight parameters (altitude, sensor type, lighting conditions), the multi-stage annotation pipeline (including how questions were generated by domain experts and ground-truth answers verified), and inter-annotator agreement statistics. These additions will directly support the validity of the benchmark and the reported performance differences. revision: yes
-
Referee: [Experiments] Experiments and Evaluation sections: Performance figures (e.g., 78% accuracy) are presented without specifying the answer format (open-ended vs. multiple-choice), the exact scoring procedure for free-form VLM outputs, or any statistical significance testing for the reported gains from fine-tuning versus base models. This prevents verification of the central empirical claims.
Authors: We agree that these protocol details are essential. The benchmark mixes multiple-choice and open-ended questions; we will explicitly state the formats, describe the scoring method (exact match for multiple-choice and expert-judged semantic equivalence for open-ended responses), and add statistical significance testing (e.g., McNemar's test) for fine-tuning gains versus base models in the revised Experiments and Evaluation sections. revision: yes
-
Referee: [Results] Results and Analysis: The claim of uneven generalization between soybean and cotton, and the specific difficulties in quantitative reasoning, are stated at a high level but lack per-category breakdowns, error analysis, or ablation studies isolating which fine-tuning data elements drive the observed improvements. Without these, the interpretation of where models still fail remains under-supported.
Authors: The referee correctly identifies the need for greater granularity. We will add per-category accuracy tables (disease, pest, weed, yield), representative error analyses with examples, and ablation studies on fine-tuning data components in the revised Results and Analysis section. These will provide stronger evidence for the reported patterns in cross-crop generalization and reasoning limitations. revision: yes
Circularity Check
No circularity: pure empirical benchmark with direct model evaluations
full rationale
This is an empirical benchmark paper that constructs a dataset of 385 UAV images and >3,000 samples, then reports accuracy scores for 11 VLMs (base and fine-tuned) on visual, quantitative, and agronomic reasoning tasks. No equations, derivations, fitted parameters, or first-principles predictions exist. Reported accuracies (e.g., up to 78% after fine-tuning) are direct outputs of model inference on the author-defined test set and do not reduce to any self-defined quantity or self-citation chain. The paper contains no load-bearing self-citations, uniqueness theorems, or ansatzes; results are self-contained measurements against the provided benchmark.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption The 385 UAV images and 3,000 benchmark samples span the essential domains of disease, pest control, weed management, and yield for soybean and cotton.
Reference graph
Works this paper leans on
-
[1]
Springer International Publishing, Cham, pp. 431–516. URL:https://link.springer.com/chapter/10.1007/978-3-030-23400-3_ 12, doi:10.1007/978-3-030-23400-3_12. Arshad, M.A., Jubery, T.Z., Roy, T., Nassiri, R., Singh, A.K., Singh, A., Hegde, C., Ganapathysubramanian, B., Balu, A., Krishnamurthy, A., Sarkar, S.,
-
[2]
Leveraging vision language models for specialized agricultural tasks, in: 2025 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), pp. 6320–6329. URL:https://ieeexplore.ieee.org/document/10943968, doi:10.1109/WACV61041. 2025.00616. Awais,M.,SalemAbdullaAlharthi,A.H.,Kumar,A.,Cholakkal,H.,Anwer,R.M.,2025.Agrogpt:Efficientagriculturalvisio...
-
[3]
Qwen3-VL Technical Report. Technical Report. Alibaba Cloud. URL:http://arxiv.org/abs/2511.21631,arXiv:2511.21631. Bommasani, R., Hudson, D.A., Adeli, E., Altman, R., Arora, S., Arx, S.v., Bernstein, M.S., Bohg, J., Bosselut, A., Brunskill, E., Brynjolfsson, E., Buch, S., Card, D., Castellon, R., Chatterji, N., Chen, A., Creel, K., Davis, J.Q., Demszky, D....
work page internal anchor Pith review Pith/arXiv arXiv
-
[4]
On the Opportunities and Risks of Foundation Models
On the opportunities and risks of foundation models. URL:https://arxiv.org/abs/2108.07258. Hu, E.J., Shen, Y., Wallis, P., Allen-Zhu, Z., Li, Y., Wang, S., Wang, L., Chen, W., et al.,
work page internal anchor Pith review Pith/arXiv arXiv
-
[5]
Computers and Electronics in Agriculture 222, 109032
Foundation models in smart agriculture: Basics, opportunities, and challenges. Computers and Electronics in Agriculture 222, 109032. URL:https://www.sciencedirect.com/science/article/pii/ Wu et al. (2026):Preprint submitted to ElsevierPage 21 of 25 PlantXpert: a Multimodal LLM Benchmark for Plant Phenotyping S016816992400423X, doi:https://doi.org/10.1016/...
-
[6]
Figure and caption extraction from biomedical documents. Bioinformatics 35, 4381–4388. URL:https: //academic.oup.com/bioinformatics/article/35/21/4381/5428177, doi:10.1093/bioinformatics/btz228. Liang, F., Huang, Z., Wang, W., He, Z., En, Q.,
-
[7]
URL: https://llava-vl.github.io/blog/2024-01-30-llava-next/
Llava-next: Improved reasoning, ocr, and world knowledge. URL: https://llava-vl.github.io/blog/2024-01-30-llava-next/. Liu, X., Liu, Z., Hu, H., Chen, Z., Wang, K., Wang, K., Lian, S.,
2024
-
[8]
A multimodal benchmark dataset and model for crop disease diagnosis
A multimodal benchmark dataset and model for crop disease diagnosis, in: Leonardis, A., Ricci, E., Roth, S., Russakovsky, O., Sattler, T., Varol, G. (Eds.), Computer Vision – ECCV 2024, Springer Nature Switzerland, Cham. pp. 157–170. URL:https://link.springer.com/chapter/10.1007/978-3-031-73016-0_10, doi:10.1007/ 978-3-031-73016-0_10. Meshram, V., Patil, K.,
-
[9]
Fruitnet: Indian fruits image dataset with quality for machine learning applications. Data in Brief 40, 107686. URL: https://www.sciencedirect.com/science/article/pii/S2352340921009616, doi:10.1016/j.dib.2021.107686. Mir, R.R., Reynolds, M., Pinto, F., Khan, M.A., Bhat, M.A.,
-
[10]
Plant Science 282, 60–72
High-throughput phenotyping for crop improvement in the genomics era. Plant Science 282, 60–72. URL:https://www.sciencedirect.com/science/article/abs/pii/S0168945218305752, doi:10.1016/ j.plantsci.2019.01.007. Mohanty, S.P., Hughes, D.P., Salathé, M.,
2019
-
[11]
URL:https://www.frontiersin.org/journals/plant-science/articles/10.3389/fpls.2016.01419/full, doi:10. 3389/fpls.2016.01419. Quoc, K.N., Dao, P.D., Quach, L.D., 2026a. Leafnet: A large-scale dataset and comprehensive benchmark for foundational vision-language understanding of plant diseases. URL:http://arxiv.org/abs/2602.13662. Quoc, K.N., Dao, P.D., Quach...
-
[12]
IEEE Transactions on Automation Science and Engineering 22, 22510–22540
Multi-modal llms in agriculture: A comprehensive review. IEEE Transactions on Automation Science and Engineering 22, 22510–22540. URL:https://ieeexplore.ieee.org/document/11173627, doi:10.1109/TASE.2025.3612154. Sharif, M., Han, G., Liu, W., Huang, X.,
-
[13]
Cultivating Multidisciplinary Research and Education on GPU Infrastructure for Mid-South Institutions at the University of Memphis: Practice and Challenge. Technical Report. University of Memphis. URL:https://arxiv.org/ abs/2504.14786,arXiv:2504.14786. Shinoda,R.,Inoue,N.,Kataoka,H.,Onishi,M.,Ushiku,Y.,2025. Agrobench:Vision-languagemodelbenchmarkinagricu...
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[14]
Openai gpt-5 system card. URL:http://arxiv.org/abs/2601.03267. arXiv:2601.03267 [cs]. Singh, D., Jain, N., Jain, P., Kayal, P., Kumawat, S., Batra, N.,
work page internal anchor Pith review Pith/arXiv arXiv
-
[15]
Plantdoc: A dataset for visual plant disease detection, in: Proceedings of the 7th ACM IKDD CoDS and 25th COMAD, Association for Computing Machinery, New York, NY, USA. p. 249–253. URL:https: //dl.acm.org/doi/abs/10.1145/3371158.3371196, doi:10.1145/3371158.3371196. Wu et al. (2026):Preprint submitted to ElsevierPage 23 of 25 PlantXpert: a Multimodal LLM ...
-
[16]
(Eds.), Fundamentals of Field Crop Breeding
Cotton breeding, in: Yadava, D.K., Dikshit, H.K., Mishra, G.P., Tripathi, S. (Eds.), Fundamentals of Field Crop Breeding. Springer Nature Singapore, Singapore, pp. 609–676. URL:https://link.springer.com/chapter/10.1007/978-981-16-9257-4_11, doi:10.1007/978-981-16-9257-4_11. Wang,L.,Jin,T.,Yang,J.,Leonardis,A.,Wang,F.,Zheng,F.,2024. Agri-llava:Knowledge-in...
-
[17]
(Eds.), Advances in Neural Information Processing Systems, Curran Associates, Inc
Chain-of-thought prompting elicits reasoning in large language models, in: Koyejo, S., Mohamed, S., Agarwal, A., Belgrave, D., Cho, K., Oh, A. (Eds.), Advances in Neural Information Processing Systems, Curran Associates, Inc.. pp. 24824–24837. URL:https://proceedings.neurips.cc/paper_files/paper/2022/ Wu et al. (2026):Preprint submitted to ElsevierPage 24...
2022
-
[18]
URL:https://www.nature.com/articles/s41597-025-06513-4, doi:10.1038/s41597-025-06513-4. Weyler, J., Magistri, F., Marks, E., Chong, Y.L., Sodano, M., Roggiolani, G., Chebrolu, N., Stachniss, C., Behley, J.,
-
[19]
IEEE Transactions on Pattern Analysis and Machine Intelligence 46, 9583–9594
Phenobench: A large dataset and benchmarks for semantic image interpretation in the agricultural domain. IEEE Transactions on Pattern Analysis and Machine Intelligence 46, 9583–9594. doi:10.1109/TPAMI.2024.3419548. Williamson, H.F., Brettschneider, J., Caccamo, M., Davey, R.P., Goble, C., Kersey, P.J., May, S., Morris, R.J., Ostler, R., Pridmore, T., Rawl...
-
[20]
Yan,L.,Wang,H.,Tang,C.,Liu,H.,Sun,T.,Liu,L.,Guan,Y.,Jiang,J.,2026
URL:https://f1000research.com/articles/10-324, doi:10.12688/f1000research.52204.2. Yan,L.,Wang,H.,Tang,C.,Liu,H.,Sun,T.,Liu,L.,Guan,Y.,Jiang,J.,2026. Agrieval:Acomprehensivechineseagriculturalbenchmarkforlarge language models. Proceedings of the AAAI Conference on Artificial Intelligence 40, 34205–34213. URL:https://ojs.aaai.org/index. php/AAAI/article/vi...
-
[21]
Agrigpt: A large language model ecosys- tem for agriculture.arXiv preprint arXiv:2508.08632, 2025
Agrigpt: a large language model ecosystem for agriculture. URL:http://arxiv.org/abs/2508.08632. Yue, X., Ni, Y., Zhang, K., Zheng, T., Liu, R., Zhang, G., Stevens, S., Jiang, D., Ren, W., Sun, Y., et al.,
- [22]
-
[23]
LlamaFactory: Unified efficient fine-tuning of 100+ language models, in: Cao, Y., Feng, Y., Xiong,D.(Eds.),Proceedingsofthe62ndAnnualMeetingoftheAssociationforComputationalLinguistics(Volume3:SystemDemonstrations), Association for Computational Linguistics, Bangkok, Thailand. pp. 400–410. URL:https://aclanthology.org/2024.acl-demos.38/, doi:10.18653/v1/20...
-
[24]
(Eds.), Computer Vision – ECCV 2024 Workshops, Springer Nature Switzerland, Cham
Agribench: A hierarchical agriculture benchmark for multimodal large language models, in: Del Bue, A., Canton, C., Pont-Tuset, J., Tommasi, T. (Eds.), Computer Vision – ECCV 2024 Workshops, Springer Nature Switzerland, Cham. pp. 207–223. URL: https://link.springer.com/chapter/10.1007/978-3-031-91835-3_14, doi:10.1007/978-3-031-91835-3_14. Wu et al. (2026)...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.