AgroCoT: A Chain-of-Thought Benchmark for Evaluating Reasoning in Vision-Language Models for Agriculture
Pith reviewed 2026-05-21 17:54 UTC · model grok-4.3
The pith
AgroCoT dataset tests vision-language models on chain-of-thought reasoning for farm tasks and finds clear performance gaps.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
AgroCoT supplies a curated collection of 4759 samples that embed chain-of-thought reasoning into visual question answering for agriculture; when 30 representative vision-language models are evaluated on it, they exhibit measurable gaps in reasoning ability, demonstrating that chain-of-thought prompting and assessment are necessary to reveal true capability in complex agricultural contexts.
What carries the argument
The AgroCoT dataset, which pairs agricultural images with questions that demand explicit chain-of-thought steps for logical problem solving.
If this is right
- Models can be ranked more reliably for agricultural use once chain-of-thought reasoning is measured.
- Development efforts can focus on closing the identified reasoning gaps in vision-language models.
- Precision farming and environmental applications become more feasible when models handle step-by-step agricultural reasoning.
- Future benchmarks in other domains may adopt similar chain-of-thought integration to test reasoning.
Where Pith is reading between the lines
- The dataset could serve as a template for creating reasoning tests in other specialized domains like medicine or manufacturing.
- Training procedures that explicitly reward chain-of-thought outputs might narrow the gaps observed here.
- Real-world agricultural systems could integrate these models more safely once reasoning performance is verified on this benchmark.
Load-bearing premise
The 4759 curated samples accurately capture the reasoning and problem-solving skills required in real agricultural contexts.
What would settle it
If models that score high on AgroCoT show no advantage over low-scoring models when deployed on actual field tasks such as pest detection or crop monitoring that require similar reasoning.
Figures
read the original abstract
Recent advancements in Vision-Language Models (VLMs) have significantly impacted various industries. In agriculture, these multimodal capabilities hold great promise for applications such as precision farming, crop monitoring, pest detection, and environmental sustainability. However, while several Visual Question Answering (VQA) datasets and benchmarks have been developed to assess VLM performance, they often fail to effectively evaluate the critical reasoning and problem-solving skills needed in complex agricultural contexts. To address this gap, we introduce AgroCoT, a VQA dataset that integrates Chain-of-Thought (CoT) reasoning, specifically designed to evaluate the reasoning capabilities of VLMs. With 4,759 carefully curated samples, AgroCoT provides a comprehensive and robust evaluation of reasoning abilities, particularly in zero-shot scenarios, focusing on the models' ability to engage in logical reasoning and effective problem-solving. Our evaluation of 30 representative VLMs, including both proprietary and open-source models, reveals a gap in their reasoning capabilities, which underscores the importance of incorporating CoT for assessments. Our dataset is available at https://huggingface.co/datasets/AgroCoT/AgroCoT.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces AgroCoT, a VQA dataset of 4,759 samples that incorporates Chain-of-Thought reasoning to evaluate VLMs on agricultural tasks such as precision farming and pest detection. It reports results from evaluating 30 proprietary and open-source VLMs in zero-shot settings, identifying gaps in reasoning capabilities and arguing for the importance of CoT-based assessment. The dataset is released publicly via Hugging Face.
Significance. A rigorously validated benchmark of this type could help identify specific limitations in current VLMs for multi-step agricultural reasoning and encourage development of models better suited to real-world deployment. The public release of the dataset itself is a constructive contribution that enables further research.
major comments (3)
- [Dataset Creation] Dataset curation section: the claim that the 4,759 samples 'carefully curated' to test critical reasoning and problem-solving lacks any description of the generation process for questions and CoT rationales, the qualifications of annotators, filtering criteria for multi-step necessity, or agricultural fidelity checks.
- [Experiments and Evaluation] Evaluation protocol: no details are given on how CoT reasoning is scored in model outputs (e.g., step-wise correctness, partial credit, or automated vs. human judgment), nor on inter-annotator agreement for the ground-truth rationales.
- [Introduction and Results] Central claim in abstract and conclusion: attribution of performance gaps to 'reasoning capabilities' rather than dataset artifacts requires evidence that the samples isolate logical reasoning without confounding biases in visual patterns or question phrasing; this is not addressed.
minor comments (2)
- [Abstract] Abstract: consider adding one sentence summarizing the main categories of agricultural reasoning tasks represented in the dataset.
- [Data Availability] Dataset release: confirm that the Hugging Face repository includes documentation on sample provenance, licensing, and any usage restrictions.
Simulated Author's Rebuttal
We thank the referee for their thorough and constructive review of our manuscript. We have carefully considered each comment and provide point-by-point responses below. Where appropriate, we will revise the manuscript to incorporate the suggested improvements.
read point-by-point responses
-
Referee: [Dataset Creation] Dataset curation section: the claim that the 4,759 samples 'carefully curated' to test critical reasoning and problem-solving lacks any description of the generation process for questions and CoT rationales, the qualifications of annotators, filtering criteria for multi-step necessity, or agricultural fidelity checks.
Authors: We acknowledge the need for greater transparency in the dataset curation process. In the revised version of the manuscript, we will provide a detailed account of how the questions and corresponding CoT rationales were generated, including the involvement of qualified annotators with expertise in agriculture. We will also describe the filtering criteria used to ensure that each sample requires multi-step reasoning and the checks performed to verify agricultural fidelity and relevance to real-world scenarios such as precision farming and pest detection. revision: yes
-
Referee: [Experiments and Evaluation] Evaluation protocol: no details are given on how CoT reasoning is scored in model outputs (e.g., step-wise correctness, partial credit, or automated vs. human judgment), nor on inter-annotator agreement for the ground-truth rationales.
Authors: We agree that the evaluation protocol requires more explicit description. We will update the manuscript to detail the scoring process for CoT reasoning, which involves assessing step-wise correctness with partial credit awarded for incomplete but directionally correct reasoning steps. The scoring is primarily conducted through human judgment by domain experts, with automated verification for basic output format compliance. Additionally, we will report the inter-annotator agreement metrics for the ground-truth rationales to demonstrate the reliability of the annotations. revision: yes
-
Referee: [Introduction and Results] Central claim in abstract and conclusion: attribution of performance gaps to 'reasoning capabilities' rather than dataset artifacts requires evidence that the samples isolate logical reasoning without confounding biases in visual patterns or question phrasing; this is not addressed.
Authors: This is a valid point. To strengthen the central claim, we will add a new analysis subsection that examines potential confounding factors. Specifically, we will present results showing that performance gaps remain consistent when controlling for visual pattern variations and different question phrasing styles. This evidence will support that the gaps are attributable to reasoning limitations rather than dataset artifacts. We will also discuss the design choices made during curation to promote diversity in visuals and questions. revision: yes
Circularity Check
No circularity: empirical dataset benchmark with no derivation chain
full rationale
The paper introduces the AgroCoT VQA dataset of 4,759 samples focused on CoT reasoning in agriculture and reports zero-shot performance of 30 public VLMs. No equations, fitted parameters, predictions, or first-principles derivations appear in the abstract or described methodology. Claims rest on empirical results against an externally released dataset and standard models, which are independently verifiable. No self-citations support uniqueness theorems, ansatzes, or load-bearing premises. This is a standard benchmark release paper whose central evaluation is self-contained against external benchmarks.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption The 4759 samples were curated to require logical reasoning and effective problem-solving in agricultural contexts.
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
We propose AgriCoT, a comprehensive agricultural multi-step, multimodal CoT dataset comprising 4,535 VQA pairs... five key dimensions (object detection, quantitative analysis, disease monitoring, spatial understanding, and environmental management)
-
IndisputableMonolith/Foundation/RealityFromDistinction.leanreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
Our evaluation of 26 representative VLMs... reveals a gap in their reasoning capabilities
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Forward citations
Cited by 1 Pith paper
-
HM-Bench: A Comprehensive Benchmark for Multimodal Large Language Models in Hyperspectral Remote Sensing
HM-Bench is the first benchmark for MLLMs on hyperspectral images, showing models struggle with complex spatial-spectral reasoning and perform better with visual PCA images than textual reports.
Reference graph
Works this paper leans on
-
[1]
Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ah- mad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. Gpt-4 technical report.arXiv preprint arXiv:2303.08774,
work page internal anchor Pith review Pith/arXiv arXiv
-
[3]
Agrogpt: Efficient agricultural vision-language model with expert tuning
Muhammad Awais, Ali Husain Salem Abdulla Alharthi, Amandeep Kumar, Hisham Cholakkal, and Rao Muham- mad Anwer. Agrogpt: Efficient agricultural vision-language model with expert tuning. In2025 IEEE/CVF Winter Con- ference on Applications of Computer Vision (WACV), pages 5687–5696. IEEE, 2025. 3
work page 2025
-
[4]
Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond
Jinze Bai, Shuai Bai, Shusheng Yang, Shijie Wang, Sinan Tan, Peng Wang, Junyang Lin, Chang Zhou, and Jingren Zhou. Qwen-vl: A frontier large vision-language model with versatile abilities.arXiv preprint arXiv:2308.12966, 1(2):3,
work page internal anchor Pith review Pith/arXiv arXiv
-
[5]
2, 6, 7, 3, 5, 8, 9, 10
-
[6]
Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, et al. Qwen2. 5-vl technical report.arXiv preprint arXiv:2502.13923, 2025. 2
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[7]
Yuhui Bie, Guowei Xu, and Yaojun Wang. Adapting vision- language models for precision agriculture: A study on crop segmentation based on uav remote sensing data. In2025 13th International Conference on Agro-Geoinformatics (Agro- Geoinformatics), pages 1–6, 2025. 2
work page 2025
-
[8]
Yiyi Cao, Lei Chen, Yuan Yuan, and Guangling Sun. Cucum- ber disease recognition with small samples using image-text- label-based multi-modal language model.Computers and electronics in agriculture, 211:107993, 2023. 1
work page 2023
-
[9]
M3cot: A novel benchmark for multi- domain multi-step multi-modal chain-of-thought
Qiguang Chen, Libo Qin, Jin Zhang, Zhi Chen, Xiao Xu, and Wanxiang Che. M3cot: A novel benchmark for multi- domain multi-step multi-modal chain-of-thought. InPro- ceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 8199–8221, 2024. 3
work page 2024
-
[10]
Zhe Chen, Weiyun Wang, Yue Cao, Yangzhou Liu, Zhang- wei Gao, Erfei Cui, Jinguo Zhu, Shenglong Ye, Hao Tian, Zhaoyang Liu, et al. Expanding performance boundaries of open-source multimodal models with model, data, and test- time scaling.arXiv preprint arXiv:2412.05271, 2024. 7, 3, 5, 6, 8, 9, 10
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[11]
Gheorghe Comanici, Eric Bieber, Mike Schaekermann, Ice Pasupat, Noveen Sachdeva, Inderjit Dhillon, Marcel Blis- tein, Ori Ram, Dan Zhang, Evan Rosen, et al. Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities.arXiv preprint arXiv:2507.06261, 2025. 4
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[12]
Vardhan Dongre, Chi Gui, Shubham Garg, Hooshang Nayy- eri, Gokhan Tur, Dilek Hakkani-T ¨ur, and Vikram S Adve. Mirage: A benchmark for multimodal information-seeking and reasoning in agricultural expert-guided conversations. arXiv preprint arXiv:2506.20100, 2025. 2, 3
-
[13]
Interleaved-modal chain-of-thought
Jun Gao, Yongqi Li, Ziqiang Cao, and Wenjie Li. Interleaved-modal chain-of-thought. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 19520–19529, 2025. 3
work page 2025
-
[14]
Minghe Gao, Xuqi Liu, Zhongqi Yue, Yang Wu, Shuang Chen, Juncheng Li, Siliang Tang, Fei Wu, Tat-Seng Chua, and Yueting Zhuang. Benchmarking multimodal cot re- ward model stepwise by visual program.arXiv preprint arXiv:2504.06606, 2025. 2, 3
-
[15]
Agmmu: A comprehensive agricultural multimodal understanding benchmark
Aruna Gauba, Irene Pi, Yunze Man, Ziqi Pang, Vikram S Adve, and Yu-Xiong Wang. Agmmu: A comprehensive agricultural multimodal understanding benchmark. InThe Thirty-ninth Annual Conference on Neural Information Pro- cessing Systems Datasets and Benchmarks Track, 2025. 2, 3, 5, 1
work page 2025
-
[16]
Rbench-v: A primary assess- ment for visual reasoning models with multi-modal outputs
Meng-Hao Guo, Xuanyu Chu, Qianrui Yang, Zhe-Han Mo, Yiqing Shen, Pei-lin Li, Xinjie Lin, Jinnian Zhang, Xin- Sheng Chen, Yi Zhang, et al. Rbench-v: A primary assess- ment for visual reasoning models with multi-modal outputs. arXiv preprint arXiv:2505.16770, 2025. 3
-
[17]
W., Li, L., Yang, Z., Wang, L., and Cheng, Y
Yunzhuo Hao, Jiawei Gu, Huichen Will Wang, Linjie Li, Zhengyuan Yang, Lijuan Wang, and Yu Cheng. Can mllms reason in multimodality? emma: An enhanced multimodal reasoning benchmark.arXiv preprint arXiv:2501.05444,
-
[18]
Aaron Hurst, Adam Lerer, Adam P Goucher, Adam Perel- man, Aditya Ramesh, Aidan Clark, AJ Ostrow, Akila Weli- hinda, Alan Hayes, Alec Radford, et al. Gpt-4o system card. arXiv preprint arXiv:2410.21276, 2024. 6
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[19]
Dongzhi Jiang, Renrui Zhang, Ziyu Guo, Yanwei Li, Yu Qi, Xinyan Chen, Liuhui Wang, Jianhan Jin, Claire Guo, Shen Yan, et al. Mme-cot: Benchmarking chain-of-thought in large multimodal models for reasoning quality, robustness, and efficiency.arXiv preprint arXiv:2502.09621, 2025. 3
-
[20]
Takeshi Kojima, Shixiang Shane Gu, Machel Reid, Yutaka Matsuo, and Yusuke Iwasawa. Large language models are zero-shot reasoners.Advances in neural information pro- cessing systems, 35:22199–22213, 2022. 3
work page 2022
-
[21]
Maikel Leon. Gpt-5 and open-weight large language models: Advances in reasoning, transparency, and control.Informa- tion Systems, page 102620, 2025. 2, 7, 3, 4, 5, 6, 8, 9, 10 9
work page 2025
-
[22]
Qingmei Li, Yang Zhang, Zurong Mai, Yuhang Chen, Shuo- hong Lou, Henglian Huang, Jiarui Zhang, Zhiwei Zhang, Yibin Wen, Weijia Li, et al. Can large multimodal models un- derstand agricultural scenes? benchmarking with agromind. arXiv preprint arXiv:2505.12207, 2025. 2, 3, 5, 7, 1
-
[23]
Zhiyuan Li, Yi Chang, and Yuan Wu. Think-bench: Evaluat- ing thinking efficiency and chain-of-thought quality of large reasoning models.arXiv preprint arXiv:2505.22113, 2025. 3
-
[24]
Zongxia Li, Xiyang Wu, Hongyang Du, Huy Nghiem, and Guangyao Shi. Benchmark evaluations, applications, and challenges of large vision language models: A survey.arXiv preprint arXiv:2501.02189, 1, 2025. 2
-
[25]
Rouge: A package for automatic evaluation of summaries
Chin-Yew Lin. Rouge: A package for automatic evaluation of summaries. InText summarization branches out, pages 74–81, 2004. 6, 7
work page 2004
-
[26]
Aixin Liu, Bei Feng, Bing Xue, Bingxuan Wang, Bochao Wu, Chengda Lu, Chenggang Zhao, Chengqi Deng, Chenyu Zhang, Chong Ruan, et al. Deepseek-v3 technical report. arXiv preprint arXiv:2412.19437, 2024. 2
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[27]
Improved baselines with visual instruction tuning
Haotian Liu, Chunyuan Li, Yuheng Li, and Yong Jae Lee. Improved baselines with visual instruction tuning. InPro- ceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 26296–26306, 2024. 7, 3, 5, 6, 8, 9, 10
work page 2024
-
[28]
A multimodal benchmark dataset and model for crop disease diagnosis
Xiang Liu, Zhaoxiang Liu, Huan Hu, Zezhou Chen, Ko- hou Wang, Kai Wang, and Shiguo Lian. A multimodal benchmark dataset and model for crop disease diagnosis. In European Conference on Computer Vision, pages 157–170. Springer, 2024. 2, 3, 5, 1
work page 2024
-
[29]
DeepSeek-VL: Towards Real-World Vision-Language Understanding
Haoyu Lu, Wen Liu, Bo Zhang, Bingxuan Wang, Kai Dong, Bo Liu, Jingxiang Sun, Tongzheng Ren, Zhuoshu Li, Hao Yang, et al. Deepseek-vl: towards real-world vision- language understanding.arXiv preprint arXiv:2403.05525,
work page internal anchor Pith review Pith/arXiv arXiv
-
[30]
6, 7, 3, 4, 5, 8, 9, 10
-
[31]
Pan Lu, Swaroop Mishra, Tanglin Xia, Liang Qiu, Kai-Wei Chang, Song-Chun Zhu, Oyvind Tafjord, Peter Clark, and Ashwin Kalyan. Learn to explain: Multimodal reasoning via thought chains for science question answering.Advances in Neural Information Processing Systems, 35:2507–2521,
-
[32]
MathVista: Evaluating Mathematical Reasoning of Foundation Models in Visual Contexts
Pan Lu, Hritik Bansal, Tony Xia, Jiacheng Liu, Chunyuan Li, Hannaneh Hajishirzi, Hao Cheng, Kai-Wei Chang, Michel Galley, and Jianfeng Gao. Mathvista: Evaluating mathemat- ical reasoning of foundation models in visual contexts.arXiv preprint arXiv:2310.02255, 2023. 3
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[33]
Compositional chain-of-thought prompting for large multimodal models
Chancharik Mitra, Brandon Huang, Trevor Darrell, and Roei Herzig. Compositional chain-of-thought prompting for large multimodal models. InProceedings of the IEEE/CVF Con- ference on Computer Vision and Pattern Recognition, pages 14420–14431, 2024. 3
work page 2024
-
[34]
Yusu Qian, Cheng Wan, Chao Jia, Yinfei Yang, Qingyu Zhao, and Zhe Gan. Prism-bench: A benchmark of puzzle- based visual tasks with cot error detection.arXiv preprint arXiv:2510.23594, 2025. 3
-
[35]
Learning transferable visual models from natural language supervision
Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, and Ilya Sutskever. Learning transferable visual models from natural language supervision. InProceedings of the 38th International Conference on Machine Learning, pages 8748–8763. PMLR, 2021. 2
work page 2021
-
[36]
A-okvqa: A benchmark for visual question answering using world knowl- edge
Dustin Schwenk, Apoorv Khandelwal, Christopher Clark, Kenneth Marino, and Roozbeh Mottaghi. A-okvqa: A benchmark for visual question answering using world knowl- edge. InEuropean conference on computer vision, pages 146–162. Springer, 2022. 3
work page 2022
-
[37]
Veningston, and Syed Mufassir Yaseen
Tawseef Ayoub Shaikh, Tabasum Rasool, K. Veningston, and Syed Mufassir Yaseen. The role of large language models in agriculture: harvesting the future with LLM intelligence. Progress in Artificial Intelligence, 14(2):117–164, 2025. 1
work page 2025
-
[38]
Agrobench: Vision-language model benchmark in agriculture
Risa Shinoda, Nakamasa Inoue, Hirokatsu Kataoka, Masaki Onishi, and Yoshitaka Ushiku. Agrobench: Vision-language model benchmark in agriculture. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 7634–7644, 2025. 2, 3, 5, 1
work page 2025
-
[39]
Assessing gpt4-v on structured reasoning tasks.arXiv preprint arXiv:2312.11524, 2023
Mukul Singh, Jos ´e Cambronero, Sumit Gulwani, Vu Le, and Gust Verbruggen. Assessing gpt4-v on structured reasoning tasks.arXiv preprint arXiv:2312.11524, 2023. 3
-
[40]
Gemini: A Family of Highly Capable Multimodal Models
Gemini Team, Rohan Anil, Sebastian Borgeaud, Jean- Baptiste Alayrac, Jiahui Yu, Radu Soricut, Johan Schalkwyk, Andrew M Dai, Anja Hauth, Katie Millican, et al. Gemini: a family of highly capable multimodal models.arXiv preprint arXiv:2312.11805, 2023. 2, 6, 7, 3, 5, 8, 9, 10
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[41]
LLaMA: Open and Efficient Foundation Language Models
Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timoth´ee Lacroix, Baptiste Rozi`ere, Naman Goyal, Eric Hambro, Faisal Azhar, et al. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971, 2023. 7, 3, 5, 6, 8, 9, 10
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[42]
Agri-llava: Knowledge-infused large multimodal assistant on agricultural pests and diseases
Liqiong Wang, Teng Jin, Jinyu Yang, Ales Leonardis, Fangyi Wang, and Feng Zheng. Agri-llava: Knowledge-infused large multimodal assistant on agricultural pests and diseases. arXiv preprint arXiv:2412.02158, 2024. 3
-
[43]
Self-Consistency Improves Chain of Thought Reasoning in Language Models
Xuezhi Wang, Jason Wei, Dale Schuurmans, Quoc Le, Ed Chi, Sharan Narang, Aakanksha Chowdhery, and Denny Zhou. Self-consistency improves chain of thought reason- ing in language models.arXiv preprint arXiv:2203.11171,
work page internal anchor Pith review Pith/arXiv arXiv
-
[44]
Multimodal Chain-of-Thought Reasoning: A Comprehensive Survey
Yaoting Wang, Shengqiong Wu, Yuecheng Zhang, Shuicheng Yan, Ziwei Liu, Jiebo Luo, and Hao Fei. Multimodal chain-of-thought reasoning: A comprehensive survey.arXiv preprint arXiv:2503.12605, 2025. 3
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[45]
Chain-of-probe: Examining the ne- cessity and accuracy of cot step-by-step
Zezhong Wang, Xingshan Zeng, Weiwen Liu, Yufei Wang, Liangyou Li, Yasheng Wang, Lifeng Shang, Xin Jiang, Qun Liu, and Kam-Fai Wong. Chain-of-probe: Examining the ne- cessity and accuracy of cot step-by-step. InFindings of the Association for Computational Linguistics: NAACL 2025, pages 2586–2606, 2025. 3
work page 2025
-
[46]
Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Fei Xia, Ed Chi, Quoc V Le, Denny Zhou, et al. Chain-of-thought prompting elicits reasoning in large lan- guage models.Advances in neural information processing systems, 35:24824–24837, 2022. 2, 3
work page 2022
-
[47]
Haiyang Wu, Weiliang Mu, Dandan Zhong, Zhuofei Du, Haifeng Li, and Chao Tao. Farmsegvlm: A farmland remote 10 sensing image segmentation method considering vision- language alignment.ISPRS Journal of Photogrammetry and Remote Sensing, 225:423–439, 2025. 2, 3
work page 2025
-
[48]
AgriGPT-VL: Agricultural Vision- Language Understanding Suite, 2025
Bo Yang, Yunkui Chen, Lanfei Feng, Yu Zhang, Xiao Xu, Jianyu Zhang, Nueraili Aierken, Runhe Huang, Hongjian Lin, Yibin Ying, et al. Agrigpt-vl: Agricul- tural vision-language understanding suite.arXiv preprint arXiv:2510.04002, 2025. 3
-
[49]
Shunyu Yao, Dian Yu, Jeffrey Zhao, Izhak Shafran, Tom Griffiths, Yuan Cao, and Karthik Narasimhan. Tree of thoughts: Deliberate problem solving with large language models.Advances in neural information processing systems, 36:11809–11822, 2023. 3
work page 2023
-
[50]
Mmmu: A massive multi-discipline multimodal understanding and reasoning benchmark for ex- pert agi
Xiang Yue, Yuansheng Ni, Kai Zhang, Tianyu Zheng, Ruoqi Liu, Ge Zhang, Samuel Stevens, Dongfu Jiang, Weiming Ren, Yuxuan Sun, et al. Mmmu: A massive multi-discipline multimodal understanding and reasoning benchmark for ex- pert agi. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 9556– 9567, 2024. 3
work page 2024
-
[51]
Hang Zhang, Jiawei Sun, Renqi Chen, Wei Liu, Zhonghang Yuan, Xinzhe Zheng, Zhefan Wang, Zhiyuan Yang, Hang Yan, Hansen Zhong, et al. Empowering and assessing the utility of large language models in crop science.Advances in Neural Information Processing Systems, 37:52670–52722,
-
[52]
Kunpeng Zhang, Li Ma, Beibei Cui, Xin Li, Boqiang Zhang, and Na Xie. Visual large language model for wheat disease diagnosis in the wild.Computers and Electronics in Agricul- ture, 227:109587, 2024. 1, 3
work page 2024
-
[53]
Shuyi Zhang, Xiaoshuai Hao, Yingbo Tang, Lingfeng Zhang, Pengwei Wang, Zhongyuan Wang, Hongxuan Ma, and Shanghang Zhang. Video-cot: A comprehensive dataset for spatiotemporal understanding of videos based on chain-of- thought. InProceedings of the 33rd ACM International Con- ference on Multimedia, pages 12745–12752, 2025. 3
work page 2025
-
[54]
BERTScore: Evaluating Text Generation with BERT
Tianyi Zhang, Varsha Kishore, Felix Wu, Kilian Q Wein- berger, and Yoav Artzi. Bertscore: Evaluating text genera- tion with bert.arXiv preprint arXiv:1904.09675, 2019. 6, 7
work page internal anchor Pith review Pith/arXiv arXiv 1904
-
[55]
Multimodal fine- grained transformer model for pest recognition.Electronics, 12(12):2620, 2023
Yinshuo Zhang, Lei Chen, and Yuan Yuan. Multimodal fine- grained transformer model for pest recognition.Electronics, 12(12):2620, 2023. 1
work page 2023
-
[56]
Yinshuo Zhang, Lei Chen, and Yuan Yuan. Few-shot agricul- tural pest recognition based on multimodal masked autoen- coder.Crop Protection, 187:106993, 2025. 1
work page 2025
-
[57]
Enhancing chain of thought prompting in large lan- guage models via reasoning patterns
Yufeng Zhang, Xuepeng Wang, Lingxiang Wu, and Jinqiao Wang. Enhancing chain of thought prompting in large lan- guage models via reasoning patterns. InProceedings of the AAAI Conference on Artificial Intelligence, pages 25985– 25993, 2025. 2
work page 2025
-
[58]
Xinyan Zhao, Baiyan Chen, Mengxue Ji, Xinyue Wang, Yuhan Yan, Jinming Zhang, Shiyingjie Liu, Muyang Ye, and Chunli Lv. Implementation of large language models and agricultural knowledge graphs for efficient plant disease de- tection.Agriculture, 14(8):1359, 2024. 1
work page 2024
-
[59]
Tinyllava: A framework of small-scale large multimodal models
Baichuan Zhou, Ying Hu, Xi Weng, Junlong Jia, Jie Luo, Xien Liu, Ji Wu, and Lei Huang. Tinyllava: A frame- work of small-scale large multimodal models.arXiv preprint arXiv:2402.14289, 2024. 7, 3, 5, 6, 8, 9, 10
-
[60]
Few-shot image classification of crop diseases based on vision–language models.Sensors, 24(18):6109,
Yueyue Zhou, Hongping Yan, Kun Ding, Tingting Cai, and Yan Zhang. Few-shot image classification of crop diseases based on vision–language models.Sensors, 24(18):6109,
-
[61]
Percentage of Different Datasets
1 11 AgriCoT: A Chain-of-Thought Benchmark for Evaluating Reasoning in Vision-Language Models for Agriculture Supplementary Material 87.8% 4.4% 4.0% 3.9% 12.2% AgroMind AgroBench AGMMU CDDM Total 4,535 QA pairs Figure 8. Percentage of Different Datasets
-
[62]
Data Collection 7.1. Data Sources Selection AgriCoT is constructed based on four state-of-the-art agri- cultural vision-language benchmarks (see Tab. 3), each of- fering unique characteristics and specialized annotations for comprehensive evaluation of VLMs in agriculture. CDDM [26]enhances our dataset withlarge-scale crop disease coverage, comprising 137...
-
[63]
CoT Construction Details To ensure that the CoTs generated by GPT-4o exhibit both deep reasoning and a consistent structure, we design a prompt template (see Fig. 10). In addition to the basic in- puts (reference image, question, and options), the template explicitly guides the model to clarify the core problem, pro- duce detailed and coherent visual desc...
-
[64]
Experiment Details 9.1. Evaluation Protocols We supplement the evaluation with ROUGE-1, ROUGE-2, and ROUGE-L, computing precision, recall, and F1 score for each. ROUGE measures n-gram overlap between gen- erated and reference CoTs, reflecting content coverage and sequence structure. Precision indicates how well the model captures relevant information, whi...
-
[65]
Keyword extraction capability The ROUGE-1 scores presented in Tab
More Experimental Results 10.1. Keyword extraction capability The ROUGE-1 scores presented in Tab. 5 provide insights into the reasoning capabilities of the evaluated models across various tasks. ROUGE-1, which measures the over- lap of unigrams between the generated CoT and the refer- ence, reflects the model’s ability to generate keywords in reasoning t...
-
[66]
Further Discussion Does the reasoning ability of VLMs improve as the pa- rameters increase?As shown in Fig. 11 (a) and (b), ex- cept for InternVL2&3 and Qwen2.5-VL, whose reasoning capabilities improve with an increase in parameters when the number of parameters is small, the reasoning capabili- ties of other models remain relatively stable once the num- ...
-
[67]
Limitations and Future Work While AgriCoT has been carefully curated and manually re- fined to address agricultural VQA tasks, the current dataset scale remains insufficient given the rapid evolution of mod- ern VLMs. As model capacities continue to grow, larger and more diverse training data become essential to fully ex- ploit their reasoning potential. ...
-
[68]
AgriCoT will also be released as open-source to facilitate academic research
Ethical Considerations All data in this dataset are sourced from publicly available open-source datasets, and their use adheres to the respec- tive copyright and licensing terms. AgriCoT will also be released as open-source to facilitate academic research. Seven annotators complete training covering agricultural knowledge, annotation protocols, and ethica...
-
[69]
The question asks about the type of damage observed on the plant in the image, and provides four options related to different types of plant stress or damage. 2. Observing the image, the tree in the center appears to have no leaves, while the surrounding vegetation is green and healthy, indicating that the issue is specific to this tree and not a general ...
-
[70]
The plant in the image appears to have no leaves, which suggests it has experienced some form of damage or stress. 2. The branches are bare, indicating that the damage occurred after the leaves had already grown, which is more consistent with a late-season event. 3. The absence of visible insect damage or signs of bacterial infection (such as leaf spots o...
-
[71]
The question asks to identify the type of damage observed on the plant in the image, providing four options: late season freeze-hail-freeze damage, insect infestation damage, bacterial leaf scorch, and early summer drought stress
-
[72]
Observing the image, the tree appears to have sparse foliage, with branches that are mostly bare and no visible signs of leaf scorch, insect activity, or drought-related wilting
-
[73]
Late season freeze-hail-freeze damage is consistent with the observed symptoms, as freezing temperatures can cause buds and leaves to fail to develop properly, leaving the tree looking bare and stressed
-
[74]
The other options (insect infestation, bacterial leaf scorch, and drought stress) typically show distinct signs such as holes in leaves, browning or scorching of leaf edges, or drooping foliage, none of which are visible in the image
-
[75]
Based on the analysis, the most appropriate option is late season freeze-hail-freeze damage. Plant Management Figure 13. A case of the Plant Management task with responses from GPT-4.1 and InternVL3-38B. Question: What are the distinctive features of the plant shown in the image? Options: A: pine tree with needle-like leaves and brown cones B: cactus with...
-
[76]
The question asks for the distinctive features of the plant shown in the image, and provides four options describing different types of plants
-
[77]
The fronds are arching and some are turning brown, while others are still green and vibrant
Observing the image, the plant has fronds that are divided into leaflets, which is characteristic of ferns. The fronds are arching and some are turning brown, while others are still green and vibrant. 3. Option A describes a pine tree, which has needle-like leaves and cones, not matching the image. Option B describes a cactus, which has thick, spiny stems...
-
[78]
ostrich fern with arching habit and tough, wiry black rhizomes
The image shows a cluster of ferns with large, arching fronds. 2. The fronds have a distinct, feathery appearance with a central stem and leaflets arranged on either side. 3. Some fronds are green, while others have a brownish hue, indicating different stages of growth or decay. 4. The description of "ostrich fern with arching habit and tough, wiry black ...
-
[79]
The question asks to identify the distinctive features of the plant shown in the image and match it with the most appropriate option
-
[80]
The image depicts a plant with arching fronds, characteristic of ferns, and some parts of the plant show wiry, dark structures resembling rhizomes
-
[81]
Option A describes a pine tree with needle-like leaves and brown cones, which does not match the plant in the image as the plant lacks needle-like leaves or cones
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.