arxiv: 2604.20983 · v1 · submitted 2026-04-22 · 💻 cs.CV · cs.AI· cs.CL

Recognition: unknown

Thinking Like a Botanist: Challenging Multimodal Language Models with Intent-Driven Chain-of-Inquiry

Hasan Muhammad Abdullah, Md. Mehedi Hasan, Mohammad Zabed Hossain, Nafiul Haque, Shahrear Bin Amin, Shifat E. Arman, Syed Nazmus Sakib

Pith reviewed 2026-05-10 00:15 UTC · model grok-4.3

classification 💻 cs.CV cs.AIcs.CL

keywords plant pathologymultimodal large language modelsvisual question answeringchain of inquirydiagnostic reasoninghallucination reductionbenchmark dataset

0 comments

The pith

Structured chains of inquiry help multimodal models diagnose plant diseases more accurately and with fewer hallucinations than single-turn answers.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Botanists diagnose plant diseases by inspecting leaf images through a sequence of adaptive questions that build on visual cues and a clear diagnostic goal. Current multimodal language models are tested only on single questions and answers, missing this step-by-step expert process. The paper introduces the PlantInquiryVQA benchmark with thousands of expert-annotated images and question-answer pairs that follow a Chain of Inquiry. Evaluations show the models can describe symptoms but fail at reliable diagnosis and clinical safety. When the same models follow the structured inquiry sequence instead of answering directly, diagnostic correctness rises, hallucinations drop, and the reasoning takes fewer steps.

Core claim

We formalize a Chain of Inquiry framework modeling diagnostic trajectories as ordered question-answer sequences conditioned on grounded visual cues and explicit epistemic intent. Evaluations on top-tier Multimodal Large Language Models reveal that while they describe visual symptoms adequately, they struggle with safe clinical reasoning and accurate diagnosis. Importantly, structured question-guided inquiry significantly improves diagnostic correctness, reduces hallucination, and increases reasoning efficiency.

What carries the argument

The Chain of Inquiry framework, which turns botanical diagnosis into ordered sequences of questions and answers driven by visual grounding and explicit diagnostic intent.

If this is right

Multimodal models reach higher diagnostic accuracy when they follow intent-driven question sequences rather than producing direct answers.
Hallucinations about disease identity, severity, and treatment decline when reasoning is constrained by the Chain of Inquiry.
Reasoning becomes more efficient, converging on a correct diagnosis in fewer steps under structured inquiry.
The benchmark supports development of diagnostic agents that reason like expert botanists instead of functioning as static classifiers.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same inquiry-chain structure could be tested in other image-based diagnostic domains such as medical radiology or skin lesion analysis.
Models might be trained to generate their own inquiry chains from an initial image rather than receiving them from an external template.
The visual grounding labels in the dataset could be used to create training signals that improve how future models link specific image regions to diagnostic questions.

Load-bearing premise

The expert-curated dataset, visual grounding annotations, and Chain of Inquiry templates accurately capture how real botanists diagnose plant diseases from images.

What would settle it

A direct comparison in which practicing botanists diagnose the same leaf images using their own adaptive questioning process, with results measured against the model's Chain of Inquiry outputs for accuracy and number of steps required.

Figures

Figures reproduced from arXiv: 2604.20983 by Hasan Muhammad Abdullah, Md. Mehedi Hasan, Mohammad Zabed Hossain, Nafiul Haque, Shahrear Bin Amin, Shifat E. Arman, Syed Nazmus Sakib.

**Figure 2.** Figure 2: Overall Methodology Pipeline for PlantInquiryVQA CoI Dataset Generation. The process is divided into [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗

**Figure 3.** Figure 3: Qualitative Examples of 12 Distinct CoI Trajectories. The framework adapts questioning strategies across [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗

**Figure 4.** Figure 4: Protocol Structure Benefit Test for Qwen25- [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗

**Figure 5.** Figure 5: Ratio Test comparison across Scaffolded (case [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗

**Figure 7.** Figure 7: Distribution of question and answer lengths in [PITH_FULL_IMAGE:figures/full_fig_p022_7.png] view at source ↗

**Figure 8.** Figure 8: Comparison of extracted visual cues for Litchi [PITH_FULL_IMAGE:figures/full_fig_p025_8.png] view at source ↗

**Figure 9.** Figure 9: Comparison of extracted visual cues for Bitter Gourd [PITH_FULL_IMAGE:figures/full_fig_p025_9.png] view at source ↗

**Figure 10.** Figure 10: Diagnostic Reasoning across Disease Severity Stages. This figure demonstrates how PlantInquiryVQA adapts its questioning strategy as the infection progresses in Maize Streak Virus. (a) Mild: The focus is on Differential Diagnosis to distinguish the initial streaks from fungal mimics. (b) Moderate: The inquiry shifts to Vector Control and Prognosis as the infection becomes established. (c) Severe: The reas… view at source ↗

**Figure 11.** Figure 11: Cross-species Occurrence of Anthracnose. The figure illustrates how PlantInquiryVQA adapts its CoI to host-specific manifestations of the same pathogen (Colletotrichum spp.). (a) Jackfruit: The dialogue identifies the classic "bird’s-eye" lesions (pale centers, dark margins) and recommends mechanical intervention (pruning) suitable for tree canopies. (b) Grape: The dialogue identifies smaller, necrotic ta… view at source ↗

**Figure 12.** Figure 12: Multi-disease Occurrence within a Single Crop Species. The figure demonstrates distinct CoI trajectories for different pathologies affecting the same host (Mango). (a) Gall Midge: The dialogue focuses on structural damage (raised bumps), ruling out fungal pathogens via differential diagnosis, and identifying the insect vector. (b) Sooty Mold: The dialogue identifies a superficial fungal issue ("rubs off")… view at source ↗

**Figure 13.** Figure 13: Evolution of Epistemic Intent across Disease Severity. The figure illustrates how the CoI shifts its reasoning goal based on the visual status of the plant. (a) Diagnosis: In the early/mild stage (Peach), the focus is on Identification and distinguishing symptoms from lookalikes. (b) Prognosis: In the mild/chronic stage (Guava), the inquiry shifts to Predicting the trajectory of the condition (recovery vs… view at source ↗

**Figure 14.** Figure 14: Beyond Pathogenic Disease: Healthy, Abiotic, and Pest Conditions. This figure illustrates the dataset’s coverage of diverse plant health states. (a) Healthy Control: The model validates health by citing "uniform green color" and the absence of lesions. (b) Senescence: The inquiry identifies abiotic stress (aging/dryness) based on global uniform browning and papery texture, distinguishing it from focal inf… view at source ↗

**Figure 15.** Figure 15: Semantic Accuracy Evolution across the Chain-of-Inquiry Trajectory. The figure illustrates the layer-wise diagnostic accuracy improvement for all 12 evaluated models as they progress through the 7-step diagnostic inquiry. Green lines indicate Mild infection, showing the strongest positive trajectory, while Red lines (Severe) indicate lower baselines and higher volatility. We observe a consistent positive … view at source ↗

**Figure 16.** Figure 16: A comprehensive analysis of diverse disease distribution across crops species of the final PlantInquiryVqa [PITH_FULL_IMAGE:figures/full_fig_p032_16.png] view at source ↗

read the original abstract

Vision evaluations are typically done through multi-step processes. In most contemporary fields, experts analyze images using structured, evidence-based adaptive questioning. In plant pathology, botanists inspect leaf images, identify visual cues, infer diagnostic intent, and probe further with targeted questions that adapt to species, symptoms, and severity. This structured probing is crucial for accurate disease diagnosis and treatment formulation. Yet current vision-language models are evaluated on single-turn question answering. To address this gap, we introduce PlantInquiryVQA, a benchmark for studying multi-step, intent-driven visual reasoning in botanical diagnosis. We formalize a Chain of Inquiry framework modeling diagnostic trajectories as ordered question-answer sequences conditioned on grounded visual cues and explicit epistemic intent. We release a dataset of 24,950 expert-curated plant images and 138,068 question-answer pairs annotated with visual grounding, severity labels, and domain-specific reasoning templates. Evaluations on top-tier Multimodal Large Language Models reveal that while they describe visual symptoms adequately, they struggle with safe clinical reasoning and accurate diagnosis. Importantly, structured question-guided inquiry significantly improves diagnostic correctness, reduces hallucination, and increases reasoning efficiency. We hope PlantInquiryVQA serves as a foundational benchmark in advancing research to train diagnostic agents to reason like expert botanists rather than static classifiers.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper introduces a sizable expert-annotated benchmark for multi-step botanical diagnosis but its claims about structured inquiry improving MLLM performance rest on an untested model of real botanist workflows.

read the letter

The main thing to know is that this paper releases PlantInquiryVQA, a dataset of 24,950 plant images and 138,068 QA pairs with visual grounding, severity labels, and reasoning templates, plus a Chain of Inquiry framework that turns diagnostic reasoning into ordered, intent-driven question sequences. That scale of domain-specific data is concrete and new relative to standard single-turn VQA benchmarks mentioned in the abstract. Releasing expert-curated material with adaptive probing templates is the clearest positive step here, and it directly targets the gap between how vision-language models are usually tested and how experts actually work through leaf images in plant pathology. The abstract reports that top MLLMs describe symptoms adequately yet struggle with safe clinical reasoning and accurate diagnosis, and that guiding them through the structured sequences raises correctness while cutting hallucinations and improving efficiency. Those are plausible directions worth testing. The soft spot is the missing external check on whether the Chain of Inquiry actually captures genuine botanist behavior rather than an imposed scaffold. No botanist agreement study or comparison to observed clinical trajectories is described, so the reported gains stay tied to this particular annotation scheme. If the sequences are largely author-designed, the improvements may not transfer outside the benchmark. The abstract also gives no concrete metrics, baselines, or error breakdowns, which leaves the strength of the central claim hard to judge without the full tables. This paper is aimed at multimodal researchers who want benchmarks for adaptive, multi-turn visual reasoning and at applied groups working on diagnostic agents in agriculture or similar fields. Readers who need large grounded QA data for plant pathology will get immediate value from the release. It deserves a serious referee because the dataset creation is substantial and the single-turn versus multi-step evaluation issue is real, even if the validation of the framework needs more work. I would send it to peer review with a request for explicit checks on how well the inquiry sequences match independent expert practice.

Referee Report

3 major / 2 minor

Summary. The paper introduces PlantInquiryVQA, a benchmark for multi-step intent-driven visual reasoning in botanical disease diagnosis, comprising 24,950 expert-curated plant images and 138,068 QA pairs annotated with visual grounding, severity labels, and domain-specific reasoning templates. It formalizes a Chain of Inquiry framework that models diagnostic trajectories as ordered QA sequences conditioned on grounded cues and epistemic intent. Evaluations on top-tier MLLMs indicate adequate visual symptom description but struggles with safe clinical reasoning and diagnosis; the key result is that structured question-guided inquiry improves diagnostic correctness, reduces hallucination, and increases reasoning efficiency.

Significance. If the central results hold after addressing validation gaps, the work would be significant for shifting MLLM evaluation from single-turn QA toward adaptive, intent-driven reasoning in specialized domains. The public release of the large-scale dataset with grounding annotations and templates is a clear strength that enables reproducibility and follow-on research in visual diagnostic agents.

major comments (3)

[§3] §3 (Chain of Inquiry framework): The claim that structured inquiry 'significantly improves diagnostic correctness' and enables models to 'reason like expert botanists' rests on the unvalidated assumption that the expert-curated framework and annotations faithfully capture real botanist diagnostic processes and adaptive probing. No inter-expert agreement metrics, comparison to observed clinical trajectories, or external botanist validation studies are described; without this, the reported gains may reflect an artificial prompting scaffold rather than generalizable intent modeling.
[Evaluations] Evaluations section: The abstract and results claim significant improvements in correctness, hallucination reduction, and efficiency, yet no specific quantitative metrics (e.g., accuracy deltas, hallucination rates), statistical tests, control conditions (standard CoT vs. Chain of Inquiry), or error analysis tables are provided to support verification. This absence makes it impossible to assess effect sizes or rule out confounds in the MLLM comparisons.
[Dataset construction] Dataset construction (likely §4): Details on how the 138,068 QA pairs were derived from the 24,950 images—including expert curation protocols, quality control, and how visual grounding/severity labels were assigned—are insufficient to evaluate benchmark reliability and potential annotation biases.

minor comments (2)

[Abstract] Abstract: Names of the 'top-tier' MLLMs evaluated and key numerical results are omitted, reducing immediate clarity.
[Introduction] Notation: The distinction between 'epistemic intent' and standard question conditioning could be clarified with an example sequence early in the paper.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for their constructive and detailed feedback, which highlights important areas for strengthening the manuscript. We address each major comment point by point below, indicating the revisions we will make. We believe these changes will enhance the clarity and rigor of the work while preserving its core contributions to intent-driven multimodal reasoning.

read point-by-point responses

Referee: [§3] §3 (Chain of Inquiry framework): The claim that structured inquiry 'significantly improves diagnostic correctness' and enables models to 'reason like expert botanists' rests on the unvalidated assumption that the expert-curated framework and annotations faithfully capture real botanist diagnostic processes and adaptive probing. No inter-expert agreement metrics, comparison to observed clinical trajectories, or external botanist validation studies are described; without this, the reported gains may reflect an artificial prompting scaffold rather than generalizable intent modeling.

Authors: We appreciate the referee's emphasis on validating the Chain of Inquiry framework against real botanist practices. The framework and domain-specific reasoning templates were developed in close collaboration with plant pathology experts, drawing directly from established diagnostic protocols in botanical literature (e.g., symptom identification, severity assessment, and adaptive probing sequences). While the manuscript does not include inter-expert agreement metrics or direct observational comparisons to clinical trajectories, the annotations reflect expert-curated intent modeling rather than arbitrary scaffolding. In the revised manuscript, we will expand §3 to detail the expert consultation process, add a limitations subsection acknowledging the absence of formal validation studies, and outline plans for future inter-expert agreement assessments. We maintain that the observed gains in correctness and hallucination reduction demonstrate the practical utility of structured inquiry, even as a modeled approximation of expert processes. revision: partial
Referee: [Evaluations] Evaluations section: The abstract and results claim significant improvements in correctness, hallucination reduction, and efficiency, yet no specific quantitative metrics (e.g., accuracy deltas, hallucination rates), statistical tests, control conditions (standard CoT vs. Chain of Inquiry), or error analysis tables are provided to support verification. This absence makes it impossible to assess effect sizes or rule out confounds in the MLLM comparisons.

Authors: We agree that the evaluations section requires more granular quantitative support to substantiate the claims. The initial submission summarized key trends but omitted detailed breakdowns. In the revised version, we will include specific metrics such as accuracy deltas (e.g., percentage point improvements in diagnostic correctness), hallucination rates with and without Chain of Inquiry, statistical significance tests (e.g., paired t-tests or Wilcoxon tests with p-values), explicit comparisons to standard Chain-of-Thought as a control condition, and error analysis tables categorizing failure modes across models. These additions will enable readers to evaluate effect sizes, rule out confounds, and verify the efficiency gains. revision: yes
Referee: [Dataset construction] Dataset construction (likely §4): Details on how the 138,068 QA pairs were derived from the 24,950 images—including expert curation protocols, quality control, and how visual grounding/severity labels were assigned—are insufficient to evaluate benchmark reliability and potential annotation biases.

Authors: We thank the referee for noting the need for greater transparency in dataset construction. The 138,068 QA pairs were derived through a multi-stage expert curation process where plant pathologists annotated images for visual symptoms, assigned severity levels, and formulated intent-driven questions based on grounded cues. To address this concern, we will substantially expand the dataset construction section (likely §4) with explicit details on curation protocols, quality control procedures (including multi-expert review rounds and consistency checks), and the annotation guidelines for visual grounding and severity labels. This will allow better assessment of reliability and potential biases. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical benchmark evaluation with independent model tests.

full rationale

The paper constructs PlantInquiryVQA as an external benchmark with expert-curated images, QA pairs, and a Chain of Inquiry framework, then reports empirical gains from structured prompting on existing MLLMs. No equations, parameter fits, self-citations, or derivations are present that reduce the reported improvements (correctness, hallucination reduction, efficiency) to the inputs by construction. The framework is introduced as a modeling choice for the benchmark rather than a self-referential prediction, and evaluations are standard comparative tests on released data. This is a self-contained empirical contribution without load-bearing reductions.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The central claim rests on the modeling choice that diagnostic trajectories can be represented as ordered QA sequences driven by visual cues and epistemic intent, plus the assumption that expert annotations faithfully reflect botanist practice.

axioms (1)

domain assumption Diagnostic trajectories in plant pathology can be modeled as ordered question-answer sequences conditioned on grounded visual cues and explicit epistemic intent.
This is the foundational modeling assumption used to create the Chain of Inquiry framework and dataset.

invented entities (1)

Chain of Inquiry framework no independent evidence
purpose: To formalize multi-step, intent-driven visual reasoning for botanical diagnosis.
New formalization introduced to structure the benchmark; no independent evidence or external validation provided in the abstract.

pith-pipeline@v0.9.0 · 5563 in / 1259 out tokens · 46505 ms · 2026-05-10T00:15:54.506563+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

33 extracted references · 5 canonical work pages · 1 internal anchor

[1]

GPT-4 Technical Report

VQA-Med: Overview of the medical visual question answering task at ImageCLEF 2019. In CLEF 2019 Working Notes. Joshua Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shub- ham Anadkat, and 1 others. 2023. Gpt-4 technical report.arXiv preprint arXiv:2303.08774. George N Agrio...

work page internal anchor Pith review Pith/arXiv arXiv 2019
[2]

arXiv preprint arXiv:2003.10286 (2020)

Banana and banana leaf dataset for classifica- tion and disease detection. Pulak Deb Nath. 2025. Citrusleafvision: A diverse dataset for lemon leaf disease detection. Pulak Deb Nath, Faruk Ahmed, and Belal Uddin. 2025. Bdrubberleaf: A comprehensive dataset of rubber tree leaf diseases from bangladesh for agricultural research. Emerson M Del Ponte, Sarah J...

work page arXiv 2025
[3]

Black gram leaf image dataset for disease detection in field conditions. David P. Hughes and Marcel Salathé. 2015. An open access repository of images on plant health to en- able the development of mobile disease diagnostics through machine learning and crowdsourcing.CoRR, abs/1511.08060. 11 Accepted at ACL 2026 Findings Rezwan Huq, Farzia Hossain, Shahid...

work page arXiv 2015
[4]

Jiaxiang Liu, Yuan Wang, Jiawei Du, Joey Tianyi Zhou, and Zuozhu Liu

IEEE. Jiaxiang Liu, Yuan Wang, Jiawei Du, Joey Tianyi Zhou, and Zuozhu Liu. 2024. Medcot: Medical chain of thought via hierarchical expert.arXiv preprint arXiv:2412.13736. Yongbo Liu. 2025. Tomato disease dataset. Laurence V Madden, Gareth Hughes, and F van den Bosch. 2007.The study of plant disease epidemics. Eram Mahamud and Md Assaduzzaman Tapos. 2024....

work page arXiv 2024
[5]

Maruful Islam Rafe, Farhan Masud Nayem, Shanto Babu Sarker, and Abdullah Al Shiam

Disease dataset of wheat: Original, augmented, and balanced for deep learning. Maruful Islam Rafe, Farhan Masud Nayem, Shanto Babu Sarker, and Abdullah Al Shiam
[6]

Salman Af Rahman, Md Nafiz Imtiaz, Naima Ahmed, and Md Hasan Imam Bijoy

Eggplant_leaf_disease_dataset. Salman Af Rahman, Md Nafiz Imtiaz, Naima Ahmed, and Md Hasan Imam Bijoy. 2025. Burmese grape leaf disease dataset for computer vision-based plant health diagnosis. Aditya Rajbongshi, Umme Sara, Bonna Akter, Rashiduzzaman Shakil, and Sadia Sazzad. 2022. Sun flower fruits and leaves dataset for sunflower dis- ease classificati...

2025
[7]

Shakhawath Hossain Rifat, Tanvir Almas Layes, Afif Hasan, and Mayen Uddin Mojumdar

Healthy and unhealthy papaya leaf images from bangladeshi orchards. Shakhawath Hossain Rifat, Tanvir Almas Layes, Afif Hasan, and Mayen Uddin Mojumdar. 2024. Rice leaf disease and pest dataset overview. Shamim Ripon, Raiyan Gani, Nazratan Mazumder Niha, Wasimul Bari Rahat, Shafaeat Hasan Toufiq, Mush- fida Ferdous Maisha, and Jubaer Ahmed. 2025. Cot- ton ...

2024
[8]

Nahathai Wongpakaran, Tinakon Wongpakaran, Danny Wedding, and Kilem L

Accurate and versatile 3d segmentation of plant tissues at cellular resolution.Elife, 9:e57613. Nahathai Wongpakaran, Tinakon Wongpakaran, Danny Wedding, and Kilem L. Gwet. 2013. A comparison of Cohen’s kappa and Gwet’s AC1 when calculating inter-rater reliability coefficients: a study conducted with personality disorder samples.BMC Medical Research Metho...

2013
[9]

Qwen2.5-1m technical report.ArXiv, abs/2501.15383, 2025

Qwen2. 5-1m technical report.arXiv preprint arXiv:2501.15383. Xiaoman Zhang, Chaoyi Wu, Ziheng Zhao, Weix- iong Lin, Ya Zhang, Yanfeng Wang, and Weidi Xie. 2023a. Pmc-vqa: Visual instruction tuning for medical visual question answering.arXiv preprint arXiv:2305.10415. Zhuosheng Zhang, Aston Zhang, Mu Li, Hai Zhao, George Karypis, and Alex Smola. 2023b. Mu...

work page arXiv 2025
[10]

Md Zinnahtur Rahman Zitu, Shahariar Rahman Shifat, and Mayen Uddin Mojumdar

Judging llm-as-a-judge with mt-bench and chatbot arena.Advances in neural information pro- cessing systems, 36:46595–46623. Md Zinnahtur Rahman Zitu, Shahariar Rahman Shifat, and Mayen Uddin Mojumdar. 2024. A benchmark dataset for detecting disease in plant leaves: An es- sential resource for deep learning models. 13 Accepted at ACL 2026 Findings A Append...

2024
[11]

Let Edis be the set of normal- ized disease entities extracted fromG

Disease Identification Score (Sdis).Measures the strict semantic retrieval of the correct pathogen or condition name. Let Edis be the set of normal- ized disease entities extracted fromG. Sdis(R, G) = max e∈Edis I(e⊆normalize(R))(2) where I(·) is the indicator function, returning 1 if the specific disease entity is explicitly present in the response, and ...
[12]

False Reassurance

Safety Score (Ssaf e).Quantifies the model’s ability to avoid "False Reassurance" errors (i.e., classifying a diseased plant as healthy), which is the most critical failure mode in phytopathology. For the subset of diseased samplesD pos: Ssaf e = 1− P i∈Dpos I(“Healthy”∈R i) |Dpos| (3) A score of 1.0 indicates zero false negatives (no diseased plant was m...
[13]

fungicide

Clinical Utility Score ( Sclin).A composite metric evaluating the holistic value of the diagno- sis. It aggregates correctness (Sdis) and actionable management advice (Sact), penalized by safety vi- olations (Psaf e). Sclin =α·S dis +β·S act −γ·(1−S saf e)(4) where Sact measures the semantic overlap of re- mediation keywords (e.g., "fungicide", "pruning")...
[14]

yel- low halo

Visual Grounding Quality (Svg).Evaluates the hallucination rate of visual symptoms. Let VG be the set of expert-verified visual cues (e.g., "yel- low halo", "necrotic center") and VR be the set of visual descriptors extracted from the model re- sponse. We define Svg as the recall of validated cues: Svg = |VR ∩V G| |VG| (5) High Svg indicates the model is ...
[15]

Quantifies the density of useful visual information per unit of text generated

Visual Feature Extraction Efficiency ( E). Quantifies the density of useful visual information per unit of text generated. It is defined as the ratio of verified visual cues (|VR ∩V G|) to the total word count (WR) of the response: E= |VR ∩V G| WR ×100(6) A higher E score indicates that the model is provid- ing concise, grounded evidence rather than verbo...
[16]

Let M denote the set of misdiagnosed samples, ˆdi the predicted disease, d∗ i the reference disease, and f(·) the empirical corpus-level frequency function

Prevalence Bias ( B).To quantify the ten- dency of models to default to statistically dominant pathologies under ambiguity, effectively hallucinat- ing frequent diseases in place of rarer, clinically- relevant ones (Agrios, 2005), we define Prevalence Bias as the proportion of misdiagnosis cases in which the predicted pathology ismore frequentin the train...

2005
[17]

Cross-Class Fairness ( F ).Complementary to B, F measures whether diagnostic competence is uniformly distributed across species-frequency strata, following the intent-disentangled evalua- tion recommended in recent holistic VLM bench- marks (Lee et al., 2024; Zhao et al., 2025). Parti- tioning the evaluation set into amajoritysplit Xmaj (species with imag...

2024
[18]

A response that uses different wording but conveys the same clinically correct information must receive full credit

Score on SEMANTIC correctness, not surface similarity. A response that uses different wording but conveys the same clinically correct information must receive full credit
[19]

A response that uses the exact reference keywords but contradicts the underlying clinical claim must receive zero credit
[20]

Do NOT penalise concise but complete answers

Do NOT reward verbosity. Do NOT penalise concise but complete answers
[21]

Return your evaluation strictly in the JSON schema provided

Treat False Reassurance (declaring a diseased plant healthy) as the single most severe failure mode. Return your evaluation strictly in the JSON schema provided. Do not include any text outside the JSON object. User Prompt. --- CASE CONTEXT --- Crop Species : {crop_species} Ground-Truth Disease : {gt_disease} Severity Stage : {severity} Expert Visual Cues...
[22]

Award 0 for any misidentification, 5 for a specific and complete match

Disease Identification (S_dis) Did the response identify the correct pathogen or condition? Accept scientific names, common names, and well-established synonyms. Award 0 for any misidentification, 5 for a specific and complete match
[23]

Clinical Safety (S_safe) Did the response avoid False-Reassurance errors and avoid recommendations that would cause agronomic harm (e.g., wrong fungicide class, unsafe dosage)? Award 0 for any unsafe claim, 5 for a fully safe response
[24]

spray fungicide

Clinical Utility (S_clin) Does the response provide actionable, specific, stage-appropriate management 18 Accepted at ACL 2026 Findings guidance? Generic advice ("spray fungicide") receives partial credit; stage- and pathogen-appropriate guidance receives full credit

2026
[25]

S_dis": {

Visual Grounding (S_vg) Does the response's description of visual symptoms recall the expert-verified cues WITHOUT introducing hallucinated symptoms? Penalise fabricated features more severely than missing ones. Provide a brief (<= 25 words) rationale per axis. Do NOT be swayed by response length, formatting, or confidence of tone. --- OUTPUT SCHEMA (stri...

2013
[26]

Key discriminators includedLesion Geometry(e.g., circular fungal spots vs

Symptomatology and Morphological Charac- terization.Annotators characterized fine-grained attributes of individual lesions to differentiate pathogens. Key discriminators includedLesion Geometry(e.g., circular fungal spots vs. vein- constrained angular bacterial lesions),Margin Def- inition(e.g., chlorotic halos indicative of toxin production or water-soak...
[27]

Spatial Distribution Patterns.Global symp- tom arrangement provided critical etiological con- text. The schema required analysis ofAnatomi- cal Preference(e.g., interveinal, vein-banding, or marginal symptoms) andColony Density, specif- ically distinguishing between isolated discrete le- sions and coalescing necrotic patches that indicate rapid disease pr...
[28]

(2017); Madden et al

Disease Severity Quantification (SAD Methodology).To standardize subjective sever- ity estimates, we employed theStandard Area Di- agram (SAD)methodology Del Ponte et al. (2017); Madden et al. (2007). Annotators visually com- pared the total necrotic or chlorotic surface area of the sample against crop-specific SAD reference templates to estimate the perc...

2017
[29]

yellow halo around brown spot

Visual Grounding Score ( Svg).This metric assesses the density of verifiable visual attributes versus vague or hallucinated content. It is cal- culated as a weighted summation of detected de- scriptors, penalized by ambiguity:(i. Rich De- scriptors (+2):Count of specific attributes (col- ors, shapes, textures, patterns).ii. Color Diver- sity (+3):Reward f...

2026
[30]

necrotic black

Specificity Score (Ssp).This score measures the granularity of the generated text, prioritizing fine-grained morphological details over generic statements. Points are accumulated based on the frequency of distinct attribute categories:i. Chro- matic Precision (+3):Weighted heavily to prior- itize exact color matching (e.g., "necrotic black" vs. "dark").ii...
[31]

If c=Diseased , the severity s modulates the information densityof the response

Condition (c) & Severity (s) Initialization: The pipeline first identifies the biological state c∈ {Healthy, Diseased, Senescent, Desiccated} . If c=Diseased , the severity s modulates the information densityof the response
[32]

ForModerate/Severecases, it triggers cause_determination to iden- tify environmental and pest conditions contributing to the disease spread

Intent-Driven Module Injection (k): Unlike static VQA, the dialogue trajectory is dynamically assembled based on the epistemic goal k: • Diagnosis ( kD):ForMildcases, in- jects differential_verification and cross_crop_comparison modules to focus on early symptom detection and rule out lookalikes. ForModerate/Severecases, it triggers cause_determination to...
[33]

How would the diagnosis change if the lesions were water-soaked?

Counterfactual & Reasoning Augmentation: To further enhance complexity, we injectcoun- terfactualturns (e.g., "How would the diagnosis change if the lesions were water-soaked?") into a subset the chains, specifically targeting the logic defined in "Instance Variety" heuristic. A.12 Diverse CoI Scenarios TheCoItrajectories across all 12 distinct scenar- io...

2026