MedVIGIL: Evaluating Trustworthy Medical VLMs Under Broken Visual Evidence
Pith reviewed 2026-05-25 06:16 UTC · model grok-4.3
The pith
Medical VLMs must refuse fluent answers when image evidence is corrupted or falsified, yet current models fall 14 points short of human radiologists on this test.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Medical vision-language models must recognise when the evidential basis for an answer has failed and refuse rather than return fluent non-refusal answers; the MedVIGIL suite, built from four public medical VQA sources and end-to-end supervised by four radiologists, measures this property through silent failures under false-premise, wording, knowledge-only, and ROI-corrupted perturbations and shows the human reference at MCS 83.3 against a 14.1-point gap for the best audited model.
What carries the argument
The MedVIGIL Composite Score aggregates seven audit metrics on silent-failure rate, refusal accuracy, and correctness under four clinician-defined perturbation categories.
If this is right
- Models must incorporate explicit detection of broken visual evidence rather than relying on fluency alone.
- Standard VQA accuracy metrics are insufficient for clinical trustworthiness.
- The released probes and risk-tier flags allow repeated auditing of new models on the same fixed cases.
- Human-level refusal under perturbation requires training signals that penalise answers when evidence is absent or contradictory.
- The open-ended variant and counterfactual triplets test generalisation beyond multiple-choice format.
Where Pith is reading between the lines
- Deployment of current VLMs in radiology workflows risks silent propagation of answers based on invalid images.
- The benchmark could be extended by adding temporal or multi-image perturbations to match real clinical workflows.
- Training objectives that reward explicit uncertainty statements when evidence is missing may close the observed gap.
- The 14-point headroom indicates that refusal calibration, not raw capability, is the current bottleneck.
Load-bearing premise
The four perturbation types and the clinician-authored cases capture the failure modes that matter for trustworthy clinical use.
What would settle it
An audited model that reaches an MCS above 80 or a silent-failure rate below 6 percent on the released 300-case set would close or reverse the reported gap to the independent radiologist baseline.
Figures
read the original abstract
Medical vision--language models (VLMs) are usually evaluated on intact image--question pairs, but trustworthy clinical use requires a stronger property: a model must recognise when the evidential basis for an answer has failed. We study this through silent failures under perturbed evidence, where a vision-required medical question is paired with a false premise, wording perturbation, knowledge-only rewrite, or ROI-corrupted image, yet the model returns a fluent non-refusal answer. We introduce medvigil, a 300-case evaluation suite drawn from four public medical VQA sources, supervised end to end by four board-certified radiologists: every gold answer, refusal option, candidate-answer set, paraphrase, false-premise trap, ROI box, and clinical risk tier is clinician-authored. Two attending radiologists annotate every case in parallel, a senior radiologist consolidates the released manifest, and a separate fourth radiologist independent of construction answers every probe to provide the human reference baseline. The release contains 2556 MCQ probes, 240 counterfactual triplets, physician-adjudicated risk-tier and answerability flags, ROI boxes, and a paired open-ended variant. We report seven correctness-conditioned audit metrics that summarise into the medvigil Composite Score (MCS), and audit 16 vision-capable models plus two text-only baselines. The independent radiologist scores MCS 83.3 at silent-failure rate 5.8%, leaving a 14.1-point composite headroom above the strongest audited model (Claude Opus 4.7 at 69.2). The benchmark and evaluation harness are publicly released.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces MedVIGIL, a 300-case benchmark drawn from public medical VQA sources and fully supervised by four board-certified radiologists, to evaluate medical VLMs on silent failures under four types of broken evidence (false premise, wording perturbation, knowledge-only rewrite, ROI-corrupted image). It defines seven correctness-conditioned audit metrics that aggregate into the MedVIGIL Composite Score (MCS), reports an independent radiologist baseline of MCS 83.3 (silent-failure rate 5.8%) versus the strongest model (Claude Opus) at 69.2, and publicly releases the 2556 MCQ probes, counterfactual triplets, risk tiers, and evaluation harness.
Significance. If the perturbation categories accurately represent clinical silent-failure modes, the work supplies a needed evaluation framework for trustworthy medical VLMs and quantifies a concrete 14.1-point composite headroom. The end-to-end multi-radiologist construction pipeline, independent fourth-radiologist baseline, and public release of the benchmark and harness are concrete strengths that support reproducibility and further research.
major comments (1)
- [Case construction pipeline] Case construction pipeline (abstract and methods): The four perturbation categories and 300 clinician-authored cases are asserted to capture relevant evidential failures for clinical trustworthiness, yet the manuscript provides no external anchoring (e.g., comparison against hospital incident logs, additional blinded clinician surveys on ecological validity, or inter-rater metrics beyond internal consolidation). This assumption is load-bearing for interpreting the reported 14.1-point MCS headroom as clinically actionable rather than benchmark-specific.
minor comments (3)
- [Abstract] Abstract: The seven correctness-conditioned audit metrics that compose the MCS are referenced but not enumerated; a one-sentence listing would improve immediate readability.
- [Evaluation metrics] Evaluation metrics: The precise aggregation formula or weighting scheme that produces the composite MCS from the seven component metrics is not shown in the main text; an explicit equation or pseudocode would aid verification.
- [Results tables] Table/figure captions: Several result tables report model scores to one decimal place while the human baseline is given to one decimal; consistent precision and explicit mention of the number of probes per model would reduce ambiguity.
Simulated Author's Rebuttal
We thank the referee for the detailed review and the emphasis on strengthening the ecological validity of the case construction pipeline. We address the major comment below and outline targeted revisions to the manuscript.
read point-by-point responses
-
Referee: [Case construction pipeline] Case construction pipeline (abstract and methods): The four perturbation categories and 300 clinician-authored cases are asserted to capture relevant evidential failures for clinical trustworthiness, yet the manuscript provides no external anchoring (e.g., comparison against hospital incident logs, additional blinded clinician surveys on ecological validity, or inter-rater metrics beyond internal consolidation). This assumption is load-bearing for interpreting the reported 14.1-point MCS headroom as clinically actionable rather than benchmark-specific.
Authors: We agree that external anchoring would strengthen claims of clinical relevance. Direct comparison against hospital incident logs is not feasible: such logs are protected under privacy regulations (e.g., HIPAA), rarely document silent failures explicitly, and are not publicly available for research use. The four perturbation categories were instead derived from established radiology error taxonomies in the literature (e.g., perceptual and cognitive errors under incomplete evidence) and instantiated by board-certified radiologists. The construction pipeline already incorporates parallel annotation by two attending radiologists followed by senior consolidation; we will add explicit inter-rater agreement statistics (Cohen’s kappa on answerability and risk-tier labels) to the methods and supplement. We will also expand the limitations section to clarify that the benchmark quantifies headroom under controlled perturbations rather than claiming direct equivalence to real-world incident rates, and note additional blinded surveys as valuable future work. These changes will make the scope and limitations of the 14.1-point gap explicit without overstating generalizability. revision: partial
Circularity Check
No circularity: benchmark and metrics are independently constructed
full rationale
The paper introduces a fixed 300-case benchmark suite with clinician-authored perturbations, answerability flags, and risk tiers, then directly evaluates 18 models on seven audit metrics that are aggregated into MCS. No equations, fitted parameters, or self-citations are used to derive the reported scores; the human baseline (MCS 83.3) is produced by an independent radiologist answering the same probes. The construction pipeline and metric definitions stand on their own without reducing to model outputs or prior author results by definition.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption The four perturbation types validly simulate clinical scenarios where visual evidence is insufficient or misleading.
Reference graph
Works this paper leans on
-
[2]
Aofei Chang, Le Huang, Parminder Bhatia, Taha Kass-Hout, Fenglong Ma, and Cao Xiao
URLhttps://arxiv.org/abs/2603.21687. Aofei Chang, Le Huang, Parminder Bhatia, Taha Kass-Hout, Fenglong Ma, and Cao Xiao. MedHEval: Benchmarking hallucinations and mitigation strategies in medical large vision-language models. arXiv preprint arXiv:2503.02157,
-
[3]
Detecting and Evaluating Medical Hallucinations in Large Vision Language Models
URLhttps://arxiv.org/abs/2503.02157. Jiawei Chen, Dingkang Yang, Tong Wu, Yue Jiang, Xiaolu Hou, Mingcheng Li, et al. Detect- ing and evaluating medical hallucinations in large vision language models.arXiv preprint arXiv:2406.10185v1, 2024a. doi: 10.48550/arXiv.2406.10185. URL https://arxiv.org/ abs/2406.10185v1. Junying Chen, Chi Gui, Ruyi Ouyang, Anning...
work page internal anchor Pith review doi:10.48550/arxiv.2406.10185 2024
-
[4]
URL https://cacm.acm.org/research/ datasheets-for-datasets/
doi: 10.1145/3458723. URL https://cacm.acm.org/research/ datasheets-for-datasets/. Yonatan Geifman and Ran El-Yaniv. Selective classification for deep neural networks. InAdvances in Neural Information Processing Systems,
-
[5]
Gemini: A Family of Highly Capable Multimodal Models
URL https://papers.neurips.cc/paper/ 7073-selective-classification-for-deep-neural-networks. Gemini Team Google. Gemini: A family of highly capable multimodal models.arXiv preprint arXiv:2312.11805,
work page internal anchor Pith review Pith/arXiv arXiv
-
[6]
Gemini: A Family of Highly Capable Multimodal Models
doi: 10.48550/arXiv.2312.11805. URL https://arxiv.org/abs/ 2312.11805. Tejas Gokhale, Pratyay Banerjee, Chitta Baral, and Yezhou Yang. VQA-LOL: Visual question answering under the lens of logic. InEuropean Conference on Computer Vision,
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2312.11805
-
[7]
10 Yash Goyal, Tejas Khot, Douglas Summers-Stay, Dhruv Batra, and Devi Parikh
URL https://arxiv.org/abs/2002.08325. 10 Yash Goyal, Tejas Khot, Douglas Summers-Stay, Dhruv Batra, and Devi Parikh. Making the V in VQA matter: Elevating the role of image understanding in visual question answering. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 6904–6913,
-
[8]
URL https://openaccess.thecvf.com/ content_cvpr_2017/html/Goyal_Making_the_v_CVPR_2017_paper.html
doi: 10.1109/CVPR.2017.670. URL https://openaccess.thecvf.com/ content_cvpr_2017/html/Goyal_Making_the_v_CVPR_2017_paper.html. Zishan Gu, Changchang Yin, Fenglin Liu, and Ping Zhang. MedVH: Towards systematic evalu- ation of hallucination for large vision language models in the medical context.arXiv preprint arXiv:2407.02730,
-
[9]
URLhttps://arxiv.org/abs/2407.02730. Hao Guan and Mingxia Liu. Domain adaptation for medical image analysis: A survey.IEEE Transactions on Biomedical Engineering, 69(3):1173–1185,
-
[10]
doi: 10.1109/TBME.2021. 3117407. URLhttps://pmc.ncbi.nlm.nih.gov/articles/PMC9011180/. Dan Hendrycks and Thomas Dietterich. Benchmarking neural network robustness to common corruptions and perturbations. InInternational Conference on Learning Representations,
-
[11]
Benchmarking Neural Network Robustness to Common Corruptions and Perturbations
doi: 10.48550/arXiv.1903.12261. URLhttps://arxiv.org/abs/1903.12261. Jeremy Irvin, Pranav Rajpurkar, Michael Ko, Yifan Yu, Silviana Ciurea-Ilcus, Chris Chute, Henrik Marklund, Behzad Haghgoo, Robyn Ball, Katie Shpanskaya, et al. CheXpert: A large chest radiograph dataset with uncertainty labels and expert comparison. InAAAI Conference on Artificial Intelligence,
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.1903.12261 1903
-
[12]
doi: 10.1609/aaai.v33i01.3301590. Alistair E. W. Johnson, Tom J. Pollard, Seth J. Berkowitz, Nathaniel R. Greenbaum, Matthew P. Lungren, Chih-ying Deng, et al. MIMIC-CXR, a de-identified publicly available database of chest radiographs with free-text reports.Scientific Data, 6:317,
-
[13]
URLhttps://www.nature.com/articles/s41597-019-0322-0
doi: 10.1038/s41597-019-0322-0. URLhttps://www.nature.com/articles/s41597-019-0322-0. Polina Kirichenko, Mark Ibrahim, Kamalika Chaudhuri, and Samuel J. Bell. AbstentionBench: Reasoning LLMs fail on unanswerable questions.arXiv preprint arXiv:2506.09038,
- [14]
-
[15]
URLhttps://www.nature.com/articles/sdata2018251
doi: 10.1038/sdata.2018.251. URLhttps://www.nature.com/articles/sdata2018251. Kang-il Lee, Minbeom Kim, Seunghyun Yoon, Minsung Kim, Dongryeol Lee, Hyukhun Koh, et al. VLind-Bench: Measuring language priors in large vision-language models. InFindings of the Association for Computational Linguistics: NAACL 2025,
-
[16]
Chunyuan Li, Cliff Wong, Sheng Zhang, Naoto Usuyama, Haotian Liu, Jianwei Yang, et al
URL https://arxiv.org/ abs/2406.08702. Chunyuan Li, Cliff Wong, Sheng Zhang, Naoto Usuyama, Haotian Liu, Jianwei Yang, et al. LLaV A- Med: Training a large language-and-vision assistant for biomedicine in one day. InAdvances in Neural Information Processing Systems, Datasets and Benchmarks Track, 2023a. URL https: //arxiv.org/abs/2306.00890. Jidong Li, Li...
-
[17]
Yifan Li, Yifan Du, Kun Zhou, Jinpeng Wang, Xin Zhao, and Ji-Rong Wen
URL https://arxiv.org/abs/2510.10965. Yifan Li, Yifan Du, Kun Zhou, Jinpeng Wang, Xin Zhao, and Ji-Rong Wen. Evaluating object hallucination in large vision-language models. InProceedings of the 2023 Conference on Empir- ical Methods in Natural Language Processing, pages 292–305, 2023b. doi: 10.18653/v1/2023. emnlp-main.20. URLhttps://aclanthology.org/202...
-
[18]
URL https://arxiv.org/abs/2102.09542
doi: 10.1109/ISBI48211.2021.9434010. URL https://arxiv.org/abs/2102.09542. Nishanth Madhusudhan, Vikas Yadav, and Alexandre Lacoste. Knowing when not to answer: Evaluating abstention in multimodal reasoning systems.arXiv preprint arXiv:2604.14799,
-
[19]
Knowing When Not to Answer: Evaluating Abstention in Multimodal Reasoning Systems
URLhttps://arxiv.org/abs/2604.14799. 11 MLCommons. Croissant: A metadata format for ML-ready datasets,
work page internal anchor Pith review Pith/arXiv arXiv
-
[21]
Benchmarking Deflection and Hallucination in Large Vision-Language Models
URLhttps://arxiv.org/abs/2604.12033. Dung Nguyen, Minh Khoi Ho, Huy Ta, Thanh Tam Nguyen, Qi Chen, Kumar Rav, et al. Localizing before answering: A benchmark for grounded medical visual question answering. InProceedings of the Thirty-Fourth International Joint Conference on Artificial Intelligence (IJCAI),
work page internal anchor Pith review Pith/arXiv arXiv
-
[22]
URLhttps://arxiv.org/abs/2505.00744
doi: 10.24963/ijcai.2025/853. URLhttps://arxiv.org/abs/2505.00744. OpenAI. GPT-4o system card,
-
[23]
Mahima Pushkarna, Andrew Zaldivar, and Oddur Kjartansson
doi: 10.1007/978-3-030-01364-6_20. Mahima Pushkarna, Andrew Zaldivar, and Oddur Kjartansson. Data cards: Purposeful and transparent dataset documentation for responsible AI.arXiv preprint arXiv:2204.01075,
-
[24]
URLhttps://arxiv.org/abs/2204.01075
doi: 10.48550/ arXiv.2204.01075. URLhttps://arxiv.org/abs/2204.01075. Jie Ren, Yao Zhao, Tu Vu, Peter J. Liu, and Balaji Lakshminarayanan. Self-evaluation improves selective generation in large language models.arXiv preprint arXiv:2312.09300,
-
[25]
URLhttps://arxiv.org/abs/2312.09300
doi: 10.48550/arXiv.2312.09300. URLhttps://arxiv.org/abs/2312.09300. Anna Rohrbach, Lisa Anne Hendricks, Kaylee Burns, Trevor Darrell, and Kate Saenko. Object hallucination in image captioning. InProceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 4035–4045,
-
[26]
URL https://aclanthology.org/D18-1437/
doi: 10.18653/v1/D18-1437. URL https://aclanthology.org/D18-1437/. Meet Shah, Xinlei Chen, Marcus Rohrbach, and Devi Parikh. Cycle-consistency for ro- bust visual question answering. InProceedings of the IEEE/CVF Conference on Com- puter Vision and Pattern Recognition (CVPR), pages 6649–6658,
-
[28]
Bingbing Wen, Jihan Yao, Shangbin Feng, Chenjun Xu, Yulia Tsvetkov, Bill Howe, et al
URLhttps://arxiv.org/abs/2507.23486. Bingbing Wen, Jihan Yao, Shangbin Feng, Chenjun Xu, Yulia Tsvetkov, Bill Howe, et al. Know your limits: A survey of abstention in large language models.Transactions of the Association for Computational Linguistics, 12:1412–1430,
-
[29]
Chaoyi Wu, Xiaoman Zhang, Ya Zhang, Yanfeng Wang, and Weidi Xie
URL https://arxiv.org/abs/2407.18418. Chaoyi Wu, Xiaoman Zhang, Ya Zhang, Yanfeng Wang, and Weidi Xie. Towards generalist foun- dation model for radiology by leveraging web-scale 2d and 3d medical data.arXiv preprint arXiv:2308.02463,
-
[30]
URL https://arxiv.org/abs/ 2308.02463
doi: 10.48550/arXiv.2308.02463. URL https://arxiv.org/abs/ 2308.02463. 12 Xiaoman Zhang, Chaoyi Wu, Ziheng Zhao, Weixiong Lin, Ya Zhang, Yanfeng Wang, et al. PMC-VQA: Visual instruction tuning for medical visual question answering.arXiv preprint arXiv:2305.10415,
-
[31]
PMC-VQA: Visual Instruction Tuning for Medical Visual Question Answering
doi: 10.48550/arXiv.2305.10415. URL https://arxiv.org/abs/ 2305.10415. 13 A Benchmark Positioning Table 2 summarizes how MedVIGIL differs from adjacent benchmark families. The key distinction is not only the medical domain, but the combination of paired evidence perturbations, doctor-adjudicated safe targets, and risk-weighted silent-failure scoring. Tabl...
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2305.10415 2018
-
[32]
Therisk layerstores the clinical risk tier (CRT) together with the binary text-only-answerability flag. The CRT is one of L1 (meta or modality questions, no clinical harm if wrong), L2 (anatomical localisation with no management impact), L3 (presence of finding, delayed care if missed), L4 (significant pathology or characterisation, wrong treatment if mis...
work page 2021
-
[33]
Is there evidence of a right apical pneumothorax?
Per-case letter trajectories.To make the aggregate behaviour concrete, Table 14 reports the modal- letter trajectory across the four mask steps for five hand-picked qualitative cases (one per CRT tier, ablation re-queried for each model). Each row is one model on one case; cells colouredgreenare the doctor-finalised refusal option E, cells colouredredare ...
work page 2069
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.