pith. sign in

arxiv: 2605.07919 · v2 · pith:VQMPFD7Enew · submitted 2026-05-08 · 💻 cs.CV

MedVIGIL: Evaluating Trustworthy Medical VLMs Under Broken Visual Evidence

Pith reviewed 2026-05-25 06:16 UTC · model grok-4.3

classification 💻 cs.CV
keywords medical vision-language modelssilent failuresvisual evidence perturbationstrustworthy medical AIVQA benchmarkrefusal behaviorradiologist evaluationcorrupted images
0
0 comments X

The pith

Medical VLMs must refuse fluent answers when image evidence is corrupted or falsified, yet current models fall 14 points short of human radiologists on this test.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper creates MedVIGIL, a 300-case suite of vision-required medical questions drawn from public VQA sources and fully supervised by board-certified radiologists. Each case pairs the question with one of four evidence breaks: false premise, wording change, knowledge-only rewrite, or ROI-corrupted image, and records whether the model refuses or produces a non-refusal answer. Seven correctness-conditioned metrics are combined into the MedVIGIL Composite Score; an independent radiologist reaches 83.3 while the strongest model reaches 69.2 with a 5.8 percent silent-failure rate. The benchmark supplies 2556 MCQ probes, 240 counterfactual triplets, risk tiers, ROI boxes, and an open-ended variant, all clinician-authored.

Core claim

Medical vision-language models must recognise when the evidential basis for an answer has failed and refuse rather than return fluent non-refusal answers; the MedVIGIL suite, built from four public medical VQA sources and end-to-end supervised by four radiologists, measures this property through silent failures under false-premise, wording, knowledge-only, and ROI-corrupted perturbations and shows the human reference at MCS 83.3 against a 14.1-point gap for the best audited model.

What carries the argument

The MedVIGIL Composite Score aggregates seven audit metrics on silent-failure rate, refusal accuracy, and correctness under four clinician-defined perturbation categories.

If this is right

  • Models must incorporate explicit detection of broken visual evidence rather than relying on fluency alone.
  • Standard VQA accuracy metrics are insufficient for clinical trustworthiness.
  • The released probes and risk-tier flags allow repeated auditing of new models on the same fixed cases.
  • Human-level refusal under perturbation requires training signals that penalise answers when evidence is absent or contradictory.
  • The open-ended variant and counterfactual triplets test generalisation beyond multiple-choice format.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Deployment of current VLMs in radiology workflows risks silent propagation of answers based on invalid images.
  • The benchmark could be extended by adding temporal or multi-image perturbations to match real clinical workflows.
  • Training objectives that reward explicit uncertainty statements when evidence is missing may close the observed gap.
  • The 14-point headroom indicates that refusal calibration, not raw capability, is the current bottleneck.

Load-bearing premise

The four perturbation types and the clinician-authored cases capture the failure modes that matter for trustworthy clinical use.

What would settle it

An audited model that reaches an MCS above 80 or a silent-failure rate below 6 percent on the released 300-case set would close or reverse the reported gap to the independent radiologist baseline.

Figures

Figures reproduced from arXiv: 2605.07919 by Hanqi Jiang, Haozhen Gong, Hui Ren, Hyeokjae Kwon, Jinglei Lv, Junhao Chen, Lifeng Chen, Lin Zhao, Mingyu Kang, Quanzheng Li, Ruiyu Yan, Tianming Liu, Weihang You, Xiang Li, Yi Pan.

Figure 1
Figure 1. Figure 1: A vision-required chest X-ray paired with two evidence-perturbations (ROI mask, laterality [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: MedVIGIL construction and evaluation pipeline. [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Audit summary. Left: MCS component decomposition (Capability / Safety= 100−SFRw / Grounding / harmonic-mean MCS), ordered by MCS. Right: risk-tier SFR heatmap with Original accuracy (left blue) and SFRw (right red), sorted safest-first. knowledge-only accuracy (LPA 99.6%); and aggregate SFR is lowest on Claude Opus 4.7 (22.0%) and Gemini 3.1 Flash-Lite (22.3%) at different overall-accuracy levels. The radi… view at source ↗
Figure 4
Figure 4. Figure 4: Visual information decay on the 300 image-required cases (solid: MCQ accuracy as [PITH_FULL_IMAGE:figures/full_fig_p009_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: MCS components in two pairwise projections. [PITH_FULL_IMAGE:figures/full_fig_p026_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Per-CRT-tier accuracy as a function of blur [PITH_FULL_IMAGE:figures/full_fig_p028_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Visual-token ablation on the example case MVB-0031. [PITH_FULL_IMAGE:figures/full_fig_p029_7.png] view at source ↗
read the original abstract

Medical vision--language models (VLMs) are usually evaluated on intact image--question pairs, but trustworthy clinical use requires a stronger property: a model must recognise when the evidential basis for an answer has failed. We study this through silent failures under perturbed evidence, where a vision-required medical question is paired with a false premise, wording perturbation, knowledge-only rewrite, or ROI-corrupted image, yet the model returns a fluent non-refusal answer. We introduce medvigil, a 300-case evaluation suite drawn from four public medical VQA sources, supervised end to end by four board-certified radiologists: every gold answer, refusal option, candidate-answer set, paraphrase, false-premise trap, ROI box, and clinical risk tier is clinician-authored. Two attending radiologists annotate every case in parallel, a senior radiologist consolidates the released manifest, and a separate fourth radiologist independent of construction answers every probe to provide the human reference baseline. The release contains 2556 MCQ probes, 240 counterfactual triplets, physician-adjudicated risk-tier and answerability flags, ROI boxes, and a paired open-ended variant. We report seven correctness-conditioned audit metrics that summarise into the medvigil Composite Score (MCS), and audit 16 vision-capable models plus two text-only baselines. The independent radiologist scores MCS 83.3 at silent-failure rate 5.8%, leaving a 14.1-point composite headroom above the strongest audited model (Claude Opus 4.7 at 69.2). The benchmark and evaluation harness are publicly released.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 3 minor

Summary. The paper introduces MedVIGIL, a 300-case benchmark drawn from public medical VQA sources and fully supervised by four board-certified radiologists, to evaluate medical VLMs on silent failures under four types of broken evidence (false premise, wording perturbation, knowledge-only rewrite, ROI-corrupted image). It defines seven correctness-conditioned audit metrics that aggregate into the MedVIGIL Composite Score (MCS), reports an independent radiologist baseline of MCS 83.3 (silent-failure rate 5.8%) versus the strongest model (Claude Opus) at 69.2, and publicly releases the 2556 MCQ probes, counterfactual triplets, risk tiers, and evaluation harness.

Significance. If the perturbation categories accurately represent clinical silent-failure modes, the work supplies a needed evaluation framework for trustworthy medical VLMs and quantifies a concrete 14.1-point composite headroom. The end-to-end multi-radiologist construction pipeline, independent fourth-radiologist baseline, and public release of the benchmark and harness are concrete strengths that support reproducibility and further research.

major comments (1)
  1. [Case construction pipeline] Case construction pipeline (abstract and methods): The four perturbation categories and 300 clinician-authored cases are asserted to capture relevant evidential failures for clinical trustworthiness, yet the manuscript provides no external anchoring (e.g., comparison against hospital incident logs, additional blinded clinician surveys on ecological validity, or inter-rater metrics beyond internal consolidation). This assumption is load-bearing for interpreting the reported 14.1-point MCS headroom as clinically actionable rather than benchmark-specific.
minor comments (3)
  1. [Abstract] Abstract: The seven correctness-conditioned audit metrics that compose the MCS are referenced but not enumerated; a one-sentence listing would improve immediate readability.
  2. [Evaluation metrics] Evaluation metrics: The precise aggregation formula or weighting scheme that produces the composite MCS from the seven component metrics is not shown in the main text; an explicit equation or pseudocode would aid verification.
  3. [Results tables] Table/figure captions: Several result tables report model scores to one decimal place while the human baseline is given to one decimal; consistent precision and explicit mention of the number of probes per model would reduce ambiguity.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the detailed review and the emphasis on strengthening the ecological validity of the case construction pipeline. We address the major comment below and outline targeted revisions to the manuscript.

read point-by-point responses
  1. Referee: [Case construction pipeline] Case construction pipeline (abstract and methods): The four perturbation categories and 300 clinician-authored cases are asserted to capture relevant evidential failures for clinical trustworthiness, yet the manuscript provides no external anchoring (e.g., comparison against hospital incident logs, additional blinded clinician surveys on ecological validity, or inter-rater metrics beyond internal consolidation). This assumption is load-bearing for interpreting the reported 14.1-point MCS headroom as clinically actionable rather than benchmark-specific.

    Authors: We agree that external anchoring would strengthen claims of clinical relevance. Direct comparison against hospital incident logs is not feasible: such logs are protected under privacy regulations (e.g., HIPAA), rarely document silent failures explicitly, and are not publicly available for research use. The four perturbation categories were instead derived from established radiology error taxonomies in the literature (e.g., perceptual and cognitive errors under incomplete evidence) and instantiated by board-certified radiologists. The construction pipeline already incorporates parallel annotation by two attending radiologists followed by senior consolidation; we will add explicit inter-rater agreement statistics (Cohen’s kappa on answerability and risk-tier labels) to the methods and supplement. We will also expand the limitations section to clarify that the benchmark quantifies headroom under controlled perturbations rather than claiming direct equivalence to real-world incident rates, and note additional blinded surveys as valuable future work. These changes will make the scope and limitations of the 14.1-point gap explicit without overstating generalizability. revision: partial

Circularity Check

0 steps flagged

No circularity: benchmark and metrics are independently constructed

full rationale

The paper introduces a fixed 300-case benchmark suite with clinician-authored perturbations, answerability flags, and risk tiers, then directly evaluates 18 models on seven audit metrics that are aggregated into MCS. No equations, fitted parameters, or self-citations are used to derive the reported scores; the human baseline (MCS 83.3) is produced by an independent radiologist answering the same probes. The construction pipeline and metric definitions stand on their own without reducing to model outputs or prior author results by definition.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the domain assumption that the chosen perturbations represent clinically relevant broken evidence and that the multi-radiologist process produces reliable gold labels.

axioms (1)
  • domain assumption The four perturbation types validly simulate clinical scenarios where visual evidence is insufficient or misleading.
    Invoked to justify relevance of the silent-failure tests to trustworthy use.

pith-pipeline@v0.9.0 · 5874 in / 1166 out tokens · 46933 ms · 2026-05-25T06:16:44.561083+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

30 extracted references · 30 canonical work pages · 7 internal anchors

  1. [2]

    Aofei Chang, Le Huang, Parminder Bhatia, Taha Kass-Hout, Fenglong Ma, and Cao Xiao

    URLhttps://arxiv.org/abs/2603.21687. Aofei Chang, Le Huang, Parminder Bhatia, Taha Kass-Hout, Fenglong Ma, and Cao Xiao. MedHEval: Benchmarking hallucinations and mitigation strategies in medical large vision-language models. arXiv preprint arXiv:2503.02157,

  2. [3]

    Detecting and Evaluating Medical Hallucinations in Large Vision Language Models

    URLhttps://arxiv.org/abs/2503.02157. Jiawei Chen, Dingkang Yang, Tong Wu, Yue Jiang, Xiaolu Hou, Mingcheng Li, et al. Detect- ing and evaluating medical hallucinations in large vision language models.arXiv preprint arXiv:2406.10185v1, 2024a. doi: 10.48550/arXiv.2406.10185. URL https://arxiv.org/ abs/2406.10185v1. Junying Chen, Chi Gui, Ruyi Ouyang, Anning...

  3. [4]

    URL https://cacm.acm.org/research/ datasheets-for-datasets/

    doi: 10.1145/3458723. URL https://cacm.acm.org/research/ datasheets-for-datasets/. Yonatan Geifman and Ran El-Yaniv. Selective classification for deep neural networks. InAdvances in Neural Information Processing Systems,

  4. [5]

    Gemini: A Family of Highly Capable Multimodal Models

    URL https://papers.neurips.cc/paper/ 7073-selective-classification-for-deep-neural-networks. Gemini Team Google. Gemini: A family of highly capable multimodal models.arXiv preprint arXiv:2312.11805,

  5. [6]

    Gemini: A Family of Highly Capable Multimodal Models

    doi: 10.48550/arXiv.2312.11805. URL https://arxiv.org/abs/ 2312.11805. Tejas Gokhale, Pratyay Banerjee, Chitta Baral, and Yezhou Yang. VQA-LOL: Visual question answering under the lens of logic. InEuropean Conference on Computer Vision,

  6. [7]

    10 Yash Goyal, Tejas Khot, Douglas Summers-Stay, Dhruv Batra, and Devi Parikh

    URL https://arxiv.org/abs/2002.08325. 10 Yash Goyal, Tejas Khot, Douglas Summers-Stay, Dhruv Batra, and Devi Parikh. Making the V in VQA matter: Elevating the role of image understanding in visual question answering. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 6904–6913,

  7. [8]

    URL https://openaccess.thecvf.com/ content_cvpr_2017/html/Goyal_Making_the_v_CVPR_2017_paper.html

    doi: 10.1109/CVPR.2017.670. URL https://openaccess.thecvf.com/ content_cvpr_2017/html/Goyal_Making_the_v_CVPR_2017_paper.html. Zishan Gu, Changchang Yin, Fenglin Liu, and Ping Zhang. MedVH: Towards systematic evalu- ation of hallucination for large vision language models in the medical context.arXiv preprint arXiv:2407.02730,

  8. [9]

    Hao Guan and Mingxia Liu

    URLhttps://arxiv.org/abs/2407.02730. Hao Guan and Mingxia Liu. Domain adaptation for medical image analysis: A survey.IEEE Transactions on Biomedical Engineering, 69(3):1173–1185,

  9. [10]

    doi: 10.1109/TBME.2021. 3117407. URLhttps://pmc.ncbi.nlm.nih.gov/articles/PMC9011180/. Dan Hendrycks and Thomas Dietterich. Benchmarking neural network robustness to common corruptions and perturbations. InInternational Conference on Learning Representations,

  10. [11]

    Benchmarking Neural Network Robustness to Common Corruptions and Perturbations

    doi: 10.48550/arXiv.1903.12261. URLhttps://arxiv.org/abs/1903.12261. Jeremy Irvin, Pranav Rajpurkar, Michael Ko, Yifan Yu, Silviana Ciurea-Ilcus, Chris Chute, Henrik Marklund, Behzad Haghgoo, Robyn Ball, Katie Shpanskaya, et al. CheXpert: A large chest radiograph dataset with uncertainty labels and expert comparison. InAAAI Conference on Artificial Intelligence,

  11. [12]

    Alistair E

    doi: 10.1609/aaai.v33i01.3301590. Alistair E. W. Johnson, Tom J. Pollard, Seth J. Berkowitz, Nathaniel R. Greenbaum, Matthew P. Lungren, Chih-ying Deng, et al. MIMIC-CXR, a de-identified publicly available database of chest radiographs with free-text reports.Scientific Data, 6:317,

  12. [13]

    URLhttps://www.nature.com/articles/s41597-019-0322-0

    doi: 10.1038/s41597-019-0322-0. URLhttps://www.nature.com/articles/s41597-019-0322-0. Polina Kirichenko, Mark Ibrahim, Kamalika Chaudhuri, and Samuel J. Bell. AbstentionBench: Reasoning LLMs fail on unanswerable questions.arXiv preprint arXiv:2506.09038,

  13. [14]

    URL https://arxiv.org/abs/2506.09038. Jason J. Lau, Soumya Gayen, Asma Ben Abacha, and Dina Demner-Fushman. A dataset of clinically generated visual questions and answers about radiology images.Scientific Data, 5:180251,

  14. [15]

    URLhttps://www.nature.com/articles/sdata2018251

    doi: 10.1038/sdata.2018.251. URLhttps://www.nature.com/articles/sdata2018251. Kang-il Lee, Minbeom Kim, Seunghyun Yoon, Minsung Kim, Dongryeol Lee, Hyukhun Koh, et al. VLind-Bench: Measuring language priors in large vision-language models. InFindings of the Association for Computational Linguistics: NAACL 2025,

  15. [16]

    Chunyuan Li, Cliff Wong, Sheng Zhang, Naoto Usuyama, Haotian Liu, Jianwei Yang, et al

    URL https://arxiv.org/ abs/2406.08702. Chunyuan Li, Cliff Wong, Sheng Zhang, Naoto Usuyama, Haotian Liu, Jianwei Yang, et al. LLaV A- Med: Training a large language-and-vision assistant for biomedicine in one day. InAdvances in Neural Information Processing Systems, Datasets and Benchmarks Track, 2023a. URL https: //arxiv.org/abs/2306.00890. Jidong Li, Li...

  16. [17]

    Yifan Li, Yifan Du, Kun Zhou, Jinpeng Wang, Xin Zhao, and Ji-Rong Wen

    URL https://arxiv.org/abs/2510.10965. Yifan Li, Yifan Du, Kun Zhou, Jinpeng Wang, Xin Zhao, and Ji-Rong Wen. Evaluating object hallucination in large vision-language models. InProceedings of the 2023 Conference on Empir- ical Methods in Natural Language Processing, pages 292–305, 2023b. doi: 10.18653/v1/2023. emnlp-main.20. URLhttps://aclanthology.org/202...

  17. [18]

    URL https://arxiv.org/abs/2102.09542

    doi: 10.1109/ISBI48211.2021.9434010. URL https://arxiv.org/abs/2102.09542. Nishanth Madhusudhan, Vikas Yadav, and Alexandre Lacoste. Knowing when not to answer: Evaluating abstention in multimodal reasoning systems.arXiv preprint arXiv:2604.14799,

  18. [19]

    Knowing When Not to Answer: Evaluating Abstention in Multimodal Reasoning Systems

    URLhttps://arxiv.org/abs/2604.14799. 11 MLCommons. Croissant: A metadata format for ML-ready datasets,

  19. [21]

    Benchmarking Deflection and Hallucination in Large Vision-Language Models

    URLhttps://arxiv.org/abs/2604.12033. Dung Nguyen, Minh Khoi Ho, Huy Ta, Thanh Tam Nguyen, Qi Chen, Kumar Rav, et al. Localizing before answering: A benchmark for grounded medical visual question answering. InProceedings of the Thirty-Fourth International Joint Conference on Artificial Intelligence (IJCAI),

  20. [22]

    URLhttps://arxiv.org/abs/2505.00744

    doi: 10.24963/ijcai.2025/853. URLhttps://arxiv.org/abs/2505.00744. OpenAI. GPT-4o system card,

  21. [23]

    Mahima Pushkarna, Andrew Zaldivar, and Oddur Kjartansson

    doi: 10.1007/978-3-030-01364-6_20. Mahima Pushkarna, Andrew Zaldivar, and Oddur Kjartansson. Data cards: Purposeful and transparent dataset documentation for responsible AI.arXiv preprint arXiv:2204.01075,

  22. [24]

    URLhttps://arxiv.org/abs/2204.01075

    doi: 10.48550/ arXiv.2204.01075. URLhttps://arxiv.org/abs/2204.01075. Jie Ren, Yao Zhao, Tu Vu, Peter J. Liu, and Balaji Lakshminarayanan. Self-evaluation improves selective generation in large language models.arXiv preprint arXiv:2312.09300,

  23. [25]

    URLhttps://arxiv.org/abs/2312.09300

    doi: 10.48550/arXiv.2312.09300. URLhttps://arxiv.org/abs/2312.09300. Anna Rohrbach, Lisa Anne Hendricks, Kaylee Burns, Trevor Darrell, and Kate Saenko. Object hallucination in image captioning. InProceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 4035–4045,

  24. [26]

    URL https://aclanthology.org/D18-1437/

    doi: 10.18653/v1/D18-1437. URL https://aclanthology.org/D18-1437/. Meet Shah, Xinlei Chen, Marcus Rohrbach, and Devi Parikh. Cycle-consistency for ro- bust visual question answering. InProceedings of the IEEE/CVF Conference on Com- puter Vision and Pattern Recognition (CVPR), pages 6649–6658,

  25. [28]

    Bingbing Wen, Jihan Yao, Shangbin Feng, Chenjun Xu, Yulia Tsvetkov, Bill Howe, et al

    URLhttps://arxiv.org/abs/2507.23486. Bingbing Wen, Jihan Yao, Shangbin Feng, Chenjun Xu, Yulia Tsvetkov, Bill Howe, et al. Know your limits: A survey of abstention in large language models.Transactions of the Association for Computational Linguistics, 12:1412–1430,

  26. [29]

    Chaoyi Wu, Xiaoman Zhang, Ya Zhang, Yanfeng Wang, and Weidi Xie

    URL https://arxiv.org/abs/2407.18418. Chaoyi Wu, Xiaoman Zhang, Ya Zhang, Yanfeng Wang, and Weidi Xie. Towards generalist foun- dation model for radiology by leveraging web-scale 2d and 3d medical data.arXiv preprint arXiv:2308.02463,

  27. [30]

    URL https://arxiv.org/abs/ 2308.02463

    doi: 10.48550/arXiv.2308.02463. URL https://arxiv.org/abs/ 2308.02463. 12 Xiaoman Zhang, Chaoyi Wu, Ziheng Zhao, Weixiong Lin, Ya Zhang, Yanfeng Wang, et al. PMC-VQA: Visual instruction tuning for medical visual question answering.arXiv preprint arXiv:2305.10415,

  28. [31]

    PMC-VQA: Visual Instruction Tuning for Medical Visual Question Answering

    doi: 10.48550/arXiv.2305.10415. URL https://arxiv.org/abs/ 2305.10415. 13 A Benchmark Positioning Table 2 summarizes how MedVIGIL differs from adjacent benchmark families. The key distinction is not only the medical domain, but the combination of paired evidence perturbations, doctor-adjudicated safe targets, and risk-weighted silent-failure scoring. Tabl...

  29. [32]

    don’t- miss

    Therisk layerstores the clinical risk tier (CRT) together with the binary text-only-answerability flag. The CRT is one of L1 (meta or modality questions, no clinical harm if wrong), L2 (anatomical localisation with no management impact), L3 (presence of finding, delayed care if missed), L4 (significant pathology or characterisation, wrong treatment if mis...

  30. [33]

    Is there evidence of a right apical pneumothorax?

    Per-case letter trajectories.To make the aggregate behaviour concrete, Table 14 reports the modal- letter trajectory across the four mask steps for five hand-picked qualitative cases (one per CRT tier, ablation re-queried for each model). Each row is one model on one case; cells colouredgreenare the doctor-finalised refusal option E, cells colouredredare ...