pith. machine review for the scientific record. sign in

arxiv: 2605.07919 · v1 · submitted 2026-05-08 · 💻 cs.CV

Recognition: no theorem link

MedVIGIL: Evaluating Trustworthy Medical VLMs Under Broken Visual Evidence

Authors on Pith no claims yet

Pith reviewed 2026-05-11 03:40 UTC · model grok-4.3

classification 💻 cs.CV
keywords medical vision-language modelstrustworthy AIsilent failuresbenchmark evaluationvisual evidence perturbationradiologist supervisioncomposite score
0
0 comments X

The pith

Medical vision-language models fail to refuse answers when visual evidence is broken, trailing radiologists by 14 points on a new composite score.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper establishes that trustworthy medical VLMs must detect broken visual evidence and refuse to answer rather than produce fluent but unsupported responses. It presents MedVIGIL, a clinician-supervised benchmark of 300 cases with perturbations including false premises, wording changes, knowledge-only rewrites, and ROI-corrupted images. The evaluation uses seven metrics summarized in the MedVIGIL Composite Score (MCS). An independent radiologist scores 83.3 on MCS with a 5.8% silent failure rate. The strongest model reaches only 69.2, highlighting a significant gap for clinical reliability.

Core claim

MedVIGIL is a 300-case evaluation suite drawn from public medical VQA sources and fully supervised by board-certified radiologists. It contains 2,556 MCQ probes plus open-ended variants, with clinician-authored false-premise traps, ROI boxes, risk tiers, and refusal options. Audits of 16 vision-capable models using the MCS show a maximum score of 69.2, compared to an independent radiologist baseline of 83.3 at 5.8% silent failures.

What carries the argument

The MedVIGIL Composite Score (MCS), which aggregates seven correctness-conditioned audit metrics to quantify silent failures under four types of broken evidence.

If this is right

  • Models can be systematically audited on the released suite to measure progress in evidence verification.
  • Clinical deployment should incorporate silent-failure thresholds derived from the MCS to limit unsupported answers.
  • The paired open-ended variant allows testing refusal behavior beyond multiple-choice formats.
  • Risk-tier annotations enable focused evaluation on high-stakes medical cases.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Models passing this benchmark may still require additional testing against naturally occurring image corruptions not captured by the four synthetic perturbations.
  • Incorporating MCS-style penalties during training could incentivize VLMs to learn when to withhold answers.
  • Extending the suite to non-radiology imaging domains would check whether the observed gap is modality-specific.

Load-bearing premise

The four chosen perturbation types and the 300 clinician-authored cases sufficiently represent the range of broken visual evidence that occurs in real clinical practice.

What would settle it

Releasing a new model that achieves an MCS of 83 or higher on the full MedVIGIL suite with silent-failure rates near 6% would test whether the performance gap can be closed.

Figures

Figures reproduced from arXiv: 2605.07919 by Hanqi Jiang, Haozhen Gong, Hui Ren, Jinglei Lv, Junhao Chen, Lifeng Chen, Lin Zhao, Quanzheng Li, Ruiyu Yan, Tianming Liu, Weihang You, Xiang Li, Yi Pan.

Figure 1
Figure 1. Figure 1: A vision-required chest X-ray paired with two evidence-perturbations (ROI mask, laterality [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: MedVIGIL construction and evaluation pipeline. [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Audit summary. Left: MCS component decomposition (Capability / Safety= 100−SFRw / Grounding / harmonic-mean MCS), ordered by MCS. Right: risk-tier SFR heatmap with Original accuracy (left blue) and SFRw (right red), sorted safest-first. knowledge-only accuracy (LPA 99.6%); and aggregate SFR is lowest on Claude Opus 4.7 (22.0%) and Gemini 3.1 Flash-Lite (22.3%) at different overall-accuracy levels. The radi… view at source ↗
Figure 4
Figure 4. Figure 4: Visual information decay on the 300 image-required cases (solid: MCQ accuracy as [PITH_FULL_IMAGE:figures/full_fig_p009_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: MCS components in two pairwise projections. [PITH_FULL_IMAGE:figures/full_fig_p026_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Per-CRT-tier accuracy as a function of blur [PITH_FULL_IMAGE:figures/full_fig_p028_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Visual-token ablation on the example case MVB-0031. [PITH_FULL_IMAGE:figures/full_fig_p029_7.png] view at source ↗
read the original abstract

Medical vision--language models (VLMs) are usually evaluated on intact image--question pairs, but trustworthy clinical use requires a stronger property: a model must recognise when the evidential basis for an answer has failed. We study this through silent failures under perturbed evidence, where a vision-required medical question is paired with a false premise, wording perturbation, knowledge-only rewrite, or ROI-corrupted image, yet the model returns a fluent non-refusal answer. We introduce medvigil, a 300-case evaluation suite drawn from four public medical VQA sources, supervised end to end by four board-certified radiologists: every gold answer, refusal option, candidate-answer set, paraphrase, false-premise trap, ROI box, and clinical risk tier is clinician-authored. Two attending radiologists annotate every case in parallel, a senior radiologist consolidates the released manifest, and a separate fourth radiologist independent of construction answers every probe to provide the human reference baseline. The release contains 2{,}556 MCQ probes, 240 counterfactual triplets, physician-adjudicated risk-tier and answerability flags, ROI boxes, and a paired open-ended variant. We report seven correctness-conditioned audit metrics that summarise into the medvigil Composite Score (MCS), and audit 16 vision-capable models plus two text-only baselines. The independent radiologist scores MCS 83.3 at silent-failure rate 5.8%, leaving a 14.1-point composite headroom above the strongest audited model (Claude Opus 4.7 at 69.2). The benchmark and evaluation harness are publicly released.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper introduces MedVIGIL, a 300-case evaluation suite for medical VLMs under broken visual evidence, constructed from four public VQA sources and fully supervised by four board-certified radiologists (parallel attending annotations, senior consolidation, independent fourth-radiologist baseline). Cases incorporate four perturbation types (false-premise wording, knowledge-only rewrite, ROI corruption, and related) to induce silent failures where models give fluent non-refusal answers. The suite includes 2,556 MCQ probes, 240 counterfactual triplets, risk tiers, ROI boxes, and an open-ended variant. Seven correctness-conditioned metrics are aggregated into the MedVIGIL Composite Score (MCS); the independent radiologist baseline achieves MCS 83.3 at 5.8% silent-failure rate, 14.1 points above the strongest model (Claude Opus at 69.2). The benchmark and harness are publicly released.

Significance. If the 300 cases and four perturbation classes adequately proxy real clinical visual-evidence failures, the work is significant for exposing a measurable trustworthiness gap in current medical VLMs and supplying a clinician-authored, publicly released resource for auditing and improving model refusal behavior. The multi-radiologist oversight, independent human baseline, and release of probes plus risk tiers strengthen its utility as a standardized testbed.

major comments (2)
  1. [Abstract] Abstract: The assumption that the four perturbation types (false-premise wording, knowledge-only rewrite, ROI corruption, and the remaining two) plus the 300 clinician-authored cases from four VQA sources sufficiently represent the distribution of broken visual evidence in real radiology workflows is load-bearing for interpreting the 14.1-point MCS headroom and 5.8% human silent-failure rate as clinically actionable. No external validation (e.g., mapping to observed PACS artifacts, modality-specific failure statistics, or inter-institutional audits) is provided to confirm coverage or severity calibration.
  2. [Abstract] Abstract: The aggregation rule that combines the seven correctness-conditioned audit metrics into the single MedVIGIL Composite Score (MCS) is not specified, preventing assessment of whether the reported 83.3 vs. 69.2 gap is robust to alternative weightings or dominated by particular sub-metrics such as silent-failure rate.
minor comments (1)
  1. [Abstract] Abstract: The notation '2{,}556' is a typographical artifact and should be rendered as '2,556' for readability.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive and detailed comments on our manuscript. We provide point-by-point responses to the major comments below and have updated the manuscript to address the concerns raised.

read point-by-point responses
  1. Referee: [Abstract] Abstract: The assumption that the four perturbation types (false-premise wording, knowledge-only rewrite, ROI corruption, and the remaining two) plus the 300 clinician-authored cases from four VQA sources sufficiently represent the distribution of broken visual evidence in real radiology workflows is load-bearing for interpreting the 14.1-point MCS headroom and 5.8% human silent-failure rate as clinically actionable. No external validation (e.g., mapping to observed PACS artifacts, modality-specific failure statistics, or inter-institutional audits) is provided to confirm coverage or severity calibration.

    Authors: We agree that external validation against real-world clinical data would be valuable for establishing the benchmark's representativeness. Our work focuses on creating a clinician-supervised benchmark using perturbations derived from existing public VQA datasets to systematically evaluate silent failures in medical VLMs. The four perturbation types were selected based on common issues in medical VQA and radiologist input to simulate plausible broken evidence scenarios. We have revised the manuscript by adding a new 'Limitations and Future Work' paragraph in the Discussion section. This paragraph acknowledges that the 300 cases serve as a proxy rather than a comprehensive sample of all possible clinical failures, explains the design choices, and explicitly recommends future studies involving PACS artifact analysis and inter-institutional validation to calibrate severity. revision: yes

  2. Referee: [Abstract] Abstract: The aggregation rule that combines the seven correctness-conditioned audit metrics into the single MedVIGIL Composite Score (MCS) is not specified, preventing assessment of whether the reported 83.3 vs. 69.2 gap is robust to alternative weightings or dominated by particular sub-metrics such as silent-failure rate.

    Authors: We appreciate this observation. The aggregation rule for the MCS was not sufficiently detailed. We have revised the abstract to state that the MCS is the unweighted average of the seven normalized metrics (each scaled to [0,100]). We have also added the explicit formula in the Methods section along with a sensitivity analysis to alternative weightings, confirming the robustness of the reported gap. revision: yes

Circularity Check

0 steps flagged

No circularity: new benchmark with independent annotations yields empirical results

full rationale

The paper constructs MedVIGIL as a fresh 300-case suite from public VQA sources, with all gold answers, refusal options, risk tiers, and perturbations clinician-authored under multi-radiologist supervision. MCS aggregates seven audit metrics computed directly on model outputs for the released probes; the human baseline (MCS 83.3) comes from a fourth independent radiologist answering the same probes. No equations, fitted parameters, or self-citations reduce any reported quantity to a prior result by construction. The derivation chain (case construction → model auditing → metric aggregation → headroom calculation) is self-contained and externally falsifiable via the public release.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the assumption that radiologist-authored perturbations and annotations constitute a valid proxy for clinical broken-evidence scenarios. No free parameters are fitted and no new physical entities are postulated.

axioms (1)
  • domain assumption Board-certified radiologist annotations provide reliable ground truth for answerability and clinical risk in medical VQA cases.
    All gold answers, refusal options, and risk tiers are stated to be clinician-authored and consolidated by a senior radiologist.

pith-pipeline@v0.9.0 · 5634 in / 1226 out tokens · 44154 ms · 2026-05-11T03:40:45.192215+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

30 extracted references · 30 canonical work pages · 6 internal anchors

  1. [2]

    Mirage the illusion of visual understanding.arXiv preprint arXiv:2603.21687, 2026

    URLhttps://arxiv.org/abs/2603.21687. Aofei Chang, Le Huang, Parminder Bhatia, Taha Kass-Hout, Fenglong Ma, and Cao Xiao. MedHEval: Benchmarking hallucinations and mitigation strategies in medical large vision-language models. arXiv preprint arXiv:2503.02157,

  2. [3]

    Detecting and Evaluating Medical Hallucinations in Large Vision Language Models

    URLhttps://arxiv.org/abs/2503.02157. Jiawei Chen, Dingkang Yang, Tong Wu, Yue Jiang, Xiaolu Hou, Mingcheng Li, et al. Detect- ing and evaluating medical hallucinations in large vision language models.arXiv preprint arXiv:2406.10185v1, 2024a. doi: 10.48550/arXiv.2406.10185. URL https://arxiv.org/ abs/2406.10185v1. Junying Chen, Chi Gui, Ruyi Ouyang, Anning...

  3. [4]

    Datasheets for datasets.Communications of the ACM, 64(12):86–92, 2021

    doi: 10.1145/3458723. URL https://cacm.acm.org/research/ datasheets-for-datasets/. Yonatan Geifman and Ran El-Yaniv. Selective classification for deep neural networks. InAdvances in Neural Information Processing Systems,

  4. [5]

    Gemini: A Family of Highly Capable Multimodal Models

    URL https://papers.neurips.cc/paper/ 7073-selective-classification-for-deep-neural-networks. Gemini Team Google. Gemini: A family of highly capable multimodal models.arXiv preprint arXiv:2312.11805,

  5. [6]

    Gemini: A Family of Highly Capable Multimodal Models

    doi: 10.48550/arXiv.2312.11805. URL https://arxiv.org/abs/ 2312.11805. Tejas Gokhale, Pratyay Banerjee, Chitta Baral, and Yezhou Yang. VQA-LOL: Visual question answering under the lens of logic. InEuropean Conference on Computer Vision,

  6. [7]

    10 Yash Goyal, Tejas Khot, Douglas Summers-Stay, Dhruv Batra, and Devi Parikh

    URL https://arxiv.org/abs/2002.08325. 10 Yash Goyal, Tejas Khot, Douglas Summers-Stay, Dhruv Batra, and Devi Parikh. Making the V in VQA matter: Elevating the role of image understanding in visual question answering. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 6904–6913,

  7. [8]

    In: CVPR (2017),https://doi.org/10.1109/CVPR.2017.670

    doi: 10.1109/CVPR.2017.670. URL https://openaccess.thecvf.com/ content_cvpr_2017/html/Goyal_Making_the_v_CVPR_2017_paper.html. Zishan Gu, Changchang Yin, Fenglin Liu, and Ping Zhang. MedVH: Towards systematic evalu- ation of hallucination for large vision language models in the medical context.arXiv preprint arXiv:2407.02730,

  8. [9]

    Hao Guan and Mingxia Liu

    URLhttps://arxiv.org/abs/2407.02730. Hao Guan and Mingxia Liu. Domain adaptation for medical image analysis: A survey.IEEE Transactions on Biomedical Engineering, 69(3):1173–1185,

  9. [10]

    doi: 10.1109/TBME.2021. 3117407. URLhttps://pmc.ncbi.nlm.nih.gov/articles/PMC9011180/. Dan Hendrycks and Thomas Dietterich. Benchmarking neural network robustness to common corruptions and perturbations. InInternational Conference on Learning Representations,

  10. [11]

    Benchmarking Neural Network Robustness to Common Corruptions and Perturbations

    doi: 10.48550/arXiv.1903.12261. URLhttps://arxiv.org/abs/1903.12261. Jeremy Irvin, Pranav Rajpurkar, Michael Ko, Yifan Yu, Silviana Ciurea-Ilcus, Chris Chute, Henrik Marklund, Behzad Haghgoo, Robyn Ball, Katie Shpanskaya, et al. CheXpert: A large chest radiograph dataset with uncertainty labels and expert comparison. InAAAI Conference on Artificial Intelligence,

  11. [12]

    Alistair E

    doi: 10.1609/aaai.v33i01.3301590. Alistair E. W. Johnson, Tom J. Pollard, Seth J. Berkowitz, Nathaniel R. Greenbaum, Matthew P. Lungren, Chih-ying Deng, et al. MIMIC-CXR, a de-identified publicly available database of chest radiographs with free-text reports.Scientific Data, 6:317,

  12. [13]

    Qiao Jin, Bhuwan Dhingra, Zhengping Liu, William W Cohen, and Xinghua Lu

    doi: 10.1038/s41597-019-0322-0. URLhttps://www.nature.com/articles/s41597-019-0322-0. Polina Kirichenko, Mark Ibrahim, Kamalika Chaudhuri, and Samuel J. Bell. AbstentionBench: Reasoning LLMs fail on unanswerable questions.arXiv preprint arXiv:2506.09038,

  13. [14]

    URL https://arxiv.org/abs/2506.09038. Jason J. Lau, Soumya Gayen, Asma Ben Abacha, and Dina Demner-Fushman. A dataset of clinically generated visual questions and answers about radiology images.Scientific Data, 5:180251,

  14. [15]

    Lau, Soumya Gayen, Asma Ben Abacha, and Dina Demner-Fushman

    doi: 10.1038/sdata.2018.251. URLhttps://www.nature.com/articles/sdata2018251. Kang-il Lee, Minbeom Kim, Seunghyun Yoon, Minsung Kim, Dongryeol Lee, Hyukhun Koh, et al. VLind-Bench: Measuring language priors in large vision-language models. InFindings of the Association for Computational Linguistics: NAACL 2025,

  15. [16]

    Chunyuan Li, Cliff Wong, Sheng Zhang, Naoto Usuyama, Haotian Liu, Jianwei Yang, et al

    URL https://arxiv.org/ abs/2406.08702. Chunyuan Li, Cliff Wong, Sheng Zhang, Naoto Usuyama, Haotian Liu, Jianwei Yang, et al. LLaV A- Med: Training a large language-and-vision assistant for biomedicine in one day. InAdvances in Neural Information Processing Systems, Datasets and Benchmarks Track, 2023a. URL https: //arxiv.org/abs/2306.00890. Jidong Li, Li...

  16. [17]

    Harnessing the power of large language models for empathetic response generation: Empirical investigations and improvements

    URL https://arxiv.org/abs/2510.10965. Yifan Li, Yifan Du, Kun Zhou, Jinpeng Wang, Xin Zhao, and Ji-Rong Wen. Evaluating object hallucination in large vision-language models. InProceedings of the 2023 Conference on Empir- ical Methods in Natural Language Processing, pages 292–305, 2023b. doi: 10.18653/v1/2023. emnlp-main.20. URLhttps://aclanthology.org/202...

  17. [18]

    URL https://arxiv.org/abs/2102.09542

    doi: 10.1109/ISBI48211.2021.9434010. URL https://arxiv.org/abs/2102.09542. Nishanth Madhusudhan, Vikas Yadav, and Alexandre Lacoste. Knowing when not to answer: Evaluating abstention in multimodal reasoning systems.arXiv preprint arXiv:2604.14799,

  18. [19]

    Knowing When Not to Answer: Evaluating Abstention in Multimodal Reasoning Systems

    URLhttps://arxiv.org/abs/2604.14799. 11 MLCommons. Croissant: A metadata format for ML-ready datasets,

  19. [21]

    Benchmarking Deflection and Hallucination in Large Vision-Language Models

    URLhttps://arxiv.org/abs/2604.12033. Dung Nguyen, Minh Khoi Ho, Huy Ta, Thanh Tam Nguyen, Qi Chen, Kumar Rav, et al. Localizing before answering: A benchmark for grounded medical visual question answering. InProceedings of the Thirty-Fourth International Joint Conference on Artificial Intelligence (IJCAI),

  20. [22]

    URLhttps://arxiv.org/abs/2505.00744

    doi: 10.24963/ijcai.2025/853. URLhttps://arxiv.org/abs/2505.00744. OpenAI. GPT-4o system card,

  21. [23]

    Mahima Pushkarna, Andrew Zaldivar, and Oddur Kjartansson

    doi: 10.1007/978-3-030-01364-6_20. Mahima Pushkarna, Andrew Zaldivar, and Oddur Kjartansson. Data cards: Purposeful and transparent dataset documentation for responsible AI.arXiv preprint arXiv:2204.01075,

  22. [24]

    URLhttps://arxiv.org/abs/2204.01075

    doi: 10.48550/ arXiv.2204.01075. URLhttps://arxiv.org/abs/2204.01075. Jie Ren, Yao Zhao, Tu Vu, Peter J. Liu, and Balaji Lakshminarayanan. Self-evaluation improves selective generation in large language models.arXiv preprint arXiv:2312.09300,

  23. [25]

    Liu, and Balaji Lakshminarayanan

    doi: 10.48550/arXiv.2312.09300. URLhttps://arxiv.org/abs/2312.09300. Anna Rohrbach, Lisa Anne Hendricks, Kaylee Burns, Trevor Darrell, and Kate Saenko. Object hallucination in image captioning. InProceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 4035–4045,

  24. [26]

    Object Hallucination in Image Captioning

    doi: 10.18653/v1/D18-1437. URL https://aclanthology.org/D18-1437/. Meet Shah, Xinlei Chen, Marcus Rohrbach, and Devi Parikh. Cycle-consistency for ro- bust visual question answering. InProceedings of the IEEE/CVF Conference on Com- puter Vision and Pattern Recognition (CVPR), pages 6649–6658,

  25. [28]

    Bingbing Wen, Jihan Yao, Shangbin Feng, Chenjun Xu, Yulia Tsvetkov, Bill Howe, et al

    URLhttps://arxiv.org/abs/2507.23486. Bingbing Wen, Jihan Yao, Shangbin Feng, Chenjun Xu, Yulia Tsvetkov, Bill Howe, et al. Know your limits: A survey of abstention in large language models.Transactions of the Association for Computational Linguistics, 12:1412–1430,

  26. [29]

    Chaoyi Wu, Xiaoman Zhang, Ya Zhang, Yanfeng Wang, and Weidi Xie

    URL https://arxiv.org/abs/2407.18418. Chaoyi Wu, Xiaoman Zhang, Ya Zhang, Yanfeng Wang, and Weidi Xie. Towards generalist foun- dation model for radiology by leveraging web-scale 2d and 3d medical data.arXiv preprint arXiv:2308.02463,

  27. [30]

    Towards generalist foundation model for radiology by leveraging web-scale 2d&3d medical data.arXiv preprint arXiv:2308.02463, 2023

    doi: 10.48550/arXiv.2308.02463. URL https://arxiv.org/abs/ 2308.02463. 12 Xiaoman Zhang, Chaoyi Wu, Ziheng Zhao, Weixiong Lin, Ya Zhang, Yanfeng Wang, et al. PMC-VQA: Visual instruction tuning for medical visual question answering.arXiv preprint arXiv:2305.10415,

  28. [31]

    arXiv preprint arXiv:2305.10415 (2023)

    doi: 10.48550/arXiv.2305.10415. URL https://arxiv.org/abs/ 2305.10415. 13 A Benchmark Positioning Table 2 summarizes how MedVIGIL differs from adjacent benchmark families. The key distinction is not only the medical domain, but the combination of paired evidence perturbations, doctor-adjudicated safe targets, and risk-weighted silent-failure scoring. Tabl...

  29. [32]

    don’t- miss

    Therisk layerstores the clinical risk tier (CRT) together with the binary text-only-answerability flag. The CRT is one of L1 (meta or modality questions, no clinical harm if wrong), L2 (anatomical localisation with no management impact), L3 (presence of finding, delayed care if missed), L4 (significant pathology or characterisation, wrong treatment if mis...

  30. [33]

    Is there evidence of a right apical pneumothorax?

    Per-case letter trajectories.To make the aggregate behaviour concrete, Table 14 reports the modal- letter trajectory across the four mask steps for five hand-picked qualitative cases (one per CRT tier, ablation re-queried for each model). Each row is one model on one case; cells colouredgreenare the doctor-finalised refusal option E, cells colouredredare ...