MedVIGIL: Evaluating Trustworthy Medical VLMs Under Broken Visual Evidence

Hanqi Jiang; Haozhen Gong; Hui Ren; Hyeokjae Kwon; Jinglei Lv; Junhao Chen; Lifeng Chen; Lin Zhao; Mingyu Kang; Quanzheng Li

arxiv: 2605.07919 · v2 · pith:VQMPFD7Enew · submitted 2026-05-08 · 💻 cs.CV

MedVIGIL: Evaluating Trustworthy Medical VLMs Under Broken Visual Evidence

Hanqi Jiang , Junhao Chen , Mingyu Kang , Hyeokjae Kwon , Yi Pan , Lifeng Chen , Weihang You , Haozhen Gong

show 7 more authors

Ruiyu Yan Jinglei Lv Lin Zhao Hui Ren Quanzheng Li Tianming Liu Xiang Li

This is my paper

Pith reviewed 2026-05-25 06:16 UTC · model grok-4.3

classification 💻 cs.CV

keywords medical vision-language modelssilent failuresvisual evidence perturbationstrustworthy medical AIVQA benchmarkrefusal behaviorradiologist evaluationcorrupted images

0 comments

The pith

Medical VLMs must refuse fluent answers when image evidence is corrupted or falsified, yet current models fall 14 points short of human radiologists on this test.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper creates MedVIGIL, a 300-case suite of vision-required medical questions drawn from public VQA sources and fully supervised by board-certified radiologists. Each case pairs the question with one of four evidence breaks: false premise, wording change, knowledge-only rewrite, or ROI-corrupted image, and records whether the model refuses or produces a non-refusal answer. Seven correctness-conditioned metrics are combined into the MedVIGIL Composite Score; an independent radiologist reaches 83.3 while the strongest model reaches 69.2 with a 5.8 percent silent-failure rate. The benchmark supplies 2556 MCQ probes, 240 counterfactual triplets, risk tiers, ROI boxes, and an open-ended variant, all clinician-authored.

Core claim

Medical vision-language models must recognise when the evidential basis for an answer has failed and refuse rather than return fluent non-refusal answers; the MedVIGIL suite, built from four public medical VQA sources and end-to-end supervised by four radiologists, measures this property through silent failures under false-premise, wording, knowledge-only, and ROI-corrupted perturbations and shows the human reference at MCS 83.3 against a 14.1-point gap for the best audited model.

What carries the argument

The MedVIGIL Composite Score aggregates seven audit metrics on silent-failure rate, refusal accuracy, and correctness under four clinician-defined perturbation categories.

If this is right

Models must incorporate explicit detection of broken visual evidence rather than relying on fluency alone.
Standard VQA accuracy metrics are insufficient for clinical trustworthiness.
The released probes and risk-tier flags allow repeated auditing of new models on the same fixed cases.
Human-level refusal under perturbation requires training signals that penalise answers when evidence is absent or contradictory.
The open-ended variant and counterfactual triplets test generalisation beyond multiple-choice format.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Deployment of current VLMs in radiology workflows risks silent propagation of answers based on invalid images.
The benchmark could be extended by adding temporal or multi-image perturbations to match real clinical workflows.
Training objectives that reward explicit uncertainty statements when evidence is missing may close the observed gap.
The 14-point headroom indicates that refusal calibration, not raw capability, is the current bottleneck.

Load-bearing premise

The four perturbation types and the clinician-authored cases capture the failure modes that matter for trustworthy clinical use.

What would settle it

An audited model that reaches an MCS above 80 or a silent-failure rate below 6 percent on the released 300-case set would close or reverse the reported gap to the independent radiologist baseline.

Figures

Figures reproduced from arXiv: 2605.07919 by Hanqi Jiang, Haozhen Gong, Hui Ren, Hyeokjae Kwon, Jinglei Lv, Junhao Chen, Lifeng Chen, Lin Zhao, Mingyu Kang, Quanzheng Li, Ruiyu Yan, Tianming Liu, Weihang You, Xiang Li, Yi Pan.

**Figure 2.** Figure 2: MedVIGIL construction and evaluation pipeline. [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗

**Figure 3.** Figure 3: Audit summary. Left: MCS component decomposition (Capability / Safety= 100−SFRw / Grounding / harmonic-mean MCS), ordered by MCS. Right: risk-tier SFR heatmap with Original accuracy (left blue) and SFRw (right red), sorted safest-first. knowledge-only accuracy (LPA 99.6%); and aggregate SFR is lowest on Claude Opus 4.7 (22.0%) and Gemini 3.1 Flash-Lite (22.3%) at different overall-accuracy levels. The radi… view at source ↗

**Figure 4.** Figure 4: Visual information decay on the 300 image-required cases (solid: MCQ accuracy as [PITH_FULL_IMAGE:figures/full_fig_p009_4.png] view at source ↗

**Figure 5.** Figure 5: MCS components in two pairwise projections. [PITH_FULL_IMAGE:figures/full_fig_p026_5.png] view at source ↗

**Figure 6.** Figure 6: Per-CRT-tier accuracy as a function of blur [PITH_FULL_IMAGE:figures/full_fig_p028_6.png] view at source ↗

**Figure 7.** Figure 7: Visual-token ablation on the example case MVB-0031. [PITH_FULL_IMAGE:figures/full_fig_p029_7.png] view at source ↗

read the original abstract

Medical vision--language models (VLMs) are usually evaluated on intact image--question pairs, but trustworthy clinical use requires a stronger property: a model must recognise when the evidential basis for an answer has failed. We study this through silent failures under perturbed evidence, where a vision-required medical question is paired with a false premise, wording perturbation, knowledge-only rewrite, or ROI-corrupted image, yet the model returns a fluent non-refusal answer. We introduce medvigil, a 300-case evaluation suite drawn from four public medical VQA sources, supervised end to end by four board-certified radiologists: every gold answer, refusal option, candidate-answer set, paraphrase, false-premise trap, ROI box, and clinical risk tier is clinician-authored. Two attending radiologists annotate every case in parallel, a senior radiologist consolidates the released manifest, and a separate fourth radiologist independent of construction answers every probe to provide the human reference baseline. The release contains 2556 MCQ probes, 240 counterfactual triplets, physician-adjudicated risk-tier and answerability flags, ROI boxes, and a paired open-ended variant. We report seven correctness-conditioned audit metrics that summarise into the medvigil Composite Score (MCS), and audit 16 vision-capable models plus two text-only baselines. The independent radiologist scores MCS 83.3 at silent-failure rate 5.8%, leaving a 14.1-point composite headroom above the strongest audited model (Claude Opus 4.7 at 69.2). The benchmark and evaluation harness are publicly released.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

MedVIGIL gives a clean new benchmark for silent failures in medical VLMs with solid clinician oversight, but the perturbation set still needs external checks against real clinical error patterns.

read the letter

The main thing here is a new 300-case suite that tests whether medical VLMs refuse when the image or premise is broken. They pull cases from existing VQA sets, add four perturbation types (false premise, wording changes, knowledge-only rewrites, ROI-corrupted images), and have four radiologists build and check everything end-to-end, including risk tiers and answerability flags. A separate radiologist gives the human baseline. They release the full harness and report a composite score (MCS) that puts the best model 14 points behind the human reference. That setup is new and the public release plus the refusal option are practical additions that prior medical VQA work did not emphasize as directly. The multi-radiologist pipeline and the independent scorer are clear strengths; they reduce the chance that the benchmark is just author-curated. The MCS metric itself looks straightforward once you accept the correctness-conditioned audits. The soft spot is exactly the one the stress-test flagged: the four perturbation categories are sensible on paper, but nothing in the abstract or described construction shows they match the actual distribution of silent failures that happen in hospitals. No incident-log comparison, no extra clinician survey on ecological fit, and no inter-rater stats beyond the internal consolidation step. That gap does not kill the work, but it means the 14-point headroom claim rests on an unanchored sample. The prompting details and exact metric formulas would also need a close look in the full text to rule out post-hoc choices. This is the kind of paper that belongs in a reading group focused on evaluation or clinical AI safety; anyone building or auditing medical VLMs will want to see the cases. It is coherent on its own terms and shows honest engagement with the refusal problem, so it deserves a serious referee rather than a desk reject. I would send it out for review with the expectation that the authors add some external anchoring or at least a limitations section on the perturbation coverage.

Referee Report

1 major / 3 minor

Summary. The paper introduces MedVIGIL, a 300-case benchmark drawn from public medical VQA sources and fully supervised by four board-certified radiologists, to evaluate medical VLMs on silent failures under four types of broken evidence (false premise, wording perturbation, knowledge-only rewrite, ROI-corrupted image). It defines seven correctness-conditioned audit metrics that aggregate into the MedVIGIL Composite Score (MCS), reports an independent radiologist baseline of MCS 83.3 (silent-failure rate 5.8%) versus the strongest model (Claude Opus) at 69.2, and publicly releases the 2556 MCQ probes, counterfactual triplets, risk tiers, and evaluation harness.

Significance. If the perturbation categories accurately represent clinical silent-failure modes, the work supplies a needed evaluation framework for trustworthy medical VLMs and quantifies a concrete 14.1-point composite headroom. The end-to-end multi-radiologist construction pipeline, independent fourth-radiologist baseline, and public release of the benchmark and harness are concrete strengths that support reproducibility and further research.

major comments (1)

[Case construction pipeline] Case construction pipeline (abstract and methods): The four perturbation categories and 300 clinician-authored cases are asserted to capture relevant evidential failures for clinical trustworthiness, yet the manuscript provides no external anchoring (e.g., comparison against hospital incident logs, additional blinded clinician surveys on ecological validity, or inter-rater metrics beyond internal consolidation). This assumption is load-bearing for interpreting the reported 14.1-point MCS headroom as clinically actionable rather than benchmark-specific.

minor comments (3)

[Abstract] Abstract: The seven correctness-conditioned audit metrics that compose the MCS are referenced but not enumerated; a one-sentence listing would improve immediate readability.
[Evaluation metrics] Evaluation metrics: The precise aggregation formula or weighting scheme that produces the composite MCS from the seven component metrics is not shown in the main text; an explicit equation or pseudocode would aid verification.
[Results tables] Table/figure captions: Several result tables report model scores to one decimal place while the human baseline is given to one decimal; consistent precision and explicit mention of the number of probes per model would reduce ambiguity.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the detailed review and the emphasis on strengthening the ecological validity of the case construction pipeline. We address the major comment below and outline targeted revisions to the manuscript.

read point-by-point responses

Referee: [Case construction pipeline] Case construction pipeline (abstract and methods): The four perturbation categories and 300 clinician-authored cases are asserted to capture relevant evidential failures for clinical trustworthiness, yet the manuscript provides no external anchoring (e.g., comparison against hospital incident logs, additional blinded clinician surveys on ecological validity, or inter-rater metrics beyond internal consolidation). This assumption is load-bearing for interpreting the reported 14.1-point MCS headroom as clinically actionable rather than benchmark-specific.

Authors: We agree that external anchoring would strengthen claims of clinical relevance. Direct comparison against hospital incident logs is not feasible: such logs are protected under privacy regulations (e.g., HIPAA), rarely document silent failures explicitly, and are not publicly available for research use. The four perturbation categories were instead derived from established radiology error taxonomies in the literature (e.g., perceptual and cognitive errors under incomplete evidence) and instantiated by board-certified radiologists. The construction pipeline already incorporates parallel annotation by two attending radiologists followed by senior consolidation; we will add explicit inter-rater agreement statistics (Cohen’s kappa on answerability and risk-tier labels) to the methods and supplement. We will also expand the limitations section to clarify that the benchmark quantifies headroom under controlled perturbations rather than claiming direct equivalence to real-world incident rates, and note additional blinded surveys as valuable future work. These changes will make the scope and limitations of the 14.1-point gap explicit without overstating generalizability. revision: partial

Circularity Check

0 steps flagged

No circularity: benchmark and metrics are independently constructed

full rationale

The paper introduces a fixed 300-case benchmark suite with clinician-authored perturbations, answerability flags, and risk tiers, then directly evaluates 18 models on seven audit metrics that are aggregated into MCS. No equations, fitted parameters, or self-citations are used to derive the reported scores; the human baseline (MCS 83.3) is produced by an independent radiologist answering the same probes. The construction pipeline and metric definitions stand on their own without reducing to model outputs or prior author results by definition.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the domain assumption that the chosen perturbations represent clinically relevant broken evidence and that the multi-radiologist process produces reliable gold labels.

axioms (1)

domain assumption The four perturbation types validly simulate clinical scenarios where visual evidence is insufficient or misleading.
Invoked to justify relevance of the silent-failure tests to trustworthy use.

pith-pipeline@v0.9.0 · 5874 in / 1166 out tokens · 46933 ms · 2026-05-25T06:16:44.561083+00:00 · methodology

Review history (2 revisions) →

discussion (0)

Reference graph

Works this paper leans on

30 extracted references · 30 canonical work pages · 7 internal anchors

[2]

Aofei Chang, Le Huang, Parminder Bhatia, Taha Kass-Hout, Fenglong Ma, and Cao Xiao

URLhttps://arxiv.org/abs/2603.21687. Aofei Chang, Le Huang, Parminder Bhatia, Taha Kass-Hout, Fenglong Ma, and Cao Xiao. MedHEval: Benchmarking hallucinations and mitigation strategies in medical large vision-language models. arXiv preprint arXiv:2503.02157,

work page arXiv
[3]

Detecting and Evaluating Medical Hallucinations in Large Vision Language Models

URLhttps://arxiv.org/abs/2503.02157. Jiawei Chen, Dingkang Yang, Tong Wu, Yue Jiang, Xiaolu Hou, Mingcheng Li, et al. Detect- ing and evaluating medical hallucinations in large vision language models.arXiv preprint arXiv:2406.10185v1, 2024a. doi: 10.48550/arXiv.2406.10185. URL https://arxiv.org/ abs/2406.10185v1. Junying Chen, Chi Gui, Ruyi Ouyang, Anning...

work page internal anchor Pith review doi:10.48550/arxiv.2406.10185 2024
[4]

URL https://cacm.acm.org/research/ datasheets-for-datasets/

doi: 10.1145/3458723. URL https://cacm.acm.org/research/ datasheets-for-datasets/. Yonatan Geifman and Ran El-Yaniv. Selective classification for deep neural networks. InAdvances in Neural Information Processing Systems,

work page doi:10.1145/3458723
[5]

Gemini: A Family of Highly Capable Multimodal Models

URL https://papers.neurips.cc/paper/ 7073-selective-classification-for-deep-neural-networks. Gemini Team Google. Gemini: A family of highly capable multimodal models.arXiv preprint arXiv:2312.11805,

work page internal anchor Pith review Pith/arXiv arXiv
[6]

Gemini: A Family of Highly Capable Multimodal Models

doi: 10.48550/arXiv.2312.11805. URL https://arxiv.org/abs/ 2312.11805. Tejas Gokhale, Pratyay Banerjee, Chitta Baral, and Yezhou Yang. VQA-LOL: Visual question answering under the lens of logic. InEuropean Conference on Computer Vision,

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2312.11805
[7]

10 Yash Goyal, Tejas Khot, Douglas Summers-Stay, Dhruv Batra, and Devi Parikh

URL https://arxiv.org/abs/2002.08325. 10 Yash Goyal, Tejas Khot, Douglas Summers-Stay, Dhruv Batra, and Devi Parikh. Making the V in VQA matter: Elevating the role of image understanding in visual question answering. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 6904–6913,

work page arXiv 2002
[8]

URL https://openaccess.thecvf.com/ content_cvpr_2017/html/Goyal_Making_the_v_CVPR_2017_paper.html

doi: 10.1109/CVPR.2017.670. URL https://openaccess.thecvf.com/ content_cvpr_2017/html/Goyal_Making_the_v_CVPR_2017_paper.html. Zishan Gu, Changchang Yin, Fenglin Liu, and Ping Zhang. MedVH: Towards systematic evalu- ation of hallucination for large vision language models in the medical context.arXiv preprint arXiv:2407.02730,

work page doi:10.1109/cvpr.2017.670 2017
[9]

Hao Guan and Mingxia Liu

URLhttps://arxiv.org/abs/2407.02730. Hao Guan and Mingxia Liu. Domain adaptation for medical image analysis: A survey.IEEE Transactions on Biomedical Engineering, 69(3):1173–1185,

work page arXiv
[10]

doi: 10.1109/TBME.2021. 3117407. URLhttps://pmc.ncbi.nlm.nih.gov/articles/PMC9011180/. Dan Hendrycks and Thomas Dietterich. Benchmarking neural network robustness to common corruptions and perturbations. InInternational Conference on Learning Representations,

work page doi:10.1109/tbme.2021 2021
[11]

Benchmarking Neural Network Robustness to Common Corruptions and Perturbations

doi: 10.48550/arXiv.1903.12261. URLhttps://arxiv.org/abs/1903.12261. Jeremy Irvin, Pranav Rajpurkar, Michael Ko, Yifan Yu, Silviana Ciurea-Ilcus, Chris Chute, Henrik Marklund, Behzad Haghgoo, Robyn Ball, Katie Shpanskaya, et al. CheXpert: A large chest radiograph dataset with uncertainty labels and expert comparison. InAAAI Conference on Artificial Intelligence,

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.1903.12261 1903
[12]

Alistair E

doi: 10.1609/aaai.v33i01.3301590. Alistair E. W. Johnson, Tom J. Pollard, Seth J. Berkowitz, Nathaniel R. Greenbaum, Matthew P. Lungren, Chih-ying Deng, et al. MIMIC-CXR, a de-identified publicly available database of chest radiographs with free-text reports.Scientific Data, 6:317,

work page doi:10.1609/aaai.v33i01.3301590
[13]

URLhttps://www.nature.com/articles/s41597-019-0322-0

doi: 10.1038/s41597-019-0322-0. URLhttps://www.nature.com/articles/s41597-019-0322-0. Polina Kirichenko, Mark Ibrahim, Kamalika Chaudhuri, and Samuel J. Bell. AbstentionBench: Reasoning LLMs fail on unanswerable questions.arXiv preprint arXiv:2506.09038,

work page doi:10.1038/s41597-019-0322-0
[14]

URL https://arxiv.org/abs/2506.09038. Jason J. Lau, Soumya Gayen, Asma Ben Abacha, and Dina Demner-Fushman. A dataset of clinically generated visual questions and answers about radiology images.Scientific Data, 5:180251,

work page arXiv
[15]

URLhttps://www.nature.com/articles/sdata2018251

doi: 10.1038/sdata.2018.251. URLhttps://www.nature.com/articles/sdata2018251. Kang-il Lee, Minbeom Kim, Seunghyun Yoon, Minsung Kim, Dongryeol Lee, Hyukhun Koh, et al. VLind-Bench: Measuring language priors in large vision-language models. InFindings of the Association for Computational Linguistics: NAACL 2025,

work page doi:10.1038/sdata.2018.251 2018
[16]

Chunyuan Li, Cliff Wong, Sheng Zhang, Naoto Usuyama, Haotian Liu, Jianwei Yang, et al

URL https://arxiv.org/ abs/2406.08702. Chunyuan Li, Cliff Wong, Sheng Zhang, Naoto Usuyama, Haotian Liu, Jianwei Yang, et al. LLaV A- Med: Training a large language-and-vision assistant for biomedicine in one day. InAdvances in Neural Information Processing Systems, Datasets and Benchmarks Track, 2023a. URL https: //arxiv.org/abs/2306.00890. Jidong Li, Li...

work page arXiv
[17]

Yifan Li, Yifan Du, Kun Zhou, Jinpeng Wang, Xin Zhao, and Ji-Rong Wen

URL https://arxiv.org/abs/2510.10965. Yifan Li, Yifan Du, Kun Zhou, Jinpeng Wang, Xin Zhao, and Ji-Rong Wen. Evaluating object hallucination in large vision-language models. InProceedings of the 2023 Conference on Empir- ical Methods in Natural Language Processing, pages 292–305, 2023b. doi: 10.18653/v1/2023. emnlp-main.20. URLhttps://aclanthology.org/202...

work page doi:10.18653/v1/2023 2023
[18]

URL https://arxiv.org/abs/2102.09542

doi: 10.1109/ISBI48211.2021.9434010. URL https://arxiv.org/abs/2102.09542. Nishanth Madhusudhan, Vikas Yadav, and Alexandre Lacoste. Knowing when not to answer: Evaluating abstention in multimodal reasoning systems.arXiv preprint arXiv:2604.14799,

work page doi:10.1109/isbi48211.2021.9434010 2021
[19]

Knowing When Not to Answer: Evaluating Abstention in Multimodal Reasoning Systems

URLhttps://arxiv.org/abs/2604.14799. 11 MLCommons. Croissant: A metadata format for ML-ready datasets,

work page internal anchor Pith review Pith/arXiv arXiv
[21]

Benchmarking Deflection and Hallucination in Large Vision-Language Models

URLhttps://arxiv.org/abs/2604.12033. Dung Nguyen, Minh Khoi Ho, Huy Ta, Thanh Tam Nguyen, Qi Chen, Kumar Rav, et al. Localizing before answering: A benchmark for grounded medical visual question answering. InProceedings of the Thirty-Fourth International Joint Conference on Artificial Intelligence (IJCAI),

work page internal anchor Pith review Pith/arXiv arXiv
[22]

URLhttps://arxiv.org/abs/2505.00744

doi: 10.24963/ijcai.2025/853. URLhttps://arxiv.org/abs/2505.00744. OpenAI. GPT-4o system card,

work page doi:10.24963/ijcai.2025/853 2025
[23]

Mahima Pushkarna, Andrew Zaldivar, and Oddur Kjartansson

doi: 10.1007/978-3-030-01364-6_20. Mahima Pushkarna, Andrew Zaldivar, and Oddur Kjartansson. Data cards: Purposeful and transparent dataset documentation for responsible AI.arXiv preprint arXiv:2204.01075,

work page doi:10.1007/978-3-030-01364-6_20
[24]

URLhttps://arxiv.org/abs/2204.01075

doi: 10.48550/ arXiv.2204.01075. URLhttps://arxiv.org/abs/2204.01075. Jie Ren, Yao Zhao, Tu Vu, Peter J. Liu, and Balaji Lakshminarayanan. Self-evaluation improves selective generation in large language models.arXiv preprint arXiv:2312.09300,

work page arXiv
[25]

URLhttps://arxiv.org/abs/2312.09300

doi: 10.48550/arXiv.2312.09300. URLhttps://arxiv.org/abs/2312.09300. Anna Rohrbach, Lisa Anne Hendricks, Kaylee Burns, Trevor Darrell, and Kate Saenko. Object hallucination in image captioning. InProceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 4035–4045,

work page doi:10.48550/arxiv.2312.09300 2018
[26]

URL https://aclanthology.org/D18-1437/

doi: 10.18653/v1/D18-1437. URL https://aclanthology.org/D18-1437/. Meet Shah, Xinlei Chen, Marcus Rohrbach, and Devi Parikh. Cycle-consistency for ro- bust visual question answering. InProceedings of the IEEE/CVF Conference on Com- puter Vision and Pattern Recognition (CVPR), pages 6649–6658,

work page doi:10.18653/v1/d18-1437
[28]

Bingbing Wen, Jihan Yao, Shangbin Feng, Chenjun Xu, Yulia Tsvetkov, Bill Howe, et al

URLhttps://arxiv.org/abs/2507.23486. Bingbing Wen, Jihan Yao, Shangbin Feng, Chenjun Xu, Yulia Tsvetkov, Bill Howe, et al. Know your limits: A survey of abstention in large language models.Transactions of the Association for Computational Linguistics, 12:1412–1430,

work page arXiv
[29]

Chaoyi Wu, Xiaoman Zhang, Ya Zhang, Yanfeng Wang, and Weidi Xie

URL https://arxiv.org/abs/2407.18418. Chaoyi Wu, Xiaoman Zhang, Ya Zhang, Yanfeng Wang, and Weidi Xie. Towards generalist foun- dation model for radiology by leveraging web-scale 2d and 3d medical data.arXiv preprint arXiv:2308.02463,

work page arXiv
[30]

URL https://arxiv.org/abs/ 2308.02463

doi: 10.48550/arXiv.2308.02463. URL https://arxiv.org/abs/ 2308.02463. 12 Xiaoman Zhang, Chaoyi Wu, Ziheng Zhao, Weixiong Lin, Ya Zhang, Yanfeng Wang, et al. PMC-VQA: Visual instruction tuning for medical visual question answering.arXiv preprint arXiv:2305.10415,

work page doi:10.48550/arxiv.2308.02463
[31]

PMC-VQA: Visual Instruction Tuning for Medical Visual Question Answering

doi: 10.48550/arXiv.2305.10415. URL https://arxiv.org/abs/ 2305.10415. 13 A Benchmark Positioning Table 2 summarizes how MedVIGIL differs from adjacent benchmark families. The key distinction is not only the medical domain, but the combination of paired evidence perturbations, doctor-adjudicated safe targets, and risk-weighted silent-failure scoring. Tabl...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2305.10415 2018
[32]

don’t- miss

Therisk layerstores the clinical risk tier (CRT) together with the binary text-only-answerability flag. The CRT is one of L1 (meta or modality questions, no clinical harm if wrong), L2 (anatomical localisation with no management impact), L3 (presence of finding, delayed care if missed), L4 (significant pathology or characterisation, wrong treatment if mis...

work page 2021
[33]

Is there evidence of a right apical pneumothorax?

Per-case letter trajectories.To make the aggregate behaviour concrete, Table 14 reports the modal- letter trajectory across the four mask steps for five hand-picked qualitative cases (one per CRT tier, ablation re-queried for each model). Each row is one model on one case; cells colouredgreenare the doctor-finalised refusal option E, cells colouredredare ...

work page 2069

[1] [2]

Aofei Chang, Le Huang, Parminder Bhatia, Taha Kass-Hout, Fenglong Ma, and Cao Xiao

URLhttps://arxiv.org/abs/2603.21687. Aofei Chang, Le Huang, Parminder Bhatia, Taha Kass-Hout, Fenglong Ma, and Cao Xiao. MedHEval: Benchmarking hallucinations and mitigation strategies in medical large vision-language models. arXiv preprint arXiv:2503.02157,

work page arXiv

[2] [3]

Detecting and Evaluating Medical Hallucinations in Large Vision Language Models

URLhttps://arxiv.org/abs/2503.02157. Jiawei Chen, Dingkang Yang, Tong Wu, Yue Jiang, Xiaolu Hou, Mingcheng Li, et al. Detect- ing and evaluating medical hallucinations in large vision language models.arXiv preprint arXiv:2406.10185v1, 2024a. doi: 10.48550/arXiv.2406.10185. URL https://arxiv.org/ abs/2406.10185v1. Junying Chen, Chi Gui, Ruyi Ouyang, Anning...

work page internal anchor Pith review doi:10.48550/arxiv.2406.10185 2024

[3] [4]

URL https://cacm.acm.org/research/ datasheets-for-datasets/

doi: 10.1145/3458723. URL https://cacm.acm.org/research/ datasheets-for-datasets/. Yonatan Geifman and Ran El-Yaniv. Selective classification for deep neural networks. InAdvances in Neural Information Processing Systems,

work page doi:10.1145/3458723

[4] [5]

Gemini: A Family of Highly Capable Multimodal Models

URL https://papers.neurips.cc/paper/ 7073-selective-classification-for-deep-neural-networks. Gemini Team Google. Gemini: A family of highly capable multimodal models.arXiv preprint arXiv:2312.11805,

work page internal anchor Pith review Pith/arXiv arXiv

[5] [6]

Gemini: A Family of Highly Capable Multimodal Models

doi: 10.48550/arXiv.2312.11805. URL https://arxiv.org/abs/ 2312.11805. Tejas Gokhale, Pratyay Banerjee, Chitta Baral, and Yezhou Yang. VQA-LOL: Visual question answering under the lens of logic. InEuropean Conference on Computer Vision,

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2312.11805

[6] [7]

10 Yash Goyal, Tejas Khot, Douglas Summers-Stay, Dhruv Batra, and Devi Parikh

URL https://arxiv.org/abs/2002.08325. 10 Yash Goyal, Tejas Khot, Douglas Summers-Stay, Dhruv Batra, and Devi Parikh. Making the V in VQA matter: Elevating the role of image understanding in visual question answering. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 6904–6913,

work page arXiv 2002

[7] [8]

URL https://openaccess.thecvf.com/ content_cvpr_2017/html/Goyal_Making_the_v_CVPR_2017_paper.html

doi: 10.1109/CVPR.2017.670. URL https://openaccess.thecvf.com/ content_cvpr_2017/html/Goyal_Making_the_v_CVPR_2017_paper.html. Zishan Gu, Changchang Yin, Fenglin Liu, and Ping Zhang. MedVH: Towards systematic evalu- ation of hallucination for large vision language models in the medical context.arXiv preprint arXiv:2407.02730,

work page doi:10.1109/cvpr.2017.670 2017

[8] [9]

Hao Guan and Mingxia Liu

URLhttps://arxiv.org/abs/2407.02730. Hao Guan and Mingxia Liu. Domain adaptation for medical image analysis: A survey.IEEE Transactions on Biomedical Engineering, 69(3):1173–1185,

work page arXiv

[9] [10]

doi: 10.1109/TBME.2021. 3117407. URLhttps://pmc.ncbi.nlm.nih.gov/articles/PMC9011180/. Dan Hendrycks and Thomas Dietterich. Benchmarking neural network robustness to common corruptions and perturbations. InInternational Conference on Learning Representations,

work page doi:10.1109/tbme.2021 2021

[10] [11]

Benchmarking Neural Network Robustness to Common Corruptions and Perturbations

doi: 10.48550/arXiv.1903.12261. URLhttps://arxiv.org/abs/1903.12261. Jeremy Irvin, Pranav Rajpurkar, Michael Ko, Yifan Yu, Silviana Ciurea-Ilcus, Chris Chute, Henrik Marklund, Behzad Haghgoo, Robyn Ball, Katie Shpanskaya, et al. CheXpert: A large chest radiograph dataset with uncertainty labels and expert comparison. InAAAI Conference on Artificial Intelligence,

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.1903.12261 1903

[11] [12]

Alistair E

doi: 10.1609/aaai.v33i01.3301590. Alistair E. W. Johnson, Tom J. Pollard, Seth J. Berkowitz, Nathaniel R. Greenbaum, Matthew P. Lungren, Chih-ying Deng, et al. MIMIC-CXR, a de-identified publicly available database of chest radiographs with free-text reports.Scientific Data, 6:317,

work page doi:10.1609/aaai.v33i01.3301590

[12] [13]

URLhttps://www.nature.com/articles/s41597-019-0322-0

doi: 10.1038/s41597-019-0322-0. URLhttps://www.nature.com/articles/s41597-019-0322-0. Polina Kirichenko, Mark Ibrahim, Kamalika Chaudhuri, and Samuel J. Bell. AbstentionBench: Reasoning LLMs fail on unanswerable questions.arXiv preprint arXiv:2506.09038,

work page doi:10.1038/s41597-019-0322-0

[13] [14]

URL https://arxiv.org/abs/2506.09038. Jason J. Lau, Soumya Gayen, Asma Ben Abacha, and Dina Demner-Fushman. A dataset of clinically generated visual questions and answers about radiology images.Scientific Data, 5:180251,

work page arXiv

[14] [15]

URLhttps://www.nature.com/articles/sdata2018251

doi: 10.1038/sdata.2018.251. URLhttps://www.nature.com/articles/sdata2018251. Kang-il Lee, Minbeom Kim, Seunghyun Yoon, Minsung Kim, Dongryeol Lee, Hyukhun Koh, et al. VLind-Bench: Measuring language priors in large vision-language models. InFindings of the Association for Computational Linguistics: NAACL 2025,

work page doi:10.1038/sdata.2018.251 2018

[15] [16]

Chunyuan Li, Cliff Wong, Sheng Zhang, Naoto Usuyama, Haotian Liu, Jianwei Yang, et al

URL https://arxiv.org/ abs/2406.08702. Chunyuan Li, Cliff Wong, Sheng Zhang, Naoto Usuyama, Haotian Liu, Jianwei Yang, et al. LLaV A- Med: Training a large language-and-vision assistant for biomedicine in one day. InAdvances in Neural Information Processing Systems, Datasets and Benchmarks Track, 2023a. URL https: //arxiv.org/abs/2306.00890. Jidong Li, Li...

work page arXiv

[16] [17]

Yifan Li, Yifan Du, Kun Zhou, Jinpeng Wang, Xin Zhao, and Ji-Rong Wen

URL https://arxiv.org/abs/2510.10965. Yifan Li, Yifan Du, Kun Zhou, Jinpeng Wang, Xin Zhao, and Ji-Rong Wen. Evaluating object hallucination in large vision-language models. InProceedings of the 2023 Conference on Empir- ical Methods in Natural Language Processing, pages 292–305, 2023b. doi: 10.18653/v1/2023. emnlp-main.20. URLhttps://aclanthology.org/202...

work page doi:10.18653/v1/2023 2023

[17] [18]

URL https://arxiv.org/abs/2102.09542

doi: 10.1109/ISBI48211.2021.9434010. URL https://arxiv.org/abs/2102.09542. Nishanth Madhusudhan, Vikas Yadav, and Alexandre Lacoste. Knowing when not to answer: Evaluating abstention in multimodal reasoning systems.arXiv preprint arXiv:2604.14799,

work page doi:10.1109/isbi48211.2021.9434010 2021

[18] [19]

Knowing When Not to Answer: Evaluating Abstention in Multimodal Reasoning Systems

URLhttps://arxiv.org/abs/2604.14799. 11 MLCommons. Croissant: A metadata format for ML-ready datasets,

work page internal anchor Pith review Pith/arXiv arXiv

[19] [21]

Benchmarking Deflection and Hallucination in Large Vision-Language Models

URLhttps://arxiv.org/abs/2604.12033. Dung Nguyen, Minh Khoi Ho, Huy Ta, Thanh Tam Nguyen, Qi Chen, Kumar Rav, et al. Localizing before answering: A benchmark for grounded medical visual question answering. InProceedings of the Thirty-Fourth International Joint Conference on Artificial Intelligence (IJCAI),

work page internal anchor Pith review Pith/arXiv arXiv

[20] [22]

URLhttps://arxiv.org/abs/2505.00744

doi: 10.24963/ijcai.2025/853. URLhttps://arxiv.org/abs/2505.00744. OpenAI. GPT-4o system card,

work page doi:10.24963/ijcai.2025/853 2025

[21] [23]

Mahima Pushkarna, Andrew Zaldivar, and Oddur Kjartansson

doi: 10.1007/978-3-030-01364-6_20. Mahima Pushkarna, Andrew Zaldivar, and Oddur Kjartansson. Data cards: Purposeful and transparent dataset documentation for responsible AI.arXiv preprint arXiv:2204.01075,

work page doi:10.1007/978-3-030-01364-6_20

[22] [24]

URLhttps://arxiv.org/abs/2204.01075

doi: 10.48550/ arXiv.2204.01075. URLhttps://arxiv.org/abs/2204.01075. Jie Ren, Yao Zhao, Tu Vu, Peter J. Liu, and Balaji Lakshminarayanan. Self-evaluation improves selective generation in large language models.arXiv preprint arXiv:2312.09300,

work page arXiv

[23] [25]

URLhttps://arxiv.org/abs/2312.09300

doi: 10.48550/arXiv.2312.09300. URLhttps://arxiv.org/abs/2312.09300. Anna Rohrbach, Lisa Anne Hendricks, Kaylee Burns, Trevor Darrell, and Kate Saenko. Object hallucination in image captioning. InProceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 4035–4045,

work page doi:10.48550/arxiv.2312.09300 2018

[24] [26]

URL https://aclanthology.org/D18-1437/

doi: 10.18653/v1/D18-1437. URL https://aclanthology.org/D18-1437/. Meet Shah, Xinlei Chen, Marcus Rohrbach, and Devi Parikh. Cycle-consistency for ro- bust visual question answering. InProceedings of the IEEE/CVF Conference on Com- puter Vision and Pattern Recognition (CVPR), pages 6649–6658,

work page doi:10.18653/v1/d18-1437

[25] [28]

Bingbing Wen, Jihan Yao, Shangbin Feng, Chenjun Xu, Yulia Tsvetkov, Bill Howe, et al

URLhttps://arxiv.org/abs/2507.23486. Bingbing Wen, Jihan Yao, Shangbin Feng, Chenjun Xu, Yulia Tsvetkov, Bill Howe, et al. Know your limits: A survey of abstention in large language models.Transactions of the Association for Computational Linguistics, 12:1412–1430,

work page arXiv

[26] [29]

Chaoyi Wu, Xiaoman Zhang, Ya Zhang, Yanfeng Wang, and Weidi Xie

URL https://arxiv.org/abs/2407.18418. Chaoyi Wu, Xiaoman Zhang, Ya Zhang, Yanfeng Wang, and Weidi Xie. Towards generalist foun- dation model for radiology by leveraging web-scale 2d and 3d medical data.arXiv preprint arXiv:2308.02463,

work page arXiv

[27] [30]

URL https://arxiv.org/abs/ 2308.02463

doi: 10.48550/arXiv.2308.02463. URL https://arxiv.org/abs/ 2308.02463. 12 Xiaoman Zhang, Chaoyi Wu, Ziheng Zhao, Weixiong Lin, Ya Zhang, Yanfeng Wang, et al. PMC-VQA: Visual instruction tuning for medical visual question answering.arXiv preprint arXiv:2305.10415,

work page doi:10.48550/arxiv.2308.02463

[28] [31]

PMC-VQA: Visual Instruction Tuning for Medical Visual Question Answering

doi: 10.48550/arXiv.2305.10415. URL https://arxiv.org/abs/ 2305.10415. 13 A Benchmark Positioning Table 2 summarizes how MedVIGIL differs from adjacent benchmark families. The key distinction is not only the medical domain, but the combination of paired evidence perturbations, doctor-adjudicated safe targets, and risk-weighted silent-failure scoring. Tabl...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2305.10415 2018

[29] [32]

don’t- miss

Therisk layerstores the clinical risk tier (CRT) together with the binary text-only-answerability flag. The CRT is one of L1 (meta or modality questions, no clinical harm if wrong), L2 (anatomical localisation with no management impact), L3 (presence of finding, delayed care if missed), L4 (significant pathology or characterisation, wrong treatment if mis...

work page 2021

[30] [33]

Is there evidence of a right apical pneumothorax?

Per-case letter trajectories.To make the aggregate behaviour concrete, Table 14 reports the modal- letter trajectory across the four mask steps for five hand-picked qualitative cases (one per CRT tier, ablation re-queried for each model). Each row is one model on one case; cells colouredgreenare the doctor-finalised refusal option E, cells colouredredare ...

work page 2069