arxiv: 2605.07919 · v1 · submitted 2026-05-08 · 💻 cs.CV

Recognition: no theorem link

MedVIGIL: Evaluating Trustworthy Medical VLMs Under Broken Visual Evidence

Hanqi Jiang , Junhao Chen , Yi Pan , Lifeng Chen , Weihang You , Haozhen Gong , Ruiyu Yan , Jinglei Lv

show 5 more authors

Lin Zhao Hui Ren Quanzheng Li Tianming Liu Xiang Li

Authors on Pith no claims yet

Pith reviewed 2026-05-11 03:40 UTC · model grok-4.3

classification 💻 cs.CV

keywords medical vision-language modelstrustworthy AIsilent failuresbenchmark evaluationvisual evidence perturbationradiologist supervisioncomposite score

0 comments

The pith

Medical vision-language models fail to refuse answers when visual evidence is broken, trailing radiologists by 14 points on a new composite score.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper establishes that trustworthy medical VLMs must detect broken visual evidence and refuse to answer rather than produce fluent but unsupported responses. It presents MedVIGIL, a clinician-supervised benchmark of 300 cases with perturbations including false premises, wording changes, knowledge-only rewrites, and ROI-corrupted images. The evaluation uses seven metrics summarized in the MedVIGIL Composite Score (MCS). An independent radiologist scores 83.3 on MCS with a 5.8% silent failure rate. The strongest model reaches only 69.2, highlighting a significant gap for clinical reliability.

Core claim

MedVIGIL is a 300-case evaluation suite drawn from public medical VQA sources and fully supervised by board-certified radiologists. It contains 2,556 MCQ probes plus open-ended variants, with clinician-authored false-premise traps, ROI boxes, risk tiers, and refusal options. Audits of 16 vision-capable models using the MCS show a maximum score of 69.2, compared to an independent radiologist baseline of 83.3 at 5.8% silent failures.

What carries the argument

The MedVIGIL Composite Score (MCS), which aggregates seven correctness-conditioned audit metrics to quantify silent failures under four types of broken evidence.

If this is right

Models can be systematically audited on the released suite to measure progress in evidence verification.
Clinical deployment should incorporate silent-failure thresholds derived from the MCS to limit unsupported answers.
The paired open-ended variant allows testing refusal behavior beyond multiple-choice formats.
Risk-tier annotations enable focused evaluation on high-stakes medical cases.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Models passing this benchmark may still require additional testing against naturally occurring image corruptions not captured by the four synthetic perturbations.
Incorporating MCS-style penalties during training could incentivize VLMs to learn when to withhold answers.
Extending the suite to non-radiology imaging domains would check whether the observed gap is modality-specific.

Load-bearing premise

The four chosen perturbation types and the 300 clinician-authored cases sufficiently represent the range of broken visual evidence that occurs in real clinical practice.

What would settle it

Releasing a new model that achieves an MCS of 83 or higher on the full MedVIGIL suite with silent-failure rates near 6% would test whether the performance gap can be closed.

Figures

Figures reproduced from arXiv: 2605.07919 by Hanqi Jiang, Haozhen Gong, Hui Ren, Jinglei Lv, Junhao Chen, Lifeng Chen, Lin Zhao, Quanzheng Li, Ruiyu Yan, Tianming Liu, Weihang You, Xiang Li, Yi Pan.

**Figure 2.** Figure 2: MedVIGIL construction and evaluation pipeline. [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗

**Figure 3.** Figure 3: Audit summary. Left: MCS component decomposition (Capability / Safety= 100−SFRw / Grounding / harmonic-mean MCS), ordered by MCS. Right: risk-tier SFR heatmap with Original accuracy (left blue) and SFRw (right red), sorted safest-first. knowledge-only accuracy (LPA 99.6%); and aggregate SFR is lowest on Claude Opus 4.7 (22.0%) and Gemini 3.1 Flash-Lite (22.3%) at different overall-accuracy levels. The radi… view at source ↗

**Figure 4.** Figure 4: Visual information decay on the 300 image-required cases (solid: MCQ accuracy as [PITH_FULL_IMAGE:figures/full_fig_p009_4.png] view at source ↗

**Figure 5.** Figure 5: MCS components in two pairwise projections. [PITH_FULL_IMAGE:figures/full_fig_p026_5.png] view at source ↗

**Figure 6.** Figure 6: Per-CRT-tier accuracy as a function of blur [PITH_FULL_IMAGE:figures/full_fig_p028_6.png] view at source ↗

**Figure 7.** Figure 7: Visual-token ablation on the example case MVB-0031. [PITH_FULL_IMAGE:figures/full_fig_p029_7.png] view at source ↗

read the original abstract

Medical vision--language models (VLMs) are usually evaluated on intact image--question pairs, but trustworthy clinical use requires a stronger property: a model must recognise when the evidential basis for an answer has failed. We study this through silent failures under perturbed evidence, where a vision-required medical question is paired with a false premise, wording perturbation, knowledge-only rewrite, or ROI-corrupted image, yet the model returns a fluent non-refusal answer. We introduce medvigil, a 300-case evaluation suite drawn from four public medical VQA sources, supervised end to end by four board-certified radiologists: every gold answer, refusal option, candidate-answer set, paraphrase, false-premise trap, ROI box, and clinical risk tier is clinician-authored. Two attending radiologists annotate every case in parallel, a senior radiologist consolidates the released manifest, and a separate fourth radiologist independent of construction answers every probe to provide the human reference baseline. The release contains 2{,}556 MCQ probes, 240 counterfactual triplets, physician-adjudicated risk-tier and answerability flags, ROI boxes, and a paired open-ended variant. We report seven correctness-conditioned audit metrics that summarise into the medvigil Composite Score (MCS), and audit 16 vision-capable models plus two text-only baselines. The independent radiologist scores MCS 83.3 at silent-failure rate 5.8%, leaving a 14.1-point composite headroom above the strongest audited model (Claude Opus 4.7 at 69.2). The benchmark and evaluation harness are publicly released.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

read the letter

MedVIGIL gives a usable benchmark for spotting when medical VLMs answer anyway under broken visual evidence, with a solid human baseline and public release, though the cases may not fully match real radiology failures. The 14-point MCS gap between the independent radiologist at 83.3 and Claude Opus at 69.2 is the clearest signal here. The work pulls 300 cases from four existing public VQA sources and puts board-certified radiologists in charge of every piece: gold answers, refusal options, risk tiers, and the four perturbation families. Two attendings annotate in parallel, a senior one consolidates, and a fourth independent radiologist supplies the human reference. They release 2556 MCQ probes, counterfactual triplets, ROI boxes, and an open-ended version, plus a composite MCS that folds correctness and refusal behavior into one number. Auditing 16 vision models plus text baselines turns the abstract idea into concrete numbers. That clinician pipeline and the public harness are the parts that actually move the needle for safety evaluation. The internal consistency looks tight. The soft spot is coverage. The four perturbation types and 300 clinician-authored cases have no external anchor to observed PACS artifacts, modality-specific failure rates, or inter-site audits. If the corruptions lean toward obvious blackouts or wording tricks rather than subtle contrast loss or slice misalignment, the low 5.8% human silent-failure rate and the measured headroom may not map cleanly to deployment risks. The paper does not claim the suite is exhaustive, but the headline result treats it as a proxy anyway. This is for groups working on medical VLM safety, refusal mechanisms, or clinical deployment standards. Readers who need a ready-made test harness with human adjudication will get immediate value from the release. It deserves peer review because the framework is practical, the human baseline is independent, and the gap it reports is worth referee scrutiny even if the authors need to add more validation on representativeness.

Referee Report

2 major / 1 minor

Summary. The paper introduces MedVIGIL, a 300-case evaluation suite for medical VLMs under broken visual evidence, constructed from four public VQA sources and fully supervised by four board-certified radiologists (parallel attending annotations, senior consolidation, independent fourth-radiologist baseline). Cases incorporate four perturbation types (false-premise wording, knowledge-only rewrite, ROI corruption, and related) to induce silent failures where models give fluent non-refusal answers. The suite includes 2,556 MCQ probes, 240 counterfactual triplets, risk tiers, ROI boxes, and an open-ended variant. Seven correctness-conditioned metrics are aggregated into the MedVIGIL Composite Score (MCS); the independent radiologist baseline achieves MCS 83.3 at 5.8% silent-failure rate, 14.1 points above the strongest model (Claude Opus at 69.2). The benchmark and harness are publicly released.

Significance. If the 300 cases and four perturbation classes adequately proxy real clinical visual-evidence failures, the work is significant for exposing a measurable trustworthiness gap in current medical VLMs and supplying a clinician-authored, publicly released resource for auditing and improving model refusal behavior. The multi-radiologist oversight, independent human baseline, and release of probes plus risk tiers strengthen its utility as a standardized testbed.

major comments (2)

[Abstract] Abstract: The assumption that the four perturbation types (false-premise wording, knowledge-only rewrite, ROI corruption, and the remaining two) plus the 300 clinician-authored cases from four VQA sources sufficiently represent the distribution of broken visual evidence in real radiology workflows is load-bearing for interpreting the 14.1-point MCS headroom and 5.8% human silent-failure rate as clinically actionable. No external validation (e.g., mapping to observed PACS artifacts, modality-specific failure statistics, or inter-institutional audits) is provided to confirm coverage or severity calibration.
[Abstract] Abstract: The aggregation rule that combines the seven correctness-conditioned audit metrics into the single MedVIGIL Composite Score (MCS) is not specified, preventing assessment of whether the reported 83.3 vs. 69.2 gap is robust to alternative weightings or dominated by particular sub-metrics such as silent-failure rate.

minor comments (1)

[Abstract] Abstract: The notation '2{,}556' is a typographical artifact and should be rendered as '2,556' for readability.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive and detailed comments on our manuscript. We provide point-by-point responses to the major comments below and have updated the manuscript to address the concerns raised.

read point-by-point responses

Referee: [Abstract] Abstract: The assumption that the four perturbation types (false-premise wording, knowledge-only rewrite, ROI corruption, and the remaining two) plus the 300 clinician-authored cases from four VQA sources sufficiently represent the distribution of broken visual evidence in real radiology workflows is load-bearing for interpreting the 14.1-point MCS headroom and 5.8% human silent-failure rate as clinically actionable. No external validation (e.g., mapping to observed PACS artifacts, modality-specific failure statistics, or inter-institutional audits) is provided to confirm coverage or severity calibration.

Authors: We agree that external validation against real-world clinical data would be valuable for establishing the benchmark's representativeness. Our work focuses on creating a clinician-supervised benchmark using perturbations derived from existing public VQA datasets to systematically evaluate silent failures in medical VLMs. The four perturbation types were selected based on common issues in medical VQA and radiologist input to simulate plausible broken evidence scenarios. We have revised the manuscript by adding a new 'Limitations and Future Work' paragraph in the Discussion section. This paragraph acknowledges that the 300 cases serve as a proxy rather than a comprehensive sample of all possible clinical failures, explains the design choices, and explicitly recommends future studies involving PACS artifact analysis and inter-institutional validation to calibrate severity. revision: yes
Referee: [Abstract] Abstract: The aggregation rule that combines the seven correctness-conditioned audit metrics into the single MedVIGIL Composite Score (MCS) is not specified, preventing assessment of whether the reported 83.3 vs. 69.2 gap is robust to alternative weightings or dominated by particular sub-metrics such as silent-failure rate.

Authors: We appreciate this observation. The aggregation rule for the MCS was not sufficiently detailed. We have revised the abstract to state that the MCS is the unweighted average of the seven normalized metrics (each scaled to [0,100]). We have also added the explicit formula in the Methods section along with a sensitivity analysis to alternative weightings, confirming the robustness of the reported gap. revision: yes

Circularity Check

0 steps flagged

No circularity: new benchmark with independent annotations yields empirical results

full rationale

The paper constructs MedVIGIL as a fresh 300-case suite from public VQA sources, with all gold answers, refusal options, risk tiers, and perturbations clinician-authored under multi-radiologist supervision. MCS aggregates seven audit metrics computed directly on model outputs for the released probes; the human baseline (MCS 83.3) comes from a fourth independent radiologist answering the same probes. No equations, fitted parameters, or self-citations reduce any reported quantity to a prior result by construction. The derivation chain (case construction → model auditing → metric aggregation → headroom calculation) is self-contained and externally falsifiable via the public release.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the assumption that radiologist-authored perturbations and annotations constitute a valid proxy for clinical broken-evidence scenarios. No free parameters are fitted and no new physical entities are postulated.

axioms (1)

domain assumption Board-certified radiologist annotations provide reliable ground truth for answerability and clinical risk in medical VQA cases.
All gold answers, refusal options, and risk tiers are stated to be clinician-authored and consolidated by a senior radiologist.

pith-pipeline@v0.9.0 · 5634 in / 1226 out tokens · 44154 ms · 2026-05-11T03:40:45.192215+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

30 extracted references · 30 canonical work pages · 6 internal anchors

[2]

Mirage the illusion of visual understanding.arXiv preprint arXiv:2603.21687, 2026

URLhttps://arxiv.org/abs/2603.21687. Aofei Chang, Le Huang, Parminder Bhatia, Taha Kass-Hout, Fenglong Ma, and Cao Xiao. MedHEval: Benchmarking hallucinations and mitigation strategies in medical large vision-language models. arXiv preprint arXiv:2503.02157,

work page arXiv
[3]

Detecting and Evaluating Medical Hallucinations in Large Vision Language Models

URLhttps://arxiv.org/abs/2503.02157. Jiawei Chen, Dingkang Yang, Tong Wu, Yue Jiang, Xiaolu Hou, Mingcheng Li, et al. Detect- ing and evaluating medical hallucinations in large vision language models.arXiv preprint arXiv:2406.10185v1, 2024a. doi: 10.48550/arXiv.2406.10185. URL https://arxiv.org/ abs/2406.10185v1. Junying Chen, Chi Gui, Ruyi Ouyang, Anning...

work page internal anchor Pith review doi:10.48550/arxiv.2406.10185 2024
[4]

Datasheets for datasets.Communications of the ACM, 64(12):86–92, 2021

doi: 10.1145/3458723. URL https://cacm.acm.org/research/ datasheets-for-datasets/. Yonatan Geifman and Ran El-Yaniv. Selective classification for deep neural networks. InAdvances in Neural Information Processing Systems,

work page doi:10.1145/3458723
[5]

Gemini: A Family of Highly Capable Multimodal Models

URL https://papers.neurips.cc/paper/ 7073-selective-classification-for-deep-neural-networks. Gemini Team Google. Gemini: A family of highly capable multimodal models.arXiv preprint arXiv:2312.11805,

work page internal anchor Pith review Pith/arXiv arXiv
[6]

Gemini: A Family of Highly Capable Multimodal Models

doi: 10.48550/arXiv.2312.11805. URL https://arxiv.org/abs/ 2312.11805. Tejas Gokhale, Pratyay Banerjee, Chitta Baral, and Yezhou Yang. VQA-LOL: Visual question answering under the lens of logic. InEuropean Conference on Computer Vision,

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2312.11805
[7]

10 Yash Goyal, Tejas Khot, Douglas Summers-Stay, Dhruv Batra, and Devi Parikh

URL https://arxiv.org/abs/2002.08325. 10 Yash Goyal, Tejas Khot, Douglas Summers-Stay, Dhruv Batra, and Devi Parikh. Making the V in VQA matter: Elevating the role of image understanding in visual question answering. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 6904–6913,

work page arXiv 2002
[8]

In: CVPR (2017),https://doi.org/10.1109/CVPR.2017.670

doi: 10.1109/CVPR.2017.670. URL https://openaccess.thecvf.com/ content_cvpr_2017/html/Goyal_Making_the_v_CVPR_2017_paper.html. Zishan Gu, Changchang Yin, Fenglin Liu, and Ping Zhang. MedVH: Towards systematic evalu- ation of hallucination for large vision language models in the medical context.arXiv preprint arXiv:2407.02730,

work page doi:10.1109/cvpr.2017.670 2017
[9]

Hao Guan and Mingxia Liu

URLhttps://arxiv.org/abs/2407.02730. Hao Guan and Mingxia Liu. Domain adaptation for medical image analysis: A survey.IEEE Transactions on Biomedical Engineering, 69(3):1173–1185,

work page arXiv
[10]

doi: 10.1109/TBME.2021. 3117407. URLhttps://pmc.ncbi.nlm.nih.gov/articles/PMC9011180/. Dan Hendrycks and Thomas Dietterich. Benchmarking neural network robustness to common corruptions and perturbations. InInternational Conference on Learning Representations,

work page doi:10.1109/tbme.2021 2021
[11]

Benchmarking Neural Network Robustness to Common Corruptions and Perturbations

doi: 10.48550/arXiv.1903.12261. URLhttps://arxiv.org/abs/1903.12261. Jeremy Irvin, Pranav Rajpurkar, Michael Ko, Yifan Yu, Silviana Ciurea-Ilcus, Chris Chute, Henrik Marklund, Behzad Haghgoo, Robyn Ball, Katie Shpanskaya, et al. CheXpert: A large chest radiograph dataset with uncertainty labels and expert comparison. InAAAI Conference on Artificial Intelligence,

work page internal anchor Pith review doi:10.48550/arxiv.1903.12261 1903
[12]

Alistair E

doi: 10.1609/aaai.v33i01.3301590. Alistair E. W. Johnson, Tom J. Pollard, Seth J. Berkowitz, Nathaniel R. Greenbaum, Matthew P. Lungren, Chih-ying Deng, et al. MIMIC-CXR, a de-identified publicly available database of chest radiographs with free-text reports.Scientific Data, 6:317,

work page doi:10.1609/aaai.v33i01.3301590
[13]

Qiao Jin, Bhuwan Dhingra, Zhengping Liu, William W Cohen, and Xinghua Lu

doi: 10.1038/s41597-019-0322-0. URLhttps://www.nature.com/articles/s41597-019-0322-0. Polina Kirichenko, Mark Ibrahim, Kamalika Chaudhuri, and Samuel J. Bell. AbstentionBench: Reasoning LLMs fail on unanswerable questions.arXiv preprint arXiv:2506.09038,

work page doi:10.1038/s41597-019-0322-0
[14]

URL https://arxiv.org/abs/2506.09038. Jason J. Lau, Soumya Gayen, Asma Ben Abacha, and Dina Demner-Fushman. A dataset of clinically generated visual questions and answers about radiology images.Scientific Data, 5:180251,

work page arXiv
[15]

Lau, Soumya Gayen, Asma Ben Abacha, and Dina Demner-Fushman

doi: 10.1038/sdata.2018.251. URLhttps://www.nature.com/articles/sdata2018251. Kang-il Lee, Minbeom Kim, Seunghyun Yoon, Minsung Kim, Dongryeol Lee, Hyukhun Koh, et al. VLind-Bench: Measuring language priors in large vision-language models. InFindings of the Association for Computational Linguistics: NAACL 2025,

work page doi:10.1038/sdata.2018.251 2018
[16]

Chunyuan Li, Cliff Wong, Sheng Zhang, Naoto Usuyama, Haotian Liu, Jianwei Yang, et al

URL https://arxiv.org/ abs/2406.08702. Chunyuan Li, Cliff Wong, Sheng Zhang, Naoto Usuyama, Haotian Liu, Jianwei Yang, et al. LLaV A- Med: Training a large language-and-vision assistant for biomedicine in one day. InAdvances in Neural Information Processing Systems, Datasets and Benchmarks Track, 2023a. URL https: //arxiv.org/abs/2306.00890. Jidong Li, Li...

work page arXiv
[17]

Harnessing the power of large language models for empathetic response generation: Empirical investigations and improvements

URL https://arxiv.org/abs/2510.10965. Yifan Li, Yifan Du, Kun Zhou, Jinpeng Wang, Xin Zhao, and Ji-Rong Wen. Evaluating object hallucination in large vision-language models. InProceedings of the 2023 Conference on Empir- ical Methods in Natural Language Processing, pages 292–305, 2023b. doi: 10.18653/v1/2023. emnlp-main.20. URLhttps://aclanthology.org/202...

work page doi:10.18653/v1/2023 2023
[18]

URL https://arxiv.org/abs/2102.09542

doi: 10.1109/ISBI48211.2021.9434010. URL https://arxiv.org/abs/2102.09542. Nishanth Madhusudhan, Vikas Yadav, and Alexandre Lacoste. Knowing when not to answer: Evaluating abstention in multimodal reasoning systems.arXiv preprint arXiv:2604.14799,

work page doi:10.1109/isbi48211.2021.9434010 2021
[19]

Knowing When Not to Answer: Evaluating Abstention in Multimodal Reasoning Systems

URLhttps://arxiv.org/abs/2604.14799. 11 MLCommons. Croissant: A metadata format for ML-ready datasets,

work page internal anchor Pith review Pith/arXiv arXiv
[21]

Benchmarking Deflection and Hallucination in Large Vision-Language Models

URLhttps://arxiv.org/abs/2604.12033. Dung Nguyen, Minh Khoi Ho, Huy Ta, Thanh Tam Nguyen, Qi Chen, Kumar Rav, et al. Localizing before answering: A benchmark for grounded medical visual question answering. InProceedings of the Thirty-Fourth International Joint Conference on Artificial Intelligence (IJCAI),

work page internal anchor Pith review Pith/arXiv arXiv
[22]

URLhttps://arxiv.org/abs/2505.00744

doi: 10.24963/ijcai.2025/853. URLhttps://arxiv.org/abs/2505.00744. OpenAI. GPT-4o system card,

work page doi:10.24963/ijcai.2025/853 2025
[23]

Mahima Pushkarna, Andrew Zaldivar, and Oddur Kjartansson

doi: 10.1007/978-3-030-01364-6_20. Mahima Pushkarna, Andrew Zaldivar, and Oddur Kjartansson. Data cards: Purposeful and transparent dataset documentation for responsible AI.arXiv preprint arXiv:2204.01075,

work page doi:10.1007/978-3-030-01364-6_20
[24]

URLhttps://arxiv.org/abs/2204.01075

doi: 10.48550/ arXiv.2204.01075. URLhttps://arxiv.org/abs/2204.01075. Jie Ren, Yao Zhao, Tu Vu, Peter J. Liu, and Balaji Lakshminarayanan. Self-evaluation improves selective generation in large language models.arXiv preprint arXiv:2312.09300,

work page arXiv
[25]

Liu, and Balaji Lakshminarayanan

doi: 10.48550/arXiv.2312.09300. URLhttps://arxiv.org/abs/2312.09300. Anna Rohrbach, Lisa Anne Hendricks, Kaylee Burns, Trevor Darrell, and Kate Saenko. Object hallucination in image captioning. InProceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 4035–4045,

work page doi:10.48550/arxiv.2312.09300 2018
[26]

Object Hallucination in Image Captioning

doi: 10.18653/v1/D18-1437. URL https://aclanthology.org/D18-1437/. Meet Shah, Xinlei Chen, Marcus Rohrbach, and Devi Parikh. Cycle-consistency for ro- bust visual question answering. InProceedings of the IEEE/CVF Conference on Com- puter Vision and Pattern Recognition (CVPR), pages 6649–6658,

work page doi:10.18653/v1/d18-1437
[28]

Bingbing Wen, Jihan Yao, Shangbin Feng, Chenjun Xu, Yulia Tsvetkov, Bill Howe, et al

URLhttps://arxiv.org/abs/2507.23486. Bingbing Wen, Jihan Yao, Shangbin Feng, Chenjun Xu, Yulia Tsvetkov, Bill Howe, et al. Know your limits: A survey of abstention in large language models.Transactions of the Association for Computational Linguistics, 12:1412–1430,

work page arXiv
[29]

Chaoyi Wu, Xiaoman Zhang, Ya Zhang, Yanfeng Wang, and Weidi Xie

URL https://arxiv.org/abs/2407.18418. Chaoyi Wu, Xiaoman Zhang, Ya Zhang, Yanfeng Wang, and Weidi Xie. Towards generalist foun- dation model for radiology by leveraging web-scale 2d and 3d medical data.arXiv preprint arXiv:2308.02463,

work page arXiv
[30]

Towards generalist foundation model for radiology by leveraging web-scale 2d&3d medical data.arXiv preprint arXiv:2308.02463, 2023

doi: 10.48550/arXiv.2308.02463. URL https://arxiv.org/abs/ 2308.02463. 12 Xiaoman Zhang, Chaoyi Wu, Ziheng Zhao, Weixiong Lin, Ya Zhang, Yanfeng Wang, et al. PMC-VQA: Visual instruction tuning for medical visual question answering.arXiv preprint arXiv:2305.10415,

work page doi:10.48550/arxiv.2308.02463
[31]

arXiv preprint arXiv:2305.10415 (2023)

doi: 10.48550/arXiv.2305.10415. URL https://arxiv.org/abs/ 2305.10415. 13 A Benchmark Positioning Table 2 summarizes how MedVIGIL differs from adjacent benchmark families. The key distinction is not only the medical domain, but the combination of paired evidence perturbations, doctor-adjudicated safe targets, and risk-weighted silent-failure scoring. Tabl...

work page doi:10.48550/arxiv.2305.10415 2018
[32]

don’t- miss

Therisk layerstores the clinical risk tier (CRT) together with the binary text-only-answerability flag. The CRT is one of L1 (meta or modality questions, no clinical harm if wrong), L2 (anatomical localisation with no management impact), L3 (presence of finding, delayed care if missed), L4 (significant pathology or characterisation, wrong treatment if mis...

work page 2021
[33]

Is there evidence of a right apical pneumothorax?

Per-case letter trajectories.To make the aggregate behaviour concrete, Table 14 reports the modal- letter trajectory across the four mask steps for five hand-picked qualitative cases (one per CRT tier, ablation re-queried for each model). Each row is one model on one case; cells colouredgreenare the doctor-finalised refusal option E, cells colouredredare ...

work page 2069