arxiv: 2605.05810 · v1 · submitted 2026-05-07 · 💻 cs.CV

Recognition: unknown

CXR-ContraBench: Benchmarking Negated-Option Attraction in Medical VLMs

Zhengru Fang , Yanan Ma , Yu Guo , Senkang Hu , Yixian Zhang , Hangcheng Cao , Wenbo Ding , Yuguang Fang

Authors on Pith no claims yet

Pith reviewed 2026-05-08 14:43 UTC · model grok-4.3

classification 💻 cs.CV

keywords medical vision language modelsnegated option attractionchest x-raypolarity reversalbenchmarkinference time verification

0 comments

The pith

Medical vision-language models often choose negated options like 'no consolidation' even when the X-ray shows consolidation.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tests medical vision-language models on chest X-ray questions asking which finding is present. Models frequently select answers that negate the finding, creating statements that contradict the image. This negated-option attraction occurs on over 62 percent of presence questions in large protocols, with direct accuracy around 30 percent for two tested models. The authors provide a new benchmark with present and absent finding questions from multiple datasets. They also show that a deterministic verifier called QCCV-Neg can correct the errors at inference time, raising accuracy above 95 percent.

Core claim

CXR-ContraBench reveals that negated-option attraction is a substantial and persistent failure mode in medical VLMs, where models select negated answers on presence questions despite visible findings in the image, and QCCV-Neg provides a question-conditioned consistency check that repairs these polarity reversals without any retraining.

What carries the argument

CXR-ContraBench, a diagnostic benchmark using present-finding questions to detect polarity reversals and absent-finding questions to test for wording copying, together with the QCCV-Neg verifier that enforces consistency between the question and the chosen option.

If this is right

Chain-of-thought prompting reduces some presence-side reversals but may increase absence-side contradictions.
The issue appears across internal and external datasets including OpenI and CheXpert.
Standard accuracy metrics can conceal clinically risky inference failures in negation handling.
Post-hoc verification offers a way to improve reliability without modifying the underlying model.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Similar negation issues could affect model performance in other diagnostic imaging tasks beyond chest X-rays.
Deploying these models in clinical settings without such checks risks generating reports that directly contradict the imaging evidence.
Future training methods might need to explicitly address polarity in visual question answering.

Load-bearing premise

That selecting a negated option on the benchmark questions corresponds to a real clinical risk of polarity reversal in actual medical decision making.

What would settle it

Running the models on a new set of chest X-rays with expert-annotated present findings and checking if the models still select negated options at similar rates.

read the original abstract

When a chest X-ray shows consolidation but the question asks which finding is present, a medical vision-language model may answer "No consolidation." This is more than an incorrect choice: it is a polarity reversal that emits a clinical statement contradicting the image. We study this failure as negated-option attraction, where a model is drawn to a negated answer option even when it conflicts with both the visual evidence and the question. We introduce CXR-ContraBench (Chest X-Ray Contradiction Benchmark), a diagnostic benchmark spanning internal ReXVQA slices and external OpenI and CheXpert protocols. The benchmark centers on present-finding questions, where selecting "No X" despite visible X creates the main clinical risk, and uses absent-finding questions as secondary tests of whether models copy negated wording. Across CheXpert protocols, the failure is substantial and persistent. On a strict direct presence probe, MedGemma and Qwen2.5-VL reach only 31.49% and 30.21% accuracy, respectively; on a matched 135,754-record CheXpert training-split protocol, both models select negated options on over 62% of presence questions. Chain-of-thought prompting reduces some presence-side reversals but does not eliminate them and can amplify absence-side contradictions. Finally, QCCV-Neg (Question-Conditioned Consistency Verifier for Negation) deterministically repairs the measured polarity-confused subset without retraining, raising MedGemma and Qwen2.5-VL to 96.60% and 95.32% accuracy on the direct presence probe. These results show that standard accuracy can hide a clinically meaningful inference-time polarity failure. Source code and benchmark construction scripts are available at https://github.com/fangzr/cxr-contrabench-code.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Medical VLMs often pick negated options on presence questions despite visible findings, and the paper gives a targeted benchmark plus a simple verifier that lifts accuracy from ~30% to ~95%.

read the letter

The main takeaway is that models like MedGemma and Qwen2.5-VL show clear polarity reversal on chest X-ray questions: they select “No consolidation” even when the image shows it, hitting only 31% and 30% accuracy on direct presence probes and choosing negated options over 62% of the time on the large CheXpert split. The work introduces CXR-ContraBench to isolate this negated-option attraction and pairs it with QCCV-Neg, a deterministic post-hoc verifier that corrects most of the errors at inference time without retraining.

Referee Report

2 major / 2 minor

Summary. The paper introduces CXR-ContraBench, a diagnostic benchmark for negated-option attraction in medical VLMs on chest X-ray tasks using internal ReXVQA slices and external OpenI/CheXpert protocols. It reports that models such as MedGemma and Qwen2.5-VL achieve only 31.49% and 30.21% accuracy on strict direct presence probes, selecting negated options on over 62% of presence questions in a 135,754-record CheXpert protocol. Chain-of-thought prompting offers partial mitigation, while the proposed QCCV-Neg verifier raises accuracy to 96.60% and 95.32% on the direct probe without retraining. Code and benchmark scripts are released.

Significance. If the benchmark protocols are shown to be free of construction artifacts, the work identifies a clinically consequential inference-time failure mode in medical VLMs: polarity reversals that produce image-contradicting statements. The deterministic QCCV-Neg repair is a practical contribution, and the open release of code plus construction scripts on public datasets (OpenI, CheXpert) enables direct reproducibility and extension. This strengthens the empirical case that standard accuracy metrics can mask safety-relevant defects.

major comments (2)

[§3 (Benchmark Construction)] §3 (Benchmark Construction): The templated mapping from CheXpert/OpenI labels to present-finding questions does not specify exclusion criteria for uncertain labels, co-occurring findings, or high-confidence positives, nor does it verify that each question stem clinically requires rejecting the negated option rather than permitting a safe 'none of the above' reading. This is load-bearing for the headline claim of 30-31% accuracy and >62% negated selections on 135k records, as labeling noise could inflate the measured failure rates.
[§4 (Experiments and Results)] §4 (Experiments and Results): The paper should report the fraction of questions that underwent manual validation or inter-annotator checks, and provide concrete examples of question templates alongside their CheXpert label sources to allow assessment of whether the constructed probes accurately reflect clinical polarity-reversal risk.

minor comments (2)

[Abstract] Abstract: The term 'internal ReXVQA slices' is introduced without definition; the main text should clarify its relation to the external protocols and any differences in question generation.
[Tables and Notation] Notation and tables: Ensure consistent use of 'presence probe' vs. 'direct presence probe' across tables and text, and add a column or note clarifying the exact number of questions per protocol.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We address each major comment below and have revised the manuscript to incorporate additional details on benchmark construction and validation, as these strengthen the work without altering its core claims.

read point-by-point responses

Referee: [§3 (Benchmark Construction)] §3 (Benchmark Construction): The templated mapping from CheXpert/OpenI labels to present-finding questions does not specify exclusion criteria for uncertain labels, co-occurring findings, or high-confidence positives, nor does it verify that each question stem clinically requires rejecting the negated option rather than permitting a safe 'none of the above' reading. This is load-bearing for the headline claim of 30-31% accuracy and >62% negated selections on 135k records, as labeling noise could inflate the measured failure rates.

Authors: We agree that the original §3 provided insufficient detail on filtering and clinical justification, which could raise questions about noise. In the revised manuscript we have expanded §3 with explicit criteria: uncertain labels ('U') are excluded from presence questions, co-occurring findings are handled by generating independent per-finding questions, and only high-confidence positive labels are retained where scores are available. We have also added a verification paragraph confirming that each present-finding stem is constructed so the negated option constitutes a direct clinical contradiction (no 'none of the above' escape hatch in the primary probes). These additions clarify that the reported accuracy and negation-selection rates are not artifacts of ambiguous labeling. revision: yes
Referee: [§4 (Experiments and Results)] §4 (Experiments and Results): The paper should report the fraction of questions that underwent manual validation or inter-annotator checks, and provide concrete examples of question templates alongside their CheXpert label sources to allow assessment of whether the constructed probes accurately reflect clinical polarity-reversal risk.

Authors: We accept that the original submission omitted these transparency elements. The revised manuscript now includes a table in §3 with multiple concrete examples mapping CheXpert/OpenI labels to full question templates (e.g., 'Consolidation=1' to 'Which finding is present? A) Consolidation B) No consolidation'). We have also added a report in §4 on the manual validation performed during construction: a representative sample of questions was reviewed for polarity correctness, and the fraction validated together with inter-annotator agreement is now stated explicitly. These changes enable readers to directly evaluate clinical fidelity. revision: yes

Circularity Check

0 steps flagged

No significant circularity: direct empirical measurements on external datasets

full rationale

The paper reports accuracy figures and negated-option selection rates obtained by running existing VLMs on templated questions derived from public CheXpert and OpenI label sets. No equations, fitted parameters, or first-principles derivations appear; the only algorithmic component (QCCV-Neg) is a deterministic post-hoc verifier whose logic is stated explicitly and does not reference the measured failure rates as inputs. No self-citations are invoked to justify uniqueness or to close a derivation loop. The construction therefore remains self-contained against external benchmarks and code release, satisfying the default non-circularity expectation for benchmark papers.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

This is an empirical benchmarking and mitigation study with no mathematical derivations, fitted constants, or postulated entities. No free parameters, axioms, or invented entities are required beyond standard dataset usage.

pith-pipeline@v0.9.0 · 5654 in / 1236 out tokens · 55971 ms · 2026-05-08T14:43:31.757302+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

57 extracted references · 41 canonical work pages · 13 internal anchors

[1]

InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

Kumail Alhamoud, Shaden Alshammari, Yonglong Tian, Guohao Li, Philip Torr, Yoon Kim, and Marzyeh Ghassemi. Vision-language models do not understand negation.arXiv preprint arXiv:2501.09425, 2025

work page arXiv 2025
[2]

Fan Bai, Yuxin Du, Tiejun Huang, Max Q. H. Meng, and Bo Zhao. M3d: Advancing 3d medical image analysis with multi-modal large language models.arXiv preprint arXiv:2404.00578, 2024

work page arXiv 2024
[3]

Qwen2.5-VL Technical Report

Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, Humen Zhong, Yuanzhi Zhu, Mingkun Yang, Zhaohai Li, Jianqiang Wan, Pengfei Wang, Wei Ding, Zheren Fu, Yiheng Xu, Jiabo Ye, Xi Zhang, Tianbao Xie, Zesen Cheng, Hang Zhang, Zhibo Yang, Haiyang Xu, and Junyang Lin. Qwen2.5-vl technical report, 2...

work page internal anchor Pith review arXiv 2025
[4]

Gmai-mmbench: A comprehensive multimodal evaluation benchmark towards general medical ai.arXiv preprint arXiv:2408.03361, 2024

Pengcheng Chen, Jin Ye, Guoan Wang, Yanjun Li, Zhongying Deng, Wei Li, Tianbin Li, Haodong Duan, Ziyan Huang, Yanzhou Su, Benyou Wang, Shaoting Zhang, Bin Fu, Jianfei Cai, Bohan Zhuang, Eric J Seibel, Junjun He, and Yu Qiao. Gmai-mmbench: A comprehensive multimodal evaluation benchmark towards general medical ai.arXiv preprint arXiv:2408.03361, 2024

work page arXiv 2024
[5]

Yixiong Chen, Wenjie Xiao, Pedro R. A. S. Bassi, Xinze Zhou, Sezgin Er, Ibrahim Ethem Hamamci, Zongwei Zhou, and Alan Yuille. Are vision language models ready for clinical diagnosis? a 3d medical benchmark for tumor-centric visual question answering.arXiv preprint arXiv:2505.18915, 2025

work page arXiv 2025
[6]

Coca-cxr: Contrastive captioners learn strong temporal structures for chest x-ray vision-language understanding.arXiv preprint arXiv:2502.20509, 2025

Yixiong Chen, Shawn Xu, Andrew Sellergren, Yossi Matias, Avinatan Hassidim, Shravya Shetty, Daniel Golden, Alan Yuille, and Lin Yang. Coca-cxr: Contrastive captioners learn strong temporal structures for chest x-ray vision-language understanding.arXiv preprint arXiv:2502.20509, 2025

work page arXiv 2025
[7]

InThe Thirty-ninth Annual Conference on Neural Information Processing Systems

Zhihong Chen, Maya Varma, Justin Xu, Magdalini Paschali, Dave Van Veen, Andrew Johnston, Alaa Youssef, Louis Blankemeier, Christian Bluethgen, Stephan Altmayer, Jeya Maria Jose Valanarasu, Mohamed Siddig Eltayeb Muneer, Eduardo Pontes Reis, Joseph Paul Cohen, Cameron Olsen, Tan- ishq Mathew Abraham, Emily B. Tsai, Christopher F. Beaulieu, Jenia Jitsev, Se...

work page arXiv 2024
[8]

Kohli, Marc B

Dina Demner-Fushman, Marc D. Kohli, Marc B. Rosenman, Sonya E. Shooshan, Laritza Rodriguez, Sameer Antani, George R. Thoma, and Clement J. McDonald. Preparing a collection of radiology examinations for distribution and retrieval.Journal of the American Medical Informatics Association, 23 (2):304–310, 2016

2016
[9]

Ponti, and Siva Reddy

Nouha Dziri, Ehsan Kamalloo, Sivan Milton, Osmar Zaiane, Mo Yu, Edoardo M. Ponti, and Siva Reddy. Faithdial: A faithful benchmark for information-seeking dialogue.arXiv preprint arXiv:2204.10757, 2022

work page arXiv 2022
[10]

Clip and complementary methods.Nature Reviews Methods Primers, 1(1):20, 2021

Markus Hafner, Maria Katsantoni, Tino Köster, James Marks, Joyita Mukherjee, Dorothee Staiger, Jernej Ule, and Mihaela Zavolan. Clip and complementary methods.Nature Reviews Methods Primers, 1(1):20, 2021

2021
[11]

Distribution-aligned decoding for efficient llm task adaptation.arXiv preprint arXiv:2509.15888, 2025

Senkang Hu, Xudong Han, Jinqi Jiang, Yihang Tao, Zihan Fang, Yong Dai, Sam Tak Wu Kwong, and Yuguang Fang. Distribution-aligned decoding for efficient llm task adaptation.arXiv preprint arXiv:2509.15888, 2025

work page arXiv 2025
[12]

Chexpert: A large chest radiograph dataset with uncertainty labels and expert comparison

Jeremy Irvin, Pranav Rajpurkar, Michael Ko, Yifan Yu, Silviana Ciurea-Ilcus, Chris Chute, Henrik Marklund, Behzad Haghgoo, Robyn Ball, Katie Shpanskaya, et al. Chexpert: A large chest radiograph dataset with uncertainty labels and expert comparison. InProceedings of the AAAI Conference on Artificial Intelligence, 2019

2019
[13]

A dataset of clinically generated visual questions and answers about radiology images.Scientific Data, 2018

Jason J Lau, Soumya Gayen, Asma Ben Abacha, and Dina Demner-Fushman. A dataset of clinically generated visual questions and answers about radiology images.Scientific Data, 2018

2018
[14]

Cxrea- sonbench: A benchmark for evaluating structured diagnostic reasoning in chest x-rays.arXiv preprint arXiv:2505.18087, 2025

Hyungyung Lee, Geon Choi, Jung-Oh Lee, Hangyul Yoon, Hyuk Gi Hong, and Edward Choi. Cxrea- sonbench: A benchmark for evaluating structured diagnostic reasoning in chest x-rays.arXiv preprint arXiv:2505.18087, 2025

work page arXiv 2025
[15]

SEED-Bench: Benchmarking Multimodal LLMs with Generative Comprehension

Bohao Li, Rui Wang, Guangzhi Wang, Yuying Ge, Yixiao Ge, and Ying Shan. Seed-bench: Benchmark- ing multimodal llms with generative comprehension.arXiv preprint arXiv:2307.16125, 2023

work page internal anchor Pith review arXiv 2023
[16]

Llava-med: Training a large language-and- vision assistant for biomedicine in one day.arXiv preprint arXiv:2306.00890, 2023

Chunyuan Li, Cliff Wong, Sheng Zhang, Naoto Usuyama, Haotian Liu, Jianwei Yang, Tristan Nau- mann, Hoifung Poon, and Jianfeng Gao. Llava-med: Training a large language-and-vision assistant for biomedicine in one day.arXiv preprint arXiv:2306.00890, 2023

work page arXiv 2023
[17]

Halueval: A large-scale hallucination evaluation benchmark for large language models.arXiv preprint arXiv:2305.11747, 2023

Junyi Li, Xiaoxue Cheng, Wayne Xin Zhao, Jian-Yun Nie, and Ji-Rong Wen. Halueval: A large-scale hallucination evaluation benchmark for large language models.arXiv preprint arXiv:2305.11747, 2023

work page arXiv 2023
[18]

Evaluating Object Hallucination in Large Vision-Language Models

Yifan Li, Yifan Du, Kun Zhou, Jinpeng Wang, Wayne Xin Zhao, and Ji-Rong Wen. Evaluating object hallucination in large vision-language models.arXiv preprint arXiv:2305.10355, 2023

work page internal anchor Pith review arXiv 2023
[19]

Holistic Evaluation of Language Models

Percy Liang, Rishi Bommasani, Tony Lee, Dimitris Tsipras, Dilara Soylu, Michihiro Yasunaga, Yian Zhang, Deepak Narayanan, Yuhuai Wu, Ananya Kumar, Benjamin Newman, Binhang Yuan, Bobby Yan, Ce Zhang, Christian Cosgrove, Christopher D. Manning, Christopher Ré, Diana Acosta-Navas, Drew A. Hudson, Eric Zelikman, Esin Durmus, Faisal Ladhak, Frieda Rong, Hongyu...

work page internal anchor Pith review arXiv 2022
[20]

TruthfulQA: Measuring How Models Mimic Human Falsehoods

Stephanie Lin, Jacob Hilton, and Owain Evans. Truthfulqa: Measuring how models mimic human falsehoods.arXiv preprint arXiv:2109.07958, 2021. 12/42 CXR-ContraBench

work page internal anchor Pith review arXiv 2021
[21]

Slake: A semantically-labeled knowledge-enhanced dataset for medical visual question answering.arXiv preprint arXiv:2102.09542, 2021

Bo Liu, Li-Ming Zhan, Li Xu, Lin Ma, Yan Yang, and Xiao-Ming Wu. Slake: A semantically-labeled knowledge-enhanced dataset for medical visual question answering.arXiv preprint arXiv:2102.09542, 2021

work page arXiv 2021
[22]

MMBench: Is Your Multi-modal Model an All-around Player?

Yuan Liu, Haodong Duan, Yuanhan Zhang, Bo Li, Songyang Zhang, Wangbo Zhao, Yike Yuan, Jiaqi Wang, Conghui He, Ziwei Liu, Kai Chen, and Dahua Lin. Mmbench: Is your multi-modal model an all-around player?arXiv preprint arXiv:2307.06281, 2023

work page internal anchor Pith review arXiv 2023
[23]

Med-flamingo: a multimodal medical few-shot learner (2023).URL: https://arxiv

Michael Moor, Qian Huang, Shirley Wu, Michihiro Yasunaga, Cyril Zakka, Yash Dalmia, Ed- uardo Pontes Reis, Pranav Rajpurkar, and Jure Leskovec. Med-flamingo: a multimodal medical few-shot learner.arXiv preprint arXiv:2307.15189, 2023

work page arXiv 2023
[24]

Rexvqa: A large-scale visual question answering benchmark for generalist chest x-ray understanding.arXiv preprint arXiv:2506.04353, 2025

Ankit Pal, Jung-Oh Lee, Xiaoman Zhang, Malaikannan Sankarasubbu, Seunghyeon Roh, Won Jung Kim, Meesun Lee, and Pranav Rajpurkar. Rexvqa: A large-scale visual question answering benchmark for generalist chest x-ray understanding.arXiv preprint arXiv:2506.04353, 2025

work page arXiv 2025
[25]

Radialog: A large vision-language model for radiology report generation and conversational assistance.arXiv preprint arXiv:2311.18681, 2023

Chantal Pellegrini, Ege Özsoy, Benjamin Busam, Nassir Navab, and Matthias Keicher. Radialog: A large vision-language model for radiology report generation and conversational assistance.arXiv preprint arXiv:2311.18681, 2023

work page arXiv 2023
[26]

Multimedeval: A benchmark and a toolkit for evaluating medical vision-language models.arXiv preprint arXiv:2402.09262, 2024

Corentin Royer, Bjoern Menze, and Anjany Sekuboyina. Multimedeval: A benchmark and a toolkit for evaluating medical vision-language models.arXiv preprint arXiv:2402.09262, 2024

work page arXiv 2024
[28]

MedGemma Technical Report

Andrew Sellergren, Sahar Kazemzadeh, Tiam Jaroensri, Atilla Kiraly, Madeleine Traverse, Timo Kohlberger, Shawn Xu, Fayaz Jamil, Cían Hughes, Charles Lau, Justin Chen, Fereshteh Mahvar, Liron Yatziv, Tiffany Chen, Bram Sterling, Stefanie Anna Baby, Susanna Maria Baby, Jeremy Lai, Samuel Schmidgall, Lu Yang, Kejia Chen, Per Bjornsson, Shashir Reddy, Ryan Br...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[29]

Challenging BIG-Bench Tasks and Whether Chain-of-Thought Can Solve Them

Mirac Suzgun, Nathan Scales, Nathanael Schärli, Sebastian Gehrmann, Yi Tay, Hyung Won Chung, Aakanksha Chowdhery, Quoc V . Le, Ed H. Chi, Denny Zhou, and Jason Wei. Challenging big-bench tasks and whether chain-of-thought can solve them.arXiv preprint arXiv:2210.09261, 2022

work page internal anchor Pith review arXiv 2022
[30]

arXiv preprint arXiv:2306.07971 (2023)

Omkar Thawakar, Abdelrahman Shaker, Sahal Shaji Mullappilly, Hisham Cholakkal, Rao Muhammad Anwer, Salman Khan, Jorma Laaksonen, and Fahad Shahbaz Khan. Xraygpt: Chest radiographs summarization using medical vision-language models.arXiv preprint arXiv:2306.07971, 2023

work page arXiv 2023
[31]

Winoground: Probing vision and language models for visio-linguistic compositionality

Tristan Thrush, Ryan Jiang, Max Bartolo, Amanpreet Singh, Adina Williams, Douwe Kiela, and Candace Ross. Winoground: Probing vision and language models for visio-linguistic compositionality. arXiv preprint arXiv:2204.03162, 2022. 13/42 CXR-ContraBench

work page arXiv 2022
[32]

Decodingtrust: A comprehensive assessment of trustworthiness in gpt models

Boxin Wang, Weixin Chen, Hengzhi Pei, Chulin Xie, Mintong Kang, Chenhui Zhang, Chejian Xu, Zidi Xiong, Ritik Dutta, Rylan Schaeffer, Sang T. Truong, Simran Arora, Mantas Mazeika, Dan Hendrycks, Zinan Lin, Yu Cheng, Sanmi Koyejo, Dawn Song, and Bo Li. Decodingtrust: A comprehensive assessment of trustworthiness in gpt models.arXiv preprint arXiv:2306.11698, 2023

work page arXiv 2023
[33]

Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution

Peng Wang, Shuai Bai, Sinan Tan, Shijie Wang, Zhihao Fan, Jinze Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Yang Fan, Kai Dang, Mengfei Du, Xuancheng Ren, Rui Men, Dayiheng Liu, Chang Zhou, Jingren Zhou, and Junyang Lin. Qwen2-vl: Enhancing vision-language model’s perception of the world at any resolution.arXiv preprint arXiv:2409.12191, 2024

work page internal anchor Pith review arXiv 2024
[34]

MMLU-Pro: A More Robust and Challenging Multi-Task Language Understanding Benchmark

Yubo Wang, Xueguang Ma, Ge Zhang, Yuansheng Ni, Abhranil Chandra, Shiguang Guo, Weiming Ren, Aaran Arulraj, Xuan He, Ziyan Jiang, Tianle Li, Max Ku, Kai Wang, Alex Zhuang, Rongqi Fan, Xiang Yue, and Wenhu Chen. Mmlu-pro: A more robust and challenging multi-task language understanding benchmark.arXiv preprint arXiv:2406.01574, 2024

work page internal anchor Pith review arXiv 2024
[35]

R2gengpt: Radiology report generation with frozen llms.arXiv preprint arXiv:2309.09812, 2023

Zhanyu Wang, Lingqiao Liu, Lei Wang, and Luping Zhou. R2gengpt: Radiology report generation with frozen llms.arXiv preprint arXiv:2309.09812, 2023

work page arXiv 2023
[36]

Chain-of-thought prompting elicits reasoning in large language models.Advances in neural information processing systems, 35:24824–24837, 2022

Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Fei Xia, Ed Chi, Quoc V Le, Denny Zhou, et al. Chain-of-thought prompting elicits reasoning in large language models.Advances in neural information processing systems, 35:24824–24837, 2022

2022
[37]

Medklip: Medical knowledge enhanced language-image pre-training in radiology.arXiv preprint arXiv:2301.02228, 2023

Chaoyi Wu, Xiaoman Zhang, Ya Zhang, Yanfeng Wang, and Weidi Xie. Medklip: Medical knowledge enhanced language-image pre-training in radiology.arXiv preprint arXiv:2301.02228, 2023

work page arXiv 2023
[38]

Towards generalist foundation model for radiology by leveraging web-scale 2d&3d medical data.Nature Communications, 2025

Chaoyi Wu, Xiaoman Zhang, Ya Zhang, Yanfeng Wang, and Weidi Xie. Towards generalist foundation model for radiology by leveraging web-scale 2d&3d medical data.Nature Communications, 2025

2025
[39]

Cares: A comprehensive benchmark of trustworthiness in medical vision language models.arXiv preprint arXiv:2406.06007, 2024

Peng Xia, Ze Chen, Juanxi Tian, Yangrui Gong, Ruibo Hou, Yue Xu, Zhenbang Wu, Zhiyuan Fan, Yiyang Zhou, Kangyu Zhu, Wenhao Zheng, Zhaoyang Wang, Xiao Wang, Xuchao Zhang, Chetan Bansal, Marc Niethammer, Junzhou Huang, Hongtu Zhu, Yun Li, Jimeng Sun, Zongyuan Ge, Gang Li, James Zou, and Huaxiu Yao. Cares: A comprehensive benchmark of trustworthiness in medi...

work page arXiv 2024
[40]

Medtrinity-25m: A large-scale multimodal dataset with multigranular annotations for medicine.arXiv preprint arXiv:2408.02900, 2024

Yunfei Xie, Ce Zhou, Lang Gao, Juncheng Wu, Xianhang Li, Hong-Yu Zhou, Sheng Liu, Lei Xing, James Zou, Cihang Xie, and Yuyin Zhou. Medtrinity-25m: A large-scale multimodal dataset with multigranular annotations for medicine.arXiv preprint arXiv:2408.02900, 2024

work page arXiv 2024
[41]

MM-Vet: Evaluating Large Multimodal Models for Integrated Capabilities

Weihao Yu, Zhengyuan Yang, Linjie Li, Jianfeng Wang, Kevin Lin, Zicheng Liu, Xinchao Wang, and Lijuan Wang. Mm-vet: Evaluating large multimodal models for integrated capabilities.arXiv preprint arXiv:2308.02490, 2023

work page internal anchor Pith review arXiv 2023
[42]

MMMU: A Massive Multi-discipline Multimodal Understanding and Reasoning Benchmark for Expert AGI

Xiang Yue, Yuansheng Ni, Kai Zhang, Tianyu Zheng, Ruoqi Liu, Ge Zhang, Samuel Stevens, Dongfu Jiang, Weiming Ren, Yuxuan Sun, Cong Wei, Botao Yu, Ruibin Yuan, Renliang Sun, Ming Yin, Boyuan Zheng, Zhenzhu Yang, Yibo Liu, Wenhao Huang, Huan Sun, Yu Su, and Wenhu Chen. Mmmu: A massive multi-discipline multimodal understanding and reasoning benchmark for exp...

work page internal anchor Pith review arXiv 2023
[43]

BiomedCLIP: a multimodal biomedical foundation model pretrained from fifteen million scientific image-text pairs

Sheng Zhang, Yanbo Xu, Naoto Usuyama, Hanwen Xu, Jaspreet Bagga, Robert Tinn, Sam Preston, Rajesh Rao, Mu Wei, Naveen Valluri, Cliff Wong, Andrea Tupini, Yu Wang, Matt Mazzola, Swadheen Shukla, Lars Liden, Jianfeng Gao, Angela Crabtree, Brian Piening, Carlo Bifulco, Matthew P . Lungren, Tristan Naumann, Sheng Wang, and Hoifung Poon. Biomedclip: a multimod...

work page internal anchor Pith review arXiv 2023
[44]

Pmc-vqa: Visual instruction tuning for medical visual question answering.arXiv preprint arXiv:2305.10415, 2023

Xiaoman Zhang, Chaoyi Wu, Ziheng Zhao, Weixiong Lin, Ya Zhang, Yanfeng Wang, and Weidi Xie. Pmc-vqa: Visual instruction tuning for medical visual question answering.arXiv preprint arXiv:2305.10415, 2023

work page arXiv 2023
[45]

Acosta, Josh Miller, Ouwen Huang, and Pranav Rajpurkar

Xiaoman Zhang, Hong-Yu Zhou, Xiaoli Yang, Oishi Banerjee, Julián N. Acosta, Josh Miller, Ouwen Huang, and Pranav Rajpurkar. Rexrank: A public leaderboard for ai-powered radiology report generation.CoRR abs/2411.15122, 2024

work page arXiv 2024
[46]

Acosta, Josh Miller, Ouwen Huang, and Pranav Rajpurkar

Xiaoman Zhang, Julián N. Acosta, Josh Miller, Ouwen Huang, and Pranav Rajpurkar. Rexgradient- 160k: A large-scale publicly available dataset of chest radiographs with free-text reports.arXiv preprint arXiv:2505.00228, 2025

work page arXiv 2025
[47]

Benchx: A unified benchmark framework for medical vision-language pretraining on chest x-rays.arXiv preprint arXiv:2410.21969, 2024

Yang Zhou, Tan Li Hui Faith, Yanyu Xu, Sicong Leng, Xinxing Xu, Yong Liu, and Rick Siow Mong Goh. Benchx: A unified benchmark framework for medical vision-language pretraining on chest x-rays.arXiv preprint arXiv:2410.21969, 2024

work page arXiv 2024
[48]

edema,” whereas ρ(o) =neg denotes an explicitly negated wording such as “No edema

Jing Zou, Qingqiu Li, Chenyu Lian, Lihao Liu, Xiaohan Yan, Shujun Wang, and Jing Qin. Corbenchx: Large-scale chest x-ray error dataset and vision-language model benchmark for report error correction. arXiv preprint arXiv:2505.12057, 2025. 15/42 CXR-ContraBench CXR-ContraBench: Benchmarking Negated-Option Attraction in Medical VLMs Supplementary Material C...

work page arXiv 2025
[49]

Absence of X,

on a balanced 4,000-example CheXpert training-split set containing 2,000 presence questions and 2,000 absence questions. The training set uses the same answer-space format and question templates as the evaluation protocols, with zero study overlap from all validation evaluation studies. This experiment is therefore intended to test whether explicit in-for...
[50]

identify whether the question asks aboutabsenceorpresence,
[51]

identify thefinding concept itself(without negation),
[52]

identify thenegated surface form(e.g., “NoX”),
[53]

Absence-side scaffold (absence_step) Input format • Question text • A/B/C/D answer options Fixed reasoning steps

output asingle final answer letter. Absence-side scaffold (absence_step) Input format • Question text • A/B/C/D answer options Fixed reasoning steps
[54]

Does the question ask for anABSENT / NOT PRESENTfinding?
[56]

Output Final answer:[A/B/C/D] Presence-side scaffold (polarity_step) Input format • Question text • A/B/C/D answer options Fixed reasoning steps

Which option is thenegated surface form(“No X”)? Decision rule If the question asks forabsence, choose thefind- ing name,notthe negated form. Output Final answer:[A/B/C/D] Presence-side scaffold (polarity_step) Input format • Question text • A/B/C/D answer options Fixed reasoning steps
[57]

Does the question ask aboutabsenceorpres- ence?
[58]

Which option names thefinding concept itself?
[59]

Tracheal deviation

Which option is thenegated surface form(“No X”)? Decision rule If the question asks forabsence, choose the find- ing name rather than the negated form. If the question asks forpresence, choose the present findingand avoid the negated form. Output Final answer:[A/B/C/D] Key difference.Both prompts force explicit polarity disambiguation, but polarity_step a...