Are vision language models ready for clinical diagno- sis? a 3d medical benchmark for tumor-centric visual question answering

Yixiong Chen, Wenjie Xiao, Pedro RAS Bassi, Xinze Zhou, Sezgin Er, Ibrahim Ethem Hamamci, Zongwei Zhou, Alan Yuille · 2025 · arXiv 2505.18915

4 Pith papers cite this work. Polarity classification is still indexing.

4 Pith papers citing it

read on arXiv browse 4 citing papers

citation-role summary

background 2 dataset 1

citation-polarity summary

background 2 use dataset 1

representative citing papers

Lost in Volume: The CT-SpatialVQA Benchmark for Evaluating Semantic-Spatial Understanding of 3D Medical Vision-Language Models

cs.CV · 2026-05-09 · unverdicted · novelty 7.0

CT-SpatialVQA benchmark shows 3D medical VLMs achieve only 34% average accuracy on semantic-spatial reasoning tasks in CT volumes, often below random chance.

CXR-ContraBench: Benchmarking Negated-Option Attraction in Medical VLMs

cs.CV · 2026-05-07 · conditional · novelty 7.0

Medical VLMs frequently select negated options that contradict visible chest X-ray findings, achieving only ~30% accuracy on direct presence probes, but a post-hoc consistency verifier raises accuracy above 95%.

RadThinking: A Dataset for Longitudinal Clinical Reasoning in Radiology

cs.CV · 2026-05-11 · unverdicted · novelty 6.0

RadThinking releases a large longitudinal CT VQA dataset stratified into foundation perception questions, single-rule reasoning questions, and compositional multi-step chains grounded in clinical reporting standards for cancer screening.

Beyond Masks: The Case for Medical Image Parsing

cs.CV · 2026-05-12 · unverdicted · novelty 5.0

Medical image parsing is proposed as the central output for the field instead of masks, with an audit showing that none of eleven representative systems produces a well-formed parse containing attributes, relationships, and closure.

citing papers explorer

Showing 4 of 4 citing papers.

Lost in Volume: The CT-SpatialVQA Benchmark for Evaluating Semantic-Spatial Understanding of 3D Medical Vision-Language Models cs.CV · 2026-05-09 · unverdicted · none · ref 5
CT-SpatialVQA benchmark shows 3D medical VLMs achieve only 34% average accuracy on semantic-spatial reasoning tasks in CT volumes, often below random chance.
CXR-ContraBench: Benchmarking Negated-Option Attraction in Medical VLMs cs.CV · 2026-05-07 · conditional · none · ref 5
Medical VLMs frequently select negated options that contradict visible chest X-ray findings, achieving only ~30% accuracy on direct presence probes, but a post-hoc consistency verifier raises accuracy above 95%.
RadThinking: A Dataset for Longitudinal Clinical Reasoning in Radiology cs.CV · 2026-05-11 · unverdicted · none · ref 24
RadThinking releases a large longitudinal CT VQA dataset stratified into foundation perception questions, single-rule reasoning questions, and compositional multi-step chains grounded in clinical reporting standards for cancer screening.
Beyond Masks: The Case for Medical Image Parsing cs.CV · 2026-05-12 · unverdicted · none · ref 10
Medical image parsing is proposed as the central output for the field instead of masks, with an audit showing that none of eleven representative systems produces a well-formed parse containing attributes, relationships, and closure.

Are vision language models ready for clinical diagno- sis? a 3d medical benchmark for tumor-centric visual question answering

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer