CT-SpatialVQA benchmark shows 3D medical VLMs achieve only 34% average accuracy on semantic-spatial reasoning tasks in CT volumes, often below random chance.
Are vision language models ready for clinical diagno- sis? a 3d medical benchmark for tumor-centric visual question answering
4 Pith papers cite this work. Polarity classification is still indexing.
citation-role summary
citation-polarity summary
fields
cs.CV 4years
2026 4representative citing papers
Medical VLMs frequently select negated options that contradict visible chest X-ray findings, achieving only ~30% accuracy on direct presence probes, but a post-hoc consistency verifier raises accuracy above 95%.
RadThinking releases a large longitudinal CT VQA dataset stratified into foundation perception questions, single-rule reasoning questions, and compositional multi-step chains grounded in clinical reporting standards for cancer screening.
Medical image parsing is proposed as the central output for the field instead of masks, with an audit showing that none of eleven representative systems produces a well-formed parse containing attributes, relationships, and closure.
citing papers explorer
-
Lost in Volume: The CT-SpatialVQA Benchmark for Evaluating Semantic-Spatial Understanding of 3D Medical Vision-Language Models
CT-SpatialVQA benchmark shows 3D medical VLMs achieve only 34% average accuracy on semantic-spatial reasoning tasks in CT volumes, often below random chance.
-
CXR-ContraBench: Benchmarking Negated-Option Attraction in Medical VLMs
Medical VLMs frequently select negated options that contradict visible chest X-ray findings, achieving only ~30% accuracy on direct presence probes, but a post-hoc consistency verifier raises accuracy above 95%.
-
RadThinking: A Dataset for Longitudinal Clinical Reasoning in Radiology
RadThinking releases a large longitudinal CT VQA dataset stratified into foundation perception questions, single-rule reasoning questions, and compositional multi-step chains grounded in clinical reporting standards for cancer screening.
-
Beyond Masks: The Case for Medical Image Parsing
Medical image parsing is proposed as the central output for the field instead of masks, with an audit showing that none of eleven representative systems produces a well-formed parse containing attributes, relationships, and closure.