pith. sign in

arxiv: 2605.20525 · v1 · pith:D5MBQX4Jnew · submitted 2026-05-19 · 💻 cs.CV · cs.AI· cs.CL· cs.LG· eess.IV

NeuroQA: A Large-Scale Image-Grounded Benchmark for 3D Brain MRI Understanding

Pith reviewed 2026-05-21 06:50 UTC · model grok-4.3

classification 💻 cs.CV cs.AIcs.CLcs.LGeess.IV
keywords NeuroQA3D brain MRIvisual question answeringmedical VQAvision-language modelsAlzheimer's diseaseParkinson's diseasebenchmark dataset
5
0 comments X

The pith

AI models lag behind text-only baselines on a new 3D brain MRI question benchmark

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces NeuroQA as a benchmark with 56953 QA pairs from 3D brain MRI scans of over 12000 subjects across 12 datasets. It covers five clinical domains and uses 203 templates split into image-grounded and image-informed types to create questions that test 11 reasoning skills. A 38-rule deterministic pipeline plus expert review verifies every pair against measurements and reports with no same-subject contradictions. Baseline tests show the best zero-shot vision-language model at 47.5 percent accuracy on closed-format items, below the 49.4 percent from a text-only majority template. This construction and release with public pairs, generation scripts, and an online leaderboard aims to measure genuine visual understanding in medical 3D data.

Core claim

NeuroQA supplies 56953 verified QA pairs from full 3D brain MRI volumes of 12977 subjects spanning ages 5-104 and five clinical domains. It employs 131 image-grounded templates answerable from a 3-plane viewer and 72 image-informed templates based on volumetry or clinical instruments, all checked by a 38-rule pipeline and two expert reviews to ensure zero contradictions. On closed-format test items the leading zero-shot VLM reaches 47.5 percent accuracy while a supervised 3D CNN baseline reaches 43.7 percent, both below the 49.4 percent text-only majority floor.

What carries the argument

The 38-rule deterministic pipeline together with answer-distribution refinement and a separate image-grounding protocol that together force questions to require the MRI volume while preserving clinical validity.

If this is right

  • The benchmark enables systematic testing of 11 reasoning skills in Yes/No, multiple-choice, and open formats using full 3D volumes rather than 2D slices.
  • It supports model development across Alzheimer's, Parkinson's, tumors, white matter disease, and neurodevelopment with subject-level splits to avoid leakage.
  • Public QA pairs for open datasets and reproducible scripts for restricted ones allow broad use while respecting data agreements.
  • Clinician evaluation of 100 frozen test items on a three-plane viewer confirms alignment with real diagnostic practice.
  • The held-out private test set and online leaderboard provide a stable way to track progress on image-grounded medical VQA.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Models that improve on NeuroQA could support more reliable automated review of brain scans in radiology workflows.
  • The use of quantitative volumetry for ground truth offers a template for building similar benchmarks in other 3D medical imaging modalities.
  • Persistent gaps may point to the need for architectural changes in VLMs to better process true volumetric data instead of slice projections.
  • The verification approach could be adapted to create image-grounded QA resources for additional clinical prediction tasks.

Load-bearing premise

The answer-distribution refinement and image-grounding protocol successfully remove text-only shortcuts while preserving clinical validity without introducing selection biases that affect model rankings.

What would settle it

Finding that model accuracy on closed-format items stays the same or rises when the actual MRI volumes are replaced by blank or noise images would show the questions do not require visual input.

Figures

Figures reproduced from arXiv: 2605.20525 by Adam Turnbull, Bailey Trang, Ehsan Adeli (Stanford University), Favour Nerrise, Gustavo Chau Loo Kung, Ken Chang, Kyan Younes, Merryn Daniel, Mohammad Asadi, Mohammad H. Abbasi, Pavan Pinkesh Shah, Ridvan Yesiloglu, Seena Dehkharghani, Shaurnav Ghosh, Yuncong Mao.

Figure 1
Figure 1. Figure 1: Sample NEUROQA items across six categories, each paired with axial, sagittal, and coronal slices of the 3D volume; T2 is shown alongside T1 where available, and longitudinal items show both prior and current scans. Questions span structural assessment (Anatomy), hemispheric comparison (Location), signal characterization (Signal), temporal change detection (Longitudinal), clinical classification (Diagnosis)… view at source ↗
Figure 2
Figure 2. Figure 2: NEUROQA construction pipeline. Five deterministic stages transform raw neuroimaging data from 12 datasets into 56,953 validated QA pairs through template-based generation, expert review, machine-verified audits, and shortcut elimination. No LLM is used at any stage, and the full pipeline is reproducible under seed=42. Stages 2–3. Expert review. Two rounds of review produce rules R1 through R14. The first r… view at source ↗
Figure 3
Figure 3. Figure 3: Closed-ended VLM accuracy on test-public ( [PITH_FULL_IMAGE:figures/full_fig_p009_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Image-grounding protocol. A VLM is treated as image-grounded only when all three stress [PITH_FULL_IMAGE:figures/full_fig_p009_4.png] view at source ↗
Figure 5
Figure 5. Figure 5 [PITH_FULL_IMAGE:figures/full_fig_p018_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Phase 1 expert review interface. Three domain experts independently reviewed 24 stratified [PITH_FULL_IMAGE:figures/full_fig_p019_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Phase 2 clinician review interface with 3-plane NIfTI brain viewer. Clinicians view axial, [PITH_FULL_IMAGE:figures/full_fig_p021_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: NEUROQA clinician evaluation interface. The 3-plane NIfTI viewer displays brain MRI in axial, sagittal, and coronal planes with a slice navigator. Clinicians answer each question independently. For multi-modality questions, both T1 and T2 scans are shown. For longitudinal questions, current and prior scans are displayed with dates. Where raters diverge most. Location (27.3%/54.5%, 36% agreement): the large… view at source ↗
Figure 9
Figure 9. Figure 9: Open-ended scorer validation summary. Left: F1 vs. rater-consensus calibration as a [PITH_FULL_IMAGE:figures/full_fig_p023_9.png] view at source ↗
read the original abstract

We present NeuroQA, a large-scale benchmark for visual question answering in 3D brain magnetic resonance imaging (MRI), with 56,953 QA pairs from 12,977 subjects across 12 datasets. It spans ages 5-104 and five clinical domains: Alzheimer's, Parkinson's, tumors, white matter disease, and neurodevelopment. Unlike prior medical Visual Question Answering (VQA) efforts that operate on 2D slices or rely on narrow diagnostic labels, NeuroQA pairs every item with a full 3D volume. It evaluates 11 clinically grounded reasoning skills across Yes/No, multiple-choice, and open-ended formats. Of the 203 templates, 131 are image-grounded (answerable from a 3-plane viewer) and 72 are image-informed (ground truth from quantitative volumetry or clinical instruments). To remove text-only shortcuts, we apply answer-distribution refinement, reducing closed-format text-only accuracy from $>$80% to 44.6%; image necessity is assessed separately through an image-grounding protocol released with the benchmark. A 38-rule deterministic pipeline and two rounds of expert review verify every QA pair against FreeSurfer measurements, metadata, or radiology report fields, with zero same-subject contradictions across templates. We conduct a clinician evaluation in which two clinicians independently assess 100 frozen test items on a three-plane viewer. On closed-format (Yes/No + multiple-choice) test-public items, the best zero-shot vision-language model and a supervised 3D CNN baseline reach 47.5% and 43.7% accuracy respectively, both below the 49.4% text-only majority-template floor. NeuroQA adopts a two-tier release with public QA pairs for open-access datasets and reproducible generation scripts for datasets restricted by data use agreements (DUAs), plus subject-level splits, a held-out private test set, and an online leaderboard.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 1 minor

Summary. The paper presents NeuroQA, a large-scale benchmark for 3D brain MRI visual question answering comprising 56,953 QA pairs from 12,977 subjects across 12 datasets and five clinical domains (Alzheimer's, Parkinson's, tumors, white matter disease, neurodevelopment). It defines 203 templates (131 image-grounded via 3-plane viewer, 72 image-informed via volumetry or clinical scores), generated through a 38-rule deterministic pipeline with two rounds of expert review that guarantees zero same-subject contradictions. Answer-distribution refinement reduces closed-format text-only accuracy from >80% to 44.6%, and baselines show the best zero-shot VLM at 47.5% and a supervised 3D CNN at 43.7%, both below the 49.4% text-only majority-template floor. The work includes a separate image-grounding protocol, clinician evaluation of 100 items, subject-level splits, a held-out private test set, and a two-tier release (public QA for open datasets, reproducible scripts for DUA-restricted data).

Significance. If the construction and validation hold, NeuroQA supplies a valuable, clinically grounded resource that moves beyond 2D-slice or narrow-label medical VQA by pairing every question with full 3D volumes. The 38-rule pipeline, two rounds of expert review, zero same-subject contradictions, and independent clinician assessment on a three-plane viewer provide concrete, reproducible support for data quality. The public/private release strategy, subject-level splits, and online leaderboard further enhance utility and reproducibility for the community. The reported gap between VLMs/CNNs and the majority baseline, if free of refinement artifacts, would usefully quantify current limitations in image-grounded 3D reasoning.

major comments (1)
  1. [Abstract] Abstract and methods description of answer-distribution refinement: the process that reduces closed-format text-only accuracy from >80% to 44.6% and yields the 47.5% VLM vs. 49.4% majority-template comparison is presented at a high level without explicit selection/reweighting criteria or verification that clinical validity and image necessity are preserved. Because this step is load-bearing for the central claim that current models fall short specifically on image-grounded 3D understanding (rather than an artifact of balancing), additional detail or pseudocode would be required to rule out selection biases that could affect VLM vs. text-only rankings.
minor comments (1)
  1. [Abstract] The abstract states that an image-grounding protocol is released with the benchmark; a brief pointer to its location or a one-sentence summary of its procedure in the main text would improve immediate clarity for readers.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the positive assessment of NeuroQA and the constructive feedback on clarifying the answer-distribution refinement. We address the single major comment below and will incorporate the requested details in the revised manuscript.

read point-by-point responses
  1. Referee: [Abstract] Abstract and methods description of answer-distribution refinement: the process that reduces closed-format text-only accuracy from >80% to 44.6% and yields the 47.5% VLM vs. 49.4% majority-template comparison is presented at a high level without explicit selection/reweighting criteria or verification that clinical validity and image necessity are preserved. Because this step is load-bearing for the central claim that current models fall short specifically on image-grounded 3D understanding (rather than an artifact of balancing), additional detail or pseudocode would be required to rule out selection biases that could affect VLM vs. text-only rankings.

    Authors: We agree that the abstract presents the refinement at a high level and that explicit criteria are needed to support the central claim. In the revised manuscript we will expand Section 3.3 with the precise selection and reweighting rules, including pseudocode for the iterative per-template distribution adjustment. The added description will document that adjustments are performed only after the two rounds of expert review, that clinical validity is preserved by retaining only pairs that remain consistent with FreeSurfer measurements and radiology reports, and that image necessity is independently verified by the released image-grounding protocol (which flags items answerable without the volume). We will also include a short verification table showing that post-refinement text-only accuracy drops uniformly across domains without altering the relative difficulty ordering between image-grounded and image-informed templates. These additions will confirm that the observed gap (47.5 % VLM vs. 49.4 % majority baseline) is not an artifact of the balancing procedure. revision: yes

Circularity Check

0 steps flagged

NeuroQA benchmark construction shows no circularity

full rationale

The paper presents an empirical benchmark construction using a deterministic 38-rule pipeline, expert review, and answer-distribution refinement to generate and validate 56,953 QA pairs from 3D MRI volumes. No mathematical derivations, equations, or fitted parameters are claimed as predictions; the reported accuracies (e.g., VLM at 47.5% below 49.4% majority floor) are direct empirical measurements on the constructed test set. The refinement step is a data-processing choice to reduce text-only shortcuts, not a self-definitional loop or self-citation that bears the central claim. The work is self-contained as a dataset release with verifiable generation scripts and splits, independent of any prior author results.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The benchmark rests on the accuracy of upstream FreeSurfer volumetry and radiology report fields plus standard assumptions about MRI data quality; no new free parameters or invented entities are introduced to support the central claim.

axioms (1)
  • domain assumption FreeSurfer measurements and radiology report fields provide reliable ground truth for QA verification.
    Invoked in the description of the 38-rule pipeline and expert review process.

pith-pipeline@v0.9.0 · 5957 in / 1306 out tokens · 44467 ms · 2026-05-21T06:50:51.893213+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

85 extracted references · 85 canonical work pages · 11 internal anchors

  1. [1]

    Prediction of ad with mri-based hippocampal volume in mild cognitive impairment.Neurology, 52(7):1397–1397, 1999

    Clifford R Jack Jr, Ronald C Petersen, Yue Cheng Xu, Peter C O’Brien, Glenn E Smith, Robert J Ivnik, Bradley F Boeve, Stephen C Waring, Eric G Tangalos, and Emre Kokmen. Prediction of ad with mri-based hippocampal volume in mild cognitive impairment.Neurology, 52(7):1397–1397, 1999

  2. [2]

    Brain age and other bodily ‘ages’: implications for neuropsychiatry.Molecular psychiatry, 24(2):266–281, 2019

    James H Cole, Riccardo E Marioni, Sarah E Harris, and Ian J Deary. Brain age and other bodily ‘ages’: implications for neuropsychiatry.Molecular psychiatry, 24(2):266–281, 2019

  3. [3]

    White matter hyperintensities and imaging patterns of brain ageing in the general population.Brain, 139(4):1164–1179, 2016

    Mohamad Habes, Guray Erus, Jon B Toledo, Tianhao Zhang, Nick Bryan, Lenore J Launer, Yves Rosseel, Deborah Janowitz, Jimit Doshi, Sandra Van der Auwera, et al. White matter hyperintensities and imaging patterns of brain ageing in the general population.Brain, 139(4):1164–1179, 2016

  4. [4]

    Longitudinal brain volume changes in major depressive disorder

    Dilara Yüksel, Jennifer Engelen, Verena Schuster, Bruno Dietsche, Carsten Konrad, Andreas Jansen, Udo Dannlowski, Tilo Kircher, and Axel Krug. Longitudinal brain volume changes in major depressive disorder. Journal of Neural Transmission, 125(10):1433–1447, 2018

  5. [5]

    Mapping cortical brain asymmetry in 17,141 healthy individuals worldwide via the enigma consortium

    Xiang-Zhen Kong, Samuel R Mathias, Tulio Guadalupe, ENIGMA Laterality Working Group, David C Glahn, Barbara Franke, Fabrice Crivello, Nathalie Tzourio-Mazoyer, Simon E Fisher, Paul M Thompson, et al. Mapping cortical brain asymmetry in 17,141 healthy individuals worldwide via the enigma consortium. Proceedings of the National Academy of Sciences, 115(22):...

  6. [6]

    Foundation models for generalist medical artificial intelligence.Nature, 616 (7956):259–265, 2023

    Michael Moor, Oishi Banerjee, Zahra Shakeri Hossein Abad, Harlan M Krumholz, Jure Leskovec, Eric J Topol, and Pranav Rajpurkar. Foundation models for generalist medical artificial intelligence.Nature, 616 (7956):259–265, 2023. 10

  7. [7]

    GPT-4 Technical Report

    Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. Gpt-4 technical report.arXiv preprint arXiv:2303.08774, 2023

  8. [8]

    Gemini: A Family of Highly Capable Multimodal Models

    Gemini Team, Rohan Anil, Sebastian Borgeaud, Jean-Baptiste Alayrac, Jiahui Yu, Radu Soricut, Johan Schalkwyk, Andrew M Dai, Anja Hauth, Katie Millican, et al. Gemini: a family of highly capable multimodal models.arXiv preprint arXiv:2312.11805, 2023

  9. [9]

    A dataset of clinically generated visual questions and answers about radiology images.Scientific data, 5(1):180251, 2018

    Jason J Lau, Soumya Gayen, Asma Ben Abacha, and Dina Demner-Fushman. A dataset of clinically generated visual questions and answers about radiology images.Scientific data, 5(1):180251, 2018

  10. [10]

    PathVQA: 30000+ Questions for Medical Visual Question Answering

    Xuehai He, Yichen Zhang, Luntian Mou, Eric Xing, and Pengtao Xie. Pathvqa: 30000+ questions for medical visual question answering.arXiv preprint arXiv:2003.10286, 2020

  11. [11]

    Rexvqa: A large-scale visual question answering benchmark for generalist chest x-ray understanding

    Ankit Pal, Jung-Oh Lee, Xiaoman Zhang, Malaikannan Sankarasubbu, Seunghyeon Roh, Won Jung Kim, Meesun Lee, and Pranav Rajpurkar. Rexvqa: A large-scale visual question answering benchmark for generalist chest x-ray understanding. InBiocomputing 2026: Proceedings of the Pacific Symposium, pages 251–264. World Scientific, 2025

  12. [12]

    arXiv preprint arXiv:2603.21687 , year=

    Mohammad Asadi, Jack W O’Sullivan, Fang Cao, Tahoura Nedaee, Kamyar Fardi, Fei-Fei Li, Ehsan Adeli, and Euan Ashley. Mirage the illusion of visual understanding.arXiv preprint arXiv:2603.21687, 2026

  13. [13]

    Don’t just assume; look and answer: Overcoming priors for visual question answering

    Aishwarya Agrawal, Dhruv Batra, Devi Parikh, and Aniruddha Kembhavi. Don’t just assume; look and answer: Overcoming priors for visual question answering. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 4971–4980, 2018

  14. [14]

    A multimodal llm approach for visual question answering on multiparametric 3d brain mri.arXiv preprint arXiv:2509.25889, 2025

    Arvind Murari Vepa, Yannan Yu, Jingru Gan, Anthony Cuturrufo, Weikai Li, Wei Wang, Fabien Scalzo, and Yizhou Sun. A multimodal llm approach for visual question answering on multiparametric 3d brain mri.arXiv preprint arXiv:2509.25889, 2025

  15. [15]

    Enhancing vision-language models for medical imaging: bridging the 3d gap with innovative slice selection.Advances in Neural Information Processing Systems, 37:99947–99964, 2024

    Yuli Wang, Jian Peng, Yuwei Dai, Craig Jones, Haris Sair, Jinglai Shen, Nicolas Loizou, Jing Wu, Wen-Chi Hsu, Maliha Imami, et al. Enhancing vision-language models for medical imaging: bridging the 3d gap with innovative slice selection.Advances in Neural Information Processing Systems, 37:99947–99964, 2024

  16. [16]

    Omnibrainbench: A comprehensive multimodal benchmark for brain imaging analysis across multi-stage clinical tasks.arXiv preprint arXiv:2511.00846, 2025

    Zhihao Peng, Cheng Wang, Shengyuan Liu, Zhiying Liang, Zanting Ye, Minjie Ju, PeterYM Woo, and Yixuan Yuan. Omnibrainbench: A comprehensive multimodal benchmark for brain imaging analysis across multi-stage clinical tasks.arXiv preprint arXiv:2511.00846, 2025

  17. [17]

    Making the v in vqa matter: Elevating the role of image understanding in visual question answering

    Yash Goyal, Tejas Khot, Douglas Summers-Stay, Dhruv Batra, and Devi Parikh. Making the v in vqa matter: Elevating the role of image understanding in visual question answering. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 6904–6913, 2017

  18. [18]

    An automated labeling system for subdividing the human cerebral cortex on mri scans into gyral based regions of interest.Neuroimage, 31(3):968–980, 2006

    Rahul S Desikan, Florent Ségonne, Bruce Fischl, Brian T Quinn, Bradford C Dickerson, Deborah Blacker, Randy L Buckner, Anders M Dale, R Paul Maguire, Bradley T Hyman, et al. An automated labeling system for subdividing the human cerebral cortex on mri scans into gyral based regions of interest.Neuroimage, 31(3):968–980, 2006

  19. [19]

    Slake: A semantically-labeled knowledge-enhanced dataset for medical visual question answering

    Bo Liu, Li-Ming Zhan, Li Xu, Lin Ma, Yan Yang, and Xiao-Ming Wu. Slake: A semantically-labeled knowledge-enhanced dataset for medical visual question answering. In2021 IEEE 18th international symposium on biomedical imaging (ISBI), pages 1650–1654. IEEE, 2021

  20. [20]

    Mimic-ext-mimic-cxr-vqa: a complex, diverse, and large-scale visual question answering dataset for chest x-ray images.PhysioNet, 2024

    Seongsu Bae, Daeun Kyung, Jaehee Ryu, Eunbyeol Cho, Gyubok Lee, Sunjun Kweon, Jungwoo Oh, Lei JI, Eric Chang, Tackeun Kim, et al. Mimic-ext-mimic-cxr-vqa: a complex, diverse, and large-scale visual question answering dataset for chest x-ray images.PhysioNet, 2024

  21. [21]

    Vqa-med: Overview of the medical visual question answering task at imageclef 2019

    Asma Ben Abacha, Sadid A Hasan, Vivek V Datla, Dina Demner-Fushman, and Henning Müller. Vqa-med: Overview of the medical visual question answering task at imageclef 2019. InProceedings of CLEF (Conference and Labs of the Evaluation Forum) 2019 Working Notes. 9-12 September 2019, 2019

  22. [22]

    Overview of the vqa-med task at imageclef 2021: Visual question answering and generation in the medical domain

    Asma Ben Abacha, Mourad Sarrouti, Dina Demner-Fushman, Sadid A Hasan, and Henning Müller. Overview of the vqa-med task at imageclef 2021: Visual question answering and generation in the medical domain. InProceedings of the CLEF 2021 Conference and Labs of the Evaluation Forum-working notes. 21-24 September 2021, 2021

  23. [23]

    Microvqa++: High-quality microscopy reasoning dataset with weakly supervised graphs for multimodal large language model.arXiv preprint arXiv:2511.11407, 2025

    Manyu Li, Ruian He, Chenxi Ma, Weimin Tan, and Bo Yan. Microvqa++: High-quality microscopy reasoning dataset with weakly supervised graphs for multimodal large language model.arXiv preprint arXiv:2511.11407, 2025. 11

  24. [24]

    PMC-VQA: Visual Instruction Tuning for Medical Visual Question Answering

    Xiaoman Zhang, Chaoyi Wu, Ziheng Zhao, Weixiong Lin, Ya Zhang, Yanfeng Wang, and Weidi Xie. Pmc- vqa: Visual instruction tuning for medical visual question answering.arXiv preprint arXiv:2305.10415, 2023

  25. [25]

    Gemex: A large-scale, groundable, and explainable medical vqa benchmark for chest x-ray diagnosis

    Bo Liu, Ke Zou, Li-Ming Zhan, Zexin Lu, Xiaoyu Dong, Yidi Chen, Chengqiang Xie, Jiannong Cao, Xiao- Ming Wu, and Huazhu Fu. Gemex: A large-scale, groundable, and explainable medical vqa benchmark for chest x-ray diagnosis. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 21310–21320, 2025

  26. [26]

    MedXpertQA: Benchmarking Expert-Level Medical Reasoning and Understanding

    Yuxin Zuo, Shang Qu, Yifei Li, Zhangren Chen, Xuekai Zhu, Ermo Hua, Kaiyan Zhang, Ning Ding, and Bowen Zhou. Medxpertqa: Benchmarking expert-level medical reasoning and understanding.arXiv preprint arXiv:2501.18362, 2025

  27. [27]

    Radqa: A question answering dataset to improve comprehension of radiology reports

    Sarvesh Soni, Meghana Gudala, Atieh Pajouhi, and Kirk Roberts. Radqa: A question answering dataset to improve comprehension of radiology reports. InProceedings of the thirteenth language resources and evaluation conference, pages 6250–6259, 2022

  28. [28]

    Medicaldiff-vqa: a large-scale medical dataset for difference visual question answering on chest x-ray images.PhysioNet, 12:13, 2023

    Xinyue Hu, Lin Gu, Qiyuan An, Mengliang Zhang, Liangchen Liu, Kazuma Kobayashi, Tatsuya Harada, R Summers, and Yingying Zhu. Medicaldiff-vqa: a large-scale medical dataset for difference visual question answering on chest x-ray images.PhysioNet, 12:13, 2023

  29. [29]

    Lumen: Longitudinal multi-modal radiology model for prognosis and diagnosis.arXiv preprint arXiv:2602.21142, 2026

    Zhifan Jiang, Dong Yang, Vishwesh Nath, Abhijeet Parida, Nishad P Kulkarni, Ziyue Xu, Daguang Xu, Syed Muhammad Anwar, Holger R Roth, and Marius George Linguraru. Lumen: Longitudinal multi-modal radiology model for prognosis and diagnosis.arXiv preprint arXiv:2602.21142, 2026

  30. [30]

    Shortcut learning in deep neural networks.Nature Machine Intelligence, 2(11):665–673, 2020

    Robert Geirhos, Jörn-Henrik Jacobsen, Claudio Michaelis, Richard Zemel, Wieland Brendel, Matthias Bethge, and Felix A Wichmann. Shortcut learning in deep neural networks.Nature Machine Intelligence, 2(11):665–673, 2020

  31. [31]

    Visual hallucinations of multi-modal large language models

    Wen Huang, Hongbin Liu, Minxin Guo, and Neil Gong. Visual hallucinations of multi-modal large language models. InFindings of the Association for Computational Linguistics: ACL 2024, pages 9614–9631, 2024

  32. [32]

    Flamingo: a visual language model for few-shot learning.Advances in neural information processing systems, 35:23716–23736, 2022

    Jean-Baptiste Alayrac, Jeff Donahue, Pauline Luc, Antoine Miech, Iain Barr, Yana Hasson, Karel Lenc, Arthur Mensch, Katherine Millican, Malcolm Reynolds, et al. Flamingo: a visual language model for few-shot learning.Advances in neural information processing systems, 35:23716–23736, 2022

  33. [33]

    Learning transferable visual models from natural language supervision

    Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervision. InInternational conference on machine learning, pages 8748–8763. PmLR, 2021

  34. [34]

    Large language models encode clinical knowledge

    Karan Singhal, Shekoofeh Azizi, Tao Tu, S Sara Mahdavi, Jason Wei, Hyung Won Chung, Nathan Scales, Ajay Tanwani, Heather Cole-Lewis, Stephen Pfohl, et al. Large language models encode clinical knowledge. Nature, 620(7972):172–180, 2023

  35. [35]

    Mm-neuroonco: A multimodal benchmark and instruction dataset for mri-based brain tumor diagnosis.arXiv preprint arXiv:2602.22955, 2026

    Feng Guo, Jiaxiang Liu, Yang Li, Qianqian Shi, and Mingkun Xu. Mm-neuroonco: A multimodal benchmark and instruction dataset for mri-based brain tumor diagnosis.arXiv preprint arXiv:2602.22955, 2026

  36. [36]

    3d-rad: A comprehensive 3d radiology med-vqa dataset with multi-temporal analysis and diverse diagnostic tasks.arXiv preprint arXiv:2506.11147, 2025

    Xiaotang Gai, Jiaxiang Liu, Yichen Li, Zijie Meng, Jian Wu, and Zuozhu Liu. 3d-rad: A comprehensive 3d radiology med-vqa dataset with multi-temporal analysis and diverse diagnostic tasks.arXiv preprint arXiv:2506.11147, 2025

  37. [37]

    The claude 3 model family: Opus, sonnet, haiku.Technical Report, 2024

    Anthropic. The claude 3 model family: Opus, sonnet, haiku.Technical Report, 2024

  38. [38]

    Visual instruction tuning.Advances in neural information processing systems, 36:34892–34916, 2023

    Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning.Advances in neural information processing systems, 36:34892–34916, 2023

  39. [39]

    Llava-med: Training a large language-and-vision assistant for biomedicine in one day.Advances in Neural Information Processing Systems, 36:28541–28564, 2023

    Chunyuan Li, Cliff Wong, Sheng Zhang, Naoto Usuyama, Haotian Liu, Jianwei Yang, Tristan Nau- mann, Hoifung Poon, and Jianfeng Gao. Llava-med: Training a large language-and-vision assistant for biomedicine in one day.Advances in Neural Information Processing Systems, 36:28541–28564, 2023

  40. [40]

    MedGemma Technical Report

    Andrew Sellergren, Sahar Kazemzadeh, Tiam Jaroensri, Atilla Kiraly, Madeleine Traverse, Timo Kohlberger, Shawn Xu, Fayaz Jamil, Cían Hughes, Charles Lau, et al. Medgemma technical report. arXiv preprint arXiv:2507.05201, 2025. 12

  41. [41]

    Towards generalist biomedical ai.Nejm Ai, 1(3): AIoa2300138, 2024

    Tao Tu, Shekoofeh Azizi, Danny Driess, Mike Schaekermann, Mohamed Amin, Pi-Chuan Chang, Andrew Carroll, Charles Lau, Ryutaro Tanno, Ira Ktena, et al. Towards generalist biomedical ai.Nejm Ai, 1(3): AIoa2300138, 2024

  42. [42]

    Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution

    Peng Wang, Shuai Bai, Sinan Tan, Shijie Wang, Zhihao Fan, Jinze Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, et al. Qwen2-vl: Enhancing vision-language model’s perception of the world at any resolution.arXiv preprint arXiv:2409.12191, 2024

  43. [43]

    BiomedCLIP: a multimodal biomedical foundation model pretrained from fifteen million scientific image-text pairs

    Sheng Zhang, Yanbo Xu, Naoto Usuyama, Hanwen Xu, Jaspreet Bagga, Robert Tinn, Sam Preston, Rajesh Rao, Mu Wei, Naveen Valluri, et al. Biomedclip: a multimodal biomedical foundation model pretrained from fifteen million scientific image-text pairs.arXiv preprint arXiv:2303.00915, 2023

  44. [44]

    Can generalist foundation models outcompete special-purpose tuning? case study in medicine.arXiv preprint arXiv:2311.16452,

    Harsha Nori, Yin Tat Lee, Sheng Zhang, Dean Carignan, Richard Edgar, Nicolo Fusi, Nicholas King, Jonathan Larson, Yuanzhi Li, Weishung Liu, et al. Can generalist foundation models outcompete special- purpose tuning? case study in medicine.arXiv preprint arXiv:2311.16452, 2023

  45. [45]

    Brain-jepa: Brain dynamics foundation model with gradient positioning and spatiotemporal masking.Advances in Neural Information Processing Systems, 37:86048– 86073, 2024

    Zijian Dong, Ruilin Li, Yilei Wu, Thuan T Nguyen, Joanna S Chong, Fang Ji, Nathanael R Tong, Christopher L Chen, and Juan H Zhou. Brain-jepa: Brain dynamics foundation model with gradient positioning and spatiotemporal masking.Advances in Neural Information Processing Systems, 37:86048– 86073, 2024

  46. [46]

    Brainclip: Bridging brain and visual-linguistic representation via clip for generic natural visual stimulus decoding.arXiv preprint arXiv:2302.12971, 2023

    Yulong Liu, Yongqiang Ma, Wei Zhou, Guibo Zhu, and Nanning Zheng. Brainclip: Bridging brain and visual-linguistic representation via clip for generic natural visual stimulus decoding.arXiv preprint arXiv:2302.12971, 2023

  47. [47]

    GeoSAE: Geometric Prior-Guided Layer-Wise Sparse Autoencoder Annotation of Brain MRI Foundation Models

    Favour Nerrise, Lucy Yin, Mohammad H. Abbasi, Kilian M. Pohl, and Ehsan Adeli. Geosae: Geometric prior-guided layer-wise sparse autoencoder annotation of brain mri foundation models, 2026. URL https://arxiv.org/abs/2605.01829

  48. [48]

    Unbiased nonlinear average age-appropriate brain templates from birth to adulthood.NeuroImage, 47:S102, 2009

    Vladimir S Fonov, Alan C Evans, Robert C McKinstry, C Robert Almli, and DL Collins. Unbiased nonlinear average age-appropriate brain templates from birth to adulthood.NeuroImage, 47:S102, 2009

  49. [49]

    Freesurfer.Neuroimage, 62(2):774–781, 2012

    Bruce Fischl. Freesurfer.Neuroimage, 62(2):774–781, 2012

  50. [50]

    Autorg-brain: Grounded report generation for brain mri.arXiv preprint arXiv:2407.16684, 2024

    Jiayu Lei, Xiaoman Zhang, Chaoyi Wu, Lisong Dai, Ya Zhang, Yanyong Zhang, Yanfeng Wang, Weidi Xie, and Yuehua Li. Autorg-brain: Grounded report generation for brain mri.arXiv preprint arXiv:2407.16684, 2024

  51. [51]

    The clinical dementia rating (cdr) current version and scoring rules.Neurology, 43(11): 2412–2412, 1993

    John C Morris. The clinical dementia rating (cdr) current version and scoring rules.Neurology, 43(11): 2412–2412, 1993

  52. [52]

    Christopher G Goetz, Barbara C Tilley, Stephanie R Shaftman, Glenn T Stebbins, Stanley Fahn, Pablo Martinez-Martin, Werner Poewe, Cristina Sampaio, Matthew B Stern, Richard Dodel, et al. Movement disorder society-sponsored revision of the unified parkinson’s disease rating scale (mds-updrs): scale presentation and clinimetric testing results.Movement diso...

  53. [53]

    Data-driven discovery of movement-linked heterogeneity in neurodegenerative diseases.Nature machine intelligence, 6(9):1034–1045, 2024

    Mark Endo, Favour Nerrise, Qingyu Zhao, Edith V Sullivan, Li Fei-Fei, Victor W Henderson, Kilian M Pohl, Kathleen L Poston, and Ehsan Adeli. Data-driven discovery of movement-linked heterogeneity in neurodegenerative diseases.Nature machine intelligence, 6(9):1034–1045, 2024

  54. [54]

    Alzheimer’s disease neuroimaging initiative (adni) clinical characterization.Neurology, 74(3):201–209, 2010

    Ronald Carl Petersen, Paul S Aisen, Laurel A Beckett, Michael C Donohue, Anthony Collins Gamst, Danielle J Harvey, Clifford R Jack Jr, William J Jagust, Leslie M Shaw, Arthur W Toga, et al. Alzheimer’s disease neuroimaging initiative (adni) clinical characterization.Neurology, 74(3):201–209, 2010

  55. [55]

    Kathryn A Ellis, Ashley I Bush, David Darby, Daniela De Fazio, Jonathan Foster, Peter Hudson, Nicola T Lautenschlager, Nat Lenzo, Ralph N Martins, Paul Maruff, et al. The australian imaging, biomarkers and lifestyle (aibl) study of aging: methodology and baseline characteristics of 1112 individuals recruited for a longitudinal study of alzheimer’s disease...

  56. [56]

    The parkinson progression marker initiative (ppmi).Progress in neurobiology, 95(4):629–635, 2011

    Kenneth Marek, Danna Jennings, Shirley Lasch, Andrew Siderowf, Caroline Tanner, Tanya Simuni, Chris Coffey, Karl Kieburtz, Emily Flagg, Sohini Chowdhury, et al. The parkinson progression marker initiative (ppmi).Progress in neurobiology, 95(4):629–635, 2011

  57. [57]

    The RSNA-ASNR-MICCAI BraTS 2021 Benchmark on Brain Tumor Segmentation and Radiogenomic Classification

    Ujjwal Baid, Satyam Ghodasara, Suyash Mohan, Michel Bilello, Evan Calabrese, Errol Colak, Keyvan Farahani, Jayashree Kalpathy-Cramer, Felipe C Kitamura, Sarthak Pati, et al. The rsna-asnr-miccai brats 2021 benchmark on brain tumor segmentation and radiogenomic classification.arXiv preprint arXiv:2107.02314, 2021. 13

  58. [58]

    The 2024 brain tumor segmentation challenge meningioma radiotherapy (brats-men-rt) dataset.Scientific Data, 2026

    Dominic LaBella, Katherine Schumacher, Michael Mix, Kevin Leu, Shan McBurney-Lin, Pierre Nedelec, Javier Villanueva-Meyer, David R Raleigh, Jonathan Shapey, Tom Vercauteren, et al. The 2024 brain tumor segmentation challenge meningioma radiotherapy (brats-men-rt) dataset.Scientific Data, 2026

  59. [59]

    Standardized assessment of automatic segmentation of white matter hyperintensities and results of the wmh segmentation challenge

    Hugo J Kuijf, J Matthijs Biesbroek, Jeroen De Bresser, Rutger Heinen, Simon Andermatt, Mariana Bento, Matt Berseth, Mikhail Belyaev, M Jorge Cardoso, Adria Casamitjana, et al. Standardized assessment of automatic segmentation of white matter hyperintensities and results of the wmh segmentation challenge. IEEE transactions on medical imaging, 38(11):2556–2...

  60. [60]

    An open, multi-vendor, multi-field-strength brain mr dataset and analysis of publicly available skull stripping methods agreement.NeuroImage, 170:482–494, 2018

    Roberto Souza, Oeslle Lucena, Julia Garrafa, David Gobbi, Marina Saluzzi, Simone Appenzeller, Letícia Rittner, Richard Frayne, and Roberto Lotufo. An open, multi-vendor, multi-field-strength brain mr dataset and analysis of publicly available skull stripping methods agreement.NeuroImage, 170:482–494, 2018

  61. [61]

    The wu-minn human connectome project: an overview.Neuroimage, 80: 62–79, 2013

    David C Van Essen, Stephen M Smith, Deanna M Barch, Timothy EJ Behrens, Essa Yacoub, Kamil Ugurbil, Wu-Minn HCP Consortium, et al. The wu-minn human connectome project: an overview.Neuroimage, 80: 62–79, 2013

  62. [62]

    The adolescent brain cognitive development (abcd) study: imaging acquisition across 21 sites.Developmental cognitive neuroscience, 32:43–54, 2018

    Betty Jo Casey, Tariq Cannonier, May I Conley, Alexandra O Cohen, Deanna M Barch, Mary M Heitzeg, Mary E Soules, Theresa Teslovich, Danielle V Dellarco, Hugh Garavan, et al. The adolescent brain cognitive development (abcd) study: imaging acquisition across 21 sites.Developmental cognitive neuroscience, 32:43–54, 2018

  63. [63]

    Ixi dataset-information extraction from images project (epsrc gr/s21533/02)[internet]

    D Hill, S Williams, D Hawkes, and SM Smith. Ixi dataset-information extraction from images project (epsrc gr/s21533/02)[internet]. 2006 [cited 2013 may 7], 2006

  64. [64]

    Abbasi and Ehsan Adeli

    Mohammad H. Abbasi and Ehsan Adeli. sMRI Processing Pipeline: A lightweight, end-to-end workflow for structural brain MRI preprocessing and quality control, 2025. URL https://doi.org/10.5281/ zenodo.17503175. Zenodo, doi: 10.5281/zenodo.17503175

  65. [65]

    The montreal cognitive assessment, moca: a brief screening tool for mild cognitive impairment.Journal of the American Geriatrics Society, 53(4):695–699, 2005

    Ziad S Nasreddine, Natalie A Phillips, Valérie Bédirian, Simon Charbonneau, Victor Whitehead, Isabelle Collin, Jeffrey L Cummings, and Howard Chertkow. The montreal cognitive assessment, moca: a brief screening tool for mild cognitive impairment.Journal of the American Geriatrics Society, 53(4):695–699, 2005

  66. [66]

    Mini-mental state.Journal of psychiatric research, 12(3):189–198, 1975

    Marshal F Folstein, Susan E Folstein, and Paul R McHugh. Mini-mental state.Journal of psychiatric research, 12(3):189–198, 1975

  67. [67]

    Noelle E Carlozzi, David S Tulsky, Robert V Kail, and Jennifer L Beaumont. Vi. nih toolbox cognition battery (cb): measuring processing speed.Monographs of the Society for Research in Child Development, 78(4):88–102, 2013

  68. [68]

    Manual for the aseba school-age forms and profiles.University of Vermont Research Center for Children, Youth, and Families, 2001

    Thomas M Achenbach. Manual for the aseba school-age forms and profiles.University of Vermont Research Center for Children, Youth, and Families, 2001

  69. [69]

    Parkinsonism: onset, progression, and mortality.Neurology, 17(5): 427–427, 1967

    Margaret M Hoehn and Melvin D Yahr. Parkinsonism: onset, progression, and mortality.Neurology, 17(5): 427–427, 1967

  70. [70]

    BERTScore: Evaluating Text Generation with BERT

    Tianyi Zhang, Varsha Kishore, Felix Wu, Kilian Q Weinberger, and Yoav Artzi. Bertscore: Evaluating text generation with bert.arXiv preprint arXiv:1904.09675, 2019

  71. [71]

    A coefficient of agreement for nominal scales.Educational and psychological measurement, 20(1):37–46, 1960

    Jacob Cohen. A coefficient of agreement for nominal scales.Educational and psychological measurement, 20(1):37–46, 1960

  72. [72]

    Medbookvqa: A systematic and comprehensive medical benchmark derived from open-access book.arXiv preprint arXiv:2506.00855, 2025

    Sau Lai Yip, Sunan He, Yuxiang Nie, Shu Pui Chan, Yilin Ye, Sum Ying Lam, and Hao Chen. Medbookvqa: A systematic and comprehensive medical benchmark derived from open-access book.arXiv preprint arXiv:2506.00855, 2025

  73. [73]

    arXiv preprint arXiv:2404.00578 , year=

    Fan Bai, Yuxin Du, Tiejun Huang, Max Q-H Meng, and Bo Zhao. M3d: Advancing 3d medical image analysis with multi-modal large language models.arXiv preprint arXiv:2404.00578, 2024

  74. [74]

    Mmmu: A massive multi-discipline multimodal understanding and reasoning benchmark for expert agi

    Xiang Yue, Yuansheng Ni, Kai Zhang, Tianyu Zheng, Ruoqi Liu, Ge Zhang, Samuel Stevens, Dongfu Jiang, Weiming Ren, Yuxuan Sun, et al. Mmmu: A massive multi-discipline multimodal understanding and reasoning benchmark for expert agi. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 9556–9567, 2024

  75. [75]

    None” to “No changes

    J Richard Landis and Gary G Koch. The measurement of observer agreement for categorical data. biometrics, pages 159–174, 1977. 14 Contents of Appendix/Supplementary Material A Limitations and Future Work 16 B Complete Quality Rule List 16 C Dataset Composition and Splits 16 D Leaderboard and Public Release 16 E Expert Review Details 17 E.1 First Round: Te...

  76. [76]

    Laterality removal (BraTS-GLI/MEN, 786 QA removed).Expert review flagged laterality labels as unreliable. Systematic verification of 220 unilateral BraTS-GLI subjects showed only 49% agreement between report-stated and voxel-based laterality, effectively random, because BraTS reports and NIfTI images may originate from different processing stages or conve...

  77. [77]

    cerebellar

    Cerebellopontine angle fix (BraTS-MEN, 28 QA corrected).Fifteen BraTS-MEN subjects with cerebellopontine angle (CPA) lesions were incorrectly labeled as “cerebellar” in Location questions. CPA lesions are extra-axial posterior fossa masses, anatomically distinct from cerebellar parenchymal lesions. Corrected to “posterior fossa.” 17 Table 6: Per-dataset c...

  78. [78]

    Removed across 7 datasets; retained questions with clearly visible asymmetry (>10% difference)

    Subtle asymmetry removal (1,524 QA removed).57% of laterality questions had less than 10% volume asymmetry between left and right structures, differences imperceptible on visual inspection. Removed across 7 datasets; retained questions with clearly visible asymmetry (>10% difference)

  79. [79]

    What is the T1 signal intensity of the lesion?

    Signal wording standardization (654 QA reworded).The neuroradiologist recommended standard clinical phrasing: “What is the T1 signal intensity of the lesion?” was revised to “What is the signal intensity of the lesion on T1-weighted imaging?” No answers changed; only question text was reworded to match radiology reporting conventions. 18 Figure 6: Phase 1...

  80. [80]

    The neuroradiologist noted that axial views are atypical for hippocampal assessment; however, this applies only to the 2D survey visualization

    Anatomy questions.Both reviewers rated Anatomy questions as correct and highly relevant. The neuroradiologist noted that axial views are atypical for hippocampal assessment; however, this applies only to the 2D survey visualization. NEUROQA provides full 3D volumetric input to models, including all orientations

Showing first 80 references.