NeuroQA: A Large-Scale Image-Grounded Benchmark for 3D Brain MRI Understanding

Adam Turnbull; Bailey Trang; Ehsan Adeli (Stanford University); Favour Nerrise; Gustavo Chau Loo Kung; Ken Chang; Kyan Younes; Merryn Daniel; Mohammad Asadi; Mohammad H. Abbasi

arxiv: 2605.20525 · v1 · pith:D5MBQX4Jnew · submitted 2026-05-19 · 💻 cs.CV · cs.AI· cs.CL· cs.LG· eess.IV

NeuroQA: A Large-Scale Image-Grounded Benchmark for 3D Brain MRI Understanding

Mohammad H. Abbasi , Favour Nerrise , Shaurnav Ghosh , Ridvan Yesiloglu , Yuncong Mao , Bailey Trang , Mohammad Asadi , Merryn Daniel

show 7 more authors

Gustavo Chau Loo Kung Ken Chang Pavan Pinkesh Shah Adam Turnbull Kyan Younes Seena Dehkharghani Ehsan Adeli (Stanford University)

This is my paper

Pith reviewed 2026-05-21 06:50 UTC · model grok-4.3

classification 💻 cs.CV cs.AIcs.CLcs.LGeess.IV

keywords NeuroQA3D brain MRIvisual question answeringmedical VQAvision-language modelsAlzheimer's diseaseParkinson's diseasebenchmark dataset

0 comments

The pith

AI models lag behind text-only baselines on a new 3D brain MRI question benchmark

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces NeuroQA as a benchmark with 56953 QA pairs from 3D brain MRI scans of over 12000 subjects across 12 datasets. It covers five clinical domains and uses 203 templates split into image-grounded and image-informed types to create questions that test 11 reasoning skills. A 38-rule deterministic pipeline plus expert review verifies every pair against measurements and reports with no same-subject contradictions. Baseline tests show the best zero-shot vision-language model at 47.5 percent accuracy on closed-format items, below the 49.4 percent from a text-only majority template. This construction and release with public pairs, generation scripts, and an online leaderboard aims to measure genuine visual understanding in medical 3D data.

Core claim

NeuroQA supplies 56953 verified QA pairs from full 3D brain MRI volumes of 12977 subjects spanning ages 5-104 and five clinical domains. It employs 131 image-grounded templates answerable from a 3-plane viewer and 72 image-informed templates based on volumetry or clinical instruments, all checked by a 38-rule pipeline and two expert reviews to ensure zero contradictions. On closed-format test items the leading zero-shot VLM reaches 47.5 percent accuracy while a supervised 3D CNN baseline reaches 43.7 percent, both below the 49.4 percent text-only majority floor.

What carries the argument

The 38-rule deterministic pipeline together with answer-distribution refinement and a separate image-grounding protocol that together force questions to require the MRI volume while preserving clinical validity.

If this is right

The benchmark enables systematic testing of 11 reasoning skills in Yes/No, multiple-choice, and open formats using full 3D volumes rather than 2D slices.
It supports model development across Alzheimer's, Parkinson's, tumors, white matter disease, and neurodevelopment with subject-level splits to avoid leakage.
Public QA pairs for open datasets and reproducible scripts for restricted ones allow broad use while respecting data agreements.
Clinician evaluation of 100 frozen test items on a three-plane viewer confirms alignment with real diagnostic practice.
The held-out private test set and online leaderboard provide a stable way to track progress on image-grounded medical VQA.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Models that improve on NeuroQA could support more reliable automated review of brain scans in radiology workflows.
The use of quantitative volumetry for ground truth offers a template for building similar benchmarks in other 3D medical imaging modalities.
Persistent gaps may point to the need for architectural changes in VLMs to better process true volumetric data instead of slice projections.
The verification approach could be adapted to create image-grounded QA resources for additional clinical prediction tasks.

Load-bearing premise

The answer-distribution refinement and image-grounding protocol successfully remove text-only shortcuts while preserving clinical validity without introducing selection biases that affect model rankings.

What would settle it

Finding that model accuracy on closed-format items stays the same or rises when the actual MRI volumes are replaced by blank or noise images would show the questions do not require visual input.

Figures

Figures reproduced from arXiv: 2605.20525 by Adam Turnbull, Bailey Trang, Ehsan Adeli (Stanford University), Favour Nerrise, Gustavo Chau Loo Kung, Ken Chang, Kyan Younes, Merryn Daniel, Mohammad Asadi, Mohammad H. Abbasi, Pavan Pinkesh Shah, Ridvan Yesiloglu, Seena Dehkharghani, Shaurnav Ghosh, Yuncong Mao.

**Figure 1.** Figure 1: Sample NEUROQA items across six categories, each paired with axial, sagittal, and coronal slices of the 3D volume; T2 is shown alongside T1 where available, and longitudinal items show both prior and current scans. Questions span structural assessment (Anatomy), hemispheric comparison (Location), signal characterization (Signal), temporal change detection (Longitudinal), clinical classification (Diagnosis)… view at source ↗

**Figure 2.** Figure 2: NEUROQA construction pipeline. Five deterministic stages transform raw neuroimaging data from 12 datasets into 56,953 validated QA pairs through template-based generation, expert review, machine-verified audits, and shortcut elimination. No LLM is used at any stage, and the full pipeline is reproducible under seed=42. Stages 2–3. Expert review. Two rounds of review produce rules R1 through R14. The first r… view at source ↗

**Figure 3.** Figure 3: Closed-ended VLM accuracy on test-public ( [PITH_FULL_IMAGE:figures/full_fig_p009_3.png] view at source ↗

**Figure 4.** Figure 4: Image-grounding protocol. A VLM is treated as image-grounded only when all three stress [PITH_FULL_IMAGE:figures/full_fig_p009_4.png] view at source ↗

**Figure 5.** Figure 5 [PITH_FULL_IMAGE:figures/full_fig_p018_5.png] view at source ↗

**Figure 6.** Figure 6: Phase 1 expert review interface. Three domain experts independently reviewed 24 stratified [PITH_FULL_IMAGE:figures/full_fig_p019_6.png] view at source ↗

**Figure 7.** Figure 7: Phase 2 clinician review interface with 3-plane NIfTI brain viewer. Clinicians view axial, [PITH_FULL_IMAGE:figures/full_fig_p021_7.png] view at source ↗

**Figure 8.** Figure 8: NEUROQA clinician evaluation interface. The 3-plane NIfTI viewer displays brain MRI in axial, sagittal, and coronal planes with a slice navigator. Clinicians answer each question independently. For multi-modality questions, both T1 and T2 scans are shown. For longitudinal questions, current and prior scans are displayed with dates. Where raters diverge most. Location (27.3%/54.5%, 36% agreement): the large… view at source ↗

**Figure 9.** Figure 9: Open-ended scorer validation summary. Left: F1 vs. rater-consensus calibration as a [PITH_FULL_IMAGE:figures/full_fig_p023_9.png] view at source ↗

read the original abstract

We present NeuroQA, a large-scale benchmark for visual question answering in 3D brain magnetic resonance imaging (MRI), with 56,953 QA pairs from 12,977 subjects across 12 datasets. It spans ages 5-104 and five clinical domains: Alzheimer's, Parkinson's, tumors, white matter disease, and neurodevelopment. Unlike prior medical Visual Question Answering (VQA) efforts that operate on 2D slices or rely on narrow diagnostic labels, NeuroQA pairs every item with a full 3D volume. It evaluates 11 clinically grounded reasoning skills across Yes/No, multiple-choice, and open-ended formats. Of the 203 templates, 131 are image-grounded (answerable from a 3-plane viewer) and 72 are image-informed (ground truth from quantitative volumetry or clinical instruments). To remove text-only shortcuts, we apply answer-distribution refinement, reducing closed-format text-only accuracy from $>$80% to 44.6%; image necessity is assessed separately through an image-grounding protocol released with the benchmark. A 38-rule deterministic pipeline and two rounds of expert review verify every QA pair against FreeSurfer measurements, metadata, or radiology report fields, with zero same-subject contradictions across templates. We conduct a clinician evaluation in which two clinicians independently assess 100 frozen test items on a three-plane viewer. On closed-format (Yes/No + multiple-choice) test-public items, the best zero-shot vision-language model and a supervised 3D CNN baseline reach 47.5% and 43.7% accuracy respectively, both below the 49.4% text-only majority-template floor. NeuroQA adopts a two-tier release with public QA pairs for open-access datasets and reproducible generation scripts for datasets restricted by data use agreements (DUAs), plus subject-level splits, a held-out private test set, and an online leaderboard.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

NeuroQA is a carefully built 3D brain MRI VQA benchmark with strong verification steps, but the answer-distribution refinement needs more detail to confirm the reported model gap is not an artifact.

read the letter

NeuroQA puts together nearly 57,000 QA pairs from full 3D brain MRI volumes across 13,000 subjects and 12 datasets. It covers five clinical areas and a wide age range, with 203 templates that split into image-grounded and image-informed questions. The main value is the verification work: a 38-rule pipeline, two rounds of expert review, and clinician checks on a three-plane viewer that produce zero same-subject contradictions. They also release the grounding protocols and use subject-level splits plus a held-out test set. That level of care stands out for a benchmark paper and makes the resource more usable than many earlier medical VQA sets that stayed with 2D slices or narrow labels. The release plan for both open and DUA-restricted data is practical too. The soft spot sits in the answer-distribution refinement. The paper shows text-only accuracy dropping from over 80% to 44.6% after this step, with the best zero-shot VLM at 47.5% and a 3D CNN at 43.7%, both under the 49.4% majority floor. The description stays high-level, so it is not obvious whether the balancing kept or removed items in a way that favors text-only guessing. If it does, the claim that current models lack visual reasoning could be weaker than it looks. More explicit checks on selection bias and the exact image-necessity protocol would tighten this. This work is aimed at groups building or evaluating medical vision-language models for neurological tasks. Readers who need a reproducible, clinically grounded test set larger than prior efforts will get direct use from it. The construction is substantial enough to merit a serious referee, even if the performance comparison section needs expansion.

Referee Report

1 major / 1 minor

Summary. The paper presents NeuroQA, a large-scale benchmark for 3D brain MRI visual question answering comprising 56,953 QA pairs from 12,977 subjects across 12 datasets and five clinical domains (Alzheimer's, Parkinson's, tumors, white matter disease, neurodevelopment). It defines 203 templates (131 image-grounded via 3-plane viewer, 72 image-informed via volumetry or clinical scores), generated through a 38-rule deterministic pipeline with two rounds of expert review that guarantees zero same-subject contradictions. Answer-distribution refinement reduces closed-format text-only accuracy from >80% to 44.6%, and baselines show the best zero-shot VLM at 47.5% and a supervised 3D CNN at 43.7%, both below the 49.4% text-only majority-template floor. The work includes a separate image-grounding protocol, clinician evaluation of 100 items, subject-level splits, a held-out private test set, and a two-tier release (public QA for open datasets, reproducible scripts for DUA-restricted data).

Significance. If the construction and validation hold, NeuroQA supplies a valuable, clinically grounded resource that moves beyond 2D-slice or narrow-label medical VQA by pairing every question with full 3D volumes. The 38-rule pipeline, two rounds of expert review, zero same-subject contradictions, and independent clinician assessment on a three-plane viewer provide concrete, reproducible support for data quality. The public/private release strategy, subject-level splits, and online leaderboard further enhance utility and reproducibility for the community. The reported gap between VLMs/CNNs and the majority baseline, if free of refinement artifacts, would usefully quantify current limitations in image-grounded 3D reasoning.

major comments (1)

[Abstract] Abstract and methods description of answer-distribution refinement: the process that reduces closed-format text-only accuracy from >80% to 44.6% and yields the 47.5% VLM vs. 49.4% majority-template comparison is presented at a high level without explicit selection/reweighting criteria or verification that clinical validity and image necessity are preserved. Because this step is load-bearing for the central claim that current models fall short specifically on image-grounded 3D understanding (rather than an artifact of balancing), additional detail or pseudocode would be required to rule out selection biases that could affect VLM vs. text-only rankings.

minor comments (1)

[Abstract] The abstract states that an image-grounding protocol is released with the benchmark; a brief pointer to its location or a one-sentence summary of its procedure in the main text would improve immediate clarity for readers.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the positive assessment of NeuroQA and the constructive feedback on clarifying the answer-distribution refinement. We address the single major comment below and will incorporate the requested details in the revised manuscript.

read point-by-point responses

Referee: [Abstract] Abstract and methods description of answer-distribution refinement: the process that reduces closed-format text-only accuracy from >80% to 44.6% and yields the 47.5% VLM vs. 49.4% majority-template comparison is presented at a high level without explicit selection/reweighting criteria or verification that clinical validity and image necessity are preserved. Because this step is load-bearing for the central claim that current models fall short specifically on image-grounded 3D understanding (rather than an artifact of balancing), additional detail or pseudocode would be required to rule out selection biases that could affect VLM vs. text-only rankings.

Authors: We agree that the abstract presents the refinement at a high level and that explicit criteria are needed to support the central claim. In the revised manuscript we will expand Section 3.3 with the precise selection and reweighting rules, including pseudocode for the iterative per-template distribution adjustment. The added description will document that adjustments are performed only after the two rounds of expert review, that clinical validity is preserved by retaining only pairs that remain consistent with FreeSurfer measurements and radiology reports, and that image necessity is independently verified by the released image-grounding protocol (which flags items answerable without the volume). We will also include a short verification table showing that post-refinement text-only accuracy drops uniformly across domains without altering the relative difficulty ordering between image-grounded and image-informed templates. These additions will confirm that the observed gap (47.5 % VLM vs. 49.4 % majority baseline) is not an artifact of the balancing procedure. revision: yes

Circularity Check

0 steps flagged

NeuroQA benchmark construction shows no circularity

full rationale

The paper presents an empirical benchmark construction using a deterministic 38-rule pipeline, expert review, and answer-distribution refinement to generate and validate 56,953 QA pairs from 3D MRI volumes. No mathematical derivations, equations, or fitted parameters are claimed as predictions; the reported accuracies (e.g., VLM at 47.5% below 49.4% majority floor) are direct empirical measurements on the constructed test set. The refinement step is a data-processing choice to reduce text-only shortcuts, not a self-definitional loop or self-citation that bears the central claim. The work is self-contained as a dataset release with verifiable generation scripts and splits, independent of any prior author results.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The benchmark rests on the accuracy of upstream FreeSurfer volumetry and radiology report fields plus standard assumptions about MRI data quality; no new free parameters or invented entities are introduced to support the central claim.

axioms (1)

domain assumption FreeSurfer measurements and radiology report fields provide reliable ground truth for QA verification.
Invoked in the description of the 38-rule pipeline and expert review process.

pith-pipeline@v0.9.0 · 5957 in / 1306 out tokens · 44467 ms · 2026-05-21T06:50:51.893213+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We apply answer-distribution refinement, reducing closed-format text-only accuracy from >80% to 44.6%; image necessity is assessed separately through an image-grounding protocol
IndisputableMonolith/Foundation/AlexanderDuality.lean alexander_duality_circle_linking unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Of the 203 templates, 131 are image-grounded (answerable from a 3-plane viewer) and 72 are image-informed

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

85 extracted references · 85 canonical work pages · 11 internal anchors

[1]

Prediction of ad with mri-based hippocampal volume in mild cognitive impairment.Neurology, 52(7):1397–1397, 1999

Clifford R Jack Jr, Ronald C Petersen, Yue Cheng Xu, Peter C O’Brien, Glenn E Smith, Robert J Ivnik, Bradley F Boeve, Stephen C Waring, Eric G Tangalos, and Emre Kokmen. Prediction of ad with mri-based hippocampal volume in mild cognitive impairment.Neurology, 52(7):1397–1397, 1999

work page 1999
[2]

Brain age and other bodily ‘ages’: implications for neuropsychiatry.Molecular psychiatry, 24(2):266–281, 2019

James H Cole, Riccardo E Marioni, Sarah E Harris, and Ian J Deary. Brain age and other bodily ‘ages’: implications for neuropsychiatry.Molecular psychiatry, 24(2):266–281, 2019

work page 2019
[3]

White matter hyperintensities and imaging patterns of brain ageing in the general population.Brain, 139(4):1164–1179, 2016

Mohamad Habes, Guray Erus, Jon B Toledo, Tianhao Zhang, Nick Bryan, Lenore J Launer, Yves Rosseel, Deborah Janowitz, Jimit Doshi, Sandra Van der Auwera, et al. White matter hyperintensities and imaging patterns of brain ageing in the general population.Brain, 139(4):1164–1179, 2016

work page 2016
[4]

Longitudinal brain volume changes in major depressive disorder

Dilara Yüksel, Jennifer Engelen, Verena Schuster, Bruno Dietsche, Carsten Konrad, Andreas Jansen, Udo Dannlowski, Tilo Kircher, and Axel Krug. Longitudinal brain volume changes in major depressive disorder. Journal of Neural Transmission, 125(10):1433–1447, 2018

work page 2018
[5]

Mapping cortical brain asymmetry in 17,141 healthy individuals worldwide via the enigma consortium

Xiang-Zhen Kong, Samuel R Mathias, Tulio Guadalupe, ENIGMA Laterality Working Group, David C Glahn, Barbara Franke, Fabrice Crivello, Nathalie Tzourio-Mazoyer, Simon E Fisher, Paul M Thompson, et al. Mapping cortical brain asymmetry in 17,141 healthy individuals worldwide via the enigma consortium. Proceedings of the National Academy of Sciences, 115(22):...

work page 2018
[6]

Foundation models for generalist medical artificial intelligence.Nature, 616 (7956):259–265, 2023

Michael Moor, Oishi Banerjee, Zahra Shakeri Hossein Abad, Harlan M Krumholz, Jure Leskovec, Eric J Topol, and Pranav Rajpurkar. Foundation models for generalist medical artificial intelligence.Nature, 616 (7956):259–265, 2023. 10

work page 2023
[7]

GPT-4 Technical Report

Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. Gpt-4 technical report.arXiv preprint arXiv:2303.08774, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[8]

Gemini: A Family of Highly Capable Multimodal Models

Gemini Team, Rohan Anil, Sebastian Borgeaud, Jean-Baptiste Alayrac, Jiahui Yu, Radu Soricut, Johan Schalkwyk, Andrew M Dai, Anja Hauth, Katie Millican, et al. Gemini: a family of highly capable multimodal models.arXiv preprint arXiv:2312.11805, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[9]

A dataset of clinically generated visual questions and answers about radiology images.Scientific data, 5(1):180251, 2018

Jason J Lau, Soumya Gayen, Asma Ben Abacha, and Dina Demner-Fushman. A dataset of clinically generated visual questions and answers about radiology images.Scientific data, 5(1):180251, 2018

work page 2018
[10]

PathVQA: 30000+ Questions for Medical Visual Question Answering

Xuehai He, Yichen Zhang, Luntian Mou, Eric Xing, and Pengtao Xie. Pathvqa: 30000+ questions for medical visual question answering.arXiv preprint arXiv:2003.10286, 2020

work page internal anchor Pith review Pith/arXiv arXiv 2003
[11]

Rexvqa: A large-scale visual question answering benchmark for generalist chest x-ray understanding

Ankit Pal, Jung-Oh Lee, Xiaoman Zhang, Malaikannan Sankarasubbu, Seunghyeon Roh, Won Jung Kim, Meesun Lee, and Pranav Rajpurkar. Rexvqa: A large-scale visual question answering benchmark for generalist chest x-ray understanding. InBiocomputing 2026: Proceedings of the Pacific Symposium, pages 251–264. World Scientific, 2025

work page 2026
[12]

arXiv preprint arXiv:2603.21687 , year=

Mohammad Asadi, Jack W O’Sullivan, Fang Cao, Tahoura Nedaee, Kamyar Fardi, Fei-Fei Li, Ehsan Adeli, and Euan Ashley. Mirage the illusion of visual understanding.arXiv preprint arXiv:2603.21687, 2026

work page arXiv 2026
[13]

Don’t just assume; look and answer: Overcoming priors for visual question answering

Aishwarya Agrawal, Dhruv Batra, Devi Parikh, and Aniruddha Kembhavi. Don’t just assume; look and answer: Overcoming priors for visual question answering. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 4971–4980, 2018

work page 2018
[14]

A multimodal llm approach for visual question answering on multiparametric 3d brain mri.arXiv preprint arXiv:2509.25889, 2025

Arvind Murari Vepa, Yannan Yu, Jingru Gan, Anthony Cuturrufo, Weikai Li, Wei Wang, Fabien Scalzo, and Yizhou Sun. A multimodal llm approach for visual question answering on multiparametric 3d brain mri.arXiv preprint arXiv:2509.25889, 2025

work page arXiv 2025
[15]

Enhancing vision-language models for medical imaging: bridging the 3d gap with innovative slice selection.Advances in Neural Information Processing Systems, 37:99947–99964, 2024

Yuli Wang, Jian Peng, Yuwei Dai, Craig Jones, Haris Sair, Jinglai Shen, Nicolas Loizou, Jing Wu, Wen-Chi Hsu, Maliha Imami, et al. Enhancing vision-language models for medical imaging: bridging the 3d gap with innovative slice selection.Advances in Neural Information Processing Systems, 37:99947–99964, 2024

work page 2024
[16]

Omnibrainbench: A comprehensive multimodal benchmark for brain imaging analysis across multi-stage clinical tasks.arXiv preprint arXiv:2511.00846, 2025

Zhihao Peng, Cheng Wang, Shengyuan Liu, Zhiying Liang, Zanting Ye, Minjie Ju, PeterYM Woo, and Yixuan Yuan. Omnibrainbench: A comprehensive multimodal benchmark for brain imaging analysis across multi-stage clinical tasks.arXiv preprint arXiv:2511.00846, 2025

work page arXiv 2025
[17]

Making the v in vqa matter: Elevating the role of image understanding in visual question answering

Yash Goyal, Tejas Khot, Douglas Summers-Stay, Dhruv Batra, and Devi Parikh. Making the v in vqa matter: Elevating the role of image understanding in visual question answering. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 6904–6913, 2017

work page 2017
[18]

An automated labeling system for subdividing the human cerebral cortex on mri scans into gyral based regions of interest.Neuroimage, 31(3):968–980, 2006

Rahul S Desikan, Florent Ségonne, Bruce Fischl, Brian T Quinn, Bradford C Dickerson, Deborah Blacker, Randy L Buckner, Anders M Dale, R Paul Maguire, Bradley T Hyman, et al. An automated labeling system for subdividing the human cerebral cortex on mri scans into gyral based regions of interest.Neuroimage, 31(3):968–980, 2006

work page 2006
[19]

Slake: A semantically-labeled knowledge-enhanced dataset for medical visual question answering

Bo Liu, Li-Ming Zhan, Li Xu, Lin Ma, Yan Yang, and Xiao-Ming Wu. Slake: A semantically-labeled knowledge-enhanced dataset for medical visual question answering. In2021 IEEE 18th international symposium on biomedical imaging (ISBI), pages 1650–1654. IEEE, 2021

work page 2021
[20]

Mimic-ext-mimic-cxr-vqa: a complex, diverse, and large-scale visual question answering dataset for chest x-ray images.PhysioNet, 2024

Seongsu Bae, Daeun Kyung, Jaehee Ryu, Eunbyeol Cho, Gyubok Lee, Sunjun Kweon, Jungwoo Oh, Lei JI, Eric Chang, Tackeun Kim, et al. Mimic-ext-mimic-cxr-vqa: a complex, diverse, and large-scale visual question answering dataset for chest x-ray images.PhysioNet, 2024

work page 2024
[21]

Vqa-med: Overview of the medical visual question answering task at imageclef 2019

Asma Ben Abacha, Sadid A Hasan, Vivek V Datla, Dina Demner-Fushman, and Henning Müller. Vqa-med: Overview of the medical visual question answering task at imageclef 2019. InProceedings of CLEF (Conference and Labs of the Evaluation Forum) 2019 Working Notes. 9-12 September 2019, 2019

work page 2019
[22]

Overview of the vqa-med task at imageclef 2021: Visual question answering and generation in the medical domain

Asma Ben Abacha, Mourad Sarrouti, Dina Demner-Fushman, Sadid A Hasan, and Henning Müller. Overview of the vqa-med task at imageclef 2021: Visual question answering and generation in the medical domain. InProceedings of the CLEF 2021 Conference and Labs of the Evaluation Forum-working notes. 21-24 September 2021, 2021

work page 2021
[23]

Microvqa++: High-quality microscopy reasoning dataset with weakly supervised graphs for multimodal large language model.arXiv preprint arXiv:2511.11407, 2025

Manyu Li, Ruian He, Chenxi Ma, Weimin Tan, and Bo Yan. Microvqa++: High-quality microscopy reasoning dataset with weakly supervised graphs for multimodal large language model.arXiv preprint arXiv:2511.11407, 2025. 11

work page arXiv 2025
[24]

PMC-VQA: Visual Instruction Tuning for Medical Visual Question Answering

Xiaoman Zhang, Chaoyi Wu, Ziheng Zhao, Weixiong Lin, Ya Zhang, Yanfeng Wang, and Weidi Xie. Pmc- vqa: Visual instruction tuning for medical visual question answering.arXiv preprint arXiv:2305.10415, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[25]

Gemex: A large-scale, groundable, and explainable medical vqa benchmark for chest x-ray diagnosis

Bo Liu, Ke Zou, Li-Ming Zhan, Zexin Lu, Xiaoyu Dong, Yidi Chen, Chengqiang Xie, Jiannong Cao, Xiao- Ming Wu, and Huazhu Fu. Gemex: A large-scale, groundable, and explainable medical vqa benchmark for chest x-ray diagnosis. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 21310–21320, 2025

work page 2025
[26]

MedXpertQA: Benchmarking Expert-Level Medical Reasoning and Understanding

Yuxin Zuo, Shang Qu, Yifei Li, Zhangren Chen, Xuekai Zhu, Ermo Hua, Kaiyan Zhang, Ning Ding, and Bowen Zhou. Medxpertqa: Benchmarking expert-level medical reasoning and understanding.arXiv preprint arXiv:2501.18362, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[27]

Radqa: A question answering dataset to improve comprehension of radiology reports

Sarvesh Soni, Meghana Gudala, Atieh Pajouhi, and Kirk Roberts. Radqa: A question answering dataset to improve comprehension of radiology reports. InProceedings of the thirteenth language resources and evaluation conference, pages 6250–6259, 2022

work page 2022
[28]

Medicaldiff-vqa: a large-scale medical dataset for difference visual question answering on chest x-ray images.PhysioNet, 12:13, 2023

Xinyue Hu, Lin Gu, Qiyuan An, Mengliang Zhang, Liangchen Liu, Kazuma Kobayashi, Tatsuya Harada, R Summers, and Yingying Zhu. Medicaldiff-vqa: a large-scale medical dataset for difference visual question answering on chest x-ray images.PhysioNet, 12:13, 2023

work page 2023
[29]

Lumen: Longitudinal multi-modal radiology model for prognosis and diagnosis.arXiv preprint arXiv:2602.21142, 2026

Zhifan Jiang, Dong Yang, Vishwesh Nath, Abhijeet Parida, Nishad P Kulkarni, Ziyue Xu, Daguang Xu, Syed Muhammad Anwar, Holger R Roth, and Marius George Linguraru. Lumen: Longitudinal multi-modal radiology model for prognosis and diagnosis.arXiv preprint arXiv:2602.21142, 2026

work page arXiv 2026
[30]

Shortcut learning in deep neural networks.Nature Machine Intelligence, 2(11):665–673, 2020

Robert Geirhos, Jörn-Henrik Jacobsen, Claudio Michaelis, Richard Zemel, Wieland Brendel, Matthias Bethge, and Felix A Wichmann. Shortcut learning in deep neural networks.Nature Machine Intelligence, 2(11):665–673, 2020

work page 2020
[31]

Visual hallucinations of multi-modal large language models

Wen Huang, Hongbin Liu, Minxin Guo, and Neil Gong. Visual hallucinations of multi-modal large language models. InFindings of the Association for Computational Linguistics: ACL 2024, pages 9614–9631, 2024

work page 2024
[32]

Flamingo: a visual language model for few-shot learning.Advances in neural information processing systems, 35:23716–23736, 2022

Jean-Baptiste Alayrac, Jeff Donahue, Pauline Luc, Antoine Miech, Iain Barr, Yana Hasson, Karel Lenc, Arthur Mensch, Katherine Millican, Malcolm Reynolds, et al. Flamingo: a visual language model for few-shot learning.Advances in neural information processing systems, 35:23716–23736, 2022

work page 2022
[33]

Learning transferable visual models from natural language supervision

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervision. InInternational conference on machine learning, pages 8748–8763. PmLR, 2021

work page 2021
[34]

Large language models encode clinical knowledge

Karan Singhal, Shekoofeh Azizi, Tao Tu, S Sara Mahdavi, Jason Wei, Hyung Won Chung, Nathan Scales, Ajay Tanwani, Heather Cole-Lewis, Stephen Pfohl, et al. Large language models encode clinical knowledge. Nature, 620(7972):172–180, 2023

work page 2023
[35]

Mm-neuroonco: A multimodal benchmark and instruction dataset for mri-based brain tumor diagnosis.arXiv preprint arXiv:2602.22955, 2026

Feng Guo, Jiaxiang Liu, Yang Li, Qianqian Shi, and Mingkun Xu. Mm-neuroonco: A multimodal benchmark and instruction dataset for mri-based brain tumor diagnosis.arXiv preprint arXiv:2602.22955, 2026

work page arXiv 2026
[36]

3d-rad: A comprehensive 3d radiology med-vqa dataset with multi-temporal analysis and diverse diagnostic tasks.arXiv preprint arXiv:2506.11147, 2025

Xiaotang Gai, Jiaxiang Liu, Yichen Li, Zijie Meng, Jian Wu, and Zuozhu Liu. 3d-rad: A comprehensive 3d radiology med-vqa dataset with multi-temporal analysis and diverse diagnostic tasks.arXiv preprint arXiv:2506.11147, 2025

work page arXiv 2025
[37]

The claude 3 model family: Opus, sonnet, haiku.Technical Report, 2024

Anthropic. The claude 3 model family: Opus, sonnet, haiku.Technical Report, 2024

work page 2024
[38]

Visual instruction tuning.Advances in neural information processing systems, 36:34892–34916, 2023

Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning.Advances in neural information processing systems, 36:34892–34916, 2023

work page 2023
[39]

Llava-med: Training a large language-and-vision assistant for biomedicine in one day.Advances in Neural Information Processing Systems, 36:28541–28564, 2023

Chunyuan Li, Cliff Wong, Sheng Zhang, Naoto Usuyama, Haotian Liu, Jianwei Yang, Tristan Nau- mann, Hoifung Poon, and Jianfeng Gao. Llava-med: Training a large language-and-vision assistant for biomedicine in one day.Advances in Neural Information Processing Systems, 36:28541–28564, 2023

work page 2023
[40]

MedGemma Technical Report

Andrew Sellergren, Sahar Kazemzadeh, Tiam Jaroensri, Atilla Kiraly, Madeleine Traverse, Timo Kohlberger, Shawn Xu, Fayaz Jamil, Cían Hughes, Charles Lau, et al. Medgemma technical report. arXiv preprint arXiv:2507.05201, 2025. 12

work page internal anchor Pith review Pith/arXiv arXiv 2025
[41]

Towards generalist biomedical ai.Nejm Ai, 1(3): AIoa2300138, 2024

Tao Tu, Shekoofeh Azizi, Danny Driess, Mike Schaekermann, Mohamed Amin, Pi-Chuan Chang, Andrew Carroll, Charles Lau, Ryutaro Tanno, Ira Ktena, et al. Towards generalist biomedical ai.Nejm Ai, 1(3): AIoa2300138, 2024

work page 2024
[42]

Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution

Peng Wang, Shuai Bai, Sinan Tan, Shijie Wang, Zhihao Fan, Jinze Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, et al. Qwen2-vl: Enhancing vision-language model’s perception of the world at any resolution.arXiv preprint arXiv:2409.12191, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[43]

BiomedCLIP: a multimodal biomedical foundation model pretrained from fifteen million scientific image-text pairs

Sheng Zhang, Yanbo Xu, Naoto Usuyama, Hanwen Xu, Jaspreet Bagga, Robert Tinn, Sam Preston, Rajesh Rao, Mu Wei, Naveen Valluri, et al. Biomedclip: a multimodal biomedical foundation model pretrained from fifteen million scientific image-text pairs.arXiv preprint arXiv:2303.00915, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[44]

Can generalist foundation models outcompete special-purpose tuning? case study in medicine.arXiv preprint arXiv:2311.16452,

Harsha Nori, Yin Tat Lee, Sheng Zhang, Dean Carignan, Richard Edgar, Nicolo Fusi, Nicholas King, Jonathan Larson, Yuanzhi Li, Weishung Liu, et al. Can generalist foundation models outcompete special- purpose tuning? case study in medicine.arXiv preprint arXiv:2311.16452, 2023

work page arXiv 2023
[45]

Brain-jepa: Brain dynamics foundation model with gradient positioning and spatiotemporal masking.Advances in Neural Information Processing Systems, 37:86048– 86073, 2024

Zijian Dong, Ruilin Li, Yilei Wu, Thuan T Nguyen, Joanna S Chong, Fang Ji, Nathanael R Tong, Christopher L Chen, and Juan H Zhou. Brain-jepa: Brain dynamics foundation model with gradient positioning and spatiotemporal masking.Advances in Neural Information Processing Systems, 37:86048– 86073, 2024

work page 2024
[46]

Brainclip: Bridging brain and visual-linguistic representation via clip for generic natural visual stimulus decoding.arXiv preprint arXiv:2302.12971, 2023

Yulong Liu, Yongqiang Ma, Wei Zhou, Guibo Zhu, and Nanning Zheng. Brainclip: Bridging brain and visual-linguistic representation via clip for generic natural visual stimulus decoding.arXiv preprint arXiv:2302.12971, 2023

work page arXiv 2023
[47]

GeoSAE: Geometric Prior-Guided Layer-Wise Sparse Autoencoder Annotation of Brain MRI Foundation Models

Favour Nerrise, Lucy Yin, Mohammad H. Abbasi, Kilian M. Pohl, and Ehsan Adeli. Geosae: Geometric prior-guided layer-wise sparse autoencoder annotation of brain mri foundation models, 2026. URL https://arxiv.org/abs/2605.01829

work page internal anchor Pith review Pith/arXiv arXiv 2026
[48]

Unbiased nonlinear average age-appropriate brain templates from birth to adulthood.NeuroImage, 47:S102, 2009

Vladimir S Fonov, Alan C Evans, Robert C McKinstry, C Robert Almli, and DL Collins. Unbiased nonlinear average age-appropriate brain templates from birth to adulthood.NeuroImage, 47:S102, 2009

work page 2009
[49]

Freesurfer.Neuroimage, 62(2):774–781, 2012

Bruce Fischl. Freesurfer.Neuroimage, 62(2):774–781, 2012

work page 2012
[50]

Autorg-brain: Grounded report generation for brain mri.arXiv preprint arXiv:2407.16684, 2024

Jiayu Lei, Xiaoman Zhang, Chaoyi Wu, Lisong Dai, Ya Zhang, Yanyong Zhang, Yanfeng Wang, Weidi Xie, and Yuehua Li. Autorg-brain: Grounded report generation for brain mri.arXiv preprint arXiv:2407.16684, 2024

work page arXiv 2024
[51]

The clinical dementia rating (cdr) current version and scoring rules.Neurology, 43(11): 2412–2412, 1993

John C Morris. The clinical dementia rating (cdr) current version and scoring rules.Neurology, 43(11): 2412–2412, 1993

work page 1993
[52]

Christopher G Goetz, Barbara C Tilley, Stephanie R Shaftman, Glenn T Stebbins, Stanley Fahn, Pablo Martinez-Martin, Werner Poewe, Cristina Sampaio, Matthew B Stern, Richard Dodel, et al. Movement disorder society-sponsored revision of the unified parkinson’s disease rating scale (mds-updrs): scale presentation and clinimetric testing results.Movement diso...

work page 2008
[53]

Data-driven discovery of movement-linked heterogeneity in neurodegenerative diseases.Nature machine intelligence, 6(9):1034–1045, 2024

Mark Endo, Favour Nerrise, Qingyu Zhao, Edith V Sullivan, Li Fei-Fei, Victor W Henderson, Kilian M Pohl, Kathleen L Poston, and Ehsan Adeli. Data-driven discovery of movement-linked heterogeneity in neurodegenerative diseases.Nature machine intelligence, 6(9):1034–1045, 2024

work page 2024
[54]

Alzheimer’s disease neuroimaging initiative (adni) clinical characterization.Neurology, 74(3):201–209, 2010

Ronald Carl Petersen, Paul S Aisen, Laurel A Beckett, Michael C Donohue, Anthony Collins Gamst, Danielle J Harvey, Clifford R Jack Jr, William J Jagust, Leslie M Shaw, Arthur W Toga, et al. Alzheimer’s disease neuroimaging initiative (adni) clinical characterization.Neurology, 74(3):201–209, 2010

work page 2010
[55]

Kathryn A Ellis, Ashley I Bush, David Darby, Daniela De Fazio, Jonathan Foster, Peter Hudson, Nicola T Lautenschlager, Nat Lenzo, Ralph N Martins, Paul Maruff, et al. The australian imaging, biomarkers and lifestyle (aibl) study of aging: methodology and baseline characteristics of 1112 individuals recruited for a longitudinal study of alzheimer’s disease...

work page 2009
[56]

The parkinson progression marker initiative (ppmi).Progress in neurobiology, 95(4):629–635, 2011

Kenneth Marek, Danna Jennings, Shirley Lasch, Andrew Siderowf, Caroline Tanner, Tanya Simuni, Chris Coffey, Karl Kieburtz, Emily Flagg, Sohini Chowdhury, et al. The parkinson progression marker initiative (ppmi).Progress in neurobiology, 95(4):629–635, 2011

work page 2011
[57]

The RSNA-ASNR-MICCAI BraTS 2021 Benchmark on Brain Tumor Segmentation and Radiogenomic Classification

Ujjwal Baid, Satyam Ghodasara, Suyash Mohan, Michel Bilello, Evan Calabrese, Errol Colak, Keyvan Farahani, Jayashree Kalpathy-Cramer, Felipe C Kitamura, Sarthak Pati, et al. The rsna-asnr-miccai brats 2021 benchmark on brain tumor segmentation and radiogenomic classification.arXiv preprint arXiv:2107.02314, 2021. 13

work page internal anchor Pith review Pith/arXiv arXiv 2021
[58]

The 2024 brain tumor segmentation challenge meningioma radiotherapy (brats-men-rt) dataset.Scientific Data, 2026

Dominic LaBella, Katherine Schumacher, Michael Mix, Kevin Leu, Shan McBurney-Lin, Pierre Nedelec, Javier Villanueva-Meyer, David R Raleigh, Jonathan Shapey, Tom Vercauteren, et al. The 2024 brain tumor segmentation challenge meningioma radiotherapy (brats-men-rt) dataset.Scientific Data, 2026

work page 2024
[59]

Standardized assessment of automatic segmentation of white matter hyperintensities and results of the wmh segmentation challenge

Hugo J Kuijf, J Matthijs Biesbroek, Jeroen De Bresser, Rutger Heinen, Simon Andermatt, Mariana Bento, Matt Berseth, Mikhail Belyaev, M Jorge Cardoso, Adria Casamitjana, et al. Standardized assessment of automatic segmentation of white matter hyperintensities and results of the wmh segmentation challenge. IEEE transactions on medical imaging, 38(11):2556–2...

work page 2019
[60]

An open, multi-vendor, multi-field-strength brain mr dataset and analysis of publicly available skull stripping methods agreement.NeuroImage, 170:482–494, 2018

Roberto Souza, Oeslle Lucena, Julia Garrafa, David Gobbi, Marina Saluzzi, Simone Appenzeller, Letícia Rittner, Richard Frayne, and Roberto Lotufo. An open, multi-vendor, multi-field-strength brain mr dataset and analysis of publicly available skull stripping methods agreement.NeuroImage, 170:482–494, 2018

work page 2018
[61]

The wu-minn human connectome project: an overview.Neuroimage, 80: 62–79, 2013

David C Van Essen, Stephen M Smith, Deanna M Barch, Timothy EJ Behrens, Essa Yacoub, Kamil Ugurbil, Wu-Minn HCP Consortium, et al. The wu-minn human connectome project: an overview.Neuroimage, 80: 62–79, 2013

work page 2013
[62]

The adolescent brain cognitive development (abcd) study: imaging acquisition across 21 sites.Developmental cognitive neuroscience, 32:43–54, 2018

Betty Jo Casey, Tariq Cannonier, May I Conley, Alexandra O Cohen, Deanna M Barch, Mary M Heitzeg, Mary E Soules, Theresa Teslovich, Danielle V Dellarco, Hugh Garavan, et al. The adolescent brain cognitive development (abcd) study: imaging acquisition across 21 sites.Developmental cognitive neuroscience, 32:43–54, 2018

work page 2018
[63]

Ixi dataset-information extraction from images project (epsrc gr/s21533/02)[internet]

D Hill, S Williams, D Hawkes, and SM Smith. Ixi dataset-information extraction from images project (epsrc gr/s21533/02)[internet]. 2006 [cited 2013 may 7], 2006

work page 2006
[64]

Abbasi and Ehsan Adeli

Mohammad H. Abbasi and Ehsan Adeli. sMRI Processing Pipeline: A lightweight, end-to-end workflow for structural brain MRI preprocessing and quality control, 2025. URL https://doi.org/10.5281/ zenodo.17503175. Zenodo, doi: 10.5281/zenodo.17503175

work page doi:10.5281/zenodo.17503175 2025
[65]

The montreal cognitive assessment, moca: a brief screening tool for mild cognitive impairment.Journal of the American Geriatrics Society, 53(4):695–699, 2005

Ziad S Nasreddine, Natalie A Phillips, Valérie Bédirian, Simon Charbonneau, Victor Whitehead, Isabelle Collin, Jeffrey L Cummings, and Howard Chertkow. The montreal cognitive assessment, moca: a brief screening tool for mild cognitive impairment.Journal of the American Geriatrics Society, 53(4):695–699, 2005

work page 2005
[66]

Mini-mental state.Journal of psychiatric research, 12(3):189–198, 1975

Marshal F Folstein, Susan E Folstein, and Paul R McHugh. Mini-mental state.Journal of psychiatric research, 12(3):189–198, 1975

work page 1975
[67]

Noelle E Carlozzi, David S Tulsky, Robert V Kail, and Jennifer L Beaumont. Vi. nih toolbox cognition battery (cb): measuring processing speed.Monographs of the Society for Research in Child Development, 78(4):88–102, 2013

work page 2013
[68]

Manual for the aseba school-age forms and profiles.University of Vermont Research Center for Children, Youth, and Families, 2001

Thomas M Achenbach. Manual for the aseba school-age forms and profiles.University of Vermont Research Center for Children, Youth, and Families, 2001

work page 2001
[69]

Parkinsonism: onset, progression, and mortality.Neurology, 17(5): 427–427, 1967

Margaret M Hoehn and Melvin D Yahr. Parkinsonism: onset, progression, and mortality.Neurology, 17(5): 427–427, 1967

work page 1967
[70]

BERTScore: Evaluating Text Generation with BERT

Tianyi Zhang, Varsha Kishore, Felix Wu, Kilian Q Weinberger, and Yoav Artzi. Bertscore: Evaluating text generation with bert.arXiv preprint arXiv:1904.09675, 2019

work page internal anchor Pith review Pith/arXiv arXiv 1904
[71]

A coefficient of agreement for nominal scales.Educational and psychological measurement, 20(1):37–46, 1960

Jacob Cohen. A coefficient of agreement for nominal scales.Educational and psychological measurement, 20(1):37–46, 1960

work page 1960
[72]

Medbookvqa: A systematic and comprehensive medical benchmark derived from open-access book.arXiv preprint arXiv:2506.00855, 2025

Sau Lai Yip, Sunan He, Yuxiang Nie, Shu Pui Chan, Yilin Ye, Sum Ying Lam, and Hao Chen. Medbookvqa: A systematic and comprehensive medical benchmark derived from open-access book.arXiv preprint arXiv:2506.00855, 2025

work page arXiv 2025
[73]

arXiv preprint arXiv:2404.00578 , year=

Fan Bai, Yuxin Du, Tiejun Huang, Max Q-H Meng, and Bo Zhao. M3d: Advancing 3d medical image analysis with multi-modal large language models.arXiv preprint arXiv:2404.00578, 2024

work page arXiv 2024
[74]

Mmmu: A massive multi-discipline multimodal understanding and reasoning benchmark for expert agi

Xiang Yue, Yuansheng Ni, Kai Zhang, Tianyu Zheng, Ruoqi Liu, Ge Zhang, Samuel Stevens, Dongfu Jiang, Weiming Ren, Yuxuan Sun, et al. Mmmu: A massive multi-discipline multimodal understanding and reasoning benchmark for expert agi. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 9556–9567, 2024

work page 2024
[75]

None” to “No changes

J Richard Landis and Gary G Koch. The measurement of observer agreement for categorical data. biometrics, pages 159–174, 1977. 14 Contents of Appendix/Supplementary Material A Limitations and Future Work 16 B Complete Quality Rule List 16 C Dataset Composition and Splits 16 D Leaderboard and Public Release 16 E Expert Review Details 17 E.1 First Round: Te...

work page 1977
[76]

Laterality removal (BraTS-GLI/MEN, 786 QA removed).Expert review flagged laterality labels as unreliable. Systematic verification of 220 unilateral BraTS-GLI subjects showed only 49% agreement between report-stated and voxel-based laterality, effectively random, because BraTS reports and NIfTI images may originate from different processing stages or conve...

work page
[77]

cerebellar

Cerebellopontine angle fix (BraTS-MEN, 28 QA corrected).Fifteen BraTS-MEN subjects with cerebellopontine angle (CPA) lesions were incorrectly labeled as “cerebellar” in Location questions. CPA lesions are extra-axial posterior fossa masses, anatomically distinct from cerebellar parenchymal lesions. Corrected to “posterior fossa.” 17 Table 6: Per-dataset c...

work page 2000
[78]

Removed across 7 datasets; retained questions with clearly visible asymmetry (>10% difference)

Subtle asymmetry removal (1,524 QA removed).57% of laterality questions had less than 10% volume asymmetry between left and right structures, differences imperceptible on visual inspection. Removed across 7 datasets; retained questions with clearly visible asymmetry (>10% difference)

work page
[79]

What is the T1 signal intensity of the lesion?

Signal wording standardization (654 QA reworded).The neuroradiologist recommended standard clinical phrasing: “What is the T1 signal intensity of the lesion?” was revised to “What is the signal intensity of the lesion on T1-weighted imaging?” No answers changed; only question text was reworded to match radiology reporting conventions. 18 Figure 6: Phase 1...

work page
[80]

The neuroradiologist noted that axial views are atypical for hippocampal assessment; however, this applies only to the 2D survey visualization

Anatomy questions.Both reviewers rated Anatomy questions as correct and highly relevant. The neuroradiologist noted that axial views are atypical for hippocampal assessment; however, this applies only to the 2D survey visualization. NEUROQA provides full 3D volumetric input to models, including all orientations

work page

Showing first 80 references.

[1] [1]

Prediction of ad with mri-based hippocampal volume in mild cognitive impairment.Neurology, 52(7):1397–1397, 1999

Clifford R Jack Jr, Ronald C Petersen, Yue Cheng Xu, Peter C O’Brien, Glenn E Smith, Robert J Ivnik, Bradley F Boeve, Stephen C Waring, Eric G Tangalos, and Emre Kokmen. Prediction of ad with mri-based hippocampal volume in mild cognitive impairment.Neurology, 52(7):1397–1397, 1999

work page 1999

[2] [2]

Brain age and other bodily ‘ages’: implications for neuropsychiatry.Molecular psychiatry, 24(2):266–281, 2019

James H Cole, Riccardo E Marioni, Sarah E Harris, and Ian J Deary. Brain age and other bodily ‘ages’: implications for neuropsychiatry.Molecular psychiatry, 24(2):266–281, 2019

work page 2019

[3] [3]

White matter hyperintensities and imaging patterns of brain ageing in the general population.Brain, 139(4):1164–1179, 2016

Mohamad Habes, Guray Erus, Jon B Toledo, Tianhao Zhang, Nick Bryan, Lenore J Launer, Yves Rosseel, Deborah Janowitz, Jimit Doshi, Sandra Van der Auwera, et al. White matter hyperintensities and imaging patterns of brain ageing in the general population.Brain, 139(4):1164–1179, 2016

work page 2016

[4] [4]

Longitudinal brain volume changes in major depressive disorder

Dilara Yüksel, Jennifer Engelen, Verena Schuster, Bruno Dietsche, Carsten Konrad, Andreas Jansen, Udo Dannlowski, Tilo Kircher, and Axel Krug. Longitudinal brain volume changes in major depressive disorder. Journal of Neural Transmission, 125(10):1433–1447, 2018

work page 2018

[5] [5]

Mapping cortical brain asymmetry in 17,141 healthy individuals worldwide via the enigma consortium

Xiang-Zhen Kong, Samuel R Mathias, Tulio Guadalupe, ENIGMA Laterality Working Group, David C Glahn, Barbara Franke, Fabrice Crivello, Nathalie Tzourio-Mazoyer, Simon E Fisher, Paul M Thompson, et al. Mapping cortical brain asymmetry in 17,141 healthy individuals worldwide via the enigma consortium. Proceedings of the National Academy of Sciences, 115(22):...

work page 2018

[6] [6]

Foundation models for generalist medical artificial intelligence.Nature, 616 (7956):259–265, 2023

Michael Moor, Oishi Banerjee, Zahra Shakeri Hossein Abad, Harlan M Krumholz, Jure Leskovec, Eric J Topol, and Pranav Rajpurkar. Foundation models for generalist medical artificial intelligence.Nature, 616 (7956):259–265, 2023. 10

work page 2023

[7] [7]

GPT-4 Technical Report

Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. Gpt-4 technical report.arXiv preprint arXiv:2303.08774, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[8] [8]

Gemini: A Family of Highly Capable Multimodal Models

Gemini Team, Rohan Anil, Sebastian Borgeaud, Jean-Baptiste Alayrac, Jiahui Yu, Radu Soricut, Johan Schalkwyk, Andrew M Dai, Anja Hauth, Katie Millican, et al. Gemini: a family of highly capable multimodal models.arXiv preprint arXiv:2312.11805, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[9] [9]

A dataset of clinically generated visual questions and answers about radiology images.Scientific data, 5(1):180251, 2018

Jason J Lau, Soumya Gayen, Asma Ben Abacha, and Dina Demner-Fushman. A dataset of clinically generated visual questions and answers about radiology images.Scientific data, 5(1):180251, 2018

work page 2018

[10] [10]

PathVQA: 30000+ Questions for Medical Visual Question Answering

Xuehai He, Yichen Zhang, Luntian Mou, Eric Xing, and Pengtao Xie. Pathvqa: 30000+ questions for medical visual question answering.arXiv preprint arXiv:2003.10286, 2020

work page internal anchor Pith review Pith/arXiv arXiv 2003

[11] [11]

Rexvqa: A large-scale visual question answering benchmark for generalist chest x-ray understanding

Ankit Pal, Jung-Oh Lee, Xiaoman Zhang, Malaikannan Sankarasubbu, Seunghyeon Roh, Won Jung Kim, Meesun Lee, and Pranav Rajpurkar. Rexvqa: A large-scale visual question answering benchmark for generalist chest x-ray understanding. InBiocomputing 2026: Proceedings of the Pacific Symposium, pages 251–264. World Scientific, 2025

work page 2026

[12] [12]

arXiv preprint arXiv:2603.21687 , year=

Mohammad Asadi, Jack W O’Sullivan, Fang Cao, Tahoura Nedaee, Kamyar Fardi, Fei-Fei Li, Ehsan Adeli, and Euan Ashley. Mirage the illusion of visual understanding.arXiv preprint arXiv:2603.21687, 2026

work page arXiv 2026

[13] [13]

Don’t just assume; look and answer: Overcoming priors for visual question answering

Aishwarya Agrawal, Dhruv Batra, Devi Parikh, and Aniruddha Kembhavi. Don’t just assume; look and answer: Overcoming priors for visual question answering. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 4971–4980, 2018

work page 2018

[14] [14]

A multimodal llm approach for visual question answering on multiparametric 3d brain mri.arXiv preprint arXiv:2509.25889, 2025

Arvind Murari Vepa, Yannan Yu, Jingru Gan, Anthony Cuturrufo, Weikai Li, Wei Wang, Fabien Scalzo, and Yizhou Sun. A multimodal llm approach for visual question answering on multiparametric 3d brain mri.arXiv preprint arXiv:2509.25889, 2025

work page arXiv 2025

[15] [15]

Enhancing vision-language models for medical imaging: bridging the 3d gap with innovative slice selection.Advances in Neural Information Processing Systems, 37:99947–99964, 2024

Yuli Wang, Jian Peng, Yuwei Dai, Craig Jones, Haris Sair, Jinglai Shen, Nicolas Loizou, Jing Wu, Wen-Chi Hsu, Maliha Imami, et al. Enhancing vision-language models for medical imaging: bridging the 3d gap with innovative slice selection.Advances in Neural Information Processing Systems, 37:99947–99964, 2024

work page 2024

[16] [16]

Omnibrainbench: A comprehensive multimodal benchmark for brain imaging analysis across multi-stage clinical tasks.arXiv preprint arXiv:2511.00846, 2025

Zhihao Peng, Cheng Wang, Shengyuan Liu, Zhiying Liang, Zanting Ye, Minjie Ju, PeterYM Woo, and Yixuan Yuan. Omnibrainbench: A comprehensive multimodal benchmark for brain imaging analysis across multi-stage clinical tasks.arXiv preprint arXiv:2511.00846, 2025

work page arXiv 2025

[17] [17]

Making the v in vqa matter: Elevating the role of image understanding in visual question answering

Yash Goyal, Tejas Khot, Douglas Summers-Stay, Dhruv Batra, and Devi Parikh. Making the v in vqa matter: Elevating the role of image understanding in visual question answering. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 6904–6913, 2017

work page 2017

[18] [18]

An automated labeling system for subdividing the human cerebral cortex on mri scans into gyral based regions of interest.Neuroimage, 31(3):968–980, 2006

Rahul S Desikan, Florent Ségonne, Bruce Fischl, Brian T Quinn, Bradford C Dickerson, Deborah Blacker, Randy L Buckner, Anders M Dale, R Paul Maguire, Bradley T Hyman, et al. An automated labeling system for subdividing the human cerebral cortex on mri scans into gyral based regions of interest.Neuroimage, 31(3):968–980, 2006

work page 2006

[19] [19]

Slake: A semantically-labeled knowledge-enhanced dataset for medical visual question answering

Bo Liu, Li-Ming Zhan, Li Xu, Lin Ma, Yan Yang, and Xiao-Ming Wu. Slake: A semantically-labeled knowledge-enhanced dataset for medical visual question answering. In2021 IEEE 18th international symposium on biomedical imaging (ISBI), pages 1650–1654. IEEE, 2021

work page 2021

[20] [20]

Mimic-ext-mimic-cxr-vqa: a complex, diverse, and large-scale visual question answering dataset for chest x-ray images.PhysioNet, 2024

Seongsu Bae, Daeun Kyung, Jaehee Ryu, Eunbyeol Cho, Gyubok Lee, Sunjun Kweon, Jungwoo Oh, Lei JI, Eric Chang, Tackeun Kim, et al. Mimic-ext-mimic-cxr-vqa: a complex, diverse, and large-scale visual question answering dataset for chest x-ray images.PhysioNet, 2024

work page 2024

[21] [21]

Vqa-med: Overview of the medical visual question answering task at imageclef 2019

Asma Ben Abacha, Sadid A Hasan, Vivek V Datla, Dina Demner-Fushman, and Henning Müller. Vqa-med: Overview of the medical visual question answering task at imageclef 2019. InProceedings of CLEF (Conference and Labs of the Evaluation Forum) 2019 Working Notes. 9-12 September 2019, 2019

work page 2019

[22] [22]

Overview of the vqa-med task at imageclef 2021: Visual question answering and generation in the medical domain

Asma Ben Abacha, Mourad Sarrouti, Dina Demner-Fushman, Sadid A Hasan, and Henning Müller. Overview of the vqa-med task at imageclef 2021: Visual question answering and generation in the medical domain. InProceedings of the CLEF 2021 Conference and Labs of the Evaluation Forum-working notes. 21-24 September 2021, 2021

work page 2021

[23] [23]

Microvqa++: High-quality microscopy reasoning dataset with weakly supervised graphs for multimodal large language model.arXiv preprint arXiv:2511.11407, 2025

Manyu Li, Ruian He, Chenxi Ma, Weimin Tan, and Bo Yan. Microvqa++: High-quality microscopy reasoning dataset with weakly supervised graphs for multimodal large language model.arXiv preprint arXiv:2511.11407, 2025. 11

work page arXiv 2025

[24] [24]

PMC-VQA: Visual Instruction Tuning for Medical Visual Question Answering

Xiaoman Zhang, Chaoyi Wu, Ziheng Zhao, Weixiong Lin, Ya Zhang, Yanfeng Wang, and Weidi Xie. Pmc- vqa: Visual instruction tuning for medical visual question answering.arXiv preprint arXiv:2305.10415, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[25] [25]

Gemex: A large-scale, groundable, and explainable medical vqa benchmark for chest x-ray diagnosis

Bo Liu, Ke Zou, Li-Ming Zhan, Zexin Lu, Xiaoyu Dong, Yidi Chen, Chengqiang Xie, Jiannong Cao, Xiao- Ming Wu, and Huazhu Fu. Gemex: A large-scale, groundable, and explainable medical vqa benchmark for chest x-ray diagnosis. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 21310–21320, 2025

work page 2025

[26] [26]

MedXpertQA: Benchmarking Expert-Level Medical Reasoning and Understanding

Yuxin Zuo, Shang Qu, Yifei Li, Zhangren Chen, Xuekai Zhu, Ermo Hua, Kaiyan Zhang, Ning Ding, and Bowen Zhou. Medxpertqa: Benchmarking expert-level medical reasoning and understanding.arXiv preprint arXiv:2501.18362, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[27] [27]

Radqa: A question answering dataset to improve comprehension of radiology reports

Sarvesh Soni, Meghana Gudala, Atieh Pajouhi, and Kirk Roberts. Radqa: A question answering dataset to improve comprehension of radiology reports. InProceedings of the thirteenth language resources and evaluation conference, pages 6250–6259, 2022

work page 2022

[28] [28]

Medicaldiff-vqa: a large-scale medical dataset for difference visual question answering on chest x-ray images.PhysioNet, 12:13, 2023

Xinyue Hu, Lin Gu, Qiyuan An, Mengliang Zhang, Liangchen Liu, Kazuma Kobayashi, Tatsuya Harada, R Summers, and Yingying Zhu. Medicaldiff-vqa: a large-scale medical dataset for difference visual question answering on chest x-ray images.PhysioNet, 12:13, 2023

work page 2023

[29] [29]

Lumen: Longitudinal multi-modal radiology model for prognosis and diagnosis.arXiv preprint arXiv:2602.21142, 2026

Zhifan Jiang, Dong Yang, Vishwesh Nath, Abhijeet Parida, Nishad P Kulkarni, Ziyue Xu, Daguang Xu, Syed Muhammad Anwar, Holger R Roth, and Marius George Linguraru. Lumen: Longitudinal multi-modal radiology model for prognosis and diagnosis.arXiv preprint arXiv:2602.21142, 2026

work page arXiv 2026

[30] [30]

Shortcut learning in deep neural networks.Nature Machine Intelligence, 2(11):665–673, 2020

Robert Geirhos, Jörn-Henrik Jacobsen, Claudio Michaelis, Richard Zemel, Wieland Brendel, Matthias Bethge, and Felix A Wichmann. Shortcut learning in deep neural networks.Nature Machine Intelligence, 2(11):665–673, 2020

work page 2020

[31] [31]

Visual hallucinations of multi-modal large language models

Wen Huang, Hongbin Liu, Minxin Guo, and Neil Gong. Visual hallucinations of multi-modal large language models. InFindings of the Association for Computational Linguistics: ACL 2024, pages 9614–9631, 2024

work page 2024

[32] [32]

Flamingo: a visual language model for few-shot learning.Advances in neural information processing systems, 35:23716–23736, 2022

Jean-Baptiste Alayrac, Jeff Donahue, Pauline Luc, Antoine Miech, Iain Barr, Yana Hasson, Karel Lenc, Arthur Mensch, Katherine Millican, Malcolm Reynolds, et al. Flamingo: a visual language model for few-shot learning.Advances in neural information processing systems, 35:23716–23736, 2022

work page 2022

[33] [33]

Learning transferable visual models from natural language supervision

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervision. InInternational conference on machine learning, pages 8748–8763. PmLR, 2021

work page 2021

[34] [34]

Large language models encode clinical knowledge

Karan Singhal, Shekoofeh Azizi, Tao Tu, S Sara Mahdavi, Jason Wei, Hyung Won Chung, Nathan Scales, Ajay Tanwani, Heather Cole-Lewis, Stephen Pfohl, et al. Large language models encode clinical knowledge. Nature, 620(7972):172–180, 2023

work page 2023

[35] [35]

Mm-neuroonco: A multimodal benchmark and instruction dataset for mri-based brain tumor diagnosis.arXiv preprint arXiv:2602.22955, 2026

Feng Guo, Jiaxiang Liu, Yang Li, Qianqian Shi, and Mingkun Xu. Mm-neuroonco: A multimodal benchmark and instruction dataset for mri-based brain tumor diagnosis.arXiv preprint arXiv:2602.22955, 2026

work page arXiv 2026

[36] [36]

3d-rad: A comprehensive 3d radiology med-vqa dataset with multi-temporal analysis and diverse diagnostic tasks.arXiv preprint arXiv:2506.11147, 2025

Xiaotang Gai, Jiaxiang Liu, Yichen Li, Zijie Meng, Jian Wu, and Zuozhu Liu. 3d-rad: A comprehensive 3d radiology med-vqa dataset with multi-temporal analysis and diverse diagnostic tasks.arXiv preprint arXiv:2506.11147, 2025

work page arXiv 2025

[37] [37]

The claude 3 model family: Opus, sonnet, haiku.Technical Report, 2024

Anthropic. The claude 3 model family: Opus, sonnet, haiku.Technical Report, 2024

work page 2024

[38] [38]

Visual instruction tuning.Advances in neural information processing systems, 36:34892–34916, 2023

Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning.Advances in neural information processing systems, 36:34892–34916, 2023

work page 2023

[39] [39]

Llava-med: Training a large language-and-vision assistant for biomedicine in one day.Advances in Neural Information Processing Systems, 36:28541–28564, 2023

Chunyuan Li, Cliff Wong, Sheng Zhang, Naoto Usuyama, Haotian Liu, Jianwei Yang, Tristan Nau- mann, Hoifung Poon, and Jianfeng Gao. Llava-med: Training a large language-and-vision assistant for biomedicine in one day.Advances in Neural Information Processing Systems, 36:28541–28564, 2023

work page 2023

[40] [40]

MedGemma Technical Report

Andrew Sellergren, Sahar Kazemzadeh, Tiam Jaroensri, Atilla Kiraly, Madeleine Traverse, Timo Kohlberger, Shawn Xu, Fayaz Jamil, Cían Hughes, Charles Lau, et al. Medgemma technical report. arXiv preprint arXiv:2507.05201, 2025. 12

work page internal anchor Pith review Pith/arXiv arXiv 2025

[41] [41]

Towards generalist biomedical ai.Nejm Ai, 1(3): AIoa2300138, 2024

Tao Tu, Shekoofeh Azizi, Danny Driess, Mike Schaekermann, Mohamed Amin, Pi-Chuan Chang, Andrew Carroll, Charles Lau, Ryutaro Tanno, Ira Ktena, et al. Towards generalist biomedical ai.Nejm Ai, 1(3): AIoa2300138, 2024

work page 2024

[42] [42]

Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution

Peng Wang, Shuai Bai, Sinan Tan, Shijie Wang, Zhihao Fan, Jinze Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, et al. Qwen2-vl: Enhancing vision-language model’s perception of the world at any resolution.arXiv preprint arXiv:2409.12191, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[43] [43]

BiomedCLIP: a multimodal biomedical foundation model pretrained from fifteen million scientific image-text pairs

Sheng Zhang, Yanbo Xu, Naoto Usuyama, Hanwen Xu, Jaspreet Bagga, Robert Tinn, Sam Preston, Rajesh Rao, Mu Wei, Naveen Valluri, et al. Biomedclip: a multimodal biomedical foundation model pretrained from fifteen million scientific image-text pairs.arXiv preprint arXiv:2303.00915, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[44] [44]

Can generalist foundation models outcompete special-purpose tuning? case study in medicine.arXiv preprint arXiv:2311.16452,

Harsha Nori, Yin Tat Lee, Sheng Zhang, Dean Carignan, Richard Edgar, Nicolo Fusi, Nicholas King, Jonathan Larson, Yuanzhi Li, Weishung Liu, et al. Can generalist foundation models outcompete special- purpose tuning? case study in medicine.arXiv preprint arXiv:2311.16452, 2023

work page arXiv 2023

[45] [45]

Brain-jepa: Brain dynamics foundation model with gradient positioning and spatiotemporal masking.Advances in Neural Information Processing Systems, 37:86048– 86073, 2024

Zijian Dong, Ruilin Li, Yilei Wu, Thuan T Nguyen, Joanna S Chong, Fang Ji, Nathanael R Tong, Christopher L Chen, and Juan H Zhou. Brain-jepa: Brain dynamics foundation model with gradient positioning and spatiotemporal masking.Advances in Neural Information Processing Systems, 37:86048– 86073, 2024

work page 2024

[46] [46]

Brainclip: Bridging brain and visual-linguistic representation via clip for generic natural visual stimulus decoding.arXiv preprint arXiv:2302.12971, 2023

Yulong Liu, Yongqiang Ma, Wei Zhou, Guibo Zhu, and Nanning Zheng. Brainclip: Bridging brain and visual-linguistic representation via clip for generic natural visual stimulus decoding.arXiv preprint arXiv:2302.12971, 2023

work page arXiv 2023

[47] [47]

GeoSAE: Geometric Prior-Guided Layer-Wise Sparse Autoencoder Annotation of Brain MRI Foundation Models

Favour Nerrise, Lucy Yin, Mohammad H. Abbasi, Kilian M. Pohl, and Ehsan Adeli. Geosae: Geometric prior-guided layer-wise sparse autoencoder annotation of brain mri foundation models, 2026. URL https://arxiv.org/abs/2605.01829

work page internal anchor Pith review Pith/arXiv arXiv 2026

[48] [48]

Unbiased nonlinear average age-appropriate brain templates from birth to adulthood.NeuroImage, 47:S102, 2009

Vladimir S Fonov, Alan C Evans, Robert C McKinstry, C Robert Almli, and DL Collins. Unbiased nonlinear average age-appropriate brain templates from birth to adulthood.NeuroImage, 47:S102, 2009

work page 2009

[49] [49]

Freesurfer.Neuroimage, 62(2):774–781, 2012

Bruce Fischl. Freesurfer.Neuroimage, 62(2):774–781, 2012

work page 2012

[50] [50]

Autorg-brain: Grounded report generation for brain mri.arXiv preprint arXiv:2407.16684, 2024

Jiayu Lei, Xiaoman Zhang, Chaoyi Wu, Lisong Dai, Ya Zhang, Yanyong Zhang, Yanfeng Wang, Weidi Xie, and Yuehua Li. Autorg-brain: Grounded report generation for brain mri.arXiv preprint arXiv:2407.16684, 2024

work page arXiv 2024

[51] [51]

The clinical dementia rating (cdr) current version and scoring rules.Neurology, 43(11): 2412–2412, 1993

John C Morris. The clinical dementia rating (cdr) current version and scoring rules.Neurology, 43(11): 2412–2412, 1993

work page 1993

[52] [52]

Christopher G Goetz, Barbara C Tilley, Stephanie R Shaftman, Glenn T Stebbins, Stanley Fahn, Pablo Martinez-Martin, Werner Poewe, Cristina Sampaio, Matthew B Stern, Richard Dodel, et al. Movement disorder society-sponsored revision of the unified parkinson’s disease rating scale (mds-updrs): scale presentation and clinimetric testing results.Movement diso...

work page 2008

[53] [53]

Data-driven discovery of movement-linked heterogeneity in neurodegenerative diseases.Nature machine intelligence, 6(9):1034–1045, 2024

Mark Endo, Favour Nerrise, Qingyu Zhao, Edith V Sullivan, Li Fei-Fei, Victor W Henderson, Kilian M Pohl, Kathleen L Poston, and Ehsan Adeli. Data-driven discovery of movement-linked heterogeneity in neurodegenerative diseases.Nature machine intelligence, 6(9):1034–1045, 2024

work page 2024

[54] [54]

Alzheimer’s disease neuroimaging initiative (adni) clinical characterization.Neurology, 74(3):201–209, 2010

Ronald Carl Petersen, Paul S Aisen, Laurel A Beckett, Michael C Donohue, Anthony Collins Gamst, Danielle J Harvey, Clifford R Jack Jr, William J Jagust, Leslie M Shaw, Arthur W Toga, et al. Alzheimer’s disease neuroimaging initiative (adni) clinical characterization.Neurology, 74(3):201–209, 2010

work page 2010

[55] [55]

Kathryn A Ellis, Ashley I Bush, David Darby, Daniela De Fazio, Jonathan Foster, Peter Hudson, Nicola T Lautenschlager, Nat Lenzo, Ralph N Martins, Paul Maruff, et al. The australian imaging, biomarkers and lifestyle (aibl) study of aging: methodology and baseline characteristics of 1112 individuals recruited for a longitudinal study of alzheimer’s disease...

work page 2009

[56] [56]

The parkinson progression marker initiative (ppmi).Progress in neurobiology, 95(4):629–635, 2011

Kenneth Marek, Danna Jennings, Shirley Lasch, Andrew Siderowf, Caroline Tanner, Tanya Simuni, Chris Coffey, Karl Kieburtz, Emily Flagg, Sohini Chowdhury, et al. The parkinson progression marker initiative (ppmi).Progress in neurobiology, 95(4):629–635, 2011

work page 2011

[57] [57]

The RSNA-ASNR-MICCAI BraTS 2021 Benchmark on Brain Tumor Segmentation and Radiogenomic Classification

Ujjwal Baid, Satyam Ghodasara, Suyash Mohan, Michel Bilello, Evan Calabrese, Errol Colak, Keyvan Farahani, Jayashree Kalpathy-Cramer, Felipe C Kitamura, Sarthak Pati, et al. The rsna-asnr-miccai brats 2021 benchmark on brain tumor segmentation and radiogenomic classification.arXiv preprint arXiv:2107.02314, 2021. 13

work page internal anchor Pith review Pith/arXiv arXiv 2021

[58] [58]

The 2024 brain tumor segmentation challenge meningioma radiotherapy (brats-men-rt) dataset.Scientific Data, 2026

Dominic LaBella, Katherine Schumacher, Michael Mix, Kevin Leu, Shan McBurney-Lin, Pierre Nedelec, Javier Villanueva-Meyer, David R Raleigh, Jonathan Shapey, Tom Vercauteren, et al. The 2024 brain tumor segmentation challenge meningioma radiotherapy (brats-men-rt) dataset.Scientific Data, 2026

work page 2024

[59] [59]

Standardized assessment of automatic segmentation of white matter hyperintensities and results of the wmh segmentation challenge

Hugo J Kuijf, J Matthijs Biesbroek, Jeroen De Bresser, Rutger Heinen, Simon Andermatt, Mariana Bento, Matt Berseth, Mikhail Belyaev, M Jorge Cardoso, Adria Casamitjana, et al. Standardized assessment of automatic segmentation of white matter hyperintensities and results of the wmh segmentation challenge. IEEE transactions on medical imaging, 38(11):2556–2...

work page 2019

[60] [60]

An open, multi-vendor, multi-field-strength brain mr dataset and analysis of publicly available skull stripping methods agreement.NeuroImage, 170:482–494, 2018

Roberto Souza, Oeslle Lucena, Julia Garrafa, David Gobbi, Marina Saluzzi, Simone Appenzeller, Letícia Rittner, Richard Frayne, and Roberto Lotufo. An open, multi-vendor, multi-field-strength brain mr dataset and analysis of publicly available skull stripping methods agreement.NeuroImage, 170:482–494, 2018

work page 2018

[61] [61]

The wu-minn human connectome project: an overview.Neuroimage, 80: 62–79, 2013

David C Van Essen, Stephen M Smith, Deanna M Barch, Timothy EJ Behrens, Essa Yacoub, Kamil Ugurbil, Wu-Minn HCP Consortium, et al. The wu-minn human connectome project: an overview.Neuroimage, 80: 62–79, 2013

work page 2013

[62] [62]

The adolescent brain cognitive development (abcd) study: imaging acquisition across 21 sites.Developmental cognitive neuroscience, 32:43–54, 2018

Betty Jo Casey, Tariq Cannonier, May I Conley, Alexandra O Cohen, Deanna M Barch, Mary M Heitzeg, Mary E Soules, Theresa Teslovich, Danielle V Dellarco, Hugh Garavan, et al. The adolescent brain cognitive development (abcd) study: imaging acquisition across 21 sites.Developmental cognitive neuroscience, 32:43–54, 2018

work page 2018

[63] [63]

Ixi dataset-information extraction from images project (epsrc gr/s21533/02)[internet]

D Hill, S Williams, D Hawkes, and SM Smith. Ixi dataset-information extraction from images project (epsrc gr/s21533/02)[internet]. 2006 [cited 2013 may 7], 2006

work page 2006

[64] [64]

Abbasi and Ehsan Adeli

Mohammad H. Abbasi and Ehsan Adeli. sMRI Processing Pipeline: A lightweight, end-to-end workflow for structural brain MRI preprocessing and quality control, 2025. URL https://doi.org/10.5281/ zenodo.17503175. Zenodo, doi: 10.5281/zenodo.17503175

work page doi:10.5281/zenodo.17503175 2025

[65] [65]

The montreal cognitive assessment, moca: a brief screening tool for mild cognitive impairment.Journal of the American Geriatrics Society, 53(4):695–699, 2005

Ziad S Nasreddine, Natalie A Phillips, Valérie Bédirian, Simon Charbonneau, Victor Whitehead, Isabelle Collin, Jeffrey L Cummings, and Howard Chertkow. The montreal cognitive assessment, moca: a brief screening tool for mild cognitive impairment.Journal of the American Geriatrics Society, 53(4):695–699, 2005

work page 2005

[66] [66]

Mini-mental state.Journal of psychiatric research, 12(3):189–198, 1975

Marshal F Folstein, Susan E Folstein, and Paul R McHugh. Mini-mental state.Journal of psychiatric research, 12(3):189–198, 1975

work page 1975

[67] [67]

Noelle E Carlozzi, David S Tulsky, Robert V Kail, and Jennifer L Beaumont. Vi. nih toolbox cognition battery (cb): measuring processing speed.Monographs of the Society for Research in Child Development, 78(4):88–102, 2013

work page 2013

[68] [68]

Manual for the aseba school-age forms and profiles.University of Vermont Research Center for Children, Youth, and Families, 2001

Thomas M Achenbach. Manual for the aseba school-age forms and profiles.University of Vermont Research Center for Children, Youth, and Families, 2001

work page 2001

[69] [69]

Parkinsonism: onset, progression, and mortality.Neurology, 17(5): 427–427, 1967

Margaret M Hoehn and Melvin D Yahr. Parkinsonism: onset, progression, and mortality.Neurology, 17(5): 427–427, 1967

work page 1967

[70] [70]

BERTScore: Evaluating Text Generation with BERT

Tianyi Zhang, Varsha Kishore, Felix Wu, Kilian Q Weinberger, and Yoav Artzi. Bertscore: Evaluating text generation with bert.arXiv preprint arXiv:1904.09675, 2019

work page internal anchor Pith review Pith/arXiv arXiv 1904

[71] [71]

A coefficient of agreement for nominal scales.Educational and psychological measurement, 20(1):37–46, 1960

Jacob Cohen. A coefficient of agreement for nominal scales.Educational and psychological measurement, 20(1):37–46, 1960

work page 1960

[72] [72]

Medbookvqa: A systematic and comprehensive medical benchmark derived from open-access book.arXiv preprint arXiv:2506.00855, 2025

Sau Lai Yip, Sunan He, Yuxiang Nie, Shu Pui Chan, Yilin Ye, Sum Ying Lam, and Hao Chen. Medbookvqa: A systematic and comprehensive medical benchmark derived from open-access book.arXiv preprint arXiv:2506.00855, 2025

work page arXiv 2025

[73] [73]

arXiv preprint arXiv:2404.00578 , year=

Fan Bai, Yuxin Du, Tiejun Huang, Max Q-H Meng, and Bo Zhao. M3d: Advancing 3d medical image analysis with multi-modal large language models.arXiv preprint arXiv:2404.00578, 2024

work page arXiv 2024

[74] [74]

Mmmu: A massive multi-discipline multimodal understanding and reasoning benchmark for expert agi

Xiang Yue, Yuansheng Ni, Kai Zhang, Tianyu Zheng, Ruoqi Liu, Ge Zhang, Samuel Stevens, Dongfu Jiang, Weiming Ren, Yuxuan Sun, et al. Mmmu: A massive multi-discipline multimodal understanding and reasoning benchmark for expert agi. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 9556–9567, 2024

work page 2024

[75] [75]

None” to “No changes

J Richard Landis and Gary G Koch. The measurement of observer agreement for categorical data. biometrics, pages 159–174, 1977. 14 Contents of Appendix/Supplementary Material A Limitations and Future Work 16 B Complete Quality Rule List 16 C Dataset Composition and Splits 16 D Leaderboard and Public Release 16 E Expert Review Details 17 E.1 First Round: Te...

work page 1977

[76] [76]

Laterality removal (BraTS-GLI/MEN, 786 QA removed).Expert review flagged laterality labels as unreliable. Systematic verification of 220 unilateral BraTS-GLI subjects showed only 49% agreement between report-stated and voxel-based laterality, effectively random, because BraTS reports and NIfTI images may originate from different processing stages or conve...

work page

[77] [77]

cerebellar

Cerebellopontine angle fix (BraTS-MEN, 28 QA corrected).Fifteen BraTS-MEN subjects with cerebellopontine angle (CPA) lesions were incorrectly labeled as “cerebellar” in Location questions. CPA lesions are extra-axial posterior fossa masses, anatomically distinct from cerebellar parenchymal lesions. Corrected to “posterior fossa.” 17 Table 6: Per-dataset c...

work page 2000

[78] [78]

Removed across 7 datasets; retained questions with clearly visible asymmetry (>10% difference)

Subtle asymmetry removal (1,524 QA removed).57% of laterality questions had less than 10% volume asymmetry between left and right structures, differences imperceptible on visual inspection. Removed across 7 datasets; retained questions with clearly visible asymmetry (>10% difference)

work page

[79] [79]

What is the T1 signal intensity of the lesion?

Signal wording standardization (654 QA reworded).The neuroradiologist recommended standard clinical phrasing: “What is the T1 signal intensity of the lesion?” was revised to “What is the signal intensity of the lesion on T1-weighted imaging?” No answers changed; only question text was reworded to match radiology reporting conventions. 18 Figure 6: Phase 1...

work page

[80] [80]

The neuroradiologist noted that axial views are atypical for hippocampal assessment; however, this applies only to the 2D survey visualization

Anatomy questions.Both reviewers rated Anatomy questions as correct and highly relevant. The neuroradiologist noted that axial views are atypical for hippocampal assessment; however, this applies only to the 2D survey visualization. NEUROQA provides full 3D volumetric input to models, including all orientations

work page