arxiv: 2605.10761 · v1 · submitted 2026-05-11 · 💻 cs.CV

Recognition: 2 theorem links

· Lean Theorem

RadThinking: A Dataset for Longitudinal Clinical Reasoning in Radiology

Wenxuan Li , Pedro R. A. S. Bassi , Xinze Zhou , Jakob Wasserthal , Alan L. Yuille , Zongwei Zhou

Authors on Pith no claims yet

Pith reviewed 2026-05-12 03:50 UTC · model grok-4.3

classification 💻 cs.CV

keywords visual question answeringradiologycancer screeningclinical reasoningchain of thoughtdatasetCT imaginglongitudinal analysis

0 comments

The pith

RadThinking releases a VQA dataset that supplies explicit reasoning chains for cancer-screening radiology questions at three depths.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces RadThinking to turn implicit radiologist reasoning into trainable data. It supplies 20,362 CT scans with VQA pairs grouped into foundation perception questions, single-rule applications, and compositional multi-step problems that reach guideline categories such as LI-RADS-5. Each compositional item comes with the exact chain of atomic questions that solves it, all aligned to clinical reporting standards. A sympathetic reader would care because current AI systems detect findings but rarely demonstrate the longitudinal comparison, prior-scan integration, and guideline-based conclusion that define actual cancer screening. The dataset therefore supplies both atomic supervision and verifiable chain-of-thought traces for reinforcement-learning methods.

Core claim

RadThinking is a Visual Question Answering corpus that stratifies questions by reasoning depth and grounds every compositional composition in the rules of the governing clinical reporting standard. Foundation VQAs test atomic perception. Single-step VQAs test one clinical rule. Compositional VQAs require multi-step chain-of-thought to reach a guideline category. For every compositional question the authors release the supporting chain of foundation questions. The corpus covers 20,362 CT scans from 9,131 patients across 43 cancer groups plus 2,077 verified healthy controls with more than one year of follow-up.

What carries the argument

The tiered VQA structure in which each compositional question is paired with its explicit chain of foundation questions that follows the rules of the governing clinical reporting standard.

If this is right

The foundation tier supplies atomic perception supervision that can be used to train perception modules.
The compositional tier supplies chain-of-thought traces that can serve as verifiable rewards for reinforcement-learning recipes.
Models trained on the dataset can be evaluated on whether they reach the same guideline category as the reference chain.
The longitudinal span across multiple scans per patient enables training that compares current findings to prior examinations.
The dataset can be used to measure whether an AI system merely detects lesions or actually performs the multi-step reasoning required by screening guidelines.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

If the chains prove reliable, the same construction method could be applied to other imaging modalities or organ systems that use standardized reporting lexicons.
Success on the compositional tier would indicate that an AI system can produce auditable intermediate steps rather than black-box diagnoses.
The dataset's size and stratification allow controlled experiments that isolate the contribution of each reasoning depth to final diagnostic accuracy.

Load-bearing premise

The automatically or semi-automatically generated VQA pairs and reasoning chains accurately reflect genuine clinical reasoning without introducing systematic errors or inconsistencies with real radiologist practice.

What would settle it

A panel of radiologists reviews a random sample of the released reasoning chains and finds that more than 15 percent of the compositional chains contain a clinically incorrect step or violate the cited reporting guideline.

Figures

Figures reproduced from arXiv: 2605.10761 by Alan L. Yuille, Jakob Wasserthal, Pedro R. A. S. Bassi, Wenxuan Li, Xinze Zhou, Zongwei Zhou.

**Figure 1.** Figure 1: Overview of RadThinking. Left: the reasoning trajectory of an illustrative hepatocellular carcinoma (HCC) patient. We monitored this patient across 26 CT scans over 11 years (2013–2024). Six selected timepoints show how reasoning complexity evolves: post-resection baseline (Scan 1), a new suspicious lesion (Scan 2), resolution confirming a benign finding (Scan 5), recurrence after 5 years of remission (Sca… view at source ↗

**Figure 2.** Figure 2: Primary cancer type distribution across cancer-positive patients in RadThinking. Clinically equivalent subtypes are grouped together. For example, colon, rectal, and colorectal NOS are grouped into Colorectal Carcinoma. Bladder and urothelial cancers are grouped into a single category. The result is 43 distinct cancer groups with a pronounced long-tail distribution. Healthy controls (2,077) are omitted for… view at source ↗

read the original abstract

Cancer screening is a reasoning task. A radiologist observes findings, compares them to prior scans, integrates clinical context, and reaches a diagnostic conclusion confirmed by pathology. We present RadThinking, a Visual Question Answering (VQA) dataset that makes this reasoning explicit and trainable. RadThinking releases VQA pairs at three difficulty tiers. Foundation VQAs are atomic perception questions. Single-step reasoning VQAs apply one clinical rule. Compositional VQAs require multi-step chain-of-thought to reach a guideline category such as LI-RADS-5. For every compositional VQA, we release the chain of foundation VQAs that solves it. The chain follows the rules of the governing clinical reporting standard. The dataset spans 20,362 CT scans from 9,131 patients across 43 cancer groups, plus 2,077 verified healthy controls with >1-year follow-up. To our knowledge, RadThinking is the first cancer-screening VQA corpus that stratifies questions by reasoning depth and grounds compositions in clinical reporting standards. The foundation tier supplies atomic perception supervision. The compositional tier supplies chain-of-thought data and verifiable rewards for reinforcement-learning recipes such as DeepSeek-R1 and OpenAI o1. RadThinking enables systematic training and evaluation of whether AI systems can reason about cancer, not merely detect it.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

RadThinking is a large new radiology VQA dataset with explicit tiered reasoning chains, but the paper gives almost no evidence that those chains were built or checked correctly.

read the letter

The main point is a dataset release that tries to make radiology reasoning trainable by breaking questions into three levels and releasing the step-by-step chains for the hardest ones. The scale is real: 20k CT scans, thousands of patients, multiple cancer types, and verified healthy controls with follow-up. That part is straightforward and useful on its own. The structure—foundation perception questions, single-rule questions, and compositional ones that chain back to guideline categories like LI-RADS—also gives a clean way to supervise models on multi-step clinical logic rather than just detection. The claim that this is the first corpus to do the stratification plus standard-grounding looks plausible from the abstract and the lack of cited predecessors. Those are the concrete contributions. The soft spot is exactly where the stress-test note points: the abstract and available description say nothing about how the VQA pairs or the reasoning chains were produced, who reviewed them, what agreement rates were, or what error checks were run. If the chains came from rules or models without substantial expert validation on a real sample, then the central promise—that these are faithful clinical reasoning paths—rests on an untested assumption. That gap is not minor because the dataset's value for training or evaluating reasoning models depends on it. This is for people working on medical VQA, chain-of-thought supervision, or reinforcement learning for diagnostic tasks. A reader who needs raw volume and tiered structure could pull useful training data from it, but anyone who cares about clinical fidelity will have to wait for the methods section or supplementary checks. It deserves peer review because the resource is big enough and the framing is clear enough to matter if the quality issues get addressed. I would send it out, with the clear expectation that referees will require detailed evidence on generation and validation before accepting the claims about faithful reasoning.

Referee Report

1 major / 2 minor

Summary. The manuscript presents RadThinking, a VQA dataset for longitudinal clinical reasoning in radiology cancer screening. It releases VQA pairs stratified into three tiers: foundation (atomic perception questions), single-step reasoning (applying one clinical rule), and compositional (multi-step chain-of-thought to reach guideline categories such as LI-RADS-5). For every compositional VQA, the corresponding chain of foundation VQAs is released and claimed to follow the rules of the governing clinical reporting standard. The dataset covers 20,362 CT scans from 9,131 patients across 43 cancer groups plus 2,077 verified healthy controls with >1-year follow-up. It positions itself as the first cancer-screening VQA corpus to stratify by reasoning depth and ground compositions in clinical standards, supplying atomic perception supervision and chain-of-thought data for RL-based training.

Significance. If the reasoning chains are shown to accurately reflect clinical practice, RadThinking would be a significant resource for medical AI by enabling explicit supervision of compositional and longitudinal reasoning rather than isolated detection. The tiered design supports progressive model training and provides verifiable rewards for methods such as chain-of-thought RL, while the scale and multi-cancer coverage increase its potential utility for developing systems that integrate findings across time and context.

major comments (1)

[Abstract] Abstract: The central claim that compositional VQAs 'require multi-step chain-of-thought to reach a guideline category such as LI-RADS-5' and that 'the chain follows the rules of the governing clinical reporting standard' is unsupported by any description of the VQA generation pipeline, expert review, inter-annotator agreement, or quality-control metrics. This detail is load-bearing for the assertion that the dataset grounds compositions in clinical reporting standards and supplies faithful chain-of-thought data.

minor comments (2)

The abstract states the dataset 'spans 20,362 CT scans' but provides no breakdown by tier, cancer group, or reasoning depth; a summary table would improve clarity on scale and balance.
The claim of being 'the first' such corpus would be strengthened by explicit comparison to prior radiology VQA datasets in the related-work section.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the careful and constructive review. We address the single major comment below and will make the requested revisions to strengthen the manuscript.

read point-by-point responses

Referee: [Abstract] Abstract: The central claim that compositional VQAs 'require multi-step chain-of-thought to reach a guideline category such as LI-RADS-5' and that 'the chain follows the rules of the governing clinical reporting standard' is unsupported by any description of the VQA generation pipeline, expert review, inter-annotator agreement, or quality-control metrics. This detail is load-bearing for the assertion that the dataset grounds compositions in clinical reporting standards and supplies faithful chain-of-thought data.

Authors: We agree that the abstract currently states the claim without referencing the supporting methodology, which weakens the manuscript. In the revised version we will (1) update the abstract to briefly note that compositional chains are constructed according to explicit clinical reporting rules and (2) add a dedicated subsection in Methods that fully describes the VQA generation pipeline. This subsection will detail the rule-based chaining procedure, the role of board-certified radiologists in annotation and verification, inter-annotator agreement statistics, and the quality-control steps used to ensure the released chains remain faithful to the governing clinical standards. revision: yes

Circularity Check

0 steps flagged

Dataset release paper exhibits no circularity

full rationale

This is a dataset paper that releases VQA pairs stratified by reasoning depth and grounded in clinical standards. It contains no derivations, equations, fitted parameters, predictions, or first-principles results that could reduce to prior quantities by construction. Claims of novelty and composition are descriptive rather than deductive, with no self-citation chains or ansatzes invoked to force any outcome. The generation process is described at a high level but does not involve any circular reduction of results to inputs.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The contribution rests on the domain assumption that clinical reporting standards encode valid reasoning steps and that the constructed VQA pairs faithfully instantiate those steps.

axioms (1)

domain assumption Clinical reporting standards such as LI-RADS provide accurate and complete rules for diagnostic reasoning in cancer screening.
Compositional VQAs are defined to reach guideline categories using these standards.

pith-pipeline@v0.9.0 · 5557 in / 1268 out tokens · 81037 ms · 2026-05-12T03:50:11.638430+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

RadThinking releases VQA pairs at three difficulty tiers... For every compositional VQA, we release the chain of foundation VQAs that solves it. The chain follows the rules of the governing clinical reporting standard.
IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

The three tiers form a curriculum from atomic skills to multi-step clinical reasoning.

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

120 extracted references · 120 canonical work pages · 17 internal anchors

[1]

Pixtral 12B

P. Agrawal, S. Antoniak, E. B. Hanna, B. Bout, D. Chaplot, J. Chudnovsky, D. Costa, B. De Monicault, S. Garg, T. Gervet, et al. Pixtral 12B.arXiv preprint arXiv:2410.07073, 2024

work page internal anchor Pith review arXiv 2024
[2]

J. A. Ajani, T. A. D’Amico, K. Almhanna, D. J. Bentrem, J. Chao, P. Das, C. S. Denlinger, P. Fanta, F. Farjah, C. S. Fuchs, et al. Gastric cancer, version 3.2016, nccn clinical practice guidelines in oncology.Journal of the National Comprehensive Cancer Network, 14(10): 1286–1312, 2016

work page 2016
[3]

J. A. Ajani, T. A. D’Amico, D. J. Bentrem, J. Chao, C. Corvera, P. Das, C. S. Denlinger, P. C. Enzinger, P. Fanta, F. Farjah, et al. Esophageal and esophagogastric junction cancers, version 2.2019, nccn clinical practice guidelines in oncology.Journal of the National Comprehensive Cancer Network, 17(7):855–883, 2019

work page 2019
[4]

Al-Katib, G

S. Al-Katib, G. Gupta, A. Brudvik, S. Ries, J. Krauss, and M. Farah. A practical guide to managing ct findings in the breast.Clinical imaging, 60(2):274–282, 2020

work page 2020
[5]

Amant, M

F. Amant, M. R. Mirza, M. Koskas, and C. L. Creutzberg. Cancer of the corpus uteri. International Journal of Gynecology & Obstetrics, 143:37–50, 2018

work page 2018
[6]

Andreas, M

J. Andreas, M. Rohrbach, T. Darrell, and D. Klein. Neural module networks. InIEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2016

work page 2016
[7]

R. F. Andreotti, D. Timmerman, L. M. Strachowski, W. Froyman, B. R. Benacerraf, G. L. Bennett, T. Bourne, D. L. Brown, B. G. Coleman, M. C. Frates, et al. O-rads us risk stratification and management system: a consensus guideline from the acr ovarian-adnexal reporting and data system committee.Radiology, 294(1):168–185, 2020

work page 2020
[8]

The medical segmentation decathlon.arXiv preprint arXiv:2106.05735, 2021

M. Antonelli, A. Reinke, S. Bakas, K. Farahani, B. A. Landman, G. Litjens, B. Menze, O. Ronneberger, R. M. Summers, B. van Ginneken, et al. The medical segmentation decathlon. arXiv preprint arXiv:2106.05735, 2021

work page arXiv 2021
[9]

M. Atri, A. Alabousi, C. Reinhold, E. A. Akin, C. B. Benson, P. R. Bhosale, S. K. Kang, Y . Lakhman, R. Nicola, P. V . Pandharipande, et al. Acr appropriateness criteria® clinically suspected adnexal mass, no acute symptoms.Journal of the American College of Radiology, 16(5):S77–S93, 2019

work page 2019
[10]

F. Bai, Y . Du, T. Huang, M. Q.-H. Meng, and B. Zhao. M3d: Advancing 3d medical image analysis with multi-modal large language models.arXiv preprint arXiv:2404.00578, 2024

work page arXiv 2024
[11]

S. Bai, K. Chen, X. Liu, J. Wang, W. Ge, S. Song, K. Dang, P. Wang, S. Wang, J. Tang, et al. Qwen2.5-vl technical report.arXiv preprint arXiv:2502.13923, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[12]

P. R. Bassi, W. Li, Y . Tang, F. Isensee, Z. Wang, J. Chen, Y .-C. Chou, Y . Kirchhoff, M. Rokuss, Z. Huang, J. Ye, J. He, T. Wald, C. Ulrich, M. Baumgartner, S. Roy, K. H. Maier-Hein, P. Jaeger, Y . Ye, Y . Xie, J. Zhang, Z. Chen, Y . Xia, Z. Xing, L. Zhu, Y . Sadegheih, A. Bozorgpour, P. Kumari, R. Azad, D. Merhof, P. Shi, T. Ma, Y . Du, F. Bai, T. Huan...

work page 2024
[13]

P. R. Bassi, W. Li, J. Chen, Z. Zhu, T. Lin, S. Decherchi, A. Cavalli, K. Wang, Y . Yang, A. L. Yuille, and Z. Zhou. Learning segmentation from radiology reports. InInternational Conference on Medical Image Computing and Computer-Assisted Intervention, pages 305–315. Springer, 2025. URLhttps://github.com/MrGiovanni/R-Super

work page 2025
[14]

P. R. Bassi, Q. Wu, W. Li, S. Decherchi, A. Cavalli, A. Yuille, and Z. Zhou. Label critic: Design data before models. InIEEE International Symposium on Biomedical Imaging (ISBI), pages 1–5. IEEE, 2025. URLhttps://github.com/PedroRASB/LabelCritic. 10

work page 2025
[15]

P. R. Bassi, M. C. Yavuz, I. E. Hamamci, S. Er, X. Chen, W. Li, B. Menze, S. Decherchi, A. Cavalli, K. Wang, Y . Yang, A. Yuille, and Z. Zhou. Radgpt: Constructing 3d image-text tumor datasets. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 23720–23730, 2025. URLhttps://github.com/MrGiovanni/RadGPT

work page 2025
[16]

P. R. Bassi, X. Zhou, W. Li, S. Płotka, J. Chen, Q. Chen, Z. Zhu, J. Prz ˛ ado, I. E. Hamamci, S. Er, X. Chen, M. C. Yavuz, Y .-C. Chou, T. Lin, K. Wang, Y . Tang, J. B. Cwikla, S. Decherchi, A. Cavalli, Y . Yang, A. L. Yuille, and Z. Zhou. Scaling artificial intelligence for multi-tumor early detection with more reports, fewer masks.arXiv preprint arXiv:...

work page arXiv 2025
[17]

Bengio, J

Y . Bengio, J. Louradour, R. Collobert, and J. Weston. Curriculum learning. InInternational Conference on Machine Learning (ICML), 2009

work page 2009
[18]

A. B. Benson, A. P. Venook, M. M. Al-Hawary, M. A. Arain, Y .-J. Chen, K. K. Ciombor, S. A. Cohen, H. S. Cooper, D. A. Deming, I. Garrido-Laguna, et al. Small bowel adenocarci- noma, version 1.2020, nccn clinical practice guidelines in oncology.Journal of the National Comprehensive Cancer Network, 17(9):1109–1133, 2019

work page 2020
[19]

The liver tumor segmentation benchmark (lits).arXiv preprint arXiv:1901.04056, 2019

P. Bilic, P. F. Christ, E. V orontsov, G. Chlebus, H. Chen, Q. Dou, C.-W. Fu, X. Han, P.- A. Heng, J. Hesser, et al. The liver tumor segmentation benchmark (lits).arXiv preprint arXiv:1901.04056, 2019

work page arXiv 1901
[20]

Blankemeier, J

L. Blankemeier, J. P. Cohen, A. Kumar, D. Van Veen, S. J. S. Gardezi, M. Paschali, Z. Chen, J.-B. Delbrouck, E. Reis, C. Truyts, et al. Merlin: A vision language foundation model for 3d computed tomography.Research Square, pages rs–3, 2024

work page 2024
[21]

J. Chen, Z. Cai, K. Ji, X. Wang, W. Liu, R. Wang, J. Hou, and B. Wang. Huatuogpt-o1, towards medical complex reasoning with llms.arXiv preprint arXiv:2412.18925, 2024

work page arXiv 2024
[22]

Q. Chen, X. Chen, H. Song, Z. Xiong, A. Yuille, C. Wei, and Z. Zhou. Towards generalizable tumor synthesis. InIEEE/CVF conference on computer vision and pattern recognition (CVPR), pages 11147–11158, 2024. URLhttps://github.com/MrGiovanni/DiffTumor

work page 2024
[23]

Q. Chen, X. Zhou, C. Liu, H. Chen, W. Li, Z. Jiang, Z. Huang, Y . Zhao, D. Yu, J. He, Y . Zheng, L. Shao, A. Yuille, and Z. Zhou. Scaling tumor segmentation: Best lessons from real and syn- thetic data. InProceedings of the IEEE International Conference on Computer Vision (ICCV), pages 24001–24013, 2025. URLhttps://github.com/BodyMaps/AbdomenAtlas2.0

work page 2025
[24]

Y . Chen, W. Xiao, P. R. Bassi, X. Zhou, S. Er, I. E. Hamamci, Z. Zhou, and A. Yuille. Are vision language models ready for clinical diagnosis? a 3d medical benchmark for tumor- centric visual question answering.arXiv preprint arXiv:2505.18915, 2025. URL https: //github.com/Schuture/DeepTumorVQA

work page arXiv 2025
[25]

Y . Chen, Z. Zhou, W. Li, and A. Yuille. Large-scale label quality assessment for medical segmentation via a vision-language judge and synthetic data.arXiv preprint arXiv:2601.14406, 2026

work page arXiv 2026
[26]

Chernyak, K

V . Chernyak, K. J. Fowler, A. Kamaya, A. Z. Kielar, K. M. Elsayes, M. R. Bashir, Y . Kono, R. K. Do, D. G. Mitchell, A. G. Singal, et al. Liver imaging reporting and data system (li-rads) version 2018: imaging of hepatocellular carcinoma in at-risk patients.Radiology, 289(3): 816–830, 2018

work page 2018
[27]

B. D. Cheson, R. I. Fisher, S. F. Barrington, F. Cavalli, L. H. Schwartz, E. Zucca, and T. A. Lister. Recommendations for initial evaluation, staging, and response assessment of hodgkin and non-hodgkin lymphoma: the lugano classification.Journal of clinical oncology, 32(27): 3059–3067, 2014

work page 2014
[28]

Y .-C. Chou, B. Li, D.-P. Fan, A. Yuille, and Z. Zhou. Acquiring weak annotations for tumor localization in temporal and volumetric data.Machine Intelligence Research, pages 1–13,

work page
[29]

URLhttps://github.com/johnson111788/Drag-Drop. 11

work page
[30]

Y .-C. Chou, Z. Zhou, and A. Yuille. Embracing massive medical data. InInternational Conference on Medical Image Computing and Computer-Assisted Intervention, pages 24–35. Springer, 2024. URLhttps://github.com/MrGiovanni/OnlineLearning

work page 2024
[31]

Christensen, A

J. Christensen, A. E. Prosper, C. C. Wu, J. Chung, E. Lee, B. Elicker, A. R. Hunsaker, M. Petranovic, K. L. Sandler, B. Stiles, et al. Acr lung-rads v2022: assessment categories and management recommendations.Journal of the American College of Radiology, 21(3): 473–488, 2024

work page 2024
[32]

Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities

G. Comanici, E. Bieber, M. Schaekermann, I. Pasupat, N. Sachdeva, I. Dhillon, M. Blistein, O. Ram, D. Zhang, E. Rosen, et al. Gemini 2.5: Pushing the frontier with advanced reason- ing, multimodality, long context, and next generation agentic capabilities.arXiv preprint arXiv:2507.06261, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[33]

W. Dai, N. Lee, B. Wang, Z. Yang, Z. Liu, J. Barker, T. Rintamaki, M. Shoeybi, B. Catanzaro, and W. Ping. NVLM: Open frontier-class multimodal LLMs.arXiv preprint arXiv:2409.11402, 2024

work page arXiv 2024
[34]

de Grauw, E

M. de Grauw, E. T. Scholten, E. J. Smit, M. J. Rutten, M. Prokop, B. van Ginneken, and A. Hering. The uls23 challenge: A baseline model and benchmark dataset for 3d universal lesion segmentation in computed tomography.Medical image analysis, 102:103525, 2025

work page 2025
[35]

H. Deng, H. Zhang, M. Ou, Z. Li, J. Liu, H. Wang, and T.-Y . Lin. Boosting the generalization and reasoning of vision language models with curriculum reinforcement learning.arXiv preprint arXiv:2503.07065, 2025

work page arXiv 2025
[36]

Y . Deng, H. Bansal, F. Yin, N. Peng, W. Wang, and K.-W. Chang. OpenVLThinker: Complex vision-language reasoning via iterative SFT-RL cycles.arXiv preprint arXiv:2503.17352, 2025

work page arXiv 2025
[37]

Diaz-Pinto, S

A. Diaz-Pinto, S. Alle, V . Nath, Y . Tang, A. Ihsani, M. Asad, F. Pérez-García, P. Mehta, W. Li, M. Flores, et al. Monai label: A framework for ai-assisted interactive labeling of 3d medical images.Medical Image Analysis, 95:103207, 2024

work page 2024
[38]

The Llama 3 Herd of Models

A. Dubey, A. Jauhri, A. Pandey, A. Kadian, A. Al-Dahle, A. Letman, A. Mathur, A. Schelten, A. Yang, A. Fan, et al. The llama 3 herd of models.arXiv preprint arXiv:2407.21783, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[39]

D’Orsi, L

C. D’Orsi, L. Bassett, and S. Feig. Breast imaging reporting and data system (bi-rads).Oxford University Press, New York, 2018

work page 2018
[40]

E. A. Eisenhauer, P. Therasse, J. Bogaerts, L. H. Schwartz, D. Sargent, R. Ford, J. Dancey, S. Arbuck, S. Gwyther, M. Mooney, et al. New response evaluation criteria in solid tumours: revised recist guideline (version 1.1).European journal of cancer, 45(2):228–247, 2009

work page 2009
[41]

X. Gai, J. Liu, Y . Li, Z. Meng, J. Wu, and Z. Liu. 3d-rad: A comprehensive 3d radiology med-vqa dataset with multi-temporal analysis and diverse diagnostic tasks.arXiv preprint arXiv:2506.11147, 2025

work page arXiv 2025
[42]

H. L. Gamage, L. Wijerathne, Y . Wickramasinghe, M. Riegler, and P. Halvorsen. Kvasir- VQA-x1: A multimodal dataset for medical reasoning and robust MedVQA in gastrointestinal endoscopy.arXiv preprint arXiv:2506.09958, 2025

work page arXiv 2025
[43]

D. Guo, D. Yang, H. Zhang, J. Song, R. Zhang, R. Xu, Q. Zhu, S. Ma, P. Wang, X. Bi, et al. DeepSeek-R1: Incentivizing reasoning capability in LLMs via reinforcement learning.arXiv preprint arXiv:2501.12948, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[44]

Gupta and A

T. Gupta and A. Kembhavi. Visual programming: Compositional visual reasoning without training. InIEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2023

work page 2023
[45]

I. E. Hamamci, S. Er, and B. Menze. Ct2rep: Automated radiology report generation for 3d medical imaging. InInternational Conference on Medical Image Computing and Computer- Assisted Intervention, pages 476–486. Springer, 2024. 12

work page 2024
[46]

I. E. Hamamci, S. Er, C. Wang, F. Almas, A. G. Simsek, S. N. Esirgun, I. Dogan, O. F. Durugol, B. Hou, S. Shit, et al. Generalist foundation models from a multimodal dataset for 3d computed tomography.Nature Biomedical Engineering, pages 1–19, 2026

work page 2026
[47]

X. He, Y . Zhang, L. Mou, E. Xing, and P. Xie. Pathvqa: 30000+ questions for medical visual question answering.arXiv preprint arXiv:2003.10286, 2020

work page internal anchor Pith review arXiv 2003
[48]

M. T. Heller, M. Harisinghani, J. D. Neitlich, P. Yeghiayan, and L. L. Berland. Managing incidental findings on abdominal and pelvic ct and mri, part 3: white paper of the acr incidental findings committee ii on splenic and nodal findings.Journal of the American College of Radiology, 10(11):833–839, 2013

work page 2013
[49]

Heller, N

N. Heller, N. Sathianathen, A. Kalapara, E. Walczak, K. Moore, H. Kaluzniak, J. Rosenberg, P. Blake, Z. Rengel, M. Oestreich, et al. The kits19 challenge data: 300 kidney tumor cases with clinical context, ct semantic segmentations, and surgical outcomes.arXiv preprint arXiv:1904.00445, 2019

work page arXiv 1904
[50]

B. R. Herts, S. G. Silverman, N. M. Hindman, R. G. Uzzo, R. P. Hartman, G. M. Israel, D. A. Baumgarten, L. L. Berland, and P. V . Pandharipande. Management of the incidental renal mass on ct: a white paper of the acr incidental findings committee.Journal of the American College of Radiology, 15(2):264–273, 2018

work page 2018
[51]

J. K. Hoang, J. E. Langer, W. D. Middleton, C. C. Wu, L. W. Hammers, J. J. Cronan, F. N. Tessler, E. G. Grant, and L. L. Berland. Managing incidental thyroid nodules detected on imaging: white paper of the acr incidental thyroid findings committee.Journal of the American College of Radiology, 12(2):143–150, 2015

work page 2015
[52]

H. Hong, H. Kim, H. Lee, S. Choi, et al. Decomposing complex visual comprehension into atomic visual skills for vision language models.arXiv preprint arXiv:2505.20021, 2025

work page arXiv 2025
[53]

Y . Hu, T. Li, Q. Lu, W. Shao, J. He, Y . Qiao, and P. Luo. OmniMedVQA: A new large- scale comprehensive evaluation benchmark for medical LVLM. InIEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2024

work page 2024
[54]

Vision-R1: Incentivizing Reasoning Capability in Multimodal Large Language Models

W. Huang, B. Jia, Z. Zhai, S. Cao, Z. Ye, F. Zhao, Y . Hu, and S. Lin. Vision-R1: Incentivizing reasoning capability in multimodal large language models.arXiv preprint arXiv:2503.06749, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[55]

D. A. Hudson and C. D. Manning. GQA: A new dataset for real-world visual reasoning and compositional question answering. InIEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2019

work page 2019
[56]

Jeong, S

J. Jeong, S. Yun, H. Lim, and J. Kang. Med-PRM: Medical reasoning models with stepwise, guideline-verified process rewards.arXiv preprint arXiv:2506.11474, 2025

work page arXiv 2025
[57]

A. E. Johnson, T. J. Pollard, N. R. Greenbaum, M. P. Lungren, C.-y. Deng, Y . Peng, Z. Lu, R. G. Mark, S. J. Berkowitz, and S. Horng. Mimic-cxr-jpg, a large publicly available database of labeled chest radiographs.arXiv preprint arXiv:1901.07042, 2019

work page arXiv 1901
[58]

Johnson, B

J. Johnson, B. Hariharan, L. van der Maaten, L. Fei-Fei, C. Lawrence Zitnick, and R. Girshick. CLEVR: A diagnostic dataset for compositional language and elementary visual reasoning. In IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2017

work page 2017
[59]

M. Kang, B. Li, Z. Zhu, Y . Lu, E. K. Fishman, A. Yuille, and Z. Zhou. Label-assemble: Leverag- ing multiple datasets with partial labels. InIEEE International Symposium on Biomedical Imag- ing, pages 1–5. IEEE, 2023. URLhttps://github.com/MrGiovanni/LabelAssemble

work page 2023
[60]

T. Khot, H. Trivedi, M. Finlayson, Y . Fu, K. Richardson, P. Clark, and A. Sabharwal. Decom- posed prompting: A modular approach for solving complex tasks. InInternational Conference on Learning Representations (ICLR), 2023

work page 2023
[61]

Y . Lai, J. Zhong, M. Li, S. Zhao, and X. Yang. Med-R1: Reinforcement learning for gen- eralizable medical reasoning in vision-language models.arXiv preprint arXiv:2503.13939, 2025. 13

work page arXiv 2025
[62]

J. J. Lau, S. Gayen, A. Ben Abacha, and D. Demner-Fushman. A dataset of clinically generated visual questions and answers about radiology images.Scientific data, 5(1):180251, 2018

work page 2018
[63]

Li, Y .-C

B. Li, Y .-C. Chou, S. Sun, H. Qiao, A. Yuille, and Z. Zhou. Early detection and localization of pancreatic cancer by label-free tumor synthesis.MICCAI Workshop on Big Task Small Data, 1001-AI, 2023. URLhttps://github.com/MrGiovanni/SyntheticTumors

work page 2023
[64]

B. Li, Y . Zhang, D. Guo, R. Zhang, F. Li, H. Zhang, K. Zhang, Y . Li, Z. Liu, and C. Li. LLaV A-OneVision: Easy visual task transfer.arXiv preprint arXiv:2408.03326, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[65]

C. Li, C. Wong, S. Zhang, N. Usuyama, H. Liu, J. Yang, T. Naumann, H. Poon, and J. Gao. Llava-med: Training a large language-and-vision assistant for biomedicine in one day.Ad- vances in Neural Information Processing Systems, 36, 2024

work page 2024
[66]

W. Li, C. Qu, X. Chen, P. R. Bassi, Y . Shi, Y . Lai, Q. Yu, H. Xue, Y . Chen, X. Lin, Y . Tang, Y . Cao, H. Han, Z. Zhang, J. Liu, T. Zhang, Y . Ma, J. Wang, G. Zhang, A. Yuille, and Z. Zhou. Abdomenatlas: A large-scale, detailed-annotated, & multi-center dataset for efficient transfer learning and open algorithmic benchmarking.Medical Image Analysis, pa...

work page 2024
[67]

W. Li, A. Yuille, and Z. Zhou. How well do supervised models transfer to 3d image segmentation? InInternational Conference on Learning Representations, 2024. URL https://github.com/MrGiovanni/SuPreM

work page 2024
[68]

W. Li, P. R. Bassi, T. Lin, Y .-C. Chou, X. Zhou, Y . Tang, F. Isensee, K. Wang, Q. Chen, X. Xu, J. Ye, Z. Zhu, S. Decherchi, A. Cavalli, A. L. Yuille, and Z. Zhou. Scalemai: Accelerating the development of trusted datasets and ai models.arXiv preprint arXiv:2501.03410, 2025. URL https://github.com/MrGiovanni/ScaleMAI

work page arXiv 2025
[69]

W. Li, X. Zhou, Q. Chen, T. Lin, P. R. Bassi, X. Chen, C. Ye, Z. Zhu, K. Ding, H. Li, K. Wang, Y . Yang, Y . Tang, D. Xu, A. L. Yuille, and Z. Zhou. Pants: The pancreatic tumor segmentation dataset. InConference on Neural Information Processing Systems (NeurIPS) Datasets and Benchmarks Track, 2025. URLhttps://github.com/MrGiovanni/PanTS

work page 2025
[70]

W. Li, P. R. A. S. Bassi, L. Wu, X. Zhou, Y . Zhao, Q. Chen, S. Plotka, T. Lin, Z. Zhu, M. Martin, J. Caskey, S. Jiang, X. Chen, J. B. ´Cwikła, A. Sankowski, Y . Wu, S. Decherchi, A. Cavalli, C. Lall, C. Tomasetti, Y . Guo, X. Yu, Y . Cai, H. Qiao, J. Bao, C. Hu, X. Wang, A. Sitek, K. Ding, H. Li, M. Wang, D. Yu, G. Zhang, Y . Yang, K. Wang, A. L. Yuille,...

work page arXiv 2026
[71]

J. Lin, H. Yin, W. Ping, P. Molchanov, M. Shoeybi, and S. Han. VILA: On pre-training for visual language models. InIEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2024

work page 2024
[72]

Liu, L.-M

B. Liu, L.-M. Zhan, L. Xu, L. Ma, Y . Yang, and X.-M. Wu. SLAKE: A semantically-labeled knowledge-enhanced dataset for medical visual question answering. InIEEE International Symposium on Biomedical Imaging (ISBI), 2021

work page 2021
[73]

J. Liu, A. Yuille, Y . Tang, and Z. Zhou. Clip-driven universal model for partially labeled organ and pan-cancer segmentation. InMICCAI 2023 FLARE Challenge, 2023. URL https: //github.com/ljwztc/CLIP-Driven-Universal-Model

work page 2023
[74]

J. Liu, Y . Zhang, K. Wang, M. C. Yavuz, X. Chen, Y . Yuan, H. Li, Y . Yang, A. Yuille, Y . Tang, and Z. Zhou. Universal and extensible language-vision models for organ segmentation and tumor detection from abdominal computed tomography.Medical Image Analysis, page 103226,

work page
[75]

URLhttps://github.com/ljwztc/CLIP-Driven-Universal-Model

work page
[76]

R. Liu, I. Q. Mohiuddin, A. J. Schoeffler, K. Renduchintala, A. Nayak, P. L. Vemu, S. C. Vedak, K. C. Black, J. L. Havlik, I. Ogunmola, S. P. Ma, R. Dhatt, and J. H. Chen. PhysicianBench: Evaluating LLM agents in real-world EHR environments.arXiv preprint arXiv:2605.02240, 2026. 14

work page internal anchor Pith review Pith/arXiv arXiv 2026
[77]

Z. Liu, Z. Sun, Y . Zang, X. Dong, Y . Cao, H. Duan, D. Lin, and J. Wang. Visual-RFT: Visual reinforcement fine-tuning.arXiv preprint arXiv:2503.01785, 2025

work page internal anchor Pith review arXiv 2025
[78]

H. Lu, W. Liu, B. Zhang, B. Wang, K. Dong, B. Liu, J. Sun, T. Ren, Z. Li, H. Yang, et al. DeepSeek-VL: Towards real-world vision-language understanding.arXiv preprint arXiv:2403.05525, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[79]

Auditing significance, metric choice, and demographic fairness in med- ical ai challenges.arXiv preprint arXiv:2512.19091, 2025

A. Lubonja, P. R. Bassi, W. Li, H. Qiao, R. Burns, A. L. Yuille, and Z. Zhou. Auditing significance, metric choice, and demographic fairness in medical ai challenges.arXiv preprint arXiv:2512.19091, 2025. URLhttps://github.com/ariellubonja/RankInsight

work page arXiv 2025
[80]

W. W. Mayo-Smith, J. H. Song, G. L. Boland, I. R. Francis, G. M. Israel, P. J. Mazzaglia, L. L. Berland, and P. V . Pandharipande. Management of incidental adrenal masses: a white paper of the acr incidental findings committee.Journal of the American College of Radiology, 14(8): 1038–1044, 2017

work page 2017

Showing first 80 references.