pith. machine review for the scientific record. sign in

arxiv: 2605.09679 · v1 · submitted 2026-05-10 · 💻 cs.CV · cs.AI

Recognition: 2 theorem links

· Lean Theorem

DeepTumorVQA: A Hierarchical 3D CT Benchmark for Stage-Wise Evaluation of Medical VLMs and Tool-Augmented Agents

Yixiong Chen , Wenjie Xiao , Pedro R. A. S. Bassi , Boyan Wang , Liang He , Xinze Zhou , Sezgin Er , Ibrahim Ethem Hamamci , Zongwei Zhou , Alan Yuille

Authors on Pith no claims yet

Pith reviewed 2026-05-12 03:58 UTC · model grok-4.3

classification 💻 cs.CV cs.AI
keywords 3D CTmedical VQAtumor diagnosishierarchical benchmarkvision-language modelstool-augmented agentsstage-wise evaluationmedical reasoning
0
0 comments X

The pith

A hierarchical benchmark reveals that accurate measurement is the main barrier for medical vision-language models analyzing 3D CT tumor scans, with tool augmentation providing substantial relief.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper builds a large-scale benchmark for evaluating how well vision-language models and agents handle 3D CT images in tumor diagnosis. It breaks the task into four stages that mirror clinical workflows: first spotting the tumor, then measuring its size and properties, then visually interpreting those measurements, and finally applying medical knowledge to reach a diagnosis. Testing many models shows that mistakes in the measurement stage prevent good performance on the reasoning stages that follow. Allowing models to use external tools for segmentation and calculation helps them overcome this bottleneck and improve on higher-level questions. The result is a more detailed diagnosis of where current medical AI systems need work.

Core claim

DeepTumorVQA decomposes 3D CT tumor diagnosis into four stages—recognition, measurement, visual reasoning, and medical reasoning—with 476,000 questions across 42 subtypes on over 9,000 volumes. Ground-truth evidence chains link the stages so higher questions can be scored independently. Experiments across more than 30 model configurations demonstrate that quantitative measurement forms the primary bottleneck for VLMs in direct reasoning mode. Tool-augmented agents that call segmentation models, measurement programs, and knowledge modules achieve better results, and supervised tool-use traces further reduce errors.

What carries the argument

The hierarchical four-stage evidence chain for tumor diagnosis, supported by tool-interaction environments that let agents call external segmentation, measurement, and knowledge tools.

Load-bearing premise

The four stages accurately capture independent steps in real clinical tumor diagnosis workflows, and the automatically generated questions and annotations do not contain systematic biases that would skew the results.

What would settle it

Running the same models on a version of the benchmark where measurement ground truth is provided directly as input, and finding no improvement in later-stage reasoning accuracy, would show that measurement is not the primary bottleneck.

Figures

Figures reproduced from arXiv: 2605.09679 by Alan Yuille, Boyan Wang, Ibrahim Ethem Hamamci, Liang He, Pedro R. A. S. Bassi, Sezgin Er, Wenjie Xiao, Xinze Zhou, Yixiong Chen, Zongwei Zhou.

Figure 1
Figure 1. Figure 1: Overview of the DeepTumorVQA benchmark. The benchmark covers four diagnostic task [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Compositional question design. (A) Structured metadata extracted from segmentation masks and radiology reports. (B) Modular logic programs generate questions across four types and 42 subtypes. Each medical reasoning subtype is composed from recognition and measurement primitives, with clinical literature grounding. Ref+ [39], and Super-CLEVR [38] but grounded in clinical workflows (we discuss the compositi… view at source ↗
Figure 3
Figure 3. Figure 3: Statistics of DeepTumorVQA. Left: distribution of QA pairs across four main types and 42 [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Error landscape transformation. (a) Error rate by task type across three paradigms. Oracle tools halve Measurement errors (70%→36%) and sharply reduce Medical Reasoning errors (72%→49%); SFT further lowers all categories. (b) Agent failure decomposition: action loops (step-limit hits, red) account for 17–18% of all questions in oracle agents, concentrated in multi-step Visual Reasoning; SFT virtually elimi… view at source ↗
Figure 5
Figure 5. Figure 5: Per-subtype accuracy heatmap for all models. Rows are models sorted by overall accuracy; [PITH_FULL_IMAGE:figures/full_fig_p028_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Accuracy by question type across demographic and imaging factors. [PITH_FULL_IMAGE:figures/full_fig_p041_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: shows the result. The Pearson correlation between median distractor gap and accuracy drop under 10% noise is r = 0.25 (p = 0.45), indicating no statistically significant relation￾ship. Subtypes with the tightest spacing (organ aggregation: median gap 7.2%) show large noise drops (−38.6pp), but subtypes with moderate spacing (tumor burden: 22.3%) also show substantial drops. The lack of correlation indicate… view at source ↗
read the original abstract

Medical vision-language models (VLMs) and AI agents have made significant progress in learning to analyze and reason about clinical images. However, existing medical visual question answering (VQA) benchmarks collapse model capabilities into a single accuracy score, obscuring where and why models fail. We propose DeepTumorVQA, a hierarchical benchmark that follows the multi-stage evidence chain in tumor diagnosis and decomposes 3D CT reasoning into four stages: recognition, measurement, visual reasoning, and medical reasoning. Higher-level questions remain independently scorable, while their ground-truth evidence chains are defined over lower-level primitives. The benchmark contains 476K questions across 42 clinical subtypes on 9,262 3D CT volumes. In addition to a direct reasoning mode for VLMs, DeepTumorVQA provides tool-interaction environments for agent evaluation, where a model can call external tools, including segmentation models, measurement programs, and medical knowledge modules, before answering the question. Evaluating over 30 model configurations, we find that reliable quantitative measurement is the primary bottleneck, making later-stage visual and medical reasoning harder for VLMs, while tool augmentation substantially mitigates this issue. When tools are available, leveraging medical knowledge and tools to reason about medical images becomes a new challenge. We further show that ground-truth step-by-step tool-use traces from DeepTumorVQA can supervise agents and reduce tool-use and reasoning failures. This stage-wise progression from recognition to measurement to visual and medical reasoning provides a concrete roadmap for future medical VLM and AI agent studies. All data and code are released at https://github.com/Schuture/DeepTumorVQA.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 0 minor

Summary. The paper introduces DeepTumorVQA, a hierarchical 3D CT VQA benchmark that decomposes tumor diagnosis into four stages (recognition, measurement, visual reasoning, medical reasoning) with independently scorable higher-level questions whose ground-truth evidence chains build on lower-level primitives. It generates 476K questions across 42 subtypes from 9,262 volumes, supports both direct VLM evaluation and tool-augmented agent environments (with segmentation, measurement, and knowledge tools), and evaluates over 30 model configurations to conclude that quantitative measurement is the primary bottleneck while tool augmentation substantially mitigates it; ground-truth tool-use traces are also released for agent supervision.

Significance. If the four-stage decomposition is a faithful model of clinical workflows, the benchmark provides a valuable, actionable roadmap for medical VLM and agent development by isolating specific failure points rather than collapsing performance into a single score. Strengths include the large scale, release of code and data for reproducibility, extensive empirical testing across model configurations, and the provision of tool-interaction environments that enable falsifiable comparisons of direct vs. augmented reasoning.

major comments (1)
  1. [Benchmark construction] Benchmark construction (implied in §3 and abstract): limited detail is provided on the question generation pipeline, annotation validation procedures (e.g., inter-annotator agreement or expert review), and potential selection effects or biases across the 42 subtypes. Because the central claim—that measurement is the primary bottleneck and tools mitigate later-stage failures—rests on the stages forming a valid, unbiased evidence chain matching real clinical tumor diagnosis, explicit validation metrics or sensitivity analyses for these aspects are needed to confirm the reported accuracy patterns are not artifacts of question design.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the positive evaluation and the recommendation for minor revision. We appreciate the focus on benchmark construction and will revise the manuscript to provide greater transparency on the question generation process, validation steps, and potential biases.

read point-by-point responses
  1. Referee: Benchmark construction (implied in §3 and abstract): limited detail is provided on the question generation pipeline, annotation validation procedures (e.g., inter-annotator agreement or expert review), and potential selection effects or biases across the 42 subtypes. Because the central claim—that measurement is the primary bottleneck and tools mitigate later-stage failures—rests on the stages forming a valid, unbiased evidence chain matching real clinical tumor diagnosis, explicit validation metrics or sensitivity analyses for these aspects are needed to confirm the reported accuracy patterns are not artifacts of question design.

    Authors: We agree that expanding the description of benchmark construction will strengthen the paper. In the revised §3 we will add a detailed account of the question generation pipeline: questions are programmatically derived from expert-annotated 3D CT volumes and tumor masks in the source public datasets (with the four-stage hierarchy explicitly encoded so that higher-level questions reference lower-level primitives). Because generation is deterministic and rule-based from these existing annotations, traditional inter-annotator agreement on the VQA pairs themselves is not applicable; we will clarify this and note that the underlying CT annotations follow standard radiological protocols. To address potential selection effects, we will include (i) the exact subtype distribution across the 9,262 volumes, (ii) a sensitivity analysis showing that the measurement-bottleneck pattern holds when subsets of subtypes are held out, and (iii) a brief discussion of how the 42 subtypes were chosen to reflect clinical prevalence. These additions will make the evidence chain and its robustness explicit without altering the reported empirical findings. revision: yes

Circularity Check

0 steps flagged

No significant circularity

full rationale

The paper constructs a new hierarchical benchmark (DeepTumorVQA) with 476K questions across four explicitly defined stages and reports empirical performance of external VLMs and agents on those questions. The central claim—that quantitative measurement is the primary bottleneck and tool augmentation mitigates it—follows directly from observed accuracy patterns and deltas on independently scored questions, with no internal parameters fitted to the target results, no self-referential definitions, and no load-bearing self-citations that reduce the findings to prior author work by construction. The stage decomposition is a design choice whose validity is an external assumption, not a circular derivation.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The work introduces no free parameters, no new physical or mathematical entities, and relies only on standard assumptions about clinical diagnostic workflows and VLM evaluation practices.

axioms (1)
  • domain assumption Tumor diagnosis in 3D CT follows a multi-stage evidence chain from recognition through measurement to visual and medical reasoning.
    This assumption structures the benchmark hierarchy and is taken from standard medical practice without independent verification in the paper.

pith-pipeline@v0.9.0 · 5653 in / 1285 out tokens · 78715 ms · 2026-05-12T03:58:24.898408+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

77 extracted references · 77 canonical work pages · 4 internal anchors

  1. [1]

    3dland: 3d lesion abdominal anomaly localization dataset.arXiv preprint arXiv:2602.12820, 2026

    Mehran Advand, Zahra Dehghanian, Navid Faraji, Reza Barati, Seyed Amir Ahmad Safavi- Naini, and Hamid R Rabiee. 3dland: 3d lesion abdominal anomaly localization dataset.arXiv preprint arXiv:2602.12820, 2026

  2. [2]

    Differentiating renal neoplasms from simple cysts on contrast-enhanced ct on the basis of attenuation and homogeneity.American Journal of Roentgenology, 208(4):801–804, 2017

    Nnenaya Agochukwu, Steffen Huber, Michael Spektor, Alexander Goehler, and Gary M Israel. Differentiating renal neoplasms from simple cysts on contrast-enhanced ct on the basis of attenuation and homogeneity.American Journal of Roentgenology, 208(4):801–804, 2017

  3. [3]

    Cystic lesions of the pancreas.Annals of Surgery, 254:680–686, 2011

    Peter J Allen, Michael D’Angelica, Mithat Gonen, David P Jaques, Daniel G Coit, William R Jarnagin, Ronald DeMatteo, Yuman Fong, Leslie H Blumgart, and Murray F Brennan. Cystic lesions of the pancreas.Annals of Surgery, 254:680–686, 2011

  4. [4]

    Medagentsim: Self-evolving multi-agent simulations for realistic clinical interactions

    Mohammad Almansoori, Komal Kumar, and Hisham Cholakkal. Medagentsim: Self-evolving multi-agent simulations for realistic clinical interactions. InInternational Conference on Medical Image Computing and Computer-Assisted Intervention, pages 362–372. Springer, 2025

  5. [5]

    Ajcc cancer staging manual, 2017

    American Joint Committee on Cancer. Ajcc cancer staging manual, 2017

  6. [6]

    The medical segmentation decathlon.arXiv preprint arXiv:2106.05735, 2021

    Michela Antonelli, Annika Reinke, Spyridon Bakas, Keyvan Farahani, Bennett A Landman, Geert Litjens, Bjoern Menze, Olaf Ronneberger, Ronald M Summers, Bram van Ginneken, et al. The medical segmentation decathlon.arXiv preprint arXiv:2106.05735, 2021

  7. [7]

    M3d:Ad- vancing 3d medical image analysis with multi-modal large language models

    Fan Bai, Yuxin Du, Tiejun Huang, Max Q-H Meng, and Bo Zhao. M3d: Advancing 3d medical image analysis with multi-modal large language models.arXiv preprint arXiv:2404.00578, 2024

  8. [8]

    Qwen2.5-VL Technical Report

    Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, et al. Qwen2.5-vl technical report.arXiv preprint arXiv:2502.13923, 2025

  9. [9]

    Radgpt: Constructing 3d image-text tumor datasets.arXiv preprint arXiv:2501.04678, 2025

    Pedro RAS Bassi, Mehmet Can Yavuz, Kang Wang, Xiaoxi Chen, Wenxuan Li, Sergio Decher- chi, Andrea Cavalli, Yang Yang, Alan Yuille, and Zongwei Zhou. Radgpt: Constructing 3d image-text tumor datasets.arXiv preprint arXiv:2501.04678, 2025

  10. [10]

    Determination of splenomegaly by ct: is there a place for a single measurement?American Journal of Roentgenology, 184(5):1510–1513, 2005

    Arlindo Sprenger Bezerra, Giuseppe D’Ippolito, Luzia Akiko Frias, Mauro Ozaki, Jacob Szejnfeld, and Marcelo Fuchs. Determination of splenomegaly by ct: is there a place for a single measurement?American Journal of Roentgenology, 184(5):1510–1513, 2005

  11. [11]

    The liver tumor segmentation benchmark (lits).arXiv preprint arXiv:1901.04056, 2019

    Patrick Bilic, Patrick Ferdinand Christ, Eugene V orontsov, Grzegorz Chlebus, Hao Chen, Qi Dou, Chi-Wing Fu, Xiao Han, Pheng-Ann Heng, Jürgen Hesser, et al. The liver tumor segmentation benchmark (lits).arXiv preprint arXiv:1901.04056, 2019

  12. [12]

    Merlin: A vision language foundation model for 3d computed tomography

    Louis Blankemeier, Joseph Paul Cohen, Ashwin Kumar, Dave Van Veen, Syed Jamal Safdar Gardezi, Magdalini Paschali, Zhihong Chen, Jean-Benoit Delbrouck, Eduardo Reis, Cesar Truyts, et al. Merlin: A vision language foundation model for 3d computed tomography. Research Square, pages rs–3, 2024

  13. [13]

    et al.: HuatuoGPT-Vision, To- wards Injecting Medical Visual Knowledge into Multimodal LLMs at Scale (Sep 2024)

    Junying Chen, Ruyi Gui, Anningzhe Chi, Anqi Peng, Zhongzhi Zhang, Xiang Wan, et al. Huatuogpt-vision, towards injecting medical visual knowledge into multimodal llms at scale. arXiv preprint arXiv:2406.19280, 2024

  14. [14]

    Yixiong Chen, Pedro R. A. S. Bassi, Xinze Zhou, Wenjie Xiao, Zongwei Zhou, and Alan Yuille. Meissa: Multi-modal medical agentic intelligence.arXiv preprint arXiv:2603.09018, 2026. 10

  15. [15]

    Internvl: Scaling up vision foundation models and aligning for generic visual-linguistic tasks

    Zhe Chen, Jiannan Wu, Wenhai Wang, Weijie Su, Guo Chen, Sen Xing, Muyan Zhong, Qinglong Zhang, Xizhou Zhu, Lewei Lu, et al. Internvl: Scaling up vision foundation models and aligning for generic visual-linguistic tasks. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024

  16. [16]

    Rsna 2023 abdominal trauma detection, 2023

    Errol Colak, Hui-Ming Lin, Robyn Ball, Melissa Davis, Adam Flanders, Sabeena Jalal, Kirti Magudia, Brett Marinelli, Savvas Nicolaou, Luciano Prevedello, Jeff Rudie, George Shih, Maryam Vazirabad, and John Mongan. Rsna 2023 abdominal trauma detection, 2023. URL https://kaggle.com/competitions/rsna-2023-abdominal-trauma-detection

  17. [17]

    FlashAttention-2: Faster Attention with Better Parallelism and Work Partitioning

    Tri Dao. Flashattention-2: Faster attention with better parallelism and work partitioning.arXiv preprint arXiv:2307.08691, 2023

  18. [18]

    Medrax: Medical reasoning agent for chest x-ray

    Adibvafa Eltawil, Fares Cheema, Yong Suk Paul Shin, Jonathan Rubin, Sheyang Chen, et al. Medrax: Medical reasoning agent for chest x-ray. InProceedings of the International Conference on Machine Learning, 2025

  19. [19]

    Medchain: Bridging the gap between llm agents and clinical practice through interactive sequential benchmarking

    Jie Gao, Yuanyuan Chen, Guozheng Wang, Hongyi Li, Jianye Yang, et al. Medchain: Bridging the gap between llm agents and clinical practice through interactive sequential benchmarking. InAdvances in Neural Information Processing Systems, 2025

  20. [20]

    Gemini 3 pro model card

    Google DeepMind. Gemini 3 pro model card. Google DeepMind Model Card,

  21. [21]

    URL https://storage.googleapis.com/deepmind-media/Model-Cards/ Gemini-3-Pro-Model-Card.pdf

  22. [22]

    Medgemma: Our most capable open models for health ai de- velopment

    Google Research. Medgemma: Our most capable open models for health ai de- velopment. Google Research Blog, 2025. URL https://research.google/blog/ medgemma-our-most-capable-open-models-for-health-ai-development/

  23. [23]

    Computed tomography evaluation of pancreatic steatosis: correlation with covid-19 prognosis.Future Virology, 17(4): 231–237, 2022

    Serkan Guneyli, Hakan Dogan, Omer Tarik Esengur, and Hur Hassoy. Computed tomography evaluation of pancreatic steatosis: correlation with covid-19 prognosis.Future Virology, 17(4): 231–237, 2022

  24. [24]

    Developing generalist foundation models from a multimodal dataset for 3d computed tomography.arXiv preprint arXiv:2403.17834, 2024

    Ibrahim Ethem Hamamci, Sezgin Er, Furkan Almas, Ayse Gulnihan Simsek, Sevval Nil Esirgun, Irem Dogan, Muhammed Furkan Dasdelen, Omer Faruk Durugol, Bastian Wittmann, Tamaz Amiranashvili, et al. Developing generalist foundation models from a multimodal dataset for 3d computed tomography.arXiv preprint arXiv:2403.17834, 2024

  25. [25]

    Diagnosis of cirrhosis based on regional changes in hepatic morphology: a radiological and pathological analysis.Radiology, 135(2):273–283, 1980

    William P Harbin, Nicholas J Robert, and Joseph T Ferrucci Jr. Diagnosis of cirrhosis based on regional changes in hepatic morphology: a radiological and pathological analysis.Radiology, 135(2):273–283, 1980

  26. [26]

    An international challenge to use artificial intelligence to define the state-of-the-art in kidney and kidney tumor segmentation in ct imaging., 2020

    Nicholas Heller, Sean McSweeney, Matthew Thomas Peterson, Sarah Peterson, Jack Rickman, Bethany Stai, Resha Tejpaul, Makinna Oestreich, Paul Blake, Joel Rosenberg, et al. An international challenge to use artificial intelligence to define the state-of-the-art in kidney and kidney tumor segmentation in ct imaging., 2020

  27. [27]

    LoRA: Low-rank adaptation of large language models

    Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. LoRA: Low-rank adaptation of large language models. In International Conference on Learning Representations, 2022

  28. [28]

    Omnimed- vqa: A new large-scale comprehensive evaluation benchmark for medical lvlm

    Yutao Hu, Tianbin Li, Quanfeng Lu, Wenqi Shao, Junjun He, Yu Qiao, and Ping Luo. Omnimed- vqa: A new large-scale comprehensive evaluation benchmark for medical lvlm. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 22170–22183, 2024

  29. [29]

    Amos: A large-scale abdominal multi-organ benchmark for versatile medical image segmentation.Advances in Neural Information Processing Systems, 35:36722–36732, 2022

    Yuanfeng Ji, Haotian Bai, Chongjian Ge, Jie Yang, Ye Zhu, Ruimao Zhang, Zhen Li, Lingyan Zhang, Wanling Ma, Xiang Wan, et al. Amos: A large-scale abdominal multi-organ benchmark for versatile medical image segmentation.Advances in Neural Information Processing Systems, 35:36722–36732, 2022

  30. [30]

    Incentivizing tool-augmented thinking with images for medical image analysis.arXiv preprint arXiv:2512.14157, 2025

    Yankai Jiang, Yujie Zhang, Peng Zhang, Yichen Li, Jintai Chen, Xiaoming Shi, and Shihui Zhen. Incentivizing tool-augmented thinking with images for medical image analysis.arXiv preprint arXiv:2512.14157, 2025. 11

  31. [31]

    Clevr: A diagnostic dataset for compositional language and elementary visual reasoning

    Justin Johnson, Bharath Hariharan, Laurens Van Der Maaten, Li Fei-Fei, C Lawrence Zitnick, and Ross Girshick. Clevr: A diagnostic dataset for compositional language and elementary visual reasoning. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 2901–2910, 2017

  32. [32]

    Mdagents: An adaptive collaboration of llms for medical decision-making.Advances in Neural Information Processing Systems, 37:79410–79452, 2024

    Yubin Kim, Chanwoo Park, Hyewon Jeong, Yik S Chan, Xuhai Xu, Daniel McDuff, Hyeon- hoon Lee, Marzyeh Ghassemi, Cynthia Breazeal, and Hae W Park. Mdagents: An adaptive collaboration of llms for medical decision-making.Advances in Neural Information Processing Systems, 37:79410–79452, 2024

  33. [33]

    Comparison of ct methods for determining the fat content of the liver.American Journal of Roentgenology, 188(5):1307–1312, 2007

    Yuji Kodama, Connie S Ng, Tsung-Teh Wu, Gregory D Ayers, Steven A Curley, Eddie K Abdalla, Jean-Nicolas Vauthey, and Chusilp Charnsangavej. Comparison of ct methods for determining the fat content of the liver.American Journal of Roentgenology, 188(5):1307–1312, 2007

  34. [34]

    Miccai multi-atlas labeling beyond the cranial vault–workshop and challenge

    Bennett Landman, Zhoubing Xu, J Igelsias, Martin Styner, T Langerak, and Arno Klein. Miccai multi-atlas labeling beyond the cranial vault–workshop and challenge. InProc. MICCAI Multi-Atlas Labeling Beyond Cranial Vault—Workshop Challenge, volume 5, page 12, 2015

  35. [35]

    A dataset of clinically generated visual questions and answers about radiology images.Scientific data, 5(1): 1–10, 2018

    Jason J Lau, Soumya Gayen, Asma Ben Abacha, and Dina Demner-Fushman. A dataset of clinically generated visual questions and answers about radiology images.Scientific data, 5(1): 1–10, 2018

  36. [36]

    Mmedagent: Learning to use medical tools with multi-modal agent

    Binxu Li, Tiankai Yan, Yuanting Pan, Jie Luo, Ruiyang Ji, Jiayuan Ding, Zhe Xu, Shilong Liu, Haoyu Dong, Zihao Lin, et al. Mmedagent: Learning to use medical tools with multi-modal agent. InFindings of the Association for Computational Linguistics: EMNLP 2024, pages 8745–8760, 2024

  37. [37]

    LLaVA-OneVision: Easy Visual Task Transfer

    Bo Li, Yuanhan Zhang, Dong Guo, Renrui Zhang, Feng Li, Hao Zhang, Kaichen Zhang, Yanwei Li, Ziwei Liu, and Chunyuan Li. Llava-onevision: Easy visual task transfer.arXiv preprint arXiv:2408.03326, 2024

  38. [38]

    Llava-med: Training a large language-and-vision assistant for biomedicine in one day.Advances in Neural Information Processing Systems, 36: 28541–28564, 2023

    Chunyuan Li, Cliff Wong, Sheng Zhang, Naoto Usuyama, Haotian Liu, Jianwei Yang, Tristan Naumann, Hoifung Poon, and Jianfeng Gao. Llava-med: Training a large language-and-vision assistant for biomedicine in one day.Advances in Neural Information Processing Systems, 36: 28541–28564, 2023

  39. [39]

    Super-clevr: A virtual benchmark to diagnose domain robustness in visual reasoning

    Zhuowan Li, Xingrui Wang, Elias Stengel-Eskin, Adam Kortylewski, Wufei Ma, Benjamin Van Durme, and Alan L Yuille. Super-clevr: A virtual benchmark to diagnose domain robustness in visual reasoning. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 14963–14973, 2023

  40. [40]

    Clevr-ref+: Diagnosing visual reasoning with referring expressions

    Runtao Liu, Chenxi Liu, Yutong Bai, and Alan L Yuille. Clevr-ref+: Diagnosing visual reasoning with referring expressions. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 4185–4194, 2019

  41. [41]

    Word: Revisiting organs segmentation in the whole abdominal region

    Xiangde Luo, Wenjun Liao, Jianghong Xiao, Tao Song, Xiaofan Zhang, Kang Li, Guotai Wang, and Shaoting Zhang. Word: Revisiting organs segmentation in the whole abdominal region. arXiv preprint arXiv:2111.02403, 2021

  42. [42]

    Abdomenct-1k: Is abdominal organ segmentation a solved problem.IEEE Transactions on Pattern Analysis and Machine Intelligence, 2021

    Jun Ma, Yao Zhang, Song Gu, Cheng Zhu, Cheng Ge, Yichi Zhang, Xingle An, Congcong Wang, Qiyuan Wang, Xin Liu, et al. Abdomenct-1k: Is abdominal organ segmentation a solved problem.IEEE Transactions on Pattern Analysis and Machine Intelligence, 2021

  43. [43]

    Fast and low-gpu-memory abdomen ct organ segmentation: the flare challenge.Medical Image Analysis, 82:102616, 2022

    Jun Ma, Yao Zhang, Song Gu, Xingle An, Zhihe Wang, Cheng Ge, Congcong Wang, Fan Zhang, Yu Wang, Yinan Xu, et al. Fast and low-gpu-memory abdomen ct organ segmentation: the flare challenge.Medical Image Analysis, 82:102616, 2022

  44. [44]

    Ct-agent: A multimodal-LLM agent for 3d CT radiology question answering, 2025

    Yuren Mao, Wenyi Xu, Yuyang Qin, and Yunjun Gao. Ct-agent: a multimodal-llm agent for 3d ct radiology question answering.arXiv preprint arXiv:2505.16229, 2025

  45. [45]

    Nccn clinical practice guidelines in oncol- ogy: Pancreatic adenocarcinoma, 2024

    National Comprehensive Cancer Network. Nccn clinical practice guidelines in oncol- ogy: Pancreatic adenocarcinoma, 2024. URL https://www.nccn.org/guidelines/ guidelines-detail?category=1&id=1455. 12

  46. [46]

    Qwen3 Technical Report

    Qwen Team. Qwen3 technical report.arXiv preprint arXiv:2505.09388, 2025

  47. [47]

    Ct-org, a new dataset for multiple organ segmentation in computed tomography.Scientific Data, 7(1): 1–9, 2020

    Blaine Rister, Darvin Yi, Kaushik Shivakumar, Tomomi Nobashi, and Daniel L Rubin. Ct-org, a new dataset for multiple organ segmentation in computed tomography.Scientific Data, 7(1): 1–9, 2020

  48. [48]

    Deeporgan: Multi-level deep convolutional networks for automated pancreas segmentation

    Holger R Roth, Le Lu, Amal Farag, Hoo-Chang Shin, Jiamin Liu, Evrim B Turkbey, and Ronald M Summers. Deeporgan: Multi-level deep convolutional networks for automated pancreas segmentation. InInternational conference on medical image computing and computer- assisted intervention, pages 556–564. Springer, 2015

  49. [49]

    arXiv preprint arXiv:2405.07960 , year =

    Samuel Schmidgall, Rojin Ziaei, Carl Harris, Eduardo Pontes Reis, Sheshank Yarlagadda, et al. Agentclinic: A multimodal agent benchmark to evaluate ai in simulated clinical environments. arXiv preprint arXiv:2405.07960, 2024

  50. [50]

    Jiang, Alex Grabowski, Aarav Chauhan, et al

    Samuel Schmidgall, Baekjin Kim, Hyunjin Goh, Albert Q. Jiang, Alex Grabowski, Aarav Chauhan, et al. Medagentbench: Dataset for benchmarking llms as agents in medical applica- tions.NEJM AI, 2025

  51. [51]

    Bosniak classification of cystic renal masses, version 2019: an update proposal and needs assessment

    Stuart G Silverman, Ivan Pedrosa, James H Ellis, Nicole M Hindman, Nicola Schieda, Andrew D Smith, Erick M Remer, Atul B Shinagare, Natalie E Curci, Lucia Manganaro, et al. Bosniak classification of cystic renal masses, version 2019: an update proposal and needs assessment. Radiology, 292(2):475–488, 2019

  52. [52]

    Revisions of international consensus fukuoka guidelines for the management of ipmn of the pancreas

    Masao Tanaka, Carlos Fernández-del Castillo, V olkan Adsay, Suresh Chari, Massimo Falconi, Jin-Young Jang, Wataru Kimura, Philip Levy, Martha B Pitman, C Max Schmidt, et al. Revisions of international consensus fukuoka guidelines for the management of ipmn of the pancreas. Pancreatology, 12(3):183–197, 2012

  53. [53]

    3d-rad: A large-scale 3d radiology med-vqa dataset

    Xiaoxiao Tang et al. 3d-rad: A large-scale 3d radiology med-vqa dataset. InAdvances in Neural Information Processing Systems, 2025

  54. [54]

    Multi-modal learning from unpaired images: Application to multi-organ segmentation in ct and mri

    Vanya V Valindria, Nick Pawlowski, Martin Rajchl, Ioannis Lavdas, Eric O Aboagye, Andrea G Rockall, Daniel Rueckert, and Ben Glocker. Multi-modal learning from unpaired images: Application to multi-organ segmentation in ct and mri. In2018 IEEE winter conference on applications of computer vision (WACV), pages 547–556. IEEE, 2018

  55. [55]

    Totalseg- mentator: Robust segmentation of 104 anatomic structures in ct images.Radiology: Artificial Intelligence, 5(5):e230024, 2023

    Jakob Wasserthal, Hanns-Christian Breit, Manfred T Meyer, Maurice Pradella, Daniel Hinck, Alexander W Saez, Tobias Semler, Florian Stritzel, Martin Segeroth, and Joshy Josi. Totalseg- mentator: Robust segmentation of 104 anatomic structures in ct images.Radiology: Artificial Intelligence, 5(5):e230024, 2023

  56. [56]

    Towards generalist foundation model for radiology by leveraging web-scale 2d&3d medical data.arXiv preprint arXiv:2308.02463, 2023

    Chaoyi Wu, Xiaoman Zhang, Ya Zhang, Yanfeng Wang, and Weidi Xie. Towards generalist foundation model for radiology by leveraging web-scale 2d&3d medical data.arXiv preprint arXiv:2308.02463, 2023

  57. [57]

    Advancing multimodal medical capabilities of gemini.arXiv preprint arXiv:2405.03162, 2024

    Lin Yang, Shawn Xu, Andrew Sellergren, Timo Kohlberger, Yuchen Zhou, Ira Ktena, Atilla Kiraly, Faruk Ahmed, Farhad Hormozdiari, Tiam Jaroensri, et al. Advancing multimodal medical capabilities of gemini.arXiv preprint arXiv:2405.03162, 2024

  58. [58]

    React: Synergizing reasoning and acting in language models

    Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik Narasimhan, and Yuan Cao. React: Synergizing reasoning and acting in language models. InInternational Conference on Learning Representations, 2023

  59. [59]

    Medsg-bench: Benchmarking medical image sequences grounding capability of multimodal large language models

    Jingkun Yu, Yixiao Ge, Rui Wang, Shan Ying, et al. Medsg-bench: Benchmarking medical image sequences grounding capability of multimodal large language models. InAdvances in Neural Information Processing Systems, 2025

  60. [60]

    Computed tomography scans in the evaluation of fatty liver disease in a population based study: the multi-ethnic study of atherosclerosis.Academic radiology, 19(7):811–818, 2012

    Irfan Zeb, Dong Li, Khurram Nasir, Ronit Katz, Vahid N Larijani, and Matthew J Budoff. Computed tomography scans in the evaluation of fatty liver disease in a population based study: the multi-ethnic study of atherosclerosis.Academic radiology, 19(7):811–818, 2012. 13

  61. [61]

    Pmc-vqa: Visual instruction tuning for medical visual question answering.arXiv preprint arXiv:2305.10415, 2023

    Xiaoman Zhang, Chaoyi Wu, Ziheng Zhao, Weixiong Lin, Ya Zhang, Yanfeng Wang, and Weidi Xie. Pmc-vqa: Visual instruction tuning for medical visual question answering.arXiv preprint arXiv:2305.10415, 2023

  62. [62]

    Ct-bench: A benchmark for multimodal lesion understanding in computed tomography.arXiv preprint arXiv:2602.14879, 2026

    Qingqing Zhu, Qiao Jin, Tejas S Mathai, Yin Fang, Zhizheng Wang, Yifan Yang, Maame Sarfo-Gyamfi, Benjamin Hou, Ran Gu, Praveen TS Balamuralikrishna, et al. Ct-bench: A benchmark for multimodal lesion understanding in computed tomography.arXiv preprint arXiv:2602.14879, 2026. 14 a Data Sources Table 6: Overview of public abdominal CT datasets collected in ...

  63. [63]

    Pancreas-CT [2015] [link] 42 1

    CHAOS [2018] [link] 20 1 2. Pancreas-CT [2015] [link] 42 1

  64. [64]

    LiTS [2019] [link] 131 7

    BTCV [2015] [link] 47 1 4. LiTS [2019] [link] 131 7

  65. [65]

    WORD [2021] [link] 120 1

    CT-ORG [2020] [link] 140 8 6. WORD [2021] [link] 120 1

  66. [66]

    KiTS [2020] [link] 489 1 9–14

    AMOS22 [2022] [link] 200 2 8. KiTS [2020] [link] 489 1 9–14. MSD CT Tasks [2021] [link] 945 1 15. AbdomenCT-1K [2021] [link] 1,050 12

  67. [67]

    enlarged

    FLARE’23 [2022] [link] 4,100 30 17. Trauma Detect. [2023] [link] 4,711 23 All 17 source datasets are publicly available under their respective licenses (links in Table 6); users are responsible for complying with each source dataset’s terms of use. The DeepTumorVQAannotations, structured metadata, and benchmark questionscontributed by this work are releas...

  68. [68]

    Reasoning liver lesion cluster

    Q: Is pancreas enlarged? A: No Vis. Reasoning liver lesion cluster. Top-1 seg. ≥50% (highly); top- 2 segs. ≥75% (somewhat); else widely – Q: Liver lesions clustered? A: Yes Vis. Reasoning bilateral kidney asym. Largest-lesion-location vote; tie bro- ken by vol. ratio>1.3× – Q: Which kidney has more le- sions? A: Right Vis. Reasoning multi-organ burden Cro...

  69. [69]

    Reasoning hepatic steatosis gr

    Q: Fatty liver grade? A: Light Med. Reasoning hepatic steatosis gr. 4-grade: HU 58/51/39 thresholds [32] Q: Steatosis grade? A: Moder- ate Med. Reasoning pancreatic steatosis P/S HU ratio<0.7 [22] Q: Pancreatic steatosis? A: No Med. Reasoning splenomegaly grading 4-grade: 314.5/500/800 cm 3 [10] Q: Splenomegaly grade? A: Mild Med. Reasoning PDAC vs PNET c...

  70. [70]

    Reasoning portal hypertension Splenic vol + liver HU composite [24] Q: Portal hypertension? A: Yes Med

    Q: PDAC or PNET? A: PDAC Med. Reasoning portal hypertension Splenic vol + liver HU composite [24] Q: Portal hypertension? A: Yes Med. Reasoning renal mass charact. Simplified Bosniak: HU≤20 simple cyst; ≥70 hyperattenuating; else in- determinate/solid

  71. [71]

    Reasoning lesion type class

    Q: Mass classification? A: Sim- ple cyst Med. Reasoning lesion type class. Largest-volume kidney lesion: cyst vs. tumor label from radiologist an- notation

  72. [72]

    Reasoning pseudocyst determ

    Q: Cyst or tumor? A: Tumor 20 (continued from previous page) Task Type Subtype Generation Logic Reference Example QA Med. Reasoning pseudocyst determ. HU>14.5 threshold [3] Q: Is it a pseudocyst? A: Yes Med. Reasoning pancreatic T staging T1–T4 from annotations [5] Q: Tumor T-stage? A: T2 Med. Reasoning cyst resectability Cyst vol>3.0 cm 3 [51] Q: Cyst re...

  73. [73]

    template matching

    Q: Lesion resectable? A: No d Compositional Design vs. Template Matching Our use of deterministic generation programs invites comparison to CLEVR [ 30], but DeepTu- morVQA differs in three fundamental ways. First, our visual inputs arereal CT scans with real pathology, not synthetic scenes; the visual grounding challenge is genuine and domain-specific. Se...

  74. [74]

    fatty liver

    We preprocess each NIfTI volume by: (1) clipping HU values to [−1024,1024] ; (2) resizing the volume to [64,256,256] using trilinear interpolation (scipy.ndimage.zoom); (3) normalizing to [0,1] by (x+ 1024)/2048 . The 3D ViT encoder processes this volume as a sequence of 2×16×16 patches, yielding 256 visual tokens. Resizing from native resolution (typical...

  75. [75]

    liver”, “volume

    measure(target, type) Returns a ground-truth quantitative measurement for a specified target and measurement type. • Input: target (string), the anatomical target; type (string), one of volume, mean_HU, diameter,count. •Output:JSON with fields:value(float),unit(string). • Supported types: volume (cm3), mean_HU (Hounsfield units), diameter (cm, largest les...

  76. [76]

    fatty liver

    lookup_medical_knowledge(query) Retrieves clinical criteria from a curated 27-entry knowledge base covering diagnostic thresholds, grading systems, and classification criteria. • Input: query (string): clinical topic (e.g., “fatty liver”,“splenomegaly grading”). • Output:JSON with fields: entries (list of dicts withcriterion,threshold,source). • Query mat...

  77. [77]

    liver”,“kidney

    crop_region(organ) Returns organ-focused multi-slice crops for visual inspection (vision mode only). •Input:organ(string): target organ (e.g.,“liver”,“kidney”,“pancreas”). • Output:JSON with field: image_base64 (string), a base64-encoded PNG of 5 representa- tive axial slices tiled in a single image with abdominal windowing. • Crop details:The bounding bo...