pith. machine review for the scientific record. sign in

arxiv: 2604.13756 · v1 · submitted 2026-04-15 · 💻 cs.CL · cs.CV

Recognition: unknown

MedRCube: A Multidimensional Framework for Fine-Grained and In-Depth Evaluation of MLLMs in Medical Imaging

Chenhui Zhang, Fangke Chen, Jiajie Peng, Licheng Bao, Wei Chen, Zhijie Bao, Zhongyu Wei

Authors on Pith no claims yet

Pith reviewed 2026-05-10 12:43 UTC · model grok-4.3

classification 💻 cs.CL cs.CV
keywords MedRCubeMLLM evaluationmedical imagingreasoning credibilityshortcut behaviorfine-grained assessmentmultimodal modelsclinical reliability
0
0 comments X

The pith

MedRCube is a multidimensional evaluation framework for MLLMs in medical imaging that reveals previously hidden insights and a strong link between shortcut reasoning and higher diagnostic scores.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces MedRCube to evaluate multimodal large language models on medical imaging tasks using multiple fine-grained dimensions instead of single overall scores. It builds this system through a two-stage pipeline meant to match actual clinical practice and adds a dedicated subset that measures the credibility of the models' reasoning steps. When run on 33 different MLLMs, the framework shows Lingshu-32B performing at the top while exposing patterns that simpler tests had missed. It also finds a clear positive association between models taking shortcuts and their success on diagnostic tasks, which questions how trustworthy these systems would be in real medical settings.

Core claim

MedRCube is a multidimensional framework for fine-grained evaluation of MLLMs in medical imaging, constructed via a two-stage systematic pipeline. It benchmarks 33 models with Lingshu-32B achieving top-tier results across dimensions, exposes insights unavailable under prior coarse-grained metrics, and through its credibility evaluation subset demonstrates a highly significant positive association between shortcut behavior and diagnostic task performance.

What carries the argument

The two-stage systematic construction pipeline that generates aligned evaluation dimensions and a credibility subset to quantify reasoning reliability.

If this is right

  • Coarse single-metric evaluations fail to capture critical aspects of model behavior needed for clinical support.
  • Shortcut behavior shows a highly significant positive association with diagnostic task performance.
  • Lingshu-32B ranks highest across the fine-grained dimensions of the framework.
  • Credibility checks are required to assess reliability for trustworthy clinical deployment.
  • Multidimensional evaluation uncovers reasoning patterns that prior methods could not detect.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • If the shortcut association holds across more models, training objectives may need explicit penalties against shortcut learning to improve reliability.
  • Medical deployment decisions should prioritize models that pass credibility tests over those with high diagnostic scores alone.
  • The framework could be adapted to test whether the same shortcut-performance link appears in non-medical multimodal tasks.

Load-bearing premise

The two-stage pipeline produces evaluation dimensions that accurately reflect real-world medical imaging practice and the credibility subset measures reasoning without its own biases.

What would settle it

Re-running the credibility subset on a new set of MLLMs and finding no significant correlation between shortcut behavior and diagnostic performance would undermine the reported association.

Figures

Figures reproduced from arXiv: 2604.13756 by Chenhui Zhang, Fangke Chen, Jiajie Peng, Licheng Bao, Wei Chen, Zhijie Bao, Zhongyu Wei.

Figure 1
Figure 1. Figure 1: Paradigm shift required for evaluating multi￾modal large language models in medical imaging. support tools and to incorporate them into radiol￾ogy workflows (Hou et al., 2025; Wada et al., 2025). Given the safety-critical nature of clinical decision￾making, this development highlights the need for systematic and rigorous evaluation frameworks that are aligned with the requirements and constraints of real-w… view at source ↗
Figure 2
Figure 2. Figure 2: Overview of MedRCube. The framework evaluates models along Anatomical, Modality, and Task [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Correlation Analysis. (a) Hierarchical cluster [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: The left panel shows the distribution of models with a quadrant of Rationality score and accuracy. The [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Statistical representation of samples in MedRCube. [PITH_FULL_IMAGE:figures/full_fig_p014_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Illustrative examples of perceptual and semantic tasks in MedRCube. (a) Modality Recognition: Classifies [PITH_FULL_IMAGE:figures/full_fig_p016_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Illustrative examples of semantic and cognitive reasoning tasks in MedRCube. (a) Region of Interest [PITH_FULL_IMAGE:figures/full_fig_p016_7.png] view at source ↗
read the original abstract

The potential of Multimodal Large Language Models (MLLMs) in domain of medical imaging raise the demands of systematic and rigorous evaluation frameworks that are aligned with the real-world medical imaging practice. Existing practices that report single or coarse-grained metrics are lack the granularity required for specialized clinical support and fail to assess the reliability of reasoning mechanisms. To address this, we propose a paradigm shift toward multidimensional, fine-grained and in-depth evaluation. Based on a two-stage systematic construction pipeline designed for this paradigm, we instantiate it with MedRCube. We benchmark 33 MLLMs, \textit{Lingshu-32B} achieve top-tier performance. Crucially, MedRCube exposes a series of pronounced insights inaccessible under prior evaluation settings. Furthermore, we introduce a credibility evaluation subset to quantify reasoning credibility, uncover a highly significant positive association between shortcut behavior and diagnostic task performance, raising concerns for clinically trustworthy deployment. The resources of this work can be found at https://github.com/F1mc/MedRCube.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 0 minor

Summary. The manuscript presents MedRCube, a multidimensional framework for fine-grained evaluation of MLLMs in medical imaging, constructed via a two-stage systematic pipeline. It benchmarks 33 MLLMs (with Lingshu-32B achieving top performance), claims to expose insights inaccessible under prior coarse metrics, and introduces a credibility evaluation subset that reveals a highly significant positive association between shortcut behavior and diagnostic task performance.

Significance. If the framework alignment and the reported association are substantiated, the work could meaningfully advance evaluation practices for medical MLLMs by moving beyond single or coarse metrics toward reliability-focused, granular analysis. The broad benchmarking of 33 models provides a useful empirical reference, and the association finding would highlight risks for clinical deployment if confirmed. The contribution is limited by the absence of validation evidence for the pipeline and subset.

major comments (3)
  1. [Credibility evaluation subset] Credibility evaluation subset: The subset used to quantify reasoning credibility and the headline finding of a highly significant positive association between shortcut behavior and diagnostic performance lack any reported details on construction criteria, proxy measures (e.g., attention patterns or error types), inter-rater reliability, or external validation against clinical ground truth. This association is load-bearing for the central claim of new insights inaccessible to prior settings.
  2. [Two-stage systematic construction pipeline] Two-stage systematic construction pipeline: The assertion that the pipeline yields dimensions aligned with real-world medical imaging practice is presented without supporting evidence such as expert review, alignment metrics, or comparison to clinical workflows. This alignment is foundational to the framework's claimed advantages over existing practices.
  3. [Benchmarking results] Benchmarking and statistical analysis: The results from evaluating 33 MLLMs and the statistical controls for the association are stated without specifics on data splits, metric validation procedures, confounder controls (e.g., model scale or training data overlap), or error analysis, leaving the empirical findings unverified.

Simulated Author's Rebuttal

3 responses · 2 unresolved

We thank the referee for the insightful and constructive comments on our manuscript. We address each of the major comments in detail below, providing clarifications and outlining planned revisions to enhance the rigor and transparency of the work.

read point-by-point responses
  1. Referee: Credibility evaluation subset: The subset used to quantify reasoning credibility and the headline finding of a highly significant positive association between shortcut behavior and diagnostic performance lack any reported details on construction criteria, proxy measures (e.g., attention patterns or error types), inter-rater reliability, or external validation against clinical ground truth. This association is load-bearing for the central claim of new insights inaccessible to prior settings.

    Authors: We recognize the importance of providing comprehensive details on the credibility evaluation subset to support the association finding. In the revised manuscript, we will add a new subsection detailing the construction criteria, including how the subset was selected from the broader dataset. We will specify the proxy measures for shortcut behavior, such as reliance on superficial patterns in image descriptions or error types in reasoning chains. Inter-rater reliability will be reported based on the agreement among annotators. For the statistical analysis, we will include more details on controls for confounders like model size. However, external validation against clinical ground truth was beyond the scope of this study and would require additional expert input; we will explicitly state this limitation and its implications for the findings. revision: partial

  2. Referee: Two-stage systematic construction pipeline: The assertion that the pipeline yields dimensions aligned with real-world medical imaging practice is presented without supporting evidence such as expert review, alignment metrics, or comparison to clinical workflows. This alignment is foundational to the framework's claimed advantages over existing practices.

    Authors: The two-stage pipeline was designed based on a systematic review of medical imaging evaluation literature and clinical practice guidelines to ensure relevance. To substantiate the alignment claim, we will include in the revision additional justification, such as mappings to standard clinical workflows and any quantitative alignment metrics computed during development. If formal expert review was not conducted, we will note this and discuss how the dimensions were validated internally. This will better support the framework's advantages. revision: yes

  3. Referee: Benchmarking and statistical analysis: The results from evaluating 33 MLLMs and the statistical controls for the association are stated without specifics on data splits, metric validation procedures, confounder controls (e.g., model scale or training data overlap), or error analysis, leaving the empirical findings unverified.

    Authors: We agree that more specifics are needed for the benchmarking results to allow full verification. The revised version will include explicit information on the data splits (e.g., train/test proportions if applicable, though evaluation is zero-shot in many cases), metric validation procedures, confounder controls including model scale and training data considerations, and a detailed error analysis. These elements will be presented in an expanded results section with supplementary tables if necessary. revision: yes

standing simulated objections not resolved
  • External validation of the credibility evaluation subset against clinical ground truth
  • Formal expert review or quantitative alignment metrics for the two-stage pipeline construction

Circularity Check

0 steps flagged

No significant circularity detected in MedRCube derivation or findings

full rationale

The paper constructs MedRCube via an independent two-stage systematic pipeline for multidimensional evaluation aligned with medical practice. The credibility evaluation subset is introduced separately to quantify reasoning credibility, and the reported highly significant positive association between shortcut behavior and diagnostic task performance is presented as an empirical result obtained by benchmarking 33 MLLMs. No equations, self-citations, or definitional steps are shown that reduce any claim to its own inputs by construction. The framework and association remain self-contained as construction plus external benchmarking rather than tautological or fitted outcomes.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 2 invented entities

Based on abstract only; central claim rests on the premise that prior single/coarse metrics are insufficient and that the new pipeline and subset deliver aligned, credible insights. No explicit free parameters named. Invented entities are the framework and subset themselves.

axioms (2)
  • domain assumption Existing practices that report single or coarse-grained metrics lack the granularity required for specialized clinical support and fail to assess the reliability of reasoning mechanisms.
    Directly stated in the abstract as the motivation for the new paradigm.
  • ad hoc to paper A two-stage systematic construction pipeline produces a framework aligned with real-world medical imaging practice.
    The paper instantiates MedRCube based on this pipeline.
invented entities (2)
  • MedRCube framework no independent evidence
    purpose: To enable multidimensional, fine-grained and in-depth evaluation of MLLMs in medical imaging.
    Newly proposed and instantiated in this work.
  • Credibility evaluation subset no independent evidence
    purpose: To quantify reasoning credibility and uncover associations with shortcut behavior.
    Introduced as part of MedRCube in this paper.

pith-pipeline@v0.9.0 · 5501 in / 1575 out tokens · 39680 ms · 2026-05-10T12:43:10.494332+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

86 extracted references · 26 canonical work pages · 5 internal anchors

  1. [1]

    Marah Abdin and 1 others. 2024. https://arxiv.org/abs/2404.14219 Phi-3 technical report: A highly capable language model locally on your phone . Preprint, arXiv:2404.14219. Updated version covers Phi-3.5 Vision

  2. [2]

    Walid Al-Dhabyani, Mohammed Gomaa, Hussien Khaled, and Aly Fahmy. 2020. https://doi.org/10.1016/j.dib.2019.104863 Dataset of breast ultrasound images . Data in Brief, 28:104863

  3. [3]

    Anthropic. 2025. https://www.anthropic.com/news/claude-opus-4-5 Claude opus 4.5

  4. [4]

    Amanullah Asraf and Zabirul Islam. 2021. https://doi.org/10.17632/jctsfj2sfn.1 COVID19 , pneumonia and normal chest X -ray PA dataset . https://doi.org/10.17632/jctsfj2sfn.1

  5. [5]

    Seongsu Bae, Daeun Kyung, Jaehee Ryu, Eunbyeol Cho, Gyubok Lee, Sunjun Kweon, Jungwoo Oh, Lei Ji, Eric Chang, Tackeun Kim, and 1 others. 2024. Ehrxqa: A multi-modal question answering dataset for electronic health records with chest x-ray images. Advances in Neural Information Processing Systems, 36

  6. [6]

    Spyridon Bakas, Hamed Akbari, Aristeidis Sotiras, Michel Bilello, Martin Rozycki, Justin S Kirby, John B Freymann, Keyvan Farahani, and Christos Davatzikos. 2017. Advancing the cancer genome atlas glioma mri collections with expert segmentation labels and radiomic features. Scientific data, 4(1):1--13

  7. [7]

    Spyridon Bakas, Mauricio Reyes, Andras Jakab, Stefan Bauer, Markus Rempfler, Alessandro Crimi, Russell Takeshi Shinohara, Christoph Berger, Sung Min Ha, Martin Rozycki, and 1 others. 2018. Identifying the best machine learning algorithms for brain tumor segmentation, progression assessment, and overall survival prediction in the brats challenge. arXiv pre...

  8. [8]

    Hasan, and Henning M\"uller

    Asma Ben Abacha , Mourad Sarrouti, Dina Demner-Fushman, Sadid A. Hasan, and Henning M\"uller. 2021. Overview of the vqa-med task at imageclef 2021: Visual question answering and generation in the medical domain. In CLEF 2021 Working Notes, CEUR Workshop Proceedings, Bucharest, Romania. CEUR-WS.org

  9. [9]

    Christian Bluethgen, Dave Van Veen, Cyril Zakka, Katherine E Link, Aaron Hunter Fanous, Roxana Daneshjou, Thomas Frauenfelder, Curtis P Langlotz, Sergios Gatidis, and Akshay Chaudhari. 2025. Best practices for large language models in radiology. Radiology, 315(1):e240528

  10. [10]

    Victor M Campello, Polyxeni Gkontra, Cristian Izquierdo, Carlos Martin-Isla, Alireza Sojoudi, Peter M Full, Klaus Maier-Hein, Yao Zhang, Zhiqiang He, Jun Ma, and 1 others. 2021. Multi-centre, multi-vendor and multi-disease cardiac segmentation: the m&ms challenge. IEEE Transactions on Medical Imaging, 40(12):3543--3554

  11. [11]

    Sema Candemir, Stefan Jaeger, Kannappan Palaniappan, Jonathan P Musco, Rahul K Singh, Zhiyun Xue, Alexandros Karargyris, Sameer Antani, George Thoma, and Clement J McDonald. 2013. Lung segmentation in chest radiographs using anatomical atlases with nonrigid registration. IEEE transactions on medical imaging, 33(2):577--590

  12. [12]

    Junying Chen, Chi Gui, Ruyi Ouyang, Anningzhe Gao, Shunian Chen, Guiming Hardy Chen, Xidong Wang, Zhenyang Cai, Ke Ji, Xiang Wan, and 1 others. 2024 a . Towards injecting medical visual knowledge into multimodal llms at scale. In Proceedings of the 2024 conference on empirical methods in natural language processing, pages 7346--7370

  13. [13]

    Junying Chen, Ruyi Ouyang, Anningzhe Gao, and 1 others. 2024 b . https://arxiv.org/abs/2406.19280 Huatuogpt-vision, towards injecting medical visual knowledge into multimodal llms at scale . Preprint, arXiv:2406.19280

  14. [14]

    Muhammad EH Chowdhury, Tawsifur Rahman, Amith Khandakar, Rashid Mazhar, Muhammad Abdul Kadir, Zaid Bin Mahbub, Khandakar Reajul Islam, Muhammad Salman Khan, Atif Iqbal, Nasser Al Emadi, and 1 others. 2020. Can ai help in screening viral and covid-19 pneumonia? Ieee Access, 8:132665--132676

  15. [15]

    Joseph Paul Cohen, Paul Morrison, and Lan Dao. 2020. Covid-19 image data collection. arXiv preprint arXiv:2003.11597

  16. [16]

    Google Deepmind. 2025. https://deepmind.google/models/gemini/pro/ Gemini 3 pro

  17. [17]

    Aysen Degerli, Mete Ahishali, Mehmet Yamac, Serkan Kiranyaz, Muhammad EH Chowdhury, Khalid Hameed, Tahir Hamid, Rashid Mazhar, and Moncef Gabbouj. 2021. Covid-19 infection map generation and detection from chest x-ray images. Health information science and systems, 9(1):15

  18. [18]

    Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, and 1 others. 2024. https://arxiv.org/abs/2407.21783 The llama 3 herd of models . Preprint, arXiv:2407.21783

  19. [19]

    Mengjie Fang, Zipei Wang, Sitian Pan, Xin Feng, Yunpeng Zhao, Dongzhi Hou, Ling Wu, Xuebin Xie, Xu-Yao Zhang, Jie Tian, and 1 others. 2025. Large models in medical imaging: Advances and prospects. Chinese Medical Journal, pages 10--1097

  20. [20]

    Praveen Govi. 2020. CoronaHack -chest X -ray-dataset. https://www.kaggle.com/datasets/praveengovi/coronahack-chest-xraydataset

  21. [21]

    Gowthaman Gunabushanam, Caroline R Taylor, Mahan Mathur, Jamal Bokhari, and Leslie M Scoutt. 2019. Automated test-item generation system for retrieval practice in radiology education. Academic Radiology, 26(6):851--859

  22. [22]

    Alessa Hering, Lasse Hansen, Tony CW Mok, Albert CS Chung, Hanna Siebert, Stephanie H \"a ger, Annkristin Lange, Sven Kuckertz, Stefan Heldmann, Wei Shao, and 1 others. 2022. Learn2reg: comprehensive multi-task medical image registration challenge, dataset and evaluation in the era of deep learning. IEEE Transactions on Medical Imaging, 42(3):697--712

  23. [23]

    Benjamin Hou, Pritam Mukherjee, Vivek Batheja, Kenneth C Wang, Ronald M Summers, and Zhiyong Lu. 2025. One year on: assessing progress of multimodal large language model performance on rsna 2024 case of the day questions. Radiology, 316(2):e250617

  24. [24]

    Murtadha Hssayeni, M Croock, A Salman, H Al-khafaji, Z Yahya, and B Ghoraani. 2020. Computed tomography images for intracranial hemorrhage detection and segmentation. Intracranial hemorrhage segmentation using a deep convolutional model. Data, 5(1):14

  25. [25]

    Yutao Hu, Tianbin Li, Quanfeng Lu, Wenqi Shao, Junjun He, Yu Qiao, and Ping Luo. 2024. Omnimedvqa: A new large-scale comprehensive evaluation benchmark for medical lvlm. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 22170--22183

  26. [26]

    Stefan Jaeger, Sema Candemir, Sameer Antani, Y \` -Xi \'a ng J W \'a ng, Pu-Xuan Lu, and George Thoma. 2014. Two public chest x-ray datasets for computer-aided screening of pulmonary diseases. Quantitative imaging in medicine and surgery, 4(6):475

  27. [27]

    Stefan Jaeger, Alexandros Karargyris, Sema Candemir, Les Folio, Jenifer Siegelman, Fiona Callaghan, Zhiyun Xue, Kannappan Palaniappan, Rahul K Singh, Sameer Antani, and 1 others. 2013. Automatic tuberculosis screening using chest radiographs. IEEE transactions on medical imaging, 33(2):233--245

  28. [28]

    Di Jin, Eileen Pan, Nassim Oufattole, Wei-Hung Weng, Hanyi Fang, and Peter Szolovits. 2021. What disease does this patient have? a large-scale open domain question answering dataset from medical exams. Applied Sciences, 11(14):6421

  29. [29]

    Qiao Jin, Bhuwan Dhingra, Zhengping Liu, William Cohen, and Xinghua Lu. 2019. Pubmedqa: A dataset for biomedical research question answering. In Proceedings of the 2019 conference on empirical methods in natural language processing and the 9th international joint conference on natural language processing (EMNLP-IJCNLP), pages 2567--2577

  30. [30]

    Daniel Kermany. 2018. Labeled optical coherence tomography (oct) and chest x-ray images for classification. Mendeley data

  31. [31]

    Daniel S Kermany, Michael Goldbaum, Wenjia Cai, Carolina CS Valentim, Huiying Liang, Sally L Baxter, Alex McKeown, Ge Yang, Xiaokang Wu, Fangbing Yan, and 1 others. 2018. Identifying medical diagnoses and treatable diseases by image-based deep learning. cell, 172(5):1122--1131

  32. [32]

    Sebastian K \"o hler, Michael Gargano, Nicolas Matentzoglu, Leigh C Carmody, David Lewis-Smith, Nicole A Vasilevsky, Daniel Danis, Ganna Balagura, Gareth Baynam, Amy M Brower, and 1 others. 2021. The human phenotype ontology in 2021. Nucleic acids research, 49(D1):D1207--D1217

  33. [33]

    Zo \'e Lambert, Caroline Petitjean, Bernard Dubray, and Su Kuan. 2020. Segthor: Segmentation of thoracic organs at risk in ct images. In 2020 Tenth International Conference on Image Processing Theory, Tools and Applications (IPTA), pages 1--6. Ieee

  34. [34]

    Jason J Lau, Soumya Gayen, Asma Ben Abacha, and Dina Demner-Fushman. 2018. A dataset of clinically generated visual questions and answers about radiology images. Scientific data, 5(1):1--10

  35. [35]

    Chunyuan Li, Cliff Wong, Sheng Zhang, Naoto Usuyama, Haotian Liu, Jianwei Yang, Tristan Naumann, Hoifung Poon, and Jianfeng Gao. 2024. Llava-med: Training a large language-and-vision assistant for biomedicine in one day. In Advances in Neural Information Processing Systems (NeurIPS)

  36. [36]

    Tianwei Lin, Wenqiao Zhang, Sijing Li, and 1 others. 2025. https://arxiv.org/abs/2502.09838 Healthgpt: A medical large vision-language model for unifying comprehension and generation via heterogeneous knowledge adaptation . Preprint, arXiv:2502.09838

  37. [37]

    Bo Liu, Li-Ming Zhan, Li Xu, Lin Ma, Yan Fang Yang, and Xiao-Ming Wu. 2021. https://api.semanticscholar.org/CorpusID:231951663 Slake: A semantically-labeled knowledge-enhanced dataset for medical visual question answering . 2021 IEEE 18th International Symposium on Biomedical Imaging (ISBI), pages 1650--1654

  38. [38]

    Haotian Liu, Chunyuan Li, Yuheng Li, and Yong Jae Lee. 2024. Improved baselines with visual instruction tuning. In CVPR. LLaVA-v1.5

  39. [39]

    Jingyu Liu, Jie Lian, and Yizhou Yu. 2020. Chestx-det10: chest x-ray dataset on detection of thoracic abnormalities. arXiv preprint arXiv:2006.10550

  40. [40]

    Kevin Mader. 2017. Finding and measuring lungs in CT data. https://www.kaggle.com/datasets/kmader/finding-lungs-in-ct-data

  41. [41]

    Carlos Mart \' n-Isla, V \' ctor M Campello, Cristian Izquierdo, Kaisar Kushibar, Carla Sendra-Balcells, Polyxeni Gkontra, Alireza Sojoudi, Mitchell J Fulton, Tewodros Weldebirhan Arega, Kumaradevan Punithakumar, and 1 others. 2023. Deep learning segmentation of the right ventricle in cardiac mri: the m&ms challenge. IEEE Journal of Biomedical and Health ...

  42. [42]

    Bjoern H Menze, Andras Jakab, Stefan Bauer, Jayashree Kalpathy-Cramer, Keyvan Farahani, Justin Kirby, Yuliya Burren, Nicole Porz, Johannes Slotboom, Roland Wiest, and 1 others. 2014. The multimodal brain tumor image segmentation benchmark (brats). IEEE transactions on medical imaging, 34(10):1993--2024

  43. [43]

    National Board of Medical Examiners . 2024. https://www.nbme.org/sites/default/files/2021-02/NBME_Item Philadelphia, PA. Accessed: 2025-10-08

  44. [44]

    OpenAI. 2025. https://platform.openai.com/docs/models/gpt-5.1 Gpt-5.1

  45. [45]

    OpenGVLab. 2025. https://arxiv.org/abs/2508.18265 Internvl3.5: Advancing open-source multimodal models in versatility, reasoning, and efficiency . Preprint, arXiv:2508.18265

  46. [46]

    Ankit Pal, Jung-Oh Lee, Xiaoman Zhang, Malaikannan Sankarasubbu, Seunghyeon Roh, Won Jung Kim, Meesun Lee, and Pranav Rajpurkar. 2025. Rexvqa: A large-scale visual question answering benchmark for generalist chest x-ray understanding. arXiv preprint arXiv:2506.04353

  47. [47]

    Ankit Pal, Logesh Kumar Umapathi, and Malaikannan Sankarasubbu. 2022. Medmcqa: A large-scale multi-subject multi-choice dataset for medical domain question answering. In Conference on health, inference, and learning, pages 248--260. PMLR

  48. [48]

    Jiazhen Pan, Che Liu, Junde Wu, Fenglin Liu, and 1 others. 2025. https://arxiv.org/abs/2502.19634 Medvlm-r1: Incentivizing medical reasoning capability of vision-language models (vlms) via reinforcement learning . Preprint, arXiv:2502.19634

  49. [49]

    Hieu H Pham, Ngoc H Nguyen, Thanh T Tran, Tuan NM Nguyen, and Ha Q Nguyen. 2023. Pedicxr: an open, large-scale chest radiograph dataset for interpretation of common thoracic diseases in children. Scientific Data, 10(1):240

  50. [50]

    Project Imaging-X Contributors . 2025. https://github.com/uni-medical/Project-Imaging-X Project imaging-x: A survey of 1000+ open-access medical imaging datasets for foundation model development

  51. [51]

    Radiological Society of North America . 2018. RSNA pneumonia detection challenge. https://www.kaggle.com/c/rsna-pneumonia-detection-challenge

  52. [52]

    Radiological Society of North America . 2024. http://radlex.org/ Radlex radiology lexicon . Accessed: 2025-10-08

  53. [53]

    Tawsifur Rahman, Amith Khandakar, Yazan Qiblawey, Anas Tahir, Serkan Kiranyaz, Saad Bin Abul Kashem, Mohammad Tariqul Islam, Somaya Al Maadeed, Susu M Zughaier, Muhammad Salman Khan, and 1 others. 2021. Exploring the effect of image enhancement techniques on covid-19 detection using chest x-ray images. Computers in biology and medicine, 132:104319

  54. [54]

    Pranav Raikokte. 2020. Covid-19 image dataset, 3 way classification- covid-19, viral pneumonia, normal . https://www.kaggle.com/datasets/pranavraikokte/covid19-image-dataset

  55. [55]

    Alvin Rajkomar, Sneha Lingam, Andrew G Taylor, Michael Blum, and John Mongan. 2017. High-throughput classification of radiographs using deep convolutional neural networks. Journal of digital imaging, 30(1):95--101

  56. [56]

    Holger R Roth, Ziyue Xu, Carlos Tor-D \' ez, Ramon Sanchez Jacob, Jonathan Zember, Jose Molto, Wenqi Li, Sheng Xu, Baris Turkbey, Evrim Turkbey, and 1 others. 2022. Rapid artificial intelligence solutions in a pandemic—the covid-19-20 lung ct lesion segmentation challenge. Medical image analysis, 82:102605

  57. [57]

    u ckert, Louise Bloch, Raphael Br \

    Johannes R \"u ckert, Louise Bloch, Raphael Br \"u ngel, Ahmad Idrissi-Yaghir, Henning Sch \"a fer, Cynthia S Schmidt, Sven Koitka, Obioma Pelka, Asma Ben Abacha, Alba G. Seco de Herrera, and 1 others. 2024. Rocov2: Radiology objects in context version 2, an updated multimodal image dataset. Scientific Data, 11(1):688

  58. [58]

    Rebecca Sawyer-Lee, Francisco Gimenez, Assaf Hoogi, and Daniel Rubin. 2016. Curated breast imaging subset of digital database for screening mammography (cbis-ddsm). (No Title)

  59. [59]

    German Gonzalez Serrano. 2019. https://doi.org/10.21227/9bw7-6823 Cad-pe

  60. [60]

    Junji Shiraishi, Shigehiko Katsuragawa, Junpei Ikezoe, Tsuneo Matsumoto, Takeshi Kobayashi, Ken-ichi Komatsu, Mitate Matsui, Hiroshi Fujita, Yoshie Kodera, and Kunio Doi. 2000. Development of a digital image database for chest radiographs with and without a lung nodule: receiver operating characteristic analysis of radiologists' detection of pulmonary nod...

  61. [61]

    Amber L Simpson, Michela Antonelli, Spyridon Bakas, Michel Bilello, Keyvan Farahani, Bram Van Ginneken, Annette Kopp-Schneider, Bennett A Landman, Geert Litjens, Bjoern Menze, and 1 others. 2019. A large annotated medical image dataset for the development and evaluation of segmentation algorithms. arXiv preprint arXiv:1902.09063

  62. [62]

    John Suckling, J Parker, D Dance, S Astley, I Hutt, C Boggis, I Ricketts, E Stamatakis, N Cerneaz, S Kok, and 1 others. 2015. Mammographic image analysis society (mias) database v1. 21. (No Title)

  63. [63]

    He Sun and 1 others. 2024. Meddr: Diagnosis-guided bootstrapping for large-scale medical vision-language learning. arXiv preprint arXiv:2404.15127

  64. [64]

    Siham Tabik, Anabel G \'o mez-R \' os, Jos \'e Luis Mart \' n-Rodr \' guez, Iv \'a n Sevillano-Garc \' a, Manuel Rey-Area, David Charte, Emilio Guirado, Juan-Luis Su \'a rez, Juli \'a n Luengo, MA Valero-Gonz \'a lez, and 1 others. 2020. Covidgr dataset and covid-sdnet methodology for predicting covid-19 based on chest x-ray images. IEEE journal of biomed...

  65. [65]

    & Dominici, F

    Anas M. Tahir, Muhammad E. H. Chowdhury, Yazan Qiblawey, Amith Khandakar, Tawsifur Rahman, Serkan Kiranyaz, Uzair Khurshid, Nabil Ibtehaz, Sakib Mahmud, and Maymouna Ezeddin. 2021 a . https://doi.org/10.34740/kaggle/dsv/3122958 COVID-QU-Ex dataset . https://www.kaggle.com/datasets/anasmohammedtahir/covidqu

  66. [66]

    Anas M Tahir, Muhammad EH Chowdhury, Amith Khandakar, Tawsifur Rahman, Yazan Qiblawey, Uzair Khurshid, Serkan Kiranyaz, Nabil Ibtehaz, M Sohel Rahman, Somaya Al-Maadeed, and 1 others. 2021 b . Covid-19 infection localization and severity grading from chest x-ray images. Computers in biology and medicine, 139:105002

  67. [67]

    Hulu-Med Team. 2025 a . https://arxiv.org/abs/2510.08668 Hulu-med: A transparent generalist model towards holistic medical vision-language understanding . Preprint, arXiv:2510.08668

  68. [68]

    LASA Team and 1 others. 2025. https://arxiv.org/abs/2506.07044 Lingshu: A generalist foundation model for unified multimodal medical understanding and reasoning . Preprint, arXiv:2506.07044

  69. [70]

    Qwen Team. 2025 b . Qwen3-vl technical report. Technical Report. See also Qwen2.5-VL arXiv:2409.12191 if Qwen3 specific paper is unavailable

  70. [71]

    Catalina Tobon-Gomez, Arjan J Geers, Jochen Peters, J \"u rgen Weese, Karen Pinto, Rashed Karim, Mohammed Ammar, Abdelaziz Daoudi, Jan Margeta, Zulma Sandoval, and 1 others. 2015. Benchmark for algorithms segmenting the left atrium from 3d ct and mri datasets. IEEE transactions on medical imaging, 34(7):1460--1473

  71. [72]

    KAY H VYDARENY, CAROLINE E BLANE, and JUDITH G CALHOUN. 1986. Guidelines for writing multiple-choice questions in radiology courses. Investigative Radiology, 21(11):871--876

  72. [73]

    Akihiko Wada, Yuya Tanaka, Mitsuo Nishizawa, Akira Yamamoto, Toshiaki Akashi, Akifumi Hagiwara, Yayoi Hayakawa, Junko Kikuta, Keigo Shimoji, Katsuhiro Sano, and 1 others. 2025. Retrieval-augmented generation elevates local llm quality in radiology contrast media consultation. npj Digital Medicine, 8(1):395

  73. [74]

    Xiaosong Wang, Yifan Peng, Le Lu, Zhiyong Lu, Mohammadhadi Bagheri, and Ronald M Summers. 2017. Chestx-ray8: Hospital-scale chest x-ray database and benchmarks on weakly-supervised classification and localization of common thorax diseases. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 2097--2106

  74. [75]

    World Health Organization . 2022. https://icd.who.int/en International Statistical Classification of Diseases and Related Health Problems (11th ed.) . World Health Organization. Released in 2018; officially in effect as of January 2022

  75. [76]

    Chengyue Wu and 1 others. 2024. https://arxiv.org/abs/2410.13848 Janus: Decoupling visual encoding for unified multimodal understanding and generation . Preprint, arXiv:2410.13848. Cite DeepSeek-AI for Janus-Pro updates

  76. [77]

    Yifan Wu, Hayden Gunraj, Chi-en Amy Tai, and Alexander Wong. 2023. Covidx cxr-4: An expanded multi-institutional open-source benchmark dataset for chest x-ray image-based computer-aided covid-19 diagnostics. arXiv preprint arXiv:2311.17677

  77. [78]

    Yiming Xiao, Hassan Rivaz, Matthieu Chabanas, Maryse Fortin, Ines Machado, Yangming Ou, Mattias P Heinrich, Julia A Schnabel, Xia Zhong, Andreas Maier, and 1 others. 2019. Evaluation of mri to ultrasound registration methods for brain shift correction: the curious2018 challenge. IEEE transactions on medical imaging, 39(3):777--786

  78. [79]

    Lin Yang, Shawn Xu, Andrew Sellergren, Timo Kohlberger, Yuchen Zhou, Ira Ktena, Atilla Kiraly, Faruk Ahmed, Farhad Hormozdiari, Tiam Jaroensri, and 1 others. 2024. Advancing multimodal medical capabilities of gemini. arXiv preprint arXiv:2405.03162

  79. [80]

    Xingyi Yang, Xuehai He, Jinyu Zhao, Yichen Zhang, Shanghang Zhang, and Pengtao Xie. 2020. Covid-ct-dataset: a ct scan dataset about covid-19. arXiv preprint arXiv:2003.13865

  80. [81]

    Yuan Yao, Tianyu Yu, Ao Zhang, and 1 others. 2024. https://arxiv.org/abs/2408.01800 Minicpm-v: A gpt-4v level mllm on your phone . Preprint, arXiv:2408.01800. Covers MiniCPM-V 2.6 and MiniCPM-o series

Showing first 80 references.