Recognition: 2 theorem links
· Lean TheoremMedGemma 1.5 Technical Report
Pith reviewed 2026-05-10 19:31 UTC · model grok-4.3
The pith
MedGemma 1.5 4B adds 3D CT/MRI volumes and whole-slide pathology to a single model, with 11% accuracy gain on MRI classification and 47% F1 improvement on pathology.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
MedGemma 1.5 4B integrates capabilities for 3D volumetric imaging and whole-slide pathology through new training data, long-context 3D volume slicing, and whole-slide pathology sampling, producing absolute accuracy gains of 11% on 3D MRI condition classification, 3% on 3D CT classification, 47% macro F1 on pathology imaging, 35% IoU on anatomical localization, and 4% macro accuracy on multi-timepoint chest X-ray analysis, while also raising text-based clinical knowledge by 5% on MedQA and 22% on EHRQA.
What carries the argument
Long-context 3D volume slicing and whole-slide pathology sampling strategies that process high-dimensional inputs within a single unified architecture.
Load-bearing premise
The performance gains arise from the new data and specialized slicing methods rather than overfitting to the chosen evaluation sets.
What would settle it
Running MedGemma 1.5 4B on an independent collection of 3D MRI volumes collected after the model's training cutoff and observing no accuracy improvement over MedGemma 1 would falsify the generalization of the reported gains.
read the original abstract
We introduce MedGemma 1.5 4B, the latest model in the MedGemma collection. MedGemma 1.5 expands on MedGemma 1 by integrating additional capabilities: high-dimensional medical imaging (CT/MRI volumes and histopathology whole slide images), anatomical localization via bounding boxes, multi-timepoint chest X-ray analysis, and improved medical document understanding (lab reports, electronic health records). We detail the innovations required to enable these modalities within a single architecture, including new training data, long-context 3D volume slicing, and whole-slide pathology sampling. Compared to MedGemma 1 4B, MedGemma 1.5 4B demonstrates significant gains in these new areas, improving 3D MRI condition classification accuracy by 11% and 3D CT condition classification by 3% (absolute improvements). In whole slide pathology imaging, MedGemma 1.5 4B achieves a 47% macro F1 gain. Additionally, it improves anatomical localization with a 35% increase in Intersection over Union on chest X-rays and achieves a 4% macro accuracy for longitudinal (multi-timepoint) chest x-ray analysis. Beyond its improved multimodal performance over MedGemma 1, MedGemma 1.5 improves on text-based clinical knowledge and reasoning, improving by 5% on MedQA accuracy and 22% on EHRQA accuracy. It also achieves an average of 18% macro F1 on 4 different lab report information extraction datasets (EHR Datasets 2, 3, 4, and Mendeley Clinical Laboratory Test Reports). Taken together, MedGemma 1.5 serves as a robust, open resource for the community, designed as an improved foundation on which developers can create the next generation of medical AI systems. Resources and tutorials for building upon MedGemma 1.5 can be found at https://goo.gle/medgemma.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript introduces MedGemma 1.5 4B as an extension of MedGemma 1, adding support for 3D CT/MRI volumes, histopathology whole-slide images, anatomical localization via bounding boxes, multi-timepoint chest X-ray analysis, and improved medical document understanding. It describes new training data, long-context 3D slicing, and whole-slide sampling strategies, and claims absolute gains over the prior 4B model including +11% 3D MRI accuracy, +3% 3D CT accuracy, +47% macro F1 on pathology, +35% IoU on localization, +4% longitudinal CXR accuracy, +5% MedQA, +22% EHRQA, and 18% average macro F1 on lab-report extraction tasks.
Significance. If the gains prove robust, the work supplies an open multimodal medical foundation model with expanded 3D and pathology capabilities that could serve as a useful base for downstream medical AI systems. The explicit release of resources and tutorials is a positive contribution to reproducibility and community use.
major comments (3)
- [Abstract] Abstract: All headline improvements are reported exclusively as deltas relative to MedGemma 1 (e.g., “improving 3D MRI condition classification accuracy by 11%”) with no absolute accuracy, F1, or IoU numbers supplied for either model on the same test sets. Without these values it is impossible to judge whether the new capabilities represent meaningful progress or merely modest shifts from low baselines.
- [Abstract] Abstract: No information is given on evaluation datasets, train/test splits, statistical significance, or explicit checks that 3D volumes, whole-slide images, and longitudinal reports used for testing are disjoint from the new training data. These omissions directly undermine the claim that the reported gains reflect genuine generalization rather than overfitting or leakage.
- [Abstract] Abstract: The paper states that long-context 3D volume slicing and whole-slide pathology sampling are key innovations, yet provides no concrete description of slice aggregation at inference, patch-level aggregation for whole slides, or how these procedures differ from standard practices, preventing assessment of their contribution to the claimed gains.
minor comments (1)
- [Abstract] Abstract: The phrase “an average of 18% macro F1 on 4 different lab report information extraction datasets” is ambiguous; it is unclear whether this is a macro-average across datasets or an average of per-dataset macro F1 scores, and no per-dataset numbers are supplied.
Simulated Author's Rebuttal
We thank the referee for their constructive and detailed comments, which highlight important areas for improving clarity and transparency in the manuscript. We agree that the abstract requires strengthening to better support the claims of improvement. We will make targeted revisions to the abstract and, where appropriate, the main text to address each point. Our point-by-point responses follow.
read point-by-point responses
-
Referee: [Abstract] Abstract: All headline improvements are reported exclusively as deltas relative to MedGemma 1 (e.g., “improving 3D MRI condition classification accuracy by 11%”) with no absolute accuracy, F1, or IoU numbers supplied for either model on the same test sets. Without these values it is impossible to judge whether the new capabilities represent meaningful progress or merely modest shifts from low baselines.
Authors: We agree that absolute metrics are necessary for readers to assess the practical significance of the reported gains. While the full manuscript contains tables reporting absolute accuracy, F1, and IoU values for both MedGemma 1 and MedGemma 1.5 on the shared test sets, the abstract relies solely on deltas. We will revise the abstract to include key absolute numbers (e.g., the baseline and new 3D MRI accuracy, pathology macro F1, and localization IoU) alongside the deltas, ensuring the improvements can be evaluated in context. revision: yes
-
Referee: [Abstract] Abstract: No information is given on evaluation datasets, train/test splits, statistical significance, or explicit checks that 3D volumes, whole-slide images, and longitudinal reports used for testing are disjoint from the new training data. These omissions directly undermine the claim that the reported gains reflect genuine generalization rather than overfitting or leakage.
Authors: We recognize the critical need for explicit details on evaluation protocols to demonstrate generalization. The full manuscript describes the datasets, sources, and train/test splits for the 3D imaging, pathology, and clinical QA tasks, and states that test data were held out. We will add concise statements to the abstract confirming the use of disjoint test sets, the evaluation datasets, and any statistical significance checks performed. A dedicated paragraph on data partitioning and leakage prevention will also be added or expanded in the methods section. revision: yes
-
Referee: [Abstract] Abstract: The paper states that long-context 3D volume slicing and whole-slide pathology sampling are key innovations, yet provides no concrete description of slice aggregation at inference, patch-level aggregation for whole slides, or how these procedures differ from standard practices, preventing assessment of their contribution to the claimed gains.
Authors: We agree that the abstract should briefly convey the technical distinctions of our slicing and sampling approaches. The manuscript details the long-context 3D volume slicing strategy and whole-slide patch sampling in the methods, including how slices are aggregated at inference and how patch-level predictions are combined for slide-level outputs. We will revise the abstract to include a short description of these procedures and their differences from standard fixed-context or random-patch baselines. Expanded pseudocode and ablation results on aggregation choices will be added to the main text or supplementary material. revision: yes
Circularity Check
No significant circularity in claimed improvements
full rationale
The paper is an empirical technical report describing new training data, 3D slicing strategies, and whole-slide sampling to extend MedGemma 1 into additional medical imaging modalities. Reported gains are presented as measured outcomes on evaluation tasks rather than quantities derived by construction from the model inputs or prior results. No equations, fitted parameters renamed as predictions, self-definitional loops, or load-bearing self-citations that reduce the central claims to tautology appear in the text. External benchmarks and prior model comparisons supply independent content, keeping the derivation chain self-contained.
Axiom & Free-Parameter Ledger
free parameters (2)
- model size (4B parameters)
- training data mixture and sampling rates for 3D and pathology
axioms (1)
- domain assumption Standard supervised fine-tuning and evaluation on held-out medical benchmarks will reflect real-world utility.
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclearWe detail the innovations required to enable these modalities within a single architecture, including new training data, long-context 3D volume slicing, and whole-slide pathology sampling.
-
IndisputableMonolith/Foundation/AlexanderDuality.leanalexander_duality_circle_linking unclear3D CT and MR image volumes were preprocessed to sequences of individual 2D axial images... capped the number of axial slices per query to a maximum of 85
Forward citations
Cited by 3 Pith papers
-
Large Language Models Lack Temporal Awareness of Medical Knowledge
LLMs lack temporal awareness of medical knowledge, showing gradual performance decline on up-to-date facts, much lower accuracy on historical knowledge (25-54% relative), and inconsistent year-to-year predictions.
-
EHR-RAGp: Retrieval-Augmented Prototype-Guided Foundation Model for Electronic Health Records
EHR-RAGp is a retrieval-augmented EHR foundation model that employs prototype-guided retrieval to dynamically integrate relevant historical patient context, outperforming prior models on clinical prediction tasks.
-
CXRMate-2: Structured Multimodal Temporal Embeddings and Tractable Reinforcement Learning for Clinically Acceptable Chest X-ray Radiology Report Generation
CXRMate-2 improves chest X-ray report generation via temporal embeddings and tractable RL, delivering metric gains and 45% acceptability in radiologist review with no significant preference difference on most findings.
Reference graph
Works this paper leans on
-
[1]
URLhttps://doi.org/10.17632/bygfmk4rx9.2. Faruk Ahmed, Lin Yang, Tiam Jaroensri, Andrew Sellergren, Yossi Matias, Avinatan Hassidim, Greg S Corrado, Dale R Webster, Shravya Shetty, Shruthi Prabhakara, et al. Polypath: Adapting a large multimodal model for multi-slide pathology report generation.arXiv preprint arXiv:2502.10536,
-
[2]
Learning to exploit temporal structure for biomedical vision-language processing
Shruthi Bannur, Stephanie Hyland, Qianchu Liu, Fernando Perez-Garcia, Maximilian Ilse, Daniel C Castro, Benedikt Boecking, Harshita Sharma, Kenza Bouzid, Anja Thieme, et al. Learning to exploit temporal structure for biomedical vision-language processing. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 15016–1502...
-
[3]
Noel C. F. Codella, David Gutman, M. Emre Celebi, Brian Helba, Michael A. Marchetti, Stephen W. Dusza, Aadi Kalloo, Konstantinos Liopyris, Nabin Mishra, Harald Kittler, and Allan Halpern. Skin lesion analysis toward melanoma detection: A challenge at the 2017 international symposium on biomedical imaging (isbi), hosted by the international skin imaging co...
2017
-
[4]
Jorge Cuadros and George Bresnick
URL https://arxiv.org/abs/1710.05006. Jorge Cuadros and George Bresnick. Eyepacs: an adaptable telemedicine system for diabetic retinopa- thy screening.Journal of diabetes science and technology, 3(3):509–516,
-
[5]
A. Goldberger, L. Amaral, L. Glass, J. Hausdorff, P. C. Ivanov, R. Mark, and H. E. Stanley. Physiobank, physiotoolkit, and physionet: Components of a new research resource for complex physiologic signals.Circulation, 101(23):e215–e220, 2000a. Online; RRID:SCR_007345. Ary L Goldberger, Luis AN Amaral, Leon Glass, Jeffrey M Hausdorff, Plamen Ch Ivanov, Roge...
-
[6]
Ibrahim Ethem Hamamci, Sezgin Er, Chenyu Wang, Furkan Almas, Ayse Gulnihan Simsek, Sevval Nil Esirgun, Irem Dogan, Omer Faruk Durugol, Benjamin Hou, Suprosanna Shit, et al. Developing generalist foundation models from a multimodal dataset for 3d computed tomography.arXiv preprint arXiv:2403.17834,
-
[7]
Measuring Massive Multitask Language Understanding
15 MedGemma 1.5 Technical Report Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and Jacob Steinhardt. Measuring massive multitask language understanding.arXiv preprint arXiv:2009.03300,
work page internal anchor Pith review Pith/arXiv arXiv 2009
-
[8]
StephanieLHyland,ShruthiBannur,KenzaBouzid,DanielCCastro,MercyRanjit,AntonSchwaighofer, Fernando Pérez-García, Valentina Salvatelli, Shaury Srivastav, Anja Thieme, et al. Maira-1: A specialisedlargemultimodalmodelforradiologyreportgeneration.arXivpreprintarXiv:2311.13668,
-
[9]
arXiv preprint arXiv:2106.14463 (2021)
Saahil Jain, Ashwin Agrawal, Adriel Saporta, Steven QH Truong, Du Nguyen Duong, Tan Bui, Pierre Chambon, Yuhao Zhang, Matthew P Lungren, Andrew Y Ng, et al. Radgraph: Extracting clinical entities and relations from radiology reports.arXiv preprint arXiv:2106.14463,
-
[10]
Qiao Jin, Bhuwan Dhingra, Zhengping Liu, William W Cohen, and Xinghua Lu. Pubmedqa: A dataset for biomedical research question answering.arXiv preprint arXiv:1909.06146,
-
[11]
MIMIC-CXR database (version 2.0
A Johnson, T Pollard, R Mark, S Berkowitz, and S Horng. MIMIC-CXR database (version 2.0. 0). PhysioNet, 2019a. Alistair Johnson, Matthew Lungren, Yifan Peng, Zhiyong Lu, Roger Mark, Seth Berkowitz, and Steven Horng. Mimic-cxr-jpg - chest radiographs with structured labels, November 2019b. URL https://doi.org/10.13026/8360-t248. Alistair EW Johnson, Tom J ...
-
[12]
Jason J Lau, Soumya Gayen, Asma Ben Abacha, and Dina Demner-Fushman
URLhttps: //doi.org/10.13026/acga-ht95. Jason J Lau, Soumya Gayen, Asma Ben Abacha, and Dina Demner-Fushman. A dataset of clinically generated visual questions and answers about radiology images.Scientific data, 5(1):1–10,
-
[13]
TobiOlatunji, CharlesNimo, AbrahamOwodunni, TassallahAbdullahi, EmmanuelAyodele, Mardhiyah Sanni, Chinemelu Aka, Folafunmi Omofoye, Foutse Yuehgoh, Timothy Faniran, et al. Afrimed-qa: A pan-african, multi-specialty, medical question-answering benchmark dataset.arXiv preprint arXiv:2411.15640,
-
[14]
Andrew Sellergren, Sahar Kazemzadeh, Tiam Jaroensri, Atilla Kiraly, Madeleine Traverse, Timo Kohlberger, Shawn Xu, Fayaz Jamil, Cían Hughes, Charles Lau, et al. Medgemma technical report. arXiv preprint arXiv:2507.05201,
work page internal anchor Pith review Pith/arXiv arXiv
-
[15]
Ryutaro Tanno, David Barrett, Andrew Sellergren, Sumedh Ghaisas, Sumanth Dathathri, Abigail See, Johannes Welbl, Karan Singhal, Shekoofeh Azizi, Tao Tu, et al. Consensus, dissensus and synergy between clinicians and specialist foundation models in radiology report generation.arXiv preprint arXiv:2311.18260,
-
[16]
URLhttps://arxiv.org/abs/2503.19786. Philipp Tschandl, Cliff Rosendahl, and Harald Kittler. The ham10000 dataset, a large collection of multi-source dermatoscopic images of common pigmented skin lesions.Scientific Data, 5(1), August
work page internal anchor Pith review Pith/arXiv arXiv
-
[17]
ISSN 2052-4463. doi: 10.1038/sdata.2018.161. URLhttp://dx.doi.org/10.1038/ sdata.2018.161. Jason Walonoski, Mark Kramer, Joseph Nichols, Andre Quina, Chris Moesel, Dylan Hall, Carlton Duffett, Kudakwashe Dube, Thomas Gallagher, and Scott McLachlan. Synthea: An approach, method, and software mechanism for generating synthetic patients and the synthetic ele...
-
[18]
Xiaosong Wang, Yifan Peng, Le Lu, Zhiyong Lu, Mohammadhadi Bagheri, and Ronald M Summers
doi: 10.1093/jamia/ocx079. Xiaosong Wang, Yifan Peng, Le Lu, Zhiyong Lu, Mohammadhadi Bagheri, and Ronald M Summers. Chestx-ray8: Hospital-scalechestx-raydatabaseandbenchmarksonweakly-supervisedclassification and localization of common thorax diseases. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 2097–2106,
-
[19]
Advancing multimodal medical capabilities of gemini.arXiv preprint arXiv:2405.03162, 2024
Lin Yang, Shawn Xu, Andrew Sellergren, Timo Kohlberger, Yuchen Zhou, Ira Ktena, Atilla Kiraly, Faruk Ahmed, Farhad Hormozdiari, Tiam Jaroensri, et al. Advancing multimodal medical capabilities of gemini.arXiv preprint arXiv:2405.03162,
-
[20]
Yuxin Zuo, Shang Qu, Yifei Li, Zhangren Chen, Xuekai Zhu, Ermo Hua, Kaiyan Zhang, Ning Ding, and Bowen Zhou. Medxpertqa: Benchmarking expert-level medical reasoning and understanding. arXiv preprint arXiv:2501.18362,
-
[21]
18 MedGemma 1.5 Technical Report A. CT-RATE Evaluation We additionally evaluated a portion of our models on the CT-RATE dataset (Hamamci et al., 2024), where we process accordingly (without resampling) per Section 2.3.1 with results summarized in Table
2024
-
[22]
medical prior,
Unlike specialized, custom-built CT architectures that are optimized to yield multi- label predictions in a single forward pass, applying generalist vision-language models to this high- dimensional task required a more granular inference strategy. Specifically, our framework necessitated querying the model 18 times per condition to accurately parse the di...
2025
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.