MedGemma Technical Report

Aishwarya Kamath; Alek Andreev; Alexandre Ram\'e; Anand Rao; Andrew Sellergren; Armand Joulin; Atilla Kiraly; Avinatan Hassidim; Bram Sterling; Can Kirmizibayrak

arxiv: 2507.05201 · v4 · submitted 2025-07-07 · 💻 cs.AI · cs.CL· cs.CV

MedGemma Technical Report

Andrew Sellergren , Sahar Kazemzadeh , Tiam Jaroensri , Atilla Kiraly , Madeleine Traverse , Timo Kohlberger , Shawn Xu , Fayaz Jamil

show 73 more authors

C\'ian Hughes Charles Lau Justin Chen Fereshteh Mahvar Liron Yatziv Tiffany Chen Bram Sterling Stefanie Anna Baby Susanna Maria Baby Jeremy Lai Samuel Schmidgall Lu Yang Kejia Chen Per Bjornsson Shashir Reddy Ryan Brush Kenneth Philbrick Mercy Asiedu Ines Mezerreg Howard Hu Howard Yang Richa Tiwari Sunny Jansen Preeti Singh Yun Liu Shekoofeh Azizi Aishwarya Kamath Johan Ferret Shreya Pathak Nino Vieillard Ramona Merhej Sarah Perrin Tatiana Matejovicova Alexandre Ram\'e Morgane Riviere Louis Rouillard Thomas Mesnard Geoffrey Cideron Jean-bastien Grill Sabela Ramos Edouard Yvinec Michelle Casbon Elena Buchatskaya Jean-Baptiste Alayrac Dmitry Lepikhin Vlad Feinberg Sebastian Borgeaud Alek Andreev Cassidy Hardin Robert Dadashi L\'eonard Hussenot Armand Joulin Olivier Bachem Yossi Matias Katherine Chou Avinatan Hassidim Kavi Goel Clement Farabet Joelle Barral Tris Warkentin Jonathon Shlens David Fleet Victor Cotruta Omar Sanseviero Gus Martins Phoebe Kirk Anand Rao Shravya Shetty David F. Steiner Can Kirmizibayrak Rory Pilgrim Daniel Golden Lin Yang

This is my paper

Pith reviewed 2026-05-19 05:53 UTC · model grok-4.3

classification 💻 cs.AI cs.CLcs.CV

keywords MedGemmamedical vision-language modelsfoundation modelsGemma 3medical AIimage classificationquestion answeringvision encoder

0 comments

The pith

MedGemma models built on Gemma 3 achieve strong medical image and text performance while retaining broad capabilities.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents MedGemma as a collection of vision-language models derived from the Gemma 3 family to tackle the challenges of diverse medical data and privacy needs in healthcare AI. It establishes that these models deliver advanced understanding and reasoning across images and text, outperforming other generative models of similar size and nearing the results of models built for single tasks. This matters because foundation models that require less custom tuning data could speed up the creation of new medical applications without starting from scratch each time. The work also includes MedSigLIP, a vision encoder tuned on medical data that supports the models' visual performance and stands alone as competitive with specialized encoders.

Core claim

MedGemma demonstrates advanced medical understanding and reasoning on images and text, significantly exceeding the performance of similar-sized generative models and approaching the performance of task-specific models, while maintaining the general capabilities of the Gemma 3 base models.

What carries the argument

MedGemma, the collection of medical vision-language foundation models based on Gemma 3 4B and 27B and powered by the MedSigLIP medically-tuned vision encoder.

If this is right

MedGemma achieves 2.6-10% improvement over base models on out-of-distribution medical multimodal question answering tasks.
It shows 15.5-18.1% improvement on chest X-ray finding classification.
Agentic evaluations improve by 10.8% compared to the base models.
Further fine-tuning cuts errors in electronic health record information retrieval by 50% and matches existing specialized methods on pneumothorax classification and histopathology patch classification.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same base-plus-tuning pattern could shorten the path from general models to usable medical tools across additional domains like pathology or radiology reporting.
Retaining general capabilities alongside medical gains may support hybrid assistants that handle both clinical questions and everyday language tasks.
MedSigLIP could function as a reusable component for other medical imaging pipelines that need strong visual features without full model retraining.

Load-bearing premise

Gains measured on the reported medical benchmarks will hold up in real clinical workflows and on new data distributions without further domain-specific safeguards or validation.

What would settle it

A controlled test of MedGemma on a fresh set of real hospital cases never seen in training or benchmarks, with direct accuracy and error rates compared against both base models and human specialists.

read the original abstract

Artificial intelligence (AI) has significant potential in healthcare applications, but its training and deployment faces challenges due to healthcare's diverse data, complex tasks, and the need to preserve privacy. Foundation models that perform well on medical tasks and require less task-specific tuning data are critical to accelerate the development of healthcare AI applications. We introduce MedGemma, a collection of medical vision-language foundation models based on Gemma 3 4B and 27B. MedGemma demonstrates advanced medical understanding and reasoning on images and text, significantly exceeding the performance of similar-sized generative models and approaching the performance of task-specific models, while maintaining the general capabilities of the Gemma 3 base models. For out-of-distribution tasks, MedGemma achieves 2.6-10% improvement on medical multimodal question answering, 15.5-18.1% improvement on chest X-ray finding classification, and 10.8% improvement on agentic evaluations compared to the base models. Fine-tuning MedGemma further improves performance in subdomains, reducing errors in electronic health record information retrieval by 50% and reaching comparable performance to existing specialized state-of-the-art methods for pneumothorax classification and histopathology patch classification. We additionally introduce MedSigLIP, a medically-tuned vision encoder derived from SigLIP. MedSigLIP powers the visual understanding capabilities of MedGemma and as an encoder achieves comparable or better performance than specialized medical image encoders. Taken together, the MedGemma collection provides a strong foundation of medical image and text capabilities, with potential to significantly accelerate medical research and development of downstream applications. The MedGemma collection, including tutorials and model weights, can be found at https://goo.gle/medgemma.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper introduces MedGemma, a collection of medical vision-language foundation models based on Gemma 3 4B and 27B, along with MedSigLIP, a medically-tuned vision encoder derived from SigLIP. It claims that MedGemma achieves advanced medical understanding and reasoning on images and text, significantly exceeding similar-sized generative models and approaching task-specific models while preserving general Gemma 3 capabilities. Reported gains on out-of-distribution tasks include 2.6-10% on medical multimodal QA, 15.5-18.1% on chest X-ray finding classification, and 10.8% on agentic evaluations; further fine-tuning yields 50% error reduction in EHR retrieval and comparable performance to specialized SOTA methods on pneumothorax and histopathology classification. Model weights and tutorials are released.

Significance. If the gains reflect genuine generalization rather than contamination, MedGemma would provide a useful general-purpose foundation for medical AI, lowering barriers to task-specific adaptation and supporting downstream applications in healthcare. The public release of weights strengthens reproducibility and enables community validation.

major comments (2)

[Abstract] Abstract: the claim of out-of-distribution improvements (2.6-10% on medical multimodal QA, 15.5-18.1% on chest X-ray classification, 10.8% on agentic evaluations) is load-bearing for the central assertion of learned medical reasoning, yet the abstract provides no information on training-data composition, decontamination protocols, overlap statistics with benchmark test sets, or statistical significance testing. Without these, the magnitude of gains cannot be distinguished from possible leakage of public medical datasets into pretraining or fine-tuning.
[Methods / Experiments] The manuscript does not describe the medical fine-tuning corpus or evaluation splits in sufficient detail to allow assessment of whether the reported benchmarks are truly held-out; this directly affects the validity of the 'out-of-distribution' framing and the claim that general capabilities are preserved.

minor comments (2)

[Abstract] Abstract: the phrase 'approaching the performance of task-specific models' would benefit from explicit comparison tables or cited baselines for each task.
The link https://goo.gle/medgemma should be accompanied by a permanent DOI or Hugging Face repository reference for long-term accessibility.

Simulated Author's Rebuttal

2 responses · 1 unresolved

We thank the referee for their insightful comments, which have helped us improve the clarity and transparency of our manuscript. We address each of the major comments below and have prepared revisions accordingly.

read point-by-point responses

Referee: [Abstract] Abstract: the claim of out-of-distribution improvements (2.6-10% on medical multimodal QA, 15.5-18.1% on chest X-ray classification, 10.8% on agentic evaluations) is load-bearing for the central assertion of learned medical reasoning, yet the abstract provides no information on training-data composition, decontamination protocols, overlap statistics with benchmark test sets, or statistical significance testing. Without these, the magnitude of gains cannot be distinguished from possible leakage of public medical datasets into pretraining or fine-tuning.

Authors: We appreciate the referee's concern regarding the substantiation of our out-of-distribution claims. To address this, we will revise the abstract to briefly note the use of curated medical datasets with decontamination steps applied to minimize overlap with evaluation benchmarks. We will also add statistical significance testing for the reported gains. A new section in the revised manuscript will provide detailed information on the training data composition and overlap statistics for the fine-tuning phase. We acknowledge that complete details on the base model's pretraining data are subject to the original development constraints and cannot be fully disclosed, but we focus on the additional medical adaptation steps. revision: yes
Referee: [Methods / Experiments] The manuscript does not describe the medical fine-tuning corpus or evaluation splits in sufficient detail to allow assessment of whether the reported benchmarks are truly held-out; this directly affects the validity of the 'out-of-distribution' framing and the claim that general capabilities are preserved.

Authors: We agree that greater detail is required to validate the held-out nature of the benchmarks. In the revised manuscript, we have substantially expanded the Methods section to describe the medical fine-tuning corpus, including data sources, volumes, curation processes, and specific decontamination protocols used to ensure no overlap with test sets. We also detail the evaluation splits and provide evidence that the reported benchmarks were held out. For the preservation of general capabilities, we include additional results on non-medical tasks to support this claim. revision: yes

standing simulated objections not resolved

Full disclosure of the pretraining data composition and decontamination protocols for the base Gemma 3 models, as these details are proprietary and were established prior to this work.

Circularity Check

0 steps flagged

No circularity: purely empirical model report

full rationale

The paper is a technical report on training and benchmarking MedGemma (fine-tuned from Gemma 3) and MedSigLIP. All claims consist of reported accuracy deltas on external medical QA, classification, and agentic benchmarks versus base models and task-specific systems. No equations, derivations, fitted parameters presented as predictions, or self-citation chains appear in the provided text. Performance numbers are direct empirical measurements, not reductions to inputs by construction. The derivation chain is therefore self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The central claims rest on standard supervised fine-tuning of large vision-language models using medical image-text pairs; no new axioms or invented entities are introduced beyond the usual assumptions of transfer learning.

free parameters (1)

fine-tuning hyperparameters and data mixture
Learning rate, epochs, and medical data selection are chosen to adapt the base model; these are not enumerated in the abstract.

axioms (1)

domain assumption Medical image-text data can be used to improve performance on medical tasks without destroying general capabilities
Invoked by the decision to fine-tune Gemma 3 on medical data while claiming retention of base-model abilities.

pith-pipeline@v0.9.0 · 6198 in / 1251 out tokens · 35715 ms · 2026-05-19T05:53:47.941877+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

MedGemma 4B multimodal model utilized ... vision encoder enhancement ... multimodal decoder pretraining ... post-training with distillation and reinforcement learning
IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

MedSigLIP ... fine-tuned ... using over 33M medical image-text pairs ... mixed with 2% weight

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 60 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

NeuroQA: A Large-Scale Image-Grounded Benchmark for 3D Brain MRI Understanding
cs.CV 2026-05 accept novelty 8.0

NeuroQA is a large-scale 3D brain MRI visual question answering benchmark with verified image-grounded QA pairs, multi-domain coverage, and baseline evaluations showing current models lag behind text-only performance.
Large Language Models Lack Temporal Awareness of Medical Knowledge
cs.LG 2026-05 unverdicted novelty 8.0

LLMs lack temporal awareness of medical knowledge, showing gradual performance decline on up-to-date facts, much lower accuracy on historical knowledge (25-54% relative), and inconsistent year-to-year predictions.
CheXTemporal: A Dataset for Temporally-Grounded Reasoning in Chest Radiography
cs.CV 2026-05 accept novelty 8.0

CheXTemporal supplies paired chest X-rays with explicit temporal progression taxonomy and spatial grounding to benchmark and improve models on longitudinal reasoning tasks.
MedHorizon: Towards Long-context Medical Video Understanding in the Wild
cs.CV 2026-05 unverdicted novelty 8.0

MedHorizon benchmark reveals current multimodal LLMs achieve only 41.1% accuracy on long medical videos due to failures in sparse evidence retrieval and procedural reasoning.
MMRareBench: A Rare-Disease Multimodal and Multi-Image Medical Benchmark
cs.CV 2026-04 unverdicted novelty 8.0

MMRareBench provides 1,756 QA pairs and 7,958 images from PMC rare-disease cases to evaluate 23 MLLMs, revealing low treatment-planning scores and medical models underperforming general models on multi-image tasks due...
MMRareBench: A Rare-Disease Multimodal and Multi-Image Medical Benchmark
cs.CV 2026-04 unverdicted novelty 8.0

MMRareBench is the first rare-disease benchmark for multimodal and multi-image clinical evaluation of MLLMs, revealing fragmented capabilities, low treatment-planning scores, and medical models underperforming general...
Dental-TriageBench: Benchmarking Multimodal Reasoning for Hierarchical Dental Triage
cs.CL 2026-03 unverdicted novelty 8.0

Dental-TriageBench is the first expert-annotated multimodal benchmark for hierarchical dental triage and shows a substantial performance gap between 19 MLLMs and junior dentists, especially on multi-domain referral cases.
Neural Signals Generate Clinical Notes in the Wild
cs.LG 2026-01 unverdicted novelty 8.0

CELM is the first EEG-to-language foundation model that generates clinical reports from variable-length EEG recordings using a new dataset of 9,922 reports paired with 11,000 hours of data from 9,048 patients.
FETAL-GAUGE: A Benchmark for Assessing Vision-Language Models in Fetal Ultrasound
cs.CV 2025-12 unverdicted novelty 8.0

Fetal-Gauge benchmark shows state-of-the-art vision-language models reach only 55% accuracy on fetal ultrasound tasks, well below clinical needs and highlighting the requirement for domain-adapted models.
DDX-TRACE: A Benchmark for Medical Diagnostic Trajectories in VLMs
cs.CV 2026-05 unverdicted novelty 7.0

DDX-TRACE is a physician-adjudicated benchmark for evaluating VLMs on evidence-supported diagnostic trajectories rather than final answers alone in multimodal neuroradiology.
Towards Clinically Interpretable Ophthalmic VQA via Spatially-Grounded Lesion Evidence
cs.CV 2026-05 unverdicted novelty 7.0

FundusGround is a new benchmark with 10,719 fundus images, 15,595 ETDRS-grid localized lesions, and 72,706 VQA questions to support clinically interpretable ophthalmic visual question answering.
JMed48k: A Multi-Profession Japanese Medical Licensing Benchmark for Vision-Language Model Evaluation
cs.CV 2026-05 conditional novelty 7.0

JMed48k is a new large-scale benchmark of Japanese medical licensing exams with images that reveals proprietary VLMs benefit more from visuals than medical-specific models, with large variation across professions.
What Do Biomedical NER and Entity Linking Benchmarks Measure? A Corpus-Centric Diagnostic Framework
cs.CL 2026-05 accept novelty 7.0

A corpus-centric framework diagnoses scale, structure, overlap, metadata, and terminology properties across nine biomedical NER/EL corpora, showing substantial differences that common statistics fail to capture.
Fully Open Meditron: An Auditable Pipeline for Clinical LLMs
cs.AI 2026-05 conditional novelty 7.0

Presents the first fully open pipeline for clinical LLMs that unifies eight public QA datasets with clinician-vetted synthetic data from guidelines and vignettes, achieving improved performance on medical benchmarks w...
RxEval: A Prescription-Level Benchmark for Evaluating LLM Medication Recommendation
cs.LG 2026-05 unverdicted novelty 7.0

RxEval benchmark shows frontier LLMs reach at most 46.10% exact match on prescription-level medication, dose, and route selection from real patient trajectories.
Overview of the MedHopQA track at BioCreative IX: track description, participation and evaluation of systems for multi-hop medical question answering
cs.CL 2026-05 unverdicted novelty 7.0

MedHopQA introduces a 1,000-question two-hop biomedical QA benchmark where retrieval-augmented systems reach 89% conceptual accuracy, outperforming zero-shot baselines by over 20 points.
Checkup2Action: A Multimodal Clinical Check-up Report Dataset for Patient-Oriented Action Card Generation
cs.CL 2026-05 unverdicted novelty 7.0

Checkup2Action is a new multimodal dataset and benchmark for generating patient-oriented action cards from real-world clinical check-up reports.
Checkup2Action: A Multimodal Clinical Check-up Report Dataset for Patient-Oriented Action Card Generation
cs.CL 2026-05 conditional novelty 7.0

Checkup2Action is a new multimodal dataset and benchmark for generating safe, prioritized action cards from real-world clinical check-up reports using large language models.
EpiGraph: Building Generalists for Evidence-Intensive Epilepsy Reasoning in the Wild
cs.AI 2026-05 conditional novelty 7.0

EpiGraph creates a heterogeneous epilepsy knowledge graph that boosts LLM performance on clinical reasoning tasks by 30-41% in pharmacogenomics when used with Graph-RAG.
EpiGraph: Building Generalists for Evidence-Intensive Epilepsy Reasoning in the Wild
cs.AI 2026-05 unverdicted novelty 7.0

EpiGraph is a new epilepsy knowledge graph with 24,324 entities and 32,009 triplets that improves LLM performance on clinical tasks by up to 41% when used in Graph-RAG.
CXR-ContraBench: Benchmarking Negated-Option Attraction in Medical VLMs
cs.CV 2026-05 conditional novelty 7.0

Medical VLMs frequently select negated options that contradict visible chest X-ray findings, achieving only ~30% accuracy on direct presence probes, but a post-hoc consistency verifier raises accuracy above 95%.
iTRIALSPACE: Programmable Virtual Lesion Trials for Controlled Evaluation of Lung CT Models
cs.CV 2026-05 unverdicted novelty 7.0

iTRIALSPACE generates realistic virtual lesion trials on lung CTs that isolate performance drivers and show strong transfer of model rankings to real clinical data (ρ=0.93).
The Cost of Context: Mitigating Textual Bias in Multimodal Retrieval-Augmented Generation
cs.CL 2026-05 unverdicted novelty 7.0

Recorruption arises from visual attention suppression and positional bias in multimodal RAG; BAIR mitigates it via bottleneck attention intervention at inference time.
ReLay: Personalized LLM-Generated Plain-Language Summaries for Better Understanding, but at What Cost?
cs.CL 2026-05 unverdicted novelty 7.0

Personalized LLM-generated plain language summaries improve lay readers' comprehension and quality ratings but increase risks of reinforcing biases and introducing hallucinations compared to static expert summaries.
Learning from Compressed CT: Feature Attention Style Transfer and Structured Factorized Projections for Resource-Efficient Medical Image Analysis
cs.CV 2026-05 unverdicted novelty 7.0

CT-Lite combines Feature Attention Style Transfer (FAST) and Structured Factorized Projections (SFP) with contrastive learning to reach AUROC within 5-7% of uncompressed baselines on compressed CT volumes across three...
X-PCR: A Benchmark for Cross-modality Progressive Clinical Reasoning in Ophthalmic Diagnosis
cs.CV 2026-04 unverdicted novelty 7.0

X-PCR is a new benchmark of 26,415 images and 177,868 expert VQA pairs that evaluates MLLMs on six-stage progressive reasoning and cross-modality integration in ophthalmology.
SurgCoT: Advancing Spatiotemporal Reasoning in Surgical Videos through a Chain-of-Thought Benchmark
cs.CV 2026-04 unverdicted novelty 7.0

SurgCoT is a new benchmark that evaluates chain-of-thought spatiotemporal reasoning in multimodal large language models on surgical videos using five defined dimensions and an annotation protocol of Question-Option-Kn...
Beyond a Single Frame: Multi-Frame Spatially Grounded Reasoning Across Volumetric MRI
cs.CV 2026-04 unverdicted novelty 7.0

A new multi-frame VQA benchmark on volumetric MRI demonstrates that bounding-box supervised fine-tuning improves spatial grounding in VLMs over zero-shot baselines.
ECHO: Efficient Chest X-ray Report Generation with One-step Block Diffusion
cs.LG 2026-04 unverdicted novelty 7.0

ECHO is a one-step block diffusion VLM for chest X-ray reports that improves RaTE and SemScore by over 60% while delivering 8x faster inference than autoregressive baselines.
SiMing-Bench: Evaluating Procedural Correctness from Continuous Interactions in Clinical Skill Videos
cs.CV 2026-04 unverdicted novelty 7.0

SiMing-Bench shows current MLLMs have weak agreement with physicians on procedural correctness in clinical videos, with intermediate step judgments remaining poor even when overall scores look acceptable.
Lost in the Hype: Revealing and Dissecting the Performance Degradation of Medical Multimodal Large Language Models in Image Classification
cs.CV 2026-04 unverdicted novelty 7.0

Medical MLLMs degrade on image classification due to four failure modes in visual representation quality, connector projection fidelity, LLM comprehension, and semantic mapping alignment, quantified by feature probing...
BAS: A Decision-Theoretic Approach to Evaluating Large Language Model Confidence
cs.CL 2026-04 unverdicted novelty 7.0

BAS aggregates utility from an answer-or-abstain model across risk thresholds and is uniquely maximized by truthful confidence estimates.
SetFlow: Generating Structured Sets of Representations for Multiple Instance Learning
cs.LG 2026-03 unverdicted novelty 7.0

SetFlow is a flow-matching generative model for permutation-invariant MIL bags in representation space that produces synthetic data improving classification performance and enabling training on synthetic data alone.
CoDA: Exploring Chain-of-Distribution Attacks and Post-Hoc Token-Space Repair for Medical Vision-Language Models
cs.CV 2026-03 unverdicted novelty 7.0

CoDA chains clinically plausible acquisition, reconstruction, display, and delivery shifts to substantially degrade zero-shot performance of medical vision-language models, with a post-hoc token-space repair partially...
IBISAgent: Reinforcing Pixel-Level Visual Reasoning in MLLMs for Universal Biomedical Object Referring and Segmentation
cs.CV 2026-01 conditional novelty 7.0

IBISAgent enables MLLMs to perform iterative pixel-level visual reasoning for biomedical object referring and segmentation via text-based clicks and agentic RL, outperforming prior SOTA methods without model modifications.
Democratising Pathology Co-Pilots: An Open Pipeline and Dataset for Whole-Slide Vision-Language Modelling
cs.CV 2025-12 conditional novelty 7.0

A new open pipeline and dataset enable training of a vision-language model for whole-slide pathology VQA that outperforms MedGemma on tissue identification, neoplasm detection, and differential diagnosis.
When to Trust the Answer: Question-Aligned Semantic Nearest Neighbor Entropy for Safer Surgical VQA
cs.CV 2025-11 conditional novelty 7.0

QA-SNNE adds question-answer alignment via bilateral gating to semantic nearest neighbor entropy, yielding higher AUROC for uncertainty detection in surgical VQA models under both standard and rephrased questions.
Beyond Classification Accuracy: Neural-MedBench and the Need for Deeper Reasoning Benchmarks
cs.CV 2025-09 unverdicted novelty 7.0

Neural-MedBench reveals sharp performance drops in state-of-the-art VLMs on reasoning-intensive neurology tasks compared to conventional classification benchmarks, with reasoning failures dominating errors.
PathNavigate: A Training-Free Pathology Agent with Surprise-Guided Scan and Shared Slide Memory for Whole-Slide Image VQA
cs.CV 2026-05 unverdicted novelty 6.0

PathNavigate introduces a scan-search-readout routine with surprise-guided low-mag scanning and shared slide memory to improve training-free WSI-VQA accuracy and efficiency.
VIHD: Visual Intervention-based Hallucination Detection for Medical Visual Question Answering
cs.CV 2026-05 unverdicted novelty 6.0

VIHD detects hallucinations in medical MLLMs by identifying visually dominant decoder layers via probing and applying visual token masking to calibrate semantic entropy as a detection signal.
ClinSeekAgent: Automating Multimodal Evidence Seeking for Agentic Clinical Reasoning
cs.CL 2026-05 unverdicted novelty 6.0

ClinSeekAgent automates active multimodal evidence seeking for clinical reasoning, improving LLM performance on raw EHR and CXR tasks while enabling distillation into smaller models.
Rethinking Visual Attribution for Chest X-ray Reasoning in Large Vision Language Models
cs.CV 2026-05 unverdicted novelty 6.0

Existing visual attribution methods often fail to identify the visual evidence used by LVLMs in chest X-ray reasoning, while MedFocus using unbalanced optimal transport and targeted interventions substantially outperf...
How Good LLMs Are at Answering Bangla Medical Visual Questions? Dataset and Benchmarking
cs.CL 2026-05 unverdicted novelty 6.0

Introduces BanglaMedVQA dataset of clinically validated image-question-answer pairs and benchmarks foundation models, finding substantially lower performance than on English MedVQA especially on diagnostic questions.
SurgLQA: Scalable Long-Horizon Surgical Video Question Answering
cs.CV 2026-05 unverdicted novelty 6.0

SurgLQA introduces FTC for compact long-range video representations and TMS for adaptive test-time scaling, reporting gains on restructured Colon-LQA and REAL-Colon-VQA benchmarks.
JANUS: Anatomy-Conditioned Gating for Robust CT Triage Under Distribution Shift
cs.CV 2026-05 unverdicted novelty 6.0

JANUS conditions Vision Transformer embeddings on macro-radiomic priors via anatomically guided gating, reaching macro-AUROC 0.88 on an internal test set of 5082 cases and 0.87 on an external set of 2000 cases while i...
MILM: Large Language Models for Multimodal Irregular Time Series with Informative Sampling
cs.LG 2026-05 unverdicted novelty 6.0

MILM fine-tunes LLMs on XML-encoded multimodal irregular time series via a two-stage process that exploits informative sampling patterns to achieve top performance on EHR classification datasets.
CRAFT: Clinical Reward-Aligned Finetuning for Medical Image Synthesis
cs.CV 2026-05 unverdicted novelty 6.0

CRAFT adapts diffusion models to medical images via clinical reward alignment from LLMs and VLMs, improving alignment scores and cutting low-quality generations by 20.4% on average across modalities.
Verification Mirage: Mapping the Reliability Boundary of Self-Verification in Medical VQA
cs.CV 2026-05 unverdicted novelty 6.0

Self-verification in medical VQA creates a verification mirage where verifiers exhibit high error and agreement bias on wrong answers, with reliability strongly conditioned on task type.
CLR-voyance: Reinforcing Open-Ended Reasoning for Inpatient Clinical Decision Support with Outcome-Aware Rubrics
cs.CL 2026-05 unverdicted novelty 6.0

CLR-voyance reformulates inpatient reasoning as POMDP with clinician-validated outcome rubrics, yielding an 8B model that outperforms larger frontier models on the authors' new benchmark.
MedExAgent: Training LLM Agents to Ask, Examine, and Diagnose in Noisy Clinical Environments
cs.CL 2026-05 unverdicted novelty 6.0

MedExAgent models clinical diagnosis as a POMDP with patient and exam noise, then uses supervised fine-tuning followed by DAPO optimization to train an agent that matches larger models on diagnostic accuracy while con...
MedSynapse-V: Bridging Visual Perception and Clinical Intuition via Latent Memory Evolution
cs.CV 2026-04 unverdicted novelty 6.0

MedSynapse-V evolves latent diagnostic memories via meta queries, causal counterfactual refinement with RL, and dual-branch memory transition to outperform prior medical VLM methods in diagnostic accuracy.
From Natural Language to Verified Code: Toward AI Assisted Problem-to-Code Generation with Dafny-Based Formal Verification
cs.SE 2026-04 unverdicted novelty 6.0

Open-weight LLMs reach 81-91% success generating formally verified Dafny code for complex algorithmic problems when given structural signatures and self-healing verifier feedback.
Differentially Private De-identification of Dutch Clinical Notes: A Comparative Evaluation
cs.CR 2026-04 unverdicted novelty 6.0

Hybrid DP with LLM or NER preprocessing significantly improves the privacy-utility trade-off for Dutch clinical note de-identification compared to standalone DP.
Autonomous Skeletal Landmark Localization towards Agentic C-Arm Control
cs.CV 2026-04 unverdicted novelty 6.0

Fine-tuned MLLMs achieve competitive skeletal landmark localization on synthetic and real X-ray datasets compared to deep learning baselines and demonstrate reasoning for sequential C-arm navigation.
Hybrid Decision Making via Conformal VLM-generated Guidance
cs.AI 2026-04 unverdicted novelty 6.0

ConfGuide uses conformal risk control to generate targeted guidance sets in a learning-to-guide hybrid decision framework and demonstrates it on multi-label medical diagnosis.
Representation geometry shapes task performance in vision-language modeling for CT enterography
cs.CV 2026-04 unverdicted novelty 6.0

Mean pooling and multi-window RGB encoding optimize vision-language performance on CT enterography, with retrieval-augmented generation substantially improving automated report severity accuracy over fine-tuning alone.
MedConcept: Unsupervised Concept Discovery for Interpretability in Medical VLMs
cs.CV 2026-04 unverdicted novelty 6.0

MedConcept extracts reusable medical concepts from VLMs via sparse neuron activations, translates them to pseudo-reports, and scores them for semantic alignment using an independent medical LLM.
How Robust Are Large Language Models for Clinical Numeracy? An Empirical Study on Numerical Reasoning Abilities in Clinical Contexts
cs.CL 2026-04 unverdicted novelty 6.0

ClinicNumRobBench shows LLMs excel at value retrieval from clinical notes but struggle with relational comparisons and aggregations, with performance dropping under note-style variations and after medical fine-tuning.
Detecting HIV-Related Stigma in Clinical Narratives Using Large Language Models
cs.CL 2026-04 unverdicted novelty 6.0

An LLM-based NLP tool was developed and tested to identify four types of HIV stigma in clinical notes, achieving up to 0.62 micro F1 score with GatorTron-large.
Scaling Recurrence-aware Foundation Models for Clinical Records via Next-Visit Prediction
cs.LG 2026-03 unverdicted novelty 6.0

RAVEN pretrains on over one million EHR sequences via recurrence-aware next-visit event prediction, enabling zero-shot disease incidence forecasting that rivals fine-tuned models and generalizes across cohorts.

Reference graph

Works this paper leans on

16 extracted references · 16 canonical work pages · cited by 102 Pith papers · 1 internal anchor

[1]

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

URL https://arxiv.org/abs/2501.12948. Aaron Defazio, Francis Bach, and Simon Lacoste-Julien. Saga: A fast incremental gradient method with support for non-strongly convex composite objectives.Advances in neural information processing systems, 27, 2014. 25 MedGemma Technical Report Nicolas Deperrois, Hidetoshi Matsuo, Samuel Ruipérez-Campillo, Moritz Vande...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.1609/aaai.v38i20.30205 2014
[2]

Ejection Fraction:The patient’s EF is 35%, which meets the criteria (≤ 40%)

work page
[3]

This is the target population for Entresto

Symptom Status:The patient is symptomatic (dyspnea on exertion) despite being on foundational guideline-directed medical therapy (GDMT) including an ACE inhibitor (lisinopril), a beta-blocker (carvedilol), and an MRA (eplerenone). This is the target population for Entresto

work page
[4]

The lisinopril would need to be stopped, typically with a 36-hour washout period, before starting Entresto

Contraindications: • ACE Inhibitor Use:Entresto shouldnot be used concurrently with an ACE inhibitor (like lisinopril) due to increased risk of angioedema, hypotension, and hyperkalemia. The lisinopril would need to be stopped, typically with a 36-hour washout period, before starting Entresto. • Angioedema History:No history of angioedema related to ACE i...

work page 2025
[5]

The illness began with flu-like symptoms two weeks ago and has progressively worsened

Clinical Presentation Analysis:The patient presents with a severe, multi-system illness characterized by fever, fatigue, shortness of breath, headache, abdominal discomfort, jaundice (icterus), hypoxia, leukocytosis, thrombocytopenia, acute kidney injury (elevated BUN/Cr), hepatitis (elevated ALT/AST/bilirubin), and possible myocarditis (reduced EF on ech...

work page
[6]

• Multi-organ Involvement:The combination of pulmonary infiltrates, jaundice, AKI, and myocarditis suggests a systemic process, often infectious

Key Clues: • Travel History:Recent travel to Hawaii is a significant clue, pointing towards infections endemic to that region. • Multi-organ Involvement:The combination of pulmonary infiltrates, jaundice, AKI, and myocarditis suggests a systemic process, often infectious. • Specific Lab Findings:Leukocytosis with thrombocytopenia, elevated liver enzymes, ...

work page
[7]

Causes fever, headache, myalgia, conjunctival suffusion (not mentioned but common), jaundice, renal failure, pulmonary hemorrhage/ARDS, and myocarditis

Differential Diagnosis (Focusing on Hawaii-Endemic Infections): • Leptospirosis: Highly consistent with the presentation. Causes fever, headache, myalgia, conjunctival suffusion (not mentioned but common), jaundice, renal failure, pulmonary hemorrhage/ARDS, and myocarditis. Thrombocytopenia is common; leukocytosis can occur. Hawaii is an endemic area. • D...

work page
[8]

Therefore, specific testing for Leptospirosis should be prioritized

Prioritizing Diagnostic Tests:Based on the clinical picture and travel history, Leptospirosis is the most likely diagnosis. Therefore, specific testing for Leptospirosis should be prioritized. Dengue and Hantavirus are also important considerations due to endemicity and overlapping features

work page
[9]

Microscopic Agglutination Test (MAT) is the gold standard but takes longer

Specific Diagnostic Tests: • Leptospirosis: –Serology: ELISA for IgM antibodies is often the first step for faster results. Microscopic Agglutination Test (MAT) is the gold standard but takes longer. Paired sera (acute and convalescent) are often needed for definitive diagnosis. – PCR:Detects Leptospira DNA in blood (early) or urine (can be positive later...

work page
[10]

Conclusion: The most crucial next steps are to order specific diagnostic tests for the most likely endemic infections given the patient’s presentation and travel history, particularly Leptospirosis, Dengue, and Hantavirus. Final Answer:The final answer is Order specific serological and PCR testing for Leptospirosis, Dengue, and Hantavirus Continued on nex...

work page
[11]

Leptospirosis serology (IgM ELISA/MAT)

work page
[12]

Hantavirus serology (IgM)

work page
[13]

Dengue serology (IgM/IgG, NS1 antigen)

work page
[14]

Blood and urine cultures

work page
[15]

Cardiac enzymes (Troponin)

work page
[16]

Commentary by a UK Specialist Registrar in infectious diseases Strengths: Correctly lists confirmatory diagnostic tests for each suggested differential

Coagulation studies. Commentary by a UK Specialist Registrar in infectious diseases Strengths: Correctly lists confirmatory diagnostic tests for each suggested differential. Covers both specific confirmatory tests (to pinpoint a particular condition) and more generalized diagnostics (e.g. coagulation studies) that would yield more useful information to gu...

work page 2000

[1] [1]

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

URL https://arxiv.org/abs/2501.12948. Aaron Defazio, Francis Bach, and Simon Lacoste-Julien. Saga: A fast incremental gradient method with support for non-strongly convex composite objectives.Advances in neural information processing systems, 27, 2014. 25 MedGemma Technical Report Nicolas Deperrois, Hidetoshi Matsuo, Samuel Ruipérez-Campillo, Moritz Vande...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.1609/aaai.v38i20.30205 2014

[2] [2]

Ejection Fraction:The patient’s EF is 35%, which meets the criteria (≤ 40%)

work page

[3] [3]

This is the target population for Entresto

Symptom Status:The patient is symptomatic (dyspnea on exertion) despite being on foundational guideline-directed medical therapy (GDMT) including an ACE inhibitor (lisinopril), a beta-blocker (carvedilol), and an MRA (eplerenone). This is the target population for Entresto

work page

[4] [4]

The lisinopril would need to be stopped, typically with a 36-hour washout period, before starting Entresto

Contraindications: • ACE Inhibitor Use:Entresto shouldnot be used concurrently with an ACE inhibitor (like lisinopril) due to increased risk of angioedema, hypotension, and hyperkalemia. The lisinopril would need to be stopped, typically with a 36-hour washout period, before starting Entresto. • Angioedema History:No history of angioedema related to ACE i...

work page 2025

[5] [5]

The illness began with flu-like symptoms two weeks ago and has progressively worsened

Clinical Presentation Analysis:The patient presents with a severe, multi-system illness characterized by fever, fatigue, shortness of breath, headache, abdominal discomfort, jaundice (icterus), hypoxia, leukocytosis, thrombocytopenia, acute kidney injury (elevated BUN/Cr), hepatitis (elevated ALT/AST/bilirubin), and possible myocarditis (reduced EF on ech...

work page

[6] [6]

• Multi-organ Involvement:The combination of pulmonary infiltrates, jaundice, AKI, and myocarditis suggests a systemic process, often infectious

Key Clues: • Travel History:Recent travel to Hawaii is a significant clue, pointing towards infections endemic to that region. • Multi-organ Involvement:The combination of pulmonary infiltrates, jaundice, AKI, and myocarditis suggests a systemic process, often infectious. • Specific Lab Findings:Leukocytosis with thrombocytopenia, elevated liver enzymes, ...

work page

[7] [7]

Causes fever, headache, myalgia, conjunctival suffusion (not mentioned but common), jaundice, renal failure, pulmonary hemorrhage/ARDS, and myocarditis

Differential Diagnosis (Focusing on Hawaii-Endemic Infections): • Leptospirosis: Highly consistent with the presentation. Causes fever, headache, myalgia, conjunctival suffusion (not mentioned but common), jaundice, renal failure, pulmonary hemorrhage/ARDS, and myocarditis. Thrombocytopenia is common; leukocytosis can occur. Hawaii is an endemic area. • D...

work page

[8] [8]

Therefore, specific testing for Leptospirosis should be prioritized

Prioritizing Diagnostic Tests:Based on the clinical picture and travel history, Leptospirosis is the most likely diagnosis. Therefore, specific testing for Leptospirosis should be prioritized. Dengue and Hantavirus are also important considerations due to endemicity and overlapping features

work page

[9] [9]

Microscopic Agglutination Test (MAT) is the gold standard but takes longer

Specific Diagnostic Tests: • Leptospirosis: –Serology: ELISA for IgM antibodies is often the first step for faster results. Microscopic Agglutination Test (MAT) is the gold standard but takes longer. Paired sera (acute and convalescent) are often needed for definitive diagnosis. – PCR:Detects Leptospira DNA in blood (early) or urine (can be positive later...

work page

[10] [10]

Conclusion: The most crucial next steps are to order specific diagnostic tests for the most likely endemic infections given the patient’s presentation and travel history, particularly Leptospirosis, Dengue, and Hantavirus. Final Answer:The final answer is Order specific serological and PCR testing for Leptospirosis, Dengue, and Hantavirus Continued on nex...

work page

[11] [11]

Leptospirosis serology (IgM ELISA/MAT)

work page

[12] [12]

Hantavirus serology (IgM)

work page

[13] [13]

Dengue serology (IgM/IgG, NS1 antigen)

work page

[14] [14]

Blood and urine cultures

work page

[15] [15]

Cardiac enzymes (Troponin)

work page

[16] [16]

Commentary by a UK Specialist Registrar in infectious diseases Strengths: Correctly lists confirmatory diagnostic tests for each suggested differential

Coagulation studies. Commentary by a UK Specialist Registrar in infectious diseases Strengths: Correctly lists confirmatory diagnostic tests for each suggested differential. Covers both specific confirmatory tests (to pinpoint a particular condition) and more generalized diagnostics (e.g. coagulation studies) that would yield more useful information to gu...

work page 2000