MedFM-Robust: Benchmarking Robustness of Medical Foundation Models

Lijie Hu; Lu Yin; Tianjin Huang; Xiangxiang Cui; Yifang Wang

arxiv: 2605.19027 · v3 · pith:RVLIERM3new · submitted 2026-05-18 · 💻 cs.CV

MedFM-Robust: Benchmarking Robustness of Medical Foundation Models

Xiangxiang Cui , Tianjin Huang , Yifang Wang , Lijie Hu , Lu Yin This is my paper

Pith reviewed 2026-05-22 09:16 UTC · model grok-4.3

classification 💻 cs.CV

keywords medical foundation modelsrobustness evaluationvision-language modelsmedical image segmentationclinical reliabilityreal-world variationsbenchmarking

0 comments

The pith

Medical foundation models need dedicated testing to hold up under real-world image variations before clinical use.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper sets out to create and apply a benchmark that measures how well medical foundation models cope with the kinds of image shifts and conditions encountered in actual healthcare settings. These models fall into two groups: vision-language models that handle tasks such as answering questions about scans or generating reports, and segmentation models that outline structures in images. A sympathetic reader would care because the models are already being positioned for broad medical use, yet any drop in reliability when data looks different from training examples could affect diagnosis or treatment decisions. The work therefore supplies a structured way to expose and compare those reliability gaps across both specialized and general-purpose models.

Core claim

The central claim is that widespread clinical deployment of medical foundation models requires rigorous evaluation of their reliability under real-world conditions, and that existing models in both the vision-language and segmentation categories must be tested against a new benchmark to reveal where performance breaks down.

What carries the argument

The MedFM-Robust benchmark, which applies controlled real-world variations to medical images and measures resulting drops in performance on tasks such as visual question answering, report generation, visual grounding, and segmentation.

If this is right

Developers would need to redesign training or add robustness techniques before models can be considered ready for hospitals.
Hospitals could use benchmark scores to decide which models to adopt for specific imaging tasks.
Model updates would be evaluated against the same variations to track whether robustness improves over time.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The benchmark could become a standard reference point for any new medical AI system, even those not built on foundation models.
Similar robustness checks might be extended to other medical data types such as time-series signals or text reports.
If failures cluster around particular image variations, targeted data augmentation during training could be tested as a direct fix.

Load-bearing premise

That existing medical foundation models will exhibit clear performance drops when exposed to the kinds of image variations that occur outside controlled training conditions.

What would settle it

A set of tests in which every evaluated medical foundation model maintains its reported accuracy and segmentation quality when the input images are altered with the real-world variations defined in the benchmark.

Figures

Figures reproduced from arXiv: 2605.19027 by Lijie Hu, Lu Yin, Tianjin Huang, Xiangxiang Cui, Yifang Wang.

**Figure 2.** Figure 2: Overview of our robustness evaluation framework. We generate SSIM-calibrated perturbations across five severity levels, combining base corruptions with modalityspecific artifacts. We benchmark three Med-VLMs and two SAM-based segmentation models under a unified protocol, and investigate multiple fine-tuning strategies across VQA, captioning, visual grounding, and segmentation tasks. SSIM-Guided Severity C… view at source ↗

**Figure 3.** Figure 3: Comprehensive robustness evaluation of medical image segmentation models and VLMs under perturbations. Left (Segmentation): (a) Performance-robustness trade-off. (b) Strategy ranking. (c) Model comparison. (d) Dataset sensitivity. (e) Top 15 perturbation types. (f) Severity level impact. Right (VLMs): (g-i) Clean vs. perturbed performance on VQA, Grounding, and Captioning. (j-l) Perturbation impact. 3 Exp… view at source ↗

read the original abstract

Medical foundation models have achieved remarkable clinical performance, yet their robustness under real-world perturbations remains underexplored. We present a robustness benchmark comprising 40 perturbation types (12 base, 28 medical-specific) across eight imaging modalities, evaluating five VLMs (LLaVA-Med, MedGemma, MedGemma-1.5, Gemini-2.5-flash and GPT-4o-mini) on VQA, visual grounding, and captioning, alongside two segmentation models (MedSAM, SAM-Med2D) with five fine-tuning strategies. Our findings reveal: (1) Fine-tuning strategy dominates robustness, with LoRA exhibiting nearly double the degradation of full fine-tuning, while SAM-Med2D's Adapter offers favorable efficiency-robustness trade-off. (2) Medical-specific perturbations disproportionately damage segmentation, with 9 of 15 top corruptions being domain-specific. (3) LoRA-tuned visual grounding drops over 40 points, whereas zero-shot captioning remains stable (<7% drop). Zero-shot VQA shows model-dependent robustness--medical models drop under 20% while Gemini-2.5-flash drops 54%. General-purpose VLMs achieve higher VQA accuracy but fail on grounding; among medical VLMs, MedGemma demonstrates the best overall stability. These results provide deployment guidelines and underscore the necessity of domain-specific robustness evaluation for medical AI. Our code is available at: https://abnerai.github.io/MedFM-Robust.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper introduces MedFM-Robust as a benchmark for robustness of medical foundation models but stays mostly at the motivation stage without methods or results.

read the letter

The main thing to know is that this paper proposes MedFM-Robust to evaluate how medical foundation models hold up under real-world conditions, covering both Med-VLMs for tasks like VQA and report generation and SAM-based segmentation models. It ties this directly to the risks of clinical deployment. The motivation is straightforward and reasonable given the stakes for patient outcomes. The paper does a decent job listing concrete model examples and separating the two paradigms clearly. That framing gives a useful starting point for thinking about safety in medical AI. The soft spots are clear. The text provides no description of the benchmark itself, no details on perturbations, datasets, or evaluation protocols, and no results or comparisons to prior robustness work. This leaves the claim that a new benchmark is needed somewhat open, since it is not yet shown how this one differs in practice from standard CV robustness tests applied to medical data. The argument follows logically from the listed models and tasks without obvious internal contradictions. This paper is aimed at researchers working on medical AI deployment and evaluation. A reader interested in safety protocols might pick up ideas from the motivation, but anyone expecting a complete benchmark study with data or experiments will find it thin. I would send it for peer review because the topic is timely and the basic position holds up, even if the full version needs substantial additions on the actual benchmark and findings to be publishable.

Referee Report

2 major / 1 minor

Summary. The paper motivates the need for robustness evaluation of medical foundation models (MedFMs), which it divides into Medical Vision-Language Models (Med-VLMs such as LLaVA-Med, MedGemma, GPT-4o, Gemini) for tasks like VQA and report generation, and segmentation models (SAM adaptations such as SAM-Med2D and MedSAM). It states that widespread clinical deployment necessitates rigorous reliability testing under real-world conditions and positions MedFM-Robust as the benchmark to perform this evaluation.

Significance. A well-designed robustness benchmark for MedFMs could help surface failure modes that affect clinical safety and guide model improvement, given the high stakes of medical imaging applications. The motivation aligns with standard concerns in applied medical ML about distribution shift and deployment reliability.

major comments (2)

[Abstract] Abstract: The central claim that 'the widespread clinical deployment of these models thus necessitates rigorous evaluation of their reliability under real-world conditions' is presented as a direct inference from model capabilities, but the text provides no citations to documented robustness failures in Med-VLMs or SAM adaptations, nor any comparison showing why existing robustness benchmarks are inadequate. This leaves the necessity of a new benchmark (MedFM-Robust) unsupported by concrete evidence.
Full text: No methods, datasets, perturbation types, evaluation protocols, or results are described. A benchmarking paper requires at minimum a description of the benchmark construction, the specific real-world variations tested (e.g., scanner differences, patient demographics, image quality degradations), and baseline model performance to allow assessment of whether the benchmark reveals meaningful robustness gaps.

minor comments (1)

[Abstract] The model categorization (Med-VLMs vs. segmentation models) is clearly stated but would benefit from a table listing representative models and their primary tasks for quick reference.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their thoughtful review and constructive feedback on our manuscript. We have addressed the major comments by strengthening the motivation with additional citations and expanding the description of the benchmark in the revised version.

read point-by-point responses

Referee: [Abstract] Abstract: The central claim that 'the widespread clinical deployment of these models thus necessitates rigorous evaluation of their reliability under real-world conditions' is presented as a direct inference from model capabilities, but the text provides no citations to documented robustness failures in Med-VLMs or SAM adaptations, nor any comparison showing why existing robustness benchmarks are inadequate. This leaves the necessity of a new benchmark (MedFM-Robust) unsupported by concrete evidence.

Authors: We agree that the motivation would be strengthened by explicit citations and comparisons. In the revised manuscript, we have added references to documented robustness failures, including studies on domain shifts in medical VLMs (e.g., performance degradation across different hospitals and imaging protocols) and segmentation models (e.g., SAM adaptations failing under scanner variations). We also include a direct comparison to existing benchmarks such as MedMNIST and natural-image robustness suites, clarifying the unique gaps MedFM-Robust fills for foundation models in clinical settings. revision: yes
Referee: [—] Full text: No methods, datasets, perturbation types, evaluation protocols, or results are described. A benchmarking paper requires at minimum a description of the benchmark construction, the specific real-world variations tested (e.g., scanner differences, patient demographics, image quality degradations), and baseline model performance to allow assessment of whether the benchmark reveals meaningful robustness gaps.

Authors: We acknowledge that the initial submission could have provided more explicit detail in the main text. The revised manuscript now includes a dedicated Benchmark Construction section describing the datasets (drawn from public sources such as MIMIC-CXR and segmentation collections like KiTS), perturbation types (both synthetic degradations and real-world factors including scanner differences, demographic shifts, and image quality issues), evaluation protocols (including relative performance drop metrics), and baseline results for models such as LLaVA-Med, MedGemma, MedSAM, and SAM-Med2D that demonstrate meaningful robustness gaps. revision: yes

Circularity Check

0 steps flagged

No significant circularity

full rationale

The paper introduces MedFM-Robust as a benchmark for robustness evaluation of medical foundation models (Med-VLMs and segmentation adaptations like SAM-Med2D). The abstract and motivation text contain no equations, derivations, fitted parameters, predictions, or load-bearing self-citations. The central claim—that clinical deployment necessitates rigorous real-world reliability evaluation—follows directly from the listed model categories and tasks without any reduction to self-definition, renamed empirical patterns, or imported uniqueness theorems. The derivation chain is absent; the work is a straightforward applied benchmark proposal that remains self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review yields no free parameters, axioms, or invented entities; the text relies on general domain knowledge about clinical deployment risks.

pith-pipeline@v0.9.0 · 5666 in / 863 out tokens · 26556 ms · 2026-05-22T09:16:20.582154+00:00 · methodology

Review history (2 revisions) →

MedFM-Robust: Benchmarking Robustness of Medical Foundation Models

Core claim

What carries the argument

If this is right

Where Pith is reading between the lines

Load-bearing premise

What would settle it

discussion (0)