hub Canonical reference

BiomedCLIP: a multimodal biomedical foundation model pretrained from fifteen million scientific image-text pairs

Sheng Zhang, Yanbo Xu, Naoto Usuyama, Hanwen Xu, Jaspreet Bagga, Robert Tinn · 2023 · cs.CV · arXiv 2303.00915

Canonical reference. 70% of citing Pith papers cite this work as background.

77 Pith papers citing it

Background 70% of classified citations

open full Pith review browse 77 citing papers arXiv PDF

abstract

Biomedical data is inherently multimodal, comprising physical measurements and natural language narratives. A generalist biomedical AI model needs to simultaneously process different modalities of data, including text and images. Therefore, training an effective generalist biomedical model requires high-quality multimodal data, such as parallel image-text pairs. Here, we present PMC-15M, a novel dataset that is two orders of magnitude larger than existing biomedical multimodal datasets such as MIMIC-CXR, and spans a diverse range of biomedical image types. PMC-15M contains 15 million biomedical image-text pairs collected from 4.4 million scientific articles. Based on PMC-15M, we have pretrained BiomedCLIP, a multimodal foundation model, with domain-specific adaptations tailored to biomedical vision-language processing. We conducted extensive experiments and ablation studies on standard biomedical imaging tasks from retrieval to classification to visual question-answering (VQA). BiomedCLIP achieved new state-of-the-art results in a wide range of standard datasets, substantially outperforming prior approaches. Intriguingly, by large-scale pretraining on diverse biomedical image types, BiomedCLIP even outperforms state-of-the-art radiology-specific models such as BioViL in radiology-specific tasks such as RSNA pneumonia detection. In summary, BiomedCLIP is a fully open-access foundation model that achieves state-of-the-art performance on various biomedical tasks, paving the way for transformative multimodal biomedical discovery and applications. We release our models at https://aka.ms/biomedclip to facilitate future research in multimodal biomedical AI.

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 7 method 2 baseline 1

citation-polarity summary

background 7 use method 2 baseline 1

representative citing papers

NeuroQA: A Large-Scale Image-Grounded Benchmark for 3D Brain MRI Understanding

cs.CV · 2026-05-19 · accept · novelty 8.0

NeuroQA is a large-scale 3D brain MRI visual question answering benchmark with verified image-grounded QA pairs, multi-domain coverage, and baseline evaluations showing current models lag behind text-only performance.

CheXTemporal: A Dataset for Temporally-Grounded Reasoning in Chest Radiography

cs.CV · 2026-05-11 · accept · novelty 8.0

CheXTemporal supplies paired chest X-rays with explicit temporal progression taxonomy and spatial grounding to benchmark and improve models on longitudinal reasoning tasks.

SonoCLIP: Mask-Guided Region-Aware Vision-Language Pretraining for Fetal Ultrasound Analysis

cs.CV · 2026-06-28 · unverdicted · novelty 7.0

SonoCLIP presents a mask-guided region-aware vision-language foundation model pretrained on 1.44M fetal ultrasound images, demonstrating superior zero-shot performance.

When Does Synthetic CT Transfer? A Label-Free Donor/Host Diagnostic for Medical Vision-Language Model Routing on Real Lung CT

cs.CV · 2026-06-28 · unverdicted · novelty 7.0

Donor-driven nodule properties in synthetic CT transfer to real lung CT vision-language tasks while host-driven anatomy properties do not, enabling a label-free diagnostic for model routing.

Learning from Acquisition: Metadata-driven Multimodal Pre-training for Cardiac MRI

cs.CV · 2026-06-27 · unverdicted · novelty 7.0

MetaCLIP-CMR applies CLIP-style contrastive learning to cardiac MRI by treating acquisition metadata as text labels, delivering 86.8% modality and 86.5% view accuracy plus top Dice scores on ACDC/M&Ms segmentation with far less pre-training data than recent large-scale CMR models.

Revealing Training Data Exposure in Vision Language Large Models via Parameter Gradients

cs.CV · 2026-06-23 · unverdicted · novelty 7.0

GradAudit detects training data exposure in VLLMs by analyzing gradient stability on image-text pairs and outperforms baselines on medical and general datasets.

PHOEBI: An Open-World Benchmark for Bacterial Identification in Phase-Contrast Microscopy

cs.CV · 2026-06-22 · unverdicted · novelty 7.0

PHOEBI is a benchmark dataset and LCO evaluation protocol for open-world multi-label bacterial species identification from phase-contrast microscopy of polymicrobial samples.

MMBU: A Massive Multi-modal Biomedical Understanding Benchmark to Probe the Perception Capabilities of Vision-Language Models

cs.CV · 2026-06-04 · unverdicted · novelty 7.0

Introduces MMBU benchmark for VLMs in biomedicine and demonstrates that established benchmarks mask perception deficiencies in evaluated models.

EchoPilot: Training-Free Ultrasound Video Segmentation via Scale-Space Semantic Prompting and Reliability-Gated Memory

cs.CV · 2026-05-25 · unverdicted · novelty 7.0

EchoPilot delivers state-of-the-art training-free ultrasound video segmentation from a single point prompt by introducing scale-space semantic prompting via S.E.E.D. and reliability-gated memory updates.

EchoVQA: Enabling Conversational Assistance for Point-of-Care Cardiac Ultrasound

cs.CV · 2026-05-22 · unverdicted · novelty 7.0

EchoVQA is the first large-scale VQA dataset for echocardiography spanning high- and low-quality images across views, with acquisition guidance questions, paired with a low-parameter multimodal prompt model that reports SOTA on several benchmarks.

HalluCXR: Benchmarking and Mitigating Hallucinations in Medical Vision-Language Models for Chest Radiograph Interpretation

cs.CV · 2026-05-19 · conditional · novelty 7.0

HalluCXR benchmark shows 61.9-82.3% hallucination rates across VLMs on MIMIC-CXR images, identifies patterns such as length-based risk and over-fabrication of common findings, and demonstrates ensemble mitigation that cuts fabrication by up to 84.8%.

MedCRP-CL: Continual Medical Image Segmentation via Bayesian Nonparametric Semantic Modality Discovery

cs.CV · 2026-05-19 · unverdicted · novelty 7.0

MedCRP-CL discovers semantic modalities online via CRP from text prompts and maintains modality-specific LoRA adapters with intra-modality EWC, achieving 73.3% Dice and 4.1% forgetting on 16 tasks while using 6x fewer parameters than the best baseline.

Next-Acceleration-Scale Prediction for Autoregressive MRI Reconstruction

eess.IV · 2026-05-19 · unverdicted · novelty 7.0 · 2 refs

Next-acceleration-scale autoregressive prediction in discrete latent space with on-policy privileged information distillation yields improved MRI reconstructions from sparse measurements on the fastMRI benchmark.

CXR-ContraBench: Benchmarking Negated-Option Attraction in Medical VLMs

cs.CV · 2026-05-07 · conditional · novelty 7.0

Medical VLMs frequently select negated options that contradict visible chest X-ray findings, achieving only ~30% accuracy on direct presence probes, but a post-hoc consistency verifier raises accuracy above 95%.

iTRIALSPACE: Programmable Virtual Lesion Trials for Controlled Evaluation of Lung CT Models

cs.CV · 2026-05-07 · unverdicted · novelty 7.0

iTRIALSPACE generates realistic virtual lesion trials on lung CTs that isolate performance drivers and show strong transfer of model rankings to real clinical data (ρ=0.93).

CoDA: Exploring Chain-of-Distribution Attacks and Post-Hoc Token-Space Repair for Medical Vision-Language Models

cs.CV · 2026-03-19 · unverdicted · novelty 7.0

CoDA chains clinically plausible acquisition, reconstruction, display, and delivery shifts to substantially degrade zero-shot performance of medical vision-language models, with a post-hoc token-space repair partially recovering accuracy.

CardioBench: Do Echocardiography Foundation Models Generalize Beyond the Lab?

cs.CV · 2025-10-01 · unverdicted · novelty 7.0

CardioBench is a new public benchmark that standardizes eight echocardiography datasets into four regression and five classification tasks to evaluate foundation model generalization.

GRAPE: Graph-Augmented Prototype Explanations for Interactive Medical Image Diagnosis

cs.CV · 2026-06-29 · unverdicted · novelty 6.0 · 2 refs

GRAPE augments prototype medical image classifiers with graph attention for co-occurrence, a mismatch safety check, and open-vocabulary anchoring to support incremental addition of findings from single examples.

Jolia: Concept-Level Vision-Language Alignment for 3D CT Contrastive Learning

cs.CV · 2026-06-23 · unverdicted · novelty 6.0

ConQuer augments global CLIP alignment with independent per-concept contrastive losses on anatomical regions extracted from reports, producing Jolia which outperforms CLIP baselines on classification, report generation, and transfer.

Benchmarking Vision-Language Models for Microscopic Plant Image Understanding

cs.CV · 2026-06-21 · unverdicted · novelty 6.0

PlantMicro benchmark shows current VLMs achieve low accuracy (e.g. GPT-5 at 34.93% on pathogen classification) on fine-grained microscopic plant image tasks.

Cohort-Anchored Foundation Models for Electronic Health Records: From Risk Scores to Auditable Peer Cohorts

cs.LG · 2026-06-20 · unverdicted · novelty 6.0

CAFM is a four-stage framework that anchors EHR foundation models to patient cohorts via deviation-aware curation, cohort-conditioned pretraining, multimodal alignment, and clinician refinement to improve interpretability and trustworthiness.

Zero-Shot Vision-Language Models for Classroom Engagement Recognition: A Benchmark Study of Prompt Sensitivity and Cross-Dataset Generalization

cs.CV · 2026-06-20 · unverdicted · novelty 6.0

Benchmark study shows zero-shot VLMs achieve near-random results (kappa <=0.10) on individual student videos but moderate agreement (kappa ~0.60) on scene-level images, with up to 32-point accuracy swings from prompt changes alone.

The Slop Paradox: How Synthetic Standardization Erodes Clinical Uncertainty and Cross-Modal Alignment in AI-Rewritten Radiology Reports

cs.CL · 2026-06-16 · unverdicted · novelty 6.0

AI rewriting tasks that standardize radiology reports erode cross-modal image-text alignment more than they erode clinical entities or hedging language, creating a dissociation termed the slop paradox.

Frozen Foundation-Model Embeddings Discard Small-Lesion Signal in Chest Radiography: Implications for Pre-Deployment Evaluation

cs.CV · 2026-06-10 · unverdicted · novelty 6.0

Frozen ViT embeddings in chest radiography suppress small-lesion signal at the CLS token but recover it via patch-local pooling on the same forward pass across multiple models and large cohorts.

citing papers explorer

Showing 6 of 6 citing papers after filters.

HalluCXR: Benchmarking and Mitigating Hallucinations in Medical Vision-Language Models for Chest Radiograph Interpretation cs.CV · 2026-05-19 · conditional · none · ref 14 · internal anchor
HalluCXR benchmark shows 61.9-82.3% hallucination rates across VLMs on MIMIC-CXR images, identifies patterns such as length-based risk and over-fabrication of common findings, and demonstrates ensemble mitigation that cuts fabrication by up to 84.8%.
CXR-ContraBench: Benchmarking Negated-Option Attraction in Medical VLMs cs.CV · 2026-05-07 · conditional · none · ref 43 · internal anchor
Medical VLMs frequently select negated options that contradict visible chest X-ray findings, achieving only ~30% accuracy on direct presence probes, but a post-hoc consistency verifier raises accuracy above 95%.
MAM-CLIP: Vision-Language Pretraining on Mammography Atlases for BI-RADS Classification cs.CV · 2026-05-19 · conditional · none · ref 11 · internal anchor
Contrastive pretraining on mammography atlas image-text pairs improves BI-RADS classification F1 by 1-14% especially in low-label regimes, outperforming equivalent numbers of direct labels in some settings.
Improving Medical VQA through Trajectory-Aware Process Supervision cs.LG · 2026-04-10 · conditional · none · ref 34 · internal anchor
A trajectory-aware process reward using DTW on sentence embeddings, combined with exact-match in GRPO after SFT, raises mean medical VQA accuracy from 0.598 to 0.689 across six benchmarks.
VA-Adapter: Adapting Ultrasound Foundation Model to Echocardiography Probe Guidance cs.CV · 2025-10-08 · conditional · none · ref 32 · internal anchor
VA-Adapter adapts ultrasound foundation models for echocardiography probe guidance by embedding a vision-action module that infers individual 3D cardiac anatomy from historical sequences, outperforming prior methods with roughly 33 times fewer trainable parameters on a 1.31 million sample dataset.
PMC-VQA: Visual Instruction Tuning for Medical Visual Question Answering cs.CV · 2023-05-17 · conditional · none · ref 64 · internal anchor
PMC-VQA dataset and MedVInT model achieve better generative performance on medical VQA benchmarks by visual instruction tuning on a newly constructed large-scale dataset.

BiomedCLIP: a multimodal biomedical foundation model pretrained from fifteen million scientific image-text pairs

hub tools

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer