hub

BiomedCLIP: a multimodal biomedical foundation model pretrained from fifteen million scientific image-text pairs

Sheng Zhang, Yanbo Xu, Naoto Usuyama, Hanwen Xu, Jaspreet Bagga, Robert Tinn · 2023 · cs.CV · arXiv 2303.00915

25 Pith papers cite this work. Polarity classification is still indexing.

25 Pith papers citing it

open full Pith review browse 25 citing papers arXiv PDF

abstract

Biomedical data is inherently multimodal, comprising physical measurements and natural language narratives. A generalist biomedical AI model needs to simultaneously process different modalities of data, including text and images. Therefore, training an effective generalist biomedical model requires high-quality multimodal data, such as parallel image-text pairs. Here, we present PMC-15M, a novel dataset that is two orders of magnitude larger than existing biomedical multimodal datasets such as MIMIC-CXR, and spans a diverse range of biomedical image types. PMC-15M contains 15 million biomedical image-text pairs collected from 4.4 million scientific articles. Based on PMC-15M, we have pretrained BiomedCLIP, a multimodal foundation model, with domain-specific adaptations tailored to biomedical vision-language processing. We conducted extensive experiments and ablation studies on standard biomedical imaging tasks from retrieval to classification to visual question-answering (VQA). BiomedCLIP achieved new state-of-the-art results in a wide range of standard datasets, substantially outperforming prior approaches. Intriguingly, by large-scale pretraining on diverse biomedical image types, BiomedCLIP even outperforms state-of-the-art radiology-specific models such as BioViL in radiology-specific tasks such as RSNA pneumonia detection. In summary, BiomedCLIP is a fully open-access foundation model that achieves state-of-the-art performance on various biomedical tasks, paving the way for transformative multimodal biomedical discovery and applications. We release our models at https://aka.ms/biomedclip to facilitate future research in multimodal biomedical AI.

hub tools

JSON dossier citing papers JSON arXiv source

representative citing papers

CheXTemporal: A Dataset for Temporally-Grounded Reasoning in Chest Radiography

cs.CV · 2026-05-11 · accept · novelty 8.0

CheXTemporal supplies paired chest X-rays with explicit temporal progression taxonomy and spatial grounding to benchmark and improve models on longitudinal reasoning tasks.

CXR-ContraBench: Benchmarking Negated-Option Attraction in Medical VLMs

cs.CV · 2026-05-07 · conditional · novelty 7.0

Medical VLMs frequently select negated options that contradict visible chest X-ray findings, achieving only ~30% accuracy on direct presence probes, but a post-hoc consistency verifier raises accuracy above 95%.

iTRIALSPACE: Programmable Virtual Lesion Trials for Controlled Evaluation of Lung CT Models

cs.CV · 2026-05-07 · unverdicted · novelty 7.0

iTRIALSPACE generates realistic virtual lesion trials on lung CTs that isolate performance drivers and show strong transfer of model rankings to real clinical data (ρ=0.93).

MedLayBench-V: A Large-Scale Benchmark for Expert-Lay Semantic Alignment in Medical Vision Language Models

cs.CL · 2026-04-07 · unverdicted · novelty 7.0

MedLayBench-V is the first large-scale multimodal benchmark for expert-lay semantic alignment in medical vision-language models, constructed via a Structured Concept-Grounded Refinement pipeline that uses UMLS CUIs to enforce equivalence.

A General B\'ezier Tree Encoding Counterfactual Framework for Retinal-Vessel-Mediated Disease Analysis

eess.IV · 2026-05-13 · unverdicted · novelty 6.0

BTECF encodes retinal vessels as Bézier trees to enable targeted, parameter-level counterfactual interventions on vessel geometry for causal analysis of vascular diseases.

CLEF: EEG Foundation Model for Learning Clinical Semantics

cs.AI · 2026-05-11 · unverdicted · novelty 6.0

CLEF, a long-context EEG foundation model using 3D multitaper spectrograms and contrastive alignment with reports and EHR, beats prior models on 229 of 234 clinical tasks and raises mean AUROC from 0.65 to 0.74.

MSD-Score: Multi-Scale Distributional Scoring for Reference-Free Image Caption Evaluation

cs.CV · 2026-05-07 · unverdicted · novelty 6.0

MSD-Score introduces multi-scale distributional scoring on von Mises-Fisher mixtures to evaluate image captions without references and reports state-of-the-art correlation with human judgments.

CheXmix: Unified Generative Pretraining for Vision Language Models in Medical Imaging

cs.CV · 2026-04-24 · unverdicted · novelty 6.0

CheXmix combines masked autoencoder pretraining with early-fusion generative modeling to outperform prior models on chest X-ray classification by up to 8.6% AUROC, inpainting by 51%, and report generation by 45% on GREEN.

Are Natural-Domain Foundation Models Effective for Accelerated Cardiac MRI Reconstruction?

eess.IV · 2026-04-24 · unverdicted · novelty 6.0

Natural-domain foundation models provide competitive and more robust priors than task-specific models for accelerated cardiac MRI reconstruction in cross-domain settings.

REVEAL: Multimodal Vision-Language Alignment of Retinal Morphometry and Clinical Risks for Incident AD and Dementia Prediction

cs.CV · 2026-04-20 · unverdicted · novelty 6.0

REVEAL uses vision-language alignment of retinal morphometry and clinical risk narratives plus group contrastive learning to predict AD and dementia about 8 years early.

Adapting in the Dark: Efficient and Stable Test-Time Adaptation for Black-Box Models

cs.LG · 2026-04-17 · unverdicted · novelty 6.0

BETA adapts black-box models at test time using a local steering model and regularization techniques to achieve accuracy improvements without additional API queries or high latency.

Improving Medical VQA through Trajectory-Aware Process Supervision

cs.LG · 2026-04-10 · conditional · novelty 6.0

A trajectory-aware process reward using DTW on sentence embeddings, combined with exact-match in GRPO after SFT, raises mean medical VQA accuracy from 0.598 to 0.689 across six benchmarks.

Visual Instruction-Finetuned Language Model for Versatile Brain MR Image Tasks

cs.CV · 2026-04-03 · unverdicted · novelty 6.0

LLaBIT is a single instruction-finetuned LLM that performs report generation, VQA, segmentation, and translation on brain MRI images while outperforming task-specific models.

An Explainable Vision-Language Model Framework with Adaptive PID-Tversky Loss for Lumbar Spinal Stenosis Diagnosis

cs.CV · 2026-04-02 · unverdicted · novelty 6.0

A VLM framework with spatial patch cross-attention and adaptive PID-Tversky loss reports 90.69% classification accuracy, 0.9512 Dice score, and 92.80 CIDEr for LSS diagnosis plus automated report generation.

Any2Any 3D Diffusion Models with Knowledge Transfer: A Radiotherapy Planning Study

cs.CV · 2026-05-10 · unverdicted · novelty 5.0

DiffKT3D transfers priors from video diffusion models to 3D radiotherapy dose prediction via modality-specific embeddings and clinically guided RL, reducing voxel MAE from 2.07 to 1.93 and claiming SOTA over the GDP-HMM challenge winner.

Cross-Modal Semantic-Enhanced Diffusion Framework for Diabetic Retinopathy Grading

eess.IV · 2026-05-10 · unverdicted · novelty 5.0

CGSD framework reaches 87.5% accuracy and 0.731 macro F1 on APTOS 2019 by conditioning diffusion denoising on dot-product vectors from image features and DR-grade text descriptions.

MultiMedVision: Multi-Modal Medical Vision Framework

cs.CV · 2026-05-09 · unverdicted · novelty 5.0

A unified Sparse Vision Transformer learns joint 2D/3D medical image representations via self-supervision and achieves competitive AUROC on chest X-ray and CT benchmarks with 5x less data than modality-specific models.

CapCLIP: A Vision-Language Representation Alignment Approach for Wireless Capsule Endoscopy Analysis

cs.CV · 2026-05-08 · unverdicted · novelty 5.0

CapCLIP uses pathology-aware text captions to align WCE images in a vision-language space, outperforming standard models in zero-shot classification and retrieval on unseen data.

Pan-FM: A Pan-Organ Foundation Model with Saliency-Guided Masking for Missing Robustness

cs.CV · 2026-05-08 · unverdicted · novelty 5.0

Pan-FM learns balanced representations across seven organs by adaptively masking dominant organs during pre-training, yielding stronger disease prediction and missing-organ robustness than single-organ or naive multimodal baselines on UK Biobank.

Learning from Medical Entity Trees: An Entity-Centric Medical Data Engineering Framework for MLLMs

cs.CL · 2026-04-28 · unverdicted · novelty 5.0

A Medical Entity Tree organizes medical knowledge to engineer higher-quality training data that boosts general MLLMs on medical benchmarks.

Align then Refine: Text-Guided 3D Prostate Lesion Segmentation

cs.CV · 2026-04-20 · unverdicted · novelty 5.0

A text-guided multi-encoder U-Net with alignment loss, heatmap calibration, and confidence-gated cross-attention refiner sets new state-of-the-art 3D prostate lesion segmentation performance on the PI-CAI dataset.

T-Gated Adapter: A Lightweight Temporal Adapter for Vision-Language Medical Segmentation

cs.CV · 2026-04-09 · unverdicted · novelty 5.0

A temporal adapter injects adjacent-slice context into VLM token representations, raising mean Dice from 0.498 to 0.704 on FLARE22 and reducing cross-domain drop from 38% to 24.9%.

A Utility-preserving De-identification Pipeline for Cross-hospital Radiology Data Sharing

cs.CV · 2026-04-08 · unverdicted · novelty 5.0

The UPDP pipeline filters privacy terms and generates de-identified radiology images that preserve diagnostic pathology information, enabling models with competitive disease detection accuracy but reduced identity leakage and improved cross-hospital performance.

CoRE: Concept-Reasoning Expansion for Continual Brain Lesion Segmentation

cs.CV · 2026-04-28 · unverdicted · novelty 4.0

CoRE aligns image tokens to a hierarchical concept library to simulate clinical reasoning for expert routing and demand-based growth in continual brain lesion segmentation, achieving SOTA on 12 tasks.

citing papers explorer

Showing 25 of 25 citing papers.

CheXTemporal: A Dataset for Temporally-Grounded Reasoning in Chest Radiography cs.CV · 2026-05-11 · accept · none · ref 21 · internal anchor
CheXTemporal supplies paired chest X-rays with explicit temporal progression taxonomy and spatial grounding to benchmark and improve models on longitudinal reasoning tasks.
CXR-ContraBench: Benchmarking Negated-Option Attraction in Medical VLMs cs.CV · 2026-05-07 · conditional · none · ref 43 · internal anchor
Medical VLMs frequently select negated options that contradict visible chest X-ray findings, achieving only ~30% accuracy on direct presence probes, but a post-hoc consistency verifier raises accuracy above 95%.
iTRIALSPACE: Programmable Virtual Lesion Trials for Controlled Evaluation of Lung CT Models cs.CV · 2026-05-07 · unverdicted · none · ref 24 · internal anchor
iTRIALSPACE generates realistic virtual lesion trials on lung CTs that isolate performance drivers and show strong transfer of model rankings to real clinical data (ρ=0.93).
MedLayBench-V: A Large-Scale Benchmark for Expert-Lay Semantic Alignment in Medical Vision Language Models cs.CL · 2026-04-07 · unverdicted · none · ref 1 · internal anchor
MedLayBench-V is the first large-scale multimodal benchmark for expert-lay semantic alignment in medical vision-language models, constructed via a Structured Concept-Grounded Refinement pipeline that uses UMLS CUIs to enforce equivalence.
A General B\'ezier Tree Encoding Counterfactual Framework for Retinal-Vessel-Mediated Disease Analysis eess.IV · 2026-05-13 · unverdicted · none · ref 59 · internal anchor
BTECF encodes retinal vessels as Bézier trees to enable targeted, parameter-level counterfactual interventions on vessel geometry for causal analysis of vascular diseases.
CLEF: EEG Foundation Model for Learning Clinical Semantics cs.AI · 2026-05-11 · unverdicted · none · ref 22 · internal anchor
CLEF, a long-context EEG foundation model using 3D multitaper spectrograms and contrastive alignment with reports and EHR, beats prior models on 229 of 234 clinical tasks and raises mean AUROC from 0.65 to 0.74.
MSD-Score: Multi-Scale Distributional Scoring for Reference-Free Image Caption Evaluation cs.CV · 2026-05-07 · unverdicted · none · ref 44 · internal anchor
MSD-Score introduces multi-scale distributional scoring on von Mises-Fisher mixtures to evaluate image captions without references and reports state-of-the-art correlation with human judgments.
CheXmix: Unified Generative Pretraining for Vision Language Models in Medical Imaging cs.CV · 2026-04-24 · unverdicted · none · ref 46 · internal anchor
CheXmix combines masked autoencoder pretraining with early-fusion generative modeling to outperform prior models on chest X-ray classification by up to 8.6% AUROC, inpainting by 51%, and report generation by 45% on GREEN.
Are Natural-Domain Foundation Models Effective for Accelerated Cardiac MRI Reconstruction? eess.IV · 2026-04-24 · unverdicted · none · ref 41 · internal anchor
Natural-domain foundation models provide competitive and more robust priors than task-specific models for accelerated cardiac MRI reconstruction in cross-domain settings.
REVEAL: Multimodal Vision-Language Alignment of Retinal Morphometry and Clinical Risks for Incident AD and Dementia Prediction cs.CV · 2026-04-20 · unverdicted · none · ref 1 · internal anchor
REVEAL uses vision-language alignment of retinal morphometry and clinical risk narratives plus group contrastive learning to predict AD and dementia about 8 years early.
Adapting in the Dark: Efficient and Stable Test-Time Adaptation for Black-Box Models cs.LG · 2026-04-17 · unverdicted · none · ref 5 · internal anchor
BETA adapts black-box models at test time using a local steering model and regularization techniques to achieve accuracy improvements without additional API queries or high latency.
Improving Medical VQA through Trajectory-Aware Process Supervision cs.LG · 2026-04-10 · conditional · none · ref 34 · internal anchor
A trajectory-aware process reward using DTW on sentence embeddings, combined with exact-match in GRPO after SFT, raises mean medical VQA accuracy from 0.598 to 0.689 across six benchmarks.
Visual Instruction-Finetuned Language Model for Versatile Brain MR Image Tasks cs.CV · 2026-04-03 · unverdicted · none · ref 56 · internal anchor
LLaBIT is a single instruction-finetuned LLM that performs report generation, VQA, segmentation, and translation on brain MRI images while outperforming task-specific models.
An Explainable Vision-Language Model Framework with Adaptive PID-Tversky Loss for Lumbar Spinal Stenosis Diagnosis cs.CV · 2026-04-02 · unverdicted · none · ref 53 · internal anchor
A VLM framework with spatial patch cross-attention and adaptive PID-Tversky loss reports 90.69% classification accuracy, 0.9512 Dice score, and 92.80 CIDEr for LSS diagnosis plus automated report generation.
Any2Any 3D Diffusion Models with Knowledge Transfer: A Radiotherapy Planning Study cs.CV · 2026-05-10 · unverdicted · none · ref 75 · internal anchor
DiffKT3D transfers priors from video diffusion models to 3D radiotherapy dose prediction via modality-specific embeddings and clinically guided RL, reducing voxel MAE from 2.07 to 1.93 and claiming SOTA over the GDP-HMM challenge winner.
Cross-Modal Semantic-Enhanced Diffusion Framework for Diabetic Retinopathy Grading eess.IV · 2026-05-10 · unverdicted · none · ref 8 · internal anchor
CGSD framework reaches 87.5% accuracy and 0.731 macro F1 on APTOS 2019 by conditioning diffusion denoising on dot-product vectors from image features and DR-grade text descriptions.
MultiMedVision: Multi-Modal Medical Vision Framework cs.CV · 2026-05-09 · unverdicted · none · ref 15 · internal anchor
A unified Sparse Vision Transformer learns joint 2D/3D medical image representations via self-supervision and achieves competitive AUROC on chest X-ray and CT benchmarks with 5x less data than modality-specific models.
CapCLIP: A Vision-Language Representation Alignment Approach for Wireless Capsule Endoscopy Analysis cs.CV · 2026-05-08 · unverdicted · none · ref 17 · internal anchor
CapCLIP uses pathology-aware text captions to align WCE images in a vision-language space, outperforming standard models in zero-shot classification and retrieval on unseen data.
Pan-FM: A Pan-Organ Foundation Model with Saliency-Guided Masking for Missing Robustness cs.CV · 2026-05-08 · unverdicted · none · ref 63 · internal anchor
Pan-FM learns balanced representations across seven organs by adaptively masking dominant organs during pre-training, yielding stronger disease prediction and missing-organ robustness than single-organ or naive multimodal baselines on UK Biobank.
Learning from Medical Entity Trees: An Entity-Centric Medical Data Engineering Framework for MLLMs cs.CL · 2026-04-28 · unverdicted · none · ref 44 · internal anchor
A Medical Entity Tree organizes medical knowledge to engineer higher-quality training data that boosts general MLLMs on medical benchmarks.
Align then Refine: Text-Guided 3D Prostate Lesion Segmentation cs.CV · 2026-04-20 · unverdicted · none · ref 11 · internal anchor
A text-guided multi-encoder U-Net with alignment loss, heatmap calibration, and confidence-gated cross-attention refiner sets new state-of-the-art 3D prostate lesion segmentation performance on the PI-CAI dataset.
T-Gated Adapter: A Lightweight Temporal Adapter for Vision-Language Medical Segmentation cs.CV · 2026-04-09 · unverdicted · none · ref 22 · internal anchor
A temporal adapter injects adjacent-slice context into VLM token representations, raising mean Dice from 0.498 to 0.704 on FLARE22 and reducing cross-domain drop from 38% to 24.9%.
A Utility-preserving De-identification Pipeline for Cross-hospital Radiology Data Sharing cs.CV · 2026-04-08 · unverdicted · none · ref 53 · internal anchor
The UPDP pipeline filters privacy terms and generates de-identified radiology images that preserve diagnostic pathology information, enabling models with competitive disease detection accuracy but reduced identity leakage and improved cross-hospital performance.
CoRE: Concept-Reasoning Expansion for Continual Brain Lesion Segmentation cs.CV · 2026-04-28 · unverdicted · none · ref 51 · internal anchor
CoRE aligns image tokens to a hierarchical concept library to simulate clinical reasoning for expert routing and demand-based growth in continual brain lesion segmentation, achieving SOTA on 12 tasks.
Structure-Augmented Standard Plane Detection with Temporal Aggregation in Blind-Sweep Fetal Ultrasound cs.CV · 2026-04-22 · unverdicted · none · ref 15 · internal anchor
Structure augmentation via segmentation prior plus temporal aggregation stabilizes keyframe detection of fetal abdomen planes in blind-sweep ultrasound.

BiomedCLIP: a multimodal biomedical foundation model pretrained from fifteen million scientific image-text pairs

hub tools

fields

years

verdicts

representative citing papers

citing papers explorer