hub Canonical reference

BiomedCLIP: a multimodal biomedical foundation model pretrained from fifteen million scientific image-text pairs

Sheng Zhang, Yanbo Xu, Naoto Usuyama, Hanwen Xu, Jaspreet Bagga, Robert Tinn · 2023 · cs.CV · arXiv 2303.00915

Canonical reference. 70% of citing Pith papers cite this work as background.

71 Pith papers citing it

Background 70% of classified citations

open full Pith review browse 71 citing papers arXiv PDF

abstract

Biomedical data is inherently multimodal, comprising physical measurements and natural language narratives. A generalist biomedical AI model needs to simultaneously process different modalities of data, including text and images. Therefore, training an effective generalist biomedical model requires high-quality multimodal data, such as parallel image-text pairs. Here, we present PMC-15M, a novel dataset that is two orders of magnitude larger than existing biomedical multimodal datasets such as MIMIC-CXR, and spans a diverse range of biomedical image types. PMC-15M contains 15 million biomedical image-text pairs collected from 4.4 million scientific articles. Based on PMC-15M, we have pretrained BiomedCLIP, a multimodal foundation model, with domain-specific adaptations tailored to biomedical vision-language processing. We conducted extensive experiments and ablation studies on standard biomedical imaging tasks from retrieval to classification to visual question-answering (VQA). BiomedCLIP achieved new state-of-the-art results in a wide range of standard datasets, substantially outperforming prior approaches. Intriguingly, by large-scale pretraining on diverse biomedical image types, BiomedCLIP even outperforms state-of-the-art radiology-specific models such as BioViL in radiology-specific tasks such as RSNA pneumonia detection. In summary, BiomedCLIP is a fully open-access foundation model that achieves state-of-the-art performance on various biomedical tasks, paving the way for transformative multimodal biomedical discovery and applications. We release our models at https://aka.ms/biomedclip to facilitate future research in multimodal biomedical AI.

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 7 method 2 baseline 1

citation-polarity summary

background 7 use method 2 baseline 1

representative citing papers

NeuroQA: A Large-Scale Image-Grounded Benchmark for 3D Brain MRI Understanding

cs.CV · 2026-05-19 · accept · novelty 8.0

NeuroQA is a large-scale 3D brain MRI visual question answering benchmark with verified image-grounded QA pairs, multi-domain coverage, and baseline evaluations showing current models lag behind text-only performance.

CheXTemporal: A Dataset for Temporally-Grounded Reasoning in Chest Radiography

cs.CV · 2026-05-11 · accept · novelty 8.0

CheXTemporal supplies paired chest X-rays with explicit temporal progression taxonomy and spatial grounding to benchmark and improve models on longitudinal reasoning tasks.

SonoCLIP: Mask-Guided Region-Aware Vision-Language Pretraining for Fetal Ultrasound Analysis

cs.CV · 2026-06-28 · unverdicted · novelty 7.0

SonoCLIP presents a mask-guided region-aware vision-language foundation model pretrained on 1.44M fetal ultrasound images, demonstrating superior zero-shot performance.

When Does Synthetic CT Transfer? A Label-Free Donor/Host Diagnostic for Medical Vision-Language Model Routing on Real Lung CT

cs.CV · 2026-06-28 · unverdicted · novelty 7.0

Donor-driven nodule properties in synthetic CT transfer to real lung CT vision-language tasks while host-driven anatomy properties do not, enabling a label-free diagnostic for model routing.

Learning from Acquisition: Metadata-driven Multimodal Pre-training for Cardiac MRI

cs.CV · 2026-06-27 · unverdicted · novelty 7.0

MetaCLIP-CMR applies CLIP-style contrastive learning to cardiac MRI by treating acquisition metadata as text labels, delivering 86.8% modality and 86.5% view accuracy plus top Dice scores on ACDC/M&Ms segmentation with far less pre-training data than recent large-scale CMR models.

MMBU: A Massive Multi-modal Biomedical Understanding Benchmark to Probe the Perception Capabilities of Vision-Language Models

cs.CV · 2026-06-04 · unverdicted · novelty 7.0

Introduces MMBU benchmark for VLMs in biomedicine and demonstrates that established benchmarks mask perception deficiencies in evaluated models.

EchoPilot: Training-Free Ultrasound Video Segmentation via Scale-Space Semantic Prompting and Reliability-Gated Memory

cs.CV · 2026-05-25 · unverdicted · novelty 7.0

EchoPilot delivers state-of-the-art training-free ultrasound video segmentation from a single point prompt by introducing scale-space semantic prompting via S.E.E.D. and reliability-gated memory updates.

EchoVQA: Enabling Conversational Assistance for Point-of-Care Cardiac Ultrasound

cs.CV · 2026-05-22 · unverdicted · novelty 7.0

EchoVQA is the first large-scale VQA dataset for echocardiography spanning high- and low-quality images across views, with acquisition guidance questions, paired with a low-parameter multimodal prompt model that reports SOTA on several benchmarks.

HalluCXR: Benchmarking and Mitigating Hallucinations in Medical Vision-Language Models for Chest Radiograph Interpretation

cs.CV · 2026-05-19 · conditional · novelty 7.0

HalluCXR benchmark shows 61.9-82.3% hallucination rates across VLMs on MIMIC-CXR images, identifies patterns such as length-based risk and over-fabrication of common findings, and demonstrates ensemble mitigation that cuts fabrication by up to 84.8%.

MedCRP-CL: Continual Medical Image Segmentation via Bayesian Nonparametric Semantic Modality Discovery

cs.CV · 2026-05-19 · unverdicted · novelty 7.0

MedCRP-CL discovers semantic modalities online via CRP from text prompts and maintains modality-specific LoRA adapters with intra-modality EWC, achieving 73.3% Dice and 4.1% forgetting on 16 tasks while using 6x fewer parameters than the best baseline.

Next-Acceleration-Scale Prediction for Autoregressive MRI Reconstruction

eess.IV · 2026-05-19 · unverdicted · novelty 7.0 · 2 refs

Next-acceleration-scale autoregressive prediction in discrete latent space with on-policy privileged information distillation yields improved MRI reconstructions from sparse measurements on the fastMRI benchmark.

CXR-ContraBench: Benchmarking Negated-Option Attraction in Medical VLMs

cs.CV · 2026-05-07 · conditional · novelty 7.0

Medical VLMs frequently select negated options that contradict visible chest X-ray findings, achieving only ~30% accuracy on direct presence probes, but a post-hoc consistency verifier raises accuracy above 95%.

iTRIALSPACE: Programmable Virtual Lesion Trials for Controlled Evaluation of Lung CT Models

cs.CV · 2026-05-07 · unverdicted · novelty 7.0

iTRIALSPACE generates realistic virtual lesion trials on lung CTs that isolate performance drivers and show strong transfer of model rankings to real clinical data (ρ=0.93).

CoDA: Exploring Chain-of-Distribution Attacks and Post-Hoc Token-Space Repair for Medical Vision-Language Models

cs.CV · 2026-03-19 · unverdicted · novelty 7.0

CoDA chains clinically plausible acquisition, reconstruction, display, and delivery shifts to substantially degrade zero-shot performance of medical vision-language models, with a post-hoc token-space repair partially recovering accuracy.

CardioBench: Do Echocardiography Foundation Models Generalize Beyond the Lab?

cs.CV · 2025-10-01 · unverdicted · novelty 7.0

CardioBench is a new public benchmark that standardizes eight echocardiography datasets into four regression and five classification tasks to evaluate foundation model generalization.

GRAPE: Graph-Augmented Prototype Explanations for Interactive Medical Image Diagnosis

cs.CV · 2026-06-29 · unverdicted · novelty 6.0 · 2 refs

GRAPE augments prototype medical image classifiers with graph attention for co-occurrence, a mismatch safety check, and open-vocabulary anchoring to support incremental addition of findings from single examples.

Benchmarking Vision-Language Models for Microscopic Plant Image Understanding

cs.CV · 2026-06-21 · unverdicted · novelty 6.0

PlantMicro benchmark shows current VLMs achieve low accuracy (e.g. GPT-5 at 34.93% on pathogen classification) on fine-grained microscopic plant image tasks.

Cohort-Anchored Foundation Models for Electronic Health Records: From Risk Scores to Auditable Peer Cohorts

cs.LG · 2026-06-20 · unverdicted · novelty 6.0

CAFM is a four-stage framework that anchors EHR foundation models to patient cohorts via deviation-aware curation, cohort-conditioned pretraining, multimodal alignment, and clinician refinement to improve interpretability and trustworthiness.

Zero-Shot Vision-Language Models for Classroom Engagement Recognition: A Benchmark Study of Prompt Sensitivity and Cross-Dataset Generalization

cs.CV · 2026-06-20 · unverdicted · novelty 6.0

Benchmark study shows zero-shot VLMs achieve near-random results (kappa <=0.10) on individual student videos but moderate agreement (kappa ~0.60) on scene-level images, with up to 32-point accuracy swings from prompt changes alone.

The Slop Paradox: How Synthetic Standardization Erodes Clinical Uncertainty and Cross-Modal Alignment in AI-Rewritten Radiology Reports

cs.CL · 2026-06-16 · unverdicted · novelty 6.0

AI rewriting tasks that standardize radiology reports erode cross-modal image-text alignment more than they erode clinical entities or hedging language, creating a dissociation termed the slop paradox.

Frozen Foundation-Model Embeddings Discard Small-Lesion Signal in Chest Radiography: Implications for Pre-Deployment Evaluation

cs.CV · 2026-06-10 · unverdicted · novelty 6.0

Frozen ViT embeddings in chest radiography suppress small-lesion signal at the CLS token but recover it via patch-local pooling on the same forward pass across multiple models and large cohorts.

Geometry-Aware Distillation for Prompt Tuning Biomedical Vision-Language Models

cs.CV · 2026-06-03 · unverdicted · novelty 6.0

OGKD injects inter-class geometry into teacher targets for two distillation losses (GAD on global tokens, LGD on patches) and reports 1.7-2.8% average accuracy gains over prior VLM adaptation methods on 11 medical datasets.

Detect Before You Leap: Mirage Detection in Vision-Language Models

cs.CV · 2026-05-29 · unverdicted · novelty 6.0

TC-LIA detects mirage in VLMs via layer-wise image patch to question alignment in CLIP encoders, reaching 94.6-94.7% three-class accuracy and under 3% mirage rate across five domains and twelve backbones.

Cross-Modal Contrastive Learning of ECG and Angiography Representations for Severe Stenosis Classification

cs.LG · 2026-05-23 · unverdicted · novelty 6.0

StenCE uses cross-modal contrastive learning on paired ECG-angiography data to learn ECG features that classify severe coronary stenosis, reporting the first high performance on this task.

citing papers explorer

Showing 50 of 71 citing papers.

NeuroQA: A Large-Scale Image-Grounded Benchmark for 3D Brain MRI Understanding cs.CV · 2026-05-19 · accept · none · ref 43 · internal anchor
NeuroQA is a large-scale 3D brain MRI visual question answering benchmark with verified image-grounded QA pairs, multi-domain coverage, and baseline evaluations showing current models lag behind text-only performance.
CheXTemporal: A Dataset for Temporally-Grounded Reasoning in Chest Radiography cs.CV · 2026-05-11 · accept · none · ref 21 · internal anchor
CheXTemporal supplies paired chest X-rays with explicit temporal progression taxonomy and spatial grounding to benchmark and improve models on longitudinal reasoning tasks.
SonoCLIP: Mask-Guided Region-Aware Vision-Language Pretraining for Fetal Ultrasound Analysis cs.CV · 2026-06-28 · unverdicted · none · ref 22 · internal anchor
SonoCLIP presents a mask-guided region-aware vision-language foundation model pretrained on 1.44M fetal ultrasound images, demonstrating superior zero-shot performance.
When Does Synthetic CT Transfer? A Label-Free Donor/Host Diagnostic for Medical Vision-Language Model Routing on Real Lung CT cs.CV · 2026-06-28 · unverdicted · none · ref 30 · internal anchor
Donor-driven nodule properties in synthetic CT transfer to real lung CT vision-language tasks while host-driven anatomy properties do not, enabling a label-free diagnostic for model routing.
Learning from Acquisition: Metadata-driven Multimodal Pre-training for Cardiac MRI cs.CV · 2026-06-27 · unverdicted · none · ref 16 · internal anchor
MetaCLIP-CMR applies CLIP-style contrastive learning to cardiac MRI by treating acquisition metadata as text labels, delivering 86.8% modality and 86.5% view accuracy plus top Dice scores on ACDC/M&Ms segmentation with far less pre-training data than recent large-scale CMR models.
MMBU: A Massive Multi-modal Biomedical Understanding Benchmark to Probe the Perception Capabilities of Vision-Language Models cs.CV · 2026-06-04 · unverdicted · none · ref 31 · internal anchor
Introduces MMBU benchmark for VLMs in biomedicine and demonstrates that established benchmarks mask perception deficiencies in evaluated models.
EchoPilot: Training-Free Ultrasound Video Segmentation via Scale-Space Semantic Prompting and Reliability-Gated Memory cs.CV · 2026-05-25 · unverdicted · none · ref 29 · internal anchor
EchoPilot delivers state-of-the-art training-free ultrasound video segmentation from a single point prompt by introducing scale-space semantic prompting via S.E.E.D. and reliability-gated memory updates.
EchoVQA: Enabling Conversational Assistance for Point-of-Care Cardiac Ultrasound cs.CV · 2026-05-22 · unverdicted · none · ref 26 · internal anchor
EchoVQA is the first large-scale VQA dataset for echocardiography spanning high- and low-quality images across views, with acquisition guidance questions, paired with a low-parameter multimodal prompt model that reports SOTA on several benchmarks.
HalluCXR: Benchmarking and Mitigating Hallucinations in Medical Vision-Language Models for Chest Radiograph Interpretation cs.CV · 2026-05-19 · conditional · none · ref 14 · internal anchor
HalluCXR benchmark shows 61.9-82.3% hallucination rates across VLMs on MIMIC-CXR images, identifies patterns such as length-based risk and over-fabrication of common findings, and demonstrates ensemble mitigation that cuts fabrication by up to 84.8%.
MedCRP-CL: Continual Medical Image Segmentation via Bayesian Nonparametric Semantic Modality Discovery cs.CV · 2026-05-19 · unverdicted · none · ref 11 · internal anchor
MedCRP-CL discovers semantic modalities online via CRP from text prompts and maintains modality-specific LoRA adapters with intra-modality EWC, achieving 73.3% Dice and 4.1% forgetting on 16 tasks while using 6x fewer parameters than the best baseline.
Next-Acceleration-Scale Prediction for Autoregressive MRI Reconstruction eess.IV · 2026-05-19 · unverdicted · none · ref 54 · 2 links · internal anchor
Next-acceleration-scale autoregressive prediction in discrete latent space with on-policy privileged information distillation yields improved MRI reconstructions from sparse measurements on the fastMRI benchmark.
CXR-ContraBench: Benchmarking Negated-Option Attraction in Medical VLMs cs.CV · 2026-05-07 · conditional · none · ref 43 · internal anchor
Medical VLMs frequently select negated options that contradict visible chest X-ray findings, achieving only ~30% accuracy on direct presence probes, but a post-hoc consistency verifier raises accuracy above 95%.
iTRIALSPACE: Programmable Virtual Lesion Trials for Controlled Evaluation of Lung CT Models cs.CV · 2026-05-07 · unverdicted · none · ref 24 · internal anchor
iTRIALSPACE generates realistic virtual lesion trials on lung CTs that isolate performance drivers and show strong transfer of model rankings to real clinical data (ρ=0.93).
CoDA: Exploring Chain-of-Distribution Attacks and Post-Hoc Token-Space Repair for Medical Vision-Language Models cs.CV · 2026-03-19 · unverdicted · none · ref 47 · internal anchor
CoDA chains clinically plausible acquisition, reconstruction, display, and delivery shifts to substantially degrade zero-shot performance of medical vision-language models, with a post-hoc token-space repair partially recovering accuracy.
CardioBench: Do Echocardiography Foundation Models Generalize Beyond the Lab? cs.CV · 2025-10-01 · unverdicted · none · ref 10 · internal anchor
CardioBench is a new public benchmark that standardizes eight echocardiography datasets into four regression and five classification tasks to evaluate foundation model generalization.
GRAPE: Graph-Augmented Prototype Explanations for Interactive Medical Image Diagnosis cs.CV · 2026-06-29 · unverdicted · none · ref 30 · 2 links · internal anchor
GRAPE augments prototype medical image classifiers with graph attention for co-occurrence, a mismatch safety check, and open-vocabulary anchoring to support incremental addition of findings from single examples.
Benchmarking Vision-Language Models for Microscopic Plant Image Understanding cs.CV · 2026-06-21 · unverdicted · none · ref 65 · internal anchor
PlantMicro benchmark shows current VLMs achieve low accuracy (e.g. GPT-5 at 34.93% on pathogen classification) on fine-grained microscopic plant image tasks.
Cohort-Anchored Foundation Models for Electronic Health Records: From Risk Scores to Auditable Peer Cohorts cs.LG · 2026-06-20 · unverdicted · none · ref 57 · internal anchor
CAFM is a four-stage framework that anchors EHR foundation models to patient cohorts via deviation-aware curation, cohort-conditioned pretraining, multimodal alignment, and clinician refinement to improve interpretability and trustworthiness.
Zero-Shot Vision-Language Models for Classroom Engagement Recognition: A Benchmark Study of Prompt Sensitivity and Cross-Dataset Generalization cs.CV · 2026-06-20 · unverdicted · none · ref 22 · internal anchor
Benchmark study shows zero-shot VLMs achieve near-random results (kappa <=0.10) on individual student videos but moderate agreement (kappa ~0.60) on scene-level images, with up to 32-point accuracy swings from prompt changes alone.
The Slop Paradox: How Synthetic Standardization Erodes Clinical Uncertainty and Cross-Modal Alignment in AI-Rewritten Radiology Reports cs.CL · 2026-06-16 · unverdicted · none · ref 11 · internal anchor
AI rewriting tasks that standardize radiology reports erode cross-modal image-text alignment more than they erode clinical entities or hedging language, creating a dissociation termed the slop paradox.
Frozen Foundation-Model Embeddings Discard Small-Lesion Signal in Chest Radiography: Implications for Pre-Deployment Evaluation cs.CV · 2026-06-10 · unverdicted · none · ref 17 · internal anchor
Frozen ViT embeddings in chest radiography suppress small-lesion signal at the CLS token but recover it via patch-local pooling on the same forward pass across multiple models and large cohorts.
Geometry-Aware Distillation for Prompt Tuning Biomedical Vision-Language Models cs.CV · 2026-06-03 · unverdicted · none · ref 46 · internal anchor
OGKD injects inter-class geometry into teacher targets for two distillation losses (GAD on global tokens, LGD on patches) and reports 1.7-2.8% average accuracy gains over prior VLM adaptation methods on 11 medical datasets.
Detect Before You Leap: Mirage Detection in Vision-Language Models cs.CV · 2026-05-29 · unverdicted · none · ref 10 · internal anchor
TC-LIA detects mirage in VLMs via layer-wise image patch to question alignment in CLIP encoders, reaching 94.6-94.7% three-class accuracy and under 3% mirage rate across five domains and twelve backbones.
Cross-Modal Contrastive Learning of ECG and Angiography Representations for Severe Stenosis Classification cs.LG · 2026-05-23 · unverdicted · none · ref 23 · internal anchor
StenCE uses cross-modal contrastive learning on paired ECG-angiography data to learn ECG features that classify severe coronary stenosis, reporting the first high performance on this task.
Rethinking Noise-Robust Training for Frozen Vision Foundation Models: A Cross-Dataset Benchmark with a Case Study of Small-Loss Failure cs.CV · 2026-05-21 · unverdicted · none · ref 15 · internal anchor
Large-scale benchmark of noisy-label methods on frozen VFMs reveals no universal winner, with ELR and CUFIT performing differently, and demonstrates small-loss assumption failure via 53-61% loss overlap under asymmetric noise.
MAM-CLIP: Vision-Language Pretraining on Mammography Atlases for BI-RADS Classification cs.CV · 2026-05-19 · conditional · none · ref 11 · internal anchor
Contrastive pretraining on mammography atlas image-text pairs improves BI-RADS classification F1 by 1-14% especially in low-label regimes, outperforming equivalent numbers of direct labels in some settings.
A General B\'ezier Tree Encoding Counterfactual Framework for Retinal-Vessel-Mediated Disease Analysis eess.IV · 2026-05-13 · unverdicted · none · ref 59 · internal anchor
BTECF encodes retinal vessels as Bézier trees to enable targeted, parameter-level counterfactual interventions on vessel geometry for causal analysis of vascular diseases.
CLEF: EEG Foundation Model for Learning Clinical Semantics cs.AI · 2026-05-11 · unverdicted · none · ref 22 · internal anchor
CLEF, a long-context EEG foundation model using 3D multitaper spectrograms and contrastive alignment with reports and EHR, beats prior models on 229 of 234 clinical tasks and raises mean AUROC from 0.65 to 0.74.
MSD-Score: Multi-Scale Distributional Scoring for Reference-Free Image Caption Evaluation cs.CV · 2026-05-07 · unverdicted · none · ref 44 · internal anchor
MSD-Score introduces multi-scale distributional scoring on von Mises-Fisher mixtures to evaluate image captions without references and reports state-of-the-art correlation with human judgments.
DIYHealth Suite: Dataset, Model, and Benchmark for Health Management at Home cs.CY · 2026-05-01 · unverdicted · none · ref 99 · internal anchor
DIYHealth Suite introduces a large home-care dataset, DIYHealthGPT model with Hybrid Hyper Low-Rank Adaptation, and DIYHealthBench, claiming SOTA results on 11 tasks over general and medical baselines.
CheXmix: Unified Generative Pretraining for Vision Language Models in Medical Imaging cs.CV · 2026-04-24 · unverdicted · none · ref 46 · internal anchor
CheXmix combines masked autoencoder pretraining with early-fusion generative modeling to outperform prior models on chest X-ray classification by up to 8.6% AUROC, inpainting by 51%, and report generation by 45% on GREEN.
Are Natural-Domain Foundation Models Effective for Accelerated Cardiac MRI Reconstruction? eess.IV · 2026-04-24 · unverdicted · none · ref 41 · internal anchor
Natural-domain foundation models provide competitive and more robust priors than task-specific models for accelerated cardiac MRI reconstruction in cross-domain settings.
REVEAL: Multimodal Vision-Language Alignment of Retinal Morphometry and Clinical Risks for Incident AD and Dementia Prediction cs.CV · 2026-04-20 · unverdicted · none · ref 1 · internal anchor
REVEAL uses vision-language alignment of retinal morphometry and clinical risk narratives plus group contrastive learning to predict AD and dementia about 8 years early.
Adapting in the Dark: Efficient and Stable Test-Time Adaptation for Black-Box Models cs.LG · 2026-04-17 · unverdicted · none · ref 5 · internal anchor
BETA adapts black-box models at test time using a local steering model and regularization techniques to achieve accuracy improvements without additional API queries or high latency.
Improving Medical VQA through Trajectory-Aware Process Supervision cs.LG · 2026-04-10 · conditional · none · ref 34 · internal anchor
A trajectory-aware process reward using DTW on sentence embeddings, combined with exact-match in GRPO after SFT, raises mean medical VQA accuracy from 0.598 to 0.689 across six benchmarks.
Visual Instruction-Finetuned Language Model for Versatile Brain MR Image Tasks cs.CV · 2026-04-03 · unverdicted · none · ref 56 · internal anchor
LLaBIT is a single instruction-finetuned LLM that performs report generation, VQA, segmentation, and translation on brain MRI images while outperforming task-specific models.
An Explainable Vision-Language Model Framework with Adaptive PID-Tversky Loss for Lumbar Spinal Stenosis Diagnosis cs.CV · 2026-04-02 · unverdicted · none · ref 53 · internal anchor
A VLM framework with spatial patch cross-attention and adaptive PID-Tversky loss reports 90.69% classification accuracy, 0.9512 Dice score, and 92.80 CIDEr for LSS diagnosis plus automated report generation.
Are Video Models Emerging as Zero-Shot Learners and Reasoners in Medical Imaging? cs.CV · 2025-10-11 · unverdicted · none · ref 38 · internal anchor
A video-trained large vision model achieves competitive zero-shot performance on organ segmentation, denoising, super-resolution, and 4D CT motion prediction in medical imaging, outperforming some specialized baselines on patient data from 122 cases.
VA-Adapter: Adapting Ultrasound Foundation Model to Echocardiography Probe Guidance cs.CV · 2025-10-08 · conditional · none · ref 32 · internal anchor
VA-Adapter adapts ultrasound foundation models for echocardiography probe guidance by embedding a vision-action module that infers individual 3D cardiac anatomy from historical sequences, outperforming prior methods with roughly 33 times fewer trainable parameters on a 1.31 million sample dataset.
RA-RRG: Multimodal Retrieval-Augmented Radiology Report Generation with Key Phrase Extraction cs.CV · 2025-04-10 · unverdicted · none · ref 58 · internal anchor
RA-RRG extracts key phrases with LLMs, retrieves them via multimodal similarity, and conditions report generation on them to achieve SOTA CheXbert scores and competitive RadGraph F1 on MIMIC-CXR and IU X-ray while supporting multi-view inputs.
LLaVA-Med: Training a Large Language-and-Vision Assistant for Biomedicine in One Day cs.CV · 2023-06-01 · unverdicted · none · ref 49 · internal anchor
LLaVA-Med is created via curriculum fine-tuning on PubMed figure-caption pairs and GPT-4 self-instructed data, achieving competitive or better results than prior supervised models on three biomedical VQA benchmarks.
PMC-VQA: Visual Instruction Tuning for Medical Visual Question Answering cs.CV · 2023-05-17 · conditional · none · ref 64 · internal anchor
PMC-VQA dataset and MedVInT model achieve better generative performance on medical VQA benchmarks by visual instruction tuning on a newly constructed large-scale dataset.
REVEAL++: Differentiable Phenotypic Grouping for Vision-Language Retinal Modeling of Alzheimer's Disease Risk cs.AI · 2026-06-17 · unverdicted · none · ref 20 · internal anchor
REVEAL++ replaces discrete phenotypic groups with differentiable soft multi-positive weighting derived from intra-modality embeddings in contrastive learning, outperforming prior discrete and baseline methods on UK Biobank incident AD prediction.
Hallucination Detection and Correction in Medical VLMs via Counter-Evidence Verification cs.CV · 2026-06-17 · unverdicted · none · ref 26 · internal anchor
CoEV is a plug-and-play bidirectional verification method that maps text statements to visual evidence regions, assigns them to a four-quadrant factuality-grounding map, and uses this to detect and correct hallucinations in medical VLMs without retraining.
A multi-agent system for spine MRI report generation from multi-sequence imaging cs.CV · 2026-06-08 · unverdicted · none · ref 42 · internal anchor
SpineAgent combines multi-sequence MRI embeddings from DINOv3 encoders with 37 specialized agents and an end-to-end Medical Report Agent to achieve SOTA automated spine MRI report generation on a large clinical dataset.
PMC-InterCPT: Rethinking Biomedical Interleaved Data for Multimodal Continued Pretraining cs.CL · 2026-05-31 · unverdicted · none · ref 24 · internal anchor
PMC-InterCPT builds a context-grounded biomedical interleaved corpus from PMC literature and shows it improves multimodal performance on Qwen3.5-4B-Base after CPT and SFT while using fewer tokens.
VITAL: Visual-Semantic Dual Supervision for Enhanced and Interpretable Latent Reasoning in Medical MLLMs cs.CV · 2026-05-27 · unverdicted · none · ref 4 · internal anchor
VITAL adds visual-semantic dual supervision during training of medical MLLMs for latent reasoning, yielding SOTA results on 7 benchmarks with a new 61K multi-modality dataset while enabling post-hoc textual and visual explanations at zero inference overhead.
Universal CT Representations from Anatomy to Disease Phenotype through Agglomerative Pretraining cs.CV · 2026-05-21 · unverdicted · none · ref 16 · 2 links · internal anchor
FlexiCT provides CT foundation models via agglomerative pretraining on 266227 volumes from 56 datasets that match or exceed task-specific models on five task families while organizing embeddings along tumor-stage gradients.
A Human-in-the-Loop Framework for Efficient Prompt Selection in Microscopy Vision-Language Models cs.CV · 2026-05-19 · unverdicted · none · ref 31 · internal anchor
A target-driven active learning approach for building efficient prompt sets in microscopy VLMs reaches 100% test accuracy with an average of 20 expert-verified images, outperforming random selection.
Rad-VLSM: A Cross-Modal Framework with Semantics-Assisted Prompting for Medical Segmentation and Diagnosis cs.CV · 2026-05-18 · unverdicted · none · ref 50 · internal anchor
Rad-VLSM is a cross-modal two-stage framework that converts semantic guidance from BLIP-2 into box prompts for SAM-based lesion segmentation and then uses the resulting masks as spatial priors in a visual-radiomics fusion head for diagnosis.

BiomedCLIP: a multimodal biomedical foundation model pretrained from fifteen million scientific image-text pairs

hub tools

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer