CheXpert: A Large Chest Radiograph Dataset with Uncertainty Labels and Expert Comparison

Andrew Y. Ng; Behzad Haghgoo; Bhavik N. Patel; Chris Chute; Curtis P. Langlotz; David A. Mong; David B. Larson; Henrik Marklund; Jayne Seekins; Jeremy Irvin

arxiv: 1901.07031 · v1 · pith:RGDC7DJQnew · submitted 2019-01-21 · 💻 cs.CV · cs.AI· cs.LG· eess.IV

CheXpert: A Large Chest Radiograph Dataset with Uncertainty Labels and Expert Comparison

Jeremy Irvin , Pranav Rajpurkar , Michael Ko , Yifan Yu , Silviana Ciurea-Ilcus , Chris Chute , Henrik Marklund , Behzad Haghgoo

show 12 more authors

Robyn Ball Katie Shpanskaya Jayne Seekins David A. Mong Safwan S. Halabi Jesse K. Sandberg Ricky Jones David B. Larson Curtis P. Langlotz Bhavik N. Patel Matthew P. Lungren Andrew Y. Ng

This is my paper

classification 💻 cs.CV cs.AIcs.LGeess.IV

keywords chestdatasetchexpertdifferentlargemodelperformanceradiograph

0 comments

read the original abstract

Large, labeled datasets have driven deep learning methods to achieve expert-level performance on a variety of medical imaging tasks. We present CheXpert, a large dataset that contains 224,316 chest radiographs of 65,240 patients. We design a labeler to automatically detect the presence of 14 observations in radiology reports, capturing uncertainties inherent in radiograph interpretation. We investigate different approaches to using the uncertainty labels for training convolutional neural networks that output the probability of these observations given the available frontal and lateral radiographs. On a validation set of 200 chest radiographic studies which were manually annotated by 3 board-certified radiologists, we find that different uncertainty approaches are useful for different pathologies. We then evaluate our best model on a test set composed of 500 chest radiographic studies annotated by a consensus of 5 board-certified radiologists, and compare the performance of our model to that of 3 additional radiologists in the detection of 5 selected pathologies. On Cardiomegaly, Edema, and Pleural Effusion, the model ROC and PR curves lie above all 3 radiologist operating points. We release the dataset to the public as a standard benchmark to evaluate performance of chest radiograph interpretation models. The dataset is freely available at https://stanfordmlgroup.github.io/competitions/chexpert .

This paper has not been read by Pith yet.

discussion (0)

Forward citations

Cited by 6 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

CheXTemporal: A Dataset for Temporally-Grounded Reasoning in Chest Radiography
cs.CV 2026-05 accept novelty 8.0

CheXTemporal supplies paired chest X-rays with explicit temporal progression taxonomy and spatial grounding to benchmark and improve models on longitudinal reasoning tasks.
CheXthought: A global multimodal dataset of clinical chain-of-thought reasoning and visual attention for chest X-ray interpretation
cs.CV 2026-04 unverdicted novelty 7.0

CheXthought supplies large-scale expert chain-of-thought reasoning and synchronized visual attention data for chest X-rays to train more accurate and interpretable clinical vision-language models.
Scaling Vision Models Does Not Consistently Improve Localisation-Based Explanation Quality
cs.CV 2026-05 accept novelty 6.0

Scaling vision models by depth and parameter count does not consistently improve localisation-based explanation quality across architectures, datasets, and post-hoc methods; smaller models often perform comparably or better.
M4CXR: Exploring Multi-task Potentials of Multi-modal Large Language Models for Chest X-ray Interpretation
cs.CV 2024-08 unverdicted novelty 5.0

M4CXR is a multi-modal large language model that performs multiple tasks in chest X-ray analysis including report generation with claimed SOTA clinical accuracy using chain-of-thought prompting.
On-Device Continual Learning with Dual-Stage Buffer and Dynamic Loss for Point-of-Care Pneumonia Diagnosis
cs.LG 2026-05 unverdicted novelty 4.0

PneumoNet uses a lightweight CNN, dual-stage balanced buffer, and dynamic class-weighted loss for domain-incremental learning on simulated PneumoniaMNIST shifts, reporting 86.6% accuracy and 1.4% forgetting.
Automatic Radiology Report Generation based on Multi-view Image Fusion and Medical Concept Enrichment
eess.IV 2019-07 unverdicted novelty 4.0

An encoder-decoder model with multi-view late fusion and medical concept attention achieves claimed state-of-the-art performance on chest X-ray report generation using the Indiana University dataset.