pith. machine review for the scientific record. sign in

arxiv: 2604.15555 · v1 · submitted 2026-04-16 · 💻 cs.CV

Recognition: unknown

CXR-LT 2026 Challenge: Multi-Center Long-Tailed and Zero Shot Chest X-ray Classification

Authors on Pith no claims yet

Pith reviewed 2026-05-10 10:44 UTC · model grok-4.3

classification 💻 cs.CV
keywords chest X-ray classificationlong-tailed distributionzero-shot learningmulti-center datavision-language modelsrare disease detectionopen-world generalizationradiologist annotations
0
0 comments X

The pith

Vision-language foundation models improve chest X-ray classification on both known and unseen rare classes in a multi-center setting, though rare findings under center shifts remain challenging.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents the CXR-LT 2026 challenge as a benchmark for long-tailed multi-label classification and zero-shot generalization in chest X-ray interpretation using a new multi-center dataset. Over 145,000 images from PadChest and NIH are annotated by radiologists rather than derived from reports to create a more reliable evaluation. Two tasks are defined: robust classification of 30 known classes and open-world generalization to 6 unseen rare classes. Analysis of participating teams shows that vision-language foundation models enhance performance on in-distribution and zero-shot tasks. However, the results highlight ongoing difficulties in detecting rare findings when data shifts across medical centers.

Core claim

By providing a multi-center dataset with radiologist annotations and splitting it into 30 known classes for robust multi-label classification and 6 unseen rare classes for open-world generalization, the challenge reveals that vision-language foundation models improve both in-distribution and zero-shot performance, but detecting rare findings under multi-center shift remains challenging.

What carries the argument

The two-task benchmark of robust multi-label classification on 30 known pathology classes and open-world generalization to 6 unseen rare disease classes, backed by a multi-center dataset of over 145,000 radiologist-annotated images from PadChest and NIH.

If this is right

  • Vision-language models provide measurable gains for both in-distribution and zero-shot tasks on known and rare chest X-ray pathologies.
  • Multi-center data shifts create persistent accuracy gaps specifically for rare disease classes.
  • Direct radiologist annotations yield a more trustworthy benchmark than report-derived labels for clinical evaluation.
  • AI development for chest X-ray must prioritize robustness to long-tailed distributions and novel findings across institutions.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Models that succeed here are more likely to handle the variability seen in actual hospital networks with different scanners and populations.
  • The benchmark could be extended with temporal sequences or other modalities to probe generalization further.
  • Techniques focused on domain adaptation or targeted augmentation for rare classes may be needed to close the remaining multi-center gaps.

Load-bearing premise

Radiologist annotations on the combined multi-center dataset create a substantially more reliable and clinically relevant evaluation than labels extracted from radiology reports, and the 30 known plus 6 unseen class split with center divisions adequately represents real-world long-tailed open-world conditions.

What would settle it

If a model achieves high accuracy on the 6 unseen rare classes across all centers without notable performance drop compared to single-center tests, or if vision-language models show no advantage over prior methods on this data, that would test whether the multi-center shift challenge for rare findings is fundamental.

Figures

Figures reproduced from arXiv: 2604.15555 by Adam E. Flanders, Aina Tur-Serrano, Alan Clint Legasto, Ang Zu, Dohui Kim, Fengnian Zhao, Gabriel Moy\`a-Alcover, George Shih, Ha-Hieu Pham, Hao Chen, Hexin Dong, Huy-Hieu Pham, Huy Le Pham, Juno Cho, Justin Namuk Kim, Ky Trung Nguyen, Mingeon Kim, Mingquan Lin, Nikhileswara Rao Sulake, Pengyu Zhou, Ronald M. Summers, Ruichi Zhang, Sunwoo Kwak, Thanh-Huy Nguyen, Yifan Peng, Yi Lin, Yuzhe Yang, Zhiyong Lu.

Figure 1
Figure 1. Figure 1: The dataset of CXR-LT 2026. (a) The co-occurrence of labels in the training set. (b) The label [PITH_FULL_IMAGE:figures/full_fig_p004_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Validation progress over time for Task 1 and Task 2. [PITH_FULL_IMAGE:figures/full_fig_p007_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Radar plots summarizing the main results of CXR-LT 2026 for the two challenge tasks. [PITH_FULL_IMAGE:figures/full_fig_p011_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Robustness analysis under test-time perturbations. Each panel compares model performance under [PITH_FULL_IMAGE:figures/full_fig_p012_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Evaluation on held-out subsets for Task 1 and Task 2. Each panel compares model performance [PITH_FULL_IMAGE:figures/full_fig_p013_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Generalization analysis across disease frequency and clinical centers. a) Head-to-tail performance [PITH_FULL_IMAGE:figures/full_fig_p014_6.png] view at source ↗
read the original abstract

Chest X-ray (CXR) interpretation is hindered by the long-tailed distribution of pathologies and the open-world nature of clinical environments. Existing benchmarks often rely on closed-set classes from a single institution, failing to capture the prevalence of rare diseases or the appearance of novel findings. To address this, we present the CXR-LT challenge. The first event, CXR-LT 2023, established a large-scale benchmark for long-tailed multi-label CXR classification and identified key challenges in rare disease recognition. CXR-LT 2024 further expanded the label space and introduced a zero-shot task to study generalization to unseen findings. Building on the success of CXR-LT 2023 and 2024, this third iteration of the benchmark introduces a multi-center dataset comprising over 145,000 images from PadChest and NIH Chest X-ray datasets. Additionally, all development and test sets in CXR-LT 2026 are annotated by radiologists, providing a more reliable and clinically grounded evaluation than report-derived labels. The challenge defines two core tasks this year: (1) Robust Multi-Label Classification on 30 known classes and (2) Open-World Generalization to 6 unseen (out-of-distribution) rare disease classes. This paper summarizes the overview of the CXR-LT 2026 challenge. We describe the data collection and annotation procedures, analyze solution strategies adopted by participating teams, and evaluate head-versus-tail performance, calibration, and cross-center generalization gaps. Our results show that vision-language foundation models improve both in-distribution and zero-shot performance, but detecting rare findings under multi-center shift remains challenging. Our study provides a foundation for developing and evaluating AI systems in realistic long-tailed and open-world clinical conditions.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper introduces the CXR-LT 2026 challenge, which provides a multi-center dataset of over 145,000 chest X-ray images from PadChest and NIH Chest X-ray, with all labels provided by radiologist annotations rather than report-derived NLP labels. It defines two tasks: (1) robust multi-label classification over 30 known classes and (2) open-world generalization to 6 unseen rare disease classes. The manuscript describes the data collection and annotation process, summarizes strategies from participating teams, and analyzes performance on head-versus-tail classes, calibration, and cross-center generalization gaps, concluding that vision-language foundation models improve both in-distribution and zero-shot performance while rare findings under multi-center shift remain challenging.

Significance. If the radiologist-annotated multi-center benchmark holds up under scrutiny, this work offers a valuable advance over prior single-center or report-derived CXR benchmarks by explicitly targeting long-tailed distributions and open-world generalization. The emphasis on cross-center shift and zero-shot rare classes, combined with analysis of VL model strengths and persistent tail-class failures, can usefully guide development of clinically deployable systems. The challenge format itself, with public participant outcomes, adds reproducibility value.

major comments (2)
  1. [Data Collection and Annotation Procedures] Data Collection and Annotation Procedures section: the claim that 'all development and test sets in CXR-LT 2026 are annotated by radiologists, providing a more reliable and clinically grounded evaluation than report-derived labels' is load-bearing for the central claim that observed performance gaps reflect model capability rather than benchmark artifacts. No supporting quantitative evidence is referenced, such as inter-rater agreement (Cohen's or Fleiss' kappa), number of annotators per image, adjudication protocol, or direct comparison against report labels on overlapping cases. Without this, the superiority of the new labels over prior work cannot be established.
  2. [Results and participant analysis] Results and participant analysis (abstract and evaluation sections): the summary states that 'vision-language foundation models improve both in-distribution and zero-shot performance' and that 'detecting rare findings under multi-center shift remains challenging,' yet provides no specific metrics, confidence intervals, statistical tests, or aggregation details for participant submissions. This leaves the magnitude and robustness of the reported improvements difficult to assess and weakens the evidential basis for the conclusions.
minor comments (2)
  1. [Abstract] Abstract: the exact total image count and the split between PadChest and NIH sources are given only approximately ('over 145,000'); providing the precise numbers and per-center breakdowns would improve clarity.
  2. [Challenge definition] The 30+6 class split and center-based train/test division are presented as representative of real-world long-tailed open-world conditions, but no supporting prevalence statistics or comparison to clinical distributions are supplied; a brief justification or reference would help.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive and detailed feedback, which helps us improve the clarity and rigor of the manuscript. We address each major comment point by point below and have incorporated revisions to strengthen the evidential basis where possible.

read point-by-point responses
  1. Referee: Data Collection and Annotation Procedures section: the claim that 'all development and test sets in CXR-LT 2026 are annotated by radiologists, providing a more reliable and clinically grounded evaluation than report-derived labels' is load-bearing for the central claim that observed performance gaps reflect model capability rather than benchmark artifacts. No supporting quantitative evidence is referenced, such as inter-rater agreement (Cohen's or Fleiss' kappa), number of annotators per image, adjudication protocol, or direct comparison against report labels on overlapping cases. Without this, the superiority of the new labels over prior work cannot be established.

    Authors: We agree that quantitative annotation quality metrics would provide stronger support for the reliability claim. The manuscript describes the radiologist annotation process for the multi-center dataset but does not include inter-rater statistics or direct comparisons. In the revision, we will expand the Data Collection and Annotation Procedures section to detail the number of board-certified radiologists involved, the standardized annotation protocol (including adjudication for disagreements), and any available agreement metrics from the process. We will also add a quantitative comparison of radiologist labels versus report-derived NLP labels on overlapping cases from the source datasets to demonstrate reduced noise. If full kappa values across all 145k images are not feasible due to scale, we will explicitly note this and reference supporting literature on the known error rates of report-derived labels. revision: yes

  2. Referee: Results and participant analysis (abstract and evaluation sections): the summary states that 'vision-language foundation models improve both in-distribution and zero-shot performance' and that 'detecting rare findings under multi-center shift remains challenging,' yet provides no specific metrics, confidence intervals, statistical tests, or aggregation details for participant submissions. This leaves the magnitude and robustness of the reported improvements difficult to assess and weakens the evidential basis for the conclusions.

    Authors: We concur that the absence of specific quantitative details weakens the conclusions. The current manuscript offers a high-level summary of participant strategies and qualitative trends from the challenge. In the revised version, we will expand the Results and participant analysis section (and update the abstract accordingly) to report concrete metrics, including mean AUC and F1 scores for vision-language models versus other approaches on the 30-class in-distribution task, zero-shot performance on the 6 unseen rare classes, head-versus-tail breakdowns, calibration errors, and cross-center gaps. We will include 95% confidence intervals, details on aggregation across submissions, and any statistical tests performed to support claims of improvement and persistent challenges. revision: yes

Circularity Check

0 steps flagged

No circularity: descriptive challenge overview with no derivations or self-referential reductions

full rationale

The paper is a benchmark challenge description that defines tasks, reports data collection/annotation procedures, and summarizes participant-submitted results on in-distribution and zero-shot performance. No equations, fitted parameters, predictions, or first-principles derivations appear in the provided text or abstract. Claims about radiologist annotations providing more reliable labels than report-derived ones are unsupported assertions (a potential correctness gap), but they do not reduce any result to the inputs by construction, nor do they rely on self-citation chains, uniqueness theorems, or ansatzes. Prior CXR-LT iterations are referenced only for context, not as load-bearing justification for the current claims. The central statements about VL models improving performance are empirical summaries of external submissions, not internally forced quantities.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

This is a benchmark overview paper containing no mathematical derivations, fitted parameters, or postulated entities.

pith-pipeline@v0.9.0 · 5743 in / 1125 out tokens · 36239 ms · 2026-05-10T10:44:06.042454+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

34 extracted references · 7 canonical work pages

  1. [1]

    S Kevin Zhou, Hayit Greenspan, Christos Davatzikos, James S Duncan, Bram Van Ginneken, Anant Madabhushi, Jerry L Prince, Daniel Rueckert, and Ronald M Summers. A review of deep learning in medical imaging: Imaging traits, technology trends, case studies with progress highlights, and future promises.Proceedings of the IEEE, 109(5):820–838, 2021

  2. [2]

    Long-tailed classification of thorax diseases on chest x-ray: A new benchmark study

    Gregory Holste, Song Wang, Ziyu Jiang, Thomas C Shen, George Shih, Ronald M Summers, Yifan Peng, and Zhangyang Wang. Long-tailed classification of thorax diseases on chest x-ray: A new benchmark study. InMICCAI Workshop on Data Augmentation, Labelling, and Imperfections, pages 22–32. Springer, 2022

  3. [3]

    Mbnm: multi-branch network based on memory features for long-tailed medical image recognition.Computer Methods and Programs in Biomedicine, 212:106448, 2021

    Ruru Zhang, E Haihong, Lifei Yuan, Jiawen He, Hongxing Zhang, Shengjuan Zhang, Yanhui Wang, Meina Song, and Lifei Wang. Mbnm: multi-branch network based on memory features for long-tailed medical image recognition.Computer Methods and Programs in Biomedicine, 212:106448, 2021

  4. [4]

    Deep long-tailed learning: A survey.IEEE transactions on pattern analysis and machine intelligence, 45(9):10795–10816, 2023

    Yifan Zhang, Bingyi Kang, Bryan Hooi, Shuicheng Yan, and Jiashi Feng. Deep long-tailed learning: A survey.IEEE transactions on pattern analysis and machine intelligence, 45(9):10795–10816, 2023

  5. [5]

    Proco: Prototype-aware contrastive learning for long-tailed medical image classification

    Zhixiong Yang, Junwen Pan, Yanzhan Yang, Xiaozhou Shi, Hong-Yu Zhou, Zhicheng Zhang, and Cheng Bian. Proco: Prototype-aware contrastive learning for long-tailed medical image classification. InInternational conference on medical image computing and computer-assisted intervention, pages 173–182. Springer, 2022

  6. [6]

    Relational subsets knowledge distillation for long-tailed retinal diseases recognition

    Lie Ju, Xin Wang, Lin Wang, Tongliang Liu, Xin Zhao, Tom Drummond, Dwarikanath Mahapatra, and Zongyuan Ge. Relational subsets knowledge distillation for long-tailed retinal diseases recognition. InInternational Conference on Medical Image Computing and Computer-Assisted Intervention, pages 3–12. Springer, 2021

  7. [7]

    Hidden stratification causes clinically meaningful failures in machine learning for medical imaging

    Luke Oakden-Rayner, Jared Dunnmon, Gustavo Carneiro, and Christopher R ´e. Hidden stratification causes clinically meaningful failures in machine learning for medical imaging. InProceedings of the ACM conference on health, inference, and learning, pages 151–159, 2020

  8. [8]

    Cxr-lt: Multi-label long-tailed classification on chest x-rays.PhysioNet, 5(19):1, 2023

    Gregory Holste, Song Wang, Ajay Jaiswal, Yuzhe Yang, Mingquan Lin, Yifan Peng, and Atlas Wang. Cxr-lt: Multi-label long-tailed classification on chest x-rays.PhysioNet, 5(19):1, 2023

  9. [9]

    Towards long-tailed, multi-label disease classification from chest x-ray: Overview of the cxr-lt challenge.Medical Image Analysis, 97: 103224, 2024

    Gregory Holste, Yiliang Zhou, Song Wang, Ajay Jaiswal, Mingquan Lin, Sherry Zhuge, Yuzhe Yang, Dongkyun Kim, Trong-Hieu Nguyen-Mau, Minh-Triet Tran, et al. Towards long-tailed, multi-label disease classification from chest x-ray: Overview of the cxr-lt challenge.Medical Image Analysis, 97: 103224, 2024

  10. [10]

    Cxr-lt 2024: A miccai challenge on long-tailed, multi- label, and zero-shot disease classification from chest x-ray.arXiv preprint arXiv:2506.07984, 2025

    Mingquan Lin, Gregory Holste, Song Wang, Yiliang Zhou, Yishu Wei, Imon Banerjee, Pengyi Chen, Tianjie Dai, Yuexi Du, Nicha C Dvornek, et al. Cxr-lt 2024: A miccai challenge on long-tailed, multi- label, and zero-shot disease classification from chest x-ray.arXiv preprint arXiv:2506.07984, 2025

  11. [11]

    Padchest: A large chest x-ray image dataset with multi-label annotated reports.Medical image analysis, 66:101797, 2020

    Aurelia Bustos, Antonio Pertusa, Jose-Maria Salinas, and Maria De La Iglesia-Vaya. Padchest: A large chest x-ray image dataset with multi-label annotated reports.Medical image analysis, 66:101797, 2020. 16

  12. [12]

    Nih chest x-ray dataset of 14 common thorax disease categories.NIH Clinical Center: Bethesda, MD, USA, 2019

    R Summers. Nih chest x-ray dataset of 14 common thorax disease categories.NIH Clinical Center: Bethesda, MD, USA, 2019

  13. [13]

    MIMIC-CXR Database.PhysioNet, September 2019

    Alistair Johnson, Tom Pollard, Roger Mark, Seth Berkowitz, and Steven Horng. MIMIC-CXR Database.PhysioNet, September 2019. doi: 10.13026/C2JT1Q. URLhttps://doi.org/10. 13026/C2JT1Q. Version 2.0.0

  14. [14]

    Padchest-gr: A bilingual chest x-ray dataset for grounded radiology report generation.NEJM AI, 2(7):AIdbp2401120, 2025

    Daniel Coelho de Castro, Aurelia Bustos, Shruthi Bannur, Stephanie L Hyland, Kenza Bouzid, Maria Teodora Wetscherek, Maria Dolores S ´anchez-Valverde, Lara Jaques-P ´erez, Lourdes P ´erez- Rodr´ıguez, Kenji Takeda, et al. Padchest-gr: A bilingual chest x-ray dataset for grounded radiology report generation.NEJM AI, 2(7):AIdbp2401120, 2025

  15. [15]

    The pascal visual object classes (voc) challenge.International journal of computer vision, 88(2):303–338, 2010

    Mark Everingham, Luc Van Gool, Christopher KI Williams, John Winn, and Andrew Zisserman. The pascal visual object classes (voc) challenge.International journal of computer vision, 88(2):303–338, 2010

  16. [16]

    Handling supervision scarcity in chest x-ray classification: Long-tailed and zero-shot learning.arXiv preprint arXiv:2602.13430, 2026

    Ha-Hieu Pham, Hai-Dang Nguyen, Thanh-Huy Nguyen, Min Xu, Ulas Bagci, Trung-Nghia Le, and Huy-Hieu Pham. Handling supervision scarcity in chest x-ray classification: Long-tailed and zero-shot learning.arXiv preprint arXiv:2602.13430, 2026

  17. [17]

    Convnext v2: Co-designing and scaling convnets with masked autoencoders

    Sanghyun Woo, Shoubhik Debnath, Ronghang Hu, Xinlei Chen, Zhuang Liu, In So Kweon, and Sain- ing Xie. Convnext v2: Co-designing and scaling convnets with masked autoencoders. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 16133–16142, 2023

  18. [18]

    A textbook remedy for domain shifts: Knowledge priors for medical image analysis

    Yue Yang, Mona Gandhi, Yufei Wang, Yifan Wu, Michael Yao, Chris Callison-Burch, James Gee, and Mark Yatskar. A textbook remedy for domain shifts: Knowledge priors for medical image analysis. Advances in neural information processing systems, 37:90683–90713, 2024

  19. [19]

    Reproducible scaling laws for contrastive language-image learning

    Mehdi Cherti, Romain Beaumont, Ross Wightman, Mitchell Wortsman, Gabriel Ilharco, Cade Gordon, Christoph Schuhmann, Ludwig Schmidt, and Jenia Jitsev. Reproducible scaling laws for contrastive language-image learning. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 2818–2829, 2023

  20. [20]

    Cxr-lt 2026 challenge: Projection-aware multi-label and zero-shot chest x-ray classification.arXiv preprint arXiv:2604.02185, 2026

    Juno Cho, Dohui Kim, Mingeon Kim, Hyunseo Jang, Chang Sun Lee, and Jong Chul Ye. Cxr-lt 2026 challenge: Projection-aware multi-label and zero-shot chest x-ray classification.arXiv preprint arXiv:2604.02185, 2026

  21. [21]

    Swin transformer: Hierarchical vision transformer using shifted windows

    Ze Liu, Yutong Lin, Yue Cao, Han Hu, Yixuan Wei, Zheng Zhang, Stephen Lin, and Baining Guo. Swin transformer: Hierarchical vision transformer using shifted windows. InProceedings of the IEEE/CVF international conference on computer vision, pages 10012–10022, 2021

  22. [22]

    Multi-task vision transformer using low-level chest x-ray feature corpus for covid-19 diagnosis and severity quantification.Medical image analysis, 75:102299, 2022

    Sangjoon Park, Gwanghyun Kim, Yujin Oh, Joon Beom Seo, Sang Min Lee, Jin Hwan Kim, Sungjun Moon, Jae-Kwang Lim, and Jong Chul Ye. Multi-task vision transformer using low-level chest x-ray feature corpus for covid-19 diagnosis and severity quantification.Medical image analysis, 75:102299, 2022

  23. [23]

    Expert- level detection of pathologies from unannotated chest x-ray images via self-supervised learning.Nature biomedical engineering, 6(12):1399–1406, 2022

    Ekin Tiu, Ellie Talius, Pujan Patel, Curtis P Langlotz, Andrew Y Ng, and Pranav Rajpurkar. Expert- level detection of pathologies from unannotated chest x-ray images via self-supervised learning.Nature biomedical engineering, 6(12):1399–1406, 2022

  24. [24]

    Asymmetric loss for multi-label classification

    Tal Ridnik, Emanuel Ben-Baruch, Nadav Zamir, Asaf Noy, Itamar Friedman, Matan Protter, and Lihi Zelnik-Manor. Asymmetric loss for multi-label classification. InProceedings of the IEEE/CVF inter- national conference on computer vision, pages 82–91, 2021. 17

  25. [25]

    An efficient framework for Long-Tailed and Multi-Label classification on chest X-Rays

    Nguyen Trung Ky, Huy Le Pham, Khoa Anh Ha, Thao Nguyen Thanh V o, Chau Thi Huyen Ly, Hien Ta, and Thang Van Thang. An efficient framework for Long-Tailed and Multi-Label classification on chest X-Rays. In2026 IEEE 23rd International Symposium on Biomedical Imaging (ISBI) (ISBI 2026), page 4, London, United Kingdom (Great Britain), April 2026

  26. [26]

    A convnet for the 2020s

    Zhuang Liu, Hanzi Mao, Chao-Yuan Wu, Christoph Feichtenhofer, Trevor Darrell, and Saining Xie. A convnet for the 2020s. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 11976–11986, 2022

  27. [27]

    Query2label: A simple transformer way to multi-label clas- sification.arXiv preprint arXiv:2107.10834, 2021

    Shilong Liu, Lei Zhang, Xiao Yang, Hang Su, and Jun Zhu. Query2label: A simple transformer way to multi-label classification.arXiv preprint arXiv:2107.10834, 2021

  28. [28]

    A fully open ai foundation model applied to chest radiography.Nature, 643(8071):488–498, 2025

    DongAo Ma, Jiaxuan Pang, Michael B Gotway, and Jianming Liang. A fully open ai foundation model applied to chest radiography.Nature, 643(8071):488–498, 2025

  29. [29]

    Learning imbalanced datasets with label-distribution-aware margin loss.Advances in neural information processing systems, 32, 2019

    Kaidi Cao, Colin Wei, Adrien Gaidon, Nikos Arechiga, and Tengyu Ma. Learning imbalanced datasets with label-distribution-aware margin loss.Advances in neural information processing systems, 32, 2019

  30. [30]

    Decoupling representa- tion and classifier for long-tailed recognition,

    Bingyi Kang, Saining Xie, Marcus Rohrbach, Zhicheng Yan, Albert Gordo, Jiashi Feng, and Yan- nis Kalantidis. Decoupling representation and classifier for long-tailed recognition.arXiv preprint arXiv:1910.09217, 2019

  31. [31]

    Torchxrayvision: A library of chest x-ray datasets and models

    Joseph Paul Cohen, Joseph D Viviano, Paul Bertin, Paul Morrison, Parsa Torabian, Matteo Guarrera, Matthew P Lungren, Akshay Chaudhari, Rupert Brooks, Mohammad Hashir, et al. Torchxrayvision: A library of chest x-ray datasets and models. InInternational Conference on Medical Imaging with Deep Learning, pages 231–249. PMLR, 2022

  32. [32]

    Multi-label contrastive learning: a comprehensive study.arXiv preprint arXiv:2412.00101, 2024

    Alexandre Audibert, Aur ´elien Gauffre, and Massih-Reza Amini. Multi-label contrastive learning: a comprehensive study.arXiv preprint arXiv:2412.00101, 2024

  33. [33]

    Medklip: Medical knowledge enhanced language-image pre-training.Proceedings of the IEEE/CVF International Conference on Computer Vision, 2023

    Chaoyi Wu, Xiaoman Zhang, Ya Zhang, Yanfeng Wang, and Weidi Xie. Medklip: Medical knowledge enhanced language-image pre-training.Proceedings of the IEEE/CVF International Conference on Computer Vision, 2023

  34. [34]

    A review of generalized zero-shot learning methods.IEEE transactions on pattern analysis and machine intelligence, 45(4):4051–4070, 2022

    Farhad Pourpanah, Moloud Abdar, Yuxuan Luo, Xinlei Zhou, Ran Wang, Chee Peng Lim, Xi-Zhao Wang, and QM Jonathan Wu. A review of generalized zero-shot learning methods.IEEE transactions on pattern analysis and machine intelligence, 45(4):4051–4070, 2022. 18 Extended Data Table 1: Main results with 95% bootstrap confidence intervals based on 1,000 resamplin...