A unified multi-task framework enables interpretable chest radiograph analysis

Hongsheng Li; Lijian Xu; Shaoting Zhang; Xiaosong Wang; Xinglong Liu; Ziyu Ni

arxiv: 2606.03417 · v1 · pith:TCJZQU7Wnew · submitted 2026-06-02 · 💻 cs.CV

A unified multi-task framework enables interpretable chest radiograph analysis

Lijian Xu , Ziyu Ni , Xinglong Liu , Xiaosong Wang , Hongsheng Li , Shaoting Zhang This is my paper

Pith reviewed 2026-06-28 10:54 UTC · model grok-4.3

classification 💻 cs.CV

keywords chest X-raymulti-task learningtransformerinterpretable AIradiology report generationdisease classificationlesion localizationanatomical segmentation

0 comments

The pith

A unified transformer sequentially executes four chest X-ray tasks to create traceable diagnostic pathways from images to reports.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces a framework that runs disease classification, lesion localization, anatomical segmentation, and report generation through one transformer model tuned on medical instructions. This sequence is meant to copy the steps a radiologist takes, so each output can be traced back to earlier steps rather than appearing as an opaque guess. The authors show the model reaches competitive scores on ten standard benchmarks whether used directly or after fine-tuning. In a test with 160 real reports from four centers, radiologists judged two-thirds of the model's reports as equal to or clearer than the originals. The central goal is to move AI from isolated black-box predictions toward outputs that match how clinicians actually reason.

Core claim

The central claim is that a single transformer architecture, after medical-domain instruction tuning, can carry out multi-label disease classification, lesion localization, anatomical segmentation, and radiology report generation in sequence, thereby producing evidence-integrated reports whose decision pathways can be followed from anatomical findings to final conclusions.

What carries the argument

The unified transformer that sequentially executes the four tasks after medical-domain instruction tuning, linking image features through intermediate outputs to the final report.

Load-bearing premise

That running the four tasks in sequence inside one model creates genuinely traceable pathways without hidden interference between tasks or leakage from training data.

What would settle it

A review in which radiologists examine the model's intermediate localization and segmentation maps and find they do not support the reasoning stated in the generated report.

Figures

Figures reproduced from arXiv: 2606.03417 by Hongsheng Li, Lijian Xu, Shaoting Zhang, Xiaosong Wang, Xinglong Liu, Ziyu Ni.

**Figure 1.** Figure 1: (a) Comparative clinical workflows: Conventional chest X-ray diagnosis relies on radiologists’ systematic evaluation of all anatomical structures (normal/abnormal) to prevent oversight, whereas the proposed AI-simulated workflow emulates this cognitive rigor through three stages: 1) Disease identification; 2) Attribute analysis (dashed box: disease characteristics [size/location/severity]; solid box: the s… view at source ↗

**Figure 2.** Figure 2: Overview of IMT-CXR framework. (a) Multimodal inputs (image and task-specific instructions) and multitask outputs (classification, localization, segmentation, report generation). (b) CXR-VQA training dataset: Disease attributes (severity, location) extracted from radiology reports. (c) Performance comparison across datasets and tasks: Metrics outside/inside the circle denote superiority/inferiority to ben… view at source ↗

**Figure 3.** Figure 3: (a) In domain evaluation of 26 disease entities on the MIMIC-CXR dataset (upper panel) and attribute classification task in disease severity level and location (lower panel). (b) Out-of-domain generalization evaluation between IMT-CXR and other classification models (i.e., ConVIRT, GloRIA, BioViL, and MedKLIP) on the ChestXray14 dataset. AUC score and F1 are utilized to assess the classification task. The… view at source ↗

**Figure 4.** Figure 4: Assessment of the lesion localization task with ACC and mIoU metrics on the MS-CXR and ChestX-ray14 datasets with 20-shot fine-tuning setting. (a) Examples with bounding box prediction and (b) comparisons between IMT-CXR and other methods. Baseline methods incorporate location-aware queries for fair comparison [51, 52]. Error bars indicate 95% confidence intervals and bar centers represent mean ACC/mIoU va… view at source ↗

**Figure 5.** Figure 5: IMT-CXR segmentation performance for disease severity assessment [61] to quantify cardiomegaly severity. (b) Pneumothorax segmentation: Pneumothorax ratio (PCR) is derived from segmentation to quantify pneumothorax severity. Left panels: Box plots compare CTR/PCR predictions (IMT-CXR vs.GloRIA) against ground-truth distributions across severity grades. 2.4 IMT-CXR enhances the interpretability of report g… view at source ↗

**Figure 6.** Figure 6: (a) Impact of task-specific prompts on report generation. Case studies: pneumothorax and pleural effusion reports generated with vs. without clinical attribute prompts (category, location, severity). Highlighted regions: light blue/yellow = correct predictions (e.g., ’small left apical pneumothorax’); magenta = errors/missing attributes (e.g., severity omitted in baseline reports). (b) Quantitative compari… view at source ↗

**Figure 7.** Figure 7: Side-by-side comparison. Three radiologists reviewed and scored the clinically derived reports from four centers and reports generated by IMT-CXR. The error bars show 95% confidence intervals, and the bars center represents the mean value of the average scores. Multi-center Radiologist Evaluation To assess the clinical interpretation, three radiologists, ranging from junior or senior levels are invited to… view at source ↗

**Figure 8.** Figure 8: Independent evaluation of omissions and clinical errors for reports generated by IMT-CXR. Significant errors are related to the presence, location, or severity of clinical findings, which are identified by radiologists. The error bars show 95% confidence intervals, and the bars center represents the mean value. AI-generated reports being rated equal or superior to radiologist-authored counterparts in blind… view at source ↗

read the original abstract

While multimodal deep learning has advanced medical imaging analysis, existing black-box systems \textcolor{black}{may remain confined to isolated tasks, often overlooking} the trust-sensitive nature of clinical diagnosis as a multi-task process. We propose IMT-CXR (Interpretable Multi-task Transformer for Chest X-ray Analysis), a framework that emulates radiologists' diagnostic workflow through three evidence-driven stages: 1) Disease recognition; 2) Attribute characterization (e.g., size, location, severity quantification); 3) Evidence-integrated report generation with traceable decision pathways. The framework employs a unified transformer architecture optimized via medical-domain instruction tuning, sequentially executing four clinical tasks: multi-label disease classification, lesion localization, anatomical segmentation, and radiology report generation. Experimental validation demonstrates competitive performance on ten CXR benchmarks under direct inference and fine-tuning settings. In a blinded evaluation of 160 historical reports from four medical centers, three radiologists rated 66\% of AI-generated reports as comparable to or surpassing original clinical reports in diagnostic clarity, highlighting the framework's translational potential. By establishing traceable diagnostic pathways from anatomical findings to conclusions, this work bridges the gap between AI technical metrics and clinical utility, advancing trustworthy AI systems in medical imaging.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

This paper gives a practical multi-task CXR setup with sequential tasks and a multi-center human study on report quality, but the evidence for traceable pathways without interference is thin.

read the letter

The paper's key offering is a multi-task transformer called IMT-CXR that sequences four tasks for chest X-rays—multi-label classification, lesion localization, anatomical segmentation, and report generation—using a unified architecture with medical instruction tuning. It shows competitive results on ten benchmarks and gets a 66% thumbs up from radiologists on the quality of its generated reports compared to real ones in a blinded multi-center study.

The human evaluation is the strongest part. Pulling 160 historical reports from four medical centers and having three radiologists compare the AI versions for diagnostic clarity is a solid way to test translational value. The 66% rate where AI matches or beats the originals gives a practical sense of how close it is to usable.

The three-stage framing—disease recognition, attribute characterization, then evidence-integrated reports—is a reasonable way to structure the workflow and try to make the outputs more traceable.

The main weakness is the lack of supporting details for the interpretability claims. The abstract and setup describe the sequential execution as producing traceable pathways, but there are no ablations shown for joint versus separate training, no metrics on whether tasks interfere with each other, and no checks like attention maps aligning with actual lesions. This leaves the "emulating radiologists without hidden biases" part resting on the architecture rather than evidence. The stress-test note flags exactly this gap.

This work is for people in medical imaging AI who are trying to move beyond single-task models toward integrated systems that output reports. A reader focused on clinical applicability would find the human study useful even if the technical novelty is incremental.

I would bring it to a reading group for the evaluation discussion. It deserves peer review because the human validation adds real information, though referees will likely ask for more quantitative backing on the multi-task benefits and interpretability.

Referee Report

2 major / 0 minor

Summary. The manuscript proposes IMT-CXR, a unified transformer framework for chest X-ray analysis that emulates radiologists' workflow via sequential execution of four tasks (multi-label disease classification, lesion localization, anatomical segmentation, and radiology report generation) using medical-domain instruction tuning. It claims this yields traceable decision pathways from anatomical findings to conclusions, with competitive performance on ten CXR benchmarks under direct inference and fine-tuning, and a blinded study where three radiologists rated 66% of 160 AI-generated reports from four centers as comparable or superior to original clinical reports in diagnostic clarity.

Significance. If substantiated, the work could advance trustworthy AI in medical imaging by connecting multi-task technical performance to clinical interpretability and workflow emulation, addressing the trust gap in black-box systems.

major comments (2)

[Abstract] Abstract: The statement that 'experimental validation demonstrates competitive performance on ten CXR benchmarks' supplies no quantitative metrics, baselines, error bars, or optimization details for the multi-task setup, preventing verification that the data support the central claims of competitiveness and translational potential.
[Experimental validation] Experimental validation paragraph: No ablation (joint vs. sequential training) or proxy metric (e.g., attention alignment with lesion masks, path consistency across tasks) is reported to isolate whether the unified transformer produces interference-free traceable pathways, which is load-bearing for the claim that sequential execution emulates radiologist workflow without hidden biases or data leakage.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the detailed and constructive feedback. We address each major comment point-by-point below, proposing revisions where they strengthen the manuscript without misrepresenting our results.

read point-by-point responses

Referee: [Abstract] Abstract: The statement that 'experimental validation demonstrates competitive performance on ten CXR benchmarks' supplies no quantitative metrics, baselines, error bars, or optimization details for the multi-task setup, preventing verification that the data support the central claims of competitiveness and translational potential.

Authors: We agree that the abstract would benefit from explicit quantitative support. In the revised manuscript we will expand the abstract to report key metrics (e.g., AUC ranges, mAP, Dice scores, and report-generation BLEU/METEOR scores) together with the primary baselines used, while preserving the existing word limit and overall claims. revision: yes
Referee: [Experimental validation] Experimental validation paragraph: No ablation (joint vs. sequential training) or proxy metric (e.g., attention alignment with lesion masks, path consistency across tasks) is reported to isolate whether the unified transformer produces interference-free traceable pathways, which is load-bearing for the claim that sequential execution emulates radiologist workflow without hidden biases or data leakage.

Authors: The referee is correct that the current manuscript does not contain an explicit joint-versus-sequential ablation or quantitative proxy metrics for pathway consistency. While the multi-task benchmark results and the blinded radiologist study provide indirect support for the absence of catastrophic interference, we acknowledge that direct evidence would strengthen the interpretability claim. We will therefore add (i) a controlled ablation comparing joint multi-task training against sequential task-specific fine-tuning and (ii) a proxy analysis of attention-map overlap with lesion localization masks in the revised experimental section. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical framework with no derivation chain

full rationale

The manuscript describes a multi-task transformer framework (IMT-CXR) that sequentially performs disease classification, lesion localization, segmentation, and report generation via medical instruction tuning. All central claims rest on reported benchmark scores across ten CXR datasets and a blinded radiologist preference study (66% of generated reports rated comparable or better). No equations, fitted parameters renamed as predictions, self-citation load-bearing uniqueness theorems, or ansatzes appear in the abstract or described experiments. The architecture is presented as an engineering choice whose value is assessed externally by performance metrics, rendering the work self-contained with no reduction of outputs to inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review supplies no information on free parameters, axioms, or invented entities.

pith-pipeline@v0.9.1-grok · 5756 in / 1036 out tokens · 32918 ms · 2026-06-28T10:54:28.521075+00:00 · methodology

discussion (0)

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Learnable Token Sparsification for Efficient Gigapixel Whole Slide Image Reasoning
cs.CV 2026-06 unverdicted novelty 6.0

Learnable sparsification framework compresses WSI visual tokens to 32 (0.78% of original) via SparseLearn, achieving 73.32% accuracy on SlideBench (TCGA) and outperforming baselines.

Reference graph

Works this paper leans on

84 extracted references · 6 linked inside Pith · cited by 1 Pith paper

[1]

A foundation model for generalizable disease diagnosis in chest x-ray images

Lijian Xu, Ziyu Ni, Hao Sun, Hongsheng Li, and Shaoting Zhang. A foundation model for generalizable disease diagnosis in chest x-ray images. arXiv preprint arXiv:2410.08861, 2024

arXiv 2024
[2]

Medvilam: A multimodal large language model with advanced generalizability and explainability for medical data understanding and generation

Lijian Xu, Hao Sun, Ziyu Ni, Hongsheng Li, and Shaoting Zhang. Medvilam: A multimodal large language model with advanced generalizability and explainability for medical data understanding and generation. arXiv preprint arXiv:2409.19684, 2024

arXiv 2024
[3]

Efﬁcient chest x-ray representation learning via semantic- partitioned contrastive learning

Wangyu Feng, Shawn Y oung, and Lijian Xu. Efﬁcient chest x-ray representation learning via semantic- partitioned contrastive learning. arXiv preprint arXiv:2603.07113, 2026. 18 LEARNING A MULTI -TASK TRANSFORMER FOR CHEST RADIOGRAPH INTERPRETATION

arXiv 2026
[4]

Xrayclaw: Cooperative-competitive multi-agent alignment for trustworthy chest x-ray diagnosis

Shawn Y oung and Lijian Xu. Xrayclaw: Cooperative-competitive multi-agent alignment for trustworthy chest x-ray diagnosis. arXiv preprint arXiv:2604.02695, 2026

Pith/arXiv arXiv 2026
[5]

An image is worth 16x16 words: Transformers for image recognition at scale, 2021

Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszkoreit, and Neil Houlsby. An image is worth 16x16 words: Transformers for image recognition at scale, 2021

2021
[6]

Swin trans- former: Hierarchical vision transformer using shifted windows

Ze Liu, Y utong Lin, Y ue Cao, Han Hu, Yixuan Wei, Zheng Zhang, Stephen Lin, and Baining Guo. Swin trans- former: Hierarchical vision transformer using shifted windows. In Proceedings of the IEEE/CVF international conference on computer vision , pages 10012–10022, 2021

2021
[7]

Bert: Pre-training of deep bidirectional transformers for language understanding

Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805, 2018

Pith/arXiv arXiv 2018
[8]

Llama: Open and efﬁcient foundation language models

Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timothée Lacroix, Bap- tiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, et al. Llama: Open and efﬁcient foundation language models. arXiv preprint arXiv:2302.13971, 2023

Pith/arXiv arXiv 2023
[9]

Palm: Scaling language modeling with pathways

Aakanksha Chowdhery, Sharan Narang, Jacob Devlin, Maarten Bosma, Gaurav Mishra, Adam Roberts, Paul Barham, Hyung Won Chung, Charles Sutton, Sebastian Gehrmann, et al. Palm: Scaling language modeling with pathways. Journal of Machine Learning Research , 24(240):1–113, 2023

2023
[10]

Glam: Efﬁcient scaling of language models with mixture-of-experts

Nan Du, Y anping Huang, Andrew M Dai, Simon Tong, Dmitry Lepikhin, Y uanzhong Xu, Maxim Krikun, Y anqi Zhou, Adams Wei Y u, Orhan Firat, et al. Glam: Efﬁcient scaling of language models with mixture-of-experts. In International Conference on Machine Learning , pages 5547–5569. PMLR, 2022

2022
[11]

Glm-130b: An open bilingual pre-trained model

Aohan Zeng, Xiao Liu, Zhengxiao Du, Zihan Wang, Hanyu Lai, Ming Ding, Zhuoyi Y ang, Yifan Xu, Wendi Zheng, Xiao Xia, et al. Glm-130b: An open bilingual pre-trained model. arXiv preprint arXiv:2210.02414 , 2022

Pith/arXiv arXiv 2022
[12]

Training language models to follow instructions with human feedback

Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sand- hini Agarwal, Katarina Slama, Alex Ray, et al. Training language models to follow instructions with human feedback. Advances in Neural Information Processing Systems , 35:27730–27744, 2022

2022
[13]

Towards expert-level medical question answering with large language models

Karan Singhal, Tao Tu, Juraj Gottweis, Rory Sayres, Ellery Wulczyn, Le Hou, Kevin Clark, Stephen Pfohl, Heather Cole-Lewis, Darlene Neal, et al. Towards expert-level medical question answering with large language models. arXiv preprint arXiv:2305.09617, 2023

Pith/arXiv arXiv 2023
[14]

One leaf reveals the season: Occlusion-based contrastive learning with semantic-aware views for efﬁcient visual representation

Xiaoyu Y ang, Lijian Xu, Hongsheng Li, and Shaoting Zhang. One leaf reveals the season: Occlusion-based contrastive learning with semantic-aware views for efﬁcient visual representation. In International Conference on Machine Learning, pages 71425–71440, 2025

2025
[15]

Fewer tokens, greater scaling: Self-adaptive visual bases for efﬁcient and expansive representation learning

Shawn Y oung, Xingyu Zeng, and Lijian Xu. Fewer tokens, greater scaling: Self-adaptive visual bases for efﬁcient and expansive representation learning. arXiv preprint arXiv:2511.19515, 2026

arXiv 2026
[16]

The model knows which tokens matter:automatic token selection via noise gating

Landi He, Xiaoyu Y ang, and Lijian Xu. The model knows which tokens matter:automatic token selection via noise gating. arXiv preprint arXiv:2603.07135, 2026

arXiv 2026
[17]

Segmentation and vascular vectorization for coronary artery by geometry-based cascaded neural network

Xiaoyu Y ang, Lijian Xu, Simon Y u, Qing Xia, Hongsheng Li, and Shaoting Zhang. Segmentation and vascular vectorization for coronary artery by geometry-based cascaded neural network. IEEE Transactions on Medical Imaging, 44(1):259–269, 2024

2024
[18]

Zhao, Kelvin Guu, Adams Wei Y u, Brian Lester, Nan Du, Andrew M

Jason Wei, Maarten Bosma, Vincent Y . Zhao, Kelvin Guu, Adams Wei Y u, Brian Lester, Nan Du, Andrew M. Dai, and Quoc V . Le. Finetuned language models are zero-shot learners, 2022

2022
[19]

Scaling instruction-ﬁnetuned language models

Hyung Won Chung, Le Hou, Shayne Longpre, Barret Zoph, Yi Tay, William Fedus, Y unxuan Li, Xuezhi Wang, Mostafa Dehghani, Siddhartha Brahma, et al. Scaling instruction-ﬁnetuned language models. Journal of Machine Learning Research, 25(70):1–53, 2024

2024
[20]

Zerosense: How vision matters in long context compression

Y onghan Gao, Zehong Chen, Lijian Xu, Jingzhi Chen, Jingwei Guan, and Xingyu Zeng. Zerosense: How vision matters in long context compression. arXiv preprint arXiv:2603.11846, 2026

arXiv 2026
[21]

Instructblip: Towards general-purpose vision-language models with instruction tuning, 2023

Wenliang Dai, Junnan Li, Dongxu Li, Anthony Meng Huat Tiong, Junqi Zhao, Weisheng Wang, Boyang Li, Pascale Fung, and Steven Hoi. Instructblip: Towards general-purpose vision-language models with instruction tuning, 2023

2023
[22]

Scalar: Spatial- concept alignment for robust vision in harsh open world

Xiaoyu Y ang, Lijian Xu, Xingyu Zeng, Xiaosong Wang, Hongsheng Li, and Shaoting Zhang. Scalar: Spatial- concept alignment for robust vision in harsh open world. Pattern Recognition, page 113203, 2026

2026
[23]

Visual instruction tuning

Haotian Liu, Chunyuan Li, Qingyang Wu, and Y ong Jae Lee. Visual instruction tuning. Advances in neural information processing systems, 36, 2024. 19 LEARNING A MULTI -TASK TRANSFORMER FOR CHEST RADIOGRAPH INTERPRETATION

2024
[24]

Med-bert: pretrained contextualized embeddings on large-scale structured electronic health records for disease prediction

Laila Rasmy, Y ang Xiang, Ziqian Xie, Cui Tao, and Degui Zhi. Med-bert: pretrained contextualized embeddings on large-scale structured electronic health records for disease prediction. NPJ digital medicine, 4(1):86, 2021

2021
[25]

Large language models encode clinical knowledge

Karan Singhal, Shekoofeh Azizi, Tao Tu, S Sara Mahdavi, Jason Wei, Hyung Won Chung, Nathan Scales, Ajay Tanwani, Heather Cole-Lewis, Stephen Pfohl, et al. Large language models encode clinical knowledge. Nature, 620(7972):172–180, 2023

2023
[26]

Foundation models for generalist medical artiﬁcial intelligence

Michael Moor, Oishi Banerjee, Zahra Shakeri Hossein Abad, Harlan M Krumholz, Jure Leskovec, Eric J Topol, and Pranav Rajpurkar. Foundation models for generalist medical artiﬁcial intelligence. Nature, 616(7956):259– 265, 2023

2023
[27]

A generalist vision–language foundation model for diverse biomedical tasks

Kai Zhang, Rong Zhou, Eashan Adhikarla, Zhiling Y an, Yixin Liu, Jun Y u, Zhengliang Liu, Xun Chen, Brian D Davison, Hui Ren, et al. A generalist vision–language foundation model for diverse biomedical tasks. Nature Medicine, pages 1–13, 2024

2024
[28]

Llava-med: Training a large language-and-vision assistant for biomedicine in one day

Chunyuan Li, Cliff Wong, Sheng Zhang, Naoto Usuyama, Haotian Liu, Jianwei Y ang, Tristan Naumann, Hoifung Poon, and Jianfeng Gao. Llava-med: Training a large language-and-vision assistant for biomedicine in one day. Advances in Neural Information Processing Systems , 36, 2024

2024
[29]

Mimic-cxr, a de-identiﬁed publicly available database of chest radiographs with free-text reports

Alistair EW Johnson, Tom J Pollard, Seth J Berkowitz, Nathaniel R Greenbaum, Matthew P Lungren, Chih- ying Deng, Roger G Mark, and Steven Horng. Mimic-cxr, a de-identiﬁed publicly available database of chest radiographs with free-text reports. Scientiﬁc data, 6(1):317, 2019

2019
[30]

Vindr-cxr: An open dataset of chest x-rays with radiologist’s annotations

Ha Q Nguyen, Khanh Lam, Linh T Le, Hieu H Pham, Dat Q Tran, Dung B Nguyen, Dung D Le, Chi M Pham, Hang TT Tong, Diep H Dinh, et al. Vindr-cxr: An open dataset of chest x-rays with radiologist’s annotations. Scientiﬁc Data, 9(1):429, 2022

2022
[31]

A structure-aware relation network for thoracic diseases detection and segmentation

Jie Lian, Jingyu Liu, Shu Zhang, Kai Gao, Xiaoqing Liu, Dingwen Zhang, and Yizhou Y u. A structure-aware relation network for thoracic diseases detection and segmentation. IEEE Transactions on Medical Imaging , 40(8):2042–2052, 2021

2042
[32]

Chexmask: a large-scale dataset of anatomical segmentation masks for multi-center chest x-ray images

Nicolas Gaggion, Candelaria Mosquera, Lucas Mansilla, Julia Mariel Saidman, Martina Aineseder, Diego H Milone, and Enzo Ferrante. Chexmask: a large-scale dataset of anatomical segmentation masks for multi-center chest x-ray images. Scientiﬁc Data, 11(1):511, 2024

2024
[33]

Siim-acr pneumothorax segmentation

Anna Zawacki, Carol Wu, George Shih, Julia Elliott, Mikhail Fomitchev, Mohannad Hussain, ParasLakhani, Phil Culliton, and Shunxing Bao. Siim-acr pneumothorax segmentation. Kaggle, 2019

2019
[34]

Chestx- ray8: Hospital-scale chest x-ray database and benchmarks on weakly-supervised classiﬁcation and localization of common thorax diseases

Xiaosong Wang, Yifan Peng, Le Lu, Zhiyong Lu, Mohammadhadi Bagheri, and Ronald M Summers. Chestx- ray8: Hospital-scale chest x-ray database and benchmarks on weakly-supervised classiﬁcation and localization of common thorax diseases. In Proceedings of the IEEE conference on computer vision and pattern recognition , pages 2097–2106, 2017

2097
[35]

Chexpert: A large chest radiograph dataset with uncer- tainty labels and expert comparison

Jeremy Irvin, Pranav Rajpurkar, Michael Ko, Yifan Y u, Silviana Ciurea-Ilcus, Chris Chute, Henrik Marklund, Behzad Haghgoo, Robyn Ball, Katie Shpanskaya, et al. Chexpert: A large chest radiograph dataset with uncer- tainty labels and expert comparison. In Proceedings of the AAAI conference on artiﬁcial intelligence , volume 33, pages 590–597, 2019

2019
[36]

Augmenting the national institutes of health chest radiograph dataset with expert annotations of possible pneumonia

George Shih, Carol C Wu, Safwan S Halabi, Marc D Kohli, Luciano M Prevedello, Tessa S Cook, Arjun Sharma, Judith K Amorosa, V eronica Arteaga, Maya Galperin-Aizenberg, et al. Augmenting the national institutes of health chest radiograph dataset with expert annotations of possible pneumonia. Radiology: Artiﬁcial Intelligence, 1(1):e180041, 2019

2019
[37]

Junji Shiraishi, Shigehiko Katsuragawa, Junpei Ikezoe, Tsuneo Matsumoto, Takeshi Kobayashi, Ken-ichi Ko- matsu, Mitate Matsui, Hiroshi Fujita, Y oshie Kodera, and Kunio Doi. Development of a digital image database for chest radiographs with and without a lung nodule: receiver operating characteristic analysis of radiologists’ detection of pulmonary nodule...

2000
[38]

Making the most of text semantics to improve biomedical vision–language processing

Benedikt Boecking, Naoto Usuyama, Shruthi Bannur, Daniel C Castro, Anton Schwaighofer, Stephanie Hyland, Maria Wetscherek, Tristan Naumann, Aditya Nori, Javier Alvarez-V alle, et al. Making the most of text semantics to improve biomedical vision–language processing. In European conference on computer vision , pages 1–21. Springer, 2022

2022
[39]

A simple framework for contrastive learning of visual representations

Ting Chen, Simon Kornblith, Mohammad Norouzi, and Geoffrey Hinton. A simple framework for contrastive learning of visual representations. In ICML, pages 1597–1607. PMLR, 2020

2020
[40]

Contrastive learning of medical visual representations from paired images and text

Y uhao Zhang, Hang Jiang, Y asuhide Miura, Christopher D Manning, and Curtis P Langlotz. Contrastive learning of medical visual representations from paired images and text. In Machine Learning for Healthcare Conference , pages 2–25. PMLR, 2022. 20 LEARNING A MULTI -TASK TRANSFORMER FOR CHEST RADIOGRAPH INTERPRETATION

2022
[41]

Lungren, and Serena Y eung

Shih-Cheng Huang, Liyue Shen, Matthew P . Lungren, and Serena Y eung. Gloria: A multimodal global-local representation learning framework for label-efﬁcient medical image recognition. 2021 IEEE/CVF International Conference on Computer Vision (ICCV), pages 3922–3931, 2021

2021
[42]

Generalized radiograph representation learning via cross-supervision between images and free-text radiology reports

Hong-Y u Zhou, Xiaoyu Chen, Yinghao Zhang, Ruibang Luo, Liansheng Wang, and Yizhou Y u. Generalized radiograph representation learning via cross-supervision between images and free-text radiology reports. Nature Machine Intelligence, 4(1):32–40, 2022

2022
[43]

Multimodal masked autoencoders learn transferable representations, 2022

Xinyang Geng, Hao Liu, Lisa Lee, Dale Schuurmans, Sergey Levine, and Pieter Abbeel. Multimodal masked autoencoders learn transferable representations, 2022

2022
[44]

Multi-granularity cross-modal alignment for generalized medical visual representation learning

Fuying Wang, Y uyin Zhou, Shujun Wang, V arut V ardhanabhuti, and Lequan Y u. Multi-granularity cross-modal alignment for generalized medical visual representation learning. Advances in Neural Information Processing Systems, 35:33536–33549, 2022

2022
[45]

Medklip: Medical knowledge enhanced language-image pre-training for x-ray diagnosis

Chaoyi Wu, Xiaoman Zhang, Y a Zhang, Y anfeng Wang, and Weidi Xie. Medklip: Medical knowledge enhanced language-image pre-training for x-ray diagnosis. In ICCV, pages 21372–21383, 2023

2023
[46]

Delving into masked autoencoders for multi-label thorax disease classiﬁcation

Junfei Xiao, Y utong Bai, Alan Y uille, and Zongwei Zhou. Delving into masked autoencoders for multi-label thorax disease classiﬁcation. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pages 3588–3600, 2023

2023
[47]

Advancing radiograph representation learning with masked record modeling

Hong-Y u Zhou, Chenyu Lian, Liansheng Wang, and Yizhou Y u. Advancing radiograph representation learning with masked record modeling. In The Eleventh International Conference on Learning Representations , 2023

2023
[48]

Visual grounding with transformers.2022 IEEE International Conference on Multimedia and Expo , pages 1–6, 2022

Y e Du, Zehua Fu, Qingjie Liu, and Y unhong Wang. Visual grounding with transformers.2022 IEEE International Conference on Multimedia and Expo , pages 1–6, 2022

2022
[49]

Seqtr: A simple yet universal network for visual grounding

Chaoyang Zhu, Yiyi Zhou, Y unhang Shen, Gen Luo, Xingjia Pan, Mingbao Lin, Chao Chen, Liujuan Cao, Xiaoshuai Sun, and Rongrong Ji. Seqtr: A simple yet universal network for visual grounding. In European Conference on Computer Vision, pages 598–615. Springer, 2022

2022
[50]

Transvg: End-to-end visual grounding with transformers

Jiajun Deng, Zhengyuan Y ang, Tianlang Chen, Wengang Zhou, and Houqiang Li. Transvg: End-to-end visual grounding with transformers. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 1769–1779, October 2021

2021
[51]

Medical phrase grounding with region-phrase context contrastive alignment

Zhihao Chen, Y ang Zhou, Anh Tran, Junting Zhao, Liang Wan, Gideon Su Kai Ooi, Lionel Tim-Ee Cheng, Choon Hua Thng, Xinxing Xu, Y ong Liu, et al. Medical phrase grounding with region-phrase context contrastive alignment. In International Conference on Medical Image Computing and Computer-Assisted Intervention, pages 371–381. Springer, 2023

2023
[52]

Chex: Interactive localization and region description in chest x-rays

Philip Muller, Georgios Kaissis, and Daniel Rueckert. Chex: Interactive localization and region description in chest x-rays. In European Conference on Computer Vision, pages 92–111. Springer, 2025

2025
[53]

Knowledge-enhanced visual-language pre-training on chest radiology images

Xiaoman Zhang, Chaoyi Wu, Y a Zhang, Weidi Xie, and Y anfeng Wang. Knowledge-enhanced visual-language pre-training on chest radiology images. Nature Communications, 14(1):4542, 2023

2023
[54]

Referring transformer: A one-step approach to multi-task visual grounding

Muchen Li and Leonid Sigal. Referring transformer: A one-step approach to multi-task visual grounding. Ad- vances in neural information processing systems , 34:19652–19664, 2021

2021
[55]

Bottom-up and top-down attention for image captioning and visual question answering

Peter Anderson, Xiaodong He, Chris Buehler, Damien Teney, Mark Johnson, Stephen Gould, and Lei Zhang. Bottom-up and top-down attention for image captioning and visual question answering. In CVPR, pages 6077– 6086, 2018

2018
[56]

Self-critical sequence training for image captioning

Steven J Rennie, Etienne Marcheret, Y oussef Mroueh, Jerret Ross, and V aibhava Goel. Self-critical sequence training for image captioning. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 7008–7024, 2017

2017
[57]

Cross-modal memory networks for radiology report generation

Zhihong Chen, Y aling Shen, Y an Song, and Xiang Wan. Cross-modal memory networks for radiology report generation. In ACL-IJCNLP V olume 1: Long Papers, pages 5904–5914, 2021

2021
[58]

Con- trastive attention for automatic chest x-ray report generation, 2023

Fenglin Liu, Changchang Yin, Xian Wu, Shen Ge, Y uexian Zou, Ping Zhang, Y uexian Zou, and Xu Sun. Con- trastive attention for automatic chest x-ray report generation, 2023

2023
[59]

Interactive and explainable region-guided radiology report generation

Tim Tanida, Philip Muller, Georgios Kaissis, and Daniel Rueckert. Interactive and explainable region-guided radiology report generation. In CVPR, pages 7433–7442, 2023

2023
[60]

Multimodal image-text matching improves retrieval-based chest x-ray report generation

Jaehwan Jeong, Katherine Tian, Andrew Li, Sina Hartung, Subathra Adithan, Fardad Behzadi, Juan Calle, David Osayande, Michael Pohlen, and Pranav Rajpurkar. Multimodal image-text matching improves retrieval-based chest x-ray report generation. In Medical Imaging with Deep Learning , pages 978–990. PMLR, 2024. 21 LEARNING A MULTI -TASK TRANSFORMER FOR CHEST...

2024
[61]

A deep-learning-based framework for identifying and localizing multiple abnormalities and assessing cardiomegaly in chest x-ray

Weijie Fan, Yi Y ang, Jing Qi, Qichuan Zhang, Cuiwei Liao, Li Wen, Shuang Wang, Guangxian Wang, Y u Xia, Qihua Wu, et al. A deep-learning-based framework for identifying and localizing multiple abnormalities and assessing cardiomegaly in chest x-ray. Nature Communications, 15(1):1347, 2024

2024
[62]

Maira-2: Grounded radiology report generation

Shruthi Bannur, Kenza Bouzid, Daniel C Castro, Anton Schwaighofer, Sam Bond-Taylor, Maximilian Ilse, Fer- nando Perez-Garcia, V alentina Salvatelli, Harshita Sharma, Felix Meissen, et al. Maira-2: Grounded radiology report generation. arXiv preprint arXiv:2406.04449, 2024

arXiv 2024
[63]

Llava-rad mimic-cxr annotations

Juan Manuel Zambrano Chaves, Shih-Cheng Huang, Y anbo Xu, Hanwen Xu, Naoto Usuyama, Sheng Zhang, Fei Wang, Y ujia Xie, Mahmoud Khademi, Ziyi Y ang, Hany Awadalla, Julia Gong, Houdong Hu, Jianwei Y ang, Chunyuan Li, Jianfeng Gao, Y u Gu, Cliff Wong, Mu-Hsin Wei, Tristan Naumann, Muhao Chen, Matthew Lun- gren, Akshay Chaudhari, Serena Y eung, Curtis Langlot...

2025
[64]

Cxr-llava: a multimodal large language model for interpreting chest x-ray images

Seowoo Lee, Jiwon Y oun, Hyungjin Kim, Mansu Kim, and Soon Ho Y oon. Cxr-llava: a multimodal large language model for interpreting chest x-ray images. European Radiology, pages 1–13, 2025

2025
[65]

Chexagent: Towards a foundation model for chest x-ray interpretation

Zhihong Chen, Maya V arma, Jean-Benoit Delbrouck, Magdalini Paschali, Louis Blankemeier, Dave V an V een, Jeya Maria Jose V alanarasu, Alaa Y oussef, Joseph Paul Cohen, Eduardo Pontes Reis, et al. Chexagent: Towards a foundation model for chest x-ray interpretation. In AAAI 2024 Spring Symposium on Clinical F oundation Models, 2024

2024
[66]

A deepseek-powered ai system for automated chest radiograph interpretation in clinical practice

Y aowei Bai, Ruiheng Zhang, Y u Lei, Xuhua Duan, Jingfeng Y ao, Shuguang Ju, Chaoyang Wang, Wei Y ao, Yiwan Guo, Guilin Zhang, et al. A deepseek-powered ai system for automated chest radiograph interpretation in clinical practice. arXiv preprint arXiv:2512.20344, 2025

arXiv 2025
[67]

A generalist learner for multifaceted medical image interpretation

Hong-Y u Zhou, Subathra Adithan, Julian Nicolas Acosta, Eric J Topol, and Pranav Rajpurkar. A generalist learner for multifaceted medical image interpretation. arXiv preprint arXiv:2405.07988, 2024

arXiv 2024
[68]

Tc-ssa: Token compression via semantic slot aggregation for gigapixel pathology reasoning

Zhuo Chen, Shawn Y oung, and Lijian Xu. Tc-ssa: Token compression via semantic slot aggregation for gigapixel pathology reasoning. arXiv preprint arXiv:2603.01143, 2026

arXiv 2026
[69]

Ofa: Unifying architectures, tasks, and modalities through a simple sequence-to-sequence learning framework, 2022

Peng Wang, An Y ang, Rui Men, Junyang Lin, Shuai Bai, Zhikang Li, Jianxin Ma, Chang Zhou, Jingren Zhou, and Hongxia Y ang. Ofa: Unifying architectures, tasks, and modalities through a simple sequence-to-sequence learning framework, 2022

2022
[70]

Pix2seq: A language modeling framework for object detection

Ting Chen, Saurabh Saxena, Lala Li, David J Fleet, and Geoffrey Hinton. Pix2seq: A language modeling framework for object detection. In International Conference on Learning Representations , 2021

2021
[71]

Uniﬁed-io: A uniﬁed model for vision, language, and multi-modal tasks

Jiasen Lu, Christopher Clark, Rowan Zellers, Roozbeh Mottaghi, and Aniruddha Kembhavi. Uniﬁed-io: A uniﬁed model for vision, language, and multi-modal tasks. In The Eleventh International Conference on Learning Representations, 2022

2022
[72]

Bart: Denoising sequence-to-sequence pre-training for natural language generation, translation, and comprehension

Mike Lewis, Yinhan Liu, Naman Goyal, Marjan Ghazvininejad, Abdelrahman Mohamed, Omer Levy, V es Stoy- anov, and Luke Zettlemoyer. Bart: Denoising sequence-to-sequence pre-training for natural language generation, translation, and comprehension. arXiv preprint arXiv:1910.13461, 2019

Pith/arXiv arXiv 1910
[73]

Improving language understanding by generative pre-training

Alec Radford and Karthik Narasimhan. Improving language understanding by generative pre-training. 2018

2018
[74]

Bleu: a method for automatic evaluation of machine translation

Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. Bleu: a method for automatic evaluation of machine translation. In Proceedings of the 40th annual meeting of the Association for Computational Linguistics , pages 311–318, 2002

2002
[75]

Rouge: A package for automatic evaluation of summaries

Chin-Y ew Lin. Rouge: A package for automatic evaluation of summaries. In Text summarization branches out, pages 74–81, 2004

2004
[76]

Evaluating progress in automatic chest x-ray radiology report generation

Feiyang Y u, Mark Endo, Rayan Krishnan, Ian Pan, Andy Tsai, Eduardo Pontes Reis, Eduardo Kaiser Uru- rahy Nunes Fonseca, Henrique Min Ho Lee, Zahra Shakeri Hossein Abad, Andrew Y Ng, et al. Evaluating progress in automatic chest x-ray radiology report generation. Patterns, 4(9), 2023

2023
[77]

Elixr: Towards a general purpose x-ray artiﬁcial in- telligence system through alignment of large language models and radiology vision encoders

Shawn Xu, Lin Y ang, Christopher Kelly, Marcin Sieniek, Timo Kohlberger, Martin Ma, Wei-Hung Weng, At- tila Kiraly, Sahar Kazemzadeh, Zakkai Melamed, et al. Elixr: Towards a general purpose x-ray artiﬁcial in- telligence system through alignment of large language models and radiology vision encoders. arXiv preprint arXiv:2308.01317, 2023

arXiv 2023
[78]

A foundation model for generalizable disease detection from retinal images

Y ukun Zhou, Mark A Chia, Siegfried K Wagner, Murat S Ayhan, Dominic J Williamson, Robbert R Struyven, Timing Liu, Moucheng Xu, Mateo G Lozano, Peter Woodward-Court, et al. A foundation model for generalizable disease detection from retinal images. Nature, 622(7981):156–163, 2023. 22 LEARNING A MULTI -TASK TRANSFORMER FOR CHEST RADIOGRAPH INTERPRETATION A...

2023
[79]

What disease does this image have?

Disease Classiﬁcation Dataset includes entity information across 193 categories and 0.24M images. For the entity classiﬁcation task, the instruction is "What disease does this image have?". The answer includes all possible diseases present in the data, such as "pneumonia" and "atelectasis.", "Is Pneumonia in this image?". The response can be either "yes" ...
[80]

Give the accurate bounding box of {}

Lesion Localization Dataset incorporates VinDR-CXR, ChestX-Det, and In-house datasets, consisting of 0.09M images and corresponding BBox for 12 diseases. The instruction given for the lesion localization task is "Give the accurate bounding box of {}.". Here, the placeholder {} represents the category of the speciﬁc disease, such as "pneumonia, in the lowe...

Showing first 80 references.

[1] [1]

A foundation model for generalizable disease diagnosis in chest x-ray images

Lijian Xu, Ziyu Ni, Hao Sun, Hongsheng Li, and Shaoting Zhang. A foundation model for generalizable disease diagnosis in chest x-ray images. arXiv preprint arXiv:2410.08861, 2024

arXiv 2024

[2] [2]

Medvilam: A multimodal large language model with advanced generalizability and explainability for medical data understanding and generation

Lijian Xu, Hao Sun, Ziyu Ni, Hongsheng Li, and Shaoting Zhang. Medvilam: A multimodal large language model with advanced generalizability and explainability for medical data understanding and generation. arXiv preprint arXiv:2409.19684, 2024

arXiv 2024

[3] [3]

Efﬁcient chest x-ray representation learning via semantic- partitioned contrastive learning

Wangyu Feng, Shawn Y oung, and Lijian Xu. Efﬁcient chest x-ray representation learning via semantic- partitioned contrastive learning. arXiv preprint arXiv:2603.07113, 2026. 18 LEARNING A MULTI -TASK TRANSFORMER FOR CHEST RADIOGRAPH INTERPRETATION

arXiv 2026

[4] [4]

Xrayclaw: Cooperative-competitive multi-agent alignment for trustworthy chest x-ray diagnosis

Shawn Y oung and Lijian Xu. Xrayclaw: Cooperative-competitive multi-agent alignment for trustworthy chest x-ray diagnosis. arXiv preprint arXiv:2604.02695, 2026

Pith/arXiv arXiv 2026

[5] [5]

An image is worth 16x16 words: Transformers for image recognition at scale, 2021

Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszkoreit, and Neil Houlsby. An image is worth 16x16 words: Transformers for image recognition at scale, 2021

2021

[6] [6]

Swin trans- former: Hierarchical vision transformer using shifted windows

Ze Liu, Y utong Lin, Y ue Cao, Han Hu, Yixuan Wei, Zheng Zhang, Stephen Lin, and Baining Guo. Swin trans- former: Hierarchical vision transformer using shifted windows. In Proceedings of the IEEE/CVF international conference on computer vision , pages 10012–10022, 2021

2021

[7] [7]

Bert: Pre-training of deep bidirectional transformers for language understanding

Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805, 2018

Pith/arXiv arXiv 2018

[8] [8]

Llama: Open and efﬁcient foundation language models

Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timothée Lacroix, Bap- tiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, et al. Llama: Open and efﬁcient foundation language models. arXiv preprint arXiv:2302.13971, 2023

Pith/arXiv arXiv 2023

[9] [9]

Palm: Scaling language modeling with pathways

Aakanksha Chowdhery, Sharan Narang, Jacob Devlin, Maarten Bosma, Gaurav Mishra, Adam Roberts, Paul Barham, Hyung Won Chung, Charles Sutton, Sebastian Gehrmann, et al. Palm: Scaling language modeling with pathways. Journal of Machine Learning Research , 24(240):1–113, 2023

2023

[10] [10]

Glam: Efﬁcient scaling of language models with mixture-of-experts

Nan Du, Y anping Huang, Andrew M Dai, Simon Tong, Dmitry Lepikhin, Y uanzhong Xu, Maxim Krikun, Y anqi Zhou, Adams Wei Y u, Orhan Firat, et al. Glam: Efﬁcient scaling of language models with mixture-of-experts. In International Conference on Machine Learning , pages 5547–5569. PMLR, 2022

2022

[11] [11]

Glm-130b: An open bilingual pre-trained model

Aohan Zeng, Xiao Liu, Zhengxiao Du, Zihan Wang, Hanyu Lai, Ming Ding, Zhuoyi Y ang, Yifan Xu, Wendi Zheng, Xiao Xia, et al. Glm-130b: An open bilingual pre-trained model. arXiv preprint arXiv:2210.02414 , 2022

Pith/arXiv arXiv 2022

[12] [12]

Training language models to follow instructions with human feedback

Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sand- hini Agarwal, Katarina Slama, Alex Ray, et al. Training language models to follow instructions with human feedback. Advances in Neural Information Processing Systems , 35:27730–27744, 2022

2022

[13] [13]

Towards expert-level medical question answering with large language models

Karan Singhal, Tao Tu, Juraj Gottweis, Rory Sayres, Ellery Wulczyn, Le Hou, Kevin Clark, Stephen Pfohl, Heather Cole-Lewis, Darlene Neal, et al. Towards expert-level medical question answering with large language models. arXiv preprint arXiv:2305.09617, 2023

Pith/arXiv arXiv 2023

[14] [14]

One leaf reveals the season: Occlusion-based contrastive learning with semantic-aware views for efﬁcient visual representation

Xiaoyu Y ang, Lijian Xu, Hongsheng Li, and Shaoting Zhang. One leaf reveals the season: Occlusion-based contrastive learning with semantic-aware views for efﬁcient visual representation. In International Conference on Machine Learning, pages 71425–71440, 2025

2025

[15] [15]

Fewer tokens, greater scaling: Self-adaptive visual bases for efﬁcient and expansive representation learning

Shawn Y oung, Xingyu Zeng, and Lijian Xu. Fewer tokens, greater scaling: Self-adaptive visual bases for efﬁcient and expansive representation learning. arXiv preprint arXiv:2511.19515, 2026

arXiv 2026

[16] [16]

The model knows which tokens matter:automatic token selection via noise gating

Landi He, Xiaoyu Y ang, and Lijian Xu. The model knows which tokens matter:automatic token selection via noise gating. arXiv preprint arXiv:2603.07135, 2026

arXiv 2026

[17] [17]

Segmentation and vascular vectorization for coronary artery by geometry-based cascaded neural network

Xiaoyu Y ang, Lijian Xu, Simon Y u, Qing Xia, Hongsheng Li, and Shaoting Zhang. Segmentation and vascular vectorization for coronary artery by geometry-based cascaded neural network. IEEE Transactions on Medical Imaging, 44(1):259–269, 2024

2024

[18] [18]

Zhao, Kelvin Guu, Adams Wei Y u, Brian Lester, Nan Du, Andrew M

Jason Wei, Maarten Bosma, Vincent Y . Zhao, Kelvin Guu, Adams Wei Y u, Brian Lester, Nan Du, Andrew M. Dai, and Quoc V . Le. Finetuned language models are zero-shot learners, 2022

2022

[19] [19]

Scaling instruction-ﬁnetuned language models

Hyung Won Chung, Le Hou, Shayne Longpre, Barret Zoph, Yi Tay, William Fedus, Y unxuan Li, Xuezhi Wang, Mostafa Dehghani, Siddhartha Brahma, et al. Scaling instruction-ﬁnetuned language models. Journal of Machine Learning Research, 25(70):1–53, 2024

2024

[20] [20]

Zerosense: How vision matters in long context compression

Y onghan Gao, Zehong Chen, Lijian Xu, Jingzhi Chen, Jingwei Guan, and Xingyu Zeng. Zerosense: How vision matters in long context compression. arXiv preprint arXiv:2603.11846, 2026

arXiv 2026

[21] [21]

Instructblip: Towards general-purpose vision-language models with instruction tuning, 2023

Wenliang Dai, Junnan Li, Dongxu Li, Anthony Meng Huat Tiong, Junqi Zhao, Weisheng Wang, Boyang Li, Pascale Fung, and Steven Hoi. Instructblip: Towards general-purpose vision-language models with instruction tuning, 2023

2023

[22] [22]

Scalar: Spatial- concept alignment for robust vision in harsh open world

Xiaoyu Y ang, Lijian Xu, Xingyu Zeng, Xiaosong Wang, Hongsheng Li, and Shaoting Zhang. Scalar: Spatial- concept alignment for robust vision in harsh open world. Pattern Recognition, page 113203, 2026

2026

[23] [23]

Visual instruction tuning

Haotian Liu, Chunyuan Li, Qingyang Wu, and Y ong Jae Lee. Visual instruction tuning. Advances in neural information processing systems, 36, 2024. 19 LEARNING A MULTI -TASK TRANSFORMER FOR CHEST RADIOGRAPH INTERPRETATION

2024

[24] [24]

Med-bert: pretrained contextualized embeddings on large-scale structured electronic health records for disease prediction

Laila Rasmy, Y ang Xiang, Ziqian Xie, Cui Tao, and Degui Zhi. Med-bert: pretrained contextualized embeddings on large-scale structured electronic health records for disease prediction. NPJ digital medicine, 4(1):86, 2021

2021

[25] [25]

Large language models encode clinical knowledge

Karan Singhal, Shekoofeh Azizi, Tao Tu, S Sara Mahdavi, Jason Wei, Hyung Won Chung, Nathan Scales, Ajay Tanwani, Heather Cole-Lewis, Stephen Pfohl, et al. Large language models encode clinical knowledge. Nature, 620(7972):172–180, 2023

2023

[26] [26]

Foundation models for generalist medical artiﬁcial intelligence

Michael Moor, Oishi Banerjee, Zahra Shakeri Hossein Abad, Harlan M Krumholz, Jure Leskovec, Eric J Topol, and Pranav Rajpurkar. Foundation models for generalist medical artiﬁcial intelligence. Nature, 616(7956):259– 265, 2023

2023

[27] [27]

A generalist vision–language foundation model for diverse biomedical tasks

Kai Zhang, Rong Zhou, Eashan Adhikarla, Zhiling Y an, Yixin Liu, Jun Y u, Zhengliang Liu, Xun Chen, Brian D Davison, Hui Ren, et al. A generalist vision–language foundation model for diverse biomedical tasks. Nature Medicine, pages 1–13, 2024

2024

[28] [28]

Llava-med: Training a large language-and-vision assistant for biomedicine in one day

Chunyuan Li, Cliff Wong, Sheng Zhang, Naoto Usuyama, Haotian Liu, Jianwei Y ang, Tristan Naumann, Hoifung Poon, and Jianfeng Gao. Llava-med: Training a large language-and-vision assistant for biomedicine in one day. Advances in Neural Information Processing Systems , 36, 2024

2024

[29] [29]

Mimic-cxr, a de-identiﬁed publicly available database of chest radiographs with free-text reports

Alistair EW Johnson, Tom J Pollard, Seth J Berkowitz, Nathaniel R Greenbaum, Matthew P Lungren, Chih- ying Deng, Roger G Mark, and Steven Horng. Mimic-cxr, a de-identiﬁed publicly available database of chest radiographs with free-text reports. Scientiﬁc data, 6(1):317, 2019

2019

[30] [30]

Vindr-cxr: An open dataset of chest x-rays with radiologist’s annotations

Ha Q Nguyen, Khanh Lam, Linh T Le, Hieu H Pham, Dat Q Tran, Dung B Nguyen, Dung D Le, Chi M Pham, Hang TT Tong, Diep H Dinh, et al. Vindr-cxr: An open dataset of chest x-rays with radiologist’s annotations. Scientiﬁc Data, 9(1):429, 2022

2022

[31] [31]

A structure-aware relation network for thoracic diseases detection and segmentation

Jie Lian, Jingyu Liu, Shu Zhang, Kai Gao, Xiaoqing Liu, Dingwen Zhang, and Yizhou Y u. A structure-aware relation network for thoracic diseases detection and segmentation. IEEE Transactions on Medical Imaging , 40(8):2042–2052, 2021

2042

[32] [32]

Chexmask: a large-scale dataset of anatomical segmentation masks for multi-center chest x-ray images

Nicolas Gaggion, Candelaria Mosquera, Lucas Mansilla, Julia Mariel Saidman, Martina Aineseder, Diego H Milone, and Enzo Ferrante. Chexmask: a large-scale dataset of anatomical segmentation masks for multi-center chest x-ray images. Scientiﬁc Data, 11(1):511, 2024

2024

[33] [33]

Siim-acr pneumothorax segmentation

Anna Zawacki, Carol Wu, George Shih, Julia Elliott, Mikhail Fomitchev, Mohannad Hussain, ParasLakhani, Phil Culliton, and Shunxing Bao. Siim-acr pneumothorax segmentation. Kaggle, 2019

2019

[34] [34]

Chestx- ray8: Hospital-scale chest x-ray database and benchmarks on weakly-supervised classiﬁcation and localization of common thorax diseases

Xiaosong Wang, Yifan Peng, Le Lu, Zhiyong Lu, Mohammadhadi Bagheri, and Ronald M Summers. Chestx- ray8: Hospital-scale chest x-ray database and benchmarks on weakly-supervised classiﬁcation and localization of common thorax diseases. In Proceedings of the IEEE conference on computer vision and pattern recognition , pages 2097–2106, 2017

2097

[35] [35]

Chexpert: A large chest radiograph dataset with uncer- tainty labels and expert comparison

Jeremy Irvin, Pranav Rajpurkar, Michael Ko, Yifan Y u, Silviana Ciurea-Ilcus, Chris Chute, Henrik Marklund, Behzad Haghgoo, Robyn Ball, Katie Shpanskaya, et al. Chexpert: A large chest radiograph dataset with uncer- tainty labels and expert comparison. In Proceedings of the AAAI conference on artiﬁcial intelligence , volume 33, pages 590–597, 2019

2019

[36] [36]

Augmenting the national institutes of health chest radiograph dataset with expert annotations of possible pneumonia

George Shih, Carol C Wu, Safwan S Halabi, Marc D Kohli, Luciano M Prevedello, Tessa S Cook, Arjun Sharma, Judith K Amorosa, V eronica Arteaga, Maya Galperin-Aizenberg, et al. Augmenting the national institutes of health chest radiograph dataset with expert annotations of possible pneumonia. Radiology: Artiﬁcial Intelligence, 1(1):e180041, 2019

2019

[37] [37]

Junji Shiraishi, Shigehiko Katsuragawa, Junpei Ikezoe, Tsuneo Matsumoto, Takeshi Kobayashi, Ken-ichi Ko- matsu, Mitate Matsui, Hiroshi Fujita, Y oshie Kodera, and Kunio Doi. Development of a digital image database for chest radiographs with and without a lung nodule: receiver operating characteristic analysis of radiologists’ detection of pulmonary nodule...

2000

[38] [38]

Making the most of text semantics to improve biomedical vision–language processing

Benedikt Boecking, Naoto Usuyama, Shruthi Bannur, Daniel C Castro, Anton Schwaighofer, Stephanie Hyland, Maria Wetscherek, Tristan Naumann, Aditya Nori, Javier Alvarez-V alle, et al. Making the most of text semantics to improve biomedical vision–language processing. In European conference on computer vision , pages 1–21. Springer, 2022

2022

[39] [39]

A simple framework for contrastive learning of visual representations

Ting Chen, Simon Kornblith, Mohammad Norouzi, and Geoffrey Hinton. A simple framework for contrastive learning of visual representations. In ICML, pages 1597–1607. PMLR, 2020

2020

[40] [40]

Contrastive learning of medical visual representations from paired images and text

Y uhao Zhang, Hang Jiang, Y asuhide Miura, Christopher D Manning, and Curtis P Langlotz. Contrastive learning of medical visual representations from paired images and text. In Machine Learning for Healthcare Conference , pages 2–25. PMLR, 2022. 20 LEARNING A MULTI -TASK TRANSFORMER FOR CHEST RADIOGRAPH INTERPRETATION

2022

[41] [41]

Lungren, and Serena Y eung

Shih-Cheng Huang, Liyue Shen, Matthew P . Lungren, and Serena Y eung. Gloria: A multimodal global-local representation learning framework for label-efﬁcient medical image recognition. 2021 IEEE/CVF International Conference on Computer Vision (ICCV), pages 3922–3931, 2021

2021

[42] [42]

Generalized radiograph representation learning via cross-supervision between images and free-text radiology reports

Hong-Y u Zhou, Xiaoyu Chen, Yinghao Zhang, Ruibang Luo, Liansheng Wang, and Yizhou Y u. Generalized radiograph representation learning via cross-supervision between images and free-text radiology reports. Nature Machine Intelligence, 4(1):32–40, 2022

2022

[43] [43]

Multimodal masked autoencoders learn transferable representations, 2022

Xinyang Geng, Hao Liu, Lisa Lee, Dale Schuurmans, Sergey Levine, and Pieter Abbeel. Multimodal masked autoencoders learn transferable representations, 2022

2022

[44] [44]

Multi-granularity cross-modal alignment for generalized medical visual representation learning

Fuying Wang, Y uyin Zhou, Shujun Wang, V arut V ardhanabhuti, and Lequan Y u. Multi-granularity cross-modal alignment for generalized medical visual representation learning. Advances in Neural Information Processing Systems, 35:33536–33549, 2022

2022

[45] [45]

Medklip: Medical knowledge enhanced language-image pre-training for x-ray diagnosis

Chaoyi Wu, Xiaoman Zhang, Y a Zhang, Y anfeng Wang, and Weidi Xie. Medklip: Medical knowledge enhanced language-image pre-training for x-ray diagnosis. In ICCV, pages 21372–21383, 2023

2023

[46] [46]

Delving into masked autoencoders for multi-label thorax disease classiﬁcation

Junfei Xiao, Y utong Bai, Alan Y uille, and Zongwei Zhou. Delving into masked autoencoders for multi-label thorax disease classiﬁcation. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pages 3588–3600, 2023

2023

[47] [47]

Advancing radiograph representation learning with masked record modeling

Hong-Y u Zhou, Chenyu Lian, Liansheng Wang, and Yizhou Y u. Advancing radiograph representation learning with masked record modeling. In The Eleventh International Conference on Learning Representations , 2023

2023

[48] [48]

Visual grounding with transformers.2022 IEEE International Conference on Multimedia and Expo , pages 1–6, 2022

Y e Du, Zehua Fu, Qingjie Liu, and Y unhong Wang. Visual grounding with transformers.2022 IEEE International Conference on Multimedia and Expo , pages 1–6, 2022

2022

[49] [49]

Seqtr: A simple yet universal network for visual grounding

Chaoyang Zhu, Yiyi Zhou, Y unhang Shen, Gen Luo, Xingjia Pan, Mingbao Lin, Chao Chen, Liujuan Cao, Xiaoshuai Sun, and Rongrong Ji. Seqtr: A simple yet universal network for visual grounding. In European Conference on Computer Vision, pages 598–615. Springer, 2022

2022

[50] [50]

Transvg: End-to-end visual grounding with transformers

Jiajun Deng, Zhengyuan Y ang, Tianlang Chen, Wengang Zhou, and Houqiang Li. Transvg: End-to-end visual grounding with transformers. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 1769–1779, October 2021

2021

[51] [51]

Medical phrase grounding with region-phrase context contrastive alignment

Zhihao Chen, Y ang Zhou, Anh Tran, Junting Zhao, Liang Wan, Gideon Su Kai Ooi, Lionel Tim-Ee Cheng, Choon Hua Thng, Xinxing Xu, Y ong Liu, et al. Medical phrase grounding with region-phrase context contrastive alignment. In International Conference on Medical Image Computing and Computer-Assisted Intervention, pages 371–381. Springer, 2023

2023

[52] [52]

Chex: Interactive localization and region description in chest x-rays

Philip Muller, Georgios Kaissis, and Daniel Rueckert. Chex: Interactive localization and region description in chest x-rays. In European Conference on Computer Vision, pages 92–111. Springer, 2025

2025

[53] [53]

Knowledge-enhanced visual-language pre-training on chest radiology images

Xiaoman Zhang, Chaoyi Wu, Y a Zhang, Weidi Xie, and Y anfeng Wang. Knowledge-enhanced visual-language pre-training on chest radiology images. Nature Communications, 14(1):4542, 2023

2023

[54] [54]

Referring transformer: A one-step approach to multi-task visual grounding

Muchen Li and Leonid Sigal. Referring transformer: A one-step approach to multi-task visual grounding. Ad- vances in neural information processing systems , 34:19652–19664, 2021

2021

[55] [55]

Bottom-up and top-down attention for image captioning and visual question answering

Peter Anderson, Xiaodong He, Chris Buehler, Damien Teney, Mark Johnson, Stephen Gould, and Lei Zhang. Bottom-up and top-down attention for image captioning and visual question answering. In CVPR, pages 6077– 6086, 2018

2018

[56] [56]

Self-critical sequence training for image captioning

Steven J Rennie, Etienne Marcheret, Y oussef Mroueh, Jerret Ross, and V aibhava Goel. Self-critical sequence training for image captioning. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 7008–7024, 2017

2017

[57] [57]

Cross-modal memory networks for radiology report generation

Zhihong Chen, Y aling Shen, Y an Song, and Xiang Wan. Cross-modal memory networks for radiology report generation. In ACL-IJCNLP V olume 1: Long Papers, pages 5904–5914, 2021

2021

[58] [58]

Con- trastive attention for automatic chest x-ray report generation, 2023

Fenglin Liu, Changchang Yin, Xian Wu, Shen Ge, Y uexian Zou, Ping Zhang, Y uexian Zou, and Xu Sun. Con- trastive attention for automatic chest x-ray report generation, 2023

2023

[59] [59]

Interactive and explainable region-guided radiology report generation

Tim Tanida, Philip Muller, Georgios Kaissis, and Daniel Rueckert. Interactive and explainable region-guided radiology report generation. In CVPR, pages 7433–7442, 2023

2023

[60] [60]

Multimodal image-text matching improves retrieval-based chest x-ray report generation

Jaehwan Jeong, Katherine Tian, Andrew Li, Sina Hartung, Subathra Adithan, Fardad Behzadi, Juan Calle, David Osayande, Michael Pohlen, and Pranav Rajpurkar. Multimodal image-text matching improves retrieval-based chest x-ray report generation. In Medical Imaging with Deep Learning , pages 978–990. PMLR, 2024. 21 LEARNING A MULTI -TASK TRANSFORMER FOR CHEST...

2024

[61] [61]

A deep-learning-based framework for identifying and localizing multiple abnormalities and assessing cardiomegaly in chest x-ray

Weijie Fan, Yi Y ang, Jing Qi, Qichuan Zhang, Cuiwei Liao, Li Wen, Shuang Wang, Guangxian Wang, Y u Xia, Qihua Wu, et al. A deep-learning-based framework for identifying and localizing multiple abnormalities and assessing cardiomegaly in chest x-ray. Nature Communications, 15(1):1347, 2024

2024

[62] [62]

Maira-2: Grounded radiology report generation

Shruthi Bannur, Kenza Bouzid, Daniel C Castro, Anton Schwaighofer, Sam Bond-Taylor, Maximilian Ilse, Fer- nando Perez-Garcia, V alentina Salvatelli, Harshita Sharma, Felix Meissen, et al. Maira-2: Grounded radiology report generation. arXiv preprint arXiv:2406.04449, 2024

arXiv 2024

[63] [63]

Llava-rad mimic-cxr annotations

Juan Manuel Zambrano Chaves, Shih-Cheng Huang, Y anbo Xu, Hanwen Xu, Naoto Usuyama, Sheng Zhang, Fei Wang, Y ujia Xie, Mahmoud Khademi, Ziyi Y ang, Hany Awadalla, Julia Gong, Houdong Hu, Jianwei Y ang, Chunyuan Li, Jianfeng Gao, Y u Gu, Cliff Wong, Mu-Hsin Wei, Tristan Naumann, Muhao Chen, Matthew Lun- gren, Akshay Chaudhari, Serena Y eung, Curtis Langlot...

2025

[64] [64]

Cxr-llava: a multimodal large language model for interpreting chest x-ray images

Seowoo Lee, Jiwon Y oun, Hyungjin Kim, Mansu Kim, and Soon Ho Y oon. Cxr-llava: a multimodal large language model for interpreting chest x-ray images. European Radiology, pages 1–13, 2025

2025

[65] [65]

Chexagent: Towards a foundation model for chest x-ray interpretation

Zhihong Chen, Maya V arma, Jean-Benoit Delbrouck, Magdalini Paschali, Louis Blankemeier, Dave V an V een, Jeya Maria Jose V alanarasu, Alaa Y oussef, Joseph Paul Cohen, Eduardo Pontes Reis, et al. Chexagent: Towards a foundation model for chest x-ray interpretation. In AAAI 2024 Spring Symposium on Clinical F oundation Models, 2024

2024

[66] [66]

A deepseek-powered ai system for automated chest radiograph interpretation in clinical practice

Y aowei Bai, Ruiheng Zhang, Y u Lei, Xuhua Duan, Jingfeng Y ao, Shuguang Ju, Chaoyang Wang, Wei Y ao, Yiwan Guo, Guilin Zhang, et al. A deepseek-powered ai system for automated chest radiograph interpretation in clinical practice. arXiv preprint arXiv:2512.20344, 2025

arXiv 2025

[67] [67]

A generalist learner for multifaceted medical image interpretation

Hong-Y u Zhou, Subathra Adithan, Julian Nicolas Acosta, Eric J Topol, and Pranav Rajpurkar. A generalist learner for multifaceted medical image interpretation. arXiv preprint arXiv:2405.07988, 2024

arXiv 2024

[68] [68]

Tc-ssa: Token compression via semantic slot aggregation for gigapixel pathology reasoning

Zhuo Chen, Shawn Y oung, and Lijian Xu. Tc-ssa: Token compression via semantic slot aggregation for gigapixel pathology reasoning. arXiv preprint arXiv:2603.01143, 2026

arXiv 2026

[69] [69]

Ofa: Unifying architectures, tasks, and modalities through a simple sequence-to-sequence learning framework, 2022

Peng Wang, An Y ang, Rui Men, Junyang Lin, Shuai Bai, Zhikang Li, Jianxin Ma, Chang Zhou, Jingren Zhou, and Hongxia Y ang. Ofa: Unifying architectures, tasks, and modalities through a simple sequence-to-sequence learning framework, 2022

2022

[70] [70]

Pix2seq: A language modeling framework for object detection

Ting Chen, Saurabh Saxena, Lala Li, David J Fleet, and Geoffrey Hinton. Pix2seq: A language modeling framework for object detection. In International Conference on Learning Representations , 2021

2021

[71] [71]

Uniﬁed-io: A uniﬁed model for vision, language, and multi-modal tasks

Jiasen Lu, Christopher Clark, Rowan Zellers, Roozbeh Mottaghi, and Aniruddha Kembhavi. Uniﬁed-io: A uniﬁed model for vision, language, and multi-modal tasks. In The Eleventh International Conference on Learning Representations, 2022

2022

[72] [72]

Bart: Denoising sequence-to-sequence pre-training for natural language generation, translation, and comprehension

Mike Lewis, Yinhan Liu, Naman Goyal, Marjan Ghazvininejad, Abdelrahman Mohamed, Omer Levy, V es Stoy- anov, and Luke Zettlemoyer. Bart: Denoising sequence-to-sequence pre-training for natural language generation, translation, and comprehension. arXiv preprint arXiv:1910.13461, 2019

Pith/arXiv arXiv 1910

[73] [73]

Improving language understanding by generative pre-training

Alec Radford and Karthik Narasimhan. Improving language understanding by generative pre-training. 2018

2018

[74] [74]

Bleu: a method for automatic evaluation of machine translation

Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. Bleu: a method for automatic evaluation of machine translation. In Proceedings of the 40th annual meeting of the Association for Computational Linguistics , pages 311–318, 2002

2002

[75] [75]

Rouge: A package for automatic evaluation of summaries

Chin-Y ew Lin. Rouge: A package for automatic evaluation of summaries. In Text summarization branches out, pages 74–81, 2004

2004

[76] [76]

Evaluating progress in automatic chest x-ray radiology report generation

Feiyang Y u, Mark Endo, Rayan Krishnan, Ian Pan, Andy Tsai, Eduardo Pontes Reis, Eduardo Kaiser Uru- rahy Nunes Fonseca, Henrique Min Ho Lee, Zahra Shakeri Hossein Abad, Andrew Y Ng, et al. Evaluating progress in automatic chest x-ray radiology report generation. Patterns, 4(9), 2023

2023

[77] [77]

Elixr: Towards a general purpose x-ray artiﬁcial in- telligence system through alignment of large language models and radiology vision encoders

Shawn Xu, Lin Y ang, Christopher Kelly, Marcin Sieniek, Timo Kohlberger, Martin Ma, Wei-Hung Weng, At- tila Kiraly, Sahar Kazemzadeh, Zakkai Melamed, et al. Elixr: Towards a general purpose x-ray artiﬁcial in- telligence system through alignment of large language models and radiology vision encoders. arXiv preprint arXiv:2308.01317, 2023

arXiv 2023

[78] [78]

A foundation model for generalizable disease detection from retinal images

Y ukun Zhou, Mark A Chia, Siegfried K Wagner, Murat S Ayhan, Dominic J Williamson, Robbert R Struyven, Timing Liu, Moucheng Xu, Mateo G Lozano, Peter Woodward-Court, et al. A foundation model for generalizable disease detection from retinal images. Nature, 622(7981):156–163, 2023. 22 LEARNING A MULTI -TASK TRANSFORMER FOR CHEST RADIOGRAPH INTERPRETATION A...

2023

[79] [79]

What disease does this image have?

Disease Classiﬁcation Dataset includes entity information across 193 categories and 0.24M images. For the entity classiﬁcation task, the instruction is "What disease does this image have?". The answer includes all possible diseases present in the data, such as "pneumonia" and "atelectasis.", "Is Pneumonia in this image?". The response can be either "yes" ...

[80] [80]

Give the accurate bounding box of {}

Lesion Localization Dataset incorporates VinDR-CXR, ChestX-Det, and In-house datasets, consisting of 0.09M images and corresponding BBox for 12 diseases. The instruction given for the lesion localization task is "Give the accurate bounding box of {}.". Here, the placeholder {} represents the category of the speciﬁc disease, such as "pneumonia, in the lowe...