RA-RRG: Multimodal Retrieval-Augmented Radiology Report Generation with Key Phrase Extraction

Byungmu Yoon; Jonggwon Park; Kyoyun Choi; Soobum Kim

arxiv: 2504.07415 · v2 · submitted 2025-04-10 · 💻 cs.CV · cs.CL· cs.LG

RA-RRG: Multimodal Retrieval-Augmented Radiology Report Generation with Key Phrase Extraction

Jonggwon Park , Byungmu Yoon , Soobum Kim , Kyoyun Choi This is my paper

Pith reviewed 2026-05-22 21:05 UTC · model grok-4.3

classification 💻 cs.CV cs.CLcs.LG

keywords radiology report generationretrieval-augmented generationkey phrase extractionchest X-rayhallucination reductionmultimodal retrievallarge language modelsMIMIC-CXR

0 comments

The pith

Retrieval of key phrases from similar X-ray reports lets LLMs generate accurate radiology reports with fewer hallucinations.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents RA-RRG, a framework that first uses LLMs to pull clinically essential key phrases out of existing radiology reports. Given a new chest X-ray image, the system retrieves the most relevant of those phrases and feeds them to an LLM that then writes the report. This retrieval step is intended to steer the model away from invented content while avoiding the heavy compute and data needs of full multimodal large language models. On the MIMIC-CXR and IU X-ray datasets the approach reaches state-of-the-art scores on CheXbert clinical accuracy metrics and remains competitive on RadGraph F1. The same phrase-aggregation mechanism also works when multiple views of one patient are available.

Core claim

RA-RRG extracts key phrases from training reports with LLMs, retrieves image-relevant phrases via multimodal similarity, and conditions an LLM on these phrases to produce the radiology report. This suppresses hallucinations while achieving state-of-the-art CheXbert metrics and competitive RadGraph F1 scores on MIMIC-CXR and IU X-ray, and supports multi-view aggregation.

What carries the argument

Multimodal retrieval of LLM-extracted key phrases from a report database, used to condition the report-generating LLM.

If this is right

Achieves state-of-the-art results on CheXbert metrics compared with multimodal LLMs.
Maintains competitive RadGraph F1 scores on standard benchmarks.
Naturally extends to multi-view report generation by aggregating phrases across images.
Reduces hallucinations and computational cost relative to full multimodal training.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The phrase-retrieval step could let smaller or cheaper LLMs reach usable accuracy in other medical imaging tasks if similar phrase databases are constructed.
Retrieved phrases might serve as explicit evidence that clinicians can inspect alongside the generated report.
The method could be tested on longitudinal patient data to check whether repeated retrievals improve consistency across visits.

Load-bearing premise

The key phrases extracted by LLMs from training reports capture all clinically essential information and image-based retrieval finds phrases relevant enough to guide accurate report generation without omissions or new errors.

What would settle it

On a new test set, reports generated after conditioning on retrieved phrases show higher rates of clinical omissions or factual errors than reports produced by direct multimodal LLMs.

Figures

Figures reproduced from arXiv: 2504.07415 by Byungmu Yoon, Jonggwon Park, Kyoyun Choi, Soobum Kim.

**Figure 2.** Figure 2: (a) Key phrase extraction using an LLM. (b) The multimodal retriever architecture. (c) Inference process of RA-RRG. [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗

**Figure 3.** Figure 3: Example of single-view RRG. The baseline is model E1 from Table [PITH_FULL_IMAGE:figures/full_fig_p008_3.png] view at source ↗

**Figure 4.** Figure 4: Example of multi-view RRG. At the top are the frontal and lateral images with their predicted key phrases. Below the original [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗

**Figure 5.** Figure 5: Impact of threshold on example-based average CheXbert [PITH_FULL_IMAGE:figures/full_fig_p013_5.png] view at source ↗

**Figure 6.** Figure 6: LLM prompt for key phrase extraction. The LLM extracts key phrases as a list by leveraging the original radiology report and [PITH_FULL_IMAGE:figures/full_fig_p015_6.png] view at source ↗

**Figure 7.** Figure 7: Example of retrieval target extraction from same radiology report as (a) sentences, (b) RadGraph phrases, and (c) key phrases. [PITH_FULL_IMAGE:figures/full_fig_p016_7.png] view at source ↗

**Figure 8.** Figure 8: Single-view RAG prompt for RRG. Key phrases are provided as input to generate a radiology report. [PITH_FULL_IMAGE:figures/full_fig_p017_8.png] view at source ↗

**Figure 9.** Figure 9: Multi-view RAG prompt for RRG. Key phrases retrieved from the frontal and lateral images are separately provided as input to [PITH_FULL_IMAGE:figures/full_fig_p017_9.png] view at source ↗

**Figure 10.** Figure 10: An example of key phrase retrieval results and the generated radiology report. Descriptions with the same meaning are high [PITH_FULL_IMAGE:figures/full_fig_p018_10.png] view at source ↗

**Figure 11.** Figure 11: Comparison of single-view RRG results. Positive findings are highlighted with different colors. The sample is sourced from the [PITH_FULL_IMAGE:figures/full_fig_p018_11.png] view at source ↗

**Figure 12.** Figure 12: Comparison of single-view RRG results. Positive findings are highlighted with different colors, and phrases considered to be [PITH_FULL_IMAGE:figures/full_fig_p019_12.png] view at source ↗

**Figure 13.** Figure 13: Comparison of (a) single-view and (b) multi-view RRG results for the same study. The report for MAIRA-2 was generated [PITH_FULL_IMAGE:figures/full_fig_p019_13.png] view at source ↗

read the original abstract

Automated radiology report generation (RRG) holds potential to reduce the workload of radiologists, and recent advances in multimodal large language models (MLLMs) have enabled multimodal chest X-ray (CXR) report generation. However, existing MLLMs are computationally expensive, require large-scale training data, and may produce hallucinated content, limiting their practical deployment. To address these limitations, we propose RA-RRG, a retrieval-augmented RRG framework that combines multimodal retrieval with large language models (LLMs) to generate radiology reports while reducing hallucinations and computational demands. RA-RRG uses LLMs to extract clinically essential key phrases from radiology reports and retrieves relevant phrases given an input image. By conditioning LLMs on the retrieved phrases, RA-RRG effectively suppresses hallucinations while maintaining strong report generation performance. Experiments on the MIMIC-CXR and IU X-ray datasets show state-of-the-art results on CheXbert metrics and competitive RadGraph F1 scores compared to MLLMs. Furthermore, RA-RRG naturally generalizes to multi-view RRG by aggregating phrases retrieved from multiple images, highlighting its broad applicability to real-world clinical scenarios. Code is available at https://github.com/deepnoid-ai/RA-RRG.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

RA-RRG adds LLM-driven key-phrase extraction plus image retrieval to condition report generation, but the numbers do not isolate whether retrieval actually cuts hallucinations or just trades one set of errors for another.

read the letter

The core idea is straightforward: run an LLM over training reports to pull out key phrases, retrieve the most relevant ones from an image, and feed those phrases to condition the final LLM output. This specific combination of phrase-level extraction and multimodal retrieval for RRG is not a routine extension of earlier work in the area. The authors report SOTA CheXbert scores and competitive RadGraph F1 on MIMIC-CXR and IU X-ray, plus a clean extension to multi-view inputs by simple aggregation. Releasing the code is a practical plus for anyone who wants to test the pipeline themselves. Those are the concrete contributions. The main gap is that the central claim about hallucination suppression rests on aggregate generation metrics without a non-retrieval baseline or any breakdown of false positives versus omissions. There is also no reported detail on the retrieval model itself, the exact phrase selection rules, or checks against data leakage between training reports and test images. Those omissions make it difficult to tell whether the retrieved phrases are reliably steering the model or occasionally introducing new factual problems. The paper is aimed at groups working on lower-compute radiology report systems who already have access to public CXR datasets. A reader in that niche can extract implementation ideas from the released code, but the work does not yet supply the controlled evidence needed to treat the hallucination benefit as established. I would send it to peer review; the empirical results and open code give it enough substance for referees to evaluate and request the missing controls.

Referee Report

2 major / 2 minor

Summary. The paper proposes RA-RRG, a retrieval-augmented framework for radiology report generation. LLMs extract key phrases from training reports; given an input CXR image, relevant phrases are retrieved and used to condition an LLM for report generation. The approach is claimed to suppress hallucinations, lower computational cost relative to MLLMs, achieve SOTA CheXbert metrics and competitive RadGraph F1 on MIMIC-CXR and IU X-ray, and naturally extend to multi-view inputs. Code is released.

Significance. If the central claim is substantiated, RA-RRG would offer a lighter-weight, retrieval-based alternative to full multimodal LLMs for factual radiology report generation. The public code release is a positive contribution for reproducibility.

major comments (2)

[Abstract] Abstract: The central claim that 'By conditioning LLMs on the retrieved phrases, RA-RRG effectively suppresses hallucinations' is not directly tested. Reported results are aggregate CheXbert and RadGraph scores; no ablation against a non-retrieval LLM baseline, no hallucination-specific metric (e.g., entity-level error analysis), and no quantification of omissions or new errors introduced by imperfect retrieval are provided.
[Methods] Methods (retrieval and phrase extraction subsections): No description is given of the multimodal retrieval model architecture, the exact LLM prompt and selection criteria used to extract key phrases from training reports, or any controls for data leakage between retrieved training phrases and test images. These omissions are load-bearing for assessing whether the reported metrics reflect genuine hallucination reduction or retrieval artifacts.

minor comments (2)

[Experiments] Experiments: Specify the exact train/validation/test splits, any additional preprocessing, and the precise CheXbert and RadGraph evaluation protocols to allow direct comparison with prior work.
[Figure 1] Figure 1 or equivalent diagram: Clarify the flow from image to phrase retrieval to LLM conditioning with explicit notation for the retrieval similarity function.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive and detailed comments. We address each major comment below and have revised the manuscript accordingly to strengthen the presentation of our results and methods.

read point-by-point responses

Referee: [Abstract] Abstract: The central claim that 'By conditioning LLMs on the retrieved phrases, RA-RRG effectively suppresses hallucinations' is not directly tested. Reported results are aggregate CheXbert and RadGraph scores; no ablation against a non-retrieval LLM baseline, no hallucination-specific metric (e.g., entity-level error analysis), and no quantification of omissions or new errors introduced by imperfect retrieval are provided.

Authors: We agree that direct evidence for hallucination suppression would strengthen the central claim. The reported CheXbert and RadGraph improvements provide indirect support via standard factual metrics in the field, but we acknowledge the value of explicit testing. In the revised manuscript we have added an ablation comparing RA-RRG against a non-retrieval LLM baseline that uses the identical underlying model and prompt structure. We have also included a qualitative entity-level error analysis in the results section that illustrates reduced hallucinations, together with a brief discussion of retrieval-induced omissions in the limitations and supplementary material. revision: yes
Referee: [Methods] Methods (retrieval and phrase extraction subsections): No description is given of the multimodal retrieval model architecture, the exact LLM prompt and selection criteria used to extract key phrases from training reports, or any controls for data leakage between retrieved training phrases and test images. These omissions are load-bearing for assessing whether the reported metrics reflect genuine hallucination reduction or retrieval artifacts.

Authors: We thank the referee for highlighting these omissions. In the revised manuscript we have substantially expanded the retrieval subsection to describe the multimodal retrieval architecture, which uses a frozen vision-language encoder to compute cosine similarity between the input CXR embedding and phrase embeddings. The precise LLM prompt template and selection criteria (clinical relevance, frequency, and length constraints) for key-phrase extraction are now provided verbatim in a new appendix. We have also added an explicit statement that the retrieval index is constructed exclusively from the training split, with no test or validation images or reports used in index construction or retrieval, thereby eliminating data leakage. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical framework evaluated on public benchmarks

full rationale

The paper describes an empirical retrieval-augmented pipeline that extracts key phrases via LLM, performs image-based retrieval, and conditions report generation on the retrieved phrases. No equations, fitted parameters, or derivations are presented that reduce reported metrics (CheXbert, RadGraph F1) to quantities defined by the authors' own inputs. Evaluation uses standard public datasets (MIMIC-CXR, IU X-ray) and released code; results are aggregate generation metrics with no self-definitional or self-citation load-bearing steps. The central claim rests on external benchmarks rather than internal construction.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The approach rests on the unverified premise that LLM-extracted phrases capture all clinically necessary information and that retrieval similarity in image space reliably surfaces useful context; no free parameters or invented entities are named in the abstract.

axioms (2)

domain assumption LLMs can reliably extract clinically essential key phrases from radiology reports
Stated as the first step of the pipeline in the abstract
domain assumption Image-feature similarity retrieves phrases that are relevant for guiding accurate report generation
Core mechanism described for suppressing hallucinations

pith-pipeline@v0.9.0 · 5762 in / 1240 out tokens · 56246 ms · 2026-05-22T21:05:44.126019+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

RA-RRG uses LLMs to extract clinically essential key phrases from radiology reports and retrieves relevant phrases given an input image. By conditioning LLMs on the retrieved phrases, RA-RRG effectively suppresses hallucinations
IndisputableMonolith/Foundation/ArithmeticFromLogic.lean embed_strictMono_of_one_lt unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We introduce several training techniques... in-batch contrastive loss to align text and semantic embeddings

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

60 extracted references · 60 canonical work pages · 8 internal anchors

[1]

Maira-2: Grounded radiology report gener- ation.arXiv preprint arXiv:2406.04449, 2024

Shruthi Bannur, Kenza Bouzid, Daniel C Castro, Anton Schwaighofer, Sam Bond-Taylor, Maximilian Ilse, Fernando P´erez-Garc´ıa, Valentina Salvatelli, Harshita Sharma, Felix Meissen, et al. Maira-2: Grounded radiology report gener- ation.arXiv preprint arXiv:2406.04449, 2024. 2, 4, 5, 6, 7, 8

work page arXiv 2024
[2]

End-to- end object detection with transformers

Nicolas Carion, Francisco Massa, Gabriel Synnaeve, Nicolas Usunier, Alexander Kirillov, and Sergey Zagoruyko. End-to- end object detection with transformers. InEuropean confer- ence on computer vision, pages 213–229. Springer, 2020. 3

work page 2020
[3]

Chexpert plus: Hundreds of thousands of aligned radiology texts, im- ages and patients.arXiv preprint arXiv:2405.19538, 2024

Pierre Chambon, Jean-Benoit Delbrouck, Thomas Sounack, Shih-Cheng Huang, Zhihong Chen, Maya Varma, Steven QH Truong, Chu The Chuong, and Curtis P Langlotz. Chexpert plus: Hundreds of thousands of aligned radiology texts, im- ages and patients.arXiv preprint arXiv:2405.19538, 2024. 5

work page arXiv 2024
[4]

Lungren, Akshay Chaudhari, Ser- ena Yeung-Levy, Curtis P

Juan Manuel Zambrano Chaves, Shih-Cheng Huang, Yanbo Xu, Hanwen Xu, Naoto Usuyama, Sheng Zhang, Fei Wang, Yujia Xie, Mahmoud Khademi, Ziyi Yang, Hany Awadalla, Julia Gong, Houdong Hu, Jianwei Yang, Chunyuan Li, Jian- feng Gao, Yu Gu, Cliff Wong, Mu Wei, Tristan Naumann, Muhao Chen, Matthew P. Lungren, Akshay Chaudhari, Ser- ena Yeung-Levy, Curtis P. Langl...

work page arXiv 2024
[5]

Wenhu Chen, Hexiang Hu, Chitwan Saharia, and William W. Cohen. Re-imagen: Retrieval-augmented text-to-image gen- erator.arXiv preprint arXiv:2209.14491, 2022. 2

work page arXiv 2022
[6]

Generating radiology reports via memory-driven trans- former

Zhihong Chen, Yan Song, Tsung-Hui Chang, and Xiang Wan. Generating radiology reports via memory-driven trans- former. InProceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 1439–1449, Online, 2020. Association for Computational Linguistics. 2, 3

work page 2020
[7]

Tsai, Andrew Johnston, Cameron Olsen, Tanishq Mathew Abraham, Sergios Gatidis, Akshay S

Zhihong Chen, Maya Varma, Jean-Benoit Delbrouck, Mag- dalini Paschali, Louis Blankemeier, Dave Van Veen, Jeya Maria Jose Valanarasu, Alaa Youssef, Joseph Paul Cohen, Eduardo Pontes Reis, Emily B. Tsai, Andrew Johnston, Cameron Olsen, Tanishq Mathew Abraham, Sergios Gatidis, Akshay S. Chaudhari, and Curtis Langlotz. Chexagent: Towards a foundation model fo...

work page arXiv 2024
[8]

Preparing a collection of radiology examinations for distribution and re- trieval.Journal of the American Medical Informatics Asso- ciation, 23(2):304–310, 2016

Dina Demner-Fushman, Marc D Kohli, Marc B Rosen- man, Steven E Shooshan, Louis Rodriguez, Sameer Antani, George R Thoma, and Clement J McDonald. Preparing a collection of radiology examinations for distribution and re- trieval.Journal of the American Medical Informatics Asso- ciation, 23(2):304–310, 2016. 5, 2

work page 2016
[9]

The Llama 3 Herd of Models

Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Ab- hishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Amy Yang, Angela Fan, et al. The llama 3 herd of models.arXiv preprint arXiv:2407.21783,

work page internal anchor Pith review Pith/arXiv arXiv
[10]

Retrieval-based chest x-ray report generation using a pre-trained contrastive language-image model

Mark Endo, Rayan Krishnan, Viswesh Krishna, Andrew Y Ng, and Pranav Rajpurkar. Retrieval-based chest x-ray report generation using a pre-trained contrastive language-image model. InMachine Learning for Health, pages 209–219. PMLR, 2021. 1, 2

work page 2021
[11]

Retrieval-Augmented Generation for Large Language Models: A Survey

Yunfan Gao, Yun Xiong, Xinyu Gao, Kangxiang Jia, Jinliu Pan, Yuxi Bi, Yi Dai, Jiawei Sun, Meng Wang, and Haofen Wang. Retrieval-augmented generation for large language models: A survey.arXiv preprint arXiv:2312.10997, 2023. 2

work page internal anchor Pith review Pith/arXiv arXiv 2023
[12]

HippoRAG: Neurobiologically in- spired long-term memory for large language models

Bernal Jimenez Gutierrez, Yiheng Shu, Yu Gu, Michihiro Yasunaga, and Yu Su. HippoRAG: Neurobiologically in- spired long-term memory for large language models. In The Thirty-eighth Annual Conference on Neural Information Processing Systems, 2024. 2

work page 2024
[13]

A Survey on Hallucination in Large Language Models: Principles, Taxonomy, Challenges, and Open Questions

Lei Huang, Weijiang Yu, Weitao Ma, Weihong Zhong, Zhangyin Feng, Haotian Wang, Qianglong Chen, Weihua Peng, Xiaocheng Feng, Bing Qin, and Ting Liu. A sur- vey on hallucination in large language models: Principles, taxonomy, challenges, and open questions.arXiv preprint arXiv:2311.05232, 2023. 2

work page internal anchor Pith review Pith/arXiv arXiv 2023
[14]

GPT-4o System Card

Aaron Hurst, Adam Lerer, Adam P Goucher, Adam Perel- man, Aditya Ramesh, Aidan Clark, AJ Ostrow, Akila Weli- hinda, Alan Hayes, Alec Radford, et al. Gpt-4o system card. arXiv preprint arXiv:2410.21276, 2024. 5

work page internal anchor Pith review Pith/arXiv arXiv 2024
[15]

Hyland, Shruthi Bannur, Kenza Bouzid, Daniel C

Stephanie L. Hyland, Shruthi Bannur, Kenza Bouzid, Daniel C. Castro, Mercy Ranjit, Anton Schwaighofer, Fer- nando P ´erez-Garc´ıa, Valentina Salvatelli, Shaury Srivas- tav, Anja Thieme, Noel Codella, Matthew P. Lungren, Maria Teodora Wetscherek, Ozan Oktay, and Javier Alvarez- Valle. Maira-1: A specialised large multimodal model for ra- diology report gen...

work page arXiv
[16]

Mong, Safwan S

Jeremy Irvin, Pranav Rajpurkar, Michael Ko, Yifan Yu, Sil- viana Ciurea-Ilcus, Chris Chute, Henrik Marklund, Behzad Haghgoo, Robyn Ball, Katie Shpanskaya, Jayne Seekins, David A. Mong, Safwan S. Halabi, Jesse K. Sandberg, Ricky Jones, David B. Larson, Curtis P. Langlotz, Bhavik N. Patel, Matthew P. Lungren, and Andrew Y . Ng. Chexpert: a large chest radio...

work page 2019
[17]

Bartold- son, Bhavya Kailkhura, Avi Schwarzschild, Aniruddha Saha, Micah Goldblum, Jonas Geiping, and Tom Goldstein

Neel Jain, Ping yeh Chiang, Yuxin Wen, John Kirchenbauer, Hong-Min Chu, Gowthami Somepalli, Brian R. Bartold- son, Bhavya Kailkhura, Avi Schwarzschild, Aniruddha Saha, Micah Goldblum, Jonas Geiping, and Tom Goldstein. NEF- Tune: Noisy embeddings improve instruction finetuning. In The Twelfth International Conference on Learning Represen- tations, 2024. 3

work page 2024
[18]

Radgraph: Extracting clinical entities and relations from ra- diology reports

Saahil Jain, Ashwin Agrawal, Adriel Saporta, Steven Truong, Du Nguyen Duong Nguyen Duong, Tan Bui, Pierre Chambon, Yuhao Zhang, Matthew Lungren, Andrew Ng, Curtis Langlotz, Pranav Rajpurkar, and Pranav Rajpurkar. Radgraph: Extracting clinical entities and relations from ra- diology reports. InProceedings of the Neural Informa- tion Processing Systems Trac...

work page
[19]

From clip to dino: Visual encoders shout in multi-modal large language models,

Dongsheng Jiang, Yuchen Liu, Songlin Liu, Jin’e Zhao, Hao Zhang, Zhen Gao, Xiaopeng Zhang, Jin Li, and Hongkai Xiong. From clip to dino: Visual encoders shout in multi-modal large language models.arXiv preprint arXiv:2310.08825, 2023. 3

work page arXiv 2023
[20]

Promptmrg: Diagnosis-driven prompts for medical report generation

Haibo Jin, Haoxuan Che, Yi Lin, and Hao Chen. Promptmrg: Diagnosis-driven prompts for medical report generation. Proceedings of the AAAI Conference on Artificial Intelli- gence, 38(3):2607–2615, 2024. 5, 6, 2, 3

work page 2024
[21]

Mimic-cxr, a de- identified publicly available database of chest radiographs with free-text reports.Scientific data, 6(1):317, 2019

Alistair EW Johnson, Tom J Pollard, Seth J Berkowitz, Nathaniel R Greenbaum, Matthew P Lungren, Chih-ying Deng, Roger G Mark, and Steven Horng. Mimic-cxr, a de- identified publicly available database of chest radiographs with free-text reports.Scientific data, 6(1):317, 2019. 4

work page 2019
[22]

MIMIC-CXR-JPG, a large publicly available database of labeled chest radiographs

Alistair EW Johnson, Tom J Pollard, Nathaniel R Green- baum, Matthew P Lungren, Chih-ying Deng, Yifan Peng, Zhiyong Lu, Roger G Mark, Seth J Berkowitz, and Steven Horng. Mimic-cxr-jpg, a large publicly available database of labeled chest radiographs.arXiv preprint arXiv:1901.07042,

work page internal anchor Pith review Pith/arXiv arXiv 1901
[23]

Transq: Transformer-based semantic query for med- ical report generation

Ming Kong, Zhengxing Huang, Kun Kuang, Qiang Zhu, and Fei Wu. Transq: Transformer-based semantic query for med- ical report generation. InMedical Image Computing and Computer Assisted Intervention – MICCAI 2022, pages 610– 620, Cham, 2022. Springer Nature Switzerland. 1, 2, 3, 5, 6

work page 2022
[24]

The hungarian method for the assignment problem.Naval research logistics quarterly, 2(1-2):83–97,

Harold W Kuhn. The hungarian method for the assignment problem.Naval research logistics quarterly, 2(1-2):83–97,

work page
[25]

Gonzalez, Hao Zhang, and Ion Stoica

Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph E. Gonzalez, Hao Zhang, and Ion Stoica. Efficient memory management for large language model serving with pagedattention. InPro- ceedings of the ACM SIGOPS 29th Symposium on Operating Systems Principles, 2023. 1

work page 2023
[26]

Retrieval-augmented generation for knowledge-intensive nlp tasks.Advances in Neural Information Processing Systems, 33:9459–9474, 2020

Patrick Lewis, Ethan Perez, Aleksandra Piktus, Fabio Petroni, Vladimir Karpukhin, Naman Goyal, Heinrich K¨uttler, Mike Lewis, Wen-tau Yih, Tim Rockt ¨aschel, et al. Retrieval-augmented generation for knowledge-intensive nlp tasks.Advances in Neural Information Processing Systems, 33:9459–9474, 2020. 1, 2

work page 2020
[27]

Align before fuse: Vision and language representation learn- ing with momentum distillation.Advances in neural infor- mation processing systems, 34:9694–9705, 2021

Junnan Li, Ramprasaath Selvaraju, Akhilesh Gotmare, Shafiq Joty, Caiming Xiong, and Steven Chu Hong Hoi. Align before fuse: Vision and language representation learn- ing with momentum distillation.Advances in neural infor- mation processing systems, 34:9694–9705, 2021. 2

work page 2021
[28]

Evcap: Retrieval-augmented image captioning with external visual-name memory for open-world compre- hension

Jiaxuan Li, Duc Minh V o, Akihiro Sugimoto, and Hideki Nakayama. Evcap: Retrieval-augmented image captioning with external visual-name memory for open-world compre- hension. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 13733– 13742, 2024. 2

work page 2024
[29]

Dynamic graph enhanced contrastive learning for chest x-ray report generation

Mingjie Li, Bingqian Lin, Zicong Chen, Haokun Lin, Xi- aodan Liang, and Xiaojun Chang. Dynamic graph enhanced contrastive learning for chest x-ray report generation. InPro- ceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 3334–3343, 2023. 5, 6, 3

work page 2023
[30]

Rouge: A package for automatic evaluation of summaries

Chin-Yew Lin. Rouge: A package for automatic evaluation of summaries. InText summarization branches out, pages 74–81, 2004. 5

work page 2004
[31]

Bootstrapping large language models for radiology report generation.Proceedings of the AAAI Conference on Artificial Intelligence, 38(17):18635–18643,

Chang Liu, Yuanhe Tian, Weidong Chen, Yan Song, and Yongdong Zhang. Bootstrapping large language models for radiology report generation.Proceedings of the AAAI Conference on Artificial Intelligence, 38(17):18635–18643,

work page
[32]

Im- proving chest X-ray report generation by leveraging warm starting.Artificial Intelligence in Medicine, 144:102633,

Aaron Nicolson, Jason Dowling, and Bevan Koopman. Im- proving chest X-ray report generation by leveraging warm starting.Artificial Intelligence in Medicine, 144:102633,

work page
[33]

Im- proving chest x-ray report generation by leveraging warm starting.Artificial intelligence in medicine, 144:102633,

Aaron Nicolson, Jason Dowling, and Bevan Koopman. Im- proving chest x-ray report generation by leveraging warm starting.Artificial intelligence in medicine, 144:102633,

work page
[34]

Maxime Oquab, Timoth ´ee Darcet, Th´eo Moutakanni, Huy V . V o, Marc Szafraniec, Vasil Khalidov, Pierre Fernandez, Daniel HAZIZA, Francisco Massa, Alaaeldin El-Nouby, Mido Assran, Nicolas Ballas, Wojciech Galuba, Russell Howes, Po-Yao Huang, Shang-Wen Li, Ishan Misra, Michael Rabbat, Vasu Sharma, Gabriel Synnaeve, Hu Xu, Herve Je- gou, Julien Mairal, Patr...

work page 2024
[35]

Bleu: a method for automatic evaluation of machine translation

Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. Bleu: a method for automatic evaluation of machine translation. InProceedings of the 40th annual meeting of the Association for Computational Linguistics, pages 311–318,

work page
[36]

M4CXR: Exploring Multi-task Potentials of Multi-modal Large Language Models for Chest X-ray Interpretation

Jonggwon Park, Soobum Kim, Byungmu Yoon, Jihun Hyun, and Kyoyun Choi. M4cxr: Exploring multi-task potentials of multi-modal large language models for chest x-ray inter- pretation.arXiv preprint arXiv:2408.16213, 2024. 2, 5, 6, 7

work page internal anchor Pith review Pith/arXiv arXiv 2024
[37]

Rad-dino: Exploring scalable medical image encoders beyond text supervision.arXiv preprint arXiv:2401.10815, 2024

Fernando P ´erez-Garc´ıa, Harshita Sharma, Sam Bond-Taylor, Kenza Bouzid, Valentina Salvatelli, Maximilian Ilse, Shruthi Bannur, Daniel C Castro, Anton Schwaighofer, Matthew P Lungren, et al. Rad-dino: Exploring scalable medical image encoders beyond text supervision.arXiv preprint arXiv:2401.10815, 2024. 5

work page arXiv 2024
[38]

Learning transferable visual models from natural language supervi- sion

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervi- sion. InInternational conference on machine learning, pages 8748–8763. PMLR, 2021. 3, 4

work page 2021
[39]

Im- proving radiology report generation systems by removing hallucinated references to non-existent priors

Vignav Ramesh, Nathan A Chi, and Pranav Rajpurkar. Im- proving radiology report generation systems by removing hallucinated references to non-existent priors. InMachine Learning for Health, pages 456–473. PMLR, 2022. 1, 2

work page 2022
[40]

Smallcap: lightweight image captioning prompted with retrieval augmentation

Rita Ramos, Bruno Martins, Desmond Elliott, and Yova Ke- mentchedjhieva. Smallcap: lightweight image captioning prompted with retrieval augmentation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 2840–2849, 2023. 2

work page 2023
[41]

Retrieval augmented chest x-ray report gen- eration using openai gpt models

Mercy Ranjit, Gopinath Ganapathy, Ranjit Manuel, and Tanuja Ganu. Retrieval augmented chest x-ray report gen- eration using openai gpt models. InMachine Learning for Healthcare Conference, pages 650–666. PMLR, 2023. 2

work page 2023
[42]

Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks

N Reimers. Sentence-bert: Sentence embeddings using siamese bert-networks.arXiv preprint arXiv:1908.10084,

work page internal anchor Pith review Pith/arXiv arXiv 1908
[43]

Retrieval-augmented transformer for image caption- ing

Sara Sarto, Marcella Cornia, Lorenzo Baraldi, and Rita Cuc- chiara. Retrieval-augmented transformer for image caption- ing. InProceedings of the 19th international conference on content-based multimedia indexing, pages 1–7, 2022. 2

work page 2022
[44]

Moein Shariatnia

M. Moein Shariatnia. Simple CLIP, 2021.https:// github.com/moein-shariatnia/OpenAI-CLIP. 4

work page 2021
[45]

Eagle: Exploring the design space for multimodal LLMs with mixture of encoders

Min Shi, Fuxiao Liu, Shihao Wang, Shijia Liao, Subhashree Radhakrishnan, Yilin Zhao, De-An Huang, Hongxu Yin, Karan Sapra, Yaser Yacoob, Humphrey Shi, Bryan Catan- zaro, Andrew Tao, Jan Kautz, Zhiding Yu, and Guilin Liu. Eagle: Exploring the design space for multimodal LLMs with mixture of encoders. InThe Thirteenth International Conference on Learning Re...

work page 2025
[46]

Chexbert: Combin- ing automatic labelers and expert annotations for accurate radiology report labeling using bert

Akshay Smit, Saahil Jain, Pranav Rajpurkar, Anuj Pareek, Andrew Y Ng, and Matthew P Lungren. Chexbert: Combin- ing automatic labelers and expert annotations for accurate radiology report labeling using bert. InEMNLP 2020-2020 Conference on Empirical Methods in Natural Language Pro- cessing, Proceedings of the Conference, pages 1500–1519,

work page 2020
[47]

Fact-aware multimodal retrieval augmentation for accu- rate medical radiology report generation.arXiv preprint arXiv:2407.15268, 2024

Liwen Sun, James Zhao, Megan Han, and Chenyan Xiong. Fact-aware multimodal retrieval augmentation for accu- rate medical radiology report generation.arXiv preprint arXiv:2407.15268, 2024. 2, 5, 6

work page arXiv 2024
[48]

Interactive and explainable region-guided radi- ology report generation

Tim Tanida, Philip M ¨uller, Georgios Kaissis, and Daniel Rueckert. Interactive and explainable region-guided radi- ology report generation. In2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), page 7433–7442. IEEE, 2023. 3

work page 2023
[49]

Towards gen- eralist biomedical ai.NEJM AI, 1(3):AIoa2300138, 2024

Tao Tu, Shekoofeh Azizi, Danny Driess, Mike Schaeker- mann, Mohamed Amin, Pi-Chuan Chang, Andrew Carroll, Charles Lau, Ryutaro Tanno, Ira Ktena, et al. Towards gen- eralist biomedical ai.NEJM AI, 1(3):AIoa2300138, 2024. 1, 2, 5, 6, 7

work page 2024
[50]

Metransformer: Radiology report generation by transformer with multiple learnable expert tokens

Zhanyu Wang, Lingqiao Liu, Lei Wang, and Luping Zhou. Metransformer: Radiology report generation by transformer with multiple learnable expert tokens. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 11558–11567, 2023. 5, 6

work page 2023
[51]

Distribution-balanced loss for multi-label classification in long-tailed datasets

Tong Wu, Qingqiu Huang, Ziwei Liu, Yu Wang, and Dahua Lin. Distribution-balanced loss for multi-label classification in long-tailed datasets. InComputer Vision – ECCV 2020, pages 162–178, Cham, 2020. Springer International Publish- ing. 4

work page 2020
[52]

Retrieval-augmented egocentric video captioning

Jilan Xu, Yifei Huang, Junlin Hou, Guo Chen, Yuejie Zhang, Rui Feng, and Weidi Xie. Retrieval-augmented egocentric video captioning. InProceedings of the IEEE/CVF Con- ference on Computer Vision and Pattern Recognition, pages 13525–13536, 2024. 2

work page 2024
[53]

Style- aware radiology report generation with radgraph and few- shot prompting

Benjamin Yan, Ruochen Liu, David Kuo, Subathra Adithan, Eduardo Reis, Stephen Kwak, Vasantha Venugopal, Chloe O’Connell, Agustina Saenz, Pranav Rajpurkar, et al. Style- aware radiology report generation with radgraph and few- shot prompting. InFindings of the Association for Computa- tional Linguistics: EMNLP 2023, pages 14676–14688, 2023. 2, 5, 6

work page 2023
[54]

Advancing multi- modal medical capabilities of gemini.arXiv preprint arXiv:2405.03162, 2024

Lin Yang, Shawn Xu, Andrew Sellergren, Timo Kohlberger, Yuchen Zhou, Ira Ktena, Atilla Kiraly, Faruk Ahmed, Farhad Hormozdiari, Tiam Jaroensri, et al. Advancing multi- modal medical capabilities of gemini.arXiv preprint arXiv:2405.03162, 2024. 1, 2, 5, 6, 8

work page arXiv 2024
[55]

Kevin Zhou, and Li Xiao

Shuxin Yang, Xian Wu, Shen Ge, Zhuozhao Zheng, S. Kevin Zhou, and Li Xiao. Radiology report generation with a learned knowledge base and multi-modal alignment.Med- ical Image Analysis, 86:102798, 2023. 3

work page 2023
[56]

Retrieval-augmented multimodal language modeling

Michihiro Yasunaga, Armen Aghajanyan, Weijia Shi, Richard James, Jure Leskovec, Percy Liang, Mike Lewis, Luke Zettlemoyer, and Wen-Tau Yih. Retrieval-augmented multimodal language modeling. InInternational Conference on Machine Learning, pages 39755–39769. PMLR, 2023. 2

work page 2023
[57]

Evaluating progress in automatic chest x-ray radiology report generation.Patterns, 4(9), 2023

Feiyang Yu, Mark Endo, Rayan Krishnan, Ian Pan, Andy Tsai, Eduardo Pontes Reis, Eduardo Kaiser Ururahy Nunes Fonseca, Henrique Min Ho Lee, Zahra Shakeri Hossein Abad, Andrew Y Ng, et al. Evaluating progress in automatic chest x-ray radiology report generation.Patterns, 4(9), 2023. 5

work page 2023
[58]

BiomedCLIP: a multimodal biomedical foundation model pretrained from fifteen million scientific image-text pairs

Sheng Zhang, Yanbo Xu, Naoto Usuyama, Hanwen Xu, Jaspreet Bagga, Robert Tinn, Sam Preston, Rajesh Rao, Mu Wei, Naveen Valluri, et al. Biomedclip: a multimodal biomedical foundation model pretrained from fifteen million scientific image-text pairs.arXiv preprint arXiv:2303.00915,

work page internal anchor Pith review Pith/arXiv arXiv
[59]

Topicwise separable sentence retrieval for med- ical report generation.IEEE Transactions on Medical Imag- ing, 2024

Junting Zhao, Yang Zhou, Zhihao Chen, Huazhu Fu, and Liang Wan. Topicwise separable sentence retrieval for med- ical report generation.IEEE Transactions on Medical Imag- ing, 2024. 1, 2 Leveraging LLMs for Multimodal Retrieval-Augmented Radiology Report Generation via Key Phrase Extraction Supplementary Material A. Implementation Details A.1. External Sou...

work page 2024
[60]

new”, “improved

used the entire IU X-Ray dataset as the test set, treating each frontal and lateral image as an independent sample and excluding a portion of normal images to maintain a 10% normal image ratio. This subset of 4,168 images is publicly available8 and is also used in our evaluation to assess the performance of single-view RRG. B.2. Results Table 5 shows our ...

work page

[1] [1]

Maira-2: Grounded radiology report gener- ation.arXiv preprint arXiv:2406.04449, 2024

Shruthi Bannur, Kenza Bouzid, Daniel C Castro, Anton Schwaighofer, Sam Bond-Taylor, Maximilian Ilse, Fernando P´erez-Garc´ıa, Valentina Salvatelli, Harshita Sharma, Felix Meissen, et al. Maira-2: Grounded radiology report gener- ation.arXiv preprint arXiv:2406.04449, 2024. 2, 4, 5, 6, 7, 8

work page arXiv 2024

[2] [2]

End-to- end object detection with transformers

Nicolas Carion, Francisco Massa, Gabriel Synnaeve, Nicolas Usunier, Alexander Kirillov, and Sergey Zagoruyko. End-to- end object detection with transformers. InEuropean confer- ence on computer vision, pages 213–229. Springer, 2020. 3

work page 2020

[3] [3]

Chexpert plus: Hundreds of thousands of aligned radiology texts, im- ages and patients.arXiv preprint arXiv:2405.19538, 2024

Pierre Chambon, Jean-Benoit Delbrouck, Thomas Sounack, Shih-Cheng Huang, Zhihong Chen, Maya Varma, Steven QH Truong, Chu The Chuong, and Curtis P Langlotz. Chexpert plus: Hundreds of thousands of aligned radiology texts, im- ages and patients.arXiv preprint arXiv:2405.19538, 2024. 5

work page arXiv 2024

[4] [4]

Lungren, Akshay Chaudhari, Ser- ena Yeung-Levy, Curtis P

Juan Manuel Zambrano Chaves, Shih-Cheng Huang, Yanbo Xu, Hanwen Xu, Naoto Usuyama, Sheng Zhang, Fei Wang, Yujia Xie, Mahmoud Khademi, Ziyi Yang, Hany Awadalla, Julia Gong, Houdong Hu, Jianwei Yang, Chunyuan Li, Jian- feng Gao, Yu Gu, Cliff Wong, Mu Wei, Tristan Naumann, Muhao Chen, Matthew P. Lungren, Akshay Chaudhari, Ser- ena Yeung-Levy, Curtis P. Langl...

work page arXiv 2024

[5] [5]

Wenhu Chen, Hexiang Hu, Chitwan Saharia, and William W. Cohen. Re-imagen: Retrieval-augmented text-to-image gen- erator.arXiv preprint arXiv:2209.14491, 2022. 2

work page arXiv 2022

[6] [6]

Generating radiology reports via memory-driven trans- former

Zhihong Chen, Yan Song, Tsung-Hui Chang, and Xiang Wan. Generating radiology reports via memory-driven trans- former. InProceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 1439–1449, Online, 2020. Association for Computational Linguistics. 2, 3

work page 2020

[7] [7]

Tsai, Andrew Johnston, Cameron Olsen, Tanishq Mathew Abraham, Sergios Gatidis, Akshay S

Zhihong Chen, Maya Varma, Jean-Benoit Delbrouck, Mag- dalini Paschali, Louis Blankemeier, Dave Van Veen, Jeya Maria Jose Valanarasu, Alaa Youssef, Joseph Paul Cohen, Eduardo Pontes Reis, Emily B. Tsai, Andrew Johnston, Cameron Olsen, Tanishq Mathew Abraham, Sergios Gatidis, Akshay S. Chaudhari, and Curtis Langlotz. Chexagent: Towards a foundation model fo...

work page arXiv 2024

[8] [8]

Preparing a collection of radiology examinations for distribution and re- trieval.Journal of the American Medical Informatics Asso- ciation, 23(2):304–310, 2016

Dina Demner-Fushman, Marc D Kohli, Marc B Rosen- man, Steven E Shooshan, Louis Rodriguez, Sameer Antani, George R Thoma, and Clement J McDonald. Preparing a collection of radiology examinations for distribution and re- trieval.Journal of the American Medical Informatics Asso- ciation, 23(2):304–310, 2016. 5, 2

work page 2016

[9] [9]

The Llama 3 Herd of Models

Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Ab- hishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Amy Yang, Angela Fan, et al. The llama 3 herd of models.arXiv preprint arXiv:2407.21783,

work page internal anchor Pith review Pith/arXiv arXiv

[10] [10]

Retrieval-based chest x-ray report generation using a pre-trained contrastive language-image model

Mark Endo, Rayan Krishnan, Viswesh Krishna, Andrew Y Ng, and Pranav Rajpurkar. Retrieval-based chest x-ray report generation using a pre-trained contrastive language-image model. InMachine Learning for Health, pages 209–219. PMLR, 2021. 1, 2

work page 2021

[11] [11]

Retrieval-Augmented Generation for Large Language Models: A Survey

Yunfan Gao, Yun Xiong, Xinyu Gao, Kangxiang Jia, Jinliu Pan, Yuxi Bi, Yi Dai, Jiawei Sun, Meng Wang, and Haofen Wang. Retrieval-augmented generation for large language models: A survey.arXiv preprint arXiv:2312.10997, 2023. 2

work page internal anchor Pith review Pith/arXiv arXiv 2023

[12] [12]

HippoRAG: Neurobiologically in- spired long-term memory for large language models

Bernal Jimenez Gutierrez, Yiheng Shu, Yu Gu, Michihiro Yasunaga, and Yu Su. HippoRAG: Neurobiologically in- spired long-term memory for large language models. In The Thirty-eighth Annual Conference on Neural Information Processing Systems, 2024. 2

work page 2024

[13] [13]

A Survey on Hallucination in Large Language Models: Principles, Taxonomy, Challenges, and Open Questions

Lei Huang, Weijiang Yu, Weitao Ma, Weihong Zhong, Zhangyin Feng, Haotian Wang, Qianglong Chen, Weihua Peng, Xiaocheng Feng, Bing Qin, and Ting Liu. A sur- vey on hallucination in large language models: Principles, taxonomy, challenges, and open questions.arXiv preprint arXiv:2311.05232, 2023. 2

work page internal anchor Pith review Pith/arXiv arXiv 2023

[14] [14]

GPT-4o System Card

Aaron Hurst, Adam Lerer, Adam P Goucher, Adam Perel- man, Aditya Ramesh, Aidan Clark, AJ Ostrow, Akila Weli- hinda, Alan Hayes, Alec Radford, et al. Gpt-4o system card. arXiv preprint arXiv:2410.21276, 2024. 5

work page internal anchor Pith review Pith/arXiv arXiv 2024

[15] [15]

Hyland, Shruthi Bannur, Kenza Bouzid, Daniel C

Stephanie L. Hyland, Shruthi Bannur, Kenza Bouzid, Daniel C. Castro, Mercy Ranjit, Anton Schwaighofer, Fer- nando P ´erez-Garc´ıa, Valentina Salvatelli, Shaury Srivas- tav, Anja Thieme, Noel Codella, Matthew P. Lungren, Maria Teodora Wetscherek, Ozan Oktay, and Javier Alvarez- Valle. Maira-1: A specialised large multimodal model for ra- diology report gen...

work page arXiv

[16] [16]

Mong, Safwan S

Jeremy Irvin, Pranav Rajpurkar, Michael Ko, Yifan Yu, Sil- viana Ciurea-Ilcus, Chris Chute, Henrik Marklund, Behzad Haghgoo, Robyn Ball, Katie Shpanskaya, Jayne Seekins, David A. Mong, Safwan S. Halabi, Jesse K. Sandberg, Ricky Jones, David B. Larson, Curtis P. Langlotz, Bhavik N. Patel, Matthew P. Lungren, and Andrew Y . Ng. Chexpert: a large chest radio...

work page 2019

[17] [17]

Bartold- son, Bhavya Kailkhura, Avi Schwarzschild, Aniruddha Saha, Micah Goldblum, Jonas Geiping, and Tom Goldstein

Neel Jain, Ping yeh Chiang, Yuxin Wen, John Kirchenbauer, Hong-Min Chu, Gowthami Somepalli, Brian R. Bartold- son, Bhavya Kailkhura, Avi Schwarzschild, Aniruddha Saha, Micah Goldblum, Jonas Geiping, and Tom Goldstein. NEF- Tune: Noisy embeddings improve instruction finetuning. In The Twelfth International Conference on Learning Represen- tations, 2024. 3

work page 2024

[18] [18]

Radgraph: Extracting clinical entities and relations from ra- diology reports

Saahil Jain, Ashwin Agrawal, Adriel Saporta, Steven Truong, Du Nguyen Duong Nguyen Duong, Tan Bui, Pierre Chambon, Yuhao Zhang, Matthew Lungren, Andrew Ng, Curtis Langlotz, Pranav Rajpurkar, and Pranav Rajpurkar. Radgraph: Extracting clinical entities and relations from ra- diology reports. InProceedings of the Neural Informa- tion Processing Systems Trac...

work page

[19] [19]

From clip to dino: Visual encoders shout in multi-modal large language models,

Dongsheng Jiang, Yuchen Liu, Songlin Liu, Jin’e Zhao, Hao Zhang, Zhen Gao, Xiaopeng Zhang, Jin Li, and Hongkai Xiong. From clip to dino: Visual encoders shout in multi-modal large language models.arXiv preprint arXiv:2310.08825, 2023. 3

work page arXiv 2023

[20] [20]

Promptmrg: Diagnosis-driven prompts for medical report generation

Haibo Jin, Haoxuan Che, Yi Lin, and Hao Chen. Promptmrg: Diagnosis-driven prompts for medical report generation. Proceedings of the AAAI Conference on Artificial Intelli- gence, 38(3):2607–2615, 2024. 5, 6, 2, 3

work page 2024

[21] [21]

Mimic-cxr, a de- identified publicly available database of chest radiographs with free-text reports.Scientific data, 6(1):317, 2019

Alistair EW Johnson, Tom J Pollard, Seth J Berkowitz, Nathaniel R Greenbaum, Matthew P Lungren, Chih-ying Deng, Roger G Mark, and Steven Horng. Mimic-cxr, a de- identified publicly available database of chest radiographs with free-text reports.Scientific data, 6(1):317, 2019. 4

work page 2019

[22] [22]

MIMIC-CXR-JPG, a large publicly available database of labeled chest radiographs

Alistair EW Johnson, Tom J Pollard, Nathaniel R Green- baum, Matthew P Lungren, Chih-ying Deng, Yifan Peng, Zhiyong Lu, Roger G Mark, Seth J Berkowitz, and Steven Horng. Mimic-cxr-jpg, a large publicly available database of labeled chest radiographs.arXiv preprint arXiv:1901.07042,

work page internal anchor Pith review Pith/arXiv arXiv 1901

[23] [23]

Transq: Transformer-based semantic query for med- ical report generation

Ming Kong, Zhengxing Huang, Kun Kuang, Qiang Zhu, and Fei Wu. Transq: Transformer-based semantic query for med- ical report generation. InMedical Image Computing and Computer Assisted Intervention – MICCAI 2022, pages 610– 620, Cham, 2022. Springer Nature Switzerland. 1, 2, 3, 5, 6

work page 2022

[24] [24]

The hungarian method for the assignment problem.Naval research logistics quarterly, 2(1-2):83–97,

Harold W Kuhn. The hungarian method for the assignment problem.Naval research logistics quarterly, 2(1-2):83–97,

work page

[25] [25]

Gonzalez, Hao Zhang, and Ion Stoica

Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph E. Gonzalez, Hao Zhang, and Ion Stoica. Efficient memory management for large language model serving with pagedattention. InPro- ceedings of the ACM SIGOPS 29th Symposium on Operating Systems Principles, 2023. 1

work page 2023

[26] [26]

Retrieval-augmented generation for knowledge-intensive nlp tasks.Advances in Neural Information Processing Systems, 33:9459–9474, 2020

Patrick Lewis, Ethan Perez, Aleksandra Piktus, Fabio Petroni, Vladimir Karpukhin, Naman Goyal, Heinrich K¨uttler, Mike Lewis, Wen-tau Yih, Tim Rockt ¨aschel, et al. Retrieval-augmented generation for knowledge-intensive nlp tasks.Advances in Neural Information Processing Systems, 33:9459–9474, 2020. 1, 2

work page 2020

[27] [27]

Align before fuse: Vision and language representation learn- ing with momentum distillation.Advances in neural infor- mation processing systems, 34:9694–9705, 2021

Junnan Li, Ramprasaath Selvaraju, Akhilesh Gotmare, Shafiq Joty, Caiming Xiong, and Steven Chu Hong Hoi. Align before fuse: Vision and language representation learn- ing with momentum distillation.Advances in neural infor- mation processing systems, 34:9694–9705, 2021. 2

work page 2021

[28] [28]

Evcap: Retrieval-augmented image captioning with external visual-name memory for open-world compre- hension

Jiaxuan Li, Duc Minh V o, Akihiro Sugimoto, and Hideki Nakayama. Evcap: Retrieval-augmented image captioning with external visual-name memory for open-world compre- hension. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 13733– 13742, 2024. 2

work page 2024

[29] [29]

Dynamic graph enhanced contrastive learning for chest x-ray report generation

Mingjie Li, Bingqian Lin, Zicong Chen, Haokun Lin, Xi- aodan Liang, and Xiaojun Chang. Dynamic graph enhanced contrastive learning for chest x-ray report generation. InPro- ceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 3334–3343, 2023. 5, 6, 3

work page 2023

[30] [30]

Rouge: A package for automatic evaluation of summaries

Chin-Yew Lin. Rouge: A package for automatic evaluation of summaries. InText summarization branches out, pages 74–81, 2004. 5

work page 2004

[31] [31]

Bootstrapping large language models for radiology report generation.Proceedings of the AAAI Conference on Artificial Intelligence, 38(17):18635–18643,

Chang Liu, Yuanhe Tian, Weidong Chen, Yan Song, and Yongdong Zhang. Bootstrapping large language models for radiology report generation.Proceedings of the AAAI Conference on Artificial Intelligence, 38(17):18635–18643,

work page

[32] [32]

Im- proving chest X-ray report generation by leveraging warm starting.Artificial Intelligence in Medicine, 144:102633,

Aaron Nicolson, Jason Dowling, and Bevan Koopman. Im- proving chest X-ray report generation by leveraging warm starting.Artificial Intelligence in Medicine, 144:102633,

work page

[33] [33]

Im- proving chest x-ray report generation by leveraging warm starting.Artificial intelligence in medicine, 144:102633,

Aaron Nicolson, Jason Dowling, and Bevan Koopman. Im- proving chest x-ray report generation by leveraging warm starting.Artificial intelligence in medicine, 144:102633,

work page

[34] [34]

Maxime Oquab, Timoth ´ee Darcet, Th´eo Moutakanni, Huy V . V o, Marc Szafraniec, Vasil Khalidov, Pierre Fernandez, Daniel HAZIZA, Francisco Massa, Alaaeldin El-Nouby, Mido Assran, Nicolas Ballas, Wojciech Galuba, Russell Howes, Po-Yao Huang, Shang-Wen Li, Ishan Misra, Michael Rabbat, Vasu Sharma, Gabriel Synnaeve, Hu Xu, Herve Je- gou, Julien Mairal, Patr...

work page 2024

[35] [35]

Bleu: a method for automatic evaluation of machine translation

Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. Bleu: a method for automatic evaluation of machine translation. InProceedings of the 40th annual meeting of the Association for Computational Linguistics, pages 311–318,

work page

[36] [36]

M4CXR: Exploring Multi-task Potentials of Multi-modal Large Language Models for Chest X-ray Interpretation

Jonggwon Park, Soobum Kim, Byungmu Yoon, Jihun Hyun, and Kyoyun Choi. M4cxr: Exploring multi-task potentials of multi-modal large language models for chest x-ray inter- pretation.arXiv preprint arXiv:2408.16213, 2024. 2, 5, 6, 7

work page internal anchor Pith review Pith/arXiv arXiv 2024

[37] [37]

Rad-dino: Exploring scalable medical image encoders beyond text supervision.arXiv preprint arXiv:2401.10815, 2024

Fernando P ´erez-Garc´ıa, Harshita Sharma, Sam Bond-Taylor, Kenza Bouzid, Valentina Salvatelli, Maximilian Ilse, Shruthi Bannur, Daniel C Castro, Anton Schwaighofer, Matthew P Lungren, et al. Rad-dino: Exploring scalable medical image encoders beyond text supervision.arXiv preprint arXiv:2401.10815, 2024. 5

work page arXiv 2024

[38] [38]

Learning transferable visual models from natural language supervi- sion

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervi- sion. InInternational conference on machine learning, pages 8748–8763. PMLR, 2021. 3, 4

work page 2021

[39] [39]

Im- proving radiology report generation systems by removing hallucinated references to non-existent priors

Vignav Ramesh, Nathan A Chi, and Pranav Rajpurkar. Im- proving radiology report generation systems by removing hallucinated references to non-existent priors. InMachine Learning for Health, pages 456–473. PMLR, 2022. 1, 2

work page 2022

[40] [40]

Smallcap: lightweight image captioning prompted with retrieval augmentation

Rita Ramos, Bruno Martins, Desmond Elliott, and Yova Ke- mentchedjhieva. Smallcap: lightweight image captioning prompted with retrieval augmentation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 2840–2849, 2023. 2

work page 2023

[41] [41]

Retrieval augmented chest x-ray report gen- eration using openai gpt models

Mercy Ranjit, Gopinath Ganapathy, Ranjit Manuel, and Tanuja Ganu. Retrieval augmented chest x-ray report gen- eration using openai gpt models. InMachine Learning for Healthcare Conference, pages 650–666. PMLR, 2023. 2

work page 2023

[42] [42]

Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks

N Reimers. Sentence-bert: Sentence embeddings using siamese bert-networks.arXiv preprint arXiv:1908.10084,

work page internal anchor Pith review Pith/arXiv arXiv 1908

[43] [43]

Retrieval-augmented transformer for image caption- ing

Sara Sarto, Marcella Cornia, Lorenzo Baraldi, and Rita Cuc- chiara. Retrieval-augmented transformer for image caption- ing. InProceedings of the 19th international conference on content-based multimedia indexing, pages 1–7, 2022. 2

work page 2022

[44] [44]

Moein Shariatnia

M. Moein Shariatnia. Simple CLIP, 2021.https:// github.com/moein-shariatnia/OpenAI-CLIP. 4

work page 2021

[45] [45]

Eagle: Exploring the design space for multimodal LLMs with mixture of encoders

Min Shi, Fuxiao Liu, Shihao Wang, Shijia Liao, Subhashree Radhakrishnan, Yilin Zhao, De-An Huang, Hongxu Yin, Karan Sapra, Yaser Yacoob, Humphrey Shi, Bryan Catan- zaro, Andrew Tao, Jan Kautz, Zhiding Yu, and Guilin Liu. Eagle: Exploring the design space for multimodal LLMs with mixture of encoders. InThe Thirteenth International Conference on Learning Re...

work page 2025

[46] [46]

Chexbert: Combin- ing automatic labelers and expert annotations for accurate radiology report labeling using bert

Akshay Smit, Saahil Jain, Pranav Rajpurkar, Anuj Pareek, Andrew Y Ng, and Matthew P Lungren. Chexbert: Combin- ing automatic labelers and expert annotations for accurate radiology report labeling using bert. InEMNLP 2020-2020 Conference on Empirical Methods in Natural Language Pro- cessing, Proceedings of the Conference, pages 1500–1519,

work page 2020

[47] [47]

Fact-aware multimodal retrieval augmentation for accu- rate medical radiology report generation.arXiv preprint arXiv:2407.15268, 2024

Liwen Sun, James Zhao, Megan Han, and Chenyan Xiong. Fact-aware multimodal retrieval augmentation for accu- rate medical radiology report generation.arXiv preprint arXiv:2407.15268, 2024. 2, 5, 6

work page arXiv 2024

[48] [48]

Interactive and explainable region-guided radi- ology report generation

Tim Tanida, Philip M ¨uller, Georgios Kaissis, and Daniel Rueckert. Interactive and explainable region-guided radi- ology report generation. In2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), page 7433–7442. IEEE, 2023. 3

work page 2023

[49] [49]

Towards gen- eralist biomedical ai.NEJM AI, 1(3):AIoa2300138, 2024

Tao Tu, Shekoofeh Azizi, Danny Driess, Mike Schaeker- mann, Mohamed Amin, Pi-Chuan Chang, Andrew Carroll, Charles Lau, Ryutaro Tanno, Ira Ktena, et al. Towards gen- eralist biomedical ai.NEJM AI, 1(3):AIoa2300138, 2024. 1, 2, 5, 6, 7

work page 2024

[50] [50]

Metransformer: Radiology report generation by transformer with multiple learnable expert tokens

Zhanyu Wang, Lingqiao Liu, Lei Wang, and Luping Zhou. Metransformer: Radiology report generation by transformer with multiple learnable expert tokens. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 11558–11567, 2023. 5, 6

work page 2023

[51] [51]

Distribution-balanced loss for multi-label classification in long-tailed datasets

Tong Wu, Qingqiu Huang, Ziwei Liu, Yu Wang, and Dahua Lin. Distribution-balanced loss for multi-label classification in long-tailed datasets. InComputer Vision – ECCV 2020, pages 162–178, Cham, 2020. Springer International Publish- ing. 4

work page 2020

[52] [52]

Retrieval-augmented egocentric video captioning

Jilan Xu, Yifei Huang, Junlin Hou, Guo Chen, Yuejie Zhang, Rui Feng, and Weidi Xie. Retrieval-augmented egocentric video captioning. InProceedings of the IEEE/CVF Con- ference on Computer Vision and Pattern Recognition, pages 13525–13536, 2024. 2

work page 2024

[53] [53]

Style- aware radiology report generation with radgraph and few- shot prompting

Benjamin Yan, Ruochen Liu, David Kuo, Subathra Adithan, Eduardo Reis, Stephen Kwak, Vasantha Venugopal, Chloe O’Connell, Agustina Saenz, Pranav Rajpurkar, et al. Style- aware radiology report generation with radgraph and few- shot prompting. InFindings of the Association for Computa- tional Linguistics: EMNLP 2023, pages 14676–14688, 2023. 2, 5, 6

work page 2023

[54] [54]

Advancing multi- modal medical capabilities of gemini.arXiv preprint arXiv:2405.03162, 2024

Lin Yang, Shawn Xu, Andrew Sellergren, Timo Kohlberger, Yuchen Zhou, Ira Ktena, Atilla Kiraly, Faruk Ahmed, Farhad Hormozdiari, Tiam Jaroensri, et al. Advancing multi- modal medical capabilities of gemini.arXiv preprint arXiv:2405.03162, 2024. 1, 2, 5, 6, 8

work page arXiv 2024

[55] [55]

Kevin Zhou, and Li Xiao

Shuxin Yang, Xian Wu, Shen Ge, Zhuozhao Zheng, S. Kevin Zhou, and Li Xiao. Radiology report generation with a learned knowledge base and multi-modal alignment.Med- ical Image Analysis, 86:102798, 2023. 3

work page 2023

[56] [56]

Retrieval-augmented multimodal language modeling

Michihiro Yasunaga, Armen Aghajanyan, Weijia Shi, Richard James, Jure Leskovec, Percy Liang, Mike Lewis, Luke Zettlemoyer, and Wen-Tau Yih. Retrieval-augmented multimodal language modeling. InInternational Conference on Machine Learning, pages 39755–39769. PMLR, 2023. 2

work page 2023

[57] [57]

Evaluating progress in automatic chest x-ray radiology report generation.Patterns, 4(9), 2023

Feiyang Yu, Mark Endo, Rayan Krishnan, Ian Pan, Andy Tsai, Eduardo Pontes Reis, Eduardo Kaiser Ururahy Nunes Fonseca, Henrique Min Ho Lee, Zahra Shakeri Hossein Abad, Andrew Y Ng, et al. Evaluating progress in automatic chest x-ray radiology report generation.Patterns, 4(9), 2023. 5

work page 2023

[58] [58]

BiomedCLIP: a multimodal biomedical foundation model pretrained from fifteen million scientific image-text pairs

Sheng Zhang, Yanbo Xu, Naoto Usuyama, Hanwen Xu, Jaspreet Bagga, Robert Tinn, Sam Preston, Rajesh Rao, Mu Wei, Naveen Valluri, et al. Biomedclip: a multimodal biomedical foundation model pretrained from fifteen million scientific image-text pairs.arXiv preprint arXiv:2303.00915,

work page internal anchor Pith review Pith/arXiv arXiv

[59] [59]

Topicwise separable sentence retrieval for med- ical report generation.IEEE Transactions on Medical Imag- ing, 2024

Junting Zhao, Yang Zhou, Zhihao Chen, Huazhu Fu, and Liang Wan. Topicwise separable sentence retrieval for med- ical report generation.IEEE Transactions on Medical Imag- ing, 2024. 1, 2 Leveraging LLMs for Multimodal Retrieval-Augmented Radiology Report Generation via Key Phrase Extraction Supplementary Material A. Implementation Details A.1. External Sou...

work page 2024

[60] [60]

new”, “improved

used the entire IU X-Ray dataset as the test set, treating each frontal and lateral image as an independent sample and excluding a portion of normal images to maintain a 10% normal image ratio. This subset of 4,168 images is publicly available8 and is also used in our evaluation to assess the performance of single-view RRG. B.2. Results Table 5 shows our ...

work page