Exploring the Capabilities of Large Language Model Encoders for Image-Text Retrieval in Chest X-rays

Changi Kim; Chang Min Park; Dongheon Lee; Donguk Kim; Gihun Cho; Hanbin Ko; Inhyeok Baek; Joonbeom Koo

arxiv: 2509.15234 · v2 · pith:TZ5I2NXVnew · submitted 2025-09-17 · 💻 cs.CV

Exploring the Capabilities of Large Language Model Encoders for Image-Text Retrieval in Chest X-rays

Hanbin Ko , Gihun Cho , Inhyeok Baek , Donguk Kim , Joonbeom Koo , Changi Kim , Dongheon Lee , Chang Min Park This is my paper

classification 💻 cs.CV

keywords reportschestlargelearningretrievaltextalignmentbidirectional

0 comments

read the original abstract

Multimodal learning from paired medical images and clinical text is a central challenge in medical data-driven informatics, where effective cross-modal alignment is critical for scalable analysis and retrieval. In chest radiography, vision-language pretraining is constrained by heterogeneous radiology reports that contain abbreviations, impression-only notes, and institution-specific writing styles. Unlike general-domain settings, naively aggregating large collections of noisy reports can plateau or even degrade multimodal learning when reporting styles differ substantially. We propose a domain-adapted bidirectional large language model text encoder for chest radiograph reports, trained with masked token prediction and supervised contrastive learning on stylistically diverse but clinically equivalent report variants to produce robust, generalizable text embeddings. We then integrate this encoder into a dual-tower contrastive vision-language framework using parameter-efficient adaptation to improve image-text alignment. Across 1.6 million paired studies from public datasets and a de-identified hospital cohort, the proposed models improve bidirectional retrieval accuracy and external generalization, achieving GREEN scores of 0.308 on MIMIC-CXR and 0.618 on Open-I, while reducing the degradation observed when abbreviation-rich, impression-only hospital reports are added to training.

This paper has not been read by Pith yet.

discussion (0)

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Temporal Inversion for Learning Interval Change in Chest X-Rays
cs.CV 2026-04 unverdicted novelty 7.0

TILA uses temporal inversion of image pairs as a supervisory signal to make existing temporal vision-language models more sensitive to directional interval changes in chest X-rays.