Region-Aware Multimodal Large Language Model via SlowFast Tokenization and Pseudo-Mask Guidance for 3D CT Report Generation

Dongyeong Kim; Hyungbin Park; Hyunseok Lim; JiHyun Kim; Jimin Sung; Jinyoung Seo; Namkug Kim; Sunggu Kyung; Wooyoung Jo; Yoojin Nam

arxiv: 2506.23102 · v2 · pith:QOLZPOJFnew · submitted 2025-06-29 · 📡 eess.IV · cs.CV

Region-Aware Multimodal Large Language Model via SlowFast Tokenization and Pseudo-Mask Guidance for 3D CT Report Generation

Sunggu Kyung , Jinyoung Seo , Hyunseok Lim , Dongyeong Kim , Hyungbin Park , Jimin Sung , Jihyun Kim , Wooyoung Jo

show 2 more authors

Yoojin Nam Namkug Kim

This is my paper

classification 📡 eess.IV cs.CV

keywords generationmodelreportmedregion-ctslowfastclinicallyframeworkglobal

0 comments

read the original abstract

Current CT report generation frameworks predominantly rely on global feature representations, often failing to capture region-specific details and potentially missing certain abnormalities. To overcome this limitation, we propose MedRegion-CT, a region-focused multimodal large language model framework featuring three key innovations. First, we revisit the SlowFast strategy to jointly model global and fine-grained information and adapt it to the medical domain via a Region-based SlowFast Tokenizer that extracts tokens guided by clinically meaningful regions. Second, generated pseudo-masks guide the model to attend to diagnostically important anatomical regions, facilitating a systematic understanding of the overall scan context. Third, quantitative lesion information, including size, diameter, and spatial location, is encoded as structured textual prompts, enabling context-aware and clinically informed report generation. To enable rigorous evaluation, we validate our framework on multi-institutional structured report generation benchmarks. Experimental results demonstrate that MedRegion-CT achieves state-of-the-art performance, outperforming existing approaches in both linguistic quality and clinical accuracy. All code is publicly available at: https://github.com/babbu3682/MedRegion-CT.

This paper has not been read by Pith yet.

discussion (0)

Forward citations

Cited by 2 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Segmentation, Detection and Explanation: A Unified Framework for CT Appearance Reasoning
cs.CV 2026-05 unverdicted novelty 6.0

A unified autoregressive vision-language framework integrates segmentation, detection, and appearance reasoning for CT images via task-routing tokens and progressive refinement, with gains on public benchmarks.
Enhancing Fine-Grained Spatial Grounding in 3D CT Report Generation via Discriminative Guidance
cs.CV 2026-04 unverdicted novelty 6.0

DCP-PD improves macro F1 scores on CT report generation benchmarks and introduces a hierarchical location-aware evaluation protocol that reveals ongoing challenges in pathology spatial grounding.