pith. sign in

arxiv: 2506.23102 · v2 · pith:QOLZPOJFnew · submitted 2025-06-29 · 📡 eess.IV · cs.CV

Region-Aware Multimodal Large Language Model via SlowFast Tokenization and Pseudo-Mask Guidance for 3D CT Report Generation

classification 📡 eess.IV cs.CV
keywords generationmodelreportmedregion-ctslowfastclinicallyframeworkglobal
0
0 comments X
read the original abstract

Current CT report generation frameworks predominantly rely on global feature representations, often failing to capture region-specific details and potentially missing certain abnormalities. To overcome this limitation, we propose MedRegion-CT, a region-focused multimodal large language model framework featuring three key innovations. First, we revisit the SlowFast strategy to jointly model global and fine-grained information and adapt it to the medical domain via a Region-based SlowFast Tokenizer that extracts tokens guided by clinically meaningful regions. Second, generated pseudo-masks guide the model to attend to diagnostically important anatomical regions, facilitating a systematic understanding of the overall scan context. Third, quantitative lesion information, including size, diameter, and spatial location, is encoded as structured textual prompts, enabling context-aware and clinically informed report generation. To enable rigorous evaluation, we validate our framework on multi-institutional structured report generation benchmarks. Experimental results demonstrate that MedRegion-CT achieves state-of-the-art performance, outperforming existing approaches in both linguistic quality and clinical accuracy. All code is publicly available at: https://github.com/babbu3682/MedRegion-CT.

This paper has not been read by Pith yet.

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 2 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Segmentation, Detection and Explanation: A Unified Framework for CT Appearance Reasoning

    cs.CV 2026-05 unverdicted novelty 6.0

    A unified autoregressive vision-language framework integrates segmentation, detection, and appearance reasoning for CT images via task-routing tokens and progressive refinement, with gains on public benchmarks.

  2. Enhancing Fine-Grained Spatial Grounding in 3D CT Report Generation via Discriminative Guidance

    cs.CV 2026-04 unverdicted novelty 6.0

    DCP-PD improves macro F1 scores on CT report generation benchmarks and introduces a hierarchical location-aware evaluation protocol that reveals ongoing challenges in pathology spatial grounding.