arxiv: 2604.10437 · v1 · submitted 2026-04-12 · 💻 cs.CV

Recognition: unknown

Enhancing Fine-Grained Spatial Grounding in 3D CT Report Generation via Discriminative Guidance

Chenyu Wang, Han Liu, Kayhan Batmanghelich, Weicheng Dai, Wenchao Li

Authors on Pith no claims yet

Pith reviewed 2026-05-10 15:51 UTC · model grok-4.3

classification 💻 cs.CV

keywords 3D CT report generationspatial groundingvision-language modelsprompt dropoutradiology report generationpathology localizationfine-grained supervisionout-of-distribution robustness

0 comments

The pith

DCP-PD distills fine-grained cues from text reports and applies prompt dropout to improve spatial grounding in 3D CT report generation.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper addresses coarse volume-to-report alignment and holistic evaluation in vision-language models for radiology report generation from volumetric CT scans. It introduces a plug-and-play approach that extracts detailed pathology attributes and locations from free-text reports to provide targeted supervision during training. Prompt dropout is used to discourage reliance on superficial shortcuts. The method yields higher macro F1 scores on in-distribution benchmarks and larger gains on out-of-distribution data. A new hierarchical question protocol is added to measure how well models localize findings by presence, laterality, and lobe.

Core claim

DCP-PD distills fine-grained cues from free-text reports to guide report generation while using prompt dropout to mitigate shortcut learning, achieving state-of-the-art macro F1 of 0.603 on CT-RATE and raising out-of-distribution F1 from 0.266 to 0.503 on Rad-ChestCT; the same framework introduces a presence-laterality-lobe question protocol that reveals persistent challenges in fine-grained spatial localization even among high-scoring models.

What carries the argument

Discriminative Cue-Prompting with Prompt Dropout (DCP-PD), a plug-and-play framework that extracts fine-grained pathology cues from reports to supply location-specific supervision and prevents shortcut learning through prompt dropout.

If this is right

Macro F1 on CT-RATE rises from 0.501 to 0.603.
Out-of-distribution F1 on Rad-ChestCT nearly doubles from 0.266 to 0.503.
Models show measurable improvement on hierarchical location questions covering presence, laterality, and lobe.
Shortcut reliance is reduced without harming overall report quality.
The hierarchical evaluation protocol provides a more diagnostic check for spatial grounding than existing lexical or entity-overlap metrics.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The cue-distillation idea could be transferred to other volumetric imaging modalities such as MRI or PET where free-text reports also contain unexploited location detail.
Prompt dropout may serve as a general regularizer for other vision-language report generators to reduce text-only bias.
The presence-laterality-lobe protocol offers a template for creating location-specific test suites in any medical VLM benchmark.

Load-bearing premise

Fine-grained cues distilled from free-text reports supply accurate, unbiased supervision for the actual locations of pathologies in the corresponding CT volumes.

What would settle it

A test set in which report text is deliberately edited to contain incorrect laterality or lobe information while the CT volumes remain unchanged; if the model still produces reports that match the altered text instead of the image, the cue-distillation claim is falsified.

Figures

Figures reproduced from arXiv: 2604.10437 by Chenyu Wang, Han Liu, Kayhan Batmanghelich, Weicheng Dai, Wenchao Li.

**Figure 2.** Figure 2: Detailed inference schematic. The pre-processed multi-channel CT volume is encoded by a 3D convolutional stem to produce a fine-scale token grid, followed by progressive downsampling to construct multi-scale token hierarchies. These token grids are processed by the Atlas visual backbone with stacked MSA blocks to yield hierarchical representations. The aggregated embedding hcls is fed into a linear classif… view at source ↗

**Figure 3.** Figure 3: Per-pathology detection performance. We report F1 for the Base VLM and a linear discriminative classifier on 18 CT findings. Numbers above bars indicate ∆F1 (Classifier−Base VLM). For readability, we shorten several long finding names on the x-axis: Arterial calcification (arterial wall calcification), Coronary calcification (coronary artery wall calcification), Septal thickening (interlobular septal thick… view at source ↗

**Figure 4.** Figure 4: Per-pathology performance on CT-RATE. We compare our Base VLM and a DCP-PD-trained generator under two inference-time guidance settings (DCP-1 and DCP-2 ) against prior baselines. We report per-pathology F1 for each finding. The value in parentheses next to each label indicates the positive rate (prevalence) in the CT-RATE validation set, and larger radii correspond to higher F1 [PITH_FULL_IMAGE:figures/f… view at source ↗

**Figure 5.** Figure 5: Fine-grained evaluation on Laterality. We report F1 for the Base VLM, DCP-PD VLM, and trained classifier. The x-axis includes four findings: lung nodule (Nodule), consolidation (Cons), ground-glass opacity (GGO), and pleural effusion (PE), each evaluated on left (L) and right (R) sides. Numbers above bars indicate ∆F1 (DCPPD−Base). Nodule (LLL) Nodule (LUL) Nodule (RLL) Nodule (RML) Nodule (RUL) Cons (LLL… view at source ↗

**Figure 6.** Figure 6: Fine-grained evaluation on Lobe. We report F1 for the Base VLM, DCPPD VLM, and trained classifier on the lobe-level question set. Lobe labels correspond to standard lung lobes: left upper lobe (LUL), left lower lobe (LLL), right upper lobe (RUL), right middle lobe (RML), and right lower lobe (RLL). Numbers above bars indicate ∆F1 (DCP-PD−Base). (RQ4) DCP-PD VLM achieves state-of-the-art performance on in… view at source ↗

**Figure 7.** Figure 7: Effect of visual tokens on fine-grained lobar grounding. [PITH_FULL_IMAGE:figures/full_fig_p016_7.png] view at source ↗

**Figure 8.** Figure 8: We visualize two examples comparing the Base VLM (no cues) and our DCPPD model guided at inference by discriminative cue prompts. For each case, we show the ground-truth report (only relevant excerpts due to space), representative CT slices with anatomy overlays from LungMask [22] and pathology masks from ReXGroundingCT [3], and the generated reports. Top: the Base VLM misses the lung nodules, while DCP-… view at source ↗

**Figure 9.** Figure 9: Example of full report generation on a CT-RATE validation case. [PITH_FULL_IMAGE:figures/full_fig_p019_9.png] view at source ↗

**Figure 10.** Figure 10: Simple prompt template used for LLM-based binary question answering. A.4 Pre-defined Question Sets Finding-level Question Set (Chest CT, 18 findings) Is there any medical material or device present? Is there arterial wall calcification? Is cardiomegaly or cardiac enlargement suspected based on the imaging findings? Is pericardial effusion present? Is there coronary artery wall calcification? Is there emph… view at source ↗

**Figure 11.** Figure 11: Finding (QS1) question set used for annotation extraction and report parsing evaluation [PITH_FULL_IMAGE:figures/full_fig_p025_11.png] view at source ↗

**Figure 12.** Figure 12: Laterality (QS2) question set used for annotation extraction and report parsing evaluation. Lobar Question Set (Chest CT) Is there consolidation in the right upper lobe (RUL)? Is there consolidation in the right middle lobe (RML)? Is there consolidation in the right lower lobe (RLL)? Is there consolidation in the left upper lobe (LUL)? Is there consolidation in the left lower lobe (LLL)? Is there ground-… view at source ↗

**Figure 13.** Figure 13: Lobar (QS3) question set used for annotation extraction and report parsing evaluation [PITH_FULL_IMAGE:figures/full_fig_p026_13.png] view at source ↗

read the original abstract

Vision--language models (VLMs) for radiology report generation (RRG) can produce long-form chest CT reports from volumetric scans and show strong potential to improve radiology workflow efficiency and consistency. However, existing methods face two key limitations: (i) training supervision is often coarse, aligning a whole CT volume with a full free-text report without explicit alignment for fine-grained attributes or pathology locations; and (ii) evaluation is typically holistic (lexical overlap, entity matching, or LLM-as-a-judge scores) and not diagnostic for spatial grounding. We propose \emph{Discriminative Cue-Prompting with Prompt Dropout (DCP-PD)}, a plug-and-play framework that distills fine-grained cues from free-text reports and uses them to guide report generation while mitigating shortcut reliance via prompt dropout. DCP-PD achieves state-of-the-art performance on CT-RATE, improving macro F1 from $=0.501$ to $0.603$ (20% relative), and substantially boosts out-of-distribution performance on Rad-ChestCT from F1 $=0.266$ to $0.503$ (89% relative). Finally, we introduce a hierarchical, location-aware question-set protocol (presence $\rightarrow$ laterality $\rightarrow$ lobe) to directly assess pathology-location grounding, showing that fine-grained spatial localization remains challenging even for models that score highly on current benchmarks.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper shows F1 gains on CT report generation by distilling location cues from text and adding prompt dropout, plus a new hierarchical eval protocol, but the gains rest on unvalidated cue accuracy.

read the letter

The main takeaway is that DCP-PD distills fine-grained location cues from free-text reports to guide 3D CT report generation and uses prompt dropout to reduce shortcut learning. This produces a macro F1 lift from 0.501 to 0.603 on CT-RATE and a larger jump from 0.266 to 0.503 on the out-of-distribution Rad-ChestCT set. They also introduce a hierarchical question protocol that checks presence, laterality, and lobe in sequence to measure actual spatial grounding.

Referee Report

3 major / 2 minor

Summary. The manuscript proposes DCP-PD (Discriminative Cue-Prompting with Prompt Dropout), a plug-and-play framework that distills fine-grained cues from free-text radiology reports to provide explicit supervision for pathology locations and laterality in 3D CT report generation. Prompt dropout is introduced to reduce shortcut learning. The work reports state-of-the-art macro F1 scores on CT-RATE (0.501 to 0.603) and large out-of-distribution gains on Rad-ChestCT (0.266 to 0.503), while introducing a new hierarchical location-aware question-set evaluation protocol (presence, laterality, lobe) to diagnose spatial grounding.

Significance. If the reported gains are attributable to improved spatial grounding rather than incidental effects, the framework and especially the new evaluation protocol could become useful tools for developing and assessing fine-grained VLMs in radiology. The large relative OOD improvement is noteworthy and, if reproducible, would strengthen claims about robustness.

major comments (3)

[Abstract / Methods] The central performance claims rest on the unvalidated assumption that cues automatically distilled from free-text reports supply accurate, unbiased supervision for pathology locations and laterality. Free-text reports frequently omit explicit laterality or lobe information or use ambiguous phrasing; without an independent validation (e.g., comparison of distilled cues against expert-annotated bounding boxes or a held-out set of location labels), the 20% and 89% relative F1 gains cannot be confidently attributed to better spatial grounding.
[Experiments / Ablation studies] Prompt dropout is presented as the mechanism that prevents shortcut learning, yet the manuscript provides no ablation that isolates its contribution to both the new hierarchical grounding metrics and standard report-quality scores (RadGraph F1, clinical correctness). Without this, it remains possible that the observed improvements stem from generic regularization rather than the intended discriminative guidance.
[Evaluation Protocol] The hierarchical question-set protocol is a constructive addition, but its reliability depends on how questions are generated from reports. The manuscript should detail the exact prompting or parsing procedure used to create presence/laterality/lobe questions and report inter-annotator or consistency statistics; otherwise the protocol itself risks inheriting the same ambiguities present in the original reports.

minor comments (2)

[Abstract] The abstract states 'macro F1' without specifying the exact label set or averaging procedure; this should be clarified in the main text and tables for reproducibility.
[Methods] Implementation details (exact distillation prompt templates, dropout rate schedule, and how the cue embeddings are injected into the VLM) are referenced but not fully specified; adding them would aid replication.

Simulated Author's Rebuttal

3 responses · 1 unresolved

We thank the referee for the constructive feedback. We address each major comment below with clarifications and commitments to revisions that strengthen the attribution of gains to spatial grounding and the reliability of our contributions.

read point-by-point responses

Referee: [Abstract / Methods] The central performance claims rest on the unvalidated assumption that cues automatically distilled from free-text reports supply accurate, unbiased supervision for pathology locations and laterality. Free-text reports frequently omit explicit laterality or lobe information or use ambiguous phrasing; without an independent validation (e.g., comparison of distilled cues against expert-annotated bounding boxes or a held-out set of location labels), the 20% and 89% relative F1 gains cannot be confidently attributed to better spatial grounding.

Authors: We recognize that free-text reports can contain omissions and ambiguities regarding laterality and lobe. DCP-PD distills cues directly from these reports to supply explicit location and laterality supervision during training, and the hierarchical evaluation protocol (presence → laterality → lobe) is introduced precisely to diagnose whether these cues translate into improved spatial grounding in generated reports. The large relative OOD gains on Rad-ChestCT support that the improvements generalize beyond dataset-specific patterns. We agree that direct comparison to expert bounding boxes would provide stronger evidence; since such annotations are unavailable in the benchmarks, we will add an explicit limitations paragraph discussing reliance on report-derived cues. revision: partial
Referee: [Experiments / Ablation studies] Prompt dropout is presented as the mechanism that prevents shortcut learning, yet the manuscript provides no ablation that isolates its contribution to both the new hierarchical grounding metrics and standard report-quality scores (RadGraph F1, clinical correctness). Without this, it remains possible that the observed improvements stem from generic regularization rather than the intended discriminative guidance.

Authors: We agree that an ablation isolating prompt dropout is required to confirm its specific role. In the revised manuscript we will add ablation experiments that separately quantify the contribution of prompt dropout to the hierarchical location-aware metrics (presence, laterality, lobe) as well as to standard report-quality metrics including RadGraph F1 and clinical correctness. These results will clarify whether gains arise from the discriminative guidance mechanism rather than generic regularization. revision: yes
Referee: [Evaluation Protocol] The hierarchical question-set protocol is a constructive addition, but its reliability depends on how questions are generated from reports. The manuscript should detail the exact prompting or parsing procedure used to create presence/laterality/lobe questions and report inter-annotator or consistency statistics; otherwise the protocol itself risks inheriting the same ambiguities present in the original reports.

Authors: We will expand the methods section with the exact prompting templates and parsing rules used to derive the presence, laterality, and lobe questions from the source reports. We will also report consistency statistics (e.g., agreement across repeated parsing runs and sensitivity to prompt variations) to demonstrate the protocol's reliability and mitigate concerns about inheriting report ambiguities. revision: yes

standing simulated objections not resolved

Independent validation of distilled cues against expert-annotated bounding boxes or held-out location labels, as no such annotations exist in the CT-RATE or Rad-ChestCT datasets.

Circularity Check

0 steps flagged

No significant circularity in derivation or claims

full rationale

The paper proposes an empirical framework (DCP-PD) that distills cues from reports and applies prompt dropout for CT report generation, with all central claims consisting of measured performance gains on external benchmarks (CT-RATE macro F1 0.501→0.603; Rad-ChestCT F1 0.266→0.503) and a new evaluation protocol. No equations, derivations, or self-citations are present that reduce any result to fitted inputs by construction, rename known patterns, or make the core improvement self-definitional. The method is presented as a plug-and-play addition whose value is demonstrated through ablation and out-of-distribution testing rather than analytic closure.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 1 invented entities

The central claim rests on domain assumptions that free-text reports contain reliable extractable spatial cues and that standard VLM fine-tuning can be guided by such cues without external supervision.

axioms (2)

domain assumption Vision-language models trained on paired CT volumes and reports can be improved by additional fine-grained cue supervision.
Invoked when proposing cue distillation as effective guidance.
domain assumption Prompt dropout during training prevents shortcut learning while preserving report quality.
Central mechanism claimed to mitigate reliance on coarse alignments.

invented entities (1)

DCP-PD framework no independent evidence
purpose: Distill fine-grained cues and apply discriminative guidance with prompt dropout for spatial grounding.
Newly introduced plug-and-play method.

pith-pipeline@v0.9.0 · 5564 in / 1393 out tokens · 54323 ms · 2026-05-10T15:51:37.020630+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

46 extracted references · 18 canonical work pages · 4 internal anchors

[1]

arXiv preprint arXiv:2503.12355 (2025) 6, 24

Agrawal, K.K., Lian, L., Liu, L., Harguindeguy, N., Li, B., Bick, A., Chung, M., Darrell, T., Yala, A.: Atlas: Multi-scale attention improves long context image modeling. arXiv preprint arXiv:2503.12355 (2025) 6, 24

work page arXiv 2025
[2]

arXiv preprint arXiv:2511.17803 (2025)

Agrawal, K.K., Liu, L., Lian, L., Nercessian, M., Harguindeguy, N., Wu, Y., Mikhael, P., Lin, G., Sequist, L.V., Fintelmann, F., Darrell, T., Bai, Y., Chung, M., Yala, A.: Pillar-0: A new frontier for radiology foundation models. arXiv preprint arXiv:2511.17803 (2025) 6, 27

work page arXiv 2025
[3]

Rexgroundingct: A 3d chest ct dataset for segmentation of findings from free-text reports.arXiv preprint arXiv:2507.22030, 2025

Baharoon, M., Luo, L., Moritz, M., Kumar, A., Kim, S.E., Zhang, X., Zhu, M., Alabbad, M.H., Alhazmi, M.S., Mistry, N.P., et al.: Rexgroundingct: A 3d chest ct dataset for segmentation of findings from free-text reports. arXiv preprint arXiv:2507.22030 (2025) 18

work page arXiv 2025
[4]

In: Proceedings of the acl workshop on intrinsic and extrinsic evaluation measures for machine translation and/or summa- rization

Banerjee, S., Lavie, A.: Meteor: An automatic metric for mt evaluation with im- proved correlation with human judgments. In: Proceedings of the acl workshop on intrinsic and extrinsic evaluation measures for machine translation and/or summa- rization. pp. 65–72 (2005) 3, 9

2005
[5]

arXiv preprint arXiv:2406.04449 (2024) 3

Bannur, S., Bouzid, K., Castro, D.C., Schwaighofer, A., Thieme, A., Bond-Taylor, S., Ilse, M., Pérez-García, F., Salvatelli, V., Sharma, H., et al.: Maira-2: Grounded radiology report generation. arXiv preprint arXiv:2406.04449 (2024) 3

work page arXiv 2024
[6]

Research Square pp

Blankemeier, L., Cohen, J.P., Kumar, A., Van Veen, D., Gardezi, S.J.S., Paschali, M., Chen, Z., Delbrouck, J.B., Reis, E., Truyts, C., et al.: Merlin: A vision language foundation model for 3d computed tomography. Research Square pp. rs–3 (2024) 11, 12

2024
[7]

IEEE Transactions on Medical Imaging (2025) 2, 3

Chen, Z., Bie, Y., Jin, H., Chen, H.: Large language model with region-guided referring and grounding for ct report generation. IEEE Transactions on Medical Imaging (2025) 2, 3

2025
[8]

In: International Conference on Medical Image Com- puting and Computer-Assisted Intervention

Chen, Z., Luo, L., Bie, Y., Chen, H.: Dia-llama: Towards large language model- driven ct report generation. In: International Conference on Medical Image Com- puting and Computer-Assisted Intervention. pp. 141–151. Springer (2025) 2, 3

2025
[9]

In: The Fourteenth International Conference on Learning Representations 14

Cheng, S., Subramanian, D.: Rethinking radiology report generation: From narra- tive flow to topic-guided findings. In: The Fourteenth International Conference on Learning Representations 14
[10]

Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities

Comanici, G., Bieber, E., Schaekermann, M., Pasupat, I., Sachdeva, N., Dhillon, I., Blistein, M., Ram, O., Zhang, D., Rosen, E., et al.: Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities. arXiv preprint arXiv:2507.06261 (2025) 9

work page internal anchor Pith review Pith/arXiv arXiv 2025
[11]

In: Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

Delbrouck, J.B., Xu, J., Moll, J., Thomas, A., Chen, Z., Ostmeier, S., Azhar, A., Li, K.Z., Johnston, A., Bluethgen, C., et al.: Automated structured radiology report generation. In: Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). pp. 26813–26829 (2025) 2

2025
[12]

DenOtter, T.D., Schubert, J.: Hounsfield unit (2019) 5

2019
[13]

In: 2025 IEEE 22nd International Symposium on Biomedical Imaging (ISBI)

Di Piazza, T., Lazarus, C., Nempont, O., Boussel, L.: Ct-agrg: Automated abnormality-guided report generation from 3d chest ct volumes. In: 2025 IEEE 22nd International Symposium on Biomedical Imaging (ISBI). pp. 01–05. IEEE (2025) 11

2025
[14]

Zenodo105281(2020) 8

Draelos,R.L.,Dov,D.,Mazurowski,M.A.,Lo,J.Y.,Henao,R.,Rubin,G.D.,Carin, L.: Rad-chestct dataset. Zenodo105281(2020) 8

2020
[15]

Goel, A.: Radextract: Radiology report structuring demo using langextract 5 22
[16]

In: Linguraru, M.G., Dou, Q., Feragen, A., Giannarou, S., Glocker, B., Lekadir, K., Schnabel, J.A

Hamamci, I.E., Er, S., Menze, B.: Ct2rep: Automated radiology report generation for 3d medical imaging. In: Linguraru, M.G., Dou, Q., Feragen, A., Giannarou, S., Glocker, B., Lekadir, K., Schnabel, J.A. (eds.) Medical Image Computing and Computer Assisted Intervention – MICCAI 2024. pp. 476–486. Springer Nature Switzerland, Cham (2024) 11, 12

2024
[17]

arXiv preprint arXiv:2505.17167 (2025) 9

Hamamci, I.E., Er, S., Shit, S., Reynaud, H., Kainz, B., Menze, B.: Crg score: A distribution-aware clinical metric for radiology report generation. arXiv preprint arXiv:2505.17167 (2025) 9

work page arXiv 2025
[18]

arXiv preprint arXiv:2510.20639 (2025) 11, 12, 33

Hamamci, I.E., Er, S., Shit, S., Reynaud, H., Yang, D., Guo, P., Edgar, M., Xu, D., Kainz, B., Menze, B.: Better tokens for better 3d: Advancing vision-language modeling in 3d medical imaging. arXiv preprint arXiv:2510.20639 (2025) 11, 12, 33

work page arXiv 2025
[19]

Nature Biomedical Engineering pp

Hamamci, I.E., Er, S., Wang, C., Almas, F., Simsek, A., Esirgun, S., Dogan, I., Durugol, O., Hou, B., Shit, S., Dai, W., Xu, M., Reynaud, H., Dasdelen, M., Wittmann, B., Amiranashvili, T., Simsar, E., Simsar, M., Erdemir, E., Menze, B.: Generalist foundation models from a multimodal dataset for 3d computed tomog- raphy. Nature Biomedical Engineering pp. 1...

2026
[20]

Nature Biomedical Engineering pp

Hamamci, I.E., Er, S., Wang, C., Almas, F., Simsek, A.G., Esirgun, S.N., Dogan, I., Durugol, O.F., Hou, B., Shit, S., et al.: Generalist foundation models from a multimodal dataset for 3d computed tomography. Nature Biomedical Engineering pp. 1–19 (2026) 8

2026
[21]

Gaussian Error Linear Units (GELUs)

Hendrycks, D., Gimpel, K.: Gaussian error linear units (gelus). arXiv preprint arXiv:1606.08415 (2016) 6

work page internal anchor Pith review Pith/arXiv arXiv 2016
[22]

European Radiology Experimental4(1) (Aug 2020)

Hofmanninger, J., Prayer, F., Pan, J., Röhrich, S., Prosch, H., Langs, G.: Auto- matic lung segmentation in routine imaging is primarily a data diversity problem, not a methodology problem. European Radiology Experimental4(1) (Aug 2020). https://doi.org/10.1186/s41747-020-00173-218

work page doi:10.1186/s41747-020-00173-218 2020
[23]

Iclr1(2), 3 (2022) 8, 27

Hu, E.J., Shen, Y., Wallis, P., Allen-Zhu, Z., Li, Y., Wang, S., Wang, L., Chen, W., et al.: Lora: Low-rank adaptation of large language models. Iclr1(2), 3 (2022) 8, 27

2022
[24]

In: Proceedings of the AAAI conference on artificial intelligence

Jin, H., Che, H., Lin, Y., Chen, H.: Promptmrg: Diagnosis-driven prompts for medical report generation. In: Proceedings of the AAAI conference on artificial intelligence. vol. 38, pp. 2607–2615 (2024) 2, 3

2024
[25]

In: Proceedings of the IEEE/CVF International Conference on Computer Vision

Kalisch, H., Hörst, F., Kleesiek, J., Herrmann, K., Seibold, C.: Ct-graph: Hier- archical graph attention network for anatomy-guided ct report generation. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 6775–6784 (2025) 11

2025
[26]

arXiv preprint arXiv:2506.23102 (2025) 3

Kyung, S., Seo, J., Lim, H., Kim, D., Park, H., Sung, J., Kim, J., Jo, W., Nam, Y., Kim, N.: Medregion-ct: region-focused multimodal llm for comprehensive 3d ct report generation. arXiv preprint arXiv:2506.23102 (2025) 3

work page arXiv 2025
[27]

arXiv preprint arXiv:2412.13558 (2024)

Lee, C., Park, S., Shin, C.I., Choi, W.H., Park, H.J., Lee, J.E., Ye, J.C.: Read like a radiologist: Efficient vision-language model for 3d medical imaging interpretation. arXiv preprint arXiv:2412.13558 (2024) 2, 11

work page arXiv 2024
[28]

arXiv preprint arXiv:2404.15272 (2024)

Lin, J., Xia, Y., Zhang, J., Yan, K., Cao, K., Lu, L., Luo, J., Zhang, L.: Ct-glip: 3d grounded language-image pretraining with ct scans and radiology reports for full-body scenarios. arXiv preprint arXiv:2404.15272 (2024) 3

work page arXiv 2024
[29]

Jiasen Lu, Liangchen Song, Mingze Xu, Byeongjoo Ahn, Yanjun Wang, Chen Chen, Afshin Dehghan, and Yinfei Yang

Liu, H., Georgescu, B., Zhang, Y., Yoo, Y., Baumgartner, M., Gao, R., Wang, J., Zhao, G., Gibson, E., Comaniciu, D., et al.: Revisiting 2d foundation models for scalable 3d medical image classification. arXiv preprint arXiv:2512.12887 (2025) 24 23

work page arXiv 2025
[30]

Advances in neural information processing systems36, 34892–34916 (2023) 9

Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in neural information processing systems36, 34892–34916 (2023) 9

2023
[31]

Artificial Intelligence in Medicine106, 101878 (2020) 1

Monshi, M.M.A., Poon, J., Chung, V.: Deep learning in generating radiology re- ports: A survey. Artificial Intelligence in Medicine106, 101878 (2020) 1

2020
[32]

In: Hegselmann, S., Parziale, A., Shanmugam, D., Tang, S., Asiedu, M.N., Chang, S., Hartvigsen, T., Singh, H

Moor, M., Huang, Q., Wu, S., Yasunaga, M., Dalmia, Y., Leskovec, J., Zakka, C., Reis, E.P., Rajpurkar, P.: Med-flamingo: a multimodal medical few-shot learner. In: Hegselmann, S., Parziale, A., Shanmugam, D., Tang, S., Asiedu, M.N., Chang, S., Hartvigsen, T., Singh, H. (eds.) Proceedings of the 3rd Machine Learning for Health Symposium. Proceedings of Mac...
[33]

PMLR (10 Dec 2023) 11

2023
[34]

DINOv2: Learning Robust Visual Features without Supervision

Oquab, M., Darcet, T., Moutakanni, T., Vo, H., Szafraniec, M., Khalidov, V., Fernandez, P., Haziza, D., Massa, F., El-Nouby, A., et al.: Dinov2: Learning robust visual features without supervision. arXiv preprint arXiv:2304.07193 (2023) 24

work page internal anchor Pith review Pith/arXiv arXiv 2023
[35]

In: Findings of the association for computational linguistics: EMNLP 2024

Ostmeier, S., Xu, J., Chen, Z., Varma, M., Blankemeier, L., Bluethgen, C., Md, A.E.M., Moseley, M., Langlotz, C., Chaudhari, A.S., et al.: Green: Generative radiology report evaluation and error notation. In: Findings of the association for computational linguistics: EMNLP 2024. pp. 374–390 (2024) 3

2024
[36]

In: Proceedings of the 40th annual meeting of the Association for Computational Linguistics

Papineni, K., Roukos, S., Ward, T., Zhu, W.J.: Bleu: a method for automatic evaluation of machine translation. In: Proceedings of the 40th annual meeting of the Association for Computational Linguistics. pp. 311–318 (2002) 3, 9

2002
[37]

arXiv preprint arXiv:2501.14548 (2025)

Shui, Z., Zhang, J., Cao, W., Wang, S., Guo, R., Lu, L., Yang, L., Ye, X., Liang, T., Zhang, Q., et al.: Large-scale and fine-grained vision-language pre-training for enhanced ct image understanding. arXiv preprint arXiv:2501.14548 (2025) 2, 3

work page arXiv 2025
[38]

Nature Medicine31(2), 599–608 (2025) 1

Tanno, R., Barrett, D.G., Sellergren, A., Ghaisas, S., Dathathri, S., See, A., Welbl, J., Lau, C., Tu, T., Azizi, S., et al.: Collaboration between clinicians and vision– language models in radiology report generation. Nature Medicine31(2), 599–608 (2025) 1

2025
[39]

Team, Q.: Qwen3 technical report (2025),https://arxiv.org/abs/2505.093889

work page internal anchor Pith review Pith/arXiv arXiv 2025
[40]

IEEE Transactions on Medical Imaging (2025) 11

Tian, Y., Song, Y.: Feature decomposition via shared low-rank matrix recovery for ct report generation. IEEE Transactions on Medical Imaging (2025) 11

2025
[41]

Wang, J., Zhu, L., Bhalerao, A., He, Y.: Can prompt learning benefit radiology report generation? arXiv preprint arXiv:2308.16269 (2023) 2, 3

work page arXiv 2023
[42]

Radiology: Artificial Intelligence 5(5), e230024 (2023) 3

Wasserthal, J., Breit, H.C., Meyer, M.T., Pradella, M., Hinck, D., Sauter, A.W., Heye, T., Boll, D.T., Cyriac, J., Yang, S., et al.: Totalsegmentator: robust segmen- tation of 104 anatomic structures in ct images. Radiology: Artificial Intelligence 5(5), e230024 (2023) 3

2023
[43]

Wu, C., Zhang, X., Zhang, Y., Wang, Y., Xie, W.: Towards generalist foundation model for radiology (2023) 11

2023
[44]

Radiology: Artificial In- telligence4(4), e210258 (2022) 28

Yan, A., McAuley, J., Lu, X., Du, J., Chang, E.Y., Gentili, A., Hsu, C.N.: Radbert: adapting transformer-based language models to radiology. Radiology: Artificial In- telligence4(4), e210258 (2022) 28

2022
[45]

Patterns4(9) (2023) 3

Yu, F., Endo, M., Krishnan, R., Pan, I., Tsai, A., Reis, E.P., Fonseca, E.K.U.N., Lee, H.M.H., Abad, Z.S.H., Ng, A.Y., et al.: Evaluating progress in automatic chest x-ray radiology report generation. Patterns4(9) (2023) 3

2023
[46]

Pmc-vqa: Visual instruction tuning for medical visual question answering.arXiv preprint arXiv:2305.10415, 2023

Zhang, X., Wu, C., Zhao, Z., Lin, W., Zhang, Y., Wang, Y., Xie, W.: Pmc-vqa: Visual instruction tuning for medical visual question answering. arXiv preprint arXiv:2305.10415 (2023) 11 24 Appendix Contents We provide additional details, analyses, and results for our paper in the following sections. –Section A presents additional method details. •Subsection...

work page arXiv 2023