Region-Grounded Report Generation for 3D Medical Imaging: A Fine-Grained Dataset and Graph-Enhanced Framework

Aditya Narayan Sankaran; Cong Huy Nguyen; Guanlin Li; Mai Hong Son; Mai Huy Thong; Noel Crespi; Phi Le Nguyen; Reza Farahbakhsh; Son Dinh Nguyen; Thanh Trung Nguyen

arxiv: 2604.18145 · v2 · pith:IJNFNYZOnew · submitted 2026-04-20 · 💻 cs.CV · cs.AI

Region-Grounded Report Generation for 3D Medical Imaging: A Fine-Grained Dataset and Graph-Enhanced Framework

Cong Huy Nguyen , Son Dinh Nguyen , Guanlin Li , Tuan Dung Nguyen , Aditya Narayan Sankaran , Mai Huy Thong , Thanh Trung Nguyen , Mai Hong Son

show 3 more authors

Reza Farahbakhsh Phi Le Nguyen Noel Crespi

This is my paper

Pith reviewed 2026-05-19 18:08 UTC · model grok-4.3

classification 💻 cs.CV cs.AI

keywords medical report generation3D PET/CT imagingregions of interestgraph-based modelsclinical evaluation metricsfine-grained annotationslow-resource language datasets

0 comments

The pith

Graph-based framework with region annotations generates more reliable reports from 3D PET/CT scans.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper addresses automated medical report generation for high-dimensional 3D PET/CT volumes, which current methods handle poorly by mapping entire scans to text without localized analysis. It introduces VietPET-RoI, the first large-scale dataset with 600 samples and 1,960 manually annotated regions of interest paired with clinical reports for a low-resource language. The HiRRA framework applies graph-based relational modules to model dependencies among attributes of these regions, following the step-by-step diagnostic process of radiologists rather than global pattern matching. This produces reports with higher alignment to actual clinical findings. A sympathetic reader would care because the shift enables more trustworthy automated assistance in interpreting complex volumetric images where hallucinations can affect patient care.

Core claim

By grounding report generation in fine-grained RoI annotations and using graph-based relational modules to capture dependencies between RoI attributes, the framework shifts from whole-volume mapping to localized clinical reasoning, achieving state-of-the-art results with gains of 19.7% in BLEU, 4.7% in ROUGE-L, and 45.8% in new clinical metrics that quantify RoI coverage and description quality.

What carries the argument

Graph-based relational modules that capture dependencies between RoI attributes, enabling the model to mimic radiologist analysis of localized findings instead of global volume patterns.

If this is right

Reports exhibit greater fidelity to specific localized findings, lowering the rate of unsupported statements.
The introduced RoI Coverage and RoI Quality Index metrics offer a more targeted way to assess clinical reliability beyond text overlap scores.
Performance gains indicate the method can support report generation in data-scarce settings for languages with limited medical corpora.
The dataset enables training and benchmarking of future models that explicitly reason over annotated regions.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same graph approach could be tested on other 3D modalities such as MRI to check whether RoI grounding transfers without new manual annotations.
Pairing the framework with automated RoI detectors might remove the need for human-provided region labels at inference time.
Extending the clinical metrics to measure consistency across multiple radiologist reports could further validate reduced hallucination.

Load-bearing premise

The assumption that graph-based relational modules accurately capture clinically meaningful dependencies between RoI attributes and that LLM-based extraction for RoI Coverage and RoI Quality Index provides an unbiased measure of report fidelity.

What would settle it

If disabling the graph relational modules in HiRRA produces no measurable drop in RoI Coverage and RoI Quality Index scores on the held-out test set, the claimed benefit of modeling attribute dependencies would not hold.

Figures

Figures reproduced from arXiv: 2604.18145 by Aditya Narayan Sankaran, Cong Huy Nguyen, Guanlin Li, Mai Hong Son, Mai Huy Thong, Noel Crespi, Phi Le Nguyen, Reza Farahbakhsh, Son Dinh Nguyen, Thanh Trung Nguyen, Tuan Dung Nguyen.

**Figure 1.** Figure 1: Illustration of VietPET-RoI annotation. Following doctors’ conventional workflow, VietPET-RoI provides hierarchical annotations at both region-level and RoI-level with structured clinical attributes. 1 Introduction Recent advances in Vision-Language Models (VLMs) have driven significant progress in healthcare AI, enabling the automated generation of clinical reports from medical images. Contemporary med… view at source ↗

**Figure 2.** Figure 2: Overview of the VietPET-RoI dataset. The figure displays (top) the multimodal data samples including 3D PET/CT volumes, structured RoI descriptions, and clinical reports; and (bottom) the four-stage curation pipeline, spanning from raw data acquisition to expert-level annotation. et al., 2015) support dense segmentation or lesion detection, they lack aligned clinical reports, limiting their utility for mu… view at source ↗

**Figure 3.** Figure 3: Data distribution across the six cancer types. [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗

**Figure 4.** Figure 4: The overall architecture of HiRRA. The framework processes paired PET/CT volumes through a Dual Encoder and a Hierarchical Feature Extractor. The Global Context is captured via Q-former, while the Local Context is using SPP-RoI extraction and GATv2. Finally, the LLM generates the report using a semantic-injected prompt. heatmaps (PET) onto anatomical scans (CT), we design a dual-stream architecture to ext… view at source ↗

**Figure 5.** Figure 5: Overview of our proposed clinical evaluation protocol. We utilize an LLM-based framework to extract structured clinical attributes from reports. RoI Coverage is quantified by aligning predicted and ground-truth RoIs via embedding-based Hungarian matching. For aligned pairs, the RoI Quality Index (RoIQ) measures semantic fidelity, strictly enforcing anatomical and lesion-type correctness. designed to assess… view at source ↗

read the original abstract

Automated medical report generation for 3D PET/CT imaging is fundamentally challenged by the high-dimensional nature of volumetric data and a critical scarcity of annotated datasets, particularly for low-resource languages. Current black-box methods map whole volumes to reports, ignoring the clinical workflow of analyzing localized Regions of Interest (RoIs) to derive diagnostic conclusions. In this paper, we bridge this gap by introducing VietPET-RoI, the first large-scale 3D PET/CT dataset with fine-grained RoI annotation for a low-resource language, comprising 600 PET/CT samples and 1,960 manually annotated RoIs, paired with corresponding clinical reports. Furthermore, to demonstrate the utility of this dataset, we propose HiRRA, a novel framework that mimics the professional radiologist diagnostic workflow by employing graph-based relational modules to capture dependencies between RoI attributes. This approach shifts from global pattern matching toward localized clinical findings. Additionally, we introduce new clinical evaluation metrics, namely RoI Coverage and RoI Quality Index, that measure both RoI localization accuracy and attribute description fidelity using LLM-based extraction. Extensive evaluation demonstrates that our framework achieves SOTA performance, surpassing existing models by 19.7% in BLEU and 4.7% in ROUGE-L, while achieving a remarkable 45.8% improvement in clinical metrics, indicating enhanced clinical reliability and reduced hallucination. Our code and dataset are available on GitHub.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The new RoI-annotated 3D PET/CT dataset for a low-resource language is the concrete contribution here, but the large clinical metric gains depend on unvalidated LLM extraction.

read the letter

The key point is that this paper releases the first sizable 3D PET/CT dataset with region-of-interest annotations for a low-resource language, along with a graph-based model meant to link attributes across those regions. The dataset, called VietPET-RoI, has 600 scans and nearly 2,000 manually marked RoIs with corresponding reports. That fills a real gap, since most medical report generation work stays in English and ignores the step-by-step regional analysis radiologists do. The HiRRA framework tries to copy that workflow by building graphs over RoI features and attributes. Releasing both the data and the code is the clearest positive here; it gives the community something concrete to test and extend. The reported results show gains on BLEU and ROUGE, which are standard, plus a larger jump on two new clinical metrics they define with an LLM. Those new metrics aim to check whether the generated report covers the right regions and describes their attributes accurately. The problem is that the paper does not appear to validate the LLM extractor against radiologists. If the LLM is off on what counts as good coverage or quality, the 45 percent improvement does not tell us much about actual clinical reliability. The graph modules are presented as capturing meaningful dependencies, but without an external check or ablation that isolates their effect, it is hard to judge how much they add. Overall the work is aimed at groups working on automated reporting for volumetric scans in settings where English data is scarce. Readers who need datasets for non-English medical VL tasks will get direct value from the release. The ideas are worth discussing even if the evaluation has gaps. I would recommend sending this to peer review. The dataset contribution is solid enough to justify referee time, provided the authors add validation for the LLM metrics and clearer ablations on the graph components.

Referee Report

3 major / 1 minor

Summary. The paper introduces VietPET-RoI, the first large-scale 3D PET/CT dataset with fine-grained RoI annotations (600 samples, 1,960 RoIs) paired with clinical reports in a low-resource language. It proposes the HiRRA framework, which uses graph-based relational modules to model dependencies between RoI attributes in a manner that mimics radiologist diagnostic workflow. New clinical metrics (RoI Coverage and RoI Quality Index) are defined via LLM-based extraction of attributes from generated reports. The work claims state-of-the-art results with gains of 19.7% in BLEU, 4.7% in ROUGE-L, and 45.8% in the new clinical metrics, along with reduced hallucination.

Significance. If validated, the dataset would address a clear gap in annotated 3D medical imaging data for low-resource languages, and the region-grounded graph approach could shift the field away from whole-volume black-box methods. The new clinical metrics offer a potential way to quantify localization and fidelity beyond standard NLG scores. Releasing code and data supports reproducibility, which is a strength.

major comments (3)

[Evaluation section] Evaluation section: The 45.8% improvement in clinical metrics (RoI Coverage and RoI Quality Index) is defined via LLM-based attribute extraction and comparison to ground-truth RoIs, yet no validation of the LLM extractor (e.g., agreement with radiologists, precision/recall on attribute extraction, or error analysis) is reported. This is load-bearing for the claim of enhanced clinical reliability and reduced hallucination.
[Framework section] Framework section: The graph-based relational modules are asserted to capture clinically meaningful dependencies between RoI attributes, but no independent check (such as expert review of learned relations or targeted ablation isolating their contribution to the reported gains) is provided.
[Experimental Setup] Experimental Setup: Full details on data splits, complete ablation studies, and error analysis are absent, preventing independent verification of the SOTA claims on BLEU, ROUGE-L, and clinical metrics.

minor comments (1)

[Abstract] Abstract: The language of the reports is described only as 'low-resource' without naming it explicitly.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for their thorough review and constructive comments. We address each of the major comments below, indicating the changes we will make to the manuscript.

read point-by-point responses

Referee: [Evaluation section] The 45.8% improvement in clinical metrics (RoI Coverage and RoI Quality Index) is defined via LLM-based attribute extraction and comparison to ground-truth RoIs, yet no validation of the LLM extractor (e.g., agreement with radiologists, precision/recall on attribute extraction, or error analysis) is reported. This is load-bearing for the claim of enhanced clinical reliability and reduced hallucination.

Authors: We agree that validating the LLM extractor is important for substantiating the clinical metrics. In the revised manuscript, we will add a validation subsection where we report inter-rater agreement between the LLM extractor and radiologist annotations on a sample of reports, along with precision and recall for attribute extraction. This will support the claims of reduced hallucination and enhanced reliability. revision: yes
Referee: [Framework section] The graph-based relational modules are asserted to capture clinically meaningful dependencies between RoI attributes, but no independent check (such as expert review of learned relations or targeted ablation isolating their contribution to the reported gains) is provided.

Authors: We will enhance the framework section with a targeted ablation study that isolates the impact of the graph-based relational modules on the performance gains. Additionally, we will include qualitative examples or analysis of the learned relations to demonstrate their clinical relevance. revision: yes
Referee: [Experimental Setup] Full details on data splits, complete ablation studies, and error analysis are absent, preventing independent verification of the SOTA claims on BLEU, ROUGE-L, and clinical metrics.

Authors: We acknowledge the need for more comprehensive experimental details. In the revision, we will include explicit descriptions of the data splits (including ratios and stratification criteria), present complete ablation studies for all components, and add an error analysis section discussing common failure modes and how the proposed method addresses them. revision: yes

Circularity Check

0 steps flagged

No significant circularity in derivation or evaluation chain

full rationale

The paper introduces an external dataset (VietPET-RoI) and a new framework (HiRRA) whose graph modules are motivated by clinical workflow description rather than by the target metrics. Reported gains are measured on standard external benchmarks (BLEU, ROUGE-L) plus newly defined clinical scores; these scores are computed from LLM extraction applied after generation and are not used as training objectives or fitted parameters. No equations, self-citations, or ansatzes are shown to reduce the claimed improvements to quantities defined inside the paper itself. The evaluation therefore remains an independent empirical comparison against prior models.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Only the abstract is available; no explicit free parameters, axioms or invented entities are detailed beyond standard deep-learning assumptions for medical imaging and the introduction of the new dataset and graph modules.

pith-pipeline@v0.9.0 · 5832 in / 1185 out tokens · 56087 ms · 2026-05-19T18:08:18.597459+00:00 · methodology

Review history (2 revisions) →

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/AlexanderDuality.lean alexander_duality_circle_linking unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

HiRRA... employs graph-based relational modules to capture dependencies between RoI attributes... GATv2... edges based on spatial proximity and morphological similarity
IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

new clinical evaluation metrics, namely RoI Coverage and RoI Quality Index, that measure both RoI localization accuracy and attribute description fidelity using LLM-based extraction

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

3 extracted references · 3 canonical work pages

[1]

Sergios Gatidis, Tobias Hepp, Marcel Früh, Christian La Fougère, Konstantin Nikolaou, Christina Pfannen- berg, Bernhard Schölkopf, Thomas Küstner, Clemens Cyran, and Daniel Rubin

Roi-based compression strategy of 3d mri brain datasets for wireless communications.IRBM, 42(3):146–153. Sergios Gatidis, Tobias Hepp, Marcel Früh, Christian La Fougère, Konstantin Nikolaou, Christina Pfannen- berg, Bernhard Schölkopf, Thomas Küstner, Clemens Cyran, and Daniel Rubin. 2022. A whole-body fdg- pet/ct dataset with manually annotated tumor les...

work page arXiv 2022
[2]

Gender,” “Examination Date,

Med3dvlm: An efficient vision–language model for 3d medical image analysis.IEEE Journal of Biomedical and Health Informatics. To appear. Junsan Zhang, Ming Cheng, Qiaoqiao Cheng, Xiuxuan Shen, Yao Wan, Jie Zhu, and Mengxuan Liu. 2024. Hierarchical medical image report adversarial genera- tion with hybrid discriminator.Artificial Intelligence in Medicine, ...

work page arXiv 2024
[3]

extraction_text

- [Very intense focal FDG uptake (SUVmax 12.3) in the cecum. Highly suggestive of colon cancer...] B.2 Information Extraction Framework To evaluate generation quality, we employLangEx- tract(Goel, 2025), an LLM-based extraction frame- work designed to parse unstructured generated re- ports into structured RoI objects. Specifically, we utilize Gemini-2.5-P...

work page 2025

[1] [1]

Sergios Gatidis, Tobias Hepp, Marcel Früh, Christian La Fougère, Konstantin Nikolaou, Christina Pfannen- berg, Bernhard Schölkopf, Thomas Küstner, Clemens Cyran, and Daniel Rubin

Roi-based compression strategy of 3d mri brain datasets for wireless communications.IRBM, 42(3):146–153. Sergios Gatidis, Tobias Hepp, Marcel Früh, Christian La Fougère, Konstantin Nikolaou, Christina Pfannen- berg, Bernhard Schölkopf, Thomas Küstner, Clemens Cyran, and Daniel Rubin. 2022. A whole-body fdg- pet/ct dataset with manually annotated tumor les...

work page arXiv 2022

[2] [2]

Gender,” “Examination Date,

Med3dvlm: An efficient vision–language model for 3d medical image analysis.IEEE Journal of Biomedical and Health Informatics. To appear. Junsan Zhang, Ming Cheng, Qiaoqiao Cheng, Xiuxuan Shen, Yao Wan, Jie Zhu, and Mengxuan Liu. 2024. Hierarchical medical image report adversarial genera- tion with hybrid discriminator.Artificial Intelligence in Medicine, ...

work page arXiv 2024

[3] [3]

extraction_text

- [Very intense focal FDG uptake (SUVmax 12.3) in the cecum. Highly suggestive of colon cancer...] B.2 Information Extraction Framework To evaluate generation quality, we employLangEx- tract(Goel, 2025), an LLM-based extraction frame- work designed to parse unstructured generated re- ports into structured RoI objects. Specifically, we utilize Gemini-2.5-P...

work page 2025