MRI2Rep: Autoregressive Structured Report Generation for 3D Liver MRI

Annabella Shewarega; James S. Duncan; Julius Chapiro; Junlin Yang; Lawrence H. Staib; Xinran Li; Zongwei Zhou

arxiv: 2606.25279 · v1 · pith:XF7T3HMZnew · submitted 2026-06-24 · 💻 cs.CV

MRI2Rep: Autoregressive Structured Report Generation for 3D Liver MRI

Xinran Li , Junlin Yang , Annabella Shewarega , Zongwei Zhou , Julius Chapiro , James S. Duncan , Lawrence H. Staib This is my paper

Pith reviewed 2026-06-25 20:53 UTC · model grok-4.3

classification 💻 cs.CV

keywords liver MRIstructured report generationautoregressive modelLI-RADSmedical vision-language modelreport canonicalization3D volumetric imagingclinical report automation

0 comments

The pith

An autoregressive model generates LI-RADS structured reports directly from 3D liver MRI volumes.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces MRI2Rep to automate the creation of structured diagnostic reports from volumetric liver MRI scans. It first converts existing free-text reports into closed-vocabulary sequences via a canonicalization step, then trains an autoregressive model on the paired image-sequence data. The resulting system produces outputs that exceed adapted baselines on sensitivity and accuracy metrics and receive clinical acceptability ratings from radiologists in the 70-75 percent range. This targets the bottleneck of manual 3D MRI reporting by delivering consistent, machine-readable reports without requiring lesion-level supervision during training.

Core claim

MRI2Rep is the first end-to-end autoregressive system for generating LI-RADS-structured reports from 3D liver MRI. Using 3,929 real-world MRI-report pairs, a Report-to-Label Canonicalization module produces training targets from free-text, enabling the model to reach 76.0 percent case-level sensitivity, 29.4 percent lesion-level F1, and 82.4 percent liver-level accuracy on held-out data, with 70-75 percent of generated reports rated clinically acceptable by radiologists.

What carries the argument

The Report-to-Label Canonicalization (RLC) module that transforms free-text reports into structured, closed-vocabulary diagnostic sequences to supervise the autoregressive vision-to-report model.

If this is right

Structured LI-RADS reports become available immediately after image acquisition without additional manual dictation.
Lesion-level detection performance improves over generic vision-language baselines when the training targets are canonicalized sequences.
An LLM-based judge can serve as a conservative proxy for human reader studies on report quality.
The same pipeline supplies a scalable route to consistent, machine-readable diagnostic sequences across large retrospective cohorts.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The canonicalization step could be reused to create training data for structured reporting in other organs or modalities where only free-text reports exist.
Performance numbers may shift when the model encounters scanner vendors or patient populations absent from the original 10-year single-site cohort.
Pairing the generated reports with downstream decision-support tools could standardize care pathways that currently rely on variable free-text documentation.

Load-bearing premise

The Report-to-Label Canonicalization module reliably produces accurate structured sequences from free-text reports without lesion-level annotations, and the single-institution cohort is representative enough for the observed performance to hold more broadly.

What would settle it

Testing the trained model on a multi-institution collection of 3D liver MRI cases and measuring whether case-level sensitivity stays above 70 percent and radiologist acceptability stays above 65 percent.

Figures

Figures reproduced from arXiv: 2606.25279 by Annabella Shewarega, James S. Duncan, Julius Chapiro, Junlin Yang, Lawrence H. Staib, Xinran Li, Zongwei Zhou.

**Figure 1.** Figure 1: Overview of MRI2Rep: given a 3D liver MRI volume, the model autoregressively generates a structured radiology report by predicting a sequence of diagnostic findings directly from the image. An auxiliary classification head (detailed in §3.2) provides an additional lesion-level supervision signal to encourage the visual encoder to retain discriminative features. force LI-RADS decision rules or produce close… view at source ↗

**Figure 2.** Figure 2: Output vocabulary: location sectors Ypos (left, Couinaud-derived) and lesion types Ytype with representative MRI examples (right). instructions [19, 30] and, per candidate observation, emits a (y type k , y pos k , y qty k ) triplet with the verbatim evidence sentence ek ⊂ R licensing it; triplets are assembled deterministically into y and abstained when evidence is absent, making every label reproducible… view at source ↗

**Figure 3.** Figure 3: Qualitative examples from the held-out test set. Each row shows the groundtruth and predicted structured labels (left), the rendered report (centre), and the reference report (right), with key clinical terms highlighted. (a) Full success: all predicted labels match the ground truth; the rendered report captures the essential diagnostic content. (b) Partial success: liver background and two lesions are co… view at source ↗

read the original abstract

Manual reporting of 3D MRI studies is time-consuming, yet end-to-end structured report generation for 3D liver MRI remains underexplored due to volumetric complexity and scarce paired data. We propose MRI2Rep, an autoregressive framework for liver MRI report generation. From 3,929 real-world MRI-report pairs acquired over a 10-year single-institution cohort, a Report-to-Label Canonicalization (RLC) module converts free-text reports into structured, closed-vocabulary diagnostic sequences without lesion-level annotations. On a held-out test set, MRI2Rep achieves 76.0% case-level sensitivity, 29.4% lesion-level F1, compared with no more than 8.3% for adapted medical vision-language baselines, and 82.4% liver-level accuracy. In a blinded reader study, two radiologists rated 75% and 70% of AI-generated reports as clinically acceptable, compared with 95% and 100% for original reports. Our automated LLM-based judge, LLM-Eval, rated 61.8% of AI-generated reports as acceptable, applying a stricter standard and supporting its use as a conservative proxy. To our knowledge, this is the first end-to-end LI-RADS-structured reporting system for 3D liver MRI.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

MRI2Rep gives a first end-to-end pipeline for LI-RADS structured reports from 3D liver MRI using real paired data, but the unvalidated Report-to-Label Canonicalization step undercuts how much the numbers can be trusted.

read the letter

The paper's main contribution is showing that an autoregressive vision-language model can be trained on 3,929 single-institution MRI-report pairs to output closed-vocabulary LI-RADS sequences, beating adapted baselines on case-level sensitivity (76%) and lesion F1 (29.4%) while getting 70-75% clinical acceptability from two radiologists.

The practical move is the Report-to-Label Canonicalization module that turns free-text reports into structured targets without lesion-level annotations. That lets them avoid expensive extra labeling and still run an end-to-end system, which is a reasonable engineering choice for this data-scarce setting. The blinded reader study and the LLM-Eval comparison also give a bit more clinical context than pure metric papers usually provide.

The soft spot is exactly where the stress-test note flags: the RLC module creates both the training targets and the evaluation labels, yet the abstract (and the available description) gives no accuracy numbers, inter-rater agreement, or lesion-level validation for it. Any consistent mismatch there directly affects the reported metrics. The 29.4% lesion F1 is already low, and a single-site 10-year cohort with no external hold-out makes generalization claims tentative. No ablations or error analysis are described, so it's hard to tell how much the architecture itself drives the gains versus the data pipeline.

This is for groups working on radiology report generation or liver imaging AI who want to see how autoregressive models handle volumetric data and structured output. A reader who cares about data curation tricks or clinical reader studies could extract useful ideas, but the work needs the full methods and label validation details to be convincing.

It deserves peer review so the RLC fidelity and split details can be checked; the core idea is concrete enough to be worth referee time even if revisions are needed.

Referee Report

2 major / 1 minor

Summary. The manuscript introduces MRI2Rep, an autoregressive framework for end-to-end generation of LI-RADS-structured reports from 3D liver MRI. From 3,929 single-institution MRI-report pairs, a Report-to-Label Canonicalization (RLC) module produces closed-vocabulary diagnostic sequences from free-text reports without lesion-level annotations. On a held-out test set the model reports 76.0% case-level sensitivity, 29.4% lesion-level F1 and 82.4% liver-level accuracy, substantially above adapted vision-language baselines; a blinded reader study finds two radiologists rating 75% and 70% of generated reports clinically acceptable (versus 95-100% for originals), with an LLM-Eval proxy at 61.8%. The work claims to be the first such system for 3D liver MRI.

Significance. If the results hold, the work would be a meaningful advance in medical vision-language modeling by showing that autoregressive generation can produce clinically usable structured reports for volumetric imaging where manual reporting is burdensome. The blinded reader study with two radiologists and the introduction of LLM-Eval as a stricter automated proxy are concrete strengths that provide direct human-grounded evidence beyond automatic metrics.

major comments (2)

[Methods (RLC module)] Methods section describing the Report-to-Label Canonicalization (RLC) module: no quantitative validation, accuracy metrics, or inter-rater study is reported for the RLC outputs despite their use as both training targets and ground truth for all reported metrics (76.0% sensitivity, 29.4% lesion F1, 82.4% liver accuracy). Systematic mismatches between RLC sequences and true LI-RADS findings would directly corrupt both training and evaluation.
[Results (Evaluation)] Results (held-out test set evaluation): performance is reported solely on a single-institution 10-year cohort with no external validation set, multi-center testing, or cross-institution RLC fidelity check. Because RLC canonicalization depends on institution-specific report phrasing, this directly limits the strength of the generalization and clinical-utility claims.

minor comments (1)

[Abstract] Abstract: the phrase 'compared with no more than 8.3% for adapted medical vision-language baselines' does not list the individual baseline scores or adaptation details; adding these would improve transparency without altering the central claim.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback and positive assessment of the work's potential significance. We address each major comment below with honest responses and indicate planned revisions to the manuscript.

read point-by-point responses

Referee: [Methods (RLC module)] Methods section describing the Report-to-Label Canonicalization (RLC) module: no quantitative validation, accuracy metrics, or inter-rater study is reported for the RLC outputs despite their use as both training targets and ground truth for all reported metrics (76.0% sensitivity, 29.4% lesion F1, 82.4% liver accuracy). Systematic mismatches between RLC sequences and true LI-RADS findings would directly corrupt both training and evaluation.

Authors: We agree this is a valid concern, as the RLC outputs serve as both training targets and evaluation ground truth. The RLC module uses rule-based parsing combined with LI-RADS-specific keyword and phrase matching to convert free-text reports into closed-vocabulary sequences without requiring lesion-level annotations. In the revised manuscript, we will expand the Methods section to provide a detailed description of the RLC rules with illustrative examples. Additionally, we will conduct and report an inter-rater agreement study on a subset of 100 randomly selected reports, comparing RLC outputs against independent manual canonicalization by two radiologists, including accuracy metrics and disagreement analysis. revision: yes
Referee: [Results (Evaluation)] Results (held-out test set evaluation): performance is reported solely on a single-institution 10-year cohort with no external validation set, multi-center testing, or cross-institution RLC fidelity check. Because RLC canonicalization depends on institution-specific report phrasing, this directly limits the strength of the generalization and clinical-utility claims.

Authors: We concur that the single-institution nature of the 3,929-pair cohort limits the strength of generalization claims, particularly given potential institution-specific phrasing in reports that affects RLC. Our dataset spans a 10-year period with natural variations in reporting styles, but we lack access to external multi-center datasets due to privacy regulations and data-sharing restrictions. In the revision, we will add an explicit limitations paragraph in the Discussion section acknowledging this constraint, clarifying that all metrics (including 76.0% case-level sensitivity) are cohort-specific, and outlining plans for future multi-center studies. We will also note the RLC's dependence on local report phrasing as a factor requiring site-specific adaptation. revision: partial

Circularity Check

0 steps flagged

No circularity in derivation or evaluation chain

full rationale

The paper presents an empirical ML pipeline for autoregressive report generation. The RLC module generates training targets from free-text reports, and performance is reported on a held-out test set using those targets. No equations, fitted parameters renamed as predictions, self-definitional constructs, or load-bearing self-citations appear in the abstract or described process. The central claims rest on standard supervised training and external reader study evaluation rather than any reduction of outputs to inputs by construction. This is the expected non-finding for a data-driven vision-language model paper without mathematical derivation steps.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review supplies no explicit free parameters, axioms, or invented entities; all modeling choices remain opaque.

pith-pipeline@v0.9.1-grok · 5788 in / 1154 out tokens · 20510 ms · 2026-06-25T20:53:08.644555+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

32 extracted references · 9 canonical work pages

[1]

Anthropic: Claude 3.5 sonnet (2024),https://www.anthropic.com/news/ claude-3-5-sonnet, announcements, Jun 21, 2024

2024
[2]

arXiv preprint arXiv:2404.00578 (2024)

Bai, F., Du, Y., Huang, T., Meng, M.Q.H., Zhao, B.: M3d: Advancing 3d medical image analysis with multi-modal large language models. arXiv preprint arXiv:2404.00578 (2024)

arXiv 2024
[3]

In: Proceedings of the IEEE/CVF International Conference on Computer Vision

Bassi, P.R., Yavuz, M.C., Hamamci, I.E., Er, S., Chen, X., Li, W., Menze, B., Decherchi, S., Cavalli, A., Wang, K., et al.: Radgpt: Constructing 3d image-text tumor datasets. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 23720–23730 (2025)

2025
[4]

Radiology252(2), 458– 467 (2009)

Bhargavan, M., Kaye, A.H., Forman, H.P., Sunshine, J.H.: Workload of radiologists in united states in 2006–2007 and trends since 1991–1992. Radiology252(2), 458– 467 (2009)

2006
[5]

https://doi.org/10.48550/arXiv.2406.06512

Blankemeier, L., Cohen, J.P., Kumar, A., Van Veen, D., Gardezi, S.J.S., Paschali, M., Chen, Z., Delbrouck, J.B., Reis, E., Truyts, C., Bluethgen, C., Jensen, M.E.K., Ostmeier, S., Varma, M., Valanarasu, J.M.J., Fang, Z., Huo, Z., Nabulsi, Z., Ardila, D., Weng, W.H., Amaro Junior, E., Ahuja, N., Fries, J., Shah, N.H., Johnston, A., Boutin, R.D., Wentland, ...

work page doi:10.48550/arxiv.2406.06512 2024
[6]

arXiv preprint arXiv:2411.05085 (2024)

Castro, D.C., Bustos, A., Bannur, S., et al.: Padchest-gr: A bilingual chest x-ray dataset for grounded radiology report generation. arXiv preprint arXiv:2411.05085 (2024)

arXiv 2024
[7]

In: Proceedings of the 2020 conference on empirical methods in natural language processing (EMNLP)

Chen, Z., Song, Y., Chang, T.H., Wan, X.: Generating radiology reports via memory-driven transformer. In: Proceedings of the 2020 conference on empirical methods in natural language processing (EMNLP). pp. 1439–1449 (2020)

2020
[8]

Academic Radiology26(4), 526–533 (2019)

Chetlen, A.L., Chan, T.L., Ballard, D.H., Frigini, L.A., Hildebrand, A., Kim, S., et al.: Addressing burnout in radiologists. Academic Radiology26(4), 526–533 (2019)

2019
[9]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2019)

Cui, Y., Jia, M., Lin, T.Y., Song, Y., Belongie, S.: Class-balanced loss based on effective number of samples. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2019)

2019
[10]

Diagnostic and Interventional Imaging95(11), 1003– 1016 (2014).https://doi.org/10.1016/j.diii.2013.12.005

Germain, T., Favelier, S., Cercueil, J.P., Denys, A., Krause, D., Guiu, B.: Liver segmentation: practical tips. Diagnostic and Interventional Imaging95(11), 1003– 1016 (2014).https://doi.org/10.1016/j.diii.2013.12.005

work page doi:10.1016/j.diii.2013.12.005 2014
[11]

Deng, J., Li, T.-W., Zhang, S., Liu, S., Pan, Y ., Huang, H., Wang, X., Hu, P., Zhang, X., et al

Guo, D., Yang, D., Zhang, H., et al.: Deepseek-r1 incentivizes reasoning in llms through reinforcement learning. Nature645, 633–638 (2025).https://doi.org/ 10.1038/s41586-025-09422-z

work page doi:10.1038/s41586-025-09422-z 2025
[12]

In: International Conference on Medical Image Computing and Computer-Assisted Intervention (MICCAI)

Hamamci, I.E., et al.: Ct2rep: Automated radiology report generation for 3d medical imaging. In: International Conference on Medical Image Computing and Computer-Assisted Intervention (MICCAI). Springer (2024)

2024
[13]

arXiv preprint arXiv:2306.06466 (2023) 10 X

Hou, W., Xu, K., Cheng, Y., Li, W., Liu, J.: Organ: Observation-guided radiology report generation via tree reasoning. arXiv preprint arXiv:2306.06466 (2023) 10 X. Li et al

arXiv 2023
[14]

In: Proceedings of the AAAI conference on artificial intelligence

Irvin, J., Rajpurkar, P., Ko, M., Yu, Y., Ciurea-Ilcus, S., Chute, C., Marklund, H., Haghgoo, B., Ball, R., Shpanskaya, K., et al.: Chexpert: A large chest radiograph dataset with uncertainty labels and expert comparison. In: Proceedings of the AAAI conference on artificial intelligence. vol. 33, pp. 590–597 (2019)

2019
[15]

In: Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics (ACL) (2015)

Iyyer, M., Manjunatha, V., Boyd-Graber, J., Daumé III, H.: Deep unordered com- position rivals syntactic methods for text classification. In: Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics (ACL) (2015)

2015
[16]

In: Proceedings of the Neural Information Processing Systems Track on Datasets and Benchmarks (NeurIPS Datasets and Benchmarks)

Jain, S., Agrawal, A., Saporta, A., Truong, S., Duong, D.N., Bui, T., Chambon, P., Zhang, Y., Lungren, M.P., Ng, A.Y., Langlotz, C.P., Rajpurkar, P.: RadGraph: Extracting clinical entities and relations from radiology reports. In: Proceedings of the Neural Information Processing Systems Track on Datasets and Benchmarks (NeurIPS Datasets and Benchmarks). v...

2021
[17]

Justifying recommendations using distantly- labeled reviews and fine-grained aspects

Jing, B., Xie, P., Xing, E.: On the automatic generation of medical imaging reports. In: Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). pp. 2577–2586. Association for Computa- tional Linguistics, Melbourne, Australia (2018).https://doi.org/10.18653/v1/ P18-1240

work page doi:10.18653/v1/ 2018
[18]

Scientific data6(1), 317 (2019)

Johnson, A.E., Pollard, T.J., Berkowitz, S.J., Greenbaum, N.R., Lungren, M.P., Deng, C.y., Mark, R.G., Horng, S.: Mimic-cxr, a de-identified publicly available database of chest radiographs with free-text reports. Scientific data6(1), 317 (2019)

2019
[19]

In: Advances in Neural Information Processing Systems

Kojima, T., Gu, S.S., Reid, M., Matsuo, Y., Iwasawa, Y.: Large language models are zero-shot reasoners. In: Advances in Neural Information Processing Systems. vol. 35, pp. 22199–22213 (2022)

2022
[20]

Computer Science Bulletin8(1), 477–489 (2025).https: //doi.org/10.71465/csb178

Liang, J.: Multi-task learning for radiology report generation with structured findings consistency. Computer Science Bulletin8(1), 477–489 (2025).https: //doi.org/10.71465/csb178

work page doi:10.71465/csb178 2025
[21]

Neural Networks p

Lou, M., Ying, H., Liu, X., Zhou, H.Y., Zhang, Y., Yu, Y.: Sdr-former: A siamese dual-resolution transformer for liver lesion classification using 3d multi-phase imag- ing. Neural Networks p. 107228 (2025)

2025
[22]

arXiv preprint arXiv:2504.03600 (2025)

Ma, J., Yang, Z., Kim, S., Chen, B., Baharoon, M., Fallahpour, A., Asakereh, R., Lyu, H., Wang, B.: Medsam2: Segment anything in 3d medical images and videos. arXiv preprint arXiv:2504.03600 (2025)

arXiv 2025
[23]

OpenAI technical report (2025),https://openai

OpenAI: GPT-5 system card. OpenAI technical report (2025),https://openai. com/index/gpt-5-system-card/

2025
[24]

AMIA Summits on Translational Science Proceedings2018, 188 (2018)

Peng, Y., Wang, X., Lu, L., Bagheri, M., Summers, R., Lu, Z.: Negbio: a high- performance tool for negation and uncertainty detection in radiology reports. AMIA Summits on Translational Science Proceedings2018, 188 (2018)

2018
[25]

Abdominal Radiology 43(1), 75–81 (2018).https://doi.org/10.1007/s00261-017-1291-4

Santillan, C., Fowler, K., Kono, Y., Chernyak, V.: Li-rads major features: Ct, mri with extracellular agents, and mri with hepatobiliary agents. Abdominal Radiology 43(1), 75–81 (2018).https://doi.org/10.1007/s00261-017-1291-4

work page doi:10.1007/s00261-017-1291-4 2018
[26]

Applied Sciences 15(23), 12578 (2025).https://doi.org/10.3390/app152312578

Song, J., Hu, Y., Wang, H., Chen, Y.W.: Liver-vlm: Enhancing focal liver lesion classification with self-supervised vision-language pretraining. Applied Sciences 15(23), 12578 (2025).https://doi.org/10.3390/app152312578

work page doi:10.3390/app152312578 2025
[27]

In: Advances in Neural Information Processing Systems (NeurIPS)

Vaswani,A.,Shazeer,N.,Parmar,N.,Uszkoreit,J.,Jones,L.,Gomez,A.N.,Kaiser, Ł., Polosukhin, I.: Attention is all you need. In: Advances in Neural Information Processing Systems (NeurIPS). vol. 30 (2017)

2017
[28]

eBioMedicine 122, 106060 (2025).https://doi.org/10.1016/j.ebiom.2025.106060 MRI2Rep: Structured Report Generation for 3D Liver MRI 11

Wang, L., et al.: A generative vision-language model for holistic pathological as- sessment using preoperative imaging in hepatocellular carcinoma. eBioMedicine 122, 106060 (2025).https://doi.org/10.1016/j.ebiom.2025.106060 MRI2Rep: Structured Report Generation for 3D Liver MRI 11

work page doi:10.1016/j.ebiom.2025.106060 2025
[29]

Medical Image Analysis110, 103992 (2026).https: //doi.org/10.1016/j.media.2026.103992

Wang, S., Safari, M., Li, Q., Chang, C.W., Qiu, R.L.J., Roper, J., Yu, D.S., Yang, X.: Vision foundation model for 3d magnetic resonance imaging segmentation, classification, and registration. Medical Image Analysis110, 103992 (2026).https: //doi.org/10.1016/j.media.2026.103992

work page doi:10.1016/j.media.2026.103992 2026
[30]

Advances in neural information processing systems35, 24824–24837 (2022)

Wei, J., Wang, X., Schuurmans, D., Bosma, M., Xia, F., Chi, E., Le, Q.V., Zhou, D., et al.: Chain-of-thought prompting elicits reasoning in large language models. Advances in neural information processing systems35, 24824–24837 (2022)

2022
[31]

Nature Communications16(1), 7866 (2025).https://doi.org/10.1038/ s41467-025-62385-7

Wu, C., Zhang, X., Zhang, Y., Hui, H., Wang, Y., Xie, W.: Towards gen- eralist foundation model for radiology by leveraging web-scale 2d&3d medical data. Nature Communications16(1), 7866 (2025).https://doi.org/10.1038/ s41467-025-62385-7

2025
[32]

arXiv preprint arXiv:2509.21249 (2025)

Yang,Z.,DSouza,N.,Megyeri,I.,etal.:Decipher-mr:Avision-languagefoundation model for 3d mri representations. arXiv preprint arXiv:2509.21249 (2025)

arXiv 2025

[1] [1]

Anthropic: Claude 3.5 sonnet (2024),https://www.anthropic.com/news/ claude-3-5-sonnet, announcements, Jun 21, 2024

2024

[2] [2]

arXiv preprint arXiv:2404.00578 (2024)

Bai, F., Du, Y., Huang, T., Meng, M.Q.H., Zhao, B.: M3d: Advancing 3d medical image analysis with multi-modal large language models. arXiv preprint arXiv:2404.00578 (2024)

arXiv 2024

[3] [3]

In: Proceedings of the IEEE/CVF International Conference on Computer Vision

Bassi, P.R., Yavuz, M.C., Hamamci, I.E., Er, S., Chen, X., Li, W., Menze, B., Decherchi, S., Cavalli, A., Wang, K., et al.: Radgpt: Constructing 3d image-text tumor datasets. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 23720–23730 (2025)

2025

[4] [4]

Radiology252(2), 458– 467 (2009)

Bhargavan, M., Kaye, A.H., Forman, H.P., Sunshine, J.H.: Workload of radiologists in united states in 2006–2007 and trends since 1991–1992. Radiology252(2), 458– 467 (2009)

2006

[5] [5]

https://doi.org/10.48550/arXiv.2406.06512

Blankemeier, L., Cohen, J.P., Kumar, A., Van Veen, D., Gardezi, S.J.S., Paschali, M., Chen, Z., Delbrouck, J.B., Reis, E., Truyts, C., Bluethgen, C., Jensen, M.E.K., Ostmeier, S., Varma, M., Valanarasu, J.M.J., Fang, Z., Huo, Z., Nabulsi, Z., Ardila, D., Weng, W.H., Amaro Junior, E., Ahuja, N., Fries, J., Shah, N.H., Johnston, A., Boutin, R.D., Wentland, ...

work page doi:10.48550/arxiv.2406.06512 2024

[6] [6]

arXiv preprint arXiv:2411.05085 (2024)

Castro, D.C., Bustos, A., Bannur, S., et al.: Padchest-gr: A bilingual chest x-ray dataset for grounded radiology report generation. arXiv preprint arXiv:2411.05085 (2024)

arXiv 2024

[7] [7]

In: Proceedings of the 2020 conference on empirical methods in natural language processing (EMNLP)

Chen, Z., Song, Y., Chang, T.H., Wan, X.: Generating radiology reports via memory-driven transformer. In: Proceedings of the 2020 conference on empirical methods in natural language processing (EMNLP). pp. 1439–1449 (2020)

2020

[8] [8]

Academic Radiology26(4), 526–533 (2019)

Chetlen, A.L., Chan, T.L., Ballard, D.H., Frigini, L.A., Hildebrand, A., Kim, S., et al.: Addressing burnout in radiologists. Academic Radiology26(4), 526–533 (2019)

2019

[9] [9]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2019)

Cui, Y., Jia, M., Lin, T.Y., Song, Y., Belongie, S.: Class-balanced loss based on effective number of samples. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2019)

2019

[10] [10]

Diagnostic and Interventional Imaging95(11), 1003– 1016 (2014).https://doi.org/10.1016/j.diii.2013.12.005

Germain, T., Favelier, S., Cercueil, J.P., Denys, A., Krause, D., Guiu, B.: Liver segmentation: practical tips. Diagnostic and Interventional Imaging95(11), 1003– 1016 (2014).https://doi.org/10.1016/j.diii.2013.12.005

work page doi:10.1016/j.diii.2013.12.005 2014

[11] [11]

Deng, J., Li, T.-W., Zhang, S., Liu, S., Pan, Y ., Huang, H., Wang, X., Hu, P., Zhang, X., et al

Guo, D., Yang, D., Zhang, H., et al.: Deepseek-r1 incentivizes reasoning in llms through reinforcement learning. Nature645, 633–638 (2025).https://doi.org/ 10.1038/s41586-025-09422-z

work page doi:10.1038/s41586-025-09422-z 2025

[12] [12]

In: International Conference on Medical Image Computing and Computer-Assisted Intervention (MICCAI)

Hamamci, I.E., et al.: Ct2rep: Automated radiology report generation for 3d medical imaging. In: International Conference on Medical Image Computing and Computer-Assisted Intervention (MICCAI). Springer (2024)

2024

[13] [13]

arXiv preprint arXiv:2306.06466 (2023) 10 X

Hou, W., Xu, K., Cheng, Y., Li, W., Liu, J.: Organ: Observation-guided radiology report generation via tree reasoning. arXiv preprint arXiv:2306.06466 (2023) 10 X. Li et al

arXiv 2023

[14] [14]

In: Proceedings of the AAAI conference on artificial intelligence

Irvin, J., Rajpurkar, P., Ko, M., Yu, Y., Ciurea-Ilcus, S., Chute, C., Marklund, H., Haghgoo, B., Ball, R., Shpanskaya, K., et al.: Chexpert: A large chest radiograph dataset with uncertainty labels and expert comparison. In: Proceedings of the AAAI conference on artificial intelligence. vol. 33, pp. 590–597 (2019)

2019

[15] [15]

In: Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics (ACL) (2015)

Iyyer, M., Manjunatha, V., Boyd-Graber, J., Daumé III, H.: Deep unordered com- position rivals syntactic methods for text classification. In: Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics (ACL) (2015)

2015

[16] [16]

In: Proceedings of the Neural Information Processing Systems Track on Datasets and Benchmarks (NeurIPS Datasets and Benchmarks)

Jain, S., Agrawal, A., Saporta, A., Truong, S., Duong, D.N., Bui, T., Chambon, P., Zhang, Y., Lungren, M.P., Ng, A.Y., Langlotz, C.P., Rajpurkar, P.: RadGraph: Extracting clinical entities and relations from radiology reports. In: Proceedings of the Neural Information Processing Systems Track on Datasets and Benchmarks (NeurIPS Datasets and Benchmarks). v...

2021

[17] [17]

Justifying recommendations using distantly- labeled reviews and fine-grained aspects

Jing, B., Xie, P., Xing, E.: On the automatic generation of medical imaging reports. In: Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). pp. 2577–2586. Association for Computa- tional Linguistics, Melbourne, Australia (2018).https://doi.org/10.18653/v1/ P18-1240

work page doi:10.18653/v1/ 2018

[18] [18]

Scientific data6(1), 317 (2019)

Johnson, A.E., Pollard, T.J., Berkowitz, S.J., Greenbaum, N.R., Lungren, M.P., Deng, C.y., Mark, R.G., Horng, S.: Mimic-cxr, a de-identified publicly available database of chest radiographs with free-text reports. Scientific data6(1), 317 (2019)

2019

[19] [19]

In: Advances in Neural Information Processing Systems

Kojima, T., Gu, S.S., Reid, M., Matsuo, Y., Iwasawa, Y.: Large language models are zero-shot reasoners. In: Advances in Neural Information Processing Systems. vol. 35, pp. 22199–22213 (2022)

2022

[20] [20]

Computer Science Bulletin8(1), 477–489 (2025).https: //doi.org/10.71465/csb178

Liang, J.: Multi-task learning for radiology report generation with structured findings consistency. Computer Science Bulletin8(1), 477–489 (2025).https: //doi.org/10.71465/csb178

work page doi:10.71465/csb178 2025

[21] [21]

Neural Networks p

Lou, M., Ying, H., Liu, X., Zhou, H.Y., Zhang, Y., Yu, Y.: Sdr-former: A siamese dual-resolution transformer for liver lesion classification using 3d multi-phase imag- ing. Neural Networks p. 107228 (2025)

2025

[22] [22]

arXiv preprint arXiv:2504.03600 (2025)

Ma, J., Yang, Z., Kim, S., Chen, B., Baharoon, M., Fallahpour, A., Asakereh, R., Lyu, H., Wang, B.: Medsam2: Segment anything in 3d medical images and videos. arXiv preprint arXiv:2504.03600 (2025)

arXiv 2025

[23] [23]

OpenAI technical report (2025),https://openai

OpenAI: GPT-5 system card. OpenAI technical report (2025),https://openai. com/index/gpt-5-system-card/

2025

[24] [24]

AMIA Summits on Translational Science Proceedings2018, 188 (2018)

Peng, Y., Wang, X., Lu, L., Bagheri, M., Summers, R., Lu, Z.: Negbio: a high- performance tool for negation and uncertainty detection in radiology reports. AMIA Summits on Translational Science Proceedings2018, 188 (2018)

2018

[25] [25]

Abdominal Radiology 43(1), 75–81 (2018).https://doi.org/10.1007/s00261-017-1291-4

Santillan, C., Fowler, K., Kono, Y., Chernyak, V.: Li-rads major features: Ct, mri with extracellular agents, and mri with hepatobiliary agents. Abdominal Radiology 43(1), 75–81 (2018).https://doi.org/10.1007/s00261-017-1291-4

work page doi:10.1007/s00261-017-1291-4 2018

[26] [26]

Applied Sciences 15(23), 12578 (2025).https://doi.org/10.3390/app152312578

Song, J., Hu, Y., Wang, H., Chen, Y.W.: Liver-vlm: Enhancing focal liver lesion classification with self-supervised vision-language pretraining. Applied Sciences 15(23), 12578 (2025).https://doi.org/10.3390/app152312578

work page doi:10.3390/app152312578 2025

[27] [27]

In: Advances in Neural Information Processing Systems (NeurIPS)

Vaswani,A.,Shazeer,N.,Parmar,N.,Uszkoreit,J.,Jones,L.,Gomez,A.N.,Kaiser, Ł., Polosukhin, I.: Attention is all you need. In: Advances in Neural Information Processing Systems (NeurIPS). vol. 30 (2017)

2017

[28] [28]

eBioMedicine 122, 106060 (2025).https://doi.org/10.1016/j.ebiom.2025.106060 MRI2Rep: Structured Report Generation for 3D Liver MRI 11

Wang, L., et al.: A generative vision-language model for holistic pathological as- sessment using preoperative imaging in hepatocellular carcinoma. eBioMedicine 122, 106060 (2025).https://doi.org/10.1016/j.ebiom.2025.106060 MRI2Rep: Structured Report Generation for 3D Liver MRI 11

work page doi:10.1016/j.ebiom.2025.106060 2025

[29] [29]

Medical Image Analysis110, 103992 (2026).https: //doi.org/10.1016/j.media.2026.103992

Wang, S., Safari, M., Li, Q., Chang, C.W., Qiu, R.L.J., Roper, J., Yu, D.S., Yang, X.: Vision foundation model for 3d magnetic resonance imaging segmentation, classification, and registration. Medical Image Analysis110, 103992 (2026).https: //doi.org/10.1016/j.media.2026.103992

work page doi:10.1016/j.media.2026.103992 2026

[30] [30]

Advances in neural information processing systems35, 24824–24837 (2022)

Wei, J., Wang, X., Schuurmans, D., Bosma, M., Xia, F., Chi, E., Le, Q.V., Zhou, D., et al.: Chain-of-thought prompting elicits reasoning in large language models. Advances in neural information processing systems35, 24824–24837 (2022)

2022

[31] [31]

Nature Communications16(1), 7866 (2025).https://doi.org/10.1038/ s41467-025-62385-7

Wu, C., Zhang, X., Zhang, Y., Hui, H., Wang, Y., Xie, W.: Towards gen- eralist foundation model for radiology by leveraging web-scale 2d&3d medical data. Nature Communications16(1), 7866 (2025).https://doi.org/10.1038/ s41467-025-62385-7

2025

[32] [32]

arXiv preprint arXiv:2509.21249 (2025)

Yang,Z.,DSouza,N.,Megyeri,I.,etal.:Decipher-mr:Avision-languagefoundation model for 3d mri representations. arXiv preprint arXiv:2509.21249 (2025)

arXiv 2025