Automatic Extraction of Structured Information from Brain MRI Reports Using an Open-Weight Large Language Model

Amos Pomp; Antoine Manenti; Esther E. Bron; Farog Faghir; Francesco Mattace-Raso; Frank J. Wolters; Harro Seelaar; Joy Martens; Kaouther Mouheb; Meike W. Vernooij

arxiv: 2606.07721 · v1 · pith:IUU6K5Y2new · submitted 2026-06-05 · 💻 cs.AI

Automatic Extraction of Structured Information from Brain MRI Reports Using an Open-Weight Large Language Model

Kaouther Mouheb , Amos Pomp , Antoine Manenti , Romy de Haan , Farog Faghir , Joy Martens , Harro Seelaar , Francesco Mattace-Raso

show 4 more authors

Meike W. Vernooij Frank J. Wolters Stefan Klein Esther E. Bron

This is my paper

Pith reviewed 2026-06-27 21:50 UTC · model grok-4.3

classification 💻 cs.AI

keywords large language modelsinformation extractionradiology reportsbrain MRIDutch languagefew-shot promptingneuroradiologystructured data

0 comments

The pith

LLaMA 3.1 extracts structured data from Dutch brain MRI reports with high accuracy on visual ratings and improved numerical counts via few-shot prompting.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tests whether an open-weight large language model can automatically pull thirty structured variables from free-text Dutch neuroradiology reports. It evaluates performance across zero-shot and few-shot settings as well as original Dutch versus English translation inputs. High accuracies on visual scores like atrophy and Fazekas ratings indicate the method could scale data extraction for research without manual review of every report. Few-shot prompting helps close the gap on counting microbleeds and infarcts. The evaluation uses real clinical reports from a memory clinic to demonstrate practical utility.

Core claim

LLaMA 3.1 demonstrates high zero-shot performance for visual rating scores such as medial temporal atrophy at 90-96 percent, global cortical atrophy at 87 percent, and Fazekas at 94 percent, along with 93 percent accuracy for microbleed mentions and 82 percent for infarct mentions. Numerical variables show lower performance at 80 percent for microbleed counts and 66 percent for infarct counts, but few-shot prompting with structural similarity-based example selection raises these to 92 percent and 81 percent. Results remain comparable when reports are translated to English.

What carries the argument

LLaMA 3.1 large language model using zero-shot and few-shot prompting on Dutch and translated English reports, with performance measured by balanced accuracy for categorical variables, accuracy and mean absolute error for counts, and text similarity for free-text fields.

If this is right

Automatic extraction from existing reports enables large-scale research studies without proportional increases in manual labor.
Few-shot prompting with structural similarity example selection improves accuracy on numerical variables such as microbleed and infarct counts.
Performance is largely unaffected by translating Dutch reports to English before model input.
Location-specific variables continue to present extraction challenges even after prompting adjustments.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same prompting approach could transfer to radiology reports written in other languages without requiring model retraining or fine-tuning.
Integration into hospital data pipelines might reduce reliance on manual chart review for populating research databases.
Applying the method to reports from multiple institutions would test whether performance holds when report style and terminology vary.

Load-bearing premise

Annotations produced by trained medical students provide accurate ground truth for all thirty variables, supported by the reliability check on the 100 double-annotated reports.

What would settle it

Re-annotation of a subset of the 947 reports by consultant neuroradiologists that reveals substantial disagreement with the student annotations on the numerical count variables would falsify the reported performance levels.

read the original abstract

Objectives: Automatic data extraction from free-text radiology reports enables large-scale research, but few studies assessed the performance of large language models (LLMs) on Dutch neuroradiology reports. Methods: We analyzed 947 brain MRI reports from a tertiary memory clinic (2016-2021), authored by consultant neuroradiologists. Trained medical students annotated thirty variables; 100 reports were double-annotated to assess inter-rater reliability. We evaluated the performance of the open-weight LLM LLaMA 3.1 using different languages (Dutch vs. English translation) and few-shot prompting with different example selection strategies. Performance was evaluated using balanced accuracy for categorical variables, accuracy and mean absolute error for counts, and text similarity for free-text. Metrics were computed across 10 random splits of the 947 reports. Results: LLaMA 3.1 demonstrated high zero-shot performance for visual rating scores (mean [95%-CI]): Medial Temporal Atrophy: 90% [77-100%] on the left and 96% [94-99%] on the right, Global Cortical Atrophy: 87% [83-91%], and Fazekas: 94% [93-96%]. Microbleed mentions were detected with 93% accuracy [92-95%] and infarct mentions with 82% [80-84%]. Text similarity for lesion location reached 0.95 [0.95-0.96]. Performance was lower for numerical variables: 80% [78-82%] for the number of microbleeds and 66% [63-68%] for infarcts. English translation yielded comparable results. Few-shot prompting improved performance for numerical variables, achieving 92% [90-93%] for microbleeds and 81% [77-85%] for infarcts using structural similarity-based selection. Conclusion: LLaMA 3.1 shows strong potential for extracting data from Dutch neuroradiology reports. Few-shot prompting enhances performance for numerical variables, whereas challenges remain for location-specific variables.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

LLaMA 3.1 extracts most variables from Dutch MRI reports at usable accuracy, but the student annotations lack the per-variable agreement numbers needed to trust the count results.

read the letter

LLaMA 3.1 gets decent results extracting 30 variables from Dutch brain MRI reports, with balanced accuracies around 90% for visual ratings and improvements to 92% and 81% on microbleed and infarct counts using few-shot. The main limitation is that we don't see the inter-rater agreement numbers for those student annotations.

The work applies an existing model with standard techniques to a new language and domain. They use 947 reports from one clinic, have trained students annotate, double-check 100 of them, and test zero-shot versus few-shot with different example selections. They run 10 random splits and give confidence intervals, which is better than many similar papers. English translation works about the same as Dutch input.

What stands out is the focus on numerical variables like counts, where zero-shot is weaker but few-shot with structural similarity selection helps. Text similarity for locations is high at 0.95.

The soft spot is the ground truth. Only 100 reports are double-annotated, and the abstract gives no per-variable stats like kappa or agreement percentages. If agreement on the count variables is low, say below 80%, then the model's performance claims are hard to interpret because we can't separate model error from label noise. The stress-test note flags this correctly. If the full paper has those numbers and they are high, this concern goes away.

This is useful for people doing medical information extraction in languages other than English or in neuroradiology specifically. It gives a practical baseline for what an open model can do without fine-tuning.

It deserves a serious referee because the evaluation is empirical and they tried to address prompting variations. I'd recommend sending it for review, with a request to include the full inter-rater results if they're not already there.

Referee Report

2 major / 2 minor

Summary. The manuscript evaluates the open-weight LLaMA 3.1 model for extracting 30 structured variables (visual ratings, mentions, counts, and locations) from 947 Dutch brain MRI reports. It compares zero-shot performance against few-shot prompting with different example-selection strategies and English translation, reporting balanced accuracy, accuracy/MAE, and text similarity averaged over 10 random splits with 95% CIs. The central claim is that the model shows strong potential, particularly with few-shot prompting improving numerical count extraction.

Significance. If the student annotations constitute reliable ground truth, the work provides useful empirical evidence that open-weight LLMs can extract structured data from non-English neuroradiology reports at scale, with concrete gains from few-shot prompting on count variables. The multi-split evaluation with confidence intervals and focus on practical prompting choices are positive features for reproducibility and applicability.

major comments (2)

[Methods] Methods (annotation protocol): Only 100 of 947 reports were double-annotated, yet the manuscript provides no per-variable inter-rater statistics (Cohen’s kappa, ICC, or raw agreement) for the 30 variables. This directly undermines interpretation of all performance numbers (e.g., 80% [78-82%] zero-shot microbleed count accuracy and the few-shot gains to 92%/81%), because it is impossible to determine whether observed model accuracies exceed, match, or fall below human annotation noise, especially for the count variables highlighted as most challenging.
[Results] Results and Abstract: Performance claims for numerical variables and the conclusion that “few-shot prompting enhances performance” rest on comparison to the student annotations, but without the missing inter-rater numbers the headline result (90-96% on visual ratings, 92%/81% few-shot on counts) cannot be distinguished from annotation variability. This is load-bearing for the “strong potential” claim.

minor comments (2)

[Abstract] Abstract and Methods: Exact prompting templates, system prompts, and the precise definitions of the “structural similarity-based selection” and other few-shot strategies are not provided, limiting reproducibility.
[Methods] The manuscript should clarify how location-specific free-text variables were scored for text similarity and whether any post-processing was applied to model outputs.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for highlighting the need for inter-rater reliability metrics. We agree these are essential to interpret model performance against annotation variability and will revise the manuscript to include them.

read point-by-point responses

Referee: [Methods] Methods (annotation protocol): Only 100 of 947 reports were double-annotated, yet the manuscript provides no per-variable inter-rater statistics (Cohen’s kappa, ICC, or raw agreement) for the 30 variables. This directly undermines interpretation of all performance numbers (e.g., 80% [78-82%] zero-shot microbleed count accuracy and the few-shot gains to 92%/81%), because it is impossible to determine whether observed model accuracies exceed, match, or fall below human annotation noise, especially for the count variables highlighted as most challenging.

Authors: We agree that per-variable inter-rater statistics are required to contextualize the results. Although 100 reports were double-annotated, the manuscript omitted the computed metrics. We will add a table reporting Cohen’s kappa for categorical variables, ICC for count variables, and raw agreement for all 30 variables on the double-annotated set. This revision will be placed in the Methods section and referenced in Results. revision: yes
Referee: [Results] Results and Abstract: Performance claims for numerical variables and the conclusion that “few-shot prompting enhances performance” rest on comparison to the student annotations, but without the missing inter-rater numbers the headline result (90-96% on visual ratings, 92%/81% few-shot on counts) cannot be distinguished from annotation variability. This is load-bearing for the “strong potential” claim.

Authors: We accept that the absence of inter-rater statistics limits interpretation of whether model performance exceeds annotation noise. Including the metrics will allow readers to make this comparison directly. We will update the Abstract, Results, and Conclusion to reference the inter-rater agreement values alongside model accuracies and few-shot improvements. revision: yes

Circularity Check

0 steps flagged

No circularity: direct empirical evaluation against external annotations

full rationale

The paper reports an empirical evaluation of LLaMA 3.1 on 947 Dutch MRI reports, measuring balanced accuracy, accuracy, MAE, and text similarity against annotations produced by trained medical students (with 100 double-annotated for reliability assessment). No equations, fitted parameters, predictions derived from inputs, self-citations forming a load-bearing chain, or ansatzes are present. Performance metrics are computed across 10 random splits and reported with 95% CIs; conclusions follow directly from these measurements. The study is self-contained against the human-annotation benchmark and contains no derivation chain that reduces to its own inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The paper contains no mathematical derivations, fitted parameters, or postulated entities; it is an empirical benchmarking study whose claims rest only on the assumption that student annotations match expert interpretation of the reports.

pith-pipeline@v0.9.1-grok · 5967 in / 1074 out tokens · 16486 ms · 2026-06-27T21:50:43.181597+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

39 extracted references · 34 canonical work pages · 3 internal anchors

[1]

Insights Imaging 9:1–7

(2018) ESR paper on structured reporting in radiology. Insights Imaging 9:1–7. https://doi.org/10.1007/s13244-017-0588-8

work page doi:10.1007/s13244-017-0588-8 2018
[2]

Eur Radiol

Visser JJ (2024) The unquestionable marriage between AI and structured reporting. Eur Radiol. https://doi.org/10.1007/s00330-024-11038-2

work page doi:10.1007/s00330-024-11038-2 2024
[3]

Eur Radiol 35:2589–2602

Busch F, Hoffmann L, dos Santos DP, et al (2025) Large language models for structured reporting in radiology: past, present, and future. Eur Radiol 35:2589–2602. https://doi.org/10.1007/s00330-024-11107-6

work page doi:10.1007/s00330-024-11107-6 2025
[4]

Insights Imaging 14:199

dos Santos DP, Kotter E, Mildenberger P, et al (2023) ESR paper on structured reporting in radiology—update 2023. Insights Imaging 14:199. https://doi.org/10.1186/s13244-023-01560-0

work page doi:10.1186/s13244-023-01560-0 2023
[5]

medRxiv preprint, https://doi.org/10.1101/2024.04.25.24306380

Ralevski A, Taiyab N, Nossal M, et al (2024) Using Large Language Models to Annotate Complex Cases of Social Determinants of Health in Longitudinal Clinical Records. medRxiv preprint, https://doi.org/10.1101/2024.04.25.24306380

work page doi:10.1101/2024.04.25.24306380 2024
[6]

PLOS ONE 20:e0317084

Jansen JA, Manukyan A, Khoury NA, Akalin A (2025) Leveraging large language models for data analysis automation. PLOS ONE 20:e0317084. https://doi.org/10.1371/journal.pone.0317084

work page doi:10.1371/journal.pone.0317084 2025
[7]

Radiology

Bhayana R (2024) Chatbots and Large Language Models in Radiology: A Practical Primer for Clinical and Research Applications. Radiology. https://doi.org/10.1148/radiol.232756

work page doi:10.1148/radiol.232756 2024
[8]

Radiology 311:e232741

Lehnen NC, Dorn F, Wiest IC, et al (2024) Data Extraction from Free-Text Reports on Mechanical Thrombectomy in Acute Ischemic Stroke Using ChatGPT: A Retrospective Analysis. Radiology 311:e232741. https://doi.org/10.1148/radiol.232741

work page doi:10.1148/radiol.232741 2024
[9]

Radiology 307:e230725

Adams LC, Truhn D, Busch F, et al (2023) Leveraging GPT-4 for Post Hoc Transformation of Free-text Radiology Reports into Structured Reporting: A Multilingual Feasibility Study. Radiology 307:e230725. https://doi.org/10.1148/radiol.230725 15 Paper submitted to European Radiology 11. Sasaki F, Tatekawa H, Mitsuyama Y, et al (2024) Bridging Language and Sty...

work page doi:10.1148/radiol.230725 2023
[10]

Radiology 311:e232133

Cozzi A, Pinker K, Hidber A, et al (2024) BI-RADS Category Assignments by GPT-3.5, GPT-4, and Google Bard: A Multilanguage Study. Radiology 311:e232133. https://doi.org/10.1148/radiol.232133

work page doi:10.1148/radiol.232133 2024
[11]

Radiology 314:e241073

Savage CH, Kanhere A, Parekh V, et al (2025) Open-Source Large Language Models in Radiology: A Review and Tutorial for Practical Research and Clinical Deployment. Radiology 314:e241073. https://doi.org/10.1148/radiol.241073

work page doi:10.1148/radiol.241073 2025
[12]

JAMIA Open 8:ooaf109

Builtjes L, Bosma J, Prokop M, et al (2025) Leveraging open-source large language models for clinical information extraction in resource-constrained settings. JAMIA Open 8:ooaf109. https://doi.org/10.1093/jamiaopen/ooaf109

work page doi:10.1093/jamiaopen/ooaf109 2025
[13]

Crit Care 29:72

Workum JD, Volkers BWS, van de Sande D, et al (2025) Comparative evaluation and performance of large language models on expert level critical care questions: a benchmark study. Crit Care 29:72. https://doi.org/10.1186/s13054-025-05302-0

work page doi:10.1186/s13054-025-05302-0 2025
[14]

Nat Lang Process J 10:100124

Nazi ZA, Hossain MdR, Mamun FA (2025) Evaluation of open and closed-source LLMs for low-resource language with zero-shot, few-shot, and chain-of-thought prompting. Nat Lang Process J 10:100124. https://doi.org/10.1016/j.nlp.2024.100124

work page doi:10.1016/j.nlp.2024.100124 2025
[15]

Nat Commun 15:2050

Sandmann S, Riepenhausen S, Plagwitz L, Varghese J (2024) Systematic analysis of ChatGPT, Google search and Llama 2 for clinical decision support tasks. Nat Commun 15:2050. https://doi.org/10.1038/s41467-024-46411-8

work page doi:10.1038/s41467-024-46411-8 2024
[16]

Eur J Ther

Çamur E, Güne ş YC (2025) Evaluation of the Performance of ChatGPT 4.5 in LI-RADS Categorization and Management Suggestion: Zero-shot versus Few-shot Prompting Method. Eur J Ther. https://doi.org/10.58600/eurjther2699

work page doi:10.58600/eurjther2699 2025
[17]

Cahyawijaya S, Lovenia H, Fung P (2024) LLMs Are Few-Shot In-Context Low-Resource Language Learners. In: Duh K, Gomez H, Bethard S (eds) Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers). Association for Computational Linguistics, Mexico Ci...

work page doi:10.18653/v1/2024.naacl-long.24 2024
[18]

In-context Learning: A Fair Comparison and Evaluation

Mosbach M, Pimentel T, Ravfogel S, et al (2023) Few-shot Fine-tuning vs. In-context Learning: A Fair Comparison and Evaluation. In: Rogers A, Boyd-Graber J, Okazaki N (eds) Findings of the Association for Computational Linguistics: ACL 2023. Association for Computational Linguistics, Toronto, Canada, pp 12284–12314, https://doi.org/10.18653/v1/2023.findin...

work page doi:10.18653/v1/2023.findings-acl.779 2023
[19]

Grattafiori A, Dubey A, Jauhri A, et al (2024) The Llama 3 Herd of Models, arXiv preprint, https://doi.org/10.48550/arXiv.2407.21783

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2407.21783 2024
[20]

Tiedemann J, Aulamo M, Bakshandaeva D, et al (2024) Democratizing neural machine translation with OPUS-MT

Meeus M, Rathé A, Remy F, et al (2024) ChocoLlama: Lessons Learned From Teaching Llamas Dutch, arXiv preprint, https://doi.org/10.48550/arXiv.2412.07633 16 Paper submitted to European Radiology 23. Tiedemann J, Aulamo M, Bakshandaeva D, et al (2024) Democratizing neural machine translation with OPUS-MT. Lang Resour Eval 58:713–755. https://doi.org/10.1007...

work page doi:10.48550/arxiv.2412.07633 2024
[21]

Reimers N, Gurevych I (2019) Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks, arXiv preprint, https://doi.org/10.48550/arXiv.1908.10084

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.1908.10084 2019
[22]

In: Cohn T, He Y, Liu Y (eds) Findings of the Association for Computational Linguistics: EMNLP 2020

Delobelle P, Winters T, Berendt B (2020) RobBERT: a Dutch RoBERTa-based Language Model. In: Cohn T, He Y, Liu Y (eds) Findings of the Association for Computational Linguistics: EMNLP 2020. Association for Computational Linguistics, Online, pp 3255–3265, https://doi.org/10.18653/v1/2020.findings-emnlp.292

work page doi:10.18653/v1/2020.findings-emnlp.292 2020
[23]

Zhang T, Kishore V, Wu F, et al (2020) BERTScore: Evaluating Text Generation with BERT, arXiv preprint, https://doi.org/10.48550/arXiv.1904.09675

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.1904.09675 2020
[24]

In: Wiley StatsRef: Statistics Reference Online

Lachenbruch PA (2014) McNemar Test. In: Wiley StatsRef: Statistics Reference Online. John Wiley & Sons, Ltd, https://doi.org/10.1002/9781118445112.stat04876

work page doi:10.1002/9781118445112.stat04876 2014
[25]

Arch Pharm Pract 11:144–148

Shabankhani B (2020) Assessing the inter-rater reliability for nominal, categorical and ordinal data in medical sciences. Arch Pharm Pract 11:144–148

2020
[26]

Biochemia medica : ča- sopis Hrvatskoga društva medicinskih biokemičara / HDMB22, 276–82 (10 2012)

McHugh ML (2012) Interrater reliability: the kappa statistic. Biochem Medica 22:276–282, https://doi.org/10.11613/BM.2012.031

work page doi:10.11613/bm.2012.031 2012
[27]

J Chiropr Med 15:155–163

Koo TK, Li MY (2016) A Guideline of Selecting and Reporting Intraclass Correlation Coefficients for Reliability Research. J Chiropr Med 15:155–163. https://doi.org/10.1016/j.jcm.2016.02.012

work page doi:10.1016/j.jcm.2016.02.012 2016
[28]

Inference for the generalization error,

Nadeau C, Bengio Y (2003) Inference for the Generalization Error. Mach Learn 52:239–281. https://doi.org/10.1023/A:1024068626366

work page doi:10.1023/a:1024068626366 2003
[29]

J Am Stat Assoc 32:675–701

Friedman M (1937) The Use of Ranks to Avoid the Assumption of Normality Implicit in the Analysis of Variance. J Am Stat Assoc 32:675–701. https://doi.org/10.2307/2279372

work page doi:10.2307/2279372 1937
[30]

J Mach Learn Res 7:1–30, https://dl.acm.org/doi/10.5555/1248547.1248548

Dem š ar J (2006) Statistical Comparisons of Classifiers over Multiple Data Sets. J Mach Learn Res 7:1–30, https://dl.acm.org/doi/10.5555/1248547.1248548

work page doi:10.5555/1248547.1248548 2006
[31]

NeuroImage Clin 8:314–321

De Cocker LJL, Geerlings MI, Hartkamp NS, et al (2015) Cerebellar infarct patterns: The SMART-Medea study. NeuroImage Clin 8:314–321. https://doi.org/10.1016/j.nicl.2015.02.001

work page doi:10.1016/j.nicl.2015.02.001 2015
[32]

arXiv preprint, https://arxiv.org/abs/2407.19299v3

Le T-D, Nguyen TT, Ha VN, et al (2024) The Impact of LoRA Adapters on LLMs for Clinical Text Classification Under Computational and Data Constraints. arXiv preprint, https://arxiv.org/abs/2407.19299v3. Accessed 27 Nov 2025

arXiv 2024
[33]

J Law Med Ethics 53:454–464

Onitiu D, Wachter S, Mittelstadt B (2025) Walking Backward to Ensure Risk Management of Large Language Models in Medicine. J Law Med Ethics 53:454–464. https://doi.org/10.1017/jme.2025.10132

work page doi:10.1017/jme.2025.10132 2025
[34]

Npj Digit Med 8:397

Zhu M, Lin H, Jiang J, et al (2025) Large language model trained on clinical oncology data predicts cancer progression. Npj Digit Med 8:397. https://doi.org/10.1038/s41746-025-01780-2 17 Paper submitted to European Radiology 38. Liu L, Lian L, Hao Y, et al (2025) Human level information extraction from clinical reports with finetuned language models. Sci ...

work page doi:10.1038/s41746-025-01780-2 2025
[35]

In: Advances in Neural Information Processing Systems

Lewis P, Perez E, Piktus A, et al (2020) Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks. In: Advances in Neural Information Processing Systems. Curran Associates, Inc., pp 9459–9474, https://dl.acm.org/doi/abs/10.5555/3495724.3496517

work page doi:10.5555/3495724.3496517 2020
[36]

Spiess C, Vaziri M, Mandel L, Hirzel M (2025) AutoPDL: Automatic Prompt Optimization for LLM Agents, arXiv preprint, https://doi.org/10.48550/arXiv.2504.04365

work page doi:10.48550/arxiv.2504.04365 2025
[37]

Fazekas":

Keicher M, Zaripova K, Czempiel T, et al (2024) FlexR: Few-shot Classification with Language Embeddings for Structured Reporting of Chest X-rays. In: Medical Imaging with Deep Learning. PMLR, pp 1493–1508, https://proceedings.mlr.press/v227/keicher24a.html 18 Paper submitted to European Radiology Appendix A. Prompt The prompt used in this work along with ...

2024
[38]

Missing" Label in Categorical Variables For categorical variables, the

Consensus reference: the model predictions were compared to the consensus annotations of the bootstrap-sampled reports. 26 Paper submitted to European Radiology 6. Single-rater reference: for each bootstrap-sampled report, one of the two raters (R1 or R2) was selected randomly and independently, and the model prediction was compared to that rater’s annota...
[39]

missing" were recoded to the negative finding class (

Merging missing with negative: labels coded as "missing" were recoded to the negative finding class ("absent" or “0” depending on the variable, see table S1) in both the reference and predicted labels, and balanced accuracy was computed on the full dataset. Both schemes were applied for the structured similarity-based few shot prompting method per fold of...

[1] [1]

Insights Imaging 9:1–7

(2018) ESR paper on structured reporting in radiology. Insights Imaging 9:1–7. https://doi.org/10.1007/s13244-017-0588-8

work page doi:10.1007/s13244-017-0588-8 2018

[2] [2]

Eur Radiol

Visser JJ (2024) The unquestionable marriage between AI and structured reporting. Eur Radiol. https://doi.org/10.1007/s00330-024-11038-2

work page doi:10.1007/s00330-024-11038-2 2024

[3] [3]

Eur Radiol 35:2589–2602

Busch F, Hoffmann L, dos Santos DP, et al (2025) Large language models for structured reporting in radiology: past, present, and future. Eur Radiol 35:2589–2602. https://doi.org/10.1007/s00330-024-11107-6

work page doi:10.1007/s00330-024-11107-6 2025

[4] [4]

Insights Imaging 14:199

dos Santos DP, Kotter E, Mildenberger P, et al (2023) ESR paper on structured reporting in radiology—update 2023. Insights Imaging 14:199. https://doi.org/10.1186/s13244-023-01560-0

work page doi:10.1186/s13244-023-01560-0 2023

[5] [5]

medRxiv preprint, https://doi.org/10.1101/2024.04.25.24306380

Ralevski A, Taiyab N, Nossal M, et al (2024) Using Large Language Models to Annotate Complex Cases of Social Determinants of Health in Longitudinal Clinical Records. medRxiv preprint, https://doi.org/10.1101/2024.04.25.24306380

work page doi:10.1101/2024.04.25.24306380 2024

[6] [6]

PLOS ONE 20:e0317084

Jansen JA, Manukyan A, Khoury NA, Akalin A (2025) Leveraging large language models for data analysis automation. PLOS ONE 20:e0317084. https://doi.org/10.1371/journal.pone.0317084

work page doi:10.1371/journal.pone.0317084 2025

[7] [7]

Radiology

Bhayana R (2024) Chatbots and Large Language Models in Radiology: A Practical Primer for Clinical and Research Applications. Radiology. https://doi.org/10.1148/radiol.232756

work page doi:10.1148/radiol.232756 2024

[8] [8]

Radiology 311:e232741

Lehnen NC, Dorn F, Wiest IC, et al (2024) Data Extraction from Free-Text Reports on Mechanical Thrombectomy in Acute Ischemic Stroke Using ChatGPT: A Retrospective Analysis. Radiology 311:e232741. https://doi.org/10.1148/radiol.232741

work page doi:10.1148/radiol.232741 2024

[9] [9]

Radiology 307:e230725

Adams LC, Truhn D, Busch F, et al (2023) Leveraging GPT-4 for Post Hoc Transformation of Free-text Radiology Reports into Structured Reporting: A Multilingual Feasibility Study. Radiology 307:e230725. https://doi.org/10.1148/radiol.230725 15 Paper submitted to European Radiology 11. Sasaki F, Tatekawa H, Mitsuyama Y, et al (2024) Bridging Language and Sty...

work page doi:10.1148/radiol.230725 2023

[10] [10]

Radiology 311:e232133

Cozzi A, Pinker K, Hidber A, et al (2024) BI-RADS Category Assignments by GPT-3.5, GPT-4, and Google Bard: A Multilanguage Study. Radiology 311:e232133. https://doi.org/10.1148/radiol.232133

work page doi:10.1148/radiol.232133 2024

[11] [11]

Radiology 314:e241073

Savage CH, Kanhere A, Parekh V, et al (2025) Open-Source Large Language Models in Radiology: A Review and Tutorial for Practical Research and Clinical Deployment. Radiology 314:e241073. https://doi.org/10.1148/radiol.241073

work page doi:10.1148/radiol.241073 2025

[12] [12]

JAMIA Open 8:ooaf109

Builtjes L, Bosma J, Prokop M, et al (2025) Leveraging open-source large language models for clinical information extraction in resource-constrained settings. JAMIA Open 8:ooaf109. https://doi.org/10.1093/jamiaopen/ooaf109

work page doi:10.1093/jamiaopen/ooaf109 2025

[13] [13]

Crit Care 29:72

Workum JD, Volkers BWS, van de Sande D, et al (2025) Comparative evaluation and performance of large language models on expert level critical care questions: a benchmark study. Crit Care 29:72. https://doi.org/10.1186/s13054-025-05302-0

work page doi:10.1186/s13054-025-05302-0 2025

[14] [14]

Nat Lang Process J 10:100124

Nazi ZA, Hossain MdR, Mamun FA (2025) Evaluation of open and closed-source LLMs for low-resource language with zero-shot, few-shot, and chain-of-thought prompting. Nat Lang Process J 10:100124. https://doi.org/10.1016/j.nlp.2024.100124

work page doi:10.1016/j.nlp.2024.100124 2025

[15] [15]

Nat Commun 15:2050

Sandmann S, Riepenhausen S, Plagwitz L, Varghese J (2024) Systematic analysis of ChatGPT, Google search and Llama 2 for clinical decision support tasks. Nat Commun 15:2050. https://doi.org/10.1038/s41467-024-46411-8

work page doi:10.1038/s41467-024-46411-8 2024

[16] [16]

Eur J Ther

Çamur E, Güne ş YC (2025) Evaluation of the Performance of ChatGPT 4.5 in LI-RADS Categorization and Management Suggestion: Zero-shot versus Few-shot Prompting Method. Eur J Ther. https://doi.org/10.58600/eurjther2699

work page doi:10.58600/eurjther2699 2025

[17] [17]

Cahyawijaya S, Lovenia H, Fung P (2024) LLMs Are Few-Shot In-Context Low-Resource Language Learners. In: Duh K, Gomez H, Bethard S (eds) Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers). Association for Computational Linguistics, Mexico Ci...

work page doi:10.18653/v1/2024.naacl-long.24 2024

[18] [18]

In-context Learning: A Fair Comparison and Evaluation

Mosbach M, Pimentel T, Ravfogel S, et al (2023) Few-shot Fine-tuning vs. In-context Learning: A Fair Comparison and Evaluation. In: Rogers A, Boyd-Graber J, Okazaki N (eds) Findings of the Association for Computational Linguistics: ACL 2023. Association for Computational Linguistics, Toronto, Canada, pp 12284–12314, https://doi.org/10.18653/v1/2023.findin...

work page doi:10.18653/v1/2023.findings-acl.779 2023

[19] [19]

Grattafiori A, Dubey A, Jauhri A, et al (2024) The Llama 3 Herd of Models, arXiv preprint, https://doi.org/10.48550/arXiv.2407.21783

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2407.21783 2024

[20] [20]

Tiedemann J, Aulamo M, Bakshandaeva D, et al (2024) Democratizing neural machine translation with OPUS-MT

Meeus M, Rathé A, Remy F, et al (2024) ChocoLlama: Lessons Learned From Teaching Llamas Dutch, arXiv preprint, https://doi.org/10.48550/arXiv.2412.07633 16 Paper submitted to European Radiology 23. Tiedemann J, Aulamo M, Bakshandaeva D, et al (2024) Democratizing neural machine translation with OPUS-MT. Lang Resour Eval 58:713–755. https://doi.org/10.1007...

work page doi:10.48550/arxiv.2412.07633 2024

[21] [21]

Reimers N, Gurevych I (2019) Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks, arXiv preprint, https://doi.org/10.48550/arXiv.1908.10084

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.1908.10084 2019

[22] [22]

In: Cohn T, He Y, Liu Y (eds) Findings of the Association for Computational Linguistics: EMNLP 2020

Delobelle P, Winters T, Berendt B (2020) RobBERT: a Dutch RoBERTa-based Language Model. In: Cohn T, He Y, Liu Y (eds) Findings of the Association for Computational Linguistics: EMNLP 2020. Association for Computational Linguistics, Online, pp 3255–3265, https://doi.org/10.18653/v1/2020.findings-emnlp.292

work page doi:10.18653/v1/2020.findings-emnlp.292 2020

[23] [23]

Zhang T, Kishore V, Wu F, et al (2020) BERTScore: Evaluating Text Generation with BERT, arXiv preprint, https://doi.org/10.48550/arXiv.1904.09675

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.1904.09675 2020

[24] [24]

In: Wiley StatsRef: Statistics Reference Online

Lachenbruch PA (2014) McNemar Test. In: Wiley StatsRef: Statistics Reference Online. John Wiley & Sons, Ltd, https://doi.org/10.1002/9781118445112.stat04876

work page doi:10.1002/9781118445112.stat04876 2014

[25] [25]

Arch Pharm Pract 11:144–148

Shabankhani B (2020) Assessing the inter-rater reliability for nominal, categorical and ordinal data in medical sciences. Arch Pharm Pract 11:144–148

2020

[26] [26]

Biochemia medica : ča- sopis Hrvatskoga društva medicinskih biokemičara / HDMB22, 276–82 (10 2012)

McHugh ML (2012) Interrater reliability: the kappa statistic. Biochem Medica 22:276–282, https://doi.org/10.11613/BM.2012.031

work page doi:10.11613/bm.2012.031 2012

[27] [27]

J Chiropr Med 15:155–163

Koo TK, Li MY (2016) A Guideline of Selecting and Reporting Intraclass Correlation Coefficients for Reliability Research. J Chiropr Med 15:155–163. https://doi.org/10.1016/j.jcm.2016.02.012

work page doi:10.1016/j.jcm.2016.02.012 2016

[28] [28]

Inference for the generalization error,

Nadeau C, Bengio Y (2003) Inference for the Generalization Error. Mach Learn 52:239–281. https://doi.org/10.1023/A:1024068626366

work page doi:10.1023/a:1024068626366 2003

[29] [29]

J Am Stat Assoc 32:675–701

Friedman M (1937) The Use of Ranks to Avoid the Assumption of Normality Implicit in the Analysis of Variance. J Am Stat Assoc 32:675–701. https://doi.org/10.2307/2279372

work page doi:10.2307/2279372 1937

[30] [30]

J Mach Learn Res 7:1–30, https://dl.acm.org/doi/10.5555/1248547.1248548

Dem š ar J (2006) Statistical Comparisons of Classifiers over Multiple Data Sets. J Mach Learn Res 7:1–30, https://dl.acm.org/doi/10.5555/1248547.1248548

work page doi:10.5555/1248547.1248548 2006

[31] [31]

NeuroImage Clin 8:314–321

De Cocker LJL, Geerlings MI, Hartkamp NS, et al (2015) Cerebellar infarct patterns: The SMART-Medea study. NeuroImage Clin 8:314–321. https://doi.org/10.1016/j.nicl.2015.02.001

work page doi:10.1016/j.nicl.2015.02.001 2015

[32] [32]

arXiv preprint, https://arxiv.org/abs/2407.19299v3

Le T-D, Nguyen TT, Ha VN, et al (2024) The Impact of LoRA Adapters on LLMs for Clinical Text Classification Under Computational and Data Constraints. arXiv preprint, https://arxiv.org/abs/2407.19299v3. Accessed 27 Nov 2025

arXiv 2024

[33] [33]

J Law Med Ethics 53:454–464

Onitiu D, Wachter S, Mittelstadt B (2025) Walking Backward to Ensure Risk Management of Large Language Models in Medicine. J Law Med Ethics 53:454–464. https://doi.org/10.1017/jme.2025.10132

work page doi:10.1017/jme.2025.10132 2025

[34] [34]

Npj Digit Med 8:397

Zhu M, Lin H, Jiang J, et al (2025) Large language model trained on clinical oncology data predicts cancer progression. Npj Digit Med 8:397. https://doi.org/10.1038/s41746-025-01780-2 17 Paper submitted to European Radiology 38. Liu L, Lian L, Hao Y, et al (2025) Human level information extraction from clinical reports with finetuned language models. Sci ...

work page doi:10.1038/s41746-025-01780-2 2025

[35] [35]

In: Advances in Neural Information Processing Systems

Lewis P, Perez E, Piktus A, et al (2020) Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks. In: Advances in Neural Information Processing Systems. Curran Associates, Inc., pp 9459–9474, https://dl.acm.org/doi/abs/10.5555/3495724.3496517

work page doi:10.5555/3495724.3496517 2020

[36] [36]

Spiess C, Vaziri M, Mandel L, Hirzel M (2025) AutoPDL: Automatic Prompt Optimization for LLM Agents, arXiv preprint, https://doi.org/10.48550/arXiv.2504.04365

work page doi:10.48550/arxiv.2504.04365 2025

[37] [37]

Fazekas":

Keicher M, Zaripova K, Czempiel T, et al (2024) FlexR: Few-shot Classification with Language Embeddings for Structured Reporting of Chest X-rays. In: Medical Imaging with Deep Learning. PMLR, pp 1493–1508, https://proceedings.mlr.press/v227/keicher24a.html 18 Paper submitted to European Radiology Appendix A. Prompt The prompt used in this work along with ...

2024

[38] [38]

Missing" Label in Categorical Variables For categorical variables, the

Consensus reference: the model predictions were compared to the consensus annotations of the bootstrap-sampled reports. 26 Paper submitted to European Radiology 6. Single-rater reference: for each bootstrap-sampled report, one of the two raters (R1 or R2) was selected randomly and independently, and the model prediction was compared to that rater’s annota...

[39] [39]

missing" were recoded to the negative finding class (

Merging missing with negative: labels coded as "missing" were recoded to the negative finding class ("absent" or “0” depending on the variable, see table S1) in both the reference and predicted labels, and balanced accuracy was computed on the full dataset. Both schemes were applied for the structured similarity-based few shot prompting method per fold of...