Automatic Extraction of Structured Information from Brain MRI Reports Using an Open-Weight Large Language Model
Pith reviewed 2026-06-27 21:50 UTC · model grok-4.3
The pith
LLaMA 3.1 extracts structured data from Dutch brain MRI reports with high accuracy on visual ratings and improved numerical counts via few-shot prompting.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
LLaMA 3.1 demonstrates high zero-shot performance for visual rating scores such as medial temporal atrophy at 90-96 percent, global cortical atrophy at 87 percent, and Fazekas at 94 percent, along with 93 percent accuracy for microbleed mentions and 82 percent for infarct mentions. Numerical variables show lower performance at 80 percent for microbleed counts and 66 percent for infarct counts, but few-shot prompting with structural similarity-based example selection raises these to 92 percent and 81 percent. Results remain comparable when reports are translated to English.
What carries the argument
LLaMA 3.1 large language model using zero-shot and few-shot prompting on Dutch and translated English reports, with performance measured by balanced accuracy for categorical variables, accuracy and mean absolute error for counts, and text similarity for free-text fields.
If this is right
- Automatic extraction from existing reports enables large-scale research studies without proportional increases in manual labor.
- Few-shot prompting with structural similarity example selection improves accuracy on numerical variables such as microbleed and infarct counts.
- Performance is largely unaffected by translating Dutch reports to English before model input.
- Location-specific variables continue to present extraction challenges even after prompting adjustments.
Where Pith is reading between the lines
- The same prompting approach could transfer to radiology reports written in other languages without requiring model retraining or fine-tuning.
- Integration into hospital data pipelines might reduce reliance on manual chart review for populating research databases.
- Applying the method to reports from multiple institutions would test whether performance holds when report style and terminology vary.
Load-bearing premise
Annotations produced by trained medical students provide accurate ground truth for all thirty variables, supported by the reliability check on the 100 double-annotated reports.
What would settle it
Re-annotation of a subset of the 947 reports by consultant neuroradiologists that reveals substantial disagreement with the student annotations on the numerical count variables would falsify the reported performance levels.
read the original abstract
Objectives: Automatic data extraction from free-text radiology reports enables large-scale research, but few studies assessed the performance of large language models (LLMs) on Dutch neuroradiology reports. Methods: We analyzed 947 brain MRI reports from a tertiary memory clinic (2016-2021), authored by consultant neuroradiologists. Trained medical students annotated thirty variables; 100 reports were double-annotated to assess inter-rater reliability. We evaluated the performance of the open-weight LLM LLaMA 3.1 using different languages (Dutch vs. English translation) and few-shot prompting with different example selection strategies. Performance was evaluated using balanced accuracy for categorical variables, accuracy and mean absolute error for counts, and text similarity for free-text. Metrics were computed across 10 random splits of the 947 reports. Results: LLaMA 3.1 demonstrated high zero-shot performance for visual rating scores (mean [95%-CI]): Medial Temporal Atrophy: 90% [77-100%] on the left and 96% [94-99%] on the right, Global Cortical Atrophy: 87% [83-91%], and Fazekas: 94% [93-96%]. Microbleed mentions were detected with 93% accuracy [92-95%] and infarct mentions with 82% [80-84%]. Text similarity for lesion location reached 0.95 [0.95-0.96]. Performance was lower for numerical variables: 80% [78-82%] for the number of microbleeds and 66% [63-68%] for infarcts. English translation yielded comparable results. Few-shot prompting improved performance for numerical variables, achieving 92% [90-93%] for microbleeds and 81% [77-85%] for infarcts using structural similarity-based selection. Conclusion: LLaMA 3.1 shows strong potential for extracting data from Dutch neuroradiology reports. Few-shot prompting enhances performance for numerical variables, whereas challenges remain for location-specific variables.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript evaluates the open-weight LLaMA 3.1 model for extracting 30 structured variables (visual ratings, mentions, counts, and locations) from 947 Dutch brain MRI reports. It compares zero-shot performance against few-shot prompting with different example-selection strategies and English translation, reporting balanced accuracy, accuracy/MAE, and text similarity averaged over 10 random splits with 95% CIs. The central claim is that the model shows strong potential, particularly with few-shot prompting improving numerical count extraction.
Significance. If the student annotations constitute reliable ground truth, the work provides useful empirical evidence that open-weight LLMs can extract structured data from non-English neuroradiology reports at scale, with concrete gains from few-shot prompting on count variables. The multi-split evaluation with confidence intervals and focus on practical prompting choices are positive features for reproducibility and applicability.
major comments (2)
- [Methods] Methods (annotation protocol): Only 100 of 947 reports were double-annotated, yet the manuscript provides no per-variable inter-rater statistics (Cohen’s kappa, ICC, or raw agreement) for the 30 variables. This directly undermines interpretation of all performance numbers (e.g., 80% [78-82%] zero-shot microbleed count accuracy and the few-shot gains to 92%/81%), because it is impossible to determine whether observed model accuracies exceed, match, or fall below human annotation noise, especially for the count variables highlighted as most challenging.
- [Results] Results and Abstract: Performance claims for numerical variables and the conclusion that “few-shot prompting enhances performance” rest on comparison to the student annotations, but without the missing inter-rater numbers the headline result (90-96% on visual ratings, 92%/81% few-shot on counts) cannot be distinguished from annotation variability. This is load-bearing for the “strong potential” claim.
minor comments (2)
- [Abstract] Abstract and Methods: Exact prompting templates, system prompts, and the precise definitions of the “structural similarity-based selection” and other few-shot strategies are not provided, limiting reproducibility.
- [Methods] The manuscript should clarify how location-specific free-text variables were scored for text similarity and whether any post-processing was applied to model outputs.
Simulated Author's Rebuttal
We thank the referee for highlighting the need for inter-rater reliability metrics. We agree these are essential to interpret model performance against annotation variability and will revise the manuscript to include them.
read point-by-point responses
-
Referee: [Methods] Methods (annotation protocol): Only 100 of 947 reports were double-annotated, yet the manuscript provides no per-variable inter-rater statistics (Cohen’s kappa, ICC, or raw agreement) for the 30 variables. This directly undermines interpretation of all performance numbers (e.g., 80% [78-82%] zero-shot microbleed count accuracy and the few-shot gains to 92%/81%), because it is impossible to determine whether observed model accuracies exceed, match, or fall below human annotation noise, especially for the count variables highlighted as most challenging.
Authors: We agree that per-variable inter-rater statistics are required to contextualize the results. Although 100 reports were double-annotated, the manuscript omitted the computed metrics. We will add a table reporting Cohen’s kappa for categorical variables, ICC for count variables, and raw agreement for all 30 variables on the double-annotated set. This revision will be placed in the Methods section and referenced in Results. revision: yes
-
Referee: [Results] Results and Abstract: Performance claims for numerical variables and the conclusion that “few-shot prompting enhances performance” rest on comparison to the student annotations, but without the missing inter-rater numbers the headline result (90-96% on visual ratings, 92%/81% few-shot on counts) cannot be distinguished from annotation variability. This is load-bearing for the “strong potential” claim.
Authors: We accept that the absence of inter-rater statistics limits interpretation of whether model performance exceeds annotation noise. Including the metrics will allow readers to make this comparison directly. We will update the Abstract, Results, and Conclusion to reference the inter-rater agreement values alongside model accuracies and few-shot improvements. revision: yes
Circularity Check
No circularity: direct empirical evaluation against external annotations
full rationale
The paper reports an empirical evaluation of LLaMA 3.1 on 947 Dutch MRI reports, measuring balanced accuracy, accuracy, MAE, and text similarity against annotations produced by trained medical students (with 100 double-annotated for reliability assessment). No equations, fitted parameters, predictions derived from inputs, self-citations forming a load-bearing chain, or ansatzes are present. Performance metrics are computed across 10 random splits and reported with 95% CIs; conclusions follow directly from these measurements. The study is self-contained against the human-annotation benchmark and contains no derivation chain that reduces to its own inputs by construction.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
(2018) ESR paper on structured reporting in radiology. Insights Imaging 9:1–7. https://doi.org/10.1007/s13244-017-0588-8
-
[2]
Visser JJ (2024) The unquestionable marriage between AI and structured reporting. Eur Radiol. https://doi.org/10.1007/s00330-024-11038-2
-
[3]
Busch F, Hoffmann L, dos Santos DP, et al (2025) Large language models for structured reporting in radiology: past, present, and future. Eur Radiol 35:2589–2602. https://doi.org/10.1007/s00330-024-11107-6
-
[4]
dos Santos DP, Kotter E, Mildenberger P, et al (2023) ESR paper on structured reporting in radiology—update 2023. Insights Imaging 14:199. https://doi.org/10.1186/s13244-023-01560-0
-
[5]
medRxiv preprint, https://doi.org/10.1101/2024.04.25.24306380
Ralevski A, Taiyab N, Nossal M, et al (2024) Using Large Language Models to Annotate Complex Cases of Social Determinants of Health in Longitudinal Clinical Records. medRxiv preprint, https://doi.org/10.1101/2024.04.25.24306380
-
[6]
Jansen JA, Manukyan A, Khoury NA, Akalin A (2025) Leveraging large language models for data analysis automation. PLOS ONE 20:e0317084. https://doi.org/10.1371/journal.pone.0317084
-
[7]
Bhayana R (2024) Chatbots and Large Language Models in Radiology: A Practical Primer for Clinical and Research Applications. Radiology. https://doi.org/10.1148/radiol.232756
-
[8]
Lehnen NC, Dorn F, Wiest IC, et al (2024) Data Extraction from Free-Text Reports on Mechanical Thrombectomy in Acute Ischemic Stroke Using ChatGPT: A Retrospective Analysis. Radiology 311:e232741. https://doi.org/10.1148/radiol.232741
-
[9]
Adams LC, Truhn D, Busch F, et al (2023) Leveraging GPT-4 for Post Hoc Transformation of Free-text Radiology Reports into Structured Reporting: A Multilingual Feasibility Study. Radiology 307:e230725. https://doi.org/10.1148/radiol.230725 15 Paper submitted to European Radiology 11. Sasaki F, Tatekawa H, Mitsuyama Y, et al (2024) Bridging Language and Sty...
-
[10]
Cozzi A, Pinker K, Hidber A, et al (2024) BI-RADS Category Assignments by GPT-3.5, GPT-4, and Google Bard: A Multilanguage Study. Radiology 311:e232133. https://doi.org/10.1148/radiol.232133
-
[11]
Savage CH, Kanhere A, Parekh V, et al (2025) Open-Source Large Language Models in Radiology: A Review and Tutorial for Practical Research and Clinical Deployment. Radiology 314:e241073. https://doi.org/10.1148/radiol.241073
-
[12]
Builtjes L, Bosma J, Prokop M, et al (2025) Leveraging open-source large language models for clinical information extraction in resource-constrained settings. JAMIA Open 8:ooaf109. https://doi.org/10.1093/jamiaopen/ooaf109
-
[13]
Workum JD, Volkers BWS, van de Sande D, et al (2025) Comparative evaluation and performance of large language models on expert level critical care questions: a benchmark study. Crit Care 29:72. https://doi.org/10.1186/s13054-025-05302-0
-
[14]
Nazi ZA, Hossain MdR, Mamun FA (2025) Evaluation of open and closed-source LLMs for low-resource language with zero-shot, few-shot, and chain-of-thought prompting. Nat Lang Process J 10:100124. https://doi.org/10.1016/j.nlp.2024.100124
-
[15]
Sandmann S, Riepenhausen S, Plagwitz L, Varghese J (2024) Systematic analysis of ChatGPT, Google search and Llama 2 for clinical decision support tasks. Nat Commun 15:2050. https://doi.org/10.1038/s41467-024-46411-8
-
[16]
Çamur E, Güne ş YC (2025) Evaluation of the Performance of ChatGPT 4.5 in LI-RADS Categorization and Management Suggestion: Zero-shot versus Few-shot Prompting Method. Eur J Ther. https://doi.org/10.58600/eurjther2699
-
[17]
Cahyawijaya S, Lovenia H, Fung P (2024) LLMs Are Few-Shot In-Context Low-Resource Language Learners. In: Duh K, Gomez H, Bethard S (eds) Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers). Association for Computational Linguistics, Mexico Ci...
-
[18]
In-context Learning: A Fair Comparison and Evaluation
Mosbach M, Pimentel T, Ravfogel S, et al (2023) Few-shot Fine-tuning vs. In-context Learning: A Fair Comparison and Evaluation. In: Rogers A, Boyd-Graber J, Okazaki N (eds) Findings of the Association for Computational Linguistics: ACL 2023. Association for Computational Linguistics, Toronto, Canada, pp 12284–12314, https://doi.org/10.18653/v1/2023.findin...
-
[19]
Grattafiori A, Dubey A, Jauhri A, et al (2024) The Llama 3 Herd of Models, arXiv preprint, https://doi.org/10.48550/arXiv.2407.21783
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2407.21783 2024
-
[20]
Meeus M, Rathé A, Remy F, et al (2024) ChocoLlama: Lessons Learned From Teaching Llamas Dutch, arXiv preprint, https://doi.org/10.48550/arXiv.2412.07633 16 Paper submitted to European Radiology 23. Tiedemann J, Aulamo M, Bakshandaeva D, et al (2024) Democratizing neural machine translation with OPUS-MT. Lang Resour Eval 58:713–755. https://doi.org/10.1007...
-
[21]
Reimers N, Gurevych I (2019) Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks, arXiv preprint, https://doi.org/10.48550/arXiv.1908.10084
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.1908.10084 2019
-
[22]
In: Cohn T, He Y, Liu Y (eds) Findings of the Association for Computational Linguistics: EMNLP 2020
Delobelle P, Winters T, Berendt B (2020) RobBERT: a Dutch RoBERTa-based Language Model. In: Cohn T, He Y, Liu Y (eds) Findings of the Association for Computational Linguistics: EMNLP 2020. Association for Computational Linguistics, Online, pp 3255–3265, https://doi.org/10.18653/v1/2020.findings-emnlp.292
-
[23]
Zhang T, Kishore V, Wu F, et al (2020) BERTScore: Evaluating Text Generation with BERT, arXiv preprint, https://doi.org/10.48550/arXiv.1904.09675
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.1904.09675 2020
-
[24]
In: Wiley StatsRef: Statistics Reference Online
Lachenbruch PA (2014) McNemar Test. In: Wiley StatsRef: Statistics Reference Online. John Wiley & Sons, Ltd, https://doi.org/10.1002/9781118445112.stat04876
-
[25]
Arch Pharm Pract 11:144–148
Shabankhani B (2020) Assessing the inter-rater reliability for nominal, categorical and ordinal data in medical sciences. Arch Pharm Pract 11:144–148
2020
-
[26]
Biochemia medica : ča- sopis Hrvatskoga društva medicinskih biokemičara / HDMB22, 276–82 (10 2012)
McHugh ML (2012) Interrater reliability: the kappa statistic. Biochem Medica 22:276–282, https://doi.org/10.11613/BM.2012.031
-
[27]
Koo TK, Li MY (2016) A Guideline of Selecting and Reporting Intraclass Correlation Coefficients for Reliability Research. J Chiropr Med 15:155–163. https://doi.org/10.1016/j.jcm.2016.02.012
-
[28]
Inference for the generalization error,
Nadeau C, Bengio Y (2003) Inference for the Generalization Error. Mach Learn 52:239–281. https://doi.org/10.1023/A:1024068626366
-
[29]
Friedman M (1937) The Use of Ranks to Avoid the Assumption of Normality Implicit in the Analysis of Variance. J Am Stat Assoc 32:675–701. https://doi.org/10.2307/2279372
-
[30]
J Mach Learn Res 7:1–30, https://dl.acm.org/doi/10.5555/1248547.1248548
Dem š ar J (2006) Statistical Comparisons of Classifiers over Multiple Data Sets. J Mach Learn Res 7:1–30, https://dl.acm.org/doi/10.5555/1248547.1248548
-
[31]
De Cocker LJL, Geerlings MI, Hartkamp NS, et al (2015) Cerebellar infarct patterns: The SMART-Medea study. NeuroImage Clin 8:314–321. https://doi.org/10.1016/j.nicl.2015.02.001
-
[32]
arXiv preprint, https://arxiv.org/abs/2407.19299v3
Le T-D, Nguyen TT, Ha VN, et al (2024) The Impact of LoRA Adapters on LLMs for Clinical Text Classification Under Computational and Data Constraints. arXiv preprint, https://arxiv.org/abs/2407.19299v3. Accessed 27 Nov 2025
arXiv 2024
-
[33]
Onitiu D, Wachter S, Mittelstadt B (2025) Walking Backward to Ensure Risk Management of Large Language Models in Medicine. J Law Med Ethics 53:454–464. https://doi.org/10.1017/jme.2025.10132
-
[34]
Zhu M, Lin H, Jiang J, et al (2025) Large language model trained on clinical oncology data predicts cancer progression. Npj Digit Med 8:397. https://doi.org/10.1038/s41746-025-01780-2 17 Paper submitted to European Radiology 38. Liu L, Lian L, Hao Y, et al (2025) Human level information extraction from clinical reports with finetuned language models. Sci ...
-
[35]
In: Advances in Neural Information Processing Systems
Lewis P, Perez E, Piktus A, et al (2020) Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks. In: Advances in Neural Information Processing Systems. Curran Associates, Inc., pp 9459–9474, https://dl.acm.org/doi/abs/10.5555/3495724.3496517
-
[36]
Spiess C, Vaziri M, Mandel L, Hirzel M (2025) AutoPDL: Automatic Prompt Optimization for LLM Agents, arXiv preprint, https://doi.org/10.48550/arXiv.2504.04365
-
[37]
Fazekas":
Keicher M, Zaripova K, Czempiel T, et al (2024) FlexR: Few-shot Classification with Language Embeddings for Structured Reporting of Chest X-rays. In: Medical Imaging with Deep Learning. PMLR, pp 1493–1508, https://proceedings.mlr.press/v227/keicher24a.html 18 Paper submitted to European Radiology Appendix A. Prompt The prompt used in this work along with ...
2024
-
[38]
Missing" Label in Categorical Variables For categorical variables, the
Consensus reference: the model predictions were compared to the consensus annotations of the bootstrap-sampled reports. 26 Paper submitted to European Radiology 6. Single-rater reference: for each bootstrap-sampled report, one of the two raters (R1 or R2) was selected randomly and independently, and the model prediction was compared to that rater’s annota...
-
[39]
missing" were recoded to the negative finding class (
Merging missing with negative: labels coded as "missing" were recoded to the negative finding class ("absent" or “0” depending on the variable, see table S1) in both the reference and predicted labels, and balanced accuracy was computed on the full dataset. Both schemes were applied for the structured similarity-based few shot prompting method per fold of...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.