pith. sign in

arxiv: 2606.07721 · v1 · pith:IUU6K5Y2new · submitted 2026-06-05 · 💻 cs.AI

Automatic Extraction of Structured Information from Brain MRI Reports Using an Open-Weight Large Language Model

Pith reviewed 2026-06-27 21:50 UTC · model grok-4.3

classification 💻 cs.AI
keywords large language modelsinformation extractionradiology reportsbrain MRIDutch languagefew-shot promptingneuroradiologystructured data
0
0 comments X

The pith

LLaMA 3.1 extracts structured data from Dutch brain MRI reports with high accuracy on visual ratings and improved numerical counts via few-shot prompting.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tests whether an open-weight large language model can automatically pull thirty structured variables from free-text Dutch neuroradiology reports. It evaluates performance across zero-shot and few-shot settings as well as original Dutch versus English translation inputs. High accuracies on visual scores like atrophy and Fazekas ratings indicate the method could scale data extraction for research without manual review of every report. Few-shot prompting helps close the gap on counting microbleeds and infarcts. The evaluation uses real clinical reports from a memory clinic to demonstrate practical utility.

Core claim

LLaMA 3.1 demonstrates high zero-shot performance for visual rating scores such as medial temporal atrophy at 90-96 percent, global cortical atrophy at 87 percent, and Fazekas at 94 percent, along with 93 percent accuracy for microbleed mentions and 82 percent for infarct mentions. Numerical variables show lower performance at 80 percent for microbleed counts and 66 percent for infarct counts, but few-shot prompting with structural similarity-based example selection raises these to 92 percent and 81 percent. Results remain comparable when reports are translated to English.

What carries the argument

LLaMA 3.1 large language model using zero-shot and few-shot prompting on Dutch and translated English reports, with performance measured by balanced accuracy for categorical variables, accuracy and mean absolute error for counts, and text similarity for free-text fields.

If this is right

  • Automatic extraction from existing reports enables large-scale research studies without proportional increases in manual labor.
  • Few-shot prompting with structural similarity example selection improves accuracy on numerical variables such as microbleed and infarct counts.
  • Performance is largely unaffected by translating Dutch reports to English before model input.
  • Location-specific variables continue to present extraction challenges even after prompting adjustments.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same prompting approach could transfer to radiology reports written in other languages without requiring model retraining or fine-tuning.
  • Integration into hospital data pipelines might reduce reliance on manual chart review for populating research databases.
  • Applying the method to reports from multiple institutions would test whether performance holds when report style and terminology vary.

Load-bearing premise

Annotations produced by trained medical students provide accurate ground truth for all thirty variables, supported by the reliability check on the 100 double-annotated reports.

What would settle it

Re-annotation of a subset of the 947 reports by consultant neuroradiologists that reveals substantial disagreement with the student annotations on the numerical count variables would falsify the reported performance levels.

read the original abstract

Objectives: Automatic data extraction from free-text radiology reports enables large-scale research, but few studies assessed the performance of large language models (LLMs) on Dutch neuroradiology reports. Methods: We analyzed 947 brain MRI reports from a tertiary memory clinic (2016-2021), authored by consultant neuroradiologists. Trained medical students annotated thirty variables; 100 reports were double-annotated to assess inter-rater reliability. We evaluated the performance of the open-weight LLM LLaMA 3.1 using different languages (Dutch vs. English translation) and few-shot prompting with different example selection strategies. Performance was evaluated using balanced accuracy for categorical variables, accuracy and mean absolute error for counts, and text similarity for free-text. Metrics were computed across 10 random splits of the 947 reports. Results: LLaMA 3.1 demonstrated high zero-shot performance for visual rating scores (mean [95%-CI]): Medial Temporal Atrophy: 90% [77-100%] on the left and 96% [94-99%] on the right, Global Cortical Atrophy: 87% [83-91%], and Fazekas: 94% [93-96%]. Microbleed mentions were detected with 93% accuracy [92-95%] and infarct mentions with 82% [80-84%]. Text similarity for lesion location reached 0.95 [0.95-0.96]. Performance was lower for numerical variables: 80% [78-82%] for the number of microbleeds and 66% [63-68%] for infarcts. English translation yielded comparable results. Few-shot prompting improved performance for numerical variables, achieving 92% [90-93%] for microbleeds and 81% [77-85%] for infarcts using structural similarity-based selection. Conclusion: LLaMA 3.1 shows strong potential for extracting data from Dutch neuroradiology reports. Few-shot prompting enhances performance for numerical variables, whereas challenges remain for location-specific variables.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript evaluates the open-weight LLaMA 3.1 model for extracting 30 structured variables (visual ratings, mentions, counts, and locations) from 947 Dutch brain MRI reports. It compares zero-shot performance against few-shot prompting with different example-selection strategies and English translation, reporting balanced accuracy, accuracy/MAE, and text similarity averaged over 10 random splits with 95% CIs. The central claim is that the model shows strong potential, particularly with few-shot prompting improving numerical count extraction.

Significance. If the student annotations constitute reliable ground truth, the work provides useful empirical evidence that open-weight LLMs can extract structured data from non-English neuroradiology reports at scale, with concrete gains from few-shot prompting on count variables. The multi-split evaluation with confidence intervals and focus on practical prompting choices are positive features for reproducibility and applicability.

major comments (2)
  1. [Methods] Methods (annotation protocol): Only 100 of 947 reports were double-annotated, yet the manuscript provides no per-variable inter-rater statistics (Cohen’s kappa, ICC, or raw agreement) for the 30 variables. This directly undermines interpretation of all performance numbers (e.g., 80% [78-82%] zero-shot microbleed count accuracy and the few-shot gains to 92%/81%), because it is impossible to determine whether observed model accuracies exceed, match, or fall below human annotation noise, especially for the count variables highlighted as most challenging.
  2. [Results] Results and Abstract: Performance claims for numerical variables and the conclusion that “few-shot prompting enhances performance” rest on comparison to the student annotations, but without the missing inter-rater numbers the headline result (90-96% on visual ratings, 92%/81% few-shot on counts) cannot be distinguished from annotation variability. This is load-bearing for the “strong potential” claim.
minor comments (2)
  1. [Abstract] Abstract and Methods: Exact prompting templates, system prompts, and the precise definitions of the “structural similarity-based selection” and other few-shot strategies are not provided, limiting reproducibility.
  2. [Methods] The manuscript should clarify how location-specific free-text variables were scored for text similarity and whether any post-processing was applied to model outputs.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for highlighting the need for inter-rater reliability metrics. We agree these are essential to interpret model performance against annotation variability and will revise the manuscript to include them.

read point-by-point responses
  1. Referee: [Methods] Methods (annotation protocol): Only 100 of 947 reports were double-annotated, yet the manuscript provides no per-variable inter-rater statistics (Cohen’s kappa, ICC, or raw agreement) for the 30 variables. This directly undermines interpretation of all performance numbers (e.g., 80% [78-82%] zero-shot microbleed count accuracy and the few-shot gains to 92%/81%), because it is impossible to determine whether observed model accuracies exceed, match, or fall below human annotation noise, especially for the count variables highlighted as most challenging.

    Authors: We agree that per-variable inter-rater statistics are required to contextualize the results. Although 100 reports were double-annotated, the manuscript omitted the computed metrics. We will add a table reporting Cohen’s kappa for categorical variables, ICC for count variables, and raw agreement for all 30 variables on the double-annotated set. This revision will be placed in the Methods section and referenced in Results. revision: yes

  2. Referee: [Results] Results and Abstract: Performance claims for numerical variables and the conclusion that “few-shot prompting enhances performance” rest on comparison to the student annotations, but without the missing inter-rater numbers the headline result (90-96% on visual ratings, 92%/81% few-shot on counts) cannot be distinguished from annotation variability. This is load-bearing for the “strong potential” claim.

    Authors: We accept that the absence of inter-rater statistics limits interpretation of whether model performance exceeds annotation noise. Including the metrics will allow readers to make this comparison directly. We will update the Abstract, Results, and Conclusion to reference the inter-rater agreement values alongside model accuracies and few-shot improvements. revision: yes

Circularity Check

0 steps flagged

No circularity: direct empirical evaluation against external annotations

full rationale

The paper reports an empirical evaluation of LLaMA 3.1 on 947 Dutch MRI reports, measuring balanced accuracy, accuracy, MAE, and text similarity against annotations produced by trained medical students (with 100 double-annotated for reliability assessment). No equations, fitted parameters, predictions derived from inputs, self-citations forming a load-bearing chain, or ansatzes are present. Performance metrics are computed across 10 random splits and reported with 95% CIs; conclusions follow directly from these measurements. The study is self-contained against the human-annotation benchmark and contains no derivation chain that reduces to its own inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The paper contains no mathematical derivations, fitted parameters, or postulated entities; it is an empirical benchmarking study whose claims rest only on the assumption that student annotations match expert interpretation of the reports.

pith-pipeline@v0.9.1-grok · 5967 in / 1074 out tokens · 16486 ms · 2026-06-27T21:50:43.181597+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

39 extracted references · 34 canonical work pages · 3 internal anchors

  1. [1]

    Insights Imaging 9:1–7

    (2018) ESR paper on structured reporting in radiology. Insights Imaging 9:1–7. https://doi.org/10.1007/s13244-017-0588-8

  2. [2]

    Eur Radiol

    Visser JJ (2024) The unquestionable marriage between AI and structured reporting. Eur Radiol. https://doi.org/10.1007/s00330-024-11038-2

  3. [3]

    Eur Radiol 35:2589–2602

    Busch F, Hoffmann L, dos Santos DP, et al (2025) Large language models for structured reporting in radiology: past, present, and future. Eur Radiol 35:2589–2602. https://doi.org/10.1007/s00330-024-11107-6

  4. [4]

    Insights Imaging 14:199

    dos Santos DP, Kotter E, Mildenberger P, et al (2023) ESR paper on structured reporting in radiology—update 2023. Insights Imaging 14:199. https://doi.org/10.1186/s13244-023-01560-0

  5. [5]

    medRxiv preprint, https://doi.org/10.1101/2024.04.25.24306380

    Ralevski A, Taiyab N, Nossal M, et al (2024) Using Large Language Models to Annotate Complex Cases of Social Determinants of Health in Longitudinal Clinical Records. medRxiv preprint, https://doi.org/10.1101/2024.04.25.24306380

  6. [6]

    PLOS ONE 20:e0317084

    Jansen JA, Manukyan A, Khoury NA, Akalin A (2025) Leveraging large language models for data analysis automation. PLOS ONE 20:e0317084. https://doi.org/10.1371/journal.pone.0317084

  7. [7]

    Radiology

    Bhayana R (2024) Chatbots and Large Language Models in Radiology: A Practical Primer for Clinical and Research Applications. Radiology. https://doi.org/10.1148/radiol.232756

  8. [8]

    Radiology 311:e232741

    Lehnen NC, Dorn F, Wiest IC, et al (2024) Data Extraction from Free-Text Reports on Mechanical Thrombectomy in Acute Ischemic Stroke Using ChatGPT: A Retrospective Analysis. Radiology 311:e232741. https://doi.org/10.1148/radiol.232741

  9. [9]

    Radiology 307:e230725

    Adams LC, Truhn D, Busch F, et al (2023) Leveraging GPT-4 for Post Hoc Transformation of Free-text Radiology Reports into Structured Reporting: A Multilingual Feasibility Study. Radiology 307:e230725. https://doi.org/10.1148/radiol.230725 15 Paper submitted to European Radiology 11. Sasaki F, Tatekawa H, Mitsuyama Y, et al (2024) Bridging Language and Sty...

  10. [10]

    Radiology 311:e232133

    Cozzi A, Pinker K, Hidber A, et al (2024) BI-RADS Category Assignments by GPT-3.5, GPT-4, and Google Bard: A Multilanguage Study. Radiology 311:e232133. https://doi.org/10.1148/radiol.232133

  11. [11]

    Radiology 314:e241073

    Savage CH, Kanhere A, Parekh V, et al (2025) Open-Source Large Language Models in Radiology: A Review and Tutorial for Practical Research and Clinical Deployment. Radiology 314:e241073. https://doi.org/10.1148/radiol.241073

  12. [12]

    JAMIA Open 8:ooaf109

    Builtjes L, Bosma J, Prokop M, et al (2025) Leveraging open-source large language models for clinical information extraction in resource-constrained settings. JAMIA Open 8:ooaf109. https://doi.org/10.1093/jamiaopen/ooaf109

  13. [13]

    Crit Care 29:72

    Workum JD, Volkers BWS, van de Sande D, et al (2025) Comparative evaluation and performance of large language models on expert level critical care questions: a benchmark study. Crit Care 29:72. https://doi.org/10.1186/s13054-025-05302-0

  14. [14]

    Nat Lang Process J 10:100124

    Nazi ZA, Hossain MdR, Mamun FA (2025) Evaluation of open and closed-source LLMs for low-resource language with zero-shot, few-shot, and chain-of-thought prompting. Nat Lang Process J 10:100124. https://doi.org/10.1016/j.nlp.2024.100124

  15. [15]

    Nat Commun 15:2050

    Sandmann S, Riepenhausen S, Plagwitz L, Varghese J (2024) Systematic analysis of ChatGPT, Google search and Llama 2 for clinical decision support tasks. Nat Commun 15:2050. https://doi.org/10.1038/s41467-024-46411-8

  16. [16]

    Eur J Ther

    Çamur E, Güne ş YC (2025) Evaluation of the Performance of ChatGPT 4.5 in LI-RADS Categorization and Management Suggestion: Zero-shot versus Few-shot Prompting Method. Eur J Ther. https://doi.org/10.58600/eurjther2699

  17. [17]

    Cahyawijaya S, Lovenia H, Fung P (2024) LLMs Are Few-Shot In-Context Low-Resource Language Learners. In: Duh K, Gomez H, Bethard S (eds) Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers). Association for Computational Linguistics, Mexico Ci...

  18. [18]

    In-context Learning: A Fair Comparison and Evaluation

    Mosbach M, Pimentel T, Ravfogel S, et al (2023) Few-shot Fine-tuning vs. In-context Learning: A Fair Comparison and Evaluation. In: Rogers A, Boyd-Graber J, Okazaki N (eds) Findings of the Association for Computational Linguistics: ACL 2023. Association for Computational Linguistics, Toronto, Canada, pp 12284–12314, https://doi.org/10.18653/v1/2023.findin...

  19. [19]

    Grattafiori A, Dubey A, Jauhri A, et al (2024) The Llama 3 Herd of Models, arXiv preprint, https://doi.org/10.48550/arXiv.2407.21783

  20. [20]

    Tiedemann J, Aulamo M, Bakshandaeva D, et al (2024) Democratizing neural machine translation with OPUS-MT

    Meeus M, Rathé A, Remy F, et al (2024) ChocoLlama: Lessons Learned From Teaching Llamas Dutch, arXiv preprint, https://doi.org/10.48550/arXiv.2412.07633 16 Paper submitted to European Radiology 23. Tiedemann J, Aulamo M, Bakshandaeva D, et al (2024) Democratizing neural machine translation with OPUS-MT. Lang Resour Eval 58:713–755. https://doi.org/10.1007...

  21. [21]

    Reimers N, Gurevych I (2019) Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks, arXiv preprint, https://doi.org/10.48550/arXiv.1908.10084

  22. [22]

    In: Cohn T, He Y, Liu Y (eds) Findings of the Association for Computational Linguistics: EMNLP 2020

    Delobelle P, Winters T, Berendt B (2020) RobBERT: a Dutch RoBERTa-based Language Model. In: Cohn T, He Y, Liu Y (eds) Findings of the Association for Computational Linguistics: EMNLP 2020. Association for Computational Linguistics, Online, pp 3255–3265, https://doi.org/10.18653/v1/2020.findings-emnlp.292

  23. [23]

    Zhang T, Kishore V, Wu F, et al (2020) BERTScore: Evaluating Text Generation with BERT, arXiv preprint, https://doi.org/10.48550/arXiv.1904.09675

  24. [24]

    In: Wiley StatsRef: Statistics Reference Online

    Lachenbruch PA (2014) McNemar Test. In: Wiley StatsRef: Statistics Reference Online. John Wiley & Sons, Ltd, https://doi.org/10.1002/9781118445112.stat04876

  25. [25]

    Arch Pharm Pract 11:144–148

    Shabankhani B (2020) Assessing the inter-rater reliability for nominal, categorical and ordinal data in medical sciences. Arch Pharm Pract 11:144–148

  26. [26]

    Biochemia medica : ča- sopis Hrvatskoga društva medicinskih biokemičara / HDMB22, 276–82 (10 2012)

    McHugh ML (2012) Interrater reliability: the kappa statistic. Biochem Medica 22:276–282, https://doi.org/10.11613/BM.2012.031

  27. [27]

    J Chiropr Med 15:155–163

    Koo TK, Li MY (2016) A Guideline of Selecting and Reporting Intraclass Correlation Coefficients for Reliability Research. J Chiropr Med 15:155–163. https://doi.org/10.1016/j.jcm.2016.02.012

  28. [28]

    Inference for the generalization error,

    Nadeau C, Bengio Y (2003) Inference for the Generalization Error. Mach Learn 52:239–281. https://doi.org/10.1023/A:1024068626366

  29. [29]

    J Am Stat Assoc 32:675–701

    Friedman M (1937) The Use of Ranks to Avoid the Assumption of Normality Implicit in the Analysis of Variance. J Am Stat Assoc 32:675–701. https://doi.org/10.2307/2279372

  30. [30]

    J Mach Learn Res 7:1–30, https://dl.acm.org/doi/10.5555/1248547.1248548

    Dem š ar J (2006) Statistical Comparisons of Classifiers over Multiple Data Sets. J Mach Learn Res 7:1–30, https://dl.acm.org/doi/10.5555/1248547.1248548

  31. [31]

    NeuroImage Clin 8:314–321

    De Cocker LJL, Geerlings MI, Hartkamp NS, et al (2015) Cerebellar infarct patterns: The SMART-Medea study. NeuroImage Clin 8:314–321. https://doi.org/10.1016/j.nicl.2015.02.001

  32. [32]

    arXiv preprint, https://arxiv.org/abs/2407.19299v3

    Le T-D, Nguyen TT, Ha VN, et al (2024) The Impact of LoRA Adapters on LLMs for Clinical Text Classification Under Computational and Data Constraints. arXiv preprint, https://arxiv.org/abs/2407.19299v3. Accessed 27 Nov 2025

  33. [33]

    J Law Med Ethics 53:454–464

    Onitiu D, Wachter S, Mittelstadt B (2025) Walking Backward to Ensure Risk Management of Large Language Models in Medicine. J Law Med Ethics 53:454–464. https://doi.org/10.1017/jme.2025.10132

  34. [34]

    Npj Digit Med 8:397

    Zhu M, Lin H, Jiang J, et al (2025) Large language model trained on clinical oncology data predicts cancer progression. Npj Digit Med 8:397. https://doi.org/10.1038/s41746-025-01780-2 17 Paper submitted to European Radiology 38. Liu L, Lian L, Hao Y, et al (2025) Human level information extraction from clinical reports with finetuned language models. Sci ...

  35. [35]

    In: Advances in Neural Information Processing Systems

    Lewis P, Perez E, Piktus A, et al (2020) Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks. In: Advances in Neural Information Processing Systems. Curran Associates, Inc., pp 9459–9474, https://dl.acm.org/doi/abs/10.5555/3495724.3496517

  36. [36]

    Spiess C, Vaziri M, Mandel L, Hirzel M (2025) AutoPDL: Automatic Prompt Optimization for LLM Agents, arXiv preprint, https://doi.org/10.48550/arXiv.2504.04365

  37. [37]

    Fazekas":

    Keicher M, Zaripova K, Czempiel T, et al (2024) FlexR: Few-shot Classification with Language Embeddings for Structured Reporting of Chest X-rays. In: Medical Imaging with Deep Learning. PMLR, pp 1493–1508, https://proceedings.mlr.press/v227/keicher24a.html 18 Paper submitted to European Radiology Appendix A. Prompt The prompt used in this work along with ...

  38. [38]

    Missing" Label in Categorical Variables For categorical variables, the

    Consensus reference: the model predictions were compared to the consensus annotations of the bootstrap-sampled reports. 26 Paper submitted to European Radiology 6. Single-rater reference: for each bootstrap-sampled report, one of the two raters (R1 or R2) was selected randomly and independently, and the model prediction was compared to that rater’s annota...

  39. [39]

    missing" were recoded to the negative finding class (

    Merging missing with negative: labels coded as "missing" were recoded to the negative finding class ("absent" or “0” depending on the variable, see table S1) in both the reference and predicted labels, and balanced accuracy was computed on the full dataset. Both schemes were applied for the structured similarity-based few shot prompting method per fold of...