arxiv: 2605.08045 · v1 · submitted 2026-05-08 · 💻 cs.CL

Recognition: 2 theorem links

· Lean Theorem

Uncertainty-Aware Structured Data Extraction from Full CMR Reports via Distilled LLMs

Yi Yu , Parker Martin , Zhenyu Bu , Yixuan Liu , Yi-Yu Zheng , Orlando Simonetti , Yuchi Han , Yuan Xue

Authors on Pith no claims yet

Pith reviewed 2026-05-11 02:13 UTC · model grok-4.3

classification 💻 cs.CL

keywords structured data extractionCMR reportsLLM distillationuncertainty estimationmedical natural language processingclinical data curationreport to data conversion

0 comments

The pith

A lightweight distilled LLM framework extracts structured data from cardiac magnetic resonance reports at 99.65% accuracy while estimating per-field confidence.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The authors develop CMR-EXTR to convert free-text reports from cardiac magnetic resonance imaging into structured, machine-readable data. This addresses a key barrier in building research cohorts and supporting clinical decisions. The approach relies on distilling knowledge from a large model into a smaller one for efficient offline operation with little labeled data. Three uncertainty measures based on data distribution, sampling consistency, and field relationships help identify which extractions likely need human verification. The system reaches 99.65 percent variable-level accuracy, showing both precision in extraction and usefulness of the confidence indicators.

Core claim

CMR-EXTR converts free-text CMR reports into structured data and assigns per-field confidence for quality control. A teacher-student distillation pipeline enables fully offline inference while limiting manual annotation. Uncertainty integrates three complementary principles—distribution plausibility, sampling stability, and cross-field consistency—to triage human review. Experiments show that CMR-EXTR achieves 99.65% variable-level accuracy, demonstrating both reliable extraction and informative confidence scores.

What carries the argument

Teacher-student distillation pipeline combined with uncertainty estimation from distribution plausibility, sampling stability, and cross-field consistency.

Load-bearing premise

The three uncertainty principles can reliably detect when an extraction is incorrect enough to require human review, and the distillation process keeps accuracy high on CMR reports.

What would settle it

A dataset of CMR reports where many low-accuracy extractions receive high confidence scores from the system, or where the distilled model shows substantially lower accuracy than the teacher.

read the original abstract

Converting free-text cardiac magnetic resonance (CMR) reports into auditable structured data remains a bottleneck for cohort assembly, longitudinal curation, and clinical decision support. We present CMR-EXTR, a lightweight framework that converts free-text CMR reports into structured data and assigns per-field confidence for quality control. A teacher-student distillation pipeline enables fully offline inference while limiting manual annotation. Uncertainty integrates three complementary principles -- distribution plausibility, sampling stability, and cross-field consistency -- to triage human review. Experiments show that CMR-EXTR achieves 99.65% variable-level accuracy, demonstrating both reliable extraction and informative confidence scores. To our knowledge, this is the first CMR-specific extraction system with integrated confidence estimation. The code is available at https://github.com/yuyi1005/CMR-EXTR.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

CMR-EXTR gives a practical offline pipeline for structuring CMR reports with three-principle uncertainty, but the evidence that uncertainty actually predicts errors is missing.

read the letter

The paper gives you a ready-to-run offline extractor for CMR reports that includes uncertainty estimates, but the evidence that those estimates actually catch errors is still thin. They distill a teacher model down to a student that runs locally, then layer on three uncertainty checks: whether the output distribution looks plausible, whether repeated samples agree, and whether fields are consistent with each other. The claim is 99.65% accuracy at the variable level, and they put the code on GitHub. What they do well is address a practical need. CMR reports are long and variable, and turning them into tables for studies or decisions takes time. An offline system with some built-in quality flags could save effort, especially if the uncertainty part works. The distillation step is sensible for deployment. The soft spot is the lack of direct tests for the uncertainty component. We don't see numbers showing that low-confidence fields have higher error rates, or how well the three principles predict mistakes compared to a simple baseline. Dataset details and teacher-student comparisons are also light in the summary. If the full paper has those, it strengthens the case; otherwise the high accuracy stands alone without proof that the confidence scores add value. This is for people who need to process cardiac imaging reports or similar medical text at scale. It won't change the broader field but could be a handy tool for cardiology data teams. I'd send it for peer review. The idea is grounded and the code is there, so referees can check the missing pieces.

Referee Report

3 major / 2 minor

Summary. The manuscript introduces CMR-EXTR, a teacher-student distillation framework that converts free-text cardiac magnetic resonance (CMR) reports into structured data while assigning per-field confidence scores. Uncertainty is computed from three principles (distribution plausibility, sampling stability, cross-field consistency) to triage human review. The central empirical claim is 99.65% variable-level accuracy on the target task, presented as the first CMR-specific extraction system with integrated confidence estimation. Code is released at the provided GitHub link.

Significance. If the headline accuracy and uncertainty triage hold under scrutiny, the work addresses a genuine clinical bottleneck in cohort curation and decision support. The fully offline inference path enabled by distillation is a practical advantage for deployment. The three-principle uncertainty design is conceptually sound, but the absence of direct quantitative validation on error stratification and distillation fidelity reduces the immediate assessed impact.

major comments (3)

[Experiments] Experiments section: the 99.65% variable-level accuracy is stated without dataset size, number of reports or variables, train/test split details, baseline comparisons, or explicit definition of how accuracy was measured (exact match, partial credit, etc.), rendering the central performance claim unverifiable from the reported evidence.
[Uncertainty Estimation] Uncertainty module: no AUROC, error-rate stratification by confidence bin, or precision-recall curves are provided to show that fields flagged by the three principles (distribution plausibility, sampling stability, cross-field consistency) exhibit materially higher error rates than high-confidence fields.
[Distillation Pipeline] Distillation pipeline: no side-by-side teacher versus student F1 or accuracy numbers on the held-out CMR test set are reported, leaving the claim that distillation preserves extraction quality without large degradation unquantified.

minor comments (2)

[Methods] Clarify the exact aggregation rule that combines the three uncertainty signals into a single per-field confidence score.
[Abstract] The abstract could explicitly state the number of CMR reports and variables used for the 99.65% figure.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed review. We address each major comment below and will revise the manuscript to incorporate the requested clarifications and additional analyses.

read point-by-point responses

Referee: [Experiments] Experiments section: the 99.65% variable-level accuracy is stated without dataset size, number of reports or variables, train/test split details, baseline comparisons, or explicit definition of how accuracy was measured (exact match, partial credit, etc.), rendering the central performance claim unverifiable from the reported evidence.

Authors: We agree that the current Experiments section omits critical details needed to verify the central claim. In the revised manuscript we will add the total number of reports and variables in the dataset, the train/test split sizes and ratios, relevant baseline comparisons, and an explicit definition of accuracy as exact-match on normalized variable values. These additions will render the 99.65% figure fully verifiable. revision: yes
Referee: [Uncertainty Estimation] Uncertainty module: no AUROC, error-rate stratification by confidence bin, or precision-recall curves are provided to show that fields flagged by the three principles (distribution plausibility, sampling stability, cross-field consistency) exhibit materially higher error rates than high-confidence fields.

Authors: The referee correctly identifies the lack of quantitative validation for the uncertainty module. We will add AUROC values for error prediction, error-rate tables stratified by confidence bins, and precision-recall curves that compare fields flagged by the three principles against high-confidence fields, thereby demonstrating the triage utility of the uncertainty estimates. revision: yes
Referee: [Distillation Pipeline] Distillation pipeline: no side-by-side teacher versus student F1 or accuracy numbers on the held-out CMR test set are reported, leaving the claim that distillation preserves extraction quality without large degradation unquantified.

Authors: We acknowledge that direct teacher-student performance numbers are not reported. The revised manuscript will include side-by-side F1 and accuracy metrics for the teacher and student models on the held-out test set, allowing readers to quantify any degradation introduced by distillation. revision: yes

Circularity Check

0 steps flagged

No significant circularity; performance metrics and uncertainty scores are computed from held-out data and model outputs

full rationale

The paper reports empirical accuracy (99.65% variable-level) on CMR reports and derives per-field confidence from three post-hoc principles (distribution plausibility, sampling stability, cross-field consistency) applied to the distilled model's outputs. No step equates a fitted parameter to a claimed prediction, renames a known result as a derivation, or reduces the central claim to a self-citation chain. The teacher-student pipeline and uncertainty triage are presented as engineering choices whose effectiveness is asserted via measured performance rather than by construction. This is the normal case of a self-contained empirical system whose headline numbers are not forced by the inputs.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The central claim rests on the assumption that LLM distillation transfers extraction capability effectively to the CMR domain and that the three uncertainty heuristics correlate with actual extraction errors.

axioms (2)

domain assumption Teacher-student distillation preserves high extraction accuracy on the target medical report domain
The pipeline assumes effective knowledge transfer from teacher to student model without domain-specific degradation.
domain assumption The three uncertainty principles (distribution plausibility, sampling stability, cross-field consistency) are valid proxies for extraction reliability
These heuristics are used to triage human review but their correlation with true errors is not independently validated in the abstract.

pith-pipeline@v0.9.0 · 5454 in / 1332 out tokens · 31211 ms · 2026-05-11T02:13:22.398137+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Uncertainty integrates three complementary principles—distribution plausibility, sampling stability, and cross-field consistency—to triage human review.
IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Experiments show that CMR-EXTR achieves 99.65% variable-level accuracy

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

25 extracted references · 25 canonical work pages · 3 internal anchors

[1]

Uncertainty-Aware Structured Data Extraction from Full CMR Reports via Distilled LLMs

INTRODUCTION Cardiac magnetic resonance (CMR) is a reference standard for biventricular function, chamber size, and tissue charac- terization, and its free-text reports remain the authoritative record of quantitative measurements and diagnostic impres- sions. Turning these reports into structured data enables scal- able cohort assembly, longitudinal resea...

work page internal anchor Pith review Pith/arXiv arXiv 2026
[2]

HEIGHT": 158.48,

Structured data for downstream tasks ➠ Better consistency {"HEIGHT": 158.48, "SBP": 148.0, "DBP": 88.0, "LVEDV": 103.0, "RVEDV": null, ...}

work page
[3]

HEIGHT": 0.99,

Confidence score for quality control ➠ Better quality {"HEIGHT": 0.99, "SBP": 0.92, "DBP": 0.92, "LVEDV": 0.95, "RVEDV": 0.94, ...}

work page
[4]

CATEGORY

Report classification for case retrieval ➠ Better accuracy {"CATEGORY": "EBSTEIN"} VITALS =============================================== HEIGHT: 68.00 in WEIGHT: 152.00 lbs BP: 133 / 78 mmHg BASELINE HR: 68 BPM .---------------------------------------------. | | | LV | Ref | RV | Ref | +-----+-------+-----+---------+-----+---------+ | EDV | ml | 150 | 10...

work page
[5]

LVSV": 60.55,

METHODOLOGY The overall framework of our work, illustrated in Fig. 2, fol- lows a knowledge distillation paradigm with human-in-the- loop quality control. Different from existing work [7, 8, 9, 11] directly using manual annotation, we utilize the zero-shot ability of a powerful teacher model to provide the knowl- edge. During this process, 52 values are e...

work page
[6]

FREE-TEXT

EXPERIMENTS Experimental setup.The model is trained on an NVIDIA A100 GPU, using Llama-3.2-1B as the initialization for our Table 1. Quantitative results for CMR report extraction. Model Accuracy Error Breakdown Variable-Level Report-Level Omission Inexact Confusion Invalid Total GPT-OSS-20B(FREE-TEXT) 90.78% 46.82% 166 5 104 780 1055 GPT-OSS-20B(STRUCTUR...

work page
[7]

STRUCTURED

Inexact.The extracted value differs slightly (within 10%) from the ground truth.3) Confusion.The value is incorrectly taken from another field.4) Invalid.The output fails to fol- low the required format and cannot be parsed as JSON. Disease classification.Our model also surpasses the base- lines (configured to “STRUCTURED”), achieving 97.04% accu- racy ac...

work page
[8]

An example is shown in Fig

Quality control.We sample 100 variables with confidence scores below 0.7 and find an error rate of 42%. An example is shown in Fig. 3. In contrast, among 100 randomly selected variables with confidence scores above 0.7, the error rate is only 1%. This clear discrepancy validates the effectiveness of the proposed quality assessment principles

work page
[9]

CONCLUSION This paper presents CMR-EXTR, a model developed to ex- tract structured data from CMR reports for research data cura- tion and clinical software development. The training pipeline combines knowledge distillation from a large teacher model to a compact student model with human-in-the-loop quality control, effectively reducing annotation effort a...

work page
[10]

The protocol was reviewed and approved by the Institutional Review Board with a waiver of informed consent, given minimal risk and no direct subject contact

COMPLIANCE WITH ETHICAL STANDARDS This retrospective study used de-identified CMR reports col- lected previously. The protocol was reviewed and approved by the Institutional Review Board with a waiver of informed consent, given minimal risk and no direct subject contact

work page
[11]

The authors declare no competing interests

ACKNOWLEDGMENTS This work was supported in part by the National Institutes of Health (R01 HL148103). The authors declare no competing interests

work page
[12]

Screening and diagnosis of car- diovascular disease using artificial intelligence-enabled cardiac magnetic resonance imaging,

Yan-Ran Wang, Kai Yang, Yi Wen, Pengcheng Wang, Yuepeng Hu, et al., “Screening and diagnosis of car- diovascular disease using artificial intelligence-enabled cardiac magnetic resonance imaging,”Nature Medicine, vol. 30, no. 5, pp. 1471–1480, 2024

work page 2024
[14]

gpt-oss-120b & gpt-oss-20b Model Card

OpenAI, Sandhini Agarwal, Lama Ahmad, Jason Ai, Sam Altman, et al., “gpt-oss-120b & gpt-oss-20b model card,”arXiv preprint arXiv:2508.10925, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[15]

The Llama 3 Herd of Models

Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, et al., “The llama 3 herd of models,”arXiv preprint arXiv:2407.21783, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[16]

Natural language processing systems for extracting in- formation from electronic health records about activities of daily living. a systematic review,

Yvonne Wieland-Jorna, Daan van Kooten, Robert A Verheij, Yvonne de Man, Anneke L Francke, et al., “Natural language processing systems for extracting in- formation from electronic health records about activities of daily living. a systematic review,”JAMIA Open, vol. 7, no. 2, pp. ooae044, 05 2024

work page 2024
[17]

Health care language models and their fine- tuning for information extraction: Scoping review,

Miguel Nunes, Joao Bone, Joao C Ferreira, and Luis B Elvas, “Health care language models and their fine- tuning for information extraction: Scoping review,” JMIR Medical Informatics, vol. 12, pp. e60164, 2024

work page 2024
[18]

Interpretable medical diagnostics with struc- tured data extraction by large language models,

Aleksa Bisercic, Mladen Nikolic, Mihaela van der Schaar, Boris Delibasic, Pietro Lio, and Andrija Petro- vic, “Interpretable medical diagnostics with struc- tured data extraction by large language models,”arXiv preprint arXiv:2306.05052, 2023

work page arXiv 2023
[19]

Clin- ical information extraction with large language models: A case study on organ procurement,

Hammaad Adam, Junjing Lin, Jianchang Lin, Hillary Keenan, Ashia Wilson, and Marzyeh Ghassemi, “Clin- ical information extraction with large language models: A case study on organ procurement,”AMIA Annual Symposium Proceedings, pp. 115–123, 2024

work page 2024
[20]

Instruction-tuned large language models for clinical data extraction: Cre- ating an aortic measurement database from ct radiology reports,

Ely Erez, Sedem Dankwa, McKenzie Tuttle, Afsheen Nasir, Prashanth Vallabhajosyula, Eric B Schneider, Roland Assi, and Chin Siang Ong, “Instruction-tuned large language models for clinical data extraction: Cre- ating an aortic measurement database from ct radiology reports,”Journal of Healthcare Informatics Research, pp. 1–19, 2025

work page 2025
[21]

& Ranganath, R

Kexin Huang, Jaan Altosaar, and Rajesh Ran- ganath, “Clinicalbert: Modeling clinical notes and predicting hospital readmission,”arXiv preprint arXiv:1904.05342, 2019

work page arXiv 1904
[22]

One clinician is all you need–cardiac magnetic resonance imaging mea- surement extraction: Deep learning algorithm develop- ment,

Pulkit Singh, Julian Haimovich, Christopher Reeder, Shaan Khurshid, Emily S Lau, et al., “One clinician is all you need–cardiac magnetic resonance imaging mea- surement extraction: Deep learning algorithm develop- ment,”JMIR Medical Informatics, vol. 10, no. 9, pp. e38178, 2022

work page 2022
[23]

Comparative analysis of privacy- preserving open-source LLMs regarding extraction of diagnostic information from clinical CMR imaging re- ports,

Sina Amirrajab, V olker Vehof, Michael Bietenbeck, and Ali Yilmaz, “Comparative analysis of privacy- preserving open-source LLMs regarding extraction of diagnostic information from clinical CMR imaging re- ports,”arXiv preprint arXiv:2506.00060, 2025

work page arXiv 2025
[24]

Efficient CMR report classification through synthetic data distillation and metadata fusion,

Parker Martin, Christopher Crabtree, Matthew S. Tong, Orlando P. Simonetti, and Yuan Xue, “Efficient CMR report classification through synthetic data distillation and metadata fusion,” inIEEE 22nd International Sym- posium on Biomedical Imaging, 2025, pp. 1–5

work page 2025
[25]

Society for cardiovascular magnetic res- onance reference values (“normal values

Nadine Kawel-Boehm, Scott J. Hetzel, Bharath Ambale-Venkatesh, Gabriella Captur, Calvin W.L. Chin, et al., “Society for cardiovascular magnetic res- onance reference values (“normal values”) in cardiovas- cular magnetic resonance: 2025 update,”Journal of Cardiovascular Magnetic Resonance, vol. 27, no. 1, pp. 101853, 2025

work page 2025
[26]

LoRA: Low-rank adaptation of large language models,

Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen, “LoRA: Low-rank adaptation of large language models,” inInternational Conference on Learning Representations, 2022

work page 2022