Towards Multidisciplinary Summarization of Hospital Stays: Efficient Sentence-Level Clinical Provenance Categorization

Amanda Karstens; Andrew D. Boyd; Angie Tipton; Barbara Di Eugenio; Baris Karacan; Brianna Clarahan; Catherine K. Craven; Danielle Hitzel; Emily Spellman; Jaewon Bae

arxiv: 2606.02487 · v1 · pith:6HKVCA5Wnew · submitted 2026-06-01 · 💻 cs.CL

Towards Multidisciplinary Summarization of Hospital Stays: Efficient Sentence-Level Clinical Provenance Categorization

Baris Karacan , Vaibhav Bhargava , Barbara Di Eugenio , Natalie Parde , Mary Khetani , Yu-Shan Tseng , Vanessa Barbosa , Julie Vignato

show 16 more authors

Lindsey Knake Rajashree Dahal Emily Spellman Danielle Hitzel Janine Petitgout Kristi Haughey Amanda Karstens Brianna Clarahan Rachel Dawson Lauren Boyd Mackenzie Weis Angie Tipton Jaewon Bae Catherine K. Craven Karen Dunn Lopez Andrew D. Boyd

This is my paper

Pith reviewed 2026-06-28 14:21 UTC · model grok-4.3

classification 💻 cs.CL

keywords clinical provenance categorizationLLM fine-tuningquantizationcross-domain transferNICU summarizationsentence-level annotationMedSecId corpus

0 comments

The pith

Fine-tuned quantized 70B LLMs categorize clinical provenance better than full-precision versions on cross-domain NICU notes.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper examines whether supervised fine-tuning of Llama-3 models on adult ICU notes labeled with clinical provenance headers transfers to sentence-level categorization in multi-disciplinary NICU summaries. It reports that the 70B model gains substantially from fine-tuning while the 8B model does not, and that the quantized fine-tuned 70B version exceeds the full-precision baseline on a 227-sentence gold-standard set. A sympathetic reader would care because accurate provenance tagging is required before heterogeneous notes from physicians, nurses, and therapists can be aggregated into coherent hospital-stay summaries. The results indicate that model capacity aids semantic flexibility during transfer and that quantization can lower compute costs without sacrificing accuracy.

Core claim

After supervised fine-tuning on the MedSecId corpus of 2,002 adult ICU notes annotated with provenance headers, both 8B and 70B Llama-3 models reach in-domain Macro F1 above 92 percent; on the separate 227-sentence NICU evaluation set the 70B model improves by 7 percent Macro F1 while the 8B model shows only marginal change, and the quantized fine-tuned 70B model then outperforms its full-precision counterpart while reducing computational requirements.

What carries the argument

Supervised fine-tuning of Llama-3 models on MedSecId provenance header annotations, with subsequent quantization, for cross-domain sentence categorization.

If this is right

Model scale controls whether fine-tuning produces meaningful gains when moving from adult ICU to neonatal ICU provenance labels.
Quantization can raise accuracy on this transfer task while lowering compute cost.
Provenance categorization becomes practical as a preprocessing step for multi-disciplinary summarization.
Adult ICU annotations can function as a usable source domain for neonatal sentence labeling.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same quantized adaptation pattern may apply to other clinical subdomains that lack large labeled sets provided model capacity is adequate.
Evaluating the approach on complete hospital-stay note collections rather than curated summary excerpts would test whether the 227-sentence sample captures real label shift.
Embedding the categorization step inside an end-to-end summarizer would reveal whether provenance tags measurably improve summary coherence.

Load-bearing premise

The 227-sentence NICU gold-standard set represents the full distribution of multi-disciplinary notes and the MedSecId labels transfer without major shift.

What would settle it

A larger and more varied collection of NICU notes on which the quantized fine-tuned 70B model no longer exceeds the full-precision baseline Macro F1.

read the original abstract

Effective "all-team" summarization in high-complexity settings like the Neonatal Intensive Care Unit (NICU) requires aggregating insights from diverse disciplines (physicians, nurses, therapists) spread across hundreds of clinical free-text notes. Simply pooling heterogeneous text often leads to incoherent outputs. Structured summarization therefore first requires accurate categorization of sentence-level provenance across multi-source notes. This pilot study introduces a clinical provenance categorization pipeline using supervised fine-tuning (SFT) of large language models (LLMs). We adapted two Llama-3 models (8B and 70B) to MedSecId, a corpus of 2,002 MIMIC-III (Adult ICU) notes annotated with clinical provenance headers, achieving in-domain Macro F1 scores above 92% for both models. To evaluate cross-domain generalization, we assessed model capacity (8B vs. 70B) and quantization on a gold-standard dataset of 227 sentence-level spans derived from three multi-disciplinary NICU summaries. Experimental results demonstrate a scale-dependent transfer effect: while SFT produced only marginal changes for the 8B model, it substantially improved the 70B model, increasing Macro F1 by 7%. Notably, the quantized fine-tuned 70B model outperformed its full-precision baseline while substantially reducing computational requirements. These findings suggest that sufficient model capacity is critical for preserving semantic flexibility during cross-domain clinical transfer and that efficient quantized adaptation can enable structured provenance modeling for downstream summarization.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Fine-tuning the 70B model transfers to the NICU set with a 7% Macro F1 gain and the quantized version beats full precision, but everything rests on 227 sentences from three summaries.

read the letter

The main takeaway is that supervised fine-tuning on the MedSecId adult ICU provenance data produces a noticeable lift for the 70B Llama-3 on the NICU test sentences while the 8B model barely moves, and the quantized fine-tuned 70B actually scores higher than its full-precision counterpart.

The paper handles the in-domain part cleanly: both models reach above 92% Macro F1 on the 2002-note training corpus, which shows the task itself is tractable. The scale-dependent transfer result and the practical efficiency note on quantization are the concrete observations that stand out.

The soft spot is the evaluation size. The cross-domain claim uses 227 sentence spans drawn from only three multi-disciplinary NICU summaries. There are no details on sampling, no inter-annotator agreement, no comparison of label distributions, and no statistical tests or variance numbers. At that scale a 7% relative improvement and the quantization reversal are difficult to read as robust evidence rather than possible artifacts of the particular notes chosen.

This is a pilot aimed at researchers who need sentence-level provenance as a first step toward multi-source clinical summarization. The work is straightforward and the motivation is clear, but the numbers need more backing before they can be treated as reliable guidance.

I would bring it to a reading group on clinical NLP applications. I would not cite it yet. It deserves peer review because the task is real and the setup is described plainly enough for referees to ask the right follow-up questions on test construction and generalization.

Referee Report

2 major / 1 minor

Summary. The paper introduces a supervised fine-tuning pipeline for sentence-level clinical provenance categorization using Llama-3 8B and 70B models on the MedSecId corpus (2,002 MIMIC-III notes). It reports in-domain Macro F1 >92% and evaluates cross-domain transfer on a 227-sentence gold-standard set from three NICU summaries, claiming a scale-dependent 7% Macro F1 gain for the 70B model after SFT and that the quantized fine-tuned 70B outperforms its full-precision baseline while lowering compute costs.

Significance. If the cross-domain results hold under stronger evaluation, the work would indicate that sufficient model capacity preserves semantic flexibility for clinical provenance transfer and that quantized adaptation can support efficient structured summarization in high-complexity settings such as the NICU. The in-domain performance on MedSecId provides a useful starting point for provenance-aware clinical NLP.

major comments (2)

[Cross-domain evaluation (NICU gold-standard dataset)] The headline cross-domain claim (7% Macro F1 gain and quantized FT 70B outperforming full-precision baseline) rests exclusively on a 227-sentence NICU gold-standard set drawn from only three summaries. No inter-annotator agreement, sampling frame for note types or disciplines, or label-shift analysis between MedSecId headers and NICU annotations is reported, and no variance estimates or significance tests are supplied; with this n the observed reversal could be driven by the particular three summaries rather than model capacity or quantization robustness.
[Experimental results and abstract] Abstract and results sections report Macro F1 numbers and a 7% relative gain but supply no baseline comparisons (e.g., zero-shot or non-SFT variants), statistical tests, or error analysis. This omission is load-bearing for the scale-dependent transfer and quantization claims.

minor comments (1)

[Abstract] The abstract states that the quantized model 'substantially reduc[es] computational requirements' without quantitative details on memory footprint, inference latency, or hardware; adding these numbers would strengthen the efficiency claim.

Simulated Author's Rebuttal

2 responses · 2 unresolved

We thank the referee for the constructive feedback on our pilot study. We address the major comments point-by-point below, providing the strongest honest defense of the work while acknowledging its limitations as a preliminary investigation into cross-domain clinical provenance categorization.

read point-by-point responses

Referee: [Cross-domain evaluation (NICU gold-standard dataset)] The headline cross-domain claim (7% Macro F1 gain and quantized FT 70B outperforming full-precision baseline) rests exclusively on a 227-sentence NICU gold-standard set drawn from only three summaries. No inter-annotator agreement, sampling frame for note types or disciplines, or label-shift analysis between MedSecId headers and NICU annotations is reported, and no variance estimates or significance tests are supplied; with this n the observed reversal could be driven by the particular three summaries rather than model capacity or quantization robustness.

Authors: We acknowledge the small size of the NICU gold-standard set (227 sentences from three multi-disciplinary summaries) as a core limitation of this pilot study, and agree that the results should be interpreted as preliminary evidence of scale-dependent transfer rather than conclusive proof. The set was constructed by clinical experts using the identical provenance schema as MedSecId to enable direct comparison. In revision we will add: (1) label distribution comparison to quantify shift, (2) bootstrap-derived confidence intervals and variance estimates for all reported F1 scores, and (3) a non-parametric significance test for the observed 7% gain. We will also expand the methods section with the annotation protocol and note selection rationale. However, inter-annotator agreement was not collected because each summary was annotated by a single expert; this cannot be retroactively computed without new annotations. revision: partial
Referee: [Experimental results and abstract] Abstract and results sections report Macro F1 numbers and a 7% relative gain but supply no baseline comparisons (e.g., zero-shot or non-SFT variants), statistical tests, or error analysis. This omission is load-bearing for the scale-dependent transfer and quantization claims.

Authors: The 7% gain already reflects the difference between the base (non-SFT) 70B model and its fine-tuned counterpart on the NICU set; the manuscript therefore does contain a non-SFT baseline. To strengthen the presentation we will add explicit zero-shot results for both 8B and 70B models, include statistical tests (paired bootstrap), and append a concise error analysis of the most frequent misclassifications. These elements will be incorporated into the results section and referenced in the abstract. The quantized vs. full-precision comparison is already present and will be supported by the new statistical framing. revision: yes

standing simulated objections not resolved

Inter-annotator agreement scores for the NICU gold-standard set (single-expert annotation precludes computation without new data collection)
Comprehensive sampling frame or discipline-level stratification details for the three NICU summaries beyond what is already described

Circularity Check

0 steps flagged

No circularity: empirical held-out evaluation on separate gold set

full rationale

The paper reports standard supervised fine-tuning of Llama-3 models on the MedSecId corpus followed by evaluation on an independent 227-sentence NICU gold-standard set. No equations, fitted parameters, or self-definitional steps are present. Reported Macro F1 scores are conventional held-out metrics with no indication that any quantity is defined in terms of itself or reduced by construction to training inputs. Self-citations, if any, are not load-bearing for the central claim.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The work relies on standard supervised fine-tuning of existing LLMs and the assumption that sentence-level provenance headers are a stable and useful signal; no new mathematical objects or free parameters beyond ordinary training hyperparameters are introduced.

pith-pipeline@v0.9.1-grok · 5897 in / 1185 out tokens · 23063 ms · 2026-06-28T14:21:34.328550+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

8 extracted references · 1 linked inside Pith

[1]

In: AMIA Annual Symposium Proceedings 2008, pp

Denny, J.C., Miller, R.A., Johnson, K.B., Spickard III, A.: Development and eval- uation of a clinical note section header terminology. In: AMIA Annual Symposium Proceedings 2008, pp. 156. (2008)

2008
[2]

In: Proceedings of the 13th International Workshop on Health Text Mining and Information Analysis (LOUHI), pp

Zhang, F., Laish, I., Benjamini, A., Feder, A.: Section classification in clinical notes with multi-task transformers. In: Proceedings of the 13th International Workshop on Health Text Mining and Information Analysis (LOUHI), pp. 54–59 (2022)

2022
[3]

arXiv preprint arXiv:2406.19526 (2024)

Saleh, M., Baghdadi, S., Paquelet, S.: TocBERT: Medical document structure ex- traction using bidirectional transformers. arXiv preprint arXiv:2406.19526 (2024)

arXiv 2024
[4]

zero-shot clinical section segmentation from MIMIC-III to obstetrics

Karacan, B., Di Eugenio, B., Thornton, P.: Bridging the domain divide: Super- vised vs. zero-shot clinical section segmentation from MIMIC-III to obstetrics. In: Proceedings of the 2026 International Conference on Language Resources and Eval- uation (LREC) (2026) (Accepted for publication)

2026
[5]

arXiv preprint arXiv:2407.21783 (2024)

Dubey, A., Jauhri, A., Pandey, A., Kadian, A., Al-Dahle, A., Letman, A., Mathur, A., Schelten, A., Yang, A., Fan, A., et al.: The Llama 3 herd of models. arXiv preprint arXiv:2407.21783 (2024)

Pith/arXiv arXiv 2024
[6]

In: Proceedings of the 29th International Conference on Computational Linguistics (COLING), pp

Landes, P., Patel, K., Huang, S.S., Webb, A., Di Eugenio, B., Caragea, C.: A new public corpus for clinical section identification: MedSecId. In: Proceedings of the 29th International Conference on Computational Linguistics (COLING), pp. 3709– 3721 (2022)

2022
[7]

Advances in Neural Information Processing Systems 36, 10088–10115 (2023)

Dettmers, T., Pagnoni, A., Holtzman, A., Zettlemoyer, L.: QLoRA: Efficient fine- tuning of quantized LLMs. Advances in Neural Information Processing Systems 36, 10088–10115 (2023)

2023
[8]

In: International Conference on Learning Representations (ICLR) (2022)

Hu, E.J., Shen, Y., Wallis, P., Allen-Zhu, Z., Li, Y., Wang, S., Wang, L., Chen, W., et al.: LoRA: Low-rank adaptation of large language models. In: International Conference on Learning Representations (ICLR) (2022)

2022

[1] [1]

In: AMIA Annual Symposium Proceedings 2008, pp

Denny, J.C., Miller, R.A., Johnson, K.B., Spickard III, A.: Development and eval- uation of a clinical note section header terminology. In: AMIA Annual Symposium Proceedings 2008, pp. 156. (2008)

2008

[2] [2]

In: Proceedings of the 13th International Workshop on Health Text Mining and Information Analysis (LOUHI), pp

Zhang, F., Laish, I., Benjamini, A., Feder, A.: Section classification in clinical notes with multi-task transformers. In: Proceedings of the 13th International Workshop on Health Text Mining and Information Analysis (LOUHI), pp. 54–59 (2022)

2022

[3] [3]

arXiv preprint arXiv:2406.19526 (2024)

Saleh, M., Baghdadi, S., Paquelet, S.: TocBERT: Medical document structure ex- traction using bidirectional transformers. arXiv preprint arXiv:2406.19526 (2024)

arXiv 2024

[4] [4]

zero-shot clinical section segmentation from MIMIC-III to obstetrics

Karacan, B., Di Eugenio, B., Thornton, P.: Bridging the domain divide: Super- vised vs. zero-shot clinical section segmentation from MIMIC-III to obstetrics. In: Proceedings of the 2026 International Conference on Language Resources and Eval- uation (LREC) (2026) (Accepted for publication)

2026

[5] [5]

arXiv preprint arXiv:2407.21783 (2024)

Dubey, A., Jauhri, A., Pandey, A., Kadian, A., Al-Dahle, A., Letman, A., Mathur, A., Schelten, A., Yang, A., Fan, A., et al.: The Llama 3 herd of models. arXiv preprint arXiv:2407.21783 (2024)

Pith/arXiv arXiv 2024

[6] [6]

In: Proceedings of the 29th International Conference on Computational Linguistics (COLING), pp

Landes, P., Patel, K., Huang, S.S., Webb, A., Di Eugenio, B., Caragea, C.: A new public corpus for clinical section identification: MedSecId. In: Proceedings of the 29th International Conference on Computational Linguistics (COLING), pp. 3709– 3721 (2022)

2022

[7] [7]

Advances in Neural Information Processing Systems 36, 10088–10115 (2023)

Dettmers, T., Pagnoni, A., Holtzman, A., Zettlemoyer, L.: QLoRA: Efficient fine- tuning of quantized LLMs. Advances in Neural Information Processing Systems 36, 10088–10115 (2023)

2023

[8] [8]

In: International Conference on Learning Representations (ICLR) (2022)

Hu, E.J., Shen, Y., Wallis, P., Allen-Zhu, Z., Li, Y., Wang, S., Wang, L., Chen, W., et al.: LoRA: Low-rank adaptation of large language models. In: International Conference on Learning Representations (ICLR) (2022)

2022