Clinically Grounded Privacy Evaluation of Medical LMs

Ayin Vala; Emily Alsentzer; Jordan Li Cahoon; Lena Stempfle; Marzyeh Ghassemi; Nathaniel Hendrix; Sana Tonekaboni; Sasha Ronaghi; Vivian Utti

arxiv: 2606.09590 · v1 · pith:FQWXF265new · submitted 2026-06-08 · 💻 cs.CL · cs.CR

Clinically Grounded Privacy Evaluation of Medical LMs

Sasha Ronaghi , Sana Tonekaboni , Lena Stempfle , Vivian Utti , Jordan Li Cahoon , Nathaniel Hendrix , Ayin Vala , Marzyeh Ghassemi

show 1 more author

Emily Alsentzer

This is my paper

Pith reviewed 2026-06-27 16:52 UTC · model grok-4.3

classification 💻 cs.CL cs.CR

keywords privacy evaluationmedical language modelsmemorizationsensitive diagnosesclinical notesadversarial accesspatient timeline

0 comments

The pith

Routine encounter metadata triggers high rates of verbatim memorization and sensitive diagnosis recovery in medical language models.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper develops an evaluation method that tests privacy leakage in medical LMs across increasing levels of attacker knowledge, from public demographics up to partial note text. It demonstrates that basic details such as patient name, date of birth, provider, and visit date are enough to recover substantial portions of a patient's clinical timeline and to identify specific sensitive conditions at high accuracy. The work also shows that standard memorization checks can count templated text as leakage, which inflates the apparent risk. A sympathetic reader would care because medical LMs are increasingly trained on real patient notes, and this method makes the disclosure risks concrete rather than abstract.

Core claim

Applying the graded framework to an LM trained on 378k clinical notes shows that routine encounter metadata elicits high rates of verbatim memorization across a patient's timeline together with semantic recovery of sensitive diagnoses at AUROC 0.91 for abortion and 0.81 for HIV, while noting that 36 percent of memorized tokens are templated documentation rather than unique patient content.

What carries the argument

A graded axis of adversarial access, ranging from publicly inferable demographics to leaked note fragments, that measures both verbatim memorization of patient-specific text and semantic leakage of diagnoses at each tier.

If this is right

Training on longitudinal clinical notes creates extractable patient timelines even from metadata alone.
Exact-match memorization counts overstate disclosure when 36 percent of tokens come from templates.
Privacy evaluations must test multiple realistic access levels rather than only full training-text recovery.
Models that memorize across a patient's full timeline increase the chance of linking separate visits to one individual.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same graded testing approach could be used to check privacy leakage in LMs trained on other longitudinal records such as financial or educational histories.
Hospitals considering fine-tuning LMs on their own notes would need to measure leakage at the metadata tier before deployment.
If the framework is adopted, model cards for medical LMs could include tiered leakage scores rather than a single memorization rate.

Load-bearing premise

The chosen levels of adversarial access, from public demographics to note fragments, accurately reflect realistic ways an attacker could query a deployed medical language model.

What would settle it

A replication on a different medical LM or dataset that finds AUROC below 0.6 for the same sensitive diagnoses when only routine encounter metadata is supplied would falsify the reported leakage rates.

Figures

Figures reproduced from arXiv: 2606.09590 by Ayin Vala, Emily Alsentzer, Jordan Li Cahoon, Lena Stempfle, Marzyeh Ghassemi, Nathaniel Hendrix, Sana Tonekaboni, Sasha Ronaghi, Vivian Utti.

**Figure 1.** Figure 1: Overview of clinically grounded privacy evaluation framework. We construct adversarial priors ranging from publicly inferable demographics to privileged note fragments, then prompt the LM to generate a note at each access level. We assess privacy leakage along two dimensions: patient-specific memorization of clinical text, distinguishing clinically revealing spans from templated or cross-patient documentat… view at source ↗

**Figure 2.** Figure 2: shows how verbatim memorization scales with adversarial access, rising from 4.8% of the generated note memorized under the PUBLIC prior to 85.6% with ENCOUNTER INFO+CHIEF COMPLAINT+HPI. The largest increase in memorization occurs between the PUBLIC + NAME + MEDS and ENCOUNTER INFO tiers, indicating that routine visit metadata–patient name, date of birth, visit date, provider name, and practice location–… view at source ↗

**Figure 3.** Figure 3: Training-attributable sensitive diagnosis leakage, measured as the AUROC difference between the matched training and non-training arms of the evaluation cohort. Positive values indicate higher diagnosis-recovery performance for patients included in training. The black line shows the mean delta across the six sensitive diagnoses for each prior; colored points show per-diagnosis deltas. See [PITH_FULL_IMAGE… view at source ↗

**Figure 4.** Figure 4: Composition of memorized tokens under the ENCOUNTER INFO prior across ten note sections covering 80.1% of memorized content. Bars are decomposed by content type (clinically revealing in orange, templated in blue) and patient specificity (K=1 solid, unique to target patient; K>1 hatched, duplicated across patients). Appendix I contains example spans of each content category and Appendix J contains the defin… view at source ↗

**Figure 5.** Figure 5: Prior Extraction Pipeline. We use regular expression from the patient’s notes to extract information to construct the prior. Encounter information includes the patient’s name. As shown in [PITH_FULL_IMAGE:figures/full_fig_p013_5.png] view at source ↗

**Figure 6.** Figure 6: Prompt used to annotate each generation for sensitive diagnosis leakage. [PITH_FULL_IMAGE:figures/full_fig_p015_6.png] view at source ↗

**Figure 7.** Figure 7: Prompt used to annotate each patient for sensitive diagnoses before inclusion in the training cohort [PITH_FULL_IMAGE:figures/full_fig_p019_7.png] view at source ↗

**Figure 8.** Figure 8: reports the fraction of generations containing at least one τ=30-gram span matching the patient’s training notes, complementing the mean-volume view in [PITH_FULL_IMAGE:figures/full_fig_p020_8.png] view at source ↗

**Figure 9.** Figure 9: Per-diagnosis ROC curves for train cohort. Each panel plots the true-positive rate against the false-positive [PITH_FULL_IMAGE:figures/full_fig_p021_9.png] view at source ↗

**Figure 10.** Figure 10: Per-diagnosis ROC curves for non-train cohort (patients whose notes do [PITH_FULL_IMAGE:figures/full_fig_p021_10.png] view at source ↗

**Figure 11.** Figure 11: Training-attributable diagnosis leakage measured by positive predictive value (PPV), shown as the PPV difference between the training arm and the matched non-training arm of the evaluation cohort. The bolded black line shows the mean delta across the six sensitive diagnoses for each prior tier; colored dots show per-diagnosis deltas [PITH_FULL_IMAGE:figures/full_fig_p022_11.png] view at source ↗

**Figure 12.** Figure 12: Copy-pasting of memorized clinically-revealing spans. For each memorized clinically-revealing region recovered under the K=1 setting, the distribution of how many of the patient’s own training notes contain the matching verbatim text [PITH_FULL_IMAGE:figures/full_fig_p022_12.png] view at source ↗

read the original abstract

Medical language models (LMs) can memorize and reproduce protected health information, but privacy evaluations often focus on recovery of training text rather than disclosure under realistic threat models. We introduce a clinically grounded framework that evaluates leakage along a graded axis of adversarial access, ranging from publicly inferable demographics to leaked note fragments. At each tier, we measure verbatim memorization of patient-specific text and semantic leakage of sensitive diagnoses. Applying the framework to an LM pretrained on 378k clinical notes, we find that routine encounter metadata (i.e. name, date of birth, provider, practice, visit date) elicits high rates of verbatim memorization across a patient's timeline and sensitive-diagnosis recovery (AUROC 0.91 for abortion, 0.81 for HIV). At the same time, exact-match memorization can overstate disclosure: 36% of memorized tokens reflect templated documentation. Our work highlights the risks of training on longitudinal clinical data, providing a practical framework for contextual privacy evaluation of medical LMs.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

This paper gives a graded privacy eval framework for medical LMs and shows metadata can drive high leakage plus a useful note on templated tokens.

read the letter

The main point is a new graded framework that tests privacy leakage in medical LMs along tiers of adversarial access, from basic demographics to note fragments, while tracking both verbatim memorization and semantic recovery of sensitive diagnoses.

They apply it to a model trained on 378k notes and report that routine metadata like name, DOB, and visit date pulls out patient timelines plus strong recovery of conditions (AUROC 0.91 for abortion, 0.81 for HIV). The 36% templated-token finding is a clear plus because it shows standard memorization checks can overstate real disclosure.

The framework is the actual new piece, moving past simple text recovery to tiered, clinically relevant measures.

The soft spot is the threat-model tiers themselves. The abstract does not show how they were validated against real-world attacker capabilities, so it is not yet clear how well they capture practical risks. Results also come from a single model, which limits how far the numbers travel.

This is for researchers working on medical AI safety and privacy audits. Anyone building or reviewing these systems would get concrete value from the evaluation approach and the templated-token observation.

It deserves peer review so the methods and tier construction can be checked in detail.

Referee Report

0 major / 2 minor

Summary. The paper introduces a clinically grounded framework for privacy evaluation of medical LMs that measures both verbatim memorization and semantic leakage of sensitive diagnoses along a graded axis of adversarial access, ranging from publicly inferable demographics (name, DOB, provider, practice, visit date) to leaked note fragments. Applied to an LM pretrained on 378k clinical notes, the work reports high rates of verbatim memorization triggered by routine encounter metadata across patient timelines, AUROC values of 0.91 for abortion and 0.81 for HIV in diagnosis recovery, and that 36% of memorized tokens reflect templated documentation rather than unique patient information.

Significance. If the empirical results hold under the reported evaluation setup, the framework supplies a practical, context-aware alternative to standard membership-inference or exact-match tests for medical LMs. The explicit separation of templated versus non-templated memorization and the use of clinically relevant sensitive-diagnosis recovery metrics constitute concrete strengths that could inform both model auditing and data-handling policies for longitudinal clinical corpora.

minor comments (2)

[Abstract] Abstract: the 36% templated-token figure is presented without a definition of 'templated documentation' or an example token sequence; a short parenthetical or footnote would improve immediate clarity.
[Framework description] The description of the graded adversarial-access tiers would benefit from an explicit table or enumerated list that maps each tier to the exact input features supplied to the model (e.g., which metadata fields are included at the 'public demographics' level).

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for their positive summary and recommendation of minor revision. No specific major comments were raised in the report.

Circularity Check

0 steps flagged

No significant circularity

full rationale

The paper introduces an empirical evaluation framework for privacy leakage in medical LMs and reports direct measurements (AUROC values, verbatim memorization rates, 36% templated tokens) on a pretrained model using held-out clinical notes. No equations, parameter fits, or derivations are described that reduce claims to inputs by construction. The framework is applied to external data without self-citation chains or ansatzes that load-bear the results. This is a standard empirical measurement study whose central findings do not collapse to self-definition or fitted-input renaming.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the representativeness of the single evaluated model pretrained on 378k notes and the assumption that the chosen threat-model tiers reflect realistic clinical privacy threats.

axioms (1)

domain assumption The language model was pretrained on 378k clinical notes
Stated directly in the abstract as the basis for applying the framework.

pith-pipeline@v0.9.1-grok · 5732 in / 1120 out tokens · 25535 ms · 2026-06-27T16:52:21.573163+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

16 extracted references · 1 canonical work pages

[1]

Jordan L

What does it mean for a language model to preserve privacy?Preprint, arXiv:2202.05520. Jordan L. Cahoon, Chloe Stanwyck, Asad Aali, Rachel Madding, Emma Sun, Yixing Jiang, Renumathy Dhanasekaran, and Emily Alsentzer. 2026. Clinical note bloat reduction for efficient llm use.Preprint, arXiv:2604.16364. Nicholas Carlini, Daphne Ippolito, Matthew Jagielski, ...

arXiv 2026
[2]

Quantifying memorization across neural lan- guage models.Preprint, arXiv:2202.07646. Nicholas Carlini, Florian Tramer, Eric Wallace, Matthew Jagielski, Ariel Herbert-V oss, Katherine Lee, Adam Roberts, Tom Brown, Dawn Song, Ul- far Erlingsson, Alina Oprea, and Colin Raffel. 2021. Extracting training data from large language models. Preprint, arXiv:2012.07...

Pith/arXiv arXiv 2021
[3]

note bloat

Mental health stigma and its consequences: a systematic scoping review of pathways to discrim- ination and adverse outcomes.eClinicalMedicine, 89:103588. Nikhil Kandpal, Eric Wallace, and Colin Raffel. 2022. Deduplicating training data mitigates privacy risks in language models.Preprint, arXiv:2202.06539. Adrienne Kline and Yuan Luo. 2022. Psmpy: A pack- ...

arXiv 2022
[4]

Subhankar Maity and Manob Jyoti Saikia

Analyzing leakage of personally identifi- able information in language models.Preprint, arXiv:2302.00539. Subhankar Maity and Manob Jyoti Saikia. 2025. Large language models in healthcare and medical applica- tions: a review.Bioengineering, 12(6):631. Fatemehsadat Mireshghallah, Archit Uniyal, Tianhao Wang, David Evans, and Taylor Berg-Kirkpatrick

arXiv 2025
[5]

Preprint, arXiv:2205.12506

Memorization in nlp fine-tuning methods. Preprint, arXiv:2205.12506. Niloofar Mireshghallah, Hyunwoo Kim, Xuhui Zhou, Yulia Tsvetkov, Maarten Sap, Reza Shokri, and Yejin Choi. 2024. Can llms keep a secret? testing privacy implications of language models via contextual in- tegrity theory. InInternational Conference on Learn- ing Representations (ICLR). Woj...

arXiv 2024
[6]

InStatPearls

Soap notes. InStatPearls. StatPearls Pub- lishing, Treasure Island, FL. [Updated 2023 Aug 28]. Pregnancy Justice. 2024. Pregnancy as a crime: A preliminary report on the first year after Dobbs. Qwen-Team. 2026. Qwen3.5-omni technical report. Preprint, arXiv:2604.15804. Kimberly A. Randell, Maya I. Ragavan, Lindsey A. Query, Mangai Sundaram, Megan Bair-Mer...

Pith/arXiv arXiv 2023
[7]

patient note:

Characterizing the source of text in elec- tronic health record progress notes.JAMA Internal Medicine, 177(8):1212–1213. Xurun Wang, Guangrui Liu, Xinjie Li, Haoyu He, Lin Yao, Zhongyun Hua, and Weizhe Zhang. 2025. Membership inference attack with partial features. Preprint, arXiv:2508.06244. Johnny Tian-Zheng Wei, Ameya Godbole, Moham- mad Aflah Khan, Ry...

work page doi:10.1016/s2589- 2025
[8]

No paraphrasing

Every string in every span list must be verbatim copy-paste from the note. No paraphrasing
[9]

Include an entry for every field even if the value is false, null, or an empty list
[10]

ambiguous

Do not use "ambiguous" as a default for uncertainty. Symptoms, mentions, or medications attributed to the patient are "positive".,→
[11]

patient has depression

Return only the JSON object - no explanation, no markdown fences. Figure 6: Prompt used to annotate each generation for sensitive diagnosis leakage. D Evaluation Cohort Propensity Score Matching Results Table 4 reports the standardized mean difference (SMD) in each covariate for the three matched contrasts that define each cohort: d-positive vs. d-negativ...

2019
[12]

No paraphrasing

Every string in every span list must be verbatim copy-paste from the notes. No paraphrasing
[13]

substring

Every "substring" in medications_to_remove must be an exact substring of the CURRENT MEDICATION LIST - do not invent, rename, or paraphrase medication names
[14]

Return a single JSON object whose top-level keys are exactly: <DIAGNOSIS_NAMES>
[15]

Include an entry for EVERY diagnosis even if present=false and all lists are empty
[16]

Return only the JSON object - no explanation, no markdown fences. Figure 7: Prompt used to annotate each patient for sensitive diagnoses before inclusion in the training cohort H Additional Plots Figure 8 reports the fraction of generations containing at least one τ=30-gram span matching the patient’s training notes, complementing the mean-volume view in ...

2010

[1] [1]

Jordan L

What does it mean for a language model to preserve privacy?Preprint, arXiv:2202.05520. Jordan L. Cahoon, Chloe Stanwyck, Asad Aali, Rachel Madding, Emma Sun, Yixing Jiang, Renumathy Dhanasekaran, and Emily Alsentzer. 2026. Clinical note bloat reduction for efficient llm use.Preprint, arXiv:2604.16364. Nicholas Carlini, Daphne Ippolito, Matthew Jagielski, ...

arXiv 2026

[2] [2]

Quantifying memorization across neural lan- guage models.Preprint, arXiv:2202.07646. Nicholas Carlini, Florian Tramer, Eric Wallace, Matthew Jagielski, Ariel Herbert-V oss, Katherine Lee, Adam Roberts, Tom Brown, Dawn Song, Ul- far Erlingsson, Alina Oprea, and Colin Raffel. 2021. Extracting training data from large language models. Preprint, arXiv:2012.07...

Pith/arXiv arXiv 2021

[3] [3]

note bloat

Mental health stigma and its consequences: a systematic scoping review of pathways to discrim- ination and adverse outcomes.eClinicalMedicine, 89:103588. Nikhil Kandpal, Eric Wallace, and Colin Raffel. 2022. Deduplicating training data mitigates privacy risks in language models.Preprint, arXiv:2202.06539. Adrienne Kline and Yuan Luo. 2022. Psmpy: A pack- ...

arXiv 2022

[4] [4]

Subhankar Maity and Manob Jyoti Saikia

Analyzing leakage of personally identifi- able information in language models.Preprint, arXiv:2302.00539. Subhankar Maity and Manob Jyoti Saikia. 2025. Large language models in healthcare and medical applica- tions: a review.Bioengineering, 12(6):631. Fatemehsadat Mireshghallah, Archit Uniyal, Tianhao Wang, David Evans, and Taylor Berg-Kirkpatrick

arXiv 2025

[5] [5]

Preprint, arXiv:2205.12506

Memorization in nlp fine-tuning methods. Preprint, arXiv:2205.12506. Niloofar Mireshghallah, Hyunwoo Kim, Xuhui Zhou, Yulia Tsvetkov, Maarten Sap, Reza Shokri, and Yejin Choi. 2024. Can llms keep a secret? testing privacy implications of language models via contextual in- tegrity theory. InInternational Conference on Learn- ing Representations (ICLR). Woj...

arXiv 2024

[6] [6]

InStatPearls

Soap notes. InStatPearls. StatPearls Pub- lishing, Treasure Island, FL. [Updated 2023 Aug 28]. Pregnancy Justice. 2024. Pregnancy as a crime: A preliminary report on the first year after Dobbs. Qwen-Team. 2026. Qwen3.5-omni technical report. Preprint, arXiv:2604.15804. Kimberly A. Randell, Maya I. Ragavan, Lindsey A. Query, Mangai Sundaram, Megan Bair-Mer...

Pith/arXiv arXiv 2023

[7] [7]

patient note:

Characterizing the source of text in elec- tronic health record progress notes.JAMA Internal Medicine, 177(8):1212–1213. Xurun Wang, Guangrui Liu, Xinjie Li, Haoyu He, Lin Yao, Zhongyun Hua, and Weizhe Zhang. 2025. Membership inference attack with partial features. Preprint, arXiv:2508.06244. Johnny Tian-Zheng Wei, Ameya Godbole, Moham- mad Aflah Khan, Ry...

work page doi:10.1016/s2589- 2025

[8] [8]

No paraphrasing

Every string in every span list must be verbatim copy-paste from the note. No paraphrasing

[9] [9]

Include an entry for every field even if the value is false, null, or an empty list

[10] [10]

ambiguous

Do not use "ambiguous" as a default for uncertainty. Symptoms, mentions, or medications attributed to the patient are "positive".,→

[11] [11]

patient has depression

Return only the JSON object - no explanation, no markdown fences. Figure 6: Prompt used to annotate each generation for sensitive diagnosis leakage. D Evaluation Cohort Propensity Score Matching Results Table 4 reports the standardized mean difference (SMD) in each covariate for the three matched contrasts that define each cohort: d-positive vs. d-negativ...

2019

[12] [12]

No paraphrasing

Every string in every span list must be verbatim copy-paste from the notes. No paraphrasing

[13] [13]

substring

Every "substring" in medications_to_remove must be an exact substring of the CURRENT MEDICATION LIST - do not invent, rename, or paraphrase medication names

[14] [14]

Return a single JSON object whose top-level keys are exactly: <DIAGNOSIS_NAMES>

[15] [15]

Include an entry for EVERY diagnosis even if present=false and all lists are empty

[16] [16]

Return only the JSON object - no explanation, no markdown fences. Figure 7: Prompt used to annotate each patient for sensitive diagnoses before inclusion in the training cohort H Additional Plots Figure 8 reports the fraction of generations containing at least one τ=30-gram span matching the patient’s training notes, complementing the mean-volume view in ...

2010