pith. machine review for the scientific record. sign in

arxiv: 2605.03301 · v1 · submitted 2026-05-05 · 💻 cs.CL · cs.AI

Recognition: unknown

SHIELD: A Diverse Clinical Note Dataset and Distilled Small Language Models for Enterprise-Scale De-identification

David Love, Jose D. Posada, Priya Desai, Somalee Datta

Pith reviewed 2026-05-07 16:48 UTC · model grok-4.3

classification 💻 cs.CL cs.AI
keywords clinical de-identificationprotected health informationsmall language modelsmodel distillationelectronic health recordsPHI extractionclinical notesnatural language processing
0
0 comments X

The pith

Small language models distilled from large ones match teacher performance on structured patient identifiers in clinical notes at 0.88 precision and 0.86 recall on standard hardware.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces SHIELD, a dataset of 1,394 clinical notes containing 10,505 gold-standard protected health information spans across nine categories, constructed via set-cover diversity sampling and human adjudication to better reflect modern electronic health records. It evaluates large language models as a performance ceiling and then distills their extraction capabilities into smaller locally runnable models. The strongest distilled model matches the teacher on structured categories including DATE, DOCTOR, ID, PATIENT, and PHONE while delivering micro-averaged span-level precision of 0.88 and recall of 0.86. Distributional comparisons show SHIELD occupies a different region of embedding space than older benchmarks such as i2b2. Cross-dataset tests indicate that diversity-trained models transfer well on universal structured entities but struggle with institution-specific ones, pointing toward hybrid deployment strategies.

Core claim

SHIELD supplies 1,394 notes and 10,505 annotated PHI spans; after large language models establish an upper bound, distilled small language models achieve equivalent results to the teacher on structured PHI categories and reach micro-averaged span-level precision of 0.88 with recall of 0.86 on ordinary workstation hardware.

What carries the argument

Knowledge distillation from large language models into small language models trained on the SHIELD dataset, which itself is assembled by set-cover diversity sampling plus human-in-the-loop adjudication.

If this is right

  • De-identification can run entirely on local hardware without transmitting protected health information to external APIs.
  • Models trained on diverse data generalize reliably across institutions for universal structured PHI categories.
  • Institution-specific PHI entities require additional specialized models or adaptation for high-volume notes.
  • Public release of the dataset and DeBERTa v3 model enables further fine-tuning and benchmarking by other groups.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same distillation pipeline could be reused for privacy-sensitive text in legal or financial domains.
  • If the diversity claim holds, SHIELD could serve as a replacement benchmark for clinical natural language processing tasks.
  • Persistent gaps on institution-specific entities suggest that periodic retraining or federated updates will be needed in real deployments.

Load-bearing premise

The set-cover diversity sampling combined with human-in-the-loop adjudication yields a dataset representative of modern clinical narratives that supports generalization beyond the sampled notes and institutions.

What would settle it

Testing the distilled model on a large collection of notes drawn from institutions excluded from the original sampling and observing a drop in micro-averaged precision or recall below 0.80 would falsify the generalization claim.

Figures

Figures reproduced from arXiv: 2605.03301 by David Love, Jose D. Posada, Priya Desai, Somalee Datta.

Figure 1
Figure 1. Figure 1: SHIELD annotation pipeline. Clinical notes are sampled from STARR-OMOP via diversity view at source ↗
Figure 2
Figure 2. Figure 2: Teacher-student distillation process. A small labeled sample is used to create prompts view at source ↗
Figure 3
Figure 3. Figure 3: Dataset overview. (a) Statistics for the three evaluation corpora. (b) PHI category distribu view at source ↗
Figure 8
Figure 8. Figure 8: 7 view at source ↗
Figure 4
Figure 4. Figure 4: Corpus divergence analysis. (a) Fréchet Text Distance decomposed into mean shift and view at source ↗
Figure 5
Figure 5. Figure 5: LLM benchmark on SHIELD. Span-level radar profiles showing precision (left) and recall view at source ↗
Figure 6
Figure 6. Figure 6: Span-level radar comparison of two distilled student models (DeBERTa v3, BioModern) view at source ↗
Figure 7
Figure 7. Figure 7: Span-level distillation comparison on SHIELD: Gemini 2.5 Flash (Teacher) vs. DeBERTa view at source ↗
Figure 8
Figure 8. Figure 8: Span-level precision (left) and recall (right) with bootstrap 95% CIs for four transformer view at source ↗
Figure 8
Figure 8. Figure 8: AIMI v2 precision/recall of DOCTOR 1.00/1.00, ID 1.00/1.00, HOSPITAL 0.94/0.99) view at source ↗
Figure 9
Figure 9. Figure 9: Span-level radar comparison of four transformer models on i2b2 2014 (cross-dataset). view at source ↗
Figure 10
Figure 10. Figure 10: Span-level radar comparison of four transformer models on AIMI (cross-dataset). AIMI view at source ↗
read the original abstract

De-identification of clinical text remains essential for secondary use of electronic health records (EHRs), yet public benchmarks such as i2b2 2006/2014 are over a decade old and lack the semantic and demographic diversity of modern narratives. While Large Language Models (LLMs) achieve state-of-the-art zero-shot extraction, enterprise deployment is hindered by compute costs and governance restricting Protected Health Information (PHI) from cloud APIs. We introduce SHIELD (Synthetic Human-annotated Identifier-replaced Entries for Learning and De-identification), a diverse dataset of 1,394 notes with 10,505 gold-standard PHI spans across 9 categories, built via set-cover diversity sampling with human-in-the-loop adjudication. We evaluate four LLMs (two proprietary, two open-weight) to establish a performance ceiling, then distill these capabilities into locally deployable Small Language Models (SLMs). Distributional analysis using Frechet Text Distance and Jensen-Shannon Divergence confirms SHIELD occupies a distinct region of biomedical embedding and vocabulary space versus legacy benchmarks. Our best distilled model matches its teacher on structured PHI categories (DATE, DOCTOR, ID, PATIENT, PHONE) and achieves micro-averaged span-level precision of 0.88 and recall of 0.86 on standard workstation hardware. Cross-dataset evaluation shows diversity-trained models generalize well on universal structured PHI, while institution-specific entities remain hard to transfer, suggesting optimal deployment combines broad-coverage models with specialized models for high-volume notes. We publicly release the SHIELD dataset and the distilled DeBERTa v3 model.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript presents SHIELD, a dataset of 1,394 clinical notes with 10,505 gold-standard PHI spans across 9 categories, constructed via set-cover diversity sampling and human-in-the-loop adjudication to overcome limitations of legacy benchmarks such as i2b2. It evaluates four LLMs (proprietary and open-weight) to establish performance ceilings, distills these into locally deployable small language models, and reports that the best distilled model matches its teacher on structured PHI categories (DATE, DOCTOR, ID, PATIENT, PHONE) while achieving micro-averaged span-level precision of 0.88 and recall of 0.86 on standard hardware. Distributional metrics (Fréchet Text Distance, Jensen-Shannon Divergence) position SHIELD apart from prior data, and cross-dataset tests indicate strong transfer on universal structured PHI but weaker results on institution-specific entities, leading to a recommendation for hybrid deployment; the dataset and a distilled DeBERTa v3 model are publicly released.

Significance. If the representativeness claims hold, the work is significant for providing a modern, diverse clinical de-identification resource that addresses the semantic and demographic gaps in decade-old public benchmarks. The distillation approach yielding competitive SLM performance on workstation hardware is practically relevant for privacy-sensitive enterprise settings where cloud LLMs are restricted. Public release of the dataset and model supports reproducibility and community follow-up, while the concrete metrics and distributional comparisons strengthen the contribution to applied NLP for healthcare.

major comments (2)
  1. [Dataset construction and distributional analysis] The central claim that SHIELD supports generalization beyond sampled notes and institutions for enterprise deployment rests on set-cover diversity sampling yielding representativeness, yet the manuscript provides no direct evidence that the chosen cover features (embeddings, metadata, vocabulary) span demographic, semantic, and stylistic variation across institutions; the Fréchet Text Distance and Jensen-Shannon Divergence results only show separation from i2b2 rather than positive coverage or absence of source-specific bias (dataset construction and distributional analysis sections).
  2. [Cross-dataset evaluation] Cross-dataset evaluation results (strong on universal structured PHI, poor on institution-specific) are presented as supporting a hybrid deployment strategy, but this pattern is the expected outcome if the sampling pool is narrow and directly weakens the abstract's claim that the distilled model is ready for general workstation deployment (cross-dataset evaluation section).
minor comments (2)
  1. [Abstract] The abstract states '9 categories' without enumerating them; listing the PHI categories explicitly would improve immediate clarity.
  2. [Results] Ensure uniform terminology for 'micro-averaged span-level precision and recall' across results tables and text to prevent minor ambiguity in metric reporting.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We address each major comment point by point below, providing clarifications on the scope of our claims and proposing targeted revisions to improve precision without overstating the results.

read point-by-point responses
  1. Referee: The central claim that SHIELD supports generalization beyond sampled notes and institutions for enterprise deployment rests on set-cover diversity sampling yielding representativeness, yet the manuscript provides no direct evidence that the chosen cover features (embeddings, metadata, vocabulary) span demographic, semantic, and stylistic variation across institutions; the Fréchet Text Distance and Jensen-Shannon Divergence results only show separation from i2b2 rather than positive coverage or absence of source-specific bias (dataset construction and distributional analysis sections).

    Authors: The set-cover diversity sampling was applied within our single-institution note pool to maximize coverage of the observed embedding, metadata, and vocabulary features, producing a subset more representative of that pool than random sampling. The FTD and JSD metrics establish that SHIELD occupies a distinct region relative to i2b2, supporting its utility as a modern benchmark. We acknowledge that these steps do not constitute direct evidence of spanning all demographic, semantic, or stylistic variations across institutions, as the source data is institution-specific. We will revise the dataset construction and distributional analysis sections to explicitly limit the representativeness claim to the sampled pool and to note that broader generalization requires further validation or adaptation. revision: partial

  2. Referee: Cross-dataset evaluation results (strong on universal structured PHI, poor on institution-specific) are presented as supporting a hybrid deployment strategy, but this pattern is the expected outcome if the sampling pool is narrow and directly weakens the abstract's claim that the distilled model is ready for general workstation deployment (cross-dataset evaluation section).

    Authors: The cross-dataset results indeed reflect the single-institution sampling pool, with strong transfer on universal structured categories and weaker performance on institution-specific ones. This pattern directly motivates the hybrid deployment recommendation already present in the abstract and discussion. The manuscript reports workstation performance while explicitly suggesting combination with specialized models rather than claiming the distilled model is ready for unrestricted general deployment. We will revise the abstract to sharpen this distinction and foreground the hybrid strategy as a core practical takeaway. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical results rest on independent gold-standard annotations and standard sampling

full rationale

The paper's derivation chain consists of dataset construction via set-cover sampling plus human adjudication, LLM evaluation to set a ceiling, distillation to SLMs, and direct span-level P/R measurement against the human-created gold labels. These steps do not reduce by construction to fitted parameters or self-referential definitions. Distributional metrics (Frechet Text Distance, Jensen-Shannon) are computed independently on embeddings and vocabulary. No load-bearing self-citations, no uniqueness theorems imported from prior author work, and no equations that make the reported 0.88/0.86 scores tautological. Performance claims are falsifiable against external benchmarks and the released dataset.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

Relies on standard NLP assumptions about annotation reliability and distillation effectiveness; no free parameters, invented entities, or ad-hoc axioms are introduced in the abstract.

axioms (2)
  • domain assumption Human-in-the-loop adjudication produces reliable gold-standard PHI annotations.
    Invoked in dataset construction via human review of set-cover sampled notes.
  • domain assumption Distillation from large LLMs to small models preserves performance on structured extraction tasks.
    Central to the claim that the distilled model matches the teacher on specific PHI categories.

pith-pipeline@v0.9.0 · 5597 in / 1276 out tokens · 50108 ms · 2026-05-07T16:48:41.935714+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

40 extracted references · 33 canonical work pages · 1 internal anchor

  1. [1]

    ModernBERT or DeBERTaV3 ? Examining architecture and data influence on transformer encoder models performance

    Wissam Antoun, Beno \^i t Sagot, and Djam \'e Seddah. ModernBERT or DeBERTaV3 ? Examining architecture and data influence on transformer encoder models performance. In Proceedings of the 14th International Joint Conference on Natural Language Processing and the 4th Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics (IJ...

  2. [2]

    Fr\' e chet distance for offline evaluation of information retrieval systems with sparse labels

    Negar Arabzadeh and Charles Clarke. Fr\' e chet distance for offline evaluation of information retrieval systems with sparse labels. In Proceedings of the 18th Conference of the European Chapter of the Association for Computational Linguistics (Volume 1: Long Papers), pages 420--431, St. Julian's, Malta, 2024. Association for Computational Linguistics. UR...

  3. [3]

    InFindings of the Associa- tion for Computational Linguistics: ACL 2024, pages 13921–13937, Bangkok, Thailand

    Peter Belcak, Greg Heinrich, Shizhe Diao, Yonggan Fu, Xin Dong, Saurav Muralidharan, Yingyan Celine Lin, and Pavlo Molchanov. Small Language Models are the Future of Agentic AI , June 2025. URL http://arxiv.org/abs/2506.02153. arXiv:2506.02153 [cs]

  4. [4]

    The stanford medicine data science ecosystem for clinical and translational research

    Alison Callahan, Euan Ashley, Somalee Datta, Priyamvada Desai, Todd A Ferris, Jason A Fries, Michael Halaas, Curtis P Langlotz, Sean Mackey, José D Posada, Michael A Pfeffer, and Nigam H Shah. The stanford medicine data science ecosystem for clinical and translational research. JAMIA Open, 6 0 (3): 0 ooad054, 08 2023. ISSN 2574-2531. doi:10.1093/jamiaopen...

  5. [5]

    Automated deidentification of radiology reports combining transformer and ``hide in plain sight'' rule-based methods

    Pierre J Chambon, Christopher Wu, Jackson M Steinkamp, Jason Adleberg, Tessa S Cook, and Curtis P Langlotz. Automated deidentification of radiology reports combining transformer and ``hide in plain sight'' rule-based methods. Journal of the American Medical Informatics Association, 30 0 (2): 0 318--328, January 2023. ISSN 1067-5027, 1527-974X. doi:10.1093...

  6. [6]

    Cornia, M

    Min Jin Chong and David Forsyth. Effectively unbiased FID and inception score and where to find them. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 6070--6079, 2020. doi:10.1109/CVPR42600.2020.00611

  7. [7]

    Exploiting vocabulary frequency imbalance in language model pre-training, 2025

    Woojin Chung and Jeonghoon Kim. Exploiting vocabulary frequency imbalance in language model pre-training, 2025. URL https://arxiv.org/abs/2508.15390. NeurIPS 2025

  8. [8]

    A new paradigm for accelerating clinical data science at stanford medicine, 2020

    Somalee Datta, Jose Posada, Garrick Olson, Wencheng Li, Ciaran O'Reilly, Deepa Balraj, Joseph Mesterhazy, Joseph Pallas, Priyamvada Desai, and Nigam Shah. A new paradigm for accelerating clinical data science at stanford medicine, 2020. URL https://arxiv.org/abs/2003.10534

  9. [9]

    D. C. Dowson and B. V. Landau. The Fr\' e chet distance between multivariate normal distributions. Journal of Multivariate Analysis, 12 0 (3): 0 450--455, 1982. doi:10.1016/0047-259X(82)90077-X

  10. [10]

    DuVall, and Patrick R

    Hannah Eyre, Qiwei Gan, Mengke Hu, Annie Bowles, Johnathan Stanley, Jianlin Shi, Scott L. DuVall, and Patrick R. Alba. Evaluating Clinical Note Deidentification Tools and Transformer Transferability between Public and Private Data from the US Department of Veterans Affairs , June 2025. URL https://www.medrxiv.org/content/10.1101/2025.03.21.25323520v2

  11. [11]

    Vertex AI — Generative AI Pricing

    Google Cloud . Vertex AI — Generative AI Pricing . https://cloud.google.com/vertex-ai/generative-ai/pricing, 2025. Gemini 2.5 Flash Flex/Batch pricing: \ 0.15 per 1M input tokens, \ 1.25 per 1M output tokens. Accessed March 2025

  12. [12]

    Pengcheng He, Jianfeng Gao, and Weizhu Chen

    Pengcheng He, Jianfeng Gao, and Weizhu Chen. Debertav3: Improving deberta using electra-style pre-training with gradient-disentangled embedding sharing. arXiv preprint arXiv:2111.09543, 2021

  13. [13]

    An Extensible Evaluation Framework Applied to Clinical Text Deidentification Natural Language Processing Tools : Multisystem and Multicorpus Study

    Paul M Heider and St \'e phane M Meystre. An Extensible Evaluation Framework Applied to Clinical Text Deidentification Natural Language Processing Tools : Multisystem and Multicorpus Study . Journal of Medical Internet Research, 26: 0 e55676, May 2024. ISSN 1439-4456. doi:10.2196/55676. URL https://pmc.ncbi.nlm.nih.gov/articles/PMC11167315/

  14. [14]

    GANs trained by a two time-scale update rule converge to a local Nash equilibrium

    Martin Heusel, Hubert Ramsauer, Thomas Unterthiner, Bernhard Nessler, and Sepp Hochreiter. GANs trained by a two time-scale update rule converge to a local Nash equilibrium. In Advances in Neural Information Processing Systems, volume 30, 2017

  15. [15]

    Comeau, Lana Yeganova, W

    Qiao Jin, Won Kim, Qingyu Chen, Donald C. Comeau, Lana Yeganova, W. John Wilbur, and Zhiyong Lu. MedCPT : Contrastive Pre -trained Transformers with large-scale PubMed search logs for zero-shot biomedical information retrieval. Bioinformatics, 39 0 (11), 2023. doi:10.1093/bioinformatics/btad651

  16. [16]

    Aman Kansal, Emma Chen, Boyang Tom Jin, Pranav Rajpurkar, and David A. Kim. MC-MED , multimodal clinical monitoring in the emergency department. Scientific Data, 12 0 (1): 0 1094, 2025. doi:10.1038/s41597-025-05419-5. URL https://doi.org/10.1038/s41597-025-05419-5

  17. [17]

    Wiest, Stephen Gilbert, and Daniel Truhn

    Jakob Nikolas Kather, Dyke Ferber, Isabelle C. Wiest, Stephen Gilbert, and Daniel Truhn. Large language models could make natural language again the universal interface of healthcare. Nature Medicine, 30: 0 2708--2710, 2024. doi:10.1038/s41591-024-03259-5

  18. [18]

    Beyond Accuracy : Automated De - Identification of Large Real - World Clinical Text Datasets , December 2023

    Veysel Kocaman, Hasham Ul Haq, and David Talby. Beyond Accuracy : Automated De - Identification of Large Real - World Clinical Text Datasets , December 2023. URL http://arxiv.org/abs/2312.08495. arXiv:2312.08495 [cs]

  19. [19]

    Automated De - Identification , Consistent Obfuscation , and Regulatory Grade Validation of 2 Billion Patient Notes , September 2025

    Veysel Kocaman, Lindsay Mico, Mustafa Aytug Kaya, Nadaa Taiyab, David Talby, Tae Surh, Yuqing Guo, Vivek Tomer, and Robert Kramer. Automated De - Identification , Consistent Obfuscation , and Regulatory Grade Validation of 2 Billion Patient Notes , September 2025. URL https://www.researchsquare.com/article/rs-6867162/v1

  20. [20]

    De-identification of clinical free text using natural language processing: A systematic review of current approaches

    Aleksandar Kovačević, Bojana Bašaragin, Nikola Milošević, and Goran Nenadić. De-identification of clinical free text using natural language processing: A systematic review of current approaches. Artificial Intelligence in Medicine, 151: 0 102845, May 2024. ISSN 09333657. doi:10.1016/j.artmed.2024.102845. URL https://linkinghub.elsevier.com/retrieve/pii/S0...

  21. [21]

    Shah, Richard Dobson, and James Teo

    Zeljko Kraljevic, Anthony Shek, Joshua Au Yeung, Ewart Jonathan Sheldon, Mohammad Al-Agil, Haris Shuaib, Xi Bai, Kawsar Noor, Anoop D. Shah, Richard Dobson, and James Teo. Validating transformers for redaction of text from electronic health records in real-world healthcare, October 2023. URL http://arxiv.org/abs/2310.04468. arXiv:2310.04468 [cs]

  22. [22]

    Rachel Kuo, Andrew A. S. Soltan, Ciaran O'Hanlon, Alan Hasanic, David A. Clifton, Collins Gary, Dominic Furniss, and David W. Eyre. Benchmarking transformer-based models for medical record deidentification: A single centre, multi-specialty evaluation, May 2025. URL https://www.medrxiv.org/content/10.1101/2025.05.05.25326979v1

  23. [23]

    Targeted Error Correction in Knowledge Distillation : Small Language Models Surpass GPT , November 2025

    Hee-Jin Lee, Zhen Guo, Luchao Jin, and Morteza Moazami Goudarzi. Targeted Error Correction in Knowledge Distillation : Small Language Models Surpass GPT , November 2025. URL http://arxiv.org/abs/2511.03005. arXiv:2511.03005 [cs]

  24. [24]

    RAD : Towards Trustworthy Retrieval - Augmented Multi -modal Clinical Diagnosis , 2025

    Haolin Li, Tianjie Dai, Zhe Chen, Siyuan Du, Jiangchao Yao, Ya Zhang, and Yanfeng Wang. RAD : Towards Trustworthy Retrieval - Augmented Multi -modal Clinical Diagnosis , 2025. URL https://arxiv.org/abs/2509.19980. NeurIPS 2025

  25. [25]

    , year =

    Jianhua Lin. Divergence measures based on the Shannon entropy. IEEE Transactions on Information Theory, 37 0 (1): 0 145--151, 1991. doi:10.1109/18.61115

  26. [26]

    DeID - GPT : Zero -shot Medical Text De - Identification by GPT -4, December 2023

    Zhengliang Liu, Yue Huang, Xiaowei Yu, Lu Zhang, Zihao Wu, Chao Cao, Haixing Dai, Lin Zhao, Yiwei Li, Peng Shu, Fang Zeng, Lichao Sun, Wei Liu, Dinggang Shen, Quanzheng Li, Tianming Liu, Dajiang Zhu, and Xiang Li. DeID - GPT : Zero -shot Medical Text De - Identification by GPT -4, December 2023. URL http://arxiv.org/abs/2303.11032. arXiv:2303.11032 [cs]

  27. [27]

    Evaluation metrics for headline generation using deep pre-trained embeddings

    Abdul Moeed, Yang An, Gerhard Hagerer, and Georg Groh. Evaluation metrics for headline generation using deep pre-trained embeddings. In Proceedings of the Twelfth Language Resources and Evaluation Conference, pages 1796--1802, Marseille, France, 2020. European Language Resources Association. URL https://aclanthology.org/2020.lrec-1.222/

  28. [28]

    Neilson, Moniruzzaman Moni, Marcello Nesca, Alexander Singer, and Jennifer E

    Bekelu Negash, Alan Katz, Christine J. Neilson, Moniruzzaman Moni, Marcello Nesca, Alexander Singer, and Jennifer E. Enns. De-identification of Free Text Data containing Personal Health Information : A Scoping Review of Reviews . International Journal of Population Data Science, 8 0 (1), December 2023. ISSN 2399-4908. doi:10.23889/ijpds.v8i1.2153. URL htt...

  29. [29]

    gpt-oss-120b & gpt-oss-20b Model Card

    OpenAI . GPT-OSS : Open -weight models for reasoning, agentic tasks, and versatile developer use cases. arXiv preprint arXiv:2508.10925, 2025. URL https://huggingface.co/openai/gpt-oss-120b. Mixture-of-Experts architecture; GPT-OSS 120B (117B total, 5.1B active parameters) and GPT-OSS 20B (21B total, 3.6B active parameters). Apache 2.0 license

  30. [30]

    Osborne, Andrew Trotter, Tobias O'Leary, Chris Coffee, Micah D

    John D. Osborne, Andrew Trotter, Tobias O'Leary, Chris Coffee, Micah D. Cochran, Luis Mansilla-Gonzalez, Akhil Nadimpalli, Alex McAnnally, Abdulateef I. Almudaifer, Jeffrey R. Curtis, Salma M. Aly, and Richard E. Kennedy. A Markov Chain Replacement Strategy for Surrogate Identifiers : Minimizing Re - Identification Risk While Preserving Text Reuse . Elect...

  31. [31]

    Improving the Performance of Radiology Report De -identification with Large - Scale Training and Benchmarking Against Cloud Vendor Methods , November 2025

    Eva Prakash, Maayane Attias, Pierre Chambon, Justin Xu, Steven Truong, Jean-Benoit Delbrouck, Tessa Cook, and Curtis Langlotz. Improving the Performance of Radiology Report De -identification with Large - Scale Training and Benchmarking Against Cloud Vendor Methods , November 2025. URL http://arxiv.org/abs/2511.04079. arXiv:2511.04079 [cs]

  32. [32]

    RedactOR : An LLM - Powered Framework for Automatic Clinical Data De - Identification , July 2025

    Praphul Singh, Charlotte Dzialo, Jangwon Kim, Sumana Srivatsa, Irfan Bulu, Sri Gadde, and Krishnaram Kenthapadi. RedactOR : An LLM - Powered Framework for Automatic Clinical Data De - Identification , July 2025. URL http://arxiv.org/abs/2505.18380. arXiv:2505.18380 [cs]

  33. [33]

    BioClinical ModernBERT: A state-of-the-art long-context encoder for biomedi- cal and clinical NLP.arXiv preprint arXiv:2506.10896,

    Thomas Sounack, Joshua Davis, Brigitte Durieux, Antoine Chaffin, Tom J. Pollard, Eric Lehman, Alistair E. W. Johnson, Matthew McDermott, Tristan Naumann, and Charlotta Lindvall. BioClinical ModernBERT : A State -of-the- Art Long - Context Encoder for Biomedical and Clinical NLP , June 2025. URL http://arxiv.org/abs/2506.10896. arXiv:2506.10896 [cs]

  34. [34]

    Annotating longitudinal clinical narratives for de-identification: The 2014 i2b2/uthealth corpus

    Amber Stubbs and "O zlem Uzuner. Annotating longitudinal clinical narratives for de-identification: The 2014 i2b2/uthealth corpus. Journal of Biomedical Informatics, 58: 0 S20--S29, 2015. doi:10.1016/j.jbi.2015.07.020. PMID: 26319540

  35. [35]

    Evaluating the State -of-the- Art in Automatic De -identification

    \"O zlem Uzuner, Yuan Luo, and Peter Szolovits. Evaluating the State -of-the- Art in Automatic De -identification. Journal of the American Medical Informatics Association, 14 0 (5): 0 550--563, September 2007. ISSN 1067-5027. doi:10.1197/jamia.M2444. URL https://doi.org/10.1197/jamia.M2444

  36. [36]

    Jensen- shannon divergence message-passing for rich-text graph representation learning.arXiv preprint arXiv:2512.20094,

    Zuo Wang and Ye Yuan. Jensen- Shannon divergence message-passing for rich-text graph representation learning. arXiv preprint arXiv:2512.20094, 2025. URL https://arxiv.org/abs/2512.20094

  37. [37]

    Wiest, Marie-Elisabeth Le mann, Fabian Wolf, Dyke Ferber, Marko Van Treeck, Jiefu Zhu, Matthias P

    Isabella C. Wiest, Marie-Elisabeth Le mann, Fabian Wolf, Dyke Ferber, Marko Van Treeck, Jiefu Zhu, Matthias P. Ebert, Christoph Benedikt Westphalen, Martin Wermke, and Jakob Nikolas Kather. Deidentifying Medical Documents with Local , Privacy - Preserving Large Language Models : The LLM - Anonymizer . NEJM AI, 2 0 (4): 0 AIdbp2400537, March 2025. doi:10.1...

  38. [38]

    The design of approximation algorithms

    David P Williamson and David B Shmoys. The design of approximation algorithms. Cambridge university press, 2011

  39. [39]

    A pragmatic note on evaluating generative models with Fr\' e chet inception distance for retinal image synthesis

    Yuli Wu, Fucheng Liu, R \"u veyda Yilmaz, Henning Konermann, Peter Walter, and Johannes Stegmaier. A pragmatic note on evaluating generative models with Fr\' e chet inception distance for retinal image synthesis. In Proceedings of Medical Imaging with Deep Learning (MIDL), 2026. URL https://arxiv.org/abs/2502.17160

  40. [40]

    Hogan, and Yonghui Wu

    Xi Yang, Tianchen Lyu, Qian Li, Chih-Yin Lee, Jiang Bian, William R. Hogan, and Yonghui Wu. A study of deep learning methods for de-identification of clinical notes in cross-institute settings. BMC Medical Informatics and Decision Making, 19 0 (Suppl 5): 0 232, December 2019. ISSN 1472-6947. doi:10.1186/s12911-019-0935-4. URL https://pmc.ncbi.nlm.nih.gov/...