Recognition: unknown
Self-Prompting Small Language Models for Privacy-Sensitive Clinical Information Extraction
Pith reviewed 2026-05-08 17:17 UTC · model grok-4.3
The pith
Small language models can self-generate prompts to accurately extract clinical entities from unstructured dental notes while running locally for privacy.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The authors show that a self-prompting pipeline combined with QLoRA fine-tuning and direct preference optimization allows selected small language models to achieve micro F1 scores above 0.8 on multi-entity extraction from dental notes, with Qwen2.5-14B-Instruct reaching 0.864 after adaptation.
What carries the argument
The self-prompting mechanism in which the model generates candidate prompts for each clinical entity, verifies them against annotations, refines them iteratively, and then uses an ensemble of the best prompts for inference.
If this is right
- Locally deployable models reduce privacy risks in handling patient data.
- Task-specific prompt optimization outperforms reliance on general model capabilities.
- Preference optimization like DPO can further boost performance on domain-specific extraction tasks.
- Multi-prompt ensembles provide robustness for unstructured clinical text.
Where Pith is reading between the lines
- Similar self-prompting could extend to other medical specialties with unstructured notes, such as radiology or pathology.
- Further iterations of self-refinement might reduce the need for large annotated datasets.
- Integration with electronic health record systems could enable real-time entity extraction without cloud access.
Load-bearing premise
The 1,200 annotated dental notes are representative of notes from other clinics and that the self-generated prompts will not cause privacy leaks or performance drops on new data.
What would settle it
Evaluating the adapted models on a held-out set of dental notes from a different institution and observing whether the F1 scores remain above 0.75 or drop substantially.
Figures
read the original abstract
Clinical named entity recognition from dental progress notes is challenging because documentation is highly unstructured, domain-specific, and often privacy-sensitive. We developed a locally deployable framework that enables small language models to self-generate, verify, refine, and evaluate entity-specific prompts for extracting multiple clinical entities from dental notes. Using 1,200 annotated notes, we evaluated candidate open-weight models with multi-prompt ensemble inference and further adapted selected models using QLoRA-based supervised fine-tuning and direct preference optimization. Model performance varied substantially, highlighting the need for task-specific evaluation rather than reliance on generic benchmarks. Qwen2.5-14B-Instruct achieved the strongest baseline performance. After DPO, Qwen2.5-14B-Instruct and Llama-3.1-8B-Instruct achieved micro/macro F1 scores of 0.864/0.837 and 0.806/0.797, respectively. These findings suggest that automated prompt optimization combined with lightweight preference-based post-training can support scalable clinical information extraction using locally deployed small language models.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper presents a locally deployable framework in which small open-weight language models self-generate, verify, and refine entity-specific prompts for clinical named-entity recognition on unstructured dental progress notes. Using 1,200 annotated notes, the authors evaluate several models with multi-prompt ensemble inference, then apply QLoRA-based supervised fine-tuning followed by direct preference optimization (DPO). Post-DPO results are reported as micro/macro F1 of 0.864/0.837 for Qwen2.5-14B-Instruct and 0.806/0.797 for Llama-3.1-8B-Instruct, with the claim that the pipeline supports scalable, privacy-preserving clinical information extraction.
Significance. If the reported F1 scores can be shown to arise from a leakage-free pipeline and to generalize beyond the single-source 1,200-note collection, the work would provide concrete evidence that automated prompt optimization plus lightweight preference tuning can produce usable clinical NER performance with locally runnable models. The emphasis on real dental notes and open-weight models is a practical strength.
major comments (3)
- [Abstract and Evaluation] Abstract and Evaluation section: the headline F1 numbers are presented without any baseline comparisons (e.g., standard clinical NER tools, fine-tuned encoder-only models, or zero-shot prompting with larger models), so the incremental benefit of self-prompting plus DPO cannot be quantified.
- [Data and Methods] Data and Methods: no description is given of how the 1,200 notes were partitioned, whether prompt-candidate generation or preference-pair construction used data disjoint from the final test set, or whether any temporal or external hold-out set was employed. Without these controls the reported scores risk optimistic bias from leakage.
- [Results] Results: the abstract states concrete micro/macro F1 values after DPO but supplies neither per-entity breakdowns, error analysis, nor discussion of failure modes, leaving the clinical reliability of the 0.864/0.837 and 0.806/0.797 figures unassessable.
minor comments (1)
- [Abstract] Abstract: the phrase 'multi-prompt ensemble inference' is used without indicating the number of prompts or the aggregation rule.
Simulated Author's Rebuttal
We thank the referee for the constructive and detailed feedback. We address each major comment below and describe the revisions we will make to strengthen the manuscript.
read point-by-point responses
-
Referee: [Abstract and Evaluation] Abstract and Evaluation section: the headline F1 numbers are presented without any baseline comparisons (e.g., standard clinical NER tools, fine-tuned encoder-only models, or zero-shot prompting with larger models), so the incremental benefit of self-prompting plus DPO cannot be quantified.
Authors: We agree that baseline comparisons are needed to quantify the incremental benefit. In the revised manuscript we will add a dedicated comparison table in the Evaluation section reporting results from standard clinical NER tools (MedSpaCy), fine-tuned encoder-only models (BioBERT, ClinicalBERT), and zero-shot prompting with larger models (Llama-3.1-70B). This will allow direct assessment of the gains from self-prompting and DPO. revision: yes
-
Referee: [Data and Methods] Data and Methods: no description is given of how the 1,200 notes were partitioned, whether prompt-candidate generation or preference-pair construction used data disjoint from the final test set, or whether any temporal or external hold-out set was employed. Without these controls the reported scores risk optimistic bias from leakage.
Authors: We will expand the Methods section to explicitly describe the partitioning: the 1,200 notes were randomly split into 70% training, 15% validation, and 15% test sets. Prompt-candidate generation, verification, and DPO preference-pair construction were performed only on the training and validation portions; the test set remained completely unseen during all optimization steps. No temporal or external hold-out was used because the collection is from a single source and time period; we will add this as an explicit limitation and discuss implications for generalization. revision: yes
-
Referee: [Results] Results: the abstract states concrete micro/macro F1 values after DPO but supplies neither per-entity breakdowns, error analysis, nor discussion of failure modes, leaving the clinical reliability of the 0.864/0.837 and 0.806/0.797 figures unassessable.
Authors: We will revise the Results section to include a per-entity F1 breakdown table for all models and a new error-analysis subsection. The subsection will categorize common failure modes (boundary errors, ambiguous abbreviations, false positives on non-entity terms) with examples and discuss their clinical implications, thereby making the reliability of the reported scores more assessable. revision: yes
Circularity Check
No significant circularity; purely empirical results on held-out data
full rationale
The paper reports an empirical pipeline for clinical NER: annotation of 1,200 dental notes, self-prompt generation and ensemble inference, followed by QLoRA SFT and DPO on selected models, with final micro/macro F1 scores measured on held-out notes. No equations, derivations, or self-referential definitions appear in the provided text. Reported performance metrics are externally computed against human annotations rather than reduced to inputs by construction, fitted parameters renamed as predictions, or load-bearing self-citations. The work is self-contained against standard ML evaluation benchmarks.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
Hillestad, R. et al. Can Electronic Medical Record Systems Transform Health Care? Potential Health Benefits, Savings, And Costs. Health Affairs 24, 1103–1117 (2005). 21
2005
-
[2]
Hou, J. et al. Generate Analysis-Ready Data for Real-world Evidence: Tutorial for Harnessing Electronic Health Records With Advanced Informatic Technologies. Journal of Medical Internet Research 25, e45662 (2023)
2023
-
[3]
Moy, A. J. et al. Measurement of clinical documentation burden among physicians and nurses using electronic health records: a scoping review. J Am Med Inform Assoc 28, 998–1008 (2021)
2021
-
[4]
Moy, A. J. et al. Understanding the perceived role of electronic health records and workflow fragmentation on clinician documentation burden in emergency departments. J Am Med Inform Assoc 30, 797–808 (2023)
2023
-
[5]
Johnson, S. B. et al. An Electronic Health Record Based on Structured Narrative. J Am Med Inform Assoc 15, 54–64 (2008)
2008
-
[6]
E., Walji, M., Lam, W
Schwendicke, F., Uribe, S. E., Walji, M., Lam, W. & Tichy, A. Electronic Health Records in Dentistry: Relevance, Challenges and Policy Directions. International Dental Journal 75, 103964 (2025)
2025
-
[7]
Schleyer, T. et al. Electronic dental record use and clinical information management patterns among practitioner-investigators in The Dental Practice-Based Research Network. J Am Dent Assoc 144, 49–58 (2013)
2013
-
[8]
& Schleyer, T
Song, M., Liu, K., Abromitis, R. & Schleyer, T. L. Reusing Electronic Patient Data for Dental Clinical Research: A Review of Current Status. J Dent 41, 1148–1163 (2013)
2013
-
[9]
Bhardwaj, A. et al. Measuring up: Implementing a dental quality measure in the electronic health record context. J Am Dent Assoc 147, 35–40 (2016). 22
2016
-
[10]
Chuang, Y.-S. et al. Cross-institutional dental electronic health record entity extraction via generative artificial intelligence and synthetic notes. JAMIA Open 8, ooaf061 (2025)
2025
-
[11]
Wang, Y. et al. Clinical information extraction applications: A literature review. Journal of Biomedical Informatics 77, 34–49 (2018)
2018
-
[12]
Kreimeyer, K. et al. Natural language processing systems for capturing and standardizing unstructured clinical information: A systematic review. Journal of biomedical informatics 73, 14–29 (2017)
2017
-
[13]
& Huang, Y
Sezgin, E., Hussain, S.-A., Rust, S. & Huang, Y. Extracting Medical Information From Free-Text and Unstructured Patient-Generated Health Data Using Natural Language Processing Methods: Feasibility Study With Real-world Data. JMIR Formative Research 7, e43014 (2023)
2023
-
[14]
& Dunn, A
Pethani, F. & Dunn, A. G. Natural language processing for clinical notes in dentistry: A systematic review. J Biomed Inform 138, 104282 (2023)
2023
-
[15]
C., Xuesi Zhou,Ji Wu,Yongsheng
Zhou, Q. C., Xuesi Zhou,Ji Wu,Yongsheng. Structuring electronic dental records through deep learning for a clinical decision support system - Qingxiao Chen, Xuesi Zhou, Ji Wu, Yongsheng Zhou, 2021. Health Informatics Journal https://journals.sagepub.com/doi/10.1177/1460458220980036 (2021)
-
[16]
& Schwendicke, F
Büttner, M., Leser, U., Schneider, L. & Schwendicke, F. Natural Language Processing: Chances and Challenges in Dentistry. Journal of Dentistry 141, 104796 (2024)
2024
-
[17]
Singhal, K. et al. Large language models encode clinical knowledge. Nature 620, 172–180 (2023). 23
2023
-
[18]
Van Veen, D. et al. Adapted large language models can outperform medical experts in clinical text summarization. Nat Med 30, 1134–1142 (2024)
2024
-
[19]
& Sontag, D
Agrawal, M., Hegselmann, S., Lang, H., Kim, Y. & Sontag, D. Large language models are few-shot clinical information extractors. in Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing 1998–2022 (2022)
2022
-
[20]
Yao, S. et al. REACT: SYNERGIZING REASONING AND ACTING IN LANGUAGE MODELS. in (2023)
2023
-
[21]
Schick, T. et al. Toolformer: Language Models Can Teach Themselves to Use Tools. in (2023)
2023
-
[22]
Shinn, N., Cassano, F., Gopinath, A., Narasimhan, K. R. & Yao, S. Reflexion: language agents with verbal reinforcement learning. in (2023)
2023
-
[23]
Madaan, A. et al. Self-Refine: Iterative Refinement with Self-Feedback. in (2023)
2023
-
[24]
Huang, H. et al. ChatGPT for shaping the future of dentistry: the potential of multi- modal large language model. Int J Oral Sci 15, 29 (2023)
2023
-
[25]
M., Schmerder, M
Dennstädt, F., Hastings, J., Putora, P. M., Schmerder, M. & Cihoric, N. Implementing large language models in healthcare while balancing control, collaboration, costs and security. npj Digit. Med. 8, 143 (2025)
2025
-
[26]
M., Venkatesh, K
Raza, M. M., Venkatesh, K. P. & Kvedar, J. C. Generative AI and large language models in health care: pathways to implementation. NPJ Digit Med 7, 62 (2024)
2024
-
[27]
Nagarajan, R. et al. Economics and Equity of Large Language Models: Health Care Perspective. Journal of Medical Internet Research 26, e64226 (2024). 24
2024
-
[28]
Zhong, X. et al. Considerations for Patient Privacy of Large Language Models in Health Care: Scoping Review. Journal of Medical Internet Research 27, e76571 (2025)
2025
-
[29]
& Wong, Z
Jonnagaddala, J. & Wong, Z. S.-Y. Privacy preserving strategies for electronic health records in the era of large language models. npj Digit. Med. 8, 34 (2025)
2025
-
[30]
Y., Helzer, J., Pfeffer, M
Ng, M. Y., Helzer, J., Pfeffer, M. A., Seto, T. & Hernandez-Boussard, T. Development of secure infrastructure for advancing generative artificial intelligence research in healthcare at an academic medical center. J Am Med Inform Assoc 32, 586–588 (2025)
2025
-
[31]
Chuang, Y.-S., Sarkar, A. R., Hsu, Y.-C., Mohammed, N. & Jiang, X. Robust Privacy Amidst Innovation with Large Language Models Through a Critical Assessment of the Risks. Preprint at https://doi.org/10.48550/arXiv.2407.16166 (2024)
-
[32]
Wiest, I. C. et al. Privacy-preserving large language models for structured medical information retrieval. npj Digit. Med. 7, 257 (2024)
2024
-
[33]
Kwon, J. et al. Validation of deep-learning-based triage and acuity score using a large national dataset. PLoS One 13, e0205836 (2018)
2018
-
[34]
Zhou, Y. et al. Large Language Models are Human-Level Prompt Engineers. in (2022)
2022
-
[35]
A., Feier, A
Kocaman, V., Kaya, M. A., Feier, A. M. & Talby, D. Clinical Large Language Model Evaluation by Expert Review (CLEVER): Framework Development and Validation. JMIR AI 4, e72153 (2025)
2025
-
[36]
Ma, Z. et al. Beyond the Leaderboard: Rethinking Medical Benchmarks for Large Language Models. Preprint at https://doi.org/10.48550/arXiv.2508.04325 (2025). 25
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2508.04325 2025
-
[37]
Ceballos-Arroyo, A. M. et al. Open (Clinical) LLMs are Sensitive to Instruction Phrasings. in Proceedings of the 23rd Workshop on Biomedical Natural Language Processing (eds. Demner-Fushman, D., Ananiadou, S., Miwa, M., Roberts, K. & Tsujii, J.) 50–71 (Association for Computational Linguistics, Bangkok, Thailand, 2024). doi:10.18653/v1/2024.bionlp-1.5
-
[38]
Cui, W. et al. A Survey of Automatic Prompt Optimization with Instruction-focused Heuristic-based Search Algorithm. Preprint at https://doi.org/10.48550/arXiv.2502.18746 (2025)
-
[39]
D., Wong, R
Zamfirescu-Pereira, J. D., Wong, R. Y., Hartmann, B. & Yang, Q. Why Johnny can’t prompt: How non-AI experts try (and fail) to design LLM prompts. in Proceedings of the 2023 CHI Conference on Human Factors in Computing Systems 1–21 (ACM, New York, NY, USA, 2023)
2023
-
[40]
Wang, Y. et al. Self-Instruct: Aligning Language Models with Self-Generated Instructions. in Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) (eds. Rogers, A., Boyd-Graber, J. & Okazaki, N.) 13484–13508 (Association for Computational Linguistics, Toronto, Canada, 2023). doi:10.18653/v1/2023.ac...
-
[41]
Guo, Q. et al. EvoPrompt: Connecting LLMs with Evolutionary Algorithms Yields Powerful Prompt Optimizers. Preprint at https://doi.org/10.48550/arXiv.2309.08532 (2025)
-
[42]
Zehle, T., Schlager, M., Heiß, T. & Feurer, M. CAPO: Cost-Aware Prompt Optimization. Preprint at https://doi.org/10.48550/arXiv.2504.16005 (2025). 26
-
[43]
Agarwal, E. et al. PromptWizard: Task-Aware Prompt Optimization Framework. Preprint at https://doi.org/10.48550/arXiv.2405.18369 (2024)
-
[44]
Li, J., Papay, S. & Klinger, R. Are Humans as Brittle as Large Language Models? in Proceedings of the 14th International Joint Conference on Natural Language Processing and the 4th Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics (eds. Inui, K. et al.) 2130–2155 (The Asian Federation of Natural Language Processing an...
-
[45]
Soylu, D., Potts, C. & Khattab, O. Fine-Tuning and Prompt Optimization: Two Great Steps that Work Better Together. in Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing (eds. Al-Onaizan, Y., Bansal, M. & Chen, Y.-N.) 10696–10710 (Association for Computational Linguistics, Miami, Florida, USA, 2024). doi:10.18653/v1/2024...
-
[46]
Wang, K. et al. Neurosymbolic LoRA: Why and When to Tune Weights vs. Rewrite Prompts. Preprint at https://doi.org/10.48550/arXiv.2601.12711 (2026)
-
[47]
Srivastava, S. et al. Reasoning Under Constraint: How Batch Prompting Suppresses Overthinking in Reasoning Models. in (2026)
2026
-
[48]
Cheng, Z., Kasai, J. & Yu, T. Batch Prompting: Efficient Inference with Large Language Model APIs. in Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing: Industry Track (eds. Wang, M. & Zitouni, I.) 792–810 (Association for Computational Linguistics, Singapore, 2023). doi:10.18653/v1/2023.emnlp-industry.74. 27
-
[49]
Zhang, T. M. et al. UW-BioNLP at ChemoTimelines 2025: Thinking, Fine-Tuning, and Dictionary-Enhanced LLM Systems for Chemotherapy Timeline Extraction. in Proceedings of the 7th Clinical Natural Language Processing Workshop (eds. Ben Abacha, A., Bethard, S., Bitterman, D., Naumann, T. & Roberts, K.) 40–56 (Association for Computational Linguistics, Virtual, 2025)
2025
-
[50]
Bandara, E. et al. Toward Zero-Egress Psychiatric AI: On-Device LLM Deployment for Privacy-Preserving Mental Health Decision Support. Preprint at https://doi.org/10.48550/arXiv.2604.18302 (2026)
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2604.18302 2026
-
[51]
Ansari, M. S., Khan, M. S. A., Revankar, S., Varma, A. & Mokhade, A. S. Lightweight Clinical Decision Support System using QLoRA-Fine-Tuned LLMs and Retrieval-Augmented Generation. Preprint at https://doi.org/10.48550/arXiv.2505.03406 (2025)
-
[52]
Li, R. et al. Beyond path selection: Better LLMs for Scientific Information Extraction with MimicSFT and Relevance and Rule-induced(R$^2$)GRPO. Preprint at https://doi.org/10.48550/arXiv.2505.22068 (2025)
-
[53]
Gupta, A., Kumar, D. & Sinha, Y. BITS Pilani at SemEval-2026 Task 9: Structured Supervised Fine-Tuning with DPO Refinement for Polarization Detection. Preprint at https://doi.org/10.48550/arXiv.2604.11121 (2026)
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2604.11121 2026
-
[54]
**White**
Rafailov, R. et al. Direct Preference Optimization: Your Language Model is Secretly a Reward Model. Advances in Neural Information Processing Systems 36, 53728– 53741 (2023). 28 Figure legends Figure 1. Prompt inference performance on the 200-note training set and 1000-note gold standard across prompt-generation configurations. Figure 2. Overview of the s...
2023
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.