arxiv: 2605.04221 · v1 · submitted 2026-05-05 · 💻 cs.CL · cs.AI

Recognition: unknown

Self-Prompting Small Language Models for Privacy-Sensitive Clinical Information Extraction

Yao-Shun Chuang , Tushti Mody , Uday Pratap Singh , Shirindokht Shiraz , Chun-Teh Lee , Ryan Brandon , Muhammad F Walji , Xiaoqian Jiang

show 1 more author

Bunmi Tokede

Authors on Pith no claims yet

Pith reviewed 2026-05-08 17:17 UTC · model grok-4.3

classification 💻 cs.CL cs.AI

keywords clinical named entity recognitionself-promptingsmall language modelsdental notesdirect preference optimizationprivacy preservationlocal deploymentinformation extraction

0 comments

The pith

Small language models can self-generate prompts to accurately extract clinical entities from unstructured dental notes while running locally for privacy.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper develops a framework where small open-weight language models generate, verify, and refine their own prompts for named entity recognition in dental progress notes. Using 1,200 annotated notes, they test models with ensemble inference and then adapt them through supervised fine-tuning and direct preference optimization. Strongest results come from Qwen2.5-14B-Instruct and Llama-3.1-8B-Instruct after DPO, reaching micro F1 scores of 0.864 and 0.806 respectively. This approach matters because dental notes are highly unstructured and privacy-sensitive, so keeping everything local and automated avoids sending data to external services. Performance varies by model, showing that generic benchmarks do not predict task success.

Core claim

The authors show that a self-prompting pipeline combined with QLoRA fine-tuning and direct preference optimization allows selected small language models to achieve micro F1 scores above 0.8 on multi-entity extraction from dental notes, with Qwen2.5-14B-Instruct reaching 0.864 after adaptation.

What carries the argument

The self-prompting mechanism in which the model generates candidate prompts for each clinical entity, verifies them against annotations, refines them iteratively, and then uses an ensemble of the best prompts for inference.

If this is right

Locally deployable models reduce privacy risks in handling patient data.
Task-specific prompt optimization outperforms reliance on general model capabilities.
Preference optimization like DPO can further boost performance on domain-specific extraction tasks.
Multi-prompt ensembles provide robustness for unstructured clinical text.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Similar self-prompting could extend to other medical specialties with unstructured notes, such as radiology or pathology.
Further iterations of self-refinement might reduce the need for large annotated datasets.
Integration with electronic health record systems could enable real-time entity extraction without cloud access.

Load-bearing premise

The 1,200 annotated dental notes are representative of notes from other clinics and that the self-generated prompts will not cause privacy leaks or performance drops on new data.

What would settle it

Evaluating the adapted models on a held-out set of dental notes from a different institution and observing whether the F1 scores remain above 0.75 or drop substantially.

Figures

Figures reproduced from arXiv: 2605.04221 by Bunmi Tokede, Chun-Teh Lee, Muhammad F Walji, Ryan Brandon, Shirindokht Shiraz, Tushti Mody, Uday Pratap Singh, Xiaoqian Jiang, Yao-Shun Chuang.

**Figure 1.** Figure 1: Prompt inference performance on the 200-note training set and 1000-note gold standard across prompt-generation configurations view at source ↗

read the original abstract

Clinical named entity recognition from dental progress notes is challenging because documentation is highly unstructured, domain-specific, and often privacy-sensitive. We developed a locally deployable framework that enables small language models to self-generate, verify, refine, and evaluate entity-specific prompts for extracting multiple clinical entities from dental notes. Using 1,200 annotated notes, we evaluated candidate open-weight models with multi-prompt ensemble inference and further adapted selected models using QLoRA-based supervised fine-tuning and direct preference optimization. Model performance varied substantially, highlighting the need for task-specific evaluation rather than reliance on generic benchmarks. Qwen2.5-14B-Instruct achieved the strongest baseline performance. After DPO, Qwen2.5-14B-Instruct and Llama-3.1-8B-Instruct achieved micro/macro F1 scores of 0.864/0.837 and 0.806/0.797, respectively. These findings suggest that automated prompt optimization combined with lightweight preference-based post-training can support scalable clinical information extraction using locally deployed small language models.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Self-prompting and DPO let small open models hit 0.86 micro F1 on dental NER while staying local, but the single 1200-note set leaves generalization untested.

read the letter

The paper shows that small open-weight models can be adapted with self-generated prompts and DPO to extract clinical entities from dental notes at around 0.86 micro F1 while staying local for privacy. That's the main practical result. They start with 1,200 annotated notes, have the model create and verify its own prompts, run multi-prompt ensembles, then apply QLoRA fine-tuning followed by direct preference optimization on Qwen2.5-14B and Llama-3.1-8B. The post-DPO scores are the headline numbers. This combination for the dentistry domain is new enough, and the work does a solid job highlighting that off-the-shelf performance varies and task-specific adaptation matters. Using open models and keeping everything local is a clear strength for real-world clinical use where data can't leave the premises. The soft spots are in the evaluation. There are no comparisons to standard baselines like rule-based extractors or simpler prompting without the self-generation step, so it's difficult to quantify the gain from the full pipeline. The abstract does not spell out the train-test split details or confirm that prompt creation and preference data stayed strictly out of the final test set. As the stress-test note points out, without a temporal split or notes from another clinic, the F1 scores could be optimistic for new data. No equations or formal claims here, just reported metrics, so the empirical nature is straightforward. This paper is for researchers and engineers working on privacy-preserving clinical NLP, especially in specialized areas like dentistry. Someone looking for concrete examples of adapting small models to domain-specific extraction would find the numbers and setup useful. It should go to peer review. The core idea is sound and the results are concrete enough to warrant referee feedback on the evaluation design and any additional experiments they might suggest.

Referee Report

3 major / 1 minor

Summary. The paper presents a locally deployable framework in which small open-weight language models self-generate, verify, and refine entity-specific prompts for clinical named-entity recognition on unstructured dental progress notes. Using 1,200 annotated notes, the authors evaluate several models with multi-prompt ensemble inference, then apply QLoRA-based supervised fine-tuning followed by direct preference optimization (DPO). Post-DPO results are reported as micro/macro F1 of 0.864/0.837 for Qwen2.5-14B-Instruct and 0.806/0.797 for Llama-3.1-8B-Instruct, with the claim that the pipeline supports scalable, privacy-preserving clinical information extraction.

Significance. If the reported F1 scores can be shown to arise from a leakage-free pipeline and to generalize beyond the single-source 1,200-note collection, the work would provide concrete evidence that automated prompt optimization plus lightweight preference tuning can produce usable clinical NER performance with locally runnable models. The emphasis on real dental notes and open-weight models is a practical strength.

major comments (3)

[Abstract and Evaluation] Abstract and Evaluation section: the headline F1 numbers are presented without any baseline comparisons (e.g., standard clinical NER tools, fine-tuned encoder-only models, or zero-shot prompting with larger models), so the incremental benefit of self-prompting plus DPO cannot be quantified.
[Data and Methods] Data and Methods: no description is given of how the 1,200 notes were partitioned, whether prompt-candidate generation or preference-pair construction used data disjoint from the final test set, or whether any temporal or external hold-out set was employed. Without these controls the reported scores risk optimistic bias from leakage.
[Results] Results: the abstract states concrete micro/macro F1 values after DPO but supplies neither per-entity breakdowns, error analysis, nor discussion of failure modes, leaving the clinical reliability of the 0.864/0.837 and 0.806/0.797 figures unassessable.

minor comments (1)

[Abstract] Abstract: the phrase 'multi-prompt ensemble inference' is used without indicating the number of prompts or the aggregation rule.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We address each major comment below and describe the revisions we will make to strengthen the manuscript.

read point-by-point responses

Referee: [Abstract and Evaluation] Abstract and Evaluation section: the headline F1 numbers are presented without any baseline comparisons (e.g., standard clinical NER tools, fine-tuned encoder-only models, or zero-shot prompting with larger models), so the incremental benefit of self-prompting plus DPO cannot be quantified.

Authors: We agree that baseline comparisons are needed to quantify the incremental benefit. In the revised manuscript we will add a dedicated comparison table in the Evaluation section reporting results from standard clinical NER tools (MedSpaCy), fine-tuned encoder-only models (BioBERT, ClinicalBERT), and zero-shot prompting with larger models (Llama-3.1-70B). This will allow direct assessment of the gains from self-prompting and DPO. revision: yes
Referee: [Data and Methods] Data and Methods: no description is given of how the 1,200 notes were partitioned, whether prompt-candidate generation or preference-pair construction used data disjoint from the final test set, or whether any temporal or external hold-out set was employed. Without these controls the reported scores risk optimistic bias from leakage.

Authors: We will expand the Methods section to explicitly describe the partitioning: the 1,200 notes were randomly split into 70% training, 15% validation, and 15% test sets. Prompt-candidate generation, verification, and DPO preference-pair construction were performed only on the training and validation portions; the test set remained completely unseen during all optimization steps. No temporal or external hold-out was used because the collection is from a single source and time period; we will add this as an explicit limitation and discuss implications for generalization. revision: yes
Referee: [Results] Results: the abstract states concrete micro/macro F1 values after DPO but supplies neither per-entity breakdowns, error analysis, nor discussion of failure modes, leaving the clinical reliability of the 0.864/0.837 and 0.806/0.797 figures unassessable.

Authors: We will revise the Results section to include a per-entity F1 breakdown table for all models and a new error-analysis subsection. The subsection will categorize common failure modes (boundary errors, ambiguous abbreviations, false positives on non-entity terms) with examples and discuss their clinical implications, thereby making the reliability of the reported scores more assessable. revision: yes

Circularity Check

0 steps flagged

No significant circularity; purely empirical results on held-out data

full rationale

The paper reports an empirical pipeline for clinical NER: annotation of 1,200 dental notes, self-prompt generation and ensemble inference, followed by QLoRA SFT and DPO on selected models, with final micro/macro F1 scores measured on held-out notes. No equations, derivations, or self-referential definitions appear in the provided text. Reported performance metrics are externally computed against human annotations rather than reduced to inputs by construction, fitted parameters renamed as predictions, or load-bearing self-citations. The work is self-contained against standard ML evaluation benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The paper rests on standard supervised fine-tuning assumptions and the availability of 1,200 human-annotated notes. No new mathematical axioms, free parameters central to a derivation, or invented physical entities are introduced.

pith-pipeline@v0.9.0 · 5519 in / 1157 out tokens · 51940 ms · 2026-05-08T17:17:43.961483+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

54 extracted references · 17 canonical work pages · 3 internal anchors

[1]

Hillestad, R. et al. Can Electronic Medical Record Systems Transform Health Care? Potential Health Benefits, Savings, And Costs. Health Affairs 24, 1103–1117 (2005). 21

2005
[2]

Hou, J. et al. Generate Analysis-Ready Data for Real-world Evidence: Tutorial for Harnessing Electronic Health Records With Advanced Informatic Technologies. Journal of Medical Internet Research 25, e45662 (2023)

2023
[3]

Moy, A. J. et al. Measurement of clinical documentation burden among physicians and nurses using electronic health records: a scoping review. J Am Med Inform Assoc 28, 998–1008 (2021)

2021
[4]

Moy, A. J. et al. Understanding the perceived role of electronic health records and workflow fragmentation on clinician documentation burden in emergency departments. J Am Med Inform Assoc 30, 797–808 (2023)

2023
[5]

Johnson, S. B. et al. An Electronic Health Record Based on Structured Narrative. J Am Med Inform Assoc 15, 54–64 (2008)

2008
[6]

E., Walji, M., Lam, W

Schwendicke, F., Uribe, S. E., Walji, M., Lam, W. & Tichy, A. Electronic Health Records in Dentistry: Relevance, Challenges and Policy Directions. International Dental Journal 75, 103964 (2025)

2025
[7]

Schleyer, T. et al. Electronic dental record use and clinical information management patterns among practitioner-investigators in The Dental Practice-Based Research Network. J Am Dent Assoc 144, 49–58 (2013)

2013
[8]

& Schleyer, T

Song, M., Liu, K., Abromitis, R. & Schleyer, T. L. Reusing Electronic Patient Data for Dental Clinical Research: A Review of Current Status. J Dent 41, 1148–1163 (2013)

2013
[9]

Bhardwaj, A. et al. Measuring up: Implementing a dental quality measure in the electronic health record context. J Am Dent Assoc 147, 35–40 (2016). 22

2016
[10]

Chuang, Y.-S. et al. Cross-institutional dental electronic health record entity extraction via generative artificial intelligence and synthetic notes. JAMIA Open 8, ooaf061 (2025)

2025
[11]

Wang, Y. et al. Clinical information extraction applications: A literature review. Journal of Biomedical Informatics 77, 34–49 (2018)

2018
[12]

Kreimeyer, K. et al. Natural language processing systems for capturing and standardizing unstructured clinical information: A systematic review. Journal of biomedical informatics 73, 14–29 (2017)

2017
[13]

& Huang, Y

Sezgin, E., Hussain, S.-A., Rust, S. & Huang, Y. Extracting Medical Information From Free-Text and Unstructured Patient-Generated Health Data Using Natural Language Processing Methods: Feasibility Study With Real-world Data. JMIR Formative Research 7, e43014 (2023)

2023
[14]

& Dunn, A

Pethani, F. & Dunn, A. G. Natural language processing for clinical notes in dentistry: A systematic review. J Biomed Inform 138, 104282 (2023)

2023
[15]

C., Xuesi Zhou,Ji Wu,Yongsheng

Zhou, Q. C., Xuesi Zhou,Ji Wu,Yongsheng. Structuring electronic dental records through deep learning for a clinical decision support system - Qingxiao Chen, Xuesi Zhou, Ji Wu, Yongsheng Zhou, 2021. Health Informatics Journal https://journals.sagepub.com/doi/10.1177/1460458220980036 (2021)

work page doi:10.1177/1460458220980036 2021
[16]

& Schwendicke, F

Büttner, M., Leser, U., Schneider, L. & Schwendicke, F. Natural Language Processing: Chances and Challenges in Dentistry. Journal of Dentistry 141, 104796 (2024)

2024
[17]

Singhal, K. et al. Large language models encode clinical knowledge. Nature 620, 172–180 (2023). 23

2023
[18]

Van Veen, D. et al. Adapted large language models can outperform medical experts in clinical text summarization. Nat Med 30, 1134–1142 (2024)

2024
[19]

& Sontag, D

Agrawal, M., Hegselmann, S., Lang, H., Kim, Y. & Sontag, D. Large language models are few-shot clinical information extractors. in Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing 1998–2022 (2022)

2022
[20]

Yao, S. et al. REACT: SYNERGIZING REASONING AND ACTING IN LANGUAGE MODELS. in (2023)

2023
[21]

Schick, T. et al. Toolformer: Language Models Can Teach Themselves to Use Tools. in (2023)

2023
[22]

Shinn, N., Cassano, F., Gopinath, A., Narasimhan, K. R. & Yao, S. Reflexion: language agents with verbal reinforcement learning. in (2023)

2023
[23]

Madaan, A. et al. Self-Refine: Iterative Refinement with Self-Feedback. in (2023)

2023
[24]

Huang, H. et al. ChatGPT for shaping the future of dentistry: the potential of multi- modal large language model. Int J Oral Sci 15, 29 (2023)

2023
[25]

M., Schmerder, M

Dennstädt, F., Hastings, J., Putora, P. M., Schmerder, M. & Cihoric, N. Implementing large language models in healthcare while balancing control, collaboration, costs and security. npj Digit. Med. 8, 143 (2025)

2025
[26]

M., Venkatesh, K

Raza, M. M., Venkatesh, K. P. & Kvedar, J. C. Generative AI and large language models in health care: pathways to implementation. NPJ Digit Med 7, 62 (2024)

2024
[27]

Nagarajan, R. et al. Economics and Equity of Large Language Models: Health Care Perspective. Journal of Medical Internet Research 26, e64226 (2024). 24

2024
[28]

Zhong, X. et al. Considerations for Patient Privacy of Large Language Models in Health Care: Scoping Review. Journal of Medical Internet Research 27, e76571 (2025)

2025
[29]

& Wong, Z

Jonnagaddala, J. & Wong, Z. S.-Y. Privacy preserving strategies for electronic health records in the era of large language models. npj Digit. Med. 8, 34 (2025)

2025
[30]

Y., Helzer, J., Pfeffer, M

Ng, M. Y., Helzer, J., Pfeffer, M. A., Seto, T. & Hernandez-Boussard, T. Development of secure infrastructure for advancing generative artificial intelligence research in healthcare at an academic medical center. J Am Med Inform Assoc 32, 586–588 (2025)

2025
[31]

R., Hsu, Y.-C., Mohammed, N

Chuang, Y.-S., Sarkar, A. R., Hsu, Y.-C., Mohammed, N. & Jiang, X. Robust Privacy Amidst Innovation with Large Language Models Through a Critical Assessment of the Risks. Preprint at https://doi.org/10.48550/arXiv.2407.16166 (2024)

work page doi:10.48550/arxiv.2407.16166 2024
[32]

Wiest, I. C. et al. Privacy-preserving large language models for structured medical information retrieval. npj Digit. Med. 7, 257 (2024)

2024
[33]

Kwon, J. et al. Validation of deep-learning-based triage and acuity score using a large national dataset. PLoS One 13, e0205836 (2018)

2018
[34]

Zhou, Y. et al. Large Language Models are Human-Level Prompt Engineers. in (2022)

2022
[35]

A., Feier, A

Kocaman, V., Kaya, M. A., Feier, A. M. & Talby, D. Clinical Large Language Model Evaluation by Expert Review (CLEVER): Framework Development and Validation. JMIR AI 4, e72153 (2025)

2025
[36]

Ma, Z. et al. Beyond the Leaderboard: Rethinking Medical Benchmarks for Large Language Models. Preprint at https://doi.org/10.48550/arXiv.2508.04325 (2025). 25

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2508.04325 2025
[37]

Ceballos-Arroyo, A. M. et al. Open (Clinical) LLMs are Sensitive to Instruction Phrasings. in Proceedings of the 23rd Workshop on Biomedical Natural Language Processing (eds. Demner-Fushman, D., Ananiadou, S., Miwa, M., Roberts, K. & Tsujii, J.) 50–71 (Association for Computational Linguistics, Bangkok, Thailand, 2024). doi:10.18653/v1/2024.bionlp-1.5

work page doi:10.18653/v1/2024.bionlp-1.5 2024
[38]

Cui, W. et al. A Survey of Automatic Prompt Optimization with Instruction-focused Heuristic-based Search Algorithm. Preprint at https://doi.org/10.48550/arXiv.2502.18746 (2025)

work page doi:10.48550/arxiv.2502.18746 2025
[39]

D., Wong, R

Zamfirescu-Pereira, J. D., Wong, R. Y., Hartmann, B. & Yang, Q. Why Johnny can’t prompt: How non-AI experts try (and fail) to design LLM prompts. in Proceedings of the 2023 CHI Conference on Human Factors in Computing Systems 1–21 (ACM, New York, NY, USA, 2023)

2023
[40]

Wang, Y. et al. Self-Instruct: Aligning Language Models with Self-Generated Instructions. in Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) (eds. Rogers, A., Boyd-Graber, J. & Okazaki, N.) 13484–13508 (Association for Computational Linguistics, Toronto, Canada, 2023). doi:10.18653/v1/2023.ac...

work page doi:10.18653/v1/2023.acl-long.754 2023
[41]

Guo, Q. et al. EvoPrompt: Connecting LLMs with Evolutionary Algorithms Yields Powerful Prompt Optimizers. Preprint at https://doi.org/10.48550/arXiv.2309.08532 (2025)

work page doi:10.48550/arxiv.2309.08532 2025
[42]

& Feurer, M

Zehle, T., Schlager, M., Heiß, T. & Feurer, M. CAPO: Cost-Aware Prompt Optimization. Preprint at https://doi.org/10.48550/arXiv.2504.16005 (2025). 26

work page doi:10.48550/arxiv.2504.16005 2025
[43]

Agarwal, E. et al. PromptWizard: Task-Aware Prompt Optimization Framework. Preprint at https://doi.org/10.48550/arXiv.2405.18369 (2024)

work page doi:10.48550/arxiv.2405.18369 2024
[44]

& Klinger, R

Li, J., Papay, S. & Klinger, R. Are Humans as Brittle as Large Language Models? in Proceedings of the 14th International Joint Conference on Natural Language Processing and the 4th Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics (eds. Inui, K. et al.) 2130–2155 (The Asian Federation of Natural Language Processing an...

work page doi:10.18653/v1/2025.ijcnlp-long.116 2025
[45]

& Khattab, O

Soylu, D., Potts, C. & Khattab, O. Fine-Tuning and Prompt Optimization: Two Great Steps that Work Better Together. in Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing (eds. Al-Onaizan, Y., Bansal, M. & Chen, Y.-N.) 10696–10710 (Association for Computational Linguistics, Miami, Florida, USA, 2024). doi:10.18653/v1/2024...

work page doi:10.18653/v1/2024.emnlp-main.597 2024
[46]

Wang, K. et al. Neurosymbolic LoRA: Why and When to Tune Weights vs. Rewrite Prompts. Preprint at https://doi.org/10.48550/arXiv.2601.12711 (2026)

work page doi:10.48550/arxiv.2601.12711 2026
[47]

Srivastava, S. et al. Reasoning Under Constraint: How Batch Prompting Suppresses Overthinking in Reasoning Models. in (2026)

2026
[48]

Cheng, Z., Kasai, J. & Yu, T. Batch Prompting: Efficient Inference with Large Language Model APIs. in Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing: Industry Track (eds. Wang, M. & Zitouni, I.) 792–810 (Association for Computational Linguistics, Singapore, 2023). doi:10.18653/v1/2023.emnlp-industry.74. 27

work page doi:10.18653/v1/2023.emnlp-industry.74 2023
[49]

Zhang, T. M. et al. UW-BioNLP at ChemoTimelines 2025: Thinking, Fine-Tuning, and Dictionary-Enhanced LLM Systems for Chemotherapy Timeline Extraction. in Proceedings of the 7th Clinical Natural Language Processing Workshop (eds. Ben Abacha, A., Bethard, S., Bitterman, D., Naumann, T. & Roberts, K.) 40–56 (Association for Computational Linguistics, Virtual, 2025)

2025
[50]

Bandara, E. et al. Toward Zero-Egress Psychiatric AI: On-Device LLM Deployment for Privacy-Preserving Mental Health Decision Support. Preprint at https://doi.org/10.48550/arXiv.2604.18302 (2026)

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2604.18302 2026
[51]

S., Khan, M

Ansari, M. S., Khan, M. S. A., Revankar, S., Varma, A. & Mokhade, A. S. Lightweight Clinical Decision Support System using QLoRA-Fine-Tuned LLMs and Retrieval-Augmented Generation. Preprint at https://doi.org/10.48550/arXiv.2505.03406 (2025)

work page doi:10.48550/arxiv.2505.03406 2025
[52]

Li, R. et al. Beyond path selection: Better LLMs for Scientific Information Extraction with MimicSFT and Relevance and Rule-induced(R$^2$)GRPO. Preprint at https://doi.org/10.48550/arXiv.2505.22068 (2025)

work page doi:10.48550/arxiv.2505.22068 2025
[53]

BITS Pilani at SemEval-2026 Task 9: Structured Supervised Fine-Tuning with DPO Refinement for Polarization Detection

Gupta, A., Kumar, D. & Sinha, Y. BITS Pilani at SemEval-2026 Task 9: Structured Supervised Fine-Tuning with DPO Refinement for Polarization Detection. Preprint at https://doi.org/10.48550/arXiv.2604.11121 (2026)

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2604.11121 2026
[54]

**White**

Rafailov, R. et al. Direct Preference Optimization: Your Language Model is Secretly a Reward Model. Advances in Neural Information Processing Systems 36, 53728– 53741 (2023). 28 Figure legends Figure 1. Prompt inference performance on the 200-note training set and 1000-note gold standard across prompt-generation configurations. Figure 2. Overview of the s...

2023