Transferable Self-Harm Surveillance from Emergency Department Triage Notes Using an Evidence-Augmented Machine Learning Approach
Pith reviewed 2026-06-28 14:53 UTC · model grok-4.3
The pith
An evidence-augmented machine learning model detects self-harm in emergency department triage notes, identifies the primary method with 95 percent accuracy, and maintains performance across different hospitals without retraining.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The evidence-augmented machine learning approach identifies self-harm presentations in ED triage notes with internal and external AUPRCs of 0.887 and 0.884, sustains AUPRCs of 0.881, 0.879, and 0.816 at the development and two external sites in prospective use without site-specific retraining, and recovers the primary self-harm method with 95 percent accuracy.
What carries the argument
three-stage approach that augments traditional machine learning with large language model-based screening and evidence extraction
If this is right
- The method supports surveillance of self-harm that is more sensitive than diagnostic codes alone.
- It supplies the primary self-harm method for each case, enabling granular rather than binary tracking.
- The model transfers to new hospital sites at comparable accuracy without any retraining on local data.
- It operates prospectively on live triage notes at both the original and external locations.
Where Pith is reading between the lines
- Hospitals could embed the model in existing electronic record systems to generate automated alerts for mental-health follow-up teams.
- Aggregated method patterns across sites might reveal regional differences in self-harm behavior that are invisible to code-based statistics.
- The same evidence-extraction stage could be tested on other conditions where triage notes already record patient intent, such as substance-use presentations.
- Wider deployment would require safeguards for how sensitive note content is stored and audited after model processing.
Load-bearing premise
Triage notes contain enough explicit information about self-harm intent and method that the evidence-augmented classifier can recover it reliably.
What would settle it
Performance falling below an AUPRC of 0.75 or method-identification accuracy below 80 percent when the same model is applied to triage notes from a fourth hospital whose note style or documentation practices differ from the three tested sites.
read the original abstract
Self-harm is a major public health concern, but current surveillance relying on hospital presentations is inadequate due to the low sensitivity of diagnostic codes. Emergency Department (ED) triage notes, recorded at the initial point of contact, provide a succinct summary of presentations and an opportunity to identify self-harm. We developed a three-stage approach, augmenting traditional machine learning with large language model-based screening and evidence extraction to detect self-harm in ED triage notes. We assessed model transferability across three Australian hospitals. Our approach showed AUPRCs of 0.887 +/- 0.016 and 0.884 +/- 0.012 during internal and external validation. Prospectively, it achieved AUPRC of 0.881 +/- 0.008 at the development site, and 0.879 +/- 0.012 and 0.816 +/- 0.015 at two external sites without site-specific retraining. A key advantage of the approach is that it enables identification of the primary self-harm method with an accuracy of 95%, supporting more granular surveillance beyond binary classification.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper presents a three-stage evidence-augmented machine learning pipeline that combines traditional classifiers with LLM-based screening and evidence extraction to identify self-harm from ED triage notes. It reports AUPRCs of 0.887±0.016 (internal), 0.884±0.012 (external), and 0.881±0.008 / 0.879±0.012 / 0.816±0.015 (prospective across sites) without site-specific retraining, plus a secondary claim of 95% accuracy in identifying the primary self-harm method.
Significance. If the transferability results hold under proper validation of all pipeline stages, the work would offer a practical advance over ICD-code surveillance by enabling higher-sensitivity, method-level monitoring that generalizes across hospitals.
major comments (2)
- [Abstract] Abstract and methods section on the three-stage pipeline: the headline claim of 95% accuracy for primary self-harm method identification is presented without any reported precision, recall, or inter-rater agreement metrics for the LLM evidence-extraction stage against human gold labels. This metric is load-bearing for the granular-surveillance advantage and cannot be assessed from the binary-detection AUPRCs alone.
- [Results] Results on prospective validation: while AUPRCs are given with standard deviations for the binary task, no corresponding breakdown is supplied for method-identification performance at the external sites, leaving the transferability claim for the full pipeline unsupported.
minor comments (1)
- [Abstract] The abstract states concrete AUPRC values with SDs but does not specify the number of positive cases or prevalence in each split; adding these would aid interpretation of the reported figures.
Simulated Author's Rebuttal
We appreciate the referee's thorough review and constructive suggestions. We will make revisions to provide the requested metrics and breakdowns to strengthen the claims regarding the LLM stage and full pipeline transferability.
read point-by-point responses
-
Referee: [Abstract] Abstract and methods section on the three-stage pipeline: the headline claim of 95% accuracy for primary self-harm method identification is presented without any reported precision, recall, or inter-rater agreement metrics for the LLM evidence-extraction stage against human gold labels. This metric is load-bearing for the granular-surveillance advantage and cannot be assessed from the binary-detection AUPRCs alone.
Authors: We thank the referee for highlighting this. Upon review, we recognize that additional details on the LLM evidence-extraction stage are necessary to substantiate the 95% accuracy claim. In the revised version, we will report precision, recall, and inter-rater agreement for this stage based on human-annotated gold labels. This will be added to both the abstract and methods sections. revision: yes
-
Referee: [Results] Results on prospective validation: while AUPRCs are given with standard deviations for the binary task, no corresponding breakdown is supplied for method-identification performance at the external sites, leaving the transferability claim for the full pipeline unsupported.
Authors: We agree that demonstrating transferability for the method-identification component is important. We will include the accuracy for primary self-harm method identification at each of the external sites in the prospective validation results section of the revised manuscript. revision: yes
Circularity Check
No circularity; performance metrics computed on independent held-out sets
full rationale
The reported AUPRC values (internal 0.887, external 0.884, prospective 0.881/0.879/0.816) and 95% method-identification accuracy are presented as empirical results from evaluation on held-out internal, external, and prospective data partitions. No equations, fitted parameters, or self-citations are shown to reduce the validation numbers or the method-identification claim to the training inputs by construction. The derivation chain consists of standard train/validate/test splits plus an evidence-augmented pipeline whose outputs are measured against external labels, satisfying the self-contained criterion.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Triage notes contain sufficient explicit information about self-harm intent and method for automated extraction to be reliable
Reference graph
Works this paper leans on
-
[1]
Gillies, D.et al.Prevalence and characteristics of self-harm in adolescents: meta-analyses of community-based studies 1990–2015.Journal of the American Academy of Child & Adolescent Psychiatry57, 733–741 (2018)
1990
-
[2]
Hawton, K., Saunders, K. E. & O’Connor, R. C. Self-harm and suicide in adolescents.The Lancet379, 2373–2382 (2012)
2012
-
[3]
Zubrick, S. R.et al.Self-harm: prevalence estimates from the second Australian child and adolescent survey of mental health and wellbeing.Australian & New Zealand Journal of Psychiatry50, 911–921 (2016)
2016
-
[4]
Suicide and self-harm among young people
Australian Institute of Health and Welfare. Suicide and self-harm among young people. Suicide & Self-Harm Monitoring (2024). URL https://www.aihw.gov.au/suicide-self-harm-monitoring/population-groups/ young-people/suicide-self-harm-young-people. Accessed: 2026-04-23
2024
-
[5]
& Gunnell, D
Carroll, R., Metcalfe, C. & Gunnell, D. Hospital presenting self-harm and risk of fatal and non-fatal repetition: systematic review and meta-analysis.PLOS ONE 9, e89944 (2014)
2014
-
[6]
& O’Reilly, D
Ross, E., Murphy, S., O’Hagan, D., Maguire, A. & O’Reilly, D. Emergency depart- ment presentations with suicide and self-harm ideation: a missed opportunity for intervention?Epidemiology and Psychiatric Sciences32, e24 (2023)
2023
-
[7]
J.et al.Emergency department as a first contact for mental health problems in children and youth.Journal of the American Academy of Child & Adolescent Psychiatry56, 475–482 (2017)
Gill, P. J.et al.Emergency department as a first contact for mental health problems in children and youth.Journal of the American Academy of Child & Adolescent Psychiatry56, 475–482 (2017)
2017
-
[8]
Silva, A. C.et al.Characteristics of surveillance systems for suicide and self-harm: A scoping review.PLOS Global Public Health4, e0003292 (2024). URL https:// journals.plos.org/globalpublichealth/article?id=10.1371/journal.pgph.0003292
-
[9]
J., Spies, E
Kuramoto-Crawford, S. J., Spies, E. L. & Davies-Cole, J. Detecting suicide- related emergency department visits among adults using the district of Columbia syndromic surveillance system.Public Health Reports132, 88S–94S (2017)
2017
-
[10]
Williams, S. Establishing a self-harm surveillance register to improve care in a general hospital.British Journal of Mental Health Nursing4, 20–25 (2015). URL https://www.magonlinelibrary.com/doi/abs/10.12968/bjmh.2015.4.1.20
-
[11]
Whyte, I. M., Dawson, A. H., Carter, G. L., Levey, C. M. & Buckley, N. A. A model for the management of self-poisoning.Medical Journal of Australia 167, 142–146 (1997). URL https://onlinelibrary.wiley.com/doi/abs/10.5694/j. 1326-5377.1997.tb138813.x. 22
work page doi:10.5694/j 1997
-
[12]
URL https://doi.org/10.1027/0227-5910/ a000845
Bandara, P.et al.Surveillance of hospital-presenting intentional self-harm in western sydney, australia, during the implementation of a new self-harm report- ing field.Crisis44, 135–145 (2023). URL https://doi.org/10.1027/0227-5910/ a000845. PMID: 35138153
-
[13]
Sveticic, J., Stapelberg, N. C. & Turner, K. Suicidal and self-harm presentations to emergency departments: The challenges of identification through diagnostic codes and presenting complaints.Health Information Management Journal49, 38–46 (2020)
2020
-
[14]
R., Roos, L
Randall, J. R., Roos, L. L., Lix, L. M., Katz, L. Y. & Bolton, J. M. Emergency department and inpatient coding for self-harm and suicide attempts: validation using clinician assessment data.International Journal of Methods in Psychiatric Research26, e1559 (2017)
2017
-
[15]
J.et al.Measuring diagnoses: ICD code accuracy.Health Services Research40, 1620–1639 (2005)
O’Malley, K. J.et al.Measuring diagnoses: ICD code accuracy.Health Services Research40, 1620–1639 (2005). URL https://onlinelibrary.wiley.com/doi/abs/ 10.1111/j.1475-6773.2005.00444.x
-
[16]
& Verspoor, K
Rozova, V., Witt, K., Robinson, J., Li, Y. & Verspoor, K. Detection of self- harm and suicidal ideation in emergency department triage notes.Journal of the American Medical Informatics Association29, 472–480 (2021). URL https: //pmc.ncbi.nlm.nih.gov/articles/PMC8800520/
2021
-
[17]
S.et al.Identifying and predicting intentional self-harm in Electronic Health Record clinical notes: deep learning approach.JMIR Medical Informatics 8, e17784 (2020)
Obeid, J. S.et al.Identifying and predicting intentional self-harm in Electronic Health Record clinical notes: deep learning approach.JMIR Medical Informatics 8, e17784 (2020)
2020
-
[18]
Ayre, K.et al.Developing a natural language processing tool to identify perinatal self-harm in Electronic Healthcare Records.PLOS ONE16, e0253809 (2021)
2021
-
[19]
URL https://www.sciencedirect
Cusick, M.et al.Portability of natural language processing methods to detect suicidality from clinical text in US and UK electronic health records.Journal of Affective Disorders Reports10, 100430 (2022). URL https://www.sciencedirect. com/science/article/pii/S2666915322001226
2022
-
[20]
Rozova, V., Witt, K., Conway, M., Robinson, J. & Verspoor, K. Portability of an artificial intelligence model for self-harm detection across hospital settings (2025). URL https://www.medrxiv.org/content/10.1101/2025.07.10.25331160v1. ISSN: 3067-2007 Pages: 2025.07.10.25331160
-
[21]
Barak-Corren, Y.et al.Validation of an Electronic Health Record-based suicide risk prediction modeling approach across multiple health care systems.JAMA Network Open3, e201262 (2020)
2020
-
[22]
URL https://www.jmir.org/2025/1/e63126
Holmes, G.et al.Applications of Large Language Models in the field of suicide prevention: Scoping review.Journal of Medical Internet Research27, e63126 23 (2025). URL https://www.jmir.org/2025/1/e63126
2025
-
[23]
Western health annual report 2023–2024
Western Health. Western health annual report 2023–2024. Tech. Rep., Western Health (2024). URL https://www.parliament.vic.gov. au/4962d7/globalassets/tabled-paper-documents/tabled-paper-8807/ western-health-annual-report-2023-2024.pdf
2023
-
[24]
Alsentzer, E.et al. Publicly available clinical BERT embeddings.Proceedings of the 2nd Clinical Natural Language Processing Workshop, 72–78 (Association for Computational Linguistics, Minneapolis, Minnesota, USA, 2019). URL https: //aclanthology.org/W19-1909/
2019
-
[25]
Royal Melbourne Hospital annual report 2024–2025
Melbourne Health. Royal Melbourne Hospital annual report 2024–2025. Tech. Rep., The Royal Melbourne Hospital (2025). URL https://www.thermh.org.au/ files/documents/Corporate/rmh-annual-report-2024-2025.pdf
2024
-
[26]
URL https://doi.org/10.5281/ zenodo.11171501
Courty, B.et al.mlco2/codecarbon: v2.4.1 (2024). URL https://doi.org/10.5281/ zenodo.11171501
2024
-
[27]
Latrobe Regional Health annual report 2024–
Latrobe Regional Health. Latrobe Regional Health annual report 2024–
2024
-
[28]
Rep., Latrobe Regional Health (2025)
Tech. Rep., Latrobe Regional Health (2025). URL https://lrh.com.au/ wp-content/uploads/2025/11/LRH-Annual-Report 2025-Web-Spreads.pdf
2025
-
[29]
Agarwal, S.et al.gpt-oss-120b & gpt-oss-20b model card.arXiv preprint arXiv:2508.10925(2025)
Pith/arXiv arXiv 2025
-
[30]
URL https://arxiv.org/abs/2203.11171
Wang, X.et al.Self-consistency improves chain of thought reasoning in language models (2023). URL https://arxiv.org/abs/2203.11171. arXiv:2203.11171
Pith/arXiv arXiv 2023
-
[31]
Pedregosa, F.et al.Scikit-learn: Machine learning in Python.Journal of Machine Learning Research12, 2825–2830 (2011)
2011
-
[32]
Ke, G.et al. LightGBM: a highly efficient gradient boosting decision tree.Pro- ceedings of the 31st International Conference on Neural Information Processing Systems, NIPS’17, 3149–3157 (Curran Associates, Inc., 2017)
2017
-
[33]
Chen, T. & Guestrin, C.XGBoost: A scalable tree boosting system.Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD ’16, 785–794 (Association for Computing Machinery, New York, NY, USA, 2016). URL https://doi.org/10.1145/2939672.2939785
-
[34]
Devlin, J., Chang, M.-W., Lee, K. & Toutanova, K.BERT: Pre-training of deep bidirectional transformers for language understanding.Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), 4171–4186 (Association for Computational Linguist...
2019
-
[35]
Beltagy, I., Lo, K. & Cohan, A.SciBERT: A pretrained language model for sci- entific text.Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Lan- guage Processing (EMNLP-IJCNLP), 3615–3620 (Association for Computational Linguistics, 2019)
2019
-
[36]
Gu, Y.et al.Domain-specific language model pretraining for biomedical natural language processing.ACM Transactions on Computing for Healthcare3, 1–23 (2021)
2021
-
[37]
Liu, Y.et al.RoBERTa: A robustly optimized BERT pretraining approach.arXiv preprint arXiv:1907.11692(2019)
Pith/arXiv arXiv 1907
-
[38]
Gururangan, S.et al. Don’t stop pretraining: Adapt language models to domains and tasks.Proceedings of the 58th Annual Meeting of the Association for Com- putational Linguistics, 8342–8360 (Association for Computational Linguistics, Online, 2020). URL https://aclanthology.org/2020.acl-main.740/
2020
-
[39]
Yang, X.et al.A Large Language Model for Electronic Health Records.npj Digital Medicine5, 194 (2022). Funding KW is supported by a Dame Kate Campbell Fellowship from the Faculty of Medicine, Dentistry, and Health Sciences at The University of Melbourne and the Victorian Collaborative Centre for Mental Health and Wellbeing. JR is supported by an NHMRC Inve...
2022
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.