pith. sign in

arxiv: 2606.02545 · v1 · pith:YVOIN6MVnew · submitted 2026-06-01 · 💻 cs.CL

Transferable Self-Harm Surveillance from Emergency Department Triage Notes Using an Evidence-Augmented Machine Learning Approach

Pith reviewed 2026-06-28 14:53 UTC · model grok-4.3

classification 💻 cs.CL
keywords self-harmemergency departmenttriage notesmachine learningtransferabilitysurveillancemethod identificationlarge language models
0
0 comments X

The pith

An evidence-augmented machine learning model detects self-harm in emergency department triage notes, identifies the primary method with 95 percent accuracy, and maintains performance across different hospitals without retraining.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents a three-stage system that combines standard machine learning classifiers with large language model screening and evidence extraction to flag self-harm cases directly from the free-text notes written at initial emergency contact. It reports that the system reaches area under the precision-recall curve values near 0.88 both within the development hospital and when moved to two external sites. The same pipeline also extracts the main self-harm method in 95 percent of detected cases, which goes beyond simple presence-or-absence detection. This matters for public-health surveillance because diagnostic codes currently miss many self-harm presentations, while triage notes are already recorded at the first point of care.

Core claim

The evidence-augmented machine learning approach identifies self-harm presentations in ED triage notes with internal and external AUPRCs of 0.887 and 0.884, sustains AUPRCs of 0.881, 0.879, and 0.816 at the development and two external sites in prospective use without site-specific retraining, and recovers the primary self-harm method with 95 percent accuracy.

What carries the argument

three-stage approach that augments traditional machine learning with large language model-based screening and evidence extraction

If this is right

  • The method supports surveillance of self-harm that is more sensitive than diagnostic codes alone.
  • It supplies the primary self-harm method for each case, enabling granular rather than binary tracking.
  • The model transfers to new hospital sites at comparable accuracy without any retraining on local data.
  • It operates prospectively on live triage notes at both the original and external locations.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Hospitals could embed the model in existing electronic record systems to generate automated alerts for mental-health follow-up teams.
  • Aggregated method patterns across sites might reveal regional differences in self-harm behavior that are invisible to code-based statistics.
  • The same evidence-extraction stage could be tested on other conditions where triage notes already record patient intent, such as substance-use presentations.
  • Wider deployment would require safeguards for how sensitive note content is stored and audited after model processing.

Load-bearing premise

Triage notes contain enough explicit information about self-harm intent and method that the evidence-augmented classifier can recover it reliably.

What would settle it

Performance falling below an AUPRC of 0.75 or method-identification accuracy below 80 percent when the same model is applied to triage notes from a fourth hospital whose note style or documentation practices differ from the three tested sites.

read the original abstract

Self-harm is a major public health concern, but current surveillance relying on hospital presentations is inadequate due to the low sensitivity of diagnostic codes. Emergency Department (ED) triage notes, recorded at the initial point of contact, provide a succinct summary of presentations and an opportunity to identify self-harm. We developed a three-stage approach, augmenting traditional machine learning with large language model-based screening and evidence extraction to detect self-harm in ED triage notes. We assessed model transferability across three Australian hospitals. Our approach showed AUPRCs of 0.887 +/- 0.016 and 0.884 +/- 0.012 during internal and external validation. Prospectively, it achieved AUPRC of 0.881 +/- 0.008 at the development site, and 0.879 +/- 0.012 and 0.816 +/- 0.015 at two external sites without site-specific retraining. A key advantage of the approach is that it enables identification of the primary self-harm method with an accuracy of 95%, supporting more granular surveillance beyond binary classification.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper presents a three-stage evidence-augmented machine learning pipeline that combines traditional classifiers with LLM-based screening and evidence extraction to identify self-harm from ED triage notes. It reports AUPRCs of 0.887±0.016 (internal), 0.884±0.012 (external), and 0.881±0.008 / 0.879±0.012 / 0.816±0.015 (prospective across sites) without site-specific retraining, plus a secondary claim of 95% accuracy in identifying the primary self-harm method.

Significance. If the transferability results hold under proper validation of all pipeline stages, the work would offer a practical advance over ICD-code surveillance by enabling higher-sensitivity, method-level monitoring that generalizes across hospitals.

major comments (2)
  1. [Abstract] Abstract and methods section on the three-stage pipeline: the headline claim of 95% accuracy for primary self-harm method identification is presented without any reported precision, recall, or inter-rater agreement metrics for the LLM evidence-extraction stage against human gold labels. This metric is load-bearing for the granular-surveillance advantage and cannot be assessed from the binary-detection AUPRCs alone.
  2. [Results] Results on prospective validation: while AUPRCs are given with standard deviations for the binary task, no corresponding breakdown is supplied for method-identification performance at the external sites, leaving the transferability claim for the full pipeline unsupported.
minor comments (1)
  1. [Abstract] The abstract states concrete AUPRC values with SDs but does not specify the number of positive cases or prevalence in each split; adding these would aid interpretation of the reported figures.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We appreciate the referee's thorough review and constructive suggestions. We will make revisions to provide the requested metrics and breakdowns to strengthen the claims regarding the LLM stage and full pipeline transferability.

read point-by-point responses
  1. Referee: [Abstract] Abstract and methods section on the three-stage pipeline: the headline claim of 95% accuracy for primary self-harm method identification is presented without any reported precision, recall, or inter-rater agreement metrics for the LLM evidence-extraction stage against human gold labels. This metric is load-bearing for the granular-surveillance advantage and cannot be assessed from the binary-detection AUPRCs alone.

    Authors: We thank the referee for highlighting this. Upon review, we recognize that additional details on the LLM evidence-extraction stage are necessary to substantiate the 95% accuracy claim. In the revised version, we will report precision, recall, and inter-rater agreement for this stage based on human-annotated gold labels. This will be added to both the abstract and methods sections. revision: yes

  2. Referee: [Results] Results on prospective validation: while AUPRCs are given with standard deviations for the binary task, no corresponding breakdown is supplied for method-identification performance at the external sites, leaving the transferability claim for the full pipeline unsupported.

    Authors: We agree that demonstrating transferability for the method-identification component is important. We will include the accuracy for primary self-harm method identification at each of the external sites in the prospective validation results section of the revised manuscript. revision: yes

Circularity Check

0 steps flagged

No circularity; performance metrics computed on independent held-out sets

full rationale

The reported AUPRC values (internal 0.887, external 0.884, prospective 0.881/0.879/0.816) and 95% method-identification accuracy are presented as empirical results from evaluation on held-out internal, external, and prospective data partitions. No equations, fitted parameters, or self-citations are shown to reduce the validation numbers or the method-identification claim to the training inputs by construction. The derivation chain consists of standard train/validate/test splits plus an evidence-augmented pipeline whose outputs are measured against external labels, satisfying the self-contained criterion.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Abstract-only review supplies almost no detail on modeling assumptions or data-processing choices; the central claims rest on the unstated premise that triage notes are sufficiently informative.

axioms (1)
  • domain assumption Triage notes contain sufficient explicit information about self-harm intent and method for automated extraction to be reliable
    Required for both the AUPRC and 95% method-identification results to generalize.

pith-pipeline@v0.9.1-grok · 5744 in / 1322 out tokens · 11181 ms · 2026-06-28T14:53:32.449226+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

39 extracted references · 7 canonical work pages

  1. [1]

    Gillies, D.et al.Prevalence and characteristics of self-harm in adolescents: meta-analyses of community-based studies 1990–2015.Journal of the American Academy of Child & Adolescent Psychiatry57, 733–741 (2018)

  2. [2]

    Hawton, K., Saunders, K. E. & O’Connor, R. C. Self-harm and suicide in adolescents.The Lancet379, 2373–2382 (2012)

  3. [3]

    Zubrick, S. R.et al.Self-harm: prevalence estimates from the second Australian child and adolescent survey of mental health and wellbeing.Australian & New Zealand Journal of Psychiatry50, 911–921 (2016)

  4. [4]

    Suicide and self-harm among young people

    Australian Institute of Health and Welfare. Suicide and self-harm among young people. Suicide & Self-Harm Monitoring (2024). URL https://www.aihw.gov.au/suicide-self-harm-monitoring/population-groups/ young-people/suicide-self-harm-young-people. Accessed: 2026-04-23

  5. [5]

    & Gunnell, D

    Carroll, R., Metcalfe, C. & Gunnell, D. Hospital presenting self-harm and risk of fatal and non-fatal repetition: systematic review and meta-analysis.PLOS ONE 9, e89944 (2014)

  6. [6]

    & O’Reilly, D

    Ross, E., Murphy, S., O’Hagan, D., Maguire, A. & O’Reilly, D. Emergency depart- ment presentations with suicide and self-harm ideation: a missed opportunity for intervention?Epidemiology and Psychiatric Sciences32, e24 (2023)

  7. [7]

    J.et al.Emergency department as a first contact for mental health problems in children and youth.Journal of the American Academy of Child & Adolescent Psychiatry56, 475–482 (2017)

    Gill, P. J.et al.Emergency department as a first contact for mental health problems in children and youth.Journal of the American Academy of Child & Adolescent Psychiatry56, 475–482 (2017)

  8. [8]

    C.et al.Characteristics of surveillance systems for suicide and self-harm: A scoping review.PLOS Global Public Health4, e0003292 (2024)

    Silva, A. C.et al.Characteristics of surveillance systems for suicide and self-harm: A scoping review.PLOS Global Public Health4, e0003292 (2024). URL https:// journals.plos.org/globalpublichealth/article?id=10.1371/journal.pgph.0003292

  9. [9]

    J., Spies, E

    Kuramoto-Crawford, S. J., Spies, E. L. & Davies-Cole, J. Detecting suicide- related emergency department visits among adults using the district of Columbia syndromic surveillance system.Public Health Reports132, 88S–94S (2017)

  10. [10]

    Establishing a self-harm surveillance register to improve care in a general hospital.British Journal of Mental Health Nursing4, 20–25 (2015)

    Williams, S. Establishing a self-harm surveillance register to improve care in a general hospital.British Journal of Mental Health Nursing4, 20–25 (2015). URL https://www.magonlinelibrary.com/doi/abs/10.12968/bjmh.2015.4.1.20

  11. [11]

    M., Dawson, A

    Whyte, I. M., Dawson, A. H., Carter, G. L., Levey, C. M. & Buckley, N. A. A model for the management of self-poisoning.Medical Journal of Australia 167, 142–146 (1997). URL https://onlinelibrary.wiley.com/doi/abs/10.5694/j. 1326-5377.1997.tb138813.x. 22

  12. [12]

    URL https://doi.org/10.1027/0227-5910/ a000845

    Bandara, P.et al.Surveillance of hospital-presenting intentional self-harm in western sydney, australia, during the implementation of a new self-harm report- ing field.Crisis44, 135–145 (2023). URL https://doi.org/10.1027/0227-5910/ a000845. PMID: 35138153

  13. [13]

    Sveticic, J., Stapelberg, N. C. & Turner, K. Suicidal and self-harm presentations to emergency departments: The challenges of identification through diagnostic codes and presenting complaints.Health Information Management Journal49, 38–46 (2020)

  14. [14]

    R., Roos, L

    Randall, J. R., Roos, L. L., Lix, L. M., Katz, L. Y. & Bolton, J. M. Emergency department and inpatient coding for self-harm and suicide attempts: validation using clinician assessment data.International Journal of Methods in Psychiatric Research26, e1559 (2017)

  15. [15]

    J.et al.Measuring diagnoses: ICD code accuracy.Health Services Research40, 1620–1639 (2005)

    O’Malley, K. J.et al.Measuring diagnoses: ICD code accuracy.Health Services Research40, 1620–1639 (2005). URL https://onlinelibrary.wiley.com/doi/abs/ 10.1111/j.1475-6773.2005.00444.x

  16. [16]

    & Verspoor, K

    Rozova, V., Witt, K., Robinson, J., Li, Y. & Verspoor, K. Detection of self- harm and suicidal ideation in emergency department triage notes.Journal of the American Medical Informatics Association29, 472–480 (2021). URL https: //pmc.ncbi.nlm.nih.gov/articles/PMC8800520/

  17. [17]

    S.et al.Identifying and predicting intentional self-harm in Electronic Health Record clinical notes: deep learning approach.JMIR Medical Informatics 8, e17784 (2020)

    Obeid, J. S.et al.Identifying and predicting intentional self-harm in Electronic Health Record clinical notes: deep learning approach.JMIR Medical Informatics 8, e17784 (2020)

  18. [18]

    Ayre, K.et al.Developing a natural language processing tool to identify perinatal self-harm in Electronic Healthcare Records.PLOS ONE16, e0253809 (2021)

  19. [19]

    URL https://www.sciencedirect

    Cusick, M.et al.Portability of natural language processing methods to detect suicidality from clinical text in US and UK electronic health records.Journal of Affective Disorders Reports10, 100430 (2022). URL https://www.sciencedirect. com/science/article/pii/S2666915322001226

  20. [20]

    & Verspoor, K

    Rozova, V., Witt, K., Conway, M., Robinson, J. & Verspoor, K. Portability of an artificial intelligence model for self-harm detection across hospital settings (2025). URL https://www.medrxiv.org/content/10.1101/2025.07.10.25331160v1. ISSN: 3067-2007 Pages: 2025.07.10.25331160

  21. [21]

    Barak-Corren, Y.et al.Validation of an Electronic Health Record-based suicide risk prediction modeling approach across multiple health care systems.JAMA Network Open3, e201262 (2020)

  22. [22]

    URL https://www.jmir.org/2025/1/e63126

    Holmes, G.et al.Applications of Large Language Models in the field of suicide prevention: Scoping review.Journal of Medical Internet Research27, e63126 23 (2025). URL https://www.jmir.org/2025/1/e63126

  23. [23]

    Western health annual report 2023–2024

    Western Health. Western health annual report 2023–2024. Tech. Rep., Western Health (2024). URL https://www.parliament.vic.gov. au/4962d7/globalassets/tabled-paper-documents/tabled-paper-8807/ western-health-annual-report-2023-2024.pdf

  24. [24]

    Alsentzer, E.et al. Publicly available clinical BERT embeddings.Proceedings of the 2nd Clinical Natural Language Processing Workshop, 72–78 (Association for Computational Linguistics, Minneapolis, Minnesota, USA, 2019). URL https: //aclanthology.org/W19-1909/

  25. [25]

    Royal Melbourne Hospital annual report 2024–2025

    Melbourne Health. Royal Melbourne Hospital annual report 2024–2025. Tech. Rep., The Royal Melbourne Hospital (2025). URL https://www.thermh.org.au/ files/documents/Corporate/rmh-annual-report-2024-2025.pdf

  26. [26]

    URL https://doi.org/10.5281/ zenodo.11171501

    Courty, B.et al.mlco2/codecarbon: v2.4.1 (2024). URL https://doi.org/10.5281/ zenodo.11171501

  27. [27]

    Latrobe Regional Health annual report 2024–

    Latrobe Regional Health. Latrobe Regional Health annual report 2024–

  28. [28]

    Rep., Latrobe Regional Health (2025)

    Tech. Rep., Latrobe Regional Health (2025). URL https://lrh.com.au/ wp-content/uploads/2025/11/LRH-Annual-Report 2025-Web-Spreads.pdf

  29. [29]

    Agarwal, S.et al.gpt-oss-120b & gpt-oss-20b model card.arXiv preprint arXiv:2508.10925(2025)

  30. [30]

    URL https://arxiv.org/abs/2203.11171

    Wang, X.et al.Self-consistency improves chain of thought reasoning in language models (2023). URL https://arxiv.org/abs/2203.11171. arXiv:2203.11171

  31. [31]

    Pedregosa, F.et al.Scikit-learn: Machine learning in Python.Journal of Machine Learning Research12, 2825–2830 (2011)

  32. [32]

    Ke, G.et al. LightGBM: a highly efficient gradient boosting decision tree.Pro- ceedings of the 31st International Conference on Neural Information Processing Systems, NIPS’17, 3149–3157 (Curran Associates, Inc., 2017)

  33. [33]

    Chen, T. & Guestrin, C.XGBoost: A scalable tree boosting system.Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD ’16, 785–794 (Association for Computing Machinery, New York, NY, USA, 2016). URL https://doi.org/10.1145/2939672.2939785

  34. [34]

    Devlin, J., Chang, M.-W., Lee, K. & Toutanova, K.BERT: Pre-training of deep bidirectional transformers for language understanding.Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), 4171–4186 (Association for Computational Linguist...

  35. [35]

    Beltagy, I., Lo, K. & Cohan, A.SciBERT: A pretrained language model for sci- entific text.Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Lan- guage Processing (EMNLP-IJCNLP), 3615–3620 (Association for Computational Linguistics, 2019)

  36. [36]

    Gu, Y.et al.Domain-specific language model pretraining for biomedical natural language processing.ACM Transactions on Computing for Healthcare3, 1–23 (2021)

  37. [37]

    Liu, Y.et al.RoBERTa: A robustly optimized BERT pretraining approach.arXiv preprint arXiv:1907.11692(2019)

  38. [38]

    Gururangan, S.et al. Don’t stop pretraining: Adapt language models to domains and tasks.Proceedings of the 58th Annual Meeting of the Association for Com- putational Linguistics, 8342–8360 (Association for Computational Linguistics, Online, 2020). URL https://aclanthology.org/2020.acl-main.740/

  39. [39]

    Yang, X.et al.A Large Language Model for Electronic Health Records.npj Digital Medicine5, 194 (2022). Funding KW is supported by a Dame Kate Campbell Fellowship from the Faculty of Medicine, Dentistry, and Health Sciences at The University of Melbourne and the Victorian Collaborative Centre for Mental Health and Wellbeing. JR is supported by an NHMRC Inve...