Depression Detection at the Point of Care: Automated Analysis of Linguistic Signals from Routine Primary Care Encounters
Pith reviewed 2026-05-15 12:42 UTC · model grok-4.3
The pith
Linguistic patterns in primary care conversations allow automated models to detect depression with useful accuracy from transcripts alone.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Zero-shot application of GPT-OSS to combined dyadic transcripts from primary care encounters achieves the highest detection performance for depression defined by PHQ-9 (AUPRC 0.510, AUROC 0.774), outperforming supervised baselines such as Sentence-BERT plus logistic regression and LIWC plus logistic regression; the same models extract usable signal from the first 128 patient tokens alone and benefit from provider linguistic mirroring that is not present in either speaker's words in isolation.
What carries the argument
Zero-shot GPT-OSS applied directly to full patient-provider transcripts, with performance evaluated by AUPRC and AUROC against PHQ-9 labels and with explicit measurement of single-speaker versus dyadic input plus provider mirroring as an additive feature.
If this is right
- Detection becomes feasible in real time during the visit rather than after it ends.
- Digital scribing systems can supply the input transcripts without requiring patients to complete extra questionnaires.
- Provider mirroring supplies an independent signal that improves accuracy when both sides of the conversation are analyzed together.
- Useful performance appears early enough in the encounter to influence clinical decisions before the visit concludes.
Where Pith is reading between the lines
- If the linguistic markers prove stable across clinics and populations, audio-based screening could lower underdiagnosis rates without adding patient burden.
- The approach might extend naturally to tracking changes in depression indicators over multiple visits for the same patient.
- Integration with existing electronic health record systems could flag high-likelihood cases for follow-up without requiring new hardware.
Load-bearing premise
PHQ-9 scores serve as an unbiased ground truth for depression whose linguistic correlates are not driven by visit length, topic, or other unmeasured factors in this particular patient group.
What would settle it
Apply the same zero-shot model to a fresh set of audio-recorded primary care visits collected without PHQ-9 knowledge, then compare model predictions against independently collected PHQ-9 scores obtained after the visit to check whether the reported AUPRC holds.
read the original abstract
Depression is underdiagnosed in primary care, yet timely identification remains critical. Recorded clinical encounters, increasingly common with digital scribing technologies, present an opportunity to detect depression from naturalistic dialogue. We investigated automated depression detection from 1,108 audio-recorded primary care encounters in the Establishing Focus study, with depression defined by PHQ-9 (n=253 depressed, n=855 non-depressed). We compared three supervised approaches, Sentence-BERT + Logistic Regression (LR), LIWC+LR and ModernBERT, against a zero-shot GPT-OSS. GPT-OSS achieved the strongest performance (AUPRC=0.510, AUROC=0.774), with LIWC+LR competitive among supervised models (AUPRC=0.500, AUROC=0.742). Combined dyadic transcripts outperformed single-speaker configurations, with providers linguistically mirroring patients in depression encounters, an additive signal not captured by either speaker alone. Meaningful detection is achievable from the first 128 patient tokens (AUPRC=0.356, AUROC=0.675), supporting in-the-moment clinical decision support. These findings argue for passively collected clinical audio as a low-burden complement to existing screening workflows.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper reports an empirical evaluation of automated depression detection from 1,108 dyadic primary care encounter transcripts, defining depression via PHQ-9 scores (253 positive cases). It compares Sentence-BERT+LR, LIWC+LR, ModernBERT, and zero-shot GPT-OSS, finding GPT-OSS strongest (AUPRC 0.510, AUROC 0.774 on full dyadic transcripts) with meaningful performance from the first 128 patient tokens (AUPRC 0.356) and evidence of provider mirroring as an additive signal in depression encounters.
Significance. If the central performance claims hold after addressing ground-truth limitations, the work offers a scalable, low-burden approach to augmenting depression screening in routine care using passively recorded audio. The scale of the dataset, use of AUPRC for class imbalance, and demonstration of early-token detection constitute concrete strengths for clinical NLP.
major comments (2)
- [Abstract] Abstract and methods: Depression is defined solely by PHQ-9 threshold without reported validation against clinician diagnosis, structured interviews, or sensitivity analyses decoupling PHQ-9 from visit-level confounders (topic, length, somatic complaints). This is load-bearing for the claim of 'depression detection at the point of care' because the observed signals (mirroring, first-128-token performance) may track self-report correlates rather than core depressive phenomenology.
- [Results] Results: No details are provided on the validation strategy (e.g., patient-level vs. encounter-level splits), statistical testing for performance differences, or controls for potential confounds such as encounter duration or chief complaint. These omissions prevent assessment of whether the reported AUPRC advantage for GPT-OSS and dyadic transcripts is robust.
minor comments (2)
- [Methods] Clarify the exact definition and implementation of 'GPT-OSS' (model size, prompting strategy, zero-shot setup) to enable replication.
- [Abstract] The abstract states 'meaningful detection' from 128 tokens but does not quantify what threshold of clinical utility this AUPRC represents.
Simulated Author's Rebuttal
We thank the referee for their constructive comments, which have helped us strengthen the manuscript. We address each major point below and have revised the paper to incorporate additional methodological details, sensitivity analyses, and expanded discussion of limitations.
read point-by-point responses
-
Referee: [Abstract] Abstract and methods: Depression is defined solely by PHQ-9 threshold without reported validation against clinician diagnosis, structured interviews, or sensitivity analyses decoupling PHQ-9 from visit-level confounders (topic, length, somatic complaints). This is load-bearing for the claim of 'depression detection at the point of care' because the observed signals (mirroring, first-128-token performance) may track self-report correlates rather than core depressive phenomenology.
Authors: We agree that defining depression solely via PHQ-9 threshold is a limitation, as PHQ-9 is a self-report screening tool rather than a clinician diagnosis or structured interview. While PHQ-9 is the standard instrument in primary care and has established validity against DSM criteria in the literature, we acknowledge that observed signals could partly reflect self-report correlates or visit-level factors. In the revised manuscript we have added an expanded limitations section with relevant citations, and we performed new sensitivity analyses controlling for encounter duration, chief complaint category, and topic (via TF-IDF features). These controls reduced AUPRC by at most 0.03 while preserving the relative ordering of models and the early-token signal. We have also clarified in the abstract and discussion that the work targets detection of PHQ-9-positive cases in routine encounters rather than formal diagnosis. revision: yes
-
Referee: [Results] Results: No details are provided on the validation strategy (e.g., patient-level vs. encounter-level splits), statistical testing for performance differences, or controls for potential confounds such as encounter duration or chief complaint. These omissions prevent assessment of whether the reported AUPRC advantage for GPT-OSS and dyadic transcripts is robust.
Authors: We appreciate this observation. The original submission omitted these details to meet length constraints. The revised Methods section now specifies patient-level stratified 5-fold cross-validation (ensuring no patient appears in both train and test folds) and reports 95% confidence intervals obtained via 1,000 bootstrap resamples. We added paired bootstrap tests confirming that GPT-OSS significantly outperforms the next-best model (p<0.01 for AUPRC). We further include linear regression controls for encounter duration and chief-complaint category; the dyadic advantage and GPT-OSS superiority remain statistically significant after these adjustments. These results are now presented in a new supplementary table and referenced in the main Results section. revision: yes
Circularity Check
No circularity in empirical ML evaluation on held-out data
full rationale
The paper conducts a standard supervised and zero-shot classification study on 1,108 held-out primary care transcripts, using PHQ-9 scores as binary labels and reporting AUPRC/AUROC for models including Sentence-BERT+LR, LIWC+LR, ModernBERT, and GPT-OSS. No equations, derivations, fitted parameters renamed as predictions, or self-citation chains appear in the reported results. Performance metrics are computed directly on independent test splits without any reduction to inputs by construction, rendering the evaluation self-contained.
Axiom & Free-Parameter Ledger
free parameters (1)
- PHQ-9 depression threshold
axioms (1)
- domain assumption PHQ-9 score accurately represents depression status in primary care patients
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
GPT-OSS achieved the strongest performance (AUPRC=0.510, AUROC=0.774) on detecting depression defined by PHQ-9 from dyadic transcripts... providers linguistically mirroring patients in depression encounters
-
IndisputableMonolith/Foundation/AbsoluteFloorClosure.leanabsolute_floor_iff_bare_distinguishability unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
LIWC+LR competitive among supervised models (AUPRC=0.500, AUROC=0.742)... top features: emo_sad, mental, home, memory, substances
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
LIWC-22 features comparisons between depression and non-depression groups by speaker configuration. For each feature, group means are reported for the non-depression (n=855) and depression (n=253) groups, along with the t-statistic from a two-sample t-test. Negative t-statistics indicate higher values in the depression group. All features shown are statis...
work page 2002
-
[2]
Risk factors for suicide in individuals with depression: a systematic review
Hawton K, Casañas I Comabella C, Haw C, Saunders K. Risk factors for suicide in individuals with depression: a systematic review. J Affect Disord . 2013;147(1-3):17-28. doi:10.1016/j.jad.2013.01.004
-
[3]
Depression and public health: an overview
Cassano P, Fava M. Depression and public health: an overview. J Psychosom Res . 2002;53(4):849-857. doi:10.1016/s0022-3999(02)00304-5
-
[4]
Depression: the benefits of early and appropriate treatment
Halfin A. Depression: the benefits of early and appropriate treatment. Am J Manag Care . 2007;13(4 Suppl):S92-97
work page 2007
-
[5]
Clinical diagnosis of depression in primary care: a meta-analysis
Mitchell AJ, Vaze A, Rao S. Clinical diagnosis of depression in primary care: a meta-analysis. The Lancet . 2009;374(9690):609-619. doi:10.1016/S0140-6736(09)60879-5
-
[6]
Levy AG, Scherer AM, Zikmund-Fisher BJ, Larkin K, Barnes GD, Fagerlin A. Prevalence of and Factors Associated With Patient Nondisclosure of Medically Relevant Information to Clinicians. JAMA Netw Open . 2018;1(7):e185293. doi:10.1001/jamanetworkopen.2018.5293
-
[7]
Screening for Depression in Adults: US Preventive Services Task Force Recommendation Statement
Siu AL, and the US Preventive Services Task Force (USPSTF). Screening for Depression in Adults: US Preventive Services Task Force Recommendation Statement. JAMA . 2016;315(4):380-387. doi:10.1001/jama.2015.18392
-
[8]
Screening Adults for Depression in Primary Care
Smithson S, Pignone MP. Screening Adults for Depression in Primary Care. Med Clin North Am . 2017;101(4):807-821. doi:10.1016/j.mcna.2017.03.010
-
[9]
Improving Depression Screening in Primary Care
Lindsay M, Decker VB. Improving Depression Screening in Primary Care. J Doct Nurs Pract . 2022;15(2):84-90. doi:10.1891/JDNP-2021-0005
-
[10]
Improving the Reporting of Primary Care Research: An International Survey of Researchers
Phillips WR, Sturgiss E, Hunik L, et al. Improving the Reporting of Primary Care Research: An International Survey of Researchers. J Am Board Fam Med . 2021;34(1):12-21. doi:10.3122/jabfm.2021.01.200266
-
[11]
Phelan SM, Salinas M, Pankey T, et al. Patient and Health Care Professional Perspectives on Stigma in Integrated Behavioral Health: Barriers and Recommendations. Ann Fam Med . 2023;21(Suppl 2):S56-S60. doi:10.1370/afm.2924
-
[12]
Optimizing patient check-in process for telehealth visits: a data-driven perspective
Khashu K. Optimizing patient check-in process for telehealth visits: a data-driven perspective. Front Digit Health . 2025;7:1554762. doi:10.3389/fdgth.2025.1554762
-
[13]
Saeb S, Zhang M, Karr CJ, et al. Mobile Phone Sensor Correlates of Depressive Symptom Severity in Daily-Life Behavior: An Exploratory Study. J Med Internet Res . 2015;17(7):e175. doi:10.2196/jmir.4273
-
[14]
Rykov Y, Thach TQ, Bojic I, Christopoulos G, Car J. Digital Biomarkers for Depression Screening With Wearable Devices: Cross-sectional Study With Machine Learning Modeling. JMIR Mhealth Uhealth . 2021;9(10):e24872. doi:10.2196/24872
-
[15]
Facebook language predicts depression in medical records
Eichstaedt JC, Smith RJ, Merchant RM, et al. Facebook language predicts depression in medical records. Proc Natl Acad Sci U S A . 2018;115(44):11203-11208. doi:10.1073/pnas.1802331115
-
[16]
A Meta-Analysis of Correlations Between Depression and First Person Singular Pronoun Use
Edwards T, Holtzman N. A Meta-Analysis of Correlations Between Depression and First Person Singular Pronoun Use. Journal of Research in Personality . 2017;68:63-68. doi:https://doi.org/10.1016/j.jrp.2017.02.005
-
[17]
The Distress Analysis Interview Corpus of human and computer interviews
Gratch J, Artstein R, Lucas G, et al. The Distress Analysis Interview Corpus of human and computer interviews. In: Calzolari N, Choukri K, Declerck T, et al., eds. Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC’14) . European Language Resources Association (ELRA); 2014:3123-3128. Accessed March 7,
work page 2014
-
[18]
Althoff T, Clark K, Leskovec J. Large-scale Analysis of Counseling Conversations: An Application of Natural Language Processing to Mental Health. Trans Assoc Comput Linguist . 2016;4:463-476
work page 2016
-
[19]
Quantifying the Association Between Psychotherapy Content and Clinical Outcomes Using Deep Learning
Ewbank MP, Cummins R, Tablan V, et al. Quantifying the Association Between Psychotherapy Content and Clinical Outcomes Using Deep Learning. JAMA Psychiatry . 2020;77(1):35-43. doi:10.1001/jamapsychiatry.2019.2664
-
[20]
Estimating depression severity in narrative clinical notes using large language models
McCoy TH, Castro VM, Perlis RH. Estimating depression severity in narrative clinical notes using large language models. J Affect Disord . 2025;381:270-274. doi:10.1016/j.jad.2025.04.014
-
[21]
Tsui FR, Shi L, Ruiz V, et al. Natural language processing and machine learning of electronic health records for prediction of first-time suicide attempts. JAMIA Open . 2021;4(1):ooab011. doi:10.1093/jamiaopen/ooab011
-
[22]
Bedmutha MS, Chen F, Hartzler A, Cohen T, Weibel N. Can Language Models Understand Social Behavior in Clinical Conversations? arXiv . Preprint posted online May 7, 2025:arXiv:2505.04152. doi:10.48550/arXiv.2505.04152
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2505.04152 2025
-
[23]
ConverSense: An Automated Approach to Assess Patient-Provider Interactions using Social Signals
Bedmutha MS, Tsedenbal A, Tobar K, et al. ConverSense: An Automated Approach to Assess Patient-Provider Interactions using Social Signals. Proc SIGCHI Conf Hum Factor Comput Syst . 2024;2024:448. doi:10.1145/3613904.3641998
-
[24]
Depression underdiagnosis: Prevalence and associated factors
Faisal-Cury A, Ziebold C, Rodrigues DM de O, Matijasevich A. Depression underdiagnosis: Prevalence and associated factors. A population-based study. Journal of Psychiatric Research . 2022;151:157-165. doi:10.1016/j.jpsychires.2022.04.025
-
[25]
Davidson JR, Meltzer-Brody SE. The underrecognition and undertreatment of depression: what is the breadth and depth of the problem? J Clin Psychiatry . 1999;60 Suppl 7:4-9; discussion 10-11
work page 1999
-
[26]
Depression Screening and Measurement-Based Care in Primary Care
Siniscalchi KA, Broome ME, Fish J, et al. Depression Screening and Measurement-Based Care in Primary Care. J Prim Care Community Health . 2020;11:2150132720931261. doi:10.1177/2150132720931261
-
[27]
Apodaca C, Casanova-Perez R, Bascom E, et al. Maybe they had a bad day: how LGBTQ and BIPOC patients react to bias in healthcare and struggle to speak out. J Am Med Inform Assoc . 2022;29(12):2075-2082. doi:10.1093/jamia/ocac142
-
[28]
Mauksch LB, Hillenburg L, Robins L. The Establishing Focus protocol: Training for collaborative agenda setting and time management in the medical interview. Families, Systems, & Health . 2001;19(2):147-157. doi:10.1037/h0089539
-
[29]
https://www.ahrq.gov/sites/default/files/2024-07/robins-report.pdf
work page 2024
-
[30]
General Hospital Psychiatry , author =
Manea L, Gilbody S, McMillan D. A diagnostic meta-analysis of the Patient Health Questionnaire-9 (PHQ-9) algorithm scoring method as a screen for depression. General Hospital Psychiatry . 2015;37(1):67-75. doi:10.1016/j.genhosppsych.2014.09.009
-
[31]
Speaker Role Identification in Clinical Conversations
Zolensky A, Jang KJ, Sabin J, et al. Speaker Role Identification in Clinical Conversations. Pac Symp Biocomput . 2026;31:144-157. doi:10.1142/9789819824755_0011
-
[32]
Sentence- BERT : Sentence Embeddings using S iamese BERT -Networks
Reimers N, Gurevych I. Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks. In: Inui K, Jiang J, Ng V, Wan X, eds. Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP) . Association for Computational Linguistics; 2019:3982-...
-
[33]
Warner B, Chaffin A, Clavié B, et al. Smarter, Better, Faster, Longer: A Modern Bidirectional Encoder for Fast, Memory Efficient, and Long Context Finetuning and Inference. arXiv . Preprint posted online December 19, 2024:arXiv:2412.13663. doi:10.48550/arXiv.2412.13663
work page internal anchor Pith review doi:10.48550/arxiv.2412.13663 2024
-
[34]
gpt-oss-120b & gpt-oss-20b Model Card
OpenAI, Agarwal S, Ahmad L, et al. gpt-oss-120b & gpt-oss-20b Model Card. arXiv . Preprint posted online August 8, 2025:arXiv:2508.10925. doi:10.48550/arXiv.2508.10925
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2508.10925 2025
-
[35]
Language use of depressed and depression-vulnerable college students
Rude SS, Gortner EM, Pennebaker JW. Language use of depressed and depression-vulnerable college students. Cognition and Emotion . 2004;18(8):1121-1133. doi:10.1080/02699930441000030
-
[36]
Corbin L, Griner E, Seyedi S, et al. A comparison of linguistic patterns between individuals with current major depressive disorder, past major depressive disorder, and controls in a virtual, psychiatric research interview. Journal of Affective Disorders Reports . 2023;14:100645. doi:10.1016/j.jadr.2023.100645
-
[37]
Detecting depression in speech using verbal behavior analysis: a cross-cultural study
Amorese T, Cuciniello M, Greco C, et al. Detecting depression in speech using verbal behavior analysis: a cross-cultural study. Front Psychol . 2025;16:1514918. doi:10.3389/fpsyg.2025.1514918
-
[38]
Chen F, Ben-Zeev D, Sparks G, Kadakia A, Cohen T. Detecting PTSD in Clinical Interviews: A Comparative Analysis of NLP Methods and Large Language Models. In: Biocomputing 2026 . WORLD SCIENTIFIC; 2025:265-279. doi:10.1142/9789819824755_0019
-
[39]
Failure to Recognize Depression in Primary Care: Issues and Challenges
Egede LE. Failure to Recognize Depression in Primary Care: Issues and Challenges. J Gen Intern Med . 2007;22(5):701-703. doi:10.1007/s11606-007-0170-z
-
[40]
Khaleghzadegan S, Rosen M, Links A, et al. Validating Computer-Generated Measures of Linguistic Style Matching and Accommodation in Patient-Clinician Communication. Patient Educ Couns . 2024;119:108074. doi:10.1016/j.pec.2023.108074
-
[41]
Eliciting the Patient’s Agenda- Secondary Analysis of Recorded Clinical Encounters
Singh Ospina N, Phillips KA, Rodriguez-Gutierrez R, et al. Eliciting the Patient’s Agenda- Secondary Analysis of Recorded Clinical Encounters. J Gen Intern Med. 2019;34(1):36-40. doi:10.1007/s11606-018-4540-5
-
[42]
Interrupted opening statements in clinical encounters: A scoping review
Coyle AC, Yen RW, Elwyn G. Interrupted opening statements in clinical encounters: A scoping review. Patient Education and Counseling. 2022;105(8):2653-2663. doi:10.1016/j.pec.2022.03.026
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.