pith. sign in

arxiv: 2604.06193 · v1 · submitted 2026-03-11 · 💻 cs.CL · cs.AI

Depression Detection at the Point of Care: Automated Analysis of Linguistic Signals from Routine Primary Care Encounters

Pith reviewed 2026-05-15 12:42 UTC · model grok-4.3

classification 💻 cs.CL cs.AI
keywords depression detectionprimary careclinical transcriptslinguistic analysisnatural language processingPHQ-9dyadic conversation
0
0 comments X

The pith

Linguistic patterns in primary care conversations allow automated models to detect depression with useful accuracy from transcripts alone.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tests whether depression can be identified from the natural back-and-forth talk in ordinary doctor visits by applying language models to full transcripts. It compares several approaches on 1,108 recorded encounters and finds that a zero-shot large language model reaches the best results when given the combined words of patient and provider. Performance remains meaningful even when limited to the patient's first 128 tokens, and the models pick up an extra signal from the way providers mirror patient language in depression cases. The work positions this as a low-effort addition to existing screening that could run in the background during routine visits.

Core claim

Zero-shot application of GPT-OSS to combined dyadic transcripts from primary care encounters achieves the highest detection performance for depression defined by PHQ-9 (AUPRC 0.510, AUROC 0.774), outperforming supervised baselines such as Sentence-BERT plus logistic regression and LIWC plus logistic regression; the same models extract usable signal from the first 128 patient tokens alone and benefit from provider linguistic mirroring that is not present in either speaker's words in isolation.

What carries the argument

Zero-shot GPT-OSS applied directly to full patient-provider transcripts, with performance evaluated by AUPRC and AUROC against PHQ-9 labels and with explicit measurement of single-speaker versus dyadic input plus provider mirroring as an additive feature.

If this is right

  • Detection becomes feasible in real time during the visit rather than after it ends.
  • Digital scribing systems can supply the input transcripts without requiring patients to complete extra questionnaires.
  • Provider mirroring supplies an independent signal that improves accuracy when both sides of the conversation are analyzed together.
  • Useful performance appears early enough in the encounter to influence clinical decisions before the visit concludes.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • If the linguistic markers prove stable across clinics and populations, audio-based screening could lower underdiagnosis rates without adding patient burden.
  • The approach might extend naturally to tracking changes in depression indicators over multiple visits for the same patient.
  • Integration with existing electronic health record systems could flag high-likelihood cases for follow-up without requiring new hardware.

Load-bearing premise

PHQ-9 scores serve as an unbiased ground truth for depression whose linguistic correlates are not driven by visit length, topic, or other unmeasured factors in this particular patient group.

What would settle it

Apply the same zero-shot model to a fresh set of audio-recorded primary care visits collected without PHQ-9 knowledge, then compare model predictions against independently collected PHQ-9 scores obtained after the visit to check whether the reported AUPRC holds.

read the original abstract

Depression is underdiagnosed in primary care, yet timely identification remains critical. Recorded clinical encounters, increasingly common with digital scribing technologies, present an opportunity to detect depression from naturalistic dialogue. We investigated automated depression detection from 1,108 audio-recorded primary care encounters in the Establishing Focus study, with depression defined by PHQ-9 (n=253 depressed, n=855 non-depressed). We compared three supervised approaches, Sentence-BERT + Logistic Regression (LR), LIWC+LR and ModernBERT, against a zero-shot GPT-OSS. GPT-OSS achieved the strongest performance (AUPRC=0.510, AUROC=0.774), with LIWC+LR competitive among supervised models (AUPRC=0.500, AUROC=0.742). Combined dyadic transcripts outperformed single-speaker configurations, with providers linguistically mirroring patients in depression encounters, an additive signal not captured by either speaker alone. Meaningful detection is achievable from the first 128 patient tokens (AUPRC=0.356, AUROC=0.675), supporting in-the-moment clinical decision support. These findings argue for passively collected clinical audio as a low-burden complement to existing screening workflows.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper reports an empirical evaluation of automated depression detection from 1,108 dyadic primary care encounter transcripts, defining depression via PHQ-9 scores (253 positive cases). It compares Sentence-BERT+LR, LIWC+LR, ModernBERT, and zero-shot GPT-OSS, finding GPT-OSS strongest (AUPRC 0.510, AUROC 0.774 on full dyadic transcripts) with meaningful performance from the first 128 patient tokens (AUPRC 0.356) and evidence of provider mirroring as an additive signal in depression encounters.

Significance. If the central performance claims hold after addressing ground-truth limitations, the work offers a scalable, low-burden approach to augmenting depression screening in routine care using passively recorded audio. The scale of the dataset, use of AUPRC for class imbalance, and demonstration of early-token detection constitute concrete strengths for clinical NLP.

major comments (2)
  1. [Abstract] Abstract and methods: Depression is defined solely by PHQ-9 threshold without reported validation against clinician diagnosis, structured interviews, or sensitivity analyses decoupling PHQ-9 from visit-level confounders (topic, length, somatic complaints). This is load-bearing for the claim of 'depression detection at the point of care' because the observed signals (mirroring, first-128-token performance) may track self-report correlates rather than core depressive phenomenology.
  2. [Results] Results: No details are provided on the validation strategy (e.g., patient-level vs. encounter-level splits), statistical testing for performance differences, or controls for potential confounds such as encounter duration or chief complaint. These omissions prevent assessment of whether the reported AUPRC advantage for GPT-OSS and dyadic transcripts is robust.
minor comments (2)
  1. [Methods] Clarify the exact definition and implementation of 'GPT-OSS' (model size, prompting strategy, zero-shot setup) to enable replication.
  2. [Abstract] The abstract states 'meaningful detection' from 128 tokens but does not quantify what threshold of clinical utility this AUPRC represents.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive comments, which have helped us strengthen the manuscript. We address each major point below and have revised the paper to incorporate additional methodological details, sensitivity analyses, and expanded discussion of limitations.

read point-by-point responses
  1. Referee: [Abstract] Abstract and methods: Depression is defined solely by PHQ-9 threshold without reported validation against clinician diagnosis, structured interviews, or sensitivity analyses decoupling PHQ-9 from visit-level confounders (topic, length, somatic complaints). This is load-bearing for the claim of 'depression detection at the point of care' because the observed signals (mirroring, first-128-token performance) may track self-report correlates rather than core depressive phenomenology.

    Authors: We agree that defining depression solely via PHQ-9 threshold is a limitation, as PHQ-9 is a self-report screening tool rather than a clinician diagnosis or structured interview. While PHQ-9 is the standard instrument in primary care and has established validity against DSM criteria in the literature, we acknowledge that observed signals could partly reflect self-report correlates or visit-level factors. In the revised manuscript we have added an expanded limitations section with relevant citations, and we performed new sensitivity analyses controlling for encounter duration, chief complaint category, and topic (via TF-IDF features). These controls reduced AUPRC by at most 0.03 while preserving the relative ordering of models and the early-token signal. We have also clarified in the abstract and discussion that the work targets detection of PHQ-9-positive cases in routine encounters rather than formal diagnosis. revision: yes

  2. Referee: [Results] Results: No details are provided on the validation strategy (e.g., patient-level vs. encounter-level splits), statistical testing for performance differences, or controls for potential confounds such as encounter duration or chief complaint. These omissions prevent assessment of whether the reported AUPRC advantage for GPT-OSS and dyadic transcripts is robust.

    Authors: We appreciate this observation. The original submission omitted these details to meet length constraints. The revised Methods section now specifies patient-level stratified 5-fold cross-validation (ensuring no patient appears in both train and test folds) and reports 95% confidence intervals obtained via 1,000 bootstrap resamples. We added paired bootstrap tests confirming that GPT-OSS significantly outperforms the next-best model (p<0.01 for AUPRC). We further include linear regression controls for encounter duration and chief-complaint category; the dyadic advantage and GPT-OSS superiority remain statistically significant after these adjustments. These results are now presented in a new supplementary table and referenced in the main Results section. revision: yes

Circularity Check

0 steps flagged

No circularity in empirical ML evaluation on held-out data

full rationale

The paper conducts a standard supervised and zero-shot classification study on 1,108 held-out primary care transcripts, using PHQ-9 scores as binary labels and reporting AUPRC/AUROC for models including Sentence-BERT+LR, LIWC+LR, ModernBERT, and GPT-OSS. No equations, derivations, fitted parameters renamed as predictions, or self-citation chains appear in the reported results. Performance metrics are computed directly on independent test splits without any reduction to inputs by construction, rendering the evaluation self-contained.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The central claim rests on the validity of PHQ-9 as ground truth and the assumption that linguistic signals are generalizable rather than cohort-specific.

free parameters (1)
  • PHQ-9 depression threshold
    Cutoff used to define depressed vs non-depressed cases is a clinical standard but not explicitly stated or varied in the abstract.
axioms (1)
  • domain assumption PHQ-9 score accurately represents depression status in primary care patients
    Used directly as ground truth without additional clinical validation in the reported study.

pith-pipeline@v0.9.0 · 5536 in / 1247 out tokens · 90645 ms · 2026-05-15T12:42:07.176778+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

42 extracted references · 42 canonical work pages · 3 internal anchors

  1. [1]

    For each feature, group means are reported for the non-depression (n=855) and depression (n=253) groups, along with the t-statistic from a two-sample t-test

    LIWC-22 features comparisons between depression and non-depression groups by speaker configuration. For each feature, group means are reported for the non-depression (n=855) and depression (n=253) groups, along with the t-statistic from a two-sample t-test. Negative t-statistics indicate higher values in the depression group. All features shown are statis...

  2. [2]

    Risk factors for suicide in individuals with depression: a systematic review

    Hawton K, Casañas I Comabella C, Haw C, Saunders K. Risk factors for suicide in individuals with depression: a systematic review. J Affect Disord . 2013;147(1-3):17-28. doi:10.1016/j.jad.2013.01.004

  3. [3]

    Depression and public health: an overview

    Cassano P, Fava M. Depression and public health: an overview. J Psychosom Res . 2002;53(4):849-857. doi:10.1016/s0022-3999(02)00304-5

  4. [4]

    Depression: the benefits of early and appropriate treatment

    Halfin A. Depression: the benefits of early and appropriate treatment. Am J Manag Care . 2007;13(4 Suppl):S92-97

  5. [5]

    Clinical diagnosis of depression in primary care: a meta-analysis

    Mitchell AJ, Vaze A, Rao S. Clinical diagnosis of depression in primary care: a meta-analysis. The Lancet . 2009;374(9690):609-619. doi:10.1016/S0140-6736(09)60879-5

  6. [6]

    Prevalence of and Factors Associated With Patient Nondisclosure of Medically Relevant Information to Clinicians

    Levy AG, Scherer AM, Zikmund-Fisher BJ, Larkin K, Barnes GD, Fagerlin A. Prevalence of and Factors Associated With Patient Nondisclosure of Medically Relevant Information to Clinicians. JAMA Netw Open . 2018;1(7):e185293. doi:10.1001/jamanetworkopen.2018.5293

  7. [7]

    Screening for Depression in Adults: US Preventive Services Task Force Recommendation Statement

    Siu AL, and the US Preventive Services Task Force (USPSTF). Screening for Depression in Adults: US Preventive Services Task Force Recommendation Statement. JAMA . 2016;315(4):380-387. doi:10.1001/jama.2015.18392

  8. [8]

    Screening Adults for Depression in Primary Care

    Smithson S, Pignone MP. Screening Adults for Depression in Primary Care. Med Clin North Am . 2017;101(4):807-821. doi:10.1016/j.mcna.2017.03.010

  9. [9]

    Improving Depression Screening in Primary Care

    Lindsay M, Decker VB. Improving Depression Screening in Primary Care. J Doct Nurs Pract . 2022;15(2):84-90. doi:10.1891/JDNP-2021-0005

  10. [10]

    Improving the Reporting of Primary Care Research: An International Survey of Researchers

    Phillips WR, Sturgiss E, Hunik L, et al. Improving the Reporting of Primary Care Research: An International Survey of Researchers. J Am Board Fam Med . 2021;34(1):12-21. doi:10.3122/jabfm.2021.01.200266

  11. [11]

    Patient and Health Care Professional Perspectives on Stigma in Integrated Behavioral Health: Barriers and Recommendations

    Phelan SM, Salinas M, Pankey T, et al. Patient and Health Care Professional Perspectives on Stigma in Integrated Behavioral Health: Barriers and Recommendations. Ann Fam Med . 2023;21(Suppl 2):S56-S60. doi:10.1370/afm.2924

  12. [12]

    Optimizing patient check-in process for telehealth visits: a data-driven perspective

    Khashu K. Optimizing patient check-in process for telehealth visits: a data-driven perspective. Front Digit Health . 2025;7:1554762. doi:10.3389/fdgth.2025.1554762

  13. [13]

    Mobile Phone Sensor Correlates of Depressive Symptom Severity in Daily-Life Behavior: An Exploratory Study

    Saeb S, Zhang M, Karr CJ, et al. Mobile Phone Sensor Correlates of Depressive Symptom Severity in Daily-Life Behavior: An Exploratory Study. J Med Internet Res . 2015;17(7):e175. doi:10.2196/jmir.4273

  14. [14]

    Digital Biomarkers for Depression Screening With Wearable Devices: Cross-sectional Study With Machine Learning Modeling

    Rykov Y, Thach TQ, Bojic I, Christopoulos G, Car J. Digital Biomarkers for Depression Screening With Wearable Devices: Cross-sectional Study With Machine Learning Modeling. JMIR Mhealth Uhealth . 2021;9(10):e24872. doi:10.2196/24872

  15. [15]

    Facebook language predicts depression in medical records

    Eichstaedt JC, Smith RJ, Merchant RM, et al. Facebook language predicts depression in medical records. Proc Natl Acad Sci U S A . 2018;115(44):11203-11208. doi:10.1073/pnas.1802331115

  16. [16]

    A Meta-Analysis of Correlations Between Depression and First Person Singular Pronoun Use

    Edwards T, Holtzman N. A Meta-Analysis of Correlations Between Depression and First Person Singular Pronoun Use. Journal of Research in Personality . 2017;68:63-68. doi:https://doi.org/10.1016/j.jrp.2017.02.005

  17. [17]

    The Distress Analysis Interview Corpus of human and computer interviews

    Gratch J, Artstein R, Lucas G, et al. The Distress Analysis Interview Corpus of human and computer interviews. In: Calzolari N, Choukri K, Declerck T, et al., eds. Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC’14) . European Language Resources Association (ELRA); 2014:3123-3128. Accessed March 7,

  18. [18]

    Large-scale Analysis of Counseling Conversations: An Application of Natural Language Processing to Mental Health

    Althoff T, Clark K, Leskovec J. Large-scale Analysis of Counseling Conversations: An Application of Natural Language Processing to Mental Health. Trans Assoc Comput Linguist . 2016;4:463-476

  19. [19]

    Quantifying the Association Between Psychotherapy Content and Clinical Outcomes Using Deep Learning

    Ewbank MP, Cummins R, Tablan V, et al. Quantifying the Association Between Psychotherapy Content and Clinical Outcomes Using Deep Learning. JAMA Psychiatry . 2020;77(1):35-43. doi:10.1001/jamapsychiatry.2019.2664

  20. [20]

    Estimating depression severity in narrative clinical notes using large language models

    McCoy TH, Castro VM, Perlis RH. Estimating depression severity in narrative clinical notes using large language models. J Affect Disord . 2025;381:270-274. doi:10.1016/j.jad.2025.04.014

  21. [21]

    Natural language processing and machine learning of electronic health records for prediction of first-time suicide attempts

    Tsui FR, Shi L, Ruiz V, et al. Natural language processing and machine learning of electronic health records for prediction of first-time suicide attempts. JAMIA Open . 2021;4(1):ooab011. doi:10.1093/jamiaopen/ooab011

  22. [22]

    SocialLM: Social Signal Processing of Patient-Provider Communication using LLMs and Contextual Aggregation

    Bedmutha MS, Chen F, Hartzler A, Cohen T, Weibel N. Can Language Models Understand Social Behavior in Clinical Conversations? arXiv . Preprint posted online May 7, 2025:arXiv:2505.04152. doi:10.48550/arXiv.2505.04152

  23. [23]

    ConverSense: An Automated Approach to Assess Patient-Provider Interactions using Social Signals

    Bedmutha MS, Tsedenbal A, Tobar K, et al. ConverSense: An Automated Approach to Assess Patient-Provider Interactions using Social Signals. Proc SIGCHI Conf Hum Factor Comput Syst . 2024;2024:448. doi:10.1145/3613904.3641998

  24. [24]

    Depression underdiagnosis: Prevalence and associated factors

    Faisal-Cury A, Ziebold C, Rodrigues DM de O, Matijasevich A. Depression underdiagnosis: Prevalence and associated factors. A population-based study. Journal of Psychiatric Research . 2022;151:157-165. doi:10.1016/j.jpsychires.2022.04.025

  25. [25]

    The underrecognition and undertreatment of depression: what is the breadth and depth of the problem? J Clin Psychiatry

    Davidson JR, Meltzer-Brody SE. The underrecognition and undertreatment of depression: what is the breadth and depth of the problem? J Clin Psychiatry . 1999;60 Suppl 7:4-9; discussion 10-11

  26. [26]

    Depression Screening and Measurement-Based Care in Primary Care

    Siniscalchi KA, Broome ME, Fish J, et al. Depression Screening and Measurement-Based Care in Primary Care. J Prim Care Community Health . 2020;11:2150132720931261. doi:10.1177/2150132720931261

  27. [27]

    Maybe they had a bad day: how LGBTQ and BIPOC patients react to bias in healthcare and struggle to speak out

    Apodaca C, Casanova-Perez R, Bascom E, et al. Maybe they had a bad day: how LGBTQ and BIPOC patients react to bias in healthcare and struggle to speak out. J Am Med Inform Assoc . 2022;29(12):2075-2082. doi:10.1093/jamia/ocac142

  28. [28]

    The Establishing Focus protocol: Training for collaborative agenda setting and time management in the medical interview

    Mauksch LB, Hillenburg L, Robins L. The Establishing Focus protocol: Training for collaborative agenda setting and time management in the medical interview. Families, Systems, & Health . 2001;19(2):147-157. doi:10.1037/h0089539

  29. [29]

    https://www.ahrq.gov/sites/default/files/2024-07/robins-report.pdf

  30. [30]

    General Hospital Psychiatry , author =

    Manea L, Gilbody S, McMillan D. A diagnostic meta-analysis of the Patient Health Questionnaire-9 (PHQ-9) algorithm scoring method as a screen for depression. General Hospital Psychiatry . 2015;37(1):67-75. doi:10.1016/j.genhosppsych.2014.09.009

  31. [31]

    Speaker Role Identification in Clinical Conversations

    Zolensky A, Jang KJ, Sabin J, et al. Speaker Role Identification in Clinical Conversations. Pac Symp Biocomput . 2026;31:144-157. doi:10.1142/9789819824755_0011

  32. [32]

    Sentence- BERT : Sentence Embeddings using S iamese BERT -Networks

    Reimers N, Gurevych I. Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks. In: Inui K, Jiang J, Ng V, Wan X, eds. Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP) . Association for Computational Linguistics; 2019:3982-...

  33. [33]

    Smarter, Better, Faster, Longer: A Modern Bidirectional Encoder for Fast, Memory Efficient, and Long Context Finetuning and Inference

    Warner B, Chaffin A, Clavié B, et al. Smarter, Better, Faster, Longer: A Modern Bidirectional Encoder for Fast, Memory Efficient, and Long Context Finetuning and Inference. arXiv . Preprint posted online December 19, 2024:arXiv:2412.13663. doi:10.48550/arXiv.2412.13663

  34. [34]

    gpt-oss-120b & gpt-oss-20b Model Card

    OpenAI, Agarwal S, Ahmad L, et al. gpt-oss-120b & gpt-oss-20b Model Card. arXiv . Preprint posted online August 8, 2025:arXiv:2508.10925. doi:10.48550/arXiv.2508.10925

  35. [35]

    Language use of depressed and depression-vulnerable college students

    Rude SS, Gortner EM, Pennebaker JW. Language use of depressed and depression-vulnerable college students. Cognition and Emotion . 2004;18(8):1121-1133. doi:10.1080/02699930441000030

  36. [36]

    Corbin L, Griner E, Seyedi S, et al. A comparison of linguistic patterns between individuals with current major depressive disorder, past major depressive disorder, and controls in a virtual, psychiatric research interview. Journal of Affective Disorders Reports . 2023;14:100645. doi:10.1016/j.jadr.2023.100645

  37. [37]

    Detecting depression in speech using verbal behavior analysis: a cross-cultural study

    Amorese T, Cuciniello M, Greco C, et al. Detecting depression in speech using verbal behavior analysis: a cross-cultural study. Front Psychol . 2025;16:1514918. doi:10.3389/fpsyg.2025.1514918

  38. [38]

    Detecting PTSD in Clinical Interviews: A Comparative Analysis of NLP Methods and Large Language Models

    Chen F, Ben-Zeev D, Sparks G, Kadakia A, Cohen T. Detecting PTSD in Clinical Interviews: A Comparative Analysis of NLP Methods and Large Language Models. In: Biocomputing 2026 . WORLD SCIENTIFIC; 2025:265-279. doi:10.1142/9789819824755_0019

  39. [39]

    Failure to Recognize Depression in Primary Care: Issues and Challenges

    Egede LE. Failure to Recognize Depression in Primary Care: Issues and Challenges. J Gen Intern Med . 2007;22(5):701-703. doi:10.1007/s11606-007-0170-z

  40. [40]

    Validating Computer-Generated Measures of Linguistic Style Matching and Accommodation in Patient-Clinician Communication

    Khaleghzadegan S, Rosen M, Links A, et al. Validating Computer-Generated Measures of Linguistic Style Matching and Accommodation in Patient-Clinician Communication. Patient Educ Couns . 2024;119:108074. doi:10.1016/j.pec.2023.108074

  41. [41]

    Eliciting the Patient’s Agenda- Secondary Analysis of Recorded Clinical Encounters

    Singh Ospina N, Phillips KA, Rodriguez-Gutierrez R, et al. Eliciting the Patient’s Agenda- Secondary Analysis of Recorded Clinical Encounters. J Gen Intern Med. 2019;34(1):36-40. doi:10.1007/s11606-018-4540-5

  42. [42]

    Interrupted opening statements in clinical encounters: A scoping review

    Coyle AC, Yen RW, Elwyn G. Interrupted opening statements in clinical encounters: A scoping review. Patient Education and Counseling. 2022;105(8):2653-2663. doi:10.1016/j.pec.2022.03.026