Fine-tuning LLMs for Passive Depression Severity Estimation from AI Mental Health Dialogue

Caitlin A. Stamatis; Olivier Tieleman; Samuel J. Bell; Thomas D. Hull; Ting Su; Ziyi Zhu

arxiv: 2606.17973 · v1 · pith:WA4XGJQ3new · submitted 2026-06-16 · 💻 cs.CL

Fine-tuning LLMs for Passive Depression Severity Estimation from AI Mental Health Dialogue

Olivier Tieleman , Ziyi Zhu , Ting Su , Samuel J. Bell , Thomas D. Hull , Caitlin A. Stamatis This is my paper

Pith reviewed 2026-06-27 00:49 UTC · model grok-4.3

classification 💻 cs.CL

keywords depressionPHQ-9LLM fine-tuningpassive estimationmental health dialoguepseudolabelsregression

0 comments

The pith

Transcripts of AI mental health conversations can predict PHQ-9 depression scores accurately.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that a fine-tuned large language model can infer depression severity directly from user-AI conversation text. The method augments limited ground-truth PHQ-9 labels with pseudolabels to train on over six thousand users. A reader would care because it offers a way to monitor symptoms passively in digital mental health tools where self-report completion is low.

Core claim

The authors fine-tune Qwen3.5-27B with a regression head on transcripts from an AI mental health application. Using 3,111 ground-truth PHQ-9 labels augmented by pseudolabels from Claude Opus and intermediate models to reach 6,283 users, the model reaches MAE=2.6, RMSE=4.0, r=0.80 and AUC=0.91 for the clinical threshold on 842 held-out users, maintaining AUC above 0.87 across all severity levels.

What carries the argument

A regression head on a fine-tuned Qwen3.5-27B model trained on conversation transcripts with pseudolabel augmentation.

If this is right

Symptom monitoring in AI platforms no longer requires users to complete questionnaires.
Depression severity can be tracked continuously from routine interactions.
Early detection of symptom changes becomes feasible at scale.
The approach covers the entire spectrum of depression severity.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Real-time feedback to users or providers could be enabled if deployed.
Validation on data from different AI chat platforms would test broader applicability.
Ethical guidelines for using conversation data for health inference would need development.

Load-bearing premise

Pseudolabels from Claude Opus and iterative models serve as reliable stand-ins for actual PHQ-9 scores.

What would settle it

Collecting fresh PHQ-9 responses from the same 842 test users and measuring how well the model's predictions match those new scores.

Figures

Figures reproduced from arXiv: 2606.17973 by Caitlin A. Stamatis, Olivier Tieleman, Samuel J. Bell, Thomas D. Hull, Ting Su, Ziyi Zhu.

read the original abstract

Depression is the leading cause of disability worldwide, and early detection of symptom change is essential for timely intervention. Validated instruments such as the Patient Health Questionnaire-9 (PHQ-9) support symptom monitoring at scale, but real-world completion rates are low, introducing response bias and systematic missingness. Passive approaches that infer severity from routinely generated data could close this gap. We address this by predicting PHQ-9 total scores directly from transcripts of conversations between users and an AI mental health application, requiring only conversation text and no additional clinical data. We fine-tune a Qwen3.5-27B backbone with a regression head, augment 3,111 ground-truth labels with pseudolabels generated by a reasoning model (Claude Opus) and iteratively trained intermediate models, for a combined dataset of 6,283 users. On a held-out test set of 842 users, our best model achieves MAE = 2.6, RMSE = 4.0, Pearson r = 0.80, and AUC = 0.91 at the PHQ-9 >= 10 clinical threshold. We also find AUC > 0.87 at every severity threshold from PHQ-9 >= 3 to PHQ-9 >= 24, demonstrating that the model captures depression severity across the full clinical spectrum. This work opens the door to passive, continuous symptom monitoring in AI mental health platforms, without requiring users to complete self-report measures.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The performance numbers look decent on paper but rest on unvalidated pseudolabels from Claude and iterative models, which weakens the main claim.

read the letter

The central point is that they fine-tune a 27B Qwen model to regress PHQ-9 scores straight from AI mental health chat transcripts and report MAE 2.6, r 0.80, and AUC 0.91 on held-out users. That pipeline is new as a direct application, and the fact that they get decent AUC across every threshold from mild to severe is a practical plus.

They start with 3,111 users who have real PHQ-9 labels and expand to 6,283 by adding pseudolabels first from Claude Opus then from intermediate models. The held-out test set of 842 is drawn from the ground-truth pool, so the metrics are at least computed on clean labels.

The obvious soft spot is the lack of any check on how accurate those pseudolabels actually are. No correlation, MAE, or confusion numbers are given against the real PHQ-9 scores on the labeled users. If the pseudolabels carry systematic bias or extra noise, that error gets baked into the regression head during training. The clean test set does not remove that problem.

Demographic balance and error analysis are also missing from what is shown, which matters for a clinical signal. Still, the core supervised setup on user-split data is standard and not circular.

This is for people working on passive monitoring inside digital mental health tools. A reader who wants to see whether dialogue text alone can stand in for self-report measures will find the numbers and the full-spectrum AUC useful, but will need the pseudolabel validation before treating the result as reliable.

It deserves peer review. The application is timely and the reported metrics are concrete enough that referees can ask the right questions about label quality and data characteristics.

Referee Report

2 major / 1 minor

Summary. The paper fine-tunes Qwen3.5-27B with a regression head to predict PHQ-9 total scores directly from AI mental health dialogue transcripts. It augments 3,111 ground-truth labels with pseudolabels from Claude Opus and iteratively trained intermediate models to reach 6,283 users total, then reports MAE=2.6, RMSE=4.0, Pearson r=0.80 and AUC=0.91 (PHQ-9 >=10 threshold) plus AUC>0.87 across all thresholds on a held-out test set of 842 users.

Significance. If the pseudolabels prove reliable, the work would support scalable passive monitoring of depression severity in AI mental health platforms without requiring self-report completion. The held-out user split and multi-threshold AUC reporting are methodologically sound elements that strengthen the contribution if the label quality issue is addressed.

major comments (2)

[Abstract / Methods] Abstract and Methods (pseudolabel generation): No correlation, MAE, RMSE, or confusion-matrix statistics are supplied that quantify how well the Claude Opus and iterative-model pseudolabels recover true PHQ-9 scores on any split of the 3,111 ground-truth users. Because the regression head is trained on the combined 6,283-user set, systematic bias or variance in the ~3,172 pseudolabels directly shapes the learned mapping and undermines interpretation of the reported test metrics.
[Results] Results (held-out evaluation): The test set of 842 users is drawn from the ground-truth pool, yet the model parameters were optimized on the noisier augmented data; without an ablation comparing performance when trained only on the 3,111 ground-truth labels versus the augmented set, it is impossible to isolate the effect of pseudolabel noise on the headline numbers (MAE=2.6, r=0.80, AUC=0.91).

minor comments (1)

[Abstract] The abstract states that pseudolabels are generated 'iteratively' but provides no details on the number of iterations, convergence criterion, or how the intermediate models were selected.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive comments, which highlight important aspects of pseudolabel validation and experimental controls. We address each major comment below and will revise the manuscript accordingly to strengthen the presentation of our methods and results.

read point-by-point responses

Referee: [Abstract / Methods] Abstract and Methods (pseudolabel generation): No correlation, MAE, RMSE, or confusion-matrix statistics are supplied that quantify how well the Claude Opus and iterative-model pseudolabels recover true PHQ-9 scores on any split of the 3,111 ground-truth users. Because the regression head is trained on the combined 6,283-user set, systematic bias or variance in the ~3,172 pseudolabels directly shapes the learned mapping and undermines interpretation of the reported test metrics.

Authors: We agree that explicit validation metrics for the pseudolabeling process are needed to assess potential bias or noise. Although the primary pseudolabels were generated for users lacking ground truth, the same procedure can be applied to held-out splits within the 3,111 ground-truth users. We will add this analysis, reporting MAE, RMSE, Pearson correlation, and threshold-specific AUCs between pseudolabels and true PHQ-9 scores, and will discuss the implications for the combined training set. revision: yes
Referee: [Results] Results (held-out evaluation): The test set of 842 users is drawn from the ground-truth pool, yet the model parameters were optimized on the noisier augmented data; without an ablation comparing performance when trained only on the 3,111 ground-truth labels versus the augmented set, it is impossible to isolate the effect of pseudolabel noise on the headline numbers (MAE=2.6, r=0.80, AUC=0.91).

Authors: We acknowledge that the requested ablation would clarify the incremental value of the pseudolabels. We will train and evaluate an additional model using only the 3,111 ground-truth labels on the identical held-out test set of 842 users, reporting the full set of metrics (MAE, RMSE, Pearson r, and multi-threshold AUCs) for direct comparison with the augmented model. These results will be added to the Results section. revision: yes

Circularity Check

0 steps flagged

No circularity: standard supervised fine-tuning on held-out ground-truth test set

full rationale

The paper reports MAE, RMSE, Pearson r and AUC on a held-out test set of 842 users after fine-tuning on an augmented training set of 6,283 users (3,111 ground-truth PHQ-9 plus pseudolabels). No equations, derivations or self-citations are present that reduce these metrics to quantities defined by the fitted parameters themselves. The workflow matches conventional supervised regression with external labels; pseudolabel quality is an empirical assumption but does not create a self-definitional or fitted-input-called-prediction loop. No uniqueness theorems, ansatzes smuggled via citation, or renaming of known results appear. The derivation chain is therefore self-contained against the external PHQ-9 benchmark.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the domain assumption that conversation text encodes enough signal for PHQ-9 regression and that pseudolabels can be treated as usable training targets.

axioms (1)

domain assumption Conversation transcripts between users and an AI mental health application contain sufficient information to infer PHQ-9 total scores
Invoked by the decision to train a regression model directly on text without additional clinical features.

pith-pipeline@v0.9.1-grok · 5812 in / 1337 out tokens · 32518 ms · 2026-06-27T00:49:26.979086+00:00 · methodology

discussion (0)

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

SAGE: Stochastic Prompt Optimization via Agent-Guided Exploration
cs.CL 2026-06 unverdicted novelty 5.0

SAGE is a multi-agent stochastic prompt optimization method that outperforms simpler search strategies on some benchmarks and improves next-day retention in a mental-health chatbot via continuous optimization.

Reference graph

Works this paper leans on

19 extracted references · 1 canonical work pages · cited by 1 Pith paper

[1]

Al-Mosaiwi, M., and Johnstone, T. (2018). In an absolute state: Elevated use of absolutist words is a marker specific to anxiety, depression, and suicidal ideation.Clinical Psychological Science, 6(4):529–542

2018
[2]

Althoff, T., Clark, K., and Leskovec, J. (2016). Large-scale analysis of counseling conversations: An application of natural language processing to mental health.Transactions of the Association for Computational Linguistics, 4:463–476

2016
[3]

Lau, C., Zhu, X., and Chan, W.-Y. (2023). Automatic depression severity assessment with deep learning using parameter-efficient tuning.Frontiers in Psychiatry, 14:1160291

2023
[4]

Alves, P., et al. (2025). A machine learning model using clinical notes to estimate PHQ-9 symptom severity scores in depressed patients.Journal of Affective Disorders. De Choudhury, M., Gamon, M., Counts, S., and Horvitz, E. (2013). Predicting depression via social media. InProceedings of ICWSM

2025
[5]

P., Cummins, R., Tablan, V., Bateup, S., Catarino, A., Martin, A

Ewbank, M. P., Cummins, R., Tablan, V., Bateup, S., Catarino, A., Martin, A. J., and Blackwell, A. D. (2020). Quantifying the association between psychotherapy content and clinical outcomes using deep learning.JAMA Psychiatry, 77(1):35–43

2020
[6]

Lalk, C., Steinbrenner, T., Kania, W., Popko, A., Wester, R., Schaffrath, J., Eberhardt, S., Schwartz, B., Lutz, W., and Rubel, J. (2024). Measuring alliance and symptom severity in psychotherapy transcripts using BERT topic modeling.Administration and Policy in Mental Health and Mental Health Services Research, 51(4):509–524

2024
[7]

Gratch, J., et al. (2014). The distress analysis interview corpus of human and computer interviews. InProceedings of LREC

2014
[8]

He, J., et al. (2020). Revisiting self-training for neural sequence generation. InProceedings of ICLR

2020
[9]

I., Zomick, J., and Hirschberg, J

Jiang, Z., Levitan, S. I., Zomick, J., and Hirschberg, J. (2020). Detection of mental health from Reddit via deep contextualized representations. InProceedings of the 11th International Workshop on Health Text Mining and Information Analysis (LOUHI)

2020
[10]

Burdisso, S., et al. (2024). DAIC-WOZ: On the validity of using the therapist’s prompts in automatic depression detection from clinical interviews.arXiv preprint arXiv:2404.14463

arXiv 2024
[11]

L., and Williams, J

Kroenke, K., Spitzer, R. L., and Williams, J. B. W. (2001). The PHQ-9: Validity of a brief depression severity measure.Journal of General Internal Medicine, 16(9):606–613

2001
[12]

Schmidt, F., Ravan, S., and Vlassov, V. (2025). Probabilistic textual time series depression detection. arXiv preprint arXiv:2511.04476

arXiv 2025
[13]

Resnik, P., et al. (2015). Beyond LDA: Exploring supervised topic modeling for depression-related language in Twitter. InProceedings of the Workshop on Computational Linguistics and Clinical Psychology

2015
[14]

Ringeval, F., et al. (2019). AVEC 2019 workshop and challenge: State-of-mind, detecting depression with AI, and cross-cultural affect recognition. InProceedings of the 9th International Audio/Visual Emotion Challenge and Workshop (AVEC ’19). 11

2019
[15]

J., et al

Hu, E. J., et al. (2022). LoRA: Low-rank adaptation of large language models. InProceedings of ICLR

2022
[16]

A., Smyth, L., Hull, T

Zhu, Z., Tieleman, O., Stamatis, C. A., Smyth, L., Hull, T. D., Cahn, D. R., Chen, J., and Malgaroli, M. (2025). DIAL: Direct iterative adversarial learning for realistic multi-turn dialogue simulation. arXiv preprint arXiv:2512.20773. Qwen Team (2026), Qwen3.5: Towards Native Multimodal Agents. https://qwen.ai/blog?id=qwen3.5

Pith/arXiv arXiv 2025
[17]

Shin, D., et al. (2024). Using large language models to detect depression from user-generated diary text data as a novel approach in digital mental health screening: Instrument validation study. Journal of Medical Internet Research, 26:e54617

2024
[18]

Weber, S., Deperrois, N., Heun, R., et al. (2025). Using a fine-tuned large language model for symptom-based depression evaluation.npj Digital Medicine, 8:598. World Health Organization (2023). Depressive disorder (depression). WHO Fact Sheet

2025
[19]

Xie, Q., Luong, M.-T., Hovy, E., and Le, Q. V. (2020). Self-training with noisy student improves ImageNet classification. InProceedings of CVPR. Friedrich M. Depression Is the Leading Cause of Disability Around the World. JAMA. 2017;317(15):1517. doi:10.1001/jama.2017.3826 Teferra BG, Rueda A, Pang H, Valenzano R, Samavi R, Krishnan S, Bhat V. Screening f...

work page doi:10.1001/jama.2017.3826 2020

[1] [1]

Al-Mosaiwi, M., and Johnstone, T. (2018). In an absolute state: Elevated use of absolutist words is a marker specific to anxiety, depression, and suicidal ideation.Clinical Psychological Science, 6(4):529–542

2018

[2] [2]

Althoff, T., Clark, K., and Leskovec, J. (2016). Large-scale analysis of counseling conversations: An application of natural language processing to mental health.Transactions of the Association for Computational Linguistics, 4:463–476

2016

[3] [3]

Lau, C., Zhu, X., and Chan, W.-Y. (2023). Automatic depression severity assessment with deep learning using parameter-efficient tuning.Frontiers in Psychiatry, 14:1160291

2023

[4] [4]

Alves, P., et al. (2025). A machine learning model using clinical notes to estimate PHQ-9 symptom severity scores in depressed patients.Journal of Affective Disorders. De Choudhury, M., Gamon, M., Counts, S., and Horvitz, E. (2013). Predicting depression via social media. InProceedings of ICWSM

2025

[5] [5]

P., Cummins, R., Tablan, V., Bateup, S., Catarino, A., Martin, A

Ewbank, M. P., Cummins, R., Tablan, V., Bateup, S., Catarino, A., Martin, A. J., and Blackwell, A. D. (2020). Quantifying the association between psychotherapy content and clinical outcomes using deep learning.JAMA Psychiatry, 77(1):35–43

2020

[6] [6]

Lalk, C., Steinbrenner, T., Kania, W., Popko, A., Wester, R., Schaffrath, J., Eberhardt, S., Schwartz, B., Lutz, W., and Rubel, J. (2024). Measuring alliance and symptom severity in psychotherapy transcripts using BERT topic modeling.Administration and Policy in Mental Health and Mental Health Services Research, 51(4):509–524

2024

[7] [7]

Gratch, J., et al. (2014). The distress analysis interview corpus of human and computer interviews. InProceedings of LREC

2014

[8] [8]

He, J., et al. (2020). Revisiting self-training for neural sequence generation. InProceedings of ICLR

2020

[9] [9]

I., Zomick, J., and Hirschberg, J

Jiang, Z., Levitan, S. I., Zomick, J., and Hirschberg, J. (2020). Detection of mental health from Reddit via deep contextualized representations. InProceedings of the 11th International Workshop on Health Text Mining and Information Analysis (LOUHI)

2020

[10] [10]

Burdisso, S., et al. (2024). DAIC-WOZ: On the validity of using the therapist’s prompts in automatic depression detection from clinical interviews.arXiv preprint arXiv:2404.14463

arXiv 2024

[11] [11]

L., and Williams, J

Kroenke, K., Spitzer, R. L., and Williams, J. B. W. (2001). The PHQ-9: Validity of a brief depression severity measure.Journal of General Internal Medicine, 16(9):606–613

2001

[12] [12]

Schmidt, F., Ravan, S., and Vlassov, V. (2025). Probabilistic textual time series depression detection. arXiv preprint arXiv:2511.04476

arXiv 2025

[13] [13]

Resnik, P., et al. (2015). Beyond LDA: Exploring supervised topic modeling for depression-related language in Twitter. InProceedings of the Workshop on Computational Linguistics and Clinical Psychology

2015

[14] [14]

Ringeval, F., et al. (2019). AVEC 2019 workshop and challenge: State-of-mind, detecting depression with AI, and cross-cultural affect recognition. InProceedings of the 9th International Audio/Visual Emotion Challenge and Workshop (AVEC ’19). 11

2019

[15] [15]

J., et al

Hu, E. J., et al. (2022). LoRA: Low-rank adaptation of large language models. InProceedings of ICLR

2022

[16] [16]

A., Smyth, L., Hull, T

Zhu, Z., Tieleman, O., Stamatis, C. A., Smyth, L., Hull, T. D., Cahn, D. R., Chen, J., and Malgaroli, M. (2025). DIAL: Direct iterative adversarial learning for realistic multi-turn dialogue simulation. arXiv preprint arXiv:2512.20773. Qwen Team (2026), Qwen3.5: Towards Native Multimodal Agents. https://qwen.ai/blog?id=qwen3.5

Pith/arXiv arXiv 2025

[17] [17]

Shin, D., et al. (2024). Using large language models to detect depression from user-generated diary text data as a novel approach in digital mental health screening: Instrument validation study. Journal of Medical Internet Research, 26:e54617

2024

[18] [18]

Weber, S., Deperrois, N., Heun, R., et al. (2025). Using a fine-tuned large language model for symptom-based depression evaluation.npj Digital Medicine, 8:598. World Health Organization (2023). Depressive disorder (depression). WHO Fact Sheet

2025

[19] [19]

Xie, Q., Luong, M.-T., Hovy, E., and Le, Q. V. (2020). Self-training with noisy student improves ImageNet classification. InProceedings of CVPR. Friedrich M. Depression Is the Leading Cause of Disability Around the World. JAMA. 2017;317(15):1517. doi:10.1001/jama.2017.3826 Teferra BG, Rueda A, Pang H, Valenzano R, Samavi R, Krishnan S, Bhat V. Screening f...

work page doi:10.1001/jama.2017.3826 2020