Fine-tuning LLMs for Passive Depression Severity Estimation from AI Mental Health Dialogue
Pith reviewed 2026-06-27 00:49 UTC · model grok-4.3
The pith
Transcripts of AI mental health conversations can predict PHQ-9 depression scores accurately.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The authors fine-tune Qwen3.5-27B with a regression head on transcripts from an AI mental health application. Using 3,111 ground-truth PHQ-9 labels augmented by pseudolabels from Claude Opus and intermediate models to reach 6,283 users, the model reaches MAE=2.6, RMSE=4.0, r=0.80 and AUC=0.91 for the clinical threshold on 842 held-out users, maintaining AUC above 0.87 across all severity levels.
What carries the argument
A regression head on a fine-tuned Qwen3.5-27B model trained on conversation transcripts with pseudolabel augmentation.
If this is right
- Symptom monitoring in AI platforms no longer requires users to complete questionnaires.
- Depression severity can be tracked continuously from routine interactions.
- Early detection of symptom changes becomes feasible at scale.
- The approach covers the entire spectrum of depression severity.
Where Pith is reading between the lines
- Real-time feedback to users or providers could be enabled if deployed.
- Validation on data from different AI chat platforms would test broader applicability.
- Ethical guidelines for using conversation data for health inference would need development.
Load-bearing premise
Pseudolabels from Claude Opus and iterative models serve as reliable stand-ins for actual PHQ-9 scores.
What would settle it
Collecting fresh PHQ-9 responses from the same 842 test users and measuring how well the model's predictions match those new scores.
Figures
read the original abstract
Depression is the leading cause of disability worldwide, and early detection of symptom change is essential for timely intervention. Validated instruments such as the Patient Health Questionnaire-9 (PHQ-9) support symptom monitoring at scale, but real-world completion rates are low, introducing response bias and systematic missingness. Passive approaches that infer severity from routinely generated data could close this gap. We address this by predicting PHQ-9 total scores directly from transcripts of conversations between users and an AI mental health application, requiring only conversation text and no additional clinical data. We fine-tune a Qwen3.5-27B backbone with a regression head, augment 3,111 ground-truth labels with pseudolabels generated by a reasoning model (Claude Opus) and iteratively trained intermediate models, for a combined dataset of 6,283 users. On a held-out test set of 842 users, our best model achieves MAE = 2.6, RMSE = 4.0, Pearson r = 0.80, and AUC = 0.91 at the PHQ-9 >= 10 clinical threshold. We also find AUC > 0.87 at every severity threshold from PHQ-9 >= 3 to PHQ-9 >= 24, demonstrating that the model captures depression severity across the full clinical spectrum. This work opens the door to passive, continuous symptom monitoring in AI mental health platforms, without requiring users to complete self-report measures.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper fine-tunes Qwen3.5-27B with a regression head to predict PHQ-9 total scores directly from AI mental health dialogue transcripts. It augments 3,111 ground-truth labels with pseudolabels from Claude Opus and iteratively trained intermediate models to reach 6,283 users total, then reports MAE=2.6, RMSE=4.0, Pearson r=0.80 and AUC=0.91 (PHQ-9 >=10 threshold) plus AUC>0.87 across all thresholds on a held-out test set of 842 users.
Significance. If the pseudolabels prove reliable, the work would support scalable passive monitoring of depression severity in AI mental health platforms without requiring self-report completion. The held-out user split and multi-threshold AUC reporting are methodologically sound elements that strengthen the contribution if the label quality issue is addressed.
major comments (2)
- [Abstract / Methods] Abstract and Methods (pseudolabel generation): No correlation, MAE, RMSE, or confusion-matrix statistics are supplied that quantify how well the Claude Opus and iterative-model pseudolabels recover true PHQ-9 scores on any split of the 3,111 ground-truth users. Because the regression head is trained on the combined 6,283-user set, systematic bias or variance in the ~3,172 pseudolabels directly shapes the learned mapping and undermines interpretation of the reported test metrics.
- [Results] Results (held-out evaluation): The test set of 842 users is drawn from the ground-truth pool, yet the model parameters were optimized on the noisier augmented data; without an ablation comparing performance when trained only on the 3,111 ground-truth labels versus the augmented set, it is impossible to isolate the effect of pseudolabel noise on the headline numbers (MAE=2.6, r=0.80, AUC=0.91).
minor comments (1)
- [Abstract] The abstract states that pseudolabels are generated 'iteratively' but provides no details on the number of iterations, convergence criterion, or how the intermediate models were selected.
Simulated Author's Rebuttal
We thank the referee for their constructive comments, which highlight important aspects of pseudolabel validation and experimental controls. We address each major comment below and will revise the manuscript accordingly to strengthen the presentation of our methods and results.
read point-by-point responses
-
Referee: [Abstract / Methods] Abstract and Methods (pseudolabel generation): No correlation, MAE, RMSE, or confusion-matrix statistics are supplied that quantify how well the Claude Opus and iterative-model pseudolabels recover true PHQ-9 scores on any split of the 3,111 ground-truth users. Because the regression head is trained on the combined 6,283-user set, systematic bias or variance in the ~3,172 pseudolabels directly shapes the learned mapping and undermines interpretation of the reported test metrics.
Authors: We agree that explicit validation metrics for the pseudolabeling process are needed to assess potential bias or noise. Although the primary pseudolabels were generated for users lacking ground truth, the same procedure can be applied to held-out splits within the 3,111 ground-truth users. We will add this analysis, reporting MAE, RMSE, Pearson correlation, and threshold-specific AUCs between pseudolabels and true PHQ-9 scores, and will discuss the implications for the combined training set. revision: yes
-
Referee: [Results] Results (held-out evaluation): The test set of 842 users is drawn from the ground-truth pool, yet the model parameters were optimized on the noisier augmented data; without an ablation comparing performance when trained only on the 3,111 ground-truth labels versus the augmented set, it is impossible to isolate the effect of pseudolabel noise on the headline numbers (MAE=2.6, r=0.80, AUC=0.91).
Authors: We acknowledge that the requested ablation would clarify the incremental value of the pseudolabels. We will train and evaluate an additional model using only the 3,111 ground-truth labels on the identical held-out test set of 842 users, reporting the full set of metrics (MAE, RMSE, Pearson r, and multi-threshold AUCs) for direct comparison with the augmented model. These results will be added to the Results section. revision: yes
Circularity Check
No circularity: standard supervised fine-tuning on held-out ground-truth test set
full rationale
The paper reports MAE, RMSE, Pearson r and AUC on a held-out test set of 842 users after fine-tuning on an augmented training set of 6,283 users (3,111 ground-truth PHQ-9 plus pseudolabels). No equations, derivations or self-citations are present that reduce these metrics to quantities defined by the fitted parameters themselves. The workflow matches conventional supervised regression with external labels; pseudolabel quality is an empirical assumption but does not create a self-definitional or fitted-input-called-prediction loop. No uniqueness theorems, ansatzes smuggled via citation, or renaming of known results appear. The derivation chain is therefore self-contained against the external PHQ-9 benchmark.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Conversation transcripts between users and an AI mental health application contain sufficient information to infer PHQ-9 total scores
Forward citations
Cited by 1 Pith paper
-
SAGE: Stochastic Prompt Optimization via Agent-Guided Exploration
SAGE is a multi-agent stochastic prompt optimization method that outperforms simpler search strategies on some benchmarks and improves next-day retention in a mental-health chatbot via continuous optimization.
Reference graph
Works this paper leans on
-
[1]
Al-Mosaiwi, M., and Johnstone, T. (2018). In an absolute state: Elevated use of absolutist words is a marker specific to anxiety, depression, and suicidal ideation.Clinical Psychological Science, 6(4):529–542
2018
-
[2]
Althoff, T., Clark, K., and Leskovec, J. (2016). Large-scale analysis of counseling conversations: An application of natural language processing to mental health.Transactions of the Association for Computational Linguistics, 4:463–476
2016
-
[3]
Lau, C., Zhu, X., and Chan, W.-Y. (2023). Automatic depression severity assessment with deep learning using parameter-efficient tuning.Frontiers in Psychiatry, 14:1160291
2023
-
[4]
Alves, P., et al. (2025). A machine learning model using clinical notes to estimate PHQ-9 symptom severity scores in depressed patients.Journal of Affective Disorders. De Choudhury, M., Gamon, M., Counts, S., and Horvitz, E. (2013). Predicting depression via social media. InProceedings of ICWSM
2025
-
[5]
P., Cummins, R., Tablan, V., Bateup, S., Catarino, A., Martin, A
Ewbank, M. P., Cummins, R., Tablan, V., Bateup, S., Catarino, A., Martin, A. J., and Blackwell, A. D. (2020). Quantifying the association between psychotherapy content and clinical outcomes using deep learning.JAMA Psychiatry, 77(1):35–43
2020
-
[6]
Lalk, C., Steinbrenner, T., Kania, W., Popko, A., Wester, R., Schaffrath, J., Eberhardt, S., Schwartz, B., Lutz, W., and Rubel, J. (2024). Measuring alliance and symptom severity in psychotherapy transcripts using BERT topic modeling.Administration and Policy in Mental Health and Mental Health Services Research, 51(4):509–524
2024
-
[7]
Gratch, J., et al. (2014). The distress analysis interview corpus of human and computer interviews. InProceedings of LREC
2014
-
[8]
He, J., et al. (2020). Revisiting self-training for neural sequence generation. InProceedings of ICLR
2020
-
[9]
I., Zomick, J., and Hirschberg, J
Jiang, Z., Levitan, S. I., Zomick, J., and Hirschberg, J. (2020). Detection of mental health from Reddit via deep contextualized representations. InProceedings of the 11th International Workshop on Health Text Mining and Information Analysis (LOUHI)
2020
-
[10]
Burdisso, S., et al. (2024). DAIC-WOZ: On the validity of using the therapist’s prompts in automatic depression detection from clinical interviews.arXiv preprint arXiv:2404.14463
arXiv 2024
-
[11]
L., and Williams, J
Kroenke, K., Spitzer, R. L., and Williams, J. B. W. (2001). The PHQ-9: Validity of a brief depression severity measure.Journal of General Internal Medicine, 16(9):606–613
2001
-
[12]
Schmidt, F., Ravan, S., and Vlassov, V. (2025). Probabilistic textual time series depression detection. arXiv preprint arXiv:2511.04476
arXiv 2025
-
[13]
Resnik, P., et al. (2015). Beyond LDA: Exploring supervised topic modeling for depression-related language in Twitter. InProceedings of the Workshop on Computational Linguistics and Clinical Psychology
2015
-
[14]
Ringeval, F., et al. (2019). AVEC 2019 workshop and challenge: State-of-mind, detecting depression with AI, and cross-cultural affect recognition. InProceedings of the 9th International Audio/Visual Emotion Challenge and Workshop (AVEC ’19). 11
2019
-
[15]
J., et al
Hu, E. J., et al. (2022). LoRA: Low-rank adaptation of large language models. InProceedings of ICLR
2022
-
[16]
Zhu, Z., Tieleman, O., Stamatis, C. A., Smyth, L., Hull, T. D., Cahn, D. R., Chen, J., and Malgaroli, M. (2025). DIAL: Direct iterative adversarial learning for realistic multi-turn dialogue simulation. arXiv preprint arXiv:2512.20773. Qwen Team (2026), Qwen3.5: Towards Native Multimodal Agents. https://qwen.ai/blog?id=qwen3.5
Pith/arXiv arXiv 2025
-
[17]
Shin, D., et al. (2024). Using large language models to detect depression from user-generated diary text data as a novel approach in digital mental health screening: Instrument validation study. Journal of Medical Internet Research, 26:e54617
2024
-
[18]
Weber, S., Deperrois, N., Heun, R., et al. (2025). Using a fine-tuned large language model for symptom-based depression evaluation.npj Digital Medicine, 8:598. World Health Organization (2023). Depressive disorder (depression). WHO Fact Sheet
2025
-
[19]
Xie, Q., Luong, M.-T., Hovy, E., and Le, Q. V. (2020). Self-training with noisy student improves ImageNet classification. InProceedings of CVPR. Friedrich M. Depression Is the Leading Cause of Disability Around the World. JAMA. 2017;317(15):1517. doi:10.1001/jama.2017.3826 Teferra BG, Rueda A, Pang H, Valenzano R, Samavi R, Krishnan S, Bhat V. Screening f...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.