Engagement Phenotypes for a Sample of 102,684 AI Mental Health Chatbot Users and Dose-Response Associations with Clinical Outcomes
Pith reviewed 2026-05-09 19:37 UTC · model grok-4.3
The pith
Users of an AI mental health chatbot fall into five engagement patterns that link to different levels of depression and anxiety relief.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
K-means clustering on eight behavioral features from 102,684 users of the Ash AI mental health chatbot identified five engagement phenotypes: Early Dropouts (52.2 percent), Power Users (1.6 percent), Intensive Users (4.1 percent), Weekly Users (25.3 percent), and Concentrated Users (16.8 percent). Significant pre-to-post reductions occurred in depression and anxiety scores, with a dose-response pattern for depression improvement that replicated when using model-predicted PHQ-9 values across 23,813 users. Higher working alliance scores predicted greater depression gains and moderated the relationship between engagement and social support increases.
What carries the argument
K-means clustering across eight behavioral usage features to derive distinct engagement phenotypes and their ties to clinical measures.
If this is right
- Different clinical outcomes respond to different dimensions of chatbot engagement, so depression relief follows a usage-intensity gradient while social support gains show separate patterns.
- Total session counts alone fail to capture meaningful variation in user behavior and should not serve as the primary engagement metric.
- Working alliance with the chatbot independently predicts depression improvement and alters how engagement affects social support.
- Model-predicted clinical scores can reliably extend outcome analysis from small survey subsamples to tens of thousands of additional users.
Where Pith is reading between the lines
- Chatbot interfaces could include targeted prompts to encourage concentrated or intensive usage patterns that align with stronger outcomes.
- Similar phenotype clusters may appear in other conversational health tools, pointing toward pattern-based rather than frequency-based personalization strategies.
- Randomized tests could check whether shifting users into higher-benefit phenotypes produces measurable clinical gains beyond natural usage.
Load-bearing premise
The small subset of users who completed clinical questionnaires accurately represents the full user base and that changes in self-reported or model-predicted symptom scores reflect genuine clinical improvement without major selection or reporting biases.
What would settle it
A study that measures actual clinical outcomes through independent assessments in a sample where all users complete follow-ups and finds no difference in improvement across the five engagement phenotypes would falsify the dose-response associations.
Figures
read the original abstract
Background: Conversational AI chatbots are emerging as scalable mental health tools, but little is known about real world engagement or its relationship to clinical outcomes. Objective: To characterize engagement phenotypes among users of Ash, a purpose-built AI mental health chatbot, and examine associations with clinical change and working alliance. Methods: K-means clustering across eight behavioral features identified engagement phenotypes among 102,684 users. Subsamples completed the PHQ-9 (n=298), GAD-7 (n=298), and MSPSS (social support; n=194) baseline and 3 weeks; 11,437 users completed baseline Working Alliance Inventory (WAI). Results: Five engagement phenotypes emerged: Early Dropouts (52.2%), Power Users (1.6%), Intensive Users (4.1%), Weekly Users (25.3%), and a novel Concentrated User pattern (16.8%); across users, 66.9% had at least one overnight session (9pm-5am). Significant pre-post improvements occurred in depression (d = -0.51), anxiety (d = -0.57), and social support (d = 0.22). An observed dose-response gradient in self-reported depression improvement was replicated in a larger sample with model-predicted PHQ-9 (n = 23,813; Power Users d = -0.54; Early Dropouts d = -0.13). Higher working alliance predicted depression improvement and moderated the engagement-social support relationship. Conclusions: Engagement with AI mental health tools is multidimensional, and different clinical outcomes respond to different dimensions of use. Findings caution against treating session counts as a primary engagement metric and offer naturalistic evidence for the clinical value of purpose-built conversational AI.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper applies k-means clustering to eight behavioral features from 102,684 users of the Ash AI mental health chatbot, identifying five engagement phenotypes (Early Dropouts 52.2%, Power Users 1.6%, Intensive Users 4.1%, Weekly Users 25.3%, Concentrated Users 16.8%). It reports pre-post clinical improvements (PHQ-9 d=-0.51, GAD-7 d=-0.57) in a subsample of n=298 and replicates a dose-response gradient in depression improvement via model-predicted PHQ-9 scores in n=23,813 users, while also examining working alliance (WAI) in 11,437 users and concluding that engagement is multidimensional with differential outcome associations.
Significance. If the central associations hold after addressing selection and prediction issues, the work offers large-scale naturalistic evidence on real-world patterns of AI chatbot engagement and their links to mental health outcomes. Strengths include the scale of the clustering analysis and the explicit caution against relying solely on session counts; the dose-response replication attempt and working-alliance moderation findings could inform chatbot design if the model predictions prove independent of the clustering features.
major comments (2)
- [Methods and Results on model-predicted PHQ-9] The replication of the dose-response gradient in depression improvement (Power Users d=-0.54 vs. Early Dropouts d=-0.13) relies on model-predicted PHQ-9 scores for n=23,813 users. The manuscript must specify the training data, features, and validation procedure for this prediction model (Methods section on outcome modeling). If the model was trained using the same eight behavioral engagement features as the k-means clustering or on the n=298 clinical subsample without proper hold-out, the larger-sample gradient is not an independent replication and risks circularity with the phenotype definitions.
- [Results on clinical subsamples and dose-response] The clinical outcome analyses rest on a small subsample (n=298 for PHQ-9/GAD-7) drawn from 102,684 users. The paper should report a direct comparison of baseline engagement metrics, demographics, and phenotype distributions between clinical completers and non-completers (Results section on sample characteristics) to evaluate selection bias. Without this, the assumption that the observed dose-response generalizes is not supported and undermines the claim that different engagement dimensions produce differential clinical responses.
minor comments (2)
- [Methods on k-means clustering] The choice of k=5 clusters is presented without reported justification such as an elbow plot, silhouette analysis, or stability checks across random seeds; add this to the Methods section on clustering to allow readers to assess sensitivity of the phenotype definitions.
- [Results on engagement phenotypes] Table or figure presenting the eight behavioral features and their means per phenotype would improve interpretability of the 'Concentrated User' pattern; currently the abstract and text leave the distinguishing characteristics of this novel phenotype underspecified.
Simulated Author's Rebuttal
We appreciate the referee's careful reading and valuable suggestions. We will revise the manuscript to address the concerns regarding the prediction model details and potential selection bias in the clinical subsample.
read point-by-point responses
-
Referee: The replication of the dose-response gradient in depression improvement (Power Users d=-0.54 vs. Early Dropouts d=-0.13) relies on model-predicted PHQ-9 scores for n=23,813 users. The manuscript must specify the training data, features, and validation procedure for this prediction model (Methods section on outcome modeling). If the model was trained using the same eight behavioral engagement features as the k-means clustering or on the n=298 clinical subsample without proper hold-out, the larger-sample gradient is not an independent replication and risks circularity with the phenotype definitions.
Authors: We will revise the Methods section to provide a complete description of the PHQ-9 prediction model, including the training dataset (separate from both the main clustering sample and the clinical subsample), the input features (which do not include the eight behavioral engagement features used for clustering), and the validation approach (with appropriate hold-out procedures). This will confirm that the dose-response analysis in the larger sample is an independent replication and not subject to circularity. revision: yes
-
Referee: The clinical outcome analyses rest on a small subsample (n=298 for PHQ-9/GAD-7) drawn from 102,684 users. The paper should report a direct comparison of baseline engagement metrics, demographics, and phenotype distributions between clinical completers and non-completers (Results section on sample characteristics) to evaluate selection bias. Without this, the assumption that the observed dose-response generalizes is not supported and undermines the claim that different engagement dimensions produce differential clinical responses.
Authors: We agree that a comparison between clinical completers and non-completers is necessary to assess selection bias. In the revised Results section on sample characteristics, we will include a direct comparison of baseline engagement metrics, demographics, and phenotype distributions for the n=298 users versus the remaining users. This addition will allow readers to better evaluate the generalizability of the findings. revision: yes
Circularity Check
No significant circularity in derivation chain
full rationale
The paper's core analysis uses unsupervised k-means clustering on eight behavioral features to identify five engagement phenotypes in the full 102,684-user cohort, followed by direct pre-post clinical outcome measurements in a small subsample (n=298 for PHQ-9/GAD-7) and a separate supervised model to predict PHQ-9 scores for a larger group (n=23,813). No step reduces by construction to its inputs: the phenotypes are defined independently of the clinical outcomes, the observed dose-response is measured directly in the subsample, and the model-predicted extension applies a fitted mapping to new users without tautologically reproducing the clustering or the small-sample gradient. No self-citations, uniqueness theorems, or ansatzes are invoked as load-bearing premises. The derivation remains self-contained observational analysis.
Axiom & Free-Parameter Ledger
free parameters (1)
- number of clusters k
axioms (2)
- domain assumption K-means clustering produces meaningful, separable groups from the chosen behavioral features
- domain assumption Pre-post changes in PHQ-9, GAD-7, and MSPSS reflect true clinical improvement rather than regression to the mean or reporting bias
invented entities (1)
-
engagement phenotypes
no independent evidence
Reference graph
Works this paper leans on
-
[1]
& Pato, M
Mongelli, F., Georgakopoulos, P. & Pato, M. T. Challenges and Opportunities to Meet the Mental Health Needs of Underserved and Disenfranchised Populations in the United States. Focus 18, 16–24 (2020). 5. Nunes, B. P., Thumé, E., Tomasi, E., Duro, S. M. S. & Facchini, L. A. Socioeconomic inequalities in the access to and quality of health care services. Re...
2020
-
[2]
Generative AI Purpose-built for Social and Mental Health: A Real-World Pilot
Abd-Alrazaq, A. A., Rababeh, A., Alajlani, M., Bewick, B. M. & Househ, M. Effectiveness and Safety of Using Chatbots to Improve Mental Health: Systematic Review and Meta-Analysis. J. Med. Internet Res. 22, e16021 (2020). 13. Casu, M., Triscari, S., Battiato, S., Guarnera, L. & Caponnetto, P. AI Chatbots for Mental Health: A Scoping Review of Effectiveness...
-
[3]
Lipschitz, J. M., Pike, C. K., Hogan, T. P., Murphy, S. A. & Burdick, K. E. The Engagement Problem: a Review of Engagement with Digital Mental Health Interventions and Recommendations for a Path Forward. Curr. Treat. Options Psychiatry 10, 119–135 (2023). 21. Kim, M., Yang, J., Ahn, W.-Y. & Choi, H. J. Machine Learning Analysis to Identify Digital Behavio...
work page internal anchor Pith review arXiv 2023
-
[4]
& Holmqvist, R
Falkenström, F., Granström, F. & Holmqvist, R. Working alliance predicts psychotherapy outcome even while controlling for prior symptom improvement. Psychother. Res. J. Soc. Psychother. Res. 24, (2013). 28. Napiwotzki, I. et al. Comparing Human and AI Therapists in Behavioral Activation for Depression: Cross-Sectional Questionnaire Study. JMIR Form. Res. ...
2013
-
[5]
Ajele, K. W. & Idemudia, E. S. Charting the course of depression care: a meta-analysis of reliability generalization of the patient health questionnaire (PHQ- 9) as the measure. Discov. Ment. Health 5, 50 (2025). 36. Lee, E.-H., Kang, E. H., Kang, H.-J. & Lee, H. Y. Measurement invariance of the patient health questionnaire-9 depression scale in a nationa...
2025
-
[6]
Horvath, A. O. & Greenberg, L. S. Development and validation of the Working Alliance Inventory. J. Couns. Psychol. 36, 223–233 (1989). 44. Paap, D. et al. The Working Alliance Inventory’s Measurement Properties: A Systematic Review. Front. Psychol. 13, 945294 (2022). 45. Yap, L. K., Ennis, E., Mulvenna, M. & Martinez-Carracedo, J. Defining and Measuring E...
-
[7]
Stamatis, C. A., Wolfe, E. C., Malgaroli, M. & Hull, T. D. Talking to a Human as an Attitudinal Barrier: A Mixed Methods Evaluation of Stigma, Access, and the Appeal of AI Mental Health Support. Preprint at https://doi.org/10.48550/arXiv.2604.09575 (2026). 53. Videtta, G. et al. Effects of therapeutic alliance on patients with major depressive disorder: a...
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2604.09575 2026
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.