Recognition: unknown
A Benchmark Suite of Reddit-Derived Datasets for Mental Health Detection
Pith reviewed 2026-05-08 07:58 UTC · model grok-4.3
The pith
Four new Reddit-derived datasets for mental health detection tasks are presented with inter-annotator agreement above 0.8 and reported model F1 scores of 93-99%.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
All datasets were established upon diligent linguistic inspection, well-defined annotation guidelines, and human-judgmental verification. Inter-annotator agreement metrics always exceeded the baseline agreement score of 0.8, ensuring the labels' trustworthiness. ... these models receive excellent performances on tasks (F1 ~ 93-99%)
Load-bearing premise
That annotations by non-clinical reviewers on Reddit posts accurately capture clinical mental health states and that the reported model performances will hold on new, unseen data from different sources.
Figures
read the original abstract
The growing availability of online support groups has opened up new windows to study mental health through natural language processing (NLP). However, it is hindered by a lack of high-quality, well-validated datasets. Existing studies have a tendency to build task-specific corpora without collecting them into widely available resources, and this makes reproducibility as well as cross-task comparison challenging. In this paper, we present a uniform benchmark set of four Reddit-based datasets for disjoint but complementary tasks: (i) detection of suicidal ideation, (ii) binary general mental disorder detection, (iii) bipolar disorder detection, and (iv) multi-class mental disorder classification. All datasets were established upon diligent linguistic inspection, well-defined annotation guidelines, and human-judgmental verification. Inter-annotator agreement metrics always exceeded the baseline agreement score of 0.8, ensuring the labels' trustworthiness. Previous work's evidence of performance on both transformer and contextualized recurrent models demonstrates that these models receive excellent performances on tasks (F1 ~ 93-99%), further validating the usefulness of the datasets. By combining these resources, we establish a unifying foundation for reproducible mental health NLP studies with the ability to carry out cross-task benchmarking, multi-task learning, and fair model comparison. The presented benchmark suite provides the research community with an easy-to-access and varied resource for advancing computational approaches toward mental health research.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript introduces a benchmark suite of four Reddit-derived datasets for complementary mental health NLP tasks: (i) suicidal ideation detection, (ii) binary general mental disorder detection, (iii) bipolar disorder detection, and (iv) multi-class mental disorder classification. It claims these were constructed via diligent linguistic inspection, well-defined annotation guidelines, and human verification, yielding IAA scores always above 0.8, with prior transformer and recurrent models achieving F1 scores of 93-99% that further validate the resources. The work positions the suite as a unifying, reproducible foundation enabling cross-task benchmarking and multi-task learning.
Significance. If the datasets' labels prove reliable and the high model performances generalize, the suite would offer a valuable standardized resource for mental health NLP, addressing fragmentation in existing corpora and supporting reproducible cross-task comparisons. The reported IAA and F1 figures, if substantiated, indicate the tasks are tractable and the data accessible.
major comments (3)
- [Abstract] Abstract: the central trustworthiness claim rests on 'Inter-annotator agreement metrics always exceeded the baseline agreement score of 0.8' and 'diligent linguistic inspection, well-defined annotation guidelines,' yet no details are supplied on annotator expertise (clinical vs. non-clinical), guideline content, number of annotators per post, class balance, or how the 0.8 baseline was derived. This directly undermines the assertion that labels are trustworthy for clinical mental health detection.
- [Abstract] Abstract: the statement that 'these models receive excellent performances on tasks (F1 ~ 93-99%)' is presented without identifying the specific models, train/test splits, evaluation protocol, or statistical significance tests. Without these, the claim cannot serve as independent validation of dataset quality.
- [Abstract] The paper's premise that consistent non-clinical annotations of self-reported Reddit posts equate to accurate clinical labels for suicidal ideation, bipolar disorder, or general mental disorders is untested. Reddit text is often ambiguous and context-poor; high IAA among lay annotators may reflect surface agreement rather than DSM-aligned validity, and no psychiatrist review or correlation with formal diagnoses is mentioned.
Simulated Author's Rebuttal
We thank the referee for their constructive feedback. We address each major comment point by point below, indicating where revisions will be made to the manuscript.
read point-by-point responses
-
Referee: [Abstract] Abstract: the central trustworthiness claim rests on 'Inter-annotator agreement metrics always exceeded the baseline agreement score of 0.8' and 'diligent linguistic inspection, well-defined annotation guidelines,' yet no details are supplied on annotator expertise (clinical vs. non-clinical), guideline content, number of annotators per post, class balance, or how the 0.8 baseline was derived. This directly undermines the assertion that labels are trustworthy for clinical mental health detection.
Authors: We agree the abstract is too concise on these points. The full manuscript (Section 3) specifies that annotators were non-clinical researchers with training in mental health linguistics, using three annotators per post and DSM-5-aligned guidelines whose content is detailed in the appendix. Class balances appear in Table 1, and the 0.8 threshold follows the standard 'substantial agreement' interpretation of Cohen's kappa. We will revise the abstract to include a brief summary of annotator expertise and the threshold rationale. revision: yes
-
Referee: [Abstract] Abstract: the statement that 'these models receive excellent performances on tasks (F1 ~ 93-99%)' is presented without identifying the specific models, train/test splits, evaluation protocol, or statistical significance tests. Without these, the claim cannot serve as independent validation of dataset quality.
Authors: We will expand the abstract to name the models (BERT, RoBERTa, and contextualized LSTM), note the stratified 80/20 train/test splits with 5-fold cross-validation, and reference the statistical significance testing (McNemar's test) already reported in Section 4. These elements are present in the full manuscript but were omitted from the abstract for brevity. revision: yes
-
Referee: [Abstract] The paper's premise that consistent non-clinical annotations of self-reported Reddit posts equate to accurate clinical labels for suicidal ideation, bipolar disorder, or general mental disorders is untested. Reddit text is often ambiguous and context-poor; high IAA among lay annotators may reflect surface agreement rather than DSM-aligned validity, and no psychiatrist review or correlation with formal diagnoses is mentioned.
Authors: We clarify that the work constructs NLP benchmark datasets based on linguistic patterns in self-reported posts rather than claiming clinical diagnostic accuracy. Annotations follow DSM-5-informed guidelines but rely on text alone. We will add an explicit Limitations section acknowledging the absence of psychiatrist review and the risk that high IAA may partly reflect surface cues rather than full clinical validity. This limitation is inherent to the Reddit self-report source and cannot be addressed without new data collection. revision: partial
- The datasets lack psychiatrist review or direct correlation with formal clinical diagnoses, as they are derived exclusively from self-reported Reddit posts and linguistic annotation.
Circularity Check
No circularity: standard resource-creation paper with no derivations or self-referential predictions
full rationale
The paper presents four Reddit-derived datasets for mental health detection tasks, created via linguistic inspection, annotation guidelines, and human verification with IAA > 0.8. No equations, fitted parameters, predictions of new quantities, or load-bearing self-citations appear. Claims about label trustworthiness rest on the described curation process and external prior model performances (F1 93-99%), which are independent inputs rather than reductions to the paper's own outputs. This matches the default non-circular case for benchmark papers.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Reddit posts from mental health support communities reflect genuine indicators of mental health conditions suitable for NLP labeling.
Reference graph
Works this paper leans on
-
[1]
In: Bender, E.M., Derczynski, L., Isabelle, P
Cohan, A., Desmet, B., Yates, A., Soldaini, L., MacAvaney, S., Goharian, N.: SMHD: a large-scale resource for exploring online language usage for multiple mental health conditions. In: Bender, E.M., Derczynski, L., Isabelle, P. (eds.) Pro- ceedings of the 27th International Conference on Computational Linguistics. pp. 1485–1497.AssociationforComputational...
2018
-
[2]
Educational and Psychological Measurement20(1), 37–46 (1960).https://doi.org/10.1177/ 001316446002000104
Cohen, J.: A coefficient of agreement for nominal scales. Educational and Psychological Measurement20(1), 37–46 (1960).https://doi.org/10.1177/ 001316446002000104
1960
-
[3]
Coppersmith, G., Dredze, M., Harman, C., Hollingshead, K.: From ADHD to SAD: Analyzing the language of mental health on Twitter through self-reported diag- noses. In: Proceedings of the 2nd Workshop on Computational Linguistics and Clinical Psychology: From Linguistic Signal to Clinical Reality. pp. 1–10. Asso- ciation for Computational Linguistics, Denve...
-
[4]
In: Proceedings of the international AAAI conference on web and social media
De Choudhury, M., Gamon, M., Counts, S., Horvitz, E.: Predicting depression via social media. In: Proceedings of the international AAAI conference on web and social media. pp. 128–137 (2013)
2013
-
[5]
In: Proceedings of the 2019 conference of the North American chapter of the association for computational linguistics: human language technologies, volume 1 (long and short papers)
Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: Bert: Pre-training of deep bidi- rectional transformers for language understanding. In: Proceedings of the 2019 conference of the North American chapter of the association for computational linguistics: human language technologies, volume 1 (long and short papers). pp. 4171–4186 (2019)
2019
-
[6]
In: Proceed- ings of the Ninth International Conference on Language Resources and Evaluation (LREC’14)
Gratch, J., Artstein, R., Lucas, G., Stratou, G., Scherer, S., Nazarian, A., Wood, R., Boberg, J., DeVault, D., Marsella, S., Traum, D., Rizzo, S., Morency, L.P.: The distress analysis interview corpus of human and computer interviews. In: Proceed- ings of the Ninth International Conference on Language Resources and Evaluation (LREC’14). pp. 3123–3128. Eu...
2014
-
[7]
In: 2024 International Conference on Machine Learning and Applications (ICMLA)
Hasan, K., Saquer, J.: A comparative analysis of transformer and lstm models for detecting suicidal ideation on reddit. In: 2024 International Conference on Machine Learning and Applications (ICMLA). pp. 1343–1349 (2024).https://doi.org/10. 1109/ICMLA61862.2024.00209
-
[8]
Hasan, K., Saquer, J.: A benchmark suite of reddit-derived datasets for men- tal health detection (Sept 2025).https://doi.org/10.5281/zenodo.17114739, https://doi.org/10.5281/zenodo.17114739
-
[9]
In: 37th IEEE Inter- national Conference on Software Engineering & Knowledge Engineering (SEKE
Hasan, K., Saquer, J.: Beyond architectures: Evaluating the role of contextual embeddings in detecting bipolar disorder on social media. In: 37th IEEE Inter- national Conference on Software Engineering & Knowledge Engineering (SEKE
-
[10]
(2025).https://doi.org/10.18293/SEKE2025-083
-
[11]
In: 2025 IEEE 49th Annual Computers, Software, and Applications Conference (COMP- SAC)
Hasan, K., Saquer, J., Ghosh, M.: Advancing mental disorder detection: A com- parative evaluation of transformer and lstm architectures on social media. In: 2025 IEEE 49th Annual Computers, Software, and Applications Conference (COMP- SAC). pp. 193–202 (2025).https://doi.org/10.1109/COMPSAC65507.2025.00033
-
[12]
In: to appear in the 2025 International Conference on Machine Learning and Applications (ICMLA) (2025)
Hasan, K., Saquer, J., Zhang, Y.: Mental multi-class classification on social media: Benchmarking transformer architectures against lstm models. In: to appear in the 2025 International Conference on Machine Learning and Applications (ICMLA) (2025)
2025
-
[13]
arXiv preprint arXiv:2009.02534 (2021)
Hendrycks, D., Burns, C., Basart, S., Zou, A., Mazeika, M., Song, D., Stein- hardt, J.: Measuring massive multitask language understanding. arXiv preprint arXiv:2009.02534 (2021)
-
[14]
Towards More Robust NLP System Evaluation: Handling Missing Scores in Benchmarks
Himmi, A., Irurozki, E., Noiry, N., Clémençon, S., Colombo, P.: Towards more robust NLP system evaluation: Handling missing scores in benchmarks. In: Findings of the Association for Computational Linguistics: EMNLP 2024. pp. 11759–11785. Association for Computational Linguistics, Miami, Florida, USA (Nov 2024).https://doi.org/10.18653/v1/2024.findings-emn...
-
[15]
Landis, J.R., Koch, G.G.: The measurement of observer agreement for categorical data. Biometrics33(1), 159–174 (1977),http://www.jstor.org/stable/2529310
-
[16]
IEEE Transactions on Information Theory37(1), 145–151 (2002)
Lin, J.: Divergence measures based on the shannon entropy. IEEE Transactions on Information Theory37(1), 145–151 (2002)
2002
-
[17]
RoBERTa: A Robustly Optimized BERT Pretraining Approach
Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., Stoyanov, V.: Roberta: A robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692 (2019) A Benchmark Suite of Reddit-Derived Datasets for Mental Health Detection 13
work page internal anchor Pith review arXiv 1907
-
[18]
Journal of Affective Disorders273, 24–31 (2020)
Mariani, R., Di Trani, M., Negri, A., Tambelli, R.: Linguistic analysis of autobi- ographical narratives in unipolar and bipolar mood disorders in light of multiple code theory. Journal of Affective Disorders273, 24–31 (2020)
2020
-
[19]
In: Lin, D., Wu, D
Mihalcea, R., Tarau, P.: TextRank: Bringing order into text. In: Lin, D., Wu, D. (eds.) Proceedings of the 2004 Conference on Empirical Methods in Natu- ral Language Processing. pp. 404–411. Association for Computational Linguistics, Barcelona, Spain (Jul 2004),https://aclanthology.org/W04-3252
2004
-
[20]
In: Proceedings of the Third Workshop on Computational Linguistics and Clinical Psychology
Milne, D.N., Pink, G., Hachey, B., Calvo, R.A.: CLPsych 2016 shared task: Triaging content in online peer-support forums. In: Proceedings of the Third Workshop on Computational Linguistics and Clinical Psychology. Association for Computational Linguistics (Jun 2016).https://doi.org/10.18653/v1/W16-0312, https://aclanthology.org/W16-0312/
-
[21]
Organization, W.H.: Mental disorders (2022),https://www.who.int/news-room/ fact-sheets/detail/mental-disorders
2022
-
[22]
In: Pro- ceedings of the 2024 Joint International Conference on Computational Linguis- tics, Language Resources and Evaluation (LREC-COLING 2024)
Szabó, M.K., Vincze, V., Dam, B., Guba, C., Bagi, A., Szendi, I.: Predictive and distinctive linguistic features in schizophrenia-bipolar spectrum disorders. In: Pro- ceedings of the 2024 Joint International Conference on Computational Linguis- tics, Language Resources and Evaluation (LREC-COLING 2024). pp. 12938–12953 (2024)
2024
-
[23]
Wang, A., Pruksachatkun, Y., Nangia, N., Singh, A., Michael, J., Hill, F., Levy, O., Bowman, S.R.: SuperGLUE: a stickier benchmark for general-purpose language understanding systems, p. 15. Curran Associates Inc., Red Hook, NY, USA (2019)
2019
-
[24]
Wang, A., Singh, A., Michael, J., Hill, F., Levy, O., Bowman, S.: GLUE: A multi- task benchmark and analysis platform for natural language understanding. In: Proceedings of the 2018 EMNLP Workshop BlackboxNLP: Analyzing and Inter- pretingNeuralNetworksforNLP.pp.353–355.AssociationforComputationalLin- guistics, Brussels, Belgium (Nov 2018).https://doi.org/...
-
[25]
In: Proceedings of the Sixth Workshop on Computational Linguistics and Clinical Psychology
Zirikly, A., Resnik, P., Uzuner, Ö., Hollingshead, K.: CLPsych 2019 shared task: Predicting the degree of suicide risk in Reddit posts. In: Proceedings of the Sixth Workshop on Computational Linguistics and Clinical Psychology. pp. 24–33. Asso- ciation for Computational Linguistics, Minneapolis, Minnesota (Jun 2019).https: //doi.org/10.18653/v1/W19-3003,h...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.