Recognition: unknown
Who is Speaking or Who is Depressed? A Controlled Study of Speaker Leakage in Speech-Based Depression Detection
Pith reviewed 2026-05-10 11:35 UTC · model grok-4.3
The pith
Speech depression detection models learn speaker identities instead of depression markers when training and test speakers overlap.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Using controlled partitions of the DAIC-WOZ dataset, the authors show that speaker overlap between train and test sets produces substantially higher accuracy than speaker-independent splits of the same size. The drop occurs across models of varying complexity. Domain-adversarial training narrows but does not close the gap, indicating that depression-related features remain entangled with speaker identity under current approaches.
What carries the argument
The controlled data-splitting strategy that keeps training set size constant while enforcing zero speaker overlap between train and test partitions.
If this is right
- Reported accuracies from conventional splits overestimate how well the models will work on new speakers.
- True clinical utility requires speaker-independent test protocols.
- Current features mix speaker identity with any depression signal.
- Domain-adversarial methods reduce but do not remove speaker leakage.
Where Pith is reading between the lines
- Similar leakage may affect other speech-based detection tasks such as anxiety or fatigue monitoring.
- Re-testing published depression detectors with strict speaker separation could lower their reported numbers.
- Larger datasets with built-in speaker separation would help isolate genuine biomarkers.
- Deployment in clinics would need repeated validation on entirely new patient groups.
Load-bearing premise
The data-splitting strategy removes speaker leakage without creating new confounds in data distribution or label balance.
What would settle it
Model accuracy staying equally high on unseen-speaker tests as on overlapping-speaker tests after the controlled split, or the domain-adversarial network eliminating the performance gap entirely.
Figures
read the original abstract
This study investigates whether speech-based depression detection models learn depression-related acoustic biomarkers or instead rely on speaker identity cues. Using the DAIC-WOZ dataset, we propose a data-splitting strategy that controls speaker overlap between training and test sets while keeping the training size constant, and evaluate three models of varying complexity. Results show that speaker overlap significantly boosts performance, whereas accuracy drops sharply on unseen speakers. Even with a Domain-Adversarial Neural Network, a substantial performance gap remains. These findings indicate that depression-related features extracted by current speech models are highly entangled with speaker identity. Conventional evaluation protocols may therefore overestimate generalization and clinical utility, highlighting the need for strictly speaker-independent evaluation.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper investigates speaker leakage in speech-based depression detection using the DAIC-WOZ dataset. It proposes a data-splitting strategy that maintains constant training set size while eliminating speaker overlap between train and test sets, then compares performance of three models (including a domain-adversarial neural network) under speaker-overlapping versus speaker-independent conditions. Results indicate substantially higher performance with speaker overlap, with a remaining gap even under adversarial training, leading to the claim that depression-related features are highly entangled with speaker identity and that standard protocols overestimate generalization.
Significance. If the performance gaps can be shown to arise specifically from speaker identity entanglement rather than other distribution shifts, the work would highlight a critical limitation in current evaluation practices for clinical speech models, supporting the need for strictly speaker-independent protocols to improve reliability and utility in mental health applications.
major comments (2)
- [Data Splitting Strategy] Methods section on data splitting: The strategy controls speaker overlap while keeping training size constant, but the manuscript provides no verification (e.g., statistical comparison of depression label prevalence, gender/age balance, or acoustic feature statistics) that the resulting train distributions are matched across the overlapping and non-overlapping conditions; without this, the accuracy drop cannot be isolated to speaker leakage and may reflect confounding shifts in label balance or covariates.
- [Results] Results section: The reported performance gaps are described qualitatively (e.g., 'significantly boosts' and 'drops sharply'), but the manuscript lacks explicit metrics (accuracy, F1, AUC), statistical tests with p-values, error bars or confidence intervals, and details on the number of splits or speaker counts per condition; this limits the strength of evidence for the entanglement claim.
Simulated Author's Rebuttal
We thank the referee for the constructive comments, which help clarify the evidence needed to support our claims about speaker leakage. We address each major point below and will incorporate revisions to strengthen the manuscript.
read point-by-point responses
-
Referee: [Data Splitting Strategy] Methods section on data splitting: The strategy controls speaker overlap while keeping training size constant, but the manuscript provides no verification (e.g., statistical comparison of depression label prevalence, gender/age balance, or acoustic feature statistics) that the resulting train distributions are matched across the overlapping and non-overlapping conditions; without this, the accuracy drop cannot be isolated to speaker leakage and may reflect confounding shifts in label balance or covariates.
Authors: We agree that explicit verification is necessary to isolate speaker leakage from other potential distribution shifts. Our splitting procedure was designed to maintain identical training set sizes and randomly select non-overlapping speakers while preserving overall label proportions where possible, but the original manuscript did not report comparative statistics. In the revision, we will add a dedicated subsection (or supplementary table) in Methods/Results that compares depression label prevalence, gender and age distributions, and summary acoustic statistics (e.g., mean F0, energy, MFCC means) across the two training conditions using appropriate tests (Kolmogorov-Smirnov or chi-squared). Any observed imbalances will be quantified and discussed as potential confounds. This directly addresses the concern and bolsters attribution to speaker identity entanglement. revision: yes
-
Referee: [Results] Results section: The reported performance gaps are described qualitatively (e.g., 'significantly boosts' and 'drops sharply'), but the manuscript lacks explicit metrics (accuracy, F1, AUC), statistical tests with p-values, error bars or confidence intervals, and details on the number of splits or speaker counts per condition; this limits the strength of evidence for the entanglement claim.
Authors: We acknowledge that the original presentation relied too heavily on qualitative descriptions. Although the manuscript contains performance figures and mentions multiple splits, it does not tabulate exact values or statistical details. We will revise the Results section to include a comprehensive table reporting accuracy, F1-score, and AUC (with means and standard deviations) for all three models under both speaker-overlapping and speaker-independent conditions. We will specify the exact number of speakers and utterances per split, add error bars to all plots, and report p-values from paired statistical tests (e.g., Wilcoxon signed-rank) across the repeated splits to quantify significance. These additions will provide the quantitative rigor needed to support the entanglement conclusion. revision: yes
Circularity Check
No circularity: purely empirical evaluation on public dataset
full rationale
The paper conducts a controlled empirical study on the DAIC-WOZ dataset by comparing model performance across speaker-overlapping vs. speaker-independent splits while holding training size constant. No derivations, equations, fitted parameters renamed as predictions, or self-citation chains underpin the claims. Results are direct train-test accuracy comparisons that are externally replicable and falsifiable. The central finding (performance drop on unseen speakers) follows from the experimental design without reducing to self-definition or imported uniqueness. Minor self-citations, if present, are not load-bearing for the core argument.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption The DAIC-WOZ dataset contains sufficient unique speakers and balanced depression labels to support controlled splits that isolate speaker effects while maintaining training size.
Reference graph
Works this paper leans on
-
[1]
Introduction Major Depressive Disorder affects over 332 million people worldwide, with an estimated burden exceeding 56 million disability-adjusted life years [1, 2]. Despite this burden, most affected individuals remain undiagnosed or untreated due to systemic healthcare barriers and limited access to mental health professionals [3–5]. As a result, autom...
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[2]
control group
The Proposed Data Split 2.1. Dataset and Preprocessing We use the Distress Analysis Interview Corpus – Wizard of Oz (DAIC-WOZ) dataset [8, 9]. Each participant completes a sin- gle clinical interview lasting 5–20 minutes. Depression severity is assessed using the PHQ-8 [19], with a score≥10 indicating clinical depression. We adopt the standard subset of 1...
-
[3]
Each model is tested in its original form and with a DANN extension to mitigate speaker-specific information
Methodology We evaluate three model families of increasing architectural complexity under both speaker-independent (Training Set A) and speaker-overlapped (Training Set B) conditions. Each model is tested in its original form and with a DANN extension to mitigate speaker-specific information. 3.1. Domain-Adversarial Neural Network Domain-Adversarial Neura...
-
[4]
To simulate these under controlled condi- tions, we use the size-matched split described in Section 2.2, yielding two training sets and a shared test set
Experiments We evaluate two clinical scenarios:initial diagnosisandre- peated diagnosis. To simulate these under controlled condi- tions, we use the size-matched split described in Section 2.2, yielding two training sets and a shared test set. In the initial diagnosis scenario (Training Set A), the training set contains 5,117 segments with no speaker over...
-
[5]
Results and Analysis Table 1 illustrates the performance across the three evaluated architectures in multiple settings. Performance under the speaker-overlapped setting.Un- der the speaker-overlapped setting (Training Set B), most ar- chitectures achieve very high depression classification perfor- mance. In the Wav2Vec-Linear Probing model, the Original v...
-
[6]
Conclusion We introduced a size-controlled data split and conducted sys- tematic evaluations under speaker-overlapped and speaker- independent conditions to examine identity leakage in speech- based depression detection. Across architectures and training strategies, the high accuracy observed under speaker overlap de- creases substantially when evaluated ...
-
[7]
Acknowledgments We thank the Johns Hopkins University Data Science and AI (DSAI) Institute for supporting this research through a faculty startup package
-
[8]
These tools were not used to generate scientific content, results, experimental designs, anal- yses, or conclusions
Generative AI Use Disclosure Generative AI tools were employed solely for language polish- ing of text written by the authors. These tools were not used to generate scientific content, results, experimental designs, anal- yses, or conclusions. All authors are responsible for the full content of this paper and consent to its submission
-
[9]
Time for united action on depression: a Lancet–World Psychi- atric Association Commission,
H. Herrman, V . Patel, C. Kieling, M. Berk, C. Buchweitz, P. Cui- jpers, T. A. Furukawa, R. C. Kessler, B. A. Kohrt, M. Majet al., “Time for united action on depression: a Lancet–World Psychi- atric Association Commission,”The Lancet, vol. 399, no. 10328, pp. 957–1022, 2022
2022
-
[10]
Global, regional and national burden of depressive disorders and attributable risk factors, from 1990 to 2021: results from the 2021 Global Burden of Disease study,
J. Rong, X. Wang, P. Cheng, D. Li, and D. Zhao, “Global, regional and national burden of depressive disorders and attributable risk factors, from 1990 to 2021: results from the 2021 Global Burden of Disease study,”The British Journal of Psychiatry, pp. 1–10, 2025
1990
-
[11]
How stigma interferes with mental health care
P. Corrigan, “How stigma interferes with mental health care.” American psychologist, vol. 59, no. 7, p. 614, 2004
2004
-
[12]
The global gap in treatment coverage for major depressive disorder in 84 countries from 2000–2019: A systematic review and Bayesian meta-regression analysis,
M. Moitra, D. Santomauro, P. Y . Collins, T. V os, H. Whiteford, S. Saxena, and A. J. Ferrari, “The global gap in treatment coverage for major depressive disorder in 84 countries from 2000–2019: A systematic review and Bayesian meta-regression analysis,”PLoS medicine, vol. 19, no. 2, p. e1003901, 2022
2000
-
[13]
Failure and delay in initial treatment contact after first onset of mental disorders in the National Comorbidity Survey Replication,
P. S. Wang, P. Berglund, M. Olfson, H. A. Pincus, K. B. Wells, and R. C. Kessler, “Failure and delay in initial treatment contact after first onset of mental disorders in the National Comorbidity Survey Replication,”Archives of general psychiatry, vol. 62, no. 6, pp. 603–613, 2005
2005
-
[14]
Speech-based depres- sion assessment: A comprehensive survey,
S. S. Leal, S. Ntalampiras, and R. Sassi, “Speech-based depres- sion assessment: A comprehensive survey,”IEEE Transactions on Affective Computing, 2024
2024
-
[15]
Diagnostic accuracy of deep learning using speech samples in depression: a systematic review and meta-analysis,
L. Liu, L. Liu, H. A. Wafa, F. Tydeman, W. Xie, and Y . Wang, “Diagnostic accuracy of deep learning using speech samples in depression: a systematic review and meta-analysis,”Journal of the American Medical Informatics Association, vol. 31, no. 10, pp. 2394–2404, 2024
2024
-
[16]
Avec 2016: Depression, mood, and emotion recognition workshop and challenge,
M. Valstar, J. Gratch, B. Schuller, F. Ringeval, D. Lalanne, M. Tor- res Torres, S. Scherer, G. Stratou, R. Cowie, and M. Pantic, “Avec 2016: Depression, mood, and emotion recognition workshop and challenge,” inProceedings of the 6th international workshop on audio/visual emotion challenge, 2016, pp. 3–10
2016
-
[17]
The distress analysis interview corpus of human and computer interviews
J. Gratch, R. Artstein, G. M. Lucas, G. Stratou, S. Scherer, A. Nazarian, R. Wood, J. Boberg, D. DeVault, S. Marsellaet al., “The distress analysis interview corpus of human and computer interviews.” inLrec, vol. 14. Reykjavik, 2014, pp. 3123–3128
2014
-
[18]
Depression detection from speech data using deep learning–based optimized temporal–frequency–channel attention with interpretable acoustic–prosodic mapping,
K. Rezaee, “Depression detection from speech data using deep learning–based optimized temporal–frequency–channel attention with interpretable acoustic–prosodic mapping,”Journal of Affec- tive Disorders, p. 121077, 2026
2026
-
[19]
RADIANCE: Re- liable and interpretable depression detection from speech using transformer,
A. K. Gupta, A. Dhamaniya, and P. Gupta, “RADIANCE: Re- liable and interpretable depression detection from speech using transformer,”Computers in biology and medicine, vol. 183, p. 109325, 2024
2024
-
[20]
Depression recognition using voice-based pre-training model,
X. Huang, F. Wang, Y . Gao, Y . Liao, W. Zhang, L. Zhang, and Z. Xu, “Depression recognition using voice-based pre-training model,”Scientific reports, vol. 14, no. 1, p. 12734, 2024
2024
-
[21]
wav2vec 2.0: A framework for self-supervised learning of speech repre- sentations,
A. Baevski, Y . Zhou, A. Mohamed, and M. Auli, “wav2vec 2.0: A framework for self-supervised learning of speech repre- sentations,”Advances in neural information processing systems, vol. 33, pp. 12 449–12 460, 2020
2020
-
[22]
Responsible development of clinical speech ai: bridging the gap between clinical research and technol- ogy,
V . Berisha and J. M. Liss, “Responsible development of clinical speech ai: bridging the gap between clinical research and technol- ogy,”NPJ digital medicine, vol. 7, no. 1, p. 208, 2024
2024
-
[23]
Common Pitfalls and Recommen- dations for Use of Machine Learning in Depression Severity Esti- mation: DAIC-WOZ Study,
I. Danylenko and O. Unold, “Common Pitfalls and Recommen- dations for Use of Machine Learning in Depression Severity Esti- mation: DAIC-WOZ Study,”Applied Sciences, vol. 16, no. 1, p. 422, 2025
2025
-
[24]
Enhancing accuracy and privacy in speech-based depression detection through speaker dis- entanglement,
V . Ravi, J. Wang, J. Flint, and A. Alwan, “Enhancing accuracy and privacy in speech-based depression detection through speaker dis- entanglement,”Computer speech & language, vol. 86, p. 101605, 2024
2024
-
[25]
Unmasking Clever Hans predictors and assess- ing what machines really learn,
S. Lapuschkin, S. W ¨aldchen, A. Binder, G. Montavon, W. Samek, and K.-R. M¨uller, “Unmasking Clever Hans predictors and assess- ing what machines really learn,”Nature communications, vol. 10, no. 1, p. 1096, 2019
2019
-
[26]
Domain-adversarial training of neural networks,
Y . Ganin, E. Ustinova, H. Ajakan, P. Germain, H. Larochelle, F. Laviolette, M. March, and V . Lempitsky, “Domain-adversarial training of neural networks,”Journal of machine learning re- search, vol. 17, no. 59, pp. 1–35, 2016
2016
-
[27]
The PHQ-8 as a measure of current depres- sion in the general population,
K. Kroenke, T. W. Strine, R. L. Spitzer, J. B. Williams, J. T. Berry, and A. H. Mokdad, “The PHQ-8 as a measure of current depres- sion in the general population,”Journal of affective disorders, vol. 114, no. 1-3, pp. 163–173, 2009
2009
-
[28]
A step towards pre- serving speakers’ identity while detecting depression via speaker disentanglement,
V . Ravi, J. Wang, J. Flint, and A. Alwan, “A step towards pre- serving speakers’ identity while detecting depression via speaker disentanglement,” inInterspeech, 2022, p. 3338
2022
-
[29]
Non-uniform speaker disen- tanglement for depression detection from raw speech signals,
J. Wang, V . Ravi, and A. Alwan, “Non-uniform speaker disen- tanglement for depression detection from raw speech signals,” in Interspeech, 2023, p. 2343
2023
-
[30]
Catastrophic forgetting in connectionist net- works,
R. M. French, “Catastrophic forgetting in connectionist net- works,”Trends in cognitive sciences, vol. 3, no. 4, pp. 128–135, 1999
1999
-
[31]
Comparing Acoustic-Based Approaches for Alzheimer’s Disease Detection,
A. Balagopalan and J. Novikova, “Comparing Acoustic-Based Approaches for Alzheimer’s Disease Detection,” inInterspeech, 2021, pp. 3800–3804
2021
-
[32]
XLS-R: Self-supervised Cross-lingual Speech Rep- resentation Learning at Scale,
A. Babu, C. Wang, A. Tjandra, K. Lakhotia, Q. Xu, N. Goyal, K. Singh, P. von Platen, Y . Saraf, J. Pino, A. Baevski, A. Conneau, and M. Auli, “XLS-R: Self-supervised Cross-lingual Speech Rep- resentation Learning at Scale,” inInterspeech, 2022, pp. 2278– 2282
2022
-
[33]
The Geneva minimalistic acoustic parameter set (GeMAPS) for voice research and affective computing,
F. Eyben, K. R. Scherer, B. W. Schuller, J. Sundberg, E. Andr ´e, C. Busso, L. Y . Devillers, J. Epps, P. Laukka, S. S. Narayananet al., “The Geneva minimalistic acoustic parameter set (GeMAPS) for voice research and affective computing,”IEEE transactions on affective computing, vol. 7, no. 2, pp. 190–202, 2015
2015
-
[34]
Opensmile: the mu- nich versatile and fast open-source audio feature extractor,
F. Eyben, M. W ¨ollmer, and B. Schuller, “Opensmile: the mu- nich versatile and fast open-source audio feature extractor,” in Proceedings of the 18th ACM international conference on Mul- timedia, 2010, pp. 1459–1462
2010
-
[35]
Audio deepfake detection with self-supervised xls-r and sls classifier,
Q. Zhang, S. Wen, and T. Hu, “Audio deepfake detection with self-supervised xls-r and sls classifier,” inProceedings of the 32nd ACM International Conference on Multimedia, 2024, pp. 6765– 6773
2024
-
[36]
Performance metrics for multilabel emotion classification: comparing micro, macro, and weighted f1-scores,
M. C. Hinojosa Lee, J. Braet, and J. Springael, “Performance metrics for multilabel emotion classification: comparing micro, macro, and weighted f1-scores,”Applied Sciences, vol. 14, no. 21, p. 9863, 2024
2024
-
[37]
Accuracy of automated classification of major depressive disorder as a function of symptom severity,
R. Ramasubbu, M. R. Brown, F. Cortese, I. Gaxiola, B. Goodyear, A. J. Greenshaw, S. M. Dursun, and R. Greiner, “Accuracy of automated classification of major depressive disorder as a function of symptom severity,”NeuroImage: Clinical, vol. 12, pp. 320– 331, 2016
2016
-
[38]
An overview of speaker identifica- tion: Accuracy and robustness issues,
R. Togneri and D. Pullella, “An overview of speaker identifica- tion: Accuracy and robustness issues,”IEEE circuits and systems magazine, vol. 11, no. 2, pp. 23–61, 2011
2011
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.