arxiv: 2604.14354 · v1 · submitted 2026-04-15 · 📡 eess.AS

Recognition: unknown

Who is Speaking or Who is Depressed? A Controlled Study of Speaker Leakage in Speech-Based Depression Detection

Hsiang-Chen Yeh , Luqi Sun , Aurosweta Mahapatra , Shreeram Suresh Chandra , Emily Mower Provost , Berrak Sisman

Authors on Pith no claims yet

Pith reviewed 2026-05-10 11:35 UTC · model grok-4.3

classification 📡 eess.AS

keywords speaker leakagedepression detectionspeech processingspeaker-independent evaluationdomain adversarial trainingacoustic biomarkersDAIC-WOZ dataset

0 comments

The pith

Speech depression detection models learn speaker identities instead of depression markers when training and test speakers overlap.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper examines whether speech models detect depression through acoustic biomarkers or by picking up speaker-specific cues. It introduces a data split that removes all speaker overlap between training and testing while holding training size fixed, then measures accuracy on three models. Performance rises markedly with speaker overlap and falls when speakers are unseen, with a remaining gap even after domain-adversarial training. This pattern implies that standard evaluation methods credit models with stronger depression detection than they actually achieve on new individuals.

Core claim

Using controlled partitions of the DAIC-WOZ dataset, the authors show that speaker overlap between train and test sets produces substantially higher accuracy than speaker-independent splits of the same size. The drop occurs across models of varying complexity. Domain-adversarial training narrows but does not close the gap, indicating that depression-related features remain entangled with speaker identity under current approaches.

What carries the argument

The controlled data-splitting strategy that keeps training set size constant while enforcing zero speaker overlap between train and test partitions.

If this is right

Reported accuracies from conventional splits overestimate how well the models will work on new speakers.
True clinical utility requires speaker-independent test protocols.
Current features mix speaker identity with any depression signal.
Domain-adversarial methods reduce but do not remove speaker leakage.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Similar leakage may affect other speech-based detection tasks such as anxiety or fatigue monitoring.
Re-testing published depression detectors with strict speaker separation could lower their reported numbers.
Larger datasets with built-in speaker separation would help isolate genuine biomarkers.
Deployment in clinics would need repeated validation on entirely new patient groups.

Load-bearing premise

The data-splitting strategy removes speaker leakage without creating new confounds in data distribution or label balance.

What would settle it

Model accuracy staying equally high on unseen-speaker tests as on overlapping-speaker tests after the controlled split, or the domain-adversarial network eliminating the performance gap entirely.

Figures

Figures reproduced from arXiv: 2604.14354 by Aurosweta Mahapatra, Berrak Sisman, Emily Mower Provost, Hsiang-Chen Yeh, Luqi Sun, Shreeram Suresh Chandra.

**Figure 1.** Figure 1: Size-matched data split with controlled subject overlap. (Training Set A: no speaker overlap with test set; Training Set B: speaker overlap with test set.) test sets, the model may exploit these identity cues as shortcuts for prediction, thereby leading to overestimated performance. To mitigate this potential identity dependence, we incorporate DANN into each model architecture, enabling the encoder to re… view at source ↗

**Figure 2.** Figure 2: Architectures of the three model groups: (a) Wav2Vec-Linear Probing Models; (b) XLSR-eGeMAPS Concatenation Models; (c) Wav2Vec-SLS Models. (GRL: Gradient Reversal Layer; DANN: Domain-Adversarial Neural Network.) adversarial speaker classification via a gradient reversal layer, encouraging speaker-invariant feature learning. 3.4. Wav2Vec-SLS Models As illustrated in [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗

read the original abstract

This study investigates whether speech-based depression detection models learn depression-related acoustic biomarkers or instead rely on speaker identity cues. Using the DAIC-WOZ dataset, we propose a data-splitting strategy that controls speaker overlap between training and test sets while keeping the training size constant, and evaluate three models of varying complexity. Results show that speaker overlap significantly boosts performance, whereas accuracy drops sharply on unseen speakers. Even with a Domain-Adversarial Neural Network, a substantial performance gap remains. These findings indicate that depression-related features extracted by current speech models are highly entangled with speaker identity. Conventional evaluation protocols may therefore overestimate generalization and clinical utility, highlighting the need for strictly speaker-independent evaluation.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Speaker leakage is real in these depression models and the controlled splits show a remaining gap even with DANN, but the split itself may shift label balance or other distributions and weaken the isolation claim.

read the letter

The main thing to know is that this paper gives controlled evidence that speech depression detectors on DAIC-WOZ pick up speaker identity cues, so performance looks better when speakers overlap between train and test. They keep training size fixed while removing overlap and still see a sharp drop, and the gap stays even after Domain-Adversarial Neural Network training. That directly questions how much current models are actually learning depression biomarkers versus who is talking. It is a useful check because it applies the constant-size constraint and compares models of different complexity on the same public data. The point about conventional protocols overestimating generalization follows from the comparisons they ran, and it lines up with leakage issues seen in other speech tasks without claiming to be the first to notice it. The work is straightforward empirical work with no circular math or invented entities. The soft spot is exactly the one in the stress test. Removing speakers to eliminate overlap necessarily changes which depression labels and acoustic conditions end up in training. Without reported checks on label ratios, gender or age balance, or basic feature statistics before and after the split, some of the accuracy difference could come from those shifts rather than speaker entanglement alone. The abstract does not include error bars, exact split sizes, or statistical tests, so the quantitative support stays moderate. This is for people building or evaluating speech-based mental health tools who need to think about real speaker-independent performance. A reader working on affective computing or clinical AI would get a practical caution from it. It deserves a serious referee to examine the split details and full metrics, because the evaluation question it raises is worth clarifying even if the current evidence needs tightening on the confounds.

Referee Report

2 major / 0 minor

Summary. The paper investigates speaker leakage in speech-based depression detection using the DAIC-WOZ dataset. It proposes a data-splitting strategy that maintains constant training set size while eliminating speaker overlap between train and test sets, then compares performance of three models (including a domain-adversarial neural network) under speaker-overlapping versus speaker-independent conditions. Results indicate substantially higher performance with speaker overlap, with a remaining gap even under adversarial training, leading to the claim that depression-related features are highly entangled with speaker identity and that standard protocols overestimate generalization.

Significance. If the performance gaps can be shown to arise specifically from speaker identity entanglement rather than other distribution shifts, the work would highlight a critical limitation in current evaluation practices for clinical speech models, supporting the need for strictly speaker-independent protocols to improve reliability and utility in mental health applications.

major comments (2)

[Data Splitting Strategy] Methods section on data splitting: The strategy controls speaker overlap while keeping training size constant, but the manuscript provides no verification (e.g., statistical comparison of depression label prevalence, gender/age balance, or acoustic feature statistics) that the resulting train distributions are matched across the overlapping and non-overlapping conditions; without this, the accuracy drop cannot be isolated to speaker leakage and may reflect confounding shifts in label balance or covariates.
[Results] Results section: The reported performance gaps are described qualitatively (e.g., 'significantly boosts' and 'drops sharply'), but the manuscript lacks explicit metrics (accuracy, F1, AUC), statistical tests with p-values, error bars or confidence intervals, and details on the number of splits or speaker counts per condition; this limits the strength of evidence for the entanglement claim.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments, which help clarify the evidence needed to support our claims about speaker leakage. We address each major point below and will incorporate revisions to strengthen the manuscript.

read point-by-point responses

Referee: [Data Splitting Strategy] Methods section on data splitting: The strategy controls speaker overlap while keeping training size constant, but the manuscript provides no verification (e.g., statistical comparison of depression label prevalence, gender/age balance, or acoustic feature statistics) that the resulting train distributions are matched across the overlapping and non-overlapping conditions; without this, the accuracy drop cannot be isolated to speaker leakage and may reflect confounding shifts in label balance or covariates.

Authors: We agree that explicit verification is necessary to isolate speaker leakage from other potential distribution shifts. Our splitting procedure was designed to maintain identical training set sizes and randomly select non-overlapping speakers while preserving overall label proportions where possible, but the original manuscript did not report comparative statistics. In the revision, we will add a dedicated subsection (or supplementary table) in Methods/Results that compares depression label prevalence, gender and age distributions, and summary acoustic statistics (e.g., mean F0, energy, MFCC means) across the two training conditions using appropriate tests (Kolmogorov-Smirnov or chi-squared). Any observed imbalances will be quantified and discussed as potential confounds. This directly addresses the concern and bolsters attribution to speaker identity entanglement. revision: yes
Referee: [Results] Results section: The reported performance gaps are described qualitatively (e.g., 'significantly boosts' and 'drops sharply'), but the manuscript lacks explicit metrics (accuracy, F1, AUC), statistical tests with p-values, error bars or confidence intervals, and details on the number of splits or speaker counts per condition; this limits the strength of evidence for the entanglement claim.

Authors: We acknowledge that the original presentation relied too heavily on qualitative descriptions. Although the manuscript contains performance figures and mentions multiple splits, it does not tabulate exact values or statistical details. We will revise the Results section to include a comprehensive table reporting accuracy, F1-score, and AUC (with means and standard deviations) for all three models under both speaker-overlapping and speaker-independent conditions. We will specify the exact number of speakers and utterances per split, add error bars to all plots, and report p-values from paired statistical tests (e.g., Wilcoxon signed-rank) across the repeated splits to quantify significance. These additions will provide the quantitative rigor needed to support the entanglement conclusion. revision: yes

Circularity Check

0 steps flagged

No circularity: purely empirical evaluation on public dataset

full rationale

The paper conducts a controlled empirical study on the DAIC-WOZ dataset by comparing model performance across speaker-overlapping vs. speaker-independent splits while holding training size constant. No derivations, equations, fitted parameters renamed as predictions, or self-citation chains underpin the claims. Results are direct train-test accuracy comparisons that are externally replicable and falsifiable. The central finding (performance drop on unseen speakers) follows from the experimental design without reducing to self-definition or imported uniqueness. Minor self-citations, if present, are not load-bearing for the core argument.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The study rests on standard machine-learning assumptions about the DAIC-WOZ dataset and model training rather than introducing new free parameters or invented entities.

axioms (1)

domain assumption The DAIC-WOZ dataset contains sufficient unique speakers and balanced depression labels to support controlled splits that isolate speaker effects while maintaining training size.
The data-splitting strategy and performance comparisons depend on this property of the dataset.

pith-pipeline@v0.9.0 · 5439 in / 1163 out tokens · 74498 ms · 2026-05-10T11:35:21.965336+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

38 extracted references · 1 canonical work pages · 1 internal anchor

[1]

Who is Speaking or Who is Depressed? A Controlled Study of Speaker Leakage in Speech-Based Depression Detection

Introduction Major Depressive Disorder affects over 332 million people worldwide, with an estimated burden exceeding 56 million disability-adjusted life years [1, 2]. Despite this burden, most affected individuals remain undiagnosed or untreated due to systemic healthcare barriers and limited access to mental health professionals [3–5]. As a result, autom...

work page internal anchor Pith review Pith/arXiv arXiv 2026
[2]

control group

The Proposed Data Split 2.1. Dataset and Preprocessing We use the Distress Analysis Interview Corpus – Wizard of Oz (DAIC-WOZ) dataset [8, 9]. Each participant completes a sin- gle clinical interview lasting 5–20 minutes. Depression severity is assessed using the PHQ-8 [19], with a score≥10 indicating clinical depression. We adopt the standard subset of 1...
[3]

Each model is tested in its original form and with a DANN extension to mitigate speaker-specific information

Methodology We evaluate three model families of increasing architectural complexity under both speaker-independent (Training Set A) and speaker-overlapped (Training Set B) conditions. Each model is tested in its original form and with a DANN extension to mitigate speaker-specific information. 3.1. Domain-Adversarial Neural Network Domain-Adversarial Neura...
[4]

To simulate these under controlled condi- tions, we use the size-matched split described in Section 2.2, yielding two training sets and a shared test set

Experiments We evaluate two clinical scenarios:initial diagnosisandre- peated diagnosis. To simulate these under controlled condi- tions, we use the size-matched split described in Section 2.2, yielding two training sets and a shared test set. In the initial diagnosis scenario (Training Set A), the training set contains 5,117 segments with no speaker over...
[5]

Results and Analysis Table 1 illustrates the performance across the three evaluated architectures in multiple settings. Performance under the speaker-overlapped setting.Un- der the speaker-overlapped setting (Training Set B), most ar- chitectures achieve very high depression classification perfor- mance. In the Wav2Vec-Linear Probing model, the Original v...
[6]

Conclusion We introduced a size-controlled data split and conducted sys- tematic evaluations under speaker-overlapped and speaker- independent conditions to examine identity leakage in speech- based depression detection. Across architectures and training strategies, the high accuracy observed under speaker overlap de- creases substantially when evaluated ...
[7]

Acknowledgments We thank the Johns Hopkins University Data Science and AI (DSAI) Institute for supporting this research through a faculty startup package
[8]

These tools were not used to generate scientific content, results, experimental designs, anal- yses, or conclusions

Generative AI Use Disclosure Generative AI tools were employed solely for language polish- ing of text written by the authors. These tools were not used to generate scientific content, results, experimental designs, anal- yses, or conclusions. All authors are responsible for the full content of this paper and consent to its submission
[9]

Time for united action on depression: a Lancet–World Psychi- atric Association Commission,

H. Herrman, V . Patel, C. Kieling, M. Berk, C. Buchweitz, P. Cui- jpers, T. A. Furukawa, R. C. Kessler, B. A. Kohrt, M. Majet al., “Time for united action on depression: a Lancet–World Psychi- atric Association Commission,”The Lancet, vol. 399, no. 10328, pp. 957–1022, 2022

2022
[10]

Global, regional and national burden of depressive disorders and attributable risk factors, from 1990 to 2021: results from the 2021 Global Burden of Disease study,

J. Rong, X. Wang, P. Cheng, D. Li, and D. Zhao, “Global, regional and national burden of depressive disorders and attributable risk factors, from 1990 to 2021: results from the 2021 Global Burden of Disease study,”The British Journal of Psychiatry, pp. 1–10, 2025

1990
[11]

How stigma interferes with mental health care

P. Corrigan, “How stigma interferes with mental health care.” American psychologist, vol. 59, no. 7, p. 614, 2004

2004
[12]

The global gap in treatment coverage for major depressive disorder in 84 countries from 2000–2019: A systematic review and Bayesian meta-regression analysis,

M. Moitra, D. Santomauro, P. Y . Collins, T. V os, H. Whiteford, S. Saxena, and A. J. Ferrari, “The global gap in treatment coverage for major depressive disorder in 84 countries from 2000–2019: A systematic review and Bayesian meta-regression analysis,”PLoS medicine, vol. 19, no. 2, p. e1003901, 2022

2000
[13]

Failure and delay in initial treatment contact after first onset of mental disorders in the National Comorbidity Survey Replication,

P. S. Wang, P. Berglund, M. Olfson, H. A. Pincus, K. B. Wells, and R. C. Kessler, “Failure and delay in initial treatment contact after first onset of mental disorders in the National Comorbidity Survey Replication,”Archives of general psychiatry, vol. 62, no. 6, pp. 603–613, 2005

2005
[14]

Speech-based depres- sion assessment: A comprehensive survey,

S. S. Leal, S. Ntalampiras, and R. Sassi, “Speech-based depres- sion assessment: A comprehensive survey,”IEEE Transactions on Affective Computing, 2024

2024
[15]

Diagnostic accuracy of deep learning using speech samples in depression: a systematic review and meta-analysis,

L. Liu, L. Liu, H. A. Wafa, F. Tydeman, W. Xie, and Y . Wang, “Diagnostic accuracy of deep learning using speech samples in depression: a systematic review and meta-analysis,”Journal of the American Medical Informatics Association, vol. 31, no. 10, pp. 2394–2404, 2024

2024
[16]

Avec 2016: Depression, mood, and emotion recognition workshop and challenge,

M. Valstar, J. Gratch, B. Schuller, F. Ringeval, D. Lalanne, M. Tor- res Torres, S. Scherer, G. Stratou, R. Cowie, and M. Pantic, “Avec 2016: Depression, mood, and emotion recognition workshop and challenge,” inProceedings of the 6th international workshop on audio/visual emotion challenge, 2016, pp. 3–10

2016
[17]

The distress analysis interview corpus of human and computer interviews

J. Gratch, R. Artstein, G. M. Lucas, G. Stratou, S. Scherer, A. Nazarian, R. Wood, J. Boberg, D. DeVault, S. Marsellaet al., “The distress analysis interview corpus of human and computer interviews.” inLrec, vol. 14. Reykjavik, 2014, pp. 3123–3128

2014
[18]

Depression detection from speech data using deep learning–based optimized temporal–frequency–channel attention with interpretable acoustic–prosodic mapping,

K. Rezaee, “Depression detection from speech data using deep learning–based optimized temporal–frequency–channel attention with interpretable acoustic–prosodic mapping,”Journal of Affec- tive Disorders, p. 121077, 2026

2026
[19]

RADIANCE: Re- liable and interpretable depression detection from speech using transformer,

A. K. Gupta, A. Dhamaniya, and P. Gupta, “RADIANCE: Re- liable and interpretable depression detection from speech using transformer,”Computers in biology and medicine, vol. 183, p. 109325, 2024

2024
[20]

Depression recognition using voice-based pre-training model,

X. Huang, F. Wang, Y . Gao, Y . Liao, W. Zhang, L. Zhang, and Z. Xu, “Depression recognition using voice-based pre-training model,”Scientific reports, vol. 14, no. 1, p. 12734, 2024

2024
[21]

wav2vec 2.0: A framework for self-supervised learning of speech repre- sentations,

A. Baevski, Y . Zhou, A. Mohamed, and M. Auli, “wav2vec 2.0: A framework for self-supervised learning of speech repre- sentations,”Advances in neural information processing systems, vol. 33, pp. 12 449–12 460, 2020

2020
[22]

Responsible development of clinical speech ai: bridging the gap between clinical research and technol- ogy,

V . Berisha and J. M. Liss, “Responsible development of clinical speech ai: bridging the gap between clinical research and technol- ogy,”NPJ digital medicine, vol. 7, no. 1, p. 208, 2024

2024
[23]

Common Pitfalls and Recommen- dations for Use of Machine Learning in Depression Severity Esti- mation: DAIC-WOZ Study,

I. Danylenko and O. Unold, “Common Pitfalls and Recommen- dations for Use of Machine Learning in Depression Severity Esti- mation: DAIC-WOZ Study,”Applied Sciences, vol. 16, no. 1, p. 422, 2025

2025
[24]

Enhancing accuracy and privacy in speech-based depression detection through speaker dis- entanglement,

V . Ravi, J. Wang, J. Flint, and A. Alwan, “Enhancing accuracy and privacy in speech-based depression detection through speaker dis- entanglement,”Computer speech & language, vol. 86, p. 101605, 2024

2024
[25]

Unmasking Clever Hans predictors and assess- ing what machines really learn,

S. Lapuschkin, S. W ¨aldchen, A. Binder, G. Montavon, W. Samek, and K.-R. M¨uller, “Unmasking Clever Hans predictors and assess- ing what machines really learn,”Nature communications, vol. 10, no. 1, p. 1096, 2019

2019
[26]

Domain-adversarial training of neural networks,

Y . Ganin, E. Ustinova, H. Ajakan, P. Germain, H. Larochelle, F. Laviolette, M. March, and V . Lempitsky, “Domain-adversarial training of neural networks,”Journal of machine learning re- search, vol. 17, no. 59, pp. 1–35, 2016

2016
[27]

The PHQ-8 as a measure of current depres- sion in the general population,

K. Kroenke, T. W. Strine, R. L. Spitzer, J. B. Williams, J. T. Berry, and A. H. Mokdad, “The PHQ-8 as a measure of current depres- sion in the general population,”Journal of affective disorders, vol. 114, no. 1-3, pp. 163–173, 2009

2009
[28]

A step towards pre- serving speakers’ identity while detecting depression via speaker disentanglement,

V . Ravi, J. Wang, J. Flint, and A. Alwan, “A step towards pre- serving speakers’ identity while detecting depression via speaker disentanglement,” inInterspeech, 2022, p. 3338

2022
[29]

Non-uniform speaker disen- tanglement for depression detection from raw speech signals,

J. Wang, V . Ravi, and A. Alwan, “Non-uniform speaker disen- tanglement for depression detection from raw speech signals,” in Interspeech, 2023, p. 2343

2023
[30]

Catastrophic forgetting in connectionist net- works,

R. M. French, “Catastrophic forgetting in connectionist net- works,”Trends in cognitive sciences, vol. 3, no. 4, pp. 128–135, 1999

1999
[31]

Comparing Acoustic-Based Approaches for Alzheimer’s Disease Detection,

A. Balagopalan and J. Novikova, “Comparing Acoustic-Based Approaches for Alzheimer’s Disease Detection,” inInterspeech, 2021, pp. 3800–3804

2021
[32]

XLS-R: Self-supervised Cross-lingual Speech Rep- resentation Learning at Scale,

A. Babu, C. Wang, A. Tjandra, K. Lakhotia, Q. Xu, N. Goyal, K. Singh, P. von Platen, Y . Saraf, J. Pino, A. Baevski, A. Conneau, and M. Auli, “XLS-R: Self-supervised Cross-lingual Speech Rep- resentation Learning at Scale,” inInterspeech, 2022, pp. 2278– 2282

2022
[33]

The Geneva minimalistic acoustic parameter set (GeMAPS) for voice research and affective computing,

F. Eyben, K. R. Scherer, B. W. Schuller, J. Sundberg, E. Andr ´e, C. Busso, L. Y . Devillers, J. Epps, P. Laukka, S. S. Narayananet al., “The Geneva minimalistic acoustic parameter set (GeMAPS) for voice research and affective computing,”IEEE transactions on affective computing, vol. 7, no. 2, pp. 190–202, 2015

2015
[34]

Opensmile: the mu- nich versatile and fast open-source audio feature extractor,

F. Eyben, M. W ¨ollmer, and B. Schuller, “Opensmile: the mu- nich versatile and fast open-source audio feature extractor,” in Proceedings of the 18th ACM international conference on Mul- timedia, 2010, pp. 1459–1462

2010
[35]

Audio deepfake detection with self-supervised xls-r and sls classifier,

Q. Zhang, S. Wen, and T. Hu, “Audio deepfake detection with self-supervised xls-r and sls classifier,” inProceedings of the 32nd ACM International Conference on Multimedia, 2024, pp. 6765– 6773

2024
[36]

Performance metrics for multilabel emotion classification: comparing micro, macro, and weighted f1-scores,

M. C. Hinojosa Lee, J. Braet, and J. Springael, “Performance metrics for multilabel emotion classification: comparing micro, macro, and weighted f1-scores,”Applied Sciences, vol. 14, no. 21, p. 9863, 2024

2024
[37]

Accuracy of automated classification of major depressive disorder as a function of symptom severity,

R. Ramasubbu, M. R. Brown, F. Cortese, I. Gaxiola, B. Goodyear, A. J. Greenshaw, S. M. Dursun, and R. Greiner, “Accuracy of automated classification of major depressive disorder as a function of symptom severity,”NeuroImage: Clinical, vol. 12, pp. 320– 331, 2016

2016
[38]

An overview of speaker identifica- tion: Accuracy and robustness issues,

R. Togneri and D. Pullella, “An overview of speaker identifica- tion: Accuracy and robustness issues,”IEEE circuits and systems magazine, vol. 11, no. 2, pp. 23–61, 2011

2011