Recognition: no theorem link
Gender Fairness in Audio Deepfake Detection: Performance and Disparity Analysis
Pith reviewed 2026-05-15 12:51 UTC · model grok-4.3
The pith
Audio deepfake detectors exhibit gender disparities in error patterns even when overall error rates between male and female voices appear similar.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Even when the overall EER difference between genders appears low, fairness-aware evaluation reveals disparities in error distribution that are obscured by aggregate performance measures.
What carries the argument
Five established fairness metrics applied to equal error rate results from a ResNet-18 classifier on ASVspoof 5 to quantify gender-specific error imbalances.
If this is right
- Reliance on EER alone can mask gender-specific failure modes in deepfake detection.
- Fairness metrics are necessary to ensure equitable performance across male and female voices.
- Audio deepfake systems may require targeted adjustments to reduce hidden demographic errors.
- Evaluation protocols for voice biometrics should routinely include fairness metrics.
Where Pith is reading between the lines
- The same fairness evaluation could be applied to other voice-based AI tasks to check for similar hidden imbalances.
- Training data collection for future detectors may need explicit gender balancing to reduce observed disparities.
- Cross-dataset testing on additional audio corpora could determine whether the disparities are dataset-specific or general.
Load-bearing premise
The ASVspoof 5 dataset has representative gender distributions and the five fairness metrics accurately capture relevant disparities without introducing their own biases.
What would settle it
An experiment on a version of ASVspoof 5 with perfectly balanced gender labels and error rates showing no disparity under the same five fairness metrics.
Figures
read the original abstract
Audio deepfake detection aims to detect real human voices from those generated by Artificial Intelligence (AI) and has emerged as a significant problem in the field of voice biometrics systems. With the ever-improving quality of synthetic voice, the probability of such a voice being exploited for illicit practices like identity thest and impersonation increases. Although significant progress has been made in the field of Audio Deepfake Detection in recent times, the issue of gender bias remains underexplored and in its nascent stage In this paper, we have attempted a thorough analysis of gender dependent performance and fairness in audio deepfake detection models. We have used the ASVspoof 5 dataset and train a ResNet-18 classifier and evaluate detection performance across four different audio features, and compared the performance with baseline AASIST model. Beyond conventional metrics such as Equal Error Rate (EER %), we incorporated five established fairness metrics to quantify gender disparities in the model. Our results show that even when the overall EER difference between genders appears low, fairness-aware evaluation reveals disparities in error distribution that are obscured by aggregate performance measures. These findings demonstrate that reliance on standard metrics is unreliable, whereas fairness metrics provide critical insights into demographic-specific failure modes. This work highlights the importance of fairness-aware evaluation for developing a more equitable, robust, and trustworthy audio deepfake detection system.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper claims that on the ASVspoof 5 dataset, ResNet-18 (trained on four audio features) and AASIST models exhibit low overall EER gender differences in audio deepfake detection, yet five fairness metrics applied to per-gender confusion matrices reveal substantial disparities in error distributions that aggregate EER obscures, demonstrating the need for fairness-aware evaluation to build equitable systems.
Significance. If the empirical results hold after addressing gaps, the work has moderate significance for voice biometrics by showing that standard metrics like EER can mask demographic-specific failure modes in deepfake detection. It contributes to the underexplored area of gender fairness in this domain through direct comparison of models and metrics, but its purely empirical nature without parameter-free derivations or falsifiable predictions limits broader impact.
major comments (3)
- [Abstract / Methods] Abstract and methods: the five fairness metrics are invoked without explicit formulas, definitions, or references to their exact computation on per-gender confusion matrices. This is load-bearing because, per the skeptic note, metrics such as demographic parity or equal opportunity can yield large values due to base-rate differences in spoof/bonafide prevalence across genders even for well-calibrated classifiers; without the formulas or reported gender-specific base rates in ASVspoof 5, it is impossible to verify that the reported disparities reflect genuine unfairness rather than metric sensitivity.
- [Evaluation / Results] Evaluation and results sections: no details are provided on data splits, exact training procedures, hyperparameter choices, or statistical tests (e.g., significance of EER differences or fairness scores). The central claim that fairness metrics expose hidden disparities rests on these results being reproducible and robust; the absence of error bars, cross-validation description, or sensitivity analysis to subsampling directly undermines confidence in the disparity findings.
- [Results] Results: the manuscript does not report or control for gender-specific base rates in ASVspoof 5 nor compare the five metrics against base-rate-adjusted alternatives (e.g., equalized odds conditioned on prevalence). This omission is critical for the claim that aggregate EER is unreliable, as the skeptic analysis indicates that unadjusted fairness scores can artifactually indicate disparity when prevalence differs by gender.
minor comments (2)
- [Abstract] Abstract contains a typo: 'identity thest' should read 'identity theft'.
- [Methods] Notation for the four audio features and the exact ResNet-18 architecture details (e.g., input dimensions, layer counts) should be clarified for reproducibility.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback. We address each major comment below and indicate revisions to the manuscript.
read point-by-point responses
-
Referee: [Abstract / Methods] Abstract and methods: the five fairness metrics are invoked without explicit formulas, definitions, or references to their exact computation on per-gender confusion matrices. This is load-bearing because, per the skeptic note, metrics such as demographic parity or equal opportunity can yield large values due to base-rate differences in spoof/bonafide prevalence across genders even for well-calibrated classifiers; without the formulas or reported gender-specific base rates in ASVspoof 5, it is impossible to verify that the reported disparities reflect genuine unfairness rather than metric sensitivity.
Authors: We agree that explicit formulas and base-rate reporting are needed for verifiability. In the revised manuscript we will add the mathematical definitions and formulas for each of the five fairness metrics, explain their exact computation from per-gender confusion matrices, include the original references, and report the gender-specific base rates (bonafide/spoof prevalence) observed in ASVspoof 5. revision: yes
-
Referee: [Evaluation / Results] Evaluation and results sections: no details are provided on data splits, exact training procedures, hyperparameter choices, or statistical tests (e.g., significance of EER differences or fairness scores). The central claim that fairness metrics expose hidden disparities rests on these results being reproducible and robust; the absence of error bars, cross-validation description, or sensitivity analysis to subsampling directly undermines confidence in the disparity findings.
Authors: We acknowledge the omission of these reproducibility details. The revised version will include full descriptions of the data splits, training procedures, hyperparameter choices, cross-validation approach, error bars on reported metrics, and statistical significance tests for EER and fairness-score differences. We will also add a sensitivity analysis to subsampling. revision: yes
-
Referee: [Results] Results: the manuscript does not report or control for gender-specific base rates in ASVspoof 5 nor compare the five metrics against base-rate-adjusted alternatives (e.g., equalized odds conditioned on prevalence). This omission is critical for the claim that aggregate EER is unreliable, as the skeptic analysis indicates that unadjusted fairness scores can artifactually indicate disparity when prevalence differs by gender.
Authors: We will add the gender-specific base rates for ASVspoof 5 in the revised results section. A full comparison against base-rate-adjusted alternatives (such as prevalence-conditioned equalized odds) would require additional experiments; we will include a brief discussion of this limitation and note that the chosen metrics are standard in the fairness literature, while acknowledging that adjusted variants could be explored in future work. revision: partial
Circularity Check
No significant circularity in empirical fairness evaluation
full rationale
This is a purely empirical study that trains standard models (ResNet-18 and AASIST) on the public ASVspoof 5 dataset, computes EER, and applies five established fairness metrics to per-gender confusion matrices. No derivations, predictions, or first-principles results are claimed that could reduce to fitted parameters or self-citations by construction. All reported disparities are direct experimental outcomes from the chosen metrics and data splits, rendering the analysis self-contained with no load-bearing circular steps.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Audio features and ResNet-18 architecture can capture voice characteristics relevant to distinguishing real from synthetic speech across genders.
Forward citations
Cited by 2 Pith papers
-
Towards Trustworthy Audio Deepfake Detection: A Systematic Framework for Diagnosing and Mitigating Gender Bias
A diagnosis-first framework for gender bias in audio deepfake detection identifies acoustic representation differences and feature leakage as sources, with per-gender threshold adjustment reducing unfairness by 54-75%...
-
Phoneme-Level Deepfake Detection Across Emotional Conditions Using Self-Supervised Embeddings
Phoneme-level analysis using self-supervised embeddings identifies higher divergence in complex vowels and fricatives for emotional voice conversion deepfakes, enabling more interpretable detection across emotions.
Reference graph
Works this paper leans on
- [1]
-
[2]
Audio deepfake detection: A survey,
J. Yi, C. Wang, J. Tao, X. Zhang, C. Y . Zhang, and Y . Zhao,“Audio deepfake detection: A survey,”arXiv preprint arXiv:2308.14970, 2023
-
[3]
Media forensics and deepfakes: An overview,
L. Verdoliva, “Media forensics and deepfakes: An overview,”IEEE Journal of Selected Topics in Signal Processing, vol. 14, no. 5, pp. 910–932, Aug. 2020
work page 2020
-
[4]
ASVspoof: The automatic speaker verification spoofing and countermeasures challenge,
Z. Wu, J. Yamagishi, T. Kinnunen, C. Hanilci, M. Sahidullah, A. Sizov, N. Evans, M. Todisco, and H. Delgado, “ASVspoof: The automatic speaker verification spoofing and countermeasures challenge,”IEEE Journal of Selected Topics in Signal Processing, vol. 11, no. 4, pp. 588–604, Jun. 2017
work page 2017
-
[5]
S. Barocas, M. Hardt, and A. Narayanan,Fairness and Machine Learning: Limitations and Opportunities. Cambridge, MA, USA: MIT Press, 2023
work page 2023
-
[6]
Modeling gender and dialect bias in automatic speech recognition,
C. Harris, C. Mgbahurike, N. Kumar, and D. Yang, “Modeling gender and dialect bias in automatic speech recognition,” inFindings of the Association for Computational Linguistics: EMNLP 2024, pp. 15166– 15184, 2024
work page 2024
-
[8]
Gender shades: Intersectional accuracy disparities in commercial gender classification,
J. Buolamwini and T. Gebru, “Gender shades: Intersectional accuracy disparities in commercial gender classification,” inProc. Conf. Fair- ness, Accountability, and Transparency, pp. 77–91, 2018
work page 2018
-
[9]
I. D. Raji, A. Smart, R. N. White, M. Mitchell, T. Gebru, B. Hutchinson, J. Smith-Loud, D. Theron, and P. Barnes, “Closing the AI accountability gap: Defining an end-to-end framework for internal algorithmic auditing,” inProc. Conf. Fairness, Accountability, and Transparency, pp. 33–44, 2020
work page 2020
-
[10]
Audio deepfake detection: What has been achieved and what lies ahead,
B. Zhang, H. Cui, V . Nguyen, and M. Whitty, “Audio deepfake detection: What has been achieved and what lies ahead,”Sensors, vol. 25, no. 7, Art. no. 1989, 2025
work page 1989
-
[11]
ASVspoof 2019: Future Horizons in Spoofed and Fake Audio Detection
M. Todisco, X. Wang, V . Vestman, M. Sahidullah, H. Delgado, A. Nautsch, J. Yamagishi, N. Evans, T. Kinnunen, and K. A. Lee, “ASVspoof 2019: Future horizons in spoofed and fake audio detec- tion,”arXiv preprintarXiv:1904.05441, 2019
work page internal anchor Pith review Pith/arXiv arXiv 2019
-
[12]
Fairness definitions explained,
S. Verma and J. Rubin, “Fairness definitions explained,” inProc. Int. Workshop on Software Fairness, 2018
work page 2018
-
[13]
Equality of opportunity in supervised learning,
M. Hardt, E. Price, and N. Srebro, “Equality of opportunity in supervised learning,” inAdvances in Neural Information Processing Systems, vol. 29, 2016
work page 2016
-
[14]
Fairness evaluation in deepfake detection models using metamorphic testing,
M. Pu, M. Y . Kuan, N. T. Lim, C. Y . Chong, and M. K. Lim, “Fairness evaluation in deepfake detection models using metamorphic testing,”in Proc. 7th Int. Workshop on Metamorphic Testing (MET), 2022
work page 2022
-
[15]
T. Estella, A. Zahra, and W.-K. Fung, “Accessing gender bias in speech processing using machine learning and deep learning with gender balanced audio deepfake dataset,” inProc. 9th Int. Conf. Informatics and Computing (ICIC), 2024
work page 2024
-
[16]
ASVspoof 5: Crowdsourced speech data, deepfakes, and adversarial attacks at scale,
X. Wang, H. Delgado, H. Tak, J. W. Jung, H. J. Shim, M. Todisco, J. Yamagishi,et al., “ASVspoof 5: Crowdsourced speech data, deepfakes, and adversarial attacks at scale,”arXiv preprint arXiv:2408.08739, 2024
-
[17]
S. Davis and P. Mermelstein, “Comparison of parametric represen- tations for monosyllabic word recognition in continuously spoken sentences,”IEEE Trans. Acoust., Speech, Signal Process., vol. 28, no. 4, pp. 357–366, Aug. 1980
work page 1980
-
[18]
Calculation of a constant Q spectral transform,
J. C. Brown, “Calculation of a constant Q spectral transform,”J. Acoust. Soc. Am., vol. 89, no. 1, pp. 425–434, Jan. 1991
work page 1991
-
[19]
WavLM: Large-scale self-supervised pre-training for full stack speech process- ing,
S. Chen, C. Wang, Z. Chen, Y . Wu, S. Liu, Z. Chen,et al., “WavLM: Large-scale self-supervised pre-training for full stack speech process- ing,”IEEE J. Sel. Topics Signal Process., vol. 16, no. 6, pp. 1505– 1518, Oct. 2022
work page 2022
-
[20]
wav2vec 2.0: A framework for self-supervised learning of speech representations,
A. Baevski, Y . Zhou, A. Mohamed, and M. Auli, “wav2vec 2.0: A framework for self-supervised learning of speech representations,” in Adv. Neural Inf. Process. Syst. (NeurIPS), vol. 33, pp. 12449–12460, 2020
work page 2020
-
[21]
Deep residual learning for image recognition,
K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” inProc. IEEE Conf. Comput. Vis. Pattern Recognit. (CVPR), pp. 770–778, 2016
work page 2016
-
[22]
PhonemeDF: A Synthetic Speech Dataset for Audio Deepfake De- tection and Naturalness Evaluation,
V . Nallaguntla, A. Fursule, S. Kshirsagar, and A. R. Avila, “PhonemeDF: A Synthetic Speech Dataset for Audio Deepfake De- tection and Naturalness Evaluation,” inProceedings of the 15th Inter- national Conference on Language Resources and Evaluation (LREC 2026), 2026, submitted
work page 2026
-
[23]
D. E. Temmar, A. Hamadene, V . Nallaguntla, A. R. Fur- sule, M. S. Allili, S. Kshirsagar, and A. R. Avila, “Phonetic analysis of real and synthetic speech using HuBERT embed- dings: Perspectives for deepfake detection,” inProc. IEEE Int. Conf. Systems, Man, and Cybernetics (SMC), 2025, pp. 86–91, doi: 10.1109/SMC58881.2025.11343334
-
[24]
Improving fairness in deepfake detection,
Y . Ju, S. Hu, S. Jia, G. H. Chen, and S. Lyu, “Improving fairness in deepfake detection,” inProc. IEEE/CVF Winter Conf. Appl. Comput. Vis. (WACV), pp. 4655–4665, 2024
work page 2024
-
[25]
An examination of fairness of AI models for deepfake detection,
L. Trinh and Y . Liu, “An examination of fairness of AI models for deepfake detection,”arXiv preprint arXiv:2105.00558, 2021
-
[26]
A study of gender bias in face presentation attack and its mitigation,
N. Alshareef, X. Yuan, K. Roy, and M. Atay, “A study of gender bias in face presentation attack and its mitigation,”Future Internet, vol. 13, no. 9, Art. no. 234, Sep. 2021
work page 2021
-
[27]
ASVspoof 2015: The first automatic speaker verifi- cation spoofing and countermeasures challenge,
Z. Wu, T. Kinnunen, N. Evans, J. Yamagishi, C. Hanilc ¸i, M. Sahidul- lah, and A. Sizov, “ASVspoof 2015: The first automatic speaker verifi- cation spoofing and countermeasures challenge,” inProc. Interspeech, 2015, pp. 2037–2041
work page 2015
-
[28]
The ASVspoof 2017 challenge: Assessing the limits of replay spoofing attack detection,
T. Kinnunen, M. Sahidullah, H. Delgado, M. Todisco, N. Evans, J. Ya- magishi, and K. A. Lee, “The ASVspoof 2017 challenge: Assessing the limits of replay spoofing attack detection,” inProc. Interspeech, 2017
work page 2017
-
[29]
ASVspoof 2021: Accelerating progress in spoofed and deepfake speech detection,
J. Yamagishi, X. Wang, M. Todisco, M. Sahidullah, J. Patino, A. Nautsch, and H. Delgado, “ASVspoof 2021: Accelerating progress in spoofed and deepfake speech detection,”arXiv preprint arXiv:2109.00537, 2021
-
[30]
A survey on bias and fairness in machine learning,
N. Mehrabi, F. Morstatter, N. Saxena, K. Lerman, and A. Galstyan, “A survey on bias and fairness in machine learning,”ACM Computing Surveys, vol. 54, no. 6, pp. 1–35, 2021
work page 2021
-
[31]
GBDF: Gender balanced deepfake dataset towards fair deepfake detection,
A. V . Nadimpalli and A. Rattani, “GBDF: Gender balanced deepfake dataset towards fair deepfake detection,” inProc. Int. Conf. Pattern Recognition (ICPR), Cham, Switzerland: Springer Nature Switzerland, Aug. 2022, pp. 320–337
work page 2022
-
[32]
Deepfake: Classifiers, fairness, and demo- graphically robust algorithm,
A. Agarwal and N. Ratha, “Deepfake: Classifiers, fairness, and demo- graphically robust algorithm,” inProc. IEEE 18th Int. Conf. Automatic Face and Gesture Recognition (FG), May 2024, pp. 1–9
work page 2024
-
[33]
Yadav, A. K. S., Bhagtani, K., Salvi, D., Bestagini, P., and Delp, E. J. (2024). FairSSD: Understanding Bias in Synthetic Speech Detectors. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 4418–4428
work page 2024
- [34]
- [35]
-
[36]
AASIST: Audio Anti-Spoofing Using Inte- grated Spectro-Temporal Graph Attention Networks,
J.-W. Jung, H.-S. Heo, H. Tak, H.-J. Shim, J. S. Chung, B.-J. Lee, H.-J. Yu, and N. Evans, “AASIST: Audio Anti-Spoofing Using Inte- grated Spectro-Temporal Graph Attention Networks,” inProc. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2022, pp. 6367–6371
work page 2022
-
[37]
Bias-Free? An Empirical Study on Ethnicity, Gender, and Age Fairness in Deepfake Detection,
A. Panda, T. Ghosh, T. Choudhary, and R. Naskar, “Bias-Free? An Empirical Study on Ethnicity, Gender, and Age Fairness in Deepfake Detection,”ACM Computing Surveys, 2026
work page 2026
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.