arxiv: 2603.09007 · v2 · submitted 2026-03-09 · 💻 cs.SD · cs.AI

Recognition: no theorem link

Gender Fairness in Audio Deepfake Detection: Performance and Disparity Analysis

Aishwarya Fursule , Shruti Kshirsagar , Anderson R. Avila

Authors on Pith no claims yet

Pith reviewed 2026-05-15 12:51 UTC · model grok-4.3

classification 💻 cs.SD cs.AI

keywords audio deepfake detectiongender fairnessASVspoof 5equal error ratefairness metricsResNet-18voice biometricserror distribution

0 comments

The pith

Audio deepfake detectors exhibit gender disparities in error patterns even when overall error rates between male and female voices appear similar.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tests audio deepfake detection models for differences in performance across genders using the ASVspoof 5 dataset. A ResNet-18 classifier is trained on four audio feature types and compared against the AASIST baseline. Performance is measured with equal error rate plus five fairness metrics that track how errors distribute between groups. Results indicate that small overall EER gaps can hide larger imbalances in false positives or negatives for one gender. This leads the authors to conclude that aggregate metrics alone are insufficient and fairness metrics are required to identify demographic-specific weaknesses.

Core claim

Even when the overall EER difference between genders appears low, fairness-aware evaluation reveals disparities in error distribution that are obscured by aggregate performance measures.

What carries the argument

Five established fairness metrics applied to equal error rate results from a ResNet-18 classifier on ASVspoof 5 to quantify gender-specific error imbalances.

If this is right

Reliance on EER alone can mask gender-specific failure modes in deepfake detection.
Fairness metrics are necessary to ensure equitable performance across male and female voices.
Audio deepfake systems may require targeted adjustments to reduce hidden demographic errors.
Evaluation protocols for voice biometrics should routinely include fairness metrics.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same fairness evaluation could be applied to other voice-based AI tasks to check for similar hidden imbalances.
Training data collection for future detectors may need explicit gender balancing to reduce observed disparities.
Cross-dataset testing on additional audio corpora could determine whether the disparities are dataset-specific or general.

Load-bearing premise

The ASVspoof 5 dataset has representative gender distributions and the five fairness metrics accurately capture relevant disparities without introducing their own biases.

What would settle it

An experiment on a version of ASVspoof 5 with perfectly balanced gender labels and error rates showing no disparity under the same five fairness metrics.

Figures

Figures reproduced from arXiv: 2603.09007 by Aishwarya Fursule, Anderson R. Avila, Shruti Kshirsagar.

**Figure 3.** Figure 3: Illustration of the procedure adopted to perform group fairness assessment. Model performance is based on 5 group fairness measures. [PITH_FULL_IMAGE:figures/full_fig_p003_3.png] view at source ↗

read the original abstract

Audio deepfake detection aims to detect real human voices from those generated by Artificial Intelligence (AI) and has emerged as a significant problem in the field of voice biometrics systems. With the ever-improving quality of synthetic voice, the probability of such a voice being exploited for illicit practices like identity thest and impersonation increases. Although significant progress has been made in the field of Audio Deepfake Detection in recent times, the issue of gender bias remains underexplored and in its nascent stage In this paper, we have attempted a thorough analysis of gender dependent performance and fairness in audio deepfake detection models. We have used the ASVspoof 5 dataset and train a ResNet-18 classifier and evaluate detection performance across four different audio features, and compared the performance with baseline AASIST model. Beyond conventional metrics such as Equal Error Rate (EER %), we incorporated five established fairness metrics to quantify gender disparities in the model. Our results show that even when the overall EER difference between genders appears low, fairness-aware evaluation reveals disparities in error distribution that are obscured by aggregate performance measures. These findings demonstrate that reliance on standard metrics is unreliable, whereas fairness metrics provide critical insights into demographic-specific failure modes. This work highlights the importance of fairness-aware evaluation for developing a more equitable, robust, and trustworthy audio deepfake detection system.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper shows fairness metrics can surface gender error disparities in audio deepfake detection that aggregate EER misses on ASVspoof 5, but the evidence is thin on base rates and implementation details.

read the letter

The main thing to know is that this work trains ResNet-18 on four audio features from ASVspoof 5, compares it to the AASIST baseline, and then applies five fairness metrics to show that per-gender error patterns differ even when overall EER gaps between men and women look small. That specific combination of dataset, models, features, and fairness evaluation is new in the cited literature, and the abstract makes a direct case that standard metrics hide demographic failure modes. The empirical setup is straightforward and uses public data, which gives it a clear, reproducible core.

Referee Report

3 major / 2 minor

Summary. The paper claims that on the ASVspoof 5 dataset, ResNet-18 (trained on four audio features) and AASIST models exhibit low overall EER gender differences in audio deepfake detection, yet five fairness metrics applied to per-gender confusion matrices reveal substantial disparities in error distributions that aggregate EER obscures, demonstrating the need for fairness-aware evaluation to build equitable systems.

Significance. If the empirical results hold after addressing gaps, the work has moderate significance for voice biometrics by showing that standard metrics like EER can mask demographic-specific failure modes in deepfake detection. It contributes to the underexplored area of gender fairness in this domain through direct comparison of models and metrics, but its purely empirical nature without parameter-free derivations or falsifiable predictions limits broader impact.

major comments (3)

[Abstract / Methods] Abstract and methods: the five fairness metrics are invoked without explicit formulas, definitions, or references to their exact computation on per-gender confusion matrices. This is load-bearing because, per the skeptic note, metrics such as demographic parity or equal opportunity can yield large values due to base-rate differences in spoof/bonafide prevalence across genders even for well-calibrated classifiers; without the formulas or reported gender-specific base rates in ASVspoof 5, it is impossible to verify that the reported disparities reflect genuine unfairness rather than metric sensitivity.
[Evaluation / Results] Evaluation and results sections: no details are provided on data splits, exact training procedures, hyperparameter choices, or statistical tests (e.g., significance of EER differences or fairness scores). The central claim that fairness metrics expose hidden disparities rests on these results being reproducible and robust; the absence of error bars, cross-validation description, or sensitivity analysis to subsampling directly undermines confidence in the disparity findings.
[Results] Results: the manuscript does not report or control for gender-specific base rates in ASVspoof 5 nor compare the five metrics against base-rate-adjusted alternatives (e.g., equalized odds conditioned on prevalence). This omission is critical for the claim that aggregate EER is unreliable, as the skeptic analysis indicates that unadjusted fairness scores can artifactually indicate disparity when prevalence differs by gender.

minor comments (2)

[Abstract] Abstract contains a typo: 'identity thest' should read 'identity theft'.
[Methods] Notation for the four audio features and the exact ResNet-18 architecture details (e.g., input dimensions, layer counts) should be clarified for reproducibility.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive feedback. We address each major comment below and indicate revisions to the manuscript.

read point-by-point responses

Referee: [Abstract / Methods] Abstract and methods: the five fairness metrics are invoked without explicit formulas, definitions, or references to their exact computation on per-gender confusion matrices. This is load-bearing because, per the skeptic note, metrics such as demographic parity or equal opportunity can yield large values due to base-rate differences in spoof/bonafide prevalence across genders even for well-calibrated classifiers; without the formulas or reported gender-specific base rates in ASVspoof 5, it is impossible to verify that the reported disparities reflect genuine unfairness rather than metric sensitivity.

Authors: We agree that explicit formulas and base-rate reporting are needed for verifiability. In the revised manuscript we will add the mathematical definitions and formulas for each of the five fairness metrics, explain their exact computation from per-gender confusion matrices, include the original references, and report the gender-specific base rates (bonafide/spoof prevalence) observed in ASVspoof 5. revision: yes
Referee: [Evaluation / Results] Evaluation and results sections: no details are provided on data splits, exact training procedures, hyperparameter choices, or statistical tests (e.g., significance of EER differences or fairness scores). The central claim that fairness metrics expose hidden disparities rests on these results being reproducible and robust; the absence of error bars, cross-validation description, or sensitivity analysis to subsampling directly undermines confidence in the disparity findings.

Authors: We acknowledge the omission of these reproducibility details. The revised version will include full descriptions of the data splits, training procedures, hyperparameter choices, cross-validation approach, error bars on reported metrics, and statistical significance tests for EER and fairness-score differences. We will also add a sensitivity analysis to subsampling. revision: yes
Referee: [Results] Results: the manuscript does not report or control for gender-specific base rates in ASVspoof 5 nor compare the five metrics against base-rate-adjusted alternatives (e.g., equalized odds conditioned on prevalence). This omission is critical for the claim that aggregate EER is unreliable, as the skeptic analysis indicates that unadjusted fairness scores can artifactually indicate disparity when prevalence differs by gender.

Authors: We will add the gender-specific base rates for ASVspoof 5 in the revised results section. A full comparison against base-rate-adjusted alternatives (such as prevalence-conditioned equalized odds) would require additional experiments; we will include a brief discussion of this limitation and note that the chosen metrics are standard in the fairness literature, while acknowledging that adjusted variants could be explored in future work. revision: partial

Circularity Check

0 steps flagged

No significant circularity in empirical fairness evaluation

full rationale

This is a purely empirical study that trains standard models (ResNet-18 and AASIST) on the public ASVspoof 5 dataset, computes EER, and applies five established fairness metrics to per-gender confusion matrices. No derivations, predictions, or first-principles results are claimed that could reduce to fitted parameters or self-citations by construction. All reported disparities are direct experimental outcomes from the chosen metrics and data splits, rendering the analysis self-contained with no load-bearing circular steps.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The work rests on standard machine learning assumptions for audio classification and the suitability of chosen fairness metrics; no free parameters, new entities, or ad-hoc axioms are introduced beyond domain conventions.

axioms (1)

domain assumption Audio features and ResNet-18 architecture can capture voice characteristics relevant to distinguishing real from synthetic speech across genders.
Implicit in the choice of features and model for training and evaluation.

pith-pipeline@v0.9.0 · 5548 in / 1162 out tokens · 44443 ms · 2026-05-15T12:51:42.229148+00:00 · methodology

discussion (0)

Forward citations

Cited by 2 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Towards Trustworthy Audio Deepfake Detection: A Systematic Framework for Diagnosing and Mitigating Gender Bias
cs.SD 2026-05 unverdicted novelty 7.0

A diagnosis-first framework for gender bias in audio deepfake detection identifies acoustic representation differences and feature leakage as sources, with per-gender threshold adjustment reducing unfairness by 54-75%...
Phoneme-Level Deepfake Detection Across Emotional Conditions Using Self-Supervised Embeddings
cs.SD 2026-05 unverdicted novelty 6.0

Phoneme-level analysis using self-supervised embeddings identifies higher divergence in complex vowels and fricatives for emotional voice conversion deepfakes, enabling more interpretable detection across emotions.

Reference graph

Works this paper leans on

36 extracted references · 36 canonical work pages · cited by 2 Pith papers · 1 internal anchor

[1]

Bengio, I

Y . Bengio, I. Goodfellow, and A. Courville,Deep Learning, vol. 1. Cambridge, MA, USA: MIT Press, 2017

work page 2017
[2]

Audio deepfake detection: A survey,

J. Yi, C. Wang, J. Tao, X. Zhang, C. Y . Zhang, and Y . Zhao,“Audio deepfake detection: A survey,”arXiv preprint arXiv:2308.14970, 2023

work page arXiv 2023
[3]

Media forensics and deepfakes: An overview,

L. Verdoliva, “Media forensics and deepfakes: An overview,”IEEE Journal of Selected Topics in Signal Processing, vol. 14, no. 5, pp. 910–932, Aug. 2020

work page 2020
[4]

ASVspoof: The automatic speaker verification spoofing and countermeasures challenge,

Z. Wu, J. Yamagishi, T. Kinnunen, C. Hanilci, M. Sahidullah, A. Sizov, N. Evans, M. Todisco, and H. Delgado, “ASVspoof: The automatic speaker verification spoofing and countermeasures challenge,”IEEE Journal of Selected Topics in Signal Processing, vol. 11, no. 4, pp. 588–604, Jun. 2017

work page 2017
[5]

Barocas, M

S. Barocas, M. Hardt, and A. Narayanan,Fairness and Machine Learning: Limitations and Opportunities. Cambridge, MA, USA: MIT Press, 2023

work page 2023
[6]

Modeling gender and dialect bias in automatic speech recognition,

C. Harris, C. Mgbahurike, N. Kumar, and D. Yang, “Modeling gender and dialect bias in automatic speech recognition,” inFindings of the Association for Computational Linguistics: EMNLP 2024, pp. 15166– 15184, 2024

work page 2024
[8]

Gender shades: Intersectional accuracy disparities in commercial gender classification,

J. Buolamwini and T. Gebru, “Gender shades: Intersectional accuracy disparities in commercial gender classification,” inProc. Conf. Fair- ness, Accountability, and Transparency, pp. 77–91, 2018

work page 2018
[9]

Closing the AI accountability gap: Defining an end-to-end framework for internal algorithmic auditing,

I. D. Raji, A. Smart, R. N. White, M. Mitchell, T. Gebru, B. Hutchinson, J. Smith-Loud, D. Theron, and P. Barnes, “Closing the AI accountability gap: Defining an end-to-end framework for internal algorithmic auditing,” inProc. Conf. Fairness, Accountability, and Transparency, pp. 33–44, 2020

work page 2020
[10]

Audio deepfake detection: What has been achieved and what lies ahead,

B. Zhang, H. Cui, V . Nguyen, and M. Whitty, “Audio deepfake detection: What has been achieved and what lies ahead,”Sensors, vol. 25, no. 7, Art. no. 1989, 2025

work page 1989
[11]

ASVspoof 2019: Future Horizons in Spoofed and Fake Audio Detection

M. Todisco, X. Wang, V . Vestman, M. Sahidullah, H. Delgado, A. Nautsch, J. Yamagishi, N. Evans, T. Kinnunen, and K. A. Lee, “ASVspoof 2019: Future horizons in spoofed and fake audio detec- tion,”arXiv preprintarXiv:1904.05441, 2019

work page internal anchor Pith review Pith/arXiv arXiv 2019
[12]

Fairness definitions explained,

S. Verma and J. Rubin, “Fairness definitions explained,” inProc. Int. Workshop on Software Fairness, 2018

work page 2018
[13]

Equality of opportunity in supervised learning,

M. Hardt, E. Price, and N. Srebro, “Equality of opportunity in supervised learning,” inAdvances in Neural Information Processing Systems, vol. 29, 2016

work page 2016
[14]

Fairness evaluation in deepfake detection models using metamorphic testing,

M. Pu, M. Y . Kuan, N. T. Lim, C. Y . Chong, and M. K. Lim, “Fairness evaluation in deepfake detection models using metamorphic testing,”in Proc. 7th Int. Workshop on Metamorphic Testing (MET), 2022

work page 2022
[15]

Accessing gender bias in speech processing using machine learning and deep learning with gender balanced audio deepfake dataset,

T. Estella, A. Zahra, and W.-K. Fung, “Accessing gender bias in speech processing using machine learning and deep learning with gender balanced audio deepfake dataset,” inProc. 9th Int. Conf. Informatics and Computing (ICIC), 2024

work page 2024
[16]

ASVspoof 5: Crowdsourced speech data, deepfakes, and adversarial attacks at scale,

X. Wang, H. Delgado, H. Tak, J. W. Jung, H. J. Shim, M. Todisco, J. Yamagishi,et al., “ASVspoof 5: Crowdsourced speech data, deepfakes, and adversarial attacks at scale,”arXiv preprint arXiv:2408.08739, 2024

work page arXiv 2024
[17]

Comparison of parametric represen- tations for monosyllabic word recognition in continuously spoken sentences,

S. Davis and P. Mermelstein, “Comparison of parametric represen- tations for monosyllabic word recognition in continuously spoken sentences,”IEEE Trans. Acoust., Speech, Signal Process., vol. 28, no. 4, pp. 357–366, Aug. 1980

work page 1980
[18]

Calculation of a constant Q spectral transform,

J. C. Brown, “Calculation of a constant Q spectral transform,”J. Acoust. Soc. Am., vol. 89, no. 1, pp. 425–434, Jan. 1991

work page 1991
[19]

WavLM: Large-scale self-supervised pre-training for full stack speech process- ing,

S. Chen, C. Wang, Z. Chen, Y . Wu, S. Liu, Z. Chen,et al., “WavLM: Large-scale self-supervised pre-training for full stack speech process- ing,”IEEE J. Sel. Topics Signal Process., vol. 16, no. 6, pp. 1505– 1518, Oct. 2022

work page 2022
[20]

wav2vec 2.0: A framework for self-supervised learning of speech representations,

A. Baevski, Y . Zhou, A. Mohamed, and M. Auli, “wav2vec 2.0: A framework for self-supervised learning of speech representations,” in Adv. Neural Inf. Process. Syst. (NeurIPS), vol. 33, pp. 12449–12460, 2020

work page 2020
[21]

Deep residual learning for image recognition,

K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” inProc. IEEE Conf. Comput. Vis. Pattern Recognit. (CVPR), pp. 770–778, 2016

work page 2016
[22]

PhonemeDF: A Synthetic Speech Dataset for Audio Deepfake De- tection and Naturalness Evaluation,

V . Nallaguntla, A. Fursule, S. Kshirsagar, and A. R. Avila, “PhonemeDF: A Synthetic Speech Dataset for Audio Deepfake De- tection and Naturalness Evaluation,” inProceedings of the 15th Inter- national Conference on Language Resources and Evaluation (LREC 2026), 2026, submitted

work page 2026
[23]

Phonetic analysis of real and synthetic speech using HuBERT embed- dings: Perspectives for deepfake detection,

D. E. Temmar, A. Hamadene, V . Nallaguntla, A. R. Fur- sule, M. S. Allili, S. Kshirsagar, and A. R. Avila, “Phonetic analysis of real and synthetic speech using HuBERT embed- dings: Perspectives for deepfake detection,” inProc. IEEE Int. Conf. Systems, Man, and Cybernetics (SMC), 2025, pp. 86–91, doi: 10.1109/SMC58881.2025.11343334

work page doi:10.1109/smc58881.2025.11343334 2025
[24]

Improving fairness in deepfake detection,

Y . Ju, S. Hu, S. Jia, G. H. Chen, and S. Lyu, “Improving fairness in deepfake detection,” inProc. IEEE/CVF Winter Conf. Appl. Comput. Vis. (WACV), pp. 4655–4665, 2024

work page 2024
[25]

An examination of fairness of AI models for deepfake detection,

L. Trinh and Y . Liu, “An examination of fairness of AI models for deepfake detection,”arXiv preprint arXiv:2105.00558, 2021

work page arXiv 2021
[26]

A study of gender bias in face presentation attack and its mitigation,

N. Alshareef, X. Yuan, K. Roy, and M. Atay, “A study of gender bias in face presentation attack and its mitigation,”Future Internet, vol. 13, no. 9, Art. no. 234, Sep. 2021

work page 2021
[27]

ASVspoof 2015: The first automatic speaker verifi- cation spoofing and countermeasures challenge,

Z. Wu, T. Kinnunen, N. Evans, J. Yamagishi, C. Hanilc ¸i, M. Sahidul- lah, and A. Sizov, “ASVspoof 2015: The first automatic speaker verifi- cation spoofing and countermeasures challenge,” inProc. Interspeech, 2015, pp. 2037–2041

work page 2015
[28]

The ASVspoof 2017 challenge: Assessing the limits of replay spoofing attack detection,

T. Kinnunen, M. Sahidullah, H. Delgado, M. Todisco, N. Evans, J. Ya- magishi, and K. A. Lee, “The ASVspoof 2017 challenge: Assessing the limits of replay spoofing attack detection,” inProc. Interspeech, 2017

work page 2017
[29]

ASVspoof 2021: Accelerating progress in spoofed and deepfake speech detection,

J. Yamagishi, X. Wang, M. Todisco, M. Sahidullah, J. Patino, A. Nautsch, and H. Delgado, “ASVspoof 2021: Accelerating progress in spoofed and deepfake speech detection,”arXiv preprint arXiv:2109.00537, 2021

work page arXiv 2021
[30]

A survey on bias and fairness in machine learning,

N. Mehrabi, F. Morstatter, N. Saxena, K. Lerman, and A. Galstyan, “A survey on bias and fairness in machine learning,”ACM Computing Surveys, vol. 54, no. 6, pp. 1–35, 2021

work page 2021
[31]

GBDF: Gender balanced deepfake dataset towards fair deepfake detection,

A. V . Nadimpalli and A. Rattani, “GBDF: Gender balanced deepfake dataset towards fair deepfake detection,” inProc. Int. Conf. Pattern Recognition (ICPR), Cham, Switzerland: Springer Nature Switzerland, Aug. 2022, pp. 320–337

work page 2022
[32]

Deepfake: Classifiers, fairness, and demo- graphically robust algorithm,

A. Agarwal and N. Ratha, “Deepfake: Classifiers, fairness, and demo- graphically robust algorithm,” inProc. IEEE 18th Int. Conf. Automatic Face and Gesture Recognition (FG), May 2024, pp. 1–9

work page 2024
[33]

Yadav, A. K. S., Bhagtani, K., Salvi, D., Bestagini, P., and Delp, E. J. (2024). FairSSD: Understanding Bias in Synthetic Speech Detectors. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 4418–4428

work page 2024
[34]

Bird, J. J. and Lotfi, A. (2023). Real-time Detection of AI- Generated Speech for DeepFake V oice Conversion.arXiv preprint arXiv:2308.12734

work page arXiv 2023
[35]

Yang, T., Sun, C., Lyu, S., and Rose, P. (2025). Forensic Deepfake Audio Detection using Segmental Speech Features.arXiv preprint arXiv:2505.13847

work page arXiv 2025
[36]

AASIST: Audio Anti-Spoofing Using Inte- grated Spectro-Temporal Graph Attention Networks,

J.-W. Jung, H.-S. Heo, H. Tak, H.-J. Shim, J. S. Chung, B.-J. Lee, H.-J. Yu, and N. Evans, “AASIST: Audio Anti-Spoofing Using Inte- grated Spectro-Temporal Graph Attention Networks,” inProc. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2022, pp. 6367–6371

work page 2022
[37]

Bias-Free? An Empirical Study on Ethnicity, Gender, and Age Fairness in Deepfake Detection,

A. Panda, T. Ghosh, T. Choudhary, and R. Naskar, “Bias-Free? An Empirical Study on Ethnicity, Gender, and Age Fairness in Deepfake Detection,”ACM Computing Surveys, 2026

work page 2026