Ethical and Technical Limits of Deepfake Speech Datasets

Anton Firc; Eva Trnovsk\'a; Kamil Malinka; Vojt\v{e}ch Stan\v{e}k

arxiv: 2606.10911 · v1 · pith:WMDRT7Y2new · submitted 2026-06-09 · 💻 cs.SD · cs.AI· cs.CR· cs.LG

Ethical and Technical Limits of Deepfake Speech Datasets

Vojt\v{e}ch Stan\v{e}k , Eva Trnovsk\'a , Kamil Malinka , Anton Firc This is my paper

Pith reviewed 2026-06-27 11:38 UTC · model grok-4.3

classification 💻 cs.SD cs.AIcs.CRcs.LG

keywords deepfake speechdatasetsfairness assessmentdemographic metadatasource overlapevaluationaudit

0 comments

The pith

Most deepfake speech datasets lack demographic metadata, making fairness assessment infeasible, and share overlapping bona fide sources that undermine cross-dataset evaluations.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper conducts a dataset-level audit of 39 deepfake speech datasets to evaluate their suitability for training and testing detectors. It establishes that fairness assessment is largely infeasible because most datasets lack demographic metadata, with only a few containing gender or language labels, which prevents meaningful subgroup analysis and leaves other attributes unaddressed. The audit also identifies substantial overlap in underlying bona fide speech source corpora across datasets. Such overlaps can undermine cross-dataset evaluation and lead to overstated generalization claims. A sympathetic reader would care because the credibility of any robustness or fairness claims for deepfake speech detectors rests directly on the quality and independence of these datasets.

Core claim

Claims about the robustness and fairness of deepfake speech detectors are only as credible as the datasets used to train and evaluate those systems. The audit reveals two important takeaways. Firstly, fairness assessment is largely infeasible because most datasets lack demographic metadata, and only a few contain gender or language labels. This prevents any meaningful subgroup analysis and leaves other demographic attributes unaddressed. Secondly, substantial overlap in underlying bona fide source corpora across datasets can undermine cross-dataset evaluation and lead to overstated generalization claims.

What carries the argument

A systematic audit of 39 deepfake speech datasets examining accessibility, documentation, demographic and language coverage, dataset scale, and underlying bona fide speech sources.

If this is right

Fairness of deepfake speech detectors cannot be reliably assessed with current datasets.
Subgroup analysis across demographics such as gender or language is not possible.
Cross-dataset evaluations may produce misleading results due to shared source material.
Reported generalization performance of detectors is likely overstated.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Future dataset creators should build in demographic metadata collection from the start to support fairness work.
Evaluation protocols may require explicit checks for source overlap to produce valid comparisons.
The effective diversity of training material across the field is smaller than the count of published datasets suggests.

Load-bearing premise

The 39 datasets compiled for the audit represent the broader deepfake speech dataset landscape and the available documentation accurately reflects the presence or absence of demographic metadata and source overlaps.

What would settle it

Discovery of additional deepfake speech datasets that include complete demographic metadata for all speakers or an experiment demonstrating that detector performance remains consistent when trained and tested on fully non-overlapping source corpora.

Figures

Figures reproduced from arXiv: 2606.10911 by Anton Firc, Eva Trnovsk\'a, Kamil Malinka, Vojt\v{e}ch Stan\v{e}k.

**Figure 1.** Figure 1: Audio datasets containing deepfake speech with the corpus they were derived from – utterances were taken directly from the corpus, or the corpus was used to train synthesis tools if mentioned in the paper. The time axis (Y) shows the year of publication. Best viewed in color. of bona fide speech sources. Notably, LibriVox [55] serves as a source for multiple corpora, including LibriTTS [56], LJSpeech [57],… view at source ↗

read the original abstract

Claims about the robustness and fairness of deepfake speech detectors are only as credible as the datasets used to train and evaluate those systems. We present a dataset-level audit of the deepfake speech landscape. We compile and analyze 39 deepfake speech datasets, examining key attributes including accessibility, documentation, demographic and language coverage, dataset scale, and the underlying bona fide speech sources. Our audit reveals two important takeaways. Firstly, fairness assessment is largely infeasible because most datasets lack demographic metadata, and only a few contain gender or language labels. This prevents any meaningful subgroup analysis and leaves other demographic attributes unaddressed. Secondly, we identify substantial overlap in underlying bona fide source corpora across datasets, which can undermine cross-dataset evaluation and lead to overstated generalization claims.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Audit of 39 deepfake speech datasets flags missing demographics and source overlaps, but the sampling method for those 39 is not described in the abstract.

read the letter

The paper's main points are straightforward: most of the 39 deepfake speech datasets lack demographic metadata, so fairness checks are basically impossible, and the real-speech sources overlap a lot across datasets, which weakens claims about cross-dataset generalization.

What stands out is the scale of the compilation. Pulling together 39 datasets and checking documentation, accessibility, language coverage, and bona fide source reuse is new at this level for speech deepfakes. The work is empirical and stays close to observable properties of the datasets rather than fitting models or deriving equations. That keeps the circularity burden at zero and makes the two takeaways traceable to the audit results.

The soft spot is the selection of the 39 itself. The abstract gives no inclusion criteria, search terms, or completeness argument, so it is not clear whether the list captures the broader landscape or simply the more visible ones. If datasets with richer metadata or independent sources were systematically missed, both headline claims become narrower than they read. The overlap finding would hold up better if the paper shows exactly how sources were matched, but that detail is also outside the abstract.

This is for people who train or evaluate deepfake speech detectors and need to know the practical limits of current data. Dataset builders and reviewers who care about evaluation hygiene will find it directly useful. It deserves a serious referee because the topic touches real stakes in detection research and the evidence is observational rather than speculative. The sampling question is fixable with a clearer methods section.

Recommendation: send it to review.

Referee Report

1 major / 2 minor

Summary. The paper presents a dataset-level audit of 39 deepfake speech datasets, examining attributes including accessibility, documentation, demographic and language coverage, scale, and underlying bona fide speech sources. It concludes that fairness assessment is largely infeasible because most datasets lack demographic metadata (with only a few containing gender or language labels), preventing subgroup analysis, and that substantial overlap exists in the bona fide source corpora across datasets, which can undermine cross-dataset evaluation and lead to overstated generalization claims.

Significance. If the audit's sampling is representative, the findings identify concrete barriers to ethical evaluation and reliable benchmarking in deepfake speech detection. By cataloging metadata gaps and source overlaps, the work supplies a practical reference that could inform dataset curation standards and evaluation protocols in the field.

major comments (1)

[Dataset compilation / audit methodology] The section describing the compilation of the 39 datasets provides no explicit search protocol, inclusion/exclusion criteria, or completeness argument. This is load-bearing for both central claims: without it, the assertion that fairness assessment is 'largely infeasible' for 'most datasets' and that source overlaps are 'substantial' across 'the deepfake speech landscape' cannot be distinguished from possible sampling bias (e.g., under-representation of newer or non-English corpora with richer metadata).

minor comments (2)

[Results / tables] A summary table listing all 39 datasets with columns for accessibility, demographic labels present, and identified source corpora would improve readability and allow readers to verify the overlap and metadata claims directly.
[Abstract] The abstract states that 'only a few contain gender or language labels' but does not quantify 'few' or 'most'; adding counts or percentages in the abstract would strengthen the headline claims without lengthening the text.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their constructive comments and recommendation. The single major comment identifies a genuine gap in our methodology description. We will revise the manuscript to address it directly.

read point-by-point responses

Referee: [Dataset compilation / audit methodology] The section describing the compilation of the 39 datasets provides no explicit search protocol, inclusion/exclusion criteria, or completeness argument. This is load-bearing for both central claims: without it, the assertion that fairness assessment is 'largely infeasible' for 'most datasets' and that source overlaps are 'substantial' across 'the deepfake speech landscape' cannot be distinguished from possible sampling bias (e.g., under-representation of newer or non-English corpora with richer metadata).

Authors: We agree that the absence of an explicit search protocol, inclusion/exclusion criteria, and completeness argument is a limitation that weakens the generalizability claims. In the revised manuscript we will insert a new subsection (likely 3.1) that specifies: (1) the search strategy (keywords, databases including Google Scholar, arXiv, Hugging Face Datasets, Zenodo, and major surveys up to December 2023); (2) inclusion criteria (publicly released deepfake speech datasets containing both bona fide and spoofed utterances, with at least one published paper); (3) exclusion criteria (non-speech audio, private datasets, or those without any accompanying paper); and (4) a completeness argument based on cross-referencing against the most-cited surveys and repositories at the time of collection. We will also note the cutoff date and discuss the possibility of newer datasets with richer metadata. This revision will allow readers to evaluate sampling bias while preserving the audit's core observations on the 39 datasets we examined. revision: yes

Circularity Check

0 steps flagged

No circularity: purely empirical survey with no derivations or fitted quantities

full rationale

This is a descriptive audit paper that compiles 39 datasets and reports direct observations on metadata presence, source overlaps, and documentation quality. There are no equations, parameters, predictions, uniqueness theorems, or ansatzes. All claims reduce to the authors' manual examination of the listed datasets rather than to any self-referential construction or prior fitted result. The sampling assumption noted by the skeptic is a scope limitation, not a circular reduction of a derivation to its inputs.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

This is a descriptive empirical audit with no mathematical modeling, fitted parameters, background axioms, or postulated entities.

pith-pipeline@v0.9.1-grok · 5672 in / 1107 out tokens · 26845 ms · 2026-06-27T11:38:29.935756+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

76 extracted references · 4 canonical work pages

[1]

Introduction The misuse of synthetic speech poses a growing challenge for speaker verification, digital forensics, and media authentica- tion [1, 2, 3]. Employing a deepfake speech detector is a core defense mechanism [4, 5], but moving toward real-world us- age shifts focus from accuracy alone to robustness, fairness, and transparency. In high-stakes set...
[2]

Background Current work on deepfake speech datasets focuses on model benchmarking [7, 8] or high-level dataset overviews [11], typ- ically reporting basic properties such as size, year, and some- times a short description, with limited attention to dataset au- ditability and representativeness (e.g., demographic metadata or data sources). In parallel, the...
[3]

Audit of Existing Deepfake Speech Datasets This section audits existing deepfake speech datasets to eval- uate whether available resources are suitable for fairness eval- uation and cross-dataset generalization testing. Our contribu- 1Interactive browser of the dataset table and provenance map: https://security-fit.github.io/deepfake_speech_ datasets_app/...

Pith/arXiv arXiv 2026
[4]

The exam- ined datasets list more than 100 distinct tools and architectures (both open-source and commercial), indicating significant di- versity

Synthesizers:A diverse range of synthesizers is arguably the most critical factor for effectively training detectors to ad- dress real-world scenarios, particularly since utterances gener- ated by novel methods may evade detection [53, 54]. The exam- ined datasets list more than 100 distinct tools and architectures (both open-source and commercial), indic...
[5]

Deepfake datasets draw from a wide range of sources, typically established speech corpora and seldom from social media; purely self-recorded data is rare

(Bona fide) speech sources:The origin of the bona fide speech plays a crucial role in assessing the generalization abil- ity of deepfake speech detectors. Deepfake datasets draw from a wide range of sources, typically established speech corpora and seldom from social media; purely self-recorded data is rare. Importantly, many datasets are derived from ove...

2019
[6]

Without adequate demographic data, ensuring the effectiveness of detection solutions across diverse groups becomes unattain- able

Demographics:Demographic metadata is essential for analyzing and mitigating bias in deepfake speech detection. Without adequate demographic data, ensuring the effectiveness of detection solutions across diverse groups becomes unattain- able. The existing datasets currently do not sufficiently incorpo- rate this essential information. Only some datasets pr...
[7]

Only a handful are bilin- gual [8, 30]

Language:Most deepfake speech datasets are monolin- gual, mostly English and Chinese. Only a handful are bilin- gual [8, 30]. Multilingual datasets have recently emerged, in- cluding MLAAD [9], AI4T [48], SynHate [49], SCDF [10] and SpeechFake [52]. These datasets are also trying to provide bal- anced language subsets. Prior results indicate that detector...
[8]

Small datasets are unlikely to be suitable for DNN-based training; prior work suggests that fewer than 80,000 samples can be insufficient in some settings [4]

Size:Dataset scale affects the feasibility of training modern detectors. Small datasets are unlikely to be suitable for DNN-based training; prior work suggests that fewer than 80,000 samples can be insufficient in some settings [4]. How- ever, the sample count alone is not informative without adequate diversity in speakers and synthesis methods
[9]

Older datasets may under-represent current attacks, leading to evaluations that do not reflect state-of-the-art deepfakes

Publication year:Because speech synthesis quality evolves rapidly, staying up to date with this trend is crucial. Older datasets may under-represent current attacks, leading to evaluations that do not reflect state-of-the-art deepfakes
[10]

Even when acces- sible, restrictive or unclear licenses can still limit reproducibil- ity and practical deployment

Availability:Access and licensing remain practical bar- riers: 15% (6/39) of datasets are restricted, and some are described as public but are not obtainable via the provided links [49] or are only partially released [51]. Even when acces- sible, restrictive or unclear licenses can still limit reproducibil- ity and practical deployment
[11]

Findings & Discussion Our dataset audit uncovered major shortcomings of the cur- rently available deepfake speech datasets. Most issues originate from missing metadata, undisclosed synthesis details, insuffi- cient gender and language coverage, outdated synthesis tools, and licensing or access constraints. Only 19/39 (49%) datasets report speaker counts a...
[12]

Our analysis shows that missing demographic and lan- guage metadata makes fairness and bias evaluation infeasible with the resources currently available

Conclusion This paper presented a dataset-level audit of 39 deepfake speech datasets and mapped overlap in underlying bona fide source cor- pora. Our analysis shows that missing demographic and lan- guage metadata makes fairness and bias evaluation infeasible with the resources currently available. Additionally, we identi- fied substantial overlap between...
[13]

Acknowledgments This work was supported by the Brno University of Technology internal project FIT-S-26-9011, and the Ministry of Education, Youth and Sports of the Czech Republic through the e-INFRA CZ project (ID: 90254)
[14]

The authors reviewed and edited the output as needed and take full respon- sibility for the publication’s content

Generative AI Use Disclosure During the preparation of this work, the authors used Generative AI Models (specifically Google Gemini, ChatGPT, and Gram- marly) for language editing and text refinement. The authors reviewed and edited the output as needed and take full respon- sibility for the publication’s content
[15]

Evaluation framework for deepfake speech detection: a comparative study of state-of-the-art deepfake speech detectors,

A. Firc, K. Malinka, and P. Han ´aˇcek, “Evaluation framework for deepfake speech detection: a comparative study of state-of-the-art deepfake speech detectors,”Cybersecurity, vol. 8, no. 1, 2025

2025
[16]

Publications Office of the European Union, 2024

Europol Innovation Lab,Facing reality? – Law enforcement and the challenge of deepfakes – An observatory report from the Eu- ropol innovation lab. Publications Office of the European Union, 2024

2024
[17]

Resilience of V oice Assistants to Synthetic Speech,

K. Malinka, A. Firc, P. Kaˇska, T. Lapˇsansk´y, O. ˇSandor, and I. Ho- moliak, “Resilience of V oice Assistants to Synthetic Speech,” Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformat- ics), vol. 14982 LNCS, p. 66 – 84, 2024

2024
[18]

Deepfake Speech De- tection: A Spectrogram Analysis,

A. Firc, K. Malinka, and P. Han ´aˇcek, “Deepfake Speech De- tection: A Spectrogram Analysis,” inProceedings of the 39th ACM/SIGAPP Symposium on Applied Computing, 2024, Confer- ence paper

2024
[19]

Evolutionary Multi-Objective Fusion of Deepfake Speech Detectors,

V . Stan ˇek, M. Pere ˇs´ıni, L. Sekanina, A. Firc, and K. Malinka, “Evolutionary Multi-Objective Fusion of Deepfake Speech Detectors,” 2026. [Online]. Available: https://arxiv.org/abs/2604. 01330

2026
[20]

Regulation (EU) 2024/1689 of the European Parliament and of the Council of 13 June 2024 laying down har- monised rules on artificial intelligence,

European Union, “Regulation (EU) 2024/1689 of the European Parliament and of the Council of 13 June 2024 laying down har- monised rules on artificial intelligence,” https://eur-lex.europa. eu/eli/reg/2024/1689/oj, 2024, official Journal of the European Union, L 295, 12.7.2024

2024
[21]

ASVspoof 5: crowd- sourced speech data, deepfakes, and adversarial attacks at scale,

X. Wang, H. Delgado, H. Tak, J. weon Jung, H. jin Shim, M. Todisco, I. Kukanov, X. Liu, M. Sahidullah, T. H. Kinnunen, N. Evans, K. A. Lee, and J. Yamagishi, “ASVspoof 5: crowd- sourced speech data, deepfakes, and adversarial attacks at scale,” inThe Automatic Speaker Verification Spoofing Countermeasures Workshop (ASVspoof 2024), 2024, pp. 1–8

2024
[22]

V oiceWukong: benchmarking deepfake voice detection,

Z. Yan, Y . Zhao, and H. Wang, “V oiceWukong: benchmarking deepfake voice detection,” inProceedings of the 34th USENIX Conference on Security Symposium, ser. SEC ’25. USA: USENIX Association, 2025

2025
[23]

MLAAD: The Multi-Language Audio Anti-Spoofing Dataset,

N. M. M ¨uller, P. Kawa, W. H. Choong, E. Casanova, E. G ¨olge, T. M ¨uller, P. Syga, P. Sperl, and K. B ¨ottinger, “MLAAD: The Multi-Language Audio Anti-Spoofing Dataset,”arXiv preprint arXiv:2401.09512, 2024

Pith/arXiv arXiv 2024
[24]

SCDF: A Speaker Characteristics Deepfake Speech Dataset for Bias Analysis,

V . Stanˇek, K. Srna, A. Firc, and K. Malinka, “SCDF: A Speaker Characteristics Deepfake Speech Dataset for Bias Analysis,” in BIOSIG 2025. Gesellschaft f ¨ur Informatik e.V ., 2025

2025
[25]

Deepfakes as a threat to a speaker and facial recognition: An overview of tools and attack vectors,

A. Firc, K. Malinka, and P. Han ´aˇcek, “Deepfakes as a threat to a speaker and facial recognition: An overview of tools and attack vectors,”Heliyon, 2023

2023
[26]

Datasheets for Healthcare AI: A Framework for Transparency and Bias Mitigation,

M. Siddik and H. J. Pandit, “Datasheets for Healthcare AI: A Framework for Transparency and Bias Mitigation,” 2025. [Online]. Available: https://arxiv.org/abs/2501.05617

arXiv 2025
[27]

What’s documented in AI? Systematic Analysis of 32K AI Model Cards,

W. Liang, N. Rajani, X. Yang, E. Ozoani, E. Wu, Y . Chen, D. S. Smith, and J. Zou, “What’s documented in AI? Systematic Analysis of 32K AI Model Cards,” 2024. [Online]. Available: https://arxiv.org/abs/2402.05160

arXiv 2024
[28]

The State of Documentation Practices of Third-Party Machine Learning Models and Datasets,

E. L. Oreamuno, R. F. Khan, A. A. Bangash, C. Stinson, and B. Adams, “The State of Documentation Practices of Third-Party Machine Learning Models and Datasets,”IEEE Software, vol. 41, no. 5, pp. 52–59, 2024

2024
[29]

Navigating Dataset Doc- umentations in AI: A Large-Scale Analysis of Dataset Cards on Hugging Face,

X. Yang, W. Liang, and J. Zou, “Navigating Dataset Doc- umentations in AI: A Large-Scale Analysis of Dataset Cards on Hugging Face,” 2024. [Online]. Available: https://arxiv.org/abs/2401.13822

arXiv 2024
[30]

GBDF: Gender Balanced DeepFake Dataset Towards Fair DeepFake Detection,

A. V . Nadimpalli and A. Rattani, “GBDF: Gender Balanced DeepFake Dataset Towards Fair DeepFake Detection,” inPat- tern Recognition, Computer Vision, and Image Processing. ICPR 2022 International Workshops and Challenges, J.-J. Rousseau and B. Kapralos, Eds. Cham: Springer Nature Switzerland, 2023, pp. 320–337

2022
[31]

Thinking Racial Bias in Fair Forgery Detection: Models, Datasets and Evaluations,

D. Liu, Z. Wang, C. Peng, N. Wang, R. Hu, and X. Gao, “Thinking Racial Bias in Fair Forgery Detection: Models, Datasets and Evaluations,”arXiv.org, 2024. [Online]. Available: https://arxiv.org/abs/2407.14367

arXiv 2024
[32]

Towards Measuring Fairness in AI: The Casual Conversations Dataset,

C. Hazirbas, J. Bitton, B. Dolhansky, J. Pan, A. Gordo, and C. C. Ferrer, “Towards Measuring Fairness in AI: The Casual Conversations Dataset,”IEEE Transactions on Biometrics, Behavior, and Identity Science, vol. 4, no. 3, pp. 324–332, 7
[33]

Available: http://dx.doi.org/10.1109/tbiom.2021

[Online]. Available: http://dx.doi.org/10.1109/tbiom.2021. 3132237

work page doi:10.1109/tbiom.2021 2021
[34]

FairSSD: Understanding Bias in Synthetic Speech Detec- tors,

A. K. Singh Yadav, K. Bhagtani, D. Salvi, P. Bestagini, and E. J. Delp, “FairSSD: Understanding Bias in Synthetic Speech Detec- tors,” in2024 IEEE/CVF Conference on Computer Vision and Pat- tern Recognition Workshops (CVPRW), 2024, pp. 4418–4428

2024
[35]

The V oice Conversion Challenge 2016,

T. Toda, L.-H. Chen, D. Saito, F. Villavicencio, M. Wester, Z. Wu, and J. Yamagishi, “The V oice Conversion Challenge 2016,” inIn- terspeech 2016. International Speech Communication Associa- tion, 2016, pp. 1632–1636

2016
[36]

The V oice Conversion Chal- lenge 2018: Promoting Development of Parallel and Nonparallel Methods,

J. Lorenzo-Trueba, J. Yamagishi, T. Toda, D. Saito, F. Villavi- cencio, T. Kinnunen, and Z. Ling, “The V oice Conversion Chal- lenge 2018: Promoting Development of Parallel and Nonparallel Methods,” inThe Speaker and Language Recognition Workshop (Odyssey 2018). ISCA, 2018, p. 195

2018
[37]

ASVspoof 2019: A large-scale public database of synthe- sized, converted and replayed speech,

X. Wang, J. Yamagishi, M. Todisco, H. Delgado, A. Nautsch, N. Evans, M. Sahidullah, V . Vestman, T. Kinnunen, K. A. Lee et al., “ASVspoof 2019: A large-scale public database of synthe- sized, converted and replayed speech,”Computer Speech & Lan- guage, vol. 64, p. 101114, 2020

2019
[38]

FoR: A dataset for synthetic speech detection,

R. Reimao and V . Tzerpos, “FoR: A dataset for synthetic speech detection,” in2019 International Conference on Speech Technol- ogy and Human-Computer Dialogue (SpeD). IEEE, 2019, pp. 1–10

2019
[39]

DeepSonar: Towards Effective and Robust Detection of AI-Synthesized Fake V oices,

R. Wang, F. Juefei-Xu, Y . Huang, Q. Guo, X. Xie, L. Ma, and Y . Liu, “DeepSonar: Towards Effective and Robust Detection of AI-Synthesized Fake V oices,” inProceedings of the 28th ACM International Conference on Multimedia, ser. MM ’20. New York, NY , USA: Association for Computing Machinery, 2020, p. 1207–1216. [Online]. Available: https: //doi.org/10.11...

work page doi:10.1145/3394171.3413716 2020
[40]

SynSpeechDDB: a new synthetic speech detection database,

Z. Zhang, Y . Gu, X. Yi, and X. Zhao, “SynSpeechDDB: a new synthetic speech detection database,” 2020. [Online]. Available: https://dx.doi.org/10.21227/ta8z-mx73

work page doi:10.21227/ta8z-mx73 2020
[41]

V oice Conversion Challenge 2020 — Intra-lingual semi-parallel and cross-lingual voice con- version —,

Z. Yi, W.-C. Huang, X. Tian, J. Yamagishi, R. K. Das, T. Kin- nunen, Z.-H. Ling, and T. Toda, “V oice Conversion Challenge 2020 — Intra-lingual semi-parallel and cross-lingual voice con- version —,” inProc. Joint Workshop for the Blizzard Challenge and Voice Conversion Challenge 2020, 2020, pp. 80–98

2020
[42]

ASVspoof 2021: Towards spoofed and deepfake speech detec- tion in the wild,

X. Liu, X. Wang, M. Sahidullah, J. Patino, H. Delgado, T. Kin- nunen, M. Todisco, J. Yamagishi, N. Evans, A. Nautschet al., “ASVspoof 2021: Towards spoofed and deepfake speech detec- tion in the wild,”IEEE/ACM Transactions on Audio, Speech, and Language Processing, 2023

2021
[43]

FMFCC-A: A Challeng- ing Mandarin Dataset for Synthetic Speech Detection,

Z. Zhang, Y . Gu, X. Yi, and X. Zhao, “FMFCC-A: A Challeng- ing Mandarin Dataset for Synthetic Speech Detection,” inDig- ital Forensics and Watermarking: 20th International Workshop, IWDW 2021, Beijing, China, November 20–22, 2021, Revised Se- lected Papers, vol. 13180. Springer Nature, 2022, p. 117

2021
[44]

Half-Truth: A Partially Fake Audio Detection Dataset,

J. Yi, Y . Bai, J. Tao, H. Ma, Z. Tian, C. Wang, T. Wang, and R. Fu, “Half-Truth: A Partially Fake Audio Detection Dataset,” inProc. Interspeech 2021, 2021, pp. 1654–1658

2021
[45]

WaveFake: A Data Set to Facilitate Audio Deepfake Detection,

J. Frank and L. Sch ¨onherr, “WaveFake: A Data Set to Facilitate Audio Deepfake Detection,” inThirty-fifth Conference on Neural Information Processing Systems Datasets and Benchmarks Track, 2021

2021
[46]

System fingerprints detection for deepfake audio: An initial dataset and investigation,

X. Yan, J. Yi, J. Tao, C. Wang, H. Ma, Z. Tian, and R. Fu, “System fingerprints detection for deepfake audio: An initial dataset and investigation,”arXiv preprint arXiv:2208.10489, 2022

arXiv 2022
[47]

Add 2022: the first audio deep synthe- sis detection challenge,

J. Yi, R. Fu, J. Tao, S. Nie, H. Ma, C. Wang, T. Wang, Z. Tian, Y . Bai, C. Fanet al., “Add 2022: the first audio deep synthe- sis detection challenge,” inICASSP 2022-2022 IEEE Interna- tional Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2022, pp. 9216–9220

2022
[48]

CFAD: A Chinese dataset for fake audio detection,

H. Ma, J. Yi, C. Wang, X. Yan, J. Tao, T. Wang, S. Wang, L. Xu, and R. Fu, “CFAD: A Chinese dataset for fake audio detection,” arXiv preprint arXiv:2207.12308, 2022

arXiv 2022
[49]

The Dawn of a Text-Dependent Society: Deepfakes as a Threat to Speech Verification Systems,

A. Firc and K. Malinka, “The Dawn of a Text-Dependent Society: Deepfakes as a Threat to Speech Verification Systems,” inPro- ceedings of the 37th ACM/SIGAPP Symposium on Applied Com- puting, ser. SAC ’22. New York, NY , USA: Association for Computing Machinery, 2022, p. 1646–1655

2022
[50]

Does Audio Deepfake Detection Generalize?

N. M ¨uller, P. Czempin, F. Diekmann, A. Froghyar, and K. B ¨ottinger, “Does Audio Deepfake Detection Generalize?” in Proc. Interspeech 2022, 2022, pp. 2783–2787

2022
[51]

ADD 2023: the Second Audio Deepfake Detection Challenge,

J. Yi, J. Tao, R. Fu, X. Yan, C. Wang, T. Wang, C. Y . Zhang, X. Zhang, Y . Zhao, Y . Renet al., “ADD 2023: the Second Audio Deepfake Detection Challenge,”arXiv preprint arXiv:2305.13774, 2023

arXiv 2023
[52]

Transferring Audio Deepfake Detection Capability across Lan- guages,

Z. Ba, Q. Wen, P. Cheng, Y . Wang, F. Lin, L. Lu, and Z. Liu, “Transferring Audio Deepfake Detection Capability across Lan- guages,” inProceedings of the ACM Web Conference 2023, 2023, pp. 2033–2044

2023
[53]

Vulnera- bility of Automatic Identity Recognition to Audio-Visual Deep- fakes,

P. Korshunov, H. Chen, P. N. Garner, and S. Marcel, “Vulnera- bility of Automatic Identity Recognition to Audio-Visual Deep- fakes,” inIEEE International Joint Conference on Biometrics (IJCB), Sep. 2023

2023
[54]

The PartialSpoof Database and Countermeasures for the Detec- tion of Short Fake Speech Segments Embedded in an Utterance,

L. Zhang, X. Wang, E. Cooper, N. Evans, and J. Yamagishi, “The PartialSpoof Database and Countermeasures for the Detec- tion of Short Fake Speech Segments Embedded in an Utterance,” IEEE/ACM Transactions on Audio, Speech, and Language Pro- cessing, vol. 31, pp. 813–825, 2023

2023
[55]

TIMIT-TTS: A Text-to-Speech Dataset for Multimodal Synthetic Media Detection,

D. Salvi, B. Hosler, P. Bestagini, M. C. Stamm, and S. Tubaro, “TIMIT-TTS: A Text-to-Speech Dataset for Multimodal Synthetic Media Detection,”IEEE Access, vol. 11, pp. 50 851–50 866, 2023

2023
[56]

Spoofed training data for speech spoofing countermeasure can be efficiently created using neural vocoders,

X. Wang and J. Yamagishi, “Spoofed training data for speech spoofing countermeasure can be efficiently created using neural vocoders,” inICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2023, pp. 1–5

2023
[57]

CodecFake: Enhancing Anti- Spoofing Models Against Deepfake Audios from Codec-Based Speech Synthesis Systems,

H. Wu, Y . Tseng, and H. yi Lee, “CodecFake: Enhancing Anti- Spoofing Models Against Deepfake Audios from Codec-Based Speech Synthesis Systems,” inInterspeech 2024, 2024, pp. 1770– 1774

2024
[58]

Codecfake: An Initial Dataset for Detecting LLM-based Deepfake Audio,

Y . Lu, Y . Xie, R. Fu, Z. Wen, J. Tao, Z. Wang, X. Liu, Y . Li, Y . Liu, X. Wang, and S. Shi, “Codecfake: An Initial Dataset for Detecting LLM-based Deepfake Audio,” 06 2024

2024
[59]

SafeEar: Content Privacy-Preserving Audio Deepfake Detection,

X. Li, K. Li, Y . Zheng, C. Yan, X. Ji, and W. Xu, “SafeEar: Content Privacy-Preserving Audio Deepfake Detection,” inPro- ceedings of the 2024 ACM SIGSAC Conference on Computer and Communications Security (CCS), 2024

2024
[60]

Diffuse or Confuse: A Diffusion Deepfake Speech Dataset,

A. Firc, K. Malinka, and P. Han ´aˇcek, “Diffuse or Confuse: A Diffusion Deepfake Speech Dataset,” in2024 International Con- ference of the Biometrics Special Interest Group (BIOSIG), 2024, pp. 1–7

2024
[61]

MLADDC: Multi-Lingual Audio Deepfake Detection Corpus,

A. J. SHAH, R. M. Purohit, D. H. Vaghera, and H. Patil, “MLADDC: Multi-Lingual Audio Deepfake Detection Corpus,” inAudio Imagination: NeurIPS 2024 Workshop AI-Driven Speech, Music, and Sound Generation, 2024. [Online]. Available: https://openreview.net/forum?id=ic3HvoOTeU

2024
[62]

DFADD: The Diffusion and Flow- Matching Based Audio Deepfake Dataset,

J. Du, I.-M. Lin, I.-H. Chiu, X. Chen, H. Wu, W. Ren, Y . Tsao, H.-y. Lee, and J.-S. R. Jang, “DFADD: The Diffusion and Flow- Matching Based Audio Deepfake Dataset,” in2024 IEEE Spoken Language Technology Workshop (SLT). IEEE, 2024, pp. 921– 928

2024
[63]

Un- masking real-world audio deepfakes: A data-centric approach,

D. Combei, A. Stan, D. Oneata, N. M ¨uller, and H. Cucu, “Un- masking real-world audio deepfakes: A data-centric approach,” in Interspeech 2025, 2025, pp. 5343–5347

2025
[64]

SynHate: De- tecting Hate Speech in Synthetic Deepfake Audio,

R. Ranjan, K. Pipariya, M. Vatsa, and R. Singh, “SynHate: De- tecting Hate Speech in Synthetic Deepfake Audio,” inInterspeech 2025, 2025, pp. 5623–5627

2025
[65]

STOPA: A Dataset of Systematic VariaTion Of DeePfake Audio for Open-Set Source Tracing and Attribution,

A. Firc, M. Chhibber, J. Mishra, V . Pratap Singh, T. Kinnunen, and K. Malinka, “STOPA: A Dataset of Systematic VariaTion Of DeePfake Audio for Open-Set Source Tracing and Attribution,” in Interspeech 2025, 2025, pp. 1553–1557

2025
[66]

PartialEdit: Identi- fying Partial Deepfakes in the Era of Neural Speech Editing ,

Y . Zhang, B. Tian, L. Zhang, and Z. Duan, “PartialEdit: Identi- fying Partial Deepfakes in the Era of Neural Speech Editing ,” in Interspeech 2025, 2025, pp. 5353–5357

2025
[67]

SpeechFake: A Large-Scale Multilingual Speech Deepfake Dataset Incorporating Cutting-Edge Generation Methods,

W. Huang, Y . Gu, Z. Wang, H. Zhu, and Y . Qian, “SpeechFake: A Large-Scale Multilingual Speech Deepfake Dataset Incorporating Cutting-Edge Generation Methods,” inProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), W. Che, J. Nabende, E. Shutova, and M. T. Pilehvar, Eds. Vienna, Austria: Associa...

2025
[68]

Assessing the Human Abil- ity to Recognize Synthetic Speech in Ordinary Conversation,

D. Prudk ´y, A. Firc, and K. Malinka, “Assessing the Human Abil- ity to Recognize Synthetic Speech in Ordinary Conversation,” in 2023 International Conference of the Biometrics Special Interest Group (BIOSIG), 2023, pp. 1–5

2023
[69]

Comprehensive Multiparametric Analysis of Human Deepfake Speech Recognition,

K. Malinka, A. Firc, M. ˇSalko, D. Prudk ´y, K. Rada ˇcovsk´a, and P. Han´aˇcek, “Comprehensive Multiparametric Analysis of Human Deepfake Speech Recognition,”EURASIP Journal on Image and Video Processing, vol. 2024, no. 24, pp. 1–25, 2024

2024
[70]

LibriV ox,

“LibriV ox,” https://librivox.org/, accessed: 08-02-2026 [online]

2026
[71]

LibriTTS: A Corpus Derived from LibriSpeech for Text-to-Speech,

H. Zen, V . Dang, R. Clark, Y . Zhang, R. J. Weiss, Y . Jia, Z. Chen, and Y . Wu, “LibriTTS: A Corpus Derived from LibriSpeech for Text-to-Speech,” inInterspeech 2019, 2019, pp. 1526–1530

2019
[72]

The LJ Speech Dataset,

K. Ito and L. Johnson, “The LJ Speech Dataset,” https://keithito. com/LJ-Speech-Dataset/, 2017

2017
[73]

Lib- rispeech: An ASR corpus based on public domain audio books,

V . Panayotov, G. Chen, D. Povey, and S. Khudanpur, “Lib- rispeech: An ASR corpus based on public domain audio books,” in2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2015, pp. 5206–5210

2015
[74]

MLS: A Large-Scale Multilingual Dataset for Speech Research,

V . Pratap, Q. Xu, A. Sriram, G. Synnaeve, and R. Collobert, “MLS: A Large-Scale Multilingual Dataset for Speech Research,” inInterspeech 2020, 2020, pp. 2757–2761

2020
[75]

AIShell-1: An Open-Source Mandarin Speech Corpus and A Speech Recogni- tion Baseline,

H. Bu, J. Du, X. Na, B. Wu, and H. Zheng, “AIShell-1: An Open-Source Mandarin Speech Corpus and A Speech Recogni- tion Baseline,” inOriental COCOSDA 2017, 2017

2017
[76]

CSTR VCTK Corpus: English Multi-speaker Corpus for CSTR V oice Cloning Toolkit,

J. Yamagishi, C. Veaux, and K. MacDonald, “CSTR VCTK Corpus: English Multi-speaker Corpus for CSTR V oice Cloning Toolkit,” University of Edinburgh. The Centre for Speech Technology Research (CSTR), Tech. Rep., 2019. [Online]. Available: https://doi.org/10.7488/ds/2645

work page doi:10.7488/ds/2645 2019

[1] [1]

Introduction The misuse of synthetic speech poses a growing challenge for speaker verification, digital forensics, and media authentica- tion [1, 2, 3]. Employing a deepfake speech detector is a core defense mechanism [4, 5], but moving toward real-world us- age shifts focus from accuracy alone to robustness, fairness, and transparency. In high-stakes set...

[2] [2]

Background Current work on deepfake speech datasets focuses on model benchmarking [7, 8] or high-level dataset overviews [11], typ- ically reporting basic properties such as size, year, and some- times a short description, with limited attention to dataset au- ditability and representativeness (e.g., demographic metadata or data sources). In parallel, the...

[3] [3]

Audit of Existing Deepfake Speech Datasets This section audits existing deepfake speech datasets to eval- uate whether available resources are suitable for fairness eval- uation and cross-dataset generalization testing. Our contribu- 1Interactive browser of the dataset table and provenance map: https://security-fit.github.io/deepfake_speech_ datasets_app/...

Pith/arXiv arXiv 2026

[4] [4]

The exam- ined datasets list more than 100 distinct tools and architectures (both open-source and commercial), indicating significant di- versity

Synthesizers:A diverse range of synthesizers is arguably the most critical factor for effectively training detectors to ad- dress real-world scenarios, particularly since utterances gener- ated by novel methods may evade detection [53, 54]. The exam- ined datasets list more than 100 distinct tools and architectures (both open-source and commercial), indic...

[5] [5]

Deepfake datasets draw from a wide range of sources, typically established speech corpora and seldom from social media; purely self-recorded data is rare

(Bona fide) speech sources:The origin of the bona fide speech plays a crucial role in assessing the generalization abil- ity of deepfake speech detectors. Deepfake datasets draw from a wide range of sources, typically established speech corpora and seldom from social media; purely self-recorded data is rare. Importantly, many datasets are derived from ove...

2019

[6] [6]

Without adequate demographic data, ensuring the effectiveness of detection solutions across diverse groups becomes unattain- able

Demographics:Demographic metadata is essential for analyzing and mitigating bias in deepfake speech detection. Without adequate demographic data, ensuring the effectiveness of detection solutions across diverse groups becomes unattain- able. The existing datasets currently do not sufficiently incorpo- rate this essential information. Only some datasets pr...

[7] [7]

Only a handful are bilin- gual [8, 30]

Language:Most deepfake speech datasets are monolin- gual, mostly English and Chinese. Only a handful are bilin- gual [8, 30]. Multilingual datasets have recently emerged, in- cluding MLAAD [9], AI4T [48], SynHate [49], SCDF [10] and SpeechFake [52]. These datasets are also trying to provide bal- anced language subsets. Prior results indicate that detector...

[8] [8]

Small datasets are unlikely to be suitable for DNN-based training; prior work suggests that fewer than 80,000 samples can be insufficient in some settings [4]

Size:Dataset scale affects the feasibility of training modern detectors. Small datasets are unlikely to be suitable for DNN-based training; prior work suggests that fewer than 80,000 samples can be insufficient in some settings [4]. How- ever, the sample count alone is not informative without adequate diversity in speakers and synthesis methods

[9] [9]

Older datasets may under-represent current attacks, leading to evaluations that do not reflect state-of-the-art deepfakes

Publication year:Because speech synthesis quality evolves rapidly, staying up to date with this trend is crucial. Older datasets may under-represent current attacks, leading to evaluations that do not reflect state-of-the-art deepfakes

[10] [10]

Even when acces- sible, restrictive or unclear licenses can still limit reproducibil- ity and practical deployment

Availability:Access and licensing remain practical bar- riers: 15% (6/39) of datasets are restricted, and some are described as public but are not obtainable via the provided links [49] or are only partially released [51]. Even when acces- sible, restrictive or unclear licenses can still limit reproducibil- ity and practical deployment

[11] [11]

Findings & Discussion Our dataset audit uncovered major shortcomings of the cur- rently available deepfake speech datasets. Most issues originate from missing metadata, undisclosed synthesis details, insuffi- cient gender and language coverage, outdated synthesis tools, and licensing or access constraints. Only 19/39 (49%) datasets report speaker counts a...

[12] [12]

Our analysis shows that missing demographic and lan- guage metadata makes fairness and bias evaluation infeasible with the resources currently available

Conclusion This paper presented a dataset-level audit of 39 deepfake speech datasets and mapped overlap in underlying bona fide source cor- pora. Our analysis shows that missing demographic and lan- guage metadata makes fairness and bias evaluation infeasible with the resources currently available. Additionally, we identi- fied substantial overlap between...

[13] [13]

Acknowledgments This work was supported by the Brno University of Technology internal project FIT-S-26-9011, and the Ministry of Education, Youth and Sports of the Czech Republic through the e-INFRA CZ project (ID: 90254)

[14] [14]

The authors reviewed and edited the output as needed and take full respon- sibility for the publication’s content

Generative AI Use Disclosure During the preparation of this work, the authors used Generative AI Models (specifically Google Gemini, ChatGPT, and Gram- marly) for language editing and text refinement. The authors reviewed and edited the output as needed and take full respon- sibility for the publication’s content

[15] [15]

Evaluation framework for deepfake speech detection: a comparative study of state-of-the-art deepfake speech detectors,

A. Firc, K. Malinka, and P. Han ´aˇcek, “Evaluation framework for deepfake speech detection: a comparative study of state-of-the-art deepfake speech detectors,”Cybersecurity, vol. 8, no. 1, 2025

2025

[16] [16]

Publications Office of the European Union, 2024

Europol Innovation Lab,Facing reality? – Law enforcement and the challenge of deepfakes – An observatory report from the Eu- ropol innovation lab. Publications Office of the European Union, 2024

2024

[17] [17]

Resilience of V oice Assistants to Synthetic Speech,

K. Malinka, A. Firc, P. Kaˇska, T. Lapˇsansk´y, O. ˇSandor, and I. Ho- moliak, “Resilience of V oice Assistants to Synthetic Speech,” Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformat- ics), vol. 14982 LNCS, p. 66 – 84, 2024

2024

[18] [18]

Deepfake Speech De- tection: A Spectrogram Analysis,

A. Firc, K. Malinka, and P. Han ´aˇcek, “Deepfake Speech De- tection: A Spectrogram Analysis,” inProceedings of the 39th ACM/SIGAPP Symposium on Applied Computing, 2024, Confer- ence paper

2024

[19] [19]

Evolutionary Multi-Objective Fusion of Deepfake Speech Detectors,

V . Stan ˇek, M. Pere ˇs´ıni, L. Sekanina, A. Firc, and K. Malinka, “Evolutionary Multi-Objective Fusion of Deepfake Speech Detectors,” 2026. [Online]. Available: https://arxiv.org/abs/2604. 01330

2026

[20] [20]

Regulation (EU) 2024/1689 of the European Parliament and of the Council of 13 June 2024 laying down har- monised rules on artificial intelligence,

European Union, “Regulation (EU) 2024/1689 of the European Parliament and of the Council of 13 June 2024 laying down har- monised rules on artificial intelligence,” https://eur-lex.europa. eu/eli/reg/2024/1689/oj, 2024, official Journal of the European Union, L 295, 12.7.2024

2024

[21] [21]

ASVspoof 5: crowd- sourced speech data, deepfakes, and adversarial attacks at scale,

X. Wang, H. Delgado, H. Tak, J. weon Jung, H. jin Shim, M. Todisco, I. Kukanov, X. Liu, M. Sahidullah, T. H. Kinnunen, N. Evans, K. A. Lee, and J. Yamagishi, “ASVspoof 5: crowd- sourced speech data, deepfakes, and adversarial attacks at scale,” inThe Automatic Speaker Verification Spoofing Countermeasures Workshop (ASVspoof 2024), 2024, pp. 1–8

2024

[22] [22]

V oiceWukong: benchmarking deepfake voice detection,

Z. Yan, Y . Zhao, and H. Wang, “V oiceWukong: benchmarking deepfake voice detection,” inProceedings of the 34th USENIX Conference on Security Symposium, ser. SEC ’25. USA: USENIX Association, 2025

2025

[23] [23]

MLAAD: The Multi-Language Audio Anti-Spoofing Dataset,

N. M. M ¨uller, P. Kawa, W. H. Choong, E. Casanova, E. G ¨olge, T. M ¨uller, P. Syga, P. Sperl, and K. B ¨ottinger, “MLAAD: The Multi-Language Audio Anti-Spoofing Dataset,”arXiv preprint arXiv:2401.09512, 2024

Pith/arXiv arXiv 2024

[24] [24]

SCDF: A Speaker Characteristics Deepfake Speech Dataset for Bias Analysis,

V . Stanˇek, K. Srna, A. Firc, and K. Malinka, “SCDF: A Speaker Characteristics Deepfake Speech Dataset for Bias Analysis,” in BIOSIG 2025. Gesellschaft f ¨ur Informatik e.V ., 2025

2025

[25] [25]

Deepfakes as a threat to a speaker and facial recognition: An overview of tools and attack vectors,

A. Firc, K. Malinka, and P. Han ´aˇcek, “Deepfakes as a threat to a speaker and facial recognition: An overview of tools and attack vectors,”Heliyon, 2023

2023

[26] [26]

Datasheets for Healthcare AI: A Framework for Transparency and Bias Mitigation,

M. Siddik and H. J. Pandit, “Datasheets for Healthcare AI: A Framework for Transparency and Bias Mitigation,” 2025. [Online]. Available: https://arxiv.org/abs/2501.05617

arXiv 2025

[27] [27]

What’s documented in AI? Systematic Analysis of 32K AI Model Cards,

W. Liang, N. Rajani, X. Yang, E. Ozoani, E. Wu, Y . Chen, D. S. Smith, and J. Zou, “What’s documented in AI? Systematic Analysis of 32K AI Model Cards,” 2024. [Online]. Available: https://arxiv.org/abs/2402.05160

arXiv 2024

[28] [28]

The State of Documentation Practices of Third-Party Machine Learning Models and Datasets,

E. L. Oreamuno, R. F. Khan, A. A. Bangash, C. Stinson, and B. Adams, “The State of Documentation Practices of Third-Party Machine Learning Models and Datasets,”IEEE Software, vol. 41, no. 5, pp. 52–59, 2024

2024

[29] [29]

Navigating Dataset Doc- umentations in AI: A Large-Scale Analysis of Dataset Cards on Hugging Face,

X. Yang, W. Liang, and J. Zou, “Navigating Dataset Doc- umentations in AI: A Large-Scale Analysis of Dataset Cards on Hugging Face,” 2024. [Online]. Available: https://arxiv.org/abs/2401.13822

arXiv 2024

[30] [30]

GBDF: Gender Balanced DeepFake Dataset Towards Fair DeepFake Detection,

A. V . Nadimpalli and A. Rattani, “GBDF: Gender Balanced DeepFake Dataset Towards Fair DeepFake Detection,” inPat- tern Recognition, Computer Vision, and Image Processing. ICPR 2022 International Workshops and Challenges, J.-J. Rousseau and B. Kapralos, Eds. Cham: Springer Nature Switzerland, 2023, pp. 320–337

2022

[31] [31]

Thinking Racial Bias in Fair Forgery Detection: Models, Datasets and Evaluations,

D. Liu, Z. Wang, C. Peng, N. Wang, R. Hu, and X. Gao, “Thinking Racial Bias in Fair Forgery Detection: Models, Datasets and Evaluations,”arXiv.org, 2024. [Online]. Available: https://arxiv.org/abs/2407.14367

arXiv 2024

[32] [32]

Towards Measuring Fairness in AI: The Casual Conversations Dataset,

C. Hazirbas, J. Bitton, B. Dolhansky, J. Pan, A. Gordo, and C. C. Ferrer, “Towards Measuring Fairness in AI: The Casual Conversations Dataset,”IEEE Transactions on Biometrics, Behavior, and Identity Science, vol. 4, no. 3, pp. 324–332, 7

[33] [33]

Available: http://dx.doi.org/10.1109/tbiom.2021

[Online]. Available: http://dx.doi.org/10.1109/tbiom.2021. 3132237

work page doi:10.1109/tbiom.2021 2021

[34] [34]

FairSSD: Understanding Bias in Synthetic Speech Detec- tors,

A. K. Singh Yadav, K. Bhagtani, D. Salvi, P. Bestagini, and E. J. Delp, “FairSSD: Understanding Bias in Synthetic Speech Detec- tors,” in2024 IEEE/CVF Conference on Computer Vision and Pat- tern Recognition Workshops (CVPRW), 2024, pp. 4418–4428

2024

[35] [35]

The V oice Conversion Challenge 2016,

T. Toda, L.-H. Chen, D. Saito, F. Villavicencio, M. Wester, Z. Wu, and J. Yamagishi, “The V oice Conversion Challenge 2016,” inIn- terspeech 2016. International Speech Communication Associa- tion, 2016, pp. 1632–1636

2016

[36] [36]

The V oice Conversion Chal- lenge 2018: Promoting Development of Parallel and Nonparallel Methods,

J. Lorenzo-Trueba, J. Yamagishi, T. Toda, D. Saito, F. Villavi- cencio, T. Kinnunen, and Z. Ling, “The V oice Conversion Chal- lenge 2018: Promoting Development of Parallel and Nonparallel Methods,” inThe Speaker and Language Recognition Workshop (Odyssey 2018). ISCA, 2018, p. 195

2018

[37] [37]

ASVspoof 2019: A large-scale public database of synthe- sized, converted and replayed speech,

X. Wang, J. Yamagishi, M. Todisco, H. Delgado, A. Nautsch, N. Evans, M. Sahidullah, V . Vestman, T. Kinnunen, K. A. Lee et al., “ASVspoof 2019: A large-scale public database of synthe- sized, converted and replayed speech,”Computer Speech & Lan- guage, vol. 64, p. 101114, 2020

2019

[38] [38]

FoR: A dataset for synthetic speech detection,

R. Reimao and V . Tzerpos, “FoR: A dataset for synthetic speech detection,” in2019 International Conference on Speech Technol- ogy and Human-Computer Dialogue (SpeD). IEEE, 2019, pp. 1–10

2019

[39] [39]

DeepSonar: Towards Effective and Robust Detection of AI-Synthesized Fake V oices,

R. Wang, F. Juefei-Xu, Y . Huang, Q. Guo, X. Xie, L. Ma, and Y . Liu, “DeepSonar: Towards Effective and Robust Detection of AI-Synthesized Fake V oices,” inProceedings of the 28th ACM International Conference on Multimedia, ser. MM ’20. New York, NY , USA: Association for Computing Machinery, 2020, p. 1207–1216. [Online]. Available: https: //doi.org/10.11...

work page doi:10.1145/3394171.3413716 2020

[40] [40]

SynSpeechDDB: a new synthetic speech detection database,

Z. Zhang, Y . Gu, X. Yi, and X. Zhao, “SynSpeechDDB: a new synthetic speech detection database,” 2020. [Online]. Available: https://dx.doi.org/10.21227/ta8z-mx73

work page doi:10.21227/ta8z-mx73 2020

[41] [41]

V oice Conversion Challenge 2020 — Intra-lingual semi-parallel and cross-lingual voice con- version —,

Z. Yi, W.-C. Huang, X. Tian, J. Yamagishi, R. K. Das, T. Kin- nunen, Z.-H. Ling, and T. Toda, “V oice Conversion Challenge 2020 — Intra-lingual semi-parallel and cross-lingual voice con- version —,” inProc. Joint Workshop for the Blizzard Challenge and Voice Conversion Challenge 2020, 2020, pp. 80–98

2020

[42] [42]

ASVspoof 2021: Towards spoofed and deepfake speech detec- tion in the wild,

X. Liu, X. Wang, M. Sahidullah, J. Patino, H. Delgado, T. Kin- nunen, M. Todisco, J. Yamagishi, N. Evans, A. Nautschet al., “ASVspoof 2021: Towards spoofed and deepfake speech detec- tion in the wild,”IEEE/ACM Transactions on Audio, Speech, and Language Processing, 2023

2021

[43] [43]

FMFCC-A: A Challeng- ing Mandarin Dataset for Synthetic Speech Detection,

Z. Zhang, Y . Gu, X. Yi, and X. Zhao, “FMFCC-A: A Challeng- ing Mandarin Dataset for Synthetic Speech Detection,” inDig- ital Forensics and Watermarking: 20th International Workshop, IWDW 2021, Beijing, China, November 20–22, 2021, Revised Se- lected Papers, vol. 13180. Springer Nature, 2022, p. 117

2021

[44] [44]

Half-Truth: A Partially Fake Audio Detection Dataset,

J. Yi, Y . Bai, J. Tao, H. Ma, Z. Tian, C. Wang, T. Wang, and R. Fu, “Half-Truth: A Partially Fake Audio Detection Dataset,” inProc. Interspeech 2021, 2021, pp. 1654–1658

2021

[45] [45]

WaveFake: A Data Set to Facilitate Audio Deepfake Detection,

J. Frank and L. Sch ¨onherr, “WaveFake: A Data Set to Facilitate Audio Deepfake Detection,” inThirty-fifth Conference on Neural Information Processing Systems Datasets and Benchmarks Track, 2021

2021

[46] [46]

System fingerprints detection for deepfake audio: An initial dataset and investigation,

X. Yan, J. Yi, J. Tao, C. Wang, H. Ma, Z. Tian, and R. Fu, “System fingerprints detection for deepfake audio: An initial dataset and investigation,”arXiv preprint arXiv:2208.10489, 2022

arXiv 2022

[47] [47]

Add 2022: the first audio deep synthe- sis detection challenge,

J. Yi, R. Fu, J. Tao, S. Nie, H. Ma, C. Wang, T. Wang, Z. Tian, Y . Bai, C. Fanet al., “Add 2022: the first audio deep synthe- sis detection challenge,” inICASSP 2022-2022 IEEE Interna- tional Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2022, pp. 9216–9220

2022

[48] [48]

CFAD: A Chinese dataset for fake audio detection,

H. Ma, J. Yi, C. Wang, X. Yan, J. Tao, T. Wang, S. Wang, L. Xu, and R. Fu, “CFAD: A Chinese dataset for fake audio detection,” arXiv preprint arXiv:2207.12308, 2022

arXiv 2022

[49] [49]

The Dawn of a Text-Dependent Society: Deepfakes as a Threat to Speech Verification Systems,

A. Firc and K. Malinka, “The Dawn of a Text-Dependent Society: Deepfakes as a Threat to Speech Verification Systems,” inPro- ceedings of the 37th ACM/SIGAPP Symposium on Applied Com- puting, ser. SAC ’22. New York, NY , USA: Association for Computing Machinery, 2022, p. 1646–1655

2022

[50] [50]

Does Audio Deepfake Detection Generalize?

N. M ¨uller, P. Czempin, F. Diekmann, A. Froghyar, and K. B ¨ottinger, “Does Audio Deepfake Detection Generalize?” in Proc. Interspeech 2022, 2022, pp. 2783–2787

2022

[51] [51]

ADD 2023: the Second Audio Deepfake Detection Challenge,

J. Yi, J. Tao, R. Fu, X. Yan, C. Wang, T. Wang, C. Y . Zhang, X. Zhang, Y . Zhao, Y . Renet al., “ADD 2023: the Second Audio Deepfake Detection Challenge,”arXiv preprint arXiv:2305.13774, 2023

arXiv 2023

[52] [52]

Transferring Audio Deepfake Detection Capability across Lan- guages,

Z. Ba, Q. Wen, P. Cheng, Y . Wang, F. Lin, L. Lu, and Z. Liu, “Transferring Audio Deepfake Detection Capability across Lan- guages,” inProceedings of the ACM Web Conference 2023, 2023, pp. 2033–2044

2023

[53] [53]

Vulnera- bility of Automatic Identity Recognition to Audio-Visual Deep- fakes,

P. Korshunov, H. Chen, P. N. Garner, and S. Marcel, “Vulnera- bility of Automatic Identity Recognition to Audio-Visual Deep- fakes,” inIEEE International Joint Conference on Biometrics (IJCB), Sep. 2023

2023

[54] [54]

The PartialSpoof Database and Countermeasures for the Detec- tion of Short Fake Speech Segments Embedded in an Utterance,

L. Zhang, X. Wang, E. Cooper, N. Evans, and J. Yamagishi, “The PartialSpoof Database and Countermeasures for the Detec- tion of Short Fake Speech Segments Embedded in an Utterance,” IEEE/ACM Transactions on Audio, Speech, and Language Pro- cessing, vol. 31, pp. 813–825, 2023

2023

[55] [55]

TIMIT-TTS: A Text-to-Speech Dataset for Multimodal Synthetic Media Detection,

D. Salvi, B. Hosler, P. Bestagini, M. C. Stamm, and S. Tubaro, “TIMIT-TTS: A Text-to-Speech Dataset for Multimodal Synthetic Media Detection,”IEEE Access, vol. 11, pp. 50 851–50 866, 2023

2023

[56] [56]

Spoofed training data for speech spoofing countermeasure can be efficiently created using neural vocoders,

X. Wang and J. Yamagishi, “Spoofed training data for speech spoofing countermeasure can be efficiently created using neural vocoders,” inICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2023, pp. 1–5

2023

[57] [57]

CodecFake: Enhancing Anti- Spoofing Models Against Deepfake Audios from Codec-Based Speech Synthesis Systems,

H. Wu, Y . Tseng, and H. yi Lee, “CodecFake: Enhancing Anti- Spoofing Models Against Deepfake Audios from Codec-Based Speech Synthesis Systems,” inInterspeech 2024, 2024, pp. 1770– 1774

2024

[58] [58]

Codecfake: An Initial Dataset for Detecting LLM-based Deepfake Audio,

Y . Lu, Y . Xie, R. Fu, Z. Wen, J. Tao, Z. Wang, X. Liu, Y . Li, Y . Liu, X. Wang, and S. Shi, “Codecfake: An Initial Dataset for Detecting LLM-based Deepfake Audio,” 06 2024

2024

[59] [59]

SafeEar: Content Privacy-Preserving Audio Deepfake Detection,

X. Li, K. Li, Y . Zheng, C. Yan, X. Ji, and W. Xu, “SafeEar: Content Privacy-Preserving Audio Deepfake Detection,” inPro- ceedings of the 2024 ACM SIGSAC Conference on Computer and Communications Security (CCS), 2024

2024

[60] [60]

Diffuse or Confuse: A Diffusion Deepfake Speech Dataset,

A. Firc, K. Malinka, and P. Han ´aˇcek, “Diffuse or Confuse: A Diffusion Deepfake Speech Dataset,” in2024 International Con- ference of the Biometrics Special Interest Group (BIOSIG), 2024, pp. 1–7

2024

[61] [61]

MLADDC: Multi-Lingual Audio Deepfake Detection Corpus,

A. J. SHAH, R. M. Purohit, D. H. Vaghera, and H. Patil, “MLADDC: Multi-Lingual Audio Deepfake Detection Corpus,” inAudio Imagination: NeurIPS 2024 Workshop AI-Driven Speech, Music, and Sound Generation, 2024. [Online]. Available: https://openreview.net/forum?id=ic3HvoOTeU

2024

[62] [62]

DFADD: The Diffusion and Flow- Matching Based Audio Deepfake Dataset,

J. Du, I.-M. Lin, I.-H. Chiu, X. Chen, H. Wu, W. Ren, Y . Tsao, H.-y. Lee, and J.-S. R. Jang, “DFADD: The Diffusion and Flow- Matching Based Audio Deepfake Dataset,” in2024 IEEE Spoken Language Technology Workshop (SLT). IEEE, 2024, pp. 921– 928

2024

[63] [63]

Un- masking real-world audio deepfakes: A data-centric approach,

D. Combei, A. Stan, D. Oneata, N. M ¨uller, and H. Cucu, “Un- masking real-world audio deepfakes: A data-centric approach,” in Interspeech 2025, 2025, pp. 5343–5347

2025

[64] [64]

SynHate: De- tecting Hate Speech in Synthetic Deepfake Audio,

R. Ranjan, K. Pipariya, M. Vatsa, and R. Singh, “SynHate: De- tecting Hate Speech in Synthetic Deepfake Audio,” inInterspeech 2025, 2025, pp. 5623–5627

2025

[65] [65]

STOPA: A Dataset of Systematic VariaTion Of DeePfake Audio for Open-Set Source Tracing and Attribution,

A. Firc, M. Chhibber, J. Mishra, V . Pratap Singh, T. Kinnunen, and K. Malinka, “STOPA: A Dataset of Systematic VariaTion Of DeePfake Audio for Open-Set Source Tracing and Attribution,” in Interspeech 2025, 2025, pp. 1553–1557

2025

[66] [66]

PartialEdit: Identi- fying Partial Deepfakes in the Era of Neural Speech Editing ,

Y . Zhang, B. Tian, L. Zhang, and Z. Duan, “PartialEdit: Identi- fying Partial Deepfakes in the Era of Neural Speech Editing ,” in Interspeech 2025, 2025, pp. 5353–5357

2025

[67] [67]

SpeechFake: A Large-Scale Multilingual Speech Deepfake Dataset Incorporating Cutting-Edge Generation Methods,

W. Huang, Y . Gu, Z. Wang, H. Zhu, and Y . Qian, “SpeechFake: A Large-Scale Multilingual Speech Deepfake Dataset Incorporating Cutting-Edge Generation Methods,” inProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), W. Che, J. Nabende, E. Shutova, and M. T. Pilehvar, Eds. Vienna, Austria: Associa...

2025

[68] [68]

Assessing the Human Abil- ity to Recognize Synthetic Speech in Ordinary Conversation,

D. Prudk ´y, A. Firc, and K. Malinka, “Assessing the Human Abil- ity to Recognize Synthetic Speech in Ordinary Conversation,” in 2023 International Conference of the Biometrics Special Interest Group (BIOSIG), 2023, pp. 1–5

2023

[69] [69]

Comprehensive Multiparametric Analysis of Human Deepfake Speech Recognition,

K. Malinka, A. Firc, M. ˇSalko, D. Prudk ´y, K. Rada ˇcovsk´a, and P. Han´aˇcek, “Comprehensive Multiparametric Analysis of Human Deepfake Speech Recognition,”EURASIP Journal on Image and Video Processing, vol. 2024, no. 24, pp. 1–25, 2024

2024

[70] [70]

LibriV ox,

“LibriV ox,” https://librivox.org/, accessed: 08-02-2026 [online]

2026

[71] [71]

LibriTTS: A Corpus Derived from LibriSpeech for Text-to-Speech,

H. Zen, V . Dang, R. Clark, Y . Zhang, R. J. Weiss, Y . Jia, Z. Chen, and Y . Wu, “LibriTTS: A Corpus Derived from LibriSpeech for Text-to-Speech,” inInterspeech 2019, 2019, pp. 1526–1530

2019

[72] [72]

The LJ Speech Dataset,

K. Ito and L. Johnson, “The LJ Speech Dataset,” https://keithito. com/LJ-Speech-Dataset/, 2017

2017

[73] [73]

Lib- rispeech: An ASR corpus based on public domain audio books,

V . Panayotov, G. Chen, D. Povey, and S. Khudanpur, “Lib- rispeech: An ASR corpus based on public domain audio books,” in2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2015, pp. 5206–5210

2015

[74] [74]

MLS: A Large-Scale Multilingual Dataset for Speech Research,

V . Pratap, Q. Xu, A. Sriram, G. Synnaeve, and R. Collobert, “MLS: A Large-Scale Multilingual Dataset for Speech Research,” inInterspeech 2020, 2020, pp. 2757–2761

2020

[75] [75]

AIShell-1: An Open-Source Mandarin Speech Corpus and A Speech Recogni- tion Baseline,

H. Bu, J. Du, X. Na, B. Wu, and H. Zheng, “AIShell-1: An Open-Source Mandarin Speech Corpus and A Speech Recogni- tion Baseline,” inOriental COCOSDA 2017, 2017

2017

[76] [76]

CSTR VCTK Corpus: English Multi-speaker Corpus for CSTR V oice Cloning Toolkit,

J. Yamagishi, C. Veaux, and K. MacDonald, “CSTR VCTK Corpus: English Multi-speaker Corpus for CSTR V oice Cloning Toolkit,” University of Edinburgh. The Centre for Speech Technology Research (CSTR), Tech. Rep., 2019. [Online]. Available: https://doi.org/10.7488/ds/2645

work page doi:10.7488/ds/2645 2019