arxiv: 2605.00251 · v1 · submitted 2026-04-30 · 💻 cs.SD · cs.CL· eess.AS

Recognition: unknown

Alethia: A Foundational Encoder for Voice Deepfakes

Yi Zhu , Brahmi Dwivedi , Jayaram Raghuram , Surya Koppisetti

Authors on Pith no claims yet

Pith reviewed 2026-05-09 19:25 UTC · model grok-4.3

classification 💻 cs.SD cs.CLeess.AS

keywords voice deepfake detectionaudio encoder pretrainingmasked embedding predictionflow matchingdeepfake localizationspeech foundation modelsgenerative pretrainingzero-shot generalization

0 comments

The pith

Alethia is the first foundational audio encoder for voice deepfake detection and localization, created by a new pretraining recipe of bottleneck masked embedding prediction combined with flow-matching spectrogram reconstruction.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper proposes shifting from finetuning existing speech models to developing a specialized pretraining approach for audio encoders focused on deepfake artifacts. It combines predicting masked embeddings from bottlenecks with reconstructing spectrograms using flow matching to train Alethia. Evaluations across five tasks and fifty-six datasets show this encoder outperforms current speech foundation models, particularly in robustness to perturbations and generalizing to new domains such as singing voices. This suggests that targeted pretraining can address the diminishing returns from finetuning alone. The results also indicate that using continuous embeddings and generative methods is key to capturing the subtle manipulations in deepfake audio.

Core claim

By pretraining on the combination of bottleneck masked embedding prediction and flow-matching based spectrogram reconstruction, Alethia becomes the first foundational audio encoder that can be applied to various voice deepfake detection and localization tasks, significantly outperforming state-of-the-art speech foundation models on 56 benchmark datasets across 5 tasks, with better robustness to real-world perturbations and zero-shot generalization to unseen domains like singing deepfakes.

What carries the argument

Bottleneck masked embedding prediction combined with flow-matching spectrogram reconstruction as the joint pretraining objective that captures deepfake artifacts in continuous representations.

Load-bearing premise

The particular pretraining combination of bottleneck masked embedding prediction with flow-matching spectrogram reconstruction is what allows better capture of deepfake artifacts than previous approaches.

What would settle it

If Alethia fails to outperform existing models on a new independent test set of voice deepfakes with perturbations, the superiority and generalization claims would not hold.

Figures

Figures reproduced from arXiv: 2605.00251 by Brahmi Dwivedi, Jayaram Raghuram, Surya Koppisetti, Yi Zhu.

**Figure 1.** Figure 1: Pretraining framework of Alethia. Alethia encoder receives masked waveform and projects it into layer-averaged bottleneck embeddings. These embeddings are then fed in parallel to (1) a bottleneck masked embedding prediction branch which predicts different teacher layer embeddings, and (2) a spectrogram reconstruction branch to predict the velocity field calculated between the target spectrogram and the d… view at source ↗

**Figure 2.** Figure 2: UMAP projections of pretrained model embeddings without any task-specific fine-tuning, colored by source labels for the ASVspoof5-ST test split, where S denotes the Silhouette Score. the WavLM-Large baseline. These results suggest that the bottleneck itself serves as a powerful architectural prior for the pretraining objective. 7. Conclusion We propose Alethia, the first foundational encoder that generali… view at source ↗

**Figure 3.** Figure 3: Masked embedding prediction error calculated at masked time steps. The global supervision leads to lower prediction error even when evaluated exclusively on masked frames. Classifier. The 3D embeddings extracted by the frontend are pooled along layer and time axes using average pooling. The output embeddings are then processed by a 2-layer MLP with 1280 and 16 input neurons, with a dropout of 0.5 for both … view at source ↗

**Figure 4.** Figure 4: Left: Spectrogram reconstruction loss measured by comparing groundtruth spectrogram with the spectrogram generated directly with a decoder (i.e., regression). Right: Flow-Matching (FM) velocity field prediction loss where the velocity prediction is conditioned on the decoder output. The regression method tends to overfit reconstruction on unmasked frames, while the FM method leads to similar loss across al… view at source ↗

**Figure 5.** Figure 5: Comparison between Alethia-Large and W2V-1B under the EXPANDED-AUG condition. Positive values suggest performance improvement [PITH_FULL_IMAGE:figures/full_fig_p020_5.png] view at source ↗

**Figure 6.** Figure 6: Comparison between Alethia-Large and W2V-1B under the EXPANDED condition. Positive values suggest performance improvement. accuracy and 3.6% decrease in EER. A similar trend is seen with the EXPANDED condition, where Alethia-Large consistently outperforms W2V-1B on 36 out of 50 datasets, with up to 16.1% accuracy gain and 5.6% decrease in EER [PITH_FULL_IMAGE:figures/full_fig_p020_6.png] view at source ↗

**Figure 7.** Figure 7: For ASVspoof5-ST test split: a) UMAP visualizations of pretrained embeddings b) final layer embeddings from finetuned frozen-backbone + MLP setup for ST. Points are colored by attack label, and S denotes the Silhouette Score [PITH_FULL_IMAGE:figures/full_fig_p023_7.png] view at source ↗

read the original abstract

Existing voice deepfake detection and localization models rely heavily on representations extracted from speech foundation models (SFMs). However, downstream finetuning has now reached a state of diminishing returns. In this paper, we shift the focus to pretraining and propose a novel recipe that combines bottleneck masked embedding prediction with flow-matching based spectrogram reconstruction. The outcome, Alethia, is the first foundational audio encoder for various voice deepfake detection and localization tasks. We evaluate on $5$ different tasks with $56$ benchmark datasets, and note Alethia significantly outperforms state-of-the-art SFMs with superior robustness to real-world perturbations and zero-shot generalization to unseen domains (e.g., singing deepfakes). We also demonstrate the limitation of discrete targets in masked token prediction, and show the importance of continuous embedding prediction and generative pretraining for capturing deepfake artifacts.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Alethia introduces a pretraining recipe mixing masked embeddings and flow-matching for deepfake audio but the zero-shot claims depend on how the 56 datasets were chosen.

read the letter

Alethia tries to create a dedicated encoder for voice deepfakes through a specific pretraining combination, but the lack of detail on dataset curation makes the generalization results look less secure than claimed. The paper shifts attention to pretraining a new model rather than finetuning speech foundation models, using bottleneck masked embedding prediction paired with flow-matching for spectrogram reconstruction. This is presented as better suited for capturing deepfake artifacts because it works with continuous embeddings instead of discrete tokens. They show this by evaluating on five tasks using 56 datasets and report better performance and robustness, including zero-shot on unseen areas like singing voices. That pretraining idea is the actual new part, and testing the continuous versus discrete angle is a useful check. If the experiments include solid ablations and controls, it could help the field move past diminishing returns on finetuning. The main concern is around the 56 datasets. The abstract does not explain the selection process or confirm that the test sets were fixed before training. This opens the door to the post-hoc selection issue raised in the stress-test, where favorable datasets might have been included after seeing initial results. That would make the outperformance and zero-shot claims less convincing, as the same method might not hold on a wider or pre-registered set of perturbations and synthesis techniques. The paper would be stronger with explicit criteria for dataset inclusion and perhaps a separate validation suite. Overall, the work engages honestly with the limitations of current SFMs and proposes a practical alternative. It is aimed at audio deepfake researchers who need better base representations. A reader focused on detection models would find the recipe and the discrete target limitation discussion worth their time. I would send this to peer review. The core idea is worth referee input, and the large evaluation scale justifies the effort even if revisions are needed to tighten the experimental design.

Referee Report

2 major / 1 minor

Summary. The paper claims to introduce Alethia, the first foundational audio encoder designed specifically for voice deepfake detection and localization tasks. By combining bottleneck masked embedding prediction with flow-matching based spectrogram reconstruction in pretraining, Alethia is said to capture deepfake artifacts more effectively than existing speech foundation models (SFMs). The evaluation spans 5 tasks and 56 benchmark datasets, with claims of significant outperformance, superior robustness to perturbations, and zero-shot generalization to unseen domains such as singing deepfakes. The work also argues against discrete targets in masked prediction in favor of continuous embeddings and generative pretraining.

Significance. Should the empirical findings be confirmed with rigorous, unbiased evaluation, this contribution would be significant for the field of audio forensics and deepfake detection. It provides a new direction by focusing on specialized pretraining rather than continued finetuning of general SFMs, potentially leading to more robust systems. The scale of the evaluation across 56 datasets is commendable and, if the selection is transparent and pre-specified, offers compelling evidence for improved generalization capabilities.

major comments (2)

The abstract asserts superior performance and generalization but supplies no experimental details, baselines, statistical tests, or ablation results; claims rest on unevaluated assertions. This is load-bearing as the performance numbers are the sole empirical support for both the 'first foundational encoder' and 'significantly outperforms' assertions.
§4 (Experiments): The evaluation on 56 datasets across 5 tasks does not specify pre-specified inclusion criteria or a protocol for avoiding post-hoc selection. This risks making the zero-shot generalization claim (e.g., to singing deepfakes) circular if datasets were curated after training to emphasize strengths of the bottleneck masked embedding prediction plus flow-matching combination.

minor comments (1)

The acronym 'SFMs' should be defined on first use as 'speech foundation models' for clarity.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their detailed and constructive feedback. The comments highlight important aspects of transparency in our presentation of results and experimental design. We have revised the manuscript to address both major comments by expanding the abstract and adding explicit documentation of our dataset selection protocol. Our responses to each point are provided below.

read point-by-point responses

Referee: The abstract asserts superior performance and generalization but supplies no experimental details, baselines, statistical tests, or ablation results; claims rest on unevaluated assertions. This is load-bearing as the performance numbers are the sole empirical support for both the 'first foundational encoder' and 'significantly outperforms' assertions.

Authors: We agree that the abstract should provide sufficient context to substantiate its claims without requiring the reader to immediately consult the full text. In the revised manuscript we have updated the abstract to briefly reference the primary baselines (WavLM, HuBERT, and other SFMs), the evaluation scope (5 tasks across 56 datasets), and the use of statistical significance testing via paired t-tests over multiple random seeds. All supporting details, including complete baseline comparisons, ablation studies, and robustness analyses, remain in §4 and the appendix. This change makes the abstract more informative while preserving its concise nature. revision: yes
Referee: §4 (Experiments): The evaluation on 56 datasets across 5 tasks does not specify pre-specified inclusion criteria or a protocol for avoiding post-hoc selection. This risks making the zero-shot generalization claim (e.g., to singing deepfakes) circular if datasets were curated after training to emphasize strengths of the bottleneck masked embedding prediction plus flow-matching combination.

Authors: We recognize the validity of this concern regarding potential selection bias. The 56 datasets were assembled according to a protocol established prior to any training: all publicly released voice deepfake detection and localization corpora from major benchmarks (ASVspoof 2019/2021, WaveFake, FakeAVCeleb, and related sources) available at the start of the project, partitioned into the five task categories, with singing-voice deepfake datasets deliberately added to evaluate zero-shot generalization to unseen domains. We have inserted a new subsection in §4.1 that explicitly states this pre-specified inclusion criteria, lists all datasets with sources, and explains the rationale for the singing-voice test set. This documentation removes any ambiguity about post-hoc curation. revision: yes

Circularity Check

0 steps flagged

No significant circularity; empirical claims with no derivation chain

full rationale

The paper proposes a pretraining recipe (bottleneck masked embedding prediction combined with flow-matching spectrogram reconstruction) and reports empirical results on 5 tasks across 56 datasets, claiming superior performance and zero-shot generalization. No equations, derivations, or mathematical steps are present in the provided text. Claims do not reduce to self-definitions, fitted inputs renamed as predictions, or self-citation chains; the 'first foundational encoder' status is asserted based on experimental outcomes rather than tautological construction. Dataset selection is described without pre-specified criteria, but this is an experimental design issue rather than a circular reduction of any derivation to its inputs. The work is self-contained as an empirical contribution without load-bearing self-referential logic.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Abstract-only view yields no explicit free parameters, new entities, or non-standard axioms; the work implicitly rests on standard self-supervised learning assumptions that masked prediction and generative reconstruction will surface deepfake artifacts.

axioms (1)

domain assumption Masked embedding prediction and flow-matching reconstruction are effective for learning representations that expose deepfake artifacts.
Central to the proposed pretraining recipe and its claimed superiority over discrete targets.

pith-pipeline@v0.9.0 · 5455 in / 1246 out tokens · 39972 ms · 2026-05-09T19:25:04.046151+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

60 extracted references · 34 canonical work pages · 2 internal anchors

[1]

V-JEPA 2: Self-Supervised Video Models Enable Understanding, Prediction and Planning

Assran, M., Bardes, A., Fan, D., Garrido, Q., Howes, R., Muckley, M., Rizvi, A., Roberts, C., Sinha, K., Zholus, A., et al. V-jepa 2: Self-supervised video models enable understanding, prediction and planning.arXiv preprint arXiv:2506.09985,

work page internal anchor Pith review arXiv
[2]

Transferring audio deepfake detection capabil- ity across languages

Ba, Z., Wen, Q., Cheng, P., Wang, Y ., Lin, F., Lu, L., and Liu, Z. Transferring audio deepfake detection capabil- ity across languages. InProceedings of the ACM Web Conference 2023, pp. 2033–2044,

2023
[3]

Learning by reconstruction produces uninformative features for perception.arXiv preprint arXiv:2402.11337, 2024

Balestriero, R. and LeCun, Y . Learning by reconstruction produces uninformative features for perception.arXiv preprint arXiv:2402.11337,

work page arXiv
[4]

Bhagtani, K., Yadav, A. K. S., Bestagini, P., and Delp, E. J. DiffSSD: A diffusion-based dataset for speech forensics. InICASSP 2025-2025 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 1–5. IEEE,

2025
[5]

Deepfake-Eval-2024: A multi-modal in-the-wild benchmark of deepfakes circulated in 2024.arXiv preprint arXiv:2503.02857, 2025

Chandra, N. A., Murtfeldt, R., Qiu, L., Karmakar, A., Lee, H., Tanumihardja, E., Farhat, K., Caffee, B., Paik, S., Lee, C., et al. Deepfake-eval-2024: A multi-modal in-the- wild benchmark of deepfakes circulated in 2024.arXiv preprint arXiv:2503.02857,

work page arXiv 2024
[6]

CoLLD: Contrastive layer-to-layer distil- lation for compressing multilingual pre-trained speech encoders

Chang, H.-J., Dong, N., Mavlyutov, R., Popuri, S., and Chung, Y .-A. CoLLD: Contrastive layer-to-layer distil- lation for compressing multilingual pre-trained speech encoders. InICASSP 2024-2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 10801–10805. IEEE,

2024
[7]

Chang, H.-J., Bhati, S., Glass, J., and Liu, A. H. USAD: Universal speech and audio representation via distillation. arXiv preprint arXiv:2506.18843,

work page arXiv
[8]

Demir¨ors, M., Ozbayoglu, A

Accessed: 2026-01-22. Demir¨ors, M., Ozbayoglu, A. M., and Akg ¨un, T. Future- proofing multilingual fake speech detection. InCS & IT Conference Proceedings, volume

2026
[9]

Trident of poseidon: A generalized approach for detecting deepfake voices

Doan, T.-P., Dinh-Xuan, H., Ryu, T., Kim, I., Lee, W., Hong, K., and Jung, S. Trident of poseidon: A generalized approach for detecting deepfake voices. InProceedings of the 2024 on ACM SIGSAC Conference on Computer and Communications Security, pp. 2222–2235,

2024
[10]

T., Manrique, R., and Nunes, B

Fl´orez, P. T., Manrique, R., and Nunes, B. P. HABLA: A dataset of latin american spanish accents for voice anti-spoofing. InProc. Interspeech, volume 2023, pp. 1963–1967,

2023
[11]

L., Garc´ıa-Perera, L

9 Alethia: A Foundational Encoder for Voice Deepfakes Garg, A., Cai, Z., Zhang, L., Xinyuan, H. L., Garc´ıa-Perera, L. P., Duh, K., Khudanpur, S., Wiesner, M., and Andrews, N. ShiftySpeech: A large-scale synthetic speech dataset with distribution shifts.arXiv preprint arXiv:2502.05674,

work page arXiv
[12]

Post- training for deepfake speech detection.arXiv preprint arXiv:2506.21090,

Ge, W., Wang, X., Liu, X., and Yamagishi, J. Post- training for deepfake speech detection.arXiv preprint arXiv:2506.21090,

work page arXiv
[13]

ReMASC: Realistic replay attack corpus for voice controlled systems.arXiv preprint arXiv:1904.03365,

Gong, Y ., Yang, J., Huber, J., MacKnight, M., and Poellabauer, C. ReMASC: Realistic replay attack corpus for voice controlled systems.arXiv preprint arXiv:1904.03365,

work page arXiv 1904
[14]

R., Pimentel, A., Avila, A

Guimar˜aes, H. R., Pimentel, A., Avila, A. R., Reza- gholizadeh, M., Chen, B., and Falk, T. H. Robustdis- tiller: Compressing universal speech representations for enhanced environment robustness. InICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 1–5. IEEE,

2023
[15]

R., Pimentel, A., Avila, A

Guimar˜aes, H. R., Pimentel, A., Avila, A. R., Reza- gholizadeh, M., Chen, B., and Falk, T. H. An efficient end-to-end approach to noise invariant speech features via multi-task learning.arXiv preprint arXiv:2403.08654,

work page arXiv
[16]

Wav2df-tsl: Two-stage learning with efficient pre-training and hierarchical experts fu- sion for robust audio deepfake detection.arXiv preprint arXiv:2509.04161,

Hao, Y ., Chen, Y ., Xu, M., Zhan, J., He, L., Fang, L., Fang, S., and Liu, L. Wav2df-tsl: Two-stage learning with efficient pre-training and hierarchical experts fu- sion for robust audio deepfake detection.arXiv preprint arXiv:2509.04161,

work page arXiv
[17]

Manipulated re- gions localization for partially deepfake audio: A survey

He, J., Yi, J., Tao, J., Zeng, S., and Gu, H. Manipulated re- gions localization for partially deepfake audio: A survey. arXiv preprint arXiv:2506.14396,

work page arXiv
[18]

Huzaifah, M., Lin, G., Liu, T., Sailor, H

URLhttps://arxiv.org/abs/2507.21463. Huzaifah, M., Lin, G., Liu, T., Sailor, H. B., Tan, K. M., Vangani, T. K., Wang, Q., Wong, J. H., Wu, J., Chen, N. F., et al. MERaLiON-SpeechEncoder: Towards a speech foundation model for singapore and beyond.arXiv preprint arXiv:2412.11538,

work page arXiv
[19]

UniCodec: Unified audio codec with single domain-adaptive codebook.arXiv preprint arXiv:2502.20067,

Jiang, Y ., Chen, Q., Ji, S., Xi, Y ., Wang, W., Zhang, C., Yue, X., Zhang, S., and Li, H. UniCodec: Unified audio codec with single domain-adaptive codebook.arXiv preprint arXiv:2502.20067,

work page arXiv
[20]

S., et al

Jung, J.-w., Wu, Y ., Wang, X., Kim, J.-H., Maiti, S., Mat- sunaga, Y ., Shim, H.-j., Tian, J., Evans, N., Chung, J. S., et al. SpoofCeleb: Speech deepfake detection and SASV in the wild.IEEE Open Journal of Signal Processing, 2025a. Jung, J.-W., Zhang, W., Maiti, S., Wu, Y ., Wang, X., KIM, J., Matsunaga, Y ., Um, S., Tian, J., Shim, H.-J., et al. The Te...

2025
[21]

Source tracing of audio deepfake systems.arXiv preprint arXiv:2407.08016,

Klein, N., Chen, T., Tak, H., Casal, R., and Khoury, E. Source tracing of audio deepfake systems.arXiv preprint arXiv:2407.08016,

work page arXiv
[22]

IndieFake dataset: A benchmark dataset for audio deepfake detection.arXiv preprint arXiv:2506.19014,

Kumar, A., Verma, K., and More, O. IndieFake dataset: A benchmark dataset for audio deepfake detection.arXiv preprint arXiv:2506.19014,

work page arXiv
[23]

A survey on speech deepfake detection.ACM Computing Surveys, 57 (7):1–38, 2025a

Li, M., Ahmadiadli, Y ., and Zhang, X.-P. A survey on speech deepfake detection.ACM Computing Surveys, 57 (7):1–38, 2025a. Li, X., Li, K., Zheng, Y ., Yan, C., Ji, X., and Xu, W. Safeear: Content privacy-preserving audio deepfake detection. In Proceedings of the 2024 on ACM SIGSAC Conference on Computer and Communications Security, pp. 3585–3599,

2024
[24]

Measuring the ro- bustness of audio deepfake detectors,

Li, X., Chen, P.-Y ., and Wei, W. Measuring the ro- bustness of audio deepfake detectors.arXiv preprint arXiv:2503.17577, 2025b. Li, Y ., Yuan, R., Zhang, G., Ma, Y ., Chen, X., Yin, H., Xiao, C., Lin, C., Ragni, A., Benetos, E., et al. Mert: Acoustic music understanding model with large-scale self- supervised training.arXiv preprint arXiv:2306.00107,

work page arXiv
[25]

Flow Matching for Generative Modeling

Lipman, Y ., Chen, R. T., Ben-Hamu, H., Nickel, M., and Le, M. Flow matching for generative modeling.arXiv preprint arXiv:2210.02747,

work page internal anchor Pith review Pith/arXiv arXiv
[26]

H., Le, M., Vyas, A., Shi, B., Tjandra, A., and Hsu, W.-N

10 Alethia: A Foundational Encoder for Voice Deepfakes Liu, A. H., Le, M., Vyas, A., Shi, B., Tjandra, A., and Hsu, W.-N. Generative pre-training for speech with flow matching.arXiv preprint arXiv:2310.16338,

work page arXiv
[27]

A., and Chng, E

Luong, H.-T., Li, H., Zhang, L., Lee, K. A., and Chng, E. S. Llamapartialspoof: An llm-driven fake speech dataset simulating disinformation generation. InICASSP 2025-2025 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 1–5. IEEE,

2025
[28]

R., Naini, A

Mahapatra, A., Ulgen, I. R., Naini, A. R., Busso, C., and Sisman, B. Can emotion fool anti-spoofing?arXiv preprint arXiv:2505.23962,

work page arXiv
[29]

Discrete audio tokens: More than a survey!arXiv preprint arXiv:2506.10274, 2025

Mousavi, P., Maimon, G., Moumen, A., Petermann, D., Shi, J., Wu, H., Yang, H., Kuznetsova, A., Ploujnikov, A., Marxer, R., et al. Discrete audio tokens: More than a survey!arXiv preprint arXiv:2506.10274,

work page arXiv
[30]

T., Pizzi, K., Wagner, A., and Sperl, P

M¨uller, N., Kawa, P., Choong, W.-H., Stan, A., Bukkapat- nam, A. T., Pizzi, K., Wagner, A., and Sperl, P. Replay attacks against audio deepfake detection.arXiv preprint arXiv:2505.14862,

work page arXiv
[31]

M., Dieckmann, F., Czempin, P., Canals, R., B¨ottinger, K., and Williams, J

M¨uller, N. M., Dieckmann, F., Czempin, P., Canals, R., B¨ottinger, K., and Williams, J. Speech is silver, silence is golden: What do ASVspoof-trained models really learn? arXiv preprint arXiv:2106.12914,

work page arXiv
[32]

H., Vest- man, V ., Todisco, M., Delgado, H., Sahidullah, M., Yam- agishi, J., and Lee, K

Nautsch, A., Wang, X., Evans, N., Kinnunen, T. H., Vest- man, V ., Todisco, M., Delgado, H., Sahidullah, M., Yam- agishi, J., and Lee, K. A. ASVspoof 2019: Spoofing coun- termeasures for the detection of synthesized, converted and replayed speech.IEEE Transactions on Biometrics, Behavior, and Identity Science, 3(2):252–265,

2019
[33]

Accessed: 2026-01-22. Park, K. and Mulc, T. CSS10: A collection of single speaker speech datasets for 10 languages. InProc. Interspeech 2019, pp. 1566–1570,

2026
[34]

V oxCeleb: A Large-Scale Speaker Identification Dataset

doi: 10.21437/Interspeech. 2019-1500. Pasad, A., Chou, J.-C., and Livescu, K. Layer-wise analysis of a self-supervised speech representation model. In2021 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), pp. 914–921. IEEE,

work page doi:10.21437/interspeech 2019
[35]

SRC4VC: Smartphone-recorded corpus for voice conversion bench- mark.arXiv preprint arXiv:2406.07254,

Saito, Y ., Igarashi, T., Seki, K., Takamichi, S., Yamamoto, R., Tachibana, K., and Saruwatari, H. SRC4VC: Smartphone-recorded corpus for voice conversion bench- mark.arXiv preprint arXiv:2406.07254,

work page arXiv
[36]

JSUT corpus: free large-scale Japanese speech corpus for end-to-end speech synthesis

Sonobe, R., Takamichi, S., and Saruwatari, H. JSUT corpus: free large-scale japanese speech corpus for end-to-end speech synthesis.arXiv preprint arXiv:1711.00354,

work page Pith review arXiv
[37]

Rawboost: A raw data boosting and augmentation method applied to automatic speaker verification anti-spoofing

Tak, H., Kamble, M., Patino, J., Todisco, M., and Evans, N. Rawboost: A raw data boosting and augmentation method applied to automatic speaker verification anti-spoofing. InICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 6382–6386. IEEE,

2022
[38]

JMD: Japanese multi-dialect corpus for speech synthesis

11 Alethia: A Foundational Encoder for Voice Deepfakes Takamichi, S. JMD: Japanese multi-dialect corpus for speech synthesis. https://sites.google.com/ site/shinnosuketakamichi/publication/ research-topics/jmd_corpus, 2021a. Ac- cessed: 2026-01-22. Takamichi, S. Tri-jek: Japanese-English-Korean tri-lingual speech corpus. https://sites. google.com/site/shi...

2026
[39]

Takamichi, S., Komachi, M., Tanji, N., and Saruwatari, H

Accessed: 2026-01-22. Takamichi, S., Komachi, M., Tanji, N., and Saruwatari, H. JSSS: free japanese speech corpus for summarization and simplification.arXiv preprint arXiv:2010.01793, 2020a. Takamichi, S., Sonobe, R., Mitsui, K., Saito, Y ., Koriyama, T., Tanji, N., and Saruwatari, H. JSUT and JVS: Free japanese voice corpora for accelerating speech synth...

work page arXiv 2026
[40]

ASVspoof 5: Crowdsourced speech data, deepfakes, and adversarial attacks at scale,

Wang, X., Delgado, H., Tak, H., Jung, J.-w., Shim, H.- j., Todisco, M., Kukanov, I., Liu, X., Sahidullah, M., Kinnunen, T., et al. ASVspoof 5: Crowdsourced speech data, deepfakes, and adversarial attacks at scale.arXiv preprint arXiv:2408.08739,

work page arXiv
[41]

Codecfake: Enhanc- ing anti-spoofing models against deepfake audios from codec-based speech synthesis systems.arXiv preprint arXiv:2406.07237,

Wu, H., Tseng, Y ., and Lee, H.-y. Codecfake: Enhanc- ing anti-spoofing models against deepfake audios from codec-based speech synthesis systems.arXiv preprint arXiv:2406.07237,

work page arXiv
[42]

The codecfake dataset and countermeasures for the universally detection of deepfake audio.IEEE Transactions on Audio, Speech and Language Processing, 2025a

Xie, Y ., Lu, Y ., Fu, R., Wen, Z., Wang, Z., Tao, J., Qi, X., Wang, X., Liu, Y ., Cheng, H., et al. The codecfake dataset and countermeasures for the universally detection of deepfake audio.IEEE Transactions on Audio, Speech and Language Processing, 2025a. Xie, Y ., Wang, X., Wang, Z., Fu, R., Wen, Z., Cao, S., Ma, L., Li, C., Cheng, H., and Ye, L. Neura...

work page arXiv
[43]

A., Kinnunen, T., Evans, N., et al

Yamagishi, J., Wang, X., Todisco, M., Sahidullah, M., Patino, J., Nautsch, A., Liu, X., Lee, K. A., Kinnunen, T., Evans, N., et al. ASVspoof 2021: accelerating progress in spoofed and deepfake speech detection.arXiv preprint arXiv:2109.00537,

work page arXiv 2021
[44]

Mscenespeech: A multi-scene speech dataset for expressive speech synthesis.arXiv preprint arXiv:2407.14006,

Yang, Q., Zuo, J., Su, Z., Jiang, Z., Li, M., Zhao, Z., Chen, F., Wang, Z., and Huai, B. Mscenespeech: A multi-scene speech dataset for expressive speech synthesis.arXiv preprint arXiv:2407.14006,

work page arXiv
[45]

J., Lakho- tia, K., Lin, Y

Yang, S.-w., Chi, P.-H., Chuang, Y .-S., Lai, C.-I. J., Lakho- tia, K., Lin, Y . Y ., Liu, A. T., Shi, J., Chang, X., Lin, G.-T., et al. Superb: Speech processing universal performance benchmark.arXiv preprint arXiv:2105.01051,

work page arXiv
[46]

SPEAR: A unified ssl framework for learning speech and audio representations.arXiv preprint arXiv:2510.25955,

Yang, X., Yang, Y ., Jin, Z., Cui, Z., Wu, W., Li, B., Zhang, C., and Woodland, P. SPEAR: A unified ssl framework for learning speech and audio representations.arXiv preprint arXiv:2510.25955,

work page arXiv
[47]

Half-truth: A partially fake audio detection dataset.arXiv preprint arXiv:2104.03617,

Yi, J., Bai, Y ., Tao, J., Ma, H., Tian, Z., Wang, C., Wang, T., and Fu, R. Half-truth: A partially fake audio detection dataset.arXiv preprint arXiv:2104.03617,

work page arXiv
[48]

Singfake: Singing voice deepfake detection

Zang, Y ., Zhang, Y ., Heydari, M., and Duan, Z. Singfake: Singing voice deepfake detection. InICASSP 2024-2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 12156–12160. IEEE,

2024
[49]

Audio deepfake detection: What has been achieved and what lies ahead.Sensors (Basel, Switzerland), 25(7):1989,

Zhang, B., Cui, H., Nguyen, V ., and Whitty, M. Audio deepfake detection: What has been achieved and what lies ahead.Sensors (Basel, Switzerland), 25(7):1989,

1989
[50]

SVDD 2024: The inaugural singing voice deepfake detection challenge

Zhang, Y ., Zang, Y ., Shi, J., Yamamoto, R., Toda, T., and Duan, Z. SVDD 2024: The inaugural singing voice deepfake detection challenge. InProc. IEEE Spoken Language Technology Workshop (SLT),

2024
[51]

R., Pimentel, A., and Falk, T

Zhu, Y ., Guimar ˜aes, H. R., Pimentel, A., and Falk, T. AUDDT: Audio unified deepfake detection benchmark toolkit.arXiv preprint arXiv:2509.21597,

work page arXiv
[52]

13 Alethia: A Foundational Encoder for Voice Deepfakes A. POC Experiment Setup We follow a similar setup as the base versions of Wave2vec2 and HuBERT, where the pretraining data are identical to the downstream finetuning data (Baevski et al., 2020; Hsu et al., 2021). We curate 1k hours of real and fake speech, where the real speech data are sourced from C...

2020
[53]

all-step

B. Alethia Architecture and Pretraining Details Table 10 summarizes the architectural hyperparameters of Alethia-Base and Alethia-Large. One thing to note is that while the teacher model WavLM-Large adopts PostLN, the student Alethia-Base uses PreLN, with which we found better convergence. Regarding pretraining setup, we utilize the AdamW optimizer with a...

2020
[54]

Other Tasks PFSL.For dataset configuration and model inference, we adapt the framework provided by Luong et al

E.2. Other Tasks PFSL.For dataset configuration and model inference, we adapt the framework provided by Luong et al. (2025).1. While the original framework supports multi-scale evaluation across six temporal resolutions ( units∈ {0.02,0.04,0.08,0.16,0.32,0.64} seconds), we evaluate exclusively at the 20ms scale (units= 0.02 ). This choice is 1https://gith...

2025
[55]

A VDD.For this task, we adapt code from the FakeA VCeleb repository2

is included in the pretraining of Alethia, which could lead to biased evaluation. A VDD.For this task, we adapt code from the FakeA VCeleb repository2. We segmented the videos into 3 second chunks to handle variable durations, and treat the chunks as independent samples while calculating metrics. The audio stream is extracted from the videos and resampled...

2021
[56]

For all experimental conditions, we performed quality control and class balancing for each training dataset. While this significantly reduces the total training volume, it helps to avoid the effects of confounding factors, such as prolonged silence and class imbalance, which may otherwise lead to spurious correlations (M¨uller et al., 2021). For the other...

2021
[57]

preprocessed

Datasets subjected to our quality control pipeline are denoted with the “preprocessed” suffix, whereas those labeled “raw” remain in their original state. To facilitate a direct comparison with prior literature, we specifically utilize the raw versions of standard benchmarks, including the ASVspoof series (Nautsch et al., 2021; Yamagishi et al., 2021; Wan...

2021
[58]

37,314 Deepfake eval 2024 preprocessed (Chandra et al.,

2024
[59]

4,700 KSS preprocessed (Park, 2019; Park & Mulc,

2019
[60]

Other Experimental Results G.1

98,411 G. Other Experimental Results G.1. SDD Model Comparison Table 16 and 17 provide per-dataset EER and accuracy of Alethia-Large and W1V-1B along with their performance difference under the EXPANDED-AUGand EXPANDEDcondition, respectively. For both conditions, Alethia-large shows better performance with significantly fewer datasets with accuracies belo...

1934