pith. machine review for the scientific record. sign in

arxiv: 2605.00251 · v1 · submitted 2026-04-30 · 💻 cs.SD · cs.CL· eess.AS

Recognition: unknown

Alethia: A Foundational Encoder for Voice Deepfakes

Authors on Pith no claims yet

Pith reviewed 2026-05-09 19:25 UTC · model grok-4.3

classification 💻 cs.SD cs.CLeess.AS
keywords voice deepfake detectionaudio encoder pretrainingmasked embedding predictionflow matchingdeepfake localizationspeech foundation modelsgenerative pretrainingzero-shot generalization
0
0 comments X

The pith

Alethia is the first foundational audio encoder for voice deepfake detection and localization, created by a new pretraining recipe of bottleneck masked embedding prediction combined with flow-matching spectrogram reconstruction.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper proposes shifting from finetuning existing speech models to developing a specialized pretraining approach for audio encoders focused on deepfake artifacts. It combines predicting masked embeddings from bottlenecks with reconstructing spectrograms using flow matching to train Alethia. Evaluations across five tasks and fifty-six datasets show this encoder outperforms current speech foundation models, particularly in robustness to perturbations and generalizing to new domains such as singing voices. This suggests that targeted pretraining can address the diminishing returns from finetuning alone. The results also indicate that using continuous embeddings and generative methods is key to capturing the subtle manipulations in deepfake audio.

Core claim

By pretraining on the combination of bottleneck masked embedding prediction and flow-matching based spectrogram reconstruction, Alethia becomes the first foundational audio encoder that can be applied to various voice deepfake detection and localization tasks, significantly outperforming state-of-the-art speech foundation models on 56 benchmark datasets across 5 tasks, with better robustness to real-world perturbations and zero-shot generalization to unseen domains like singing deepfakes.

What carries the argument

Bottleneck masked embedding prediction combined with flow-matching spectrogram reconstruction as the joint pretraining objective that captures deepfake artifacts in continuous representations.

Load-bearing premise

The particular pretraining combination of bottleneck masked embedding prediction with flow-matching spectrogram reconstruction is what allows better capture of deepfake artifacts than previous approaches.

What would settle it

If Alethia fails to outperform existing models on a new independent test set of voice deepfakes with perturbations, the superiority and generalization claims would not hold.

Figures

Figures reproduced from arXiv: 2605.00251 by Brahmi Dwivedi, Jayaram Raghuram, Surya Koppisetti, Yi Zhu.

Figure 1
Figure 1. Figure 1: Pretraining framework of Alethia. Alethia encoder receives masked waveform and projects it into layer-averaged bot￾tleneck embeddings. These embeddings are then fed in parallel to (1) a bottleneck masked embedding prediction branch which predicts different teacher layer embeddings, and (2) a spectrogram reconstruction branch to predict the velocity field calculated be￾tween the target spectrogram and the d… view at source ↗
Figure 2
Figure 2. Figure 2: UMAP projections of pretrained model embeddings without any task-specific fine-tuning, colored by source labels for the ASVspoof5-ST test split, where S denotes the Silhouette Score. the WavLM-Large baseline. These results suggest that the bottleneck itself serves as a powerful architectural prior for the pretraining objective. 7. Conclusion We propose Alethia, the first foundational encoder that gen￾erali… view at source ↗
Figure 3
Figure 3. Figure 3: Masked embedding prediction error calculated at masked time steps. The global supervision leads to lower prediction error even when evaluated exclusively on masked frames. Classifier. The 3D embeddings extracted by the frontend are pooled along layer and time axes using average pooling. The output embeddings are then processed by a 2-layer MLP with 1280 and 16 input neurons, with a dropout of 0.5 for both … view at source ↗
Figure 4
Figure 4. Figure 4: Left: Spectrogram reconstruction loss measured by comparing groundtruth spectrogram with the spectrogram generated directly with a decoder (i.e., regression). Right: Flow-Matching (FM) velocity field prediction loss where the velocity prediction is conditioned on the decoder output. The regression method tends to overfit reconstruction on unmasked frames, while the FM method leads to similar loss across al… view at source ↗
Figure 5
Figure 5. Figure 5: Comparison between Alethia-Large and W2V-1B under the EXPANDED-AUG condition. Positive values suggest performance improvement [PITH_FULL_IMAGE:figures/full_fig_p020_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Comparison between Alethia-Large and W2V-1B under the EXPANDED condition. Positive values suggest performance improvement. accuracy and 3.6% decrease in EER. A similar trend is seen with the EXPANDED condition, where Alethia-Large consistently outperforms W2V-1B on 36 out of 50 datasets, with up to 16.1% accuracy gain and 5.6% decrease in EER [PITH_FULL_IMAGE:figures/full_fig_p020_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: For ASVspoof5-ST test split: a) UMAP visualizations of pretrained embeddings b) final layer embeddings from finetuned frozen-backbone + MLP setup for ST. Points are colored by attack label, and S denotes the Silhouette Score [PITH_FULL_IMAGE:figures/full_fig_p023_7.png] view at source ↗
read the original abstract

Existing voice deepfake detection and localization models rely heavily on representations extracted from speech foundation models (SFMs). However, downstream finetuning has now reached a state of diminishing returns. In this paper, we shift the focus to pretraining and propose a novel recipe that combines bottleneck masked embedding prediction with flow-matching based spectrogram reconstruction. The outcome, Alethia, is the first foundational audio encoder for various voice deepfake detection and localization tasks. We evaluate on $5$ different tasks with $56$ benchmark datasets, and note Alethia significantly outperforms state-of-the-art SFMs with superior robustness to real-world perturbations and zero-shot generalization to unseen domains (e.g., singing deepfakes). We also demonstrate the limitation of discrete targets in masked token prediction, and show the importance of continuous embedding prediction and generative pretraining for capturing deepfake artifacts.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper claims to introduce Alethia, the first foundational audio encoder designed specifically for voice deepfake detection and localization tasks. By combining bottleneck masked embedding prediction with flow-matching based spectrogram reconstruction in pretraining, Alethia is said to capture deepfake artifacts more effectively than existing speech foundation models (SFMs). The evaluation spans 5 tasks and 56 benchmark datasets, with claims of significant outperformance, superior robustness to perturbations, and zero-shot generalization to unseen domains such as singing deepfakes. The work also argues against discrete targets in masked prediction in favor of continuous embeddings and generative pretraining.

Significance. Should the empirical findings be confirmed with rigorous, unbiased evaluation, this contribution would be significant for the field of audio forensics and deepfake detection. It provides a new direction by focusing on specialized pretraining rather than continued finetuning of general SFMs, potentially leading to more robust systems. The scale of the evaluation across 56 datasets is commendable and, if the selection is transparent and pre-specified, offers compelling evidence for improved generalization capabilities.

major comments (2)
  1. The abstract asserts superior performance and generalization but supplies no experimental details, baselines, statistical tests, or ablation results; claims rest on unevaluated assertions. This is load-bearing as the performance numbers are the sole empirical support for both the 'first foundational encoder' and 'significantly outperforms' assertions.
  2. §4 (Experiments): The evaluation on 56 datasets across 5 tasks does not specify pre-specified inclusion criteria or a protocol for avoiding post-hoc selection. This risks making the zero-shot generalization claim (e.g., to singing deepfakes) circular if datasets were curated after training to emphasize strengths of the bottleneck masked embedding prediction plus flow-matching combination.
minor comments (1)
  1. The acronym 'SFMs' should be defined on first use as 'speech foundation models' for clarity.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their detailed and constructive feedback. The comments highlight important aspects of transparency in our presentation of results and experimental design. We have revised the manuscript to address both major comments by expanding the abstract and adding explicit documentation of our dataset selection protocol. Our responses to each point are provided below.

read point-by-point responses
  1. Referee: The abstract asserts superior performance and generalization but supplies no experimental details, baselines, statistical tests, or ablation results; claims rest on unevaluated assertions. This is load-bearing as the performance numbers are the sole empirical support for both the 'first foundational encoder' and 'significantly outperforms' assertions.

    Authors: We agree that the abstract should provide sufficient context to substantiate its claims without requiring the reader to immediately consult the full text. In the revised manuscript we have updated the abstract to briefly reference the primary baselines (WavLM, HuBERT, and other SFMs), the evaluation scope (5 tasks across 56 datasets), and the use of statistical significance testing via paired t-tests over multiple random seeds. All supporting details, including complete baseline comparisons, ablation studies, and robustness analyses, remain in §4 and the appendix. This change makes the abstract more informative while preserving its concise nature. revision: yes

  2. Referee: §4 (Experiments): The evaluation on 56 datasets across 5 tasks does not specify pre-specified inclusion criteria or a protocol for avoiding post-hoc selection. This risks making the zero-shot generalization claim (e.g., to singing deepfakes) circular if datasets were curated after training to emphasize strengths of the bottleneck masked embedding prediction plus flow-matching combination.

    Authors: We recognize the validity of this concern regarding potential selection bias. The 56 datasets were assembled according to a protocol established prior to any training: all publicly released voice deepfake detection and localization corpora from major benchmarks (ASVspoof 2019/2021, WaveFake, FakeAVCeleb, and related sources) available at the start of the project, partitioned into the five task categories, with singing-voice deepfake datasets deliberately added to evaluate zero-shot generalization to unseen domains. We have inserted a new subsection in §4.1 that explicitly states this pre-specified inclusion criteria, lists all datasets with sources, and explains the rationale for the singing-voice test set. This documentation removes any ambiguity about post-hoc curation. revision: yes

Circularity Check

0 steps flagged

No significant circularity; empirical claims with no derivation chain

full rationale

The paper proposes a pretraining recipe (bottleneck masked embedding prediction combined with flow-matching spectrogram reconstruction) and reports empirical results on 5 tasks across 56 datasets, claiming superior performance and zero-shot generalization. No equations, derivations, or mathematical steps are present in the provided text. Claims do not reduce to self-definitions, fitted inputs renamed as predictions, or self-citation chains; the 'first foundational encoder' status is asserted based on experimental outcomes rather than tautological construction. Dataset selection is described without pre-specified criteria, but this is an experimental design issue rather than a circular reduction of any derivation to its inputs. The work is self-contained as an empirical contribution without load-bearing self-referential logic.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Abstract-only view yields no explicit free parameters, new entities, or non-standard axioms; the work implicitly rests on standard self-supervised learning assumptions that masked prediction and generative reconstruction will surface deepfake artifacts.

axioms (1)
  • domain assumption Masked embedding prediction and flow-matching reconstruction are effective for learning representations that expose deepfake artifacts.
    Central to the proposed pretraining recipe and its claimed superiority over discrete targets.

pith-pipeline@v0.9.0 · 5455 in / 1246 out tokens · 39972 ms · 2026-05-09T19:25:04.046151+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

60 extracted references · 34 canonical work pages · 2 internal anchors

  1. [1]

    V-JEPA 2: Self-Supervised Video Models Enable Understanding, Prediction and Planning

    Assran, M., Bardes, A., Fan, D., Garrido, Q., Howes, R., Muckley, M., Rizvi, A., Roberts, C., Sinha, K., Zholus, A., et al. V-jepa 2: Self-supervised video models enable understanding, prediction and planning.arXiv preprint arXiv:2506.09985,

  2. [2]

    Transferring audio deepfake detection capabil- ity across languages

    Ba, Z., Wen, Q., Cheng, P., Wang, Y ., Lin, F., Lu, L., and Liu, Z. Transferring audio deepfake detection capabil- ity across languages. InProceedings of the ACM Web Conference 2023, pp. 2033–2044,

  3. [3]

    Learning by reconstruction produces uninformative features for perception.arXiv preprint arXiv:2402.11337, 2024

    Balestriero, R. and LeCun, Y . Learning by reconstruction produces uninformative features for perception.arXiv preprint arXiv:2402.11337,

  4. [4]

    Bhagtani, K., Yadav, A. K. S., Bestagini, P., and Delp, E. J. DiffSSD: A diffusion-based dataset for speech forensics. InICASSP 2025-2025 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 1–5. IEEE,

  5. [5]

    Deepfake-Eval-2024: A multi-modal in-the-wild benchmark of deepfakes circulated in 2024.arXiv preprint arXiv:2503.02857, 2025

    Chandra, N. A., Murtfeldt, R., Qiu, L., Karmakar, A., Lee, H., Tanumihardja, E., Farhat, K., Caffee, B., Paik, S., Lee, C., et al. Deepfake-eval-2024: A multi-modal in-the- wild benchmark of deepfakes circulated in 2024.arXiv preprint arXiv:2503.02857,

  6. [6]

    CoLLD: Contrastive layer-to-layer distil- lation for compressing multilingual pre-trained speech encoders

    Chang, H.-J., Dong, N., Mavlyutov, R., Popuri, S., and Chung, Y .-A. CoLLD: Contrastive layer-to-layer distil- lation for compressing multilingual pre-trained speech encoders. InICASSP 2024-2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 10801–10805. IEEE,

  7. [7]

    Chang, H.-J., Bhati, S., Glass, J., and Liu, A. H. USAD: Universal speech and audio representation via distillation. arXiv preprint arXiv:2506.18843,

  8. [8]

    Demir¨ors, M., Ozbayoglu, A

    Accessed: 2026-01-22. Demir¨ors, M., Ozbayoglu, A. M., and Akg ¨un, T. Future- proofing multilingual fake speech detection. InCS & IT Conference Proceedings, volume

  9. [9]

    Trident of poseidon: A generalized approach for detecting deepfake voices

    Doan, T.-P., Dinh-Xuan, H., Ryu, T., Kim, I., Lee, W., Hong, K., and Jung, S. Trident of poseidon: A generalized approach for detecting deepfake voices. InProceedings of the 2024 on ACM SIGSAC Conference on Computer and Communications Security, pp. 2222–2235,

  10. [10]

    T., Manrique, R., and Nunes, B

    Fl´orez, P. T., Manrique, R., and Nunes, B. P. HABLA: A dataset of latin american spanish accents for voice anti-spoofing. InProc. Interspeech, volume 2023, pp. 1963–1967,

  11. [11]

    L., Garc´ıa-Perera, L

    9 Alethia: A Foundational Encoder for Voice Deepfakes Garg, A., Cai, Z., Zhang, L., Xinyuan, H. L., Garc´ıa-Perera, L. P., Duh, K., Khudanpur, S., Wiesner, M., and Andrews, N. ShiftySpeech: A large-scale synthetic speech dataset with distribution shifts.arXiv preprint arXiv:2502.05674,

  12. [12]

    Post- training for deepfake speech detection.arXiv preprint arXiv:2506.21090,

    Ge, W., Wang, X., Liu, X., and Yamagishi, J. Post- training for deepfake speech detection.arXiv preprint arXiv:2506.21090,

  13. [13]

    ReMASC: Realistic replay attack corpus for voice controlled systems.arXiv preprint arXiv:1904.03365,

    Gong, Y ., Yang, J., Huber, J., MacKnight, M., and Poellabauer, C. ReMASC: Realistic replay attack corpus for voice controlled systems.arXiv preprint arXiv:1904.03365,

  14. [14]

    R., Pimentel, A., Avila, A

    Guimar˜aes, H. R., Pimentel, A., Avila, A. R., Reza- gholizadeh, M., Chen, B., and Falk, T. H. Robustdis- tiller: Compressing universal speech representations for enhanced environment robustness. InICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 1–5. IEEE,

  15. [15]

    R., Pimentel, A., Avila, A

    Guimar˜aes, H. R., Pimentel, A., Avila, A. R., Reza- gholizadeh, M., Chen, B., and Falk, T. H. An efficient end-to-end approach to noise invariant speech features via multi-task learning.arXiv preprint arXiv:2403.08654,

  16. [16]

    Wav2df-tsl: Two-stage learning with efficient pre-training and hierarchical experts fu- sion for robust audio deepfake detection.arXiv preprint arXiv:2509.04161,

    Hao, Y ., Chen, Y ., Xu, M., Zhan, J., He, L., Fang, L., Fang, S., and Liu, L. Wav2df-tsl: Two-stage learning with efficient pre-training and hierarchical experts fu- sion for robust audio deepfake detection.arXiv preprint arXiv:2509.04161,

  17. [17]

    Manipulated re- gions localization for partially deepfake audio: A survey

    He, J., Yi, J., Tao, J., Zeng, S., and Gu, H. Manipulated re- gions localization for partially deepfake audio: A survey. arXiv preprint arXiv:2506.14396,

  18. [18]

    Huzaifah, M., Lin, G., Liu, T., Sailor, H

    URLhttps://arxiv.org/abs/2507.21463. Huzaifah, M., Lin, G., Liu, T., Sailor, H. B., Tan, K. M., Vangani, T. K., Wang, Q., Wong, J. H., Wu, J., Chen, N. F., et al. MERaLiON-SpeechEncoder: Towards a speech foundation model for singapore and beyond.arXiv preprint arXiv:2412.11538,

  19. [19]

    UniCodec: Unified audio codec with single domain-adaptive codebook.arXiv preprint arXiv:2502.20067,

    Jiang, Y ., Chen, Q., Ji, S., Xi, Y ., Wang, W., Zhang, C., Yue, X., Zhang, S., and Li, H. UniCodec: Unified audio codec with single domain-adaptive codebook.arXiv preprint arXiv:2502.20067,

  20. [20]

    S., et al

    Jung, J.-w., Wu, Y ., Wang, X., Kim, J.-H., Maiti, S., Mat- sunaga, Y ., Shim, H.-j., Tian, J., Evans, N., Chung, J. S., et al. SpoofCeleb: Speech deepfake detection and SASV in the wild.IEEE Open Journal of Signal Processing, 2025a. Jung, J.-W., Zhang, W., Maiti, S., Wu, Y ., Wang, X., KIM, J., Matsunaga, Y ., Um, S., Tian, J., Shim, H.-J., et al. The Te...

  21. [21]

    Source tracing of audio deepfake systems.arXiv preprint arXiv:2407.08016,

    Klein, N., Chen, T., Tak, H., Casal, R., and Khoury, E. Source tracing of audio deepfake systems.arXiv preprint arXiv:2407.08016,

  22. [22]

    IndieFake dataset: A benchmark dataset for audio deepfake detection.arXiv preprint arXiv:2506.19014,

    Kumar, A., Verma, K., and More, O. IndieFake dataset: A benchmark dataset for audio deepfake detection.arXiv preprint arXiv:2506.19014,

  23. [23]

    A survey on speech deepfake detection.ACM Computing Surveys, 57 (7):1–38, 2025a

    Li, M., Ahmadiadli, Y ., and Zhang, X.-P. A survey on speech deepfake detection.ACM Computing Surveys, 57 (7):1–38, 2025a. Li, X., Li, K., Zheng, Y ., Yan, C., Ji, X., and Xu, W. Safeear: Content privacy-preserving audio deepfake detection. In Proceedings of the 2024 on ACM SIGSAC Conference on Computer and Communications Security, pp. 3585–3599,

  24. [24]

    Measuring the ro- bustness of audio deepfake detectors,

    Li, X., Chen, P.-Y ., and Wei, W. Measuring the ro- bustness of audio deepfake detectors.arXiv preprint arXiv:2503.17577, 2025b. Li, Y ., Yuan, R., Zhang, G., Ma, Y ., Chen, X., Yin, H., Xiao, C., Lin, C., Ragni, A., Benetos, E., et al. Mert: Acoustic music understanding model with large-scale self- supervised training.arXiv preprint arXiv:2306.00107,

  25. [25]

    Flow Matching for Generative Modeling

    Lipman, Y ., Chen, R. T., Ben-Hamu, H., Nickel, M., and Le, M. Flow matching for generative modeling.arXiv preprint arXiv:2210.02747,

  26. [26]

    H., Le, M., Vyas, A., Shi, B., Tjandra, A., and Hsu, W.-N

    10 Alethia: A Foundational Encoder for Voice Deepfakes Liu, A. H., Le, M., Vyas, A., Shi, B., Tjandra, A., and Hsu, W.-N. Generative pre-training for speech with flow matching.arXiv preprint arXiv:2310.16338,

  27. [27]

    A., and Chng, E

    Luong, H.-T., Li, H., Zhang, L., Lee, K. A., and Chng, E. S. Llamapartialspoof: An llm-driven fake speech dataset simulating disinformation generation. InICASSP 2025-2025 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 1–5. IEEE,

  28. [28]

    R., Naini, A

    Mahapatra, A., Ulgen, I. R., Naini, A. R., Busso, C., and Sisman, B. Can emotion fool anti-spoofing?arXiv preprint arXiv:2505.23962,

  29. [29]

    Discrete audio tokens: More than a survey!arXiv preprint arXiv:2506.10274, 2025

    Mousavi, P., Maimon, G., Moumen, A., Petermann, D., Shi, J., Wu, H., Yang, H., Kuznetsova, A., Ploujnikov, A., Marxer, R., et al. Discrete audio tokens: More than a survey!arXiv preprint arXiv:2506.10274,

  30. [30]

    T., Pizzi, K., Wagner, A., and Sperl, P

    M¨uller, N., Kawa, P., Choong, W.-H., Stan, A., Bukkapat- nam, A. T., Pizzi, K., Wagner, A., and Sperl, P. Replay attacks against audio deepfake detection.arXiv preprint arXiv:2505.14862,

  31. [31]

    M., Dieckmann, F., Czempin, P., Canals, R., B¨ottinger, K., and Williams, J

    M¨uller, N. M., Dieckmann, F., Czempin, P., Canals, R., B¨ottinger, K., and Williams, J. Speech is silver, silence is golden: What do ASVspoof-trained models really learn? arXiv preprint arXiv:2106.12914,

  32. [32]

    H., Vest- man, V ., Todisco, M., Delgado, H., Sahidullah, M., Yam- agishi, J., and Lee, K

    Nautsch, A., Wang, X., Evans, N., Kinnunen, T. H., Vest- man, V ., Todisco, M., Delgado, H., Sahidullah, M., Yam- agishi, J., and Lee, K. A. ASVspoof 2019: Spoofing coun- termeasures for the detection of synthesized, converted and replayed speech.IEEE Transactions on Biometrics, Behavior, and Identity Science, 3(2):252–265,

  33. [33]

    Accessed: 2026-01-22. Park, K. and Mulc, T. CSS10: A collection of single speaker speech datasets for 10 languages. InProc. Interspeech 2019, pp. 1566–1570,

  34. [34]

    V oxCeleb: A Large-Scale Speaker Identification Dataset

    doi: 10.21437/Interspeech. 2019-1500. Pasad, A., Chou, J.-C., and Livescu, K. Layer-wise analysis of a self-supervised speech representation model. In2021 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), pp. 914–921. IEEE,

  35. [35]

    SRC4VC: Smartphone-recorded corpus for voice conversion bench- mark.arXiv preprint arXiv:2406.07254,

    Saito, Y ., Igarashi, T., Seki, K., Takamichi, S., Yamamoto, R., Tachibana, K., and Saruwatari, H. SRC4VC: Smartphone-recorded corpus for voice conversion bench- mark.arXiv preprint arXiv:2406.07254,

  36. [36]

    JSUT corpus: free large-scale Japanese speech corpus for end-to-end speech synthesis

    Sonobe, R., Takamichi, S., and Saruwatari, H. JSUT corpus: free large-scale japanese speech corpus for end-to-end speech synthesis.arXiv preprint arXiv:1711.00354,

  37. [37]

    Rawboost: A raw data boosting and augmentation method applied to automatic speaker verification anti-spoofing

    Tak, H., Kamble, M., Patino, J., Todisco, M., and Evans, N. Rawboost: A raw data boosting and augmentation method applied to automatic speaker verification anti-spoofing. InICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 6382–6386. IEEE,

  38. [38]

    JMD: Japanese multi-dialect corpus for speech synthesis

    11 Alethia: A Foundational Encoder for Voice Deepfakes Takamichi, S. JMD: Japanese multi-dialect corpus for speech synthesis. https://sites.google.com/ site/shinnosuketakamichi/publication/ research-topics/jmd_corpus, 2021a. Ac- cessed: 2026-01-22. Takamichi, S. Tri-jek: Japanese-English-Korean tri-lingual speech corpus. https://sites. google.com/site/shi...

  39. [39]

    Takamichi, S., Komachi, M., Tanji, N., and Saruwatari, H

    Accessed: 2026-01-22. Takamichi, S., Komachi, M., Tanji, N., and Saruwatari, H. JSSS: free japanese speech corpus for summarization and simplification.arXiv preprint arXiv:2010.01793, 2020a. Takamichi, S., Sonobe, R., Mitsui, K., Saito, Y ., Koriyama, T., Tanji, N., and Saruwatari, H. JSUT and JVS: Free japanese voice corpora for accelerating speech synth...

  40. [40]

    ASVspoof 5: Crowdsourced speech data, deepfakes, and adversarial attacks at scale,

    Wang, X., Delgado, H., Tak, H., Jung, J.-w., Shim, H.- j., Todisco, M., Kukanov, I., Liu, X., Sahidullah, M., Kinnunen, T., et al. ASVspoof 5: Crowdsourced speech data, deepfakes, and adversarial attacks at scale.arXiv preprint arXiv:2408.08739,

  41. [41]

    Codecfake: Enhanc- ing anti-spoofing models against deepfake audios from codec-based speech synthesis systems.arXiv preprint arXiv:2406.07237,

    Wu, H., Tseng, Y ., and Lee, H.-y. Codecfake: Enhanc- ing anti-spoofing models against deepfake audios from codec-based speech synthesis systems.arXiv preprint arXiv:2406.07237,

  42. [42]

    The codecfake dataset and countermeasures for the universally detection of deepfake audio.IEEE Transactions on Audio, Speech and Language Processing, 2025a

    Xie, Y ., Lu, Y ., Fu, R., Wen, Z., Wang, Z., Tao, J., Qi, X., Wang, X., Liu, Y ., Cheng, H., et al. The codecfake dataset and countermeasures for the universally detection of deepfake audio.IEEE Transactions on Audio, Speech and Language Processing, 2025a. Xie, Y ., Wang, X., Wang, Z., Fu, R., Wen, Z., Cao, S., Ma, L., Li, C., Cheng, H., and Ye, L. Neura...

  43. [43]

    A., Kinnunen, T., Evans, N., et al

    Yamagishi, J., Wang, X., Todisco, M., Sahidullah, M., Patino, J., Nautsch, A., Liu, X., Lee, K. A., Kinnunen, T., Evans, N., et al. ASVspoof 2021: accelerating progress in spoofed and deepfake speech detection.arXiv preprint arXiv:2109.00537,

  44. [44]

    Mscenespeech: A multi-scene speech dataset for expressive speech synthesis.arXiv preprint arXiv:2407.14006,

    Yang, Q., Zuo, J., Su, Z., Jiang, Z., Li, M., Zhao, Z., Chen, F., Wang, Z., and Huai, B. Mscenespeech: A multi-scene speech dataset for expressive speech synthesis.arXiv preprint arXiv:2407.14006,

  45. [45]

    J., Lakho- tia, K., Lin, Y

    Yang, S.-w., Chi, P.-H., Chuang, Y .-S., Lai, C.-I. J., Lakho- tia, K., Lin, Y . Y ., Liu, A. T., Shi, J., Chang, X., Lin, G.-T., et al. Superb: Speech processing universal performance benchmark.arXiv preprint arXiv:2105.01051,

  46. [46]

    SPEAR: A unified ssl framework for learning speech and audio representations.arXiv preprint arXiv:2510.25955,

    Yang, X., Yang, Y ., Jin, Z., Cui, Z., Wu, W., Li, B., Zhang, C., and Woodland, P. SPEAR: A unified ssl framework for learning speech and audio representations.arXiv preprint arXiv:2510.25955,

  47. [47]

    Half-truth: A partially fake audio detection dataset.arXiv preprint arXiv:2104.03617,

    Yi, J., Bai, Y ., Tao, J., Ma, H., Tian, Z., Wang, C., Wang, T., and Fu, R. Half-truth: A partially fake audio detection dataset.arXiv preprint arXiv:2104.03617,

  48. [48]

    Singfake: Singing voice deepfake detection

    Zang, Y ., Zhang, Y ., Heydari, M., and Duan, Z. Singfake: Singing voice deepfake detection. InICASSP 2024-2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 12156–12160. IEEE,

  49. [49]

    Audio deepfake detection: What has been achieved and what lies ahead.Sensors (Basel, Switzerland), 25(7):1989,

    Zhang, B., Cui, H., Nguyen, V ., and Whitty, M. Audio deepfake detection: What has been achieved and what lies ahead.Sensors (Basel, Switzerland), 25(7):1989,

  50. [50]

    SVDD 2024: The inaugural singing voice deepfake detection challenge

    Zhang, Y ., Zang, Y ., Shi, J., Yamamoto, R., Toda, T., and Duan, Z. SVDD 2024: The inaugural singing voice deepfake detection challenge. InProc. IEEE Spoken Language Technology Workshop (SLT),

  51. [51]

    R., Pimentel, A., and Falk, T

    Zhu, Y ., Guimar ˜aes, H. R., Pimentel, A., and Falk, T. AUDDT: Audio unified deepfake detection benchmark toolkit.arXiv preprint arXiv:2509.21597,

  52. [52]

    13 Alethia: A Foundational Encoder for Voice Deepfakes A. POC Experiment Setup We follow a similar setup as the base versions of Wave2vec2 and HuBERT, where the pretraining data are identical to the downstream finetuning data (Baevski et al., 2020; Hsu et al., 2021). We curate 1k hours of real and fake speech, where the real speech data are sourced from C...

  53. [53]

    all-step

    B. Alethia Architecture and Pretraining Details Table 10 summarizes the architectural hyperparameters of Alethia-Base and Alethia-Large. One thing to note is that while the teacher model WavLM-Large adopts PostLN, the student Alethia-Base uses PreLN, with which we found better convergence. Regarding pretraining setup, we utilize the AdamW optimizer with a...

  54. [54]

    Other Tasks PFSL.For dataset configuration and model inference, we adapt the framework provided by Luong et al

    E.2. Other Tasks PFSL.For dataset configuration and model inference, we adapt the framework provided by Luong et al. (2025).1. While the original framework supports multi-scale evaluation across six temporal resolutions ( units∈ {0.02,0.04,0.08,0.16,0.32,0.64} seconds), we evaluate exclusively at the 20ms scale (units= 0.02 ). This choice is 1https://gith...

  55. [55]

    A VDD.For this task, we adapt code from the FakeA VCeleb repository2

    is included in the pretraining of Alethia, which could lead to biased evaluation. A VDD.For this task, we adapt code from the FakeA VCeleb repository2. We segmented the videos into 3 second chunks to handle variable durations, and treat the chunks as independent samples while calculating metrics. The audio stream is extracted from the videos and resampled...

  56. [56]

    For all experimental conditions, we performed quality control and class balancing for each training dataset. While this significantly reduces the total training volume, it helps to avoid the effects of confounding factors, such as prolonged silence and class imbalance, which may otherwise lead to spurious correlations (M¨uller et al., 2021). For the other...

  57. [57]

    preprocessed

    Datasets subjected to our quality control pipeline are denoted with the “preprocessed” suffix, whereas those labeled “raw” remain in their original state. To facilitate a direct comparison with prior literature, we specifically utilize the raw versions of standard benchmarks, including the ASVspoof series (Nautsch et al., 2021; Yamagishi et al., 2021; Wan...

  58. [58]

    37,314 Deepfake eval 2024 preprocessed (Chandra et al.,

  59. [59]

    4,700 KSS preprocessed (Park, 2019; Park & Mulc,

  60. [60]

    Other Experimental Results G.1

    98,411 G. Other Experimental Results G.1. SDD Model Comparison Table 16 and 17 provide per-dataset EER and accuracy of Alethia-Large and W1V-1B along with their performance difference under the EXPANDED-AUGand EXPANDEDcondition, respectively. For both conditions, Alethia-large shows better performance with significantly fewer datasets with accuracies belo...