Mitigating Proxy-to-Wild Domain Gap in Deepfake Speech

Claire Lin; Haibin Wu; Hung-yi Lee; Jyh-Shing Roger Jang; Wei-Chung Lu; Xuanjun Chen; Yun-Shing Wu

arxiv: 2606.07494 · v1 · pith:XCFDOTQNnew · submitted 2026-06-05 · 💻 cs.SD · eess.AS

Mitigating Proxy-to-Wild Domain Gap in Deepfake Speech

Xuanjun Chen , Yun-Shing Wu , Wei-Chung Lu , Claire Lin , Haibin Wu , Hung-yi Lee , Jyh-Shing Roger Jang This is my paper

Pith reviewed 2026-06-27 20:41 UTC · model grok-4.3

classification 💻 cs.SD eess.AS

keywords deepfake detectionspeech synthesisdomain adaptationfeature augmentationcodec fakegeneralizationself-supervised learning

0 comments

The pith

Transforming deterministic feature statistics into stochastic distributions narrows the proxy-to-wild domain gap in deepfake speech detection.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper aims to fix the poor generalization of deepfake speech detectors that are trained on codec-resynthesized proxy data when they encounter real-world CodecFake attacks from new models. It introduces Domain-Shift Feature Augmentation to create more realistic training variations by turning fixed feature statistics into random distributions while fine-tuning the model. The authors also release a harder test set called CoSG ExtEval that includes 40 unseen generative models and long-form audio. When this augmentation is combined with a post-trained self-supervised backbone, the approach reaches the highest detection accuracy on both the original and the new extended evaluation sets.

Core claim

Domain-Shift Feature Augmentation narrows the proxy-to-wild domain gap by transforming deterministic feature statistics into stochastic distributions during fine-tuning, and when paired with a post-trained SSL backbone it achieves state-of-the-art performance across diverse CodecFake attacks in both CoSG Eval and CoSG ExtEval.

What carries the argument

Domain-Shift Feature Augmentation (DSFA), a fine-tuning technique that simulates in-the-wild variations by converting deterministic feature statistics to stochastic distributions.

If this is right

Detectors generalize better to unseen generative models and long-form audio in extended evaluations.
The combination with post-trained SSL backbones produces state-of-the-art results on both standard and harder test sets.
The method improves robustness across a wider range of neural audio codec attacks.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same stochastic augmentation idea could be tested on domain gaps in other audio tasks such as speaker verification or music deepfake detection.
Future experiments might vary the type and amount of stochasticity to find the minimal change that still closes the gap.

Load-bearing premise

That converting deterministic feature statistics into stochastic distributions during fine-tuning accurately represents real-world variations without introducing new artifacts that degrade detection.

What would settle it

A controlled test showing that models using DSFA achieve no improvement or lower accuracy than standard fine-tuning on the CoSG ExtEval set with its 40 unseen models would falsify the claim.

Figures

Figures reproduced from arXiv: 2606.07494 by Claire Lin, Haibin Wu, Hung-yi Lee, Jyh-Shing Roger Jang, Wei-Chung Lu, Xuanjun Chen, Yun-Shing Wu.

**Figure 1.** Figure 1: Overview of the Domain Shift Feature Augmentation (DSFA) method. The proposed method estimates feature statistics µ and σ to construct probabilistic distributions for sampling. For visual clarity, only the mean statistic µ is illustrated in this figure. 3. Proposed Method Our framework ( [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗

**Figure 2.** Figure 2: The Proxy-to-Wild Domain Gap Analysis. Transformer model and measure the feature statistics distributions of training and testing domain. While the baseline shows significant feature shifts (Figs. 2a, 2c), DSFA (Figs. 2b, 2d) improves distribution overlap for Mean (42.91% → 43.03%) and STD (65.01% → 67.09%). By narrowing the statistical gap in the latent space, DSFA aligns the data distributions and pro… view at source ↗

read the original abstract

Recent neural audio codec-based speech generation (CodecFake) produces highly realistic audio, posing a challenge to existing deepfake countermeasure models. While using codec resynthesized speech (CoRS) as proxy data improves performance, it often suffers from limited generalization. We propose Domain-Shift Feature Augmentation (DSFA), which simulates "in-the-wild" variations by transforming deterministic feature statistics into stochastic distributions during fine-tuning. To evaluate generalization, we further introduce Codec-based Speech Generation Extension Evaluation (CoSG ExtEval) dataset, a more challenging extension of the CoSG Eval (from CodecFake+) dataset, featuring 40 unseen generative models and long-form audio. Experimental results demonstrate that combining a post-trained SSL backbone with DSFA effectively narrows the proxy-to-wild domain gap. This approach achieves state-of-the-art performance across diverse CodecFake attacks in both CoSG Eval and CoSG ExtEval.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

DSFA and the extended CoSG dataset target a real generalization gap in codec deepfake detection, but the abstract supplies no numbers or controls to check whether the stochastic augmentation actually delivers.

read the letter

The paper's core addition is Domain-Shift Feature Augmentation, which converts fixed SSL feature statistics into random distributions during fine-tuning, paired with a new CoSG ExtEval set that adds 40 unseen generators and long-form clips. The claim is that a post-trained SSL backbone plus DSFA narrows the proxy-to-wild gap and reaches SOTA on both the original and extended sets.

The method and dataset are presented as fresh. The augmentation idea is simple and directly addresses the known mismatch between resynthesized proxy data and real codec outputs. Extending the evaluation to more unseen models and longer audio is a practical step that other groups could reuse.

The main weakness is the complete absence of experimental detail. The abstract states SOTA results without listing baselines, reporting error bars, describing training splits, or showing any distribution matching between the induced stochastic features and actual wild artifacts. Without those, it is impossible to judge whether the gains come from better domain simulation or from generic regularization. The stress-test point about whether the stochastic transform faithfully approximates real codec shifts is still open.

This work is aimed at researchers building audio deepfake detectors that must handle codec-based attacks. A reader already working in that narrow area would find the dataset and the augmentation recipe worth trying. The paper shows clear thinking about the proxy problem and honest engagement with the generalization issue, so it is worth sending out for review even though the current evidence is thin.

Referee Report

2 major / 0 minor

Summary. The manuscript proposes Domain-Shift Feature Augmentation (DSFA) to narrow the proxy-to-wild domain gap in deepfake speech detection. DSFA transforms deterministic feature statistics from a post-trained SSL backbone into stochastic distributions during fine-tuning to simulate in-the-wild variations. The authors introduce the CoSG ExtEval dataset (an extension of CoSG Eval with 40 unseen generative models and long-form audio) and claim that combining the post-trained SSL backbone with DSFA achieves state-of-the-art performance across diverse CodecFake attacks on both CoSG Eval and CoSG ExtEval.

Significance. If the results are reproducible and the stochastic augmentation is shown to target domain-specific shifts rather than acting as generic regularization, the work would be significant for improving generalization in audio deepfake countermeasures. The new CoSG ExtEval dataset would also serve as a useful community benchmark for evaluating robustness to unseen codec-based attacks and long-form audio.

major comments (2)

[Abstract] Abstract: the claim of state-of-the-art performance across CoSG Eval and CoSG ExtEval is asserted without any reported baselines, error bars, statistical tests, data-exclusion rules, or quantitative results, making it impossible to verify whether the numbers support the central claim that DSFA narrows the proxy-to-wild gap.
[Abstract] Abstract (DSFA description): the load-bearing assumption that converting deterministic SSL feature statistics to stochastic distributions during fine-tuning faithfully approximates real domain shifts (e.g., long-form audio artifacts or outputs from the 40 unseen models in CoSG ExtEval) is not accompanied by any distribution-matching analysis or comparison to actual wild data statistics.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. We address the two major comments on the abstract below and will make revisions to improve clarity and support for our claims.

read point-by-point responses

Referee: [Abstract] Abstract: the claim of state-of-the-art performance across CoSG Eval and CoSG ExtEval is asserted without any reported baselines, error bars, statistical tests, data-exclusion rules, or quantitative results, making it impossible to verify whether the numbers support the central claim that DSFA narrows the proxy-to-wild gap.

Authors: We agree that the abstract should include concrete quantitative support for the SOTA claim. In the revised version we will add specific performance metrics (EER/AUC on both datasets), explicit baseline comparisons, and references to the error bars and statistical tests already reported in the experimental sections. Data exclusion criteria (if any) will also be summarized. revision: yes
Referee: [Abstract] Abstract (DSFA description): the load-bearing assumption that converting deterministic SSL feature statistics to stochastic distributions during fine-tuning faithfully approximates real domain shifts (e.g., long-form audio artifacts or outputs from the 40 unseen models in CoSG ExtEval) is not accompanied by any distribution-matching analysis or comparison to actual wild data statistics.

Authors: The manuscript currently supports the assumption via downstream generalization gains on CoSG ExtEval. We acknowledge that an explicit distribution-matching analysis would strengthen the justification. We will add a short analysis (e.g., feature-statistic comparisons or divergence measures) either in the main text or appendix to directly compare DSFA-augmented statistics against those observed from the unseen models and long-form audio. revision: yes

Circularity Check

0 steps flagged

No circularity; empirical method and new evaluation dataset are independent

full rationale

The provided abstract and description contain no equations, fitted parameters, or self-citations. DSFA is introduced as a descriptive augmentation technique (transforming deterministic statistics to stochastic distributions) and evaluated on the newly introduced CoSG ExtEval dataset with unseen models. No derivation reduces by construction to its inputs, and the central claim rests on empirical results rather than self-referential definitions or predictions.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract supplies no information on free parameters, axioms, or invented entities.

pith-pipeline@v0.9.1-grok · 5706 in / 1043 out tokens · 21049 ms · 2026-06-27T20:41:19.984733+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

41 extracted references · 2 linked inside Pith

[1]

Introduction Advances in speech generation technologies have greatly im- proved the naturalness and controllability of synthetic speech. While these developments enable a wide range of beneficial applications, they also introduce serious security risks when misused for malicious audio deepfake attacks, such as mis- information dissemination, identity impe...
[2]

in-the- wild

The Proxy-to-Wild Domain Gap in Deepfake Speech Training CMs on proxy data is a cost-effective alternative to collecting diverse TTS/VC speech [9, 10, 13–15], yet an inher- ent domain gap persists, hindering generalization to “in-the- wild” scenarios. We categorize this gap into three dimensions: (1) Artifact Mismatch:Unseen codecs and generative models i...

Pith/arXiv arXiv 2026
[3]

in-the-wild

Proposed Method Our framework (Fig. 1) bridges the proxy-to-wild domain gap by: (1) leveraging a deepfake-tailored post-trained SSL back- bone to establish a versatile representation space, and (2) em- ploying Domain-Shift Feature Augmentation (DSFA) during fine-tuning to simulate unseen domain variations. 3.1. Post-Training Self-Supervised Learning Backb...
[4]

CoRS con- tains spoofed samples from 31 neural codecs applied to the VCTK corpus [25]

Experimental Setup We conduct experiments using the CodecFake+ [10] dataset, where CoRS (speech resynthesized by neural audio codecs) is employed for training and CoSG (speech from codec- based generation models) is used for evaluation. CoRS con- tains spoofed samples from 31 neural codecs applied to the VCTK corpus [25]. Following previous work [10], we ...
[5]

Beyond the benchmarks (a)–(f) from CodecFake+ [10] on existing sets (ASVspoof19 LA, CoRS, CoSG Eval), we further evaluate per- formance on our new collected CoSG ExtEval dataset

Main Results Table 2 presents the cross-scenario results. Beyond the benchmarks (a)–(f) from CodecFake+ [10] on existing sets (ASVspoof19 LA, CoRS, CoSG Eval), we further evaluate per- formance on our new collected CoSG ExtEval dataset. CoSG ExtEval Baseline Evaluation.Model (a) achieves near-perfect in-domain results but generalizes poorly to CoSG Eval, ...

arXiv
[6]

SSL Layer-wise Analysis.We evaluate DSFA across SSL layers to identify the optimal integration point for robustness and generalizability

Ablation and Quantitative Evaluation To further dissect the mechanisms behind these improvements and optimize feature-level augmentations, we conduct a de- tailed ablation study and quantitative analysis in this section. SSL Layer-wise Analysis.We evaluate DSFA across SSL layers to identify the optimal integration point for robustness and generalizability...
[7]

To overcome this, we propose Domain-Shift Feature Augmentation (DSFA), which promotes domain-invariant representations by simulating sta- tistical discrepancies in the latent space

Conclusion This work addresses the proxy-to-wild domain gap in Codec- Fake detection, where models trained on resynthesized data (CoRS) exhibit a distributional bias that impairs their perfor- mance against unseen generative systems. To overcome this, we propose Domain-Shift Feature Augmentation (DSFA), which promotes domain-invariant representations by s...
[8]

Acknowledgements This work was supported by the Ministry of Education (MOE) of Taiwan under the project Taiwan Centers of Excellence in Artificial Intelligence, through the NTU Artificial Intelligence Center of Research Excellence (NTU AI-CoRE). We thank the National Center for High-performance Computing (NCHC) of the National Applied Research Laboratorie...
[9]

Generative AI Use Disclosure We employed Gemini for grammatical paraphrasing and lan- guage polishing to improve the manuscript’s clarity. The AI tool was utilized solely for technical editing purposes and did not contribute to the conceptualization, data analysis, or pro- duction of any significant scholarly content in this work
[10]

ASVspoof 2019: future horizons in spoofed and fake audio detection,

M. Todisco, X. Wang, V . Vestman, M. Sahidullah, H. Delgado, A. Nautsch, J. Yamagishi, N. Evans, T. H. Kinnunen, and K. A. Lee, “ASVspoof 2019: future horizons in spoofed and fake audio detection,” inProc. Interspeech

2019
[11]

ASVspoof 2021: Towards spoofed and deepfake speech detection in the wild,

X. Liu, X. Wanget al., “ASVspoof 2021: Towards spoofed and deepfake speech detection in the wild,”IEEE Transactions on Au- dio, Speech and Language Processing, vol. 31, 2023

2021
[12]

ASVspoof 5: Crowdsourced speech data, deepfakes, and adversarial attacks at scale,

X. Wang, H. Delgado, H. Tak, J.-w. Jung, H.-j. Shim, M. Todisco et al., “ASVspoof 5: Crowdsourced speech data, deepfakes, and adversarial attacks at scale,” inProc. ASVspoof Workshop, 2024

2024
[13]

ADD 2022: the first audio deep synthesis detection challenge,

J. Yi, R. Fuet al., “ADD 2022: the first audio deep synthesis detection challenge,” inProc. ICASSP, 2022

2022
[14]

ADD 2023: Towards audio deepfake detection and anal- ysis in the wild,

J. Yi, C. Y . Zhang, J. Tao, C. Wang, X. Yan, Y . Ren, H. Gu, and J. Zhou, “ADD 2023: Towards audio deepfake detection and anal- ysis in the wild,”arXiv preprint arXiv:2408.04967, 2024

arXiv 2023
[15]

Codec-SUPERB: An in-depth analysis of sound codec models,

H. Wu, H.-L. Chung, Y .-C. Lin, Y .-K. Wu, X. Chen, Y .-C. Pai et al., “Codec-SUPERB: An in-depth analysis of sound codec models,” inFindings Assoc. Comput. Linguist., 2024

2024
[16]

Codec-SUPERB@ SLT 2024: A lightweight benchmark for neu- ral audio codec models,

H. Wu, X. Chen, Y .-C. Lin, K. Chang, J. Du, K.-H. Luet al., “Codec-SUPERB@ SLT 2024: A lightweight benchmark for neu- ral audio codec models,” inProc. IEEE Spoken Lang. Technol. Workshop, 2024

2024
[17]

Towards audio language modeling-an overview,

H. Wu, X. Chen, Y .-C. Lin, K.-w. Chang, H.-L. Chung, A. H. Liu, and H.-y. Lee, “Towards audio language modeling-an overview,” arXiv preprint arXiv:2402.13236, 2024

arXiv 2024
[18]

CodecFake: Enhancing anti-spoofing models against deepfake audios from codec-based speech synthesis systems,

H. Wu, Y . Tseng, and H. yi Lee, “CodecFake: Enhancing anti-spoofing models against deepfake audios from codec-based speech synthesis systems,” inProc. Interspeech, 2024

2024
[19]

CodecFake+: A large- scale neural audio codec-based deepfake speech dataset,

X. Chen, J. Du, H. Wu, L. Zhang, I. Lin, I. Chiu, W. Ren, Y . Tseng, Y . Tsao, J.-S. R. Jang, and H.-y. Lee, “CodecFake+: A large- scale neural audio codec-based deepfake speech dataset,”arXiv preprint arXiv:2501.08238, 2025

Pith/arXiv arXiv 2025
[20]

Towards generalized source tracing for codec-based deepfake speech,

X. Chen, I. Lin, L. Zhang, H. Wu, H.-y. Lee, J.-S. R. Janget al., “Towards generalized source tracing for codec-based deepfake speech,”arXiv preprint arXiv:2506.07294, 2025

arXiv 2025
[21]

Codec-based deepfake source tracing via neural audio codec taxonomy,

X. Chen, I. Lin, L. Zhang, J. Du, H. Wu, H.-y. Lee, J.-S. R. Janget al., “Codec-based deepfake source tracing via neural audio codec taxonomy,”arXiv preprint arXiv:2505.12994, 2025

arXiv 2025
[22]

Can large-scale vocoded spoofed data improve speech spoofing countermeasure with a self-supervised front end?

X. Wang and J. Yamagishi, “Can large-scale vocoded spoofed data improve speech spoofing countermeasure with a self-supervised front end?” inICASSP 2024 - 2024 IEEE International Confer- ence on Acoustics, Speech and Signal Processing (ICASSP), 2024, pp. 10 311–10 315

2024
[23]

Spoofed training data for speech spoofing countermeasure can be efficiently created using neural vocoders,

——, “Spoofed training data for speech spoofing countermeasure can be efficiently created using neural vocoders,” inICASSP 2023 - 2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2023, pp. 1–5

2023
[24]

Improv- ing copy-synthesis anti-spoofing training method with rhythm and speaker perturbation,

J. Lu, Y . Zhang, Z. Li, Z. Shang, W. Wang, and P. Zhang, “Improv- ing copy-synthesis anti-spoofing training method with rhythm and speaker perturbation,” inInterspeech, vol. 2024, 2024, pp. 512– 516

2024
[25]

Codec-Based Deepfake Source Tracing via Neural Audio Codec Taxonomy,

X. Chen, I.-M. Lin, L. Zhang, J. Du, H. Wu, H. yi Lee, and J.-S. R. Jang, “Codec-Based Deepfake Source Tracing via Neural Audio Codec Taxonomy,” inInterspeech 2025, 2025, pp. 1538–1542

2025
[26]

The impact of silence on speech anti-spoofing,

Y . Zhang, Z. Li, J. Lu, H. Hua, W. Wang, and P. Zhang, “The impact of silence on speech anti-spoofing,”IEEE/ACM Transac- tions on Audio, Speech, and Language Processing, vol. 31, pp. 3374–3389, 2023

2023
[27]

Post-training for deepfake speech detection,

W. Ge, X. Wang, X. Liu, and J. Yamagishi, “Post-training for deepfake speech detection,”arXiv preprint arXiv:2506.21090, 2025

arXiv 2025
[28]

wav2vec 2.0: A framework for self-supervised learning of speech representa- tions,

A. Baevski, Y . Zhou, A. Mohamed, and M. Auli, “wav2vec 2.0: A framework for self-supervised learning of speech representa- tions,” inProc. NeurIPS, vol. 33, 2020

2020
[29]

Automatic speaker verification spoofing and deepfake detection using wav2vec 2.0 and data augmentation,

H. Tak, M. Todisco, X. Wang, J.-w. Jung, J. Yamagishi, and N. Evans, “Automatic speaker verification spoofing and deepfake detection using wav2vec 2.0 and data augmentation,” inProc. Odyssey Speaker Lang. Recognit. Workshop, 2022

2022
[30]

Closed-form factorization of latent seman- tics in gans,

Y . Shen and B. Zhou, “Closed-form factorization of latent seman- tics in gans,” inProceedings of the IEEE/CVF conference on com- puter vision and pattern recognition, 2021, pp. 1532–1540

2021
[31]

Im- plicit semantic data augmentation for deep networks,

Y . Wang, X. Pan, S. Song, H. Zhang, G. Huang, and C. Wu, “Im- plicit semantic data augmentation for deep networks,”Advances in neural information processing systems, vol. 32, 2019

2019
[32]

Arbitrary style transfer in real-time with adaptive instance normalization,

X. Huang and S. Belongie, “Arbitrary style transfer in real-time with adaptive instance normalization,” inProceedings of the IEEE international conference on computer vision, 2017, pp. 1501– 1510

2017
[33]

Supervised contrastive learning,

P. Khosla, P. Teterwak, C. Wang, A. Sarna, Y . Tian, P. Isola, A. Maschinot, C. Liu, and D. Krishnan, “Supervised contrastive learning,”Advances in neural information processing systems, vol. 33, pp. 18 661–18 673, 2020

2020
[34]

Cstr vctk corpus: English multi-speaker corpus for cstr voice cloning toolkit (ver- sion 0.92),

J. Yamagishi, C. Veaux, K. MacDonaldet al., “Cstr vctk corpus: English multi-speaker corpus for cstr voice cloning toolkit (ver- sion 0.92),”Univ. of Edinburgh, The Centre for Speech Technol- ogy Research (CSTR), 2019

2019
[35]

Raw- boost: A raw data boosting and augmentation method applied to automatic speaker verification anti-spoofing,

H. Tak, M. Kamble, J. Patino, M. Todisco, and N. Evans, “Raw- boost: A raw data boosting and augmentation method applied to automatic speaker verification anti-spoofing,” inProc. ICASSP, 2022

2022
[36]

Lens-df: Deepfake detection and temporal localization for long-form noisy speech,

X. Liu, W. Ge, X. Wang, and J. Yamagishi, “Lens-df: Deepfake detection and temporal localization for long-form noisy speech,” Osaka, Japan, 2025

2025
[37]

Singfake: Singing voice deepfake detection,

Y . Zang, Y . Zhang, M. Heydari, and Z. Duan, “Singfake: Singing voice deepfake detection,” inICASSP 2024-2024 IEEE Interna- tional Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2024, pp. 12 156–12 160

2024
[38]

Singing voice graph modeling for singfake detection,

X. Chen, H. Wu, J.-S. R. Jang, and H. yi Lee, “Singing voice graph modeling for singfake detection,” inInterspeech 2024, 2024

2024
[39]

How does instrumental music help singfake detection?

X. Chen, C.-Y . Hu, I.-M. Lin, Y .-C. Lin, I.-H. Chiu, Y . Zhang, S.- F. Huang, Y .-H. Yang, H. Wu, H. yi Lee, and J.-S. R. Jang, “How does instrumental music help singfake detection?” 2025

2025
[40]

SpeechFake: A large-scale multilingual speech deepfake dataset incorporating cutting-edge generation methods,

W. Huang, Y . Gu, Z. Wang, H. Zhu, and Y . Qian, “SpeechFake: A large-scale multilingual speech deepfake dataset incorporating cutting-edge generation methods,” inProceedings of the 63rd An- nual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Vienna, Austria: Association for Com- putational Linguistics, Jul. 2025, pp. 9985–9998

2025
[41]

Uncertainty modeling for out-of-distribution generalization,

X. Li, Y . Dai, Y . Ge, J. Liu, Y . Shan, and L. DUAN, “Uncertainty modeling for out-of-distribution generalization,” inInternational Conference on Learning Representations, 2022. [Online]. Available: https://openreview.net/forum?id=6HN7LHyzGgC

2022

[1] [1]

Introduction Advances in speech generation technologies have greatly im- proved the naturalness and controllability of synthetic speech. While these developments enable a wide range of beneficial applications, they also introduce serious security risks when misused for malicious audio deepfake attacks, such as mis- information dissemination, identity impe...

[2] [2]

in-the- wild

The Proxy-to-Wild Domain Gap in Deepfake Speech Training CMs on proxy data is a cost-effective alternative to collecting diverse TTS/VC speech [9, 10, 13–15], yet an inher- ent domain gap persists, hindering generalization to “in-the- wild” scenarios. We categorize this gap into three dimensions: (1) Artifact Mismatch:Unseen codecs and generative models i...

Pith/arXiv arXiv 2026

[3] [3]

in-the-wild

Proposed Method Our framework (Fig. 1) bridges the proxy-to-wild domain gap by: (1) leveraging a deepfake-tailored post-trained SSL back- bone to establish a versatile representation space, and (2) em- ploying Domain-Shift Feature Augmentation (DSFA) during fine-tuning to simulate unseen domain variations. 3.1. Post-Training Self-Supervised Learning Backb...

[4] [4]

CoRS con- tains spoofed samples from 31 neural codecs applied to the VCTK corpus [25]

Experimental Setup We conduct experiments using the CodecFake+ [10] dataset, where CoRS (speech resynthesized by neural audio codecs) is employed for training and CoSG (speech from codec- based generation models) is used for evaluation. CoRS con- tains spoofed samples from 31 neural codecs applied to the VCTK corpus [25]. Following previous work [10], we ...

[5] [5]

Beyond the benchmarks (a)–(f) from CodecFake+ [10] on existing sets (ASVspoof19 LA, CoRS, CoSG Eval), we further evaluate per- formance on our new collected CoSG ExtEval dataset

Main Results Table 2 presents the cross-scenario results. Beyond the benchmarks (a)–(f) from CodecFake+ [10] on existing sets (ASVspoof19 LA, CoRS, CoSG Eval), we further evaluate per- formance on our new collected CoSG ExtEval dataset. CoSG ExtEval Baseline Evaluation.Model (a) achieves near-perfect in-domain results but generalizes poorly to CoSG Eval, ...

arXiv

[6] [6]

SSL Layer-wise Analysis.We evaluate DSFA across SSL layers to identify the optimal integration point for robustness and generalizability

Ablation and Quantitative Evaluation To further dissect the mechanisms behind these improvements and optimize feature-level augmentations, we conduct a de- tailed ablation study and quantitative analysis in this section. SSL Layer-wise Analysis.We evaluate DSFA across SSL layers to identify the optimal integration point for robustness and generalizability...

[7] [7]

To overcome this, we propose Domain-Shift Feature Augmentation (DSFA), which promotes domain-invariant representations by simulating sta- tistical discrepancies in the latent space

Conclusion This work addresses the proxy-to-wild domain gap in Codec- Fake detection, where models trained on resynthesized data (CoRS) exhibit a distributional bias that impairs their perfor- mance against unseen generative systems. To overcome this, we propose Domain-Shift Feature Augmentation (DSFA), which promotes domain-invariant representations by s...

[8] [8]

Acknowledgements This work was supported by the Ministry of Education (MOE) of Taiwan under the project Taiwan Centers of Excellence in Artificial Intelligence, through the NTU Artificial Intelligence Center of Research Excellence (NTU AI-CoRE). We thank the National Center for High-performance Computing (NCHC) of the National Applied Research Laboratorie...

[9] [9]

Generative AI Use Disclosure We employed Gemini for grammatical paraphrasing and lan- guage polishing to improve the manuscript’s clarity. The AI tool was utilized solely for technical editing purposes and did not contribute to the conceptualization, data analysis, or pro- duction of any significant scholarly content in this work

[10] [10]

ASVspoof 2019: future horizons in spoofed and fake audio detection,

M. Todisco, X. Wang, V . Vestman, M. Sahidullah, H. Delgado, A. Nautsch, J. Yamagishi, N. Evans, T. H. Kinnunen, and K. A. Lee, “ASVspoof 2019: future horizons in spoofed and fake audio detection,” inProc. Interspeech

2019

[11] [11]

ASVspoof 2021: Towards spoofed and deepfake speech detection in the wild,

X. Liu, X. Wanget al., “ASVspoof 2021: Towards spoofed and deepfake speech detection in the wild,”IEEE Transactions on Au- dio, Speech and Language Processing, vol. 31, 2023

2021

[12] [12]

ASVspoof 5: Crowdsourced speech data, deepfakes, and adversarial attacks at scale,

X. Wang, H. Delgado, H. Tak, J.-w. Jung, H.-j. Shim, M. Todisco et al., “ASVspoof 5: Crowdsourced speech data, deepfakes, and adversarial attacks at scale,” inProc. ASVspoof Workshop, 2024

2024

[13] [13]

ADD 2022: the first audio deep synthesis detection challenge,

J. Yi, R. Fuet al., “ADD 2022: the first audio deep synthesis detection challenge,” inProc. ICASSP, 2022

2022

[14] [14]

ADD 2023: Towards audio deepfake detection and anal- ysis in the wild,

J. Yi, C. Y . Zhang, J. Tao, C. Wang, X. Yan, Y . Ren, H. Gu, and J. Zhou, “ADD 2023: Towards audio deepfake detection and anal- ysis in the wild,”arXiv preprint arXiv:2408.04967, 2024

arXiv 2023

[15] [15]

Codec-SUPERB: An in-depth analysis of sound codec models,

H. Wu, H.-L. Chung, Y .-C. Lin, Y .-K. Wu, X. Chen, Y .-C. Pai et al., “Codec-SUPERB: An in-depth analysis of sound codec models,” inFindings Assoc. Comput. Linguist., 2024

2024

[16] [16]

Codec-SUPERB@ SLT 2024: A lightweight benchmark for neu- ral audio codec models,

H. Wu, X. Chen, Y .-C. Lin, K. Chang, J. Du, K.-H. Luet al., “Codec-SUPERB@ SLT 2024: A lightweight benchmark for neu- ral audio codec models,” inProc. IEEE Spoken Lang. Technol. Workshop, 2024

2024

[17] [17]

Towards audio language modeling-an overview,

H. Wu, X. Chen, Y .-C. Lin, K.-w. Chang, H.-L. Chung, A. H. Liu, and H.-y. Lee, “Towards audio language modeling-an overview,” arXiv preprint arXiv:2402.13236, 2024

arXiv 2024

[18] [18]

CodecFake: Enhancing anti-spoofing models against deepfake audios from codec-based speech synthesis systems,

H. Wu, Y . Tseng, and H. yi Lee, “CodecFake: Enhancing anti-spoofing models against deepfake audios from codec-based speech synthesis systems,” inProc. Interspeech, 2024

2024

[19] [19]

CodecFake+: A large- scale neural audio codec-based deepfake speech dataset,

X. Chen, J. Du, H. Wu, L. Zhang, I. Lin, I. Chiu, W. Ren, Y . Tseng, Y . Tsao, J.-S. R. Jang, and H.-y. Lee, “CodecFake+: A large- scale neural audio codec-based deepfake speech dataset,”arXiv preprint arXiv:2501.08238, 2025

Pith/arXiv arXiv 2025

[20] [20]

Towards generalized source tracing for codec-based deepfake speech,

X. Chen, I. Lin, L. Zhang, H. Wu, H.-y. Lee, J.-S. R. Janget al., “Towards generalized source tracing for codec-based deepfake speech,”arXiv preprint arXiv:2506.07294, 2025

arXiv 2025

[21] [21]

Codec-based deepfake source tracing via neural audio codec taxonomy,

X. Chen, I. Lin, L. Zhang, J. Du, H. Wu, H.-y. Lee, J.-S. R. Janget al., “Codec-based deepfake source tracing via neural audio codec taxonomy,”arXiv preprint arXiv:2505.12994, 2025

arXiv 2025

[22] [22]

Can large-scale vocoded spoofed data improve speech spoofing countermeasure with a self-supervised front end?

X. Wang and J. Yamagishi, “Can large-scale vocoded spoofed data improve speech spoofing countermeasure with a self-supervised front end?” inICASSP 2024 - 2024 IEEE International Confer- ence on Acoustics, Speech and Signal Processing (ICASSP), 2024, pp. 10 311–10 315

2024

[23] [23]

Spoofed training data for speech spoofing countermeasure can be efficiently created using neural vocoders,

——, “Spoofed training data for speech spoofing countermeasure can be efficiently created using neural vocoders,” inICASSP 2023 - 2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2023, pp. 1–5

2023

[24] [24]

Improv- ing copy-synthesis anti-spoofing training method with rhythm and speaker perturbation,

J. Lu, Y . Zhang, Z. Li, Z. Shang, W. Wang, and P. Zhang, “Improv- ing copy-synthesis anti-spoofing training method with rhythm and speaker perturbation,” inInterspeech, vol. 2024, 2024, pp. 512– 516

2024

[25] [25]

Codec-Based Deepfake Source Tracing via Neural Audio Codec Taxonomy,

X. Chen, I.-M. Lin, L. Zhang, J. Du, H. Wu, H. yi Lee, and J.-S. R. Jang, “Codec-Based Deepfake Source Tracing via Neural Audio Codec Taxonomy,” inInterspeech 2025, 2025, pp. 1538–1542

2025

[26] [26]

The impact of silence on speech anti-spoofing,

Y . Zhang, Z. Li, J. Lu, H. Hua, W. Wang, and P. Zhang, “The impact of silence on speech anti-spoofing,”IEEE/ACM Transac- tions on Audio, Speech, and Language Processing, vol. 31, pp. 3374–3389, 2023

2023

[27] [27]

Post-training for deepfake speech detection,

W. Ge, X. Wang, X. Liu, and J. Yamagishi, “Post-training for deepfake speech detection,”arXiv preprint arXiv:2506.21090, 2025

arXiv 2025

[28] [28]

wav2vec 2.0: A framework for self-supervised learning of speech representa- tions,

A. Baevski, Y . Zhou, A. Mohamed, and M. Auli, “wav2vec 2.0: A framework for self-supervised learning of speech representa- tions,” inProc. NeurIPS, vol. 33, 2020

2020

[29] [29]

Automatic speaker verification spoofing and deepfake detection using wav2vec 2.0 and data augmentation,

H. Tak, M. Todisco, X. Wang, J.-w. Jung, J. Yamagishi, and N. Evans, “Automatic speaker verification spoofing and deepfake detection using wav2vec 2.0 and data augmentation,” inProc. Odyssey Speaker Lang. Recognit. Workshop, 2022

2022

[30] [30]

Closed-form factorization of latent seman- tics in gans,

Y . Shen and B. Zhou, “Closed-form factorization of latent seman- tics in gans,” inProceedings of the IEEE/CVF conference on com- puter vision and pattern recognition, 2021, pp. 1532–1540

2021

[31] [31]

Im- plicit semantic data augmentation for deep networks,

Y . Wang, X. Pan, S. Song, H. Zhang, G. Huang, and C. Wu, “Im- plicit semantic data augmentation for deep networks,”Advances in neural information processing systems, vol. 32, 2019

2019

[32] [32]

Arbitrary style transfer in real-time with adaptive instance normalization,

X. Huang and S. Belongie, “Arbitrary style transfer in real-time with adaptive instance normalization,” inProceedings of the IEEE international conference on computer vision, 2017, pp. 1501– 1510

2017

[33] [33]

Supervised contrastive learning,

P. Khosla, P. Teterwak, C. Wang, A. Sarna, Y . Tian, P. Isola, A. Maschinot, C. Liu, and D. Krishnan, “Supervised contrastive learning,”Advances in neural information processing systems, vol. 33, pp. 18 661–18 673, 2020

2020

[34] [34]

Cstr vctk corpus: English multi-speaker corpus for cstr voice cloning toolkit (ver- sion 0.92),

J. Yamagishi, C. Veaux, K. MacDonaldet al., “Cstr vctk corpus: English multi-speaker corpus for cstr voice cloning toolkit (ver- sion 0.92),”Univ. of Edinburgh, The Centre for Speech Technol- ogy Research (CSTR), 2019

2019

[35] [35]

Raw- boost: A raw data boosting and augmentation method applied to automatic speaker verification anti-spoofing,

H. Tak, M. Kamble, J. Patino, M. Todisco, and N. Evans, “Raw- boost: A raw data boosting and augmentation method applied to automatic speaker verification anti-spoofing,” inProc. ICASSP, 2022

2022

[36] [36]

Lens-df: Deepfake detection and temporal localization for long-form noisy speech,

X. Liu, W. Ge, X. Wang, and J. Yamagishi, “Lens-df: Deepfake detection and temporal localization for long-form noisy speech,” Osaka, Japan, 2025

2025

[37] [37]

Singfake: Singing voice deepfake detection,

Y . Zang, Y . Zhang, M. Heydari, and Z. Duan, “Singfake: Singing voice deepfake detection,” inICASSP 2024-2024 IEEE Interna- tional Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2024, pp. 12 156–12 160

2024

[38] [38]

Singing voice graph modeling for singfake detection,

X. Chen, H. Wu, J.-S. R. Jang, and H. yi Lee, “Singing voice graph modeling for singfake detection,” inInterspeech 2024, 2024

2024

[39] [39]

How does instrumental music help singfake detection?

X. Chen, C.-Y . Hu, I.-M. Lin, Y .-C. Lin, I.-H. Chiu, Y . Zhang, S.- F. Huang, Y .-H. Yang, H. Wu, H. yi Lee, and J.-S. R. Jang, “How does instrumental music help singfake detection?” 2025

2025

[40] [40]

SpeechFake: A large-scale multilingual speech deepfake dataset incorporating cutting-edge generation methods,

W. Huang, Y . Gu, Z. Wang, H. Zhu, and Y . Qian, “SpeechFake: A large-scale multilingual speech deepfake dataset incorporating cutting-edge generation methods,” inProceedings of the 63rd An- nual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Vienna, Austria: Association for Com- putational Linguistics, Jul. 2025, pp. 9985–9998

2025

[41] [41]

Uncertainty modeling for out-of-distribution generalization,

X. Li, Y . Dai, Y . Ge, J. Liu, Y . Shan, and L. DUAN, “Uncertainty modeling for out-of-distribution generalization,” inInternational Conference on Learning Representations, 2022. [Online]. Available: https://openreview.net/forum?id=6HN7LHyzGgC

2022