Recognition: unknown
ProSDD: Learning Prosodic Representations for Speech Deepfake Detection against Expressive and Emotional Attacks
Pith reviewed 2026-05-10 13:26 UTC · model grok-4.3
The pith
Learning prosodic patterns from real speech enables better detection of expressive speech deepfakes.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
ProSDD is a two-stage framework in which Stage I learns prosodic variability from real speech via supervised masked prediction of speaker-conditioned features based on pitch, voice activity, and energy, while Stage II jointly optimizes the same objective with spoof classification to improve generalization against expressive and emotional attacks.
What carries the argument
Supervised masked prediction of speaker-conditioned prosodic variation, which enriches embeddings with natural speech cues before spoof classification.
If this is right
- Training on ASVspoof 2019 data yields 16.14% EER on ASVspoof 2024 instead of 25.43%.
- Training on ASVspoof 2024 data yields 7.38% EER on ASVspoof 2024 instead of 39.62%.
- Approximately 50% relative EER reductions occur on EmoFake and EmoSpoof-TTS.
- Joint optimization of prosody prediction and spoof classification works without requiring spoof-heavy training data.
Where Pith is reading between the lines
- The same prosody pretraining could be tested on other audio classification tasks that suffer from stylistic variation, such as speaker verification under emotion shifts.
- If the learned representations capture universal natural-speech properties, they might allow effective detection even when the spoof generator is completely unseen.
- Combining this approach with self-supervised audio models pretrained on larger unlabeled corpora could further reduce dependence on labeled real-speech data for Stage I.
Load-bearing premise
The prosodic representations learned from real speech in Stage I transfer as generalizable cues for distinguishing expressive and emotional spoofing attacks rather than capturing training-set specific patterns.
What would settle it
If ablating the Stage I masked prosodic prediction objective removes the reported EER reductions on EmoFake and EmoSpoof-TTS while keeping Stage II intact, the claim that prosody learning drives the gains would be falsified.
Figures
read the original abstract
Speech deepfake detection (SDD) systems perform well on standard benchmarks datasets but often fail to generalize to expressive and emotional spoofing attacks. Many methods rely on spoof-heavy training data, learning dataset-specific artifacts rather than transferable cues of natural speech. In contrast, humans internalize variability in real speech and detect fakes as deviations from it. We introduce ProSDD, a two-stage framework that enriches model embeddings through supervised masked prediction of speaker-conditioned prosodic variation based on pitch, voice activity, and energy. Stage I learns prosodic variability from real speech, and Stage II jointly optimizes this objective with spoof classification. ProSDD consistently outperforms baselines under both ASVspoof 2019 and 2024 training, reducing ASVspoof 2024 EER from 25.43% to 16.14% (2019-trained) and from 39.62% to 7.38% (2024-trained), while achieving 50% relative reductions on EmoFake and EmoSpoof-TTS.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces ProSDD, a two-stage framework for speech deepfake detection. In Stage I, it learns prosodic representations from real speech using supervised masked prediction of speaker-conditioned pitch, voice activity detection (VAD), and energy. Stage II jointly optimizes this with spoof classification. The method is claimed to outperform baselines on ASVspoof 2019 and 2024 datasets, achieving EER reductions on ASVspoof 2024 from 25.43% to 16.14% when trained on 2019 data and from 39.62% to 7.38% when trained on 2024 data, along with 50% relative EER reductions on EmoFake and EmoSpoof-TTS.
Significance. If the central claims hold after verification, this work has potential significance in addressing the generalization challenges of speech deepfake detection systems to expressive and emotional attacks. By focusing on learning natural prosodic variability from real speech rather than relying on spoof-specific artifacts, it offers a more human-like approach to detection. The reported performance gains suggest it could contribute to more robust SDD models.
major comments (2)
- [Abstract] The abstract presents concrete EER reductions on named benchmarks but provides no details on baseline implementations, statistical significance, data splits, or ablation studies. This lack of information prevents verification that the gains support the central claim of improved generalization via prosodic representations.
- [Stage I description] The transferability of prosodic representations learned via supervised masked prediction on real speech to detecting expressive/emotional spoofs is load-bearing for the claims. Without explicit details confirming that the real speech data used in Stage I is fully disjoint from all evaluation sets (speakers, emotions, recording conditions), it is unclear whether the embeddings capture general cues or training-set specific patterns.
minor comments (1)
- [Abstract] The term 'speaker-conditioned prosodic variation' could be clarified with a brief definition or reference to how conditioning is implemented.
Simulated Author's Rebuttal
We thank the referee for the constructive comments, which help strengthen the verifiability of our claims. We address each point below and have revised the manuscript to provide additional clarity on experimental details and data usage.
read point-by-point responses
-
Referee: [Abstract] The abstract presents concrete EER reductions on named benchmarks but provides no details on baseline implementations, statistical significance, data splits, or ablation studies. This lack of information prevents verification that the gains support the central claim of improved generalization via prosodic representations.
Authors: We agree that the abstract's length constraints limit the inclusion of full experimental details. The manuscript body addresses these aspects comprehensively: baseline implementations (including AASIST and other SOTA models) are specified in Section 4.2, statistical significance is evaluated through repeated runs with standard deviations reported in Tables 1–3, data splits adhere to official ASVspoof protocols as described in Section 4.1, and ablation studies isolating the prosodic components appear in Section 5.2. To improve immediate verifiability from the abstract, we have added a concise clause referencing the detailed analyses and ablations provided in the paper. revision: yes
-
Referee: [Stage I description] The transferability of prosodic representations learned via supervised masked prediction on real speech to detecting expressive/emotional spoofs is load-bearing for the claims. Without explicit details confirming that the real speech data used in Stage I is fully disjoint from all evaluation sets (speakers, emotions, recording conditions), it is unclear whether the embeddings capture general cues or training-set specific patterns.
Authors: We confirm that Stage I training uses exclusively bona fide utterances from the training partitions of each dataset, with no overlap in speakers, utterances, emotions, or recording conditions relative to any evaluation sets. For ASVspoof 2019/2024 experiments, only the official training bona fide data is employed; cross-dataset tests on EmoFake and EmoSpoof-TTS draw from entirely separate real-speech corpora. This design ensures the learned representations reflect general prosodic variability. We have added an explicit paragraph and data-disjointness table in the revised Section 3.1 to document these splits. revision: yes
Circularity Check
No significant circularity in ProSDD's empirical two-stage framework
full rationale
The paper describes a standard empirical ML pipeline: Stage I performs supervised masked prediction of prosodic attributes (pitch, VAD, energy) on real speech to learn representations, while Stage II jointly optimizes the same objective with a spoof classification loss. No equations, definitions, or self-citations are present that make the final detection performance equivalent to the training inputs by construction. Reported EER reductions on ASVspoof and EmoFake benchmarks are presented as experimental outcomes rather than derived quantities, rendering the approach self-contained against external evaluation.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Prosodic features (pitch, voice activity, energy) capture transferable cues of natural speech variability useful for distinguishing fakes.
Reference graph
Works this paper leans on
-
[1]
Introduction Speech deepfake detection (SDD) aims to distinguish synthetic speech generated by text-to-speech (TTS) and voice conversion (VC) systems from genuine human speech [1, 2]. As synthe- sis models advance in naturalness, speaker similarity, and emo- tional expressiveness, this task becomes increasingly challeng- ing [3, 4]. Over the past decade, ...
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[2]
However, their robustness to emotional and expressive synthetic speech remains limited
Related Work Self-supervised learning (SSL) backbones are widely used for speech deepfake detection due to strong performance [19, 20, 29]. However, their robustness to emotional and expressive synthetic speech remains limited. Studies show that state-of- the-art SDD systems exhibit performance variations depending on emotion that can be exploited through...
-
[3]
Figure 1 illustrates the over- all architecture
Proposed Method We introduceProSDD, a two-stage speech deepfake detection framework that enriches the contextual representations of a pre- trained SSL backbone through supervised modeling of speaker- conditioned prosodic variation. Figure 1 illustrates the over- all architecture. In Stage I, the backbone is fine-tuned using only real speech with a supervi...
-
[4]
Dataset.We group data into training, standard bench- marks, and emotional/expressive benchmarks
Experimental Setup We describe the datasets, baselines, and implementation details used to evaluate ProSDD. Dataset.We group data into training, standard bench- marks, and emotional/expressive benchmarks. Stage I training uses LibriSpeech train-clean-100 and dev (bona fide only) [44]. Stage II training uses ASVspoof 2019 LA train/dev [10] and ASVspoof 202...
2019
-
[5]
w/o MP- SI
Results We evaluate ProSDD on traditional benchmarks and challeng- ing emotional/expressive datasets to assess both in-domain per- formance and cross-domain generalization. Performance on Traditional Benchmarks.Table 1 shows that ProSDD maintains strong performance on ASVspoof 2019 and 2021 under both training settings. When trained on ASVspoof 2019, ProS...
2019
-
[6]
Conclusion We introduced ProSDD, a two-stage speech deepfake detec- tion framework that first learns structured speaker-conditioned prosodic representations from bona fide speech via supervised masked prediction, and then jointly optimizes spoof classifica- tion with prosodic supervision as an auxiliary task. ProSDD substantially improves robustness for e...
2019
-
[7]
• The Office of the Director of National Intelligence (ODNI), Intelligence Advanced Research Projects Activity (IARPA), via the ARTS Program, Contract #D2023-2308110001
Acknowledgments This work was supported by: • The National Science Foundation (NSF) CAREER Award IIS-2533652. • The Office of the Director of National Intelligence (ODNI), Intelligence Advanced Research Projects Activity (IARPA), via the ARTS Program, Contract #D2023-2308110001. The views and conclusions contained herein are those of the authors and shoul...
-
[8]
These tools were not used to generate scientific content, results, experimental designs, anal- yses, or conclusions
Generative AI Use Disclosure Generative AI tools were employed solely for language polish- ing of text written by the authors. These tools were not used to generate scientific content, results, experimental designs, anal- yses, or conclusions. All authors are responsible for the full content of this paper and consent to its submission
-
[9]
Battling voice spoofing: a review, comparative analysis, and generalizability evaluation of state-of-the-art voice spoofing counter measures,
A. Khan, K. M. Malik, J. Ryan, and M. Saravanan, “Battling voice spoofing: a review, comparative analysis, and generalizability evaluation of state-of-the-art voice spoofing counter measures,” Artificial Intelligence Review, vol. 56, no. Suppl 1, pp. 513–566, 2023
2023
-
[10]
A survey on speech deep- fake detection,
M. Li, Y . Ahmadiadli, and X.-P. Zhang, “A survey on speech deep- fake detection,”ACM Computing Surveys, vol. 57, no. 7, pp. 1–38, 2025
2025
-
[11]
Can emotion fool anti-spoofing?
A. Mahapatra, I. R. Ulgen, A. Reddy Naini, C. Busso, and B. Sis- man, “Can emotion fool anti-spoofing?” inProc. Interspeech 2025, 2025, pp. 5628–5632
2025
-
[12]
Emofake: An initial dataset for emotion fake audio detection,
Y . Zhao, J. Yi, J. Tao, C. Wang, and Y . Dong, “Emofake: An initial dataset for emotion fake audio detection,” inChina National Con- ference on Chinese Computational Linguistics. Springer, 2024, pp. 419–433
2024
-
[13]
Mlaad: The multi- language audio anti-spoofing dataset,
N. M. M ¨uller, P. Kawa, W. H. Choong, E. Casanova, E. G ¨olge, T. M¨uller, P. Syga, P. Sperl, and K. B¨ottinger, “Mlaad: The multi- language audio anti-spoofing dataset,” in2024 International Joint Conference on Neural Networks (IJCNN). IEEE, 2024, pp. 1–7
2024
-
[14]
Does Audio Deepfake Detection Generalize?
N. M. M ¨uller, P. Czempin, F. Dieckmann, A. Froghyar, and K. B¨ottinger, “Does audio deepfake detection generalize?”arXiv preprint arXiv:2203.16263, 2022
-
[15]
Asvspoof 2015: Automatic speaker verification spoofing and countermea- sures challenge evaluation plan,
Z. Wu, T. Kinnunen, N. Evans, and J. Yamagishi, “Asvspoof 2015: Automatic speaker verification spoofing and countermea- sures challenge evaluation plan,”Training, vol. 10, no. 15, p. 3750, 2014
2015
-
[16]
Add 2022: the first audio deep synthe- sis detection challenge,
J. Yi, R. Fu, J. Tao, S. Nie, H. Ma, C. Wang, T. Wang, Z. Tian, Y . Bai, C. Fan, S. Liang, S. Wang, S. Zhang, X. Yan, L. Xu, Z. Wen, and H. Li, “Add 2022: the first audio deep synthe- sis detection challenge,” inICASSP 2022 - 2022 IEEE Interna- tional Conference on Acoustics, Speech and Signal Processing (ICASSP), 2022, pp. 9216–9220
2022
-
[17]
Add 2023: the second audio deepfake detection challenge,
J. Yi, J. Tao, R. Fu, X. Yan, C. Wang, T. Wang, C. Y . Zhang, X. Zhang, Y . Zhao, Y . Renet al., “Add 2023: the second audio deepfake detection challenge,”arXiv preprint arXiv:2305.13774, 2023
-
[18]
Asvspoof 2019: A large-scale public database of synthesized, converted and replayed speech,
X. Wang, J. Yamagishi, M. Todisco, H. Delgado, A. Nautsch, N. Evans, M. Sahidullah, V . Vestman, T. Kinnunen, K. A. Lee, L. Juvela, P. Alku, Y .-H. Peng, H.-T. Hwang, Y . Tsao, H.-M. Wang, S. L. Maguer, M. Becker, F. Henderson, R. Clark, Y . Zhang, Q. Wang, Y . Jia, K. Onuma, K. Mushika, T. Kaneda, Y . Jiang, L.-J. Liu, Y .-C. Wu, W.-C. Huang, T. Toda, K....
2019
-
[19]
Asvspoof 2021: Towards spoofed and deepfake speech de- tection in the wild,
X. Liu, X. Wang, M. Sahidullah, J. Patino, H. Delgado, T. Kin- nunen, M. Todisco, J. Yamagishi, N. Evans, A. Nautsch, and K. A. Lee, “Asvspoof 2021: Towards spoofed and deepfake speech de- tection in the wild,”IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 31, pp. 2507–2522, 2023
2021
-
[20]
ASVspoof 5: Crowdsourced speech data, deepfakes, and adversarial attacks at scale,
X. Wang, H. Delgado, H. Tak, J.-w. Jung, H.-j. Shim, M. Todisco, I. Kukanov, X. Liu, M. Sahidullah, T. Kinnunenet al., “Asvspoof 5: Crowdsourced speech data, deepfakes, and adversarial attacks at scale,”arXiv preprint arXiv:2408.08739, 2024
-
[21]
A. Mahapatra, I. R. Ulgen, and B. Sisman, “Hula: Prosody-aware anti-spoofing with multi-task learning for expressive and emo- tional synthetic speech,”arXiv preprint arXiv:2509.21676, 2025
-
[22]
End-to-end anti-spoofing with rawnet2,
H. Tak, J. Patino, M. Todisco, A. Nautsch, N. Evans, and A. Larcher, “End-to-end anti-spoofing with rawnet2,” inICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2021, pp. 6369–6373
2021
-
[23]
Aasist: Audio anti-spoofing using in- tegrated spectro-temporal graph attention networks,
J.-w. Jung, H.-S. Heo, H. Tak, H.-j. Shim, J. S. Chung, B.-J. Lee, H.-J. Yu, and N. Evans, “Aasist: Audio anti-spoofing using in- tegrated spectro-temporal graph attention networks,” inICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2022, pp. 6367–6371
2022
-
[24]
Z. Wu, R. K. Das, J. Yang, and H. Li, “Light convolutional neu- ral network with feature genuinization for detection of synthetic speech attacks,”arXiv preprint arXiv:2009.09637, 2020
-
[25]
Assert: Anti- spoofing with squeeze-excitation and residual networks,
C.-I. Lai, N. Chen, J. Villalba, and N. Dehak, “Assert: Anti- spoofing with squeeze-excitation and residual networks,”arXiv preprint arXiv:1904.01120, 2019
-
[26]
Spoof detection using voice contribution on lfcc features and resnet-34,
K. Z. Mon, K. Galajit, C. O. Mawalim, J. Karnjana, T. Isshiki, and P. Aimmanee, “Spoof detection using voice contribution on lfcc features and resnet-34,” in2023 18th International Joint Sympo- sium on Artificial Intelligence and Natural Language Processing (iSAI-NLP). IEEE, 2023, pp. 1–6
2023
-
[27]
Audio deepfake detection with self-supervised xls-r and sls classifier,
Q. Zhang, S. Wen, and T. Hu, “Audio deepfake detection with self-supervised xls-r and sls classifier,” inProceedings of the 32nd ACM International Conference on Multimedia, 2024, pp. 6765– 6773
2024
-
[28]
H. Tak, M. Todisco, X. Wang, J.-w. Jung, J. Yamagishi, and N. Evans, “Automatic speaker verification spoofing and deep- fake detection using wav2vec 2.0 and data augmentation,”arXiv preprint arXiv:2202.12233, 2022
-
[29]
Pitch imperfect: Detecting audio deepfakes through acoustic prosodic analysis,
K. Warren, D. Olszewski, S. Layton, K. Butler, C. Gates, and P. Traynor, “Pitch imperfect: Detecting audio deepfakes through acoustic prosodic analysis,”arXiv preprint arXiv:2502.14726, 2025
-
[30]
Deepfake speech detection through emotion recognition: a semantic approach,
E. Conti, D. Salvi, C. Borrelli, B. Hosler, P. Bestagini, F. An- tonacci, A. Sarti, M. C. Stamm, and S. Tubaro, “Deepfake speech detection through emotion recognition: a semantic approach,” in ICASSP 2022-2022 IEEE international conference on acoustics, speech and signal processing (ICASSP). IEEE, 2022, pp. 8962– 8966
2022
-
[31]
Combining automatic speaker verification and prosody analysis for synthetic speech detection,
L. Attorresi, D. Salvi, C. Borrelli, P. Bestagini, and S. Tubaro, “Combining automatic speaker verification and prosody analysis for synthetic speech detection,” inInternational Conference on Pattern Recognition. Springer, 2022, pp. 247–263
2022
-
[32]
Emoanti: au- dio anti-deepfake with refined emotion-guided representations,
X. Li, Y . Gong, D. Zou, X. Cao, and S. Lee, “Emoanti: au- dio anti-deepfake with refined emotion-guided representations,” arXiv preprint arXiv:2509.10781, 2025
-
[33]
Finding the human voice in ai: Insights on the perception of ai-voice clones from naturalness and similarity ratings,
L. Bakkouche, C. McGhee, E. Lau, S. Cooper, X. Luo, M. Rees, K. Alter, B. Post, and J. Schwarz, “Finding the human voice in ai: Insights on the perception of ai-voice clones from naturalness and similarity ratings,” inProc. Interspeech 2025, 2025, pp. 2190– 2194
2025
-
[34]
How do users perceive deepfake personas? investigating the deepfake user perception and its implications for human-computer interaction,
I. Kaate, J. Salminen, S.-G. Jung, H. Almerekhi, and B. J. Jansen, “How do users perceive deepfake personas? investigating the deepfake user perception and its implications for human-computer interaction,” inProceedings of the 15th Biannual Conference of the Italian SIGCHI Chapter, 2023, pp. 1–12
2023
-
[35]
Subjective perception and objective evaluation of speech naturalness for deepfake de- tection,
S. Zhang, L. Peng, L. Xie, and Z. Zhao, “Subjective perception and objective evaluation of speech naturalness for deepfake de- tection,” in2025 International Conference on Culture-Oriented Science & Technology (CoST), 2025, pp. 65–70
2025
-
[36]
Slim: Style- linguistics mismatch model for generalized audio deepfake de- tection,
Y . Zhu, S. Koppisetti, T. Tran, and G. Bharaj, “Slim: Style- linguistics mismatch model for generalized audio deepfake de- tection,”Advances in Neural Information Processing Systems, vol. 37, pp. 67 901–67 928, 2024
2024
-
[37]
Xlsr-mamba: A dual-column bidirec- tional state space model for spoofing attack detection,
Y . Xiao and R. K. Das, “Xlsr-mamba: A dual-column bidirec- tional state space model for spoofing attack detection,”IEEE Sig- nal Processing Letters, vol. 32, pp. 1276–1280, 2025
2025
-
[38]
Emoq-tts: Emo- tion intensity quantization for fine-grained controllable emotional text-to-speech,
C.-B. Im, S.-H. Lee, S.-B. Kim, and S.-W. Lee, “Emoq-tts: Emo- tion intensity quantization for fine-grained controllable emotional text-to-speech,” inICASSP 2022 - 2022 IEEE International Con- ference on Acoustics, Speech and Signal Processing (ICASSP), 2022, pp. 6317–6321
2022
-
[39]
Glow-tts: A genera- tive flow for text-to-speech via monotonic alignment search,
J. Kim, S. Kim, J. Kong, and S. Yoon, “Glow-tts: A genera- tive flow for text-to-speech via monotonic alignment search,”Ad- vances in Neural Information Processing Systems, vol. 33, pp. 8067–8077, 2020
2020
-
[40]
F5-TTS: A fairytaler that fakes fluent and faithful speech with flow matching,
Y . Chen, Z. Niu, Z. Ma, K. Deng, C. Wang, J. Zhao, K. Yu, and X. Chen, “F5-tts: A fairytaler that fakes fluent and faithful speech with flow matching,”arXiv preprint arXiv:2410.06885, 2024
-
[41]
Laugh now cry later: Controlling time-varying emotional states of flow- matching-based zero-shot text-to-speech,
H. Wu, X. Wang, S. E. Eskimez, M. Thakker, D. Tompkins, C.- H. Tsai, C. Li, Z. Xiao, S. Zhao, J. Li, and N. Kanda, “Laugh now cry later: Controlling time-varying emotional states of flow- matching-based zero-shot text-to-speech,” in2024 IEEE Spoken Language Technology Workshop (SLT), 2024, pp. 690–697
2024
-
[42]
Expressive-vc: Highly expressive voice conversion with attention fusion of bottleneck and perturbation features,
Z. Ning, Q. Xie, P. Zhu, Z. Wang, L. Xue, J. Yao, L. Xie, and M. Bi, “Expressive-vc: Highly expressive voice conversion with attention fusion of bottleneck and perturbation features,” in ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2023, pp. 1–5
2023
-
[43]
Pmvc: Data augmentation-based prosody modeling for expres- sive voice conversion,
Y . Deng, H. Tang, X. Zhang, J. Wang, N. Cheng, and J. Xiao, “Pmvc: Data augmentation-based prosody modeling for expres- sive voice conversion,” inProceedings of the 31st ACM Interna- tional Conference on Multimedia, 2023, pp. 184–192
2023
-
[44]
Emotion intensity and its control for emotional voice conversion,
K. Zhou, B. Sisman, R. Rana, B. W. Schuller, and H. Li, “Emotion intensity and its control for emotional voice conversion,”IEEE Transactions on Affective Computing, vol. 14, no. 1, pp. 31–48, 2022
2022
-
[45]
Prosody in context: A review,
J. Cole, “Prosody in context: A review,”Language, Cognition and Neuroscience, vol. 30, no. 1-2, pp. 1–31, 2015
2015
-
[46]
Learning second language suprasegmentals: Effect of l2 experience on prosody and fluency characteristics of l2 speech,
P. Trofimovich and W. Baker, “Learning second language suprasegmentals: Effect of l2 experience on prosody and fluency characteristics of l2 speech,”Studies in second language acquisi- tion, vol. 28, no. 1, pp. 1–30, 2006
2006
-
[47]
Variation adds to prosodic typology,
E. Grabe, “Variation adds to prosodic typology,” inProceedings of the speech prosody 2002 conference. Laboratoire Parole et Langage Aix-en-Provence, France, 2002, pp. 127–132
2002
-
[48]
De- tection of cross-dataset fake audio based on prosodic and pronun- ciation features,
C. Wang, J. Yi, J. Tao, C. Y . Zhang, S. Zhang, and X. Chen, “De- tection of cross-dataset fake audio based on prosodic and pronun- ciation features,” inProc. Interspeech 2023, 2023, pp. 3844–3848
2023
-
[49]
Investigating voiced and unvoiced regions of speech for audio deepfake detection,
G. Sivaraman, H. Tak, and E. Khoury, “Investigating voiced and unvoiced regions of speech for audio deepfake detection,” in ICASSP 2025 - 2025 IEEE International Conference on Acous- tics, Speech and Signal Processing (ICASSP), 2025, pp. 1–5
2025
-
[50]
B. Desplanques, J. Thienpondt, and K. Demuynck, “Ecapa- tdnn: Emphasized channel attention, propagation and ag- gregation in tdnn based speaker verification,”arXiv preprint arXiv:2005.07143, 2020
-
[51]
Prosodic struc- ture beyond lexical content: A study of self-supervised learning,
S. Wallbridge, C. Minixhofer, C. Lai, and P. Bell, “Prosodic struc- ture beyond lexical content: A study of self-supervised learning,” arXiv preprint arXiv:2506.02584, 2025
-
[52]
Lib- rispeech: An asr corpus based on public domain audio books,
V . Panayotov, G. Chen, D. Povey, and S. Khudanpur, “Lib- rispeech: An asr corpus based on public domain audio books,” in2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2015, pp. 5206–5210
2015
-
[53]
Conditional variational autoencoder with adversarial learning for end-to-end text-to-speech,
J. Kim, J. Kong, and J. Son, “Conditional variational autoencoder with adversarial learning for end-to-end text-to-speech,” inInter- national Conference on Machine Learning. PMLR, 2021
2021
-
[54]
arXiv preprint arXiv:2406.04904 , year=
E. Casanova, K. Davis, E. G ¨olge, G. G ¨oknar, I. Gulea, L. Hart, A. Aljafari, J. Meyer, R. Morais, S. Olayemiet al., “Xtts: a massively multilingual zero-shot text-to-speech model,”arXiv preprint arXiv:2406.04904, 2024
-
[55]
Fastpitch: Parallel text-to-speech with pitch pre- diction,
A. Ła ´ncucki, “Fastpitch: Parallel text-to-speech with pitch pre- diction,” inICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2021, pp. 6588–6592
2021
-
[56]
Grad-tts: A diffusion probabilistic model for text-to-speech,
V . Popov, I. V ovk, V . Gogoryan, T. Sadekova, and M. Kudinov, “Grad-tts: A diffusion probabilistic model for text-to-speech,” in International conference on machine learning. PMLR, 2021, pp. 8599–8608
2021
-
[57]
Starganv2-vc: A diverse, unsupervised, non-parallel framework for natural-sounding voice conversion,
Y . A. Li, A. Zare, and N. Mesgarani, “Starganv2-vc: A diverse, unsupervised, non-parallel framework for natural-sounding voice conversion,”arXiv preprint arXiv:2107.10394, 2021
-
[58]
Diffusion-based voice conversion with fast maximum likelihood sampling scheme,
V . Popov, I. V ovk, V . Gogoryan, T. Sadekova, M. Kudinov, and J. Wei, “Diffusion-based voice conversion with fast maximum likelihood sampling scheme,”arXiv preprint arXiv:2109.13821, 2021
-
[59]
V oice conversion using speech-to- speech neuro-style transfer
E. A. AlBadawy and S. Lyu, “V oice conversion using speech-to- speech neuro-style transfer.” inInterspeech, 2020, pp. 4726–4730
2020
-
[60]
Raw- boost: A raw data boosting and augmentation method applied to automatic speaker verification anti-spoofing,
H. Tak, M. Kamble, J. Patino, M. Todisco, and N. Evans, “Raw- boost: A raw data boosting and augmentation method applied to automatic speaker verification anti-spoofing,” inICASSP 2022- 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2022, pp. 6382–6386
2022
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.