pith. machine review for the scientific record. sign in

arxiv: 2604.13229 · v1 · submitted 2026-04-14 · 📡 eess.AS

Recognition: unknown

ProSDD: Learning Prosodic Representations for Speech Deepfake Detection against Expressive and Emotional Attacks

Aurosweta Mahapatra, Berrak Sisman, Ismail Rasim Ulgen, Kong Aik Lee, Nicholas Andrews

Authors on Pith no claims yet

Pith reviewed 2026-05-10 13:26 UTC · model grok-4.3

classification 📡 eess.AS
keywords speech deepfake detectionprosodic representationsmasked predictionexpressive attacksemotional speechgeneralizationASVspoof
0
0 comments X

The pith

Learning prosodic patterns from real speech enables better detection of expressive speech deepfakes.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Current speech deepfake detectors often fail on expressive and emotional attacks because they overfit to spoof-specific artifacts in training data rather than learning general properties of natural speech. ProSDD counters this with a two-stage approach: the first stage uses supervised masked prediction to capture speaker-conditioned variations in pitch, energy, and voice activity from real audio only. The second stage then combines this prosodic objective with spoof classification. When trained on standard ASVspoof sets, the resulting model cuts error rates substantially on both standard and emotional test sets. A reader would care because this shifts focus from chasing fakes to modeling real variability, which humans already use to spot deviations.

Core claim

ProSDD is a two-stage framework in which Stage I learns prosodic variability from real speech via supervised masked prediction of speaker-conditioned features based on pitch, voice activity, and energy, while Stage II jointly optimizes the same objective with spoof classification to improve generalization against expressive and emotional attacks.

What carries the argument

Supervised masked prediction of speaker-conditioned prosodic variation, which enriches embeddings with natural speech cues before spoof classification.

If this is right

  • Training on ASVspoof 2019 data yields 16.14% EER on ASVspoof 2024 instead of 25.43%.
  • Training on ASVspoof 2024 data yields 7.38% EER on ASVspoof 2024 instead of 39.62%.
  • Approximately 50% relative EER reductions occur on EmoFake and EmoSpoof-TTS.
  • Joint optimization of prosody prediction and spoof classification works without requiring spoof-heavy training data.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same prosody pretraining could be tested on other audio classification tasks that suffer from stylistic variation, such as speaker verification under emotion shifts.
  • If the learned representations capture universal natural-speech properties, they might allow effective detection even when the spoof generator is completely unseen.
  • Combining this approach with self-supervised audio models pretrained on larger unlabeled corpora could further reduce dependence on labeled real-speech data for Stage I.

Load-bearing premise

The prosodic representations learned from real speech in Stage I transfer as generalizable cues for distinguishing expressive and emotional spoofing attacks rather than capturing training-set specific patterns.

What would settle it

If ablating the Stage I masked prosodic prediction objective removes the reported EER reductions on EmoFake and EmoSpoof-TTS while keeping Stage II intact, the claim that prosody learning drives the gains would be falsified.

Figures

Figures reproduced from arXiv: 2604.13229 by Aurosweta Mahapatra, Berrak Sisman, Ismail Rasim Ulgen, Kong Aik Lee, Nicholas Andrews.

Figure 1
Figure 1. Figure 1: Two-stage training framework of ProSDD. Stage I learns speaker-conditioned prosodic representations from real speech, and Stage II jointly optimizes spoof classification with supervised masked prediction. 2. Related Work Self-supervised learning (SSL) backbones are widely used for speech deepfake detection due to strong performance [19, 20, 29]. However, their robustness to emotional and expressive synthet… view at source ↗
read the original abstract

Speech deepfake detection (SDD) systems perform well on standard benchmarks datasets but often fail to generalize to expressive and emotional spoofing attacks. Many methods rely on spoof-heavy training data, learning dataset-specific artifacts rather than transferable cues of natural speech. In contrast, humans internalize variability in real speech and detect fakes as deviations from it. We introduce ProSDD, a two-stage framework that enriches model embeddings through supervised masked prediction of speaker-conditioned prosodic variation based on pitch, voice activity, and energy. Stage I learns prosodic variability from real speech, and Stage II jointly optimizes this objective with spoof classification. ProSDD consistently outperforms baselines under both ASVspoof 2019 and 2024 training, reducing ASVspoof 2024 EER from 25.43% to 16.14% (2019-trained) and from 39.62% to 7.38% (2024-trained), while achieving 50% relative reductions on EmoFake and EmoSpoof-TTS.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper introduces ProSDD, a two-stage framework for speech deepfake detection. In Stage I, it learns prosodic representations from real speech using supervised masked prediction of speaker-conditioned pitch, voice activity detection (VAD), and energy. Stage II jointly optimizes this with spoof classification. The method is claimed to outperform baselines on ASVspoof 2019 and 2024 datasets, achieving EER reductions on ASVspoof 2024 from 25.43% to 16.14% when trained on 2019 data and from 39.62% to 7.38% when trained on 2024 data, along with 50% relative EER reductions on EmoFake and EmoSpoof-TTS.

Significance. If the central claims hold after verification, this work has potential significance in addressing the generalization challenges of speech deepfake detection systems to expressive and emotional attacks. By focusing on learning natural prosodic variability from real speech rather than relying on spoof-specific artifacts, it offers a more human-like approach to detection. The reported performance gains suggest it could contribute to more robust SDD models.

major comments (2)
  1. [Abstract] The abstract presents concrete EER reductions on named benchmarks but provides no details on baseline implementations, statistical significance, data splits, or ablation studies. This lack of information prevents verification that the gains support the central claim of improved generalization via prosodic representations.
  2. [Stage I description] The transferability of prosodic representations learned via supervised masked prediction on real speech to detecting expressive/emotional spoofs is load-bearing for the claims. Without explicit details confirming that the real speech data used in Stage I is fully disjoint from all evaluation sets (speakers, emotions, recording conditions), it is unclear whether the embeddings capture general cues or training-set specific patterns.
minor comments (1)
  1. [Abstract] The term 'speaker-conditioned prosodic variation' could be clarified with a brief definition or reference to how conditioning is implemented.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments, which help strengthen the verifiability of our claims. We address each point below and have revised the manuscript to provide additional clarity on experimental details and data usage.

read point-by-point responses
  1. Referee: [Abstract] The abstract presents concrete EER reductions on named benchmarks but provides no details on baseline implementations, statistical significance, data splits, or ablation studies. This lack of information prevents verification that the gains support the central claim of improved generalization via prosodic representations.

    Authors: We agree that the abstract's length constraints limit the inclusion of full experimental details. The manuscript body addresses these aspects comprehensively: baseline implementations (including AASIST and other SOTA models) are specified in Section 4.2, statistical significance is evaluated through repeated runs with standard deviations reported in Tables 1–3, data splits adhere to official ASVspoof protocols as described in Section 4.1, and ablation studies isolating the prosodic components appear in Section 5.2. To improve immediate verifiability from the abstract, we have added a concise clause referencing the detailed analyses and ablations provided in the paper. revision: yes

  2. Referee: [Stage I description] The transferability of prosodic representations learned via supervised masked prediction on real speech to detecting expressive/emotional spoofs is load-bearing for the claims. Without explicit details confirming that the real speech data used in Stage I is fully disjoint from all evaluation sets (speakers, emotions, recording conditions), it is unclear whether the embeddings capture general cues or training-set specific patterns.

    Authors: We confirm that Stage I training uses exclusively bona fide utterances from the training partitions of each dataset, with no overlap in speakers, utterances, emotions, or recording conditions relative to any evaluation sets. For ASVspoof 2019/2024 experiments, only the official training bona fide data is employed; cross-dataset tests on EmoFake and EmoSpoof-TTS draw from entirely separate real-speech corpora. This design ensures the learned representations reflect general prosodic variability. We have added an explicit paragraph and data-disjointness table in the revised Section 3.1 to document these splits. revision: yes

Circularity Check

0 steps flagged

No significant circularity in ProSDD's empirical two-stage framework

full rationale

The paper describes a standard empirical ML pipeline: Stage I performs supervised masked prediction of prosodic attributes (pitch, VAD, energy) on real speech to learn representations, while Stage II jointly optimizes the same objective with a spoof classification loss. No equations, definitions, or self-citations are present that make the final detection performance equivalent to the training inputs by construction. Reported EER reductions on ASVspoof and EmoFake benchmarks are presented as experimental outcomes rather than derived quantities, rendering the approach self-contained against external evaluation.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Abstract-only review limits identification of exact free parameters or invented entities; the approach rests on the domain assumption that prosodic variability modeling from real data provides transferable detection cues.

axioms (1)
  • domain assumption Prosodic features (pitch, voice activity, energy) capture transferable cues of natural speech variability useful for distinguishing fakes.
    Stage I relies on this to learn from real speech; invoked in the description of the supervised masked prediction objective.

pith-pipeline@v0.9.0 · 5502 in / 1529 out tokens · 45387 ms · 2026-05-10T13:26:39.993849+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

60 extracted references · 16 canonical work pages · 1 internal anchor

  1. [1]

    ProSDD: Learning Prosodic Representations for Speech Deepfake Detection against Expressive and Emotional Attacks

    Introduction Speech deepfake detection (SDD) aims to distinguish synthetic speech generated by text-to-speech (TTS) and voice conversion (VC) systems from genuine human speech [1, 2]. As synthe- sis models advance in naturalness, speaker similarity, and emo- tional expressiveness, this task becomes increasingly challeng- ing [3, 4]. Over the past decade, ...

  2. [2]

    However, their robustness to emotional and expressive synthetic speech remains limited

    Related Work Self-supervised learning (SSL) backbones are widely used for speech deepfake detection due to strong performance [19, 20, 29]. However, their robustness to emotional and expressive synthetic speech remains limited. Studies show that state-of- the-art SDD systems exhibit performance variations depending on emotion that can be exploited through...

  3. [3]

    Figure 1 illustrates the over- all architecture

    Proposed Method We introduceProSDD, a two-stage speech deepfake detection framework that enriches the contextual representations of a pre- trained SSL backbone through supervised modeling of speaker- conditioned prosodic variation. Figure 1 illustrates the over- all architecture. In Stage I, the backbone is fine-tuned using only real speech with a supervi...

  4. [4]

    Dataset.We group data into training, standard bench- marks, and emotional/expressive benchmarks

    Experimental Setup We describe the datasets, baselines, and implementation details used to evaluate ProSDD. Dataset.We group data into training, standard bench- marks, and emotional/expressive benchmarks. Stage I training uses LibriSpeech train-clean-100 and dev (bona fide only) [44]. Stage II training uses ASVspoof 2019 LA train/dev [10] and ASVspoof 202...

  5. [5]

    w/o MP- SI

    Results We evaluate ProSDD on traditional benchmarks and challeng- ing emotional/expressive datasets to assess both in-domain per- formance and cross-domain generalization. Performance on Traditional Benchmarks.Table 1 shows that ProSDD maintains strong performance on ASVspoof 2019 and 2021 under both training settings. When trained on ASVspoof 2019, ProS...

  6. [6]

    Conclusion We introduced ProSDD, a two-stage speech deepfake detec- tion framework that first learns structured speaker-conditioned prosodic representations from bona fide speech via supervised masked prediction, and then jointly optimizes spoof classifica- tion with prosodic supervision as an auxiliary task. ProSDD substantially improves robustness for e...

  7. [7]

    • The Office of the Director of National Intelligence (ODNI), Intelligence Advanced Research Projects Activity (IARPA), via the ARTS Program, Contract #D2023-2308110001

    Acknowledgments This work was supported by: • The National Science Foundation (NSF) CAREER Award IIS-2533652. • The Office of the Director of National Intelligence (ODNI), Intelligence Advanced Research Projects Activity (IARPA), via the ARTS Program, Contract #D2023-2308110001. The views and conclusions contained herein are those of the authors and shoul...

  8. [8]

    These tools were not used to generate scientific content, results, experimental designs, anal- yses, or conclusions

    Generative AI Use Disclosure Generative AI tools were employed solely for language polish- ing of text written by the authors. These tools were not used to generate scientific content, results, experimental designs, anal- yses, or conclusions. All authors are responsible for the full content of this paper and consent to its submission

  9. [9]

    Battling voice spoofing: a review, comparative analysis, and generalizability evaluation of state-of-the-art voice spoofing counter measures,

    A. Khan, K. M. Malik, J. Ryan, and M. Saravanan, “Battling voice spoofing: a review, comparative analysis, and generalizability evaluation of state-of-the-art voice spoofing counter measures,” Artificial Intelligence Review, vol. 56, no. Suppl 1, pp. 513–566, 2023

  10. [10]

    A survey on speech deep- fake detection,

    M. Li, Y . Ahmadiadli, and X.-P. Zhang, “A survey on speech deep- fake detection,”ACM Computing Surveys, vol. 57, no. 7, pp. 1–38, 2025

  11. [11]

    Can emotion fool anti-spoofing?

    A. Mahapatra, I. R. Ulgen, A. Reddy Naini, C. Busso, and B. Sis- man, “Can emotion fool anti-spoofing?” inProc. Interspeech 2025, 2025, pp. 5628–5632

  12. [12]

    Emofake: An initial dataset for emotion fake audio detection,

    Y . Zhao, J. Yi, J. Tao, C. Wang, and Y . Dong, “Emofake: An initial dataset for emotion fake audio detection,” inChina National Con- ference on Chinese Computational Linguistics. Springer, 2024, pp. 419–433

  13. [13]

    Mlaad: The multi- language audio anti-spoofing dataset,

    N. M. M ¨uller, P. Kawa, W. H. Choong, E. Casanova, E. G ¨olge, T. M¨uller, P. Syga, P. Sperl, and K. B¨ottinger, “Mlaad: The multi- language audio anti-spoofing dataset,” in2024 International Joint Conference on Neural Networks (IJCNN). IEEE, 2024, pp. 1–7

  14. [14]

    Does Audio Deepfake Detection Generalize?

    N. M. M ¨uller, P. Czempin, F. Dieckmann, A. Froghyar, and K. B¨ottinger, “Does audio deepfake detection generalize?”arXiv preprint arXiv:2203.16263, 2022

  15. [15]

    Asvspoof 2015: Automatic speaker verification spoofing and countermea- sures challenge evaluation plan,

    Z. Wu, T. Kinnunen, N. Evans, and J. Yamagishi, “Asvspoof 2015: Automatic speaker verification spoofing and countermea- sures challenge evaluation plan,”Training, vol. 10, no. 15, p. 3750, 2014

  16. [16]

    Add 2022: the first audio deep synthe- sis detection challenge,

    J. Yi, R. Fu, J. Tao, S. Nie, H. Ma, C. Wang, T. Wang, Z. Tian, Y . Bai, C. Fan, S. Liang, S. Wang, S. Zhang, X. Yan, L. Xu, Z. Wen, and H. Li, “Add 2022: the first audio deep synthe- sis detection challenge,” inICASSP 2022 - 2022 IEEE Interna- tional Conference on Acoustics, Speech and Signal Processing (ICASSP), 2022, pp. 9216–9220

  17. [17]

    Add 2023: the second audio deepfake detection challenge,

    J. Yi, J. Tao, R. Fu, X. Yan, C. Wang, T. Wang, C. Y . Zhang, X. Zhang, Y . Zhao, Y . Renet al., “Add 2023: the second audio deepfake detection challenge,”arXiv preprint arXiv:2305.13774, 2023

  18. [18]

    Asvspoof 2019: A large-scale public database of synthesized, converted and replayed speech,

    X. Wang, J. Yamagishi, M. Todisco, H. Delgado, A. Nautsch, N. Evans, M. Sahidullah, V . Vestman, T. Kinnunen, K. A. Lee, L. Juvela, P. Alku, Y .-H. Peng, H.-T. Hwang, Y . Tsao, H.-M. Wang, S. L. Maguer, M. Becker, F. Henderson, R. Clark, Y . Zhang, Q. Wang, Y . Jia, K. Onuma, K. Mushika, T. Kaneda, Y . Jiang, L.-J. Liu, Y .-C. Wu, W.-C. Huang, T. Toda, K....

  19. [19]

    Asvspoof 2021: Towards spoofed and deepfake speech de- tection in the wild,

    X. Liu, X. Wang, M. Sahidullah, J. Patino, H. Delgado, T. Kin- nunen, M. Todisco, J. Yamagishi, N. Evans, A. Nautsch, and K. A. Lee, “Asvspoof 2021: Towards spoofed and deepfake speech de- tection in the wild,”IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 31, pp. 2507–2522, 2023

  20. [20]

    ASVspoof 5: Crowdsourced speech data, deepfakes, and adversarial attacks at scale,

    X. Wang, H. Delgado, H. Tak, J.-w. Jung, H.-j. Shim, M. Todisco, I. Kukanov, X. Liu, M. Sahidullah, T. Kinnunenet al., “Asvspoof 5: Crowdsourced speech data, deepfakes, and adversarial attacks at scale,”arXiv preprint arXiv:2408.08739, 2024

  21. [21]

    Hula: Prosody-aware anti-spoofing with multi-task learning for expressive and emo- tional synthetic speech,

    A. Mahapatra, I. R. Ulgen, and B. Sisman, “Hula: Prosody-aware anti-spoofing with multi-task learning for expressive and emo- tional synthetic speech,”arXiv preprint arXiv:2509.21676, 2025

  22. [22]

    End-to-end anti-spoofing with rawnet2,

    H. Tak, J. Patino, M. Todisco, A. Nautsch, N. Evans, and A. Larcher, “End-to-end anti-spoofing with rawnet2,” inICASSP 2021 - 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2021, pp. 6369–6373

  23. [23]

    Aasist: Audio anti-spoofing using in- tegrated spectro-temporal graph attention networks,

    J.-w. Jung, H.-S. Heo, H. Tak, H.-j. Shim, J. S. Chung, B.-J. Lee, H.-J. Yu, and N. Evans, “Aasist: Audio anti-spoofing using in- tegrated spectro-temporal graph attention networks,” inICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2022, pp. 6367–6371

  24. [24]

    Light convolutional neu- ral network with feature genuinization for detection of synthetic speech attacks,

    Z. Wu, R. K. Das, J. Yang, and H. Li, “Light convolutional neu- ral network with feature genuinization for detection of synthetic speech attacks,”arXiv preprint arXiv:2009.09637, 2020

  25. [25]

    Assert: Anti- spoofing with squeeze-excitation and residual networks,

    C.-I. Lai, N. Chen, J. Villalba, and N. Dehak, “Assert: Anti- spoofing with squeeze-excitation and residual networks,”arXiv preprint arXiv:1904.01120, 2019

  26. [26]

    Spoof detection using voice contribution on lfcc features and resnet-34,

    K. Z. Mon, K. Galajit, C. O. Mawalim, J. Karnjana, T. Isshiki, and P. Aimmanee, “Spoof detection using voice contribution on lfcc features and resnet-34,” in2023 18th International Joint Sympo- sium on Artificial Intelligence and Natural Language Processing (iSAI-NLP). IEEE, 2023, pp. 1–6

  27. [27]

    Audio deepfake detection with self-supervised xls-r and sls classifier,

    Q. Zhang, S. Wen, and T. Hu, “Audio deepfake detection with self-supervised xls-r and sls classifier,” inProceedings of the 32nd ACM International Conference on Multimedia, 2024, pp. 6765– 6773

  28. [28]

    Automatic speaker verification spoofing and deepfake detection using wav2vec 2.0 and data augmentation,

    H. Tak, M. Todisco, X. Wang, J.-w. Jung, J. Yamagishi, and N. Evans, “Automatic speaker verification spoofing and deep- fake detection using wav2vec 2.0 and data augmentation,”arXiv preprint arXiv:2202.12233, 2022

  29. [29]

    Pitch imperfect: Detecting audio deepfakes through acoustic prosodic analysis,

    K. Warren, D. Olszewski, S. Layton, K. Butler, C. Gates, and P. Traynor, “Pitch imperfect: Detecting audio deepfakes through acoustic prosodic analysis,”arXiv preprint arXiv:2502.14726, 2025

  30. [30]

    Deepfake speech detection through emotion recognition: a semantic approach,

    E. Conti, D. Salvi, C. Borrelli, B. Hosler, P. Bestagini, F. An- tonacci, A. Sarti, M. C. Stamm, and S. Tubaro, “Deepfake speech detection through emotion recognition: a semantic approach,” in ICASSP 2022-2022 IEEE international conference on acoustics, speech and signal processing (ICASSP). IEEE, 2022, pp. 8962– 8966

  31. [31]

    Combining automatic speaker verification and prosody analysis for synthetic speech detection,

    L. Attorresi, D. Salvi, C. Borrelli, P. Bestagini, and S. Tubaro, “Combining automatic speaker verification and prosody analysis for synthetic speech detection,” inInternational Conference on Pattern Recognition. Springer, 2022, pp. 247–263

  32. [32]

    Emoanti: au- dio anti-deepfake with refined emotion-guided representations,

    X. Li, Y . Gong, D. Zou, X. Cao, and S. Lee, “Emoanti: au- dio anti-deepfake with refined emotion-guided representations,” arXiv preprint arXiv:2509.10781, 2025

  33. [33]

    Finding the human voice in ai: Insights on the perception of ai-voice clones from naturalness and similarity ratings,

    L. Bakkouche, C. McGhee, E. Lau, S. Cooper, X. Luo, M. Rees, K. Alter, B. Post, and J. Schwarz, “Finding the human voice in ai: Insights on the perception of ai-voice clones from naturalness and similarity ratings,” inProc. Interspeech 2025, 2025, pp. 2190– 2194

  34. [34]

    How do users perceive deepfake personas? investigating the deepfake user perception and its implications for human-computer interaction,

    I. Kaate, J. Salminen, S.-G. Jung, H. Almerekhi, and B. J. Jansen, “How do users perceive deepfake personas? investigating the deepfake user perception and its implications for human-computer interaction,” inProceedings of the 15th Biannual Conference of the Italian SIGCHI Chapter, 2023, pp. 1–12

  35. [35]

    Subjective perception and objective evaluation of speech naturalness for deepfake de- tection,

    S. Zhang, L. Peng, L. Xie, and Z. Zhao, “Subjective perception and objective evaluation of speech naturalness for deepfake de- tection,” in2025 International Conference on Culture-Oriented Science & Technology (CoST), 2025, pp. 65–70

  36. [36]

    Slim: Style- linguistics mismatch model for generalized audio deepfake de- tection,

    Y . Zhu, S. Koppisetti, T. Tran, and G. Bharaj, “Slim: Style- linguistics mismatch model for generalized audio deepfake de- tection,”Advances in Neural Information Processing Systems, vol. 37, pp. 67 901–67 928, 2024

  37. [37]

    Xlsr-mamba: A dual-column bidirec- tional state space model for spoofing attack detection,

    Y . Xiao and R. K. Das, “Xlsr-mamba: A dual-column bidirec- tional state space model for spoofing attack detection,”IEEE Sig- nal Processing Letters, vol. 32, pp. 1276–1280, 2025

  38. [38]

    Emoq-tts: Emo- tion intensity quantization for fine-grained controllable emotional text-to-speech,

    C.-B. Im, S.-H. Lee, S.-B. Kim, and S.-W. Lee, “Emoq-tts: Emo- tion intensity quantization for fine-grained controllable emotional text-to-speech,” inICASSP 2022 - 2022 IEEE International Con- ference on Acoustics, Speech and Signal Processing (ICASSP), 2022, pp. 6317–6321

  39. [39]

    Glow-tts: A genera- tive flow for text-to-speech via monotonic alignment search,

    J. Kim, S. Kim, J. Kong, and S. Yoon, “Glow-tts: A genera- tive flow for text-to-speech via monotonic alignment search,”Ad- vances in Neural Information Processing Systems, vol. 33, pp. 8067–8077, 2020

  40. [40]

    F5-TTS: A fairytaler that fakes fluent and faithful speech with flow matching,

    Y . Chen, Z. Niu, Z. Ma, K. Deng, C. Wang, J. Zhao, K. Yu, and X. Chen, “F5-tts: A fairytaler that fakes fluent and faithful speech with flow matching,”arXiv preprint arXiv:2410.06885, 2024

  41. [41]

    Laugh now cry later: Controlling time-varying emotional states of flow- matching-based zero-shot text-to-speech,

    H. Wu, X. Wang, S. E. Eskimez, M. Thakker, D. Tompkins, C.- H. Tsai, C. Li, Z. Xiao, S. Zhao, J. Li, and N. Kanda, “Laugh now cry later: Controlling time-varying emotional states of flow- matching-based zero-shot text-to-speech,” in2024 IEEE Spoken Language Technology Workshop (SLT), 2024, pp. 690–697

  42. [42]

    Expressive-vc: Highly expressive voice conversion with attention fusion of bottleneck and perturbation features,

    Z. Ning, Q. Xie, P. Zhu, Z. Wang, L. Xue, J. Yao, L. Xie, and M. Bi, “Expressive-vc: Highly expressive voice conversion with attention fusion of bottleneck and perturbation features,” in ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2023, pp. 1–5

  43. [43]

    Pmvc: Data augmentation-based prosody modeling for expres- sive voice conversion,

    Y . Deng, H. Tang, X. Zhang, J. Wang, N. Cheng, and J. Xiao, “Pmvc: Data augmentation-based prosody modeling for expres- sive voice conversion,” inProceedings of the 31st ACM Interna- tional Conference on Multimedia, 2023, pp. 184–192

  44. [44]

    Emotion intensity and its control for emotional voice conversion,

    K. Zhou, B. Sisman, R. Rana, B. W. Schuller, and H. Li, “Emotion intensity and its control for emotional voice conversion,”IEEE Transactions on Affective Computing, vol. 14, no. 1, pp. 31–48, 2022

  45. [45]

    Prosody in context: A review,

    J. Cole, “Prosody in context: A review,”Language, Cognition and Neuroscience, vol. 30, no. 1-2, pp. 1–31, 2015

  46. [46]

    Learning second language suprasegmentals: Effect of l2 experience on prosody and fluency characteristics of l2 speech,

    P. Trofimovich and W. Baker, “Learning second language suprasegmentals: Effect of l2 experience on prosody and fluency characteristics of l2 speech,”Studies in second language acquisi- tion, vol. 28, no. 1, pp. 1–30, 2006

  47. [47]

    Variation adds to prosodic typology,

    E. Grabe, “Variation adds to prosodic typology,” inProceedings of the speech prosody 2002 conference. Laboratoire Parole et Langage Aix-en-Provence, France, 2002, pp. 127–132

  48. [48]

    De- tection of cross-dataset fake audio based on prosodic and pronun- ciation features,

    C. Wang, J. Yi, J. Tao, C. Y . Zhang, S. Zhang, and X. Chen, “De- tection of cross-dataset fake audio based on prosodic and pronun- ciation features,” inProc. Interspeech 2023, 2023, pp. 3844–3848

  49. [49]

    Investigating voiced and unvoiced regions of speech for audio deepfake detection,

    G. Sivaraman, H. Tak, and E. Khoury, “Investigating voiced and unvoiced regions of speech for audio deepfake detection,” in ICASSP 2025 - 2025 IEEE International Conference on Acous- tics, Speech and Signal Processing (ICASSP), 2025, pp. 1–5

  50. [50]

    Ecapa-tdnn: Emphasized chan- nel attention, propagation and aggregation in tdnn based speaker verification.arXiv preprint arXiv:2005.07143,

    B. Desplanques, J. Thienpondt, and K. Demuynck, “Ecapa- tdnn: Emphasized channel attention, propagation and ag- gregation in tdnn based speaker verification,”arXiv preprint arXiv:2005.07143, 2020

  51. [51]

    Prosodic struc- ture beyond lexical content: A study of self-supervised learning,

    S. Wallbridge, C. Minixhofer, C. Lai, and P. Bell, “Prosodic struc- ture beyond lexical content: A study of self-supervised learning,” arXiv preprint arXiv:2506.02584, 2025

  52. [52]

    Lib- rispeech: An asr corpus based on public domain audio books,

    V . Panayotov, G. Chen, D. Povey, and S. Khudanpur, “Lib- rispeech: An asr corpus based on public domain audio books,” in2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2015, pp. 5206–5210

  53. [53]

    Conditional variational autoencoder with adversarial learning for end-to-end text-to-speech,

    J. Kim, J. Kong, and J. Son, “Conditional variational autoencoder with adversarial learning for end-to-end text-to-speech,” inInter- national Conference on Machine Learning. PMLR, 2021

  54. [54]

    arXiv preprint arXiv:2406.04904 , year=

    E. Casanova, K. Davis, E. G ¨olge, G. G ¨oknar, I. Gulea, L. Hart, A. Aljafari, J. Meyer, R. Morais, S. Olayemiet al., “Xtts: a massively multilingual zero-shot text-to-speech model,”arXiv preprint arXiv:2406.04904, 2024

  55. [55]

    Fastpitch: Parallel text-to-speech with pitch pre- diction,

    A. Ła ´ncucki, “Fastpitch: Parallel text-to-speech with pitch pre- diction,” inICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2021, pp. 6588–6592

  56. [56]

    Grad-tts: A diffusion probabilistic model for text-to-speech,

    V . Popov, I. V ovk, V . Gogoryan, T. Sadekova, and M. Kudinov, “Grad-tts: A diffusion probabilistic model for text-to-speech,” in International conference on machine learning. PMLR, 2021, pp. 8599–8608

  57. [57]

    Starganv2-vc: A diverse, unsupervised, non-parallel framework for natural-sounding voice conversion,

    Y . A. Li, A. Zare, and N. Mesgarani, “Starganv2-vc: A diverse, unsupervised, non-parallel framework for natural-sounding voice conversion,”arXiv preprint arXiv:2107.10394, 2021

  58. [58]

    Diffusion-based voice conversion with fast maximum likelihood sampling scheme,

    V . Popov, I. V ovk, V . Gogoryan, T. Sadekova, M. Kudinov, and J. Wei, “Diffusion-based voice conversion with fast maximum likelihood sampling scheme,”arXiv preprint arXiv:2109.13821, 2021

  59. [59]

    V oice conversion using speech-to- speech neuro-style transfer

    E. A. AlBadawy and S. Lyu, “V oice conversion using speech-to- speech neuro-style transfer.” inInterspeech, 2020, pp. 4726–4730

  60. [60]

    Raw- boost: A raw data boosting and augmentation method applied to automatic speaker verification anti-spoofing,

    H. Tak, M. Kamble, J. Patino, M. Todisco, and N. Evans, “Raw- boost: A raw data boosting and augmentation method applied to automatic speaker verification anti-spoofing,” inICASSP 2022- 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2022, pp. 6382–6386