Recognition: unknown
The False Resonance: A Critical Examination of Emotion Embedding Similarity for Speech Generation Evaluation
Pith reviewed 2026-05-07 12:30 UTC · model grok-4.3
The pith
Emotion embedding similarity metrics are unsuitable for zero-shot evaluation of emotional speech generation.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Despite achieving high accuracy in emotion classification tasks, the latent spaces of emotion encoders are unsuitable for zero-shot similarity evaluation. Linguistic and speaker interference overshadows emotional features in these representations, degrading the metric's ability to discriminate emotional content. This causes misalignment with human perception and leads the metric to reward acoustic mimicry over genuine emotional synthesis.
What carries the argument
The cosine similarity between emotion embeddings extracted from reference and generated audio samples using pre-trained encoders.
If this is right
- Evaluation of expressive speech synthesis and voice conversion systems using this approach produces unreliable results.
- The metric cannot reliably distinguish genuine emotional transfer from mere acoustic copying.
- Human perception tests are required to supplement or replace embedding-based similarity for emotional evaluation.
- New metrics must address the entanglement of emotion with linguistic and speaker information.
Where Pith is reading between the lines
- Embedding-based evaluations in other audio domains may share similar vulnerabilities to non-target feature interference.
- Disentanglement techniques could be applied to emotion encoders to improve their suitability for similarity tasks.
- Real-world deployment of speech generation models should prioritize metrics validated against diverse human judgments.
Load-bearing premise
The adversarial tasks and human tests in the study reflect the metric's behavior in standard real-world speech generation evaluation settings.
What would settle it
Observing cases where human listeners rate generated speech as emotionally similar to a reference but the embedding cosine similarity is low, or the reverse, outside of the controlled test conditions.
Figures
read the original abstract
Objective metrics for emotional expressiveness are vital for speech generation, particularly in expressive synthesis and voice conversion requiring emotional prosody transfer. To quantify this, the field widely relies on emotion similarity between reference and generated samples. This approach computes cosine similarity of embeddings from encoders like emotion2vec, assuming they capture affective cues despite linguistic and speaker variations. We challenge this assumption through controlled adversarial tasks and human alignment tests. Despite high classification accuracy, these latent spaces are unsuitable for zero-shot similarity evaluation. Representational limitations cause linguistic and speaker interference to overshadow emotional features, degrading discriminative ability. Consequently, the metric misaligns with human perception. This acoustic vulnerability reveals it rewards acoustic mimicry over genuine emotional synthesis.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper claims that emotion embeddings (e.g., from emotion2vec) achieve high classification accuracy yet remain unsuitable for zero-shot cosine-similarity evaluation of emotional prosody in speech generation. Controlled adversarial tasks that swap linguistic content or speaker identity while holding emotion labels fixed, together with human-alignment checks, demonstrate that non-emotional features dominate the latent space, causing the metric to misalign with human perception and to reward acoustic mimicry rather than genuine emotional transfer.
Significance. If substantiated, the result would be significant for the speech-generation community because it directly challenges a widely adopted evaluation practice. The use of explicit adversarial constructions and human tests supplies a falsifiable empirical basis rather than purely theoretical critique, which is a strength. However, the absence of dataset details, statistical tests, and quantitative results in the provided abstract leaves the magnitude of the degradation and the robustness of the conclusion difficult to gauge.
major comments (1)
- [Experimental setup and adversarial tasks] The central claim that the embeddings are unsuitable for typical speech-generation evaluation rests on the assumption that the controlled adversarial manipulations (content/speaker swaps) produce interference patterns representative of real end-to-end generator artifacts. If the acoustic distortions introduced by these explicit swaps differ systematically from the prosody-transfer errors that arise in actual synthesis models, the observed degradation and human misalignment may not generalize. This is load-bearing for the unsuitability conclusion.
minor comments (1)
- [Abstract] The abstract states the main conclusion but supplies no information on the datasets, exact construction of the adversarial examples, number of trials, or statistical significance tests used to support the claim of degraded discriminative ability.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback and for acknowledging the potential significance of challenging a widely used evaluation practice. We address the single major comment below and will incorporate clarifications to strengthen the manuscript.
read point-by-point responses
-
Referee: The central claim that the embeddings are unsuitable for typical speech-generation evaluation rests on the assumption that the controlled adversarial manipulations (content/speaker swaps) produce interference patterns representative of real end-to-end generator artifacts. If the acoustic distortions introduced by these explicit swaps differ systematically from the prosody-transfer errors that arise in actual synthesis models, the observed degradation and human misalignment may not generalize. This is load-bearing for the unsuitability conclusion.
Authors: We appreciate this point on generalizability. Our adversarial swaps are not intended to replicate every acoustic artifact of end-to-end generators but to isolate the embedding space's sensitivity to non-emotional factors (linguistic content and speaker identity) under conditions where emotion labels are held fixed. These factors are precisely the confounds that arise in real prosody transfer, where models frequently fail to fully disentangle them. The core finding is that the latent space is dominated by such features regardless of how the interference is introduced, as shown by the contrast between high clean-data classification accuracy and degraded performance under swaps, plus the human misalignment results. This property of the embedding itself supports unsuitability for zero-shot cosine-similarity evaluation. To address the concern directly, we will add a dedicated paragraph in the discussion section comparing our controlled interference patterns to documented prosody-transfer errors in recent emotional TTS and voice conversion literature, thereby clarifying the link to practical generator artifacts. revision: partial
Circularity Check
Empirical evaluation with no derivation or self-referential reduction
full rationale
The paper advances its central claim—that emotion embedding spaces like those from emotion2vec are unsuitable for zero-shot similarity evaluation—solely through controlled adversarial tasks (content/speaker swaps) and human alignment tests. These are direct experimental measurements of classification accuracy, cosine similarity degradation, and perceptual misalignment. No equations, fitted parameters, or predictive derivations appear; the results are reported outcomes rather than quantities constructed from the inputs by definition. Self-citations, if present, support background on embedding models but are not load-bearing for the unsuitability conclusion, which rests on the new empirical evidence.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption High accuracy of emotion encoders on classification tasks implies the embeddings encode emotion independently of linguistic and speaker content
Reference graph
Works this paper leans on
-
[1]
Consequently, evaluating how well generated audio captures the desired emotional style is essen- tial
Introduction and Background As generative speech models rapidly evolve, generating emotion- ally expressive speech has become a key objective across tasks such as expressive text-to-speech (TTS) and emotional voice conversion (EVC) [1, 2]. Consequently, evaluating how well generated audio captures the desired emotional style is essen- tial. While subjecti...
-
[2]
Methodology To rigorously evaluate EMO-SIM against the three criteria out- lined in Section 1, we design a systematic pipeline. We first calibrate the anisotropic latent space to prevent similarity dis- arXiv:2604.26347v1 [eess.AS] 29 Apr 2026 Table 1:Categorical Emotion Evaluation. Triplet accuracy (%) under controlled adversarial settings. Values are Me...
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[3]
positive
Experimental Setup 3.1. Representation Extractors We evaluate the baseemotion2vecalongside its fine-tunedemo- tion2vec+ variants (seed, base, and large) [6]. We also include SSL models:HuBERT[38],Wav2vec 2.0[39], andTERA[40]. As inVERSA[30], we extract frame-level representations from the last hidden layer, applying temporal mean pooling and mean centerin...
-
[4]
Categorical Emotion Robustness Table 1 reveals EMO-SIM’s failure to capture genuine emotion
Results and Analyses 4.1. Categorical Emotion Robustness Table 1 reveals EMO-SIM’s failure to capture genuine emotion. Even in the idealspeaker-linguistic matchscenario,emotion2vec andemotion2vec + variants barely reach 60-70% accuracy. This weak inherent affective representation severely degrades under adversarial acoustic variations. With alinguistic di...
-
[5]
We hypothesizeemotion2vecdirectly inherits acoustic representa- tions from its foundational model,data2vec[55, 56]
Discussion and Suggestion Our findings demonstrate that high SER accuracy does not in- herently translate to a perceptually meaningful latent space. We hypothesizeemotion2vecdirectly inherits acoustic representa- tions from its foundational model,data2vec[55, 56]. While supervised SER utilizes linear classification layers as filters to suppress these non-...
-
[6]
However, this paper demonstrates that it is fundamentally unreliable
Conclusion EMO-SIM has become the de facto objective metric for zero-shot expressive speech evaluation. However, this paper demonstrates that it is fundamentally unreliable. Our empirical analyses reveal two critical flaws: current emotion embeddings are structurally biased by acoustic distractors, and their similarity scores severely misalign with human ...
-
[7]
Acknowledgments This work was supported by the Ministry of Education (MOE) of Taiwan under the project Taiwan Centers of Excellence in Artificial Intelligence, through the NTU Artificial Intelligence Center of Research Excellence (NTU AI-CoRE)”
-
[8]
Generative AI Use Disclosure We employed generative AI solely to improve the writing quality of this manuscript, without using it to generate any core content
-
[9]
Towards controllable speech synthesis in the era of large language models: A systematic survey,
T. Xie, Y . Rong, P. Zhang, W. Wang, and L. Liu, “Towards controllable speech synthesis in the era of large language models: A systematic survey,” inProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, C. Christodoulopoulos, T. Chakraborty, C. Rose, and V . Peng, Eds. Suzhou, China: Association for Computational Linguist...
2025
-
[10]
C.-J. Hsu, Y .-C. Lin, C.-C. Lin, W.-C. Chen, H. L. Chung, C.-A. Li, Y .-C. Chen, C.-Y . Yu, M.-J. Lee, C.-C. Chen, R.-H. Huang, H. yi Lee, and D.-S. Shiu, “Breezyvoice: Adapting tts for taiwanese mandarin with enhanced polyphone disambiguation – challenges and insights,” 2025. [Online]. Available: https://arxiv.org/abs/2501.17790
-
[11]
Wavlm: Large-scale self- supervised pre-training for full stack speech processing,
S. Chen, C. Wang, Z. Chen, Y . Wu, S. Liu, Z. Chen, J. Li, N. Kanda, T. Yoshioka, X. Xiao, J. Wu, L. Zhou, S. Ren, Y . Qian, Y . Qian, J. Wu, M. Zeng, X. Yu, and F. Wei, “Wavlm: Large-scale self- supervised pre-training for full stack speech processing,”IEEE Journal of Selected Topics in Signal Processing, vol. 16, no. 6, pp. 1505–1518, 2022
2022
-
[12]
Speechtokenizer: Unified speech tokenizer for speech language models,
X. Zhang, D. Zhang, S. Li, Y . Zhou, and X. Qiu, “Speechtokenizer: Unified speech tokenizer for speech language models,” inThe Twelfth International Conference on Learning Representations,
-
[13]
Available: https://openreview.net/forum?id= AF9Q8Vip84
[Online]. Available: https://openreview.net/forum?id= AF9Q8Vip84
-
[14]
arXiv preprint arXiv:2312.05187 , year=
L. Barrault, Y .-A. Chung, M. C. Meglioli, D. Dale, N. Dong, M. Duppenthaler, P.-A. Duquenne, B. Ellis, H. Elsahar, J. Haaheim et al., “Seamless: Multilingual expressive and streaming speech translation,”arXiv preprint arXiv:2312.05187, 2023
-
[15]
emotion2vec: Self-supervised pre-training for speech emotion representation,
Z. Ma, Z. Zheng, J. Ye, J. Li, Z. Gao, S. Zhang, and X. Chen, “emotion2vec: Self-supervised pre-training for speech emotion representation,” inFindings of the Association for Computational Linguistics: ACL 2024, L.-W. Ku, A. Martins, and V . Srikumar, Eds. Bangkok, Thailand: Association for Computational Linguistics, Aug. 2024, pp. 15 747–15 760. [Online]...
2024
-
[16]
MaskGCT: Zero-shot text-to-speech with masked generative codec transformer,
Y . Wang, H. Zhan, L. Liu, R. Zeng, H. Guo, J. Zheng, Q. Zhang, X. Zhang, S. Zhang, and Z. Wu, “MaskGCT: Zero-shot text-to-speech with masked generative codec transformer,” inThe Thirteenth International Conference on Learning Representations,
-
[17]
Available: https://openreview.net/forum?id= ExuBFYtCQU
[Online]. Available: https://openreview.net/forum?id= ExuBFYtCQU
-
[18]
S. Zhou, Y . Zhou, Y . He, X. Zhou, J. Wang, W. Deng, and J. Shu, “Indextts2: A breakthrough in emotionally expressive and duration- controlled auto-regressive zero-shot text-to-speech,”arXiv preprint arXiv:2506.21619, 2025
-
[19]
Expressive Prompting: Improving Emotion Intensity and Speaker Consistency in Zero-Shot TTS
H. Wang, C. Qiang, T. Wang, C. Gong, Q. Liu, Y . Jiang, X. Wang, C. Wang, and C. Zhang, “Emopro: A prompt selection strategy for emotional expression in lm-based speech synthesis,”arXiv preprint arXiv:2409.18512, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[20]
Emo- sphere++: Emotion-controllable zero-shot text-to-speech via emotion-adaptive spherical vector,
D.-H. Cho, H.-S. Oh, S.-B. Kim, and S.-W. Lee, “Emo- sphere++: Emotion-controllable zero-shot text-to-speech via emotion-adaptive spherical vector,”IEEE Transactions on Affective Computing, vol. 16, no. 3, pp. 2365–2380, 2025
2025
-
[21]
Emo-dpo: Controllable emotional speech synthesis through direct preference optimization,
X. Gao, C. Zhang, Y . Chen, H. Zhang, and N. F. Chen, “Emo-dpo: Controllable emotional speech synthesis through direct preference optimization,” inICASSP 2025 - 2025 IEEE International Confer- ence on Acoustics, Speech and Signal Processing (ICASSP), 2025, pp. 1–5
2025
-
[22]
Laugh now cry later: Controlling time-varying emotional states of flow-matching- based zero-shot text-to-speech,
H. Wu, X. Wang, S. E. Eskimez, M. Thakker, D. Tompkins, C.-H. Tsai, C. Li, Z. Xiao, S. Zhao, J. Li, and N. Kanda, “Laugh now cry later: Controlling time-varying emotional states of flow-matching- based zero-shot text-to-speech,” in2024 IEEE Spoken Language Technology Workshop (SLT), 2024, pp. 690–697
2024
-
[23]
Tts-ctrlnet: Time varying emotion aligned text-to-speech generation with controlnet,
J. Jeong, Y . Lee, M. Kwon, and Y . Uh, “Tts-ctrlnet: Time varying emotion aligned text-to-speech generation with controlnet,”arXiv preprint arXiv:2507.04349, 2025
-
[24]
DiEmo-TTS: Disentangled Emotion Representations via Self-Supervised Distil- lation for Cross-Speaker Emotion Transfer in Text-to-Speech,
D.-H. Cho, H.-S. Oh, S.-B. Kim, and S.-W. Lee, “DiEmo-TTS: Disentangled Emotion Representations via Self-Supervised Distil- lation for Cross-Speaker Emotion Transfer in Text-to-Speech,” in Interspeech 2025, 2025, pp. 4373–4377
2025
-
[25]
Emovoice: Llm-based emotional text-to-speech model with freestyle text prompting,
G. Yang, C. Yang, Q. Chen, Z. Ma, W. Chen, W. Wang, T. Wang, Y . Yang, Z. Niu, W. Liuet al., “Emovoice: Llm-based emotional text-to-speech model with freestyle text prompting,” inProceed- ings of the 33rd ACM International Conference on Multimedia, 2025, pp. 10 748–10 757
2025
-
[26]
Computational Narrative Understanding for Expressive Text-to-Speech
G. Michel, E. V . Epure, and C. Cerisara, “Libriquote: A speech dataset of fictional character utterances for expressive zero-shot speech synthesis,”arXiv preprint arXiv:2509.04072, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[27]
An efficient emotional speech synthesis ap- proach via multi-scale feature generation,
J. Luo and J. Yang, “An efficient emotional speech synthesis ap- proach via multi-scale feature generation,” inInternational Con- ference on Intelligent Computing. Springer, 2025, pp. 3–16
2025
-
[28]
Word-level emotional expression control in zero-shot text-to-speech synthesis,
tianrui wang, H. Wang, M. Ge, C. Gong, C. Qiang, Z. Ma, Z. Huang, G. Yang, X. Wang, E. Chng, X. Chen, L. Wang, and J. Dang, “Word-level emotional expression control in zero-shot text-to-speech synthesis,” inThe Thirty-ninth Annual Conference on Neural Information Processing Systems, 2025. [Online]. Available: https://openreview.net/forum?id=SYcggdxX6W
2025
-
[29]
De- batts: Zero-shot debating text-to-speech synthesis,
Y . Huang, Y . Wang, J. Li, H. Guo, H. He, S. Zhang, and Z. Wu, “De- batts: Zero-shot debating text-to-speech synthesis,”arXiv preprint arXiv:2411.06540, 2024
-
[30]
Hd-ppt: Hierarchical decoding of content-and prompt-preference tokens for instruction- based tts,
S. Nie, X. Xing, J. Xing, B. Liu, and X. Xu, “Hd-ppt: Hierarchical decoding of content-and prompt-preference tokens for instruction- based tts,”arXiv preprint arXiv:2509.19001, 2025
-
[31]
arXiv preprint arXiv:2509.25416 , year=
J. Shi, H. Du, Y . He, Y . A. Hong, and Y . Gao, “Emotion-aligned generation in diffusion text to speech models via preference-guided optimization,”arXiv preprint arXiv:2509.25416, 2025
-
[32]
NonverbalTTS: A Public English Corpus of Text-Aligned Nonverbal V ocalizations with Emotion Annotations for Text-to-Speech,
M. Borisov, E. Spirin, and D. Diatlova, “NonverbalTTS: A Public English Corpus of Text-Aligned Nonverbal V ocalizations with Emotion Annotations for Text-to-Speech,” in13th edition of the Speech Synthesis Workshop, 2025, pp. 104–109
2025
-
[33]
Durflex-evc: Duration-flexible emotional voice conversion leveraging discrete representations without text alignment,
H.-S. Oh, S.-H. Lee, D.-H. Cho, and S.-W. Lee, “Durflex-evc: Duration-flexible emotional voice conversion leveraging discrete representations without text alignment,”IEEE Transactions on Affective Computing, vol. 16, no. 3, pp. 1660–1674, 2025
2025
-
[34]
Maestro-evc: Control- lable emotional voice conversion guided by references and explicit prosody,
J. Yoon, W. Jeong, J. Gim, and Y .-J. Suh, “Maestro-evc: Control- lable emotional voice conversion guided by references and explicit prosody,”arXiv preprint arXiv:2508.06890, 2025
-
[35]
Naturalvoices: A large-scale, sponta- neous and emotional podcast dataset for voice conversion,
Z. Du, S. S. Chandra, I. R. Ulgen, A. Mahapatra, A. N. Salman, C. Busso, and B. Sisman, “Naturalvoices: A large-scale, sponta- neous and emotional podcast dataset for voice conversion,”arXiv preprint arXiv:2511.00256, 2025
-
[36]
Audio-to-audio emotion conversion with pitch and duration style transfer,
S. Dutta, A. Jain, and S. Ganapathy, “Audio-to-audio emotion conversion with pitch and duration style transfer,”arXiv preprint arXiv:2505.17655, 2025
-
[37]
S. Wang, A. Chen, and T. Zhao, “Beyond global emotion: Fine- grained emotional speech synthesis with dynamic word-level mod- ulation,”arXiv preprint arXiv:2509.20378, 2025
-
[38]
ClapFM-EVC: High-Fidelity and Flexible Emotional V oice Con- version with Dual Control from Natural Language and Speech,
Y . Pan, Y . Hu, Y . Yang, J. Yao, J. Ye, H. Zhou, L. Ma, and J. Zhao, “ClapFM-EVC: High-Fidelity and Flexible Emotional V oice Con- version with Dual Control from Natural Language and Speech,” in Interspeech 2025, 2025, pp. 4583–4587
2025
-
[39]
C. Gong, C. Qiang, T. Wang, Y . Jiang, Y . Lu, R. Jing, X. Miao, X. Zhang, L. Wang, and J. Dang, “Perturbation self-supervised representations for cross-lingual emotion tts: Stage-wise modeling of emotion and speaker,”arXiv preprint arXiv:2510.11124, 2025
-
[40]
VERSA: A versatile evaluation toolkit for speech, audio, and music,
J. Shi, H.-j. Shim, J. Tian, S. Arora, H. Wu, D. Petermann, J. Q. Yip, Y . Zhang, Y . Tang, W. Zhang, D. S. Alharthi, Y . Huang, K. Saito, J. Han, Y . Zhao, C. Donahue, and S. Watanabe, “VERSA: A versatile evaluation toolkit for speech, audio, and music,” in Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for C...
2025
-
[41]
Emo- bias: A Large Scale Evaluation of Social Bias on Speech Emotion Recognition,
Y .-C. Lin, H. Wu, H.-C. Chou, C.-C. Lee, and H. yi Lee, “Emo- bias: A Large Scale Evaluation of Social Bias on Speech Emotion Recognition,” inInterspeech 2024, 2024, pp. 4633–4637
2024
-
[42]
On the social bias of speech self-supervised models,
Y .-C. Lin, T.-Q. Lin, H.-C. Lin, A. T. Liu, and H. yi Lee, “On the social bias of speech self-supervised models,” inInterspeech 2024, 2024, pp. 4638–4642
2024
-
[43]
EmergentTTS- eval: Evaluating TTS models on complex prosodic, expressiveness, and linguistic challenges using model-as-a-judge,
R. R. Manku, Y . Tang, X. Shi, M. Li, and A. Smola, “EmergentTTS- eval: Evaluating TTS models on complex prosodic, expressiveness, and linguistic challenges using model-as-a-judge,” inThe Thirty-ninth Annual Conference on Neural Information Processing Systems Datasets and Benchmarks Track, 2025. [Online]. Available: https://openreview.net/forum?id=P3JBBnh10z
2025
-
[44]
Self-supervised rewiring of pre-trained speech encoders:towards faster fine- tuning with less labels in speech processing,
H. Yang, J. Zhao, G. Haffari, and E. Shareghi, “Self-supervised rewiring of pre-trained speech encoders:towards faster fine- tuning with less labels in speech processing,” inFindings of the Association for Computational Linguistics: EMNLP 2022, Y . Goldberg, Z. Kozareva, and Y . Zhang, Eds. Abu Dhabi, United Arab Emirates: Association for Computational Li...
2022
-
[45]
G. Wisniewski, S. Guillaume, and C. R. Fern ´andez, “Assessing the impact of anisotropy in neural representations of speech: A case study on keyword spotting,”arXiv preprint arXiv:2506.11096, 2025
-
[46]
All-but-the-top: Simple and effective postprocessing for word representations,
J. Mu and P. Viswanath, “All-but-the-top: Simple and effective postprocessing for word representations,” inInternational Conference on Learning Representations, 2018. [Online]. Available: https://openreview.net/forum?id=HkuGJ3kCb
2018
-
[47]
Measuring nominal scale agreement among many raters
J. L. Fleiss, “Measuring nominal scale agreement among many raters.”Psychological bulletin, vol. 76, no. 5, p. 378, 1971
1971
-
[48]
Hubert: Self-supervised speech representation learning by masked prediction of hidden units,
W.-N. Hsu, B. Bolte, Y .-H. H. Tsai, K. Lakhotia, R. Salakhutdinov, and A. Mohamed, “Hubert: Self-supervised speech representation learning by masked prediction of hidden units,”IEEE/ACM Trans- actions on Audio, Speech, and Language Processing, vol. 29, pp. 3451–3460, 2021
2021
-
[49]
wav2vec 2.0: A framework for self-supervised learning of speech representa- tions,
A. Baevski, Y . Zhou, A. Mohamed, and M. Auli, “wav2vec 2.0: A framework for self-supervised learning of speech representa- tions,” inAdvances in Neural Information Processing Systems, H. Larochelle, M. Ranzato, R. Hadsell, M. Balcan, and H. Lin, Eds., vol. 33. Curran Associates, Inc., 2020, pp. 12 449–12 460. [Online]. Available: https://proceedings.neur...
2020
-
[50]
Tera: Self-supervised learning of transformer encoder representation for speech,
A. T. Liu, S.-W. Li, and H.-y. Lee, “Tera: Self-supervised learning of transformer encoder representation for speech,”IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 29, pp. 2351–2366, 2021
2021
-
[51]
Crema-d: Crowd-sourced emotional multimodal actors dataset,
H. Cao, D. G. Cooper, M. K. Keutmann, R. C. Gur, A. Nenkova, and R. Verma, “Crema-d: Crowd-sourced emotional multimodal actors dataset,”IEEE Transactions on Affective Computing, vol. 5, no. 4, pp. 377–390, 2014
2014
-
[52]
Msp-improv: An acted cor- pus of dyadic interactions to study emotion perception,
C. Busso, S. Parthasarathy, A. Burmania, M. AbdelWahab, N. Sadoughi, and E. M. Provost, “Msp-improv: An acted cor- pus of dyadic interactions to study emotion perception,”IEEE Transactions on Affective Computing, vol. 8, no. 1, pp. 67–80, 2017
2017
-
[53]
Building naturalistic emotionally bal- anced speech corpus by retrieving emotional speech from existing podcast recordings,
R. Lotfian and C. Busso, “Building naturalistic emotionally bal- anced speech corpus by retrieving emotional speech from existing podcast recordings,”IEEE Transactions on Affective Computing, vol. 10, no. 4, pp. 471–483, 2019
2019
-
[54]
arXiv preprint arXiv:2509.09791 , year=
C. Busso, R. Lotfian, K. Sridhar, A. N. Salman, W.-C. Lin, L. Goncalves, S. Parthasarathy, A. R. Naini, S.-G. Leem, L. Martinez-Lucas, H.-C. Chou, and P. Mote, “The MSP-Podcast Corpus,” 2025. [Online]. Available: https: //arxiv.org/abs/2509.09791
-
[55]
An intelligent infras- tructure toward large scale naturalistic affective speech corpora collection,
S. G. Upadhyay, W.-S. Chien, B.-H. Su, L. Goncalves, Y .-T. Wu, A. N. Salman, C. Busso, and C.-C. Lee, “An intelligent infras- tructure toward large scale naturalistic affective speech corpora collection,” in2023 11th International Conference on Affective Computing and Intelligent Interaction (ACII), 2023, pp. 1–8
2023
-
[56]
NNIME: The NTHU-NTUA Chinese interactive multimodal emotion corpus,
H.-C. Chou, W.-C. Lin, L.-C. Chang, C.-C. Li, H.-P. Ma, and C.-C. Lee, “NNIME: The NTHU-NTUA Chinese interactive multimodal emotion corpus,” in2017 Seventh International Conference on Affective Computing and Intelligent Interaction (ACII), 2017, pp. 292–298
2017
-
[57]
Large raw emotional dataset with aggregation mechanism,
V . Kondratenko, A. Sokolov, N. Karpov, O. Kutuzov, N. Savushkin, and F. Minkin, “Large raw emotional dataset with aggregation mechanism,”arXiv preprint arXiv:2212.12266, 2022
-
[58]
Z. Du, Q. Chen, S. Zhang, K. Hu, H. Lu, Y . Yang, H. Hu, S. Zheng, Y . Gu, Z. Maet al., “Cosyvoice: A scalable multilingual zero-shot text-to-speech synthesizer based on supervised semantic tokens,” arXiv preprint arXiv:2407.05407, 2024
-
[59]
Spark-tts: An efficient llm- based text-to-speech model with single-stream decoupled speech tokens,
X. Wang, M. Jiang, Z. Ma, Z. Zhang, S. Liu, L. Li, Z. Liang, Q. Zheng, R. Wang, X. Fenget al., “Spark-tts: An efficient llm- based text-to-speech model with single-stream decoupled speech tokens,”arXiv preprint arXiv:2503.01710, 2025
-
[60]
F5-TTS: A fairytaler that fakes fluent and faithful speech with flow matching,
Y . Chen, Z. Niu, Z. Ma, K. Deng, C. Wang, J. JianZhao, K. Yu, and X. Chen, “F5-TTS: A fairytaler that fakes fluent and faithful speech with flow matching,” inProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), W. Che, J. Nabende, E. Shutova, and M. T. Pilehvar, Eds. Vienna, Austria: Association...
2025
-
[61]
E2 tts: Embarrassingly easy fully non-autoregressive zero-shot tts,
S. E. Eskimez, X. Wang, M. Thakker, C. Li, C.-H. Tsai, Z. Xiao, H. Yang, Z. Zhu, M. Tang, X. Tan, Y . Liu, S. Zhao, and N. Kanda, “E2 tts: Embarrassingly easy fully non-autoregressive zero-shot tts,” in2024 IEEE Spoken Language Technology Workshop (SLT), 2024, pp. 682–689
2024
-
[62]
Qwen3-tts technical report.arXiv preprint arXiv:2601.15621, 2026
H. Hu, X. Zhu, T. He, D. Guo, B. Zhang, X. Wang, Z. Guo, Z. Jiang, H. Hao, Z. Guoet al., “Qwen3-tts technical report,”arXiv preprint arXiv:2601.15621, 2026
-
[63]
Diff-HierVC: Diffusion- based Hierarchical V oice Conversion with Robust Pitch Generation and Masked Prior for Zero-shot Speaker Adaptation,
H.-Y . Choi, S.-H. Lee, and S.-W. Lee, “Diff-HierVC: Diffusion- based Hierarchical V oice Conversion with Robust Pitch Generation and Masked Prior for Zero-shot Speaker Adaptation,” inInter- speech 2023, 2023, pp. 2283–2287
2023
-
[64]
Freevc: Towards high-quality text-free one-shot voice conversion,
J. Li, W. Tu, and L. Xiao, “Freevc: Towards high-quality text-free one-shot voice conversion,” inICASSP 2023 - 2023 IEEE Inter- national Conference on Acoustics, Speech and Signal Processing (ICASSP), 2023, pp. 1–5
2023
-
[65]
data2vec: A general framework for self- supervised learning in speech, vision and language,
A. Baevskiet al., “data2vec: A general framework for self- supervised learning in speech, vision and language,” inICML, 2022
2022
-
[66]
What do speech foundation models not learn about speech?
A. Waheed, H. Atwany, B. Raj, and R. Singh, “What do speech foundation models not learn about speech?”arXiv preprint arXiv:2410.12948, 2024
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.