arxiv: 2604.26347 · v1 · submitted 2026-04-29 · 📡 eess.AS · cs.CL

Recognition: unknown

The False Resonance: A Critical Examination of Emotion Embedding Similarity for Speech Generation Evaluation

Yun-Shao Tsai , Yi-Cheng Lin , Huang-Cheng Chou , Tzu-Wen Hsu , Yun-Man Hsu , Chun Wei Chen , Shrikanth Narayanan , Hung-yi Lee

Authors on Pith no claims yet

Pith reviewed 2026-05-07 12:30 UTC · model grok-4.3

classification 📡 eess.AS cs.CL

keywords emotion embeddingspeech synthesis evaluationcosine similarityadversarial evaluationhuman perception alignmentemotional prosody transfer

0 comments

The pith

Emotion embedding similarity metrics are unsuitable for zero-shot evaluation of emotional speech generation.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Researchers commonly use cosine similarity between emotion embeddings from models like emotion2vec to assess how well generated speech conveys the intended emotions. The paper demonstrates through adversarial tasks that these embeddings suffer from interference by linguistic content and speaker identity, which prevents them from isolating emotional features effectively. As a result, the similarity scores do not align well with human judgments of emotional expressiveness. This limitation means the metric often rewards superficial acoustic similarities rather than true emotional synthesis. The findings highlight the need for more robust ways to evaluate emotional prosody in speech generation systems.

Core claim

Despite achieving high accuracy in emotion classification tasks, the latent spaces of emotion encoders are unsuitable for zero-shot similarity evaluation. Linguistic and speaker interference overshadows emotional features in these representations, degrading the metric's ability to discriminate emotional content. This causes misalignment with human perception and leads the metric to reward acoustic mimicry over genuine emotional synthesis.

What carries the argument

The cosine similarity between emotion embeddings extracted from reference and generated audio samples using pre-trained encoders.

If this is right

Evaluation of expressive speech synthesis and voice conversion systems using this approach produces unreliable results.
The metric cannot reliably distinguish genuine emotional transfer from mere acoustic copying.
Human perception tests are required to supplement or replace embedding-based similarity for emotional evaluation.
New metrics must address the entanglement of emotion with linguistic and speaker information.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Embedding-based evaluations in other audio domains may share similar vulnerabilities to non-target feature interference.
Disentanglement techniques could be applied to emotion encoders to improve their suitability for similarity tasks.
Real-world deployment of speech generation models should prioritize metrics validated against diverse human judgments.

Load-bearing premise

The adversarial tasks and human tests in the study reflect the metric's behavior in standard real-world speech generation evaluation settings.

What would settle it

Observing cases where human listeners rate generated speech as emotionally similar to a reference but the embedding cosine similarity is low, or the reverse, outside of the controlled test conditions.

Figures

Figures reproduced from arXiv: 2604.26347 by Chun Wei Chen, Huang-Cheng Chou, Hung-yi Lee, Shrikanth Narayanan, Tzu-Wen Hsu, Yi-Cheng Lin, Yun-Man Hsu, Yun-Shao Tsai.

**Figure 1.** Figure 1: Human Perception Alignment. Accuracy (%) of various models evaluated on human perception annotations. ∗ indicates statistical significance (p < 0.05) compared to the 50% random baseline via binomial test. candidate best matching the reference’s emotion, regardless of quality. The inter-rater agreement across the initial pool reached Fleiss’ κ = 0.7349. From the subset of triplets with strong consensus (≥ 4… view at source ↗

read the original abstract

Objective metrics for emotional expressiveness are vital for speech generation, particularly in expressive synthesis and voice conversion requiring emotional prosody transfer. To quantify this, the field widely relies on emotion similarity between reference and generated samples. This approach computes cosine similarity of embeddings from encoders like emotion2vec, assuming they capture affective cues despite linguistic and speaker variations. We challenge this assumption through controlled adversarial tasks and human alignment tests. Despite high classification accuracy, these latent spaces are unsuitable for zero-shot similarity evaluation. Representational limitations cause linguistic and speaker interference to overshadow emotional features, degrading discriminative ability. Consequently, the metric misaligns with human perception. This acoustic vulnerability reveals it rewards acoustic mimicry over genuine emotional synthesis.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Emotion embedding similarity metrics fail for zero-shot speech evaluation because of linguistic and speaker interference, backed by adversarial tests, though those tests may not match real generation errors.

read the letter

The paper's main takeaway is that cosine similarity on emotion embeddings like emotion2vec does not work for zero-shot evaluation of generated speech. High classification accuracy on emotions does not translate to reliable similarity scores, since linguistic content and speaker traits interfere and pull the embeddings away from pure affective information. This leads to scores that diverge from human perception of emotional match in synthesis and voice conversion tasks.

Referee Report

1 major / 1 minor

Summary. The paper claims that emotion embeddings (e.g., from emotion2vec) achieve high classification accuracy yet remain unsuitable for zero-shot cosine-similarity evaluation of emotional prosody in speech generation. Controlled adversarial tasks that swap linguistic content or speaker identity while holding emotion labels fixed, together with human-alignment checks, demonstrate that non-emotional features dominate the latent space, causing the metric to misalign with human perception and to reward acoustic mimicry rather than genuine emotional transfer.

Significance. If substantiated, the result would be significant for the speech-generation community because it directly challenges a widely adopted evaluation practice. The use of explicit adversarial constructions and human tests supplies a falsifiable empirical basis rather than purely theoretical critique, which is a strength. However, the absence of dataset details, statistical tests, and quantitative results in the provided abstract leaves the magnitude of the degradation and the robustness of the conclusion difficult to gauge.

major comments (1)

[Experimental setup and adversarial tasks] The central claim that the embeddings are unsuitable for typical speech-generation evaluation rests on the assumption that the controlled adversarial manipulations (content/speaker swaps) produce interference patterns representative of real end-to-end generator artifacts. If the acoustic distortions introduced by these explicit swaps differ systematically from the prosody-transfer errors that arise in actual synthesis models, the observed degradation and human misalignment may not generalize. This is load-bearing for the unsuitability conclusion.

minor comments (1)

[Abstract] The abstract states the main conclusion but supplies no information on the datasets, exact construction of the adversarial examples, number of trials, or statistical significance tests used to support the claim of degraded discriminative ability.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive feedback and for acknowledging the potential significance of challenging a widely used evaluation practice. We address the single major comment below and will incorporate clarifications to strengthen the manuscript.

read point-by-point responses

Referee: The central claim that the embeddings are unsuitable for typical speech-generation evaluation rests on the assumption that the controlled adversarial manipulations (content/speaker swaps) produce interference patterns representative of real end-to-end generator artifacts. If the acoustic distortions introduced by these explicit swaps differ systematically from the prosody-transfer errors that arise in actual synthesis models, the observed degradation and human misalignment may not generalize. This is load-bearing for the unsuitability conclusion.

Authors: We appreciate this point on generalizability. Our adversarial swaps are not intended to replicate every acoustic artifact of end-to-end generators but to isolate the embedding space's sensitivity to non-emotional factors (linguistic content and speaker identity) under conditions where emotion labels are held fixed. These factors are precisely the confounds that arise in real prosody transfer, where models frequently fail to fully disentangle them. The core finding is that the latent space is dominated by such features regardless of how the interference is introduced, as shown by the contrast between high clean-data classification accuracy and degraded performance under swaps, plus the human misalignment results. This property of the embedding itself supports unsuitability for zero-shot cosine-similarity evaluation. To address the concern directly, we will add a dedicated paragraph in the discussion section comparing our controlled interference patterns to documented prosody-transfer errors in recent emotional TTS and voice conversion literature, thereby clarifying the link to practical generator artifacts. revision: partial

Circularity Check

0 steps flagged

Empirical evaluation with no derivation or self-referential reduction

full rationale

The paper advances its central claim—that emotion embedding spaces like those from emotion2vec are unsuitable for zero-shot similarity evaluation—solely through controlled adversarial tasks (content/speaker swaps) and human alignment tests. These are direct experimental measurements of classification accuracy, cosine similarity degradation, and perceptual misalignment. No equations, fitted parameters, or predictive derivations appear; the results are reported outcomes rather than quantities constructed from the inputs by definition. Self-citations, if present, support background on embedding models but are not load-bearing for the unsuitability conclusion, which rests on the new empirical evidence.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the premise that adversarial perturbations and human judgments serve as valid proxies for metric quality; these are standard evaluation tools but are not derived or proven within the work.

axioms (1)

domain assumption High accuracy of emotion encoders on classification tasks implies the embeddings encode emotion independently of linguistic and speaker content
This is the core assumption the paper tests and finds does not hold for similarity-based evaluation.

pith-pipeline@v0.9.0 · 5442 in / 1163 out tokens · 85735 ms · 2026-05-07T12:30:37.329064+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

66 extracted references · 22 canonical work pages · 3 internal anchors

[1]

Consequently, evaluating how well generated audio captures the desired emotional style is essen- tial

Introduction and Background As generative speech models rapidly evolve, generating emotion- ally expressive speech has become a key objective across tasks such as expressive text-to-speech (TTS) and emotional voice conversion (EVC) [1, 2]. Consequently, evaluating how well generated audio captures the desired emotional style is essen- tial. While subjecti...
[2]

The False Resonance: A Critical Examination of Emotion Embedding Similarity for Speech Generation Evaluation

Methodology To rigorously evaluate EMO-SIM against the three criteria out- lined in Section 1, we design a systematic pipeline. We first calibrate the anisotropic latent space to prevent similarity dis- arXiv:2604.26347v1 [eess.AS] 29 Apr 2026 Table 1:Categorical Emotion Evaluation. Triplet accuracy (%) under controlled adversarial settings. Values are Me...

work page internal anchor Pith review Pith/arXiv arXiv 2026
[3]

positive

Experimental Setup 3.1. Representation Extractors We evaluate the baseemotion2vecalongside its fine-tunedemo- tion2vec+ variants (seed, base, and large) [6]. We also include SSL models:HuBERT[38],Wav2vec 2.0[39], andTERA[40]. As inVERSA[30], we extract frame-level representations from the last hidden layer, applying temporal mean pooling and mean centerin...
[4]

Categorical Emotion Robustness Table 1 reveals EMO-SIM’s failure to capture genuine emotion

Results and Analyses 4.1. Categorical Emotion Robustness Table 1 reveals EMO-SIM’s failure to capture genuine emotion. Even in the idealspeaker-linguistic matchscenario,emotion2vec andemotion2vec + variants barely reach 60-70% accuracy. This weak inherent affective representation severely degrades under adversarial acoustic variations. With alinguistic di...
[5]

We hypothesizeemotion2vecdirectly inherits acoustic representa- tions from its foundational model,data2vec[55, 56]

Discussion and Suggestion Our findings demonstrate that high SER accuracy does not in- herently translate to a perceptually meaningful latent space. We hypothesizeemotion2vecdirectly inherits acoustic representa- tions from its foundational model,data2vec[55, 56]. While supervised SER utilizes linear classification layers as filters to suppress these non-...
[6]

However, this paper demonstrates that it is fundamentally unreliable

Conclusion EMO-SIM has become the de facto objective metric for zero-shot expressive speech evaluation. However, this paper demonstrates that it is fundamentally unreliable. Our empirical analyses reveal two critical flaws: current emotion embeddings are structurally biased by acoustic distractors, and their similarity scores severely misalign with human ...
[7]

Acknowledgments This work was supported by the Ministry of Education (MOE) of Taiwan under the project Taiwan Centers of Excellence in Artificial Intelligence, through the NTU Artificial Intelligence Center of Research Excellence (NTU AI-CoRE)”
[8]

Generative AI Use Disclosure We employed generative AI solely to improve the writing quality of this manuscript, without using it to generate any core content
[9]

Towards controllable speech synthesis in the era of large language models: A systematic survey,

T. Xie, Y . Rong, P. Zhang, W. Wang, and L. Liu, “Towards controllable speech synthesis in the era of large language models: A systematic survey,” inProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, C. Christodoulopoulos, T. Chakraborty, C. Rose, and V . Peng, Eds. Suzhou, China: Association for Computational Linguist...

2025
[10]

Breezyvoice: Adapting tts for taiwanese mandarin with enhanced polyphone disambiguation – challenges and insights,

C.-J. Hsu, Y .-C. Lin, C.-C. Lin, W.-C. Chen, H. L. Chung, C.-A. Li, Y .-C. Chen, C.-Y . Yu, M.-J. Lee, C.-C. Chen, R.-H. Huang, H. yi Lee, and D.-S. Shiu, “Breezyvoice: Adapting tts for taiwanese mandarin with enhanced polyphone disambiguation – challenges and insights,” 2025. [Online]. Available: https://arxiv.org/abs/2501.17790

work page arXiv 2025
[11]

Wavlm: Large-scale self- supervised pre-training for full stack speech processing,

S. Chen, C. Wang, Z. Chen, Y . Wu, S. Liu, Z. Chen, J. Li, N. Kanda, T. Yoshioka, X. Xiao, J. Wu, L. Zhou, S. Ren, Y . Qian, Y . Qian, J. Wu, M. Zeng, X. Yu, and F. Wei, “Wavlm: Large-scale self- supervised pre-training for full stack speech processing,”IEEE Journal of Selected Topics in Signal Processing, vol. 16, no. 6, pp. 1505–1518, 2022

2022
[12]

Speechtokenizer: Unified speech tokenizer for speech language models,

X. Zhang, D. Zhang, S. Li, Y . Zhou, and X. Qiu, “Speechtokenizer: Unified speech tokenizer for speech language models,” inThe Twelfth International Conference on Learning Representations,
[13]

Available: https://openreview.net/forum?id= AF9Q8Vip84

[Online]. Available: https://openreview.net/forum?id= AF9Q8Vip84
[14]

arXiv preprint arXiv:2312.05187 , year=

L. Barrault, Y .-A. Chung, M. C. Meglioli, D. Dale, N. Dong, M. Duppenthaler, P.-A. Duquenne, B. Ellis, H. Elsahar, J. Haaheim et al., “Seamless: Multilingual expressive and streaming speech translation,”arXiv preprint arXiv:2312.05187, 2023

work page arXiv 2023
[15]

emotion2vec: Self-supervised pre-training for speech emotion representation,

Z. Ma, Z. Zheng, J. Ye, J. Li, Z. Gao, S. Zhang, and X. Chen, “emotion2vec: Self-supervised pre-training for speech emotion representation,” inFindings of the Association for Computational Linguistics: ACL 2024, L.-W. Ku, A. Martins, and V . Srikumar, Eds. Bangkok, Thailand: Association for Computational Linguistics, Aug. 2024, pp. 15 747–15 760. [Online]...

2024
[16]

MaskGCT: Zero-shot text-to-speech with masked generative codec transformer,

Y . Wang, H. Zhan, L. Liu, R. Zeng, H. Guo, J. Zheng, Q. Zhang, X. Zhang, S. Zhang, and Z. Wu, “MaskGCT: Zero-shot text-to-speech with masked generative codec transformer,” inThe Thirteenth International Conference on Learning Representations,
[17]

Available: https://openreview.net/forum?id= ExuBFYtCQU

[Online]. Available: https://openreview.net/forum?id= ExuBFYtCQU
[18]

Indextts2: A breakthrough in emotionally expressive and duration-controlled auto-regressive zero-shot text-to-speech.arXiv preprint arXiv:2506.21619, 2025

S. Zhou, Y . Zhou, Y . He, X. Zhou, J. Wang, W. Deng, and J. Shu, “Indextts2: A breakthrough in emotionally expressive and duration- controlled auto-regressive zero-shot text-to-speech,”arXiv preprint arXiv:2506.21619, 2025

work page arXiv 2025
[19]

Expressive Prompting: Improving Emotion Intensity and Speaker Consistency in Zero-Shot TTS

H. Wang, C. Qiang, T. Wang, C. Gong, Q. Liu, Y . Jiang, X. Wang, C. Wang, and C. Zhang, “Emopro: A prompt selection strategy for emotional expression in lm-based speech synthesis,”arXiv preprint arXiv:2409.18512, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[20]

Emo- sphere++: Emotion-controllable zero-shot text-to-speech via emotion-adaptive spherical vector,

D.-H. Cho, H.-S. Oh, S.-B. Kim, and S.-W. Lee, “Emo- sphere++: Emotion-controllable zero-shot text-to-speech via emotion-adaptive spherical vector,”IEEE Transactions on Affective Computing, vol. 16, no. 3, pp. 2365–2380, 2025

2025
[21]

Emo-dpo: Controllable emotional speech synthesis through direct preference optimization,

X. Gao, C. Zhang, Y . Chen, H. Zhang, and N. F. Chen, “Emo-dpo: Controllable emotional speech synthesis through direct preference optimization,” inICASSP 2025 - 2025 IEEE International Confer- ence on Acoustics, Speech and Signal Processing (ICASSP), 2025, pp. 1–5

2025
[22]

Laugh now cry later: Controlling time-varying emotional states of flow-matching- based zero-shot text-to-speech,

H. Wu, X. Wang, S. E. Eskimez, M. Thakker, D. Tompkins, C.-H. Tsai, C. Li, Z. Xiao, S. Zhao, J. Li, and N. Kanda, “Laugh now cry later: Controlling time-varying emotional states of flow-matching- based zero-shot text-to-speech,” in2024 IEEE Spoken Language Technology Workshop (SLT), 2024, pp. 690–697

2024
[23]

Tts-ctrlnet: Time varying emotion aligned text-to-speech generation with controlnet,

J. Jeong, Y . Lee, M. Kwon, and Y . Uh, “Tts-ctrlnet: Time varying emotion aligned text-to-speech generation with controlnet,”arXiv preprint arXiv:2507.04349, 2025

work page arXiv 2025
[24]

DiEmo-TTS: Disentangled Emotion Representations via Self-Supervised Distil- lation for Cross-Speaker Emotion Transfer in Text-to-Speech,

D.-H. Cho, H.-S. Oh, S.-B. Kim, and S.-W. Lee, “DiEmo-TTS: Disentangled Emotion Representations via Self-Supervised Distil- lation for Cross-Speaker Emotion Transfer in Text-to-Speech,” in Interspeech 2025, 2025, pp. 4373–4377

2025
[25]

Emovoice: Llm-based emotional text-to-speech model with freestyle text prompting,

G. Yang, C. Yang, Q. Chen, Z. Ma, W. Chen, W. Wang, T. Wang, Y . Yang, Z. Niu, W. Liuet al., “Emovoice: Llm-based emotional text-to-speech model with freestyle text prompting,” inProceed- ings of the 33rd ACM International Conference on Multimedia, 2025, pp. 10 748–10 757

2025
[26]

Computational Narrative Understanding for Expressive Text-to-Speech

G. Michel, E. V . Epure, and C. Cerisara, “Libriquote: A speech dataset of fictional character utterances for expressive zero-shot speech synthesis,”arXiv preprint arXiv:2509.04072, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[27]

An efficient emotional speech synthesis ap- proach via multi-scale feature generation,

J. Luo and J. Yang, “An efficient emotional speech synthesis ap- proach via multi-scale feature generation,” inInternational Con- ference on Intelligent Computing. Springer, 2025, pp. 3–16

2025
[28]

Word-level emotional expression control in zero-shot text-to-speech synthesis,

tianrui wang, H. Wang, M. Ge, C. Gong, C. Qiang, Z. Ma, Z. Huang, G. Yang, X. Wang, E. Chng, X. Chen, L. Wang, and J. Dang, “Word-level emotional expression control in zero-shot text-to-speech synthesis,” inThe Thirty-ninth Annual Conference on Neural Information Processing Systems, 2025. [Online]. Available: https://openreview.net/forum?id=SYcggdxX6W

2025
[29]

De- batts: Zero-shot debating text-to-speech synthesis,

Y . Huang, Y . Wang, J. Li, H. Guo, H. He, S. Zhang, and Z. Wu, “De- batts: Zero-shot debating text-to-speech synthesis,”arXiv preprint arXiv:2411.06540, 2024

work page arXiv 2024
[30]

Hd-ppt: Hierarchical decoding of content-and prompt-preference tokens for instruction- based tts,

S. Nie, X. Xing, J. Xing, B. Liu, and X. Xu, “Hd-ppt: Hierarchical decoding of content-and prompt-preference tokens for instruction- based tts,”arXiv preprint arXiv:2509.19001, 2025

work page arXiv 2025
[31]

arXiv preprint arXiv:2509.25416 , year=

J. Shi, H. Du, Y . He, Y . A. Hong, and Y . Gao, “Emotion-aligned generation in diffusion text to speech models via preference-guided optimization,”arXiv preprint arXiv:2509.25416, 2025

work page arXiv 2025
[32]

NonverbalTTS: A Public English Corpus of Text-Aligned Nonverbal V ocalizations with Emotion Annotations for Text-to-Speech,

M. Borisov, E. Spirin, and D. Diatlova, “NonverbalTTS: A Public English Corpus of Text-Aligned Nonverbal V ocalizations with Emotion Annotations for Text-to-Speech,” in13th edition of the Speech Synthesis Workshop, 2025, pp. 104–109

2025
[33]

Durflex-evc: Duration-flexible emotional voice conversion leveraging discrete representations without text alignment,

H.-S. Oh, S.-H. Lee, D.-H. Cho, and S.-W. Lee, “Durflex-evc: Duration-flexible emotional voice conversion leveraging discrete representations without text alignment,”IEEE Transactions on Affective Computing, vol. 16, no. 3, pp. 1660–1674, 2025

2025
[34]

Maestro-evc: Control- lable emotional voice conversion guided by references and explicit prosody,

J. Yoon, W. Jeong, J. Gim, and Y .-J. Suh, “Maestro-evc: Control- lable emotional voice conversion guided by references and explicit prosody,”arXiv preprint arXiv:2508.06890, 2025

work page arXiv 2025
[35]

Naturalvoices: A large-scale, sponta- neous and emotional podcast dataset for voice conversion,

Z. Du, S. S. Chandra, I. R. Ulgen, A. Mahapatra, A. N. Salman, C. Busso, and B. Sisman, “Naturalvoices: A large-scale, sponta- neous and emotional podcast dataset for voice conversion,”arXiv preprint arXiv:2511.00256, 2025

work page arXiv 2025
[36]

Audio-to-audio emotion conversion with pitch and duration style transfer,

S. Dutta, A. Jain, and S. Ganapathy, “Audio-to-audio emotion conversion with pitch and duration style transfer,”arXiv preprint arXiv:2505.17655, 2025

work page arXiv 2025
[37]

Beyond global emotion: Fine- grained emotional speech synthesis with dynamic word-level mod- ulation,

S. Wang, A. Chen, and T. Zhao, “Beyond global emotion: Fine- grained emotional speech synthesis with dynamic word-level mod- ulation,”arXiv preprint arXiv:2509.20378, 2025

work page arXiv 2025
[38]

ClapFM-EVC: High-Fidelity and Flexible Emotional V oice Con- version with Dual Control from Natural Language and Speech,

Y . Pan, Y . Hu, Y . Yang, J. Yao, J. Ye, H. Zhou, L. Ma, and J. Zhao, “ClapFM-EVC: High-Fidelity and Flexible Emotional V oice Con- version with Dual Control from Natural Language and Speech,” in Interspeech 2025, 2025, pp. 4583–4587

2025
[39]

Perturbation self-supervised representations for cross-lingual emotion tts: Stage-wise modeling of emotion and speaker,

C. Gong, C. Qiang, T. Wang, Y . Jiang, Y . Lu, R. Jing, X. Miao, X. Zhang, L. Wang, and J. Dang, “Perturbation self-supervised representations for cross-lingual emotion tts: Stage-wise modeling of emotion and speaker,”arXiv preprint arXiv:2510.11124, 2025

work page arXiv 2025
[40]

VERSA: A versatile evaluation toolkit for speech, audio, and music,

J. Shi, H.-j. Shim, J. Tian, S. Arora, H. Wu, D. Petermann, J. Q. Yip, Y . Zhang, Y . Tang, W. Zhang, D. S. Alharthi, Y . Huang, K. Saito, J. Han, Y . Zhao, C. Donahue, and S. Watanabe, “VERSA: A versatile evaluation toolkit for speech, audio, and music,” in Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for C...

2025
[41]

Emo- bias: A Large Scale Evaluation of Social Bias on Speech Emotion Recognition,

Y .-C. Lin, H. Wu, H.-C. Chou, C.-C. Lee, and H. yi Lee, “Emo- bias: A Large Scale Evaluation of Social Bias on Speech Emotion Recognition,” inInterspeech 2024, 2024, pp. 4633–4637

2024
[42]

On the social bias of speech self-supervised models,

Y .-C. Lin, T.-Q. Lin, H.-C. Lin, A. T. Liu, and H. yi Lee, “On the social bias of speech self-supervised models,” inInterspeech 2024, 2024, pp. 4638–4642

2024
[43]

EmergentTTS- eval: Evaluating TTS models on complex prosodic, expressiveness, and linguistic challenges using model-as-a-judge,

R. R. Manku, Y . Tang, X. Shi, M. Li, and A. Smola, “EmergentTTS- eval: Evaluating TTS models on complex prosodic, expressiveness, and linguistic challenges using model-as-a-judge,” inThe Thirty-ninth Annual Conference on Neural Information Processing Systems Datasets and Benchmarks Track, 2025. [Online]. Available: https://openreview.net/forum?id=P3JBBnh10z

2025
[44]

Self-supervised rewiring of pre-trained speech encoders:towards faster fine- tuning with less labels in speech processing,

H. Yang, J. Zhao, G. Haffari, and E. Shareghi, “Self-supervised rewiring of pre-trained speech encoders:towards faster fine- tuning with less labels in speech processing,” inFindings of the Association for Computational Linguistics: EMNLP 2022, Y . Goldberg, Z. Kozareva, and Y . Zhang, Eds. Abu Dhabi, United Arab Emirates: Association for Computational Li...

2022
[45]

Assessing the impact of anisotropy in neural representations of speech: A case study on keyword spotting,

G. Wisniewski, S. Guillaume, and C. R. Fern ´andez, “Assessing the impact of anisotropy in neural representations of speech: A case study on keyword spotting,”arXiv preprint arXiv:2506.11096, 2025

work page arXiv 2025
[46]

All-but-the-top: Simple and effective postprocessing for word representations,

J. Mu and P. Viswanath, “All-but-the-top: Simple and effective postprocessing for word representations,” inInternational Conference on Learning Representations, 2018. [Online]. Available: https://openreview.net/forum?id=HkuGJ3kCb

2018
[47]

Measuring nominal scale agreement among many raters

J. L. Fleiss, “Measuring nominal scale agreement among many raters.”Psychological bulletin, vol. 76, no. 5, p. 378, 1971

1971
[48]

Hubert: Self-supervised speech representation learning by masked prediction of hidden units,

W.-N. Hsu, B. Bolte, Y .-H. H. Tsai, K. Lakhotia, R. Salakhutdinov, and A. Mohamed, “Hubert: Self-supervised speech representation learning by masked prediction of hidden units,”IEEE/ACM Trans- actions on Audio, Speech, and Language Processing, vol. 29, pp. 3451–3460, 2021

2021
[49]

wav2vec 2.0: A framework for self-supervised learning of speech representa- tions,

A. Baevski, Y . Zhou, A. Mohamed, and M. Auli, “wav2vec 2.0: A framework for self-supervised learning of speech representa- tions,” inAdvances in Neural Information Processing Systems, H. Larochelle, M. Ranzato, R. Hadsell, M. Balcan, and H. Lin, Eds., vol. 33. Curran Associates, Inc., 2020, pp. 12 449–12 460. [Online]. Available: https://proceedings.neur...

2020
[50]

Tera: Self-supervised learning of transformer encoder representation for speech,

A. T. Liu, S.-W. Li, and H.-y. Lee, “Tera: Self-supervised learning of transformer encoder representation for speech,”IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 29, pp. 2351–2366, 2021

2021
[51]

Crema-d: Crowd-sourced emotional multimodal actors dataset,

H. Cao, D. G. Cooper, M. K. Keutmann, R. C. Gur, A. Nenkova, and R. Verma, “Crema-d: Crowd-sourced emotional multimodal actors dataset,”IEEE Transactions on Affective Computing, vol. 5, no. 4, pp. 377–390, 2014

2014
[52]

Msp-improv: An acted cor- pus of dyadic interactions to study emotion perception,

C. Busso, S. Parthasarathy, A. Burmania, M. AbdelWahab, N. Sadoughi, and E. M. Provost, “Msp-improv: An acted cor- pus of dyadic interactions to study emotion perception,”IEEE Transactions on Affective Computing, vol. 8, no. 1, pp. 67–80, 2017

2017
[53]

Building naturalistic emotionally bal- anced speech corpus by retrieving emotional speech from existing podcast recordings,

R. Lotfian and C. Busso, “Building naturalistic emotionally bal- anced speech corpus by retrieving emotional speech from existing podcast recordings,”IEEE Transactions on Affective Computing, vol. 10, no. 4, pp. 471–483, 2019

2019
[54]

arXiv preprint arXiv:2509.09791 , year=

C. Busso, R. Lotfian, K. Sridhar, A. N. Salman, W.-C. Lin, L. Goncalves, S. Parthasarathy, A. R. Naini, S.-G. Leem, L. Martinez-Lucas, H.-C. Chou, and P. Mote, “The MSP-Podcast Corpus,” 2025. [Online]. Available: https: //arxiv.org/abs/2509.09791

work page arXiv 2025
[55]

An intelligent infras- tructure toward large scale naturalistic affective speech corpora collection,

S. G. Upadhyay, W.-S. Chien, B.-H. Su, L. Goncalves, Y .-T. Wu, A. N. Salman, C. Busso, and C.-C. Lee, “An intelligent infras- tructure toward large scale naturalistic affective speech corpora collection,” in2023 11th International Conference on Affective Computing and Intelligent Interaction (ACII), 2023, pp. 1–8

2023
[56]

NNIME: The NTHU-NTUA Chinese interactive multimodal emotion corpus,

H.-C. Chou, W.-C. Lin, L.-C. Chang, C.-C. Li, H.-P. Ma, and C.-C. Lee, “NNIME: The NTHU-NTUA Chinese interactive multimodal emotion corpus,” in2017 Seventh International Conference on Affective Computing and Intelligent Interaction (ACII), 2017, pp. 292–298

2017
[57]

Large raw emotional dataset with aggregation mechanism,

V . Kondratenko, A. Sokolov, N. Karpov, O. Kutuzov, N. Savushkin, and F. Minkin, “Large raw emotional dataset with aggregation mechanism,”arXiv preprint arXiv:2212.12266, 2022

work page arXiv 2022
[58]

Cosyvoice: A scalable multilingual zero-shot text- to-speech synthesizer based on supervised semantic tokens

Z. Du, Q. Chen, S. Zhang, K. Hu, H. Lu, Y . Yang, H. Hu, S. Zheng, Y . Gu, Z. Maet al., “Cosyvoice: A scalable multilingual zero-shot text-to-speech synthesizer based on supervised semantic tokens,” arXiv preprint arXiv:2407.05407, 2024

work page arXiv 2024
[59]

Spark-tts: An efficient llm- based text-to-speech model with single-stream decoupled speech tokens,

X. Wang, M. Jiang, Z. Ma, Z. Zhang, S. Liu, L. Li, Z. Liang, Q. Zheng, R. Wang, X. Fenget al., “Spark-tts: An efficient llm- based text-to-speech model with single-stream decoupled speech tokens,”arXiv preprint arXiv:2503.01710, 2025

work page arXiv 2025
[60]

F5-TTS: A fairytaler that fakes fluent and faithful speech with flow matching,

Y . Chen, Z. Niu, Z. Ma, K. Deng, C. Wang, J. JianZhao, K. Yu, and X. Chen, “F5-TTS: A fairytaler that fakes fluent and faithful speech with flow matching,” inProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), W. Che, J. Nabende, E. Shutova, and M. T. Pilehvar, Eds. Vienna, Austria: Association...

2025
[61]

E2 tts: Embarrassingly easy fully non-autoregressive zero-shot tts,

S. E. Eskimez, X. Wang, M. Thakker, C. Li, C.-H. Tsai, Z. Xiao, H. Yang, Z. Zhu, M. Tang, X. Tan, Y . Liu, S. Zhao, and N. Kanda, “E2 tts: Embarrassingly easy fully non-autoregressive zero-shot tts,” in2024 IEEE Spoken Language Technology Workshop (SLT), 2024, pp. 682–689

2024
[62]

Qwen3-tts technical report.arXiv preprint arXiv:2601.15621, 2026

H. Hu, X. Zhu, T. He, D. Guo, B. Zhang, X. Wang, Z. Guo, Z. Jiang, H. Hao, Z. Guoet al., “Qwen3-tts technical report,”arXiv preprint arXiv:2601.15621, 2026

work page arXiv 2026
[63]

Diff-HierVC: Diffusion- based Hierarchical V oice Conversion with Robust Pitch Generation and Masked Prior for Zero-shot Speaker Adaptation,

H.-Y . Choi, S.-H. Lee, and S.-W. Lee, “Diff-HierVC: Diffusion- based Hierarchical V oice Conversion with Robust Pitch Generation and Masked Prior for Zero-shot Speaker Adaptation,” inInter- speech 2023, 2023, pp. 2283–2287

2023
[64]

Freevc: Towards high-quality text-free one-shot voice conversion,

J. Li, W. Tu, and L. Xiao, “Freevc: Towards high-quality text-free one-shot voice conversion,” inICASSP 2023 - 2023 IEEE Inter- national Conference on Acoustics, Speech and Signal Processing (ICASSP), 2023, pp. 1–5

2023
[65]

data2vec: A general framework for self- supervised learning in speech, vision and language,

A. Baevskiet al., “data2vec: A general framework for self- supervised learning in speech, vision and language,” inICML, 2022

2022
[66]

What do speech foundation models not learn about speech?

A. Waheed, H. Atwany, B. Raj, and R. Singh, “What do speech foundation models not learn about speech?”arXiv preprint arXiv:2410.12948, 2024

work page arXiv 2024