Expressive Prompting: Improving Emotion Intensity and Speaker Consistency in Zero-Shot TTS

Cheng Gong; Chen Zhang; Chunyu Qiang; Haoyu Wang; Jianwu Dang; Longbiao Wang; Tianrui Wang; Yuheng Lu; Yu Jiang

arxiv: 2409.18512 · v2 · submitted 2024-09-27 · 💻 cs.SD · cs.AI· cs.CL· eess.AS

Expressive Prompting: Improving Emotion Intensity and Speaker Consistency in Zero-Shot TTS

Haoyu Wang , Chunyu Qiang , Tianrui Wang , Cheng Gong , Yu Jiang , Yuheng Lu , Chen Zhang , Longbiao Wang

show 1 more author

Jianwu Dang

This is my paper

Pith reviewed 2026-05-23 20:29 UTC · model grok-4.3

classification 💻 cs.SD cs.AIcs.CLeess.AS

keywords zero-shot TTSprompt selectionexpressive speech synthesisemotion intensityspeaker consistencyprosodic featuresLLM-based TTS

0 comments

The pith

A two-stage prompt selection strategy improves emotion intensity and speaker consistency in zero-shot TTS systems.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces a method to choose better prompts for zero-shot text-to-speech models that control emotion and speaker. Existing prompt selection often lacks stable speaker cues and right emotional strength. The approach has a static stage that scores prompts on prosodic features, audio quality, and LLM-assessed text-emotion match, plus model-specific checks for error rate and similarities. A dynamic stage then picks the prompt most similar in text to the input. If effective, this leads to synthesized speech that sounds more emotionally intense while keeping the speaker's identity clear.

Core claim

The authors propose and test a two-stage prompt selection strategy for expressive zero-shot TTS. In the static stage, prompt candidates are evaluated using pitch-based prosodic features, perceptual audio quality, and LLM text-emotion coherence, as well as character error rate, speaker similarity, and emotional similarity when synthesized. In the dynamic stage, a textual similarity model selects the best aligned prompt for the input text. Experimental results show this selects prompts that produce speech with high-intensity emotional expression and robust speaker identity.

What carries the argument

The two-stage prompt selection strategy, consisting of static evaluation with prosodic, quality, and similarity metrics followed by dynamic textual similarity selection.

If this is right

Zero-shot TTS outputs gain higher emotional intensity without losing speaker consistency.
Prompt design becomes more reliable for controlling both emotion and identity in LLM-based synthesizers.
Automatic metrics can guide prompt choice before and during synthesis to stabilize performance.
Expressive speech synthesis improves in stability across different inputs.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Similar selection logic could apply to other controllable generation tasks like image or music synthesis.
Integrating human feedback into the selection stages might further align with perception.
The method might reduce the need for fine-tuning by better leveraging existing prompts.
Testing on a wider range of TTS models would show if the gains generalize.

Load-bearing premise

The chosen automatic metrics reliably reflect the human-perceived emotion intensity and speaker consistency that the method aims to improve.

What would settle it

A blind listening test in which human raters assign lower scores for emotion intensity or speaker similarity to the method's outputs than to a simple random prompt baseline.

Figures

Figures reproduced from arXiv: 2409.18512 by Cheng Gong, Chen Zhang, Chunyu Qiang, Haoyu Wang, Jianwu Dang, Longbiao Wang, Tianrui Wang, Yuheng Lu, Yu Jiang.

**Figure 1.** Figure 1: The overview of EmoPro. It consists of two stages: a static selection stage and a dynamic selection stage. The static selection stage evaluates the [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗

**Figure 2.** Figure 2: Mean and variance of emotional speech pitch: red indicates anger, [PITH_FULL_IMAGE:figures/full_fig_p002_2.png] view at source ↗

read the original abstract

Recent advancements in speech synthesis have enabled large language model (LLM)-based systems to perform zero-shot generation with controllable content, timbre, speaker identity, and emotion through input prompts. As a result, these models heavily rely on prompt design to guide the generation process. However, existing prompt selection methods often fail to ensure that prompts contain sufficiently stable speaker identity cues and appropriate emotional intensity indicators, which are crucial for expressive speech synthesis. To address this challenge, we propose a two-stage prompt selection strategy specifically designed for expressive speech synthesis. In the static stage (before synthesis), we first evaluate prompt candidates using pitch-based prosodic features, perceptual audio quality, and text-emotion coherence scores evaluated by an LLM. We further assess the candidates under a specific TTS model by measuring character error rate, speaker similarity, and emotional similarity between the synthesized and prompt speech. In the dynamic stage (during synthesis), we use a textual similarity model to select the prompt that is most aligned with the current input text. Experimental results demonstrate that our strategy effectively selects prompt to synthesize speech with both high-intensity emotional expression and robust speaker identity, leading to more expressive and stable zero-shot TTS performance. Audio samples and codes will be available at https://whyrrrrun.github.io/ExpPro.github.io/.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper proposes a two-stage prompt selection method for zero-shot TTS to improve emotion intensity and speaker consistency. The static stage scores prompt candidates on pitch-based prosody, perceptual quality, LLM text-emotion coherence, and TTS-model metrics (CER, speaker similarity, emotional similarity). The dynamic stage selects the prompt with highest textual similarity to the target input. The abstract states that experiments demonstrate the strategy yields speech with higher-intensity emotion and more stable speaker identity.

Significance. If the automatic metrics are shown to track human judgments, the method offers a practical, model-agnostic way to improve prompt quality for expressive zero-shot TTS without retraining. The two-stage design and use of both pre-synthesis and in-synthesis selection are straightforward engineering contributions that could be adopted by existing LLM-based TTS pipelines.

major comments (2)

[Abstract, Experimental Results] Abstract and Experimental Results: The central claim that the method produces 'high-intensity emotional expression and robust speaker identity' rests entirely on automatic proxies (pitch features, CER, speaker/emotional similarity, LLM coherence). No human listening tests, preference studies, or correlation analysis between these metrics and human perception of emotion intensity or speaker consistency are reported. Without such validation the experimental demonstration does not support the headline performance claim.
[Experimental Results] Experimental Results: The abstract asserts effectiveness yet supplies no information on the number of test utterances, choice of baselines, statistical significance tests, or whether prompt-selection hyperparameters were tuned on the evaluation set. These omissions prevent verification that the reported gains are robust rather than post-hoc.

minor comments (1)

[Abstract] Abstract: 'selects prompt to synthesize' should read 'selects prompts to synthesize'.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. We address each major comment below and describe the revisions we will make to strengthen the manuscript.

read point-by-point responses

Referee: [Abstract, Experimental Results] Abstract and Experimental Results: The central claim that the method produces 'high-intensity emotional expression and robust speaker identity' rests entirely on automatic proxies (pitch features, CER, speaker/emotional similarity, LLM coherence). No human listening tests, preference studies, or correlation analysis between these metrics and human perception of emotion intensity or speaker consistency are reported. Without such validation the experimental demonstration does not support the headline performance claim.

Authors: We acknowledge that the absence of human listening tests and correlation analysis with perceptual judgments is a limitation. Our experiments rely on established automatic metrics standard in TTS research for measuring the targeted aspects. In the revised version we will add an explicit limitations subsection discussing reliance on proxies, the lack of human validation, and the need for future perceptual studies. We will also report any feasible post-hoc correlation analysis between the metrics and available human data if it can be computed without new experiments. revision: partial
Referee: [Experimental Results] Experimental Results: The abstract asserts effectiveness yet supplies no information on the number of test utterances, choice of baselines, statistical significance tests, or whether prompt-selection hyperparameters were tuned on the evaluation set. These omissions prevent verification that the reported gains are robust rather than post-hoc.

Authors: We will expand the Experimental Results section to explicitly state the number of test utterances, the baselines used, the statistical significance tests performed (with p-values), and confirmation that hyperparameters were selected on a separate validation set. These details exist in our experimental logs and will be added to allow readers to assess robustness. revision: yes

Circularity Check

0 steps flagged

Empirical method with external metrics; no derivation reduces to self-inputs

full rationale

The paper describes a two-stage prompt selection procedure evaluated via external automatic metrics (pitch features, CER, similarities, LLM coherence, textual similarity). No equations, fitted parameters, or self-citations are presented as load-bearing for the central claim. The experimental demonstration relies on independent model outputs rather than any quantity defined inside the paper itself, so the result is not equivalent to its inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The paper introduces no new mathematical axioms, free parameters, or invented physical entities; it relies on standard evaluation metrics and off-the-shelf models whose assumptions are inherited from prior literature.

pith-pipeline@v0.9.0 · 5790 in / 1219 out tokens · 23644 ms · 2026-05-23T20:29:35.840589+00:00 · methodology

discussion (0)

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

The False Resonance: A Critical Examination of Emotion Embedding Similarity for Speech Generation Evaluation
eess.AS 2026-04 unverdicted novelty 6.0

Emotion embedding similarities are unsuitable for zero-shot evaluation of emotional expressiveness in speech generation due to confounding by non-emotional acoustic features.

Reference graph

Works this paper leans on

33 extracted references · 33 canonical work pages · cited by 1 Pith paper · 6 internal anchors

[1]

Improving language understanding by generative pre- training,

A. Radford, “Improving language understanding by generative pre- training,” 2018

work page 2018
[2]

Neural Codec Language Models are Zero-Shot Text to Speech Synthesizers

C. Wang, S. Chen, Y . Wu, Z. Zhang, L. Zhou, S. Liu, Z. Chen, Y . Liu, H. Wang, J. Li et al., “Neural codec language models are zero-shot text to speech synthesizers,” arXiv preprint arXiv:2301.02111 , 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[3]

Speak, read and prompt: High-fidelity text-to-speech with minimal supervision,

E. Kharitonov, D. Vincent, Z. Borsos, R. Marinier, S. Girgin, O. Pietquin, M. Sharifi, M. Tagliasacchi, and N. Zeghidour, “Speak, read and prompt: High-fidelity text-to-speech with minimal supervision,” Transactions of the Association for Computational Linguistics , vol. 11, pp. 1703–1718, 2023

work page 2023
[4]

Hubert: Self-supervised speech representation learning by masked prediction of hidden units,

W.-N. Hsu, B. Bolte, Y .-H. H. Tsai, K. Lakhotia, R. Salakhutdinov, and A. Mohamed, “Hubert: Self-supervised speech representation learning by masked prediction of hidden units,” IEEE/ACM transactions on audio, speech, and language processing , vol. 29, pp. 3451–3460, 2021

work page 2021
[5]

Learn- ing speech representation from contrastive token-acoustic pretraining,

C. Qiang, H. Li, Y . Tian, R. Fu, T. Wang, L. Wang, and J. Dang, “Learn- ing speech representation from contrastive token-acoustic pretraining,” in ICASSP 2024-2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) . IEEE, 2024, pp. 10 196–10 200

work page 2024
[6]

High Fidelity Neural Audio Compression

A. D ´efossez, J. Copet, G. Synnaeve, and Y . Adi, “High fidelity neural audio compression,” arXiv preprint arXiv:2210.13438 , 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[7]

CosyVoice: A Scalable Multilingual Zero-shot Text-to-speech Synthesizer based on Supervised Semantic Tokens

Z. Du, Q. Chen, S. Zhang, K. Hu, H. Lu, Y . Yang, H. Hu, S. Zheng, Y . Gu, Z. Ma et al. , “Cosyvoice: A scalable multilingual zero-shot text-to-speech synthesizer based on supervised semantic tokens,” arXiv preprint arXiv:2407.05407, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[8]

AudioPaLM: A Large Language Model That Can Speak and Listen

P. K. Rubenstein, C. Asawaroengchai, D. D. Nguyen, A. Bapna, Z. Bor- sos, F. d. C. Quitry, P. Chen, D. E. Badawy, W. Han, E. Kharitonovet al., “Audiopalm: A large language model that can speak and listen,” arXiv preprint arXiv:2306.12925, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[9]

VioLA: Conditional language models for speech recognition, synthesis, and translation,

T. Wang, L. Zhou, Z. Zhang, Y . Wu, S. Liu, Y . Gaur, Z. Chen, J. Li, and F. Wei, “VioLA: Conditional language models for speech recognition, synthesis, and translation,” IEEE/ACM transactions on audio, speech, and language processing , 2024

work page 2024
[10]

Large lan- guage models are zero-shot reasoners,

T. Kojima, S. S. Gu, M. Reid, Y . Matsuo, and Y . Iwasawa, “Large lan- guage models are zero-shot reasoners,” Advances in neural information processing systems, vol. 35, pp. 22 199–22 213, 2022

work page 2022
[11]

Scaling instruction-finetuned language models,

H. W. Chung, L. Hou, S. Longpre, B. Zoph, Y . Tay, W. Fedus, Y . Li, X. Wang, M. Dehghani, S. Brahma, A. Webson, S. S. Gu, Z. Dai, M. Suzgun, X. Chen, A. Chowdhery, A. Castro-Ros, M. Pellat, K. Robinson, D. Valter, S. Narang, G. Mishra, A. Yu, V . Zhao, Y . Huang, A. Dai, H. Yu, S. Petrov, E. H. Chi, J. Dean, J. Devlin, A. Roberts, D. Zhou, Q. V . Le, and...

work page 2024
[12]

Retrieval-based prompt se- lection for code-related few-shot learning,

N. Nashid, M. Sintaha, and A. Mesbah, “Retrieval-based prompt se- lection for code-related few-shot learning,” in 2023 IEEE/ACM 45th International Conference on Software Engineering (ICSE). IEEE, 2023, pp. 2450–2462

work page 2023
[13]

Learning to prompt for continual learning,

Z. Wang, Z. Zhang, C.-Y . Lee, H. Zhang, R. Sun, X. Ren, G. Su, V . Perot, J. Dy, and T. Pfister, “Learning to prompt for continual learning,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2022, pp. 139–149

work page 2022
[14]

Automatic prompt augmentation and selection with chain-of-thought from labeled data,

K. Shum, S. Diao, and T. Zhang, “Automatic prompt augmentation and selection with chain-of-thought from labeled data,” arXiv preprint arXiv:2302.12822, 2023

work page arXiv 2023
[15]

Universal information extraction as unified semantic matching,

J. Lou, Y . Lu, D. Dai, W. Jia, H. Lin, X. Han, L. Sun, and H. Wu, “Universal information extraction as unified semantic matching,” in Proceedings of the AAAI Conference on Artificial Intelligence , vol. 37, no. 11, 2023, pp. 13 318–13 326

work page 2023
[16]

Sentence-bert: Sentence embeddings using siamese bert-networks,

N. Reimers and I. Gurevych, “Sentence-bert: Sentence embeddings using siamese bert-networks,” in Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing . Association for Computational Linguistics, 11 2019

work page 2019
[17]

Controlling emotion in text-to-speech with natural language prompts,

T. Bott, F. Lux, and N. T. Vu, “Controlling emotion in text-to-speech with natural language prompts,” arXiv preprint arXiv:2406.06406, 2024

work page arXiv 2024
[18]

Seed-TTS: A Family of High-Quality Versatile Speech Generation Models

P. Anastassiou, J. Chen, J. Chen, Y . Chen, Z. Chen, Z. Chen, J. Cong, L. Deng, C. Ding, L. Gao et al., “Seed-TTS: A family of high-quality versatile speech generation models,” arXiv preprint arXiv:2406.02430 , 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[19]

Mm-tts: A unified framework for multimodal, prompt-induced emotional text-to- speech synthesis,

X. Li, Z.-Q. Cheng, J.-Y . He, X. Peng, and A. G. Hauptmann, “Mm-tts: A unified framework for multimodal, prompt-induced emotional text-to- speech synthesis,” arXiv preprint arXiv:2404.18398 , 2024

work page arXiv 2024
[20]

Zmm-tts: Zero-shot multilingual and multi- speaker speech synthesis conditioned on self-supervised discrete speech representations,

C. Gong, X. Wang, E. Cooper, D. Wells, L. Wang, J. Dang, K. Rich- mond, and J. Yamagishi, “Zmm-tts: Zero-shot multilingual and multi- speaker speech synthesis conditioned on self-supervised discrete speech representations,” IEEE/ACM Transactions on Audio, Speech, and Lan- guage Processing, pp. 1–16, 2024

work page 2024
[21]

Dnsmos: A non-intrusive perceptual objective speech quality metric to evaluate noise suppressors,

C. K. Reddy, V . Gopal, and R. Cutler, “Dnsmos: A non-intrusive perceptual objective speech quality metric to evaluate noise suppressors,” in Proc. ICASSP. IEEE, 2021, pp. 6493–6497

work page 2021
[22]

GPT-4 Technical Report

J. Achiam, S. Adler, S. Agarwal, L. Ahmad, I. Akkaya, F. L. Aleman, D. Almeida, J. Altenschmidt, S. Altman, S. Anadkat et al. , “Gpt-4 technical report,” arXiv preprint arXiv:2303.08774 , 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[23]

Intonation and emotion: influence of pitch levels and contour type on creating emotions,

E. Rodero, “Intonation and emotion: influence of pitch levels and contour type on creating emotions,” Journal of voice, vol. 25, no. 1, pp. e25–e34, 2011

work page 2011
[24]

Analysis of emotionally salient aspects of fundamental frequency for emotion detection,

C. Busso, S. Lee, and S. Narayanan, “Analysis of emotionally salient aspects of fundamental frequency for emotion detection,” IEEE trans- actions on audio, speech, and language processing , vol. 17, no. 4, pp. 582–596, 2009

work page 2009
[25]

Pitch in emotional speech and emotional speech recognition using pitch frequency,

D. Gharavian, M. Sheikhan, and M. Janipour, “Pitch in emotional speech and emotional speech recognition using pitch frequency,”Majlesi Journal of Electrical Engineering , vol. 4, no. 1, p. 19, 2010

work page 2010
[26]

Communicating emotion: The role of prosodic features

R. W. Frick, “Communicating emotion: The role of prosodic features.” Psychological bulletin, vol. 97, no. 3, p. 412, 1985

work page 1985
[27]

The global k-means clustering algorithm,

A. Likas, N. Vlassis, and J. J. Verbeek, “The global k-means clustering algorithm,” Pattern recognition, vol. 36, no. 2, pp. 451–461, 2003

work page 2003
[28]

Generalized end-to-end loss for speaker verification,

L. Wan, Q. Wang, A. Papir, and I. L. Moreno, “Generalized end-to-end loss for speaker verification,” in 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) . IEEE, 2018, pp. 4879–4883

work page 2018
[29]

Wavlm: Large-scale self-supervised pre- training for full stack speech processing,

S. Chen, C. Wang, Z. Chen, Y . Wu, S. Liu, Z. Chen, J. Li, N. Kanda, T. Yoshioka, X. Xiao et al. , “Wavlm: Large-scale self-supervised pre- training for full stack speech processing,” IEEE Journal of Selected Topics in Signal Processing , vol. 16, no. 6, pp. 1505–1518, 2022

work page 2022
[30]

emotion2vec: Self-supervised pre-training for speech emotion repre- sentation,

Z. Ma, Z. Zheng, J. Ye, J. Li, Z. Gao, S. Zhang, and X. Chen, “emotion2vec: Self-supervised pre-training for speech emotion repre- sentation,” arXiv preprint arXiv:2312.15185 , 2023

work page arXiv 2023
[31]

Improv- ing prosody for cross-speaker style transfer by semi-supervised style extractor and hierarchical modeling in speech synthesis,

C. Qiang, P. Yang, H. Che, Y . Zhang, X. Wang, and Z. Wang, “Improv- ing prosody for cross-speaker style transfer by semi-supervised style extractor and hierarchical modeling in speech synthesis,” in ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) . IEEE, 2023, pp. 1–5

work page 2023
[32]

Minilm: Deep self-attention distillation for task-agnostic compression of pre- trained transformers,

W. Wang, F. Wei, L. Dong, H. Bao, N. Yang, and M. Zhou, “Minilm: Deep self-attention distillation for task-agnostic compression of pre- trained transformers,” Advances in Neural Information Processing Sys- tems, vol. 33, pp. 5776–5788, 2020

work page 2020
[33]

Paraformer: Fast and accurate parallel transformer for non-autoregressive end-to-end speech recognition,

Z. Gao, S. Zhang, I. McLoughlin, and Z. Yan, “Paraformer: Fast and accurate parallel transformer for non-autoregressive end-to-end speech recognition,” arXiv preprint arXiv:2206.08317 , 2022

work page arXiv 2022

[1] [1]

Improving language understanding by generative pre- training,

A. Radford, “Improving language understanding by generative pre- training,” 2018

work page 2018

[2] [2]

Neural Codec Language Models are Zero-Shot Text to Speech Synthesizers

C. Wang, S. Chen, Y . Wu, Z. Zhang, L. Zhou, S. Liu, Z. Chen, Y . Liu, H. Wang, J. Li et al., “Neural codec language models are zero-shot text to speech synthesizers,” arXiv preprint arXiv:2301.02111 , 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[3] [3]

Speak, read and prompt: High-fidelity text-to-speech with minimal supervision,

E. Kharitonov, D. Vincent, Z. Borsos, R. Marinier, S. Girgin, O. Pietquin, M. Sharifi, M. Tagliasacchi, and N. Zeghidour, “Speak, read and prompt: High-fidelity text-to-speech with minimal supervision,” Transactions of the Association for Computational Linguistics , vol. 11, pp. 1703–1718, 2023

work page 2023

[4] [4]

Hubert: Self-supervised speech representation learning by masked prediction of hidden units,

W.-N. Hsu, B. Bolte, Y .-H. H. Tsai, K. Lakhotia, R. Salakhutdinov, and A. Mohamed, “Hubert: Self-supervised speech representation learning by masked prediction of hidden units,” IEEE/ACM transactions on audio, speech, and language processing , vol. 29, pp. 3451–3460, 2021

work page 2021

[5] [5]

Learn- ing speech representation from contrastive token-acoustic pretraining,

C. Qiang, H. Li, Y . Tian, R. Fu, T. Wang, L. Wang, and J. Dang, “Learn- ing speech representation from contrastive token-acoustic pretraining,” in ICASSP 2024-2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) . IEEE, 2024, pp. 10 196–10 200

work page 2024

[6] [6]

High Fidelity Neural Audio Compression

A. D ´efossez, J. Copet, G. Synnaeve, and Y . Adi, “High fidelity neural audio compression,” arXiv preprint arXiv:2210.13438 , 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022

[7] [7]

CosyVoice: A Scalable Multilingual Zero-shot Text-to-speech Synthesizer based on Supervised Semantic Tokens

Z. Du, Q. Chen, S. Zhang, K. Hu, H. Lu, Y . Yang, H. Hu, S. Zheng, Y . Gu, Z. Ma et al. , “Cosyvoice: A scalable multilingual zero-shot text-to-speech synthesizer based on supervised semantic tokens,” arXiv preprint arXiv:2407.05407, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[8] [8]

AudioPaLM: A Large Language Model That Can Speak and Listen

P. K. Rubenstein, C. Asawaroengchai, D. D. Nguyen, A. Bapna, Z. Bor- sos, F. d. C. Quitry, P. Chen, D. E. Badawy, W. Han, E. Kharitonovet al., “Audiopalm: A large language model that can speak and listen,” arXiv preprint arXiv:2306.12925, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[9] [9]

VioLA: Conditional language models for speech recognition, synthesis, and translation,

T. Wang, L. Zhou, Z. Zhang, Y . Wu, S. Liu, Y . Gaur, Z. Chen, J. Li, and F. Wei, “VioLA: Conditional language models for speech recognition, synthesis, and translation,” IEEE/ACM transactions on audio, speech, and language processing , 2024

work page 2024

[10] [10]

Large lan- guage models are zero-shot reasoners,

T. Kojima, S. S. Gu, M. Reid, Y . Matsuo, and Y . Iwasawa, “Large lan- guage models are zero-shot reasoners,” Advances in neural information processing systems, vol. 35, pp. 22 199–22 213, 2022

work page 2022

[11] [11]

Scaling instruction-finetuned language models,

H. W. Chung, L. Hou, S. Longpre, B. Zoph, Y . Tay, W. Fedus, Y . Li, X. Wang, M. Dehghani, S. Brahma, A. Webson, S. S. Gu, Z. Dai, M. Suzgun, X. Chen, A. Chowdhery, A. Castro-Ros, M. Pellat, K. Robinson, D. Valter, S. Narang, G. Mishra, A. Yu, V . Zhao, Y . Huang, A. Dai, H. Yu, S. Petrov, E. H. Chi, J. Dean, J. Devlin, A. Roberts, D. Zhou, Q. V . Le, and...

work page 2024

[12] [12]

Retrieval-based prompt se- lection for code-related few-shot learning,

N. Nashid, M. Sintaha, and A. Mesbah, “Retrieval-based prompt se- lection for code-related few-shot learning,” in 2023 IEEE/ACM 45th International Conference on Software Engineering (ICSE). IEEE, 2023, pp. 2450–2462

work page 2023

[13] [13]

Learning to prompt for continual learning,

Z. Wang, Z. Zhang, C.-Y . Lee, H. Zhang, R. Sun, X. Ren, G. Su, V . Perot, J. Dy, and T. Pfister, “Learning to prompt for continual learning,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2022, pp. 139–149

work page 2022

[14] [14]

Automatic prompt augmentation and selection with chain-of-thought from labeled data,

K. Shum, S. Diao, and T. Zhang, “Automatic prompt augmentation and selection with chain-of-thought from labeled data,” arXiv preprint arXiv:2302.12822, 2023

work page arXiv 2023

[15] [15]

Universal information extraction as unified semantic matching,

J. Lou, Y . Lu, D. Dai, W. Jia, H. Lin, X. Han, L. Sun, and H. Wu, “Universal information extraction as unified semantic matching,” in Proceedings of the AAAI Conference on Artificial Intelligence , vol. 37, no. 11, 2023, pp. 13 318–13 326

work page 2023

[16] [16]

Sentence-bert: Sentence embeddings using siamese bert-networks,

N. Reimers and I. Gurevych, “Sentence-bert: Sentence embeddings using siamese bert-networks,” in Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing . Association for Computational Linguistics, 11 2019

work page 2019

[17] [17]

Controlling emotion in text-to-speech with natural language prompts,

T. Bott, F. Lux, and N. T. Vu, “Controlling emotion in text-to-speech with natural language prompts,” arXiv preprint arXiv:2406.06406, 2024

work page arXiv 2024

[18] [18]

Seed-TTS: A Family of High-Quality Versatile Speech Generation Models

P. Anastassiou, J. Chen, J. Chen, Y . Chen, Z. Chen, Z. Chen, J. Cong, L. Deng, C. Ding, L. Gao et al., “Seed-TTS: A family of high-quality versatile speech generation models,” arXiv preprint arXiv:2406.02430 , 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[19] [19]

Mm-tts: A unified framework for multimodal, prompt-induced emotional text-to- speech synthesis,

X. Li, Z.-Q. Cheng, J.-Y . He, X. Peng, and A. G. Hauptmann, “Mm-tts: A unified framework for multimodal, prompt-induced emotional text-to- speech synthesis,” arXiv preprint arXiv:2404.18398 , 2024

work page arXiv 2024

[20] [20]

Zmm-tts: Zero-shot multilingual and multi- speaker speech synthesis conditioned on self-supervised discrete speech representations,

C. Gong, X. Wang, E. Cooper, D. Wells, L. Wang, J. Dang, K. Rich- mond, and J. Yamagishi, “Zmm-tts: Zero-shot multilingual and multi- speaker speech synthesis conditioned on self-supervised discrete speech representations,” IEEE/ACM Transactions on Audio, Speech, and Lan- guage Processing, pp. 1–16, 2024

work page 2024

[21] [21]

Dnsmos: A non-intrusive perceptual objective speech quality metric to evaluate noise suppressors,

C. K. Reddy, V . Gopal, and R. Cutler, “Dnsmos: A non-intrusive perceptual objective speech quality metric to evaluate noise suppressors,” in Proc. ICASSP. IEEE, 2021, pp. 6493–6497

work page 2021

[22] [22]

GPT-4 Technical Report

J. Achiam, S. Adler, S. Agarwal, L. Ahmad, I. Akkaya, F. L. Aleman, D. Almeida, J. Altenschmidt, S. Altman, S. Anadkat et al. , “Gpt-4 technical report,” arXiv preprint arXiv:2303.08774 , 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[23] [23]

Intonation and emotion: influence of pitch levels and contour type on creating emotions,

E. Rodero, “Intonation and emotion: influence of pitch levels and contour type on creating emotions,” Journal of voice, vol. 25, no. 1, pp. e25–e34, 2011

work page 2011

[24] [24]

Analysis of emotionally salient aspects of fundamental frequency for emotion detection,

C. Busso, S. Lee, and S. Narayanan, “Analysis of emotionally salient aspects of fundamental frequency for emotion detection,” IEEE trans- actions on audio, speech, and language processing , vol. 17, no. 4, pp. 582–596, 2009

work page 2009

[25] [25]

Pitch in emotional speech and emotional speech recognition using pitch frequency,

D. Gharavian, M. Sheikhan, and M. Janipour, “Pitch in emotional speech and emotional speech recognition using pitch frequency,”Majlesi Journal of Electrical Engineering , vol. 4, no. 1, p. 19, 2010

work page 2010

[26] [26]

Communicating emotion: The role of prosodic features

R. W. Frick, “Communicating emotion: The role of prosodic features.” Psychological bulletin, vol. 97, no. 3, p. 412, 1985

work page 1985

[27] [27]

The global k-means clustering algorithm,

A. Likas, N. Vlassis, and J. J. Verbeek, “The global k-means clustering algorithm,” Pattern recognition, vol. 36, no. 2, pp. 451–461, 2003

work page 2003

[28] [28]

Generalized end-to-end loss for speaker verification,

L. Wan, Q. Wang, A. Papir, and I. L. Moreno, “Generalized end-to-end loss for speaker verification,” in 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) . IEEE, 2018, pp. 4879–4883

work page 2018

[29] [29]

Wavlm: Large-scale self-supervised pre- training for full stack speech processing,

S. Chen, C. Wang, Z. Chen, Y . Wu, S. Liu, Z. Chen, J. Li, N. Kanda, T. Yoshioka, X. Xiao et al. , “Wavlm: Large-scale self-supervised pre- training for full stack speech processing,” IEEE Journal of Selected Topics in Signal Processing , vol. 16, no. 6, pp. 1505–1518, 2022

work page 2022

[30] [30]

emotion2vec: Self-supervised pre-training for speech emotion repre- sentation,

Z. Ma, Z. Zheng, J. Ye, J. Li, Z. Gao, S. Zhang, and X. Chen, “emotion2vec: Self-supervised pre-training for speech emotion repre- sentation,” arXiv preprint arXiv:2312.15185 , 2023

work page arXiv 2023

[31] [31]

Improv- ing prosody for cross-speaker style transfer by semi-supervised style extractor and hierarchical modeling in speech synthesis,

C. Qiang, P. Yang, H. Che, Y . Zhang, X. Wang, and Z. Wang, “Improv- ing prosody for cross-speaker style transfer by semi-supervised style extractor and hierarchical modeling in speech synthesis,” in ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) . IEEE, 2023, pp. 1–5

work page 2023

[32] [32]

Minilm: Deep self-attention distillation for task-agnostic compression of pre- trained transformers,

W. Wang, F. Wei, L. Dong, H. Bao, N. Yang, and M. Zhou, “Minilm: Deep self-attention distillation for task-agnostic compression of pre- trained transformers,” Advances in Neural Information Processing Sys- tems, vol. 33, pp. 5776–5788, 2020

work page 2020

[33] [33]

Paraformer: Fast and accurate parallel transformer for non-autoregressive end-to-end speech recognition,

Z. Gao, S. Zhang, I. McLoughlin, and Z. Yan, “Paraformer: Fast and accurate parallel transformer for non-autoregressive end-to-end speech recognition,” arXiv preprint arXiv:2206.08317 , 2022

work page arXiv 2022