Robust and fine-grained prosody control of end-to-end speech synthesis

Taesu Kim; Younggun Lee

arxiv: 1811.02122 · v2 · pith:D6B2KNRJnew · submitted 2018-11-06 · 💻 cs.CL · cs.LG· cs.SD· eess.AS

Robust and fine-grained prosody control of end-to-end speech synthesis

Younggun Lee , Taesu Kim This is my paper

classification 💻 cs.CL cs.LGcs.SDeess.AS

keywords prosodyspeechnetworkscontrolembeddingsynthesistemporalembeddings

0 comments

read the original abstract

We propose prosody embeddings for emotional and expressive speech synthesis networks. The proposed methods introduce temporal structures in the embedding networks, thus enabling fine-grained control of the speaking style of the synthesized speech. The temporal structures can be designed either on the speech side or the text side, leading to different control resolutions in time. The prosody embedding networks are plugged into end-to-end speech synthesis networks and trained without any other supervision except for the target speech for synthesizing. It is demonstrated that the prosody embedding networks learned to extract prosodic features. By adjusting the learned prosody features, we could change the pitch and amplitude of the synthesized speech both at the frame level and the phoneme level. We also introduce the temporal normalization of prosody embeddings, which shows better robustness against speaker perturbations during prosody transfer tasks.

This paper has not been read by Pith yet.

discussion (0)

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Fine-grained robust prosody transfer for single-speaker neural text-to-speech
eess.AS 2019-07 unverdicted novelty 6.0

Decouples prosody alignment via pre-computed phoneme timestamps and adds VAE to achieve robust fine-grained prosody transfer in single-speaker neural TTS from unseen speakers.