pith. sign in

arxiv: 1809.08895 · v3 · pith:5IEJFG6Knew · submitted 2018-09-19 · 💻 cs.CL

Neural Speech Synthesis with Transformer Network

classification 💻 cs.CL
keywords networkefficiencyneuralperformancetacotron2transformermechanismtraining
0
0 comments X
read the original abstract

Although end-to-end neural text-to-speech (TTS) methods (such as Tacotron2) are proposed and achieve state-of-the-art performance, they still suffer from two problems: 1) low efficiency during training and inference; 2) hard to model long dependency using current recurrent neural networks (RNNs). Inspired by the success of Transformer network in neural machine translation (NMT), in this paper, we introduce and adapt the multi-head attention mechanism to replace the RNN structures and also the original attention mechanism in Tacotron2. With the help of multi-head self-attention, the hidden states in the encoder and decoder are constructed in parallel, which improves the training efficiency. Meanwhile, any two inputs at different times are connected directly by self-attention mechanism, which solves the long range dependency problem effectively. Using phoneme sequences as input, our Transformer TTS network generates mel spectrograms, followed by a WaveNet vocoder to output the final audio results. Experiments are conducted to test the efficiency and performance of our new network. For the efficiency, our Transformer TTS network can speed up the training about 4.25 times faster compared with Tacotron2. For the performance, rigorous human tests show that our proposed model achieves state-of-the-art performance (outperforms Tacotron2 with a gap of 0.048) and is very close to human quality (4.39 vs 4.44 in MOS).

This paper has not been read by Pith yet.

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 2 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Forward-Backward Decoding for Regularizing End-to-End TTS

    eess.AS 2019-07 unverdicted novelty 6.0

    Forward-backward decoding with divergence regularization and bidirectional decoder improves end-to-end TTS robustness and naturalness by addressing exposure bias via joint L2R/R2L training.

  2. Fine-grained robust prosody transfer for single-speaker neural text-to-speech

    eess.AS 2019-07 unverdicted novelty 6.0

    Decouples prosody alignment via pre-computed phoneme timestamps and adds VAE to achieve robust fine-grained prosody transfer in single-speaker neural TTS from unseen speakers.