pith. sign in

arxiv: 1712.05884 · v2 · pith:ZYVHMQVUnew · submitted 2017-12-16 · 💻 cs.CL

Natural TTS Synthesis by Conditioning WaveNet on Mel Spectrogram Predictions

classification 💻 cs.CL
keywords wavenetspectrogramsarchitecturemodelnetworkspeechsynthesissystem
0
0 comments X
read the original abstract

This paper describes Tacotron 2, a neural network architecture for speech synthesis directly from text. The system is composed of a recurrent sequence-to-sequence feature prediction network that maps character embeddings to mel-scale spectrograms, followed by a modified WaveNet model acting as a vocoder to synthesize timedomain waveforms from those spectrograms. Our model achieves a mean opinion score (MOS) of $4.53$ comparable to a MOS of $4.58$ for professionally recorded speech. To validate our design choices, we present ablation studies of key components of our system and evaluate the impact of using mel spectrograms as the input to WaveNet instead of linguistic, duration, and $F_0$ features. We further demonstrate that using a compact acoustic intermediate representation enables significant simplification of the WaveNet architecture.

This paper has not been read by Pith yet.

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 3 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. RUSLAN: Russian Spoken Language Corpus for Speech Synthesis

    eess.AS 2019-06 unverdicted novelty 7.0

    RUSLAN is a 31-hour single-speaker Russian speech corpus for TTS containing 22200 annotated samples, with a baseline end-to-end model scoring 4.05 naturalness and 3.78 intelligibility on MOS tests.

  2. Mechanisms of Misgeneralization in Physical Sequence Modeling

    cs.LG 2026-05 unverdicted novelty 6.0

    Generative sequence models for physical tasks exhibit physical misgeneralization where local prediction errors propagate through physical measurements to distort aggregate distributions over quantities like distance o...

  3. Hierarchical Sequence to Sequence Voice Conversion with Limited Data

    eess.AS 2019-07 unverdicted novelty 4.0

    Hierarchical seq2seq model for parallel voice conversion pretrained as autoencoder on single-speaker data then adapted to limited multispeaker data, using mel spectrograms converted via wavenet vocoder.