Natural TTS Synthesis by Conditioning WaveNet on Mel Spectrogram Predictions

· 2017 · cs.CL · arXiv 1712.05884

3 Pith papers cite this work. Polarity classification is still indexing.

3 Pith papers citing it

open full Pith review browse 3 citing papers arXiv PDF

abstract

This paper describes Tacotron 2, a neural network architecture for speech synthesis directly from text. The system is composed of a recurrent sequence-to-sequence feature prediction network that maps character embeddings to mel-scale spectrograms, followed by a modified WaveNet model acting as a vocoder to synthesize timedomain waveforms from those spectrograms. Our model achieves a mean opinion score (MOS) of $4.53$ comparable to a MOS of $4.58$ for professionally recorded speech. To validate our design choices, we present ablation studies of key components of our system and evaluate the impact of using mel spectrograms as the input to WaveNet instead of linguistic, duration, and $F_0$ features. We further demonstrate that using a compact acoustic intermediate representation enables significant simplification of the WaveNet architecture.

representative citing papers

RUSLAN: Russian Spoken Language Corpus for Speech Synthesis

eess.AS · 2019-06-26 · unverdicted · novelty 7.0

RUSLAN is a 31-hour single-speaker Russian speech corpus for TTS containing 22200 annotated samples, with a baseline end-to-end model scoring 4.05 naturalness and 3.78 intelligibility on MOS tests.

Mechanisms of Misgeneralization in Physical Sequence Modeling

cs.LG · 2026-05-19 · unverdicted · novelty 6.0

Generative sequence models for physical tasks exhibit physical misgeneralization where local prediction errors propagate through physical measurements to distort aggregate distributions over quantities like distance or energy; a data deviation kernel explains and predicts the shifts and supports a内核

Hierarchical Sequence to Sequence Voice Conversion with Limited Data

eess.AS · 2019-07-15 · unverdicted · novelty 4.0

Hierarchical seq2seq model for parallel voice conversion pretrained as autoencoder on single-speaker data then adapted to limited multispeaker data, using mel spectrograms converted via wavenet vocoder.

citing papers explorer

Showing 3 of 3 citing papers.

RUSLAN: Russian Spoken Language Corpus for Speech Synthesis eess.AS · 2019-06-26 · unverdicted · none · ref 15 · internal anchor
RUSLAN is a 31-hour single-speaker Russian speech corpus for TTS containing 22200 annotated samples, with a baseline end-to-end model scoring 4.05 naturalness and 3.78 intelligibility on MOS tests.
Mechanisms of Misgeneralization in Physical Sequence Modeling cs.LG · 2026-05-19 · unverdicted · none · ref 135 · internal anchor
Generative sequence models for physical tasks exhibit physical misgeneralization where local prediction errors propagate through physical measurements to distort aggregate distributions over quantities like distance or energy; a data deviation kernel explains and predicts the shifts and supports a内核
Hierarchical Sequence to Sequence Voice Conversion with Limited Data eess.AS · 2019-07-15 · unverdicted · none · ref 11 · internal anchor
Hierarchical seq2seq model for parallel voice conversion pretrained as autoencoder on single-speaker data then adapted to limited multispeaker data, using mel spectrograms converted via wavenet vocoder.

Natural TTS Synthesis by Conditioning WaveNet on Mel Spectrogram Predictions

fields

years

verdicts

representative citing papers

citing papers explorer