Deep Voice 3: Scaling Text-to-Speech with Convolutional Sequence Learning

Ajay Kannan; Andrew Gibiansky; John Miller; Jonathan Raiman; Kainan Peng; Sercan O. Arik; Sharan Narang; Wei Ping

arxiv: 1710.07654 · v3 · pith:GU3Y4K4Vnew · submitted 2017-10-20 · 💻 cs.SD · cs.AI· cs.CL· cs.LG· eess.AS

Deep Voice 3: Scaling Text-to-Speech with Convolutional Sequence Learning

Wei Ping , Kainan Peng , Andrew Gibiansky , Sercan O. Arik , Ajay Kannan , Sharan Narang , Jonathan Raiman , John Miller This is my paper

classification 💻 cs.SD cs.AIcs.CLcs.LGeess.AS

keywords deepvoicesynthesisattention-basedneuralscalespeechtext-to-speech

0 comments

read the original abstract

We present Deep Voice 3, a fully-convolutional attention-based neural text-to-speech (TTS) system. Deep Voice 3 matches state-of-the-art neural speech synthesis systems in naturalness while training ten times faster. We scale Deep Voice 3 to data set sizes unprecedented for TTS, training on more than eight hundred hours of audio from over two thousand speakers. In addition, we identify common error modes of attention-based speech synthesis networks, demonstrate how to mitigate them, and compare several different waveform synthesis methods. We also describe how to scale inference to ten million queries per day on one single-GPU server.

This paper has not been read by Pith yet.

discussion (0)

Forward citations

Cited by 4 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

CosyVoice 3: Towards In-the-wild Speech Generation via Scaling-up and Post-training
cs.SD 2025-05 unverdicted novelty 6.0

CosyVoice 3 achieves better content consistency, speaker similarity, and prosody naturalness in zero-shot multilingual speech synthesis by scaling data to one million hours, model size to 1.5 billion parameters, and i...
CosyVoice 2: Scalable Streaming Speech Synthesis with Large Language Models
cs.SD 2024-12 unverdicted novelty 5.0

CosyVoice 2 delivers human-parity naturalness and near-lossless streaming speech synthesis by combining finite-scalar quantization, a streamlined pre-trained LLM, and chunk-aware causal flow matching on large multilin...
Hierarchical Sequence to Sequence Voice Conversion with Limited Data
eess.AS 2019-07 unverdicted novelty 4.0

Hierarchical seq2seq model for parallel voice conversion pretrained as autoencoder on single-speaker data then adapted to limited multispeaker data, using mel spectrograms converted via wavenet vocoder.
Polyphone Disambiguation for Mandarin Chinese Using Conditional Neural Network with Multi-level Embedding Features
cs.CL 2019-07 unverdicted novelty 4.0

A conditional neural network using bidirectional RNN sentence encoding and multi-level word/sentence embeddings reaches 94.69% accuracy on a public Mandarin polyphone dataset.