Deep Voice 3: Scaling Text-to-Speech with Convolutional Sequence Learning

Wei Ping , Kainan Peng , Andrew Gibiansky , Sercan O. Arik , Ajay Kannan , Sharan Narang , Jonathan Raiman , John Miller

Authors on Pith no claims yet

classification 💻 cs.SD cs.AIcs.CLcs.LGeess.AS

keywords deepvoicesynthesisattention-basedneuralscalespeechtext-to-speech

0 comments

read the original abstract

We present Deep Voice 3, a fully-convolutional attention-based neural text-to-speech (TTS) system. Deep Voice 3 matches state-of-the-art neural speech synthesis systems in naturalness while training ten times faster. We scale Deep Voice 3 to data set sizes unprecedented for TTS, training on more than eight hundred hours of audio from over two thousand speakers. In addition, we identify common error modes of attention-based speech synthesis networks, demonstrate how to mitigate them, and compare several different waveform synthesis methods. We also describe how to scale inference to ten million queries per day on one single-GPU server.

This paper has not been read by Pith yet.

discussion (0)

Forward citations

Cited by 2 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

CosyVoice 3: Towards In-the-wild Speech Generation via Scaling-up and Post-training
cs.SD 2025-05 unverdicted novelty 6.0

CosyVoice 3 achieves better content consistency, speaker similarity, and prosody naturalness in zero-shot multilingual speech synthesis by scaling data to one million hours, model size to 1.5 billion parameters, and i...
CosyVoice 2: Scalable Streaming Speech Synthesis with Large Language Models
cs.SD 2024-12 unverdicted novelty 5.0

CosyVoice 2 delivers human-parity naturalness and near-lossless streaming speech synthesis by combining finite-scalar quantization, a streamlined pre-trained LLM, and chunk-aware causal flow matching on large multilin...