pith. machine review for the scientific record. sign in

arxiv: 1710.07654 · v3 · submitted 2017-10-20 · 💻 cs.SD · cs.AI· cs.CL· cs.LG· eess.AS

Recognition: unknown

Deep Voice 3: Scaling Text-to-Speech with Convolutional Sequence Learning

Authors on Pith no claims yet
classification 💻 cs.SD cs.AIcs.CLcs.LGeess.AS
keywords deepvoicesynthesisattention-basedneuralscalespeechtext-to-speech
0
0 comments X
read the original abstract

We present Deep Voice 3, a fully-convolutional attention-based neural text-to-speech (TTS) system. Deep Voice 3 matches state-of-the-art neural speech synthesis systems in naturalness while training ten times faster. We scale Deep Voice 3 to data set sizes unprecedented for TTS, training on more than eight hundred hours of audio from over two thousand speakers. In addition, we identify common error modes of attention-based speech synthesis networks, demonstrate how to mitigate them, and compare several different waveform synthesis methods. We also describe how to scale inference to ten million queries per day on one single-GPU server.

This paper has not been read by Pith yet.

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 2 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. CosyVoice 3: Towards In-the-wild Speech Generation via Scaling-up and Post-training

    cs.SD 2025-05 unverdicted novelty 6.0

    CosyVoice 3 achieves better content consistency, speaker similarity, and prosody naturalness in zero-shot multilingual speech synthesis by scaling data to one million hours, model size to 1.5 billion parameters, and i...

  2. CosyVoice 2: Scalable Streaming Speech Synthesis with Large Language Models

    cs.SD 2024-12 unverdicted novelty 5.0

    CosyVoice 2 delivers human-parity naturalness and near-lossless streaming speech synthesis by combining finite-scalar quantization, a streamlined pre-trained LLM, and chunk-aware causal flow matching on large multilin...