BASE TTS: Lessons from building a billion-parameter Text-to-Speech model on 100K hours of data
read the original abstract
We introduce a text-to-speech (TTS) model called BASE TTS, which stands for $\textbf{B}$ig $\textbf{A}$daptive $\textbf{S}$treamable TTS with $\textbf{E}$mergent abilities. BASE TTS is the largest TTS model to-date, trained on 100K hours of public domain speech data, achieving a new state-of-the-art in speech naturalness. It deploys a 1-billion-parameter autoregressive Transformer that converts raw texts into discrete codes ("speechcodes") followed by a convolution-based decoder which converts these speechcodes into waveforms in an incremental, streamable manner. Further, our speechcodes are built using a novel speech tokenization technique that features speaker ID disentanglement and compression with byte-pair encoding. Echoing the widely-reported "emergent abilities" of large language models when trained on increasing volume of data, we show that BASE TTS variants built with 10K+ hours and 500M+ parameters begin to demonstrate natural prosody on textually complex sentences. We design and share a specialized dataset to measure these emergent abilities for text-to-speech. We showcase state-of-the-art naturalness of BASE TTS by evaluating against baselines that include publicly available large-scale text-to-speech systems: YourTTS, Bark and TortoiseTTS. Audio samples generated by the model can be heard at https://amazon-ltts-paper.com/.
This paper has not been read by Pith yet.
Forward citations
Cited by 10 Pith papers
-
WavTTS: Towards High-Quality Zero-Shot TTS via Direct Raw Waveform Modeling
WavTTS is the first raw-waveform diffusion TTS model using DiT flow matching and multi-scale mel supervision that approaches SOTA latent zero-shot performance while beating prior end-to-end models.
-
FlexiSLM: A Dynamic and Controllable Frame Rate Spoken Language Model
FlexiSLM is the first spoken language model supporting dynamic and controllable frame rates on speech input and output, outperforming fixed-rate 7B models at high quality and enabling faster inference at lower rates l...
-
N\"ushuVoice: Reviving the Voice of Endangered N\"ushu with Pitch-Aware Text-to-Speech
NüshuVoice releases the first sentence-level Nüshu TTS dataset and shows that an F0-conditioned VITS model using five-level pitch notation outperforms baselines on spectral fidelity, pitch accuracy, and intelligibility.
-
Is Natural Always Appropriate? Investigating Naturalness and Appropriateness Across Different Domains for TTS Evaluation
Appropriateness of TTS varies independently across domains while naturalness scores penalize stylized speech and reward spontaneity.
-
SemaVoice: Semantic-Aware Continuous Autoregressive Speech Synthesis
SemaVoice adds SFM-guided alignment to refine continuous speech representations in autoregressive TTS, reporting 1.71% English WER on Seed-TTS and competitiveness with open-source SOTA.
-
X-Voice: Enabling Everyone to Speak 30 Languages via Zero-Shot Cross-Lingual Voice Cloning
X-Voice achieves zero-shot cross-lingual voice cloning across 30 languages by using IPA as a unified phonetic representation and a two-stage training process that first generates its own audio prompts then fine-tunes ...
-
A Novel Automatic Framework for Speaker Drift Detection in Synthesized Speech
A framework detects speaker drift in TTS outputs by computing cosine similarities across speech segments and using LLMs for binary classification, supported by a human-validated synthetic benchmark.
-
Seed-TTS: A Family of High-Quality Versatile Speech Generation Models
Seed-TTS models produce speech matching human naturalness and speaker similarity, with added controllability via self-distillation and reinforcement learning.
-
X-Voice: Enabling Everyone to Speak 30 Languages via Zero-Shot Cross-Lingual Voice Cloning
X-Voice achieves zero-shot cross-lingual voice cloning across 30 languages via IPA-based training on 420K hours of data and a two-stage paradigm that synthesizes its own audio prompts for text-masked fine-tuning.
-
CosyVoice 2: Scalable Streaming Speech Synthesis with Large Language Models
CosyVoice 2 delivers human-parity naturalness and near-lossless streaming speech synthesis by combining finite-scalar quantization, a streamlined pre-trained LLM, and chunk-aware causal flow matching on large multilin...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.