BASE TTS: Lessons from building a billion-parameter Text-to-Speech model on 100K hours of data

Adam Michalski; Alexis Moinet; \'Alvaro Mart\'in-Cortinas; Ammar Abbas; Arent van Korlaar; Arnaud Joly; Bartosz Putrycz; Elena Sokolova; Ewa Muszy\'nska; Fan Yang

arxiv: 2402.08093 · v2 · pith:WIUPLPKWnew · submitted 2024-02-12 · 💻 cs.LG · cs.CL· eess.AS

BASE TTS: Lessons from building a billion-parameter Text-to-Speech model on 100K hours of data

Mateusz {\L}ajszczak , Guillermo C\'ambara , Yang Li , Fatih Beyhan , Arent van Korlaar , Fan Yang , Arnaud Joly , \'Alvaro Mart\'in-Cortinas

show 11 more authors

Ammar Abbas Adam Michalski Alexis Moinet Sri Karlapati Ewa Muszy\'nska Haohan Guo Bartosz Putrycz Soledad L\'opez Gambino Kayeon Yoo Elena Sokolova Thomas Drugman

This is my paper

classification 💻 cs.LG cs.CLeess.AS

keywords basemodeltext-to-speechtextbfabilitiesdatahoursspeech

0 comments

read the original abstract

We introduce a text-to-speech (TTS) model called BASE TTS, which stands for $\textbf{B}$ig $\textbf{A}$daptive $\textbf{S}$treamable TTS with $\textbf{E}$mergent abilities. BASE TTS is the largest TTS model to-date, trained on 100K hours of public domain speech data, achieving a new state-of-the-art in speech naturalness. It deploys a 1-billion-parameter autoregressive Transformer that converts raw texts into discrete codes ("speechcodes") followed by a convolution-based decoder which converts these speechcodes into waveforms in an incremental, streamable manner. Further, our speechcodes are built using a novel speech tokenization technique that features speaker ID disentanglement and compression with byte-pair encoding. Echoing the widely-reported "emergent abilities" of large language models when trained on increasing volume of data, we show that BASE TTS variants built with 10K+ hours and 500M+ parameters begin to demonstrate natural prosody on textually complex sentences. We design and share a specialized dataset to measure these emergent abilities for text-to-speech. We showcase state-of-the-art naturalness of BASE TTS by evaluating against baselines that include publicly available large-scale text-to-speech systems: YourTTS, Bark and TortoiseTTS. Audio samples generated by the model can be heard at https://amazon-ltts-paper.com/.

This paper has not been read by Pith yet.

discussion (0)

Forward citations

Cited by 10 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

WavTTS: Towards High-Quality Zero-Shot TTS via Direct Raw Waveform Modeling
eess.AS 2026-06 unverdicted novelty 8.0

WavTTS is the first raw-waveform diffusion TTS model using DiT flow matching and multi-scale mel supervision that approaches SOTA latent zero-shot performance while beating prior end-to-end models.
FlexiSLM: A Dynamic and Controllable Frame Rate Spoken Language Model
cs.SD 2026-06 unverdicted novelty 7.0

FlexiSLM is the first spoken language model supporting dynamic and controllable frame rates on speech input and output, outperforming fixed-rate 7B models at high quality and enabling faster inference at lower rates l...
N\"ushuVoice: Reviving the Voice of Endangered N\"ushu with Pitch-Aware Text-to-Speech
cs.CL 2026-06 unverdicted novelty 7.0

NüshuVoice releases the first sentence-level Nüshu TTS dataset and shows that an F0-conditioned VITS model using five-level pitch notation outperforms baselines on spectral fidelity, pitch accuracy, and intelligibility.
Is Natural Always Appropriate? Investigating Naturalness and Appropriateness Across Different Domains for TTS Evaluation
eess.AS 2026-06 unverdicted novelty 6.0

Appropriateness of TTS varies independently across domains while naturalness scores penalize stylized speech and reward spontaneity.
SemaVoice: Semantic-Aware Continuous Autoregressive Speech Synthesis
eess.AS 2026-05 unverdicted novelty 6.0

SemaVoice adds SFM-guided alignment to refine continuous speech representations in autoregressive TTS, reporting 1.71% English WER on Seed-TTS and competitiveness with open-source SOTA.
X-Voice: Enabling Everyone to Speak 30 Languages via Zero-Shot Cross-Lingual Voice Cloning
cs.SD 2026-05 unverdicted novelty 6.0

X-Voice achieves zero-shot cross-lingual voice cloning across 30 languages by using IPA as a unified phonetic representation and a two-stage training process that first generates its own audio prompts then fine-tunes ...
A Novel Automatic Framework for Speaker Drift Detection in Synthesized Speech
cs.SD 2026-04 unverdicted novelty 6.0

A framework detects speaker drift in TTS outputs by computing cosine similarities across speech segments and using LLMs for binary classification, supported by a human-validated synthetic benchmark.
Seed-TTS: A Family of High-Quality Versatile Speech Generation Models
eess.AS 2024-06 unverdicted novelty 6.0

Seed-TTS models produce speech matching human naturalness and speaker similarity, with added controllability via self-distillation and reinforcement learning.
X-Voice: Enabling Everyone to Speak 30 Languages via Zero-Shot Cross-Lingual Voice Cloning
cs.SD 2026-05 unverdicted novelty 5.0

X-Voice achieves zero-shot cross-lingual voice cloning across 30 languages via IPA-based training on 420K hours of data and a two-stage paradigm that synthesizes its own audio prompts for text-masked fine-tuning.
CosyVoice 2: Scalable Streaming Speech Synthesis with Large Language Models
cs.SD 2024-12 unverdicted novelty 5.0

CosyVoice 2 delivers human-parity naturalness and near-lossless streaming speech synthesis by combining finite-scalar quantization, a streamlined pre-trained LLM, and chunk-aware causal flow matching on large multilin...