pith. sign in

arxiv: 2402.08093 · v2 · pith:WIUPLPKWnew · submitted 2024-02-12 · 💻 cs.LG · cs.CL· eess.AS

BASE TTS: Lessons from building a billion-parameter Text-to-Speech model on 100K hours of data

classification 💻 cs.LG cs.CLeess.AS
keywords basemodeltext-to-speechtextbfabilitiesdatahoursspeech
0
0 comments X
read the original abstract

We introduce a text-to-speech (TTS) model called BASE TTS, which stands for $\textbf{B}$ig $\textbf{A}$daptive $\textbf{S}$treamable TTS with $\textbf{E}$mergent abilities. BASE TTS is the largest TTS model to-date, trained on 100K hours of public domain speech data, achieving a new state-of-the-art in speech naturalness. It deploys a 1-billion-parameter autoregressive Transformer that converts raw texts into discrete codes ("speechcodes") followed by a convolution-based decoder which converts these speechcodes into waveforms in an incremental, streamable manner. Further, our speechcodes are built using a novel speech tokenization technique that features speaker ID disentanglement and compression with byte-pair encoding. Echoing the widely-reported "emergent abilities" of large language models when trained on increasing volume of data, we show that BASE TTS variants built with 10K+ hours and 500M+ parameters begin to demonstrate natural prosody on textually complex sentences. We design and share a specialized dataset to measure these emergent abilities for text-to-speech. We showcase state-of-the-art naturalness of BASE TTS by evaluating against baselines that include publicly available large-scale text-to-speech systems: YourTTS, Bark and TortoiseTTS. Audio samples generated by the model can be heard at https://amazon-ltts-paper.com/.

This paper has not been read by Pith yet.

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 10 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. WavTTS: Towards High-Quality Zero-Shot TTS via Direct Raw Waveform Modeling

    eess.AS 2026-06 unverdicted novelty 8.0

    WavTTS is the first raw-waveform diffusion TTS model using DiT flow matching and multi-scale mel supervision that approaches SOTA latent zero-shot performance while beating prior end-to-end models.

  2. FlexiSLM: A Dynamic and Controllable Frame Rate Spoken Language Model

    cs.SD 2026-06 unverdicted novelty 7.0

    FlexiSLM is the first spoken language model supporting dynamic and controllable frame rates on speech input and output, outperforming fixed-rate 7B models at high quality and enabling faster inference at lower rates l...

  3. N\"ushuVoice: Reviving the Voice of Endangered N\"ushu with Pitch-Aware Text-to-Speech

    cs.CL 2026-06 unverdicted novelty 7.0

    NüshuVoice releases the first sentence-level Nüshu TTS dataset and shows that an F0-conditioned VITS model using five-level pitch notation outperforms baselines on spectral fidelity, pitch accuracy, and intelligibility.

  4. Is Natural Always Appropriate? Investigating Naturalness and Appropriateness Across Different Domains for TTS Evaluation

    eess.AS 2026-06 unverdicted novelty 6.0

    Appropriateness of TTS varies independently across domains while naturalness scores penalize stylized speech and reward spontaneity.

  5. SemaVoice: Semantic-Aware Continuous Autoregressive Speech Synthesis

    eess.AS 2026-05 unverdicted novelty 6.0

    SemaVoice adds SFM-guided alignment to refine continuous speech representations in autoregressive TTS, reporting 1.71% English WER on Seed-TTS and competitiveness with open-source SOTA.

  6. X-Voice: Enabling Everyone to Speak 30 Languages via Zero-Shot Cross-Lingual Voice Cloning

    cs.SD 2026-05 unverdicted novelty 6.0

    X-Voice achieves zero-shot cross-lingual voice cloning across 30 languages by using IPA as a unified phonetic representation and a two-stage training process that first generates its own audio prompts then fine-tunes ...

  7. A Novel Automatic Framework for Speaker Drift Detection in Synthesized Speech

    cs.SD 2026-04 unverdicted novelty 6.0

    A framework detects speaker drift in TTS outputs by computing cosine similarities across speech segments and using LLMs for binary classification, supported by a human-validated synthetic benchmark.

  8. Seed-TTS: A Family of High-Quality Versatile Speech Generation Models

    eess.AS 2024-06 unverdicted novelty 6.0

    Seed-TTS models produce speech matching human naturalness and speaker similarity, with added controllability via self-distillation and reinforcement learning.

  9. X-Voice: Enabling Everyone to Speak 30 Languages via Zero-Shot Cross-Lingual Voice Cloning

    cs.SD 2026-05 unverdicted novelty 5.0

    X-Voice achieves zero-shot cross-lingual voice cloning across 30 languages via IPA-based training on 420K hours of data and a two-stage paradigm that synthesizes its own audio prompts for text-masked fine-tuning.

  10. CosyVoice 2: Scalable Streaming Speech Synthesis with Large Language Models

    cs.SD 2024-12 unverdicted novelty 5.0

    CosyVoice 2 delivers human-parity naturalness and near-lossless streaming speech synthesis by combining finite-scalar quantization, a streamlined pre-trained LLM, and chunk-aware causal flow matching on large multilin...