LibriTTS: A Corpus Derived from LibriSpeech for Text-to-Speech

Heiga Zen; Rob Clark; Ron J. Weiss; Viet Dang; Ye Jia; Yonghui Wu; Yu Zhang; Zhifeng Chen

arxiv: 1904.02882 · v1 · pith:MKRYBDAFnew · submitted 2019-04-05 · 💻 cs.SD · eess.AS

LibriTTS: A Corpus Derived from LibriSpeech for Text-to-Speech

Heiga Zen , Viet Dang , Rob Clark , Yu Zhang , Ron J. Weiss , Ye Jia , Zhifeng Chen , Yonghui Wu This is my paper

classification 💻 cs.SD eess.AS

keywords corpuslibrispeechlibrittsspeechtext-to-speechderivedspeakersabove

0 comments

read the original abstract

This paper introduces a new speech corpus called "LibriTTS" designed for text-to-speech use. It is derived from the original audio and text materials of the LibriSpeech corpus, which has been used for training and evaluating automatic speech recognition systems. The new corpus inherits desired properties of the LibriSpeech corpus while addressing a number of issues which make LibriSpeech less than ideal for text-to-speech work. The released corpus consists of 585 hours of speech data at 24kHz sampling rate from 2,456 speakers and the corresponding texts. Experimental results show that neural end-to-end TTS models trained from the LibriTTS corpus achieved above 4.0 in mean opinion scores in naturalness in five out of six evaluation speakers. The corpus is freely available for download from http://www.openslr.org/60/.

This paper has not been read by Pith yet.

discussion (0)

Forward citations

Cited by 17 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

AffectCodec: Emotion-Preserving Neural Speech Codec for Expressive Speech Modeling
cs.SD 2026-05 unverdicted novelty 7.0

AffectCodec is an emotion-guided neural speech codec that preserves emotional cues during quantization while maintaining semantic fidelity and prosodic naturalness.
VITA-QinYu: Expressive Spoken Language Model for Role-Playing and Singing
cs.CL 2026-05 unverdicted novelty 7.0

VITA-QinYu is the first expressive end-to-end spoken language model supporting role-playing and singing alongside conversation, trained on 15.8K hours of data and outperforming prior models on expressiveness and conve...
Minimizing Modality Gap from the Input Side: Your Speech LLM Can Be a Prosody-Aware Text LLM
cs.CL 2026-05 unverdicted novelty 7.0

TextPro-SLM minimizes the speech-text modality gap from the input side via a prosody-aware unified encoder, delivering the lowest gap and strong performance at 3B/7B scales with only ~1000 hours of audio.
The Silent Thought: Modeling Internal Cognition in Full-Duplex Spoken Dialogue Models via Latent Reasoning
eess.AS 2026-03 unverdicted novelty 7.0

FLAIR enables spoken dialogue AI to conduct continuous latent reasoning while perceiving speech through recursive latent embeddings and an ELBO-based finetuning objective.
JASTIN: Aligning LLMs for Zero-Shot Audio and Speech Evaluation via Natural Language Instructions
eess.AS 2026-05 unverdicted novelty 6.0

JASTIN is an instruction-driven audio evaluation system that achieves state-of-the-art correlation with human ratings on speech, sound, music, and out-of-domain tasks without task-specific retraining.
Text-To-Speech with Chain-of-Details: modeling temporal dynamics in speech generation
eess.AS 2026-04 unverdicted novelty 6.0

Chain-of-Details (CoD) is a cascaded TTS method that explicitly models temporal coarse-to-fine dynamics with a shared decoder, achieving competitive performance using significantly fewer parameters.
Two-Dimensional Quantization for Geometry-Aware Audio Coding
cs.SD 2025-12 unverdicted novelty 6.0

Q2D2 uses 2D geometric grid projections to quantize feature pairs in neural audio codecs, yielding implicit codebooks that improve efficiency and utilization over RVQ, VQ, and FSQ while maintaining reconstruction quality.
StableToken: A Noise-Robust Semantic Speech Tokenizer for Resilient SpeechLLMs
cs.CL 2025-09 unverdicted novelty 6.0

StableToken introduces a multi-branch architecture with bit-wise voting to create noise-robust semantic speech tokens, achieving lower Unit Edit Distance and better SpeechLLM robustness than prior single-path tokenizers.
MLS: A Large-Scale Multilingual Dataset for Speech Research
eess.AS 2020-12 accept novelty 6.0

MLS is a new large-scale multilingual speech corpus derived from LibriVox with 44.5k hours of English and 6k hours across seven other languages, plus baseline ASR and LM models.
Minimizing Modality Gap from the Input Side: Your Speech LLM Can Be a Prosody-Aware Text LLM
cs.CL 2026-05 unverdicted novelty 5.0

TextPro-SLM reduces the speech-text modality gap by feeding an LLM backbone with synchronized text tokens and prosody embeddings from WhisperPro, achieving lowest gap scores at 3B/7B scales with roughly 1,000 hours of audio.
Enhancing Speech Large Language Models through Reinforced Behavior Alignment
cs.CL 2025-08 unverdicted novelty 5.0

Reinforced Behavior Alignment (RBA) uses self-synthesized data from a teacher LLM and reinforcement learning to close the instruction-following gap in SpeechLMs, outperforming distillation and reaching SOTA on spoken ...
Kimi-Audio Technical Report
eess.AS 2025-04 unverdicted novelty 5.0

Kimi-Audio is an open-source audio foundation model that achieves state-of-the-art results on speech recognition, audio understanding, question answering, and conversation after pre-training on more than 13 million ho...
F5-TTS: A Fairytaler that Fakes Fluent and Faithful Speech with Flow Matching
eess.AS 2024-10 unverdicted novelty 5.0

F5-TTS generates natural speech from text via flow matching on DiT with simple text padding, ConvNeXt refinement, and sway sampling, trained on 100K hours multilingual data.
Speech bandwidth extension with WaveNet
eess.AS 2019-07 conditional novelty 5.0

WaveNet conditioned on log-mel spectrograms upsamples 8 kHz GSM-FR speech to 24 kHz and reaches perceptual quality close to 16 kHz AMR-WB in MUSHRA listening tests.
The Silent Thought: Modeling Internal Cognition in Full-Duplex Spoken Dialogue Models via Latent Reasoning
eess.AS 2026-03 unverdicted novelty 4.0

FLAIR enables simultaneous latent reasoning during speech input in full-duplex dialogue models via recursive latent embeddings and an ELBO-based training objective without added latency.
A Methodology for Controlling the Emotional Expressiveness in Synthetic Speech -- a Deep Learning approach
eess.AS 2019-07 unverdicted novelty 3.0

A methodology is proposed for emotional text-to-speech using emotional data collection, transfer-learning-based annotation of expressiveness features, and fine-tuning of a neutral TTS model.
A Survey of Advancing Audio Super-Resolution and Bandwidth Extension from Discriminative to Generative Models
eess.AS 2026-05 unverdicted novelty 2.0

A structured survey of audio bandwidth extension that organizes the transition from deterministic discriminative DNNs to generative approaches including GANs, diffusion models, and flow-based methods.