hub Tool reference

LibriTTS: A Corpus Derived from LibriSpeech for Text-to-Speech

Libritts: A corpus derived from librispeech for text-to-speech , author= · 2019 · cs.SD · arXiv 1904.02882

Tool reference. 80% of classified Pith citations use this work as a method, library, or software dependency, not as a substantive claim.

23 Pith papers citing it

Method reference 80% of classified citations

open full Pith review browse 23 citing papers arXiv PDF

abstract

This paper introduces a new speech corpus called "LibriTTS" designed for text-to-speech use. It is derived from the original audio and text materials of the LibriSpeech corpus, which has been used for training and evaluating automatic speech recognition systems. The new corpus inherits desired properties of the LibriSpeech corpus while addressing a number of issues which make LibriSpeech less than ideal for text-to-speech work. The released corpus consists of 585 hours of speech data at 24kHz sampling rate from 2,456 speakers and the corresponding texts. Experimental results show that neural end-to-end TTS models trained from the LibriTTS corpus achieved above 4.0 in mean opinion scores in naturalness in five out of six evaluation speakers. The corpus is freely available for download from http://www.openslr.org/60/.

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

dataset 3 background 1 method 1

citation-polarity summary

use dataset 3 background 1 use method 1

representative citing papers

WavTTS: Towards High-Quality Zero-Shot TTS via Direct Raw Waveform Modeling

eess.AS · 2026-06-02 · unverdicted · novelty 8.0

WavTTS is the first raw-waveform diffusion TTS model using DiT flow matching and multi-scale mel supervision that approaches SOTA latent zero-shot performance while beating prior end-to-end models.

SpeechEditBench: A Bilingual Multi-Attribute Benchmark for Instruction-Guided Speech Editing

eess.AS · 2026-06-01 · unverdicted · novelty 7.0

SpeechEditBench provides seven atomic editing tasks, compositional multi-operation instructions, and an anchor-based protocol yielding target success, preservation success, and joint success metrics; evaluations show no model excels across dimensions and compositional editing is especially difficult

AffectCodec: Emotion-Preserving Neural Speech Codec for Expressive Speech Modeling

cs.SD · 2026-05-11 · unverdicted · novelty 7.0

AffectCodec is an emotion-guided neural speech codec that preserves emotional cues during quantization while maintaining semantic fidelity and prosodic naturalness.

VITA-QinYu: Expressive Spoken Language Model for Role-Playing and Singing

cs.CL · 2026-05-07 · unverdicted · novelty 7.0

VITA-QinYu is the first expressive end-to-end spoken language model supporting role-playing and singing alongside conversation, trained on 15.8K hours of data and outperforming prior models on expressiveness and conversational benchmarks.

UniVocal: Unified Speech-Singing Code-Switching Synthesis

cs.SD · 2026-06-01 · unverdicted · novelty 6.0

UniVocal presents a text-context-only framework for speech-singing code-switching synthesis via two-stage curriculum learning and a synthetic data pipeline, claiming SOTA on a new benchmark.

Ultra-Low-Bitrate Mel-Spectrogram-based Neural Speech Coding with Flow-Matching-based Refinement and Vocoding-driven Reconstruction

eess.AS · 2026-05-25 · unverdicted · novelty 6.0

FMelCodec is a three-stage mel-spectrogram codec using 640x VQ compression, conditional flow matching refinement, and HiFi-GAN reconstruction that reports higher quality than prior methods at 250 bps for 16 kHz speech.

Two-Dimensional Quantization for Geometry-Aware Audio Coding

cs.SD · 2025-12-01 · unverdicted · novelty 6.0

Q2D2 uses 2D geometric grid projections to quantize feature pairs in neural audio codecs, yielding implicit codebooks that improve efficiency and utilization over RVQ, VQ, and FSQ while maintaining reconstruction quality.

StableToken: A Noise-Robust Semantic Speech Tokenizer for Resilient SpeechLLMs

cs.CL · 2025-09-26 · unverdicted · novelty 6.0

StableToken introduces a multi-branch architecture with bit-wise voting to create noise-robust semantic speech tokens, achieving lower Unit Edit Distance and better SpeechLLM robustness than prior single-path tokenizers.

MLS: A Large-Scale Multilingual Dataset for Speech Research

eess.AS · 2020-12-07 · accept · novelty 6.0

MLS is a new large-scale multilingual speech corpus derived from LibriVox with 44.5k hours of English and 6k hours across seven other languages, plus baseline ASR and LM models.

JASTIN: Aligning LLMs for Zero-Shot Audio and Speech Evaluation via Natural Language Instructions

eess.AS · 2026-05-06 · unverdicted · novelty 6.0

JASTIN is an instruction-driven audio evaluation system that achieves state-of-the-art correlation with human ratings on speech, sound, music, and out-of-domain tasks without task-specific retraining.

Text-To-Speech with Chain-of-Details: modeling temporal dynamics in speech generation

eess.AS · 2026-04-21 · unverdicted · novelty 6.0

Chain-of-Details (CoD) is a cascaded TTS method that explicitly models temporal coarse-to-fine dynamics with a shared decoder, achieving competitive performance using significantly fewer parameters.

ImmersiveTTS: Environment-Aware Text-to-Speech with Multimodal Diffusion Transformer and Domain-Specific Representation Alignment

eess.AS · 2026-05-29 · unverdicted · novelty 5.0

ImmersiveTTS proposes an environment-aware TTS system that integrates speech with environmental audio via multimodal diffusion transformer, joint attention, and domain-specific representation alignment, claiming superior naturalness and fidelity.

Enhancing Speech Large Language Models through Reinforced Behavior Alignment

cs.CL · 2025-08-25 · unverdicted · novelty 5.0

Reinforced Behavior Alignment (RBA) uses self-synthesized data from a teacher LLM and reinforcement learning to close the instruction-following gap in SpeechLMs, outperforming distillation and reaching SOTA on spoken QA and speech-to-text translation benchmarks.

F5-TTS: A Fairytaler that Fakes Fluent and Faithful Speech with Flow Matching

eess.AS · 2024-10-09 · unverdicted · novelty 5.0

F5-TTS generates natural speech from text via flow matching on DiT with simple text padding, ConvNeXt refinement, and sway sampling, trained on 100K hours multilingual data.

Speech bandwidth extension with WaveNet

eess.AS · 2019-07-05 · conditional · novelty 5.0

WaveNet conditioned on log-mel spectrograms upsamples 8 kHz GSM-FR speech to 24 kHz and reaches perceptual quality close to 16 kHz AMR-WB in MUSHRA listening tests.

Minimizing Modality Gap from the Input Side: Your Speech LLM Can Be a Prosody-Aware Text LLM

cs.CL · 2026-05-07 · unverdicted · novelty 5.0 · 2 refs

TextPro-SLM reduces the speech-text modality gap by feeding an LLM backbone with synchronized text tokens and prosody embeddings from WhisperPro, achieving lowest gap scores at 3B/7B scales with roughly 1,000 hours of audio.

Kimi-Audio Technical Report

eess.AS · 2025-04-25 · unverdicted · novelty 5.0

Kimi-Audio is an open-source audio foundation model that achieves state-of-the-art results on speech recognition, audio understanding, question answering, and conversation after pre-training on more than 13 million hours of speech, sound, and music data.

Multimodal Group Emotion Recognition In-the-Wild Towards a Privacy-Safe Non-Individual Approach

cs.CV · 2026-05-27 · unverdicted · novelty 4.0

Proposes cross-attention audio-video fusion and VE-MD latent-space models for group emotion recognition that avoid individual cues and report competitive performance via ablation studies on synthetic and real data.

I Hear, Therefore I Trust: A Socio-Technical Investigation of Humans as Synthetic Speech Detectors

eess.AS · 2026-05-27 · unverdicted · novelty 4.0

An experiment shows humans detect fully synthetic speech below chance but exhibit implicit discrimination via quality ratings in a localization task with trust cue manipulations.

Toward Native Multimodal Modeling: A Roadmap

cs.CV · 2026-05-25 · unverdicted · novelty 3.0

A roadmap that defines architectural nativity for multimodal models and categorizes them into Multi-to-Text, Multi-to-Target, and Multi-to-Multi types while outlining an industrial pipeline toward unified transformer-based native multimodal modeling.

A Methodology for Controlling the Emotional Expressiveness in Synthetic Speech -- a Deep Learning approach

eess.AS · 2019-07-05 · unverdicted · novelty 3.0

A methodology is proposed for emotional text-to-speech using emotional data collection, transfer-learning-based annotation of expressiveness features, and fine-tuning of a neutral TTS model.

A Survey of Advancing Audio Super-Resolution and Bandwidth Extension from Discriminative to Generative Models

eess.AS · 2026-05-15 · unverdicted · novelty 2.0

A structured survey of audio bandwidth extension that organizes the transition from deterministic discriminative DNNs to generative approaches including GANs, diffusion models, and flow-based methods.

The Silent Thought: Modeling Internal Cognition in Full-Duplex Spoken Dialogue Models via Latent Reasoning

eess.AS · 2026-03-18 · 2 refs

citing papers explorer

Showing 23 of 23 citing papers.

WavTTS: Towards High-Quality Zero-Shot TTS via Direct Raw Waveform Modeling eess.AS · 2026-06-02 · unverdicted · none · ref 99 · internal anchor
WavTTS is the first raw-waveform diffusion TTS model using DiT flow matching and multi-scale mel supervision that approaches SOTA latent zero-shot performance while beating prior end-to-end models.
SpeechEditBench: A Bilingual Multi-Attribute Benchmark for Instruction-Guided Speech Editing eess.AS · 2026-06-01 · unverdicted · none · ref 35 · internal anchor
SpeechEditBench provides seven atomic editing tasks, compositional multi-operation instructions, and an anchor-based protocol yielding target success, preservation success, and joint success metrics; evaluations show no model excels across dimensions and compositional editing is especially difficult
AffectCodec: Emotion-Preserving Neural Speech Codec for Expressive Speech Modeling cs.SD · 2026-05-11 · unverdicted · none · ref 3
AffectCodec is an emotion-guided neural speech codec that preserves emotional cues during quantization while maintaining semantic fidelity and prosodic naturalness.
VITA-QinYu: Expressive Spoken Language Model for Role-Playing and Singing cs.CL · 2026-05-07 · unverdicted · none · ref 117
VITA-QinYu is the first expressive end-to-end spoken language model supporting role-playing and singing alongside conversation, trained on 15.8K hours of data and outperforming prior models on expressiveness and conversational benchmarks.
UniVocal: Unified Speech-Singing Code-Switching Synthesis cs.SD · 2026-06-01 · unverdicted · none · ref 80 · internal anchor
UniVocal presents a text-context-only framework for speech-singing code-switching synthesis via two-stage curriculum learning and a synthetic data pipeline, claiming SOTA on a new benchmark.
Ultra-Low-Bitrate Mel-Spectrogram-based Neural Speech Coding with Flow-Matching-based Refinement and Vocoding-driven Reconstruction eess.AS · 2026-05-25 · unverdicted · none · ref 47 · internal anchor
FMelCodec is a three-stage mel-spectrogram codec using 640x VQ compression, conditional flow matching refinement, and HiFi-GAN reconstruction that reports higher quality than prior methods at 250 bps for 16 kHz speech.
Two-Dimensional Quantization for Geometry-Aware Audio Coding cs.SD · 2025-12-01 · unverdicted · none · ref 79 · internal anchor
Q2D2 uses 2D geometric grid projections to quantize feature pairs in neural audio codecs, yielding implicit codebooks that improve efficiency and utilization over RVQ, VQ, and FSQ while maintaining reconstruction quality.
StableToken: A Noise-Robust Semantic Speech Tokenizer for Resilient SpeechLLMs cs.CL · 2025-09-26 · unverdicted · none · ref 83 · internal anchor
StableToken introduces a multi-branch architecture with bit-wise voting to create noise-robust semantic speech tokens, achieving lower Unit Edit Distance and better SpeechLLM robustness than prior single-path tokenizers.
MLS: A Large-Scale Multilingual Dataset for Speech Research eess.AS · 2020-12-07 · accept · none · ref 25 · internal anchor
MLS is a new large-scale multilingual speech corpus derived from LibriVox with 44.5k hours of English and 6k hours across seven other languages, plus baseline ASR and LM models.
JASTIN: Aligning LLMs for Zero-Shot Audio and Speech Evaluation via Natural Language Instructions eess.AS · 2026-05-06 · unverdicted · none · ref 40
JASTIN is an instruction-driven audio evaluation system that achieves state-of-the-art correlation with human ratings on speech, sound, music, and out-of-domain tasks without task-specific retraining.
Text-To-Speech with Chain-of-Details: modeling temporal dynamics in speech generation eess.AS · 2026-04-21 · unverdicted · none · ref 44
Chain-of-Details (CoD) is a cascaded TTS method that explicitly models temporal coarse-to-fine dynamics with a shared decoder, achieving competitive performance using significantly fewer parameters.
ImmersiveTTS: Environment-Aware Text-to-Speech with Multimodal Diffusion Transformer and Domain-Specific Representation Alignment eess.AS · 2026-05-29 · unverdicted · none · ref 71 · internal anchor
ImmersiveTTS proposes an environment-aware TTS system that integrates speech with environmental audio via multimodal diffusion transformer, joint attention, and domain-specific representation alignment, claiming superior naturalness and fidelity.
Enhancing Speech Large Language Models through Reinforced Behavior Alignment cs.CL · 2025-08-25 · unverdicted · none · ref 56 · internal anchor
Reinforced Behavior Alignment (RBA) uses self-synthesized data from a teacher LLM and reinforcement learning to close the instruction-following gap in SpeechLMs, outperforming distillation and reaching SOTA on spoken QA and speech-to-text translation benchmarks.
F5-TTS: A Fairytaler that Fakes Fluent and Faithful Speech with Flow Matching eess.AS · 2024-10-09 · unverdicted · none · ref 153 · internal anchor
F5-TTS generates natural speech from text via flow matching on DiT with simple text padding, ConvNeXt refinement, and sway sampling, trained on 100K hours multilingual data.
Speech bandwidth extension with WaveNet eess.AS · 2019-07-05 · conditional · none · ref 21 · internal anchor
WaveNet conditioned on log-mel spectrograms upsamples 8 kHz GSM-FR speech to 24 kHz and reaches perceptual quality close to 16 kHz AMR-WB in MUSHRA listening tests.
Minimizing Modality Gap from the Input Side: Your Speech LLM Can Be a Prosody-Aware Text LLM cs.CL · 2026-05-07 · unverdicted · none · ref 57 · 2 links
TextPro-SLM reduces the speech-text modality gap by feeding an LLM backbone with synchronized text tokens and prosody embeddings from WhisperPro, achieving lowest gap scores at 3B/7B scales with roughly 1,000 hours of audio.
Kimi-Audio Technical Report eess.AS · 2025-04-25 · unverdicted · none · ref 83
Kimi-Audio is an open-source audio foundation model that achieves state-of-the-art results on speech recognition, audio understanding, question answering, and conversation after pre-training on more than 13 million hours of speech, sound, and music data.
Multimodal Group Emotion Recognition In-the-Wild Towards a Privacy-Safe Non-Individual Approach cs.CV · 2026-05-27 · unverdicted · none · ref 252 · internal anchor
Proposes cross-attention audio-video fusion and VE-MD latent-space models for group emotion recognition that avoid individual cues and report competitive performance via ablation studies on synthetic and real data.
I Hear, Therefore I Trust: A Socio-Technical Investigation of Humans as Synthetic Speech Detectors eess.AS · 2026-05-27 · unverdicted · none · ref 47 · internal anchor
An experiment shows humans detect fully synthetic speech below chance but exhibit implicit discrimination via quality ratings in a localization task with trust cue manipulations.
Toward Native Multimodal Modeling: A Roadmap cs.CV · 2026-05-25 · unverdicted · none · ref 116 · internal anchor
A roadmap that defines architectural nativity for multimodal models and categorizes them into Multi-to-Text, Multi-to-Target, and Multi-to-Multi types while outlining an industrial pipeline toward unified transformer-based native multimodal modeling.
A Methodology for Controlling the Emotional Expressiveness in Synthetic Speech -- a Deep Learning approach eess.AS · 2019-07-05 · unverdicted · none · ref 18 · internal anchor
A methodology is proposed for emotional text-to-speech using emotional data collection, transfer-learning-based annotation of expressiveness features, and fine-tuning of a neutral TTS model.
A Survey of Advancing Audio Super-Resolution and Bandwidth Extension from Discriminative to Generative Models eess.AS · 2026-05-15 · unverdicted · none · ref 68 · internal anchor
A structured survey of audio bandwidth extension that organizes the transition from deterministic discriminative DNNs to generative approaches including GANs, diffusion models, and flow-based methods.
The Silent Thought: Modeling Internal Cognition in Full-Duplex Spoken Dialogue Models via Latent Reasoning eess.AS · 2026-03-18 · unreviewed · ref 39 · 2 links · internal anchor

LibriTTS: A Corpus Derived from LibriSpeech for Text-to-Speech

hub tools

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer