Wenetspeech4tts: A 12,800-hour mandarin tts corpus for large speech generation model benchmark

· 2024 · arXiv 2406.05763

4 Pith papers cite this work. Polarity classification is still indexing.

4 Pith papers citing it

read on arXiv browse 4 citing papers

citation-role summary

dataset 1

citation-polarity summary

use dataset 1

representative citing papers

VITA-QinYu: Expressive Spoken Language Model for Role-Playing and Singing

cs.CL · 2026-05-07 · unverdicted · novelty 7.0

VITA-QinYu is the first expressive end-to-end spoken language model supporting role-playing and singing alongside conversation, trained on 15.8K hours of data and outperforming prior models on expressiveness and conversational benchmarks.

FlashTTS: Fast Streaming TTS with MTP Acceleration and X-pred Mean Flow Distillation

eess.AS · 2026-06-08 · unverdicted · novelty 5.0

FlashTTS delivers a streaming TTS system using multi-track input processing and X-pred mean flow matching to reach 325 ms latency in two function evaluations while retaining zero-shot voice cloning.

Kimi-Audio Technical Report

eess.AS · 2025-04-25 · unverdicted · novelty 5.0

Kimi-Audio is an open-source audio foundation model that achieves state-of-the-art results on speech recognition, audio understanding, question answering, and conversation after pre-training on more than 13 million hours of speech, sound, and music data.

F5-TTS: A Fairytaler that Fakes Fluent and Faithful Speech with Flow Matching

eess.AS · 2024-10-09 · unverdicted · novelty 5.0

F5-TTS generates natural speech from text via flow matching on DiT with simple text padding, ConvNeXt refinement, and sway sampling, trained on 100K hours multilingual data.

citing papers explorer

Showing 4 of 4 citing papers.

VITA-QinYu: Expressive Spoken Language Model for Role-Playing and Singing cs.CL · 2026-05-07 · unverdicted · none · ref 116
VITA-QinYu is the first expressive end-to-end spoken language model supporting role-playing and singing alongside conversation, trained on 15.8K hours of data and outperforming prior models on expressiveness and conversational benchmarks.
FlashTTS: Fast Streaming TTS with MTP Acceleration and X-pred Mean Flow Distillation eess.AS · 2026-06-08 · unverdicted · none · ref 36
FlashTTS delivers a streaming TTS system using multi-track input processing and X-pred mean flow matching to reach 325 ms latency in two function evaluations while retaining zero-shot voice cloning.
Kimi-Audio Technical Report eess.AS · 2025-04-25 · unverdicted · none · ref 50
Kimi-Audio is an open-source audio foundation model that achieves state-of-the-art results on speech recognition, audio understanding, question answering, and conversation after pre-training on more than 13 million hours of speech, sound, and music data.
F5-TTS: A Fairytaler that Fakes Fluent and Faithful Speech with Flow Matching eess.AS · 2024-10-09 · unverdicted · none · ref 123
F5-TTS generates natural speech from text via flow matching on DiT with simple text padding, ConvNeXt refinement, and sway sampling, trained on 100K hours multilingual data.

Wenetspeech4tts: A 12,800-hour mandarin tts corpus for large speech generation model benchmark

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer