pith. sign in

Wenetspeech4tts: A 12,800-hour mandarin tts corpus for large speech generation model benchmark

4 Pith papers cite this work. Polarity classification is still indexing.

4 Pith papers citing it

citation-role summary

dataset 1

citation-polarity summary

verdicts

UNVERDICTED 4

roles

dataset 1

polarities

use dataset 1

representative citing papers

Kimi-Audio Technical Report

eess.AS · 2025-04-25 · unverdicted · novelty 5.0

Kimi-Audio is an open-source audio foundation model that achieves state-of-the-art results on speech recognition, audio understanding, question answering, and conversation after pre-training on more than 13 million hours of speech, sound, and music data.

citing papers explorer

Showing 4 of 4 citing papers.

  • VITA-QinYu: Expressive Spoken Language Model for Role-Playing and Singing cs.CL · 2026-05-07 · unverdicted · none · ref 116

    VITA-QinYu is the first expressive end-to-end spoken language model supporting role-playing and singing alongside conversation, trained on 15.8K hours of data and outperforming prior models on expressiveness and conversational benchmarks.

  • FlashTTS: Fast Streaming TTS with MTP Acceleration and X-pred Mean Flow Distillation eess.AS · 2026-06-08 · unverdicted · none · ref 36

    FlashTTS delivers a streaming TTS system using multi-track input processing and X-pred mean flow matching to reach 325 ms latency in two function evaluations while retaining zero-shot voice cloning.

  • Kimi-Audio Technical Report eess.AS · 2025-04-25 · unverdicted · none · ref 50

    Kimi-Audio is an open-source audio foundation model that achieves state-of-the-art results on speech recognition, audio understanding, question answering, and conversation after pre-training on more than 13 million hours of speech, sound, and music data.

  • F5-TTS: A Fairytaler that Fakes Fluent and Faithful Speech with Flow Matching eess.AS · 2024-10-09 · unverdicted · none · ref 123

    F5-TTS generates natural speech from text via flow matching on DiT with simple text padding, ConvNeXt refinement, and sway sampling, trained on 100K hours multilingual data.