RALL-E: robust codec language modeling with chain-of-thought prompting for text-to-speech synthesis.CoRR, abs/2404.03204

Detai Xin, Xu Tan, Kai Shen, Zeqian Ju, et al · 2024 · arXiv 2404.03204

3 Pith papers cite this work. Polarity classification is still indexing.

3 Pith papers citing it

read on arXiv browse 3 citing papers

citation-role summary

background 1

citation-polarity summary

background 1

representative citing papers

CosyVoice 3: Towards In-the-wild Speech Generation via Scaling-up and Post-training

cs.SD · 2025-05-23 · unverdicted · novelty 6.0

CosyVoice 3 achieves better content consistency, speaker similarity, and prosody naturalness in zero-shot multilingual speech synthesis by scaling data to one million hours, model size to 1.5 billion parameters, and introducing a supervised multi-task speech tokenizer plus a differentiable reward模型.

CosyVoice 2: Scalable Streaming Speech Synthesis with Large Language Models

cs.SD · 2024-12-13 · unverdicted · novelty 5.0

CosyVoice 2 delivers human-parity naturalness and near-lossless streaming speech synthesis by combining finite-scalar quantization, a streamlined pre-trained LLM, and chunk-aware causal flow matching on large multilingual data.

F5-TTS: A Fairytaler that Fakes Fluent and Faithful Speech with Flow Matching

eess.AS · 2024-10-09 · unverdicted · novelty 5.0

F5-TTS generates natural speech from text via flow matching on DiT with simple text padding, ConvNeXt refinement, and sway sampling, trained on 100K hours multilingual data.

citing papers explorer

Showing 3 of 3 citing papers.

CosyVoice 3: Towards In-the-wild Speech Generation via Scaling-up and Post-training cs.SD · 2025-05-23 · unverdicted · none · ref 12
CosyVoice 3 achieves better content consistency, speaker similarity, and prosody naturalness in zero-shot multilingual speech synthesis by scaling data to one million hours, model size to 1.5 billion parameters, and introducing a supervised multi-task speech tokenizer plus a differentiable reward模型.
CosyVoice 2: Scalable Streaming Speech Synthesis with Large Language Models cs.SD · 2024-12-13 · unverdicted · none · ref 15
CosyVoice 2 delivers human-parity naturalness and near-lossless streaming speech synthesis by combining finite-scalar quantization, a streamlined pre-trained LLM, and chunk-aware causal flow matching on large multilingual data.
F5-TTS: A Fairytaler that Fakes Fluent and Faithful Speech with Flow Matching eess.AS · 2024-10-09 · unverdicted · none · ref 150
F5-TTS generates natural speech from text via flow matching on DiT with simple text padding, ConvNeXt refinement, and sway sampling, trained on 100K hours multilingual data.

RALL-E: robust codec language modeling with chain-of-thought prompting for text-to-speech synthesis.CoRR, abs/2404.03204

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer