AffectCodec is an emotion-guided neural speech codec that preserves emotional cues during quantization while maintaining semantic fidelity and prosodic naturalness.
hub
Maskgct: Zero-shot text- to-speech with masked generative codec transformer
15 Pith papers cite this work. Polarity classification is still indexing.
hub tools
citation-role summary
citation-polarity summary
representative citing papers
HiCoDiT generates speech from video by conditioning low-level RVQ tokens on speaker identity and high-level tokens on facial expressions via a dual-scale normalized diffusion transformer.
TokenChain demonstrates that a discrete semantic-token interface can sustain effective chain learning between ASR and TTS, yielding faster convergence and lower error rates on LibriSpeech and TED-LIUM.
The paper introduces target-KL regularization to train audio VAEs at specific bitrates, enabling rate-distortion curves and comparison to discrete audio codecs for improved text-to-sound generation.
UniSonate unifies text-to-speech, text-to-music, and text-to-audio in a flow-matching framework with dynamic token injection and curriculum learning, reporting SOTA TTS and TTM results plus positive cross-task transfer.
Chain-of-Details (CoD) is a cascaded TTS method that explicitly models temporal coarse-to-fine dynamics with a shared decoder, achieving competitive performance using significantly fewer parameters.
MimicLM achieves better naturalness in zero-shot voice imitation by autoregressively modeling pseudo-parallel data with synthetic sources and real targets, plus interleaved text-audio guidance and preference alignment.
Qwen3-TTS delivers state-of-the-art multilingual TTS performance with 3-second voice cloning, description control, and ultra-low-latency streaming via dual tokenizers and a dual-track LM architecture trained on over 5 million hours of data.
Step-Audio 2 integrates a latent audio encoder, reasoning-centric reinforcement learning, and discrete audio token generation into language modeling to deliver state-of-the-art performance on audio understanding and conversational benchmarks.
ZipVoice-Dialog is a flow-matching non-autoregressive model for zero-shot spoken dialogue generation that uses curriculum learning and speaker-turn embeddings, paired with a new 6.8k-hour OpenDialog dataset, and reports better speed and quality than autoregressive baselines.
CosyVoice 3 achieves better content consistency, speaker similarity, and prosody naturalness in zero-shot multilingual speech synthesis by scaling data to one million hours, model size to 1.5 billion parameters, and introducing a supervised multi-task speech tokenizer plus a differentiable reward模型.
A singing voice conversion system with boundary-aware information bottleneck and high-frequency augmentation achieves the best naturalness in SVCC2025 subjective tests while using less extra data than competitors.
CosyVoice 2 delivers human-parity naturalness and near-lossless streaming speech synthesis by combining finite-scalar quantization, a streamlined pre-trained LLM, and chunk-aware causal flow matching on large multilingual data.
F5-TTS generates natural speech from text via flow matching on DiT with simple text padding, ConvNeXt refinement, and sway sampling, trained on 100K hours multilingual data.