hub Mixed citations

CosyVoice: A Scalable Multilingual Zero-shot Text-to-speech Synthesizer based on Supervised Semantic Tokens

Zhihao Du, Qian Chen, Shiliang Zhang, Kai Hu, Heng Lu, Yexin Yang · 2024 · cs.SD · arXiv 2407.05407

Mixed citation behavior. Most common role is background (50%).

70 Pith papers citing it

Background 50% of classified citations

open full Pith review browse 70 citing papers arXiv PDF

abstract

Recent years have witnessed a trend that large language model (LLM) based text-to-speech (TTS) emerges into the mainstream due to their high naturalness and zero-shot capacity. In this paradigm, speech signals are discretized into token sequences, which are modeled by an LLM with text as prompts and reconstructed by a token-based vocoder to waveforms. Obviously, speech tokens play a critical role in LLM-based TTS models. Current speech tokens are learned in an unsupervised manner, which lacks explicit semantic information and alignment to the text. In this paper, we propose to represent speech with supervised semantic tokens, which are derived from a multilingual speech recognition model by inserting vector quantization into the encoder. Based on the tokens, we further propose a scalable zero-shot TTS synthesizer, CosyVoice, which consists of an LLM for text-to-token generation and a conditional flow matching model for token-to-speech synthesis. Experimental results show that supervised semantic tokens significantly outperform existing unsupervised tokens in terms of content consistency and speaker similarity for zero-shot voice cloning. Moreover, we find that utilizing large-scale data further improves the synthesis performance, indicating the scalable capacity of CosyVoice. To the best of our knowledge, this is the first attempt to involve supervised speech tokens into TTS models.

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 6 method 4 dataset 1 extension 1

citation-polarity summary

background 6 use method 4 extend 1 use dataset 1

representative citing papers

WavTTS: Towards High-Quality Zero-Shot TTS via Direct Raw Waveform Modeling

eess.AS · 2026-06-02 · unverdicted · novelty 8.0

WavTTS is the first raw-waveform diffusion TTS model using DiT flow matching and multi-scale mel supervision that approaches SOTA latent zero-shot performance while beating prior end-to-end models.

FlexiSLM: A Dynamic and Controllable Frame Rate Spoken Language Model

cs.SD · 2026-06-30 · unverdicted · novelty 7.0

FlexiSLM is the first spoken language model supporting dynamic and controllable frame rates on speech input and output, outperforming fixed-rate 7B models at high quality and enabling faster inference at lower rates like 6.25 Hz.

A Survey of Full-Duplex Spoken Dialogue Systems: Architectural Hierarchy, Interaction Ontology, and Decision State Machine

eess.AS · 2026-06-17 · accept · novelty 7.0

A survey proposing an L0-L3 architectural hierarchy, T×I×R interaction ontology, and IDLE/LISTEN/SPEAK/WAIT/DUAL decision state machine for full-duplex spoken dialogue systems, documenting a realization gap between architectural potential and observed behavior due to training data limits.

HoliDubber: Holistic Video Dubbing for Complex Acoustic Scenes via Text-Guided Audio Synthesis

eess.AS · 2026-06-08 · unverdicted · novelty 7.0

HoliDubber introduces a patch-based autoregressive diffusion transformer for joint text-guided synthesis of speech and ambient audio in video dubbing, with a new benchmark showing outperformance over prior speech-only methods.

Native Audio-Visual Alignment for Generation

cs.CV · 2026-05-28 · unverdicted · novelty 7.0

NAVA proposes native audio-visual alignment via Align-then-Fuse MMDiT and Timbre-in-Context Conditioning for joint audio-video generation with improved synchronization and timbre control.

Kinetic-Optimal Scheduling with Moment Correction for Metric-Induced Discrete Flow Matching in Zero-Shot Text-to-Speech

eess.AS · 2026-05-10 · unverdicted · novelty 7.0

GibbsTTS combines a training-free kinetic-optimal scheduler with finite-step moment correction in MI-DFM to deliver top naturalness and strong speaker similarity in zero-shot TTS.

VITA-QinYu: Expressive Spoken Language Model for Role-Playing and Singing

cs.CL · 2026-05-07 · unverdicted · novelty 7.0

VITA-QinYu is the first expressive end-to-end spoken language model supporting role-playing and singing alongside conversation, trained on 15.8K hours of data and outperforming prior models on expressiveness and conversational benchmarks.

Toward Fine-Grained Speech Inpainting Forensics:A Dataset, Method, and Metric for Multi-Region Tampering Localization

cs.SD · 2026-05-04 · unverdicted · novelty 7.0

A new dataset, iterative coarse-to-fine localization framework, and segment-level IoU F1 metric tackle the open problem of detecting multiple unknown word-level inpainted regions in speech.

MINT-Bench: A Comprehensive Multilingual Benchmark for Instruction-Following Text-to-Speech

eess.AS · 2026-04-20 · unverdicted · novelty 7.0

MINT-Bench is a new benchmark using hierarchical taxonomy, multi-stage data pipeline, and hybrid evaluation to assess instruction-following TTS systems, revealing major gaps in compositional and paralinguistic controls.

AST: Adaptive, Seamless, and Training-Free Precise Speech Editing

cs.SD · 2026-04-17 · unverdicted · novelty 7.0

AST enables seamless speech editing by latent recomposition on pre-trained TTS models plus adaptive weak fact guidance, plus a new dataset and WDTW metric, claiming 70% WER reduction and better temporal consistency without training.

Character Beyond Speech: Leveraging Role-Playing Evaluation in Audio Large Language Models via Reinforcement Learning

cs.LG · 2026-04-15 · unverdicted · novelty 7.0

RoleJudge is a multidimensional evaluation framework for speech-character alignment in audio LLMs, backed by the RoleChat dataset and multi-stage RL training with standard alignment to reduce reward issues.

CapTalk: Unified Voice Design for Single-Utterance and Dialogue Speech Generation

cs.SD · 2026-04-09 · unverdicted · novelty 7.0

CapTalk unifies single-utterance and dialogue voice design via utterance- and speaker-level captions plus a hierarchical variational module for stable timbre with adaptive expression.

Hierarchical Decoding for Discrete Speech Synthesis with Multi-Resolution Spoof Detection

cs.SD · 2026-03-05 · unverdicted · novelty 7.0

MSpoof-TTS improves zero-shot discrete speech synthesis by integrating multi-resolution token-based spoof detection into a hierarchical decoding process that prunes low-quality candidates.

JUST-DUB-IT: Video Dubbing via Joint Audio-Visual Diffusion

cs.GR · 2026-01-29 · unverdicted · novelty 7.0

JUST-DUB-IT adapts a joint audio-visual diffusion model via LoRA to generate high-quality dubbed videos with translated audio and lip-synced facial motion.

TED-TTS: Training-Free Intra-Utterance Emotion and Duration Control for Text-to-Speech Synthesis

cs.SD · 2026-01-06 · unverdicted · novelty 7.0

A training-free framework for intra-utterance emotion and duration control in pretrained zero-shot TTS via segment-aware conditioning and steering strategies.

ViBES: A Conversational Agent with Behaviorally-Intelligent 3D Virtual Body

cs.CV · 2025-12-16 · unverdicted · novelty 7.0

ViBES introduces a speech-language-behavior model using modality-specific transformer experts that jointly generates dialogue and 3D body actions, showing gains over separate co-speech and text-to-motion baselines on multi-turn metrics.

VoiceBench: Benchmarking LLM-Based Voice Assistants

cs.CL · 2024-10-22 · unverdicted · novelty 7.0

VoiceBench is the first benchmark for multi-faceted evaluation of LLM voice assistants using real and synthetic spoken instructions with speaker, environmental, and content variations.

Emo-LiPO: Listwise Preference Optimization for Fine-Grained Emotion Intensity Control in LLM-based Text-to-Speech

cs.SD · 2026-06-11 · unverdicted · novelty 6.0

Emo-LiPO applies listwise preference optimization to model global emotion intensity ordering in LLM TTS, yielding better accuracy and controllability than supervised or DPO baselines on a new multi-speaker dataset.

Benchmarking Neural Speech Compression from a Rate-Distortion Perspective

eess.AS · 2026-06-10 · unverdicted · novelty 6.0

ECC integrates hyperprior side information, channel-wise context, latent residual prediction, temporal modeling, and entropy skip into a learned entropy model, yielding 39.9% and 76.3% average BD-rate reductions on ViSQOL and PESQ over baselines.

EmoInstruct-TTS: Dual-Path Instruction-Guided Emotional Speech Synthesis

cs.CL · 2026-06-08 · unverdicted · novelty 6.0

EmoInstruct-TTS uses Emotion2embed and an Instruction-Conditioned Emotion Flow Model (ICE-Flow) to generate acoustically grounded emotion representations from free-form instructions and integrate them into an LLM-based TTS pipeline.

TLDR: Compressing Audio Tokens for Efficient Autoregressive Text-to-Speech

cs.SD · 2026-06-08 · unverdicted · novelty 6.0

TLDR groups codec tokens into patches for patch-level autoregressive modeling in pretrained TTS systems, yielding 1.8x speedup and 75% KV-cache reduction at patch size 4.

UniVocal: Unified Speech-Singing Code-Switching Synthesis

cs.SD · 2026-06-01 · unverdicted · novelty 6.0

UniVocal presents a text-context-only framework for speech-singing code-switching synthesis via two-stage curriculum learning and a synthetic data pipeline, claiming SOTA on a new benchmark.

StreamChar: Long-Horizon Streaming Character Audio-Video Generation with Decoupled Orchestration

cs.CV · 2026-05-25 · unverdicted · novelty 6.0

StreamChar decouples LLM-based orchestration from DiT denoising to achieve real-time long-horizon streaming character audio-video generation with reduced drift and misalignment.

Eroding Trust in Real Speech: A Large-Scale Study of Human Audio Deepfake Perception

cs.SD · 2026-05-21 · unverdicted · novelty 6.0

Large-scale listening study of 35,532 judgments finds human accuracy on real audio fell from 72.7% to 64.1% since 2021 while fake detection remained stable, indicating a skepticism shift toward genuine speech.

citing papers explorer

Showing 31 of 31 citing papers after filters.

FlexiSLM: A Dynamic and Controllable Frame Rate Spoken Language Model cs.SD · 2026-06-30 · unverdicted · none · ref 174 · internal anchor
FlexiSLM is the first spoken language model supporting dynamic and controllable frame rates on speech input and output, outperforming fixed-rate 7B models at high quality and enabling faster inference at lower rates like 6.25 Hz.
Toward Fine-Grained Speech Inpainting Forensics:A Dataset, Method, and Metric for Multi-Region Tampering Localization cs.SD · 2026-05-04 · unverdicted · none · ref 2 · internal anchor
A new dataset, iterative coarse-to-fine localization framework, and segment-level IoU F1 metric tackle the open problem of detecting multiple unknown word-level inpainted regions in speech.
AST: Adaptive, Seamless, and Training-Free Precise Speech Editing cs.SD · 2026-04-17 · unverdicted · none · ref 17 · internal anchor
AST enables seamless speech editing by latent recomposition on pre-trained TTS models plus adaptive weak fact guidance, plus a new dataset and WDTW metric, claiming 70% WER reduction and better temporal consistency without training.
CapTalk: Unified Voice Design for Single-Utterance and Dialogue Speech Generation cs.SD · 2026-04-09 · unverdicted · none · ref 8 · internal anchor
CapTalk unifies single-utterance and dialogue voice design via utterance- and speaker-level captions plus a hierarchical variational module for stable timbre with adaptive expression.
Hierarchical Decoding for Discrete Speech Synthesis with Multi-Resolution Spoof Detection cs.SD · 2026-03-05 · unverdicted · none · ref 8 · internal anchor
MSpoof-TTS improves zero-shot discrete speech synthesis by integrating multi-resolution token-based spoof detection into a hierarchical decoding process that prunes low-quality candidates.
TED-TTS: Training-Free Intra-Utterance Emotion and Duration Control for Text-to-Speech Synthesis cs.SD · 2026-01-06 · unverdicted · none · ref 1 · internal anchor
A training-free framework for intra-utterance emotion and duration control in pretrained zero-shot TTS via segment-aware conditioning and steering strategies.
Emo-LiPO: Listwise Preference Optimization for Fine-Grained Emotion Intensity Control in LLM-based Text-to-Speech cs.SD · 2026-06-11 · unverdicted · none · ref 3 · internal anchor
Emo-LiPO applies listwise preference optimization to model global emotion intensity ordering in LLM TTS, yielding better accuracy and controllability than supervised or DPO baselines on a new multi-speaker dataset.
TLDR: Compressing Audio Tokens for Efficient Autoregressive Text-to-Speech cs.SD · 2026-06-08 · unverdicted · none · ref 9 · internal anchor
TLDR groups codec tokens into patches for patch-level autoregressive modeling in pretrained TTS systems, yielding 1.8x speedup and 75% KV-cache reduction at patch size 4.
UniVocal: Unified Speech-Singing Code-Switching Synthesis cs.SD · 2026-06-01 · unverdicted · none · ref 6 · internal anchor
UniVocal presents a text-context-only framework for speech-singing code-switching synthesis via two-stage curriculum learning and a synthetic data pipeline, claiming SOTA on a new benchmark.
Eroding Trust in Real Speech: A Large-Scale Study of Human Audio Deepfake Perception cs.SD · 2026-05-21 · unverdicted · none · ref 8 · internal anchor
Large-scale listening study of 35,532 judgments finds human accuracy on real audio fell from 72.7% to 64.1% since 2021 while fake detection remained stable, indicating a skepticism shift toward genuine speech.
RTCFake: Speech Deepfake Detection in Real-Time Communication cs.SD · 2026-04-26 · unverdicted · none · ref 6 · internal anchor
RTCFake is the first large-scale dataset of real-time communication speech deepfakes paired with offline versions, paired with a phoneme-guided consistency learning method that improves cross-platform and noise-robust detection.
FoleyDirector: Fine-Grained Temporal Steering for Video-to-Audio Generation via Structured Scripts cs.SD · 2026-03-20 · unverdicted · none · ref 6 · internal anchor
FoleyDirector introduces structured temporal scripts and a fusion module to enable precise timing control in DiT-based video-to-audio generation while preserving audio fidelity.
Qwen3-TTS Technical Report cs.SD · 2026-01-22 · unverdicted · none · ref 5 · internal anchor
Qwen3-TTS delivers state-of-the-art multilingual TTS performance with 3-second voice cloning, description control, and ultra-low-latency streaming via dual tokenizers and a dual-track LM architecture trained on over 5 million hours of data.
CoMelSinger: Discrete Token-Based Zero-Shot Singing Synthesis With Structured Melody Control and Guidance cs.SD · 2025-09-24 · unverdicted · none · ref 16 · internal anchor
CoMelSinger introduces a discrete token-based zero-shot SVS framework on MaskGCT with coarse-to-fine contrastive learning and an SVT module to improve melody control and reduce prosody leakage.
CosyVoice 3: Towards In-the-wild Speech Generation via Scaling-up and Post-training cs.SD · 2025-05-23 · unverdicted · none · ref 27 · internal anchor
CosyVoice 3 achieves better content consistency, speaker similarity, and prosody naturalness in zero-shot multilingual speech synthesis by scaling data to one million hours, model size to 1.5 billion parameters, and introducing a supervised multi-task speech tokenizer plus a differentiable reward模型.
Streaming T5-based Text-to-Speech Synthesis with Limited Lookahead cs.SD · 2026-06-20 · unverdicted · none · ref 55 · internal anchor
S5-TTS introduces a streaming T5-TTS variant with lookahead-causal masking and interleaved multi-source distillation that achieves comparable quality to full-context models while cutting end-to-end latency.
Zero-VC: Zero-Lookahead Streaming Voice Conversion via Speaker Anonymization cs.SD · 2026-06-18 · unverdicted · none · ref 34 · internal anchor
Zero-VC applies speaker anonymization as a perturbation to achieve strictly causal zero-lookahead streaming voice conversion by balancing timbre leakage against prosodic utility.
MagpieTTS-LF: Inference-Time Long-Form Speech Generation Without Training on Long-Form data cs.SD · 2026-06-16 · unverdicted · none · ref 17 · internal anchor
MagpieTTS-LF enables coherent long-form TTS via three inference-time innovations without any retraining on long-form data.
End-to-End Training for Discrete Token LLM based TTS System cs.SD · 2026-06-08 · unverdicted · none · ref 5 · internal anchor
An end-to-end optimization framework jointly trains the speech tokenizer, LLM, FM model, and reward model for discrete-token TTS, reporting new SOTA WER of 0.78% and 1.56% on Seed-TTS-Eval with 0.6B LLM and 0.5B FM.
VoxCPM2 Technical Report cs.SD · 2026-06-05 · unverdicted · none · ref 9 · internal anchor
VoxCPM2 scales hierarchical continuous-latent speech modeling to 2B parameters and over 2M hours of multilingual data, unifying voice cloning, style control, and continuation in one backbone with open release.
UniVoice: A Unified Model for Speech and Singing Voice Generation cs.SD · 2026-06-04 · unverdicted · none · ref 3 · internal anchor
UniVoice is a conditional flow matching model with a Diffusion Transformer backbone that unifies TTS and SVS via modality-specific encoders and a null melody token for speech, achieving 5.26% speech PER and 16.22% singing PER.
ActorMind: Emulating Human Actor Reasoning for Speech Role-Playing cs.SD · 2026-04-13 · unverdicted · none · ref 20 · internal anchor
ActorMind is a four-agent chain-of-thought framework that emulates human actors to produce spontaneous, emotion-infused speech responses for role-playing scenarios.
CosyVoice 2: Scalable Streaming Speech Synthesis with Large Language Models cs.SD · 2024-12-13 · unverdicted · none · ref 34 · internal anchor
CosyVoice 2 delivers human-parity naturalness and near-lossless streaming speech synthesis by combining finite-scalar quantization, a streamlined pre-trained LLM, and chunk-aware causal flow matching on large multilingual data.
Enhancing Flow Matching with A Unified Guidance Framework for Efficient and Robust Speech Synthesis cs.SD · 2026-07-01 · unverdicted · none · ref 10 · internal anchor
Unified guidance framework for Flow Matching speech synthesis achieves nearly 3x faster inference and improved speaker similarity by combining heterogeneous data augmentation with intrinsic model guidance to eliminate CFG overhead.
SARA: A Dual-Stream VAE for High-Fidelity Speech Generation via Integrating Semantic and Acoustic Representations cs.SD · 2026-06-10 · unverdicted · none · ref 8 · internal anchor
SARA is a dual-stream VAE that integrates semantic and acoustic streams to achieve high-fidelity reconstruction and natural zero-shot TTS without complex regularizers.
EntangleCodec: A Unified Discrete Audio Tokenizer via Semantic-Acoustic Entanglement cs.SD · 2026-06-01 · unverdicted · none · ref 7 · internal anchor
EntangleCodec unifies semantic and acoustic audio tokenization via caption alignment and flow-matching decoding, reporting competitive reconstruction, +7.4% gains on MMAR understanding, and 0.6B-parameter ALMs surpassing 13B-parameter continuous baselines.
PilotTTS: A Disciplined Modular Recipe for Competitive Speech Synthesis cs.SD · 2026-05-26 · unverdicted · none · ref 4 · internal anchor
PilotTTS achieves lowest WER 1.50% (en) and CER 0.87% (zh) plus highest speaker similarity on Seed-TTS Eval using a Q-Former conditioned autoregressive architecture and a released multi-stage open data pipeline.
ATRIE: Adaptive Tuning for Robust Inference and Emotion in Persona-Driven Speech Synthesis cs.SD · 2026-04-21 · unverdicted · none · ref 10 · internal anchor
ATRIE disentangles timbre and prosody in a Persona-Prosody Dual-Track model distilled from a large LLM to achieve strong identity preservation (EER 0.04) and emotional speech synthesis with SOTA results on an extended AnimeTTS-Bench.
AT-ADD: All-Type Audio Deepfake Detection Challenge Evaluation Plan cs.SD · 2026-04-09 · unverdicted · none · ref 10 · internal anchor
AT-ADD introduces standardized tracks and datasets for evaluating audio deepfake detectors on speech under real-world conditions and on diverse unknown audio types to promote generalization beyond speech-centric methods.
Expressive Prompting: Improving Emotion Intensity and Speaker Consistency in Zero-Shot TTS cs.SD · 2024-09-27 · unverdicted · none · ref 7 · internal anchor
A two-stage static-then-dynamic prompt selection strategy using prosodic features, LLM coherence scores, and similarity metrics improves emotion intensity and speaker consistency in zero-shot TTS.
When Spoof Detectors Travel: Evaluation Across 66 Languages in the Low-Resource Language Spoofing Corpus cs.SD · 2026-03-02 · unreviewed · ref 50 · internal anchor

CosyVoice: A Scalable Multilingual Zero-shot Text-to-speech Synthesizer based on Supervised Semantic Tokens

hub tools

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer