hub Mixed citations

CosyVoice: A Scalable Multilingual Zero-shot Text-to-speech Synthesizer based on Supervised Semantic Tokens

Zhihao Du, Qian Chen, Shiliang Zhang, Kai Hu, Heng Lu, Yexin Yang · 2024 · cs.SD · arXiv 2407.05407

Mixed citation behavior. Most common role is background (50%).

73 Pith papers citing it

Background 50% of classified citations

open full Pith review browse 73 citing papers arXiv PDF

abstract

Recent years have witnessed a trend that large language model (LLM) based text-to-speech (TTS) emerges into the mainstream due to their high naturalness and zero-shot capacity. In this paradigm, speech signals are discretized into token sequences, which are modeled by an LLM with text as prompts and reconstructed by a token-based vocoder to waveforms. Obviously, speech tokens play a critical role in LLM-based TTS models. Current speech tokens are learned in an unsupervised manner, which lacks explicit semantic information and alignment to the text. In this paper, we propose to represent speech with supervised semantic tokens, which are derived from a multilingual speech recognition model by inserting vector quantization into the encoder. Based on the tokens, we further propose a scalable zero-shot TTS synthesizer, CosyVoice, which consists of an LLM for text-to-token generation and a conditional flow matching model for token-to-speech synthesis. Experimental results show that supervised semantic tokens significantly outperform existing unsupervised tokens in terms of content consistency and speaker similarity for zero-shot voice cloning. Moreover, we find that utilizing large-scale data further improves the synthesis performance, indicating the scalable capacity of CosyVoice. To the best of our knowledge, this is the first attempt to involve supervised speech tokens into TTS models.

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 6 method 4 dataset 1 extension 1

citation-polarity summary

background 6 use method 4 extend 1 use dataset 1

representative citing papers

The Watermark Shortcut: How Provenance Marking Sabotages Audio Deepfake Detection

cs.SD · 2026-06-22 · unverdicted · novelty 8.0

Watermarking only synthetic audio leads deepfake detectors to use the watermark as a spurious shortcut, causing generalization failure, evasion by removing watermarks, and false positives on watermarked real audio.

WavTTS: Towards High-Quality Zero-Shot TTS via Direct Raw Waveform Modeling

eess.AS · 2026-06-02 · unverdicted · novelty 8.0

WavTTS is the first raw-waveform diffusion TTS model using DiT flow matching and multi-scale mel supervision that approaches SOTA latent zero-shot performance while beating prior end-to-end models.

FlexiSLM: A Dynamic and Controllable Frame Rate Spoken Language Model

cs.SD · 2026-06-30 · unverdicted · novelty 7.0

FlexiSLM is the first spoken language model supporting dynamic and controllable frame rates on speech input and output, outperforming fixed-rate 7B models at high quality and enabling faster inference at lower rates like 6.25 Hz.

A Survey of Full-Duplex Spoken Dialogue Systems: Architectural Hierarchy, Interaction Ontology, and Decision State Machine

eess.AS · 2026-06-17 · accept · novelty 7.0

A survey proposing an L0-L3 architectural hierarchy, T×I×R interaction ontology, and IDLE/LISTEN/SPEAK/WAIT/DUAL decision state machine for full-duplex spoken dialogue systems, documenting a realization gap between architectural potential and observed behavior due to training data limits.

HoliDubber: Holistic Video Dubbing for Complex Acoustic Scenes via Text-Guided Audio Synthesis

eess.AS · 2026-06-08 · unverdicted · novelty 7.0

HoliDubber introduces a patch-based autoregressive diffusion transformer for joint text-guided synthesis of speech and ambient audio in video dubbing, with a new benchmark showing outperformance over prior speech-only methods.

Native Audio-Visual Alignment for Generation

cs.CV · 2026-05-28 · unverdicted · novelty 7.0

NAVA proposes native audio-visual alignment via Align-then-Fuse MMDiT and Timbre-in-Context Conditioning for joint audio-video generation with improved synchronization and timbre control.

Kinetic-Optimal Scheduling with Moment Correction for Metric-Induced Discrete Flow Matching in Zero-Shot Text-to-Speech

eess.AS · 2026-05-10 · unverdicted · novelty 7.0

GibbsTTS combines a training-free kinetic-optimal scheduler with finite-step moment correction in MI-DFM to deliver top naturalness and strong speaker similarity in zero-shot TTS.

VITA-QinYu: Expressive Spoken Language Model for Role-Playing and Singing

cs.CL · 2026-05-07 · unverdicted · novelty 7.0

VITA-QinYu is the first expressive end-to-end spoken language model supporting role-playing and singing alongside conversation, trained on 15.8K hours of data and outperforming prior models on expressiveness and conversational benchmarks.

Toward Fine-Grained Speech Inpainting Forensics:A Dataset, Method, and Metric for Multi-Region Tampering Localization

cs.SD · 2026-05-04 · unverdicted · novelty 7.0

A new dataset, iterative coarse-to-fine localization framework, and segment-level IoU F1 metric tackle the open problem of detecting multiple unknown word-level inpainted regions in speech.

MINT-Bench: A Comprehensive Multilingual Benchmark for Instruction-Following Text-to-Speech

eess.AS · 2026-04-20 · unverdicted · novelty 7.0

MINT-Bench is a new benchmark using hierarchical taxonomy, multi-stage data pipeline, and hybrid evaluation to assess instruction-following TTS systems, revealing major gaps in compositional and paralinguistic controls.

AST: Adaptive, Seamless, and Training-Free Precise Speech Editing

cs.SD · 2026-04-17 · unverdicted · novelty 7.0

AST enables seamless speech editing by latent recomposition on pre-trained TTS models plus adaptive weak fact guidance, plus a new dataset and WDTW metric, claiming 70% WER reduction and better temporal consistency without training.

Character Beyond Speech: Leveraging Role-Playing Evaluation in Audio Large Language Models via Reinforcement Learning

cs.LG · 2026-04-15 · unverdicted · novelty 7.0

RoleJudge is a multidimensional evaluation framework for speech-character alignment in audio LLMs, backed by the RoleChat dataset and multi-stage RL training with standard alignment to reduce reward issues.

CapTalk: Unified Voice Design for Single-Utterance and Dialogue Speech Generation

cs.SD · 2026-04-09 · unverdicted · novelty 7.0

CapTalk unifies single-utterance and dialogue voice design via utterance- and speaker-level captions plus a hierarchical variational module for stable timbre with adaptive expression.

Hierarchical Decoding for Discrete Speech Synthesis with Multi-Resolution Spoof Detection

cs.SD · 2026-03-05 · unverdicted · novelty 7.0

MSpoof-TTS improves zero-shot discrete speech synthesis by integrating multi-resolution token-based spoof detection into a hierarchical decoding process that prunes low-quality candidates.

JUST-DUB-IT: Video Dubbing via Joint Audio-Visual Diffusion

cs.GR · 2026-01-29 · unverdicted · novelty 7.0

JUST-DUB-IT adapts a joint audio-visual diffusion model via LoRA to generate high-quality dubbed videos with translated audio and lip-synced facial motion.

TED-TTS: Training-Free Intra-Utterance Emotion and Duration Control for Text-to-Speech Synthesis

cs.SD · 2026-01-06 · unverdicted · novelty 7.0

A training-free framework for intra-utterance emotion and duration control in pretrained zero-shot TTS via segment-aware conditioning and steering strategies.

ViBES: A Conversational Agent with Behaviorally-Intelligent 3D Virtual Body

cs.CV · 2025-12-16 · unverdicted · novelty 7.0

ViBES introduces a speech-language-behavior model using modality-specific transformer experts that jointly generates dialogue and 3D body actions, showing gains over separate co-speech and text-to-motion baselines on multi-turn metrics.

VoiceBench: Benchmarking LLM-Based Voice Assistants

cs.CL · 2024-10-22 · unverdicted · novelty 7.0

VoiceBench is the first benchmark for multi-faceted evaluation of LLM voice assistants using real and synthetic spoken instructions with speaker, environmental, and content variations.

Emo-LiPO: Listwise Preference Optimization for Fine-Grained Emotion Intensity Control in LLM-based Text-to-Speech

cs.SD · 2026-06-11 · unverdicted · novelty 6.0

Emo-LiPO applies listwise preference optimization to model global emotion intensity ordering in LLM TTS, yielding better accuracy and controllability than supervised or DPO baselines on a new multi-speaker dataset.

Benchmarking Neural Speech Compression from a Rate-Distortion Perspective

eess.AS · 2026-06-10 · unverdicted · novelty 6.0

ECC integrates hyperprior side information, channel-wise context, latent residual prediction, temporal modeling, and entropy skip into a learned entropy model, yielding 39.9% and 76.3% average BD-rate reductions on ViSQOL and PESQ over baselines.

EmoInstruct-TTS: Dual-Path Instruction-Guided Emotional Speech Synthesis

cs.CL · 2026-06-08 · unverdicted · novelty 6.0

EmoInstruct-TTS uses Emotion2embed and an Instruction-Conditioned Emotion Flow Model (ICE-Flow) to generate acoustically grounded emotion representations from free-form instructions and integrate them into an LLM-based TTS pipeline.

TLDR: Compressing Audio Tokens for Efficient Autoregressive Text-to-Speech

cs.SD · 2026-06-08 · unverdicted · novelty 6.0

TLDR groups codec tokens into patches for patch-level autoregressive modeling in pretrained TTS systems, yielding 1.8x speedup and 75% KV-cache reduction at patch size 4.

UniVocal: Unified Speech-Singing Code-Switching Synthesis

cs.SD · 2026-06-01 · unverdicted · novelty 6.0

UniVocal presents a text-context-only framework for speech-singing code-switching synthesis via two-stage curriculum learning and a synthetic data pipeline, claiming SOTA on a new benchmark.

StreamChar: Long-Horizon Streaming Character Audio-Video Generation with Decoupled Orchestration

cs.CV · 2026-05-25 · unverdicted · novelty 6.0

StreamChar decouples LLM-based orchestration from DiT denoising to achieve real-time long-horizon streaming character audio-video generation with reduced drift and misalignment.

citing papers explorer

Showing 9 of 9 citing papers after filters.

Native Audio-Visual Alignment for Generation cs.CV · 2026-05-28 · unverdicted · none · ref 15 · internal anchor
NAVA proposes native audio-visual alignment via Align-then-Fuse MMDiT and Timbre-in-Context Conditioning for joint audio-video generation with improved synchronization and timbre control.
ViBES: A Conversational Agent with Behaviorally-Intelligent 3D Virtual Body cs.CV · 2025-12-16 · unverdicted · none · ref 22 · internal anchor
ViBES introduces a speech-language-behavior model using modality-specific transformer experts that jointly generates dialogue and 3D body actions, showing gains over separate co-speech and text-to-motion baselines on multi-turn metrics.
StreamChar: Long-Horizon Streaming Character Audio-Video Generation with Decoupled Orchestration cs.CV · 2026-05-25 · unverdicted · none · ref 7 · internal anchor
StreamChar decouples LLM-based orchestration from DiT denoising to achieve real-time long-horizon streaming character audio-video generation with reduced drift and misalignment.
JAM-Flow: Joint Audio-Motion Synthesis with Flow Matching cs.CV · 2025-06-30 · unverdicted · none · ref 46 · internal anchor
JAM-Flow introduces a unified flow-matching model with a Multi-Modal Diffusion Transformer that jointly synthesizes facial motion and speech from text, audio, or motion inputs.
AgentSteerTTS: A Multi-Agent Closed-Loop Framework for Composite-Instruction Text-to-Speech cs.CV · 2026-05-14 · unverdicted · none · ref 88 · internal anchor
AgentSteerTTS proposes a multi-agent framework with adversarial disentanglement, dual-stream anchoring via acoustic prototypes, and fast-slow feedback to achieve intent-faithful expressive TTS for composite instructions.
Omni-Fake: Benchmarking Unified Multimodal Social Media Deepfake Detection cs.CV · 2026-05-02 · unverdicted · none · ref 27 · internal anchor
Omni-Fake delivers a unified multimodal deepfake benchmark dataset and RL-driven detector that reports gains in accuracy, cross-modal generalization, and explainability over prior baselines.
MAVIN: Multi-Shot Audio-Visual Generation with Narrative Control cs.CV · 2026-06-28 · unverdicted · none · ref 13 · internal anchor
MAVIN proposes boundary-aware attention, ID-aware propagation, a multi-agent scripting pipeline, and the MAVINSet dataset as the first framework for multi-shot audio-visual generation with narrative control, claiming SOTA results.
Toward Native Multimodal Modeling: A Roadmap cs.CV · 2026-05-25 · unverdicted · none · ref 57 · internal anchor
A roadmap that defines architectural nativity for multimodal models and categorizes them into Multi-to-Text, Multi-to-Target, and Multi-to-Multi types while outlining an industrial pipeline toward unified transformer-based native multimodal modeling.
Multimodal Large Language Model-Enabled Video Translation: A Role-Oriented Survey cs.CV · 2026-04-13 · unreviewed · ref 68 · internal anchor

CosyVoice: A Scalable Multilingual Zero-shot Text-to-speech Synthesizer based on Supervised Semantic Tokens

hub tools

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer