citation dossier

Cosyvoice: A scalable multi- lingual zero-shot text-to-speech synthesizer based on supervised semantic tokens

Zhihao Du, Qian Chen, Shiliang Zhang, Kai Hu, Heng Lu, Yexin Yang, Hangrui Hu, Siqi Zheng, Yue Gu, Ziyang Ma, Zhifu Gao, and Zhijie Yan · 2024 · arXiv 2407.05407

17Pith papers citing it

17reference links

cs.SDtop field · 8 papers

UNVERDICTEDtop verdict bucket · 17 papers

This arXiv-backed work is queued for full Pith review when it crosses the high-inbound sweep. That review runs reader · skeptic · desk-editor · referee · rebuttal · circularity · lean confirmation · RS check · pith extraction.

read on arXiv PDF

why this work matters in Pith

Pith has found this work in 17 reviewed papers. Its strongest current cluster is cs.SD (8 papers). The largest review-status bucket among citing papers is UNVERDICTED (17 papers). For highly cited works, this page shows a dossier first and a bounded explorer second; it never tries to render every citing paper at once.

representative citing papers

Kinetic-Optimal Scheduling with Moment Correction for Metric-Induced Discrete Flow Matching in Zero-Shot Text-to-Speech

eess.AS · 2026-05-10 · unverdicted · novelty 7.0

GibbsTTS combines a training-free kinetic-optimal scheduler with finite-step moment correction in MI-DFM to deliver top naturalness and strong speaker similarity in zero-shot TTS.

VITA-QinYu: Expressive Spoken Language Model for Role-Playing and Singing

cs.CL · 2026-05-07 · unverdicted · novelty 7.0

VITA-QinYu is the first expressive end-to-end spoken language model supporting role-playing and singing alongside conversation, trained on 15.8K hours of data and outperforming prior models on expressiveness and conversational benchmarks.

Toward Fine-Grained Speech Inpainting Forensics:A Dataset, Method, and Metric for Multi-Region Tampering Localization

cs.SD · 2026-05-04 · unverdicted · novelty 7.0

A new dataset, iterative coarse-to-fine localization framework, and segment-level IoU F1 metric tackle the open problem of detecting multiple unknown word-level inpainted regions in speech.

MINT-Bench: A Comprehensive Multilingual Benchmark for Instruction-Following Text-to-Speech

eess.AS · 2026-04-20 · unverdicted · novelty 7.0

MINT-Bench is a new benchmark using hierarchical taxonomy, multi-stage data pipeline, and hybrid evaluation to assess instruction-following TTS systems, revealing major gaps in compositional and paralinguistic controls.

AST: Adaptive, Seamless, and Training-Free Precise Speech Editing

cs.SD · 2026-04-17 · unverdicted · novelty 7.0

AST enables seamless speech editing by latent recomposition on pre-trained TTS models plus adaptive weak fact guidance, plus a new dataset and WDTW metric, claiming 70% WER reduction and better temporal consistency without training.

Character Beyond Speech: Leveraging Role-Playing Evaluation in Audio Large Language Models via Reinforcement Learning

cs.LG · 2026-04-15 · unverdicted · novelty 7.0

RoleJudge is a multidimensional evaluation framework for speech-character alignment in audio LLMs, backed by the RoleChat dataset and multi-stage RL training with standard alignment to reduce reward issues.

CapTalk: Unified Voice Design for Single-Utterance and Dialogue Speech Generation

cs.SD · 2026-04-09 · unverdicted · novelty 7.0

CapTalk unifies single-utterance and dialogue voice design via utterance- and speaker-level captions plus a hierarchical variational module for stable timbre with adaptive expression.

The False Resonance: A Critical Examination of Emotion Embedding Similarity for Speech Generation Evaluation

eess.AS · 2026-04-29 · unverdicted · novelty 6.0

Emotion embedding similarities are unsuitable for zero-shot evaluation of emotional expressiveness in speech generation due to confounding by non-emotional acoustic features.

RTCFake: Speech Deepfake Detection in Real-Time Communication

cs.SD · 2026-04-26 · unverdicted · novelty 6.0

RTCFake is the first large-scale dataset of real-time communication speech deepfakes paired with offline versions, paired with a phoneme-guided consistency learning method that improves cross-platform and noise-robust detection.

ASPIRin: Action Space Projection for Interactivity-Optimized Reinforcement Learning in Full-Duplex Speech Language Models

cs.CL · 2026-04-11 · unverdicted · novelty 6.0

ASPIRin decouples speaking timing from token content via binary action space projection and applies GRPO with rule-based rewards to optimize interactivity in SLMs without semantic collapse or repetition.

RADAR Challenge 2026: Robust Audio Deepfake Recognition under Media Transformations

eess.AS · 2026-05-10 · unverdicted · novelty 5.0

The RADAR Challenge 2026 provides a multilingual benchmark for audio deepfake detection under media transformations and finds that robust performance remains an open problem.

Omni-Fake: Benchmarking Unified Multimodal Social Media Deepfake Detection

cs.CV · 2026-05-02 · unverdicted · novelty 5.0

Omni-Fake delivers a unified multimodal deepfake benchmark dataset and RL-driven detector that reports gains in accuracy, cross-modal generalization, and explainability over prior baselines.

ActorMind: Emulating Human Actor Reasoning for Speech Role-Playing

cs.SD · 2026-04-13 · unverdicted · novelty 5.0

ActorMind is a four-agent chain-of-thought framework that emulates human actors to produce spontaneous, emotion-infused speech responses for role-playing scenarios.

CosyVoice 2: Scalable Streaming Speech Synthesis with Large Language Models

cs.SD · 2024-12-13 · unverdicted · novelty 5.0

CosyVoice 2 delivers human-parity naturalness and near-lossless streaming speech synthesis by combining finite-scalar quantization, a streamlined pre-trained LLM, and chunk-aware causal flow matching on large multilingual data.

ATRIE: Adaptive Tuning for Robust Inference and Emotion in Persona-Driven Speech Synthesis

cs.SD · 2026-04-21 · unverdicted · novelty 4.0

ATRIE disentangles timbre and prosody in a Persona-Prosody Dual-Track model distilled from a large LLM to achieve strong identity preservation (EER 0.04) and emotional speech synthesis with SOTA results on an extended AnimeTTS-Bench.

Empowering Video Translation using Multimodal Large Language Models

cs.CV · 2026-04-13 · unverdicted · novelty 4.0

The paper offers the first focused review of MLLM-based video translation organized by a three-role taxonomy of Semantic Reasoner, Expressive Performer, and Visual Synthesizer, plus open challenges.

AT-ADD: All-Type Audio Deepfake Detection Challenge Evaluation Plan

cs.SD · 2026-04-09 · unverdicted · novelty 3.0

AT-ADD introduces standardized tracks and datasets for evaluating audio deepfake detectors on speech under real-world conditions and on diverse unknown audio types to promote generalization beyond speech-centric methods.

citing papers explorer

Showing 17 of 17 citing papers.

Kinetic-Optimal Scheduling with Moment Correction for Metric-Induced Discrete Flow Matching in Zero-Shot Text-to-Speech eess.AS · 2026-05-10 · unverdicted · none · ref 18
GibbsTTS combines a training-free kinetic-optimal scheduler with finite-step moment correction in MI-DFM to deliver top naturalness and strong speaker similarity in zero-shot TTS.
VITA-QinYu: Expressive Spoken Language Model for Role-Playing and Singing cs.CL · 2026-05-07 · unverdicted · none · ref 41
VITA-QinYu is the first expressive end-to-end spoken language model supporting role-playing and singing alongside conversation, trained on 15.8K hours of data and outperforming prior models on expressiveness and conversational benchmarks.
Toward Fine-Grained Speech Inpainting Forensics:A Dataset, Method, and Metric for Multi-Region Tampering Localization cs.SD · 2026-05-04 · unverdicted · none · ref 2
A new dataset, iterative coarse-to-fine localization framework, and segment-level IoU F1 metric tackle the open problem of detecting multiple unknown word-level inpainted regions in speech.
MINT-Bench: A Comprehensive Multilingual Benchmark for Instruction-Following Text-to-Speech eess.AS · 2026-04-20 · unverdicted · none · ref 6
MINT-Bench is a new benchmark using hierarchical taxonomy, multi-stage data pipeline, and hybrid evaluation to assess instruction-following TTS systems, revealing major gaps in compositional and paralinguistic controls.
AST: Adaptive, Seamless, and Training-Free Precise Speech Editing cs.SD · 2026-04-17 · unverdicted · none · ref 17
AST enables seamless speech editing by latent recomposition on pre-trained TTS models plus adaptive weak fact guidance, plus a new dataset and WDTW metric, claiming 70% WER reduction and better temporal consistency without training.
Character Beyond Speech: Leveraging Role-Playing Evaluation in Audio Large Language Models via Reinforcement Learning cs.LG · 2026-04-15 · unverdicted · none · ref 10
RoleJudge is a multidimensional evaluation framework for speech-character alignment in audio LLMs, backed by the RoleChat dataset and multi-stage RL training with standard alignment to reduce reward issues.
CapTalk: Unified Voice Design for Single-Utterance and Dialogue Speech Generation cs.SD · 2026-04-09 · unverdicted · none · ref 8
CapTalk unifies single-utterance and dialogue voice design via utterance- and speaker-level captions plus a hierarchical variational module for stable timbre with adaptive expression.
The False Resonance: A Critical Examination of Emotion Embedding Similarity for Speech Generation Evaluation eess.AS · 2026-04-29 · unverdicted · none · ref 58
Emotion embedding similarities are unsuitable for zero-shot evaluation of emotional expressiveness in speech generation due to confounding by non-emotional acoustic features.
RTCFake: Speech Deepfake Detection in Real-Time Communication cs.SD · 2026-04-26 · unverdicted · none · ref 6
RTCFake is the first large-scale dataset of real-time communication speech deepfakes paired with offline versions, paired with a phoneme-guided consistency learning method that improves cross-platform and noise-robust detection.
ASPIRin: Action Space Projection for Interactivity-Optimized Reinforcement Learning in Full-Duplex Speech Language Models cs.CL · 2026-04-11 · unverdicted · none · ref 24
ASPIRin decouples speaking timing from token content via binary action space projection and applies GRPO with rule-based rewards to optimize interactivity in SLMs without semantic collapse or repetition.
RADAR Challenge 2026: Robust Audio Deepfake Recognition under Media Transformations eess.AS · 2026-05-10 · unverdicted · none · ref 15
The RADAR Challenge 2026 provides a multilingual benchmark for audio deepfake detection under media transformations and finds that robust performance remains an open problem.
Omni-Fake: Benchmarking Unified Multimodal Social Media Deepfake Detection cs.CV · 2026-05-02 · unverdicted · none · ref 27
Omni-Fake delivers a unified multimodal deepfake benchmark dataset and RL-driven detector that reports gains in accuracy, cross-modal generalization, and explainability over prior baselines.
ActorMind: Emulating Human Actor Reasoning for Speech Role-Playing cs.SD · 2026-04-13 · unverdicted · none · ref 20
ActorMind is a four-agent chain-of-thought framework that emulates human actors to produce spontaneous, emotion-infused speech responses for role-playing scenarios.
CosyVoice 2: Scalable Streaming Speech Synthesis with Large Language Models cs.SD · 2024-12-13 · unverdicted · none · ref 34
CosyVoice 2 delivers human-parity naturalness and near-lossless streaming speech synthesis by combining finite-scalar quantization, a streamlined pre-trained LLM, and chunk-aware causal flow matching on large multilingual data.
ATRIE: Adaptive Tuning for Robust Inference and Emotion in Persona-Driven Speech Synthesis cs.SD · 2026-04-21 · unverdicted · none · ref 10
ATRIE disentangles timbre and prosody in a Persona-Prosody Dual-Track model distilled from a large LLM to achieve strong identity preservation (EER 0.04) and emotional speech synthesis with SOTA results on an extended AnimeTTS-Bench.
Empowering Video Translation using Multimodal Large Language Models cs.CV · 2026-04-13 · unverdicted · none · ref 68
The paper offers the first focused review of MLLM-based video translation organized by a three-role taxonomy of Semantic Reasoner, Expressive Performer, and Visual Synthesizer, plus open challenges.
AT-ADD: All-Type Audio Deepfake Detection Challenge Evaluation Plan cs.SD · 2026-04-09 · unverdicted · none · ref 10
AT-ADD introduces standardized tracks and datasets for evaluating audio deepfake detectors on speech under real-world conditions and on diverse unknown audio types to promote generalization beyond speech-centric methods.

Cosyvoice: A scalable multi- lingual zero-shot text-to-speech synthesizer based on supervised semantic tokens

why this work matters in Pith

fields

years

verdicts

representative citing papers

citing papers explorer