Recognition: 3 theorem links
· Lean TheoremCosyVoice 3: Towards In-the-wild Speech Generation via Scaling-up and Post-training
Pith reviewed 2026-05-16 05:22 UTC · model grok-4.3
The pith
CosyVoice 3 improves zero-shot multilingual speech synthesis by scaling data to one million hours, model size to 1.5 billion parameters, and adding a multi-task tokenizer with a differentiable reward model.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
CosyVoice 3 surpasses its predecessor in content consistency, speaker similarity, and prosody naturalness for zero-shot multilingual speech synthesis in the wild through dataset scaling to one million hours, model scaling to 1.5 billion parameters, a supervised multi-task speech tokenizer covering automatic speech recognition, emotion recognition, language identification, audio event detection and speaker analysis, plus a new differentiable reward model for post-training.
What carries the argument
The supervised multi-task speech tokenizer trained jointly on automatic speech recognition, speech emotion recognition, language identification, audio event detection, and speaker analysis, which supplies richer conditioning signals for prosody and consistency during generation.
If this is right
- The differentiable reward model can be reused for post-training other LLM-based speech synthesis systems.
- Larger data and model scales support synthesis across more domains and text formats while maintaining low-latency streaming.
- The multi-task tokenizer enables better handling of prosody variation in zero-shot scenarios without explicit style references.
- Performance gains appear on benchmarks covering nine languages and eighteen dialects under diverse acoustic conditions.
Where Pith is reading between the lines
- The post-training reward approach may transfer to other audio generation tasks such as music or sound effects.
- Further scaling beyond one million hours could continue to improve robustness if compute allows without introducing new failure modes.
- The tokenizer's multi-task design suggests similar joint training could benefit related tasks like speech enhancement or diarization.
Load-bearing premise
That scaling data volume and model size together with the new tokenizer and reward model will deliver the reported gains in consistency and naturalness without overfitting or reduced generalization across unseen real-world conditions.
What would settle it
A controlled evaluation on a held-out set of multilingual wild recordings showing no statistically significant gains or outright drops in content consistency, speaker similarity, or prosody naturalness scores compared to CosyVoice 2.
read the original abstract
In our prior works, we introduced a scalable streaming speech synthesis model, CosyVoice 2, which integrates a large language model (LLM) and a chunk-aware flow matching (FM) model, and achieves low-latency bi-streaming speech synthesis and human-parity quality. Despite these advancements, CosyVoice 2 exhibits limitations in language coverage, domain diversity, data volume, text formats, and post-training techniques. In this paper, we present CosyVoice 3, an improved model designed for zero-shot multilingual speech synthesis in the wild, surpassing its predecessor in content consistency, speaker similarity, and prosody naturalness. Key features of CosyVoice 3 include: 1) A novel speech tokenizer to improve prosody naturalness, developed via supervised multi-task training, including automatic speech recognition, speech emotion recognition, language identification, audio event detection, and speaker analysis. 2) A new differentiable reward model for post-training applicable not only to CosyVoice 3 but also to other LLM-based speech synthesis models. 3) Dataset Size Scaling: Training data is expanded from ten thousand hours to one million hours, encompassing 9 languages and 18 Chinese dialects across various domains and text formats. 4) Model Size Scaling: Model parameters are increased from 0.5 billion to 1.5 billion, resulting in enhanced performance on our multilingual benchmark due to the larger model capacity. These advancements contribute significantly to the progress of speech synthesis in the wild. We encourage readers to listen to the demo at https://funaudiollm.github.io/cosyvoice3.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper presents CosyVoice 3 as an advancement over CosyVoice 2 for zero-shot multilingual speech synthesis in the wild. It claims superior content consistency, speaker similarity, and prosody naturalness through four main contributions: a novel multi-task speech tokenizer trained on ASR, SER, LID, AED, and speaker analysis tasks; a differentiable reward model for post-training; scaling training data from 10k to 1M hours across 9 languages and 18 Chinese dialects; and scaling model parameters from 0.5B to 1.5B. The work emphasizes applicability to in-the-wild conditions and provides a demo link for subjective evaluation.
Significance. If the claimed gains are confirmed through rigorous, quantitative benchmarks with ablations, this work would meaningfully advance scalable LLM-based speech synthesis by demonstrating the benefits of combined data/model scaling and targeted post-training components. The introduction of a reusable differentiable reward model and the multi-task tokenizer represent potentially reusable contributions for the field.
major comments (2)
- [Abstract] Abstract: The central claim that CosyVoice 3 surpasses CosyVoice 2 in content consistency, speaker similarity, and prosody naturalness is stated without any quantitative metrics, baseline comparisons, error bars, or ablation results. This absence makes it impossible to assess whether the reported improvements are load-bearing or statistically meaningful, directly undermining evaluation of the scaling and post-training contributions.
- [Methods and Experiments] Methods and Experiments sections: The manuscript describes the multi-task tokenizer and differentiable reward model as key innovations but provides no details on how these components were validated against overfitting risks when scaling to 1M hours and 1.5B parameters. Specific ablation tables isolating the contribution of each (e.g., tokenizer vs. reward model vs. scale) are required to support the weakest assumption that the combination yields generalization gains in diverse real-world conditions.
minor comments (2)
- [Abstract] The abstract and introduction would benefit from explicit cross-references to the specific benchmark datasets and evaluation protocols used for the multilingual results.
- [Demo and Figures] Figure captions and demo descriptions should clarify which audio samples correspond to zero-shot vs. few-shot conditions to aid reproducibility.
Simulated Author's Rebuttal
We thank the referee for the constructive and detailed feedback. We address each major comment point-by-point below. Where the comments identify gaps in quantitative support and validation details, we have prepared revisions to strengthen the manuscript.
read point-by-point responses
-
Referee: [Abstract] Abstract: The central claim that CosyVoice 3 surpasses CosyVoice 2 in content consistency, speaker similarity, and prosody naturalness is stated without any quantitative metrics, baseline comparisons, error bars, or ablation results. This absence makes it impossible to assess whether the reported improvements are load-bearing or statistically meaningful, directly undermining evaluation of the scaling and post-training contributions.
Authors: We agree that the abstract would be strengthened by including key quantitative results. In the revised manuscript we will update the abstract to report specific metrics, including WER reductions for content consistency, speaker embedding cosine similarity scores, and MOS improvements for prosody naturalness, with direct comparisons to CosyVoice 2. We will also reference the corresponding tables and note any available error bars from repeated evaluations. revision: yes
-
Referee: [Methods and Experiments] Methods and Experiments sections: The manuscript describes the multi-task tokenizer and differentiable reward model as key innovations but provides no details on how these components were validated against overfitting risks when scaling to 1M hours and 1.5B parameters. Specific ablation tables isolating the contribution of each (e.g., tokenizer vs. reward model vs. scale) are required to support the weakest assumption that the combination yields generalization gains in diverse real-world conditions.
Authors: We acknowledge the need for explicit validation details and finer-grained ablations. The current manuscript contains initial ablation results in the Experiments section, but to address overfitting concerns at scale we will add a new paragraph in Methods describing our procedures: use of a large held-out in-the-wild validation set, monitoring of training versus validation loss curves, and regularization techniques applied during the 1M-hour training. We will also expand the ablation tables to isolate the individual contributions of the multi-task tokenizer, differentiable reward model, data scaling, and model scaling, reporting results across the nine languages and 18 dialects under diverse real-world conditions. revision: yes
Circularity Check
Minor self-citation to CosyVoice 2 provides context but does not reduce central claims to inputs
specific steps
-
self citation load bearing
[Abstract]
"In our prior works, we introduced a scalable streaming speech synthesis model, CosyVoice 2, which integrates a large language model (LLM) and a chunk-aware flow matching (FM) model, and achieves low-latency bi-streaming speech synthesis and human-parity quality. Despite these advancements, CosyVoice 2 exhibits limitations in language coverage, domain diversity, data volume, text formats, and post-training techniques. In this paper, we present CosyVoice 3, an improved model designed for zero-shot multilingual speech synthesis in the wild, surpassing its predecessor in content consistency, 1) A "
The central positioning of CosyVoice 3 as surpassing CosyVoice 2 relies on a self-citation to prior work by overlapping authors, but the paper then introduces independent new elements (multi-task tokenizer, differentiable reward model, explicit scaling) whose performance gains are not derived from the cited model by definition or fit. The citation supplies only historical context rather than a load-bearing uniqueness theorem or ansatz that collapses the new results.
full rationale
The paper's claims rest on explicit new components (multi-task tokenizer via supervised training on ASR/SER/LID/AED/speaker tasks, differentiable reward model for post-training, data scaling to 1M hours across 9 languages/18 dialects, model scaling to 1.5B parameters) whose effects are described as independent increments over the prior CosyVoice 2 architecture. No equation, prediction, or uniqueness result is shown to reduce by construction to a fitted parameter or to the self-cited predecessor. The self-citation is limited to background and does not serve as the sole justification for the reported gains in consistency, similarity, or naturalness; external benchmarks and listening demos are referenced as validation. This yields a low but non-zero circularity score for the normal incremental self-reference.
Axiom & Free-Parameter Ledger
free parameters (2)
- model parameter count =
1.5 billion
- training data volume =
one million hours
axioms (2)
- domain assumption Scaling laws observed in language models also apply to speech synthesis models
- ad hoc to paper Multi-task supervised training on ASR, SER, LID, AED and speaker analysis produces a tokenizer that improves prosody naturalness
invented entities (2)
-
differentiable reward model
no independent evidence
-
novel speech tokenizer
no independent evidence
Lean theorems connected to this paper
-
Cost.FunctionalEquationwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
Key features of CosyVoice 3 include: 1) A novel speech tokenizer... 2) A new differentiable reward model... 3) Dataset Size Scaling: Training data is expanded from ten thousand hours to one million hours... 4) Model Size Scaling: Model parameters are increased from 0.5 billion to 1.5 billion
-
PhiForcingphi_equation unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
We present CosyVoice 3, an improved model designed for zero-shot multilingual speech synthesis in the wild, surpassing its predecessor in content consistency, speaker similarity, and prosody naturalness
-
DimensionForcingalexander_duality_circle_linking unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
Through optimizing semantic token utilization, initializing with text-based LLMs, designing a bidirectional streaming scheme...
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Forward citations
Cited by 18 Pith papers
-
VoxSafeBench: Not Just What Is Said, but Who, How, and Where
VoxSafeBench reveals that speech language models recognize social norms from text but fail to apply them when acoustic cues like speaker or scene determine the appropriate response.
-
Kinetic-Optimal Scheduling with Moment Correction for Metric-Induced Discrete Flow Matching in Zero-Shot Text-to-Speech
GibbsTTS combines a training-free kinetic-optimal scheduler with finite-step moment correction in MI-DFM to deliver top naturalness and strong speaker similarity in zero-shot TTS.
-
VITA-QinYu: Expressive Spoken Language Model for Role-Playing and Singing
VITA-QinYu is the first expressive end-to-end spoken language model supporting role-playing and singing alongside conversation, trained on 15.8K hours of data and outperforming prior models on expressiveness and conve...
-
Toward Fine-Grained Speech Inpainting Forensics:A Dataset, Method, and Metric for Multi-Region Tampering Localization
A new dataset, iterative coarse-to-fine localization framework, and segment-level IoU F1 metric tackle the open problem of detecting multiple unknown word-level inpainted regions in speech.
-
SPG-Codec: Exploring the Role and Boundaries of Semantic Priors in Ultra-Low-Bitrate Neural Speech Coding
Semantic priors from HuBERT and Whisper improve speech codec intelligibility up to 6 kbps but show diminishing returns beyond that, with a bitrate-aware regulation strategy balancing semantic consistency and naturalness.
-
AST: Adaptive, Seamless, and Training-Free Precise Speech Editing
AST enables seamless speech editing by latent recomposition on pre-trained TTS models plus adaptive weak fact guidance, plus a new dataset and WDTW metric, claiming 70% WER reduction and better temporal consistency wi...
-
From Reactive to Proactive: Assessing the Proactivity of Voice Agents via ProVoice-Bench
ProVoice-Bench is the first framework to evaluate proactive voice agents, revealing that state-of-the-art multimodal LLMs struggle with over-triggering and context-aware reasoning.
-
Knowing What to Stress: A Discourse-Conditioned Text-to-Speech Benchmark
CAST benchmark shows language models infer correct word stress from discourse context but TTS systems frequently fail to produce it in speech.
-
SQuTR: A Robustness Benchmark for Spoken Query to Text Retrieval under Acoustic Noise
SQuTR aggregates 37k queries from six text retrieval datasets, synthesizes speech from 200 speakers, adds 17 noise categories at varying SNR, and shows that even large retrieval models degrade sharply under extreme ac...
-
TTS-PRISM: A Perceptual Reasoning and Interpretable Speech Model for Fine-Grained Diagnosis
TTS-PRISM defines a 12-dimensional perceptual schema, builds a targeted diagnostic dataset via adversarial synthesis and expert labels, and tunes an end-to-end model that outperforms generalist LLMs in human alignment...
-
UAF: A Unified Audio Front-end LLM for Full-Duplex Speech Interaction
UAF is the first unified audio front-end LLM that turns multiple front-end tasks into one sequence prediction model processing streaming audio chunks and reference prompts to output semantic and control tokens for ful...
-
MoVE: Translating Laughter and Tears via Mixture of Vocalization Experts in Speech-to-Speech Translation
MoVE uses specialized LoRA expert adapters and a soft router to translate non-verbal vocalizations in S2ST, reproducing them in 76% of cases versus at most 14% for baselines while scoring highest on naturalness and em...
-
Audio2Tool: Speak, Call, Act -- A Dataset for Benchmarking Speech Tool Use
Audio2Tool is a new benchmark dataset that shows speech models perform well on simple commands but degrade sharply on compositional tasks and realistic acoustic noise.
-
ASPIRin: Action Space Projection for Interactivity-Optimized Reinforcement Learning in Full-Duplex Speech Language Models
ASPIRin decouples speaking timing from token content via binary action space projection and applies GRPO with rule-based rewards to optimize interactivity in SLMs without semantic collapse or repetition.
-
OmniVoice: Towards Omnilingual Zero-Shot Text-to-Speech with Diffusion Language Models
OmniVoice introduces a diffusion language model-style non-autoregressive TTS system that directly maps text to multi-codebook acoustic tokens, scaling zero-shot synthesis to over 600 languages with SOTA results on mul...
-
FoleyDirector: Fine-Grained Temporal Steering for Video-to-Audio Generation via Structured Scripts
FoleyDirector introduces structured temporal scripts and a fusion module to enable precise timing control in DiT-based video-to-audio generation while preserving audio fidelity.
-
Qwen3-Omni Technical Report
Qwen3-Omni is a unified multimodal model that achieves open-source SOTA on 32 of 36 audio and audio-visual benchmarks and overall SOTA on 22 without degrading performance on text, image, or video relative to single-mo...
-
RADAR Challenge 2026: Robust Audio Deepfake Recognition under Media Transformations
The RADAR Challenge 2026 provides a multilingual benchmark for audio deepfake detection under media transformations and finds that robust performance remains an open problem.
Reference graph
Works this paper leans on
-
[1]
Yuxuan Wang, R. J. Skerry-Ryan, Daisy Stanton, Yonghui Wu, Ron J. Weiss, Navdeep Jaitly, Zongheng Yang, Ying Xiao, Zhifeng Chen, Samy Bengio, Quoc V . Le, Yannis Agiomyrgian- nakis, Rob Clark, and Rif A. Saurous. Tacotron: Towards end-to-end speech synthesis. In INTERSPEECH, pages 4006–4010. ISCA, 2017. 18
work page 2017
-
[2]
Weiss, Mike Schuster, Navdeep Jaitly, Zongheng Yang, Zhifeng Chen, Yu Zhang, Yuxuan Wang, R
Jonathan Shen, Ruoming Pang, Ron J. Weiss, Mike Schuster, Navdeep Jaitly, Zongheng Yang, Zhifeng Chen, Yu Zhang, Yuxuan Wang, R. J. Skerry-Ryan, Rif A. Saurous, Yannis Agiomyr- giannakis, and Yonghui Wu. Natural TTS synthesis by conditioning wavenet on MEL spectro- gram predictions. In ICASSP, pages 4779–4783. IEEE, 2018
work page 2018
-
[3]
Deep Voice 3: Scaling Text-to-Speech with Convolutional Sequence Learning
Wei Ping, Kainan Peng, Andrew Gibiansky, Sercan ¨Omer Arik, Ajay Kannan, Sharan Narang, Jonathan Raiman, and John Miller. Deep voice 3: 2000-speaker neural text-to-speech. CoRR, abs/1710.07654, 2017
work page internal anchor Pith review Pith/arXiv arXiv 2000
-
[4]
Clarinet: Parallel wave generation in end-to-end text-to-speech
Wei Ping, Kainan Peng, and Jitong Chen. Clarinet: Parallel wave generation in end-to-end text-to-speech. In ICLR (Poster). OpenReview.net, 2019
work page 2019
-
[5]
Fast- speech: Fast, robust and controllable text to speech
Yi Ren, Yangjun Ruan, Xu Tan, Tao Qin, Sheng Zhao, Zhou Zhao, and Tie-Yan Liu. Fast- speech: Fast, robust and controllable text to speech. In NeurIPS, pages 3165–3174, 2019
work page 2019
-
[6]
Neural speech synthesis with transformer network
Naihan Li, Shujie Liu, Yanqing Liu, Sheng Zhao, and Ming Liu. Neural speech synthesis with transformer network. In AAAI, pages 6706–6713. AAAI Press, 2019
work page 2019
-
[7]
Fastspeech 2: Fast and high-quality end-to-end text to speech
Yi Ren, Chenxu Hu, Xu Tan, Tao Qin, Sheng Zhao, Zhou Zhao, and Tie-Yan Liu. Fastspeech 2: Fast and high-quality end-to-end text to speech. In ICLR. OpenReview.net, 2021
work page 2021
-
[8]
Neural Codec Language Models are Zero-Shot Text to Speech Synthesizers
Chengyi Wang, Sanyuan Chen, Yu Wu, Ziqiang Zhang, Long Zhou, Shujie Liu, Zhuo Chen, Yanqing Liu, Huaming Wang, Jinyu Li, Lei He, Sheng Zhao, and Furu Wei. Neural codec language models are zero-shot text to speech synthesizers. CoRR, abs/2301.02111, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[9]
Speak, read and prompt: High-fidelity text-to-speech with minimal supervision
Eugene Kharitonov, Damien Vincent, Zal ´an Borsos, Rapha¨el Marinier, Sertan Girgin, Olivier Pietquin, Matt Sharifi, Marco Tagliasacchi, and Neil Zeghidour. Speak, read and prompt: High-fidelity text-to-speech with minimal supervision. Trans. Assoc. Comput. Linguistics , 11:1703–1718, 2023
work page 2023
-
[10]
Yakun Song, Zhuo Chen, Xiaofei Wang, Ziyang Ma, and Xie Chen. ELLA-V: stable neural codec language modeling with alignment-guided sequence reordering.CoRR, abs/2401.07333, 2024
-
[11]
V ALL-T: decoder-only generative transducer for robust and decoding- controllable text-to-speech
Chenpeng Du, Yiwei Guo, Hankun Wang, Yifan Yang, Zhikang Niu, Shuai Wang, Hui Zhang, Xie Chen, and Kai Yu. V ALL-T: decoder-only generative transducer for robust and decoding- controllable text-to-speech. CoRR, abs/2401.14321, 2024
-
[12]
Detai Xin, Xu Tan, Kai Shen, Zeqian Ju, Dongchao Yang, Yuancheng Wang, Shinnosuke Takamichi, Hiroshi Saruwatari, Shujie Liu, Jinyu Li, and Sheng Zhao. RALL-E: robust codec language modeling with chain-of-thought prompting for text-to-speech synthesis.CoRR, abs/2404.03204, 2024
-
[13]
V ALL-E 2: Neural codec language models are human parity zero-shot text to speech synthesizers
Sanyuan Chen, Shujie Liu, Long Zhou, Yanqing Liu, Xu Tan, Jinyu Li, Sheng Zhao, Yao Qian, and Furu Wei. V ALL-E 2: Neural codec language models are human parity zero-shot text to speech synthesizers. CoRR, abs/2406.05370, 2024
-
[14]
V ALL-E R: robust and efficient zero-shot text-to- speech synthesis via monotonic alignment
Bing Han, Long Zhou, Shujie Liu, Sanyuan Chen, Lingwei Meng, Yanming Qian, Yanqing Liu, Sheng Zhao, Jinyu Li, and Furu Wei. V ALL-E R: robust and efficient zero-shot text-to- speech synthesis via monotonic alignment. CoRR, abs/2406.07855, 2024
-
[15]
Maskgct: Zero-shot text-to-speech with masked generative codec transformer
Yuancheng Wang, Haoyue Zhan, Liwei Liu, Ruihong Zeng, Haotian Guo, Jiachen Zheng, Qiang Zhang, Shunsi Zhang, and Zhizheng Wu. Maskgct: Zero-shot text-to-speech with masked generative codec transformer. CoRR, abs/2409.00750, 2024
-
[16]
Autoregressive speech synthesis without vector quantization
Lingwei Meng, Long Zhou, Shujie Liu, Sanyuan Chen, Bing Han, Shujie Hu, Yanqing Liu, Jinyu Li, Sheng Zhao, Xixin Wu, Helen Meng, and Furu Wei. Autoregressive speech synthesis without vector quantization. CoRR, abs/2407.08551, 2024
-
[17]
Spark-tts: An efficient llm-based text-to-speech model with single-stream decoupled speech tokens
Xinsheng Wang, Mingqi Jiang, Ziyang Ma, Ziyu Zhang, Songxiang Liu, Linqin Li, Zheng Liang, Qixi Zheng, Rui Wang, Xiaoqin Feng, et al. Spark-tts: An efficient llm-based text-to- speech model with single-stream decoupled speech tokens. arXiv preprint arXiv:2503.01710, 2025
-
[18]
V oicebox: Text- guided multilingual universal speech generation at scale
Matthew Le, Apoorv Vyas, Bowen Shi, Brian Karrer, Leda Sari, Rashel Moritz, Mary Williamson, Vimal Manohar, Yossi Adi, Jay Mahadeokar, and Wei-Ning Hsu. V oicebox: Text- guided multilingual universal speech generation at scale. In NeurIPS, 2023
work page 2023
-
[19]
Naturalspeech 3: Zero-shot speech synthesis with factorized codec and diffusion models
Zeqian Ju, Yuancheng Wang, Kai Shen, Xu Tan, Detai Xin, Dongchao Yang, Eric Liu, Yichong Leng, Kaitao Song, Siliang Tang, Zhizheng Wu, Tao Qin, Xiangyang Li, Wei Ye, Shikun 19 Zhang, Jiang Bian, Lei He, Jinyu Li, and Sheng Zhao. Naturalspeech 3: Zero-shot speech synthesis with factorized codec and diffusion models. In ICML. OpenReview.net, 2024
work page 2024
-
[20]
V oiceflow: Efficient text-to- speech with rectified flow matching
Yiwei Guo, Chenpeng Du, Ziyang Ma, Xie Chen, and Kai Yu. V oiceflow: Efficient text-to- speech with rectified flow matching. In ICASSP, pages 11121–11125. IEEE, 2024
work page 2024
-
[21]
Matcha-tts: A fast TTS architecture with conditional flow matching
Shivam Mehta, Ruibo Tu, Jonas Beskow, ´Eva Sz´ekely, and Gustav Eje Henter. Matcha-tts: A fast TTS architecture with conditional flow matching. In ICASSP, pages 11341–11345. IEEE, 2024
work page 2024
-
[22]
E3 TTS: easy end-to-end diffusion-based text to speech
Yuan Gao, Nobuyuki Morioka, Yu Zhang, and Nanxin Chen. E3 TTS: easy end-to-end diffusion-based text to speech. In ASRU, pages 1–8. IEEE, 2023
work page 2023
-
[23]
Ditto-tts: Efficient and scalable zero-shot text-to-speech with diffusion transformer
Keon Lee, Dong Won Kim, Jaehyeon Kim, and Jaewoong Cho. Ditto-tts: Efficient and scalable zero-shot text-to-speech with diffusion transformer. CoRR, abs/2406.11427, 2024
-
[24]
E2 TTS: embarrassingly easy fully non-autoregressive zero-shot TTS
Sefik Emre Eskimez, Xiaofei Wang, Manthan Thakker, Canrun Li, Chung-Hsien Tsai, Zhen Xiao, Hemin Yang, Zirun Zhu, Min Tang, Xu Tan, Yanqing Liu, Sheng Zhao, and Naoyuki Kanda. E2 TTS: embarrassingly easy fully non-autoregressive zero-shot TTS. CoRR, abs/2406.18009, 2024
-
[25]
F5-TTS: A Fairytaler that Fakes Fluent and Faithful Speech with Flow Matching
Yushen Chen, Zhikang Niu, Ziyang Ma, Keqi Deng, Chunhui Wang, Jian Zhao, Kai Yu, and Xie Chen. F5-TTS: A fairytaler that fakes fluent and faithful speech with flow matching. CoRR, abs/2410.06885, 2024
work page internal anchor Pith review arXiv 2024
-
[26]
Seed-TTS: A Family of High-Quality Versatile Speech Generation Models
Philip Anastassiou, Jiawei Chen, Jitong Chen, Yuanzhe Chen, Zhuo Chen, Ziyi Chen, Jian Cong, Lelai Deng, Chuang Ding, Lu Gao, Mingqing Gong, Peisong Huang, Qingqing Huang, Zhiying Huang, Yuanyuan Huo, Dongya Jia, Chumin Li, Feiya Li, Hui Li, Jiaxin Li, Xiaoyang Li, Xingxing Li, Lin Liu, Shouda Liu, Sichao Liu, Xudong Liu, Yuchen Liu, Zhengxi Liu, Lu Lu, J...
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[27]
Zhihao Du, Qian Chen, Shiliang Zhang, Kai Hu, Heng Lu, Yexin Yang, Hangrui Hu, Siqi Zheng, Yue Gu, Ziyang Ma, Zhifu Gao, and Zhijie Yan. Cosyvoice: A scalable multi- lingual zero-shot text-to-speech synthesizer based on supervised semantic tokens. CoRR, abs/2407.05407, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[28]
Touchtts: An embarrassingly simple tts framework that everyone can touch
Xingchen Song, Mengtao Xing, Changwei Ma, Shengqiang Li, Di Wu, Binbin Zhang, Fuping Pan, Dinghao Zhou, Yuekai Zhang, Shun Lei, et al. Touchtts: An embarrassingly simple tts framework that everyone can touch. arXiv preprint arXiv:2412.08237, 2024
-
[29]
Haohan Guo, Kun Liu, Feiyu Shen, Yi-Chen Wu, Feng-Long Xie, Kun Xie, and Kaituo Xu. Fireredtts: A foundation text-to-speech framework for industry-level generative speech appli- cations. CoRR, abs/2409.03283, 2024
-
[30]
CosyVoice 2: Scalable Streaming Speech Synthesis with Large Language Models
Zhihao Du, Yuxuan Wang, Qian Chen, Xian Shi, Xiang Lv, Tianyu Zhao, Zhifu Gao, Yexin Yang, Changfeng Gao, Hui Wang, et al. Cosyvoice 2: Scalable streaming speech synthesis with large language models. arXiv preprint arXiv:2412.10117, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[31]
Ditar: Diffusion transformer autoregressive mod- eling for speech generation
Dongya Jia, Zhuo Chen, Jiawei Chen, Chenpeng Du, Jian Wu, Jian Cong, Xiaobin Zhuang, Chumin Li, Zhen Wei, Yuping Wang, et al. Ditar: Diffusion transformer autoregressive mod- eling for speech generation. arXiv preprint arXiv:2502.03930, 2025
-
[32]
Indextts: An industrial-level controllable and efficient zero-shot text-to-speech system
Wei Deng, Siyi Zhou, Jingchen Shu, Jinchao Wang, and Lu Wang. Indextts: An industrial-level controllable and efficient zero-shot text-to-speech system. arXiv preprint arXiv:2502.05512 , 2025
-
[33]
Minmo: A multimodal large language model for seamless voice interaction
Qian Chen, Yafeng Chen, Yanni Chen, Mengzhe Chen, Yingda Chen, Chong Deng, Zhihao Du, Ruize Gao, Changfeng Gao, Zhifu Gao, et al. Minmo: A multimodal large language model for seamless voice interaction. arXiv preprint arXiv:2501.06282, 2025
-
[34]
Finite scalar quantization: VQ-V AE made simple
Fabian Mentzer, David Minnen, Eirikur Agustsson, and Michael Tschannen. Finite scalar quantization: VQ-V AE made simple. InICLR. OpenReview.net, 2024
work page 2024
-
[35]
Tongyi Speech Team. Funaudiollm: V oice understanding and generation foundation models for natural interaction between humans and llms. arxiv, 2024. 20
work page 2024
-
[36]
Jianlin Su, Murtadha H. M. Ahmed, Yu Lu, Shengfeng Pan, Wen Bo, and Yunfeng Liu. Ro- former: Enhanced transformer with rotary position embedding. Neurocomputing, 568:127063, 2024
work page 2024
-
[37]
F5r-tts: Improving flow matching based text-to-speech with group relative policy optimization
Xiaohui Sun, Ruitong Xiao, Jianye Mo, Bowen Wu, Qun Yu, and Baoxun Wang. F5r-tts: Improving flow matching based text-to-speech with group relative policy optimization. arXiv preprint arXiv:2504.02407, 2025
-
[38]
An Yang, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chengyuan Li, Dayiheng Liu, Fei Huang, Haoran Wei, et al. Qwen2. 5 technical report. arXiv preprint arXiv:2412.15115, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[39]
Shengkui Zhao, Yukun Ma, Chongjia Ni, Chong Zhang, Hao Wang, Trung Hieu Nguyen, Kun Zhou, Jiaqi Yip, Dianwen Ng, and Bin Ma. Mossformer2: Combining transformer and rnn-free recurrent network for enhanced time-domain monaural speech separation, 2024
work page 2024
- [40]
- [41]
- [42]
-
[43]
Montreal forced aligner: Trainable text-speech alignment using kaldi
Michael McAuliffe, Michaela Socolof, Sarah Mihuc, Michael Wagner, and Morgan Sondereg- ger. Montreal forced aligner: Trainable text-speech alignment using kaldi. In Interspeech 2017, pages 498–502, 2017
work page 2017
-
[44]
Scalable diffusion models with transformers
William Peebles and Saining Xie. Scalable diffusion models with transformers. InProceedings of the IEEE/CVF international conference on computer vision , pages 4195–4205, 2023
work page 2023
-
[45]
Robust speech recognition via large-scale weak supervision
Alec Radford, Jong Wook Kim, Tao Xu, Greg Brockman, Christine McLeavey, and Ilya Sutskever. Robust speech recognition via large-scale weak supervision. In Andreas Krause, Emma Brunskill, Kyunghyun Cho, Barbara Engelhardt, Sivan Sabato, and Jonathan Scarlett, editors, International Conference on Machine Learning, ICML 2023, 23-29 July 2023, Hon- olulu, Haw...
work page 2023
-
[46]
Funasr: A fundamental end-to-end speech recognition toolkit
Zhifu Gao, Zerui Li, Jiaming Wang, Haoneng Luo, Xian Shi, Mengzhe Chen, Yabin Li, Lingyun Zuo, Zhihao Du, Zhangyu Xiao, et al. Funasr: A fundamental end-to-end speech recognition toolkit. arXiv preprint arXiv:2305.11013, 2023
-
[47]
An en- hanced res2net with local and global feature fusion for speaker verification
Yafeng Chen, Siqi Zheng, Hui Wang, Luyao Cheng, Qian Chen, and Jiajun Qi. An en- hanced res2net with local and global feature fusion for speaker verification. arXiv preprint arXiv:2305.12838, 2023
-
[48]
Chandan K. A. Reddy, Vishak Gopal, and Ross Cutler. Dnsmos P.835: A non-intrusive percep- tual objective speech quality metric to evaluate noise suppressors. In ICASSP, pages 886–890. IEEE, 2022
work page 2022
-
[49]
Large-scale self-supervised speech representation learning for automatic speaker verification
Zhengyang Chen, Sanyuan Chen, Yu Wu, Yao Qian, Chengyi Wang, Shujie Liu, Yanmin Qian, and Michael Zeng. Large-scale self-supervised speech representation learning for automatic speaker verification. In ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) , pages 6147–6151. IEEE, 2022
work page 2022
-
[50]
Jin Xu, Zhifang Guo, Jinzheng He, Hangrui Hu, Ting He, Shuai Bai, Keqin Chen, Jialin Wang, Yang Fan, Kai Dang, et al. Qwen2. 5-omni technical report.arXiv preprint arXiv:2503.20215, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[51]
Librispeech: an asr corpus based on public domain audio books
Vassil Panayotov, Guoguo Chen, Daniel Povey, and Sanjeev Khudanpur. Librispeech: an asr corpus based on public domain audio books. In 2015 IEEE international conference on acoustics, speech and signal processing (ICASSP) , pages 5206–5210. IEEE, 2015
work page 2015
-
[52]
Common voice: A massively-multilingual speech corpus
Rosana Ardila, Megan Branson, Kelly Davis, Michael Henretty, Michael Kohler, Josh Meyer, Reuben Morais, Lindsay Saunders, Francis M Tyers, and Gregor Weber. Common voice: A massively-multilingual speech corpus. arXiv preprint arXiv:1912.06670, 2019
-
[53]
Fleurs: Few-shot learning evaluation of universal representations of speech
Alexis Conneau, Min Ma, Simran Khanuja, Yu Zhang, Vera Axelrod, Siddharth Dalmia, Jason Riesa, Clara Rivera, and Ankur Bapna. Fleurs: Few-shot learning evaluation of universal representations of speech. In 2022 IEEE Spoken Language Technology Workshop (SLT), pages 798–805. IEEE, 2023. 21
work page 2022
-
[54]
Emobox: Multilingual multi-corpus speech emotion recog- nition toolkit and benchmark
Ziyang Ma, Mingjie Chen, Hezhao Zhang, Zhisheng Zheng, Wenxi Chen, Xiquan Li, Jiaxin Ye, Xie Chen, and Thomas Hain. Emobox: Multilingual multi-corpus speech emotion recog- nition toolkit and benchmark. In Proc. INTERSPEECH, 2024
work page 2024
-
[55]
Secap: Speech emotion captioning with large lan- guage model
Yaoxun Xu, Hangting Chen, Jianwei Yu, Qiaochu Huang, Zhiyong Wu, Shixiong Zhang, Guangzhi Li, Yi Luo, and Rongzhi Gu. Secap: Speech emotion captioning with large lan- guage model. arXiv preprint arXiv:2312.10381, 2023
-
[56]
Soundstream: An end-to-end neural audio codec
Neil Zeghidour, Alejandro Luebs, Ahmed Omran, Jan Skoglund, and Marco Tagliasacchi. Soundstream: An end-to-end neural audio codec. IEEE ACM Trans. Audio Speech Lang. Process., 30:495–507, 2022
work page 2022
-
[57]
Hubert: Self-supervised speech representation learning by masked prediction of hidden units
Wei-Ning Hsu, Benjamin Bolte, Yao-Hung Hubert Tsai, Kushal Lakhotia, Ruslan Salakhutdi- nov, and Abdelrahman Mohamed. Hubert: Self-supervised speech representation learning by masked prediction of hidden units. IEEE/ACM transactions on audio, speech, and language processing, 29:3451–3460, 2021
work page 2021
-
[58]
Yu-An Chung, Yu Zhang, Wei Han, Chung-Cheng Chiu, James Qin, Ruoming Pang, and Yonghui Wu. W2v-bert: Combining contrastive learning and masked language modeling for self-supervised speech pre-training. In 2021 IEEE Automatic Speech Recognition and Under- standing Workshop (ASRU), pages 244–250. IEEE, 2021
work page 2021
-
[59]
A benchmark and analysis of discrete expressive speech resynthesis
Tu Anh Nguyen, Wei-Ning Hsu, Antony D’Avirro, Bowen Shi, Itai Gat, Maryam Fazel-Zarani, Tal Remez, Jade Copet, Gabriel Synnaeve, Michael Hassid, Felix Kreuk, Yossi Adi, and Em- manuel Dupoux. A benchmark and analysis of discrete expressive speech resynthesis. arXiv preprint arXiv:2308.05725, 2023
-
[60]
Unis- peaker: A unified approach for multimodality-driven speaker generation
Zhengyan Sheng, Zhihao Du, Heng Lu, Shiliang Zhang, and Zhen-Hua Ling. Unis- peaker: A unified approach for multimodality-driven speaker generation. arXiv preprint arXiv:2501.06394, 2025. 22
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.