{"total":15,"items":[{"citing_arxiv_id":"2605.17085","ref_index":15,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Taming Audio VAEs via Target-KL Regularization","primary_cat":"cs.SD","submitted_at":"2026-05-16T17:01:47+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"The paper introduces target-KL regularization to train audio VAEs at specific bitrates, enabling rate-distortion curves and comparison to discrete audio codecs for improved text-to-sound generation.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.11098","ref_index":50,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"AffectCodec: Emotion-Preserving Neural Speech Codec for Expressive Speech Modeling","primary_cat":"cs.SD","submitted_at":"2026-05-11T18:04:33+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"AffectCodec is an emotion-guided neural speech codec that preserves emotional cues during quantization while maintaining semantic fidelity and prosodic naturalness.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.22209","ref_index":5,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"UniSonate: A Unified Model for Speech, Music, and Sound Effect Generation with Text Instructions","primary_cat":"eess.AS","submitted_at":"2026-04-24T04:26:04+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"UniSonate unifies text-to-speech, text-to-music, and text-to-audio in a flow-matching framework with dynamic token injection and curriculum learning, reporting SOTA TTS and TTM results plus positive cross-task transfer.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.19330","ref_index":23,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Text-To-Speech with Chain-of-Details: modeling temporal dynamics in speech generation","primary_cat":"eess.AS","submitted_at":"2026-04-21T10:58:48+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Chain-of-Details (CoD) is a cascaded TTS method that explicitly models temporal coarse-to-fine dynamics with a shared decoder, achieving competitive performance using significantly fewer parameters.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.15923","ref_index":56,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Hierarchical Codec Diffusion for Video-to-Speech Generation","primary_cat":"cs.SD","submitted_at":"2026-04-17T10:28:21+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"HiCoDiT generates speech from video by conditioning low-level RVQ tokens on speaker identity and high-level tokens on facial expressions via a dual-scale normalized diffusion transformer.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.11552","ref_index":9,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"MimicLM: Zero-Shot Voice Imitation through Autoregressive Modeling of Pseudo-Parallel Speech Corpora","primary_cat":"cs.SD","submitted_at":"2026-04-13T14:40:25+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"MimicLM achieves better naturalness in zero-shot voice imitation by autoregressively modeling pseudo-parallel data with synthetic sources and real targets, plus interleaved text-audio guidance and preference alignment.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.11283","ref_index":110,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Multimodal Large Language Model-Enabled Video Translation: A Role-Oriented Survey","primary_cat":"cs.CV","submitted_at":"2026-04-13T10:42:31+00:00","verdict":null,"verdict_confidence":null,"novelty_score":null,"formal_verification":null,"one_line_summary":null,"context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"FireREDTTS-2 [88], CoFi-Speech [89], VALL-E R [90], GenerSpeech [91], ControlSpeech [92], NaturalSpeech [93], NaturalSpeech2 [94], NaturalSpeech3 [95], GradStyleSpeech [96], SC VALL-E [97], CLaM-TTS [98], AST-LDM [99], Voicebox [100], DiTTO-TTS [101], DiffGAN-TTS [102], SpeechFlow, SPEAR-TTS [103], ARDiT [104], MELLE [105], DEX-TTS [106], NanoVoice [107], VoiceCraft [108], ELLA-V [109], Mask-GCT [110], SimpleSpeech [111], SimpleSpeech 2 [112], FlashSpeech [113], Llasa [114], VoiceGuider [115], AI-STA [116] The VisualSynthesizer UNet-based X-Dancer [117], VideoFusion [118], SpA2V [119], MakeVideo [120], MagicVideo [121], AlignLatent [122], Dysen-VDM [123], VideoCrafter [124], Latent-VDM [125], Latent-Shift [126], LVDM [127], Tune-A-Video [128],"},{"citing_arxiv_id":"2604.05526","ref_index":27,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Controllable Singing Style Conversion with Boundary-Aware Information Bottleneck","primary_cat":"cs.SD","submitted_at":"2026-04-07T07:25:59+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"A singing voice conversion system with boundary-aware information bottleneck and high-frequency augmentation achieves the best naturalness in SVCC2025 subjective tests while using less extra data than competitors.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2601.15621","ref_index":19,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Qwen3-TTS Technical Report","primary_cat":"cs.SD","submitted_at":"2026-01-22T03:51:43+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Qwen3-TTS delivers state-of-the-art multilingual TTS performance with 3-second voice cloning, description control, and ultra-low-latency streaming via dual tokenizers and a dual-track LM architecture trained on over 5 million hours of data.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2510.06201","ref_index":19,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"TokenChain: A Discrete Speech Chain via Semantic Token Modeling","primary_cat":"eess.AS","submitted_at":"2025-10-07T17:54:12+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"TokenChain demonstrates that a discrete semantic-token interface can sustain effective chain learning between ASR and TTS, yielding faster convergence and lower error rates on LibriSpeech and TED-LIUM.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2507.16632","ref_index":68,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Step-Audio 2 Technical Report","primary_cat":"cs.CL","submitted_at":"2025-07-22T14:23:55+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Step-Audio 2 integrates a latent audio encoder, reasoning-centric reinforcement learning, and discrete audio token generation into language modeling to deliver state-of-the-art performance on audio understanding and conversational benchmarks.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2507.09318","ref_index":9,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"ZipVoice-Dialog: Non-Autoregressive Spoken Dialogue Generation with Flow Matching","primary_cat":"eess.AS","submitted_at":"2025-07-12T15:18:47+00:00","verdict":"CONDITIONAL","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"ZipVoice-Dialog is a flow-matching non-autoregressive model for zero-shot spoken dialogue generation that uses curriculum learning and speaker-turn embeddings, paired with a new 6.8k-hour OpenDialog dataset, and reports better speed and quality than autoregressive baselines.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2505.17589","ref_index":15,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"CosyVoice 3: Towards In-the-wild Speech Generation via Scaling-up and Post-training","primary_cat":"cs.SD","submitted_at":"2025-05-23T07:55:21+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"CosyVoice 3 achieves better content consistency, speaker similarity, and prosody naturalness in zero-shot multilingual speech synthesis by scaling data to one million hours, model size to 1.5 billion parameters, and introducing a supervised multi-task speech tokenizer plus a differentiable reward模型.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2412.10117","ref_index":18,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"CosyVoice 2: Scalable Streaming Speech Synthesis with Large Language Models","primary_cat":"cs.SD","submitted_at":"2024-12-13T12:59:39+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"CosyVoice 2 delivers human-parity naturalness and near-lossless streaming speech synthesis by combining finite-scalar quantization, a streamlined pre-trained LLM, and chunk-aware causal flow matching on large multilingual data.","context_count":1,"top_context_role":"baseline","top_context_polarity":"baseline","context_text":"shot TTS models benefit from large-scale training data, achieving synthesis quality and naturalness nearly indistinguishable from human speech. Recent zero-shot TTS models can be broadly divided into three categories: codec language models, feature diffusion models and their hybrid systems. Codec language models utilize a speech codec model to extract discrete speech representation [9-11] and employ an autoregressive [8, 12-17] or masked [18] language model to predict the speech tokens, which are then synthesized to waveforms via codec vocoders [19,20]. Continuous speech representations are also explored in [21]. Language model-based TTS can generate varied and prosody-consistent speech via autoregressive sampling. ∗The code and pre-trained models are released at: https://github.com/FunAudioLLM/CosyVoice"},{"citing_arxiv_id":"2410.06885","ref_index":147,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"F5-TTS: A Fairytaler that Fakes Fluent and Faithful Speech with Flow Matching","primary_cat":"eess.AS","submitted_at":"2024-10-09T13:46:34+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"F5-TTS generates natural speech from text via flow matching on DiT with simple text padding, ConvNeXt refinement, and sway sampling, trained on 100K hours multilingual data.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null}],"limit":50,"offset":0}