{"total":13,"items":[{"citing_arxiv_id":"2607.01563","ref_index":5,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Beyond Words: Towards Effective Modeling of Non-Verbal Vocalizations in ASR","primary_cat":"eess.AS","submitted_at":"2026-07-02T00:43:14+00:00","verdict":"UNVERDICTED","verdict_confidence":"UNKNOWN","novelty_score":4.0,"formal_verification":"none","one_line_summary":"Three data-centric strategies are studied to improve rare non-verbal vocalization recognition in ASR while preserving lexical accuracy.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2607.00363","ref_index":20,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Enhancing Flow Matching with A Unified Guidance Framework for Efficient and Robust Speech Synthesis","primary_cat":"cs.SD","submitted_at":"2026-07-01T03:02:31+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":4.0,"formal_verification":"none","one_line_summary":"Unified guidance framework for Flow Matching speech synthesis achieves nearly 3x faster inference and improved speaker similarity by combining heterogeneous data augmentation with intrinsic model guidance to eliminate CFG overhead.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.09050","ref_index":22,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"MeanVC 2: Robust Low-Latency Streaming Zero-Shot Voice Conversion","primary_cat":"eess.AS","submitted_at":"2026-06-08T05:39:23+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":4.0,"formal_verification":"none","one_line_summary":"MeanVC 2 introduces future-receptive chunking and a universal timbre token encoder to achieve lower-latency and more robust streaming zero-shot voice conversion than the original MeanVC.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.08843","ref_index":21,"ref_count":2,"confidence":0.9,"is_internal_anchor":false,"paper_title":"From A to B to A: Palindromic Zero-Shot Voice Conversion with Non-Parallel Data","primary_cat":"cs.SD","submitted_at":"2026-06-07T21:25:14+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"KNN retrieval over WavLM representations creates synthetic source-target pairs from non-parallel data for supervised voice conversion training with a speaker loss, achieving strong results on multilingual test sets despite English-only training.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.01804","ref_index":45,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"SpeechEditBench: A Bilingual Multi-Attribute Benchmark for Instruction-Guided Speech Editing","primary_cat":"eess.AS","submitted_at":"2026-06-01T07:21:02+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"SpeechEditBench provides seven atomic editing tasks, compositional multi-operation instructions, and an anchor-based protocol yielding target success, preservation success, and joint success metrics; evaluations show no model excels across dimensions and compositional editing is especially difficult","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.12310","ref_index":7,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Poly-SVC: Polyphony-Aware Singing Voice Conversion with Harmonic Modeling","primary_cat":"cs.SD","submitted_at":"2026-05-12T15:57:12+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"Poly-SVC converts singing voices from polyphonic recordings while keeping melody, lyrics, and harmonies by combining CQT-based pitch extraction with a conditional flow matching diffusion decoder.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"ment paradigm, which decomposes the singing voice into three pri- mary components: timbre, content, and pitch [1]-[4]. To support this framework, recent studies have leveraged powerful pre-trained models for each aspect. For content modeling, self-supervised learn- ing (SSL) models such as wav2vec [5] and HuBERT [6], as well as automatic speech recognition (ASR) models such as Whisper [7], are widely adopted to provide robust, high-resolution linguistic rep- resentations, greatly improving intelligibility in the synthesized out- put. At the same time, F0-based pitch estimation methods, including RMVPE [8] and Crepe [9], are commonly used for capturing the melodic contour. For timbre representation, both speaker embed- dings and prompt-based timbre encodings are widely adopted."},{"citing_arxiv_id":"2604.23742","ref_index":16,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"RTCFake: Speech Deepfake Detection in Real-Time Communication","primary_cat":"cs.SD","submitted_at":"2026-04-26T14:42:50+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"RTCFake is the first large-scale dataset of real-time communication speech deepfakes paired with offline versions, paired with a phoneme-guided consistency learning method that improves cross-platform and noise-robust detection.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.19193","ref_index":44,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"How Far Are Video Models from True Multimodal Reasoning?","primary_cat":"cs.CV","submitted_at":"2026-04-21T08:04:02+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Current video models succeed on basic understanding but achieve under 25% success on logically grounded generation and near 0% on interactive generation, exposing gaps in multimodal reasoning.","context_count":1,"top_context_role":"dataset","top_context_polarity":"use_dataset","context_text":"UniVBench [75] contains a diverse set of reference images that are free from copyright concerns. We manually curated and filtered out samples that might cause style drift or introduce ambiguous references. The remaining images were retained and used as metadata images. For the metadata audio, we curate and collect audio data from the Seed-VC open-source repository [44], filtering out segments with noisy backgrounds and low quality. Multi-turn Interaction.While some prior work has explored the physical sim- ulation and logical reasoning capabilities of video models [47,54,67,78], few stud- ies focus on the multi-turn interactive generation of videos. This task requires video models to not only understand but also iteratively predict video scripts"},{"citing_arxiv_id":"2604.12456","ref_index":24,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"X-VC: Zero-shot Streaming Voice Conversion in Codec Space","primary_cat":"eess.AS","submitted_at":"2026-04-14T08:42:10+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"X-VC achieves zero-shot streaming voice conversion via one-step codec-space conversion with dual-conditioning acoustic converter and role-assignment training on generated paired data.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"[22] Guobin Ma, Jixun Yao, Ziqian Ning, Yuepeng Jiang, Lingxin Xiong, Lei Xie, and Pengcheng Zhu. 2025. MeanVC: Lightweight and Streaming Zero-Shot Voice Conversion via Mean Flows. arXiv:2510.08392 [eess.AS] https://arxiv.org/abs/ 2510.08392 [23] Seyed Hamidreza Mohammadi and Alexander Kain. 2017. An overview of voice conversion systems.Speech Communication88 (2017), 65-82. doi:10.1016/j. specom.2017.01.008 [24] Ziqian Ning, Yuepeng Jiang, Pengcheng Zhu, Shuai Wang, Jixun Yao, Lei Xie, and Mengxiao Bi. 2024. Dualvc 2: Dynamic Masked Convolution for Unified Streaming and Non-Streaming Voice Conversion. InICASSP 2024 - 2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). 11106-11110. doi:10.1109/ICASSP48485.2024.10446229 [25] Ziqian Ning, Yuepeng Jiang, Pengcheng Zhu, Jixun Yao, Shuai Wang, Lei Xie,"},{"citing_arxiv_id":"2604.08184","ref_index":38,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"AT-ADD: All-Type Audio Deepfake Detection Challenge Evaluation Plan","primary_cat":"cs.SD","submitted_at":"2026-04-09T12:38:19+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":3.0,"formal_verification":"none","one_line_summary":"AT-ADD introduces standardized tracks and datasets for evaluating audio deepfake detectors on speech under real-world conditions and on diverse unknown audio types to promote generalization beyond speech-centric methods.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"is an emerging yet under-established research frontier. In the follow- ing, we review prior work by audio type-speech, sound, singing voice, and music-and then discuss cross-type ADD, which moti- vates the need for a unified and realistic all-type benchmark. Speech. Speech deepfake detection has been extensively stud- ied, largely driven by the ASVspoof challenges [ 38, 40, 57] and ADD challenges [66, 67]. Representative CMs include AASIST [20] and SSL-based pipelines that combine XLSR with AASIST [ 56]. Subsequent studies have investigated different SSL representations [22, 45], layer utilization of SSL features [43, 59, 70], and robustness [21, 63, 75]. However, a substantial gap remains between existing public benchmarks and real-world conditions (e."},{"citing_arxiv_id":"2604.13067","ref_index":18,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"From Seeing it to Experiencing it: Interactive Evaluation of Intersectional Voice Bias in Human-AI Speech Interaction","primary_cat":"cs.HC","submitted_at":"2026-03-19T21:12:49+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"Voice conversion in interactive studies boosts user trust in SpeechLLM responses while automated metrics detect accent-by-gender disparities in alignment and verbosity.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2511.20657","ref_index":95,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Intelligent Agents with Emotional Intelligence: Current Trends, Challenges, and Future Prospects","primary_cat":"cs.HC","submitted_at":"2025-10-11T07:40:36+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":2.0,"formal_verification":"none","one_line_summary":"A holistic survey of affective computing for intelligent agents covering emotion understanding via multimodal data, affective cognition, emotional expression synthesis, key challenges, and future directions emphasizing generative technologies.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2504.18425","ref_index":46,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Kimi-Audio Technical Report","primary_cat":"eess.AS","submitted_at":"2025-04-25T15:31:46+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"Kimi-Audio is an open-source audio foundation model that achieves state-of-the-art results on speech recognition, audio understanding, question answering, and conversation after pre-training on more than 13 million hours of speech, sound, and music data.","context_count":1,"top_context_role":"method","top_context_polarity":"use_method","context_text":"Since it is difficult for the voice actor to record speech in any styles, emotions, and accents, we develop a voice conversion (VC) system, called Kimi-VC, to convert diverse and in-the- wild speech in different speakers/timbres into the timbre of Kimi-Audio speaker while preserving the styles, emotions, and accents. Built on the Seed-VC framework [46], Kimi-VC incorporates source timbre perturbation via a timbre-shifting model during training, which mitigates information leakage and ensures alignment between training and inference phases. To ensure high quality of voice conversion, we fine-tune the Kimi-VC model using speech data recorded by the Kimi-Audio speaker, a voice actor selected by Kimi."}],"limit":50,"offset":0}