{"total":21,"items":[{"citing_arxiv_id":"2606.31247","ref_index":144,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"FlexiSLM: A Dynamic and Controllable Frame Rate Spoken Language Model","primary_cat":"cs.SD","submitted_at":"2026-06-30T07:24:10+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"FlexiSLM is the first spoken language model supporting dynamic and controllable frame rates on speech input and output, outperforming fixed-rate 7B models at high quality and enabling faster inference at lower rates like 6.25 Hz.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.30107","ref_index":1,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Dial HEALTHDIAL for Advice: A Multilingual and Multi-Parallel Spoken Dialogue Dataset for Knowledge-Grounded Information Seeking","primary_cat":"cs.CL","submitted_at":"2026-05-28T15:47:09+00:00","verdict":"ACCEPT","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"HEALTHDIAL is a multilingual multi-parallel spoken dialogue dataset containing 1,500 dialogues per language grounded in WHO content, with recorded speech and speaker metadata across four languages.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.15984","ref_index":24,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Beyond Content: A Comprehensive Speech Toxicity Dataset and Detection Framework Incorporating Paralinguistic Cues","primary_cat":"cs.SD","submitted_at":"2026-05-15T14:17:19+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"ToxiAlert-Bench dataset and dual-head neural network detect toxic speech by distinguishing textual versus paralinguistic sources, reporting 21.1% Macro-F1 and 13% accuracy gains over baselines.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.01224","ref_index":115,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Lost in the Tower of Babel: The Adverse Effects of Incidental Multilingualism in LLMs","primary_cat":"cs.CL","submitted_at":"2026-05-02T03:39:12+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"Incidental multilingualism from uneven web training makes LLMs unequal, brittle, and opaque across languages.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.02928","ref_index":12,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Keyword spotting using convolutional neural network for speech recognition in Hindi","primary_cat":"cs.SD","submitted_at":"2026-04-26T21:11:29+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":2.0,"formal_verification":"none","one_line_summary":"CNNs using MFCC features achieve 91.79% accuracy for keyword spotting in Hindi speech on a 40,000-sample dataset.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.24770","ref_index":8,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Elderly-Contextual Data Augmentation via Speech Synthesis for Elderly ASR","primary_cat":"cs.CL","submitted_at":"2026-04-15T12:56:11+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"Combining LLM-based elderly-contextual paraphrasing with TTS synthesis using elderly speakers reduces word error rates in elderly ASR by up to 58% over standard Whisper baselines on English and Korean datasets.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.22817","ref_index":31,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"In-Sync: Adaptation of Speech Aware Large Language Models for ASR with Word Level Timestamp Predictions","primary_cat":"eess.AS","submitted_at":"2026-04-14T20:56:24+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":4.0,"formal_verification":"none","one_line_summary":"Lightweight training strategies allow speech-aware LLMs to output accurate word timestamps alongside ASR transcripts while also improving recognition quality across datasets.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.10736","ref_index":4,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"BlasBench: An Open Benchmark for Irish Speech Recognition","primary_cat":"cs.CL","submitted_at":"2026-04-12T17:17:54+00:00","verdict":"CONDITIONAL","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"BlasBench supplies an Irish-aware normalizer and scoring harness that enables reproducible ASR comparisons and exposes a 33-43 point generalization gap for fine-tuned models versus 7-10 points for massively multilingual ones.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.14186","ref_index":2,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"HARNESS: Lightweight Distilled Arabic Speech Foundation Models","primary_cat":"eess.AS","submitted_at":"2026-03-31T16:56:33+00:00","verdict":"ACCEPT","verdict_confidence":"MODERATE","novelty_score":5.5,"formal_verification":"none","one_line_summary":"HARNESS introduces Arabic-centric speech foundation models that achieve high efficiency and performance through iterative self-distillation and PCA-based signal compression.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2603.29087","ref_index":32,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"IQRA 2026: Interspeech Challenge on Automatic Pronunciation Assessment for Modern Standard Arabic (MSA)","primary_cat":"cs.SD","submitted_at":"2026-03-31T00:05:07+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"The IQRA 2026 challenge on Arabic mispronunciation detection reports a 0.28 F1-score gain from new authentic human error data and diverse modeling approaches including self-supervised and audio-language models.","context_count":1,"top_context_role":"method","top_context_polarity":"use_method","context_text":"SQZ ww employs a U2++ Conformer encoder [30] with a bidirectional Transformer de- coder optimized via joint CTC/attention encoder decoder loss with label smoothing. Additional synthetic mispronounced speech is generated using a VITS TTS [31] model trained on challenge data, with phoneme interpolation to simulate pronun- ciation errors. SpecAugment [32] augmentation is applied dur- ing fine-tuning. A two-pass decoding strategy performs first- pass CTC beam search followed by bidirectional Transformer rescoring. Najva (5th, F1=0.6894).Najva fine-tunes the NVIDIA FastConformer Hybrid Large model (stt ar fastconformer hybrid large pcd)2, pretrained on Arabic speech corpora including FLEURS [33], Tarteel EveryAyah3, and Common V oice [34]."},{"citing_arxiv_id":"2603.14222","ref_index":51,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Membership Inference for Contrastive Pre-training Models with Text-only PII Queries","primary_cat":"cs.CR","submitted_at":"2026-03-15T04:53:39+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"UMID infers membership in contrastive pre-training data using only text queries by performing latent inversion and comparing similarity and variability signals to synthetic gibberish references via unsupervised anomaly detection.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2512.01537","ref_index":9,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Two-Dimensional Quantization for Geometry-Aware Audio Coding","primary_cat":"cs.SD","submitted_at":"2025-12-01T11:06:38+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Q2D2 uses 2D geometric grid projections to quantize feature pairs in neural audio codecs, yielding implicit codebooks that improve efficiency and utilization over RVQ, VQ, and FSQ while maintaining reconstruction quality.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2509.22220","ref_index":4,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"StableToken: A Noise-Robust Semantic Speech Tokenizer for Resilient SpeechLLMs","primary_cat":"cs.CL","submitted_at":"2025-09-26T11:32:51+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"StableToken introduces a multi-branch architecture with bit-wise voting to create noise-robust semantic speech tokens, achieving lower Unit Edit Distance and better SpeechLLM robustness than prior single-path tokenizers.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2508.07285","ref_index":129,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Non-Intrusive Automatic Speech Recognition Refinement: A Survey","primary_cat":"eess.AS","submitted_at":"2025-08-10T10:46:14+00:00","verdict":"ACCEPT","verdict_confidence":"LOW","novelty_score":4.0,"formal_verification":"none","one_line_summary":"A survey that classifies non-intrusive ASR refinement methods into five categories, reviews domain adaptation and evaluation datasets, proposes standardized metrics, and identifies future research directions.","context_count":1,"top_context_role":"dataset","top_context_polarity":"use_dataset","context_text":"HyPoradise (HP) and ASR-EC. Chen et al. [121] developed the HP dataset by generating N-best hypotheses (top-5 from a beam search with a beam size of 60) from two state-of-the-art ASR models, WavLM [123] and Whisper [124]. This process has been applied to numerous popular ASR datasets, including LibriSpeech [125], CHIME-4 [126], WSJ [127], SwitchBoard [128], Common V oice [129], Tedlium-3 [130], LRS2 [131], ATIS [132], and CORAAL [133], yielding over 334,000 pairs. In a similar vein, the ASR-EC benchmark [134], specifically designed for Chinese ASR error correction, was constructed by collecting erroneous transcriptions from industry-grade ASR systems and pairing them with manually verified ground-truth transcripts. ASR-EC utilizes data from diverse Chinese speech corpora, such"},{"citing_arxiv_id":"2506.17185","ref_index":11,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"A Common Pool of Privacy Problems: Legal and Technical Lessons from a Large-Scale Web-Scraped Machine Learning Dataset","primary_cat":"cs.CR","submitted_at":"2025-06-20T17:40:05+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"An empirical audit of one web-scraped ML training dataset reveals persistent PII after sanitization, which the authors combine with legal analysis to highlight privacy risks and advocate redefining 'publicly available' data for AI training.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"public availability of this search tool may raise awareness about the content of web-crawled data, the tool can also be used by adver- saries to gather personal information; we expand on these concerns in our discussion on profiling Section 7.3.7. Currently the CLIP retrieval website is no longer accessible, but code is available to run the tool locally [11]). Moreover, the LAION-5B authors place the responsibility on the individuals to find and remove their personal information, yet these opt-out policies again are not very meaningful (Section 4.3.1). For instance, when someone found their medical records leaked on LAION-5B and wished to take them down, a LAION author re- sponded that the hosting website was responsible since the dataset"},{"citing_arxiv_id":"2505.24437","ref_index":49,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"SwitchCodec: A High-Fidelity Nerual Audio Codec With Sparse Quantization","primary_cat":"cs.SD","submitted_at":"2025-05-30T10:20:04+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"SwitchCodec introduces Residual Experts Vector Quantization and a multi-tiered STFT discriminator to achieve PESQ 2.87 and ViSQOL 4.27 at 2.67 kbps while halving training time via post-training.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2505.17589","ref_index":52,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"CosyVoice 3: Towards In-the-wild Speech Generation via Scaling-up and Post-training","primary_cat":"cs.SD","submitted_at":"2025-05-23T07:55:21+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"CosyVoice 3 achieves better content consistency, speaker similarity, and prosody naturalness in zero-shot multilingual speech synthesis by scaling data to one million hours, model size to 1.5 billion parameters, and introducing a supervised multi-task speech tokenizer plus a differentiable reward模型.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2504.18425","ref_index":1,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Kimi-Audio Technical Report","primary_cat":"eess.AS","submitted_at":"2025-04-25T15:31:46+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"Kimi-Audio is an open-source audio foundation model that achieves state-of-the-art results on speech recognition, audio understanding, question answering, and conversation after pre-training on more than 13 million hours of speech, sound, and music data.","context_count":1,"top_context_role":"dataset","top_context_polarity":"use_dataset","context_text":"Table 1: List of datasets used for audio understanding and their training epoch in SFT stage. Dataset Audio Length (#hours) Task Type SFT Epochs WenetSpeech [85] 10, 518 ASR 2.0 WenetSpeech4TTS [50] 12, 085 ASR 2.0 AISHELL-1 [4] 155 ASR 2.0 AISHELL-2 [17] 1, 036 ASR 2.0 AISHELL-3 [62] 65 ASR 2.0 Emilla [25] 98, 305 ASR 2.0 Fleurs [12] 17 ASR 2.0 CommonV oice [1] 43 ASR 2.0 KeSpeech [64] 1, 428 ASR 2.0 Magicdata [79] 747 ASR 2.0 zhvoice1 901 ASR 2.0 Libriheavy [33] 51, 448 ASR 2.0 MLS [57] 45, 042 ASR 2.0 Gigaspeech [5] 10, 288 ASR 2.0 LibriSpeech [54] 960 ASR 2.0 CommonV oice [1] 1, 854 ASR 2.0 V oxpopuli [69] 529 ASR 2.0 LibriTTS [83] 568 ASR 2.0 CompA-R [22] 159 AQA 2.0 ClothoAQA [43] 7.4 AQA 4.0 AudioCaps [34] 137 AAC 2."},{"citing_arxiv_id":"2410.06885","ref_index":81,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"F5-TTS: A Fairytaler that Fakes Fluent and Faithful Speech with Flow Matching","primary_cat":"eess.AS","submitted_at":"2024-10-09T13:46:34+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"F5-TTS generates natural speech from text via flow matching on DiT with simple text padding, ConvNeXt refinement, and sway sampling, trained on 100K hours multilingual data.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2406.02430","ref_index":13,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Seed-TTS: A Family of High-Quality Versatile Speech Generation Models","primary_cat":"eess.AS","submitted_at":"2024-06-04T15:48:29+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Seed-TTS models produce speech matching human naturalness and speaker similarity, with added controllability via self-distillation and reinforcement learning.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2210.13438","ref_index":2,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"High Fidelity Neural Audio Compression","primary_cat":"eess.AS","submitted_at":"2022-10-24T17:52:02+00:00","verdict":"ACCEPT","verdict_confidence":"MODERATE","novelty_score":7.0,"formal_verification":"none","one_line_summary":"EnCodec is an end-to-end trained streaming neural audio codec that uses a single multiscale spectrogram discriminator and a gradient-normalizing loss balancer to achieve higher fidelity than prior methods at the same bitrates for 24 kHz mono and 48 kHz stereo audio.","context_count":1,"top_context_role":"dataset","top_context_polarity":"use_dataset","context_text":"The adversarial loss for the generator is constructed as follows,ℓg(ˆx) = 1 K ∑ k max(0, 1−Dk(ˆx))), whereK is the number of discriminators. Similarly to previous work on neural vocoders (Kumar et al., 2019; Kong et al., 2020; You et al., 2021), we additionally include a relative feature matching loss for the generator. Formally, ℓfeat(x, ˆx) = 1 KL K∑ k=1 L∑ l=1 ∥Dl k(x)−Dl k(ˆx)∥1 mean ( ∥Dl k(x)∥1 ), (2) where themean is computed over all dimensions,(Dk) are the discriminators, andL is the number of layers in discriminators. The discriminators are trained to minimize the following hinge-loss adversarial loss function: Ld(x, ˆx) = 1 K ∑K k=1 max(0, 1−Dk(x)) +max(0, 1 +Dk(ˆx)), whereK is the number of discriminators. Given that the discriminator tend to overpower easily the decoder, we update its weight with a probability of 2/3 at"}],"limit":50,"offset":0}