{"total":17,"items":[{"citing_arxiv_id":"2607.01108","ref_index":14,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"NPUsper: Eliminating Redundant Computation for Real-Time Whisper on Mobile NPUs","primary_cat":"cs.SD","submitted_at":"2026-07-01T16:00:16+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"NPUsper reduces per-word latency, TTFT, and power for Whisper on mobile NPUs via online hallucination detection and K-step chunk graphs while preserving accuracy.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.31112","ref_index":25,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"What Counts as an Error? Dual-Reference Benchmarking for Atypical ASR","primary_cat":"cs.CL","submitted_at":"2026-06-30T04:15:39+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Dual-reference benchmarking on atypical stuttered speech reveals disparities in ASR model performance and rankings between verbatim and intended transcriptions.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.29534","ref_index":35,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Preference-ASR: A Preference-Aware Test Set for Benchmarking ASR in the Era of Speech LLMs","primary_cat":"cs.CL","submitted_at":"2026-06-28T17:57:41+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"PreferenceASR is a preference-aware ASR test set built from seven corpora that shows model rankings change when user output-style instructions are considered.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.10675","ref_index":27,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Multilingual Word-Level Forced Alignment with Self-Supervised Representations and Learned Dynamic Programming","primary_cat":"cs.CL","submitted_at":"2026-06-09T10:27:59+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"A fused self-supervised encoder and learned DP decoder for word alignment outperforms MFA on English datasets and generalizes to unseen languages.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.08486","ref_index":15,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"TRADE: Transducer-Augmented Decoder for Speech LLM","primary_cat":"cs.CL","submitted_at":"2026-06-07T07:15:34+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"TRADE augments multimodal Speech LLMs with a transducer branch for streaming ASR, reporting 6.71% WER offline and 8.40% streaming on the Open ASR Leaderboard from one checkpoint.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.08194","ref_index":76,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"GlobeAudio: A Multilingual Multicultural Benchmark for Naturalistic Evaluation of Large Audio-Language Models","primary_cat":"cs.CL","submitted_at":"2026-06-06T14:24:05+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"GlobeAudio is a new multilingual multicultural benchmark for naturalistic evaluation of large audio-language models, showing performance gaps especially for open-source models and low-resource languages.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.05121","ref_index":24,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Audio Interaction Model","primary_cat":"cs.SD","submitted_at":"2026-06-03T17:26:11+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Audio-Interaction unifies offline and online audio tasks into one streaming model via the SoundFlow framework and a new 2.6M-item streaming corpus, enabling real-time instruction following and proactive responses.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.04418","ref_index":29,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"CleanCodec: Efficient and Robust Speech Tokenization via Perceptually Guided Encoding","primary_cat":"cs.SD","submitted_at":"2026-06-03T03:56:14+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"CleanCodec reframes audio tokenization as a selective information bottleneck to encode only perceptually important features at 12.5 tokens per second, outperforming prior codecs in efficiency, speaker similarity, and intelligibility.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.03948","ref_index":12,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"A Pocket Offline Model for Simultaneous Speech Translation as CUNI Submission to IWSLT 2026","primary_cat":"cs.CL","submitted_at":"2026-06-02T17:37:11+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":2.0,"formal_verification":"none","one_line_summary":"A 1B-parameter multilingual offline model is adapted with AlignAtt policy for simultaneous speech translation and submitted to IWSLT 2026 for three language pairs.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.19695","ref_index":62,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Cross-Talk Speech Reduction, by Separation, for Separation","primary_cat":"eess.AS","submitted_at":"2026-05-19T11:29:35+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"Proposes cross-talk reduction task with CTRnet and pseudo-label far-field separation (PuLSS) to train on real close-talk/far-field pairs, achieving SOTA ASR on CHiME-6 and outperforming guided source separation.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.16545","ref_index":16,"ref_count":2,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Symphony for Speech-to-Text: Supporting Real-Time Medical Voice Interfaces","primary_cat":"cs.LG","submitted_at":"2026-05-15T18:39:14+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"Symphony is a medical-grade speech recognition system that decomposes transcription into specialized components and outperforms existing systems in clinical settings while matching them in general domains.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.03776","ref_index":46,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Assessing the Impact of Noise and Speech Enhancement on the Intelligibility of Speech Codecs","primary_cat":"eess.AS","submitted_at":"2026-05-05T14:06:03+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":4.0,"formal_verification":"none","one_line_summary":"Classical codecs prove more robust to noise than neural codecs, speech enhancement significantly helps noise-affected codecs, and listening effort plus ASR-based metrics add useful nuance beyond basic intelligibility scores.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.27543","ref_index":16,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"AppTek Call-Center Dialogues: A Multi-Accent Long-Form Benchmark for English ASR","primary_cat":"cs.CL","submitted_at":"2026-04-30T07:48:27+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"A new multi-accent long-form call-center dialogue dataset for English ASR evaluation shows substantial performance variation across accents and segmentation methods.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.27436","ref_index":20,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"BUT System Description for CHiME-9 MCoRec Challenge","primary_cat":"eess.AS","submitted_at":"2026-04-30T05:21:09+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":3.0,"formal_verification":"none","one_line_summary":"BUT's CHiME-9 MCoRec system conditions Parakeet-v2 ASR on AV-HuBERT visuals for 33.7% WER and uses Qwen3.5 LLM for hierarchical clustering to reach 0.97 F1, beating the baseline by 16.2% WER and 0.15 F1 on the development set.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.10065","ref_index":16,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"ASPIRin: Action Space Projection for Interactivity-Optimized Reinforcement Learning in Full-Duplex Speech Language Models","primary_cat":"cs.CL","submitted_at":"2026-04-11T07:07:08+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"ASPIRin decouples speaking timing from token content via binary action space projection and applies GRPO with rule-based rewards to optimize interactivity in SLMs without semantic collapse or repetition.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"Grattafiori, A. Dubey, A. Jauhri, A. Pandey, A. Kadian, A. Al- Dahle, A. Letman, A. Mathur, A. Schelten, A. Vaughanet al., \"The llama 3 herd of models,\"arXiv preprint arXiv:2407.21783, 2024. [15] A. Liu, B. Feng, B. Xue, B. Wang, B. Wu, C. Lu, C. Zhao, C. Deng, C. Zhang, C. Ruanet al., \"Deepseek-v3 technical re- port,\"arXiv preprint arXiv:2412.19437, 2024. [16] DeepSeek-AI, \"Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning,\" 2025. [Online]. Available: https://arxiv.org/abs/2501.12948 [17] Z. Du, Q. Chen, S. Zhang, K. Hu, H. Lu, Y . Yang, H. Hu, S. Zheng, Y . Gu, Z. Maet al., \"Cosyvoice: A scalable multi- lingual zero-shot text-to-speech synthesizer based on supervised semantic tokens,\"arXiv preprint arXiv:2407."},{"citing_arxiv_id":"2604.07354","ref_index":30,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Contextual Earnings-22: A Speech Recognition Benchmark with Custom Vocabulary in the Wild","primary_cat":"cs.CL","submitted_at":"2026-03-28T05:09:16+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"Contextual Earnings-22 is a new benchmark dataset showing that scaled keyword prompting and boosting both deliver significantly better accuracy on custom vocabularies than standard academic tests.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2512.16378","ref_index":87,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Hearing to Translate: The Effectiveness of Speech Modality Integration into LLMs","primary_cat":"cs.CL","submitted_at":"2025-12-18T10:21:14+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"Cascaded systems remain the most reliable for speech translation overall, but recent SpeechLLMs match or outperform them in many conditions while standalone speech models lag.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null}],"limit":50,"offset":0}