{"total":10,"items":[{"citing_arxiv_id":"2605.23463","ref_index":32,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"StepAudio 2.5 Technical Report","primary_cat":"eess.AS","submitted_at":"2026-05-22T10:24:50+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"StepAudio 2.5 is a unified audio-language foundation model that reaches state-of-the-art results on ASR, TTS, and realtime interaction by using task-tailored RLHF on a shared backbone.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.06765","ref_index":128,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"VITA-QinYu: Expressive Spoken Language Model for Role-Playing and Singing","primary_cat":"cs.CL","submitted_at":"2026-05-07T17:59:56+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"VITA-QinYu is the first expressive end-to-end spoken language model supporting role-playing and singing alongside conversation, trained on 15.8K hours of data and outperforming prior models on expressiveness and conversational benchmarks.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.27393","ref_index":56,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"MiniCPM-o 4.5: Towards Real-Time Full-Duplex Omni-Modal Interaction","primary_cat":"cs.CL","submitted_at":"2026-04-30T04:05:43+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"MiniCPM-o 4.5 uses the Omni-Flow streaming framework to deliver real-time full-duplex omni-modal interaction with proactive behavior in a 9B model that approaches Gemini 2.5 Flash performance.","context_count":1,"top_context_role":"dataset","top_context_polarity":"use_dataset","context_text":"LongVideoBench [53], and MotionBench [54], covering both varying video lengths. Speech Understanding and Generation.Speech evaluation covers automatic speech recognition, speech translation, audio understanding, speech question answering, and speech generation. For speech understanding, we evaluate on standard ASR benchmarks, including AISHELL-1 [ 55], AISHELL-2 [56], WenetSpeech [57], LibriSpeech [58], GigaSpeech [59], and V oxPopuli [60]; speech translation on CoV oST 2 [61]; multi-task audio understanding on MMAU and MELD [ 62]; and 9 Table 2: Vision-language results (instruct mode). Benchmark Gemini 2.5 Flash InternVL3.5 Qwen3-VL Qwen3-OmniMiniCPM-o 4.5 Size - 8B 8B 30B-A3B 9B STEM & General OpenCompass78."},{"citing_arxiv_id":"2604.18105","ref_index":4,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"NIM4-ASR: Towards Efficient, Robust, and Customizable Real-Time LLM-Based ASR","primary_cat":"eess.AS","submitted_at":"2026-04-20T11:21:06+00:00","verdict":null,"verdict_confidence":null,"novelty_score":null,"formal_verification":null,"one_line_summary":null,"context_count":1,"top_context_role":"background","top_context_polarity":"unclear","context_text":"j=1) +ϵ .(3) where ϵ is a small constant for numerical stability. Denote θold as the policy parameters at the beginning of each optimization step, ε as the clipping range, and β as the KL penalty coefficient. The GRPO objective is defined as JGRPO = 1 K KX i=1 1 |τi| |τi|X t=1 min \u0010 ri,t(θ) ˆAi,t,clip ri,t(θ),1−ε,1 +ε \u0001 ˆAi,t \u0011 −β D KL(πθ∥πref ), (4) where ri,t(θ) = πθ(τi,t |q, τ i,<t) πθold (τi,t |q, τ i,<t) .(5) RL training framework.We implement an RL training pipeline tailored for LLM-based ASR. For each training batch, the policy model encodes input utterances into speech embeddings, which are reused across both rollout generation and policy model log-probability computation to avoid redundant computation."},{"citing_arxiv_id":"2604.08003","ref_index":8,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Rethinking Entropy Allocation in LLM-based ASR: Understanding the Dynamics between Speech Encoders and LLMs","primary_cat":"eess.AS","submitted_at":"2026-04-09T09:07:52+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"A multi-stage training method for LLM-based ASR uses new entropy allocation metrics to achieve competitive benchmark performance with 2.3B parameters while mitigating hallucinations via better encoder-LLM decoupling.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"Let q= [u ⊤, P ⊤, C⊤]⊤ and let ˆΣ = Cov(q) denote the empirical covariance estimated over the evaluation set. We use a ridge-regularized covariance ˜Σ = ˆΣ +λI , where λ >0 ensures numerical stability. All covariance blocks below are taken as principal submatrices of ˜Σ. Under a joint Gaussian approximation on(u, P, C), we define PAI(E′) = \" 1 2 log 2log det ˜Σuu det ˜ΣP P det ˜Σ[u,P] # + ,(8) where [·]+ = max(·,0) clips residual negative values caused by numerical error. This quantity serves as a regular- ized Gaussian accessible-information proxy for the mutual information betweenuandP. Similarly, letting ˜Σ·|P denote the corresponding regularized conditional covariance matrices given P , computed from the same joint covariance ˜Σ, we define"},{"citing_arxiv_id":"2604.01897","ref_index":28,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"FastTurn: Unifying Acoustic and Streaming Semantic Cues for Low-Latency and Robust Turn Detection","primary_cat":"cs.SD","submitted_at":"2026-04-02T11:00:37+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"FastTurn unifies acoustic features and streaming CTC decoding for low-latency, robust turn detection in full-duplex dialogue systems and releases a realistic human-dialogue test set.","context_count":1,"top_context_role":"dataset","top_context_polarity":"use_dataset","context_text":"Since thewaitstate is rare in natural conversations, we sup- plement the set with 1,000 samples generated using DeepSeek V3 [21] for text and IndexTTS2 [22] for audio synthesis. 3. Experiments 3.1. Datasets ASR Task. We use large-scale open-source corpora and in- ternal datasets, including AISHELL-1 [23], AISHELL-2 [24], WenetSpeech [25], LibriSpeech [26], GigaSpeech [27], and MLS [28], totaling over 30,000 hours of Chinese and English speech to support robust feature learning. Turn Detection Task.We use the Easy Turn training set, augmented with internal conversational data and synthetic cor- pora. Dialogue texts are generated by Qwen3-32B [20] and DeepSeek-v3 [21], then synthesized into speech using In- dextts2 [22]. To generatecompleteandincompletestates, we"},{"citing_arxiv_id":"2508.07285","ref_index":159,"ref_count":1,"confidence":0.98,"is_internal_anchor":true,"paper_title":"Non-Intrusive Automatic Speech Recognition Refinement: A Survey","primary_cat":"eess.AS","submitted_at":"2025-08-10T10:46:14+00:00","verdict":"ACCEPT","verdict_confidence":"LOW","novelty_score":4.0,"formal_verification":"none","one_line_summary":"A survey that classifies non-intrusive ASR refinement methods into five categories, reviews domain adaptation and evaluation datasets, proposes standardized metrics, and identifies future research directions.","context_count":1,"top_context_role":"dataset","top_context_polarity":"use_dataset","context_text":"5 hours primary-care telemedicine domain, manual transcript[153]HyProdise (HP)2023 English Text Refinement334,000 samplesgeneral (mixed) domain [154]SIGHAN132013 Chinese Text Refinement 1700 samples language learning domain (TOCFL), L2 learners[155]SIGHAN142014 Chinese Text Refinement 4497 samples language learning domain (TOCFL), L2 learners[156]SIGHAN152015 Chinese Text Refinement 3439 samples language learning domain (TOCFL), L2 learners[157]AISHELL-12017 Chinese Speech Recognition178 hours read speech, several domains (e.g., finance, technology, sports, news)[158]AISHELL-22018 Chinese Speech Recognition1,000 hours read speech, several domains (e.g., finance, technology, sports, news)[159]Wang271K2018 Chinese Text Refinement271,299 samplesgeneral domain (news-inclusive), symbolic (OCR) and phonetic (ASR) errors[160]Aidatatang 2019 Chinese Speech Recognition200 hours read speech, mostly mobile records, manual transcript[161]AISHELL-32020 Chinese Speech Recognition85 hours general domain, multi-speaker [162]AISHELL-42021 Chinese Speech Recognition120 hours conference/meeting speech, multi-channel [163]LEMON 2023 Chinese Text Refinement22,000 samplesfrom human writings, manual transcript, multi-domain (medical-inclusive)[164]ChFT 2024 Chinese Text Refinement562,048 samplesnews domain [165]ASR-EC 2024 Chinese Text Refinement1,000,000 samplesgeneral (mixed) domain [24]CSCD-NS 2024"},{"citing_arxiv_id":"2504.18425","ref_index":17,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Kimi-Audio Technical Report","primary_cat":"eess.AS","submitted_at":"2025-04-25T15:31:46+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"Kimi-Audio is an open-source audio foundation model that achieves state-of-the-art results on speech recognition, audio understanding, question answering, and conversation after pre-training on more than 13 million hours of speech, sound, and music data.","context_count":1,"top_context_role":"dataset","top_context_polarity":"use_dataset","context_text":"3https://github.com/modelscope/FunASR 1https://github.com/fighting41love/zhvoice 7 Kimi-Audio Technical Report Table 1: List of datasets used for audio understanding and their training epoch in SFT stage. Dataset Audio Length (#hours) Task Type SFT Epochs WenetSpeech [85] 10, 518 ASR 2.0 WenetSpeech4TTS [50] 12, 085 ASR 2.0 AISHELL-1 [4] 155 ASR 2.0 AISHELL-2 [17] 1, 036 ASR 2.0 AISHELL-3 [62] 65 ASR 2.0 Emilla [25] 98, 305 ASR 2.0 Fleurs [12] 17 ASR 2.0 CommonV oice [1] 43 ASR 2.0 KeSpeech [64] 1, 428 ASR 2.0 Magicdata [79] 747 ASR 2.0 zhvoice1 901 ASR 2.0 Libriheavy [33] 51, 448 ASR 2.0 MLS [57] 45, 042 ASR 2.0 Gigaspeech [5] 10, 288 ASR 2.0 LibriSpeech [54] 960 ASR 2.0 CommonV oice [1] 1, 854 ASR 2.0 V oxpopuli [69] 529 ASR 2."},{"citing_arxiv_id":"2407.10759","ref_index":10,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Qwen2-Audio Technical Report","primary_cat":"eess.AS","submitted_at":"2024-07-15T14:38:09+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":4.0,"formal_verification":"none","one_line_summary":"Qwen2-Audio is an open-source audio-language model that outperforms prior systems such as Gemini-1.5-pro on audio-centric instruction-following benchmarks after simplified prompt-based pre-training and expanded data.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2311.07919","ref_index":11,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Qwen-Audio: Advancing Universal Audio Understanding via Unified Large-Scale Audio-Language Models","primary_cat":"eess.AS","submitted_at":"2023-11-14T05:34:50+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Qwen-Audio trains a unified model on diverse audio and tasks with hierarchical tags to enable strong zero-shot performance on audio understanding benchmarks and multi-turn audio chat.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null}],"limit":50,"offset":0}