{"total":11,"items":[{"citing_arxiv_id":"2605.21008","ref_index":124,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"A Survey of Audio Reasoning in Multimodal Foundation Models","primary_cat":"eess.AS","submitted_at":"2026-05-20T10:44:56+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":2.0,"formal_verification":"none","one_line_summary":"A survey that provides a unified formulation of audio reasoning and reviews advances across Audio-to-Text, Audio-to-Speech, Audio-Visual, and Agentic paradigms while discussing challenges and future directions.","context_count":1,"top_context_role":"dataset","top_context_polarity":"use_dataset","context_text":"Benchmark Domain Audio Type Eval Type Size #Tasks Year MMAU [51] General Audio Reasoning Speech, Sound, Music Closed 10K 27 2024 MMAU-Pro [123] General Audio Reasoning Speech, Sound, Music Closed 0.5K 49 2025 MMAR [120] General Audio Reasoning Speech, Sound, Music Closed 1K 16 2025 MMSU [119] Speech-focused Reasoning Speech Closed 5K 47 2025 V oxEval [124] Speech-focused Reasoning Speech Closed 14K 56 2025 CMDAR [125] Chinese Audio Reasoning Speech, Sound Closed, Open 3K 5 2025 Spoken-MQA [126] Spoken Mathematical Reasoning Speech Closed, Open 3K 5 2025 WildSpeech-Bench [127] End-to-end Interaction Speech Open 1.1K 5 2025 URO-Bench [128] End-to-end Interaction Speech Closed, Open 5K 20 2025 V oiceAgentBench [129] Multi-turn Tool Use & Reasoning Speech Closed 6K 6 2025"},{"citing_arxiv_id":"2605.20266","ref_index":195,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"A Survey of Large Audio Language Models: Generalization, Trustworthiness, and Outlook","primary_cat":"cs.SD","submitted_at":"2026-05-18T20:21:32+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"A survey of Large Audio Language Models that establishes a taxonomy of trustworthiness vulnerabilities and proposes a Defense-in-Depth roadmap for audio intelligence.","context_count":1,"top_context_role":"dataset","top_context_polarity":"use_dataset","context_text":"col to separate ASR errors from event localization failures. Beyond temporal precision,WoW-Bench[194] andMUSE [198] reveal a \"Semantic Shortcut\" phenomenon: models may infer plausible answers from linguistic or structural priors rather than genuine acoustic perception, and CoT prompting can even degrade performance. For complex acoustic scenes,MMAU[179] andMMAU- Pro[195] emphasizeDisentanglement Efficiencyunder overlapping events.RSA-Bench[205] further exposes a Perception-Cognition Gap, where low-level recognition re- mains relatively robust but higher-order reasoning collapses under real-world degradation; itsDenoising Paradoxsug- gests that standard enhancement may worsen downstream reasoning.AudioBench[178] identifies aModality Fusion"},{"citing_arxiv_id":"2605.04556","ref_index":19,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Benchmarking LLMs on the Massive Sound Embedding Benchmark (MSEB)","primary_cat":"cs.SD","submitted_at":"2026-05-06T06:58:45+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":3.0,"formal_verification":"none","one_line_summary":"LLMs exhibit a persistent modality gap versus specialized audio encoders on MSEB tasks, with no conclusive evidence favoring audio-native over cascaded architectures.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.00969","ref_index":1,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"MedMosaic: A Challenging Large Scale Benchmark of Diverse Medical Audio","primary_cat":"cs.SD","submitted_at":"2026-05-01T16:06:27+00:00","verdict":null,"verdict_confidence":null,"novelty_score":null,"formal_verification":null,"one_line_summary":null,"context_count":1,"top_context_role":"other","top_context_polarity":"unclear","context_text":"Do not use the identifiers'above','previous','prior'etc. iii. OBFUSCATION RULE: The referent of'this'or'here'should require correct prior answers to disambiguate. -------------------------------------------------- D. OUTPUT FORMAT (STRICT JSON) -------------------------------------------------- { \"audio_context\": \"single medical audio input\", \"turns\": [ { \"turn\": 1, \"question\": \"...\", \"answer\": \"The answer is (X).\", \"question_type\": \"\", \"options\": \"(a) ...\\n(b) ...\\n(c) ...\\n(d) ...\\n(e) ...\\n(f) ...\\n(g) ...\\n(h) ...\\n(i) ...\\n(j) ...\" }, { \"turn\": 2, \"question\": \"...\", \"answer\": \"The answer is (X).\", \"question_type\": \"\", \"options\": \"(a) ...\\n(b) ...\\n(c) ...\\n(d) ...\\n(e) ...\\n(f) ...\\n(g) ...\\n(h) ...\\n(i) ...\\n(j) ...\" }, { 27 MedMosaic: A Challenging Large Scale Benchmark of Diverse Medical Audio \"turn\": 3, \"question\": \"...\", \"answer\": \"The answer is (X).\", \"question_type\": \"\", \"options\": \"(a) ...\\n(b) ...\\n(c) ...\\n(d) ...\\n(e) ...\\n(f) ...\\n(g) ...\\n(h) ...\\n(i) ...\\n(j) ...\" } ] } -------------------------------------------------- E. ANTI-SHORTCUT MEASURES -------------------------------------------------- 1. Prevent Surface Pattern Exploitation i. The correct answer must NOT be identifiable by length alone (vary option lengths randomly) ii. The correct answer must NOT be the most \"hedged\" or qualified option iii. The correct answer must NOT be identifiable by unique terminology"},{"citing_arxiv_id":"2604.25591","ref_index":28,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Walking Through Uncertainty: An Empirical Study of Uncertainty Estimation for Audio-Aware Large Language Models","primary_cat":"eess.AS","submitted_at":"2026-04-28T12:56:22+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"Semantic-level and verification-based uncertainty methods outperform token-level baselines for audio reasoning in ALLMs, but their relative performance on hallucination and unanswerable-question benchmarks is model- and task-dependent.","context_count":1,"top_context_role":"dataset","top_context_polarity":"use_dataset","context_text":"has introduced increasingly diverse benchmarks to evaluate their capabilities. Existing benchmarks cover general audio understanding and reasoning [26]-[30], [33], [34], [39]-[43], [45]-[51] across speech, environmental sounds, and music, with recent efforts expanding evaluation toward more chal- lenging reasoning settings such as expert-level audio un- derstanding and multi-hop reasoning [27], [28], [30], [48]. These developments have substantially improved the empirical coverage of audio-aware LLM evaluation. At the same time, recent work has begun to highlight trustworthiness issues [31], [32], [35], [40], [44], [51], [53]-[55] in audio-aware LLMs. Prior studies [31], [32] show that these models can hallucinate sound events that are not present in the input audio, revealing a"},{"citing_arxiv_id":"2604.24401","ref_index":32,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"All That Glitters Is Not Audio: Rethinking Text Priors and Audio Reliance in Audio-Language Evaluation","primary_cat":"cs.SD","submitted_at":"2026-04-27T12:25:18+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Audio-language models retain 60-72% of benchmark scores without audio, and most audio-dependent items can be solved from short fragments rather than full clips.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.23717","ref_index":12,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"HeadRouter: Dynamic Head-Weight Routing for Task-Adaptive Audio Token Pruning in Large Audio Language Models","primary_cat":"cs.SD","submitted_at":"2026-04-26T14:00:31+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"HeadRouter prunes audio tokens more effectively by dynamically routing based on per-head importance for semantic versus acoustic tasks, exceeding baseline performance at 70% token retention on Qwen2.5-Omni models.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.14548","ref_index":2,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"VoxSafeBench: Not Just What Is Said, but Who, How, and Where","primary_cat":"cs.SD","submitted_at":"2026-04-16T02:24:59+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":8.0,"formal_verification":"none","one_line_summary":"VoxSafeBench reveals that speech language models recognize social norms from text but fail to apply them when acoustic cues like speaker or scene determine the appropriate response.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"All synthesized audio is filtered with Whisper-large-v3 [30], discarding samples with WER > 5%. Full construction details, including the prompt-audio pool composition and quality control pipeline, are provided in Appendix D. Evaluation Models and JudgingWe curate a model set with demonstrated strong audio understand- ing. Guided by MMSU [1] and MMAU-Pro [2], two benchmarks that emphasize paralinguistic and sound-mixture reasoning, we select open models (Qwen3-Omni [31], Mimo-Audio [32], Kimi-Audio [33]) and closed models (Gemini-3-Pro, Gemini-3-Flash, GPT-4o-Audio [34]). For Qwen3-Omni and Mimo-Audio, we additionally evaluate their thinking variants, as reasoning modes can exhibit noticeably different alignment behaviors."},{"citing_arxiv_id":"2604.12527","ref_index":31,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Audio-Cogito: Towards Deep Audio Reasoning in Large Audio Language Models","primary_cat":"eess.AS","submitted_at":"2026-04-14T10:00:39+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"Audio-Cogito is an open-source LALM using Cogito-pipe data curation and self-distillation to achieve leading open-source performance on audio reasoning benchmarks.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"AudioCaps [28], and Clotho [29], typically provide brief labels or captions that are insufficient to cultivate deep audio reason- ing. While a handful of audio reasoning datasets exist [23, 30], they predominantly focus on shallow reasoning tasks. Further- more, constructing datasets with complex reasoning traces re- lies heavily on closed-source models like Gemini 2.5 Pro [31]. This reliance not only leads to substantial annotation costs and hinders reproducibility but also introduces incompatible infer- ence formats across architectures, further constraining the prac- tical applicability of existing resources. To address these challenges, this study proposesAudio- Cogito1, a fully open-source solution that elicits deep audio"},{"citing_arxiv_id":"2604.08209","ref_index":13,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"OmniJigsaw: Enhancing Omni-Modal Reasoning via Modality-Orchestrated Reordering","primary_cat":"cs.CV","submitted_at":"2026-04-09T13:09:40+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"OmniJigsaw is a self-supervised proxy task that reconstructs shuffled audio-visual clips via joint integration, sample-level selection, and clip-level masking strategies, yielding gains on 15 video, audio, and reasoning benchmarks.","context_count":1,"top_context_role":"dataset","top_context_polarity":"use_dataset","context_text":"75.80 ↑1.40 70.48 ↑0.32 69.50 ↑1.00 OmniJigsaw (CMM) 58.59 ↑1.98 76.30 ↑1.90 70.70 ↑0.54 71.00 ↑2.50 Audio ReasoningTo evaluate audio understanding improvements facilitated by our OmniJigsaw, we employ four representative bench- marks: MMSU [36] for fine-grained perception, MMAU-test-mini [28] and MMAR [21] for hierarchical reasoning, and MMAU-Pro [13] for versatile auditory comprehension. As shown in Table 2, OmniJigsaw yields consistent improvements; no- tably, CMM outperforms AudioJig- saw despite the latter's exclusive audio attention, validating its effi- cacy in excavating mutually bene- ficial audio-visual synergy. Signifi- cant gains on MMAR (+2.50) and robust performance on MMAU-Pro (+1."},{"citing_arxiv_id":"2604.06138","ref_index":16,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Generating Synthetic Doctor-Patient Conversations for Long-form Audio Summarization","primary_cat":"cs.SD","submitted_at":"2026-04-07T17:45:07+00:00","verdict":"CONDITIONAL","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"A three-stage synthetic data pipeline generates 8800 doctor-patient conversations totaling 1.3k hours of audio and LLM-produced SOAP notes, with evaluation showing cascaded transcription-then-summarization models outperform end-to-end audio models.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null}],"limit":50,"offset":0}