{"total":13,"items":[{"citing_arxiv_id":"2606.31338","ref_index":16,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Beyond Binary Instrument QA: Probing Instrument Grounding in Music Audio-Language Models","primary_cat":"cs.SD","submitted_at":"2026-06-30T08:39:56+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"Introduces an OpenMIC-derived multi-axis benchmark sequence showing that high binary instrument QA accuracy fails to predict robust grounding, with models showing position bias, confusable errors, and temporal bias.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.18273","ref_index":31,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Continuous Audio Thinking for Large Audio Language Models","primary_cat":"cs.CL","submitted_at":"2026-06-05T11:38:30+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"CoAT adds a continuous latent thinking space to LALMs via expert distillation to retain acoustic information, yielding gains on audio reasoning, understanding, music, emotion, and transcription benchmarks across three models.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2606.05121","ref_index":15,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Audio Interaction Model","primary_cat":"cs.SD","submitted_at":"2026-06-03T17:26:11+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Audio-Interaction unifies offline and online audio tasks into one streaming model via the SoundFlow framework and a new 2.6M-item streaming corpus, enabling real-time instruction following and proactive responses.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.28480","ref_index":17,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Audio-Mind: An Auditable Agentic Framework for Audio Understanding","primary_cat":"eess.AS","submitted_at":"2026-05-27T13:39:14+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":4.0,"formal_verification":"none","one_line_summary":"Audio-Mind introduces a conditional, auditable agentic framework for audio understanding that preserves frontend judgment and acquires bounded external evidence only when needed, reporting 80.4% on MMAR and 82.8% on MSU-Bench.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.20266","ref_index":108,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"A Survey of Large Audio Language Models: Generalization, Trustworthiness, and Outlook","primary_cat":"cs.SD","submitted_at":"2026-05-18T20:21:32+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"A survey of Large Audio Language Models that establishes a taxonomy of trustworthiness vulnerabilities and proposes a Defense-in-Depth roadmap for audio intelligence.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"Oct 2023 Qwen-1.8B 2B EN, CN Contin. -✗✓ ✓ Qwen-Audio [15] Nov 2023 Qwen-7B 7B Multi. Contin. 130K+ Hrs audio✗✓ ✓ ParalinGPT [105] Dec 2023 DialoGPT 345M EN Contin. 140 Hrs audio✗✓ ✓ E-chat [106] Dec 2023 Baichuan2-7B-Chat 7B CN Contin. 10K Hrs ASR data✗✓ ✓ Year 2024 SpeechGPT-Gen [107] Jan 2024 LLaMA-2-7B-Chat 7B EN Discrete -✗✓ ✓ Audio Flamingo [108] Feb 2024 OPT-IML-1.3B 1.3B EN Contin. 21K Hrs audio✗✓ ✓ Spoken-LLM [109] Feb 2024 Llama-2-7B-Chat 7B EN Contin. 16,472 current-response speech pairs✗✓ ✓ Spirit LM [110] Feb 2024 Llama-2-7B 7B EN Discrete 35.2B tokens✗✓ ✓ USDM [111] Feb 2024 Mistral-7B 7B EN Discrete 87K Hrs audio✗✓ ✓ WavLLM [112] Mar 2024 LLaMA-2-7B-Chat 7B EN Contin. -✗✓ ✓ SpeechVerse [113]"},{"citing_arxiv_id":"2604.24401","ref_index":23,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"All That Glitters Is Not Audio: Rethinking Text Priors and Audio Reliance in Audio-Language Evaluation","primary_cat":"cs.SD","submitted_at":"2026-04-27T12:25:18+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Audio-language models retain 60-72% of benchmark scores without audio, and most audio-dependent items can be solved from short fragments rather than full clips.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.13804","ref_index":17,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Character Beyond Speech: Leveraging Role-Playing Evaluation in Audio Large Language Models via Reinforcement Learning","primary_cat":"cs.LG","submitted_at":"2026-04-15T12:39:03+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"RoleJudge is a multidimensional evaluation framework for speech-character alignment in audio LLMs, backed by the RoleChat dataset and multi-stage RL training with standard alignment to reduce reward issues.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"In recent years, the development of multimodal technologies has enabled the alignment of audio modalities with large model inputs, thereby facilitating extensive audio understanding by large lan- guage models. Some studies encode speech into discrete tokens and incorporate them into LLMs, allowing the models to accept audio in- put, as seen in works such as SpeechGPT [40] and AudioPaLM [17]. Models like SALMONN [ 30] and Qwen-Audio [7, 8] are trained on large-scale, multi-task datasets, equipping them to perform a variety of downstream tasks including speech recognition, speech translation, and audio event detection. A subset of research applies large audio models to spoken dialogue, enabling more intelligent interactions, for example, by mining paralinguistic factors such as"},{"citing_arxiv_id":"2604.13023","ref_index":23,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"SpotSound: Enhancing Large Audio-Language Models with Fine-Grained Temporal Grounding","primary_cat":"cs.SD","submitted_at":"2026-04-14T17:57:01+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"SpotSound adds a hallucination-suppressing objective and a needle-in-haystack benchmark to audio-language models, reaching state-of-the-art temporal grounding while keeping general task performance.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.12527","ref_index":50,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Audio-Cogito: Towards Deep Audio Reasoning in Large Audio Language Models","primary_cat":"eess.AS","submitted_at":"2026-04-14T10:00:39+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"Audio-Cogito is an open-source LALM using Cogito-pipe data curation and self-distillation to achieve leading open-source performance on audio reasoning benchmarks.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2512.02231","ref_index":28,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"See, Hear, and Understand: Benchmarking Audiovisual Human Speech Understanding in Multimodal Large Language Models","primary_cat":"cs.CV","submitted_at":"2025-12-01T21:57:26+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"AV-SpeakerBench is a new speaker-centered benchmark showing that top multimodal models still struggle with fine-grained audiovisual speech understanding, with Gemini 2.5 Pro leading but open models lagging on fusion.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2507.16632","ref_index":43,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Step-Audio 2 Technical Report","primary_cat":"cs.CL","submitted_at":"2025-07-22T14:23:55+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Step-Audio 2 integrates a latent audio encoder, reasoning-centric reinforcement learning, and discrete audio token generation into language modeling to deliver state-of-the-art performance on audio understanding and conversational benchmarks.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2504.18425","ref_index":35,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Kimi-Audio Technical Report","primary_cat":"eess.AS","submitted_at":"2025-04-25T15:31:46+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"Kimi-Audio is an open-source audio foundation model that achieves state-of-the-art results on speech recognition, audio understanding, question answering, and conversation after pre-training on more than 13 million hours of speech, sound, and music data.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"2024-2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE. 2024, pp. 10991-10995. [34] Chris Dongjoo Kim et al. \"Audiocaps: Generating captions for audios in the wild\". In: Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers). 2019, pp. 119-132. [35] Zhifeng Kong et al. \"Audio flamingo: A novel audio language model with few-shot learning and dialogue abilities\". In: arXiv preprint arXiv:2402.01831 (2024). [36] Nathan Lambert et al. \"T\\\" ulu 3: Pushing frontiers in open language model post-training\". In: arXiv preprint arXiv:2411.15124 (2024). [37] Matthew Le et al. \"V oicebox: Text-guided multilingual universal speech generation at scale\"."},{"citing_arxiv_id":"2407.10759","ref_index":15,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Qwen2-Audio Technical Report","primary_cat":"eess.AS","submitted_at":"2024-07-15T14:38:09+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":4.0,"formal_verification":"none","one_line_summary":"Qwen2-Audio is an open-source audio-language model that outperforms prior systems such as Gemini-1.5-pro on audio-centric instruction-following benchmarks after simplified prompt-based pre-training and expanded data.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null}],"limit":50,"offset":0}