{"total":19,"items":[{"citing_arxiv_id":"2605.21008","ref_index":71,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"A Survey of Audio Reasoning in Multimodal Foundation Models","primary_cat":"eess.AS","submitted_at":"2026-05-20T10:44:56+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":2.0,"formal_verification":"none","one_line_summary":"A survey that provides a unified formulation of audio reasoning and reviews advances across Audio-to-Text, Audio-to-Speech, Audio-Visual, and Agentic paradigms while discussing challenges and future directions.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"and paralinguistic acoustic cues, making the input modality broader than speech alone. Regarding the output modality, we strictly focus on \"Speech-out\" systems designed for spoken conversational interaction1. A. Sequential Audio-to-Speech Reasoning A broad range of prominent end-to-end Audio-to-Speech models, such as Qwen2.5-Omni [69], Kimi-Audio [70], GLM- 4-V oice [36], Llama-omni [16], Mini-Omni [71], Mini-Omni 2 [72], and SALM-Omni [73], exhibit reasoning capabilities inherited from their text-based LLM backbones. These models can apply reasoning to audio inputs, but their performance is often limited by the modality gap [74] between text and audio and by the lack of audio-specific reasoning supervision. To optimize reasoning for the audio modality, recent frame-"},{"citing_arxiv_id":"2605.20755","ref_index":9,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"DuplexSLA: A Full-Duplex Spoken Language Model with Synchronized Speech, Language, and Action","primary_cat":"eess.AS","submitted_at":"2026-05-20T05:54:08+00:00","verdict":null,"verdict_confidence":null,"novelty_score":null,"formal_verification":null,"one_line_summary":null,"context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.20266","ref_index":116,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"A Survey of Large Audio Language Models: Generalization, Trustworthiness, and Outlook","primary_cat":"cs.SD","submitted_at":"2026-05-18T20:21:32+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"A survey of Large Audio Language Models that establishes a taxonomy of trustworthiness vulnerabilities and proposes a Defense-in-Depth roadmap for audio intelligence.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"Feb 2024 Mistral-7B 7B EN Discrete 87K Hrs audio✗✓ ✓ WavLLM [112] Mar 2024 LLaMA-2-7B-Chat 7B EN Contin. -✗✓ ✓ SpeechVerse [113] May 2024 Flan-T5-XL 3B EN Contin. -✗✓ ✓ GAMA [114] Jun 2024 LLaMA2-7B 7B EN Contin. 2.2M audio-caption pairs✗✓ ✓ Qwen2-Audio [18] Jul 2024 Qwen-7B 7B Multi. Contin. 520K Hrs audio✗✓ ✓ FunAudioLLM [115] Jul 2024 - - Multi. - -✗✓ ✓ Mini-Omni [116] Aug 2024 Qwen2-0.5B 0.5B - Discrete 8K Hrs speech + 2M text examples✓ ✓ ✓ Moshi [117] Sep 2024 Helium 7B EN Discrete 7M Hrs audio + 2.1T text tokens✓ ✓ ✓ LLaMA-Omni [118] Sep 2024 Llama-3.1-8B-Instruct 8B EN Contin. -✗✓ ✓ Parrot [119] Sep 2024 Llama 3.1-8B 8B EN Discrete 74,554 Hrs audio✓✗✓ OmniFlatten [120] Oct 2024 Qwen2-0.5B 0.5B EN, CN Discrete -✓ ✓ ✓"},{"citing_arxiv_id":"2605.06765","ref_index":43,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"VITA-QinYu: Expressive Spoken Language Model for Role-Playing and Singing","primary_cat":"cs.CL","submitted_at":"2026-05-07T17:59:56+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"VITA-QinYu is the first expressive end-to-end spoken language model supporting role-playing and singing alongside conversation, trained on 15.8K hours of data and outperforming prior models on expressiveness and conversational benchmarks.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.05927","ref_index":17,"ref_count":2,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Minimizing Modality Gap from the Input Side: Your Speech LLM Can Be a Prosody-Aware Text LLM","primary_cat":"cs.CL","submitted_at":"2026-05-07T09:32:05+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"TextPro-SLM reduces the speech-text modality gap by feeding an LLM backbone with synchronized text tokens and prosody embeddings from WhisperPro, achieving lowest gap scores at 3B/7B scales with roughly 1,000 hours of audio.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"SpeechGPT: Empowering large language models with intrinsic cross-modal conversational abilities. In Houda Bouamor, Juan Pino, and Kalika Bali, editors,Findings of the Association for Computational Linguistics: EMNLP 2023, pages 15757-15773, Singapore, December 2023. Association for Computational Linguistics. doi: 10.18653/v1/2023.findings-emnlp.1055. URL https://aclanthology.org/2023.findings-emnlp.1055. [17] Zhifei Xie and Changqiao Wu. Mini-omni: Language models can hear, talk while thinking in streaming.arXiv preprint arXiv:2408.16725, 2024. [18] Zuwei Long, Yunhang Shen, Chaoyou Fu, Heting Gao, Peixian Chen, Mengdan Zhang, Hang Shao, Jian Li, Jinlong Peng, Haoyu Cao, et al. Vita-audio: Fast interleaved audio-text token generation for efficient large speech-language model."},{"citing_arxiv_id":"2605.03937","ref_index":12,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"MiniMind-O Technical Report: An Open Small-Scale Speech-Native Omni Model","primary_cat":"cs.SD","submitted_at":"2026-05-05T16:27:33+00:00","verdict":"ACCEPT","verdict_confidence":"MODERATE","novelty_score":6.0,"formal_verification":"none","one_line_summary":"MiniMind-O delivers a working 0.1B-scale open omni model with speech-native output, Thinker-Talker split, frozen encoders, and full release of code, checkpoints, and training data.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.14604","ref_index":35,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Hijacking Large Audio-Language Models via Context-Agnostic and Imperceptible Auditory Prompt Injection","primary_cat":"cs.CR","submitted_at":"2026-04-16T04:22:11+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"AudioHijack generates imperceptible adversarial audio via gradient estimation, attention supervision, and reverberation blending to hijack 13 LALMs with 79-96% success on unseen contexts and real commercial agents.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"Fang, Y . Zhou, S. Guo, S. Zhang, and Y . Feng, \"Llama-Omni2: LLM-Based Real-Time Spoken Chatbot with Autoregressive Stream- ing Speech Synthesis,\" inProceedings of ACL, Vienna, Austria, 2025, pp. 18 617-18 629. [34] Z. Xie and C. Wu, \"Mini-Omni: Language Models Can Hear, Talk While Thinking in Streaming,\"arXiv preprint, vol. arXiv:2408.16725, 2024. [35] Y . Shu, S. Dong, G. Chen, W. Huang, R. Zhang, D. Shi, Q. Xiang, and Y . Shi, \"LLaSM: Large Language and Speech Model,\"arXiv preprint, vol. arXiv:2308.15930, 2023. [36] Y . Chu, J. Xu, X. Zhou, Q. Yang, S. Zhang, Z. Yan, C. Zhou, and J. Zhou, \"Qwen-Audio: Advancing Universal Audio Understanding via Unified Large-Scale Audio-Language Models,\"arXiv preprint,"},{"citing_arxiv_id":"2604.09222","ref_index":36,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"GRM: Utility-Aware Jailbreak Attacks on Audio LLMs via Gradient-Ratio Masking","primary_cat":"cs.SD","submitted_at":"2026-04-10T11:27:25+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"GRM ranks Mel bands by attack contribution versus utility sensitivity, perturbs a subset, and learns a universal perturbation to reach 88.46% average jailbreak success rate with improved attack-utility trade-off on four audio LLMs.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"CoRRabs/2408.16725 (2024). arXiv:2408.16725 doi:10.48550/ARXIV.2408.16725 [35] Jin Xu, Zhifang Guo, Jinzheng He, Hangrui Hu, Ting He, Shuai Bai, Keqin Chen, Jialin Wang, Yang Fan, Kai Dang, Bin Zhang, Xiong Wang, Yunfei Chu, and Junyang Lin. 2025. Qwen2.5-Omni Technical Report.CoRRabs/2503.20215 (2025). arXiv:2503.20215 doi:10.48550/ARXIV.2503.20215 [36] Hao Yang, Lizhen Qu, Ehsan Shareghi, and Gholamreza Haffari. 2024. Jigsaw Puzzles: Splitting Harmful Questions to Jailbreak Large Language Models.CoRR abs/2410.11459 (2024). arXiv:2410.11459 doi:10.48550/ARXIV.2410.11459 [37] Hao Yang, Lizhen Qu, Ehsan Shareghi, and Gholamreza Haffari. 2025. Audio Is the Achilles' Heel: Red Teaming Audio Large Multimodal Models."},{"citing_arxiv_id":"2604.08000","ref_index":27,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"PASK: Toward Intent-Aware Proactive Agents with Long-Term Memory","primary_cat":"cs.AI","submitted_at":"2026-04-09T09:06:13+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":4.0,"formal_verification":"none","one_line_summary":"PASK introduces the DD-MM-PAS paradigm for streaming proactive agents with intent-aware detection, hybrid memory modeling, and a new real-world benchmark where the IntentFlow model matches top LLMs on latency while finding deeper intents.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.01897","ref_index":9,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"FastTurn: Unifying Acoustic and Streaming Semantic Cues for Low-Latency and Robust Turn Detection","primary_cat":"cs.SD","submitted_at":"2026-04-02T11:00:37+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"FastTurn unifies acoustic features and streaming CTC decoding for low-latency, robust turn detection in full-duplex dialogue systems and releases a realistic human-dialogue test set.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"especially under noisy and overlapped observations. Although large language models excel at reasoning and generation with complete textual inputs, integrating them into low-latency full- duplex dialogue systems remains challenging. *These authors contributed equally. **indicates the corresponding author. In practical deployments, existing full-duplex spoken dia- logue systems [9, 10] often rely on turn detection to provide a controllable interface between speech processing and response generation. Current turn detection approaches can be broadly categorized into two groups. The first group relies on voice activity detection (V AD) and infers interruption timing from acoustic energy or activity patterns [11, 12, 13]. These methods"},{"citing_arxiv_id":"2509.22220","ref_index":77,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"StableToken: A Noise-Robust Semantic Speech Tokenizer for Resilient SpeechLLMs","primary_cat":"cs.CL","submitted_at":"2025-09-26T11:32:51+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"StableToken introduces a multi-branch architecture with bit-wise voting to create noise-robust semantic speech tokens, achieving lower Unit Edit Distance and better SpeechLLM robustness than prior single-path tokenizers.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2507.16632","ref_index":72,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Step-Audio 2 Technical Report","primary_cat":"cs.CL","submitted_at":"2025-07-22T14:23:55+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Step-Audio 2 integrates a latent audio encoder, reasoning-centric reinforcement learning, and discrete audio token generation into language modeling to deliver state-of-the-art performance on audio understanding and conversational benchmarks.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2507.08128","ref_index":115,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Audio Flamingo 3: Advancing Audio Intelligence with Fully Open Large Audio Language Models","primary_cat":"cs.SD","submitted_at":"2025-07-10T19:40:21+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"Audio Flamingo 3 introduces an open large audio-language model achieving new state-of-the-art results on over 20 audio understanding and reasoning benchmarks using a unified encoder and curriculum training on open data.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"To enable voice-to-voice interaction, we employ a TTS module for streaming speech generation, supporting streaming inputs and outputs. Our TTS module employs a decoder- only transformer architecture: it predicts the subsequent audio token conditioned on incoming subword text tokens from the LLM and the history of previously generated audio tokens. Similar streaming TTS techniques have been explored with LLMs [ 115] (for voice-out on LLM outputs), but not in the context of LALMs (which we define as models designed to perceive and reason over diverse audio inputs). Since not a core novelty of our work, we provide more details, including training and architecture, in Appendix I. 4 Audio Flamingo 3 Training Data We present detailed statistics for all datasets used to train AF3 in Table 11."},{"citing_arxiv_id":"2504.18425","ref_index":72,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Kimi-Audio Technical Report","primary_cat":"eess.AS","submitted_at":"2025-04-25T15:31:46+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"Kimi-Audio is an open-source audio foundation model that achieves state-of-the-art results on speech recognition, audio understanding, question answering, and conversation after pre-training on more than 13 million hours of speech, sound, and music data.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"While these models can produce high-fidelity audio, they typically focus on generation only and lack understanding and conversation capabilities or instruction-following speech interaction. Speech Conversation and Real-Time Dialogue Recent models have moved toward enabling real-time, end-to-end speech interaction. Moshi [14], GLM-4-V oice [84], and Mini-Omni [72] adopt inter- leaved or parallel decoding to support simultaneous generation of text and audio tokens, facilitating low-latency dialogue systems. OmniFlatten [86] introduces a progressive training pipeline to adapt a frozen LLM for full-duplex conversation. LLaMA-Omni [18] and Freeze-Omni [71] further refine duplex speech interaction through streaming decoders or multi-task alignment strategies."},{"citing_arxiv_id":"2503.20215","ref_index":38,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Qwen2.5-Omni Technical Report","primary_cat":"cs.CL","submitted_at":"2025-03-26T04:17:55+00:00","verdict":"CONDITIONAL","verdict_confidence":"MODERATE","novelty_score":5.0,"formal_verification":"none","one_line_summary":"Qwen2.5-Omni presents a multimodal model with block-wise encoders, TMRoPE position embeddings, and a Thinker-Talker architecture that enables simultaneous text and streaming speech generation while matching text performance on reasoning benchmarks.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2503.12605","ref_index":214,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Multimodal Chain-of-Thought Reasoning: A Comprehensive Survey","primary_cat":"cs.CV","submitted_at":"2025-03-16T18:39:13+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":2.0,"formal_verification":"none","one_line_summary":"The paper provides the first comprehensive survey of multimodal chain-of-thought reasoning, including foundational concepts, a taxonomy of methodologies, application analyses, challenges, and future directions.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"Figure 4: Common architectures for comprehension-only and comprehension-generation MLLMs. Prior works, such as NExT-GPT [14] advances this objective for the first time by integrating mul- timodal adapters with various diffusion models. AnyGPT [213] utilizes multimodal discrete tokens to facilitate the generation of diverse multimodal content. Subsequently, Mini-Omni2 [214, 215] in- troduces a command-based interruption mechanism, enhancing user interaction and aligning further with GPT-4o's capabilities. Compared to MLLMs that only support comprehension, as shown in Figure 4, MLLMs that integrate both comprehension and generation either utilize an autoregressive approach to generate multimodal tokens [213], or connect decoders of varying modalities to decode"},{"citing_arxiv_id":"2502.11946","ref_index":59,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Step-Audio: Unified Understanding and Generation in Intelligent Speech Interaction","primary_cat":"cs.CL","submitted_at":"2025-02-17T15:58:56+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"Step-Audio introduces a 130B-parameter unified speech-text model with open-sourced components for understanding, generation, affordable voice cloning, and dynamic control, claiming SOTA human evaluation results on a new benchmark.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2412.02612","ref_index":43,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"GLM-4-Voice: Towards Intelligent and Human-Like End-to-End Spoken Chatbot","primary_cat":"cs.CL","submitted_at":"2024-12-03T17:41:24+00:00","verdict":"CONDITIONAL","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"GLM-4-Voice builds an end-to-end spoken chatbot by deriving a 175bps single-codebook tokenizer from ASR, synthesizing interleaved speech-text data, and continuing pre-training of GLM-4-9B on up to 1 trillion tokens before fine-tuning on conversational speech.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2410.17196","ref_index":101,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"VoiceBench: Benchmarking LLM-Based Voice Assistants","primary_cat":"cs.CL","submitted_at":"2024-10-22T17:15:20+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"VoiceBench is the first benchmark for multi-faceted evaluation of LLM voice assistants using real and synthetic spoken instructions with speaker, environmental, and content variations.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null}],"limit":50,"offset":0}