{"total":14,"items":[{"citing_arxiv_id":"2605.20755","ref_index":32,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"DuplexSLA: A Full-Duplex Spoken Language Model with Synchronized Speech, Language, and Action","primary_cat":"eess.AS","submitted_at":"2026-05-20T05:54:08+00:00","verdict":null,"verdict_confidence":null,"novelty_score":null,"formal_verification":null,"one_line_summary":null,"context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.20356","ref_index":20,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Synchronization and Turn-Taking in Full-Duplex Speech Dialogue Models","primary_cat":"cs.CL","submitted_at":"2026-05-19T18:11:03+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"Full-duplex SDMs show strong representational synchronization that peaks near zero lag and degrades with noise, with internal states encoding anticipatory turn-taking cues detectable ahead of time.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.20266","ref_index":96,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"A Survey of Large Audio Language Models: Generalization, Trustworthiness, and Outlook","primary_cat":"cs.SD","submitted_at":"2026-05-18T20:21:32+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"A survey of Large Audio Language Models that establishes a taxonomy of trustworthiness vulnerabilities and proposes a Defense-in-Depth roadmap for audio intelligence.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"Pareto frontier necessitates semantic-aware token compres- sion [51] and factorized tokenization [50] to maintain per- formance across long-form contexts [56]. Third, integrating agentic frameworks with full-duplex intelligence marks the next stage of synchronous interaction [69], requiring robust handling of disfluency and tool-use in real-time conversa- tions [72], [96]. Fourth, cross-modal knowledge distillation and multi-sensory alignment will empower models to \"lis- ten between frames\" by transferring spatial reasoning from vision to audio [63], [64]. As these architectural advancements expand the multi- modal attack surface, the next-generation framework must pioneer intrinsic representation engineering, ensuring that"},{"citing_arxiv_id":"2605.17360","ref_index":47,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Omni-DuplexEval: Evaluating Real-time Duplex Omni-modal Interaction","primary_cat":"cs.CV","submitted_at":"2026-05-17T09:57:01+00:00","verdict":"CONDITIONAL","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"Omni-DuplexEval creates a new benchmark and LLM-as-a-Judge framework for real-time duplex omni-modal interaction, revealing that current models score below 40% overall and struggle especially with proactive responses.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.13841","ref_index":18,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"EVA-Bench: A New End-to-end Framework for Evaluating Voice Agents","primary_cat":"cs.SD","submitted_at":"2026-05-13T17:58:52+00:00","verdict":null,"verdict_confidence":null,"novelty_score":null,"formal_verification":null,"one_line_summary":null,"context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"Effective voice agent evaluation requires a simulation methodology that faithfully replicates the dynamic, real-time nature of spoken interaction, where the agent must navigate complete, task-oriented multi-turn conversations with live users whose requests and clarifications may shift throughout the call. Several benchmarks fall short on this requirement in distinct ways. FullDuplex-Bench-v1 (FDB) and FDB-v1.5 [18, 17] assess conversation dynamics in a heavily-scripted manner without task completion or tool use, rendering themselves unsuitable for voice agent evaluations. VoiceAgentBench [14] evaluates multi-tool 2 Table 1 Feature comparison of contemporary voice agent evaluation frameworks.∼denotes partial support, and - for the simulator validation means it doesn't apply because of missing simulator."},{"citing_arxiv_id":"2605.11484","ref_index":32,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Engagement Process: Rethinking the Temporal Interface of Action and Observation","primary_cat":"cs.AI","submitted_at":"2026-05-12T04:02:03+00:00","verdict":null,"verdict_confidence":null,"novelty_score":null,"formal_verification":null,"one_line_summary":null,"context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"URL https://proceedings.neurips.cc/paper_files/paper/2024/ file/180d4373aca26bd86bf45fc50d1a709f-Paper-Conference.pdf. [31] Ziyang Ma, Yakun Song, Chenpeng Du, Jian Cong, Zhuo Chen, Yuping Wang, Yuxuan Wang, and Xie Chen. Language model can listen while speaking. InProceedings of the AAAI Conference on Artificial Intelligence, volume 39, pages 24831-24839, 2025. [32] Guan-Ting Lin, Jiachen Lian, Tingle Li, Qirui Wang, Gopala Anumanchipalli, Alexander H. Liu, and Hung yi Lee. Full-duplex-bench: A benchmark to evaluate full-duplex spoken dialogue models on turn-taking capabilities.arXiv preprint arXiv:2503.04721, 2025. [33] Gengyuan Zhang, Tanveer Hannan, Hermine Kleiner, Beste Aydemir, Xinyu Xie, Jian Lan, Thomas"},{"citing_arxiv_id":"2604.21406","ref_index":18,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Full-Duplex Interaction in Spoken Dialogue Systems: A Comprehensive Study from the ICASSP 2026 HumDial Challenge","primary_cat":"eess.AS","submitted_at":"2026-04-23T08:21:52+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"A new HumDial-FDBench benchmark and real human-recorded dual-channel dataset are released to assess full-duplex dialogue systems on interruptions and conversational flow.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.10065","ref_index":64,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"ASPIRin: Action Space Projection for Interactivity-Optimized Reinforcement Learning in Full-Duplex Speech Language Models","primary_cat":"cs.CL","submitted_at":"2026-04-11T07:07:08+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"ASPIRin decouples speaking timing from token content via binary action space projection and applies GRPO with rule-based rewards to optimize interactivity in SLMs without semantic collapse or repetition.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.01897","ref_index":21,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"FastTurn: Unifying Acoustic and Streaming Semantic Cues for Low-Latency and Robust Turn Detection","primary_cat":"cs.SD","submitted_at":"2026-04-02T11:00:37+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"FastTurn unifies acoustic features and streaming CTC decoding for low-latency, robust turn detection in full-duplex dialogue systems and releases a realistic human-dialogue test set.","context_count":1,"top_context_role":"other","top_context_polarity":"unclear","context_text":"Backchannel real-world 3080 0.42 Wait Synthesized 1000 0.71 To evaluate turn-state prediction, we construct an evalua- tion set consisting of segments from real-world data and 1,000 synthetically generatedwaitstate samples, as shown in Table 1. Since thewaitstate is rare in natural conversations, we sup- plement the set with 1,000 samples generated using DeepSeek V3 [21] for text and IndexTTS2 [22] for audio synthesis. 3. Experiments 3.1. Datasets ASR Task. We use large-scale open-source corpora and in- ternal datasets, including AISHELL-1 [23], AISHELL-2 [24], WenetSpeech [25], LibriSpeech [26], GigaSpeech [27], and MLS [28], totaling over 30,000 hours of Chinese and English speech to support robust feature learning."},{"citing_arxiv_id":"2603.22267","ref_index":27,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"TiCo: Time-Controllable Spoken Dialogue Model","primary_cat":"cs.CL","submitted_at":"2026-03-23T17:51:40+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"TiCo enables spoken dialogue models to follow explicit time constraints in generated responses using Spoken Time Markers and reinforcement learning with verifiable rewards, cutting duration error by 2.7x over its backbone.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2603.17837","ref_index":21,"ref_count":2,"confidence":0.9,"is_internal_anchor":false,"paper_title":"The Silent Thought: Modeling Internal Cognition in Full-Duplex Spoken Dialogue Models via Latent Reasoning","primary_cat":"eess.AS","submitted_at":"2026-03-18T15:30:29+00:00","verdict":null,"verdict_confidence":null,"novelty_score":null,"formal_verification":null,"one_line_summary":null,"context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2603.09643","ref_index":8,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"MM-tau-p$^2$: Persona-Adaptive Prompting for Robust Multi-Modal Agent Evaluation in Dual-Control Settings","primary_cat":"cs.ET","submitted_at":"2026-03-10T13:18:02+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"MM-tau-p² is a new benchmark with 12 metrics that measures how well multi-modal agents adapt to user personas and maintain robustness in dual-control interactions.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2509.26388","ref_index":19,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Game-Time: Evaluating Temporal Dynamics in Spoken Language Models","primary_cat":"eess.AS","submitted_at":"2025-09-30T15:23:39+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"Game-Time Benchmark shows spoken language models handle basic tasks but degrade sharply under temporal constraints like tempo adherence and synchronized responses.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2505.15957","ref_index":9,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Towards Holistic Evaluation of Large Audio-Language Models: A Comprehensive Survey","primary_cat":"eess.AS","submitted_at":"2025-05-21T19:17:29+00:00","verdict":"ACCEPT","verdict_confidence":"UNKNOWN","novelty_score":6.0,"formal_verification":"none","one_line_summary":"The survey introduces a four-category taxonomy for LALM evaluations and reviews benchmarks across general auditory processing, knowledge reasoning, dialogue, and fairness-safety.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null}],"limit":50,"offset":0}