LLM-Enhanced Dialogue Management for Full-Duplex Spoken Dialogue Systems

Dong Yu; Hao Zhang; Meng Yu; Rilin Chen; Vinay Kothapally; Weiwei Li

arxiv: 2502.14145 · v3 · pith:OTWV6PFHnew · submitted 2025-02-19 · 💻 cs.CL · eess.AS

LLM-Enhanced Dialogue Management for Full-Duplex Spoken Dialogue Systems

Hao Zhang , Weiwei Li , Rilin Chen , Vinay Kothapally , Meng Yu , Dong Yu This is my paper

classification 💻 cs.CL eess.AS

keywords dialoguefull-duplexsemanticreal-timespokensystemswhileaccuracy

0 comments

read the original abstract

Achieving full-duplex communication in spoken dialogue systems (SDS) requires real-time coordination between listening, speaking, and thinking. This paper proposes a semantic voice activity detection (VAD) module as a dialogue manager (DM) to efficiently manage turn-taking in full-duplex SDS. Implemented as a lightweight (0.5B) LLM fine-tuned on full-duplex conversation data, the semantic VAD predicts four control tokens to regulate turn-switching and turn-keeping, distinguishing between intentional and unintentional barge-ins while detecting query completion for handling user pauses and hesitations. By processing input speech in short intervals, the semantic VAD enables real-time decision-making, while the core dialogue engine (CDE) is only activated for response generation, reducing computational overhead. This design allows independent DM optimization without retraining the CDE, balancing interaction accuracy and inference efficiency for scalable, next-generation full-duplex SDS.

This paper has not been read by Pith yet.

discussion (0)

Forward citations

Cited by 6 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

The Silent Thought: Modeling Internal Cognition in Full-Duplex Spoken Dialogue Models via Latent Reasoning
eess.AS 2026-03 unverdicted novelty 7.0

FLAIR enables spoken dialogue AI to conduct continuous latent reasoning while perceiving speech through recursive latent embeddings and an ELBO-based finetuning objective.
Next-Turn: Duration-Aware Streaming Endpoint Detection via Time-to-Next-Speech-Onset Prediction
cs.SD 2026-06 unverdicted novelty 6.0

Next-Turn introduces time-to-next-speech-onset prediction for duration-aware streaming endpoint detection, reporting a 25.9% improvement in accuracy within 320 ms.
LMPAN: A Lightweight Multi-Path Alignment Network for Joint Full-Duplex Acoustic Echo Cancellation and Noise Suppression
eess.AS 2026-07 unverdicted novelty 5.0

LMPAN is a 480K-parameter network using multi-path alignment, attention integration, and dynamic post-filtering that matches larger models on joint AEC and NS while supporting real-time inference.
IRAF: Interference-Resilient Adaptive Fusion for Noise-Robust End-to-End Full-Duplex Spoken Dialogue Systems
cs.SD 2026-06 unverdicted novelty 4.0

IRAF introduces an adaptive fusion module that uses a predicted scalar reliability gate to reduce the impact of interfering speakers on user audio representations in end-to-end full-duplex spoken dialogue systems, wit...
The Silent Thought: Modeling Internal Cognition in Full-Duplex Spoken Dialogue Models via Latent Reasoning
eess.AS 2026-03 unverdicted novelty 4.0

FLAIR enables simultaneous latent reasoning during speech input in full-duplex dialogue models via recursive latent embeddings and an ELBO-based training objective without added latency.
Toward Native Multimodal Modeling: A Roadmap
cs.CV 2026-05 unverdicted novelty 3.0

A roadmap that defines architectural nativity for multimodal models and categorizes them into Multi-to-Text, Multi-to-Target, and Multi-to-Multi types while outlining an industrial pipeline toward unified transformer-...