arxiv: 2604.21406 · v2 · submitted 2026-04-23 · 📡 eess.AS

Recognition: unknown

Full-Duplex Interaction in Spoken Dialogue Systems: A Comprehensive Study from the ICASSP 2026 HumDial Challenge

Chengyou Wang , Hongfei Xue , Guojian Li , Zhixian Zhao , Shuiyuan Wang , Shuai Wang , Xin Xu , Hui Bu

show 1 more author

Lei Xie

Authors on Pith no claims yet

Pith reviewed 2026-05-08 13:15 UTC · model grok-4.3

classification 📡 eess.AS

keywords full-duplex interactionspoken dialogue systemsbenchmarkdatasetinterruptionsoverlapping speechevaluation frameworkhuman-like dialogue

0 comments

The pith

This paper establishes a benchmark and releases a dual-channel dataset to evaluate full-duplex spoken dialogue systems capable of handling real-time interruptions and overlaps.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Spoken dialogue systems have traditionally used strict turn-taking, making them unable to respond naturally when people interrupt or speak over each other. The work addresses this by creating a benchmark based on real human conversations recorded in dual channels. These recordings include natural interruptions, overlapping speech, and feedback. The benchmark measures how well systems can manage these dynamics while keeping the conversation flowing. A public leaderboard is also provided to compare different models' performances in this area.

Core claim

The paper claims that providing a high-quality dual-channel dataset of real human-recorded conversations, which captures interruptions, overlapping speech, and feedback mechanisms, along with the HumDial-FDBench framework, allows for the assessment of full-duplex spoken dialogue systems' ability to handle interruptions while maintaining conversational flow, thereby supporting the development of more human-like systems.

What carries the argument

The HumDial-FDBench benchmark that uses the dual-channel dataset to evaluate interruption handling and dynamic turn negotiation in spoken dialogues.

If this is right

Systems can be objectively evaluated on their interruption handling capabilities.
Open-source and proprietary models can be compared transparently on a public leaderboard.
The resources enable development of more responsive and adaptive dialogue systems.
The dataset provides a foundation for testing feedback mechanisms in conversations.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Improved full-duplex systems could make voice assistants more suitable for natural group discussions or collaborative settings.
This evaluation approach might reveal specific weaknesses in current AI models' real-time speech processing.
Future work could extend the benchmark to include visual cues or multi-speaker scenarios for richer assessment.

Load-bearing premise

The dual-channel dataset and HumDial-FDBench framework sufficiently capture and measure the full range of real-time interruption handling and dynamic turn negotiation in natural human conversations.

What would settle it

A test showing that high-performing systems on this benchmark still produce unnatural responses or fail to handle interruptions effectively in live, spontaneous human interactions would challenge the claim.

Figures

Figures reproduced from arXiv: 2604.21406 by Chengyou Wang, Guojian Li, Hongfei Xue, Hui Bu, Lei Xie, Shuai Wang, Shuiyuan Wang, Xin Xu, Zhixian Zhao.

**Figure 1.** Figure 1: Framework for latency evaluation in interruption scenarios. Latency Evaluation. In addition to evaluating behavioral correctness, we measure real-time responsiveness in interruption scenarios, as shown in view at source ↗

read the original abstract

Full-duplex interaction, where speakers and listeners converse simultaneously, is a key element of human communication often missing from traditional spoken dialogue systems. These systems, based on rigid turn-taking paradigms, struggle to respond naturally in dynamic conversations. The Full-Duplex Interaction Track of ICASSP 2026 Human-like Spoken Dialogue Systems Challenge (HumDial Challenge) aims to advance the evaluation of full-duplex systems by offering a framework for handling real-time interruptions, speech overlap, and dynamic turn negotiation. We introduce a comprehensive benchmark for full-duplex spoken dialogue systems, built from the HumDial Challenge. We release a high-quality dual-channel dataset of real human-recorded conversations, capturing interruptions, overlapping speech, and feedback mechanisms. This dataset forms the basis for the HumDial-FDBench benchmark, which assesses a system's ability to handle interruptions while maintaining conversational flow. Additionally, we create a public leaderboard to compare the performance of open-source and proprietary models, promoting transparent, reproducible evaluation. These resources support the development of more responsive, adaptive, and human-like dialogue systems.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

This is a resource paper releasing a dual-channel dataset and HumDial-FDBench for full-duplex dialogue evaluation, with no new results or baselines.

read the letter

This paper is a challenge description that releases a dual-channel dataset of human conversations and introduces the HumDial-FDBench benchmark for full-duplex spoken dialogue systems. The main point is to provide a way to evaluate systems on handling interruptions, overlaps, and dynamic turns in a standardized way. It does a good job highlighting why full-duplex is important for natural interaction and by opening a leaderboard it supports comparison between different models. The use of real recorded conversations rather than synthetic data is a strength for ecological validity. The soft spots are that the paper doesn't include any baseline performances or detailed validation of the dataset, such as how they ensured quality or what the scale is. Without that, it's hard to know if this will become a standard or if there are biases in the collected data. The claims about capturing feedback mechanisms are stated but not demonstrated with examples or stats. Since it's not presenting new algorithms or results but rather resources, the soundness depends on the data release being well-documented in the full text. This is targeted at the spoken dialogue systems community, especially those participating in challenges or needing benchmarks for full-duplex capabilities. Someone working on real-time dialogue models could use this to test their systems. It has enough to go to peer review as these kinds of papers help build the infrastructure for the field. My recommendation is to accept it for review rather than desk reject, as the resources could be valuable if the details check out.

Referee Report

1 major / 0 minor

Summary. The manuscript introduces the Full-Duplex Interaction Track of the ICASSP 2026 HumDial Challenge. It presents the HumDial-FDBench benchmark for evaluating full-duplex spoken dialogue systems, releases a dual-channel dataset of real human-recorded conversations capturing interruptions, overlapping speech, and feedback mechanisms, and establishes a public leaderboard to compare open-source and proprietary models.

Significance. If the dataset collection procedures ensure high quality and the benchmark tasks meaningfully quantify interruption handling and conversational flow, this resource-release paper could meaningfully advance spoken dialogue research by filling a gap in half-duplex systems and enabling reproducible comparisons via the leaderboard. The explicit support for both open-source and proprietary models is a constructive element that encourages broader adoption.

major comments (1)

[Benchmark description] The description of the HumDial-FDBench benchmark (abstract and associated sections) provides no concrete evaluation metrics, scoring functions, or task definitions for measuring interruption handling or maintenance of conversational flow. Without these, it is impossible to verify that the released resources actually support the central claim of assessing real-time full-duplex capabilities.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. We address the single major comment below and have made revisions to strengthen the benchmark description.

read point-by-point responses

Referee: [Benchmark description] The description of the HumDial-FDBench benchmark (abstract and associated sections) provides no concrete evaluation metrics, scoring functions, or task definitions for measuring interruption handling or maintenance of conversational flow. Without these, it is impossible to verify that the released resources actually support the central claim of assessing real-time full-duplex capabilities.

Authors: We agree that the original manuscript description of HumDial-FDBench was insufficiently concrete on metrics and task definitions. In the revised version, we have added a dedicated subsection (Section 3.2) that explicitly defines the two core tasks: (1) Interruption Handling, evaluated via interruption detection F1-score, response latency (in ms) to overlap onset, and overlap resolution success rate; (2) Conversational Flow Maintenance, measured by turn-negotiation accuracy, dialogue coherence (via embedding similarity to human references), and feedback appropriateness. Scoring functions are now provided, including a composite full-duplex score that weights latency and coherence with dataset-derived human baselines. These additions make the evaluation criteria verifiable and directly tied to the dual-channel recordings and public leaderboard. revision: yes

Circularity Check

0 steps flagged

No significant circularity in derivation chain

full rationale

The paper is a challenge description and resource-release document whose central claims consist of defining HumDial-FDBench tasks, releasing a dual-channel human-recorded dataset, and opening a public leaderboard. These claims are satisfied directly by the acts of specification and data provision; no equations, fitted parameters, predictions, or derivation steps appear that could reduce to inputs by construction. No self-citation load-bearing arguments, uniqueness theorems, or ansatzes are invoked. The work is self-contained against external benchmarks as a factual release of resources and evaluation framework.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

This is a benchmark and dataset release paper with no mathematical models, free parameters, axioms, or invented entities described.

pith-pipeline@v0.9.0 · 5515 in / 959 out tokens · 46946 ms · 2026-05-08T13:15:18.590165+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

30 extracted references · 19 canonical work pages · 7 internal anchors

[1]

Full-Duplex Interaction in Spoken Dialogue Systems: A Comprehensive Study from the ICASSP 2026 HumDial Challenge

Introduction In natural communication, human dialogue rarely follows strict turn alternation. Instead, speakers and listeners interact through a continuous full-duplex process in which listening and speak- ing may occur simultaneously. Participants regulate conversa- tional flow through cues such as intonation, pauses, prosody, and semantic context. These...

work page internal anchor Pith review Pith/arXiv arXiv 2026
[2]

Cascaded sys- tems typically chain modules such as ASR, LLM, and TTS to complete the interaction, as exemplified by Google-Duplex [1] and Firered-Chat [8]

Related Work Full-Duplex Spoken Interaction.Mainstream spoken dia- logue system [7, 4] architectures can be broadly categorized into cascaded pipelines and end-to-end frameworks. Cascaded sys- tems typically chain modules such as ASR, LLM, and TTS to complete the interaction, as exemplified by Google-Duplex [1] and Firered-Chat [8]. However, this design o...
[3]

uh-huh,” 2https://github.com/TEN-framework/ ten-turn-detection 3https://github.com/pipecat-ai/Smart-Turn “yeah

Released Dataset 3.1. Scenarios As shown in Table 1, the released dataset includes two major categories: Interruption and Rejection, with a total of eight sce- narios. Interruption: The Interruption scenario assesses the capa- bility of a model to adapt an ongoing response when the user intervenes. It includes five sub-scenarios.F ollow-up Ques- tionscove...
[4]

HumDial-FDBench The HumDial-FDBench evaluation protocol is built upon Full- Duplex-Bench v1.5 [13] and introduces several extensions to support more complex interaction scenarios and a more com- prehensive assessment of full-duplex dialogue systems. 4.1. Behavior evaluation To analyze the semantic response strategy of a model follow- ing speech overlap, w...
[5]

Leaderboard 5.1. Model Evaluation We evaluated a diverse set of dialogue management models, divided into two categories: Open-Source Models, such as Freeze-Omni [2], Moshi [4], and Osum-EChat, which provide transparency and reproducibility, and Closed-Source Models, like Gemini 2.5 [16]. To ensure consistent evaluation across systems, we use a baseline th...
[6]

The goal is to highlight the major design axes and the practical trade-offs that shaped system performance under full- duplex interaction

Analysis This section summarizes the representative design choices of Track II submissions from three complementary perspectives: architecture paradigm, turn-taking strategy, and training strat- egy. The goal is to highlight the major design axes and the practical trade-offs that shaped system performance under full- duplex interaction. 6.1. System Archit...
[7]

We release a dual-channel dataset of real human-recorded conversations to advance research on real-time, full-duplex systems

Conclusion We conduct a comprehensive study based on the Full-Duplex Interaction Track of the ICASSP 2026 Human-like Spoken Di- alogue Systems Challenge. We release a dual-channel dataset of real human-recorded conversations to advance research on real-time, full-duplex systems. To facilitate systematic evalua- tion, we introduce the HumDial-FDBench bench...

2026
[8]

These tools were not involved in the development of the methodology, execution of experiments, generation of results, or formulation of conclu- sions

Generative AI Use Disclosure In this study, generative AI tools were solely used for language refinement and editorial support, aimed at improving the clarity, readability, and overall flow of the manuscript. These tools were not involved in the development of the methodology, execution of experiments, generation of results, or formulation of conclu- sion...
[9]

Google duplex: An ai system for accomplishing real-world tasks over the phone,

Y . Leviathan and Y . Matias, “Google duplex: An ai system for accomplishing real-world tasks over the phone,”Google AI blog, vol. 8, 2018

2018
[10]

Freeze-omni: A smart and low latency speech-to-speech dia- logue model with frozen llm,

X. Wang, Y . Li, C. Fu, Y . Shen, L. Xie, K. Li, X. Sun, and L. Ma, “Freeze-omni: A smart and low latency speech-to-speech dia- logue model with frozen llm,”arXiv preprint arXiv:2411.00774, 2024

work page arXiv 2024
[11]

Vita-1.5: Towards gpt-4o level real-time vision and speech interaction.arXiv preprint arXiv:2501.01957, 2025

C. Fu, H. Lin, X. Wang, Y .-F. Zhang, Y . Shen, X. Liu, H. Cao, Z. Long, H. Gao, K. Liet al., “Vita-1.5: Towards gpt-4o level real-time vision and speech interaction,”arXiv preprint arXiv:2501.01957, 2025

work page arXiv 2025
[12]

Moshi: a speech-text foundation model for real-time dialogue

A. D ´efossez, L. Mazar´e, M. Orsini, A. Royer, P. P´erez, H. J´egou, E. Grave, and N. Zeghidour, “Moshi: a speech-text foundation model for real-time dialogue,”arXiv preprint arXiv:2410.00037, 2024

work page internal anchor Pith review arXiv 2024
[13]

Osum-echat: Enhancing end-to- end empathetic spoken chatbot via understanding-driven spoken dialogue,

X. Geng, Q. Shao, H. Xue, S. Wang, H. Xie, Z. Guo, Y . Zhao, G. Li, W. Tian, C. Wanget al., “Osum-echat: Enhancing end-to- end empathetic spoken chatbot via understanding-driven spoken dialogue,”arXiv preprint arXiv:2508.09600, 2025

work page arXiv 2025
[14]

The ICASSP 2026 humdial challenge: Bench- marking human-like spoken dialogue systems in the LLM era,

Z. Zhao, S. Wang, G. Li, H. Xue, C. Wang, S. Wang, L. Xiao, Z. Zhang, H. Bu, X. Xu, X. Wang, H. Liu, E. S. Chng, H. Lee, H. Li, and L. Xie, “The ICASSP 2026 humdial challenge: Bench- marking human-like spoken dialogue systems in the LLM era,” CoRR, vol. abs/2601.05564, 2026

work page arXiv 2026
[15]

Salmonn-omni: A codec-free llm for full-duplex speech understanding and generation,

W. Yu, S. Wang, X. Yang, X. Chen, X. Tian, J. Zhang, G. Sun, L. Lu, Y . Wang, and C. Zhang, “Salmonn-omni: A codec-free llm for full-duplex speech understanding and generation,”arXiv preprint arXiv:2411.18138, 2024

work page arXiv 2024
[16]

Fireredchat: A pluggable, full-duplex voice in- teraction system with cascaded and semi-cascaded implementa- tions,

J. Chen, Y . Hu, J. Li, K. Li, K. Liu, W. Li, X. Li, Z. Li, F. Shen, X. Tanget al., “Fireredchat: A pluggable, full-duplex voice in- teraction system with cascaded and semi-cascaded implementa- tions,”arXiv preprint arXiv:2509.06502, 2025

work page arXiv 2025
[17]

Mtalk-bench: Evaluating speech-to-speech models in multi-turn dialogues via arena-style and rubrics protocols,

Y . Du, Q. Huang, G. Zhu, Z. Dai, S. Chen, Q. Zhu, L. Pan, M. Chen, Y . Zhang, L. Zhouet al., “Mtalk-bench: Evaluating speech-to-speech models in multi-turn dialogues via arena-style and rubrics protocols,”arXiv preprint arXiv:2508.18240, 2025

work page arXiv 2025
[18]

Full-duplex-bench: A benchmark to evaluate full-duplex spoken dialogue models on turn-taking capabilities.arXiv preprint arXiv:2503.04721, 2025

G.-T. Lin, J. Lian, T. Li, Q. Wang, G. Anumanchipalli, A. H. Liu, and H.-y. Lee, “Full-duplex-bench: A benchmark to evaluate full- duplex spoken dialogue models on turn-taking capabilities,”arXiv preprint arXiv:2503.04721, 2025

work page arXiv 2025
[19]

Fd-bench: A full-duplex benchmarking pipeline de- signed for full duplex spoken dialogue systems,

Y . Peng, Y .-W. Chao, D. Ng, Y . Ma, C. Ni, B. Ma, and E. S. Chng, “Fd-bench: A full-duplex benchmarking pipeline de- signed for full duplex spoken dialogue systems,”arXiv preprint arXiv:2507.19040, 2025

work page arXiv 2025
[20]

DeepSeek-V3 Technical Report

A. Liu, B. Feng, B. Xue, B. Wang, B. Wu, C. Lu, C. Zhao, C. Deng, C. Zhang, C. Ruanet al., “Deepseek-v3 technical re- port,”arXiv preprint arXiv:2412.19437, 2024

work page internal anchor Pith review arXiv 2024
[21]

Full-Duplex-Bench v1.5: Evaluating Overlap Handling for Full-Duplex Speech Models

G.-T. Lin, S.-Y . S. Kuan, Q. Wang, J. Lian, T. Li, S. Watanabe, and H.-y. Lee, “Full-duplex-bench v1. 5: Evaluating overlap handling for full-duplex speech models,”arXiv preprint arXiv:2507.23159, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[22]

Paraformer: Fast and accurate parallel transformer for non-autoregressive end-to- end speech recognition,

Z. Gao, S. Zhang, I. McLoughlin, and Z. Yan, “Paraformer: Fast and accurate parallel transformer for non-autoregressive end-to- end speech recognition,”arXiv preprint arXiv:2206.08317, 2022

work page arXiv 2022
[23]

Silero vad: pre-trained enterprise-grade voice activity detector (vad), number detector and language classifier,

S. Team, “Silero vad: pre-trained enterprise-grade voice activity detector (vad), number detector and language classifier,” https:// github.com/snakers4/silero-vad, 2024

2024
[24]

Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities

G. Comanici, E. Bieber, M. Schaekermann, I. Pasupat, N. Sachdeva, I. Dhillon, M. Blistein, O. Ram, D. Zhang, E. Rosen et al., “Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabil- ities,”arXiv preprint arXiv:2507.06261, 2025

work page internal anchor Pith review arXiv 2025
[25]

Unit-based agent for semi-cascaded full-duplex dialogue systems,

H. Yu, Y . Chen, and M. Cai, “Unit-based agent for semi-cascaded full-duplex dialogue systems,”CoRR, vol. abs/2601.20230, 2026

work page arXiv 2026
[26]

Qwen3-Omni Technical Report

J. Xu, Z. Guo, H. Hu, Y . Chu, X. Wang, J. He, Y . Wang, X. Shi, T. He, X. Zhuet al., “Qwen3-omni technical report,”arXiv preprint arXiv:2509.17765, 2025

work page internal anchor Pith review arXiv 2025
[27]

Glm-4-voice: Towards intelligent and human-like end-to-end spoken chatbot.arXiv preprint arXiv:2412.02612, 2024

A. Zeng, Z. Du, M. Liu, K. Wang, S. Jiang, L. Zhao, Y . Dong, and J. Tang, “Glm-4-voice: Towards intelligent and human-like end- to-end spoken chatbot,”arXiv preprint arXiv:2412.02612, 2024

work page arXiv 2024
[28]

Robust speech recognition via large-scale weak supervision,

A. Radford, J. W. Kim, T. Xu, G. Brockman, C. McLeavey, and I. Sutskever, “Robust speech recognition via large-scale weak supervision,” inInternational conference on machine learning. PMLR, 2023, pp. 28 492–28 518

2023
[29]

Tinybert: Distilling bert for natural language understand- ing,

X. Jiao, Y . Yin, L. Shang, X. Jiang, X. Chen, L. Li, F. Wang, and Q. Liu, “Tinybert: Distilling bert for natural language understand- ing,” inFindings of the association for computational linguistics: EMNLP 2020, 2020, pp. 4163–4174

2020
[30]

Qwen2.5 Technical Report

A. Yang, B. Yang, B. Zhang, B. Hui, B. Zheng, B. Yu, C. Li, D. Liu, F. Huang, H. Wei, H. Lin, J. Yang, J. Tu, J. Zhang, J. Yang, J. Yang, J. Zhou, J. Lin, K. Dang, K. Lu, K. Bao, K. Yang, L. Yu, M. Li, M. Xue, P. Zhang, Q. Zhu, R. Men, R. Lin, T. Li, T. Xia, X. Ren, X. Ren, Y . Fan, Y . Su, Y . Zhang, Y . Wan, Y . Liu, Z. Cui, Z. Zhang, and Z. Qiu, “Qwen2...

work page internal anchor Pith review arXiv 2024