Recognition: unknown
Full-Duplex Interaction in Spoken Dialogue Systems: A Comprehensive Study from the ICASSP 2026 HumDial Challenge
Pith reviewed 2026-05-08 13:15 UTC · model grok-4.3
The pith
This paper establishes a benchmark and releases a dual-channel dataset to evaluate full-duplex spoken dialogue systems capable of handling real-time interruptions and overlaps.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The paper claims that providing a high-quality dual-channel dataset of real human-recorded conversations, which captures interruptions, overlapping speech, and feedback mechanisms, along with the HumDial-FDBench framework, allows for the assessment of full-duplex spoken dialogue systems' ability to handle interruptions while maintaining conversational flow, thereby supporting the development of more human-like systems.
What carries the argument
The HumDial-FDBench benchmark that uses the dual-channel dataset to evaluate interruption handling and dynamic turn negotiation in spoken dialogues.
If this is right
- Systems can be objectively evaluated on their interruption handling capabilities.
- Open-source and proprietary models can be compared transparently on a public leaderboard.
- The resources enable development of more responsive and adaptive dialogue systems.
- The dataset provides a foundation for testing feedback mechanisms in conversations.
Where Pith is reading between the lines
- Improved full-duplex systems could make voice assistants more suitable for natural group discussions or collaborative settings.
- This evaluation approach might reveal specific weaknesses in current AI models' real-time speech processing.
- Future work could extend the benchmark to include visual cues or multi-speaker scenarios for richer assessment.
Load-bearing premise
The dual-channel dataset and HumDial-FDBench framework sufficiently capture and measure the full range of real-time interruption handling and dynamic turn negotiation in natural human conversations.
What would settle it
A test showing that high-performing systems on this benchmark still produce unnatural responses or fail to handle interruptions effectively in live, spontaneous human interactions would challenge the claim.
Figures
read the original abstract
Full-duplex interaction, where speakers and listeners converse simultaneously, is a key element of human communication often missing from traditional spoken dialogue systems. These systems, based on rigid turn-taking paradigms, struggle to respond naturally in dynamic conversations. The Full-Duplex Interaction Track of ICASSP 2026 Human-like Spoken Dialogue Systems Challenge (HumDial Challenge) aims to advance the evaluation of full-duplex systems by offering a framework for handling real-time interruptions, speech overlap, and dynamic turn negotiation. We introduce a comprehensive benchmark for full-duplex spoken dialogue systems, built from the HumDial Challenge. We release a high-quality dual-channel dataset of real human-recorded conversations, capturing interruptions, overlapping speech, and feedback mechanisms. This dataset forms the basis for the HumDial-FDBench benchmark, which assesses a system's ability to handle interruptions while maintaining conversational flow. Additionally, we create a public leaderboard to compare the performance of open-source and proprietary models, promoting transparent, reproducible evaluation. These resources support the development of more responsive, adaptive, and human-like dialogue systems.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript introduces the Full-Duplex Interaction Track of the ICASSP 2026 HumDial Challenge. It presents the HumDial-FDBench benchmark for evaluating full-duplex spoken dialogue systems, releases a dual-channel dataset of real human-recorded conversations capturing interruptions, overlapping speech, and feedback mechanisms, and establishes a public leaderboard to compare open-source and proprietary models.
Significance. If the dataset collection procedures ensure high quality and the benchmark tasks meaningfully quantify interruption handling and conversational flow, this resource-release paper could meaningfully advance spoken dialogue research by filling a gap in half-duplex systems and enabling reproducible comparisons via the leaderboard. The explicit support for both open-source and proprietary models is a constructive element that encourages broader adoption.
major comments (1)
- [Benchmark description] The description of the HumDial-FDBench benchmark (abstract and associated sections) provides no concrete evaluation metrics, scoring functions, or task definitions for measuring interruption handling or maintenance of conversational flow. Without these, it is impossible to verify that the released resources actually support the central claim of assessing real-time full-duplex capabilities.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback on our manuscript. We address the single major comment below and have made revisions to strengthen the benchmark description.
read point-by-point responses
-
Referee: [Benchmark description] The description of the HumDial-FDBench benchmark (abstract and associated sections) provides no concrete evaluation metrics, scoring functions, or task definitions for measuring interruption handling or maintenance of conversational flow. Without these, it is impossible to verify that the released resources actually support the central claim of assessing real-time full-duplex capabilities.
Authors: We agree that the original manuscript description of HumDial-FDBench was insufficiently concrete on metrics and task definitions. In the revised version, we have added a dedicated subsection (Section 3.2) that explicitly defines the two core tasks: (1) Interruption Handling, evaluated via interruption detection F1-score, response latency (in ms) to overlap onset, and overlap resolution success rate; (2) Conversational Flow Maintenance, measured by turn-negotiation accuracy, dialogue coherence (via embedding similarity to human references), and feedback appropriateness. Scoring functions are now provided, including a composite full-duplex score that weights latency and coherence with dataset-derived human baselines. These additions make the evaluation criteria verifiable and directly tied to the dual-channel recordings and public leaderboard. revision: yes
Circularity Check
No significant circularity in derivation chain
full rationale
The paper is a challenge description and resource-release document whose central claims consist of defining HumDial-FDBench tasks, releasing a dual-channel human-recorded dataset, and opening a public leaderboard. These claims are satisfied directly by the acts of specification and data provision; no equations, fitted parameters, predictions, or derivation steps appear that could reduce to inputs by construction. No self-citation load-bearing arguments, uniqueness theorems, or ansatzes are invoked. The work is self-contained against external benchmarks as a factual release of resources and evaluation framework.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
Introduction In natural communication, human dialogue rarely follows strict turn alternation. Instead, speakers and listeners interact through a continuous full-duplex process in which listening and speak- ing may occur simultaneously. Participants regulate conversa- tional flow through cues such as intonation, pauses, prosody, and semantic context. These...
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[2]
Cascaded sys- tems typically chain modules such as ASR, LLM, and TTS to complete the interaction, as exemplified by Google-Duplex [1] and Firered-Chat [8]
Related Work Full-Duplex Spoken Interaction.Mainstream spoken dia- logue system [7, 4] architectures can be broadly categorized into cascaded pipelines and end-to-end frameworks. Cascaded sys- tems typically chain modules such as ASR, LLM, and TTS to complete the interaction, as exemplified by Google-Duplex [1] and Firered-Chat [8]. However, this design o...
-
[3]
uh-huh,” 2https://github.com/TEN-framework/ ten-turn-detection 3https://github.com/pipecat-ai/Smart-Turn “yeah
Released Dataset 3.1. Scenarios As shown in Table 1, the released dataset includes two major categories: Interruption and Rejection, with a total of eight sce- narios. Interruption: The Interruption scenario assesses the capa- bility of a model to adapt an ongoing response when the user intervenes. It includes five sub-scenarios.F ollow-up Ques- tionscove...
-
[4]
HumDial-FDBench The HumDial-FDBench evaluation protocol is built upon Full- Duplex-Bench v1.5 [13] and introduces several extensions to support more complex interaction scenarios and a more com- prehensive assessment of full-duplex dialogue systems. 4.1. Behavior evaluation To analyze the semantic response strategy of a model follow- ing speech overlap, w...
-
[5]
Leaderboard 5.1. Model Evaluation We evaluated a diverse set of dialogue management models, divided into two categories: Open-Source Models, such as Freeze-Omni [2], Moshi [4], and Osum-EChat, which provide transparency and reproducibility, and Closed-Source Models, like Gemini 2.5 [16]. To ensure consistent evaluation across systems, we use a baseline th...
-
[6]
The goal is to highlight the major design axes and the practical trade-offs that shaped system performance under full- duplex interaction
Analysis This section summarizes the representative design choices of Track II submissions from three complementary perspectives: architecture paradigm, turn-taking strategy, and training strat- egy. The goal is to highlight the major design axes and the practical trade-offs that shaped system performance under full- duplex interaction. 6.1. System Archit...
-
[7]
We release a dual-channel dataset of real human-recorded conversations to advance research on real-time, full-duplex systems
Conclusion We conduct a comprehensive study based on the Full-Duplex Interaction Track of the ICASSP 2026 Human-like Spoken Di- alogue Systems Challenge. We release a dual-channel dataset of real human-recorded conversations to advance research on real-time, full-duplex systems. To facilitate systematic evalua- tion, we introduce the HumDial-FDBench bench...
2026
-
[8]
These tools were not involved in the development of the methodology, execution of experiments, generation of results, or formulation of conclu- sions
Generative AI Use Disclosure In this study, generative AI tools were solely used for language refinement and editorial support, aimed at improving the clarity, readability, and overall flow of the manuscript. These tools were not involved in the development of the methodology, execution of experiments, generation of results, or formulation of conclu- sion...
-
[9]
Google duplex: An ai system for accomplishing real-world tasks over the phone,
Y . Leviathan and Y . Matias, “Google duplex: An ai system for accomplishing real-world tasks over the phone,”Google AI blog, vol. 8, 2018
2018
-
[10]
Freeze-omni: A smart and low latency speech-to-speech dia- logue model with frozen llm,
X. Wang, Y . Li, C. Fu, Y . Shen, L. Xie, K. Li, X. Sun, and L. Ma, “Freeze-omni: A smart and low latency speech-to-speech dia- logue model with frozen llm,”arXiv preprint arXiv:2411.00774, 2024
-
[11]
C. Fu, H. Lin, X. Wang, Y .-F. Zhang, Y . Shen, X. Liu, H. Cao, Z. Long, H. Gao, K. Liet al., “Vita-1.5: Towards gpt-4o level real-time vision and speech interaction,”arXiv preprint arXiv:2501.01957, 2025
-
[12]
Moshi: a speech-text foundation model for real-time dialogue
A. D ´efossez, L. Mazar´e, M. Orsini, A. Royer, P. P´erez, H. J´egou, E. Grave, and N. Zeghidour, “Moshi: a speech-text foundation model for real-time dialogue,”arXiv preprint arXiv:2410.00037, 2024
work page internal anchor Pith review arXiv 2024
-
[13]
X. Geng, Q. Shao, H. Xue, S. Wang, H. Xie, Z. Guo, Y . Zhao, G. Li, W. Tian, C. Wanget al., “Osum-echat: Enhancing end-to- end empathetic spoken chatbot via understanding-driven spoken dialogue,”arXiv preprint arXiv:2508.09600, 2025
-
[14]
The ICASSP 2026 humdial challenge: Bench- marking human-like spoken dialogue systems in the LLM era,
Z. Zhao, S. Wang, G. Li, H. Xue, C. Wang, S. Wang, L. Xiao, Z. Zhang, H. Bu, X. Xu, X. Wang, H. Liu, E. S. Chng, H. Lee, H. Li, and L. Xie, “The ICASSP 2026 humdial challenge: Bench- marking human-like spoken dialogue systems in the LLM era,” CoRR, vol. abs/2601.05564, 2026
-
[15]
Salmonn-omni: A codec-free llm for full-duplex speech understanding and generation,
W. Yu, S. Wang, X. Yang, X. Chen, X. Tian, J. Zhang, G. Sun, L. Lu, Y . Wang, and C. Zhang, “Salmonn-omni: A codec-free llm for full-duplex speech understanding and generation,”arXiv preprint arXiv:2411.18138, 2024
-
[16]
J. Chen, Y . Hu, J. Li, K. Li, K. Liu, W. Li, X. Li, Z. Li, F. Shen, X. Tanget al., “Fireredchat: A pluggable, full-duplex voice in- teraction system with cascaded and semi-cascaded implementa- tions,”arXiv preprint arXiv:2509.06502, 2025
-
[17]
Y . Du, Q. Huang, G. Zhu, Z. Dai, S. Chen, Q. Zhu, L. Pan, M. Chen, Y . Zhang, L. Zhouet al., “Mtalk-bench: Evaluating speech-to-speech models in multi-turn dialogues via arena-style and rubrics protocols,”arXiv preprint arXiv:2508.18240, 2025
-
[18]
G.-T. Lin, J. Lian, T. Li, Q. Wang, G. Anumanchipalli, A. H. Liu, and H.-y. Lee, “Full-duplex-bench: A benchmark to evaluate full- duplex spoken dialogue models on turn-taking capabilities,”arXiv preprint arXiv:2503.04721, 2025
-
[19]
Fd-bench: A full-duplex benchmarking pipeline de- signed for full duplex spoken dialogue systems,
Y . Peng, Y .-W. Chao, D. Ng, Y . Ma, C. Ni, B. Ma, and E. S. Chng, “Fd-bench: A full-duplex benchmarking pipeline de- signed for full duplex spoken dialogue systems,”arXiv preprint arXiv:2507.19040, 2025
-
[20]
A. Liu, B. Feng, B. Xue, B. Wang, B. Wu, C. Lu, C. Zhao, C. Deng, C. Zhang, C. Ruanet al., “Deepseek-v3 technical re- port,”arXiv preprint arXiv:2412.19437, 2024
work page internal anchor Pith review arXiv 2024
-
[21]
Full-Duplex-Bench v1.5: Evaluating Overlap Handling for Full-Duplex Speech Models
G.-T. Lin, S.-Y . S. Kuan, Q. Wang, J. Lian, T. Li, S. Watanabe, and H.-y. Lee, “Full-duplex-bench v1. 5: Evaluating overlap handling for full-duplex speech models,”arXiv preprint arXiv:2507.23159, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[22]
Z. Gao, S. Zhang, I. McLoughlin, and Z. Yan, “Paraformer: Fast and accurate parallel transformer for non-autoregressive end-to- end speech recognition,”arXiv preprint arXiv:2206.08317, 2022
-
[23]
Silero vad: pre-trained enterprise-grade voice activity detector (vad), number detector and language classifier,
S. Team, “Silero vad: pre-trained enterprise-grade voice activity detector (vad), number detector and language classifier,” https:// github.com/snakers4/silero-vad, 2024
2024
-
[24]
G. Comanici, E. Bieber, M. Schaekermann, I. Pasupat, N. Sachdeva, I. Dhillon, M. Blistein, O. Ram, D. Zhang, E. Rosen et al., “Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabil- ities,”arXiv preprint arXiv:2507.06261, 2025
work page internal anchor Pith review arXiv 2025
-
[25]
Unit-based agent for semi-cascaded full-duplex dialogue systems,
H. Yu, Y . Chen, and M. Cai, “Unit-based agent for semi-cascaded full-duplex dialogue systems,”CoRR, vol. abs/2601.20230, 2026
-
[26]
J. Xu, Z. Guo, H. Hu, Y . Chu, X. Wang, J. He, Y . Wang, X. Shi, T. He, X. Zhuet al., “Qwen3-omni technical report,”arXiv preprint arXiv:2509.17765, 2025
work page internal anchor Pith review arXiv 2025
-
[27]
A. Zeng, Z. Du, M. Liu, K. Wang, S. Jiang, L. Zhao, Y . Dong, and J. Tang, “Glm-4-voice: Towards intelligent and human-like end- to-end spoken chatbot,”arXiv preprint arXiv:2412.02612, 2024
-
[28]
Robust speech recognition via large-scale weak supervision,
A. Radford, J. W. Kim, T. Xu, G. Brockman, C. McLeavey, and I. Sutskever, “Robust speech recognition via large-scale weak supervision,” inInternational conference on machine learning. PMLR, 2023, pp. 28 492–28 518
2023
-
[29]
Tinybert: Distilling bert for natural language understand- ing,
X. Jiao, Y . Yin, L. Shang, X. Jiang, X. Chen, L. Li, F. Wang, and Q. Liu, “Tinybert: Distilling bert for natural language understand- ing,” inFindings of the association for computational linguistics: EMNLP 2020, 2020, pp. 4163–4174
2020
-
[30]
A. Yang, B. Yang, B. Zhang, B. Hui, B. Zheng, B. Yu, C. Li, D. Liu, F. Huang, H. Wei, H. Lin, J. Yang, J. Tu, J. Zhang, J. Yang, J. Yang, J. Zhou, J. Lin, K. Dang, K. Lu, K. Bao, K. Yang, L. Yu, M. Li, M. Xue, P. Zhang, Q. Zhu, R. Men, R. Lin, T. Li, T. Xia, X. Ren, X. Ren, Y . Fan, Y . Su, Y . Zhang, Y . Wan, Y . Liu, Z. Cui, Z. Zhang, and Z. Qiu, “Qwen2...
work page internal anchor Pith review arXiv 2024
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.