Full-Duplex-Bench-v2: A Multi-Turn Evaluation Framework for Duplex Dialogue Systems with an Automated Examiner
Pith reviewed 2026-05-18 09:10 UTC · model grok-4.3
The pith
Full-duplex dialogue systems often struggle with simultaneous speech, corrections, and entity tracking in multi-turn settings.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
We introduce Full-Duplex-Bench-v2 (FDB-v2), a streaming framework that integrates with an automated examiner that enforces staged goals under two pacing setups (Fast vs. Slow). FDB-v2 covers four task families: daily, correction, entity tracking, and safety. We report turn-taking fluency, multi-turn instruction following, and task-specific competence. The framework is extensible, supporting both commercial APIs and open source models. When we test full-duplex systems with FDB-v2, they often get confused when people talk at the same time, struggle to handle corrections smoothly, and sometimes lose track of who or what is being talked about.
What carries the argument
Full-Duplex-Bench-v2 (FDB-v2), a multi-turn evaluation framework that pairs a streaming protocol with an automated examiner enforcing staged goals under fast versus slow pacing across four task families.
If this is right
- Full-duplex systems exhibit confusion when users speak at the same time.
- The framework enables consistent measurement of multi-turn instruction following and task competence.
- Performance on corrections and entity tracking can be isolated and compared across models.
- The open streaming protocol makes it straightforward to add new task families or model types.
Where Pith is reading between the lines
- Widespread use of the benchmark could steer development toward stronger mechanisms for handling interruptions and maintaining context across turns.
- Differences in results between fast and slow pacing setups may help determine suitable response timing for particular applications.
- The inclusion of safety tasks points to a route for systematically surfacing risks before deployment in open-ended conversations.
Load-bearing premise
The automated examiner and the four task families sufficiently capture the range of real multi-turn human interactions that matter for full-duplex performance.
What would settle it
If human users conducting unscripted multi-turn conversations with the same full-duplex systems produce failure patterns or success rates that differ substantially from those recorded by the automated examiner in FDB-v2, the framework's coverage of relevant interactions would be challenged.
read the original abstract
While full-duplex speech agents enable natural, low-latency interaction by speaking and listening simultaneously, their consistency and task performance in multi-turn settings remain underexplored. We introduce Full-Duplex-Bench-v2 (FDB-v2), a streaming framework that integrates with an automated examiner that enforces staged goals under two pacing setups (Fast vs. Slow). FDB-v2 covers four task families: daily, correction, entity tracking, and safety. We report turn-taking fluency, multi-turn instruction following, and task-specific competence. The framework is extensible, supporting both commercial APIs and open source models. When we test full-duplex systems with FDB-v2, they often get confused when people talk at the same time, struggle to handle corrections smoothly, and sometimes lose track of who or what is being talked about. Through an open-sourced, standardized streaming protocol and a task set, FDB-v2 makes it easy to extend to new task families, allowing the community to tailor and accelerate evaluation of multi-turn full-duplex systems.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces Full-Duplex-Bench-v2 (FDB-v2), a streaming multi-turn evaluation framework for full-duplex dialogue systems. It integrates an automated examiner enforcing staged goals across four task families (daily, correction, entity tracking, safety) under Fast and Slow pacing. The work reports on turn-taking fluency, multi-turn instruction following, and task competence, with qualitative observations that tested systems often become confused during simultaneous speech, struggle with corrections, and lose entity tracking. An open-sourced standardized streaming protocol is provided to support extensibility to new tasks and models.
Significance. If the automated examiner is shown to align with human judgments and the task families prove representative, FDB-v2 could provide a useful standardized benchmark for full-duplex systems, helping identify and mitigate weaknesses in simultaneous turn-taking and multi-turn coherence. The open-sourcing of the protocol and support for both commercial APIs and open models are strengths that could facilitate reproducible community evaluations.
major comments (2)
- [Abstract] Abstract: The central claims that full-duplex systems 'often get confused when people talk at the same time, struggle to handle corrections smoothly, and sometimes lose track of who or what is being talked about' rest on qualitative observations without any reported quantitative metrics (e.g., success rates, error rates, or comparisons between Fast/Slow pacing), error bars, or statistical details on how the examiner detects and scores these failures.
- [Abstract] The manuscript provides no validation of the automated examiner against human judgments, no inter-rater agreement metrics, and no description of the specific rules or staged goals used to enforce task families and score turn-taking fluency or instruction following. This leaves open whether observed weaknesses are intrinsic or artifacts of examiner design.
minor comments (1)
- Clarify the exact number of systems tested and the selection criteria for commercial APIs versus open-source models to allow readers to assess generalizability.
Simulated Author's Rebuttal
We thank the referee for the constructive and detailed feedback. We address the major comments point by point below. Where the comments identify areas for improved clarity or additional detail, we will revise the manuscript accordingly while preserving the core contributions of the FDB-v2 framework and open protocol.
read point-by-point responses
-
Referee: [Abstract] Abstract: The central claims that full-duplex systems 'often get confused when people talk at the same time, struggle to handle corrections smoothly, and sometimes lose track of who or what is being talked about' rest on qualitative observations without any reported quantitative metrics (e.g., success rates, error rates, or comparisons between Fast/Slow pacing), error bars, or statistical details on how the examiner detects and scores these failures.
Authors: We acknowledge that the abstract presents a concise qualitative summary. The full manuscript contains quantitative results on turn-taking fluency, multi-turn instruction following, and task competence, with explicit comparisons between Fast and Slow pacing conditions. We will revise the abstract to include representative quantitative metrics (e.g., success rates and error rates) and statistical details. We will also expand the Methods section to provide a precise description of the examiner's failure detection and scoring rules, including any error bars or statistical procedures employed. revision: yes
-
Referee: [Abstract] The manuscript provides no validation of the automated examiner against human judgments, no inter-rater agreement metrics, and no description of the specific rules or staged goals used to enforce task families and score turn-taking fluency or instruction following. This leaves open whether observed weaknesses are intrinsic or artifacts of examiner design.
Authors: We agree that explicit description of the examiner's rules strengthens the work. The manuscript already outlines the staged goals for the four task families and the Fast/Slow pacing in the Methods section; we will expand this with more granular scoring rules for turn-taking fluency and instruction following. On validation against human judgments and inter-rater agreement, the current version does not include a formal study. We will add a discussion of internal consistency checks performed during examiner development and will explicitly note the absence of large-scale human validation as a limitation, with suggestions for future work. revision: partial
- A full-scale human validation study with inter-rater agreement metrics would require new experiments and participant recruitment that cannot be completed within the scope and timeline of a major revision.
Circularity Check
No circularity: benchmark definition and empirical observations are self-contained
full rationale
The paper introduces FDB-v2 as a new streaming evaluation framework with an automated examiner and four task families, then reports empirical observations on full-duplex system behaviors under Fast/Slow pacing. No mathematical derivations, first-principles predictions, fitted parameters, or equations appear in the provided text. The central claims about system confusion during simultaneous speech, correction handling, and entity tracking are direct outputs from applying the defined benchmark rather than reductions to self-citations, ansatzes, or renamed known results. The contribution is the benchmark definition itself, which stands independently without load-bearing self-citation chains or uniqueness theorems imported from prior author work.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption The four task families (daily, correction, entity tracking, safety) and two pacing setups adequately sample the space of multi-turn full-duplex interactions.
Forward citations
Cited by 3 Pith papers
-
EVA-Bench: A New End-to-end Framework for Evaluating Voice Agents
EVA-Bench introduces a simulation-plus-scoring framework for voice agents that reveals no tested system exceeds 0.5 on both accuracy and experience metrics at pass@1.
-
TiCo: Time-Controllable Spoken Dialogue Model
TiCo enables spoken dialogue models to follow explicit time constraints in generated responses using Spoken Time Markers and reinforcement learning with verifiable rewards, cutting duration error by 2.7x over its backbone.
-
ASPIRin: Action Space Projection for Interactivity-Optimized Reinforcement Learning in Full-Duplex Speech Language Models
ASPIRin decouples speaking timing from token content via binary action space projection and applies GRPO with rule-based rewards to optimize interactivity in SLMs without semantic collapse or repetition.
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.