Full-Duplex-Bench-v2: A Multi-Turn Evaluation Framework for Duplex Dialogue Systems with an Automated Examiner

Guan-Ting Lin; Hung-yi Lee; Jiatong Shi; Kai-Wei Chang; Shih-Yun Shan Kuan; Shinji Watanabe; Siddhant Arora

arxiv: 2510.07838 · v2 · submitted 2025-10-09 · 📡 eess.AS

Full-Duplex-Bench-v2: A Multi-Turn Evaluation Framework for Duplex Dialogue Systems with an Automated Examiner

Guan-Ting Lin , Shih-Yun Shan Kuan , Jiatong Shi , Kai-Wei Chang , Siddhant Arora , Shinji Watanabe , Hung-yi Lee This is my paper

Pith reviewed 2026-05-18 09:10 UTC · model grok-4.3

classification 📡 eess.AS

keywords full-duplex dialogueevaluation frameworkmulti-turn interactionautomated examinerturn-taking fluencyentity trackingspeech agentsstreaming evaluation

0 comments

The pith

Full-duplex dialogue systems often struggle with simultaneous speech, corrections, and entity tracking in multi-turn settings.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces Full-Duplex-Bench-v2, a streaming evaluation framework for full-duplex speech agents that uses an automated examiner to enforce staged goals under fast and slow pacing. The framework tests four task families covering daily conversations, corrections, entity tracking, and safety while measuring turn-taking fluency and multi-turn instruction following. It supports both commercial APIs and open-source models through a standardized streaming protocol. When applied to existing systems, the tests show frequent confusion during overlapping talk, difficulty incorporating corrections, and loss of referential context. A reader would care because these systems promise natural low-latency interaction yet lack proven reliability over extended exchanges that real users require.

Core claim

We introduce Full-Duplex-Bench-v2 (FDB-v2), a streaming framework that integrates with an automated examiner that enforces staged goals under two pacing setups (Fast vs. Slow). FDB-v2 covers four task families: daily, correction, entity tracking, and safety. We report turn-taking fluency, multi-turn instruction following, and task-specific competence. The framework is extensible, supporting both commercial APIs and open source models. When we test full-duplex systems with FDB-v2, they often get confused when people talk at the same time, struggle to handle corrections smoothly, and sometimes lose track of who or what is being talked about.

What carries the argument

Full-Duplex-Bench-v2 (FDB-v2), a multi-turn evaluation framework that pairs a streaming protocol with an automated examiner enforcing staged goals under fast versus slow pacing across four task families.

If this is right

Full-duplex systems exhibit confusion when users speak at the same time.
The framework enables consistent measurement of multi-turn instruction following and task competence.
Performance on corrections and entity tracking can be isolated and compared across models.
The open streaming protocol makes it straightforward to add new task families or model types.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Widespread use of the benchmark could steer development toward stronger mechanisms for handling interruptions and maintaining context across turns.
Differences in results between fast and slow pacing setups may help determine suitable response timing for particular applications.
The inclusion of safety tasks points to a route for systematically surfacing risks before deployment in open-ended conversations.

Load-bearing premise

The automated examiner and the four task families sufficiently capture the range of real multi-turn human interactions that matter for full-duplex performance.

What would settle it

If human users conducting unscripted multi-turn conversations with the same full-duplex systems produce failure patterns or success rates that differ substantially from those recorded by the automated examiner in FDB-v2, the framework's coverage of relevant interactions would be challenged.

read the original abstract

While full-duplex speech agents enable natural, low-latency interaction by speaking and listening simultaneously, their consistency and task performance in multi-turn settings remain underexplored. We introduce Full-Duplex-Bench-v2 (FDB-v2), a streaming framework that integrates with an automated examiner that enforces staged goals under two pacing setups (Fast vs. Slow). FDB-v2 covers four task families: daily, correction, entity tracking, and safety. We report turn-taking fluency, multi-turn instruction following, and task-specific competence. The framework is extensible, supporting both commercial APIs and open source models. When we test full-duplex systems with FDB-v2, they often get confused when people talk at the same time, struggle to handle corrections smoothly, and sometimes lose track of who or what is being talked about. Through an open-sourced, standardized streaming protocol and a task set, FDB-v2 makes it easy to extend to new task families, allowing the community to tailor and accelerate evaluation of multi-turn full-duplex systems.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper introduces Full-Duplex-Bench-v2 (FDB-v2), a streaming multi-turn evaluation framework for full-duplex dialogue systems. It integrates an automated examiner enforcing staged goals across four task families (daily, correction, entity tracking, safety) under Fast and Slow pacing. The work reports on turn-taking fluency, multi-turn instruction following, and task competence, with qualitative observations that tested systems often become confused during simultaneous speech, struggle with corrections, and lose entity tracking. An open-sourced standardized streaming protocol is provided to support extensibility to new tasks and models.

Significance. If the automated examiner is shown to align with human judgments and the task families prove representative, FDB-v2 could provide a useful standardized benchmark for full-duplex systems, helping identify and mitigate weaknesses in simultaneous turn-taking and multi-turn coherence. The open-sourcing of the protocol and support for both commercial APIs and open models are strengths that could facilitate reproducible community evaluations.

major comments (2)

[Abstract] Abstract: The central claims that full-duplex systems 'often get confused when people talk at the same time, struggle to handle corrections smoothly, and sometimes lose track of who or what is being talked about' rest on qualitative observations without any reported quantitative metrics (e.g., success rates, error rates, or comparisons between Fast/Slow pacing), error bars, or statistical details on how the examiner detects and scores these failures.
[Abstract] The manuscript provides no validation of the automated examiner against human judgments, no inter-rater agreement metrics, and no description of the specific rules or staged goals used to enforce task families and score turn-taking fluency or instruction following. This leaves open whether observed weaknesses are intrinsic or artifacts of examiner design.

minor comments (1)

Clarify the exact number of systems tested and the selection criteria for commercial APIs versus open-source models to allow readers to assess generalizability.

Simulated Author's Rebuttal

2 responses · 1 unresolved

We thank the referee for the constructive and detailed feedback. We address the major comments point by point below. Where the comments identify areas for improved clarity or additional detail, we will revise the manuscript accordingly while preserving the core contributions of the FDB-v2 framework and open protocol.

read point-by-point responses

Referee: [Abstract] Abstract: The central claims that full-duplex systems 'often get confused when people talk at the same time, struggle to handle corrections smoothly, and sometimes lose track of who or what is being talked about' rest on qualitative observations without any reported quantitative metrics (e.g., success rates, error rates, or comparisons between Fast/Slow pacing), error bars, or statistical details on how the examiner detects and scores these failures.

Authors: We acknowledge that the abstract presents a concise qualitative summary. The full manuscript contains quantitative results on turn-taking fluency, multi-turn instruction following, and task competence, with explicit comparisons between Fast and Slow pacing conditions. We will revise the abstract to include representative quantitative metrics (e.g., success rates and error rates) and statistical details. We will also expand the Methods section to provide a precise description of the examiner's failure detection and scoring rules, including any error bars or statistical procedures employed. revision: yes
Referee: [Abstract] The manuscript provides no validation of the automated examiner against human judgments, no inter-rater agreement metrics, and no description of the specific rules or staged goals used to enforce task families and score turn-taking fluency or instruction following. This leaves open whether observed weaknesses are intrinsic or artifacts of examiner design.

Authors: We agree that explicit description of the examiner's rules strengthens the work. The manuscript already outlines the staged goals for the four task families and the Fast/Slow pacing in the Methods section; we will expand this with more granular scoring rules for turn-taking fluency and instruction following. On validation against human judgments and inter-rater agreement, the current version does not include a formal study. We will add a discussion of internal consistency checks performed during examiner development and will explicitly note the absence of large-scale human validation as a limitation, with suggestions for future work. revision: partial

standing simulated objections not resolved

A full-scale human validation study with inter-rater agreement metrics would require new experiments and participant recruitment that cannot be completed within the scope and timeline of a major revision.

Circularity Check

0 steps flagged

No circularity: benchmark definition and empirical observations are self-contained

full rationale

The paper introduces FDB-v2 as a new streaming evaluation framework with an automated examiner and four task families, then reports empirical observations on full-duplex system behaviors under Fast/Slow pacing. No mathematical derivations, first-principles predictions, fitted parameters, or equations appear in the provided text. The central claims about system confusion during simultaneous speech, correction handling, and entity tracking are direct outputs from applying the defined benchmark rather than reductions to self-citations, ansatzes, or renamed known results. The contribution is the benchmark definition itself, which stands independently without load-bearing self-citation chains or uniqueness theorems imported from prior author work.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The framework rests on the assumption that the chosen task families and pacing modes are representative; no free parameters, mathematical axioms, or invented physical entities are introduced.

axioms (1)

domain assumption The four task families (daily, correction, entity tracking, safety) and two pacing setups adequately sample the space of multi-turn full-duplex interactions.
Invoked when the authors claim the benchmark covers relevant failure modes.

pith-pipeline@v0.9.0 · 5748 in / 1213 out tokens · 29605 ms · 2026-05-18T09:10:39.491980+00:00 · methodology

discussion (0)

Forward citations

Cited by 3 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

EVA-Bench: A New End-to-end Framework for Evaluating Voice Agents
cs.SD 2026-05 accept novelty 8.0

EVA-Bench introduces a simulation-plus-scoring framework for voice agents that reveals no tested system exceeds 0.5 on both accuracy and experience metrics at pass@1.
TiCo: Time-Controllable Spoken Dialogue Model
cs.CL 2026-03 unverdicted novelty 7.0

TiCo enables spoken dialogue models to follow explicit time constraints in generated responses using Spoken Time Markers and reinforcement learning with verifiable rewards, cutting duration error by 2.7x over its backbone.
ASPIRin: Action Space Projection for Interactivity-Optimized Reinforcement Learning in Full-Duplex Speech Language Models
cs.CL 2026-04 unverdicted novelty 6.0

ASPIRin decouples speaking timing from token content via binary action space projection and applies GRPO with rule-based rewards to optimize interactivity in SLMs without semantic collapse or repetition.