arxiv: 2605.13841 · v1 · submitted 2026-05-13 · 💻 cs.SD · cs.AI· cs.CL· cs.LG

Recognition: 2 theorem links

· Lean Theorem

EVA-Bench: A New End-to-end Framework for Evaluating Voice Agents

Anil Madamala, Fanny Riols, Gabrielle Gauthier Melan\c{c}on, Hari Subramani, Hoang H. Nguyen, Joseph Marinier, Katrina Stankiewicz, Lindsay Devon Brin, Oluwanifemi Bamgbose, Raghav Mehndiratta, Sridhar Krishna Nemala, Srinivas Sunkara, Tara Bogavelli

Pith reviewed 2026-05-14 17:26 UTC · model grok-4.3

classification 💻 cs.SD cs.AIcs.CLcs.LG

keywords voice agentsevaluation benchmarkconversational AIspeech fidelityrobustness testingmulti-turn dialogueaccuracy metricsexperience metrics

0 comments

The pith

No voice agent exceeds 0.5 on both accuracy and experience metrics simultaneously.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces EVA-Bench to evaluate voice agents by running bot-to-bot audio conversations over multi-turn tasks and scoring them with two composite metrics. EVA-A combines task completion, faithfulness to instructions, and audio speech quality. EVA-X combines smooth conversation flow, spoken conciseness, and natural turn-taking timing. Testing twelve systems across three architectures and 213 enterprise scenarios shows none clear 0.5 on both metrics at pass@1, large gaps appear between best-case and consistent runs, and accent or noise changes expose clear robustness shortfalls. Readers care because voice agents are already used in customer service and enterprise workflows, and a shared yardstick lets developers see exactly where each architecture falls short.

Core claim

EVA-Bench generates realistic multi-turn dialogues through bot-to-bot audio interaction with automatic error detection and regeneration, then scores agents on EVA-A for accuracy and fidelity plus EVA-X for experience and timing. Across 213 scenarios and controlled accent-noise perturbations, no system exceeds 0.5 on both EVA-A pass@1 and EVA-X pass@1; the median pass@k minus pass^k gap reaches 0.44 on EVA-A; and perturbations produce mean drops up to 0.314 that differ by architecture and metric.

What carries the argument

EVA-Bench end-to-end framework that runs validated bot-to-bot audio dialogues and applies the paired EVA-A and EVA-X composite metrics together with pass@1, pass@k, and pass^k statistics.

If this is right

Different voice-agent architectures can be ranked on identical accuracy and experience scales for the first time.
Reliability engineering must close the 0.44 median gap between peak and consistent performance on accuracy tasks.
Accent and noise robustness must be treated as first-class requirements that vary by architecture.
The 213-scenario suite across three enterprise domains supplies a reusable test bed for targeted fixes.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Architectures may face an inherent trade-off that future work could resolve by combining strengths of the three current families.
Adding more open-ended or multi-party scenarios would test whether the current metrics still separate systems cleanly.
If EVA-X turns out to drive user retention more than EVA-A, teams might deliberately accept lower accuracy for better flow.
The automatic validation step could be reused as a training signal to reduce simulator errors in other dialogue systems.

Load-bearing premise

Bot-to-bot simulated conversations with automatic validation match the distribution of real human voice interactions and the EVA-A and EVA-X scores track downstream user satisfaction or task success.

What would settle it

A head-to-head study that runs the same twelve agents with real human users, records satisfaction and task-success rates, and checks whether the ordering or absolute levels match the EVA-Bench rankings.

read the original abstract

Voice agents, artificial intelligence systems that conduct spoken conversations to complete tasks, are increasingly deployed across enterprise applications. However, no existing benchmark jointly addresses two core evaluation challenges: generating realistic simulated conversations, and measuring quality across the full scope of voice-specific failure modes. We present EVA-Bench, an end-to-end evaluation framework that addresses both. On the simulation side, EVA-Bench orchestrates bot-to-bot audio conversations over dynamic multi-turn dialogues, with automatic simulation validation that detects user simulator error and appropriately regenerates conversations before scoring. On the measurement side, EVA-Bench introduces two composite metrics: EVA-A (Accuracy), capturing task completion, faithfulness, and audio-level speech fidelity; and EVA-X (Experience), capturing conversation progression, spoken conciseness, and turn-taking timing. Both metrics apply to different agent architectures, enabling direct cross-architecture comparison. EVA-Bench includes 213 scenarios across three enterprise domains, a controlled perturbation suite for accent and noise robustness, and pass@1, pass@k, pass^k measurements that distinguish peak from reliable capability. Across 12 systems spanning all three architectures, we find: (1) no system simultaneously exceeds 0.5 on both EVA-A pass@1 and EVA-X pass@1; (2) peak and reliable performance diverge substantially (median pass@k - pass^k gap of 0.44 on EVA-A); and (3) accent and noise perturbations expose substantial robustness gaps, with effects varying across architectures, systems, and metrics (mean up to 0.314). We release the full framework, evaluation suite, and benchmark data under an open-source license.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

EVA-Bench gives a usable open framework for voice-agent eval with bot-to-bot simulation and two new composite scores, but the reported accuracy-experience trade-offs rest on unvalidated metrics.

read the letter

The main thing here is a practical benchmark release that wires together dynamic bot-to-bot audio dialogues, automatic error detection for regeneration, and two composite metrics (EVA-A for task/faithfulness/audio fidelity, EVA-X for progression/conciseness/timing). That combination is new enough to be worth looking at, and the paper ships the full suite plus data under open license, which is the right move for this kind of work. They run it on 12 systems across architectures, show that nothing clears 0.5 on both pass@1 scores, document a 0.44 median gap between peak and reliable performance, and quantify how accent and noise perturbations hit different systems differently. Those are concrete observations that people building enterprise voice agents can actually use to compare approaches today. The soft spot is exactly what the stress-test flags: no human correlation data or A/B checks against real user satisfaction or completion rates. The bot-to-bot setup plus automatic validation is clever for scale, but without evidence that the simulated turn-taking and intent distributions match production traffic, the architecture-varying robustness numbers and the accuracy-experience split could be artifacts of the simulator rather than stable signals. The perturbation results are still useful as a controlled stress test, but they inherit the same limitation. This is the kind of paper a reading group should see if the group works on spoken dialogue systems or evaluation methodology; it gives people something concrete to try and extend. It is not a load-bearing theoretical claim, so the missing validation is a clear but fixable gap rather than a fatal one. A serious editor should send it to referees who can push on the human-alignment question and check the metric definitions against existing voice eval literature. It is worth the review time.

Referee Report

1 major / 2 minor

Summary. The paper introduces EVA-Bench, an end-to-end framework for evaluating voice agents via bot-to-bot audio conversations with automatic simulation validation and regeneration. It defines two composite metrics—EVA-A (capturing task completion, faithfulness, and speech fidelity) and EVA-X (capturing conversation progression, conciseness, and turn-taking timing)—and applies them to 213 scenarios across three enterprise domains. The work evaluates 12 systems spanning three architectures under pass@1/pass@k/pass^k protocols and a controlled accent/noise perturbation suite, reporting that no system exceeds 0.5 on both EVA-A and EVA-X pass@1, a median 0.44 gap between peak and reliable performance on EVA-A, and architecture-varying robustness drops up to 0.314.

Significance. If the simulation and metrics prove representative, EVA-Bench fills a gap by enabling direct cross-architecture comparison of voice-specific failure modes and by releasing the full framework, evaluation suite, and data under open license. The empirical distinctions between peak/reliable capability and the quantified robustness gaps under perturbation provide concrete, falsifiable baselines that future systems can target.

major comments (1)

[Abstract and results section] Abstract and results section: The claims that the observed accuracy-experience trade-off and robustness gaps (e.g., mean drops up to 0.314) are indicative of production voice-agent behavior rest on the unvalidated assumption that bot-to-bot dialogues plus automatic error detection faithfully reproduce human turn-taking, intent distributions, and downstream task success. No human-A/B correlation studies, user-satisfaction ratings, or comparison against real completion rates are reported, which is load-bearing for interpreting the numerical findings as generalizable rather than benchmark-specific.

minor comments (2)

[Framework description] The description of how automatic validation detects simulator errors and triggers regeneration could be expanded with a concrete example or pseudocode to improve reproducibility.
[Results tables/figures] Table or figure captions for the 12-system results should explicitly note the number of runs per condition to clarify the statistical basis of the reported medians and means.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive feedback and recommendation for minor revision. We address the major comment below.

read point-by-point responses

Referee: [Abstract and results section] Abstract and results section: The claims that the observed accuracy-experience trade-off and robustness gaps (e.g., mean drops up to 0.314) are indicative of production voice-agent behavior rest on the unvalidated assumption that bot-to-bot dialogues plus automatic error detection faithfully reproduce human turn-taking, intent distributions, and downstream task success. No human-A/B correlation studies, user-satisfaction ratings, or comparison against real completion rates are reported, which is load-bearing for interpreting the numerical findings as generalizable rather than benchmark-specific.

Authors: We agree that the absence of human-A/B correlation studies, user-satisfaction ratings, or direct comparisons to real-world completion rates means the generalizability of the reported accuracy-experience trade-off and robustness gaps to production voice agents rests on an assumption that remains unvalidated in the current manuscript. The EVA-Bench simulation is constructed to approximate human-like multi-turn interactions via bot-to-bot audio with automatic error detection and regeneration, but this does not substitute for empirical human validation. In the revised manuscript we will add an explicit Limitations subsection (in the Discussion) that acknowledges this gap, clarifies that the numerical findings are benchmark-specific, and outlines planned future work on human correlation studies. We will also insert a brief qualifying clause in the abstract and results section to avoid overclaiming generalizability while preserving the core empirical observations. revision: partial

Circularity Check

0 steps flagged

Empirical benchmark release with direct metric computation; no derivations or predictions reduce to inputs

full rationale

The paper defines EVA-A and EVA-X as composite metrics from explicit criteria (task completion, faithfulness, conciseness, timing) and applies them to bot-to-bot audio dialogues with automatic validation. All reported results (pass@1 thresholds, gaps of 0.44, perturbation drops) are computed directly from these definitions on generated data. No equations, fitted parameters, self-citations, or ansatzes are used to derive the central claims; the framework is self-contained against its own stated benchmarks and scenarios. This matches the default expectation of no significant circularity for an empirical benchmark paper.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The framework rests on domain assumptions that simulated conversations can stand in for real user behavior and that the chosen composite metrics capture the relevant quality dimensions; no free parameters or invented entities are described in the abstract.

axioms (2)

domain assumption Bot-to-bot audio conversations with automatic validation can generate realistic multi-turn dialogues that reflect real-world voice agent usage
This underpins the entire simulation side of EVA-Bench and is required for the benchmark scores to generalize.
domain assumption The composite definitions of EVA-A and EVA-X adequately measure task completion, faithfulness, speech fidelity, conversation progression, conciseness, and turn-taking
These metrics are the core of the measurement side; their validity is assumed without reported human correlation data in the abstract.

pith-pipeline@v0.9.0 · 5667 in / 1513 out tokens · 71341 ms · 2026-05-14T17:26:52.036391+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation washburn_uniqueness_aczel unclear
EVA-Bench introduces two composite metrics: EVA-A (Accuracy), capturing task completion, faithfulness, and audio-level speech fidelity; and EVA-X (Experience), capturing conversation progression, spoken conciseness, and turn-taking timing.
IndisputableMonolith/Foundation/RealityFromDistinction reality_from_one_distinction unclear
Across 12 systems spanning all three architectures, we find: (1) no system simultaneously exceeds 0.5 on both EVA-A pass@1 and EVA-X pass@1; (2) peak and reliable performance diverge substantially (median pass@k - pass^k gap of 0.44 on EVA-A)

Reference graph

Works this paper leans on

134 extracted references · 134 canonical work pages · 4 internal anchors

[1]

Andres, Vadim Fedorov, Rida Sadek, Enric Spagnolo-Arrizabalaga, and Nadescha Trudel

Miguel E. Andres, Vadim Fedorov, Rida Sadek, Enric Spagnolo-Arrizabalaga, and Nadescha Trudel. Testing the testers: Human-driven quality assessment of voice AI testing platforms.arXiv preprint arXiv:2511.04133, 2025

work page arXiv 2025
[2]

Sd-eval: A benchmark dataset for spoken dialogue understanding beyond words.Advances in Neural Information Processing Systems, 37:56898–56918, 2024

Junyi Ao, Yuancheng Wang, Xiaohai Tian, Dekun Chen, Jun Zhang, Lu Lu, Yuxuan Wang, Haizhou Li, and Zhizheng Wu. Sd-eval: A benchmark dataset for spoken dialogue understanding beyond words.Advances in Neural Information Processing Systems, 37:56898–56918, 2024

work page 2024
[3]

Talking turns: Bench- marking audio foundation models on turn-taking dynamics

Siddhant Arora, Zhiyun Lu, Chung Cheng Chiu, Ruoming Pang, and Shinji Watanabe. Talking turns: Bench- marking audio foundation models on turn-taking dynamics. In13th International Conference on Learning Representations, ICLR 2025, pp. 3663–3690. International Conference on Learning Representations, ICLR, 2025

work page 2025
[4]

Beyond task completion: Revealing corrupt success in LLM agents through procedure-aware evaluation.arXiv preprint arXiv:2603.03116, 2026

Hongliu Cao, Ilias Driouich, and Eoin Thomas. Beyond task completion: Revealing corrupt success in LLM agents through procedure-aware evaluation.arXiv preprint arXiv:2603.03116, 2026

work page arXiv 2026
[5]

Yiming Chen, Xianghu Yue, Chen Zhang, Xiaoxue Gao, and Robby T. Tan. VoiceBench: Benchmarking llm-based voice assistants.arXiv preprint arXiv:2410.17196, 2024

work page arXiv 2024
[6]

Voxeval: Benchmarking the knowledge understanding capabilities of end-to-end spoken language models

Wenqian Cui, Xiaoqi Jiao, Ziqiao Meng, and Irwin King. Voxeval: Benchmarking the knowledge understanding capabilities of end-to-end spoken language models. InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 16735–16753, 2025

work page 2025
[7]

Pipecat: Open source framework for voice and multimodal conversational AI.https://github.com/ pipecat-ai/pipecat, 2024

Daily. Pipecat: Open source framework for voice and multimodal conversational AI.https://github.com/ pipecat-ai/pipecat, 2024. Accessed: 2025

work page 2024
[8]

ElevenAgents: Conversational AI agent platform.https://elevenlabs.io/docs/eleven-agents/ overview, 2024

ElevenLabs. ElevenAgents: Conversational AI agent platform.https://elevenlabs.io/docs/eleven-agents/ overview, 2024. Accessed: 2025

work page 2024
[9]

Gemini live API: Low-latency bidirectional voice and video interactions.https://ai.google

Google DeepMind. Gemini live API: Low-latency bidirectional voice and video interactions.https://ai.google. dev/gemini-api/docs/live, 2024. Accessed: 2025

work page 2024
[10]

Ryan, Aditya Shrivastava, Ali Sartaz Khan, Caleb Ziems, Ella Li, Martijn Bartelds, Michael Sun, Tan Li, Woody Gan, and Diyi Yang

Will Held, Michael J. Ryan, Aditya Shrivastava, Ali Sartaz Khan, Caleb Ziems, Ella Li, Martijn Bartelds, Michael Sun, Tan Li, Woody Gan, and Diyi Yang. Cava: Comprehensive assessment of voice assistants. https://github.com/SALT-NLP/CAVA, 2025. URLhttps://talkarena.org/cava. A benchmark for evaluating 12 large audio models (LAMs) capabilities across six do...

work page 2025
[11]

Detection thresholds for gaps, overlaps, and no-gap-no-overlaps.The Journal of the Acoustical Society of America, 130(1):508–513, 2011

Mattias Heldner. Detection thresholds for gaps, overlaps, and no-gap-no-overlaps.The Journal of the Acoustical Society of America, 130(1):508–513, 2011. doi: 10.1121/1.3598457

work page doi:10.1121/1.3598457 2011
[12]

Pauses, gaps and overlaps in conversations.Journal of Phonetics, 38(4): 555–568, 2010

Mattias Heldner and Jens Edlund. Pauses, gaps and overlaps in conversations.Journal of Phonetics, 38(4): 555–568, 2010. doi: 10.1016/j.wocn.2010.08.002

work page doi:10.1016/j.wocn.2010.08.002 2010
[13]

Sheet: A multi-purpose open-source speech human evaluation estimation toolkit

Wen-Chin Huang, Erica Cooper, and Tomoki Toda. Sheet: A multi-purpose open-source speech human evaluation estimation toolkit. InProc. Interspeech 2025, pp. 2355–2359, 2025

work page 2025
[14]

Voiceagent- bench: Are voice assistants ready for agentic tasks?arXiv preprint arXiv:2510.07978, 2025

Dhruv Jain, Harshit Shukla, Gautam Rajeev, Ashish Kulkarni, Chandra Khatri, and Shubham Agarwal. Voiceagent- bench: Are voice assistants ready for agentic tasks?arXiv preprint arXiv:2510.07978, 2025

work page arXiv 2025
[15]

Levinson and Francisco Torreira

Stephen C. Levinson and Francisco Torreira. Timing in turn-taking and its implications for processing models of language.Frontiers in Psychology, 6:731, 2015. doi: 10.3389/fpsyg.2015.00731

work page doi:10.3389/fpsyg.2015.00731 2015
[16]

Full-Duplex-Bench-v2: A Multi-Turn Evaluation Framework for Duplex Dialogue Systems with an Automated Examiner

Guan-Ting Lin, Shih-Yun Shan Kuan, Jiatong Shi, Kai-Wei Chang, Siddhant Arora, Shinji Watanabe, and Hung-yi Lee. Full-duplex-bench-v2: A multi-turn evaluation framework for duplex dialogue systems with an automated examiner.arXiv preprint arXiv:2510.07838, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[17]

Full-Duplex-Bench v1.5: Evaluating Overlap Handling for Full-Duplex Speech Models

Guan-Ting Lin, Shih-Yun Shan Kuan, Qirui Wang, Jiachen Lian, Tingle Li, and Hung-yi Lee. Full-duplex-bench v1.5: Evaluating overlap handling for full-duplex speech models.arXiv preprint arXiv:2507.23159, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[18]

Full-duplex-bench: A benchmark to evaluate full-duplex spoken dialogue models on turn-taking capabilities

Guan-Ting Lin, Jiachen Lian, Tingle Li, Qirui Wang, Gopala Anumanchipalli, Alexander H Liu, and Hung-yi Lee. Full-duplex-bench: A benchmark to evaluate full-duplex spoken dialogue models on turn-taking capabilities. arXiv preprint arXiv:2503.04721, 2025

work page arXiv 2025
[19]

Full-Duplex-Bench-v3: Benchmarking Tool Use for Full-Duplex Voice Agents Under Real-World Disfluency

Guan-Ting Lin, Chen Chen, Zhehuai Chen, and Hung-yi Lee. Full-duplex-bench-v3: Benchmarking tool use for full-duplex voice agents under real-world disfluency.arXiv preprint arXiv:2604.04847, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[20]

EmergentTTS-eval: Evaluating TTS models on complex prosodic, expressiveness, and linguistic challenges using model-as-a-judge

Ruskin Raj Manku, Yuzhi Tang, Xingjian Shi, Mu Li, and Alex Smola. EmergentTTS-eval: Evaluating TTS models on complex prosodic, expressiveness, and linguistic challenges using model-as-a-judge. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems Datasets and Benchmarks Track, 2026. URL https://openreview.net/forum?id=P3JBBnh10z

work page 2026
[21]

Beyond accuracy: A multi-dimensional framework for evaluating enterprise agentic AI systems

Sushant Mehta. Beyond accuracy: A multi-dimensional framework for evaluating enterprise agentic AI systems. arXiv preprint arXiv:2511.14136, 2025

work page arXiv 2025
[22]

Ai voice agents: 2025 update.https://a16z.com/ai-voice-agents-2025-update/, 2025

Olivia Moore. Ai voice agents: 2025 update.https://a16z.com/ai-voice-agents-2025-update/, 2025. An- dreessen Horowitz

work page 2025
[23]

Realtime API documentation.https://platform.openai.com/docs/guides/realtime, 2024

OpenAI. Realtime API documentation.https://platform.openai.com/docs/guides/realtime, 2024. Accessed: 2025

work page 2024
[24]

Patil, Huanzhi Mao, Fanjia Yan, Charlie Cheng-Jie Ji, Vishnu Suresh, Ion Stoica, and Joseph E

Shishir G. Patil, Huanzhi Mao, Fanjia Yan, Charlie Cheng-Jie Ji, Vishnu Suresh, Ion Stoica, and Joseph E. Gonzalez. The berkeley function calling leaderboard (BFCL): From tool use to agentic evaluation of large language models. InProceedings of the 42nd International Conference on Machine Learning (ICML), volume 267 ofPMLR, pp. 48371–48392, 2025

work page 2025
[25]

Fd-bench: A full-duplex benchmarking pipeline designed for full duplex spoken dialogue systems

Yizhou Peng, Yi-Wen Chao, Dianwen Ng, Yukun Ma, Chongjia Ni, Bin Ma, and Eng Siong Chng. Fd-bench: A full-duplex benchmarking pipeline designed for full duplex spoken dialogue systems. InProc. Interspeech 2025, pp. 176–180, 2025

work page 2025
[26]

Sygra: A unified graph-based framework for scalable generation, quality tagging, and management of synthetic data.arXiv preprint arXiv:2508.15432, 2025

Bidyapati Pradhan, Surajit Dasgupta, Amit Kumar Saha, Omkar Anustoop, Sriram Puttagunta, Vipul Mittal, and Gopal Sarda. Sygra: A unified graph-based framework for scalable generation, quality tagging, and management of synthetic data.arXiv preprint arXiv:2508.15432, 2025

work page arXiv 2025
[27]

Toolllm: Facilitating large language models to master 16000+ real-world apis

Yujia Qin, Shihao Liang, Yining Ye, Kunlun Zhu, Lan Yan, Yaxi Lu, Yankai Lin, Xin Cong, Xiangru Tang, Bill Qian, et al. Toolllm: Facilitating large language models to master 16000+ real-world apis. InThe twelfth international conference on learning representations, 2024. URLhttps://openreview.net

work page 2024
[28]

URLhttps://arxiv.org/abs/2603.13686

Soham Ray, Keshav Dhandhania, Victor Barres, and Karthik Narasimhan.τ-voice: Benchmarking full-duplex voice agents on real-world domains, 2026. URLhttps://arxiv.org/abs/2603.13686. 13

work page arXiv 2026
[29]

Judgments concerning the valence of inter-turn silence across speakers of American English, Italian, and Japanese.Discourse Processes, 48(5):331–354, 2011

Felicia Roberts, Piera Margutti, and Shoji Takano. Judgments concerning the valence of inter-turn silence across speakers of American English, Italian, and Japanese.Discourse Processes, 48(5):331–354, 2011. doi: 10.1080/0163853X.2011.558002

work page doi:10.1080/0163853x.2011.558002 2011
[30]

Turn-taking in conversational systems and human-robot interaction: A review.Computer Speech & Language, 67:101178, 2021

Gabriel Skantze. Turn-taking in conversational systems and human-robot interaction: A review.Computer Speech & Language, 67:101178, 2021. URLhttps://arxiv.org/abs/2010.03674

work page arXiv 2021
[31]

Universals and cultural variation in turn-taking in conversation.Proceedings of the National Academy of Sciences, 106(26):10587–10592, 2009

Tanya Stivers, Nicholas J Enfield, Penelope Brown, Christina Englert, Makoto Hayashi, Trine Heinemann, Gertie Hoymann, Federico Rossano, Jan Peter De Ruiter, Kyung-Eun Yoon, et al. Universals and cultural variation in turn-taking in conversation.Proceedings of the National Academy of Sciences, 106(26):10587–10592, 2009

work page 2009
[32]

Prosodic features and their use in studying turn-taking.Speech Communication, 46:52–66, 2005

Nigel Ward and Wataru Tsukahara. Prosodic features and their use in studying turn-taking.Speech Communication, 46:52–66, 2005

work page 2005
[33]

Shunyu Yao, Noah Shinn, Pedram Razavi, and Karthik Narasimhan.τ-bench: A benchmark for tool-agent-user interaction in real-world domains.arXiv preprint arXiv:2406.12045, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[34]

Judging llm-as-a- judge with mt-bench and chatbot arena

Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric Xing, Hao Zhang, Joseph Gonzalez, and Ion Stoica. Judging llm-as-a- judge with mt-bench and chatbot arena. In A. Oh, T. Naumann, A. Globerson, K. Saenko, M. Hardt, and S. Levine (eds.),Advances in Neural Information Processing System...

work page 2023
[35]

eagerness

Xuhui Zhou, Weiwei Sun, Qianou Ma, Yiqing Xie, Jiarui Liu, Weihua Du, Sean Welleck, Yiming Yang, Graham Neubig, Sherry Tongshuang Wu, and Maarten Sap. Mind the sim2real gap in user simulation for agentic tasks. arXiv preprint arXiv:2603.11245, 2026. 14 A Definitions & Key Terms STT Speech-to-Text.A model or service that transcribes spoken audio into text....

work page arXiv 2026
[36]

Policy specification.Domain policies and workflow constraints are defined and reviewed prior to generation

work page
[37]

Joint generation.SyGra generates user goals, initial databases, and expected final states jointly from a workflow graph, using GPT-5.2 as the generative backbone

work page
[38]

Multi-intent composition.Multi-intent scenarios are constructed by combining single-intent records into coherent multi-workflow user goals, with expected final states merged accordingly

work page
[39]

End Conversation

Adversarial scenario design.Adversarial scenarios are hand-designed around specific policy boundary conditions, then verified against tool executor behavior to confirm that the policy violation is achievable but detectable by a correctly behaving agent. Human Review Following generation, all scenarios went through multiple rounds of manual review. Reviewe...

work page
[40]

The agent has confirmed your request is resolved (all steps are completed) and you have said goodbye

work page
[41]

The agent has initiated a transfer to a live agent

work page
[42]

The agent has been unable to make progress for at least 5 consecutive turns

work page
[43]

The agent says goodbye or indicates the conversation is over

work page
[44]

The agent indicates that the remainder of your request cannot be fulfilled

work page
[45]

I’m sorry I encountered an error processing your request

If the assistant says something along the lines of "I’m sorry I encountered an error processing your request." 21 IMPORTANT: never call this tool in the same turn that you provide the agent with data, an identifier, a request to transfer to a live agent, an approval to proceed, or any kind of additional information. Before calling this tool, always say a ...

work page
[46]

The user simulator prompt explicitly instructs the user to decline any such offers from the agent, but we check for violations regardless

Extra modifications.The user makes requests beyond its stated goal that invoke modification tools writing to the scenario database. The user simulator prompt explicitly instructs the user to decline any such offers from the agent, but we check for violations regardless

work page
[47]

If the user hangs up prematurely—for example, providing actionable information and ending the call in the same turn—the agent has no opportunity to execute the required tool calls

Premature ending.In our simulations, the user is responsible for ending the call once its goal is complete. If the user hangs up prematurely—for example, providing actionable information and ending the call in the same turn—the agent has no opportunity to execute the required tool calls. We therefore verify that the user does not terminate the conversatio...

work page
[48]

Missing information.If the user fails to provide information the agent needs to complete the task, the evaluation is corrupted since task success cannot reasonably be expected. 22

work page
[49]

The agent then acts on the duplicate request, causing redundant writes to the scenario database that cause the final state comparison to fail

Duplicate modifications.Occasionally, the user simulator (particularly when using non-primary models) enters a loop and repeats requests the agent has already fulfilled. The agent then acts on the duplicate request, causing redundant writes to the scenario database that cause the final state comparison to fail

work page
[50]

accept the earlier flight if the price difference is under $200, otherwise decline

Decision tree violations.Each user is given a structured decision tree governing how to navigate choices during the interaction (e.g.,“accept the earlier flight if the price difference is under $200, otherwise decline”). We verify that the user adheres to this logic, since deviations would cause the agent to reach a final state inconsistent with the groun...

work page
[51]

S2S.There is no separable TTS step, so the framework log carries notts_text or llm_response records and is dropped from the merge. Consequentlyintended_assistant_turns is left empty — S2S models typically do not expose any separate text intent — and the assistant’s entries inconversation_- trace are sourced from ElevenLabsassistant_speech (transcribed) ra...

work page
[52]

Hybrid.The framework log is populated withtts_text or llm_response (depending on the backend), and intended_assistant_turns is built as in cascade. On the input side, however, hybrid audio-native models bypass the agent’s STT — as in S2S — so the audit-log user transcripts are unreliable, and the 26 user entries inconversation_trace are again sourced from...

work page
[53]

stay open

Cascade.All three streams are used unmodified: audit-log user transcripts feed bothtranscribed_- user_turns and the trace, the framework log supplies the assistant’s intended text, and ElevenLabs supplies the user’s intended text and the assistant’s transcribed text. A final post-processing step (i) aligns the per-turn dictionaries so that all sources sha...

work page
[54]

Cascade and Hybrid.Both architectures expose anintendedtext-side reference for the assistant, i.e., the LLM’s text output before TTS (intended_assistant_turns E.1). The judge task is a direct word-for-word comparison: did the audio reproduce the intended text, with particular attention to TTS-critical entities (confirmation codes, flight numbers, dollar a...

work page
[55]

[Assistant speaks]

S2S.S2S systems do not typically expose any text-side intent, so there is nothing to compare the audio against in the cascade sense. We instead reformulate the question as anentity articulationcheck: does the assistant clearly and correctly speak the entities it was supposed to convey? The judge receives aredacted conversation tracein which assistant entr...

work page 2026
[56]

If the agent asks for verification details, provide your confirmation code and last name exactly as given in the required information, then wait for the agent to read back your reservation and confirm it is yours; if they read back a different name or itinerary, correct them and re-provide the details

work page
[57]

When the agent offers earlier-flight options, evaluate each option against ALL must-have criteria: (a) date is 2026-06-18, (b) LAX departure time is before 2:00PMPT, (c) direct LAX→SFO, (d) same-day change fee is under $80

work page 2026
[58]

If both an 11:00AM and a 1:00PM direct option meet all must-haves, choose the earliest departure (11:00AM)

work page
[59]

If only one option meets all must-haves, accept that option

work page
[60]

What will the change fee be in total?

Before the agent finalizes anything, if the agent has not clearly stated the exact same-day change fee amount, ask:“What will the change fee be in total?”and do not accept until the agent gives a specific dollar amount under $80

work page
[61]

It needs to be today, direct LAX to SFO, leaving before 2PM, and the fee has to be under $80—can you check again?

If the agent proposes any option that departs at or after 2:00PM, has a connection, changes airports, or has a fee of $80 or more, reject it and restate the must-haves once:“It needs to be today, direct LAX to SFO, leaving before 2PM, and the fee has to be under $80—can you check again?”

work page
[62]

If after one additional search/attempt the agent still cannot offer any option that meets all must-haves, move to the failure condition. Resolution Condition.The agent has confirmed the rebooking is completed (not just planned) to a direct LAX→SFO flight departing on 2026-06-18 before 2:00PMPT, has stated the same-day change fee is under $80, AND has prov...

work page 2026
[64]

Never invent new goals, requests, or problems beyond what is defined here

work page
[66]

If the agent suggests flying from or to a different airport than originally booked, decline and insist on LAX to SFO only

work page
[67]

V O R J U

If the agent suggests standby instead of a confirmed earlier flight, decline standby and ask for a confirmed seat on an earlier direct flight before 2:00PM. Expected Flow, Database & Ground Truth Expected Flow.Passenger wants to move to an earlier departure on the same date. Agent applies same-day change fee ($75, waived for Gold+) and searches for earlie...

work page 2026
[68]

(fare $228 in main cabin)

One-stop option– depart at nine twenty a.m., connect in San Jose and arrive at twelve ten p.m. (fare $228 in main cabin)

work page
[69]

LAX”, destination: “SFO

Direct flight– depart at one o’clock p.m., arrive at two twenty-five p.m. (fare $289, same as your current ticket). 3.Direct flight– depart at two forty p.m., arrive at four oh-five p.m. (fare $259, a little cheaper). Because this is a voluntary same-day change, achange fee of seventy-five dollarsapplies. If you choose a lower-priced flight, the fare diff...

work page 2026
[70]

Provide your employee ID and the last four digits of your phone number

Start by completing identity verification only when asked. Provide your employee ID and the last four digits of your phone number. Do not volunteer other details before the agent asks

work page
[71]

Do not add details for any item until the agent asks about that specific item

After verification, give a brief overview of all four items: email seems down for everyone, your AD account is locked, you need Confluence access, and you want a 30-day Figma trial. Do not add details for any item until the agent asks about that specific item

work page
[72]

If asked which service, say email

First intent — email outage.Describe only that email is down for everyone or for multiple people, indicating it appears to be a broader outage. If asked which service, say email. Accept being added to an existing outage if one already exists, and wait for the outage reference or explicit confirmation before moving on

work page
[73]

What happens next?

Second intent — AD lockout.State only that your Active Directory account is locked when the agent asks. If the agent says the account cannot be unlocked because of a security hold, ask exactly one follow-up question:“What happens next?”If they explain that a ticket has been opened and provide the ticket number and expected response time or SLA, accept tha...

work page
[74]

If asked for access level, choose read_only

Third intent — Confluence access.Provide the application name only when asked: Confluence. If asked for access level, choose read_only. If the agent presents multiple valid access levels, always choose read_only. Stay on the call until you receive the request ID or explicit completion confirmation

work page
[75]

If asked whether you want permanent or temporary, choose temporary

Fourth intent — Figma trial.Provide the product name only when asked: Figma. If asked whether you want permanent or temporary, choose temporary. If asked for duration, choose 30 days. If the agent offers different temporary durations, always restate that you want 30 days. Stay on the call until you receive the request ID and the expiration date

work page
[76]

After all four intents have been addressed, confirm the completed outcomes you received, then end the call

work page
[77]

Do not invent missing details

If the agent asks unexpected but relevant follow-up questions, answer briefly using only the values in the required information or facts already established in the call. Do not invent missing details. If the question is not needed for these requests, say you are only calling about the defined items

work page
[78]

If it does not match, correct only the incorrect field and nothing else

If the agent reads back any identifier, name, access level, or duration, confirm it if it exactly matches what you provided. If it does not match, correct only the incorrect field and nothing else. Resolution Condition.You have clear confirmation that you were added to the existing email outage or have been given the outage ticket number, you have receive...

work page 2026
[82]

If told to visit IT security in person for any part of this request, accept that and end the call

work page
[83]

Do not request services beyond your stated IT requests

work page
[84]

If asked which access level you want for Confluence, choose read_only

work page
[85]

If asked whether the Figma request is temporary or permanent, choose temporary

work page

Showing first 80 references.