pith. machine review for the scientific record. sign in

arxiv: 2605.13841 · v1 · submitted 2026-05-13 · 💻 cs.SD · cs.AI· cs.CL· cs.LG

Recognition: 2 theorem links

· Lean Theorem

EVA-Bench: A New End-to-end Framework for Evaluating Voice Agents

Anil Madamala, Fanny Riols, Gabrielle Gauthier Melan\c{c}on, Hari Subramani, Hoang H. Nguyen, Joseph Marinier, Katrina Stankiewicz, Lindsay Devon Brin, Oluwanifemi Bamgbose, Raghav Mehndiratta, Sridhar Krishna Nemala, Srinivas Sunkara, Tara Bogavelli

Pith reviewed 2026-05-14 17:26 UTC · model grok-4.3

classification 💻 cs.SD cs.AIcs.CLcs.LG
keywords voice agentsevaluation benchmarkconversational AIspeech fidelityrobustness testingmulti-turn dialogueaccuracy metricsexperience metrics
0
0 comments X

The pith

No voice agent exceeds 0.5 on both accuracy and experience metrics simultaneously.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces EVA-Bench to evaluate voice agents by running bot-to-bot audio conversations over multi-turn tasks and scoring them with two composite metrics. EVA-A combines task completion, faithfulness to instructions, and audio speech quality. EVA-X combines smooth conversation flow, spoken conciseness, and natural turn-taking timing. Testing twelve systems across three architectures and 213 enterprise scenarios shows none clear 0.5 on both metrics at pass@1, large gaps appear between best-case and consistent runs, and accent or noise changes expose clear robustness shortfalls. Readers care because voice agents are already used in customer service and enterprise workflows, and a shared yardstick lets developers see exactly where each architecture falls short.

Core claim

EVA-Bench generates realistic multi-turn dialogues through bot-to-bot audio interaction with automatic error detection and regeneration, then scores agents on EVA-A for accuracy and fidelity plus EVA-X for experience and timing. Across 213 scenarios and controlled accent-noise perturbations, no system exceeds 0.5 on both EVA-A pass@1 and EVA-X pass@1; the median pass@k minus pass^k gap reaches 0.44 on EVA-A; and perturbations produce mean drops up to 0.314 that differ by architecture and metric.

What carries the argument

EVA-Bench end-to-end framework that runs validated bot-to-bot audio dialogues and applies the paired EVA-A and EVA-X composite metrics together with pass@1, pass@k, and pass^k statistics.

If this is right

  • Different voice-agent architectures can be ranked on identical accuracy and experience scales for the first time.
  • Reliability engineering must close the 0.44 median gap between peak and consistent performance on accuracy tasks.
  • Accent and noise robustness must be treated as first-class requirements that vary by architecture.
  • The 213-scenario suite across three enterprise domains supplies a reusable test bed for targeted fixes.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Architectures may face an inherent trade-off that future work could resolve by combining strengths of the three current families.
  • Adding more open-ended or multi-party scenarios would test whether the current metrics still separate systems cleanly.
  • If EVA-X turns out to drive user retention more than EVA-A, teams might deliberately accept lower accuracy for better flow.
  • The automatic validation step could be reused as a training signal to reduce simulator errors in other dialogue systems.

Load-bearing premise

Bot-to-bot simulated conversations with automatic validation match the distribution of real human voice interactions and the EVA-A and EVA-X scores track downstream user satisfaction or task success.

What would settle it

A head-to-head study that runs the same twelve agents with real human users, records satisfaction and task-success rates, and checks whether the ordering or absolute levels match the EVA-Bench rankings.

read the original abstract

Voice agents, artificial intelligence systems that conduct spoken conversations to complete tasks, are increasingly deployed across enterprise applications. However, no existing benchmark jointly addresses two core evaluation challenges: generating realistic simulated conversations, and measuring quality across the full scope of voice-specific failure modes. We present EVA-Bench, an end-to-end evaluation framework that addresses both. On the simulation side, EVA-Bench orchestrates bot-to-bot audio conversations over dynamic multi-turn dialogues, with automatic simulation validation that detects user simulator error and appropriately regenerates conversations before scoring. On the measurement side, EVA-Bench introduces two composite metrics: EVA-A (Accuracy), capturing task completion, faithfulness, and audio-level speech fidelity; and EVA-X (Experience), capturing conversation progression, spoken conciseness, and turn-taking timing. Both metrics apply to different agent architectures, enabling direct cross-architecture comparison. EVA-Bench includes 213 scenarios across three enterprise domains, a controlled perturbation suite for accent and noise robustness, and pass@1, pass@k, pass^k measurements that distinguish peak from reliable capability. Across 12 systems spanning all three architectures, we find: (1) no system simultaneously exceeds 0.5 on both EVA-A pass@1 and EVA-X pass@1; (2) peak and reliable performance diverge substantially (median pass@k - pass^k gap of 0.44 on EVA-A); and (3) accent and noise perturbations expose substantial robustness gaps, with effects varying across architectures, systems, and metrics (mean up to 0.314). We release the full framework, evaluation suite, and benchmark data under an open-source license.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 2 minor

Summary. The paper introduces EVA-Bench, an end-to-end framework for evaluating voice agents via bot-to-bot audio conversations with automatic simulation validation and regeneration. It defines two composite metrics—EVA-A (capturing task completion, faithfulness, and speech fidelity) and EVA-X (capturing conversation progression, conciseness, and turn-taking timing)—and applies them to 213 scenarios across three enterprise domains. The work evaluates 12 systems spanning three architectures under pass@1/pass@k/pass^k protocols and a controlled accent/noise perturbation suite, reporting that no system exceeds 0.5 on both EVA-A and EVA-X pass@1, a median 0.44 gap between peak and reliable performance on EVA-A, and architecture-varying robustness drops up to 0.314.

Significance. If the simulation and metrics prove representative, EVA-Bench fills a gap by enabling direct cross-architecture comparison of voice-specific failure modes and by releasing the full framework, evaluation suite, and data under open license. The empirical distinctions between peak/reliable capability and the quantified robustness gaps under perturbation provide concrete, falsifiable baselines that future systems can target.

major comments (1)
  1. [Abstract and results section] Abstract and results section: The claims that the observed accuracy-experience trade-off and robustness gaps (e.g., mean drops up to 0.314) are indicative of production voice-agent behavior rest on the unvalidated assumption that bot-to-bot dialogues plus automatic error detection faithfully reproduce human turn-taking, intent distributions, and downstream task success. No human-A/B correlation studies, user-satisfaction ratings, or comparison against real completion rates are reported, which is load-bearing for interpreting the numerical findings as generalizable rather than benchmark-specific.
minor comments (2)
  1. [Framework description] The description of how automatic validation detects simulator errors and triggers regeneration could be expanded with a concrete example or pseudocode to improve reproducibility.
  2. [Results tables/figures] Table or figure captions for the 12-system results should explicitly note the number of runs per condition to clarify the statistical basis of the reported medians and means.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive feedback and recommendation for minor revision. We address the major comment below.

read point-by-point responses
  1. Referee: [Abstract and results section] Abstract and results section: The claims that the observed accuracy-experience trade-off and robustness gaps (e.g., mean drops up to 0.314) are indicative of production voice-agent behavior rest on the unvalidated assumption that bot-to-bot dialogues plus automatic error detection faithfully reproduce human turn-taking, intent distributions, and downstream task success. No human-A/B correlation studies, user-satisfaction ratings, or comparison against real completion rates are reported, which is load-bearing for interpreting the numerical findings as generalizable rather than benchmark-specific.

    Authors: We agree that the absence of human-A/B correlation studies, user-satisfaction ratings, or direct comparisons to real-world completion rates means the generalizability of the reported accuracy-experience trade-off and robustness gaps to production voice agents rests on an assumption that remains unvalidated in the current manuscript. The EVA-Bench simulation is constructed to approximate human-like multi-turn interactions via bot-to-bot audio with automatic error detection and regeneration, but this does not substitute for empirical human validation. In the revised manuscript we will add an explicit Limitations subsection (in the Discussion) that acknowledges this gap, clarifies that the numerical findings are benchmark-specific, and outlines planned future work on human correlation studies. We will also insert a brief qualifying clause in the abstract and results section to avoid overclaiming generalizability while preserving the core empirical observations. revision: partial

Circularity Check

0 steps flagged

Empirical benchmark release with direct metric computation; no derivations or predictions reduce to inputs

full rationale

The paper defines EVA-A and EVA-X as composite metrics from explicit criteria (task completion, faithfulness, conciseness, timing) and applies them to bot-to-bot audio dialogues with automatic validation. All reported results (pass@1 thresholds, gaps of 0.44, perturbation drops) are computed directly from these definitions on generated data. No equations, fitted parameters, self-citations, or ansatzes are used to derive the central claims; the framework is self-contained against its own stated benchmarks and scenarios. This matches the default expectation of no significant circularity for an empirical benchmark paper.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The framework rests on domain assumptions that simulated conversations can stand in for real user behavior and that the chosen composite metrics capture the relevant quality dimensions; no free parameters or invented entities are described in the abstract.

axioms (2)
  • domain assumption Bot-to-bot audio conversations with automatic validation can generate realistic multi-turn dialogues that reflect real-world voice agent usage
    This underpins the entire simulation side of EVA-Bench and is required for the benchmark scores to generalize.
  • domain assumption The composite definitions of EVA-A and EVA-X adequately measure task completion, faithfulness, speech fidelity, conversation progression, conciseness, and turn-taking
    These metrics are the core of the measurement side; their validity is assumed without reported human correlation data in the abstract.

pith-pipeline@v0.9.0 · 5667 in / 1513 out tokens · 71341 ms · 2026-05-14T17:26:52.036391+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

Reference graph

Works this paper leans on

134 extracted references · 134 canonical work pages · 4 internal anchors

  1. [1]

    Andres, Vadim Fedorov, Rida Sadek, Enric Spagnolo-Arrizabalaga, and Nadescha Trudel

    Miguel E. Andres, Vadim Fedorov, Rida Sadek, Enric Spagnolo-Arrizabalaga, and Nadescha Trudel. Testing the testers: Human-driven quality assessment of voice AI testing platforms.arXiv preprint arXiv:2511.04133, 2025

  2. [2]

    Sd-eval: A benchmark dataset for spoken dialogue understanding beyond words.Advances in Neural Information Processing Systems, 37:56898–56918, 2024

    Junyi Ao, Yuancheng Wang, Xiaohai Tian, Dekun Chen, Jun Zhang, Lu Lu, Yuxuan Wang, Haizhou Li, and Zhizheng Wu. Sd-eval: A benchmark dataset for spoken dialogue understanding beyond words.Advances in Neural Information Processing Systems, 37:56898–56918, 2024

  3. [3]

    Talking turns: Bench- marking audio foundation models on turn-taking dynamics

    Siddhant Arora, Zhiyun Lu, Chung Cheng Chiu, Ruoming Pang, and Shinji Watanabe. Talking turns: Bench- marking audio foundation models on turn-taking dynamics. In13th International Conference on Learning Representations, ICLR 2025, pp. 3663–3690. International Conference on Learning Representations, ICLR, 2025

  4. [4]

    Beyond task completion: Revealing corrupt success in LLM agents through procedure-aware evaluation.arXiv preprint arXiv:2603.03116, 2026

    Hongliu Cao, Ilias Driouich, and Eoin Thomas. Beyond task completion: Revealing corrupt success in LLM agents through procedure-aware evaluation.arXiv preprint arXiv:2603.03116, 2026

  5. [5]

    Yiming Chen, Xianghu Yue, Chen Zhang, Xiaoxue Gao, and Robby T. Tan. VoiceBench: Benchmarking llm-based voice assistants.arXiv preprint arXiv:2410.17196, 2024

  6. [6]

    Voxeval: Benchmarking the knowledge understanding capabilities of end-to-end spoken language models

    Wenqian Cui, Xiaoqi Jiao, Ziqiao Meng, and Irwin King. Voxeval: Benchmarking the knowledge understanding capabilities of end-to-end spoken language models. InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 16735–16753, 2025

  7. [7]

    Pipecat: Open source framework for voice and multimodal conversational AI.https://github.com/ pipecat-ai/pipecat, 2024

    Daily. Pipecat: Open source framework for voice and multimodal conversational AI.https://github.com/ pipecat-ai/pipecat, 2024. Accessed: 2025

  8. [8]

    ElevenAgents: Conversational AI agent platform.https://elevenlabs.io/docs/eleven-agents/ overview, 2024

    ElevenLabs. ElevenAgents: Conversational AI agent platform.https://elevenlabs.io/docs/eleven-agents/ overview, 2024. Accessed: 2025

  9. [9]

    Gemini live API: Low-latency bidirectional voice and video interactions.https://ai.google

    Google DeepMind. Gemini live API: Low-latency bidirectional voice and video interactions.https://ai.google. dev/gemini-api/docs/live, 2024. Accessed: 2025

  10. [10]

    Ryan, Aditya Shrivastava, Ali Sartaz Khan, Caleb Ziems, Ella Li, Martijn Bartelds, Michael Sun, Tan Li, Woody Gan, and Diyi Yang

    Will Held, Michael J. Ryan, Aditya Shrivastava, Ali Sartaz Khan, Caleb Ziems, Ella Li, Martijn Bartelds, Michael Sun, Tan Li, Woody Gan, and Diyi Yang. Cava: Comprehensive assessment of voice assistants. https://github.com/SALT-NLP/CAVA, 2025. URLhttps://talkarena.org/cava. A benchmark for evaluating 12 large audio models (LAMs) capabilities across six do...

  11. [11]

    Detection thresholds for gaps, overlaps, and no-gap-no-overlaps.The Journal of the Acoustical Society of America, 130(1):508–513, 2011

    Mattias Heldner. Detection thresholds for gaps, overlaps, and no-gap-no-overlaps.The Journal of the Acoustical Society of America, 130(1):508–513, 2011. doi: 10.1121/1.3598457

  12. [12]

    Pauses, gaps and overlaps in conversations.Journal of Phonetics, 38(4): 555–568, 2010

    Mattias Heldner and Jens Edlund. Pauses, gaps and overlaps in conversations.Journal of Phonetics, 38(4): 555–568, 2010. doi: 10.1016/j.wocn.2010.08.002

  13. [13]

    Sheet: A multi-purpose open-source speech human evaluation estimation toolkit

    Wen-Chin Huang, Erica Cooper, and Tomoki Toda. Sheet: A multi-purpose open-source speech human evaluation estimation toolkit. InProc. Interspeech 2025, pp. 2355–2359, 2025

  14. [14]

    Voiceagent- bench: Are voice assistants ready for agentic tasks?arXiv preprint arXiv:2510.07978, 2025

    Dhruv Jain, Harshit Shukla, Gautam Rajeev, Ashish Kulkarni, Chandra Khatri, and Shubham Agarwal. Voiceagent- bench: Are voice assistants ready for agentic tasks?arXiv preprint arXiv:2510.07978, 2025

  15. [15]

    Levinson and Francisco Torreira

    Stephen C. Levinson and Francisco Torreira. Timing in turn-taking and its implications for processing models of language.Frontiers in Psychology, 6:731, 2015. doi: 10.3389/fpsyg.2015.00731

  16. [16]

    Full-Duplex-Bench-v2: A Multi-Turn Evaluation Framework for Duplex Dialogue Systems with an Automated Examiner

    Guan-Ting Lin, Shih-Yun Shan Kuan, Jiatong Shi, Kai-Wei Chang, Siddhant Arora, Shinji Watanabe, and Hung-yi Lee. Full-duplex-bench-v2: A multi-turn evaluation framework for duplex dialogue systems with an automated examiner.arXiv preprint arXiv:2510.07838, 2025

  17. [17]

    Full-Duplex-Bench v1.5: Evaluating Overlap Handling for Full-Duplex Speech Models

    Guan-Ting Lin, Shih-Yun Shan Kuan, Qirui Wang, Jiachen Lian, Tingle Li, and Hung-yi Lee. Full-duplex-bench v1.5: Evaluating overlap handling for full-duplex speech models.arXiv preprint arXiv:2507.23159, 2025

  18. [18]

    Full-duplex-bench: A benchmark to evaluate full-duplex spoken dialogue models on turn-taking capabilities

    Guan-Ting Lin, Jiachen Lian, Tingle Li, Qirui Wang, Gopala Anumanchipalli, Alexander H Liu, and Hung-yi Lee. Full-duplex-bench: A benchmark to evaluate full-duplex spoken dialogue models on turn-taking capabilities. arXiv preprint arXiv:2503.04721, 2025

  19. [19]

    Full-Duplex-Bench-v3: Benchmarking Tool Use for Full-Duplex Voice Agents Under Real-World Disfluency

    Guan-Ting Lin, Chen Chen, Zhehuai Chen, and Hung-yi Lee. Full-duplex-bench-v3: Benchmarking tool use for full-duplex voice agents under real-world disfluency.arXiv preprint arXiv:2604.04847, 2026

  20. [20]

    EmergentTTS-eval: Evaluating TTS models on complex prosodic, expressiveness, and linguistic challenges using model-as-a-judge

    Ruskin Raj Manku, Yuzhi Tang, Xingjian Shi, Mu Li, and Alex Smola. EmergentTTS-eval: Evaluating TTS models on complex prosodic, expressiveness, and linguistic challenges using model-as-a-judge. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems Datasets and Benchmarks Track, 2026. URL https://openreview.net/forum?id=P3JBBnh10z

  21. [21]

    Beyond accuracy: A multi-dimensional framework for evaluating enterprise agentic AI systems

    Sushant Mehta. Beyond accuracy: A multi-dimensional framework for evaluating enterprise agentic AI systems. arXiv preprint arXiv:2511.14136, 2025

  22. [22]

    Ai voice agents: 2025 update.https://a16z.com/ai-voice-agents-2025-update/, 2025

    Olivia Moore. Ai voice agents: 2025 update.https://a16z.com/ai-voice-agents-2025-update/, 2025. An- dreessen Horowitz

  23. [23]

    Realtime API documentation.https://platform.openai.com/docs/guides/realtime, 2024

    OpenAI. Realtime API documentation.https://platform.openai.com/docs/guides/realtime, 2024. Accessed: 2025

  24. [24]

    Patil, Huanzhi Mao, Fanjia Yan, Charlie Cheng-Jie Ji, Vishnu Suresh, Ion Stoica, and Joseph E

    Shishir G. Patil, Huanzhi Mao, Fanjia Yan, Charlie Cheng-Jie Ji, Vishnu Suresh, Ion Stoica, and Joseph E. Gonzalez. The berkeley function calling leaderboard (BFCL): From tool use to agentic evaluation of large language models. InProceedings of the 42nd International Conference on Machine Learning (ICML), volume 267 ofPMLR, pp. 48371–48392, 2025

  25. [25]

    Fd-bench: A full-duplex benchmarking pipeline designed for full duplex spoken dialogue systems

    Yizhou Peng, Yi-Wen Chao, Dianwen Ng, Yukun Ma, Chongjia Ni, Bin Ma, and Eng Siong Chng. Fd-bench: A full-duplex benchmarking pipeline designed for full duplex spoken dialogue systems. InProc. Interspeech 2025, pp. 176–180, 2025

  26. [26]

    Sygra: A unified graph-based framework for scalable generation, quality tagging, and management of synthetic data.arXiv preprint arXiv:2508.15432, 2025

    Bidyapati Pradhan, Surajit Dasgupta, Amit Kumar Saha, Omkar Anustoop, Sriram Puttagunta, Vipul Mittal, and Gopal Sarda. Sygra: A unified graph-based framework for scalable generation, quality tagging, and management of synthetic data.arXiv preprint arXiv:2508.15432, 2025

  27. [27]

    Toolllm: Facilitating large language models to master 16000+ real-world apis

    Yujia Qin, Shihao Liang, Yining Ye, Kunlun Zhu, Lan Yan, Yaxi Lu, Yankai Lin, Xin Cong, Xiangru Tang, Bill Qian, et al. Toolllm: Facilitating large language models to master 16000+ real-world apis. InThe twelfth international conference on learning representations, 2024. URLhttps://openreview.net

  28. [28]

    URLhttps://arxiv.org/abs/2603.13686

    Soham Ray, Keshav Dhandhania, Victor Barres, and Karthik Narasimhan.τ-voice: Benchmarking full-duplex voice agents on real-world domains, 2026. URLhttps://arxiv.org/abs/2603.13686. 13

  29. [29]

    Judgments concerning the valence of inter-turn silence across speakers of American English, Italian, and Japanese.Discourse Processes, 48(5):331–354, 2011

    Felicia Roberts, Piera Margutti, and Shoji Takano. Judgments concerning the valence of inter-turn silence across speakers of American English, Italian, and Japanese.Discourse Processes, 48(5):331–354, 2011. doi: 10.1080/0163853X.2011.558002

  30. [30]

    Turn-taking in conversational systems and human-robot interaction: A review.Computer Speech & Language, 67:101178, 2021

    Gabriel Skantze. Turn-taking in conversational systems and human-robot interaction: A review.Computer Speech & Language, 67:101178, 2021. URLhttps://arxiv.org/abs/2010.03674

  31. [31]

    Universals and cultural variation in turn-taking in conversation.Proceedings of the National Academy of Sciences, 106(26):10587–10592, 2009

    Tanya Stivers, Nicholas J Enfield, Penelope Brown, Christina Englert, Makoto Hayashi, Trine Heinemann, Gertie Hoymann, Federico Rossano, Jan Peter De Ruiter, Kyung-Eun Yoon, et al. Universals and cultural variation in turn-taking in conversation.Proceedings of the National Academy of Sciences, 106(26):10587–10592, 2009

  32. [32]

    Prosodic features and their use in studying turn-taking.Speech Communication, 46:52–66, 2005

    Nigel Ward and Wataru Tsukahara. Prosodic features and their use in studying turn-taking.Speech Communication, 46:52–66, 2005

  33. [33]

    Shunyu Yao, Noah Shinn, Pedram Razavi, and Karthik Narasimhan.τ-bench: A benchmark for tool-agent-user interaction in real-world domains.arXiv preprint arXiv:2406.12045, 2024

  34. [34]

    Judging llm-as-a- judge with mt-bench and chatbot arena

    Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric Xing, Hao Zhang, Joseph Gonzalez, and Ion Stoica. Judging llm-as-a- judge with mt-bench and chatbot arena. In A. Oh, T. Naumann, A. Globerson, K. Saenko, M. Hardt, and S. Levine (eds.),Advances in Neural Information Processing System...

  35. [35]

    eagerness

    Xuhui Zhou, Weiwei Sun, Qianou Ma, Yiqing Xie, Jiarui Liu, Weihua Du, Sean Welleck, Yiming Yang, Graham Neubig, Sherry Tongshuang Wu, and Maarten Sap. Mind the sim2real gap in user simulation for agentic tasks. arXiv preprint arXiv:2603.11245, 2026. 14 A Definitions & Key Terms STT Speech-to-Text.A model or service that transcribes spoken audio into text....

  36. [36]

    Policy specification.Domain policies and workflow constraints are defined and reviewed prior to generation

  37. [37]

    Joint generation.SyGra generates user goals, initial databases, and expected final states jointly from a workflow graph, using GPT-5.2 as the generative backbone

  38. [38]

    Multi-intent composition.Multi-intent scenarios are constructed by combining single-intent records into coherent multi-workflow user goals, with expected final states merged accordingly

  39. [39]

    End Conversation

    Adversarial scenario design.Adversarial scenarios are hand-designed around specific policy boundary conditions, then verified against tool executor behavior to confirm that the policy violation is achievable but detectable by a correctly behaving agent. Human Review Following generation, all scenarios went through multiple rounds of manual review. Reviewe...

  40. [40]

    The agent has confirmed your request is resolved (all steps are completed) and you have said goodbye

  41. [41]

    The agent has initiated a transfer to a live agent

  42. [42]

    The agent has been unable to make progress for at least 5 consecutive turns

  43. [43]

    The agent says goodbye or indicates the conversation is over

  44. [44]

    The agent indicates that the remainder of your request cannot be fulfilled

  45. [45]

    I’m sorry I encountered an error processing your request

    If the assistant says something along the lines of "I’m sorry I encountered an error processing your request." 21 IMPORTANT: never call this tool in the same turn that you provide the agent with data, an identifier, a request to transfer to a live agent, an approval to proceed, or any kind of additional information. Before calling this tool, always say a ...

  46. [46]

    The user simulator prompt explicitly instructs the user to decline any such offers from the agent, but we check for violations regardless

    Extra modifications.The user makes requests beyond its stated goal that invoke modification tools writing to the scenario database. The user simulator prompt explicitly instructs the user to decline any such offers from the agent, but we check for violations regardless

  47. [47]

    If the user hangs up prematurely—for example, providing actionable information and ending the call in the same turn—the agent has no opportunity to execute the required tool calls

    Premature ending.In our simulations, the user is responsible for ending the call once its goal is complete. If the user hangs up prematurely—for example, providing actionable information and ending the call in the same turn—the agent has no opportunity to execute the required tool calls. We therefore verify that the user does not terminate the conversatio...

  48. [48]

    Missing information.If the user fails to provide information the agent needs to complete the task, the evaluation is corrupted since task success cannot reasonably be expected. 22

  49. [49]

    The agent then acts on the duplicate request, causing redundant writes to the scenario database that cause the final state comparison to fail

    Duplicate modifications.Occasionally, the user simulator (particularly when using non-primary models) enters a loop and repeats requests the agent has already fulfilled. The agent then acts on the duplicate request, causing redundant writes to the scenario database that cause the final state comparison to fail

  50. [50]

    accept the earlier flight if the price difference is under $200, otherwise decline

    Decision tree violations.Each user is given a structured decision tree governing how to navigate choices during the interaction (e.g.,“accept the earlier flight if the price difference is under $200, otherwise decline”). We verify that the user adheres to this logic, since deviations would cause the agent to reach a final state inconsistent with the groun...

  51. [51]

    S2S.There is no separable TTS step, so the framework log carries notts_text or llm_response records and is dropped from the merge. Consequentlyintended_assistant_turns is left empty — S2S models typically do not expose any separate text intent — and the assistant’s entries inconversation_- trace are sourced from ElevenLabsassistant_speech (transcribed) ra...

  52. [52]

    Hybrid.The framework log is populated withtts_text or llm_response (depending on the backend), and intended_assistant_turns is built as in cascade. On the input side, however, hybrid audio-native models bypass the agent’s STT — as in S2S — so the audit-log user transcripts are unreliable, and the 26 user entries inconversation_trace are again sourced from...

  53. [53]

    stay open

    Cascade.All three streams are used unmodified: audit-log user transcripts feed bothtranscribed_- user_turns and the trace, the framework log supplies the assistant’s intended text, and ElevenLabs supplies the user’s intended text and the assistant’s transcribed text. A final post-processing step (i) aligns the per-turn dictionaries so that all sources sha...

  54. [54]

    Cascade and Hybrid.Both architectures expose anintendedtext-side reference for the assistant, i.e., the LLM’s text output before TTS (intended_assistant_turns E.1). The judge task is a direct word-for-word comparison: did the audio reproduce the intended text, with particular attention to TTS-critical entities (confirmation codes, flight numbers, dollar a...

  55. [55]

    [Assistant speaks]

    S2S.S2S systems do not typically expose any text-side intent, so there is nothing to compare the audio against in the cascade sense. We instead reformulate the question as anentity articulationcheck: does the assistant clearly and correctly speak the entities it was supposed to convey? The judge receives aredacted conversation tracein which assistant entr...

  56. [56]

    If the agent asks for verification details, provide your confirmation code and last name exactly as given in the required information, then wait for the agent to read back your reservation and confirm it is yours; if they read back a different name or itinerary, correct them and re-provide the details

  57. [57]

    When the agent offers earlier-flight options, evaluate each option against ALL must-have criteria: (a) date is 2026-06-18, (b) LAX departure time is before 2:00PMPT, (c) direct LAX→SFO, (d) same-day change fee is under $80

  58. [58]

    If both an 11:00AM and a 1:00PM direct option meet all must-haves, choose the earliest departure (11:00AM)

  59. [59]

    If only one option meets all must-haves, accept that option

  60. [60]

    What will the change fee be in total?

    Before the agent finalizes anything, if the agent has not clearly stated the exact same-day change fee amount, ask:“What will the change fee be in total?”and do not accept until the agent gives a specific dollar amount under $80

  61. [61]

    It needs to be today, direct LAX to SFO, leaving before 2PM, and the fee has to be under $80—can you check again?

    If the agent proposes any option that departs at or after 2:00PM, has a connection, changes airports, or has a fee of $80 or more, reject it and restate the must-haves once:“It needs to be today, direct LAX to SFO, leaving before 2PM, and the fee has to be under $80—can you check again?”

  62. [62]

    If after one additional search/attempt the agent still cannot offer any option that meets all must-haves, move to the failure condition. Resolution Condition.The agent has confirmed the rebooking is completed (not just planned) to a direct LAX→SFO flight departing on 2026-06-18 before 2:00PMPT, has stated the same-day change fee is under $80, AND has prov...

  63. [64]

    Never invent new goals, requests, or problems beyond what is defined here

  64. [66]

    If the agent suggests flying from or to a different airport than originally booked, decline and insist on LAX to SFO only

  65. [67]

    V O R J U

    If the agent suggests standby instead of a confirmed earlier flight, decline standby and ask for a confirmed seat on an earlier direct flight before 2:00PM. Expected Flow, Database & Ground Truth Expected Flow.Passenger wants to move to an earlier departure on the same date. Agent applies same-day change fee ($75, waived for Gold+) and searches for earlie...

  66. [68]

    (fare $228 in main cabin)

    One-stop option– depart at nine twenty a.m., connect in San Jose and arrive at twelve ten p.m. (fare $228 in main cabin)

  67. [69]

    LAX”, destination: “SFO

    Direct flight– depart at one o’clock p.m., arrive at two twenty-five p.m. (fare $289, same as your current ticket). 3.Direct flight– depart at two forty p.m., arrive at four oh-five p.m. (fare $259, a little cheaper). Because this is a voluntary same-day change, achange fee of seventy-five dollarsapplies. If you choose a lower-priced flight, the fare diff...

  68. [70]

    Provide your employee ID and the last four digits of your phone number

    Start by completing identity verification only when asked. Provide your employee ID and the last four digits of your phone number. Do not volunteer other details before the agent asks

  69. [71]

    Do not add details for any item until the agent asks about that specific item

    After verification, give a brief overview of all four items: email seems down for everyone, your AD account is locked, you need Confluence access, and you want a 30-day Figma trial. Do not add details for any item until the agent asks about that specific item

  70. [72]

    If asked which service, say email

    First intent — email outage.Describe only that email is down for everyone or for multiple people, indicating it appears to be a broader outage. If asked which service, say email. Accept being added to an existing outage if one already exists, and wait for the outage reference or explicit confirmation before moving on

  71. [73]

    What happens next?

    Second intent — AD lockout.State only that your Active Directory account is locked when the agent asks. If the agent says the account cannot be unlocked because of a security hold, ask exactly one follow-up question:“What happens next?”If they explain that a ticket has been opened and provide the ticket number and expected response time or SLA, accept tha...

  72. [74]

    If asked for access level, choose read_only

    Third intent — Confluence access.Provide the application name only when asked: Confluence. If asked for access level, choose read_only. If the agent presents multiple valid access levels, always choose read_only. Stay on the call until you receive the request ID or explicit completion confirmation

  73. [75]

    If asked whether you want permanent or temporary, choose temporary

    Fourth intent — Figma trial.Provide the product name only when asked: Figma. If asked whether you want permanent or temporary, choose temporary. If asked for duration, choose 30 days. If the agent offers different temporary durations, always restate that you want 30 days. Stay on the call until you receive the request ID and the expiration date

  74. [76]

    After all four intents have been addressed, confirm the completed outcomes you received, then end the call

  75. [77]

    Do not invent missing details

    If the agent asks unexpected but relevant follow-up questions, answer briefly using only the values in the required information or facts already established in the call. Do not invent missing details. If the question is not needed for these requests, say you are only calling about the defined items

  76. [78]

    If it does not match, correct only the incorrect field and nothing else

    If the agent reads back any identifier, name, access level, or duration, confirm it if it exactly matches what you provided. If it does not match, correct only the incorrect field and nothing else. Resolution Condition.You have clear confirmation that you were added to the existing email outage or have been given the outage ticket number, you have receive...

  77. [82]

    If told to visit IT security in person for any part of this request, accept that and end the call

  78. [83]

    Do not request services beyond your stated IT requests

  79. [84]

    If asked which access level you want for Confluence, choose read_only

  80. [85]

    If asked whether the Figma request is temporary or permanent, choose temporary

Showing first 80 references.