Recognition: 2 theorem links
· Lean TheoremEVA-Bench: A New End-to-end Framework for Evaluating Voice Agents
Pith reviewed 2026-05-14 17:26 UTC · model grok-4.3
The pith
No voice agent exceeds 0.5 on both accuracy and experience metrics simultaneously.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
EVA-Bench generates realistic multi-turn dialogues through bot-to-bot audio interaction with automatic error detection and regeneration, then scores agents on EVA-A for accuracy and fidelity plus EVA-X for experience and timing. Across 213 scenarios and controlled accent-noise perturbations, no system exceeds 0.5 on both EVA-A pass@1 and EVA-X pass@1; the median pass@k minus pass^k gap reaches 0.44 on EVA-A; and perturbations produce mean drops up to 0.314 that differ by architecture and metric.
What carries the argument
EVA-Bench end-to-end framework that runs validated bot-to-bot audio dialogues and applies the paired EVA-A and EVA-X composite metrics together with pass@1, pass@k, and pass^k statistics.
If this is right
- Different voice-agent architectures can be ranked on identical accuracy and experience scales for the first time.
- Reliability engineering must close the 0.44 median gap between peak and consistent performance on accuracy tasks.
- Accent and noise robustness must be treated as first-class requirements that vary by architecture.
- The 213-scenario suite across three enterprise domains supplies a reusable test bed for targeted fixes.
Where Pith is reading between the lines
- Architectures may face an inherent trade-off that future work could resolve by combining strengths of the three current families.
- Adding more open-ended or multi-party scenarios would test whether the current metrics still separate systems cleanly.
- If EVA-X turns out to drive user retention more than EVA-A, teams might deliberately accept lower accuracy for better flow.
- The automatic validation step could be reused as a training signal to reduce simulator errors in other dialogue systems.
Load-bearing premise
Bot-to-bot simulated conversations with automatic validation match the distribution of real human voice interactions and the EVA-A and EVA-X scores track downstream user satisfaction or task success.
What would settle it
A head-to-head study that runs the same twelve agents with real human users, records satisfaction and task-success rates, and checks whether the ordering or absolute levels match the EVA-Bench rankings.
read the original abstract
Voice agents, artificial intelligence systems that conduct spoken conversations to complete tasks, are increasingly deployed across enterprise applications. However, no existing benchmark jointly addresses two core evaluation challenges: generating realistic simulated conversations, and measuring quality across the full scope of voice-specific failure modes. We present EVA-Bench, an end-to-end evaluation framework that addresses both. On the simulation side, EVA-Bench orchestrates bot-to-bot audio conversations over dynamic multi-turn dialogues, with automatic simulation validation that detects user simulator error and appropriately regenerates conversations before scoring. On the measurement side, EVA-Bench introduces two composite metrics: EVA-A (Accuracy), capturing task completion, faithfulness, and audio-level speech fidelity; and EVA-X (Experience), capturing conversation progression, spoken conciseness, and turn-taking timing. Both metrics apply to different agent architectures, enabling direct cross-architecture comparison. EVA-Bench includes 213 scenarios across three enterprise domains, a controlled perturbation suite for accent and noise robustness, and pass@1, pass@k, pass^k measurements that distinguish peak from reliable capability. Across 12 systems spanning all three architectures, we find: (1) no system simultaneously exceeds 0.5 on both EVA-A pass@1 and EVA-X pass@1; (2) peak and reliable performance diverge substantially (median pass@k - pass^k gap of 0.44 on EVA-A); and (3) accent and noise perturbations expose substantial robustness gaps, with effects varying across architectures, systems, and metrics (mean up to 0.314). We release the full framework, evaluation suite, and benchmark data under an open-source license.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces EVA-Bench, an end-to-end framework for evaluating voice agents via bot-to-bot audio conversations with automatic simulation validation and regeneration. It defines two composite metrics—EVA-A (capturing task completion, faithfulness, and speech fidelity) and EVA-X (capturing conversation progression, conciseness, and turn-taking timing)—and applies them to 213 scenarios across three enterprise domains. The work evaluates 12 systems spanning three architectures under pass@1/pass@k/pass^k protocols and a controlled accent/noise perturbation suite, reporting that no system exceeds 0.5 on both EVA-A and EVA-X pass@1, a median 0.44 gap between peak and reliable performance on EVA-A, and architecture-varying robustness drops up to 0.314.
Significance. If the simulation and metrics prove representative, EVA-Bench fills a gap by enabling direct cross-architecture comparison of voice-specific failure modes and by releasing the full framework, evaluation suite, and data under open license. The empirical distinctions between peak/reliable capability and the quantified robustness gaps under perturbation provide concrete, falsifiable baselines that future systems can target.
major comments (1)
- [Abstract and results section] Abstract and results section: The claims that the observed accuracy-experience trade-off and robustness gaps (e.g., mean drops up to 0.314) are indicative of production voice-agent behavior rest on the unvalidated assumption that bot-to-bot dialogues plus automatic error detection faithfully reproduce human turn-taking, intent distributions, and downstream task success. No human-A/B correlation studies, user-satisfaction ratings, or comparison against real completion rates are reported, which is load-bearing for interpreting the numerical findings as generalizable rather than benchmark-specific.
minor comments (2)
- [Framework description] The description of how automatic validation detects simulator errors and triggers regeneration could be expanded with a concrete example or pseudocode to improve reproducibility.
- [Results tables/figures] Table or figure captions for the 12-system results should explicitly note the number of runs per condition to clarify the statistical basis of the reported medians and means.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback and recommendation for minor revision. We address the major comment below.
read point-by-point responses
-
Referee: [Abstract and results section] Abstract and results section: The claims that the observed accuracy-experience trade-off and robustness gaps (e.g., mean drops up to 0.314) are indicative of production voice-agent behavior rest on the unvalidated assumption that bot-to-bot dialogues plus automatic error detection faithfully reproduce human turn-taking, intent distributions, and downstream task success. No human-A/B correlation studies, user-satisfaction ratings, or comparison against real completion rates are reported, which is load-bearing for interpreting the numerical findings as generalizable rather than benchmark-specific.
Authors: We agree that the absence of human-A/B correlation studies, user-satisfaction ratings, or direct comparisons to real-world completion rates means the generalizability of the reported accuracy-experience trade-off and robustness gaps to production voice agents rests on an assumption that remains unvalidated in the current manuscript. The EVA-Bench simulation is constructed to approximate human-like multi-turn interactions via bot-to-bot audio with automatic error detection and regeneration, but this does not substitute for empirical human validation. In the revised manuscript we will add an explicit Limitations subsection (in the Discussion) that acknowledges this gap, clarifies that the numerical findings are benchmark-specific, and outlines planned future work on human correlation studies. We will also insert a brief qualifying clause in the abstract and results section to avoid overclaiming generalizability while preserving the core empirical observations. revision: partial
Circularity Check
Empirical benchmark release with direct metric computation; no derivations or predictions reduce to inputs
full rationale
The paper defines EVA-A and EVA-X as composite metrics from explicit criteria (task completion, faithfulness, conciseness, timing) and applies them to bot-to-bot audio dialogues with automatic validation. All reported results (pass@1 thresholds, gaps of 0.44, perturbation drops) are computed directly from these definitions on generated data. No equations, fitted parameters, self-citations, or ansatzes are used to derive the central claims; the framework is self-contained against its own stated benchmarks and scenarios. This matches the default expectation of no significant circularity for an empirical benchmark paper.
Axiom & Free-Parameter Ledger
axioms (2)
- domain assumption Bot-to-bot audio conversations with automatic validation can generate realistic multi-turn dialogues that reflect real-world voice agent usage
- domain assumption The composite definitions of EVA-A and EVA-X adequately measure task completion, faithfulness, speech fidelity, conversation progression, conciseness, and turn-taking
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquationwashburn_uniqueness_aczel unclearEVA-Bench introduces two composite metrics: EVA-A (Accuracy), capturing task completion, faithfulness, and audio-level speech fidelity; and EVA-X (Experience), capturing conversation progression, spoken conciseness, and turn-taking timing.
-
IndisputableMonolith/Foundation/RealityFromDistinctionreality_from_one_distinction unclearAcross 12 systems spanning all three architectures, we find: (1) no system simultaneously exceeds 0.5 on both EVA-A pass@1 and EVA-X pass@1; (2) peak and reliable performance diverge substantially (median pass@k - pass^k gap of 0.44 on EVA-A)
Reference graph
Works this paper leans on
-
[1]
Andres, Vadim Fedorov, Rida Sadek, Enric Spagnolo-Arrizabalaga, and Nadescha Trudel
Miguel E. Andres, Vadim Fedorov, Rida Sadek, Enric Spagnolo-Arrizabalaga, and Nadescha Trudel. Testing the testers: Human-driven quality assessment of voice AI testing platforms.arXiv preprint arXiv:2511.04133, 2025
-
[2]
Junyi Ao, Yuancheng Wang, Xiaohai Tian, Dekun Chen, Jun Zhang, Lu Lu, Yuxuan Wang, Haizhou Li, and Zhizheng Wu. Sd-eval: A benchmark dataset for spoken dialogue understanding beyond words.Advances in Neural Information Processing Systems, 37:56898–56918, 2024
work page 2024
-
[3]
Talking turns: Bench- marking audio foundation models on turn-taking dynamics
Siddhant Arora, Zhiyun Lu, Chung Cheng Chiu, Ruoming Pang, and Shinji Watanabe. Talking turns: Bench- marking audio foundation models on turn-taking dynamics. In13th International Conference on Learning Representations, ICLR 2025, pp. 3663–3690. International Conference on Learning Representations, ICLR, 2025
work page 2025
-
[4]
Hongliu Cao, Ilias Driouich, and Eoin Thomas. Beyond task completion: Revealing corrupt success in LLM agents through procedure-aware evaluation.arXiv preprint arXiv:2603.03116, 2026
- [5]
-
[6]
Voxeval: Benchmarking the knowledge understanding capabilities of end-to-end spoken language models
Wenqian Cui, Xiaoqi Jiao, Ziqiao Meng, and Irwin King. Voxeval: Benchmarking the knowledge understanding capabilities of end-to-end spoken language models. InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 16735–16753, 2025
work page 2025
-
[7]
Daily. Pipecat: Open source framework for voice and multimodal conversational AI.https://github.com/ pipecat-ai/pipecat, 2024. Accessed: 2025
work page 2024
-
[8]
ElevenLabs. ElevenAgents: Conversational AI agent platform.https://elevenlabs.io/docs/eleven-agents/ overview, 2024. Accessed: 2025
work page 2024
-
[9]
Gemini live API: Low-latency bidirectional voice and video interactions.https://ai.google
Google DeepMind. Gemini live API: Low-latency bidirectional voice and video interactions.https://ai.google. dev/gemini-api/docs/live, 2024. Accessed: 2025
work page 2024
-
[10]
Will Held, Michael J. Ryan, Aditya Shrivastava, Ali Sartaz Khan, Caleb Ziems, Ella Li, Martijn Bartelds, Michael Sun, Tan Li, Woody Gan, and Diyi Yang. Cava: Comprehensive assessment of voice assistants. https://github.com/SALT-NLP/CAVA, 2025. URLhttps://talkarena.org/cava. A benchmark for evaluating 12 large audio models (LAMs) capabilities across six do...
work page 2025
-
[11]
Mattias Heldner. Detection thresholds for gaps, overlaps, and no-gap-no-overlaps.The Journal of the Acoustical Society of America, 130(1):508–513, 2011. doi: 10.1121/1.3598457
-
[12]
Pauses, gaps and overlaps in conversations.Journal of Phonetics, 38(4): 555–568, 2010
Mattias Heldner and Jens Edlund. Pauses, gaps and overlaps in conversations.Journal of Phonetics, 38(4): 555–568, 2010. doi: 10.1016/j.wocn.2010.08.002
-
[13]
Sheet: A multi-purpose open-source speech human evaluation estimation toolkit
Wen-Chin Huang, Erica Cooper, and Tomoki Toda. Sheet: A multi-purpose open-source speech human evaluation estimation toolkit. InProc. Interspeech 2025, pp. 2355–2359, 2025
work page 2025
-
[14]
Dhruv Jain, Harshit Shukla, Gautam Rajeev, Ashish Kulkarni, Chandra Khatri, and Shubham Agarwal. Voiceagent- bench: Are voice assistants ready for agentic tasks?arXiv preprint arXiv:2510.07978, 2025
-
[15]
Levinson and Francisco Torreira
Stephen C. Levinson and Francisco Torreira. Timing in turn-taking and its implications for processing models of language.Frontiers in Psychology, 6:731, 2015. doi: 10.3389/fpsyg.2015.00731
-
[16]
Guan-Ting Lin, Shih-Yun Shan Kuan, Jiatong Shi, Kai-Wei Chang, Siddhant Arora, Shinji Watanabe, and Hung-yi Lee. Full-duplex-bench-v2: A multi-turn evaluation framework for duplex dialogue systems with an automated examiner.arXiv preprint arXiv:2510.07838, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[17]
Full-Duplex-Bench v1.5: Evaluating Overlap Handling for Full-Duplex Speech Models
Guan-Ting Lin, Shih-Yun Shan Kuan, Qirui Wang, Jiachen Lian, Tingle Li, and Hung-yi Lee. Full-duplex-bench v1.5: Evaluating overlap handling for full-duplex speech models.arXiv preprint arXiv:2507.23159, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[18]
Guan-Ting Lin, Jiachen Lian, Tingle Li, Qirui Wang, Gopala Anumanchipalli, Alexander H Liu, and Hung-yi Lee. Full-duplex-bench: A benchmark to evaluate full-duplex spoken dialogue models on turn-taking capabilities. arXiv preprint arXiv:2503.04721, 2025
-
[19]
Full-Duplex-Bench-v3: Benchmarking Tool Use for Full-Duplex Voice Agents Under Real-World Disfluency
Guan-Ting Lin, Chen Chen, Zhehuai Chen, and Hung-yi Lee. Full-duplex-bench-v3: Benchmarking tool use for full-duplex voice agents under real-world disfluency.arXiv preprint arXiv:2604.04847, 2026
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[20]
Ruskin Raj Manku, Yuzhi Tang, Xingjian Shi, Mu Li, and Alex Smola. EmergentTTS-eval: Evaluating TTS models on complex prosodic, expressiveness, and linguistic challenges using model-as-a-judge. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems Datasets and Benchmarks Track, 2026. URL https://openreview.net/forum?id=P3JBBnh10z
work page 2026
-
[21]
Beyond accuracy: A multi-dimensional framework for evaluating enterprise agentic AI systems
Sushant Mehta. Beyond accuracy: A multi-dimensional framework for evaluating enterprise agentic AI systems. arXiv preprint arXiv:2511.14136, 2025
-
[22]
Ai voice agents: 2025 update.https://a16z.com/ai-voice-agents-2025-update/, 2025
Olivia Moore. Ai voice agents: 2025 update.https://a16z.com/ai-voice-agents-2025-update/, 2025. An- dreessen Horowitz
work page 2025
-
[23]
Realtime API documentation.https://platform.openai.com/docs/guides/realtime, 2024
OpenAI. Realtime API documentation.https://platform.openai.com/docs/guides/realtime, 2024. Accessed: 2025
work page 2024
-
[24]
Patil, Huanzhi Mao, Fanjia Yan, Charlie Cheng-Jie Ji, Vishnu Suresh, Ion Stoica, and Joseph E
Shishir G. Patil, Huanzhi Mao, Fanjia Yan, Charlie Cheng-Jie Ji, Vishnu Suresh, Ion Stoica, and Joseph E. Gonzalez. The berkeley function calling leaderboard (BFCL): From tool use to agentic evaluation of large language models. InProceedings of the 42nd International Conference on Machine Learning (ICML), volume 267 ofPMLR, pp. 48371–48392, 2025
work page 2025
-
[25]
Fd-bench: A full-duplex benchmarking pipeline designed for full duplex spoken dialogue systems
Yizhou Peng, Yi-Wen Chao, Dianwen Ng, Yukun Ma, Chongjia Ni, Bin Ma, and Eng Siong Chng. Fd-bench: A full-duplex benchmarking pipeline designed for full duplex spoken dialogue systems. InProc. Interspeech 2025, pp. 176–180, 2025
work page 2025
-
[26]
Bidyapati Pradhan, Surajit Dasgupta, Amit Kumar Saha, Omkar Anustoop, Sriram Puttagunta, Vipul Mittal, and Gopal Sarda. Sygra: A unified graph-based framework for scalable generation, quality tagging, and management of synthetic data.arXiv preprint arXiv:2508.15432, 2025
-
[27]
Toolllm: Facilitating large language models to master 16000+ real-world apis
Yujia Qin, Shihao Liang, Yining Ye, Kunlun Zhu, Lan Yan, Yaxi Lu, Yankai Lin, Xin Cong, Xiangru Tang, Bill Qian, et al. Toolllm: Facilitating large language models to master 16000+ real-world apis. InThe twelfth international conference on learning representations, 2024. URLhttps://openreview.net
work page 2024
-
[28]
URLhttps://arxiv.org/abs/2603.13686
Soham Ray, Keshav Dhandhania, Victor Barres, and Karthik Narasimhan.τ-voice: Benchmarking full-duplex voice agents on real-world domains, 2026. URLhttps://arxiv.org/abs/2603.13686. 13
-
[29]
Felicia Roberts, Piera Margutti, and Shoji Takano. Judgments concerning the valence of inter-turn silence across speakers of American English, Italian, and Japanese.Discourse Processes, 48(5):331–354, 2011. doi: 10.1080/0163853X.2011.558002
-
[30]
Gabriel Skantze. Turn-taking in conversational systems and human-robot interaction: A review.Computer Speech & Language, 67:101178, 2021. URLhttps://arxiv.org/abs/2010.03674
-
[31]
Tanya Stivers, Nicholas J Enfield, Penelope Brown, Christina Englert, Makoto Hayashi, Trine Heinemann, Gertie Hoymann, Federico Rossano, Jan Peter De Ruiter, Kyung-Eun Yoon, et al. Universals and cultural variation in turn-taking in conversation.Proceedings of the National Academy of Sciences, 106(26):10587–10592, 2009
work page 2009
-
[32]
Prosodic features and their use in studying turn-taking.Speech Communication, 46:52–66, 2005
Nigel Ward and Wataru Tsukahara. Prosodic features and their use in studying turn-taking.Speech Communication, 46:52–66, 2005
work page 2005
-
[33]
Shunyu Yao, Noah Shinn, Pedram Razavi, and Karthik Narasimhan.τ-bench: A benchmark for tool-agent-user interaction in real-world domains.arXiv preprint arXiv:2406.12045, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[34]
Judging llm-as-a- judge with mt-bench and chatbot arena
Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric Xing, Hao Zhang, Joseph Gonzalez, and Ion Stoica. Judging llm-as-a- judge with mt-bench and chatbot arena. In A. Oh, T. Naumann, A. Globerson, K. Saenko, M. Hardt, and S. Levine (eds.),Advances in Neural Information Processing System...
work page 2023
-
[35]
Xuhui Zhou, Weiwei Sun, Qianou Ma, Yiqing Xie, Jiarui Liu, Weihua Du, Sean Welleck, Yiming Yang, Graham Neubig, Sherry Tongshuang Wu, and Maarten Sap. Mind the sim2real gap in user simulation for agentic tasks. arXiv preprint arXiv:2603.11245, 2026. 14 A Definitions & Key Terms STT Speech-to-Text.A model or service that transcribes spoken audio into text....
-
[36]
Policy specification.Domain policies and workflow constraints are defined and reviewed prior to generation
-
[37]
Joint generation.SyGra generates user goals, initial databases, and expected final states jointly from a workflow graph, using GPT-5.2 as the generative backbone
-
[38]
Multi-intent composition.Multi-intent scenarios are constructed by combining single-intent records into coherent multi-workflow user goals, with expected final states merged accordingly
-
[39]
Adversarial scenario design.Adversarial scenarios are hand-designed around specific policy boundary conditions, then verified against tool executor behavior to confirm that the policy violation is achievable but detectable by a correctly behaving agent. Human Review Following generation, all scenarios went through multiple rounds of manual review. Reviewe...
-
[40]
The agent has confirmed your request is resolved (all steps are completed) and you have said goodbye
-
[41]
The agent has initiated a transfer to a live agent
-
[42]
The agent has been unable to make progress for at least 5 consecutive turns
-
[43]
The agent says goodbye or indicates the conversation is over
-
[44]
The agent indicates that the remainder of your request cannot be fulfilled
-
[45]
I’m sorry I encountered an error processing your request
If the assistant says something along the lines of "I’m sorry I encountered an error processing your request." 21 IMPORTANT: never call this tool in the same turn that you provide the agent with data, an identifier, a request to transfer to a live agent, an approval to proceed, or any kind of additional information. Before calling this tool, always say a ...
-
[46]
Extra modifications.The user makes requests beyond its stated goal that invoke modification tools writing to the scenario database. The user simulator prompt explicitly instructs the user to decline any such offers from the agent, but we check for violations regardless
-
[47]
Premature ending.In our simulations, the user is responsible for ending the call once its goal is complete. If the user hangs up prematurely—for example, providing actionable information and ending the call in the same turn—the agent has no opportunity to execute the required tool calls. We therefore verify that the user does not terminate the conversatio...
-
[48]
Missing information.If the user fails to provide information the agent needs to complete the task, the evaluation is corrupted since task success cannot reasonably be expected. 22
-
[49]
Duplicate modifications.Occasionally, the user simulator (particularly when using non-primary models) enters a loop and repeats requests the agent has already fulfilled. The agent then acts on the duplicate request, causing redundant writes to the scenario database that cause the final state comparison to fail
-
[50]
accept the earlier flight if the price difference is under $200, otherwise decline
Decision tree violations.Each user is given a structured decision tree governing how to navigate choices during the interaction (e.g.,“accept the earlier flight if the price difference is under $200, otherwise decline”). We verify that the user adheres to this logic, since deviations would cause the agent to reach a final state inconsistent with the groun...
-
[51]
S2S.There is no separable TTS step, so the framework log carries notts_text or llm_response records and is dropped from the merge. Consequentlyintended_assistant_turns is left empty — S2S models typically do not expose any separate text intent — and the assistant’s entries inconversation_- trace are sourced from ElevenLabsassistant_speech (transcribed) ra...
-
[52]
Hybrid.The framework log is populated withtts_text or llm_response (depending on the backend), and intended_assistant_turns is built as in cascade. On the input side, however, hybrid audio-native models bypass the agent’s STT — as in S2S — so the audit-log user transcripts are unreliable, and the 26 user entries inconversation_trace are again sourced from...
-
[53]
Cascade.All three streams are used unmodified: audit-log user transcripts feed bothtranscribed_- user_turns and the trace, the framework log supplies the assistant’s intended text, and ElevenLabs supplies the user’s intended text and the assistant’s transcribed text. A final post-processing step (i) aligns the per-turn dictionaries so that all sources sha...
-
[54]
Cascade and Hybrid.Both architectures expose anintendedtext-side reference for the assistant, i.e., the LLM’s text output before TTS (intended_assistant_turns E.1). The judge task is a direct word-for-word comparison: did the audio reproduce the intended text, with particular attention to TTS-critical entities (confirmation codes, flight numbers, dollar a...
-
[55]
S2S.S2S systems do not typically expose any text-side intent, so there is nothing to compare the audio against in the cascade sense. We instead reformulate the question as anentity articulationcheck: does the assistant clearly and correctly speak the entities it was supposed to convey? The judge receives aredacted conversation tracein which assistant entr...
work page 2026
-
[56]
If the agent asks for verification details, provide your confirmation code and last name exactly as given in the required information, then wait for the agent to read back your reservation and confirm it is yours; if they read back a different name or itinerary, correct them and re-provide the details
-
[57]
When the agent offers earlier-flight options, evaluate each option against ALL must-have criteria: (a) date is 2026-06-18, (b) LAX departure time is before 2:00PMPT, (c) direct LAX→SFO, (d) same-day change fee is under $80
work page 2026
-
[58]
If both an 11:00AM and a 1:00PM direct option meet all must-haves, choose the earliest departure (11:00AM)
-
[59]
If only one option meets all must-haves, accept that option
-
[60]
What will the change fee be in total?
Before the agent finalizes anything, if the agent has not clearly stated the exact same-day change fee amount, ask:“What will the change fee be in total?”and do not accept until the agent gives a specific dollar amount under $80
-
[61]
If the agent proposes any option that departs at or after 2:00PM, has a connection, changes airports, or has a fee of $80 or more, reject it and restate the must-haves once:“It needs to be today, direct LAX to SFO, leaving before 2PM, and the fee has to be under $80—can you check again?”
-
[62]
If after one additional search/attempt the agent still cannot offer any option that meets all must-haves, move to the failure condition. Resolution Condition.The agent has confirmed the rebooking is completed (not just planned) to a direct LAX→SFO flight departing on 2026-06-18 before 2:00PMPT, has stated the same-day change fee is under $80, AND has prov...
work page 2026
-
[64]
Never invent new goals, requests, or problems beyond what is defined here
-
[66]
If the agent suggests flying from or to a different airport than originally booked, decline and insist on LAX to SFO only
-
[67]
If the agent suggests standby instead of a confirmed earlier flight, decline standby and ask for a confirmed seat on an earlier direct flight before 2:00PM. Expected Flow, Database & Ground Truth Expected Flow.Passenger wants to move to an earlier departure on the same date. Agent applies same-day change fee ($75, waived for Gold+) and searches for earlie...
work page 2026
-
[68]
One-stop option– depart at nine twenty a.m., connect in San Jose and arrive at twelve ten p.m. (fare $228 in main cabin)
-
[69]
Direct flight– depart at one o’clock p.m., arrive at two twenty-five p.m. (fare $289, same as your current ticket). 3.Direct flight– depart at two forty p.m., arrive at four oh-five p.m. (fare $259, a little cheaper). Because this is a voluntary same-day change, achange fee of seventy-five dollarsapplies. If you choose a lower-priced flight, the fare diff...
work page 2026
-
[70]
Provide your employee ID and the last four digits of your phone number
Start by completing identity verification only when asked. Provide your employee ID and the last four digits of your phone number. Do not volunteer other details before the agent asks
-
[71]
Do not add details for any item until the agent asks about that specific item
After verification, give a brief overview of all four items: email seems down for everyone, your AD account is locked, you need Confluence access, and you want a 30-day Figma trial. Do not add details for any item until the agent asks about that specific item
-
[72]
If asked which service, say email
First intent — email outage.Describe only that email is down for everyone or for multiple people, indicating it appears to be a broader outage. If asked which service, say email. Accept being added to an existing outage if one already exists, and wait for the outage reference or explicit confirmation before moving on
-
[73]
Second intent — AD lockout.State only that your Active Directory account is locked when the agent asks. If the agent says the account cannot be unlocked because of a security hold, ask exactly one follow-up question:“What happens next?”If they explain that a ticket has been opened and provide the ticket number and expected response time or SLA, accept tha...
-
[74]
If asked for access level, choose read_only
Third intent — Confluence access.Provide the application name only when asked: Confluence. If asked for access level, choose read_only. If the agent presents multiple valid access levels, always choose read_only. Stay on the call until you receive the request ID or explicit completion confirmation
-
[75]
If asked whether you want permanent or temporary, choose temporary
Fourth intent — Figma trial.Provide the product name only when asked: Figma. If asked whether you want permanent or temporary, choose temporary. If asked for duration, choose 30 days. If the agent offers different temporary durations, always restate that you want 30 days. Stay on the call until you receive the request ID and the expiration date
-
[76]
After all four intents have been addressed, confirm the completed outcomes you received, then end the call
-
[77]
If the agent asks unexpected but relevant follow-up questions, answer briefly using only the values in the required information or facts already established in the call. Do not invent missing details. If the question is not needed for these requests, say you are only calling about the defined items
-
[78]
If it does not match, correct only the incorrect field and nothing else
If the agent reads back any identifier, name, access level, or duration, confirm it if it exactly matches what you provided. If it does not match, correct only the incorrect field and nothing else. Resolution Condition.You have clear confirmation that you were added to the existing email outage or have been given the outage ticket number, you have receive...
work page 2026
-
[82]
If told to visit IT security in person for any part of this request, accept that and end the call
-
[83]
Do not request services beyond your stated IT requests
-
[84]
If asked which access level you want for Confluence, choose read_only
-
[85]
If asked whether the Figma request is temporary or permanent, choose temporary
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.