EVA-Bench: A New End-to-end Framework for Evaluating Voice Agents

Anil Madamala; Fanny Riols; Gabrielle Gauthier Melan\c{c}on; Hari Subramani; Hoang H. Nguyen; Joseph Marinier; Katrina Stankiewicz; Lindsay Devon Brin; Oluwanifemi Bamgbose; Raghav Mehndiratta

arxiv: 2605.13841 · v2 · pith:GXZWOMWXnew · submitted 2026-05-13 · 💻 cs.SD · cs.AI· cs.CL· cs.LG

EVA-Bench: A New End-to-end Framework for Evaluating Voice Agents

Tara Bogavelli , Gabrielle Gauthier Melan\c{c}on , Katrina Stankiewicz , Oluwanifemi Bamgbose , Fanny Riols , Hoang H. Nguyen , Raghav Mehndiratta , Lindsay Devon Brin

show 5 more authors

Joseph Marinier Hari Subramani Anil Madamala Sridhar Krishna Nemala Srinivas Sunkara

This is my paper

Pith reviewed 2026-05-14 17:26 UTC · model grok-4.3

classification 💻 cs.SD cs.AIcs.CLcs.LG

keywords voice agentsevaluation benchmarkconversational AIspeech fidelityrobustness testingmulti-turn dialogueaccuracy metricsexperience metrics

0 comments

The pith

No voice agent exceeds 0.5 on both accuracy and experience metrics simultaneously.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces EVA-Bench to evaluate voice agents by running bot-to-bot audio conversations over multi-turn tasks and scoring them with two composite metrics. EVA-A combines task completion, faithfulness to instructions, and audio speech quality. EVA-X combines smooth conversation flow, spoken conciseness, and natural turn-taking timing. Testing twelve systems across three architectures and 213 enterprise scenarios shows none clear 0.5 on both metrics at pass@1, large gaps appear between best-case and consistent runs, and accent or noise changes expose clear robustness shortfalls. Readers care because voice agents are already used in customer service and enterprise workflows, and a shared yardstick lets developers see exactly where each architecture falls short.

Core claim

EVA-Bench generates realistic multi-turn dialogues through bot-to-bot audio interaction with automatic error detection and regeneration, then scores agents on EVA-A for accuracy and fidelity plus EVA-X for experience and timing. Across 213 scenarios and controlled accent-noise perturbations, no system exceeds 0.5 on both EVA-A pass@1 and EVA-X pass@1; the median pass@k minus pass^k gap reaches 0.44 on EVA-A; and perturbations produce mean drops up to 0.314 that differ by architecture and metric.

What carries the argument

EVA-Bench end-to-end framework that runs validated bot-to-bot audio dialogues and applies the paired EVA-A and EVA-X composite metrics together with pass@1, pass@k, and pass^k statistics.

If this is right

Different voice-agent architectures can be ranked on identical accuracy and experience scales for the first time.
Reliability engineering must close the 0.44 median gap between peak and consistent performance on accuracy tasks.
Accent and noise robustness must be treated as first-class requirements that vary by architecture.
The 213-scenario suite across three enterprise domains supplies a reusable test bed for targeted fixes.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Architectures may face an inherent trade-off that future work could resolve by combining strengths of the three current families.
Adding more open-ended or multi-party scenarios would test whether the current metrics still separate systems cleanly.
If EVA-X turns out to drive user retention more than EVA-A, teams might deliberately accept lower accuracy for better flow.
The automatic validation step could be reused as a training signal to reduce simulator errors in other dialogue systems.

Load-bearing premise

Bot-to-bot simulated conversations with automatic validation match the distribution of real human voice interactions and the EVA-A and EVA-X scores track downstream user satisfaction or task success.

What would settle it

A head-to-head study that runs the same twelve agents with real human users, records satisfaction and task-success rates, and checks whether the ordering or absolute levels match the EVA-Bench rankings.

read the original abstract

Voice agents, artificial intelligence systems that conduct spoken conversations to complete tasks, are increasingly deployed across enterprise applications. However, no existing benchmark jointly addresses two core evaluation challenges: generating realistic simulated conversations, and measuring quality across the full scope of voice-specific failure modes. We present EVA-Bench, an end-to-end evaluation framework that addresses both. On the simulation side, EVA-Bench orchestrates bot-to-bot audio conversations over dynamic multi-turn dialogues, with automatic simulation validation that detects user simulator error and appropriately regenerates conversations before scoring. On the measurement side, EVA-Bench introduces two composite metrics: EVA-A (Accuracy), capturing task completion, faithfulness, and audio-level speech fidelity; and EVA-X (Experience), capturing conversation progression, spoken conciseness, and turn-taking timing. Both metrics apply to all major agent architectures, enabling direct cross-architecture comparison. EVA-Bench includes 213 scenarios across three enterprise domains, a controlled perturbation suite for accent and noise robustness, and pass@1, pass@k, pass^k measurements that distinguish peak from reliable capability. Across 12 systems spanning all three architectures, we find: (1) no system simultaneously exceeds 0.5 on both EVA-A pass@1 and EVA-X pass@1; (2) peak and reliable performance diverge substantially (median pass@k--pass^k gap of 0.44 on EVA-A); and (3) accent and noise perturbations expose substantial robustness gaps, with effects varying across architectures, systems, and metrics (mean $\Delta$ up to 0.314). We release the full framework, evaluation suite, and benchmark data under an open-source license.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

EVA-Bench gives a usable open framework for voice-agent eval with bot-to-bot simulation and two new composite scores, but the reported accuracy-experience trade-offs rest on unvalidated metrics.

read the letter

The main thing here is a practical benchmark release that wires together dynamic bot-to-bot audio dialogues, automatic error detection for regeneration, and two composite metrics (EVA-A for task/faithfulness/audio fidelity, EVA-X for progression/conciseness/timing). That combination is new enough to be worth looking at, and the paper ships the full suite plus data under open license, which is the right move for this kind of work. They run it on 12 systems across architectures, show that nothing clears 0.5 on both pass@1 scores, document a 0.44 median gap between peak and reliable performance, and quantify how accent and noise perturbations hit different systems differently. Those are concrete observations that people building enterprise voice agents can actually use to compare approaches today. The soft spot is exactly what the stress-test flags: no human correlation data or A/B checks against real user satisfaction or completion rates. The bot-to-bot setup plus automatic validation is clever for scale, but without evidence that the simulated turn-taking and intent distributions match production traffic, the architecture-varying robustness numbers and the accuracy-experience split could be artifacts of the simulator rather than stable signals. The perturbation results are still useful as a controlled stress test, but they inherit the same limitation. This is the kind of paper a reading group should see if the group works on spoken dialogue systems or evaluation methodology; it gives people something concrete to try and extend. It is not a load-bearing theoretical claim, so the missing validation is a clear but fixable gap rather than a fatal one. A serious editor should send it to referees who can push on the human-alignment question and check the metric definitions against existing voice eval literature. It is worth the review time.

Referee Report

1 major / 2 minor

Summary. The paper introduces EVA-Bench, an end-to-end framework for evaluating voice agents via bot-to-bot audio conversations with automatic simulation validation and regeneration. It defines two composite metrics—EVA-A (capturing task completion, faithfulness, and speech fidelity) and EVA-X (capturing conversation progression, conciseness, and turn-taking timing)—and applies them to 213 scenarios across three enterprise domains. The work evaluates 12 systems spanning three architectures under pass@1/pass@k/pass^k protocols and a controlled accent/noise perturbation suite, reporting that no system exceeds 0.5 on both EVA-A and EVA-X pass@1, a median 0.44 gap between peak and reliable performance on EVA-A, and architecture-varying robustness drops up to 0.314.

Significance. If the simulation and metrics prove representative, EVA-Bench fills a gap by enabling direct cross-architecture comparison of voice-specific failure modes and by releasing the full framework, evaluation suite, and data under open license. The empirical distinctions between peak/reliable capability and the quantified robustness gaps under perturbation provide concrete, falsifiable baselines that future systems can target.

major comments (1)

[Abstract and results section] Abstract and results section: The claims that the observed accuracy-experience trade-off and robustness gaps (e.g., mean drops up to 0.314) are indicative of production voice-agent behavior rest on the unvalidated assumption that bot-to-bot dialogues plus automatic error detection faithfully reproduce human turn-taking, intent distributions, and downstream task success. No human-A/B correlation studies, user-satisfaction ratings, or comparison against real completion rates are reported, which is load-bearing for interpreting the numerical findings as generalizable rather than benchmark-specific.

minor comments (2)

[Framework description] The description of how automatic validation detects simulator errors and triggers regeneration could be expanded with a concrete example or pseudocode to improve reproducibility.
[Results tables/figures] Table or figure captions for the 12-system results should explicitly note the number of runs per condition to clarify the statistical basis of the reported medians and means.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive feedback and recommendation for minor revision. We address the major comment below.

read point-by-point responses

Referee: [Abstract and results section] Abstract and results section: The claims that the observed accuracy-experience trade-off and robustness gaps (e.g., mean drops up to 0.314) are indicative of production voice-agent behavior rest on the unvalidated assumption that bot-to-bot dialogues plus automatic error detection faithfully reproduce human turn-taking, intent distributions, and downstream task success. No human-A/B correlation studies, user-satisfaction ratings, or comparison against real completion rates are reported, which is load-bearing for interpreting the numerical findings as generalizable rather than benchmark-specific.

Authors: We agree that the absence of human-A/B correlation studies, user-satisfaction ratings, or direct comparisons to real-world completion rates means the generalizability of the reported accuracy-experience trade-off and robustness gaps to production voice agents rests on an assumption that remains unvalidated in the current manuscript. The EVA-Bench simulation is constructed to approximate human-like multi-turn interactions via bot-to-bot audio with automatic error detection and regeneration, but this does not substitute for empirical human validation. In the revised manuscript we will add an explicit Limitations subsection (in the Discussion) that acknowledges this gap, clarifies that the numerical findings are benchmark-specific, and outlines planned future work on human correlation studies. We will also insert a brief qualifying clause in the abstract and results section to avoid overclaiming generalizability while preserving the core empirical observations. revision: partial

Circularity Check

0 steps flagged

Empirical benchmark release with direct metric computation; no derivations or predictions reduce to inputs

full rationale

The paper defines EVA-A and EVA-X as composite metrics from explicit criteria (task completion, faithfulness, conciseness, timing) and applies them to bot-to-bot audio dialogues with automatic validation. All reported results (pass@1 thresholds, gaps of 0.44, perturbation drops) are computed directly from these definitions on generated data. No equations, fitted parameters, self-citations, or ansatzes are used to derive the central claims; the framework is self-contained against its own stated benchmarks and scenarios. This matches the default expectation of no significant circularity for an empirical benchmark paper.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The framework rests on domain assumptions that simulated conversations can stand in for real user behavior and that the chosen composite metrics capture the relevant quality dimensions; no free parameters or invented entities are described in the abstract.

axioms (2)

domain assumption Bot-to-bot audio conversations with automatic validation can generate realistic multi-turn dialogues that reflect real-world voice agent usage
This underpins the entire simulation side of EVA-Bench and is required for the benchmark scores to generalize.
domain assumption The composite definitions of EVA-A and EVA-X adequately measure task completion, faithfulness, speech fidelity, conversation progression, conciseness, and turn-taking
These metrics are the core of the measurement side; their validity is assumed without reported human correlation data in the abstract.

pith-pipeline@v0.9.0 · 5667 in / 1513 out tokens · 71341 ms · 2026-05-14T17:26:52.036391+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

EVA-Bench introduces two composite metrics: EVA-A (Accuracy), capturing task completion, faithfulness, and audio-level speech fidelity; and EVA-X (Experience), capturing conversation progression, spoken conciseness, and turn-taking timing.
IndisputableMonolith/Foundation/RealityFromDistinction reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Across 12 systems spanning all three architectures, we find: (1) no system simultaneously exceeds 0.5 on both EVA-A pass@1 and EVA-X pass@1; (2) peak and reliable performance diverge substantially (median pass@k - pass^k gap of 0.44 on EVA-A)

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.