pith. sign in

arxiv: 2605.13841 · v2 · pith:GXZWOMWXnew · submitted 2026-05-13 · 💻 cs.SD · cs.AI· cs.CL· cs.LG

EVA-Bench: A New End-to-end Framework for Evaluating Voice Agents

Pith reviewed 2026-05-14 17:26 UTC · model grok-4.3

classification 💻 cs.SD cs.AIcs.CLcs.LG
keywords voice agentsevaluation benchmarkconversational AIspeech fidelityrobustness testingmulti-turn dialogueaccuracy metricsexperience metrics
2
0 comments X

The pith

No voice agent exceeds 0.5 on both accuracy and experience metrics simultaneously.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces EVA-Bench to evaluate voice agents by running bot-to-bot audio conversations over multi-turn tasks and scoring them with two composite metrics. EVA-A combines task completion, faithfulness to instructions, and audio speech quality. EVA-X combines smooth conversation flow, spoken conciseness, and natural turn-taking timing. Testing twelve systems across three architectures and 213 enterprise scenarios shows none clear 0.5 on both metrics at pass@1, large gaps appear between best-case and consistent runs, and accent or noise changes expose clear robustness shortfalls. Readers care because voice agents are already used in customer service and enterprise workflows, and a shared yardstick lets developers see exactly where each architecture falls short.

Core claim

EVA-Bench generates realistic multi-turn dialogues through bot-to-bot audio interaction with automatic error detection and regeneration, then scores agents on EVA-A for accuracy and fidelity plus EVA-X for experience and timing. Across 213 scenarios and controlled accent-noise perturbations, no system exceeds 0.5 on both EVA-A pass@1 and EVA-X pass@1; the median pass@k minus pass^k gap reaches 0.44 on EVA-A; and perturbations produce mean drops up to 0.314 that differ by architecture and metric.

What carries the argument

EVA-Bench end-to-end framework that runs validated bot-to-bot audio dialogues and applies the paired EVA-A and EVA-X composite metrics together with pass@1, pass@k, and pass^k statistics.

If this is right

  • Different voice-agent architectures can be ranked on identical accuracy and experience scales for the first time.
  • Reliability engineering must close the 0.44 median gap between peak and consistent performance on accuracy tasks.
  • Accent and noise robustness must be treated as first-class requirements that vary by architecture.
  • The 213-scenario suite across three enterprise domains supplies a reusable test bed for targeted fixes.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Architectures may face an inherent trade-off that future work could resolve by combining strengths of the three current families.
  • Adding more open-ended or multi-party scenarios would test whether the current metrics still separate systems cleanly.
  • If EVA-X turns out to drive user retention more than EVA-A, teams might deliberately accept lower accuracy for better flow.
  • The automatic validation step could be reused as a training signal to reduce simulator errors in other dialogue systems.

Load-bearing premise

Bot-to-bot simulated conversations with automatic validation match the distribution of real human voice interactions and the EVA-A and EVA-X scores track downstream user satisfaction or task success.

What would settle it

A head-to-head study that runs the same twelve agents with real human users, records satisfaction and task-success rates, and checks whether the ordering or absolute levels match the EVA-Bench rankings.

read the original abstract

Voice agents, artificial intelligence systems that conduct spoken conversations to complete tasks, are increasingly deployed across enterprise applications. However, no existing benchmark jointly addresses two core evaluation challenges: generating realistic simulated conversations, and measuring quality across the full scope of voice-specific failure modes. We present EVA-Bench, an end-to-end evaluation framework that addresses both. On the simulation side, EVA-Bench orchestrates bot-to-bot audio conversations over dynamic multi-turn dialogues, with automatic simulation validation that detects user simulator error and appropriately regenerates conversations before scoring. On the measurement side, EVA-Bench introduces two composite metrics: EVA-A (Accuracy), capturing task completion, faithfulness, and audio-level speech fidelity; and EVA-X (Experience), capturing conversation progression, spoken conciseness, and turn-taking timing. Both metrics apply to all major agent architectures, enabling direct cross-architecture comparison. EVA-Bench includes 213 scenarios across three enterprise domains, a controlled perturbation suite for accent and noise robustness, and pass@1, pass@k, pass^k measurements that distinguish peak from reliable capability. Across 12 systems spanning all three architectures, we find: (1) no system simultaneously exceeds 0.5 on both EVA-A pass@1 and EVA-X pass@1; (2) peak and reliable performance diverge substantially (median pass@k--pass^k gap of 0.44 on EVA-A); and (3) accent and noise perturbations expose substantial robustness gaps, with effects varying across architectures, systems, and metrics (mean $\Delta$ up to 0.314). We release the full framework, evaluation suite, and benchmark data under an open-source license.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 2 minor

Summary. The paper introduces EVA-Bench, an end-to-end framework for evaluating voice agents via bot-to-bot audio conversations with automatic simulation validation and regeneration. It defines two composite metrics—EVA-A (capturing task completion, faithfulness, and speech fidelity) and EVA-X (capturing conversation progression, conciseness, and turn-taking timing)—and applies them to 213 scenarios across three enterprise domains. The work evaluates 12 systems spanning three architectures under pass@1/pass@k/pass^k protocols and a controlled accent/noise perturbation suite, reporting that no system exceeds 0.5 on both EVA-A and EVA-X pass@1, a median 0.44 gap between peak and reliable performance on EVA-A, and architecture-varying robustness drops up to 0.314.

Significance. If the simulation and metrics prove representative, EVA-Bench fills a gap by enabling direct cross-architecture comparison of voice-specific failure modes and by releasing the full framework, evaluation suite, and data under open license. The empirical distinctions between peak/reliable capability and the quantified robustness gaps under perturbation provide concrete, falsifiable baselines that future systems can target.

major comments (1)
  1. [Abstract and results section] Abstract and results section: The claims that the observed accuracy-experience trade-off and robustness gaps (e.g., mean drops up to 0.314) are indicative of production voice-agent behavior rest on the unvalidated assumption that bot-to-bot dialogues plus automatic error detection faithfully reproduce human turn-taking, intent distributions, and downstream task success. No human-A/B correlation studies, user-satisfaction ratings, or comparison against real completion rates are reported, which is load-bearing for interpreting the numerical findings as generalizable rather than benchmark-specific.
minor comments (2)
  1. [Framework description] The description of how automatic validation detects simulator errors and triggers regeneration could be expanded with a concrete example or pseudocode to improve reproducibility.
  2. [Results tables/figures] Table or figure captions for the 12-system results should explicitly note the number of runs per condition to clarify the statistical basis of the reported medians and means.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive feedback and recommendation for minor revision. We address the major comment below.

read point-by-point responses
  1. Referee: [Abstract and results section] Abstract and results section: The claims that the observed accuracy-experience trade-off and robustness gaps (e.g., mean drops up to 0.314) are indicative of production voice-agent behavior rest on the unvalidated assumption that bot-to-bot dialogues plus automatic error detection faithfully reproduce human turn-taking, intent distributions, and downstream task success. No human-A/B correlation studies, user-satisfaction ratings, or comparison against real completion rates are reported, which is load-bearing for interpreting the numerical findings as generalizable rather than benchmark-specific.

    Authors: We agree that the absence of human-A/B correlation studies, user-satisfaction ratings, or direct comparisons to real-world completion rates means the generalizability of the reported accuracy-experience trade-off and robustness gaps to production voice agents rests on an assumption that remains unvalidated in the current manuscript. The EVA-Bench simulation is constructed to approximate human-like multi-turn interactions via bot-to-bot audio with automatic error detection and regeneration, but this does not substitute for empirical human validation. In the revised manuscript we will add an explicit Limitations subsection (in the Discussion) that acknowledges this gap, clarifies that the numerical findings are benchmark-specific, and outlines planned future work on human correlation studies. We will also insert a brief qualifying clause in the abstract and results section to avoid overclaiming generalizability while preserving the core empirical observations. revision: partial

Circularity Check

0 steps flagged

Empirical benchmark release with direct metric computation; no derivations or predictions reduce to inputs

full rationale

The paper defines EVA-A and EVA-X as composite metrics from explicit criteria (task completion, faithfulness, conciseness, timing) and applies them to bot-to-bot audio dialogues with automatic validation. All reported results (pass@1 thresholds, gaps of 0.44, perturbation drops) are computed directly from these definitions on generated data. No equations, fitted parameters, self-citations, or ansatzes are used to derive the central claims; the framework is self-contained against its own stated benchmarks and scenarios. This matches the default expectation of no significant circularity for an empirical benchmark paper.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The framework rests on domain assumptions that simulated conversations can stand in for real user behavior and that the chosen composite metrics capture the relevant quality dimensions; no free parameters or invented entities are described in the abstract.

axioms (2)
  • domain assumption Bot-to-bot audio conversations with automatic validation can generate realistic multi-turn dialogues that reflect real-world voice agent usage
    This underpins the entire simulation side of EVA-Bench and is required for the benchmark scores to generalize.
  • domain assumption The composite definitions of EVA-A and EVA-X adequately measure task completion, faithfulness, speech fidelity, conversation progression, conciseness, and turn-taking
    These metrics are the core of the measurement side; their validity is assumed without reported human correlation data in the abstract.

pith-pipeline@v0.9.0 · 5667 in / 1513 out tokens · 71341 ms · 2026-05-14T17:26:52.036391+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

  • IndisputableMonolith/Cost/FunctionalEquation washburn_uniqueness_aczel unclear
    ?
    unclear

    Relation between the paper passage and the cited Recognition theorem.

    EVA-Bench introduces two composite metrics: EVA-A (Accuracy), capturing task completion, faithfulness, and audio-level speech fidelity; and EVA-X (Experience), capturing conversation progression, spoken conciseness, and turn-taking timing.

  • IndisputableMonolith/Foundation/RealityFromDistinction reality_from_one_distinction unclear
    ?
    unclear

    Relation between the paper passage and the cited Recognition theorem.

    Across 12 systems spanning all three architectures, we find: (1) no system simultaneously exceeds 0.5 on both EVA-A pass@1 and EVA-X pass@1; (2) peak and reliable performance diverge substantially (median pass@k - pass^k gap of 0.44 on EVA-A)

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.