pith. machine review for the scientific record. sign in

arxiv: 2605.15104 · v1 · submitted 2026-05-14 · 💻 cs.CL

Recognition: no theorem link

From Text to Voice: A Reproducible and Verifiable Framework for Evaluating Tool Calling LLM Agents

Authors on Pith no claims yet

Pith reviewed 2026-05-15 03:18 UTC · model grok-4.3

classification 💻 cs.CL
keywords frameworktoolconfettiagentsbenchmarkscallingdemonstratesevaluation
0
0 comments X

The pith

A dataset-agnostic framework converts text tool-calling benchmarks to paired audio versions via TTS and noise, showing model-dependent performance with small text-to-voice gaps of 1.8-4.8 points on Confetti and When2Call.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Voice assistants are getting better at understanding spoken requests, but they often need to use tools like checking calendars or searching information. Most current tests for these tool-calling skills use only written text, which does not match real spoken conversations that include different voices and background sounds. The authors created a method to turn those text tests into spoken versions without rewriting the questions or correct answers. They generate audio from the text using speech synthesis, vary the speaker, and mix in environmental noise to simulate real conditions. This produces matching text and audio pairs that keep the same tool schemas and gold-standard answers. They tested seven models that can process both text and audio on two converted datasets. Performance varied by model and task, with some models handling one dataset better than the other. The drop in scores from text to audio was modest, usually just a few points. Most errors came from models misreading specific details like names or numbers in the spoken input. They also checked a way to use other AI models as judges for the answers instead of humans and found that some open-source models matched proprietary ones over 80 percent of the time. The approach gives a quick, repeatable first check for how well voice agents handle tools before investing in full custom audio datasets.

Core claim

Our dataset-agnostic framework uses text-to-speech, speaker variation, and environmental noise to create paired text-audio instances while preserving the original dataset annotations.

Load-bearing premise

That adding TTS, speaker variation, and environmental noise does not introduce new biases or artifacts that change how models interpret tool arguments or intent in ways that the preserved gold labels fail to capture.

read the original abstract

Voice agents increasingly require reliable tool use from speech, whereas prominent tool-calling benchmarks remain text-based. We study whether verified text benchmarks can be converted into controlled audio-based tool calling evaluations without re-annotating the tool schema and gold labels. Our dataset-agnostic framework uses text-to-speech, speaker variation, and environmental noise to create paired text-audio instances while preserving the original dataset annotations. Based on extensive evaluation of 7 omni-modal models on audio-converted versions of Confetti and When2Call, our framework demonstrates that the performance is strongly model- and task-dependent: Gemini-3.1-Flash-Live obtains the highest Confetti score (70.4), whereas GPT-Realtime-1.5 performs best on When2Call (71.9). On Confetti, the text-to-voice gap ranges from 1.8 points for Qwen3-Omni to 4.8 points for GPT-Realtime-1.5. A targeted analysis of failure cases demonstrates that degradations most often reflect misunderstandings of argument values in the speech. Considering real-world deployment scenarios, we further report text-only results, an ambiguity-based reformulation stress test, and a reference-free LLM-as-judge protocol validated against human preferences. Notably, we find that open-source Qwen3 judges with at least 8B parameters exceed 80% agreement with proprietary judges, supporting privacy-preserving evaluation. Overall, our framework provides a verifiable and reproducible first-stage diagnostic that complements purpose-built audio corpora.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper introduces a dataset-agnostic framework that converts text-based tool-calling benchmarks (Confetti and When2Call) into paired audio versions via TTS, speaker variation, and environmental noise while preserving original annotations and tool schemas. It evaluates seven omni-modal models, reports model- and task-dependent performance (e.g., Gemini-3.1-Flash-Live at 70.4 on Confetti audio, GPT-Realtime-1.5 at 71.9 on When2Call), quantifies text-to-voice gaps (1.8–4.8 points), analyzes failures as primarily argument-value misunderstandings, and includes text-only baselines, an ambiguity stress test, and a reference-free LLM-as-judge protocol validated against human preferences (with open-source Qwen3 judges ≥8B reaching >80% agreement).

Significance. If the conversion fidelity holds, the work supplies a reproducible, low-cost diagnostic for voice-based tool calling that complements purpose-built audio corpora, quantifies modality gaps in current omni models, and demonstrates viable open-source LLM judges for privacy-preserving evaluation. The framework's emphasis on preserving gold labels without re-annotation is a practical strength for rapid iteration on existing benchmarks.

major comments (2)
  1. [Abstract / Framework] Abstract and framework description: the central claim that performance gaps (1.8–4.8 points) can be attributed to audio understanding rather than label drift rests on the untested assumption that TTS + speaker variation + noise preserves argument values and intent exactly as in the original text. No human re-annotation or verification study is reported to confirm that annotators would assign identical tool calls to the synthesized audio instances; without this control, the reported degradations and model rankings remain potentially confounded by conversion artifacts.
  2. [Failure analysis] Failure analysis section: the statement that degradations 'most often reflect misunderstandings of argument values in the speech' is presented without quantitative breakdown (e.g., percentage of errors by category) or inter-annotator agreement on the audio-specific error taxonomy, making it difficult to assess whether the observed gaps are driven by audio-specific issues or by residual label mismatches.
minor comments (2)
  1. [Methods] The paper should explicitly state the TTS engine, sampling parameters, noise SNR levels, and speaker pool size used for each dataset to ensure full reproducibility.
  2. [Evaluation] Statistical details (number of audio instances per dataset, confidence intervals on the reported scores, and exact data splits) are referenced in the abstract but should be tabulated for clarity.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the domain assumption that TTS plus noise preserves task semantics and gold labels. No free parameters or invented entities are described in the abstract.

axioms (1)
  • domain assumption Text-to-speech synthesis combined with speaker variation and environmental noise preserves the original tool schema, argument values, and gold labels without introducing unaccounted biases.
    Invoked to justify creating audio instances without re-annotation.

pith-pipeline@v0.9.0 · 5602 in / 1392 out tokens · 82732 ms · 2026-05-15T03:18:59.044985+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.