From Text to Voice: A Reproducible and Verifiable Framework for Evaluating Tool Calling LLM Agents

Md Tahmid Rahman Laskar , Xue-Yong Fu , Seyyed Saeed Sarfjoo , Quinten McNamara , Jonas Robertson , Shashi Bhushan TN

Authors on Pith no claims yet

Pith reviewed 2026-05-15 03:18 UTC · model grok-4.3

classification 💻 cs.CL

keywords frameworktoolconfettiagentsbenchmarkscallingdemonstratesevaluation

0 comments

The pith

A dataset-agnostic framework converts text tool-calling benchmarks to paired audio versions via TTS and noise, showing model-dependent performance with small text-to-voice gaps of 1.8-4.8 points on Confetti and When2Call.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Voice assistants are getting better at understanding spoken requests, but they often need to use tools like checking calendars or searching information. Most current tests for these tool-calling skills use only written text, which does not match real spoken conversations that include different voices and background sounds. The authors created a method to turn those text tests into spoken versions without rewriting the questions or correct answers. They generate audio from the text using speech synthesis, vary the speaker, and mix in environmental noise to simulate real conditions. This produces matching text and audio pairs that keep the same tool schemas and gold-standard answers. They tested seven models that can process both text and audio on two converted datasets. Performance varied by model and task, with some models handling one dataset better than the other. The drop in scores from text to audio was modest, usually just a few points. Most errors came from models misreading specific details like names or numbers in the spoken input. They also checked a way to use other AI models as judges for the answers instead of humans and found that some open-source models matched proprietary ones over 80 percent of the time. The approach gives a quick, repeatable first check for how well voice agents handle tools before investing in full custom audio datasets.

Core claim

Our dataset-agnostic framework uses text-to-speech, speaker variation, and environmental noise to create paired text-audio instances while preserving the original dataset annotations.

Load-bearing premise

That adding TTS, speaker variation, and environmental noise does not introduce new biases or artifacts that change how models interpret tool arguments or intent in ways that the preserved gold labels fail to capture.

read the original abstract

Voice agents increasingly require reliable tool use from speech, whereas prominent tool-calling benchmarks remain text-based. We study whether verified text benchmarks can be converted into controlled audio-based tool calling evaluations without re-annotating the tool schema and gold labels. Our dataset-agnostic framework uses text-to-speech, speaker variation, and environmental noise to create paired text-audio instances while preserving the original dataset annotations. Based on extensive evaluation of 7 omni-modal models on audio-converted versions of Confetti and When2Call, our framework demonstrates that the performance is strongly model- and task-dependent: Gemini-3.1-Flash-Live obtains the highest Confetti score (70.4), whereas GPT-Realtime-1.5 performs best on When2Call (71.9). On Confetti, the text-to-voice gap ranges from 1.8 points for Qwen3-Omni to 4.8 points for GPT-Realtime-1.5. A targeted analysis of failure cases demonstrates that degradations most often reflect misunderstandings of argument values in the speech. Considering real-world deployment scenarios, we further report text-only results, an ambiguity-based reformulation stress test, and a reference-free LLM-as-judge protocol validated against human preferences. Notably, we find that open-source Qwen3 judges with at least 8B parameters exceed 80% agreement with proprietary judges, supporting privacy-preserving evaluation. Overall, our framework provides a verifiable and reproducible first-stage diagnostic that complements purpose-built audio corpora.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper gives a practical way to turn text tool-calling benchmarks into audio versions with TTS, speaker changes, and noise while keeping the original labels, and shows modest model drops that mostly hit on argument values.

read the letter

The main thing here is a straightforward framework that converts existing text tool-calling benchmarks into audio versions using TTS, speaker variation, and added noise, all without touching the original annotations or schemas. They apply it to Confetti and When2Call, run seven omni models, and report concrete scores plus failure breakdowns. Gemini-3.1-Flash-Live leads on Confetti at 70.4 while GPT-Realtime-1.5 leads on When2Call at 71.9, with text-to-audio gaps between 1.8 and 4.8 points. The analysis flags argument value errors as the most common issue in the audio cases, and they add checks like an ambiguity stress test plus a reference-free LLM judge protocol. Open-source Qwen3 models at 8B and up reach over 80 percent agreement with proprietary judges, which is a useful side result for privacy reasons. What the work does well is keep things dataset-agnostic and reproducible as a first-stage diagnostic that avoids the cost of building fresh audio corpora from scratch. The paired text-audio setup lets them isolate the modality effect directly, and the failure cases are tied back to specific error types rather than vague claims. The soft spot is the untested assumption that the synthesized audio preserves the exact same tool-call intent and argument clarity as the text. The abstract notes degradations often involve argument misunderstandings, yet there is no human validation step where listeners re-label the audio instances to confirm the gold labels still apply. If TTS artifacts or noise levels create new ambiguities in numbers or names, then part of the reported gaps could reflect conversion issues instead of model audio handling. The methods description of how speakers and noise were chosen is high-level, so exact parameters and any audio quality checks would strengthen the case. This is for teams working on voice agents who want a quick, low-cost way to test tool use from speech without starting from zero. It deserves peer review because the core conversion idea is practical, the experiments cover multiple models and tasks with numbers, and the judge comparison adds immediate value even if reviewers will push for tighter label-fidelity controls.

Referee Report

2 major / 2 minor

Summary. The paper introduces a dataset-agnostic framework that converts text-based tool-calling benchmarks (Confetti and When2Call) into paired audio versions via TTS, speaker variation, and environmental noise while preserving original annotations and tool schemas. It evaluates seven omni-modal models, reports model- and task-dependent performance (e.g., Gemini-3.1-Flash-Live at 70.4 on Confetti audio, GPT-Realtime-1.5 at 71.9 on When2Call), quantifies text-to-voice gaps (1.8–4.8 points), analyzes failures as primarily argument-value misunderstandings, and includes text-only baselines, an ambiguity stress test, and a reference-free LLM-as-judge protocol validated against human preferences (with open-source Qwen3 judges ≥8B reaching >80% agreement).

Significance. If the conversion fidelity holds, the work supplies a reproducible, low-cost diagnostic for voice-based tool calling that complements purpose-built audio corpora, quantifies modality gaps in current omni models, and demonstrates viable open-source LLM judges for privacy-preserving evaluation. The framework's emphasis on preserving gold labels without re-annotation is a practical strength for rapid iteration on existing benchmarks.

major comments (2)

[Abstract / Framework] Abstract and framework description: the central claim that performance gaps (1.8–4.8 points) can be attributed to audio understanding rather than label drift rests on the untested assumption that TTS + speaker variation + noise preserves argument values and intent exactly as in the original text. No human re-annotation or verification study is reported to confirm that annotators would assign identical tool calls to the synthesized audio instances; without this control, the reported degradations and model rankings remain potentially confounded by conversion artifacts.
[Failure analysis] Failure analysis section: the statement that degradations 'most often reflect misunderstandings of argument values in the speech' is presented without quantitative breakdown (e.g., percentage of errors by category) or inter-annotator agreement on the audio-specific error taxonomy, making it difficult to assess whether the observed gaps are driven by audio-specific issues or by residual label mismatches.

minor comments (2)

[Methods] The paper should explicitly state the TTS engine, sampling parameters, noise SNR levels, and speaker pool size used for each dataset to ensure full reproducibility.
[Evaluation] Statistical details (number of audio instances per dataset, confidence intervals on the reported scores, and exact data splits) are referenced in the abstract but should be tabulated for clarity.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the domain assumption that TTS plus noise preserves task semantics and gold labels. No free parameters or invented entities are described in the abstract.

axioms (1)

domain assumption Text-to-speech synthesis combined with speaker variation and environmental noise preserves the original tool schema, argument values, and gold labels without introducing unaccounted biases.
Invoked to justify creating audio instances without re-annotation.

pith-pipeline@v0.9.0 · 5602 in / 1392 out tokens · 82732 ms · 2026-05-15T03:18:59.044985+00:00 · methodology

From Text to Voice: A Reproducible and Verifiable Framework for Evaluating Tool Calling LLM Agents

Core claim

Load-bearing premise

discussion (0)