Recognition: no theorem link
From Text to Voice: A Reproducible and Verifiable Framework for Evaluating Tool Calling LLM Agents
Pith reviewed 2026-05-15 03:18 UTC · model grok-4.3
The pith
A dataset-agnostic framework converts text tool-calling benchmarks to paired audio versions via TTS and noise, showing model-dependent performance with small text-to-voice gaps of 1.8-4.8 points on Confetti and When2Call.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Our dataset-agnostic framework uses text-to-speech, speaker variation, and environmental noise to create paired text-audio instances while preserving the original dataset annotations.
Load-bearing premise
That adding TTS, speaker variation, and environmental noise does not introduce new biases or artifacts that change how models interpret tool arguments or intent in ways that the preserved gold labels fail to capture.
read the original abstract
Voice agents increasingly require reliable tool use from speech, whereas prominent tool-calling benchmarks remain text-based. We study whether verified text benchmarks can be converted into controlled audio-based tool calling evaluations without re-annotating the tool schema and gold labels. Our dataset-agnostic framework uses text-to-speech, speaker variation, and environmental noise to create paired text-audio instances while preserving the original dataset annotations. Based on extensive evaluation of 7 omni-modal models on audio-converted versions of Confetti and When2Call, our framework demonstrates that the performance is strongly model- and task-dependent: Gemini-3.1-Flash-Live obtains the highest Confetti score (70.4), whereas GPT-Realtime-1.5 performs best on When2Call (71.9). On Confetti, the text-to-voice gap ranges from 1.8 points for Qwen3-Omni to 4.8 points for GPT-Realtime-1.5. A targeted analysis of failure cases demonstrates that degradations most often reflect misunderstandings of argument values in the speech. Considering real-world deployment scenarios, we further report text-only results, an ambiguity-based reformulation stress test, and a reference-free LLM-as-judge protocol validated against human preferences. Notably, we find that open-source Qwen3 judges with at least 8B parameters exceed 80% agreement with proprietary judges, supporting privacy-preserving evaluation. Overall, our framework provides a verifiable and reproducible first-stage diagnostic that complements purpose-built audio corpora.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces a dataset-agnostic framework that converts text-based tool-calling benchmarks (Confetti and When2Call) into paired audio versions via TTS, speaker variation, and environmental noise while preserving original annotations and tool schemas. It evaluates seven omni-modal models, reports model- and task-dependent performance (e.g., Gemini-3.1-Flash-Live at 70.4 on Confetti audio, GPT-Realtime-1.5 at 71.9 on When2Call), quantifies text-to-voice gaps (1.8–4.8 points), analyzes failures as primarily argument-value misunderstandings, and includes text-only baselines, an ambiguity stress test, and a reference-free LLM-as-judge protocol validated against human preferences (with open-source Qwen3 judges ≥8B reaching >80% agreement).
Significance. If the conversion fidelity holds, the work supplies a reproducible, low-cost diagnostic for voice-based tool calling that complements purpose-built audio corpora, quantifies modality gaps in current omni models, and demonstrates viable open-source LLM judges for privacy-preserving evaluation. The framework's emphasis on preserving gold labels without re-annotation is a practical strength for rapid iteration on existing benchmarks.
major comments (2)
- [Abstract / Framework] Abstract and framework description: the central claim that performance gaps (1.8–4.8 points) can be attributed to audio understanding rather than label drift rests on the untested assumption that TTS + speaker variation + noise preserves argument values and intent exactly as in the original text. No human re-annotation or verification study is reported to confirm that annotators would assign identical tool calls to the synthesized audio instances; without this control, the reported degradations and model rankings remain potentially confounded by conversion artifacts.
- [Failure analysis] Failure analysis section: the statement that degradations 'most often reflect misunderstandings of argument values in the speech' is presented without quantitative breakdown (e.g., percentage of errors by category) or inter-annotator agreement on the audio-specific error taxonomy, making it difficult to assess whether the observed gaps are driven by audio-specific issues or by residual label mismatches.
minor comments (2)
- [Methods] The paper should explicitly state the TTS engine, sampling parameters, noise SNR levels, and speaker pool size used for each dataset to ensure full reproducibility.
- [Evaluation] Statistical details (number of audio instances per dataset, confidence intervals on the reported scores, and exact data splits) are referenced in the abstract but should be tabulated for clarity.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Text-to-speech synthesis combined with speaker variation and environmental noise preserves the original tool schema, argument values, and gold labels without introducing unaccounted biases.
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.