Recognition: unknown
Audio2Tool: Speak, Call, Act -- A Dataset for Benchmarking Speech Tool Use
Pith reviewed 2026-05-10 07:27 UTC · model grok-4.3
The pith
A new dataset shows speech language models handle simple spoken commands but degrade on complex multi-step tool tasks and noisy audio.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The central claim is that existing benchmarks for speech tool use lack sufficient domain coverage, acoustic variety, and compositional depth, so the authors supply Audio2Tool to expose that current models remain reliable only on straightforward commands while losing capability under realistic multi-intent and noisy conditions.
What carries the argument
The Audio2Tool dataset itself, built around a three-tier complexity ladder (simple commands, multi-intent, needle-in-a-haystack) and generated via zero-shot voice cloning plus diverse noise overlays to create in-the-wild audio.
If this is right
- Any new SpeechLM intended for tool use must be evaluated on compositional and acoustically varied inputs rather than isolated commands.
- Performance gaps in the benchmark point to separate failure modes that can be targeted in model training or architecture design.
- The public dataset allows consistent comparison across future models and pipelines.
- Domains such as smart-car and wearables now have a shared testbed for spoken tool execution.
Where Pith is reading between the lines
- The observed degradation suggests that purely end-to-end speech models may benefit from explicit intermediate planning steps before tool selection.
- Similar multi-tier benchmarks could be extended to other high-stakes domains such as medical or legal voice interfaces.
- The results imply that acoustic robustness and compositional reasoning should be treated as distinct training objectives rather than assumed to improve together.
Load-bearing premise
Synthetic speech produced by zero-shot voice cloning together with the selected noise profiles is representative enough of actual spoken interactions to evaluate tool-calling reliability.
What would settle it
Record the same query set with real speakers in uncontrolled environments and measure whether model error rates match the synthetic benchmark results.
Figures
read the original abstract
Voice assistants increasingly rely on Speech Language Models (SpeechLMs) to interpret spoken queries and execute complex tasks, yet existing benchmarks lack domain breadth, acoustic diversity, and compositional reasoning complexity to evaluate tool-calling performance. We introduce Audio2Tool, a large-scale dataset comprising approximately 30,000 queries designed to assess tool-calling capabilities of SpeechLMs across three primary domains: Smart Car, Smart Home, and Wearables. Our benchmark features a multi-tier complexity hierarchy, ranging from simple direct commands to complex multi-intent and needle-in-a-haystack extraction to isolate distinct failure modes. To ensure realism, we employ zero-shot voice cloning text-to-speech synthesis and diverse noise profiles to simulate in-the-wild conditions. Evaluations of state-of-the-art SpeechLMs and ASR-LLM pipelines show strong performance on simple commands but significant degradation under compositional and acoustic challenges. Code and dataset are publicly available on the project page: https://audio2tool.github.io/.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces Audio2Tool, a dataset of approximately 30,000 spoken queries for benchmarking tool-calling capabilities of SpeechLMs across Smart Car, Smart Home, and Wearables domains. It defines a multi-tier complexity hierarchy (simple direct commands to multi-intent and needle-in-a-haystack extraction) and generates test audio via zero-shot voice cloning TTS plus diverse noise profiles to simulate in-the-wild conditions. Evaluations of state-of-the-art SpeechLMs and ASR-LLM pipelines are reported to demonstrate strong performance on simple commands but significant degradation under compositional and acoustic challenges. The dataset and code are released publicly.
Significance. If the dataset curation is sound and the evaluations include rigorous quantitative metrics with proper controls, Audio2Tool could fill an important gap in benchmarks for speech-based tool use, a growing area for voice assistants. The public release of data and code is a clear strength that would enable community follow-up work.
major comments (2)
- [Evaluation section] Evaluation section: The abstract states that evaluations 'show strong performance on simple commands but significant degradation under compositional and acoustic challenges,' yet supplies no concrete metrics (e.g., accuracy or F1 per tier), per-tier sample counts, statistical tests, or baseline comparisons. This absence makes the central performance claim unverifiable from the provided text and is load-bearing for the paper's empirical contribution.
- [Dataset Generation] Dataset Generation / Acoustic Simulation: The acoustic challenges are created exclusively via zero-shot voice cloning TTS followed by noise overlay, with no control experiment comparing the same textual queries rendered by TTS versus human recordings under matched noise profiles. This leaves open whether reported acoustic degradation reflects genuine in-the-wild variability or TTS-specific artifacts (e.g., prosody or spectral distortions), directly affecting the claim that the setup 'ensures realism' and 'simulates in-the-wild conditions.'
minor comments (1)
- [Abstract] Abstract: The exact total query count, domain-wise breakdown, and tier-wise distribution are not stated, only the approximate figure of 30,000.
Simulated Author's Rebuttal
We thank the referee for their constructive comments on the evaluation presentation and acoustic simulation approach. We address each major comment below, indicating the revisions we will incorporate.
read point-by-point responses
-
Referee: [Evaluation section] Evaluation section: The abstract states that evaluations 'show strong performance on simple commands but significant degradation under compositional and acoustic challenges,' yet supplies no concrete metrics (e.g., accuracy or F1 per tier), per-tier sample counts, statistical tests, or baseline comparisons. This absence makes the central performance claim unverifiable from the provided text and is load-bearing for the paper's empirical contribution.
Authors: We agree that the abstract would benefit from concrete metrics to make the central claims immediately verifiable. The full manuscript's Evaluation section (Section 4) already contains the requested details: per-tier accuracy and F1 scores, exact sample counts per complexity tier, statistical significance tests, and direct comparisons against ASR-LLM baselines. We will revise the abstract to include a concise quantitative summary of the key results (e.g., performance on simple commands versus compositional and noisy conditions) while preserving the existing detailed reporting in the body. revision: yes
-
Referee: [Dataset Generation] Dataset Generation / Acoustic Simulation: The acoustic challenges are created exclusively via zero-shot voice cloning TTS followed by noise overlay, with no control experiment comparing the same textual queries rendered by TTS versus human recordings under matched noise profiles. This leaves open whether reported acoustic degradation reflects genuine in-the-wild variability or TTS-specific artifacts (e.g., prosody or spectral distortions), directly affecting the claim that the setup 'ensures realism' and 'simulates in-the-wild conditions.'
Authors: We acknowledge this as a substantive limitation of the current acoustic simulation. Zero-shot voice cloning was selected to enable scalable, reproducible generation of 30k queries with voice diversity that would be prohibitively expensive to obtain via human recordings. We incorporated a broad set of real-world noise profiles to approximate in-the-wild conditions. We will add an explicit discussion paragraph in the Limitations section noting the absence of a TTS-versus-human control and the possibility of synthesis artifacts, while clarifying that the benchmark remains a controlled, publicly reproducible testbed for measuring degradation. We cannot retroactively collect matched human recordings for the full dataset at this stage. revision: partial
Circularity Check
Empirical dataset release with no derivation chain or fitted predictions
full rationale
The paper introduces Audio2Tool as a new benchmark dataset of ~30k queries generated via zero-shot TTS plus noise overlays. All reported results are direct empirical evaluations on this held-out test set; no equations, parameters, or quantities are fitted to the evaluation data and then re-presented as predictions. No self-citation chains, uniqueness theorems, or ansatzes are invoked to justify core claims. The work is therefore self-contained against external benchmarks and contains no load-bearing step that reduces to its own inputs by construction.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
in-the-wild
Introduction Recent advances in Speech Large Language Models (SpeechLMs) are transforming voice assistants from sim- ple intent recognition systems into end-to-end, audio-native agents capable of directly invoking tools from raw speech. In this setting, the model performsaudio-native function calling, mapping raw acoustic signals directly to executable AP...
2026
-
[2]
Audio2Tool: Speak, Call, Act -- A Dataset for Benchmarking Speech Tool Use
Background With the growing interest in developing systems capable of end- to-end interaction, Spoken Language Understanding (SLU) has expanded to also encompass the ability to invoke executable function calls directly from speech. Going beyond traditional cascaded pipelines, recent SpeechLMs enable audio native rea- soning, allowing models to directly se...
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[3]
While these benchmarks are challenging, they do not di- rectly evaluate executable tool invocation from speech
covered multi-domain, compositional, and multi-intent set- tings. While these benchmarks are challenging, they do not di- rectly evaluate executable tool invocation from speech. From SLU semantics to executable tool calling:Traditional SLU systems produce intermediate semantic representations such as intents, slots, or structured parses. Tool calling exte...
-
[4]
Finally, we generate speech with con- trolled perturbations to assess robustness and failure modes that are specific to speech-driven tool invocation
and audio/multi-modal tool-use benchmarks [10, 11], it or- ganizes tools and APIs into taxonomies across three applica- tion domains and introduces a eight-tier query curriculum that progressively assesses tool selection, argument grounding, and multi-step composition. Finally, we generate speech with con- trolled perturbations to assess robustness and fa...
-
[5]
Open the trunk
Benchmark Agentic tool-oriented interactions require capabilities such as audio-native function calling and mapping speech directly to executable API calls. However, evaluating such capabilities re- quires benchmarks to capture the interplay of speech prosody, acoustic variability, contextual cues, and multi-domain API constraints. To address this need, w...
-
[6]
To increase realism, we use state-of-the- art zero-shot voice-cloning TTS models: Qwen3TTS [19] and CosyV oice-3 [20]
TTS V oice Generation and Processing The benchmark is designed to reflect realistic usage scenar- ios, with the taxonomy derived from tools representing real- world functionality. To increase realism, we use state-of-the- art zero-shot voice-cloning TTS models: Qwen3TTS [19] and CosyV oice-3 [20]. To capture accent diversity, we curate speak- ers from mul...
-
[7]
Models We evaluate two categories of speech-based systems:(i)end- to-end SpeechLMs; and(ii)modular ASR-LLM pipelines
Experiments 5.1. Models We evaluate two categories of speech-based systems:(i)end- to-end SpeechLMs; and(ii)modular ASR-LLM pipelines. For this study, we focus exclusively on open-source models. SpeechLMs.We evaluate several state-of-the-art speech lan- guage models, including Step-Audio-2 [21], AudioFlamingo- 3 [22], Kimi Audio [23], Qwen-3-Omni [24], an...
-
[8]
Conclusion In this paper, we introducedAudio2Tool, a large-scale bench- mark for evaluating speech-based tool use under realistic and compositional conditions. Audio2Tool evaluates SpeechLMs across diverse application domains, acoustic variability, and increasing task complexity, ranging from direct commands to multi-intent and multi-turn interactions, an...
-
[9]
After using these tools, the authors reviewed and edited the content as needed and take full responsibility for the content of the publi- cation
Generative AI Use Disclosure During the preparation of this work, the authors used Claude Sonnet 4.5, ChatGPT, and Gemini 3, in order to generate struc- tured figures for the overview of the pipeline shown in the pa- per, and for polishing the grammar of the manuscript. After using these tools, the authors reviewed and edited the content as needed and tak...
-
[10]
H., Pasad, A., Casanova, E., Wang, W., Fu, S.-W., Li, J., Chen, Z., Balam, J., et al
Y . Chen, X. Yue, C. Zhang, X. Gao, R. T. Tan, and H. Li, “V oicebench: Benchmarking llm-based voice assistants,”arXiv preprint arXiv:2410.17196, 2024
-
[11]
L. Zhong, Z. Du, X. Zhang, H. Hu, and J. Tang, “Complex- funcbench: Exploring multi-step and constrained function calling under long-context scenario,”arXiv preprint arXiv:2501.10132, 2025
-
[12]
The berkeley function calling leaderboard (bfcl): From tool use to agentic evaluation of large language models,
S. G. Patil, H. Mao, F. Yan, C. C.-J. Ji, V . Suresh, I. Stoica, and J. E. Gonzalez, “The berkeley function calling leaderboard (bfcl): From tool use to agentic evaluation of large language models,” inForty-second International Conference on Machine Learning, 2025
2025
-
[13]
Audiobench: A universal benchmark for audio large language models,
B. Wang, X. Zou, G. Lin, S. Sun, Z. Liu, W. Zhang, Z. Liu, A. Aw, and N. Chen, “Audiobench: A universal benchmark for audio large language models,” inProceedings of the 2025 Con- ference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers), 2025, pp. 4297–4316
2025
-
[14]
D. Jain, H. Shukla, G. Rajeev, A. Kulkarni, C. Khatri, and S. Agar- wal, “V oiceagentbench: Are voice assistants ready for agentic tasks?”arXiv preprint arXiv:2510.07978, 2025
-
[15]
The ATIS spoken language systems pilot corpus,
C. T. Hemphill, J. J. Godfrey, and G. R. Doddington, “The ATIS spoken language systems pilot corpus,” inProceedings of the Workshop on Speech and Natural Language, 1990, pp. 96–101. [Online]. Available: https://doi.org/10.3115/116580.116613
-
[16]
Slurp: A spoken language understanding resource package,
E. Bastianelliet al., “Slurp: A spoken language understanding resource package,” inEMNLP, 2020
2020
-
[17]
Stop: A dataset for spoken task oriented semantic parsing,
P. Tomasello, A. Shrivastava, D. Lazar, P.-C. Hsu, D. Le, A. Sagar, A. Elkahky, J. Copet, W.-N. Hsu, Y . Adi, R. Algayres, T. A. Nguyen, E. Dupoux, L. Zettlemoyer, and A. Mohamed, “Stop: A dataset for spoken task oriented semantic parsing,”
-
[18]
Available: https://arxiv.org/abs/2207.10643
[Online]. Available: https://arxiv.org/abs/2207.10643
-
[19]
Mac-slu: Multi-intent automotive cabin spoken language understanding benchmark,
Y . Peng, C. Cai, Z. Liu, S. Fan, S. Jiang, H. Xu, Y . Liu, Q. Chen, K. Xu, Y . Liet al., “Mac-slu: Multi-intent automotive cabin spoken language understanding benchmark,”arXiv preprint arXiv:2512.01603, 2025
-
[20]
BFCL Audio: A benchmark for audio-native function calling,
H. Mao, A. A. Ginart, and J. R. Emmons, “BFCL Audio: A benchmark for audio-native function calling,” Salesforce Blog, Aug. 2025, published Aug. 22, 2025. Accessed: 2026- 02-23. [Online]. Available: https://www.salesforce.com/blog/ bfcl-audio-benchmark/
2025
-
[21]
MFCL: A multi- modal function calling evaluation for large language models,
H. Mao, S. G. Patil, and J. E. Gonzalez, “MFCL: A multi- modal function calling evaluation for large language models,” OpenReview, 2025, openReview paper ID: 8yWECy22Zi. Accessed: 2026-02-23. [Online]. Available: https://openreview. net/forum?id=8yWECy22Zi
2025
-
[22]
Snips voice platform: An embedded spoken language understanding system for private-by- design voice interfaces,
A. Coucke, A. Saade, A. Ballet al., “Snips voice platform: An embedded spoken language understanding system for private-by- design voice interfaces,” inEMNLP, 2018
2018
-
[23]
A. T. Nguyen, T. T. U. Hoang, M. P. Tu, and X. B. Ngo, “Joint multiple intent detection and slot filling with supervised contrastive learning and self-distillation,”arXiv preprint arXiv:2308.14654, 2023. [Online]. Available: https: //arxiv.org/abs/2308.14654
-
[24]
Agif: An adaptive graph- interactive framework for joint multiple intent detection and slot filling,
L. Qin, X. Xu, W. Che, and T. Liu, “Agif: An adaptive graph- interactive framework for joint multiple intent detection and slot filling,” inFindings of the Association for Computational Linguis- tics: EMNLP 2020, 2020, pp. 1807–1816
2020
-
[25]
Spgis- peech 2.0: Transcribed multi-speaker financial audio for speaker- tagged transcription,
R. Grossman, T. Park, K. Dhawan, A. Titus, S. Zhi, Y . Shchadilova, W. Wang, J. Balam, and B. Ginsburg, “Spgis- peech 2.0: Transcribed multi-speaker financial audio for speaker- tagged transcription,”arXiv preprint arXiv:2508.05554, 2025
-
[26]
Yodasspeakerpool: A richly-annotated multi-speaker dataset for voice cloning,
F. Shao, “Yodasspeakerpool: A richly-annotated multi-speaker dataset for voice cloning,” https://github.com/fangningshao/ YodasSpeakerPool, 2026, built from Emilia-YODAS, annotated with Gemini 2.5 Flash
2026
-
[27]
S. Zheng, L. Cheng, Y . Chen, H. Wang, and Q. Chen, “3d-speaker: A large-scale multi-device, multi-distance, and multi-dialect cor- pus for speech representation disentanglement,”arXiv preprint arXiv:2306.15354, 2023
-
[28]
V oxpopuli: A large-scale multilingual speech corpus for representation learn- ing, semi-supervised learning and interpretation,
C. Wang, M. Riviere, A. Lee, A. Wu, C. Talnikar, D. Haz- iza, M. Williamson, J. Pino, and E. Dupoux, “V oxpopuli: A large-scale multilingual speech corpus for representation learn- ing, semi-supervised learning and interpretation,” inProceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Co...
2021
-
[29]
Qwen3-tts technical report.arXiv preprint arXiv:2601.15621, 2026
H. Hu, X. Zhu, T. He, D. Guo, B. Zhang, X. Wang, Z. Guo, Z. Jiang, H. Hao, Z. Guoet al., “Qwen3-tts technical report,” arXiv preprint arXiv:2601.15621, 2026
-
[30]
Z. Du, C. Gao, Y . Wang, F. Yu, T. Zhao, H. Wang, X. Lv, H. Wang, C. Ni, X. Shiet al., “Cosyvoice 3: Towards in-the- wild speech generation via scaling-up and post-training,”arXiv preprint arXiv:2505.17589, 2025
-
[31]
Step-audio 2 technical report, 2025
B. Wu, C. Yan, C. Hu, C. Yi, C. Feng, F. Tian, F. Shen, G. Yu, H. Zhang, J. Liet al., “Step-audio 2 technical report,”arXiv preprint arXiv:2507.16632, 2025
-
[32]
Audio Flamingo 3: Advancing Audio Intelligence with Fully Open Large Audio Language Models
A. Goel, S. Ghosh, J. Kim, S. Kumar, Z. Kong, S.-g. Lee, C.- H. H. Yang, R. Duraiswami, D. Manocha, R. Valleet al., “Audio flamingo 3: Advancing audio intelligence with fully open large audio language models,”arXiv preprint arXiv:2507.08128, 2025
work page internal anchor Pith review arXiv 2025
-
[33]
D. Ding, Z. Ju, Y . Leng, S. Liu, T. Liu, Z. Shang, K. Shen, W. Song, X. Tan, H. Tanget al., “Kimi-audio technical report,” arXiv preprint arXiv:2504.18425, 2025
work page internal anchor Pith review arXiv 2025
-
[34]
J. Xu, Z. Guo, H. Hu, Y . Chu, X. Wang, J. He, Y . Wang, X. Shi, T. He, X. Zhuet al., “Qwen3-omni technical report,”arXiv preprint arXiv:2509.17765, 2025
work page internal anchor Pith review arXiv 2025
-
[35]
Robust speech recognition via large-scale weak supervision,
A. Radford, J. W. Kim, T. Xu, G. Brockman, C. McLeavey, and I. Sutskever, “Robust speech recognition via large-scale weak supervision,” inInternational conference on machine learning. PMLR, 2023, pp. 28 492–28 518
2023
-
[36]
A. Yang, A. Li, B. Yang, B. Zhang, B. Hui, B. Zheng, B. Yu, C. Gao, C. Huang, C. Lvet al., “Qwen3 technical report,”arXiv preprint arXiv:2505.09388, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[37]
A. Kamath, J. Ferret, S. Pathak, N. Vieillard, R. Merhej, S. Per- rin, T. Matejovicova, A. Ram ´e, M. Rivi `ere, L. Rouillardet al., “Gemma 3 technical report,”arXiv preprint arXiv:2503.19786, vol. 4, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[38]
A scalable noisy speech dataset and online subjective test framework,
C. K. Reddy, E. Beyrami, J. Pool, R. Cutler, S. Srinivasan, and J. Gehrke, “A scalable noisy speech dataset and online subjective test framework,”arXiv preprint arXiv:1909.08050, 2019
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.