arxiv: 2604.22821 · v2 · submitted 2026-04-17 · 💻 cs.SD · cs.LG· eess.AS

Recognition: unknown

Audio2Tool: Speak, Call, Act -- A Dataset for Benchmarking Speech Tool Use

Ramit Pahwa , Apoorva Beedu , Parivesh Priye , Rutu Gandhi , Saloni Takawale , Aruna Baijal , Zengli Yang

Authors on Pith no claims yet

Pith reviewed 2026-05-10 07:27 UTC · model grok-4.3

classification 💻 cs.SD cs.LGeess.AS

keywords speech language modelstool callingbenchmark datasetvoice assistantscompositional reasoningacoustic robustnesssmart homewearables

0 comments

The pith

A new dataset shows speech language models handle simple spoken commands but degrade on complex multi-step tool tasks and noisy audio.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces Audio2Tool, a dataset of roughly 30,000 spoken queries spanning smart-car, smart-home, and wearables domains. It organizes queries into tiers of increasing difficulty, from direct single commands to multi-intent instructions and needle-in-a-haystack extractions. The authors generate the audio with zero-shot voice cloning and layered noise to mimic everyday conditions. When state-of-the-art SpeechLMs and ASR-LLM pipelines are tested, they perform adequately on the simplest cases yet show clear drops once composition or acoustic interference increases. This gap matters because voice assistants are expected to execute real actions from speech, and existing tests have not measured those limits.

Core claim

The central claim is that existing benchmarks for speech tool use lack sufficient domain coverage, acoustic variety, and compositional depth, so the authors supply Audio2Tool to expose that current models remain reliable only on straightforward commands while losing capability under realistic multi-intent and noisy conditions.

What carries the argument

The Audio2Tool dataset itself, built around a three-tier complexity ladder (simple commands, multi-intent, needle-in-a-haystack) and generated via zero-shot voice cloning plus diverse noise overlays to create in-the-wild audio.

If this is right

Any new SpeechLM intended for tool use must be evaluated on compositional and acoustically varied inputs rather than isolated commands.
Performance gaps in the benchmark point to separate failure modes that can be targeted in model training or architecture design.
The public dataset allows consistent comparison across future models and pipelines.
Domains such as smart-car and wearables now have a shared testbed for spoken tool execution.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The observed degradation suggests that purely end-to-end speech models may benefit from explicit intermediate planning steps before tool selection.
Similar multi-tier benchmarks could be extended to other high-stakes domains such as medical or legal voice interfaces.
The results imply that acoustic robustness and compositional reasoning should be treated as distinct training objectives rather than assumed to improve together.

Load-bearing premise

Synthetic speech produced by zero-shot voice cloning together with the selected noise profiles is representative enough of actual spoken interactions to evaluate tool-calling reliability.

What would settle it

Record the same query set with real speakers in uncontrolled environments and measure whether model error rates match the synthetic benchmark results.

Figures

Figures reproduced from arXiv: 2604.22821 by Apoorva Beedu, Aruna Baijal, Parivesh Priye, Ramit Pahwa, Rutu Gandhi, Saloni Takawale, Zengli Yang.

**Figure 1.** Figure 1: Overview of the Audio2Tool query generation pipeline detailed for different tiers. Early benchmarks such as the Air Travel Information System corpus [6] focused on flight-related queries, while later datasets such as the Spoken Language Understanding Resource Package (SLURP) [7], the Spoken Task-Oriented Semantic Parsing dataset (STOP) [8], and the Multi-Intent Automotive Cabin Spoken Language Understandi… view at source ↗

**Figure 2.** Figure 2: Query and domain distributions across complexity tiers [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗

**Figure 3.** Figure 3: Effect of noise on the performance. 4. TTS Voice Generation and Processing The benchmark is designed to reflect realistic usage scenarios, with the taxonomy derived from tools representing realworld functionality. To increase realism, we use state-of-theart zero-shot voice-cloning TTS models: Qwen3TTS [19] and CosyVoice-3 [20]. To capture accent diversity, we curate speakers from multiple open-source s… view at source ↗

read the original abstract

Voice assistants increasingly rely on Speech Language Models (SpeechLMs) to interpret spoken queries and execute complex tasks, yet existing benchmarks lack domain breadth, acoustic diversity, and compositional reasoning complexity to evaluate tool-calling performance. We introduce Audio2Tool, a large-scale dataset comprising approximately 30,000 queries designed to assess tool-calling capabilities of SpeechLMs across three primary domains: Smart Car, Smart Home, and Wearables. Our benchmark features a multi-tier complexity hierarchy, ranging from simple direct commands to complex multi-intent and needle-in-a-haystack extraction to isolate distinct failure modes. To ensure realism, we employ zero-shot voice cloning text-to-speech synthesis and diverse noise profiles to simulate in-the-wild conditions. Evaluations of state-of-the-art SpeechLMs and ASR-LLM pipelines show strong performance on simple commands but significant degradation under compositional and acoustic challenges. Code and dataset are publicly available on the project page: https://audio2tool.github.io/.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Audio2Tool releases a multi-domain spoken tool-calling dataset with complexity tiers, but its performance claims rest on missing numbers and an unvalidated TTS-plus-noise setup.

read the letter

Audio2Tool is a dataset paper that adds coverage for spoken tool use across smart cars, homes, and wearables, with a tiered structure from simple commands up to multi-intent and needle-in-a-haystack cases. That combination of domain breadth and explicit difficulty levels is the main new element relative to prior benchmarks mentioned in the abstract. Releasing the data and code publicly is also a clear positive for anyone who wants to run their own tests on SpeechLMs or ASR-LLM pipelines. The abstract's description of the generation process (zero-shot voice cloning TTS plus noise overlays) shows an effort to move beyond clean studio audio, which is a reasonable direction for practical voice assistant work. The paper does well on the high-level framing of what current evaluations are missing. The soft spots are more noticeable once you look at the evaluation claims. The abstract states that models do well on simple cases but degrade under compositional and acoustic challenges, yet it supplies no concrete accuracy figures, per-tier sample sizes, baseline comparisons, or statistical tests. Without those, the size and reliability of the reported drops stay unclear. The acoustic tier raises a separate issue. Generating all test audio via TTS and then adding noise profiles can introduce consistent synthesis artifacts in prosody or spectral quality that models may not have seen during training. The stress-test note is on point here: without a control set of the same textual queries recorded by humans under matched noise, it is hard to tell whether the observed degradation comes from real acoustic variability or from the generation method itself. This paper is aimed at researchers building or benchmarking speech-based tool calling systems, especially those focused on consumer devices and robustness. A reader who needs a ready-made test set with domain variety and tiered difficulty could get value from it once the numbers and controls are filled in. I would send it for peer review. The core dataset idea is grounded enough to deserve referee time, even if the current version needs clearer metrics and an acoustic validation experiment.

Referee Report

2 major / 1 minor

Summary. The paper introduces Audio2Tool, a dataset of approximately 30,000 spoken queries for benchmarking tool-calling capabilities of SpeechLMs across Smart Car, Smart Home, and Wearables domains. It defines a multi-tier complexity hierarchy (simple direct commands to multi-intent and needle-in-a-haystack extraction) and generates test audio via zero-shot voice cloning TTS plus diverse noise profiles to simulate in-the-wild conditions. Evaluations of state-of-the-art SpeechLMs and ASR-LLM pipelines are reported to demonstrate strong performance on simple commands but significant degradation under compositional and acoustic challenges. The dataset and code are released publicly.

Significance. If the dataset curation is sound and the evaluations include rigorous quantitative metrics with proper controls, Audio2Tool could fill an important gap in benchmarks for speech-based tool use, a growing area for voice assistants. The public release of data and code is a clear strength that would enable community follow-up work.

major comments (2)

[Evaluation section] Evaluation section: The abstract states that evaluations 'show strong performance on simple commands but significant degradation under compositional and acoustic challenges,' yet supplies no concrete metrics (e.g., accuracy or F1 per tier), per-tier sample counts, statistical tests, or baseline comparisons. This absence makes the central performance claim unverifiable from the provided text and is load-bearing for the paper's empirical contribution.
[Dataset Generation] Dataset Generation / Acoustic Simulation: The acoustic challenges are created exclusively via zero-shot voice cloning TTS followed by noise overlay, with no control experiment comparing the same textual queries rendered by TTS versus human recordings under matched noise profiles. This leaves open whether reported acoustic degradation reflects genuine in-the-wild variability or TTS-specific artifacts (e.g., prosody or spectral distortions), directly affecting the claim that the setup 'ensures realism' and 'simulates in-the-wild conditions.'

minor comments (1)

[Abstract] Abstract: The exact total query count, domain-wise breakdown, and tier-wise distribution are not stated, only the approximate figure of 30,000.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive comments on the evaluation presentation and acoustic simulation approach. We address each major comment below, indicating the revisions we will incorporate.

read point-by-point responses

Referee: [Evaluation section] Evaluation section: The abstract states that evaluations 'show strong performance on simple commands but significant degradation under compositional and acoustic challenges,' yet supplies no concrete metrics (e.g., accuracy or F1 per tier), per-tier sample counts, statistical tests, or baseline comparisons. This absence makes the central performance claim unverifiable from the provided text and is load-bearing for the paper's empirical contribution.

Authors: We agree that the abstract would benefit from concrete metrics to make the central claims immediately verifiable. The full manuscript's Evaluation section (Section 4) already contains the requested details: per-tier accuracy and F1 scores, exact sample counts per complexity tier, statistical significance tests, and direct comparisons against ASR-LLM baselines. We will revise the abstract to include a concise quantitative summary of the key results (e.g., performance on simple commands versus compositional and noisy conditions) while preserving the existing detailed reporting in the body. revision: yes
Referee: [Dataset Generation] Dataset Generation / Acoustic Simulation: The acoustic challenges are created exclusively via zero-shot voice cloning TTS followed by noise overlay, with no control experiment comparing the same textual queries rendered by TTS versus human recordings under matched noise profiles. This leaves open whether reported acoustic degradation reflects genuine in-the-wild variability or TTS-specific artifacts (e.g., prosody or spectral distortions), directly affecting the claim that the setup 'ensures realism' and 'simulates in-the-wild conditions.'

Authors: We acknowledge this as a substantive limitation of the current acoustic simulation. Zero-shot voice cloning was selected to enable scalable, reproducible generation of 30k queries with voice diversity that would be prohibitively expensive to obtain via human recordings. We incorporated a broad set of real-world noise profiles to approximate in-the-wild conditions. We will add an explicit discussion paragraph in the Limitations section noting the absence of a TTS-versus-human control and the possibility of synthesis artifacts, while clarifying that the benchmark remains a controlled, publicly reproducible testbed for measuring degradation. We cannot retroactively collect matched human recordings for the full dataset at this stage. revision: partial

Circularity Check

0 steps flagged

Empirical dataset release with no derivation chain or fitted predictions

full rationale

The paper introduces Audio2Tool as a new benchmark dataset of ~30k queries generated via zero-shot TTS plus noise overlays. All reported results are direct empirical evaluations on this held-out test set; no equations, parameters, or quantities are fitted to the evaluation data and then re-presented as predictions. No self-citation chains, uniqueness theorems, or ansatzes are invoked to justify core claims. The work is therefore self-contained against external benchmarks and contains no load-bearing step that reduces to its own inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

This is an empirical dataset and benchmarking paper; no mathematical derivations, fitted parameters, or new postulated entities are introduced. The contribution rests on curation of queries and public release of evaluation artifacts.

pith-pipeline@v0.9.0 · 5502 in / 1041 out tokens · 36649 ms · 2026-05-10T07:27:30.394184+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

38 extracted references · 19 canonical work pages · 6 internal anchors

[1]

in-the-wild

Introduction Recent advances in Speech Large Language Models (SpeechLMs) are transforming voice assistants from sim- ple intent recognition systems into end-to-end, audio-native agents capable of directly invoking tools from raw speech. In this setting, the model performsaudio-native function calling, mapping raw acoustic signals directly to executable AP...

2026
[2]

Audio2Tool: Speak, Call, Act -- A Dataset for Benchmarking Speech Tool Use

Background With the growing interest in developing systems capable of end- to-end interaction, Spoken Language Understanding (SLU) has expanded to also encompass the ability to invoke executable function calls directly from speech. Going beyond traditional cascaded pipelines, recent SpeechLMs enable audio native rea- soning, allowing models to directly se...

work page internal anchor Pith review Pith/arXiv arXiv 2026
[3]

While these benchmarks are challenging, they do not di- rectly evaluate executable tool invocation from speech

covered multi-domain, compositional, and multi-intent set- tings. While these benchmarks are challenging, they do not di- rectly evaluate executable tool invocation from speech. From SLU semantics to executable tool calling:Traditional SLU systems produce intermediate semantic representations such as intents, slots, or structured parses. Tool calling exte...
[4]

Finally, we generate speech with con- trolled perturbations to assess robustness and failure modes that are specific to speech-driven tool invocation

and audio/multi-modal tool-use benchmarks [10, 11], it or- ganizes tools and APIs into taxonomies across three applica- tion domains and introduces a eight-tier query curriculum that progressively assesses tool selection, argument grounding, and multi-step composition. Finally, we generate speech with con- trolled perturbations to assess robustness and fa...
[5]

Open the trunk

Benchmark Agentic tool-oriented interactions require capabilities such as audio-native function calling and mapping speech directly to executable API calls. However, evaluating such capabilities re- quires benchmarks to capture the interplay of speech prosody, acoustic variability, contextual cues, and multi-domain API constraints. To address this need, w...
[6]

To increase realism, we use state-of-the- art zero-shot voice-cloning TTS models: Qwen3TTS [19] and CosyV oice-3 [20]

TTS V oice Generation and Processing The benchmark is designed to reflect realistic usage scenar- ios, with the taxonomy derived from tools representing real- world functionality. To increase realism, we use state-of-the- art zero-shot voice-cloning TTS models: Qwen3TTS [19] and CosyV oice-3 [20]. To capture accent diversity, we curate speak- ers from mul...
[7]

Models We evaluate two categories of speech-based systems:(i)end- to-end SpeechLMs; and(ii)modular ASR-LLM pipelines

Experiments 5.1. Models We evaluate two categories of speech-based systems:(i)end- to-end SpeechLMs; and(ii)modular ASR-LLM pipelines. For this study, we focus exclusively on open-source models. SpeechLMs.We evaluate several state-of-the-art speech lan- guage models, including Step-Audio-2 [21], AudioFlamingo- 3 [22], Kimi Audio [23], Qwen-3-Omni [24], an...
[8]

Conclusion In this paper, we introducedAudio2Tool, a large-scale bench- mark for evaluating speech-based tool use under realistic and compositional conditions. Audio2Tool evaluates SpeechLMs across diverse application domains, acoustic variability, and increasing task complexity, ranging from direct commands to multi-intent and multi-turn interactions, an...
[9]

After using these tools, the authors reviewed and edited the content as needed and take full responsibility for the content of the publi- cation

Generative AI Use Disclosure During the preparation of this work, the authors used Claude Sonnet 4.5, ChatGPT, and Gemini 3, in order to generate struc- tured figures for the overview of the pipeline shown in the pa- per, and for polishing the grammar of the manuscript. After using these tools, the authors reviewed and edited the content as needed and tak...
[10]

H., Pasad, A., Casanova, E., Wang, W., Fu, S.-W., Li, J., Chen, Z., Balam, J., et al

Y . Chen, X. Yue, C. Zhang, X. Gao, R. T. Tan, and H. Li, “V oicebench: Benchmarking llm-based voice assistants,”arXiv preprint arXiv:2410.17196, 2024

work page arXiv 2024
[11]

Complexfuncbench: Exploring multi-step and constrained function calling under long- context scenario,

L. Zhong, Z. Du, X. Zhang, H. Hu, and J. Tang, “Complex- funcbench: Exploring multi-step and constrained function calling under long-context scenario,”arXiv preprint arXiv:2501.10132, 2025

work page arXiv 2025
[12]

The berkeley function calling leaderboard (bfcl): From tool use to agentic evaluation of large language models,

S. G. Patil, H. Mao, F. Yan, C. C.-J. Ji, V . Suresh, I. Stoica, and J. E. Gonzalez, “The berkeley function calling leaderboard (bfcl): From tool use to agentic evaluation of large language models,” inForty-second International Conference on Machine Learning, 2025

2025
[13]

Audiobench: A universal benchmark for audio large language models,

B. Wang, X. Zou, G. Lin, S. Sun, Z. Liu, W. Zhang, Z. Liu, A. Aw, and N. Chen, “Audiobench: A universal benchmark for audio large language models,” inProceedings of the 2025 Con- ference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers), 2025, pp. 4297–4316

2025
[14]

Voiceagent- bench: Are voice assistants ready for agentic tasks?arXiv preprint arXiv:2510.07978, 2025

D. Jain, H. Shukla, G. Rajeev, A. Kulkarni, C. Khatri, and S. Agar- wal, “V oiceagentbench: Are voice assistants ready for agentic tasks?”arXiv preprint arXiv:2510.07978, 2025

work page arXiv 2025
[15]

The ATIS spoken language systems pilot corpus,

C. T. Hemphill, J. J. Godfrey, and G. R. Doddington, “The ATIS spoken language systems pilot corpus,” inProceedings of the Workshop on Speech and Natural Language, 1990, pp. 96–101. [Online]. Available: https://doi.org/10.3115/116580.116613

work page doi:10.3115/116580.116613 1990
[16]

Slurp: A spoken language understanding resource package,

E. Bastianelliet al., “Slurp: A spoken language understanding resource package,” inEMNLP, 2020

2020
[17]

Stop: A dataset for spoken task oriented semantic parsing,

P. Tomasello, A. Shrivastava, D. Lazar, P.-C. Hsu, D. Le, A. Sagar, A. Elkahky, J. Copet, W.-N. Hsu, Y . Adi, R. Algayres, T. A. Nguyen, E. Dupoux, L. Zettlemoyer, and A. Mohamed, “Stop: A dataset for spoken task oriented semantic parsing,”
[18]

Available: https://arxiv.org/abs/2207.10643

[Online]. Available: https://arxiv.org/abs/2207.10643

work page arXiv
[19]

Mac-slu: Multi-intent automotive cabin spoken language understanding benchmark,

Y . Peng, C. Cai, Z. Liu, S. Fan, S. Jiang, H. Xu, Y . Liu, Q. Chen, K. Xu, Y . Liet al., “Mac-slu: Multi-intent automotive cabin spoken language understanding benchmark,”arXiv preprint arXiv:2512.01603, 2025

work page arXiv 2025
[20]

BFCL Audio: A benchmark for audio-native function calling,

H. Mao, A. A. Ginart, and J. R. Emmons, “BFCL Audio: A benchmark for audio-native function calling,” Salesforce Blog, Aug. 2025, published Aug. 22, 2025. Accessed: 2026- 02-23. [Online]. Available: https://www.salesforce.com/blog/ bfcl-audio-benchmark/

2025
[21]

MFCL: A multi- modal function calling evaluation for large language models,

H. Mao, S. G. Patil, and J. E. Gonzalez, “MFCL: A multi- modal function calling evaluation for large language models,” OpenReview, 2025, openReview paper ID: 8yWECy22Zi. Accessed: 2026-02-23. [Online]. Available: https://openreview. net/forum?id=8yWECy22Zi

2025
[22]

Snips voice platform: An embedded spoken language understanding system for private-by- design voice interfaces,

A. Coucke, A. Saade, A. Ballet al., “Snips voice platform: An embedded spoken language understanding system for private-by- design voice interfaces,” inEMNLP, 2018

2018
[23]

Joint multiple intent detection and slot filling with supervised contrastive learning and self-distillation,

A. T. Nguyen, T. T. U. Hoang, M. P. Tu, and X. B. Ngo, “Joint multiple intent detection and slot filling with supervised contrastive learning and self-distillation,”arXiv preprint arXiv:2308.14654, 2023. [Online]. Available: https: //arxiv.org/abs/2308.14654

work page arXiv 2023
[24]

Agif: An adaptive graph- interactive framework for joint multiple intent detection and slot filling,

L. Qin, X. Xu, W. Che, and T. Liu, “Agif: An adaptive graph- interactive framework for joint multiple intent detection and slot filling,” inFindings of the Association for Computational Linguis- tics: EMNLP 2020, 2020, pp. 1807–1816

2020
[25]

Spgis- peech 2.0: Transcribed multi-speaker financial audio for speaker- tagged transcription,

R. Grossman, T. Park, K. Dhawan, A. Titus, S. Zhi, Y . Shchadilova, W. Wang, J. Balam, and B. Ginsburg, “Spgis- peech 2.0: Transcribed multi-speaker financial audio for speaker- tagged transcription,”arXiv preprint arXiv:2508.05554, 2025

work page arXiv 2025
[26]

Yodasspeakerpool: A richly-annotated multi-speaker dataset for voice cloning,

F. Shao, “Yodasspeakerpool: A richly-annotated multi-speaker dataset for voice cloning,” https://github.com/fangningshao/ YodasSpeakerPool, 2026, built from Emilia-YODAS, annotated with Gemini 2.5 Flash

2026
[27]

3d-speaker: A large-scale multi-device, multi-distance, and multi-dialect cor- pus for speech representation disentanglement,

S. Zheng, L. Cheng, Y . Chen, H. Wang, and Q. Chen, “3d-speaker: A large-scale multi-device, multi-distance, and multi-dialect cor- pus for speech representation disentanglement,”arXiv preprint arXiv:2306.15354, 2023

work page arXiv 2023
[28]

V oxpopuli: A large-scale multilingual speech corpus for representation learn- ing, semi-supervised learning and interpretation,

C. Wang, M. Riviere, A. Lee, A. Wu, C. Talnikar, D. Haz- iza, M. Williamson, J. Pino, and E. Dupoux, “V oxpopuli: A large-scale multilingual speech corpus for representation learn- ing, semi-supervised learning and interpretation,” inProceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Co...

2021
[29]

Qwen3-tts technical report.arXiv preprint arXiv:2601.15621, 2026

H. Hu, X. Zhu, T. He, D. Guo, B. Zhang, X. Wang, Z. Guo, Z. Jiang, H. Hao, Z. Guoet al., “Qwen3-tts technical report,” arXiv preprint arXiv:2601.15621, 2026

work page arXiv 2026
[30]

Cosyvoice 3: Towards in-the-wild speech gen- eration via scaling-up and post-training.arXiv preprint arXiv:2505.17589, 2025

Z. Du, C. Gao, Y . Wang, F. Yu, T. Zhao, H. Wang, X. Lv, H. Wang, C. Ni, X. Shiet al., “Cosyvoice 3: Towards in-the- wild speech generation via scaling-up and post-training,”arXiv preprint arXiv:2505.17589, 2025

work page arXiv 2025
[31]

Step-audio 2 technical report, 2025

B. Wu, C. Yan, C. Hu, C. Yi, C. Feng, F. Tian, F. Shen, G. Yu, H. Zhang, J. Liet al., “Step-audio 2 technical report,”arXiv preprint arXiv:2507.16632, 2025

work page arXiv 2025
[32]

Audio Flamingo 3: Advancing Audio Intelligence with Fully Open Large Audio Language Models

A. Goel, S. Ghosh, J. Kim, S. Kumar, Z. Kong, S.-g. Lee, C.- H. H. Yang, R. Duraiswami, D. Manocha, R. Valleet al., “Audio flamingo 3: Advancing audio intelligence with fully open large audio language models,”arXiv preprint arXiv:2507.08128, 2025

work page internal anchor Pith review arXiv 2025
[33]

Kimi-Audio Technical Report

D. Ding, Z. Ju, Y . Leng, S. Liu, T. Liu, Z. Shang, K. Shen, W. Song, X. Tan, H. Tanget al., “Kimi-audio technical report,” arXiv preprint arXiv:2504.18425, 2025

work page internal anchor Pith review arXiv 2025
[34]

Qwen3-Omni Technical Report

J. Xu, Z. Guo, H. Hu, Y . Chu, X. Wang, J. He, Y . Wang, X. Shi, T. He, X. Zhuet al., “Qwen3-omni technical report,”arXiv preprint arXiv:2509.17765, 2025

work page internal anchor Pith review arXiv 2025
[35]

Robust speech recognition via large-scale weak supervision,

A. Radford, J. W. Kim, T. Xu, G. Brockman, C. McLeavey, and I. Sutskever, “Robust speech recognition via large-scale weak supervision,” inInternational conference on machine learning. PMLR, 2023, pp. 28 492–28 518

2023
[36]

Qwen3 Technical Report

A. Yang, A. Li, B. Yang, B. Zhang, B. Hui, B. Zheng, B. Yu, C. Gao, C. Huang, C. Lvet al., “Qwen3 technical report,”arXiv preprint arXiv:2505.09388, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[37]

Gemma 3 Technical Report

A. Kamath, J. Ferret, S. Pathak, N. Vieillard, R. Merhej, S. Per- rin, T. Matejovicova, A. Ram ´e, M. Rivi `ere, L. Rouillardet al., “Gemma 3 technical report,”arXiv preprint arXiv:2503.19786, vol. 4, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[38]

A scalable noisy speech dataset and online subjective test framework,

C. K. Reddy, E. Beyrami, J. Pool, R. Cutler, S. Srinivasan, and J. Gehrke, “A scalable noisy speech dataset and online subjective test framework,”arXiv preprint arXiv:1909.08050, 2019

work page arXiv 1909