pith. machine review for the scientific record. sign in

arxiv: 2603.05094 · v3 · submitted 2026-03-05 · 💻 cs.SD

Recognition: 2 theorem links

· Lean Theorem

TW-Sound580K: A Regional Audio-Text Dataset with Verification-Guided Curation for Localized Audio-Language Modeling

Authors on Pith no claims yet

Pith reviewed 2026-05-15 15:36 UTC · model grok-4.3

classification 💻 cs.SD
keywords Taiwanese audio-text datasetVerify-Generate-Critiquelarge audio-language modelslocalized speechDual-ASR validationinstruction tuningregional corporaTAU benchmark
0
0 comments X

The pith

A verification-curated Taiwanese audio-text dataset and dynamic arbitration strategy lifts audio-language model accuracy on localized speech from 42.6 to 49.1 percent.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Large audio-language models struggle with regional dialects and prosody because of scarce specialized training data. This paper builds TW-Sound580K, a collection of 580,000 Taiwanese audio-text instruction pairs, by applying a Verify-Generate-Critique pipeline that uses dual automatic speech recognition to filter raw clips and expand them into high-fidelity examples. The resulting Tai-LALM model, fine-tuned from an existing backbone and equipped with dynamic dual-ASR arbitration to pick the best transcription at inference time, is tested on the TAU benchmark. The work shows that adding rigorously curated regional corpora produces measurable gains in handling dialectal speech.

Core claim

We present TW-Sound580K, a 580K-pair Taiwanese audio-text instruction dataset obtained by filtering 522K raw clips with Dual-ASR validation and expanding them through a Verify-Generate-Critique protocol. Fine-tuning a DeSTA 2.5-Audio backbone on this data and applying dynamic Dual-ASR Arbitration at inference produces Tai-LALM, which reaches 49.1 percent accuracy on the TAU Benchmark compared with the 42.6 percent zero-shot baseline that uses ASR text conditioning.

What carries the argument

The Verify-Generate-Critique protocol that filters and expands raw audio clips into high-fidelity instruction pairs using Dual-ASR validation, combined with the dynamic Dual-ASR Arbitration strategy that selects the best transcription during inference.

If this is right

  • Regional audio-text corpora can close performance gaps that standard LALMs show on dialectal prosody tasks.
  • Dynamic arbitration between multiple ASR outputs improves transcription quality at inference time for audio-language models.
  • A verification-guided curation pipeline can turn raw regional recordings into usable instruction-tuning data at scale.
  • Fine-tuning on localized high-fidelity pairs yields concrete benchmark lifts beyond what zero-shot ASR conditioning achieves.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same curation approach could be applied to other under-resourced languages or dialects to test whether similar gains appear.
  • Dynamic arbitration might reduce downstream errors in any pipeline that depends on accurate speech-to-text conversion.
  • Controlled ablations that isolate the curation step from model size or training length would clarify how much of the gain is data-driven.

Load-bearing premise

That the accuracy gains on the TAU benchmark are caused by the higher-fidelity instruction pairs and arbitration mechanism rather than other differences in training data volume or evaluation setup.

What would settle it

Train the same backbone on the original unfiltered 522K raw clips without the Verify-Generate-Critique step and check whether the TAU benchmark accuracy remains near the 42.6 percent baseline instead of rising to 49.1 percent.

Figures

Figures reproduced from arXiv: 2603.05094 by Hao-Hui Xie, Ho-Lam Chung, Hung-yi Lee, Ke-Han Lu, Wenze Ren, Xie Chen, Yi-Cheng Lin.

Figure 1
Figure 1. Figure 1: The proposed framework for TW-Sound580K dataset construction and Tai-LALM fine-tuning, illustrating the DeSTA 2.5- Audio-based localization pipeline. Conversation Entertainment Education Music Others Announcement Media Emergency Cultural 0 20 40 46.4 17.1 16.5 12.4 2.7 2 1.4 0.8 0.7 [PITH_FULL_IMAGE:figures/full_fig_p003_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Label occurrence distribution in the TW-Sound580K dataset. tic representation Q(zA) and the text generated by the built-in ASR (hgt): LSFT = − XT t=1 log P(yt | y<t, hgt, Q(zA); ϕ) 4. Experiments 4.1. Experimental Setup Implementation Details: The proposed model, Tai-LALM, is developed as a localized adaptation of the DeSTA 2.5-Audio framework. It inherits its architectural configuration and pre￾trained we… view at source ↗
Figure 3
Figure 3. Figure 3: Scaling law analysis demonstrating the efficacy of our localized data pipeline on the TW-Sound580K dataset. 4.3. Ablation Study [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗
read the original abstract

Large Audio-Language Models (LALMs) typically struggle with localized dialectal prosody due to the scarcity of specialized corpora. We present TW-Sound580K, a Taiwanese audio-text instruction dataset developed through a Verify-Generate-Critique (VGC) protocol. This pipeline leverages Dual-ASR validation to filter 522K raw clips, subsequently expanding them into 580,000 high-fidelity instruction pairs using a teacher model. The dataset's utility is demonstrated through Tai-LALM, which fine-tunes a DeSTA 2.5-Audio-initialized backbone and incorporates a dynamic Dual-ASR Arbitration strategy to optimize transcription selection during inference. On the TAU Benchmark, Tai-LALM reaches 49.1% accuracy, marking a 6.5% absolute improvement over the zero-shot baseline (42.6% with ASR text conditioning). This confirms that integrating regional corpora with rigorous curation and dynamic arbitration significantly enhances LALM performance on localized speech.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces TW-Sound580K, a Taiwanese audio-text instruction dataset of 580K pairs derived from 522K raw clips via a Verify-Generate-Critique (VGC) protocol that employs Dual-ASR validation. The authors present Tai-LALM, obtained by fine-tuning a DeSTA 2.5-Audio backbone and augmenting it with a dynamic Dual-ASR Arbitration module at inference time. On the TAU benchmark, Tai-LALM achieves 49.1% accuracy, a 6.5% absolute improvement over a zero-shot baseline that uses ASR text conditioning (42.6%).

Significance. If the reported gain can be causally attributed to the VGC curation rather than fine-tuning or the arbitration module alone, the dataset and protocol would constitute a useful contribution to regional audio-language modeling, addressing the scarcity of dialect-specific corpora. The work highlights a practical pipeline for expanding instruction data while maintaining fidelity.

major comments (2)
  1. The central claim that the VGC protocol and Dual-ASR validation produce higher-fidelity pairs that drive the 6.5% TAU improvement is not supported by an ablation that isolates data curation. The only quantitative comparison is between the full fine-tuned Tai-LALM (with arbitration) and a zero-shot baseline; no experiment holds architecture, training compute, and inference fixed while varying only raw 522K clips versus VGC-expanded 580K pairs.
  2. No details are supplied on training hyperparameters, data splits, optimization settings, total compute, or statistical significance testing for the 49.1% result. Without these controls, it is impossible to rule out confounding factors as the source of the observed delta.
minor comments (2)
  1. The acronym LALM is used in the abstract without prior expansion; spell out 'Large Audio-Language Models' on first use.
  2. The description of the dynamic Dual-ASR Arbitration strategy would benefit from a short pseudocode snippet or flowchart to clarify how transcription selection occurs at inference.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments. We address each major point below and will revise the manuscript accordingly to strengthen the presentation of our results.

read point-by-point responses
  1. Referee: The central claim that the VGC protocol and Dual-ASR validation produce higher-fidelity pairs that drive the 6.5% TAU improvement is not supported by an ablation that isolates data curation. The only quantitative comparison is between the full fine-tuned Tai-LALM (with arbitration) and a zero-shot baseline; no experiment holds architecture, training compute, and inference fixed while varying only raw 522K clips versus VGC-expanded 580K pairs.

    Authors: We agree that an ablation isolating the VGC curation effect is required to causally attribute the gain. In the revised manuscript we will add an experiment that trains the identical DeSTA 2.5-Audio backbone on the raw 522K clips versus the VGC-curated 580K pairs while holding architecture, training compute, optimizer, and inference (including arbitration) fixed, thereby quantifying the contribution of the curation protocol. revision: yes

  2. Referee: No details are supplied on training hyperparameters, data splits, optimization settings, total compute, or statistical significance testing for the 49.1% result. Without these controls, it is impossible to rule out confounding factors as the source of the observed delta.

    Authors: We acknowledge the omission. The revised manuscript will include a dedicated experimental-setup section reporting all hyperparameters (learning rate, batch size, epochs, etc.), data splits, optimization settings, total compute, and statistical significance tests (bootstrap confidence intervals and paired tests) for the 49.1% accuracy to ensure reproducibility and rule out confounds. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical benchmark result stands independently of curation protocol

full rationale

The paper reports an empirical accuracy of 49.1% for Tai-LALM on the TAU benchmark versus a 42.6% zero-shot baseline after fine-tuning on the newly collected TW-Sound580K dataset. No equations, predictions, or first-principles derivations are present that reduce to fitted parameters or self-citations by construction. The VGC protocol and Dual-ASR validation are described as data-processing steps whose output is measured directly on held-out evaluation; the observed delta is not forced by redefinition or internal fitting. The result is self-contained as an experimental outcome rather than a tautological renaming or self-referential claim.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the domain assumption that Dual-ASR agreement reliably indicates high-quality audio-text pairs and that the observed benchmark lift is attributable to the new data rather than training artifacts. No free parameters or invented entities are introduced in the abstract.

axioms (1)
  • domain assumption Dual-ASR validation filters raw clips into high-fidelity instruction pairs
    Invoked in the description of the Verify-Generate-Critique pipeline as the mechanism that produces the 580K dataset.

pith-pipeline@v0.9.0 · 5496 in / 1234 out tokens · 46763 ms · 2026-05-15T15:36:28.639447+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

44 extracted references · 44 canonical work pages · 6 internal anchors

  1. [1]

    TW-Sound580K: A Regional Audio-Text Dataset with Verification-Guided Curation for Localized Audio-Language Modeling

    Introduction Recent advancements in Large Audio-Language Models (LALMs) [1, 2] have improved multimodal reasoning across various speech and environmental contexts [3, 4, 5]. Despite this progress, models often underperform in culturally specific regions due to a localization gap [6]. In linguistically diverse ar- eas like Taiwan, audio comprehension relie...

  2. [2]

    Background Foundational corpora like AudioSet [11] and LibriSpeech

  3. [3]

    acoustic long-tail

    predominantly feature standard acoustic environments and dominant accents. Similarly, large-scale Mandarin datasets such as WenetSpeech [13] favor Standard Mandarin, effectively marginalizing regional prosody and dialectal variations. While modern instruction-tuning sets [14, 4] facilitate multimodal rea- soning, they focus on cross-cultural semantics [15...

  4. [4]

    Methodology To bridge the localization gap, our data-centric pipeline is struc- tured into four key stages: (I) Dataset Construction, (II) Train- ing Data Generation, (III) the multimodal Training Process, and (IV) Inference featuring a dynamic Dual-ASR Arbitration mechanism. 3.1. TW-Sound580K: Socio-Functional Data Engineering for Taiwan To mitigate repr...

  5. [5]

    To preserve speech-free soundmarks, clips where both ASRs yield empty outputs bypass the text check

    Verify (Conditional Routing): We procure transcriptions from two heterogeneous ASR engines to compute a semantic consistency scoreS(based on text similarity). To preserve speech-free soundmarks, clips where both ASRs yield empty outputs bypass the text check. Conversely, speech samples withSbelow a predefined empirical thresholdτare explic- itly pruned to...

  6. [6]

    Generate (Acoustic-Constrained Distillation): A powerful native-audio Large Language Model acts as our Teacher Model. By processing raw continuous audio without re- ferring to validated ASR transcriptions, restrictive zero-shot prompting constrains outputs to verifiable paralinguistic and environmental features, preventing cross-modal hallucina- tions

  7. [7]

    Hello everyone

    Critique (Self-Reflective Audit): The teacher model con- ducts a secondary review to prune any ungrounded descrip- tors from the captions. This process ensures that the Taiwan- centric instruction data is strictly anchored to actual acoustic cues while preserving the full original audio collection. 3.3. Inference-Time Perceptual Arbitration To mitigate er...

  8. [8]

    Experimental Setup Implementation Details:The proposed model, Tai-LALM, is developed as a localized adaptation of the DeSTA 2.5-Audio framework

    Experiments 4.1. Experimental Setup Implementation Details:The proposed model, Tai-LALM, is developed as a localized adaptation of the DeSTA 2.5-Audio framework. It inherits its architectural configuration and pre- trained weights directly from the DeSTA 2.5-Audio finetune stage, utilizing the Llama-3-8B-Instruct backbone. Modality alignment is facilitate...

  9. [9]

    Architectural scal- ing alone is insufficient for robust sound-to-meaning ground- ing without localized acoustic semantics

    Discussion and Limitations Our results indicate that aligning LALMs to regional acous- tics is primarily a data-centric challenge. Architectural scal- ing alone is insufficient for robust sound-to-meaning ground- ing without localized acoustic semantics. TW-Sound580K ad- dresses this by providing region-specific pairs, enabling mod- els to internalize loc...

  10. [10]

    Beyond the Taiwanese context, this pipeline offers a method for regional adaptation

    in identifying failure modes that are less apparent in glob- ally aligned corpora. Beyond the Taiwanese context, this pipeline offers a method for regional adaptation. Constructing a localized dataset and applying VGC curation provides a computationally vi- able alternative to continual pre-training. However, transferring this pipeline to other languages ...

  11. [11]

    The perfor- mance gains underscore the necessity of the VGC pipeline for robust training-time curation and Dual-ASR arbitration for sta- bilizing inference

    Conclusion This work presents TW-Sound580K and Tai-LALM, which achieves a peak accuracy of 49.1% on the TAU benchmark, out- performing the Qwen2.5-Omni baseline by 2.8%. The perfor- mance gains underscore the necessity of the VGC pipeline for robust training-time curation and Dual-ASR arbitration for sta- bilizing inference. By prioritizing high-fidelity ...

  12. [12]

    AudioPaLM: A Large Language Model That Can Speak and Listen

    P. K. Rubenstein, C. Asawaroengchai, D. D. Nguyen, A. Bapna, Z. Borsos, F. de Chaumont Quitry, P. Chen, D. E. Badawyet al., “AudioPaLM: A large language model that can speak and listen,” arXiv preprint arXiv:2306.12925, 2023

  13. [13]

    Dynamic-SUPERB Phase-2: A collaboratively expanding benchmark for measuring the capa- bilities of spoken language models with 180 tasks,

    C.-y. Huang, W.-C. Chen, S.-w. Yang, A. T. Liu, C.-A. Li, Y .-X. Lin, W.-C. Tseng, A. Diwanet al., “Dynamic-SUPERB Phase-2: A collaboratively expanding benchmark for measuring the capa- bilities of spoken language models with 180 tasks,” inInterna- tional Conference on Learning Representations (ICLR), 2025

  14. [14]

    Listen, think, and understand,

    Y . Gong, H. Luo, A. H. Liu, L. Karlinsky, and J. Glass, “Listen, think, and understand,” inInternational Conference on Learning Representations (ICLR), 2024

  15. [15]

    SALMONN: Towards generic hearing abilities for large language models,

    C. Tang, W. Yu, G. Sun, X. Chen, T. Tan, W. Li, L. Lu, Z. Ma et al., “SALMONN: Towards generic hearing abilities for large language models,” inInternational Conference on Learning Rep- resentations (ICLR), 2024

  16. [16]

    Qwen-Audio: Advancing Universal Audio Understanding via Unified Large-Scale Audio-Language Models

    Y . Chu, J. Xu, X. Zhou, Q. Yang, S. Zhang, Z. Yan, C. Zhou, and J. Zhou, “Qwen-Audio: Advancing universal audio understand- ing via unified large-scale audio-language models,”arXiv preprint arXiv:2311.07919, 2023

  17. [17]

    CultureLLM: Incorporating cultural differences into large language models,

    C. Li, M. Chen, J. Wang, S. Sitaram, and X. Xie, “CultureLLM: Incorporating cultural differences into large language models,” in Advances in Neural Information Processing Systems (NeurIPS), 2024

  18. [18]

    Universal paralinguistic speech representations using self-supervised con- formers,

    J. Shor, A. Jansen, W. Han, D. Park, and Y . Zhang, “Universal paralinguistic speech representations using self-supervised con- formers,” inIEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2022

  19. [19]

    Building a Taiwanese Man- darin spoken language model: A first attempt,

    C.-K. Yang, Y .-K. Fu, C.-A. Li, Y .-C. Lin, Y .-X. Lin, W.-C. Chen, H. L. Chung, C.-Y . Kuanet al., “Building a Taiwanese Man- darin spoken language model: A first attempt,”arXiv preprint arXiv:2411.07111, 2024

  20. [20]

    Understanding sounds, missing the questions: The challenge of object hallucination in large audio-language models,

    C.-Y . Kuan, W.-P. Huang, and H.-y. Lee, “Understanding sounds, missing the questions: The challenge of object hallucination in large audio-language models,” inInterspeech, 2024

  21. [21]

    Mitigating subgroup dis- parities in multi-label speech emotion recognition: A pseudo- labeling and unsupervised learning approach,

    Y .-C. Lin, H.-C. Chou, and H.-y. Lee, “Mitigating subgroup dis- parities in multi-label speech emotion recognition: A pseudo- labeling and unsupervised learning approach,” inInterspeech, 2025

  22. [22]

    AudioSet: An ontology and human-labeled dataset for audio events,

    J. F. Gemmeke, D. P. W. Ellis, D. Freedman, A. Jansen, W. Lawrence, R. C. Moore, M. Plakal, and M. Ritter, “AudioSet: An ontology and human-labeled dataset for audio events,” inIEEE International Conference on Acoustics, Speech and Signal Pro- cessing (ICASSP), 2017

  23. [23]

    Lib- riSpeech: An ASR corpus based on public domain audio books,

    V . Panayotov, G. Chen, D. Povey, and S. Khudanpur, “Lib- riSpeech: An ASR corpus based on public domain audio books,” inIEEE International Conference on Acoustics, Speech and Sig- nal Processing (ICASSP), 2015

  24. [24]

    WenetSpeech: A 10000+ hours multi-domain mandarin corpus for speech recognition,

    B. Zhang, H. Lv, P. Guo, Q. Shao, C. Yang, L. Xie, X. Xu, H. Bu, X. Chen, C. Zeng, D. Wu, and Z. Peng, “WenetSpeech: A 10000+ hours multi-domain mandarin corpus for speech recognition,” in IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2022

  25. [25]

    AudioGen: Tex- tually guided audio generation,

    F. Kreuk, G. Synnaeve, A. Polyak, U. Singer, A. D ´efossez, J. Copet, D. Parikh, Y . Taigman, and Y . Adi, “AudioGen: Tex- tually guided audio generation,” inInternational Conference on Learning Representations (ICLR), 2023

  26. [26]

    When au- dio and text disagree: Revealing text bias in large audio-language models,

    C. Wang, G. Deng, X. Yang, H. Qiu, and T. Zhang, “When au- dio and text disagree: Revealing text bias in large audio-language models,”arXiv preprint arXiv:2508.15407, 2025

  27. [27]

    WoW-Bench: Evaluating fine-grained acoustic perception in audio-language models via marine mammal vocalizations,

    J. Kim, H. Yun, S. H. Woo, C.-H. H. Yang, and G. Kim, “WoW-Bench: Evaluating fine-grained acoustic perception in audio-language models via marine mammal vocalizations,”arXiv preprint arXiv:2508.20976, 2025

  28. [28]

    Ke- Speech: An open source speech dataset of Mandarin and its eight subdialects,

    Z. Tang, D. Wang, Y . Xu, J. Sun, X. Lei, S. Zhao, C. Wen, X. Tan, C. Xie, S. Zhou, R. Yan, C. Lv, Y . Han, W. Zou, and X. Li, “Ke- Speech: An open source speech dataset of Mandarin and its eight subdialects,” inAdvances in Neural Information Processing Sys- tems (NeurIPS) Datasets and Benchmarks Track, 2021

  29. [29]

    LESS: Large language model enhanced semi-supervised learning for speech foundational models using in-the-wild data,

    W. Ding and F. Qian, “LESS: Large language model enhanced semi-supervised learning for speech foundational models using in-the-wild data,” inIEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2026

  30. [30]

    Training language mod- els to follow instructions with human feedback,

    L. Ouyang, J. Wu, X. Jiang, D. Almeida, C. L. Wainwright, P. Mishkin, C. Zhang, S. Agarwalet al., “Training language mod- els to follow instructions with human feedback,” inAdvances in Neural Information Processing Systems (NeurIPS), 2022

  31. [31]

    Data-centric lessons to improve speech-language pretraining,

    V . Udandarao, Z. Lu, X. Chang, Y . Wang, V . Z. Yao, A. M. Jose, F. Faghri, J. Gardneret al., “Data-centric lessons to improve speech-language pretraining,”arXiv preprint arXiv:2510.20860, 2025

  32. [32]

    Reducing ob- ject hallucination in large audio-language models via audio-aware decoding,

    T.-w. Hsu, K.-H. Lu, C.-H. Chiang, and H.-y. Lee, “Reducing ob- ject hallucination in large audio-language models via audio-aware decoding,” inIEEE Automatic Speech Recognition and Under- standing Workshop (ASRU), 2025

  33. [33]

    Beyond transcription: Mechanistic interpretability in ASR,

    N. Glazer, Y . Segal-Feldman, H. Segev, A. Shamsian, A. Buch- nick, G. Hetz, E. Fetaya, J. Keshetet al., “Beyond transcription: Mechanistic interpretability in ASR,” inProceedings of the AAAI Conference on Artificial Intelligence, 2026

  34. [34]

    TAU: A benchmark for cultural sound understanding beyond semantics,

    Y .-C. Lin, Y .-H. Chen, J.-K. Dong, Y .-H. Huang, S.-C. Chen, Y .- C. Chen, C.-Y . Chen, Y .-J. Lin, Y .-L. Chen, Z.-Y . Chen, I.-N. Tsai, H.-H. Wang, H.-L. Chung, K.-H. Lu, and H.-y. Lee, “TAU: A benchmark for cultural sound understanding beyond semantics,” inProceedings of the IEEE International Conference on Acous- tics, Speech and Signal Processing (...

  35. [35]

    DeSTA2.5-Audio: Toward general- purpose large audio language model with self-generated cross- modal alignment,

    K.-H. Lu, Z. Chen, S.-W. Fu, C.-H. H. Yang, S.-F. Huang, C.- K. Yang, C.-E. Yuet al., “DeSTA2.5-Audio: Toward general- purpose large audio language model with self-generated cross- modal alignment,”IEEE Transactions on Audio, Speech and Lan- guage Processing, 2026

  36. [36]

    Attention- passing models for robust and data-efficient end-to-end speech translation,

    M. Sperber, G. Neubig, J. Niehues, and A. Waibel, “Attention- passing models for robust and data-efficient end-to-end speech translation,”Transactions of the Association for Computational Linguistics (TACL), 2019

  37. [37]

    Lost in transcription, found in distribution shift: Demystifying hallucination in speech foundation models,

    H. Atwany, A. Waheed, R. Singh, M. Choudhury, and B. Raj, “Lost in transcription, found in distribution shift: Demystifying hallucination in speech foundation models,” inFindings of the As- sociation for Computational Linguistics: ACL, 2025

  38. [38]

    Teaching audio-aware large language models what does not hear: Mitigating hallucinations through synthesized negative samples,

    C.-Y . Kuan and H.-y. Lee, “Teaching audio-aware large language models what does not hear: Mitigating hallucinations through synthesized negative samples,” inInterspeech, 2025

  39. [39]

    Hallucinations in neural automatic speech recognition: Identifying errors and hallucinatory models,

    R. Frieske and B. E. Shi, “Hallucinations in neural automatic speech recognition: Identifying errors and hallucinatory models,” arXiv preprint arXiv:2401.01572, 2024

  40. [40]

    Robust speech recognition via large-scale weak su- pervision,

    A. Radford, J. W. Kim, T. Xu, G. Brockman, C. McLeavey, and I. Sutskever, “Robust speech recognition via large-scale weak su- pervision,” inProceedings of the 40th International Conference on Machine Learning (ICML), 2023

  41. [41]

    Funaudiollm: V oice understanding and generation foundation models for natural interaction between humans and llms

    K. An, Q. Chen, C. Deng, Z. Du, C. Gao, Z. Gao, Y . Gu, T. He et al., “FunAudioLLM: V oice understanding and generation foun- dation models for natural interaction between humans and LLMs,” arXiv preprint arXiv:2407.04051, 2024

  42. [42]

    Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities

    G. Comanici, E. Bieber, M. Schaekermann, I. Pasupat, N. Sachdeva, I. Dhillon, M. Blistein, O. Ramet al., “Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities,”arXiv preprint arXiv:2507.06261, 2025

  43. [43]

    Qwen2.5-Omni Technical Report

    J. Xu, Z. Guo, J. He, H. Hu, T. He, S. Bai, K. Chen, J. Wanget al., “Qwen2.5-Omni technical report,”arXiv preprint arXiv:2503.20215, 2025

  44. [44]

    Qwen2-Audio Technical Report

    Y . Chu, J. Xu, Q. Yang, H. Wei, X. Wei, Z. Guo, Y . Leng, Y . Lv, J. He, J. Lin, C. Zhou, and J. Zhou, “Qwen2-Audio technical re- port,”arXiv preprint arXiv:2407.10759, 2024