pith. sign in

arxiv: 2507.23511 · v3 · submitted 2025-07-31 · 📡 eess.AS · cs.AI· cs.CL· cs.SD

MECAT: A Multi-Experts Constructed Benchmark for Fine-Grained Audio Understanding Tasks

Pith reviewed 2026-05-19 02:35 UTC · model grok-4.3

classification 📡 eess.AS cs.AIcs.CLcs.SD
keywords audio understandingbenchmark datasetfine-grained evaluationaudio-language modelsdiscriminative metricmulti-expert annotation
0
0 comments X

The pith

MECAT introduces a benchmark for fine-grained audio understanding using multi-expert construction and a new DATE evaluation metric.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents MECAT as a new benchmark designed to address limitations in current audio understanding evaluations that cannot distinguish generic from detailed model responses. It is built through a pipeline that uses specialized expert models combined with chain-of-thought reasoning in large language models to produce multi-perspective fine-grained captions and open-set question-answering pairs. The accompanying DATE metric evaluates outputs by integrating semantic similarity for a single sample with discriminability across samples to favor detailed and unique descriptions. Comprehensive tests on state-of-the-art audio models highlight their current strengths and weaknesses in handling nuanced audio content. This setup aims to push models toward more human-like comprehension of audio details.

Core claim

MECAT is constructed via a pipeline integrating analysis from specialized expert models with Chain-of-Thought large language model reasoning, providing multi-perspective, fine-grained captions and open-set question-answering pairs for audio understanding tasks, evaluated with the DATE metric that penalizes generic terms and rewards detailed descriptions through single-sample semantic similarity combined with cross-sample discriminability.

What carries the argument

The multi-expert construction pipeline that merges outputs from specialized audio expert models with chain-of-thought reasoning from large language models to generate accurate fine-grained annotations.

If this is right

  • State-of-the-art audio models can be more precisely evaluated for their ability to produce detailed rather than generic descriptions.
  • New insights into model limitations in capturing nuanced audio elements are provided.
  • Future audio-language model development can target improvements measured by the DATE metric.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Similar multi-expert pipelines could be applied to create benchmarks in other sensory domains like video or multimodal data.
  • Adoption of DATE-like metrics might improve evaluation standards across language model benchmarks in general.
  • The open-set QA pairs could support training models directly for better fine-grained understanding.

Load-bearing premise

The pipeline integrating specialized expert models with chain-of-thought large language model reasoning reliably produces accurate and unbiased fine-grained annotations that match true human distinctions in audio content.

What would settle it

Human evaluation where listeners compare the generated fine-grained captions and questions against the actual audio to check if they capture details that generic annotations miss, or if they introduce inaccuracies.

Figures

Figures reproduced from arXiv: 2507.23511 by Gang Li, Heinrich Dinkel, Jiahao Zhou, Jian Luan, Jizhong Liu, Junbo Zhang, Tianzi Wang, Xingwei Sun, Xunying Liu, Yadong Niu.

Figure 1
Figure 1. Figure 1: Overview of the MECAT Benchmark. descriptions. This design enables DATE to robustly distin￾guish between superficial and context-rich model outputs. Related works Audio Captioning Benchmark Audio captioning benchmarks have been pivotal in ad￾vancing audio understanding works (Wu, Dinkel, and Yu 2019; Kim et al. 2019; Drossos, Lipping, and Virtanen 2020; Yuan et al. 2025; Manco et al. 2023; Liu et al. 2024a… view at source ↗
Figure 2
Figure 2. Figure 2: Distribution of audio samples across extended [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Domain Experts for Speech, Music, and Acoustic Properties. [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: t-SNE plots of MECAT audio embeddings compared to other benchmarks I), further clustered by domain II). Caption [PITH_FULL_IMAGE:figures/full_fig_p005_4.png] view at source ↗
Figure 6
Figure 6. Figure 6: Cumulative Distribution Functions (CDF) of [PITH_FULL_IMAGE:figures/full_fig_p006_6.png] view at source ↗
Figure 5
Figure 5. Figure 5: Example predictions from different models [PITH_FULL_IMAGE:figures/full_fig_p006_5.png] view at source ↗
read the original abstract

While large audio-language models have advanced open-ended audio understanding, they still fall short of nuanced human-level comprehension. This gap persists largely because current benchmarks, limited by data annotations and evaluation metrics, fail to reliably distinguish between generic and highly detailed model outputs. To this end, this work introduces MECAT, a Multi-Expert Constructed Benchmark for Fine-Grained Audio Understanding Tasks. Generated via a pipeline that integrates analysis from specialized expert models with Chain-of-Thought large language model reasoning, MECAT provides multi-perspective, fine-grained captions and open-set question-answering pairs. The benchmark is complemented by a novel metric: DATE (Discriminative-Enhanced Audio Text Evaluation). This metric penalizes generic terms and rewards detailed descriptions by combining single-sample semantic similarity with cross-sample discriminability. A comprehensive evaluation of state-of-the-art audio models is also presented, providing new insights into their current capabilities and limitations. The data and code are available at https://github.com/xiaomi-research/mecat

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper introduces MECAT, a benchmark for fine-grained audio understanding tasks generated via a pipeline that integrates specialized expert models with Chain-of-Thought LLM reasoning to produce multi-perspective captions and open-set QA pairs. It also proposes the DATE metric, which evaluates outputs by combining single-sample semantic similarity with cross-sample discriminability to penalize generic terms and reward detailed descriptions, and reports evaluations of state-of-the-art audio models.

Significance. If the generated annotations prove reliable, MECAT and DATE could meaningfully advance audio-language model evaluation by addressing the inability of existing benchmarks to distinguish nuanced from generic outputs. The public release of data and code at the provided GitHub link is a clear strength that supports reproducibility.

major comments (2)
  1. [Abstract] Abstract and benchmark construction description: the central claim that the multi-expert + CoT pipeline produces accurate fine-grained annotations reflecting human-level distinctions lacks any reported quantitative human agreement study, inter-annotator comparison, or systematic error analysis on the outputs of the expert models and LLM.
  2. [Evaluation] Evaluation section: no detailed results, ablations, or comparisons are provided to demonstrate that DATE offers a measurable advantage over standard metrics in distinguishing model capabilities on fine-grained tasks.
minor comments (2)
  1. [Abstract] The abstract would benefit from including basic statistics such as the number of audio clips or total annotations to give readers immediate context on benchmark scale.
  2. [Metric Definition] Ensure consistent definition of the DATE acronym and its components on first use in the main text.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback on our manuscript. We address each major comment point by point below, acknowledging areas where additional evidence would strengthen the work, and describe the revisions we plan to incorporate.

read point-by-point responses
  1. Referee: [Abstract] Abstract and benchmark construction description: the central claim that the multi-expert + CoT pipeline produces accurate fine-grained annotations reflecting human-level distinctions lacks any reported quantitative human agreement study, inter-annotator comparison, or systematic error analysis on the outputs of the expert models and LLM.

    Authors: We agree that a quantitative human validation study would provide stronger support for the reliability of the generated annotations. The manuscript currently emphasizes the design of the multi-expert pipeline combined with Chain-of-Thought reasoning to capture fine-grained distinctions, without including formal inter-annotator agreement metrics or a systematic error analysis. In the revised manuscript, we will add a dedicated human evaluation subsection. This will include agreement statistics (e.g., Cohen's kappa or Krippendorff's alpha) computed on a sampled subset of captions and QA pairs, along with qualitative error categorization comparing expert-model outputs against human judgments. revision: yes

  2. Referee: [Evaluation] Evaluation section: no detailed results, ablations, or comparisons are provided to demonstrate that DATE offers a measurable advantage over standard metrics in distinguishing model capabilities on fine-grained tasks.

    Authors: We acknowledge that the current evaluation could more explicitly demonstrate the advantages of DATE. The manuscript reports model rankings using DATE in conjunction with other metrics, but does not present ablations isolating the discriminability term or head-to-head comparisons showing superior correlation with human preference for detailed outputs. In the revision, we will expand the evaluation section with (i) an ablation study removing the cross-sample discriminability component, (ii) quantitative comparisons of DATE versus BLEU, CIDEr, and SPICE on the same model outputs, and (iii) a small-scale human preference study measuring which metric better ranks detailed versus generic responses. revision: yes

Circularity Check

0 steps flagged

No circularity: benchmark pipeline and DATE metric rely on external models and explicit definitions

full rationale

The paper constructs MECAT via an external pipeline of specialized expert models plus Chain-of-Thought LLM reasoning and defines the DATE metric as an explicit combination of single-sample semantic similarity with cross-sample discriminability. No derivation step reduces by the paper's own equations or self-citations to quantities fitted inside the work; the central claims rest on the described external components and the novel but non-self-referential metric definition, rendering the chain self-contained.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The central claim rests on the assumption that expert models deliver trustworthy complementary audio analyses and that LLM chain-of-thought synthesis converts them into reliable fine-grained labels without introducing systematic errors.

axioms (2)
  • domain assumption Specialized expert models provide accurate and complementary analysis of audio content.
    The generation pipeline depends on the outputs of these models being sufficiently reliable to serve as the foundation for fine-grained annotations.
  • domain assumption Chain-of-Thought reasoning in large language models can synthesize multi-perspective expert analyses into coherent, accurate fine-grained captions and QA pairs.
    Invoked in the abstract as the mechanism that turns expert outputs into the benchmark data.

pith-pipeline@v0.9.0 · 5743 in / 1382 out tokens · 50291 ms · 2026-05-19T02:35:16.410224+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 2 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Omni-Embed-Audio: Leveraging Multimodal LLMs for Robust Audio-Text Retrieval

    cs.SD 2026-04 unverdicted novelty 6.0

    Omni-Embed-Audio uses multimodal LLMs to match CLAP on standard audio retrieval while improving text-to-text retrieval by 22% relative and hard negative discrimination by 4.3 points HNSR@10 on user-intent queries.

  2. AU-Harness: An Open-Source Toolkit for Holistic Evaluation of Audio LLMs

    cs.SD 2025-09 unverdicted novelty 6.0

    AU-Harness introduces an efficient unified evaluation framework for audio LLMs featuring batch optimizations, multi-turn dialogue support, and standardized protocols for fair comparisons.

Reference graph

Works this paper leans on

63 extracted references · 63 canonical work pages · cited by 2 Pith papers · 5 internal anchors

  1. [1]

    , " * write output.state after.block = add.period write newline

    ENTRY address archivePrefix author booktitle chapter edition editor eid eprint howpublished institution isbn journal key month note number organization pages publisher school series title type volume year label extra.label sort.label short.list INTEGERS output.state before.all mid.sentence after.sentence after.block FUNCTION init.state.consts #0 'before.a...

  2. [2]

    write newline

    " write newline "" before.all 'output.state := FUNCTION n.dashify 't := "" t empty not t #1 #1 substring "-" = t #1 #2 substring "--" = not "--" * t #2 global.max substring 't := t #1 #1 substring "-" = "-" * t #2 global.max substring 't := while if t #1 #1 substring * t #2 global.max substring 't := if while FUNCTION word.in bbl.in capitalize " " * FUNCT...

  3. [3]

    Bredin, H. 2023. pyannote. audio 2.1 speaker diarization pipeline: principle, benchmark, and recipe. In Proceedings of the 24th Interspeech Conference (interspeech), 1983--1987. ISCA

  4. [4]

    Burkhardt, F.; Wagner, J.; Wierstorf, H.; Eyben, F.; and Schuller, B. 2023. Speech-based age and gender prediction with transformers. In Speech Communication; 15th ITG Conference, 46--50. VDE

  5. [5]

    Dinkel, H.; Wang, Y.; Yan, Z.; Zhang, J.; and Wang, Y. 2024. CED: Consistent ensemble distillation for audio tagging. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 291--295. IEEE

  6. [6]

    Ghosh, S.; Kong, Z.; Kumar, S.; Sakshi, S.; Kim, J.; Ping, W.; Valle, R.; Manocha, D.; and Catanzaro, B. 2025. Audio Flamingo 2: An Audio-Language Model with Long-Audio Understanding and Expert Reasoning Abilities. In Proceedings of the 40th International Conference on Machine Learning (ICML), 1--48

  7. [7]

    Hawley, S. H. 2023. SHAART: Speech and Hearing in Audio and Real Time. https://github.com/drscotthawley/SHAART

  8. [8]

    Kang, J.; and Herremans, D. 2025. Towards Unified Music Emotion Recognition across Dimensional and Categorical Models. arXiv:2502.03979

  9. [9]

    Kim, T.; and Nam, J. 2023. All-In-One Metrical And Functional Structure Analysis With Neighborhood Attentions on Demixed Audio. In Proceedings of the IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA)

  10. [10]

    Kong, Q.; Cao, Y.; Liu, H.; Choi, K.; and Wang, Y. 2021. Decoupling Magnitude and Phase Estimation with Deep ResUNet for Music Source Separation. In Proceedings of the 22nd International Society for Music Information Retrieval Conference (ISMIR), 342--349. Citeseer

  11. [11]

    Ma, Z.; Zheng, Z.; Ye, J.; Li, J.; Gao, Z.; Zhang, S.; and Chen, X. 2023. Emotion2vec: Self-Supervised Pre-Training for Speech Emotion Representation. arXiv preprint arXiv:2312.15185

  12. [12]

    Mittag, G.; Naderi, B.; Chehadi, A.; and M \"o ller, S. 2021. NISQA: A Deep CNN-Self-Attention Model for Multidimensional Speech Quality Prediction with Crowdsourced Datasets. In Proceedings of the 22nd Interspeech Conference (interspeech), 2127--2131

  13. [13]

    W.; Xu, T.; Brockman, G.; McLeavey, C.; and Sutskever, I

    Radford, A.; Kim, J. W.; Xu, T.; Brockman, G.; McLeavey, C.; and Sutskever, I. 2023. Robust speech recognition via large-scale weak supervision. In Proceedings of the 40th International Conference on Machine Learning (ICML), 28492--28518

  14. [14]

    SpeechBrain: A general- purpose speech toolkit

    Ravanelli, M.; Parcollet, T.; Plantinga, P.; Rouhe, A.; Cornell, S.; Lugosch, L.; Subakan, C.; Dawalatabad, N.; Heba, A.; Zhong, J.; Chou, J.-C.; Yeh, S.-L.; Fu, S.-W.; Liao, C.-F.; Rastorgueva, E.; Grondin, F.; Aris, W.; Na, H.; Gao, Y.; Mori, R. D.; and Bengio, Y. 2021. SpeechBrain : A General-Purpose Speech Toolkit. ArXiv:2106.04624, arXiv:2106.04624

  15. [15]

    Reddy, C. K. e. a. 2021. DNSMOS: A non-intrusive perceptual objective speech quality metric to evaluate noise suppressors. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 6493--6497. IEEE

  16. [16]

    Reddy, C. K. e. a. 2022. DNSMOS P. 835: A non-intrusive perceptual objective speech quality metric to evaluate noise suppressors. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 886--890. IEEE

  17. [17]

    Wagner, J.; Triantafyllopoulos, A.; Wierstorf, H.; Schmitt, M.; Burkhardt, F.; Eyben, F.; and Schuller, B. W. 2023. Dawn of the transformer era in speech emotion recognition: closing the valence gap. IEEE Transactions on Pattern Analysis and Machine Intelligence, 45(9): 10745--10759

  18. [18]

    Yizhi, L.; Yuan, R.; Zhang, G.; Ma, Y.; Chen, X.; Yin, H.; Xiao, C.; Lin, C.; Ragni, A.; Benetos, E.; et al. 2023. MERT: Acoustic music understanding model with large-scale self-supervised training. In Proceedings of the 12th International Conference on Learning Representations (ICLR), 1--24

  19. [19]

    Zuluaga-Gomez, J.; Ahmed, S.; Visockas, D.; and Subakan, C. 2023. CommonAccent: Exploring Large Acoustic Pretrained Models for Accent Classification Based on Common Voice. In Proceedings of the 24th Interspeech Conference (interspeech), 5291--5295. ISCA

  20. [20]

    Anderson, P.; Fernando, B.; Johnson, M.; and Gould, S. 2016. Spice: Semantic propositional image caption evaluation. In Proceedings of the European conference on computer vision, 382--398. Springer

  21. [21]

    Bara \'n ski, M.; Jasi \'n ski, J.; Bartolewska, J.; Kacprzak, S.; Witkowski, M.; and Kowalczyk, K. 2025. Investigation of whisper asr hallucinations induced by non-speech audio. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 1--5. IEEE

  22. [22]

    Chen, F.; Han, M.; Zhao, H.; Zhang, Q.; Shi, J.; Xu, S.; and Xu, B. 2023. X-LLM : Bootstrapping Advanced Large Language Models by Treating Multi-Modalities as Foreign Languages. arXiv preprint arXiv:2305.04160

  23. [23]

    Chu, Y.; Xu, J.; Zhou, X.; Yang, Q.; Zhang, S.; Yan, Z.; Zhou, C.; and Zhou, J. 2023. Qwen-Audio : Advancing Universal Audio Understanding via Unified Large-Scale Audio-Language Models. arXiv preprint arXiv:2311.07919

  24. [24]

    Deshmukh, S.; Elizalde, B.; Singh, R.; and Wang, H. 2023. Pengi: An Audio Language Model for Audio Tasks. In Oh, A.; Naumann, T.; Globerson, A.; Saenko, K.; Hardt, M.; and Levine, S., eds., Advances in Neural Information Processing Systems, volume 36, 18090--18108. Curran Associates, Inc

  25. [25]

    Dinkel, H.; Wang, Y.; Yan, Z.; Zhang, J.; and Wang, Y. 2024 a . CED: Consistent ensemble distillation for audio tagging. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 291--295. IEEE

  26. [26]

    Dinkel, H.; Yan, Z.; Wang, T.; Wang, Y.; Sun, X.; Niu, Y.; Liu, J.; Li, G.; Zhang, J.; and Luan, J. 2025. GLAP: General contrastive audio-text pretraining across domains and languages. arXiv:arXiv preprint arXiv:2506.11350

  27. [27]

    Dinkel, H.; Yan, Z.; Wang, Y.; Zhang, J.; Wang, Y.; and Wang, B. 2024 b . Scaling up masked audio encoder learning for general audio classification. In Proceedings of the 25th Interspeech Conference (interspeech), 547--551

  28. [28]

    Doh, S.; and Nam, J. 2023. LP-MusicCaps: LLM-Based Pseudo Music Captioning. In Proceedings of the 24th International Society for Music Information Retrieval Conference. International Society for Music Information Retrieval Conference

  29. [29]

    Drossos, K.; Lipping, S.; and Virtanen, T. 2020. Clotho: an Audio Captioning Dataset. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 736--740. IEEE

  30. [30]

    Du, Z.; Wang, J.; Chen, Q.; Chu, Y.; Gao, Z.; Li, Z.; Hu, K.; Zhou, X.; Xu, J.; Ma, Z.; et al. 2023. LauraGPT : Listen, Attend, Understand, and Regenerate Audio with GPT . arXiv preprint arXiv:2310.04673

  31. [31]

    F.; Ellis, D

    Gemmeke, J. F.; Ellis, D. P.; Freedman, D.; Jansen, A.; Lawrence, W.; Moore, R. C.; Plakal, M.; and Ritter, M. 2017. Audio Set: An Ontology and Human-labeled Dataset for Audio Events. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 776--780. IEEE

  32. [32]

    Guo, D.; Yang, D.; Zhang, H.; Song, J.; Zhang, R.; Xu, R.; Zhu, Q.; Ma, S.; Wang, P.; Bi, X.; et al. 2025. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning. arXiv preprint arXiv:2501.12948

  33. [33]

    Hu, S.; Zhou, L.; Liu, S.; Chen, S.; Meng, L.; Hao, H.; Pan, J.; Liu, X.; Li, J.; Sivasankaran, S.; et al. 2024. WavLLM: Towards Robust and Adaptive Speech Large Language Model. In Proceedings of the Findings of the Association for Computational Linguistics (EMNLP), 4552--4572

  34. [34]

    Huang, R.; Li, M.; Yang, D.; Shi, J.; Chang, X.; Ye, Z.; Wu, Y.; Hong, Z.; Huang, J.; Liu, J.; et al. 2024. Audiogpt: Understanding and generating speech, music, sound, and talking head. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 38, 23802--23804

  35. [35]

    D.; Kim, B.; Lee, H.; and Kim, G

    Kim, C. D.; Kim, B.; Lee, H.; and Kim, G. 2019. Audiocaps: Generating captions for audios in the wild. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), 119--132

  36. [36]

    Kim, J.; Jung, J.; Lee, J.; and Woo, S. H. 2024. EnCLAP: Combining Neural Audio Codec and Audio-Text Joint Embedding for Automated Audio Captioning. In ICASSP 2024 - 2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 6735--6739

  37. [37]

    KimiTeam; Ding, D.; Ju, Z.; Leng, Y.; Liu, S.; Liu, T.; Shang, Z.; Shen, K.; Song, W.; Tan, X.; Tang, H.; Wang, Z.; Wei, C.; Xin, Y.; Xu, X.; Yu, J.; Zhang, Y.; Zhou, X.; Charles, Y.; Chen, J.; Chen, Y.; Du, Y.; He, W.; Hu, Z.; Lai, G.; Li, Q.; Liu, Y.; Sun, W.; Wang, J.; Wang, Y.; Wu, Y.; Wu, Y.; Yang, D.; Yang, H.; Yang, Y.; Yang, Z.; Yin, A.; Yuan, R.;...

  38. [38]

    Lee, S.; Chung, J.; Yu, Y.; Kim, G.; Breuel, T.; Chechik, G.; and Song, Y. 2021. Acav100m: Automatic curation of large-scale datasets for audio-visual video representation learning. In Proceedings of the IEEE/CVF International Conference on Computer Vision, 10274--10284

  39. [39]

    Lee, Y.; Park, I.; and Kang, M. 2024. FLEUR: An Explainable Reference-Free Evaluation Metric for Image Captioning Using a Large Multimodal Model. In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 3732--3746

  40. [40]

    Li, G.; Wei, Y.; Tian, Y.; Xu, C.; Wen, J.-R.; and Hu, D. 2022. Learning to answer questions in dynamic audio-visual scenarios. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 19108--19118

  41. [41]

    Lipping, S.; Sudarsanam, P.; Drossos, K.; and Virtanen, T. 2022. Clotho-aqa: A crowdsourced dataset for audio question answering. In Proceedings of the 30th European Signal Processing Conference (EUSIPCO), 1140--1144. IEEE

  42. [42]

    Liu, J.; Li, G.; Zhang, J.; Dinkel, H.; Wang, Y.; Yan, Z.; Wang, Y.; and Wang, B. 2024 a . Enhancing Automated Audio Captioning via Large Language Models with Optimized Audio Encoding. In Proceedings of the 25th Interspeech Conference (interspeech), 1135--1139

  43. [43]

    Liu, J.; Li, G.; Zhang, J.; Liu, C.; Dinkel, H.; Wang, Y.; Yan, Z.; Wang, Y.; and Wang, B. 2024 b . Leveraging ced encoder and large language models for automated audio captioning. Proceedings of the DCASE Challenge, 1--4

  44. [44]

    Lyon, R. F. 2017. Human and machine hearing. Cambridge University Press

  45. [45]

    Ma, Z.; Ma, Y.; Zhu, Y.; Yang, C.; Chao, Y.-W.; Xu, R.; Chen, W.; Chen, Y.; Chen, Z.; Cong, J.; et al. 2025. MMAR: A Challenging Benchmark for Deep Reasoning in Speech, Audio, Music, and Their Mix. arXiv preprint arXiv:2505.13032

  46. [46]

    Manco, I.; Weck, B.; Doh, S.; Won, M.; Zhang, Y.; Bogdanov, D.; Wu, Y.; Chen, K.; Tovstogan, P.; Benetos, E.; et al. 2023. The Song Describer Dataset: a Corpus of Audio Captions for Music-and-Language Evaluation. arXiv preprint arXiv:2311.10057

  47. [47]

    InNeuRIPS Workshop on Self- Supervised Learning for Speech and Audio Process- ing

    Pandey, P.; Swaminathan, R. V.; Girish, K.; Sen, A.; Xie, J.; Strimel, G. P.; and Schwarz, A. 2025. SIFT-50m: A large-scale multilingual dataset for speech instruction fine-tuning. arXiv preprint arXiv:2504.09081

  48. [48]

    Papineni, K.; Roukos, S.; Ward, T.; and Zhu, W.-J. 2002. Bleu: a method for automatic evaluation of machine translation. In Proceedings of the 40th annual meeting of the Association for Computational Linguistics, 311--318

  49. [49]

    Plack, C. J. 2023. The sense of hearing. Routledge

  50. [50]

    Reimers, N.; and Gurevych, I. 2019. Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics

  51. [51]

    AudioPaLM: A Large Language Model That Can Speak and Listen

    Rubenstein, P. K.; Asawaroengchai, C.; Nguyen, D. D.; Bapna, A.; Borsos, Z.; Quitry, F. d. C.; Chen, P.; Badawy, D. E.; Han, W.; Kharitonov, E.; et al. 2023. AudioPaLM : A Large Language Model That Can Speak and Listen. arXiv preprint arXiv:2306.12925

  52. [52]

    Sakshi, S.; Tyagi, U.; Kumar, S.; Seth, A.; Selvakumar, R.; Nieto, O.; Duraiswami, R.; Ghosh, S.; and Manocha, D. 2025. MMAU: A Massive Multi-Task Audio Understanding and Reasoning Benchmark. In Proceedings of the 13th International Conference on Learning Representations (ICLR), 1--36

  53. [53]

    Shu, Y.; Dong, S.; Chen, G.; Huang, W.; Zhang, R.; Shi, D.; Xiang, Q.; and Shi, Y. 2023. LLASM : Large Language and Speech Model. arXiv preprint arXiv:2308.15930

  54. [54]

    Sun, L.; Xu, X.; Wu, M.; and Xie, W. 2024. Auto-ACD: A large-scale dataset for audio-language representation learning. In Proceedings of the 32nd ACM International Conference on Multimedia, 5025--5034

  55. [55]

    Tang, C.; Yu, W.; Sun, G.; Chen, X.; Tan, T.; Li, W.; Lu, L.; MA, Z.; and Zhang, C. 2024. SALMONN: Towards Generic Hearing Abilities for Large Language Models. In Proceedings of the 20th International Conference on Learning Representations (ICLR), 1--23

  56. [56]

    Vedantam, R.; Lawrence Zitnick, C.; and Parikh, D. 2015. Cider: Consensus-based image description evaluation. In Proceedings of the IEEE conference on computer vision and pattern recognition, 4566--4575

  57. [57]

    Wang, B.; Zou, X.; Lin, G.; Sun, S.; Liu, Z.; Zhang, W.; Liu, Z.; Aw, A.; and Chen, N. 2025. AudioBench: A Universal Benchmark for Audio Large Language Models. In Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers), 4297--4316

  58. [58]

    Wang, C.; Liao, M.; Huang, Z.; Lu, J.; Wu, J.; Liu, Y.; Zong, C.; and Zhang, J. 2023. BLSP : Bootstrapping Language-Speech Pre-Training via Behavior Alignment of Continuation Writing. arXiv preprint arXiv:2309.00916

  59. [59]

    Wu, M.; Dinkel, H.; and Yu, K. 2019. Audio caption: Listen and tell. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 830--834. IEEE

  60. [60]

    Xu, J.; Guo, Z.; He, J.; Hu, H.; He, T.; Bai, S.; Chen, K.; Wang, J.; Fan, Y.; Dang, K.; et al. 2025. Qwen2. 5-omni technical report. arXiv preprint arXiv:2503.20215

  61. [61]

    D.; et al

    Yuan, Y.; Jia, D.; Zhuang, X.; Chen, Y.; Chen, Z.; Wang, Y.; Wang, Y.; Liu, X.; Kang, X.; Plumbley, M. D.; et al. 2025. Sound-VECaps: Improving Audio Generation with Visually Enhanced Captions. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 1--5. IEEE

  62. [62]

    Zhang, T.; Yu, Y.; Mao, X.; Lu, Y.; Li, Z.; and Wang, H. 2022. FENSE: A feature-based ensemble modeling approach to cross-project just-in-time defect prediction. Empirical Software Engineering, 27(7): 162

  63. [63]

    Zheng, L.; Chiang, W.-L.; Sheng, Y.; Zhuang, S.; Wu, Z.; Zhuang, Y.; Lin, Z.; Li, Z.; Li, D.; Xing, E.; et al. 2023. Judging llm-as-a-judge with mt-bench and chatbot arena. Advances in neural information processing systems, 36: 46595--46623