pith. machine review for the scientific record. sign in

arxiv: 2604.12527 · v2 · submitted 2026-04-14 · 📡 eess.AS

Recognition: unknown

Audio-Cogito: Towards Deep Audio Reasoning in Large Audio Language Models

Hongjie Chen, Jian Kang, Jie Li, Lei Xie, Longhao Li, Qihan Hu, Yongxiang Li, Zehan Li

Authors on Pith no claims yet

Pith reviewed 2026-05-10 14:06 UTC · model grok-4.3

classification 📡 eess.AS
keywords audio reasoninglarge audio language modelschain-of-thoughtself-distillationdata curationMMAR benchmarkopen-source audio models
0
0 comments X

The pith

Audio-Cogito builds chain-of-thought reasoning into large audio language models by curating 545k samples and applying self-distillation.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper addresses the limited reasoning ability in current large audio language models compared with text and multimodal systems by creating a dedicated pipeline to produce explicit step-by-step reasoning chains from audio inputs. It generates a dataset of 545k such samples and transfers the capability to the model through self-distillation, then demonstrates the result on the MMAR benchmark that specifically tests the reasoning process. A reader would care because reliable audio reasoning could support more accurate applications in sound understanding, speech analysis, and audio question answering where current models often produce inconsistent or shallow answers.

Core claim

Audio-Cogito is a fully open-source large audio language model that employs Cogito-pipe to curate 545k high-quality audio reasoning samples and uses self-distillation during fine-tuning to instill chain-of-thought capabilities, achieving the strongest results among open-source models on the MMAR benchmark while matching or exceeding certain closed-source models on specific metrics and placing among the top systems in the Interspeech 2026 Audio Reasoning Challenge.

What carries the argument

Cogito-pipe, a pipeline that generates high-quality audio reasoning samples containing explicit chains, paired with a self-distillation training strategy that transfers those chains into the model's responses.

If this is right

  • Explicit chain-of-thought outputs become more consistent and accurate for complex audio understanding tasks.
  • The released 545k reasoning samples can be reused to train or evaluate other audio models.
  • Open-source models can reach parity with some closed-source systems on audio reasoning metrics.
  • The same curation-plus-distillation recipe applies to additional audio benchmarks and challenges.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The curation approach could be extended to create reasoning datasets for video or multimodal inputs that combine audio with other signals.
  • Wider release of the data pipeline would allow independent labs to replicate and scale audio reasoning without proprietary resources.
  • Performance on real-world recordings with background noise or accents would test whether the learned reasoning chains remain robust outside clean benchmarks.

Load-bearing premise

The 545k curated samples contain high-quality, generalizable audio reasoning chains that self-distillation can effectively transfer to new audio tasks.

What would settle it

Running the fine-tuned model on a fresh audio reasoning benchmark whose tasks were never seen during data curation or distillation and finding no gain over a baseline model trained on uncurated audio data would falsify the central claim.

Figures

Figures reproduced from arXiv: 2604.12527 by Hongjie Chen, Jian Kang, Jie Li, Lei Xie, Longhao Li, Qihan Hu, Yongxiang Li, Zehan Li.

Figure 1
Figure 1. Figure 1: Overview of Cogito-Pipe. • Audio-Cogito achieves top-tier performance in the Inter￾speech 2026 Audio Reasoning Challenge and sets new SOTA results among open-source models on the MMAR bench￾mark, even surpassing several proprietary systems. 2. Audio-Cogito 2.1. Cogito-Pipe In this section, we introduce our automated pipeline, Cogito￾Pipe, to generate audio reasoning SFT data. As shown in Fig￾ure 1, the Cog… view at source ↗
read the original abstract

Recent advances in reasoning models have driven significant progress in text and multimodal domains, yet audio reasoning remains relatively limited. Only a few Large Audio Language Models (LALMs) incorporate explicit Chain-of-Thought (CoT) reasoning, and their capabilities are often inconsistent and insufficient for complex tasks. To bridge this gap, we introduce Audio-Cogito, a fully open-source solution for deep audio reasoning. We develop Cogito-pipe for high-quality audio reasoning data curation, producing 545k reasoning samples that will be released after review. Based on this dataset, we adopt a self-distillation strategy for model fine-tuning. Experiments on the MMAR benchmark, the only audio benchmark evaluating the CoT process, show that our model achieves the best performance among open-source models and matches or surpasses certain closed-source models in specific metrics. Our approach also ranks among the top-tier systems in the Interspeech 2026 Audio Reasoning Challenge.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The manuscript introduces Audio-Cogito, an open-source Large Audio Language Model (LALM) for explicit Chain-of-Thought (CoT) audio reasoning. It describes Cogito-pipe, a curation pipeline that generates 545k audio reasoning samples, which are then used in a self-distillation fine-tuning procedure. On the MMAR benchmark (the only audio benchmark that evaluates the CoT process), the resulting model is reported to achieve the highest scores among open-source LALMs and to match or exceed selected closed-source models on specific metrics; it also places among the top entries in the Interspeech 2026 Audio Reasoning Challenge.

Significance. If the performance claims are confirmed with complete experimental controls, the work would constitute a useful contribution by supplying both an open dataset of audio reasoning chains and a practical self-distillation recipe for improving CoT capabilities in the audio domain. The public release of the 545k samples is a concrete community resource that parallels the role of CoT datasets in text LLMs.

major comments (3)
  1. [§4] §4 (Experiments): The central claim that Audio-Cogito attains the best performance among open-source models on MMAR is only partially supported. The section provides no exhaustive list of baselines, their parameter counts, training regimes, or prompting formats, nor any statistical significance tests or confidence intervals on the reported scores. These omissions prevent assessment of whether the gains are robust or attributable to the proposed pipeline.
  2. [§3] §3 (Cogito-pipe): No analysis is given of possible overlap or leakage between the 545k curated samples and the MMAR test set. Because the pipeline draws from existing audio corpora, contamination remains a plausible alternative explanation for the observed benchmark gains and must be ruled out.
  3. [§4.3] §4.3 (Ablations): The manuscript contains no ablation studies that isolate the contribution of the self-distillation stage from the quality of the curated data alone, nor comparisons against standard supervised fine-tuning or other reasoning-enhancement techniques. Without these controls the attribution of improvements to the proposed method remains unverified.
minor comments (2)
  1. [Abstract] Abstract: The phrase 'matches or surpasses certain closed-source models in specific metrics' is too vague; naming the models and metrics would improve clarity.
  2. Throughout: Some citations to prior LALM and multimodal reasoning papers could be expanded to better situate the novelty of explicit audio CoT.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. The comments highlight important aspects for strengthening the experimental rigor and methodological validation of our work. We address each point below and will incorporate the necessary revisions.

read point-by-point responses
  1. Referee: [§4] §4 (Experiments): The central claim that Audio-Cogito attains the best performance among open-source models on MMAR is only partially supported. The section provides no exhaustive list of baselines, their parameter counts, training regimes, or prompting formats, nor any statistical significance tests or confidence intervals on the reported scores. These omissions prevent assessment of whether the gains are robust or attributable to the proposed pipeline.

    Authors: We agree that the current presentation of results is incomplete. In the revised manuscript we will expand §4 with a new comprehensive table that enumerates every open-source and closed-source baseline, including exact parameter counts, training data regimes, and the precise prompting formats employed. We will also rerun the MMAR evaluations across multiple random seeds to report means and standard deviations, and include paired statistical significance tests (e.g., t-tests with p-values) to substantiate the robustness of the reported gains. revision: yes

  2. Referee: [§3] §3 (Cogito-pipe): No analysis is given of possible overlap or leakage between the 545k curated samples and the MMAR test set. Because the pipeline draws from existing audio corpora, contamination remains a plausible alternative explanation for the observed benchmark gains and must be ruled out.

    Authors: We recognize this as a critical issue. We will add a new subsection to §3 that quantifies potential leakage using both embedding cosine similarity (with a threshold) and n-gram overlap detection between the 545k Cogito-pipe samples and the MMAR test set. Any detected overlap will be reported with exact percentages, and contaminated samples will be removed prior to final training. The revised paper will include these results and the mitigation steps taken. revision: yes

  3. Referee: [§4.3] §4.3 (Ablations): The manuscript contains no ablation studies that isolate the contribution of the self-distillation stage from the quality of the curated data alone, nor comparisons against standard supervised fine-tuning or other reasoning-enhancement techniques. Without these controls the attribution of improvements to the proposed method remains unverified.

    Authors: We concur that additional controls are required. The revised §4.3 will present new ablation experiments comparing (1) the base model, (2) supervised fine-tuning on the curated data without self-distillation, (3) the full self-distillation pipeline, and (4) alternative techniques such as standard CoT prompting and direct preference optimization. These results will isolate the incremental benefit of each stage and allow direct attribution of gains to the proposed method. revision: yes

Circularity Check

0 steps flagged

No significant circularity in derivation chain

full rationale

The paper describes an empirical pipeline: curation of 545k audio reasoning samples via Cogito-pipe followed by self-distillation fine-tuning, with performance measured on the external MMAR benchmark. No equations, fitted parameters, or derivations are present that could reduce to inputs by construction. No self-citations are invoked as load-bearing uniqueness theorems or ansatzes. The central claim (improved benchmark scores) rests on comparison to independent external models and data, making the argument self-contained without internal circularity.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The performance claims rest on the unproven transferability of self-distillation and the quality of synthetically curated audio reasoning data, both of which are domain assumptions rather than derived results.

axioms (1)
  • domain assumption Self-distillation from improved outputs reliably enhances reasoning capabilities in audio language models
    Invoked when adopting self-distillation for fine-tuning without new justification specific to audio.
invented entities (1)
  • Cogito-pipe no independent evidence
    purpose: High-quality audio reasoning data curation
    Newly proposed pipeline whose effectiveness is asserted but not independently validated outside the paper's own results.

pith-pipeline@v0.9.0 · 5477 in / 1277 out tokens · 57638 ms · 2026-05-10T14:06:01.035842+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

51 extracted references · 29 canonical work pages · 8 internal anchors

  1. [1]

    Audio-Cogito: Towards Deep Audio Reasoning in Large Audio Language Models

    Introduction Recent advancements in Large Language Models (LLMs) have significantly boosted their capabilities, particularly through techniques like inference scaling and Chain-of-Thought (CoT). It has been widely demonstrated that CoT enhances reasoning effectively by decomposing complex queries into intermediate reasoning steps. This paradigm has succes...

  2. [2]

    Cogito-Pipe In this section, we introduce our automated pipeline, Cogito- Pipe, to generate audio reasoning SFT data

    Audio-Cogito 2.1. Cogito-Pipe In this section, we introduce our automated pipeline, Cogito- Pipe, to generate audio reasoning SFT data. As shown in Fig- ure 1, the Cogito-Pipe consists of four stages: (1) Data Collec- tion from multi-domain audio sources spanning sound, speech, and music; (2) QA Construction to synthesize diverse and chal- lenging QA pair...

  3. [3]

    Experimental Setup 3.1.1

    Experiments 3.1. Experimental Setup 3.1.1. Training Details Our model,Audio-Cogito, is built upon the Qwen3-Omni-Thinkingwith 30 billion parameters. We utilize the ms-swift 2 framework to conduct supervised fine-tuning using Low-Rank Adaptation (LoRA). The model is fine-tuned for one epoch on the dataset constructed via Cogito-Pipe, with a maximum learnin...

  4. [4]

    Leveraging Cogito- Pipe for high-quality data curation, we construct and release a 545k-sample open-source audio reasoning dataset

    Conclusion In this work, we introduce Audio-Cogito, an open-source solu- tion for deep audio reasoning in LALMs. Leveraging Cogito- Pipe for high-quality data curation, we construct and release a 545k-sample open-source audio reasoning dataset. We fur- ther employ a self-distillation strategy that substantially en- hances complex reasoning capabilities. E...

  5. [5]

    These tools were not used to develop the methodology, conduct the experiments, generate the results, or draw the conclusions of this work

    Generative AI Use Disclosure Generative AI tools were employed exclusively for linguistic refinement and editorial assistance. These tools were not used to develop the methodology, conduct the experiments, generate the results, or draw the conclusions of this work. The authors retain full responsibility and accountability for all aspects of the manuscript

  6. [6]

    Improve vision language model chain-of- thought reasoning,

    R. Zhang, B. Zhang, Y . Li, H. Zhang, Z. Sun, Z. Gan, Y . Yang, R. Pang, and Y . Yang, “Improve vision language model chain-of- thought reasoning,” inProc. ACL, 2025

  7. [7]

    Salmonn: Towards generic hearing abilities for large language models.arXiv preprint arXiv:2310.13289, 2023

    C. Tang, W. Yu, G. Sun, X. Chen, T. Tan, W. Li, L. Lu, Z. Ma, and C. Zhang, “Salmonn: Towards generic hearing abilities for large language models,”arXiv preprint arXiv:2310.13289, 2023

  8. [8]

    Qwen2-Audio Technical Report

    Y . Chu, J. Xu, Q. Yang, H. Wei, X. Wei, Z. Guo, Y . Leng, Y . Lv, J. He, J. Lin, C. Zhou, and J. Zhou, “Qwen2-audio technical re- port,”CoRR, vol. abs/2407.10759, 2024

  9. [9]

    Audio flamingo 2: An audio- language model with long-audio understanding and expert rea- soning abilities,

    S. Ghosh, Z. Kong, S. Kumar, S. Sakshi, J. Kim, W. Ping, R. Valle, D. Manocha, and B. Catanzaro, “Audio flamingo 2: An audio- language model with long-audio understanding and expert rea- soning abilities,” inProc. ICML, 2025

  10. [10]

    Listen, think, and understand,

    Y . Gong, H. Luo, A. H. Liu, L. Karlinsky, and J. Glass, “Listen, think, and understand,”arXiv preprint arXiv:2305.10790, 2023

  11. [11]

    Musilingo: Bridging music and text with pre- trained language models for music captioning and query re- sponse,

    Z. Deng, Y . Ma, Y . Liu, R. Guo, G. Zhang, W. Chen, W. Huang, and E. Benetos, “Musilingo: Bridging music and text with pre- trained language models for music captioning and query re- sponse,” inProc. Findings of ACL, 2024

  12. [12]

    Music understanding llama: Advancing text-to-music generation with question answer- ing and captioning,

    S. Liu, A. S. Hussain, C. Sun, and Y . Shan, “Music understanding llama: Advancing text-to-music generation with question answer- ing and captioning,” inProc. ICASSP, 2024

  13. [13]

    Gama: A large audio-language model with advanced audio understanding and complex reasoning abilities,

    S. Ghosh, S. Kumar, A. Seth, C. K. R. Evuru, U. Tyagi, S. Sak- shi, O. Nieto, R. Duraiswami, and D. Manocha, “Gama: A large audio-language model with advanced audio understanding and complex reasoning abilities,” inPro. EMNLP, 2024

  14. [14]

    OSUM: Advancing Open Speech Understanding Models with Limited Resources in Academia

    X. Geng, K. Wei, Q. Shao, S. Liu, Z. Lin, Z. Zhao, G. Li, W. Tian, P. Chen, Y . Liet al., “Osum: Advancing open speech understand- ing models with limited resources in academia,”arXiv preprint arXiv:2501.13306, 2025

  15. [15]

    Mimo-audio: Audio language models are few-shot learners

    D. Zhang, G. Wang, J. Xue, K. Fang, L. Zhao, R. Ma, S. Ren, S. Liu, T. Guo, W. Zhuanget al., “Mimo-audio: Audio language models are few-shot learners,”arXiv preprint arXiv:2512.23808, 2025

  16. [16]

    Step-audio 2 technical report.arXiv preprint arXiv:2507.16632, 2025

    B. Wu, C. Yan, C. Hu, C. Yi, C. Feng, F. Tian, F. Shen, G. Yu, H. Zhang, J. Liet al., “Step-audio 2 technical report,”arXiv preprint arXiv:2507.16632, 2025

  17. [17]

    Kimi-Audio Technical Report

    D. Ding, Z. Ju, Y . Leng, S. Liu, T. Liu, Z. Shang, K. Shen, W. Song, X. Tan, H. Tanget al., “Kimi-audio technical report,” arXiv preprint arXiv:2504.18425, 2025

  18. [18]

    GPT-4o System Card

    A. Hurst, A. Lerer, A. P. Goucher, A. Perelman, A. Ramesh, A. Clark, A. Ostrow, A. Welihinda, A. Hayes, A. Radfordet al., “Gpt-4o system card,”arXiv preprint arXiv:2410.21276, 2024

  19. [19]

    Anygpt: Unified multimodal llm with discrete sequence modeling,

    J. Zhan, J. Dai, J. Ye, Y . Zhou, D. Zhang, Z. Liu, X. Zhang, R. Yuan, G. Zhang, L. Liet al., “Anygpt: Unified multimodal llm with discrete sequence modeling,” inProc. ACL, 2024

  20. [20]

    arXiv preprint arXiv:2501.04561 , year=

    R. Luo, T.-E. Lin, H. Zhang, Y . Wu, X. Liu, M. Yang, Y . Li, L. Chen, J. Li, L. Zhanget al., “Openomni: Advancing open- source omnimodal large language models with progressive multi- modal alignment and real-time self-aware emotional speech syn- thesis,”arXiv preprint arXiv:2501.04561, 2025

  21. [21]

    arXiv preprint arXiv:2501.15368 , year=

    Y . Li, J. Liu, T. Zhang, S. Chen, T. Li, Z. Li, L. Liu, L. Ming, G. Dong, D. Panet al., “Baichuan-omni-1.5 technical report,” arXiv preprint arXiv:2501.15368, 2025

  22. [22]

    Qwen2.5-Omni Technical Report

    J. Xu, Z. Guo, J. He, H. Hu, T. He, S. Bai, K. Chen, J. Wang, Y . Fan, K. Dang, B. Zhang, X. Wang, Y . Chu, and J. Lin, “Qwen2.5-omni technical report,”arXiv preprint arXiv:2503.20215, 2025

  23. [23]

    Qwen3-Omni Technical Report

    J. Xu, Z. Guo, H. Hu, Y . Chu, X. Wang, J. He, Y . Wang, X. Shi, T. He, X. Zhuet al., “Qwen3-omni technical report,”arXiv preprint arXiv:2509.17765, 2025

  24. [24]

    Ming-omni: A unified multimodal model for perception and generation.arXiv preprint arXiv:2506.09344, 2025

    I. AI, B. Gong, C. Zou, C. Zheng, C. Zhou, C. Yan, C. Jin, C. Shen, D. Zheng, F. Wanget al., “Ming-omni: A unified mul- timodal model for perception and generation,”arXiv preprint arXiv:2506.09344, 2025

  25. [25]

    Omni-r1: Reinforcement learning for omnimodal reasoning via two-system collaboration.arXiv preprint arXiv:2505.20256,

    H. Zhong, M. Zhu, Z. Du, Z. Huang, C. Zhao, M. Liu, W. Wang, H. Chen, and C. Shen, “Omni-r1: Reinforcement learning for om- nimodal reasoning via two-system collaboration,”arXiv preprint arXiv:2505.20256, 2025

  26. [26]

    Gemini 2.0 flash,

    Google, “Gemini 2.0 flash,” https://cloud.google.com/vertex-ai/ generative-ai/docs/models/gemini/2-0-flash, 2025

  27. [27]

    Audio-cot: Exploring chain-of-thought reasoning in large audio language model,

    Z. Ma, Z. Chen, Y . Wang, E. S. Chng, and X. Chen, “Audio- cot: Exploring chain-of-thought reasoning in large audio language model,”arXiv preprint arXiv:2501.07246, 2025

  28. [28]

    Audio flamingo 3: Advancing audio intelligence with fully open large audio language models,

    A. Goel, S. Ghosh, J. Kim, S. Kumar, Z. Kong, S.-g. Lee, C.- H. H. Yang, R. Duraiswami, D. Manocha, R. Valleet al., “Audio flamingo 3: Advancing audio intelligence with fully open large audio language models,”arXiv preprint arXiv:2507.08128, 2025

  29. [29]

    Step-audio-r1 technical report,

    F. Tian, X. T. Zhang, Y . Zhang, H. Zhang, Y . Li, D. Liu, Y . Deng, D. Wu, J. Chen, L. Zhaoet al., “Step-audio-r1 technical report,” arXiv preprint arXiv:2511.15848, 2025

  30. [30]

    Mmar: A challenging benchmark for deep reasoning in speech, audio, music, and their mix,

    Z. Ma, Y . Ma, Y . Zhu, C. Yang, Y .-W. Chao, R. Xu, W. Chen, Y . Chen, Z. Chen, J. Conget al., “Mmar: A challenging bench- mark for deep reasoning in speech, audio, music, and their mix,” arXiv preprint arXiv:2505.13032, 2025

  31. [31]

    Mmau-pro: A challenging and comprehensive benchmark for holistic evaluation of audio general intelligence,

    S. Kumar, ˇS. Sedl ´aˇcek, V . Lokegaonkar, F. L ´opez, W. Yu, N. Anand, H. Ryu, L. Chen, M. Pli ˇcka, M. Hlav ´aˇceket al., “Mmau-pro: A challenging and comprehensive benchmark for holistic evaluation of audio general intelligence,”arXiv preprint arXiv:2508.13992, 2025

  32. [32]

    Audio set: An ontology and human-labeled dataset for audio events,

    J. F. Gemmeke, D. P. Ellis, D. Freedman, A. Jansen, W. Lawrence, R. C. Moore, M. Plakal, and M. Ritter, “Audio set: An ontology and human-labeled dataset for audio events,” inProc. ICASSP, 2017

  33. [33]

    Audiocaps: Generating captions for audios in the wild,

    C. D. Kim, B. Kim, H. Lee, and G. Kim, “Audiocaps: Generating captions for audios in the wild,” inProc. NAACL-HLT, 2019

  34. [34]

    Clotho: An audio cap- tioning dataset,

    K. Drossos, S. Lipping, and T. Virtanen, “Clotho: An audio cap- tioning dataset,” inProc. ICASSP, 2020

  35. [35]

    Audio- reasoner: Improving reasoning capability in large audio language models,

    X. Zhifei, M. Lin, Z. Liu, P. Wu, S. Yan, and C. Miao, “Audio- reasoner: Improving reasoning capability in large audio language models,” inProc. EMNLP, 2025

  36. [36]

    Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities

    G. Comanici, E. Bieber, M. Schaekermann, I. Pasupat, N. Sachdeva, I. Dhillon, M. Blistein, O. Ram, D. Zhang, E. Rosen et al., “Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabil- ities,”arXiv preprint arXiv:2507.06261, 2025

  37. [37]

    The interspeech 2026 audio reasoning challenge: Evaluating reasoning process quality for audio reason- ing models and agents,

    Z. Ma, R. Xu, Y . Ma, C.-H. H. Yang, B. Li, J. Kim, J. Xu, J. Li, C. Busso, K. Yu, E. S. Chng, and X. Chen, “The interspeech 2026 audio reasoning challenge: Evaluating reasoning process quality for audio reasoning models and agents,” 2026. [Online]. Available: https://arxiv.org/abs/2602.14224

  38. [38]

    Meld: A multimodal multi-party dataset for emo- tion recognition in conversations,

    S. Poria, D. Hazarika, N. Majumder, G. Naik, E. Cambria, and R. Mihalcea, “Meld: A multimodal multi-party dataset for emo- tion recognition in conversations,” inProc. ACL, 2019

  39. [39]

    Towards multimodal sarcasm detection (an obviously perfect paper),

    S. Castro, D. Hazarika, V . P ´erez-Rosas, R. Zimmermann, R. Mi- halcea, and S. Poria, “Towards multimodal sarcasm detection (an obviously perfect paper),” inProc. ACL, 2019

  40. [40]

    Dailytalk: Spoken dialogue dataset for conversational text-to-speech,

    K. Lee, K. Park, and D. Kim, “Dailytalk: Spoken dialogue dataset for conversational text-to-speech,” inProc. ICASSP, 2023

  41. [41]

    Mustango: Toward controllable text-to-music gen- eration,

    J. Melechovsky, Z. Guo, D. Ghosal, N. Majumder, D. Herremans, and S. Poria, “Mustango: Toward controllable text-to-music gen- eration,” inProc. NAACL-HLT, 2024

  42. [42]

    Fma: A dataset for music analysis,

    M. Defferrard, K. Benzi, P. Vandergheynst, and X. Bresson, “Fma: A dataset for music analysis,”arXiv preprint arXiv:1612.01840, 2016

  43. [43]

    Medley- solos-db: a cross-collection dataset for musical instrument recognition,

    V . Lostanlen, C.-E. Cella, R. Bittner, and S. Essid, “Medley- solos-db: a cross-collection dataset for musical instrument recognition,” 2019. [Online]. Available: https://doi.org/10.5281/ zenodo.3464194

  44. [44]

    Audiobench: A universal benchmark for audio large language models,

    B. Wang, X. Zou, G. Lin, S. Sun, Z. Liu, W. Zhang, Z. Liu, A. Aw, and N. Chen, “Audiobench: A universal benchmark for audio large language models,” inProc. NAACL-HLT, 2025

  45. [45]

    Air-bench: Benchmarking large audio- language models via generative comprehension,

    Q. Yang, J. Xu, W. Liu, Y . Chu, Z. Jiang, X. Zhou, Y . Leng, Y . Lv, Z. Zhao, C. Zhouet al., “Air-bench: Benchmarking large audio- language models via generative comprehension,” inProc. ACL, 2024

  46. [46]

    MMAU: A Massive Multi-Task Audio Understanding and Reasoning Benchmark

    S. Sakshi, U. Tyagi, S. Kumar, A. Seth, R. Selvakumar, O. Nieto, R. Duraiswami, S. Ghosh, and D. Manocha, “Mmau: A massive multi-task audio understanding and reasoning benchmark,”arXiv preprint arXiv:2410.19168, 2024

  47. [47]

    Mmsu: A massive multi-task spoken language understanding and reasoning benchmark,

    D. Wang, J. Wu, J. Li, D. Yang, X. Chen, T. Zhang, and H. Meng, “Mmsu: A massive multi-task spoken language understanding and reasoning benchmark,”arXiv preprint arXiv:2506.04779, 2025

  48. [48]

    Televal: A dynamic benchmark designed for spoken language models in chinese interactive sce- narios,

    Z. Li, H. Chen, Q. Wang, Y . Zhang, J. Zhou, H. Lv, M. Du, Y . Song, J. Lian, J. Kanget al., “Televal: A dynamic benchmark designed for spoken language models in chinese interactive sce- narios,”arXiv preprint arXiv:2507.18061, 2025

  49. [49]

    Uro-bench: Towards comprehensive evaluation for end-to-end spoken dialogue models,

    R. Yan, X. Li, W. Chen, Z. Niu, C. Yang, Z. Ma, K. Yu, and X. Chen, “Uro-bench: Towards comprehensive evaluation for end-to-end spoken dialogue models,” inProc. EMNLP, 2025

  50. [50]

    Audio flamingo: A novel audio language model with few-shot learning and dialogue abilities,

    Z. Kong, A. Goel, R. Badlani, W. Ping, R. Valle, and B. Catanzaro, “Audio flamingo: A novel audio language model with few-shot learning and dialogue abilities,”arXiv preprint arXiv:2402.01831, 2024

  51. [51]

    Mellow: a small audio language model for reasoning,

    S. Deshmukh, S. Dixit, R. Singh, and B. Raj, “Mellow: a small audio language model for reasoning,”arXiv preprint arXiv:2503.08540, 2025