pith. machine review for the scientific record. sign in

arxiv: 2605.00371 · v1 · submitted 2026-05-01 · 💻 cs.SD · cs.AI

Recognition: unknown

GaMMA: Towards Joint Global-Temporal Music Understanding in Large Multimodal Models

Authors on Pith no claims yet

Pith reviewed 2026-05-09 19:08 UTC · model grok-4.3

classification 💻 cs.SD cs.AI
keywords large multimodal modelsmusic understandingmixture of expertstemporal analysisglobal music featuresprogressive trainingMusicBench
0
0 comments X

The pith

GaMMA unifies global and temporal music understanding in one large multimodal model using mixture-of-experts audio encoders and staged training.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents GaMMA as a large multimodal model that inherits the LLaVA encoder-decoder structure to learn across music and language. It adds audio encoders arranged in a mixture-of-experts pattern so that both time-based and non-time-based music tasks run under a single set of parameters. The model trains in stages—pretraining on large data, supervised fine-tuning, then reinforcement learning—to improve its grasp of musical content. The authors also introduce MusicBench, a benchmark of thousands of human-curated questions that separately measures temporal and global music skills. These steps produce higher accuracy than earlier systems on standard music tests and on the new benchmark.

Core claim

GaMMA establishes new state-of-the-art results for music understanding by unifying global and temporal capabilities inside one model. It combines the LLaVA design with mixture-of-experts audio encoders that handle time-series and non-time-series tasks together. Progressive training through pretraining, supervised fine-tuning, and reinforcement learning on large curated datasets yields 79.1 percent accuracy on MuchoMusic, 79.3 percent on MusicBench-Temporal, and 81.3 percent on MusicBench-Global, outperforming previous approaches.

What carries the argument

Mixture-of-experts audio encoders that dynamically route different music signals to specialized sub-networks, allowing a single model to manage both timing details and overall structure.

If this is right

  • Music LMMs can address both timing precision and overall structure without needing separate models or extra parameters for each.
  • Staged training that moves from broad pretraining through fine-tuning to reinforcement learning lifts results across multiple music benchmarks at once.
  • MusicBench provides a shared test bed that separates temporal from global capabilities for consistent comparison of future models.
  • Applications such as music description, recommendation, or education tools gain access to more complete understanding of a piece in one system.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same expert-routing approach could transfer to other time-based audio tasks such as speech or environmental sound analysis without redesigning the core architecture.
  • Adding reinforcement learning after initial training may improve the model's ability to follow open-ended instructions about music content.
  • MusicBench-style splits could be adapted to diagnose whether other multimodal models truly integrate local and global features or merely memorize surface patterns.
  • The joint training method suggests a path for video or motion models that must also combine frame-level timing with scene-level meaning.

Load-bearing premise

The performance improvements reflect genuine joint global-temporal understanding rather than tuning that only works on the specific datasets and benchmarks used.

What would settle it

Test the model on a fresh collection of music questions that demand simultaneous global and temporal reasoning and that were never seen during any training stage or included in MusicBench; sustained high accuracy would support the claim, while a sharp drop would indicate benchmark-specific optimization.

read the original abstract

In this paper, we propose GaMMA, a state-of-the-art (SoTA) large multimodal model (LMM) designed to achieve comprehensive musical content understanding. GaMMA inherits the streamlined encoder-decoder design of LLaVA, enabling effective cross-modal learning between music and language. By incorporating audio encoders in a mixture-of-experts manner, GaMMA effectively unifies both time-series and non-time-series music understanding tasks within one set of parameters. Our approach combines carefully curated datasets at scale with a progressive training pipeline, effectively pushing the boundaries of music understanding via pretraining, supervised fine-tuning (SFT), and reinforcement learning (RL). To comprehensively assess both temporal and non-temporal capability of music LMMs, we introduce MusicBench, the largest music-oriented benchmark, comprising 3,739 human-curated multiple-choice questions covering diverse aspects of musical understanding. Extensive experiments demonstrate that GaMMA establishes new SoTA in the music domain, achieving 79.1% accuracy on MuchoMusic, 79.3% on MusicBench-Temporal, and 81.3% on MusicBench-Global, consistently outperforming previous methods.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 1 minor

Summary. The paper proposes GaMMA, an LLaVA-style encoder-decoder LMM for music understanding that integrates mixture-of-experts audio encoders to unify time-series (temporal) and non-time-series (global) tasks within a single parameter set. It employs a progressive training pipeline (pretraining, SFT, RL) on large curated datasets and introduces MusicBench, a benchmark of 3,739 human-curated multiple-choice questions. The central empirical claim is new SoTA performance: 79.1% on MuchoMusic, 79.3% on MusicBench-Temporal, and 81.3% on MusicBench-Global, outperforming prior methods.

Significance. If the joint global-temporal integration claim is substantiated, the work would advance multimodal music models by demonstrating unified handling of temporal dynamics and global structure in one architecture, with the large-scale MusicBench benchmark providing a reusable resource for the community. The MoE design and progressive training pipeline are potentially extensible contributions.

major comments (3)
  1. [Experiments] Experiments section: accuracies are reported separately on MusicBench-Temporal (79.3%) and MusicBench-Global (81.3%) with no evaluation on queries requiring simultaneous global structure and temporal dynamics in a single input. This partitioned evaluation does not support the claim of 'joint' understanding, as MoE routing could dispatch to independent experts without cross-talk, and the RL stage could optimize splits separately.
  2. [Method] Method section (MoE audio encoders and progressive pipeline): the unification claim rests on the assertion that MoE plus pretraining/SFT/RL produces genuine integration, yet no ablations isolate whether routing weights enable cross-expert communication or whether performance gains arise from benchmark-specific optimization on partitioned data.
  3. [Abstract / Experiments] Abstract and Experiments: headline accuracy figures are given without baseline implementation details, error bars, dataset composition statistics, or ablation tables, preventing verification that outperformance reflects the proposed joint mechanism rather than dataset scale or training hyperparameters.
minor comments (1)
  1. [Figures] Figure captions and architecture diagrams would benefit from explicit labeling of the MoE routing mechanism and how global vs. temporal experts interact during inference.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for their constructive comments, which highlight important aspects of evaluation and evidence for our claims about joint global-temporal understanding in GaMMA. We respond to each major comment below and commit to revisions that strengthen the manuscript.

read point-by-point responses
  1. Referee: [Experiments] Experiments section: accuracies are reported separately on MusicBench-Temporal (79.3%) and MusicBench-Global (81.3%) with no evaluation on queries requiring simultaneous global structure and temporal dynamics in a single input. This partitioned evaluation does not support the claim of 'joint' understanding, as MoE routing could dispatch to independent experts without cross-talk, and the RL stage could optimize splits separately.

    Authors: We acknowledge that separate reporting on MusicBench-Temporal and MusicBench-Global does not directly test mixed queries within a single input. The model is trained jointly on a combined corpus of temporal and global tasks using a shared decoder, which we argue enables integration beyond independent dispatching. To address this directly, the revised manuscript will add a curated subset of mixed global-temporal questions to MusicBench, report accuracy on them, and include an analysis of MoE routing weights on these inputs to demonstrate cross-expert interaction. revision: yes

  2. Referee: [Method] Method section (MoE audio encoders and progressive pipeline): the unification claim rests on the assertion that MoE plus pretraining/SFT/RL produces genuine integration, yet no ablations isolate whether routing weights enable cross-expert communication or whether performance gains arise from benchmark-specific optimization on partitioned data.

    Authors: The progressive pipeline (pretraining on large-scale mixed music data, followed by SFT and RL) is designed to encourage routing that integrates both task types through the common language model. We agree that explicit isolation of cross-talk is valuable. The revision will incorporate new ablation studies, including comparisons against non-MoE single-encoder variants, frozen-routing controls, and visualizations of expert activation patterns across task categories to show that performance gains arise from integrated routing rather than partitioned optimization. revision: yes

  3. Referee: [Abstract / Experiments] Abstract and Experiments: headline accuracy figures are given without baseline implementation details, error bars, dataset composition statistics, or ablation tables, preventing verification that outperformance reflects the proposed joint mechanism rather than dataset scale or training hyperparameters.

    Authors: We agree that reproducibility and attribution require these details. The revised Experiments section will expand to include full baseline implementation descriptions (with any adaptations noted), error bars from multiple random seeds, detailed statistics on dataset composition and splits, and comprehensive ablation tables for the MoE design and training stages. These additions will clarify that gains derive from the joint architecture rather than scale or hyperparameters alone. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical results on external benchmarks

full rationale

The paper presents GaMMA as an LLaVA-derived encoder-decoder LMM trained progressively (pretraining, SFT, RL) on curated music data, with performance measured on MuchoMusic and the newly introduced MusicBench splits. These accuracy figures (79.1%, 79.3%, 81.3%) are reported outcomes of training and evaluation, not quantities derived by construction from the model equations or from fitted parameters that are then renamed as predictions. No self-definitional loops, load-bearing self-citations, or ansatz smuggling appear in the abstract or described pipeline; the MoE routing and joint-claim rest on architectural description whose effectiveness is assessed externally rather than tautologically.

Axiom & Free-Parameter Ledger

2 free parameters · 2 axioms · 0 invented entities

The central claim rests on the unverified effectiveness of the MoE audio encoders for unification and the quality of the human-curated datasets and progressive training stages; no independent evidence for these is provided in the abstract.

free parameters (2)
  • Mixture-of-experts routing weights
    The paper does not specify how experts are selected or weighted during training or inference.
  • Progressive training hyperparameters
    Stages of pretraining, SFT, and RL involve multiple unspecified learning rates, batch sizes, and reward model details.
axioms (2)
  • domain assumption LLaVA encoder-decoder design transfers effectively to music-language cross-modal learning
    The paper states it inherits this design without providing justification or ablation for music domain.
  • domain assumption Human-curated multiple-choice questions in MusicBench accurately measure musical understanding
    No validation of question quality or inter-annotator agreement is mentioned.

pith-pipeline@v0.9.0 · 5517 in / 1467 out tokens · 37011 ms · 2026-05-09T19:08:14.868616+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

52 extracted references · 5 canonical work pages · 4 internal anchors

  1. [1]

    Flamingo: a visual language model for few-shot learning

    Jean-Baptiste Alayrac, Jeff Donahue, Pauline Luc, Antoine Miech, Iain Barr, Yana Hasson, Karel Lenc, Arthur Mensch, Katherine Millican, Malcolm Reynolds, et al. Flamingo: a visual language model for few-shot learning. InNeurIPS, 2022

  2. [2]

    Mathqa: Towards interpretable math word problem solving with operation-based formalisms.arXiv, 2019

    Aida Amini, Saadia Gabriel, Peter Lin, Rik Koncel-Kedziorski, Yejin Choi, and Hannaneh Hajishirzi. Mathqa: Towards interpretable math word problem solving with operation-based formalisms.arXiv, 2019

  3. [3]

    A general language assistant as a laboratory for alignment.arXiv, 2021

    Amanda Askell, Yuntao Bai, Anna Chen, Dawn Drain, Deep Ganguli, Tom Henighan, Andy Jones, Nicholas Joseph, Ben Mann, Nova DasSarma, et al. A general language assistant as a laboratory for alignment.arXiv, 2021

  4. [4]

    Coig-cqia: Quality is all you need for chinese instruction fine-tuning.arXiv, 2024

    Yuelin Bai, Xinrun Du, Yiming Liang, Yonggang Jin, Junting Zhou, Ziqiang Liu, Feiteng Fang, Mingshan Chang, Tianyu Zheng, Xincheng Zhang, et al. Coig-cqia: Quality is all you need for chinese instruction fine-tuning.arXiv, 2024

  5. [5]

    R1-v: Reinforcing super generalization ability in vision-language models with less than $3, 2025

    Liang Chen, Lei Li, Haozhe Zhao, Yifan Song, and Vinci. R1-v: Reinforcing super generalization ability in vision-language models with less than $3, 2025

  6. [6]

    Vision transformer adapter for dense predictions.arXiv, 2022

    Zhe Chen, Yuchen Duan, Wenhai Wang, Junjun He, Tong Lu, Jifeng Dai, and Yu Qiao. Vision transformer adapter for dense predictions.arXiv, 2022

  7. [7]

    Qwen-audio: Advancing universal audio understanding via unified large-scale audio-language models.arXiv, 2023

    Yunfei Chu, Jin Xu, Xiaohuan Zhou, Qian Yang, Shiliang Zhang, Zhijie Yan, Chang Zhou, and Jingren Zhou. Qwen-audio: Advancing universal audio understanding via unified large-scale audio-language models.arXiv, 2023

  8. [8]

    Qwen2-audio technical report.arXiv, 2024

    Yunfei Chu, Jin Xu, Qian Yang, Haojie Wei, Xipin Wei, Zhifang Guo, Yichong Leng, Yuanjun Lv, Jinzheng He, Junyang Lin, et al. Qwen2-audio technical report.arXiv, 2024

  9. [9]

    Training verifiers to solve math word problems.arXiv, 2021

    Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, Christopher Hesse, and John Schulman. Training verifiers to solve math word problems.arXiv, 2021. 11

  10. [10]

    Free dolly: Introducing the world’s first truly open instruction-tuned llm, 2023

    Mike Conover, Matt Hayes, Ankit Mathur, Jianwei Xie, Jun Wan, Sam Shah, Ali Ghodsi, Patrick Wendell, Matei Zaharia, and Reynold Xin. Free dolly: Introducing the world’s first truly open instruction-tuned llm, 2023

  11. [11]

    Flashattention-2: Faster attention with better parallelism and work partitioning.arXiv, 2023

    Tri Dao. Flashattention-2: Faster attention with better parallelism and work partitioning.arXiv, 2023

  12. [12]

    Kimi-audio technical report.arXiv, 2025

    Ding Ding, Zeqian Ju, Yichong Leng, Songxiang Liu, Tong Liu, Zeyu Shang, Kai Shen, Wei Song, Xu Tan, Heyi Tang, et al. Kimi-audio technical report.arXiv, 2025

  13. [13]

    Lp-musiccaps: Llm-based pseudo music captioning

    Seungheon Doh, Keunwoo Choi, Jongpil Lee, and Juhan Nam. Lp-musiccaps: Llm-based pseudo music captioning. InISMIR, 2023

  14. [14]

    Palm-e: An embodied multimodal language model, 2023

    Danny Driess, Fei Xia, Mehdi SM Sajjadi, Corey Lynch, Aakanksha Chowdhery, Ayzaan Wahid, Jonathan Tompson, Quan Vuong, Tianhe Yu, Wenlong Huang, et al. Palm-e: An embodied multimodal language model, 2023

  15. [15]

    Aishell-2: Transforming mandarin asr research into industrial scale.arXiv, 2018

    Jiayu Du, Xingyu Na, Xuechen Liu, and Hui Bu. Aishell-2: Transforming mandarin asr research into industrial scale.arXiv, 2018

  16. [16]

    Clap learning audio concepts from natural language supervision

    Benjamin Elizalde, Soham Deshmukh, Mahmoud Al Ismail, and Huaming Wang. Clap learning audio concepts from natural language supervision. InICASSP, 2023

  17. [17]

    Audio set: An ontology and human-labeled dataset for audio events

    Jort F Gemmeke, Daniel PW Ellis, Dylan Freedman, Aren Jansen, Wade Lawrence, R Channing Moore, Manoj Plakal, and Marvin Ritter. Audio set: An ontology and human-labeled dataset for audio events. InICASSP, 2017

  18. [18]

    Audio Flamingo 3: Advancing Audio Intelligence with Fully Open Large Audio Language Models

    Arushi Goel, Sreyan Ghosh, Jaehyeon Kim, Sonal Kumar, Zhifeng Kong, Sang-gil Lee, Chao-Han Huck Yang, Ramani Duraiswami, Dinesh Manocha, Rafael Valle, et al. Audio flamingo 3: Advancing audio intelligence with fully open large audio language models.arXiv preprint arXiv:2507.08128, 2025

  19. [19]

    SongFormer: Scaling Music Structure Analysis with Heterogeneous Supervision

    Chunbo Hao, Ruibin Yuan, Jixun Yao, Qixin Deng, Xinyi Bai, Wei Xue, and Lei Xie. Songformer: Scaling music structure analysis with heterogeneous supervision, 2025. URLhttps://arxiv.org/abs/2510.02797

  20. [20]

    Liger-kernel: Efficient triton kernels for LLM training

    Pin-Lun Hsu, Yun Dai, Vignesh Kothapalli, Qingquan Song, Shao Tang, Siyu Zhu, Steven Shimizu, Shivam Sahni, Haowen Ning, Yanning Chen, and Zhipeng Wang. Liger-kernel: Efficient triton kernels for LLM training. In ICML, 2025

  21. [21]

    A study of bfloat16 for deep learning training.arXiv, 2019

    Dhiraj Kalamkar, Dheevatsa Mudigere, Naveen Mellempudi, Dipankar Das, Kunal Banerjee, Sasikanth Avancha, Dharma Teja Vooturi, Nataraj Jammalamadaka, Jianyu Huang, Hector Yuen, et al. A study of bfloat16 for deep learning training.arXiv, 2019

  22. [22]

    Los angeles midi dataset: Sota kilo-scale midi dataset for mir and music ai purposes

    Aleksandr Lev. Los angeles midi dataset: Sota kilo-scale midi dataset for mir and music ai purposes. InGitHub, 2024

  23. [23]

    Reinforcement learning outperforms supervised fine-tuning: A case study on audio question answering.arXiv, 2025

    Gang Li, Jizhong Liu, Heinrich Dinkel, Yadong Niu, Junbo Zhang, and Jian Luan. Reinforcement learning outperforms supervised fine-tuning: A case study on audio question answering.arXiv, 2025

  24. [24]

    Visual instruction tuning

    Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning. InNeurIPS, 2023

  25. [25]

    M2ugen: Multi-modal music understanding and generation with the power of large language models.arXiv, 2023

    Shansong Liu, Atin Sakkeer Hussain, Chenshuo Sun, and Ying Shan. M2ugen: Multi-modal music understanding and generation with the power of large language models.arXiv, 2023

  26. [26]

    Music understanding llama: Advancing text-to-music generation with question answering and captioning

    Shansong Liu, Atin Sakkeer Hussain, Chenshuo Sun, and Ying Shan. Music understanding llama: Advancing text-to-music generation with question answering and captioning. InICASSP, 2024

  27. [27]

    Deepseek-vl: towards real-world vision-language understanding.arXiv, 2024

    Haoyu Lu, Wen Liu, Bo Zhang, Bingxuan Wang, Kai Dong, Bo Liu, Jingxiang Sun, Tongzheng Ren, Zhuoshu Li, Hao Yang, et al. Deepseek-vl: towards real-world vision-language understanding.arXiv, 2024

  28. [28]

    Deepstack: Deeply stacking visual tokens is surprisingly simple and effective for lmms

    Lingchen Meng, Jianwei Yang, Rui Tian, Xiyang Dai, Zuxuan Wu, Jianfeng Gao, and Yu-Gang Jiang. Deepstack: Deeply stacking visual tokens is surprisingly simple and effective for lmms. InNeurIPS, 2024

  29. [29]

    Orca-math: Unlocking the potential of slms in grade school math.arXiv, 2024

    Arindam Mitra, Hamed Khanpour, Corby Rosset, and Ahmed Awadallah. Orca-math: Unlocking the potential of slms in grade school math.arXiv, 2024

  30. [30]

    The harmonix set: Beats, downbeats, and functional segment annotations of western popular music

    Oriol Nieto, Matthew C McCallum, Matthew EP Davies, Andrew Robertson, Adam M Stark, and Eran Egozy. The harmonix set: Beats, downbeats, and functional segment annotations of western popular music. InISMIR, 2019

  31. [31]

    Instruction tuning with gpt-4.arXiv, 2023

    Baolin Peng, Chunyuan Li, Pengcheng He, Michel Galley, and Jianfeng Gao. Instruction tuning with gpt-4.arXiv, 2023. 12

  32. [32]

    Robust speech recognition via large-scale weak supervision

    Alec Radford, Jong Wook Kim, Tao Xu, Greg Brockman, Christine McLeavey, and Ilya Sutskever. Robust speech recognition via large-scale weak supervision. InICML, 2023

  33. [33]

    Deepspeed: System optimizations enable training deep learning models with over 100 billion parameters

    Jeff Rasley, Samyam Rajbhandari, Olatunji Ruwase, and Yuxiong He. Deepspeed: System optimizations enable training deep learning models with over 100 billion parameters. InKDD, 2020

  34. [34]

    Mmau: A massive multi-task audio understanding and reasoning benchmark

    S Sakshi, Utkarsh Tyagi, Sonal Kumar, Ashish Seth, Ramaneswaran Selvakumar, Oriol Nieto, Ramani Duraiswami, Sreyan Ghosh, and Dinesh Manocha. Mmau: A massive multi-task audio understanding and reasoning benchmark. arXiv, 2024

  35. [35]

    Vlm-r1: A stable and generalizable r1-style large vision-language model.arXiv, 2025

    Haozhan Shen, Peng Liu, Jingcheng Li, Chunxin Fang, Yibo Ma, Jiajia Liao, Qiaoli Shen, Zilun Zhang, Kangjia Zhao, Qianqian Zhang, et al. Vlm-r1: A stable and generalizable r1-style large vision-language model.arXiv, 2025

  36. [36]

    Salmonn: Towards generic hearing abilities for large language models.arXiv, 2023

    Changli Tang, Wenyi Yu, Guangzhi Sun, Xianzhao Chen, Tian Tan, Wei Li, Lu Lu, Zejun Ma, and Chao Zhang. Salmonn: Towards generic hearing abilities for large language models.arXiv, 2023

  37. [37]

    Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context.arXiv, 2024

    Gemini Team, Petko Georgiev, Ving Ian Lei, Ryan Burnell, Libin Bai, Anmol Gulati, Garrett Tanzer, Damien Vincent, Zhufeng Pan, Shibo Wang, et al. Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context.arXiv, 2024

  38. [38]

    Qwen2 Technical Report

    Qwen Team et al. Qwen2 technical report.arXiv preprint arXiv:2407.10671, 2024

  39. [39]

    Covost 2 and massively multilingual speech translation

    Changhan Wang, Anne Wu, Jiatao Gu, and Juan Pino. Covost 2 and massively multilingual speech translation. InInterspeech, 2021

  40. [40]

    Simplear: Pushing the frontier of autoregressive visual generation through pretraining, sft, and rl.arXiv, 2025

    Junke Wang, Zhi Tian, Xun Wang, Xinyu Zhang, Weilin Huang, Zuxuan Wu, and Yu-Gang Jiang. Simplear: Pushing the frontier of autoregressive visual generation through pretraining, sft, and rl.arXiv, 2025

  41. [41]

    Muchomusic: Evaluating music understanding in multimodal audio-language models.arXiv, 2024

    BennoWeck, IlariaManco, EmmanouilBenetos, ElioQuinton, GeorgeFazekas, andDmitryBogdanov. Muchomusic: Evaluating music understanding in multimodal audio-language models.arXiv, 2024

  42. [42]

    A foundation model for music informatics

    Minz Won, Yun-Ning Hung, and Duc Le. A foundation model for music informatics. InICASSP, 2024

  43. [43]

    Synthrl: Scaling visual reasoning with verifiable data synthesis.arXiv preprint arXiv:2506.02096, 2025

    Zijian Wu, Jinjie Ni, Xiangyan Liu, Zichen Liu, Hang Yan, and Michael Qizhe Shieh. Synthrl: Scaling visual reasoning with verifiable data synthesis.arXiv preprint arXiv:2506.02096, 2025

  44. [44]

    Audio-reasoner: Improving reasoning capability in large audio language models.arXiv, 2025

    Zhifei Xie, Mingbao Lin, Zihang Liu, Pengcheng Wu, Shuicheng Yan, and Chunyan Miao. Audio-reasoner: Improving reasoning capability in large audio language models.arXiv, 2025

  45. [45]

    Jin Xu, Zhifang Guo, Jinzheng He, Hangrui Hu, Ting He, Shuai Bai, Keqin Chen, Jialin Wang, Yang Fan, Kai Dang, et al. Qwen2. 5-omni technical report.arXiv, 2025

  46. [46]

    Qwen3 Technical Report

    An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, et al. Qwen3 technical report.arXiv preprint arXiv:2505.09388, 2025

  47. [47]

    Air-bench: Benchmarking large audio-language models via generative comprehension

    Qian Yang, Jin Xu, Wenrui Liu, Yunfei Chu, Ziyue Jiang, Xiaohuan Zhou, Yichong Leng, Yuanjun Lv, Zhou Zhao, Chang Zhou, et al. Air-bench: Benchmarking large audio-language models via generative comprehension. arXiv, 2024

  48. [48]

    optimal solution

    Zuyao You, Junke Wang, Lingyu Kong, Bo He, and Zuxuan Wu. Pix2cap-coco: Advancing visual comprehension via pixel-level captioning.arXiv, 2025. 13 Appendix A Details on the Training Strategy and Datasets In this section, we detail the training strategy and datasets used for GaMMA. As summarized in Table 9, we present the data scales and key hyperparameter ...

  49. [49]

    Annotation: For each song, record (1) the number of chord labels with errors in [root] or [majmin] (Ebase), (2) the number of labels with errors in [root], [majmin], or [seventh] (Ef ull), and (3) the total number of chord labels (Ntotal)

  50. [50]

    Calculation: The accuracy metrics are calculated as Accuracybase = ( Ntotal − Ebase)/Ntotal and Accuracyf ull = ( Ntotal − Ef ull)/Ntotal

  51. [51]

    Trouble will find you no matter where you go, oh oh... No matter if you're fast, no matter if you're slow, oh oh

    Pass Criteria: The song passes if Accuracybase ≥ 90%and Accuracyf ull ≥ 80%. First verse–chorus progression Analyze the energy trend of the first verse- chorus group. Describe the energy changes in the first verse-chorus group, such as start- ing subdued then building up, starting strong then tapering off, or maintaining consistent energy. 4: Completely c...

  52. [52]

    Fujiyama mama

    MoodQ: Please describe the mood of this song.A: The mood of the song is overwhelmingly frenetic, chaotic, and exhilarating. It projects a sense of wild, unrestrained energy and pure, unadulterated fun. The driving tempo, shouting vocals, and roaring guitars create a feeling of being at a wild party or in the middle of a high-speed chase. It's energetic, t...