arxiv: 2605.00371 · v1 · submitted 2026-05-01 · 💻 cs.SD · cs.AI

Recognition: unknown

GaMMA: Towards Joint Global-Temporal Music Understanding in Large Multimodal Models

Zuyao You , Zhesong Yu , Mingyu Liu , Bilei Zhu , Yuan Wan , Zuxuan Wu

Authors on Pith no claims yet

Pith reviewed 2026-05-09 19:08 UTC · model grok-4.3

classification 💻 cs.SD cs.AI

keywords large multimodal modelsmusic understandingmixture of expertstemporal analysisglobal music featuresprogressive trainingMusicBench

0 comments

The pith

GaMMA unifies global and temporal music understanding in one large multimodal model using mixture-of-experts audio encoders and staged training.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents GaMMA as a large multimodal model that inherits the LLaVA encoder-decoder structure to learn across music and language. It adds audio encoders arranged in a mixture-of-experts pattern so that both time-based and non-time-based music tasks run under a single set of parameters. The model trains in stages—pretraining on large data, supervised fine-tuning, then reinforcement learning—to improve its grasp of musical content. The authors also introduce MusicBench, a benchmark of thousands of human-curated questions that separately measures temporal and global music skills. These steps produce higher accuracy than earlier systems on standard music tests and on the new benchmark.

Core claim

GaMMA establishes new state-of-the-art results for music understanding by unifying global and temporal capabilities inside one model. It combines the LLaVA design with mixture-of-experts audio encoders that handle time-series and non-time-series tasks together. Progressive training through pretraining, supervised fine-tuning, and reinforcement learning on large curated datasets yields 79.1 percent accuracy on MuchoMusic, 79.3 percent on MusicBench-Temporal, and 81.3 percent on MusicBench-Global, outperforming previous approaches.

What carries the argument

Mixture-of-experts audio encoders that dynamically route different music signals to specialized sub-networks, allowing a single model to manage both timing details and overall structure.

If this is right

Music LMMs can address both timing precision and overall structure without needing separate models or extra parameters for each.
Staged training that moves from broad pretraining through fine-tuning to reinforcement learning lifts results across multiple music benchmarks at once.
MusicBench provides a shared test bed that separates temporal from global capabilities for consistent comparison of future models.
Applications such as music description, recommendation, or education tools gain access to more complete understanding of a piece in one system.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same expert-routing approach could transfer to other time-based audio tasks such as speech or environmental sound analysis without redesigning the core architecture.
Adding reinforcement learning after initial training may improve the model's ability to follow open-ended instructions about music content.
MusicBench-style splits could be adapted to diagnose whether other multimodal models truly integrate local and global features or merely memorize surface patterns.
The joint training method suggests a path for video or motion models that must also combine frame-level timing with scene-level meaning.

Load-bearing premise

The performance improvements reflect genuine joint global-temporal understanding rather than tuning that only works on the specific datasets and benchmarks used.

What would settle it

Test the model on a fresh collection of music questions that demand simultaneous global and temporal reasoning and that were never seen during any training stage or included in MusicBench; sustained high accuracy would support the claim, while a sharp drop would indicate benchmark-specific optimization.

read the original abstract

In this paper, we propose GaMMA, a state-of-the-art (SoTA) large multimodal model (LMM) designed to achieve comprehensive musical content understanding. GaMMA inherits the streamlined encoder-decoder design of LLaVA, enabling effective cross-modal learning between music and language. By incorporating audio encoders in a mixture-of-experts manner, GaMMA effectively unifies both time-series and non-time-series music understanding tasks within one set of parameters. Our approach combines carefully curated datasets at scale with a progressive training pipeline, effectively pushing the boundaries of music understanding via pretraining, supervised fine-tuning (SFT), and reinforcement learning (RL). To comprehensively assess both temporal and non-temporal capability of music LMMs, we introduce MusicBench, the largest music-oriented benchmark, comprising 3,739 human-curated multiple-choice questions covering diverse aspects of musical understanding. Extensive experiments demonstrate that GaMMA establishes new SoTA in the music domain, achieving 79.1% accuracy on MuchoMusic, 79.3% on MusicBench-Temporal, and 81.3% on MusicBench-Global, consistently outperforming previous methods.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

GaMMA adds a new music benchmark and applies MoE audio encoders inside an LLaVA-style model, but separate temporal and global scores leave the joint-understanding claim weakly supported.

read the letter

The punchline here is that GaMMA combines an LLaVA backbone with mixture-of-experts audio encoders and releases MusicBench, a new 3739-question benchmark for music understanding. It reports solid accuracy numbers around 79-81% on MuchoMusic and the new benchmark splits, claiming to outperform earlier work. What stands out as new is the MusicBench dataset itself, which covers diverse aspects of musical understanding in multiple-choice format. The MoE routing for audio encoders to handle both time-series temporal tasks and non-time-series global ones in a single model is a practical extension of existing multimodal techniques. The progressive training stages—pretraining, supervised fine-tuning, and reinforcement learning—follow established patterns but are tailored to music data at scale. The paper does a reasonable job presenting these elements in a streamlined encoder-decoder setup. The numbers suggest improvement over previous methods, which is positive for the subfield. On the soft spots, the central idea of joint global-temporal understanding isn't strongly tested. Results are given separately for temporal questions at 79.3% and global at 81.3%, without any combined tasks that require integrating both types of reasoning in the same query. This opens the door to the possibility that the model is simply dispatching to different experts rather than achieving unified comprehension. Details on exact baselines, error bars, dataset splits, and ablations aren't in the abstract, making it harder to fully trust the performance claims without the full paper. The RL stage could be fitting to the partitioned data independently. This work is mainly for researchers in music information retrieval and multimodal AI who focus on audio-language models. A reader building or evaluating music LMMs would get value from the benchmark and the MoE idea. It has enough substance—the new benchmark and the unified model attempt—to deserve a serious referee, though the evaluation of integration needs closer scrutiny. I recommend sending it to peer review rather than desk rejecting it.

Referee Report

3 major / 1 minor

Summary. The paper proposes GaMMA, an LLaVA-style encoder-decoder LMM for music understanding that integrates mixture-of-experts audio encoders to unify time-series (temporal) and non-time-series (global) tasks within a single parameter set. It employs a progressive training pipeline (pretraining, SFT, RL) on large curated datasets and introduces MusicBench, a benchmark of 3,739 human-curated multiple-choice questions. The central empirical claim is new SoTA performance: 79.1% on MuchoMusic, 79.3% on MusicBench-Temporal, and 81.3% on MusicBench-Global, outperforming prior methods.

Significance. If the joint global-temporal integration claim is substantiated, the work would advance multimodal music models by demonstrating unified handling of temporal dynamics and global structure in one architecture, with the large-scale MusicBench benchmark providing a reusable resource for the community. The MoE design and progressive training pipeline are potentially extensible contributions.

major comments (3)

[Experiments] Experiments section: accuracies are reported separately on MusicBench-Temporal (79.3%) and MusicBench-Global (81.3%) with no evaluation on queries requiring simultaneous global structure and temporal dynamics in a single input. This partitioned evaluation does not support the claim of 'joint' understanding, as MoE routing could dispatch to independent experts without cross-talk, and the RL stage could optimize splits separately.
[Method] Method section (MoE audio encoders and progressive pipeline): the unification claim rests on the assertion that MoE plus pretraining/SFT/RL produces genuine integration, yet no ablations isolate whether routing weights enable cross-expert communication or whether performance gains arise from benchmark-specific optimization on partitioned data.
[Abstract / Experiments] Abstract and Experiments: headline accuracy figures are given without baseline implementation details, error bars, dataset composition statistics, or ablation tables, preventing verification that outperformance reflects the proposed joint mechanism rather than dataset scale or training hyperparameters.

minor comments (1)

[Figures] Figure captions and architecture diagrams would benefit from explicit labeling of the MoE routing mechanism and how global vs. temporal experts interact during inference.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for their constructive comments, which highlight important aspects of evaluation and evidence for our claims about joint global-temporal understanding in GaMMA. We respond to each major comment below and commit to revisions that strengthen the manuscript.

read point-by-point responses

Referee: [Experiments] Experiments section: accuracies are reported separately on MusicBench-Temporal (79.3%) and MusicBench-Global (81.3%) with no evaluation on queries requiring simultaneous global structure and temporal dynamics in a single input. This partitioned evaluation does not support the claim of 'joint' understanding, as MoE routing could dispatch to independent experts without cross-talk, and the RL stage could optimize splits separately.

Authors: We acknowledge that separate reporting on MusicBench-Temporal and MusicBench-Global does not directly test mixed queries within a single input. The model is trained jointly on a combined corpus of temporal and global tasks using a shared decoder, which we argue enables integration beyond independent dispatching. To address this directly, the revised manuscript will add a curated subset of mixed global-temporal questions to MusicBench, report accuracy on them, and include an analysis of MoE routing weights on these inputs to demonstrate cross-expert interaction. revision: yes
Referee: [Method] Method section (MoE audio encoders and progressive pipeline): the unification claim rests on the assertion that MoE plus pretraining/SFT/RL produces genuine integration, yet no ablations isolate whether routing weights enable cross-expert communication or whether performance gains arise from benchmark-specific optimization on partitioned data.

Authors: The progressive pipeline (pretraining on large-scale mixed music data, followed by SFT and RL) is designed to encourage routing that integrates both task types through the common language model. We agree that explicit isolation of cross-talk is valuable. The revision will incorporate new ablation studies, including comparisons against non-MoE single-encoder variants, frozen-routing controls, and visualizations of expert activation patterns across task categories to show that performance gains arise from integrated routing rather than partitioned optimization. revision: yes
Referee: [Abstract / Experiments] Abstract and Experiments: headline accuracy figures are given without baseline implementation details, error bars, dataset composition statistics, or ablation tables, preventing verification that outperformance reflects the proposed joint mechanism rather than dataset scale or training hyperparameters.

Authors: We agree that reproducibility and attribution require these details. The revised Experiments section will expand to include full baseline implementation descriptions (with any adaptations noted), error bars from multiple random seeds, detailed statistics on dataset composition and splits, and comprehensive ablation tables for the MoE design and training stages. These additions will clarify that gains derive from the joint architecture rather than scale or hyperparameters alone. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical results on external benchmarks

full rationale

The paper presents GaMMA as an LLaVA-derived encoder-decoder LMM trained progressively (pretraining, SFT, RL) on curated music data, with performance measured on MuchoMusic and the newly introduced MusicBench splits. These accuracy figures (79.1%, 79.3%, 81.3%) are reported outcomes of training and evaluation, not quantities derived by construction from the model equations or from fitted parameters that are then renamed as predictions. No self-definitional loops, load-bearing self-citations, or ansatz smuggling appear in the abstract or described pipeline; the MoE routing and joint-claim rest on architectural description whose effectiveness is assessed externally rather than tautologically.

Axiom & Free-Parameter Ledger

2 free parameters · 2 axioms · 0 invented entities

The central claim rests on the unverified effectiveness of the MoE audio encoders for unification and the quality of the human-curated datasets and progressive training stages; no independent evidence for these is provided in the abstract.

free parameters (2)

Mixture-of-experts routing weights
The paper does not specify how experts are selected or weighted during training or inference.
Progressive training hyperparameters
Stages of pretraining, SFT, and RL involve multiple unspecified learning rates, batch sizes, and reward model details.

axioms (2)

domain assumption LLaVA encoder-decoder design transfers effectively to music-language cross-modal learning
The paper states it inherits this design without providing justification or ablation for music domain.
domain assumption Human-curated multiple-choice questions in MusicBench accurately measure musical understanding
No validation of question quality or inter-annotator agreement is mentioned.

pith-pipeline@v0.9.0 · 5517 in / 1467 out tokens · 37011 ms · 2026-05-09T19:08:14.868616+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

52 extracted references · 5 canonical work pages · 4 internal anchors

[1]

Flamingo: a visual language model for few-shot learning

Jean-Baptiste Alayrac, Jeff Donahue, Pauline Luc, Antoine Miech, Iain Barr, Yana Hasson, Karel Lenc, Arthur Mensch, Katherine Millican, Malcolm Reynolds, et al. Flamingo: a visual language model for few-shot learning. InNeurIPS, 2022

2022
[2]

Mathqa: Towards interpretable math word problem solving with operation-based formalisms.arXiv, 2019

Aida Amini, Saadia Gabriel, Peter Lin, Rik Koncel-Kedziorski, Yejin Choi, and Hannaneh Hajishirzi. Mathqa: Towards interpretable math word problem solving with operation-based formalisms.arXiv, 2019

2019
[3]

A general language assistant as a laboratory for alignment.arXiv, 2021

Amanda Askell, Yuntao Bai, Anna Chen, Dawn Drain, Deep Ganguli, Tom Henighan, Andy Jones, Nicholas Joseph, Ben Mann, Nova DasSarma, et al. A general language assistant as a laboratory for alignment.arXiv, 2021

2021
[4]

Coig-cqia: Quality is all you need for chinese instruction fine-tuning.arXiv, 2024

Yuelin Bai, Xinrun Du, Yiming Liang, Yonggang Jin, Junting Zhou, Ziqiang Liu, Feiteng Fang, Mingshan Chang, Tianyu Zheng, Xincheng Zhang, et al. Coig-cqia: Quality is all you need for chinese instruction fine-tuning.arXiv, 2024

2024
[5]

R1-v: Reinforcing super generalization ability in vision-language models with less than $3, 2025

Liang Chen, Lei Li, Haozhe Zhao, Yifan Song, and Vinci. R1-v: Reinforcing super generalization ability in vision-language models with less than $3, 2025

2025
[6]

Vision transformer adapter for dense predictions.arXiv, 2022

Zhe Chen, Yuchen Duan, Wenhai Wang, Junjun He, Tong Lu, Jifeng Dai, and Yu Qiao. Vision transformer adapter for dense predictions.arXiv, 2022

2022
[7]

Qwen-audio: Advancing universal audio understanding via unified large-scale audio-language models.arXiv, 2023

Yunfei Chu, Jin Xu, Xiaohuan Zhou, Qian Yang, Shiliang Zhang, Zhijie Yan, Chang Zhou, and Jingren Zhou. Qwen-audio: Advancing universal audio understanding via unified large-scale audio-language models.arXiv, 2023

2023
[8]

Qwen2-audio technical report.arXiv, 2024

Yunfei Chu, Jin Xu, Qian Yang, Haojie Wei, Xipin Wei, Zhifang Guo, Yichong Leng, Yuanjun Lv, Jinzheng He, Junyang Lin, et al. Qwen2-audio technical report.arXiv, 2024

2024
[9]

Training verifiers to solve math word problems.arXiv, 2021

Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, Christopher Hesse, and John Schulman. Training verifiers to solve math word problems.arXiv, 2021. 11

2021
[10]

Free dolly: Introducing the world’s first truly open instruction-tuned llm, 2023

Mike Conover, Matt Hayes, Ankit Mathur, Jianwei Xie, Jun Wan, Sam Shah, Ali Ghodsi, Patrick Wendell, Matei Zaharia, and Reynold Xin. Free dolly: Introducing the world’s first truly open instruction-tuned llm, 2023

2023
[11]

Flashattention-2: Faster attention with better parallelism and work partitioning.arXiv, 2023

Tri Dao. Flashattention-2: Faster attention with better parallelism and work partitioning.arXiv, 2023

2023
[12]

Kimi-audio technical report.arXiv, 2025

Ding Ding, Zeqian Ju, Yichong Leng, Songxiang Liu, Tong Liu, Zeyu Shang, Kai Shen, Wei Song, Xu Tan, Heyi Tang, et al. Kimi-audio technical report.arXiv, 2025

2025
[13]

Lp-musiccaps: Llm-based pseudo music captioning

Seungheon Doh, Keunwoo Choi, Jongpil Lee, and Juhan Nam. Lp-musiccaps: Llm-based pseudo music captioning. InISMIR, 2023

2023
[14]

Palm-e: An embodied multimodal language model, 2023

Danny Driess, Fei Xia, Mehdi SM Sajjadi, Corey Lynch, Aakanksha Chowdhery, Ayzaan Wahid, Jonathan Tompson, Quan Vuong, Tianhe Yu, Wenlong Huang, et al. Palm-e: An embodied multimodal language model, 2023

2023
[15]

Aishell-2: Transforming mandarin asr research into industrial scale.arXiv, 2018

Jiayu Du, Xingyu Na, Xuechen Liu, and Hui Bu. Aishell-2: Transforming mandarin asr research into industrial scale.arXiv, 2018

2018
[16]

Clap learning audio concepts from natural language supervision

Benjamin Elizalde, Soham Deshmukh, Mahmoud Al Ismail, and Huaming Wang. Clap learning audio concepts from natural language supervision. InICASSP, 2023

2023
[17]

Audio set: An ontology and human-labeled dataset for audio events

Jort F Gemmeke, Daniel PW Ellis, Dylan Freedman, Aren Jansen, Wade Lawrence, R Channing Moore, Manoj Plakal, and Marvin Ritter. Audio set: An ontology and human-labeled dataset for audio events. InICASSP, 2017

2017
[18]

Audio Flamingo 3: Advancing Audio Intelligence with Fully Open Large Audio Language Models

Arushi Goel, Sreyan Ghosh, Jaehyeon Kim, Sonal Kumar, Zhifeng Kong, Sang-gil Lee, Chao-Han Huck Yang, Ramani Duraiswami, Dinesh Manocha, Rafael Valle, et al. Audio flamingo 3: Advancing audio intelligence with fully open large audio language models.arXiv preprint arXiv:2507.08128, 2025

work page internal anchor Pith review arXiv 2025
[19]

SongFormer: Scaling Music Structure Analysis with Heterogeneous Supervision

Chunbo Hao, Ruibin Yuan, Jixun Yao, Qixin Deng, Xinyi Bai, Wei Xue, and Lei Xie. Songformer: Scaling music structure analysis with heterogeneous supervision, 2025. URLhttps://arxiv.org/abs/2510.02797

work page internal anchor Pith review Pith/arXiv arXiv 2025
[20]

Liger-kernel: Efficient triton kernels for LLM training

Pin-Lun Hsu, Yun Dai, Vignesh Kothapalli, Qingquan Song, Shao Tang, Siyu Zhu, Steven Shimizu, Shivam Sahni, Haowen Ning, Yanning Chen, and Zhipeng Wang. Liger-kernel: Efficient triton kernels for LLM training. In ICML, 2025

2025
[21]

A study of bfloat16 for deep learning training.arXiv, 2019

Dhiraj Kalamkar, Dheevatsa Mudigere, Naveen Mellempudi, Dipankar Das, Kunal Banerjee, Sasikanth Avancha, Dharma Teja Vooturi, Nataraj Jammalamadaka, Jianyu Huang, Hector Yuen, et al. A study of bfloat16 for deep learning training.arXiv, 2019

2019
[22]

Los angeles midi dataset: Sota kilo-scale midi dataset for mir and music ai purposes

Aleksandr Lev. Los angeles midi dataset: Sota kilo-scale midi dataset for mir and music ai purposes. InGitHub, 2024

2024
[23]

Reinforcement learning outperforms supervised fine-tuning: A case study on audio question answering.arXiv, 2025

Gang Li, Jizhong Liu, Heinrich Dinkel, Yadong Niu, Junbo Zhang, and Jian Luan. Reinforcement learning outperforms supervised fine-tuning: A case study on audio question answering.arXiv, 2025

2025
[24]

Visual instruction tuning

Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning. InNeurIPS, 2023

2023
[25]

M2ugen: Multi-modal music understanding and generation with the power of large language models.arXiv, 2023

Shansong Liu, Atin Sakkeer Hussain, Chenshuo Sun, and Ying Shan. M2ugen: Multi-modal music understanding and generation with the power of large language models.arXiv, 2023

2023
[26]

Music understanding llama: Advancing text-to-music generation with question answering and captioning

Shansong Liu, Atin Sakkeer Hussain, Chenshuo Sun, and Ying Shan. Music understanding llama: Advancing text-to-music generation with question answering and captioning. InICASSP, 2024

2024
[27]

Deepseek-vl: towards real-world vision-language understanding.arXiv, 2024

Haoyu Lu, Wen Liu, Bo Zhang, Bingxuan Wang, Kai Dong, Bo Liu, Jingxiang Sun, Tongzheng Ren, Zhuoshu Li, Hao Yang, et al. Deepseek-vl: towards real-world vision-language understanding.arXiv, 2024

2024
[28]

Deepstack: Deeply stacking visual tokens is surprisingly simple and effective for lmms

Lingchen Meng, Jianwei Yang, Rui Tian, Xiyang Dai, Zuxuan Wu, Jianfeng Gao, and Yu-Gang Jiang. Deepstack: Deeply stacking visual tokens is surprisingly simple and effective for lmms. InNeurIPS, 2024

2024
[29]

Orca-math: Unlocking the potential of slms in grade school math.arXiv, 2024

Arindam Mitra, Hamed Khanpour, Corby Rosset, and Ahmed Awadallah. Orca-math: Unlocking the potential of slms in grade school math.arXiv, 2024

2024
[30]

The harmonix set: Beats, downbeats, and functional segment annotations of western popular music

Oriol Nieto, Matthew C McCallum, Matthew EP Davies, Andrew Robertson, Adam M Stark, and Eran Egozy. The harmonix set: Beats, downbeats, and functional segment annotations of western popular music. InISMIR, 2019

2019
[31]

Instruction tuning with gpt-4.arXiv, 2023

Baolin Peng, Chunyuan Li, Pengcheng He, Michel Galley, and Jianfeng Gao. Instruction tuning with gpt-4.arXiv, 2023. 12

2023
[32]

Robust speech recognition via large-scale weak supervision

Alec Radford, Jong Wook Kim, Tao Xu, Greg Brockman, Christine McLeavey, and Ilya Sutskever. Robust speech recognition via large-scale weak supervision. InICML, 2023

2023
[33]

Deepspeed: System optimizations enable training deep learning models with over 100 billion parameters

Jeff Rasley, Samyam Rajbhandari, Olatunji Ruwase, and Yuxiong He. Deepspeed: System optimizations enable training deep learning models with over 100 billion parameters. InKDD, 2020

2020
[34]

Mmau: A massive multi-task audio understanding and reasoning benchmark

S Sakshi, Utkarsh Tyagi, Sonal Kumar, Ashish Seth, Ramaneswaran Selvakumar, Oriol Nieto, Ramani Duraiswami, Sreyan Ghosh, and Dinesh Manocha. Mmau: A massive multi-task audio understanding and reasoning benchmark. arXiv, 2024

2024
[35]

Vlm-r1: A stable and generalizable r1-style large vision-language model.arXiv, 2025

Haozhan Shen, Peng Liu, Jingcheng Li, Chunxin Fang, Yibo Ma, Jiajia Liao, Qiaoli Shen, Zilun Zhang, Kangjia Zhao, Qianqian Zhang, et al. Vlm-r1: A stable and generalizable r1-style large vision-language model.arXiv, 2025

2025
[36]

Salmonn: Towards generic hearing abilities for large language models.arXiv, 2023

Changli Tang, Wenyi Yu, Guangzhi Sun, Xianzhao Chen, Tian Tan, Wei Li, Lu Lu, Zejun Ma, and Chao Zhang. Salmonn: Towards generic hearing abilities for large language models.arXiv, 2023

2023
[37]

Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context.arXiv, 2024

Gemini Team, Petko Georgiev, Ving Ian Lei, Ryan Burnell, Libin Bai, Anmol Gulati, Garrett Tanzer, Damien Vincent, Zhufeng Pan, Shibo Wang, et al. Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context.arXiv, 2024

2024
[38]

Qwen2 Technical Report

Qwen Team et al. Qwen2 technical report.arXiv preprint arXiv:2407.10671, 2024

work page internal anchor Pith review arXiv 2024
[39]

Covost 2 and massively multilingual speech translation

Changhan Wang, Anne Wu, Jiatao Gu, and Juan Pino. Covost 2 and massively multilingual speech translation. InInterspeech, 2021

2021
[40]

Simplear: Pushing the frontier of autoregressive visual generation through pretraining, sft, and rl.arXiv, 2025

Junke Wang, Zhi Tian, Xun Wang, Xinyu Zhang, Weilin Huang, Zuxuan Wu, and Yu-Gang Jiang. Simplear: Pushing the frontier of autoregressive visual generation through pretraining, sft, and rl.arXiv, 2025

2025
[41]

Muchomusic: Evaluating music understanding in multimodal audio-language models.arXiv, 2024

BennoWeck, IlariaManco, EmmanouilBenetos, ElioQuinton, GeorgeFazekas, andDmitryBogdanov. Muchomusic: Evaluating music understanding in multimodal audio-language models.arXiv, 2024

2024
[42]

A foundation model for music informatics

Minz Won, Yun-Ning Hung, and Duc Le. A foundation model for music informatics. InICASSP, 2024

2024
[43]

Synthrl: Scaling visual reasoning with verifiable data synthesis.arXiv preprint arXiv:2506.02096, 2025

Zijian Wu, Jinjie Ni, Xiangyan Liu, Zichen Liu, Hang Yan, and Michael Qizhe Shieh. Synthrl: Scaling visual reasoning with verifiable data synthesis.arXiv preprint arXiv:2506.02096, 2025

work page arXiv 2025
[44]

Audio-reasoner: Improving reasoning capability in large audio language models.arXiv, 2025

Zhifei Xie, Mingbao Lin, Zihang Liu, Pengcheng Wu, Shuicheng Yan, and Chunyan Miao. Audio-reasoner: Improving reasoning capability in large audio language models.arXiv, 2025

2025
[45]

Jin Xu, Zhifang Guo, Jinzheng He, Hangrui Hu, Ting He, Shuai Bai, Keqin Chen, Jialin Wang, Yang Fan, Kai Dang, et al. Qwen2. 5-omni technical report.arXiv, 2025

2025
[46]

Qwen3 Technical Report

An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, et al. Qwen3 technical report.arXiv preprint arXiv:2505.09388, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[47]

Air-bench: Benchmarking large audio-language models via generative comprehension

Qian Yang, Jin Xu, Wenrui Liu, Yunfei Chu, Ziyue Jiang, Xiaohuan Zhou, Yichong Leng, Yuanjun Lv, Zhou Zhao, Chang Zhou, et al. Air-bench: Benchmarking large audio-language models via generative comprehension. arXiv, 2024

2024
[48]

optimal solution

Zuyao You, Junke Wang, Lingyu Kong, Bo He, and Zuxuan Wu. Pix2cap-coco: Advancing visual comprehension via pixel-level captioning.arXiv, 2025. 13 Appendix A Details on the Training Strategy and Datasets In this section, we detail the training strategy and datasets used for GaMMA. As summarized in Table 9, we present the data scales and key hyperparameter ...

2025
[49]

Annotation: For each song, record (1) the number of chord labels with errors in [root] or [majmin] (Ebase), (2) the number of labels with errors in [root], [majmin], or [seventh] (Ef ull), and (3) the total number of chord labels (Ntotal)
[50]

Calculation: The accuracy metrics are calculated as Accuracybase = ( Ntotal − Ebase)/Ntotal and Accuracyf ull = ( Ntotal − Ef ull)/Ntotal
[51]

Trouble will find you no matter where you go, oh oh... No matter if you're fast, no matter if you're slow, oh oh

Pass Criteria: The song passes if Accuracybase ≥ 90%and Accuracyf ull ≥ 80%. First verse–chorus progression Analyze the energy trend of the first verse- chorus group. Describe the energy changes in the first verse-chorus group, such as start- ing subdued then building up, starting strong then tapering off, or maintaining consistent energy. 4: Completely c...
[52]

Fujiyama mama

MoodQ: Please describe the mood of this song.A: The mood of the song is overwhelmingly frenetic, chaotic, and exhilarating. It projects a sense of wild, unrestrained energy and pure, unadulterated fun. The driving tempo, shouting vocals, and roaring guitars create a feeling of being at a wild party or in the middle of a high-speed chase. It's energetic, t...

2025