Academic Text-to-Music Grand Challenge: Datasets, Baselines, and Evaluation Methods

Chun-Ping Wang; Fang-Chih Hsieh; Hao-Wen Dong; Hung-yi Lee; Wei-Jaw Lee; Yi-Hsuan Yang

arxiv: 2605.21538 · v1 · pith:QZVIEY3Gnew · submitted 2026-05-20 · 💻 cs.SD

Academic Text-to-Music Grand Challenge: Datasets, Baselines, and Evaluation Methods

Fang-Chih Hsieh , Wei-Jaw Lee , Chun-Ping Wang , Hung-yi Lee , Hao-Wen Dong , Yi-Hsuan Yang This is my paper

Pith reviewed 2026-05-22 01:33 UTC · model grok-4.3

classification 💻 cs.SD

keywords text-to-music generationgrand challengebenchmarkMTG-Jamendo datasetgenerative modelsevaluation metricsacademic research

0 comments

The pith

The ATTM Challenge creates a fair benchmark for academic text-to-music generation by requiring all models to train from scratch on a fixed public instrumental dataset.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper introduces the Academic Text-to-Music Grand Challenge to remove the barrier posed by proprietary datasets and industrial compute in text-to-music research. It defines a standardized training resource consisting of a CC-licensed instrumental-only subset of the MTG-Jamendo dataset that every participant must use exclusively. The challenge splits into an Efficiency Track limited to 500 million parameters and a Performance Track with no size restriction. Evaluation proceeds through objective scores including Frechet Audio Distance, CLAP score, and a new Concept Coverage Score, then concludes with subjective listening tests. Open baselines, preprocessing pipelines, reference captions, and public code for the metrics are released to support reproducible academic work.

Core claim

The paper establishes that a controlled grand challenge using only a single open instrumental dataset and a multi-stage evaluation protocol can create fair conditions for academic teams to develop and compare text-to-music generative models without relying on massive private resources.

What carries the argument

The standardized CC-licensed instrumental subset of the MTG-Jamendo dataset, which serves as the sole allowed training resource to enforce equal starting conditions across all submissions.

If this is right

Every submitted model will be trained without any data beyond the provided instrumental subset.
The Efficiency Track will test whether competitive text-to-music performance remains possible under a 500-million-parameter cap.
The Concept Coverage Score will supply a new objective measure of semantic alignment between text prompts and generated audio.
Released evaluation code will allow any researcher to recompute FAD and CLAP results on new submissions.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The instrumental-only restriction may push later work to add vocal or mixed-genre tracks as a natural next step.
Public baselines and pipelines could lower the entry cost for smaller research groups and increase participation from universities.
The staged evaluation combining objective metrics with listening tests offers a template other audio generation challenges might adopt.

Load-bearing premise

The chosen instrumental subset of MTG-Jamendo supplies enough musical variety and audio quality to let models trained on it produce results that meaningfully advance text-to-music generation.

What would settle it

If models trained strictly on this dataset produce consistently low CLAP scores or fail to match textual concepts in the subjective listening test, the benchmark would not support further progress.

read the original abstract

This paper presents an overview and the technical framework of the ICME 2026 Grand Challenge on Academic Text-to-Music Generation (ATTM). Despite the rapid progress in text-to-music generation (TTM) systems, the field is currently dominated by models trained on massive proprietary datasets with industrial-scale computational resources, creating a significant barrier for academic research. To address this, the ATTM Challenge establishes a fair-play benchmark that requires participants to train generative models strictly from scratch using a standardized, CC-licensed subset of the MTG-Jamendo dataset containing only instrumental music. The challenge is divided into two tracks: the Efficiency Track (limited to 500M parameters) and the Performance Track (no parameter limit). Submissions are evaluated through a multi-stage process involving objective metrics, including Frechet Audio Distance, CLAP score, and a novel Concept Coverage Score (CCS), followed by a subjective listening test. By providing open-source baselines, preprocessing pipelines, reference captions, and public evaluation code for computing FAD and CLAP, this challenge aims to facilitate and promote TTM research in academic contexts.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 2 minor

Summary. The manuscript presents the framework for the ICME 2026 Grand Challenge on Academic Text-to-Music Generation (ATTM). It establishes two tracks (Efficiency limited to 500M parameters and Performance with no limit) in which participants must train generative models from scratch on a standardized CC-licensed instrumental-only subset of the MTG-Jamendo dataset. Submissions are evaluated via objective metrics (FAD, CLAP score, and a novel Concept Coverage Score) followed by subjective listening tests, with the organizers supplying open baselines, preprocessing pipelines, reference captions, and public evaluation code.

Significance. If the promised artifacts are released and the evaluation protocol proves stable, the challenge would meaningfully lower barriers to academic TTM research by replacing reliance on proprietary data and industrial compute with a reproducible, CC-licensed benchmark. The explicit provision of baselines and evaluation code is a concrete strength that supports reproducibility and fair comparison.

major comments (1)

Evaluation Methods section: the novel Concept Coverage Score (CCS) is listed as one of the three objective metrics used to rank submissions, yet the manuscript supplies neither a formal definition, computational procedure, nor any preliminary validation (e.g., correlation with human ratings or behavior on existing TTM models). Because CCS participates directly in the multi-stage ranking, its untested status is load-bearing for the claim that the benchmark is fair and informative.

minor comments (2)

Dataset description: the paper asserts that the released subset contains only instrumental music, but does not report the filtering procedure, final track count, or genre/instrumental diversity statistics; adding these numbers would strengthen the claim that the resource is sufficient for meaningful training.
The abstract and framework description promise public evaluation code for FAD and CLAP; the final version should include permanent links (GitHub commit hash or DOI) to the exact code and dataset subset that participants will use.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive feedback and the recommendation for minor revision. We address the single major comment below and will update the manuscript accordingly to strengthen the description of the evaluation protocol.

read point-by-point responses

Referee: Evaluation Methods section: the novel Concept Coverage Score (CCS) is listed as one of the three objective metrics used to rank submissions, yet the manuscript supplies neither a formal definition, computational procedure, nor any preliminary validation (e.g., correlation with human ratings or behavior on existing TTM models). Because CCS participates directly in the multi-stage ranking, its untested status is load-bearing for the claim that the benchmark is fair and informative.

Authors: We agree that the current manuscript does not include a formal definition, computational procedure, or preliminary validation results for the Concept Coverage Score (CCS). This omission limits the transparency of the proposed evaluation protocol. In the revised version we will add a dedicated subsection that (1) provides the mathematical definition of CCS as the fraction of musically relevant concepts from the reference captions that are detected in the generated audio via a pre-trained concept classifier, (2) details the exact computational steps including concept extraction, audio feature computation, and scoring, and (3) reports preliminary validation experiments on a small set of existing TTM models showing moderate correlation with human preference ratings. These additions will make the ranking procedure fully reproducible and address the concern about its load-bearing role in the benchmark. revision: yes

Circularity Check

0 steps flagged

Descriptive challenge announcement with no circular derivations

full rationale

The paper is a descriptive overview of the ATTM Grand Challenge, focused on establishing a fair-play benchmark by supplying a standardized CC-licensed instrumental subset of the MTG-Jamendo dataset, open-source baselines, preprocessing pipelines, reference captions, and public evaluation code for metrics like FAD and CLAP. No mathematical derivations, predictive equations, fitted parameters, or self-referential claims appear in the text. The central premise rests directly on the explicit provision of these external artifacts for participants to use, rather than on any internal reduction, self-citation load-bearing argument, or ansatz that collapses to the paper's own inputs. This is a standard, self-contained competition framework paper with independent content.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The paper is an organizational framework contribution rather than a theoretical derivation, so it introduces minimal free parameters and relies primarily on standard domain assumptions about audio datasets and evaluation metrics.

axioms (1)

domain assumption Instrumental music excerpts from the MTG-Jamendo dataset form a sufficient and representative training resource for text-to-music generation tasks.
The challenge design restricts training data to this specific CC-licensed instrumental subset and treats it as the standardized starting point for all participants.

invented entities (1)

Concept Coverage Score (CCS) no independent evidence
purpose: To quantify how comprehensively generated music covers the semantic concepts described in the input text prompt.
The paper presents CCS as a new objective metric to complement FAD and CLAP within the multi-stage evaluation process.

pith-pipeline@v0.9.0 · 5744 in / 1495 out tokens · 89136 ms · 2026-05-22T01:33:09.056672+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

The ATTM Challenge establishes a fair-play benchmark that requires participants to train generative models strictly from scratch using a standardized, CC-licensed subset of the MTG-Jamendo dataset containing only instrumental music

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

25 extracted references · 25 canonical work pages · 4 internal anchors

[1]

Training-efficient text-to-music generation with state-space modeling,

Wei-Jaw Lee et al., “Training-efficient text-to-music generation with state-space modeling,”arXiv preprint arXiv:2601.14786, 2026

work page arXiv 2026
[2]

The MTG-Jamendo dataset for automatic music tagging,

Dmitry Bogdanov et al., “The MTG-Jamendo dataset for automatic music tagging,” inICML, 2019

work page 2019
[3]

Fr\'echet Audio Distance: A Metric for Evaluating Music Enhancement Algorithms

Kevin Kilgour et al., “Fr ´echet Audio Distance: A metric for evaluating music enhancement algorithms,”arXiv preprint 1812.08466, 2019

work page internal anchor Pith review Pith/arXiv arXiv 2019
[4]

CLAP: Learning audio concepts from natural language supervision,

Benjamin Elizalde et al., “CLAP: Learning audio concepts from natural language supervision,” inICASSP, 2023

work page 2023
[5]

Large-scale contrastive language-audio pretraining with feature fusion and keyword-to-caption augmentation,

Yusong Wu et al., “Large-scale contrastive language-audio pretraining with feature fusion and keyword-to-caption augmentation,”arXiv preprint arXiv:2211.06687, 2024

work page arXiv 2024
[6]

FLUX that plays music,

Zhengcong Fei et al., “FLUX that plays music,”arXiv preprint arXiv:2409.00587, 2024

work page arXiv 2024
[7]

High Fidelity Neural Audio Compression

Alexandre D ´efossez et al., “High fidelity neural audio compression,” arXiv preprint arXiv:2210.13438, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[8]

Mel-Band Roformer for music source separa- tion,

Ju-Chiang Wang et al., “Mel-Band Roformer for music source separa- tion,”arXiv preprint arXiv:2310.01809, 2023

work page arXiv 2023
[9]

Qwen-Audio: Advancing Universal Audio Understanding via Unified Large-Scale Audio-Language Models

Yunfei Chu et al., “Qwen-Audio: Advancing universal audio under- standing via unified large-scale audio-language models,”arXiv preprint arXiv:2311.07919, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[10]

Music Flamingo: Scaling music understanding in audio language models,

Sreyan Ghosh et al., “Music Flamingo: Scaling music understanding in audio language models,”arXiv preprint arXiv:2511.10289, 2025

work page arXiv 2025
[11]

Stable Audio Open,

Zach Evans et al., “Stable Audio Open,” inICASSP, 2025

work page 2025
[12]

Simple and controllable music generation,

Jade Copet et al., “Simple and controllable music generation,”NeurIPS, 2023

work page 2023
[13]

MeanAudio: Fast and faithful text-to-audio generation with mean flows,

Xiquan Li et al., “MeanAudio: Fast and faithful text-to-audio generation with mean flows,”arXiv preprint arXiv:2508.06098, 2025

work page arXiv 2025
[14]

Scaling instruction-finetuned language models,

Hyung Won Chung et al., “Scaling instruction-finetuned language models,”Journal of Machine Learning Research, 2024

work page 2024
[15]

Qwen3-Omni Technical Report

Jin Xu et al., “Qwen3-Omni technical report,”arXiv preprint arXiv:2509.17765, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[16]

Towards evaluating generative audio: Insights from neural audio codec embedding distances,

Arijit Biswas and Lars Villemoes, “Towards evaluating generative audio: Insights from neural audio codec embedding distances,”arXiv preprint arXiv:2509.18823, 2025

work page arXiv 2025
[17]

Benchmarking music generation models and metrics via human preference studies,

Florian Gr ¨otschla et al., “Benchmarking music generation models and metrics via human preference studies,” inICASSP, 2025

work page 2025
[18]

MeanAudio-S with DACO: Efficient text-to-music gener- ation via rectified flow and distribution-aware posterior refinement,

Weiwei Li, “MeanAudio-S with DACO: Efficient text-to-music gener- ation via rectified flow and distribution-aware posterior refinement,” in ICME Grand Challenge Paper, 2026

work page 2026
[19]

Efficient text-to-music generation via flow matching with bidirectional mamba SSM,

Anthony Wang and Shlomo Dubnov, “Efficient text-to-music generation via flow matching with bidirectional mamba SSM,” inICME Grand Challenge Paper, 2026

work page 2026
[20]

Improving text-to-music generation with human preference rewards,

Yonghyun Kim et al., “Improving text-to-music generation with human preference rewards,” inICME Grand Challenge Paper, 2026

work page 2026
[21]

Instrumental text-to-music generation with auxiliary conditioning branches,

Junyoung Koh, “Instrumental text-to-music generation with auxiliary conditioning branches,” inICME Grand Challenge Paper, 2026

work page 2026
[22]

S2Accompanist: A semantic-aware and structure- guided diffusion model for music accompaniment generation,

Huakang Chen et al., “S2Accompanist: A semantic-aware and structure- guided diffusion model for music accompaniment generation,” inICME Grand Challenge Paper, 2026

work page 2026
[23]

Making the most of limited data: Score-aware training for text-to-music generation,

Yun-Chen Cheng, Tzu-Hung Huang, and Chih-Pin Tan, “Making the most of limited data: Score-aware training for text-to-music generation,” inICME Grand Challenge Paper, 2026

work page 2026
[24]

UT-AISTimprt submission for ICME 2026 grand challenge on academic text-to-music generation,

Shunsuke Yoshida et al., “UT-AISTimprt submission for ICME 2026 grand challenge on academic text-to-music generation,” inICME Grand Challenge Paper, 2026

work page 2026
[25]

Modeling music as a time-frequency image: A 2D tokenizer for music generation,

Yuqing Cheng, Xingyu Ma, Guochen Yu, and Xiaotao Gu, “Modeling music as a time-frequency image: A 2D tokenizer for music generation,” inICME Grand Challenge Paper, 2026

work page 2026

[1] [1]

Training-efficient text-to-music generation with state-space modeling,

Wei-Jaw Lee et al., “Training-efficient text-to-music generation with state-space modeling,”arXiv preprint arXiv:2601.14786, 2026

work page arXiv 2026

[2] [2]

The MTG-Jamendo dataset for automatic music tagging,

Dmitry Bogdanov et al., “The MTG-Jamendo dataset for automatic music tagging,” inICML, 2019

work page 2019

[3] [3]

Fr\'echet Audio Distance: A Metric for Evaluating Music Enhancement Algorithms

Kevin Kilgour et al., “Fr ´echet Audio Distance: A metric for evaluating music enhancement algorithms,”arXiv preprint 1812.08466, 2019

work page internal anchor Pith review Pith/arXiv arXiv 2019

[4] [4]

CLAP: Learning audio concepts from natural language supervision,

Benjamin Elizalde et al., “CLAP: Learning audio concepts from natural language supervision,” inICASSP, 2023

work page 2023

[5] [5]

Large-scale contrastive language-audio pretraining with feature fusion and keyword-to-caption augmentation,

Yusong Wu et al., “Large-scale contrastive language-audio pretraining with feature fusion and keyword-to-caption augmentation,”arXiv preprint arXiv:2211.06687, 2024

work page arXiv 2024

[6] [6]

FLUX that plays music,

Zhengcong Fei et al., “FLUX that plays music,”arXiv preprint arXiv:2409.00587, 2024

work page arXiv 2024

[7] [7]

High Fidelity Neural Audio Compression

Alexandre D ´efossez et al., “High fidelity neural audio compression,” arXiv preprint arXiv:2210.13438, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022

[8] [8]

Mel-Band Roformer for music source separa- tion,

Ju-Chiang Wang et al., “Mel-Band Roformer for music source separa- tion,”arXiv preprint arXiv:2310.01809, 2023

work page arXiv 2023

[9] [9]

Qwen-Audio: Advancing Universal Audio Understanding via Unified Large-Scale Audio-Language Models

Yunfei Chu et al., “Qwen-Audio: Advancing universal audio under- standing via unified large-scale audio-language models,”arXiv preprint arXiv:2311.07919, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[10] [10]

Music Flamingo: Scaling music understanding in audio language models,

Sreyan Ghosh et al., “Music Flamingo: Scaling music understanding in audio language models,”arXiv preprint arXiv:2511.10289, 2025

work page arXiv 2025

[11] [11]

Stable Audio Open,

Zach Evans et al., “Stable Audio Open,” inICASSP, 2025

work page 2025

[12] [12]

Simple and controllable music generation,

Jade Copet et al., “Simple and controllable music generation,”NeurIPS, 2023

work page 2023

[13] [13]

MeanAudio: Fast and faithful text-to-audio generation with mean flows,

Xiquan Li et al., “MeanAudio: Fast and faithful text-to-audio generation with mean flows,”arXiv preprint arXiv:2508.06098, 2025

work page arXiv 2025

[14] [14]

Scaling instruction-finetuned language models,

Hyung Won Chung et al., “Scaling instruction-finetuned language models,”Journal of Machine Learning Research, 2024

work page 2024

[15] [15]

Qwen3-Omni Technical Report

Jin Xu et al., “Qwen3-Omni technical report,”arXiv preprint arXiv:2509.17765, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[16] [16]

Towards evaluating generative audio: Insights from neural audio codec embedding distances,

Arijit Biswas and Lars Villemoes, “Towards evaluating generative audio: Insights from neural audio codec embedding distances,”arXiv preprint arXiv:2509.18823, 2025

work page arXiv 2025

[17] [17]

Benchmarking music generation models and metrics via human preference studies,

Florian Gr ¨otschla et al., “Benchmarking music generation models and metrics via human preference studies,” inICASSP, 2025

work page 2025

[18] [18]

MeanAudio-S with DACO: Efficient text-to-music gener- ation via rectified flow and distribution-aware posterior refinement,

Weiwei Li, “MeanAudio-S with DACO: Efficient text-to-music gener- ation via rectified flow and distribution-aware posterior refinement,” in ICME Grand Challenge Paper, 2026

work page 2026

[19] [19]

Efficient text-to-music generation via flow matching with bidirectional mamba SSM,

Anthony Wang and Shlomo Dubnov, “Efficient text-to-music generation via flow matching with bidirectional mamba SSM,” inICME Grand Challenge Paper, 2026

work page 2026

[20] [20]

Improving text-to-music generation with human preference rewards,

Yonghyun Kim et al., “Improving text-to-music generation with human preference rewards,” inICME Grand Challenge Paper, 2026

work page 2026

[21] [21]

Instrumental text-to-music generation with auxiliary conditioning branches,

Junyoung Koh, “Instrumental text-to-music generation with auxiliary conditioning branches,” inICME Grand Challenge Paper, 2026

work page 2026

[22] [22]

S2Accompanist: A semantic-aware and structure- guided diffusion model for music accompaniment generation,

Huakang Chen et al., “S2Accompanist: A semantic-aware and structure- guided diffusion model for music accompaniment generation,” inICME Grand Challenge Paper, 2026

work page 2026

[23] [23]

Making the most of limited data: Score-aware training for text-to-music generation,

Yun-Chen Cheng, Tzu-Hung Huang, and Chih-Pin Tan, “Making the most of limited data: Score-aware training for text-to-music generation,” inICME Grand Challenge Paper, 2026

work page 2026

[24] [24]

UT-AISTimprt submission for ICME 2026 grand challenge on academic text-to-music generation,

Shunsuke Yoshida et al., “UT-AISTimprt submission for ICME 2026 grand challenge on academic text-to-music generation,” inICME Grand Challenge Paper, 2026

work page 2026

[25] [25]

Modeling music as a time-frequency image: A 2D tokenizer for music generation,

Yuqing Cheng, Xingyu Ma, Guochen Yu, and Xiaotao Gu, “Modeling music as a time-frequency image: A 2D tokenizer for music generation,” inICME Grand Challenge Paper, 2026

work page 2026