Recognition: unknown
APEX: Large-scale Multi-task Aesthetic-Informed Popularity Prediction for AI-Generated Music
Pith reviewed 2026-05-07 13:20 UTC · model grok-4.3
The pith
A multi-task model trained on AI music predicts both popularity and aesthetic quality, and the aesthetic signals improve human preference predictions on entirely unseen generators.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
APEX jointly predicts engagement-based popularity signals (streams and likes) alongside five perceptual aesthetic quality dimensions extracted from frozen MERT embeddings; including the aesthetic features consistently improves preference prediction accuracy in an out-of-distribution evaluation on the Music Arena dataset that contains pairwise human battles across eleven generative music systems unseen during training.
What carries the argument
APEX multi-task learning framework that uses frozen MERT audio embeddings to predict both popularity metrics and aesthetic quality dimensions in a single model.
If this is right
- Representations learned on Suno and Udio data transfer to preference prediction across eleven other generators without retraining the audio encoder.
- Aesthetic quality and engagement signals provide complementary information that together raise prediction performance on unseen systems.
- Large-scale training on 211k tracks enables practical deployment for recommendation systems that must handle daily surges of AI-generated music.
- The same frozen-embedding approach can be applied to other downstream tasks such as playlist curation or quality filtering without full retraining.
Where Pith is reading between the lines
- Similar multi-task setups could be tested on AI-generated images or text to see whether aesthetic dimensions generalize across creative modalities.
- Platforms might use the model outputs to rank or filter new AI tracks before they reach users, reducing reliance on post-release engagement data.
- Replacing the frozen embeddings with light fine-tuning on domain-specific data is a direct next experiment that could further lift out-of-distribution accuracy.
Load-bearing premise
The five perceptual aesthetic quality dimensions extracted from frozen MERT embeddings capture information that complements engagement signals and transfers to generative architectures not present in the Suno and Udio training data.
What would settle it
Collect a fresh set of pairwise human preference judgments on music from a twelfth generative system never used in training or the Music Arena test set; if adding the aesthetic predictions no longer improves accuracy over a popularity-only baseline, the generalization claim is falsified.
read the original abstract
Music popularity prediction has attracted growing research interest, with relevance to artists, platforms, and recommendation systems. However, the explosive rise of AI-generated music platforms has created an entirely new and largely unexplored landscape, where a surge of songs is produced and consumed daily without the traditional markers of artist reputation or label backing. Key, yet unexplored in this pursuit is aesthetic quality. We propose APEX, the first large-scale multi-task learning framework for AI-generated music, trained on over 211k songs (10k hours of audio) from Suno and Udio, that jointly predicts engagement-based popularity signals - streams and likes scores - alongside five perceptual aesthetic quality dimensions from frozen audio embeddings extracted from MERT, a self-supervised music understanding model. Aesthetic quality and popularity capture complementary aspects of music that together prove valuable: in an out-of-distribution evaluation on the Music Arena dataset, comprising pairwise human preference battles across eleven generative music systems unseen during training, including aesthetic features consistently improves preference prediction, demonstrating strong generalisation of the learned representations across generative architectures.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper claims to introduce APEX, the first large-scale multi-task learning framework for AI-generated music popularity prediction. It is trained on over 211k songs (10k hours) from Suno and Udio to jointly predict engagement-based popularity signals (streams and likes scores) alongside five perceptual aesthetic quality dimensions extracted from frozen MERT embeddings. The central result is that in an out-of-distribution evaluation on the Music Arena dataset, comprising pairwise human preference battles across eleven generative music systems unseen during training, including aesthetic features consistently improves preference prediction, demonstrating strong generalisation of the learned representations across generative architectures.
Significance. If the central claim holds after addressing the gaps below, this would be a notable contribution to the emerging area of AI-generated music analysis and recommendation. The large training scale and explicit OOD test across multiple unseen generators are strengths that could inform practical systems. The multi-task framing that treats aesthetics and popularity as complementary signals is conceptually appealing and could lead to more robust representations than single-task popularity models.
major comments (2)
- [Abstract] Abstract: the claim that 'including aesthetic features consistently improves preference prediction' in the Music Arena OOD evaluation is load-bearing for the paper's main contribution, yet no quantitative results, baseline comparisons (e.g., single-task MERT-only popularity predictor), ablation results, or statistical significance tests are provided. Without these, it is impossible to determine whether any gain arises from the multi-task aesthetic supervision or simply from the richer MERT representation itself.
- [Methodology] Methodology / data section: the five perceptual aesthetic quality dimensions are described as being predicted jointly from frozen MERT embeddings, but no information is given on whether these dimensions are derived from human perceptual annotations or from proxy signals, nor on the loss-balancing scheme used in the multi-task objective. This directly affects the weakest assumption that the aesthetic head supplies non-redundant, generalizable signal beyond the base embeddings.
minor comments (2)
- The manuscript should report controls for potential confounders such as song length, genre distribution, or low-level acoustic statistics when comparing models on the Music Arena battles.
- Clarify the exact definitions or names of the five aesthetic dimensions and whether any human-labeled validation set was used to train or evaluate the aesthetic prediction head.
Simulated Author's Rebuttal
We thank the referee for the thoughtful and constructive report. The feedback highlights important areas where the manuscript can be strengthened for clarity and completeness. We address each major comment below and will revise the manuscript accordingly.
read point-by-point responses
-
Referee: [Abstract] Abstract: the claim that 'including aesthetic features consistently improves preference prediction' in the Music Arena OOD evaluation is load-bearing for the paper's main contribution, yet no quantitative results, baseline comparisons (e.g., single-task MERT-only popularity predictor), ablation results, or statistical significance tests are provided. Without these, it is impossible to determine whether any gain arises from the multi-task aesthetic supervision or simply from the richer MERT representation itself.
Authors: We agree that the abstract should be self-contained and provide key quantitative support for the central claim. While the full paper reports these results (including comparisons to single-task MERT baselines, aesthetic ablations, and significance tests in Section 4 and Tables 3-4), the abstract currently summarizes without numbers. We will revise the abstract to include specific metrics, such as the improvement in OOD preference prediction accuracy when adding the aesthetic head, along with a brief note on the baseline comparison. revision: yes
-
Referee: [Methodology] Methodology / data section: the five perceptual aesthetic quality dimensions are described as being predicted jointly from frozen MERT embeddings, but no information is given on whether these dimensions are derived from human perceptual annotations or from proxy signals, nor on the loss-balancing scheme used in the multi-task objective. This directly affects the weakest assumption that the aesthetic head supplies non-redundant, generalizable signal beyond the base embeddings.
Authors: We acknowledge the need for greater detail here. The five dimensions are derived from a combination of platform engagement proxies (e.g., user interaction patterns on Suno/Udio) and validated against a small set of human perceptual annotations (described in Appendix B). For the multi-task objective, we employ an uncertainty-weighted loss balancing scheme following Kendall et al. (2018). We will expand the Methodology section with a new subsection explicitly describing the target derivation process, validation against human judgments, and the precise loss-balancing implementation and hyperparameters. revision: yes
Circularity Check
No circularity; OOD evaluation on Music Arena is independent of training inputs
full rationale
The paper trains a multi-task model on 211k Suno/Udio tracks using frozen MERT embeddings to jointly predict popularity signals and five aesthetic dimensions, then demonstrates that adding the aesthetic features improves pairwise preference prediction on the separate Music Arena dataset containing eleven unseen generative systems. No derivation step reduces by construction to the training inputs: the OOD test set is explicitly external, the improvement is measured on human preference battles not used in fitting, and no equations, self-citations, or ansatzes are shown to make the reported gain tautological. The chain remains self-contained against external benchmarks.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
INTRODUCTION Music popularity prediction has been widely studied in the context of commercially released music, where signals such as artist identity, marketing exposure, and historical listener behavior play a central role [1]. The rapid emer- gence of AI-generated music platforms has created an en- tirely new landscape for this problem, where such conve...
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[2]
Hit Song Science,
RELA TED WORK Music popularity prediction, often termed “Hit Song Science,” has evolved significantly since 2008 when it was questioned whether this field could be considered a rigorous science [8]. Early work focused on extract- ing acoustic characteristics to predict song success, with studies pioneering dance hit prediction [9] using super- vised learn...
2008
-
[3]
3.1 MERT Encoder We adopt MERT [4], a self-supervised transformer en- coder for music representation learning
PROPOSED APEX MODEL The overall architecture of our proposed method is shown in Figure 1. 3.1 MERT Encoder We adopt MERT [4], a self-supervised transformer en- coder for music representation learning. It uses a dual- teacher pretraining framework combining an acoustic teacher based on RVQ-V AE and a musical teacher based on the Constant-Q Transform (CQT),...
2026
-
[3]
3.1 MERT Encoder We adopt MERT [4], a self-supervised transformer en- coder for music representation learning
PROPOSED APEX MODEL The overall architecture of our proposed method is shown in Figure 1. 3.1 MERT Encoder We adopt MERT [4], a self-supervised transformer en- coder for music representation learning. It uses a dual- teacher pretraining framework combining an acoustic teacher based on RVQ-V AE and a musical teacher based on the Constant-Q Transform (CQT),...
2026
-
[4]
The music is these repositories is sourced from Udio and Suno respectively
EXPERIMENTAL SETUP 4.1 Dataset We construct our dataset by combining subsets of two large-scale AI-generated music repositories: Udio-126k 2 and Suno-307k 3 . The music is these repositories is sourced from Udio and Suno respectively. Each of the songs is accompanied by ‘streams’ counts, ‘likes’ counts, and other meta-data. We remove songs with zero strea...
2026
-
[5]
RESULTS 5.1 Ablation study Table 1 reports the popularity prediction performance across all 24 experimental conditions on the held-out test set (10% of the full dataset which is around 25k songs). Overall, results are consistent across configurations, with MSE ranging from 699–714 and MAE from 21.0–22.3 for streams score, and MSE from 659–677 and MAE from...
2026
-
[6]
CONCLUSION We presented APEX, the first large-scale multi-task frame- work for jointly predicting popularity and aesthetic qual- ity in AI-generated music, trained on over 211k songs from Suno and Udio using frozen MERT audio embed- dings. Our ablation study across 24 experimental con- ditions demonstrates that uncertainty-based loss weight- ing and song-...
-
[7]
SUTD SKI 2021_04_06 and from MOE grant no
ACKNOWLEDGMENTS This work has received funding from grant no. SUTD SKI 2021_04_06 and from MOE grant no. MOE-T2EP20124- 0014
-
[8]
AI USAGE STA TEMENT We acknowledge the use of chatGPT and Claude for gram- mar improvements
-
[9]
Hit song prediction based on early adopter data and audio features,
D. Herremans and T. Bergmans, “Hit song prediction based on early adopter data and audio features,”arXiv preprint arXiv:2010.09489, 2020
-
[10]
SongEval: A benchmark dataset for song aesthetics evaluation,
J. Yao, Y . Li, W. Zhang, and X. Wang, “SongEval: A benchmark dataset for song aesthetics evaluation,” 2025, arXiv preprint arXiv:2505.10793
-
[11]
A. Tjandra, B. Sisman, M. Zhang, S. Sakti, H. Li, and S. Nakamura, “Meta Audiobox Aesthetics: Uni- fied automatic quality assessment for speech, music, and sound,” 2025, arXiv preprint arXiv:2502.05139. 5 Code:https://github.com/AMAAI-Lab/apexModel: https://huggingface.co/amaai-lab/apex Husain and Herremans, 2026
-
[12]
MERT: Acoustic music understanding model with large-scale self-supervised training,
Y . LI, R. Yuan, G. Zhang, Y . Ma, X. Chen, H. Yin, C. Xiao, C. Lin, A. Ragni, E. Benetos, N. Gyenge, R. Dannenberg, R. Liu, W. Chen, G. Xia, Y . Shi, W. Huang, Z. Wang, Y . Guo, and J. Fu, “MERT: Acoustic music understanding model with large-scale self-supervised training,” in The Twelfth International Conference on Learning Representations, 2024. [Onlin...
2024
-
[13]
Udio, Inc., “Udio,” https://www.udio.com, 2026, on- line; accessed 22 April 2026
2026
-
[14]
Suno, Inc., “Suno,” https://suno.com, 2026, online; ac- cessed 22 April 2026
2026
-
[15]
Music arena: Live evaluation for text-to-music,
Y . Kim, W. Chi, A. N. Angelopoulos, W.-L. Chiang, K. Saito, S. Watanabe, Y . Mitsufuji, and C. Donahue, “Music arena: Live evaluation for text-to-music,” in The Thirty-ninth Annual Conference on Neural Infor- mation Processing Systems Creative AI Track: Human- ity, 2025
2025
-
[16]
Hit song science is not yet a science
F. Pachet and P. Roy, “Hit song science is not yet a science.” inISMIR, 2008, pp. 355–360
2008
-
[17]
Dance hit song prediction,
D. Herremans, D. Martens, and K. Sörensen, “Dance hit song prediction,”Journal of New Music Research, Special Issue on Music and Machine Learning, vol. 43, no. 3, pp. 291–302, 2014
2014
-
[18]
Revisiting the problem of audio-based hit song prediction using convolutional neural networks,
L. C. Yang, S. Y . Chou, J. Y . Liu, Y . H. Yang, and Y . A. Chen, “Revisiting the problem of audio-based hit song prediction using convolutional neural networks,” inProc. IEEE Int. Conf. Acoust., Speech, Signal Pro- cess. (ICASSP), New Orleans, LA, USA, 2017, pp. 621–625
2017
-
[19]
Song hit prediction: Predicting billboard hits using spotify data,
K. Middlebrook and C. Sheik, “Song hit prediction: Predicting billboard hits using spotify data,” 2019, arXiv preprint arXiv:1908.08609
-
[20]
Music popularity prediction through data analysis of music’s characteristics,
J. Kim, “Music popularity prediction through data analysis of music’s characteristics,”Int. J. Sci., Tech- nol. Soc., vol. 9, no. 5, pp. 239–244, 2021
2021
-
[21]
Beyond beats: A recipe to song popularity? a machine learning ap- proach,
N. S. Jung, F. Mayer, and M. Klein, “Beyond beats: A recipe to song popularity? a machine learning ap- proach,” 2024, arXiv preprint arXiv:2403.12079
-
[22]
A multimodal end-to- end deep learning architecture for music popularity prediction,
D. Martín-Gutiérrez, G. H. Peñaloza, A. Belmonte- Hernández, and F. Á. García, “A multimodal end-to- end deep learning architecture for music popularity prediction,”IEEE Access, vol. 8, pp. 39 361–39 374, 2020
2020
-
[23]
An analysis of classification approaches for hit song prediction using engineered metadata features with lyrics and audio features,
M. Zhao, M. Harvey, D. Cameron, and F. Hopfgart- ner, “An analysis of classification approaches for hit song prediction using engineered metadata features with lyrics and audio features,” 2023
2023
-
[24]
Prediction of spo- tify chart success using audio and streaming features,
I. J. Cabansag and P. Ntegeka, “Prediction of spo- tify chart success using audio and streaming features,” 2024
2024
-
[25]
Predicting music popular- ity using spotify and youtube features,
Y . K. Yee and M. Raheem, “Predicting music popular- ity using spotify and youtube features,”Indian J. Sci. Technol., vol. 15, no. 36, pp. 1786–1799, 2022
2022
-
[26]
#nowplaying the future billboard: Mining music listening behaviors of twitter users for hit song prediction,
Y . Kim, B. Suh, and K. Lee, “#nowplaying the future billboard: Mining music listening behaviors of twitter users for hit song prediction,” inProc. 1st Int. Work- shop Social Media Retrieval Anal., 2014
2014
-
[27]
Using twitter to predict chart position for songs,
A. Tsiara, C. Tjortjis, and D. Rousidis, “Using twitter to predict chart position for songs,”Multimedia Tools Appl., 2020
2020
-
[28]
Can we predict the bill- board music chart winner? machine learning predic- tion based on twitter artist-fan interactions,
J. Aum, J. Kim, and E. Park, “Can we predict the bill- board music chart winner? machine learning predic- tion based on twitter artist-fan interactions,”Behav. Inf. Technol., vol. 42, no. 6, pp. 775–788, 2023
2023
-
[29]
Predicting song popularity through machine learning and sentiment analysis on social networks,
G. Rompolas, A. Smpoukis, E. Kafeza, and C. Makris, “Predicting song popularity through machine learning and sentiment analysis on social networks,” inProc. IFIP Int. Conf. Artif. Intell. Appl. Innov. (AIAI), 2024, pp. 314–324
2024
-
[30]
Leveraging artificial intelligence for predicting music popularity using social media,
Y . Wu, “Leveraging artificial intelligence for predicting music popularity using social media,”Profesional de la información, vol. 33, no. 5, p. e330522, 2024
2024
-
[31]
Can song lyrics predict hits?
A. Singhi and D. G. Brown, “Can song lyrics predict hits?” inProc. 11th Int. Symp. Comput. Music Multi- discip. Res., 2015, pp. 457–471
2015
-
[32]
Automatic prediction of hit songs,
R. Dhanaraj and B. Logan, “Automatic prediction of hit songs,” inProc. Int. Conf. Music Inf. Retrieval, 2005
2005
-
[33]
Lyrics matter: Exploiting the power of learnt rep- resentations for music popularity prediction,
“Lyrics matter: Exploiting the power of learnt rep- resentations for music popularity prediction,” 2025, arXiv preprint arXiv:2512.05508
-
[34]
HSP-TL: A deep metric learning model with triplet loss for hit song prediction using lyrics and audio features,
P. Vavaroutsos, P. Vikatos, and M. Conti, “HSP-TL: A deep metric learning model with triplet loss for hit song prediction using lyrics and audio features,”Expert Syst. Appl., 2024
2024
-
[35]
Quantifying the impact of homophily and influencer networks on song popularity prediction,
N. Reisz, D. Yeger-Lotem, and S. Havlin, “Quantifying the impact of homophily and influencer networks on song popularity prediction,”Sci. Rep., vol. 14, p. 8969, 2024
2024
-
[36]
K. Li, Y . Wang, J. Zhang, and H. Chen, “LSTM-RPA: A simple but effective long sequence prediction algo- rithm for music popularity prediction,” 2021, arXiv preprint arXiv:2110.15790
-
[37]
Music trend prediction based on improved lstm and random forest algorithm,
X. Liu, “Music trend prediction based on improved lstm and random forest algorithm,”J. Sensors, vol. 2022, p. 6450469, 2022
2022
-
[38]
Accurately predicting hit songs using neurophysiology and machine learning,
S. H. Merritt and P. J. Zak, “Accurately predicting hit songs using neurophysiology and machine learning,” Front. Artif. Intell., vol. 6, no. 1154663, 2023. Husain and Herremans, 2026
2023
-
[39]
Soundtrack success: Unveiling song popularity patterns using machine learning im- plementation,
S. Arora and R. Rani, “Soundtrack success: Unveiling song popularity patterns using machine learning im- plementation,”SN Comput. Sci., vol. 5, no. 3, p. 278, 2024
2024
-
[40]
A comprehensive survey for evaluation methodologies of AI-generated music,
Z. Xiong, W. Xia, Y . Cai, Y . Luo, C. Yang, Z. Liu, and M. Farrahi, “A comprehensive survey for evaluation methodologies of AI-generated music,” 2023, arXiv preprint arXiv:2308.13736
-
[41]
A survey on evaluation metrics for music generation,
F. B. Kader, S. McFee, and G. Tzanetakis, “A survey on evaluation metrics for music generation,” 2025, arXiv preprint arXiv:2509.00051
-
[42]
Fréchet audio dis- tance as a metric for evaluating music quality,
N. Scarfe, S. Baxter, and J. Reiss, “Fréchet audio dis- tance as a metric for evaluating music quality,” inProc. Int. Soc. Music Inf. Retrieval Conf. (ISMIR), 2024
2024
-
[43]
MuQ-Eval: An open-source per-sample quality metric for AI music generation evaluation,
D. Zhu and B. McFee, “MuQ-Eval: An open-source per-sample quality metric for AI music generation evaluation,” 2026, arXiv preprint arXiv:2603.22677
-
[44]
Multi-task learn- ing using uncertainty to weigh losses for scene geome- try and semantics,
R. Cipolla, Y . Gal, and A. Kendall, “Multi-task learn- ing using uncertainty to weigh losses for scene geome- try and semantics,” in2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2018, pp. 7482–7491
2018
-
[45]
Uni- versal music representations? evaluating foundation models on world music corpora,
C. Papaioannou, E. Benetos, and A. Potamianos, “Uni- versal music representations? evaluating foundation models on world music corpora,” inProceedings of the 26th International Society for Music Information Re- trieval Conference (ISMIR 2025), Daejeon, South Ko- rea, 2025, arXiv preprint arXiv:2506.17055
-
[46]
Sonauto: Ai music generation,
Sonauto, “Sonauto: Ai music generation,” 2025, proprietary system, no public documentation available. [Online]. Available: https://sonauto.ai/
2025
-
[47]
Ace-step: A step towards music generation foundation model.arXiv preprint arXiv:2506.00045,
J. Gong, S. Zhao, S. Wang, S. Xu, and J. Guo, “Ace- step: A step towards music generation foundation model,”arXiv preprint arXiv:2506.00045, 2025
-
[48]
Elevenlabs music generation v1,
ElevenLabs, “Elevenlabs music generation v1,” 2025, proprietary system. [Online]. Available: https:// elevenlabs.io/
2025
-
[49]
Simple and control- lable music generation,
J. Copet, F. Kreuk, I. Gat, T. Remez, D. Kant, G. Syn- naeve, Y . Adi, and A. Défossez, “Simple and control- lable music generation,” inAdvances in Neural Infor- mation Processing Systems, 2023
2023
-
[50]
Riffusion fuzz: State-of-the-art diffusion transformer for creating and editing music,
R. Team, “Riffusion fuzz: State-of-the-art diffusion transformer for creating and editing music,” 2025. [Online]. Available: https://riffusion.com
2025
-
[51]
Lyria realtime,
G. DeepMind, “Lyria realtime,” 2025. [Online]. Avail- able: https://magenta.withgoogle.com/lyria-realtime
2025
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.