pith. machine review for the scientific record. sign in

arxiv: 2605.03395 · v1 · submitted 2026-05-05 · 💻 cs.SD · cs.AI· cs.LG· cs.MM

Recognition: unknown

APEX: Large-scale Multi-task Aesthetic-Informed Popularity Prediction for AI-Generated Music

Dorien Herremans, Jaavid Aktar Husain

Pith reviewed 2026-05-07 13:20 UTC · model grok-4.3

classification 💻 cs.SD cs.AIcs.LGcs.MM
keywords AI-generated musicpopularity predictionaesthetic qualitymulti-task learningpreference predictionMERT embeddingsMusic Arena dataset
0
0 comments X

The pith

A multi-task model trained on AI music predicts both popularity and aesthetic quality, and the aesthetic signals improve human preference predictions on entirely unseen generators.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents APEX, a framework that processes over 211,000 AI-generated tracks from Suno and Udio to forecast streams, likes, and five perceptual aesthetic dimensions at once. It extracts these features from frozen embeddings of a self-supervised music model rather than training new audio representations from scratch. The central demonstration is that adding the aesthetic predictions consistently raises accuracy when forecasting which tracks humans prefer in head-to-head battles on the Music Arena dataset, even though those eleven generative systems were never seen during training. This matters because AI music lacks traditional signals like artist reputation, so models that capture both engagement and perceived quality can support better recommendation and curation on platforms.

Core claim

APEX jointly predicts engagement-based popularity signals (streams and likes) alongside five perceptual aesthetic quality dimensions extracted from frozen MERT embeddings; including the aesthetic features consistently improves preference prediction accuracy in an out-of-distribution evaluation on the Music Arena dataset that contains pairwise human battles across eleven generative music systems unseen during training.

What carries the argument

APEX multi-task learning framework that uses frozen MERT audio embeddings to predict both popularity metrics and aesthetic quality dimensions in a single model.

If this is right

  • Representations learned on Suno and Udio data transfer to preference prediction across eleven other generators without retraining the audio encoder.
  • Aesthetic quality and engagement signals provide complementary information that together raise prediction performance on unseen systems.
  • Large-scale training on 211k tracks enables practical deployment for recommendation systems that must handle daily surges of AI-generated music.
  • The same frozen-embedding approach can be applied to other downstream tasks such as playlist curation or quality filtering without full retraining.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Similar multi-task setups could be tested on AI-generated images or text to see whether aesthetic dimensions generalize across creative modalities.
  • Platforms might use the model outputs to rank or filter new AI tracks before they reach users, reducing reliance on post-release engagement data.
  • Replacing the frozen embeddings with light fine-tuning on domain-specific data is a direct next experiment that could further lift out-of-distribution accuracy.

Load-bearing premise

The five perceptual aesthetic quality dimensions extracted from frozen MERT embeddings capture information that complements engagement signals and transfers to generative architectures not present in the Suno and Udio training data.

What would settle it

Collect a fresh set of pairwise human preference judgments on music from a twelfth generative system never used in training or the Music Arena test set; if adding the aesthetic predictions no longer improves accuracy over a popularity-only baseline, the generalization claim is falsified.

read the original abstract

Music popularity prediction has attracted growing research interest, with relevance to artists, platforms, and recommendation systems. However, the explosive rise of AI-generated music platforms has created an entirely new and largely unexplored landscape, where a surge of songs is produced and consumed daily without the traditional markers of artist reputation or label backing. Key, yet unexplored in this pursuit is aesthetic quality. We propose APEX, the first large-scale multi-task learning framework for AI-generated music, trained on over 211k songs (10k hours of audio) from Suno and Udio, that jointly predicts engagement-based popularity signals - streams and likes scores - alongside five perceptual aesthetic quality dimensions from frozen audio embeddings extracted from MERT, a self-supervised music understanding model. Aesthetic quality and popularity capture complementary aspects of music that together prove valuable: in an out-of-distribution evaluation on the Music Arena dataset, comprising pairwise human preference battles across eleven generative music systems unseen during training, including aesthetic features consistently improves preference prediction, demonstrating strong generalisation of the learned representations across generative architectures.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper claims to introduce APEX, the first large-scale multi-task learning framework for AI-generated music popularity prediction. It is trained on over 211k songs (10k hours) from Suno and Udio to jointly predict engagement-based popularity signals (streams and likes scores) alongside five perceptual aesthetic quality dimensions extracted from frozen MERT embeddings. The central result is that in an out-of-distribution evaluation on the Music Arena dataset, comprising pairwise human preference battles across eleven generative music systems unseen during training, including aesthetic features consistently improves preference prediction, demonstrating strong generalisation of the learned representations across generative architectures.

Significance. If the central claim holds after addressing the gaps below, this would be a notable contribution to the emerging area of AI-generated music analysis and recommendation. The large training scale and explicit OOD test across multiple unseen generators are strengths that could inform practical systems. The multi-task framing that treats aesthetics and popularity as complementary signals is conceptually appealing and could lead to more robust representations than single-task popularity models.

major comments (2)
  1. [Abstract] Abstract: the claim that 'including aesthetic features consistently improves preference prediction' in the Music Arena OOD evaluation is load-bearing for the paper's main contribution, yet no quantitative results, baseline comparisons (e.g., single-task MERT-only popularity predictor), ablation results, or statistical significance tests are provided. Without these, it is impossible to determine whether any gain arises from the multi-task aesthetic supervision or simply from the richer MERT representation itself.
  2. [Methodology] Methodology / data section: the five perceptual aesthetic quality dimensions are described as being predicted jointly from frozen MERT embeddings, but no information is given on whether these dimensions are derived from human perceptual annotations or from proxy signals, nor on the loss-balancing scheme used in the multi-task objective. This directly affects the weakest assumption that the aesthetic head supplies non-redundant, generalizable signal beyond the base embeddings.
minor comments (2)
  1. The manuscript should report controls for potential confounders such as song length, genre distribution, or low-level acoustic statistics when comparing models on the Music Arena battles.
  2. Clarify the exact definitions or names of the five aesthetic dimensions and whether any human-labeled validation set was used to train or evaluate the aesthetic prediction head.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the thoughtful and constructive report. The feedback highlights important areas where the manuscript can be strengthened for clarity and completeness. We address each major comment below and will revise the manuscript accordingly.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the claim that 'including aesthetic features consistently improves preference prediction' in the Music Arena OOD evaluation is load-bearing for the paper's main contribution, yet no quantitative results, baseline comparisons (e.g., single-task MERT-only popularity predictor), ablation results, or statistical significance tests are provided. Without these, it is impossible to determine whether any gain arises from the multi-task aesthetic supervision or simply from the richer MERT representation itself.

    Authors: We agree that the abstract should be self-contained and provide key quantitative support for the central claim. While the full paper reports these results (including comparisons to single-task MERT baselines, aesthetic ablations, and significance tests in Section 4 and Tables 3-4), the abstract currently summarizes without numbers. We will revise the abstract to include specific metrics, such as the improvement in OOD preference prediction accuracy when adding the aesthetic head, along with a brief note on the baseline comparison. revision: yes

  2. Referee: [Methodology] Methodology / data section: the five perceptual aesthetic quality dimensions are described as being predicted jointly from frozen MERT embeddings, but no information is given on whether these dimensions are derived from human perceptual annotations or from proxy signals, nor on the loss-balancing scheme used in the multi-task objective. This directly affects the weakest assumption that the aesthetic head supplies non-redundant, generalizable signal beyond the base embeddings.

    Authors: We acknowledge the need for greater detail here. The five dimensions are derived from a combination of platform engagement proxies (e.g., user interaction patterns on Suno/Udio) and validated against a small set of human perceptual annotations (described in Appendix B). For the multi-task objective, we employ an uncertainty-weighted loss balancing scheme following Kendall et al. (2018). We will expand the Methodology section with a new subsection explicitly describing the target derivation process, validation against human judgments, and the precise loss-balancing implementation and hyperparameters. revision: yes

Circularity Check

0 steps flagged

No circularity; OOD evaluation on Music Arena is independent of training inputs

full rationale

The paper trains a multi-task model on 211k Suno/Udio tracks using frozen MERT embeddings to jointly predict popularity signals and five aesthetic dimensions, then demonstrates that adding the aesthetic features improves pairwise preference prediction on the separate Music Arena dataset containing eleven unseen generative systems. No derivation step reduces by construction to the training inputs: the OOD test set is explicitly external, the improvement is measured on human preference battles not used in fitting, and no equations, self-citations, or ansatzes are shown to make the reported gain tautological. The chain remains self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review provides no explicit free parameters, axioms, or invented entities; standard machine learning assumptions such as the utility of frozen self-supervised embeddings and the complementarity of aesthetic and engagement signals are implicit but unstated.

pith-pipeline@v0.9.0 · 5491 in / 1265 out tokens · 117673 ms · 2026-05-07T13:20:49.052355+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

52 extracted references · 13 canonical work pages · 1 internal anchor

  1. [1]

    INTRODUCTION Music popularity prediction has been widely studied in the context of commercially released music, where signals such as artist identity, marketing exposure, and historical listener behavior play a central role [1]. The rapid emer- gence of AI-generated music platforms has created an en- tirely new landscape for this problem, where such conve...

  2. [2]

    Hit Song Science,

    RELA TED WORK Music popularity prediction, often termed “Hit Song Science,” has evolved significantly since 2008 when it was questioned whether this field could be considered a rigorous science [8]. Early work focused on extract- ing acoustic characteristics to predict song success, with studies pioneering dance hit prediction [9] using super- vised learn...

  3. [3]

    3.1 MERT Encoder We adopt MERT [4], a self-supervised transformer en- coder for music representation learning

    PROPOSED APEX MODEL The overall architecture of our proposed method is shown in Figure 1. 3.1 MERT Encoder We adopt MERT [4], a self-supervised transformer en- coder for music representation learning. It uses a dual- teacher pretraining framework combining an acoustic teacher based on RVQ-V AE and a musical teacher based on the Constant-Q Transform (CQT),...

  4. [3]

    3.1 MERT Encoder We adopt MERT [4], a self-supervised transformer en- coder for music representation learning

    PROPOSED APEX MODEL The overall architecture of our proposed method is shown in Figure 1. 3.1 MERT Encoder We adopt MERT [4], a self-supervised transformer en- coder for music representation learning. It uses a dual- teacher pretraining framework combining an acoustic teacher based on RVQ-V AE and a musical teacher based on the Constant-Q Transform (CQT),...

  5. [4]

    The music is these repositories is sourced from Udio and Suno respectively

    EXPERIMENTAL SETUP 4.1 Dataset We construct our dataset by combining subsets of two large-scale AI-generated music repositories: Udio-126k 2 and Suno-307k 3 . The music is these repositories is sourced from Udio and Suno respectively. Each of the songs is accompanied by ‘streams’ counts, ‘likes’ counts, and other meta-data. We remove songs with zero strea...

  6. [5]

    RESULTS 5.1 Ablation study Table 1 reports the popularity prediction performance across all 24 experimental conditions on the held-out test set (10% of the full dataset which is around 25k songs). Overall, results are consistent across configurations, with MSE ranging from 699–714 and MAE from 21.0–22.3 for streams score, and MSE from 659–677 and MAE from...

  7. [6]

    CONCLUSION We presented APEX, the first large-scale multi-task frame- work for jointly predicting popularity and aesthetic qual- ity in AI-generated music, trained on over 211k songs from Suno and Udio using frozen MERT audio embed- dings. Our ablation study across 24 experimental con- ditions demonstrates that uncertainty-based loss weight- ing and song-...

  8. [7]

    SUTD SKI 2021_04_06 and from MOE grant no

    ACKNOWLEDGMENTS This work has received funding from grant no. SUTD SKI 2021_04_06 and from MOE grant no. MOE-T2EP20124- 0014

  9. [8]

    AI USAGE STA TEMENT We acknowledge the use of chatGPT and Claude for gram- mar improvements

  10. [9]

    Hit song prediction based on early adopter data and audio features,

    D. Herremans and T. Bergmans, “Hit song prediction based on early adopter data and audio features,”arXiv preprint arXiv:2010.09489, 2020

  11. [10]

    SongEval: A benchmark dataset for song aesthetics evaluation,

    J. Yao, Y . Li, W. Zhang, and X. Wang, “SongEval: A benchmark dataset for song aesthetics evaluation,” 2025, arXiv preprint arXiv:2505.10793

  12. [11]

    Meta audiobox aesthetics: Unified automatic quality assessment for speech, music, and sound.arXiv preprint arXiv:2502.05139,

    A. Tjandra, B. Sisman, M. Zhang, S. Sakti, H. Li, and S. Nakamura, “Meta Audiobox Aesthetics: Uni- fied automatic quality assessment for speech, music, and sound,” 2025, arXiv preprint arXiv:2502.05139. 5 Code:https://github.com/AMAAI-Lab/apexModel: https://huggingface.co/amaai-lab/apex Husain and Herremans, 2026

  13. [12]

    MERT: Acoustic music understanding model with large-scale self-supervised training,

    Y . LI, R. Yuan, G. Zhang, Y . Ma, X. Chen, H. Yin, C. Xiao, C. Lin, A. Ragni, E. Benetos, N. Gyenge, R. Dannenberg, R. Liu, W. Chen, G. Xia, Y . Shi, W. Huang, Z. Wang, Y . Guo, and J. Fu, “MERT: Acoustic music understanding model with large-scale self-supervised training,” in The Twelfth International Conference on Learning Representations, 2024. [Onlin...

  14. [13]

    Udio, Inc., “Udio,” https://www.udio.com, 2026, on- line; accessed 22 April 2026

  15. [14]

    Suno, Inc., “Suno,” https://suno.com, 2026, online; ac- cessed 22 April 2026

  16. [15]

    Music arena: Live evaluation for text-to-music,

    Y . Kim, W. Chi, A. N. Angelopoulos, W.-L. Chiang, K. Saito, S. Watanabe, Y . Mitsufuji, and C. Donahue, “Music arena: Live evaluation for text-to-music,” in The Thirty-ninth Annual Conference on Neural Infor- mation Processing Systems Creative AI Track: Human- ity, 2025

  17. [16]

    Hit song science is not yet a science

    F. Pachet and P. Roy, “Hit song science is not yet a science.” inISMIR, 2008, pp. 355–360

  18. [17]

    Dance hit song prediction,

    D. Herremans, D. Martens, and K. Sörensen, “Dance hit song prediction,”Journal of New Music Research, Special Issue on Music and Machine Learning, vol. 43, no. 3, pp. 291–302, 2014

  19. [18]

    Revisiting the problem of audio-based hit song prediction using convolutional neural networks,

    L. C. Yang, S. Y . Chou, J. Y . Liu, Y . H. Yang, and Y . A. Chen, “Revisiting the problem of audio-based hit song prediction using convolutional neural networks,” inProc. IEEE Int. Conf. Acoust., Speech, Signal Pro- cess. (ICASSP), New Orleans, LA, USA, 2017, pp. 621–625

  20. [19]

    Song hit prediction: Predicting billboard hits using spotify data,

    K. Middlebrook and C. Sheik, “Song hit prediction: Predicting billboard hits using spotify data,” 2019, arXiv preprint arXiv:1908.08609

  21. [20]

    Music popularity prediction through data analysis of music’s characteristics,

    J. Kim, “Music popularity prediction through data analysis of music’s characteristics,”Int. J. Sci., Tech- nol. Soc., vol. 9, no. 5, pp. 239–244, 2021

  22. [21]

    Beyond beats: A recipe to song popularity? a machine learning ap- proach,

    N. S. Jung, F. Mayer, and M. Klein, “Beyond beats: A recipe to song popularity? a machine learning ap- proach,” 2024, arXiv preprint arXiv:2403.12079

  23. [22]

    A multimodal end-to- end deep learning architecture for music popularity prediction,

    D. Martín-Gutiérrez, G. H. Peñaloza, A. Belmonte- Hernández, and F. Á. García, “A multimodal end-to- end deep learning architecture for music popularity prediction,”IEEE Access, vol. 8, pp. 39 361–39 374, 2020

  24. [23]

    An analysis of classification approaches for hit song prediction using engineered metadata features with lyrics and audio features,

    M. Zhao, M. Harvey, D. Cameron, and F. Hopfgart- ner, “An analysis of classification approaches for hit song prediction using engineered metadata features with lyrics and audio features,” 2023

  25. [24]

    Prediction of spo- tify chart success using audio and streaming features,

    I. J. Cabansag and P. Ntegeka, “Prediction of spo- tify chart success using audio and streaming features,” 2024

  26. [25]

    Predicting music popular- ity using spotify and youtube features,

    Y . K. Yee and M. Raheem, “Predicting music popular- ity using spotify and youtube features,”Indian J. Sci. Technol., vol. 15, no. 36, pp. 1786–1799, 2022

  27. [26]

    #nowplaying the future billboard: Mining music listening behaviors of twitter users for hit song prediction,

    Y . Kim, B. Suh, and K. Lee, “#nowplaying the future billboard: Mining music listening behaviors of twitter users for hit song prediction,” inProc. 1st Int. Work- shop Social Media Retrieval Anal., 2014

  28. [27]

    Using twitter to predict chart position for songs,

    A. Tsiara, C. Tjortjis, and D. Rousidis, “Using twitter to predict chart position for songs,”Multimedia Tools Appl., 2020

  29. [28]

    Can we predict the bill- board music chart winner? machine learning predic- tion based on twitter artist-fan interactions,

    J. Aum, J. Kim, and E. Park, “Can we predict the bill- board music chart winner? machine learning predic- tion based on twitter artist-fan interactions,”Behav. Inf. Technol., vol. 42, no. 6, pp. 775–788, 2023

  30. [29]

    Predicting song popularity through machine learning and sentiment analysis on social networks,

    G. Rompolas, A. Smpoukis, E. Kafeza, and C. Makris, “Predicting song popularity through machine learning and sentiment analysis on social networks,” inProc. IFIP Int. Conf. Artif. Intell. Appl. Innov. (AIAI), 2024, pp. 314–324

  31. [30]

    Leveraging artificial intelligence for predicting music popularity using social media,

    Y . Wu, “Leveraging artificial intelligence for predicting music popularity using social media,”Profesional de la información, vol. 33, no. 5, p. e330522, 2024

  32. [31]

    Can song lyrics predict hits?

    A. Singhi and D. G. Brown, “Can song lyrics predict hits?” inProc. 11th Int. Symp. Comput. Music Multi- discip. Res., 2015, pp. 457–471

  33. [32]

    Automatic prediction of hit songs,

    R. Dhanaraj and B. Logan, “Automatic prediction of hit songs,” inProc. Int. Conf. Music Inf. Retrieval, 2005

  34. [33]

    Lyrics matter: Exploiting the power of learnt rep- resentations for music popularity prediction,

    “Lyrics matter: Exploiting the power of learnt rep- resentations for music popularity prediction,” 2025, arXiv preprint arXiv:2512.05508

  35. [34]

    HSP-TL: A deep metric learning model with triplet loss for hit song prediction using lyrics and audio features,

    P. Vavaroutsos, P. Vikatos, and M. Conti, “HSP-TL: A deep metric learning model with triplet loss for hit song prediction using lyrics and audio features,”Expert Syst. Appl., 2024

  36. [35]

    Quantifying the impact of homophily and influencer networks on song popularity prediction,

    N. Reisz, D. Yeger-Lotem, and S. Havlin, “Quantifying the impact of homophily and influencer networks on song popularity prediction,”Sci. Rep., vol. 14, p. 8969, 2024

  37. [36]

    LSTM-RPA: A simple but effective long sequence prediction algo- rithm for music popularity prediction,

    K. Li, Y . Wang, J. Zhang, and H. Chen, “LSTM-RPA: A simple but effective long sequence prediction algo- rithm for music popularity prediction,” 2021, arXiv preprint arXiv:2110.15790

  38. [37]

    Music trend prediction based on improved lstm and random forest algorithm,

    X. Liu, “Music trend prediction based on improved lstm and random forest algorithm,”J. Sensors, vol. 2022, p. 6450469, 2022

  39. [38]

    Accurately predicting hit songs using neurophysiology and machine learning,

    S. H. Merritt and P. J. Zak, “Accurately predicting hit songs using neurophysiology and machine learning,” Front. Artif. Intell., vol. 6, no. 1154663, 2023. Husain and Herremans, 2026

  40. [39]

    Soundtrack success: Unveiling song popularity patterns using machine learning im- plementation,

    S. Arora and R. Rani, “Soundtrack success: Unveiling song popularity patterns using machine learning im- plementation,”SN Comput. Sci., vol. 5, no. 3, p. 278, 2024

  41. [40]

    A comprehensive survey for evaluation methodologies of AI-generated music,

    Z. Xiong, W. Xia, Y . Cai, Y . Luo, C. Yang, Z. Liu, and M. Farrahi, “A comprehensive survey for evaluation methodologies of AI-generated music,” 2023, arXiv preprint arXiv:2308.13736

  42. [41]

    A survey on evaluation metrics for music generation,

    F. B. Kader, S. McFee, and G. Tzanetakis, “A survey on evaluation metrics for music generation,” 2025, arXiv preprint arXiv:2509.00051

  43. [42]

    Fréchet audio dis- tance as a metric for evaluating music quality,

    N. Scarfe, S. Baxter, and J. Reiss, “Fréchet audio dis- tance as a metric for evaluating music quality,” inProc. Int. Soc. Music Inf. Retrieval Conf. (ISMIR), 2024

  44. [43]

    MuQ-Eval: An open-source per-sample quality metric for AI music generation evaluation,

    D. Zhu and B. McFee, “MuQ-Eval: An open-source per-sample quality metric for AI music generation evaluation,” 2026, arXiv preprint arXiv:2603.22677

  45. [44]

    Multi-task learn- ing using uncertainty to weigh losses for scene geome- try and semantics,

    R. Cipolla, Y . Gal, and A. Kendall, “Multi-task learn- ing using uncertainty to weigh losses for scene geome- try and semantics,” in2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2018, pp. 7482–7491

  46. [45]

    Uni- versal music representations? evaluating foundation models on world music corpora,

    C. Papaioannou, E. Benetos, and A. Potamianos, “Uni- versal music representations? evaluating foundation models on world music corpora,” inProceedings of the 26th International Society for Music Information Re- trieval Conference (ISMIR 2025), Daejeon, South Ko- rea, 2025, arXiv preprint arXiv:2506.17055

  47. [46]

    Sonauto: Ai music generation,

    Sonauto, “Sonauto: Ai music generation,” 2025, proprietary system, no public documentation available. [Online]. Available: https://sonauto.ai/

  48. [47]

    Ace-step: A step towards music generation foundation model.arXiv preprint arXiv:2506.00045,

    J. Gong, S. Zhao, S. Wang, S. Xu, and J. Guo, “Ace- step: A step towards music generation foundation model,”arXiv preprint arXiv:2506.00045, 2025

  49. [48]

    Elevenlabs music generation v1,

    ElevenLabs, “Elevenlabs music generation v1,” 2025, proprietary system. [Online]. Available: https:// elevenlabs.io/

  50. [49]

    Simple and control- lable music generation,

    J. Copet, F. Kreuk, I. Gat, T. Remez, D. Kant, G. Syn- naeve, Y . Adi, and A. Défossez, “Simple and control- lable music generation,” inAdvances in Neural Infor- mation Processing Systems, 2023

  51. [50]

    Riffusion fuzz: State-of-the-art diffusion transformer for creating and editing music,

    R. Team, “Riffusion fuzz: State-of-the-art diffusion transformer for creating and editing music,” 2025. [Online]. Available: https://riffusion.com

  52. [51]

    Lyria realtime,

    G. DeepMind, “Lyria realtime,” 2025. [Online]. Avail- able: https://magenta.withgoogle.com/lyria-realtime