pith. sign in

arxiv: 2606.25713 · v2 · pith:OF75XY3Knew · submitted 2026-06-24 · 💻 cs.SD

Frequency-Aware Self-Supervised Music Representation Learning

Pith reviewed 2026-06-30 10:01 UTC · model grok-4.3

classification 💻 cs.SD
keywords self-supervised learningmusic information retrievalspectrogramJEPArepresentation learningtime-frequencyMARBLE benchmarkmasked prediction
0
0 comments X

The pith

A visual JEPA trained on 2D spectrograms learns stronger music representations than 1D sequence models.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper argues that music signals carry natural 2D structure in time-frequency space that 1D waveform or flattened-spectrogram models discard. It introduces PupuJEPA, which applies a Joint-Embedding Predictive Architecture directly to 2D spectrogram patches and predicts the latent embeddings of masked patches from visible context. Domain-specific changes to architecture, training, and inference are added to suit audio. Linear-probing results on the MARBLE benchmark show consistent gains over prior 1D SSL methods across several MIR tasks, and attention maps indicate the model attends to musically relevant patterns.

Core claim

PupuJEPA demonstrates that a visual JEPA trained to predict latent embeddings of masked 2D spectrogram patches, after music-specific adaptations, produces representations that outperform 1D sequence-based SSL models on multiple music information retrieval tasks when evaluated by linear probing on the MARBLE benchmark.

What carries the argument

PupuJEPA, a visual Joint-Embedding Predictive Architecture that predicts latent embeddings of masked 2D spectrogram patches from unmasked context rather than applying masked language modeling to 1D sequences.

If this is right

  • Representations from 2D patch prediction improve linear-probing accuracy on multiple MIR tasks relative to 1D sequence SSL models.
  • Attention maps inside the model highlight musically meaningful time-frequency patterns.
  • The same 2D JEPA approach can be applied to any audio task that benefits from explicit time-frequency grid structure.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • If the 2D advantage holds, future music SSL work may shift from waveform or flattened-spectrogram inputs toward native 2D patch models.
  • The approach could extend to other domains where signals have an intrinsic 2D grid, such as vibration analysis or radar returns.
  • Ablation results imply that the choice of masking strategy and prediction target in 2D space is at least as important as the base architecture.

Load-bearing premise

The domain-specific modifications to the visual JEPA architecture, training scheme, and inference are generally effective for music signals rather than tuned only to the MARBLE benchmark tasks.

What would settle it

Retraining the same base JEPA on the identical spectrograms but without the listed domain modifications and measuring whether linear-probing accuracy on MARBLE drops to or below the level of the best 1D SSL baseline.

Figures

Figures reproduced from arXiv: 2606.25713 by Jerry Li, Junan Zhang, Lauri Juvela, Yicheng Gu, Zhizheng Wu.

Figure 1
Figure 1. Figure 1: Illustration of the fundamental intuition behind PupuJEPA. The top panel displays a multitrack project in a Digital [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Overview of the PupuJEPA architecture and training scheme. The target encoder encodes the masked spectrogram [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Illustration of different masking strategies. [PITH_FULL_IMAGE:figures/full_fig_p003_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Illustration of different inference paradigms for global tasks. “GAP” means global average pooling. The top and bottom [PITH_FULL_IMAGE:figures/full_fig_p004_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Illustration of different patch aggregation strategies. [PITH_FULL_IMAGE:figures/full_fig_p004_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Illustration of PupuJEPA’s predictor attention maps. The top panels display a multitrack project in a DAW, while the [PITH_FULL_IMAGE:figures/full_fig_p011_6.png] view at source ↗
Figure 1
Figure 1. Figure 1: These tools were used to improve paper writing and [PITH_FULL_IMAGE:figures/full_fig_p012_1.png] view at source ↗
read the original abstract

Self-supervised learning (SSL) has emerged as an essential paradigm for music information retrieval (MIR). While current SSL models achieve state-of-the-art performance across various MIR tasks, they typically treat audio as 1D sequences, either operating on time-domain waveforms or on flattened time-frequency-domain spectrograms. This discards the rich spatial and structural information in time-frequency representations and overlooks a fundamental intuition in music production. In particular, music is naturally represented as time-frequency grids in MIDI-based workflows, a structure that tightly corresponds to 2D spectrograms and inherently makes many MIR tasks trivial. Motivated by this intuition, we propose PupuJEPA, a visual Joint-Embedding Predictive Architecture (JEPA) that is trained directly on 2D spectrograms. Instead of applying masked language modeling (MLM) to 1D sequences, PupuJEPA learns robust representations by predicting the latent embeddings of masked 2D spectrogram patches from unmasked contexts. To optimally adapt such a visual framework to music signals, we also apply domain-specific modifications to model architecture, training scheme, and inference paradigm, with comprehensive ablation studies showing their effectiveness. Evaluations on the MARBLE benchmark show that PupuJEPA outperforms the 1D sequence-based SSL models across multiple MIR tasks in linear probing. Additionally, case studies of the attention maps also confirm that PupuJEPA captures musically meaningful patterns within the 2D time-frequency domain. Codes and checkpoints are available at: https://www.yichenggu.com/PupuJEPA/.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 1 minor

Summary. The paper proposes PupuJEPA, a visual JEPA model trained directly on 2D spectrograms for self-supervised music representation learning. It applies domain-specific modifications to architecture, training scheme, and inference, claims these are validated by comprehensive ablations, and reports that the model outperforms 1D sequence-based SSL baselines across multiple MIR tasks on the MARBLE benchmark under linear probing, with qualitative attention-map case studies supporting musically meaningful patterns. Code and checkpoints are released.

Significance. If the empirical claims hold and the modifications are shown to be general rather than benchmark-tuned, the work could meaningfully shift SSL approaches in MIR toward explicit 2D time-frequency modeling. The public release of code and checkpoints is a clear strength for reproducibility.

major comments (1)
  1. [Abstract] Abstract: the claim that 'comprehensive ablation studies' demonstrate effectiveness of the domain-specific modifications to architecture, training scheme, and inference is load-bearing for the central contribution, yet all evaluations and ablations are confined to the MARBLE benchmark; this setup cannot distinguish genuine domain adaptation from design choices that overfit the benchmark's particular task suite and label distributions.
minor comments (1)
  1. [Abstract] The abstract states outperformance 'across multiple MIR tasks' but does not list the specific tasks or report per-task metrics with error bars or statistical significance.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the detailed review and constructive feedback on our manuscript. We address the major comment below and outline planned revisions.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the claim that 'comprehensive ablation studies' demonstrate effectiveness of the domain-specific modifications to architecture, training scheme, and inference is load-bearing for the central contribution, yet all evaluations and ablations are confined to the MARBLE benchmark; this setup cannot distinguish genuine domain adaptation from design choices that overfit the benchmark's particular task suite and label distributions.

    Authors: We appreciate the referee's observation that our ablation studies and evaluations are performed exclusively on the MARBLE benchmark. MARBLE was selected because it aggregates multiple MIR tasks with heterogeneous label distributions and evaluation protocols, providing a more rigorous test than single-task benchmarks. The ablations demonstrate that each domain-specific modification yields measurable gains across this task suite rather than isolated improvements. Nevertheless, we acknowledge that this does not fully rule out benchmark-specific tuning. In the revised manuscript we will qualify the abstract claim to specify that the ablations show effectiveness on MARBLE, add a limitations paragraph discussing the scope of the current evaluation, and note the value of future cross-benchmark validation. revision: partial

Circularity Check

0 steps flagged

No circularity: empirical proposal with external benchmark evaluation

full rationale

The paper proposes PupuJEPA as an adaptation of visual JEPA to 2D spectrograms, reports linear-probe results on the external MARBLE benchmark, and includes ablations confined to that benchmark. No mathematical derivation chain, equations, or predictions are presented that reduce by construction to fitted inputs or self-citations. The central claims rest on empirical comparisons rather than any self-definitional, fitted-prediction, or uniqueness-imported steps. This is the expected non-finding for a model-proposal paper whose results are externally falsifiable on a public benchmark.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review provides no explicit free parameters, axioms, or invented entities beyond the general claim of domain-specific modifications; full text would be needed to audit these.

pith-pipeline@v0.9.1-grok · 5818 in / 966 out tokens · 32048 ms · 2026-06-30T10:01:21.261109+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

45 extracted references · 9 canonical work pages · 5 internal anchors

  1. [1]

    MusicCNN: Pre-trained Convo- lutional Neural Networks for Music Audio Tagging,

    J. Pons and X. Serra, “MusicCNN: Pre-trained Convo- lutional Neural Networks for Music Audio Tagging,” in Proc. ISMIR, 2019

  2. [2]

    Codified Audio Language Modeling Learns Useful Representations for Music Information Retrieval,

    R. Castellon, C. Donahue, and P. Liang, “Codified Audio Language Modeling Learns Useful Representations for Music Information Retrieval,” inISMIR, 2021, pp. 88– 96

  3. [3]

    Jukebox: A Generative Model for Music

    P. Dhariwal, H. Jun, C. Payne, J. W. Kim, A. Radford, and I. Sutskever, “Jukebox: A generative model for music,”arXiv:2005.00341, 2020

  4. [4]

    Contrastive Learning of Musical Representations,

    J. Spijkervet and J. A. Burgoyne, “Contrastive Learning of Musical Representations,” inISMIR, 2021, pp. 673– 681

  5. [5]

    Supervised and Unsu- pervised Learning of Audio Representations for Music Understanding,

    M. C. McCallum, F. Korzeniowski, S. Oramas, F. Gouyon, and A. F. Ehmann, “Supervised and Unsu- pervised Learning of Audio Representations for Music Understanding,” inISMIR, 2022, pp. 256–263

  6. [6]

    BERT: Pre-training of Deep Bidirectional Transformers for Lan- guage Understanding,

    J. Devlin, M. Chang, K. Lee, and K. Toutanova, “BERT: Pre-training of Deep Bidirectional Transformers for Lan- guage Understanding,” inProc. NAACL, 2019, pp. 4171– 4186

  7. [7]

    MERT: Acoustic Music Understanding Model with Large-Scale Self-supervised Training,

    Y . Li, R. Yuan, G. Zhang, Y . Ma, X. Chen, H. Yin, C. Xiao, C. Lin, A. Ragni, E. Benetos, N. Gyenge, R. B. Dannenberg, R. Liu, W. Chen, G. Xia, Y . Shi, W. Huang, Z. Wang, Y . Guo, and J. Fu, “MERT: Acoustic Music Understanding Model with Large-Scale Self-supervised Training,” inProc. ICLR, 2024

  8. [8]

    High Fidelity Neural Audio Compression,

    A. D ´efossez, J. Copet, G. Synnaeve, and Y . Adi, “High Fidelity Neural Audio Compression,”Trans. Mach. Learn. Res., vol. 2023, 2023

  9. [9]

    A Foundation Model for Music Informatics,

    M. Won, Y . Hung, and D. Le, “A Foundation Model for Music Informatics,” inProc. Int. Conf. Acoust. Speech Signal Process., 2024, pp. 1226–1230

  10. [10]

    MuQ: Self-Supervised Music Representation Learning with Mel Residual Vector Quantization,

    H. Zhu, Y . Zhou, H. Chen, J. Yu, Z. Ma, R. Gu, Y . Luo, W. Tan, and X. Chen, “MuQ: Self-Supervised Music Representation Learning with Mel Residual Vector Quantization,”IEEE/ACM Trans. Audio Speech Lang. Process., 2025

  11. [11]

    Masked Autoencoders Are Scalable Vision Learners,

    K. He, X. Chen, S. Xie, Y . Li, P. Doll ´ar, and R. B. Girshick, “Masked Autoencoders Are Scalable Vision Learners,” inProc. CVPR, 2022, pp. 15 979–15 988

  12. [12]

    Self-Supervised Learning from Images with a Joint-Embedding Predictive Architecture,

    M. Assran, Q. Duval, I. Misra, P. Bojanowski, P. Vincent, M. G. Rabbat, Y . LeCun, and N. Ballas, “Self-Supervised Learning from Images with a Joint-Embedding Predictive Architecture,” inProc. CVPR, 2023, pp. 15 619–15 629

  13. [13]

    Revisiting Fea- ture Prediction for Learning Visual Representations from Video,

    A. Bardes, Q. Garrido, J. Ponce, X. Chen, M. Rabbat, Y . LeCun, M. Assran, and N. Ballas, “Revisiting Fea- ture Prediction for Learning Visual Representations from Video,”Trans. Mach. Learn. Res., vol. 2024, 2024

  14. [14]

    V-JEPA 2: Self-Supervised Video Models Enable Understanding, Prediction and Planning

    M. Assran, A. Bardes, D. Fan, Q. Garrido, R. Howes, M. Muckley, A. Rizvi, C. Roberts, K. Sinha, A. Zho- luset al., “V-JEPA 2: Self-supervised Video Mod- els Enable Understanding, Prediction and Planning,” arXiv:2506.09985, 2025

  15. [15]

    iBOT: Image BERT Pre-Training with Online Tokenizer

    J. Zhou, C. Wei, H. Wang, W. Shen, C. Xie, A. Yuille, and T. Kong, “iBOT: Image BERT Pre-Training with Online Tokenizer,”arXiv:2111.07832, 2021

  16. [16]

    Emerging Properties in Self-Supervised Vision Transformers,

    M. Caron, H. Touvron, I. Misra, H. J ´egou, J. Mairal, P. Bojanowski, and A. Joulin, “Emerging Properties in Self-Supervised Vision Transformers,” inProc. ICCV, 2021, pp. 9630–9640. 13

  17. [17]

    DINOv2: Learning Robust Visual Features without Supervision,

    M. Oquab, T. Darcet, T. Moutakanni, H. V . V o, M. Szafraniec, V . Khalidov, P. Fernandez, D. Haziza, F. Massa, A. El-Nouby, M. Assran, N. Ballas, W. Galuba, R. Howes, P. Huang, S. Li, I. Misra, M. Rabbat, V . Sharma, G. Synnaeve, H. Xu, H. J ´egou, J. Mairal, P. Labatut, A. Joulin, and P. Bojanowski, “DINOv2: Learning Robust Visual Features without Super...

  18. [18]

    DINOv3

    O. Sim ´eoni, H. V . V o, M. Seitzer, F. Baldassarre, M. Oquab, C. Jose, V . Khalidov, M. Szafraniec, S. Yi, M. Ramamonjisoaet al., “Dinov3,”arXiv:2508.10104, 2025

  19. [19]

    Masked Autoencoders that Listen,

    P. Huang, H. Xu, J. Li, A. Baevski, M. Auli, W. Galuba, F. Metze, and C. Feichtenhofer, “Masked Autoencoders that Listen,” inProc. NeurIPS, 2022

  20. [20]

    AudioMAE++: Learning Better Masked Audio Representations with Swiglu FFNS,

    S. Yadav, S. Theodoridis, and Z. Tan, “AudioMAE++: Learning Better Masked Audio Representations with Swiglu FFNS,” inIEEE Int. Workshop Mach. Learn. Signal Process., 2025, pp. 1–6

  21. [21]

    Masked Spectrogram Modeling using Masked Autoencoders for Learning General-purpose Au- dio Representation,

    D. Niizumi, D. Takeuchi, Y . Ohishi, N. Harada, and K. Kashino, “Masked Spectrogram Modeling using Masked Autoencoders for Learning General-purpose Au- dio Representation,” inProc. NeurIPS, 2021, pp. 1–24

  22. [22]

    Masked Spectrogram Prediction for Self-Supervised Audio Pre- Training,

    D. Chong, H. Wang, P. Zhou, and Q. Zeng, “Masked Spectrogram Prediction for Self-Supervised Audio Pre- Training,” inProc. Int. Conf. Acoust. Speech Signal Process., 2023, pp. 1–5

  23. [23]

    MAE-AST: Masked Autoencoding Audio Spectrogram Transformer,

    A. Baade, P. Peng, and D. Harwath, “MAE-AST: Masked Autoencoding Audio Spectrogram Transformer,” inProc Interspeech, 2022, pp. 2438–2442

  24. [24]

    BEATs: Audio Pre-Training with Acoustic Tokenizers,

    S. Chen, Y . Wu, C. Wang, S. Liu, D. Tompkins, Z. Chen, W. Che, X. Yu, and F. Wei, “BEATs: Audio Pre-Training with Acoustic Tokenizers,” inProc. ICML, 2023, pp. 5178–5193

  25. [25]

    EAT: Self-Supervised Pre-Training with Efficient Audio Transformer,

    W. Chen, Y . Liang, Z. Ma, Z. Zheng, and X. Chen, “EAT: Self-Supervised Pre-Training with Efficient Audio Transformer,” inProc. IJCAI, 2024, pp. 3807–3815

  26. [26]

    Self-Supervised Audio Teacher-Student Transformer for Both Clip-Level and Frame-Level Tasks,

    X. Li, N. Shao, and X. Li, “Self-Supervised Audio Teacher-Student Transformer for Both Clip-Level and Frame-Level Tasks,”IEEE/ACM Trans. Audio Speech Lang. Process., vol. 32, pp. 1336–1351, 2024

  27. [27]

    Masked Modeling Duo: Towards a Univer- sal Audio Pre-Training Framework,

    D. Niizumi, D. Takeuchi, Y . Ohishi, N. Harada, and K. Kashino, “Masked Modeling Duo: Towards a Univer- sal Audio Pre-Training Framework,”IEEE/ACM Trans. Audio Speech Lang. Process., vol. 32, pp. 2391–2406, 2024

  28. [28]

    Inves- tigating Design Choices in Joint-Embedding Predictive Architectures for General Audio Representation Learn- ing,

    A. Riou, S. Lattner, G. Hadjeres, and G. Peeters, “Inves- tigating Design Choices in Joint-Embedding Predictive Architectures for General Audio Representation Learn- ing,” inProc. Int. Conf. Acoust. Speech Signal Process. Workshop, 2024, pp. 680–684

  29. [29]

    Masked Latent Prediction and Classification for Self- Supervised Audio Representation Learning,

    A. Quelennec, P. Chouteau, G. Peeters, and S. Essid, “Masked Latent Prediction and Classification for Self- Supervised Audio Representation Learning,” inProc. Int. Conf. Acoust. Speech Signal Process., 2025, pp. 1–5

  30. [30]

    MATPAC++: Enhanced Masked Latent Predic- tion for Self-Supervised Audio Representation Learning,

    ——, “MATPAC++: Enhanced Masked Latent Predic- tion for Self-Supervised Audio Representation Learning,” arXiv:2508.12709, 2025

  31. [31]

    A-Jepa: Joint-Embedding Predictive Architecture can Listen,

    Z. Fei, M. Fan, and J. Huang, “A-Jepa: Joint-Embedding Predictive Architecture can Listen,”arXiv:2311.15830, 2023

  32. [32]

    Audio-JEPA: Joint-Embedding Predictive Architecture for Audio Representation Learning,

    L. Tuncay, E. Labb ´e, E. Benetos, and T. Pellegrini, “Audio-JEPA: Joint-Embedding Predictive Architecture for Audio Representation Learning,”arXiv:2507.02915, 2025

  33. [33]

    Scaling Up Masked Audio Encoder Learning for General Audio Classification,

    H. Dinkel, Z. Yan, Y . Wang, J. Zhang, Y . Wang, and B. Wang, “Scaling Up Masked Audio Encoder Learning for General Audio Classification,” inProc. Interspeech, 2024

  34. [34]

    An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale,

    A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weis- senborn, X. Zhai, T. Unterthiner, M. Dehghani, M. Min- derer, G. Heigold, S. Gelly, J. Uszkoreit, and N. Houlsby, “An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale,” inProc. ICLR, 2021

  35. [35]

    RoFormer: Enhanced transformer with Rotary Posi- tion Embedding,

    J. Su, M. H. M. Ahmed, Y . Lu, S. Pan, W. Bo, and Y . Liu, “RoFormer: Enhanced transformer with Rotary Posi- tion Embedding,”Neurocomputing, vol. 568, p. 127063, 2024

  36. [36]

    GLU Variants Improve Transformer

    N. Shazeer, “GLU Variants Improve Transformer,” arXiv:2002.05202, 2020

  37. [37]

    MARBLE: Music Audio Representation Benchmark for Universal Evaluation,

    R. Yuan, Y . Ma, Y . Li, G. Zhang, X. Chen, H. Yin, L. Zhuo, Y . Liu, J. Huang, Z. Tian, B. Deng, N. Wang, C. Lin, E. Benetos, A. Ragni, N. Gyenge, R. B. Dan- nenberg, W. Chen, G. Xia, W. Xue, S. Liu, S. Wang, R. Liu, Y . Guo, and J. Fu, “MARBLE: Music Audio Representation Benchmark for Universal Evaluation,” in Proc. NeurIPS, 2023

  38. [38]

    Layer-Wise Investigation of Large-Scale Self-Supervised Music Representation Models,

    Y . Zhou, H. Zhu, and H. Chen, “Layer-Wise Investigation of Large-Scale Self-Supervised Music Representation Models,”arXiv:2505.16306, 2025

  39. [39]

    1000 songs for emotional analysis of music,

    M. Soleymani, M. N. Caro, E. M. Schmidt, C. Sha, and Y . Yang, “1000 songs for emotional analysis of music,” inProc. ACM MM, 2013, pp. 1–6

  40. [40]

    Two Data Sets for Tempo Estimation and Key Detection in Electronic Dance Music Annotated from User Corrections,

    P. Knees, ´A. Faraldo, P. Herrera, R. V ogl, S. B ¨ock, F. H ¨orschl¨ager, and M. L. Goff, “Two Data Sets for Tempo Estimation and Key Detection in Electronic Dance Music Annotated from User Corrections,” inProc. ISMIR, 2015, pp. 364–370

  41. [41]

    Melody tran- scription via generative pre-training,

    C. Donahue, J. Thickstun, and P. Liang, “Melody tran- scription via generative pre-training,” inProc. ISMIR, 2022, pp. 485–492

  42. [42]

    MIR EV AL: A Transparent Implementation of Common MIR Met- rics,

    C. Raffel, B. McFee, E. J. Humphrey, J. Salamon, O. Ni- eto, D. Liang, D. P. Ellis, and C. C. Raffel, “MIR EV AL: A Transparent Implementation of Common MIR Met- rics,” inProc. ISMIR, vol. 10, 2014, p. 2014

  43. [43]

    Musical Genre Classifica- tion of Audio Signals,

    G. Tzanetakis and P. R. Cook, “Musical Genre Classifica- tion of Audio Signals,”IEEE/ACM Trans. Audio Speech Lang. Process., vol. 10, no. 5, pp. 293–302, 2002

  44. [44]

    The MTG-Jamendo Dataset for Automatic Music Tagging,

    D. Bogdanov, M. Won, P. Tovstogan, A. Porter, and X. Serra, “The MTG-Jamendo Dataset for Automatic Music Tagging,” inProc. ICML, 2019, pp. 1–3

  45. [45]

    Evaluation of Algorithms Using Games: The Case of Music Tagging,

    E. Law, K. West, M. I. Mandel, M. Bay, and J. S. Downie, “Evaluation of Algorithms Using Games: The Case of Music Tagging,” inProc. ISMIR, 2009, pp. 387–392