SongFormer: Scaling Music Structure Analysis with Heterogeneous Supervision

Chunbo Hao , Ruibin Yuan , Jixun Yao , Qixin Deng , Xinyi Bai , Yanbo Wang , Wei Xue , Lei Xie

Authors on Pith no claims yet

classification 📡 eess.AS

keywords songformermusicanalysisheterogeneousscalingsongformbenchstructuresupervision

read the original abstract

Music structure analysis (MSA) underpins music understanding and controllable generation, yet progress has been limited by small, inconsistent corpora. We present SongFormer, a scalable framework that learns from heterogeneous supervision. SongFormer (i) fuses short- and long-window self-supervised learning representations to capture both fine-grained and long-range dependencies, and (ii) introduces a learned source embedding to enable training with partial, noisy, and schema-mismatched labels. To support scaling and fair evaluation, we release SongFormDB, the largest MSA corpus to date (over 14k songs spanning languages and genres), and SongFormBench, a 300-song expert-verified benchmark. On SongFormBench, SongFormer sets a new state of the art in strict boundary detection (HR.5F) and achieves the highest functional label accuracy, while remaining computationally efficient; it surpasses strong baselines and Gemini 2.5 Pro on these metrics and remains competitive under relaxed tolerance (HR3F). Code, datasets, and model are open-sourced at https://github.com/ASLP-lab/SongFormer.

This paper has not been read by Pith yet.

discussion (0)

Forward citations

Cited by 5 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

YingMusic-Singer-Plus: Controllable Singing Voice Synthesis with Flexible Lyric Manipulation and Annotation-free Melody Guidance
eess.AS 2026-03 unverdicted novelty 7.0

YingMusic-Singer-Plus is a diffusion model for singing voice synthesis that preserves melody from a reference clip while allowing flexible lyric changes without manual alignment, outperforming Vevo2 and introducing th...
VocalParse: Towards Unified and Scalable Singing Voice Transcription with Large Audio Language Models
cs.SD 2026-05 unverdicted novelty 6.0

VocalParse applies interleaved and Chain-of-Thought prompting to a Large Audio Language Model to jointly transcribe lyrics, melody and word-note alignments, achieving state-of-the-art results on multiple singing datasets.
LaDA-Band: Language Diffusion Models for Vocal-to-Accompaniment Generation
cs.SD 2026-04 unverdicted novelty 6.0

LaDA-Band applies discrete masked diffusion with dual-track conditioning and progressive training to generate vocal-to-accompaniment tracks that improve acoustic authenticity, global coherence, and dynamic orchestrati...
AllocMV: Optimal Resource Allocation for Music Video Generation via Structured Persistent State
cs.CV 2026-05 unverdicted novelty 5.0

AllocMV uses a global planner to build a structured persistent state then solves a Multiple-Choice Knapsack Problem to allocate High-Gen, Mid-Gen, and Reuse compute branches, achieving an optimal Cost-Quality Ratio un...
GaMMA: Towards Joint Global-Temporal Music Understanding in Large Multimodal Models
cs.SD 2026-05 unverdicted novelty 5.0

GaMMA unifies global and temporal music understanding in a single LMM via MoE audio encoders and progressive training, achieving new state-of-the-art accuracies on music benchmarks including 79.1% on MuchoMusic.