pith. machine review for the scientific record. sign in

arxiv: 2510.02797 · v3 · submitted 2025-10-03 · 📡 eess.AS

Recognition: unknown

SongFormer: Scaling Music Structure Analysis with Heterogeneous Supervision

Authors on Pith no claims yet
classification 📡 eess.AS
keywords songformermusicanalysisheterogeneousscalingsongformbenchstructuresupervision
0
0 comments X
read the original abstract

Music structure analysis (MSA) underpins music understanding and controllable generation, yet progress has been limited by small, inconsistent corpora. We present SongFormer, a scalable framework that learns from heterogeneous supervision. SongFormer (i) fuses short- and long-window self-supervised learning representations to capture both fine-grained and long-range dependencies, and (ii) introduces a learned source embedding to enable training with partial, noisy, and schema-mismatched labels. To support scaling and fair evaluation, we release SongFormDB, the largest MSA corpus to date (over 14k songs spanning languages and genres), and SongFormBench, a 300-song expert-verified benchmark. On SongFormBench, SongFormer sets a new state of the art in strict boundary detection (HR.5F) and achieves the highest functional label accuracy, while remaining computationally efficient; it surpasses strong baselines and Gemini 2.5 Pro on these metrics and remains competitive under relaxed tolerance (HR3F). Code, datasets, and model are open-sourced at https://github.com/ASLP-lab/SongFormer.

This paper has not been read by Pith yet.

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 5 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. YingMusic-Singer-Plus: Controllable Singing Voice Synthesis with Flexible Lyric Manipulation and Annotation-free Melody Guidance

    eess.AS 2026-03 unverdicted novelty 7.0

    YingMusic-Singer-Plus is a diffusion model for singing voice synthesis that preserves melody from a reference clip while allowing flexible lyric changes without manual alignment, outperforming Vevo2 and introducing th...

  2. VocalParse: Towards Unified and Scalable Singing Voice Transcription with Large Audio Language Models

    cs.SD 2026-05 unverdicted novelty 6.0

    VocalParse applies interleaved and Chain-of-Thought prompting to a Large Audio Language Model to jointly transcribe lyrics, melody and word-note alignments, achieving state-of-the-art results on multiple singing datasets.

  3. LaDA-Band: Language Diffusion Models for Vocal-to-Accompaniment Generation

    cs.SD 2026-04 unverdicted novelty 6.0

    LaDA-Band applies discrete masked diffusion with dual-track conditioning and progressive training to generate vocal-to-accompaniment tracks that improve acoustic authenticity, global coherence, and dynamic orchestrati...

  4. AllocMV: Optimal Resource Allocation for Music Video Generation via Structured Persistent State

    cs.CV 2026-05 unverdicted novelty 5.0

    AllocMV uses a global planner to build a structured persistent state then solves a Multiple-Choice Knapsack Problem to allocate High-Gen, Mid-Gen, and Reuse compute branches, achieving an optimal Cost-Quality Ratio un...

  5. GaMMA: Towards Joint Global-Temporal Music Understanding in Large Multimodal Models

    cs.SD 2026-05 unverdicted novelty 5.0

    GaMMA unifies global and temporal music understanding in a single LMM via MoE audio encoders and progressive training, achieving new state-of-the-art accuracies on music benchmarks including 79.1% on MuchoMusic.