MIDI-Informed Singing Accompaniment Generation in a Compositional Song Pipeline

Fang-Duo Tsai , Yi-An Lai , Fei-Yueh Chen , Hsueh-Wei Fu , Wei-Jaw Lee , Hao-Chung Cheng , Yi-Hsuan Yang

Authors on Pith no claims yet

classification 💻 cs.SD eess.AS

keywords generationlong-formmidi-sagscore-to-songaccompanimentcompositionallyrics-to-songmidi-informed

read the original abstract

While end-to-end lyrics-to-song models offer convenience for casual users, professional songwriters require score-to-song systems that allow them to retain authorship over the core melody. However, existing score-to-song methods are limited to short-form snippets and fail to maintain coherence in long-form generation, particularly during vocal-silent sections like intros and bridges. To address this long-form bottleneck, we propose MIDI-informed singing accompaniment generation (MIDI-SAG). Unlike conventional audio-only models, MIDI-SAG utilizes symbolic timing and chord information derived from the vocal MIDI to provide a stable musical roadmap. By incorporating structure planning, which defines temporal boundaries and semantic labels, our framework facilitates consistent generation across both vocal and non-vocal sections. We demonstrate the feasibility of this compositional pipeline by leveraging specialized pre-trained modules, enabling data-efficient training on a single GPU. Our experiments show the potential of this approach for both professional score-to-song and general lyrics-to-song tasks. While an early exploration, MIDI-SAG suggests a promising direction for structured, long-form music synthesis. Audio demos are available, and the code will be open-sourced at https://composerflow.github.io/web_revealed/.

This paper has not been read by Pith yet.

discussion (0)

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

LaDA-Band: Language Diffusion Models for Vocal-to-Accompaniment Generation
cs.SD 2026-04 unverdicted novelty 6.0

LaDA-Band applies discrete masked diffusion with dual-track conditioning and progressive training to generate vocal-to-accompaniment tracks that improve acoustic authenticity, global coherence, and dynamic orchestrati...