arxiv: 2604.09054 · v2 · submitted 2026-04-10 · 💻 cs.SD · cs.MM

Recognition: unknown

HAFM: Hierarchical Autoregressive Foundation Model for Music Accompaniment Generation

Jian Zhu , Jianwei Cui , Shihao Chen , Yubang Zhang , Cheng Luo

Authors on Pith no claims yet

Pith reviewed 2026-05-10 17:23 UTC · model grok-4.3

classification 💻 cs.SD cs.MM

keywords music accompaniment generationhierarchical autoregressive modeldual-rate tokenizationaudio codec tokensFréchet Audio DistanceTransformer stability techniquesMUSDB18 benchmark

0 comments

The pith

A hierarchical autoregressive model generates coherent instrumental accompaniments from isolated vocals using dual-rate tokenization.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents a system that takes isolated singing voice as input and outputs instrumental audio that mixes directly into a complete track. It establishes that a dual-rate tokenization scheme, pairing semantic tokens from vocals at one rate with acoustic tokens from instrumentals at another rate, combined with staged autoregressive prediction, produces time-aligned results without separate alignment steps. Modern transformer components are added to stabilize training on long sequences. Experiments show the resulting model reaches audio quality comparable to prior systems on a standard benchmark while using fewer parameters than those systems.

Core claim

HAFM generates instrumental music audio to accompany input vocals. Given isolated singing voice, HAFM produces a coherent instrumental accompaniment that can be directly mixed with the input to create complete music. The system uses a dual-rate codec tokenization scheme with HuBERT semantic tokens at 50 Hz for vocals and EnCodec acoustic tokens at 75 Hz for instrumentals, a three-stage hierarchical autoregressive architecture with interleaved multi-codebook prediction and classifier-free guidance, plus modern Transformer design choices including QK-norm, GEGLU activations, RMSNorm, and T5-style relative position bias.

What carries the argument

Three-stage hierarchical autoregressive architecture with interleaved multi-codebook prediction and dual-rate codec tokenization scheme that maps 50 Hz vocal semantic tokens to 75 Hz instrumental acoustic tokens.

If this is right

Accompaniments can be produced directly from isolated vocals and mixed without post-processing synchronization.
The model matches prior state-of-the-art audio quality on MUSDB18 while using fewer parameters than those systems.
The staged prediction from semantic to coarse acoustic to fine acoustic tokens supports rate-independent yet time-aligned modeling.
Retrieval-based baselines are outperformed on the same isolated-vocal input task.
Classifier-free guidance and interleaved codebook prediction improve coherence during generation.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same staged tokenization pattern could be tested on other conditional audio tasks such as generating backing tracks for speech or solo instruments.
Fewer parameters combined with open-source code release may lower the barrier for integrating accompaniment generation into consumer music apps or DAWs.
The hierarchical structure might reduce error accumulation on longer sequences compared to single-stage autoregressive baselines, though the paper does not measure sequence length effects directly.

Load-bearing premise

The dual-rate tokenization and three-stage architecture produce time-aligned and coherent accompaniments without needing additional explicit alignment or synchronization steps.

What would settle it

An objective temporal alignment score or human listening test on MUSDB18 showing that generated instrumentals drift in timing relative to the input vocals or yield FAD scores worse than the reported 2.08.

read the original abstract

We present HAFM, a system that generates instrumental music audio to accompany input vocals. Given isolated singing voice, HAFM produces a coherent instrumental accompaniment that can be directly mixed with the input to create complete music. We propose three key innovations over prior work: (1) a dual-rate codec tokenization scheme using HuBERT semantic tokens at 50\,Hz for vocals and EnCodec acoustic tokens at 75\,Hz for instrumentals, enabling time-aligned yet rate-independent modeling; (2) a three-stage hierarchical autoregressive architecture (semantic to coarse acoustic to fine acoustic) with interleaved multi-codebook prediction and classifier-free guidance; and (3) modern Transformer design choices including QK-norm, GEGLU activations, RMSNorm, and T5-style relative position bias for improved training stability and sequence generalization. Experiments on MUSDB18 demonstrate that HAFM achieves a Fr\'{e}chet Audio Distance (FAD) of 2.08 on isolated vocal inputs, outperforming retrieval baselines and matching prior state-of-the-art systems with fewer parameters. The source code is available at https://github.com/HackerHyper/HAFM.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper presents HAFM, a hierarchical autoregressive foundation model for generating instrumental accompaniments from isolated vocal inputs. Key contributions include a dual-rate codec tokenization using HuBERT semantic tokens at 50 Hz for vocals and EnCodec acoustic tokens at 75 Hz for instrumentals, a three-stage architecture progressing from semantic to coarse acoustic to fine acoustic tokens with interleaved multi-codebook prediction and classifier-free guidance, and modern Transformer elements such as QK-norm, GEGLU, RMSNorm, and T5-style relative position bias. Experiments on MUSDB18 report a Fréchet Audio Distance (FAD) of 2.08 for HAFM on isolated vocal inputs, outperforming retrieval baselines and matching prior state-of-the-art systems while using fewer parameters; source code is released.

Significance. If the results and alignment claims hold under scrutiny, this work offers a parameter-efficient approach to conditional music generation that could advance foundation models for audio by demonstrating scalable hierarchical autoregression across mismatched token rates. The open-source code strengthens potential impact for reproducibility in the music information retrieval and generative audio communities.

major comments (2)

[Abstract and architecture description] The abstract claims that the dual-rate tokenization scheme 'enables time-aligned yet rate-independent modeling,' but the three-stage hierarchical autoregressive architecture description provides no explicit mechanism (e.g., cross-rate alignment loss, upsampling layer, or phase-locking) to enforce frame correspondence between 50 Hz vocal tokens and 75 Hz instrumental tokens. This is load-bearing for the central claim of producing coherent, directly mixable accompaniments, as FAD evaluates distributional quality rather than input-conditioned synchronization.
[Experiments] The experimental results section reports an FAD of 2.08 and comparisons to baselines/SOTA but omits details on training procedures, exact baseline implementations, statistical significance tests, or potential confounds in the MUSDB18 evaluation (e.g., vocal isolation quality or mixing process), preventing verification that the data supports the performance claims.

minor comments (2)

[Method] Notation for token rates and stages could be clarified with a diagram or explicit equations showing how interleaving occurs across mismatched rates.
[Related work] The paper would benefit from additional references to prior work on hierarchical audio tokenization or rate-mismatch handling in autoregressive models.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our work. We address each major comment point by point below, providing clarifications and committing to revisions that strengthen the manuscript without altering its core claims.

read point-by-point responses

Referee: The abstract claims that the dual-rate tokenization scheme 'enables time-aligned yet rate-independent modeling,' but the three-stage hierarchical autoregressive architecture description provides no explicit mechanism (e.g., cross-rate alignment loss, upsampling layer, or phase-locking) to enforce frame correspondence between 50 Hz vocal tokens and 75 Hz instrumental tokens. This is load-bearing for the central claim of producing coherent, directly mixable accompaniments, as FAD evaluates distributional quality rather than input-conditioned synchronization.

Authors: We appreciate the referee highlighting the need for greater clarity on temporal alignment. In the HAFM architecture, alignment between the 50 Hz HuBERT vocal tokens and 75 Hz EnCodec instrumental tokens is achieved through direct conditioning in the autoregressive process: vocal semantic tokens condition the semantic stage, which then drives coarse and fine acoustic stages over the same underlying audio duration, with token counts scaled proportionally to their respective rates (no explicit upsampling layer or alignment loss is used, as the model learns rate-independent correspondences from synchronized training pairs). This enables direct mixing because generated instrumentals match the input vocal timing by construction. However, we agree the description in the architecture section was insufficiently explicit. We will revise the manuscript to add a dedicated paragraph detailing the temporal correspondence mechanism, including sequence length handling and rate scaling, and will expand the discussion to acknowledge FAD's limitations while noting that coherence is further supported by the task design and qualitative mixing results. revision: yes
Referee: The experimental results section reports an FAD of 2.08 and comparisons to baselines/SOTA but omits details on training procedures, exact baseline implementations, statistical significance tests, or potential confounds in the MUSDB18 evaluation (e.g., vocal isolation quality or mixing process), preventing verification that the data supports the performance claims.

Authors: We agree that these details are important for verification and reproducibility. Training procedures (including the three-stage curriculum, optimizer settings, batch size, and data preprocessing on MUSDB18) are fully specified in the appendix and the released GitHub code. Baseline implementations follow the original papers with minor adaptations for fair comparison, as noted in Section 4.2. We omitted statistical significance tests in the initial submission due to the substantial compute required for repeated full training runs, but we will add error bars computed over three random seeds in the revision. Regarding confounds, vocal isolation used a standard pre-trained model with no custom processing, and mixing is performed via direct sample-wise addition to match the isolated vocal duration exactly. We will move key details from the appendix into the main Experiments section and add a short paragraph addressing potential evaluation confounds. revision: yes

Circularity Check

0 steps flagged

No significant circularity; empirical results on public benchmark

full rationale

The paper proposes a hierarchical autoregressive architecture with dual-rate tokenization and evaluates it via FAD on the external MUSDB18 dataset. No derivation chain reduces a claimed prediction or result to a fitted parameter, self-definition, or self-citation by construction. Architectural details are presented as design choices whose effectiveness is measured externally rather than asserted tautologically.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The central claim rests on the effectiveness of pre-trained external models for tokenization and standard Transformer components, with the hierarchy and rate choices as the primary additions; no new physical entities or ungrounded axioms are introduced.

free parameters (1)

Tokenization rates (50 Hz semantic for vocals, 75 Hz acoustic for instrumentals)
Specific rates chosen to enable time-aligned yet rate-independent modeling; these are design decisions rather than data-fitted values.

axioms (1)

domain assumption HuBERT semantic tokens and EnCodec acoustic tokens provide suitable representations for aligned vocal-instrumental modeling.
The architecture depends on these pre-trained models performing well for the music accompaniment task without further justification in the abstract.

pith-pipeline@v0.9.0 · 5513 in / 1329 out tokens · 62939 ms · 2026-05-10T17:23:06.028909+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

25 extracted references · 9 canonical work pages · 6 internal anchors

[1]

INTRODUCTION Singing is one of the most intuitive ways to engage with music. While singing along to existing music is common, singing could also serve as a natural control mechanism for musiccreation—allowing anyone who can sing to generate personalized instrumental accompaniments. This motivates the task ofvocal accompaniment generation: given an iso- la...
[2]

HAFM: Hierarchical Autoregressive Foundation Model for Music Accompaniment Generation

RELATED WORK Audio-domain accompaniment generation.SingSong [2] is the most closely related work, adapting AudioLM [3] for conditional audio-to-audio generation. It uses source separation to create training pairs, encodes vocals with w2v- BERT [10], and models instrumentals via SoundStream [11] tokens using a T5 encoder-decoder. Key to its generalization ...

work page internal anchor Pith review Pith/arXiv arXiv 2026
[3]

Problem Formulation Given a vocal waveformx∈R fsT of durationTseconds at sample ratef s, we model the conditional distributionP(y| x)over instrumental waveformsy

PROPOSED METHOD 3.1. Problem Formulation Given a vocal waveformx∈R fsT of durationTseconds at sample ratef s, we model the conditional distributionP(y| x)over instrumental waveformsy. Following AudioLM, we work with discrete proxy distributions over audio codes rather than raw waveforms. 3.2. Dual-Rate Codec Tokenization Vocal encoding.We extract semantic...
[4]

Setup Dataset.We use the FMA-Large dataset [17] (∼100K tracks) for training and MUSDB18 [18] for evaluation

EXPERIMENTS 4.1. Setup Dataset.We use the FMA-Large dataset [17] (∼100K tracks) for training and MUSDB18 [18] for evaluation. MUSDB18 provides studio-isolated vocal and instrumental stems for 150 songs, enabling direct evaluation on both source-separated and isolated vocals. Evaluation metrics.Following SingSong, we use Fr´echet Audio Distance (FAD) [19] ...
[5]

CONCLUSION We presented HAFM, a three-stage hierarchical autore- gressive system for vocal accompaniment generation. By combining dual-rate HuBERT/EnCodec tokenization, inter- leaved multi-codebook AR modeling with CFG, and modern Transformer components, HAFM achieves strong results on MUSDB18 while using fewer parameters than comparable systems. Future w...
[6]

Mysong: au- tomatic accompaniment generation for vocal melodies,

Ian Simon, Dan Morris, and Sumit Basu, “Mysong: au- tomatic accompaniment generation for vocal melodies,” inSIGCHI, 2008

2008
[7]

Singsong: Generating musical accompaniments from singing,

Chris Donahue, Antoine Caillon, Adam Roberts, Ethan Manilow, Philippe Esling, Andrea Agostinelli, Mauro Verzetti, Ian Simon, Olivier Pietquin, Neil Zeghi- dour, and Jesse Engel, “Singsong: Generating mu- sical accompaniments from singing,”arXiv preprint arXiv:2301.12662, 2023

work page arXiv 2023
[8]

Audiolm: a language modeling approach to audio generation

Zal ´an Borsos, Rapha ¨el Marinier, Damien Vincent, Eu- gene Kharitonov, Olivier Pietquin, Matt Sharifi, Olivier Teboul, David Grangier, Marco Tagliasacchi, and Neil Zeghidour, “Audiolm: a language modeling approach to audio generation,”arXiv preprint arXiv:2209.03143, 2022

work page arXiv 2022
[9]

Hubert: Self-supervised speech rep- resentation learning by masked prediction of hidden units,

Wei-Ning Hsu, Benjamin Bolte, Yao-Hung Hubert Tsai, Kushal Lakhotia, Ruslan Salakhutdinov, and Abdelrah- man Mohamed, “Hubert: Self-supervised speech rep- resentation learning by masked prediction of hidden units,” inIEEE/ACM Transactions on Audio, Speech, and Language Processing, 2021, vol. 29, pp. 3451– 3460

2021
[10]

High Fidelity Neural Audio Compression

Alexandre D ´efossez, Jade Copet, Gabriel Synnaeve, and Yossi Adi, “High fidelity neural audio compression,” arXiv preprint arXiv:2210.13438, 2022

work page internal anchor Pith review arXiv 2022
[11]

Scaling vision transformers to 22 bil- lion parameters,

Mostafa Dehghani, Josip Djolonga, Basil Mustafa, Piotr Padlewski, Jonathan Heek, Justin Gilmer, Andreas Peter Steiner, Mathilde Caron, Robert Geirhos, Ibrahim Alab- dulmohsin, et al., “Scaling vision transformers to 22 bil- lion parameters,”International Conference on Machine Learning, 2023

2023
[12]

GLU Variants Improve Transformer

Noam Shazeer, “Glu variants improve transformer,” arXiv preprint arXiv:2002.05202, 2020

work page internal anchor Pith review Pith/arXiv arXiv 2002
[13]

Root mean square layer normalization,

Biao Zhang and Rico Sennrich, “Root mean square layer normalization,”Advances in Neural Information Processing Systems, vol. 32, 2019

2019
[14]

Exploring the limits of transfer learning with a unified text-to-text transformer,

Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, Peter J Liu, et al., “Exploring the limits of transfer learning with a unified text-to-text transformer,”JMLR, 2020

2020
[15]

w2v-bert: Combining contrastive learning and masked language modeling for self-supervised speech pre- training,

Yu-An Chung, Yu Zhang, Wei Han, Chung-Cheng Chiu, James Qin, Ruoming Pang, and Yonghui Wu, “w2v-bert: Combining contrastive learning and masked language modeling for self-supervised speech pre- training,” 2021

2021
[16]

Soundstream: An end-to-end neural audio codec,

Neil Zeghidour, Alejandro Luebs, Ahmed Omran, Jan Skoglund, and Marco Tagliasacchi, “Soundstream: An end-to-end neural audio codec,”IEEE/ACM Trans- actions on Audio, Speech, and Language Processing, 2021

2021
[17]

MusicLM: Generating Music From Text

Andrea Agostinelli, Timo I Denk, Zal ´an Borsos, Jesse Engel, Mauro Verzetti, Antoine Caillon, Qingqing Huang, Aren Jansen, Adam Roberts, Marco Tagliasac- chi, Matt Sharifi, Neil Zeghidour, and Christian Frank, “Musiclm: Generating music from text,”arXiv preprint arXiv:2301.11325, 2023

work page internal anchor Pith review arXiv 2023
[18]

Neural Codec Language Models are Zero-Shot Text to Speech Synthesizers

Chengyi Wang, Sanyuan Chen, Yu Wu, Ziqiang Zhang, Long Zhou, Shujie Liu, Zhuo Chen, Yanqing Liu, Huaming Wang, Jinyu Li, et al., “Neural codec lan- guage models are zero-shot text to speech synthesizers,” arXiv preprint arXiv:2301.02111, 2023

work page internal anchor Pith review arXiv 2023
[19]

Palm: Scaling language modeling with pathways,

Aakanksha Chowdhery, Sharan Narang, Jacob Devlin, Maarten Bosma, Gaurav Mishra, Adam Roberts, Paul Barham, Hyung Won Chung, Charles Sutton, Sebastian Gehrmann, et al., “Palm: Scaling language modeling with pathways,”JMLR, vol. 24, no. 240, pp. 1–113, 2023

2023
[20]

Adam: A Method for Stochastic Optimization

Diederik P Kingma and Jimmy Ba, “Adam: A method for stochastic optimization,”arXiv preprint arXiv:1412.6980, 2014

work page internal anchor Pith review Pith/arXiv arXiv 2014
[21]

Kuielab-mdx-net: A two- stream neural network for music demixing,

Minseok Kim, Woosung Choi, Jaehwa Chung, Daewon Lee, and Soonyoung Jung, “Kuielab-mdx-net: A two- stream neural network for music demixing,”Proceed- ings of the MDX Workshop, 2021

2021
[22]

FMA: A Dataset For Music Analysis

Micha ¨el Defferrard, Kirell Benzi, Pierre Vandergheynst, and Xavier Bresson, “Fma: A dataset for music analy- sis,”arXiv preprint arXiv:1612.01840, 2017

work page Pith review arXiv 2017
[23]

Musdb18 - a corpus for music separation,

Zafar Rafii, Antoine Liutkus, Fabian-Robert St ¨oter, Stylianos Ioannis Mimilakis, and Rachel Bittner, “Musdb18 - a corpus for music separation,” 2017

2017
[24]

Fr ´echet audio distance: A reference-free metric for evaluating music enhancement algorithms,

Kevin Kilgour, Mauricio Zuluaga, Dominik Roblek, and Matt Sharifi, “Fr ´echet audio distance: A reference-free metric for evaluating music enhancement algorithms,” INTERSPEECH, 2019

2019
[25]

Cnn architectures for large-scale audio classification,

Shawn Hershey, Sourish Chaudhuri, Daniel PW Ellis, Jort F Gemmeke, Aren Jansen, R Channing Moore, Manoj Plakal, Devin Platt, Rif A Saurous, Bryan Sey- bold, et al., “Cnn architectures for large-scale audio classification,” inICASSP, 2017

2017