Recognition: unknown
HAFM: Hierarchical Autoregressive Foundation Model for Music Accompaniment Generation
Pith reviewed 2026-05-10 17:23 UTC · model grok-4.3
The pith
A hierarchical autoregressive model generates coherent instrumental accompaniments from isolated vocals using dual-rate tokenization.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
HAFM generates instrumental music audio to accompany input vocals. Given isolated singing voice, HAFM produces a coherent instrumental accompaniment that can be directly mixed with the input to create complete music. The system uses a dual-rate codec tokenization scheme with HuBERT semantic tokens at 50 Hz for vocals and EnCodec acoustic tokens at 75 Hz for instrumentals, a three-stage hierarchical autoregressive architecture with interleaved multi-codebook prediction and classifier-free guidance, plus modern Transformer design choices including QK-norm, GEGLU activations, RMSNorm, and T5-style relative position bias.
What carries the argument
Three-stage hierarchical autoregressive architecture with interleaved multi-codebook prediction and dual-rate codec tokenization scheme that maps 50 Hz vocal semantic tokens to 75 Hz instrumental acoustic tokens.
If this is right
- Accompaniments can be produced directly from isolated vocals and mixed without post-processing synchronization.
- The model matches prior state-of-the-art audio quality on MUSDB18 while using fewer parameters than those systems.
- The staged prediction from semantic to coarse acoustic to fine acoustic tokens supports rate-independent yet time-aligned modeling.
- Retrieval-based baselines are outperformed on the same isolated-vocal input task.
- Classifier-free guidance and interleaved codebook prediction improve coherence during generation.
Where Pith is reading between the lines
- The same staged tokenization pattern could be tested on other conditional audio tasks such as generating backing tracks for speech or solo instruments.
- Fewer parameters combined with open-source code release may lower the barrier for integrating accompaniment generation into consumer music apps or DAWs.
- The hierarchical structure might reduce error accumulation on longer sequences compared to single-stage autoregressive baselines, though the paper does not measure sequence length effects directly.
Load-bearing premise
The dual-rate tokenization and three-stage architecture produce time-aligned and coherent accompaniments without needing additional explicit alignment or synchronization steps.
What would settle it
An objective temporal alignment score or human listening test on MUSDB18 showing that generated instrumentals drift in timing relative to the input vocals or yield FAD scores worse than the reported 2.08.
read the original abstract
We present HAFM, a system that generates instrumental music audio to accompany input vocals. Given isolated singing voice, HAFM produces a coherent instrumental accompaniment that can be directly mixed with the input to create complete music. We propose three key innovations over prior work: (1) a dual-rate codec tokenization scheme using HuBERT semantic tokens at 50\,Hz for vocals and EnCodec acoustic tokens at 75\,Hz for instrumentals, enabling time-aligned yet rate-independent modeling; (2) a three-stage hierarchical autoregressive architecture (semantic to coarse acoustic to fine acoustic) with interleaved multi-codebook prediction and classifier-free guidance; and (3) modern Transformer design choices including QK-norm, GEGLU activations, RMSNorm, and T5-style relative position bias for improved training stability and sequence generalization. Experiments on MUSDB18 demonstrate that HAFM achieves a Fr\'{e}chet Audio Distance (FAD) of 2.08 on isolated vocal inputs, outperforming retrieval baselines and matching prior state-of-the-art systems with fewer parameters. The source code is available at https://github.com/HackerHyper/HAFM.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper presents HAFM, a hierarchical autoregressive foundation model for generating instrumental accompaniments from isolated vocal inputs. Key contributions include a dual-rate codec tokenization using HuBERT semantic tokens at 50 Hz for vocals and EnCodec acoustic tokens at 75 Hz for instrumentals, a three-stage architecture progressing from semantic to coarse acoustic to fine acoustic tokens with interleaved multi-codebook prediction and classifier-free guidance, and modern Transformer elements such as QK-norm, GEGLU, RMSNorm, and T5-style relative position bias. Experiments on MUSDB18 report a Fréchet Audio Distance (FAD) of 2.08 for HAFM on isolated vocal inputs, outperforming retrieval baselines and matching prior state-of-the-art systems while using fewer parameters; source code is released.
Significance. If the results and alignment claims hold under scrutiny, this work offers a parameter-efficient approach to conditional music generation that could advance foundation models for audio by demonstrating scalable hierarchical autoregression across mismatched token rates. The open-source code strengthens potential impact for reproducibility in the music information retrieval and generative audio communities.
major comments (2)
- [Abstract and architecture description] The abstract claims that the dual-rate tokenization scheme 'enables time-aligned yet rate-independent modeling,' but the three-stage hierarchical autoregressive architecture description provides no explicit mechanism (e.g., cross-rate alignment loss, upsampling layer, or phase-locking) to enforce frame correspondence between 50 Hz vocal tokens and 75 Hz instrumental tokens. This is load-bearing for the central claim of producing coherent, directly mixable accompaniments, as FAD evaluates distributional quality rather than input-conditioned synchronization.
- [Experiments] The experimental results section reports an FAD of 2.08 and comparisons to baselines/SOTA but omits details on training procedures, exact baseline implementations, statistical significance tests, or potential confounds in the MUSDB18 evaluation (e.g., vocal isolation quality or mixing process), preventing verification that the data supports the performance claims.
minor comments (2)
- [Method] Notation for token rates and stages could be clarified with a diagram or explicit equations showing how interleaving occurs across mismatched rates.
- [Related work] The paper would benefit from additional references to prior work on hierarchical audio tokenization or rate-mismatch handling in autoregressive models.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback on our work. We address each major comment point by point below, providing clarifications and committing to revisions that strengthen the manuscript without altering its core claims.
read point-by-point responses
-
Referee: The abstract claims that the dual-rate tokenization scheme 'enables time-aligned yet rate-independent modeling,' but the three-stage hierarchical autoregressive architecture description provides no explicit mechanism (e.g., cross-rate alignment loss, upsampling layer, or phase-locking) to enforce frame correspondence between 50 Hz vocal tokens and 75 Hz instrumental tokens. This is load-bearing for the central claim of producing coherent, directly mixable accompaniments, as FAD evaluates distributional quality rather than input-conditioned synchronization.
Authors: We appreciate the referee highlighting the need for greater clarity on temporal alignment. In the HAFM architecture, alignment between the 50 Hz HuBERT vocal tokens and 75 Hz EnCodec instrumental tokens is achieved through direct conditioning in the autoregressive process: vocal semantic tokens condition the semantic stage, which then drives coarse and fine acoustic stages over the same underlying audio duration, with token counts scaled proportionally to their respective rates (no explicit upsampling layer or alignment loss is used, as the model learns rate-independent correspondences from synchronized training pairs). This enables direct mixing because generated instrumentals match the input vocal timing by construction. However, we agree the description in the architecture section was insufficiently explicit. We will revise the manuscript to add a dedicated paragraph detailing the temporal correspondence mechanism, including sequence length handling and rate scaling, and will expand the discussion to acknowledge FAD's limitations while noting that coherence is further supported by the task design and qualitative mixing results. revision: yes
-
Referee: The experimental results section reports an FAD of 2.08 and comparisons to baselines/SOTA but omits details on training procedures, exact baseline implementations, statistical significance tests, or potential confounds in the MUSDB18 evaluation (e.g., vocal isolation quality or mixing process), preventing verification that the data supports the performance claims.
Authors: We agree that these details are important for verification and reproducibility. Training procedures (including the three-stage curriculum, optimizer settings, batch size, and data preprocessing on MUSDB18) are fully specified in the appendix and the released GitHub code. Baseline implementations follow the original papers with minor adaptations for fair comparison, as noted in Section 4.2. We omitted statistical significance tests in the initial submission due to the substantial compute required for repeated full training runs, but we will add error bars computed over three random seeds in the revision. Regarding confounds, vocal isolation used a standard pre-trained model with no custom processing, and mixing is performed via direct sample-wise addition to match the isolated vocal duration exactly. We will move key details from the appendix into the main Experiments section and add a short paragraph addressing potential evaluation confounds. revision: yes
Circularity Check
No significant circularity; empirical results on public benchmark
full rationale
The paper proposes a hierarchical autoregressive architecture with dual-rate tokenization and evaluates it via FAD on the external MUSDB18 dataset. No derivation chain reduces a claimed prediction or result to a fitted parameter, self-definition, or self-citation by construction. Architectural details are presented as design choices whose effectiveness is measured externally rather than asserted tautologically.
Axiom & Free-Parameter Ledger
free parameters (1)
- Tokenization rates (50 Hz semantic for vocals, 75 Hz acoustic for instrumentals)
axioms (1)
- domain assumption HuBERT semantic tokens and EnCodec acoustic tokens provide suitable representations for aligned vocal-instrumental modeling.
Reference graph
Works this paper leans on
-
[1]
INTRODUCTION Singing is one of the most intuitive ways to engage with music. While singing along to existing music is common, singing could also serve as a natural control mechanism for musiccreation—allowing anyone who can sing to generate personalized instrumental accompaniments. This motivates the task ofvocal accompaniment generation: given an iso- la...
-
[2]
HAFM: Hierarchical Autoregressive Foundation Model for Music Accompaniment Generation
RELATED WORK Audio-domain accompaniment generation.SingSong [2] is the most closely related work, adapting AudioLM [3] for conditional audio-to-audio generation. It uses source separation to create training pairs, encodes vocals with w2v- BERT [10], and models instrumentals via SoundStream [11] tokens using a T5 encoder-decoder. Key to its generalization ...
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[3]
Problem Formulation Given a vocal waveformx∈R fsT of durationTseconds at sample ratef s, we model the conditional distributionP(y| x)over instrumental waveformsy
PROPOSED METHOD 3.1. Problem Formulation Given a vocal waveformx∈R fsT of durationTseconds at sample ratef s, we model the conditional distributionP(y| x)over instrumental waveformsy. Following AudioLM, we work with discrete proxy distributions over audio codes rather than raw waveforms. 3.2. Dual-Rate Codec Tokenization Vocal encoding.We extract semantic...
-
[4]
Setup Dataset.We use the FMA-Large dataset [17] (∼100K tracks) for training and MUSDB18 [18] for evaluation
EXPERIMENTS 4.1. Setup Dataset.We use the FMA-Large dataset [17] (∼100K tracks) for training and MUSDB18 [18] for evaluation. MUSDB18 provides studio-isolated vocal and instrumental stems for 150 songs, enabling direct evaluation on both source-separated and isolated vocals. Evaluation metrics.Following SingSong, we use Fr´echet Audio Distance (FAD) [19] ...
-
[5]
CONCLUSION We presented HAFM, a three-stage hierarchical autore- gressive system for vocal accompaniment generation. By combining dual-rate HuBERT/EnCodec tokenization, inter- leaved multi-codebook AR modeling with CFG, and modern Transformer components, HAFM achieves strong results on MUSDB18 while using fewer parameters than comparable systems. Future w...
-
[6]
Mysong: au- tomatic accompaniment generation for vocal melodies,
Ian Simon, Dan Morris, and Sumit Basu, “Mysong: au- tomatic accompaniment generation for vocal melodies,” inSIGCHI, 2008
2008
-
[7]
Singsong: Generating musical accompaniments from singing,
Chris Donahue, Antoine Caillon, Adam Roberts, Ethan Manilow, Philippe Esling, Andrea Agostinelli, Mauro Verzetti, Ian Simon, Olivier Pietquin, Neil Zeghi- dour, and Jesse Engel, “Singsong: Generating mu- sical accompaniments from singing,”arXiv preprint arXiv:2301.12662, 2023
-
[8]
Audiolm: a language modeling approach to audio generation
Zal ´an Borsos, Rapha ¨el Marinier, Damien Vincent, Eu- gene Kharitonov, Olivier Pietquin, Matt Sharifi, Olivier Teboul, David Grangier, Marco Tagliasacchi, and Neil Zeghidour, “Audiolm: a language modeling approach to audio generation,”arXiv preprint arXiv:2209.03143, 2022
-
[9]
Hubert: Self-supervised speech rep- resentation learning by masked prediction of hidden units,
Wei-Ning Hsu, Benjamin Bolte, Yao-Hung Hubert Tsai, Kushal Lakhotia, Ruslan Salakhutdinov, and Abdelrah- man Mohamed, “Hubert: Self-supervised speech rep- resentation learning by masked prediction of hidden units,” inIEEE/ACM Transactions on Audio, Speech, and Language Processing, 2021, vol. 29, pp. 3451– 3460
2021
-
[10]
High Fidelity Neural Audio Compression
Alexandre D ´efossez, Jade Copet, Gabriel Synnaeve, and Yossi Adi, “High fidelity neural audio compression,” arXiv preprint arXiv:2210.13438, 2022
work page internal anchor Pith review arXiv 2022
-
[11]
Scaling vision transformers to 22 bil- lion parameters,
Mostafa Dehghani, Josip Djolonga, Basil Mustafa, Piotr Padlewski, Jonathan Heek, Justin Gilmer, Andreas Peter Steiner, Mathilde Caron, Robert Geirhos, Ibrahim Alab- dulmohsin, et al., “Scaling vision transformers to 22 bil- lion parameters,”International Conference on Machine Learning, 2023
2023
-
[12]
GLU Variants Improve Transformer
Noam Shazeer, “Glu variants improve transformer,” arXiv preprint arXiv:2002.05202, 2020
work page internal anchor Pith review Pith/arXiv arXiv 2002
-
[13]
Root mean square layer normalization,
Biao Zhang and Rico Sennrich, “Root mean square layer normalization,”Advances in Neural Information Processing Systems, vol. 32, 2019
2019
-
[14]
Exploring the limits of transfer learning with a unified text-to-text transformer,
Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, Peter J Liu, et al., “Exploring the limits of transfer learning with a unified text-to-text transformer,”JMLR, 2020
2020
-
[15]
w2v-bert: Combining contrastive learning and masked language modeling for self-supervised speech pre- training,
Yu-An Chung, Yu Zhang, Wei Han, Chung-Cheng Chiu, James Qin, Ruoming Pang, and Yonghui Wu, “w2v-bert: Combining contrastive learning and masked language modeling for self-supervised speech pre- training,” 2021
2021
-
[16]
Soundstream: An end-to-end neural audio codec,
Neil Zeghidour, Alejandro Luebs, Ahmed Omran, Jan Skoglund, and Marco Tagliasacchi, “Soundstream: An end-to-end neural audio codec,”IEEE/ACM Trans- actions on Audio, Speech, and Language Processing, 2021
2021
-
[17]
MusicLM: Generating Music From Text
Andrea Agostinelli, Timo I Denk, Zal ´an Borsos, Jesse Engel, Mauro Verzetti, Antoine Caillon, Qingqing Huang, Aren Jansen, Adam Roberts, Marco Tagliasac- chi, Matt Sharifi, Neil Zeghidour, and Christian Frank, “Musiclm: Generating music from text,”arXiv preprint arXiv:2301.11325, 2023
work page internal anchor Pith review arXiv 2023
-
[18]
Neural Codec Language Models are Zero-Shot Text to Speech Synthesizers
Chengyi Wang, Sanyuan Chen, Yu Wu, Ziqiang Zhang, Long Zhou, Shujie Liu, Zhuo Chen, Yanqing Liu, Huaming Wang, Jinyu Li, et al., “Neural codec lan- guage models are zero-shot text to speech synthesizers,” arXiv preprint arXiv:2301.02111, 2023
work page internal anchor Pith review arXiv 2023
-
[19]
Palm: Scaling language modeling with pathways,
Aakanksha Chowdhery, Sharan Narang, Jacob Devlin, Maarten Bosma, Gaurav Mishra, Adam Roberts, Paul Barham, Hyung Won Chung, Charles Sutton, Sebastian Gehrmann, et al., “Palm: Scaling language modeling with pathways,”JMLR, vol. 24, no. 240, pp. 1–113, 2023
2023
-
[20]
Adam: A Method for Stochastic Optimization
Diederik P Kingma and Jimmy Ba, “Adam: A method for stochastic optimization,”arXiv preprint arXiv:1412.6980, 2014
work page internal anchor Pith review Pith/arXiv arXiv 2014
-
[21]
Kuielab-mdx-net: A two- stream neural network for music demixing,
Minseok Kim, Woosung Choi, Jaehwa Chung, Daewon Lee, and Soonyoung Jung, “Kuielab-mdx-net: A two- stream neural network for music demixing,”Proceed- ings of the MDX Workshop, 2021
2021
-
[22]
FMA: A Dataset For Music Analysis
Micha ¨el Defferrard, Kirell Benzi, Pierre Vandergheynst, and Xavier Bresson, “Fma: A dataset for music analy- sis,”arXiv preprint arXiv:1612.01840, 2017
work page Pith review arXiv 2017
-
[23]
Musdb18 - a corpus for music separation,
Zafar Rafii, Antoine Liutkus, Fabian-Robert St ¨oter, Stylianos Ioannis Mimilakis, and Rachel Bittner, “Musdb18 - a corpus for music separation,” 2017
2017
-
[24]
Fr ´echet audio distance: A reference-free metric for evaluating music enhancement algorithms,
Kevin Kilgour, Mauricio Zuluaga, Dominik Roblek, and Matt Sharifi, “Fr ´echet audio distance: A reference-free metric for evaluating music enhancement algorithms,” INTERSPEECH, 2019
2019
-
[25]
Cnn architectures for large-scale audio classification,
Shawn Hershey, Sourish Chaudhuri, Daniel PW Ellis, Jort F Gemmeke, Aren Jansen, R Channing Moore, Manoj Plakal, Devin Platt, Rif A Saurous, Bryan Sey- bold, et al., “Cnn architectures for large-scale audio classification,” inICASSP, 2017
2017
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.