pith. sign in

arxiv: 2606.03803 · v2 · pith:DEJ2FEPBnew · submitted 2026-06-02 · 💻 cs.SD · cs.AI· eess.AS

LiveBand: Live Accompaniment Generation in the Audio Domain

Pith reviewed 2026-06-28 08:34 UTC · model grok-4.3

classification 💻 cs.SD cs.AIeess.AS
keywords music accompaniment generationreal-time audio synthesiscausal transformeraudio autoencoder latent spaceadversarial trainingstreaming generationcausal constraints
0
0 comments X

The pith

A causal transformer generates real-time music accompaniments from live audio using only past context and noise.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents LiveBand as a system that produces music accompaniments to live input while obeying strict causality, so the model never accesses future audio frames. It achieves this by placing a generator inside the latent space of a pre-trained causal audio autoencoder, where each step receives only the current mix context plus noise and outputs the next accompaniment latents. Training runs in one parallel pass under causal masking with sequence-level adversarial supervision, and inference runs autoregressively with a rolling attention state so that train and inference computations match exactly. This design removes teacher forcing and the resulting exposure bias. The approach yields measurable gains on audio quality, beat alignment, and mix adherence while supporting streaming output on ordinary hardware.

Core claim

LiveBand trains a causal transformer generator in the continuous latent space of a pre-trained causal audio autoencoder. At every timestep the generator receives only the causally available mix context and Gaussian noise and predicts accompaniment latents without any future mix frames or ground-truth target latents. Sequence-level adversarial supervision is supplied by a discriminator. Training occurs in a single parallel forward pass under causal masking, while streaming inference proceeds autoregressively with a rolling attention state. The matching of training and inference computations eliminates teacher forcing and exposure bias.

What carries the argument

Causal transformer generator inside the latent space of a pre-trained causal audio autoencoder, trained with sequence-level adversarial supervision from a discriminator.

If this is right

  • The generated accompaniments score higher than prior work on objective measures of audio quality, beat alignment, and mix adherence.
  • Streaming generation proceeds without any lookahead into future audio frames.
  • Training and inference computations are identical by design, removing exposure bias.
  • The system runs in real time on consumer hardware while respecting strict causal constraints.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same causal-latent approach could be tested on related tasks such as live effect generation or multi-track mixing.
  • If the latent space supports coherent structure from past context alone, similar generators might handle other causal audio problems like real-time source separation.
  • Practical live-performance tools could incorporate the method once the autoencoder is fixed, because no future buffering is required.
  • Extending the benchmark to include longer performances would test whether the rolling attention state maintains coherence over many minutes.

Load-bearing premise

The latent space learned by the pre-trained causal audio autoencoder is rich enough for the generator to learn coherent accompaniments from only causally available mix context and noise.

What would settle it

Running the model on the multi-instrument accompaniment benchmark and finding that it does not improve on at least two of the three reported objective measures (audio quality, beat alignment, mix adherence) or that it cannot sustain real-time generation without lookahead on consumer hardware.

read the original abstract

We present LiveBand, a real-time system that generates high-fidelity music accompaniments to live audio input, respecting strict causal constraints. Our method trains a causal transformer generator in the continuous latent space of a pre-trained causal audio autoencoder, using adversarial sequence-level supervision from a discriminator. At each timestep, the generator receives only the causally available mix context and Gaussian noise, and predicts accompaniment latents without access to future mix frames or ground-truth target latents. Training is performed in a single parallel forward pass under causal masking, while streaming inference proceeds autoregressively with a rolling attention state. The model's training and inference computations are matched by design, eliminating teacher forcing and the associated exposure bias. On a multi-instrument music accompaniment benchmark, LiveBand improves over prior work on objective measures of audio quality, beat alignment, and mix adherence, while enabling real-time streaming generation without lookahead into the future on consumer hardware.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript presents LiveBand, a real-time music accompaniment generation system that operates strictly causally. It encodes input via a fixed pre-trained causal audio autoencoder, trains a causal transformer generator in that latent space using adversarial sequence-level discrimination, and produces accompaniment latents from only causally-masked mix context plus noise. Training uses a single parallel forward pass under causal masking; inference is autoregressive with rolling attention. The work claims objective improvements over prior methods on audio quality, beat alignment, and mix adherence for a multi-instrument benchmark, together with real-time streaming performance on consumer hardware without lookahead.

Significance. If the empirical results hold and the latent-space assumption is validated, the work would advance practical live-accompaniment systems by demonstrating that strict causality can be maintained while matching training and inference procedures, thereby avoiding exposure bias. The adversarial sequence-level supervision and explicit alignment of train/inference computation are concrete strengths.

major comments (2)
  1. [Method] Method section (description of the generator and autoencoder): the central claim that the generator produces coherent, beat-aligned, mix-adherent accompaniments from only causally available mix latents plus noise rests on the untested assumption that the fixed pre-trained causal autoencoder's continuous latent space already encodes the necessary harmonic, rhythmic, and timbral relations in a form accessible to the transformer. No ablation replacing the encoder, no analysis of latent-space musical structure preservation, and no comparison against a jointly trained encoder are reported; if the space collapses or entangles these features, the reported objective gains cannot be attributed to the generator training procedure.
  2. [Experiments] Experiments section (benchmark results): the abstract asserts improvements on objective measures of audio quality, beat alignment, and mix adherence, yet the provided text supplies neither the numerical values, error bars, dataset statistics, nor the precise baseline comparisons that would allow verification that the gains are statistically meaningful and not artifacts of the particular autoencoder choice.
minor comments (2)
  1. [Abstract] Abstract: the claim of improvement would be more informative if accompanied by at least the headline metric deltas rather than a purely qualitative statement.
  2. [Method] Notation: the distinction between 'causally available mix context' and the precise masking schedule used during the parallel training pass should be clarified with an equation or diagram to avoid ambiguity about what information is visible at each timestep.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We address each major comment below, indicating planned changes to the manuscript where appropriate.

read point-by-point responses
  1. Referee: [Method] Method section (description of the generator and autoencoder): the central claim that the generator produces coherent, beat-aligned, mix-adherent accompaniments from only causally available mix latents plus noise rests on the untested assumption that the fixed pre-trained causal autoencoder's continuous latent space already encodes the necessary harmonic, rhythmic, and timbral relations in a form accessible to the transformer. No ablation replacing the encoder, no analysis of latent-space musical structure preservation, and no comparison against a jointly trained encoder are reported; if the space collapses or entangles these features, the reported objective gains cannot be attributed to the generator training procedure.

    Authors: We agree that the attribution of gains to the generator training procedure would be strengthened by explicit validation of the latent space. The fixed pre-trained causal autoencoder was chosen specifically to enforce strict causality and enable real-time inference without joint optimization overhead. While the original autoencoder publication reports strong reconstruction metrics on music, we did not include ablations, latent-space analyses, or joint-training comparisons in this work. In revision we will add a dedicated paragraph in Section 3.1 citing the autoencoder's reported preservation of harmonic and rhythmic features and explicitly noting the lack of encoder ablations as a limitation and avenue for future investigation. revision: partial

  2. Referee: [Experiments] Experiments section (benchmark results): the abstract asserts improvements on objective measures of audio quality, beat alignment, and mix adherence, yet the provided text supplies neither the numerical values, error bars, dataset statistics, nor the precise baseline comparisons that would allow verification that the gains are statistically meaningful and not artifacts of the particular autoencoder choice.

    Authors: The referee correctly observes that the main narrative does not quote the numerical results. These appear in Table 2 (with means and standard deviations) and Figure 3, along with dataset details in Section 4.1. We will revise Section 4.2 to inline the key metric values, explicitly reference the error bars and statistical comparisons, and add a sentence on dataset scale (number of tracks and total duration) to allow independent verification of the reported improvements. revision: yes

Circularity Check

0 steps flagged

No significant circularity; derivation is self-contained

full rationale

The paper describes an independent adversarial training procedure for a causal transformer generator operating in the fixed latent space of an external pre-trained causal audio autoencoder. No equations, predictions, or central claims reduce by construction to fitted inputs, self-definitions, or self-citation chains. The benchmark improvements are presented as empirical outcomes of the described training, not as tautological renamings or forced results. The approach matches training and inference computations explicitly but does not create circularity in the claimed results.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Abstract provides insufficient detail to enumerate specific free parameters or invented entities; the primary domain assumption is the adequacy of the pre-trained autoencoder latent space for the described task.

axioms (1)
  • domain assumption The latent space of the pre-trained causal audio autoencoder supports learning of coherent accompaniments from causally available context only
    Invoked when describing the generator input and output in the method

pith-pipeline@v0.9.1-grok · 5705 in / 1267 out tokens · 23714 ms · 2026-06-28T08:34:23.375758+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

65 extracted references · 15 canonical work pages · 10 internal anchors

  1. [1]

    LiveBand: Live Accompaniment Generation in the Audio Domain

    INTRODUCTION Designing AI systems to jam–creating musical accompani- ments in real time while listening to a live audio stream–is a long-standing goal at the intersection of music informa- tion retrieval, generative modelling, and human-computer interaction. Such a system would enable musicians to jam with an AI companion that responds naturally to their ...

  2. [2]

    Early systems range from rule-based and symbolic ap- proaches [8–14] to recent neural models operating di- rectly on audio [2–4, 15–18]

    RELA TED WORK Real-time accompaniment generation builds on prior work in both automated and interactive music generation. Early systems range from rule-based and symbolic ap- proaches [8–14] to recent neural models operating di- rectly on audio [2–4, 15–18]. While these latter sys- tems show that high-quality accompaniment can be learned from acoustic con...

  3. [3]

    BACKGROUND 3.1 Teacher/Student Forcing DriftAutoregressive mod- els are usually trained withteacher forcing: LTF =− X t logp θ(xt |x ⋆ <t),(1) wherex ⋆ <t is the ground-truth past. At inference, the same model must instead sample ˆxt ∼p θ(xt |ˆx<t).(2) This mismatch is exposure bias: the model is optimized on histories from the data distribution, but depl...

  4. [4]

    Let m= (m 1, m2,

    LIVEBAND 4.1 Problem formulationWe consider the task of real- time accompaniment generation from a live input mix. Let m= (m 1, m2, . . .)denote the sequence of mix latent frames anda= (a 1, a2, . . .)the sequence of accompa- niment latent frames to be generated. At streaming stept, the model has access to the causally available mix history m≤t = (m 1, . ...

  5. [5]

    We form each training example by selecting one stem as the target accompaniment, randomly choosing a subset of [1,

    EXPERIMENTS 5.1 DatasetUnless otherwise stated, all models are trained and evaluated on the official Slakh2100 train/test split [48]. We form each training example by selecting one stem as the target accompaniment, randomly choosing a subset of [1, . . . , N−1]remaining stems, and summing them to cre- ate the conditioning mix [2, 5]. Audio is encoded with...

  6. [6]

    6.1 Sink vs

    RESULTS We provide audio examples at this link2 . 6.1 Sink vs. No-SinkTable 2 reports the sink ablation, isolating long-form drift at the main real-time operating 2 https://sonycslparis.github.io/liveband-companion Model∆FADvgg ∆FADclap ∆Beat∆COCfull ∆COCharm ∆COCperc w/o sink -0.02 -0.06 +0.01 +0.29 +0.27 +0.34 sink -0.02 -0.05 +0.02 +0.31 +0.39 +0.36 Ta...

  7. [7]

    The effective frame budget forτ= 0.1s is exactly 92.88ms, corresponding to one latent frame (4096audio samples at44.1kHz)

    All measurements are averaged across128streaming steps. The effective frame budget forτ= 0.1s is exactly 92.88ms, corresponding to one latent frame (4096audio samples at44.1kHz). In eager mode, end-to-end gener- ation plus decoding already remains within this real-time budget. Withtorch.compile, latency is substantially reduced. These measurements confirm...

  8. [8]

    By pairing a causal transformer with sequence-level adversarial supervision, we eliminate teacher forcing and the associated exposure bias

    CONCLUSION We introduced LiveBand, a real-time system for live music accompaniment that operates under strict causal and latency constraints. By pairing a causal transformer with sequence-level adversarial supervision, we eliminate teacher forcing and the associated exposure bias. This fully aligns training with streaming inference, allowing the model to ...

  9. [9]

    Automated accompaniment generation systems raise ques- tions around authorship, the impact on professional musi- cians, and the potential for misuse in generating deceptive content

    ETHICS STA TEMENT This work is intended for creative and artistic applications. Automated accompaniment generation systems raise ques- tions around authorship, the impact on professional musi- cians, and the potential for misuse in generating deceptive content. We encourage the community to develop appro- priate guidelines as these technologies mature

  10. [10]

    Caillon, B

    L. Team, A. Caillon, B. McWilliams, C. Tarakajian, I. Simon, I. Manco, J. Engel, N. Constant, Y . Li, T. I. Denket al., “Live music models,”arXiv preprint arXiv:2508.04651, 2025

  11. [11]

    Diff-a-riff: Musical accompaniment co- creation via latent diffusion models,

    J. Nistal, M. Pasini, C. Aouameur, M. Grachten, and S. Lattner, “Diff-a-riff: Musical accompaniment co- creation via latent diffusion models,” inProceedings of the 25th International Society for Music Information Retrieval Conference, 2024, pp. 272–280

  12. [12]

    Improved diff-a-riff: Musical accompaniment co-creation via latent diffusion models,

    ——, “Improved diff-a-riff: Musical accompaniment co-creation via latent diffusion models,” inNeurIPS 2024 Workshop, 2024

  13. [13]

    Stemgen: A music generation model that lis- tens,

    J. Parker, J. Spijkervet, K. Kosta, F. Yesiler, B. Kuznetsov, J.-C. Wang, M. Avent, J. Chen, and D. Le, “Stemgen: A music generation model that lis- tens,” inIEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2024, pp. 1116–1120

  14. [14]

    Streaming generation for music accompaniment,

    Y . Wu, M. Wang, H. Lei, S. Brade, L. Blanchard, S.-L. Wu, A. C. Courville, and C.-Z. A. Huang, “Streaming generation for music accompaniment,”arXiv preprint arXiv:2510.22105, 2025

  15. [15]

    Why exposure bias matters: An imitation learn- ing perspective of error accumulation in language gen- eration,

    K. Arora, L. E. Asri, H. Bahuleyan, and J. C. K. Che- ung, “Why exposure bias matters: An imitation learn- ing perspective of error accumulation in language gen- eration,” inFindings of the Association for Computa- tional Linguistics: ACL 2022, 2022

  16. [16]

    Self Forcing: Bridging the Train-Test Gap in Autoregressive Video Diffusion

    X. Huang, Z. Li, G. He, M. Zhou, and E. Shecht- man, “Self forcing: Bridging the train-test gap in autoregressive video diffusion,”arXiv preprint arXiv:2506.08009, 2025

  17. [17]

    An on-line algorithm for real- time accompaniment,

    R. B. Dannenberg, “An on-line algorithm for real- time accompaniment,”Proceedings of the 1984 Inter- national Computer Music Conference, 1984

  18. [18]

    Music transformer: Generating music with long-term structure,

    C.-Z. A. Huang, A. Vaswani, J. Uszkoreit, I. Simon, C. Hawthorne, N. Shazeer, A. M. Dai, M. D. Hoff- man, M. Dinculescu, and D. Eck, “Music transformer: Generating music with long-term structure,” in7th In- ternational Conference on Learning Representations (ICLR), 2019

  19. [19]

    SongDriver: Real-time music accompaniment gen- eration without logical latency nor exposure bias,

    Z. Wang, K. Zhang, Y . Wang, C. Zhang, Q. Liang, P. Yu, Y . Feng, W. Liu, Y . Wang, Y . Bao, and Y . Yang, “SongDriver: Real-time music accompaniment gen- eration without logical latency nor exposure bias,” in Proceedings of the 30th ACM International Conference on Multimedia (MM), 2022

  20. [20]

    Life with GenJam: Interacting with a mu- sical IGA,

    J. A. Biles, “Life with GenJam: Interacting with a mu- sical IGA,” inProceedings of the 1999 IEEE Interna- tional Conference on Systems, Man, and Cybernetics, vol. 3. Tokyo, Japan: IEEE, 1999, pp. 652–656

  21. [21]

    Re- aljam: Real-time human-ai music jamming with re- inforcement learning-tuned transformers,

    A. Scarlatos, Y . Wu, I. Simon, A. Roberts, T. Cooij- mans, N. Jaques, C. Tarakajian, and C. A. Huang, “Re- aljam: Real-time human-ai music jamming with re- inforcement learning-tuned transformers,” inProceed- ings of the Extended Abstracts of the CHI Conference on Human Factors in Computing Systems, CHI EA 2025, Yokohama, Japan, 26 April 2025- 1 May 2025

  22. [22]

    Generative adversarial post-training mitigates reward hacking in live human-AI music interaction,

    Y . Wu, S. Brade, T. Ma, T.-J. Fowler, E. Yang, B. Ba- nar, A. Courville, N. Jaques, and C.-Z. A. Huang, “Generative adversarial post-training mitigates reward hacking in live human-AI music interaction,” inThe Fourteenth International Conference on Learning Rep- resentations, 2026

  23. [23]

    Anticipatory music transformer,

    J. Thickstun, D. L. W. Hall, C. Donahue, and P. Liang, “Anticipatory music transformer,”Transactions on Machine Learning Research, 2024

  24. [24]

    Musika! fast infinite wave- form music generation,

    M. Pasini and J. Schlüter, “Musika! fast infinite wave- form music generation,” inProceedings of the 23rd International Society for Music Information Retrieval Conference (ISMIR), 2022, pp. 568–575

  25. [25]

    Bassnet: A variational gated autoencoder for conditional genera- tion of bass guitar tracks with learned interactive con- trol,

    M. Grachten, S. Lattner, and E. Deruty, “Bassnet: A variational gated autoencoder for conditional genera- tion of bass guitar tracks with learned interactive con- trol,”Applied Sciences, 2020

  26. [26]

    Bass accompaniment generation via latent diffusion,

    M. Pasini, M. Grachtenet al., “Bass accompaniment generation via latent diffusion,” inICASSP 2024 - 2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2024

  27. [27]

    DRUMGAN: synthesis of drum sounds with timbral feature conditioning using generative adversarial networks,

    J. Nistal, S. Lattneret al., “DRUMGAN: synthesis of drum sounds with timbral feature conditioning using generative adversarial networks,” inProceedings of the 21th International Society for Music Information Re- trieval Conference (ISMIR), Oct. 2020

  28. [28]

    Towards Real-Time Human-AI Musical Co-Performance: Accompaniment Generation with Latent Diffusion Models and MAX/MSP

    T. Karchkhadze and S. Dubnov, “Towards real-time human-ai musical co-performance: Accompaniment generation with latent diffusion models and max/msp,” arXiv preprint arXiv:2604.07612, 2026

  29. [29]

    Time- series generative adversarial networks,

    J. Yoon, D. Jarrett, and M. van der Schaar, “Time- series generative adversarial networks,” inAdvances in Neural Information Processing Systems 32 (NeurIPS), 2019

  30. [30]

    Adversarial audio synthesis,

    C. Donahue, J. J. McAuleyet al., “Adversarial audio synthesis,” in7th International Conference on Learn- ing Representations (ICLR), May 2019

  31. [31]

    GANSynth: Ad- versarial neural audio synthesis,

    J. H. Engel, K. K. Agrawalet al., “GANSynth: Ad- versarial neural audio synthesis,” in7th International Conference on Learning Representations (ICLR), May 2019

  32. [32]

    VQCPC-GAN: Variable-Length Adversarial Audio Synthesis Using Vector-Quantized Contrastive Predic- tive Coding,

    J. Nistal, C. Aouameur, S. Lattner, and G. Richard, “VQCPC-GAN: Variable-Length Adversarial Audio Synthesis Using Vector-Quantized Contrastive Predic- tive Coding,” inProceedings of the IEEE Workshop on Applications of Signal Processing to Audio and Acous- tics (WASPAA), 2021

  33. [33]

    The GAN is dead; long live the GAN! A modern GAN baseline,

    N. Huang, A. Gokaslan, V . Kuleshov, and J. Tompkin, “The GAN is dead; long live the GAN! A modern GAN baseline,” inAdvances in Neural Information Process- ing Systems 37 (NeurIPS), 2024

  34. [34]

    Se- quence level training with recurrent neural networks,

    M. Ranzato, S. Chopra, M. Auli, and W. Zaremba, “Se- quence level training with recurrent neural networks,” in4th International Conference on Learning Represen- tations (ICLR), 2016

  35. [35]

    Self-Forcing++: Towards Minute-Scale High-Quality Video Generation

    J. Cui, J. Wu, M. Li, T. Yang, X. Li, R. Wang, A. Bai, Y . Ban, and C.-J. Hsieh, “Self-forcing++: Towards minute-scale high-quality video generation,” arXiv preprint arXiv:2510.02283, 2025

  36. [36]

    Diffusion forcing: Next- token prediction meets full-sequence diffusion,

    B. Chen, D. Marti Monso, Y . Du, M. Simchowitz, R. Tedrake, and V . Sitzmann, “Diffusion forcing: Next- token prediction meets full-sequence diffusion,” inAd- vances in Neural Information Processing Systems 37 (NeurIPS), 2024

  37. [37]

    Continuous autoregressive models with noise aug- mentation avoid error accumulation,

    M. Pasini, J. Nistal, S. Lattner, and G. Fazekas, “Continuous autoregressive models with noise aug- mentation avoid error accumulation,”arXiv preprint arXiv:2411.18447, 2024

  38. [38]

    Context forcing: Consistent autoregressive video generation with long context.arXiv preprint arXiv:2602.06028, 2026

    S. Chen, C. Wei, S. Sun, P. Nie, K. Zhou, G. Zhang, M.-H. Yang, and W. Chen, “Context forcing: Con- sistent autoregressive video generation with long con- text,”arXiv preprint arXiv:2602.06028, 2026

  39. [39]

    Causal Forcing: Autoregressive Diffusion Distillation Done Right for High-Quality Real-Time Interactive Video Generation

    H. Zhu, M. Zhao, G. He, H. Su, C. Li, and J. Zhu, “Causal forcing: Autoregressive diffusion distillation done right for high-quality real-time interactive video generation,”arXiv preprint arXiv:2602.02214, 2026

  40. [40]

    Rolling forc- ing: Autoregressive long video diffusion in real time,

    K. Liu, W. Hu, J. Xu, Y . Shan, and S. Lu, “Rolling forc- ing: Autoregressive long video diffusion in real time,” ICLR, 2026

  41. [41]

    Effi- cient streaming language models with attention sinks,

    G. Xiao, Y . Tian, B. Chen, S. Han, and M. Lewis, “Effi- cient streaming language models with attention sinks,” inThe Twelfth International Conference on Learning Representations (ICLR), 2024

  42. [42]

    gpt-oss-120b & gpt-oss-20b Model Card

    OpenAI, “gpt-oss-120b & gpt-oss-20b model card,” arXiv preprint arXiv:2508.10925, 2025

  43. [43]

    Live Music Diffusion Models: Efficient Fine-Tuning and Post-Training of Interactive Diffusion Music Generators

    Z. Novack, S. Brade, H. Kim, H. F. García, N. Shikarpur, C. Talegaonkar, S. Kim, V . K. Chen, J. McAuley, T. Berg-Kirkpatricket al., “Live mu- sic diffusion models: Efficient fine-tuning and post- training of interactive diffusion music generators,” arXiv preprint arXiv:2605.22717, 2026

  44. [44]

    Generative adversarial nets,

    I. J. Goodfellow, J. Pouget-Abadieet al., “Generative adversarial nets,” inAdvances in Neural Information Processing Systems 27, Dec. 2014

  45. [45]

    Geometric GAN

    J. H. Lim and J. C. Ye, “Geometric gan,”arXiv preprint arXiv:1705.02894, 2017

  46. [46]

    The relativistic discrimina- tor: A key element missing from standard GAN,

    A. Jolicoeur-Martineau, “The relativistic discrimina- tor: A key element missing from standard GAN,” in 7th International Conference on Learning Representa- tions (ICLR), 2019

  47. [47]

    Which training methods for GANs do actually converge?

    L. M. Mescheder, A. Geigeret al., “Which training methods for GANs do actually converge?” inProceed- ings of the 35th International Conference on Machine Learning (ICML), ser. Proceedings of Machine Learn- ing Research, vol. 80, Jul. 2018

  48. [48]

    Analyzing and Improving the Image Quality of StyleGAN,

    T. Karras, S. Laineet al., “Analyzing and Improving the Image Quality of StyleGAN,” in2020 IEEE/CVF Conference on Computer Vision and Pattern Recogni- tion, CVPR 2020, Seattle, WA, USA, June 13-19, 2020, 2020

  49. [49]

    CoDiCodec: Unifying continuous and discrete compressed repre- sentations of audio,

    M. Pasini, S. Lattner, and G. Fazekas, “CoDiCodec: Unifying continuous and discrete compressed repre- sentations of audio,” inProceedings of the 26th Inter- national Society for Music Information Retrieval Con- ference (ISMIR), 2025

  50. [50]

    Arbitrary style trans- fer in real-time with adaptive instance normalization,

    X. Huang and S. J. Belongie, “Arbitrary style trans- fer in real-time with adaptive instance normalization,” inIEEE International Conference on Computer Vision (ICCV), Oct. 2017

  51. [51]

    Query-key normalization for transformers,

    A. Henry, P. R. Dachapally, S. S. Pawar, and Y . Chen, “Query-key normalization for transformers,”arXiv preprint arXiv:2010.04245, 2020

  52. [52]

    RoFormer: Enhanced transformer with ro- tary position embedding,

    J. Su, M. H. M. Ahmed, Y . Lu, S. Pan, W. Bo, and Y . Liu, “RoFormer: Enhanced transformer with ro- tary position embedding,”Neurocomputing, vol. 568, p. 127063, 2024

  53. [53]

    GLU Variants Improve Transformer

    N. Shazeer, “GLU variants improve transformer,” arXiv preprint arXiv:2002.05202, 2020

  54. [54]

    Gemma 2: Improving Open Language Models at a Practical Size

    J. Dong, B. Feng, D. Guessous, Y . Liang, and H. He, “FlexAttention: A programming model for generating optimized attention kernels,”arXiv preprint arXiv:2408.00118, 2024

  55. [55]

    A convnet for the 2020s,

    Z. Liu, H. Mao, C. Wu, C. Feichtenhofer, T. Darrell, and S. Xie, “A convnet for the 2020s,” inIEEE/CVF Conference on Computer Vision and Pattern Recogni- tion, CVPR 2022, New Orleans, LA, USA, June 18-24, 2022, 2022

  56. [56]

    Adam: A method for stochas- tic optimization,

    D. P. Kingma and J. Ba, “Adam: A method for stochas- tic optimization,” in3rd International Conference on Learning Representations (ICLR), May 2015

  57. [57]

    Cutting music source separation some Slakh: A dataset to study the impact of training data quality and quantity,

    E. Manilow, G. Wichern, P. Seetharaman, and J. Le Roux, “Cutting music source separation some Slakh: A dataset to study the impact of training data quality and quantity,” inProc. IEEE Workshop on Ap- plications of Signal Processing to Audio and Acoustics (WASPAA), 2019, pp. 45–49

  58. [58]

    Fréchet Audio Dis- tance: A Reference-Free Metric for Evaluating Music Enhancement Algorithms,

    K. Kilgour, M. Zuluagaet al., “Fréchet Audio Dis- tance: A Reference-Free Metric for Evaluating Music Enhancement Algorithms,” in20th Annual Conference of the International Speech Communication Associa- tion (INTERSPEECH), Sep. 2019, place: Graz, Aus- tria

  59. [59]

    CNN architectures for large-scale audio classification,

    S. Hershey, S. Chaudhuri, D. P. W. Ellis, J. F. Gem- meke, A. Jansen, R. C. Moore, M. Plakal, D. Platt, R. A. Saurous, B. Seybold, M. Slaney, R. J. Weiss, and K. W. Wilson, “CNN architectures for large-scale audio classification,” in2017 IEEE International Con- ference on Acoustics, Speech and Signal Processing, ICASSP 2017, New Orleans, LA, USA, March 5-9,

  60. [60]

    IEEE, 2017, pp. 131–135

  61. [61]

    Large-scale contrastive language- audio pretraining with feature fusion and keyword- to-caption augmentation,

    Y . Wu, K. Chen, T. Zhang, Y . Hui, T. Berg-Kirkpatrick, and S. Dubnov, “Large-scale contrastive language- audio pretraining with feature fusion and keyword- to-caption augmentation,” inIEEE International Con- ference on Acoustics, Speech and Signal Processing (ICASSP), 2023, pp. 1–5

  62. [62]

    madmom: A New Python Audio and Music Signal Processing Library,

    S. Böck, F. Korzeniowskiet al., “madmom: A New Python Audio and Music Signal Processing Library,” inProceedings of the 2016 ACM Conference on Mul- timedia Conference (MM), Oct. 2016, place: Amster- dam, The Netherlands

  63. [63]

    Beat this! accurate beat tracking without DBN postprocessing,

    F. Foscarin, J. Schlüter, and G. Widmer, “Beat this! accurate beat tracking without DBN postprocessing,” inProceedings of the 25th International Society for Music Information Retrieval Conference, ISMIR 2024, San Francisco, California, USA and Online, November 10-14, 2024, 2024

  64. [64]

    Cocola: Coherence-oriented contrastive learning of musical au- dio representations,

    R. Ciranni, G. Mariani, M. Mancusi, E. Postolache, G. Fabbro, E. Rodolà, and L. Cosmo, “Cocola: Coherence-oriented contrastive learning of musical au- dio representations,” pp. 1–5, 2025

  65. [65]

    High-fidelity audio compression with im- proved RVQGAN,

    R. Kumar, P. Seetharaman, A. Luebs, I. Kumar, and K. Kumar, “High-fidelity audio compression with im- proved RVQGAN,” inAdvances in Neural Information Processing Systems 36 (NeurIPS), 2023