arxiv: 2604.07612 · v1 · submitted 2026-04-08 · 💻 cs.SD · cs.AI

Recognition: unknown

Towards Real-Time Human-AI Musical Co-Performance: Accompaniment Generation with Latent Diffusion Models and MAX/MSP

Shlomo Dubnov, Tornike Karchkhadze

Pith reviewed 2026-05-10 16:50 UTC · model grok-4.3

classification 💻 cs.SD cs.AI

keywords real-time accompanimentlatent diffusion modelsmusical co-performanceconsistency distillationMAX/MSP integrationsliding-window look-aheadaudio generationhuman-AI music

0 comments

The pith

Latent diffusion models generate real-time musical accompaniment from live audio streams by predicting ahead in sliding windows.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows how to adapt latent diffusion models for generating instrumental parts that respond to a human performer in real time. It does this by running the model on partial audio context inside a sliding window while using consistency distillation to cut sampling time by a factor of 5.4, and by linking the Python model to MAX/MSP through fast OSC messages so that established music software can drive the AI. This matters because musicians already work in real-time environments like MAX/MSP; the work removes the usual barrier that keeps large generative models out of live performance. Results indicate the generated audio stays musically coherent and beat-aligned when the model sees the full past, and that quality falls off smoothly rather than collapsing as more look-ahead is added to meet tighter latency budgets.

Core claim

A latent diffusion model trained to predict future audio from partial context inside a sliding-window protocol can be distilled for fast sampling and integrated with MAX/MSP via OSC to produce instrumental accompaniment that runs in real time, achieving strong coherence and alignment scores in retrospective full-context conditions while degrading gracefully as look-ahead depth is increased to satisfy live latency limits.

What carries the argument

Sliding-window look-ahead protocol that trains the latent diffusion model to generate future audio from incomplete context, accelerated by consistency distillation to reach real-time inference speeds.

If this is right

Real-time human-AI co-performance becomes practical with diffusion models once sampling is accelerated and look-ahead is tuned to the available latency budget.
Generation quality trades off directly against look-ahead depth, giving system designers a concrete knob to turn when hardware or network conditions change.
The MAX/MSP front-end plus OSC bridge removes the previous barrier that kept large Python-based generative models out of established real-time music workflows.
Both the original and distilled models remain usable under live constraints, showing that the core diffusion approach itself is compatible with performance timing.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Performers might develop new playing strategies once they know exactly how far ahead the AI is looking.
The same sliding-window plus distillation pattern could be tested on other live generative tasks such as real-time sound effects or visual generation.
Reducing the OSC communication overhead itself would let the system operate with less look-ahead and therefore higher musical responsiveness.
Collecting paired human-AI recordings from actual performances could be used to fine-tune the model for specific musical styles or instruments.

Load-bearing premise

A model trained only on partial audio context will still produce musically coherent output once it runs live with the extra delays introduced by the MAX/MSP-to-Python communication layer.

What would settle it

A controlled live session in which musicians play against the system at several fixed look-ahead depths, after which independent listeners rate the accompaniment for beat alignment and musical fit to see whether scores stay usable past a particular latency threshold.

read the original abstract

We present a framework for real-time human-AI musical co-performance, in which a latent diffusion model generates instrumental accompaniment in response to a live stream of context audio. The system combines a MAX/MSP front-end-handling real-time audio input, buffering, and playback-with a Python inference server running the generative model, communicating via OSC/UDP messages. This allows musicians to perform in MAX/MSP - a well-established, real-time capable environment - while interacting with a large-scale Python-based generative model, overcoming the fundamental disconnect between real-time music tools and state-of-the-art AI models. We formulate accompaniment generation as a sliding-window look-ahead protocol, training the model to predict future audio from partial context, where system latency is a critical constraint. To reduce latency, we apply consistency distillation to our diffusion model, achieving a 5.4x reduction in sampling time, with both models achieving real-time operation. Evaluated on musical coherence, beat alignment, and audio quality, both models achieve strong performance in the Retrospective regime and degrade gracefully as look-ahead increases. These results demonstrate the feasibility of diffusion-based real-time accompaniment and expose the fundamental trade-off between model latency, look-ahead depth, and generation quality that any such system must navigate.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

A concrete integration recipe for running distilled diffusion models as live accompaniment inside MAX/MSP, but the real-time claims rest on unmeasured end-to-end latency.

read the letter

The paper's main contribution is a working system that pipes a latent diffusion model into MAX/MSP via OSC for real-time accompaniment. It uses a sliding-window look-ahead protocol so the model predicts ahead from partial context, plus consistency distillation to cut sampling time by 5.4x. Both the base and distilled versions run in real time on paper, and the authors note the expected drop in coherence as look-ahead shrinks. That specific stack—distillation, sliding window, and the MAX/MSP bridge—is not a standard extension of earlier diffusion music work, so the implementation details are the useful part here.

Referee Report

3 major / 1 minor

Summary. The manuscript presents a framework for real-time human-AI musical co-performance in which a latent diffusion model generates instrumental accompaniment from a live audio stream. It integrates a MAX/MSP front-end for real-time audio buffering and playback with a Python inference server via OSC/UDP messaging, formulates generation as a sliding-window look-ahead protocol, and applies consistency distillation to achieve a 5.4x sampling speedup. Both the base and distilled models are claimed to deliver strong performance on musical coherence, beat alignment, and audio quality in the retrospective regime, with graceful degradation as look-ahead increases, thereby demonstrating feasibility of diffusion-based real-time accompaniment and exposing latency-quality trade-offs.

Significance. If the results are substantiated with quantitative evidence, the work would usefully bridge state-of-the-art generative models with established real-time music environments, offering a practical path for interactive AI accompaniment. The consistency-distillation speedup and explicit treatment of look-ahead/latency constraints provide concrete engineering insights that could guide deployment of similar models in live settings.

major comments (3)

[Abstract] Abstract: the claims of 'strong performance' and 'graceful degradation' on coherence, beat alignment, and quality are unsupported by any quantitative metrics, baselines, error bars, dataset details, or statistical tests, rendering the central feasibility conclusion unevaluable.
[Abstract] Abstract: the assertion of real-time operation rests on the 5.4x sampling speedup and sliding-window protocol, yet no end-to-end latency measurements (MAX/MSP buffering + OSC/UDP messaging + model sampling + audio return) or jitter characterization are supplied, despite musical timing tolerances typically requiring <20-50 ms; this is load-bearing for the real-time co-performance claim.
[Abstract] Abstract: the weakest assumption—that a sliding-window protocol trained on partial context will yield musically coherent output under live latency constraints without prohibitive integration overhead—is not tested with the full pipeline, so the reported retrospective results do not establish live feasibility.

minor comments (1)

[Abstract] The manuscript would benefit from a system-architecture diagram clarifying data flow, buffering, and message timing between MAX/MSP and the Python server.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the detailed and constructive comments. We address each major point below. The abstract was intentionally concise, but we agree it requires strengthening with explicit references to quantitative results, system measurements, and pipeline details already present in the body of the manuscript. We have revised the abstract and added a new subsection on end-to-end latency to make the real-time claims fully evaluable.

read point-by-point responses

Referee: [Abstract] Abstract: the claims of 'strong performance' and 'graceful degradation' on coherence, beat alignment, and quality are unsupported by any quantitative metrics, baselines, error bars, dataset details, or statistical tests, rendering the central feasibility conclusion unevaluable.

Authors: The full manuscript (Section 4) reports quantitative results using embedding-based coherence scores, beat-alignment F1 via onset detection, and perceptual audio quality metrics, with comparisons to a non-diffusion baseline and error bars across multiple seeds and look-ahead values. Dataset details (training corpus size, preprocessing, and splits) appear in Section 3. We have revised the abstract to include the key numerical findings (e.g., coherence scores and degradation slopes) and explicit references to the relevant figures and tables, thereby grounding the feasibility conclusion. revision: yes
Referee: [Abstract] Abstract: the assertion of real-time operation rests on the 5.4x sampling speedup and sliding-window protocol, yet no end-to-end latency measurements (MAX/MSP buffering + OSC/UDP messaging + model sampling + audio return) or jitter characterization are supplied, despite musical timing tolerances typically requiring <20-50 ms; this is load-bearing for the real-time co-performance claim.

Authors: We acknowledge that the original abstract did not report measured end-to-end latencies. The manuscript already contains per-component timings (MAX/MSP buffering, OSC round-trip, and distilled sampling at 5.4x speedup) in Section 5; we have added a new paragraph and table that aggregates these into measured end-to-end latency (mean and jitter) under realistic load, confirming operation within musical tolerances for moderate look-ahead. This directly substantiates the real-time claim. revision: yes
Referee: [Abstract] Abstract: the weakest assumption—that a sliding-window protocol trained on partial context will yield musically coherent output under live latency constraints without prohibitive integration overhead—is not tested with the full pipeline, so the reported retrospective results do not establish live feasibility.

Authors: The retrospective experiments systematically vary look-ahead depth to simulate increasing latency, and the observed graceful degradation directly tests the core assumption under controlled conditions that match the live sliding-window protocol. We have added an explicit discussion in the revised manuscript clarifying how these controlled conditions map to live operation and have included a brief live pilot recording (with qualitative description) to illustrate integration overhead. While a large-scale live user study is beyond the current scope, the existing results plus the measured pipeline timings provide substantive evidence of feasibility. revision: partial

Circularity Check

0 steps flagged

No circularity: empirical training and evaluation of diffusion-based accompaniment system

full rationale

The paper presents a system implementation (MAX/MSP front-end + Python inference server via OSC/UDP) and an empirical protocol (sliding-window look-ahead training of latent diffusion model, consistency distillation for 5.4x speedup, evaluation on coherence/alignment/quality metrics). No equations, derivations, or claims reduce to fitted parameters by construction or self-referential definitions. Feasibility and trade-offs are reported as experimental outcomes rather than definitional identities. No self-citation load-bearing steps, uniqueness theorems, or ansatz smuggling appear in the abstract or described chain. The central results follow from training and testing on data, not tautology.

Axiom & Free-Parameter Ledger

2 free parameters · 2 axioms · 0 invented entities

The claim rests on the trained weights of the latent diffusion model (fitted to unspecified audio data) and the domain assumption that short-term audio context suffices for coherent future accompaniment under real-time constraints.

free parameters (2)

look-ahead window length
Critical hyperparameter controlling the latency-quality trade-off; value not stated in abstract.
consistency distillation steps
Determines the 5.4x speedup; chosen to meet real-time requirements.

axioms (2)

domain assumption Partial recent audio context contains sufficient information to generate musically coherent future accompaniment
Invoked by the sliding-window look-ahead training protocol.
domain assumption OSC/UDP communication between MAX/MSP and Python server adds negligible latency relative to model inference time
Required for the claimed real-time operation.

pith-pipeline@v0.9.0 · 5530 in / 1368 out tokens · 35296 ms · 2026-05-10T16:50:38.987019+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

68 extracted references · 10 canonical work pages · 2 internal anchors

[1]

INTRODUCTION Music is inherently a performative art-form. For most of human his- tory—long before the relatively recent invention of recording tech- nologies—music, an act of realization in sound, existed only in live, performative, and ephemeral contexts [1,2]. Performative musician- ship, whether in the form of improvisation, jamming, or following a kno...

work page internal anchor Pith review Pith/arXiv arXiv 2026
[2]

SingSong [10] produces instrumental accompaniment from a vocal recording

RELA TED WORK Music-to-Music and Accompaniment Generation:A growing body of work addresses generating musical accompaniment condi- tioned on musical context audio, rather than on text. SingSong [10] produces instrumental accompaniment from a vocal recording. StemGen [9] trains a non-autoregressive transformer conditioned on a mixture to synthesize a coher...
[3]

METHOD Fig. 1 gives an overview of the system we propose for real-time in- teractive musical accompaniment, in which a human performer plays live while an LDM generates matching instrumental parts. Real- time responsiveness is achieved through a client–server architec- ture: the server runs the inference-heavy LDM in a Python backend, while the client—a M...
[4]

EXPERIMENTAL SETUP This section covers the experimental setup for both components of the system. First, we describe the generative model setup: dataset, model architecture, training procedure, baselines, and evaluation metrics used to assess accompaniment generation quality across different streaming configurations. Second, we describe the RTAP system con...

2070
[5]

Generative Model Performance Fig

RESULTS 5.1. Generative Model Performance Fig. 6 summarises performance of our diffusion model and consis- tency distillation (CD) model across COCOLA, BeatF 1, and FAD, compared against the StreamMusicGen’s online decoder and offline baselines (Prefix Decoder, StemGen), as a function of the net look- aheadT·r·w— the effective time distance between the cu...

2070
[6]

CONCLUSION We present a framework for real-time human–AI musical co- performance combining a latent diffusion model with a sliding- window look-ahead inference paradigm, accelerated via consistency distillation, and deployed through a low-latency client–server sys- tem interfaced via RTAP, a musician-facing MAX/MSP patch. In this work, we establish that t...
[7]

ACKNOWLEDGMENTS We thank the Institute for Research and Coordination in Acous- tics and Music (IRCAM) and Project REACH: Raising Co-creativity in Cyber-Human Musicianship for their support. This project re- ceived support and resources in the form of computational power from the European Research Council (ERC REACH) under the Eu- ropean Union’s Horizon 20...

2020
[8]

Christopher Small,Musicking: The Meanings of Performing and Listening, Wesleyan University Press, 1998

1998
[9]

Nicholas Cook,Music: A V ery Short Introduction, Oxford University Press, 2 edition, 2021

2021
[10]

Joint action in music performance,

Peter Keller, “Joint action in music performance,” inEnact- ing Intersubjectivity: A Cognitive and Social Perspective to the Study of Interactions. IOS Press, 2008

2008
[11]

The experience of the flow state in live music performance,

William J Wrigley and Stephen B Emmerson, “The experience of the flow state in live music performance,”Psychology of Music, 2013

2013
[12]

MusicLM: Generating Music From Text

Andrea Agostinelli, Timo I. Denk, Zal ´an Borsos, Jesse H. En- gel, Mauro Verzetti, Antoine Caillon, Qingqing Huang, Aren Jansen, Adam Roberts, Marco Tagliasacchi, Matthew Sharifi, Neil Zeghidour, and Christian Havnø Frank, “Musiclm: Gen- erating music from text,”arXiv:2301.11325, 2023

work page internal anchor Pith review arXiv 2023
[13]

Vladimir Gligorijevi´c, P

Prafulla Dhariwal, Heewoo Jun, Christine Payne, Jong Wook Kim, Alec Radford, and Ilya Sutskever, “Jukebox: A genera- tive model for music,”arXiv:2005.00341, 2020

work page arXiv 2005
[14]

Sim- ple and controllable music generation,

Jade Copet, Felix Kreuk, Itai Gat, Tal Remez, David Kant, Gabriel Synnaeve, Yossi Adi, and Alexandre D´efossez, “Sim- ple and controllable music generation,” inNeurIPS, 2023

2023
[15]

Musicldm: Enhanc- ing novelty in text-to-music generation using beat-synchronous mixup strategies,

Ke Chen, Yusong Wu, Haohe Liu, et al., “Musicldm: Enhanc- ing novelty in text-to-music generation using beat-synchronous mixup strategies,” inICASSP, 2024

2024
[16]

Stemgen: A music generation model that listens,

Julian D Parker, Janne Spijkervet, Katerina Kosta, et al., “Stemgen: A music generation model that listens,” inICASSP, 2024

2024
[17]

Singsong: Generating musical accompaniments from singing,

Chris Donahue, Antoine Caillon, Adam Roberts, Ethan Manilow, Philippe Esling, Andrea Agostinelli, Mauro Verzetti, Ian Simon, Olivier Pietquin, Neil Zeghidour, and Jesse H. En- gel, “Singsong: Generating musical accompaniments from singing,”arXiv:2301.12662, 2023

work page arXiv 2023
[18]

Musicgen-stem: Multi-stem music generation and edition through autoregressive modeling,

Simon Rouard, Robin San Roman, Yossi Adi, et al., “Musicgen-stem: Multi-stem music generation and edition through autoregressive modeling,” inICASSP, 2025

2025
[19]

Multi-track musi- cldm: Towards versatile music generation with latent diffusion model,

Tornike Karchkhadze, Mohammad Rasool Izadi, Ke Chen, Gerard Assayag, and Shlomo Dubnov, “Multi-track musi- cldm: Towards versatile music generation with latent diffusion model,” inArtsIT, 2026, pp. 76–91

2026
[20]

Simultaneous music separation and generation using multi-track latent diffusion models,

Tornike Karchkhadze, Mohammad Rasool Izadi, and Shlomo Dubnov, “Simultaneous music separation and generation using multi-track latent diffusion models,” inICASSP, 2025, pp. 1–5

2025
[21]

Au- dioldm: Text-to-audio generation with latent diffusion mod- els,

Haohe Liu, Zehua Chen, Yi Yuan, Xinhao Mei, Xubo Liu, Danilo P. Mandic, Wenwu Wang, and Mark D. Plumbley, “Au- dioldm: Text-to-audio generation with latent diffusion mod- els,” inICML, 2023, pp. 21450–21474

2023
[22]

Generative modeling by es- timating gradients of the data distribution,

Yang Song and Stefano Ermon, “Generative modeling by es- timating gradients of the data distribution,” inNeurIPS 2019, 2019, pp. 11895–11907

2019
[23]

Mu- sic2latent: Consistency autoencoders for latent audio compres- sion,

Marco Pasini, Stefan Lattner, and George Fazekas, “Mu- sic2latent: Consistency autoencoders for latent audio compres- sion,” inISMIR, 2024, pp. 111–119

2024
[24]

Cutting music source separation some slakh: A dataset to study the impact of training data quality and quantity,

Ethan Manilow, Gordon Wichern, Prem Seetharaman, et al., “Cutting music source separation some slakh: A dataset to study the impact of training data quality and quantity,” inWAS- PAA, 2019

2019
[25]

Consistency models,

Yang Song, Prafulla Dhariwal, Mark Chen, and Ilya Sutskever, “Consistency models,” inICML, 2023, pp. 32211–32252

2023
[26]

Consistency trajectory models: Learning probability flow ODE trajectory of diffusion,

Dongjun Kim, Chieh-Hsin Lai, Wei-Hsiang Liao, Naoki Mu- rata, Yuhta Takida, Toshimitsu Uesaka, Yutong He, Yuki Mit- sufuji, and Stefano Ermon, “Consistency trajectory models: Learning probability flow ODE trajectory of diffusion,” in ICLR, 2024

2024
[27]

Streaming generation for music accom- paniment,

Yusong Wu, Mason Wang, Heidi Lei, Stephen Brade, Lancelot Blanchard, Shih-Lun Wu, Aaron C. Courville, and Cheng- Zhi Anna Huang, “Streaming generation for music accom- paniment,”arXiv:2510.22105, 2025

work page arXiv 2025
[28]

Open sound control: an enabling technol- ogy for musical networking,

Matthew Wright, “Open sound control: an enabling technol- ogy for musical networking,”Organised Sound, vol. 10, no. 3, pp. 193–200, 2005

2005
[29]

Bass accompaniment generation via latent diffusion,

Marco Pasini, Maarten Grachten, and Stefan Lattner, “Bass accompaniment generation via latent diffusion,” inICASSP, 2024, pp. 1166–1170

2024
[30]

Diff-a-riff: Musical accom- paniment co-creation via latent diffusion models,

Javier Nistal, Marco Pasini, Cyran Aouameur, Maarten Grachten, and Stefan Lattner, “Diff-a-riff: Musical accom- paniment co-creation via latent diffusion models,” inISMIR, 2024

2024
[31]

Improving musical accompaniment co-creation via diffusion transform- ers,

Javier Nistal, Marco Pasini, and Stefan Lattner, “Improving musical accompaniment co-creation via diffusion transform- ers,”arXiv:2410.23005, 2024

work page arXiv 2024
[32]

Multi-source diffusion models for simultaneous music generation and sepa- ration,

Giorgio Mariani, Irene Tallini, Emilian Postolache, Michele Mancusi, Luca Cosmo, and Emanuele Rodola, “Multi-source diffusion models for simultaneous music generation and sepa- ration,” inICLR, 2024

2024
[33]

JEN-1 Com- poser: A unified framework for high-fidelity multi-track music generation,

Yao Yao, Peike Li, Boyu Chen, and Alex Wang, “JEN-1 Com- poser: A unified framework for high-fidelity multi-track music generation,” inAAAI, 2025

2025
[34]

Multi-source music generation with latent diffusion,

Zhongweiyang Xu, Debottam Dutta, Yu-Lin Wei, and Romit Roy Choudhury, “Multi-source music generation with latent diffusion,”arXiv:2409.06190, 2024

work page arXiv 2024
[35]

MGE-LDM: Joint latent diffu- sion for simultaneous music generation and source extraction,

Yunkee Chae and Kyogu Lee, “MGE-LDM: Joint latent diffu- sion for simultaneous music generation and source extraction,” inNeurIPS, 2025

2025
[36]

Probabilistic melodic harmonization,

Jean-Franc ¸ois Paiement, Douglas Eck, and Samy Bengio, “Probabilistic melodic harmonization,” inCanadian Confer- ence on AI, 2006

2006
[37]

Mysong: automatic accompaniment generation for vocal melodies,

Ian Simon, Dan Morris, and Sumit Basu, “Mysong: automatic accompaniment generation for vocal melodies,” inCHI, 2008

2008
[38]

High-level control of drum track generation using learned patterns of rhythmic inter- action,

Stefan Lattner and Maarten Grachten, “High-level control of drum track generation using learned patterns of rhythmic inter- action,” inWASPAA, 2019

2019
[39]

BassNet: A variational gated autoencoder for conditional gen- eration of bass guitar tracks with learned interactive control,

Maarten Grachten, Stefan Lattner, and Emmanuel Deruty, “BassNet: A variational gated autoencoder for conditional gen- eration of bass guitar tracks with learned interactive control,” Applied Sciences, 2020

2020
[40]

MuseGAN: Multi-track sequential generative adver- sarial networks for symbolic music generation and accompani- ment,

Hao-Wen Dong, Wen-Yi Hsiao, Li-Chia Yang, and Yi-Hsuan Yang, “MuseGAN: Multi-track sequential generative adver- sarial networks for symbolic music generation and accompani- ment,” inAAAI, 2018, pp. 34–41

2018
[41]

MMM: Exploring con- ditional multi-track music generation with the transformer,

Jeff Ens and Philippe Pasquier, “MMM: Exploring con- ditional multi-track music generation with the transformer,” arXiv:2008.06048, 2020

work page arXiv 2008
[42]

A transformer-based model for multi-track music generation,

Cong Jin, Tao Wang, Shouxun Liu, Yun Tie, Jianguang Li, Xiaobing Li, and Simon Lui, “A transformer-based model for multi-track music generation,”Int. J. Multim. Data Eng. Manag., vol. 11, no. 3, pp. 36–54, 2020

2020
[43]

Multitrack music transformer,

Hao-Wen Dong, Ke Chen, Shlomo Dubnov, Julian McAuley, and Taylor Berg-Kirkpatrick, “Multitrack music transformer,” inICASSP, 2023

2023
[44]

An on-line algorithm for real-time ac- companiment,

Roger B Dannenberg, “An on-line algorithm for real-time ac- companiment,” inICMC, 1984

1984
[45]

A design space for live music agents,

Yewon Kim, Stephen Brade, Alexander Wang, David Zhou, Haven Kim, Bill Wang, Sung-Ju Lee, Hugo F. Flores Garcia, Cheng-Zhi Anna Huang, and Chris Donahue, “A design space for live music agents,” inCHI, 2026

2026
[46]

Music plus one and machine learning.,

Christopher Raphael, “Music plus one and machine learning.,” inICML, 2010

2010
[47]

Antescofo: Anticipatory synchronization and control of interactive parameters in computer music.,

Arshia Cont, “Antescofo: Anticipatory synchronization and control of interactive parameters in computer music.,” in ICMC, 2008

2008
[48]

Too many notes: Computers, complexity, and culture in voyager,

George E Lewis, “Too many notes: Computers, complexity, and culture in voyager,” inNew Media. Routledge, 2003

2003
[49]

Omax brothers: a dynamic yopology of agents for improviza- tion learning,

G ´erard Assayag, Georges Bloch, Marc Chemillier, et al., “Omax brothers: a dynamic yopology of agents for improviza- tion learning,” inACM workshop on Audio and music comput- ing multimedia, 2006

2006
[50]

Improtek: integrating har- monic controls into improvisation in the filiation of omax,

J ´erˆome Nika and Marc Chemillier, “Improtek: integrating har- monic controls into improvisation in the filiation of omax,” in ICMC, 2012

2012
[51]

Impro- tek: introducing scenarios into human-computer music impro- visation,

J ´erˆome Nika, Marc Chemillier, and G ´erard Assayag, “Impro- tek: introducing scenarios into human-computer music impro- visation,”Computers in Entertainment (CIE), 2017

2017
[52]

Bachduet: A deep learning system for human-machine counterpoint improvisation,

Christodoulos Benetatos, Joseph VanderStel, and Zhiyao Duan, “Bachduet: A deep learning system for human-machine counterpoint improvisation,” inNIME, 2020

2020
[53]

Songdriver: Real-time music accompaniment generation without logical la- tency nor exposure bias,

Zihao Wang, Kejun Zhang, Yuxing Wang, et al., “Songdriver: Real-time music accompaniment generation without logical la- tency nor exposure bias,” inACM MM, 2022

2022
[54]

RL-duet: On- line music accompaniment generation using deep reinforce- ment learning,

Nan Jiang, Sheng Jin, Zhiyao Duan, et al., “RL-duet: On- line music accompaniment generation using deep reinforce- ment learning,” inAAAI, 2020

2020
[55]

Adaptive accompaniment with realchords,

Yusong Wu, Tim Cooijmans, Kyle Kastner, et al., “Adaptive accompaniment with realchords,” inICML, 2024

2024
[56]

Real- jam: Real-time human-ai music jamming with reinforcement learning-tuned transformers,

Alexander Scarlatos, Yusong Wu, Ian Simon, et al., “Real- jam: Real-time human-ai music jamming with reinforcement learning-tuned transformers,” inCHI EA, 2025

2025
[57]

Live music models,

Lyria Team, Antoine Caillon, Brian McWilliams, et al., “Live music models,”arXiv:2508.04651, 2025

work page arXiv 2025
[58]

A controller to overcome dead time,

O. J. M. Smith, “A controller to overcome dead time,”ISA Journal, 1959

1959
[59]

Review on model predictive control: an engineering perspec- tive,

Maximilian Schwenzer, Muzaffer Ay, Thomas Bergs, et al., “Review on model predictive control: an engineering perspec- tive,”The International Journal of Advanced Manufacturing Technology, 2021

2021
[60]

Real-time execution of action chunking flow policies.arXiv preprint arXiv:2506.07339, 2025

Kevin Black, Manuel Y . Galliker, and Sergey Levine, “Real-time execution of action chunking flow policies,” arXiv:2506.07339, 2025

work page arXiv 2025
[61]

Score-based generative modeling through stochastic differential equations,

Yang Song, Jascha Sohl-Dickstein, Diederik P Kingma, Ab- hishek Kumar, Stefano Ermon, and Ben Poole, “Score-based generative modeling through stochastic differential equations,” inICLR, 2021

2021
[62]

Elucidating the design space of diffusion-based generative models,

Tero Karras, Miika Aittala, Timo Aila, and Samuli Laine, “Elucidating the design space of diffusion-based generative models,” inNeurIPS, 2022

2022
[63]

DPM-Solver: A fast ODE solver for diffusion probabilistic model sampling in around 10 steps,

Cheng Lu, Yuhao Zhou, Fan Bao, Jianfei Chen, Chongxuan Li, and Jun Zhu, “DPM-Solver: A fast ODE solver for diffusion probabilistic model sampling in around 10 steps,” inNeurIPS, 2022

2022
[64]

Cycling ’74,Max/MSP 8, 2023

2023
[65]

Co- cola: Coherence-oriented contrastive learning of musical audio representations,

Ruben Ciranni, Giorgio Mariani, Michele Mancusi, et al., “Co- cola: Coherence-oriented contrastive learning of musical audio representations,” inICASSP, 2025

2025
[66]

Beat this! accurate beat tracking without dbn postprocessing,

Francesco Foscarin, Jan Schl ¨uter, and Gerhard Widmer, “Beat this! accurate beat tracking without dbn postprocessing,” in ISMIR, 2024

2024
[67]

mad- mom: a new Python Audio and Music Signal Processing Li- brary,

Sebastian B ¨ock, Filip Korzeniowski, Jan Schl¨uter, et al., “mad- mom: a new Python Audio and Music Signal Processing Li- brary,” inACM MM, 2016

2016
[68]

Fr ´echet audio distance: A reference-free metric for evaluating music enhancement algorithms,

Kevin Kilgour, Mauricio Zuluaga, Dominik Roblek, and Matthew Sharifi, “Fr ´echet audio distance: A reference-free metric for evaluating music enhancement algorithms,” inIn- terspeech, 2019, pp. 2350–2354

2019