Not that Groove: Zero-Shot Symbolic Music Editing
Pith reviewed 2026-05-22 16:20 UTC · model grok-4.3
The pith
Converting drum grooves to a spatial text grid lets off-the-shelf LLMs perform complex zero-shot edits from natural-language instructions.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
By converting drum grooves into drumroll notation, a spatial syntax-driven text grid, off-the-shelf LLMs can deduce and apply edits that satisfy natural-language instructions without any fine-tuning or in-context examples.
What carries the argument
Drumroll notation, a text-based spatial grid that turns drum hit timing and velocity into a readable layout so LLMs can reason about musical constraints directly.
If this is right
- Music producers gain granular control over drum patterns using only text prompts and existing language models.
- Evaluation of symbolic edits becomes scalable and repeatable through code-based unit tests rather than subjective listening alone.
- The same notation-and-reasoning approach can be applied to other symbolic music tasks once similar representations are defined.
- Data collection for instruction-tuned music models is no longer a prerequisite for useful editing performance.
Where Pith is reading between the lines
- Similar grid notations could be designed for melody or chord sequences to extend zero-shot editing beyond percussion.
- The 68 percent success rate suggests that improvements in LLM step-by-step musical reasoning would directly raise editing reliability.
- Combining the drumroll method with existing audio generators might allow text instructions to control both symbolic structure and sound output.
- The benchmark itself supplies a reusable testbed for comparing future models or notations on controllable music editing.
Load-bearing premise
The drumroll notation plus the symbolic unit tests together capture all the musically important parts of a user's request without systematic blind spots.
What would settle it
An edited groove that passes every automated unit test yet is rated by professional musicians as failing to match the original natural-language instruction.
read the original abstract
While recent advancements in AI music generation have predominantly focused on direct audio synthesis, these systems suffer from inherent rigidity, limiting their utility for professional music producers who require granular, highly malleable creative control. Symbolic music (e.g., MIDI) resolves this constraint by providing editable note-level parameters, yet the natural progression to instruction-driven symbolic music editing remains critically under-explored due to a severe scarcity of paired instruction-MIDI datasets. In this paper, we bypass this data bottleneck by formalizing zero-shot symbolic music editing as a structured reasoning task. We introduce a novel text-based "drumroll" notation that translates musical mechanics into a spatial, syntax-driven grid, empowering off-the-shelf Large Language Models (LLMs) to logically deduce and apply complex edits to drum grooves using only zero-shot prompting. To rigorously evaluate this paradigm, we propose Not that Groove, a comprehensive benchmark comprising thousands of drum grooves paired with specific, descriptive, and stylistic natural language instructions. Crucially, to overcome the prohibitive cost and subjectivity of human musical evaluation, we introduce a scalable, domain-informed automated unit-testing framework that symbolically verifies whether an edited groove satisfies the core constraints of the user's request. Our extensive experiments across eight state-of-the-art LLMs demonstrate the high efficacy of this approach, with the top-performing model achieving a 68% success rate on our automated unit tests. Furthermore, listening tests confirm that our programmatic unit tests align highly with the subjective judgments of professional musicians, establishing a robust, data-efficient, and scalable foundation for the future of controllable AI music production.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript claims that converting drum grooves to a novel spatial 'drumroll' text notation enables off-the-shelf LLMs to perform complex zero-shot edits from natural-language instructions. It introduces the Not that Groove benchmark of thousands of groove-instruction pairs and a domain-informed automated unit-testing framework that symbolically verifies constraint satisfaction, reporting a 68% success rate for the best LLM; listening tests are said to align with professional musician judgments.
Significance. If the central claims hold, the work provides a data-efficient, training-free route to controllable symbolic music editing that could be practically useful for producers needing granular edits without large paired datasets. The scalable automated unit-test framework and its reported alignment with human listening tests constitute a methodological strength that could be adopted more broadly in music AI evaluation.
major comments (2)
- [Evaluation / unit-testing framework] Automated unit-testing framework (described in the evaluation section): the 68% success rate is load-bearing for the efficacy claim, yet the manuscript does not state whether the symbolic verification rules were fixed before any model outputs were inspected. Without this pre-specification or an explicit coverage analysis of continuous properties (swing, micro-timing, dynamics), it remains possible for an edit to pass the tests while failing to satisfy the full musical intent of the instruction.
- [Experiments and results] Benchmark construction and results tables: the reported success rates across eight LLMs should be accompanied by per-instruction-category breakdowns (e.g., timing vs. stylistic vs. density edits) and failure-mode statistics. The current aggregate 68% figure does not yet demonstrate that the drumroll notation plus unit tests systematically capture the constraints implied by the natural-language requests.
minor comments (2)
- [Abstract] The abstract states 'thousands of drum grooves' without giving the exact count or split statistics; the methods section should report these numbers for reproducibility.
- [Drumroll notation] Notation examples in the drumroll figure would benefit from an accompanying legend that explicitly maps each grid symbol to MIDI parameters (velocity, duration, offset).
Circularity Check
No significant circularity; empirical evaluation stands independently of inputs
full rationale
The paper presents an empirical application of off-the-shelf LLMs to zero-shot drum groove editing via a newly introduced drumroll notation and a custom benchmark with automated unit tests. The reported 68% success rate is measured directly on these tests, which symbolically check constraint satisfaction rather than being derived from or fitted to the same LLM outputs. No equations, self-definitional loops, or load-bearing self-citations reduce the central claim to its inputs by construction. The approach relies on external LLMs and domain-informed tests that are falsifiable against musician listening judgments, satisfying the criteria for non-circularity.
Axiom & Free-Parameter Ledger
axioms (2)
- domain assumption A drum groove can be losslessly represented as a spatial text grid in which rows correspond to drum instruments and columns to discrete time steps.
- domain assumption The automated unit tests verify the core musical constraints implied by the natural-language instruction.
Lean theorems connected to this paper
-
IndisputableMonolith/Foundation/ArithmeticFromLogic.leanLogicNat orbit and 8-tick periodicity unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
We introduce a novel text-based “drumroll” notation that translates musical mechanics into a spatial, syntax-driven grid... each character represents a 16th note, grouped by 4 into beats separated by |.
-
IndisputableMonolith/Cost/FunctionalEquation.leanJcost uniqueness and recognition cost unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
unit test t ... symbolically verifies whether an edited groove satisfies the core constraints
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
INTRODUCTION Music generation has seen much development along trans- formers and large language models (LLMs). Most exist- ing systems tackledmusic audio generation[1–3] given 1 The reader is encouraged to listen to the demos in the accompanying materials matching the speaker icons throughout the paper. © Li Zhang. Licensed under a Creative Commons Attrib...
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[2]
REPRESENTA TION We study a particular kind of symbolic music, the compo- sition of a drum set, that is core to modern popular music. To interface LLMs for understanding and generation, we use the transposeddrumrollrepresentation inspired by the success of previous work [12]. An example can be seen in the rightmost of Figure 2. Each line corresponds to an ...
-
[3]
EXPERIMENTAL SETUP 3.1 Formulation We consider the task of editing symbolic music based on a natural language request. Concretely, a model is given go, an original one-bar groove in a drumroll notation (e.g., the example in Figure 2), and an instructionidescribing a user request (e.g., “I want it to sound heavier”). The model should output a new grooveg e...
-
[4]
RESULTS 4.1 Unit Test The performance of all LLMs on both the development and set set, measured by whether the edited drum grooves pass 4 https://openai.com/index/gpt-4-1/ 5 https://qwenlm.github.io/blog/qwq-32b/ gpt-4.1-nanogpt-4.1-miniDeepSeek-R1-8BDeepSeek-R1-70B QwQ-32B 0 20 40 60 80 100 22.3 67.7 25.8 35.4 64.5 42.9 79.1 7.5 41.4 61.3 % passed unit t...
-
[5]
CONCLUSION AND LIMITA TIONS We presented a novel paradigm for AI-driven music gen- eration by formalizing zero-shot symbolic music editing as a structured reasoning task. By introducing a text- based drumroll notation, we successfully interfaced the spatial reasoning and constraint-satisfaction capabilities of Large Language Models with the structural rul...
-
[6]
Jukebox: A Generative Model for Music
P. Dhariwal, H. Jun, C. Payne, J. W. Kim, A. Radford, and I. Sutskever, “Jukebox: A generative model for music,”arXiv preprint arXiv:2005.00341, 2020
work page internal anchor Pith review Pith/arXiv arXiv 2005
-
[7]
Musiclm: Generating music from text,
A. Agostinelli, T. I. Denk, Z. Borsos, J. Engel, M. Verzetti, A. Caillon, Q. Huang, A. Jansen, A. Roberts, M. Tagliasacchi, M. Sharifi, N. Zeghidour, and C. Frank, “Musiclm: Generating music from text,”
-
[8]
Available: https://arxiv.org/abs/2301
[Online]. Available: https://arxiv.org/abs/2301. 11325
-
[9]
Simple and controllable music generation,
J. Copet, F. Kreuk, I. Gat, T. Remez, D. Kant, G. Synnaeve, Y . Adi, and A. Défossez, “Simple and controllable music generation,” 2024. [Online]. Available: https://arxiv.org/abs/2306.05284
-
[10]
25% of music producers are now using ai, survey says – but a majority shows strong resistance,
D. Tencer, “25% of music producers are now using ai, survey says – but a majority shows strong resistance,”Music Business Worldwide, July Preprint 2024, accessed: 2025-05-09. [Online]. Available: https://www.musicbusinessworldwide.com/
work page 2024
-
[11]
Musicmagus: Zero-shot text-to-music edit- ing via diffusion models,
Y . Zhang, Y . Ikemiya, G. Xia, N. Murata, M. A. Martínez-Ramírez, W.-H. Liao, Y . Mitsufuji, and S. Dixon, “Musicmagus: Zero-shot text-to-music edit- ing via diffusion models,” 2024. [Online]. Available: https://arxiv.org/abs/2402.06178
-
[12]
F.-D. Tsai, S.-L. Wu, H. Kim, B.-Y . Chen, H.-C. Cheng, and Y .-H. Yang, “Audio prompt adapter: Un- leashing music editing abilities for text-to-music with lightweight finetuning,” 2024. [Online]. Available: https://arxiv.org/abs/2407.16564
-
[13]
Editing music with melody and text: Using controlnet for diffusion transformer,
S. Hou, S. Liu, R. Yuan, W. Xue, Y . Shan, M. Zhao, and C. Zhang, “Editing music with melody and text: Using controlnet for diffusion transformer,” 2025. [Online]. Available: https://arxiv.org/abs/2410.05151
-
[14]
High fidelity text-guided music editing via single-stage flow matching,
G. L. Lan, B. Shi, Z. Ni, S. Srinivasan, A. Kumar, B. Ellis, D. Kant, V . Nagaraja, E. Chang, W.-N. Hsu, Y . Shi, and V . Chandra, “High fidelity text-guided music editing via single-stage flow matching,” 2024. [Online]. Available: https://arxiv.org/abs/2407.03648
-
[15]
DeepBach: a Steerable Model for Bach Chorales Generation
G. Hadjeres, F. Pachet, and F. Nielsen, “Deepbach: a steerable model for bach chorales generation,” 2017. [Online]. Available: https://arxiv.org/abs/1612.01010
work page internal anchor Pith review Pith/arXiv arXiv 2017
- [16]
-
[17]
MusicBERT: Symbolic music understanding with large-scale pre-training,
M. Zeng, X. Tan, R. Wang, Z. Ju, T. Qin, and T.-Y . Liu, “MusicBERT: Symbolic music understanding with large-scale pre-training,” inFindings of the Association for Computational Linguistics: ACL-IJCNLP 2021, C. Zong, F. Xia, W. Li, and R. Navigli, Eds. Online: Association for Computational Linguistics, Aug. 2021, pp. 791–800. [Online]. Available: https: /...
work page 2021
-
[18]
Language mod- els are drummers: Drum composition with natural language pre-training,
L. Zhang and C. Callison-Burch, “Language mod- els are drummers: Drum composition with natural language pre-training,” inThe AAAI-23 Workshop on Creative AI Across Modalities, 2023. [Online]. Avail- able: https://openreview.net/forum?id=haiht1U7pGL
work page 2023
-
[19]
Text2midi: Generating symbolic music from captions,
K. Bhandari, A. Roy, K. Wang, G. Puri, S. Colton, and D. Herremans, “Text2midi: Generating symbolic music from captions,” 2024. [Online]. Available: https://arxiv.org/abs/2412.16526
-
[20]
Symbolic music generation with non-differentiable rule guided diffusion,
Y . Huang, A. Ghatare, Y . Liu, Z. Hu, Q. Zhang, C. S. Sastry, S. Gururani, S. Oore, and Y . Yue, “Symbolic music generation with non-differentiable rule guided diffusion,” 2024. [Online]. Available: https://arxiv.org/abs/2402.14285
-
[21]
Syncopation, body- movement and pleasure in groove music,
M. A. Witek, E. F. Clarke, M. Wallentin, M. L. Kringelbach, and P. Vuust, “Syncopation, body- movement and pleasure in groove music,”PloS one, vol. 9, no. 4, p. e94446, 2014
work page 2014
-
[22]
DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning
DeepSeek-AI, D. Guo, D. Yang, H. Zhang, J. Song, R. Zhang, R. Xu, Q. Zhu, S. Ma, P. Wang, X. Bi, X. Zhang, X. Yu, Y . Wu, Z. F. Wu, Z. Gou, Z. Shao, Z. Li, Z. Gao, A. Liu, B. Xue, B. Wang, B. Wu, B. Feng, C. Lu, C. Zhao, C. Deng, C. Zhang, C. Ruan, D. Dai, D. Chen, D. Ji, E. Li, F. Lin, F. Dai, F. Luo, G. Hao, G. Chen, G. Li, H. Zhang, H. Bao, H. Xu, H. W...
work page internal anchor Pith review Pith/arXiv arXiv 2025
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.