pith. sign in

arxiv: 2505.08203 · v2 · submitted 2025-05-13 · 💻 cs.SD · cs.CL· eess.AS

Not that Groove: Zero-Shot Symbolic Music Editing

Pith reviewed 2026-05-22 16:20 UTC · model grok-4.3

classification 💻 cs.SD cs.CLeess.AS
keywords zero-shot symbolic music editingdrumroll notationLLM drum groove editingautomated unit testing for musicNot that Groove benchmarkcontrollable symbolic musicMIDI editing with language models
0
0 comments X

The pith

Converting drum grooves to a spatial text grid lets off-the-shelf LLMs perform complex zero-shot edits from natural-language instructions.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that symbolic drum editing can be treated as a pure reasoning task by mapping MIDI patterns into a grid-style text notation that LLMs already understand. This removes the usual requirement for large paired datasets of instructions and music. The authors test the idea on thousands of examples using an automated verification system that checks whether the edited groove meets the explicit constraints in the user's request. The strongest model reaches 68 percent success on these checks, and separate listening tests show that the checks match what professional musicians consider correct.

Core claim

By converting drum grooves into drumroll notation, a spatial syntax-driven text grid, off-the-shelf LLMs can deduce and apply edits that satisfy natural-language instructions without any fine-tuning or in-context examples.

What carries the argument

Drumroll notation, a text-based spatial grid that turns drum hit timing and velocity into a readable layout so LLMs can reason about musical constraints directly.

If this is right

  • Music producers gain granular control over drum patterns using only text prompts and existing language models.
  • Evaluation of symbolic edits becomes scalable and repeatable through code-based unit tests rather than subjective listening alone.
  • The same notation-and-reasoning approach can be applied to other symbolic music tasks once similar representations are defined.
  • Data collection for instruction-tuned music models is no longer a prerequisite for useful editing performance.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Similar grid notations could be designed for melody or chord sequences to extend zero-shot editing beyond percussion.
  • The 68 percent success rate suggests that improvements in LLM step-by-step musical reasoning would directly raise editing reliability.
  • Combining the drumroll method with existing audio generators might allow text instructions to control both symbolic structure and sound output.
  • The benchmark itself supplies a reusable testbed for comparing future models or notations on controllable music editing.

Load-bearing premise

The drumroll notation plus the symbolic unit tests together capture all the musically important parts of a user's request without systematic blind spots.

What would settle it

An edited groove that passes every automated unit test yet is rated by professional musicians as failing to match the original natural-language instruction.

read the original abstract

While recent advancements in AI music generation have predominantly focused on direct audio synthesis, these systems suffer from inherent rigidity, limiting their utility for professional music producers who require granular, highly malleable creative control. Symbolic music (e.g., MIDI) resolves this constraint by providing editable note-level parameters, yet the natural progression to instruction-driven symbolic music editing remains critically under-explored due to a severe scarcity of paired instruction-MIDI datasets. In this paper, we bypass this data bottleneck by formalizing zero-shot symbolic music editing as a structured reasoning task. We introduce a novel text-based "drumroll" notation that translates musical mechanics into a spatial, syntax-driven grid, empowering off-the-shelf Large Language Models (LLMs) to logically deduce and apply complex edits to drum grooves using only zero-shot prompting. To rigorously evaluate this paradigm, we propose Not that Groove, a comprehensive benchmark comprising thousands of drum grooves paired with specific, descriptive, and stylistic natural language instructions. Crucially, to overcome the prohibitive cost and subjectivity of human musical evaluation, we introduce a scalable, domain-informed automated unit-testing framework that symbolically verifies whether an edited groove satisfies the core constraints of the user's request. Our extensive experiments across eight state-of-the-art LLMs demonstrate the high efficacy of this approach, with the top-performing model achieving a 68% success rate on our automated unit tests. Furthermore, listening tests confirm that our programmatic unit tests align highly with the subjective judgments of professional musicians, establishing a robust, data-efficient, and scalable foundation for the future of controllable AI music production.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript claims that converting drum grooves to a novel spatial 'drumroll' text notation enables off-the-shelf LLMs to perform complex zero-shot edits from natural-language instructions. It introduces the Not that Groove benchmark of thousands of groove-instruction pairs and a domain-informed automated unit-testing framework that symbolically verifies constraint satisfaction, reporting a 68% success rate for the best LLM; listening tests are said to align with professional musician judgments.

Significance. If the central claims hold, the work provides a data-efficient, training-free route to controllable symbolic music editing that could be practically useful for producers needing granular edits without large paired datasets. The scalable automated unit-test framework and its reported alignment with human listening tests constitute a methodological strength that could be adopted more broadly in music AI evaluation.

major comments (2)
  1. [Evaluation / unit-testing framework] Automated unit-testing framework (described in the evaluation section): the 68% success rate is load-bearing for the efficacy claim, yet the manuscript does not state whether the symbolic verification rules were fixed before any model outputs were inspected. Without this pre-specification or an explicit coverage analysis of continuous properties (swing, micro-timing, dynamics), it remains possible for an edit to pass the tests while failing to satisfy the full musical intent of the instruction.
  2. [Experiments and results] Benchmark construction and results tables: the reported success rates across eight LLMs should be accompanied by per-instruction-category breakdowns (e.g., timing vs. stylistic vs. density edits) and failure-mode statistics. The current aggregate 68% figure does not yet demonstrate that the drumroll notation plus unit tests systematically capture the constraints implied by the natural-language requests.
minor comments (2)
  1. [Abstract] The abstract states 'thousands of drum grooves' without giving the exact count or split statistics; the methods section should report these numbers for reproducibility.
  2. [Drumroll notation] Notation examples in the drumroll figure would benefit from an accompanying legend that explicitly maps each grid symbol to MIDI parameters (velocity, duration, offset).

Circularity Check

0 steps flagged

No significant circularity; empirical evaluation stands independently of inputs

full rationale

The paper presents an empirical application of off-the-shelf LLMs to zero-shot drum groove editing via a newly introduced drumroll notation and a custom benchmark with automated unit tests. The reported 68% success rate is measured directly on these tests, which symbolically check constraint satisfaction rather than being derived from or fitted to the same LLM outputs. No equations, self-definitional loops, or load-bearing self-citations reduce the central claim to its inputs by construction. The approach relies on external LLMs and domain-informed tests that are falsifiable against musician listening judgments, satisfying the criteria for non-circularity.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The central claim rests on the assumption that a text grid can faithfully encode the editable aspects of a drum groove and that the unit tests correctly operationalize musical intent. No free parameters are mentioned in the abstract. No new physical entities are introduced.

axioms (2)
  • domain assumption A drum groove can be losslessly represented as a spatial text grid in which rows correspond to drum instruments and columns to discrete time steps.
    This premise is required for the LLM to perform edits by editing the grid text; it is invoked when the paper states that the notation 'translates musical mechanics into a spatial, syntax-driven grid'.
  • domain assumption The automated unit tests verify the core musical constraints implied by the natural-language instruction.
    The paper relies on this to claim that 68% success on unit tests indicates successful editing; the listening-test alignment is offered as supporting evidence.

pith-pipeline@v0.9.0 · 5804 in / 1718 out tokens · 44220 ms · 2026-05-22T16:20:49.146675+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

22 extracted references · 22 canonical work pages · 4 internal anchors

  1. [1]

    INTRODUCTION Music generation has seen much development along trans- formers and large language models (LLMs). Most exist- ing systems tackledmusic audio generation[1–3] given 1 The reader is encouraged to listen to the demos in the accompanying materials matching the speaker icons throughout the paper. © Li Zhang. Licensed under a Creative Commons Attrib...

  2. [2]

    To interface LLMs for understanding and generation, we use the transposeddrumrollrepresentation inspired by the success of previous work [12]

    REPRESENTA TION We study a particular kind of symbolic music, the compo- sition of a drum set, that is core to modern popular music. To interface LLMs for understanding and generation, we use the transposeddrumrollrepresentation inspired by the success of previous work [12]. An example can be seen in the rightmost of Figure 2. Each line corresponds to an ...

  3. [3]

    I want it to sound heavier

    EXPERIMENTAL SETUP 3.1 Formulation We consider the task of editing symbolic music based on a natural language request. Concretely, a model is given go, an original one-bar groove in a drumroll notation (e.g., the example in Figure 2), and an instructionidescribing a user request (e.g., “I want it to sound heavier”). The model should output a new grooveg e...

  4. [4]

    last 8th note

    RESULTS 4.1 Unit Test The performance of all LLMs on both the development and set set, measured by whether the edited drum grooves pass 4 https://openai.com/index/gpt-4-1/ 5 https://qwenlm.github.io/blog/qwq-32b/ gpt-4.1-nanogpt-4.1-miniDeepSeek-R1-8BDeepSeek-R1-70B QwQ-32B 0 20 40 60 80 100 22.3 67.7 25.8 35.4 64.5 42.9 79.1 7.5 41.4 61.3 % passed unit t...

  5. [5]

    CONCLUSION AND LIMITA TIONS We presented a novel paradigm for AI-driven music gen- eration by formalizing zero-shot symbolic music editing as a structured reasoning task. By introducing a text- based drumroll notation, we successfully interfaced the spatial reasoning and constraint-satisfaction capabilities of Large Language Models with the structural rul...

  6. [6]

    Jukebox: A Generative Model for Music

    P. Dhariwal, H. Jun, C. Payne, J. W. Kim, A. Radford, and I. Sutskever, “Jukebox: A generative model for music,”arXiv preprint arXiv:2005.00341, 2020

  7. [7]

    Musiclm: Generating music from text,

    A. Agostinelli, T. I. Denk, Z. Borsos, J. Engel, M. Verzetti, A. Caillon, Q. Huang, A. Jansen, A. Roberts, M. Tagliasacchi, M. Sharifi, N. Zeghidour, and C. Frank, “Musiclm: Generating music from text,”

  8. [8]

    Available: https://arxiv.org/abs/2301

    [Online]. Available: https://arxiv.org/abs/2301. 11325

  9. [9]

    Simple and controllable music generation,

    J. Copet, F. Kreuk, I. Gat, T. Remez, D. Kant, G. Synnaeve, Y . Adi, and A. Défossez, “Simple and controllable music generation,” 2024. [Online]. Available: https://arxiv.org/abs/2306.05284

  10. [10]

    25% of music producers are now using ai, survey says – but a majority shows strong resistance,

    D. Tencer, “25% of music producers are now using ai, survey says – but a majority shows strong resistance,”Music Business Worldwide, July Preprint 2024, accessed: 2025-05-09. [Online]. Available: https://www.musicbusinessworldwide.com/

  11. [11]

    Musicmagus: Zero-shot text-to-music edit- ing via diffusion models,

    Y . Zhang, Y . Ikemiya, G. Xia, N. Murata, M. A. Martínez-Ramírez, W.-H. Liao, Y . Mitsufuji, and S. Dixon, “Musicmagus: Zero-shot text-to-music edit- ing via diffusion models,” 2024. [Online]. Available: https://arxiv.org/abs/2402.06178

  12. [12]

    Audio prompt adapter: Un- leashing music editing abilities for text-to-music with lightweight finetuning,

    F.-D. Tsai, S.-L. Wu, H. Kim, B.-Y . Chen, H.-C. Cheng, and Y .-H. Yang, “Audio prompt adapter: Un- leashing music editing abilities for text-to-music with lightweight finetuning,” 2024. [Online]. Available: https://arxiv.org/abs/2407.16564

  13. [13]

    Editing music with melody and text: Using controlnet for diffusion transformer,

    S. Hou, S. Liu, R. Yuan, W. Xue, Y . Shan, M. Zhao, and C. Zhang, “Editing music with melody and text: Using controlnet for diffusion transformer,” 2025. [Online]. Available: https://arxiv.org/abs/2410.05151

  14. [14]

    High fidelity text-guided music editing via single-stage flow matching,

    G. L. Lan, B. Shi, Z. Ni, S. Srinivasan, A. Kumar, B. Ellis, D. Kant, V . Nagaraja, E. Chang, W.-N. Hsu, Y . Shi, and V . Chandra, “High fidelity text-guided music editing via single-stage flow matching,” 2024. [Online]. Available: https://arxiv.org/abs/2407.03648

  15. [15]

    DeepBach: a Steerable Model for Bach Chorales Generation

    G. Hadjeres, F. Pachet, and F. Nielsen, “Deepbach: a steerable model for bach chorales generation,” 2017. [Online]. Available: https://arxiv.org/abs/1612.01010

  16. [16]

    Musenet,

    OpenAI, “Musenet,” https://openai.com/index/ musenet/, April 2019, accessed: 2025-05-09

  17. [17]

    MusicBERT: Symbolic music understanding with large-scale pre-training,

    M. Zeng, X. Tan, R. Wang, Z. Ju, T. Qin, and T.-Y . Liu, “MusicBERT: Symbolic music understanding with large-scale pre-training,” inFindings of the Association for Computational Linguistics: ACL-IJCNLP 2021, C. Zong, F. Xia, W. Li, and R. Navigli, Eds. Online: Association for Computational Linguistics, Aug. 2021, pp. 791–800. [Online]. Available: https: /...

  18. [18]

    Language mod- els are drummers: Drum composition with natural language pre-training,

    L. Zhang and C. Callison-Burch, “Language mod- els are drummers: Drum composition with natural language pre-training,” inThe AAAI-23 Workshop on Creative AI Across Modalities, 2023. [Online]. Avail- able: https://openreview.net/forum?id=haiht1U7pGL

  19. [19]

    Text2midi: Generating symbolic music from captions,

    K. Bhandari, A. Roy, K. Wang, G. Puri, S. Colton, and D. Herremans, “Text2midi: Generating symbolic music from captions,” 2024. [Online]. Available: https://arxiv.org/abs/2412.16526

  20. [20]

    Symbolic music generation with non-differentiable rule guided diffusion,

    Y . Huang, A. Ghatare, Y . Liu, Z. Hu, Q. Zhang, C. S. Sastry, S. Gururani, S. Oore, and Y . Yue, “Symbolic music generation with non-differentiable rule guided diffusion,” 2024. [Online]. Available: https://arxiv.org/abs/2402.14285

  21. [21]

    Syncopation, body- movement and pleasure in groove music,

    M. A. Witek, E. F. Clarke, M. Wallentin, M. L. Kringelbach, and P. Vuust, “Syncopation, body- movement and pleasure in groove music,”PloS one, vol. 9, no. 4, p. e94446, 2014

  22. [22]

    DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

    DeepSeek-AI, D. Guo, D. Yang, H. Zhang, J. Song, R. Zhang, R. Xu, Q. Zhu, S. Ma, P. Wang, X. Bi, X. Zhang, X. Yu, Y . Wu, Z. F. Wu, Z. Gou, Z. Shao, Z. Li, Z. Gao, A. Liu, B. Xue, B. Wang, B. Wu, B. Feng, C. Lu, C. Zhao, C. Deng, C. Zhang, C. Ruan, D. Dai, D. Chen, D. Ji, E. Li, F. Lin, F. Dai, F. Luo, G. Hao, G. Chen, G. Li, H. Zhang, H. Bao, H. Xu, H. W...