Can LLMs understand LilyPond? A benchmark for symbolic music generation and understanding
Pith reviewed 2026-06-27 17:48 UTC · model grok-4.3
The pith
Large language models can generate executable LilyPond notation in zero-shot settings, yet structural understanding tasks remain difficult even when composer and genre recognition succeeds.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Experiments on four open-weight models show that executable LilyPond generation is achievable in zero-shot settings, while structural understanding tasks remain challenging despite strong performance on composer and genre recognition. Systematic disagreements between descriptor-based and embedding-based metrics suggest that symbolic music evaluation benefits from metric triangulation rather than single-score ranking.
What carries the argument
LilyBench benchmark consisting of a 200-prompt generation suite plus ten understanding tasks adapted from ABC-Eval, scored by compile rate, MusPy descriptor Jensen-Shannon similarity, and LilyBERT Fréchet Music Distance.
If this is right
- Zero-shot prompts suffice to produce compilable LilyPond output on the evaluated models.
- Composer and genre identification tasks yield higher accuracy than structural sequencing or syntax tasks.
- Descriptor distributions and embedding distances often rank the same outputs differently.
- Releasing the prompt bank and evaluation scripts enables direct comparison across future models.
Where Pith is reading between the lines
- Surface-level notation patterns appear easier for current LLMs to capture than deeper musical relations among notes and sections.
- Adding fine-tuning stages on LilyPond data might narrow the gap on structural tasks, though the paper does not test this.
- The observed metric disagreements imply that any single automatic score will miss aspects of musical quality that other scores capture.
Load-bearing premise
The chosen prompts, tasks, and metrics together form a representative test of what LLMs can do with symbolic music notation.
What would settle it
Running the same generation prompts on one of the tested models and finding that the compile rate drops below 50 percent, or finding that structural sequencing accuracy exceeds 80 percent on the same models, would directly contradict the reported pattern.
read the original abstract
Symbolic music evaluation for large language models remains fragmented across representations, datasets, and metrics. We introduce LilyBench, a LilyPond-based benchmark that jointly evaluates symbolic music generation and music understanding on the same family of open-weight LLMs. The benchmark includes a 200-prompt generation suite and ten understanding tasks adapted from ABC-Eval, covering syntax, metadata prediction, structural sequencing, and music recognition. Generation quality is evaluated using compile rate, MusPy descriptor distributions via Jensen-Shannon similarity, and LilyBERT-based Fr\'echet Music Distance (FMD). Experiments on four open-weight models show that executable LilyPond generation is achievable in zero-shot settings, while structural understanding tasks remain challenging despite strong performance on composer and genre recognition. Our experiments also reveal systematic disagreements between descriptor-based and embedding-based metrics, suggesting that symbolic music evaluation benefits from metric triangulation rather than single-score ranking. We release the benchmark, prompt bank, and evaluation code to support future research in symbolic music generation and understanding at https://github.com/CSCPadova/lilybench
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces LilyBench, a LilyPond-based benchmark for LLM symbolic music generation and understanding. It comprises a 200-prompt generation suite evaluated via compile rate, MusPy descriptor Jensen-Shannon similarity, and LilyBERT Fréchet Music Distance (FMD), plus ten understanding tasks adapted from ABC-Eval covering syntax, metadata, structural sequencing, and recognition tasks. Experiments on four open-weight models indicate that zero-shot executable LilyPond generation is achievable while structural understanding remains challenging (despite strong composer/genre recognition). The work also observes disagreements across metric families and releases the benchmark, prompt bank, and evaluation code.
Significance. If the benchmark tasks and prompts prove representative, the results would usefully document a generation-understanding gap in LilyPond and motivate metric triangulation for symbolic music. The public release of the prompt bank and code is a concrete strength that supports reproducibility and extension by the community.
major comments (2)
- [§3] §3 (Benchmark construction): The 200-prompt generation suite is described only at high level; no details are given on prompt sourcing, difficulty stratification, coverage of LilyPond-specific constructs (e.g., relative vs. absolute pitch, markup, or multi-voice structures), or sampling procedure. Because the headline claim that “executable LilyPond generation is achievable in zero-shot settings” rests on performance on this suite, the absence of this information prevents assessment of whether the observed success reflects model capability or benchmark design.
- [§4] §4 (Task adaptation): The ten understanding tasks are stated to be “adapted from ABC-Eval,” yet the manuscript supplies no mapping, no discussion of how LilyPond syntax/semantics alter task difficulty or validity, and no validation that structural-sequencing items remain comparable to their ABC originals. This directly affects the second central claim that structural understanding tasks remain challenging.
minor comments (3)
- [Abstract] Abstract: “Fr\'echet” should be rendered as “Fréchet”.
- [Evaluation] Evaluation section: The manuscript reports raw metric values but omits any statistical significance tests or confidence intervals on the reported differences between models or between metric families.
- [Related Work] Related work: Prior symbolic-music LLM benchmarks (e.g., those using ABC or MusicXML) are mentioned only briefly; a short comparative table would clarify LilyBench’s incremental contribution.
Simulated Author's Rebuttal
We thank the referee for the constructive comments on LilyBench. We agree that the manuscript would benefit from expanded details on benchmark construction and task adaptation to better support the central claims. We will revise accordingly.
read point-by-point responses
-
Referee: [§3] §3 (Benchmark construction): The 200-prompt generation suite is described only at high level; no details are given on prompt sourcing, difficulty stratification, coverage of LilyPond-specific constructs (e.g., relative vs. absolute pitch, markup, or multi-voice structures), or sampling procedure. Because the headline claim that “executable LilyPond generation is achievable in zero-shot settings” rests on performance on this suite, the absence of this information prevents assessment of whether the observed success reflects model capability or benchmark design.
Authors: We acknowledge the high-level description in the current §3. In the revised manuscript we will add: (i) sourcing details (music-theory references, LilyPond manual excerpts, and curated public-domain scores), (ii) explicit stratification by construct complexity (basic monophonic, relative-pitch, markup, multi-voice), (iii) coverage statistics for the listed LilyPond features, and (iv) the exact sampling procedure with random seed. These additions will allow independent evaluation of whether success reflects model capability or benchmark design. revision: yes
-
Referee: [§4] §4 (Task adaptation): The ten understanding tasks are stated to be “adapted from ABC-Eval,” yet the manuscript supplies no mapping, no discussion of how LilyPond syntax/semantics alter task difficulty or validity, and no validation that structural-sequencing items remain comparable to their ABC originals. This directly affects the second central claim that structural understanding tasks remain challenging.
Authors: We agree a fuller account is required. The revision will include a task-mapping table, a short discussion of LilyPond-specific syntactic differences (relative octaves, explicit durations, elative vs. absolute) and their potential impact on difficulty/validity, and brief validation examples confirming that the structural-sequencing items preserve the intended challenge level relative to the ABC originals. This will strengthen support for the generation-understanding gap observation. revision: yes
Circularity Check
No circularity: empirical benchmark results on external models
full rationale
The paper introduces LilyBench as a new evaluation suite and reports direct zero-shot performance measurements (compile rates, MusPy JSD, LilyBERT FMD) on four open-weight LLMs. No derivation chain, fitted parameters, or predictions are defined in terms of the target quantities. Task adaptation from ABC-Eval is described as external reuse rather than self-referential fitting. No self-citations are load-bearing for the central claims, and no equations or ansatzes reduce the reported results to the benchmark inputs by construction. The evaluation remains an independent empirical probe.
Axiom & Free-Parameter Ledger
free parameters (2)
- Generation prompt count =
200
- Understanding task count =
10
axioms (2)
- domain assumption LilyPond compilation and descriptor distributions validly measure symbolic music generation quality
- domain assumption The ten adapted tasks comprehensively probe music understanding
Reference graph
Works this paper leans on
-
[1]
Y. Ma, A. Oland, A. Ragni, C. Saitis, C. Donahue, C. Lin, et al., Foundation models for music: A survey, 2024.arXiv:2408.14340
arXiv 2024
- [2]
-
[3]
Y. Wang, S. Wu, J. Hu, X. Du, Y. Peng, Y. Huang, S. Fan, X. Li, F. Yu, M. Sun, Notagen: advancing musicality in symbolic music generation with large language model training paradigms, in: Proceedings of the Thirty-Fourth International Joint Conference on Artificial Intelligence, IJCAI ’25, 2025. doi:10.24963/ijcai.2025/1134
-
[4]
Casini, B
L. Casini, B. L. T. Sturm, Investigating the viability of masked language modeling for symbolic music generation in ABC notation, in: EvoMUSART, LNCS, 2024
2024
-
[5]
J. Zhao, Y. Li, W. Li, K. Yoshii, Abc-eval: Benchmarking large language models on symbolic music understanding and instruction following, in: ICASSP 2026 - 2026 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2026, pp. 16072–16076
2026
-
[6]
R. Yuan, H. Lin, Y. Wang, Z. Tian, S. Wu, T. Shen, G. Zhang, Y. Wu, C. Liu, Z. Zhou, L. Xue, Z. Ma, Q. Liu, T. Zheng, Y. Li, Y. Ma, Y. Liang, X. Chi, R. Liu, Z. Wang, C. Lin, Q. Liu, T. Jiang, W. Huang, W. Chen, J. Fu, E. Benetos, G. Xia, R. Dannenberg, W. Xue, S. Kang, Y. Guo, ChatMusician: Understanding and generating music intrinsically with LLM, in: L...
-
[7]
URL: https://lilypond.org/
LilyPond Developers, Lilypond official website, 2026. URL: https://lilypond.org/
2026
-
[8]
M. Spanio, I. Guler, A. Rodà, Bmdataset: A musicologically curated lilypond dataset, 2026. arXiv:2604.10628
Pith/arXiv arXiv 2026
-
[9]
D.-V.-T. Le, L. Bigo, D. Herremans, M. Keller, Natural language processing methods for symbolic music generation and information retrieval: A survey, ACM Computing Surveys 57 (2025)
2025
-
[10]
M. Zhou, X. Li, F. Yu, W. Li, Emelodygen: Emotion-conditioned melody generation in abc notation with the musical feature template, in: 2025 IEEE International Conference on Multimedia and Expo Workshops (ICMEW), 2025, pp. 1–6. doi:10.1109/ICMEW68306.2025.11152266
-
[11]
Kilgour, M
K. Kilgour, M. Zuluaga, D. Roblek, M. Sharifi, Fréchet audio distance: A reference-free metric for evaluating music enhancement algorithms, in: Interspeech, 2019, pp. 2350–2354
2019
-
[12]
F. B. Kader, S. Karmaker, A survey on evaluation metrics for music generation, arXiv preprint arXiv:2509.00051 (2025)
arXiv 2025
-
[13]
H. Dong, K. Chen, J. J. McAuley, T. Berg-Kirkpatrick, Muspy: A toolkit for symbolic music generation, CoRR abs/2008.01951 (2020)
arXiv 2008
-
[14]
L. Qian, H. Gu, D. Li, B. Cao, Q. Liu, Pianoroll-event: A novel score representation for symbolic music, in: ICASSP 2026 - 2026 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2026, pp. 1–5
2026
-
[15]
M. Zeng, X. Tan, R. Wang, Z. Ju, T. Qin, T.-Y. Liu, MusicBERT: Symbolic music understanding with large-scale pre-training, in: Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021, Association for Computational Linguistics, Online, 2021, pp. 791–800
2021
-
[16]
S. Wu, D. Yu, X. Tan, M. Sun, Clamp: Contrastive language-music pre-training for cross-modal symbolic music information retrieval, in: International Society for Music Information Retrieval Conference, 2023
2023
-
[17]
J. Retkowski, J. Stępniak, M. Modrzejewski, Fréchet music distance: A metric for generative symbolic music evaluation, 2024.arXiv:2412.07948
arXiv 2024
-
[18]
Z. Zhou, Y. Wu, Z. Wu, X. Zhang, R. Yuan, Y. Ma, L. Wang, E. Benetos, W. Xue, Y. Guo, Can llms “reason” in music? an evaluation of llms’ capability of music understanding and generation, 2024. arXiv:2407.21531
arXiv 2024
-
[19]
M. Abdin, J. Aneja, H. Behl, S. Bubeck, R. Eldan, et al., Phi-4 technical report, 2024. arXiv:2412.08905
Pith/arXiv arXiv 2024
-
[20]
B. Hui, J. Yang, Z. Cui, J. Yang, D. Liu, et al., Qwen2.5-coder technical report, 2024. arXiv:2409.12186
Pith/arXiv arXiv 2024
-
[21]
Q. Zhu, D. Guo, Z. Shao, D. Yang, P. Wang, et al., Deepseek-coder-v2: Breaking the barrier of closed-source models in code intelligence, 2024.arXiv:2406.11931
Pith/arXiv arXiv 2024
-
[22]
Mistral AI, Codestral-22b-v0.1 model card, 2024
2024
-
[23]
Morais, M
G. Morais, M. Fuentes, Investigating modality contribution in audio llms for music, in: ICASSP 2026 - 2026 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2026, pp. 3496–3500
2026
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.