Can LLMs understand LilyPond? A benchmark for symbolic music generation and understanding

Andrea Poltronieri; Antonio Rod\`a; Matteo Spanio; Mohammad Torabi

arxiv: 2606.08722 · v1 · pith:DQ62XXYNnew · submitted 2026-06-07 · 💻 cs.SD · cs.CL

Can LLMs understand LilyPond? A benchmark for symbolic music generation and understanding

Matteo Spanio , Mohammad Torabi , Andrea Poltronieri , Antonio Rod\`a This is my paper

Pith reviewed 2026-06-27 17:48 UTC · model grok-4.3

classification 💻 cs.SD cs.CL

keywords LilyPondsymbolic musicLLM benchmarkmusic generationmusic understandingzero-shot evaluationevaluation metrics

0 comments

The pith

Large language models can generate executable LilyPond notation in zero-shot settings, yet structural understanding tasks remain difficult even when composer and genre recognition succeeds.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces LilyBench to evaluate open-weight LLMs on both generating and understanding symbolic music written in LilyPond. Tests across four models show that prompts often produce code that compiles without any examples provided. Recognition of metadata such as composer or genre works better than tasks that require ordering musical elements or handling syntax details. Different quality measures, one based on musical feature distributions and another on embedding distances, frequently disagree about which outputs are better. The work releases the full set of prompts, tasks, and scoring code so others can run the same tests.

Core claim

Experiments on four open-weight models show that executable LilyPond generation is achievable in zero-shot settings, while structural understanding tasks remain challenging despite strong performance on composer and genre recognition. Systematic disagreements between descriptor-based and embedding-based metrics suggest that symbolic music evaluation benefits from metric triangulation rather than single-score ranking.

What carries the argument

LilyBench benchmark consisting of a 200-prompt generation suite plus ten understanding tasks adapted from ABC-Eval, scored by compile rate, MusPy descriptor Jensen-Shannon similarity, and LilyBERT Fréchet Music Distance.

If this is right

Zero-shot prompts suffice to produce compilable LilyPond output on the evaluated models.
Composer and genre identification tasks yield higher accuracy than structural sequencing or syntax tasks.
Descriptor distributions and embedding distances often rank the same outputs differently.
Releasing the prompt bank and evaluation scripts enables direct comparison across future models.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Surface-level notation patterns appear easier for current LLMs to capture than deeper musical relations among notes and sections.
Adding fine-tuning stages on LilyPond data might narrow the gap on structural tasks, though the paper does not test this.
The observed metric disagreements imply that any single automatic score will miss aspects of musical quality that other scores capture.

Load-bearing premise

The chosen prompts, tasks, and metrics together form a representative test of what LLMs can do with symbolic music notation.

What would settle it

Running the same generation prompts on one of the tested models and finding that the compile rate drops below 50 percent, or finding that structural sequencing accuracy exceeds 80 percent on the same models, would directly contradict the reported pattern.

read the original abstract

Symbolic music evaluation for large language models remains fragmented across representations, datasets, and metrics. We introduce LilyBench, a LilyPond-based benchmark that jointly evaluates symbolic music generation and music understanding on the same family of open-weight LLMs. The benchmark includes a 200-prompt generation suite and ten understanding tasks adapted from ABC-Eval, covering syntax, metadata prediction, structural sequencing, and music recognition. Generation quality is evaluated using compile rate, MusPy descriptor distributions via Jensen-Shannon similarity, and LilyBERT-based Fr\'echet Music Distance (FMD). Experiments on four open-weight models show that executable LilyPond generation is achievable in zero-shot settings, while structural understanding tasks remain challenging despite strong performance on composer and genre recognition. Our experiments also reveal systematic disagreements between descriptor-based and embedding-based metrics, suggesting that symbolic music evaluation benefits from metric triangulation rather than single-score ranking. We release the benchmark, prompt bank, and evaluation code to support future research in symbolic music generation and understanding at https://github.com/CSCPadova/lilybench

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

LilyBench is a released benchmark for LilyPond generation and understanding that shows zero-shot compilation is feasible while structural tasks lag, but its value depends on unshown details of prompt and task design.

read the letter

Your colleague should know that this paper introduces LilyBench, a benchmark for LLMs using LilyPond notation that covers both generation from prompts and ten understanding tasks adapted from prior ABC work. They test four open-weight models and report that the models can produce compilable LilyPond in zero-shot but struggle more with structural sequencing while doing okay on composer and genre ID. They also note that different metrics don't always agree.

What stands out is the release of the benchmark, prompt bank, and evaluation code on GitHub. That's concrete and helpful for anyone wanting to build on it. Using multiple metrics like compile rate, MusPy descriptors, and LilyBERT FMD is a good move, and calling out their disagreements shows they're not just pushing one number. The joint generation-understanding setup on the same models is a reasonable step beyond fragmented prior evaluations.

The weak part is the lack of information on how the 200 prompts were selected or constructed, and exactly how the tasks were adapted from ABC-Eval to LilyPond. Without that, it's hard to know if the performance differences reflect model limits or benchmark choices. The abstract mentions empirical outcomes but skips prompt details, statistical tests, or error analysis, so the central claims feel a bit thin on evidence. The representativeness of the suite remains the main open question.

This is for researchers in AI music generation who need an evaluation suite for symbolic formats. Someone looking for a new benchmark to use or extend would find it worth looking at the repo.

I would send it to peer review because the benchmark itself is new and the code is out, even if the paper needs more on methods to stand alone.

Referee Report

2 major / 3 minor

Summary. The paper introduces LilyBench, a LilyPond-based benchmark for LLM symbolic music generation and understanding. It comprises a 200-prompt generation suite evaluated via compile rate, MusPy descriptor Jensen-Shannon similarity, and LilyBERT Fréchet Music Distance (FMD), plus ten understanding tasks adapted from ABC-Eval covering syntax, metadata, structural sequencing, and recognition tasks. Experiments on four open-weight models indicate that zero-shot executable LilyPond generation is achievable while structural understanding remains challenging (despite strong composer/genre recognition). The work also observes disagreements across metric families and releases the benchmark, prompt bank, and evaluation code.

Significance. If the benchmark tasks and prompts prove representative, the results would usefully document a generation-understanding gap in LilyPond and motivate metric triangulation for symbolic music. The public release of the prompt bank and code is a concrete strength that supports reproducibility and extension by the community.

major comments (2)

[§3] §3 (Benchmark construction): The 200-prompt generation suite is described only at high level; no details are given on prompt sourcing, difficulty stratification, coverage of LilyPond-specific constructs (e.g., relative vs. absolute pitch, markup, or multi-voice structures), or sampling procedure. Because the headline claim that “executable LilyPond generation is achievable in zero-shot settings” rests on performance on this suite, the absence of this information prevents assessment of whether the observed success reflects model capability or benchmark design.
[§4] §4 (Task adaptation): The ten understanding tasks are stated to be “adapted from ABC-Eval,” yet the manuscript supplies no mapping, no discussion of how LilyPond syntax/semantics alter task difficulty or validity, and no validation that structural-sequencing items remain comparable to their ABC originals. This directly affects the second central claim that structural understanding tasks remain challenging.

minor comments (3)

[Abstract] Abstract: “Fr\'echet” should be rendered as “Fréchet”.
[Evaluation] Evaluation section: The manuscript reports raw metric values but omits any statistical significance tests or confidence intervals on the reported differences between models or between metric families.
[Related Work] Related work: Prior symbolic-music LLM benchmarks (e.g., those using ABC or MusicXML) are mentioned only briefly; a short comparative table would clarify LilyBench’s incremental contribution.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments on LilyBench. We agree that the manuscript would benefit from expanded details on benchmark construction and task adaptation to better support the central claims. We will revise accordingly.

read point-by-point responses

Referee: [§3] §3 (Benchmark construction): The 200-prompt generation suite is described only at high level; no details are given on prompt sourcing, difficulty stratification, coverage of LilyPond-specific constructs (e.g., relative vs. absolute pitch, markup, or multi-voice structures), or sampling procedure. Because the headline claim that “executable LilyPond generation is achievable in zero-shot settings” rests on performance on this suite, the absence of this information prevents assessment of whether the observed success reflects model capability or benchmark design.

Authors: We acknowledge the high-level description in the current §3. In the revised manuscript we will add: (i) sourcing details (music-theory references, LilyPond manual excerpts, and curated public-domain scores), (ii) explicit stratification by construct complexity (basic monophonic, relative-pitch, markup, multi-voice), (iii) coverage statistics for the listed LilyPond features, and (iv) the exact sampling procedure with random seed. These additions will allow independent evaluation of whether success reflects model capability or benchmark design. revision: yes
Referee: [§4] §4 (Task adaptation): The ten understanding tasks are stated to be “adapted from ABC-Eval,” yet the manuscript supplies no mapping, no discussion of how LilyPond syntax/semantics alter task difficulty or validity, and no validation that structural-sequencing items remain comparable to their ABC originals. This directly affects the second central claim that structural understanding tasks remain challenging.

Authors: We agree a fuller account is required. The revision will include a task-mapping table, a short discussion of LilyPond-specific syntactic differences (relative octaves, explicit durations, elative vs. absolute) and their potential impact on difficulty/validity, and brief validation examples confirming that the structural-sequencing items preserve the intended challenge level relative to the ABC originals. This will strengthen support for the generation-understanding gap observation. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical benchmark results on external models

full rationale

The paper introduces LilyBench as a new evaluation suite and reports direct zero-shot performance measurements (compile rates, MusPy JSD, LilyBERT FMD) on four open-weight LLMs. No derivation chain, fitted parameters, or predictions are defined in terms of the target quantities. Task adaptation from ABC-Eval is described as external reuse rather than self-referential fitting. No self-citations are load-bearing for the central claims, and no equations or ansatzes reduce the reported results to the benchmark inputs by construction. The evaluation remains an independent empirical probe.

Axiom & Free-Parameter Ledger

2 free parameters · 2 axioms · 0 invented entities

The central claim rests on domain assumptions about LilyPond as a proxy and the chosen metrics as valid quality measures; no free parameters are fitted to target results but task and prompt counts are selected without external justification.

free parameters (2)

Generation prompt count = 200
200 prompts selected for the suite
Understanding task count = 10
Ten tasks adapted from ABC-Eval

axioms (2)

domain assumption LilyPond compilation and descriptor distributions validly measure symbolic music generation quality
Invoked as the basis for compile rate and MusPy/FMD evaluation
domain assumption The ten adapted tasks comprehensively probe music understanding
Used to claim structural understanding remains challenging

pith-pipeline@v0.9.1-grok · 5722 in / 1163 out tokens · 19350 ms · 2026-06-27T17:48:49.031008+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

23 extracted references · 3 canonical work pages

[1]

Y. Ma, A. Oland, A. Ragni, C. Saitis, C. Donahue, C. Lin, et al., Foundation models for music: A survey, 2024.arXiv:2408.14340

arXiv 2024
[2]

Spanio, M

M. Spanio, M. Zampini, A. Rodà, F. Pierucci, A multimodal symphony: integrating taste and sound through generative ai, Frontiers in Computer Science Volume 7 - 2025 (2025). doi:10.3389/ fcomp.2025.1575741

arXiv 2025
[3]

Y. Wang, S. Wu, J. Hu, X. Du, Y. Peng, Y. Huang, S. Fan, X. Li, F. Yu, M. Sun, Notagen: advancing musicality in symbolic music generation with large language model training paradigms, in: Proceedings of the Thirty-Fourth International Joint Conference on Artificial Intelligence, IJCAI ’25, 2025. doi:10.24963/ijcai.2025/1134

work page doi:10.24963/ijcai.2025/1134 2025
[4]

Casini, B

L. Casini, B. L. T. Sturm, Investigating the viability of masked language modeling for symbolic music generation in ABC notation, in: EvoMUSART, LNCS, 2024

2024
[5]

J. Zhao, Y. Li, W. Li, K. Yoshii, Abc-eval: Benchmarking large language models on symbolic music understanding and instruction following, in: ICASSP 2026 - 2026 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2026, pp. 16072–16076

2026
[6]

R. Yuan, H. Lin, Y. Wang, Z. Tian, S. Wu, T. Shen, G. Zhang, Y. Wu, C. Liu, Z. Zhou, L. Xue, Z. Ma, Q. Liu, T. Zheng, Y. Li, Y. Ma, Y. Liang, X. Chi, R. Liu, Z. Wang, C. Lin, Q. Liu, T. Jiang, W. Huang, W. Chen, J. Fu, E. Benetos, G. Xia, R. Dannenberg, W. Xue, S. Kang, Y. Guo, ChatMusician: Understanding and generating music intrinsically with LLM, in: L...

work page doi:10.18653/v1/2024 2024
[7]

URL: https://lilypond.org/

LilyPond Developers, Lilypond official website, 2026. URL: https://lilypond.org/

2026
[8]

Spanio, I

M. Spanio, I. Guler, A. Rodà, Bmdataset: A musicologically curated lilypond dataset, 2026. arXiv:2604.10628

Pith/arXiv arXiv 2026
[9]

D.-V.-T. Le, L. Bigo, D. Herremans, M. Keller, Natural language processing methods for symbolic music generation and information retrieval: A survey, ACM Computing Surveys 57 (2025)

2025
[10]

M. Zhou, X. Li, F. Yu, W. Li, Emelodygen: Emotion-conditioned melody generation in abc notation with the musical feature template, in: 2025 IEEE International Conference on Multimedia and Expo Workshops (ICMEW), 2025, pp. 1–6. doi:10.1109/ICMEW68306.2025.11152266

work page doi:10.1109/icmew68306.2025.11152266 2025
[11]

Kilgour, M

K. Kilgour, M. Zuluaga, D. Roblek, M. Sharifi, Fréchet audio distance: A reference-free metric for evaluating music enhancement algorithms, in: Interspeech, 2019, pp. 2350–2354

2019
[12]

F. B. Kader, S. Karmaker, A survey on evaluation metrics for music generation, arXiv preprint arXiv:2509.00051 (2025)

arXiv 2025
[13]

H. Dong, K. Chen, J. J. McAuley, T. Berg-Kirkpatrick, Muspy: A toolkit for symbolic music generation, CoRR abs/2008.01951 (2020)

arXiv 2008
[14]

L. Qian, H. Gu, D. Li, B. Cao, Q. Liu, Pianoroll-event: A novel score representation for symbolic music, in: ICASSP 2026 - 2026 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2026, pp. 1–5

2026
[15]

M. Zeng, X. Tan, R. Wang, Z. Ju, T. Qin, T.-Y. Liu, MusicBERT: Symbolic music understanding with large-scale pre-training, in: Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021, Association for Computational Linguistics, Online, 2021, pp. 791–800

2021
[16]

S. Wu, D. Yu, X. Tan, M. Sun, Clamp: Contrastive language-music pre-training for cross-modal symbolic music information retrieval, in: International Society for Music Information Retrieval Conference, 2023

2023
[17]

Retkowski, J

J. Retkowski, J. Stępniak, M. Modrzejewski, Fréchet music distance: A metric for generative symbolic music evaluation, 2024.arXiv:2412.07948

arXiv 2024
[18]

Z. Zhou, Y. Wu, Z. Wu, X. Zhang, R. Yuan, Y. Ma, L. Wang, E. Benetos, W. Xue, Y. Guo, Can llms “reason” in music? an evaluation of llms’ capability of music understanding and generation, 2024. arXiv:2407.21531

arXiv 2024
[19]

Abdin, J

M. Abdin, J. Aneja, H. Behl, S. Bubeck, R. Eldan, et al., Phi-4 technical report, 2024. arXiv:2412.08905

Pith/arXiv arXiv 2024
[20]

B. Hui, J. Yang, Z. Cui, J. Yang, D. Liu, et al., Qwen2.5-coder technical report, 2024. arXiv:2409.12186

Pith/arXiv arXiv 2024
[21]

Q. Zhu, D. Guo, Z. Shao, D. Yang, P. Wang, et al., Deepseek-coder-v2: Breaking the barrier of closed-source models in code intelligence, 2024.arXiv:2406.11931

Pith/arXiv arXiv 2024
[22]

Mistral AI, Codestral-22b-v0.1 model card, 2024

2024
[23]

Morais, M

G. Morais, M. Fuentes, Investigating modality contribution in audio llms for music, in: ICASSP 2026 - 2026 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2026, pp. 3496–3500

2026

[1] [1]

Y. Ma, A. Oland, A. Ragni, C. Saitis, C. Donahue, C. Lin, et al., Foundation models for music: A survey, 2024.arXiv:2408.14340

arXiv 2024

[2] [2]

Spanio, M

M. Spanio, M. Zampini, A. Rodà, F. Pierucci, A multimodal symphony: integrating taste and sound through generative ai, Frontiers in Computer Science Volume 7 - 2025 (2025). doi:10.3389/ fcomp.2025.1575741

arXiv 2025

[3] [3]

Y. Wang, S. Wu, J. Hu, X. Du, Y. Peng, Y. Huang, S. Fan, X. Li, F. Yu, M. Sun, Notagen: advancing musicality in symbolic music generation with large language model training paradigms, in: Proceedings of the Thirty-Fourth International Joint Conference on Artificial Intelligence, IJCAI ’25, 2025. doi:10.24963/ijcai.2025/1134

work page doi:10.24963/ijcai.2025/1134 2025

[4] [4]

Casini, B

L. Casini, B. L. T. Sturm, Investigating the viability of masked language modeling for symbolic music generation in ABC notation, in: EvoMUSART, LNCS, 2024

2024

[5] [5]

J. Zhao, Y. Li, W. Li, K. Yoshii, Abc-eval: Benchmarking large language models on symbolic music understanding and instruction following, in: ICASSP 2026 - 2026 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2026, pp. 16072–16076

2026

[6] [6]

R. Yuan, H. Lin, Y. Wang, Z. Tian, S. Wu, T. Shen, G. Zhang, Y. Wu, C. Liu, Z. Zhou, L. Xue, Z. Ma, Q. Liu, T. Zheng, Y. Li, Y. Ma, Y. Liang, X. Chi, R. Liu, Z. Wang, C. Lin, Q. Liu, T. Jiang, W. Huang, W. Chen, J. Fu, E. Benetos, G. Xia, R. Dannenberg, W. Xue, S. Kang, Y. Guo, ChatMusician: Understanding and generating music intrinsically with LLM, in: L...

work page doi:10.18653/v1/2024 2024

[7] [7]

URL: https://lilypond.org/

LilyPond Developers, Lilypond official website, 2026. URL: https://lilypond.org/

2026

[8] [8]

Spanio, I

M. Spanio, I. Guler, A. Rodà, Bmdataset: A musicologically curated lilypond dataset, 2026. arXiv:2604.10628

Pith/arXiv arXiv 2026

[9] [9]

D.-V.-T. Le, L. Bigo, D. Herremans, M. Keller, Natural language processing methods for symbolic music generation and information retrieval: A survey, ACM Computing Surveys 57 (2025)

2025

[10] [10]

M. Zhou, X. Li, F. Yu, W. Li, Emelodygen: Emotion-conditioned melody generation in abc notation with the musical feature template, in: 2025 IEEE International Conference on Multimedia and Expo Workshops (ICMEW), 2025, pp. 1–6. doi:10.1109/ICMEW68306.2025.11152266

work page doi:10.1109/icmew68306.2025.11152266 2025

[11] [11]

Kilgour, M

K. Kilgour, M. Zuluaga, D. Roblek, M. Sharifi, Fréchet audio distance: A reference-free metric for evaluating music enhancement algorithms, in: Interspeech, 2019, pp. 2350–2354

2019

[12] [12]

F. B. Kader, S. Karmaker, A survey on evaluation metrics for music generation, arXiv preprint arXiv:2509.00051 (2025)

arXiv 2025

[13] [13]

H. Dong, K. Chen, J. J. McAuley, T. Berg-Kirkpatrick, Muspy: A toolkit for symbolic music generation, CoRR abs/2008.01951 (2020)

arXiv 2008

[14] [14]

L. Qian, H. Gu, D. Li, B. Cao, Q. Liu, Pianoroll-event: A novel score representation for symbolic music, in: ICASSP 2026 - 2026 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2026, pp. 1–5

2026

[15] [15]

M. Zeng, X. Tan, R. Wang, Z. Ju, T. Qin, T.-Y. Liu, MusicBERT: Symbolic music understanding with large-scale pre-training, in: Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021, Association for Computational Linguistics, Online, 2021, pp. 791–800

2021

[16] [16]

S. Wu, D. Yu, X. Tan, M. Sun, Clamp: Contrastive language-music pre-training for cross-modal symbolic music information retrieval, in: International Society for Music Information Retrieval Conference, 2023

2023

[17] [17]

Retkowski, J

J. Retkowski, J. Stępniak, M. Modrzejewski, Fréchet music distance: A metric for generative symbolic music evaluation, 2024.arXiv:2412.07948

arXiv 2024

[18] [18]

Z. Zhou, Y. Wu, Z. Wu, X. Zhang, R. Yuan, Y. Ma, L. Wang, E. Benetos, W. Xue, Y. Guo, Can llms “reason” in music? an evaluation of llms’ capability of music understanding and generation, 2024. arXiv:2407.21531

arXiv 2024

[19] [19]

Abdin, J

M. Abdin, J. Aneja, H. Behl, S. Bubeck, R. Eldan, et al., Phi-4 technical report, 2024. arXiv:2412.08905

Pith/arXiv arXiv 2024

[20] [20]

B. Hui, J. Yang, Z. Cui, J. Yang, D. Liu, et al., Qwen2.5-coder technical report, 2024. arXiv:2409.12186

Pith/arXiv arXiv 2024

[21] [21]

Q. Zhu, D. Guo, Z. Shao, D. Yang, P. Wang, et al., Deepseek-coder-v2: Breaking the barrier of closed-source models in code intelligence, 2024.arXiv:2406.11931

Pith/arXiv arXiv 2024

[22] [22]

Mistral AI, Codestral-22b-v0.1 model card, 2024

2024

[23] [23]

Morais, M

G. Morais, M. Fuentes, Investigating modality contribution in audio llms for music, in: ICASSP 2026 - 2026 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2026, pp. 3496–3500

2026