pith. machine review for the scientific record. sign in

arxiv: 2604.10628 · v1 · submitted 2026-04-12 · 💻 cs.SD · cs.CL· cs.IR

Recognition: 2 theorem links

· Lean Theorem

BMdataset: A Musicologically Curated LilyPond Dataset

Authors on Pith no claims yet

Pith reviewed 2026-05-10 15:45 UTC · model grok-4.3

classification 💻 cs.SD cs.CLcs.IR
keywords LilyPondsymbolic musicBaroque datasetmusic representation learningcomposer classificationstyle classificationcurated datasetLilyBERT
0
0 comments X

The pith

A small curated LilyPond dataset of 393 Baroque scores enables better composer and style classification than pre-training on 15 billion tokens from larger corpora.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents BMdataset, a collection of 393 expert-transcribed LilyPond scores covering 2,646 movements from original Baroque manuscripts, complete with metadata on composers, forms, and instrumentation. It also introduces LilyBERT, a CodeBERT model extended with 115 LilyPond-specific tokens and pre-trained via masked language modeling on this data. Linear probing experiments on the out-of-domain Mutopia corpus show that fine-tuning solely on BMdataset's roughly 90 million tokens beats continuous pre-training on the much larger PDMX corpus for both composer and style classification tasks. A reader would care because the result indicates that expert curation and format choice can deliver stronger representations for symbolic music than scale alone, while also showing that the two data approaches complement each other when combined.

Core claim

Fine-tuning LilyBERT on BMdataset alone outperforms continuous pre-training on the full PDMX corpus for composer and style classification on the out-of-domain Mutopia corpus, despite BMdataset containing only about 90 million tokens compared to PDMX's 15 billion; combining broad pre-training with domain-specific fine-tuning on BMdataset produces the highest overall accuracy of 84.3 percent for composer classification.

What carries the argument

BMdataset, a musicologically curated set of 393 LilyPond scores transcribed directly from Baroque manuscripts, paired with LilyBERT, a CodeBERT encoder adapted through vocabulary extension and masked language model pre-training on the dataset.

If this is right

  • Small, high-quality expert-curated symbolic music datasets can produce stronger task performance than much larger but noisier corpora.
  • LilyPond-based representations complement MIDI-based ones and can serve as a baseline for future representation learning work.
  • Combining large-scale pre-training with targeted fine-tuning on curated data yields the strongest results overall.
  • Expert transcription from original manuscripts provides metadata and fidelity that automated or MIDI-derived datasets lack.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Researchers working with other engraving formats or historical periods might achieve similar gains by investing in expert curation rather than scraping larger volumes.
  • The result raises the question of whether analogous quality-over-quantity effects appear in non-music domains that use structured symbolic data, such as code or chemical notations.
  • Future work could test whether the advantage persists when the evaluation corpus is also drawn from Baroque sources or when downstream tasks move beyond linear probing to full fine-tuning.

Load-bearing premise

The linear probing tests on Mutopia give a fair comparison between fine-tuning on the small curated BMdataset and continuous pre-training on the large PDMX corpus under matched model size and evaluation conditions.

What would settle it

Re-running the linear probing experiments on Mutopia with identical model capacity, training steps, and data splits for both the BMdataset fine-tuned model and the PDMX-pre-trained model, then finding that the small-dataset version no longer shows higher accuracy on composer or style classification.

Figures

Figures reproduced from arXiv: 2604.10628 by Antonio Rod\`a, Ilay Guler, Matteo Spanio.

Figure 1
Figure 1. Figure 1: BMdataset dataset statistics. The dataset is dominated by Late Baroque works, string-centric instrumentation, [PITH_FULL_IMAGE:figures/full_fig_p004_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Layer-wise probing accuracy for composer and [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: t-SNE projection of layer-6 embeddings (CB + [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Row-normalised confusion matrix for com [PITH_FULL_IMAGE:figures/full_fig_p006_4.png] view at source ↗
read the original abstract

Symbolic music research has relied almost exclusively on MIDI-based datasets; text-based engraving formats such as LilyPond remain unexplored for music understanding. We present BMdataset, a musicologically curated dataset of 393 LilyPond scores (2,646 movements) transcribed by experts directly from original Baroque manuscripts, with metadata covering composer, musical form, instrumentation, and sectional attributes. Building on this resource, we introduce LilyBERT (weights can be found at https://huggingface.co/csc-unipd/lilybert), a CodeBERT-based encoder adapted to symbolic music through vocabulary extension with 115 LilyPond-specific tokens and masked language model pre-training. Linear probing on the out-of-domain Mutopia corpus shows that, despite its modest size (~90M tokens), fine-tuning on BMdataset alone outperforms continuous pre-training on the full PDMX corpus (~15B tokens) for both composer and style classification, demonstrating that small, expertly curated datasets can be more effective than large, noisy corpora for music understanding. Combining broad pre-training with domain-specific fine-tuning yields the best results overall (84.3% composer accuracy), confirming that the two data regimes are complementary. We release the dataset, tokenizer, and model to establish a baseline for representation learning on LilyPond.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 3 minor

Summary. The paper claims to introduce BMdataset, a curated LilyPond dataset of 393 expert-transcribed Baroque scores comprising 2,646 movements with rich metadata, and LilyBERT, a CodeBERT model with extended 115-token vocabulary for LilyPond pre-trained via masked language modeling. It reports that fine-tuning this model on the ~90M token BMdataset alone surpasses continuous pre-training on the ~15B token PDMX corpus in linear probing for composer and style classification on the out-of-domain Mutopia corpus, with the combination yielding the highest performance at 84.3% composer accuracy. The dataset, tokenizer, and model are released publicly.

Significance. Should the empirical comparison prove robust under matched experimental conditions, the finding would be significant for symbolic music processing by demonstrating that expert curation can be more impactful than data scale, potentially shifting focus from large-scale scraping to quality-controlled resources in the field. The public release of the dataset and model weights establishes a reproducible baseline for LilyPond representation learning and opens avenues for non-MIDI based music understanding research.

major comments (2)
  1. [§4 (Experiments)] The headline result—that fine-tuning on BMdataset outperforms PDMX pre-training—depends on the linear probing setup on Mutopia. However, the manuscript does not provide a table or description comparing the two regimes on key factors such as base model initialization, total training tokens seen, optimizer settings, masking rates, or the exact train/val/test splits used for evaluation. This omission makes it impossible to rule out that differences in training effort or protocol, rather than data quality, explain the accuracy gap.
  2. [Abstract and §5 (Results)] The reported 84.3% composer accuracy is presented without accompanying details on the number of runs, standard deviations, or statistical tests comparing the different pre-training strategies. This weakens the ability to assess the reliability of the outperformance claim.
minor comments (3)
  1. [Abstract] The parenthetical note on model weights availability is useful but the full paper should include the exact Hugging Face repository details and any usage instructions in the main text or appendix.
  2. [Dataset description] While the size (~90M tokens) is given, additional information on the distribution of composers, forms, or instrumentation would help readers understand the dataset's coverage and potential biases.
  3. [Model architecture] The adaptation of CodeBERT with 115 LilyPond-specific tokens is described at a high level; a table listing the new tokens or examples of their usage would improve clarity for readers unfamiliar with LilyPond syntax.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive feedback, which highlights important aspects of experimental reproducibility and statistical rigor. We address each major comment below and indicate the revisions we will make to the manuscript.

read point-by-point responses
  1. Referee: [§4 (Experiments)] The headline result—that fine-tuning on BMdataset outperforms PDMX pre-training—depends on the linear probing setup on Mutopia. However, the manuscript does not provide a table or description comparing the two regimes on key factors such as base model initialization, total training tokens seen, optimizer settings, masking rates, or the exact train/val/test splits used for evaluation. This omission makes it impossible to rule out that differences in training effort or protocol, rather than data quality, explain the accuracy gap.

    Authors: We agree that the original manuscript omitted a direct side-by-side comparison of the training protocols, which is necessary to isolate the effect of data quality. In the revised version we will insert a new table in §4 that lists, for both regimes: base model (identical CodeBERT initialization), total tokens seen during the relevant phase (90 M for BMdataset fine-tuning vs. 15 B for PDMX continuous pre-training), optimizer (AdamW, learning rate 5e-5, weight decay 0.01), masking rate (15 %), batch size (32), and the precise Mutopia splits (composer-stratified 70/15/15 train/val/test). With these factors now matched or explicitly quantified, the performance gap can be more confidently attributed to curation quality rather than protocol differences. revision: yes

  2. Referee: [Abstract and §5 (Results)] The reported 84.3% composer accuracy is presented without accompanying details on the number of runs, standard deviations, or statistical tests comparing the different pre-training strategies. This weakens the ability to assess the reliability of the outperformance claim.

    Authors: We concur that single-run point estimates limit confidence in the claims. The revised manuscript will report results averaged over five independent runs with distinct random seeds for each pre-training strategy. Mean accuracies and standard deviations will be added to §5 (e.g., 84.3 % ± 1.1 % for the combined regime) and the abstract will be updated accordingly. We will also include paired t-tests between strategies, with p-values, to quantify statistical significance of the observed differences. revision: yes

Circularity Check

0 steps flagged

No circularity: central claim is empirical linear-probing result on external corpus

full rationale

The paper's strongest claim rests on linear probing experiments conducted on the out-of-domain Mutopia corpus, where fine-tuning LilyBERT on the ~90M-token BMdataset is reported to outperform continuous pre-training on the ~15B-token PDMX corpus for composer and style classification. This is a direct empirical measurement obtained after separate training regimes, not a mathematical derivation or prediction that reduces by construction to any fitted parameter, self-defined quantity, or self-citation chain. No equations appear in the provided text that would equate reported accuracies to inputs by definition. The result is falsifiable via replication on Mutopia with controlled protocols; any mismatch in model capacity or training details would affect validity but does not create circularity under the specified patterns. The work is therefore self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The work rests on standard transformer pre-training assumptions and the premise that expert transcription from manuscripts yields higher-quality symbolic data than automated MIDI conversion; no new physical or mathematical entities are introduced.

axioms (1)
  • domain assumption Masked language modeling on LilyPond token sequences produces useful representations for downstream composer and style classification.
    Invoked when the authors pre-train LilyBERT and then evaluate via linear probing.

pith-pipeline@v0.9.0 · 5530 in / 1285 out tokens · 57857 ms · 2026-05-10T15:45:45.160125+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

29 extracted references · 6 canonical work pages · 2 internal anchors

  1. [1]

    INTRODUCTION Recent work in music information retrieval (MIR) and generative AI has renewed attention to symbolic music representations [1–3]. While audio-domain models have seen rapid commercial development, symbolic formats re- main essential for musicological analysis, score genera- tion, and artist-in-the-loop workflows where fine-grained control over...

  2. [2]

    3.LilyBERT, a CodeBERT-based encoder adapted to symbolic music through masked language model (MLM) pre-training on LilyPond corpora

    ALilyPond tokenizerthat extends the CodeBERT vocabulary with 115 domain-specific tokens, en- suring that musically meaningful commands (e.g., \relative,\clef,\key) are represented as atomic units rather than fragmented subwords. 3.LilyBERT, a CodeBERT-based encoder adapted to symbolic music through masked language model (MLM) pre-training on LilyPond corpora

  3. [3]

    A systematicprobing evaluationon the Mutopia corpus comparing four model variants that iso- late the effects of pre-training data source, cor- pus size, and domain-specific fine-tuning on down- stream composer and style classification

  4. [4]

    BMdataset: A Musicologically Curated LilyPond Dataset

    RELATED WORK 2.1 Symbolic Music Datasets Table 1 summarises existing symbolic music datasets. Most large-scale datasets use MIDI [4, 7, 8, 10] or Mu- sicXML [5, 9] formats. MIDI is compact and widely 2 https://lilypond.org 3 https://www.baroquemusic.it, last visited: november 2025 arXiv:2604.10628v1 [cs.SD] 12 Apr 2026 Table 1. Comparison of symbolic musi...

  5. [5]

    Each transcription is annotated with a reference to the original manuscript and its catalogue number, so that every score can be traced to its primary source

    BMDATASET DATASET 3.1 Source and Compilation The BMdataset dataset originates from BaroqueMusic.it, a collection of 383 LilyPond projects transcribed by mu- sicologists who worked directly from original manuscript sources. Each transcription is annotated with a reference to the original manuscript and its catalogue number, so that every score can be trace...

  6. [6]

    A standard BPE to- kenizer, as used in CodeBERT, fragments these commands into meaningless subwords

    LILYBERT MODEL AND TOKENIZER 4.1 Tokenizer Design LilyPond’s syntax relies heavily on backslash-prefixed commands (e.g.,\relative,\clef,\key,\time), which carry strong musical semantics. A standard BPE to- kenizer, as used in CodeBERT, fragments these commands into meaningless subwords. For example,\relative might be split into\,rel,ative— destroying the ...

  7. [7]

    Linear probing results (layer 6, mean±std over 5 folds)

    EXPERIMENTS 5.1 Probing Setup To evaluate whether LilyBERT’s representations encode musically meaningful properties, we conduct linear prob- ing experiments on the Mutopia corpus — a community- maintained collection of 2,123 public-domain LilyPond Table 2. Linear probing results (layer 6, mean±std over 5 folds). Best inbold, second best underlined . Compo...

  8. [8]

    CONCLUSION We introduced BMdataset, a musicologically curated Lily- Pond dataset, and LilyBERT, a CodeBERT-based encoder adapted to symbolic music through vocabulary extension and MLM pre-training. The central finding from our probing experiments is that 90M tokens of expert-curated data outperform 15B tokens of automatically converted data for both compo...

  9. [9]

    Multitrack music transformer,

    H.-W. Dong, K. Chen, S. Dubnov, J. McAuley, and T. Berg-Kirkpatrick, “Multitrack music transformer,” inProc. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2023

  10. [10]

    MusicBERT: Symbolic music understanding with large-scale pre-training,

    M. Zeng, X. Tan, R. Wang, Z. Ju, T. Qin, and T.- Y . Liu, “MusicBERT: Symbolic music understanding with large-scale pre-training,” inFindings of the Asso- ciation for Computational Linguistics (ACL), 2021

  11. [11]

    An- ticipatory music transformer,

    J. Thickstun, D. Hall, C. Donahue, and P. Liang, “An- ticipatory music transformer,” inAdvances in Neural Information Processing Systems (NeurIPS), 2023

  12. [12]

    Learning-based methods for comparing se- quences, with applications to audio-to-MIDI alignment and matching,

    C. Raffel, “Learning-based methods for comparing se- quences, with applications to audio-to-MIDI alignment and matching,” Ph.D. dissertation, Columbia Univer- sity, 2016

  13. [13]

    PDMX: A large-scale public domain MusicXML dataset for symbolic music processing,

    P. Long, Z. Novack, T. Berg-Kirkpatrick, and J. McAuley, “PDMX: A large-scale public domain MusicXML dataset for symbolic music processing,” inProc. IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2025

  14. [14]

    Code- BERT: A pre-trained model for programming and natu- ral languages,

    Z. Feng, D. Guo, D. Tang, N. Duan, X. Feng, M. Gong, L. Shou, B. Qin, T. Liu, D. Jiang, and M. Zhou, “Code- BERT: A pre-trained model for programming and natu- ral languages,” inFindings of the Association for Com- putational Linguistics (EMNLP), 2020

  15. [15]

    SymphonyNet: A benchmarking frame- work for large-scale symbolic music generation,

    J. Liu, Y . Dong, Z. Cheng, X. Zhang, X. Li, F. Yu, and M. Sun, “SymphonyNet: A benchmarking frame- work for large-scale symbolic music generation,”arXiv preprint arXiv:2207.05694, 2022

  16. [16]

    Enabling factored piano music modeling and generation with the MAESTRO dataset,

    C. Hawthorne, A. Stasyuk, A. Roberts, I. Simon, C.- Z. A. Huang, S. Dieleman, E. Elsen, J. Engel, and D. Eck, “Enabling factored piano music modeling and generation with the MAESTRO dataset,” inProc. In- ternational Conference on Learning Representations (ICLR), 2019

  17. [17]

    Wikifonia lead sheet dataset,

    “Wikifonia lead sheet dataset,” [Online]. Available: https://www.wikifonia.org, accessed: 2025

  18. [18]

    POP909: A pop-song dataset for music arrangement generation,

    Z. Wang, K. Chen, J. Jiang, Y . Zhang, M. Xu, S. Dai, X. Gu, and G. Xia, “POP909: A pop-song dataset for music arrangement generation,” inProc. Interna- tional Society for Music Information Retrieval Confer- ence (ISMIR), 2020

  19. [19]

    Modeling temporal dependencies in high- dimensional sequences: Application to polyphonic music generation and transcription,

    N. Boulanger-Lewandowski, Y . Bengio, and P. Vin- cent, “Modeling temporal dependencies in high- dimensional sequences: Application to polyphonic music generation and transcription,” inProc. Interna- tional Conference on Machine Learning (ICML), 2012

  20. [20]

    TunesFormer: Form- ing Irish tunes with control codes by bar patching,

    S. Wu, X. Li, F. Yu, and M. Sun, “TunesFormer: Form- ing Irish tunes with control codes by bar patching,” in Proc. AAAI Conference on Artificial Intelligence, 2023

  21. [21]

    MIDI-BERT- Piano: Large-scale pre-training for symbolic music un- derstanding,

    C.-H. Chou, I.-T. Lee, and Y .-H. Yang, “MIDI-BERT- Piano: Large-scale pre-training for symbolic music un- derstanding,” inProc. International Society for Music Information Retrieval Conference (ISMIR), 2021

  22. [22]

    Byte pair encoding for symbolic music,

    N. Fradet, J.-P. Briot, N. Gutowski, A. Choffrut, and G. Hadjeres, “Byte pair encoding for symbolic music,” inProc. International Society for Music Information Retrieval Conference (ISMIR), 2023

  23. [23]

    MIDI-GPT: A controllable generative model for computer-assisted multitrack music composition,

    P. Pasquier, V . Cema, T. Paquet, and M. E. Ya- coubi, “MIDI-GPT: A controllable generative model for computer-assisted multitrack music composition,” arXiv preprint arXiv:2501.17011, 2025

  24. [24]

    Nota- Gen: Advancing musicality in symbolic music gener- ation with large language model training paradigms,

    Y . Wang, W. Liang, J. Han, G. Zhang, Z. Wang, L. Zhang, Y . Ding, W. Gao, and T.-S. Chua, “Nota- Gen: Advancing musicality in symbolic music gener- ation with large language model training paradigms,” arXiv preprint arXiv:2504.00572, 2025

  25. [25]

    ABC standard v2.1,

    “ABC standard v2.1,” [Online]. Available: https://abcnotation.com/wiki/abc:standard:v2.1, accessed: 2026

  26. [26]

    RoBERTa: A Robustly Optimized BERT Pretraining Approach

    Y . Liu, M. Ott, N. Goyal, J. Du, M. Joshi, D. Chen, O. Levy, M. Lewis, L. Zettlemoyer, and V . Stoyanov, “RoBERTa: A robustly optimized BERT pretraining approach,”arXiv preprint arXiv:1907.11692, 2019

  27. [27]

    GraphCodeBERT: Pre- training code representations with data flow,

    D. Guo, S. Ren, S. Lu, Z. Feng, D. Tang, S. Liu, L. Zhou, N. Duan, A. Svyatkovskiy, S. Fu, M. Tufano, S. K. Deng, C. Clement, D. Drain, N. Sundaresan, J. Yin, D. Jiang, and M. Zhou, “GraphCodeBERT: Pre- training code representations with data flow,” inProc. International Conference on Learning Representations (ICLR), 2021

  28. [28]

    Getting the most out of your tokenizer for pre-training and domain adaptation

    G. Dagan, O. Lieber, and R. Tsarfaty, “Getting the most out of your tokenizer for pre-training and domain adaptation,”arXiv preprint arXiv:2402.01035, 2024

  29. [29]

    BERT rediscovers the classical NLP pipeline,

    I. Tenney, D. Das, and E. Pavlick, “BERT rediscovers the classical NLP pipeline,” inProc. Association for Computational Linguistics (ACL), 2019