arxiv: 2605.13431 · v1 · submitted 2026-05-13 · 💻 cs.SD

Recognition: 1 theorem link

· Lean Theorem

Text2Score: Generating Sheet Music From Textual Prompts

Keshav Bhandari , Sungkyun Chang , Abhinaba Roy , Francesca Ronchini , Emmanouil Benetos , Dorien Herremans , Simon Colton

Authors on Pith no claims yet

Pith reviewed 2026-05-14 19:34 UTC · model grok-4.3

classification 💻 cs.SD

keywords text-to-music generationsheet musicABC notationLLM planningsymbolic musictwo-stage frameworkprompt adherencemusic generation

0 comments

The pith

Text2Score generates sheet music from text by first using an LLM to create structured measure-wise plans then conditioning a generative model on those plans to output ABC notation.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents Text2Score as a two-stage system that addresses the lack of aligned text-music data by deriving supervision from symbolic XML instead. In the first stage an LLM turns a natural-language prompt into a detailed plan that specifies instruments, key, time signature, harmony, and other attributes for each measure. The second stage feeds this plan to a generative model that produces interleaved ABC notation respecting the constraints. The resulting outputs score higher than direct LLM agents and end-to-end baselines on playability, readability, instrument use, structural complexity, and how closely they follow the original prompt, according to both objective metrics and expert musicians.

Core claim

Text2Score establishes that separating the generation process into an LLM-driven planning stage that produces measure-wise attribute plans and a subsequent execution stage that generates plan-conditioned ABC notation yields sheet music that is more playable, readable, and prompt-adherent than either pure LLM agentic methods or end-to-end trained models, while bypassing the need for scarce paired text-music datasets.

What carries the argument

The two-stage framework in which an LLM orchestrator translates prompts into structured measure-wise plans of musical attributes and a generative model then produces interleaved ABC notation conditioned on those plans.

If this is right

Symbolic music generation becomes feasible without large aligned text-music corpora.
Outputs respect explicit constraints on key, meter, and harmony more reliably than direct generation.
The same planning-plus-execution split can be applied to other notation formats or instrument sets.
A reusable evaluation protocol now exists for measuring playability, readability, and prompt match in generated sheet music.
Open-sourced dataset, code, and prompts lower the barrier for follow-on research in text-driven symbolic music.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Non-musicians could use short text descriptions to obtain usable starter scores for further editing.
The planning stage could be reused as a controllable interface for iterative refinement of existing compositions.
Similar two-stage decomposition may improve controllability in other symbolic generation tasks such as chord progression or lyric alignment.

Load-bearing premise

The LLM-generated plans accurately capture and constrain all relevant musical attributes so that the execution stage can produce valid sheet music without harmony, rhythm, or playability errors.

What would settle it

Expert musicians consistently rate the generated scores lower than baselines on prompt adherence or identify systematic harmony or playability violations that the plans were supposed to prevent.

read the original abstract

Developing text-driven symbolic music generation models remains challenging due to the scarcity of aligned text-music datasets and the unreliability of automated captioning pipelines. While most efforts have focused on MIDI, sheet music representations are largely underexplored in text-driven generation. We present Text2Score, a two-stage framework comprising a planning stage and an execution stage for generating sheet music from natural language prompts. By deriving supervision signals directly from symbolic XML data, we propose an alternative training paradigm that bypasses noisy or scarce text-music pairs. In the planning stage, an LLM orchestrator translates a natural language prompt into a structured measure-wise plan defining musical attributes such as instruments, key, time signatures, harmony, etc. This plan is then consumed by a generative model in the execution stage to produce interleaved ABC notation conditioned on the plan's structural constraints. To assess output quality, we introduce an evaluation framework covering playability, readability, instrument utilization, structural complexity, and prompt adherence, validated by expert musicians. Text2Score consistently outperforms both a pure LLM-based agentic framework and three end-to-end baselines across objective and subjective dimensions. We open-source the dataset, code, evaluation set and LLM prompts used in this work; a demo is available on our project page (https://keshavbhandari.github.io/portfolio/text2score).

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Text2Score's two-stage LLM planner plus plan-conditioned ABC generator is a workable way around paired data scarcity, but the lack of plan validation leaves the outperformance claim hard to trust without more checks.

read the letter

Text2Score splits the task into an LLM that turns a text prompt into a measure-by-measure plan covering key, harmony, time signature and instruments, then feeds that plan to a second model that outputs ABC notation. They train the execution stage on symbolic XML data instead of scarce text-music pairs. That split and the use of XML supervision are the concrete moves that stand out from prior end-to-end attempts mentioned in the abstract.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces Text2Score, a two-stage framework for generating sheet music (ABC notation) from natural language prompts. The planning stage uses an LLM to translate prompts into measure-wise structured plans specifying attributes such as instruments, key, time signatures, and harmony. The execution stage employs a generative model to produce interleaved ABC notation conditioned on these plans. Supervision signals are derived directly from symbolic XML data to bypass the need for aligned text-music pairs. An evaluation framework is proposed covering playability, readability, instrument utilization, structural complexity, and prompt adherence, with validation by expert musicians. The authors claim that Text2Score consistently outperforms a pure LLM-based agentic framework and three end-to-end baselines on both objective and subjective metrics, and they release the dataset, code, evaluation set, and prompts.

Significance. If the outperformance claims hold after verification of plan correctness and baseline details, the work would be significant for addressing data scarcity in text-driven symbolic music generation. By using XML-derived supervision and a planning-execution decomposition, it offers a practical alternative to direct text-music pairing. The emphasis on sheet music representations (ABC) rather than MIDI fills an underexplored area, and the open-sourcing of resources plus the expert-validated evaluation suite could support reproducibility and more reliable assessment in music generation research.

major comments (2)

[Planning Stage] Planning stage description: No mechanism is described for validating the correctness of LLM-generated plans (e.g., automated checks via music21 for key-harmony consistency, time-signature uniformity across measures, or playability constraints). This is load-bearing for the central outperformance claim because the execution model generates ABC conditioned on the plan; unvalidated plans can propagate harmonic or rhythmic errors while still producing syntactically valid output that scores well on prompt adherence.
[Evaluation and Results] Evaluation and results sections: The claim of consistent outperformance across objective and subjective dimensions lacks reported quantitative values, statistical significance tests, or ablation studies isolating the planning stage's contribution. Without these, it is not possible to assess whether the two-stage separation demonstrably improves over the pure LLM baseline or the three end-to-end models.

minor comments (2)

[Method] The manuscript would benefit from including one or two concrete examples of an input prompt, the corresponding LLM plan, and the generated ABC notation (with any post-processing) directly in the main text or a dedicated figure to illustrate the pipeline.
[Execution Stage] Clarify the exact architecture and training details of the execution-stage generative model (e.g., whether it is a fine-tuned transformer or diffusion model) and how the plan is encoded as conditioning input.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive and detailed feedback. We address each major comment below, indicating the revisions we will make to improve the manuscript's clarity and rigor.

read point-by-point responses

Referee: [Planning Stage] Planning stage description: No mechanism is described for validating the correctness of LLM-generated plans (e.g., automated checks via music21 for key-harmony consistency, time-signature uniformity across measures, or playability constraints). This is load-bearing for the central outperformance claim because the execution model generates ABC conditioned on the plan; unvalidated plans can propagate harmonic or rhythmic errors while still producing syntactically valid output that scores well on prompt adherence.

Authors: We agree that the absence of explicit validation for the LLM-generated plans represents a limitation, as erroneous plans could affect downstream generation quality even if syntactic validity is maintained. The current framework relies on the LLM's planning capabilities and the execution model's training to follow structural constraints, with prompt adherence evaluated subjectively by experts. To strengthen the work, we will add an automated validation step using music21 to enforce consistency checks on key, harmony, time signatures, and basic playability (e.g., note range per instrument). We will describe this module in the revised planning stage section and report validation pass rates alongside the main results. revision: yes
Referee: [Evaluation and Results] Evaluation and results sections: The claim of consistent outperformance across objective and subjective dimensions lacks reported quantitative values, statistical significance tests, or ablation studies isolating the planning stage's contribution. Without these, it is not possible to assess whether the two-stage separation demonstrably improves over the pure LLM baseline or the three end-to-end models.

Authors: We acknowledge that the manuscript's presentation of results could be strengthened by including explicit numerical values, statistical tests, and targeted ablations. The current version reports comparative outcomes across metrics but does not detail exact scores or significance testing in the main text. In the revision, we will expand the evaluation section to include the full quantitative tables with mean scores, add statistical significance tests (e.g., paired t-tests or Wilcoxon tests with p-values), and introduce an ablation study that isolates the planning stage by comparing against a direct-prompt execution baseline. This will more clearly demonstrate the contribution of the two-stage decomposition. revision: yes

Circularity Check

0 steps flagged

No significant circularity in the two-stage framework

full rationale

The paper describes an empirical two-stage pipeline (LLM planning from text prompts followed by conditioned ABC generation) whose supervision is taken directly from existing symbolic XML data rather than from any fitted parameter or self-referential target. No equations, uniqueness theorems, or self-citations are invoked that would force a prediction to equal its own input by construction. Evaluation metrics (playability, prompt adherence, etc.) are defined externally and validated by human experts, keeping the central claim of outperformance independent of the training loop itself.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the assumption that LLMs can produce reliable musical plans and that conditioning the generative model on those plans yields playable output; no numerical free parameters or new invented entities are described.

axioms (1)

domain assumption Large language models can translate natural language prompts into accurate structured plans specifying instruments, key, time signatures, harmony, and related musical attributes per measure.
This is invoked as the core of the planning stage without further justification or error analysis in the abstract.

pith-pipeline@v0.9.0 · 5558 in / 1303 out tokens · 24462 ms · 2026-05-14T19:34:54.549119+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

two-stage framework comprising a planning stage and an execution stage... LLM orchestrator translates... structured measure-wise plan... generative model... interleaved ABC notation

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

48 extracted references · 48 canonical work pages · 1 internal anchor

[1]

We introduce Text2Score, a two-stage framework pair- ing an LLM orchestrator for structural planning with a hierarchical decoder for execution to bridge natural lan- guage prompts and sheet music generation

work page
[2]

We present an evaluation framework designed to quan- tify the readability and playability of generated scores, which is further validated by expert musicians

work page
[3]

Text2Score: Generating Sheet Music From Textual Prompts

We release the ABC notation dataset used in this work strictly for non-commercial research purposes to sup- port further studies in symbolic sheet music generation. 1 arXiv:2605.13431v1 [cs.SD] 13 May 2026 ❄ E N C O D E R %%score 1(2|3).. L:1/8 Q:1/4=å92 M:3/4 V:1 treble nm="Violin" V:2 treble nm=”Piano” V:3 bass [V:1]d3 e fg|[V:2]z6|[V:3]z2 D2 D2| [V:1]P...

work page internal anchor Pith review Pith/arXiv arXiv 2026
[4]

Recent advancements have shifted toward end-to- end training paradigms

aligned sentences with musical sequences via a cross- modal V AE latent space, while, [8] predicted intermediate attributes from text to condition token decoding. Recent advancements have shifted toward end-to- end training paradigms. Text2midi [9] and Text2midi- InferAlign [10] pair a text encoder with an autoregressive decoder, while [4, 11] adapted LLM...

work page
[5]

musical world

applies motif development rules, while [13] supports multiple input modalities with emotional control. Both re- quire extensive pre-training on large-scale paired datasets. LLM-Based Agentic Composition: A burgeoning area of research investigates the “musical world” knowledge implicitly held by LLMs trained solely on text. As shown in [14], text-only LLMs...

work page 2048
[6]

Prompt Adherence: How accurately does the gener- ated music reflect the constraints of the text prompt?

work page
[7]

Readability & Engraving: How clear and standard is the musical notation for a performing musician? 3 https://www.gold.ac.uk/music-mind-brain/ gold-msi/

work page
[8]

Musicality & Expressive Intent: How aesthetically pleasing and musically expressive is the composition?

work page
[9]

Authenticity to Professional Composition: How closely does the generated score resemble the work of a professional human composer?

work page
[10]

Usability for Professional Composition: To what ex- tent could this score serve as a viable foundation for a professional composer requiring only minimal edits? 6 Results and Analysis Metric Text2Score ComposerX Midi-LLM Infer-Align MidiLM Generation Efficiency Valid Files Gen. 99.16% 50.00% 100.00% 99.58% 97.90%Total API Cost $2.00 $91.56 - - -Total API ...

work page 2021
[11]

Modeling Symbolic Music with Natural Language Processing Approaches,

D.-V .-T. Le, “Modeling Symbolic Music with Natural Language Processing Approaches,” Theses, Université de Lille, Nov. 2025. [Online]. Available: https: //hal.science/tel-05426752

work page 2025
[12]

Generating symbolic music from natural language prompts using an llm-enhanced dataset,

W. Xu, J. McAuley, T. Berg-Kirkpatrick, S. Dub- nov, and H.-W. Dong, “Generating symbolic music from natural language prompts using an llm-enhanced dataset,”arXiv preprint arXiv:2410.02084, 2024

work page arXiv 2024
[13]

Motifs, phrases, and be- yond: The modelling of structure in symbolic music generation,

K. Bhandari and S. Colton, “Motifs, phrases, and be- yond: The modelling of structure in symbolic music generation,” in International Conference on Compu- tational Intelligence in Music, Sound, Art and Design (Part of EvoStar). Springer, 2024, pp. 33–51

work page 2024
[14]

Midilm: A dual-path model for controllable text-to-midi generation,

S. Li, D. Choi, and Y . Sung, “Midilm: A dual-path model for controllable text-to-midi generation,” inPro- ceedings of the AAAI Conference on Artificial Intelli- gence, vol. 40, no. 28, 2026, pp. 23 160–23 168

work page 2026
[15]

Lp-musiccaps: Llm-based pseudo music captioning,

S. Doh, K. Choi, J. Lee, and J. Nam, “Lp-musiccaps: Llm-based pseudo music captioning,” arXiv preprint arXiv:2307.16372, 2023

work page arXiv 2023
[16]

Notagen: Advanc- ing musicality in symbolic music generation with large language model training paradigms,

Y . Wang, S. Wu, J. Hu, X. Du, Y . Peng, Y . Huang, S. Fan, X. Li, F. Yu, and M. Sun, “Notagen: Advanc- ing musicality in symbolic music generation with large language model training paradigms,” arXiv preprint arXiv:2502.18008, 2025

work page arXiv 2025
[17]

Butter: A representation learning framework for bi-directional music-sentence retrieval and generation,

Y . Zhang, Z. Wang, D. Wang, and G. Xia, “Butter: A representation learning framework for bi-directional music-sentence retrieval and generation,” in Proceed- ings of the 1st workshop on nlp for music and audio (nlp4musa), 2020, pp. 54–58

work page 2020
[18]

Musecoco: Generating symbolic music from text,

P. Lu, X. Xu, C. Kang, B. Yu, C. Xing, X. Tan, and J. Bian, “Musecoco: Generating symbolic music from text,”arXiv preprint arXiv:2306.00110, 2023

work page arXiv 2023
[19]

Text2midi: Generating symbolic mu- sic from captions,

K. Bhandari, A. Roy, K. Wang, G. Puri, S. Colton, and D. Herremans, “Text2midi: Generating symbolic mu- sic from captions,” in Proceedings of the AAAI Con- ference on Artificial Intelligence, vol. 39, no. 22, 2025, pp. 23 478–23 486

work page 2025
[20]

Text2midi- inferalign: Improving symbolic music generation with inference-time alignment,

A. Roy, G. Puri, and D. Herremans, “Text2midi- inferalign: Improving symbolic music generation with inference-time alignment,” arXiv preprint arXiv:2505.12669, 2025

work page arXiv 2025
[21]

Midi-llm: Adapting large language models for text-to-midi music generation,

S.-L. Wu, Y . Kim, and C.-Z. A. Huang, “Midi-llm: Adapting large language models for text-to-midi music generation,”arXiv preprint arXiv:2511.03942, 2025

work page arXiv 2025
[22]

Melotrans: A text to symbolic music gen- eration model following human composition habit,

Y . Wang, W. Yang, Z. Dai, Y . Zhang, K. Zhao, and H. Wang, “Melotrans: A text to symbolic music gen- eration model following human composition habit,” arXiv preprint arXiv:2410.13419, 2024

work page arXiv 2024
[23]

Xmusic: Towards a generalized and controllable sym- bolic music generation framework,

S. Tian, C. Zhang, W. Yuan, W. Tan, and W. Zhu, “Xmusic: Towards a generalized and controllable sym- bolic music generation framework,” IEEE Transac- tions on Multimedia, vol. 27, pp. 6857–6871, 2025

work page 2025
[24]

Large language models’ in- ternal perception of symbolic music,

A. Shin and K. Kaneko, “Large language models’ in- ternal perception of symbolic music,” arXiv preprint arXiv:2507.12808, 2025

work page arXiv 2025
[25]

Composerx: Multi-agent symbolic music composition with llms,

Q. Deng, Q. Yang, R. Yuan, Y . Huang, Y . Wang, X. Liu, Z. Tian, J. Pan, G. Zhang, H. Lin et al., “Composerx: Multi-agent symbolic music composition with llms,” arXiv preprint arXiv:2404.18081, 2024

work page arXiv 2024
[26]

Cocomposer: Llm multi-agent collaborative music composition,

P. Xing, A. Plaat, and N. van Stein, “Cocomposer: Llm multi-agent collaborative music composition,” arXiv preprint arXiv:2509.00132, 2025

work page arXiv 2025
[27]

Po ´cwiardowski, M

J. Po ´cwiardowski, M. Modrzejewski, and M. S. Tatara, “M6 (gpt) 3: Generating multitrack modifiable multi- minute midi music from text using genetic algorithms, probabilistic methods and gpt models in any pro- gression and time signature,” in 2025 IEEE Interna- tional Conference on Multimedia and Expo Workshops (ICMEW). IEEE, 2025, pp. 1–6

work page 2025
[28]

A hier- archical recurrent neural network for symbolic melody generation,

J. Wu, C. Hu, Y . Wang, X. Hu, and J. Zhu, “A hier- archical recurrent neural network for symbolic melody generation,”IEEE transactions on cybernetics, vol. 50, no. 6, pp. 2749–2757, 2019

work page 2019
[29]

The power of fragmen- tation: A hierarchical transformer model for struc- tural segmentation in symbolic music generation,

G. Wu, S. Liu, and X. Fan, “The power of fragmen- tation: A hierarchical transformer model for struc- tural segmentation in symbolic music generation,” IEEE/ACM Transactions on Audio, Speech, and Lan- guage Processing, vol. 31, pp. 1409–1420, 2023

work page 2023
[30]

Hierarchical recurrent neural networks for conditional melody gen- eration with long-term structure,

G. Zixun, D. Makris, and D. Herremans, “Hierarchical recurrent neural networks for conditional melody gen- eration with long-term structure,” in2021 international joint conference on neural networks (IJCNN). IEEE, 2021, pp. 1–8

work page 2021
[31]

Controllable deep melody generation via hierarchi- cal music structure representation,

S. Dai, Z. Jin, C. Gomes, and R. B. Dannenberg, “Controllable deep melody generation via hierarchi- cal music structure representation,” arXiv preprint arXiv:2109.00663, 2021

work page arXiv 2021
[32]

Structure-enhanced pop music generation via harmony-aware learning,

X. Zhang, J. Zhang, Y . Qiu, L. Wang, and J. Zhou, “Structure-enhanced pop music generation via harmony-aware learning,” in Proceedings of the 30th ACM International Conference on Multimedia , 2022, pp. 1204–1213

work page 2022
[33]

Folk music style modelling by recurrent neural networks with long short term memory units,

B. Sturm, J. F. Santos, and I. Korshunova, “Folk music style modelling by recurrent neural networks with long short term memory units,” in16th international society for music information retrieval conference, 2015

work page 2015
[34]

Tunesformer: Form- ing irish tunes with control codes by bar patching,

S. Wu, X. Li, F. Yu, and M. Sun, “Tunesformer: Form- ing irish tunes with control codes by bar patching,” arXiv preprint arXiv:2301.02884, 2023

work page arXiv 2023
[35]

Mupt: A gen- erative symbolic music pretrained transformer,

X. Qu, Y . Bai, Y . Ma, Z. Zhou, K. M. Lo, J. Liu, R. Yuan, L. Min, X. Liu, T. Zhanget al., “Mupt: A gen- erative symbolic music pretrained transformer,” arXiv preprint arXiv:2404.06393, 2024

work page arXiv 2024
[36]

Emelodygen: Emotion-conditioned melody generation in abc nota- tion with the musical feature template,

M. Zhou, X. Li, F. Yu, and W. Li, “Emelodygen: Emotion-conditioned melody generation in abc nota- tion with the musical feature template,” in 2025 IEEE International Conference on Multimedia and Expo Workshops (ICMEW). IEEE, 2025, pp. 1–6

work page 2025
[37]

Bytecomposer: a human-like melody compo- sition method based on language model agent,

X. Liang, X. Du, J. Lin, P. Zou, Y . Wan, and B. Zhu, “Bytecomposer: a human-like melody compo- sition method based on language model agent,” arXiv preprint arXiv:2402.17785, 2024

work page arXiv 2024
[38]

Melodyt5: A unified score-to-score transformer for symbolic music processing,

S. Wu, Y . Wang, X. Li, F. Yu, and M. Sun, “Melodyt5: A unified score-to-score transformer for symbolic music processing,” arXiv preprint arXiv:2407.02277 , 2024

work page arXiv 2024
[39]

How far can pretrained llms go in symbolic music? controlled comparisons of supervised and preference- based adaptation,

D. Kumar, E. Karystinaios, G. Widmer, and M. Schedl, “How far can pretrained llms go in symbolic music? controlled comparisons of supervised and preference- based adaptation,” 2026. [Online]. Available: https: //arxiv.org/abs/2601.22764

work page arXiv 2026
[40]

Smarter, better, faster, longer: A modern bidirectional encoder for fast, mem- ory efficient, and long context finetuning and infer- ence,

B. Warner, A. Chaffin, B. Clavié, O. Weller, O. Hall- ström, S. Taghadouini, A. Gallagher, R. Biswas, F. Ladhak, T. Aarsen et al. , “Smarter, better, faster, longer: A modern bidirectional encoder for fast, mem- ory efficient, and long context finetuning and infer- ence,” in Proceedings of the 63rd Annual Meeting of the Association for Computational Lingu...

work page 2025
[41]

Cosiatec and siateccompress: Pattern discovery by geometric compression,

D. Meredith, “Cosiatec and siateccompress: Pattern discovery by geometric compression,” inInternational society for music information retrieval conference , no. 14. International Society for Music Information Retrieval, 2013

work page 2013
[42]

Large-scale contrastive language- audio pretraining with feature fusion and keyword-to- caption augmentation,

Y . Wu, K. Chen, T. Zhang, Y . Hui, T. Berg-Kirkpatrick, and S. Dubnov, “Large-scale contrastive language- audio pretraining with feature fusion and keyword-to- caption augmentation,” inICASSP 2023-2023 IEEE In- ternational Conference on Acoustics, Speech and Sig- nal Processing (ICASSP). IEEE, 2023, pp. 1–5

work page 2023
[43]

Clamp 3: Universal music information retrieval across unaligned modalities and unseen languages,

S. Wu, Z. Guo, R. Yuan, J. Jiang, S. Doh, G. Xia, J. Nam, X. Li, F. Yu, and M. Sun, “Clamp 3: Universal music information retrieval across unaligned modalities and unseen languages,” 2025. [Online]. Available: https://arxiv.org/abs/2502.10362

work page arXiv 2025
[44]

music21: A toolkit for computer-aided musicology and symbolic music data,

M. S. Cuthbert and C. Ariza, “music21: A toolkit for computer-aided musicology and symbolic music data,” in 11th International Society for Music Information Retrieval Conference (ISMIR 2010) , 2010, pp. 637–642. [Online]. Available: https: //ismir2010.ismir.net/proceedings/ismir2010-108.pdf

work page 2010
[45]

Pdmx: A large-scale public domain musicxml dataset for symbolic music processing,

P. Long, Z. Novack, T. Berg-Kirkpatrick, and J. McAuley, “Pdmx: A large-scale public domain musicxml dataset for symbolic music processing,” in ICASSP 2025-2025 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) . IEEE, 2025, pp. 1–5

work page 2025
[46]

Symphony generation with per- mutation invariant language model,

J. Liu, Y . Dong, Z. Cheng, X. Zhang, X. Li, F. Yu, and M. Sun, “Symphony generation with per- mutation invariant language model,” arXiv preprint arXiv:2205.05448, 2022

work page arXiv 2022
[47]

Symbolic music similarity through a graph-based rep- resentation,

F. Simonetta, F. Carnovalini, N. Orio, and A. Rodà, “Symbolic music similarity through a graph-based rep- resentation,” in Proceedings of the Audio Mostly 2018 on Sound in Immersion and Emotion (AM’18). ACM Press, 2018

work page 2018
[48]

Asap: a dataset of aligned scores and per- formances for piano transcription,

F. Foscarin, A. Mcleod, P. Rigaux, F. Jacquemard, and M. Sakai, “Asap: a dataset of aligned scores and per- formances for piano transcription,” in Proceedings of the 21st International Society for Music Information Retrieval Conference, 2020, pp. 534–541

work page 2020