arxiv: 2604.20719 · v1 · submitted 2026-04-22 · 💻 cs.SD · cs.AI· cs.MM· eess.AS

Recognition: unknown

ONOTE: Benchmarking Omnimodal Notation Processing for Expert-level Music Intelligence

Menghe Ma , Siqing Wei , Yuecheng Xing , Yaheng Wang , Fanhong Meng , Peijun Han , Luu Anh Tuan , Haoran Luo

Authors on Pith no claims yet

Pith reviewed 2026-05-09 23:00 UTC · model grok-4.3

classification 💻 cs.SD cs.AIcs.MMeess.AS

keywords omnimodal notation processingmusic benchmarkmusic-theoretic comprehensionperceptual accuracynotation systemsrule-constrained reasoningomnimodal AI evaluation

0 comments

The pith

The ONOTE benchmark shows omnimodal models achieve perceptual accuracy on music notation yet lack music-theoretic comprehension.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces ONOTE as a multi-format benchmark designed to test how well AI systems handle music notation across auditory, visual, and symbolic inputs. It relies on a deterministic pipeline using canonical pitch projection to score performance objectively and avoid biases from subjective judging or Western-staff favoritism. Evaluations of current leading models reveal they often recognize surface patterns correctly but fail to apply the underlying rules and logic of music theory. This gap matters because it highlights where AI reasoning breaks down in domains that require strict adherence to formal constraints rather than pattern matching alone.

Core claim

Omnimodal Notation Processing requires alignment across auditory, visual, and symbolic domains, yet existing models remain limited to isolated transcription tasks that do not capture musical logic. The ONOTE benchmark applies a deterministic pipeline grounded in canonical pitch projection to remove subjective scoring biases and test diverse notation systems. Evaluation of leading models demonstrates a consistent disconnect between high perceptual accuracy and weak music-theoretic comprehension, supplying an objective diagnostic for reasoning failures in rule-constrained settings.

What carries the argument

ONOTE, a multi-format benchmark that applies a deterministic pipeline grounded in canonical pitch projection to measure music-theoretic comprehension objectively across notation systems.

If this is right

Future model development can target the specific gap between perception and rule application in structured domains.
Evaluation standards for omnimodal AI shift from subjective LLM judges to deterministic pipelines in expert tasks.
The benchmark framework can be adapted to diagnose reasoning limits in other rule-heavy fields such as formal logic or chemistry.
Training data and objectives can be redesigned to emphasize logical consistency over perceptual fidelity alone.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same objective-pipeline approach could expose similar perception-versus-comprehension gaps in non-music domains that use multiple representation formats.
If the disconnect persists, it suggests current scaling methods improve surface recognition faster than rule internalization.
Integration of ONOTE-style testing into model training loops could produce systems that generate notation while obeying theoretical constraints.

Load-bearing premise

A deterministic pipeline based on canonical pitch projection can remove subjective biases and accurately measure genuine music-theoretic comprehension rather than surface pattern recognition.

What would settle it

A controlled test in which models that score low on ONOTE nevertheless correctly answer expert-level music-theory questions about the same scores, or models that score high on ONOTE fail those same theory questions.

Figures

Figures reproduced from arXiv: 2604.20719 by Fanhong Meng, Haoran Luo, Luu Anh Tuan, Menghe Ma, Peijun Han, Siqing Wei, Yaheng Wang, Yuecheng Xing.

**Figure 2.** Figure 2: This framework establishes a deterministic evaluation metric for ONP by benchmarking OLLMs across three notation [PITH_FULL_IMAGE:figures/full_fig_p002_2.png] view at source ↗

**Figure 3.** Figure 3: Overview of the ONOTE benchmark design workflow. The framework categorizes the evaluation into four core tasks [PITH_FULL_IMAGE:figures/full_fig_p003_3.png] view at source ↗

**Figure 4.** Figure 4: This figure analyzes the structural and cognitive bottlenecks of OLLMs, specifically the performance collapse when [PITH_FULL_IMAGE:figures/full_fig_p005_4.png] view at source ↗

**Figure 5.** Figure 5: Four typical failure cases and underlying causes in omnimodal notation processing. Current models struggle with [PITH_FULL_IMAGE:figures/full_fig_p006_5.png] view at source ↗

**Figure 6.** Figure 6: Core semantic execution prompts for the four primary tasks in the ONOTE benchmark [PITH_FULL_IMAGE:figures/full_fig_p010_6.png] view at source ↗

**Figure 7.** Figure 7: The three standard music notation formatting constraints applied in ONOTE [PITH_FULL_IMAGE:figures/full_fig_p011_7.png] view at source ↗

**Figure 8.** Figure 8: System prompts designed for the LLM-as-a-Judge automated evaluation, detailing strict scoring criteria for rhythmic [PITH_FULL_IMAGE:figures/full_fig_p012_8.png] view at source ↗

read the original abstract

Omnimodal Notation Processing (ONP) represents a unique frontier for omnimodal AI due to the rigorous, multi-dimensional alignment required across auditory, visual, and symbolic domains. Current research remains fragmented, focusing on isolated transcription tasks that fail to bridge the gap between superficial pattern recognition and the underlying musical logic. This landscape is further complicated by severe notation biases toward Western staff and the inherent unreliability of "LLM-as-a-judge" metrics, which often mask structural reasoning failures with systemic hallucinations. To establish a more rigorous standard, we introduce ONOTE, a multi-format benchmark that utilizes a deterministic pipeline--grounded in canonical pitch projection--to eliminate subjective scoring biases across diverse notation systems. Our evaluation of leading omnimodal models exposes a fundamental disconnect between perceptual accuracy and music-theoretic comprehension, providing a necessary framework for diagnosing reasoning vulnerabilities in complex, rule-constrained domains.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

ONOTE sets up a multi-format music notation benchmark with a deterministic pipeline, but the pitch projection method may not isolate music-theoretic comprehension from basic matching.

read the letter

The paper's core move is introducing ONOTE as a unified benchmark for omnimodal models handling music notation in varied formats. It argues that existing work stays stuck on narrow transcription tasks and that current models show a gap between getting the notes right and grasping the underlying rules. The deterministic pipeline based on canonical pitch projection is presented as a way to remove subjective scoring and notation biases, especially the Western staff tilt and LLM-judge hallucinations mentioned in the abstract. That framing is a clear attempt to tighten evaluation in a domain where rule-based reasoning matters. The setup itself is new relative to the fragmented literature the abstract cites, and the goal of creating a shared testbed for perceptual versus structural performance is reasonable on its face. If the full results hold up with concrete scores and error breakdowns, the benchmark could serve as a diagnostic tool for other structured domains too. The soft spot is the evaluation design. Projecting everything to canonical pitch sequences risks turning the test into low-level transposition-invariant matching rather than probing rhythmic hierarchy, voice leading, or harmonic function. The stress-test note flags this directly, and the abstract gives no implementation details or sample outputs to show otherwise. Without those, the claimed fundamental disconnect stays hard to verify. Non-Western or non-staff notations would be the real test case here, and it's unclear how the pipeline handles them without collapsing to pitch alone. This is the kind of paper that belongs in a reading group focused on multimodal benchmarks or music AI. Researchers working on evaluation frameworks or creative models would get the most out of the setup and the problem statement, even if they end up tweaking the metric. It deserves peer review because the benchmark idea is worth checking in detail, though the authors would need to supply the pipeline code, full results, and a clearer defense of why pitch projection captures higher-order comprehension.

Referee Report

2 major / 1 minor

Summary. The paper introduces ONOTE, a multi-format benchmark for Omnimodal Notation Processing (ONP) that employs a deterministic pipeline grounded in canonical pitch projection to evaluate leading omnimodal models on music notation tasks across diverse systems. It claims this evaluation exposes a fundamental disconnect between perceptual accuracy and music-theoretic comprehension, providing a framework for diagnosing reasoning vulnerabilities in rule-constrained domains.

Significance. If the pipeline is validated to isolate higher-order music-theoretic reasoning (beyond low-level pitch matching) and the disconnect is substantiated with reproducible results, ONOTE could serve as a useful standardized tool for assessing AI capabilities in music and other structured domains, addressing limitations of fragmented transcription tasks and subjective LLM-as-a-judge metrics.

major comments (2)

[Abstract] Abstract: The manuscript asserts that 'our evaluation of leading omnimodal models exposes a fundamental disconnect,' yet supplies no model names, performance metrics, error analyses, tables, or specific failure examples to support this finding. Without these, the central claim cannot be verified or assessed for soundness.
[Abstract / Benchmark Description] Benchmark pipeline (as described in the abstract): The deterministic pipeline grounded in canonical pitch projection is presented as eliminating subjective biases and measuring music-theoretic comprehension, but this risks reducing evaluation to transposition-invariant pitch sequences. It does not clearly demonstrate probing of rule-constrained elements such as rhythmic hierarchy, voice leading, or harmonic function, undermining the distinction from perceptual pattern recognition.

minor comments (1)

[Abstract] The abstract references 'severe notation biases toward Western staff' and 'non-Western or non-staff notations' but does not detail how the benchmark specifically incorporates or scores diverse notation systems beyond the pitch projection step.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the detailed and constructive feedback. We address each major comment point by point below, providing clarifications based on the full manuscript content and indicating where revisions have been made to strengthen the presentation.

read point-by-point responses

Referee: [Abstract] Abstract: The manuscript asserts that 'our evaluation of leading omnimodal models exposes a fundamental disconnect,' yet supplies no model names, performance metrics, error analyses, tables, or specific failure examples to support this finding. Without these, the central claim cannot be verified or assessed for soundness.

Authors: The full manuscript provides these details in Sections 4 and 5, including the specific omnimodal models evaluated, quantitative metrics, error breakdowns, tables, and concrete failure examples that substantiate the disconnect between perceptual accuracy and music-theoretic comprehension. The abstract summarizes the overall finding at a high level due to length constraints. We have revised the abstract to include a concise summary of key models tested and aggregate metrics demonstrating the claimed disconnect. revision: yes
Referee: [Abstract / Benchmark Description] Benchmark pipeline (as described in the abstract): The deterministic pipeline grounded in canonical pitch projection is presented as eliminating subjective biases and measuring music-theoretic comprehension, but this risks reducing evaluation to transposition-invariant pitch sequences. It does not clearly demonstrate probing of rule-constrained elements such as rhythmic hierarchy, voice leading, or harmonic function, undermining the distinction from perceptual pattern recognition.

Authors: The canonical pitch projection serves as the deterministic core to normalize across notation systems and remove subjective biases, but the full pipeline integrates this with multi-format alignment tasks that explicitly require higher-order music-theoretic reasoning. The manuscript includes task designs and results showing model failures on rhythmic hierarchy, voice leading, and harmonic function even when pitch sequences match correctly. We agree the abstract description is brief and have expanded the benchmark pipeline section to more explicitly detail how these rule-constrained elements are probed beyond basic pitch matching. revision: partial

Circularity Check

0 steps flagged

No circularity: ONOTE is an externally introduced benchmark without self-referential derivations

full rationale

The paper introduces ONOTE as a new multi-format benchmark employing a deterministic pipeline grounded in canonical pitch projection to score model outputs across notation systems. No equations, fitted parameters, or predictions are derived that reduce by construction to the paper's own inputs or prior self-citations. The central claims about a disconnect between perceptual accuracy and music-theoretic comprehension rest on empirical evaluations of external models rather than any closed-loop construction or load-bearing self-reference. This is a standard benchmark contribution with independent content.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Based solely on the abstract; the work relies on standard music theory concepts but introduces no explicit free parameters, new entities, or ad-hoc axioms beyond the stated grounding in canonical pitch projection.

axioms (1)

domain assumption Canonical pitch projection supplies an objective ground truth for aligning auditory, visual, and symbolic music notation.
Invoked in the abstract as the basis for the deterministic pipeline that eliminates subjective scoring.

pith-pipeline@v0.9.0 · 5481 in / 1186 out tokens · 48580 ms · 2026-05-09T23:00:24.481966+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

67 extracted references · 24 canonical work pages · 6 internal anchors

[1]

Rana L Abdulazeez and Fattah Alizadeh. 2024. Deep Learning-Based Optical Music Recognition for Semantic Representation of Non-overlap and Overlap Music Notes.ARO-THE SCIENTIFIC JOURNAL OF KOYA UNIVERSITY12, 1 (2024), 79–87

2024
[2]

MusicLM: Generating Music From Text

Andrea Agostinelli, Timo I. Denk, Zalán Borsos, Jesse Engel, Mauro Verzetti, An- toine Caillon, Qingqing Huang, Aren Jansen, Adam Roberts, Marco Tagliasacchi, Matt Sharifi, Neil Zeghidour, and Christian Frank. 2023. MusicLM: Generating Music From Text. arXiv:2301.11325 [cs.SD] https://arxiv.org/abs/2301.11325

work page internal anchor Pith review arXiv 2023
[3]

Jinze Bai, Shuai Bai, Shusheng Yang, Shijie Wang, Sinan Tan, Peng Wang, Junyang Lin, Chang Zhou, and J Qwen-VL Zhou. 2023. A versatile vision-language model for understanding, localization, text reading, and beyond.arXiv preprint arXiv:2308.129666 (2023), 3

work page internal anchor Pith review arXiv 2023
[4]

Arnau Baró, Pau Riba, Jorge Calvo-Zaragoza, and Alicia Fornés. 2017. Optical Music Recognition by Recurrent Neural Networks.. InGREC@ ICDAR. 25–26

2017
[5]

Keshav Bhandari and Simon Colton. 2024. Motifs, phrases, and beyond: The modelling of structure in symbolic music generation. InInternational Conference on Computational Intelligence in Music, Sound, Art and Design (Part of EvoStar). Springer, 33–51

2024
[6]

Zalán Borsos, Raphaël Marinier, Damien Vincent, Eugene Kharitonov, Olivier Pietquin, Matt Sharifi, Dominik Roblek, Olivier Teboul, David Grangier, Marco Tagliasacchi, and Neil Zeghidour. 2023. AudioLM: A Language Modeling Ap- proach to Audio Generation.IEEE/ACM Transactions on Audio, Speech, and Language Processing31 (2023), 2523–2533. doi:10.1109/TASLP.2...

work page doi:10.1109/taslp.2023.3288409 2023
[7]

Martinez-Sevilla, Carlos Penarrubia, and Antonio Rios-Vila

Jorge Calvo-Zaragoza, Juan C. Martinez-Sevilla, Carlos Penarrubia, and Antonio Rios-Vila. 2023. Optical Music Recognition: Recent Advances, Current Chal- lenges, and Future Directions. InDocument Analysis and Recognition – ICDAR 2023 Workshops, Mickael Coustaty and Alicia Fornés (Eds.). Springer Nature Switzerland, Cham, 94–104

2023
[8]

Jorge Calvo-Zaragoza and David Rizo. 2018. End-to-end neural optical music recognition of monophonic scores.Applied Sciences8, 4 (2018), 606

2018
[9]

Luca Casini and Bob Sturm. 2022. Tradformer: A transformer model of traditional music transcriptions. InInternational Joint Conference on Artificial Intelligence IJCAI 2022, Vienna, Austria, 23-29 July 2022. 4915–4920

2022
[10]

Jian Chen, Wenye Ma, Penghang Liu, Wei Wang, Tengwei Song, Ming Li, Chen- guang Wang, Jiayu Qin, Ruiyi Zhang, and Changyou Chen. 2025. MusiXQA: Advancing Visual Music Understanding in Multimodal Large Language Models. arXiv:2506.23009 [cs.CV] https://arxiv.org/abs/2506.23009

work page arXiv 2025
[11]

Yunfei Chu, Jin Xu, Xiaohuan Zhou, Qian Yang, Shiliang Zhang, Zhijie Yan, Chang Zhou, and Jingren Zhou. 2023. Qwen-Audio: Advancing Univer- sal Audio Understanding via Unified Large-Scale Audio-Language Models. arXiv:2311.07919 [eess.AS] https://arxiv.org/abs/2311.07919

work page internal anchor Pith review arXiv 2023
[12]

Jade Copet, Felix Kreuk, Itai Gat, Tal Remez, David Kant, Gabriel Synnaeve, Yossi Adi, and Alexandre Défossez. 2023. Simple and controllable music generation. Advances in neural information processing systems36 (2023), 47704–47720

2023
[13]

Josh Gardner, Ian Simon, Ethan Manilow, Curtis Hawthorne, and Jesse Engel. 2021. MT3: Multi-task multitrack music transcription.arXiv preprint arXiv:2111.03017 (2021)

work page arXiv 2021
[14]

Carlos Garrido-Munoz, Antonio Rios-Vila, and Jorge Calvo-Zaragoza. 2022. A holistic approach for image-to-graph: application to optical music recognition: C. Garrido-Munoz et al.International Journal on Document Analysis and Recognition (IJDAR)25, 4 (2022), 293–303

2022
[15]

Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Alex Vaughan, et al. 2024. The llama 3 herd of models.arXiv preprint arXiv:2407.21783 (2024)

work page internal anchor Pith review Pith/arXiv arXiv 2024
[16]

Curtis Hawthorne, Erich Elsen, Jialin Song, Adam Roberts, Ian Simon, Colin Raffel, Jesse Engel, Sageev Oore, and Douglas Eck. 2017. Onsets and frames: Dual-objective piano transcription.arXiv preprint arXiv:1710.11153(2017)

work page arXiv 2017
[17]

Curtis Hawthorne, Andriy Stasyuk, Adam Roberts, Ian Simon, Cheng-Zhi Anna Huang, Sander Dieleman, Erich Elsen, Jesse Engel, and Douglas Eck. 2018. En- abling factorized piano music modeling and generation with the MAESTRO dataset.arXiv preprint arXiv:1810.12247(2018)

work page arXiv 2018
[18]

Wen-Yi Hsiao, Jen-Yu Liu, Yin-Cheng Yeh, and Yi-Hsuan Yang. 2021. Compound word transformer: Learning to compose full-song music over dynamic directed hypergraphs. InProceedings of the AAAI Conference on Artificial Intelligence, Vol. 35. 178–186

2021
[19]

Cheng-Zhi Anna Huang, Ashish Vaswani, Jakob Uszkoreit, Noam Shazeer, Ian Si- mon, Curtis Hawthorne, Andrew M Dai, Matthew D Hoffman, Monica Dinculescu, and Douglas Eck. 2018. Music transformer.arXiv preprint arXiv:1809.04281(2018)

work page arXiv 2018
[20]

Yu-Siang Huang and Yi-Hsuan Yang. 2020. Pop music transformer: Beat-based modeling and generation of expressive pop piano compositions. InProceedings of the 28th ACM international conference on multimedia. 1180–1188

2020
[21]

Jinjing Jiang, Nicole Teo, Haibo Pen, Seng-Beng Ho, and Zhaoxia Wang. 2024. Converting vocal performances into sheet music leveraging large language mod- els. In2024 IEEE International Conference on Data Mining Workshops (ICDMW). IEEE, 445–452

2024
[22]

Qiuqiang Kong, Bochen Li, Xuchen Song, Yuan Wan, and Yuxuan Wang. 2021. High-resolution piano transcription with pedals by regressing onset and offset times.IEEE/ACM Transactions on Audio, Speech, and Language Processing29 (2021), 3707–3717

2021
[23]

Vladimir I Levenshtein et al. 1966. Binary codes capable of correcting deletions, insertions, and reversals. InSoviet physics doklady, Vol. 10. Soviet Union, 707–710

1966
[24]

Dichucheng Li, Yongyi Zang, and Qiuqiang Kong. 2025. Piano Transcription by Hierarchical Language Modeling with Pretrained Roll-based Encoders. InICASSP 2025-2025 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 1–5

2025
[25]

Junnan Li, Dongxu Li, Silvio Savarese, and Steven Hoi. 2023. Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. InInternational conference on machine learning. PMLR, 19730–19742

2023
[26]

Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. 2023. Visual in- struction tuning.Advances in neural information processing systems36 (2023), 34892–34916

2023
[27]

Jiafeng Liu, Yuanliang Dong, Zehua Cheng, Xinran Zhang, Xiaobing Li, Feng Yu, and Maosong Sun. 2022. Symphony generation with permutation invariant language model.arXiv preprint arXiv:2205.05448(2022)

work page arXiv 2022
[28]

Peiling Lu, Xin Xu, Chenfei Kang, Botao Yu, Chengyi Xing, Xu Tan, and Jiang Bian
[29]

Musecoco: Generating symbolic music from text,

MuseCoco: Generating Symbolic Music from Text. arXiv:2306.00110 [cs.SD] https://arxiv.org/abs/2306.00110

work page arXiv
[30]

Haoran Luo, Haihong E, Guanting Chen, Qika Lin, Yikai Guo, Fangzhi Xu, Zemin Kuang, Meina Song, Xiaobao Wu, Yifan Zhu, and Luu Anh Tuan. 2025. Graph-R1: Towards Agentic GraphRAG Framework via End-to-end Reinforcement Learning. arXiv:2507.21892 [cs.CL] https://arxiv.org/abs/2507.21892

work page arXiv 2025
[31]

Haoran Luo, Haihong E, Guanting Chen, Yandan Zheng, Xiaobao Wu, Yikai Guo, Qika Lin, Yu Feng, Zemin Kuang, Meina Song, Yifan Zhu, and Luu Anh Tuan. 2025. HyperGraphRAG: Retrieval-Augmented Generation via Hypergraph-Structured Knowledge Representation. arXiv:2503.21322 [cs.AI] https://arxiv.org/abs/2503. 21322

work page arXiv 2025
[32]

Haoran Luo, Haihong E, Yikai Guo, Qika Lin, Xiaobao Wu, Xinyu Mu, Wenhao Liu, Meina Song, Yifan Zhu, and Anh Tuan Luu. 2025. KBQA-o1: Agentic Knowl- edge Base Question Answering with Monte Carlo Tree Search. InProceedings of the 42nd International Conference on Machine Learning (Proceedings of Ma- chine Learning Research, Vol. 267), Aarti Singh, Maryam Fa...

2025
[33]

Ethan Manilow, Gordon Wichern, Prem Seetharaman, and Jonathan Le Roux
[34]

In2019 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (W ASPAA)

Cutting music source separation some Slakh: A dataset to study the impact of training data quality and quantity. In2019 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (W ASPAA). IEEE, 45–49
[35]

1996.MIDI 1.0 Detailed Specification

MIDI Manufacturers Association. 1996.MIDI 1.0 Detailed Specification. MIDI Manufacturers Association, Los Angeles, CA

1996
[36]

MuseScore BVBA. 2024. MuseScore: Create, play and print beautiful sheet music. https://musescore.org/

2024
[37]

OpenAI. 2024. Hello GPT-4o. https://openai.com/index/hello-gpt-4o/

2024
[38]

Alexander Pacha, Jan Hajič Jr, and Jorge Calvo-Zaragoza. 2018. A baseline for general music object detection with deep learning.Applied Sciences8, 9 (2018), 1488

2018
[39]

Xingwei Qu, Yuelin Bai, et al. 2024. MuPT: A Generative Symbolic Music Pre- trained Transformer. arXiv:2404.06393 [cs.SD] https://arxiv.org/abs/2404.06393

work page arXiv 2024
[40]

2016.Learning-Based Methods for Comparing Sequences, with Appli- cations to Audio-to-MIDI Alignment and Matching

Colin Raffel. 2016.Learning-Based Methods for Comparing Sequences, with Appli- cations to Audio-to-MIDI Alignment and Matching. Ph. D. Dissertation. Columbia University

2016
[41]

Paul K Rubenstein, Chulayuth Asawaroengchai, Duc Dung Nguyen, Ankur Bapna, Zalán Borsos, Félix de Chaumont Quitry, Peter Chen, Dalia El Badawy, Wei Han, Eugene Kharitonov, et al. 2023. Audiopalm: A large language model that can speak and listen.arXiv preprint arXiv:2306.12925(2023)

work page arXiv 2023
[42]

Pedro Sarmento, Adarsh Kumar, CJ Carr, Zack Zukowski, Mathieu Barthet, and Yi- Hsuan Yang. 2021. DadaGP: A dataset of tokenized GuitarPro songs for sequence models.arXiv preprint arXiv:2107.14653(2021)

work page arXiv 2021
[43]

Elona Shatri and George Fazekas. 2024. Knowledge discovery in optical music recognition: Enhancing information retrieval with instance segmentation.arXiv preprint arXiv:2408.15002(2024)

work page arXiv 2024
[44]

Bob L Sturm, Joao Felipe Santos, Oded Ben-Tal, and Iryna Korshunova. 2016. Music transcription modelling and composition using deep learning.arXiv preprint arXiv:1604.08723(2016)

work page arXiv 2016
[45]

Gemini Team. 2024. Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context. (2024). arXiv:2403.05530 [cs.CL] https://arxiv.org/ abs/2403.05530

work page internal anchor Pith review arXiv 2024
[46]

Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yas- mine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhos- ale, et al. 2023. Llama 2: Open foundation and fine-tuned chat models.arXiv preprint arXiv:2307.09288(2023)

work page internal anchor Pith review Pith/arXiv arXiv 2023
[47]

Chris Walshaw. 2011. The abc music standard 2.1.URL: http://abcnotation. com/wiki/abc: standard: v21 (2011). Conference’17, July 2017, Washington, DC, USA Menghe Ma *, Siqing Wei*, Yuecheng Xing*, Yaheng Wang, Fanhong Meng, Peijun Han, Luu Anh Tuan, and Haoran Luo †

2011
[48]

Yashan Wang, Shangda Wu, Jianhuai Hu, Xingjian Du, Yueqi Peng, Yongxin Huang, Shuai Fan, Xiaobing Li, Feng Yu, and Maosong Sun. 2025. Notagen: Advancing musicality in symbolic music generation with large language model training paradigms.arXiv preprint arXiv:2502.18008(2025)

work page arXiv 2025
[49]

Andrew Wiggins and Youngmoo E Kim. 2019. Guitar Tablature Estimation with a Convolutional Neural Network.. InISMIR. 284–291

2019
[50]

Jiayang Wu, Wensheng Gan, Zefeng Chen, Shicheng Wan, and Philip S Yu
[51]

In2023 IEEE International Conference on Big Data (BigData)

Multimodal large language models: A survey. In2023 IEEE International Conference on Big Data (BigData). IEEE, 2247–2256
[52]

Jun Wu and Wanshan Guo. 2026. Autoregressive ConvNeXt-transformer fusion framework for polyphonic optical music recognition with focal loss optimization. Acoustical Science and Technology47, 2 (2026), 86–96

2026
[53]

Shangda Wu, Xiaobing Li, Feng Yu, and Maosong Sun. 2023. TunesFormer: Form- ing Irish Tunes with Control Codes by Bar Patching. arXiv:2301.02884 [cs.SD] https://arxiv.org/abs/2301.02884

work page arXiv 2023
[54]

Qingyang Xi, Rachel M Bittner, Johan Pauwels, Xuzhou Ye, and Juan Pablo Bello
[55]

Proceedings of the 19th International Society for Music Information

GuitarSet: A Dataset for Guitar Transcription. Proceedings of the 19th International Society for Music Information
[56]

Yujia Yan and Zhiyao Duan. 2024. Measure by measure: Measure-based automatic music composition with modern staff notation.Transactions of the International Society for Music Information Retrieval(2024)

2024
[57]

Ruibin Yuan, Hanfeng Lin, Yi Wang, Zeyue Tian, Shangda Wu, Tianhao Shen, Ge Zhang, Yuhang Wu, Cong Liu, Ziya Zhou, et al. 2024. Chatmusician: Under- standing and generating music intrinsically with llm. (2024), 6252–6271

2024
[58]

Shenghua Yuan, Xing Tang, Jiatao Chen, Tianming Xie, Jing Wang, and Bing Shi
[59]

Diffusion-based symbolic music generation with structured state space models.arXiv preprint arXiv:2507.20128(2025)

work page arXiv 2025
[60]

Jun Zhan, Junqi Dai, Jiasheng Ye, Yunhua Zhou, Dong Zhang, Zhigeng Liu, Xin Zhang, Ruibin Yuan, Ge Zhang, Linyang Li, et al . 2024. Anygpt: Unified multimodal llm with discrete sequence modeling. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 9637–9662. ONOTE: Benchmarking Omnimodal Notat...

2024
[61]

The sum of rhythmic values between two| symbols must equal 4.0

Rhythmic Verification:A correct score contains exactly 4.0 beats per measure. The sum of rhythmic values between two| symbols must equal 4.0. (Score 5: All correct; Score 3: Half correct; Score 1: All incorrect)
[62]

(Score 5: Complete motif and beautiful melody; Score 3: Discernible motif; Score 1: Lacks motif, unappealing)

Aesthetic Analysis:Motif development, melodic contour, and rhythmic groove should be rich and logical. (Score 5: Complete motif and beautiful melody; Score 3: Discernible motif; Score 1: Lacks motif, unappealing). Output Constraints:Output ONLY the final scores without analysis. Format:Technical Score: [ ]/5, Aesthetic Score: [ ]/5, Average Score: [ ]/5. ...
[63]

(Score 5: All correct; Score 3: Half correct; Score 1: All incorrect)

Rhythmic Verification:Exactly 4.0 beats per measure between|symbols. (Score 5: All correct; Score 3: Half correct; Score 1: All incorrect)
[64]

(Score 5: Rich and logical; Score 1: Does not constitute actual music)

Aesthetic Analysis:Evaluate motif, melodic contour (rise/fall of notation numbers), and rhythmic groove. (Score 5: Rich and logical; Score 1: Does not constitute actual music). Output Constraints:Output ONLY the scores without analysis. Format:Technical Score: [ ]/5, Aesthetic Score: [ ]/5, Average Score: [ ]/5. Critic C: ASCII Guitar Tablature Evaluation...
[65]

Assuming 4/4 time, characters on each string between two bar lines (|) must be identical, exactly 16 characters per measure

Layout and Timing:The 6 strings must align perfectly vertically. Assuming 4/4 time, characters on each string between two bar lines (|) must be identical, exactly 16 characters per measure. (Score 5: Layout correct, exactly 16 chars/measure; Score 3: Partial errors; Score 1: Mostly incorrect)
[66]

(Score 5: Rich range/rhythm; Score 3: Monotonous; Score 1: Illogical)

Musicality Analysis:Voicing should have distinct layers (bass root, middle harmony, high melody). (Score 5: Rich range/rhythm; Score 3: Monotonous; Score 1: Illogical)
[67]

A reasonable stretch is a maximum span of 7 frets

Fingering Feasibility:Analyze simultaneous notes in the same column. A reasonable stretch is a maximum span of 7 frets. (Score 5: Logical, rarely ¿7 frets; Score 3: Frequent unreasonable stretches; Score 1: Physically impossible). Output Constraints:Output ONLY the scores without analysis. Format:Technical Layout Score: [ ]/5, Fingering Score: [ ]/5, Musi...