Recognition: unknown
From Image to Music Language: A Two-Stage Structure Decoding Approach for Complex Polyphonic OMR
Pith reviewed 2026-05-09 23:05 UTC · model grok-4.3
The pith
A two-stage pipeline decodes visual music symbols into editable polyphonic scores using topology recognition.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The paper claims that given symbol and event candidates from a visual pipeline, second-stage decoding for polyphonic OMR can be solved by recognizing the underlying topology of notes and events through probability-guided search (BeadSolver), together with a hybrid data-generation approach that yields editable and exportable score structures.
What carries the argument
BeadSolver, a probability-guided search procedure that performs topology recognition to resolve voice separation and intra-measure timing from candidate symbols.
If this is right
- Yields a usable decoding component for existing OMR systems handling complex piano notation.
- Enables accumulation of large-scale structured score data from image sources.
- Opens a route to train future end-to-end multimodal and reinforcement-learning OMR models.
Where Pith is reading between the lines
- The method could expose the current limits of visual detection stages when run on noisy real-world inputs.
- Structured outputs produced this way could serve as supervision for models that skip explicit symbol detection entirely.
- The topology-search framing may generalize to other dense symbolic notations beyond Western staff music.
Load-bearing premise
The visual pipeline supplies symbol and event candidates accurate enough for the decoder to resolve voice separation and timing ambiguities without frequent failure.
What would settle it
Apply the full pipeline to a set of real piano score images, supply the actual noisy symbol candidates produced by a standard visual detector, and check whether the resulting scores match ground-truth transcriptions at a rate substantially higher than current one-stage or rule-based baselines.
Figures
read the original abstract
We propose a new approach for a practical two-stage Optical Music Recognition (OMR) pipeline, with a particular focus on its second stage. Given symbol and event candidates from the visual pipeline, we decode them into an editable, verifiable, and exportable score structure. We focus on complex polyphonic staff notation, especially piano scores, where voice separation and intra-measure timing are the main bottlenecks. Our approach formulates second-stage decoding as a structure decoding problem and uses topology recognition with probability-guided search (BeadSolver) as its core method. We also describe a data strategy that combines procedural generation with recognition-feedback annotations. The result is a practical decoding component for real OMR systems and a path to accumulate structured score data for future end-to-end, multimodal, and RL-style methods.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper proposes a two-stage Optical Music Recognition (OMR) pipeline focused on the second stage for complex polyphonic staff notation, especially piano scores. Given symbol and event candidates from the visual pipeline, it decodes them into an editable score structure by formulating the task as structure decoding and applying topology recognition with probability-guided search (BeadSolver) to resolve voice separation and intra-measure timing. It also outlines a data strategy combining procedural generation with recognition-feedback annotations to support the decoder and enable future end-to-end or RL-style methods.
Significance. If empirically validated on realistic inputs, the approach could provide a practical, modular decoding component for OMR systems handling polyphonic complexities that current visual pipelines struggle with, while also generating structured score data to bootstrap more advanced multimodal models.
major comments (2)
- [Abstract] Abstract: the central claim that the method yields 'a practical decoding component for real OMR systems' is unsupported because the manuscript supplies only a high-level description and data-generation strategy with no quantitative results, failure-mode analysis, or comparisons against existing voice-separation or timing-resolution baselines.
- [Method] The description of BeadSolver (topology recognition plus probability-guided search) is presented without equations, pseudocode, or complexity analysis, preventing assessment of whether the search is guaranteed to produce verifiable scores or scales to dense polyphony.
Simulated Author's Rebuttal
We thank the referee for the constructive and detailed comments, which highlight important areas for strengthening the manuscript. We address each major comment below and commit to revisions that clarify the scope and add necessary technical details without misrepresenting the current contribution.
read point-by-point responses
-
Referee: [Abstract] Abstract: the central claim that the method yields 'a practical decoding component for real OMR systems' is unsupported because the manuscript supplies only a high-level description and data-generation strategy with no quantitative results, failure-mode analysis, or comparisons against existing voice-separation or timing-resolution baselines.
Authors: We agree that the manuscript as submitted provides only a high-level methodological description and data strategy without quantitative results or baseline comparisons, so the claim of yielding a 'practical decoding component for real OMR systems' is not yet empirically supported. The paper's primary contribution is the formulation of second-stage decoding as a structure decoding problem together with the outlined data-generation approach. In revision we will modify the abstract and introduction to tone down this claim, explicitly framing the work as a proposed framework whose practicality remains to be validated, and we will add a dedicated section discussing planned empirical evaluation, failure-mode analysis, and comparisons against existing voice-separation and timing-resolution methods. revision: yes
-
Referee: [Method] The description of BeadSolver (topology recognition plus probability-guided search) is presented without equations, pseudocode, or complexity analysis, preventing assessment of whether the search is guaranteed to produce verifiable scores or scales to dense polyphony.
Authors: We acknowledge that the current description of BeadSolver remains high-level and lacks formal specification. In the revised manuscript we will supply the missing mathematical formulation of the topology recognition step, pseudocode for the probability-guided search procedure, and a complexity analysis. We will also include a discussion of the conditions under which the search produces verifiable scores and its expected scaling behavior with respect to polyphonic density. revision: yes
Circularity Check
No significant circularity detected
full rationale
The manuscript describes a two-stage OMR pipeline at the level of a methodological proposal, formulating the second stage as a structure-decoding task solved via topology recognition and probability-guided search (BeadSolver) together with a procedural data-generation strategy. No equations, fitted parameters, predictions, or derivations appear in the provided text. No self-citations are invoked as load-bearing premises, and the central claim does not reduce to any input by construction. The approach is therefore self-contained as a forward description of a proposed component rather than a closed derivation.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
D. S. Prerau. DO-RE-MI: a program that recognizes music notation.Computers and the Humanities, 9(1):25–29, 1975
1975
-
[2]
M. Good. MusicXML: an internet-friendly format for sheet music. InProceedings of XML 2001, Orlando, FL, 2001
2001
-
[3]
D. Bainbridge and T. Bell. A music notation construction engine for optical music recognition. Software: Practice and Experience, 33(2):173–200, 2003. doi:10.1002/spe.502
-
[4]
Nienhuys and J
H.-W. Nienhuys and J. Nieuwenhuizen. LilyPond, a system for automated music engraving. In Proceedings of the XIV Colloquium on Musical Informatics (XIV CIM), Florence, Italy, 2003. https://lilypond.org. 4https://imslp.org/ 5https://github.com/k-l-lambda/imslp-mining 27
2003
-
[5]
P. Bellini, I. Bruno, and P. Nesi. Assessing optical music recognition tools.Computer Music Journal, 31(1):68–93, 2007. doi:10.1162/comj.2007.31.1.68
-
[6]
R. Coulom. Efficient selectivity and backup operators in Monte-Carlo tree search. InLecture Notes in Computer Science, pages 72–83, 2007. doi:10.1007/978-3-540-75538-8_7
-
[7]
A. Rebelo, I. Fujinaga, F. Paszkiewicz, A. R. S. Marcal, C. Guedes, and J. S. Cardoso. Opti- cal music recognition: state-of-the-art and open issues.International Journal of Multimedia Information Retrieval, 1(3):173–190, 2012. doi:10.1007/s13735-012-0004-6
-
[8]
C. B. Browne, E. Powley, D. Whitehouse, S. M. Lucas, P. I. Cowling, P. Rohlfshagen, S. Tavener, D. Perez, S. Samothrakis, and S. Colton. A survey of Monte Carlo tree search methods.IEEE Transactions on Computational Intelligence and AI in Games, 4(1):1–43, 2012. doi:10.1109/TCIAIG.2012.2186810
-
[9]
Raphael and R
C. Raphael and R. Jin. Optical music recognition on the International Music Score Library Project. InSPIE Proceedings, page 90210F, 2013
2013
-
[10]
J. Hajiˇc and P. Pecina. The MUSCIMA++ dataset for handwritten optical music recognition. In 14th IAPR International Conference on Document Analysis and Recognition (ICDAR), pages 39–46, 2017. doi:10.1109/ICDAR.2017.16
-
[11]
D. Silver, J. Schrittwieser, K. Simonyan, I. Antonoglou, A. Huang, A. Guez, T. Hubert, L. Baker, M. Lai, A. Bolton, Y . Chen, T. Lillicrap, F. Hui, L. Sifre, G. van den Driessche, T. Graepel, and D. Hassabis. Mastering the game of Go without human knowledge.Nature, 550:354–359, 2017. doi:10.1038/nature24270
-
[12]
A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, and I. Polosukhin. Attention is all you need. InAdvances in Neural Information Processing Systems 30 (NeurIPS), pages 5998–6008, 2017. arXiv:1706.03762
work page internal anchor Pith review Pith/arXiv arXiv 2017
-
[13]
J. Calvo-Zaragoza and D. Rizo. End-to-end neural optical music recognition of monophonic scores.Applied Sciences, 8(4):606, 2018. doi:10.3390/app8040606
- [14]
-
[15]
L. Tuggener, I. Elezi, J. Schmidhuber, M. Pelillo, and T. Stadelmann. DeepScores: a dataset for segmentation, detection and classification of tiny objects. In24th International Conference on Pattern Recognition (ICPR), pages 3704–3709, 2018. doi:10.1109/ICPR.2018.8545307
-
[16]
J. Calvo-Zaragoza, J. Hajiˇc Jr., and A. Pacha. Understanding optical music recognition.ACM Computing Surveys, 53(4):1–35, 2020. doi:10.1145/3397499
-
[17]
L. Tuggener, Y . P. Satyawan, A. Pacha, J. Schmidhuber, and T. Stadelmann. The DeepScoresV2 dataset and benchmark for music object detection. In25th International Conference on Pattern Recognition (ICPR), pages 9188–9195, 2021. doi:10.1109/ICPR48806.2021.9412290
-
[18]
A. Liu, L. Zhang, Y . Mei, B. Han, Z. Cai, Z. Zhu, and J. Xiao. Residual recurrent CRNN for end-to-end optical music recognition on monophonic scores. InProceedings of the 2021 Workshop on Multi-Modal Pre-Training for Multimedia Understanding (MMPT@ICMR), pages 23–27, 2021. doi:10.1145/3463945.3469056
-
[19]
S. Geng, M. Josifoski, M. Peyrard, and R. West. Grammar-constrained decoding for structured NLP tasks without finetuning. InProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 10932–10952, 2023
2023
-
[20]
A. Ríos-Vila, D. Rizo, J. M. Iñesta, and J. Calvo-Zaragoza. End-to-end optical music recognition for pianoform sheet music.International Journal on Document Analysis and Recognition (IJDAR), 26(3):347–362, 2023. doi:10.1007/s10032-023-00432-z
-
[21]
P. Torras, S. Biswas, and A. Fornés. A unified representation framework for the evaluation of optical music recognition systems.International Journal on Document Analysis and Recognition (IJDAR), 27:379–393, 2024. doi:10.1007/s10032-024-00485-8
- [22]
-
[23]
A document is worth a structured record: Principled inductive bias design for document recognition
B. Meyer, L. Tuggener, S. Hänzi, D. Schmid, E. Ayfer, B. F. Grewe, A. Abdulkadir, and T. Stadelmann. A document is worth a structured record: Principled inductive bias design for document recognition.arXiv:2507.08458, 2025. arXiv:2507.08458. 28 Appendices A Implementation Notes This appendix collects the engineering details that support reproducibility of...
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[24]
an integerorder oi (consecutive integers mark the same voice; a gap of more than one starts a new voice),
-
[25]
a tick offsett i ≥0,
-
[26]
,8}(whole through 256th) and dot countδ i ∈ {0,1,2},
a division classd i ∈ {0, . . . ,8}(whole through 256th) and dot countδ i ∈ {0,1,2},
-
[27]
|:") vline_VoltaRightV olta V olta bracket, right vertical stroke (end-repeat
scalar attributes (beam state, stem direction, grace, timeWarp, fullMeasure). The ordering and duration assignments together determine the voice structure and the tick layout; scalar attributes are resolved deterministically from predispositions after search. B.2 Three-Phase Action Sequence Each search step handles exactly one event through three sub-deci...
1920
-
[28]
If events are sequential (non-overlapping) on the same staff, they are ONE voice
Excessive voice splitting (most common): Creating 3–4 voices when 1–2 would suffice. If events are sequential (non-overlapping) on the same staff, they are ONE voice. Stem direction changes alone do NOT justify a new voice
-
[29]
Blindly copying timeWarp from regulation: Unless you see actual tuplet brackets in the image, set timeWarp: null
-
[30]
Wrong division leading to cascading tick errors: Always verify each event’s division against feature.divisionsAND the image before computing ticks. 46
-
[31]
Not verifying feature.divisions confidence: The ML classifier’s confidence array is often more reliable than the regulation’s assigned division
-
[32]
/tmp/spartito-annotate-LHcs8D/m274.webp
Mixing staves in one voice: Events with differentevent.staff values must never be in the same voice array. Output FormatOutput ONLY a JSON block with fixes. Each fix is a RegulationSolution plus measureIndexandstatus. Fix fields:measureIndex (index in spartito);events (array with requiredid,tick,tickGroup,timeWarp, and optional division/dots/beam/grace ov...
1920
-
[33]
beamBroken = true — the beam group 8–9–10 has events 5 and 7 between them, but they’re in different voices
-
[34]
The voices are currently: [1,2,3], [4,6,9,10], [5,7,8] — three voices
-
[35]
Open"}, {id:9, tick:960, beam:
spaceTime = 0.375 (1440 * 0.375 = 540 ticks of unused time) Examining the feature data reveals a critical mismatch: event 6, which should be a quarter note, has its highest division confidence at index 2 (quarter note) with a value of 5.39, while event 7, an 8th rest, shows strongest confidence for an 8th note. Events 6 and 7 are very close (x= 19.56 vs 1...
-
[36]
• Cross-referencing images with event data— The x-positions (events 6 and 7 at ≈19.6) confirmed they were simultaneous, requiring voice separation
Most Helpful Principles: • Beamed notes must be consecutive in a single voice— This was critical for identifying why the beam was broken (events 8, 9, 10 were split across voices). • Cross-referencing images with event data— The x-positions (events 6 and 7 at ≈19.6) confirmed they were simultaneous, requiring voice separation. • Voice separation for simul...
-
[37]
• Stem direction as a voice hint— Notes with opposite stem directions at the same tick are strong indicators they belong to different voices
Suggested Additions: • Prioritize beam continuity overspaceTime minimization— A measure with a continuous beam but some spaceTime gaps is preferred over a “perfect” voice assignment with broken beams. • Stem direction as a voice hint— Notes with opposite stem directions at the same tick are strong indicators they belong to different voices. • Simultaneous...
-
[38]
• Misinterpreting simultaneous events— Events 6 (quarter note) and 7 (eighth rest) at tick 480 appeared to conflict, but belong to different voices
Common Patterns/Pitfalls: • Broken beams from split voice assignments— The beamed group (8, 9, 10) was split with 8 in voice 2 and 9, 10 in voice 1, breaking the beam even though the timing worked. • Misinterpreting simultaneous events— Events 6 (quarter note) and 7 (eighth rest) at tick 480 appeared to conflict, but belong to different voices. • Over-opt...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.