arxiv: 2604.20522 · v3 · submitted 2026-04-22 · 💻 cs.SD · cs.CV

Recognition: unknown

From Image to Music Language: A Two-Stage Structure Decoding Approach for Complex Polyphonic OMR

Nan Xu , Shiheng Li , Shengchao Hou

Authors on Pith no claims yet

Pith reviewed 2026-05-09 23:05 UTC · model grok-4.3

classification 💻 cs.SD cs.CV

keywords Optical Music RecognitionPolyphonic NotationStructure DecodingTopology RecognitionVoice SeparationTwo-Stage PipelineBeadSolver

0 comments

The pith

A two-stage pipeline decodes visual music symbols into editable polyphonic scores using topology recognition.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper proposes a practical two-stage Optical Music Recognition pipeline that focuses on the second stage of turning symbol and event candidates into a complete score structure. It targets complex polyphonic notation such as piano music, where separating voices and determining intra-measure timings are the main difficulties. The approach treats structure decoding as a topology recognition problem solved by probability-guided search, supported by a data strategy that mixes procedural generation with feedback-based annotations. If successful, this produces verifiable, exportable scores that can feed real OMR systems and supply structured data for later multimodal or learning-based methods.

Core claim

The paper claims that given symbol and event candidates from a visual pipeline, second-stage decoding for polyphonic OMR can be solved by recognizing the underlying topology of notes and events through probability-guided search (BeadSolver), together with a hybrid data-generation approach that yields editable and exportable score structures.

What carries the argument

BeadSolver, a probability-guided search procedure that performs topology recognition to resolve voice separation and intra-measure timing from candidate symbols.

If this is right

Yields a usable decoding component for existing OMR systems handling complex piano notation.
Enables accumulation of large-scale structured score data from image sources.
Opens a route to train future end-to-end multimodal and reinforcement-learning OMR models.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The method could expose the current limits of visual detection stages when run on noisy real-world inputs.
Structured outputs produced this way could serve as supervision for models that skip explicit symbol detection entirely.
The topology-search framing may generalize to other dense symbolic notations beyond Western staff music.

Load-bearing premise

The visual pipeline supplies symbol and event candidates accurate enough for the decoder to resolve voice separation and timing ambiguities without frequent failure.

What would settle it

Apply the full pipeline to a set of real piano score images, supply the actual noisy symbol candidates produced by a standard visual detector, and check whether the resulting scores match ground-truth transcriptions at a rate substantially higher than current one-stage or rule-based baselines.

Figures

Figures reproduced from arXiv: 2604.20522 by Nan Xu, Shengchao Hou, Shiheng Li.

**Figure 2.** Figure 2: A compact overview of the Starry OMR pipeline. [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗

**Figure 3.** Figure 3: Visual pipeline for candidate generation. The page-level stage detects systems and [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗

**Figure 4.** Figure 4: Examples of visual predictors used before symbolic assembly. Panels (a), (c), and (d) show [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗

**Figure 5.** Figure 5: Simplified pipeline from semantic recognition to event-candidate assembly. Dense semantic [PITH_FULL_IMAGE:figures/full_fig_p009_5.png] view at source ↗

**Figure 6.** Figure 6: Regulation target on a difficult piano measure. (a) The image shows the original notation, [PITH_FULL_IMAGE:figures/full_fig_p009_6.png] view at source ↗

**Figure 7.** Figure 7: Measure regulation as chained structure recovery. (a) A polyphonic measure example. (b) [PITH_FULL_IMAGE:figures/full_fig_p010_7.png] view at source ↗

**Figure 8.** Figure 8: A simplified illustration of the overall tree-search workflow in principle. In the Pass step, [PITH_FULL_IMAGE:figures/full_fig_p011_8.png] view at source ↗

**Figure 9.** Figure 9: x–tick geometry consistency. Left: an ambiguous measure; there are 2 potential topology candidates for regulation. Right: events plotted in the (x, t) plane, where t is the cumulative tick position obtained by accumulating event durations along the voice chain; each group is normalised by its total measure duration. A well-regulated voice (black circles) distributes its events nearly uniformly along both a… view at source ↗

**Figure 10.** Figure 10: BeadPicker architecture for topology recognition. The model reads measure-level event [PITH_FULL_IMAGE:figures/full_fig_p016_10.png] view at source ↗

**Figure 11.** Figure 11: Training-data pipeline for topology recognition. Structured symbolic music is rendered [PITH_FULL_IMAGE:figures/full_fig_p019_11.png] view at source ↗

**Figure 12.** Figure 12: A preliminary agent-assisted annotation loop for issue measures. The agent requests [PITH_FULL_IMAGE:figures/full_fig_p027_12.png] view at source ↗

**Figure 13.** Figure 13: Representative failure cases illustrating the performance boundary of each regulation [PITH_FULL_IMAGE:figures/full_fig_p038_13.png] view at source ↗

**Figure 14.** Figure 14: Rendered output of the multi-voice cross-staff Paraff example. Three voices share and [PITH_FULL_IMAGE:figures/full_fig_p040_14.png] view at source ↗

**Figure 15.** Figure 15: A sample score generated by a learned Paraff generation model via constrained autore [PITH_FULL_IMAGE:figures/full_fig_p041_15.png] view at source ↗

**Figure 16.** Figure 16: Measure 274 initial state. Left: composite stave image read by the agent. Right: topology [PITH_FULL_IMAGE:figures/full_fig_p047_16.png] view at source ↗

**Figure 17.** Figure 17: Attempt 1 topology. Merging all staff-1 events into one voice places ev. 6 and ev. 7 (both [PITH_FULL_IMAGE:figures/full_fig_p049_17.png] view at source ↗

**Figure 18.** Figure 18: Fix summary topology. Voice 0 (red): staff-0 events [1,2,3]. Voice 1 (green): staff-1 [PITH_FULL_IMAGE:figures/full_fig_p051_18.png] view at source ↗

read the original abstract

We propose a new approach for a practical two-stage Optical Music Recognition (OMR) pipeline, with a particular focus on its second stage. Given symbol and event candidates from the visual pipeline, we decode them into an editable, verifiable, and exportable score structure. We focus on complex polyphonic staff notation, especially piano scores, where voice separation and intra-measure timing are the main bottlenecks. Our approach formulates second-stage decoding as a structure decoding problem and uses topology recognition with probability-guided search (BeadSolver) as its core method. We also describe a data strategy that combines procedural generation with recognition-feedback annotations. The result is a practical decoding component for real OMR systems and a path to accumulate structured score data for future end-to-end, multimodal, and RL-style methods.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

This paper sketches a two-stage OMR pipeline with BeadSolver for polyphonic structure decoding but supplies no results or implementation details to show it works.

read the letter

The main takeaway is a proposed split between visual candidate detection and a second-stage decoder that uses topology recognition plus probability-guided search to handle voice separation and timing in complex piano scores. The authors call the search component BeadSolver and pair it with a data strategy that blends procedural generation and recognition feedback loops. That framing is the clearest new piece here, since most OMR work either stays end-to-end or treats post-processing as an afterthought. The practical focus on editable, exportable output and on the specific bottlenecks of polyphony is also useful; those are real pain points for anyone trying to digitize dense scores. The data-generation idea could help future multimodal or reinforcement-learning efforts by creating more structured training material. The write-up stays honest about the limits of current visual pipelines and does not overclaim what the decoder will achieve. The soft spot is the complete absence of evidence. No equations for BeadSolver appear, no failure cases are shown, and there are no numbers on accuracy, runtime, or comparisons against existing graph-based or search-based OMR post-processors. Without those, it is impossible to judge whether the method is genuinely new or simply a re-description of standard constraint satisfaction. The assumption that the first-stage candidates will be clean enough for the second stage to succeed is stated but not tested. This work is for people actively building or improving OMR pipelines who need concrete ideas for the decoding layer. A reader already familiar with music information retrieval will see the engineering angle quickly and might borrow the data strategy. It is not yet strong enough for a journal but deserves a serious referee round so the authors can be pushed to add experiments and clarify how BeadSolver differs from prior search techniques. I would send it out for review rather than desk-reject.

Referee Report

2 major / 0 minor

Summary. The paper proposes a two-stage Optical Music Recognition (OMR) pipeline focused on the second stage for complex polyphonic staff notation, especially piano scores. Given symbol and event candidates from the visual pipeline, it decodes them into an editable score structure by formulating the task as structure decoding and applying topology recognition with probability-guided search (BeadSolver) to resolve voice separation and intra-measure timing. It also outlines a data strategy combining procedural generation with recognition-feedback annotations to support the decoder and enable future end-to-end or RL-style methods.

Significance. If empirically validated on realistic inputs, the approach could provide a practical, modular decoding component for OMR systems handling polyphonic complexities that current visual pipelines struggle with, while also generating structured score data to bootstrap more advanced multimodal models.

major comments (2)

[Abstract] Abstract: the central claim that the method yields 'a practical decoding component for real OMR systems' is unsupported because the manuscript supplies only a high-level description and data-generation strategy with no quantitative results, failure-mode analysis, or comparisons against existing voice-separation or timing-resolution baselines.
[Method] The description of BeadSolver (topology recognition plus probability-guided search) is presented without equations, pseudocode, or complexity analysis, preventing assessment of whether the search is guaranteed to produce verifiable scores or scales to dense polyphony.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed comments, which highlight important areas for strengthening the manuscript. We address each major comment below and commit to revisions that clarify the scope and add necessary technical details without misrepresenting the current contribution.

read point-by-point responses

Referee: [Abstract] Abstract: the central claim that the method yields 'a practical decoding component for real OMR systems' is unsupported because the manuscript supplies only a high-level description and data-generation strategy with no quantitative results, failure-mode analysis, or comparisons against existing voice-separation or timing-resolution baselines.

Authors: We agree that the manuscript as submitted provides only a high-level methodological description and data strategy without quantitative results or baseline comparisons, so the claim of yielding a 'practical decoding component for real OMR systems' is not yet empirically supported. The paper's primary contribution is the formulation of second-stage decoding as a structure decoding problem together with the outlined data-generation approach. In revision we will modify the abstract and introduction to tone down this claim, explicitly framing the work as a proposed framework whose practicality remains to be validated, and we will add a dedicated section discussing planned empirical evaluation, failure-mode analysis, and comparisons against existing voice-separation and timing-resolution methods. revision: yes
Referee: [Method] The description of BeadSolver (topology recognition plus probability-guided search) is presented without equations, pseudocode, or complexity analysis, preventing assessment of whether the search is guaranteed to produce verifiable scores or scales to dense polyphony.

Authors: We acknowledge that the current description of BeadSolver remains high-level and lacks formal specification. In the revised manuscript we will supply the missing mathematical formulation of the topology recognition step, pseudocode for the probability-guided search procedure, and a complexity analysis. We will also include a discussion of the conditions under which the search produces verifiable scores and its expected scaling behavior with respect to polyphonic density. revision: yes

Circularity Check

0 steps flagged

No significant circularity detected

full rationale

The manuscript describes a two-stage OMR pipeline at the level of a methodological proposal, formulating the second stage as a structure-decoding task solved via topology recognition and probability-guided search (BeadSolver) together with a procedural data-generation strategy. No equations, fitted parameters, predictions, or derivations appear in the provided text. No self-citations are invoked as load-bearing premises, and the central claim does not reduce to any input by construction. The approach is therefore self-contained as a forward description of a proposed component rather than a closed derivation.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

No free parameters, axioms, or invented entities are identifiable from the abstract; the method name BeadSolver appears to be an internal label for the proposed search procedure rather than a new physical entity.

pith-pipeline@v0.9.0 · 5436 in / 1116 out tokens · 27984 ms · 2026-05-09T23:05:02.207008+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

38 extracted references · 18 canonical work pages · 2 internal anchors

[1]

D. S. Prerau. DO-RE-MI: a program that recognizes music notation.Computers and the Humanities, 9(1):25–29, 1975

1975
[2]

M. Good. MusicXML: an internet-friendly format for sheet music. InProceedings of XML 2001, Orlando, FL, 2001

2001
[3]

Bainbridge and T

D. Bainbridge and T. Bell. A music notation construction engine for optical music recognition. Software: Practice and Experience, 33(2):173–200, 2003. doi:10.1002/spe.502

work page doi:10.1002/spe.502 2003
[4]

Nienhuys and J

H.-W. Nienhuys and J. Nieuwenhuizen. LilyPond, a system for automated music engraving. In Proceedings of the XIV Colloquium on Musical Informatics (XIV CIM), Florence, Italy, 2003. https://lilypond.org. 4https://imslp.org/ 5https://github.com/k-l-lambda/imslp-mining 27

2003
[5]

Bellini, I

P. Bellini, I. Bruno, and P. Nesi. Assessing optical music recognition tools.Computer Music Journal, 31(1):68–93, 2007. doi:10.1162/comj.2007.31.1.68

work page doi:10.1162/comj.2007.31.1.68 2007
[6]

R. Coulom. Efficient selectivity and backup operators in Monte-Carlo tree search. InLecture Notes in Computer Science, pages 72–83, 2007. doi:10.1007/978-3-540-75538-8_7

work page doi:10.1007/978-3-540-75538-8_7 2007
[7]

Rebelo, I

A. Rebelo, I. Fujinaga, F. Paszkiewicz, A. R. S. Marcal, C. Guedes, and J. S. Cardoso. Opti- cal music recognition: state-of-the-art and open issues.International Journal of Multimedia Information Retrieval, 1(3):173–190, 2012. doi:10.1007/s13735-012-0004-6

work page doi:10.1007/s13735-012-0004-6 2012
[8]

C. B. Browne, E. Powley, D. Whitehouse, S. M. Lucas, P. I. Cowling, P. Rohlfshagen, S. Tavener, D. Perez, S. Samothrakis, and S. Colton. A survey of Monte Carlo tree search methods.IEEE Transactions on Computational Intelligence and AI in Games, 4(1):1–43, 2012. doi:10.1109/TCIAIG.2012.2186810

work page doi:10.1109/tciaig.2012.2186810 2012
[9]

Raphael and R

C. Raphael and R. Jin. Optical music recognition on the International Music Score Library Project. InSPIE Proceedings, page 90210F, 2013

2013
[10]

Hajiˇc and P

J. Hajiˇc and P. Pecina. The MUSCIMA++ dataset for handwritten optical music recognition. In 14th IAPR International Conference on Document Analysis and Recognition (ICDAR), pages 39–46, 2017. doi:10.1109/ICDAR.2017.16

work page doi:10.1109/icdar.2017.16 2017
[11]

Gerald Tesauro

D. Silver, J. Schrittwieser, K. Simonyan, I. Antonoglou, A. Huang, A. Guez, T. Hubert, L. Baker, M. Lai, A. Bolton, Y . Chen, T. Lillicrap, F. Hui, L. Sifre, G. van den Driessche, T. Graepel, and D. Hassabis. Mastering the game of Go without human knowledge.Nature, 550:354–359, 2017. doi:10.1038/nature24270

work page doi:10.1038/nature24270 2017
[12]

Attention Is All You Need

A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, and I. Polosukhin. Attention is all you need. InAdvances in Neural Information Processing Systems 30 (NeurIPS), pages 5998–6008, 2017. arXiv:1706.03762

work page internal anchor Pith review Pith/arXiv arXiv 2017
[13]

Calvo-Zaragoza and D

J. Calvo-Zaragoza and D. Rizo. End-to-end neural optical music recognition of monophonic scores.Applied Sciences, 8(4):606, 2018. doi:10.3390/app8040606

work page doi:10.3390/app8040606 2018
[14]

C.-Z. A. Huang, A. Vaswani, J. Uszkoreit, N. Shazeer, I. Simon, C. Hawthorne, A. M. Dai, M. D. Hoffman, M. Dinculescu, and D. Eck. Music Transformer: generating music with long-term structure.arXiv:1809.04281, 2018. arXiv:1809.04281

work page arXiv 2018
[15]

Tuggener, I

L. Tuggener, I. Elezi, J. Schmidhuber, M. Pelillo, and T. Stadelmann. DeepScores: a dataset for segmentation, detection and classification of tiny objects. In24th International Conference on Pattern Recognition (ICPR), pages 3704–3709, 2018. doi:10.1109/ICPR.2018.8545307

work page doi:10.1109/icpr.2018.8545307 2018
[16]

Calvo-Zaragoza, J

J. Calvo-Zaragoza, J. Hajiˇc Jr., and A. Pacha. Understanding optical music recognition.ACM Computing Surveys, 53(4):1–35, 2020. doi:10.1145/3397499

work page doi:10.1145/3397499 2020
[17]

Tuggener, Y

L. Tuggener, Y . P. Satyawan, A. Pacha, J. Schmidhuber, and T. Stadelmann. The DeepScoresV2 dataset and benchmark for music object detection. In25th International Conference on Pattern Recognition (ICPR), pages 9188–9195, 2021. doi:10.1109/ICPR48806.2021.9412290

work page doi:10.1109/icpr48806.2021.9412290 2021
[18]

A. Liu, L. Zhang, Y . Mei, B. Han, Z. Cai, Z. Zhu, and J. Xiao. Residual recurrent CRNN for end-to-end optical music recognition on monophonic scores. InProceedings of the 2021 Workshop on Multi-Modal Pre-Training for Multimedia Understanding (MMPT@ICMR), pages 23–27, 2021. doi:10.1145/3463945.3469056

work page doi:10.1145/3463945.3469056 2021
[19]

S. Geng, M. Josifoski, M. Peyrard, and R. West. Grammar-constrained decoding for structured NLP tasks without finetuning. InProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 10932–10952, 2023

2023
[20]

Ríos-Vila, D

A. Ríos-Vila, D. Rizo, J. M. Iñesta, and J. Calvo-Zaragoza. End-to-end optical music recognition for pianoform sheet music.International Journal on Document Analysis and Recognition (IJDAR), 26(3):347–362, 2023. doi:10.1007/s10032-023-00432-z

work page doi:10.1007/s10032-023-00432-z 2023
[21]

Torras, S

P. Torras, S. Biswas, and A. Fornés. A unified representation framework for the evaluation of optical music recognition systems.International Journal on Document Analysis and Recognition (IJDAR), 27:379–393, 2024. doi:10.1007/s10032-024-00485-8

work page doi:10.1007/s10032-024-00485-8 2024
[22]

G. Yang, M. Zhang, L. Qiu, Y . Wan, and N. A. Smith. Toward a more complete OMR solution. Zenodo record 14877483, 2024. https://zenodo.org/records/14877483

work page arXiv 2024
[23]

A document is worth a structured record: Principled inductive bias design for document recognition

B. Meyer, L. Tuggener, S. Hänzi, D. Schmid, E. Ayfer, B. F. Grewe, A. Abdulkadir, and T. Stadelmann. A document is worth a structured record: Principled inductive bias design for document recognition.arXiv:2507.08458, 2025. arXiv:2507.08458. 28 Appendices A Implementation Notes This appendix collects the engineering details that support reproducibility of...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[24]

an integerorder oi (consecutive integers mark the same voice; a gap of more than one starts a new voice),
[25]

a tick offsett i ≥0,
[26]

,8}(whole through 256th) and dot countδ i ∈ {0,1,2},

a division classd i ∈ {0, . . . ,8}(whole through 256th) and dot countδ i ∈ {0,1,2},
[27]

|:") vline_VoltaRightV olta V olta bracket, right vertical stroke (end-repeat

scalar attributes (beam state, stem direction, grace, timeWarp, fullMeasure). The ordering and duration assignments together determine the voice structure and the tick layout; scalar attributes are resolved deterministically from predispositions after search. B.2 Three-Phase Action Sequence Each search step handles exactly one event through three sub-deci...

1920
[28]

If events are sequential (non-overlapping) on the same staff, they are ONE voice

Excessive voice splitting (most common): Creating 3–4 voices when 1–2 would suffice. If events are sequential (non-overlapping) on the same staff, they are ONE voice. Stem direction changes alone do NOT justify a new voice
[29]

Blindly copying timeWarp from regulation: Unless you see actual tuplet brackets in the image, set timeWarp: null
[30]

Wrong division leading to cascading tick errors: Always verify each event’s division against feature.divisionsAND the image before computing ticks. 46
[31]

Not verifying feature.divisions confidence: The ML classifier’s confidence array is often more reliable than the regulation’s assigned division
[32]

/tmp/spartito-annotate-LHcs8D/m274.webp

Mixing staves in one voice: Events with differentevent.staff values must never be in the same voice array. Output FormatOutput ONLY a JSON block with fixes. Each fix is a RegulationSolution plus measureIndexandstatus. Fix fields:measureIndex (index in spartito);events (array with requiredid,tick,tickGroup,timeWarp, and optional division/dots/beam/grace ov...

1920
[33]

beamBroken = true — the beam group 8–9–10 has events 5 and 7 between them, but they’re in different voices
[34]

The voices are currently: [1,2,3], [4,6,9,10], [5,7,8] — three voices
[35]

Open"}, {id:9, tick:960, beam:

spaceTime = 0.375 (1440 * 0.375 = 540 ticks of unused time) Examining the feature data reveals a critical mismatch: event 6, which should be a quarter note, has its highest division confidence at index 2 (quarter note) with a value of 5.39, while event 7, an 8th rest, shows strongest confidence for an 8th note. Events 6 and 7 are very close (x= 19.56 vs 1...
[36]

• Cross-referencing images with event data— The x-positions (events 6 and 7 at ≈19.6) confirmed they were simultaneous, requiring voice separation

Most Helpful Principles: • Beamed notes must be consecutive in a single voice— This was critical for identifying why the beam was broken (events 8, 9, 10 were split across voices). • Cross-referencing images with event data— The x-positions (events 6 and 7 at ≈19.6) confirmed they were simultaneous, requiring voice separation. • Voice separation for simul...
[37]

• Stem direction as a voice hint— Notes with opposite stem directions at the same tick are strong indicators they belong to different voices

Suggested Additions: • Prioritize beam continuity overspaceTime minimization— A measure with a continuous beam but some spaceTime gaps is preferred over a “perfect” voice assignment with broken beams. • Stem direction as a voice hint— Notes with opposite stem directions at the same tick are strong indicators they belong to different voices. • Simultaneous...
[38]

• Misinterpreting simultaneous events— Events 6 (quarter note) and 7 (eighth rest) at tick 480 appeared to conflict, but belong to different voices

Common Patterns/Pitfalls: • Broken beams from split voice assignments— The beamed group (8, 9, 10) was split with 8 in voice 2 and 9, 10 in voice 1, breaking the beam even though the timing worked. • Misinterpreting simultaneous events— Events 6 (quarter note) and 7 (eighth rest) at tick 480 appeared to conflict, but belong to different voices. • Over-opt...