pith. machine review for the scientific record. sign in

arxiv: 2604.20522 · v3 · submitted 2026-04-22 · 💻 cs.SD · cs.CV

Recognition: unknown

From Image to Music Language: A Two-Stage Structure Decoding Approach for Complex Polyphonic OMR

Authors on Pith no claims yet

Pith reviewed 2026-05-09 23:05 UTC · model grok-4.3

classification 💻 cs.SD cs.CV
keywords Optical Music RecognitionPolyphonic NotationStructure DecodingTopology RecognitionVoice SeparationTwo-Stage PipelineBeadSolver
0
0 comments X

The pith

A two-stage pipeline decodes visual music symbols into editable polyphonic scores using topology recognition.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper proposes a practical two-stage Optical Music Recognition pipeline that focuses on the second stage of turning symbol and event candidates into a complete score structure. It targets complex polyphonic notation such as piano music, where separating voices and determining intra-measure timings are the main difficulties. The approach treats structure decoding as a topology recognition problem solved by probability-guided search, supported by a data strategy that mixes procedural generation with feedback-based annotations. If successful, this produces verifiable, exportable scores that can feed real OMR systems and supply structured data for later multimodal or learning-based methods.

Core claim

The paper claims that given symbol and event candidates from a visual pipeline, second-stage decoding for polyphonic OMR can be solved by recognizing the underlying topology of notes and events through probability-guided search (BeadSolver), together with a hybrid data-generation approach that yields editable and exportable score structures.

What carries the argument

BeadSolver, a probability-guided search procedure that performs topology recognition to resolve voice separation and intra-measure timing from candidate symbols.

If this is right

  • Yields a usable decoding component for existing OMR systems handling complex piano notation.
  • Enables accumulation of large-scale structured score data from image sources.
  • Opens a route to train future end-to-end multimodal and reinforcement-learning OMR models.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The method could expose the current limits of visual detection stages when run on noisy real-world inputs.
  • Structured outputs produced this way could serve as supervision for models that skip explicit symbol detection entirely.
  • The topology-search framing may generalize to other dense symbolic notations beyond Western staff music.

Load-bearing premise

The visual pipeline supplies symbol and event candidates accurate enough for the decoder to resolve voice separation and timing ambiguities without frequent failure.

What would settle it

Apply the full pipeline to a set of real piano score images, supply the actual noisy symbol candidates produced by a standard visual detector, and check whether the resulting scores match ground-truth transcriptions at a rate substantially higher than current one-stage or rule-based baselines.

Figures

Figures reproduced from arXiv: 2604.20522 by Nan Xu, Shengchao Hou, Shiheng Li.

Figure 1
Figure 1. Figure 1: Examples of structural ambiguity in complex piano notation. (a) Multiple voices can [PITH_FULL_IMAGE:figures/full_fig_p003_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: A compact overview of the Starry OMR pipeline. [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Visual pipeline for candidate generation. The page-level stage detects systems and [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Examples of visual predictors used before symbolic assembly. Panels (a), (c), and (d) show [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Simplified pipeline from semantic recognition to event-candidate assembly. Dense semantic [PITH_FULL_IMAGE:figures/full_fig_p009_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Regulation target on a difficult piano measure. (a) The image shows the original notation, [PITH_FULL_IMAGE:figures/full_fig_p009_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Measure regulation as chained structure recovery. (a) A polyphonic measure example. (b) [PITH_FULL_IMAGE:figures/full_fig_p010_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: A simplified illustration of the overall tree-search workflow in principle. In the Pass step, [PITH_FULL_IMAGE:figures/full_fig_p011_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: x–tick geometry consistency. Left: an ambiguous measure; there are 2 potential topology candidates for regulation. Right: events plotted in the (x, t) plane, where t is the cumulative tick position obtained by accumulating event durations along the voice chain; each group is normalised by its total measure duration. A well-regulated voice (black circles) distributes its events nearly uniformly along both a… view at source ↗
Figure 10
Figure 10. Figure 10: BeadPicker architecture for topology recognition. The model reads measure-level event [PITH_FULL_IMAGE:figures/full_fig_p016_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: Training-data pipeline for topology recognition. Structured symbolic music is rendered [PITH_FULL_IMAGE:figures/full_fig_p019_11.png] view at source ↗
Figure 12
Figure 12. Figure 12: A preliminary agent-assisted annotation loop for issue measures. The agent requests [PITH_FULL_IMAGE:figures/full_fig_p027_12.png] view at source ↗
Figure 13
Figure 13. Figure 13: Representative failure cases illustrating the performance boundary of each regulation [PITH_FULL_IMAGE:figures/full_fig_p038_13.png] view at source ↗
Figure 14
Figure 14. Figure 14: Rendered output of the multi-voice cross-staff Paraff example. Three voices share and [PITH_FULL_IMAGE:figures/full_fig_p040_14.png] view at source ↗
Figure 15
Figure 15. Figure 15: A sample score generated by a learned Paraff generation model via constrained autore [PITH_FULL_IMAGE:figures/full_fig_p041_15.png] view at source ↗
Figure 16
Figure 16. Figure 16: Measure 274 initial state. Left: composite stave image read by the agent. Right: topology [PITH_FULL_IMAGE:figures/full_fig_p047_16.png] view at source ↗
Figure 17
Figure 17. Figure 17: Attempt 1 topology. Merging all staff-1 events into one voice places ev. 6 and ev. 7 (both [PITH_FULL_IMAGE:figures/full_fig_p049_17.png] view at source ↗
Figure 18
Figure 18. Figure 18: Fix summary topology. Voice 0 (red): staff-0 events [1,2,3]. Voice 1 (green): staff-1 [PITH_FULL_IMAGE:figures/full_fig_p051_18.png] view at source ↗
read the original abstract

We propose a new approach for a practical two-stage Optical Music Recognition (OMR) pipeline, with a particular focus on its second stage. Given symbol and event candidates from the visual pipeline, we decode them into an editable, verifiable, and exportable score structure. We focus on complex polyphonic staff notation, especially piano scores, where voice separation and intra-measure timing are the main bottlenecks. Our approach formulates second-stage decoding as a structure decoding problem and uses topology recognition with probability-guided search (BeadSolver) as its core method. We also describe a data strategy that combines procedural generation with recognition-feedback annotations. The result is a practical decoding component for real OMR systems and a path to accumulate structured score data for future end-to-end, multimodal, and RL-style methods.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 0 minor

Summary. The paper proposes a two-stage Optical Music Recognition (OMR) pipeline focused on the second stage for complex polyphonic staff notation, especially piano scores. Given symbol and event candidates from the visual pipeline, it decodes them into an editable score structure by formulating the task as structure decoding and applying topology recognition with probability-guided search (BeadSolver) to resolve voice separation and intra-measure timing. It also outlines a data strategy combining procedural generation with recognition-feedback annotations to support the decoder and enable future end-to-end or RL-style methods.

Significance. If empirically validated on realistic inputs, the approach could provide a practical, modular decoding component for OMR systems handling polyphonic complexities that current visual pipelines struggle with, while also generating structured score data to bootstrap more advanced multimodal models.

major comments (2)
  1. [Abstract] Abstract: the central claim that the method yields 'a practical decoding component for real OMR systems' is unsupported because the manuscript supplies only a high-level description and data-generation strategy with no quantitative results, failure-mode analysis, or comparisons against existing voice-separation or timing-resolution baselines.
  2. [Method] The description of BeadSolver (topology recognition plus probability-guided search) is presented without equations, pseudocode, or complexity analysis, preventing assessment of whether the search is guaranteed to produce verifiable scores or scales to dense polyphony.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed comments, which highlight important areas for strengthening the manuscript. We address each major comment below and commit to revisions that clarify the scope and add necessary technical details without misrepresenting the current contribution.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the central claim that the method yields 'a practical decoding component for real OMR systems' is unsupported because the manuscript supplies only a high-level description and data-generation strategy with no quantitative results, failure-mode analysis, or comparisons against existing voice-separation or timing-resolution baselines.

    Authors: We agree that the manuscript as submitted provides only a high-level methodological description and data strategy without quantitative results or baseline comparisons, so the claim of yielding a 'practical decoding component for real OMR systems' is not yet empirically supported. The paper's primary contribution is the formulation of second-stage decoding as a structure decoding problem together with the outlined data-generation approach. In revision we will modify the abstract and introduction to tone down this claim, explicitly framing the work as a proposed framework whose practicality remains to be validated, and we will add a dedicated section discussing planned empirical evaluation, failure-mode analysis, and comparisons against existing voice-separation and timing-resolution methods. revision: yes

  2. Referee: [Method] The description of BeadSolver (topology recognition plus probability-guided search) is presented without equations, pseudocode, or complexity analysis, preventing assessment of whether the search is guaranteed to produce verifiable scores or scales to dense polyphony.

    Authors: We acknowledge that the current description of BeadSolver remains high-level and lacks formal specification. In the revised manuscript we will supply the missing mathematical formulation of the topology recognition step, pseudocode for the probability-guided search procedure, and a complexity analysis. We will also include a discussion of the conditions under which the search produces verifiable scores and its expected scaling behavior with respect to polyphonic density. revision: yes

Circularity Check

0 steps flagged

No significant circularity detected

full rationale

The manuscript describes a two-stage OMR pipeline at the level of a methodological proposal, formulating the second stage as a structure-decoding task solved via topology recognition and probability-guided search (BeadSolver) together with a procedural data-generation strategy. No equations, fitted parameters, predictions, or derivations appear in the provided text. No self-citations are invoked as load-bearing premises, and the central claim does not reduce to any input by construction. The approach is therefore self-contained as a forward description of a proposed component rather than a closed derivation.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

No free parameters, axioms, or invented entities are identifiable from the abstract; the method name BeadSolver appears to be an internal label for the proposed search procedure rather than a new physical entity.

pith-pipeline@v0.9.0 · 5436 in / 1116 out tokens · 27984 ms · 2026-05-09T23:05:02.207008+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

38 extracted references · 18 canonical work pages · 2 internal anchors

  1. [1]

    D. S. Prerau. DO-RE-MI: a program that recognizes music notation.Computers and the Humanities, 9(1):25–29, 1975

  2. [2]

    M. Good. MusicXML: an internet-friendly format for sheet music. InProceedings of XML 2001, Orlando, FL, 2001

  3. [3]

    Bainbridge and T

    D. Bainbridge and T. Bell. A music notation construction engine for optical music recognition. Software: Practice and Experience, 33(2):173–200, 2003. doi:10.1002/spe.502

  4. [4]

    Nienhuys and J

    H.-W. Nienhuys and J. Nieuwenhuizen. LilyPond, a system for automated music engraving. In Proceedings of the XIV Colloquium on Musical Informatics (XIV CIM), Florence, Italy, 2003. https://lilypond.org. 4https://imslp.org/ 5https://github.com/k-l-lambda/imslp-mining 27

  5. [5]

    Bellini, I

    P. Bellini, I. Bruno, and P. Nesi. Assessing optical music recognition tools.Computer Music Journal, 31(1):68–93, 2007. doi:10.1162/comj.2007.31.1.68

  6. [6]

    R. Coulom. Efficient selectivity and backup operators in Monte-Carlo tree search. InLecture Notes in Computer Science, pages 72–83, 2007. doi:10.1007/978-3-540-75538-8_7

  7. [7]

    Rebelo, I

    A. Rebelo, I. Fujinaga, F. Paszkiewicz, A. R. S. Marcal, C. Guedes, and J. S. Cardoso. Opti- cal music recognition: state-of-the-art and open issues.International Journal of Multimedia Information Retrieval, 1(3):173–190, 2012. doi:10.1007/s13735-012-0004-6

  8. [8]

    C. B. Browne, E. Powley, D. Whitehouse, S. M. Lucas, P. I. Cowling, P. Rohlfshagen, S. Tavener, D. Perez, S. Samothrakis, and S. Colton. A survey of Monte Carlo tree search methods.IEEE Transactions on Computational Intelligence and AI in Games, 4(1):1–43, 2012. doi:10.1109/TCIAIG.2012.2186810

  9. [9]

    Raphael and R

    C. Raphael and R. Jin. Optical music recognition on the International Music Score Library Project. InSPIE Proceedings, page 90210F, 2013

  10. [10]

    Hajiˇc and P

    J. Hajiˇc and P. Pecina. The MUSCIMA++ dataset for handwritten optical music recognition. In 14th IAPR International Conference on Document Analysis and Recognition (ICDAR), pages 39–46, 2017. doi:10.1109/ICDAR.2017.16

  11. [11]

    Gerald Tesauro

    D. Silver, J. Schrittwieser, K. Simonyan, I. Antonoglou, A. Huang, A. Guez, T. Hubert, L. Baker, M. Lai, A. Bolton, Y . Chen, T. Lillicrap, F. Hui, L. Sifre, G. van den Driessche, T. Graepel, and D. Hassabis. Mastering the game of Go without human knowledge.Nature, 550:354–359, 2017. doi:10.1038/nature24270

  12. [12]

    Attention Is All You Need

    A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, and I. Polosukhin. Attention is all you need. InAdvances in Neural Information Processing Systems 30 (NeurIPS), pages 5998–6008, 2017. arXiv:1706.03762

  13. [13]

    Calvo-Zaragoza and D

    J. Calvo-Zaragoza and D. Rizo. End-to-end neural optical music recognition of monophonic scores.Applied Sciences, 8(4):606, 2018. doi:10.3390/app8040606

  14. [14]

    C.-Z. A. Huang, A. Vaswani, J. Uszkoreit, N. Shazeer, I. Simon, C. Hawthorne, A. M. Dai, M. D. Hoffman, M. Dinculescu, and D. Eck. Music Transformer: generating music with long-term structure.arXiv:1809.04281, 2018. arXiv:1809.04281

  15. [15]

    Tuggener, I

    L. Tuggener, I. Elezi, J. Schmidhuber, M. Pelillo, and T. Stadelmann. DeepScores: a dataset for segmentation, detection and classification of tiny objects. In24th International Conference on Pattern Recognition (ICPR), pages 3704–3709, 2018. doi:10.1109/ICPR.2018.8545307

  16. [16]

    Calvo-Zaragoza, J

    J. Calvo-Zaragoza, J. Hajiˇc Jr., and A. Pacha. Understanding optical music recognition.ACM Computing Surveys, 53(4):1–35, 2020. doi:10.1145/3397499

  17. [17]

    Tuggener, Y

    L. Tuggener, Y . P. Satyawan, A. Pacha, J. Schmidhuber, and T. Stadelmann. The DeepScoresV2 dataset and benchmark for music object detection. In25th International Conference on Pattern Recognition (ICPR), pages 9188–9195, 2021. doi:10.1109/ICPR48806.2021.9412290

  18. [18]

    A. Liu, L. Zhang, Y . Mei, B. Han, Z. Cai, Z. Zhu, and J. Xiao. Residual recurrent CRNN for end-to-end optical music recognition on monophonic scores. InProceedings of the 2021 Workshop on Multi-Modal Pre-Training for Multimedia Understanding (MMPT@ICMR), pages 23–27, 2021. doi:10.1145/3463945.3469056

  19. [19]

    S. Geng, M. Josifoski, M. Peyrard, and R. West. Grammar-constrained decoding for structured NLP tasks without finetuning. InProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 10932–10952, 2023

  20. [20]

    Ríos-Vila, D

    A. Ríos-Vila, D. Rizo, J. M. Iñesta, and J. Calvo-Zaragoza. End-to-end optical music recognition for pianoform sheet music.International Journal on Document Analysis and Recognition (IJDAR), 26(3):347–362, 2023. doi:10.1007/s10032-023-00432-z

  21. [21]

    Torras, S

    P. Torras, S. Biswas, and A. Fornés. A unified representation framework for the evaluation of optical music recognition systems.International Journal on Document Analysis and Recognition (IJDAR), 27:379–393, 2024. doi:10.1007/s10032-024-00485-8

  22. [22]

    G. Yang, M. Zhang, L. Qiu, Y . Wan, and N. A. Smith. Toward a more complete OMR solution. Zenodo record 14877483, 2024. https://zenodo.org/records/14877483

  23. [23]

    A document is worth a structured record: Principled inductive bias design for document recognition

    B. Meyer, L. Tuggener, S. Hänzi, D. Schmid, E. Ayfer, B. F. Grewe, A. Abdulkadir, and T. Stadelmann. A document is worth a structured record: Principled inductive bias design for document recognition.arXiv:2507.08458, 2025. arXiv:2507.08458. 28 Appendices A Implementation Notes This appendix collects the engineering details that support reproducibility of...

  24. [24]

    an integerorder oi (consecutive integers mark the same voice; a gap of more than one starts a new voice),

  25. [25]

    a tick offsett i ≥0,

  26. [26]

    ,8}(whole through 256th) and dot countδ i ∈ {0,1,2},

    a division classd i ∈ {0, . . . ,8}(whole through 256th) and dot countδ i ∈ {0,1,2},

  27. [27]

    |:") vline_VoltaRightV olta V olta bracket, right vertical stroke (end-repeat

    scalar attributes (beam state, stem direction, grace, timeWarp, fullMeasure). The ordering and duration assignments together determine the voice structure and the tick layout; scalar attributes are resolved deterministically from predispositions after search. B.2 Three-Phase Action Sequence Each search step handles exactly one event through three sub-deci...

  28. [28]

    If events are sequential (non-overlapping) on the same staff, they are ONE voice

    Excessive voice splitting (most common): Creating 3–4 voices when 1–2 would suffice. If events are sequential (non-overlapping) on the same staff, they are ONE voice. Stem direction changes alone do NOT justify a new voice

  29. [29]

    Blindly copying timeWarp from regulation: Unless you see actual tuplet brackets in the image, set timeWarp: null

  30. [30]

    Wrong division leading to cascading tick errors: Always verify each event’s division against feature.divisionsAND the image before computing ticks. 46

  31. [31]

    Not verifying feature.divisions confidence: The ML classifier’s confidence array is often more reliable than the regulation’s assigned division

  32. [32]

    /tmp/spartito-annotate-LHcs8D/m274.webp

    Mixing staves in one voice: Events with differentevent.staff values must never be in the same voice array. Output FormatOutput ONLY a JSON block with fixes. Each fix is a RegulationSolution plus measureIndexandstatus. Fix fields:measureIndex (index in spartito);events (array with requiredid,tick,tickGroup,timeWarp, and optional division/dots/beam/grace ov...

  33. [33]

    beamBroken = true — the beam group 8–9–10 has events 5 and 7 between them, but they’re in different voices

  34. [34]

    The voices are currently: [1,2,3], [4,6,9,10], [5,7,8] — three voices

  35. [35]

    Open"}, {id:9, tick:960, beam:

    spaceTime = 0.375 (1440 * 0.375 = 540 ticks of unused time) Examining the feature data reveals a critical mismatch: event 6, which should be a quarter note, has its highest division confidence at index 2 (quarter note) with a value of 5.39, while event 7, an 8th rest, shows strongest confidence for an 8th note. Events 6 and 7 are very close (x= 19.56 vs 1...

  36. [36]

    • Cross-referencing images with event data— The x-positions (events 6 and 7 at ≈19.6) confirmed they were simultaneous, requiring voice separation

    Most Helpful Principles: • Beamed notes must be consecutive in a single voice— This was critical for identifying why the beam was broken (events 8, 9, 10 were split across voices). • Cross-referencing images with event data— The x-positions (events 6 and 7 at ≈19.6) confirmed they were simultaneous, requiring voice separation. • Voice separation for simul...

  37. [37]

    • Stem direction as a voice hint— Notes with opposite stem directions at the same tick are strong indicators they belong to different voices

    Suggested Additions: • Prioritize beam continuity overspaceTime minimization— A measure with a continuous beam but some spaceTime gaps is preferred over a “perfect” voice assignment with broken beams. • Stem direction as a voice hint— Notes with opposite stem directions at the same tick are strong indicators they belong to different voices. • Simultaneous...

  38. [38]

    • Misinterpreting simultaneous events— Events 6 (quarter note) and 7 (eighth rest) at tick 480 appeared to conflict, but belong to different voices

    Common Patterns/Pitfalls: • Broken beams from split voice assignments— The beamed group (8, 9, 10) was split with 8 in voice 2 and 9, 10 in voice 1, breaking the beam even though the timing worked. • Misinterpreting simultaneous events— Events 6 (quarter note) and 7 (eighth rest) at tick 480 appeared to conflict, but belong to different voices. • Over-opt...