Precise and Simple Audio-to-Score Alignment
Pith reviewed 2026-05-20 03:52 UTC · model grok-4.3
The pith
Audio onset and spectral features can be matched directly to symbolic score positions using dynamic programming to achieve precise alignment without transcription or score synthesis.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The authors derive a bespoke dynamic programming-based matching algorithm from symbolic alignment methods to match sequential audio features encoding onset and spectral activation directly to score positions. This produces an alignment method that is both more precise than widely used audio-to-audio approaches based on synthesized scores and adaptable to diverse timbral characteristics without a separate transcription model, while maintaining at worst linear algorithmic complexity in the lengths of the score and audio feature sequence.
What carries the argument
The bespoke dynamic programming-based matching algorithm that directly aligns audio onset and spectral activation features to score positions, derived from symbolic alignment methods.
Load-bearing premise
That audio features for onset and spectral activation can be reliably matched to score positions by the dynamic programming algorithm without first transcribing the audio into symbolic notes.
What would settle it
Running the alignment on a large dataset of solo piano recordings and measuring the alignment error; if the error is not lower than that of synthesized audio-to-audio baselines, the precision claim would be falsified.
Figures
read the original abstract
Audio-to-score alignment is a long-standing challenge in music information retrieval and arguably the most widely applicable alignment task for music research. Alignment algorithms match two versions of a piece of music, and for this to work these versions need to be in comparable formats. Audio-to-audio alignment matches audio features; when matching audio files to scores, they must either synthesize the score or derive audio-like features by means of piano rolls or similar feature sequences. Symbolic alignment, by contrast, matches symbolically encoded notes; in an audio-to-score scenario these would be obtained by a transcription of the audio file. In this article, we present an algorithm that bridges audio-like and symbol-level features directly. Sequential audio features encoding onset and spectral activation are matched to score positions by a bespoke dynamic programming-based matching algorithm derived from symbolic alignment methods. The resulting method is both precise - surpassing widely used audio-to-audio approaches based on synthesized scores -, and remains flexible in its digital signal processing components, i.e., the method is adaptable to diverse timbral characteristics without requiring a separate transcription model. Furthermore it inherits some of the symbolic alignment runtime advantages with an algorithmic complexity that is at worst linear in the length of the (typically short) symbolic score and (typically long) audio feature sequence. In the following sections, we provide a detailed algorithm description and evaluate its alignment quality on a large-scale dataset of solo piano recordings.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript presents a direct audio-to-score alignment algorithm that matches sequential audio features encoding onset and spectral activation to symbolic score positions via a bespoke dynamic programming procedure adapted from symbolic alignment methods. It claims superior precision over synthesized-score audio-to-audio baselines, flexibility across timbres without a separate transcription model, and at-worst linear complexity in the lengths of the score and audio sequence. A detailed algorithm description is provided together with an evaluation of alignment quality on a large-scale dataset of solo piano recordings.
Significance. If the central claims hold, the work would offer a practically useful simplification for music information retrieval tasks such as score following and performance analysis by eliminating the need for either score synthesis or audio transcription while retaining symbolic-style runtime scaling. The explicit provision of the algorithm together with large-scale piano evaluation constitutes a clear strength that supports reproducibility and applicability.
major comments (1)
- [§3.2] §3.2 (Dynamic Programming Matching), cost function and transition rules: the manuscript does not specify explicit mechanisms (e.g., local warping windows, onset tolerance bands, or timbre-normalized distances) to compensate for audio onset jitter and continuous spectral mismatch when mapping to discrete score events. Because the central claim of precision superiority rests on the DP successfully bridging these domains without transcription, the absence of these adaptations is load-bearing and requires clarification or additional experiments.
minor comments (2)
- [Table 1] Table 1: the reported alignment error statistics lack units or normalization details (e.g., whether errors are in beats or seconds), which hinders direct comparison with prior audio-to-audio baselines.
- [§4] §4 (Evaluation): the dataset description should include the range of performance tempi and recording conditions to substantiate the claim of adaptability to diverse timbral characteristics.
Simulated Author's Rebuttal
We thank the referee for the detailed and constructive report. We address the single major comment below and have revised the manuscript to provide the requested clarifications on the dynamic programming components.
read point-by-point responses
-
Referee: [§3.2] §3.2 (Dynamic Programming Matching), cost function and transition rules: the manuscript does not specify explicit mechanisms (e.g., local warping windows, onset tolerance bands, or timbre-normalized distances) to compensate for audio onset jitter and continuous spectral mismatch when mapping to discrete score events. Because the central claim of precision superiority rests on the DP successfully bridging these domains without transcription, the absence of these adaptations is load-bearing and requires clarification or additional experiments.
Authors: We agree that the original manuscript would have benefited from more explicit description of how the cost function and transition rules address audio-specific variations such as onset jitter and spectral mismatch. In the revised version we have expanded §3.2 with the following details: the local cost function combines an onset-strength term (derived from the spectral flux feature) with a frame-wise spectral activation distance; a tolerance band of one feature frame on either side of each score event is built into the cost computation to absorb typical onset jitter arising from the 20 ms hop size. The transition rules, adapted from the symbolic alignment literature, explicitly permit limited insertion and deletion paths (up to two consecutive audio frames or score events) without incurring prohibitive cost, thereby accommodating continuous timing deviations. Because the evaluation is restricted to solo piano recordings, timbre normalization is not applied; the spectral features are already L2-normalized per frame, which proved sufficient for the reported precision gains. We have not introduced new experiments, as the existing large-scale piano dataset already quantifies the end-to-end alignment accuracy, but we have added a short parameter-sensitivity paragraph confirming that the chosen tolerance values are stable across the test set. We believe these additions directly address the load-bearing concern while preserving the manuscript’s focus. revision: partial
Circularity Check
No circularity: bespoke DP adaptation is independent of inputs
full rationale
The paper describes a new matching algorithm that directly aligns sequential audio features (onset and spectral activation) to score positions using a dynamic programming approach adapted from symbolic methods. No equations or steps reduce by construction to fitted parameters, self-definitions, or self-citation chains. The central claim of precision without transcription rests on the algorithm design and dataset evaluation rather than tautological renaming or imported uniqueness theorems. This is a standard non-circular finding for a methods paper presenting an adapted DSP technique.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
INTRODUCTION Audio-to-score alignment is a long-standing challenge in music information retrieval and arguably the most widely applicable alignment task for music research. Alignment algorithms match two versions of a piece of music, and for this to work these versions need to be in compara- ble formats. Audio-to-audio alignment matches audio fea- tures; ...
-
[2]
Precise and Simple Audio-to-Score Alignment
ALIGNMENT METHOD 2.1 Signal Processing The audio signal is processed into two feature sequences, one for onset (time) information, the other for spectral (pitch) information. As a first step, the stereo signal is summed to mono and then sent through an IIR filterbank of second-order Butterworth filters. The filterbank consists of 88 filters centered at th...
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[3]
EV ALUA TION We evaluate our algorithm on over 300 piano performances from the (n)ASAP Dataset [7]. We compare it to an audio- to-audio alignment baseline which uses Dynamic Time Warping on both onset-related and spectral features. The implementation is given by the synctoolbox library [8]. Audio-to-audio alignment based on a mix of features and synthesiz...
-
[4]
CONCLUSION We introduce an audio-to-score algorithm which uses both onset and spectral audio features in a note-based match- ing procedure typically found in symbolic alignment. Our method leverages dynamic beat period estimates and score-informed pitch-wise onset and spectral processing to produce highly precise alignments. It relies on standard digital ...
work page 2020
-
[5]
Audio-to-score align- ment of piano music using rnn-based automatic music transcription,
T. Kwon, D. Jeong, and J. Nam, “Audio-to-score align- ment of piano music using rnn-based automatic music transcription,” inProceedings of the 14th Sound and Music Computing Conference (SMC), 2017
work page 2017
-
[6]
Fine-tuning midi-to-audio alignment using a neural network on piano roll and cqt representations,
S. Murgul, M. Reiser, M. Heizmann, and C. Seibert, “Fine-tuning midi-to-audio alignment using a neural network on piano roll and cqt representations,”arXiv preprint arXiv:2506.22237, 2025
-
[7]
Robust and ac- curate audio synchronization using raw features from transcription models
J. Zeitler, B. Maman, and M. M ¨uller, “Robust and ac- curate audio synchronization using raw features from transcription models.” inProceedings of the Interna- tional Society of Music Information Retrieval Confer- ence (ISMIR), 2024, pp. 120–127
work page 2024
-
[8]
Audio- to-score alignment using deep automatic music tran- scription,
F. Simonetta, S. Ntalampiras, and F. Avanzini, “Audio- to-score alignment using deep automatic music tran- scription,” in23rd International Workshop on Multi- media Signal Processing (MMSP), 2021
work page 2021
-
[9]
S. D. Peter, P. Hu, and G. Widmer, “Pairing real-time piano transcription with symbol-level tracking for pre- cise and robust score following,” inProceedings of the Sound and Music Computing Conference (SMC), 2025
work page 2025
-
[10]
Maximum filter vibrato sup- pression for onset detection,
S. B ¨ock and G. Widmer, “Maximum filter vibrato sup- pression for onset detection,” inProceedings of the 16th International Conference on Digital Audio Effects (DAFx-13), Maynooth, Ireland, September 2013
work page 2013
-
[11]
Automatic note-level score-to-performance align- ments in the asap dataset,
S. D. Peter, C. E. Cancino-Chac ´on, F. Foscarin, A. P. McLeod, F. Henkel, E. Karystinaios, and G. Widmer, “Automatic note-level score-to-performance align- ments in the asap dataset,”Transactions of the In- ternational Society for Music Information Retrieval (TISMIR), 2023
work page 2023
-
[12]
Sync toolbox: A python package for ef- ficient, robust, and accurate music synchronization,
M. M ¨uller, Y . ¨Ozer, M. Krause, T. Pr ¨atzlich, and J. Driedger, “Sync toolbox: A python package for ef- ficient, robust, and accurate music synchronization,” Journal of Open Source Software, vol. 6, no. 64, p. 3434, 2021
work page 2021
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.