Precise and Simple Audio-to-Score Alignment

Gerhard Widmer; Patricia Hu; Silvan Peter

arxiv: 2605.20014 · v1 · pith:LWXPRG7Cnew · submitted 2026-05-19 · 💻 cs.SD

Precise and Simple Audio-to-Score Alignment

Silvan Peter , Patricia Hu , Gerhard Widmer This is my paper

Pith reviewed 2026-05-20 03:52 UTC · model grok-4.3

classification 💻 cs.SD

keywords audio-to-score alignmentdynamic programmingmusic information retrievalonset detectionspectral featuressymbolic alignmentsolo piano recordings

0 comments

The pith

Audio onset and spectral features can be matched directly to symbolic score positions using dynamic programming to achieve precise alignment without transcription or score synthesis.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents an algorithm for aligning audio recordings of music to their corresponding scores by extracting sequential features that capture note onsets and spectral activations from the audio signal. These features are then aligned to positions in the symbolic score using a dynamic programming method adapted from techniques used in symbolic music alignment. This direct bridging avoids the common steps of transcribing the audio into notes or synthesizing the score into an audio-like signal. A reader would care because it offers a more accurate and flexible way to synchronize performances with scores, which is fundamental for analyzing large collections of music recordings and studying performance practices.

Core claim

The authors derive a bespoke dynamic programming-based matching algorithm from symbolic alignment methods to match sequential audio features encoding onset and spectral activation directly to score positions. This produces an alignment method that is both more precise than widely used audio-to-audio approaches based on synthesized scores and adaptable to diverse timbral characteristics without a separate transcription model, while maintaining at worst linear algorithmic complexity in the lengths of the score and audio feature sequence.

What carries the argument

The bespoke dynamic programming-based matching algorithm that directly aligns audio onset and spectral activation features to score positions, derived from symbolic alignment methods.

Load-bearing premise

That audio features for onset and spectral activation can be reliably matched to score positions by the dynamic programming algorithm without first transcribing the audio into symbolic notes.

What would settle it

Running the alignment on a large dataset of solo piano recordings and measuring the alignment error; if the error is not lower than that of synthesized audio-to-audio baselines, the precision claim would be falsified.

Figures

Figures reproduced from arXiv: 2605.20014 by Gerhard Widmer, Patricia Hu, Silvan Peter.

**Figure 1.** Figure 1: Spectral (left) and onset (right) activation features on the first ten seconds of a piano recording. There are 88 [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗

read the original abstract

Audio-to-score alignment is a long-standing challenge in music information retrieval and arguably the most widely applicable alignment task for music research. Alignment algorithms match two versions of a piece of music, and for this to work these versions need to be in comparable formats. Audio-to-audio alignment matches audio features; when matching audio files to scores, they must either synthesize the score or derive audio-like features by means of piano rolls or similar feature sequences. Symbolic alignment, by contrast, matches symbolically encoded notes; in an audio-to-score scenario these would be obtained by a transcription of the audio file. In this article, we present an algorithm that bridges audio-like and symbol-level features directly. Sequential audio features encoding onset and spectral activation are matched to score positions by a bespoke dynamic programming-based matching algorithm derived from symbolic alignment methods. The resulting method is both precise - surpassing widely used audio-to-audio approaches based on synthesized scores -, and remains flexible in its digital signal processing components, i.e., the method is adaptable to diverse timbral characteristics without requiring a separate transcription model. Furthermore it inherits some of the symbolic alignment runtime advantages with an algorithmic complexity that is at worst linear in the length of the (typically short) symbolic score and (typically long) audio feature sequence. In the following sections, we provide a detailed algorithm description and evaluate its alignment quality on a large-scale dataset of solo piano recordings.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

This paper gives a direct DP-based way to align audio features to scores without synthesis or transcription, with a claimed precision edge and linear runtime, though robustness to audio jitter needs checking.

read the letter

The main thing here is a method that matches sequential audio features for onsets and spectral activation straight to score positions using a dynamic programming algorithm adapted from symbolic alignment. It skips both score synthesis into audio and audio transcription into symbols, which keeps things simpler and more flexible across different timbres. The linear complexity in the audio sequence length is a practical plus for longer recordings against short scores. They evaluate on a large dataset of solo piano recordings, which adds some concrete support to the precision claims over synthesized audio-to-audio baselines. That evaluation step is useful and shows they are not just describing an idea in the abstract. The adaptation itself looks like a legitimate extension rather than a big conceptual leap, but it fills a gap for cases where you want to stay in audio features without extra models. On the soft side, the stress-test point about onset jitter and timbre-induced mismatches is worth watching. Symbolic DP works on clean discrete events, and audio features bring detection errors, continuous energy, and performance timing shifts. If the cost function or transition rules do not build in explicit tolerance bands or normalization, the precision advantage could narrow in real data. The abstract leaves this implicit, so the full algorithm section and any error breakdowns need to show how they handled it. If those details are there and the results hold, the central argument is fine. This is for MIR researchers or performance analysts who need accurate alignments without adding transcription steps. A reader who wants implementable code or dataset-backed comparisons would get something out of it. I would send it to peer review so experts can look at the DP adaptations and the quantitative results in detail.

Referee Report

1 major / 2 minor

Summary. The manuscript presents a direct audio-to-score alignment algorithm that matches sequential audio features encoding onset and spectral activation to symbolic score positions via a bespoke dynamic programming procedure adapted from symbolic alignment methods. It claims superior precision over synthesized-score audio-to-audio baselines, flexibility across timbres without a separate transcription model, and at-worst linear complexity in the lengths of the score and audio sequence. A detailed algorithm description is provided together with an evaluation of alignment quality on a large-scale dataset of solo piano recordings.

Significance. If the central claims hold, the work would offer a practically useful simplification for music information retrieval tasks such as score following and performance analysis by eliminating the need for either score synthesis or audio transcription while retaining symbolic-style runtime scaling. The explicit provision of the algorithm together with large-scale piano evaluation constitutes a clear strength that supports reproducibility and applicability.

major comments (1)

[§3.2] §3.2 (Dynamic Programming Matching), cost function and transition rules: the manuscript does not specify explicit mechanisms (e.g., local warping windows, onset tolerance bands, or timbre-normalized distances) to compensate for audio onset jitter and continuous spectral mismatch when mapping to discrete score events. Because the central claim of precision superiority rests on the DP successfully bridging these domains without transcription, the absence of these adaptations is load-bearing and requires clarification or additional experiments.

minor comments (2)

[Table 1] Table 1: the reported alignment error statistics lack units or normalization details (e.g., whether errors are in beats or seconds), which hinders direct comparison with prior audio-to-audio baselines.
[§4] §4 (Evaluation): the dataset description should include the range of performance tempi and recording conditions to substantiate the claim of adaptability to diverse timbral characteristics.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the detailed and constructive report. We address the single major comment below and have revised the manuscript to provide the requested clarifications on the dynamic programming components.

read point-by-point responses

Referee: [§3.2] §3.2 (Dynamic Programming Matching), cost function and transition rules: the manuscript does not specify explicit mechanisms (e.g., local warping windows, onset tolerance bands, or timbre-normalized distances) to compensate for audio onset jitter and continuous spectral mismatch when mapping to discrete score events. Because the central claim of precision superiority rests on the DP successfully bridging these domains without transcription, the absence of these adaptations is load-bearing and requires clarification or additional experiments.

Authors: We agree that the original manuscript would have benefited from more explicit description of how the cost function and transition rules address audio-specific variations such as onset jitter and spectral mismatch. In the revised version we have expanded §3.2 with the following details: the local cost function combines an onset-strength term (derived from the spectral flux feature) with a frame-wise spectral activation distance; a tolerance band of one feature frame on either side of each score event is built into the cost computation to absorb typical onset jitter arising from the 20 ms hop size. The transition rules, adapted from the symbolic alignment literature, explicitly permit limited insertion and deletion paths (up to two consecutive audio frames or score events) without incurring prohibitive cost, thereby accommodating continuous timing deviations. Because the evaluation is restricted to solo piano recordings, timbre normalization is not applied; the spectral features are already L2-normalized per frame, which proved sufficient for the reported precision gains. We have not introduced new experiments, as the existing large-scale piano dataset already quantifies the end-to-end alignment accuracy, but we have added a short parameter-sensitivity paragraph confirming that the chosen tolerance values are stable across the test set. We believe these additions directly address the load-bearing concern while preserving the manuscript’s focus. revision: partial

Circularity Check

0 steps flagged

No circularity: bespoke DP adaptation is independent of inputs

full rationale

The paper describes a new matching algorithm that directly aligns sequential audio features (onset and spectral activation) to score positions using a dynamic programming approach adapted from symbolic methods. No equations or steps reduce by construction to fitted parameters, self-definitions, or self-citation chains. The central claim of precision without transcription rests on the algorithm design and dataset evaluation rather than tautological renaming or imported uniqueness theorems. This is a standard non-circular finding for a methods paper presenting an adapted DSP technique.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract provides insufficient detail to identify specific free parameters, axioms, or invented entities; the method appears to rely on standard dynamic programming for sequence matching.

pith-pipeline@v0.9.0 · 5771 in / 990 out tokens · 39440 ms · 2026-05-20T03:52:19.524002+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

12 extracted references · 12 canonical work pages · 1 internal anchor

[1]

Alignment algorithms match two versions of a piece of music, and for this to work these versions need to be in compara- ble formats

INTRODUCTION Audio-to-score alignment is a long-standing challenge in music information retrieval and arguably the most widely applicable alignment task for music research. Alignment algorithms match two versions of a piece of music, and for this to work these versions need to be in compara- ble formats. Audio-to-audio alignment matches audio fea- tures; ...

work page
[2]

Precise and Simple Audio-to-Score Alignment

ALIGNMENT METHOD 2.1 Signal Processing The audio signal is processed into two feature sequences, one for onset (time) information, the other for spectral (pitch) information. As a first step, the stereo signal is summed to mono and then sent through an IIR filterbank of second-order Butterworth filters. The filterbank consists of 88 filters centered at th...

work page internal anchor Pith review Pith/arXiv arXiv 2026
[3]

DualDTWMatcher

EV ALUA TION We evaluate our algorithm on over 300 piano performances from the (n)ASAP Dataset [7]. We compare it to an audio- to-audio alignment baseline which uses Dynamic Time Warping on both onset-related and spectral features. The implementation is given by the synctoolbox library [8]. Audio-to-audio alignment based on a mix of features and synthesiz...

work page
[4]

Our method leverages dynamic beat period estimates and score-informed pitch-wise onset and spectral processing to produce highly precise alignments

CONCLUSION We introduce an audio-to-score algorithm which uses both onset and spectral audio features in a note-based match- ing procedure typically found in symbolic alignment. Our method leverages dynamic beat period estimates and score-informed pitch-wise onset and spectral processing to produce highly precise alignments. It relies on standard digital ...

work page 2020
[5]

Audio-to-score align- ment of piano music using rnn-based automatic music transcription,

T. Kwon, D. Jeong, and J. Nam, “Audio-to-score align- ment of piano music using rnn-based automatic music transcription,” inProceedings of the 14th Sound and Music Computing Conference (SMC), 2017

work page 2017
[6]

Fine-tuning midi-to-audio alignment using a neural network on piano roll and cqt representations,

S. Murgul, M. Reiser, M. Heizmann, and C. Seibert, “Fine-tuning midi-to-audio alignment using a neural network on piano roll and cqt representations,”arXiv preprint arXiv:2506.22237, 2025

work page arXiv 2025
[7]

Robust and ac- curate audio synchronization using raw features from transcription models

J. Zeitler, B. Maman, and M. M ¨uller, “Robust and ac- curate audio synchronization using raw features from transcription models.” inProceedings of the Interna- tional Society of Music Information Retrieval Confer- ence (ISMIR), 2024, pp. 120–127

work page 2024
[8]

Audio- to-score alignment using deep automatic music tran- scription,

F. Simonetta, S. Ntalampiras, and F. Avanzini, “Audio- to-score alignment using deep automatic music tran- scription,” in23rd International Workshop on Multi- media Signal Processing (MMSP), 2021

work page 2021
[9]

Pairing real-time piano transcription with symbol-level tracking for pre- cise and robust score following,

S. D. Peter, P. Hu, and G. Widmer, “Pairing real-time piano transcription with symbol-level tracking for pre- cise and robust score following,” inProceedings of the Sound and Music Computing Conference (SMC), 2025

work page 2025
[10]

Maximum filter vibrato sup- pression for onset detection,

S. B ¨ock and G. Widmer, “Maximum filter vibrato sup- pression for onset detection,” inProceedings of the 16th International Conference on Digital Audio Effects (DAFx-13), Maynooth, Ireland, September 2013

work page 2013
[11]

Automatic note-level score-to-performance align- ments in the asap dataset,

S. D. Peter, C. E. Cancino-Chac ´on, F. Foscarin, A. P. McLeod, F. Henkel, E. Karystinaios, and G. Widmer, “Automatic note-level score-to-performance align- ments in the asap dataset,”Transactions of the In- ternational Society for Music Information Retrieval (TISMIR), 2023

work page 2023
[12]

Sync toolbox: A python package for ef- ficient, robust, and accurate music synchronization,

M. M ¨uller, Y . ¨Ozer, M. Krause, T. Pr ¨atzlich, and J. Driedger, “Sync toolbox: A python package for ef- ficient, robust, and accurate music synchronization,” Journal of Open Source Software, vol. 6, no. 64, p. 3434, 2021

work page 2021

[1] [1]

Alignment algorithms match two versions of a piece of music, and for this to work these versions need to be in compara- ble formats

INTRODUCTION Audio-to-score alignment is a long-standing challenge in music information retrieval and arguably the most widely applicable alignment task for music research. Alignment algorithms match two versions of a piece of music, and for this to work these versions need to be in compara- ble formats. Audio-to-audio alignment matches audio fea- tures; ...

work page

[2] [2]

Precise and Simple Audio-to-Score Alignment

ALIGNMENT METHOD 2.1 Signal Processing The audio signal is processed into two feature sequences, one for onset (time) information, the other for spectral (pitch) information. As a first step, the stereo signal is summed to mono and then sent through an IIR filterbank of second-order Butterworth filters. The filterbank consists of 88 filters centered at th...

work page internal anchor Pith review Pith/arXiv arXiv 2026

[3] [3]

DualDTWMatcher

EV ALUA TION We evaluate our algorithm on over 300 piano performances from the (n)ASAP Dataset [7]. We compare it to an audio- to-audio alignment baseline which uses Dynamic Time Warping on both onset-related and spectral features. The implementation is given by the synctoolbox library [8]. Audio-to-audio alignment based on a mix of features and synthesiz...

work page

[4] [4]

Our method leverages dynamic beat period estimates and score-informed pitch-wise onset and spectral processing to produce highly precise alignments

CONCLUSION We introduce an audio-to-score algorithm which uses both onset and spectral audio features in a note-based match- ing procedure typically found in symbolic alignment. Our method leverages dynamic beat period estimates and score-informed pitch-wise onset and spectral processing to produce highly precise alignments. It relies on standard digital ...

work page 2020

[5] [5]

Audio-to-score align- ment of piano music using rnn-based automatic music transcription,

T. Kwon, D. Jeong, and J. Nam, “Audio-to-score align- ment of piano music using rnn-based automatic music transcription,” inProceedings of the 14th Sound and Music Computing Conference (SMC), 2017

work page 2017

[6] [6]

Fine-tuning midi-to-audio alignment using a neural network on piano roll and cqt representations,

S. Murgul, M. Reiser, M. Heizmann, and C. Seibert, “Fine-tuning midi-to-audio alignment using a neural network on piano roll and cqt representations,”arXiv preprint arXiv:2506.22237, 2025

work page arXiv 2025

[7] [7]

Robust and ac- curate audio synchronization using raw features from transcription models

J. Zeitler, B. Maman, and M. M ¨uller, “Robust and ac- curate audio synchronization using raw features from transcription models.” inProceedings of the Interna- tional Society of Music Information Retrieval Confer- ence (ISMIR), 2024, pp. 120–127

work page 2024

[8] [8]

Audio- to-score alignment using deep automatic music tran- scription,

F. Simonetta, S. Ntalampiras, and F. Avanzini, “Audio- to-score alignment using deep automatic music tran- scription,” in23rd International Workshop on Multi- media Signal Processing (MMSP), 2021

work page 2021

[9] [9]

Pairing real-time piano transcription with symbol-level tracking for pre- cise and robust score following,

S. D. Peter, P. Hu, and G. Widmer, “Pairing real-time piano transcription with symbol-level tracking for pre- cise and robust score following,” inProceedings of the Sound and Music Computing Conference (SMC), 2025

work page 2025

[10] [10]

Maximum filter vibrato sup- pression for onset detection,

S. B ¨ock and G. Widmer, “Maximum filter vibrato sup- pression for onset detection,” inProceedings of the 16th International Conference on Digital Audio Effects (DAFx-13), Maynooth, Ireland, September 2013

work page 2013

[11] [11]

Automatic note-level score-to-performance align- ments in the asap dataset,

S. D. Peter, C. E. Cancino-Chac ´on, F. Foscarin, A. P. McLeod, F. Henkel, E. Karystinaios, and G. Widmer, “Automatic note-level score-to-performance align- ments in the asap dataset,”Transactions of the In- ternational Society for Music Information Retrieval (TISMIR), 2023

work page 2023

[12] [12]

Sync toolbox: A python package for ef- ficient, robust, and accurate music synchronization,

M. M ¨uller, Y . ¨Ozer, M. Krause, T. Pr ¨atzlich, and J. Driedger, “Sync toolbox: A python package for ef- ficient, robust, and accurate music synchronization,” Journal of Open Source Software, vol. 6, no. 64, p. 3434, 2021

work page 2021