arxiv: 2604.24386 · v1 · submitted 2026-04-27 · 💻 cs.SD · eess.AS

Recognition: unknown

An event-based sequence modeling approach to recognizing non-triad chords with oversegmentation minimization

Jonghun Park, Leekyung Kim

Pith reviewed 2026-05-07 17:40 UTC · model grok-4.3

classification 💻 cs.SD eess.AS

keywords automatic chord recognitionsequence-to-sequence modelingoversegmentation minimizationnon-triad chordsauto-regressive predictiontoken representationsencoder pre-trainingmusic audio analysis

0 comments

The pith

Reformulating chord recognition as segment-level auto-regressive sequence prediction improves detection of complex non-triad chords while reducing oversegmentation.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that automatic chord recognition can be reframed as predicting sequences of chords over learned audio segments rather than labeling every audio frame. This approach detects changes only at segment boundaries, which curbs the tendency of models to insert too many chord labels. By using auto-regressive prediction and specially designed token representations with encoder pre-training, the method achieves better results on infrequent complex chords that suffer from data scarcity in standard datasets. A sympathetic reader would care because existing systems produce fragmented outputs and fail on musically rich chords that appear less often in training data.

Core claim

The central discovery is that casting automatic chord recognition as a segment-level sequence-to-sequence task, where chords are predicted auto-regressively at detected boundaries, combined with two types of token representations and encoder pre-training, leads to improved chord recognition accuracy and reduced oversegmentation, with particular benefits for non-triad chords.

What carries the argument

segment-level sequence-to-sequence auto-regressive prediction that identifies chord changes only at learned segment boundaries instead of per frame

If this is right

Overall chord recognition accuracy rises, with larger gains on complex and infrequent chord types.
Oversegmentation decreases because predictions are restricted to segment boundaries.
Structured token representations and encoder pre-training support better time-aligned modeling of music events.
Data imbalance for rare non-triad chords is mitigated through the sequence-level formulation.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same boundary-constrained auto-regressive approach could reduce fragmentation in related audio tasks such as note transcription.
Pre-training the encoder on unlabeled music audio may transfer to other music information retrieval problems like key detection.
Extending the token set to encode harmonic relationships could address remaining ambiguities in chord labeling.

Load-bearing premise

That chord changes occur only at the learned segment boundaries and that auto-regressive prediction at those points will not miss transitions or accumulate identification errors.

What would settle it

An experiment on recordings containing rapid non-triad chord changes where the model merges distinct chords into single segments and shows lower accuracy than a frame-by-frame baseline.

read the original abstract

Automatic chord recognition (ACR) extracts time-aligned chord labels from music audio recordings. Despite recent advances, ACR still struggles with oversegmentation, data scarcity, and imbalance, especially in recognizing complex chords such as non-triads, which are unpopular in existing datasets. To address these challenges, we reformulate ACR as a segment-level sequence-to-sequence prediction task, where chord sequences are predicted auto-regressively rather than frame by frame. This design mitigates excessive segmentation by detecting chord changes only at segment boundaries. We further introduce two types of token representations and an encoder pre-training method, both specifically designed for time-aligned chord modeling. Experimental results show that our model improves performance in both chord recognition and segmentation, with notable gains for complex and infrequent chord types. These findings demonstrate the effectiveness of segment-level sequence modeling, structured tokenization, and representation learning for advancing chord recognition systems.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The segment-level seq2seq reformulation targets real ACR problems but the abstract and stress-test both flag missing controls that would show whether the boundary detection actually drives the gains.

read the letter

The paper's core move is to treat automatic chord recognition as segment-level sequence-to-sequence prediction instead of frame-by-frame labeling. Chord changes are allowed only at learned boundaries, and they add custom token representations plus encoder pre-training aimed at non-triad chords. This is a clear departure from the usual frame-based setups in the cited prior work, and it directly attacks oversegmentation and the imbalance that hurts rare chord types.

Referee Report

3 major / 2 minor

Summary. The paper reformulates automatic chord recognition (ACR) as a segment-level sequence-to-sequence task in which chord labels are predicted auto-regressively only at learned segment boundaries rather than frame-by-frame. It introduces two token representations and an encoder pre-training scheme designed for time-aligned chord modeling, with the goal of reducing oversegmentation while improving recognition of rare non-triad chords. The abstract states that experiments demonstrate gains in both chord recognition accuracy and segmentation quality, particularly for complex and infrequent chord types.

Significance. If the reported gains are reproducible and attributable to the segment-level auto-regressive reformulation, the work would address two persistent ACR problems—oversegmentation and severe class imbalance for non-triads—using a modeling change that is conceptually clean. The explicit design of tokenization and pre-training for chord sequences is a positive contribution that could be adopted more broadly. However, the significance is currently limited by the absence of quantitative results, baselines, and controls in the provided description, making it impossible to assess whether the central modeling choice drives the improvements or whether they arise from representation learning alone.

major comments (3)

[Experimental section] Experimental section (assumed §4): No ablation is reported that applies the same encoder pre-training and token representations inside a standard frame-level classifier. Without this control, it is impossible to determine whether the claimed reduction in oversegmentation and gains on non-triads are due to the segment-level auto-regressive prediction or to the pre-training and tokenization components that the skeptic correctly flags as potential confounds.
[§3 (Method)] Abstract and §3 (Method): The central claim that “detecting chord changes only at segment boundaries” mitigates oversegmentation rests on the assumption that the learned boundaries are reliable; yet no quantitative analysis of boundary precision/recall or comparison against a forced-alignment baseline is supplied, leaving the mechanism unverified.
[Results tables] Results tables (assumed §4): The abstract asserts “notable gains for complex and infrequent chord types” but supplies neither per-class F1 scores, confusion matrices, nor error bars. Given the acknowledged data scarcity for non-triads, aggregate metrics alone cannot substantiate the specific claim.

minor comments (2)

[§3] Notation for the two token representations is introduced without a compact tabular comparison of their vocabularies and alignment properties, making it harder to replicate the structured tokenization.
[§3] The description of the encoder pre-training objective lacks an explicit equation or loss formulation, forcing the reader to infer the exact self-supervised task from prose.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive feedback, which highlights important aspects for strengthening the empirical validation of our segment-level modeling approach. We address each major comment below and have revised the manuscript to include the requested controls, analyses, and detailed metrics.

read point-by-point responses

Referee: [Experimental section] Experimental section (assumed §4): No ablation is reported that applies the same encoder pre-training and token representations inside a standard frame-level classifier. Without this control, it is impossible to determine whether the claimed reduction in oversegmentation and gains on non-triads are due to the segment-level auto-regressive prediction or to the pre-training and tokenization components that the skeptic correctly flags as potential confounds.

Authors: We agree that this ablation is essential to isolate the contribution of the segment-level auto-regressive reformulation. In the revised manuscript, we have added the requested control experiment applying identical encoder pre-training and token representations within a standard frame-level classifier. The results demonstrate that while pre-training and tokenization yield improvements, the segment-level auto-regressive prediction provides additional gains in segmentation quality and non-triad recognition, supporting our central modeling claim. revision: yes
Referee: [§3 (Method)] Abstract and §3 (Method): The central claim that “detecting chord changes only at segment boundaries” mitigates oversegmentation rests on the assumption that the learned boundaries are reliable; yet no quantitative analysis of boundary precision/recall or comparison against a forced-alignment baseline is supplied, leaving the mechanism unverified.

Authors: We acknowledge that verifying boundary reliability is key to substantiating the oversegmentation reduction. The revised manuscript now includes quantitative boundary precision and recall metrics evaluated against ground-truth chord change points, along with a comparison to a forced-alignment baseline using oracle boundaries. These results confirm that the model learns reliable segment boundaries that directly contribute to the observed improvements. revision: yes
Referee: [Results tables] Results tables (assumed §4): The abstract asserts “notable gains for complex and infrequent chord types” but supplies neither per-class F1 scores, confusion matrices, nor error bars. Given the acknowledged data scarcity for non-triads, aggregate metrics alone cannot substantiate the specific claim.

Authors: We agree that aggregate metrics are insufficient given class imbalance and data scarcity for non-triads. The revised version now reports per-class F1 scores across all chord types (with emphasis on non-triads), full confusion matrices, and error bars computed over multiple random seeds with statistical significance testing. These additions substantiate the specific gains for complex and infrequent chords. revision: yes

Circularity Check

0 steps flagged

No circularity: reformulation presented as independent modeling choice

full rationale

The paper reformulates automatic chord recognition as a segment-level sequence-to-sequence task with auto-regressive prediction at learned boundaries, introduces two token representations and encoder pre-training designed for time-aligned chord modeling, and reports experimental gains on chord recognition and segmentation (especially non-triads). This is a methodological design choice to address oversegmentation and data issues, not a derivation that reduces outputs to inputs by construction, a fitted parameter renamed as prediction, or a load-bearing self-citation chain. No equations, uniqueness theorems, or ansatzes are shown to be self-referential; the central claims rest on external experimental validation rather than internal equivalence. The derivation chain is therefore self-contained.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Review is limited to the abstract; no explicit free parameters, invented entities, or detailed axioms are stated. The approach implicitly assumes that chord changes align with detectable segment boundaries and that standard sequence modeling techniques transfer to time-aligned chord data.

axioms (1)

domain assumption Chord changes in music audio occur at identifiable segment boundaries that can be modeled separately from frame-level labeling.
This underpins the shift from frame-by-frame to segment-level prediction.

pith-pipeline@v0.9.0 · 5451 in / 1135 out tokens · 79423 ms · 2026-05-07T17:40:59.219564+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

27 extracted references · 2 canonical work pages · 2 internal anchors

[1]

INTRODUCTION Automatic Chord Recognition (ACR) is the task of identify- ing the sequence of chords in musical audio, where chords— a group of simultaneously played notes—form the harmonic foundation of a musical piece[1]. ACR remains an active area of research due to complex musical structures and diverse in- strumentation, which make chord annotation tim...

work page internal anchor Pith review Pith/arXiv arXiv 2026
[2]

PROPOSED METHOD 2.1. Problem Specification Most traditional ACR models predict the probability of chords at each time frame from an input spectrogramX spec ∈ RNT ×NF , whereN T is the number of time frames andN F is the number of frequency bins. This can be formulated as ˆyi = arg max yi∈V P(y i|Xspec)fori= 1, . . . , N T , where Vis a pre-defined chord v...
[3]

Experimental Setups Data.We use the dataset as BTC[3], consisting of 471 pop songs with manually aligned audio and chord annotations

EXPERIMENTS 3.1. Experimental Setups Data.We use the dataset as BTC[3], consisting of 471 pop songs with manually aligned audio and chord annotations. A 5-fold cross-validation was conducted by splitting the whole data into five subsets. In each fold, one subset was used for validation and the remaining four subsets for training, applied identically to bo...
[4]

Our method carried out segment-level auto-regressive prediction, which alleviates oversegmentation, in contrast to traditional frame-level classification

CONCLUSION In this paper, we proposed a seq2seq approach to automatic chord recognition using a Transformer encoder-decoder. Our method carried out segment-level auto-regressive prediction, which alleviates oversegmentation, in contrast to traditional frame-level classification. We also introduced two types of token representations for time-aligned chords...
[5]

20 years of automatic chord recognition from audio,

Johan Pauwels, Ken O’hanlon, Emilia G´omez Guti ´errez, and Mark Sandler, “20 years of automatic chord recognition from audio,” 2019

2019
[6]

Attention is all you need,

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Lukasz Kaiser, and Illia Polosukhin, “Attention is all you need,”Ad- vances in neural information processing systems, vol. 30, 2017

2017
[7]

A bi-directional transformer for musical chord recognition,

Jonggwon Park, Kyoyun Choi, Sungwook Jeon, Dokyun Kim, and Jonghun Park, “A bi-directional transformer for musical chord recognition,” inISMIR, 2019, pp. 620–627

2019
[8]

Harmony transformer: Incorporating chord segmentation into harmony recog- nition,

Tsung-Ping Chen, Li Su, et al., “Harmony transformer: Incorporating chord segmentation into harmony recog- nition,”Neural Netw, vol. 12, pp. 15, 2019

2019
[9]

Attend to chords: Im- proving harmonic analysis of symbolic music using transformer-based models,

Tsung-Ping Chen and Li Su, “Attend to chords: Im- proving harmonic analysis of symbolic music using transformer-based models,”Transactions of the Inter- national Society for Music Information Retrieval, vol. 4, no. 1, pp. 1–14, 2021

2021
[10]

BERT: pre-training of deep bidi- rectional transformers for language understanding,

Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova, “BERT: pre-training of deep bidi- rectional transformers for language understanding,” in Proceedings of the 2019 Conference of the North Amer- ican Chapter of the Association for Computational Lin- guistics: Human Language Technologies. 2019, pp. 4171–4186, Association for Computational Linguistics

2019
[11]

A hybrid gaussian- hmm-deep learning approach for automatic chord esti- mation with very large vocabulary.,

Jun-qi Deng and Yu-Kwong Kwok, “A hybrid gaussian- hmm-deep learning approach for automatic chord esti- mation with very large vocabulary.,” inISMIR, 2016, pp. 812–818

2016
[12]

Large- vocabulary chord transcription via chord structure de- composition.,

Junyan Jiang, Ke Chen, Wei Li, and Gus Xia, “Large- vocabulary chord transcription via chord structure de- composition.,” inISMIR, 2019, pp. 644–651

2019
[13]

Structured train- ing for large-vocabulary chord recognition.,

Brian McFee and Juan Pablo Bello, “Structured train- ing for large-vocabulary chord recognition.,” inISMIR, 2017, pp. 188–194

2017
[14]

Large vocabulary automatic chord estimation with an even chance training scheme.,

Jun-qi Deng and Yu-Kwong Kwok, “Large vocabulary automatic chord estimation with an even chance training scheme.,” inISMIR, 2017, pp. 531–536

2017
[15]

Curriculum learning for imbalanced classification in large vocabu- lary automatic chord recognition.,

Luke O Rowe and George Tzanetakis, “Curriculum learning for imbalanced classification in large vocabu- lary automatic chord recognition.,” inISMIR, 2021, pp. 586–593

2021
[16]

Improving the classification of rare chords with unlabeled data,

Marcelo Bortolozzo, Rodrigo Schramm, and Claudio R Jung, “Improving the classification of rare chords with unlabeled data,” inICASSP 2021-2021 IEEE Interna- tional Conference on Acoustics, Speech and Signal Pro- cessing (ICASSP). IEEE, 2021, pp. 3390–3394

2021
[17]

Deep semi- supervised learning with contrastive learning in large vocabulary automatic chord recognition,

Chen Li, Yu Li, Hui Song, and Lihua Tian, “Deep semi- supervised learning with contrastive learning in large vocabulary automatic chord recognition,” in2023 IEEE 13th Annual Computing and Communication Workshop and Conference (CCWC). IEEE, 2023, pp. 1065–1069

2023
[18]

Using online chord databases to enhance chord recognition,

Matt McVicar, Yizhao Ni, Raul Santos-Rodriguez, and Tijl De Bie, “Using online chord databases to enhance chord recognition,”Journal of New Music Research, vol. 40, no. 2, pp. 139–152, 2011

2011
[19]

Im- proving balance in automatic chord recognition with random forests,

Jeff Miller, Ken O’Hanlon, and Mark B Sandler, “Im- proving balance in automatic chord recognition with random forests,” in2022 30th European Signal Process- ing Conference (EUSIPCO). IEEE, 2022, pp. 244–248

2022
[20]

On the fu- tility of learning complex frame-level language models for chord recognition,

Filip Korzeniowski and Gerhard Widmer, “On the fu- tility of learning complex frame-level language models for chord recognition,” inAudio Engineering Society Conference: 2017 AES International Conference on Se- mantic Audio. Audio Engineering Society, 2017

2017
[21]

Conditional random fields: Probabilistic mod- els for segmenting and labeling sequence data,

John Lafferty, Andrew McCallum, and Fernando CN Pereira, “Conditional random fields: Probabilistic mod- els for segmenting and labeling sequence data,” 2001

2001
[22]

A fully con- volutional deep auditory model for musical chord recog- nition,

Filip Korzeniowski and Gerhard Widmer, “A fully con- volutional deep auditory model for musical chord recog- nition,” in2016 IEEE 26th International Workshop on Machine Learning for Signal Processing (MLSP). IEEE, 2016, pp. 1–6

2016
[23]

thesis, De- partment of Electronic Engineering, Queen Mary, Uni- versity of London, 2010

Christopher Harte,Towards automatic extraction of har- mony information from music signals, Ph.D. thesis, De- partment of Electronic Engineering, Queen Mary, Uni- versity of London, 2010

2010
[24]

Calculation of a constant q spectral transform,

Judith C Brown, “Calculation of a constant q spectral transform,”The Journal of the Acoustical Society of America, vol. 89, no. 1, pp. 425–434, 1991

1991
[25]

Adam: A Method for Stochastic Optimization

Diederik P. Kingma and Jimmy Ba, “Adam: A method for stochastic optimization,”CoRR, vol. abs/1412.6980, 2014

work page internal anchor Pith review arXiv 2014
[26]

thesis, Citeseer, 2010

Matthias Mauch,Automatic Chord Transcription from Audio Using Computational Models of Musical Context, Ph.D. thesis, Citeseer, 2010

2010
[27]

Umap: Uniform manifold approximation and projection,

Leland McInnes, John Healy, Nathaniel Saul, and Lukas Grossberger, “Umap: Uniform manifold approximation and projection,”The Journal of Open Source Software, vol. 3, no. 29, pp. 861, 2018

2018