arxiv: 2605.12135 · v1 · submitted 2026-05-12 · 💻 cs.SD · cs.LG· eess.AS

Recognition: no theorem link

STRUM: A Spectral Transcription and Rhythm Understanding Model for End-to-End Generation of Playable Rhythm-Game Charts

Joshua Opria

Authors on Pith no claims yet

Pith reviewed 2026-05-13 04:42 UTC · model grok-4.3

classification 💻 cs.SD cs.LGeess.AS

keywords audio to chart transcriptionrhythm game chartsonset detectionClone Heromulti-instrument analysisneural audio processingsource separation benchmarkF1 score evaluation

0 comments

The pith

STRUM converts raw audio into playable Clone Hero charts for five instruments using a hybrid neural pipeline without metadata.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents STRUM as a complete audio-to-chart system that turns ordinary music recordings into notes and timings suitable for rhythm games. It combines separate neural detectors for each instrument: a CRNN ensemble for drums, pitch trackers for guitar and bass, speech recognition for vocals, and spectral methods for keys. On thirty songs screened for clear drum audio after separation, the system reaches onset F1 scores of 0.838 for drums, 0.694 for bass, 0.651 for guitar, and 0.539 for vocals when a 100-millisecond window and per-song timing adjustment are allowed. A full component ablation with statistical tests shows which steps matter most for drum accuracy, and the authors release the models, code, and benchmark for others to use. If the approach holds, any recording could become a chart without waiting for community creators.

Core claim

STRUM is a multi-stage hybrid pipeline that maps raw audio to playable rhythm-game charts for drums, guitar, bass, vocals, and keys without oracle metadata. It employs a two-stage CRNN onset detector plus six-model ensemble for drums, neural onsets with monophonic pitch tracking for guitar and bass, word-aligned ASR for vocals, and spectral keyboard detection for keys. On the 30-song benchmark filtered by high median 1-second drum-stem RMS after source separation, it attains the stated F1 scores at +/- 100 ms tolerance with per-song global offset search, supported by an ablation of seven drum components, timing distribution analysis, and drum-classifier confusion matrices.

What carries the argument

The multi-stage hybrid pipeline that routes audio through instrument-specific neural onset detectors, pitch trackers, classifiers, and spectral analyzers to produce timed note events for game charts.

If this is right

Charts become available for any song whose audio meets basic clarity thresholds rather than only those already transcribed by volunteers.
Ablation results identify which drum-pipeline stages contribute measurable accuracy gains and which can be simplified.
Per-song offset search corrects global timing mismatches between audio and generated charts.
The released benchmark and timing-distribution analysis supply a standardized testbed and reference for future chart-generation work.
Instrument-specific modules can be swapped or retrained independently while keeping the rest of the pipeline intact.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same pipeline structure could be adapted to generate charts for other rhythm-game engines or mobile titles that use similar note layouts.
Performance on vocals points to a clear next target: replacing the current ASR step with models trained on singing rather than speech.
Combining STRUM with existing source-separation models would allow chart generation directly from mixed stereo tracks without separate stems.
If the drum-onset detector generalizes, the system could support real-time chart previews during live performances or DJ sets.

Load-bearing premise

The 30 songs chosen for strong drum-stem energy after source separation represent the audio conditions and genres users will actually want to convert into charts.

What would settle it

Running the released STRUM models on a larger, unscreened collection of songs spanning lower audio quality, different genres, or live recordings would produce F1 scores that stay within 10 percent of the reported values or drop sharply.

Figures

Figures reproduced from arXiv: 2605.12135 by Joshua Opria.

**Figure 2.** Figure 2: Histogram of signed offsets between ground-truth drum events and the nearest detected [PITH_FULL_IMAGE:figures/full_fig_p006_2.png] view at source ↗

**Figure 3.** Figure 3: Quantitative results. Drums dominate; on the confusion matrix the blue (high-tom) lane [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗

read the original abstract

We present STRUM (Spectral Transcription and Rhythm Understanding Model), an audio-to-chart pipeline that converts raw recordings into playable Clone Hero / YARG charts for drums, guitar, bass, vocals, and keys without any oracle metadata. STRUM is a multi-stage hybrid: a two-stage CRNN onset detector and a six-model ensemble classifier for drums; neural onset detectors with monophonic pitch tracking for guitar and bass; word-aligned ASR for vocals; and spectral keyboard detection for keys. We evaluate on a 30-song in-envelope benchmark constructed by screening candidate songs on a single audio-quality criterion -- the median 1-second drum-stem RMS after htdemucs_6s source separation. On this benchmark STRUM achieves drums onset F1 = 0.838, bass F1 = 0.694, guitar F1 = 0.651, and vocals F1 = 0.539 at a +/- 100 ms tolerance with per-song global offset search. We report a complete ablation of seven drum-pipeline components with paired per-song Wilcoxon tests, an analysis of ground-truth-to-audio timing distributions in community Clone Hero charts, and a per-class confusion matrix for the drum classifier. Code, model weights, and the full benchmark manifest are released.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

STRUM puts together a full multi-instrument audio-to-chart pipeline with released code and a new 30-song benchmark, but the song filter on high drum RMS after separation likely makes the F1 scores look stronger than they would on typical tracks.

read the letter

STRUM is a hybrid system that takes raw audio and produces playable charts for drums, guitar, bass, vocals, and keys in games like Clone Hero. It chains a two-stage CRNN for drum onsets with a six-model ensemble classifier, monophonic pitch tracking for guitar and bass, ASR word alignment for vocals, and spectral detection for keys. On their 30-song set they report drums F1 of 0.838, bass 0.694, guitar 0.651, and vocals 0.539 at 100 ms tolerance, along with per-song global offset search, drum ablations with Wilcoxon tests, and a confusion matrix.

Referee Report

3 major / 1 minor

Summary. The paper introduces STRUM, a hybrid audio-to-chart pipeline for generating playable rhythm game charts from raw audio recordings for drums, guitar, bass, vocals, and keys. It uses a two-stage CRNN for drum onsets and ensemble classifier, neural methods for guitar/bass, ASR for vocals, and spectral detection for keys. Evaluation is performed on a 30-song benchmark selected based on high median 1-second drum-stem RMS after htdemucs source separation, achieving F1 scores of 0.838 for drums, 0.694 for bass, 0.651 for guitar, and 0.539 for vocals at ±100 ms tolerance with per-song global offset search. The authors provide ablations with statistical tests, timing analysis, confusion matrices, and release all code, weights, and the benchmark manifest.

Significance. Should the reported performance prove robust on more diverse and unfiltered audio corpora, this work would represent a meaningful advance in end-to-end chart generation for rhythm games, potentially reducing manual effort in community chart creation. The public release of code, model weights, and the complete benchmark manifest is a notable strength that supports reproducibility and allows independent verification of the results and ablations.

major comments (3)

§4 (Benchmark Construction): The 30-song benchmark is constructed by screening solely on high median 1-second drum-stem RMS after htdemucs_6s source separation. This criterion selects for tracks with loud, clean drums, which simplifies onset detection and may inflate the headline F1 scores (drums F1 = 0.838). No performance results are reported on unfiltered or standard corpora, undermining claims of general applicability.
Evaluation Protocol: The per-song global offset search is applied during evaluation to align predictions with ground truth. This step is not available in a blind, real-world deployment scenario for chart generation and should be distinguished from the core model's performance.
Pipeline Description: The htdemucs_6s model is used both to construct the benchmark (via drum-stem RMS screening) and as a pre-trained component in the STRUM pipeline for source separation. This overlap creates a potential closed loop that could mask domain shift or overfitting to the screened songs.

minor comments (1)

Abstract: The term 'in-envelope benchmark' is used without a clear definition or reference to its meaning in the context of the screening process.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for their thoughtful and constructive comments. We address each major comment below, indicating where revisions have been made to improve clarity and acknowledge limitations.

read point-by-point responses

Referee: §4 (Benchmark Construction): The 30-song benchmark is constructed by screening solely on high median 1-second drum-stem RMS after htdemucs_6s source separation. This criterion selects for tracks with loud, clean drums, which simplifies onset detection and may inflate the headline F1 scores (drums F1 = 0.838). No performance results are reported on unfiltered or standard corpora, undermining claims of general applicability.

Authors: We appreciate the referee pointing out this selection bias. The benchmark was deliberately constructed around songs yielding high-quality drum stems after source separation to ensure reliable ground-truth timing for evaluation; this is explicitly described as an 'in-envelope' benchmark in the manuscript. We agree that the criterion favors tracks with prominent drums and that headline scores may not generalize to noisier or less-separated audio. In the revised manuscript we have expanded the limitations paragraph in Section 4 to state this bias explicitly and to avoid any implication of broad applicability. We have not added results on unfiltered corpora, as constructing and annotating such a set would require substantial new effort outside the current revision scope. revision: partial
Referee: Evaluation Protocol: The per-song global offset search is applied during evaluation to align predictions with ground truth. This step is not available in a blind, real-world deployment scenario for chart generation and should be distinguished from the core model's performance.

Authors: We concur that the per-song global offset search is an evaluation-only alignment step and is unavailable in blind deployment. The manuscript already notes its use, but we have revised the evaluation section to more clearly separate the core model output from the offset-aligned results. A new table now reports F1 scores computed without the global offset search, allowing readers to assess standalone performance directly. revision: yes
Referee: Pipeline Description: The htdemucs_6s model is used both to construct the benchmark (via drum-stem RMS screening) and as a pre-trained component in the STRUM pipeline for source separation. This overlap creates a potential closed loop that could mask domain shift or overfitting to the screened songs.

Authors: htdemucs_6s is used strictly as a fixed, pre-trained, off-the-shelf model; none of its parameters were updated using the benchmark. The screening step merely filters candidate tracks on drum-stem energy, while the pipeline applies the same frozen model to separate test audio. This does not constitute a training loop or permit overfitting to the screened songs. We have nevertheless added explicit wording in the pipeline overview to emphasize that htdemucs_6s remains an unmodified external component and to note the possibility of domain effects when the same separator is used for both selection and inference. revision: partial

Circularity Check

0 steps flagged

No circularity; evaluation is direct measurement against external ground truth

full rationale

The paper describes a hybrid multi-stage pipeline (CRNN onset detectors, ensemble classifiers, pitch tracking, ASR, spectral detection) and reports F1 scores computed directly from model outputs versus independent community Clone Hero ground-truth charts on a transparently filtered 30-song set. The benchmark construction criterion (median 1-second drum-stem RMS after htdemucs_6s separation) is an explicit selection step that affects difficulty but does not redefine any fitted parameter, model output, or prediction as its own input. No equations reduce by construction, no self-citations supply load-bearing uniqueness theorems, and no ansatz or renaming of known results occurs. The derivation from raw audio to playable charts remains externally falsifiable via the released benchmark manifest and annotations.

Axiom & Free-Parameter Ledger

2 free parameters · 1 axioms · 0 invented entities

The central performance claims rest on the accuracy of the external htdemucs_6s source separator for benchmark construction and on the assumption that the screened songs represent the target distribution for rhythm-game audio.

free parameters (2)

onset detection thresholds
Tuned parameters in the CRNN and neural onset detectors for each instrument.
ensemble classifier decision rules
Six-model drum classifier outputs combined via unspecified aggregation.

axioms (1)

domain assumption htdemucs_6s source separation produces reliable drum-stem RMS values for audio-quality screening.
Directly used to construct the 30-song evaluation benchmark.

pith-pipeline@v0.9.0 · 5534 in / 1319 out tokens · 90152 ms · 2026-05-13T04:42:05.009087+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

8 extracted references · 8 canonical work pages · 1 internal anchor

[1]

Omnizart: A general toolbox for automatic music transcription.Journal of Open Source Software, 6(68):3391, 2021.https://doi.org/10.21105/joss.03391

Yu-Te Wu, Yin-Jyun Luo, Tsung-Ping Chen, I-Chieh Wei, Jui-Yang Hsu, Yi-Chin Chuang, and Li Su. Omnizart: A general toolbox for automatic music transcription.Journal of Open Source Software, 6(68):3391, 2021.https://doi.org/10.21105/joss.03391

work page doi:10.21105/joss.03391 2021
[2]

MT3: Multi-task multitrack music transcription,

Josh Gardner, Ian Simon, Ethan Manilow, Curtis Hawthorne, and Jesse Engel. MT3: Multi-task multitrack music transcription.International Conference on Learning Representations (ICLR), 2022.https://arxiv.org/abs/2111.03017

work page arXiv 2022
[3]

Automatic drum transcription using bi- directional recurrent neural networks.Proc

Carl Southall, Ryan Stables, and Jason Hockman. Automatic drum transcription using bi- directional recurrent neural networks.Proc. International Society for Music Information Retrieval Conference (ISMIR), pages 591–597, 2016

work page 2016
[4]

CloneCharter: A Clone Hero chart generator

thejorseman. CloneCharter: A Clone Hero chart generator. Hugging Face Model Repository, 2026.https://huggingface.co/thejorseman/CloneCharter

work page 2026
[5]

Hybrid Transformers for music source separation.IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2023.https://arxiv.org/abs/2211.08553

Simon Rouard, Francisco Massa, and Alexandre D´ efossez. Hybrid Transformers for music source separation.IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2023.https://arxiv.org/abs/2211.08553

work page arXiv 2023
[6]

Robust Speech Recognition via Large-Scale Weak Supervision

Alec Radford, Jong Wook Kim, Tao Xu, Greg Brockman, Christine McLeavey, and Ilya Sutskever. Robust speech recognition via large-scale weak supervision.International Conference on Machine Learning (ICML), 2023.https://arxiv.org/abs/2212.04356

work page internal anchor Pith review Pith/arXiv arXiv 2023
[7]

Matthias Mauch and Simon Dixon. pYIN: A fundamental frequency estimator using proba- bilistic threshold distributions.IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 659–663, 2014

work page 2014
[8]

Brian McFee, Colin Raffel, Dawen Liang, Daniel P. W. Ellis, Matt McVicar, Eric Battenberg, and Oriol Nieto. librosa: Audio and music signal analysis in Python.Proc. of the 14th Python in Science Conference (SciPy), pages 18–25, 2015. 9

work page 2015