HAIM: Human-AI Music Datasets for AI Music Production Tracking Benchmark

Seonghyeon Go; Yumin Kim

arxiv: 2606.01686 · v1 · pith:URFOORUWnew · submitted 2026-06-01 · 💻 cs.SD · cs.AI

HAIM: Human-AI Music Datasets for AI Music Production Tracking Benchmark

Seonghyeon Go , Yumin Kim This is my paper

Pith reviewed 2026-06-28 13:11 UTC · model grok-4.3

classification 💻 cs.SD cs.AI

keywords AI music detectionmusic production trackinghybrid AI-human audiodataset benchmarkgenerative audio evaluationaudio forensics

0 comments

The pith

HAIM dataset enables tracking of specific AI integration stages in music production instead of binary AI-or-human classification.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Current detectors classify music tracks only as AI-generated or human-made, yet real production mixes the two through vocal synthesis, arrangement, mastering, and post-processing. The paper introduces HAIM, a dataset that assigns labels to distinct production stages and AI intervention points, including hybrid cases where AI and human work combine. Evaluation of existing detectors on HAIM shows they fail systematically on these mixed examples. Releasing the dataset creates a benchmark that measures granular AI use across the workflow rather than a single yes-no decision. This approach aligns evaluation with how generative tools are actually applied in contemporary music creation.

Core claim

The paper defines AI Music Tracking as the task of identifying specific points of AI intervention across the full music production spectrum and releases HAIM as a dataset containing diverse labels for those stages, including hybrid production and agent-level tracking. When state-of-the-art detectors are tested on HAIM they exhibit systemic flaws that binary classification cannot capture, particularly with adversarial post-processing such as human mastering applied to AI tracks.

What carries the argument

HAIM dataset with labels that isolate stages of AI intervention in music production.

If this is right

Evaluation metrics must move from overall binary accuracy to per-stage detection performance.
Detectors will need to handle cases where human mastering is applied to AI-generated material or vice versa.
Research focus shifts toward identifying AI contributions at specific workflow points such as arrangement or mastering.
Adversarial tactics that combine AI and human steps become measurable rather than invisible to binary systems.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Standardized taxonomies of production stages may become necessary for future datasets and detectors.
Commercial music platforms could adopt stage-level disclosure requirements once tracking becomes reliable.
The benchmark opens the possibility of training models that output production-stage probabilities rather than single binary outputs.

Load-bearing premise

The production-stage labels assigned to tracks in HAIM accurately reflect consistent real-world patterns of AI use.

What would settle it

Re-labeling a random subset of HAIM tracks by independent music production experts and measuring low agreement with the original stage labels would show the benchmark does not capture actual integration patterns.

Figures

Figures reproduced from arXiv: 2606.01686 by Seonghyeon Go, Yumin Kim.

**Figure 1.** Figure 1: Comparison of AI Music Detection and AI Music Tracking. Detection focuses on wholetrack binary classification, whereas tracking operationalizes detail analysis to identify the nuances of AI integration. Figure shows the one case of analysis, multilabel classification. as either entirely human or entirely synthetic—is fundamentally misaligned with the reality of modern music production. In contemporary… view at source ↗

**Figure 2.** Figure 2: Overview of the HAIM dataset generation pipeline. [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗

**Figure 3.** Figure 3: Spectral feature statistics across all HAIM categories. Bar heights indicate per-track means; error bars show standard deviations. Category Centroid (Hz) Flatness RMS Harm. R. Bandwidth (Hz) Contrast (dB) A1 1847 ± 720 .024 ± .031 .146 ± .080 .624 ± .188 1982 ± 591 23.1 ± 2.1 A2 1924 ± 770 .016 ± .018 .113 ± .049 .614 ± .190 2125 ± 562 23.2 ± 1.7 B1 1869 ± 743 .023 ± .028 .201 ± .084 .670 ± .196 2001 ± 613… view at source ↗

**Figure 4.** Figure 4: Kernel density estimates of seven acoustic features comparing two post-production pipelines. Left column: human music before and after AI mastering (A1 → B1). Right column: AI-generated music before and after human-referenced AI style transfer (A1 → B2), and human post-production (A2 → B3, B4). modified variant as MuQ-FST. 5.1 Architecture Stage 1: Segment Detection. An input track is divided into non-ove… view at source ↗

**Figure 5.** Figure 5: Temporal AI probability curves (10 s window, 1 s hop) for representative C1 (top) and C2 (bottom) tracks. Green/red shading: ground-truth human/AI segments. Dashed red: ground-truth boundaries; dashdotted blue: predicted boundaries. Orange: crossfade zone (C2 only). cating that the model cannot yet disentangle these roles from the audio signal alone, especially in Lyricist. As shown in B6, where lyrics a… view at source ↗

read the original abstract

As generative platforms such as Suno and Udio reach human-grade audio quality, the scope of AI's utility has expanded across the entire music production workflow. Beyond simple track generation, these advancements have catalyzed the adoption of AI-driven methodologies in diverse forms. These include vocal synthesis, arrangement, and professional mastering. However, current detection research remains largely confined to a binary `AI-or-human' paradigm. It fails to reflect the realities of contemporary music production workflows. In real-world production, AI tools are increasingly used to refine or master human-produced tracks, and human engineers likewise post-process AI-generated material to ensure professional quality. Moreover, users often employ adversarial tactics to bypass AI detectors, such as applying human mastering to AI-generated tracks. This creates a grey area that a simple binary classification fails to capture. In this paper, we define and investigate ``AI Music Tracking'': the challenge of identifying specific AI integration across the multifaceted spectrum of music production. To this end, we introduce HAIM, a dataset with diverse labels for stages of music production. It is designed to isolate stages of AI intervention, including hybrid production and agent-level tracking. Our evaluation of state-of-the-art detectors reveals systemic flaws. By releasing HAIM, we propose a new benchmark that shifts the field beyond binary classification toward a granular, structured evaluation of AI music.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

HAIM introduces a dataset and framing for multi-stage AI music tracking but supplies zero details on label construction or validation.

read the letter

The paper's main move is releasing HAIM to push AI music detection past binary AI-or-human labels toward tracking specific production stages like vocal synthesis, arrangement, mastering, hybrids, and adversarial post-processing. That framing matches real workflows better than current detectors and is the clearest new element.

It correctly identifies the gap: binary classifiers miss cases where AI refines human tracks or humans clean up AI output. The proposal for granular, stage-level evaluation follows directly from that observation.

The weakness is straightforward and central. The abstract and stress-test note give no information on how labels were assigned, what sources were used for ground truth, inter-annotator checks, expert review, or any test against actual production logs. Without those steps the benchmark cannot be shown to capture the claimed grey areas rather than arbitrary buckets. The reported flaws in existing detectors therefore rest on unverified categories.

No equations or fitted parameters appear, so the circularity burden is zero. The work is a dataset and benchmark proposal, not a derivation.

This is for researchers building or testing audio content detectors who need staged rather than binary test sets. A reader focused on generative media benchmarks would find the motivation useful if the label process is later documented.

The topic is timely enough that a serious editor should send it to peer review rather than desk reject, even though the current version will need substantial revision on methods and validation.

Referee Report

1 major / 0 minor

Summary. The paper introduces the HAIM dataset, which annotates music tracks with granular labels for stages of AI integration in production workflows (including vocal synthesis, arrangement, mastering, hybrid human-AI, and adversarial post-processing). It evaluates existing state-of-the-art AI music detectors on this benchmark and concludes that they exhibit systemic flaws, proposing HAIM as a new standard that moves the field beyond binary AI-vs-human classification toward structured tracking of specific integration points.

Significance. If the dataset labels prove accurate and representative of real production practices, HAIM could meaningfully advance AI music forensics by enabling evaluation of detectors on realistic hybrid and post-processed cases rather than idealized binary distinctions. The contribution is primarily a dataset and benchmark proposal rather than a new method or theoretical result.

major comments (1)

[Abstract / Dataset description] The manuscript provides no information on dataset construction, including annotation methodology, source of ground truth for production-stage labels, inter-annotator agreement, expert validation, or any check against actual industry workflows (Abstract and any dataset section). This is load-bearing for the central claim, as the benchmark's ability to demonstrate 'systemic flaws' in detectors and to capture 'grey area' phenomena depends entirely on the labels accurately reflecting real-world AI integration patterns rather than arbitrary categorizations.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for highlighting the importance of dataset construction details, which are indeed central to validating the benchmark. We agree that additional information is required and will incorporate it in the revision.

read point-by-point responses

Referee: [Abstract / Dataset description] The manuscript provides no information on dataset construction, including annotation methodology, source of ground truth for production-stage labels, inter-annotator agreement, expert validation, or any check against actual industry workflows (Abstract and any dataset section). This is load-bearing for the central claim, as the benchmark's ability to demonstrate 'systemic flaws' in detectors and to capture 'grey area' phenomena depends entirely on the labels accurately reflecting real-world AI integration patterns rather than arbitrary categorizations.

Authors: We acknowledge that the manuscript as submitted does not include sufficient detail on dataset construction. In the revised version we will add a dedicated Dataset Construction section that specifies: (1) the annotation protocol and guidelines provided to annotators, (2) the provenance of the ground-truth production-stage labels (including how hybrid and post-processed cases were identified), (3) inter-annotator agreement statistics (Cohen’s kappa or equivalent), (4) any expert validation steps performed, and (5) explicit mapping to documented industry workflows. These additions will directly address the load-bearing concern and allow readers to assess whether the labels reflect realistic AI-integration patterns. revision: yes

Circularity Check

0 steps flagged

No circularity: dataset release with no derivations or self-referential claims

full rationale

The paper is a dataset and benchmark proposal paper with no equations, fitted parameters, derivation chains, or load-bearing self-citations. Its central contribution is the release of HAIM labels for production stages, and the claim that this enables granular evaluation rests on the dataset itself rather than any reduction to prior inputs or self-defined constructs. No steps match the enumerated circularity patterns; the work is self-contained as an empirical resource release.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Dataset paper with no mathematical model; no free parameters, axioms, or invented entities required for the central claim.

pith-pipeline@v0.9.1-grok · 5769 in / 876 out tokens · 18535 ms · 2026-06-28T13:11:01.250522+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

20 extracted references · 1 linked inside Pith

[1]

Afchar, D., Meseguer-Brocal, G., Akesbi, K., and Hen- nequin, R. (2025a). A fourier explanation of ai- music artifacts.arXiv preprint arXiv:2506.19108

arXiv
[2]

Afchar, D., Meseguer-Brocal, G., and Hennequin, R. (2024). Detecting music deepfakes is easy but ac- tually hard.arXiv preprint arXiv:2405.04181

arXiv 2024
[3]

Afchar, D., Meseguer-Brocal, G., and Hennequin, R. (2025b). Ai-generated music detection and its challenges. InICASSP 2025-2025 IEEE Interna- tional Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 1–5. IEEE

2025
[4]

Baevski, A., Zhou, Y., Mohamed, A., and Auli, M. (2020). wav2vec 2.0: A framework for self- supervised learning of speech representations.Ad- vances in neural information processing systems, 33:12449–12460

2020
[5]

Batra, A., Sharma, D., Thukral, K., Bhatia, R., Batra, N., and Gautam, A. (2025). Melody or machine: Detecting synthetic music with dual-stream con- trastive learning.arXiv preprint arXiv:2512.00621

arXiv 2025
[6]

Serra, X. (2019). The mtg-jamendo dataset for automatic music tagging. InMachine learning for music discovery workshop, international conference on machine learning (ICML 2019), pages 1–3. Long

2019
[7]

Copet, J., Kreuk, F., Gat, I., Remez, T., Kant, D., Syn- naeve, G., Adi, Y., and Défossez, A. (2023). Sim- ple and controllable music generation. volume 36, pages 47704–47720. Deezer (2025). 28% of all delivered music is now fully AI-generated.https://newsroom-deezer. com/2025/09/28-fully-ai-generated-music/. Défossez, A. (2021). Hybrid spectrogram and 13...

arXiv 2023
[8]

Schedl, M., and Hennequin, R. (2025). Ai- generated song detection via lyrics transcripts. arXiv preprint arXiv:2506.18488

arXiv 2025
[9]

Gong, J., Song, Y., Zhao, W., Wang, S., Xu, S., Guo, J., and Yang, X. (2026). Ace-step 1.5: Pushing the boundaries of open-source music generation.arXiv preprint arXiv:2602.00744. Grötschla, F., Solak, A., Lanzendörfer, L. A., and Wat- tenhofer, R. (2025). Benchmarking music gener- ation models and metrics via human preference studies. InICASSP 2025-2025 ...

arXiv 2026
[10]

Kim, H., Jung, J., Jeong, D., and Nam, J. (2024). K- pop lyric translation: Dataset, analysis, and neural- modelling. InProceedings of the 2024 Joint Inter- national Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024), pages 9974–9987

2024
[11]

Kim, W., Wi, H., Park, S., Kim, T., Keum, S., Kim, K., Kim, T., Jung, J., Kim, T., Guerrero, G., et al. (2025). From generation to attribution: Music ai agent architectures for the post-streaming era. arXiv preprint arXiv:2510.20276

arXiv 2025
[12]

and Go, S

Kim, Y. and Go, S. (2026). Fusion segment trans- former: Bi-directional attention guided fusion net- work for ai-generated music detection.arXiv preprint arXiv:2601.13647

arXiv 2026
[13]

A., Liao, W.-H., Uhlich, S., Lee, K., and Mitsufuji, Y

Koo, J., Martínez-Ramírez, M. A., Liao, W.-H., Uhlich, S., Lee, K., and Mitsufuji, Y. (2023). Music mixing style transfer: A contrastive learning approach to disentangle audio effects. InICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 1–5. IEEE

2023
[14]

Li, Y., Yuan, R., Zhang, G., Ma, Y., Chen, X., Yin, H., Xiao, C., Lin, C., Ragni, A., Benetos, E., et al. (2024). Mert: Acoustic music understand- ing model with large-scale self-supervised training. InInternational Conference on Learning Representa- tions, volume 2024, pages 12181–12204. Melechovsky , J., Mehrish, A., Roy , A., and Herremans, D. (2025). ...

Pith/arXiv arXiv 2024
[15]

and Xie, S

Peebles, W. and Xie, S. (2023). Scalable diffusion models with transformers. InProceedings of the IEEE/CVF international conference on computer vi- sion, pages 4195–4205

2023
[16]

Rahman, A., Hakim, Z. I. A., Sarker, N. H., Paul, B., and Fattah, A. (2025). Sonics: Synthetic or not- identifying counterfeit songs. InInternational Con- ference on Learning Representations, volume 2025, pages 21421–21451. Skywork AI (2024). Mureka: AI music generation plat- form.https://www.mureka.ai

2025
[17]

Suno, Inc. (2024). Suno AI music generator.https: //suno.com. Udio (2024). Udio AI music generation platform. https://www.udio.com

2024
[18]

C., Sturm, B

Vila, L. C., Sturm, B. L., Casini, L., and Dalmazzo, D. (2025). The ai music arms race: On the detection of ai-generated music.Transactions of the Interna- tional Society for Music Information Retrieval, 8(1)

2025
[19]

Zang, Y., Zhang, Y., Heydari, M., and Duan, Z. (2024). Singfake: Singing voice deepfake detec- tion. InICASSP 2024-2024 IEEE International Con- ference on Acoustics, Speech and Signal Processing (ICASSP), pages 12156–12160. IEEE

2024
[20]

Luo, Y., Tan, W., and Chen, X. (2025). Muq: Self- supervised music representation learning with mel residual vector quantization.IEEE Transactions on

2025

[1] [1]

Afchar, D., Meseguer-Brocal, G., Akesbi, K., and Hen- nequin, R. (2025a). A fourier explanation of ai- music artifacts.arXiv preprint arXiv:2506.19108

arXiv

[2] [2]

Afchar, D., Meseguer-Brocal, G., and Hennequin, R. (2024). Detecting music deepfakes is easy but ac- tually hard.arXiv preprint arXiv:2405.04181

arXiv 2024

[3] [3]

Afchar, D., Meseguer-Brocal, G., and Hennequin, R. (2025b). Ai-generated music detection and its challenges. InICASSP 2025-2025 IEEE Interna- tional Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 1–5. IEEE

2025

[4] [4]

Baevski, A., Zhou, Y., Mohamed, A., and Auli, M. (2020). wav2vec 2.0: A framework for self- supervised learning of speech representations.Ad- vances in neural information processing systems, 33:12449–12460

2020

[5] [5]

Batra, A., Sharma, D., Thukral, K., Bhatia, R., Batra, N., and Gautam, A. (2025). Melody or machine: Detecting synthetic music with dual-stream con- trastive learning.arXiv preprint arXiv:2512.00621

arXiv 2025

[6] [6]

Serra, X. (2019). The mtg-jamendo dataset for automatic music tagging. InMachine learning for music discovery workshop, international conference on machine learning (ICML 2019), pages 1–3. Long

2019

[7] [7]

Copet, J., Kreuk, F., Gat, I., Remez, T., Kant, D., Syn- naeve, G., Adi, Y., and Défossez, A. (2023). Sim- ple and controllable music generation. volume 36, pages 47704–47720. Deezer (2025). 28% of all delivered music is now fully AI-generated.https://newsroom-deezer. com/2025/09/28-fully-ai-generated-music/. Défossez, A. (2021). Hybrid spectrogram and 13...

arXiv 2023

[8] [8]

Schedl, M., and Hennequin, R. (2025). Ai- generated song detection via lyrics transcripts. arXiv preprint arXiv:2506.18488

arXiv 2025

[9] [9]

Gong, J., Song, Y., Zhao, W., Wang, S., Xu, S., Guo, J., and Yang, X. (2026). Ace-step 1.5: Pushing the boundaries of open-source music generation.arXiv preprint arXiv:2602.00744. Grötschla, F., Solak, A., Lanzendörfer, L. A., and Wat- tenhofer, R. (2025). Benchmarking music gener- ation models and metrics via human preference studies. InICASSP 2025-2025 ...

arXiv 2026

[10] [10]

Kim, H., Jung, J., Jeong, D., and Nam, J. (2024). K- pop lyric translation: Dataset, analysis, and neural- modelling. InProceedings of the 2024 Joint Inter- national Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024), pages 9974–9987

2024

[11] [11]

Kim, W., Wi, H., Park, S., Kim, T., Keum, S., Kim, K., Kim, T., Jung, J., Kim, T., Guerrero, G., et al. (2025). From generation to attribution: Music ai agent architectures for the post-streaming era. arXiv preprint arXiv:2510.20276

arXiv 2025

[12] [12]

and Go, S

Kim, Y. and Go, S. (2026). Fusion segment trans- former: Bi-directional attention guided fusion net- work for ai-generated music detection.arXiv preprint arXiv:2601.13647

arXiv 2026

[13] [13]

A., Liao, W.-H., Uhlich, S., Lee, K., and Mitsufuji, Y

Koo, J., Martínez-Ramírez, M. A., Liao, W.-H., Uhlich, S., Lee, K., and Mitsufuji, Y. (2023). Music mixing style transfer: A contrastive learning approach to disentangle audio effects. InICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 1–5. IEEE

2023

[14] [14]

Li, Y., Yuan, R., Zhang, G., Ma, Y., Chen, X., Yin, H., Xiao, C., Lin, C., Ragni, A., Benetos, E., et al. (2024). Mert: Acoustic music understand- ing model with large-scale self-supervised training. InInternational Conference on Learning Representa- tions, volume 2024, pages 12181–12204. Melechovsky , J., Mehrish, A., Roy , A., and Herremans, D. (2025). ...

Pith/arXiv arXiv 2024

[15] [15]

and Xie, S

Peebles, W. and Xie, S. (2023). Scalable diffusion models with transformers. InProceedings of the IEEE/CVF international conference on computer vi- sion, pages 4195–4205

2023

[16] [16]

Rahman, A., Hakim, Z. I. A., Sarker, N. H., Paul, B., and Fattah, A. (2025). Sonics: Synthetic or not- identifying counterfeit songs. InInternational Con- ference on Learning Representations, volume 2025, pages 21421–21451. Skywork AI (2024). Mureka: AI music generation plat- form.https://www.mureka.ai

2025

[17] [17]

Suno, Inc. (2024). Suno AI music generator.https: //suno.com. Udio (2024). Udio AI music generation platform. https://www.udio.com

2024

[18] [18]

C., Sturm, B

Vila, L. C., Sturm, B. L., Casini, L., and Dalmazzo, D. (2025). The ai music arms race: On the detection of ai-generated music.Transactions of the Interna- tional Society for Music Information Retrieval, 8(1)

2025

[19] [19]

Zang, Y., Zhang, Y., Heydari, M., and Duan, Z. (2024). Singfake: Singing voice deepfake detec- tion. InICASSP 2024-2024 IEEE International Con- ference on Acoustics, Speech and Signal Processing (ICASSP), pages 12156–12160. IEEE

2024

[20] [20]

Luo, Y., Tan, W., and Chen, X. (2025). Muq: Self- supervised music representation learning with mel residual vector quantization.IEEE Transactions on

2025