arxiv: 2604.20270 · v1 · submitted 2026-04-22 · 📡 eess.AS · cs.SD

Recognition: unknown

Embedding-Based Intrusive Evaluation Metrics for Musical Source Separation Using MERT Representations

Paul A. Bereuter , Alois Sontacchi

Authors on Pith no claims yet

Pith reviewed 2026-05-09 23:20 UTC · model grok-4.3

classification 📡 eess.AS cs.SD

keywords musical source separationMERT embeddingsevaluation metricsperceptual audio qualityBSS-EvalFréchet Audio Distanceintrusive metricssinging voice separation

0 comments

The pith

MERT embedding metrics correlate more strongly with perceptual audio quality than BSS-Eval in musical source separation.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper tests two intrusive embedding-based metrics for musical source separation that rely on MERT representations: mean squared error and an intrusive variant of the Fréchet Audio Distance. It compares their alignment with human listening test ratings against the traditional BSS-Eval metrics. Experiments across two independent datasets show the embedding metrics achieve higher correlations, and this advantage appears for all tested stem types and separation models. A reader would care because current automatic evaluations often diverge from how listeners actually judge separation quality, so stronger alternatives could make model development and benchmarking more efficient.

Core claim

The authors introduce and evaluate intrusive metrics computed directly on MERT embeddings, specifically MSE between reference and estimated source embeddings plus an intrusive Fréchet Audio Distance. Experiments on two independent datasets demonstrate that these metrics correlate more strongly with perceptual audio quality ratings from listening tests than conventional BSS-Eval metrics, and the result holds across all analyzed stem and model types.

What carries the argument

MERT embeddings, the latent representations produced by a large self-supervised music understanding model, used as the space in which to compute intrusive distance measures between reference and separated audio.

If this is right

These embedding metrics provide a more perceptually aligned automatic way to evaluate musical source separation performance.
The stronger correlation holds consistently across different audio stems such as vocals and accompaniment and across various separation models.
Improved automatic metrics could allow faster iteration on new source separation algorithms by reducing dependence on repeated human listening tests.
The approach suggests that distance measures in self-supervised embedding spaces can substitute for signal-level metrics in audio evaluation.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

If the result generalizes, evaluation pipelines in music information retrieval could shift from signal-based to representation-based intrusive metrics.
The finding points to a possible link between self-supervised pretraining objectives and human auditory perception that could be tested on other audio tasks.
Combining MERT embeddings with embeddings from additional models might further improve correlation with perceptual ratings.

Load-bearing premise

MERT embeddings capture the perceptual features relevant to audio quality and listening test ratings serve as a reliable gold standard without further validation of consistency or generalizability.

What would settle it

A new listening test on a different dataset or set of separation models in which BSS-Eval metrics show equal or higher correlation with the ratings than the MERT-based MSE and intrusive FAD.

Figures

Figures reproduced from arXiv: 2604.20270 by Alois Sontacchi, Paul A. Bereuter.

**Figure 1.** Figure 1: Scatter plots between the both embedding-based metrics and the MUSHRA scores for the Bake-Off dataset. Different stem types are indicated by different colors. strate that these results generalize to the independently created Bake-Off dataset. Among traditional BSS-Eval metrics SDR and SI-SAR exhibit the highest correlations with perceptual ratings for vocal stems, however only for discriminative singing vo… view at source ↗

read the original abstract

Evaluation of musical source separation (MSS) has traditionally relied on Blind Source Separation Evaluation (BSS-Eval) metrics. However, recent work suggests that BSS-Eval metrics exhibit low correlation between metrics and perceptual audio quality ratings from a listening test, which is considered the gold standard evaluation method. As an alternative approach in singing voice separation, embedding-based intrusive metrics that leverage latent representations from large self-supervised audio models such as Music undERstanding with large-scale self-supervised Training (MERT) embeddings have been introduced. In this work, we analyze the correlation of perceptual audio quality ratings with two intrusive embedding-based metrics: a mean squared error (MSE) and an intrusive variant of the Fr\'echet Audio Distance (FAD) calculated on MERT embeddings. Experiments on two independent datasets show that these metrics correlate more strongly with perceptual audio quality ratings than traditional BSS-Eval metrics across all analyzed stem and model types.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript proposes two intrusive embedding-based metrics for musical source separation evaluation using MERT representations: mean squared error (MSE) on embeddings and an intrusive variant of Fréchet Audio Distance (FAD). Experiments on two independent datasets are reported to show stronger correlations between these metrics and perceptual audio quality ratings from listening tests than those achieved by traditional BSS-Eval metrics, across stems and separation models.

Significance. If the central empirical claims hold after validation, the work would provide a practical advance in MSS evaluation by offering metrics that align better with human perception than BSS-Eval, addressing a documented weakness in current practice. The use of large self-supervised embeddings like MERT is a timely direction, and the intrusive formulation allows direct comparison to reference signals.

major comments (2)

[Listening test methodology] Section on listening test data collection and validation: the paper treats perceptual ratings as the gold standard without reporting inter-rater reliability (ICC, Cronbach’s α), number of raters, screening criteria, or any cross-validation across listener pools or test conditions. This is load-bearing for the claim of stronger perceptual alignment, because unquantified noise or bias in the ratings could produce artifactual correlation advantages for the MERT-based metrics.
[Section 4] Section 4 (Results and correlation analysis): no statistical tests for the significance of correlation differences, no confidence intervals, and insufficient detail on dataset sizes, stem/model breakdowns, or potential confounds are provided. Without these, the assertion that the embedding metrics 'correlate more strongly ... across all analyzed stem and model types' cannot be rigorously evaluated.

minor comments (2)

[Abstract] The abstract and introduction would benefit from explicit numerical correlation values (e.g., Pearson r for each metric) rather than qualitative statements of 'more strongly.'
[Section 3] Notation for the intrusive FAD variant should be defined more clearly with respect to the standard FAD formulation to avoid ambiguity in the embedding space.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. The comments highlight important areas for improving methodological transparency and statistical rigor, which we will address in the revision.

read point-by-point responses

Referee: [Listening test methodology] Section on listening test data collection and validation: the paper treats perceptual ratings as the gold standard without reporting inter-rater reliability (ICC, Cronbach’s α), number of raters, screening criteria, or any cross-validation across listener pools or test conditions. This is load-bearing for the claim of stronger perceptual alignment, because unquantified noise or bias in the ratings could produce artifactual correlation advantages for the MERT-based metrics.

Authors: We agree that these details are necessary to fully validate the perceptual ratings. In the revised manuscript we will expand the listening-test section to report the number of raters, screening criteria, inter-rater reliability statistics (ICC and Cronbach’s α), and any cross-validation steps that were performed. This addition will allow readers to assess the reliability of the human judgments underlying our correlation claims. revision: yes
Referee: [Section 4] Section 4 (Results and correlation analysis): no statistical tests for the significance of correlation differences, no confidence intervals, and insufficient detail on dataset sizes, stem/model breakdowns, or potential confounds are provided. Without these, the assertion that the embedding metrics 'correlate more strongly ... across all analyzed stem and model types' cannot be rigorously evaluated.

Authors: We acknowledge the value of formal statistical comparison. The revised Section 4 will include (i) bootstrap confidence intervals for all reported correlations, (ii) statistical tests (Fisher’s z-transformation) for differences between correlation coefficients, and (iii) expanded tables and text detailing dataset sizes, per-stem and per-model breakdowns, and discussion of potential confounds such as signal duration and genre distribution. These additions will make the comparative claims quantitatively verifiable. revision: yes

Circularity Check

0 steps flagged

No circularity: purely empirical correlation study with no derivations or self-referential reductions

full rationale

The paper reports experimental correlations between two intrusive embedding-based metrics (MSE and intrusive FAD on MERT embeddings) and perceptual ratings from listening tests, comparing them to BSS-Eval metrics across datasets and stem types. No equations, fitted parameters, predictions, or derivations are present in the provided text; the central claim rests on direct computation against external listening-test data rather than any self-definition, ansatz smuggling, or load-bearing self-citation. The analysis is self-contained and does not reduce any result to its own inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the domain assumption that human perceptual ratings constitute the definitive evaluation criterion and that MERT embeddings are appropriate for intrusive metric computation without introducing new entities or fitted parameters.

axioms (1)

domain assumption Perceptual audio quality ratings from listening tests are the gold standard for evaluation.
Invoked in the abstract as the basis for validating the new metrics against traditional ones.

pith-pipeline@v0.9.0 · 5457 in / 1140 out tokens · 40295 ms · 2026-05-09T23:20:19.645296+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

16 extracted references · 1 canonical work pages

[1]

ITU- R BS.1534 : Method for the subjec- tive assessment of intermediate quality level of audio systems,

International Telecommunication Union, “ITU- R BS.1534 : Method for the subjec- tive assessment of intermediate quality level of audio systems,” ITU, Geneva, Switzer- land, Tech. Rep., Oct. 2015. [Online]. Avail- able: https://www.itu.int/dms pubrec/itu-r/rec/ bs/R-REC-BS.1534-3-201510-I!!PDF-E.pdf

2015
[2]

ITU-T Recommendation P.808: subjective evaluation of speech quality with a crowd- sourcing approach,

——, “ITU-T Recommendation P.808: subjective evaluation of speech quality with a crowd- sourcing approach,” ITU, Geneva, Switzerland, Tech. Rep., Jun. 2021. [Online]. Available: https://www.itu.int/rec/dologin pub.asp?lang=e& id=T-REC-P.808-202106-I!!PDF-E&type=items

2021
[3]

Perfor- mance measurement in blind audio source separa- tion,

E. Vincent, R. Gribonval, and C. Fevotte, “Perfor- mance measurement in blind audio source separa- tion,”IEEE Trans. Audio, Speech, Lang. Process., vol. 14, no. 4, pp. 1462–1469, 2006

2006
[4]

Sdr – half-baked or well done?

J. L. Roux, S. Wisdom, H. Erdogan, and J. R. Her- shey, “Sdr – half-baked or well done?” inProc. ICASSP, 2019, pp. 626–630

2019
[5]

Evaluation of quality of sound source separation algorithms: Human perception vs quantitative met- rics,

E. Cano, D. FitzGerald, and K. Brandenburg, “Evaluation of quality of sound source separation algorithms: Human perception vs quantitative met- rics,” inProc. EUSIPCO, 2016, pp. 1758–1762

2016
[6]

Bss eval or peass? predicting the perception of singing-voice separation,

D. Ward, H. Wierstorf, R. D. Mason, E. M. Grais, and M. D. Plumbley, “Bss eval or peass? predicting the perception of singing-voice separation,” inProc. ICASSP, 2018, pp. 596–600

2018
[7]

Objective measures of perceptual audio quality reviewed: An evaluation of their application domain dependence,

M. Torcoli, T. Kastner, and J. Herre, “Objective measures of perceptual audio quality reviewed: An evaluation of their application domain dependence,” IEEE/ACM Trans. Audio, Speech, Lang. Process., vol. 29, pp. 1530–1541, 2021

2021
[8]

Towards reliable objective evaluation metrics for generative singing voice separation mod- els,

P. A. Bereuter, B. Stahl, M. D. Plumbley, and A. Sontacchi, “Towards reliable objective evaluation metrics for generative singing voice separation mod- els,” inProc. WASPAA, 2025, pp. 1–5

2025
[9]

MERT: Acoustic music understanding model with large- scale self-supervised training,

Y. LI, R. Yuan, G. Zhang, Y. Ma, X. Chen, H. Yin, C. Xiao, C. Lin, A. Ragni, E. Benetos, N. Gyenge, R. Dannenberg, R. Liu, W. Chen, G. Xia, Y. Shi, W. Huang, Z. Wang, Y. Guo, and J. Fu, “MERT: Acoustic music understanding model with large- scale self-supervised training,” inProc. ICLR, 2024, pp. 12 181–12 204

2024
[10]

Fr´ echet Audio Distance: A reference-free metric for evaluating music enhancement algorithms,

K. Kilgour, M. Zuluaga, D. Roblek, and M. Sharifi, “Fr´ echet Audio Distance: A reference-free metric for evaluating music enhancement algorithms,” inProc. ISCA, 2019, pp. 2350–2354

2019
[11]

Adapting Frechet Audio Distance for genera- tive music evaluation,

A. Gui, H. Gamper, S. Braun, and D. Emmanouili- dou, “Adapting Frechet Audio Distance for genera- tive music evaluation,” inProc. ICASSP, 2024, pp. 1331–1335

2024
[12]

Musical source sep- aration bake-off: Comparing objective metrics with human perception,

N. Jaffe and J. A. Burgoyne, “Musical source sep- aration bake-off: Comparing objective metrics with human perception,” inProc. WASPAA, 2025, pp. 1–5

2025
[13]

The 2018 signal separation evaluation campaign,

F.-R. St¨ oter, A. Liutkus, and N. Ito, “The 2018 signal separation evaluation campaign,” inLatent Variable Analysis and Signal Separation, Y. Dev- ille, S. Gannot, R. Mason, M. D. Plumbley, and D. Ward, Eds. Cham: Springer International Pub- lishing, 2018, pp. 293–305

2018
[14]

The MUSDB18 corpus for music separation,

Z. Rafii, A. Liutkus, F.-R. St¨ oter, S. I. Mimilakis, and R. Bittner, “The MUSDB18 corpus for music separation,” Dec. 2017. [Online]. Available: https://doi.org/10.5281/zenodo.1117372

work page doi:10.5281/zenodo.1117372 2017
[15]

The proof and measurement of asso- ciation between two things,

C. Spearman, “The proof and measurement of asso- ciation between two things,”The American Journal of Psychology, vol. 15, no. 1, pp. 72–101, 1904

1904
[16]

Vii. mathematical contributions to the theory of evolution.—iii. regression, heredity, and panmixia,

K. Pearson, “Vii. mathematical contributions to the theory of evolution.—iii. regression, heredity, and panmixia,”Philosophical Transactions of the Royal Society of London, Series A: Containing Papers of a Mathematical or Physical Character, no. 187, pp. 253–318, 1896