Recognition: unknown
Embedding-Based Intrusive Evaluation Metrics for Musical Source Separation Using MERT Representations
Pith reviewed 2026-05-09 23:20 UTC · model grok-4.3
The pith
MERT embedding metrics correlate more strongly with perceptual audio quality than BSS-Eval in musical source separation.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The authors introduce and evaluate intrusive metrics computed directly on MERT embeddings, specifically MSE between reference and estimated source embeddings plus an intrusive Fréchet Audio Distance. Experiments on two independent datasets demonstrate that these metrics correlate more strongly with perceptual audio quality ratings from listening tests than conventional BSS-Eval metrics, and the result holds across all analyzed stem and model types.
What carries the argument
MERT embeddings, the latent representations produced by a large self-supervised music understanding model, used as the space in which to compute intrusive distance measures between reference and separated audio.
If this is right
- These embedding metrics provide a more perceptually aligned automatic way to evaluate musical source separation performance.
- The stronger correlation holds consistently across different audio stems such as vocals and accompaniment and across various separation models.
- Improved automatic metrics could allow faster iteration on new source separation algorithms by reducing dependence on repeated human listening tests.
- The approach suggests that distance measures in self-supervised embedding spaces can substitute for signal-level metrics in audio evaluation.
Where Pith is reading between the lines
- If the result generalizes, evaluation pipelines in music information retrieval could shift from signal-based to representation-based intrusive metrics.
- The finding points to a possible link between self-supervised pretraining objectives and human auditory perception that could be tested on other audio tasks.
- Combining MERT embeddings with embeddings from additional models might further improve correlation with perceptual ratings.
Load-bearing premise
MERT embeddings capture the perceptual features relevant to audio quality and listening test ratings serve as a reliable gold standard without further validation of consistency or generalizability.
What would settle it
A new listening test on a different dataset or set of separation models in which BSS-Eval metrics show equal or higher correlation with the ratings than the MERT-based MSE and intrusive FAD.
Figures
read the original abstract
Evaluation of musical source separation (MSS) has traditionally relied on Blind Source Separation Evaluation (BSS-Eval) metrics. However, recent work suggests that BSS-Eval metrics exhibit low correlation between metrics and perceptual audio quality ratings from a listening test, which is considered the gold standard evaluation method. As an alternative approach in singing voice separation, embedding-based intrusive metrics that leverage latent representations from large self-supervised audio models such as Music undERstanding with large-scale self-supervised Training (MERT) embeddings have been introduced. In this work, we analyze the correlation of perceptual audio quality ratings with two intrusive embedding-based metrics: a mean squared error (MSE) and an intrusive variant of the Fr\'echet Audio Distance (FAD) calculated on MERT embeddings. Experiments on two independent datasets show that these metrics correlate more strongly with perceptual audio quality ratings than traditional BSS-Eval metrics across all analyzed stem and model types.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript proposes two intrusive embedding-based metrics for musical source separation evaluation using MERT representations: mean squared error (MSE) on embeddings and an intrusive variant of Fréchet Audio Distance (FAD). Experiments on two independent datasets are reported to show stronger correlations between these metrics and perceptual audio quality ratings from listening tests than those achieved by traditional BSS-Eval metrics, across stems and separation models.
Significance. If the central empirical claims hold after validation, the work would provide a practical advance in MSS evaluation by offering metrics that align better with human perception than BSS-Eval, addressing a documented weakness in current practice. The use of large self-supervised embeddings like MERT is a timely direction, and the intrusive formulation allows direct comparison to reference signals.
major comments (2)
- [Listening test methodology] Section on listening test data collection and validation: the paper treats perceptual ratings as the gold standard without reporting inter-rater reliability (ICC, Cronbach’s α), number of raters, screening criteria, or any cross-validation across listener pools or test conditions. This is load-bearing for the claim of stronger perceptual alignment, because unquantified noise or bias in the ratings could produce artifactual correlation advantages for the MERT-based metrics.
- [Section 4] Section 4 (Results and correlation analysis): no statistical tests for the significance of correlation differences, no confidence intervals, and insufficient detail on dataset sizes, stem/model breakdowns, or potential confounds are provided. Without these, the assertion that the embedding metrics 'correlate more strongly ... across all analyzed stem and model types' cannot be rigorously evaluated.
minor comments (2)
- [Abstract] The abstract and introduction would benefit from explicit numerical correlation values (e.g., Pearson r for each metric) rather than qualitative statements of 'more strongly.'
- [Section 3] Notation for the intrusive FAD variant should be defined more clearly with respect to the standard FAD formulation to avoid ambiguity in the embedding space.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback on our manuscript. The comments highlight important areas for improving methodological transparency and statistical rigor, which we will address in the revision.
read point-by-point responses
-
Referee: [Listening test methodology] Section on listening test data collection and validation: the paper treats perceptual ratings as the gold standard without reporting inter-rater reliability (ICC, Cronbach’s α), number of raters, screening criteria, or any cross-validation across listener pools or test conditions. This is load-bearing for the claim of stronger perceptual alignment, because unquantified noise or bias in the ratings could produce artifactual correlation advantages for the MERT-based metrics.
Authors: We agree that these details are necessary to fully validate the perceptual ratings. In the revised manuscript we will expand the listening-test section to report the number of raters, screening criteria, inter-rater reliability statistics (ICC and Cronbach’s α), and any cross-validation steps that were performed. This addition will allow readers to assess the reliability of the human judgments underlying our correlation claims. revision: yes
-
Referee: [Section 4] Section 4 (Results and correlation analysis): no statistical tests for the significance of correlation differences, no confidence intervals, and insufficient detail on dataset sizes, stem/model breakdowns, or potential confounds are provided. Without these, the assertion that the embedding metrics 'correlate more strongly ... across all analyzed stem and model types' cannot be rigorously evaluated.
Authors: We acknowledge the value of formal statistical comparison. The revised Section 4 will include (i) bootstrap confidence intervals for all reported correlations, (ii) statistical tests (Fisher’s z-transformation) for differences between correlation coefficients, and (iii) expanded tables and text detailing dataset sizes, per-stem and per-model breakdowns, and discussion of potential confounds such as signal duration and genre distribution. These additions will make the comparative claims quantitatively verifiable. revision: yes
Circularity Check
No circularity: purely empirical correlation study with no derivations or self-referential reductions
full rationale
The paper reports experimental correlations between two intrusive embedding-based metrics (MSE and intrusive FAD on MERT embeddings) and perceptual ratings from listening tests, comparing them to BSS-Eval metrics across datasets and stem types. No equations, fitted parameters, predictions, or derivations are present in the provided text; the central claim rests on direct computation against external listening-test data rather than any self-definition, ansatz smuggling, or load-bearing self-citation. The analysis is self-contained and does not reduce any result to its own inputs by construction.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Perceptual audio quality ratings from listening tests are the gold standard for evaluation.
Reference graph
Works this paper leans on
-
[1]
ITU- R BS.1534 : Method for the subjec- tive assessment of intermediate quality level of audio systems,
International Telecommunication Union, “ITU- R BS.1534 : Method for the subjec- tive assessment of intermediate quality level of audio systems,” ITU, Geneva, Switzer- land, Tech. Rep., Oct. 2015. [Online]. Avail- able: https://www.itu.int/dms pubrec/itu-r/rec/ bs/R-REC-BS.1534-3-201510-I!!PDF-E.pdf
2015
-
[2]
ITU-T Recommendation P.808: subjective evaluation of speech quality with a crowd- sourcing approach,
——, “ITU-T Recommendation P.808: subjective evaluation of speech quality with a crowd- sourcing approach,” ITU, Geneva, Switzerland, Tech. Rep., Jun. 2021. [Online]. Available: https://www.itu.int/rec/dologin pub.asp?lang=e& id=T-REC-P.808-202106-I!!PDF-E&type=items
2021
-
[3]
Perfor- mance measurement in blind audio source separa- tion,
E. Vincent, R. Gribonval, and C. Fevotte, “Perfor- mance measurement in blind audio source separa- tion,”IEEE Trans. Audio, Speech, Lang. Process., vol. 14, no. 4, pp. 1462–1469, 2006
2006
-
[4]
Sdr – half-baked or well done?
J. L. Roux, S. Wisdom, H. Erdogan, and J. R. Her- shey, “Sdr – half-baked or well done?” inProc. ICASSP, 2019, pp. 626–630
2019
-
[5]
Evaluation of quality of sound source separation algorithms: Human perception vs quantitative met- rics,
E. Cano, D. FitzGerald, and K. Brandenburg, “Evaluation of quality of sound source separation algorithms: Human perception vs quantitative met- rics,” inProc. EUSIPCO, 2016, pp. 1758–1762
2016
-
[6]
Bss eval or peass? predicting the perception of singing-voice separation,
D. Ward, H. Wierstorf, R. D. Mason, E. M. Grais, and M. D. Plumbley, “Bss eval or peass? predicting the perception of singing-voice separation,” inProc. ICASSP, 2018, pp. 596–600
2018
-
[7]
Objective measures of perceptual audio quality reviewed: An evaluation of their application domain dependence,
M. Torcoli, T. Kastner, and J. Herre, “Objective measures of perceptual audio quality reviewed: An evaluation of their application domain dependence,” IEEE/ACM Trans. Audio, Speech, Lang. Process., vol. 29, pp. 1530–1541, 2021
2021
-
[8]
Towards reliable objective evaluation metrics for generative singing voice separation mod- els,
P. A. Bereuter, B. Stahl, M. D. Plumbley, and A. Sontacchi, “Towards reliable objective evaluation metrics for generative singing voice separation mod- els,” inProc. WASPAA, 2025, pp. 1–5
2025
-
[9]
MERT: Acoustic music understanding model with large- scale self-supervised training,
Y. LI, R. Yuan, G. Zhang, Y. Ma, X. Chen, H. Yin, C. Xiao, C. Lin, A. Ragni, E. Benetos, N. Gyenge, R. Dannenberg, R. Liu, W. Chen, G. Xia, Y. Shi, W. Huang, Z. Wang, Y. Guo, and J. Fu, “MERT: Acoustic music understanding model with large- scale self-supervised training,” inProc. ICLR, 2024, pp. 12 181–12 204
2024
-
[10]
Fr´ echet Audio Distance: A reference-free metric for evaluating music enhancement algorithms,
K. Kilgour, M. Zuluaga, D. Roblek, and M. Sharifi, “Fr´ echet Audio Distance: A reference-free metric for evaluating music enhancement algorithms,” inProc. ISCA, 2019, pp. 2350–2354
2019
-
[11]
Adapting Frechet Audio Distance for genera- tive music evaluation,
A. Gui, H. Gamper, S. Braun, and D. Emmanouili- dou, “Adapting Frechet Audio Distance for genera- tive music evaluation,” inProc. ICASSP, 2024, pp. 1331–1335
2024
-
[12]
Musical source sep- aration bake-off: Comparing objective metrics with human perception,
N. Jaffe and J. A. Burgoyne, “Musical source sep- aration bake-off: Comparing objective metrics with human perception,” inProc. WASPAA, 2025, pp. 1–5
2025
-
[13]
The 2018 signal separation evaluation campaign,
F.-R. St¨ oter, A. Liutkus, and N. Ito, “The 2018 signal separation evaluation campaign,” inLatent Variable Analysis and Signal Separation, Y. Dev- ille, S. Gannot, R. Mason, M. D. Plumbley, and D. Ward, Eds. Cham: Springer International Pub- lishing, 2018, pp. 293–305
2018
-
[14]
The MUSDB18 corpus for music separation,
Z. Rafii, A. Liutkus, F.-R. St¨ oter, S. I. Mimilakis, and R. Bittner, “The MUSDB18 corpus for music separation,” Dec. 2017. [Online]. Available: https://doi.org/10.5281/zenodo.1117372
-
[15]
The proof and measurement of asso- ciation between two things,
C. Spearman, “The proof and measurement of asso- ciation between two things,”The American Journal of Psychology, vol. 15, no. 1, pp. 72–101, 1904
1904
-
[16]
Vii. mathematical contributions to the theory of evolution.—iii. regression, heredity, and panmixia,
K. Pearson, “Vii. mathematical contributions to the theory of evolution.—iii. regression, heredity, and panmixia,”Philosophical Transactions of the Royal Society of London, Series A: Containing Papers of a Mathematical or Physical Character, no. 187, pp. 253–318, 1896
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.