Scoring Backends Matter More Than Pooling: A Systematic Study of Training-Free Anomalous Sound Detection under Domain Shift
Pith reviewed 2026-06-26 19:05 UTC · model grok-4.3
The pith
In training-free anomalous sound detection, scoring backend choice affects target-domain AUC far more than temporal pooling.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Switching the backend moves target-domain AUC by 13.8 points on average (up to 53.8), whereas switching the pooling moves it by only 3.2 points: in this training-free regime, the backend, not the pooling, dominates domain-shift robustness. No backend wins everywhere, but the machine-dependent pattern reproduces on the DCASE 2025 development data.
What carries the argument
Four classical scoring backends (nearest-neighbor cosine distance, Mahalanobis distance, locally density-normalized kNN, PCA-subspace reconstruction residual) crossed with three temporal poolings (mean, GeM, max), evaluated by target-domain AUC.
If this is right
- No single backend wins on all machine types, but patterns are consistent across the 2023 and 2025 DCASE development sets.
- A label-free fusion of backends via z-normalization of each with its training-bank self-scores and taking the minimum reaches 63.3% harmonic-mean target AUC, close to the 64.4% per-machine oracle.
- Selecting a backend by source-domain pseudo-validation with proxy outliers fails because all backends saturate on that proxy task.
Where Pith is reading between the lines
- The observed dominance of backend choice may extend to other frozen pretrained encoders beyond the one tested.
- In deployment without labels, the proposed fusion could be used to adapt to unknown domain shifts by combining multiple backends.
- The failure of proxy validation suggests that source-only outlier proxies do not reliably simulate target-domain shift for backend selection.
Load-bearing premise
That the performance patterns observed with the BEATs encoder on these specific DCASE machine types will hold for other encoders and different types of domain shifts.
What would settle it
Running the same cross of backends and poolings with a different frozen audio encoder on the same DCASE sets and checking if the average AUC difference between backend switches remains around 13.8 points.
read the original abstract
Training-free anomalous sound detection (ASD) scores a test clip against a memory bank of normal embeddings from a frozen pretrained audio encoder. Recent work attributes domain-shift robustness mainly to how frame-level features are pooled over time; the scoring backend applied on top of the pooled embedding has received far less systematic attention. Using a single frozen BEATs encoder on the DCASE 2023 Task 2 development set (all seven machine types), we cross four classical backends -- nearest-neighbor cosine distance, Mahalanobis distance, locally density-normalized kNN, and PCA-subspace reconstruction residual -- with three temporal poolings (mean, GeM, max). Switching the backend moves target-domain AUC by 13.8 points on average (up to 53.8), whereas switching the pooling moves it by only 3.2 points: in this training-free regime, the backend, not the pooling, dominates domain-shift robustness. No backend wins everywhere, but the machine-dependent pattern reproduces on the DCASE 2025 development data (fan, bearing). Exploiting this, we propose a label-free score fusion that z-normalizes each backend with its training-bank self-scores and takes the minimum; it reaches a harmonic-mean target AUC of 63.3% versus 64.4% for the per-machine oracle, surpassing every fixed single backend while preserving source-domain accuracy. We also report a negative result: selecting a backend by source-domain pseudo-validation with proxy outliers fails, because all backends saturate on the proxy task.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript presents a systematic empirical comparison in training-free anomalous sound detection (ASD) under domain shift. Using a single frozen BEATs encoder on the DCASE 2023 Task 2 development set (seven machine types), it crosses four scoring backends (nearest-neighbor cosine distance, Mahalanobis distance, locally density-normalized kNN, PCA-subspace reconstruction residual) with three temporal poolings (mean, GeM, max). It reports that backend choice affects target-domain AUC by 13.8 points on average (up to 53.8), while pooling affects it by only 3.2 points. The machine-dependent pattern reproduces on DCASE 2025 data (fan, bearing). A label-free fusion of backends via z-normalization on training-bank self-scores and taking the minimum is proposed, achieving 63.3% harmonic-mean target AUC (vs. 64.4% per-machine oracle). A negative result on source-domain pseudo-validation for backend selection is also reported.
Significance. If the findings hold, the work redirects attention in training-free ASD from pooling mechanisms to the design and combination of scoring backends for improved domain-shift robustness. The systematic ablation across multiple backends and poolings on standard DCASE benchmarks, the reproduction on a second dataset, the practical fusion method that approaches oracle performance, and the explicit negative result on pseudo-validation are all strengths that enhance the paper's value. The concrete AUC numbers provide clear, falsifiable benchmarks for future work.
major comments (2)
- [Abstract / Experimental Setup] Abstract / Experimental Setup: The headline result that backends dominate pooling (13.8 vs. 3.2 AUC points) is obtained exclusively with one frozen BEATs encoder. Since backend scoring operates on the geometry of the pooled vector and pooling on the temporal axis, both are mediated by the covariance and temporal statistics of BEATs frame embeddings. The manuscript provides no experiments with alternative encoders (e.g., AST or HuBERT) to test whether the dominance reverses when embedding statistics differ, which is load-bearing for the general claim in the title and abstract that 'in this training-free regime, the backend, not the pooling, dominates'.
- [Proposed Fusion] Proposed Fusion: The label-free fusion (z-normalizes each backend with its training-bank self-scores and takes the minimum) reaches 63.3% harmonic mean vs. 64.4% per-machine oracle. The description does not specify the exact procedure for computing the self-score normalization statistics (e.g., whether they are computed per-backend on the full source normal bank or with any cross-validation), which is needed to confirm the method introduces no implicit target-domain information.
minor comments (2)
- [Abstract] The term 'locally density-normalized kNN' is introduced without a brief definition or citation to the specific density normalization formula used.
- A table listing per-machine AUC for all 4 backends × 3 poolings (and the fusion) on both DCASE 2023 and 2025 sets would allow direct verification of the reported average deltas and machine-dependent patterns.
Simulated Author's Rebuttal
We thank the referee for the detailed and constructive report. We address the two major comments point-by-point below.
read point-by-point responses
-
Referee: [Abstract / Experimental Setup] Abstract / Experimental Setup: The headline result that backends dominate pooling (13.8 vs. 3.2 AUC points) is obtained exclusively with one frozen BEATs encoder. Since backend scoring operates on the geometry of the pooled vector and pooling on the temporal axis, both are mediated by the covariance and temporal statistics of BEATs frame embeddings. The manuscript provides no experiments with alternative encoders (e.g., AST or HuBERT) to test whether the dominance reverses when embedding statistics differ, which is load-bearing for the general claim in the title and abstract that 'in this training-free regime, the backend, not the pooling, dominates'.
Authors: We agree that the experiments use a single encoder (BEATs) and that the dominance result is therefore specific to the embedding statistics produced by that model. The manuscript already states explicitly that a single frozen BEATs encoder is used, and the title/abstract claim is framed within the training-free regime studied under that setup. While additional encoders would strengthen generality, the core contribution is the systematic backend-vs-pooling ablation on standard DCASE benchmarks rather than an exhaustive encoder sweep. We will revise the abstract and discussion to qualify the claim more precisely (e.g., “with a frozen BEATs encoder”) and note the limitation; we do not intend to add new encoder experiments in the current revision. revision: partial
-
Referee: [Proposed Fusion] Proposed Fusion: The label-free fusion (z-normalizes each backend with its training-bank self-scores and takes the minimum) reaches 63.3% harmonic mean vs. 64.4% per-machine oracle. The description does not specify the exact procedure for computing the self-score normalization statistics (e.g., whether they are computed per-backend on the full source normal bank or with any cross-validation), which is needed to confirm the method introduces no implicit target-domain information.
Authors: The self-score normalization statistics are computed independently for each backend on the full source normal bank (all available normal training clips for the given machine type) with no cross-validation folds and no target-domain samples. This is done once per backend before any test-time scoring. We will expand the method description (including the exact normalization formula and a clarifying sentence) in the revised manuscript to make this procedure explicit and to confirm the absence of target information. revision: yes
Circularity Check
No circularity: purely empirical comparison with direct held-out evaluation
full rationale
The paper reports measured AUC differences from exhaustive cross-product experiments (4 backends × 3 poolings) on fixed DCASE 2023/2025 development sets using one frozen BEATs encoder. No equations, derivations, fitted parameters renamed as predictions, or self-citation load-bearing steps appear in the provided text. All headline deltas (13.8 vs 3.2 AUC points) and the proposed fusion are obtained by direct evaluation on target-domain labels; the design contains no self-definitional reductions or ansatz smuggling. This is the expected 0 outcome for a controlled empirical ablation study.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
INTRODUCTION Unsupervised ASD for machine condition monitoring, as standard- ized by the DCASE Task 2 series [1, 2, 3, 4], asks a system to de- tect anomalous sounds having heard onlynormalclips of a machine. Since 2022 the task has emphasizeddomain shift: the bank of nor- mal training clips is dominated by a source domain (990 clips), while only 10 clips...
Pith/arXiv arXiv 2022
-
[2]
Training-free ASD pipeline For each machine type we are givenN=1,000normal training clips (990source,10target)
METHOD 2.1. Training-free ASD pipeline For each machine type we are givenN=1,000normal training clips (990source,10target). A frozen encoder maps a clipxto frame featuresF(x)∈R T×d ; a temporal poolingϕproduces the clip em- beddinge=ϕ(F(x))∈R d. The bank isB={e 1, . . . ,eN }. At test time, a backends(·)maps a test embedding to a scalar anomaly score. We ...
-
[3]
EXPERIMENTS Setup.DCASE 2023 Task 2 development set [3] (ToyADMOS2
2023
-
[4]
Audio is 16 kHz; clips are 10 s (Toy- Car/ToyTrain: first 10 of 12 s)
and MIMII DG [31] recordings), all seven machines; per ma- chine,1,000normal training clips and200test clips (50normal+50 anomalous per domain). Audio is 16 kHz; clips are 10 s (Toy- Car/ToyTrain: first 10 of 12 s). Following the official protocol [3], AUC is computed per domain using that domain’s normal clips againstallanomalous clips of the section, an...
2025
-
[5]
CONCLUSION With backbone, bank, and pooling held fixed, the scoring backend is the dominant—and underexamined—design choice for domain- robust training-free ASD with a frozen BEATs backbone: it moves target-domain AUC 4–6×more than temporal pooling, its machine- dependent strengths are stable across benchmark editions, and a sim- ple training-bank-calibra...
2023
-
[6]
De- scription and discussion on DCASE2020 challenge task2: Un- supervised anomalous sound detection for machine condition monitoring,
Yuma Koizumi, Yohei Kawaguchi, Keisuke Imoto, et al., “De- scription and discussion on DCASE2020 challenge task2: Un- supervised anomalous sound detection for machine condition monitoring,” inProc. DCASE Workshop, 2020
2020
-
[7]
Description and discussion on DCASE 2022 challenge task 2: Unsuper- vised anomalous sound detection for machine condition mon- itoring applying domain generalization techniques,
Kota Dohi, Keisuke Imoto, Noboru Harada, et al., “Description and discussion on DCASE 2022 challenge task 2: Unsuper- vised anomalous sound detection for machine condition mon- itoring applying domain generalization techniques,” inProc. DCASE Workshop, 2022
2022
-
[8]
Description and discussion on DCASE 2023 challenge task 2: First-shot unsupervised anomalous sound detection for machine condition monitoring,
Kota Dohi, Keisuke Imoto, Noboru Harada, Daisuke Niizumi, Yuma Koizumi, Tomoya Nishida, Harsh Purohit, Takashi Endo, and Yohei Kawaguchi, “Description and discussion on DCASE 2023 challenge task 2: First-shot unsupervised anomalous sound detection for machine condition monitoring,” inProc. DCASE Workshop, 2023
2023
-
[9]
Description and discussion on DCASE 2025 challenge task 2: First-shot unsupervised anomalous sound detection for machine condition monitoring,
Tomoya Nishida, Noboru Harada, Daisuke Niizumi, Davide Albertini, Roberto Sannino, Simone Pradolini, Filippo Au- gusti, Keisuke Imoto, Kota Dohi, Harsh Purohit, Takashi Endo, and Yohei Kawaguchi, “Description and discussion on DCASE 2025 challenge task 2: First-shot unsupervised anomalous sound detection for machine condition monitoring,” inProc. DCASE Wo...
2025
-
[10]
On using pre-trained em- beddings for detecting anomalous sounds with limited training data,
Kevin Wilkinghoff and Fabian Fritz, “On using pre-trained em- beddings for detecting anomalous sounds with limited training data,” inProc. EUSIPCO, 2023, pp. 186–190
2023
-
[11]
Kevin Wilkinghoff, Sarthak Yadav, and Zheng-Hua Tan, “Temporal pooling strategies for training-free anomalous sound detection with self-supervised audio embeddings,” arXiv:2603.04605, 2026
Pith/arXiv arXiv 2026
-
[12]
BEATs: Audio pre- training with acoustic tokenizers,
Sanyuan Chen, Yu Wu, Chengyi Wang, Shujie Liu, Daniel Tompkins, Zhuo Chen, and Furu Wei, “BEATs: Audio pre- training with acoustic tokenizers,” inProc. ICML, 2023
2023
-
[13]
wav2vec 2.0: A framework for self-supervised learning of speech representations,
Alexei Baevski, Yuhao Zhou, Abdelrahman Mohamed, and Michael Auli, “wav2vec 2.0: A framework for self-supervised learning of speech representations,” inProc. NeurIPS, 2020
2020
-
[14]
HuBERT: Self-supervised speech representation learning by masked prediction of hidden units,
Wei-Ning Hsu, Benjamin Bolte, Yao-Hung Hubert Tsai, Kushal Lakhotia, Ruslan Salakhutdinov, and Abdelrahman Mohamed, “HuBERT: Self-supervised speech representation learning by masked prediction of hidden units,”IEEE/ACM Trans. Audio, Speech, Lang. Process., vol. 29, pp. 3451–3460, 2021
2021
-
[15]
AST: Audio spectrogram transformer,
Yuan Gong, Yu-An Chung, and James Glass, “AST: Audio spectrogram transformer,” inProc. Interspeech, 2021
2021
-
[16]
PANNs: Large-scale pre- trained audio neural networks for audio pattern recognition,
Qiuqiang Kong, Yin Cao, Turab Iqbal, Yuxuan Wang, Wenwu Wang, and Mark D. Plumbley, “PANNs: Large-scale pre- trained audio neural networks for audio pattern recognition,” IEEE/ACM Trans. Audio, Speech, Lang. Process., vol. 28, pp. 2880–2894, 2020
2020
-
[17]
Deep nearest neighbor anomaly detection,
Liron Bergman, Niv Cohen, and Yedid Hoshen, “Deep nearest neighbor anomaly detection,” arXiv:2002.10445, 2020
arXiv 2002
-
[18]
Modeling the distribution of normal data in pre-trained deep features for anomaly detection,
Oliver Rippel, Patrick Mertens, and Dorit Merhof, “Modeling the distribution of normal data in pre-trained deep features for anomaly detection,” inProc. ICPR, 2021
2021
-
[19]
PaDiM: A patch distribution modeling frame- work for anomaly detection and localization,
Thomas Defard, Aleksandr Setkov, Angelique Loesch, and Ro- maric Audigier, “PaDiM: A patch distribution modeling frame- work for anomaly detection and localization,” inProc. ICPR Workshops, 2021
2021
-
[20]
Towards total recall in industrial anomaly detection,
Karsten Roth, Latha Pemula, Joaquin Zepeda, Bernhard Sch¨olkopf, Thomas Brox, and Peter Gehler, “Towards total recall in industrial anomaly detection,” inProc. CVPR, 2022
2022
-
[21]
AnomalyDINO: Boosting patch-based few-shot anomaly detection with DINOv2,
Simon Damm, Mike Laszkiewicz, Johannes Lederer, and Asja Fischer, “AnomalyDINO: Boosting patch-based few-shot anomaly detection with DINOv2,” inProc. WACV, 2025, pp. 1319–1329
2025
-
[22]
MuSc: Zero-shot industrial anomaly classification and segmentation with mutual scoring of the unlabeled images,
Xurui Li, Ziming Huang, Feng Xue, and Yu Zhou, “MuSc: Zero-shot industrial anomaly classification and segmentation with mutual scoring of the unlabeled images,” inProc. ICLR, 2024
2024
-
[23]
AnoPatch: Towards better consistency in machine anomalous sound detection,
Anbai Jiang, Bing Han, Zhiqiang Lv, Yufeng Deng, Wei-Qiang Zhang, Xie Chen, Yanmin Qian, Jia Liu, and Pingyi Fan, “AnoPatch: Towards better consistency in machine anomalous sound detection,” inProc. Interspeech, 2024, pp. 107–111
2024
-
[24]
Design choices for learning embeddings from auxiliary tasks for domain generalization in anomalous sound detection,
Kevin Wilkinghoff, “Design choices for learning embeddings from auxiliary tasks for domain generalization in anomalous sound detection,” inProc. ICASSP, 2023
2023
-
[25]
Ef- ficient algorithms for mining outliers from large data sets,
Sridhar Ramaswamy, Rajeev Rastogi, and Kyuseok Shim, “Ef- ficient algorithms for mining outliers from large data sets,” in Proc. ACM SIGMOD, 2000
2000
-
[26]
On the generalised distance in statistics,
Prasanta Chandra Mahalanobis, “On the generalised distance in statistics,”Proc. National Institute of Sciences of India, vol. 2, no. 1, pp. 49–55, 1936
1936
-
[27]
LOF: Identifying density-based local outliers,
Markus M. Breunig, Hans-Peter Kriegel, Raymond T. Ng, and J¨org Sander, “LOF: Identifying density-based local outliers,” inProc. ACM SIGMOD, 2000
2000
-
[28]
Kernel PCA for novelty detection,
Heiko Hoffmann, “Kernel PCA for novelty detection,”Pattern Recognition, vol. 40, no. 3, pp. 863–874, 2007
2007
-
[29]
Fine- tuning CNN image retrieval with no human annotation,
Filip Radenovi ´c, Giorgos Tolias, and Ond ˇrej Chum, “Fine- tuning CNN image retrieval with no human annotation,”IEEE Trans. Pattern Anal. Mach. Intell., vol. 41, no. 7, pp. 1655– 1668, 2019
2019
-
[30]
A well-conditioned esti- mator for large-dimensional covariance matrices,
Olivier Ledoit and Michael Wolf, “A well-conditioned esti- mator for large-dimensional covariance matrices,”Journal of Multivariate Analysis, vol. 88, no. 2, pp. 365–411, 2004
2004
-
[31]
Local density-based anomaly score normalization for domain gener- alization,
Kevin Wilkinghoff, Haici Yang, Janek Ebbers, Franc ¸ois G. Germain, Gordon Wichern, and Jonathan Le Roux, “Local density-based anomaly score normalization for domain gener- alization,”IEEE Trans. Audio, Speech, Lang. Process., vol. 33, pp. 4642–4652, 2025
2025
-
[32]
Score nor- malization in multimodal biometric systems,
Anil Jain, Karthik Nandakumar, and Arun Ross, “Score nor- malization in multimodal biometric systems,”Pattern Recog- nition, vol. 38, no. 12, pp. 2270–2285, 2005
2005
-
[33]
On combining classifiers,
Josef Kittler, Mohamad Hatef, Robert P. W. Duin, and Jiri Matas, “On combining classifiers,”IEEE Trans. Pattern Anal. Mach. Intell., vol. 20, no. 3, pp. 226–239, 1998
1998
-
[34]
Theoretical foundations and algorithms for outlier ensembles,
Charu C. Aggarwal and Saket Sathe, “Theoretical foundations and algorithms for outlier ensembles,”ACM SIGKDD Explo- rations, vol. 17, no. 1, pp. 24–47, 2015
2015
-
[35]
Toy- ADMOS2: Another dataset of miniature-machine operating sounds for anomalous sound detection under domain shift con- ditions,
Noboru Harada, Daisuke Niizumi, Daiki Takeuchi, Yasunori Ohishi, Masahiro Yasuda, and Shoichiro Saito, “Toy- ADMOS2: Another dataset of miniature-machine operating sounds for anomalous sound detection under domain shift con- ditions,” inProc. DCASE Workshop, 2021
2021
-
[36]
MIMII DG: Sound dataset for malfunctioning in- dustrial machine investigation and inspection for domain gen- eralization task,
Kota Dohi, Tomoya Nishida, Harsh Purohit, Ryo Tanabe, Takashi Endo, Masaaki Yamamoto, Yuki Nikaido, and Yohei Kawaguchi, “MIMII DG: Sound dataset for malfunctioning in- dustrial machine investigation and inspection for domain gen- eralization task,” inProc. DCASE Workshop, 2022
2022
-
[37]
First-shot anomaly sound detection for machine condition monitoring: A domain gener- alization baseline,
Noboru Harada, Daisuke Niizumi, Yasunori Ohishi, Daiki Takeuchi, and Masahiro Yasuda, “First-shot anomaly sound detection for machine condition monitoring: A domain gener- alization baseline,” inProc. EUSIPCO, 2023, pp. 191–195
2023
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.