Scoring Backends Matter More Than Pooling: A Systematic Study of Training-Free Anomalous Sound Detection under Domain Shift

Jingwen Zhou; Mingzhe Wang

arxiv: 2606.19269 · v1 · pith:PHTF4JAEnew · submitted 2026-06-17 · 💻 cs.SD

Scoring Backends Matter More Than Pooling: A Systematic Study of Training-Free Anomalous Sound Detection under Domain Shift

Jingwen Zhou , Mingzhe Wang This is my paper

Pith reviewed 2026-06-26 19:05 UTC · model grok-4.3

classification 💻 cs.SD

keywords anomalous sound detectiondomain shifttraining-freescoring backendtemporal poolingscore fusionDCASE task 2

0 comments

The pith

In training-free anomalous sound detection, scoring backend choice affects target-domain AUC far more than temporal pooling.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that for training-free anomalous sound detection under domain shift, the scoring backend applied to pooled embeddings determines robustness much more than the choice of temporal pooling. Using a fixed pretrained encoder, experiments across multiple backends and poolings on DCASE development sets show backend switches cause large performance variations while pooling does not. This leads to a proposed fusion of backends that improves average performance without machine labels. The finding redirects attention from pooling design to backend selection in this setting.

Core claim

Switching the backend moves target-domain AUC by 13.8 points on average (up to 53.8), whereas switching the pooling moves it by only 3.2 points: in this training-free regime, the backend, not the pooling, dominates domain-shift robustness. No backend wins everywhere, but the machine-dependent pattern reproduces on the DCASE 2025 development data.

What carries the argument

Four classical scoring backends (nearest-neighbor cosine distance, Mahalanobis distance, locally density-normalized kNN, PCA-subspace reconstruction residual) crossed with three temporal poolings (mean, GeM, max), evaluated by target-domain AUC.

If this is right

No single backend wins on all machine types, but patterns are consistent across the 2023 and 2025 DCASE development sets.
A label-free fusion of backends via z-normalization of each with its training-bank self-scores and taking the minimum reaches 63.3% harmonic-mean target AUC, close to the 64.4% per-machine oracle.
Selecting a backend by source-domain pseudo-validation with proxy outliers fails because all backends saturate on that proxy task.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The observed dominance of backend choice may extend to other frozen pretrained encoders beyond the one tested.
In deployment without labels, the proposed fusion could be used to adapt to unknown domain shifts by combining multiple backends.
The failure of proxy validation suggests that source-only outlier proxies do not reliably simulate target-domain shift for backend selection.

Load-bearing premise

That the performance patterns observed with the BEATs encoder on these specific DCASE machine types will hold for other encoders and different types of domain shifts.

What would settle it

Running the same cross of backends and poolings with a different frozen audio encoder on the same DCASE sets and checking if the average AUC difference between backend switches remains around 13.8 points.

read the original abstract

Training-free anomalous sound detection (ASD) scores a test clip against a memory bank of normal embeddings from a frozen pretrained audio encoder. Recent work attributes domain-shift robustness mainly to how frame-level features are pooled over time; the scoring backend applied on top of the pooled embedding has received far less systematic attention. Using a single frozen BEATs encoder on the DCASE 2023 Task 2 development set (all seven machine types), we cross four classical backends -- nearest-neighbor cosine distance, Mahalanobis distance, locally density-normalized kNN, and PCA-subspace reconstruction residual -- with three temporal poolings (mean, GeM, max). Switching the backend moves target-domain AUC by 13.8 points on average (up to 53.8), whereas switching the pooling moves it by only 3.2 points: in this training-free regime, the backend, not the pooling, dominates domain-shift robustness. No backend wins everywhere, but the machine-dependent pattern reproduces on the DCASE 2025 development data (fan, bearing). Exploiting this, we propose a label-free score fusion that z-normalizes each backend with its training-bank self-scores and takes the minimum; it reaches a harmonic-mean target AUC of 63.3% versus 64.4% for the per-machine oracle, surpassing every fixed single backend while preserving source-domain accuracy. We also report a negative result: selecting a backend by source-domain pseudo-validation with proxy outliers fails, because all backends saturate on the proxy task.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Backend choice drives most of the domain-shift gains here, and the label-free fusion is a practical addition worth testing, but the whole thing is locked to one BEATs encoder.

read the letter

The central finding is that with a frozen BEATs encoder on DCASE 2023 and 2025 development data, swapping the scoring backend changes target AUC by roughly 14 points on average while changing the temporal pooling changes it by only 3. The paper runs the four-by-three grid cleanly across seven machine types and shows the machine-dependent pattern repeats on the later set. That comparison itself is new in the training-free ASD literature.

They also introduce a simple fusion: z-normalize each backend against its own scores on the normal training bank, then take the minimum. It reaches 63.3 % harmonic-mean target AUC against a 64.4 % per-machine oracle and keeps source-domain performance intact. The negative result on source-domain pseudo-validation is useful too; it shows why you cannot just pick the best backend that way.

The main limitation is scope. All numbers come from one pretrained encoder. The relative size of backend versus pooling effects could easily shift if the frame embeddings had different covariance or temporal structure, which other models often do. They stay inside DCASE development sets, so the size of the gains may not translate to shifts that look different in practice.

This is worth a reading group if your group works on training-free audio anomaly detection. The fusion recipe is cheap to try and the grid is reproducible from the description. It deserves peer review because the empirical claim is narrow, the controls are explicit, and the negative result adds value rather than being buried.

Referee Report

2 major / 2 minor

Summary. The manuscript presents a systematic empirical comparison in training-free anomalous sound detection (ASD) under domain shift. Using a single frozen BEATs encoder on the DCASE 2023 Task 2 development set (seven machine types), it crosses four scoring backends (nearest-neighbor cosine distance, Mahalanobis distance, locally density-normalized kNN, PCA-subspace reconstruction residual) with three temporal poolings (mean, GeM, max). It reports that backend choice affects target-domain AUC by 13.8 points on average (up to 53.8), while pooling affects it by only 3.2 points. The machine-dependent pattern reproduces on DCASE 2025 data (fan, bearing). A label-free fusion of backends via z-normalization on training-bank self-scores and taking the minimum is proposed, achieving 63.3% harmonic-mean target AUC (vs. 64.4% per-machine oracle). A negative result on source-domain pseudo-validation for backend selection is also reported.

Significance. If the findings hold, the work redirects attention in training-free ASD from pooling mechanisms to the design and combination of scoring backends for improved domain-shift robustness. The systematic ablation across multiple backends and poolings on standard DCASE benchmarks, the reproduction on a second dataset, the practical fusion method that approaches oracle performance, and the explicit negative result on pseudo-validation are all strengths that enhance the paper's value. The concrete AUC numbers provide clear, falsifiable benchmarks for future work.

major comments (2)

[Abstract / Experimental Setup] Abstract / Experimental Setup: The headline result that backends dominate pooling (13.8 vs. 3.2 AUC points) is obtained exclusively with one frozen BEATs encoder. Since backend scoring operates on the geometry of the pooled vector and pooling on the temporal axis, both are mediated by the covariance and temporal statistics of BEATs frame embeddings. The manuscript provides no experiments with alternative encoders (e.g., AST or HuBERT) to test whether the dominance reverses when embedding statistics differ, which is load-bearing for the general claim in the title and abstract that 'in this training-free regime, the backend, not the pooling, dominates'.
[Proposed Fusion] Proposed Fusion: The label-free fusion (z-normalizes each backend with its training-bank self-scores and takes the minimum) reaches 63.3% harmonic mean vs. 64.4% per-machine oracle. The description does not specify the exact procedure for computing the self-score normalization statistics (e.g., whether they are computed per-backend on the full source normal bank or with any cross-validation), which is needed to confirm the method introduces no implicit target-domain information.

minor comments (2)

[Abstract] The term 'locally density-normalized kNN' is introduced without a brief definition or citation to the specific density normalization formula used.
A table listing per-machine AUC for all 4 backends × 3 poolings (and the fusion) on both DCASE 2023 and 2025 sets would allow direct verification of the reported average deltas and machine-dependent patterns.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the detailed and constructive report. We address the two major comments point-by-point below.

read point-by-point responses

Referee: [Abstract / Experimental Setup] Abstract / Experimental Setup: The headline result that backends dominate pooling (13.8 vs. 3.2 AUC points) is obtained exclusively with one frozen BEATs encoder. Since backend scoring operates on the geometry of the pooled vector and pooling on the temporal axis, both are mediated by the covariance and temporal statistics of BEATs frame embeddings. The manuscript provides no experiments with alternative encoders (e.g., AST or HuBERT) to test whether the dominance reverses when embedding statistics differ, which is load-bearing for the general claim in the title and abstract that 'in this training-free regime, the backend, not the pooling, dominates'.

Authors: We agree that the experiments use a single encoder (BEATs) and that the dominance result is therefore specific to the embedding statistics produced by that model. The manuscript already states explicitly that a single frozen BEATs encoder is used, and the title/abstract claim is framed within the training-free regime studied under that setup. While additional encoders would strengthen generality, the core contribution is the systematic backend-vs-pooling ablation on standard DCASE benchmarks rather than an exhaustive encoder sweep. We will revise the abstract and discussion to qualify the claim more precisely (e.g., “with a frozen BEATs encoder”) and note the limitation; we do not intend to add new encoder experiments in the current revision. revision: partial
Referee: [Proposed Fusion] Proposed Fusion: The label-free fusion (z-normalizes each backend with its training-bank self-scores and takes the minimum) reaches 63.3% harmonic mean vs. 64.4% per-machine oracle. The description does not specify the exact procedure for computing the self-score normalization statistics (e.g., whether they are computed per-backend on the full source normal bank or with any cross-validation), which is needed to confirm the method introduces no implicit target-domain information.

Authors: The self-score normalization statistics are computed independently for each backend on the full source normal bank (all available normal training clips for the given machine type) with no cross-validation folds and no target-domain samples. This is done once per backend before any test-time scoring. We will expand the method description (including the exact normalization formula and a clarifying sentence) in the revised manuscript to make this procedure explicit and to confirm the absence of target information. revision: yes

Circularity Check

0 steps flagged

No circularity: purely empirical comparison with direct held-out evaluation

full rationale

The paper reports measured AUC differences from exhaustive cross-product experiments (4 backends × 3 poolings) on fixed DCASE 2023/2025 development sets using one frozen BEATs encoder. No equations, derivations, fitted parameters renamed as predictions, or self-citation load-bearing steps appear in the provided text. All headline deltas (13.8 vs 3.2 AUC points) and the proposed fusion are obtained by direct evaluation on target-domain labels; the design contains no self-definitional reductions or ansatz smuggling. This is the expected 0 outcome for a controlled empirical ablation study.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

This is an empirical comparative study; no free parameters are fitted to produce the central claim, no new axioms beyond standard statistical evaluation of AUC are invoked, and no new entities are postulated.

pith-pipeline@v0.9.1-grok · 5816 in / 1234 out tokens · 25464 ms · 2026-06-26T19:05:02.991866+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

37 extracted references · 2 linked inside Pith

[1]

INTRODUCTION Unsupervised ASD for machine condition monitoring, as standard- ized by the DCASE Task 2 series [1, 2, 3, 4], asks a system to de- tect anomalous sounds having heard onlynormalclips of a machine. Since 2022 the task has emphasizeddomain shift: the bank of nor- mal training clips is dominated by a source domain (990 clips), while only 10 clips...

Pith/arXiv arXiv 2022
[2]

Training-free ASD pipeline For each machine type we are givenN=1,000normal training clips (990source,10target)

METHOD 2.1. Training-free ASD pipeline For each machine type we are givenN=1,000normal training clips (990source,10target). A frozen encoder maps a clipxto frame featuresF(x)∈R T×d ; a temporal poolingϕproduces the clip em- beddinge=ϕ(F(x))∈R d. The bank isB={e 1, . . . ,eN }. At test time, a backends(·)maps a test embedding to a scalar anomaly score. We ...
[3]

EXPERIMENTS Setup.DCASE 2023 Task 2 development set [3] (ToyADMOS2

2023
[4]

Audio is 16 kHz; clips are 10 s (Toy- Car/ToyTrain: first 10 of 12 s)

and MIMII DG [31] recordings), all seven machines; per ma- chine,1,000normal training clips and200test clips (50normal+50 anomalous per domain). Audio is 16 kHz; clips are 10 s (Toy- Car/ToyTrain: first 10 of 12 s). Following the official protocol [3], AUC is computed per domain using that domain’s normal clips againstallanomalous clips of the section, an...

2025
[5]

CONCLUSION With backbone, bank, and pooling held fixed, the scoring backend is the dominant—and underexamined—design choice for domain- robust training-free ASD with a frozen BEATs backbone: it moves target-domain AUC 4–6×more than temporal pooling, its machine- dependent strengths are stable across benchmark editions, and a sim- ple training-bank-calibra...

2023
[6]

De- scription and discussion on DCASE2020 challenge task2: Un- supervised anomalous sound detection for machine condition monitoring,

Yuma Koizumi, Yohei Kawaguchi, Keisuke Imoto, et al., “De- scription and discussion on DCASE2020 challenge task2: Un- supervised anomalous sound detection for machine condition monitoring,” inProc. DCASE Workshop, 2020

2020
[7]

Description and discussion on DCASE 2022 challenge task 2: Unsuper- vised anomalous sound detection for machine condition mon- itoring applying domain generalization techniques,

Kota Dohi, Keisuke Imoto, Noboru Harada, et al., “Description and discussion on DCASE 2022 challenge task 2: Unsuper- vised anomalous sound detection for machine condition mon- itoring applying domain generalization techniques,” inProc. DCASE Workshop, 2022

2022
[8]

Description and discussion on DCASE 2023 challenge task 2: First-shot unsupervised anomalous sound detection for machine condition monitoring,

Kota Dohi, Keisuke Imoto, Noboru Harada, Daisuke Niizumi, Yuma Koizumi, Tomoya Nishida, Harsh Purohit, Takashi Endo, and Yohei Kawaguchi, “Description and discussion on DCASE 2023 challenge task 2: First-shot unsupervised anomalous sound detection for machine condition monitoring,” inProc. DCASE Workshop, 2023

2023
[9]

Description and discussion on DCASE 2025 challenge task 2: First-shot unsupervised anomalous sound detection for machine condition monitoring,

Tomoya Nishida, Noboru Harada, Daisuke Niizumi, Davide Albertini, Roberto Sannino, Simone Pradolini, Filippo Au- gusti, Keisuke Imoto, Kota Dohi, Harsh Purohit, Takashi Endo, and Yohei Kawaguchi, “Description and discussion on DCASE 2025 challenge task 2: First-shot unsupervised anomalous sound detection for machine condition monitoring,” inProc. DCASE Wo...

2025
[10]

On using pre-trained em- beddings for detecting anomalous sounds with limited training data,

Kevin Wilkinghoff and Fabian Fritz, “On using pre-trained em- beddings for detecting anomalous sounds with limited training data,” inProc. EUSIPCO, 2023, pp. 186–190

2023
[11]

Temporal pooling strategies for training-free anomalous sound detection with self-supervised audio embeddings,

Kevin Wilkinghoff, Sarthak Yadav, and Zheng-Hua Tan, “Temporal pooling strategies for training-free anomalous sound detection with self-supervised audio embeddings,” arXiv:2603.04605, 2026

Pith/arXiv arXiv 2026
[12]

BEATs: Audio pre- training with acoustic tokenizers,

Sanyuan Chen, Yu Wu, Chengyi Wang, Shujie Liu, Daniel Tompkins, Zhuo Chen, and Furu Wei, “BEATs: Audio pre- training with acoustic tokenizers,” inProc. ICML, 2023

2023
[13]

wav2vec 2.0: A framework for self-supervised learning of speech representations,

Alexei Baevski, Yuhao Zhou, Abdelrahman Mohamed, and Michael Auli, “wav2vec 2.0: A framework for self-supervised learning of speech representations,” inProc. NeurIPS, 2020

2020
[14]

HuBERT: Self-supervised speech representation learning by masked prediction of hidden units,

Wei-Ning Hsu, Benjamin Bolte, Yao-Hung Hubert Tsai, Kushal Lakhotia, Ruslan Salakhutdinov, and Abdelrahman Mohamed, “HuBERT: Self-supervised speech representation learning by masked prediction of hidden units,”IEEE/ACM Trans. Audio, Speech, Lang. Process., vol. 29, pp. 3451–3460, 2021

2021
[15]

AST: Audio spectrogram transformer,

Yuan Gong, Yu-An Chung, and James Glass, “AST: Audio spectrogram transformer,” inProc. Interspeech, 2021

2021
[16]

PANNs: Large-scale pre- trained audio neural networks for audio pattern recognition,

Qiuqiang Kong, Yin Cao, Turab Iqbal, Yuxuan Wang, Wenwu Wang, and Mark D. Plumbley, “PANNs: Large-scale pre- trained audio neural networks for audio pattern recognition,” IEEE/ACM Trans. Audio, Speech, Lang. Process., vol. 28, pp. 2880–2894, 2020

2020
[17]

Deep nearest neighbor anomaly detection,

Liron Bergman, Niv Cohen, and Yedid Hoshen, “Deep nearest neighbor anomaly detection,” arXiv:2002.10445, 2020

arXiv 2002
[18]

Modeling the distribution of normal data in pre-trained deep features for anomaly detection,

Oliver Rippel, Patrick Mertens, and Dorit Merhof, “Modeling the distribution of normal data in pre-trained deep features for anomaly detection,” inProc. ICPR, 2021

2021
[19]

PaDiM: A patch distribution modeling frame- work for anomaly detection and localization,

Thomas Defard, Aleksandr Setkov, Angelique Loesch, and Ro- maric Audigier, “PaDiM: A patch distribution modeling frame- work for anomaly detection and localization,” inProc. ICPR Workshops, 2021

2021
[20]

Towards total recall in industrial anomaly detection,

Karsten Roth, Latha Pemula, Joaquin Zepeda, Bernhard Sch¨olkopf, Thomas Brox, and Peter Gehler, “Towards total recall in industrial anomaly detection,” inProc. CVPR, 2022

2022
[21]

AnomalyDINO: Boosting patch-based few-shot anomaly detection with DINOv2,

Simon Damm, Mike Laszkiewicz, Johannes Lederer, and Asja Fischer, “AnomalyDINO: Boosting patch-based few-shot anomaly detection with DINOv2,” inProc. WACV, 2025, pp. 1319–1329

2025
[22]

MuSc: Zero-shot industrial anomaly classification and segmentation with mutual scoring of the unlabeled images,

Xurui Li, Ziming Huang, Feng Xue, and Yu Zhou, “MuSc: Zero-shot industrial anomaly classification and segmentation with mutual scoring of the unlabeled images,” inProc. ICLR, 2024

2024
[23]

AnoPatch: Towards better consistency in machine anomalous sound detection,

Anbai Jiang, Bing Han, Zhiqiang Lv, Yufeng Deng, Wei-Qiang Zhang, Xie Chen, Yanmin Qian, Jia Liu, and Pingyi Fan, “AnoPatch: Towards better consistency in machine anomalous sound detection,” inProc. Interspeech, 2024, pp. 107–111

2024
[24]

Design choices for learning embeddings from auxiliary tasks for domain generalization in anomalous sound detection,

Kevin Wilkinghoff, “Design choices for learning embeddings from auxiliary tasks for domain generalization in anomalous sound detection,” inProc. ICASSP, 2023

2023
[25]

Ef- ficient algorithms for mining outliers from large data sets,

Sridhar Ramaswamy, Rajeev Rastogi, and Kyuseok Shim, “Ef- ficient algorithms for mining outliers from large data sets,” in Proc. ACM SIGMOD, 2000

2000
[26]

On the generalised distance in statistics,

Prasanta Chandra Mahalanobis, “On the generalised distance in statistics,”Proc. National Institute of Sciences of India, vol. 2, no. 1, pp. 49–55, 1936

1936
[27]

LOF: Identifying density-based local outliers,

Markus M. Breunig, Hans-Peter Kriegel, Raymond T. Ng, and J¨org Sander, “LOF: Identifying density-based local outliers,” inProc. ACM SIGMOD, 2000

2000
[28]

Kernel PCA for novelty detection,

Heiko Hoffmann, “Kernel PCA for novelty detection,”Pattern Recognition, vol. 40, no. 3, pp. 863–874, 2007

2007
[29]

Fine- tuning CNN image retrieval with no human annotation,

Filip Radenovi ´c, Giorgos Tolias, and Ond ˇrej Chum, “Fine- tuning CNN image retrieval with no human annotation,”IEEE Trans. Pattern Anal. Mach. Intell., vol. 41, no. 7, pp. 1655– 1668, 2019

2019
[30]

A well-conditioned esti- mator for large-dimensional covariance matrices,

Olivier Ledoit and Michael Wolf, “A well-conditioned esti- mator for large-dimensional covariance matrices,”Journal of Multivariate Analysis, vol. 88, no. 2, pp. 365–411, 2004

2004
[31]

Local density-based anomaly score normalization for domain gener- alization,

Kevin Wilkinghoff, Haici Yang, Janek Ebbers, Franc ¸ois G. Germain, Gordon Wichern, and Jonathan Le Roux, “Local density-based anomaly score normalization for domain gener- alization,”IEEE Trans. Audio, Speech, Lang. Process., vol. 33, pp. 4642–4652, 2025

2025
[32]

Score nor- malization in multimodal biometric systems,

Anil Jain, Karthik Nandakumar, and Arun Ross, “Score nor- malization in multimodal biometric systems,”Pattern Recog- nition, vol. 38, no. 12, pp. 2270–2285, 2005

2005
[33]

On combining classifiers,

Josef Kittler, Mohamad Hatef, Robert P. W. Duin, and Jiri Matas, “On combining classifiers,”IEEE Trans. Pattern Anal. Mach. Intell., vol. 20, no. 3, pp. 226–239, 1998

1998
[34]

Theoretical foundations and algorithms for outlier ensembles,

Charu C. Aggarwal and Saket Sathe, “Theoretical foundations and algorithms for outlier ensembles,”ACM SIGKDD Explo- rations, vol. 17, no. 1, pp. 24–47, 2015

2015
[35]

Toy- ADMOS2: Another dataset of miniature-machine operating sounds for anomalous sound detection under domain shift con- ditions,

Noboru Harada, Daisuke Niizumi, Daiki Takeuchi, Yasunori Ohishi, Masahiro Yasuda, and Shoichiro Saito, “Toy- ADMOS2: Another dataset of miniature-machine operating sounds for anomalous sound detection under domain shift con- ditions,” inProc. DCASE Workshop, 2021

2021
[36]

MIMII DG: Sound dataset for malfunctioning in- dustrial machine investigation and inspection for domain gen- eralization task,

Kota Dohi, Tomoya Nishida, Harsh Purohit, Ryo Tanabe, Takashi Endo, Masaaki Yamamoto, Yuki Nikaido, and Yohei Kawaguchi, “MIMII DG: Sound dataset for malfunctioning in- dustrial machine investigation and inspection for domain gen- eralization task,” inProc. DCASE Workshop, 2022

2022
[37]

First-shot anomaly sound detection for machine condition monitoring: A domain gener- alization baseline,

Noboru Harada, Daisuke Niizumi, Yasunori Ohishi, Daiki Takeuchi, and Masahiro Yasuda, “First-shot anomaly sound detection for machine condition monitoring: A domain gener- alization baseline,” inProc. EUSIPCO, 2023, pp. 191–195

2023

[1] [1]

INTRODUCTION Unsupervised ASD for machine condition monitoring, as standard- ized by the DCASE Task 2 series [1, 2, 3, 4], asks a system to de- tect anomalous sounds having heard onlynormalclips of a machine. Since 2022 the task has emphasizeddomain shift: the bank of nor- mal training clips is dominated by a source domain (990 clips), while only 10 clips...

Pith/arXiv arXiv 2022

[2] [2]

Training-free ASD pipeline For each machine type we are givenN=1,000normal training clips (990source,10target)

METHOD 2.1. Training-free ASD pipeline For each machine type we are givenN=1,000normal training clips (990source,10target). A frozen encoder maps a clipxto frame featuresF(x)∈R T×d ; a temporal poolingϕproduces the clip em- beddinge=ϕ(F(x))∈R d. The bank isB={e 1, . . . ,eN }. At test time, a backends(·)maps a test embedding to a scalar anomaly score. We ...

[3] [3]

EXPERIMENTS Setup.DCASE 2023 Task 2 development set [3] (ToyADMOS2

2023

[4] [4]

Audio is 16 kHz; clips are 10 s (Toy- Car/ToyTrain: first 10 of 12 s)

and MIMII DG [31] recordings), all seven machines; per ma- chine,1,000normal training clips and200test clips (50normal+50 anomalous per domain). Audio is 16 kHz; clips are 10 s (Toy- Car/ToyTrain: first 10 of 12 s). Following the official protocol [3], AUC is computed per domain using that domain’s normal clips againstallanomalous clips of the section, an...

2025

[5] [5]

CONCLUSION With backbone, bank, and pooling held fixed, the scoring backend is the dominant—and underexamined—design choice for domain- robust training-free ASD with a frozen BEATs backbone: it moves target-domain AUC 4–6×more than temporal pooling, its machine- dependent strengths are stable across benchmark editions, and a sim- ple training-bank-calibra...

2023

[6] [6]

De- scription and discussion on DCASE2020 challenge task2: Un- supervised anomalous sound detection for machine condition monitoring,

Yuma Koizumi, Yohei Kawaguchi, Keisuke Imoto, et al., “De- scription and discussion on DCASE2020 challenge task2: Un- supervised anomalous sound detection for machine condition monitoring,” inProc. DCASE Workshop, 2020

2020

[7] [7]

Description and discussion on DCASE 2022 challenge task 2: Unsuper- vised anomalous sound detection for machine condition mon- itoring applying domain generalization techniques,

Kota Dohi, Keisuke Imoto, Noboru Harada, et al., “Description and discussion on DCASE 2022 challenge task 2: Unsuper- vised anomalous sound detection for machine condition mon- itoring applying domain generalization techniques,” inProc. DCASE Workshop, 2022

2022

[8] [8]

Description and discussion on DCASE 2023 challenge task 2: First-shot unsupervised anomalous sound detection for machine condition monitoring,

Kota Dohi, Keisuke Imoto, Noboru Harada, Daisuke Niizumi, Yuma Koizumi, Tomoya Nishida, Harsh Purohit, Takashi Endo, and Yohei Kawaguchi, “Description and discussion on DCASE 2023 challenge task 2: First-shot unsupervised anomalous sound detection for machine condition monitoring,” inProc. DCASE Workshop, 2023

2023

[9] [9]

Description and discussion on DCASE 2025 challenge task 2: First-shot unsupervised anomalous sound detection for machine condition monitoring,

Tomoya Nishida, Noboru Harada, Daisuke Niizumi, Davide Albertini, Roberto Sannino, Simone Pradolini, Filippo Au- gusti, Keisuke Imoto, Kota Dohi, Harsh Purohit, Takashi Endo, and Yohei Kawaguchi, “Description and discussion on DCASE 2025 challenge task 2: First-shot unsupervised anomalous sound detection for machine condition monitoring,” inProc. DCASE Wo...

2025

[10] [10]

On using pre-trained em- beddings for detecting anomalous sounds with limited training data,

Kevin Wilkinghoff and Fabian Fritz, “On using pre-trained em- beddings for detecting anomalous sounds with limited training data,” inProc. EUSIPCO, 2023, pp. 186–190

2023

[11] [11]

Temporal pooling strategies for training-free anomalous sound detection with self-supervised audio embeddings,

Kevin Wilkinghoff, Sarthak Yadav, and Zheng-Hua Tan, “Temporal pooling strategies for training-free anomalous sound detection with self-supervised audio embeddings,” arXiv:2603.04605, 2026

Pith/arXiv arXiv 2026

[12] [12]

BEATs: Audio pre- training with acoustic tokenizers,

Sanyuan Chen, Yu Wu, Chengyi Wang, Shujie Liu, Daniel Tompkins, Zhuo Chen, and Furu Wei, “BEATs: Audio pre- training with acoustic tokenizers,” inProc. ICML, 2023

2023

[13] [13]

wav2vec 2.0: A framework for self-supervised learning of speech representations,

Alexei Baevski, Yuhao Zhou, Abdelrahman Mohamed, and Michael Auli, “wav2vec 2.0: A framework for self-supervised learning of speech representations,” inProc. NeurIPS, 2020

2020

[14] [14]

HuBERT: Self-supervised speech representation learning by masked prediction of hidden units,

Wei-Ning Hsu, Benjamin Bolte, Yao-Hung Hubert Tsai, Kushal Lakhotia, Ruslan Salakhutdinov, and Abdelrahman Mohamed, “HuBERT: Self-supervised speech representation learning by masked prediction of hidden units,”IEEE/ACM Trans. Audio, Speech, Lang. Process., vol. 29, pp. 3451–3460, 2021

2021

[15] [15]

AST: Audio spectrogram transformer,

Yuan Gong, Yu-An Chung, and James Glass, “AST: Audio spectrogram transformer,” inProc. Interspeech, 2021

2021

[16] [16]

PANNs: Large-scale pre- trained audio neural networks for audio pattern recognition,

Qiuqiang Kong, Yin Cao, Turab Iqbal, Yuxuan Wang, Wenwu Wang, and Mark D. Plumbley, “PANNs: Large-scale pre- trained audio neural networks for audio pattern recognition,” IEEE/ACM Trans. Audio, Speech, Lang. Process., vol. 28, pp. 2880–2894, 2020

2020

[17] [17]

Deep nearest neighbor anomaly detection,

Liron Bergman, Niv Cohen, and Yedid Hoshen, “Deep nearest neighbor anomaly detection,” arXiv:2002.10445, 2020

arXiv 2002

[18] [18]

Modeling the distribution of normal data in pre-trained deep features for anomaly detection,

Oliver Rippel, Patrick Mertens, and Dorit Merhof, “Modeling the distribution of normal data in pre-trained deep features for anomaly detection,” inProc. ICPR, 2021

2021

[19] [19]

PaDiM: A patch distribution modeling frame- work for anomaly detection and localization,

Thomas Defard, Aleksandr Setkov, Angelique Loesch, and Ro- maric Audigier, “PaDiM: A patch distribution modeling frame- work for anomaly detection and localization,” inProc. ICPR Workshops, 2021

2021

[20] [20]

Towards total recall in industrial anomaly detection,

Karsten Roth, Latha Pemula, Joaquin Zepeda, Bernhard Sch¨olkopf, Thomas Brox, and Peter Gehler, “Towards total recall in industrial anomaly detection,” inProc. CVPR, 2022

2022

[21] [21]

AnomalyDINO: Boosting patch-based few-shot anomaly detection with DINOv2,

Simon Damm, Mike Laszkiewicz, Johannes Lederer, and Asja Fischer, “AnomalyDINO: Boosting patch-based few-shot anomaly detection with DINOv2,” inProc. WACV, 2025, pp. 1319–1329

2025

[22] [22]

MuSc: Zero-shot industrial anomaly classification and segmentation with mutual scoring of the unlabeled images,

Xurui Li, Ziming Huang, Feng Xue, and Yu Zhou, “MuSc: Zero-shot industrial anomaly classification and segmentation with mutual scoring of the unlabeled images,” inProc. ICLR, 2024

2024

[23] [23]

AnoPatch: Towards better consistency in machine anomalous sound detection,

Anbai Jiang, Bing Han, Zhiqiang Lv, Yufeng Deng, Wei-Qiang Zhang, Xie Chen, Yanmin Qian, Jia Liu, and Pingyi Fan, “AnoPatch: Towards better consistency in machine anomalous sound detection,” inProc. Interspeech, 2024, pp. 107–111

2024

[24] [24]

Design choices for learning embeddings from auxiliary tasks for domain generalization in anomalous sound detection,

Kevin Wilkinghoff, “Design choices for learning embeddings from auxiliary tasks for domain generalization in anomalous sound detection,” inProc. ICASSP, 2023

2023

[25] [25]

Ef- ficient algorithms for mining outliers from large data sets,

Sridhar Ramaswamy, Rajeev Rastogi, and Kyuseok Shim, “Ef- ficient algorithms for mining outliers from large data sets,” in Proc. ACM SIGMOD, 2000

2000

[26] [26]

On the generalised distance in statistics,

Prasanta Chandra Mahalanobis, “On the generalised distance in statistics,”Proc. National Institute of Sciences of India, vol. 2, no. 1, pp. 49–55, 1936

1936

[27] [27]

LOF: Identifying density-based local outliers,

Markus M. Breunig, Hans-Peter Kriegel, Raymond T. Ng, and J¨org Sander, “LOF: Identifying density-based local outliers,” inProc. ACM SIGMOD, 2000

2000

[28] [28]

Kernel PCA for novelty detection,

Heiko Hoffmann, “Kernel PCA for novelty detection,”Pattern Recognition, vol. 40, no. 3, pp. 863–874, 2007

2007

[29] [29]

Fine- tuning CNN image retrieval with no human annotation,

Filip Radenovi ´c, Giorgos Tolias, and Ond ˇrej Chum, “Fine- tuning CNN image retrieval with no human annotation,”IEEE Trans. Pattern Anal. Mach. Intell., vol. 41, no. 7, pp. 1655– 1668, 2019

2019

[30] [30]

A well-conditioned esti- mator for large-dimensional covariance matrices,

Olivier Ledoit and Michael Wolf, “A well-conditioned esti- mator for large-dimensional covariance matrices,”Journal of Multivariate Analysis, vol. 88, no. 2, pp. 365–411, 2004

2004

[31] [31]

Local density-based anomaly score normalization for domain gener- alization,

Kevin Wilkinghoff, Haici Yang, Janek Ebbers, Franc ¸ois G. Germain, Gordon Wichern, and Jonathan Le Roux, “Local density-based anomaly score normalization for domain gener- alization,”IEEE Trans. Audio, Speech, Lang. Process., vol. 33, pp. 4642–4652, 2025

2025

[32] [32]

Score nor- malization in multimodal biometric systems,

Anil Jain, Karthik Nandakumar, and Arun Ross, “Score nor- malization in multimodal biometric systems,”Pattern Recog- nition, vol. 38, no. 12, pp. 2270–2285, 2005

2005

[33] [33]

On combining classifiers,

Josef Kittler, Mohamad Hatef, Robert P. W. Duin, and Jiri Matas, “On combining classifiers,”IEEE Trans. Pattern Anal. Mach. Intell., vol. 20, no. 3, pp. 226–239, 1998

1998

[34] [34]

Theoretical foundations and algorithms for outlier ensembles,

Charu C. Aggarwal and Saket Sathe, “Theoretical foundations and algorithms for outlier ensembles,”ACM SIGKDD Explo- rations, vol. 17, no. 1, pp. 24–47, 2015

2015

[35] [35]

Toy- ADMOS2: Another dataset of miniature-machine operating sounds for anomalous sound detection under domain shift con- ditions,

Noboru Harada, Daisuke Niizumi, Daiki Takeuchi, Yasunori Ohishi, Masahiro Yasuda, and Shoichiro Saito, “Toy- ADMOS2: Another dataset of miniature-machine operating sounds for anomalous sound detection under domain shift con- ditions,” inProc. DCASE Workshop, 2021

2021

[36] [36]

MIMII DG: Sound dataset for malfunctioning in- dustrial machine investigation and inspection for domain gen- eralization task,

Kota Dohi, Tomoya Nishida, Harsh Purohit, Ryo Tanabe, Takashi Endo, Masaaki Yamamoto, Yuki Nikaido, and Yohei Kawaguchi, “MIMII DG: Sound dataset for malfunctioning in- dustrial machine investigation and inspection for domain gen- eralization task,” inProc. DCASE Workshop, 2022

2022

[37] [37]

First-shot anomaly sound detection for machine condition monitoring: A domain gener- alization baseline,

Noboru Harada, Daisuke Niizumi, Yasunori Ohishi, Daiki Takeuchi, and Masahiro Yasuda, “First-shot anomaly sound detection for machine condition monitoring: A domain gener- alization baseline,” inProc. EUSIPCO, 2023, pp. 191–195

2023