pith. sign in

arxiv: 2605.26068 · v3 · pith:RWLFDV6Tnew · submitted 2026-05-25 · 💻 cs.LG · cs.AI

Rethinking Weak Supervision in Anomaly Detection: A Comprehensive Benchmark

Pith reviewed 2026-06-29 22:24 UTC · model grok-4.3

classification 💻 cs.LG cs.AI
keywords weak supervisionanomaly detectionbenchmarklabel noisetabular foundation modelsincomplete supervisioninexact supervisioninaccurate supervision
0
0 comments X

The pith

A single benchmark across weak supervision scenarios in anomaly detection finds strong intrinsic correlations between them and shows specialized methods lose to general models once labels exceed extreme scarcity.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper builds WSADBench to test anomaly detection under incomplete, inexact, and inaccurate supervision within one evaluation framework instead of treating them as separate tracks. It runs more than 700,000 experiments that systematically change how many labels are available, how coarse they are, and how noisy they are, across dozens of algorithms and four data types. Results indicate the three supervision types share fundamental mechanisms rather than presenting distinct challenges. A sympathetic reader would care because this questions whether current research directions are pursuing truly independent problems and shows when investing in specialized anomaly detectors stops being worthwhile.

Core claim

WSADBench shows that the three primary weak supervision scenarios exhibit strong intrinsic correlations, that specialized WSAD algorithms only outperform others in the most extreme label-scarcity regimes and are quickly surpassed by tabular foundation models and general classification methods as supervision increases or in out-of-distribution cases, that unlabeled data yields inconsistent and marginal gains compared with label refinement, and that models display asymmetric sensitivity to different forms of label noise.

What carries the argument

WSADBench, the benchmark that applies standardized protocols for varying label quantity, granularity, and quality to compare 36 algorithms across four modalities in a unified way.

If this is right

  • Strong correlations between incomplete, inexact, and inaccurate supervision challenge the practice of isolating research on each direction.
  • Specialized WSAD algorithms are competitive only under extreme label scarcity and lose to foundation models and general classifiers otherwise or in OOD settings.
  • Unlabeled data provides inconsistent and smaller benefits than improving the quality or granularity of existing labels.
  • Model performance reacts differently to different kinds of label noise, with some noise types hurting more than others.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Research effort might shift toward methods designed to exploit the shared structure across supervision types rather than building separate tools for each.
  • Similar unified benchmarks could be useful in other weakly supervised domains to check whether apparent distinctions are real or artifacts of isolated evaluation.
  • In practice, teams may gain more by investing in label cleaning than by collecting additional unlabeled examples.
  • The observed asymmetry in noise sensitivity suggests targeted noise-robust training techniques could be developed for the most damaging noise types.

Load-bearing premise

The chosen collection of 36 algorithms, four modalities, and the particular ways of changing label quantity, granularity, and quality are representative enough to draw general conclusions about performance boundaries and correlations in real-world settings.

What would settle it

An independent replication that applies the same variation protocols to a substantially different set of algorithms or modalities and finds either no correlations between the supervision scenarios or continued dominance of specialized methods outside extreme scarcity would falsify the central claims.

Figures

Figures reproduced from arXiv: 2605.26068 by Chaochuan Hou, Hailiang Huang, Minqi Jiang, Shiping Wang, Shuang Liang, Siyuan Zhou, Songqiao Han, Xu Yao, Zhenbo Wu.

Figure 1
Figure 1. Figure 1: Overview of the WSADBench. It integrates datasets spanning diverse modalities and varied supervision scenarios into [PITH_FULL_IMAGE:figures/full_fig_p004_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Tabular Foundation model performance (AUCPR) [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: AUCPR results on tabular datasets under varying [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Model sensitivity to label noise: AUCPR degradation [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Analysis of incomplete OOD across 3 Settings. [PITH_FULL_IMAGE:figures/full_fig_p007_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Anomaly score decision boundaries on Metal_nut [PITH_FULL_IMAGE:figures/full_fig_p008_6.png] view at source ↗
Figure 8
Figure 8. Figure 8: Model ranking radar chart. outperform specialized inexact approaches on video MIL tasks: DeepSAD achieves mean AUCPR 0.453, surpassing the specialized MIL model AR-Net’s AUCPR 0.441. Conversely, specialized inexact methods show mixed transferability to tabular MIL. Sultani trans￾fers relatively well and ranks second, but it is still outperformed by TabPFN, while GCN-Anomaly remains less effective in both s… view at source ↗
Figure 9
Figure 9. Figure 9: Critical Difference (CD) diagrams under extremely limited supervision for each modality. The top row (a-c) compares [PITH_FULL_IMAGE:figures/full_fig_p014_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Visualization of incomplete OOD results. Row [PITH_FULL_IMAGE:figures/full_fig_p015_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: Visualization of AUCROC and AUCPR distributions under three different incomplete OOD settings. [PITH_FULL_IMAGE:figures/full_fig_p016_11.png] view at source ↗
Figure 12
Figure 12. Figure 12: Heatmap of spearman correlations between dataset meta-features (denoted by symbols) and model performance [PITH_FULL_IMAGE:figures/full_fig_p019_12.png] view at source ↗
Figure 13
Figure 13. Figure 13: AUCPR (a) and AUCROC (b) results on Tabular datasets under varying labeled ( [PITH_FULL_IMAGE:figures/full_fig_p020_13.png] view at source ↗
Figure 14
Figure 14. Figure 14: Performance comparison under inaccurate conditions (label noise) on tabular datasets. [PITH_FULL_IMAGE:figures/full_fig_p020_14.png] view at source ↗
Figure 16
Figure 16. Figure 16: 3D surfaces visualizing AUCPR degradation of different models under varying flip normal ratios (FNR) and flip [PITH_FULL_IMAGE:figures/full_fig_p022_16.png] view at source ↗
read the original abstract

Weakly supervised anomaly detection (WSAD) has developed in three primary directions: incomplete, inexact, and inaccurate supervision. However, these directions remain isolated, lacking a unified framework to assess whether they address unique challenges or share fundamental mechanisms. This paper introduces WSADBench, the first benchmark that unifies evaluation across distinct weakly supervised scenarios, benchmarking diverse approaches from specialized WSAD methods to advanced tabular foundation models. WSADBench establishes standardized protocols to evaluate 36 algorithms across 4 modalities by systematically varying label quantity, granularity, and quality, revealing the performance boundaries of various methods. Based on over 700K experiments, WSADBench reveals four critical insights: (i) Strong intrinsic correlations exist between these weak supervision scenarios, challenging the isolation of current research directions. (ii) Specialized WSAD algorithms excel only in extreme label-scarcity regimes but are quickly dominated by tabular foundation models and general classification methods as supervision increases or in OOD scenarios. (iii) Unlabeled data shows inconsistent utility across settings, with marginal gains compared to label refinement. (iv) Models exhibit asymmetric sensitivity to different types of label noise. We release WSADBench as an open-source benchmark with code and datasets to facilitate future WSAD research: https://github.com/SUFE-AILAB/WSADBench.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper introduces WSADBench, the first unified benchmark for weakly supervised anomaly detection across incomplete, inexact, and inaccurate supervision. It evaluates 36 algorithms across 4 modalities using standardized protocols that systematically vary label quantity, granularity, and quality, conducting over 700K experiments to derive four insights: strong intrinsic correlations between the supervision scenarios, specialized WSAD methods being dominated by tabular foundation models and general classifiers outside extreme scarcity or in OOD settings, inconsistent utility of unlabeled data relative to label refinement, and asymmetric model sensitivity to label noise types. Code and datasets are released.

Significance. If the experimental choices prove representative, the work supplies large-scale empirical evidence that isolated WSAD research directions may share mechanisms and that method superiority is regime-specific, which could guide future algorithm development and evaluation standards. The scale of the benchmark and open release are clear strengths for reproducibility.

major comments (2)
  1. [Experimental setup] Experimental setup (protocols for label quantity/granularity/quality and algorithm selection): the four insights, particularly (i) on intrinsic correlations and (ii) on method dominance, rest on the assumption that the fixed set of 36 algorithms and simulation protocols are sufficiently representative. Reuse of the same base datasets and noise models across scenarios risks inducing the observed correlations as artifacts rather than intrinsic properties; a sensitivity analysis to alternative dataset families or method classes (e.g., recent graph-based WSAD) is needed to support generalizability.
  2. [Results and analysis] Results and analysis sections: insight (iii) states unlabeled data shows 'marginal gains' and 'inconsistent utility,' yet no quantitative definition of marginal gain, statistical significance tests, or comparison baselines against label refinement are provided, weakening the claim that refinement is preferable.
minor comments (1)
  1. [Abstract] Abstract and introduction: the claim of 'strong intrinsic correlations' would benefit from a brief parenthetical on the correlation metric (e.g., Spearman rank or Pearson on performance surfaces) used to establish them.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback and recommendation for minor revision. We address the major comments point by point below.

read point-by-point responses
  1. Referee: [Experimental setup] Experimental setup (protocols for label quantity/granularity/quality and algorithm selection): the four insights, particularly (i) on intrinsic correlations and (ii) on method dominance, rest on the assumption that the fixed set of 36 algorithms and simulation protocols are sufficiently representative. Reuse of the same base datasets and noise models across scenarios risks inducing the observed correlations as artifacts rather than intrinsic properties; a sensitivity analysis to alternative dataset families or method classes (e.g., recent graph-based WSAD) is needed to support generalizability.

    Authors: Our selection of 36 algorithms spans specialized WSAD methods, general classifiers, and tabular foundation models across four modalities to represent core paradigms. Consistent base datasets and noise models are required to isolate the effects of supervision type and enable the benchmark's unification goal. The correlations and dominance patterns hold consistently across 700K experiments and multiple modalities, indicating intrinsic properties. We will add a dedicated paragraph in the revised manuscript discussing the scope of our algorithm and dataset choices and outlining directions for future sensitivity analyses. revision: partial

  2. Referee: [Results and analysis] Results and analysis sections: insight (iii) states unlabeled data shows 'marginal gains' and 'inconsistent utility,' yet no quantitative definition of marginal gain, statistical significance tests, or comparison baselines against label refinement are provided, weakening the claim that refinement is preferable.

    Authors: We agree that insight (iii) requires more rigorous quantification to be fully convincing. In the revision we will introduce an explicit definition of marginal gains (relative improvement below 5%), report statistical significance via paired tests across runs, and add direct side-by-side comparisons of unlabeled-data utility versus label-refinement baselines. revision: yes

Circularity Check

0 steps flagged

Empirical benchmark with no derivation chain or self-referential reductions

full rationale

The paper conducts a large-scale empirical evaluation of 36 algorithms on 4 modalities using standardized protocols for label quantity, granularity, and quality, generating results from external algorithm implementations and public datasets. No equations, fitted parameters, or predictions are defined in terms of the paper's own outputs. The four insights are direct summaries of experimental performance surfaces rather than reductions to self-definitions, self-citations, or ansatzes. Self-citations, if present, are not load-bearing for any claimed derivation. This is a standard non-circular benchmark paper.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

This is an empirical benchmarking study with no mathematical derivations, fitted constants, or new postulated entities. It relies on the domain assumption that the selected algorithms and modalities adequately sample the space of WSAD methods.

axioms (1)
  • domain assumption The 36 algorithms and 4 modalities are representative of current WSAD practice.
    Invoked when generalizing the four insights beyond the specific experimental runs.

pith-pipeline@v0.9.1-grok · 5787 in / 1218 out tokens · 38588 ms · 2026-06-29T22:24:11.101441+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

79 extracted references · 7 canonical work pages · 1 internal anchor

  1. [1]

    Moshira Abdalla, Sajid Javed, Muaz Al Radi, Anwaar Ulhaq, and Naoufel Werghi

  2. [2]

    Video anomaly detection in 10 years: A survey and outlook.Neural Computing and Applications37, 32 (2025), 26321–26364

  3. [3]

    Andra Acsintoae, Andrei Florescu, Mariana-Iuliana Georgescu, Tudor Mare, Paul Sumedrea, Radu Tudor Ionescu, Fahad Shahbaz Khan, and Mubarak Shah

  4. [4]

    InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

    UBnormal: New Benchmark for Supervised Open-Set Video Anomaly Detection. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 20125–20135

  5. [5]

    Aggarwal

    Charu C. Aggarwal. 2013.Outlier Analysis. Springer. doi:10.1007/978-1-4614- 6396-2

  6. [6]

    Samet Akcay, Amir Atapour-Abarghouei, and Toby P Breckon. 2018. Ganomaly: Semi-supervised anomaly detection via adversarial training. InAsian conference on computer vision. Springer, 622–637

  7. [7]

    Fabrizio Angiulli and Clara Pizzuti. 2002. Fast outlier detection in high dimen- sional spaces. InEuropean conference on principles of data mining and knowledge discovery. Springer, 15–27

  8. [8]

    Jinan Bao, Hanshi Sun, Hanqiu Deng, Yinsheng Brennan He, Zhaoxiang Zhang, and Xingyu Li. 2024. BMAD: Benchmarks for Medical Anomaly Detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 4042–4053

  9. [9]

    Mohamed Bouadi, Pratinav Seth, Aditya Tanna, and Vinay Kumar Sankarapu

  10. [10]

    arXiv preprint arXiv:2511.02818(2025)

    Orion-MSP: Multi-Scale Sparse Attention for Tabular In-Context Learning. arXiv preprint arXiv:2511.02818(2025)

  11. [11]

    Markus M Breunig, Hans-Peter Kriegel, Raymond T Ng, and Jörg Sander. 2000. LOF: identifying density-based local outliers. InProceedings of the 2000 ACM SIGMOD international conference on Management of data. 93–104

  12. [12]

    Joao Carreira and Andrew Zisserman. 2017. Quo vadis, action recognition? a new model and the kinetics dataset. Inproceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 6299–6308

  13. [13]

    Tianqi Chen and Carlos Guestrin. 2016. Xgboost: A scalable tree boosting system. InProceedings of the 22nd acm sigkdd international conference on knowledge discovery and data mining. 785–794

  14. [14]

    Yingxian Chen, Zhengzhe Liu, Baoheng Zhang, Wilton Fok, Xiaojuan Qi, and Yik-Chung Wu. 2023. Mgfn: Magnitude-contrastive glance-and-focus network for weakly-supervised video anomaly detection. InProceedings of the AAAI conference on artificial intelligence, Vol. 37. 387–395

  15. [15]

    Choubo Ding, Guansong Pang, and Chunhua Shen. 2022. Catching both gray and black swans: Open-set supervised anomaly detection. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition. 7388–7398

  16. [16]

    Yutao Dong, Qing Li, Richard O Sinnott, Yong Jiang, and Shutao Xia. 2021. ISP self-operated BGP anomaly detection based on weakly supervised learning. In 2021 IEEE 29th International Conference on Network Protocols (ICNP). IEEE, 1–11

  17. [17]

    Marius Dragoi, Elena Burceanu, Emanuela Haller, Andrei Manolache, and Florin Brad. 2022. Anoshift: A distribution shift benchmark for unsupervised anomaly detection.Advances in Neural Information Processing Systems35 (2022), 32854– 32867

  18. [18]

    Christoph Feichtenhofer, Haoqi Fan, Jitendra Malik, and Kaiming He. 2019. Slow- fast networks for video recognition. InProceedings of the IEEE/CVF international conference on computer vision. 6202–6211

  19. [19]

    Adam Goodge, Bryan Hooi, See-Kiong Ng, and Wee Siong Ng. 2022. Lunar: Uni- fying local outlier detection methods via graph neural networks. InProceedings of the AAAI Conference on Artificial Intelligence, Vol. 36. 6737–6745

  20. [20]

    Yury Gorishniy, Akim Kotelnikov, and Artem Babenko. 2025. Tabm: Advanc- ing tabular deep learning with parameter-efficient ensembling. InInternational Conference on Learning Representations, Vol. 2025. 77899–77935

  21. [21]

    Yury Gorishniy, Ivan Rubachev, Nikolay Kartashev, Daniil Shlenskii, Akim Kotel- nikov, and Artem Babenko. 2024. Tabr: Tabular deep learning meets nearest neighbors. InInternational Conference on Learning Representations, Vol. 2024. 18209–18249

  22. [22]

    Yury Gorishniy, Ivan Rubachev, Valentin Khrulkov, and Artem Babenko. 2021. Revisiting deep learning models for tabular data.Advances in neural information processing systems34 (2021), 18932–18943

  23. [23]

    Sachin Goyal, Aditi Raghunathan, Moksh Jain, Harsha Vardhan Simhadri, and Prateek Jain. 2020. DROCC: Deep robust one-class classification. InInternational conference on machine learning. PMLR, 3711–3721

  24. [24]

    Léo Grinsztajn, Klemens Flöge, Oscar Key, Felix Birkel, Philipp Jund, Brendan Roof, Benjamin Jäger, Dominik Safaric, Simone Alessi, Adrian Hayler, et al. 2025. Tabpfn-2.5: Advancing the state of the art in tabular foundation models.arXiv preprint arXiv:2511.08667(2025)

  25. [25]

    Weinberger

    Chuan Guo, Geoff Pleiss, Yu Sun, and Kilian Q. Weinberger. 2017. On Calibration of Modern Neural Networks. InProceedings of the 34th International Conference on Machine Learning. PMLR, 1321–1330

  26. [26]

    Songqiao Han, Xiyang Hu, Hailiang Huang, Minqi Jiang, and Yue Zhao. 2022. ADBench: Anomaly Detection Benchmark. InNeurIPS

  27. [27]

    Zengyou He, Xiaofei Xu, and Shengchun Deng. 2003. Discovering cluster-based local outliers.Pattern recognition letters24, 9-10 (2003), 1641–1650

  28. [28]

    Tin Kam Ho and Mitra Basu. 2002. Complexity measures of supervised classifica- tion problems.IEEE transactions on pattern analysis and machine intelligence24, 3 (2002), 289–300

  29. [29]

    Xi Jiang, Jian Li, Hanqiu Deng, Yong Liu, Bin-Bin Gao, Yifeng Zhou, Jialin Li, Chengjie Wang, and Feng Zheng. 2025. Mmad: A comprehensive benchmark for multimodal large language models in industrial anomaly detection. InInterna- tional conference on learning representations. 87273–87295

  30. [30]

    Kingma and Max Welling

    Diederik P. Kingma and Max Welling. 2014. Auto-Encoding Variational Bayes. In 2nd International Conference on Learning Representations

  31. [31]

    Elizaveta Levina and Peter Bickel. 2004. Maximum likelihood estimation of intrinsic dimension.Advances in neural information processing systems17 (2004)

  32. [32]

    Yuangang Li, Jiaqi Li, Zhuo Xiao, Tiankai Yang, Yi Nian, Xiyang Hu, and Yue Zhao. 2025. NLP-ADBench: NLP Anomaly Detection Benchmark. InFindings of the Association for Computational Linguistics: EMNLP 2025. 2464–2474. doi:10. 18653/v1/2025.findings-emnlp.133

  33. [33]

    Zhe Li, Chunhua Sun, et al. 2022. Dual-MGAN: An Efficient Approach for Semi- supervised Outlier Detection with Few Identified Anomalies.TKDD(2022)

  34. [34]

    Zheng Li, Yue Zhao, Xiyang Hu, Nicola Botta, Cezar Ionescu, and George H Chen. 2022. Ecod: Unsupervised outlier detection using empirical cumulative distribution functions.IEEE Transactions on Knowledge and Data Engineering35, 12 (2022), 12181–12193

  35. [35]

    Fei Tony Liu, Kai Ming Ting, and Zhi-Hua Zhou. 2008. Isolation forest. In2008 eighth ieee international conference on data mining. IEEE, 413–422

  36. [36]

    Wen Liu, Weixin Luo, Dongze Lian, and Shenghua Gao. 2018. Future frame prediction for anomaly detection–a new baseline. InProceedings of the IEEE conference on computer vision and pattern recognition. 6536–6545

  37. [37]

    Cresswell, Keyvan Golestan, Guangwei Yu, Anthony L

    Junwei Ma, Valentin Thomas, Rasa Hosseinzadeh, Alex Labach, Jesse C. Cresswell, Keyvan Golestan, Guangwei Yu, Anthony L. Caterini, and Maksims Volkovs. 2025. TabDPT: Scaling Tabular Foundation Models on Real Data. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems

  38. [38]

    Mahdi Pakdaman Naeini, Gregory Cooper, and Milos Hauskrecht. 2015. Obtaining Well Calibrated Probabilities Using Bayesian Binning. InProceedings of the AAAI Conference on Artificial Intelligence, Vol. 29. KDD ’26, August 09–13, 2026, Jeju Island, Republic of Korea Xu Yao et al

  39. [39]

    Curtis Northcutt, Lu Jiang, and Isaac Chuang. 2021. Confident learning: Esti- mating uncertainty in dataset labels.Journal of Artificial Intelligence Research70 (2021), 1373–1411

  40. [40]

    Guansong Pang, Longbing Cao, Ling Chen, and Huan Liu. 2018. Learning rep- resentations of ultrahigh-dimensional data for random distance-based outlier detection. InProceedings of the 24th ACM SIGKDD international conference on knowledge discovery & data mining. 2041–2050

  41. [41]

    Guansong Pang, Chunhua Shen, Longbing Cao, and Anton Van Den Hengel. 2021. Deep learning for anomaly detection: A review.ACM computing surveys (CSUR) 54, 2 (2021), 1–38

  42. [42]

    Guansong Pang, Chunhua Shen, Huidong Jin, and Anton Van Den Hengel. 2023. Deep weakly-supervised anomaly detection. InProceedings of the 29th ACM SIGKDD conference on knowledge discovery and data mining. 1795–1807

  43. [43]

    Guansong Pang, Chunhua Shen, and Anton Van Den Hengel. 2019. Deep anomaly detection with deviation networks. InProceedings of the 25th ACM SIGKDD international conference on knowledge discovery & data mining. 353–362

  44. [44]

    Lorenzo Perini, Vincent Vercruyssen, and Jesse Davis. 2023. Learning from positive and unlabeled multi-instance bags in anomaly detection. InProceedings of the 29th ACM SIGKDD conference on knowledge discovery and data mining. 1897–1906

  45. [45]

    Liudmila Prokhorenkova, Gleb Gusev, Aleksandr Vorobev, Anna Veronika Doro- gush, and Andrey Gulin. 2018. CatBoost: unbiased boosting with categorical features.Advances in neural information processing systems31 (2018)

  46. [46]

    Jingang Qu, David Holzmüller, Gaël Varoquaux, and Marine Le Morvan. 2026. TabICLv2: A better, faster, scalable, and open tabular foundation model. (2026)

  47. [47]

    Olivier Roy and Martin Vetterli. 2007. The effective rank: A measure of effective dimensionality. In2007 15th European signal processing conference. IEEE, 606–610

  48. [48]

    Lukas Ruff, Robert Vandermeulen, Nico Goernitz, Lucas Deecke, Shoaib Ahmed Siddiqui, Alexander Binder, Emmanuel Müller, and Marius Kloft. 2018. Deep one-class classification. InInternational conference on machine learning. PMLR, 4393–4402

  49. [49]

    Vandermeulen, Nico Görnitz, Alexander Binder, Emmanuel Müller, Klaus-Robert Müller, and Marius Kloft

    Lukas Ruff, Robert A. Vandermeulen, Nico Görnitz, Alexander Binder, Emmanuel Müller, Klaus-Robert Müller, and Marius Kloft. 2020. Deep Semi-Supervised Anomaly Detection. InInternational Conference on Learning Representations

  50. [50]

    Timur Sattarov, Marco Schreyer, and Damian Borth. 2025. Diffusion-Scheduled Denoising Autoencoders for Anomaly Detection in Tabular Data. InProceedings of the 31st ACM SIGKDD Conference on Knowledge Discovery and Data Mining V

  51. [51]

    Mei-Ling Shyu, Shu-Ching Chen, Kanoksri Sarinnapakorn, and LiWu Chang

  52. [52]

    Technical report, Miami Univ Coral Gables Fl Dept of Electrical and Computer Engineering

    A novel anomaly detection scheme based on principal component classifier. Technical report, Miami Univ Coral Gables Fl Dept of Electrical and Computer Engineering

  53. [53]

    Hwanjun Song, Minseok Kim, Dongmin Park, Yooju Shin, and Jae-Gil Lee. 2022. Learning from noisy labels with deep neural networks: A survey.IEEE transactions on neural networks and learning systems34, 11 (2022), 8135–8153

  54. [54]

    Waqas Sultani, Chen Chen, and Mubarak Shah. 2018. Real-world anomaly de- tection in surveillance videos. InProceedings of the IEEE conference on computer vision and pattern recognition. 6479–6488

  55. [55]

    Bowen Tian, Qinliang Su, and Jian Yin. 2022. Anomaly Detection by Lever- aging Incomplete Anomalous Knowledge with Anomaly-Aware Bidirectional GANs. InProceedings of the Thirty-First International Joint Conference on Artificial Intelligence, IJCAI-22. 2255–2261

  56. [56]

    Yu Tian, Guansong Pang, Yuanhong Chen, Rajvinder Singh, Johan W Verjans, and Gustavo Carneiro. 2021. Weakly-supervised video anomaly detection with robust temporal feature magnitude learning. InProceedings of the IEEE/CVF international conference on computer vision. 4975–4986

  57. [57]

    Du Tran, Lubomir Bourdev, Rob Fergus, Lorenzo Torresani, and Manohar Paluri

  58. [58]

    In Proceedings of the IEEE international conference on computer vision

    Learning spatiotemporal features with 3D convolutional networks. In Proceedings of the IEEE international conference on computer vision. 4489–4497

  59. [59]

    Boyang Wan, Yuming Fang, Xue Xia, and Jiajie Mei. 2020. Weakly Supervised Video Anomaly Detection via Center-Guided Discriminative Learning. In2020 IEEE International Conference on Multimedia and Expo (ICME). 1–6. doi:10.1109/ ICME46284.2020.9102722

  60. [60]

    Peng Wu, Jing Liu, Yujia Shi, Yujia Sun, Fangtao Shao, Zhaoyang Wu, and Zhiwei Yang. 2020. Not only look, but also listen: Learning multimodal violence detection under weak supervision. InEuropean conference on computer vision. Springer, 322–339

  61. [61]

    Peng Wu, Xuerong Zhou, Guansong Pang, Lingru Zhou, Qingsen Yan, Peng Wang, and Yanning Zhang. 2024. Vadclip: Adapting vision-language models for weakly supervised video anomaly detection. InProceedings of the AAAI conference on artificial intelligence, Vol. 38. 6074–6082

  62. [62]

    Feng Xiao and Jicong Fan. 2025. Text-ADBench: Text Anomaly Detection Bench- mark based on LLMs Embedding.arXiv preprint arXiv:2507.12295(2025)

  63. [63]

    Hongzuo Xu, Yijie Wang, Guansong Pang, Songlei Jian, Ning Liu, and Yongjun Wang. 2023. RoSAS: Deep semi-supervised anomaly detection with contamination-resilient continuous supervision.Information Processing & Man- agement60, 5 (2023), 103459

  64. [64]

    Yajun Xu, Huan Hu, Chuwen Huang, Yibing Nan, Yuyao Liu, Kai Wang, Zhaox- iang Liu, and Shiguo Lian. 2025. TAD: A Large-Scale Benchmark for Traffic Accidents Detection From Video Surveillance.IEEE Access13 (2025), 2018–2033

  65. [65]

    Jingkang Yang, Pengyun Wang, Dejian Zou, Zitang Zhou, Kunyuan Ding, Wenx- uan Peng, Haoqi Wang, Dilip Chen, Bo Li, Yiyou Sun, et al . 2022. OpenOOD: Benchmarking generalized out-of-distribution detection. InAdvances in Neural Information Processing Systems, Vol. 35. 32598–32611

  66. [66]

    Chiyuan Zhang, Samy Bengio, Moritz Hardt, Benjamin Recht, and Oriol Vinyals

  67. [67]

    Understanding deep learning (still) requires rethinking generalization. Commun. ACM64, 3 (2021), 107–115

  68. [68]

    Jieyu Zhang, Yue Yu, Yinghao Li, Yujing Wang, Yaming Yang, Mao Yang, and Alexander Ratner. 2021. WRENCH: A Comprehensive Benchmark for Weak Supervision. InProceedings of the Neural Information Processing Systems Track on Datasets and Benchmarks

  69. [69]

    Xiyuan Zhang et al. 2025. Mitra: Mixed Synthetic Priors for Enhancing Tabular Foundation Models. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems

  70. [70]

    Xingxuan Zhang, Gang Ren, Han Yu, Hao Yuan, Hui Wang, Jiansheng Li, Ji- ayun Wu, Lang Mo, Li Mao, Mingchao Hao, et al . 2025. Limix: Unleashing structured-data modeling capability for generalist intelligence.arXiv preprint arXiv:2509.03505(2025)

  71. [71]

    Yue Zhao and Maciej K Hryniewicki. 2018. Xgbod: improving supervised outlier detection with unsupervised representation learning. In2018 International Joint Conference on Neural Networks (IJCNN). IEEE, 1–8

  72. [72]

    Yue Zhao, Zain Nasrullah, and Zheng Li. 2019. Pyod: A python toolbox for scalable outlier detection.Journal of machine learning research20, 96 (2019), 1–7

  73. [73]

    Yue Zhao, Guoqing Zheng, Subhabrata Mukherjee, Robert McCann, and Ahmed Awadallah. 2023. Admoe: Anomaly detection with mixture-of-experts from noisy labels. InProceedings of the AAAI Conference on Artificial Intelligence, Vol. 37. 4937–4945

  74. [74]

    Jia-Xing Zhong, Nannan Li, Weijie Kong, Shan Liu, Thomas H Li, and Ge Li. 2019. Graph convolutional label noise cleaner: Train a plug-and-play action classifier for anomaly detection. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition. 1237–1246

  75. [75]

    Hang Zhou, Junqing Yu, and Wei Yang. 2023. Dual memory units with uncertainty regulation for weakly supervised video anomaly detection. InProceedings of the AAAI Conference on Artificial Intelligence, Vol. 37. 3769–3777

  76. [76]

    Yingjie Zhou, Xucheng Song, Yanru Zhang, Fanxing Liu, Ce Zhu, and Lingqiao Liu. 2021. Feature encoding with autoencoders for weakly supervised anomaly detection.IEEE Transactions on Neural Networks and Learning Systems33, 6 (2021), 2454–2465

  77. [77]

    Zhi-Hua Zhou. 2018. A brief introduction to weakly supervised learning.National science review5, 1 (2018), 44–53

  78. [78]

    Spatial Spar- sity Penalty

    Bo Zong, Qi Song, Martin Renqiang Min, Wei Cheng, Cristian Lumezanu, Daeki Cho, and Haifeng Chen. 2018. Deep autoencoding gaussian mixture model for unsupervised anomaly detection. InInternational conference on learning represen- tations. A Benchmark Details A.1 Dataset Summaries and Processing Specs We detail the diverse collection of datasets evaluated ...

  79. [79]

    Table 24: The average ± standard deviation and ranking of AUCPR under different 𝑁𝑙𝑎 (=1, 3, 5, 10, 15, 20, 50) settings on tabular datasets

    (p) TabR-S 0.5 0.25 0.1 0.05 0.01 FNR 0.5 0.25 0.1 0.05 0.01 FAR 0.2 0.4 0.6 0.8 1.0AUCPR 0.177 0.5 0.25 0.1 0.05 0.01 FNR 0.5 0.25 0.1 0.05 0.01 FAR 0.2 0.4 0.6 0.8 1.0AUCPR 0.277 0.5 0.25 0.1 0.05 0.01 FNR 0.5 0.25 0.1 0.05 0.01 FAR 0.2 0.4 0.6 0.8 1.0AUCPR 0.144 0.5 0.25 0.1 0.05 0.01 FNR 0.5 0.25 0.1 0.05 0.01 FAR 0.2 0.4 0.6 0.8 1.0AUCPR 0.141 (q) Ta...