arxiv: 2605.13642 · v1 · submitted 2026-05-13 · 📊 stat.ML · cs.LG· stat.CO

Recognition: no theorem link

Conformal Anomaly Detection in Python: Moving Beyond Heuristic Thresholds with 'nonconform'

Oliver Hennh\"ofer , Maximilian Kirsch , Christine Preisach

Authors on Pith no claims yet

Pith reviewed 2026-05-14 17:43 UTC · model grok-4.3

classification 📊 stat.ML cs.LGstat.CO

keywords conformal anomaly detectionPython packagep-valuesanomaly detectionscikit-learn integrationfalse discovery rateexchangeability

0 comments

The pith

The nonconform package converts anomaly scores into calibrated p-values valid under data exchangeability.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Most anomaly detection systems produce raw scores that force users to pick thresholds without statistical grounding. Conformal anomaly detection solves this by turning scores into p-values whose validity is guaranteed when data points can be treated as exchangeable. The paper presents the nonconform Python package, which wraps detectors from scikit-learn and pyod and supplies a single interface for calibration, p-value computation, and false-discovery-rate control. It walks through basic split-conformal methods up to more efficient and shift-aware variants, backed by code examples and empirical checks. The result is a practical route to anomaly decisions that carry explicit error-rate guarantees inside standard machine-learning pipelines.

Core claim

The paper claims that the nonconform package implements conformal anomaly detection methods that convert existing anomaly scores into p-values with valid coverage under the exchangeability assumption, supports multiple calibration strategies, integrates directly with scikit-learn and pyod, and enables false-discovery-rate control, thereby allowing statistically principled anomaly detection without heuristic thresholds.

What carries the argument

The nonconform package, which supplies a unified interface that takes any anomaly detector, applies conformal calibration to produce p-values, and supports false-discovery-rate control.

If this is right

Users can replace arbitrary score thresholds with explicit p-value cutoffs that control error rates.
The same detector code works for both point-anomaly and group-anomaly tasks once wrapped by nonconform.
Shift-aware conformal strategies inside the package reduce the data needed for reliable calibration.
False-discovery-rate procedures become directly applicable to anomaly lists produced by the package.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same wrapper pattern could be reused for other conformal guarantees such as prediction intervals.
Empirical checks on mildly non-exchangeable data would quantify how quickly coverage degrades.
Production pipelines could log the p-value distribution as a continuous monitor of exchangeability.

Load-bearing premise

The p-values remain valid only under the assumption of data exchangeability.

What would settle it

Generate p-values on a large set of exchangeable inliers with no anomalies present; the resulting p-values must be distributed uniformly on [0,1] or the calibration guarantee is violated.

Figures

Figures reproduced from arXiv: 2605.13642 by Christine Preisach, Maximilian Kirsch, Oliver Hennh\"ofer.

**Figure 1.** Figure 1: Recall and FDR depend on the calibration-set size. Top: distribution of recall at nominal FDR level α = 0.1 for training-set sizes |Dtrain| ∈ {250, 500, 1000}. Bottom: empirical FDR across corresponding nominal FDR levels. Depicted are the standard conformal and the probabilistic (P.) approach. The Split variant is calibrated on Dtrain/2. Results are averaged over 50 randomized trials. JaB+ uses n bootst… view at source ↗

**Figure 2.** Figure 2: Calibration-conditional error guarantees are stricter than marginal error guarantees. Left: 90th-percentile empirical FDR across 20 randomized trials as a function of the nominal FDR target, illustrating tail behavior under conditional calibration with delta=0.1, with marginal control as reference. Right: distribution of recall across trials at nominal level α = 0.1. Methods correspond to JaB+ with marg… view at source ↗

**Figure 3.** Figure 3: Importance weighting restores valid FDR control under covariate shift3 . Left: density of the standardized first principal-component shift score demonstrating covariate shift between the calibration and target distribution. Right: distributions of empirical FDR and recall at nominal level α = 0.1 for uniform weighting, estimated weighting (Random Forest), and Oracle weighting. 2.8. Weighted Conformal Valid… view at source ↗

**Figure 4.** Figure 4: Exchangeability martingales detect distributional change in an online stream. A stream of 2,000 observations is processed with an Isolation Forest, producing p-values that are approximately uniform before the change point and become non-uniform as the anomaly rate increases linearly from 0% to 100% after t = 1000. The standard martingale and restarted Ville martingale evaluate the p-values sequentially and… view at source ↗

read the original abstract

Most anomaly detection systems output scores rather than calibrated decisions, leaving practitioners to choose thresholds heuristically and without clear statistical interpretation. Conformal anomaly detection addresses this limitation by converting anomaly scores into calibrated p-values that are valid under the statistical assumption of data exchangeability, with a growing literature extending this idea beyond that setting. We present 'nonconform', a Python package for applying conformal anomaly detection within existing machine-learning workflows, and use it as the basis for an implementation-grounded introduction to the field. The package integrates with 'scikit-learn', 'pyod', and custom anomaly detectors, and provides a unified interface for calibration, p-value generation, and false discovery rate control. It supports several conformalization strategies, ranging from simple split-conformal calibration to more data-efficient and shift-aware extensions. Through a progression from foundational concepts to advanced conformalization strategies, complemented by code examples, the paper connects the statistical ideas behind conformal anomaly detection to their practical use in 'nonconform'. Empirical results demonstrate that the implemented methods enable statistically principled anomaly detection. Together, the package and exposition aim to make core conformal anomaly detection workflows more accessible and reproducible in experimental and production-oriented settings.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper is mainly a software release and tutorial for existing conformal anomaly detection methods, with a clean Python package but limited new evidence on robustness.

read the letter

The paper introduces the 'nonconform' package, which gives a unified Python interface for turning anomaly scores from scikit-learn or pyod into conformal p-values and FDR-controlled decisions. It walks through split-conformal calibration and a few extensions, with code examples that connect the stats to practice. That part is done cleanly and could save users time when they want calibrated outputs instead of heuristic thresholds. The integration looks straightforward and the progression from basics to more advanced strategies is a reasonable way to structure the exposition. The package itself is the real deliverable here, and it seems aimed at making established conformal ideas easier to apply. The empirical claims rest on the standard exchangeability assumption for valid p-values. Anomaly detection routinely involves test points that are not exchangeable with the calibration data, especially the anomalies, and real deployments often have mild shifts or dependence. The abstract mentions empirical support, but without details on how the experiments handled those violations or reported error bars, it is hard to judge how much the results generalize beyond synthetic exchangeable cases. The core statistical machinery is drawn from prior conformal literature, so the novelty sits in the implementation and the tutorial framing rather than new mechanisms or findings. This is for applied researchers or engineers who need a ready-to-use tool for calibrated anomaly detection in their workflows. A reader looking for practical code and examples would get something usable from it. The package looks solid enough on the surface to warrant peer review, mainly to check the documentation, clarify the scope of the empirical tests, and confirm the installation and reproducibility claims.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces the 'nonconform' Python package for conformal anomaly detection. It integrates with scikit-learn and pyod to convert anomaly scores into exchangeability-calibrated p-values, supports split-conformal and shift-aware conformalization strategies, provides FDR control, and includes code examples plus empirical results that claim to demonstrate statistically principled anomaly detection.

Significance. If the empirical claims hold under realistic conditions, the package would make calibrated anomaly detection more accessible and reproducible for practitioners, reducing reliance on heuristic thresholds. The unified interface and support for extensions beyond basic exchangeability represent a practical contribution that could encourage wider adoption of conformal methods in anomaly detection workflows.

major comments (2)

[Empirical evaluation] Empirical evaluation section: The reported results demonstrate p-value calibration and FDR control only on synthetic data under full exchangeability. Because anomaly detection by definition involves non-exchangeable anomalies and real deployments frequently exhibit mild distribution shift or dependence, the experiments must include tests under such violations to substantiate the central claim that the methods enable 'statistically principled anomaly detection' in practice; otherwise the empirical support is weaker than stated.
[§4] §4 (conformalization strategies): The description of shift-aware extensions is too brief to allow readers to assess whether they preserve marginal validity or FDR control when exchangeability is mildly violated; explicit statements of the assumptions retained by each strategy and a small simulation under controlled shift would strengthen the exposition.

minor comments (2)

[Abstract] The abstract states that the package 'supports several conformalization strategies' but does not name them; listing the supported methods (e.g., split-conformal, inductive, etc.) would improve clarity.
[Tutorial sections] Code examples in the tutorial sections would benefit from explicit handling of the case when the calibration set is empty or too small, as this is a common practical failure mode.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the detailed and constructive report. We address each major comment below and have made revisions to strengthen the empirical support and exposition as suggested.

read point-by-point responses

Referee: [Empirical evaluation] Empirical evaluation section: The reported results demonstrate p-value calibration and FDR control only on synthetic data under full exchangeability. Because anomaly detection by definition involves non-exchangeable anomalies and real deployments frequently exhibit mild distribution shift or dependence, the experiments must include tests under such violations to substantiate the central claim that the methods enable 'statistically principled anomaly detection' in practice; otherwise the empirical support is weaker than stated.

Authors: We agree that the original experiments focused on the exchangeable setting to isolate and validate the core calibration guarantees. To better support the claim of practical utility, the revised manuscript now includes additional simulation studies under controlled distribution shifts and mild temporal dependence. These new results show that the p-values remain approximately valid and FDR control is largely preserved for the shift-aware strategies, consistent with the theoretical literature on conformal methods under mild violations. revision: yes
Referee: [§4] §4 (conformalization strategies): The description of shift-aware extensions is too brief to allow readers to assess whether they preserve marginal validity or FDR control when exchangeability is mildly violated; explicit statements of the assumptions retained by each strategy and a small simulation under controlled shift would strengthen the exposition.

Authors: We accept this criticism. Section 4 has been expanded with explicit statements of the assumptions retained by each conformalization strategy (e.g., the shift-aware methods require an estimable shift model but preserve marginal validity for the calibration set under exchangeability). A small controlled-shift simulation has also been added to illustrate the empirical behavior of p-value calibration and FDR control under mild violations. revision: yes

Circularity Check

0 steps flagged

No circularity: package applies established conformal procedures without deriving new quantities from fits

full rationale

The paper presents an implementation of existing conformal anomaly detection methods (split-conformal calibration, p-value generation, FDR control) that integrate with scikit-learn and pyod. No equations derive new predictions from fitted parameters by construction, no uniqueness theorems are imported via self-citation, and no ansatz is smuggled in. The empirical demonstrations are presented as applications of the standard exchangeability assumption rather than forced outputs. The derivation chain is therefore self-contained against external benchmarks in the conformal prediction literature.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the standard exchangeability assumption required for conformal p-value validity; no free parameters or new entities are introduced in the abstract.

axioms (1)

domain assumption Data exchangeability is required for the conformal p-values to be valid
Explicitly referenced in the abstract as the basis for calibrated p-values.

pith-pipeline@v0.9.0 · 5518 in / 1047 out tokens · 29747 ms · 2026-05-14T17:43:16.682774+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

38 extracted references · 7 canonical work pages

[1]

& Hochberg, Y

Controlling the False Discovery Rate: A Practical and Powerful Approach to Multiple Testing , author =. Journal of the Royal Statistical Society: Series B (Methodological) , volume = 57, number = 1, pages =. doi:https://doi.org/10.1111/j.2517-6161.1995.tb02031.x , url =

work page doi:10.1111/j.2517-6161.1995.tb02031.x 1995
[2]

Simultaneous statistical inference , author =
[3]

Algorithmic learning in a random world , author =
[4]

2024 IEEE International Conference on Knowledge Graph (ICKG) , publisher =

Leave-One-Out-, Bootstrap- and Cross-Conformal Anomaly Detectors , author =. 2024 IEEE International Conference on Knowledge Graph (ICKG) , publisher =. doi:10.1109/ickg63256.2024.00022 , url =

work page doi:10.1109/ickg63256.2024.00022 2024
[5]

Proceedings of the 34th International Conference on Neural Information Processing Systems , location =

Predictive inference is free with the jackknife+-after-bootstrap , author =. Proceedings of the 34th International Conference on Neural Information Processing Systems , location =
[6]

Cross-conformal predictors , author =. Ann. Math. Artif. Intell. , publisher =
[7]

Conformal Prediction: A Gentle Introduction , author =. Found. Trends Mach. Learn. , publisher =. doi:10.1561/2200000101 , issn =

work page doi:10.1561/2200000101
[8]

Biometrika , volume = 113, number = 1, pages =

Model-free selective inference under covariate shift via weighted conformal p-values , author =. Biometrika , volume = 113, number = 1, pages =. doi:10.1093/biomet/asaf066 , issn =

work page doi:10.1093/biomet/asaf066
[9]

2008 Eighth IEEE International Conference on Data Mining , volume =

Isolation Forest , author =. 2008 Eighth IEEE International Conference on Data Mining , volume =. doi:10.1109/ICDM.2008.17 , keywords =

work page doi:10.1109/icdm.2008.17 2008
[10]

Neural Computation , volume = 13, number = 7, pages =

Estimating the Support of a High-Dimensional Distribution , author =. Neural Computation , volume = 13, number = 7, pages =. doi:10.1162/089976601750264965 , keywords =

work page doi:10.1162/089976601750264965
[11]

Testing for outliers with conformal p-values , author =. Ann. Stat. , publisher =
[12]

Annals of Mathematics and Artificial Intelligence , publisher =

Inductive conformal anomaly detection for sequential detection of anomalous sub-trajectories , author =. Annals of Mathematics and Artificial Intelligence , publisher =
[13]

Foster, Dean P and Stine, Robert A , year = 2008, month = apr, journal =

2008
[14]

Online rules for control of false discovery rate and false discovery exceedance , author =. Ann. Stat. , publisher =
[15]

Game-theoretic statistics and safe anytime-valid inference , author =. Stat. Sci. , publisher =
[16]

Proceedings of the Twenty Third International Conference on Artificial Intelligence and Statistics , publisher =

The Power of Batching in Multiple Hypothesis Testing , author =. Proceedings of the Twenty Third International Conference on Artificial Intelligence and Statistics , publisher =
[17]

E-values: Calibration, combination and applications , author =. Ann. Stat. , publisher =
[18]

Proceedings of the Twentieth International Conference on International Conference on Machine Learning , location =

Testing exchangeability on-line , author =. Proceedings of the Twentieth International Conference on International Conference on Machine Learning , location =
[19]

Journal of Statistical Planning and Inference , volume = 90, number = 2, pages =

Improving predictive inference under covariate shift by weighting the log-likelihood function , author =. Journal of Statistical Planning and Inference , volume = 90, number = 2, pages =. doi:https://doi.org/10.1016/S0378-3758(00)00115-4 , issn =

work page doi:10.1016/s0378-3758(00)00115-4
[20]

Direct importance estimation for covariate shift adaptation , author =. Ann. Inst. Stat. Math. , publisher =
[21]

Machine learning in non-stationary environments , author =
[22]

Multiple comparisons , author =
[23]

Multiple Comparison Procedures , author =
[24]

The problem of multiple comparisons , author =
[25]

Cordier, Thibault and Blot, Vincent and Lacombe, Louis and Morzadec, Thomas and Capitaine, Arnaud and Brunel, Nicolas , year = 2023, booktitle =

2023
[26]

MAPIE: an open-source library for distribution-free uncertainty quantification , author =
[27]

crepes: a Python Package for Generating Conformal Regressors and Predictive Systems , author =. Proc. of the 11th Symposium on Conformal and Probabilistic Prediction with Applications , pages =
[28]

Conformal Prediction in Python with crepes , author =. Proc. of the 13th Symposium on Conformal and Probabilistic Prediction with Applications , pages =
[29]

Conformal and Probabilistic Prediction with Applications , pages =

PUNCC: a Python Library for Predictive Uncertainty Calibration and Conformalization , author =. Conformal and Probabilistic Prediction with Applications , pages =
[30]

ECML PKDD Workshop: Languages for Data Mining and Machine Learning , pages =

Lars Buitinck and Gilles Louppe and Mathieu Blondel and Fabian Pedregosa and Andreas Mueller and Olivier Grisel and Vlad Niculae and Peter Prettenhofer and Alexandre Gramfort and Jaques Grobler and Robert Layton and Jake VanderPlas and Arnaud Joly and Brian Holt and Ga. ECML PKDD Workshop: Languages for Data Mining and Machine Learning , pages =
[31]

and Varoquaux, G

Pedregosa, F. and Varoquaux, G. and Gramfort, A. and Michel, V. and Thirion, B. and Grisel, O. and Blondel, M. and Prettenhofer, P. and Weiss, R. and Dubourg, V. and Vanderplas, J. and Passos, A. and Cournapeau, D. and Brucher, M. and Perrot, M. and Duchesnay, E. , year = 2011, journal =. Scikit-learn: Machine Learning in

2011
[32]

Journal of Machine Learning Research , volume = 20, number = 96, pages =

PyOD: A Python Toolbox for Scalable Outlier Detection , author =. Journal of Machine Learning Research , volume = 20, number = 96, pages =
[33]

Companion Proceedings of the ACM on Web Conference 2025 , pages =

Pyod 2: A python library for outlier detection with llm-powered model selection , author =. Companion Proceedings of the ACM on Web Conference 2025 , pages =

2025
[34]

Between resolution collapse and variance inflation: Weighted conformal anomaly detection in low-data regimes , author =
[35]

Shebuti Rayana , year = 2016, note =

2016
[36]

Multiple hypothesis testing , author =. Annu. Rev. Psychol. , publisher =
[37]

Proceedings of the 29th International Coference on International Conference on Machine Learning , location =

Plug-in martingales for testing exchangeability on-line , author =. Proceedings of the 29th International Coference on International Conference on Machine Learning , location =
[38]

Proceedings of the Tenth Symposium on Conformal and Probabilistic Prediction and Applications , publisher =

Retrain or not retrain: conformal test martingales for change-point detection , author =. Proceedings of the Tenth Symposium on Conformal and Probabilistic Prediction and Applications , publisher =