Benchmarking non-conformity score functions in conformal prediction
Pith reviewed 2026-06-30 11:37 UTC · model grok-4.3
The pith
A new method for measuring prediction set sizes enables direct comparison of non-conformity score functions in conformal prediction.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
We introduce an original method of evaluating the prediction set sizes of conformal predictors and use it to provide a comparison between non-conformity score functions. We also examine efficacy of different non-conformity score functions for class-conditional conformal prediction in a setting with imbalanced classes.
What carries the argument
An original method of evaluating the prediction set sizes of conformal predictors that is used to benchmark non-conformity score functions.
If this is right
- Non-conformity score functions produce measurably different prediction set sizes when assessed with the new evaluation procedure.
- Original modifications to existing score functions can be ranked against published ones using the same size metric.
- In imbalanced class settings the relative performance of score functions changes under class-conditional conformal prediction.
- The method supplies a uniform yardstick that was previously absent for comparing score functions.
Where Pith is reading between the lines
- A practitioner facing a new dataset could apply the method to pick the score that keeps sets smallest while maintaining coverage.
- The evaluation procedure might be extended to regression tasks or to settings with multiple models to check consistency of rankings.
- If the method reveals that certain scores systematically enlarge sets on imbalanced data, new score designs could target that failure mode.
Load-bearing premise
The introduced original method of evaluating prediction set sizes yields a fair and informative comparison of non-conformity score functions.
What would settle it
Running the new evaluation method on a collection of standard non-conformity scores and finding that the resulting size rankings contradict those obtained from direct measurement of average set size on the same data would undermine the method.
read the original abstract
Conformal prediction is a useful and versatile alternative to model calibration in machine learning classification. It replaces single-class prediction with prediction sets, guaranteeing that the \textit{a priori} probability of the prediction sets containing the true class is larger than or equal to a pre-specified rate. The size and usefulness of the prediction sets relies heavily on the choice of the non-conformity score function. The scientific literature contains many examples of non-conformity score functions but there is an absence of studies examining their properties and effectiveness. In this paper, we give an overview of properties of non-conformity score functions. We give examples of non-conformity score functions in the existing literature and introduce original modifications. We introduce an original method of evaluating the prediction set sizes of conformal predictors and use it to provide a comparison between non-conformity score functions. We also examine efficacy of different non-conformity score functions for class-conditional conformal prediction in a setting with imbalanced classes.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript surveys properties of non-conformity score functions for conformal prediction, presents examples from the literature along with original modifications, introduces a novel method for evaluating prediction set sizes, applies this method to benchmark multiple non-conformity scores, and studies their performance under class-conditional conformal prediction with imbalanced classes.
Significance. A well-validated benchmarking study of non-conformity scores, particularly one that addresses class imbalance, would help practitioners select scores that produce smaller, more informative prediction sets while preserving coverage guarantees.
major comments (2)
- [Abstract] Abstract: The central contribution is an 'original method of evaluating the prediction set sizes' used to benchmark non-conformity scores, yet the manuscript supplies no formulation, theoretical justification, or comparison to standard efficiency metrics (e.g., average set size at fixed marginal coverage). Without evidence that the method avoids artifacts or produces unbiased rankings, the reported comparisons cannot be attributed to the scores themselves.
- [Abstract] Abstract: No empirical results, tables, error bars, or ablation studies are described, so it is impossible to evaluate whether observed differences between scores (including in the imbalanced class-conditional setting) are statistically meaningful or reproducible.
Simulated Author's Rebuttal
We thank the referee for their review and recommendation. We agree that the abstract is too high-level and will revise the manuscript to include the requested formulation, justification, comparisons, and empirical details.
read point-by-point responses
-
Referee: [Abstract] Abstract: The central contribution is an 'original method of evaluating the prediction set sizes' used to benchmark non-conformity scores, yet the manuscript supplies no formulation, theoretical justification, or comparison to standard efficiency metrics (e.g., average set size at fixed marginal coverage). Without evidence that the method avoids artifacts or produces unbiased rankings, the reported comparisons cannot be attributed to the scores themselves.
Authors: We agree the current abstract provides no formulation or justification. In revision we will add a concise formulation of the original evaluation method to the abstract, include its theoretical motivation, and add explicit comparisons to standard metrics such as average set size at fixed marginal coverage. We will also insert supporting analysis (e.g., controlled simulations) demonstrating that the method produces consistent rankings without obvious artifacts. revision: yes
-
Referee: [Abstract] Abstract: No empirical results, tables, error bars, or ablation studies are described, so it is impossible to evaluate whether observed differences between scores (including in the imbalanced class-conditional setting) are statistically meaningful or reproducible.
Authors: We agree the abstract does not describe the empirical results. The revised version will update the abstract to summarize the key findings, including performance differences across scores in both marginal and class-conditional imbalanced settings. The main text will be augmented with tables, error bars, and ablation studies that allow assessment of statistical significance and reproducibility. revision: yes
Circularity Check
No circularity: empirical benchmarking with new evaluation method has no self-referential derivations
full rationale
The paper is an empirical benchmarking study that overviews non-conformity scores, introduces modifications, and proposes an original evaluation method for prediction set sizes to compare them (including class-conditional on imbalanced data). No equations, fitted parameters, or derivation chains appear in the provided text. The central contribution is the new evaluation procedure itself, presented as original rather than derived from prior results by the same authors. No self-citations are load-bearing, no ansatzes are smuggled, and no predictions reduce to inputs by construction. The work is self-contained against external benchmarks as a comparative study.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
Angelopoulos and Stephen Bates
Anastasios N. Angelopoulos and Stephen Bates. A Gentle Introduction to Confor- mal Prediction and Distribution-Free Uncertainty Quantification, 2021
2021
-
[2]
A tutorial on conformal prediction
Glenn Shafer and Vladimir Vovk. A tutorial on conformal prediction. 2007. doi: 10.48550/ARXIV.0706.3188
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.0706.3188 2007
-
[3]
Conditional validity of inductive conformal predictors
Vladimir Vovk. Conditional validity of inductive conformal predictors, 2012. URL https://arxiv.org/abs/1209.2673. Version Number: 2
work page internal anchor Pith review Pith/arXiv arXiv 2012
-
[4]
Least Ambiguous Set-Valued Classifiers with Bounded Error Levels
Mauricio Sadinle, Jing Lei, and Larry Wasserman. Least Ambiguous Set-Valued Classifiers with Bounded Error Levels. 2016. doi: 10.48550/ARXIV.1609.00451. 16
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.1609.00451 2016
-
[5]
Bias re- duction through conditional conformal prediction.Intelligent Data Analysis, 19 (6):1355–1375, November 2015
Tuve L ¨ofstr¨om, Henrik Bostr ¨om, Henrik Linusson, and Ulf Johansson. Bias re- duction through conditional conformal prediction.Intelligent Data Analysis, 19 (6):1355–1375, November 2015. ISSN 1088467X, 15714128. doi: 10.3233/ IDA-150786
2015
-
[6]
Yaniv Romano, Matteo Sesia, and Emmanuel J. Cand `es. Classification with Valid and Adaptive Coverage. 2020. doi: 10.48550/ARXIV.2006.02544
-
[7]
Anastasios Angelopoulos, Stephen Bates, Jitendra Malik, and Michael I. Jordan. Uncertainty Sets for Image Classifiers using Conformal Prediction, 2020
2020
-
[8]
Conformal Prediction for Deep Classifier via Label Ranking, 2023
Jianguo Huang, Huajun Xi, Linjun Zhang, Huaxiu Yao, Yue Qiu, and Hongxin Wei. Conformal Prediction for Deep Classifier via Label Ranking, 2023
2023
-
[9]
Predictive Inference with Feature Conformal Prediction, 2022
Jiaye Teng, Chuan Wen, Dinghuai Zhang, Yoshua Bengio, Yang Gao, and Yang Yuan. Predictive Inference with Feature Conformal Prediction, 2022
2022
-
[10]
Predictive Inference With Fast Feature Conformal Prediction, 2024
Zihao Tang, Boyuan Wang, Chuan Wen, and Jiaye Teng. Predictive Inference With Fast Feature Conformal Prediction, 2024
2024
-
[11]
Deep Residual Learning for Image Recognition, 2015
Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep Residual Learning for Image Recognition, 2015
2015
-
[12]
Mingxing Tan and Quoc V. Le. EfficientNet: Rethinking Model Scaling for Convo- lutional Neural Networks, September 2020. 17 Appendices A Complete version of table 1 Dataset Model accuracy (top-1 / top-5) Score function Conformal domain Distance metricI 1 ImageNet 76.15% / 92.87% Label Distance Probability space Euclidean distance 137.7 Label Distance Prob...
2020
-
[13]
Denote by ˆπi the ordered list of probabilities for the i-th example
Given a predictive classification model and a calibration data setCnot seen during training, denote byX i the i-th data point andy i its class (i.e., its label). Denote by ˆπi the ordered list of probabilities for the i-th example
-
[14]
Sample a uniform random variableU i ∼Uniform(0,1) for each example
-
[15]
Define the (non-conformity) score function s(yi, Ui, ˆπi) = min{τ:y i ∈ Si(Ui, ˆπi, τ)},(24) Si(Ui, ˆπi, τ) = {y(j) i :j <min{j ′ | Pj′ k=1 ˆπ(k) i ≥τ}}ifU i ≤V i( ˆπi, τ) {y(j) i :j≤min{j ′ | Pj′ k=1 ˆπ(k) i ≥τ}}ifU i > Vi( ˆπi, τ) ,(25) 19 Vi( ˆπi, τ) = 1 ˆπ(c) i ( X ˆπ(j) ≥ ˆπ(c) ˆπ(j) −τ).(26) Here,y i is the label of the i-th example,y (j) i ...
-
[16]
Calibrate the conformal predictor by calculatings i for all the examples inC
-
[17]
Choose an allowed error rate (sometimes called significance)α
-
[18]
Construct the prediction set for a new test example asS i+1(Ui+1, ˆπi+1, τc) withτ c the ⌈(|C|+1)(1−α)⌉ |C| -th quantile ofs i onC. A na¨ıve implementation of the above algorithm in pseudo-code could look like def calibration vars label , sorted_indices , sorted_scores c = cumsum ( sorted_scores ) while calibrating : L = index( sorted_indices == label) U ...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.