Benchmarking non-conformity score functions in conformal prediction

Sol Erika Boman

arxiv: 2605.24983 · v1 · pith:I6L6Q4GFnew · submitted 2026-05-24 · 💻 cs.LG

Benchmarking non-conformity score functions in conformal prediction

Sol Erika Boman This is my paper

Pith reviewed 2026-06-30 11:37 UTC · model grok-4.3

classification 💻 cs.LG

keywords conformal predictionnon-conformity scoreprediction setsclass-conditionalimbalanced classesbenchmarkingmachine learning classification

0 comments

The pith

A new method for measuring prediction set sizes enables direct comparison of non-conformity score functions in conformal prediction.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces an original method to evaluate the sizes of prediction sets generated by conformal predictors. This method is then applied to benchmark various non-conformity score functions drawn from the literature along with some original modifications. The same approach is used to test how different scores perform in class-conditional conformal prediction when classes are imbalanced. A sympathetic reader would care because the size of the prediction sets determines how informative the conformal output is in practice. If the evaluation method works, it supplies a concrete basis for choosing one score function over another.

Core claim

We introduce an original method of evaluating the prediction set sizes of conformal predictors and use it to provide a comparison between non-conformity score functions. We also examine efficacy of different non-conformity score functions for class-conditional conformal prediction in a setting with imbalanced classes.

What carries the argument

An original method of evaluating the prediction set sizes of conformal predictors that is used to benchmark non-conformity score functions.

If this is right

Non-conformity score functions produce measurably different prediction set sizes when assessed with the new evaluation procedure.
Original modifications to existing score functions can be ranked against published ones using the same size metric.
In imbalanced class settings the relative performance of score functions changes under class-conditional conformal prediction.
The method supplies a uniform yardstick that was previously absent for comparing score functions.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

A practitioner facing a new dataset could apply the method to pick the score that keeps sets smallest while maintaining coverage.
The evaluation procedure might be extended to regression tasks or to settings with multiple models to check consistency of rankings.
If the method reveals that certain scores systematically enlarge sets on imbalanced data, new score designs could target that failure mode.

Load-bearing premise

The introduced original method of evaluating prediction set sizes yields a fair and informative comparison of non-conformity score functions.

What would settle it

Running the new evaluation method on a collection of standard non-conformity scores and finding that the resulting size rankings contradict those obtained from direct measurement of average set size on the same data would undermine the method.

read the original abstract

Conformal prediction is a useful and versatile alternative to model calibration in machine learning classification. It replaces single-class prediction with prediction sets, guaranteeing that the \textit{a priori} probability of the prediction sets containing the true class is larger than or equal to a pre-specified rate. The size and usefulness of the prediction sets relies heavily on the choice of the non-conformity score function. The scientific literature contains many examples of non-conformity score functions but there is an absence of studies examining their properties and effectiveness. In this paper, we give an overview of properties of non-conformity score functions. We give examples of non-conformity score functions in the existing literature and introduce original modifications. We introduce an original method of evaluating the prediction set sizes of conformal predictors and use it to provide a comparison between non-conformity score functions. We also examine efficacy of different non-conformity score functions for class-conditional conformal prediction in a setting with imbalanced classes.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper flags a real gap in comparing non-conformity scores but rests its main claims on an unevaluated 'original method' with no details or validation supplied.

read the letter

The core takeaway is that this is a benchmarking study on non-conformity score functions for conformal prediction. It promises an overview of their properties, some original modifications, a new way to assess prediction set sizes, and a look at class-conditional performance under imbalance. That last part and the gap it identifies are the parts that could matter to people already working in this corner of uncertainty quantification.

What the paper does is straightforward: it notes the absence of systematic comparisons in the literature and sets out to fill it with both properties and an empirical ranking. If the full version actually delivers clean experiments on standard datasets and shows the modifications clearly, that would be a modest but usable contribution.

The soft spot is exactly where the stress-test note lands. The central comparison tool is described only as an 'original method of evaluating the prediction set sizes,' with no account of how it is constructed, why it is less biased than average set size at fixed coverage, or any check that it produces stable rankings. The abstract contains no numbers, no ablation, and no justification, so the observed differences between scores cannot yet be attributed to the scores themselves. That makes the soundness claim hard to assess from what is here.

This is the kind of paper that would interest a narrow group already running conformal predictors on classification tasks and wanting practical guidance on score choice. A reader outside that subfield will not find much. It is worth sending to referees if the full manuscript supplies the missing method description, the actual results, and some comparison to existing efficiency metrics; without those pieces it is too thin to review.

Referee Report

2 major / 0 minor

Summary. The manuscript surveys properties of non-conformity score functions for conformal prediction, presents examples from the literature along with original modifications, introduces a novel method for evaluating prediction set sizes, applies this method to benchmark multiple non-conformity scores, and studies their performance under class-conditional conformal prediction with imbalanced classes.

Significance. A well-validated benchmarking study of non-conformity scores, particularly one that addresses class imbalance, would help practitioners select scores that produce smaller, more informative prediction sets while preserving coverage guarantees.

major comments (2)

[Abstract] Abstract: The central contribution is an 'original method of evaluating the prediction set sizes' used to benchmark non-conformity scores, yet the manuscript supplies no formulation, theoretical justification, or comparison to standard efficiency metrics (e.g., average set size at fixed marginal coverage). Without evidence that the method avoids artifacts or produces unbiased rankings, the reported comparisons cannot be attributed to the scores themselves.
[Abstract] Abstract: No empirical results, tables, error bars, or ablation studies are described, so it is impossible to evaluate whether observed differences between scores (including in the imbalanced class-conditional setting) are statistically meaningful or reproducible.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their review and recommendation. We agree that the abstract is too high-level and will revise the manuscript to include the requested formulation, justification, comparisons, and empirical details.

read point-by-point responses

Referee: [Abstract] Abstract: The central contribution is an 'original method of evaluating the prediction set sizes' used to benchmark non-conformity scores, yet the manuscript supplies no formulation, theoretical justification, or comparison to standard efficiency metrics (e.g., average set size at fixed marginal coverage). Without evidence that the method avoids artifacts or produces unbiased rankings, the reported comparisons cannot be attributed to the scores themselves.

Authors: We agree the current abstract provides no formulation or justification. In revision we will add a concise formulation of the original evaluation method to the abstract, include its theoretical motivation, and add explicit comparisons to standard metrics such as average set size at fixed marginal coverage. We will also insert supporting analysis (e.g., controlled simulations) demonstrating that the method produces consistent rankings without obvious artifacts. revision: yes
Referee: [Abstract] Abstract: No empirical results, tables, error bars, or ablation studies are described, so it is impossible to evaluate whether observed differences between scores (including in the imbalanced class-conditional setting) are statistically meaningful or reproducible.

Authors: We agree the abstract does not describe the empirical results. The revised version will update the abstract to summarize the key findings, including performance differences across scores in both marginal and class-conditional imbalanced settings. The main text will be augmented with tables, error bars, and ablation studies that allow assessment of statistical significance and reproducibility. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical benchmarking with new evaluation method has no self-referential derivations

full rationale

The paper is an empirical benchmarking study that overviews non-conformity scores, introduces modifications, and proposes an original evaluation method for prediction set sizes to compare them (including class-conditional on imbalanced data). No equations, fitted parameters, or derivation chains appear in the provided text. The central contribution is the new evaluation procedure itself, presented as original rather than derived from prior results by the same authors. No self-citations are load-bearing, no ansatzes are smuggled, and no predictions reduce to inputs by construction. The work is self-contained against external benchmarks as a comparative study.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Based solely on the abstract; the described work is an empirical benchmarking study that appears to rest on standard assumptions of conformal prediction and supervised classification without introducing new free parameters, axioms, or invented entities.

pith-pipeline@v0.9.1-grok · 5682 in / 1057 out tokens · 39168 ms · 2026-06-30T11:37:43.624508+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

18 extracted references · 4 canonical work pages · 3 internal anchors

[1]

Angelopoulos and Stephen Bates

Anastasios N. Angelopoulos and Stephen Bates. A Gentle Introduction to Confor- mal Prediction and Distribution-Free Uncertainty Quantification, 2021

2021
[2]

A tutorial on conformal prediction

Glenn Shafer and Vladimir Vovk. A tutorial on conformal prediction. 2007. doi: 10.48550/ARXIV.0706.3188

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.0706.3188 2007
[3]

Conditional validity of inductive conformal predictors

Vladimir Vovk. Conditional validity of inductive conformal predictors, 2012. URL https://arxiv.org/abs/1209.2673. Version Number: 2

work page internal anchor Pith review Pith/arXiv arXiv 2012
[4]

Least Ambiguous Set-Valued Classifiers with Bounded Error Levels

Mauricio Sadinle, Jing Lei, and Larry Wasserman. Least Ambiguous Set-Valued Classifiers with Bounded Error Levels. 2016. doi: 10.48550/ARXIV.1609.00451. 16

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.1609.00451 2016
[5]

Bias re- duction through conditional conformal prediction.Intelligent Data Analysis, 19 (6):1355–1375, November 2015

Tuve L ¨ofstr¨om, Henrik Bostr ¨om, Henrik Linusson, and Ulf Johansson. Bias re- duction through conditional conformal prediction.Intelligent Data Analysis, 19 (6):1355–1375, November 2015. ISSN 1088467X, 15714128. doi: 10.3233/ IDA-150786

2015
[6]

Cand `es

Yaniv Romano, Matteo Sesia, and Emmanuel J. Cand `es. Classification with Valid and Adaptive Coverage. 2020. doi: 10.48550/ARXIV.2006.02544

work page doi:10.48550/arxiv.2006.02544 2020
[7]

Anastasios Angelopoulos, Stephen Bates, Jitendra Malik, and Michael I. Jordan. Uncertainty Sets for Image Classifiers using Conformal Prediction, 2020

2020
[8]

Conformal Prediction for Deep Classifier via Label Ranking, 2023

Jianguo Huang, Huajun Xi, Linjun Zhang, Huaxiu Yao, Yue Qiu, and Hongxin Wei. Conformal Prediction for Deep Classifier via Label Ranking, 2023

2023
[9]

Predictive Inference with Feature Conformal Prediction, 2022

Jiaye Teng, Chuan Wen, Dinghuai Zhang, Yoshua Bengio, Yang Gao, and Yang Yuan. Predictive Inference with Feature Conformal Prediction, 2022

2022
[10]

Predictive Inference With Fast Feature Conformal Prediction, 2024

Zihao Tang, Boyuan Wang, Chuan Wen, and Jiaye Teng. Predictive Inference With Fast Feature Conformal Prediction, 2024

2024
[11]

Deep Residual Learning for Image Recognition, 2015

Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep Residual Learning for Image Recognition, 2015

2015
[12]

Mingxing Tan and Quoc V. Le. EfficientNet: Rethinking Model Scaling for Convo- lutional Neural Networks, September 2020. 17 Appendices A Complete version of table 1 Dataset Model accuracy (top-1 / top-5) Score function Conformal domain Distance metricI 1 ImageNet 76.15% / 92.87% Label Distance Probability space Euclidean distance 137.7 Label Distance Prob...

2020
[13]

Denote by ˆπi the ordered list of probabilities for the i-th example

Given a predictive classification model and a calibration data setCnot seen during training, denote byX i the i-th data point andy i its class (i.e., its label). Denote by ˆπi the ordered list of probabilities for the i-th example
[14]

Sample a uniform random variableU i ∼Uniform(0,1) for each example
[15]

Define the (non-conformity) score function s(yi, Ui, ˆπi) = min{τ:y i ∈ Si(Ui, ˆπi, τ)},(24) Si(Ui, ˆπi, τ) =  {y(j) i :j <min{j ′ | Pj′ k=1 ˆπ(k) i ≥τ}}ifU i ≤V i( ˆπi, τ) {y(j) i :j≤min{j ′ | Pj′ k=1 ˆπ(k) i ≥τ}}ifU i > Vi( ˆπi, τ) ,(25) 19 Vi( ˆπi, τ) = 1 ˆπ(c) i ( X ˆπ(j) ≥ ˆπ(c) ˆπ(j) −τ).(26) Here,y i is the label of the i-th example,y (j) i ...
[16]

Calibrate the conformal predictor by calculatings i for all the examples inC
[17]

Choose an allowed error rate (sometimes called significance)α
[18]

Construct the prediction set for a new test example asS i+1(Ui+1, ˆπi+1, τc) withτ c the ⌈(|C|+1)(1−α)⌉ |C| -th quantile ofs i onC. A na¨ıve implementation of the above algorithm in pseudo-code could look like def calibration vars label , sorted_indices , sorted_scores c = cumsum ( sorted_scores ) while calibrating : L = index( sorted_indices == label) U ...

[1] [1]

Angelopoulos and Stephen Bates

Anastasios N. Angelopoulos and Stephen Bates. A Gentle Introduction to Confor- mal Prediction and Distribution-Free Uncertainty Quantification, 2021

2021

[2] [2]

A tutorial on conformal prediction

Glenn Shafer and Vladimir Vovk. A tutorial on conformal prediction. 2007. doi: 10.48550/ARXIV.0706.3188

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.0706.3188 2007

[3] [3]

Conditional validity of inductive conformal predictors

Vladimir Vovk. Conditional validity of inductive conformal predictors, 2012. URL https://arxiv.org/abs/1209.2673. Version Number: 2

work page internal anchor Pith review Pith/arXiv arXiv 2012

[4] [4]

Least Ambiguous Set-Valued Classifiers with Bounded Error Levels

Mauricio Sadinle, Jing Lei, and Larry Wasserman. Least Ambiguous Set-Valued Classifiers with Bounded Error Levels. 2016. doi: 10.48550/ARXIV.1609.00451. 16

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.1609.00451 2016

[5] [5]

Bias re- duction through conditional conformal prediction.Intelligent Data Analysis, 19 (6):1355–1375, November 2015

Tuve L ¨ofstr¨om, Henrik Bostr ¨om, Henrik Linusson, and Ulf Johansson. Bias re- duction through conditional conformal prediction.Intelligent Data Analysis, 19 (6):1355–1375, November 2015. ISSN 1088467X, 15714128. doi: 10.3233/ IDA-150786

2015

[6] [6]

Cand `es

Yaniv Romano, Matteo Sesia, and Emmanuel J. Cand `es. Classification with Valid and Adaptive Coverage. 2020. doi: 10.48550/ARXIV.2006.02544

work page doi:10.48550/arxiv.2006.02544 2020

[7] [7]

Anastasios Angelopoulos, Stephen Bates, Jitendra Malik, and Michael I. Jordan. Uncertainty Sets for Image Classifiers using Conformal Prediction, 2020

2020

[8] [8]

Conformal Prediction for Deep Classifier via Label Ranking, 2023

Jianguo Huang, Huajun Xi, Linjun Zhang, Huaxiu Yao, Yue Qiu, and Hongxin Wei. Conformal Prediction for Deep Classifier via Label Ranking, 2023

2023

[9] [9]

Predictive Inference with Feature Conformal Prediction, 2022

Jiaye Teng, Chuan Wen, Dinghuai Zhang, Yoshua Bengio, Yang Gao, and Yang Yuan. Predictive Inference with Feature Conformal Prediction, 2022

2022

[10] [10]

Predictive Inference With Fast Feature Conformal Prediction, 2024

Zihao Tang, Boyuan Wang, Chuan Wen, and Jiaye Teng. Predictive Inference With Fast Feature Conformal Prediction, 2024

2024

[11] [11]

Deep Residual Learning for Image Recognition, 2015

Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep Residual Learning for Image Recognition, 2015

2015

[12] [12]

Mingxing Tan and Quoc V. Le. EfficientNet: Rethinking Model Scaling for Convo- lutional Neural Networks, September 2020. 17 Appendices A Complete version of table 1 Dataset Model accuracy (top-1 / top-5) Score function Conformal domain Distance metricI 1 ImageNet 76.15% / 92.87% Label Distance Probability space Euclidean distance 137.7 Label Distance Prob...

2020

[13] [13]

Denote by ˆπi the ordered list of probabilities for the i-th example

Given a predictive classification model and a calibration data setCnot seen during training, denote byX i the i-th data point andy i its class (i.e., its label). Denote by ˆπi the ordered list of probabilities for the i-th example

[14] [14]

Sample a uniform random variableU i ∼Uniform(0,1) for each example

[15] [15]

Define the (non-conformity) score function s(yi, Ui, ˆπi) = min{τ:y i ∈ Si(Ui, ˆπi, τ)},(24) Si(Ui, ˆπi, τ) =  {y(j) i :j <min{j ′ | Pj′ k=1 ˆπ(k) i ≥τ}}ifU i ≤V i( ˆπi, τ) {y(j) i :j≤min{j ′ | Pj′ k=1 ˆπ(k) i ≥τ}}ifU i > Vi( ˆπi, τ) ,(25) 19 Vi( ˆπi, τ) = 1 ˆπ(c) i ( X ˆπ(j) ≥ ˆπ(c) ˆπ(j) −τ).(26) Here,y i is the label of the i-th example,y (j) i ...

[16] [16]

Calibrate the conformal predictor by calculatings i for all the examples inC

[17] [17]

Choose an allowed error rate (sometimes called significance)α

[18] [18]

Construct the prediction set for a new test example asS i+1(Ui+1, ˆπi+1, τc) withτ c the ⌈(|C|+1)(1−α)⌉ |C| -th quantile ofs i onC. A na¨ıve implementation of the above algorithm in pseudo-code could look like def calibration vars label , sorted_indices , sorted_scores c = cumsum ( sorted_scores ) while calibrating : L = index( sorted_indices == label) U ...