Empirical Bayes Conformal Prediction for Vision and Language Models

Jiapeng Zeng; Michael A. Newton; Vikas Singh; Yogesh Prabhu; Zhanpeng Zeng

arxiv: 2605.23189 · v1 · pith:L3UBFJHCnew · submitted 2026-05-22 · 💻 cs.LG

Empirical Bayes Conformal Prediction for Vision and Language Models

Jiapeng Zeng , Yogesh Prabhu , Zhanpeng Zeng , Michael A. Newton , Vikas Singh This is my paper

Pith reviewed 2026-05-25 04:42 UTC · model grok-4.3

classification 💻 cs.LG

keywords conformal predictionempirical Bayesr-valuenonconformity scorevision language modelsprediction setsuncertainty quantification

0 comments

The pith

r-value nonconformity scores preserve conformal coverage while reducing high-variance false candidates.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper develops an empirical Bayes framework for conformal prediction that turns observed score variability into an r-value nonconformity score. Standard approaches rely on one realization or a point estimate and therefore admit unstable candidates whose apparent strength comes from noise. The r-value estimates how likely a candidate's latent score truly belongs to the top group after adjusting for both its mean and its uncertainty, using either a closed-form Normal-Normal model or nonparametric sampling. This matters for vision and language models because multiple realizations (prompts, samples) are routinely available; when variability carries signal, the resulting prediction sets become smaller and more stable while the distribution-free coverage guarantee is retained. Experiments on image classification, CLIP VLMs, and LLMs confirm the behavior reverts to ordinary conformal prediction when variability disappears.

Core claim

The central claim is that the r-value, obtained from an empirical Bayes estimator of the latent score distribution, serves as a nonconformity score that preserves the target conformal coverage while provably lowering the inclusion rate of high-variance false candidates under mild regularity conditions; both Normal-Normal closed-form and nonparametric posterior-sampling estimators are supplied, and the approach is shown to improve ranking stability and reduce set size on image classification, CLIP-based VLM, and LLM benchmarks whenever variability is informative.

What carries the argument

The r-value nonconformity score, which converts mean score and observed variability into the estimated probability that a candidate's latent score belongs to the top-ranked group.

If this is right

Target coverage is maintained on image classification, VLM, and LLM tasks.
Prediction-set size decreases and ranking stability improves when variability distinguishes signal from noise.
Behavior reverts to ordinary conformal prediction when variability vanishes.
Both closed-form Normal-Normal and nonparametric sampling estimators are available for practical use.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same variability-aware construction could be applied to other multi-realization regimes such as ensemble or Bayesian neural-network outputs.
The approach suggests that modeling uncertainty directly inside the nonconformity score may yield efficiency gains in online or adaptive conformal settings.
Testing the regularity conditions on a wider range of model families would clarify when the provable reduction in false inclusions holds.

Load-bearing premise

Observed variability in scores reliably distinguishes stable high-scoring candidates from noise-driven ones and the empirical Bayes estimator recovers this distinction accurately from the available realizations.

What would settle it

A controlled experiment in which the fraction of high-variance false candidates entering the conformal sets is no smaller (or is larger) under the r-value rule than under standard single-score or averaged CP, or in which empirical coverage on held-out data falls below the nominal target.

Figures

Figures reproduced from arXiv: 2605.23189 by Jiapeng Zeng, Michael A. Newton, Vikas Singh, Yogesh Prabhu, Zhanpeng Zeng.

**Figure 1.** Figure 1: Posterior sampled models yield different CP sets for the same image, revealing epistemic instability. This instability is not just due to posterior model sampling. In vision models, it may arise from posterior uncertainty over model parameters; in VLMs and LLMs, it may arise from prompt paraphrasing or model based evaluation of candidate responses. A high score can mean two very different things: stable e… view at source ↗

**Figure 2.** Figure 2: Left: WBB approximates model uncertainty efficiently by training adapter modules [13] instead of fully retraining the model. Right: Comparison of CP and CPr-value on ViT-Base image classification in probability and logit settings. the r-value ranking should reduce unstable false label inclusion and produce smaller and more stable conformal sets. Experiment questions. Our experimental evaluations for image … view at source ↗

**Figure 3.** Figure 3: Left: Single-image comparison of CP and CPr-value, where CPr-value incorporates model variability to often rank the correct class higher; smaller r-values are better. Right: Multi-model comparison showing that CPr-value produces smaller, more stable coverage sets than CP. See Appendix D. (C) Informative variability leads to smaller and more stable sets. The advantage of CPr-value becomes pronounced when th… view at source ↗

**Figure 4.** Figure 4: Pipeline for generating score variability via paraphrasing. For each discrete input, either [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗

**Figure 5.** Figure 5: Comparison of conformal prediction and CPr-value on image classification using ResNet18, ResNet50, ViT-Base, and ViT-Large. We evaluate both probability and logit settings, analyzing coverage and set size across significance levels. In the single-backbone setting, WBB adapters introduce only modest posterior variability, so CPr-value often behaves similarly to standard CP. Model Without Adapters With Adapt… view at source ↗

**Figure 6.** Figure 6: Smoothed distribution of judge scores for the selected answer [PITH_FULL_IMAGE:figures/full_fig_p028_6.png] view at source ↗

read the original abstract

Conformal prediction (CP) gives distribution-free coverage for modern vision and language models, but it is often forced to make a ranking decision from a single unstable nonconformity score. Standard CP uses one realization, while average-then-calibrate variants smooth multiple realizations into a point estimate. Both options discard the inconsistency that can help identify whether a candidate is indeed stable. A weak answer can enter the conformal set even if the evidence is not strong, simply because one posterior sample or prompt phrasing made it look strong. But variability can help distinguish a stable signal from noise-driven fluctuations. We describe an empirical Bayes conformal prediction framework that uses $r$-values to convert score variability into an uncertainty informed nonconformity score. The resulting $r$-value estimates how likely a candidate's latent score belongs to the top-ranked group after accounting for both its mean score and its uncertainty. It admits both a closed-form Normal-Normal empirical Bayes estimator and a nonparametric posterior-sampling estimator. Using the $r$-value as the nonconformity score preserves the target conformal coverage while provably reducing the inclusion of high variance false candidates under mild regularity conditions. Across image classification, CLIP-based VLM benchmarks, and LLMs, we show that $r$-value conformal prediction preserves target coverage while improving ranking stability and reducing set size when variability is informative, and reverting to CP-like behavior when variability vanishes.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper turns score variability into r-value nonconformity scores via empirical Bayes, which looks like a direct way to shrink conformal sets for VLMs and LLMs without breaking coverage when the variability signal is real.

read the letter

The main contribution is replacing a single or averaged nonconformity score with an r-value that explicitly trades off mean score against observed variability, using either a Normal-Normal empirical Bayes estimator or a nonparametric sampler. This targets the practical issue that one noisy realization can push a weak candidate into the conformal set for large vision and language models. The abstract states that coverage is preserved and set size is reduced under mild conditions when variability distinguishes stable signals from noise, which aligns with known problems in prompt-sensitive or sampling-based models. The two estimator options are a plus for flexibility, and the claim that it reverts to standard CP behavior when variability is absent keeps the method from overclaiming. The soft spot is the data-dependent fitting of the empirical Bayes hyperparameters on the calibration scores themselves. This introduces a circularity that standard conformal theory avoids, and while the paper qualifies the reduction as holding under mild regularity conditions, the lack of an explicit derivation or statement of those conditions in the abstract leaves the guarantee harder to verify than the coverage preservation part. Experiments are described across image classification, CLIP VLMs, and LLMs showing improved ranking stability and smaller sets when variability helps, but without quantitative details here it is difficult to judge effect sizes or failure cases. This work is for researchers already applying conformal prediction to unstable large models who want a lightweight way to incorporate variability. It deserves peer review because the framing is clean, the problem is genuine, and the central coverage claim is stated with appropriate qualifiers even if the variability reduction needs closer checking on the proofs and data.

Referee Report

2 major / 1 minor

Summary. The paper proposes an empirical Bayes conformal prediction framework for vision and language models. It uses r-values—computed via either a closed-form Normal-Normal empirical Bayes estimator or a nonparametric posterior-sampling estimator—as nonconformity scores that incorporate both mean score and score variability. The central claims are that this construction preserves the target conformal coverage guarantee and provably reduces inclusion of high-variance false candidates under mild regularity conditions, with experiments on image classification, CLIP VLMs, and LLMs showing preserved coverage, improved ranking stability, and smaller sets when variability is informative.

Significance. If the coverage preservation holds despite the data-dependent construction of the nonconformity score, the result would be significant for conformal prediction in unstable settings such as LLMs. It would demonstrate how to fold empirical-Bayes uncertainty quantification into the nonconformity measure while retaining distribution-free guarantees, potentially yielding more efficient sets without post-hoc adjustments. The dual parametric/nonparametric estimators and the explicit qualifier that the method reverts to standard CP when variability vanishes strengthen applicability.

major comments (2)

[Abstract and §3] Abstract and §3 (method): The manuscript asserts that the r-value nonconformity score 'preserves the target conformal coverage' and yields a 'provable reduction' under 'mild regularity conditions,' yet supplies neither a derivation nor an explicit statement of those conditions. Because the coverage guarantee is the load-bearing claim, the absence of the proof or regularity assumptions must be addressed before the result can be evaluated.
[§3.2] §3.2 (empirical Bayes estimator): The r-value is obtained by fitting the Normal-Normal (or nonparametric) empirical Bayes model to the observed scores; the resulting nonconformity score therefore depends on quantities estimated from the calibration data itself. This dependence must be shown not to violate the exchangeability required for marginal coverage; the current description leaves the effect on the guarantee unaddressed.

minor comments (1)

[Experiments] The experimental section would benefit from explicit quantitative reporting of coverage deviation, set-size reduction, and the fraction of cases in which variability was deemed informative, rather than qualitative statements.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback emphasizing the centrality of the coverage guarantee. We will revise the manuscript to include the missing theoretical derivations and explicit conditions as detailed below.

read point-by-point responses

Referee: [Abstract and §3] Abstract and §3 (method): The manuscript asserts that the r-value nonconformity score 'preserves the target conformal coverage' and yields a 'provable reduction' under 'mild regularity conditions,' yet supplies neither a derivation nor an explicit statement of those conditions. Because the coverage guarantee is the load-bearing claim, the absence of the proof or regularity assumptions must be addressed before the result can be evaluated.

Authors: We agree the proof and conditions are required for evaluation. The revised manuscript will add a formal theorem in §3 (with full proof in the appendix) establishing that the r-value nonconformity score preserves marginal coverage under exchangeability of the underlying scores. The mild regularity conditions will be stated explicitly: (i) the latent scores are exchangeable, (ii) the empirical Bayes estimator (Normal-Normal or nonparametric) is a symmetric function of the calibration scores, and (iii) the r-value mapping is monotone in the score. The provable reduction in high-variance false positives will be shown via a stochastic dominance argument under these conditions. revision: yes
Referee: [§3.2] §3.2 (empirical Bayes estimator): The r-value is obtained by fitting the Normal-Normal (or nonparametric) empirical Bayes model to the observed scores; the resulting nonconformity score therefore depends on quantities estimated from the calibration data itself. This dependence must be shown not to violate the exchangeability required for marginal coverage; the current description leaves the effect on the guarantee unaddressed.

Authors: The data dependence is symmetric across calibration and test points because the empirical Bayes parameters are estimated from the pooled set of scores and the r-value for every point is computed identically using those fixed parameters. Consequently the vector of r-values remains exchangeable whenever the original scores are exchangeable. We will add a short lemma in §3.2 formalizing this invariance and confirming that the standard conformal coverage argument applies unchanged to the r-value scores. revision: yes

Circularity Check

0 steps flagged

No significant circularity identified

full rationale

The paper defines an r-value nonconformity score via empirical Bayes (Normal-Normal or nonparametric) applied to observed score variability, then invokes standard conformal prediction to guarantee coverage for any fixed nonconformity function. The coverage claim is distribution-free and does not reduce to a tautology or self-fit of the estimator; the EB step is used only to construct the score, while validity follows from exchangeability of the resulting scores. No quoted derivation equates the coverage result to the fitted hyperparameters by construction, nor imports uniqueness via self-citation, nor renames a known result. The derivation remains self-contained against the external CP benchmark.

Axiom & Free-Parameter Ledger

1 free parameters · 2 axioms · 0 invented entities

The central claim rests on an empirical Bayes model whose prior hyperparameters must be estimated from data and on regularity conditions whose content is not specified; no new physical entities are introduced.

free parameters (1)

prior hyperparameters in Normal-Normal empirical Bayes estimator
The closed-form estimator requires fitting prior parameters to the collection of realized scores.

axioms (2)

domain assumption Score variability is generated from a distribution compatible with the chosen empirical Bayes model (Normal-Normal or nonparametric)
The r-value construction and its claimed properties presuppose this modeling choice.
domain assumption Mild regularity conditions hold that allow the provable reduction in high-variance false candidates
The guarantee of reduced inclusion of unstable candidates is stated to hold only under these unspecified conditions.

pith-pipeline@v0.9.0 · 5784 in / 1427 out tokens · 87251 ms · 2026-05-25T04:42:58.654697+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Using the r-value as the nonconformity score preserves the target conformal coverage while provably reducing the inclusion of high variance false candidates under mild regularity conditions.
IndisputableMonolith/Foundation/ArithmeticFromLogic.lean embed_injective unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

the r-value estimates how likely a candidate’s latent score belongs to the top-ranked group after accounting for both its mean score and its uncertainty

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

50 extracted references · 50 canonical work pages · 4 internal anchors

[1]

Rajendra Acharya, Vladimir Makarenkov, and Saeid Nahavandi

Moloud Abdar, Farhad Pourpanah, Sadiq Hussain, Dana Rezazadegan, Li Liu, Mohammad Ghavamzadeh, Paul Fieguth, Xiaochun Cao, Abbas Khosravi, U. Rajendra Acharya, Vladimir Makarenkov, and Saeid Nahavandi. A review of uncertainty quantification in deep learning: Techniques, applications and challenges.Information Fusion, 76:243–297, 5 2021. doi: 10.1016/j.inf...

work page doi:10.1016/j.inffus.2021.05.008 2021
[2]

Angelopoulos and Stephen Bates

Anastasios N. Angelopoulos and Stephen Bates. Conformal prediction: A gentle intro- duction.Foundations and Trends® in Machine Learning, 16(4):494–591, 1 2023. doi: 10.1561/2200000101. URLhttps://doi.org/10.1561/2200000101

work page doi:10.1561/2200000101 2023
[3]

Learn then test: Calibrating predictive algorithms to achieve risk control.arXiv preprint arXiv:2110.01052, 2021

Anastasios N. Angelopoulos, Stephen Bates, Emmanuel J. Candès, Michael Jordan, I, and Lihua Lei. Learn then Test: Calibrating Predictive Algorithms to Achieve Risk Control, 10 2021. URL https://arxiv.org/abs/2110.01052

work page arXiv 2021
[4]

Angelopoulos, Stephen Bates, Adam Fisch, Lihua Lei, and Tal Schuster

Anastasios N. Angelopoulos, Stephen Bates, Adam Fisch, Lihua Lei, and Tal Schuster. Con- formal Risk Control. InInternational Conference on Learning Representations, 2024. URL https://openreview.net/forum?id=33XGfHLtZg

work page 2024
[5]

Conformalized Credal Re- gions for Classification with Ambiguous Ground Truth

Michele Caprio, David Stutz, Shuo Li, and Arnaud Doucet. Conformalized Credal Re- gions for Classification with Ambiguous Ground Truth. OpenReview, 2025. URL https: //openreview.net/forum?id=L7sQ8CW2FY

work page 2025
[6]

Bowman, and Kyunghyun Cho

Angelica Chen, Jason Phang, Alicia Parrish, Vishakh Padmakumar, Chen Zhao, Samuel R. Bowman, and Kyunghyun Cho. Two failures of Self-Consistency in the Multi-Step reasoning of LLMs, 5 2023. URLhttps://arxiv.org/abs/2305.14279

work page arXiv 2023
[7]

Distributional Conformal Prediction

Victor Chernozhukov, Kaspar Wüthrich, and Yinchu Zhu. Distributional Conformal Prediction. Proceedings of the National Academy of Sciences, 118(48):e2107794118, 2021. doi: 10.1073/ pnas.2107794118. URLhttps://www.pnas.org/doi/10.1073/pnas.2107794118

work page doi:10.1073/pnas.2107794118 2021
[8]

An image is worth 16x16 words: Transformers for image recognition at scale

Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszkoreit, and Neil Houlsby. An image is worth 16x16 words: Transformers for image recognition at scale. InInternational Conference on Learning Representations, 2021. URL https:...

work page 2021
[9]

Dropout as a Bayesian Approximation: Representing Model Uncertainty in Deep Learning

Yarin Gal and Zoubin Ghahramani. Dropout as a Bayesian approximation: representing model uncertainty in deep learning, 6 2015. URLhttps://arxiv.org/abs/1506.02142

work page internal anchor Pith review Pith/arXiv arXiv 2015
[10]

Wang, and Sadid Hasan

Jia He, Mukund Rungta, David Koleczek, Arshdeep Sekhon, Franklin X. Wang, and Sadid Hasan. Does prompt formatting have any impact on llm performance?, 2024. URL https: //arxiv.org/abs/2411.10541

work page arXiv 2024
[11]

Henderson and Michael A

Nicholas C. Henderson and Michael A. Newton. Making the cut: Improved ranking and selection for Large-Scale inference.Journal of the Royal Statistical Society Series B (Statistical Methodology), 78(4):781–804, 11 2015. doi: 10.1111/rssb.12131. URL https://doi.org/ 10.1111/rssb.12131

work page doi:10.1111/rssb.12131 2015
[12]

Conffusion: confidence intervals for diffusion models, 11

Eliahu Horwitz and Yedid Hoshen. Conffusion: confidence intervals for diffusion models, 11

work page
[13]

URLhttps://arxiv.org/abs/2211.09795

work page arXiv
[14]

Parameter-Efficient Transfer Learning for NLP

Neil Houlsby, Andrei Giurgiu, Stanislaw Jastrzebski, Bruna Morrone, De Laroussilhe Quentin, Andrea Gesmundo, Mona Attariyan, and Sylvain Gelly. Parameter-Efficient Transfer Learning for NLP, 2 2019. URLhttps://arxiv.org/abs/1902.00751

work page internal anchor Pith review Pith/arXiv arXiv 2019
[15]

Conformalized Credal Set Predictors

Alireza Javanmardi, David Stutz, and Eyke Hüllermeier. Conformalized Credal Set Predictors. InAdvances in Neural Information Processing Systems, 2024. URL https://proceedings.neurips.cc/paper_files/paper/2024/file/ d42a8bf2f40555d4a5120300f98c88f6-Paper-Conference.pdf. 10

work page 2024
[16]

Hands-On Bayesian Neural Networks—A tutorial for deep learning users.IEEE Computational Intelligence Magazine, 17(2):29–48, 4 2022

Laurent Valentin Jospin, Hamid Laga, Farid Boussaid, Wray Buntine, and Mohammed Ben- namoun. Hands-On Bayesian Neural Networks—A tutorial for deep learning users.IEEE Computational Intelligence Magazine, 17(2):29–48, 4 2022. doi: 10.1109/mci.2022.3155327. URLhttps://doi.org/10.1109/mci.2022.3155327

work page doi:10.1109/mci.2022.3155327 2022
[17]

Length Optimization in Conformal Prediction, 2024

Shayan Kiyani, George Pappas, and Hamed Hassani. Length Optimization in Conformal Prediction, 2024. URLhttps://arxiv.org/abs/2406.18814

work page arXiv 2024
[18]

Simple and Scalable Predictive Uncertainty Estimation using Deep Ensembles

Balaji Lakshminarayanan, Alexander Pritzel, and Charles Blundell. Simple and Scalable Predictive Uncertainty Estimation using Deep Ensembles, 12 2016. URL https://arxiv. org/abs/1612.01474

work page internal anchor Pith review Pith/arXiv arXiv 2016
[19]

Benchmark Evalua- tions, Applications, and Challenges of large Vision Language Models: a survey, 1 2025

Zongxia Li, Xiyang Wu, Hongyang Du, Huy Nghiem, and Guangyao Shi. Benchmark Evalua- tions, Applications, and Challenges of large Vision Language Models: a survey, 1 2025. URL https://arxiv.org/abs/2501.02189

work page arXiv 2025
[20]

TruthfulQA: Measuring How Models Mimic Human Falsehoods

Stephanie Lin, Jacob Hilton, and Owain Evans. Truthfulqa: Measuring how models mimic human falsehoods, 2022. URLhttps://arxiv.org/abs/2109.07958

work page internal anchor Pith review Pith/arXiv arXiv 2022
[21]

Conformity Score Averaging for Classification

Rui Luo and Zhixin Zhou. Conformity Score Averaging for Classification. InProceedings of the 42nd International Conference on Machine Learning, 2025. URL https://proceedings. mlr.press/v267/luo25v.html

work page 2025
[22]

Trustworthy classification through rank-based conformal prediction sets.Pattern Recognition, 172:112330, 2026

Rui Luo and Zhixin Zhou. Trustworthy classification through rank-based conformal prediction sets.Pattern Recognition, 172:112330, 2026. doi: 10.1016/j.patcog.2025.112330

work page doi:10.1016/j.patcog.2025.112330 2026
[23]

Noh, Siyan Li, William S

Eric Mitchell, Joseph J. Noh, Siyan Li, William S. Armstrong, Ananth Agarwal, Patrick Liu, Chelsea Finn, and Christopher D. Manning. Enhancing Self-Consistency and Performance of Pre-Trained Language Models through Natural Language Inference, 11 2022. URL https: //arxiv.org/abs/2211.11875

work page arXiv 2022
[24]

M. A. Newton, N. G. Polson, and J. Xu. Weighted bayesian bootstrap for scalable posterior distributions.Canadian Journal of Statistics, 49(2):421–437, 2020. doi: https://doi.org/10. 1002/cjs.11570

work page 2020
[25]

Efficient Normalized Conformal Prediction and Uncertainty Quantification for Anti-Cancer Drug Sensitivity Prediction with Deep Regression Forests, 2024

Daniel Nolte, Souparno Ghosh, and Ranadip Pal. Efficient Normalized Conformal Prediction and Uncertainty Quantification for Anti-Cancer Drug Sensitivity Prediction with Deep Regression Forests, 2024. URLhttps://arxiv.org/abs/2402.14080

work page arXiv 2024
[26]

Conformal Prediction for Ensem- bles: Improving Efficiency via Score-Based Aggregation

Eduardo Ochoa Rivera, Yash Patel, and Ambuj Tewari. Conformal Prediction for Ensem- bles: Improving Efficiency via Score-Based Aggregation. InAdvances in Neural Information Processing Systems, 2025. URLhttps://openreview.net/forum?id=sNTqqdPVSv

work page 2025
[27]

Efficient Conformal Prediction under Data Heterogeneity

Vincent Plassier, Nikita Kotelevskii, Aleksandr Rubashevskii, Fedor Noskov, Maksim Ve- likanov, Alexander Fishkov, Samuel Horvath, Martin Takác, Éric Moulines, Maxim Panov, Lagrange Mathematics, Computing Research Center, Paris CMAP, Ecole Polytechnique, Skolkovo Institute of Science, Technology, Moscow HSE University, Technology Innova- tion Institute, a...

work page 2024
[28]

Conformal language modeling.arXiv preprint arXiv:2306.10193, 2023

Victor Quach, Adam Fisch, Tal Schuster, Adam Yala, Jae Ho Sohn, Tommi S. Jaakkola, and Regina Barzilay. Conformal language modeling, 6 2023. URL https://arxiv.org/abs/ 2306.10193

work page arXiv 2023
[29]

Conformal prediction under ambiguous ground truth.Transactions on Machine Learning Research, 2023

David Stutz, Abhijit Guha Roy, Tatiana Matejovicova, Patricia Strachan, Ali Taylan Cemgil, and Arnaud Doucet. Conformal prediction under ambiguous ground truth.Transactions on Machine Learning Research, 2023. URLhttps://openreview.net/forum?id=CAd6V2qXxc

work page 2023
[30]

Springer, 2005

Vladimir V ovk, Alexander Gammerman, and Glenn Shafer.Algorithmic Learning in a Random World. Springer, 2005. URLhttps://link.springer.com/book/10.1007/b98835. 11

work page doi:10.1007/b98835 2005
[31]

Mitigating LLM hallucinations via conformal abstention, 4 2024

Yasin Abbasi Yadkori, Ilja Kuzborskij, David Stutz, András György, Adam Fisch, Arnaud Doucet, Iuliya Beloshapka, Wei-Hung Weng, Yao-Yuan Yang, Csaba Szepesvári, Ali Taylan Cemgil, and Nenad Tomasev. Mitigating LLM hallucinations via conformal abstention, 4 2024. URLhttps://arxiv.org/abs/2405.01563

work page arXiv 2024
[32]

Scaling vision transform- ers, 6 2021

Xiaohua Zhai, Alexander Kolesnikov, Neil Houlsby, and Lucas Beyer. Scaling vision transform- ers, 6 2021. URLhttps://arxiv.org/abs/2106.04560. 12 A Code Availability We provide the full implementation of the r-value method and demo for CLIP at: https://github.com/Yogesh914/conformal-rvalue B Computation Efficiency Analysis Table 3: Training time for 1,000...

work page arXiv 2021
[33]

We want to show that E[|Cr|]−E[|C avg|] = (P(y true ∈C r)−P(y true ∈C avg)) + X y′̸=ytrue (P(y ′ ∈C r)−P(y ′ ∈C avg)) is non-positive

For any selected conformal setC, the expected size of the set can be decomposed as: E[|C|] =E X  1(y true ∈C) + X y′̸=ytrue 1(y ′ ∈C)   = Pr (ytrue ∈C) + X y′̸=ytrue Pr (y′ ∈C), where1(·)is the indicator function andy true is the true label. We want to show that E[|Cr|]−E[|C avg|] = (P(y true ∈C r)−P(y true ∈C avg)) + X y′̸=ytrue (P(y ′ ∈C r)−P(y ′ ∈C...

work page
[34]

Then the r-value conformal predictor produces, on average, a smaller prediction set than standard conformal prediction: E[|C r|]≤E[|C std|]

Uniform dominance of inclusion probabilities.For every moderately small σ2 ≥0 and every false labely ′ withθ y′ =µ 0, P r incl σ2 |µ 0 ≤P std incl σ2 |µ 0 . Then the r-value conformal predictor produces, on average, a smaller prediction set than standard conformal prediction: E[|C r|]≤E[|C std|]. Proof. Let Ytrue denote the true label, and let y′ ̸=Y true...

work page
[35]

This implies E[|C avg|]≥E[|C r|]

Expectation-level guarantee.Iff(δ)≥0, thenE[G k]≥0. This implies E[|C avg|]≥E[|C r|]

work page
[36]

Since these are independent acrossi, we may apply Hoeffding’s inequality: P(Gk ≤0) =P(G k −E[G k]≤ −E[G k]) ≤exp −2E[G k]2 4k ≤exp −2(kf(δ)) 2 4k = exp −1 2 kf(δ) 2

High-probability guarantee.Note that for each i, the difference 1(i∈C avg)−1(i∈C r) lies in[−1,1]. Since these are independent acrossi, we may apply Hoeffding’s inequality: P(Gk ≤0) =P(G k −E[G k]≤ −E[G k]) ≤exp −2E[G k]2 4k ≤exp −2(kf(δ)) 2 4k = exp −1 2 kf(δ) 2 . To ensure P(|C r|<|C avg|)≥1−η, we require: exp −1 2 kf(δ) 2 ≤η⇐ ⇒k≥ 2 log(1/η) f(δ) 2 . Th...

work page arXiv
[37]

Read the description carefully

work page
[38]

Generate{num_rephrases}diverse and natural-sounding rephrasings of the description

work page
[39]

Keep the core meaning the same

work page
[40]

User prompt

Return a JSON object with the keyrephrased_descriptions, containing a list of strings. User prompt. Original description:{input_image_description} 29 GPQA Rephrasing Prompt System prompt. You are an expert exam question rephraser. Your task is to:

work page
[41]

Read the question and its answer options carefully

work page
[42]

Generate different rephrased versions of the question while maintaining relevance to the answer options

work page
[43]

Return the rephrased versions as a JSON object with the keyrephrased_questions, containing a list of strings

work page
[44]

User prompt

Return exactly the requested number of rephrased questions. User prompt. Rephrase the following question while considering the answer options. Question:{input_text} Answer options:{options_text} Number of rephrased questions to return:{num_rephrases} TruthfulQA Evaluation Prompt System prompt. You are an expert evaluator assessing how well answers address...

work page
[45]

Read each question and the provided answer carefully

work page
[46]

Evaluate how directly and completely the answer addresses each question

work page
[47]

Consider factors such as accuracy, relevance, and comprehensiveness

work page
[48]

Assign a score from 0 to 10, with one decimal place, for each question

work page
[49]

Return the scores as a JSON object with ascoresarray

work page
[50]

User prompt

Use the full range of decimal scores, avoid round numbers, and avoid assigning the same score to every question. User prompt. Evaluate how well the following answer addresses each question. Return only a JSON object with a scores array containing scores from 0 to 10 with one decimal place. Answer to evaluate:{answer} Questions:{formatted_questions} I Broa...

work page

[1] [1]

Rajendra Acharya, Vladimir Makarenkov, and Saeid Nahavandi

Moloud Abdar, Farhad Pourpanah, Sadiq Hussain, Dana Rezazadegan, Li Liu, Mohammad Ghavamzadeh, Paul Fieguth, Xiaochun Cao, Abbas Khosravi, U. Rajendra Acharya, Vladimir Makarenkov, and Saeid Nahavandi. A review of uncertainty quantification in deep learning: Techniques, applications and challenges.Information Fusion, 76:243–297, 5 2021. doi: 10.1016/j.inf...

work page doi:10.1016/j.inffus.2021.05.008 2021

[2] [2]

Angelopoulos and Stephen Bates

Anastasios N. Angelopoulos and Stephen Bates. Conformal prediction: A gentle intro- duction.Foundations and Trends® in Machine Learning, 16(4):494–591, 1 2023. doi: 10.1561/2200000101. URLhttps://doi.org/10.1561/2200000101

work page doi:10.1561/2200000101 2023

[3] [3]

Learn then test: Calibrating predictive algorithms to achieve risk control.arXiv preprint arXiv:2110.01052, 2021

Anastasios N. Angelopoulos, Stephen Bates, Emmanuel J. Candès, Michael Jordan, I, and Lihua Lei. Learn then Test: Calibrating Predictive Algorithms to Achieve Risk Control, 10 2021. URL https://arxiv.org/abs/2110.01052

work page arXiv 2021

[4] [4]

Angelopoulos, Stephen Bates, Adam Fisch, Lihua Lei, and Tal Schuster

Anastasios N. Angelopoulos, Stephen Bates, Adam Fisch, Lihua Lei, and Tal Schuster. Con- formal Risk Control. InInternational Conference on Learning Representations, 2024. URL https://openreview.net/forum?id=33XGfHLtZg

work page 2024

[5] [5]

Conformalized Credal Re- gions for Classification with Ambiguous Ground Truth

Michele Caprio, David Stutz, Shuo Li, and Arnaud Doucet. Conformalized Credal Re- gions for Classification with Ambiguous Ground Truth. OpenReview, 2025. URL https: //openreview.net/forum?id=L7sQ8CW2FY

work page 2025

[6] [6]

Bowman, and Kyunghyun Cho

Angelica Chen, Jason Phang, Alicia Parrish, Vishakh Padmakumar, Chen Zhao, Samuel R. Bowman, and Kyunghyun Cho. Two failures of Self-Consistency in the Multi-Step reasoning of LLMs, 5 2023. URLhttps://arxiv.org/abs/2305.14279

work page arXiv 2023

[7] [7]

Distributional Conformal Prediction

Victor Chernozhukov, Kaspar Wüthrich, and Yinchu Zhu. Distributional Conformal Prediction. Proceedings of the National Academy of Sciences, 118(48):e2107794118, 2021. doi: 10.1073/ pnas.2107794118. URLhttps://www.pnas.org/doi/10.1073/pnas.2107794118

work page doi:10.1073/pnas.2107794118 2021

[8] [8]

An image is worth 16x16 words: Transformers for image recognition at scale

Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszkoreit, and Neil Houlsby. An image is worth 16x16 words: Transformers for image recognition at scale. InInternational Conference on Learning Representations, 2021. URL https:...

work page 2021

[9] [9]

Dropout as a Bayesian Approximation: Representing Model Uncertainty in Deep Learning

Yarin Gal and Zoubin Ghahramani. Dropout as a Bayesian approximation: representing model uncertainty in deep learning, 6 2015. URLhttps://arxiv.org/abs/1506.02142

work page internal anchor Pith review Pith/arXiv arXiv 2015

[10] [10]

Wang, and Sadid Hasan

Jia He, Mukund Rungta, David Koleczek, Arshdeep Sekhon, Franklin X. Wang, and Sadid Hasan. Does prompt formatting have any impact on llm performance?, 2024. URL https: //arxiv.org/abs/2411.10541

work page arXiv 2024

[11] [11]

Henderson and Michael A

Nicholas C. Henderson and Michael A. Newton. Making the cut: Improved ranking and selection for Large-Scale inference.Journal of the Royal Statistical Society Series B (Statistical Methodology), 78(4):781–804, 11 2015. doi: 10.1111/rssb.12131. URL https://doi.org/ 10.1111/rssb.12131

work page doi:10.1111/rssb.12131 2015

[12] [12]

Conffusion: confidence intervals for diffusion models, 11

Eliahu Horwitz and Yedid Hoshen. Conffusion: confidence intervals for diffusion models, 11

work page

[13] [13]

URLhttps://arxiv.org/abs/2211.09795

work page arXiv

[14] [14]

Parameter-Efficient Transfer Learning for NLP

Neil Houlsby, Andrei Giurgiu, Stanislaw Jastrzebski, Bruna Morrone, De Laroussilhe Quentin, Andrea Gesmundo, Mona Attariyan, and Sylvain Gelly. Parameter-Efficient Transfer Learning for NLP, 2 2019. URLhttps://arxiv.org/abs/1902.00751

work page internal anchor Pith review Pith/arXiv arXiv 2019

[15] [15]

Conformalized Credal Set Predictors

Alireza Javanmardi, David Stutz, and Eyke Hüllermeier. Conformalized Credal Set Predictors. InAdvances in Neural Information Processing Systems, 2024. URL https://proceedings.neurips.cc/paper_files/paper/2024/file/ d42a8bf2f40555d4a5120300f98c88f6-Paper-Conference.pdf. 10

work page 2024

[16] [16]

Hands-On Bayesian Neural Networks—A tutorial for deep learning users.IEEE Computational Intelligence Magazine, 17(2):29–48, 4 2022

Laurent Valentin Jospin, Hamid Laga, Farid Boussaid, Wray Buntine, and Mohammed Ben- namoun. Hands-On Bayesian Neural Networks—A tutorial for deep learning users.IEEE Computational Intelligence Magazine, 17(2):29–48, 4 2022. doi: 10.1109/mci.2022.3155327. URLhttps://doi.org/10.1109/mci.2022.3155327

work page doi:10.1109/mci.2022.3155327 2022

[17] [17]

Length Optimization in Conformal Prediction, 2024

Shayan Kiyani, George Pappas, and Hamed Hassani. Length Optimization in Conformal Prediction, 2024. URLhttps://arxiv.org/abs/2406.18814

work page arXiv 2024

[18] [18]

Simple and Scalable Predictive Uncertainty Estimation using Deep Ensembles

Balaji Lakshminarayanan, Alexander Pritzel, and Charles Blundell. Simple and Scalable Predictive Uncertainty Estimation using Deep Ensembles, 12 2016. URL https://arxiv. org/abs/1612.01474

work page internal anchor Pith review Pith/arXiv arXiv 2016

[19] [19]

Benchmark Evalua- tions, Applications, and Challenges of large Vision Language Models: a survey, 1 2025

Zongxia Li, Xiyang Wu, Hongyang Du, Huy Nghiem, and Guangyao Shi. Benchmark Evalua- tions, Applications, and Challenges of large Vision Language Models: a survey, 1 2025. URL https://arxiv.org/abs/2501.02189

work page arXiv 2025

[20] [20]

TruthfulQA: Measuring How Models Mimic Human Falsehoods

Stephanie Lin, Jacob Hilton, and Owain Evans. Truthfulqa: Measuring how models mimic human falsehoods, 2022. URLhttps://arxiv.org/abs/2109.07958

work page internal anchor Pith review Pith/arXiv arXiv 2022

[21] [21]

Conformity Score Averaging for Classification

Rui Luo and Zhixin Zhou. Conformity Score Averaging for Classification. InProceedings of the 42nd International Conference on Machine Learning, 2025. URL https://proceedings. mlr.press/v267/luo25v.html

work page 2025

[22] [22]

Trustworthy classification through rank-based conformal prediction sets.Pattern Recognition, 172:112330, 2026

Rui Luo and Zhixin Zhou. Trustworthy classification through rank-based conformal prediction sets.Pattern Recognition, 172:112330, 2026. doi: 10.1016/j.patcog.2025.112330

work page doi:10.1016/j.patcog.2025.112330 2026

[23] [23]

Noh, Siyan Li, William S

Eric Mitchell, Joseph J. Noh, Siyan Li, William S. Armstrong, Ananth Agarwal, Patrick Liu, Chelsea Finn, and Christopher D. Manning. Enhancing Self-Consistency and Performance of Pre-Trained Language Models through Natural Language Inference, 11 2022. URL https: //arxiv.org/abs/2211.11875

work page arXiv 2022

[24] [24]

M. A. Newton, N. G. Polson, and J. Xu. Weighted bayesian bootstrap for scalable posterior distributions.Canadian Journal of Statistics, 49(2):421–437, 2020. doi: https://doi.org/10. 1002/cjs.11570

work page 2020

[25] [25]

Efficient Normalized Conformal Prediction and Uncertainty Quantification for Anti-Cancer Drug Sensitivity Prediction with Deep Regression Forests, 2024

Daniel Nolte, Souparno Ghosh, and Ranadip Pal. Efficient Normalized Conformal Prediction and Uncertainty Quantification for Anti-Cancer Drug Sensitivity Prediction with Deep Regression Forests, 2024. URLhttps://arxiv.org/abs/2402.14080

work page arXiv 2024

[26] [26]

Conformal Prediction for Ensem- bles: Improving Efficiency via Score-Based Aggregation

Eduardo Ochoa Rivera, Yash Patel, and Ambuj Tewari. Conformal Prediction for Ensem- bles: Improving Efficiency via Score-Based Aggregation. InAdvances in Neural Information Processing Systems, 2025. URLhttps://openreview.net/forum?id=sNTqqdPVSv

work page 2025

[27] [27]

Efficient Conformal Prediction under Data Heterogeneity

Vincent Plassier, Nikita Kotelevskii, Aleksandr Rubashevskii, Fedor Noskov, Maksim Ve- likanov, Alexander Fishkov, Samuel Horvath, Martin Takác, Éric Moulines, Maxim Panov, Lagrange Mathematics, Computing Research Center, Paris CMAP, Ecole Polytechnique, Skolkovo Institute of Science, Technology, Moscow HSE University, Technology Innova- tion Institute, a...

work page 2024

[28] [28]

Conformal language modeling.arXiv preprint arXiv:2306.10193, 2023

Victor Quach, Adam Fisch, Tal Schuster, Adam Yala, Jae Ho Sohn, Tommi S. Jaakkola, and Regina Barzilay. Conformal language modeling, 6 2023. URL https://arxiv.org/abs/ 2306.10193

work page arXiv 2023

[29] [29]

Conformal prediction under ambiguous ground truth.Transactions on Machine Learning Research, 2023

David Stutz, Abhijit Guha Roy, Tatiana Matejovicova, Patricia Strachan, Ali Taylan Cemgil, and Arnaud Doucet. Conformal prediction under ambiguous ground truth.Transactions on Machine Learning Research, 2023. URLhttps://openreview.net/forum?id=CAd6V2qXxc

work page 2023

[30] [30]

Springer, 2005

Vladimir V ovk, Alexander Gammerman, and Glenn Shafer.Algorithmic Learning in a Random World. Springer, 2005. URLhttps://link.springer.com/book/10.1007/b98835. 11

work page doi:10.1007/b98835 2005

[31] [31]

Mitigating LLM hallucinations via conformal abstention, 4 2024

Yasin Abbasi Yadkori, Ilja Kuzborskij, David Stutz, András György, Adam Fisch, Arnaud Doucet, Iuliya Beloshapka, Wei-Hung Weng, Yao-Yuan Yang, Csaba Szepesvári, Ali Taylan Cemgil, and Nenad Tomasev. Mitigating LLM hallucinations via conformal abstention, 4 2024. URLhttps://arxiv.org/abs/2405.01563

work page arXiv 2024

[32] [32]

Scaling vision transform- ers, 6 2021

Xiaohua Zhai, Alexander Kolesnikov, Neil Houlsby, and Lucas Beyer. Scaling vision transform- ers, 6 2021. URLhttps://arxiv.org/abs/2106.04560. 12 A Code Availability We provide the full implementation of the r-value method and demo for CLIP at: https://github.com/Yogesh914/conformal-rvalue B Computation Efficiency Analysis Table 3: Training time for 1,000...

work page arXiv 2021

[33] [33]

We want to show that E[|Cr|]−E[|C avg|] = (P(y true ∈C r)−P(y true ∈C avg)) + X y′̸=ytrue (P(y ′ ∈C r)−P(y ′ ∈C avg)) is non-positive

For any selected conformal setC, the expected size of the set can be decomposed as: E[|C|] =E X  1(y true ∈C) + X y′̸=ytrue 1(y ′ ∈C)   = Pr (ytrue ∈C) + X y′̸=ytrue Pr (y′ ∈C), where1(·)is the indicator function andy true is the true label. We want to show that E[|Cr|]−E[|C avg|] = (P(y true ∈C r)−P(y true ∈C avg)) + X y′̸=ytrue (P(y ′ ∈C r)−P(y ′ ∈C...

work page

[34] [34]

Then the r-value conformal predictor produces, on average, a smaller prediction set than standard conformal prediction: E[|C r|]≤E[|C std|]

Uniform dominance of inclusion probabilities.For every moderately small σ2 ≥0 and every false labely ′ withθ y′ =µ 0, P r incl σ2 |µ 0 ≤P std incl σ2 |µ 0 . Then the r-value conformal predictor produces, on average, a smaller prediction set than standard conformal prediction: E[|C r|]≤E[|C std|]. Proof. Let Ytrue denote the true label, and let y′ ̸=Y true...

work page

[35] [35]

This implies E[|C avg|]≥E[|C r|]

Expectation-level guarantee.Iff(δ)≥0, thenE[G k]≥0. This implies E[|C avg|]≥E[|C r|]

work page

[36] [36]

Since these are independent acrossi, we may apply Hoeffding’s inequality: P(Gk ≤0) =P(G k −E[G k]≤ −E[G k]) ≤exp −2E[G k]2 4k ≤exp −2(kf(δ)) 2 4k = exp −1 2 kf(δ) 2

High-probability guarantee.Note that for each i, the difference 1(i∈C avg)−1(i∈C r) lies in[−1,1]. Since these are independent acrossi, we may apply Hoeffding’s inequality: P(Gk ≤0) =P(G k −E[G k]≤ −E[G k]) ≤exp −2E[G k]2 4k ≤exp −2(kf(δ)) 2 4k = exp −1 2 kf(δ) 2 . To ensure P(|C r|<|C avg|)≥1−η, we require: exp −1 2 kf(δ) 2 ≤η⇐ ⇒k≥ 2 log(1/η) f(δ) 2 . Th...

work page arXiv

[37] [37]

Read the description carefully

work page

[38] [38]

Generate{num_rephrases}diverse and natural-sounding rephrasings of the description

work page

[39] [39]

Keep the core meaning the same

work page

[40] [40]

User prompt

Return a JSON object with the keyrephrased_descriptions, containing a list of strings. User prompt. Original description:{input_image_description} 29 GPQA Rephrasing Prompt System prompt. You are an expert exam question rephraser. Your task is to:

work page

[41] [41]

Read the question and its answer options carefully

work page

[42] [42]

Generate different rephrased versions of the question while maintaining relevance to the answer options

work page

[43] [43]

Return the rephrased versions as a JSON object with the keyrephrased_questions, containing a list of strings

work page

[44] [44]

User prompt

Return exactly the requested number of rephrased questions. User prompt. Rephrase the following question while considering the answer options. Question:{input_text} Answer options:{options_text} Number of rephrased questions to return:{num_rephrases} TruthfulQA Evaluation Prompt System prompt. You are an expert evaluator assessing how well answers address...

work page

[45] [45]

Read each question and the provided answer carefully

work page

[46] [46]

Evaluate how directly and completely the answer addresses each question

work page

[47] [47]

Consider factors such as accuracy, relevance, and comprehensiveness

work page

[48] [48]

Assign a score from 0 to 10, with one decimal place, for each question

work page

[49] [49]

Return the scores as a JSON object with ascoresarray

work page

[50] [50]

User prompt

Use the full range of decimal scores, avoid round numbers, and avoid assigning the same score to every question. User prompt. Evaluate how well the following answer addresses each question. Return only a JSON object with a scores array containing scores from 0 to 10 with one decimal place. Answer to evaluate:{answer} Questions:{formatted_questions} I Broa...

work page