Conf-Gen: Conformal Uncertainty Quantification for Generative Models

Gabriel Loaiza-Ganem; Kevin Zhang; Kin Kwan Leung; Marc T. Law; Wei Cui

arxiv: 2605.28920 · v1 · pith:7GBPBW4Pnew · submitted 2026-05-27 · 💻 cs.LG · cs.AI· stat.ML

Conf-Gen: Conformal Uncertainty Quantification for Generative Models

Gabriel Loaiza-Ganem , Kevin Zhang , Wei Cui , Marc T. Law , Kin Kwan Leung This is my paper

Pith reviewed 2026-06-29 13:52 UTC · model grok-4.3

classification 💻 cs.LG cs.AIstat.ML

keywords conformal predictionconformal risk controlgenerative modelsuncertainty quantificationlarge language modelsimage generationAI agents

0 comments

The pith

Conf-Gen adapts conformal risk control to generative models by relaxing assumptions and supplies formal guarantees for image generators, conversational systems, and AI agents.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents Conf-Gen as a framework that takes the established conformal risk control method and modifies it for use with generative models such as large language models and image generators. It shows that this adaptation unifies earlier applications of conformal prediction to LLMs and creates new use cases with mathematical guarantees. A reader would care because generative models currently lack the uncertainty quantification tools available in supervised settings, and Conf-Gen offers a route to add those tools without starting from scratch.

Core claim

Conf-Gen is a general framework adapting conformal risk control to generative tasks while relaxing its theoretical assumptions. This produces conformal guarantees on image generators producing non-memorized images, conversational AI systems having asked enough clarifying questions, and the output of AI agents being correct. The same framework unifies and generalizes previous attempts to apply conformal prediction to LLMs and extends the methodology to entirely new domains.

What carries the argument

Conformal generation (Conf-Gen), the adaptation of conformal risk control that relaxes theoretical assumptions to produce valid guarantees on generative outputs.

If this is right

Image generators can be equipped with guarantees that the produced images are non-memorized.
Conversational AI systems can receive guarantees that they have asked enough clarifying questions.
AI agent outputs can receive guarantees of correctness.
Previous applications of conformal prediction to LLMs become special cases of a single framework.
Conformal methodology extends to domains previously outside its scope.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The framework may encourage calibration data collection practices tailored to each new generative application.
Similar relaxations could be tested on other unsupervised settings such as reinforcement learning policies.
If the relaxed assumptions hold across modalities, hybrid systems combining language and image generation could inherit joint guarantees.

Load-bearing premise

Adapting conformal risk control to generative tasks remains possible after relaxing its theoretical assumptions while still delivering valid guarantees.

What would settle it

A concrete counter-example in which Conf-Gen is applied to an image generator yet the probability of producing a memorized image exceeds the target risk level on a held-out calibration set.

Figures

Figures reproduced from arXiv: 2605.28920 by Gabriel Loaiza-Ganem, Kevin Zhang, Kin Kwan Leung, Marc T. Law, Wei Cui.

**Figure 1.** Figure 1: Simplified illustration of our method. The selection function Cλ(x, y) processes an input x and a sequence of generations y to yield an output that becomes increasingly conservative as λ increases. Conf-Gen identifies a value λˆ that formally ensures the output of Cλˆ satisfies a target admissibility requirement in expectation. Top: when y consists of i.i.d. responses to a query x, admissibility ensures th… view at source ↗

**Figure 2.** Figure 2: Results on test data for the question answering task. γ, alongside the corresponding lower bound as a diagonal line. These plots aim to empirically verify the conformal guarantee; we highlight that the few small dips below the diagonal lines do not contradict Theorem 3 as we can only plot empirical averages over a finite test set for a single calibration dataset, not true expectations. Since the conformal… view at source ↗

**Figure 4.** Figure 4: Results on test data for the conversational task. tion of human evaluators who assessed the selected image as too different from the ground truth (i.e., “medium”). The gap between the left and right plots corresponds to the percentage of “good” images. This gap being large highlights the usefulness of Conf-Gen: it avoids trivially achieving large admissibility by always generating “medium” images. More de… view at source ↗

**Figure 6.** Figure 6: Results on test data for the random forest. k trees which make the correct prediction is at least γ. This guarantee is particularly meaningful when the selected set contains at most 2k − 1 trees, as it then ensures the correctness of the majority-vote prediction. While defining admissibility directly as the correctness of the final prediction might seem more intuitive—as the resulting conformal guarantee… view at source ↗

**Figure 7.** Figure 7: Average sequence length as a function of γ with different sampling strategies. Our proposed strategy results in shorter average sequence lengths than the baseline, resulting in a smaller number of LLM calls. fewer LLM calls for a given γ ≤ 0.84. When γ ≥ 0.85, the value of our λˆ is +∞, and the size of our sequence becomes as large as possible. Quach et al. (2024) avoid this degenerate case by using differ… view at source ↗

**Figure 8.** Figure 8: Example of a screen displayed to human evaluators, corresponding to a “bad” example [PITH_FULL_IMAGE:figures/full_fig_p020_8.png] view at source ↗

**Figure 9.** Figure 9: Example of a screen displayed to human evaluators, corresponding to a “medium” example [PITH_FULL_IMAGE:figures/full_fig_p020_9.png] view at source ↗

**Figure 10.** Figure 10: Example of a screen displayed to human evaluators, corresponding to a “good” example. 20 [PITH_FULL_IMAGE:figures/full_fig_p020_10.png] view at source ↗

**Figure 11.** Figure 11: Prompt used for the LLM to consolidate questions in ClariQ. 22 [PITH_FULL_IMAGE:figures/full_fig_p022_11.png] view at source ↗

**Figure 12.** Figure 12: Prompt used for the LLM to generate responses to consolidated questions from ClariQ. 23 [PITH_FULL_IMAGE:figures/full_fig_p023_12.png] view at source ↗

**Figure 13.** Figure 13: Prompt used for the LLM to generate responses to consolidated questions from ClariQ. 24 [PITH_FULL_IMAGE:figures/full_fig_p024_13.png] view at source ↗

**Figure 14.** Figure 14: Prompt used for the LLM to score questions from ClariQ. 25 [PITH_FULL_IMAGE:figures/full_fig_p025_14.png] view at source ↗

**Figure 15.** Figure 15: Prompt used for the LLM to score questions from ClariQ (continued). 26 [PITH_FULL_IMAGE:figures/full_fig_p026_15.png] view at source ↗

**Figure 16.** Figure 16: Conformal generation admissibility and average sequence length as a function of γ of our conformal random forest on GesturePhaseProcessed. 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 Mean Test Admissibility Lower Bound < = 0.0 0.2 0.4 0.6 0.8 1.0 0 20 40 60 80 100 Mean Sequence Length 2k 1 [PITH_FULL_IMAGE:figures/full_fig_p029_16.png] view at source ↗

**Figure 17.** Figure 17: Conformal generation admissibility and average sequence length as a function of γ of our conformal random forest on Adult. 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 Mean Test Admissibility Lower Bound < = 0.0 0.2 0.4 0.6 0.8 1.0 0 10 20 30 40 50 60 70 80 Mean Sequence Length 2k 1 [PITH_FULL_IMAGE:figures/full_fig_p029_17.png] view at source ↗

**Figure 18.** Figure 18: Conformal generation admissibility and average sequence length as a function of γ of our conformal random forest on Census-Income. 29 [PITH_FULL_IMAGE:figures/full_fig_p029_18.png] view at source ↗

**Figure 19.** Figure 19: Conformal generation admissibility and average sequence length as a function of γ of our conformal random forest on MiniBooNE. 10 20 30 40 50 k 0.020 0.015 0.010 0.005 0.000 Highest AUC diff (over gamma) Census-Income Click_prediction_small GesturePhaseSegmentationProcessed MiniBooNE adult [PITH_FULL_IMAGE:figures/full_fig_p030_19.png] view at source ↗

**Figure 20.** Figure 20: Change in performance (AUC) of our conformal random forests for various datasets and k values. 30 [PITH_FULL_IMAGE:figures/full_fig_p030_20.png] view at source ↗

read the original abstract

Conformal prediction (CP) and its extension, conformal risk control (CRC), are established frameworks for quantifying uncertainty in supervised machine learning through formal guarantees. However, recent breakthroughs in artificial intelligence (AI) have been driven by unsupervised generative models, such as large language models (LLMs) and image generators, which are not directly compatible with CP or CRC. In this work we introduce conformal generation (Conf-Gen), a general framework adapting CRC to generative tasks while relaxing its theoretical assumptions. Conf-Gen unifies and generalizes previous attempts to apply CP to LLMs, and extends conformal methodology to entirely new domains. We demonstrate the flexibility of Conf-Gen through some novel applications, including obtaining conformal guarantees on: image generators producing non-memorized images, conversational AI systems having asked enough clarifying questions, and the output of AI agents being correct.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Conf-Gen sketches a way to port CRC to generative settings but the abstract leaves the relaxed assumptions and proofs unspecified, so the guarantees are not yet verifiable.

read the letter

The main thing here is that the paper tries to extend conformal risk control to generative models by relaxing the usual exchangeability or risk-function assumptions, then applies the idea to image generators avoiding memorized outputs, conversational agents asking enough questions, and agent correctness. That unification of prior LLM work plus the new domains is the concrete addition.

What the work does well is lay out a menu of target applications that feel natural for conformal methods and show how the same high-level template could cover them. The abstract is clear that this is meant as an adaptation rather than a wholly new theory.

The soft spot is exactly the one the stress-test flags: the claim rests on being able to relax CRC assumptions while keeping valid risk control, yet no derivation, modified quantile argument, or martingale construction is supplied. Without those steps it is impossible to check whether the guarantees survive the relaxation. The experimental section is also absent from what is visible, so we cannot yet see calibration or coverage numbers on the claimed tasks.

This is the kind of paper a reading group could discuss for the application ideas, but only after the proofs are filled in. It deserves a serious referee because the target problem is real and the framing is coherent on its own terms; the referees would need to verify the relaxed conditions and the empirical coverage. I would not cite it yet and would not bring it to group until the math is there.

Referee Report

1 major / 1 minor

Summary. The paper introduces Conf-Gen, a framework adapting conformal risk control (CRC) to generative models (LLMs, image generators, conversational agents) by relaxing CRC's theoretical assumptions. It claims to unify prior CP applications to LLMs and extend conformal guarantees to new tasks including non-memorized image generation, sufficient clarifying questions in conversational AI, and correct outputs from AI agents.

Significance. If the relaxed guarantees are valid, the work would extend conformal methods beyond supervised settings to unsupervised generative AI, enabling formal risk control in domains where exchangeability or standard risk functions do not hold.

major comments (1)

[Abstract] Abstract: the central claim that CRC can be adapted to generative tasks 'by relaxing its theoretical assumptions' while still delivering valid guarantees is stated without specifying the relaxations (e.g., to exchangeability or the risk function) or the modified proof structure. This is load-bearing because the quantile or martingale arguments underlying CRC may fail under the unspecified changes, and no derivation or theorem is supplied to confirm validity is retained.

minor comments (1)

[Abstract] The abstract refers to 'some novel applications' and 'demonstrate the flexibility' but supplies no experimental details, datasets, or quantitative results to support the claimed guarantees.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for highlighting the need for greater precision in the abstract regarding the theoretical relaxations. We address this point below and will revise accordingly.

read point-by-point responses

Referee: [Abstract] Abstract: the central claim that CRC can be adapted to generative tasks 'by relaxing its theoretical assumptions' while still delivering valid guarantees is stated without specifying the relaxations (e.g., to exchangeability or the risk function) or the modified proof structure. This is load-bearing because the quantile or martingale arguments underlying CRC may fail under the unspecified changes, and no derivation or theorem is supplied to confirm validity is retained.

Authors: We agree the abstract is too terse on this load-bearing claim. The manuscript body (Section 3 and Theorem 1) specifies the relaxations: exchangeability is weakened to a conditional form compatible with generative sampling, and the risk function is extended to non-deterministic outputs via an expectation over the generative distribution. Validity is retained by adapting the quantile argument to a supermartingale under these conditions, with the full derivation supplied in the proof of Theorem 1. To resolve the referee's concern, we will revise the abstract to name the two relaxations and cite Theorem 1 explicitly. revision: yes

Circularity Check

0 steps flagged

No circularity in Conf-Gen adaptation of CRC

full rationale

The provided abstract and description present Conf-Gen as an adaptation of established CRC to generative tasks via relaxation of assumptions, unifying prior CP applications to LLMs and extending to new domains like image generators and agents. No equations, self-definitions, fitted parameters renamed as predictions, or load-bearing self-citations appear in the text. The derivation chain does not reduce to inputs by construction under any of the enumerated patterns; the framework is described as building on external CRC without the result being equivalent to its inputs. This matches the expectation of a self-contained adaptation with no significant circularity.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review provides no concrete information on free parameters, axioms, or invented entities; the framework is described at a high level without equations or implementation details.

pith-pipeline@v0.9.1-grok · 5688 in / 963 out tokens · 29989 ms · 2026-06-29T13:52:39.400724+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

62 extracted references · 14 canonical work pages · 10 internal anchors

[1]

write newline

" write newline "" before.all 'output.state := FUNCTION n.dashify 't := "" t empty not t #1 #1 substring "-" = t #1 #2 substring "--" = not "--" * t #2 global.max substring 't := t #1 #1 substring "-" = "-" * t #2 global.max substring 't := while if t #1 #1 substring * t #2 global.max substring 't := if while FUNCTION format.date year duplicate empty "emp...
[2]

Building and evaluating open-domain dialogue corpora with clarifying questions

Aliannejadi, M., Kiseleva, J., Chuklin, A., Dalton, J., and Burtsev, M. Building and evaluating open-domain dialogue corpora with clarifying questions. In Conference on Empirical Methods in Natural Language Processing, 2021

2021
[3]

Angelopoulos, A. N. and Bates, S. A gentle introduction to conformal prediction and distribution-free uncertainty quantification. arXiv:2107.07511, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021
[4]

Theoretical Foundations of Conformal Prediction

Angelopoulos, A. N., Barber, R. F., and Bates, S. Theoretical foundations of conformal prediction. arXiv:2411.11824, 2024 a

work page internal anchor Pith review Pith/arXiv arXiv 2024
[5]

N., Bates, S., Fisch, A., Lei, L., and Schuster, T

Angelopoulos, A. N., Bates, S., Fisch, A., Lei, L., and Schuster, T. Conformal risk control. In International Conference on Learning Representations, 2024 b

2024
[6]

and Kohavi, R

Becker, B. and Kohavi, R. Adult . UCI Machine Learning Repository, 1996

1996
[7]

The need for uncertainty quantification in machine-assisted medical decision making

Begoli, E., Bhattacharya, T., and Kusnezov, D. The need for uncertainty quantification in machine-assisted medical decision making. In Nature Machine Intelligence, volume 1, pp.\ 20--23, 2019

2019
[8]

Random forests

Breiman, L. Random forests. In Machine learning, volume 45, pp.\ 5--32. Springer, 2001

2001
[9]

Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J. D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., Agarwal, S., Herbert-Voss, A., Krueger, G., Henighan, T., Child, R., Ramesh, A., Ziegler, D., Wu, J., Winter, C., Hesse, C., Chen, M., Sigler, E., Litwin, M., Gray, S., Chess, B., Clark, J., Berner, C., McCandlish, S., Radford, A.,...

2020
[10]

Click prediction

Coutinho, F. Click prediction. https://kaggle.com/competitions/click-prediction-cds, 2022. Kaggle

2022
[11]

Feng, N., Sui, Y., Hou, S., Wu, G., and Cresswell, J. C. Conformal agent error attribution. arXiv:2605.06788, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[12]

and Vovk, V

Gammerman, A. and Vovk, V. Hedging predictions in machine learning. The Computer Journal, 50 0 (2): 0 151--163, 2007

2007
[13]

Learning by transduction

Gammerman, A., Vovk, V., and Vapnik, V. Learning by transduction. In Conference on Uncertainty in Artificial Intelligence, 1998

1998
[14]

SPUQ : Perturbation-based uncertainty quantification for large language models

Gao, X., Zhang, J., Mouatadid, L., and Das, K. SPUQ : Perturbation-based uncertainty quantification for large language models. In Conference of the European Chapter of the Association for Computational Linguistics, 2024

2024
[15]

S., Bonilla, E

Grewal, Y. S., Bonilla, E. V., and Bui, T. D. Improving uncertainty quantification in large language models via semantic embeddings. arXiv:2410.22685, 2024

work page arXiv 2024
[16]

W eb V oyager: Building an end-to-end web agent with large multimodal models

He, H., Yao, W., Ma, K., Yu, W., Dai, Y., Zhang, H., Lan, Z., and Yu, D. W eb V oyager: Building an end-to-end web agent with large multimodal models. In Annual Meeting of the Association for Computational Linguistics, pp.\ 6864--6890, 2024

2024
[17]

Classifier-Free Diffusion Guidance

Ho, J. and Salimans, T. Classifier-free diffusion guidance. arXiv:2207.12598, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[18]

Denoising diffusion probabilistic models

Ho, J., Jain, A., and Abbeel, P. Denoising diffusion probabilistic models. In Advances in Neural Information Processing Systems, 2020

2020
[19]

Decomposing uncertainty for large language models through input clarification ensembling

Hou, B., Liu, Y., Qian, K., Andreas, J., Chang, S., and Zhang, Y. Decomposing uncertainty for large language models through input clarification ensembling. In International Conference on Machine Learning, 2024

2024
[20]

V., Krikun, M., Wu, Y., Chen, Z., Thorat, N., Vi \'e gas, F., Wattenberg, M., Corrado, G., Hughes, M., and Dean, J

Johnson, M., Schuster, M., Le, Q. V., Krikun, M., Wu, Y., Chen, Z., Thorat, N., Vi \'e gas, F., Wattenberg, M., Corrado, G., Hughes, M., and Dean, J. G oogle ' s multilingual neural machine translation system: Enabling zero-shot translation. Transactions of the Association for Computational Linguistics, 5: 0 339--351, 2017

2017
[21]

TriviaQA: A Large Scale Distantly Supervised Challenge Dataset for Reading Comprehension

Joshi, M., Choi, E., Weld, D. S., and Zettlemoyer, L. Triviaqa: A large scale distantly supervised challenge dataset for reading comprehension. arXiv:1705.03551, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017
[22]

Language Models (Mostly) Know What They Know

Kadavath, S., Conerly, T., Askell, A., Henighan, T., Drain, D., Perez, E., Schiefer, N., Hatfield-Dodds, Z., DasSarma, N., Tran-Johnson, E., et al. Language models (mostly) know what they know. arXiv:2207.05221, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[23]

L., Hosseinzadeh, R., Cresswell, J

Kamkari, H., Ross, B. L., Hosseinzadeh, R., Cresswell, J. C., and Loaiza-Ganem, G. A geometric view of data complexity: Efficient local intrinsic dimension estimation with diffusion models. In Advances in Neural Information Processing Systems, volume 37, 2024

2024
[24]

Conformal generative modeling with improved sample efficiency through sequential greedy filtering

Kladny, K.-R., Sch \"o lkopf, B., and Muehlebach, M. Conformal generative modeling with improved sample efficiency through sequential greedy filtering. In International Conference on Learning Representations, 2025

2025
[25]

Semantic uncertainty: Linguistic invariances for uncertainty estimation in natural language generation

Kuhn, L., Gal, Y., and Farquhar, S. Semantic uncertainty: Linguistic invariances for uncertainty estimation in natural language generation. In International Conference on Learning Representations, 2023

2023
[26]

, year =

Kuwahara, B., Lin, C.-Y., Huang, X. S., Leung, K. K., Yapeter, J. A., Stanevich, I., Perez, F., and Cresswell, J. C. Document summarization with conformal importance guarantees. arXiv:2509.20461, 2025

work page arXiv 2025
[27]

Laufer-Goldshtein, B., Fisch, A., Barzilay, R., and Jaakkola, T. S. Efficiently controlling multiple risks with pareto testing. In International Conference on Learning Representations, 2023

2023
[28]

J., and Wasserman, L

Lei, J., G’Sell, M., Rinaldo, A., Tibshirani, R. J., and Wasserman, L. Distribution-free predictive inference for regression. In Journal of the American Statistical Association, volume 113, pp.\ 1094--1111, 2018

2018
[29]

K., Hosseinzadeh, R., and Loaiza-Ganem, G

Leung, K. K., Hosseinzadeh, R., and Loaiza-Ganem, G. On convolutions, intrinsic dimension, and diffusion models. In Transactions on Machine Learning Research, 2025

2025
[30]

Generating with confidence: Uncertainty quantification for black-box large language models

Lin, Z., Trivedi, S., and Sun, J. Generating with confidence: Uncertainty quantification for black-box large language models. In Transactions on Machine Learning Research, 2024

2024
[31]

Gesture Phase Segmentation

Madeo, R., Wagner, P., and Peres, S. Gesture Phase Segmentation . UCI Machine Learning Repository, 2013

2013
[32]

and Hashimoto, T

Mohri, C. and Hashimoto, T. Language models with conformal factuality guarantees. In International Conference on Machine Learning, 2024

2024
[33]

Kernel language entropy: Fine-grained uncertainty quantification for LLM s from semantic similarities

Nikitin, A., Kossen, J., Gal, Y., and Marttinen, P. Kernel language entropy: Fine-grained uncertainty quantification for LLM s from semantic similarities. In Advances in Neural Information Processing Systems, 2024

2024
[34]

GPT-4 Technical Report

OpenAI. Gpt-4 technical report. arXiv:2303.08774, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[35]

Orrick, W. H. Andersen v. Stability AI Ltd. , 2023. URL https://casetext.com/case/andersen-v-stability-ai-ltd

2023
[36]

Inductive confidence machines for regression

Papadopoulos, H., Proedrou, K., Vovk, V., and Gammerman, A. Inductive confidence machines for regression. In European Conference on Machine Learning, 2002

2002
[37]

S., O'Brien, J., Cai, C

Park, J. S., O'Brien, J., Cai, C. J., Morris, M. R., Liang, P., and Bernstein, M. S. Generative agents: Interactive simulacra of human behavior. In Proceedings of the annual ACM symposium on user interface software and technology, 2023

2023
[38]

Scikit-learn: Machine learning in P ython

Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., Blondel, M., Prettenhofer, P., Weiss, R., Dubourg, V., Vanderplas, J., Passos, A., Cournapeau, D., Brucher, M., Perrot, M., and Duchesnay, E. Scikit-learn: Machine learning in P ython. Journal of Machine Learning Research, 12: 0 2825--2830, 2011

2011
[39]

B., Meyer, C., Kohl, S

Potapenko, A. B., Meyer, C., Kohl, S. A., Ballard, A. J., Cowie, A., Romera-Paredes, B., Nikolov, S., Jain, R., and Hassabis, D. Highly accurate protein structure prediction with A lpha F old. In Nature, volume 596, pp.\ 583--589, 2021

2021
[40]

and Miikkulainen, R

Qiu, X. and Miikkulainen, R. Semantic density: Uncertainty quantification for large language models through confidence measurement in semantic space. In Advances in Neural Information Processing Systems, 2024

2024
[41]

H., Jaakkola, T

Quach, V., Fisch, A., Schuster, T., Yala, A., Sohn, J. H., Jaakkola, T. S., and Barzilay, R. Conformal language modeling. In International Conference on Learning Representations, 2024

2024
[42]

Hierarchical Text-Conditional Image Generation with CLIP Latents

Ramesh, A., Dhariwal, P., Nichol, A., Chu, C., and Chen, M. Hierarchical text-conditional image generation with CLIP latents. arXiv:2204.06125, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[43]

MiniBooNE particle identification

Roe, B. MiniBooNE particle identification . UCI Machine Learning Repository, 2005

2005
[44]

High-resolution image synthesis with latent diffusion models

Rombach, R., Blattmann, A., Lorenz, D., Esser, P., and Ommer, B. High-resolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2022

2022
[45]

L., Kamkari, H., Wu, T., Hosseinzadeh, R., Liu, Z., Stein, G., Cresswell, J

Ross, B. L., Kamkari, H., Wu, T., Hosseinzadeh, R., Liu, Z., Stein, G., Cresswell, J. C., and Loaiza-Ganem, G. A geometric framework for understanding memorization in generative models. In International Conference on Learning Representations, 2025

2025
[46]

L., Vouitsis, N., Ghomi, A

Ross, B. L., Vouitsis, N., Ghomi, A. A., Hosseinzadeh, R., Xin, J., Liu, Z., Sui, Y., Hou, S., Leung, K. K., Loaiza-Ganem, G., and Cresswell, J. C. Textual B ayes: Quantifying prompt uncertainty in LLM -based systems. In International Conference on Learning Representations, 2026

2026
[47]

Transduction with confidence and credibility

Saunders, C., Gammerman, A., and Vovk, V. Transduction with confidence and credibility. In International Joint Conference on Artificial Intelligence, 1999

1999
[48]

Silver, D., Huang, A., Maddison, C. J., Guez, A., Sifre, L., van den Driessche, G., Schrittwieser, J., Antonoglou, I., Panneershelvam, V., Lanctot, M., Dieleman, S., Grewe, D., Nham, J., Kalchbrenner, N., Sutskever, I., Lillicrap, T., Leach, M., Kavukcuoglu, K., Graepel, T., and Hassabis, D. Mastering the game of Go with deep neural networks and tree sear...

2016
[49]

Deep unsupervised learning using nonequilibrium thermodynamics

Sohl-Dickstein, J., Weiss, E., Maheswaranathan, N., and Ganguli, S. Deep unsupervised learning using nonequilibrium thermodynamics. In International Conference on Machine Learning, 2015

2015
[50]

P., Kumar, A., Ermon, S., and Poole, B

Song, Y., Sohl-Dickstein, J., Kingma, D. P., Kumar, A., Ermon, S., and Poole, B. Score-based generative modeling through stochastic differential equations. In International Conference on Learning Representations, 2021

2021
[51]

How to trust your diffusion model: A convex optimization approach to conformal risk control

Teneggi, J., Tivnan, M., Stayman, W., and Sulam, J. How to trust your diffusion model: A convex optimization approach to conformal risk control. In International Conference on Machine Learning, 2023

2023
[53]

Census Bureau

U.S. Census Bureau . Census-Income (KDD) . UCI Machine Learning Repository, 2000

2000
[54]

Machine-learning applications of algorithmic randomness

Vovk, V., Gammerman, A., and Saunders, C. Machine-learning applications of algorithmic randomness. In International Conference on Machine Learning, 1999

1999
[55]

Algorithmic learning in a random world

Vovk, V., Gammerman, A., and Shafer, G. Algorithmic learning in a random world. Springer, 2005

2005
[56]

Scientific discovery in the age of artificial intelligence

Wang, H., Fu, T., Du, Y., Gao, W., Huang, K., Liu, Z., Chandak, P., Liu, S., Van Katwyk , P., Deac, A., Anandkumar, A., Bergen, K., Gomes, C., Ho, S., Kohli, P., Lasenby, J., Leskovec, J., Liu, T., Manrai, A., Marks, D., Ramsundar, B., Song, L., Sun, J., Tang, J., Veli c kovi \'c , P., Welling, M., Zhang, L., Coley, C., Bengio, Y., and Zitnik, M. Scientif...

2023
[57]

Lora ensembles for large language model fine-tuning,

Wang, X., Aitchison, L., and Rudolph, M. Lo R a ensembles for large language model fine-tuning. arXiv:2310.00035, 2023 b

work page arXiv 2023
[58]

and Holmes, C

Wang, Z. and Holmes, C. On subjective uncertainty quantification and calibration in natural language generation. In International Conference on Learning Representations, 2025

2025
[59]

A reproducible extraction of training images from diffusion models

Webster, R. A reproducible extraction of training images from diffusion models. arXiv:2305.08694, 2023

work page arXiv 2023
[60]

Detecting, explaining, and mitigating memorization in diffusion models

Wen, Y., Liu, Y., Chen, C., and Lyu, L. Detecting, explaining, and mitigating memorization in diffusion models. In The Twelfth International Conference on Learning Representations, 2023

2023
[61]

Wu, Y., Schuster, M., Chen, Z., Le, Q. V., Norouzi, M., Macherey, W., Krikun, M., Cao, Y., Gao, Q., Macherey, K., Klingner, J., Shah, A., Johnson, M., Liu, X., Łukasz Kaiser, Gouws, S., Kato, Y., Kudo, T., Kazawa, H., Stevens, K., Kurian, G., Patil, N., Wang, W., Young, C., Smith, J., Riesa, J., Rudnick, A., Vinyals, O., Corrado, G., Hughes, M., and Dean,...

work page internal anchor Pith review Pith/arXiv arXiv 2016
[62]

X., Robeyns, M., Wang, X., and Aitchison, L

Yang, A. X., Robeyns, M., Wang, X., and Aitchison, L. Bayesian low-rank adaptation for large language models. In International Conference on Learning Representations, 2024 a

2024
[63]

On Verbalized Confidence Scores for LLMs

Yang, D., Tsai, Y.-H. H., and Yamada, M. On verbalized confidence scores for LLM s. arXiv:2412.14737, 2024 b

work page internal anchor Pith review Pith/arXiv arXiv 2024

[1] [1]

write newline

" write newline "" before.all 'output.state := FUNCTION n.dashify 't := "" t empty not t #1 #1 substring "-" = t #1 #2 substring "--" = not "--" * t #2 global.max substring 't := t #1 #1 substring "-" = "-" * t #2 global.max substring 't := while if t #1 #1 substring * t #2 global.max substring 't := if while FUNCTION format.date year duplicate empty "emp...

[2] [2]

Building and evaluating open-domain dialogue corpora with clarifying questions

Aliannejadi, M., Kiseleva, J., Chuklin, A., Dalton, J., and Burtsev, M. Building and evaluating open-domain dialogue corpora with clarifying questions. In Conference on Empirical Methods in Natural Language Processing, 2021

2021

[3] [3]

Angelopoulos, A. N. and Bates, S. A gentle introduction to conformal prediction and distribution-free uncertainty quantification. arXiv:2107.07511, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021

[4] [4]

Theoretical Foundations of Conformal Prediction

Angelopoulos, A. N., Barber, R. F., and Bates, S. Theoretical foundations of conformal prediction. arXiv:2411.11824, 2024 a

work page internal anchor Pith review Pith/arXiv arXiv 2024

[5] [5]

N., Bates, S., Fisch, A., Lei, L., and Schuster, T

Angelopoulos, A. N., Bates, S., Fisch, A., Lei, L., and Schuster, T. Conformal risk control. In International Conference on Learning Representations, 2024 b

2024

[6] [6]

and Kohavi, R

Becker, B. and Kohavi, R. Adult . UCI Machine Learning Repository, 1996

1996

[7] [7]

The need for uncertainty quantification in machine-assisted medical decision making

Begoli, E., Bhattacharya, T., and Kusnezov, D. The need for uncertainty quantification in machine-assisted medical decision making. In Nature Machine Intelligence, volume 1, pp.\ 20--23, 2019

2019

[8] [8]

Random forests

Breiman, L. Random forests. In Machine learning, volume 45, pp.\ 5--32. Springer, 2001

2001

[9] [9]

Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J. D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., Agarwal, S., Herbert-Voss, A., Krueger, G., Henighan, T., Child, R., Ramesh, A., Ziegler, D., Wu, J., Winter, C., Hesse, C., Chen, M., Sigler, E., Litwin, M., Gray, S., Chess, B., Clark, J., Berner, C., McCandlish, S., Radford, A.,...

2020

[10] [10]

Click prediction

Coutinho, F. Click prediction. https://kaggle.com/competitions/click-prediction-cds, 2022. Kaggle

2022

[11] [11]

Feng, N., Sui, Y., Hou, S., Wu, G., and Cresswell, J. C. Conformal agent error attribution. arXiv:2605.06788, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026

[12] [12]

and Vovk, V

Gammerman, A. and Vovk, V. Hedging predictions in machine learning. The Computer Journal, 50 0 (2): 0 151--163, 2007

2007

[13] [13]

Learning by transduction

Gammerman, A., Vovk, V., and Vapnik, V. Learning by transduction. In Conference on Uncertainty in Artificial Intelligence, 1998

1998

[14] [14]

SPUQ : Perturbation-based uncertainty quantification for large language models

Gao, X., Zhang, J., Mouatadid, L., and Das, K. SPUQ : Perturbation-based uncertainty quantification for large language models. In Conference of the European Chapter of the Association for Computational Linguistics, 2024

2024

[15] [15]

S., Bonilla, E

Grewal, Y. S., Bonilla, E. V., and Bui, T. D. Improving uncertainty quantification in large language models via semantic embeddings. arXiv:2410.22685, 2024

work page arXiv 2024

[16] [16]

W eb V oyager: Building an end-to-end web agent with large multimodal models

He, H., Yao, W., Ma, K., Yu, W., Dai, Y., Zhang, H., Lan, Z., and Yu, D. W eb V oyager: Building an end-to-end web agent with large multimodal models. In Annual Meeting of the Association for Computational Linguistics, pp.\ 6864--6890, 2024

2024

[17] [17]

Classifier-Free Diffusion Guidance

Ho, J. and Salimans, T. Classifier-free diffusion guidance. arXiv:2207.12598, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022

[18] [18]

Denoising diffusion probabilistic models

Ho, J., Jain, A., and Abbeel, P. Denoising diffusion probabilistic models. In Advances in Neural Information Processing Systems, 2020

2020

[19] [19]

Decomposing uncertainty for large language models through input clarification ensembling

Hou, B., Liu, Y., Qian, K., Andreas, J., Chang, S., and Zhang, Y. Decomposing uncertainty for large language models through input clarification ensembling. In International Conference on Machine Learning, 2024

2024

[20] [20]

V., Krikun, M., Wu, Y., Chen, Z., Thorat, N., Vi \'e gas, F., Wattenberg, M., Corrado, G., Hughes, M., and Dean, J

Johnson, M., Schuster, M., Le, Q. V., Krikun, M., Wu, Y., Chen, Z., Thorat, N., Vi \'e gas, F., Wattenberg, M., Corrado, G., Hughes, M., and Dean, J. G oogle ' s multilingual neural machine translation system: Enabling zero-shot translation. Transactions of the Association for Computational Linguistics, 5: 0 339--351, 2017

2017

[21] [21]

TriviaQA: A Large Scale Distantly Supervised Challenge Dataset for Reading Comprehension

Joshi, M., Choi, E., Weld, D. S., and Zettlemoyer, L. Triviaqa: A large scale distantly supervised challenge dataset for reading comprehension. arXiv:1705.03551, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017

[22] [22]

Language Models (Mostly) Know What They Know

Kadavath, S., Conerly, T., Askell, A., Henighan, T., Drain, D., Perez, E., Schiefer, N., Hatfield-Dodds, Z., DasSarma, N., Tran-Johnson, E., et al. Language models (mostly) know what they know. arXiv:2207.05221, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022

[23] [23]

L., Hosseinzadeh, R., Cresswell, J

Kamkari, H., Ross, B. L., Hosseinzadeh, R., Cresswell, J. C., and Loaiza-Ganem, G. A geometric view of data complexity: Efficient local intrinsic dimension estimation with diffusion models. In Advances in Neural Information Processing Systems, volume 37, 2024

2024

[24] [24]

Conformal generative modeling with improved sample efficiency through sequential greedy filtering

Kladny, K.-R., Sch \"o lkopf, B., and Muehlebach, M. Conformal generative modeling with improved sample efficiency through sequential greedy filtering. In International Conference on Learning Representations, 2025

2025

[25] [25]

Semantic uncertainty: Linguistic invariances for uncertainty estimation in natural language generation

Kuhn, L., Gal, Y., and Farquhar, S. Semantic uncertainty: Linguistic invariances for uncertainty estimation in natural language generation. In International Conference on Learning Representations, 2023

2023

[26] [26]

, year =

Kuwahara, B., Lin, C.-Y., Huang, X. S., Leung, K. K., Yapeter, J. A., Stanevich, I., Perez, F., and Cresswell, J. C. Document summarization with conformal importance guarantees. arXiv:2509.20461, 2025

work page arXiv 2025

[27] [27]

Laufer-Goldshtein, B., Fisch, A., Barzilay, R., and Jaakkola, T. S. Efficiently controlling multiple risks with pareto testing. In International Conference on Learning Representations, 2023

2023

[28] [28]

J., and Wasserman, L

Lei, J., G’Sell, M., Rinaldo, A., Tibshirani, R. J., and Wasserman, L. Distribution-free predictive inference for regression. In Journal of the American Statistical Association, volume 113, pp.\ 1094--1111, 2018

2018

[29] [29]

K., Hosseinzadeh, R., and Loaiza-Ganem, G

Leung, K. K., Hosseinzadeh, R., and Loaiza-Ganem, G. On convolutions, intrinsic dimension, and diffusion models. In Transactions on Machine Learning Research, 2025

2025

[30] [30]

Generating with confidence: Uncertainty quantification for black-box large language models

Lin, Z., Trivedi, S., and Sun, J. Generating with confidence: Uncertainty quantification for black-box large language models. In Transactions on Machine Learning Research, 2024

2024

[31] [31]

Gesture Phase Segmentation

Madeo, R., Wagner, P., and Peres, S. Gesture Phase Segmentation . UCI Machine Learning Repository, 2013

2013

[32] [32]

and Hashimoto, T

Mohri, C. and Hashimoto, T. Language models with conformal factuality guarantees. In International Conference on Machine Learning, 2024

2024

[33] [33]

Kernel language entropy: Fine-grained uncertainty quantification for LLM s from semantic similarities

Nikitin, A., Kossen, J., Gal, Y., and Marttinen, P. Kernel language entropy: Fine-grained uncertainty quantification for LLM s from semantic similarities. In Advances in Neural Information Processing Systems, 2024

2024

[34] [34]

GPT-4 Technical Report

OpenAI. Gpt-4 technical report. arXiv:2303.08774, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[35] [35]

Orrick, W. H. Andersen v. Stability AI Ltd. , 2023. URL https://casetext.com/case/andersen-v-stability-ai-ltd

2023

[36] [36]

Inductive confidence machines for regression

Papadopoulos, H., Proedrou, K., Vovk, V., and Gammerman, A. Inductive confidence machines for regression. In European Conference on Machine Learning, 2002

2002

[37] [37]

S., O'Brien, J., Cai, C

Park, J. S., O'Brien, J., Cai, C. J., Morris, M. R., Liang, P., and Bernstein, M. S. Generative agents: Interactive simulacra of human behavior. In Proceedings of the annual ACM symposium on user interface software and technology, 2023

2023

[38] [38]

Scikit-learn: Machine learning in P ython

Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., Blondel, M., Prettenhofer, P., Weiss, R., Dubourg, V., Vanderplas, J., Passos, A., Cournapeau, D., Brucher, M., Perrot, M., and Duchesnay, E. Scikit-learn: Machine learning in P ython. Journal of Machine Learning Research, 12: 0 2825--2830, 2011

2011

[39] [39]

B., Meyer, C., Kohl, S

Potapenko, A. B., Meyer, C., Kohl, S. A., Ballard, A. J., Cowie, A., Romera-Paredes, B., Nikolov, S., Jain, R., and Hassabis, D. Highly accurate protein structure prediction with A lpha F old. In Nature, volume 596, pp.\ 583--589, 2021

2021

[40] [40]

and Miikkulainen, R

Qiu, X. and Miikkulainen, R. Semantic density: Uncertainty quantification for large language models through confidence measurement in semantic space. In Advances in Neural Information Processing Systems, 2024

2024

[41] [41]

H., Jaakkola, T

Quach, V., Fisch, A., Schuster, T., Yala, A., Sohn, J. H., Jaakkola, T. S., and Barzilay, R. Conformal language modeling. In International Conference on Learning Representations, 2024

2024

[42] [42]

Hierarchical Text-Conditional Image Generation with CLIP Latents

Ramesh, A., Dhariwal, P., Nichol, A., Chu, C., and Chen, M. Hierarchical text-conditional image generation with CLIP latents. arXiv:2204.06125, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022

[43] [43]

MiniBooNE particle identification

Roe, B. MiniBooNE particle identification . UCI Machine Learning Repository, 2005

2005

[44] [44]

High-resolution image synthesis with latent diffusion models

Rombach, R., Blattmann, A., Lorenz, D., Esser, P., and Ommer, B. High-resolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2022

2022

[45] [45]

L., Kamkari, H., Wu, T., Hosseinzadeh, R., Liu, Z., Stein, G., Cresswell, J

Ross, B. L., Kamkari, H., Wu, T., Hosseinzadeh, R., Liu, Z., Stein, G., Cresswell, J. C., and Loaiza-Ganem, G. A geometric framework for understanding memorization in generative models. In International Conference on Learning Representations, 2025

2025

[46] [46]

L., Vouitsis, N., Ghomi, A

Ross, B. L., Vouitsis, N., Ghomi, A. A., Hosseinzadeh, R., Xin, J., Liu, Z., Sui, Y., Hou, S., Leung, K. K., Loaiza-Ganem, G., and Cresswell, J. C. Textual B ayes: Quantifying prompt uncertainty in LLM -based systems. In International Conference on Learning Representations, 2026

2026

[47] [47]

Transduction with confidence and credibility

Saunders, C., Gammerman, A., and Vovk, V. Transduction with confidence and credibility. In International Joint Conference on Artificial Intelligence, 1999

1999

[48] [48]

Silver, D., Huang, A., Maddison, C. J., Guez, A., Sifre, L., van den Driessche, G., Schrittwieser, J., Antonoglou, I., Panneershelvam, V., Lanctot, M., Dieleman, S., Grewe, D., Nham, J., Kalchbrenner, N., Sutskever, I., Lillicrap, T., Leach, M., Kavukcuoglu, K., Graepel, T., and Hassabis, D. Mastering the game of Go with deep neural networks and tree sear...

2016

[49] [49]

Deep unsupervised learning using nonequilibrium thermodynamics

Sohl-Dickstein, J., Weiss, E., Maheswaranathan, N., and Ganguli, S. Deep unsupervised learning using nonequilibrium thermodynamics. In International Conference on Machine Learning, 2015

2015

[50] [50]

P., Kumar, A., Ermon, S., and Poole, B

Song, Y., Sohl-Dickstein, J., Kingma, D. P., Kumar, A., Ermon, S., and Poole, B. Score-based generative modeling through stochastic differential equations. In International Conference on Learning Representations, 2021

2021

[51] [51]

How to trust your diffusion model: A convex optimization approach to conformal risk control

Teneggi, J., Tivnan, M., Stayman, W., and Sulam, J. How to trust your diffusion model: A convex optimization approach to conformal risk control. In International Conference on Machine Learning, 2023

2023

[52] [53]

Census Bureau

U.S. Census Bureau . Census-Income (KDD) . UCI Machine Learning Repository, 2000

2000

[53] [54]

Machine-learning applications of algorithmic randomness

Vovk, V., Gammerman, A., and Saunders, C. Machine-learning applications of algorithmic randomness. In International Conference on Machine Learning, 1999

1999

[54] [55]

Algorithmic learning in a random world

Vovk, V., Gammerman, A., and Shafer, G. Algorithmic learning in a random world. Springer, 2005

2005

[55] [56]

Scientific discovery in the age of artificial intelligence

Wang, H., Fu, T., Du, Y., Gao, W., Huang, K., Liu, Z., Chandak, P., Liu, S., Van Katwyk , P., Deac, A., Anandkumar, A., Bergen, K., Gomes, C., Ho, S., Kohli, P., Lasenby, J., Leskovec, J., Liu, T., Manrai, A., Marks, D., Ramsundar, B., Song, L., Sun, J., Tang, J., Veli c kovi \'c , P., Welling, M., Zhang, L., Coley, C., Bengio, Y., and Zitnik, M. Scientif...

2023

[56] [57]

Lora ensembles for large language model fine-tuning,

Wang, X., Aitchison, L., and Rudolph, M. Lo R a ensembles for large language model fine-tuning. arXiv:2310.00035, 2023 b

work page arXiv 2023

[57] [58]

and Holmes, C

Wang, Z. and Holmes, C. On subjective uncertainty quantification and calibration in natural language generation. In International Conference on Learning Representations, 2025

2025

[58] [59]

A reproducible extraction of training images from diffusion models

Webster, R. A reproducible extraction of training images from diffusion models. arXiv:2305.08694, 2023

work page arXiv 2023

[59] [60]

Detecting, explaining, and mitigating memorization in diffusion models

Wen, Y., Liu, Y., Chen, C., and Lyu, L. Detecting, explaining, and mitigating memorization in diffusion models. In The Twelfth International Conference on Learning Representations, 2023

2023

[60] [61]

Wu, Y., Schuster, M., Chen, Z., Le, Q. V., Norouzi, M., Macherey, W., Krikun, M., Cao, Y., Gao, Q., Macherey, K., Klingner, J., Shah, A., Johnson, M., Liu, X., Łukasz Kaiser, Gouws, S., Kato, Y., Kudo, T., Kazawa, H., Stevens, K., Kurian, G., Patil, N., Wang, W., Young, C., Smith, J., Riesa, J., Rudnick, A., Vinyals, O., Corrado, G., Hughes, M., and Dean,...

work page internal anchor Pith review Pith/arXiv arXiv 2016

[61] [62]

X., Robeyns, M., Wang, X., and Aitchison, L

Yang, A. X., Robeyns, M., Wang, X., and Aitchison, L. Bayesian low-rank adaptation for large language models. In International Conference on Learning Representations, 2024 a

2024

[62] [63]

On Verbalized Confidence Scores for LLMs

Yang, D., Tsai, Y.-H. H., and Yamada, M. On verbalized confidence scores for LLM s. arXiv:2412.14737, 2024 b

work page internal anchor Pith review Pith/arXiv arXiv 2024