pith. sign in

arxiv: 2605.28920 · v1 · pith:7GBPBW4Pnew · submitted 2026-05-27 · 💻 cs.LG · cs.AI· stat.ML

Conf-Gen: Conformal Uncertainty Quantification for Generative Models

Pith reviewed 2026-06-29 13:52 UTC · model grok-4.3

classification 💻 cs.LG cs.AIstat.ML
keywords conformal predictionconformal risk controlgenerative modelsuncertainty quantificationlarge language modelsimage generationAI agents
0
0 comments X

The pith

Conf-Gen adapts conformal risk control to generative models by relaxing assumptions and supplies formal guarantees for image generators, conversational systems, and AI agents.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents Conf-Gen as a framework that takes the established conformal risk control method and modifies it for use with generative models such as large language models and image generators. It shows that this adaptation unifies earlier applications of conformal prediction to LLMs and creates new use cases with mathematical guarantees. A reader would care because generative models currently lack the uncertainty quantification tools available in supervised settings, and Conf-Gen offers a route to add those tools without starting from scratch.

Core claim

Conf-Gen is a general framework adapting conformal risk control to generative tasks while relaxing its theoretical assumptions. This produces conformal guarantees on image generators producing non-memorized images, conversational AI systems having asked enough clarifying questions, and the output of AI agents being correct. The same framework unifies and generalizes previous attempts to apply conformal prediction to LLMs and extends the methodology to entirely new domains.

What carries the argument

Conformal generation (Conf-Gen), the adaptation of conformal risk control that relaxes theoretical assumptions to produce valid guarantees on generative outputs.

If this is right

  • Image generators can be equipped with guarantees that the produced images are non-memorized.
  • Conversational AI systems can receive guarantees that they have asked enough clarifying questions.
  • AI agent outputs can receive guarantees of correctness.
  • Previous applications of conformal prediction to LLMs become special cases of a single framework.
  • Conformal methodology extends to domains previously outside its scope.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The framework may encourage calibration data collection practices tailored to each new generative application.
  • Similar relaxations could be tested on other unsupervised settings such as reinforcement learning policies.
  • If the relaxed assumptions hold across modalities, hybrid systems combining language and image generation could inherit joint guarantees.

Load-bearing premise

Adapting conformal risk control to generative tasks remains possible after relaxing its theoretical assumptions while still delivering valid guarantees.

What would settle it

A concrete counter-example in which Conf-Gen is applied to an image generator yet the probability of producing a memorized image exceeds the target risk level on a held-out calibration set.

Figures

Figures reproduced from arXiv: 2605.28920 by Gabriel Loaiza-Ganem, Kevin Zhang, Kin Kwan Leung, Marc T. Law, Wei Cui.

Figure 1
Figure 1. Figure 1: Simplified illustration of our method. The selection function Cλ(x, y) processes an input x and a sequence of generations y to yield an output that becomes increasingly conservative as λ increases. Conf-Gen identifies a value λˆ that formally ensures the output of Cλˆ satisfies a target admissibility requirement in expectation. Top: when y consists of i.i.d. responses to a query x, admissibility ensures th… view at source ↗
Figure 2
Figure 2. Figure 2: Results on test data for the question answering task. γ, alongside the corresponding lower bound as a diagonal line. These plots aim to empirically verify the conformal guarantee; we highlight that the few small dips below the diagonal lines do not contradict Theorem 3 as we can only plot empirical averages over a finite test set for a single cali￾bration dataset, not true expectations. Since the conformal… view at source ↗
Figure 4
Figure 4. Figure 4: Results on test data for the conversational task. tion of human evaluators who assessed the selected image as too different from the ground truth (i.e., “medium”). The gap between the left and right plots corresponds to the per￾centage of “good” images. This gap being large highlights the usefulness of Conf-Gen: it avoids trivially achieving large admissibility by always generating “medium” images. More de… view at source ↗
Figure 6
Figure 6. Figure 6: Results on test data for the random forest. k trees which make the correct prediction is at least γ. This guarantee is particularly meaningful when the selected set contains at most 2k − 1 trees, as it then ensures the correct￾ness of the majority-vote prediction. While defining admissibility directly as the correctness of the final prediction might seem more intuitive—as the re￾sulting conformal guarantee… view at source ↗
Figure 7
Figure 7. Figure 7: Average sequence length as a function of γ with different sampling strategies. Our proposed strategy results in shorter average sequence lengths than the baseline, resulting in a smaller number of LLM calls. fewer LLM calls for a given γ ≤ 0.84. When γ ≥ 0.85, the value of our λˆ is +∞, and the size of our sequence becomes as large as possible. Quach et al. (2024) avoid this degenerate case by using differ… view at source ↗
Figure 8
Figure 8. Figure 8: Example of a screen displayed to human evaluators, corresponding to a “bad” example [PITH_FULL_IMAGE:figures/full_fig_p020_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Example of a screen displayed to human evaluators, corresponding to a “medium” example [PITH_FULL_IMAGE:figures/full_fig_p020_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Example of a screen displayed to human evaluators, corresponding to a “good” example. 20 [PITH_FULL_IMAGE:figures/full_fig_p020_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: Prompt used for the LLM to consolidate questions in ClariQ. 22 [PITH_FULL_IMAGE:figures/full_fig_p022_11.png] view at source ↗
Figure 12
Figure 12. Figure 12: Prompt used for the LLM to generate responses to consolidated questions from ClariQ. 23 [PITH_FULL_IMAGE:figures/full_fig_p023_12.png] view at source ↗
Figure 13
Figure 13. Figure 13: Prompt used for the LLM to generate responses to consolidated questions from ClariQ. 24 [PITH_FULL_IMAGE:figures/full_fig_p024_13.png] view at source ↗
Figure 14
Figure 14. Figure 14: Prompt used for the LLM to score questions from ClariQ. 25 [PITH_FULL_IMAGE:figures/full_fig_p025_14.png] view at source ↗
Figure 15
Figure 15. Figure 15: Prompt used for the LLM to score questions from ClariQ (continued). 26 [PITH_FULL_IMAGE:figures/full_fig_p026_15.png] view at source ↗
Figure 16
Figure 16. Figure 16: Conformal generation admissibility and average sequence length as a function of γ of our conformal random forest on GesturePhaseProcessed. 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 Mean Test Admissibility Lower Bound < = 0.0 0.2 0.4 0.6 0.8 1.0 0 20 40 60 80 100 Mean Sequence Length 2k 1 [PITH_FULL_IMAGE:figures/full_fig_p029_16.png] view at source ↗
Figure 17
Figure 17. Figure 17: Conformal generation admissibility and average sequence length as a function of γ of our conformal random forest on Adult. 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 Mean Test Admissibility Lower Bound < = 0.0 0.2 0.4 0.6 0.8 1.0 0 10 20 30 40 50 60 70 80 Mean Sequence Length 2k 1 [PITH_FULL_IMAGE:figures/full_fig_p029_17.png] view at source ↗
Figure 18
Figure 18. Figure 18: Conformal generation admissibility and average sequence length as a function of γ of our conformal random forest on Census-Income. 29 [PITH_FULL_IMAGE:figures/full_fig_p029_18.png] view at source ↗
Figure 19
Figure 19. Figure 19: Conformal generation admissibility and average sequence length as a function of γ of our conformal random forest on MiniBooNE. 10 20 30 40 50 k 0.020 0.015 0.010 0.005 0.000 Highest AUC diff (over gamma) Census-Income Click_prediction_small GesturePhaseSegmentationProcessed MiniBooNE adult [PITH_FULL_IMAGE:figures/full_fig_p030_19.png] view at source ↗
Figure 20
Figure 20. Figure 20: Change in performance (AUC) of our conformal random forests for various datasets and k values. 30 [PITH_FULL_IMAGE:figures/full_fig_p030_20.png] view at source ↗
read the original abstract

Conformal prediction (CP) and its extension, conformal risk control (CRC), are established frameworks for quantifying uncertainty in supervised machine learning through formal guarantees. However, recent breakthroughs in artificial intelligence (AI) have been driven by unsupervised generative models, such as large language models (LLMs) and image generators, which are not directly compatible with CP or CRC. In this work we introduce conformal generation (Conf-Gen), a general framework adapting CRC to generative tasks while relaxing its theoretical assumptions. Conf-Gen unifies and generalizes previous attempts to apply CP to LLMs, and extends conformal methodology to entirely new domains. We demonstrate the flexibility of Conf-Gen through some novel applications, including obtaining conformal guarantees on: image generators producing non-memorized images, conversational AI systems having asked enough clarifying questions, and the output of AI agents being correct.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 1 minor

Summary. The paper introduces Conf-Gen, a framework adapting conformal risk control (CRC) to generative models (LLMs, image generators, conversational agents) by relaxing CRC's theoretical assumptions. It claims to unify prior CP applications to LLMs and extend conformal guarantees to new tasks including non-memorized image generation, sufficient clarifying questions in conversational AI, and correct outputs from AI agents.

Significance. If the relaxed guarantees are valid, the work would extend conformal methods beyond supervised settings to unsupervised generative AI, enabling formal risk control in domains where exchangeability or standard risk functions do not hold.

major comments (1)
  1. [Abstract] Abstract: the central claim that CRC can be adapted to generative tasks 'by relaxing its theoretical assumptions' while still delivering valid guarantees is stated without specifying the relaxations (e.g., to exchangeability or the risk function) or the modified proof structure. This is load-bearing because the quantile or martingale arguments underlying CRC may fail under the unspecified changes, and no derivation or theorem is supplied to confirm validity is retained.
minor comments (1)
  1. [Abstract] The abstract refers to 'some novel applications' and 'demonstrate the flexibility' but supplies no experimental details, datasets, or quantitative results to support the claimed guarantees.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for highlighting the need for greater precision in the abstract regarding the theoretical relaxations. We address this point below and will revise accordingly.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the central claim that CRC can be adapted to generative tasks 'by relaxing its theoretical assumptions' while still delivering valid guarantees is stated without specifying the relaxations (e.g., to exchangeability or the risk function) or the modified proof structure. This is load-bearing because the quantile or martingale arguments underlying CRC may fail under the unspecified changes, and no derivation or theorem is supplied to confirm validity is retained.

    Authors: We agree the abstract is too terse on this load-bearing claim. The manuscript body (Section 3 and Theorem 1) specifies the relaxations: exchangeability is weakened to a conditional form compatible with generative sampling, and the risk function is extended to non-deterministic outputs via an expectation over the generative distribution. Validity is retained by adapting the quantile argument to a supermartingale under these conditions, with the full derivation supplied in the proof of Theorem 1. To resolve the referee's concern, we will revise the abstract to name the two relaxations and cite Theorem 1 explicitly. revision: yes

Circularity Check

0 steps flagged

No circularity in Conf-Gen adaptation of CRC

full rationale

The provided abstract and description present Conf-Gen as an adaptation of established CRC to generative tasks via relaxation of assumptions, unifying prior CP applications to LLMs and extending to new domains like image generators and agents. No equations, self-definitions, fitted parameters renamed as predictions, or load-bearing self-citations appear in the text. The derivation chain does not reduce to inputs by construction under any of the enumerated patterns; the framework is described as building on external CRC without the result being equivalent to its inputs. This matches the expectation of a self-contained adaptation with no significant circularity.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review provides no concrete information on free parameters, axioms, or invented entities; the framework is described at a high level without equations or implementation details.

pith-pipeline@v0.9.1-grok · 5688 in / 963 out tokens · 29989 ms · 2026-06-29T13:52:39.400724+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

62 extracted references · 14 canonical work pages · 10 internal anchors

  1. [1]

    write newline

    " write newline "" before.all 'output.state := FUNCTION n.dashify 't := "" t empty not t #1 #1 substring "-" = t #1 #2 substring "--" = not "--" * t #2 global.max substring 't := t #1 #1 substring "-" = "-" * t #2 global.max substring 't := while if t #1 #1 substring * t #2 global.max substring 't := if while FUNCTION format.date year duplicate empty "emp...

  2. [2]

    Building and evaluating open-domain dialogue corpora with clarifying questions

    Aliannejadi, M., Kiseleva, J., Chuklin, A., Dalton, J., and Burtsev, M. Building and evaluating open-domain dialogue corpora with clarifying questions. In Conference on Empirical Methods in Natural Language Processing, 2021

  3. [3]

    Angelopoulos, A. N. and Bates, S. A gentle introduction to conformal prediction and distribution-free uncertainty quantification. arXiv:2107.07511, 2021

  4. [4]

    Theoretical Foundations of Conformal Prediction

    Angelopoulos, A. N., Barber, R. F., and Bates, S. Theoretical foundations of conformal prediction. arXiv:2411.11824, 2024 a

  5. [5]

    N., Bates, S., Fisch, A., Lei, L., and Schuster, T

    Angelopoulos, A. N., Bates, S., Fisch, A., Lei, L., and Schuster, T. Conformal risk control. In International Conference on Learning Representations, 2024 b

  6. [6]

    and Kohavi, R

    Becker, B. and Kohavi, R. Adult . UCI Machine Learning Repository, 1996

  7. [7]

    The need for uncertainty quantification in machine-assisted medical decision making

    Begoli, E., Bhattacharya, T., and Kusnezov, D. The need for uncertainty quantification in machine-assisted medical decision making. In Nature Machine Intelligence, volume 1, pp.\ 20--23, 2019

  8. [8]

    Random forests

    Breiman, L. Random forests. In Machine learning, volume 45, pp.\ 5--32. Springer, 2001

  9. [9]

    Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J. D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., Agarwal, S., Herbert-Voss, A., Krueger, G., Henighan, T., Child, R., Ramesh, A., Ziegler, D., Wu, J., Winter, C., Hesse, C., Chen, M., Sigler, E., Litwin, M., Gray, S., Chess, B., Clark, J., Berner, C., McCandlish, S., Radford, A.,...

  10. [10]

    Click prediction

    Coutinho, F. Click prediction. https://kaggle.com/competitions/click-prediction-cds, 2022. Kaggle

  11. [11]

    Feng, N., Sui, Y., Hou, S., Wu, G., and Cresswell, J. C. Conformal agent error attribution. arXiv:2605.06788, 2026

  12. [12]

    and Vovk, V

    Gammerman, A. and Vovk, V. Hedging predictions in machine learning. The Computer Journal, 50 0 (2): 0 151--163, 2007

  13. [13]

    Learning by transduction

    Gammerman, A., Vovk, V., and Vapnik, V. Learning by transduction. In Conference on Uncertainty in Artificial Intelligence, 1998

  14. [14]

    SPUQ : Perturbation-based uncertainty quantification for large language models

    Gao, X., Zhang, J., Mouatadid, L., and Das, K. SPUQ : Perturbation-based uncertainty quantification for large language models. In Conference of the European Chapter of the Association for Computational Linguistics, 2024

  15. [15]

    Grewal, Edwin V

    Grewal, Y. S., Bonilla, E. V., and Bui, T. D. Improving uncertainty quantification in large language models via semantic embeddings. arXiv:2410.22685, 2024

  16. [16]

    W eb V oyager: Building an end-to-end web agent with large multimodal models

    He, H., Yao, W., Ma, K., Yu, W., Dai, Y., Zhang, H., Lan, Z., and Yu, D. W eb V oyager: Building an end-to-end web agent with large multimodal models. In Annual Meeting of the Association for Computational Linguistics, pp.\ 6864--6890, 2024

  17. [17]

    Classifier-Free Diffusion Guidance

    Ho, J. and Salimans, T. Classifier-free diffusion guidance. arXiv:2207.12598, 2022

  18. [18]

    Denoising diffusion probabilistic models

    Ho, J., Jain, A., and Abbeel, P. Denoising diffusion probabilistic models. In Advances in Neural Information Processing Systems, 2020

  19. [19]

    Decomposing uncertainty for large language models through input clarification ensembling

    Hou, B., Liu, Y., Qian, K., Andreas, J., Chang, S., and Zhang, Y. Decomposing uncertainty for large language models through input clarification ensembling. In International Conference on Machine Learning, 2024

  20. [20]

    V., Krikun, M., Wu, Y., Chen, Z., Thorat, N., Vi \'e gas, F., Wattenberg, M., Corrado, G., Hughes, M., and Dean, J

    Johnson, M., Schuster, M., Le, Q. V., Krikun, M., Wu, Y., Chen, Z., Thorat, N., Vi \'e gas, F., Wattenberg, M., Corrado, G., Hughes, M., and Dean, J. G oogle ' s multilingual neural machine translation system: Enabling zero-shot translation. Transactions of the Association for Computational Linguistics, 5: 0 339--351, 2017

  21. [21]

    TriviaQA: A Large Scale Distantly Supervised Challenge Dataset for Reading Comprehension

    Joshi, M., Choi, E., Weld, D. S., and Zettlemoyer, L. Triviaqa: A large scale distantly supervised challenge dataset for reading comprehension. arXiv:1705.03551, 2017

  22. [22]

    Language Models (Mostly) Know What They Know

    Kadavath, S., Conerly, T., Askell, A., Henighan, T., Drain, D., Perez, E., Schiefer, N., Hatfield-Dodds, Z., DasSarma, N., Tran-Johnson, E., et al. Language models (mostly) know what they know. arXiv:2207.05221, 2022

  23. [23]

    L., Hosseinzadeh, R., Cresswell, J

    Kamkari, H., Ross, B. L., Hosseinzadeh, R., Cresswell, J. C., and Loaiza-Ganem, G. A geometric view of data complexity: Efficient local intrinsic dimension estimation with diffusion models. In Advances in Neural Information Processing Systems, volume 37, 2024

  24. [24]

    Conformal generative modeling with improved sample efficiency through sequential greedy filtering

    Kladny, K.-R., Sch \"o lkopf, B., and Muehlebach, M. Conformal generative modeling with improved sample efficiency through sequential greedy filtering. In International Conference on Learning Representations, 2025

  25. [25]

    Semantic uncertainty: Linguistic invariances for uncertainty estimation in natural language generation

    Kuhn, L., Gal, Y., and Farquhar, S. Semantic uncertainty: Linguistic invariances for uncertainty estimation in natural language generation. In International Conference on Learning Representations, 2023

  26. [26]

    , year =

    Kuwahara, B., Lin, C.-Y., Huang, X. S., Leung, K. K., Yapeter, J. A., Stanevich, I., Perez, F., and Cresswell, J. C. Document summarization with conformal importance guarantees. arXiv:2509.20461, 2025

  27. [27]

    Laufer-Goldshtein, B., Fisch, A., Barzilay, R., and Jaakkola, T. S. Efficiently controlling multiple risks with pareto testing. In International Conference on Learning Representations, 2023

  28. [28]

    J., and Wasserman, L

    Lei, J., G’Sell, M., Rinaldo, A., Tibshirani, R. J., and Wasserman, L. Distribution-free predictive inference for regression. In Journal of the American Statistical Association, volume 113, pp.\ 1094--1111, 2018

  29. [29]

    K., Hosseinzadeh, R., and Loaiza-Ganem, G

    Leung, K. K., Hosseinzadeh, R., and Loaiza-Ganem, G. On convolutions, intrinsic dimension, and diffusion models. In Transactions on Machine Learning Research, 2025

  30. [30]

    Generating with confidence: Uncertainty quantification for black-box large language models

    Lin, Z., Trivedi, S., and Sun, J. Generating with confidence: Uncertainty quantification for black-box large language models. In Transactions on Machine Learning Research, 2024

  31. [31]

    Gesture Phase Segmentation

    Madeo, R., Wagner, P., and Peres, S. Gesture Phase Segmentation . UCI Machine Learning Repository, 2013

  32. [32]

    and Hashimoto, T

    Mohri, C. and Hashimoto, T. Language models with conformal factuality guarantees. In International Conference on Machine Learning, 2024

  33. [33]

    Kernel language entropy: Fine-grained uncertainty quantification for LLM s from semantic similarities

    Nikitin, A., Kossen, J., Gal, Y., and Marttinen, P. Kernel language entropy: Fine-grained uncertainty quantification for LLM s from semantic similarities. In Advances in Neural Information Processing Systems, 2024

  34. [34]

    GPT-4 Technical Report

    OpenAI. Gpt-4 technical report. arXiv:2303.08774, 2024

  35. [35]

    Orrick, W. H. Andersen v. Stability AI Ltd. , 2023. URL https://casetext.com/case/andersen-v-stability-ai-ltd

  36. [36]

    Inductive confidence machines for regression

    Papadopoulos, H., Proedrou, K., Vovk, V., and Gammerman, A. Inductive confidence machines for regression. In European Conference on Machine Learning, 2002

  37. [37]

    S., O'Brien, J., Cai, C

    Park, J. S., O'Brien, J., Cai, C. J., Morris, M. R., Liang, P., and Bernstein, M. S. Generative agents: Interactive simulacra of human behavior. In Proceedings of the annual ACM symposium on user interface software and technology, 2023

  38. [38]

    Scikit-learn: Machine learning in P ython

    Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., Blondel, M., Prettenhofer, P., Weiss, R., Dubourg, V., Vanderplas, J., Passos, A., Cournapeau, D., Brucher, M., Perrot, M., and Duchesnay, E. Scikit-learn: Machine learning in P ython. Journal of Machine Learning Research, 12: 0 2825--2830, 2011

  39. [39]

    B., Meyer, C., Kohl, S

    Potapenko, A. B., Meyer, C., Kohl, S. A., Ballard, A. J., Cowie, A., Romera-Paredes, B., Nikolov, S., Jain, R., and Hassabis, D. Highly accurate protein structure prediction with A lpha F old. In Nature, volume 596, pp.\ 583--589, 2021

  40. [40]

    and Miikkulainen, R

    Qiu, X. and Miikkulainen, R. Semantic density: Uncertainty quantification for large language models through confidence measurement in semantic space. In Advances in Neural Information Processing Systems, 2024

  41. [41]

    H., Jaakkola, T

    Quach, V., Fisch, A., Schuster, T., Yala, A., Sohn, J. H., Jaakkola, T. S., and Barzilay, R. Conformal language modeling. In International Conference on Learning Representations, 2024

  42. [42]

    Hierarchical Text-Conditional Image Generation with CLIP Latents

    Ramesh, A., Dhariwal, P., Nichol, A., Chu, C., and Chen, M. Hierarchical text-conditional image generation with CLIP latents. arXiv:2204.06125, 2022

  43. [43]

    MiniBooNE particle identification

    Roe, B. MiniBooNE particle identification . UCI Machine Learning Repository, 2005

  44. [44]

    High-resolution image synthesis with latent diffusion models

    Rombach, R., Blattmann, A., Lorenz, D., Esser, P., and Ommer, B. High-resolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2022

  45. [45]

    L., Kamkari, H., Wu, T., Hosseinzadeh, R., Liu, Z., Stein, G., Cresswell, J

    Ross, B. L., Kamkari, H., Wu, T., Hosseinzadeh, R., Liu, Z., Stein, G., Cresswell, J. C., and Loaiza-Ganem, G. A geometric framework for understanding memorization in generative models. In International Conference on Learning Representations, 2025

  46. [46]

    L., Vouitsis, N., Ghomi, A

    Ross, B. L., Vouitsis, N., Ghomi, A. A., Hosseinzadeh, R., Xin, J., Liu, Z., Sui, Y., Hou, S., Leung, K. K., Loaiza-Ganem, G., and Cresswell, J. C. Textual B ayes: Quantifying prompt uncertainty in LLM -based systems. In International Conference on Learning Representations, 2026

  47. [47]

    Transduction with confidence and credibility

    Saunders, C., Gammerman, A., and Vovk, V. Transduction with confidence and credibility. In International Joint Conference on Artificial Intelligence, 1999

  48. [48]

    Silver, D., Huang, A., Maddison, C. J., Guez, A., Sifre, L., van den Driessche, G., Schrittwieser, J., Antonoglou, I., Panneershelvam, V., Lanctot, M., Dieleman, S., Grewe, D., Nham, J., Kalchbrenner, N., Sutskever, I., Lillicrap, T., Leach, M., Kavukcuoglu, K., Graepel, T., and Hassabis, D. Mastering the game of Go with deep neural networks and tree sear...

  49. [49]

    Deep unsupervised learning using nonequilibrium thermodynamics

    Sohl-Dickstein, J., Weiss, E., Maheswaranathan, N., and Ganguli, S. Deep unsupervised learning using nonequilibrium thermodynamics. In International Conference on Machine Learning, 2015

  50. [50]

    P., Kumar, A., Ermon, S., and Poole, B

    Song, Y., Sohl-Dickstein, J., Kingma, D. P., Kumar, A., Ermon, S., and Poole, B. Score-based generative modeling through stochastic differential equations. In International Conference on Learning Representations, 2021

  51. [51]

    How to trust your diffusion model: A convex optimization approach to conformal risk control

    Teneggi, J., Tivnan, M., Stayman, W., and Sulam, J. How to trust your diffusion model: A convex optimization approach to conformal risk control. In International Conference on Machine Learning, 2023

  52. [53]

    Census Bureau

    U.S. Census Bureau . Census-Income (KDD) . UCI Machine Learning Repository, 2000

  53. [54]

    Machine-learning applications of algorithmic randomness

    Vovk, V., Gammerman, A., and Saunders, C. Machine-learning applications of algorithmic randomness. In International Conference on Machine Learning, 1999

  54. [55]

    Algorithmic learning in a random world

    Vovk, V., Gammerman, A., and Shafer, G. Algorithmic learning in a random world. Springer, 2005

  55. [56]

    Scientific discovery in the age of artificial intelligence

    Wang, H., Fu, T., Du, Y., Gao, W., Huang, K., Liu, Z., Chandak, P., Liu, S., Van Katwyk , P., Deac, A., Anandkumar, A., Bergen, K., Gomes, C., Ho, S., Kohli, P., Lasenby, J., Leskovec, J., Liu, T., Manrai, A., Marks, D., Ramsundar, B., Song, L., Sun, J., Tang, J., Veli c kovi \'c , P., Welling, M., Zhang, L., Coley, C., Bengio, Y., and Zitnik, M. Scientif...

  56. [57]

    Lora ensembles for large language model fine-tuning,

    Wang, X., Aitchison, L., and Rudolph, M. Lo R a ensembles for large language model fine-tuning. arXiv:2310.00035, 2023 b

  57. [58]

    and Holmes, C

    Wang, Z. and Holmes, C. On subjective uncertainty quantification and calibration in natural language generation. In International Conference on Learning Representations, 2025

  58. [59]

    A reproducible extraction of training images from diffusion models

    Webster, R. A reproducible extraction of training images from diffusion models. arXiv:2305.08694, 2023

  59. [60]

    Detecting, explaining, and mitigating memorization in diffusion models

    Wen, Y., Liu, Y., Chen, C., and Lyu, L. Detecting, explaining, and mitigating memorization in diffusion models. In The Twelfth International Conference on Learning Representations, 2023

  60. [61]

    Wu, Y., Schuster, M., Chen, Z., Le, Q. V., Norouzi, M., Macherey, W., Krikun, M., Cao, Y., Gao, Q., Macherey, K., Klingner, J., Shah, A., Johnson, M., Liu, X., Łukasz Kaiser, Gouws, S., Kato, Y., Kudo, T., Kazawa, H., Stevens, K., Kurian, G., Patil, N., Wang, W., Young, C., Smith, J., Riesa, J., Rudnick, A., Vinyals, O., Corrado, G., Hughes, M., and Dean,...

  61. [62]

    X., Robeyns, M., Wang, X., and Aitchison, L

    Yang, A. X., Robeyns, M., Wang, X., and Aitchison, L. Bayesian low-rank adaptation for large language models. In International Conference on Learning Representations, 2024 a

  62. [63]

    On Verbalized Confidence Scores for LLMs

    Yang, D., Tsai, Y.-H. H., and Yamada, M. On verbalized confidence scores for LLM s. arXiv:2412.14737, 2024 b