Conformal Language Modeling via Posterior Sampling

Armando Solar-Lezama; Chara Podimata; Nicolas Emmenegger; Theo X. Olausson

arxiv: 2606.03731 · v1 · pith:GWW3E37Bnew · submitted 2026-06-02 · 💻 cs.LG · stat.ML

Conformal Language Modeling via Posterior Sampling

Nicolas Emmenegger , Theo X. Olausson , Armando Solar-Lezama , Chara Podimata This is my paper

Pith reviewed 2026-06-28 11:20 UTC · model grok-4.3

classification 💻 cs.LG stat.ML

keywords conformal predictionposterior samplinghallucination mitigationlanguage modelsrisk controlsequential generationcalibration

0 comments

The pith

Sampling from calibrated LLM posteriors controls hallucination risk with higher utility than post-hoc methods.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Large language models often hallucinate, and earlier conformal methods reduce this by post-hoc alteration of samples, which can produce incoherent or unlikely outputs under the model. This paper proposes instead to sample from approximations to the LLM posterior conditioned on a calibrated high-scoring region. A calibration procedure is developed specifically for conditional sequential generation to identify that region and meet a target risk level. On biography generation and mathematical problem solving, the approach delivers equivalent statistical guarantees while improving downstream utility.

Core claim

Sampling from approximations to an LLM posterior, where the conditioning event is a calibrated high-scoring region, allows identification of this region through a calibration procedure tailored to conditional sequential generation; this achieves target risk control without the disconnect between filtering and generation that produces incoherence, and shifts probability mass toward more useful responses.

What carries the argument

Calibration procedure for conditional sequential generation that identifies the high-scoring region used to condition posterior sampling.

If this is right

Achieves target risk control in conditional sequential generation tasks.
Maintains statistical guarantees matching prior conformal prediction methods.
Delivers higher downstream utility on open-ended biography generation and math problem solving.
Avoids incoherence and inconsistency from post-hoc sample surgery.
Shifts probability mass toward more useful and helpful responses under the model.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The approach could be tested on additional sequential tasks such as code generation or multi-turn dialogue.
If posterior approximations improve, the utility gains over post-hoc methods may increase further.
Similar conditioning on calibrated regions might be examined in non-text generation domains that suffer from output errors.

Load-bearing premise

An effective calibration procedure exists for conditional sequential generation that can identify the high-scoring region without introducing incoherence or needing post-hoc adjustments.

What would settle it

On a held-out set of conditional generations, the posterior samples fail to meet the stated target risk level or produce lower utility scores than post-hoc conformal baselines on the same tasks.

Figures

Figures reproduced from arXiv: 2606.03731 by Armando Solar-Lezama, Chara Podimata, Nicolas Emmenegger, Theo X. Olausson.

**Figure 2.** Figure 2: LLM-judge utility for ours ( ) vs. MH ( ) as target factuality varies. Unconditional panels (a, c) include abstentions (judged on FActScore, marked incomplete for MATH); conditional panels (b, d) show results for both methods conditioned on the output not being an abstention. (prompt in Appendix E.4.2). Abstentions are naturally scored as incomplete. To reduce annotation cost, we subsample a single generat… view at source ↗

**Figure 3.** Figure 3: Claim-level proxy metrics for ours ( ) vs. MH ( ) on FActScore (top) and MATH (bottom). 5 Related Work Calibrated stopping rules. A complementary body of recent work calibrates stopping rules for reasoning models and agents [Wu et al., 2025, Sadhuka et al., 2025, Zhou et al., 2026]. These methods similarly avoid the limitations of post-hoc filtering (Section 1), but the target of calibration is different: … view at source ↗

**Figure 4.** Figure 4: shows the results of the empirical validity experiment ( [PITH_FULL_IMAGE:figures/full_fig_p020_4.png] view at source ↗

**Figure 5.** Figure 5: Breakdown of the FActScore Likert results from [PITH_FULL_IMAGE:figures/full_fig_p021_5.png] view at source ↗

**Figure 6.** Figure 6: Scorer-elicitation ablation on FActScore: confidence-rating elicitation (main text) vs. raw token log-probabilities. Averaged over 10 splits; shaded area shows 95% CI. gather generations from Llama-3.1-8B-Instruct [Grattafiori et al., 2024], which we then score using GPT-4o-mini instead of GPT-4o. We fix the rest of the experimental setting as in Section 4. As one can observe, validity is agnostic to the c… view at source ↗

**Figure 7.** Figure 7: Model capacity ablation on MATH: GPT-4o vs. GPT-4o-mini scoring, all on Llama-3.1-8B-Instruct generations (in contrast to the larger Llama-3.3-70B-Instruct used elsewhere). Averaged over 10 splits; shaded area shows 95% CI. 21 [PITH_FULL_IMAGE:figures/full_fig_p021_7.png] view at source ↗

**Figure 8.** Figure 8: Fixed-budget rejection-sampling β-ablation on FActScore. (a, b) Target vs. empirical factuality under the worst-case (fully-false) and abstention fallbacks; (c) Calibrated τˆ vs. target (identical for both fallbacks, since calibration is unaffected). Averaged over 10 splits; shaded area shows 95% CI. Calibration validity. Figures 8a and 8b show target factuality 1 − α against observed factuality on the hel… view at source ↗

**Figure 9.** Figure 9: Proxy metrics under the fixed-budget rejection sampling procedure on FActScore, comparing the worst-case (fully-false) and abstention budget-exhausted fallbacks. E Prompts E.1 Generation prompts We do not set a special system prompt for generation; the dataset-specific prompts below are passed as the user’s first turn. FActScore. You are a helpful assistant that writes biographies. You will be given the na… view at source ↗

read the original abstract

Large Language Models remain plagued by hallucinations. Recent work has sought to tame their prevalence using statistical techniques based on conformal prediction, with both theoretical and empirical success. However, these methods operate in a post-hoc fashion, treating the sampling procedure itself as atomic and then surgically altering samples to remove hallucinated claims. This disconnect between filtering and generation can result in samples that are incoherent, inconsistent, or simply unlikely under the model itself. Moreover, post-hoc surgery is unable to shift probability mass towards more useful and helpful responses. To address these issues, we propose to instead sample from approximations to an LLM posterior, where the conditioning event corresponds to a calibrated, high-scoring region. We develop a calibration procedure tailored to the setting of conditional sequential generation that effectively identifies this region and achieves target risk control. Empirically, we apply our method to case studies focused on open-ended biography generation and mathematical problem solving; compared to prior work, we obtain the same statistical guarantees, with higher downstream utility.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper reframes conformal calibration as a conditioning event for posterior sampling in sequential LLM generation to fix post-hoc incoherence, but the abstract leaves the key theoretical adaptation underspecified.

read the letter

The paper's main contribution is reframing conformal calibration as a conditioning event for sampling from an approximated LLM posterior in sequential generation. This avoids the incoherence that comes from post-hoc filtering of samples.

It does a good job highlighting why post-hoc methods fall short: they can't shift probability toward useful outputs and can produce inconsistent text. The empirical results on biography generation and math solving claim to match the statistical guarantees of prior work while improving utility, which is a reasonable target.

The weak point is the calibration procedure for conditional sequential generation. The abstract states that it identifies the region and achieves target risk control, but sequential autoregressive models have strong dependencies between tokens. Standard conformal prediction relies on exchangeability, and it's not clear from the description how the adaptation preserves marginal coverage over full sequences. If the posterior approximation is used, any bias there could undermine both the guarantees and the utility gains. The stress test note raises exactly this issue, and without the full derivation it's hard to dismiss.

This paper is for people working on statistical guarantees for language models. A reader focused on conformal prediction applied to generation would find the framing useful to think about. It shows clear engagement with the literature on the limitations of existing approaches.

I think it deserves a serious referee to check whether the theoretical justification for the sequential case holds and whether the experiments properly validate the claims.

Referee Report

2 major / 2 minor

Summary. The manuscript proposes sampling from approximations to an LLM posterior, where the conditioning event is a calibrated high-scoring region identified via a new calibration procedure tailored to conditional sequential generation. This is claimed to achieve target risk control for hallucination mitigation while shifting probability mass toward useful responses, yielding the same statistical guarantees as prior post-hoc conformal methods but with higher downstream utility; empirical results are reported on open-ended biography generation and mathematical problem solving.

Significance. If the calibration procedure preserves marginal coverage guarantees under autoregressive dependencies and the posterior approximation is sufficiently accurate, the approach could improve coherence and utility over post-hoc filtering by integrating control directly into the sampling process rather than altering outputs after generation.

major comments (2)

[Section 3 (Calibration Procedure)] The central claim that the calibration procedure achieves target risk control for conditional sequential generation rests on an adaptation of conformal prediction; however, the manuscript must explicitly derive why marginal coverage over full sequences continues to hold given the strong token-level dependencies in autoregressive models (standard exchangeability assumptions do not apply directly).
[Section 5 (Experiments)] Empirical claims of 'the same statistical guarantees' with higher utility require reporting of the exact coverage rates, risk metrics, and utility measures (e.g., factuality scores or solution correctness) on the biography and math tasks, including comparison to baselines under identical posterior approximation quality; without these, the utility gain cannot be isolated from approximation artifacts.

minor comments (2)

[Section 2] Notation for the high-scoring region and the posterior approximation (e.g., how the conditioning event is formalized) should be introduced earlier and used consistently to improve readability.
The abstract would benefit from naming the specific risk level (e.g., 1-δ) and utility metrics used in the case studies.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the detailed and constructive feedback. We address the two major comments point by point below. Both points identify areas where the manuscript can be clarified and strengthened, and we will revise accordingly.

read point-by-point responses

Referee: [Section 3 (Calibration Procedure)] The central claim that the calibration procedure achieves target risk control for conditional sequential generation rests on an adaptation of conformal prediction; however, the manuscript must explicitly derive why marginal coverage over full sequences continues to hold given the strong token-level dependencies in autoregressive models (standard exchangeability assumptions do not apply directly).

Authors: We agree that an explicit derivation is required. The calibration procedure operates at the sequence level: nonconformity scores are computed on complete generated sequences, and the threshold is chosen so that the indicator of the coverage event is exchangeable across calibration and test sequences drawn from the same data distribution. Token-level dependencies are internal to the generation process and do not affect the marginal coverage guarantee over full sequences. In the revision we will add a self-contained derivation (new subsection in Section 3) that makes this argument precise without invoking token-wise exchangeability. revision: yes
Referee: [Section 5 (Experiments)] Empirical claims of 'the same statistical guarantees' with higher utility require reporting of the exact coverage rates, risk metrics, and utility measures (e.g., factuality scores or solution correctness) on the biography and math tasks, including comparison to baselines under identical posterior approximation quality; without these, the utility gain cannot be isolated from approximation artifacts.

Authors: We will expand Section 5 with new tables that report the precise empirical coverage frequencies, risk values, and utility metrics (factuality for biographies, solution correctness for math) for every method. All comparisons will be performed under identical posterior approximations (same model, same temperature, same calibration set) so that differences can be attributed to the sampling procedure rather than approximation quality. The revised manuscript will also include the raw numerical values rather than only summary statements. revision: yes

Circularity Check

0 steps flagged

No significant circularity; derivation extends prior conformal methods independently

full rationale

The paper proposes a calibration procedure for conditional sequential generation that identifies high-scoring regions and achieves risk control, presented as an extension of existing conformal prediction techniques to address post-hoc filtering limitations. No equations or steps in the abstract reduce the claimed guarantees or utility gains to quantities defined by construction from fitted parameters on the same data, self-citations that are load-bearing, or renamed known results. The central claims rest on the existence and empirical validation of the new procedure rather than tautological redefinitions, making the derivation self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Central claim rests on the existence and effectiveness of a new calibration procedure for sequential generation and the feasibility of posterior approximations; no free parameters, axioms, or invented entities are identifiable from the abstract.

axioms (1)

domain assumption Existence of a calibration procedure for conditional sequential generation that identifies a high-scoring region while achieving target risk control.
The method depends on developing and applying such a procedure to enable the posterior sampling approach.

pith-pipeline@v0.9.1-grok · 5707 in / 1129 out tokens · 22586 ms · 2026-06-28T11:20:13.533342+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

30 extracted references · 4 linked inside Pith

[1]

The Annals of Applied Statistics , volume=

Learn then test: Calibrating predictive algorithms to achieve risk control , author=. The Annals of Applied Statistics , volume=. 2025 , publisher=

2025
[2]

arXiv preprint arXiv:2411.11824 , year=

Theoretical foundations of conformal prediction , author=. arXiv preprint arXiv:2411.11824 , year=

Pith/arXiv arXiv
[3]

International conference on learning representations , volume=

Conformal risk control , author=. International conference on learning representations , volume=
[4]

Advances in Neural Information Processing Systems , volume=

Large language model validity via enhanced conformal prediction methods , author=. Advances in Neural Information Processing Systems , volume=
[5]

International conference on learning representations , year=

Beyond binary rewards: Training lms to reason about their uncertainty , author=. International conference on learning representations , year=
[6]

Humanities and Social Sciences Communications , volume=

Large language models in legal systems: A survey , author=. Humanities and Social Sciences Communications , volume=. 2025 , publisher=

2025
[7]

2024 , booktitle =

Detommaso, Gianluca and Bertran, Martin and Fogliato, Riccardo and Roth, Aaron , title =. 2024 , booktitle =

2024
[8]

arXiv preprint arXiv:2602.01031 , year=

HalluHard: A Hard Multi-Turn Hallucination Benchmark , author=. arXiv preprint arXiv:2602.01031 , year=

arXiv
[9]

Proceedings of the National Academy of Sciences , volume=

Conformal prediction under feedback covariate shift for biomolecular design , author=. Proceedings of the National Academy of Sciences , volume=. 2022 , publisher=

2022
[10]

npj Digital Medicine , volume=

Evaluating large language model workflows in clinical decision support for triage and referral and diagnosis , author=. npj Digital Medicine , volume=. 2025 , publisher=

2025
[11]

International Conference on Machine Learning , pages=

Multicalibration: Calibration for the (computationally-identifiable) masses , author=. International Conference on Machine Learning , pages=. 2018 , organization=

2018
[12]

arXiv preprint arXiv:2407.21783 , year=

The llama 3 herd of models , author=. arXiv preprint arXiv:2407.21783 , year=

Pith/arXiv arXiv
[13]

International Conference on Learning Representations , volume=

Syntactic and semantic control of large language models via sequential monte carlo , author=. International Conference on Learning Representations , volume=
[14]

arXiv preprint arXiv:2306.03081 , year=

Sequential monte carlo steering of large language models using probabilistic programs , author=. arXiv preprint arXiv:2306.03081 , year=

arXiv
[15]

arXiv preprint arXiv:2103.03874 , year=

Measuring mathematical problem solving with the math dataset , author=. arXiv preprint arXiv:2103.03874 , year=

Pith/arXiv arXiv
[16]

Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing , pages=

Factscore: Fine-grained atomic evaluation of factual precision in long form text generation , author=. Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing , pages=

2023
[17]

International Conference on Machine Learning , pages=

Language Models with Conformal Factuality Guarantees , author=. International Conference on Machine Learning , pages=. 2024 , organization=

2024
[18]

Proceedings of the 41st International Conference on Machine Learning , pages =

Conformal Validity Guarantees Exist for Any Data Distribution (and How to Find Them) , author =. Proceedings of the 41st International Conference on Machine Learning , pages =
[19]

International Conference on Learning Representations , volume=

Conformal language modeling , author=. International Conference on Learning Representations , volume=
[20]

Foundations and Trends

Hypothesis testing with e-values , author=. Foundations and Trends. 2025 , publisher=

2025
[21]

The Thirteenth International Conference on Learning Representations , year=

Conformal Language Model Reasoning with Coherent Factuality , author=. The Thirteenth International Conference on Learning Representations , year=
[22]

arXiv preprint arXiv:2512.03109 , year=

E-valuator: Reliable Agent Verifiers with Sequential Hypothesis Testing , author=. arXiv preprint arXiv:2512.03109 , year=

Pith/arXiv arXiv
[23]

Advances in neural information processing systems , volume=

Conformal prediction under covariate shift , author=. Advances in neural information processing systems , volume=
[24]

Journal of clinical epidemiology , volume=

A calibration hierarchy for risk models was defined: from utopia to empirical data , author=. Journal of clinical epidemiology , volume=. 2016 , publisher=

2016
[25]

2000 , publisher=

Asymptotic statistics , author=. 2000 , publisher=

2000
[26]

Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing , pages=

Thought calibration: Efficient and confident test-time scaling , author=. Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing , pages=

2025
[27]

arXiv preprint arXiv:2604.01170 , year=

Online Reasoning Calibration: Test-Time Training Enables Generalizable Conformal LLM Reasoning , author=. arXiv preprint arXiv:2604.01170 , year=

arXiv
[28]

International Conference on Machine Learning , pages=

Probabilistic Inference in Language Models via Twisted Sequential Monte Carlo , author=. International Conference on Machine Learning , pages=. 2024 , organization=

2024
[29]

With applications to statistics

Weak convergence and empirical processes. With applications to statistics. , author=. 1996 , publisher=

1996
[30]

Proceedings of the Sixteenth International Joint Conference on Artificial Intelligence , year=

Transduction with Confidence and Credibility , author=. Proceedings of the Sixteenth International Joint Conference on Artificial Intelligence , year=

[1] [1]

The Annals of Applied Statistics , volume=

Learn then test: Calibrating predictive algorithms to achieve risk control , author=. The Annals of Applied Statistics , volume=. 2025 , publisher=

2025

[2] [2]

arXiv preprint arXiv:2411.11824 , year=

Theoretical foundations of conformal prediction , author=. arXiv preprint arXiv:2411.11824 , year=

Pith/arXiv arXiv

[3] [3]

International conference on learning representations , volume=

Conformal risk control , author=. International conference on learning representations , volume=

[4] [4]

Advances in Neural Information Processing Systems , volume=

Large language model validity via enhanced conformal prediction methods , author=. Advances in Neural Information Processing Systems , volume=

[5] [5]

International conference on learning representations , year=

Beyond binary rewards: Training lms to reason about their uncertainty , author=. International conference on learning representations , year=

[6] [6]

Humanities and Social Sciences Communications , volume=

Large language models in legal systems: A survey , author=. Humanities and Social Sciences Communications , volume=. 2025 , publisher=

2025

[7] [7]

2024 , booktitle =

Detommaso, Gianluca and Bertran, Martin and Fogliato, Riccardo and Roth, Aaron , title =. 2024 , booktitle =

2024

[8] [8]

arXiv preprint arXiv:2602.01031 , year=

HalluHard: A Hard Multi-Turn Hallucination Benchmark , author=. arXiv preprint arXiv:2602.01031 , year=

arXiv

[9] [9]

Proceedings of the National Academy of Sciences , volume=

Conformal prediction under feedback covariate shift for biomolecular design , author=. Proceedings of the National Academy of Sciences , volume=. 2022 , publisher=

2022

[10] [10]

npj Digital Medicine , volume=

Evaluating large language model workflows in clinical decision support for triage and referral and diagnosis , author=. npj Digital Medicine , volume=. 2025 , publisher=

2025

[11] [11]

International Conference on Machine Learning , pages=

Multicalibration: Calibration for the (computationally-identifiable) masses , author=. International Conference on Machine Learning , pages=. 2018 , organization=

2018

[12] [12]

arXiv preprint arXiv:2407.21783 , year=

The llama 3 herd of models , author=. arXiv preprint arXiv:2407.21783 , year=

Pith/arXiv arXiv

[13] [13]

International Conference on Learning Representations , volume=

Syntactic and semantic control of large language models via sequential monte carlo , author=. International Conference on Learning Representations , volume=

[14] [14]

arXiv preprint arXiv:2306.03081 , year=

Sequential monte carlo steering of large language models using probabilistic programs , author=. arXiv preprint arXiv:2306.03081 , year=

arXiv

[15] [15]

arXiv preprint arXiv:2103.03874 , year=

Measuring mathematical problem solving with the math dataset , author=. arXiv preprint arXiv:2103.03874 , year=

Pith/arXiv arXiv

[16] [16]

Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing , pages=

Factscore: Fine-grained atomic evaluation of factual precision in long form text generation , author=. Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing , pages=

2023

[17] [17]

International Conference on Machine Learning , pages=

Language Models with Conformal Factuality Guarantees , author=. International Conference on Machine Learning , pages=. 2024 , organization=

2024

[18] [18]

Proceedings of the 41st International Conference on Machine Learning , pages =

Conformal Validity Guarantees Exist for Any Data Distribution (and How to Find Them) , author =. Proceedings of the 41st International Conference on Machine Learning , pages =

[19] [19]

International Conference on Learning Representations , volume=

Conformal language modeling , author=. International Conference on Learning Representations , volume=

[20] [20]

Foundations and Trends

Hypothesis testing with e-values , author=. Foundations and Trends. 2025 , publisher=

2025

[21] [21]

The Thirteenth International Conference on Learning Representations , year=

Conformal Language Model Reasoning with Coherent Factuality , author=. The Thirteenth International Conference on Learning Representations , year=

[22] [22]

arXiv preprint arXiv:2512.03109 , year=

E-valuator: Reliable Agent Verifiers with Sequential Hypothesis Testing , author=. arXiv preprint arXiv:2512.03109 , year=

Pith/arXiv arXiv

[23] [23]

Advances in neural information processing systems , volume=

Conformal prediction under covariate shift , author=. Advances in neural information processing systems , volume=

[24] [24]

Journal of clinical epidemiology , volume=

A calibration hierarchy for risk models was defined: from utopia to empirical data , author=. Journal of clinical epidemiology , volume=. 2016 , publisher=

2016

[25] [25]

2000 , publisher=

Asymptotic statistics , author=. 2000 , publisher=

2000

[26] [26]

Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing , pages=

Thought calibration: Efficient and confident test-time scaling , author=. Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing , pages=

2025

[27] [27]

arXiv preprint arXiv:2604.01170 , year=

Online Reasoning Calibration: Test-Time Training Enables Generalizable Conformal LLM Reasoning , author=. arXiv preprint arXiv:2604.01170 , year=

arXiv

[28] [28]

International Conference on Machine Learning , pages=

Probabilistic Inference in Language Models via Twisted Sequential Monte Carlo , author=. International Conference on Machine Learning , pages=. 2024 , organization=

2024

[29] [29]

With applications to statistics

Weak convergence and empirical processes. With applications to statistics. , author=. 1996 , publisher=

1996

[30] [30]

Proceedings of the Sixteenth International Joint Conference on Artificial Intelligence , year=

Transduction with Confidence and Credibility , author=. Proceedings of the Sixteenth International Joint Conference on Artificial Intelligence , year=