Textual Bayes: Quantifying Prompt Uncertainty in LLM-Based Systems

· 2025 · cs.LG · arXiv 2506.10060

3 Pith papers cite this work. Polarity classification is still indexing.

3 Pith papers citing it

open full Pith review browse 3 citing papers arXiv PDF

abstract

Although large language models (LLMs) are becoming increasingly capable of solving challenging real-world tasks, accurately quantifying their uncertainty remains a critical open problem--one that limits their applicability in high-stakes domains. This challenge is further compounded by the closed-source, black-box nature of many state-of-the-art LLMs. Moreover, LLM-based systems can be highly sensitive to the prompts that bind them together, which often require significant manual tuning (i.e., prompt engineering). In this work, we address these challenges by viewing LLM-based systems through a Bayesian lens. We interpret prompts as textual parameters in a statistical model, allowing us to use a small training dataset to perform Bayesian inference over these prompts. This novel perspective enables principled uncertainty quantification over both the model's textual parameters and its downstream predictions, while also incorporating prior beliefs about these parameters expressed in free-form text. To perform Bayesian inference--a difficult problem even for well-studied data modalities--we introduce Metropolis-Hastings through LLM Proposals (MHLP), a novel Markov chain Monte Carlo (MCMC) algorithm that combines prompt optimization techniques with standard MCMC methods. MHLP is a turnkey modification to existing LLM pipelines, including those that rely exclusively on closed-source models. Empirically, we demonstrate that our method yields improvements in both predictive accuracy and uncertainty quantification (UQ) on a range of LLM benchmarks and UQ tasks. More broadly, our work demonstrates a viable path for incorporating methods from the rich Bayesian literature into the era of LLMs, paving the way for more reliable and calibrated LLM-based systems.

representative citing papers

Calibration, Not Compilation: Detecting and Repairing Misspecified Probabilistic Programs Written by Language Models

cs.LG · 2026-06-30 · unverdicted · novelty 7.0

Bayesian workflow diagnostics outperform unit tests for detecting and repairing statistically misspecified LLM-generated probabilistic programs across benchmarks and real generation tasks.

Ask the Right Comparison:Bias-Aware Bayesian Active Top-$k$ Ranking with LLM Judges

cs.LG · 2026-07-02 · unverdicted · novelty 6.0

A bias-aware Bayesian model with judge-specific covariates and a top-k membership uncertainty acquisition rule recovers accurate top-k rankings from noisy LLM judges using fewer comparisons than naive aggregation or standard active learning.

Don't Pass@k: A Bayesian Framework for Large Language Model Evaluation

cs.AI · 2025-10-05 · unverdicted · novelty 6.0

A Dirichlet-prior Bayesian estimator for model success probability replaces Pass@k, delivering faster-converging and more stable rankings with credible intervals on math benchmarks.

citing papers explorer

Showing 3 of 3 citing papers.

Calibration, Not Compilation: Detecting and Repairing Misspecified Probabilistic Programs Written by Language Models cs.LG · 2026-06-30 · unverdicted · none · ref 11 · internal anchor
Bayesian workflow diagnostics outperform unit tests for detecting and repairing statistically misspecified LLM-generated probabilistic programs across benchmarks and real generation tasks.
Ask the Right Comparison:Bias-Aware Bayesian Active Top-$k$ Ranking with LLM Judges cs.LG · 2026-07-02 · unverdicted · none · ref 20 · internal anchor
A bias-aware Bayesian model with judge-specific covariates and a top-k membership uncertainty acquisition rule recovers accurate top-k rankings from noisy LLM judges using fewer comparisons than naive aggregation or standard active learning.
Don't Pass@k: A Bayesian Framework for Large Language Model Evaluation cs.AI · 2025-10-05 · unverdicted · none · ref 43 · internal anchor
A Dirichlet-prior Bayesian estimator for model success probability replaces Pass@k, delivering faster-converging and more stable rankings with credible intervals on math benchmarks.

Textual Bayes: Quantifying Prompt Uncertainty in LLM-Based Systems

fields

years

verdicts

representative citing papers

citing papers explorer