BED-LLM: Intelligent Information Gathering with LLMs and Bayesian Experimental Design

Adam Goli\'nski; Deepro Choudhury; Freddie Bickford Smith; Michael Kirchhof; Ning Miao; Sinead Williamson; Tom Rainforth; Yizhe Zhang

arxiv: 2508.21184 · v3 · submitted 2025-08-28 · 💻 cs.CL · cs.AI· stat.ML

BED-LLM: Intelligent Information Gathering with LLMs and Bayesian Experimental Design

Deepro Choudhury , Sinead Williamson , Adam Goli\'nski , Ning Miao , Freddie Bickford Smith , Michael Kirchhof , Yizhe Zhang , Tom Rainforth This is my paper

Pith reviewed 2026-05-18 20:13 UTC · model grok-4.3

classification 💻 cs.CL cs.AIstat.ML

keywords Bayesian experimental designlarge language modelsexpected information gaininformation gathering20 questions gameuser preference inferenceadaptive conversational agents

0 comments

The pith

Large language models gather information more effectively by selecting questions that maximize expected information gain using Bayesian experimental design.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces BED-LLM, which integrates sequential Bayesian experimental design into large language models to enable them to choose questions adaptively. By formulating the expected information gain from the model's predictive distributions, the approach allows LLMs to focus on queries that reduce uncertainty about a target variable most efficiently. This is demonstrated in experiments based on the 20 Questions game and inferring user preferences, where it outperforms standard prompting and other adaptive methods. Readers would care if this makes LLMs more useful as interactive tools for probing and learning from users or environments without extensive manual prompt engineering.

Core claim

BED-LLM iteratively selects questions or queries to maximize the expected information gain with respect to a variable of interest, where the EIG is formulated and estimated using a probabilistic model derived from the LLM's predictive distributions, resulting in substantial performance gains over purely prompting-based design generation and other adaptive design strategies in information gathering tasks.

What carries the argument

Expected information gain (EIG) maximization for query selection, estimated from LLM predictive distributions in a sequential Bayesian experimental design loop.

If this is right

LLMs can function as effective multi-turn conversational agents.
Improved ability to interactively interface with external environments.
Better inference of user preferences through adaptive questioning.
Principled alternative to heuristic or prompting-only question selection.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The method could be combined with other LLM capabilities like chain-of-thought to further improve EIG estimates.
It may lead to shorter conversations for the same level of information in practical applications.
Extensions to non-text modalities or multi-agent settings could follow from the same principle.

Load-bearing premise

The probabilistic model derived from the LLM's predictive distributions provides sufficiently accurate estimates of expected information gain to enable effective question selection.

What would settle it

Running the same tasks with question selection based on random choice or fixed prompts instead of EIG maximization and finding no improvement or even worse performance in metrics like number of questions needed or accuracy of inferred variable.

read the original abstract

We propose a general-purpose approach for improving the ability of large language models (LLMs) to intelligently and adaptively gather information from a user or other external source using the framework of sequential Bayesian experimental design (BED). This enables LLMs to act as effective multi-turn conversational agents and interactively interface with external environments. Our approach, which we call BED-LLM (Bayesian experimental design with large language models), is based on iteratively choosing questions or queries that maximize the expected information gain (EIG) with respect to a variable of interest given the responses gathered previously. We show how this EIG can be formulated (and then estimated) in a principled way using a probabilistic model derived from the LLM's predictive distributions and provide detailed insights into key decisions in its construction and updating procedure. We find that BED-LLM achieves substantial gains in performance across a wide range of tests based on the 20 Questions game and using the LLM to actively infer user preferences, compared to purely prompting-based design generation and other adaptive design strategies.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

BED-LLM shows how to derive EIG estimates from an LLM's predictive distributions to drive sequential query selection, with reported gains over prompting baselines but open questions on whether the estimates are accurate enough to explain the results.

read the letter

The core idea is to treat an LLM as the source of a probabilistic model and then use sequential Bayesian experimental design to pick the next question that maximizes expected information gain about a target like a secret item or hidden preference. They spell out how to build and update that model from the LLM's token-level outputs and test the whole loop on 20 Questions variants and preference elicitation tasks. The reported improvements over plain prompting and other adaptive baselines are the main empirical hook. That specific integration of BED with LLM-derived probabilities for query choice is what is new here, and the construction details give a usable recipe that goes beyond just prompting the model to be clever. The approach is straightforward to follow and the tests cover a reasonable range of settings. The soft spot is the link between the EIG calculation and actual performance. LLMs are not trained to output calibrated probabilities over the latent variables that matter for these tasks, so the estimated information gains could be biased or noisy. Without more on the approximation method, exact baselines, error bars, or statistical checks, it is hard to tell how much of the gains come from the principled selection versus other implementation choices. The abstract leaves those pieces at a high level. This paper is for people working on LLM agents, conversational systems, or active learning setups who want a Bayesian framing for question selection. Readers already familiar with experimental design will see the value in the adaptation to language models. It is coherent enough and has a clear enough empirical claim to deserve a serious referee rather than a desk reject. I would send it for review but ask the authors to tighten the evaluation and address how they handle potential miscalibration in the EIG estimates.

Referee Report

2 major / 2 minor

Summary. The paper proposes BED-LLM, which augments LLMs with sequential Bayesian experimental design to select queries maximizing expected information gain (EIG) with respect to a target variable (e.g., secret item in 20 Questions or user preferences). EIG is formulated and estimated from a probabilistic model built on the LLM's token-level predictive distributions, with an updating procedure for sequential responses; the method is evaluated on 20 Questions variants and preference inference tasks, claiming substantial gains over pure prompting and other adaptive baselines.

Significance. If the EIG estimates prove sufficiently accurate, the framework supplies a principled, model-based alternative to heuristic prompting for adaptive information gathering, potentially improving multi-turn conversational agents and interactive systems. The explicit treatment of construction and updating decisions is a constructive contribution even if further validation is required.

major comments (2)

[§3.2] §3.2 (Probabilistic Model Construction): the derivation of likelihoods and posteriors over the latent variable from LLM next-token distributions is presented at a high level; no calibration diagnostics (e.g., reliability diagrams or posterior predictive checks against held-out human responses) are reported, leaving open whether overconfidence in the LLM produces systematically biased EIG values and therefore suboptimal query selection.
[§5] §5 (Experimental Results): the abstract and results claim 'substantial gains' across tests, yet the manuscript provides no quantitative details on baseline implementations, number of trials, statistical tests, or confidence intervals; without these, it is impossible to determine whether the observed improvements are attributable to EIG maximization rather than implementation differences or variance.

minor comments (2)

[§3.3] Clarify the exact form of the EIG estimator (Monte Carlo sample size, approximation method) in the main text rather than deferring all details to the appendix.
[Figure 2] Figure 2 (query selection trajectories): axis labels and legend entries should explicitly state the metric being plotted (e.g., cumulative EIG or accuracy) to avoid ambiguity.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive and detailed feedback on our manuscript. We address each major comment below and indicate the revisions made to improve clarity, reproducibility, and empirical grounding of the BED-LLM framework.

read point-by-point responses

Referee: [§3.2] §3.2 (Probabilistic Model Construction): the derivation of likelihoods and posteriors over the latent variable from LLM next-token distributions is presented at a high level; no calibration diagnostics (e.g., reliability diagrams or posterior predictive checks against held-out human responses) are reported, leaving open whether overconfidence in the LLM produces systematically biased EIG values and therefore suboptimal query selection.

Authors: We agree that §3.2 presents the derivation at a relatively high level and that the absence of calibration diagnostics leaves open questions about potential bias from LLM overconfidence. In the revised manuscript we have expanded §3.2 with a more explicit step-by-step derivation of the likelihood function and posterior update rules directly from the token-level predictive distributions. We have also added a new paragraph discussing the risk of overconfidence and included reliability diagrams computed on the 20 Questions task in the supplementary material. A comprehensive posterior predictive check against held-out human responses would require new data collection that exceeds the scope of the current revision; we therefore note this as a limitation and a direction for future work. revision: partial
Referee: [§5] §5 (Experimental Results): the abstract and results claim 'substantial gains' across tests, yet the manuscript provides no quantitative details on baseline implementations, number of trials, statistical tests, or confidence intervals; without these, it is impossible to determine whether the observed improvements are attributable to EIG maximization rather than implementation differences or variance.

Authors: We acknowledge that the original §5 lacked the quantitative details necessary for assessing statistical significance and reproducibility. In the revised manuscript we have substantially expanded the experimental section to specify the exact baseline implementations (including prompt templates, temperature settings, and decoding strategies), the number of independent trials (50 per condition), the statistical tests performed (paired two-sided t-tests with Bonferroni correction), and 95% confidence intervals for all reported performance metrics. These additions make clear that the observed gains are attributable to EIG-guided query selection rather than implementation variance. revision: yes

Circularity Check

0 steps flagged

No significant circularity; empirical claims rest on external benchmarks

full rationale

The paper constructs a probabilistic model from LLM token predictive distributions to define and estimate EIG for sequential query selection under BED. This is an explicit modeling decision rather than a self-referential derivation that reduces the target result to its inputs by construction. Performance improvements are demonstrated via direct comparisons to prompting baselines and other adaptive strategies on measurable tasks (20 Questions success rate, preference inference accuracy), which are independent of the internal EIG estimates. No self-citations, uniqueness theorems, or ansatzes are invoked to force the central claim; the derivation chain remains self-contained against external evaluation.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The approach rests on the domain assumption that LLM predictive distributions yield usable estimates of information gain; no free parameters or invented entities are mentioned in the abstract.

axioms (1)

domain assumption LLM predictive distributions can be turned into a probabilistic model suitable for computing expected information gain
Central to the EIG formulation described in the abstract.

pith-pipeline@v0.9.0 · 5734 in / 1137 out tokens · 35581 ms · 2026-05-18T20:13:00.396132+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We show how this EIG can be formulated (and then estimated) in a principled way using a probabilistic model derived from the LLM's predictive distributions
IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

BED-LLM iterates over ... Compute EIG estimator ... Select and ask optimal question

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 8 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Uncertainty Propagation in LLM-Based Systems
cs.SE 2026-04 unverdicted novelty 7.0

This paper introduces a systems-level conceptual framing and a three-level taxonomy (intra-model, system-level, socio-technical) for uncertainty propagation in compound LLM applications, along with engineering insight...
A Proactive Multi-Agent Dialogue Framework for Assessing Social Language Disorder Traits in Autism
cs.CL 2026-05 unverdicted novelty 6.0

TPA, a proactive multi-agent dialogue system, achieves 82.1% SLD trait coverage in simulated ADOS-2 assessments, outperforming real clinician dialogues by 16.6% and other AI baselines.
LLMs are not (consistently) Bayesian: Quantifying internal (in)consistencies of LLMs' probabilistic beliefs
cs.LG 2026-05 unverdicted novelty 6.0

LLMs do not consistently perform Bayesian updates on probabilistic beliefs; heuristic approaches often outperform exact Bayesian computation on downstream tasks, indicating misspecified internal models of the world.
The Perceptual Bandwidth Bottleneck in Vision-Language Models: Active Visual Reasoning via Sequential Experimental Design
cs.CV 2026-05 unverdicted novelty 6.0

VLMs improve high-resolution reasoning by framing it as sequential Bayesian optimal experimental design, using a coverage-resolution proxy and the FOVEA procedure to acquire task-relevant visual evidence, yielding gai...
The Perceptual Bandwidth Bottleneck in Vision-Language Models: Active Visual Reasoning via Sequential Experimental Design
cs.CV 2026-05 unverdicted novelty 6.0

VLMs suffer from a perceptual bandwidth bottleneck; the paper formalizes active visual reasoning as sequential Bayesian optimal experimental design, derives a coverage-resolution proxy objective, and introduces the tr...
MoBayes: A Modular Bayesian Framework for Separating Reasoning from Language in Conversational Clinical Decision Support
cs.LG 2026-04 unverdicted novelty 6.0

MoBayes separates LLM language parsing from Bayesian probabilistic reasoning in conversational clinical decision support and reports performance gains over standalone frontier LLMs across multiple knowledge bases and ...
Planning to Explore: Curiosity-Driven Planning for LLM Test Generation
cs.SE 2026-04 unverdicted novelty 6.0

CovQValue achieves 51-77% higher branch coverage than greedy baselines on TestGenEval Lite by using coverage feedback and LLM-estimated Q-values to select informative test plans.
MoBayes: A Modular Bayesian Framework for Separating Reasoning from Language in Conversational Clinical Decision Support
cs.LG 2026-04 unverdicted novelty 5.0

BMBE separates LLM language handling from a standalone Bayesian diagnostic engine, producing calibrated selective diagnosis, a performance gap over frontier LLMs, and robustness to adversarial inputs.