arxiv: 2604.07667 · v1 · submitted 2026-04-09 · 💻 cs.AI · cs.MA· cs.SI

Recognition: 2 theorem links

· Lean Theorem

From Debate to Decision: Conformal Social Choice for Safe Multi-Agent Deliberation

Mengdie Flora Wang , Haochen Xie , Guanghui Wang , Aijing Gao , Guang Yang , Ziyuan Li , Qucy Wei Qiu , Fangwei Han

show 4 more authors

Hengzhi Qiu Yajing Huang Bing Zhu Jae Oh Woo

Authors on Pith no claims yet

Pith reviewed 2026-05-10 18:32 UTC · model grok-4.3

classification 💻 cs.AI cs.MAcs.SI

keywords conformal predictionmulti-agent debatelarge language modelsprediction setscoverage guaranteesafe AI decision makingescalation policyLLM agents

0 comments

The pith

Applying split conformal prediction to aggregated probabilities from multiple LLM agents in debate creates guaranteed-coverage sets that allow safe autonomous action only on singletons.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that multi-agent debate among LLMs can be made safer by adding a conformal decision layer after the debate ends. Agents' verbalized probabilities are pooled linearly and then calibrated with split conformal prediction to form sets that include the correct answer with high probability. This enables a policy to act on its own when the set is a singleton and to escalate to a human otherwise. The result is that most cases where agents wrongly agree are intercepted rather than acted upon, while the accuracy on the cases that do trigger action improves through selection. The guarantee holds without requiring any individual agent to be well-calibrated.

Core claim

Verbalized probability distributions from heterogeneous agents are aggregated via a linear opinion pool and calibrated with split conformal prediction, yielding prediction sets with a marginal coverage guarantee that the correct answer is included with probability at least 1 minus alpha. A hierarchical action policy then maps singleton sets to autonomous action and larger sets to human escalation. On benchmarks with three agents, coverage stays close to target, 81.9 percent of wrong-consensus cases are intercepted at alpha equal to 0.05, and the remaining singleton cases reach 90.0 to 96.8 percent accuracy.

What carries the argument

The conformal prediction sets obtained after linear pooling of agent probabilities, which enforce marginal coverage and drive the singleton-versus-larger-set decision rule.

If this is right

81.9 percent of wrong-consensus cases lead to escalation at alpha of 0.05 instead of erroneous autonomous action.
The accuracy of cases where the system acts autonomously reaches 90.0 to 96.8 percent, up to 22.1 percentage points above simple consensus-based stopping.
Empirical coverage of the correct answer stays within 1 to 2 points of the target level across the tested domains.
The fraction of cases handled autonomously versus escalated can be adjusted directly by changing the alpha parameter.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same post-hoc calibration approach could be tested on multi-agent systems that use different interaction protocols beyond debate.
High-stakes applications might accept a lower automation rate in exchange for the coverage guarantee.
Future work could explore whether non-linear pooling methods preserve the conformal guarantees as well as linear ones do.
Deployment would require maintaining a calibration set of past debates that remains similar to new inputs.

Load-bearing premise

The past calibration examples and the new debate cases must be similar enough in distribution that the coverage guarantee still holds for the combined probabilities.

What would settle it

Collecting a new set of debate instances, computing the conformal sets, and checking whether the fraction of cases where the correct answer is missing from the set exceeds the chosen alpha.

Figures

Figures reproduced from arXiv: 2604.07667 by Aijing Gao, Bing Zhu, Fangwei Han, Guanghui Wang, Guang Yang, Haochen Xie, Hengzhi Qiu, Jae Oh Woo, Mengdie Flora Wang, Qucy Wei Qiu, Yajing Huang, Ziyuan Li.

**Figure 2.** Figure 2: From debate output to deployment decision. Heterogeneous LLM agents debate for [PITH_FULL_IMAGE:figures/full_fig_p002_2.png] view at source ↗

**Figure 3.** Figure 3: Coverage maintenance across debate rounds. [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗

**Figure 4.** Figure 4: Calibrated refusal in action (α=0.05). In both examples, agents unanimously converge to the wrong answer through social reinforcement. Consensus alone would commit this error to an automated action; conformal prediction keeps the correct answer inside the set and forces escalation to human review. the interpretation that conformal stopping flags the uncertain cases that consensus would commit to. Conformal… view at source ↗

read the original abstract

Multi-agent debate improves LLM reasoning, yet agreement among agents is not evidence of correctness. When agents converge on a wrong answer through social reinforcement, consensus-based stopping commits that error to an automated action with no recourse. We introduce Conformal Social Choice, a post-hoc decision layer that converts debate outputs into calibrated act-versus-escalate decisions. Verbalized probability distributions from heterogeneous agents are aggregated via a linear opinion pool and calibrated with split conformal prediction, yielding prediction sets with a marginal coverage guarantee: the correct answer is included with probability ${\geq}\,1{-}\alpha$, without assumptions on individual model calibration. A hierarchical action policy maps singleton sets to autonomous action and larger sets to human escalation. On eight MMLU-Pro domains with three agents (Claude Haiku, DeepSeek-R1, Qwen-3 32B), coverage stays within 1--2 points of the target. The key finding is not that debate becomes more accurate, but that the conformal layer makes its failures actionable: 81.9% of wrong-consensus cases are intercepted at $\alpha{=}0.05$. Because the layer refuses to act on cases where debate is confidently wrong, the remaining conformal singletons reach 90.0--96.8% accuracy (up to 22.1pp above consensus stopping) -- a selection effect, not a reasoning improvement. This safety comes at the cost of automation, but the operating point is user-adjustable via $\alpha$.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Conformal prediction layered on pooled LLM debate outputs gives a tunable way to escalate wrong-consensus cases, with solid empirical coverage in the reported tests but a soft spot on exchangeability.

read the letter

The main thing to know is that this paper adds a post-hoc conformal layer to multi-agent debates so the system can act on clear singletons and escalate the rest, catching most wrong consensus without claiming to improve the debate itself. They pool verbalized probabilities linearly from three heterogeneous models, run split conformal prediction on the scores, and get prediction sets with a marginal coverage guarantee. On eight MMLU-Pro domains the coverage stays within 1-2 points of target, 81.9% of wrong-consensus cases are intercepted at alpha=0.05, and the remaining singletons reach 90-96.8% accuracy through selection rather than better reasoning. The alpha knob lets users trade automation for safety. This specific application of linear pooling plus split conformal to debate stopping is new. It does well by grounding the decision rule in standard conformal theory and showing that the guarantee holds without needing each individual model to be calibrated. The experiments are concrete and the gains are reported as a selection effect, which is honest. The soft spot is exchangeability. The coverage guarantee requires calibration and test instances to be exchangeable, yet the setup mixes three different LLMs across domains; if domain shifts or debate-induced correlations break that, the reported coverage and interception numbers lose their theoretical support. The abstract does not detail checks for this after pooling. This is for people building safe multi-agent LLM pipelines who need adjustable reliability guarantees. A reader working on conformal methods or AI safety will get practical value from the operating-point results. It deserves a serious referee because the core idea is workable and the empirical pattern is clear, even if the assumption needs verification in the full text. I would send it out for review.

Referee Report

1 major / 2 minor

Summary. The paper introduces Conformal Social Choice, a post-hoc decision layer for multi-agent LLM debate. Verbalized probability distributions from heterogeneous agents are aggregated via linear opinion pool and calibrated using split conformal prediction to produce prediction sets with a marginal coverage guarantee (correct answer included with probability ≥1-α, without requiring individual model calibration). A hierarchical policy maps singleton sets to autonomous action and larger sets to human escalation. On eight MMLU-Pro domains with three fixed LLMs, empirical coverage stays within 1-2 points of target; at α=0.05, 81.9% of wrong-consensus cases are intercepted, and remaining singletons achieve 90.0-96.8% accuracy (up to 22.1pp gain over consensus stopping) via selection effects rather than improved reasoning.

Significance. If the coverage guarantee is valid under the experimental conditions, the work offers a practical, adjustable mechanism to mitigate risks of erroneous consensus in multi-agent deliberation by making failures actionable through calibrated escalation. It correctly applies established split conformal prediction theory to aggregated scores and demonstrates that safety gains arise from selective abstention. This could be relevant for high-stakes LLM agent deployment, with the α parameter providing a tunable automation-safety tradeoff.

major comments (1)

[Conformal calibration procedure and experimental setup] The marginal coverage guarantee and the 81.9% wrong-consensus interception rate at α=0.05 rest on exchangeability of calibration and test instances for the linearly pooled scores. The setup uses three heterogeneous LLMs across eight MMLU-Pro domains with debate dynamics; the manuscript does not address whether domain shifts, agent-specific behaviors, or non-random splits violate exchangeability (see abstract claim of guarantee 'without assumptions on individual model calibration' and the conformal calibration description). This is load-bearing for the central safety claims, as violation would render the reported coverage and interception rates without theoretical backing.

minor comments (2)

[Methods] Clarify the exact form of the linear opinion pool aggregation with an explicit equation or pseudocode, including how verbalized probabilities are normalized across agents.
[Experiments] In results tables or figures showing coverage and accuracy, include the target 1-α line and report the number of calibration/test instances per domain to allow assessment of statistical reliability.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the positive evaluation of the work's significance and for identifying the need to explicitly address the exchangeability assumption underlying the conformal coverage guarantee. We respond to the major comment below and will revise the manuscript accordingly.

read point-by-point responses

Referee: [Conformal calibration procedure and experimental setup] The marginal coverage guarantee and the 81.9% wrong-consensus interception rate at α=0.05 rest on exchangeability of calibration and test instances for the linearly pooled scores. The setup uses three heterogeneous LLMs across eight MMLU-Pro domains with debate dynamics; the manuscript does not address whether domain shifts, agent-specific behaviors, or non-random splits violate exchangeability (see abstract claim of guarantee 'without assumptions on individual model calibration' and the conformal calibration description). This is load-bearing for the central safety claims, as violation would render the reported coverage and interception rates without theoretical backing.

Authors: We agree that the marginal coverage guarantee of split conformal prediction requires exchangeability between calibration and test instances with respect to the linearly pooled scores. The abstract phrase 'without assumptions on individual model calibration' specifically means that the base LLMs need not output calibrated probabilities; the linear opinion pool produces an aggregated score that is calibrated post-hoc by the conformal procedure. However, the manuscript does not explicitly discuss the exchangeability assumption for these aggregated scores or potential violations arising from domain shifts across MMLU-Pro domains, heterogeneous agent behaviors, or debate dynamics. In the revised version we will add a dedicated paragraph in the Methods section clarifying that (i) the guarantee is marginal and holds under exchangeability of the pooled scores, (ii) calibration and test splits are formed by random partitioning of instances (we will state this explicitly), and (iii) while domain heterogeneity could in principle affect exchangeability, the observed coverage remains within 1–2 points of the nominal level across all eight domains, providing empirical support for practical validity. We will also report per-domain coverage statistics to allow readers to evaluate robustness. These clarifications address the theoretical foundation without changing the empirical results or core claims. revision: yes

Circularity Check

0 steps flagged

No circularity: coverage guarantee from established split conformal prediction theory

full rationale

The paper aggregates verbalized probabilities from heterogeneous agents via a linear opinion pool and applies split conformal prediction to produce prediction sets with a marginal coverage guarantee. This guarantee follows directly from the standard theorem for split conformal prediction (requiring only exchangeability of calibration and test instances, which the paper states holds), without any reduction to quantities defined, fitted, or renamed within the paper's own equations. The 81.9% interception rate and accuracy improvements on conformal singletons are empirical observations on the MMLU-Pro evaluation, not forced by construction. No self-citation is load-bearing for the core claim, no uniqueness theorem is imported from the authors' prior work, and no ansatz is smuggled via citation. The derivation chain is self-contained against external benchmarks of conformal prediction.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the standard theoretical assumptions of split conformal prediction and the validity of linear opinion pooling; no new free parameters or invented entities are introduced beyond the user-chosen alpha.

axioms (1)

domain assumption The calibration and test instances (debate examples) are exchangeable, as required for split conformal prediction to deliver marginal coverage guarantees.
Implicit in the use of split conformal prediction; standard assumption for the method to hold.

pith-pipeline@v0.9.0 · 5608 in / 1446 out tokens · 46976 ms · 2026-05-10T18:32:29.839455+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel (J uniqueness) unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Verbalized probability distributions from heterogeneous agents are aggregated via a linear opinion pool and calibrated with split conformal prediction, yielding prediction sets with a marginal coverage guarantee: Pr[y∈C(x)]≥1−α
IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

A hierarchical action policy maps singleton sets to autonomous action and larger sets to human escalation

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

6 extracted references · 3 canonical work pages · 1 internal anchor

[1]

Debate or vote: Which yields better decisions in multi-agent large language models?arXiv preprint arXiv:2508.17536, 2025

On the steady state of continuous-time stochas- tic opinion dynamics with power-law confidence. Journal of Applied Probability, 58(3):746–772. Hyeong Kyu Choi, Xiaojin Zhu, and Sharon Li. 2025. Debate or vote: Which yields better decisions in multi-agent large language models?Preprint, arXiv:2508.17536. Roi Cohen, May Hamri, Mor Geva, and Amir Glober- son...

work page arXiv 2025
[2]

Towards Understanding Sycophancy in Language Models

Classification with valid and adaptive cover- age. InAdvances in Neural Information Processing Systems, volume 33, pages 3581–3591. Mrinank Sharma, Meg Tong, Tomasz Korbak, David Duvenaud, Amanda Askell, Samuel R. Bow- man, Newton Cheng, Esin Durmus, Zac Hatfield- Dodds, Scott R. Johnston, Shauna Kravec, Timo- thy Maxwell, Sam McCandlish, Kamal Ndousse, O...

work page internal anchor Pith review arXiv 2024
[3]

Su, J., Luo, J., Wang, H., and Cheng, L

API is enough: Conformal prediction for large language models without logit-access.Preprint, arXiv:2403.01216. Katherine Tian, Eric Mitchell, Allan Zhou, Archit Sharma, Rafael Rafailov, Huaxiu Yao, Chelsea Finn, and Christopher D. Manning. 2023. Just ask for cali- bration: Strategies for eliciting calibrated confidence scores from language models fine-tun...

work page arXiv 2023
[4]

But by round 3, ˆq reaches 1.0 and the set size jumps to 9.76

through rounds 0–2 because even the 99th- percentile calibration example retains some prob- ability on the correct answer. But by round 3, ˆq reaches 1.0 and the set size jumps to 9.76. In con- trast, domains like Law (60–66% accuracy) already have ˆq= 1.0 by round 1, as the larger fraction of errors produces extreme nonconformity scores earlier in the de...
[5]

The ratedecreasesacross rounds: as debate refines social probabilities, the residual uncer- tainty on wrong answers increases, pushing them out of singleton sets
[6]

G.4 Net Safety Balance The net effect of conformal prediction on safety is overwhelmingly positive

The rate is higher at α=0.10 than α=0.05: tighter sets are more likely to collapse to a wrong singleton. G.4 Net Safety Balance The net effect of conformal prediction on safety is overwhelmingly positive. At α=0.05 and round 3, conformal prediction: • Catches480 out of 586 wrong-consensus er- rors (81.9% rejection rate); • Introducesonly 2 wrong singleton...

2021