pith. machine review for the scientific record. sign in

arxiv: 2604.07667 · v1 · submitted 2026-04-09 · 💻 cs.AI · cs.MA· cs.SI

Recognition: 2 theorem links

· Lean Theorem

From Debate to Decision: Conformal Social Choice for Safe Multi-Agent Deliberation

Authors on Pith no claims yet

Pith reviewed 2026-05-10 18:32 UTC · model grok-4.3

classification 💻 cs.AI cs.MAcs.SI
keywords conformal predictionmulti-agent debatelarge language modelsprediction setscoverage guaranteesafe AI decision makingescalation policyLLM agents
0
0 comments X

The pith

Applying split conformal prediction to aggregated probabilities from multiple LLM agents in debate creates guaranteed-coverage sets that allow safe autonomous action only on singletons.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that multi-agent debate among LLMs can be made safer by adding a conformal decision layer after the debate ends. Agents' verbalized probabilities are pooled linearly and then calibrated with split conformal prediction to form sets that include the correct answer with high probability. This enables a policy to act on its own when the set is a singleton and to escalate to a human otherwise. The result is that most cases where agents wrongly agree are intercepted rather than acted upon, while the accuracy on the cases that do trigger action improves through selection. The guarantee holds without requiring any individual agent to be well-calibrated.

Core claim

Verbalized probability distributions from heterogeneous agents are aggregated via a linear opinion pool and calibrated with split conformal prediction, yielding prediction sets with a marginal coverage guarantee that the correct answer is included with probability at least 1 minus alpha. A hierarchical action policy then maps singleton sets to autonomous action and larger sets to human escalation. On benchmarks with three agents, coverage stays close to target, 81.9 percent of wrong-consensus cases are intercepted at alpha equal to 0.05, and the remaining singleton cases reach 90.0 to 96.8 percent accuracy.

What carries the argument

The conformal prediction sets obtained after linear pooling of agent probabilities, which enforce marginal coverage and drive the singleton-versus-larger-set decision rule.

If this is right

  • 81.9 percent of wrong-consensus cases lead to escalation at alpha of 0.05 instead of erroneous autonomous action.
  • The accuracy of cases where the system acts autonomously reaches 90.0 to 96.8 percent, up to 22.1 percentage points above simple consensus-based stopping.
  • Empirical coverage of the correct answer stays within 1 to 2 points of the target level across the tested domains.
  • The fraction of cases handled autonomously versus escalated can be adjusted directly by changing the alpha parameter.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same post-hoc calibration approach could be tested on multi-agent systems that use different interaction protocols beyond debate.
  • High-stakes applications might accept a lower automation rate in exchange for the coverage guarantee.
  • Future work could explore whether non-linear pooling methods preserve the conformal guarantees as well as linear ones do.
  • Deployment would require maintaining a calibration set of past debates that remains similar to new inputs.

Load-bearing premise

The past calibration examples and the new debate cases must be similar enough in distribution that the coverage guarantee still holds for the combined probabilities.

What would settle it

Collecting a new set of debate instances, computing the conformal sets, and checking whether the fraction of cases where the correct answer is missing from the set exceeds the chosen alpha.

Figures

Figures reproduced from arXiv: 2604.07667 by Aijing Gao, Bing Zhu, Fangwei Han, Guanghui Wang, Guang Yang, Haochen Xie, Hengzhi Qiu, Jae Oh Woo, Mengdie Flora Wang, Qucy Wei Qiu, Yajing Huang, Ziyuan Li.

Figure 1
Figure 1. Figure 1: The cost of acting without calibrated refusal. [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: From debate output to deployment decision. Heterogeneous LLM agents debate for [PITH_FULL_IMAGE:figures/full_fig_p002_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Coverage maintenance across debate rounds. [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Calibrated refusal in action (α=0.05). In both examples, agents unanimously converge to the wrong answer through social reinforcement. Consensus alone would commit this error to an automated action; conformal prediction keeps the correct answer inside the set and forces escalation to human review. the interpretation that conformal stopping flags the uncertain cases that consensus would commit to. Conformal… view at source ↗
read the original abstract

Multi-agent debate improves LLM reasoning, yet agreement among agents is not evidence of correctness. When agents converge on a wrong answer through social reinforcement, consensus-based stopping commits that error to an automated action with no recourse. We introduce Conformal Social Choice, a post-hoc decision layer that converts debate outputs into calibrated act-versus-escalate decisions. Verbalized probability distributions from heterogeneous agents are aggregated via a linear opinion pool and calibrated with split conformal prediction, yielding prediction sets with a marginal coverage guarantee: the correct answer is included with probability ${\geq}\,1{-}\alpha$, without assumptions on individual model calibration. A hierarchical action policy maps singleton sets to autonomous action and larger sets to human escalation. On eight MMLU-Pro domains with three agents (Claude Haiku, DeepSeek-R1, Qwen-3 32B), coverage stays within 1--2 points of the target. The key finding is not that debate becomes more accurate, but that the conformal layer makes its failures actionable: 81.9% of wrong-consensus cases are intercepted at $\alpha{=}0.05$. Because the layer refuses to act on cases where debate is confidently wrong, the remaining conformal singletons reach 90.0--96.8% accuracy (up to 22.1pp above consensus stopping) -- a selection effect, not a reasoning improvement. This safety comes at the cost of automation, but the operating point is user-adjustable via $\alpha$.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 2 minor

Summary. The paper introduces Conformal Social Choice, a post-hoc decision layer for multi-agent LLM debate. Verbalized probability distributions from heterogeneous agents are aggregated via linear opinion pool and calibrated using split conformal prediction to produce prediction sets with a marginal coverage guarantee (correct answer included with probability ≥1-α, without requiring individual model calibration). A hierarchical policy maps singleton sets to autonomous action and larger sets to human escalation. On eight MMLU-Pro domains with three fixed LLMs, empirical coverage stays within 1-2 points of target; at α=0.05, 81.9% of wrong-consensus cases are intercepted, and remaining singletons achieve 90.0-96.8% accuracy (up to 22.1pp gain over consensus stopping) via selection effects rather than improved reasoning.

Significance. If the coverage guarantee is valid under the experimental conditions, the work offers a practical, adjustable mechanism to mitigate risks of erroneous consensus in multi-agent deliberation by making failures actionable through calibrated escalation. It correctly applies established split conformal prediction theory to aggregated scores and demonstrates that safety gains arise from selective abstention. This could be relevant for high-stakes LLM agent deployment, with the α parameter providing a tunable automation-safety tradeoff.

major comments (1)
  1. [Conformal calibration procedure and experimental setup] The marginal coverage guarantee and the 81.9% wrong-consensus interception rate at α=0.05 rest on exchangeability of calibration and test instances for the linearly pooled scores. The setup uses three heterogeneous LLMs across eight MMLU-Pro domains with debate dynamics; the manuscript does not address whether domain shifts, agent-specific behaviors, or non-random splits violate exchangeability (see abstract claim of guarantee 'without assumptions on individual model calibration' and the conformal calibration description). This is load-bearing for the central safety claims, as violation would render the reported coverage and interception rates without theoretical backing.
minor comments (2)
  1. [Methods] Clarify the exact form of the linear opinion pool aggregation with an explicit equation or pseudocode, including how verbalized probabilities are normalized across agents.
  2. [Experiments] In results tables or figures showing coverage and accuracy, include the target 1-α line and report the number of calibration/test instances per domain to allow assessment of statistical reliability.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the positive evaluation of the work's significance and for identifying the need to explicitly address the exchangeability assumption underlying the conformal coverage guarantee. We respond to the major comment below and will revise the manuscript accordingly.

read point-by-point responses
  1. Referee: [Conformal calibration procedure and experimental setup] The marginal coverage guarantee and the 81.9% wrong-consensus interception rate at α=0.05 rest on exchangeability of calibration and test instances for the linearly pooled scores. The setup uses three heterogeneous LLMs across eight MMLU-Pro domains with debate dynamics; the manuscript does not address whether domain shifts, agent-specific behaviors, or non-random splits violate exchangeability (see abstract claim of guarantee 'without assumptions on individual model calibration' and the conformal calibration description). This is load-bearing for the central safety claims, as violation would render the reported coverage and interception rates without theoretical backing.

    Authors: We agree that the marginal coverage guarantee of split conformal prediction requires exchangeability between calibration and test instances with respect to the linearly pooled scores. The abstract phrase 'without assumptions on individual model calibration' specifically means that the base LLMs need not output calibrated probabilities; the linear opinion pool produces an aggregated score that is calibrated post-hoc by the conformal procedure. However, the manuscript does not explicitly discuss the exchangeability assumption for these aggregated scores or potential violations arising from domain shifts across MMLU-Pro domains, heterogeneous agent behaviors, or debate dynamics. In the revised version we will add a dedicated paragraph in the Methods section clarifying that (i) the guarantee is marginal and holds under exchangeability of the pooled scores, (ii) calibration and test splits are formed by random partitioning of instances (we will state this explicitly), and (iii) while domain heterogeneity could in principle affect exchangeability, the observed coverage remains within 1–2 points of the nominal level across all eight domains, providing empirical support for practical validity. We will also report per-domain coverage statistics to allow readers to evaluate robustness. These clarifications address the theoretical foundation without changing the empirical results or core claims. revision: yes

Circularity Check

0 steps flagged

No circularity: coverage guarantee from established split conformal prediction theory

full rationale

The paper aggregates verbalized probabilities from heterogeneous agents via a linear opinion pool and applies split conformal prediction to produce prediction sets with a marginal coverage guarantee. This guarantee follows directly from the standard theorem for split conformal prediction (requiring only exchangeability of calibration and test instances, which the paper states holds), without any reduction to quantities defined, fitted, or renamed within the paper's own equations. The 81.9% interception rate and accuracy improvements on conformal singletons are empirical observations on the MMLU-Pro evaluation, not forced by construction. No self-citation is load-bearing for the core claim, no uniqueness theorem is imported from the authors' prior work, and no ansatz is smuggled via citation. The derivation chain is self-contained against external benchmarks of conformal prediction.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the standard theoretical assumptions of split conformal prediction and the validity of linear opinion pooling; no new free parameters or invented entities are introduced beyond the user-chosen alpha.

axioms (1)
  • domain assumption The calibration and test instances (debate examples) are exchangeable, as required for split conformal prediction to deliver marginal coverage guarantees.
    Implicit in the use of split conformal prediction; standard assumption for the method to hold.

pith-pipeline@v0.9.0 · 5608 in / 1446 out tokens · 46976 ms · 2026-05-10T18:32:29.839455+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

6 extracted references · 3 canonical work pages · 1 internal anchor

  1. [1]

    Debate or vote: Which yields better decisions in multi-agent large language models?arXiv preprint arXiv:2508.17536, 2025

    On the steady state of continuous-time stochas- tic opinion dynamics with power-law confidence. Journal of Applied Probability, 58(3):746–772. Hyeong Kyu Choi, Xiaojin Zhu, and Sharon Li. 2025. Debate or vote: Which yields better decisions in multi-agent large language models?Preprint, arXiv:2508.17536. Roi Cohen, May Hamri, Mor Geva, and Amir Glober- son...

  2. [2]

    Towards Understanding Sycophancy in Language Models

    Classification with valid and adaptive cover- age. InAdvances in Neural Information Processing Systems, volume 33, pages 3581–3591. Mrinank Sharma, Meg Tong, Tomasz Korbak, David Duvenaud, Amanda Askell, Samuel R. Bow- man, Newton Cheng, Esin Durmus, Zac Hatfield- Dodds, Scott R. Johnston, Shauna Kravec, Timo- thy Maxwell, Sam McCandlish, Kamal Ndousse, O...

  3. [3]

    Su, J., Luo, J., Wang, H., and Cheng, L

    API is enough: Conformal prediction for large language models without logit-access.Preprint, arXiv:2403.01216. Katherine Tian, Eric Mitchell, Allan Zhou, Archit Sharma, Rafael Rafailov, Huaxiu Yao, Chelsea Finn, and Christopher D. Manning. 2023. Just ask for cali- bration: Strategies for eliciting calibrated confidence scores from language models fine-tun...

  4. [4]

    But by round 3, ˆq reaches 1.0 and the set size jumps to 9.76

    through rounds 0–2 because even the 99th- percentile calibration example retains some prob- ability on the correct answer. But by round 3, ˆq reaches 1.0 and the set size jumps to 9.76. In con- trast, domains like Law (60–66% accuracy) already have ˆq= 1.0 by round 1, as the larger fraction of errors produces extreme nonconformity scores earlier in the de...

  5. [5]

    The ratedecreasesacross rounds: as debate refines social probabilities, the residual uncer- tainty on wrong answers increases, pushing them out of singleton sets

  6. [6]

    G.4 Net Safety Balance The net effect of conformal prediction on safety is overwhelmingly positive

    The rate is higher at α=0.10 than α=0.05: tighter sets are more likely to collapse to a wrong singleton. G.4 Net Safety Balance The net effect of conformal prediction on safety is overwhelmingly positive. At α=0.05 and round 3, conformal prediction: • Catches480 out of 586 wrong-consensus er- rors (81.9% rejection rate); • Introducesonly 2 wrong singleton...