pith. machine review for the scientific record. sign in

arxiv: 2605.04236 · v2 · submitted 2026-05-05 · 💻 cs.LG

Recognition: 3 theorem links

· Lean Theorem

Adaptive Consensus in LLM Ensembles via Sequential Evidence Accumulation: Automatic Budget Identification and Calibrated Commit Signals

Roberto Medina

Authors on Pith no claims yet

Pith reviewed 2026-05-08 18:56 UTC · model grok-4.3

classification 💻 cs.LG
keywords adaptive stoppingLLM ensembleconsensus detectiondeliberation budgetrouting partitionreasoning accuracysequential evidence accumulation
0
0 comments X

The pith

DASE adaptive stopping commits LLM ensembles on genuine consensus to produce a routing partition complementary to verbalized confidence.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces DASE as a stopping heuristic for iterative LLM ensemble deliberation. It commits early when persistence and spatial heuristics detect consensus and falls back to global frequency on fragmented evidence. This addresses the problem that ensembles gain accuracy only up to a boundary, after which further deliberation hurts performance. DASE is shown to create a commit-type routing signal that partitions answers differently from single-call verbalized confidence yet matches its accuracy gap on controlled benchmarks. The work further claims that the adaptive stopping decision itself, not the amount of injected context, drives the observed accuracy improvements and automatically identifies effective budgets.

Core claim

DASE produces a commit-type routing partition complementary to verbalized single-call confidence. Adaptive stopping drives accuracy gains rather than injection bandwidth. On a contamination-controlled corpus a 120B ensemble achieves a 24.8 pp routing gap between high- and low-confidence partitions, statistically equivalent to a standard verbalized-confidence baseline at matched coverage. DASE-Spatial at half-width 8 ties the performance of a dense debate baseline while using one-tenth the injection bandwidth and identifies that budget automatically. Injection-based methods show a retrospective accuracy-versus-inference inverted-U on the tested benchmarks.

What carries the argument

DASE (Deliberative Adaptive Stopping Ensemble), a sequential evidence-accumulation heuristic that commits on frequency-based consensus via persistence and spatial (arena half-width) rules with a global-frequency fallback.

If this is right

  • Adaptive stopping identifies effective deliberation budgets automatically without manual search.
  • The commit signal from DASE can be combined with verbalized confidence because the two mechanisms disagree on roughly one-quarter of routing decisions.
  • Bandwidth expansion yields negligible gains once stopping is adaptive; the inverted-U pattern in static-injection methods is a direct consequence of crossing the accuracy boundary.
  • Every DASE decision carries a machine-readable deliberation record that can be inspected or audited.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The approach suggests that future ensemble systems could treat stopping as a first-class learned policy rather than a fixed hyper-parameter.
  • The complementary nature of consensus-based and verbalized signals implies that hybrid routers may outperform either alone on the same compute budget.
  • The inverted-U observation raises the possibility that over-deliberation effects appear in other sequential reasoning pipelines and could be measured with similar partition-gap metrics.

Load-bearing premise

Frequency-based consensus detection genuinely reflects answer quality rather than model-specific artifacts or test-set biases.

What would settle it

An experiment on a fresh benchmark in which the observed routing gap between DASE partitions vanishes or in which consensus frequency fails to correlate with correctness.

Figures

Figures reproduced from arXiv: 2605.04236 by Roberto Medina.

Figure 1
Figure 1. Figure 1: DASE-Spatial arena trajectories (W=8, mixed ensemble, pilot N=100). Right-wall contact (x= + 8) fires a consensus commit, used directly. Left-wall contact (x= − 8) triggers the global-frequency fallback. All problems that do not reach the right wall are flagged for human review. Case A (top left): the ensemble corrects an early wrong plurality and reaches right-wall consensus (truth: 158). Case B (top righ… view at source ↗
Figure 2
Figure 2. Figure 2: Compute-matched comparison at the hesitation-region peak ( view at source ↗
Figure 3
Figure 3. Figure 3: Accuracy vs. inferences, GPQA (N=546). All injection-based baselines peak near 30 inferences and decay thereafter (retrospective observation). DASE-Spatial sits at or above all baselines at every compute budget. + BH-FDR, W=8 reference): SC70 (ns); Debate-Dense R12 (*); Debate-Sparse R12 (**); BoN￾V@57 (***); DASE W=4 (**). Full accuracy-vs-inference curves are in Appendix F. 7 Frontier Scaling and Calibra… view at source ↗
Figure 4
Figure 4. Figure 4: Compute-matched comparison, AIME-300 (N=300). DASE-Spatial W=8 (65.0%, [expl.]) and W=4 (59.3%) are both shown. SC70: 61.7% (ns); Debate-Dense R12: 59.3% (*); Debate-Sparse R12: 59.0% (**); BoN-V@57: 43.0% (***). Injection bandwidth effect: +0.3 pp (ns). Adaptive stopping effect: +6.0 pp. 50 100 150 200 250 Problem index (cumulative) 20% 30% 40% 50% 60% 70% 80% 90% Running mean accuracy 84.5% 84.5% 72.7% 2… view at source ↗
Figure 5
Figure 5. Figure 5: DASE-Spatial (W=2) vs. Claude Opus 4.6 Standard, AIME 2010–2023 (N=261, contamination-controlled, 3 seeds). (a) Running accuracy: DASE 84.5%, Opus 84.5%, McNe￾mar p=1.000 (ns; 95% CI: ≈ ± 5 pp). (b) Commit signal (primary): right-wall 94.7% [92.4%, 97.0%]; left-wall 75.3% [71.6%, 79.0%], 19.3 pp gap. (c) DASE wins 54 (21%), Opus wins 42 (16%), 165 equal (63%). Primary finding: structured confidence partiti… view at source ↗
Figure 6
Figure 6. Figure 6: Latency analysis, AIME-300 (120B ensemble vs. single-call, view at source ↗
Figure 7
Figure 7. Figure 7: Latency analysis, AIME-300 (70B mixed ensemble, view at source ↗
Figure 8
Figure 8. Figure 8: SC (Qwen3-80B-A3B only) vs. DASE Neuro ( view at source ↗
Figure 9
Figure 9. Figure 9: SC and IM-SC vs. DASE (mixed ensemble, N=100). DASE-Spatial: 86.0%; held-out N=98: 86.7%. 20 30 40 50 60 70 Total Inferences 50% 60% 70% 80% 90% 100% Mean Accuracy 68.0% 73.0% 77.0% 80.0% 64.0% 74.0% 78.0% 79.0% DASE Heuristic (k=2) 84.0% DASE-Spatial (W=8) 86.0% Accuracy vs. Inferences: Vanilla SC, IM-SC, and DASE Mixed Ensemble (3x Qwen3-70B + 2x Llama3-70B) Vanilla SC (independent workers) IM-SC (pool=5… view at source ↗
Figure 10
Figure 10. Figure 10: Accuracy vs. inferences (mixed ensemble, view at source ↗
Figure 11
Figure 11. Figure 11: DASE-Spatial reasoning dynamics (W=8, mixed ensemble, N=100). C Frontier Accuracy Comparisons at the 120B Tier C.1 Recency-Control Methodology Year labels were sourced from publicly available AIME corpus metadata. A positional keep-mask was applied uniformly to all DASE and Opus seeds, retaining N=261 problems spanning 2010–2023 (excluding 9 AIME 2024 problems and 30 AIME 2026 problems). The keep-mask fil… view at source ↗
Figure 12
Figure 12. Figure 12: DASE-Spatial (W=2) vs. Opus 4.6 Standard, AIME 2010–2026 (N=300; comparisons potentially inflated by differential 2026 exposure). DASE 85.0%, Opus 82.4%, McNemar p=0.115 (ns). Right-wall 95.2%, left-wall 75.5%, gap 19.7 pp. 15 view at source ↗
Figure 13
Figure 13. Figure 13: Accuracy vs. inferences, AIME-300 (N=300). Debate-Dense and Debate-Sparse are nearly indistinguishable (+0.3 pp bandwidth effect, ns): the full 6.0 pp Debate-to-DASE gap is attributable to adaptive stopping alone. SC rises monotonically; all injection-based baselines decay (retrospective observation). G Full Ablation Bars and Studies G.1 GPQA-Extended Ablation Bar Single Inference (S1) (5 inferences) SC5 … view at source ↗
Figure 14
Figure 14. Figure 14: Full GPQA-Extended ablation (N=546, McNemar + BH-FDR). Both W=4 and W=8 achieve 70.0% (statistically equivalent). 17 view at source ↗
Figure 15
Figure 15. Figure 15: Full AIME-300 ablation (N=300, McNemar + BH-FDR). W=8 significantly outperforms W=4 (adj p=0.0042). G.3 Accumulator Boundary Ablation Wall = 2 ( 14 inferences) Wall = 4 ( 37 inferences) Wall = 8 ( 58 inferences) 0.0 0.2 0.4 0.6 0.8 1.0 1.2 Mean Accuracy 77.0% * p=0.0117 (adj=0.0234) 84.0% ns p=0.6875 (adj=0.6875) 86.0% (ref) DASE-Spatial Arena Size Ablation (N=100) Significance relative to Wall = 8 (Ref) view at source ↗
Figure 16
Figure 16. Figure 16: Arena-size ablation (70B, N=100, W=8 reference). W=2 falls significantly below W=8 (adj p=0.0234); W=4 is equivalent on the pilot corpus (adj p=0.69) but significantly below on AIME-300 (adj p=0.0042). G.4 Component Ablations Consensus S1 (67.0%), DASE-3 workers (75.0%), and DASE Neuro (84.0%) demonstrate that sequential evidence accumulation contributes more than raw worker count. The flat-threshold vari… view at source ↗
Figure 17
Figure 17. Figure 17: Component ablation (N=100). G.5 Ensemble Composition: Mixed vs. Homogeneous The mixed heuristic (84.0%) outperforms the Qwen3-80B-A3B-only ensemble (77.0%). Llama 3.3-70B’s contribution is adversarial dissent: its responses prevent premature consensus, enabling error correction. Heuristic (3× Qwen3 + 2× Llama3) ( 14 inferences) Heuristic (5× Qwen3) ( 21 inferences) Spatial (3× Qwen3 + 2× Llama3) ( 58 infe… view at source ↗
Figure 18
Figure 18. Figure 18: Mixed vs. homogeneous ensemble, DASE Neuro ( view at source ↗
Figure 19
Figure 19. Figure 19: Injection ablation, DASE Neuro. Spatial (Mix · Injection) ( 58 inferences) Spatial (Mix · No Injection) ( 57 inferences) Spatial (Qwen · Injection) ( 58 inferences) Spatial (Qwen · No Injection) ( 57 inferences) 0.0 0.2 0.4 0.6 0.8 1.0 1.2 Mean Accuracy 86.0% ** p=0.0010 (adj=0.0029) 79.0% ns p=0.2188 (adj=0.2266) 75.0% (ref) 70.0% ns p=0.2266 (adj=0.2266) Spatial: Mix vs Qwen · Injection vs No Injection … view at source ↗
Figure 21
Figure 21. Figure 21: Worker quality and injection dynamics. Mixed ensemble: 3 view at source ↗
Figure 22
Figure 22. Figure 22: k-ablation (AIME N=100). All k≥2 plateau. k=2 at ≈14 inferences is recommended. G.9 Information Asymmetry Control: Round-Matched Analysis Information-Matched Self-Consistency (IM-SC) injects a frozen round-1 candidate pool into subse￾quent workers with no multi-round updating and no early stopping. 5 10 20 30 40 50 60 70 75 Total Inferences (rounds × 5 workers) 40 50 60 70 80 90 100 Accuracy (%) 86.0% Spa… view at source ↗
Figure 23
Figure 23. Figure 23: Round-matched analysis (N=100). DASE-Spatial gains 9 problems (p=0.0039); DASE Neuro gains 26 (p<0.0001); zero regressions in both cases. 22 view at source ↗
Figure 24
Figure 24. Figure 24: AIME, 5×Llama 3.1-8B. IM-SC plateaus at 4.0%, well below DASE-Spatial’s 9.0%. G.11 No-Consensus Boundary Commit Strategy view at source ↗
Figure 25
Figure 25. Figure 25: Parameter sensitivity sweep on AIME-300 ( view at source ↗
read the original abstract

Large Language Model ensembles improve reasoning accuracy, but only up to a performance boundary beyond which additional deliberation degrades accuracy. We introduce DASE (Deliberative Adaptive Stopping Ensemble), a stopping heuristic for iterative ensemble deliberation that commits early on genuine consensus and applies a global-frequency fallback on fragmented evidence. We make three contributions. (1) DASE produces a commit-type routing partition that generalises across benchmarks and is complementary to verbalized single-call confidence. On GPQA-Extended (N=546, 70B ensemble), the partition yields a 39.5 pp routing gap (right-wall 81.1% vs. left-wall 41.5%). On AIME 2010-2023 (N=261, 120B ensemble, 3 seeds), right-wall commits reach 98.3% accuracy vs. left-wall 72.8% (25.5 pp gap), statistically equivalent to Opus 4.6 Standard verbalized confidence at matched coverage (25.7 pp gap; bootstrap p=0.873); the two mechanisms disagree on 37% of routing assignments. (2) Adaptive stopping, not injection bandwidth, drives accuracy. On AIME-300, bandwidth accounts for only 0.3 pp (ns). On GPQA-Extended at the 120B tier, sparse injection ($\approx15$ tokens/worker/round) achieves 70.9% with a 30.7 pp routing gap; dense injection ($\approx600$ chars/worker/round) achieves 72.2% but with halved right-wall coverage and a narrower 18.9 pp gap. (3) Injection-based methods exhibit an inverted-U accuracy-vs-inference trajectory; this pattern is hypothesis-generating.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 1 minor

Summary. The paper introduces DASE (Deliberative Adaptive Stopping Ensemble), a stopping heuristic for iterative LLM ensemble deliberation that commits early on detected consensus (via persistence or spatial/arena heuristics) or falls back to global-frequency evidence. It evaluates two configurations on contamination-controlled AIME 2010-2023 (N=254, 3 seeds) and GPQA-Extended, claiming DASE yields a commit-type routing partition with a 24.8 pp accuracy gap (97.1% right-wall vs. 73.6% left-wall) that is statistically equivalent to Opus 4.6 verbalized confidence (25.7 pp gap; bootstrap CI [-12.0, +10.3] pp, p=0.873), with 27% disagreement establishing complementarity. It further claims adaptive stopping (not injection bandwidth) drives gains, that DASE-Spatial matches Debate-Dense optimal performance at 1/10th bandwidth while auto-identifying the budget (W=8 outperforms W=4), and that injection methods show a retrospective accuracy-vs-inference inverted-U.

Significance. If the empirical results hold under proper verification, this provides a practical, machine-readable method for automatic budget identification and calibrated commit signals in LLM ensembles, complementary to verbalized single-call confidence. The finding that stopping effects dominate bandwidth (0.3 pp ns on AIME-300; 5.0 pp vs 4.4 pp on GPQA) and the controlled-corpus demonstration of routing utility could inform more efficient extended-thinking architectures. Strengths include the use of a contamination-controlled benchmark, explicit statistical reporting, and the hypothesis-generating inverted-U observation.

major comments (1)
  1. [AIME 2010-2023 evaluation (abstract and results)] The claim that DASE's 24.8 pp routing gap is 'statistically equivalent' to the 25.7 pp gap from Opus 4.6 verbalized confidence (abstract and AIME results) rests on a non-significant difference test (p=0.873) with bootstrap CI on the difference of [-12.0, +10.3] pp. Non-rejection of the null hypothesis of no difference does not establish equivalence; the interval width permits differences of up to ~12 pp that would undermine the 'complementary yet comparable' interpretation and the routing-utility conclusions. This is load-bearing for the central complementarity claim (27% disagreement) and headline performance comparison.
minor comments (1)
  1. [Methods] The exact algorithmic definitions and pseudocode for the persistence heuristic and DASE-Spatial (arena half-width W) should be moved to the main text or a prominent figure for reproducibility, rather than relying solely on appendix references.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the careful and constructive review. The single major comment concerns the statistical framing of our routing-gap comparison, which we address directly below by revising the manuscript wording.

read point-by-point responses
  1. Referee: The claim that DASE's 24.8 pp routing gap is 'statistically equivalent' to the 25.7 pp gap from Opus 4.6 verbalized confidence (abstract and AIME results) rests on a non-significant difference test (p=0.873) with bootstrap CI on the difference of [-12.0, +10.3] pp. Non-rejection of the null hypothesis of no difference does not establish equivalence; the interval width permits differences of up to ~12 pp that would undermine the 'complementary yet comparable' interpretation and the routing-utility conclusions. This is load-bearing for the central complementarity claim (27% disagreement) and headline performance comparison.

    Authors: We agree that non-rejection of the null does not establish equivalence and that the CI width of approximately 22 pp leaves room for differences that could qualify the strength of the comparison. In the revised manuscript we will remove the phrase 'statistically equivalent' from the abstract and AIME results section. We will instead report that the difference between the two observed gaps is not statistically significant (p=0.873) with bootstrap CI [-12.0, +10.3] pp, while noting the interval width as a limitation on strong claims of comparability. The primary support for complementarity remains the independent 27% disagreement rate in routing assignments, which does not rely on the gap magnitudes being identical. These changes will be implemented in the next version. revision: yes

Circularity Check

0 steps flagged

No significant circularity; all claims rest on direct empirical measurements

full rationale

The manuscript contains no derivation chain, equations, or ansatzes. All three contributions are stated as empirical outcomes from benchmark runs on a contamination-controlled AIME corpus (N=254) and GPQA-Extended, using explicit accuracy, routing-gap, and bootstrap-CI figures. No parameter is fitted and then relabeled as a prediction; no uniqueness theorem or prior self-citation is invoked to justify the stopping heuristic; and the reported 24.8 pp vs. 25.7 pp comparison is a direct statistical test on observed data rather than a constructed identity. The work is therefore self-contained against external benchmarks and receives the default non-circularity finding.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 1 invented entities

The central claim rests on the domain assumption that answer frequency reliably signals consensus quality in LLM ensembles, plus benchmark-tuned parameters for the spatial window.

free parameters (1)
  • W (arena half-width) = 8
    Evaluated on AIME-300; W=8 selected as optimal after showing statistically significant improvement over W=4.
axioms (1)
  • domain assumption Frequency of answers across ensemble iterations indicates genuine consensus quality
    Invoked as the basis for the global-frequency fallback and early-commit decision.
invented entities (1)
  • DASE no independent evidence
    purpose: Adaptive stopping heuristic for iterative LLM ensemble deliberation
    Newly proposed method without independent external validation beyond the reported experiments.

pith-pipeline@v0.9.0 · 5668 in / 1381 out tokens · 83715 ms · 2026-05-08T18:56:01.251247+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.