Beyond Consensus: Trace-Level Synthesis in Mixture of Agents

Felix Wyss; Praitayini Kanakaraj; Shreyas Fadnavis

arxiv: 2605.29116 · v1 · pith:AFEVFAXKnew · submitted 2026-05-27 · 💻 cs.AI

Beyond Consensus: Trace-Level Synthesis in Mixture of Agents

Shreyas Fadnavis , Praitayini Kanakaraj , Felix Wyss This is my paper

Pith reviewed 2026-06-29 11:47 UTC · model grok-4.3

classification 💻 cs.AI

keywords mixture of agentsreasoning tracesaggregation paradoxLLM synthesismulti-agent systemsconsensus methodstrace complementarityinput perturbations

0 comments

The pith

An LLM aggregator recovers correct answers from unanimous agent errors by synthesizing full reasoning traces.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Standard multi-agent LLM practice compresses outputs to majority votes or summaries, treating agreement as the end point. The paper shows this discards useful information: an aggregator that reads complete reasoning traces can correct mistakes even when every agent agrees on the wrong final answer. Beneficial corrections drawn from minority traces consistently outweigh any errors introduced during synthesis. The authors introduce Self-Consistent Mixture of Agents, which creates trace diversity via semantic-preserving input perturbations and applies anchored refinement to ensure the result never falls below the majority baseline. A single model using this trace-level approach outperforms pools of heterogeneous models on structured reasoning, PhD-level science, competition math, and programming tasks.

Core claim

When multiple LLM agents solve the same problem, an aggregator that reads their complete reasoning traces recovers correct solutions even under unanimous agreement because beneficial corrections from trace-level complementarity consistently outweigh harmful ones. Majority voting reaches a performance ceiling since error correlations stay identical regardless of input perturbations. The gain arises from assembling correct intermediate steps from minority chains that voting discards. These observations motivate always synthesizing from traces rather than gating on consensus.

What carries the argument

The aggregation paradox: the empirical observation that an LLM aggregator produces net-beneficial corrections when synthesizing full reasoning traces, even when all agents agree on the wrong answer.

If this is right

Majority voting has a fixed performance ceiling because error correlations remain unchanged by input perturbations.
Trace-level synthesis assembles correct intermediate steps from minority chains that consensus methods discard.
Anchored refinement supplies provable non-degradation guarantees when always synthesizing instead of gating on agreement.
A single model with perturbation-induced trace variation outperforms heterogeneous model pools on structured reasoning, science, math, and programming tasks.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Perturbation-induced trace diversity within one model may substitute for the cost of maintaining multiple distinct models.
The same trace-complementarity principle could apply to iterative agent systems where synthesis feeds back into new trace generation.
If the aggregator reliably filters steps, the approach suggests that error detection along reasoning chains is a learnable capability worth isolating and improving.

Load-bearing premise

The aggregator LLM can reliably detect and retain correct intermediate steps from minority traces while discarding errors.

What would settle it

Measure whether the aggregator's synthesized answers on problems where all agents agree on an incorrect solution show higher accuracy than majority vote, with the rate of beneficial corrections exceeding harmful ones across a held-out benchmark.

Figures

Figures reproduced from arXiv: 2605.29116 by Felix Wyss, Praitayini Kanakaraj, Shreyas Fadnavis.

**Figure 1.** Figure 1: SC-MoA on a GPQA-Diamond chemistry problem. Five semantic perturbations of the same question produce structurally different reasoning traces: two identify the correct answer (A) via different proof strategies, while three reach wrong answers through plausible but incomplete arguments. Anchored refinement freezes the majority answer and revises minorities, raising consensus from 2/5 to 5/5. Trace-level synt… view at source ↗

**Figure 2.** Figure 2: SC-MoA pipeline. Phase 1 generates N proposals via semantic perturbations and inner selfconsistency. Phase 2 clusters by answer equivalence and refines minority traces with the majority answer frozen. Phase 3 synthesizes all N traces and the consensus record into a single output, with no early exit even at unanimous consensus. Section 3 imposes four design constraints: cheap perturbation suffices for div… view at source ↗

**Figure 3.** Figure 3: Compute–accuracy Pareto frontier. SC-MoA k=1 (N=4, ∼5 calls) exceeds SC(k=10) on GPQA (74.7% vs. 70.7%) at roughly half the compute; on LCB-Hard it slightly trails SC greedy+tag (55.6% vs. 57.3%). N=5, k=2 (11 calls) reaches 73.2% on GPQA and 62.6% on LCB-Hard, extending the frontier. Note: k=1 and k=2 use different N; points are not directly comparable across the two operating points. The aggregation pa… view at source ↗

**Figure 4.** Figure 4: Anchored refinement never degrades consensus (N=5, k=2, 867 problems). (a) Transition matrix: zero degradations. (b) Low-consensus problems benefit most. (c) Net accuracy gain per benchmark. GPQA-Diamond BBH AIME 2022–24 LCB-Hard 0.0 0.2 0.4 0.6 0.8 1.0 Tra c e div ersity ̄ Dt (T F-ID F c osine) ** *** ** (a) Trace diversity predicts beneficial flips Beneficial Non-flip Harmful GPQA-Diamond BBH AIME 2022–… view at source ↗

**Figure 5.** Figure 5: Beneficial flips occur when traces are diverse; recovery dominates corruption on all benchmarks. Mechanistic config: N=4, k=5 (21 calls; §5). (a) Trace diversity D¯ t stratified by outcome: beneficial flips show significantly higher diversity on GPQA, AIME, and LCB-Hard (p<0.01). (b) Synthesis decomposition (Proposition 1): recovery dominates corruption on all four benchmarks, significantly on three (p<0.0… view at source ↗

**Figure 6.** Figure 6: SC-MoA generalizes across models and scales with N (N=4, k=5). (a) Gains across 6 families (8B–480B). (b) MM-MoA degrades with weak models; Multi-SC-MoA filters them. (c) SC-MoA exceeds SC at every N; SC plateaus. choice: cross-perturbation agreement is 82.0% (Fleiss’ κ=0.681; Appendix X.4); full ablation suite in Appendix X). All propositions rest on idealized assumptions; they are design heuristics, not … view at source ↗

**Figure 7.** Figure 7: Consensus as calibrated confidence (N=5, k=2). (a) Reliability: QA tracks the diagonal (ECE = 0.064–0.154); LCB-Hard is overconfident (ECE = 0.288). (b) Selective prediction: abstaining on low-consensus AIME problems raises accuracy to 100% (AUROC = 0.85). Calibration as a byproduct. SC-MoA’s consensus structure provides uncertainty quantification as an emergent property, requiring no reward model or … view at source ↗

**Figure 8.** Figure 8: Pipeline heatmap for all five conditions (n=171 each). Cells are colored per-column by normalised rank (green = best, red = worst). Thick borders highlight the two competitive conditions. 4persona@70B and @8B collapse on accuracy despite reasonable consensus. 4persona@20B† ran under severe TPM throttling (0.8 LLM calls/problem); its SC-self-exit and aggregation-free metrics reflect incomplete pipeline exec… view at source ↗

**Figure 9.** Figure 9: Inner sampling budget sweep. Accuracy vs. k (samples per paraphrase) on GPQADiamond (n=198) and LCB-Hard (n=171). Both benchmarks peak at k=5; beyond this point, post-peak variation is within noise on GPQA and significantly regresses on LCB-Hard (∆= − 4.7 pp, p=0.008). Error bars: bootstrap 95% CIs. Across 198 GPQA-Diamond questions ( [PITH_FULL_IMAGE:figures/full_fig_p026_9.png] view at source ↗

**Figure 10.** Figure 10: Information ladder on GPQA-Diamond (n=198). Accuracy with Wilson 95% CIs, ordered by increasing information. The dashed line marks majority vote (69.7%). Re-solve conditions (red) fall well below MV; trace-level conditions (green) cluster above it. SC-MoA synthesis significantly outperforms all voting and selection baselines at p < 0.03. The pairwise judge matches accuracy but at nearly 2× compute (21 vs.… view at source ↗

**Figure 11.** Figure 11: Consensus-stratified information ladder. On unanimous problems (left, n=175), all conditions cluster near MV except re-solve, which collapses. On contested problems (right, n=23), trace conditions nearly triple MV accuracy. Q-only (agg) Q-only (CoT) Answers bare Answers+meta Traces bare Traces+meta Traces evo Full 0 5 10 15 20 25 30 35 Flip count (vs majority vote) 0.3× 0.4× 2.2× 1.0× 2.2× 3.0× 5.5× 2.7× … view at source ↗

**Figure 12.** Figure 12: Beneficial-to-harmful flip ratio across the information ladder. Re-solve conditions (< 1×) are corruption-dominated; trace conditions (2–5.5×) are recovery-dominated. may differ. Across 198 problems, unanimous agreement (all 5 paraphrases yield the same answer) occurs on 56.4% of problems, with 50.9% all-correct. The 56.4% unanimity rate is a lower bound on stability: after aggregation, the pipeline absor… view at source ↗

**Figure 13.** Figure 13: Architectural comparison of six multi-agent aggregation strategies. Self-Consistency votes over samples from a single prompt. MoA layers heterogeneous proposers with debate. SelfMoA applies MoA-style aggregation to single-model samples. TextGrad iteratively refines via text gradients. GoA structures agent interaction as a scored graph. SC-MoA (highlighted, rightmost) uniquely combines perturbation-based … view at source ↗

read the original abstract

When multiple LLM agents solve the same problem, standard practice compresses each agent's reasoning into a majority vote or layered synthesis, treating agreement as the finish line. We show this is unnecessarily lossy: an LLM aggregator that reads complete reasoning traces recovers correct solutions even when agents unanimously agree, with beneficial corrections consistently outweighing harmful ones -- the \emph{aggregation paradox}. Majority voting has a ceiling that perturbation diversity does not raise (error correlations are identical); the aggregator's gain comes from trace-level complementarity, assembling correct intermediate steps from minority chains that voting discards. These findings motivate Self-Consistent Mixture of Agents which generates trace diversity through semantic-preserving input perturbations, safeguards the majority via anchored refinement with provable non-degradation guarantees, and always synthesizes -- never gates on consensus. A single model with perturbation-induced trace variation outperforms heterogeneous model pools across structured reasoning, PhD-level science, competition mathematics, and competitive programming. The unit of aggregation should be the reasoning trace, not the answer.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper flags a real limit in majority-vote MoA and pushes trace-level synthesis, but the central claims rest on unshown ablations and mechanisms.

read the letter

The main thing to know is that this paper argues standard mixture-of-agents setups lose information by collapsing to final answers, and that feeding an aggregator the full reasoning traces can recover correct solutions even when every agent lands on the same wrong answer. They call this the aggregation paradox and back it with a Self-Consistent MoA recipe that adds semantic-preserving perturbations for trace diversity plus anchored refinement that supposedly never degrades performance.

What is actually new is the explicit focus on trace complementarity as the source of gains rather than model heterogeneity, plus the claim that a single model plus perturbations beats mixed-model pools on structured reasoning, PhD science, math, and programming tasks. The observation that majority voting has a hard ceiling because error correlations stay the same is a fair point and aligns with existing work on correlated failures.

The soft spots are substantial and central. The abstract supplies no methods section, no ablation that compares trace input against answer-only input, and no data or error analysis showing that beneficial corrections outweigh harmful ones. The assumption that the aggregator reliably keeps correct intermediate steps while discarding errors is stated as an empirical fact but given no verification procedure or mechanism. The stress-test note is on target: without those controls it is impossible to separate trace synthesis from the aggregator simply being a stronger reasoner on its own. The "provable non-degradation guarantees" are mentioned but not derived or tested against the unanimous-wrong case.

This is aimed at people already running multi-agent LLM systems for reasoning. It deserves a serious referee because the idea is testable and the critique of answer-level aggregation is direct, even though the current evidence is thin. Send it to review so the authors can supply the missing ablations and checks.

Referee Report

3 major / 1 minor

Summary. The paper claims that majority-vote or consensus-based aggregation in LLM mixtures is lossy because it discards reasoning traces; an LLM aggregator reading full traces can recover correct solutions even under unanimous agent agreement on an incorrect final answer by assembling correct intermediate steps from minority traces (the 'aggregation paradox'). It proposes Self-Consistent Mixture of Agents, which generates trace diversity via semantic-preserving input perturbations, applies anchored refinement with provable non-degradation guarantees, and always performs trace synthesis rather than gating on consensus. A single model with perturbation-induced trace variation is claimed to outperform heterogeneous model pools on structured reasoning, PhD-level science, competition mathematics, and competitive programming.

Significance. If the central claims hold with proper controls, the work could shift multi-agent LLM design from answer-level consensus to trace-level synthesis, offering a route to higher reliability on reasoning tasks without requiring model heterogeneity. The emphasis on provable non-degradation guarantees for anchored refinement and the empirical claim of single-model superiority are potential strengths if supported by rigorous ablations and verification procedures.

major comments (3)

[Abstract / Method] Abstract and method description: the aggregation paradox claim requires that the aggregator performs reliable step-level selection and retention of correct intermediates from minority traces while discarding errors. No formal mechanism, verification procedure for step correctness, or ablation (e.g., full traces vs. answers-only input to the aggregator) is provided to isolate trace complementarity from the aggregator model's independent reasoning capability.
[Abstract] Abstract: the 'provable non-degradation guarantees' for anchored refinement are invoked to safeguard the majority, but the assumptions under which these guarantees hold are not shown to cover the unanimous-wrong case that is central to the aggregation paradox.
[Experiments] Experiments section (implied by claims of outperformance): no error bars, statistical tests, or ablations are described that would demonstrate beneficial corrections consistently outweigh harmful ones or rule out that observed gains derive from the aggregator solving the problem better on its own rather than from trace synthesis.

minor comments (1)

[Abstract] The term 'aggregation paradox' is introduced without a concise formal statement or mathematical characterization that distinguishes it from standard ensemble effects.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive feedback. We address each major comment below and will revise the manuscript accordingly to strengthen the empirical and formal support for our claims.

read point-by-point responses

Referee: [Abstract / Method] Abstract and method description: the aggregation paradox claim requires that the aggregator performs reliable step-level selection and retention of correct intermediates from minority traces while discarding errors. No formal mechanism, verification procedure for step correctness, or ablation (e.g., full traces vs. answers-only input to the aggregator) is provided to isolate trace complementarity from the aggregator model's independent reasoning capability.

Authors: We agree that the manuscript would benefit from an explicit ablation comparing aggregator performance on full traces versus answers-only inputs to isolate trace complementarity. The current description relies on the aggregator synthesizing from complete traces as motivated by the aggregation paradox, but we will add this ablation along with further detail on the synthesis prompting procedure in the revised method section. revision: yes
Referee: [Abstract] Abstract: the 'provable non-degradation guarantees' for anchored refinement are invoked to safeguard the majority, but the assumptions under which these guarantees hold are not shown to cover the unanimous-wrong case that is central to the aggregation paradox.

Authors: The anchored refinement provides non-degradation by construction when the anchor is retained, but we acknowledge that explicit coverage of the unanimous-wrong case is not detailed in the current text. We will add a formal statement of assumptions and an argument demonstrating applicability to this case in the revised manuscript. revision: yes
Referee: [Experiments] Experiments section (implied by claims of outperformance): no error bars, statistical tests, or ablations are described that would demonstrate beneficial corrections consistently outweigh harmful ones or rule out that observed gains derive from the aggregator solving the problem better on its own rather than from trace synthesis.

Authors: We will revise the experiments section to report error bars across runs, include statistical tests, and add ablations (including aggregator-only baselines) to show that beneficial corrections outweigh harmful ones and that gains stem from trace synthesis rather than independent solving by the aggregator. revision: yes

Circularity Check

0 steps flagged

No circularity detected; claims rest on empirical observations without self-referential reduction

full rationale

The abstract and described method present the aggregation paradox and trace-level synthesis as empirical results from experiments across multiple domains, with no equations, fitted parameters, or self-citations provided that would reduce any central claim to its inputs by construction. Mentions of 'provable non-degradation guarantees' and 'anchored refinement' are stated without details showing definitional equivalence or load-bearing self-citation. The derivation chain is self-contained against external benchmarks and does not exhibit any of the enumerated circularity patterns.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Only the abstract is available; no free parameters, axioms, or invented entities can be identified from the provided text.

pith-pipeline@v0.9.1-grok · 5703 in / 1058 out tokens · 26527 ms · 2026-06-29T11:47:15.339817+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

6 extracted references

[1]

Applying the correlated Hoeffding inequality [Ladha, 1992] with t=N ε yields the stated bound

Therefore Var(S)≤ N 4 1 + (N−1)¯ρ . Applying the correlated Hoeffding inequality [Ladha, 1992] with t=N ε yields the stated bound. When ¯ρ= 0, this reduces to the classical Condorcet–Hoeffding bound 1−exp(−2N ε 2); when ¯ρ→1, the bound becomes vacuous. .2 Hierarchical Marginalization (Proposition 5) This result instantiates classical stratified sampling [...

1992
[2]

independence

Under a faithful clustering function Φ, the consensus fidelity at thresholdc=k 0/Nsatisfies: F(c; Φ)≥1− N ⌈N c⌉ 1−p min pmin ⌈N c⌉ .(5) Under an unfaithful Φ, the bound does not hold: observed consensus may be inflated by spurious agreement. 13 When consensus gating is applied at threshold θ, the accuracy loss relative to always-aggregating is bounded by:...
[3]

Google-Proof

In this case Phase 5 (override) provides the safety net: if a′ scores worse, the system reverts. Empirically, the 14 bare-majority edge case accounts for <5% of problems; the override mechanism handles it correctly in all observed instances (Appendix N). .5 Synthesis Advantage Decomposition (Proposition 1) Statement.Let VOTEand SYNTHbe two aggregation pro...

2004
[4]

You are an intuitive problem solver. Start with your best guess, then verify it against the constraints. If verification fails, try the next most likely answer

show that MoA is a special case of GoA (GoA Proposition 1), SC-MoA’s phases can replace graph pooling in any GoA-family pipeline. Head-to-head on GPQA-Diamond, GoAMax and SC-MoA solve largely non-overlapping problem subsets (union ceiling: 79.8%), confirming their complementarity. SC-GoA: full composition experiment.To validate this composability claim em...

2023
[5]

2.Surface variation: vary word choice, sentence structure, and notation style

Semantic equivalence: each rephrasing must ask the same question and require the same answer. 2.Surface variation: vary word choice, sentence structure, and notation style. 3.Preserve numbers verbatim: all numerical values must appear exactly as in the original
[6]

Google-proof

Preserve code verbatim: all code blocks, variable names, and function signatures must be identical. 5.Preserve units: physical units (eV , m/s, kg, etc.) must appear exactly. 6.Produce exactlyNrephrasings: output as a JSON list. Validation.A regex-based check verifies that allprotected tokens—numbers, code blocks, and physical units—appear verbatim in eac...

2023

[1] [1]

Applying the correlated Hoeffding inequality [Ladha, 1992] with t=N ε yields the stated bound

Therefore Var(S)≤ N 4 1 + (N−1)¯ρ . Applying the correlated Hoeffding inequality [Ladha, 1992] with t=N ε yields the stated bound. When ¯ρ= 0, this reduces to the classical Condorcet–Hoeffding bound 1−exp(−2N ε 2); when ¯ρ→1, the bound becomes vacuous. .2 Hierarchical Marginalization (Proposition 5) This result instantiates classical stratified sampling [...

1992

[2] [2]

independence

Under a faithful clustering function Φ, the consensus fidelity at thresholdc=k 0/Nsatisfies: F(c; Φ)≥1− N ⌈N c⌉ 1−p min pmin ⌈N c⌉ .(5) Under an unfaithful Φ, the bound does not hold: observed consensus may be inflated by spurious agreement. 13 When consensus gating is applied at threshold θ, the accuracy loss relative to always-aggregating is bounded by:...

[3] [3]

Google-Proof

In this case Phase 5 (override) provides the safety net: if a′ scores worse, the system reverts. Empirically, the 14 bare-majority edge case accounts for <5% of problems; the override mechanism handles it correctly in all observed instances (Appendix N). .5 Synthesis Advantage Decomposition (Proposition 1) Statement.Let VOTEand SYNTHbe two aggregation pro...

2004

[4] [4]

You are an intuitive problem solver. Start with your best guess, then verify it against the constraints. If verification fails, try the next most likely answer

show that MoA is a special case of GoA (GoA Proposition 1), SC-MoA’s phases can replace graph pooling in any GoA-family pipeline. Head-to-head on GPQA-Diamond, GoAMax and SC-MoA solve largely non-overlapping problem subsets (union ceiling: 79.8%), confirming their complementarity. SC-GoA: full composition experiment.To validate this composability claim em...

2023

[5] [5]

2.Surface variation: vary word choice, sentence structure, and notation style

Semantic equivalence: each rephrasing must ask the same question and require the same answer. 2.Surface variation: vary word choice, sentence structure, and notation style. 3.Preserve numbers verbatim: all numerical values must appear exactly as in the original

[6] [6]

Google-proof

Preserve code verbatim: all code blocks, variable names, and function signatures must be identical. 5.Preserve units: physical units (eV , m/s, kg, etc.) must appear exactly. 6.Produce exactlyNrephrasings: output as a JSON list. Validation.A regex-based check verifies that allprotected tokens—numbers, code blocks, and physical units—appear verbatim in eac...

2023