arxiv: 2603.09127 · v2 · submitted 2026-03-10 · 💻 cs.AI · cs.MA

Recognition: no theorem link

Collective AI can amplify tiny perturbations into divergent decisions

Hajime Shimao , Warut Khern-am-nuai , Sung Joo Kim

Authors on Pith no claims yet

Pith reviewed 2026-05-15 14:20 UTC · model grok-4.3

classification 💻 cs.AI cs.MA

keywords collective AImulti-LLM deliberationdecision instabilityperturbation sensitivityiterative interactionAI committeesdeliberative outcomespolicy scenarios

0 comments

The pith

Multi-LLM deliberation amplifies tiny perturbations into divergent decisions

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that when large language models deliberate together in iterative committees, even small meaning-preserving changes to a scenario can cause the conversation to split into different paths and reach different final recommendations. This occurs in fully deterministic self-hosted systems where identical runs match exactly, showing the divergence arises from the interaction process itself rather than added randomness. The same instability appears in commercial black-box APIs at temperature zero. Committee design elements such as role structure, model mix, and memory feedback each change how much divergence occurs. The finding implies collective AI systems carry a stability challenge separate from their accuracy issues.

Core claim

Iterative multi-LLM deliberation amplifies tiny perturbations into divergent conversational trajectories and different final decisions. In a fully deterministic self-hosted benchmark, exact reruns are identical, yet small meaning-preserving changes to the scenario text still separate over time and often alter the final recommendation. In deployed black-box API systems, nominally identical committee runs likewise remain unstable even at temperature 0. Across 12 policy scenarios, instability in collective AI arises from sensitivity to nearby initial conditions under repeated interaction itself. Committee architecture modulates this instability: role structure, model composition, and feedback内存

What carries the argument

Iterative multi-LLM deliberation, the repeated back-and-forth among models that builds each response on prior outputs and thereby magnifies small initial differences

Load-bearing premise

That the tested meaning-preserving changes to scenario text are small enough and equivalent enough that any resulting divergence must come from the interaction dynamics rather than unmeasured differences in model interpretation.

What would settle it

A controlled deterministic run in which the exact same set of meaning-preserving scenario variants produces identical committee trajectories and final recommendations across repeated trials.

Figures

Figures reproduced from arXiv: 2603.09127 by Hajime Shimao, Sung Joo Kim, Warut Khern-am-nuai.

**Figure 2.** Figure 2: Architecture modulates deployed instability (HL-01, [PITH_FULL_IMAGE:figures/full_fig_p007_2.png] view at source ↗

**Figure 3.** Figure 3: Chair channel within role-structured deployed committees. [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗

**Figure 4.** Figure 4: Feedback-memory interventions attenuate instability. [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗

read the original abstract

Large language models are increasingly deployed not as single assistants but as committees whose members deliberate and then vote or synthesize a decision. Such systems are often expected to be more robust than individual models. We show that iterative multi-LLM deliberation can instead amplify tiny perturbations into divergent conversational trajectories and different final decisions. In a fully deterministic self-hosted benchmark, exact reruns are identical, yet small meaning-preserving changes to the scenario text still separate over time and often alter the final recommendation. In deployed black-box API systems, nominally identical committee runs likewise remain unstable even at temperature 0, where many users expect near-determinism. Across 12 policy scenarios, these findings indicate that instability in collective AI is not only a consequence of residual platform-side stochasticity, but can arise from sensitivity to nearby initial conditions under repeated interaction itself. Additional deployed experiments show that committee architecture modulates this instability: role structure, model composition, and feedback memory can each alter the degree of divergence. Collective AI therefore faces a stability problem, not only an accuracy problem: deterministic execution alone does not guarantee predictable or auditable deliberative outcomes.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Multi-LLM deliberation amplifies small input changes into divergent outcomes even in deterministic settings, but the paper needs tighter proof that those changes are truly equivalent.

read the letter

Hey, the main result is that iterative deliberation among LLMs can turn minor wording tweaks into separate conversation tracks and different final calls, and this happens even when the whole system is deterministic. They show it across 12 policy scenarios in both a controlled self-hosted benchmark and real API runs, plus they test how role assignments, model mixes, and memory feedback change the amount of divergence. That part is useful because it isolates interaction sensitivity from platform noise and gives practical levers for people building committees. The deterministic control is a solid check, and the architecture variations add something concrete. The soft spot is the perturbations themselves. The claim only holds if those small meaning-preserving edits are genuinely neutral and equivalent, yet the description gives no construction method, similarity measure, or validation step to confirm it. Without that, any observed split could trace back to unmeasured interpretive differences rather than amplification through rounds of talk. In the API experiments this risk is higher because platform behavior could interact with the edits. For readers working on multi-agent LLM reliability this is worth reading, since the stability question is already live in deployed systems. It deserves a serious referee because the core observation is timely and the experiments are a reasonable start, even if the equivalence documentation needs to be tightened before it can be taken as settled.

Referee Report

3 major / 1 minor

Summary. The paper claims that iterative multi-LLM deliberation in collective AI systems can amplify tiny meaning-preserving perturbations in input scenarios into divergent conversational trajectories and different final decisions. This occurs even in fully deterministic self-hosted benchmarks where exact reruns produce identical outputs, and persists in black-box API deployments at temperature 0. Across 12 policy scenarios, the work shows that committee architecture (role structure, model composition, feedback memory) modulates the degree of divergence, indicating that instability arises from interaction dynamics rather than solely from platform stochasticity.

Significance. If the central claim holds after addressing documentation gaps, the result identifies a fundamental stability problem in collective AI that is distinct from accuracy issues. Deterministic execution alone does not ensure predictable or auditable outcomes, with direct implications for high-stakes deployments such as policy synthesis or decision committees. The use of controlled deterministic benchmarks to isolate interaction effects is a methodological strength that allows falsifiable observation of amplification without relying on fitted parameters.

major comments (3)

[Experimental Setup] The central claim requires that observed divergence stems from interaction dynamics amplifying perturbations rather than unmeasured interpretive differences. The manuscript describes 'small meaning-preserving changes' across 12 scenarios but provides no explicit construction method, semantic similarity metric, or human validation protocol to confirm equivalence. This is load-bearing for the deterministic self-hosted results, where any residual difference would deterministically produce divergent trajectories without invoking amplification.
[Deployed Experiments] In the black-box API experiments, instability at temperature 0 is attributed to interaction sensitivity, yet the paper does not detail controls for API-side consistency (e.g., request ordering, caching, or version pinning) beyond temperature. This leaves open whether platform behavior interacts with the perturbations, weakening the separation between stochasticity and deterministic sensitivity.
[Architecture Modulation Results] The claim that committee architecture modulates instability is supported by additional experiments, but the manuscript should specify quantitative divergence metrics (e.g., trajectory similarity or decision agreement rates) and report statistical significance or variance across the 12 scenarios to substantiate the modulation effects.

minor comments (1)

[Abstract] The abstract states that 'exact reruns are identical' but should report the number of reruns performed and the precise determinism guarantees of the self-hosted setup to allow readers to assess the strength of the baseline.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive feedback, which highlights important areas for clarifying our experimental methods and results on instability in collective AI systems. We address each major comment below and have revised the manuscript accordingly to strengthen the documentation and quantitative support for our claims.

read point-by-point responses

Referee: [Experimental Setup] The central claim requires that observed divergence stems from interaction dynamics amplifying perturbations rather than unmeasured interpretive differences. The manuscript describes 'small meaning-preserving changes' across 12 scenarios but provides no explicit construction method, semantic similarity metric, or human validation protocol to confirm equivalence. This is load-bearing for the deterministic self-hosted results, where any residual difference would deterministically produce divergent trajectories without invoking amplification.

Authors: We agree that explicit details on perturbation construction are necessary to rule out initial interpretive differences. In the revised manuscript, we have added a new 'Perturbation Generation Protocol' subsection in the Methods. Perturbations were generated via targeted synonym substitutions and minor syntactic rephrasings using a controlled prompt on a separate model, followed by verification with sentence embeddings (all-MiniLM-L6-v2, cosine similarity threshold >0.93) and dual-author manual review confirming semantic equivalence in all cases. This protocol is now fully documented to support the deterministic benchmark claims. revision: yes
Referee: [Deployed Experiments] In the black-box API experiments, instability at temperature 0 is attributed to interaction sensitivity, yet the paper does not detail controls for API-side consistency (e.g., request ordering, caching, or version pinning) beyond temperature. This leaves open whether platform behavior interacts with the perturbations, weakening the separation between stochasticity and deterministic sensitivity.

Authors: We have expanded the deployed experiments section to include these controls. All API requests used pinned model versions (e.g., gpt-4-0613), identical payloads without cache directives, and were executed in randomized order across separate sessions on different days. Single-model baseline repetitions confirmed no platform-induced divergence, isolating the effect to committee interactions. These details are now specified in Section 3.2 of the revision. revision: yes
Referee: [Architecture Modulation Results] The claim that committee architecture modulates instability is supported by additional experiments, but the manuscript should specify quantitative divergence metrics (e.g., trajectory similarity or decision agreement rates) and report statistical significance or variance across the 12 scenarios to substantiate the modulation effects.

Authors: We have augmented the results with explicit quantitative metrics. Trajectory divergence is quantified using normalized Levenshtein distance on full deliberation logs, and decision agreement is reported as the percentage of matching final recommendations. For each architecture variant, we now provide means, standard deviations across the 12 scenarios, and p-values from paired statistical tests (Wilcoxon signed-rank). These are presented in an updated Table 2 and Section 4.3 to substantiate the modulation findings. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical observations of deliberation instability are independent of inputs

full rationale

The paper reports direct experimental results from controlled multi-LLM committee runs across 12 scenarios, comparing exact reruns versus small meaning-preserving text changes. No equations, fitted parameters, self-citations, or derivations are invoked to generate the claimed outcomes; divergence is measured as an observed phenomenon rather than predicted from prior quantities. The central claim therefore rests on external benchmarks (deterministic self-hosted execution and API behavior) and does not reduce to its own inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The central claim rests on empirical runs across 12 policy scenarios; no free parameters, axioms, or invented entities are introduced beyond standard assumptions that small text changes preserve meaning and that platform behavior at temperature 0 is deterministic.

pith-pipeline@v0.9.0 · 5495 in / 1053 out tokens · 30214 ms · 2026-05-15T14:20:37.002877+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

12 extracted references · 12 canonical work pages · 4 internal anchors

[1]

AutoGen: Enabling Next-Gen LLM Applications via Multi-Agent Conversation

URLhttps://arxiv.org/abs/2308.08155. Weize Chen, Yusheng Su, Jingwei Zuo, et al. AgentVerse: Fa- cilitating multi-agent collaboration and exploring emergent behaviors in agents,

work page internal anchor Pith review Pith/arXiv arXiv
[2]

AgentVerse: Facilitating Multi-Agent Collaboration and Exploring Emergent Behaviors

URLhttps://arxiv.org/abs/ 2308.10848. Guohao Li, Hasan Hammoud, Hani Itani, Dmitrii Khizbullin, and Bernard Ghanem. CAMEL: Communicative agents for mind exploration of large language model society,

work page internal anchor Pith review arXiv
[3]

CAMEL: Communicative Agents for "Mind" Exploration of Large Language Model Society

URL https://arxiv.org/abs/2303.17760. Sirui Hong, Ming Zhuge, Junjie Chen, et al. MetaGPT: Meta programming for multi-agent collaborative framework,

work page internal anchor Pith review Pith/arXiv arXiv
[4]

MetaGPT: Meta Programming for A Multi-Agent Collaborative Framework

URLhttps://arxiv.org/abs/2308.00352. Joon Sung Park, Joseph C. O’Brien, Carrie J. Cai, Mered- ith Ringel Morris, Percy Liang, and Michael S. Bernstein. Generative agents: Interactive simulacra of human behav- ior. InProceedings of the 36th Annual ACM Symposium on User Interface Software and Technology (UIST), pages 1–22,

work page internal anchor Pith review Pith/arXiv arXiv
[5]

O’Brien, Carrie J

doi:10.1145/3586183.3606763. Melanie Sclar, Yejin Choi, Yulia Tsvetkov, and Alane Suhr. Quantifying language models’ sensitivity to spurious features in prompt design or: How i learned to start worrying about prompt formatting. InInternational Conference on Learning Representations (ICLR),

work page doi:10.1145/3586183.3606763
[6]

Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, et al

URLhttps://arxiv.org/ abs/2310.11324. Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, et al. Judging LLM-as-a-Judge with MT-Bench and chatbot arena. InAd- vances in Neural Information Processing Systems (NeurIPS), volume 36,

work page arXiv
[7]

Amos Tversky and Daniel Kahneman

doi:10.1111/ 1467-9760.00148. Amos Tversky and Daniel Kahneman. The framing of decisions and the psychology of choice.Science, 211(4481):453–458,

work page arXiv
[8]

doi:10.1126/science.7455683. Scott E. Page.The Difference: How the Power of Diver- sity Creates Better Groups, Firms, Schools, and Societies. Princeton University Press,

work page doi:10.1126/science.7455683
[9]

Jean-Pierre Eckmann and David Ruelle

doi:10.1073/pnas.0403723101. Jean-Pierre Eckmann and David Ruelle. Ergodic theory of chaos and strange attractors.Reviews of Modern Physics, 57(3):617–656,

work page doi:10.1073/pnas.0403723101
[10]

Edward Ott.Chaos in Dynamical Systems

doi:10.1103/RevModPhys.57.617. Edward Ott.Chaos in Dynamical Systems. Cambridge Univer- sity Press, 2nd edition,

work page doi:10.1103/revmodphys.57.617
[11]

Improving reproducibility in machine learning research: A report from the neurips 2019 reproducibility program.Journal of Machine Learning Research, 22(164):1–20,

Joelle Pineau, Philippe Vincent-Lamarre, Koustuv Sinha, Vin- cent Larivière, Alina Beygelzimer, Florence d’Alché Buc, Emily Fox, and Hugo Larochelle. Improving reproducibility in machine learning research: A report from the neurips 2019 reproducibility program.Journal of Machine Learning Research, 22(164):1–20,

work page 2019
[12]

doi: 10.1109/MCSE.2007.55. 9

work page doi:10.1109/mcse.2007.55 2007