The Ringelmann Effect in Multi-Agent LLM Systems: A Scaling Law for Effective Team Size

Bla\v{z} Bertalani\v{c}; Carolina Fortuna

arxiv: 2606.02646 · v1 · pith:WCJCAQAHnew · submitted 2026-05-31 · ⚛️ physics.soc-ph · cs.AI· cs.MA

The Ringelmann Effect in Multi-Agent LLM Systems: A Scaling Law for Effective Team Size

Bla\v{z} Bertalani\v{c} , Carolina Fortuna This is my paper

Pith reviewed 2026-06-28 16:02 UTC · model grok-4.3

classification ⚛️ physics.soc-ph cs.AIcs.MA

keywords multi-agent LLMsscaling lawRingelmann effecteffective team sizepeer debateself-correctionhard ceilingMMLU

0 comments

The pith

Multi-agent LLM systems obey a two-parameter scaling law for effective team size that saturates in most configurations.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes a scaling law R(N) = 1/(1+c(N-1)N^{-β}) for the ratio of effective to nominal agents in LLM teams. This form fits data from 44 model-task-condition combinations with R² exceeding 0.99, distinguishing hard-ceiling, sublinear, and linear regimes based on the value of β. A mean-field result shows that the product of peer count and debate rounds governs the dynamics. On free-form math tasks, dense peer influence reduces answer diversity to a hard ceiling, while correctness redundancy stays capped throughout. Practical results include that 30 agents add no diversity beyond one in certain settings, and only heterogeneous teams can lower the saturation constant c.

Core claim

The central discovery is that the functional form R(N) = N_eff/N = 1/(1+c(N-1)N^{-β}) fits every tested condition at R² > 0.99, with only the parameters (c, β) varying. The exponent β determines the regime: hard-ceiling at 1/c when β=0, sublinear growth when 0<β<1, or linear when β≥1. A mean-field theorem predicts that peer count k and rounds τ enter only through the product kτ. The law holds at answer diversity and correctness redundancy levels, and across peer debate, self-correction, placebo, self-consistency, various models, and communication modes.

What carries the argument

The two-parameter scaling law R(N) = 1/(1 + c(N-1)N^{-β}) that measures the fraction of effective agents N_eff/N and classifies configurations by the regime exponent β.

Load-bearing premise

Peer count k and debate rounds τ influence the system only through their product kτ, and the 44 experimental cells represent the full space of configurations without bias in selection.

What would settle it

Running new experiments with different k and τ where the effect depends on them separately rather than only their product, or finding conditions where the scaling form fits with R² much lower than 0.99.

Figures

Figures reproduced from arXiv: 2606.02646 by Bla\v{z} Bertalani\v{c}, Carolina Fortuna.

**Figure 1.** Figure 1: Effective team size saturates and the regime is set by collaboration mode (Qwen2.5-7B × GSM-Hard). A: Answer-level effective team size Nans eff vs. nominal N. Self-correction and noise placebo grow sublinearly toward Neff ≈ 6 at N=30. Debate flattens at Neff ≈ 1.8. The dashed line marks the independent-voter ideal Neff = N, and the shaded gap is the departure of the observed team from that ideal. B: Fitted… view at source ↗

**Figure 2.** Figure 2: Structural comparison to classical Ringelmann data via the Latané power law. We fit R(N) = Nt−1 to LLM debate and to three published human group datasets, on the same axes. The estimated exponent t quantifies the decline rate: human groups have t ∈ [0.49, 0.90] (gradual decline), LLM debate has t ≤ 0.13 (much steeper). The same form is used for both populations so the gap in estimated t is structural, not … view at source ↗

**Figure 3.** Figure 3: Structural comparison via the Kish design effect. Same datasets as [PITH_FULL_IMAGE:figures/full_fig_p017_3.png] view at source ↗

**Figure 4.** Figure 4: The Ringelmann paradox: individual learning, collective stagnation. Top: Net agentlevel transitions (W→C minus C→W) are positive at all team sizes, on both tasks, under both debate and self-correction. Agents do learn from interaction. Bottom: Effective team size Neff stays close to the no-aggregation lower bound on MMLU-Hard and grows sublinearly on GSM-Hard, in both cases far below the independent-voter… view at source ↗

**Figure 5.** Figure 5: Small-N extrapolation of the Ringelmann scaling law on heterogeneous debate teams across MMLU-Hard, GSM-Hard, and GPQA. Green curves are the Ringelmann form R(N) = 1/(1+c(N−1)N −β ) estimated on small N (training range varies by task), with 95% bootstrap envelopes, extrapolated to the largest N available. The same form continues to fit, with a higher ceiling (1/c) than in the homogeneous setting. Within- v… view at source ↗

**Figure 6.** Figure 6: Small-team runs predict the large-team Ringelmann ceiling across models and tasks. Rows: Qwen2.5-7B (top), Llama-3.1-8B (middle), Ministral-8B (bottom). Columns: MMLU-Hard, GSM-Hard, GPQA. Black points show observed answer-diversity efficiency R(N) = Neff/N with 95% item-level bootstrap intervals. Green curves are the Ringelmann form R(N) = 1/(1+c(N−1)N −β ) estimated only on N ∈ {2, 3, 5} (gray training b… view at source ↗

**Figure 7.** Figure 7: The (c, β) landscape across evaluated configurations. Each point is one Ringelmann estimate across 6 models plus heterogeneous teams, 3 tasks, and 3 conditions (44 estimates total, also reported in [PITH_FULL_IMAGE:figures/full_fig_p031_7.png] view at source ↗

**Figure 8.** Figure 8: kτ collapse stratified by initial item-level agreement (Qwen2.5-7B, GSM-Hard, N=10). Curves are pairwise agreement ρ¯ vs. kτ , one per peer count k. Right panel (high agreement, ρ¯ (0) > 0.7): curves nearly coincide, consistent with the theorem’s high-agreement regime. Left panel (low agreement, ρ¯ (0) < 0.4): visible cross-k separation at small kτ , narrowing as kτ grows. Why the large βent does not viola… view at source ↗

**Figure 9.** Figure 9: Total prompt tokens per item at R3 (Qwen2.5-7B), across three tasks and two conditions. [PITH_FULL_IMAGE:figures/full_fig_p037_9.png] view at source ↗

read the original abstract

Inference-time multi-agent LLM scaling lacks a shared unit: counting nominal agents conflates cost with independent evidence. We derive a two-parameter scaling law $R(N) = N_\text{eff}/N = 1/(1+c(N-1)N^{-\beta})$ where the regime exponent $\beta$ classifies any configuration into one of three asymptotic regimes -- hard-ceiling at $1/c$ ($\beta = 0$), sublinear at $N^\beta/c$ ($0 < \beta < 1$), or linear ($\beta \ge 1$), and a mean-field theorem predicts that peer count $k$ and rounds $\tau$ during agent debate enter the dynamics only through their product $k\tau$. The law applies at two levels: answer diversity and correctness redundancy. Across 44 (model $\times$ task $\times$ condition) cells spanning peer debate, self-correction, random-noise placebo, self-consistency, three open-weight families (Qwen, Llama, Ministral) at scales from 7B to 32B with a frontier API check (Gemini), thinking models, heterogeneous teams, and sparse communication, the functional form fits every condition at $R^2 > 0.99$; only $(c, \beta)$ shifts. On free-form math, dense peer influence collapses the answer-level regime from sublinear into hard-ceiling; correctness-level fits remain hard-ceiling throughout. Three findings have practical implications. \emph{(i)}~Thirty dense debating agents produce no more answer diversity than one on MMLU-Hard. \emph{(ii)}~A noise placebo tracks self-correction on free-form math and at $4\times$ scale, so within homogeneous teams the gain commonly attributed to ``debate'' comes from re-evaluation, not peer content. \emph{(iii)}~A single $N \le 5$ pilot predicts the $N=30$ structural ceiling, and within the configurations tested only architectural diversity (heterogeneous teams) lowers $c$ and escapes the hard-ceiling regime, communication-mode interventions do not.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper gives a usable two-parameter form that fits their 44 LLM team experiments at R²>0.99 and flags heterogeneous teams as the only way out of hard ceilings, but the mean-field claim that k and τ only matter via their product is not shown to survive independent variation.

read the letter

The main point is that they have a compact scaling law R(N) = 1/(1 + c(N-1)N^{-β}) that describes effective team size across debate, self-correction, noise placebos, and heterogeneous setups, with only the two parameters shifting. It classifies regimes by β and reports that thirty dense agents add no answer diversity beyond one on MMLU-Hard while heterogeneous teams alone lower c.

They do the empirical part cleanly: three model families, multiple tasks, frontier check, and the same functional form holding at both answer-diversity and correctness levels. The practical takeaways follow directly from the fits.

The soft spot is the mean-field step that reduces peer count k and rounds τ to their product alone. The stress-test concern lands because the paper does not appear to vary k and τ independently at fixed product; if finite-size or round-wise effects break that reduction, the claimed universality across all 44 cells rests on the particular (k,τ) choices they made rather than the derivation. The high R² values are real but cannot rule that out without the extra test.

The math is elementary and the data collection spans enough conditions to be worth checking. This is for groups already running multi-agent LLM experiments who need a quantitative handle on diminishing returns. It deserves a serious referee because the form is testable and the heterogeneous-team result is actionable, even if the mean-field validation needs tightening.

Referee Report

1 major / 1 minor

Summary. The paper derives a two-parameter scaling law R(N) = N_eff/N = 1/(1 + c(N-1)N^{-β}) for effective team size in multi-agent LLM systems. It classifies configurations into hard-ceiling, sublinear, or linear regimes via the exponent β, and presents a mean-field theorem asserting that peer count k and debate rounds τ enter the dynamics only through the product kτ. The law is reported to fit answer diversity and correctness redundancy across 44 cells (spanning models, tasks, debate, self-correction, heterogeneous teams, and communication modes) at R² > 0.99, with only (c, β) varying; practical claims include that 30 dense agents add no diversity beyond one on MMLU-Hard and that heterogeneity lowers c while communication interventions do not.

Significance. If the central functional form and mean-field reduction hold after verification, the work supplies a compact, falsifiable description of diminishing returns in LLM teams, directly linking observable parameters (k, τ) to measurable effective size. The consistent high-R² fits across open-weight families, frontier models, and free-form vs. multiple-choice tasks, together with the explicit regime classification, would constitute a useful quantitative tool for system design; the observation that a small-N pilot predicts large-N ceilings is particularly actionable.

major comments (1)

[Mean-field theorem and experimental design] The mean-field theorem (stated in the abstract and used to justify universality across the 44 cells) asserts that k and τ contribute only via their product kτ. No experiment is described that holds kτ fixed while varying k and τ independently; without such a test, finite-size effects or round-wise correlations could produce different N_eff for the same product, rendering the claimed two-parameter universality dependent on the particular (k, τ) sampling chosen for the cells rather than a general derivation.

minor comments (1)

[Abstract and methods] The abstract states that the functional form applies at both answer-diversity and correctness-redundancy levels, yet the precise operational definitions of N_eff for each level (e.g., how diversity is quantified, how redundancy is measured) are not restated in the provided excerpt; a short methods paragraph clarifying these quantities would aid reproducibility.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for highlighting this important point on the empirical support for the mean-field theorem. We respond to the major comment below and agree that additional clarification is warranted.

read point-by-point responses

Referee: The mean-field theorem (stated in the abstract and used to justify universality across the 44 cells) asserts that k and τ contribute only via their product kτ. No experiment is described that holds kτ fixed while varying k and τ independently; without such a test, finite-size effects or round-wise correlations could produce different N_eff for the same product, rendering the claimed two-parameter universality dependent on the particular (k, τ) sampling chosen for the cells rather than a general derivation.

Authors: We agree that a direct experimental test holding the product kτ fixed while independently varying k and τ would strengthen the claim. The mean-field theorem is a theoretical derivation from a mean-field approximation of the interaction process, in which the effective peer influence depends on the total number of interactions kτ. Although the 44 cells include a range of (k, τ) pairs and yield consistent high-R² fits, they were not explicitly sampled to hold kτ constant. We will revise the manuscript to (i) clarify that the reduction is a theoretical prediction whose empirical support is indirect, (ii) explicitly note the absence of a matched-product control as a limitation, and (iii) outline a targeted follow-up experiment. This revision does not alter the reported scaling-law fits or regime classifications but improves the interpretation of universality. revision: yes

Circularity Check

0 steps flagged

No significant circularity; derivation presented as independent of fitted values

full rationale

The paper states it derives the scaling law R(N) from a mean-field theorem reducing k and τ to the product kτ, then reports that the resulting two-parameter form fits all 44 cells at R² > 0.99 with only (c, β) varying. No equation or step is exhibited that reduces the claimed derivation or the mean-field prediction to the experimental fits by construction, nor is any self-citation used as load-bearing justification. The regime classification follows directly from the value of the fitted exponent β rather than presupposing the outcome, and the experimental validation remains external to the derivation step itself. The central claim therefore retains independent content from the data.

Axiom & Free-Parameter Ledger

2 free parameters · 1 axioms · 0 invented entities

Two free parameters c and β are introduced and fitted per configuration to capture the strength and shape of the saturation effect; the mean-field reduction of k and τ to their product is an imported domain assumption from physics-style modeling.

free parameters (2)

c
Controls the magnitude of the saturation effect and is fitted separately for each model-task-condition cell.
β
Exponent that determines the asymptotic regime and is fitted separately for each model-task-condition cell.

axioms (1)

domain assumption Mean-field theorem: peer count k and rounds τ enter the dynamics only through their product kτ
Used to simplify the multi-agent interaction model into the two-parameter scaling law.

pith-pipeline@v0.9.1-grok · 5936 in / 1453 out tokens · 43225 ms · 2026-06-28T16:02:17.471686+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

59 extracted references · 12 canonical work pages · 7 internal anchors

[1]

Evaluating Large Language Models Trained on Code

Evaluating Large Language Models Trained on Code , author=. arXiv preprint arXiv:2107.03374 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[2]

Recherches sur les moteurs anim

Ringelmann, Maximilien , journal=. Recherches sur les moteurs anim
[3]

1972 , publisher=

Group Process and Productivity , author=. 1972 , publisher=

1972
[4]

American Psychologist , volume=

The psychology of social impact , author=. American Psychologist , volume=
[5]

Journal of Personality and Social Psychology , volume=

Many hands make light the work: The causes and consequences of social loafing , author=. Journal of Personality and Social Psychology , volume=
[6]

Essai sur l'application de l'analyse

de Condorcet, Marquis , year=. Essai sur l'application de l'analyse
[7]

1965 , publisher=

Survey Sampling , author=. 1965 , publisher=

1965
[8]

ICML , year=

Improving Factuality and Reasoning in Language Models through Multiagent Debate , author=. ICML , year=
[9]

EMNLP , year=

Encouraging Divergent Thinking in Large Language Models through Multi-Agent Debate , author=. EMNLP , year=
[10]

ICLR , year=

Towards Understanding Sycophancy in Language Models , author=. ICLR , year=
[11]

Findings of ACL , year=

Discovering Language Model Behaviors with Model-Written Evaluations , author=. Findings of ACL , year=
[12]

Rethinking the Bounds of

Wang, Qineng and Wang, Zihao and Su, Ying and Tong, Hanghang and Song, Yangqiu , booktitle=. Rethinking the Bounds of
[13]

Yu Xia, Yiran Jenny Shen, Junda Wu, Tong Yu, Sungchul Kim, Ryan A Rossi, Lina Yao, and Ju- lian McAuley

Talk Isn't Always Cheap: Understanding Failure Modes in Multi-Agent Debate , author=. arXiv preprint arXiv:2509.05396 , year=

work page arXiv
[14]

Proceedings of ACM Conference on AI and Agentic Systems , year=

Decomposing Sycophancy, Fragility, Consensus Collapse and Cost in Homogeneous Multi-Agent LLM Debate , author=. Proceedings of ACM Conference on AI and Agentic Systems , year=
[15]

ICLR 2026 Workshop on AI for Mechanism Design and Strategic Decision Making , year=

Understanding Agent Scaling in LLM-based Multi-Agent Systems via Diversity , author=. ICLR 2026 Workshop on AI for Mechanism Design and Strategic Decision Making , year=

2026
[16]

Wu, Haolun and Li, Zhenkun and Li, Lingyao , journal=. Can
[17]

When Identity Skews Debate: Anonymization for Bias-Reduced Multi-Agent Reasoning

When Identity Skews Debate: Anonymization for Bias-Reduced Multi-Agent Reasoning , author=. arXiv preprint arXiv:2510.07517 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[18]

NeurIPS , year=

Debate or Vote: Which Yields Better Decisions in Multi-Agent Large Language Models? , author=. NeurIPS , year=
[19]

Revisiting multi-agent debate as test-time scaling: A systematic study of conditional effectiveness.arXiv preprint arXiv:2505.22960,

Revisiting Multi-Agent Debate as Test-Time Scaling: When Does Multi-Agent Help? , author=. arXiv preprint arXiv:2505.22960 , year=

work page arXiv
[20]

ACL , year=

Conformity in Large Language Models , author=. ACL , year=
[21]

Scaling Laws for Neural Language Models

Scaling Laws for Neural Language Models , author=. arXiv preprint arXiv:2001.08361 , year=

work page internal anchor Pith review Pith/arXiv arXiv 2001
[22]

NeurIPS , year=

Training Compute-Optimal Large Language Models , author=. NeurIPS , year=
[23]

ICLR , year=

Self-Consistency Improves Chain of Thought Reasoning in Language Models , author=. ICLR , year=
[24]

Snell, Charlie and Lee, Jaehoon and Xu, Kelvin and Kumar, Aviral , booktitle=. Scaling
[25]

Are More

Chen, Lingjiao and Davis, Jared Quincy and Hanin, Boris and Bailis, Peter and Stoica, Ion and Zaharia, Matei and Zou, James , booktitle=. Are More
[26]

Reasoning in Token Economies: Budget-Aware Evaluation of

Wang, Junlin and Jain, Siddhartha and Zhang, Dejiao and Ray, Baishakhi and Kumar, Varun and Athiwaratkun, Ben , booktitle=. Reasoning in Token Economies: Budget-Aware Evaluation of
[27]

Consensus is Not Verification: Why Crowd Wisdom Strategies Fail for

Denisov-Blanch, Yegor and Kazdan, Joshua and Chudnovsky, Jessica and Schaeffer, Rylan and others , journal=. Consensus is Not Verification: Why Crowd Wisdom Strategies Fail for
[28]

Conformity and Social Impact on

Bellina, Alessandro and De Marzo, Giordano and Garcia, David , journal=. Conformity and Social Impact on
[29]

InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 8140– 8155

Peacemaker or Troublemaker: How Sycophancy Shapes Multi-Agent Debate , author=. arXiv preprint arXiv:2509.23055 , year=

work page arXiv
[30]

Understanding Bias Reinforcement in

Oh, Jihwan and Jeong, Minchan and Ko, Jongwoo and Yun, Se-Young , journal=. Understanding Bias Reinforcement in
[31]

Chan, Chi-Min and Chen, Weize and Su, Yusheng and Yu, Jianxuan and Xue, Wei and Zhang, Shanghang and Fu, Jie and Liu, Zhiyuan , booktitle=. Chat
[32]

Zhang, Hangfan and Cui, Zhiyao and Zhang, Qiaosheng and Hu, Shuyue , journal=. Multi-. 2025 , note=

2025
[33]

Journal of Personality and Social Psychology , volume=

Ringelmann rediscovered: The original article , author=. Journal of Personality and Social Psychology , volume=
[34]

Is out of sight, out of mind?

Chidambaram, Laku and Tung, Lai Lai , journal=. Is out of sight, out of mind?
[35]

PLOS ONE , volume=

An experimental study of team size and performance on a complex task , author=. PLOS ONE , volume=
[36]

Towards a Science of Scaling Agent Systems

Towards a Science of Scaling Agent Systems , author=. arXiv preprint arXiv:2512.08296 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[37]

ICLR , year=

Scaling Large Language Model-based Multi-Agent Collaboration , author=. ICLR , year=
[38]

and Pretorius, Arnu , booktitle=

Smit, Andries Petrus and Grinsztajn, Nathan and Duckworth, Paul and Barrett, Thomas D. and Pretorius, Arnu , booktitle=. Should We Be Going
[39]

arXiv preprint arXiv:2502.00674 , year=

Rethinking Mixture-of-Agents: Is Mixing Different Large Language Models Beneficial? , author=. arXiv preprint arXiv:2502.00674 , year=

work page arXiv
[40]

COLM , year=

Mixture-of-Agents Enhances Large Language Model Capabilities , author=. COLM , year=
[41]

, journal=

Ladha, Krishna K. , journal=. The
[42]

, journal=

Boland, Philip J. , journal=. Majority Systems and the
[43]

Oikos , volume=

Entropy and Diversity , author=. Oikos , volume=
[44]

Journal of the American Statistical Association , volume=

Reaching a Consensus , author=. Journal of the American Statistical Association , volume=
[45]

Irving, Geoffrey and Christiano, Paul and Amodei, Dario , journal=
[46]

Debating with More Persuasive

Khan, Akbir and Hughes, John and Valentine, Dan and Ruis, Laura and Sachan, Kush and Raber, Ansh and Guo, Edward and He, Haotian and Perez, Ethan and Irving, Geoffrey , booktitle=. Debating with More Persuasive
[47]

1972 , publisher=

Victims of Groupthink: A Psychological Study of Foreign-Policy Decisions and Fiascoes , author=. 1972 , publisher=

1972
[48]

Transactions on Machine Learning Research , year=

More Agents Is All You Need , author=. Transactions on Machine Learning Research , year=
[49]

NeurIPS , year=

Self-Refine: Iterative Refinement with Self-Feedback , author=. NeurIPS , year=
[50]

Large Language Monkeys: Scaling Inference Compute with Repeated Sampling

Large Language Monkeys: Scaling Inference Compute with Repeated Sampling , author=. arXiv preprint arXiv:2407.21787 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[51]

Findings of ACL , year=

Voting or Consensus? Decision-Making in Multi-Agent Debate , author=. Findings of ACL , year=
[52]

Training Verifiers to Solve Math Word Problems

Training Verifiers to Solve Math Word Problems , author=. arXiv preprint arXiv:2110.14168 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[53]

ICLR , year=

Let's Verify Step by Step , author=. ICLR , year=
[54]

AAAI , year=

Diverse Beam Search for Improved Description of Complex Scenes , author=. AAAI , year=
[55]

Dsdr: Dual-scale diversity regularization for exploration in llm reasoning.arXiv preprint arXiv:2602.19895,

DSDR: Dual-Scale Diversity Regularization for Exploration in LLM Reasoning , author=. arXiv preprint arXiv:2602.19895 , year=

work page arXiv
[56]

NeurIPS , year=

Tree of Thoughts: Deliberate Problem Solving with Large Language Models , author=. NeurIPS , year=
[57]

PAL: Program-aided Language Models

Gao, Luyu and Madaan, Aman and Zhou, Shuyan and Alon, Uri and Liu, Pengfei and Yang, Yiming and Callan, Jamie and Neubig, Graham , year=. 2211.10435 , archivePrefix=

work page internal anchor Pith review Pith/arXiv arXiv
[58]

International Conference on Learning Representations (ICLR) , year=

Measuring Massive Multitask Language Understanding , author=. International Conference on Learning Representations (ICLR) , year=
[59]

, journal=

Rein, David and Hou, Betty Li and Stickland, Asa Cooper and Petty, Jackson and Pang, Richard Yuanzhe and Dirani, Julien and Michael, Julian and Bowman, Samuel R. , journal=

[1] [1]

Evaluating Large Language Models Trained on Code

Evaluating Large Language Models Trained on Code , author=. arXiv preprint arXiv:2107.03374 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[2] [2]

Recherches sur les moteurs anim

Ringelmann, Maximilien , journal=. Recherches sur les moteurs anim

[3] [3]

1972 , publisher=

Group Process and Productivity , author=. 1972 , publisher=

1972

[4] [4]

American Psychologist , volume=

The psychology of social impact , author=. American Psychologist , volume=

[5] [5]

Journal of Personality and Social Psychology , volume=

Many hands make light the work: The causes and consequences of social loafing , author=. Journal of Personality and Social Psychology , volume=

[6] [6]

Essai sur l'application de l'analyse

de Condorcet, Marquis , year=. Essai sur l'application de l'analyse

[7] [7]

1965 , publisher=

Survey Sampling , author=. 1965 , publisher=

1965

[8] [8]

ICML , year=

Improving Factuality and Reasoning in Language Models through Multiagent Debate , author=. ICML , year=

[9] [9]

EMNLP , year=

Encouraging Divergent Thinking in Large Language Models through Multi-Agent Debate , author=. EMNLP , year=

[10] [10]

ICLR , year=

Towards Understanding Sycophancy in Language Models , author=. ICLR , year=

[11] [11]

Findings of ACL , year=

Discovering Language Model Behaviors with Model-Written Evaluations , author=. Findings of ACL , year=

[12] [12]

Rethinking the Bounds of

Wang, Qineng and Wang, Zihao and Su, Ying and Tong, Hanghang and Song, Yangqiu , booktitle=. Rethinking the Bounds of

[13] [13]

Yu Xia, Yiran Jenny Shen, Junda Wu, Tong Yu, Sungchul Kim, Ryan A Rossi, Lina Yao, and Ju- lian McAuley

Talk Isn't Always Cheap: Understanding Failure Modes in Multi-Agent Debate , author=. arXiv preprint arXiv:2509.05396 , year=

work page arXiv

[14] [14]

Proceedings of ACM Conference on AI and Agentic Systems , year=

Decomposing Sycophancy, Fragility, Consensus Collapse and Cost in Homogeneous Multi-Agent LLM Debate , author=. Proceedings of ACM Conference on AI and Agentic Systems , year=

[15] [15]

ICLR 2026 Workshop on AI for Mechanism Design and Strategic Decision Making , year=

Understanding Agent Scaling in LLM-based Multi-Agent Systems via Diversity , author=. ICLR 2026 Workshop on AI for Mechanism Design and Strategic Decision Making , year=

2026

[16] [16]

Wu, Haolun and Li, Zhenkun and Li, Lingyao , journal=. Can

[17] [17]

When Identity Skews Debate: Anonymization for Bias-Reduced Multi-Agent Reasoning

When Identity Skews Debate: Anonymization for Bias-Reduced Multi-Agent Reasoning , author=. arXiv preprint arXiv:2510.07517 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[18] [18]

NeurIPS , year=

Debate or Vote: Which Yields Better Decisions in Multi-Agent Large Language Models? , author=. NeurIPS , year=

[19] [19]

Revisiting multi-agent debate as test-time scaling: A systematic study of conditional effectiveness.arXiv preprint arXiv:2505.22960,

Revisiting Multi-Agent Debate as Test-Time Scaling: When Does Multi-Agent Help? , author=. arXiv preprint arXiv:2505.22960 , year=

work page arXiv

[20] [20]

ACL , year=

Conformity in Large Language Models , author=. ACL , year=

[21] [21]

Scaling Laws for Neural Language Models

Scaling Laws for Neural Language Models , author=. arXiv preprint arXiv:2001.08361 , year=

work page internal anchor Pith review Pith/arXiv arXiv 2001

[22] [22]

NeurIPS , year=

Training Compute-Optimal Large Language Models , author=. NeurIPS , year=

[23] [23]

ICLR , year=

Self-Consistency Improves Chain of Thought Reasoning in Language Models , author=. ICLR , year=

[24] [24]

Snell, Charlie and Lee, Jaehoon and Xu, Kelvin and Kumar, Aviral , booktitle=. Scaling

[25] [25]

Are More

Chen, Lingjiao and Davis, Jared Quincy and Hanin, Boris and Bailis, Peter and Stoica, Ion and Zaharia, Matei and Zou, James , booktitle=. Are More

[26] [26]

Reasoning in Token Economies: Budget-Aware Evaluation of

Wang, Junlin and Jain, Siddhartha and Zhang, Dejiao and Ray, Baishakhi and Kumar, Varun and Athiwaratkun, Ben , booktitle=. Reasoning in Token Economies: Budget-Aware Evaluation of

[27] [27]

Consensus is Not Verification: Why Crowd Wisdom Strategies Fail for

Denisov-Blanch, Yegor and Kazdan, Joshua and Chudnovsky, Jessica and Schaeffer, Rylan and others , journal=. Consensus is Not Verification: Why Crowd Wisdom Strategies Fail for

[28] [28]

Conformity and Social Impact on

Bellina, Alessandro and De Marzo, Giordano and Garcia, David , journal=. Conformity and Social Impact on

[29] [29]

InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 8140– 8155

Peacemaker or Troublemaker: How Sycophancy Shapes Multi-Agent Debate , author=. arXiv preprint arXiv:2509.23055 , year=

work page arXiv

[30] [30]

Understanding Bias Reinforcement in

Oh, Jihwan and Jeong, Minchan and Ko, Jongwoo and Yun, Se-Young , journal=. Understanding Bias Reinforcement in

[31] [31]

Chan, Chi-Min and Chen, Weize and Su, Yusheng and Yu, Jianxuan and Xue, Wei and Zhang, Shanghang and Fu, Jie and Liu, Zhiyuan , booktitle=. Chat

[32] [32]

Zhang, Hangfan and Cui, Zhiyao and Zhang, Qiaosheng and Hu, Shuyue , journal=. Multi-. 2025 , note=

2025

[33] [33]

Journal of Personality and Social Psychology , volume=

Ringelmann rediscovered: The original article , author=. Journal of Personality and Social Psychology , volume=

[34] [34]

Is out of sight, out of mind?

Chidambaram, Laku and Tung, Lai Lai , journal=. Is out of sight, out of mind?

[35] [35]

PLOS ONE , volume=

An experimental study of team size and performance on a complex task , author=. PLOS ONE , volume=

[36] [36]

Towards a Science of Scaling Agent Systems

Towards a Science of Scaling Agent Systems , author=. arXiv preprint arXiv:2512.08296 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[37] [37]

ICLR , year=

Scaling Large Language Model-based Multi-Agent Collaboration , author=. ICLR , year=

[38] [38]

and Pretorius, Arnu , booktitle=

Smit, Andries Petrus and Grinsztajn, Nathan and Duckworth, Paul and Barrett, Thomas D. and Pretorius, Arnu , booktitle=. Should We Be Going

[39] [39]

arXiv preprint arXiv:2502.00674 , year=

Rethinking Mixture-of-Agents: Is Mixing Different Large Language Models Beneficial? , author=. arXiv preprint arXiv:2502.00674 , year=

work page arXiv

[40] [40]

COLM , year=

Mixture-of-Agents Enhances Large Language Model Capabilities , author=. COLM , year=

[41] [41]

, journal=

Ladha, Krishna K. , journal=. The

[42] [42]

, journal=

Boland, Philip J. , journal=. Majority Systems and the

[43] [43]

Oikos , volume=

Entropy and Diversity , author=. Oikos , volume=

[44] [44]

Journal of the American Statistical Association , volume=

Reaching a Consensus , author=. Journal of the American Statistical Association , volume=

[45] [45]

Irving, Geoffrey and Christiano, Paul and Amodei, Dario , journal=

[46] [46]

Debating with More Persuasive

Khan, Akbir and Hughes, John and Valentine, Dan and Ruis, Laura and Sachan, Kush and Raber, Ansh and Guo, Edward and He, Haotian and Perez, Ethan and Irving, Geoffrey , booktitle=. Debating with More Persuasive

[47] [47]

1972 , publisher=

Victims of Groupthink: A Psychological Study of Foreign-Policy Decisions and Fiascoes , author=. 1972 , publisher=

1972

[48] [48]

Transactions on Machine Learning Research , year=

More Agents Is All You Need , author=. Transactions on Machine Learning Research , year=

[49] [49]

NeurIPS , year=

Self-Refine: Iterative Refinement with Self-Feedback , author=. NeurIPS , year=

[50] [50]

Large Language Monkeys: Scaling Inference Compute with Repeated Sampling

Large Language Monkeys: Scaling Inference Compute with Repeated Sampling , author=. arXiv preprint arXiv:2407.21787 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[51] [51]

Findings of ACL , year=

Voting or Consensus? Decision-Making in Multi-Agent Debate , author=. Findings of ACL , year=

[52] [52]

Training Verifiers to Solve Math Word Problems

Training Verifiers to Solve Math Word Problems , author=. arXiv preprint arXiv:2110.14168 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[53] [53]

ICLR , year=

Let's Verify Step by Step , author=. ICLR , year=

[54] [54]

AAAI , year=

Diverse Beam Search for Improved Description of Complex Scenes , author=. AAAI , year=

[55] [55]

Dsdr: Dual-scale diversity regularization for exploration in llm reasoning.arXiv preprint arXiv:2602.19895,

DSDR: Dual-Scale Diversity Regularization for Exploration in LLM Reasoning , author=. arXiv preprint arXiv:2602.19895 , year=

work page arXiv

[56] [56]

NeurIPS , year=

Tree of Thoughts: Deliberate Problem Solving with Large Language Models , author=. NeurIPS , year=

[57] [57]

PAL: Program-aided Language Models

Gao, Luyu and Madaan, Aman and Zhou, Shuyan and Alon, Uri and Liu, Pengfei and Yang, Yiming and Callan, Jamie and Neubig, Graham , year=. 2211.10435 , archivePrefix=

work page internal anchor Pith review Pith/arXiv arXiv

[58] [58]

International Conference on Learning Representations (ICLR) , year=

Measuring Massive Multitask Language Understanding , author=. International Conference on Learning Representations (ICLR) , year=

[59] [59]

, journal=

Rein, David and Hou, Betty Li and Stickland, Asa Cooper and Petty, Jackson and Pang, Richard Yuanzhe and Dirani, Julien and Michael, Julian and Bowman, Samuel R. , journal=