pith. sign in

arxiv: 2606.02646 · v1 · pith:WCJCAQAHnew · submitted 2026-05-31 · ⚛️ physics.soc-ph · cs.AI· cs.MA

The Ringelmann Effect in Multi-Agent LLM Systems: A Scaling Law for Effective Team Size

Pith reviewed 2026-06-28 16:02 UTC · model grok-4.3

classification ⚛️ physics.soc-ph cs.AIcs.MA
keywords multi-agent LLMsscaling lawRingelmann effecteffective team sizepeer debateself-correctionhard ceilingMMLU
0
0 comments X

The pith

Multi-agent LLM systems obey a two-parameter scaling law for effective team size that saturates in most configurations.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes a scaling law R(N) = 1/(1+c(N-1)N^{-β}) for the ratio of effective to nominal agents in LLM teams. This form fits data from 44 model-task-condition combinations with R² exceeding 0.99, distinguishing hard-ceiling, sublinear, and linear regimes based on the value of β. A mean-field result shows that the product of peer count and debate rounds governs the dynamics. On free-form math tasks, dense peer influence reduces answer diversity to a hard ceiling, while correctness redundancy stays capped throughout. Practical results include that 30 agents add no diversity beyond one in certain settings, and only heterogeneous teams can lower the saturation constant c.

Core claim

The central discovery is that the functional form R(N) = N_eff/N = 1/(1+c(N-1)N^{-β}) fits every tested condition at R² > 0.99, with only the parameters (c, β) varying. The exponent β determines the regime: hard-ceiling at 1/c when β=0, sublinear growth when 0<β<1, or linear when β≥1. A mean-field theorem predicts that peer count k and rounds τ enter only through the product kτ. The law holds at answer diversity and correctness redundancy levels, and across peer debate, self-correction, placebo, self-consistency, various models, and communication modes.

What carries the argument

The two-parameter scaling law R(N) = 1/(1 + c(N-1)N^{-β}) that measures the fraction of effective agents N_eff/N and classifies configurations by the regime exponent β.

Load-bearing premise

Peer count k and debate rounds τ influence the system only through their product kτ, and the 44 experimental cells represent the full space of configurations without bias in selection.

What would settle it

Running new experiments with different k and τ where the effect depends on them separately rather than only their product, or finding conditions where the scaling form fits with R² much lower than 0.99.

Figures

Figures reproduced from arXiv: 2606.02646 by Bla\v{z} Bertalani\v{c}, Carolina Fortuna.

Figure 1
Figure 1. Figure 1: Effective team size saturates and the regime is set by collaboration mode (Qwen2.5-7B × GSM-Hard). A: Answer-level effective team size Nans eff vs. nominal N. Self-correction and noise placebo grow sublinearly toward Neff ≈ 6 at N=30. Debate flattens at Neff ≈ 1.8. The dashed line marks the independent-voter ideal Neff = N, and the shaded gap is the departure of the observed team from that ideal. B: Fitted… view at source ↗
Figure 2
Figure 2. Figure 2: Structural comparison to classical Ringelmann data via the Latané power law. We fit R(N) = Nt−1 to LLM debate and to three published human group datasets, on the same axes. The estimated exponent t quantifies the decline rate: human groups have t ∈ [0.49, 0.90] (gradual decline), LLM debate has t ≤ 0.13 (much steeper). The same form is used for both populations so the gap in estimated t is structural, not … view at source ↗
Figure 3
Figure 3. Figure 3: Structural comparison via the Kish design effect. Same datasets as [PITH_FULL_IMAGE:figures/full_fig_p017_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: The Ringelmann paradox: individual learning, collective stagnation. Top: Net agent￾level transitions (W→C minus C→W) are positive at all team sizes, on both tasks, under both debate and self-correction. Agents do learn from interaction. Bottom: Effective team size Neff stays close to the no-aggregation lower bound on MMLU-Hard and grows sublinearly on GSM-Hard, in both cases far below the independent-voter… view at source ↗
Figure 5
Figure 5. Figure 5: Small-N extrapolation of the Ringelmann scaling law on heterogeneous debate teams across MMLU-Hard, GSM-Hard, and GPQA. Green curves are the Ringelmann form R(N) = 1/(1+c(N−1)N −β ) estimated on small N (training range varies by task), with 95% bootstrap envelopes, extrapolated to the largest N available. The same form continues to fit, with a higher ceiling (1/c) than in the homogeneous setting. Within- v… view at source ↗
Figure 6
Figure 6. Figure 6: Small-team runs predict the large-team Ringelmann ceiling across models and tasks. Rows: Qwen2.5-7B (top), Llama-3.1-8B (middle), Ministral-8B (bottom). Columns: MMLU-Hard, GSM-Hard, GPQA. Black points show observed answer-diversity efficiency R(N) = Neff/N with 95% item-level bootstrap intervals. Green curves are the Ringelmann form R(N) = 1/(1+c(N−1)N −β ) estimated only on N ∈ {2, 3, 5} (gray training b… view at source ↗
Figure 7
Figure 7. Figure 7: The (c, β) landscape across evaluated configurations. Each point is one Ringelmann estimate across 6 models plus heterogeneous teams, 3 tasks, and 3 conditions (44 estimates total, also reported in [PITH_FULL_IMAGE:figures/full_fig_p031_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: kτ collapse stratified by initial item-level agreement (Qwen2.5-7B, GSM-Hard, N=10). Curves are pairwise agreement ρ¯ vs. kτ , one per peer count k. Right panel (high agreement, ρ¯ (0) > 0.7): curves nearly coincide, consistent with the theorem’s high-agreement regime. Left panel (low agreement, ρ¯ (0) < 0.4): visible cross-k separation at small kτ , narrowing as kτ grows. Why the large βent does not viola… view at source ↗
Figure 9
Figure 9. Figure 9: Total prompt tokens per item at R3 (Qwen2.5-7B), across three tasks and two conditions. [PITH_FULL_IMAGE:figures/full_fig_p037_9.png] view at source ↗
read the original abstract

Inference-time multi-agent LLM scaling lacks a shared unit: counting nominal agents conflates cost with independent evidence. We derive a two-parameter scaling law $R(N) = N_\text{eff}/N = 1/(1+c(N-1)N^{-\beta})$ where the regime exponent $\beta$ classifies any configuration into one of three asymptotic regimes -- hard-ceiling at $1/c$ ($\beta = 0$), sublinear at $N^\beta/c$ ($0 < \beta < 1$), or linear ($\beta \ge 1$), and a mean-field theorem predicts that peer count $k$ and rounds $\tau$ during agent debate enter the dynamics only through their product $k\tau$. The law applies at two levels: answer diversity and correctness redundancy. Across 44 (model $\times$ task $\times$ condition) cells spanning peer debate, self-correction, random-noise placebo, self-consistency, three open-weight families (Qwen, Llama, Ministral) at scales from 7B to 32B with a frontier API check (Gemini), thinking models, heterogeneous teams, and sparse communication, the functional form fits every condition at $R^2 > 0.99$; only $(c, \beta)$ shifts. On free-form math, dense peer influence collapses the answer-level regime from sublinear into hard-ceiling; correctness-level fits remain hard-ceiling throughout. Three findings have practical implications. \emph{(i)}~Thirty dense debating agents produce no more answer diversity than one on MMLU-Hard. \emph{(ii)}~A noise placebo tracks self-correction on free-form math and at $4\times$ scale, so within homogeneous teams the gain commonly attributed to ``debate'' comes from re-evaluation, not peer content. \emph{(iii)}~A single $N \le 5$ pilot predicts the $N=30$ structural ceiling, and within the configurations tested only architectural diversity (heterogeneous teams) lowers $c$ and escapes the hard-ceiling regime, communication-mode interventions do not.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 1 minor

Summary. The paper derives a two-parameter scaling law R(N) = N_eff/N = 1/(1 + c(N-1)N^{-β}) for effective team size in multi-agent LLM systems. It classifies configurations into hard-ceiling, sublinear, or linear regimes via the exponent β, and presents a mean-field theorem asserting that peer count k and debate rounds τ enter the dynamics only through the product kτ. The law is reported to fit answer diversity and correctness redundancy across 44 cells (spanning models, tasks, debate, self-correction, heterogeneous teams, and communication modes) at R² > 0.99, with only (c, β) varying; practical claims include that 30 dense agents add no diversity beyond one on MMLU-Hard and that heterogeneity lowers c while communication interventions do not.

Significance. If the central functional form and mean-field reduction hold after verification, the work supplies a compact, falsifiable description of diminishing returns in LLM teams, directly linking observable parameters (k, τ) to measurable effective size. The consistent high-R² fits across open-weight families, frontier models, and free-form vs. multiple-choice tasks, together with the explicit regime classification, would constitute a useful quantitative tool for system design; the observation that a small-N pilot predicts large-N ceilings is particularly actionable.

major comments (1)
  1. [Mean-field theorem and experimental design] The mean-field theorem (stated in the abstract and used to justify universality across the 44 cells) asserts that k and τ contribute only via their product kτ. No experiment is described that holds kτ fixed while varying k and τ independently; without such a test, finite-size effects or round-wise correlations could produce different N_eff for the same product, rendering the claimed two-parameter universality dependent on the particular (k, τ) sampling chosen for the cells rather than a general derivation.
minor comments (1)
  1. [Abstract and methods] The abstract states that the functional form applies at both answer-diversity and correctness-redundancy levels, yet the precise operational definitions of N_eff for each level (e.g., how diversity is quantified, how redundancy is measured) are not restated in the provided excerpt; a short methods paragraph clarifying these quantities would aid reproducibility.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for highlighting this important point on the empirical support for the mean-field theorem. We respond to the major comment below and agree that additional clarification is warranted.

read point-by-point responses
  1. Referee: The mean-field theorem (stated in the abstract and used to justify universality across the 44 cells) asserts that k and τ contribute only via their product kτ. No experiment is described that holds kτ fixed while varying k and τ independently; without such a test, finite-size effects or round-wise correlations could produce different N_eff for the same product, rendering the claimed two-parameter universality dependent on the particular (k, τ) sampling chosen for the cells rather than a general derivation.

    Authors: We agree that a direct experimental test holding the product kτ fixed while independently varying k and τ would strengthen the claim. The mean-field theorem is a theoretical derivation from a mean-field approximation of the interaction process, in which the effective peer influence depends on the total number of interactions kτ. Although the 44 cells include a range of (k, τ) pairs and yield consistent high-R² fits, they were not explicitly sampled to hold kτ constant. We will revise the manuscript to (i) clarify that the reduction is a theoretical prediction whose empirical support is indirect, (ii) explicitly note the absence of a matched-product control as a limitation, and (iii) outline a targeted follow-up experiment. This revision does not alter the reported scaling-law fits or regime classifications but improves the interpretation of universality. revision: yes

Circularity Check

0 steps flagged

No significant circularity; derivation presented as independent of fitted values

full rationale

The paper states it derives the scaling law R(N) from a mean-field theorem reducing k and τ to the product kτ, then reports that the resulting two-parameter form fits all 44 cells at R² > 0.99 with only (c, β) varying. No equation or step is exhibited that reduces the claimed derivation or the mean-field prediction to the experimental fits by construction, nor is any self-citation used as load-bearing justification. The regime classification follows directly from the value of the fitted exponent β rather than presupposing the outcome, and the experimental validation remains external to the derivation step itself. The central claim therefore retains independent content from the data.

Axiom & Free-Parameter Ledger

2 free parameters · 1 axioms · 0 invented entities

Two free parameters c and β are introduced and fitted per configuration to capture the strength and shape of the saturation effect; the mean-field reduction of k and τ to their product is an imported domain assumption from physics-style modeling.

free parameters (2)
  • c
    Controls the magnitude of the saturation effect and is fitted separately for each model-task-condition cell.
  • β
    Exponent that determines the asymptotic regime and is fitted separately for each model-task-condition cell.
axioms (1)
  • domain assumption Mean-field theorem: peer count k and rounds τ enter the dynamics only through their product kτ
    Used to simplify the multi-agent interaction model into the two-parameter scaling law.

pith-pipeline@v0.9.1-grok · 5936 in / 1453 out tokens · 43225 ms · 2026-06-28T16:02:17.471686+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

59 extracted references · 12 canonical work pages · 7 internal anchors

  1. [1]

    Evaluating Large Language Models Trained on Code

    Evaluating Large Language Models Trained on Code , author=. arXiv preprint arXiv:2107.03374 , year=

  2. [2]

    Recherches sur les moteurs anim

    Ringelmann, Maximilien , journal=. Recherches sur les moteurs anim

  3. [3]

    1972 , publisher=

    Group Process and Productivity , author=. 1972 , publisher=

  4. [4]

    American Psychologist , volume=

    The psychology of social impact , author=. American Psychologist , volume=

  5. [5]

    Journal of Personality and Social Psychology , volume=

    Many hands make light the work: The causes and consequences of social loafing , author=. Journal of Personality and Social Psychology , volume=

  6. [6]

    Essai sur l'application de l'analyse

    de Condorcet, Marquis , year=. Essai sur l'application de l'analyse

  7. [7]

    1965 , publisher=

    Survey Sampling , author=. 1965 , publisher=

  8. [8]

    ICML , year=

    Improving Factuality and Reasoning in Language Models through Multiagent Debate , author=. ICML , year=

  9. [9]

    EMNLP , year=

    Encouraging Divergent Thinking in Large Language Models through Multi-Agent Debate , author=. EMNLP , year=

  10. [10]

    ICLR , year=

    Towards Understanding Sycophancy in Language Models , author=. ICLR , year=

  11. [11]

    Findings of ACL , year=

    Discovering Language Model Behaviors with Model-Written Evaluations , author=. Findings of ACL , year=

  12. [12]

    Rethinking the Bounds of

    Wang, Qineng and Wang, Zihao and Su, Ying and Tong, Hanghang and Song, Yangqiu , booktitle=. Rethinking the Bounds of

  13. [13]

    Yu Xia, Yiran Jenny Shen, Junda Wu, Tong Yu, Sungchul Kim, Ryan A Rossi, Lina Yao, and Ju- lian McAuley

    Talk Isn't Always Cheap: Understanding Failure Modes in Multi-Agent Debate , author=. arXiv preprint arXiv:2509.05396 , year=

  14. [14]

    Proceedings of ACM Conference on AI and Agentic Systems , year=

    Decomposing Sycophancy, Fragility, Consensus Collapse and Cost in Homogeneous Multi-Agent LLM Debate , author=. Proceedings of ACM Conference on AI and Agentic Systems , year=

  15. [15]

    ICLR 2026 Workshop on AI for Mechanism Design and Strategic Decision Making , year=

    Understanding Agent Scaling in LLM-based Multi-Agent Systems via Diversity , author=. ICLR 2026 Workshop on AI for Mechanism Design and Strategic Decision Making , year=

  16. [16]

    Wu, Haolun and Li, Zhenkun and Li, Lingyao , journal=. Can

  17. [17]

    When Identity Skews Debate: Anonymization for Bias-Reduced Multi-Agent Reasoning

    When Identity Skews Debate: Anonymization for Bias-Reduced Multi-Agent Reasoning , author=. arXiv preprint arXiv:2510.07517 , year=

  18. [18]

    NeurIPS , year=

    Debate or Vote: Which Yields Better Decisions in Multi-Agent Large Language Models? , author=. NeurIPS , year=

  19. [19]

    Revisiting multi-agent debate as test-time scaling: A systematic study of conditional effectiveness.arXiv preprint arXiv:2505.22960,

    Revisiting Multi-Agent Debate as Test-Time Scaling: When Does Multi-Agent Help? , author=. arXiv preprint arXiv:2505.22960 , year=

  20. [20]

    ACL , year=

    Conformity in Large Language Models , author=. ACL , year=

  21. [21]

    Scaling Laws for Neural Language Models

    Scaling Laws for Neural Language Models , author=. arXiv preprint arXiv:2001.08361 , year=

  22. [22]

    NeurIPS , year=

    Training Compute-Optimal Large Language Models , author=. NeurIPS , year=

  23. [23]

    ICLR , year=

    Self-Consistency Improves Chain of Thought Reasoning in Language Models , author=. ICLR , year=

  24. [24]

    Snell, Charlie and Lee, Jaehoon and Xu, Kelvin and Kumar, Aviral , booktitle=. Scaling

  25. [25]

    Are More

    Chen, Lingjiao and Davis, Jared Quincy and Hanin, Boris and Bailis, Peter and Stoica, Ion and Zaharia, Matei and Zou, James , booktitle=. Are More

  26. [26]

    Reasoning in Token Economies: Budget-Aware Evaluation of

    Wang, Junlin and Jain, Siddhartha and Zhang, Dejiao and Ray, Baishakhi and Kumar, Varun and Athiwaratkun, Ben , booktitle=. Reasoning in Token Economies: Budget-Aware Evaluation of

  27. [27]

    Consensus is Not Verification: Why Crowd Wisdom Strategies Fail for

    Denisov-Blanch, Yegor and Kazdan, Joshua and Chudnovsky, Jessica and Schaeffer, Rylan and others , journal=. Consensus is Not Verification: Why Crowd Wisdom Strategies Fail for

  28. [28]

    Conformity and Social Impact on

    Bellina, Alessandro and De Marzo, Giordano and Garcia, David , journal=. Conformity and Social Impact on

  29. [29]

    InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 8140– 8155

    Peacemaker or Troublemaker: How Sycophancy Shapes Multi-Agent Debate , author=. arXiv preprint arXiv:2509.23055 , year=

  30. [30]

    Understanding Bias Reinforcement in

    Oh, Jihwan and Jeong, Minchan and Ko, Jongwoo and Yun, Se-Young , journal=. Understanding Bias Reinforcement in

  31. [31]

    Chan, Chi-Min and Chen, Weize and Su, Yusheng and Yu, Jianxuan and Xue, Wei and Zhang, Shanghang and Fu, Jie and Liu, Zhiyuan , booktitle=. Chat

  32. [32]

    Zhang, Hangfan and Cui, Zhiyao and Zhang, Qiaosheng and Hu, Shuyue , journal=. Multi-. 2025 , note=

  33. [33]

    Journal of Personality and Social Psychology , volume=

    Ringelmann rediscovered: The original article , author=. Journal of Personality and Social Psychology , volume=

  34. [34]

    Is out of sight, out of mind?

    Chidambaram, Laku and Tung, Lai Lai , journal=. Is out of sight, out of mind?

  35. [35]

    PLOS ONE , volume=

    An experimental study of team size and performance on a complex task , author=. PLOS ONE , volume=

  36. [36]

    Towards a Science of Scaling Agent Systems

    Towards a Science of Scaling Agent Systems , author=. arXiv preprint arXiv:2512.08296 , year=

  37. [37]

    ICLR , year=

    Scaling Large Language Model-based Multi-Agent Collaboration , author=. ICLR , year=

  38. [38]

    and Pretorius, Arnu , booktitle=

    Smit, Andries Petrus and Grinsztajn, Nathan and Duckworth, Paul and Barrett, Thomas D. and Pretorius, Arnu , booktitle=. Should We Be Going

  39. [39]

    arXiv preprint arXiv:2502.00674 , year=

    Rethinking Mixture-of-Agents: Is Mixing Different Large Language Models Beneficial? , author=. arXiv preprint arXiv:2502.00674 , year=

  40. [40]

    COLM , year=

    Mixture-of-Agents Enhances Large Language Model Capabilities , author=. COLM , year=

  41. [41]

    , journal=

    Ladha, Krishna K. , journal=. The

  42. [42]

    , journal=

    Boland, Philip J. , journal=. Majority Systems and the

  43. [43]

    Oikos , volume=

    Entropy and Diversity , author=. Oikos , volume=

  44. [44]

    Journal of the American Statistical Association , volume=

    Reaching a Consensus , author=. Journal of the American Statistical Association , volume=

  45. [45]

    Irving, Geoffrey and Christiano, Paul and Amodei, Dario , journal=

  46. [46]

    Debating with More Persuasive

    Khan, Akbir and Hughes, John and Valentine, Dan and Ruis, Laura and Sachan, Kush and Raber, Ansh and Guo, Edward and He, Haotian and Perez, Ethan and Irving, Geoffrey , booktitle=. Debating with More Persuasive

  47. [47]

    1972 , publisher=

    Victims of Groupthink: A Psychological Study of Foreign-Policy Decisions and Fiascoes , author=. 1972 , publisher=

  48. [48]

    Transactions on Machine Learning Research , year=

    More Agents Is All You Need , author=. Transactions on Machine Learning Research , year=

  49. [49]

    NeurIPS , year=

    Self-Refine: Iterative Refinement with Self-Feedback , author=. NeurIPS , year=

  50. [50]

    Large Language Monkeys: Scaling Inference Compute with Repeated Sampling

    Large Language Monkeys: Scaling Inference Compute with Repeated Sampling , author=. arXiv preprint arXiv:2407.21787 , year=

  51. [51]

    Findings of ACL , year=

    Voting or Consensus? Decision-Making in Multi-Agent Debate , author=. Findings of ACL , year=

  52. [52]

    Training Verifiers to Solve Math Word Problems

    Training Verifiers to Solve Math Word Problems , author=. arXiv preprint arXiv:2110.14168 , year=

  53. [53]

    ICLR , year=

    Let's Verify Step by Step , author=. ICLR , year=

  54. [54]

    AAAI , year=

    Diverse Beam Search for Improved Description of Complex Scenes , author=. AAAI , year=

  55. [55]

    Dsdr: Dual-scale diversity regularization for exploration in llm reasoning.arXiv preprint arXiv:2602.19895,

    DSDR: Dual-Scale Diversity Regularization for Exploration in LLM Reasoning , author=. arXiv preprint arXiv:2602.19895 , year=

  56. [56]

    NeurIPS , year=

    Tree of Thoughts: Deliberate Problem Solving with Large Language Models , author=. NeurIPS , year=

  57. [57]

    PAL: Program-aided Language Models

    Gao, Luyu and Madaan, Aman and Zhou, Shuyan and Alon, Uri and Liu, Pengfei and Yang, Yiming and Callan, Jamie and Neubig, Graham , year=. 2211.10435 , archivePrefix=

  58. [58]

    International Conference on Learning Representations (ICLR) , year=

    Measuring Massive Multitask Language Understanding , author=. International Conference on Learning Representations (ICLR) , year=

  59. [59]

    , journal=

    Rein, David and Hou, Betty Li and Stickland, Asa Cooper and Petty, Jackson and Pang, Richard Yuanzhe and Dirani, Julien and Michael, Julian and Bowman, Samuel R. , journal=