pith. sign in

arxiv: 2606.05670 · v1 · pith:DLHTFXGOnew · submitted 2026-06-04 · 💻 cs.AI

Do More Agents Help? Controlled and Protocol-Aligned Evaluation of LLM Agent Workflows

Pith reviewed 2026-06-28 01:40 UTC · model grok-4.3

classification 💻 cs.AI
keywords LLM agentsmulti-agent systemsevaluation frameworkbenchmark evaluationaccuracy-cost trade-offsingle vs multi-agentprotocol alignmentGAIA benchmark
0
0 comments X

The pith

Controlled tests show most multi-agent LLM workflows trail single-agent baselines on average accuracy.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces BenchAgent to place single-agent, fixed multi-agent, and evolving multi-agent workflows under one shared benchmark loader, tool access, answer contract, usage accounting, and trajectory logging. On ten reasoning, coding, and tool-use benchmarks run with GPT-4.1, five of six multi-agent systems fall 2.56 to 11.29 points below the matched single-agent anchor on benchmark-balanced average accuracy and sit at worse accuracy-cost points. EvoAgent stays within the one-run Wilson interval of the single-agent result. A separate protocol-aligned external evaluation on GAIA finds that a runtime-generated workflow reaches 66.72 percent overall and 69.23 percent on Level 3, more than 20 points above the strongest fixed multi-agent baseline.

Core claim

Under standardized implementation conditions, at most one of six tested multi-agent systems exceeds the performance of a matched single-agent system on benchmark-balanced average accuracy, with the others trailing by 2.56-11.29 points and showing inferior accuracy-cost trade-offs. On the protocol-aligned external GAIA snapshot, a Claude-Code-style runtime workflow reaches 66.72 percent overall and 69.23 percent on Level 3, exceeding the strongest fixed multi-agent baseline by more than 20 points.

What carries the argument

BenchAgent, the evaluation framework that normalizes single-agent, fixed multi-agent, and evolving multi-agent workflows under identical execution and logging protocols.

If this is right

  • Five of the six multi-agent systems occupy strictly worse accuracy-cost positions than their single-agent anchors.
  • EvoAgent is the sole multi-agent system whose accuracy lies inside the statistical guidance interval of the single-agent baseline.
  • A runtime-generated workflow can exceed fixed multi-agent performance on the GAIA benchmark under protocol-aligned external conditions.
  • Benchmark-balanced average accuracy and cost must both be reported to evaluate whether added agents deliver net benefit.
  • Protocol differences between single-agent and multi-agent substrates must be eliminated before attributing performance gaps to agent count.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Optimizing a single agent may be a higher-leverage first step than adding more agents for many reasoning and tool-use tasks.
  • The results raise the possibility that literature gains attributed to multi-agent designs partly reflect uneven optimization effort rather than the presence of multiple agents.
  • Future protocol designs could include explicit single-agent ablation arms as a required control.
  • The GAIA result suggests that dynamic, runtime-generated agent graphs warrant separate study from fixed multi-agent topologies.

Load-bearing premise

The single-agent implementations are configured and optimized to the same standard as the multi-agent ones inside the shared BenchAgent protocol.

What would settle it

A replication in which any of the five underperforming multi-agent systems is re-run after its single-agent counterpart receives identical prompt engineering, tool-call budgets, and iteration limits, and the multi-agent system still trails, would be consistent with the claim; the claim would be falsified by the opposite outcome.

Figures

Figures reproduced from arXiv: 2606.05670 by Bing Luo, Huiyu Zheng, Jiaqi Shao, Ruishan Fang, Tao Lin, Yuhang Fu, Zhengtao Zhu.

Figure 1
Figure 1. Figure 1: BenchAgent compares workflow paradigms under a shared evaluation substrate. Benchmark instances enter the same loading, tool-access, accounting, logging, and evaluation interfaces, while the workflow layer varies across single-agent, fixed MAS, dynamic/evolving MAS, and externally evaluated runtime-generated workflows. Within the substrate-internal setting, this design isolates workflow lift from differenc… view at source ↗
Figure 2
Figure 2. Figure 2: Accuracy–cost and accuracy–time trade-offs under SI conditions. Left: benchmark-balanced average accuracy against instance-level average end-to-end token usage. Right: benchmark-balanced average accuracy against instance-level average execution time. Each point represents one workflow, and the dashed line traces the empirical Pareto front under this descriptive aggregation. checks are scoped to Qwen3-32B f… view at source ↗
Figure 3
Figure 3. Figure 3: Same-instance GAIA contrast: EvoAgent vs. CC-workflow. EvoAgent loses a task-critical constraint during linear decomposition, whereas the retained CC-workflow trace preserves intermediate state and verifies before finalization. strategy, re-reading an artifact, or invoking a verifier. Candidate mechanisms—runtime delegation, persistent artifacts, verifier-stage control, and context management—remain confou… view at source ↗
Figure 4
Figure 4. Figure 4: Controlled same-task workflow comparison between a fixed MAS pipeline and a runtime-generated [PITH_FULL_IMAGE:figures/full_fig_p019_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Supplementary matched-success GAIA case. Both EvoAgent and Claude Code eventually solve the [PITH_FULL_IMAGE:figures/full_fig_p019_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Main Experiment 1 method-level summaries. The panels visualize the summary rows of Table [PITH_FULL_IMAGE:figures/full_fig_p020_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Benchmark-by-method accuracy heatmap for Main Experiment 1. Each cell reports benchmark-level [PITH_FULL_IMAGE:figures/full_fig_p020_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Workflow lift relative to the Single Agent baseline in Main Experiment 1. Green cells indicate positive [PITH_FULL_IMAGE:figures/full_fig_p021_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Per-benchmark method comparison for Main Experiment 1. Each panel shows the pass@1 accuracy [PITH_FULL_IMAGE:figures/full_fig_p021_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Per-workflow benchmark-level lift profiles relative to the Single Agent baseline. Points to the right of [PITH_FULL_IMAGE:figures/full_fig_p021_10.png] view at source ↗
read the original abstract

Does adding more agents help an LLM workflow once compared systems share the same benchmark loader, tool access, answer contract, usage accounting, and trajectory logging? We introduce BenchAgent, an evaluation framework that places single-agent, fixed multi-agent (MAS), and evolving MAS workflows under one normalized execution and logging protocol. BenchAgent evaluates these substrate-internal workflows across ten reasoning, coding, and tool-use benchmarks with GPT-4.1, and separately reports a Protocol-Aligned External (PAE) GAIA study of a runtime-generated workflow. Under SI conditions, at most one of six tested MAS exceeds the matched single-agent anchor on benchmark-balanced average accuracy: EvoAgent lies within the Wilson one-run guidance, while the remaining five trail by 2.56-11.29 points and occupy more expensive accuracy-cost trade-offs. On the PAE GAIA snapshot, a Claude-Code-style runtime workflow reaches 66.72% overall and 69.23% on Level 3, more than 20 points above the strongest non-Claude baseline, Jarvis, a fixed MAS.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper introduces BenchAgent, a normalized evaluation framework that places single-agent, fixed multi-agent (MAS), and evolving MAS workflows under identical benchmark loaders, tool access, answer contracts, usage accounting, and trajectory logging. It reports results across ten reasoning/coding/tool-use benchmarks with GPT-4.1 plus a Protocol-Aligned External (PAE) GAIA snapshot, claiming that under SI conditions at most one of six tested MAS (EvoAgent) matches or exceeds the matched single-agent anchor on benchmark-balanced average accuracy while the other five trail by 2.56–11.29 points and incur higher costs; a Claude-Code-style runtime workflow reaches 66.72% overall on the GAIA snapshot.

Significance. If the single-agent baselines receive equivalent optimization effort, the controlled comparison supplies direct empirical evidence that simply increasing agent count does not improve accuracy and often worsens the accuracy-cost trade-off. The use of Wilson one-run guidance for statistical reference and the explicit protocol normalization are strengths that make the measurements more reproducible than typical ad-hoc MAS evaluations.

major comments (2)
  1. [§3 (BenchAgent protocol) and §4 (benchmark results)] The central comparative claim (abstract and §4 results) requires that single-agent implementations were tuned to the same hyper-parameter and prompt-engineering standard as the MAS variants inside BenchAgent. The manuscript normalizes loader, tools, logging and accounting but does not report iteration counts, prompt-search effort, or ablation results showing that the single-agent code paths received comparable optimization; without this, the reported 2.56–11.29 point deficits cannot be unambiguously attributed to agent count rather than substrate treatment disparity.
  2. [Table 2 and associated Wilson guidance paragraph] Table 2 (or equivalent benchmark-balanced average table) reports EvoAgent within Wilson guidance while the other five MAS trail; however, the paper does not provide per-benchmark variance or the exact number of runs underlying the Wilson interval, making it impossible to assess whether the “within guidance” statement for EvoAgent is robust or sensitive to run count.
minor comments (2)
  1. [§5 (PAE GAIA study)] The GAIA PAE snapshot is described as “runtime-generated” but the exact generation prompt and stopping criteria are not listed; adding them would improve reproducibility.
  2. [Figure 1] Figure 1 (accuracy-cost scatter) uses different marker styles for single-agent vs MAS but the legend does not explicitly state that all points share the same underlying model (GPT-4.1) and tool set; a one-sentence clarification would remove ambiguity.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the careful reading and for emphasizing the need for unambiguous attribution in controlled comparisons. We address each major comment below and indicate planned revisions where appropriate.

read point-by-point responses
  1. Referee: [§3 (BenchAgent protocol) and §4 (benchmark results)] The central comparative claim (abstract and §4 results) requires that single-agent implementations were tuned to the same hyper-parameter and prompt-engineering standard as the MAS variants inside BenchAgent. The manuscript normalizes loader, tools, logging and accounting but does not report iteration counts, prompt-search effort, or ablation results showing that the single-agent code paths received comparable optimization; without this, the reported 2.56–11.29 point deficits cannot be unambiguously attributed to agent count rather than substrate treatment disparity.

    Authors: We agree that explicit documentation of optimization effort strengthens causal attribution. Under BenchAgent, the single-agent baselines use the canonical prompts and hyper-parameters supplied by each benchmark (or their reference implementations), modified only to satisfy the shared loader, tool schema, answer contract, and logging interface. The six MAS variants were ported from their original papers into the identical substrate, with prompt adjustments limited to the same answer format and tool-calling contract; no additional hyper-parameter sweeps or prompt-search loops were performed on either side beyond what was required for protocol compliance. This design choice keeps the comparison focused on the effect of adding agent coordination layers rather than on differential tuning investment. Nevertheless, we acknowledge that the manuscript does not quantify iteration counts or present an ablation of prompt-engineering effort. In the revision we will add a dedicated paragraph in §3 describing the exact adaptation steps taken for single-agent versus MAS prompts and will state that no further optimization was applied to either class. We will also note this as a limitation and suggest that future work could include matched tuning budgets. revision_made = partial revision: partial

  2. Referee: [Table 2 and associated Wilson guidance paragraph] Table 2 (or equivalent benchmark-balanced average table) reports EvoAgent within Wilson guidance while the other five MAS trail; however, the paper does not provide per-benchmark variance or the exact number of runs underlying the Wilson interval, making it impossible to assess whether the “within guidance” statement for EvoAgent is robust or sensitive to run count.

    Authors: All results reported in the manuscript, including the single-agent anchor and the six MAS systems, were obtained from exactly one run per system per benchmark; the Wilson one-run guidance is therefore computed under n=1 for every entry. The guidance serves as a conservative statistical reference indicating whether an observed difference lies inside the binomial sampling interval expected from a single trial. We will revise the manuscript to (i) state the run count explicitly in the caption of Table 2 and in the Wilson paragraph, (ii) move the full per-benchmark accuracy matrix to an appendix so readers can recompute variances or confidence intervals themselves, and (iii) add a short sensitivity note confirming that the “within guidance” classification for EvoAgent holds under the reported single-run protocol. These changes make the statistical claim fully auditable without altering the experimental design. revision_made = yes revision: yes

Circularity Check

0 steps flagged

No circularity: direct empirical measurement under controlled protocol

full rationale

The paper reports benchmark accuracy and cost measurements for single-agent and multi-agent workflows executed under the shared BenchAgent protocol. No equations, fitted parameters, derivations, or self-citation chains appear in the provided text. All central claims are stated as observed outcomes on ten benchmarks plus one PAE GAIA snapshot; they do not reduce to prior fitted values or self-referential definitions. The skeptic concern about optimization parity is a methodological question about experimental controls, not a circularity in any derivation chain.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on domain assumptions about benchmark representativeness and protocol neutrality rather than new mathematical axioms, fitted parameters, or invented entities.

axioms (1)
  • domain assumption The ten reasoning, coding, and tool-use benchmarks plus GAIA are valid and representative measures for evaluating LLM agent performance.
    The comparative claims depend on these benchmarks capturing the relevant capabilities without systematic bias toward single or multi-agent setups.

pith-pipeline@v0.9.1-grok · 5734 in / 1364 out tokens · 44233 ms · 2026-06-28T01:40:12.797838+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

60 extracted references · 25 canonical work pages · 20 internal anchors

  1. [1]

    Building effective agents

    Anthropic . Building effective agents. https://www.anthropic.com/engineering/building-effective-agents, 2024. Official engineering post. Accessed: 2026-04-08

  2. [2]

    Claude code overview

    Anthropic . Claude code overview. https://docs.anthropic.com/en/docs/claude-code/overview, 2025 a . Official documentation. Accessed: 2026-03-25

  3. [3]

    Claude code

    Anthropic . Claude code. https://www.anthropic.com/claude-code, 2025 b . Official product page. Accessed: 2026-03-25

  4. [4]

    Claude code subagents

    Anthropic . Claude code subagents. https://docs.anthropic.com/en/docs/claude-code/subagents, 2025 c . Official documentation. Accessed: 2026-03-25

  5. [5]

    Effective context engineering for ai agents

    Anthropic . Effective context engineering for ai agents. https://www.anthropic.com/engineering/effective-context-engineering-for-ai-agents, 2025 d . Official engineering post. Accessed: 2026-04-08

  6. [6]

    How we built our multi-agent research system

    Anthropic . How we built our multi-agent research system. https://www.anthropic.com/engineering/built-multi-agent-research-system, 2025 e . Official engineering post. Accessed: 2026-04-08

  7. [7]

    Pan, Shuyi Yang, Lakshya A

    Mert Cemri, Melissa Z. Pan, Shuyi Yang, Lakshya A. Agrawal, Bhavya Chopra, Rishabh Tiwari, Kurt Keutzer, Aditya Parameswaran, Dan Klein, Kannan Ramchandran, Matei Zaharia, Joseph E. Gonzalez, and Ion Stoica. Why do Multi-Agent LLM systems fail?, 2025

  8. [10]

    CrewAI documentation

    CrewAI . CrewAI documentation. https://docs.crewai.com/, 2026. Official documentation. Accessed: 2026-05-16

  9. [13]

    Harbor: A framework for evaluating and optimizing sandboxed agents and models

    Harbor . Harbor: A framework for evaluating and optimizing sandboxed agents and models. https://www.harborframework.com/, 2026. Official project page. Accessed: 2026-05-09

  10. [16]

    Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, and Karthik Narasimhan

    Carlos E. Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, and Karthik Narasimhan. SWE-bench : Can language models resolve real-world GitHub issues? In The Twelfth International Conference on Learning Representations, 2024

  11. [18]

    Yubin Kim, Ken Gu, Chanwoo Park, Chunjong Park, Samuel Schmidgall, A. Ali Heydari, Yao Yan, Zhihan Zhang, Yuchen Zhuang, Yun Liu, Mark Malhotra, Paul Pu Liang, Hae Won Park, Yuzhe Yang, Xuhai Xu, Yilun Du, Shwetak Patel, Tim Althoff, Daniel McDuff, and Xin Liu. Towards a science of scaling agent systems, 2025

  12. [19]

    LangGraph overview

    LangChain . LangGraph overview. https://docs.langchain.com/oss/python/langgraph/overview, 2026. Official documentation. Accessed: 2026-05-16

  13. [21]

    More agents is all you need

    Junyou Li, Qin Zhang, Yangbin Yu, Qiang Fu, and Deheng Ye. More agents is all you need. Transactions on Machine Learning Research, 2024. arXiv:2402.05120

  14. [26]

    GPT-4.1 model documentation

    OpenAI . GPT-4.1 model documentation. https://platform.openai.com/docs/models/gpt-4.1, 2025. Official documentation. Accessed: 2026-05-09

  15. [27]

    OpenAI . Codex. https://openai.com/codex/, 2026. Official product page. Accessed: 2026-05-09

  16. [28]

    Opencode documentation

    OpenCode . Opencode documentation. https://opencode.ai/docs/, 2026. Official documentation. Accessed: 2026-05-09

  17. [30]

    Toolformer: Language Models Can Teach Themselves to Use Tools

    Timo Schick, Jane Dwivedi-Yu, Roberto Dess \`i , Roberta Raileanu, Maria Lomeli, Eric Hambro, Luke Zettlemoyer, Nicola Cancedda, and Thomas Scialom. Toolformer: Language models can teach themselves to use tools. arXiv preprint arXiv:2302.04761, 2023

  18. [33]

    Edwin B. Wilson. Probable inference, the law of succession, and statistical inference. Journal of the American Statistical Association, 22 0 (158): 0 209--212, 1927. doi:10.1080/01621459.1927.10502953

  19. [39]

    Multi-agent architecture search via agentic supernet

    Guibin Zhang, Luyang Niu, Junfeng Fang, Kun Wang, Lei Bai, and Xiang Wang. Multi-agent architecture search via agentic supernet. In Forty-Second International Conference on Machine Learning, 2025. URL https://openreview.net/forum?id=imcyVlzpXh

  20. [41]

    Silo-Bench : A scalable environment for evaluating distributed coordination in Multi-Agent LLM systems, 2026

    Yuzhe Zhang, Feiran Liu, Yi Shan, Xinyi Huang, Xin Yang, Yueqi Zhu, Xuxin Cheng, Cao Liu, Ke Zeng, Terry Jingchen Zhang, and Wenyuan Jiang. Silo-Bench : A scalable environment for evaluating distributed coordination in Multi-Agent LLM systems, 2026

  21. [43]

    2025 , howpublished =

  22. [44]

    OpenCode Documentation , year =

  23. [45]

    Harbor: A Framework for Evaluating and Optimizing Sandboxed Agents and Models , year =

  24. [46]

    2026 , howpublished =

  25. [47]

    Claude Code , year =

  26. [48]

    Claude Code Overview , year =

  27. [49]

    Claude Code Subagents , year =

  28. [50]

    Building Effective Agents , year =

  29. [51]

    How We Built Our Multi-Agent Research System , year =

  30. [52]

    Effective Context Engineering for AI Agents , year =

  31. [53]

    ReAct: Synergizing Reasoning and Acting in Language Models

    Yao, Shunyu and Zhao, Jeffrey and Yu, Dian and Du, Nan and Shafran, Izhak and Narasimhan, Karthik and Cao, Yuan , title =. arXiv preprint arXiv:2210.03629 , year =

  32. [54]

    Toolformer: Language Models Can Teach Themselves to Use Tools , journal =

    Schick, Timo and Dwivedi-Yu, Jane and Dess. Toolformer: Language Models Can Teach Themselves to Use Tools , journal =

  33. [55]

    WebGPT: Browser-assisted question-answering with human feedback

    Nakano, Reiichiro and Hilton, Jacob and Balaji, Suchir and Wu, Jeff and Ouyang, Long and Kim, Christina and Hesse, Christopher and Jain, Shantanu and Kosaraju, Vineet and Saunders, William and Jiang, Xu and Cobbe, Karl and Eloundou, Tyna and Krueger, Gretchen and Button, Kevin and Knight, Matthew and Chess, Benjamin and Schulman, John , title =. arXiv pre...

  34. [56]

    SWE-agent: Agent-Computer Interfaces Enable Automated Software Engineering

    Yang, John and Jimenez, Carlos E. and Wettig, Alexander and Lieret, Kilian and Yao, Shunyu and Narasimhan, Karthik and Press, Ofir , title =. arXiv preprint arXiv:2405.15793 , year =

  35. [57]

    HuggingGPT: Solving AI Tasks with ChatGPT and its Friends in Hugging Face

    Shen, Yongliang and Song, Kaitao and Tan, Xu and Li, Dongsheng and Lu, Weiming and Zhuang, Yueting , title =. arXiv preprint arXiv:2303.17580 , year =

  36. [58]

    DSPy: Compiling Declarative Language Model Calls into Self-Improving Pipelines

    Khattab, Omar and Singhvi, Arnav and Maheshwari, Paridhi and Zhang, Zhiyuan and Santhanam, Keshav and Vardhamanan, Sri and Haq, Saiful and Sharma, Ashutosh and Joshi, Thomas T. and Moazam, Hanna and Miller, Heather and Zaharia, Matei and Potts, Christopher , title =. arXiv preprint arXiv:2310.03714 , year =

  37. [59]

    Transactions on Machine Learning Research , year =

    Li, Junyou and Zhang, Qin and Yu, Yangbin and Fu, Qiang and Ye, Deheng , title =. Transactions on Machine Learning Research , year =

  38. [60]

    AgentBench: Evaluating LLMs as Agents

    Liu, Xiao and Yu, Hao and Zhang, Hanchen and Xu, Yifan and Lei, Xuanyu and Lai, Hanyu and Gu, Yu and Ding, Hangliang and Men, Kaiwen and Yang, Kejuan and Zhang, Shudan and Deng, Xiang and Zeng, Aohan and Du, Zhengxiao and Zhang, Chenhui and Shen, Sheng and Zhang, Tianjun and Su, Yu and Sun, Huan and Huang, Minlie and Dong, Yuxiao and Tang, Jie , title =. ...

  39. [61]

    GAIA: a benchmark for General AI Assistants

    Mialon, Gr. arXiv preprint arXiv:2311.12983 , year =

  40. [62]

    arXiv preprint arXiv:2401.13178 , year =

    Ma, Chang and Zhang, Junlei and Zhu, Zhihao and Yang, Cheng and Yang, Yujiu , title =. arXiv preprint arXiv:2401.13178 , year =

  41. [63]

    WebArena: A Realistic Web Environment for Building Autonomous Agents

    Zhou, Shuyan and Xu, Frank F. and Zhu, Hao and Zhou, Xuhui and Lo, Robert and Sridhar, Abishek and Cheng, Xianyi and Ou, Tianyue and Bisk, Yonatan and Fried, Daniel and Alon, Uri and Neubig, Graham , title =. arXiv preprint arXiv:2307.13854 , year =

  42. [64]

    and Yang, John and Wettig, Alexander and Yao, Shunyu and Pei, Kexin and Press, Ofir and Narasimhan, Karthik , title =

    Jimenez, Carlos E. and Yang, John and Wettig, Alexander and Yao, Shunyu and Pei, Kexin and Press, Ofir and Narasimhan, Karthik , title =. The Twelfth International Conference on Learning Representations , year =

  43. [65]

    ToolLLM: Facilitating Large Language Models to Master 16000+ Real-world APIs

    Qin, Yujia and Liang, Sheng and Ye, Yining and Zhu, Kunlun and Yan, Lan and Lu, Yaxi and Lin, Yankai and Cong, Xin and Tang, Xiangru and Qian, Bill and Zhao, Sihan and Tian, Runchu and Xie, Ruobing and Zhou, Jie and Gerstein, Mark and Li, Dahai and Liu, Zhiyuan and Sun, Maosong , title =. arXiv preprint arXiv:2307.16789 , year =

  44. [66]

    CAMEL: Communicative Agents for "Mind" Exploration of Large Language Model Society

    Li, Guohao and Hammoud, Hasan Abed Al Kader and Itani, Hani and Khizbullin, Dmitrii and Ghanem, Bernard , title =. arXiv preprint arXiv:2303.17760 , year =

  45. [67]

    Improving Factuality and Reasoning in Language Models through Multiagent Debate

    Du, Yilun and Li, Shuang and Torralba, Antonio and Tenenbaum, Joshua B. and Mordatch, Igor , title =. arXiv preprint arXiv:2305.14325 , year =

  46. [68]

    ChatEval: Towards Better LLM-based Evaluators through Multi-Agent Debate

    Chan, Chi-Min and Chen, Weize and Su, Yusheng and Yu, Jianxuan and Xue, Wei and Zhang, Shanghang and Fu, Jie and Liu, Zhiyuan , title =. arXiv preprint arXiv:2308.07201 , year =

  47. [69]

    AutoGen: Enabling Next-Gen LLM Applications via Multi-Agent Conversation

    Wu, Qingyun and Bansal, Gagan and Zhang, Jieyu and Wu, Yiran and Li, Beibin and Zhu, Erkang and Jiang, Li and Zhang, Xiaoyun and Liu, Sicong and Awadallah, Ahmed Hassan and White, Ryen W. and Burger, Doug and Wang, Chi , title =. arXiv preprint arXiv:2308.08155 , year =

  48. [70]

    MetaGPT: Meta Programming for A Multi-Agent Collaborative Framework

    Hong, Sirui and Zhuge, Mingchen and Chen, Jonathan and Zheng, Xiawu and Cheng, Yuheng and Wang, Jinlin and Zhang, Ceyao and Wang, Zili and Yau, Steven Ka Shing and Lin, Zijuan and Zhou, Liyang and Ran, Chenyu and Xiao, Lingfeng and Wu, Chenglin and Schmidhuber, J. arXiv preprint arXiv:2308.00352 , year =

  49. [71]

    AgentVerse: Facilitating Multi-Agent Collaboration and Exploring Emergent Behaviors

    Chen, Weize and Su, Yusheng and Zuo, Jingwei and Yang, Cheng and Yuan, Chenfei and Qian, Chen and Chan, Chi-Min and Qin, Yujia and Lu, Yaxi and Xie, Ruobing and Liu, Zhiyuan and Sun, Maosong and Zhou, Jie , title =. arXiv preprint arXiv:2308.10848 , year =

  50. [72]

    arXiv preprint arXiv:2406.14228 , year=

    Yuan, Siyu and Chen, Kaitao and Ye, Jiangjie and Qin, Chengwei and Zhang, Deqing and Bi, Wei and Wang, Xiang and He, Xinran , title =. arXiv preprint arXiv:2406.14228 , year =

  51. [73]

    Magentic-One: A Generalist Multi-Agent System for Solving Complex Tasks

    Fourney, Adam and Bansal, Gagan and Mozannar, Hussein and Tan, Chenglei and Salinas, Eduardo and Niedtner, Fabian and Proebsting, Geoff and Bass, Dina and Gerrits, Jack and Alber, Jacob and Zhang, Peter and Zhu, Qingyu and Zhang, Chi and Shah, Shital and Zhu, Ran and Al-Hossami, Erfan and Yang, Huan and Ashktorab, Zahra and Matsakis, Nicholas and Awadalla...

  52. [74]

    Automated Design of Agentic Systems

    Hu, Shengran and Lu, Cong and Clune, Jeff , title =. arXiv preprint arXiv:2408.08435 , year =

  53. [75]

    AFlow: Automating Agentic Workflow Generation

    Zhang, Jiayi and Lan, Zhaoheng and Hu, Mingkai and Wang, Yuan and Liu, Zhiwei and Zhou, Fei and Yan, Jie and Xu, Jiajun and Qiao, Yu and Li, Pengfei , title =. arXiv preprint arXiv:2410.10762 , year =

  54. [76]

    , title =

    Wilson, Edwin B. , title =. Journal of the American Statistical Association , volume =. 1927 , publisher =

  55. [77]

    Pan and Shuyi Yang and Lakshya A

    Mert Cemri and Melissa Z. Pan and Shuyi Yang and Lakshya A. Agrawal and Bhavya Chopra and Rishabh Tiwari and Kurt Keutzer and Aditya Parameswaran and Dan Klein and Kannan Ramchandran and Matei Zaharia and Joseph E. Gonzalez and Ion Stoica , title =. 2025 , eprint =

  56. [78]

    Yubin Kim and Ken Gu and Chanwoo Park and Chunjong Park and Samuel Schmidgall and A. Ali Heydari and Yao Yan and Zhihan Zhang and Yuchen Zhuang and Yun Liu and Mark Malhotra and Paul Pu Liang and Hae Won Park and Yuzhe Yang and Xuhai Xu and Yilun Du and Shwetak Patel and Tim Althoff and Daniel McDuff and Xin Liu , title =. 2025 , eprint =

  57. [79]

    The Fourteenth International Conference on Learning Representations , year =

    Jiawei Xu and Arief Koesdwiady and Sisong Bei and Yan Han and Baixiang Huang and Dakuo Wang and Yutong Chen and Zheshen Wang and Peihao Wang and Pan Li and Ying Ding , title =. The Fourteenth International Conference on Learning Representations , year =. 2601.12307 , archivePrefix =

  58. [80]

    2026 , eprint =

    Yuzhe Zhang and Feiran Liu and Yi Shan and Xinyi Huang and Xin Yang and Yueqi Zhu and Xuxin Cheng and Cao Liu and Ke Zeng and Terry Jingchen Zhang and Wenyuan Jiang , title =. 2026 , eprint =

  59. [81]

    Forty-Second International Conference on Machine Learning , year =

    Guibin Zhang and Luyang Niu and Junfeng Fang and Kun Wang and Lei Bai and Xiang Wang , title =. Forty-Second International Conference on Machine Learning , year =

  60. [82]

    MASPO: Joint Prompt Optimization for LLM-based Multi-Agent Systems

    Zhexuan Wang and Xuebo Liu and Li Wang and Zifei Shan and Yutong Wang and Zhenxi Song and Min Zhang , title =. Forty-Third International Conference on Machine Learning , year =. 2605.06623 , archivePrefix =