arxiv: 2503.13657 · v3 · submitted 2025-03-17 · 💻 cs.AI

Recognition: 2 theorem links

Why Do Multi-Agent LLM Systems Fail?

Aditya Parameswaran, Bhavya Chopra, Dan Klein, Ion Stoica, Joseph E. Gonzalez, Kannan Ramchandran, Kurt Keutzer, Lakshya A. Agrawal, Matei Zaharia, Melissa Z. Pan, Mert Cemri, Rishabh Tiwari, Shuyi Yang

Pith reviewed 2026-05-12 05:38 UTC · model grok-4.3

classification 💻 cs.AI

keywords multi-agent systemsLLM failuresfailure taxonomyagent misalignmentsystem designtask verificationLLM evaluation

0 comments

The pith

Multi-agent LLM systems fail in 14 distinct modes that cluster into design flaws, agent misalignment, and verification gaps.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper collects over 1600 annotated failure traces from multi-agent LLM systems across seven popular frameworks. By examining 150 of these traces with expert annotators, it builds a taxonomy that sorts failures into 14 modes. These modes fall into three groups: issues with overall system design, cases where agents fail to coordinate properly, and problems in confirming that the task is complete. The work also creates an automated labeling method using LLMs that matches human judgments closely. The dataset, taxonomy, and labeling tool are released to support targeted improvements in future multi-agent setups.

Core claim

Systematic analysis of failure traces across multiple frameworks and models shows that multi-agent LLM systems exhibit 14 unique failure modes, which cluster into the three categories of system design issues, inter-agent misalignment, and task verification, with high consistency among annotators.

What carries the argument

The Multi-Agent System Failure Taxonomy (MAST), a structured classification of 14 failure modes into three categories that enables consistent identification and analysis of why these systems underperform.

If this is right

Performance gaps on benchmarks can be reduced by redesigning systems to avoid the identified failure modes.
The automated LLM judge enables efficient labeling of new traces while maintaining reliability close to human levels.
Releasing the full dataset allows other researchers to test and extend the taxonomy on additional models and tasks.
The patterns indicate that current multi-agent approaches need more advanced coordination and verification mechanisms.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The taxonomy could serve as a quick diagnostic checklist when testing new multi-agent frameworks before full deployment.
Agents might be trained to monitor their own interactions for signs of the three main failure categories and self-correct.
Similar clustering approaches could apply to failure analysis in other distributed AI systems beyond language models.

Load-bearing premise

The 150 traces examined by experts are representative of the full range of failures that occur across different models, tasks, and multi-agent frameworks.

What would settle it

A fresh collection of multi-agent failure traces in which most cases cannot be assigned to any of the 14 modes or in which independent expert annotators show low agreement on the categories.

read the original abstract

Despite enthusiasm for Multi-Agent LLM Systems (MAS), their performance gains on popular benchmarks are often minimal. This gap highlights a critical need for a principled understanding of why MAS fail. Addressing this question requires systematic identification and analysis of failure patterns. We introduce MAST-Data, a comprehensive dataset of 1600+ annotated traces collected across 7 popular MAS frameworks. MAST-Data is the first multi-agent system dataset to outline the failure dynamics in MAS for guiding the development of better future systems. To enable systematic classification of failures for MAST-Data, we build the first Multi-Agent System Failure Taxonomy (MAST). We develop MAST through rigorous analysis of 150 traces, guided closely by expert human annotators and validated by high inter-annotator agreement (kappa = 0.88). This process identifies 14 unique modes, clustered into 3 categories: (i) system design issues, (ii) inter-agent misalignment, and (iii) task verification. To enable scalable annotation, we develop an LLM-as-a-Judge pipeline with high agreement with human annotations. We leverage MAST and MAST-Data to analyze failure patterns across models (GPT4, Claude 3, Qwen2.5, CodeLlama) and tasks (coding, math, general agent), demonstrating improvement headrooms from better MAS design. Our analysis provides insights revealing that identified failures require more sophisticated solutions, highlighting a clear roadmap for future research. We publicly release our comprehensive dataset (MAST-Data), the MAST, and our LLM annotator to facilitate widespread research and development in MAS.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper delivers a new failure taxonomy and a 1600+ trace dataset for multi-agent LLM systems, but the 150-trace foundation for the taxonomy lacks clear sampling details.

read the letter

The core contribution is MAST, a 14-mode taxonomy grouped into system design issues, inter-agent misalignment, and task verification, plus the MAST-Data set of annotated traces pulled from seven frameworks. They built the taxonomy bottom-up from 150 human-annotated traces with solid inter-annotator agreement (kappa 0.88), then scaled with an LLM judge that matches humans reasonably well. The analysis across models and tasks like coding and math points to concrete places where better coordination or verification would help, which is more useful than another benchmark score. Releasing the dataset, taxonomy, and judge code is a practical move that others can build on directly. The main weakness is the sampling for those initial 150 traces. Nothing in the write-up explains whether they were chosen randomly, stratified by framework or task, or just convenient examples. If certain failure patterns were over- or under-represented there, the 14 modes and three clusters could miss important cases or overstate others, and the LLM judge would carry that forward. They also give limited numbers on how the full 1600+ annotations were validated beyond the initial set. This work is aimed at researchers and engineers who actually build or debug multi-agent setups and need a shared way to talk about what goes wrong. A reader who wants empirical grounding for MAS improvements will find the artifacts and the sketched roadmap worth looking at. I would send it for peer review because the dataset and taxonomy are new, concrete, and timely, even though the sampling and validation sections need more detail to hold up.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces MAST-Data, a dataset of over 1600 annotated failure traces from multi-agent LLM systems (MAS) across 7 frameworks, and the MAST taxonomy of 14 failure modes clustered into three categories (system design issues, inter-agent misalignment, task verification). MAST is constructed bottom-up from expert human analysis of 150 traces (κ=0.88), scaled via an LLM-as-a-Judge pipeline, and used to examine failure patterns across models (GPT-4, Claude 3, Qwen2.5, CodeLlama) and tasks (coding, math, general agent), with public release of the dataset, taxonomy, and annotator to guide future MAS design.

Significance. If the taxonomy is representative and the annotations reliable, the work provides a valuable public resource for systematically understanding MAS failures, which is timely given the minimal benchmark gains often observed. The bottom-up construction with reported agreement metrics, cross-model/task analysis, and open release of data/tools could directly support more robust system design and serve as a foundation for subsequent research.

major comments (2)

[Taxonomy construction (Section 3 / Methods, as described in abstract)] The construction of MAST relies exclusively on analysis of 150 traces, yet the manuscript provides no description of the sampling procedure (e.g., random, stratified by framework/model/task, or convenience sampling). This is load-bearing for the central claim that the 14 modes and three-category clustering capture dominant failure dynamics, as non-representative selection could omit high-impact modes or over-represent others, undermining both the taxonomy's validity and the utility of MAST-Data for guiding future systems.
[LLM-as-a-Judge pipeline and dataset annotation (Section 4)] While high inter-annotator agreement (κ=0.88) is reported for the initial 150 traces, the manuscript does not provide quantitative validation metrics (e.g., per-category agreement, confusion matrices, or error rates) for the LLM-as-a-Judge pipeline on the remaining 1450+ traces. This weakens the reliability of the full dataset's failure distributions and cross-model/task analyses.

minor comments (2)

[Introduction / Related Work] The abstract claims MAST-Data is 'the first' such dataset but the manuscript should include a brief related-work comparison table to substantiate novelty claims regarding prior MAS failure analyses.
[Dataset description (Section 2)] Exact counts and breakdowns (by framework, model, task) for the 1600+ traces and the 14 modes should be reported in a table for transparency, rather than relying on '1600+' and '14 unique modes' phrasing.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive and detailed feedback on our manuscript. We address each major comment point-by-point below. Where the comments identify gaps in the current description, we will revise the manuscript to incorporate the requested details and strengthen the presentation of our methods and validation.

read point-by-point responses

Referee: [Taxonomy construction (Section 3 / Methods, as described in abstract)] The construction of MAST relies exclusively on analysis of 150 traces, yet the manuscript provides no description of the sampling procedure (e.g., random, stratified by framework/model/task, or convenience sampling). This is load-bearing for the central claim that the 14 modes and three-category clustering capture dominant failure dynamics, as non-representative selection could omit high-impact modes or over-represent others, undermining both the taxonomy's validity and the utility of MAST-Data for guiding future systems.

Authors: We agree that an explicit description of the sampling procedure for the 150 expert-annotated traces is necessary to support claims about the taxonomy's coverage. The traces were drawn from the full collection to span the seven frameworks, multiple models, and task categories, prioritizing diversity in observed failure behaviors during initial data exploration. We will revise Section 3 to document the exact selection process, including any stratification or inclusion criteria used, so readers can evaluate representativeness directly. revision: yes
Referee: [LLM-as-a-Judge pipeline and dataset annotation (Section 4)] While high inter-annotator agreement (κ=0.88) is reported for the initial 150 traces, the manuscript does not provide quantitative validation metrics (e.g., per-category agreement, confusion matrices, or error rates) for the LLM-as-a-Judge pipeline on the remaining 1450+ traces. This weakens the reliability of the full dataset's failure distributions and cross-model/task analyses.

Authors: We appreciate this observation regarding the need for more granular validation of the LLM-as-a-Judge pipeline. The manuscript notes high agreement with human annotations but does not report per-category metrics, confusion matrices, or error rates on the scaled portion of the data. We will add these quantitative validation results in a revised Section 4, including agreement statistics on a held-out validation subset and any error analysis, to better substantiate the reliability of the full MAST-Data distributions and downstream analyses. revision: yes

Circularity Check

0 steps flagged

No circularity: taxonomy derived bottom-up from annotated traces

full rationale

The paper builds the MAST taxonomy inductively via human analysis of 150 traces, followed by inter-annotator validation (kappa=0.88) and an LLM-as-Judge pipeline calibrated to those annotations. No equations, fitted parameters, self-referential definitions, or self-citation chains reduce any claim to its own inputs by construction. The 14-mode classification and three-category clustering are presented as empirical outputs from the trace analysis rather than predictions or first-principles results equivalent to the input data. The absence of a described sampling procedure for the 150 traces affects representativeness but does not constitute circularity under the defined criteria.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 2 invented entities

The central claims rest on the representativeness of the collected traces and the reliability of expert labeling; no free parameters are fitted to data, and no new physical or mathematical entities are postulated.

axioms (2)

domain assumption The 150 traces analyzed by experts are representative of failure patterns across the 7 MAS frameworks and chosen tasks
This assumption underpins generalization from the initial analysis to the full 1600+ trace dataset and the resulting taxonomy.
domain assumption Expert human annotators can consistently identify and categorize failure modes with high reliability
Invoked to justify the kappa=0.88 agreement and the use of the taxonomy as a stable reference.

invented entities (2)

Multi-Agent System Failure Taxonomy (MAST) with 14 modes in 3 categories no independent evidence
purpose: To systematically classify observed failures for analysis and future system design
Newly constructed from the 150-trace analysis; no independent falsifiable prediction is provided.
LLM-as-a-Judge annotation pipeline no independent evidence
purpose: To scale labeling of the remaining traces while matching human judgments
Developed as part of the work; agreement with humans is asserted but not quantified in the abstract.

pith-pipeline@v0.9.0 · 5633 in / 1537 out tokens · 37064 ms · 2026-05-12T05:38:36.140360+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

LawOfExistence defect_zero_iff_one unclear
We develop MAST through rigorous analysis of 150 traces... identifies 14 unique modes, clustered into 3 categories: (i) system design issues, (ii) inter-agent misalignment, and (iii) task verification.
HierarchyEmergence hierarchy_emergence_forces_phi unclear
MAST-Data is the first multi-agent system dataset to outline the failure dynamics in MAS for guiding the development of better future systems.

Forward citations

Cited by 38 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Hackers or Hallucinators? A Comprehensive Analysis of LLM-Based Automated Penetration Testing
cs.CR 2026-04 unverdicted novelty 8.0

The first SoK on LLM-based AutoPT frameworks provides a six-dimension taxonomy of agent designs and a unified empirical benchmark evaluating 15 frameworks via over 10 billion tokens and 1,500 manually reviewed logs.
AgentLens: Revealing The Lucky Pass Problem in SWE-Agent Evaluation
cs.SE 2026-05 conditional novelty 7.0

10.7% of passing SWE-agent trajectories are Lucky Passes with chaotic behaviors, and a quality score based on process references changes model rankings across eight backends.
AgentForesight: Online Auditing for Early Failure Prediction in Multi-Agent Systems
cs.CL 2026-05 unverdicted novelty 7.0

AgentForesight trains a 7B model to perform online auditing of multi-agent LLM trajectories, detecting early decisive errors and outperforming larger models on custom and external benchmarks.
TraceFix: Repairing Agent Coordination Protocols with TLA+ Counterexamples
cs.AI 2026-05 conditional novelty 7.0

TraceFix repairs LLM-generated multi-agent protocols via TLA+ counterexamples to achieve full verification on all tested tasks and higher completion rates than prompt-only baselines.
TeamBench: Evaluating Agent Coordination under Enforced Role Separation
cs.AI 2026-05 unverdicted novelty 7.0

Enforcing role separation in agent teams reveals that prompt-only setups hide coordination failures, with verifiers approving 49% of failing work and teams sometimes harming performance when solo agents already succeed.
Improving the Efficiency of Language Agent Teams with Adaptive Task Graphs
cs.MA 2026-05 unverdicted novelty 7.0

LATTE coordinates LLM agent teams with an evolving shared task graph, cutting token use, time, and failures while matching or beating accuracy of MetaGPT, leader-worker, and static methods.
Inference-Time Budget Control for LLM Search Agents
cs.AI 2026-05 unverdicted novelty 7.0

A VOI-based controller for dual inference budgets improves multi-hop QA performance by prioritizing search actions and selectively finalizing answers.
Measuring the Unmeasurable: Markov Chain Reliability for LLM Agents
cs.SE 2026-04 unverdicted novelty 7.0

TraceToChain models LLM agent traces as absorbing DTMCs using automatic clustering and smoothed MLE, with KS and AIC validation, to reconcile pass@k, pass^k, and RDC as projections of a single first-passage success-ti...
AgentEval: DAG-Structured Step-Level Evaluation for Agentic Workflows with Error Propagation Tracking
cs.SE 2026-04 conditional novelty 7.0

AgentEval evaluates agentic workflows via DAGs with step metrics, a 21-category failure taxonomy, and error propagation tracking, yielding 2.17x higher failure recall than end-to-end methods and strong human agreement.
Learning to Interrupt in Language-based Multi-agent Communication
cs.CL 2026-04 unverdicted novelty 7.0

HANDRAISER learns optimal interruption points in multi-agent LLM communication using estimated future reward and cost, achieving 32.2% lower communication cost with comparable or better task results across games, sche...
How to Interpret Agent Behavior
cs.AI 2026-05 conditional novelty 6.0

ACT*ONOMY is a Grounded-Theory-derived hierarchical taxonomy and open repository that enables systematic comparison and characterization of autonomous agent behavior across trajectories.
SWE-Cycle: Benchmarking Code Agents across the Complete Issue Resolution Cycle
cs.SE 2026-05 unverdicted novelty 6.0

SWE-Cycle benchmark shows sharp drops in code agent success rates from isolated tasks to full autonomous issue resolution, highlighting cross-phase dependency issues.
Not Just RLHF: Why Alignment Alone Won't Fix Multi-Agent Sycophancy
cs.LG 2026-05 unverdicted novelty 6.0

Pretrained base models exhibit higher yield to peer disagreement than RLHF instruct variants, with the effect localized to mid-layer attention and mitigated by structured dissent rather than prompt defenses.
PIVOT: Bridging Planning and Execution in LLM Agents via Trajectory Refinement
cs.AI 2026-05 unverdicted novelty 6.0

PIVOT refines LLM agent trajectories through plan-inspect-evolve-verify stages using environment feedback, yielding up to 94% relative gains in constraint satisfaction and 3-5x token efficiency over prior refinement methods.
Coordination as an Architectural Layer for LLM-Based Multi-Agent Systems
cs.MA 2026-05 unverdicted novelty 6.0

Coordination treated as a separable architectural layer in LLM multi-agent systems yields distinguishable Murphy-decomposed performance signatures on prediction-market tasks, with some configurations dominating a cost...
Trace-Level Analysis of Information Contamination in Multi-Agent Systems
cs.AI 2026-04 unverdicted novelty 6.0

Agent workflows can diverge substantially from contaminated inputs yet recover correct answers, or stay similar while failing, as measured by trace divergence on GAIA tasks.
EPM-RL: Reinforcement Learning for On-Premise Product Mapping in E-Commerce
cs.CL 2026-04 unverdicted novelty 6.0

EPM-RL uses PEFT followed by RL with agent-based rewards from judge models to create a trainable in-house product mapping model that improves on fine-tuning alone and beats API baselines in quality-cost while enabling...
VLAA-GUI: Knowing When to Stop, Recover, and Search, A Modular Framework for GUI Automation
cs.CL 2026-04 conditional novelty 6.0

VLAA-GUI adds mandatory visual verifiers, multi-tier loop breakers, and on-demand search to GUI agents, reaching 77.5% on OSWorld and 61.0% on WindowsAgentArena with some models exceeding human performance.
Do LLMs Need to See Everything? A Benchmark and Study of Failures in LLM-driven Smartphone Automation using Screentext vs. Screenshots
cs.HC 2026-04 unverdicted novelty 6.0

A new benchmark shows LLM smartphone agents achieve comparable success with screen text alone as with screenshots, but both fail often due to UI accessibility and reasoning gaps.
Contrastive Attribution in the Wild: An Interpretability Analysis of LLM Failures on Realistic Benchmarks
cs.AI 2026-04 conditional novelty 6.0

Token-level contrastive attribution yields informative signals for some LLM benchmark failures but is not universally applicable across datasets and models.
Do Agent Societies Develop Intellectual Elites? The Hidden Power Laws of Collective Cognition in LLM Multi-Agent Systems
cs.MA 2026-04 unverdicted novelty 6.0

LLM agent societies develop power-law coordination cascades and intellectual elites through an integration bottleneck that grows with system size.
Shepherd: A Runtime Substrate Empowering Meta-Agents with a Formalized Execution Trace
cs.AI 2026-05 unverdicted novelty 5.0 partial

Shepherd is a runtime system that formalizes meta-agent operations via typed execution traces, enabling fast forking and demonstrated improvements in agent intervention, optimization, and training on benchmarks.
AgentRx: A Benchmark Study of LLM Agents for Multimodal Clinical Prediction Tasks
cs.AI 2026-05 unverdicted novelty 5.0

Single-agent LLM frameworks outperform naive multi-agent systems in multimodal clinical risk prediction tasks and are better calibrated.
Is a team only as strong as its weakest link? Quantifying the short-board effect with AI Agents
physics.soc-ph 2026-05 unverdicted novelty 5.0

LLM multi-agent simulations reveal a cumulative product effect from multiple weak links on team performance and identify distinct capability regimes including a Sisyphus predicament.
Social Theory Should Be a Structural Prior for Agentic AI: A Formal Framework for Multi-Agent Social Systems
cs.MA 2026-05 unverdicted novelty 5.0

Agentic AI needs social theory as a structural prior, formalized via the MASS dynamical system framework with four priors: strategic heterogeneity, networked-constrained dependence, co-evolution, and distributional in...
TRUST: A Framework for Decentralized AI Service v.0.1
cs.AI 2026-04 unverdicted novelty 5.0

TRUST is a decentralized AI auditing framework that decomposes reasoning into HDAGs, maps agent interactions via the DAAN protocol to CIGs, and uses stake-weighted multi-tier consensus to achieve 72.4% accuracy while ...
Sovereign Agentic Loops: Decoupling AI Reasoning from Execution in Real-World Systems
cs.CR 2026-04 unverdicted novelty 5.0

Sovereign Agentic Loops decouple LLM reasoning from execution by emitting validated intents through a control plane with obfuscation and evidence chains, blocking 93% of unsafe actions in a cloud prototype while addin...
Mesh Memory Protocol: Semantic Infrastructure for Multi-Agent LLM Systems
cs.MA 2026-04 unverdicted novelty 5.0

MMP defines a seven-field CMB schema, role-based SVAF evaluation, content-hash lineage, and remix storage to enable traceable cross-session collaboration among autonomous LLM agents.
More Is Different: Toward a Theory of Emergence in AI-Native Software Ecosystems
cs.SE 2026-04 unverdicted novelty 5.0

AI-native software ecosystems exhibit emergent behaviors best explained by complex adaptive systems theory, requiring new ecosystem-level monitoring and seven testable propositions that may extend or replace Lehman's laws.
Dive into Claude Code: The Design Space of Today's and Future AI Agent Systems
cs.SE 2026-04 unverdicted novelty 5.0

Claude Code centers on a model-tool while-loop surrounded by permission systems, context compaction, extensibility hooks, subagent delegation, and session storage; the same design questions yield different answers in ...
Improving Role Consistency in Multi-Agent Collaboration via Quantitative Role Clarity
cs.AI 2026-04 conditional novelty 5.0

A role clarity matrix from softmax-normalized behavior-role similarities is employed as a regularizer to enhance role consistency in multi-agent LLM collaborations.
Iterative Audit Convergence in LLM-Managed Multi-Agent Systems: A Case Study in Prompt Engineering Quality Assurance
cs.SE 2026-05 conditional novelty 4.0

Nine LLM-agent audit rounds on a 7150-line prompt specification surface found 51 defects with non-monotonic convergence and a post-hoc seven-category taxonomy, showing single-file review misses defect classes.
Social Theory Should Be a Structural Prior for Agentic AI: A Formal Framework for Multi-Agent Social Systems
cs.MA 2026-05 unverdicted novelty 4.0

Agentic AI requires social theory as a structural prior in the proposed MASS framework to model emergent outcomes from agent interactions and influence.
Social Theory Should Be a Structural Prior for Agentic AI: A Formal Framework for Multi-Agent Social Systems
cs.MA 2026-05 unverdicted novelty 4.0

Agentic AI needs social theory as structural priors in the MASS framework to model emergent dynamics from multi-agent interactions.
Agentic Microphysics: A Manifesto for Generative AI Safety
cs.CY 2026-04 unverdicted novelty 4.0

The authors introduce agentic microphysics and generative safety to link local agent interactions to population-level risks in agentic AI through a causally explicit framework.
Conversations Risk Detection LLMs in Financial Agents via Multi-Stage Generative Rollout
cs.CR 2026-04 unverdicted novelty 4.0

FinSec is a multi-stage detection system for financial LLM dialogues that reaches 90.13% F1 score, cuts attack success rate to 9.09%, and raises AUPRC to 0.9189.
Qualixar OS: A Universal Operating System for AI Agent Orchestration
cs.AI 2026-04 unverdicted novelty 4.0

Qualixar OS provides a runtime for multi-agent AI systems with support for 12 topologies, LLM-driven team design, dynamic routing, consensus judging, content attribution, and protocol bridging, achieving 100% accuracy...
Compiled AI: Deterministic Code Generation for LLM-Based Workflow Automation
cs.SE 2026-04 unverdicted novelty 4.0

Compiled AI generates deterministic code artifacts from LLMs in a one-time compilation step, enabling reliable workflow execution with zero runtime tokens after break-even.

Reference graph

Works this paper leans on

123 extracted references · 123 canonical work pages · cited by 36 Pith papers · 17 internal anchors

[1]

The Russian Messenger, 1878

Leo Tolstoy.Anna Karenina. The Russian Messenger, 1878

work page
[2]

Gorilla: Large Language Model Connected with Massive APIs

Shishir G. Patil, Tianjun Zhang, Xin Wang, and Joseph E. Gonzalez. Gorilla: Large language model connected with massive apis, 2023. URLhttps://arxiv.org/abs/2305.15334

work page internal anchor Pith review Pith/arXiv arXiv 2023
[3]

MemGPT: Towards LLMs as Operating Systems

Charles Packer, Sarah Wooders, Kevin Lin, Vivian Fang, Shishir G. Patil, Ion Stoica, and Joseph E. Gonzalez. Memgpt: Towards llms as operating systems, 2024. URL https: //arxiv.org/abs/2310.08560

work page internal anchor Pith review Pith/arXiv arXiv 2024
[4]

A survey on large language model based autonomous agents.Frontiers of Computer Science, 18(6), March

Lei Wang, Chen Ma, Xueyang Feng, Zeyu Zhang, Hao Yang, Jingsen Zhang, Zhiyuan Chen, Jiakai Tang, Xu Chen, Yankai Lin, Wayne Xin Zhao, Zhewei Wei, and Jirong Wen. A survey on large language model based autonomous agents.Frontiers of Computer Science, 18(6), March

work page
[5]

A survey on large language model based autonomous agents,

ISSN 2095-2236. doi: 10.1007/s11704-024-40231-1. URL http://dx.doi.org/10. 1007/s11704-024-40231-1

work page doi:10.1007/s11704-024-40231-1 2095
[6]

ChatDev: Communicative Agents for Software Development

Chen Qian, Wei Liu, Hongzhang Liu, Nuo Chen, Yufan Dang, Jiahao Li, Cheng Yang, Weize Chen, Yusheng Su, Xin Cong, Juyuan Xu, Dahai Li, Zhiyuan Liu, and Maosong Sun. Chatdev: Communicative agents for software development.arXiv preprint arXiv:2307.07924, 2023. URL https://arxiv.org/abs/2307.07924

work page internal anchor Pith review arXiv 2023
[7]

OpenHands: An Open Platform for AI Software Developers as Generalist Agents

Xingyao Wang, Boxuan Li, Yufan Song, Frank F. Xu, Xiangru Tang, Mingchen Zhuge, Jiayi Pan, Yueqi Song, Bowen Li, Jaskirat Singh, Hoang H. Tran, Fuqiang Li, Ren Ma, Mingzhang Zheng, Bill Qian, Yanjun Shao, Niklas Muennighoff, Yizhe Zhang, Binyuan Hui, Junyang Lin, Robert Brennan, Hao Peng, Heng Ji, and Graham Neubig. Openhands: An open platform for ai soft...

work page internal anchor Pith review Pith/arXiv arXiv 2024
[8]

Towards an AI co-scientist

Juraj Gottweis, Wei-Hung Weng, Alexander Daryin, Tao Tu, Anil Palepu, Petar Sirkovic, Artiom Myaskovsky, Felix Weissenberger, Keran Rong, Ryutaro Tanno, Khaled Saab, Dan Popovici, Jacob Blum, Fan Zhang, Katherine Chou, Avinatan Hassidim, Burak Gokturk, Amin Vahdat, Pushmeet Kohli, Yossi Matias, Andrew Carroll, Kavita Kulkarni, Nenad Tomasev, Yuan Guan, Vi...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[9]

Bulaong, John E

Kyle Swanson, Wesley Wu, Nash L. Bulaong, John E. Pak, and James Zou. The virtual lab: Ai agents design new sars-cov-2 nanobodies with experimental validation.bioRxiv, 2024. doi: 10. 1101/2024.11.11.623004. URL https://www.biorxiv.org/content/early/2024/11/ 12/2024.11.11.623004

work page 2024
[10]

Generative Agents: Interactive Simulacra of Human Behavior

Joon Sung Park, Joseph C. O’Brien, Carrie J. Cai, Meredith Ringel Morris, Percy Liang, and Michael S. Bernstein. Generative agents: Interactive simulacra of human behavior, 2023. URL https://arxiv.org/abs/2304.03442

work page internal anchor Pith review Pith/arXiv arXiv 2023
[11]

Openmanus: An open- source framework for building general ai agents

Xinbin Liang, Jinyu Xiang, Zhaoyang Yu, Jiayi Zhang, and Sirui Hong. Openmanus: An open- source framework for building general ai agents. https://github.com/mannaandpoem/ OpenManus, 2025

work page 2025
[12]

Magentic-one: A generalist multi- agent system for solving complex tasks.arXiv preprint arXiv:2411.04468, 2024

Adam Fourney, Gagan Bansal, Hussein Mozannar, Cheng Tan, Eduardo Salinas, Friederike Niedtner, Grace Proebsting, Griffin Bassman, Jack Gerrits, Jacob Alber, et al. Magentic-one: A generalist multi-agent system for solving complex tasks.arXiv preprint arXiv:2411.04468, 2024

work page arXiv 2024
[13]

LLM-Based Multi-Agent Systems for Software Engineering: Literature Review, Vision and the Road Ahead, 2024, [arXiv:cs.SE/2404.04834]

Junda He, Christoph Treude, and David Lo. Llm-based multi-agent systems for software engineering: Vision and the road ahead, 2024. URL https://arxiv.org/abs/2404.04834

work page arXiv 2024
[14]

Roco: Dialectic multi-robot collaboration with large language models

Zhao Mandi, Shreeya Jain, and Shuran Song. Roco: Dialectic multi-robot collaboration with large language models, 2023. URLhttps://arxiv.org/abs/2307.04738

work page arXiv 2023
[15]

arXiv preprint arXiv:2307.02485

Hongxin Zhang, Weihua Du, Jiaming Shan, Qinhong Zhou, Yilun Du, Joshua B. Tenenbaum, Tianmin Shu, and Chuang Gan. Building cooperative embodied agents modularly with large language models, 2024. URLhttps://arxiv.org/abs/2307.02485. 11

work page arXiv 2024
[16]

Improving Factuality and Reasoning in Language Models through Multiagent Debate

Yilun Du, Shuang Li, Antonio Torralba, Joshua B. Tenenbaum, and Igor Mordatch. Improving factuality and reasoning in language models through multiagent debate, 2023. URL https: //arxiv.org/abs/2305.14325

work page internal anchor Pith review Pith/arXiv arXiv 2023
[17]

Generative agents: Interactive simulacra of human behavior

Joon Sung Park, Joseph O’Brien, Carrie Jun Cai, Meredith Ringel Morris, Percy Liang, and Michael S Bernstein. Generative agents: Interactive simulacra of human behavior. InProceed- ings of the 36th annual acm symposium on user interface software and technology, pages 1–22, 2023

work page 2023
[18]

Large Language Model based Multi-Agents: A Survey of Progress and Challenges

Taicheng Guo, Xiuying Chen, Yaqi Wang, Ruidi Chang, Shichao Pei, Nitesh V Chawla, Olaf Wiest, and Xiangliang Zhang. Large language model based multi-agents: A survey of progress and challenges.arXiv preprint arXiv:2402.01680, 2024

work page internal anchor Pith review arXiv 2024
[19]

Agentless: Demystifying LLM-based Software Engineering Agents

Chunqiu Steven Xia, Yinlin Deng, Soren Dunn, and Lingming Zhang. Agentless: Demystifying llm-based software engineering agents, 2024. URL https://arxiv.org/abs/2407.01489

work page internal anchor Pith review Pith/arXiv arXiv 2024
[20]

Kapoor and A

Sayash Kapoor, Benedikt Stroebl, Zachary S. Siegel, Nitya Nadgir, and Arvind Narayanan. Ai agents that matter, 2024. URLhttps://arxiv.org/abs/2407.01502

work page arXiv 2024
[21]

Glaser and Anselm L

Barney G. Glaser and Anselm L. Strauss.The Discovery of Grounded Theory: Strategies for Qualitative Research. Aldine Publishing Company, 1967

work page 1967
[22]

Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena

Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric P. Xing, Hao Zhang, Joseph E. Gonzalez, and Ion Stoica. Judging llm-as-a-judge with mt-bench and chatbot arena, 2023. URL https://arxiv.org/ abs/2306.05685

work page internal anchor Pith review Pith/arXiv arXiv 2023
[23]

Agent workflow memory,

Zora Zhiruo Wang, Jiayuan Mao, Daniel Fried, and Graham Neubig. Agent workflow memory,

work page
[24]

URLhttps://arxiv.org/abs/2409.07429

work page arXiv
[25]

DSPy: Compiling Declarative Language Model Calls into Self-Improving Pipelines

Omar Khattab, Arnav Singhvi, Paridhi Maheshwari, Zhiyuan Zhang, Keshav Santhanam, Sri Vardhamanan, Saiful Haq, Ashutosh Sharma, Thomas T. Joshi, Hanna Moazam, Heather Miller, Matei Zaharia, and Christopher Potts. Dspy: Compiling declarative language model calls into self-improving pipelines, 2023. URLhttps://arxiv.org/abs/2310.03714

work page internal anchor Pith review Pith/arXiv arXiv 2023
[26]

Stateflow: Enhancing llm task-solving through state-driven workflows, 2024

Yiran Wu, Tianwei Yue, Shaokun Zhang, Chi Wang, and Qingyun Wu. Stateflow: Enhancing llm task-solving through state-driven workflows, 2024. URLhttps://arxiv.org/abs/2403. 11322

work page 2024
[27]

CoRR abs/2402.03578 (2024)

Shanshan Han, Qifan Zhang, Yuhang Yao, Weizhao Jin, Zhaozhuo Xu, and Chaoyang He. Llm multi-agent systems: Challenges and open problems, 2024. URL https://arxiv.org/abs/ 2402.03578

work page arXiv 2024
[28]

Lewis Hammond, Alan Chan, Jesse Clifton, Jason Hoelscher-Obermaier, Akbir Khan, Euan McLean, Chandler Smith, Wolfram Barfuss, Jakob Foerster, Tom ´aˇs Gaven ˇciak, The Anh Han, Edward Hughes, V ojtˇech Kovaˇr´ık, Jan Kulveit, Joel Z. Leibo, Caspar Oesterheld, Chris- tian Schroeder de Witt, Nisarg Shah, Michael Wellman, Paolo Bova, Theodor Cimpeanu, Carson...

work page arXiv 2025
[29]

SWE-bench: Can language models resolve real-world github issues? InThe Twelfth International Conference on Learning Representations, 2024

Carlos E Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, and Karthik R Narasimhan. SWE-bench: Can language models resolve real-world github issues? InThe Twelfth International Conference on Learning Representations, 2024. URL https: //openreview.net/forum?id=VTF8yNQM66

work page 2024
[30]

A survey of useful llm evaluation.arXiv preprint arXiv:2406.00936, 2024

Ji-Lun Peng, Sijia Cheng, Egil Diau, Yung-Yu Shih, Po-Heng Chen, Yen-Ting Lin, and Yun- Nung Chen. A survey of useful llm evaluation.arXiv preprint arXiv:2406.00936, 2024. 12

work page arXiv 2024
[31]

Battleagentbench: A benchmark for evaluating cooperation and competition capabilities of language models in multi-agent systems, 2024

Wei Wang, Dan Zhang, Tao Feng, Boyan Wang, and Jie Tang. Battleagentbench: A benchmark for evaluating cooperation and competition capabilities of language models in multi-agent systems.arXiv preprint arXiv:2408.15971, 2024

work page arXiv 2024
[32]

Harnessing language for coordination: A framework and benchmark for llm-driven multi-agent control.arXiv preprint arXiv:2412.11761, 2024

Timoth´ee Anne, Noah Syrkis, Meriem Elhosni, Florian Turati, Franck Legendre, Alain Jaquier, and Sebastian Risi. Harnessing language for coordination: A framework and benchmark for llm-driven multi-agent control.arXiv preprint arXiv:2412.11761, 2024

work page arXiv 2024
[33]

Benchmarl: Benchmarking multi-agent reinforcement learning.Journal of Machine Learning Research, 25(217):1–10, 2024

Matteo Bettini, Amanda Prorok, and Vincent Moens. Benchmarl: Benchmarking multi-agent reinforcement learning.Journal of Machine Learning Research, 25(217):1–10, 2024

work page 2024
[34]

Teamcraft: A benchmark for multi-modal multi-agent systems in minecraft.arXiv preprint arXiv:2412.05255, 2024

Qian Long, Zhi Li, Ran Gong, Ying Nian Wu, Demetri Terzopoulos, and Xiaofeng Gao. Teamcraft: A benchmark for multi-modal multi-agent systems in minecraft.arXiv preprint arXiv:2412.05255, 2024

work page arXiv 2024
[35]

Lu, S., Wang, Y ., Sheng, L., He, L., Zheng, A., and Liang, J

Yang Liu, Yuanshun Yao, Jean-Francois Ton, Xiaoying Zhang, Ruocheng Guo Hao Cheng, Yegor Klochkov, Muhammad Faaiz Taufiq, and Hang Li. Trustworthy llms: A survey and guideline for evaluating large language models’ alignment.arXiv preprint arXiv:2308.05374, 2023

work page arXiv 2023
[36]

A survey on large language model (llm) security and privacy: The good, the bad, and the ugly.High-Confidence Computing, page 100211, 2024

Yifan Yao, Jinhao Duan, Kaidi Xu, Yuanfang Cai, Zhibo Sun, and Yue Zhang. A survey on large language model (llm) security and privacy: The good, the bad, and the ugly.High-Confidence Computing, page 100211, 2024

work page 2024
[37]

URL https://www.anthropic.com/research/ building-effective-agents

Anthropic, Dec 2024. URL https://www.anthropic.com/research/ building-effective-agents

work page 2024
[38]

Specifications: The miss- ing link to making the development of LLM systems an engineering discipline

Ion Stoica, Matei Zaharia, Joseph Gonzalez, Ken Goldberg, Hao Zhang, Anastasios Angelopou- los, Shishir G Patil, Lingjiao Chen, Wei-Lin Chiang, and Jared Q Davis. Specifications: The missing link to making the development of llm systems an engineering discipline.arXiv preprint arXiv:2412.05299, 2024

work page arXiv 2024
[39]

Gagan Bansal, Jennifer Wortman Vaughan, Saleema Amershi, Eric Horvitz, Adam Fourney, Hussein Mozannar, Victor Dibia, and Daniel S. Weld. Challenges in human-agent communication. Technical Report MSR-TR-2024-53, Microsoft, De- cember 2024. URL https://www.microsoft.com/en-us/research/publication/ human-agent-interaction-challenges/

work page 2024
[40]

NAACL-LONG.102

Ge Bai, Jie Liu, Xingyuan Bu, Yancheng He, Jiaheng Liu, Zhanhui Zhou, Zhuoran Lin, Wenbo Su, Tiezheng Ge, Bo Zheng, and Wanli Ouyang. Mt-bench-101: A fine-grained benchmark for evaluating large language models in multi-turn dialogues. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), page 74...

work page doi:10.18653/v1/2024 2024
[41]

An empirical study of code generation errors made by large language models

Song Da, Zijie Zhou, Zhijie Wang, Yuheng Huang, Shengmai Chen, Bonan Kou, Lei Ma, and Tianyi Zhang. An empirical study of code generation errors made by large language models. In In 7th Annual Symposium on Machine Programming, 2023

work page 2023
[42]

Negar Arabzadeh, Siqing Huo, Nikhil Mehta, Qinqyun Wu, Chi Wang, Ahmed Awadallah, Charles L. A. Clarke, and Julia Kiseleva. Assessing and verifying task utility in llm-powered applications, 2024. URLhttps://arxiv.org/abs/2405.02178

work page arXiv 2024
[43]

Interactive debugging and steering of multi-agent ai systems

Will Epperson, Gagan Bansal, Victor Dibia, Adam Fourney, Jack Gerrits, Erkang (Eric) Zhu, and Saleema Amershi. Interactive debugging and steering of multi-agent ai systems. InCHI 2025, April 2025. URLhttps://arxiv.org/abs/2503.02068

work page arXiv 2025
[44]

Zhang, M

Shaokun Zhang, Ming Yin, Jieyu Zhang, Jiale Liu, Zhiguang Han, Jingyang Zhang, Beibin Li, Chi Wang, Huazheng Wang, Yiran Chen, and Qingyun Wu. Which agent causes task failures and when? on automated failure attribution of llm multi-agent systems, 2025. URL https://arxiv.org/abs/2505.00212. 13

work page arXiv 2025
[45]

Theoretical sampling and category development in grounded theory.Qualitative health research, 17(8): 1137–1148, 2007

Claire B Draucker, Donna S Martsolf, Ratchneewan Ross, and Thomas B Rusk. Theoretical sampling and category development in grounded theory.Qualitative health research, 17(8): 1137–1148, 2007

work page 2007
[46]

Open coding.University of Calgary, 23(2009):2009, 2009

Shahedul Huq Khandkar. Open coding.University of Calgary, 23(2009):2009, 2009

work page 2009
[47]

Manus.https://manus.im/, 2025

Manus AI. Manus.https://manus.im/, 2025

work page 2025
[48]

Model context protocol: Introduction

Anthropic. Model context protocol: Introduction. https://modelcontextprotocol.io/ introduction, dec 2024

work page 2024
[49]

A2a: A new era of agent interoperability, April 2025

Rao Surapaneni, Miku Jha, Michael Vakoc, and Todd Segal. A2a: A new era of agent interoperability, April 2025. URL https://developers.googleblog.com/en/ a2a-a-new-era-of-agent-interoperability/. Google Developers Blog

work page 2025
[50]

Llm-coordination: Evaluating and analyzing multi-agent coordination abilities in large language models, 2025

Saaket Agashe, Yue Fan, Anthony Reyna, and Xin Eric Wang. Llm-coordination: Evaluating and analyzing multi-agent coordination abilities in large language models, 2025. URL https: //arxiv.org/abs/2310.03903

work page arXiv 2025
[51]

Princeton University Press, Princeton, NJ, 1984

Charles Perrow.Normal Accidents: Living with High-Risk Technologies. Princeton University Press, Princeton, NJ, 1984. ISBN 978-0691004129

work page 1984
[52]

Karlene H. Roberts. New challenges in organizational research: High reliability organizations. Organization & Environment, 3(2):111–125, 1989. doi: 10.1177/108602668900300202

work page doi:10.1177/108602668900300202 1989
[53]

Reliable organizations: Present research and future directions.Journal of contingencies and crisis management., 4(2), 1996

Gene I Rochlin. Reliable organizations: Present research and future directions.Journal of contingencies and crisis management., 4(2), 1996. ISSN 0966-0879

work page 1996
[54]

MetaGPT: Meta Programming for A Multi-Agent Collaborative Framework

Sirui Hong, Xiawu Zheng, Jonathan Chen, Yuheng Cheng, Jinlin Wang, Ceyao Zhang, Zili Wang, Steven Ka Shing Yau, Zijuan Lin, Liyang Zhou, et al. Metagpt: Meta programming for multi-agent collaborative framework.arXiv preprint arXiv:2308.00352, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[55]

HyperAgent: Generalist software engineering agents to solve coding tasks at scale.arXiv:2409.16299, 2024

Huy Nhat Phan, Tien N Nguyen, Phong X Nguyen, and Nghi DQ Bui. Hyperagent: Generalist software engineering agents to solve coding tasks at scale.arXiv preprint arXiv:2409.16299, 2024

work page arXiv 2024
[56]

AppWorld: A controllable world of apps and people for benchmarking interactive coding agents.arXiv preprint arXiv:2407.18901,

Harsh Trivedi, Tushar Khot, Mareike Hartmann, Ruskin Manku, Vinty Dong, Edward Li, Shashank Gupta, Ashish Sabharwal, and Niranjan Balasubramanian. Appworld: A control- lable world of apps and people for benchmarking interactive coding agents.arXiv preprint arXiv:2407.18901, 2024

work page arXiv 2024
[57]

Autogen: Enabling next-gen llm applications via multi-agent conversations

Qingyun Wu, Gagan Bansal, Jieyu Zhang, Yiran Wu, Beibin Li, Erkang Zhu, Li Jiang, Xiaoyun Zhang, Shaokun Zhang, Jiale Liu, et al. Autogen: Enabling next-gen llm applications via multi-agent conversations. InFirst Conference on Language Modeling, 2024

work page 2024
[58]

Chatdev: Communicative agents for software development

Chen Qian, Wei Liu, Hongzhang Liu, Nuo Chen, Yufan Dang, Jiahao Li, Cheng Yang, Weize Chen, Yusheng Su, Xin Cong, et al. Chatdev: Communicative agents for software development. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 15174–15186, 2024

work page 2024
[59]

AutoGen: Enabling Next-Gen LLM Applications via Multi-Agent Conversation

Qingyun Wu, Gagan Bansal, Jieyu Zhang, Yiran Wu, Shaokun Zhang, Erkang Zhu, Beibin Li, Li Jiang, Xiaoyun Zhang, and Chi Wang. Autogen: Enabling next-gen llm applications via multi-agent conversation framework.arXiv preprint arXiv:2308.08155, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[60]

Does prompt formatting have any impact on llm performance?

Jia He, Mukund Rungta, David Koleczek, Arshdeep Sekhon, Franklin X Wang, and Sadid Hasan. Does prompt formatting have any impact on llm performance?arXiv preprint arXiv:2411.10541, 2024

work page arXiv 2024
[61]

Multi-agent collaboration: Harnessing the power of intelligent llm agents.arXiv preprint arXiv:2306.03314, 2023

Yashar Talebirad and Amirhossein Nadiri. Multi-agent collaboration: Harnessing the power of intelligent llm agents.arXiv preprint arXiv:2306.03314, 2023

work page arXiv 2023
[62]

ChatEval: Towards Better LLM-based Evaluators through Multi-Agent Debate

Chi-Min Chan, Weize Chen, Yusheng Su, Jianxuan Yu, Wei Xue, Shanghang Zhang, Jie Fu, and Zhiyuan Liu. Chateval: Towards better llm-based evaluators through multi-agent debate. arXiv preprint arXiv:2308.07201, 2023. 14

work page internal anchor Pith review arXiv 2023
[63]

Large language models are better reasoners with self-verification

Yixuan Weng, Minjun Zhu, Fei Xia, Bin Li, Shizhu He, Shengping Liu, Bin Sun, Kang Liu, and Jun Zhao. Large language models are better reasoners with self-verification. InThe 2023 Conference on Empirical Methods in Natural Language Processing, 2023

work page 2023
[64]

Langgraph, 2024

LangChain. Langgraph, 2024. URLhttps://www.langchain.com/langgraph

work page 2024
[65]

Building effective agents, 2024

Anthropic. Building effective agents, 2024. URLhttps://www.anthropic.com/research/ building-effective-agents

work page 2024
[66]

Tree of thoughts: Deliberate problem solving with large language models.Ad- vances in Neural Information Processing Systems, 36, 2024

Shunyu Yao, Dian Yu, Jeffrey Zhao, Izhak Shafran, Tom Griffiths, Yuan Cao, and Karthik Narasimhan. Tree of thoughts: Deliberate problem solving with large language models.Ad- vances in Neural Information Processing Systems, 36, 2024

work page 2024
[67]

Improving llm reasoning with multi-agent tree-of-thought validator agent.arXiv preprint arXiv:2409.11527, 2024

Fatemeh Haji, Mazal Bethany, Maryam Tabar, Jason Chiang, Anthony Rios, and Peyman Najafirad. Improving llm reasoning with multi-agent tree-of-thought validator agent.arXiv preprint arXiv:2409.11527, 2024

work page arXiv 2024
[68]

Towards reasoning in large language models via multi-agent peer review collaboration.arXiv preprint arXiv:2311.08152, 2023

Zhenran Xu, Senbao Shi, Baotian Hu, Jindi Yu, Dongfang Li, Min Zhang, and Yuxiang Wu. Towards reasoning in large language models via multi-agent peer review collaboration.arXiv preprint arXiv:2311.08152, 2023

work page arXiv 2023
[69]

Inference scaling f laws: The limits of llm resampling with imperfect verifiers.arXiv preprint arXiv:2411.17501, 2024

Benedikt Stroebl, Sayash Kapoor, and Arvind Narayanan. Inference scaling f laws: The limits of llm resampling with imperfect verifiers.arXiv preprint arXiv:2411.17501, 2024

work page arXiv 2024
[70]

Are more llm calls all you need? towards scaling laws of compound inference systems.arXiv preprint arXiv:2403.02419, 2024

Lingjiao Chen, Jared Quincy Davis, Boris Hanin, Peter Bailis, Ion Stoica, Matei Zaharia, and James Zou. Are more llm calls all you need? towards scaling laws of compound inference systems.arXiv preprint arXiv:2403.02419, 2024

work page arXiv 2024
[71]

Testgeneval: A real world unit test generation and test completion benchmark.arXiv preprint arXiv:2410.00752, 2024

Kush Jain, Gabriel Synnaeve, and Baptiste Rozi `ere. Testgeneval: A real world unit test generation and test completion benchmark.arXiv preprint arXiv:2410.00752, 2024

work page arXiv 2024
[72]

Check your facts and try again: Improving large language models with external knowledge and automated feedback

Baolin Peng, Michel Galley, Pengcheng He, Hao Cheng, Yujia Xie, Yu Hu, Qiuyuan Huang, Lars Liden, Zhou Yu, Weizhu Chen, et al. Check your facts and try again: Improving large language models with external knowledge and automated feedback.arXiv preprint arXiv:2302.12813, 2023

work page arXiv 2023
[73]

Ques- tion answering over knowledge bases by leveraging semantic parsing and neuro-symbolic reasoning.arXiv preprint arXiv:2012.01707, 2020

Pavan Kapanipathi, Ibrahim Abdelaziz, Srinivas Ravishankar, Salim Roukos, Alexander Gray, Ramon Astudillo, Maria Chang, Cristina Cornelio, Saswati Dana, Achille Fokoue, et al. Ques- tion answering over knowledge bases by leveraging semantic parsing and neuro-symbolic reasoning.arXiv preprint arXiv:2012.01707, 2020

work page arXiv 2012
[74]

A survey on llm-based multi-agent systems: workflow, infrastructure, and challenges.Vicinagearth, 1(1):9, 2024

Xinyi Li, Sai Wang, Siqi Zeng, Yu Wu, and Yi Yang. A survey on llm-based multi-agent systems: workflow, infrastructure, and challenges.Vicinagearth, 1(1):9, 2024

work page 2024
[75]

Multi-agent graph-attention communica- tion and teaming

Yaru Niu, Rohan R Paleja, and Matthew C Gombolay. Multi-agent graph-attention communica- tion and teaming. InAAMAS, volume 21, page 20th, 2021

work page 2021
[76]

Learning attentional communication for multi-agent coopera- tion.Advances in neural information processing systems, 31, 2018

Jiechuan Jiang and Zongqing Lu. Learning attentional communication for multi-agent coopera- tion.Advances in neural information processing systems, 31, 2018

work page 2018
[77]

Learning when to communicate at scale in multiagent cooperative and competitive tasks.arXiv preprint arXiv:1812.09755, 2018

Amanpreet Singh, Tushar Jain, and Sainbayar Sukhbaatar. Learning when to communicate at scale in multiagent cooperative and competitive tasks.arXiv preprint arXiv:1812.09755, 2018

work page arXiv 2018
[78]

The surprising effectiveness of ppo in cooperative multi-agent games.Advances in Neural Information Processing Systems, 35:24611–24624, 2022

Chao Yu, Akash Velu, Eugene Vinitsky, Jiaxuan Gao, Yu Wang, Alexandre Bayen, and Yi Wu. The surprising effectiveness of ppo in cooperative multi-agent games.Advances in Neural Information Processing Systems, 35:24611–24624, 2022

work page 2022
[79]

Heterogeneous multi-agent reinforce- ment learning for zero-shot scalable collaboration.arXiv preprint arXiv:2404.03869, 2024

Xudong Guo, Daming Shi, Junjie Yu, and Wenhui Fan. Heterogeneous multi-agent reinforce- ment learning for zero-shot scalable collaboration.arXiv preprint arXiv:2404.03869, 2024

work page arXiv 2024
[80]

Optima: Optimizing effectiveness and efficiency for llm-based multi-agent system.arXiv preprint arXiv:2410.08115, 2024

Weize Chen, Jiarui Yuan, Chen Qian, Cheng Yang, Zhiyuan Liu, and Maosong Sun. Optima: Optimizing effectiveness and efficiency for llm-based multi-agent system.arXiv preprint arXiv:2410.08115, 2024. 15

work page arXiv 2024

Showing first 80 references.