The authors create the first large-scale dataset and taxonomy of failure modes in multi-agent LLM systems to explain their limited performance gains.
hub Canonical reference
ChatDev: Communicative Agents for Software Development
Canonical reference. 100% of citing Pith papers cite this work as background.
abstract
Software development is a complex task that necessitates cooperation among multiple members with diverse skills. Numerous studies used deep learning to improve specific phases in a waterfall model, such as design, coding, and testing. However, the deep learning model in each phase requires unique designs, leading to technical inconsistencies across various phases, which results in a fragmented and ineffective development process. In this paper, we introduce ChatDev, a chat-powered software development framework in which specialized agents driven by large language models (LLMs) are guided in what to communicate (via chat chain) and how to communicate (via communicative dehallucination). These agents actively contribute to the design, coding, and testing phases through unified language-based communication, with solutions derived from their multi-turn dialogues. We found their utilization of natural language is advantageous for system design, and communicating in programming language proves helpful in debugging. This paradigm demonstrates how linguistic communication facilitates multi-agent collaboration, establishing language as a unifying bridge for autonomous task-solving among LLM agents. The code and data are available at https://github.com/OpenBMB/ChatDev.
hub tools
citation-role summary
citation-polarity summary
polarities
background 17representative citing papers
Prompt injection attacks can self-replicate across LLM agents in multi-agent systems, enabling data theft, misinformation, and system disruption while propagating silently.
A new six-dimension process taxonomy for AI software development frameworks shows convergence on artifact persistence and human oversight but reveals that no framework covers all dimensions strongly, indicating a depth-portability trade-off.
IDS is an agentic LLM system that incrementally synthesizes both implementation and proof for distributed key-value stores, succeeding on all 7 specs where prior agents succeeded on only 2.
An AI-agent social platform generated mostly neutral content whose use in fine-tuning reduced model truthfulness comparably to human Reddit data, suggesting limited unique harm but flagging tail risks like secret leaks.
LLMs show up to 60.58% social bias in generated code; a new Fairness Monitor Agent cuts bias by 65.1% and raises functional correctness from 75.80% to 83.97%.
FineState-Bench and FineState-Metrics show LVLMs achieve only 22.8% average exact-state success in GUI interactions, with visual diagnostic hints improving results by up to 14.9 points.
Comet-H orchestrates LLMs via deficit-scoring prompt selection and half-life task tracking to co-evolve research software components, demonstrated by a static analysis tool reaching F1=0.768 versus a 0.364 baseline.
PhysCodeBench benchmark and SMRF multi-agent framework enable better AI generation of physically accurate 3D simulation code, boosting performance by 31 points over baselines.
A new 7x4 taxonomy organizes agentic AI security threats by architectural layer and persistence timescale, revealing under-explored upper layers and missing defenses after surveying 116 papers.
ClawNet digitizes human collaborative relationships into a network of identity-governed AI agents that collaborate on behalf of their owners through a central orchestrator enforcing binding and verification.
NARCBench and five activation-probing methods detect multi-agent collusion with 0.73-1.00 AUROC across distribution shifts and steganographic tasks by aggregating per-agent signals.
An agent factory combining sub-kernel ILP assembly with multi-agent cross-optimization lets general coding agents deliver mean 8.27x speedups in HLS designs on standard benchmarks.
KVCodec uses GPU-native video codecs and pipelined fetching to compress and transmit KV caches, delivering up to 3.51x faster TTFT than prior methods while preserving accuracy.
Multi-agent LLM systems can be steered via prompt design from mere aggregates to higher-order collectives with identity-linked differentiation and goal-directed complementarity, as measured by partial information decomposition of time-delayed mutual information.
Meta Agent Search uses a meta-agent to iteratively program novel agentic systems in code, producing agents that outperform state-of-the-art hand-designed ones across coding, science, and math while transferring across domains and models.
A three-agent loop of code generation, test creation, and execution feedback lifts pass@1 to 96.3% on HumanEval and 91.8% on MBPP for GPT-4 while using roughly half the tokens of prior state-of-the-art.
BRANE maps queries to optimal retrieval pipeline configurations using LLM-derived features and per-configuration correctness predictors, improving the cost-quality Pareto frontier on three benchmarks.
Swarm Skills is a portable multi-agent coordination specification with roles, workflows, bounds, and a self-evolution algorithm that distills trajectories using Effectiveness, Utilization, and Freshness scores for zero-adapter portability.
CTM-AI combines a formal consciousness model with foundation models to report state-of-the-art results on sarcasm detection, humor, and agentic tool-use benchmarks.
CreativeGame enables iterative HTML5 game generation via mechanic-guided planning, lineage memory, runtime validation, and programmatic rewards to produce inspectable version-to-version mechanic evolution.
CoopGuard deploys cooperative agents to track conversation history and counter evolving multi-round attacks on LLMs, achieving a 78.9% reduction in attack success rate on a new 5,200-sample benchmark.
PersonaTester uses LLMs guided by three-dimensional personas to replicate crowdworker testing patterns, yielding higher behavioral consistency, variability, and more bug detections than baseline LLM agents.
NOMAD decomposes UML class diagram creation into a multi-agent LLM workflow that outperforms baselines on a Northwind case study and human exercises while introducing a taxonomy of structural, relationship, and semantic errors.
citing papers explorer
-
Towards Large Reasoning Models: A Survey of Reinforced Reasoning with Large Language Models
The paper surveys reinforced reasoning techniques for LLMs, covering automated data construction, learning-to-reason methods, and test-time scaling as steps toward Large Reasoning Models.