ChatDev: Communicative Agents for Software Development

Cheng Yang; Chen Qian; Dahai Li; Hongzhang Liu; Jiahao Li; Juyuan Xu; Maosong Sun; Nuo Chen; Wei Liu; Weize Chen

arxiv: 2307.07924 · v5 · submitted 2023-07-16 · 💻 cs.SE · cs.CL· cs.MA

ChatDev: Communicative Agents for Software Development

Chen Qian , Wei Liu , Hongzhang Liu , Nuo Chen , Yufan Dang , Jiahao Li , Cheng Yang , Weize Chen

show 6 more authors

Yusheng Su Xin Cong Juyuan Xu Dahai Li Zhiyuan Liu Maosong Sun

This is my paper

Pith reviewed 2026-05-12 20:28 UTC · model grok-4.3

classification 💻 cs.SE cs.CLcs.MA

keywords multi-agent systemslarge language modelssoftware developmentcommunicative agentschat chaindehallucinationcollaborative AI

0 comments

The pith

Specialized LLM agents can develop software by communicating through a structured chat process guided to avoid errors.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces ChatDev, a framework that organizes large language model agents into roles for software development. These agents follow a chat chain to structure their discussions and use communicative dehallucination to maintain accuracy in exchanges. Through repeated conversations, they cover the design, coding, and testing stages, using natural language for planning and code for fixes. This shows that language serves as a common medium for AI agents to solve complex tasks together without separate tools for each step.

Core claim

ChatDev demonstrates that communicative agents driven by large language models can actively participate in design, coding, and testing by deriving solutions from multi-turn dialogues. The use of natural language supports system design while programming language communication aids debugging, unifying the process under language-based collaboration.

What carries the argument

The chat chain guides the content of agent communications while communicative dehallucination directs the manner of communication to prevent hallucinations, enabling coherent multi-agent collaboration on software tasks.

If this is right

Development phases become integrated through ongoing agent dialogues rather than sequential handoffs.
Natural language exchanges handle creative aspects like design, while code-based talks resolve technical issues.
The framework allows solutions to emerge directly from agent interactions without external models for each phase.
Language acts as the unifying mechanism for autonomous problem-solving by multiple LLM agents.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Similar chat-based coordination could apply to other collaborative tasks such as scientific discovery or project management.
Testing this on larger projects might reveal limits in handling very complex software systems.
Future systems could incorporate more agent roles to cover additional development activities like deployment.

Load-bearing premise

That the guided multi-turn dialogues between these LLM agents will produce functional and correct software without needing outside verification or corrections.

What would settle it

Generating applications with ChatDev for common programming problems and then running them through unit tests and manual inspection to see if they function as intended or contain errors.

read the original abstract

Software development is a complex task that necessitates cooperation among multiple members with diverse skills. Numerous studies used deep learning to improve specific phases in a waterfall model, such as design, coding, and testing. However, the deep learning model in each phase requires unique designs, leading to technical inconsistencies across various phases, which results in a fragmented and ineffective development process. In this paper, we introduce ChatDev, a chat-powered software development framework in which specialized agents driven by large language models (LLMs) are guided in what to communicate (via chat chain) and how to communicate (via communicative dehallucination). These agents actively contribute to the design, coding, and testing phases through unified language-based communication, with solutions derived from their multi-turn dialogues. We found their utilization of natural language is advantageous for system design, and communicating in programming language proves helpful in debugging. This paradigm demonstrates how linguistic communication facilitates multi-agent collaboration, establishing language as a unifying bridge for autonomous task-solving among LLM agents. The code and data are available at https://github.com/OpenBMB/ChatDev.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

ChatDev sets up LLM agents with a chat chain and dehallucination for the full dev cycle but supplies no numbers showing the setup adds value over simpler prompting.

read the letter

ChatDev sets up LLM agents with a chat chain and dehallucination for the full dev cycle but supplies no numbers showing the setup adds value over simpler prompting. The paper actually builds and releases a working system that coordinates design, coding, and testing through multi-turn language exchanges, with the chat chain deciding what gets discussed at each stage and dehallucination shaping how agents respond to cut down on obvious errors. Using natural language for design and code for debugging is a reasonable observation, and making the code public lets others reproduce the pattern without starting from scratch. That part is concrete and new relative to earlier multi-agent LLM papers that stayed at the level of general collaboration ideas. The evaluation is the clear gap. The abstract claims advantages from the communication mechanisms but reports no success rates, bug counts, completion times, or direct comparisons to a single LLM or non-communicative baseline. Without those controls it is impossible to tell whether the dialogues are doing real work or simply exposing what the base model already knows. The circularity burden is low since the claims rest on observed runs rather than fitted parameters, yet that does not substitute for controlled evidence. This paper is aimed at researchers exploring agent architectures for software engineering. A reader who wants to experiment with structured multi-agent prompting can pull useful patterns from the chat chain description and the released code. Someone looking for a method that has been shown to outperform existing approaches will come away empty. It deserves peer review because the framework is original, the code is available, and the core idea of language as the coordination layer is worth testing properly. I would send it out with a request for quantitative ablations and baseline comparisons.

Referee Report

2 major / 1 minor

Summary. The paper introduces ChatDev, a multi-agent framework for software development in which specialized LLM agents collaborate across design, coding, and testing phases. Agents are guided by a chat chain (specifying what to communicate) and communicative dehallucination (specifying how to communicate), deriving solutions from multi-turn natural-language dialogues. The authors claim that natural language is advantageous for system design while programming-language communication aids debugging, positioning language as a unifying bridge for autonomous task-solving among LLM agents. Code and data are released at https://github.com/OpenBMB/ChatDev.

Significance. If the central effectiveness claims were supported by controlled experiments, the work would be significant for demonstrating how communicative multi-agent LLM systems can unify fragmented software-engineering phases without phase-specific model designs. The public release of code is a clear strength that supports reproducibility and follow-on research.

major comments (2)

[Abstract and evaluation sections] The manuscript asserts qualitative advantages of the chat-chain and communicative-dehallucination mechanisms (abstract and §4) yet reports no quantitative success rates, bug counts, completion times, or error metrics, nor any ablation or baseline comparisons against single-LLM prompting or non-communicative multi-agent setups. This absence directly undermines the central claim that the proposed communication protocols causally improve outcomes.
[Method / Communicative Dehallucination] The description of how 'communicative dehallucination' is implemented and how it differs from standard prompting or self-consistency techniques is insufficient to allow replication or to isolate its contribution (the method section provides only high-level prose).

minor comments (1)

[Introduction / Related Work] The paper introduces two new terms ('chat chain' and 'communicative dehallucination') without a dedicated related-work subsection contrasting them to prior multi-agent LLM frameworks (e.g., AutoGen, MetaGPT).

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback, which helps us strengthen the presentation of ChatDev. We address the major comments point by point below and commit to revisions that improve clarity and evidence.

read point-by-point responses

Referee: [Abstract and evaluation sections] The manuscript asserts qualitative advantages of the chat-chain and communicative-dehallucination mechanisms (abstract and §4) yet reports no quantitative success rates, bug counts, completion times, or error metrics, nor any ablation or baseline comparisons against single-LLM prompting or non-communicative multi-agent setups. This absence directly undermines the central claim that the proposed communication protocols causally improve outcomes.

Authors: We agree that the current manuscript relies primarily on qualitative demonstrations through case studies and illustrative examples rather than controlled quantitative experiments. While the abstract and evaluation sections highlight observed advantages of natural language for design and programming language for debugging, we acknowledge the absence of success rates, bug counts, completion times, ablations, or baseline comparisons. This limits the strength of causal claims. In the revised manuscript, we will add a dedicated quantitative evaluation section reporting task completion rates on a set of software development tasks, comparisons to single-LLM prompting and non-communicative multi-agent baselines, and ablation studies that isolate the chat chain and communicative dehallucination components. revision: yes
Referee: [Method / Communicative Dehallucination] The description of how 'communicative dehallucination' is implemented and how it differs from standard prompting or self-consistency techniques is insufficient to allow replication or to isolate its contribution (the method section provides only high-level prose).

Authors: We thank the referee for identifying this gap in methodological detail. Communicative dehallucination guides agents to perform verification during multi-turn dialogues by cross-referencing prior messages and validating outputs through code execution in later phases. It differs from standard prompting by enforcing an explicit inter-agent verification protocol and from self-consistency by relying on communicative correction rather than independent sampling. In the revised manuscript, we will expand the method section with pseudocode, a step-by-step algorithmic description, and explicit differentiation from related techniques to enable replication and better isolation of its contribution. revision: yes

Circularity Check

0 steps flagged

No circularity: system description with empirical observations, no derivations or fitted predictions.

full rationale

The paper describes an implemented multi-agent framework (ChatDev) for software development using LLMs, with claims resting on observed behavior from running the system on example tasks. No mathematical derivations, equations, uniqueness theorems, or parameter-fitting steps are present. The central claims concern advantages of natural language communication and programming-language debugging, supported by qualitative examples and implementation details rather than any chain that reduces to self-referential inputs or self-citations. The reader's assessment of score 1.0 aligns with this; the work is self-contained as an engineering demonstration without load-bearing circular elements.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 2 invented entities

The central claim depends on two newly introduced mechanisms whose effectiveness is asserted without independent prior evidence, plus the domain assumption that current LLMs can sustain coherent multi-turn role-based collaboration.

axioms (1)

domain assumption Current LLMs can be prompted into stable specialized roles and can follow structured communication protocols across design, coding, and testing phases without additional fine-tuning.
Invoked when the paper states that agents actively contribute through unified language-based communication.

invented entities (2)

chat chain no independent evidence
purpose: to specify what each agent should communicate at each development stage
New structuring device introduced to organize dialogues; no independent evidence supplied.
communicative dehallucination no independent evidence
purpose: to steer agents toward reliable communication and reduce fabricated content
New technique introduced to improve multi-agent reliability; no independent evidence supplied.

pith-pipeline@v0.9.0 · 5531 in / 1402 out tokens · 43976 ms · 2026-05-12T20:28:32.041494+00:00 · methodology

discussion (0)

Forward citations

Cited by 60 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Why Do Multi-Agent LLM Systems Fail?
cs.AI 2025-03 unverdicted novelty 8.0

The authors create the first large-scale dataset and taxonomy of failure modes in multi-agent LLM systems to explain their limited performance gains.
Prompt Infection: LLM-to-LLM Prompt Injection within Multi-Agent Systems
cs.MA 2024-10 unverdicted novelty 8.0

Prompt injection attacks can self-replicate across LLM agents in multi-agent systems, enabling data theft, misinformation, and system disruption while propagating silently.
RTSGameBench: An RTS Benchmark for Strategic Reasoning by Vision-Language Models
cs.AI 2026-06 unverdicted novelty 7.0

RTSGameBench is a new extensible benchmark for VLMs using diverse RTS matchups, diagnostic mini-games targeting individual competencies, and a self-evolving query-to-game generator, with results showing poor VLM perfo...
ADK Arena: Evaluating Agent Development Kits via LLM-as-a-Developer
cs.SE 2026-06 unverdicted novelty 7.0

ADK Arena evaluates 51 Python ADKs by having an LLM learn each framework's API, write and repair agent code, and run on benchmarks, finding 57% success rate, 5.6x cost variation, no dominant framework, and substitutab...
From Prompt to Process: a Process Taxonomy and Comparative Assessment of Frameworks Supporting AI Software Development Agents
cs.SE 2026-06 conditional novelty 7.0

A new six-dimension process taxonomy for AI software development frameworks shows convergence on artifact persistence and human oversight but reveals that no framework covers all dimensions strongly, indicating a dept...
Inductive Deductive Synthesis: Enabling AI to Generate Formally Verified Systems
cs.AI 2026-05 unverdicted novelty 7.0

IDS is an agentic LLM system that incrementally synthesizes both implementation and proof for distributed key-value stores, succeeding on all 7 specs where prior agents succeeded on only 2.
The Moltbook Files: A Harmless Slopocalypse or Humanity's Last Experiment
cs.CL 2026-05 unverdicted novelty 7.0

An AI-agent social platform generated mostly neutral content whose use in fine-tuning reduced model truthfulness comparably to human Reddit data, suggesting limited unique harm but flagging tail risks like secret leaks.
Social Bias in LLM-Generated Code: Benchmark and Mitigation
cs.SE 2026-05 unverdicted novelty 7.0

LLMs show up to 60.58% social bias in generated code; a new Fairness Monitor Agent cuts bias by 65.1% and raises functional correctness from 75.80% to 83.97%.
FineState-Bench: Benchmarking State-Conditioned Grounding for Fine-grained GUI State Setting
cs.CV 2026-04 unverdicted novelty 7.0

FineState-Bench and FineState-Metrics show LVLMs achieve only 22.8% average exact-state success in GUI interactions, with visual diagnostic hints improving results by up to 14.9 points.
Theory Under Construction: Orchestrating Language Models for Research Software Where the Specification Evolves
cs.SE 2026-04 unverdicted novelty 7.0

Comet-H orchestrates LLMs via deficit-scoring prompt selection and half-life task tracking to co-evolve research software components, demonstrated by a static analysis tool reaching F1=0.768 versus a 0.364 baseline.
PhysCodeBench: Benchmarking Physics-Aware Symbolic Simulation of 3D Scenes via Self-Corrective Multi-Agent Refinement
cs.RO 2026-04 unverdicted novelty 7.0

PhysCodeBench benchmark and SMRF multi-agent framework enable better AI generation of physically accurate 3D simulation code, boosting performance by 31 points over baselines.
A Systematic Survey of Security Threats and Defenses in LLM-Based AI Agents: A Layered Attack Surface Framework
cs.CR 2026-04 unverdicted novelty 7.0

A new 7x4 taxonomy organizes agentic AI security threats by architectural layer and persistence timescale, revealing under-explored upper layers and missing defenses after surveying 116 papers.
ClawNet: Human-Symbiotic Agent Network for Cross-User Autonomous Cooperation
cs.AI 2026-04 unverdicted novelty 7.0

ClawNet digitizes human collaborative relationships into a network of identity-governed AI agents that collaborate on behalf of their owners through a central orchestrator enforcing binding and verification.
Detecting Multi-Agent Collusion Through Multi-Agent Interpretability
cs.AI 2026-04 conditional novelty 7.0

NARCBench and five activation-probing methods detect multi-agent collusion with 0.73-1.00 AUROC across distribution shifts and steganographic tasks by aggregating per-agent signals.
Agent Factories for High Level Synthesis: How Far Can General-Purpose Coding Agents Go in Hardware Optimization?
cs.AI 2026-03 conditional novelty 7.0

An agent factory combining sub-kernel ILP assembly with multi-agent cross-optimization lets general coding agents deliver mean 8.27x speedups in HLS designs on standard benchmarks.
Efficient Remote KV Cache Reuse with GPU-native Video Codec
cs.DC 2026-02 conditional novelty 7.0

KVCodec uses GPU-native video codecs and pipelined fetching to compress and transmit KV caches, delivering up to 3.51x faster TTFT than prior methods while preserving accuracy.
Emergent Coordination in Multi-Agent Language Models
cs.MA 2025-10 unverdicted novelty 7.0

Multi-agent LLM systems can be steered via prompt design from mere aggregates to higher-order collectives with identity-linked differentiation and goal-directed complementarity, as measured by partial information deco...
Automated Design of Agentic Systems
cs.AI 2024-08 conditional novelty 7.0

Meta Agent Search uses a meta-agent to iteratively program novel agentic systems in code, producing agents that outperform state-of-the-art hand-designed ones across coding, science, and math while transferring across...
AgentCoder: Multi-Agent-based Code Generation with Iterative Testing and Optimisation
cs.CL 2023-12 accept novelty 7.0

A three-agent loop of code generation, test creation, and execution feedback lifts pass@1 to 96.3% on HumanEval and 91.8% on MBPP for GPT-4 while using roughly half the tokens of prior state-of-the-art.
MESA: Prioritizing Vulnerable Communication Channels for Securing Multi-Agent Systems
cs.CR 2026-06 unverdicted novelty 6.0

MESA ranks MAS communication edges by vulnerability via graph-theoretic metrics and dynamic probes, achieving mean Spearman ρ=+0.60 correlation with empirical per-edge attack success and 3x interception gain when moni...
Iterating Toward Better Search: A Two-Agent Simulation Framework for Evaluating Agentic Search Architectures in E-Commerce
cs.AI 2026-06 unverdicted novelty 6.0

A modular two-agent simulation framework enables controlled comparison of conversational e-commerce responders, showing rolling-window memory outperforms intent extraction and targeted fixes reduce failures by 62%.
Automated Mediator for Human Negotiation: Pre-Mediation via a Structured LLM Pipeline
cs.AI 2026-06 unverdicted novelty 6.0

A structured LLM pipeline for pre-mediation in integrative negotiations performs comparably to human mediators on self-reported outcomes and better on preference inference in controlled experiments.
Natural Language Query to Configuration for Retrieval Agents
cs.AI 2026-05 unverdicted novelty 6.0

BRANE maps queries to optimal retrieval pipeline configurations using LLM-derived features and per-configuration correctness predictors, improving the cost-quality Pareto frontier on three benchmarks.
Swarm Skills: A Portable, Self-Evolving Multi-Agent System Specification for Coordination Engineering
cs.CL 2026-05 unverdicted novelty 6.0

Swarm Skills is a portable multi-agent coordination specification with roles, workflows, bounds, and a self-evolution algorithm that distills trajectories using Effectiveness, Utilization, and Freshness scores for zer...
CTM-AI: A Blueprint for General AI Inspired by a Model of Consciousness
q-bio.NC 2026-04 unverdicted novelty 6.0

CTM-AI combines a formal consciousness model with foundation models to report state-of-the-art results on sarcasm detection, humor, and agentic tool-use benchmarks.
CreativeGame:Toward Mechanic-Aware Creative Game Generation
cs.AI 2026-04 unverdicted novelty 6.0

CreativeGame enables iterative HTML5 game generation via mechanic-guided planning, lineage memory, runtime validation, and programmatic rewards to produce inspectable version-to-version mechanic evolution.
CoopGuard: Stateful Cooperative Agents Safeguarding LLMs Against Evolving Multi-Round Attacks
cs.CR 2026-04 unverdicted novelty 6.0

CoopGuard deploys cooperative agents to track conversation history and counter evolving multi-round attacks on LLMs, achieving a 78.9% reduction in attack success rate on a new 5,200-sample benchmark.
Towards Automated Crowdsourced Testing via Personified-LLM
cs.SE 2026-03 unverdicted novelty 6.0

PersonaTester uses LLMs guided by three-dimensional personas to replicate crowdworker testing patterns, yielding higher behavioral consistency, variability, and more bug detections than baseline LLM agents.
NOMAD: A Multi-Agent LLM System for UML Class Diagram Generation from Natural Language Requirements
cs.SE 2025-11 unverdicted novelty 6.0

NOMAD decomposes UML class diagram creation into a multi-agent LLM workflow that outperforms baselines on a Northwind case study and human exercises while introducing a taxonomy of structural, relationship, and semant...
ARM: Discovering Agentic Reasoning Modules for Generalizable Multi-Agent Systems
cs.AI 2025-10 unverdicted novelty 6.0

ARM evolves specialized reasoning modules from basic CoT via tree search to serve as reusable components in multi-agent systems that generalize across models and domains without per-task re-optimization.
PosterForest: Hierarchical Multi-Agent Collaboration for Scientific Poster Generation
cs.AI 2025-08 unverdicted novelty 6.0

PosterForest uses a Poster Tree intermediate representation and hierarchical multi-agent reasoning to generate coherent scientific posters without training, outperforming prior methods in evaluations.
BlindGuard: Safeguarding LLM-based Multi-Agent Systems under Unknown Attacks
cs.AI 2025-08 unverdicted novelty 6.0

BlindGuard introduces an unsupervised hierarchical agent encoder plus corruption-guided contrastive detector that identifies malicious agents in LLM-based multi-agent systems without any attack labels or prior knowled...
GenoMAS: A Multi-Agent Framework for Scientific Discovery via Code-Driven Gene Expression Analysis
cs.AI 2025-07 unverdicted novelty 6.0

GenoMAS deploys six specialized LLM agents with guided planning to preprocess transcriptomic data and identify genes, reaching 89.13% composite similarity and 60.48% F1 on the GenoTEX benchmark while outperforming pri...
A Dynamic LLM-Powered Agent Network for Task-Oriented Agent Collaboration
cs.CL 2023-10 conditional novelty 6.0

DyLAN automatically selects and dynamically organizes LLM agents for collaboration, outperforming fixed-agent baselines on code generation, reasoning, and decision tasks with up to 25% accuracy gains on some MMLU subjects.
Cognitive Architectures for Language Agents
cs.AI 2023-09 accept novelty 6.0

CoALA is a modular cognitive architecture for language agents that organizes memory components, action spaces for internal and external interaction, and a generalized decision-making loop to support more systematic de...
A Survey on Large Language Model based Autonomous Agents
cs.AI 2023-08 accept novelty 6.0

A survey of LLM-based autonomous agents that proposes a unified framework for their construction and reviews applications in social science, natural science, and engineering along with evaluation methods and future di...
ChatEval: Towards Better LLM-based Evaluators through Multi-Agent Debate
cs.CL 2023-08 conditional novelty 6.0

Multi-agent debate among LLMs yields more reliable text evaluations than single-agent prompting by simulating collaborative human judgment.
PaperClaw: Harnessing Agents for Autonomous Research and Human-in-the-Loop Refinement
cs.AI 2026-06 unverdicted novelty 5.0

PAPERCLAW is a multi-agent system for end-to-end autonomous research paper generation from literature to output, with human refinement and LLM-judge evaluation showing strong results.
A Technical Taxonomy of LLM Agent Communication Protocols
cs.MA 2026-06 unverdicted novelty 5.0

Creates a five-dimension taxonomy (counterparty, payload, interaction state, discovery mechanism, schema flexibility) from nine protocols and identifies architectural patterns plus convergence trends.
Decision-Aware Memory Cards: Counterfactual-Inspired Context Selection and Compression for Tool-Using LLM Agents
cs.AI 2026-06 unverdicted novelty 5.0

CICL scores and compresses context evidence for LLM agents via action-shift and outcome-uplift metrics, lifting hit@1 from 0.58 to 0.78 on 50 SWE-bench retrieval tasks.
LLM Consortium for Software Design Refinement: A Controlled Experiment on Multi-Agent Collaboration Topologies
cs.SE 2026-05 unverdicted novelty 5.0

A 520-run factorial experiment ranks an adversarial rewrite topology highest and cross-model review second among 12 LLM collaboration structures for software design, with parallel merge performing worst.
CONCAT: Consensus- and Confidence-Driven Ad Hoc Teaming for Efficient LLM-Based Multi-Agent Systems
cs.MA 2026-05 unverdicted novelty 5.0

CONCAT introduces a consensus- and confidence-driven ad hoc teaming method that reduces communication overhead in LLM-based multi-agent systems by up to 50% latency while improving efficiency ratio without any training.
ATOM: Instantiating Budget-Controllable Multi-Agent Collaboration via Nucleus-Electron Hierarchy
cs.MA 2026-05 unverdicted novelty 5.0

ATOM uses a nucleus-electron hierarchy and task-driven RL to generate budget-controllable multi-agent collaboration graphs for LLMs, claiming SOTA performance with up to 30% better token efficiency on six benchmarks.
Reinforced Collaboration in Multi-Agent Flow Networks
cs.LG 2026-05 unverdicted novelty 5.0

MANGO optimizes multi-agent LLM workflows via flow networks, RL, and textual gradients, delivering up to 12.8% higher performance and 47.4% better efficiency while generalizing to new domains.
ProfiliTable: Profiling-Driven Tabular Data Processing via Agentic Workflows
cs.AI 2026-05 unverdicted novelty 5.0

ProfiliTable is a profiling-driven multi-agent system that builds semantic context through exploration and closed-loop refinement to produce more reliable tabular data transformations than prior LLM approaches.
ProfiliTable: Profiling-Driven Tabular Data Processing via Agentic Workflows
cs.AI 2026-05 unverdicted novelty 5.0

ProfiliTable is a multi-agent system with profiler, generator, and evaluator components that outperforms baselines on 18 tabular task types via dynamic profiling and closed-loop refinement.
Consistency as a Testable Property: Statistical Methods to Evaluate AI Agent Reliability
cs.AI 2026-05 unverdicted novelty 5.0

A framework with U-statistics and kernel-based metrics quantifies AI agent consistency and robustness, showing trajectory metrics outperform pass@1 rates in diagnosing failures.
Swarm Skills: A Portable, Self-Evolving Multi-Agent System Specification for Coordination Engineering
cs.CL 2026-05 unverdicted novelty 5.0

Swarm Skills is a distributable specification for multi-agent workflows that includes roles, execution bounds, and a self-evolution algorithm to automatically improve coordination strategies.
Agentic AI in the Software Development Lifecycle: Architecture, Empirical Evidence, and the Reshaping of Software Engineering
cs.SE 2026-04 unverdicted novelty 5.0

Agentic AI systems are shifting software engineering from line-level code generation to delegated repository-scale execution under supervision, with SWE-bench performance rising from 1.96% to 78.4% and productivity ga...
ARMove: Learning to Predict Human Mobility through Agentic Reasoning
cs.MA 2026-04 unverdicted novelty 5.0

ARMove is a transferable framework for human mobility prediction that combines agentic LLM reasoning, feature management, and large-small model synergy to outperform baselines on several metrics while improving interp...
Separating Intelligence from Execution: A Workflow Engine for the Model Context Protocol
cs.DC 2026-03 unverdicted novelty 5.0

An MCP-native workflow engine decouples agent reasoning from execution by using declarative blueprints, reducing token cost by over 99% on a 67-step Kubernetes synchronization task.
MARS-SQL: A multi-agent reinforcement learning framework for Text-to-SQL
cs.CL 2025-11 unverdicted novelty 5.0

MARS-SQL trains a multi-agent RL system with ReAct-style interaction and generative validation to produce SQL queries, reaching 77.84% execution accuracy on BIRD dev and 89.75% on Spider test.
CodeWiki: Evaluating AI's Ability to Generate Holistic Documentation for Large-Scale Codebases
cs.SE 2025-10 unverdicted novelty 5.0

CodeWiki presents a unified framework for repository-level documentation across seven languages using hierarchical decomposition, recursive multi-agent processing, and multi-modal synthesis, outperforming DeepWiki by ...
GUARD: Guideline Upholding Test through Adaptive Role-play and Jailbreak Diagnostics for LLMs
cs.CL 2025-08 unverdicted novelty 5.0

GUARD automates generation of guideline-violating questions and jailbreak diagnostics to test LLM compliance with government ethics guidelines, validated empirically on eight models and extended to vision-language models.
AppAgent: Multimodal Agents as Smartphone Users
cs.CV 2023-12 unverdicted novelty 5.0

AppAgent lets large language models operate diverse smartphone apps via visual interactions and learns app usage from exploration or demonstrations.
Encouraging Divergent Thinking in Large Language Models through Multi-Agent Debate
cs.CL 2023-05 conditional novelty 5.0

Multi-agent debate with tit-for-tat arguments and a judge LLM improves reasoning by preventing LLMs from locking into incorrect initial solutions.
Verifying Intent and Harm: A Unified Defense Against LLM-Generated Threats
cs.CR 2026-06 unverdicted novelty 4.0

A joint prompt-response verification framework using intent analysts, harm analysts, and a judge improves average F1 to 0.95 and cuts attack success rate to 4.1% across jailbreaks, prompt injection, phishing, cyber ab...
What makes a harness a harness: necessary and sufficient conditions for an agent harness
cs.SE 2026-06 unverdicted novelty 4.0

Proposes and tests a constitutive definition of 'agent harness' via conceptual analysis of literature and six real systems.
SPOQ: Specialist Orchestrated Queuing for Multi-Agent Software Engineering
cs.SE 2026-06 unverdicted novelty 4.0

SPOQ is a multi-agent orchestration approach using wave-based topological dispatch, dual validation gates, and Human-as-an-Agent integration that reports large gains in speed, planning quality, defect reduction, and t...
Towards Cybersecurity SuperIntelligence (CSI): What's the best harness for cybersecurity?
cs.CR 2026-05 unverdicted novelty 4.0

CSI meta-scaffold unifies five LLM agent harnesses; a blackboard multi-agent system solves 19/33 cybench challenges (57.6%) versus 15/33 for the best single scaffold.