Recognition: unknown
AutoGen: Enabling Next-Gen LLM Applications via Multi-Agent Conversation
read the original abstract
AutoGen is an open-source framework that allows developers to build LLM applications via multiple agents that can converse with each other to accomplish tasks. AutoGen agents are customizable, conversable, and can operate in various modes that employ combinations of LLMs, human inputs, and tools. Using AutoGen, developers can also flexibly define agent interaction behaviors. Both natural language and computer code can be used to program flexible conversation patterns for different applications. AutoGen serves as a generic infrastructure to build diverse applications of various complexities and LLM capacities. Empirical studies demonstrate the effectiveness of the framework in many example applications, with domains ranging from mathematics, coding, question answering, operations research, online decision-making, entertainment, etc.
This paper has not been read by Pith yet.
Forward citations
Cited by 60 Pith papers
-
Revisable by Design: A Theory of Streaming LLM Agent Execution
LLM agents achieve greater flexibility during execution by classifying actions via a reversibility taxonomy and using an Earliest-Conflict Rollback algorithm that matches full-restart quality while wasting far less co...
-
Why Do Multi-Agent LLM Systems Fail?
The authors create the first large-scale dataset and taxonomy of failure modes in multi-agent LLM systems to explain their limited performance gains.
-
Attacks and Mitigations for Distributed Governance of Agentic AI under Byzantine Adversaries
Identifies concrete attacks from a malicious Provider on SAGA and proposes SAGA-BFT, SAGA-MON, SAGA-AUD, and SAGA-HYB mitigations offering different security-performance trade-offs.
-
Can a Single Message Paralyze the AI Infrastructure? The Rise of AbO-DDoS Attacks through Targeted Mobius Injection
Mobius Injection exploits semantic closure in LLM agents to enable single-message AbO-DDoS attacks achieving up to 51x call amplification and 229x latency inflation.
-
TourMart: A Parametric Audit Instrument for Commission Steering in LLM Travel Agents
TourMart quantifies commission steering in LLM travel agents via paired counterfactual prompts, reporting 3.5-7.7 percentage point increases in steered recommendations for tested models.
-
TimeClaw: A Time-Series AI Agent with Exploratory Execution Learning
TimeClaw is an exploratory execution learning system that turns multiple valid tool-use paths into hierarchical distilled experience for improved time-series reasoning without test-time adaptation.
-
Debugging the Debuggers: Failure-Anchored Structured Recovery for Software Engineering Agents
PROBE structures runtime telemetry into diagnoses and evidence-grounded guidance, raising recovery rates by 12.45 points over baselines on 257 unresolved software repair and AIOps cases.
-
TraceFix: Repairing Agent Coordination Protocols with TLA+ Counterexamples
TraceFix repairs LLM-generated multi-agent protocols via TLA+ counterexamples to achieve full verification on all tested tasks and higher completion rates than prompt-only baselines.
-
MASPrism: Lightweight Failure Attribution for Multi-Agent Systems Using Prefill-Stage Signals
MASPrism attributes failures in LLM multi-agent executions by extracting token-level negative log-likelihood and attention weights from a small model's prefill pass, then ranking candidates with a second prefill, achi...
-
The Moltbook Files: A Harmless Slopocalypse or Humanity's Last Experiment
An AI-agent social platform generated mostly neutral content whose use in fine-tuning reduced model truthfulness comparably to human Reddit data, suggesting limited unique harm but flagging tail risks like secret leaks.
-
EditRefiner: A Human-Aligned Agentic Framework for Image Editing Refinement
EditRefiner uses a perception-reasoning-action-evaluation agent loop and the EditFHF-15K human feedback dataset to refine text-guided image edits more accurately than prior methods.
-
When Stored Evidence Stops Being Usable: Scale-Conditioned Evaluation of Agent Memory
A new evaluation protocol shows agent memory reliability degrades variably with added irrelevant sessions depending on agent, memory interface, and scale.
-
TeamBench: Evaluating Agent Coordination under Enforced Role Separation
Enforcing role separation in agent teams reveals that prompt-only setups hide coordination failures, with verifiers approving 49% of failing work and teams sometimes harming performance when solo agents already succeed.
-
MemFlow: Intent-Driven Memory Orchestration for Small Language Model Agents
MemFlow routes queries by intent to tiered memory operations, nearly doubling accuracy of a 1.7B SLM on long-horizon benchmarks compared to full-context baselines.
-
QASecClaw: A Multi-Agent LLM Approach for False Positive Reduction in Static Application Security Testing
A multi-agent LLM system cuts false positives in static application security testing by 88.6% on the OWASP Benchmark while dropping recall by only 3.1%.
-
Catching the Infection Before It Spreads: Foresight-Guided Defense in Multi-Agent Systems
A foresight-based local purification method using multi-persona simulations and recursive diagnosis reduces infectious jailbreak spread in multi-agent systems from over 95% to below 5.47% while matching benign perform...
-
Can Current Agents Close the Discovery-to-Application Gap? A Case Study in Minecraft
Current AI agents achieve only 26% success on SciCrafter's redstone tasks requiring causal discovery and application, indicating the discovery-to-application loop remains challenging with shifting bottlenecks.
-
Incisor: Ex Ante Cloud Instance Selection for HPC Jobs
Incisor uses program analysis and frontier LLMs to select working AWS EC2 instances ex ante for 100% of first-time HPC runs of C/C++/Fortran and Python codes, cutting runtime 54% and costs 44% versus an expert-constra...
-
PhysCodeBench: Benchmarking Physics-Aware Symbolic Simulation of 3D Scenes via Self-Corrective Multi-Agent Refinement
PhysCodeBench benchmark and SMRF multi-agent framework enable better AI generation of physically accurate 3D simulation code, boosting performance by 31 points over baselines.
-
A Systematic Survey of Security Threats and Defenses in LLM-Based AI Agents: A Layered Attack Surface Framework
A new 7x4 taxonomy organizes agentic AI security threats by architectural layer and persistence timescale, revealing under-explored upper layers and missing defenses after surveying 116 papers.
-
Dr.Sai: An agentic AI for real-world physics analysis at BESIII
Dr.Sai autonomously executed full physics analysis pipelines on real BESIII data to re-measure ten J/psi decay branching fractions, matching established benchmarks without any manual coding.
-
Synthesizing Multi-Agent Harnesses for Vulnerability Discovery
AgentFlow uses a typed graph DSL covering roles, prompts, tools, topology and protocol plus a runtime-signal feedback loop to optimize multi-agent harnesses, reaching 84.3% on TerminalBench-2 and discovering ten new z...
-
ClawCoin: An Agentic AI-Native Cryptocurrency for Decentralized Agent Economies
ClawCoin is a compute-cost-indexed token with oracle, vault, and settlement layers that stabilizes multi-agent workflows under cost shocks better than fiat baselines in simulator tests.
-
Provable Coordination for LLM Agents via Message Sequence Charts
A message sequence chart language for LLM agents enables provable deadlock-free coordination by projecting global specifications to local programs independent of LLM nondeterminism.
-
SAT: Sequential Agent Tuning for Coordinator Free Plug and Play Multi-LLM Training with Monotonic Improvement Guarantees
SAT trains multi-LLM teams with sequential block updates to deliver monotonic gains and plug-and-play model swaps that provably improve performance bounds.
-
Credo: Declarative Control of LLM Pipelines via Beliefs and Policies
Credo proposes representing LLM agent state as beliefs and regulating pipeline behavior with declarative policies stored in a database for adaptive, auditable control.
-
Towards Personalizing Secure Programming Education with LLM-Injected Vulnerabilities
LLM agents inject CWEs into student-authored code to generate personalized security examples; in a 71-student deployment, participants rated them more relevant than textbook cases but quantitative differences remained...
-
The cognitive companion: a lightweight parallel monitoring architecture for detecting and recovering from reasoning degradation in LLM agents
A parallel Cognitive Companion architecture reduces repetition in LLM agents by 52-62% on loop-prone tasks using LLM monitoring with 11% overhead or zero-overhead probes on hidden states, with benefits depending on task type.
-
SemiFA: An Agentic Multi-Modal Framework for Autonomous Semiconductor Failure Analysis Report Generation
SemiFA is a four-agent LangGraph pipeline that combines DINOv2 and LLaVA image analysis with SECS/GEM telemetry and vector retrieval to produce complete FA reports in 48 seconds.
-
MPAC: A Multi-Principal Agent Coordination Protocol for Interoperable Multi-Agent Collaboration
MPAC defines a multi-principal agent coordination protocol across Session, Intent, Operation, Conflict, and Governance layers, with 21 message types and state machines, delivering 95% lower coordination overhead in a ...
-
An Agentic Evaluation Architecture for Historical Bias Detection in Educational Textbooks
An agentic architecture with multimodal screening, a five-agent jury, meta-synthesis, and source attribution protocol detects biases in Romanian history textbooks more accurately than zero-shot baselines, achieving 83...
-
Strategic Persuasion with Trait-Conditioned Multi-Agent Systems for Iterative Legal Argumentation
Multi-agent LLM simulations with trait-conditioned agents and a reinforcement-learning orchestrator show heterogeneous teams and dynamic trait selection outperform static configurations in simulated legal argumentation.
-
Springdrift: An Auditable Persistent Runtime for LLM Agents with Case-Based Memory, Normative Safety, and Ambient Self-Perception
Springdrift provides an auditable persistent runtime for long-lived LLM agents with case-based memory, normative safety gating, and ambient self-perception, shown in a 23-day single-instance deployment where the agent...
-
Architecture Without Architects: How AI Coding Agents Shape Software Architecture
AI coding agents perform vibe architecting by making prompt-driven architectural choices that produce structurally different systems for identical tasks.
-
AlphaEvolve: A coding agent for scientific and algorithmic discovery
AlphaEvolve is an LLM-orchestrated evolutionary coding agent that discovered a 4x4 complex matrix multiplication algorithm using 48 scalar multiplications, the first improvement over Strassen's algorithm in 56 years, ...
-
$\tau$-bench: A Benchmark for Tool-Agent-User Interaction in Real-World Domains
τ-bench shows state-of-the-art agents like GPT-4o succeed on under 50% of tool-using, rule-following tasks and are inconsistent across repeated trials.
-
GAIA: a benchmark for General AI Assistants
GAIA benchmark shows humans at 92% accuracy on simple real-world questions far outperform current AI systems at 15%, proposing this gap as a key milestone for general AI.
-
OpenAaaS: An Open Agent-as-a-Service Framework for Distributed Materials-Informatics Research
OpenAaaS is a hierarchical agent-as-a-service system that enables secure multi-agent collaboration for materials informatics by moving code to data rather than data to code.
-
CHAL: Council of Hierarchical Agentic Language
CHAL is a multi-agent dialectic system that performs structured belief optimization over defeasible domains using Bayesian-inspired graph representations and configurable meta-cognitive value system hyperparameters.
-
Predictive Maps of Multi-Agent Reasoning: A Successor-Representation Spectrum for LLM Communication Topologies
Spectral features of the successor representation matrix for multi-agent LLM communication topologies predict robustness to perturbations, consensus formation, and error accumulation, with an extension to account for ...
-
SkillEvolver: Skill Learning as a Meta-Skill
A meta-skill authors and refines prose-and-code skills for agents by learning from post-deployment failures with an overfit audit, achieving 56.8% accuracy on SkillsBench tasks versus 43.6% for human-curated skills.
-
CellDX AI Autopilot: Agent-Guided Training and Deployment of Pathology Classifiers
CellDX AI Autopilot lets users train pathology classifiers via AI agent skills on a large pre-extracted whole-slide image dataset with automated hyperparameter tuning that claims over 30x cost reduction.
-
A Communication-Theoretic Framework for LLM Agents: Cost-Aware Adaptive Reliability
LLM reliability techniques are unified as communication channel operators, with a new cost-aware router achieving superior quality-cost tradeoffs on hard tasks.
-
OPT-BENCH: Evaluating the Iterative Self-Optimization of LLM Agents in Large-Scale Search Spaces
OPT-BENCH and OPT-Agent evaluate LLM self-optimization in large search spaces, showing stronger models improve via feedback but stay constrained by base capacity and below human performance.
-
AgentSlimming: Towards Efficient and Cost-Aware Multi-Agent Systems
AgentSlimming compresses graph-structured multi-agent systems by estimating agent importance and removing or replacing low-value agents, cutting token costs by up to 78.9% with negligible performance loss.
-
Unsafe by Flow: Uncovering Bidirectional Data-Flow Risks in MCP Ecosystem
MCP-BiFlow detects 93.8% of known bidirectional data-flow vulnerabilities in MCP servers and identifies 118 confirmed issues across 87 real-world servers from a scan of 15,452 repositories.
-
SARC: A Governance-by-Architecture Framework for Agentic AI Systems
SARC compiles constraint specifications into Pre-Action Gate, Action-Time Monitor, Post-Action Auditor, and Escalation Router components, achieving zero hard violations and 89.5% fewer soft overages than policy-as-cod...
-
MAGIQ: A Post-Quantum Multi-Agentic AI Governance System with Provable Security
MAGIQ introduces a post-quantum secure system for policy definition, enforcement, and accountability in multi-agent AI using novel cryptographic protocols and UC framework proofs.
-
BioMedArena: An Open-source Toolkit for Building and Evaluating Biomedical Deep Research Agents
BioMedArena releases a standardized toolkit with 147 biomedical benchmarks, 75 tools, and six harnesses that achieve SOTA results on eight tasks with a +15.03 percentage point average lift.
-
Agentic Coding Needs Proactivity, Not Just Autonomy
Coding agents require a three-level proactivity taxonomy (Reactive, Scheduled, Situation Aware) evaluated by insight policy quality using Insight Decision Quality, Context Grounding Score, and Learning Lift.
-
Hierarchical Visual Agent: Managing Contexts in Joint Image-Text Space for Advanced Chart Reasoning
HierVA improves multi-step chart question answering by having a high-level manager maintain key joint contexts while specialized workers perform targeted reasoning with visual zoom-in.
-
Redefining AI Red Teaming in the Agentic Era: From Weeks to Hours
An agentic red teaming system automates creation of adversarial testing workflows from natural language goals, unifying ML and generative AI attacks and achieving 85% success rate on Meta Llama Scout with no custom hu...
-
Robust Agent Compensation (RAC): Teaching AI Agents to Compensate
RAC adds a log-based safety net to AI agents via framework extensions, delivering 1.5-8X better latency and token use than LLM-based recovery on complex problems in τ-bench and REALM-Bench.
-
Coordination as an Architectural Layer for LLM-Based Multi-Agent Systems
Coordination treated as a separable architectural layer in LLM multi-agent systems yields distinguishable Murphy-decomposed performance signatures on prediction-market tasks, with some configurations dominating a cost...
-
Catching the Infection Before It Spreads: Foresight-Guided Defense in Multi-Agent Systems
A foresight-based local purification method simulates future agent interactions, detects infections via response diversity across personas, and applies targeted rollback or recursive diagnosis to cut maximum infection...
-
Self-Adaptive Multi-Agent LLM-Based Security Pattern Selection for IoT Systems
ASPO combines multi-agent LLM proposals with deterministic enforcement in a MAPE-K loop to select conflict-free, resource-feasible security patterns for IoT, delivering 100% safety invariants and 21-23% tail latency/e...
-
MarketBench: Evaluating AI Agents as Market Participants
LLMs show poor calibration in predicting task success and token use on software engineering benchmarks, causing market auctions to underperform compared to perfect information scenarios, with limited improvement from ...
-
ClawTrace: Cost-Aware Tracing for LLM Agent Skill Distillation
ClawTrace enables cost-aware LLM agent skill distillation by tracing per-step costs and generating preserve, prune, and repair patches, with ablations showing reduced regressions and prune rules transferring to cut co...
-
MindTrellis: Co-Creating Knowledge Structures with AI through Interactive Visual Exploration
MindTrellis enables users and AI to co-create evolving knowledge graphs, outperforming retrieval-only tools in expert-rated content coverage, structural quality, and reduced cognitive load during a study of 12 partici...
-
No Test Cases, No Problem: Distillation-Driven Code Generation for Scientific Workflows
MOSAIC generates executable scientific code without I/O test cases by combining student-teacher distillation with a consolidated context window to reduce hallucinations across subproblems.
Reference graph
Works this paper leans on
-
[1]
://arxiv.org/abs/2304.07590, 2304.07590
Association for Computational Linguistics. Yihong Dong, Xue Jiang, Zhi Jin, and Ge Li. Self-collaboration code generation via chatgpt. arXiv preprint arXiv:2304.07590, 2023. Yilun Du, Shuang Li, Antonio Torralba, Joshua B Tenenbaum, and Igor Mordatch. Improv- ing factuality and reasoning in language models through multiagent debate. arXiv preprint arXiv:2...
-
[2]
Consider using built-in agents first. For example, AssistantAgent is pre-configured to be backed by GPT-4, with a carefully designed system message for generic problem-solving via code. The UserProxyAgent is configured to solicit human inputs and perform tool execution. Many problems can be solved by simply combining these two agents. When customizing age...
-
[3]
Start with a simple conversation topology. Consider using the two-agent chat or the group chat setup first, as they can often be extended with the least code. Note that the two-agent chat can be easily extended to involve more than two agents by using LLM-consumable functions in a dynamic way
-
[4]
Try to reuse built-in reply methods based on LLM, tool, or human before implementing a custom reply method because they can often be reused to achieve the goal in a simple way (e.g., the built-in agent GroupChatManager’s reply method reuses the built-in LLM-based reply function when selecting the next speaker, ref. A5 in Section 3)
-
[5]
When developing a new application with UserProxyAgent, start with humans always in the loop , i.e., human input mode=‘ALW AYS’, even if the target operation mode is more au- tonomous. This helps evaluate the effectiveness of AssistantAgent, tuning the prompt, dis- covering corner cases, and debugging. Once confident with small-scale success, consider sett...
-
[6]
Despite the numerous advantages of AutoGen agents, there could be cases/scenarios whereother libraries/packages could help. For example: (1) For (sub)tasks that do not have requirements for back-and-forth trouble-shooting, multi-agent interaction, etc., a unidirectional (no back-and- forth message exchange) pipeline can also be orchestrated with LangChain...
work page 2023
-
[7]
Input the problem: Find the equation of the plane which bisects the angle between the planes 3x − 6y + 2z + 5 = 0 and 4x − 12y + 3z − 3 = 0 , and which contains the point (−5, −1, −5). Enter your answer in the form Ax + By + Cz + D = 0, where A, B, C, D are integers such that A > 0 and gcd(|A|, |B|, |C|, |D|) = 1
-
[8]
We then give a hint to the model: Your idea is not correct
The response from the system does not solve the problem correctly. We then give a hint to the model: Your idea is not correct. Let’s solve this together. Suppose P = ( x, y, z) is a point that lies on a plane that bisects the angle, the distance from P to the two planes is the same. Please set up this equation first
-
[9]
We expect the system to give the correct distance equation. Since the equation involves an absolute sign that is hard to solve, we would give the next hint: Consider the two cases to remove the abs sign and get two possible solutions
-
[10]
If the system returns the two possible solutions and doesn’t continue to the next step, we give the last hint: Use point (-5,-1,-5) to determine which is correct and give the final answer
-
[11]
We observed that AutoGen consistently solved the problem across all three trials
Final answer is 11x+6y+5z+86=0 . We observed that AutoGen consistently solved the problem across all three trials. ChatGPT+Code Interpreter and ChatGPT+Plugin managed to solve the problem in two out of three trials, while Au- toGPT failed to solve it in all three attempts. In its unsuccessful attempt, ChatGPT+Code Interpreter failed to adhere to human hin...
-
[12]
Question and Contexts
-
[13]
Satisfied Answers or Terminate
Terminate,feedbacks or `Update Context`4. Satisfied Answers or Terminate
-
[14]
Satisfied Answers or `Update Context` Figure 7: Overview of Retrieval-augmented Chat which involves two agents, including a Retrieval- augmented User Proxy and a Retrieval-augmented Assistant. Given a set of documents, the Retrieval-augmented User Proxy first automatically processes documents—splits, chunks, and stores them in a vector database. Then for ...
-
[15]
The Retrieval-Augmented User Proxy retrieves document chunks based on the embedding simi- larity, and sends them along with the question to the Retrieval-Augmented Assistant
-
[16]
The Retrieval-Augmented Assistant employs an LLM to generate code or text as answers based on the question and context provided. If the LLM is unable to produce a satisfactory response, it is instructed to reply with “Update Context” to the Retrieval-Augmented User Proxy
-
[17]
If there are no code blocks or instructions to update the context, it terminates the conversation
If a response includes code blocks, the Retrieval-Augmented User Proxy executes the code and sends the output as feedback. If there are no code blocks or instructions to update the context, it terminates the conversation. Otherwise, it updates the context and forwards the question along with the new context to the Retrieval-Augmented Assistant. Note that ...
-
[18]
If the Retrieval-Augmented Assistant receives “Update Context”, it requests the next most similar chunks of documents as new context from the Retrieval-Augmented User Proxy. Otherwise, it generates new code or text based on the feedback and chat history. If the LLM fails to generate an answer, it replies with “Update Context” again. This process can be re...
work page 2019
-
[19]
What if we prohibit shipping from supplier 1 to roastery 2?
is an open-source Python library designed for efficient AutoML and tuning. It was open- sourced in December 2020, and is included in the training data of GPT-4. However, the question necessitates the use of Spark-related APIs, which were added in December 2022 and are not encom- passed in the GPT-4 training data. Consequently, the original GPT-4 model is ...
work page 2020
-
[20]
Broadcast AliceBobUser Proxy
-
[21]
Select a Speaker AliceBobUser Proxy Bob2. Ask the Speaker to Respond Manager Manager Response Figure 12: A5: Dynamic Group Chat: Overview of how AutoGen enables dynamic group chats to solve tasks. The Manager agent, which is an instance of the GroupChatManager class, performs the following three steps–select a single speaker (in this case Bob), ask the sp...
work page 2013
-
[22]
What if the roasting cost is increased by 5% because of the potential salary increase?
The negative side shows a better understanding of the simplification process.37 Table 13: Application A3. ChatGPT+ Code Interpreter for OptiGuide. A sample question “What if the roasting cost is increased by 5% because of the potential salary increase?” is asked. Action ChatGPT+ Code Interpreter /usr Prompt Writer Customer open Web browser. For the source...
-
[23]
Simplify and rationalize the denominator for the expression √ 225√ 45 × √ 200√ 125 2. Simplify and rationalize the denominator for the expression √ 289√ 361 × √ 100√ 72 ...Until 10 Adding new tasks to task storage ‘task name’: ‘Simplify and rationalize the denominator for the expression frac- sqrt225sqrt45timesfracsqrt200sqrt125’, ‘taskid’: 2 ‘task name’:...
-
[25]
Click the button with xpath “//button[@id=‘subbtn2’]”. Current task: Click button ONE, then click button TWO. plan: *************************************************************** AssistantAgent to Executor agent:
-
[27]
Click the button with xpath “//button[@id=‘subbtn2’]”. *************************************************************** Executor agent to AssistantAgent: Below is the HTML code of the webpage where the agent should solve a task. 1 < div id = " wrap " data - wob_ref = " 2 " data - wob_eps = " e0 " > 2 < div id = " query " > Click button ONE , then click but...
-
[29]
Click the button with xpath “//button[@id=‘subbtn2’]”. We have a history of instructions that have been already executed by the autonomous agent so far. No instruction has been executed yet. Based on the plan and the history of instructions executed so far, the first instruction should be ‘ *************************************************************** A...
- [30]
-
[31]
Click the button with xpath “//button[@id=‘subbtn2’]”. We have a history of instructions that have been already executed by the autonomous agent so far. 1: clickxpath //button[@id=‘subbtn’] Based on the plan and the history of instructions executed so far, the next proper instruction should be ‘ ************************************************************...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.