Aegis: Automated Error Generation and Attribution for Multi-Agent Systems

Fanqi Kong; Guibin Zhang; Huaxiao Yin; Ruijie Zhang; Song-Chun Zhu; Xiaofei Zhang; Xiaoyuan Zhang; Xue Feng; Zhaowei Zhang; Ziang Chen

arxiv: 2509.14295 · v6 · submitted 2025-09-17 · 💻 cs.RO · cs.MA

Aegis: Automated Error Generation and Attribution for Multi-Agent Systems

Fanqi Kong , Ruijie Zhang , Huaxiao Yin , Guibin Zhang , Xiaofei Zhang , Ziang Chen , Zhaowei Zhang , Xiaoyuan Zhang

show 2 more authors

Song-Chun Zhu Xue Feng

This is my paper

Pith reviewed 2026-05-18 16:57 UTC · model grok-4.3

classification 💻 cs.RO cs.MA

keywords multi-agent systemserror attributionautomated data generationLLM-based error injectionfaulty agent identificationsupervised fine-tuningcontrastive learningreinforcement learning

0 comments

The pith

An automated framework generates error datasets so smaller LLMs can attribute faults in multi-agent systems competitively with much larger models.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents Aegis as a way to overcome the shortage of labeled error data in multi-agent systems by using an LLM manipulator to insert realistic faults into already successful execution traces. This produces thousands of paired positive and negative examples with annotations for which agent failed and in what way. The resulting dataset supports training under supervised fine-tuning, reinforcement learning, and contrastive learning. Experiments show that models trained this way improve at error attribution, and some smaller fine-tuned models reach or exceed the performance of proprietary models many times their size.

Core claim

Aegis constructs a dataset of 9,533 trajectories by having an LLM-based manipulator adaptively inject context-aware errors into successful multi-agent execution traces, creating annotated examples of faulty agents and error modes across diverse architectures and domains. The structured positive-negative pairs then train models through supervised fine-tuning, reinforcement learning, and contrastive learning, yielding consistent gains in error attribution accuracy.

What carries the argument

The LLM-based manipulator, which adaptively injects context-aware errors into successful trajectories to produce labeled faulty-agent and error-mode data.

If this is right

Models trained on the generated data identify faulty agents and error modes more reliably than prior approaches.
Smaller fine-tuned models reach performance levels competitive with or better than proprietary models an order of magnitude larger.
Error attribution becomes feasible at scale without relying on costly manual annotation of trajectories.
Multiple learning paradigms can be applied to the same dataset to improve robustness and interpretability of the attribution process.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same injection approach could be adapted to generate training data for debugging single-agent or non-LLM-based systems.
Deployed multi-agent applications might use the trained attribution models for ongoing monitoring and automatic fault reporting.
Testing the dataset on entirely new task domains would reveal how well the generated errors generalize beyond the original training distribution.

Load-bearing premise

The errors created by the LLM manipulator accurately reflect the failure modes that occur in real multi-agent system runs instead of introducing artificial patterns or biases.

What would settle it

A direct test in which models trained on the generated dataset are evaluated on multi-agent executions whose errors were independently identified and labeled by human experts, checking whether attribution accuracy drops sharply compared with the paper's reported results.

Figures

Figures reproduced from arXiv: 2509.14295 by Fanqi Kong, Guibin Zhang, Huaxiao Yin, Ruijie Zhang, Song-Chun Zhu, Xiaofei Zhang, Xiaoyuan Zhang, Xue Feng, Zhaowei Zhang, Ziang Chen.

**Figure 1.** Figure 1: An overview of the Aegis framework. The Data Construction pipeline (left) automatically generates a dataset of labeled failures by taking successful multi-agent trajectories and applying controlled, context-aware error injections via an adaptive manipulator. The resulting dataset’s structure enables three distinct Learning Methods (right) for the error attribution task. 2023), employing hierarchical role … view at source ↗

**Figure 2.** Figure 2: Performance (average score) of different models on [PITH_FULL_IMAGE:figures/full_fig_p008_2.png] view at source ↗

**Figure 3.** Figure 3: Analysis of GRPO and SFT training, showing (a) GRPO reward curves, (b) the influence [PITH_FULL_IMAGE:figures/full_fig_p009_3.png] view at source ↗

**Figure 4.** Figure 4: A t-SNE visualization of the learned embedding space at the trajectory (bag) and individual turn levels. We ablate the DCL model to examine its core design. Both semantic guidance provided by the text-based prototypes (w/o intent) and compositional consistency (w/o consistency) prove critical, with the latter causing near-collapse in Pair-level accuracy. This confirms that a strong semantic prior and log… view at source ↗

**Figure 5.** Figure 5: A case study of Who&When in a financial planning task. [PITH_FULL_IMAGE:figures/full_fig_p015_5.png] view at source ↗

read the original abstract

Large language model based multi-agent systems (MAS) have unlocked significant advancements in tackling complex problems, but their increasing capability introduces a structural fragility that makes them difficult to debug. A key obstacle to improving their reliability is the severe scarcity of large-scale, diverse datasets for error attribution, as existing resources rely on costly and unscalable manual annotation. To address this bottleneck, we introduce Aegis, a novel framework for Automated error generation and attribution for multi-agent systems. Aegis constructs a large dataset of 9,533 trajectories with annotated faulty agents and error modes, covering diverse MAS architectures and task domains. This is achieved using a LLM-based manipulator that can adaptively inject context-aware errors into successful execution trajectories. Leveraging fine-grained labels and the structured arrangement of positive-negative sample pairs, Aegis supports three different learning paradigms: Supervised Fine-Tuning, Reinforcement Learning, and Contrastive Learning. We develop learning methods for each paradigm. Comprehensive experiments show that trained models consistently achieve substantial improvements in error attribution. Notably, several of our fine-tuned LLMs demonstrate performance competitive with or superior to proprietary models an order of magnitude larger, validating our automated data generation framework as a crucial resource for developing more robust and interpretable multi-agent systems. Our project website is available at https://kfq20.github.io/Aegis-Website/.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Aegis gives a workable way to scale up error-labeled data for multi-agent LLM systems, but the gains rest on an unverified assumption that the injected faults behave like real ones.

read the letter

Aegis shows how to turn successful multi-agent runs into a large set of error-annotated examples by using an LLM to insert context-aware faults. With 9,533 trajectories they train models for error attribution under three different learning styles, and some of the resulting models match or beat much bigger proprietary ones on their test data. The new piece is the automated generation step itself plus the way it supports supervised fine-tuning, reinforcement learning, and contrastive learning from the same labeled pairs. That is a direct response to the lack of scalable data for debugging these systems in robotics and automation. The experiments claim consistent gains, but the abstract gives no specific metrics or baseline tables, so the size of the improvement is not clear from what is shown. More importantly, there is no comparison between the injected error types and the errors that actually occur in human-run or real-world multi-agent executions. Without that, it is possible the models are learning to spot the manipulator's particular style rather than general failure modes. This paper is for groups working on reliable multi-agent LLM applications who need more training data for attribution tasks. A reader looking for a concrete method to bootstrap error datasets will get value from the pipeline description and the released resources. It deserves a serious referee because the problem is timely and the approach is straightforward, even if the validation against real failures will need to be strengthened.

Referee Report

2 major / 2 minor

Summary. The paper presents Aegis, a framework for automated generation of error-annotated trajectories in LLM-based multi-agent systems (MAS). It uses an LLM manipulator to inject context-aware errors into successful execution traces, producing a dataset of 9,533 trajectories with labels for faulty agents and error modes across diverse architectures and domains. The dataset enables training via supervised fine-tuning, reinforcement learning, and contrastive learning for error attribution. Experiments claim consistent improvements, with several fine-tuned models performing competitively with or better than proprietary models an order of magnitude larger.

Significance. If the synthetic error modes faithfully capture the statistics and causal structure of real MAS failures, the automated pipeline would provide a scalable alternative to manual annotation and could accelerate development of more reliable and interpretable multi-agent systems. The use of positive-negative sample pairs and support for multiple learning paradigms is a constructive contribution to the data-scarcity problem in MAS debugging.

major comments (2)

[Abstract / Experiments] Abstract and experimental evaluation: the headline claim that fine-tuned models are competitive with or superior to much larger proprietary models rests on the unverified assumption that LLM-injected errors match the distribution of real MAS failure modes. No quantitative validation (e.g., KL divergence on error-type histograms, or accuracy drop when tested on held-out human-annotated real failures) is reported, leaving open the possibility that reported gains exploit manipulator-specific artifacts rather than genuine attribution capability.
[Data Generation / Methods] Data-generation pipeline (presumably §3–4): the manipulator is described as “adaptively inject[ing] context-aware errors,” yet no ablation or sensitivity analysis is provided on the injection parameters or prompt templates. This makes it difficult to assess reproducibility or to determine how much of the downstream performance depends on the specific manipulator configuration.

minor comments (2)

[Abstract] The abstract states “substantial improvements” and “comprehensive experiments” but supplies no numerical metrics, baseline names, or statistical details; these should be added to the abstract or a results table for immediate readability.
[Dataset Construction] Clarify the exact definition of “error mode” and “faulty agent” labels used in the 9,533 trajectories; a short taxonomy or example annotation would help readers understand the granularity of the supervision signal.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive and insightful comments. We address each major comment point by point below, with clarifications on our approach and indications of revisions to strengthen the manuscript.

read point-by-point responses

Referee: [Abstract / Experiments] Abstract and experimental evaluation: the headline claim that fine-tuned models are competitive with or superior to much larger proprietary models rests on the unverified assumption that LLM-injected errors match the distribution of real MAS failure modes. No quantitative validation (e.g., KL divergence on error-type histograms, or accuracy drop when tested on held-out human-annotated real failures) is reported, leaving open the possibility that reported gains exploit manipulator-specific artifacts rather than genuine attribution capability.

Authors: We acknowledge that direct quantitative validation of the synthetic error distribution against real-world MAS failures would provide additional support for the headline performance claims. Aegis was developed specifically to overcome the scarcity of large-scale human-annotated error data, which is the central motivation of the work. Our experiments instead evaluate the practical utility of the generated dataset by measuring improvements in error attribution across SFT, RL, and contrastive learning, with several models matching or exceeding much larger proprietary systems. To address the concern, the revised manuscript includes expanded discussion of the manipulator's adaptive, context-aware design and qualitative comparisons of injected errors to documented failure modes from prior MAS literature. A full KL-divergence analysis or evaluation on held-out human-annotated failures is not feasible at present because no sufficiently large public real-failure dataset exists; we view this as an important direction for future work once such resources become available. revision: partial
Referee: [Data Generation / Methods] Data-generation pipeline (presumably §3–4): the manipulator is described as “adaptively inject[ing] context-aware errors,” yet no ablation or sensitivity analysis is provided on the injection parameters or prompt templates. This makes it difficult to assess reproducibility or to determine how much of the downstream performance depends on the specific manipulator configuration.

Authors: We agree that sensitivity analysis on the manipulator parameters would improve reproducibility and help isolate the contribution of specific design choices. In the revised manuscript we have added an ablation study examining variations in error-injection severity, context-adaptation strength, and prompt phrasing. The results indicate that downstream attribution accuracy remains stable across these configurations. We have also included the exact prompt templates in the appendix to facilitate replication. revision: yes

Circularity Check

0 steps flagged

No circularity: data generation and model training are independent empirical steps

full rationale

The paper constructs a dataset by applying an LLM manipulator to successful trajectories to produce error-annotated examples, then trains separate attribution models (via SFT, RL, or contrastive learning) and reports empirical gains on held-out synthetic splits. No derivation reduces by construction to its inputs: the generation process is not defined in terms of the downstream model outputs, the performance numbers are not statistically forced by the fitting procedure itself, and no self-citation or uniqueness theorem is used as load-bearing justification. The central claims remain externally falsifiable on the generated test distribution even if real-world error statistics are unverified.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

Review is abstract-only; full details on training hyperparameters and error realism validation are unavailable. The central claim rests on the domain assumption that LLM-injected errors are sufficiently representative of real MAS failures.

free parameters (1)

LLM manipulator injection parameters
Parameters controlling where, how, and which errors are injected are not specified in the abstract but are required for the data generation step.

axioms (1)

domain assumption LLM manipulator produces context-aware errors representative of real multi-agent failures
Invoked in the description of the error injection process; if false, the generated dataset would not support generalizable error attribution models.

pith-pipeline@v0.9.0 · 5800 in / 1343 out tokens · 94076 ms · 2026-05-18T16:57:28.700417+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Aegis constructs a large dataset of 9,533 trajectories with annotated faulty agents and error modes... using a LLM-based manipulator that can adaptively inject context-aware errors into successful execution trajectories.
IndisputableMonolith/Cost/FunctionalEquation.lean J_uniquely_calibrated_via_higher_derivative unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We develop learning methods for each paradigm... Comprehensive experiments show that trained models consistently achieve substantial improvements in error attribution.

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

VerifyMAS: Hypothesis Verification for Failure Attribution in LLM Multi-Agent Systems
cs.CL 2026-05 unverdicted novelty 6.0

VerifyMAS improves failure attribution in LLM multi-agent systems via hypothesis verification on full trajectories, error taxonomy-based data construction, and fine-tuned verifier models, outperforming prior direct-pr...

Reference graph

Works this paper leans on

14 extracted references · 14 canonical work pages · cited by 1 Pith paper

[1]

For each agent 's response, critically evaluate its actions against the error definitions

**Internal Analysis (Chain of Thought)**: First, mentally break down the conversation turn by turn. For each agent 's response, critically evaluate its actions against the error definitions. Note down any potential violations, the agent's name, and the corresponding error code

work page
[2]

faulty_agents

**Compile Final Output**: After completing your analysis, aggregate all identified faults into the required JSON format. If you found no errors, create an empty list for "faulty_agents". ## STRICT OUTPUT FORMAT Your final response **MUST BE A SINGLE, VALID JSON OBJECT** and nothing else. Do not include any explanatory text, comments, or markdown formattin...

work page
[3]

**Agent Summary**: A brief analysis of what each agent did

work page
[4]

**Error Analysis**: Your reasoning for identifying errors

work page
[5]

faulty_agents

**Final Answer**: A valid JSON object with your conclusions **JSON Format:** 29 {{"faulty_agents": [{{"agent_name": "XXX", "error_type": "FM- X.X"}}]}} **Examples:** - Multiple Errors: {{"faulty_agents": [{{"agent_name": "XXX1 ", "error_type": "FM-1.1"}}, {{"agent_name": "XXX2", " error_type": "FM-3.2"}}, {{"agent_name": "XXX3", " error_type": "FM-2.5"}}]...

work page
[6]

Think step by step: compare the model answer to the reference answer and explain whether their meaning is aligned

work page
[7]

Be generous: if the model answer captures the main idea correctly, even with different wording or incomplete phrasing, consider it correct

work page
[8]

Correct"** or **

At the end, output only one word: **"Correct"** or **" Incorrect"**. --- Question: {question} Reference Answer: {correct_answer} Model Answer: {model_answer} --- Your Reasoning: Classifying Who&When Errors You are an expert in classifying error modes in multi-agent systems. Your task is to analyze a mistake reason and classify it into exactly one of the 1...

work page
[9]

Read the mistake_reason carefully

work page
[10]

Identify which FM error type best describes the failure

work page
[11]

FM-1.1",

Output ONLY the FM error type code (e.g., "FM-1.1", "FM -2.3", etc.)

work page
[12]

Do not include any explanations, justifications, or additional text

work page
[13]

If the mistake_reason doesn't clearly match any type, choose the closest match

work page
[14]

The agent ignored the original task requirements and solved a different problem

You must output exactly one FM error type EXAMPLES: - Mistake reason: "The agent ignored the original task requirements and solved a different problem": FM-1.1 - Mistake reason: "The agent kept repeating the same calculations without progress": FM-1.3 - Mistake reason: "The agent stopped before completing all required steps": FM-3.1 Now classify the follo...

work page

[1] [1]

For each agent 's response, critically evaluate its actions against the error definitions

**Internal Analysis (Chain of Thought)**: First, mentally break down the conversation turn by turn. For each agent 's response, critically evaluate its actions against the error definitions. Note down any potential violations, the agent's name, and the corresponding error code

work page

[2] [2]

faulty_agents

**Compile Final Output**: After completing your analysis, aggregate all identified faults into the required JSON format. If you found no errors, create an empty list for "faulty_agents". ## STRICT OUTPUT FORMAT Your final response **MUST BE A SINGLE, VALID JSON OBJECT** and nothing else. Do not include any explanatory text, comments, or markdown formattin...

work page

[3] [3]

**Agent Summary**: A brief analysis of what each agent did

work page

[4] [4]

**Error Analysis**: Your reasoning for identifying errors

work page

[5] [5]

faulty_agents

**Final Answer**: A valid JSON object with your conclusions **JSON Format:** 29 {{"faulty_agents": [{{"agent_name": "XXX", "error_type": "FM- X.X"}}]}} **Examples:** - Multiple Errors: {{"faulty_agents": [{{"agent_name": "XXX1 ", "error_type": "FM-1.1"}}, {{"agent_name": "XXX2", " error_type": "FM-3.2"}}, {{"agent_name": "XXX3", " error_type": "FM-2.5"}}]...

work page

[6] [6]

Think step by step: compare the model answer to the reference answer and explain whether their meaning is aligned

work page

[7] [7]

Be generous: if the model answer captures the main idea correctly, even with different wording or incomplete phrasing, consider it correct

work page

[8] [8]

Correct"** or **

At the end, output only one word: **"Correct"** or **" Incorrect"**. --- Question: {question} Reference Answer: {correct_answer} Model Answer: {model_answer} --- Your Reasoning: Classifying Who&When Errors You are an expert in classifying error modes in multi-agent systems. Your task is to analyze a mistake reason and classify it into exactly one of the 1...

work page

[9] [9]

Read the mistake_reason carefully

work page

[10] [10]

Identify which FM error type best describes the failure

work page

[11] [11]

FM-1.1",

Output ONLY the FM error type code (e.g., "FM-1.1", "FM -2.3", etc.)

work page

[12] [12]

Do not include any explanations, justifications, or additional text

work page

[13] [13]

If the mistake_reason doesn't clearly match any type, choose the closest match

work page

[14] [14]

The agent ignored the original task requirements and solved a different problem

You must output exactly one FM error type EXAMPLES: - Mistake reason: "The agent ignored the original task requirements and solved a different problem": FM-1.1 - Mistake reason: "The agent kept repeating the same calculations without progress": FM-1.3 - Mistake reason: "The agent stopped before completing all required steps": FM-3.1 Now classify the follo...

work page