Aegis: Automated Error Generation and Attribution for Multi-Agent Systems
Pith reviewed 2026-05-18 16:57 UTC · model grok-4.3
The pith
An automated framework generates error datasets so smaller LLMs can attribute faults in multi-agent systems competitively with much larger models.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Aegis constructs a dataset of 9,533 trajectories by having an LLM-based manipulator adaptively inject context-aware errors into successful multi-agent execution traces, creating annotated examples of faulty agents and error modes across diverse architectures and domains. The structured positive-negative pairs then train models through supervised fine-tuning, reinforcement learning, and contrastive learning, yielding consistent gains in error attribution accuracy.
What carries the argument
The LLM-based manipulator, which adaptively injects context-aware errors into successful trajectories to produce labeled faulty-agent and error-mode data.
If this is right
- Models trained on the generated data identify faulty agents and error modes more reliably than prior approaches.
- Smaller fine-tuned models reach performance levels competitive with or better than proprietary models an order of magnitude larger.
- Error attribution becomes feasible at scale without relying on costly manual annotation of trajectories.
- Multiple learning paradigms can be applied to the same dataset to improve robustness and interpretability of the attribution process.
Where Pith is reading between the lines
- The same injection approach could be adapted to generate training data for debugging single-agent or non-LLM-based systems.
- Deployed multi-agent applications might use the trained attribution models for ongoing monitoring and automatic fault reporting.
- Testing the dataset on entirely new task domains would reveal how well the generated errors generalize beyond the original training distribution.
Load-bearing premise
The errors created by the LLM manipulator accurately reflect the failure modes that occur in real multi-agent system runs instead of introducing artificial patterns or biases.
What would settle it
A direct test in which models trained on the generated dataset are evaluated on multi-agent executions whose errors were independently identified and labeled by human experts, checking whether attribution accuracy drops sharply compared with the paper's reported results.
Figures
read the original abstract
Large language model based multi-agent systems (MAS) have unlocked significant advancements in tackling complex problems, but their increasing capability introduces a structural fragility that makes them difficult to debug. A key obstacle to improving their reliability is the severe scarcity of large-scale, diverse datasets for error attribution, as existing resources rely on costly and unscalable manual annotation. To address this bottleneck, we introduce Aegis, a novel framework for Automated error generation and attribution for multi-agent systems. Aegis constructs a large dataset of 9,533 trajectories with annotated faulty agents and error modes, covering diverse MAS architectures and task domains. This is achieved using a LLM-based manipulator that can adaptively inject context-aware errors into successful execution trajectories. Leveraging fine-grained labels and the structured arrangement of positive-negative sample pairs, Aegis supports three different learning paradigms: Supervised Fine-Tuning, Reinforcement Learning, and Contrastive Learning. We develop learning methods for each paradigm. Comprehensive experiments show that trained models consistently achieve substantial improvements in error attribution. Notably, several of our fine-tuned LLMs demonstrate performance competitive with or superior to proprietary models an order of magnitude larger, validating our automated data generation framework as a crucial resource for developing more robust and interpretable multi-agent systems. Our project website is available at https://kfq20.github.io/Aegis-Website/.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper presents Aegis, a framework for automated generation of error-annotated trajectories in LLM-based multi-agent systems (MAS). It uses an LLM manipulator to inject context-aware errors into successful execution traces, producing a dataset of 9,533 trajectories with labels for faulty agents and error modes across diverse architectures and domains. The dataset enables training via supervised fine-tuning, reinforcement learning, and contrastive learning for error attribution. Experiments claim consistent improvements, with several fine-tuned models performing competitively with or better than proprietary models an order of magnitude larger.
Significance. If the synthetic error modes faithfully capture the statistics and causal structure of real MAS failures, the automated pipeline would provide a scalable alternative to manual annotation and could accelerate development of more reliable and interpretable multi-agent systems. The use of positive-negative sample pairs and support for multiple learning paradigms is a constructive contribution to the data-scarcity problem in MAS debugging.
major comments (2)
- [Abstract / Experiments] Abstract and experimental evaluation: the headline claim that fine-tuned models are competitive with or superior to much larger proprietary models rests on the unverified assumption that LLM-injected errors match the distribution of real MAS failure modes. No quantitative validation (e.g., KL divergence on error-type histograms, or accuracy drop when tested on held-out human-annotated real failures) is reported, leaving open the possibility that reported gains exploit manipulator-specific artifacts rather than genuine attribution capability.
- [Data Generation / Methods] Data-generation pipeline (presumably §3–4): the manipulator is described as “adaptively inject[ing] context-aware errors,” yet no ablation or sensitivity analysis is provided on the injection parameters or prompt templates. This makes it difficult to assess reproducibility or to determine how much of the downstream performance depends on the specific manipulator configuration.
minor comments (2)
- [Abstract] The abstract states “substantial improvements” and “comprehensive experiments” but supplies no numerical metrics, baseline names, or statistical details; these should be added to the abstract or a results table for immediate readability.
- [Dataset Construction] Clarify the exact definition of “error mode” and “faulty agent” labels used in the 9,533 trajectories; a short taxonomy or example annotation would help readers understand the granularity of the supervision signal.
Simulated Author's Rebuttal
We thank the referee for their constructive and insightful comments. We address each major comment point by point below, with clarifications on our approach and indications of revisions to strengthen the manuscript.
read point-by-point responses
-
Referee: [Abstract / Experiments] Abstract and experimental evaluation: the headline claim that fine-tuned models are competitive with or superior to much larger proprietary models rests on the unverified assumption that LLM-injected errors match the distribution of real MAS failure modes. No quantitative validation (e.g., KL divergence on error-type histograms, or accuracy drop when tested on held-out human-annotated real failures) is reported, leaving open the possibility that reported gains exploit manipulator-specific artifacts rather than genuine attribution capability.
Authors: We acknowledge that direct quantitative validation of the synthetic error distribution against real-world MAS failures would provide additional support for the headline performance claims. Aegis was developed specifically to overcome the scarcity of large-scale human-annotated error data, which is the central motivation of the work. Our experiments instead evaluate the practical utility of the generated dataset by measuring improvements in error attribution across SFT, RL, and contrastive learning, with several models matching or exceeding much larger proprietary systems. To address the concern, the revised manuscript includes expanded discussion of the manipulator's adaptive, context-aware design and qualitative comparisons of injected errors to documented failure modes from prior MAS literature. A full KL-divergence analysis or evaluation on held-out human-annotated failures is not feasible at present because no sufficiently large public real-failure dataset exists; we view this as an important direction for future work once such resources become available. revision: partial
-
Referee: [Data Generation / Methods] Data-generation pipeline (presumably §3–4): the manipulator is described as “adaptively inject[ing] context-aware errors,” yet no ablation or sensitivity analysis is provided on the injection parameters or prompt templates. This makes it difficult to assess reproducibility or to determine how much of the downstream performance depends on the specific manipulator configuration.
Authors: We agree that sensitivity analysis on the manipulator parameters would improve reproducibility and help isolate the contribution of specific design choices. In the revised manuscript we have added an ablation study examining variations in error-injection severity, context-adaptation strength, and prompt phrasing. The results indicate that downstream attribution accuracy remains stable across these configurations. We have also included the exact prompt templates in the appendix to facilitate replication. revision: yes
Circularity Check
No circularity: data generation and model training are independent empirical steps
full rationale
The paper constructs a dataset by applying an LLM manipulator to successful trajectories to produce error-annotated examples, then trains separate attribution models (via SFT, RL, or contrastive learning) and reports empirical gains on held-out synthetic splits. No derivation reduces by construction to its inputs: the generation process is not defined in terms of the downstream model outputs, the performance numbers are not statistically forced by the fitting procedure itself, and no self-citation or uniqueness theorem is used as load-bearing justification. The central claims remain externally falsifiable on the generated test distribution even if real-world error statistics are unverified.
Axiom & Free-Parameter Ledger
free parameters (1)
- LLM manipulator injection parameters
axioms (1)
- domain assumption LLM manipulator produces context-aware errors representative of real multi-agent failures
Lean theorems connected to this paper
-
IndisputableMonolith/Foundation/RealityFromDistinction.leanreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
Aegis constructs a large dataset of 9,533 trajectories with annotated faulty agents and error modes... using a LLM-based manipulator that can adaptively inject context-aware errors into successful execution trajectories.
-
IndisputableMonolith/Cost/FunctionalEquation.leanJ_uniquely_calibrated_via_higher_derivative unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
We develop learning methods for each paradigm... Comprehensive experiments show that trained models consistently achieve substantial improvements in error attribution.
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Forward citations
Cited by 1 Pith paper
-
VerifyMAS: Hypothesis Verification for Failure Attribution in LLM Multi-Agent Systems
VerifyMAS improves failure attribution in LLM multi-agent systems via hypothesis verification on full trajectories, error taxonomy-based data construction, and fine-tuned verifier models, outperforming prior direct-pr...
Reference graph
Works this paper leans on
-
[1]
For each agent 's response, critically evaluate its actions against the error definitions
**Internal Analysis (Chain of Thought)**: First, mentally break down the conversation turn by turn. For each agent 's response, critically evaluate its actions against the error definitions. Note down any potential violations, the agent's name, and the corresponding error code
-
[2]
**Compile Final Output**: After completing your analysis, aggregate all identified faults into the required JSON format. If you found no errors, create an empty list for "faulty_agents". ## STRICT OUTPUT FORMAT Your final response **MUST BE A SINGLE, VALID JSON OBJECT** and nothing else. Do not include any explanatory text, comments, or markdown formattin...
-
[3]
**Agent Summary**: A brief analysis of what each agent did
-
[4]
**Error Analysis**: Your reasoning for identifying errors
-
[5]
**Final Answer**: A valid JSON object with your conclusions **JSON Format:** 29 {{"faulty_agents": [{{"agent_name": "XXX", "error_type": "FM- X.X"}}]}} **Examples:** - Multiple Errors: {{"faulty_agents": [{{"agent_name": "XXX1 ", "error_type": "FM-1.1"}}, {{"agent_name": "XXX2", " error_type": "FM-3.2"}}, {{"agent_name": "XXX3", " error_type": "FM-2.5"}}]...
-
[6]
Think step by step: compare the model answer to the reference answer and explain whether their meaning is aligned
-
[7]
Be generous: if the model answer captures the main idea correctly, even with different wording or incomplete phrasing, consider it correct
-
[8]
At the end, output only one word: **"Correct"** or **" Incorrect"**. --- Question: {question} Reference Answer: {correct_answer} Model Answer: {model_answer} --- Your Reasoning: Classifying Who&When Errors You are an expert in classifying error modes in multi-agent systems. Your task is to analyze a mistake reason and classify it into exactly one of the 1...
-
[9]
Read the mistake_reason carefully
-
[10]
Identify which FM error type best describes the failure
- [11]
-
[12]
Do not include any explanations, justifications, or additional text
-
[13]
If the mistake_reason doesn't clearly match any type, choose the closest match
-
[14]
The agent ignored the original task requirements and solved a different problem
You must output exactly one FM error type EXAMPLES: - Mistake reason: "The agent ignored the original task requirements and solved a different problem": FM-1.1 - Mistake reason: "The agent kept repeating the same calculations without progress": FM-1.3 - Mistake reason: "The agent stopped before completing all required steps": FM-3.1 Now classify the follo...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.