Agent System Operations: Categorization, Challenges, and Future Directions

Changhua Pei; Dan Pei; David Lo; Fei Sun; Gaogang Xie; Hang Cui; Haotian Si; Jingjing Li; Quan Zhou; Yintong Huo

arxiv: 2606.01581 · v1 · pith:BB2X3MF2new · submitted 2026-06-01 · 💻 cs.MA

Agent System Operations: Categorization, Challenges, and Future Directions

Zexin Wang , Changhua Pei , Yuanhao Liu , Jingjing Li , Yintong Huo , Quan Zhou , Haotian Si , Hang Cui

show 5 more authors

Zihan Liu Gaogang Xie Fei Sun Dan Pei David Lo

This is my paper

Pith reviewed 2026-06-28 12:12 UTC · model grok-4.3

classification 💻 cs.MA

keywords agent systemsLLM-based agentsanomaly categorizationoperations frameworkmonitoringanomaly detectionroot cause localizationsystem maintenance

0 comments

The pith

Agent systems need the AgentOps framework to categorize anomalies as intra-agent or inter-agent and manage them through four operational stages.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper surveys the operations of LLM-based agent systems, which often encounter anomalies that cause instability. It systematically defines these anomalies by splitting them into those occurring inside a single agent and those arising between multiple agents. It then presents AgentOps as an operational framework built around four stages. A sympathetic reader would care because agent systems are seeing growing industrial use yet lack established maintenance practices, leaving them vulnerable to failures. The work fills this gap by providing definitions and structure where prior research has been limited.

Core claim

The authors categorize anomalies within agent systems into intra-agent anomalies, which occur within individual agents, and inter-agent anomalies, which emerge from interactions among agents. They introduce the AgentOps framework to structure operations around four stages: monitoring to observe system states, anomaly detection to identify deviations, root cause localization to determine origins of issues, and resolution to apply fixes. This framework is positioned as a comprehensive approach to support stable and secure operation of agent systems.

What carries the argument

The AgentOps framework, which organizes agent system operations into the four stages of monitoring, anomaly detection, root cause localization, and resolution.

If this is right

The anomaly categorization supplies a shared vocabulary that future studies can use to classify and compare issues across different agent implementations.
Each of the four AgentOps stages can be developed independently, allowing targeted tools for monitoring or root cause analysis to be built and evaluated.
Industrial deployments of agent systems can adopt the stages sequentially to reduce instability without requiring entirely new infrastructure.
The framework highlights specific challenges in each stage, directing research attention toward gaps such as effective resolution methods for inter-agent issues.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The intra- versus inter-agent split could be tested by measuring whether resolution success rates differ when failures are isolated to one agent versus spread across several.
AgentOps might integrate with existing software reliability practices by mapping its stages onto established DevOps pipelines for hybrid systems.
Quantifying the relative prevalence of intra-agent versus inter-agent anomalies in real deployments would help prioritize which stage of the framework receives the most tooling effort.

Load-bearing premise

Current research on the operations of agent systems is sparse, creating an urgent need for a survey that defines anomalies and establishes the AgentOps framework.

What would settle it

Empirical data from deployed agent systems showing that observed failures fall outside the intra-agent and inter-agent categories or cannot be addressed by the four proposed stages would challenge the framework's coverage.

Figures

Figures reproduced from arXiv: 2606.01581 by Changhua Pei, Dan Pei, David Lo, Fei Sun, Gaogang Xie, Hang Cui, Haotian Si, Jingjing Li, Quan Zhou, Yintong Huo, Yuanhao Liu, Zexin Wang, Zihan Liu.

**Figure 1.** Figure 1: Workflow of agent systems. Execution Stage Taxonomy Intra- Agent Inter- Agent Pre-execution Execution Post-execution Orchestration Anomalies Reasoning Anomalies Action Anomalies Memory Anomalies Task Specification Anomalies Security Anomalies Termination Anomalies Communication Anomalies [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗

**Figure 2.** Figure 2: Definition and taxonomy of anomalies in agent systems. clarify when anomalies may occur, but they do not specify where such anomalies originate within an agent system. Since agent systems can be broadly classified into single-agent and multi-agent systems, anomalies may originate either from the internal workflow of an individual agent or from interactions among multiple agents. This is analogous to tradit… view at source ↗

**Figure 3.** Figure 3: Components of agent systems. DeepSeek-R1 [1], as well as prompting strategies such as CoT [23], Reflexion [24], Self-Consistency [25], CoK [26], and StepBack [27]. Despite these advances, reasoning anomalies remain prevalent, among which hallucination is the most representative. The definition of hallucination has been continuously refined in the literature. Rawte et al. [28] define hallucinations as unr… view at source ↗

**Figure 4.** Figure 4: Examples of traditional operations in agent systems. Traditional IT Infrastructure Cloud-naive Applications AI/ML Systems Agent Systems Evolution of Operational Objects Manual Operation Automation Observability Platform Intelligence Autonomy Evolution of Operational Technologies SRE • SLO/SLA • Error budgets • Monitoring & Alerting AIOps • Monitoring • Anomaly detection • Root cause analysis • Resolution M… view at source ↗

**Figure 6.** Figure 6: Comparison of traditional system operations and agent system [PITH_FULL_IMAGE:figures/full_fig_p005_6.png] view at source ↗

**Figure 7.** Figure 7: Illustration of AgentOps. Frontend Checkout/API Recommend/API Payment/API ProductCatalog/API Checkout/API: {“item”: “#AX5”, ... ...} Recommend/API: {“user”: “uuid220”, ... ...} ProductCatalog/API: {“category”: “shoe”, ... ...} Payment/API: {“item”: “#AX5”, ... ...} Frontend 1.1s 2.1s 1.5s 1.6s 2.0s Timeline of trace. main-agent food-agent tour-agent search_food search_place food-agent tour-agent search_pla… view at source ↗

**Figure 8.** Figure 8: Comparison of trace data in microservice system and agent system. [PITH_FULL_IMAGE:figures/full_fig_p007_8.png] view at source ↗

**Figure 11.** Figure 11: Localization accuracy under different context lengths. agent system’s execution. Accordingly, AgentDebug integrates taxonomy-aware analysis at each step of the agent system to jointly assess system behavior. Upon detecting a failure, the agent replans its trajectory, enabling timely resolution. C. Future Directions Currently, attribution methods can be broadly divided into two categories: LLM-based method… view at source ↗

read the original abstract

As the reasoning capabilities of Large Language Models (LLMs) continue to advance, LLM-based agent systems offer advantages in flexibility and interpretability over traditional systems, garnering increasing attention. However, despite the widespread research interest and industrial application of agent systems, these systems, like their traditional counterparts, frequently encounter anomalies. These anomalies lead to instability and insecurity, hindering their further development. Therefore, a comprehensive and systematic approach to the operation and maintenance of agent systems is urgently needed. Unfortunately, current research on the operations of agent systems is sparse. To address this gap, we have undertaken a survey on agent system operations with the aim of establishing a clear framework for the field, defining the challenges, and facilitating further development. Specifically, this paper begins by systematically defining anomalies within agent systems, categorizing them into intra-agent anomalies and inter-agent anomalies. Next, we introduce a novel and comprehensive operational framework for agent systems, dubbed Agent System Operations (AgentOps). We provide detailed definitions and explanations of its four key stages: monitoring, anomaly detection, root cause localization, and resolution.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

This survey organizes anomaly types and a four-stage ops framework for LLM agents but leans on an unproven claim that the area is sparse.

read the letter

The main thing to know is that this is a survey paper that splits anomalies in agent systems into intra-agent (inside one agent) and inter-agent (between agents) categories, then lays out an AgentOps framework covering monitoring, anomaly detection, root cause localization, and resolution.

It does a clean job of applying standard operations concepts to LLM-based multi-agent setups and gives readable definitions for each stage. That kind of map can be useful for engineers who need to keep these systems stable in practice, and the intra/inter split is a reasonable way to group the issues.

The soft spot is the load-bearing premise that current research on agent system operations is sparse. The abstract repeats this to justify the work, but the claim is not obviously true given existing literature on LLM monitoring, hallucination detection, multi-agent coordination failures, and general anomaly handling. If the full paper does not include a thorough literature search that actually shows the gap, the urgency and novelty of the framework weaken. The four stages themselves are standard IT ops steps with new names attached, so the contribution sits mostly in the organization rather than new techniques.

No experiments or formal proofs appear, which fits a survey but means the value rests entirely on coverage and insight. The paper is aimed at applied researchers and practitioners deploying agent systems who want a structured overview. Readers looking for original methods or exhaustive analysis will find less. It deserves peer review because the topic is relevant and a solid survey could help organize the space, even if the gap argument needs more evidence.

Referee Report

1 major / 0 minor

Summary. The paper surveys operations for LLM-based agent systems. It defines anomalies as intra-agent or inter-agent, then proposes a new AgentOps framework whose four stages are monitoring, anomaly detection, root cause localization, and resolution. The work positions itself as filling an urgent gap created by sparse existing research on these operational topics.

Significance. If the sparsity premise holds and the categorization proves comprehensive, the survey could provide a useful organizing lens for reliability work on multi-agent LLM systems. The four-stage breakdown mirrors established DevOps practices while highlighting agent-specific issues such as inter-agent coordination failures.

major comments (1)

[Abstract / Introduction] Abstract and Introduction: The claim that 'current research on the operations of agent systems is sparse' is presented as the load-bearing motivation for both the survey and the novel AgentOps framework. No systematic literature review or citation count is supplied to substantiate the sparsity assertion. Without this demonstration, the asserted gap, urgency, and novelty of the intra-/inter-agent split and four-stage framework cannot be evaluated.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive feedback highlighting the need to substantiate the sparsity claim that motivates our survey. We address this point directly below and outline revisions to strengthen the manuscript.

read point-by-point responses

Referee: [Abstract / Introduction] Abstract and Introduction: The claim that 'current research on the operations of agent systems is sparse' is presented as the load-bearing motivation for both the survey and the novel AgentOps framework. No systematic literature review or citation count is supplied to substantiate the sparsity assertion. Without this demonstration, the asserted gap, urgency, and novelty of the intra-/inter-agent split and four-stage framework cannot be evaluated.

Authors: We agree that the sparsity assertion is central to the paper's motivation and that a more explicit demonstration would strengthen the work. Our claim derives from a broad review of the literature on LLM-based agents, where the overwhelming majority of publications focus on capability development, prompting techniques, and architectural designs rather than operational concerns such as monitoring, anomaly detection, root-cause analysis, and resolution. However, the manuscript does not include a quantified citation analysis or formal systematic literature review protocol. In the revised version we will add a dedicated subsection (likely in Section 2 or a new Appendix) that documents our search methodology, including databases queried, keywords employed, time window considered, and the approximate ratio of operation-focused papers to the total body of agent research. This addition will allow readers to evaluate the gap claim directly while preserving the intra-/inter-agent categorization and four-stage AgentOps framework as the paper's primary contributions. revision: yes

Circularity Check

0 steps flagged

No circularity: survey categorization with independent literature premise

full rationale

The paper is a survey that defines anomaly categories (intra-/inter-agent) and introduces the AgentOps framework (monitoring, anomaly detection, root cause localization, resolution) as an organizational structure. No equations, fitted parameters, predictions, or derivations exist. The premise that 'current research on the operations of agent systems is sparse' is an external claim about the literature, not a self-referential reduction or self-citation chain that forces the framework. The central contribution is the categorization itself, which stands as an independent synthesis rather than a renaming or definitional loop. This matches the default expectation for non-circular survey papers.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

As a survey paper, the work relies on definitions and categorizations from prior literature without introducing new fitted parameters, axioms, or entities. No free parameters, axioms, or invented entities are identified from the abstract.

pith-pipeline@v0.9.1-grok · 5754 in / 1091 out tokens · 20889 ms · 2026-06-28T12:12:18.158605+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

126 extracted references · 34 canonical work pages · 13 internal anchors

[1]

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

D. Guo, D. Yang, H. Zhang, J. Song, R. Zhang, R. Xu, Q. Zhu, S. Ma, P. Wang, X. Biet al., “Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning,”arXiv preprint arXiv:2501.12948, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[2]

Anthropic

A. Anthropic. (2024) The claude 3 model family: Opus, sonnet, haiku. claude-3 model card

2024
[3]

Econagent: large language model-empowered agents for simulating macroeconomic activities,

N. Li, C. Gao, M. Li, Y . Li, and Q. Liao, “Econagent: large language model-empowered agents for simulating macroeconomic activities,” 2024

2024
[4]

Toolllm: Facilitating large language models to master 16000+ real-world apis,

Y . Qin, S. Liang, Y . Ye, K. Zhu, L. Yan, Y . Lu, Y . Lin, X. Cong, X. Tang, B. Qianet al., “Toolllm: Facilitating large language models to master 16000+ real-world apis,” inThe Twelfth International Con- ference on Learning Representations, 2024

2024
[5]

Flow-of-action: Sop enhanced llm-based multi- agent system for root cause analysis,

C. Pei, Z. Wang, F. Liu, Z. Li, Y . Liu, X. He, R. Kang, T. Zhang, J. Chen, J. Liet al., “Flow-of-action: Sop enhanced llm-based multi- agent system for root cause analysis,” inCompanion Proceedings of the ACM on Web Conference 2025, 2025, pp. 422–431

2025
[6]

Recommender ai agent: Integrating large language models for interactive recommen- dations,

X. Huang, J. Lian, Y . Lei, J. Yao, D. Lian, and X. Xie, “Recommender ai agent: Integrating large language models for interactive recommen- dations,”ACM Transactions on Information Systems, vol. 43, no. 4, pp. 1–33, 2025

2025
[8]

Shielda: Structured handling of exceptions in llm-driven agentic workflows,

J. Zhou, J. Chen, Q. Lu, D. Zhao, and L. Zhu, “Shielda: Structured handling of exceptions in llm-driven agentic workflows,”arXiv preprint arXiv:2508.07935, 2025

work page arXiv 2025
[9]

(2025) SWE-bench: A benchmark for evaluating software engineering agents

SWE-bench Team. (2025) SWE-bench: A benchmark for evaluating software engineering agents. Leaderboard tracking AI agent performance on software engineering tasks. [Online]. Available: https://www.swebench.com

2025
[10]

(2025) Llamatrace — hosted phoenix: Llm tracing & evaluation platform

Arize AI, Inc. (2025) Llamatrace — hosted phoenix: Llm tracing & evaluation platform. [Online]. Available: https://phoenix.arize.com/ll amatrace/

2025
[11]

Automatic failure attribution and critical step prediction method for multi-agent systems based on causal inference,

G. Ma, J. Zhu, H. Guo, W. Shi, J. Shen, J. Liu, and Y . Liang, “Automatic failure attribution and critical step prediction method for multi-agent systems based on causal inference,”arXiv preprint arXiv:2509.08682, 2025

work page arXiv 2025
[12]

Agent-pro: Learning to evolve via policy-level reflection and optimization,

W. Zhang, K. Tang, H. Wu, M. Wang, Y . Shen, G. Hou, Z. Tan, P. Li, Y . Zhuang, and W. Lu, “Agent-pro: Learning to evolve via policy-level reflection and optimization,”arXiv preprint arXiv:2402.17574, 2024

work page arXiv 2024
[13]

Trial and error: Exploration-based trajectory optimization for llm agents,

Y . Song, D. Yin, X. Yue, J. Huang, S. Li, and B. Y . Lin, “Trial and error: Exploration-based trajectory optimization for llm agents,”arXiv preprint arXiv:2403.02502, 2024

work page arXiv 2024
[14]

Agent AI: Surveying the Horizons of Multimodal Interaction

Z. Durante, Q. Huang, N. Wake, R. Gong, J. S. Park, B. Sarkaret al., “Agent ai: Surveying the horizons of multimodal interaction,”arXiv preprint arXiv:2401.03568, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[15]

Hallucination de- tection in foundation models for decision-making: A flexible definition and review of the state of the art,

N. Chakraborty, M. Ornik, and K. Driggs-Campbell, “Hallucination de- tection in foundation models for decision-making: A flexible definition and review of the state of the art,”ACM Computing Surveys, 2025

2025
[16]

Ai agents under threat: A survey of key security challenges and future pathways,

Z. Deng, Y . Guo, C. Han, W. Ma, J. Xiong, S. Wen, and Y . Xiang, “Ai agents under threat: A survey of key security challenges and future pathways,”ACM Computing Surveys, vol. 57, no. 7, pp. 1–36, 2025

2025
[17]

Towards trustworthy gui agents: A survey,

Y . Shi, W. Yu, W. Yao, W. Chen, and N. Liu, “Towards trustworthy gui agents: A survey,”arXiv preprint arXiv:2503.23434, 2025

work page arXiv 2025
[18]

Which agent causes task failures and when? on automated failure attribution of LLM multi-agent systems,

S. Zhang, M. Yin, J. Zhang, J. Liu, Z. Han, J. Zhang, B. Li, C. Wang, H. Wang, Y . Chenet al., “Which agent causes task failures and when? on automated failure attribution of LLM multi-agent systems,” inForty- second International Conference on Machine Learning, 2025

2025
[19]

React: Synergizing reasoning and acting in language models,

S. Yao, J. Zhao, D. Yu, N. Du, I. Shafran, K. Narasimhan, and Y . Cao, “React: Synergizing reasoning and acting in language models,” in International Conference on Learning Representations (ICLR), 2023

2023
[20]

Finetuned language models are zero-shot learners,

J. Wei, M. Bosma, V . Y . Zhao, K. Guu, A. W. Yu, B. Lester, N. Du, A. M. Dai, and Q. V . Le, “Finetuned language models are zero-shot learners,” inInternational Conference on Learning Representations, 2022

2022
[21]

Training language models to follow instructions with human feedback,

L. Ouyang, J. Wu, X. Jiang, D. Almeida, C. Wainwright, P. Mishkin, C. Zhang, S. Agarwal, K. Slama, A. Rayet al., “Training language models to follow instructions with human feedback,”Advances in neural information processing systems, vol. 35, pp. 27 730–27 744, 2022

2022
[22]

Search-R1: Training LLMs to Reason and Leverage Search Engines with Reinforcement Learning

B. Jin, H. Zeng, Z. Yue, J. Yoon, S. Arik, D. Wang, H. Zamani, and J. Han, “Search-r1: Training llms to reason and leverage search engines with reinforcement learning,”arXiv preprint arXiv:2503.09516, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[23]

Chain-of-thought prompting elicits reasoning in large language models,

J. Wei, X. Wang, D. Schuurmans, M. Bosma, F. Xia, E. Chi, Q. V . Le, D. Zhouet al., “Chain-of-thought prompting elicits reasoning in large language models,”Advances in neural information processing systems, vol. 35, pp. 24 824–24 837, 2022

2022
[24]

Reflexion: Language agents with verbal reinforcement learning,

N. Shinn, F. Cassano, A. Gopinath, K. Narasimhan, and S. Yao, “Reflexion: Language agents with verbal reinforcement learning,” Advances in Neural Information Processing Systems, vol. 36, pp. 8634– 8652, 2023

2023
[25]

Self-consistency improves chain of thought rea- soning in language models,

X. Wang, J. Wei, D. Schuurmans, Q. Le, E. Chi, S. Narang, A. Chowd- hery, and D. Zhou, “Self-consistency improves chain of thought rea- soning in language models,” inThe Eleventh International Conference on Learning Representations, 2023

2023
[26]

Chain-of-knowledge: Grounding large language models via dynamic knowledge adapting over heterogeneous sources,

X. Li, R. Zhao, Y . K. Chia, B. Ding, S. Joty, S. Poria, and L. Bing, “Chain-of-knowledge: Grounding large language models via dynamic knowledge adapting over heterogeneous sources,” inThe Twelfth In- ternational Conference on Learning Representations, 2024

2024
[27]

Take a step back: Evoking reasoning via abstraction in large language models,

H. S. Zheng, S. Mishra, X. Chen, H.-T. Cheng, E. H. Chi, Q. V . Le, and D. Zhou, “Take a step back: Evoking reasoning via abstraction in large language models,” inThe Twelfth International Conference on Learning Representations, 2024

2024
[28]

A Survey of Hallucination in Large Foundation Models

V . Rawte, A. Sheth, and A. Das, “A survey of hallucination in large foundation models,”arXiv preprint arXiv:2309.05922, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[29]

Peer review of gpt-4 technical report and systems card,

J. Gallifant, A. Fiske, Y . A. Levites Strekalova, J. S. Osorio-Valencia, R. Parke, R. Mwavu, N. Martinez, J. W. Gichoya, M. Ghassemi, D. Demner-Fushmanet al., “Peer review of gpt-4 technical report and systems card,”PLOS digital health, vol. 3, no. 1, p. e0000417, 2024

2024
[30]

Alignment for honesty,

Y . Yang, E. Chern, X. Qiu, G. Neubig, and P. Liu, “Alignment for honesty,”Advances in Neural Information Processing Systems, vol. 37, pp. 63 565–63 598, 2024

2024
[31]

Function calling in large language models: Industrial practices, challenges, and future directions,

M. Wang, Y . Zhang, C. Peng, Y . Chen, W. Zhou, J. Gu, C. Zhuang, R. Guo, B. Yu, W. Wanget al., “Function calling in large language models: Industrial practices, challenges, and future directions,” 2025

2025
[32]

The dark side of function calling: Pathways to jailbreaking large language models,

Z. Wu, H. Gao, J. He, and P. Wang, “The dark side of function calling: Pathways to jailbreaking large language models,” inProceedings of the 31st International Conference on Computational Linguistics. Associ- ation for Computational Linguistics, Jan. 2024, pp. 584–592

2024
[33]

(2025) Ai-infra-guard

Tencent. (2025) Ai-infra-guard. [Online]. Available: https://github.c om/Tencent/AI-Infra-Guard

2025
[34]

Lost in the middle: How language models use long con- texts,

N. F. Liu, K. Lin, J. Hewitt, A. Paranjape, M. Bevilacqua, F. Petroni, and P. Liang, “Lost in the middle: How language models use long con- texts,”Transactions of the Association for Computational Linguistics, vol. 12, pp. 157–173, 2024

2024
[35]

Unable to forget: Proactive lnterference reveals working memory limits in llms beyond context length,

C. Wang and J. V . Sun, “Unable to forget: Proactive lnterference reveals working memory limits in llms beyond context length,” inICML 2025 Workshop on Long-Context Foundation Models, 2025

2025
[36]

Qe-rag: A robust retrieval-augmented generation benchmark for query entry errors,

K. Zhang, Z. Sun, W. Yu, X. Zang, K. Zheng, Y . Song, H. Li, and J. Xu, “Qe-rag: A robust retrieval-augmented generation benchmark for query entry errors,”arXiv preprint arXiv:2504.04062, 2025. IEEE TRANSACTIONS OF SOFTW ARE ENGINEERING, VOL. 00, NO. 0, AUGUST 0000 15

work page arXiv 2025
[37]

Astute rag: Overcoming imperfect retrieval augmentation and knowledge conflicts for large language models,

F. Wang, X. Wan, R. Sun, J. Chen, and S. ¨O. Arık, “Astute rag: Overcoming imperfect retrieval augmentation and knowledge conflicts for large language models,”arXiv preprint arXiv:2410.07176, 2024

work page arXiv 2024
[38]

Redeep: Detecting hallucination in retrieval-augmented gener- ation via mechanistic interpretability,

Z. Sun, X. Zang, K. Zheng, Y . Song, J. Xu, X. Zhang, W. Yu, and H. Li, “Redeep: Detecting hallucination in retrieval-augmented gener- ation via mechanistic interpretability,” inThe Thirteenth International Conference on Learning Representations, 2025

2025
[39]

Benchmarking large language models in retrieval-augmented generation,

J. Chen, H. Lin, X. Han, and L. Sun, “Benchmarking large language models in retrieval-augmented generation,” inProceedings of the AAAI Conference on Artificial Intelligence, vol. 38, no. 16, 2024, pp. 17 754– 17 762

2024
[40]

(2025) Emergent behavior in multi-agent systems: How com- plex behaviors arise from simple agent interactions

Sanjeev. (2025) Emergent behavior in multi-agent systems: How com- plex behaviors arise from simple agent interactions

2025
[41]

Agent security bench (asb): Formalizing and benchmarking attacks and defenses in llm-based agents,

H. Zhang, J. Huang, K. Mei, Y . Yao, Z. Wang, C. Zhan, H. Wang, and Y . Zhang, “Agent security bench (asb): Formalizing and benchmarking attacks and defenses in llm-based agents,” inInternational Conference on Learning Representations, vol. 2025, 2025, pp. 35 331–35 366

2025
[42]

Why do multiagent systems fail?

M. Z. Pan, M. Cemri, L. A. Agrawal, S. Yang, B. Chopra, R. Tiwari, K. Keutzer, A. Parameswaran, K. Ramchandran, D. Kleinet al., “Why do multiagent systems fail?” inICLR 2025 Workshop on Building Trust in Language Models and Applications, 2025

2025
[43]

Emergence in multi-agent systems: A safety perspective,

P. Altmann, J. Sch ¨onberger, S. Illium, M. Zorn, F. Ritz, T. Haider, S. Burton, and T. Gabor, “Emergence in multi-agent systems: A safety perspective,” inInternational Symposium on Leveraging Applications of Formal Methods. Springer, 2024, pp. 104–120

2024
[45]

Modeling exception management in multi-agent systems

E. Platonet al., “Modeling exception management in multi-agent systems.” Ph.D. dissertation, Citeseer, 2007

2007
[46]

P. D. OG. (2025) Building high-quality ai agent systems: Best practices

2025
[47]

Agentfm: Role-aware failure management for distributed databases with llm- driven multi-agents,

L. Zhang, Y . Zhai, T. Jia, X. Huang, C. Duan, and Y . Li, “Agentfm: Role-aware failure management for distributed databases with llm- driven multi-agents,” inProceedings of the 33rd ACM International Conference on the Foundations of Software Engineering, 2025, p. 947–958

2025
[48]

A survey on llm- based multi-agent systems: workflow, infrastructure, and challenges,

X. Li, S. Wang, S. Zeng, Y . Wu, and Y . Yang, “A survey on llm- based multi-agent systems: workflow, infrastructure, and challenges,” Vicinagearth, vol. 1, no. 1, p. 9, 2024

2024
[49]

Autogen: Enabling next-gen llm applications via multi-agent conversations,

Q. Wu, G. Bansal, J. Zhang, Y . Wu, B. Li, E. Zhu, L. Jiang, X. Zhang, S. Zhang, J. Liuet al., “Autogen: Enabling next-gen llm applications via multi-agent conversations,” inFirst conference on language modeling, 2024

2024
[50]

Camel: Communicative agents for

G. Li, H. Hammoud, H. Itani, D. Khizbullin, and B. Ghanem, “Camel: Communicative agents for” mind” exploration of large language model society,”Advances in neural information processing systems, vol. 36, pp. 51 991–52 008, 2023

2023
[51]

Bronsdon

C. Bronsdon. (2025) Real-time anomaly detection for multi-agent ai systems

2025
[52]

Cut the crap: An economical communication pipeline for llm-based multi-agent systems,

G. Zhang, Y . Yue, Z. Li, S. Yun, G. Wan, K. Wang, D. Cheng, J. X. Yu, and T. Chen, “Cut the crap: An economical communication pipeline for llm-based multi-agent systems,” inThe Thirteenth International Conference on Learning Representations, 2025

2025
[53]

Taxonomy of failure mode in agentic ai systems,

Microsoft, “Taxonomy of failure mode in agentic ai systems,” 2025

2025
[54]

Smurfs: Multi-agent system using context-efficient dfsdt for tool planning,

J. Chen, J. Liang, and B. Wang, “Smurfs: Multi-agent system using context-efficient dfsdt for tool planning,” inProceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers), 2025, pp. 3281–3298

2025
[55]

Redel: A toolkit for llm- powered recursive multi-agent systems,

A. Zhu, L. Dugan, and C. Callison-Burch, “Redel: A toolkit for llm- powered recursive multi-agent systems,” inProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, 2024, pp. 162–171

2024
[56]

’neural howlround’in large language models: a self- reinforcing bias phenomenon, and a dynamic attenuation solution,

S. Drake, “’neural howlround’in large language models: a self- reinforcing bias phenomenon, and a dynamic attenuation solution,” arXiv preprint arXiv:2504.07992, 2025

work page arXiv 2025
[57]

Devops: a definition and perceived adoption impediments,

J. Smeds, K. Nybom, and I. Porres, “Devops: a definition and perceived adoption impediments,” inInternational conference on agile software development. Springer, 2015, pp. 166–177

2015
[58]

Ai for it operations (aiops) on cloud platforms: Reviews, opportunities and challenges,

Q. Cheng, D. Sahoo, A. Saha, W. Yang, C. Liu, G. Woo, M. Singh, S. Saverese, and S. C. Hoi, “Ai for it operations (aiops) on cloud platforms: Reviews, opportunities and challenges,”arXiv preprint arXiv:2304.04661, 2023

work page arXiv 2023
[59]

What is agentic operations (agenticops)?

Cisco, “What is agentic operations (agenticops)?” https://www.cisco. com/site/us/en/learn/topics/artificial-intelligence/what-is-agentic-opera tions-agenticops.html, 2026, accessed: 2026-05-15

2026
[60]

Opentelemetry: A cloud native observability framework,

O. Authors, “Opentelemetry: A cloud native observability framework,” https://opentelemetry.io, 2019

2019
[62]

Agentops: Enabling observability of llm agents,

——, “Agentops: Enabling observability of llm agents,” no. arXiv:2411.05285, Nov. 2024, arXiv:2411.05285 [cs.AI]. [Online]. Available: http://arxiv.org/abs/2411.05285

work page arXiv 2024
[63]

Architecting agentops needs change,

S. Biswas, H. Bhatt, and K. Vaidhyanathan, “Architecting agentops needs change,” no. arXiv:2601.06456, Jan. 2026, arXiv:2601.06456 [cs.SE]. [Online]. Available: http://arxiv.org/abs/2601.06456

work page arXiv 2026
[64]

Taming uncertainty via automation: Observing, analyzing, and optimizing agentic ai systems,

D. Moshkovich and S. Zeltyn, “Taming uncertainty via automation: Observing, analyzing, and optimizing agentic ai systems,” no. arXiv:2507.11277, Nov. 2025, arXiv:2507.11277 [cs.AI]. [Online]. Available: http://arxiv.org/abs/2507.11277

work page arXiv 2025
[65]

Langdb: Llm-enhanced database exploration,

langdb, “Langdb: Llm-enhanced database exploration,” 2025. [Online]. Available: https://github.com/langdb/ai-gateway

2025
[66]

Langfuse: Open-source llm tracing and observability,

Langfuse, “Langfuse: Open-source llm tracing and observability,”
[67]

Available: https://github.com/langfuse/langfuse

[Online]. Available: https://github.com/langfuse/langfuse
[68]

MLflow for GenAI: Build Production-Ready AI Applications,

MLflow Project, a Series of LF Projects, LLC, “MLflow for GenAI: Build Production-Ready AI Applications,” https://mlflow.org/genai, 2025

2025
[69]

Helicone: Llm observability platform,

Helicone, “Helicone: Llm observability platform,” 2025. [Online]. Available: https://github.com/Helicone/helicone

2025
[70]

Langwatch,

LangWatch, “Langwatch,” 2025. [Online]. Available: https://github.c om/langwatch

2025
[71]

(2025) Openllmetry: Open-source observability for your llm application

Traceloop. (2025) Openllmetry: Open-source observability for your llm application. [Online]. Available: https://github.com/traceloop/ope nllmetry

2025
[72]

(2025) Arize phoenix: Open-source llm tracing & evaluation platform

Arize AI, Inc. (2025) Arize phoenix: Open-source llm tracing & evaluation platform. [Online]. Available: https://phoenix.arize.com/

2025
[73]

(2025) Literal ai: Rag llm evaluation & observability platform

Literal AI, Inc. (2025) Literal ai: Rag llm evaluation & observability platform. [Online]. Available: https://www.literalai.com/

2025
[74]

(2025) Opik — open-source llm evaluation platform

Comet ML, Inc. (2025) Opik — open-source llm evaluation platform. [Online]. Available: https://www.comet.com/site/products/opik/

2025
[75]

(2025) Openinference: Opentelemetry instrumentation for ai observability

Arize AI, Inc. (2025) Openinference: Opentelemetry instrumentation for ai observability. [Online]. Available: https://github.com/Arize-ai/ openinference

2025
[76]

(2025) Trulens: Open-source llm evaluation & observability platform

TruEra Inc. (2025) Trulens: Open-source llm evaluation & observability platform. [Online]. Available: https://www.trulens.org/

2025
[77]

(2025) Honeyhive: Ai observability and evaluation platform

HoneyHive AI, Inc. (2025) Honeyhive: Ai observability and evaluation platform. [Online]. Available: https://www.honeyhive.ai/

2025
[78]

(2025) Promptlayer: Platform for prompt engineering, management, evaluation, and llm observability

Magniv, Inc. (2025) Promptlayer: Platform for prompt engineering, management, evaluation, and llm observability. [Online]. Available: https://www.promptlayer.com/

2025
[79]

(2025) Agentops: Developer platform for ai agent observability

AgentOps.ai. (2025) Agentops: Developer platform for ai agent observability. [Online]. Available: https://www.agentops.ai/

2025
[80]

(2024) Deepeval: The llm evaluation framework

confident-ai. (2024) Deepeval: The llm evaluation framework. [Online]. Available: https://github.com/confident-ai/deepeval

2024
[81]

LangSmith: The Agent Engineering Platform,

LangChain, “LangSmith: The Agent Engineering Platform,” 2026. [Online]. Available: https://www.langchain.com/langsmith-platform

2026
[82]

Mlcapsule: Guarded offline deployment of machine learning as a service,

L. Hanzlik, Y . Zhang, K. Grosse, A. Salem, M. Augustin, M. Backes, and M. Fritz, “Mlcapsule: Guarded offline deployment of machine learning as a service,” inProceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2021, pp. 3300–3309

2021
[83]

Opera: Alleviating hallucination in multi- modal large language models via over-trust penalty and retrospection- allocation,

Q. Huang, X. Dong, P. Zhang, B. Wang, C. He, J. Wang, D. Lin, W. Zhang, and N. Yu, “Opera: Alleviating hallucination in multi- modal large language models via over-trust penalty and retrospection- allocation,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024, pp. 13 418–13 427

2024

Showing first 80 references.

[1] [1]

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

D. Guo, D. Yang, H. Zhang, J. Song, R. Zhang, R. Xu, Q. Zhu, S. Ma, P. Wang, X. Biet al., “Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning,”arXiv preprint arXiv:2501.12948, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[2] [2]

Anthropic

A. Anthropic. (2024) The claude 3 model family: Opus, sonnet, haiku. claude-3 model card

2024

[3] [3]

Econagent: large language model-empowered agents for simulating macroeconomic activities,

N. Li, C. Gao, M. Li, Y . Li, and Q. Liao, “Econagent: large language model-empowered agents for simulating macroeconomic activities,” 2024

2024

[4] [4]

Toolllm: Facilitating large language models to master 16000+ real-world apis,

Y . Qin, S. Liang, Y . Ye, K. Zhu, L. Yan, Y . Lu, Y . Lin, X. Cong, X. Tang, B. Qianet al., “Toolllm: Facilitating large language models to master 16000+ real-world apis,” inThe Twelfth International Con- ference on Learning Representations, 2024

2024

[5] [5]

Flow-of-action: Sop enhanced llm-based multi- agent system for root cause analysis,

C. Pei, Z. Wang, F. Liu, Z. Li, Y . Liu, X. He, R. Kang, T. Zhang, J. Chen, J. Liet al., “Flow-of-action: Sop enhanced llm-based multi- agent system for root cause analysis,” inCompanion Proceedings of the ACM on Web Conference 2025, 2025, pp. 422–431

2025

[6] [6]

Recommender ai agent: Integrating large language models for interactive recommen- dations,

X. Huang, J. Lian, Y . Lei, J. Yao, D. Lian, and X. Xie, “Recommender ai agent: Integrating large language models for interactive recommen- dations,”ACM Transactions on Information Systems, vol. 43, no. 4, pp. 1–33, 2025

2025

[7] [8]

Shielda: Structured handling of exceptions in llm-driven agentic workflows,

J. Zhou, J. Chen, Q. Lu, D. Zhao, and L. Zhu, “Shielda: Structured handling of exceptions in llm-driven agentic workflows,”arXiv preprint arXiv:2508.07935, 2025

work page arXiv 2025

[8] [9]

(2025) SWE-bench: A benchmark for evaluating software engineering agents

SWE-bench Team. (2025) SWE-bench: A benchmark for evaluating software engineering agents. Leaderboard tracking AI agent performance on software engineering tasks. [Online]. Available: https://www.swebench.com

2025

[9] [10]

(2025) Llamatrace — hosted phoenix: Llm tracing & evaluation platform

Arize AI, Inc. (2025) Llamatrace — hosted phoenix: Llm tracing & evaluation platform. [Online]. Available: https://phoenix.arize.com/ll amatrace/

2025

[10] [11]

Automatic failure attribution and critical step prediction method for multi-agent systems based on causal inference,

G. Ma, J. Zhu, H. Guo, W. Shi, J. Shen, J. Liu, and Y . Liang, “Automatic failure attribution and critical step prediction method for multi-agent systems based on causal inference,”arXiv preprint arXiv:2509.08682, 2025

work page arXiv 2025

[11] [12]

Agent-pro: Learning to evolve via policy-level reflection and optimization,

W. Zhang, K. Tang, H. Wu, M. Wang, Y . Shen, G. Hou, Z. Tan, P. Li, Y . Zhuang, and W. Lu, “Agent-pro: Learning to evolve via policy-level reflection and optimization,”arXiv preprint arXiv:2402.17574, 2024

work page arXiv 2024

[12] [13]

Trial and error: Exploration-based trajectory optimization for llm agents,

Y . Song, D. Yin, X. Yue, J. Huang, S. Li, and B. Y . Lin, “Trial and error: Exploration-based trajectory optimization for llm agents,”arXiv preprint arXiv:2403.02502, 2024

work page arXiv 2024

[13] [14]

Agent AI: Surveying the Horizons of Multimodal Interaction

Z. Durante, Q. Huang, N. Wake, R. Gong, J. S. Park, B. Sarkaret al., “Agent ai: Surveying the horizons of multimodal interaction,”arXiv preprint arXiv:2401.03568, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[14] [15]

Hallucination de- tection in foundation models for decision-making: A flexible definition and review of the state of the art,

N. Chakraborty, M. Ornik, and K. Driggs-Campbell, “Hallucination de- tection in foundation models for decision-making: A flexible definition and review of the state of the art,”ACM Computing Surveys, 2025

2025

[15] [16]

Ai agents under threat: A survey of key security challenges and future pathways,

Z. Deng, Y . Guo, C. Han, W. Ma, J. Xiong, S. Wen, and Y . Xiang, “Ai agents under threat: A survey of key security challenges and future pathways,”ACM Computing Surveys, vol. 57, no. 7, pp. 1–36, 2025

2025

[16] [17]

Towards trustworthy gui agents: A survey,

Y . Shi, W. Yu, W. Yao, W. Chen, and N. Liu, “Towards trustworthy gui agents: A survey,”arXiv preprint arXiv:2503.23434, 2025

work page arXiv 2025

[17] [18]

Which agent causes task failures and when? on automated failure attribution of LLM multi-agent systems,

S. Zhang, M. Yin, J. Zhang, J. Liu, Z. Han, J. Zhang, B. Li, C. Wang, H. Wang, Y . Chenet al., “Which agent causes task failures and when? on automated failure attribution of LLM multi-agent systems,” inForty- second International Conference on Machine Learning, 2025

2025

[18] [19]

React: Synergizing reasoning and acting in language models,

S. Yao, J. Zhao, D. Yu, N. Du, I. Shafran, K. Narasimhan, and Y . Cao, “React: Synergizing reasoning and acting in language models,” in International Conference on Learning Representations (ICLR), 2023

2023

[19] [20]

Finetuned language models are zero-shot learners,

J. Wei, M. Bosma, V . Y . Zhao, K. Guu, A. W. Yu, B. Lester, N. Du, A. M. Dai, and Q. V . Le, “Finetuned language models are zero-shot learners,” inInternational Conference on Learning Representations, 2022

2022

[20] [21]

Training language models to follow instructions with human feedback,

L. Ouyang, J. Wu, X. Jiang, D. Almeida, C. Wainwright, P. Mishkin, C. Zhang, S. Agarwal, K. Slama, A. Rayet al., “Training language models to follow instructions with human feedback,”Advances in neural information processing systems, vol. 35, pp. 27 730–27 744, 2022

2022

[21] [22]

Search-R1: Training LLMs to Reason and Leverage Search Engines with Reinforcement Learning

B. Jin, H. Zeng, Z. Yue, J. Yoon, S. Arik, D. Wang, H. Zamani, and J. Han, “Search-r1: Training llms to reason and leverage search engines with reinforcement learning,”arXiv preprint arXiv:2503.09516, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[22] [23]

Chain-of-thought prompting elicits reasoning in large language models,

J. Wei, X. Wang, D. Schuurmans, M. Bosma, F. Xia, E. Chi, Q. V . Le, D. Zhouet al., “Chain-of-thought prompting elicits reasoning in large language models,”Advances in neural information processing systems, vol. 35, pp. 24 824–24 837, 2022

2022

[23] [24]

Reflexion: Language agents with verbal reinforcement learning,

N. Shinn, F. Cassano, A. Gopinath, K. Narasimhan, and S. Yao, “Reflexion: Language agents with verbal reinforcement learning,” Advances in Neural Information Processing Systems, vol. 36, pp. 8634– 8652, 2023

2023

[24] [25]

Self-consistency improves chain of thought rea- soning in language models,

X. Wang, J. Wei, D. Schuurmans, Q. Le, E. Chi, S. Narang, A. Chowd- hery, and D. Zhou, “Self-consistency improves chain of thought rea- soning in language models,” inThe Eleventh International Conference on Learning Representations, 2023

2023

[25] [26]

Chain-of-knowledge: Grounding large language models via dynamic knowledge adapting over heterogeneous sources,

X. Li, R. Zhao, Y . K. Chia, B. Ding, S. Joty, S. Poria, and L. Bing, “Chain-of-knowledge: Grounding large language models via dynamic knowledge adapting over heterogeneous sources,” inThe Twelfth In- ternational Conference on Learning Representations, 2024

2024

[26] [27]

Take a step back: Evoking reasoning via abstraction in large language models,

H. S. Zheng, S. Mishra, X. Chen, H.-T. Cheng, E. H. Chi, Q. V . Le, and D. Zhou, “Take a step back: Evoking reasoning via abstraction in large language models,” inThe Twelfth International Conference on Learning Representations, 2024

2024

[27] [28]

A Survey of Hallucination in Large Foundation Models

V . Rawte, A. Sheth, and A. Das, “A survey of hallucination in large foundation models,”arXiv preprint arXiv:2309.05922, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[28] [29]

Peer review of gpt-4 technical report and systems card,

J. Gallifant, A. Fiske, Y . A. Levites Strekalova, J. S. Osorio-Valencia, R. Parke, R. Mwavu, N. Martinez, J. W. Gichoya, M. Ghassemi, D. Demner-Fushmanet al., “Peer review of gpt-4 technical report and systems card,”PLOS digital health, vol. 3, no. 1, p. e0000417, 2024

2024

[29] [30]

Alignment for honesty,

Y . Yang, E. Chern, X. Qiu, G. Neubig, and P. Liu, “Alignment for honesty,”Advances in Neural Information Processing Systems, vol. 37, pp. 63 565–63 598, 2024

2024

[30] [31]

Function calling in large language models: Industrial practices, challenges, and future directions,

M. Wang, Y . Zhang, C. Peng, Y . Chen, W. Zhou, J. Gu, C. Zhuang, R. Guo, B. Yu, W. Wanget al., “Function calling in large language models: Industrial practices, challenges, and future directions,” 2025

2025

[31] [32]

The dark side of function calling: Pathways to jailbreaking large language models,

Z. Wu, H. Gao, J. He, and P. Wang, “The dark side of function calling: Pathways to jailbreaking large language models,” inProceedings of the 31st International Conference on Computational Linguistics. Associ- ation for Computational Linguistics, Jan. 2024, pp. 584–592

2024

[32] [33]

(2025) Ai-infra-guard

Tencent. (2025) Ai-infra-guard. [Online]. Available: https://github.c om/Tencent/AI-Infra-Guard

2025

[33] [34]

Lost in the middle: How language models use long con- texts,

N. F. Liu, K. Lin, J. Hewitt, A. Paranjape, M. Bevilacqua, F. Petroni, and P. Liang, “Lost in the middle: How language models use long con- texts,”Transactions of the Association for Computational Linguistics, vol. 12, pp. 157–173, 2024

2024

[34] [35]

Unable to forget: Proactive lnterference reveals working memory limits in llms beyond context length,

C. Wang and J. V . Sun, “Unable to forget: Proactive lnterference reveals working memory limits in llms beyond context length,” inICML 2025 Workshop on Long-Context Foundation Models, 2025

2025

[35] [36]

Qe-rag: A robust retrieval-augmented generation benchmark for query entry errors,

K. Zhang, Z. Sun, W. Yu, X. Zang, K. Zheng, Y . Song, H. Li, and J. Xu, “Qe-rag: A robust retrieval-augmented generation benchmark for query entry errors,”arXiv preprint arXiv:2504.04062, 2025. IEEE TRANSACTIONS OF SOFTW ARE ENGINEERING, VOL. 00, NO. 0, AUGUST 0000 15

work page arXiv 2025

[36] [37]

Astute rag: Overcoming imperfect retrieval augmentation and knowledge conflicts for large language models,

F. Wang, X. Wan, R. Sun, J. Chen, and S. ¨O. Arık, “Astute rag: Overcoming imperfect retrieval augmentation and knowledge conflicts for large language models,”arXiv preprint arXiv:2410.07176, 2024

work page arXiv 2024

[37] [38]

Redeep: Detecting hallucination in retrieval-augmented gener- ation via mechanistic interpretability,

Z. Sun, X. Zang, K. Zheng, Y . Song, J. Xu, X. Zhang, W. Yu, and H. Li, “Redeep: Detecting hallucination in retrieval-augmented gener- ation via mechanistic interpretability,” inThe Thirteenth International Conference on Learning Representations, 2025

2025

[38] [39]

Benchmarking large language models in retrieval-augmented generation,

J. Chen, H. Lin, X. Han, and L. Sun, “Benchmarking large language models in retrieval-augmented generation,” inProceedings of the AAAI Conference on Artificial Intelligence, vol. 38, no. 16, 2024, pp. 17 754– 17 762

2024

[39] [40]

(2025) Emergent behavior in multi-agent systems: How com- plex behaviors arise from simple agent interactions

Sanjeev. (2025) Emergent behavior in multi-agent systems: How com- plex behaviors arise from simple agent interactions

2025

[40] [41]

Agent security bench (asb): Formalizing and benchmarking attacks and defenses in llm-based agents,

H. Zhang, J. Huang, K. Mei, Y . Yao, Z. Wang, C. Zhan, H. Wang, and Y . Zhang, “Agent security bench (asb): Formalizing and benchmarking attacks and defenses in llm-based agents,” inInternational Conference on Learning Representations, vol. 2025, 2025, pp. 35 331–35 366

2025

[41] [42]

Why do multiagent systems fail?

M. Z. Pan, M. Cemri, L. A. Agrawal, S. Yang, B. Chopra, R. Tiwari, K. Keutzer, A. Parameswaran, K. Ramchandran, D. Kleinet al., “Why do multiagent systems fail?” inICLR 2025 Workshop on Building Trust in Language Models and Applications, 2025

2025

[42] [43]

Emergence in multi-agent systems: A safety perspective,

P. Altmann, J. Sch ¨onberger, S. Illium, M. Zorn, F. Ritz, T. Haider, S. Burton, and T. Gabor, “Emergence in multi-agent systems: A safety perspective,” inInternational Symposium on Leveraging Applications of Formal Methods. Springer, 2024, pp. 104–120

2024

[43] [45]

Modeling exception management in multi-agent systems

E. Platonet al., “Modeling exception management in multi-agent systems.” Ph.D. dissertation, Citeseer, 2007

2007

[44] [46]

P. D. OG. (2025) Building high-quality ai agent systems: Best practices

2025

[45] [47]

Agentfm: Role-aware failure management for distributed databases with llm- driven multi-agents,

L. Zhang, Y . Zhai, T. Jia, X. Huang, C. Duan, and Y . Li, “Agentfm: Role-aware failure management for distributed databases with llm- driven multi-agents,” inProceedings of the 33rd ACM International Conference on the Foundations of Software Engineering, 2025, p. 947–958

2025

[46] [48]

A survey on llm- based multi-agent systems: workflow, infrastructure, and challenges,

X. Li, S. Wang, S. Zeng, Y . Wu, and Y . Yang, “A survey on llm- based multi-agent systems: workflow, infrastructure, and challenges,” Vicinagearth, vol. 1, no. 1, p. 9, 2024

2024

[47] [49]

Autogen: Enabling next-gen llm applications via multi-agent conversations,

Q. Wu, G. Bansal, J. Zhang, Y . Wu, B. Li, E. Zhu, L. Jiang, X. Zhang, S. Zhang, J. Liuet al., “Autogen: Enabling next-gen llm applications via multi-agent conversations,” inFirst conference on language modeling, 2024

2024

[48] [50]

Camel: Communicative agents for

G. Li, H. Hammoud, H. Itani, D. Khizbullin, and B. Ghanem, “Camel: Communicative agents for” mind” exploration of large language model society,”Advances in neural information processing systems, vol. 36, pp. 51 991–52 008, 2023

2023

[49] [51]

Bronsdon

C. Bronsdon. (2025) Real-time anomaly detection for multi-agent ai systems

2025

[50] [52]

Cut the crap: An economical communication pipeline for llm-based multi-agent systems,

G. Zhang, Y . Yue, Z. Li, S. Yun, G. Wan, K. Wang, D. Cheng, J. X. Yu, and T. Chen, “Cut the crap: An economical communication pipeline for llm-based multi-agent systems,” inThe Thirteenth International Conference on Learning Representations, 2025

2025

[51] [53]

Taxonomy of failure mode in agentic ai systems,

Microsoft, “Taxonomy of failure mode in agentic ai systems,” 2025

2025

[52] [54]

Smurfs: Multi-agent system using context-efficient dfsdt for tool planning,

J. Chen, J. Liang, and B. Wang, “Smurfs: Multi-agent system using context-efficient dfsdt for tool planning,” inProceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers), 2025, pp. 3281–3298

2025

[53] [55]

Redel: A toolkit for llm- powered recursive multi-agent systems,

A. Zhu, L. Dugan, and C. Callison-Burch, “Redel: A toolkit for llm- powered recursive multi-agent systems,” inProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, 2024, pp. 162–171

2024

[54] [56]

’neural howlround’in large language models: a self- reinforcing bias phenomenon, and a dynamic attenuation solution,

S. Drake, “’neural howlround’in large language models: a self- reinforcing bias phenomenon, and a dynamic attenuation solution,” arXiv preprint arXiv:2504.07992, 2025

work page arXiv 2025

[55] [57]

Devops: a definition and perceived adoption impediments,

J. Smeds, K. Nybom, and I. Porres, “Devops: a definition and perceived adoption impediments,” inInternational conference on agile software development. Springer, 2015, pp. 166–177

2015

[56] [58]

Ai for it operations (aiops) on cloud platforms: Reviews, opportunities and challenges,

Q. Cheng, D. Sahoo, A. Saha, W. Yang, C. Liu, G. Woo, M. Singh, S. Saverese, and S. C. Hoi, “Ai for it operations (aiops) on cloud platforms: Reviews, opportunities and challenges,”arXiv preprint arXiv:2304.04661, 2023

work page arXiv 2023

[57] [59]

What is agentic operations (agenticops)?

Cisco, “What is agentic operations (agenticops)?” https://www.cisco. com/site/us/en/learn/topics/artificial-intelligence/what-is-agentic-opera tions-agenticops.html, 2026, accessed: 2026-05-15

2026

[58] [60]

Opentelemetry: A cloud native observability framework,

O. Authors, “Opentelemetry: A cloud native observability framework,” https://opentelemetry.io, 2019

2019

[59] [62]

Agentops: Enabling observability of llm agents,

——, “Agentops: Enabling observability of llm agents,” no. arXiv:2411.05285, Nov. 2024, arXiv:2411.05285 [cs.AI]. [Online]. Available: http://arxiv.org/abs/2411.05285

work page arXiv 2024

[60] [63]

Architecting agentops needs change,

S. Biswas, H. Bhatt, and K. Vaidhyanathan, “Architecting agentops needs change,” no. arXiv:2601.06456, Jan. 2026, arXiv:2601.06456 [cs.SE]. [Online]. Available: http://arxiv.org/abs/2601.06456

work page arXiv 2026

[61] [64]

Taming uncertainty via automation: Observing, analyzing, and optimizing agentic ai systems,

D. Moshkovich and S. Zeltyn, “Taming uncertainty via automation: Observing, analyzing, and optimizing agentic ai systems,” no. arXiv:2507.11277, Nov. 2025, arXiv:2507.11277 [cs.AI]. [Online]. Available: http://arxiv.org/abs/2507.11277

work page arXiv 2025

[62] [65]

Langdb: Llm-enhanced database exploration,

langdb, “Langdb: Llm-enhanced database exploration,” 2025. [Online]. Available: https://github.com/langdb/ai-gateway

2025

[63] [66]

Langfuse: Open-source llm tracing and observability,

Langfuse, “Langfuse: Open-source llm tracing and observability,”

[64] [67]

Available: https://github.com/langfuse/langfuse

[Online]. Available: https://github.com/langfuse/langfuse

[65] [68]

MLflow for GenAI: Build Production-Ready AI Applications,

MLflow Project, a Series of LF Projects, LLC, “MLflow for GenAI: Build Production-Ready AI Applications,” https://mlflow.org/genai, 2025

2025

[66] [69]

Helicone: Llm observability platform,

Helicone, “Helicone: Llm observability platform,” 2025. [Online]. Available: https://github.com/Helicone/helicone

2025

[67] [70]

Langwatch,

LangWatch, “Langwatch,” 2025. [Online]. Available: https://github.c om/langwatch

2025

[68] [71]

(2025) Openllmetry: Open-source observability for your llm application

Traceloop. (2025) Openllmetry: Open-source observability for your llm application. [Online]. Available: https://github.com/traceloop/ope nllmetry

2025

[69] [72]

(2025) Arize phoenix: Open-source llm tracing & evaluation platform

Arize AI, Inc. (2025) Arize phoenix: Open-source llm tracing & evaluation platform. [Online]. Available: https://phoenix.arize.com/

2025

[70] [73]

(2025) Literal ai: Rag llm evaluation & observability platform

Literal AI, Inc. (2025) Literal ai: Rag llm evaluation & observability platform. [Online]. Available: https://www.literalai.com/

2025

[71] [74]

(2025) Opik — open-source llm evaluation platform

Comet ML, Inc. (2025) Opik — open-source llm evaluation platform. [Online]. Available: https://www.comet.com/site/products/opik/

2025

[72] [75]

(2025) Openinference: Opentelemetry instrumentation for ai observability

Arize AI, Inc. (2025) Openinference: Opentelemetry instrumentation for ai observability. [Online]. Available: https://github.com/Arize-ai/ openinference

2025

[73] [76]

(2025) Trulens: Open-source llm evaluation & observability platform

TruEra Inc. (2025) Trulens: Open-source llm evaluation & observability platform. [Online]. Available: https://www.trulens.org/

2025

[74] [77]

(2025) Honeyhive: Ai observability and evaluation platform

HoneyHive AI, Inc. (2025) Honeyhive: Ai observability and evaluation platform. [Online]. Available: https://www.honeyhive.ai/

2025

[75] [78]

(2025) Promptlayer: Platform for prompt engineering, management, evaluation, and llm observability

Magniv, Inc. (2025) Promptlayer: Platform for prompt engineering, management, evaluation, and llm observability. [Online]. Available: https://www.promptlayer.com/

2025

[76] [79]

(2025) Agentops: Developer platform for ai agent observability

AgentOps.ai. (2025) Agentops: Developer platform for ai agent observability. [Online]. Available: https://www.agentops.ai/

2025

[77] [80]

(2024) Deepeval: The llm evaluation framework

confident-ai. (2024) Deepeval: The llm evaluation framework. [Online]. Available: https://github.com/confident-ai/deepeval

2024

[78] [81]

LangSmith: The Agent Engineering Platform,

LangChain, “LangSmith: The Agent Engineering Platform,” 2026. [Online]. Available: https://www.langchain.com/langsmith-platform

2026

[79] [82]

Mlcapsule: Guarded offline deployment of machine learning as a service,

L. Hanzlik, Y . Zhang, K. Grosse, A. Salem, M. Augustin, M. Backes, and M. Fritz, “Mlcapsule: Guarded offline deployment of machine learning as a service,” inProceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2021, pp. 3300–3309

2021

[80] [83]

Opera: Alleviating hallucination in multi- modal large language models via over-trust penalty and retrospection- allocation,

Q. Huang, X. Dong, P. Zhang, B. Wang, C. He, J. Wang, D. Lin, W. Zhang, and N. Yu, “Opera: Alleviating hallucination in multi- modal large language models via over-trust penalty and retrospection- allocation,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024, pp. 13 418–13 427

2024