pith. sign in

arxiv: 2606.01581 · v1 · pith:BB2X3MF2new · submitted 2026-06-01 · 💻 cs.MA

Agent System Operations: Categorization, Challenges, and Future Directions

Pith reviewed 2026-06-28 12:12 UTC · model grok-4.3

classification 💻 cs.MA
keywords agent systemsLLM-based agentsanomaly categorizationoperations frameworkmonitoringanomaly detectionroot cause localizationsystem maintenance
0
0 comments X

The pith

Agent systems need the AgentOps framework to categorize anomalies as intra-agent or inter-agent and manage them through four operational stages.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper surveys the operations of LLM-based agent systems, which often encounter anomalies that cause instability. It systematically defines these anomalies by splitting them into those occurring inside a single agent and those arising between multiple agents. It then presents AgentOps as an operational framework built around four stages. A sympathetic reader would care because agent systems are seeing growing industrial use yet lack established maintenance practices, leaving them vulnerable to failures. The work fills this gap by providing definitions and structure where prior research has been limited.

Core claim

The authors categorize anomalies within agent systems into intra-agent anomalies, which occur within individual agents, and inter-agent anomalies, which emerge from interactions among agents. They introduce the AgentOps framework to structure operations around four stages: monitoring to observe system states, anomaly detection to identify deviations, root cause localization to determine origins of issues, and resolution to apply fixes. This framework is positioned as a comprehensive approach to support stable and secure operation of agent systems.

What carries the argument

The AgentOps framework, which organizes agent system operations into the four stages of monitoring, anomaly detection, root cause localization, and resolution.

If this is right

  • The anomaly categorization supplies a shared vocabulary that future studies can use to classify and compare issues across different agent implementations.
  • Each of the four AgentOps stages can be developed independently, allowing targeted tools for monitoring or root cause analysis to be built and evaluated.
  • Industrial deployments of agent systems can adopt the stages sequentially to reduce instability without requiring entirely new infrastructure.
  • The framework highlights specific challenges in each stage, directing research attention toward gaps such as effective resolution methods for inter-agent issues.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The intra- versus inter-agent split could be tested by measuring whether resolution success rates differ when failures are isolated to one agent versus spread across several.
  • AgentOps might integrate with existing software reliability practices by mapping its stages onto established DevOps pipelines for hybrid systems.
  • Quantifying the relative prevalence of intra-agent versus inter-agent anomalies in real deployments would help prioritize which stage of the framework receives the most tooling effort.

Load-bearing premise

Current research on the operations of agent systems is sparse, creating an urgent need for a survey that defines anomalies and establishes the AgentOps framework.

What would settle it

Empirical data from deployed agent systems showing that observed failures fall outside the intra-agent and inter-agent categories or cannot be addressed by the four proposed stages would challenge the framework's coverage.

Figures

Figures reproduced from arXiv: 2606.01581 by Changhua Pei, Dan Pei, David Lo, Fei Sun, Gaogang Xie, Hang Cui, Haotian Si, Jingjing Li, Quan Zhou, Yintong Huo, Yuanhao Liu, Zexin Wang, Zihan Liu.

Figure 1
Figure 1. Figure 1: Workflow of agent systems. Execution Stage Taxonomy Intra- Agent Inter- Agent Pre-execution Execution Post-execution Orchestration Anomalies Reasoning Anomalies Action Anomalies Memory Anomalies Task Specification Anomalies Security Anomalies Termination Anomalies Communication Anomalies [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Definition and taxonomy of anomalies in agent systems. clarify when anomalies may occur, but they do not specify where such anomalies originate within an agent system. Since agent systems can be broadly classified into single-agent and multi-agent systems, anomalies may originate either from the internal workflow of an individual agent or from interactions among multiple agents. This is analogous to tradit… view at source ↗
Figure 3
Figure 3. Figure 3: Components of agent systems. DeepSeek-R1 [1], as well as prompting strategies such as CoT [23], Reflexion [24], Self-Consistency [25], CoK [26], and StepBack [27]. Despite these advances, reasoning anoma￾lies remain prevalent, among which hallucination is the most representative. The definition of hallucination has been continuously re￾fined in the literature. Rawte et al. [28] define hallucinations as unr… view at source ↗
Figure 4
Figure 4. Figure 4: Examples of traditional operations in agent systems. Traditional IT Infrastructure Cloud-naive Applications AI/ML Systems Agent Systems Evolution of Operational Objects Manual Operation Automation Observability Platform Intelligence Autonomy Evolution of Operational Technologies SRE • SLO/SLA • Error budgets • Monitoring & Alerting AIOps • Monitoring • Anomaly detection • Root cause analysis • Resolution M… view at source ↗
Figure 6
Figure 6. Figure 6: Comparison of traditional system operations and agent system [PITH_FULL_IMAGE:figures/full_fig_p005_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Illustration of AgentOps. Frontend Checkout/API Recommend/API Payment/API ProductCatalog/API Checkout/API: {“item”: “#AX5”, ... ...} Recommend/API: {“user”: “uuid220”, ... ...} ProductCatalog/API: {“category”: “shoe”, ... ...} Payment/API: {“item”: “#AX5”, ... ...} Frontend 1.1s 2.1s 1.5s 1.6s 2.0s Timeline of trace. main-agent food-agent tour-agent search_food search_place food-agent tour-agent search_pla… view at source ↗
Figure 8
Figure 8. Figure 8: Comparison of trace data in microservice system and agent system. [PITH_FULL_IMAGE:figures/full_fig_p007_8.png] view at source ↗
Figure 11
Figure 11. Figure 11: Localization accuracy under different context lengths. agent system’s execution. Accordingly, AgentDebug integrates taxonomy-aware analysis at each step of the agent system to jointly assess system behavior. Upon detecting a failure, the agent replans its trajectory, enabling timely resolution. C. Future Directions Currently, attribution methods can be broadly divided into two categories: LLM-based method… view at source ↗
read the original abstract

As the reasoning capabilities of Large Language Models (LLMs) continue to advance, LLM-based agent systems offer advantages in flexibility and interpretability over traditional systems, garnering increasing attention. However, despite the widespread research interest and industrial application of agent systems, these systems, like their traditional counterparts, frequently encounter anomalies. These anomalies lead to instability and insecurity, hindering their further development. Therefore, a comprehensive and systematic approach to the operation and maintenance of agent systems is urgently needed. Unfortunately, current research on the operations of agent systems is sparse. To address this gap, we have undertaken a survey on agent system operations with the aim of establishing a clear framework for the field, defining the challenges, and facilitating further development. Specifically, this paper begins by systematically defining anomalies within agent systems, categorizing them into intra-agent anomalies and inter-agent anomalies. Next, we introduce a novel and comprehensive operational framework for agent systems, dubbed Agent System Operations (AgentOps). We provide detailed definitions and explanations of its four key stages: monitoring, anomaly detection, root cause localization, and resolution.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 0 minor

Summary. The paper surveys operations for LLM-based agent systems. It defines anomalies as intra-agent or inter-agent, then proposes a new AgentOps framework whose four stages are monitoring, anomaly detection, root cause localization, and resolution. The work positions itself as filling an urgent gap created by sparse existing research on these operational topics.

Significance. If the sparsity premise holds and the categorization proves comprehensive, the survey could provide a useful organizing lens for reliability work on multi-agent LLM systems. The four-stage breakdown mirrors established DevOps practices while highlighting agent-specific issues such as inter-agent coordination failures.

major comments (1)
  1. [Abstract / Introduction] Abstract and Introduction: The claim that 'current research on the operations of agent systems is sparse' is presented as the load-bearing motivation for both the survey and the novel AgentOps framework. No systematic literature review or citation count is supplied to substantiate the sparsity assertion. Without this demonstration, the asserted gap, urgency, and novelty of the intra-/inter-agent split and four-stage framework cannot be evaluated.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive feedback highlighting the need to substantiate the sparsity claim that motivates our survey. We address this point directly below and outline revisions to strengthen the manuscript.

read point-by-point responses
  1. Referee: [Abstract / Introduction] Abstract and Introduction: The claim that 'current research on the operations of agent systems is sparse' is presented as the load-bearing motivation for both the survey and the novel AgentOps framework. No systematic literature review or citation count is supplied to substantiate the sparsity assertion. Without this demonstration, the asserted gap, urgency, and novelty of the intra-/inter-agent split and four-stage framework cannot be evaluated.

    Authors: We agree that the sparsity assertion is central to the paper's motivation and that a more explicit demonstration would strengthen the work. Our claim derives from a broad review of the literature on LLM-based agents, where the overwhelming majority of publications focus on capability development, prompting techniques, and architectural designs rather than operational concerns such as monitoring, anomaly detection, root-cause analysis, and resolution. However, the manuscript does not include a quantified citation analysis or formal systematic literature review protocol. In the revised version we will add a dedicated subsection (likely in Section 2 or a new Appendix) that documents our search methodology, including databases queried, keywords employed, time window considered, and the approximate ratio of operation-focused papers to the total body of agent research. This addition will allow readers to evaluate the gap claim directly while preserving the intra-/inter-agent categorization and four-stage AgentOps framework as the paper's primary contributions. revision: yes

Circularity Check

0 steps flagged

No circularity: survey categorization with independent literature premise

full rationale

The paper is a survey that defines anomaly categories (intra-/inter-agent) and introduces the AgentOps framework (monitoring, anomaly detection, root cause localization, resolution) as an organizational structure. No equations, fitted parameters, predictions, or derivations exist. The premise that 'current research on the operations of agent systems is sparse' is an external claim about the literature, not a self-referential reduction or self-citation chain that forces the framework. The central contribution is the categorization itself, which stands as an independent synthesis rather than a renaming or definitional loop. This matches the default expectation for non-circular survey papers.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

As a survey paper, the work relies on definitions and categorizations from prior literature without introducing new fitted parameters, axioms, or entities. No free parameters, axioms, or invented entities are identified from the abstract.

pith-pipeline@v0.9.1-grok · 5754 in / 1091 out tokens · 20889 ms · 2026-06-28T12:12:18.158605+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

126 extracted references · 34 canonical work pages · 13 internal anchors

  1. [1]

    DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

    D. Guo, D. Yang, H. Zhang, J. Song, R. Zhang, R. Xu, Q. Zhu, S. Ma, P. Wang, X. Biet al., “Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning,”arXiv preprint arXiv:2501.12948, 2025

  2. [2]

    Anthropic

    A. Anthropic. (2024) The claude 3 model family: Opus, sonnet, haiku. claude-3 model card

  3. [3]

    Econagent: large language model-empowered agents for simulating macroeconomic activities,

    N. Li, C. Gao, M. Li, Y . Li, and Q. Liao, “Econagent: large language model-empowered agents for simulating macroeconomic activities,” 2024

  4. [4]

    Toolllm: Facilitating large language models to master 16000+ real-world apis,

    Y . Qin, S. Liang, Y . Ye, K. Zhu, L. Yan, Y . Lu, Y . Lin, X. Cong, X. Tang, B. Qianet al., “Toolllm: Facilitating large language models to master 16000+ real-world apis,” inThe Twelfth International Con- ference on Learning Representations, 2024

  5. [5]

    Flow-of-action: Sop enhanced llm-based multi- agent system for root cause analysis,

    C. Pei, Z. Wang, F. Liu, Z. Li, Y . Liu, X. He, R. Kang, T. Zhang, J. Chen, J. Liet al., “Flow-of-action: Sop enhanced llm-based multi- agent system for root cause analysis,” inCompanion Proceedings of the ACM on Web Conference 2025, 2025, pp. 422–431

  6. [6]

    Recommender ai agent: Integrating large language models for interactive recommen- dations,

    X. Huang, J. Lian, Y . Lei, J. Yao, D. Lian, and X. Xie, “Recommender ai agent: Integrating large language models for interactive recommen- dations,”ACM Transactions on Information Systems, vol. 43, no. 4, pp. 1–33, 2025

  7. [8]

    Shielda: Structured handling of exceptions in llm-driven agentic workflows,

    J. Zhou, J. Chen, Q. Lu, D. Zhao, and L. Zhu, “Shielda: Structured handling of exceptions in llm-driven agentic workflows,”arXiv preprint arXiv:2508.07935, 2025

  8. [9]

    (2025) SWE-bench: A benchmark for evaluating software engineering agents

    SWE-bench Team. (2025) SWE-bench: A benchmark for evaluating software engineering agents. Leaderboard tracking AI agent performance on software engineering tasks. [Online]. Available: https://www.swebench.com

  9. [10]

    (2025) Llamatrace — hosted phoenix: Llm tracing & evaluation platform

    Arize AI, Inc. (2025) Llamatrace — hosted phoenix: Llm tracing & evaluation platform. [Online]. Available: https://phoenix.arize.com/ll amatrace/

  10. [11]

    Automatic failure attribution and critical step prediction method for multi-agent systems based on causal inference,

    G. Ma, J. Zhu, H. Guo, W. Shi, J. Shen, J. Liu, and Y . Liang, “Automatic failure attribution and critical step prediction method for multi-agent systems based on causal inference,”arXiv preprint arXiv:2509.08682, 2025

  11. [12]

    Agent-pro: Learning to evolve via policy-level reflection and optimization,

    W. Zhang, K. Tang, H. Wu, M. Wang, Y . Shen, G. Hou, Z. Tan, P. Li, Y . Zhuang, and W. Lu, “Agent-pro: Learning to evolve via policy-level reflection and optimization,”arXiv preprint arXiv:2402.17574, 2024

  12. [13]

    Trial and error: Exploration-based trajectory optimization for llm agents,

    Y . Song, D. Yin, X. Yue, J. Huang, S. Li, and B. Y . Lin, “Trial and error: Exploration-based trajectory optimization for llm agents,”arXiv preprint arXiv:2403.02502, 2024

  13. [14]

    Agent AI: Surveying the Horizons of Multimodal Interaction

    Z. Durante, Q. Huang, N. Wake, R. Gong, J. S. Park, B. Sarkaret al., “Agent ai: Surveying the horizons of multimodal interaction,”arXiv preprint arXiv:2401.03568, 2024

  14. [15]

    Hallucination de- tection in foundation models for decision-making: A flexible definition and review of the state of the art,

    N. Chakraborty, M. Ornik, and K. Driggs-Campbell, “Hallucination de- tection in foundation models for decision-making: A flexible definition and review of the state of the art,”ACM Computing Surveys, 2025

  15. [16]

    Ai agents under threat: A survey of key security challenges and future pathways,

    Z. Deng, Y . Guo, C. Han, W. Ma, J. Xiong, S. Wen, and Y . Xiang, “Ai agents under threat: A survey of key security challenges and future pathways,”ACM Computing Surveys, vol. 57, no. 7, pp. 1–36, 2025

  16. [17]

    Towards trustworthy gui agents: A survey,

    Y . Shi, W. Yu, W. Yao, W. Chen, and N. Liu, “Towards trustworthy gui agents: A survey,”arXiv preprint arXiv:2503.23434, 2025

  17. [18]

    Which agent causes task failures and when? on automated failure attribution of LLM multi-agent systems,

    S. Zhang, M. Yin, J. Zhang, J. Liu, Z. Han, J. Zhang, B. Li, C. Wang, H. Wang, Y . Chenet al., “Which agent causes task failures and when? on automated failure attribution of LLM multi-agent systems,” inForty- second International Conference on Machine Learning, 2025

  18. [19]

    React: Synergizing reasoning and acting in language models,

    S. Yao, J. Zhao, D. Yu, N. Du, I. Shafran, K. Narasimhan, and Y . Cao, “React: Synergizing reasoning and acting in language models,” in International Conference on Learning Representations (ICLR), 2023

  19. [20]

    Finetuned language models are zero-shot learners,

    J. Wei, M. Bosma, V . Y . Zhao, K. Guu, A. W. Yu, B. Lester, N. Du, A. M. Dai, and Q. V . Le, “Finetuned language models are zero-shot learners,” inInternational Conference on Learning Representations, 2022

  20. [21]

    Training language models to follow instructions with human feedback,

    L. Ouyang, J. Wu, X. Jiang, D. Almeida, C. Wainwright, P. Mishkin, C. Zhang, S. Agarwal, K. Slama, A. Rayet al., “Training language models to follow instructions with human feedback,”Advances in neural information processing systems, vol. 35, pp. 27 730–27 744, 2022

  21. [22]

    Search-R1: Training LLMs to Reason and Leverage Search Engines with Reinforcement Learning

    B. Jin, H. Zeng, Z. Yue, J. Yoon, S. Arik, D. Wang, H. Zamani, and J. Han, “Search-r1: Training llms to reason and leverage search engines with reinforcement learning,”arXiv preprint arXiv:2503.09516, 2025

  22. [23]

    Chain-of-thought prompting elicits reasoning in large language models,

    J. Wei, X. Wang, D. Schuurmans, M. Bosma, F. Xia, E. Chi, Q. V . Le, D. Zhouet al., “Chain-of-thought prompting elicits reasoning in large language models,”Advances in neural information processing systems, vol. 35, pp. 24 824–24 837, 2022

  23. [24]

    Reflexion: Language agents with verbal reinforcement learning,

    N. Shinn, F. Cassano, A. Gopinath, K. Narasimhan, and S. Yao, “Reflexion: Language agents with verbal reinforcement learning,” Advances in Neural Information Processing Systems, vol. 36, pp. 8634– 8652, 2023

  24. [25]

    Self-consistency improves chain of thought rea- soning in language models,

    X. Wang, J. Wei, D. Schuurmans, Q. Le, E. Chi, S. Narang, A. Chowd- hery, and D. Zhou, “Self-consistency improves chain of thought rea- soning in language models,” inThe Eleventh International Conference on Learning Representations, 2023

  25. [26]

    Chain-of-knowledge: Grounding large language models via dynamic knowledge adapting over heterogeneous sources,

    X. Li, R. Zhao, Y . K. Chia, B. Ding, S. Joty, S. Poria, and L. Bing, “Chain-of-knowledge: Grounding large language models via dynamic knowledge adapting over heterogeneous sources,” inThe Twelfth In- ternational Conference on Learning Representations, 2024

  26. [27]

    Take a step back: Evoking reasoning via abstraction in large language models,

    H. S. Zheng, S. Mishra, X. Chen, H.-T. Cheng, E. H. Chi, Q. V . Le, and D. Zhou, “Take a step back: Evoking reasoning via abstraction in large language models,” inThe Twelfth International Conference on Learning Representations, 2024

  27. [28]

    A Survey of Hallucination in Large Foundation Models

    V . Rawte, A. Sheth, and A. Das, “A survey of hallucination in large foundation models,”arXiv preprint arXiv:2309.05922, 2023

  28. [29]

    Peer review of gpt-4 technical report and systems card,

    J. Gallifant, A. Fiske, Y . A. Levites Strekalova, J. S. Osorio-Valencia, R. Parke, R. Mwavu, N. Martinez, J. W. Gichoya, M. Ghassemi, D. Demner-Fushmanet al., “Peer review of gpt-4 technical report and systems card,”PLOS digital health, vol. 3, no. 1, p. e0000417, 2024

  29. [30]

    Alignment for honesty,

    Y . Yang, E. Chern, X. Qiu, G. Neubig, and P. Liu, “Alignment for honesty,”Advances in Neural Information Processing Systems, vol. 37, pp. 63 565–63 598, 2024

  30. [31]

    Function calling in large language models: Industrial practices, challenges, and future directions,

    M. Wang, Y . Zhang, C. Peng, Y . Chen, W. Zhou, J. Gu, C. Zhuang, R. Guo, B. Yu, W. Wanget al., “Function calling in large language models: Industrial practices, challenges, and future directions,” 2025

  31. [32]

    The dark side of function calling: Pathways to jailbreaking large language models,

    Z. Wu, H. Gao, J. He, and P. Wang, “The dark side of function calling: Pathways to jailbreaking large language models,” inProceedings of the 31st International Conference on Computational Linguistics. Associ- ation for Computational Linguistics, Jan. 2024, pp. 584–592

  32. [33]

    (2025) Ai-infra-guard

    Tencent. (2025) Ai-infra-guard. [Online]. Available: https://github.c om/Tencent/AI-Infra-Guard

  33. [34]

    Lost in the middle: How language models use long con- texts,

    N. F. Liu, K. Lin, J. Hewitt, A. Paranjape, M. Bevilacqua, F. Petroni, and P. Liang, “Lost in the middle: How language models use long con- texts,”Transactions of the Association for Computational Linguistics, vol. 12, pp. 157–173, 2024

  34. [35]

    Unable to forget: Proactive lnterference reveals working memory limits in llms beyond context length,

    C. Wang and J. V . Sun, “Unable to forget: Proactive lnterference reveals working memory limits in llms beyond context length,” inICML 2025 Workshop on Long-Context Foundation Models, 2025

  35. [36]

    Qe-rag: A robust retrieval-augmented generation benchmark for query entry errors,

    K. Zhang, Z. Sun, W. Yu, X. Zang, K. Zheng, Y . Song, H. Li, and J. Xu, “Qe-rag: A robust retrieval-augmented generation benchmark for query entry errors,”arXiv preprint arXiv:2504.04062, 2025. IEEE TRANSACTIONS OF SOFTW ARE ENGINEERING, VOL. 00, NO. 0, AUGUST 0000 15

  36. [37]

    Astute rag: Overcoming imperfect retrieval augmentation and knowledge conflicts for large language models,

    F. Wang, X. Wan, R. Sun, J. Chen, and S. ¨O. Arık, “Astute rag: Overcoming imperfect retrieval augmentation and knowledge conflicts for large language models,”arXiv preprint arXiv:2410.07176, 2024

  37. [38]

    Redeep: Detecting hallucination in retrieval-augmented gener- ation via mechanistic interpretability,

    Z. Sun, X. Zang, K. Zheng, Y . Song, J. Xu, X. Zhang, W. Yu, and H. Li, “Redeep: Detecting hallucination in retrieval-augmented gener- ation via mechanistic interpretability,” inThe Thirteenth International Conference on Learning Representations, 2025

  38. [39]

    Benchmarking large language models in retrieval-augmented generation,

    J. Chen, H. Lin, X. Han, and L. Sun, “Benchmarking large language models in retrieval-augmented generation,” inProceedings of the AAAI Conference on Artificial Intelligence, vol. 38, no. 16, 2024, pp. 17 754– 17 762

  39. [40]

    (2025) Emergent behavior in multi-agent systems: How com- plex behaviors arise from simple agent interactions

    Sanjeev. (2025) Emergent behavior in multi-agent systems: How com- plex behaviors arise from simple agent interactions

  40. [41]

    Agent security bench (asb): Formalizing and benchmarking attacks and defenses in llm-based agents,

    H. Zhang, J. Huang, K. Mei, Y . Yao, Z. Wang, C. Zhan, H. Wang, and Y . Zhang, “Agent security bench (asb): Formalizing and benchmarking attacks and defenses in llm-based agents,” inInternational Conference on Learning Representations, vol. 2025, 2025, pp. 35 331–35 366

  41. [42]

    Why do multiagent systems fail?

    M. Z. Pan, M. Cemri, L. A. Agrawal, S. Yang, B. Chopra, R. Tiwari, K. Keutzer, A. Parameswaran, K. Ramchandran, D. Kleinet al., “Why do multiagent systems fail?” inICLR 2025 Workshop on Building Trust in Language Models and Applications, 2025

  42. [43]

    Emergence in multi-agent systems: A safety perspective,

    P. Altmann, J. Sch ¨onberger, S. Illium, M. Zorn, F. Ritz, T. Haider, S. Burton, and T. Gabor, “Emergence in multi-agent systems: A safety perspective,” inInternational Symposium on Leveraging Applications of Formal Methods. Springer, 2024, pp. 104–120

  43. [45]

    Modeling exception management in multi-agent systems

    E. Platonet al., “Modeling exception management in multi-agent systems.” Ph.D. dissertation, Citeseer, 2007

  44. [46]

    P. D. OG. (2025) Building high-quality ai agent systems: Best practices

  45. [47]

    Agentfm: Role-aware failure management for distributed databases with llm- driven multi-agents,

    L. Zhang, Y . Zhai, T. Jia, X. Huang, C. Duan, and Y . Li, “Agentfm: Role-aware failure management for distributed databases with llm- driven multi-agents,” inProceedings of the 33rd ACM International Conference on the Foundations of Software Engineering, 2025, p. 947–958

  46. [48]

    A survey on llm- based multi-agent systems: workflow, infrastructure, and challenges,

    X. Li, S. Wang, S. Zeng, Y . Wu, and Y . Yang, “A survey on llm- based multi-agent systems: workflow, infrastructure, and challenges,” Vicinagearth, vol. 1, no. 1, p. 9, 2024

  47. [49]

    Autogen: Enabling next-gen llm applications via multi-agent conversations,

    Q. Wu, G. Bansal, J. Zhang, Y . Wu, B. Li, E. Zhu, L. Jiang, X. Zhang, S. Zhang, J. Liuet al., “Autogen: Enabling next-gen llm applications via multi-agent conversations,” inFirst conference on language modeling, 2024

  48. [50]

    Camel: Communicative agents for

    G. Li, H. Hammoud, H. Itani, D. Khizbullin, and B. Ghanem, “Camel: Communicative agents for” mind” exploration of large language model society,”Advances in neural information processing systems, vol. 36, pp. 51 991–52 008, 2023

  49. [51]

    Bronsdon

    C. Bronsdon. (2025) Real-time anomaly detection for multi-agent ai systems

  50. [52]

    Cut the crap: An economical communication pipeline for llm-based multi-agent systems,

    G. Zhang, Y . Yue, Z. Li, S. Yun, G. Wan, K. Wang, D. Cheng, J. X. Yu, and T. Chen, “Cut the crap: An economical communication pipeline for llm-based multi-agent systems,” inThe Thirteenth International Conference on Learning Representations, 2025

  51. [53]

    Taxonomy of failure mode in agentic ai systems,

    Microsoft, “Taxonomy of failure mode in agentic ai systems,” 2025

  52. [54]

    Smurfs: Multi-agent system using context-efficient dfsdt for tool planning,

    J. Chen, J. Liang, and B. Wang, “Smurfs: Multi-agent system using context-efficient dfsdt for tool planning,” inProceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers), 2025, pp. 3281–3298

  53. [55]

    Redel: A toolkit for llm- powered recursive multi-agent systems,

    A. Zhu, L. Dugan, and C. Callison-Burch, “Redel: A toolkit for llm- powered recursive multi-agent systems,” inProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, 2024, pp. 162–171

  54. [56]

    ’neural howlround’in large language models: a self- reinforcing bias phenomenon, and a dynamic attenuation solution,

    S. Drake, “’neural howlround’in large language models: a self- reinforcing bias phenomenon, and a dynamic attenuation solution,” arXiv preprint arXiv:2504.07992, 2025

  55. [57]

    Devops: a definition and perceived adoption impediments,

    J. Smeds, K. Nybom, and I. Porres, “Devops: a definition and perceived adoption impediments,” inInternational conference on agile software development. Springer, 2015, pp. 166–177

  56. [58]

    Ai for it operations (aiops) on cloud platforms: Reviews, opportunities and challenges,

    Q. Cheng, D. Sahoo, A. Saha, W. Yang, C. Liu, G. Woo, M. Singh, S. Saverese, and S. C. Hoi, “Ai for it operations (aiops) on cloud platforms: Reviews, opportunities and challenges,”arXiv preprint arXiv:2304.04661, 2023

  57. [59]

    What is agentic operations (agenticops)?

    Cisco, “What is agentic operations (agenticops)?” https://www.cisco. com/site/us/en/learn/topics/artificial-intelligence/what-is-agentic-opera tions-agenticops.html, 2026, accessed: 2026-05-15

  58. [60]

    Opentelemetry: A cloud native observability framework,

    O. Authors, “Opentelemetry: A cloud native observability framework,” https://opentelemetry.io, 2019

  59. [62]

    Agentops: Enabling observability of llm agents,

    ——, “Agentops: Enabling observability of llm agents,” no. arXiv:2411.05285, Nov. 2024, arXiv:2411.05285 [cs.AI]. [Online]. Available: http://arxiv.org/abs/2411.05285

  60. [63]

    Architecting agentops needs change,

    S. Biswas, H. Bhatt, and K. Vaidhyanathan, “Architecting agentops needs change,” no. arXiv:2601.06456, Jan. 2026, arXiv:2601.06456 [cs.SE]. [Online]. Available: http://arxiv.org/abs/2601.06456

  61. [64]

    Taming uncertainty via automation: Observing, analyzing, and optimizing agentic ai systems,

    D. Moshkovich and S. Zeltyn, “Taming uncertainty via automation: Observing, analyzing, and optimizing agentic ai systems,” no. arXiv:2507.11277, Nov. 2025, arXiv:2507.11277 [cs.AI]. [Online]. Available: http://arxiv.org/abs/2507.11277

  62. [65]

    Langdb: Llm-enhanced database exploration,

    langdb, “Langdb: Llm-enhanced database exploration,” 2025. [Online]. Available: https://github.com/langdb/ai-gateway

  63. [66]

    Langfuse: Open-source llm tracing and observability,

    Langfuse, “Langfuse: Open-source llm tracing and observability,”

  64. [67]

    Available: https://github.com/langfuse/langfuse

    [Online]. Available: https://github.com/langfuse/langfuse

  65. [68]

    MLflow for GenAI: Build Production-Ready AI Applications,

    MLflow Project, a Series of LF Projects, LLC, “MLflow for GenAI: Build Production-Ready AI Applications,” https://mlflow.org/genai, 2025

  66. [69]

    Helicone: Llm observability platform,

    Helicone, “Helicone: Llm observability platform,” 2025. [Online]. Available: https://github.com/Helicone/helicone

  67. [70]

    Langwatch,

    LangWatch, “Langwatch,” 2025. [Online]. Available: https://github.c om/langwatch

  68. [71]

    (2025) Openllmetry: Open-source observability for your llm application

    Traceloop. (2025) Openllmetry: Open-source observability for your llm application. [Online]. Available: https://github.com/traceloop/ope nllmetry

  69. [72]

    (2025) Arize phoenix: Open-source llm tracing & evaluation platform

    Arize AI, Inc. (2025) Arize phoenix: Open-source llm tracing & evaluation platform. [Online]. Available: https://phoenix.arize.com/

  70. [73]

    (2025) Literal ai: Rag llm evaluation & observability platform

    Literal AI, Inc. (2025) Literal ai: Rag llm evaluation & observability platform. [Online]. Available: https://www.literalai.com/

  71. [74]

    (2025) Opik — open-source llm evaluation platform

    Comet ML, Inc. (2025) Opik — open-source llm evaluation platform. [Online]. Available: https://www.comet.com/site/products/opik/

  72. [75]

    (2025) Openinference: Opentelemetry instrumentation for ai observability

    Arize AI, Inc. (2025) Openinference: Opentelemetry instrumentation for ai observability. [Online]. Available: https://github.com/Arize-ai/ openinference

  73. [76]

    (2025) Trulens: Open-source llm evaluation & observability platform

    TruEra Inc. (2025) Trulens: Open-source llm evaluation & observability platform. [Online]. Available: https://www.trulens.org/

  74. [77]

    (2025) Honeyhive: Ai observability and evaluation platform

    HoneyHive AI, Inc. (2025) Honeyhive: Ai observability and evaluation platform. [Online]. Available: https://www.honeyhive.ai/

  75. [78]

    (2025) Promptlayer: Platform for prompt engineering, management, evaluation, and llm observability

    Magniv, Inc. (2025) Promptlayer: Platform for prompt engineering, management, evaluation, and llm observability. [Online]. Available: https://www.promptlayer.com/

  76. [79]

    (2025) Agentops: Developer platform for ai agent observability

    AgentOps.ai. (2025) Agentops: Developer platform for ai agent observability. [Online]. Available: https://www.agentops.ai/

  77. [80]

    (2024) Deepeval: The llm evaluation framework

    confident-ai. (2024) Deepeval: The llm evaluation framework. [Online]. Available: https://github.com/confident-ai/deepeval

  78. [81]

    LangSmith: The Agent Engineering Platform,

    LangChain, “LangSmith: The Agent Engineering Platform,” 2026. [Online]. Available: https://www.langchain.com/langsmith-platform

  79. [82]

    Mlcapsule: Guarded offline deployment of machine learning as a service,

    L. Hanzlik, Y . Zhang, K. Grosse, A. Salem, M. Augustin, M. Backes, and M. Fritz, “Mlcapsule: Guarded offline deployment of machine learning as a service,” inProceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2021, pp. 3300–3309

  80. [83]

    Opera: Alleviating hallucination in multi- modal large language models via over-trust penalty and retrospection- allocation,

    Q. Huang, X. Dong, P. Zhang, B. Wang, C. He, J. Wang, D. Lin, W. Zhang, and N. Yu, “Opera: Alleviating hallucination in multi- modal large language models via over-trust penalty and retrospection- allocation,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024, pp. 13 418–13 427

Showing first 80 references.