arxiv: 2605.12729 · v1 · submitted 2026-05-12 · 💻 cs.NI · cs.AI· cs.CR

Recognition: no theorem link

Large Language Models for Agentic NetOps and AIOps: Architectures, Evaluation, and Safety

Muhammad Bilal , Jon Crowcroft , Ruizhi Wang , Xiaolong Xu , Schahram Dustdar

Authors on Pith no claims yet

Pith reviewed 2026-05-14 19:50 UTC · model grok-4.3

classification 💻 cs.NI cs.AIcs.CR

keywords large language modelsNetOpsAIOpsagentic systemsassurance contractsoperational reliabilityworkflow evaluationsecurity risks

0 comments

The pith

Operational reliability in LLM-based NetOps and AIOps comes from the machinery around the model rather than the model itself.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper surveys literature on large language models used for agentic workflows in network operations and AIOps, covering tasks from incident investigation to limited self-healing. It organizes this work around a hierarchy of autonomy, tool scope, evidence traces, and assurance contracts that specify what agents may observe, propose, and execute along with required checks and rollback options. The central claim is that reliability is engineered through these surrounding controls, permissions, and policies rather than inherent model properties. Evaluation must therefore shift from static question answering to workflow-centered measures such as trace quality, bounded tool use, and sandboxed replays. The survey also flags acute security, privacy, and governance risks when agents operate near operational control surfaces.

Core claim

The paper claims that a consistent pattern appears across work on telemetry query recommendation, diagnosis, root-cause analysis, configuration synthesis, change planning, and limited self-healing. This pattern can be organized around the hierarchy of autonomy, tool scope, evidence traces, and assurance contracts. These contracts define what an agent may observe, propose, and execute, and they specify the checks that must pass before any action is allowed. Operational reliability does not come chiefly from the model itself. It depends on the machinery around the model, including permissions, policies, and rollback options. Evaluation should therefore move beyond static question answering to,

What carries the argument

The hierarchy of autonomy, tool scope, evidence traces, and assurance contracts, which structures agent workflows from evidence gathering to action while enforcing operational checks and constraints.

If this is right

Agentic systems require workflow-centred evaluation that includes trace quality, bounded tool use, safe proposal generation, and replay in sandboxed environments with rollback-aware scoring.
Progress depends on treating autonomy as a constrained operational control problem whose outputs must be reliable, auditable, and securely deployable.
Security, privacy, and governance risks become acute when agents sit close to operational control surfaces.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The hierarchy could be tested for fit in other agentic settings such as robotic control or financial trading systems.
Formal verification techniques could be added to strengthen assurance contracts.
Empirical studies could quantify how much each layer of the hierarchy contributes to measured reliability.

Load-bearing premise

That the reviewed tasks exhibit a consistent pattern that can be usefully organized around the hierarchy of autonomy, tool scope, evidence traces, and assurance contracts.

What would settle it

A detailed literature scan that finds no shared structure across the listed tasks or shows reliability deriving primarily from the LLM without the additional machinery of contracts and checks.

Figures

Figures reproduced from arXiv: 2605.12729 by Jon Crowcroft, Muhammad Bilal, Ruizhi Wang, Schahram Dustdar, Xiaolong Xu.

**Figure 2.** Figure 2: Unified evidence-to-action control loop across NetOps and AIOps. The system is partially observed [PITH_FULL_IMAGE:figures/full_fig_p006_2.png] view at source ↗

**Figure 3.** Figure 3: Operational artefacts by trust level and staleness risk. The same artefact can move over time. For [PITH_FULL_IMAGE:figures/full_fig_p010_3.png] view at source ↗

**Figure 4.** Figure 4: A practical “LLM-in-ops” stack. Reliability comes mainly from typed tool interfaces, an explicit [PITH_FULL_IMAGE:figures/full_fig_p011_4.png] view at source ↗

**Figure 5.** Figure 5: Operational agent state machine with mandatory gates and stop conditions. Read-only states can [PITH_FULL_IMAGE:figures/full_fig_p013_5.png] view at source ↗

**Figure 6.** Figure 6: Operational budgets as first-class constraints on agent loops. Tool, token, and time budgets bound [PITH_FULL_IMAGE:figures/full_fig_p014_6.png] view at source ↗

**Figure 7.** Figure 7: What a human approves in a guarded planner–executor system. The review object is a bundle [PITH_FULL_IMAGE:figures/full_fig_p018_7.png] view at source ↗

**Figure 8.** Figure 8: Autonomy–risk coupling and the gated action envelope. As the agent moves from copilot to closed-loop control (left), the severity of key failure modes rises (right). A proposed action a is therefore routed into an execution policy: allow-listed low-risk actions may proceed only through a non-bypassable gate g(a, E, Π) with explicit checks and a rollback plan, while high-risk actions remain proposal-only by… view at source ↗

**Figure 9.** Figure 9: Agentic NetOps loop specialised to high-consequence change. The LLM is useful as a workflow [PITH_FULL_IMAGE:figures/full_fig_p021_9.png] view at source ↗

**Figure 10.** Figure 10: Common NetOps invariants that can form the verification wall. The point is not the notation; it [PITH_FULL_IMAGE:figures/full_fig_p022_10.png] view at source ↗

**Figure 11.** Figure 11: Update as a protocol, not a tool call. Canary and staged expansion turn transient-safety [PITH_FULL_IMAGE:figures/full_fig_p024_11.png] view at source ↗

**Figure 12.** Figure 12: Workflow view of agentic AIOps. The central loop is query-driven diagnosis: plan the next [PITH_FULL_IMAGE:figures/full_fig_p026_12.png] view at source ↗

**Figure 13.** Figure 13: Evaluation ladder with matched reporting burden. Left: an evaluation ladder from offline corpora to canary-in-production studies, where realism and operational risk increase. Right: the minimum information that must be reported to make claims at each rung (tasks/data, tool surface, budgets, gates, trace logging, robustness). The connector highlights that moving down the ladder requires stricter evidence, … view at source ↗

**Figure 14.** Figure 14: Process quality is scorable. A trace is evaluated by discriminating value (did the query narrow [PITH_FULL_IMAGE:figures/full_fig_p034_14.png] view at source ↗

**Figure 15.** Figure 15: Threat model for agentic NetOps and AIOps. Untrusted artefacts (tickets, runbooks, dashboards, [PITH_FULL_IMAGE:figures/full_fig_p036_15.png] view at source ↗

read the original abstract

Large language models are increasingly being used to support network operations (NetOps) and artificial intelligence for IT operations (AIOps), including incident investigation, root-cause analysis, configuration synthesis, and limited self-healing. In both NetOps and AIOps, this shift is changing how tasks are managed. Agent-based operations work as workflows, from gathering evidence to taking action, following permissions, policies, and checks, and providing rollback options when necessary. This is crucial because operational decisions can have instant impacts. To make the argument concrete, we organise the relevant literature around the hierarchy of autonomy, tool scope, evidence traces, and assurance contracts. These contracts define what an agent may observe, propose, and execute. They also define the checks that must pass before any action is allowed. A consistent pattern appears across work on telemetry query recommendation, diagnosis, root-cause analysis, configuration synthesis, change planning, and limited self-healing. Operational reliability does not come chiefly from the model itself. It depends on the machinery around the model. We also argue that evaluation should go beyond static question answering. Agentic NetOps and AIOps systems require workflow-centred evaluation, including trace quality, bounded tool use, safe proposal generation, replay in sandboxed environments, and canary trials with rollback-aware scoring. Without these measures, a system may appear robust yet remain too fragile. Finally, we examine security, privacy, and governance risks that become acute when agents sit close to operational control surfaces. Taken together, the survey concludes that progress in intelligent NetOps and AIOps will depend on treating autonomy as a constrained operational control problem, whose outputs must be reliable, auditable, and securely deployable.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

This is a survey that organizes LLM agent work in NetOps and AIOps around a four-part hierarchy and argues reliability comes from the surrounding controls, but the pattern is asserted at a high level without detailed per-work mappings.

read the letter

The main thing to know is that this paper surveys existing literature on using large language models for tasks like root-cause analysis, configuration synthesis, and limited self-healing in networking and IT operations. It organizes that work into a hierarchy of autonomy, tool scope, evidence traces, and assurance contracts, and its central point is that operational reliability depends chiefly on the machinery around the model rather than the model itself. The abstract makes this concrete by noting how agent workflows involve evidence gathering, permissions, policies, and rollback options. That framing is useful because it shifts attention from raw model performance to the operational constraints that actually matter in live systems. The paper also does a good job pushing for workflow-centered evaluation instead of static question answering, including trace quality checks, bounded tool use, sandbox replays, and canary trials with rollback scoring. The section on security, privacy, and governance risks when agents sit near control surfaces is straightforward and relevant given the potential for instant impact. The soft spot is that the claimed consistent pattern across telemetry query recommendation, diagnosis, configuration, and self-healing rests on high-level synthesis. The stress-test note is right that without explicit mappings or concrete examples showing where model-only approaches failed and where contracts or traces succeeded, the hierarchy functions more as an organizing lens than as evidenced isolation of the decisive factors. No new technical results or derivations appear, which is expected for a survey but means the contribution is structural rather than additive. This paper is for researchers and engineers already working on AIOps or intelligent network management who want a way to categorize safety and evaluation practices. It shows honest engagement with the literature and identifies real gaps, so it deserves a serious referee to tighten the evidence for the hierarchy and expand the evaluation recommendations.

Referee Report

2 major / 2 minor

Summary. The paper surveys the application of large language models to agentic NetOps and AIOps tasks including telemetry query recommendation, diagnosis, root-cause analysis, configuration synthesis, change planning, and limited self-healing. It organizes the literature around a proposed hierarchy of autonomy, tool scope, evidence traces, and assurance contracts, arguing that operational reliability derives primarily from the surrounding machinery and constraints rather than the LLM itself. The work further advocates shifting evaluation from static QA to workflow-centered metrics (trace quality, bounded tool use, sandbox replay, canary trials) and examines security, privacy, and governance risks when agents approach operational control surfaces.

Significance. If the claimed consistent pattern across domains holds with explicit per-work mappings, the survey offers a useful organizing framework for designing constrained, auditable agentic systems in high-stakes operations. The emphasis on workflow evaluation and assurance contracts addresses a genuine gap between model capabilities and deployable reliability; the safety discussion is timely given the proximity to live control planes.

major comments (2)

[Abstract and §3] Abstract and §3 (Literature Organization): The central claim that 'a consistent pattern appears across' the six domains and that 'operational reliability does not come chiefly from the model itself' is asserted via high-level summaries but lacks explicit per-paper mappings to the four elements (autonomy hierarchy, tool scope, evidence traces, assurance contracts) that isolate the machinery as the decisive factor. Without tables or structured breakdowns showing specific failures of model-only approaches versus successes attributable to contracts/traces, the pattern remains framed rather than evidenced.
[§4] §4 (Evaluation): The call for workflow-centred evaluation (trace quality, bounded tool use, sandbox replay, rollback-aware scoring) is well-motivated but the section provides no concrete metrics, scoring rubrics, or example evaluation protocols drawn from the surveyed works; this leaves the recommendation at the level of desiderata rather than actionable guidance that could be adopted by the community.

minor comments (2)

[Abstract and §1] The abstract and introduction use 'agentic' and 'workflows' without an early formal definition or diagram; a small taxonomy figure would clarify the hierarchy before the literature sections.
[§5] Several citations in the safety section (§5) appear to be recent preprints; adding a note on the recency and potential volatility of those sources would help readers assess the stability of the risk claims.

Simulated Author's Rebuttal

2 responses · 0 unresolved

Thank you for the constructive comments. We address the major points below and will incorporate revisions to provide more explicit mappings and concrete evaluation guidance.

read point-by-point responses

Referee: [Abstract and §3] Abstract and §3 (Literature Organization): The central claim that 'a consistent pattern appears across' the six domains and that 'operational reliability does not come chiefly from the model itself' is asserted via high-level summaries but lacks explicit per-paper mappings to the four elements (autonomy hierarchy, tool scope, evidence traces, assurance contracts) that isolate the machinery as the decisive factor. Without tables or structured breakdowns showing specific failures of model-only approaches versus successes attributable to contracts/traces, the pattern remains framed rather than evidenced.

Authors: We appreciate this observation. Although the manuscript structures the discussion around the four elements with illustrative examples from the literature, we agree that a structured table would better evidence the consistent pattern. In the revised manuscript, we will introduce a summary table in Section 3 that explicitly maps each referenced work to the autonomy hierarchy, tool scope, evidence traces, and assurance contracts. This will highlight cases where model-only approaches fail and where the surrounding machinery provides the reliability, thereby strengthening the central claim. revision: yes
Referee: [§4] §4 (Evaluation): The call for workflow-centred evaluation (trace quality, bounded tool use, sandbox replay, rollback-aware scoring) is well-motivated but the section provides no concrete metrics, scoring rubrics, or example evaluation protocols drawn from the surveyed works; this leaves the recommendation at the level of desiderata rather than actionable guidance that could be adopted by the community.

Authors: We concur that the evaluation section would benefit from greater specificity. We will revise §4 to include concrete metrics and rubrics extracted from the surveyed papers, such as trace quality scores used in root-cause analysis studies, examples of bounded tool use from configuration works, and sandbox replay protocols from self-healing literature. Additionally, we will outline an example workflow evaluation protocol that the community could adopt, moving the recommendations from desiderata to actionable guidance. revision: yes

Circularity Check

0 steps flagged

No circularity: survey organizes literature without derivations or self-referential reductions

full rationale

The paper is a literature survey that proposes an organizational hierarchy (autonomy, tool scope, evidence traces, assurance contracts) to frame existing work on NetOps/AIOps tasks. No equations, fitted parameters, predictions, or derivations appear anywhere in the text. The central observation that reliability depends on surrounding machinery is presented as a pattern distilled from cited external literature rather than derived from the authors' prior results or by construction from the survey's own inputs. Self-citations, if present, are not load-bearing for any claim; the paper contains no uniqueness theorems, ansatzes, or renamings that reduce to self-reference. The derivation chain is therefore self-contained as a descriptive re-organization with no internal circular steps.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Survey paper with no free parameters, no invented entities, and only standard domain assumptions about agent workflows.

axioms (1)

domain assumption Agent-based operations work as workflows with permissions, policies, checks, and rollback options
Invoked to frame the hierarchy of autonomy and assurance contracts.

pith-pipeline@v0.9.0 · 5628 in / 1117 out tokens · 50292 ms · 2026-05-14T19:50:47.679112+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

170 extracted references · 5 canonical work pages · 1 internal anchor

[1]

A general approach to network configuration analysis,

A. Fogel, S. Fung, L. Pedrosa, M. Walraed-Sullivan, R. Govindan, R. Mahajan, and T. Millstein, “A general approach to network configuration analysis,” inProceedings of the 12th USENIX Symposium on Networked Systems Design and Implementation (NSDI ’15). USENIX Association, 2015, pp. 469–483

2015
[2]

Header space analysis: Static checking for networks,

P. Kazemian, G. Varghese, and N. McKeown, “Header space analysis: Static checking for networks,” inProceedings of the 9th USENIX Symposium on Networked Systems Design and Implementation (NSDI), 2012

2012
[3]

VeriFlow: Verifying Network-Wide invariants in real time,

A. Khurshid, X. Zou, W. Zhou, M. Caesar, and P. B. Godfrey, “VeriFlow: Verifying Network-Wide invariants in real time,” in10th USENIX Symposium on Networked Systems Design and Implementation (NSDI 13). USENIX Association, Apr. 2013, pp. 15–27

2013
[4]

N. R. Murphy, B. Beyer, C. Jones, and J. Petoff,Site Reliability Engineering: How Google Runs Production Systems. O’Reilly Media, 2016. 44

2016
[5]

Forsgren, J

N. Forsgren, J. Humble, and G. Kim,Accelerate: The Science of Lean Software and DevOps: Building and Scaling High Performing Technology Organizations. IT Revolution Press, 2018

2018
[6]

The vision of autonomic computing,

J. Kephart and D. Chess, “The vision of autonomic computing,”Computer, vol. 36, no. 1, pp. 41–50, 2003

2003
[7]

Monitorassistant: Simplifying cloud service monitoring via large language models,

Z. Yu, M. Ma, C. Zhang, S. Qin, Y . Kang, C. Bansal, S. Rajmohan, Y . Dang, C. Pei, D. Pei, Q. Lin, and D. Zhang, “Monitorassistant: Simplifying cloud service monitoring via large language models,” inCompanion Proceedings of the 32nd ACM International Conference on the Foundations of Software Engineering. ACM, 2024, pp. 38–49

2024
[8]

Xpert: Empowering incident management with query recommendations via large language models,

Y . Jiang, C. Zhang, S. He, Z. Yang, M. Ma, S. Qin, Y . Kang, Y . Dang, S. Rajmohan, Q. Lin, and D. Zhang, “Xpert: Empowering incident management with query recommendations via large language models,” inProceedings of the IEEE/ACM 46th International Conference on Software Engineering, ser. ICSE ’24. Association for Computing Machinery, 2024

2024
[9]

Netassistant: Dialogue based network diagnosis in data center networks,

H. Wang, A. Abhashkumar, C. Lin, T. Zhang, X. Gu, N. Ma, C. Wu, S. Liu, W. Zhou, Y . Dong, W. Jiang, and Y . Wang, “Netassistant: Dialogue based network diagnosis in data center networks,” inProceedings of the 21st USENIX Symposium on Networked Systems Design and Implementation (NSDI ’24). USENIX Association, 2024

2024
[10]

Automatic root cause analysis via large language models for cloud incidents,

Y . Chen, H. Xie, M. Ma, Y . Kang, X. Gao, L. Shi, Y . Cao, X. Gao, H. Fan, M. Wen, J. Zeng, S. Ghosh, X. Zhang, C. Zhang, Q. Lin, S. Rajmohan, D. Zhang, and T. Xu, “Automatic root cause analysis via large language models for cloud incidents,” inProceedings of the 19th European Conference on Computer Systems (EuroSys ’24). ACM, 2024, pp. 674–688

2024
[11]

Abstractions for network update,

M. Reitblatt, N. Foster, J. Rexford, C. Schlesinger, and D. Walker, “Abstractions for network update,” inProceedings of the ACM SIGCOMM 2012 Conference. ACM, 2012, pp. 323–334

2012
[12]

Retrieval-augmented generation for knowledge-intensive NLP tasks,

P. Lewis, E. Perez, A. Piktus, F. Petroni, V . Karpukhin, N. Goyal, H. K¨uttler, M. Lewis, W. Yih, T. Rockt¨aschel, S. Riedel, and D. Kiela, “Retrieval-augmented generation for knowledge-intensive NLP tasks,” inAdvances in Neural Information Processing Systems (NeurIPS 2020), 2020

2020
[13]

Toolformer: language models can teach themselves to use tools,

T. Schick, J. Dwivedi-Yu, R. Dessi, R. Raileanu, M. Lomeli, E. Hambro, L. Zettlemoyer, N. Cancedda, and T. Scialom, “Toolformer: language models can teach themselves to use tools,” inProceedings of the 37th International Conference on Neural Information Processing Systems, ser. NIPS ’23. Red Hook, NY , USA: Curran Associates Inc., 2023

2023
[14]

ReAct: Synergizing reasoning and acting in language models,

S. Yao, J. Zhao, D. Yu, N. Du, I. Shafran, K. R. Narasimhan, and Y . Cao, “ReAct: Synergizing reasoning and acting in language models,” inProceedings of the International Conference on Learning Representations, ser. ICLR 2023, 2023. [Online]. Available: https://openreview.net/forum?id=WE vluYUL-X

2023
[15]

A survey on intent-based networking,

A. Leivadeas and M. Falkner, “A survey on intent-based networking,”IEEE Communications Surveys & Tutorials, vol. 25, no. 1, pp. 625–655, 2023

2023
[16]

Intent-based networking for the enterprise: a modern network architecture,

M. Falkner and J. Apostolopoulos, “Intent-based networking for the enterprise: a modern network architecture,”Communications of the ACM, vol. 65, no. 11, pp. 108–117, 2022

2022
[17]

NetComplete: Practical network-wide configuration synthesis with autocompletion,

A. El-Hassany, P. Tsankov, L. Vanbever, and M. T. Vechev, “NetComplete: Practical network-wide configuration synthesis with autocompletion,” inProceedings of the 15th USENIX Symposium on Networked Systems Design and Implementation (NSDI), 2018, pp. 579–594

2018
[18]

A survey of aiops methods for failure management,

P. Notaro, J. Cardoso, and M. Gerndt, “A survey of aiops methods for failure management,”ACM Trans. Intell. Syst. Technol., vol. 12, no. 6, 2021

2021
[19]

Webarena: A realistic web environment for building autonomous agents,

S. Zhou, F. F. Xu, H. Zhu, X. Zhou, R. Lo, A. Sridhar, X. Cheng, T. Ou, Y . Bisk, D. Fried, U. Alon, and G. Neubig, “Webarena: A realistic web environment for building autonomous agents,” inThe Twelfth International Conference on Learning Representations, 2024

2024
[20]

API-bank: A comprehensive benchmark for tool-augmented LLMs,

M. Li, Y . Zhao, B. Yu, F. Song, H. Li, H. Yu, Z. Li, F. Huang, and Y . Li, “API-bank: A comprehensive benchmark for tool-augmented LLMs,” inProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, H. Bouamor, J. Pino, and K. Bali, Eds. Singapore: Association for Computational Linguistics, Dec. 2023, pp. 3102–3116

2023
[21]

Gorilla: Large language model connected with massive apis,

S. G. Patil, T. Zhang, X. Wang, and J. E. Gonzalez, “Gorilla: Large language model connected with massive apis,” inAdvances in Neural Information Processing Systems (NeurIPS 2024), vol. 37, 2024, pp. 126 544–126 565

2024
[22]

Safely and automatically updating in-network acl configurations with intent language,

B. Tian, X. Zhang, E. Zhai, H. H. Liu, Q. Ye, C. Wang, X. Wu, Z. Ji, Y . Sang, M. Zhanget al., “Safely and automatically updating in-network acl configurations with intent language,” inProceedings of the ACM Special Interest Group on Data Communication, 2019, pp. 214–226

2019
[23]

SWE-bench: Can language models resolve real-world github issues?

C. E. Jimenez, J. Yang, A. Wettig, S. Yao, K. Pei, O. Press, and K. R. Narasimhan, “SWE-bench: Can language models resolve real-world github issues?” inThe Twelfth International Conference on Learning Representations, 2024

2024
[24]

Understanding BGP misconfiguration,

R. Mahajan, D. Wetherall, and T. Anderson, “Understanding BGP misconfiguration,” inProceedings of the ACM SIGCOMM 2002 Conference, 2002, pp. 3–16

2002
[25]

Openflow: enabling innovation in campus networks,

N. McKeown, T. Anderson, H. Balakrishnan, G. Parulkar, L. Peterson, J. Rexford, S. Shenker, and J. Turner, “Openflow: enabling innovation in campus networks,”SIGCOMM Comput. Commun. Rev., vol. 38, no. 2, p. 69–74, 2008

2008
[26]

Consistent updates for software-defined networks: change you can believe in!

M. Reitblatt, N. Foster, J. Rexford, and D. Walker, “Consistent updates for software-defined networks: change you can believe in!” inProceedings of the 10th ACM Workshop on Hot Topics in Networks, ser. HotNets-X, 2011, pp. 1–6

2011
[27]

G-rca: A generic root cause analysis platform for service quality management in large ip networks,

H. Yan, L. Breslau, Z. Ge, D. Massey, D. Pei, and J. Yates, “G-rca: A generic root cause analysis platform for service quality management in large ip networks,”IEEE/ACM Transactions on Networking, vol. 20, no. 6, pp. 1734–1747, 2012

2012
[28]

Mining causality of network events in log data,

S. Kobayashi, K. Otomo, K. Fukuda, and H. Esaki, “Mining causality of network events in log data,”IEEE Transactions on Network and Service Management, vol. 15, no. 1, pp. 53–67, 2018

2018
[29]

Causal analysis of network logs with layered protocols and topology knowledge,

S. Kobayashi, K. Otomo, and K. Fukuda, “Causal analysis of network logs with layered protocols and topology knowledge,” in2019 15th International Conference on Network and Service Management (CNSM), 2019, pp. 1–9

2019
[30]

A general approach to network configuration verification,

R. Beckett, A. Gupta, R. Mahajan, and D. Walker, “A general approach to network configuration verification,” inProceedings of the ACM SIGCOMM 2017 Conference, 2017, pp. 155–168

2017
[31]

Checking beliefs in dynamic networks,

N. P. Lopes, N. Bjørner, P. Godefroid, K. Jayaraman, and G. Varghese, “Checking beliefs in dynamic networks,” inProceedings of the 12th USENIX Symposium on Networked Systems Design and Implementation (NSDI ’15), 2015, pp. 499–512. 45

2015
[32]

Lessons from the evolution of the batfish configuration analysis tool,

M. Brown, A. Fogel, D. Halperin, V . Heorhiadi, R. Mahajan, and T. Millstein, “Lessons from the evolution of the batfish configuration analysis tool,” inProceedings of the ACM SIGCOMM 2023 Conference, 2023, pp. 122–135

2023
[33]

Pinpoint: Problem determination in large, dynamic internet services,

M. Y . Chen, E. Kiciman, E. Fratkin, A. Fox, and E. A. Brewer, “Pinpoint: Problem determination in large, dynamic internet services,” inProceedings of the International Conference on Dependable Systems and Networks (DSN). IEEE, 2002, pp. 595–604

2002
[34]

Orca: Differential bug localization in large-scale services,

R. Bhagwan, R. Kumar, C. S. Maddila, and A. A. Philip, “Orca: Differential bug localization in large-scale services,” inProceedings of the 13th USENIX Symposium on Operating Systems Design and Implementation (OSDI ’18), 2018, pp. 493–509

2018
[35]

Towards intelligent incident management: why we need it and how we make it,

Z. Chen, Y . Kang, L. Li, X. Zhang, H. Zhang, H. Xu, Y . Zhou, L. Yang, J. Sun, Z. Xuet al., “Towards intelligent incident management: why we need it and how we make it,” inProceedings of the 28th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering, 2020, pp. 1487–1497

2020
[36]

How to mitigate the incident? an effective troubleshooting guide recommendation technique for online service systems,

J. Jiang, W. Lu, J. Chen, Q. Lin, P. Zhao, Y . Kang, H. Zhang, Y . Xiong, F. Gao, Z. Xuet al., “How to mitigate the incident? an effective troubleshooting guide recommendation technique for online service systems,” inProceedings of the 28th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineeri...

2020
[37]

Enjoy your observability: An industrial survey of microservice tracing and analysis,

B. Li, X. Peng, Q. Xiang, H. Wang, T. Xie, J. Sun, and X. Liu, “Enjoy your observability: An industrial survey of microservice tracing and analysis,”Empirical Software Engineering, vol. 27, no. 1, p. 25, 2022

2022
[38]

MRCA: Metric-level root cause analysis for microservices via multi-modal data,

Y . Wang, Z. Zhu, Q. Fu, Y . Ma, and P. He, “MRCA: Metric-level root cause analysis for microservices via multi-modal data,” in Proceedings of the 39th IEEE/ACM International Conference on Automated Software Engineering (ASE ’24), 2024, pp. 1057–1068

2024
[39]

Hemirca: Fine-grained root cause analysis for microservices with heterogeneous data sources,

Z. Zhu, C. Lee, X. Tang, and P. He, “Hemirca: Fine-grained root cause analysis for microservices with heterogeneous data sources,” ACM Trans. Softw. Eng. Methodol., vol. 33, no. 8, 2024

2024
[40]

X-trace: A pervasive network tracing framework,

R. Fonseca, G. Porter, R. H. Katz, S. Shenker, and I. Stoica, “X-trace: A pervasive network tracing framework,” inProceedings of the 4th USENIX Symposium on Networked Systems Design and Implementation (NSDI ’07), 2007, pp. 271–284

2007
[41]

Canopy: An end-to-end performance tracing and analysis system,

J. Kaldor, J. Mace, M. Bejda, E. Gao, W. Kuropatwa, J. O’Neill, K. W. Ong, B. Schaller, P. Shan, B. Viscomi, V . Venkataraman, K. Veeraraghavan, and Y . J. Song, “Canopy: An end-to-end performance tracing and analysis system,” inProceedings of the 26th ACM Symposium on Operating Systems Principles (SOSP ’17), 2017, pp. 34–50

2017
[42]

An empirical study of policy as code: Adoption, purpose, and maintenance,

R. Opdebeeck, M. Alfadel, A. Rahman, Y . Kashiwa, J. F. Ferreira, R. G. Kula, and C. D. Roover, “An empirical study of policy as code: Adoption, purpose, and maintenance,” inProceedings of the 23rd International Conference on Mining Software Repositories (MSR 2026), 2026

2026
[43]

Automated infrastructure as code program testing,

D. Sokolowski, D. Spielmann, and G. Salvaneschi, “Automated infrastructure as code program testing,”IEEE Transactions on Software Engineering, vol. 50, no. 6, pp. 1585–1599, 2024

2024
[44]

Change management in physical network lifecycle automation,

M. Al-Fares, V . Beauregard, K. Grant, A. Griffith, J. Hasan, C. Huang, Q. Leng, J. Li, A. Lin, Z. Liu, A. Mansy, B. Martinusen, N. Mehta, J. C. Mogul, A. Narver, A. Nigham, M. Obenberger, S. Smith, K. Steinkraus, S. Sun, E. Thiele, and A. Vahdat, “Change management in physical network lifecycle automation,” in2023 USENIX Annual Technical Conference (USEN...

2023
[45]

Learning from lessons learned: Preliminary findings from a study of learning from failure,

J. Sillito and M. Pope, “Learning from lessons learned: Preliminary findings from a study of learning from failure,” inProceedings of the 2024 IEEE/ACM 17th International Conference on Cooperative and Human Aspects of Software Engineering, 2024, pp. 97–102

2024
[46]

Drain: An online log parsing approach with fixed depth tree,

P. He, J. Zhu, Z. Zheng, and M. R. Lyu, “Drain: An online log parsing approach with fixed depth tree,” in2017 IEEE International Conference on Web Services (ICWS), 2017, pp. 33–40

2017
[47]

Deeplog: Anomaly detection and diagnosis from system logs through deep learning,

M. Du, F. Li, G. Zheng, and V . Srikumar, “Deeplog: Anomaly detection and diagnosis from system logs through deep learning,” in Proceedings of the 2017 ACM SIGSAC Conference on Computer and Communications Security. ACM, 2017, pp. 1285–1298

2017
[48]

Loghub: A large collection of system log datasets for ai-driven log analytics,

J. Zhu, S. He, P. He, J. Liu, and M. R. Lyu, “Loghub: A large collection of system log datasets for ai-driven log analytics,” in2023 IEEE 34th International Symposium on Software Reliability Engineering (ISSRE), 2023, pp. 355–366

2023
[49]

Opentelemetry specification,

OpenTelemetry Authors, “Opentelemetry specification,” Cloud Native Computing Foundation (CNCF), 2024, accessed: 2026-02-02

2024
[50]

An empirical study on change-induced incidents of online service systems,

Y . Wu, B. Chai, Y . Li, B. Liu, J. Li, Y . Yang, and W. Jiang, “An empirical study on change-induced incidents of online service systems,” inProceedings of the IEEE/ACM 45th International Conference on Software Engineering: Software Engineering in Practice (ICSE-SEIP), 2023, pp. 234–245

2023
[51]

Identifying linked incidents in large-scale online service systems,

Y . Chen, X. Yang, H. Dong, X. He, H. Zhang, Q. Lin, J. Chen, P. Zhao, Y . Kang, F. Gao, Z. Xu, and D. Zhang, “Identifying linked incidents in large-scale online service systems,” inProceedings of the 28th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering (ESEC/FSE), 2020, pp. 304–314

2020
[52]

Llexus: an ai agent system for incident management,

P. Las-Casas, A. G. Kumbhare, R. Fonseca, and S. Agarwal, “Llexus: an ai agent system for incident management,”SIGOPS Oper. Syst. Rev., vol. 58, no. 1, 2024

2024
[53]

Tool learning with foundation models,

Y . Qin, S. Hu, Y . Lin, W. Chen, N. Ding, G. Cui, Z. Zeng, X. Zhou, Y . Huang, C. Xiao, C. Han, Y . R. Fung, Y . Su, H. Wang, C. Qian, R. Tian, K. Zhu, S. Liang, X. Shen, B. Xu, Z. Zhang, Y . Ye, B. Li, Z. Tang, J. Yi, Y . Zhu, Z. Dai, L. Yan, X. Cong, Y . Lu, W. Zhao, Y . Huang, J. Yan, X. Han, X. Sun, D. Li, J. Phang, C. Yang, T. Wu, H. Ji, G. Li, Z. L...

2024
[54]

ToolLLM: Facilitating large language models to master 16000+ real-world APIs,

Y . Qin, S. Liang, Y . Ye, K. Zhu, L. Yan, Y . Lu, Y . Lin, X. Cong, X. Tang, B. Qian, S. Zhao, L. Hong, R. Tian, R. Xie, J. Zhou, M. Gerstein, D. Li, Z. Liu, and M. Sun, “ToolLLM: Facilitating large language models to master 16000+ real-world APIs,” inThe Twelfth International Conference on Learning Representations, 2024

2024
[55]

Humble and D

J. Humble and D. Farley,Continuous Delivery: Reliable Software Releases through Build, Test, and Deployment Automation. Addison- Wesley Professional, 2010

2010
[56]

Risk based planning of network changes in evolving data centers,

O. Alipourfard, J. Gao, J. Koenig, C. Harshaw, A. Vahdat, and M. Yu, “Risk based planning of network changes in evolving data centers,” inProceedings of the 27th ACM Symposium on Operating Systems Principles, ser. SOSP ’19. New York, NY , USA: Association for Computing Machinery, 2019, pp. 414–429

2019
[57]

Artificial intelligence risk management framework: Generative artificial intelligence profile,

C. Autio, R. Schwartz, J. Dunietz, S. Jain, M. Stanley, E. Tabassi, P. Hall, and K. Roberts, “Artificial intelligence risk management framework: Generative artificial intelligence profile,” Tech. Rep., 2024-07-26 04:07:00 2024. 46

2024
[58]

Creating characteristically auditable agentic ai systems,

C. C. Phiri, “Creating characteristically auditable agentic ai systems,” inProceedings of Intelligent Robotics FAIR 2025 (IntRob ’25), 2025, pp. 1–14

2025
[59]

What do llms need to synthesize correct router configurations?

R. Mondal, A. Tang, R. Beckett, T. Millstein, and G. Varghese, “What do llms need to synthesize correct router configurations?” in Proceedings of the 22nd ACM Workshop on Hot Topics in Networks (HotNets ’23). Association for Computing Machinery, 2023, pp. 189–195

2023
[60]

Meshagent: Enabling reliable network management with large language models,

Y . Zhou, K. Hsieh, S. K. Mani, S. Kandula, and Z. Liu, “Meshagent: Enabling reliable network management with large language models,”Proc. ACM Meas. Anal. Comput. Syst., vol. 9, no. 3, Dec. 2025

2025
[61]

Artificial intelligence risk management framework (ai rmf 1.0),

E. Tabassi, “Artificial intelligence risk management framework (ai rmf 1.0),” Tech. Rep., 2023-01-26 05:01:00 2023

2023
[62]

Dapper, a large-scale distributed systems tracing infrastructure,

B. H. Sigelman, L. A. Barroso, M. Burrows, P. Stephenson, M. Plakal, D. Beaver, S. Jaspan, and C. Shanbhag, “Dapper, a large-scale distributed systems tracing infrastructure,” Google, Inc., Tech. Rep., 2010, technical report (widely circulated)

2010
[63]

Pivot tracing: Dynamic causal monitoring for distributed systems,

J. Mace, R. Roelke, and R. Fonseca, “Pivot tracing: Dynamic causal monitoring for distributed systems,” inProceedings of the 25th Symposium on Operating Systems Principles (SOSP ’15). Association for Computing Machinery, 2015, pp. 378–393

2015
[64]

Real time network policy checking using header space analysis,

P. Kazemian, M. Chang, H. Zeng, G. Varghese, N. McKeown, and S. Whyte, “Real time network policy checking using header space analysis,” inProceedings of the 10th USENIX Conference on Networked Systems Design and Implementation, 2013, p. 99–112

2013
[65]

Accuracy, scalability, coverage: A practical configuration verifier on a global wan,

F. Ye, D. Yu, E. Zhai, H. H. Liu, B. Tian, Q. Ye, C. Wang, X. Wu, T. Guo, C. Jin, D. She, Q. Ma, B. Cheng, H. Xu, M. Zhang, Z. Wang, and R. Fonseca, “Accuracy, scalability, coverage: A practical configuration verifier on a global wan,” inProceedings of the ACM SIGCOMM 2020 Conference, 2020, pp. 599–614

2020
[66]

Itbench: evaluating ai agents across diverse real-world it automation tasks,

S. Jha, R. Arora, Y . Watanabe, T. Yanagawa, Y . Chen, J. Clark, B. Bhavya, M. Verma, H. Kumar, H. Kitahara, N. Zheutlin, S. Takano, D. Pathak, F. George, X. Wu, B. O. Turkkan, G. Vanloo, M. Nidd, T. Dai, O. Chatterjee, P. Gupta, S. Samanta, P. Aggarwal, R. Lee, J.-w. Ahn, D. Kar, A. Paradkar, Y . Deng, P. Moogi, P. Mohapatra, N. Abe, C. Narayanaswami, T....

2025
[67]

Agentbench: Evaluating llms as agents,

X. Liu, H. Yu, H. Zhang, Y . Xu, X. Lei, H. Lai, Y . Gu, H. Ding, K. Men, K. Yang, S. Zhang, X. Deng, A. Zeng, Z. Du, C. Zhang, S. Shen, T. Zhang, Y . Su, H. Sun, M. Huang, Y . Dong, and J. Tang, “Agentbench: Evaluating llms as agents,” inICLR 2024, 2024

2024
[68]

τ-bench: A benchmark for tool-agent-user interaction in real-world domains,

S. Yao, N. Shinn, P. Razavi, and K. R. Narasimhan, “τ-bench: A benchmark for tool-agent-user interaction in real-world domains,” inInternational Conference on Learning Representations (ICLR 2025), 2025. [Online]. Available: https://openreview.net/forum?id=roNSXZpUDN

2025
[69]

McCormack

I. McCormack. (2025, Jan.) Preserving integrity in the age of generative AI. National Cyber Security Centre (NCSC). [Online]. Available: https://www.ncsc.gov.uk/blog-post/preserving-integrity-in-age-generative-ai

2025
[70]

Not what you’ve signed up for: Compromising real-world LLM-integrated applications with indirect prompt injection,

K. Greshake, S. Abdelnabi, S. Mishra, C. Endres, T. Holz, and M. Fritz, “Not what you’ve signed up for: Compromising real-world LLM-integrated applications with indirect prompt injection,” inProceedings of the 2023 Workshop on Artificial Intelligence and Security (AISec ’23). Association for Computing Machinery, 2023, pp. 79–90

2023
[71]

StruQ: Defending against prompt injection with structured queries,

S. Chen, J. Piet, C. Sitawarin, and D. Wagner, “StruQ: Defending against prompt injection with structured queries,” in34th USENIX Security Symposium (USENIX Security 25), 2025, pp. 2383–2400

2025
[72]

When AIOps become “AI oops

D. Pasquini, E. M. Kornaropoulos, G. Ateniese, O. Akgul, A. Theocharis, and P. Efstathopoulos, “When AIOps become “AI oops”: Subverting LLM-driven IT operations via telemetry manipulation,” arXiv:2508.06394, 2025

work page arXiv 2025
[73]

Failures and fixes: A study of software system incident response,

J. Sillito and E. Kutomi, “Failures and fixes: A study of software system incident response,” in2020 IEEE International Conference on Software Maintenance and Evolution (ICSME). IEEE, 2020, pp. 185–195

2020
[74]

Trust in collaborative automation in high stakes software engineering work: A case study at nasa,

D. G. Widder, L. Dabbish, J. D. Herbsleb, A. Holloway, and S. Davidoff, “Trust in collaborative automation in high stakes software engineering work: A case study at nasa,” inProceedings of the 2021 CHI Conference on Human Factors in Computing Systems. Association for Computing Machinery, 2021

2021
[75]

RCAgent: Cloud root cause analysis by autonomous agents with tool-augmented large language models,

Z. Wang, Z. Liu, Y . Zhang, A. Zhong, J. Wang, F. Yin, L. Fan, L. Wu, and Q. Wen, “RCAgent: Cloud root cause analysis by autonomous agents with tool-augmented large language models,” inProceedings of the 33rd ACM International Conference on Information and Knowledge Management (CIKM), 2024, pp. 4966–4974

2024
[76]

Acto: Automatic end-to-end testing for operation correctness of cloud system management,

J. T. Gu, X. Sun, W. Zhang, Y . Jiang, C. Wang, M. Vaziri, O. Legunsen, and T. Xu, “Acto: Automatic end-to-end testing for operation correctness of cloud system management,” inProceedings of the 29th ACM Symposium on Operating Systems Principles (SOSP). ACM, 2023, pp. 96–112

2023
[77]

Conveyor: One-Tool-Fits-All continuous software deployment at meta,

B. Grubic, Y . Wang, T. Petrochko, R. Yaniv, B. Jones, D. Callies, M. Clarke-Lauer, D. Kelley, S. Demetriou, K. Yu, and C. Tang, “Conveyor: One-Tool-Fits-All continuous software deployment at meta,” in17th USENIX Symposium on Operating Systems Design and Implementation (OSDI 23). USENIX Association, 2023, pp. 325–342

2023
[78]

Localized explanations for automatically synthesized network configurations,

A. Nazari, Y . Zhang, M. Raghothaman, and H. Chen, “Localized explanations for automatically synthesized network configurations,” inProceedings of the 23rd ACM Workshop on Hot Topics in Networks (HotNets), 2024, pp. 52–59

2024
[79]

Automatic configuration repair,

X. Liu, P. Zhang, A. Abhashkumar, J. Chen, and W. Jiang, “Automatic configuration repair,” inProceedings of the 23rd ACM Workshop on Hot Topics in Networks (HotNets), 2024, pp. 213–220

2024
[80]

Learning to generate structured output with schema reinforcement learning,

Y . Lu, H. Li, X. Cong, Z. Zhang, Y . Wu, Y . Lin, Z. Liu, F. Liu, and M. Sun, “Learning to generate structured output with schema reinforcement learning,” inProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 2025, pp. 4905–4918

2025

Showing first 80 references.