UModel: An Agent-Ready Observability Data Modeling Method at Scale

Changhua Pei; Cheng Zhang; Dan Pei; Fang Situ; Gaogang Xie; Hang Cui; Jingjing Li; Qi Zhou; Xiaohui Nie; Xidao Wen

arxiv: 2606.04799 · v1 · pith:PIB5PHO6new · submitted 2026-06-03 · 💻 cs.SE

UModel: An Agent-Ready Observability Data Modeling Method at Scale

Changhua Pei , Zheyuan Li , Zexin Wang , Hang Cui , Xiaohui Nie , Qi Zhou , Fang Situ , Cheng Zhang

show 5 more authors

Xin Zhang Xidao Wen Gaogang Xie Jingjing Li Dan Pei

This is my paper

Pith reviewed 2026-06-28 05:22 UTC · model grok-4.3

classification 💻 cs.SE

keywords observability modelingroot cause analysisontological frameworksemantic graphsLLM agentsAIOpsdata silossystem topology

0 comments

The pith

UModel turns fragmented observability data into objects linked by semantic graphs so agents can trace root causes across system topologies.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that existing data-centric observability frameworks leave LLM agents unable to form the relationships needed for automated root cause analysis because of silos and missing semantics. UModel addresses this by building a virtual layer that treats telemetry, entities, and knowledge as standardized objects connected through semantic graphs, plus a pipeline query interface that lets agents explore topologies autonomously. The authors report an 8 percent gain in localization precision after re-modeling one public challenge dataset and note that the same approach has run at production scale for over a year. A sympathetic reader would care because better data organization could directly raise the success rate of agent-driven diagnosis without changing the agents themselves.

Core claim

UModel constructs a virtual ontological layer where heterogeneous telemetry, entities, and expert knowledge are standardized as objects and interconnected via semantic graphs; a companion pipeline-based query interface called U-SPL then lets agents autonomously explore topologies and correlate multimodal data, producing an 8 percent increase in root cause localization precision on the AIOps 2025 Challenge dataset.

What carries the argument

The object-centric ontological layer that converts telemetry and entities into standardized objects joined by semantic graphs, together with the U-SPL query interface that supports autonomous topology exploration.

If this is right

Agents gain the ability to traverse system topologies and correlate data without custom integration code for each data source.
Downstream RCA accuracy rises when the same agents operate on data organized under the unified object model.
The modeling supports production workloads at millions of operations per second with sub-second query responses.
Heterogeneous observability sources become interchangeable once expressed as objects in the semantic graph.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same object-graph approach could be applied to other agent tasks that require correlating logs, metrics, and traces across distributed systems.
Standardized semantic graphs might reduce the engineering cost of onboarding new observability tools or migrating between vendors.
If the modeling proves portable, it could serve as a shared substrate for multiple independent RCA agents rather than each maintaining its own data mappings.

Load-bearing premise

The measured 8 percent precision gain arises from the shift to object-centric modeling and semantic graphs rather than from differences in agents, experimental setup, or other unstated factors.

What would settle it

Re-executing the root cause localization task on the identical AIOps 2025 Challenge dataset using the original non-UModel data organization while holding the agents and all other experimental conditions fixed, then observing whether the precision remains unchanged.

Figures

Figures reproduced from arXiv: 2606.04799 by Changhua Pei, Cheng Zhang, Dan Pei, Fang Situ, Gaogang Xie, Hang Cui, Jingjing Li, Qi Zhou, Xiaohui Nie, Xidao Wen, Xin Zhang, Zexin Wang, Zheyuan Li.

**Figure 1.** Figure 1: Overview of UModel. Service call DB Service Logs Service Metrics Service Traces Call Metrics DB Logs DB Traces DB Metrics EntitySet EntitySetLink TelemetryData DataLink Storage StorageLink Explorer ExplorerLink Service Explorer DB Explorer [PITH_FULL_IMAGE:figures/full_fig_p004_1.png] view at source ↗

**Figure 2.** Figure 2: An example of UModel. • Semantic & Knowledge Attachment: Through a standardized schema design, we attach high-level semantics (natural language descriptions), expert knowledge (diagnosis rules), and tools (remediation scripts) directly to these entities. This transforms the system from a “database of numbers” into an environment that the agent can reason about logically. 2) Pillar 2: Unified Interface an… view at source ↗

**Figure 3.** Figure 3: Comparison of query complexity between U-SPL and traditional methods. The top right depicts a concise U-SPL statement for retrieving golden [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗

**Figure 4.** Figure 4: Drill-down in metricset explorer. expose U-SPL directly to the RCA agent and let the LLM generate queries. However, this often leads to semantically incorrect queries, such as applying a service_id filter to a node-level metric that only accepts node_id. Moreover, raw query results may contain thousands of time-series points, which exceed the LLM context budget and degrade diagnosis reliability. To address… view at source ↗

**Figure 5.** Figure 5: Agent call chains of UModel and traditional data model. [PITH_FULL_IMAGE:figures/full_fig_p009_5.png] view at source ↗

**Figure 6.** Figure 6: Agent call chain of UModel. When performing RCA using traditional data models, a variety of anomaly detection tools—such as trace anomaly detection, log anomaly detection, and time-series anomaly detection—are typically invoked. These tools collectively generate a large volume of error signals across multiple services. However, the resulting information is often noisy, unstructured, and lacks prioritizati… view at source ↗

**Figure 7.** Figure 7: Viusalized model construction. APPENDIX A. Visualized Model Building Constructing a data model from large volumes of raw multimodal data is labor-intensive. Therefore, ease of construction is a prerequisite for the rapid adoption of an observable data model. Through continuous iteration, UModel has streamlined the construction process into a small number of intuitive steps: 1) Identification of entities a… view at source ↗

read the original abstract

When networked system failures occur, automatically performing Root Cause Analysis (RCA) using observability data is critical for ensuring networked system reliability. Recently, LLM-based agents have shown promise for automating this diagnosis process through advanced reasoning and autonomous exploration. However, existing observability frameworks remain archaic, characterized by fragmented data silos, incompatible schemas, and insufficient semantic metadata, preventing agents from establishing the complex relationships required for effective RCA. To address these challenges, we present UModel, a unified ontological framework that shifts observability from data-centric to object-centric modeling. UModel constructs a virtual ontological layer where heterogeneous telemetry, entities, and expert knowledge are standardized as objects and interconnected via semantic graphs. In addition, we introduce U-SPL, a pipeline-based query interface that enables agents to autonomously explore system topologies and correlate multimodal data. By re-modeling the "AIOps 2025 Challenge" dataset using UModel, the precision of root cause localization improved by 8%, demonstrating that enhanced data organization can significantly increase the accuracy of downstream tasks. UModel provides a scalable modeling framework that, in its deployment at Alibaba Cloud for more than one year, has served tens of thousands of users, sustained millions of operations per second, and delivered sub-second query latency.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

UModel reframes observability data as semantic objects for agents and reports an 8% RCA lift plus Alibaba deployment, but the gain lacks controls to tie it to the modeling change.

read the letter

The main things here are a shift to object-centric modeling with semantic graphs to make telemetry usable by LLM agents for root cause analysis, plus a new pipeline query interface called U-SPL. They also claim an 8% precision improvement on the AIOps 2025 Challenge dataset after re-modeling and note a year-long deployment at Alibaba handling high load with low latency.

What the paper does is lay out a concrete way to standardize heterogeneous data and expert knowledge into interconnected objects, which directly targets the fragmentation that blocks agent reasoning. The deployment claim shows the framework runs at production scale, which gives it some grounding beyond the lab.

The soft spot is the 8% result. The description supplies no baseline method, no confirmation that the agent and prompts were held fixed, no raw scores, and no ablation on how the object model changed the input topology or features. Without those, the delta cannot be confidently linked to UModel rather than other factors in the evaluation. The deployment metrics do not close that gap.

This is for researchers working on LLM agents for automated diagnosis in networked systems or AIOps pipelines. Someone looking for practical data organization ideas would get value from the ontological layer and query design. The work shows clear engagement with the problem and has enough of a deployed system to deserve a serious referee, mainly to examine the experimental details around the reported gain.

Referee Report

2 major / 0 minor

Summary. The paper introduces UModel, a unified object-centric ontological framework for modeling heterogeneous observability data (telemetry, entities, expert knowledge) as interconnected semantic graphs, along with the U-SPL pipeline-based query interface. The central empirical claim is that re-modeling the AIOps 2025 Challenge dataset with UModel yields an 8% improvement in root cause localization precision for LLM-based RCA agents; the work also reports production deployment at Alibaba Cloud serving tens of thousands of users at millions of operations per second with sub-second latency.

Significance. If the 8% precision gain can be isolated to the object-centric modeling and semantic graphs with fixed agent behavior, the approach could meaningfully improve the effectiveness of LLM agents on RCA tasks by addressing data fragmentation and lack of semantics in existing observability systems. The reported production metrics at Alibaba provide evidence of scalability, which would strengthen the practical contribution if the experimental attribution holds.

major comments (2)

[Abstract] Abstract: the claim that re-modeling the AIOps 2025 Challenge dataset with UModel improved root cause localization precision by 8% supplies no baseline RCA method, no statement that the LLM agent/prompts were held fixed, no raw before/after scores, no statistical test, and no description of how the re-modeling altered input features or topology. This prevents attribution of the delta to UModel's ontological modeling versus confounding changes in data ingestion or evaluation protocol.
[Evaluation] The experimental evaluation (referenced in the abstract) provides no ablation studies, controls, or exclusion criteria that would isolate the contribution of the semantic graphs and object-centric modeling from other factors such as query interface changes or dataset preprocessing.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback highlighting the need for greater clarity on experimental attribution and controls. We address each major comment below and will revise the manuscript to strengthen the presentation of results.

read point-by-point responses

Referee: [Abstract] Abstract: the claim that re-modeling the AIOps 2025 Challenge dataset with UModel improved root cause localization precision by 8% supplies no baseline RCA method, no statement that the LLM agent/prompts were held fixed, no raw before/after scores, no statistical test, and no description of how the re-modeling altered input features or topology. This prevents attribution of the delta to UModel's ontological modeling versus confounding changes in data ingestion or evaluation protocol.

Authors: We agree the abstract is insufficiently detailed for proper attribution. In the revision we will expand it to name the baseline RCA method, state explicitly that the LLM agent and prompts remained fixed, report the raw before/after precision scores, include the statistical test performed, and describe the precise changes introduced by the object-centric re-modeling (addition of semantic graphs and standardized entity objects) while confirming that data ingestion and evaluation protocols were unchanged. revision: yes
Referee: [Evaluation] The experimental evaluation (referenced in the abstract) provides no ablation studies, controls, or exclusion criteria that would isolate the contribution of the semantic graphs and object-centric modeling from other factors such as query interface changes or dataset preprocessing.

Authors: We acknowledge that the current evaluation section does not contain the requested ablations or controls. In the revised manuscript we will add ablation experiments that isolate the effect of the semantic graphs and object-centric modeling while holding the U-SPL query interface and preprocessing steps fixed, include explicit controls for dataset changes, and state the exclusion criteria applied. These additions will allow readers to attribute the observed 8% gain more directly to the ontological modeling. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical claim on external dataset with no fitted parameters or self-referential derivations.

full rationale

The paper's central claim is an 8% precision improvement on the external AIOps 2025 Challenge dataset after applying UModel. No equations, parameter fits, or predictions are described that reduce to inputs by construction. No self-citations are invoked as load-bearing for uniqueness theorems or ansatzes. The modeling method and query interface are presented as independent contributions, with the result framed as an external evaluation rather than a derived quantity equivalent to its inputs. This is the common case of a self-contained empirical report.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 2 invented entities

Abstract provides no explicit free parameters or external axioms; the central modeling premise is treated as a domain assumption.

axioms (1)

domain assumption Heterogeneous telemetry, entities, and expert knowledge can be standardized as objects interconnected via semantic graphs to support more effective RCA by agents.
This premise is invoked to justify the shift to object-centric modeling.

invented entities (2)

UModel no independent evidence
purpose: Unified ontological framework converting observability data to semantic object graphs
New modeling construct introduced by the paper.
U-SPL no independent evidence
purpose: Pipeline-based query interface enabling autonomous agent exploration of system topologies
New interface introduced by the paper.

pith-pipeline@v0.9.1-grok · 5791 in / 1203 out tokens · 34449 ms · 2026-06-28T05:22:22.137866+00:00 · methodology

discussion (0)

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

A Multi-Dataset Benchmark for Evaluating LLM Agents in Microservice Failure Diagnosis
cs.SE 2026-06 unverdicted novelty 7.0

Introduces AIOps2025 and RCA100 datasets for evaluating LLM agents on microservice failure diagnosis via localization, identification, and reasoning-grounded-in-evidence dimensions.

Reference graph

Works this paper leans on

74 extracted references · 9 canonical work pages · cited by 1 Pith paper · 3 internal anchors

[1]

Replit: Online ide & code editor for every language,

I. Replit, “Replit: Online ide & code editor for every language,” https: //replit.com/, Replit, Inc., 2024

2024
[2]

Google antigravity: Next-generation agentic development platform,

G. LLC, “Google antigravity: Next-generation agentic development platform,” https://antigravity.google/, Google LLC, 2025

2025
[3]

Gartner market guide for aiops: Es- sential reading for itops and sre,

IBM, “Gartner market guide for aiops: Es- sential reading for itops and sre,” Feb
[4]

Available: https://www.ibm.com/think/insights/ gartner-market-guide-for-aiops-essential-reading-for-itops-and-sre

[Online]. Available: https://www.ibm.com/think/insights/ gartner-market-guide-for-aiops-essential-reading-for-itops-and-sre
[5]

Aiops maturity model: From automation to autonomous it operations,

Gartner, “Aiops maturity model: From automation to autonomous it operations,” 2025. [Online]. Available: https://www.gartner.com/en/ information-technology/glossary/aiops-artificial-intelligence-operations

2025
[6]

A joint study of the challenges, opportunities, and roadmap of mlops and aiops: A systematic survey,

J. Diaz-De-Arcaya, A. I. Torre-Bastida, G. Z ´arate, R. Mi ˜n´on, and A. Almeida, “A joint study of the challenges, opportunities, and roadmap of mlops and aiops: A systematic survey,”ACM Computing Surveys, vol. 56, no. 4, pp. 1–30, 2023

2023
[7]

A survey of aiops for failure management in the era of large language models,

L. Zhang, T. Jia, M. Jia, Y . Wu, A. Liu, Y . Yang, Z. Wu, X. Hu, P. S. Yu, and Y . Li, “A survey of aiops for failure management in the era of large language models,”arXiv preprint arXiv:2406.11213, 2024

work page arXiv 2024
[8]

Unsupervised anomaly detection via variational auto- encoder for seasonal kpis in web applications,

H. Xu, W. Chen, N. Zhao, Z. Li, J. Bu, Z. Li, Y . Liu, Y . Zhao, D. Pei, Y . Fenget al., “Unsupervised anomaly detection via variational auto- encoder for seasonal kpis in web applications,” inProceedings of the 2018 world wide web conference, 2018, pp. 187–196

2018
[9]

Drain: An online log parsing approach with fixed depth tree,

P. He, J. Zhu, S. He, J. Li, and M. R. Lyu, “Drain: An online log parsing approach with fixed depth tree,” inProceedings of ICWS, 2017

2017
[10]

From point-wise to group-wise: A fast and accurate microservice trace anomaly detection approach,

Z. Xie, C. Pei, W. Li, H. Jiang, L. Su, J. Li, G. Xie, and D. Pei, “From point-wise to group-wise: A fast and accurate microservice trace anomaly detection approach,” inProceedings of the 31st ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering, 2023, pp. 1739–1749

2023
[11]

Kan-ad: Time series anomaly detection with kolmogorov-arnold networks,

Q. Zhou, C. Pei, F. Sun, H. Jing, Z. Gao, H. Zhang, G. Xie, D. Pei, and J. Li, “Kan-ad: Time series anomaly detection with kolmogorov-arnold networks,” inProceedings of the 42nd International Conference on Machine Learning (ICML), 2025. [Online]. Available: https://openreview.net/forum?id=LWQ4zu9SdQ

2025
[12]

Actionable and interpretable fault localization for recurring failures in online service systems,

Z. Li, N. Zhao, M. Li, X. Lu, L. Wang, D. Chang, X. Nie, L. Cao, W. Zhang, K. Suiet al., “Actionable and interpretable fault localization for recurring failures in online service systems,” inProceedings of the 30th ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering, 2022, pp. 996– 1008

2022
[13]

Anomaly detection and failure root cause analysis in (micro) service-based cloud applications: A survey,

J. Soldani and A. Brogi, “Anomaly detection and failure root cause analysis in (micro) service-based cloud applications: A survey,”ACM Computing Surveys (CSUR), vol. 55, no. 3, pp. 1–39, 2022

2022
[14]

Failure diagnosis in microservice systems: A comprehensive survey and analysis,

S. Zhang, S. Xia, W. Fan, B. Shi, X. Xiong, Z. Zhong, M. Ma, Y . Sun, and D. Pei, “Failure diagnosis in microservice systems: A comprehensive survey and analysis,”ACM Transactions on Software Engineering and Methodology, vol. 35, no. 1, pp. 1–55, 2025

2025
[15]

A survey on failure analysis and fault injection in ai systems,

G. Yu, G. Tan, H. Huang, Z. Zhang, P. Chen, R. Natella, Z. Zheng, and M. R. Lyu, “A survey on failure analysis and fault injection in ai systems,”ACM Transactions on Software Engineering and Methodology, vol. 35, no. 1, pp. 1–42, 2026

2026
[16]

Nezha: Interpretable fine-grained root causes analysis for microservices on multi-modal observability data,

G. Yu, P. Chen, Y . Li, H. Chen, X. Li, and Z. Zheng, “Nezha: Interpretable fine-grained root causes analysis for microservices on multi-modal observability data,” inProceedings of the 31st ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering, 2023, pp. 553–565

2023
[17]

Rcagent: Cloud root cause analysis by autonomous agents with tool-augmented large language models,

Z. Wang, Z. Liu, Y . Zhang, A. Zhong, J. Wang, F. Yin, L. Fan, L. Wu, and Q. Wen, “Rcagent: Cloud root cause analysis by autonomous agents with tool-augmented large language models,” inProceedings of the 33rd ACM International Conference on Information and Knowledge Management, 2024, pp. 4966–4974

2024
[18]

Flow-of-action: Sop enhanced llm-based multi- agent system for root cause analysis,

C. Pei, Z. Wang, F. Liu, Z. Li, Y . Liu, X. He, R. Kang, T. Zhang, J. Chen, J. Liet al., “Flow-of-action: Sop enhanced llm-based multi- agent system for root cause analysis,” inCompanion Proceedings of the ACM on Web Conference 2025, 2025, pp. 422–431

2025
[19]

Prometheus: Monitoring system and time series database,

Prometheus Team, “Prometheus: Monitoring system and time series database,” https://prometheus.io/, 2012, accessed: 2025-08-21

2012
[20]

Elasticsearch: Open Source Distributed RESTful Search and Analytics Engine,

Elastic NV, “Elasticsearch: Open Source Distributed RESTful Search and Analytics Engine,” https://www.elastic.co/elasticsearch, 2010, ac- cessed: 2025-08-21

2010
[21]

Aiopsarena: Scenario-oriented evaluation and leaderboard for aiops algorithms in microservices,

Y . Sun, J. Wang, Z. Li, X. Nie, M. Ma, S. Zhang, Y . Ji, L. Zhang, W. Long, H. Chenet al., “Aiopsarena: Scenario-oriented evaluation and leaderboard for aiops algorithms in microservices,” in2025 IEEE Inter- national Conference on Software Analysis, Evolution and Reengineering (SANER). IEEE, 2025, pp. 809–813

2025
[22]

The unix™ programming envi- ronment,

B. W. Kernighan and J. R. Mashey, “The unix™ programming envi- ronment,”Software: Practice and Experience, vol. 9, no. 1, pp. 1–15, 1979

1979
[23]

Promassistant: Leveraging large language models for text-to-promql,

C. Zhang, B. Zhang, D. Yang, X. Peng, M. Chen, S. Xie, G. Chen, W. Bi, and W. Li, “Promassistant: Leveraging large language models for text-to-promql,”arXiv preprint arXiv:2503.03114, 2025

work page arXiv 2025
[24]

Cypher: An evolving query language for property graphs,

N. Francis, A. Green, P. Guagliardo, L. Libkin, T. Lindaaker, V . Marsault, S. Plantikow, M. Rydberg, P. Selmer, and A. Taylor, “Cypher: An evolving query language for property graphs,” inProceed- ings of the 2018 international conference on management of data, 2018, pp. 1433–1445

2018
[25]

Model Context Protocol (MCP): Landscape, Security Threats, and Future Research Directions

X. Hou, Y . Zhao, S. Wang, and H. Wang, “Model context protocol (mcp): Landscape, security threats, and future research directions,”arXiv preprint arXiv:2503.23278, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[26]

ReAct: Synergizing Reasoning and Acting in Language Models

S. Yao, J. Zhao, D. Yu, N. Du, I. Shafran, K. Narasimhanet al., “React: Synergizing reasoning and acting in language models,” arXiv preprint, 2023. [Online]. Available: https://arxiv.org/abs/2210.03629

work page internal anchor Pith review Pith/arXiv arXiv 2023
[27]

OpenClaw: Personal AI Assistant,

“OpenClaw: Personal AI Assistant,” https://github.com/openclaw/ openclaw
[28]

Chateval: Towards better llm-based evaluators through multi- agent debate,

C.-M. Chan, W. Chen, Y . Su, J. Yu, W. Xue, S. Zhang, J. Fu, and Z. Liu, “Chateval: Towards better llm-based evaluators through multi- agent debate,” inInternational conference on learning representations, vol. 2024, 2024, pp. 9079–9093

2024
[29]

Re- flexion: Language agents with verbal reinforcement learning,

N. Shinn, F. Cassano, A. Gopinath, K. Narasimhan, and S. Yao, “Re- flexion: Language agents with verbal reinforcement learning,”Advances in neural information processing systems, vol. 36, pp. 8634–8652, 2023

2023
[30]

Plan-and-solve prompting: Improving zero-shot chain-of-thought reasoning by large language models,

L. Wang, W. Xu, Y . Lan, Z. Hu, Y . Lan, R. K.-W. Lee, and E.-P. Lim, “Plan-and-solve prompting: Improving zero-shot chain-of-thought reasoning by large language models,” inProceedings of the 61st annual meeting of the association for computational linguistics (volume 1: long papers), 2023, pp. 2609–2634

2023
[31]

Rd-probe: Scalable monitoring with sufficient coverage in complex datacenter networks,

R. Ding, X. Liu, S. Yang, Q. Huang, B. Xie, R. Sun, Z. Zhang, and B. Cui, “Rd-probe: Scalable monitoring with sufficient coverage in complex datacenter networks,” inProceedings of the ACM SIGCOMM 2024 Conference, 2024, pp. 258–273

2024
[32]

µmon: Empowering microsecond-level network monitoring with wavelets,

H. Zheng, C. Huang, X. Han, J. Zheng, X. Wang, C. Tian, W. Dou, and G. Chen, “µmon: Empowering microsecond-level network monitoring with wavelets,” inProceedings of the ACM SIGCOMM 2024 Conference, 2024, pp. 274–290

2024
[33]

Eagle: Toward scalable and near-optimal network- wide sketch deployment in network measurement,

X. Chen, Q. Xiao, H. Liu, Q. Huang, D. Zhang, X. Liu, L. Hu, H. Zhou, C. Wu, and K. Ren, “Eagle: Toward scalable and near-optimal network- wide sketch deployment in network measurement,” inProceedings of the ACM SIGCOMM 2024 Conference, 2024, pp. 291–310

2024
[34]

Ipd: Detecting traffic ingress points at isps,

S. Mehner, H. Reelfs, I. Poese, and O. Hohlfeld, “Ipd: Detecting traffic ingress points at isps,” inProceedings of the ACM SIGCOMM 2024 Conference, 2024, pp. 778–793

2024
[35]

Achieving high-speed and robust encrypted traffic anomaly detection with programmable switches,

H. Zhang, G. Liu, X. Shi, Y . Li, D. He, J. Wang, Z. Wang, Y . Zhu, K. Ruan, W. Caoet al., “Achieving high-speed and robust encrypted traffic anomaly detection with programmable switches,” inProceedings of the ACM SIGCOMM 2025 Conference, 2025, pp. 1254–1256

2025
[36]

The syslog protocol,

R. Gerhards, “The syslog protocol,” RFC 5424, 2009. [Online]. Available: https://www.rfc-editor.org/rfc/rfc5424

2009
[37]

The bsd packet filter: A new architecture for user-level packet capture,

S. McCanne and V . Jacobson, “The bsd packet filter: A new architecture for user-level packet capture,” inProceedings of the USENIX Winter Conference, 1993. [Online]. Available: https://www.tcpdump.org/papers/ bpf-usenix93.pdf

1993
[38]

The ebpf runtime in the linux kernel,

T. Alabiet al., “The ebpf runtime in the linux kernel,” arXiv preprint,
[39]

Available: https://arxiv.org/abs/2410.00026

[Online]. Available: https://arxiv.org/abs/2410.00026

work page arXiv
[40]

Cisco systems netflow services export version 9,

B. Claise, “Cisco systems netflow services export version 9,” RFC 3954, 2004. [Online]. Available: https://www.rfc-editor.org/rfc/rfc3954

2004
[41]

Inmon corporation’s sflow: A method for monitoring traffic in switched and routed networks,

P. Phaal, S. Panchen, and N. McKee, “Inmon corporation’s sflow: A method for monitoring traffic in switched and routed networks,” RFC 3176, 2001. [Online]. Available: https://www.rfc-editor.org/rfc/rfc3176

2001
[42]

Hawkeye: Diagnosing rdma network performance anomalies with pfc provenance,

S. Wang, M. Zhang, X. Li, Q. Peng, H. Yu, Z. Wang, M. Xu, X. Hu, J. Yang, and X. Shi, “Hawkeye: Diagnosing rdma network performance anomalies with pfc provenance,” inProceedings of the ACM SIGCOMM 2025 Conference, 2025, pp. 481–495

2025
[43]

P4: Programming protocol-independent packet processors,

P. Bosshart, G. Gibb, H. Kim, G. Varghese, N. McKeown, M. Izzard, F. Mujica, and M. Horowitz, “P4: Programming protocol-independent packet processors,”ACM SIGCOMM Computer Communication Review, vol. 44, no. 3, pp. 87–95, 2014

2014
[44]

Robust anomaly detection for multivariate time series through stochastic recurrent neural network,

Y . Su, Y . Zhao, C. Niu, R. Liu, W. Sun, and D. Pei, “Robust anomaly detection for multivariate time series through stochastic recurrent neural network,” inProceedings of KDD, 2019

2019
[45]

Revisiting vae for unsupervised time series anomaly detection: A frequency perspective,

Z. Wang, C. Pei, M. Ma, X. Wang, Z. Li, D. Pei, S. Rajmohan, D. Zhang, Q. Lin, H. Zhanget al., “Revisiting vae for unsupervised time series anomaly detection: A frequency perspective,” inProceedings of the ACM web conference 2024, 2024, pp. 3096–3105

2024
[46]

Anomaly transformer: Time series anomaly detection with association discrepancy,

J. Xu, H. Wu, J. Wang, and M. Long, “Anomaly transformer: Time series anomaly detection with association discrepancy,” inInternational Conference on Learning Representations (ICLR), 2022. [Online]. Available: https://openreview.net/forum?id=LzQQ89U1qm

2022
[47]

Tshape: Rescuing machine learning models from complex shapelet anomalies,

H. Cui, J. Li, H. Si, Q. Zhou, C. Pei, G. Xie, and D. Pei, “Tshape: Rescuing machine learning models from complex shapelet anomalies,” in2025 IEEE 36th International Symposium on Software Reliability Engineering Workshops (ISSREW). IEEE, 2025, pp. 9–14

2025
[48]

Deeplog: Anomaly detection and diagnosis from system logs through deep learning,

M. Du, F. Li, G. Zheng, and V . Srikumar, “Deeplog: Anomaly detection and diagnosis from system logs through deep learning,” inProceedings of CCS, 2017

2017
[49]

Logbert: Log anomaly detection via bert,

H. Guo, S. Yuan, J. Wuet al., “Logbert: Log anomaly detection via bert,” inProceedings of IJCNN, 2022

2022
[50]

Microrca: Root cause localization of performance issues in microservices,

L. Wu, J. Tordsson, E. Elmroth, and O. Kao, “Microrca: Root cause localization of performance issues in microservices,” inIEEE/IFIP Network Operations and Management Symposium (NOMS), 2020. [Online]. Available: https://github.com/elastisys/MicroRCA

2020
[51]

Global, passive detection of connection tampering,

R. Sundara Raman, L.-H. Merino, K. Bock, M. Fayed, D. Levin, N. Sullivan, and L. Valenta, “Global, passive detection of connection tampering,” inProceedings of the ACM SIGCOMM 2023 Conference, 2023, pp. 622–636

2023
[52]

Localizing failure root causes in a microservice through causality inference,

Y . Meng, S. Zhang, Y . Sun, R. Zhang, Z. Hu, Y . Zhang, C. Jia, Z. Wang, and D. Pei, “Localizing failure root causes in a microservice through causality inference,” inProceedings of IEEE/ACM IWQoS,
[53]

Available: https://nkcs.iops.ai/wp-content/uploads/2020/ 07/paper-IWQOS2020-MicroCause.pdf

[Online]. Available: https://nkcs.iops.ai/wp-content/uploads/2020/ 07/paper-IWQOS2020-MicroCause.pdf

2020
[54]

Towards llm-based failure localization in production-scale networks,

C. Wang, X. Zhang, R. Lu, X. Lin, X. Zeng, X. Zhang, Z. An, G. Wu, J. Gao, C. Tianet al., “Towards llm-based failure localization in production-scale networks,” inProceedings of the ACM SIGCOMM 2025 Conference, 2025, pp. 496–511

2025
[55]

Microhecl: High-efficient root cause localization in large-scale microservice systems,

D. Liu, C. He, X. Peng, F. Lin, C. Zhang, S. Gong, Z. Li, J. Ou, and Z. Wu, “Microhecl: High-efficient root cause localization in large-scale microservice systems,” inICSE-SEIP, 2021. [Online]. Available: https://arxiv.org/abs/2103.01782

work page arXiv 2021
[56]

Skeletonhunter: Diagnosing and localizing network failures in containerized large model training,

W. Liu, K. Qian, Z. Li, T. Xu, Y . Liu, W. Wang, Y . Zhang, J. Li, S. Zhu, X. Liet al., “Skeletonhunter: Diagnosing and localizing network failures in containerized large model training,” inProceedings of the ACM SIGCOMM 2025 Conference, 2025, pp. 527–540

2025
[57]

Skynet: Analyzing alert flooding from severe network failures in large cloud infrastructures,

B. Yang, H. Hu, Y . Li, Y . Li, X. Tang, B. Tian, G. Wu, J. Xu, X. Zhang, F. Chenet al., “Skynet: Analyzing alert flooding from severe network failures in large cloud infrastructures,” inProceedings of the ACM SIGCOMM 2025 Conference, 2025, pp. 512–526

2025
[58]

Robust failure diagnosis of microservice system through multimodal data,

S. Zhang, P. Jin, Z. Lin, Y . Sun, B. Zhang, S. Xia, Z. Li, Z. Zhong, M. Ma, W. Jin, D. Zhang, Z. Zhu, and D. Pei, “Robust failure diagnosis of microservice system through multimodal data,” arXiv preprint, 2023. [Online]. Available: https://arxiv.org/abs/2302.10512

work page arXiv 2023
[59]

Anomaly detection from system tracing data using multimodal deep learning,

S. Nedelkoski, J. Cardoso, and O. Kao, “Anomaly detection from system tracing data using multimodal deep learning,” inIEEE International Conference on Cloud Computing (CLOUD), 2019

2019
[60]

Dest: An unsupervised decoupled spatio-temporal framework for microservice incident management,

X. Nie, H. Cui, C. Pei, H. Si, K. Xiang, J. Li, Y . Li, G. Xie, and D. Pei, “Dest: An unsupervised decoupled spatio-temporal framework for microservice incident management,” in2025 IEEE 36th International Symposium on Software Reliability Engineering (ISSRE). IEEE, 2025, pp. 335–346

2025
[61]

Toolformer: Language models can teach themselves to use tools,

T. Schick, J. Dwivedi-Yu, R. Dess `ı, R. Raileanuet al., “Toolformer: Language models can teach themselves to use tools,” arXiv preprint,
[62]

Toolformer: Language Models Can Teach Themselves to Use Tools

[Online]. Available: https://arxiv.org/abs/2302.04761

work page internal anchor Pith review Pith/arXiv arXiv
[63]

A survey of aiops in the era of large language models,

L. Zhang, T. Jia, M. Jia, Y . Wu, A. Liu, Y . Yang, Z. Wu, X. Hu, P. S. Yu, and Y . Li, “A survey of aiops in the era of large language models,” arXiv preprint, 2025. [Online]. Available: https://arxiv.org/abs/2507.12472 Database Kubernetes Node Service Service Traces Kubernetes Pod Service Logs Node Logs Pod Logs Service Metrics Node Metrics Pod Metrics ...

work page arXiv 2025
[64]

Query{pod2} {metric}data
[65]

Query{pod3} {metric}data
[66]

entity set(with(domain=’aiops’, name=’aiops.pod’, ids=[id1, id2,...]))|entity−call get metric(’aiops’,’aiops

Aggregate and return. .entity set(with(domain=’aiops’, name=’aiops.pod’, ids=[id1, id2,...]))|entity−call get metric(’aiops’,’aiops. metric.pod’,’{metric}’,’ range’,’’,aggregate=true) Aggregates multi-pod metric val- ues in one query and returns unified results. Q2 (Knowledge) 1) Get ownership (node/service)
[67]

Collect depth from trace/logs
[68]

Merge into a unified dep graph/table
[69]

Expand to 4 hops (BFS / multi-join)
[70]

topo|graph−call cypher(‘ MATCH (s:‘aiops@aiops

Summarize and filter related entities. .topo|graph−call cypher(‘ MATCH (s:‘aiops@aiops. pod‘{ entity id :’id’}) −[e]−(d)−[f]−(g)−[h]−(j)−[k ]−(l)RETURN s, d, g, j, l‘) 1-query 4-hop subgraph extraction; No manual multi-join/BFS. Q3 (Data+Knowledge) 1) Resolve the Pod’s node
[71]

List all Pods on that node
[72]

Query{metric}for these Pods
[73]

pod{ entity id :’id’})−[ e]−(d)−[f]−(g) RETURN g

topo|graph−call cypher( MATCH (s:aiops@aiops. pod{ entity id :’id’})−[ e]−(d)−[f]−(g) RETURN g. entity id AS entity ids )}
[74]

aiops.service

entity set with(domain=’aiops ’, name=’aiops.node’, ids=[’ entity ids’])|entity−call get metric(’aiops’,’aiops. metric.node’,’{metric}’,’ range’,’’,aggregate=false) Topology query first finds peer Pods on the same node, then batch metric retrieval is executed in one flow without manual joins. must query each Pod’s metric data individually and perform aggr...

[1] [1]

Replit: Online ide & code editor for every language,

I. Replit, “Replit: Online ide & code editor for every language,” https: //replit.com/, Replit, Inc., 2024

2024

[2] [2]

Google antigravity: Next-generation agentic development platform,

G. LLC, “Google antigravity: Next-generation agentic development platform,” https://antigravity.google/, Google LLC, 2025

2025

[3] [3]

Gartner market guide for aiops: Es- sential reading for itops and sre,

IBM, “Gartner market guide for aiops: Es- sential reading for itops and sre,” Feb

[4] [4]

Available: https://www.ibm.com/think/insights/ gartner-market-guide-for-aiops-essential-reading-for-itops-and-sre

[Online]. Available: https://www.ibm.com/think/insights/ gartner-market-guide-for-aiops-essential-reading-for-itops-and-sre

[5] [5]

Aiops maturity model: From automation to autonomous it operations,

Gartner, “Aiops maturity model: From automation to autonomous it operations,” 2025. [Online]. Available: https://www.gartner.com/en/ information-technology/glossary/aiops-artificial-intelligence-operations

2025

[6] [6]

A joint study of the challenges, opportunities, and roadmap of mlops and aiops: A systematic survey,

J. Diaz-De-Arcaya, A. I. Torre-Bastida, G. Z ´arate, R. Mi ˜n´on, and A. Almeida, “A joint study of the challenges, opportunities, and roadmap of mlops and aiops: A systematic survey,”ACM Computing Surveys, vol. 56, no. 4, pp. 1–30, 2023

2023

[7] [7]

A survey of aiops for failure management in the era of large language models,

L. Zhang, T. Jia, M. Jia, Y . Wu, A. Liu, Y . Yang, Z. Wu, X. Hu, P. S. Yu, and Y . Li, “A survey of aiops for failure management in the era of large language models,”arXiv preprint arXiv:2406.11213, 2024

work page arXiv 2024

[8] [8]

Unsupervised anomaly detection via variational auto- encoder for seasonal kpis in web applications,

H. Xu, W. Chen, N. Zhao, Z. Li, J. Bu, Z. Li, Y . Liu, Y . Zhao, D. Pei, Y . Fenget al., “Unsupervised anomaly detection via variational auto- encoder for seasonal kpis in web applications,” inProceedings of the 2018 world wide web conference, 2018, pp. 187–196

2018

[9] [9]

Drain: An online log parsing approach with fixed depth tree,

P. He, J. Zhu, S. He, J. Li, and M. R. Lyu, “Drain: An online log parsing approach with fixed depth tree,” inProceedings of ICWS, 2017

2017

[10] [10]

From point-wise to group-wise: A fast and accurate microservice trace anomaly detection approach,

Z. Xie, C. Pei, W. Li, H. Jiang, L. Su, J. Li, G. Xie, and D. Pei, “From point-wise to group-wise: A fast and accurate microservice trace anomaly detection approach,” inProceedings of the 31st ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering, 2023, pp. 1739–1749

2023

[11] [11]

Kan-ad: Time series anomaly detection with kolmogorov-arnold networks,

Q. Zhou, C. Pei, F. Sun, H. Jing, Z. Gao, H. Zhang, G. Xie, D. Pei, and J. Li, “Kan-ad: Time series anomaly detection with kolmogorov-arnold networks,” inProceedings of the 42nd International Conference on Machine Learning (ICML), 2025. [Online]. Available: https://openreview.net/forum?id=LWQ4zu9SdQ

2025

[12] [12]

Actionable and interpretable fault localization for recurring failures in online service systems,

Z. Li, N. Zhao, M. Li, X. Lu, L. Wang, D. Chang, X. Nie, L. Cao, W. Zhang, K. Suiet al., “Actionable and interpretable fault localization for recurring failures in online service systems,” inProceedings of the 30th ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering, 2022, pp. 996– 1008

2022

[13] [13]

Anomaly detection and failure root cause analysis in (micro) service-based cloud applications: A survey,

J. Soldani and A. Brogi, “Anomaly detection and failure root cause analysis in (micro) service-based cloud applications: A survey,”ACM Computing Surveys (CSUR), vol. 55, no. 3, pp. 1–39, 2022

2022

[14] [14]

Failure diagnosis in microservice systems: A comprehensive survey and analysis,

S. Zhang, S. Xia, W. Fan, B. Shi, X. Xiong, Z. Zhong, M. Ma, Y . Sun, and D. Pei, “Failure diagnosis in microservice systems: A comprehensive survey and analysis,”ACM Transactions on Software Engineering and Methodology, vol. 35, no. 1, pp. 1–55, 2025

2025

[15] [15]

A survey on failure analysis and fault injection in ai systems,

G. Yu, G. Tan, H. Huang, Z. Zhang, P. Chen, R. Natella, Z. Zheng, and M. R. Lyu, “A survey on failure analysis and fault injection in ai systems,”ACM Transactions on Software Engineering and Methodology, vol. 35, no. 1, pp. 1–42, 2026

2026

[16] [16]

Nezha: Interpretable fine-grained root causes analysis for microservices on multi-modal observability data,

G. Yu, P. Chen, Y . Li, H. Chen, X. Li, and Z. Zheng, “Nezha: Interpretable fine-grained root causes analysis for microservices on multi-modal observability data,” inProceedings of the 31st ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering, 2023, pp. 553–565

2023

[17] [17]

Rcagent: Cloud root cause analysis by autonomous agents with tool-augmented large language models,

Z. Wang, Z. Liu, Y . Zhang, A. Zhong, J. Wang, F. Yin, L. Fan, L. Wu, and Q. Wen, “Rcagent: Cloud root cause analysis by autonomous agents with tool-augmented large language models,” inProceedings of the 33rd ACM International Conference on Information and Knowledge Management, 2024, pp. 4966–4974

2024

[18] [18]

Flow-of-action: Sop enhanced llm-based multi- agent system for root cause analysis,

C. Pei, Z. Wang, F. Liu, Z. Li, Y . Liu, X. He, R. Kang, T. Zhang, J. Chen, J. Liet al., “Flow-of-action: Sop enhanced llm-based multi- agent system for root cause analysis,” inCompanion Proceedings of the ACM on Web Conference 2025, 2025, pp. 422–431

2025

[19] [19]

Prometheus: Monitoring system and time series database,

Prometheus Team, “Prometheus: Monitoring system and time series database,” https://prometheus.io/, 2012, accessed: 2025-08-21

2012

[20] [20]

Elasticsearch: Open Source Distributed RESTful Search and Analytics Engine,

Elastic NV, “Elasticsearch: Open Source Distributed RESTful Search and Analytics Engine,” https://www.elastic.co/elasticsearch, 2010, ac- cessed: 2025-08-21

2010

[21] [21]

Aiopsarena: Scenario-oriented evaluation and leaderboard for aiops algorithms in microservices,

Y . Sun, J. Wang, Z. Li, X. Nie, M. Ma, S. Zhang, Y . Ji, L. Zhang, W. Long, H. Chenet al., “Aiopsarena: Scenario-oriented evaluation and leaderboard for aiops algorithms in microservices,” in2025 IEEE Inter- national Conference on Software Analysis, Evolution and Reengineering (SANER). IEEE, 2025, pp. 809–813

2025

[22] [22]

The unix™ programming envi- ronment,

B. W. Kernighan and J. R. Mashey, “The unix™ programming envi- ronment,”Software: Practice and Experience, vol. 9, no. 1, pp. 1–15, 1979

1979

[23] [23]

Promassistant: Leveraging large language models for text-to-promql,

C. Zhang, B. Zhang, D. Yang, X. Peng, M. Chen, S. Xie, G. Chen, W. Bi, and W. Li, “Promassistant: Leveraging large language models for text-to-promql,”arXiv preprint arXiv:2503.03114, 2025

work page arXiv 2025

[24] [24]

Cypher: An evolving query language for property graphs,

N. Francis, A. Green, P. Guagliardo, L. Libkin, T. Lindaaker, V . Marsault, S. Plantikow, M. Rydberg, P. Selmer, and A. Taylor, “Cypher: An evolving query language for property graphs,” inProceed- ings of the 2018 international conference on management of data, 2018, pp. 1433–1445

2018

[25] [25]

Model Context Protocol (MCP): Landscape, Security Threats, and Future Research Directions

X. Hou, Y . Zhao, S. Wang, and H. Wang, “Model context protocol (mcp): Landscape, security threats, and future research directions,”arXiv preprint arXiv:2503.23278, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[26] [26]

ReAct: Synergizing Reasoning and Acting in Language Models

S. Yao, J. Zhao, D. Yu, N. Du, I. Shafran, K. Narasimhanet al., “React: Synergizing reasoning and acting in language models,” arXiv preprint, 2023. [Online]. Available: https://arxiv.org/abs/2210.03629

work page internal anchor Pith review Pith/arXiv arXiv 2023

[27] [27]

OpenClaw: Personal AI Assistant,

“OpenClaw: Personal AI Assistant,” https://github.com/openclaw/ openclaw

[28] [28]

Chateval: Towards better llm-based evaluators through multi- agent debate,

C.-M. Chan, W. Chen, Y . Su, J. Yu, W. Xue, S. Zhang, J. Fu, and Z. Liu, “Chateval: Towards better llm-based evaluators through multi- agent debate,” inInternational conference on learning representations, vol. 2024, 2024, pp. 9079–9093

2024

[29] [29]

Re- flexion: Language agents with verbal reinforcement learning,

N. Shinn, F. Cassano, A. Gopinath, K. Narasimhan, and S. Yao, “Re- flexion: Language agents with verbal reinforcement learning,”Advances in neural information processing systems, vol. 36, pp. 8634–8652, 2023

2023

[30] [30]

Plan-and-solve prompting: Improving zero-shot chain-of-thought reasoning by large language models,

L. Wang, W. Xu, Y . Lan, Z. Hu, Y . Lan, R. K.-W. Lee, and E.-P. Lim, “Plan-and-solve prompting: Improving zero-shot chain-of-thought reasoning by large language models,” inProceedings of the 61st annual meeting of the association for computational linguistics (volume 1: long papers), 2023, pp. 2609–2634

2023

[31] [31]

Rd-probe: Scalable monitoring with sufficient coverage in complex datacenter networks,

R. Ding, X. Liu, S. Yang, Q. Huang, B. Xie, R. Sun, Z. Zhang, and B. Cui, “Rd-probe: Scalable monitoring with sufficient coverage in complex datacenter networks,” inProceedings of the ACM SIGCOMM 2024 Conference, 2024, pp. 258–273

2024

[32] [32]

µmon: Empowering microsecond-level network monitoring with wavelets,

H. Zheng, C. Huang, X. Han, J. Zheng, X. Wang, C. Tian, W. Dou, and G. Chen, “µmon: Empowering microsecond-level network monitoring with wavelets,” inProceedings of the ACM SIGCOMM 2024 Conference, 2024, pp. 274–290

2024

[33] [33]

Eagle: Toward scalable and near-optimal network- wide sketch deployment in network measurement,

X. Chen, Q. Xiao, H. Liu, Q. Huang, D. Zhang, X. Liu, L. Hu, H. Zhou, C. Wu, and K. Ren, “Eagle: Toward scalable and near-optimal network- wide sketch deployment in network measurement,” inProceedings of the ACM SIGCOMM 2024 Conference, 2024, pp. 291–310

2024

[34] [34]

Ipd: Detecting traffic ingress points at isps,

S. Mehner, H. Reelfs, I. Poese, and O. Hohlfeld, “Ipd: Detecting traffic ingress points at isps,” inProceedings of the ACM SIGCOMM 2024 Conference, 2024, pp. 778–793

2024

[35] [35]

Achieving high-speed and robust encrypted traffic anomaly detection with programmable switches,

H. Zhang, G. Liu, X. Shi, Y . Li, D. He, J. Wang, Z. Wang, Y . Zhu, K. Ruan, W. Caoet al., “Achieving high-speed and robust encrypted traffic anomaly detection with programmable switches,” inProceedings of the ACM SIGCOMM 2025 Conference, 2025, pp. 1254–1256

2025

[36] [36]

The syslog protocol,

R. Gerhards, “The syslog protocol,” RFC 5424, 2009. [Online]. Available: https://www.rfc-editor.org/rfc/rfc5424

2009

[37] [37]

The bsd packet filter: A new architecture for user-level packet capture,

S. McCanne and V . Jacobson, “The bsd packet filter: A new architecture for user-level packet capture,” inProceedings of the USENIX Winter Conference, 1993. [Online]. Available: https://www.tcpdump.org/papers/ bpf-usenix93.pdf

1993

[38] [38]

The ebpf runtime in the linux kernel,

T. Alabiet al., “The ebpf runtime in the linux kernel,” arXiv preprint,

[39] [39]

Available: https://arxiv.org/abs/2410.00026

[Online]. Available: https://arxiv.org/abs/2410.00026

work page arXiv

[40] [40]

Cisco systems netflow services export version 9,

B. Claise, “Cisco systems netflow services export version 9,” RFC 3954, 2004. [Online]. Available: https://www.rfc-editor.org/rfc/rfc3954

2004

[41] [41]

Inmon corporation’s sflow: A method for monitoring traffic in switched and routed networks,

P. Phaal, S. Panchen, and N. McKee, “Inmon corporation’s sflow: A method for monitoring traffic in switched and routed networks,” RFC 3176, 2001. [Online]. Available: https://www.rfc-editor.org/rfc/rfc3176

2001

[42] [42]

Hawkeye: Diagnosing rdma network performance anomalies with pfc provenance,

S. Wang, M. Zhang, X. Li, Q. Peng, H. Yu, Z. Wang, M. Xu, X. Hu, J. Yang, and X. Shi, “Hawkeye: Diagnosing rdma network performance anomalies with pfc provenance,” inProceedings of the ACM SIGCOMM 2025 Conference, 2025, pp. 481–495

2025

[43] [43]

P4: Programming protocol-independent packet processors,

P. Bosshart, G. Gibb, H. Kim, G. Varghese, N. McKeown, M. Izzard, F. Mujica, and M. Horowitz, “P4: Programming protocol-independent packet processors,”ACM SIGCOMM Computer Communication Review, vol. 44, no. 3, pp. 87–95, 2014

2014

[44] [44]

Robust anomaly detection for multivariate time series through stochastic recurrent neural network,

Y . Su, Y . Zhao, C. Niu, R. Liu, W. Sun, and D. Pei, “Robust anomaly detection for multivariate time series through stochastic recurrent neural network,” inProceedings of KDD, 2019

2019

[45] [45]

Revisiting vae for unsupervised time series anomaly detection: A frequency perspective,

Z. Wang, C. Pei, M. Ma, X. Wang, Z. Li, D. Pei, S. Rajmohan, D. Zhang, Q. Lin, H. Zhanget al., “Revisiting vae for unsupervised time series anomaly detection: A frequency perspective,” inProceedings of the ACM web conference 2024, 2024, pp. 3096–3105

2024

[46] [46]

Anomaly transformer: Time series anomaly detection with association discrepancy,

J. Xu, H. Wu, J. Wang, and M. Long, “Anomaly transformer: Time series anomaly detection with association discrepancy,” inInternational Conference on Learning Representations (ICLR), 2022. [Online]. Available: https://openreview.net/forum?id=LzQQ89U1qm

2022

[47] [47]

Tshape: Rescuing machine learning models from complex shapelet anomalies,

H. Cui, J. Li, H. Si, Q. Zhou, C. Pei, G. Xie, and D. Pei, “Tshape: Rescuing machine learning models from complex shapelet anomalies,” in2025 IEEE 36th International Symposium on Software Reliability Engineering Workshops (ISSREW). IEEE, 2025, pp. 9–14

2025

[48] [48]

Deeplog: Anomaly detection and diagnosis from system logs through deep learning,

M. Du, F. Li, G. Zheng, and V . Srikumar, “Deeplog: Anomaly detection and diagnosis from system logs through deep learning,” inProceedings of CCS, 2017

2017

[49] [49]

Logbert: Log anomaly detection via bert,

H. Guo, S. Yuan, J. Wuet al., “Logbert: Log anomaly detection via bert,” inProceedings of IJCNN, 2022

2022

[50] [50]

Microrca: Root cause localization of performance issues in microservices,

L. Wu, J. Tordsson, E. Elmroth, and O. Kao, “Microrca: Root cause localization of performance issues in microservices,” inIEEE/IFIP Network Operations and Management Symposium (NOMS), 2020. [Online]. Available: https://github.com/elastisys/MicroRCA

2020

[51] [51]

Global, passive detection of connection tampering,

R. Sundara Raman, L.-H. Merino, K. Bock, M. Fayed, D. Levin, N. Sullivan, and L. Valenta, “Global, passive detection of connection tampering,” inProceedings of the ACM SIGCOMM 2023 Conference, 2023, pp. 622–636

2023

[52] [52]

Localizing failure root causes in a microservice through causality inference,

Y . Meng, S. Zhang, Y . Sun, R. Zhang, Z. Hu, Y . Zhang, C. Jia, Z. Wang, and D. Pei, “Localizing failure root causes in a microservice through causality inference,” inProceedings of IEEE/ACM IWQoS,

[53] [53]

Available: https://nkcs.iops.ai/wp-content/uploads/2020/ 07/paper-IWQOS2020-MicroCause.pdf

[Online]. Available: https://nkcs.iops.ai/wp-content/uploads/2020/ 07/paper-IWQOS2020-MicroCause.pdf

2020

[54] [54]

Towards llm-based failure localization in production-scale networks,

C. Wang, X. Zhang, R. Lu, X. Lin, X. Zeng, X. Zhang, Z. An, G. Wu, J. Gao, C. Tianet al., “Towards llm-based failure localization in production-scale networks,” inProceedings of the ACM SIGCOMM 2025 Conference, 2025, pp. 496–511

2025

[55] [55]

Microhecl: High-efficient root cause localization in large-scale microservice systems,

D. Liu, C. He, X. Peng, F. Lin, C. Zhang, S. Gong, Z. Li, J. Ou, and Z. Wu, “Microhecl: High-efficient root cause localization in large-scale microservice systems,” inICSE-SEIP, 2021. [Online]. Available: https://arxiv.org/abs/2103.01782

work page arXiv 2021

[56] [56]

Skeletonhunter: Diagnosing and localizing network failures in containerized large model training,

W. Liu, K. Qian, Z. Li, T. Xu, Y . Liu, W. Wang, Y . Zhang, J. Li, S. Zhu, X. Liet al., “Skeletonhunter: Diagnosing and localizing network failures in containerized large model training,” inProceedings of the ACM SIGCOMM 2025 Conference, 2025, pp. 527–540

2025

[57] [57]

Skynet: Analyzing alert flooding from severe network failures in large cloud infrastructures,

B. Yang, H. Hu, Y . Li, Y . Li, X. Tang, B. Tian, G. Wu, J. Xu, X. Zhang, F. Chenet al., “Skynet: Analyzing alert flooding from severe network failures in large cloud infrastructures,” inProceedings of the ACM SIGCOMM 2025 Conference, 2025, pp. 512–526

2025

[58] [58]

Robust failure diagnosis of microservice system through multimodal data,

S. Zhang, P. Jin, Z. Lin, Y . Sun, B. Zhang, S. Xia, Z. Li, Z. Zhong, M. Ma, W. Jin, D. Zhang, Z. Zhu, and D. Pei, “Robust failure diagnosis of microservice system through multimodal data,” arXiv preprint, 2023. [Online]. Available: https://arxiv.org/abs/2302.10512

work page arXiv 2023

[59] [59]

Anomaly detection from system tracing data using multimodal deep learning,

S. Nedelkoski, J. Cardoso, and O. Kao, “Anomaly detection from system tracing data using multimodal deep learning,” inIEEE International Conference on Cloud Computing (CLOUD), 2019

2019

[60] [60]

Dest: An unsupervised decoupled spatio-temporal framework for microservice incident management,

X. Nie, H. Cui, C. Pei, H. Si, K. Xiang, J. Li, Y . Li, G. Xie, and D. Pei, “Dest: An unsupervised decoupled spatio-temporal framework for microservice incident management,” in2025 IEEE 36th International Symposium on Software Reliability Engineering (ISSRE). IEEE, 2025, pp. 335–346

2025

[61] [61]

Toolformer: Language models can teach themselves to use tools,

T. Schick, J. Dwivedi-Yu, R. Dess `ı, R. Raileanuet al., “Toolformer: Language models can teach themselves to use tools,” arXiv preprint,

[62] [62]

Toolformer: Language Models Can Teach Themselves to Use Tools

[Online]. Available: https://arxiv.org/abs/2302.04761

work page internal anchor Pith review Pith/arXiv arXiv

[63] [63]

A survey of aiops in the era of large language models,

L. Zhang, T. Jia, M. Jia, Y . Wu, A. Liu, Y . Yang, Z. Wu, X. Hu, P. S. Yu, and Y . Li, “A survey of aiops in the era of large language models,” arXiv preprint, 2025. [Online]. Available: https://arxiv.org/abs/2507.12472 Database Kubernetes Node Service Service Traces Kubernetes Pod Service Logs Node Logs Pod Logs Service Metrics Node Metrics Pod Metrics ...

work page arXiv 2025

[64] [64]

Query{pod2} {metric}data

[65] [65]

Query{pod3} {metric}data

[66] [66]

entity set(with(domain=’aiops’, name=’aiops.pod’, ids=[id1, id2,...]))|entity−call get metric(’aiops’,’aiops

Aggregate and return. .entity set(with(domain=’aiops’, name=’aiops.pod’, ids=[id1, id2,...]))|entity−call get metric(’aiops’,’aiops. metric.pod’,’{metric}’,’ range’,’’,aggregate=true) Aggregates multi-pod metric val- ues in one query and returns unified results. Q2 (Knowledge) 1) Get ownership (node/service)

[67] [67]

Collect depth from trace/logs

[68] [68]

Merge into a unified dep graph/table

[69] [69]

Expand to 4 hops (BFS / multi-join)

[70] [70]

topo|graph−call cypher(‘ MATCH (s:‘aiops@aiops

Summarize and filter related entities. .topo|graph−call cypher(‘ MATCH (s:‘aiops@aiops. pod‘{ entity id :’id’}) −[e]−(d)−[f]−(g)−[h]−(j)−[k ]−(l)RETURN s, d, g, j, l‘) 1-query 4-hop subgraph extraction; No manual multi-join/BFS. Q3 (Data+Knowledge) 1) Resolve the Pod’s node

[71] [71]

List all Pods on that node

[72] [72]

Query{metric}for these Pods

[73] [73]

pod{ entity id :’id’})−[ e]−(d)−[f]−(g) RETURN g

topo|graph−call cypher( MATCH (s:aiops@aiops. pod{ entity id :’id’})−[ e]−(d)−[f]−(g) RETURN g. entity id AS entity ids )}

[74] [74]

aiops.service

entity set with(domain=’aiops ’, name=’aiops.node’, ids=[’ entity ids’])|entity−call get metric(’aiops’,’aiops. metric.node’,’{metric}’,’ range’,’’,aggregate=false) Topology query first finds peer Pods on the same node, then batch metric retrieval is executed in one flow without manual joins. must query each Pod’s metric data individually and perform aggr...