pith. sign in

arxiv: 2606.04799 · v1 · pith:PIB5PHO6new · submitted 2026-06-03 · 💻 cs.SE

UModel: An Agent-Ready Observability Data Modeling Method at Scale

Pith reviewed 2026-06-28 05:22 UTC · model grok-4.3

classification 💻 cs.SE
keywords observability modelingroot cause analysisontological frameworksemantic graphsLLM agentsAIOpsdata silossystem topology
0
0 comments X

The pith

UModel turns fragmented observability data into objects linked by semantic graphs so agents can trace root causes across system topologies.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that existing data-centric observability frameworks leave LLM agents unable to form the relationships needed for automated root cause analysis because of silos and missing semantics. UModel addresses this by building a virtual layer that treats telemetry, entities, and knowledge as standardized objects connected through semantic graphs, plus a pipeline query interface that lets agents explore topologies autonomously. The authors report an 8 percent gain in localization precision after re-modeling one public challenge dataset and note that the same approach has run at production scale for over a year. A sympathetic reader would care because better data organization could directly raise the success rate of agent-driven diagnosis without changing the agents themselves.

Core claim

UModel constructs a virtual ontological layer where heterogeneous telemetry, entities, and expert knowledge are standardized as objects and interconnected via semantic graphs; a companion pipeline-based query interface called U-SPL then lets agents autonomously explore topologies and correlate multimodal data, producing an 8 percent increase in root cause localization precision on the AIOps 2025 Challenge dataset.

What carries the argument

The object-centric ontological layer that converts telemetry and entities into standardized objects joined by semantic graphs, together with the U-SPL query interface that supports autonomous topology exploration.

If this is right

  • Agents gain the ability to traverse system topologies and correlate data without custom integration code for each data source.
  • Downstream RCA accuracy rises when the same agents operate on data organized under the unified object model.
  • The modeling supports production workloads at millions of operations per second with sub-second query responses.
  • Heterogeneous observability sources become interchangeable once expressed as objects in the semantic graph.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same object-graph approach could be applied to other agent tasks that require correlating logs, metrics, and traces across distributed systems.
  • Standardized semantic graphs might reduce the engineering cost of onboarding new observability tools or migrating between vendors.
  • If the modeling proves portable, it could serve as a shared substrate for multiple independent RCA agents rather than each maintaining its own data mappings.

Load-bearing premise

The measured 8 percent precision gain arises from the shift to object-centric modeling and semantic graphs rather than from differences in agents, experimental setup, or other unstated factors.

What would settle it

Re-executing the root cause localization task on the identical AIOps 2025 Challenge dataset using the original non-UModel data organization while holding the agents and all other experimental conditions fixed, then observing whether the precision remains unchanged.

Figures

Figures reproduced from arXiv: 2606.04799 by Changhua Pei, Cheng Zhang, Dan Pei, Fang Situ, Gaogang Xie, Hang Cui, Jingjing Li, Qi Zhou, Xiaohui Nie, Xidao Wen, Xin Zhang, Zexin Wang, Zheyuan Li.

Figure 1
Figure 1. Figure 1: Overview of UModel. Service call DB Service Logs Service Metrics Service Traces Call Metrics DB Logs DB Traces DB Metrics EntitySet EntitySetLink TelemetryData DataLink Storage StorageLink Explorer ExplorerLink Service Explorer DB Explorer [PITH_FULL_IMAGE:figures/full_fig_p004_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: An example of UModel. • Semantic & Knowledge Attachment: Through a stan￾dardized schema design, we attach high-level semantics (natural language descriptions), expert knowledge (diag￾nosis rules), and tools (remediation scripts) directly to these entities. This transforms the system from a “database of numbers” into an environment that the agent can reason about logically. 2) Pillar 2: Unified Interface an… view at source ↗
Figure 3
Figure 3. Figure 3: Comparison of query complexity between U-SPL and traditional methods. The top right depicts a concise U-SPL statement for retrieving golden [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Drill-down in metricset explorer. expose U-SPL directly to the RCA agent and let the LLM generate queries. However, this often leads to semantically incorrect queries, such as applying a service_id filter to a node-level metric that only accepts node_id. Moreover, raw query results may contain thousands of time-series points, which exceed the LLM context budget and degrade diagnosis reliability. To address… view at source ↗
Figure 5
Figure 5. Figure 5: Agent call chains of UModel and traditional data model. [PITH_FULL_IMAGE:figures/full_fig_p009_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Agent call chain of UModel. When performing RCA using traditional data models, a variety of anomaly detection tools—such as trace anomaly detection, log anomaly detection, and time-series anomaly detection—are typically invoked. These tools collectively gen￾erate a large volume of error signals across multiple services. However, the resulting information is often noisy, unstructured, and lacks prioritizati… view at source ↗
Figure 7
Figure 7. Figure 7: Viusalized model construction. APPENDIX A. Visualized Model Building Constructing a data model from large volumes of raw mul￾timodal data is labor-intensive. Therefore, ease of construction is a prerequisite for the rapid adoption of an observable data model. Through continuous iteration, UModel has streamlined the construction process into a small number of intuitive steps: 1) Identification of entities a… view at source ↗
read the original abstract

When networked system failures occur, automatically performing Root Cause Analysis (RCA) using observability data is critical for ensuring networked system reliability. Recently, LLM-based agents have shown promise for automating this diagnosis process through advanced reasoning and autonomous exploration. However, existing observability frameworks remain archaic, characterized by fragmented data silos, incompatible schemas, and insufficient semantic metadata, preventing agents from establishing the complex relationships required for effective RCA. To address these challenges, we present UModel, a unified ontological framework that shifts observability from data-centric to object-centric modeling. UModel constructs a virtual ontological layer where heterogeneous telemetry, entities, and expert knowledge are standardized as objects and interconnected via semantic graphs. In addition, we introduce U-SPL, a pipeline-based query interface that enables agents to autonomously explore system topologies and correlate multimodal data. By re-modeling the "AIOps 2025 Challenge" dataset using UModel, the precision of root cause localization improved by 8%, demonstrating that enhanced data organization can significantly increase the accuracy of downstream tasks. UModel provides a scalable modeling framework that, in its deployment at Alibaba Cloud for more than one year, has served tens of thousands of users, sustained millions of operations per second, and delivered sub-second query latency.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 0 minor

Summary. The paper introduces UModel, a unified object-centric ontological framework for modeling heterogeneous observability data (telemetry, entities, expert knowledge) as interconnected semantic graphs, along with the U-SPL pipeline-based query interface. The central empirical claim is that re-modeling the AIOps 2025 Challenge dataset with UModel yields an 8% improvement in root cause localization precision for LLM-based RCA agents; the work also reports production deployment at Alibaba Cloud serving tens of thousands of users at millions of operations per second with sub-second latency.

Significance. If the 8% precision gain can be isolated to the object-centric modeling and semantic graphs with fixed agent behavior, the approach could meaningfully improve the effectiveness of LLM agents on RCA tasks by addressing data fragmentation and lack of semantics in existing observability systems. The reported production metrics at Alibaba provide evidence of scalability, which would strengthen the practical contribution if the experimental attribution holds.

major comments (2)
  1. [Abstract] Abstract: the claim that re-modeling the AIOps 2025 Challenge dataset with UModel improved root cause localization precision by 8% supplies no baseline RCA method, no statement that the LLM agent/prompts were held fixed, no raw before/after scores, no statistical test, and no description of how the re-modeling altered input features or topology. This prevents attribution of the delta to UModel's ontological modeling versus confounding changes in data ingestion or evaluation protocol.
  2. [Evaluation] The experimental evaluation (referenced in the abstract) provides no ablation studies, controls, or exclusion criteria that would isolate the contribution of the semantic graphs and object-centric modeling from other factors such as query interface changes or dataset preprocessing.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback highlighting the need for greater clarity on experimental attribution and controls. We address each major comment below and will revise the manuscript to strengthen the presentation of results.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the claim that re-modeling the AIOps 2025 Challenge dataset with UModel improved root cause localization precision by 8% supplies no baseline RCA method, no statement that the LLM agent/prompts were held fixed, no raw before/after scores, no statistical test, and no description of how the re-modeling altered input features or topology. This prevents attribution of the delta to UModel's ontological modeling versus confounding changes in data ingestion or evaluation protocol.

    Authors: We agree the abstract is insufficiently detailed for proper attribution. In the revision we will expand it to name the baseline RCA method, state explicitly that the LLM agent and prompts remained fixed, report the raw before/after precision scores, include the statistical test performed, and describe the precise changes introduced by the object-centric re-modeling (addition of semantic graphs and standardized entity objects) while confirming that data ingestion and evaluation protocols were unchanged. revision: yes

  2. Referee: [Evaluation] The experimental evaluation (referenced in the abstract) provides no ablation studies, controls, or exclusion criteria that would isolate the contribution of the semantic graphs and object-centric modeling from other factors such as query interface changes or dataset preprocessing.

    Authors: We acknowledge that the current evaluation section does not contain the requested ablations or controls. In the revised manuscript we will add ablation experiments that isolate the effect of the semantic graphs and object-centric modeling while holding the U-SPL query interface and preprocessing steps fixed, include explicit controls for dataset changes, and state the exclusion criteria applied. These additions will allow readers to attribute the observed 8% gain more directly to the ontological modeling. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical claim on external dataset with no fitted parameters or self-referential derivations.

full rationale

The paper's central claim is an 8% precision improvement on the external AIOps 2025 Challenge dataset after applying UModel. No equations, parameter fits, or predictions are described that reduce to inputs by construction. No self-citations are invoked as load-bearing for uniqueness theorems or ansatzes. The modeling method and query interface are presented as independent contributions, with the result framed as an external evaluation rather than a derived quantity equivalent to its inputs. This is the common case of a self-contained empirical report.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 2 invented entities

Abstract provides no explicit free parameters or external axioms; the central modeling premise is treated as a domain assumption.

axioms (1)
  • domain assumption Heterogeneous telemetry, entities, and expert knowledge can be standardized as objects interconnected via semantic graphs to support more effective RCA by agents.
    This premise is invoked to justify the shift to object-centric modeling.
invented entities (2)
  • UModel no independent evidence
    purpose: Unified ontological framework converting observability data to semantic object graphs
    New modeling construct introduced by the paper.
  • U-SPL no independent evidence
    purpose: Pipeline-based query interface enabling autonomous agent exploration of system topologies
    New interface introduced by the paper.

pith-pipeline@v0.9.1-grok · 5791 in / 1203 out tokens · 34449 ms · 2026-06-28T05:22:22.137866+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. A Multi-Dataset Benchmark for Evaluating LLM Agents in Microservice Failure Diagnosis

    cs.SE 2026-06 unverdicted novelty 7.0

    Introduces AIOps2025 and RCA100 datasets for evaluating LLM agents on microservice failure diagnosis via localization, identification, and reasoning-grounded-in-evidence dimensions.

Reference graph

Works this paper leans on

74 extracted references · 9 canonical work pages · cited by 1 Pith paper · 3 internal anchors

  1. [1]

    Replit: Online ide & code editor for every language,

    I. Replit, “Replit: Online ide & code editor for every language,” https: //replit.com/, Replit, Inc., 2024

  2. [2]

    Google antigravity: Next-generation agentic development platform,

    G. LLC, “Google antigravity: Next-generation agentic development platform,” https://antigravity.google/, Google LLC, 2025

  3. [3]

    Gartner market guide for aiops: Es- sential reading for itops and sre,

    IBM, “Gartner market guide for aiops: Es- sential reading for itops and sre,” Feb

  4. [4]

    Available: https://www.ibm.com/think/insights/ gartner-market-guide-for-aiops-essential-reading-for-itops-and-sre

    [Online]. Available: https://www.ibm.com/think/insights/ gartner-market-guide-for-aiops-essential-reading-for-itops-and-sre

  5. [5]

    Aiops maturity model: From automation to autonomous it operations,

    Gartner, “Aiops maturity model: From automation to autonomous it operations,” 2025. [Online]. Available: https://www.gartner.com/en/ information-technology/glossary/aiops-artificial-intelligence-operations

  6. [6]

    A joint study of the challenges, opportunities, and roadmap of mlops and aiops: A systematic survey,

    J. Diaz-De-Arcaya, A. I. Torre-Bastida, G. Z ´arate, R. Mi ˜n´on, and A. Almeida, “A joint study of the challenges, opportunities, and roadmap of mlops and aiops: A systematic survey,”ACM Computing Surveys, vol. 56, no. 4, pp. 1–30, 2023

  7. [7]

    A survey of aiops for failure management in the era of large language models,

    L. Zhang, T. Jia, M. Jia, Y . Wu, A. Liu, Y . Yang, Z. Wu, X. Hu, P. S. Yu, and Y . Li, “A survey of aiops for failure management in the era of large language models,”arXiv preprint arXiv:2406.11213, 2024

  8. [8]

    Unsupervised anomaly detection via variational auto- encoder for seasonal kpis in web applications,

    H. Xu, W. Chen, N. Zhao, Z. Li, J. Bu, Z. Li, Y . Liu, Y . Zhao, D. Pei, Y . Fenget al., “Unsupervised anomaly detection via variational auto- encoder for seasonal kpis in web applications,” inProceedings of the 2018 world wide web conference, 2018, pp. 187–196

  9. [9]

    Drain: An online log parsing approach with fixed depth tree,

    P. He, J. Zhu, S. He, J. Li, and M. R. Lyu, “Drain: An online log parsing approach with fixed depth tree,” inProceedings of ICWS, 2017

  10. [10]

    From point-wise to group-wise: A fast and accurate microservice trace anomaly detection approach,

    Z. Xie, C. Pei, W. Li, H. Jiang, L. Su, J. Li, G. Xie, and D. Pei, “From point-wise to group-wise: A fast and accurate microservice trace anomaly detection approach,” inProceedings of the 31st ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering, 2023, pp. 1739–1749

  11. [11]

    Kan-ad: Time series anomaly detection with kolmogorov-arnold networks,

    Q. Zhou, C. Pei, F. Sun, H. Jing, Z. Gao, H. Zhang, G. Xie, D. Pei, and J. Li, “Kan-ad: Time series anomaly detection with kolmogorov-arnold networks,” inProceedings of the 42nd International Conference on Machine Learning (ICML), 2025. [Online]. Available: https://openreview.net/forum?id=LWQ4zu9SdQ

  12. [12]

    Actionable and interpretable fault localization for recurring failures in online service systems,

    Z. Li, N. Zhao, M. Li, X. Lu, L. Wang, D. Chang, X. Nie, L. Cao, W. Zhang, K. Suiet al., “Actionable and interpretable fault localization for recurring failures in online service systems,” inProceedings of the 30th ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering, 2022, pp. 996– 1008

  13. [13]

    Anomaly detection and failure root cause analysis in (micro) service-based cloud applications: A survey,

    J. Soldani and A. Brogi, “Anomaly detection and failure root cause analysis in (micro) service-based cloud applications: A survey,”ACM Computing Surveys (CSUR), vol. 55, no. 3, pp. 1–39, 2022

  14. [14]

    Failure diagnosis in microservice systems: A comprehensive survey and analysis,

    S. Zhang, S. Xia, W. Fan, B. Shi, X. Xiong, Z. Zhong, M. Ma, Y . Sun, and D. Pei, “Failure diagnosis in microservice systems: A comprehensive survey and analysis,”ACM Transactions on Software Engineering and Methodology, vol. 35, no. 1, pp. 1–55, 2025

  15. [15]

    A survey on failure analysis and fault injection in ai systems,

    G. Yu, G. Tan, H. Huang, Z. Zhang, P. Chen, R. Natella, Z. Zheng, and M. R. Lyu, “A survey on failure analysis and fault injection in ai systems,”ACM Transactions on Software Engineering and Methodology, vol. 35, no. 1, pp. 1–42, 2026

  16. [16]

    Nezha: Interpretable fine-grained root causes analysis for microservices on multi-modal observability data,

    G. Yu, P. Chen, Y . Li, H. Chen, X. Li, and Z. Zheng, “Nezha: Interpretable fine-grained root causes analysis for microservices on multi-modal observability data,” inProceedings of the 31st ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering, 2023, pp. 553–565

  17. [17]

    Rcagent: Cloud root cause analysis by autonomous agents with tool-augmented large language models,

    Z. Wang, Z. Liu, Y . Zhang, A. Zhong, J. Wang, F. Yin, L. Fan, L. Wu, and Q. Wen, “Rcagent: Cloud root cause analysis by autonomous agents with tool-augmented large language models,” inProceedings of the 33rd ACM International Conference on Information and Knowledge Management, 2024, pp. 4966–4974

  18. [18]

    Flow-of-action: Sop enhanced llm-based multi- agent system for root cause analysis,

    C. Pei, Z. Wang, F. Liu, Z. Li, Y . Liu, X. He, R. Kang, T. Zhang, J. Chen, J. Liet al., “Flow-of-action: Sop enhanced llm-based multi- agent system for root cause analysis,” inCompanion Proceedings of the ACM on Web Conference 2025, 2025, pp. 422–431

  19. [19]

    Prometheus: Monitoring system and time series database,

    Prometheus Team, “Prometheus: Monitoring system and time series database,” https://prometheus.io/, 2012, accessed: 2025-08-21

  20. [20]

    Elasticsearch: Open Source Distributed RESTful Search and Analytics Engine,

    Elastic NV, “Elasticsearch: Open Source Distributed RESTful Search and Analytics Engine,” https://www.elastic.co/elasticsearch, 2010, ac- cessed: 2025-08-21

  21. [21]

    Aiopsarena: Scenario-oriented evaluation and leaderboard for aiops algorithms in microservices,

    Y . Sun, J. Wang, Z. Li, X. Nie, M. Ma, S. Zhang, Y . Ji, L. Zhang, W. Long, H. Chenet al., “Aiopsarena: Scenario-oriented evaluation and leaderboard for aiops algorithms in microservices,” in2025 IEEE Inter- national Conference on Software Analysis, Evolution and Reengineering (SANER). IEEE, 2025, pp. 809–813

  22. [22]

    The unix™ programming envi- ronment,

    B. W. Kernighan and J. R. Mashey, “The unix™ programming envi- ronment,”Software: Practice and Experience, vol. 9, no. 1, pp. 1–15, 1979

  23. [23]

    Promassistant: Leveraging large language models for text-to-promql,

    C. Zhang, B. Zhang, D. Yang, X. Peng, M. Chen, S. Xie, G. Chen, W. Bi, and W. Li, “Promassistant: Leveraging large language models for text-to-promql,”arXiv preprint arXiv:2503.03114, 2025

  24. [24]

    Cypher: An evolving query language for property graphs,

    N. Francis, A. Green, P. Guagliardo, L. Libkin, T. Lindaaker, V . Marsault, S. Plantikow, M. Rydberg, P. Selmer, and A. Taylor, “Cypher: An evolving query language for property graphs,” inProceed- ings of the 2018 international conference on management of data, 2018, pp. 1433–1445

  25. [25]

    Model Context Protocol (MCP): Landscape, Security Threats, and Future Research Directions

    X. Hou, Y . Zhao, S. Wang, and H. Wang, “Model context protocol (mcp): Landscape, security threats, and future research directions,”arXiv preprint arXiv:2503.23278, 2025

  26. [26]

    ReAct: Synergizing Reasoning and Acting in Language Models

    S. Yao, J. Zhao, D. Yu, N. Du, I. Shafran, K. Narasimhanet al., “React: Synergizing reasoning and acting in language models,” arXiv preprint, 2023. [Online]. Available: https://arxiv.org/abs/2210.03629

  27. [27]

    OpenClaw: Personal AI Assistant,

    “OpenClaw: Personal AI Assistant,” https://github.com/openclaw/ openclaw

  28. [28]

    Chateval: Towards better llm-based evaluators through multi- agent debate,

    C.-M. Chan, W. Chen, Y . Su, J. Yu, W. Xue, S. Zhang, J. Fu, and Z. Liu, “Chateval: Towards better llm-based evaluators through multi- agent debate,” inInternational conference on learning representations, vol. 2024, 2024, pp. 9079–9093

  29. [29]

    Re- flexion: Language agents with verbal reinforcement learning,

    N. Shinn, F. Cassano, A. Gopinath, K. Narasimhan, and S. Yao, “Re- flexion: Language agents with verbal reinforcement learning,”Advances in neural information processing systems, vol. 36, pp. 8634–8652, 2023

  30. [30]

    Plan-and-solve prompting: Improving zero-shot chain-of-thought reasoning by large language models,

    L. Wang, W. Xu, Y . Lan, Z. Hu, Y . Lan, R. K.-W. Lee, and E.-P. Lim, “Plan-and-solve prompting: Improving zero-shot chain-of-thought reasoning by large language models,” inProceedings of the 61st annual meeting of the association for computational linguistics (volume 1: long papers), 2023, pp. 2609–2634

  31. [31]

    Rd-probe: Scalable monitoring with sufficient coverage in complex datacenter networks,

    R. Ding, X. Liu, S. Yang, Q. Huang, B. Xie, R. Sun, Z. Zhang, and B. Cui, “Rd-probe: Scalable monitoring with sufficient coverage in complex datacenter networks,” inProceedings of the ACM SIGCOMM 2024 Conference, 2024, pp. 258–273

  32. [32]

    µmon: Empowering microsecond-level network monitoring with wavelets,

    H. Zheng, C. Huang, X. Han, J. Zheng, X. Wang, C. Tian, W. Dou, and G. Chen, “µmon: Empowering microsecond-level network monitoring with wavelets,” inProceedings of the ACM SIGCOMM 2024 Conference, 2024, pp. 274–290

  33. [33]

    Eagle: Toward scalable and near-optimal network- wide sketch deployment in network measurement,

    X. Chen, Q. Xiao, H. Liu, Q. Huang, D. Zhang, X. Liu, L. Hu, H. Zhou, C. Wu, and K. Ren, “Eagle: Toward scalable and near-optimal network- wide sketch deployment in network measurement,” inProceedings of the ACM SIGCOMM 2024 Conference, 2024, pp. 291–310

  34. [34]

    Ipd: Detecting traffic ingress points at isps,

    S. Mehner, H. Reelfs, I. Poese, and O. Hohlfeld, “Ipd: Detecting traffic ingress points at isps,” inProceedings of the ACM SIGCOMM 2024 Conference, 2024, pp. 778–793

  35. [35]

    Achieving high-speed and robust encrypted traffic anomaly detection with programmable switches,

    H. Zhang, G. Liu, X. Shi, Y . Li, D. He, J. Wang, Z. Wang, Y . Zhu, K. Ruan, W. Caoet al., “Achieving high-speed and robust encrypted traffic anomaly detection with programmable switches,” inProceedings of the ACM SIGCOMM 2025 Conference, 2025, pp. 1254–1256

  36. [36]

    The syslog protocol,

    R. Gerhards, “The syslog protocol,” RFC 5424, 2009. [Online]. Available: https://www.rfc-editor.org/rfc/rfc5424

  37. [37]

    The bsd packet filter: A new architecture for user-level packet capture,

    S. McCanne and V . Jacobson, “The bsd packet filter: A new architecture for user-level packet capture,” inProceedings of the USENIX Winter Conference, 1993. [Online]. Available: https://www.tcpdump.org/papers/ bpf-usenix93.pdf

  38. [38]

    The ebpf runtime in the linux kernel,

    T. Alabiet al., “The ebpf runtime in the linux kernel,” arXiv preprint,

  39. [39]

    Available: https://arxiv.org/abs/2410.00026

    [Online]. Available: https://arxiv.org/abs/2410.00026

  40. [40]

    Cisco systems netflow services export version 9,

    B. Claise, “Cisco systems netflow services export version 9,” RFC 3954, 2004. [Online]. Available: https://www.rfc-editor.org/rfc/rfc3954

  41. [41]

    Inmon corporation’s sflow: A method for monitoring traffic in switched and routed networks,

    P. Phaal, S. Panchen, and N. McKee, “Inmon corporation’s sflow: A method for monitoring traffic in switched and routed networks,” RFC 3176, 2001. [Online]. Available: https://www.rfc-editor.org/rfc/rfc3176

  42. [42]

    Hawkeye: Diagnosing rdma network performance anomalies with pfc provenance,

    S. Wang, M. Zhang, X. Li, Q. Peng, H. Yu, Z. Wang, M. Xu, X. Hu, J. Yang, and X. Shi, “Hawkeye: Diagnosing rdma network performance anomalies with pfc provenance,” inProceedings of the ACM SIGCOMM 2025 Conference, 2025, pp. 481–495

  43. [43]

    P4: Programming protocol-independent packet processors,

    P. Bosshart, G. Gibb, H. Kim, G. Varghese, N. McKeown, M. Izzard, F. Mujica, and M. Horowitz, “P4: Programming protocol-independent packet processors,”ACM SIGCOMM Computer Communication Review, vol. 44, no. 3, pp. 87–95, 2014

  44. [44]

    Robust anomaly detection for multivariate time series through stochastic recurrent neural network,

    Y . Su, Y . Zhao, C. Niu, R. Liu, W. Sun, and D. Pei, “Robust anomaly detection for multivariate time series through stochastic recurrent neural network,” inProceedings of KDD, 2019

  45. [45]

    Revisiting vae for unsupervised time series anomaly detection: A frequency perspective,

    Z. Wang, C. Pei, M. Ma, X. Wang, Z. Li, D. Pei, S. Rajmohan, D. Zhang, Q. Lin, H. Zhanget al., “Revisiting vae for unsupervised time series anomaly detection: A frequency perspective,” inProceedings of the ACM web conference 2024, 2024, pp. 3096–3105

  46. [46]

    Anomaly transformer: Time series anomaly detection with association discrepancy,

    J. Xu, H. Wu, J. Wang, and M. Long, “Anomaly transformer: Time series anomaly detection with association discrepancy,” inInternational Conference on Learning Representations (ICLR), 2022. [Online]. Available: https://openreview.net/forum?id=LzQQ89U1qm

  47. [47]

    Tshape: Rescuing machine learning models from complex shapelet anomalies,

    H. Cui, J. Li, H. Si, Q. Zhou, C. Pei, G. Xie, and D. Pei, “Tshape: Rescuing machine learning models from complex shapelet anomalies,” in2025 IEEE 36th International Symposium on Software Reliability Engineering Workshops (ISSREW). IEEE, 2025, pp. 9–14

  48. [48]

    Deeplog: Anomaly detection and diagnosis from system logs through deep learning,

    M. Du, F. Li, G. Zheng, and V . Srikumar, “Deeplog: Anomaly detection and diagnosis from system logs through deep learning,” inProceedings of CCS, 2017

  49. [49]

    Logbert: Log anomaly detection via bert,

    H. Guo, S. Yuan, J. Wuet al., “Logbert: Log anomaly detection via bert,” inProceedings of IJCNN, 2022

  50. [50]

    Microrca: Root cause localization of performance issues in microservices,

    L. Wu, J. Tordsson, E. Elmroth, and O. Kao, “Microrca: Root cause localization of performance issues in microservices,” inIEEE/IFIP Network Operations and Management Symposium (NOMS), 2020. [Online]. Available: https://github.com/elastisys/MicroRCA

  51. [51]

    Global, passive detection of connection tampering,

    R. Sundara Raman, L.-H. Merino, K. Bock, M. Fayed, D. Levin, N. Sullivan, and L. Valenta, “Global, passive detection of connection tampering,” inProceedings of the ACM SIGCOMM 2023 Conference, 2023, pp. 622–636

  52. [52]

    Localizing failure root causes in a microservice through causality inference,

    Y . Meng, S. Zhang, Y . Sun, R. Zhang, Z. Hu, Y . Zhang, C. Jia, Z. Wang, and D. Pei, “Localizing failure root causes in a microservice through causality inference,” inProceedings of IEEE/ACM IWQoS,

  53. [53]

    Available: https://nkcs.iops.ai/wp-content/uploads/2020/ 07/paper-IWQOS2020-MicroCause.pdf

    [Online]. Available: https://nkcs.iops.ai/wp-content/uploads/2020/ 07/paper-IWQOS2020-MicroCause.pdf

  54. [54]

    Towards llm-based failure localization in production-scale networks,

    C. Wang, X. Zhang, R. Lu, X. Lin, X. Zeng, X. Zhang, Z. An, G. Wu, J. Gao, C. Tianet al., “Towards llm-based failure localization in production-scale networks,” inProceedings of the ACM SIGCOMM 2025 Conference, 2025, pp. 496–511

  55. [55]

    Microhecl: High-efficient root cause localization in large-scale microservice systems,

    D. Liu, C. He, X. Peng, F. Lin, C. Zhang, S. Gong, Z. Li, J. Ou, and Z. Wu, “Microhecl: High-efficient root cause localization in large-scale microservice systems,” inICSE-SEIP, 2021. [Online]. Available: https://arxiv.org/abs/2103.01782

  56. [56]

    Skeletonhunter: Diagnosing and localizing network failures in containerized large model training,

    W. Liu, K. Qian, Z. Li, T. Xu, Y . Liu, W. Wang, Y . Zhang, J. Li, S. Zhu, X. Liet al., “Skeletonhunter: Diagnosing and localizing network failures in containerized large model training,” inProceedings of the ACM SIGCOMM 2025 Conference, 2025, pp. 527–540

  57. [57]

    Skynet: Analyzing alert flooding from severe network failures in large cloud infrastructures,

    B. Yang, H. Hu, Y . Li, Y . Li, X. Tang, B. Tian, G. Wu, J. Xu, X. Zhang, F. Chenet al., “Skynet: Analyzing alert flooding from severe network failures in large cloud infrastructures,” inProceedings of the ACM SIGCOMM 2025 Conference, 2025, pp. 512–526

  58. [58]

    Robust failure diagnosis of microservice system through multimodal data,

    S. Zhang, P. Jin, Z. Lin, Y . Sun, B. Zhang, S. Xia, Z. Li, Z. Zhong, M. Ma, W. Jin, D. Zhang, Z. Zhu, and D. Pei, “Robust failure diagnosis of microservice system through multimodal data,” arXiv preprint, 2023. [Online]. Available: https://arxiv.org/abs/2302.10512

  59. [59]

    Anomaly detection from system tracing data using multimodal deep learning,

    S. Nedelkoski, J. Cardoso, and O. Kao, “Anomaly detection from system tracing data using multimodal deep learning,” inIEEE International Conference on Cloud Computing (CLOUD), 2019

  60. [60]

    Dest: An unsupervised decoupled spatio-temporal framework for microservice incident management,

    X. Nie, H. Cui, C. Pei, H. Si, K. Xiang, J. Li, Y . Li, G. Xie, and D. Pei, “Dest: An unsupervised decoupled spatio-temporal framework for microservice incident management,” in2025 IEEE 36th International Symposium on Software Reliability Engineering (ISSRE). IEEE, 2025, pp. 335–346

  61. [61]

    Toolformer: Language models can teach themselves to use tools,

    T. Schick, J. Dwivedi-Yu, R. Dess `ı, R. Raileanuet al., “Toolformer: Language models can teach themselves to use tools,” arXiv preprint,

  62. [62]

    Toolformer: Language Models Can Teach Themselves to Use Tools

    [Online]. Available: https://arxiv.org/abs/2302.04761

  63. [63]

    A survey of aiops in the era of large language models,

    L. Zhang, T. Jia, M. Jia, Y . Wu, A. Liu, Y . Yang, Z. Wu, X. Hu, P. S. Yu, and Y . Li, “A survey of aiops in the era of large language models,” arXiv preprint, 2025. [Online]. Available: https://arxiv.org/abs/2507.12472 Database Kubernetes Node Service Service Traces Kubernetes Pod Service Logs Node Logs Pod Logs Service Metrics Node Metrics Pod Metrics ...

  64. [64]

    Query{pod2} {metric}data

  65. [65]

    Query{pod3} {metric}data

  66. [66]

    entity set(with(domain=’aiops’, name=’aiops.pod’, ids=[id1, id2,...]))|entity−call get metric(’aiops’,’aiops

    Aggregate and return. .entity set(with(domain=’aiops’, name=’aiops.pod’, ids=[id1, id2,...]))|entity−call get metric(’aiops’,’aiops. metric.pod’,’{metric}’,’ range’,’’,aggregate=true) Aggregates multi-pod metric val- ues in one query and returns unified results. Q2 (Knowledge) 1) Get ownership (node/service)

  67. [67]

    Collect depth from trace/logs

  68. [68]

    Merge into a unified dep graph/table

  69. [69]

    Expand to 4 hops (BFS / multi-join)

  70. [70]

    topo|graph−call cypher(‘ MATCH (s:‘aiops@aiops

    Summarize and filter related entities. .topo|graph−call cypher(‘ MATCH (s:‘aiops@aiops. pod‘{ entity id :’id’}) −[e]−(d)−[f]−(g)−[h]−(j)−[k ]−(l)RETURN s, d, g, j, l‘) 1-query 4-hop subgraph extraction; No manual multi-join/BFS. Q3 (Data+Knowledge) 1) Resolve the Pod’s node

  71. [71]

    List all Pods on that node

  72. [72]

    Query{metric}for these Pods

  73. [73]

    pod{ entity id :’id’})−[ e]−(d)−[f]−(g) RETURN g

    topo|graph−call cypher( MATCH (s:aiops@aiops. pod{ entity id :’id’})−[ e]−(d)−[f]−(g) RETURN g. entity id AS entity ids )}

  74. [74]

    aiops.service

    entity set with(domain=’aiops ’, name=’aiops.node’, ids=[’ entity ids’])|entity−call get metric(’aiops’,’aiops. metric.node’,’{metric}’,’ range’,’’,aggregate=false) Topology query first finds peer Pods on the same node, then batch metric retrieval is executed in one flow without manual joins. must query each Pod’s metric data individually and perform aggr...