pith. machine review for the scientific record. sign in

arxiv: 2604.06448 · v2 · submitted 2026-04-07 · 💻 cs.LG · cs.AI· cs.MM· eess.IV

Recognition: no theorem link

From Load Tests to Live Streams: Graph Embedding-Based Anomaly Detection in Microservice Architectures

Authors on Pith no claims yet

Pith reviewed 2026-05-10 19:47 UTC · model grok-4.3

classification 💻 cs.LG cs.AIcs.MMeess.IV
keywords anomaly detectionmicroservice architecturesgraph embeddingsgraph neural networksload testingcosine similaritysynthetic evaluationservice dependency graphs
0
0 comments X

The pith

Graph embeddings from service call graphs identify microservice behaviors unique to live events versus load tests.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper aims to show that unsupervised node embeddings learned from directed weighted service graphs can flag services that act differently during real traffic spikes than during simulated load tests. A sympathetic reader would care because load tests may overlook production-specific issues that lead to incidents, leaving gaps in system validation. The method constructs minute-level graphs of service interactions and uses a graph convolutional autoencoder to produce embeddings, then measures cosine similarity between those from load test periods and live event periods. Services with low similarity are marked as anomalous, and the approach correctly surfaces documented incident-related services while introducing a synthetic anomaly injection method for testing.

Core claim

The central claim is that node-level embeddings from a GCN-GAE applied to minute-resolution directed weighted service graphs enable anomaly detection by comparing cosine similarity scores between embeddings computed on load test data and live event data, thereby identifying under-represented services whose behaviors are missed by load tests alone.

What carries the argument

The GCN-GAE that learns unsupervised structural node embeddings from the service call graph, with cosine similarity serving as the metric to quantify behavioral divergence between test and live conditions.

If this is right

  • Load testing procedures can be augmented by incorporating live event embeddings to cover a wider range of service behaviors.
  • Anomalies can be detected early enough to allow intervention before full incidents develop.
  • The synthetic anomaly injection framework supplies a repeatable method for measuring precision and recall of graph embedding detectors under controlled conditions.
  • The same embedding comparison technique extends naturally to monitoring other distributed microservice systems.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Combining the embedding approach with additional signals such as latency or error rates could raise recall without sacrificing the observed precision.
  • Minute-level graph resolution suggests the method could support near-real-time alerting if embedding computation is optimized for streaming data.
  • The conservative propagation assumptions in the synthetic framework likely produce lower recall bounds than would be seen with actual incident data.

Load-bearing premise

Unsupervised embeddings from the GCN-GAE combined with cosine similarity will reliably separate real-event service behaviors from those seen in load tests.

What would settle it

Checking whether services flagged by low cosine similarity during a documented live incident match the services identified as root causes in the corresponding incident reports.

Figures

Figures reproduced from arXiv: 2604.06448 by Elliott Nash, Mayur Kurup, Pranesh Vyas, Srinidhi Madabhushi, Swathi Vaidyanathan, Yegor Silyutin.

Figure 1
Figure 1. Figure 1: End-to-end pipeline for training and evaluating [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: PCA over service embeddings showing natural [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗
Figure 4
Figure 4. Figure 4: A shift in cosine similarity after a code deployment [PITH_FULL_IMAGE:figures/full_fig_p004_4.png] view at source ↗
Figure 3
Figure 3. Figure 3: A brief dip in cosine similarity for a service over a [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗
read the original abstract

Prime Video regularly conducts load tests to simulate the viewer traffic spikes seen during live events such as Thursday Night Football as well as video-on-demand (VOD) events such as Rings of Power. While these stress tests validate system capacity, they can sometimes miss service behaviors unique to real event traffic. We present a graph-based anomaly detection system that identifies under-represented services using unsupervised node-level graph embeddings. Built on a GCN-GAE, our approach learns structural representations from directed, weighted service graphs at minute-level resolution and flags anomalies based on cosine similarity between load test and event embeddings. The system identifies incident-related services that are documented and demonstrates early detection capability. We also introduce a preliminary synthetic anomaly injection framework for controlled evaluation that show promising precision (96%) and low false positive rate (0.08%), though recall (58%) remains limited under conservative propagation assumptions. This framework demonstrates practical utility within Prime Video while also surfacing methodological lessons and directions, providing a foundation for broader application across microservice ecosystems.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript presents a graph-based anomaly detection system for microservice architectures that learns unsupervised node embeddings via a GCN-GAE on minute-level directed, weighted service call graphs. Embeddings from load-test and live-event traffic are compared via cosine similarity to flag services exhibiting behaviors unique to real events. The paper claims this approach identifies documented incident-related services and enables early detection. A synthetic anomaly injection framework is introduced for controlled evaluation, reporting 96% precision, 0.08% false-positive rate, and 58% recall under conservative propagation assumptions.

Significance. If the real-incident identification claim were supported by quantitative metrics, the work would provide a practical unsupervised method for distinguishing load-test from live-event behaviors in large-scale streaming microservices, potentially improving reliability engineering. The synthetic evaluation framework is a constructive contribution that enables reproducible testing of graph-based anomaly detectors on service topologies.

major comments (2)
  1. Abstract and evaluation sections: The claim that the system 'identifies incident-related services that are documented' and 'demonstrates early detection capability' on real events lacks any quantitative support (no incident counts, no precision/recall/F1 on production data, no list of flagged vs. documented services, and no baseline comparisons). All reported metrics derive exclusively from the synthetic framework; this is load-bearing for the central practical-utility assertion.
  2. Method and evaluation: The assumption that GCN-GAE embeddings plus cosine similarity reliably isolate behaviors 'unique to real event traffic' versus load tests is not accompanied by ablations, comparisons to simpler graph features (e.g., degree or weight statistics), or analysis of embedding stability across the minute-level graphs.
minor comments (2)
  1. Synthetic framework: Provide explicit details on the anomaly injection procedure, including how 'conservative propagation assumptions' are operationalized on the directed weighted graphs and how injected anomalies relate to observed real-incident patterns.
  2. Notation and reproducibility: Define the precise anomaly-score formula, the cosine-similarity threshold, and the embedding dimensionality; include pseudocode or a small worked example of the embedding-to-flag pipeline.

Simulated Author's Rebuttal

2 responses · 1 unresolved

We thank the referee for the constructive and detailed comments on our manuscript. We address each major comment below and indicate the planned revisions.

read point-by-point responses
  1. Referee: Abstract and evaluation sections: The claim that the system 'identifies incident-related services that are documented' and 'demonstrates early detection capability' on real events lacks any quantitative support (no incident counts, no precision/recall/F1 on production data, no list of flagged vs. documented services, and no baseline comparisons). All reported metrics derive exclusively from the synthetic framework; this is load-bearing for the central practical-utility assertion.

    Authors: We agree with the referee that the real-event claims in the abstract and evaluation lack quantitative backing. Our statements are derived from qualitative observations in specific production incidents where the flagged services aligned with documented issues. However, we did not perform systematic counting or metric computation on real data. In the revised version, we will revise the abstract to accurately reflect the scope of our claims, add a limitations subsection on real-world evaluation challenges, and provide more context on the case studies without revealing sensitive details. revision: yes

  2. Referee: Method and evaluation: The assumption that GCN-GAE embeddings plus cosine similarity reliably isolate behaviors 'unique to real event traffic' versus load tests is not accompanied by ablations, comparisons to simpler graph features (e.g., degree or weight statistics), or analysis of embedding stability across the minute-level graphs.

    Authors: We appreciate this suggestion for strengthening the methodological validation. We will add ablations in the evaluation section comparing the GCN-GAE approach to simpler baselines using graph features such as in/out-degree, total call weights, and average edge weights. Additionally, we will include an analysis of embedding stability by computing cosine similarities across consecutive minute graphs from the same load test or event. These additions will help demonstrate the necessity of the learned embeddings. revision: yes

standing simulated objections not resolved
  • Providing exhaustive quantitative metrics (e.g., precision/recall on all documented incidents) or a complete list of flagged services from production data, due to confidentiality requirements and the practical difficulty of obtaining complete ground-truth labels for all services in large-scale microservice architectures.

Circularity Check

0 steps flagged

No significant circularity; empirical unsupervised method with independent synthetic evaluation

full rationale

The paper describes an unsupervised GCN-GAE embedding pipeline on directed weighted service graphs, followed by cosine similarity for anomaly flagging between load-test and event embeddings. No equations, parameters, or derivations are shown that reduce the anomaly score or incident identification to a fitted value by construction. The synthetic anomaly injection framework is presented as a separate controlled evaluation tool (with its own precision/recall numbers) and does not feed back into or define the core embedding method. No self-citations are invoked as load-bearing uniqueness theorems or ansatzes. The derivation chain is therefore self-contained as an applied empirical procedure rather than a tautology.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The central claim rests on standard domain assumptions about graph modeling of microservices and the ability of unsupervised embeddings to surface structural differences; no free parameters or invented entities are explicitly introduced in the abstract.

axioms (2)
  • domain assumption Microservice interactions can be represented as directed, weighted graphs at minute-level resolution
    Invoked when constructing the input graphs from service call data.
  • domain assumption Node embeddings learned by GCN-GAE capture the structural features needed to distinguish load-test from live-event behavior
    Core premise underlying the anomaly flagging via cosine similarity.

pith-pipeline@v0.9.0 · 5508 in / 1377 out tokens · 85788 ms · 2026-05-10T19:47:03.772798+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

17 extracted references · 5 canonical work pages · 2 internal anchors

  1. [1]

    Amazon Web Services. 2020. Correction of Error (CoE). https://wa.aws.amazon. com/wellarchitected/2020-07-02T19-33-23/wat.concept.coe.en.html. Accessed: July 2025

  2. [2]

    Amazon Web Services. 2020. Game day. https://wa.aws.amazon.com/ wellarchitected/2020-07-02T19-33-23/wat.concept.gameday.en.html. Accessed: July 2025

  3. [3]

    Amazon Web Services. 2020. Root Cause Analysis (RCA). https://wa.aws.amazon. com/wellarchitected/2020-07-02T19-33-23/wat.concept.rca.en.html. Accessed: July 2025

  4. [4]

    Amazon Web Services. 2024. Scaling Prime Video for peak NFL streaming on AWS. https://reinvent.awsevents.com/content/dam/reinvent/2024/slides/ Srinidhi Madabhushi, Pranesh Vyas, Swathi Vaidyanathan, Mayur Kurup, Elliott Nash, and Yegor Silyutin arc/ARC311_Scaling-Prime-Video-for-peak-NFL-streaming-on-AWS.pdf. Ac- cessed: July 2025

  5. [5]

    Moran Beladev, Lior Rokach, Gilad Katz, Ido Guy, and Kira Radinsky. 2020. tdgraphembed: Temporal dynamic graph-level embedding. InProceedings of the 29th ACM International Conference on Information & Knowledge Management. 55–64

  6. [6]

    Michaël Defferrard, Xavier Bresson, and Pierre Vandergheynst. 2016. Convo- lutional neural networks on graphs with fast localized spectral filtering.arXiv preprint arXiv:1606.09375(2016)

  7. [7]

    Kaize Ding, Jundong Li, Rohit Bhanushali, and Huan Liu. 2019. Deep anomaly detection on attributed networks. InProceedings of the 2019 SIAM International Conference on Data Mining. SIAM, 594–602

  8. [8]

    Thomas N Kipf and Max Welling. 2016. Semi-supervised classification with graph convolutional networks.arXiv preprint arXiv:1609.02907(2016)

  9. [9]

    Thomas N Kipf and Max Welling. 2016. Variational graph auto-encoders.arXiv preprint arXiv:1611.07308(2016)

  10. [10]

    Xiaoxiao Ma, Jia Wu, Shan Xue, Jian Yang, Chuan Zhou, Quan Z Sheng, Hui Xiong, and Leman Akoglu. 2023. A comprehensive survey on graph anomaly detection with deep learning.IEEE Transactions on Knowledge and Data Engineering35, 12 (2023), 12012–12038

  11. [11]

    Si Meng, Xianglin Zhan, Jianfeng Huang, and Thomas Fuhrman. 2020. Cross- correlation analysis for microservice architecture understanding and prediction. In2020 IEEE International Conference on Service-Oriented System Engineering (SOSE). IEEE, 86–95

  12. [12]

    Aldo Pareja, Giacomo Domeniconi, Jie Chen, Tengfei Ma, Toyotaro Suzumura, Hiroki Kanezashi, Tim Kaler, Tao B Schardl, and Charles E Leiserson. 2020. EvolveGCN: Evolving graph convolutional networks for dynamic graphs. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 34. 5363–5370

  13. [13]

    T Konstantin Rusch, Michael M Bronstein, and Siddhartha Mishra. 2023. A survey on oversmoothing in graph neural networks.arXiv preprint arXiv:2303.10993 (2023)

  14. [14]

    Aravind Sankar, Yanhong Wu, Liang Gou, Wei Zhang, and Hao Yang. 2020. Dysat: Deep neural representation learning on dynamic graphs via self-attention networks. InProceedings of the 13th International Conference on Web Search and Data Mining. 519–527

  15. [15]

    TOPdesk. 2025. A single point of contact (SPOC) to bring your departments together. https://www.topdesk.com/en/blog/single-point-of-contact-spoc/. Ac- cessed: July 2025

  16. [16]

    Petar Velickovic, Guillem Cucurull, Arantxa Casanova, Adriana Romero, Pietro Liò, and Yoshua Bengio. 2017. Graph attention networks.arXiv preprint arXiv:1710.10903(2017)

  17. [17]

    Li Zheng, Zhenpeng Li, Jian Li, Zhao Li, and Jun Gao. 2019. Addgraph: Anomaly detection in dynamic graph using attention-based temporal gcn. InProceedings of the 28th International Joint Conference on Artificial Intelligence. 4419–4425