Recognition: no theorem link
From Load Tests to Live Streams: Graph Embedding-Based Anomaly Detection in Microservice Architectures
Pith reviewed 2026-05-10 19:47 UTC · model grok-4.3
The pith
Graph embeddings from service call graphs identify microservice behaviors unique to live events versus load tests.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The central claim is that node-level embeddings from a GCN-GAE applied to minute-resolution directed weighted service graphs enable anomaly detection by comparing cosine similarity scores between embeddings computed on load test data and live event data, thereby identifying under-represented services whose behaviors are missed by load tests alone.
What carries the argument
The GCN-GAE that learns unsupervised structural node embeddings from the service call graph, with cosine similarity serving as the metric to quantify behavioral divergence between test and live conditions.
If this is right
- Load testing procedures can be augmented by incorporating live event embeddings to cover a wider range of service behaviors.
- Anomalies can be detected early enough to allow intervention before full incidents develop.
- The synthetic anomaly injection framework supplies a repeatable method for measuring precision and recall of graph embedding detectors under controlled conditions.
- The same embedding comparison technique extends naturally to monitoring other distributed microservice systems.
Where Pith is reading between the lines
- Combining the embedding approach with additional signals such as latency or error rates could raise recall without sacrificing the observed precision.
- Minute-level graph resolution suggests the method could support near-real-time alerting if embedding computation is optimized for streaming data.
- The conservative propagation assumptions in the synthetic framework likely produce lower recall bounds than would be seen with actual incident data.
Load-bearing premise
Unsupervised embeddings from the GCN-GAE combined with cosine similarity will reliably separate real-event service behaviors from those seen in load tests.
What would settle it
Checking whether services flagged by low cosine similarity during a documented live incident match the services identified as root causes in the corresponding incident reports.
Figures
read the original abstract
Prime Video regularly conducts load tests to simulate the viewer traffic spikes seen during live events such as Thursday Night Football as well as video-on-demand (VOD) events such as Rings of Power. While these stress tests validate system capacity, they can sometimes miss service behaviors unique to real event traffic. We present a graph-based anomaly detection system that identifies under-represented services using unsupervised node-level graph embeddings. Built on a GCN-GAE, our approach learns structural representations from directed, weighted service graphs at minute-level resolution and flags anomalies based on cosine similarity between load test and event embeddings. The system identifies incident-related services that are documented and demonstrates early detection capability. We also introduce a preliminary synthetic anomaly injection framework for controlled evaluation that show promising precision (96%) and low false positive rate (0.08%), though recall (58%) remains limited under conservative propagation assumptions. This framework demonstrates practical utility within Prime Video while also surfacing methodological lessons and directions, providing a foundation for broader application across microservice ecosystems.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript presents a graph-based anomaly detection system for microservice architectures that learns unsupervised node embeddings via a GCN-GAE on minute-level directed, weighted service call graphs. Embeddings from load-test and live-event traffic are compared via cosine similarity to flag services exhibiting behaviors unique to real events. The paper claims this approach identifies documented incident-related services and enables early detection. A synthetic anomaly injection framework is introduced for controlled evaluation, reporting 96% precision, 0.08% false-positive rate, and 58% recall under conservative propagation assumptions.
Significance. If the real-incident identification claim were supported by quantitative metrics, the work would provide a practical unsupervised method for distinguishing load-test from live-event behaviors in large-scale streaming microservices, potentially improving reliability engineering. The synthetic evaluation framework is a constructive contribution that enables reproducible testing of graph-based anomaly detectors on service topologies.
major comments (2)
- Abstract and evaluation sections: The claim that the system 'identifies incident-related services that are documented' and 'demonstrates early detection capability' on real events lacks any quantitative support (no incident counts, no precision/recall/F1 on production data, no list of flagged vs. documented services, and no baseline comparisons). All reported metrics derive exclusively from the synthetic framework; this is load-bearing for the central practical-utility assertion.
- Method and evaluation: The assumption that GCN-GAE embeddings plus cosine similarity reliably isolate behaviors 'unique to real event traffic' versus load tests is not accompanied by ablations, comparisons to simpler graph features (e.g., degree or weight statistics), or analysis of embedding stability across the minute-level graphs.
minor comments (2)
- Synthetic framework: Provide explicit details on the anomaly injection procedure, including how 'conservative propagation assumptions' are operationalized on the directed weighted graphs and how injected anomalies relate to observed real-incident patterns.
- Notation and reproducibility: Define the precise anomaly-score formula, the cosine-similarity threshold, and the embedding dimensionality; include pseudocode or a small worked example of the embedding-to-flag pipeline.
Simulated Author's Rebuttal
We thank the referee for the constructive and detailed comments on our manuscript. We address each major comment below and indicate the planned revisions.
read point-by-point responses
-
Referee: Abstract and evaluation sections: The claim that the system 'identifies incident-related services that are documented' and 'demonstrates early detection capability' on real events lacks any quantitative support (no incident counts, no precision/recall/F1 on production data, no list of flagged vs. documented services, and no baseline comparisons). All reported metrics derive exclusively from the synthetic framework; this is load-bearing for the central practical-utility assertion.
Authors: We agree with the referee that the real-event claims in the abstract and evaluation lack quantitative backing. Our statements are derived from qualitative observations in specific production incidents where the flagged services aligned with documented issues. However, we did not perform systematic counting or metric computation on real data. In the revised version, we will revise the abstract to accurately reflect the scope of our claims, add a limitations subsection on real-world evaluation challenges, and provide more context on the case studies without revealing sensitive details. revision: yes
-
Referee: Method and evaluation: The assumption that GCN-GAE embeddings plus cosine similarity reliably isolate behaviors 'unique to real event traffic' versus load tests is not accompanied by ablations, comparisons to simpler graph features (e.g., degree or weight statistics), or analysis of embedding stability across the minute-level graphs.
Authors: We appreciate this suggestion for strengthening the methodological validation. We will add ablations in the evaluation section comparing the GCN-GAE approach to simpler baselines using graph features such as in/out-degree, total call weights, and average edge weights. Additionally, we will include an analysis of embedding stability by computing cosine similarities across consecutive minute graphs from the same load test or event. These additions will help demonstrate the necessity of the learned embeddings. revision: yes
- Providing exhaustive quantitative metrics (e.g., precision/recall on all documented incidents) or a complete list of flagged services from production data, due to confidentiality requirements and the practical difficulty of obtaining complete ground-truth labels for all services in large-scale microservice architectures.
Circularity Check
No significant circularity; empirical unsupervised method with independent synthetic evaluation
full rationale
The paper describes an unsupervised GCN-GAE embedding pipeline on directed weighted service graphs, followed by cosine similarity for anomaly flagging between load-test and event embeddings. No equations, parameters, or derivations are shown that reduce the anomaly score or incident identification to a fitted value by construction. The synthetic anomaly injection framework is presented as a separate controlled evaluation tool (with its own precision/recall numbers) and does not feed back into or define the core embedding method. No self-citations are invoked as load-bearing uniqueness theorems or ansatzes. The derivation chain is therefore self-contained as an applied empirical procedure rather than a tautology.
Axiom & Free-Parameter Ledger
axioms (2)
- domain assumption Microservice interactions can be represented as directed, weighted graphs at minute-level resolution
- domain assumption Node embeddings learned by GCN-GAE capture the structural features needed to distinguish load-test from live-event behavior
Reference graph
Works this paper leans on
-
[1]
Amazon Web Services. 2020. Correction of Error (CoE). https://wa.aws.amazon. com/wellarchitected/2020-07-02T19-33-23/wat.concept.coe.en.html. Accessed: July 2025
2020
-
[2]
Amazon Web Services. 2020. Game day. https://wa.aws.amazon.com/ wellarchitected/2020-07-02T19-33-23/wat.concept.gameday.en.html. Accessed: July 2025
2020
-
[3]
Amazon Web Services. 2020. Root Cause Analysis (RCA). https://wa.aws.amazon. com/wellarchitected/2020-07-02T19-33-23/wat.concept.rca.en.html. Accessed: July 2025
2020
-
[4]
Amazon Web Services. 2024. Scaling Prime Video for peak NFL streaming on AWS. https://reinvent.awsevents.com/content/dam/reinvent/2024/slides/ Srinidhi Madabhushi, Pranesh Vyas, Swathi Vaidyanathan, Mayur Kurup, Elliott Nash, and Yegor Silyutin arc/ARC311_Scaling-Prime-Video-for-peak-NFL-streaming-on-AWS.pdf. Ac- cessed: July 2025
2024
-
[5]
Moran Beladev, Lior Rokach, Gilad Katz, Ido Guy, and Kira Radinsky. 2020. tdgraphembed: Temporal dynamic graph-level embedding. InProceedings of the 29th ACM International Conference on Information & Knowledge Management. 55–64
2020
- [6]
-
[7]
Kaize Ding, Jundong Li, Rohit Bhanushali, and Huan Liu. 2019. Deep anomaly detection on attributed networks. InProceedings of the 2019 SIAM International Conference on Data Mining. SIAM, 594–602
2019
-
[8]
Thomas N Kipf and Max Welling. 2016. Semi-supervised classification with graph convolutional networks.arXiv preprint arXiv:1609.02907(2016)
work page internal anchor Pith review Pith/arXiv arXiv 2016
- [9]
-
[10]
Xiaoxiao Ma, Jia Wu, Shan Xue, Jian Yang, Chuan Zhou, Quan Z Sheng, Hui Xiong, and Leman Akoglu. 2023. A comprehensive survey on graph anomaly detection with deep learning.IEEE Transactions on Knowledge and Data Engineering35, 12 (2023), 12012–12038
2023
-
[11]
Si Meng, Xianglin Zhan, Jianfeng Huang, and Thomas Fuhrman. 2020. Cross- correlation analysis for microservice architecture understanding and prediction. In2020 IEEE International Conference on Service-Oriented System Engineering (SOSE). IEEE, 86–95
2020
-
[12]
Aldo Pareja, Giacomo Domeniconi, Jie Chen, Tengfei Ma, Toyotaro Suzumura, Hiroki Kanezashi, Tim Kaler, Tao B Schardl, and Charles E Leiserson. 2020. EvolveGCN: Evolving graph convolutional networks for dynamic graphs. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 34. 5363–5370
2020
- [13]
-
[14]
Aravind Sankar, Yanhong Wu, Liang Gou, Wei Zhang, and Hao Yang. 2020. Dysat: Deep neural representation learning on dynamic graphs via self-attention networks. InProceedings of the 13th International Conference on Web Search and Data Mining. 519–527
2020
-
[15]
TOPdesk. 2025. A single point of contact (SPOC) to bring your departments together. https://www.topdesk.com/en/blog/single-point-of-contact-spoc/. Ac- cessed: July 2025
2025
-
[16]
Petar Velickovic, Guillem Cucurull, Arantxa Casanova, Adriana Romero, Pietro Liò, and Yoshua Bengio. 2017. Graph attention networks.arXiv preprint arXiv:1710.10903(2017)
work page internal anchor Pith review Pith/arXiv arXiv 2017
-
[17]
Li Zheng, Zhenpeng Li, Jian Li, Zhao Li, and Jun Gao. 2019. Addgraph: Anomaly detection in dynamic graph using attention-based temporal gcn. InProceedings of the 28th International Joint Conference on Artificial Intelligence. 4419–4425
2019
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.