A Topology-Aware, Memory-Centric Architecture that Separates Root-Cause Derivation from Root-Cause Explanation

Momil Seedat

arxiv: 2606.20758 · v1 · pith:WFHU5SWInew · submitted 2026-06-18 · 💻 cs.SE · cs.AI

A Topology-Aware, Memory-Centric Architecture that Separates Root-Cause Derivation from Root-Cause Explanation

Momil Seedat This is my paper

Pith reviewed 2026-06-26 16:49 UTC · model grok-4.3

classification 💻 cs.SE cs.AI

keywords root cause analysismicroservicesoperational memorydependency graphsLLMcascading failuresobservabilityanomaly detection

0 comments

The pith

Operational memory of dependencies and failures enables deterministic root cause derivation separate from LLM explanation

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

When faults cascade through microservice dependencies, alerts arrive in floods and context is lost. The paper claims the solution is not better detectors or bigger models but operational memory that persistently records normal behavior, service dependencies, and past failures. This memory supports computing root causes directly from the dependency graph and the sequence of alert triggers. The LLM is then used only to explain the already-derived cause and suggest actions. The approach is shown in a prototype that handles eight failure scenarios on an e-commerce benchmark.

Core claim

OPS CORTEX organizes operational memory into four tiers to keep track of system topology and history. Root cause is derived deterministically by traversing the learned dependency graph according to the order in which thresholds were crossed. The language model receives only the assembled evidence to produce explanations, confirmations, and recommendations rather than performing the derivation itself.

What carries the argument

Four-tier operational memory structure that stores dependency relations and failure history to support deterministic computation of root causes from graphs and temporal alert sequences

Load-bearing premise

The combination of a learned dependency graph and the timing of threshold crossings provides enough information to identify root causes without false correlations or overlooked context

What would settle it

A test case in which the order of alerts leads the method to an incorrect root cause while the true cause is a service not captured in the dependency graph

Figures

Figures reproduced from arXiv: 2606.20758 by Momil Seedat.

**Figure 1.** Figure 1: The four memory tiers, ordered by lifetime. Solid arrows are the live write path; dashed arrows are the permanent store reconstituting the volatile tiers on restart, so the system is never blind. erogeneous signals as open challenges. Our design can be read as a deliberate combination of two families the survey treats separately (time-series baselining for detection and graph reasoning for localization),… view at source ↗

**Figure 2.** Figure 2: shows the pipeline. Three always-on agents observe the system; the diagnosis path is deliberately split into a deterministic derivation stage and a generative explanation stage. 4.1 Agents and cold start A monitoring agent polls metrics and writes hot state and an alert stream to Tier 1, accepting any service label it sees so that no allow-list is required. A topology agent rebuilds the live graph from ho… view at source ↗

read the original abstract

Modern microservice deployments fail in ways that are easy to detect and hard to explain. When a fault propagates along service dependencies, alerts fire in floods, dashboards multiply, and the scarce resource, an engineer who understands how the services relate, is consumed reconstructing context that the monitoring stack discarded. We argue that the missing ingredient in autonomous operations is not a better anomaly detector or a larger language model, but operational memory: a persistent, structured representation of how a system normally behaves, how its parts depend on one another, and how it has failed before. We present O PS C ORTEX, a working multi-agent prototype that organizes this memory into four tiers and uses it to separate two tasks the field usually conflates: deriving a root cause and explaining it. Root cause is computed deterministically from a learned dependency graph and the temporal ordering of threshold crossings; a large language model (LLM) is then asked only to explain, confirm, and recommend, using evidence the system has already assembled. We motivate the design with two documented production cascading failures, review representative literature on observability, anomaly detection, graph-based localization, and LLM-assisted diagnosis, and show how each architectural choice maps directly to a failure mode those incidents exhibit. The prototype is validated on an instrumented e-commerce benchmark with eight injectable failure scenarios.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper's core idea is a four-tier memory structure that keeps root-cause derivation deterministic from a learned graph and threshold orderings, handing only explanation to the LLM.

read the letter

The punchline is that OPS CORTEX tries to fix alert fatigue by building persistent operational memory and strictly separating deterministic root-cause computation from LLM-based explanation. That separation is the concrete design move, and the four-tier organization (normal behavior, dependencies, failure history, and assembled evidence) is presented as the missing piece rather than bigger models or better detectors.

What the paper does is lay out the architecture clearly, motivate it with two real cascading failures, map each choice to those failure modes, and position it against graph-based localization and LLM diagnosis work. It also ships a working multi-agent prototype and runs it on an e-commerce benchmark with eight injectable scenarios. That is more than most position papers deliver.

The soft spot is exactly the one the stress-test flags: the deterministic step assumes a learned dependency graph plus temporal ordering of threshold crossings is enough to identify root cause without false correlations. The abstract gives no mechanism for pruning non-causal edges or handling ordering ambiguity, and no quantitative results or error analysis appear in the provided text. If those edges stay in, the separation claim collapses before the LLM is even called. The validation is mentioned but not described with numbers or failure cases, so we cannot yet tell whether the deterministic part works in practice.

This is for people building observability systems in microservices who want a practical pattern rather than another end-to-end LLM agent. It deserves a serious referee because the architectural separation is a clear, falsifiable proposal even if the current evidence is thin.

Referee Report

2 major / 1 minor

Summary. The paper claims that the key missing component for autonomous operations in microservices is not better anomaly detection or larger LLMs but 'operational memory' organized into four tiers; OPS CORTEX is a multi-agent prototype that learns a dependency graph, computes root causes deterministically from that graph plus the temporal order of threshold crossings, and invokes an LLM only for subsequent explanation, confirmation, and recommendations. The design is motivated by two real production cascading failures, positioned against observability and graph-based localization literature, and validated on an instrumented e-commerce benchmark with eight injectable failure scenarios.

Significance. If the deterministic derivation step proves reliable on production data, the four-tier memory architecture and explicit separation of derivation from explanation would constitute a substantive advance over current LLM-centric diagnosis systems, directly addressing context loss in alert floods and providing a persistent, structured representation of normal behavior and past failures. The prototype implementation and explicit mapping of each architectural choice to documented failure modes are concrete strengths.

major comments (2)

[Abstract] Abstract: The central claim that 'Root cause is computed deterministically from a learned dependency graph and the temporal ordering of threshold crossings' provides no mechanism (conditional independence tests, intervention data, or do-calculus) to prune non-causal edges that observational learning routinely encodes (shared resources, background load, metric coupling). This is load-bearing for the separation of derivation from explanation; without it the deterministic step can return incorrect roots before the LLM is ever invoked.
[Validation] Validation section (e-commerce benchmark): The manuscript reports eight injectable scenarios but supplies no quantitative results, precision/recall figures, or comparison against baselines for the deterministic root-cause step itself. This omission prevents assessment of whether the learned graph plus temporal ordering actually succeeds in the presence of the spurious correlations the skeptic note identifies.

minor comments (1)

[Abstract] The four-tier memory organization is introduced in the abstract but never given explicit tier definitions or data structures in the provided text; a table or diagram would clarify how each tier supports the deterministic step.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback highlighting the need to strengthen the causal grounding of the deterministic derivation step and to provide quantitative validation metrics. We address each major comment below, indicating planned revisions where appropriate.

read point-by-point responses

Referee: [Abstract] Abstract: The central claim that 'Root cause is computed deterministically from a learned dependency graph and the temporal ordering of threshold crossings' provides no mechanism (conditional independence tests, intervention data, or do-calculus) to prune non-causal edges that observational learning routinely encodes (shared resources, background load, metric coupling). This is load-bearing for the separation of derivation from explanation; without it the deterministic step can return incorrect roots before the LLM is ever invoked.

Authors: The manuscript does not apply formal causal discovery methods such as conditional independence tests or do-calculus. The dependency graph is constructed from observed service call patterns and topology, which in microservice systems typically encode direct invocation dependencies rather than purely statistical associations. Temporal ordering of threshold crossings is then used to trace propagation along these edges. We agree this leaves open the possibility of residual non-causal edges from shared resources or metric coupling. In revision we will (1) explicitly describe the graph-learning procedure, (2) add a dedicated limitations subsection discussing observational confounding, and (3) note that the subsequent LLM stage operates on the assembled evidence and can surface inconsistencies. These changes preserve the core separation claim while acknowledging its observational basis. revision: partial
Referee: [Validation] Validation section (e-commerce benchmark): The manuscript reports eight injectable scenarios but supplies no quantitative results, precision/recall figures, or comparison against baselines for the deterministic root-cause step itself. This omission prevents assessment of whether the learned graph plus temporal ordering actually succeeds in the presence of the spurious correlations the skeptic note identifies.

Authors: The current text describes the eight failure scenarios and maps each architectural choice to documented failure modes but does not report numerical performance for the deterministic root-cause step. We accept this gap. The revised manuscript will include precision, recall, and F1 scores for root-cause identification across all scenarios, together with comparisons against two baselines: (a) alert-correlation without the learned graph and (b) graph traversal without temporal ordering. These additions will allow direct evaluation of the deterministic component against the spurious-correlation concern. revision: yes

Circularity Check

0 steps flagged

No significant circularity; derivation presented as independent

full rationale

The abstract and provided excerpts describe root-cause derivation as a deterministic computation from a learned dependency graph plus temporal ordering of threshold crossings, with the LLM used only for subsequent explanation. No equations, self-definitional steps, fitted parameters renamed as predictions, or load-bearing self-citations are visible in the text. The separation of derivation from explanation is stated directly without reducing one to the other by construction. This is the most common honest finding for papers whose central claim remains externally falsifiable.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 1 invented entities

The design rests on domain assumptions about dependency graphs and timing being adequate for root-cause derivation; no free parameters or invented physical entities are described.

axioms (2)

domain assumption Microservice dependency graphs can be learned accurately enough to support deterministic root-cause computation.
Invoked when stating that root cause is computed from the learned graph.
domain assumption Temporal ordering of threshold crossings reliably identifies the originating fault.
Central premise of the deterministic derivation step.

invented entities (1)

Four-tier operational memory no independent evidence
purpose: Persistent structured store of normal behavior, dependencies, and past failures.
New architectural component introduced to enable the separation of derivation and explanation.

pith-pipeline@v0.9.1-grok · 5765 in / 1350 out tokens · 32933 ms · 2026-06-26T16:49:04.725850+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

9 extracted references · 1 canonical work pages

[1]

Slack’s Outage on Jan- uary 4th 2021,

Slack Engineering, “Slack’s Outage on Jan- uary 4th 2021,” Slack Engineering Blog, Feb. 4

2021
[2]

https://slack.engineering/ slacks-outage-on-january-4th-2021/

2021
[3]

Slack’s Incident on 2-22-22,

Slack Engineering, “Slack’s Incident on 2-22-22,” Slack Engineering Blog, Apr. 2022. https://slack. engineering/slacks-incident-on-2-22-22/

2022
[4]

Dapper, a Large-Scale Distributed Systems Tracing Infrastructure,

B. H. Sigelman, L. A. Barroso, M. Burrows, P. Stephenson, M. Plakal, D. Beaver, S. Jaspan, and C. Shanbhag, “Dapper, a Large-Scale Distributed Systems Tracing Infrastructure,” Tech- nical Report, Google, Inc., 2010

2010
[5]

Isolation Forest,

F. T. Liu, K. M. Ting, and Z.-H. Zhou, “Isolation Forest,” in Proc. 8th IEEE Int. Conf. on Data Mining (ICDM), 2008, pp. 413–422

2008
[6]

MicroRCA: Root Cause Localization of Performance Issues in Microservices,

L. Wu, J. Tordsson, E. Elmroth, and O. Kao, “MicroRCA: Root Cause Localization of Performance Issues in Microservices,” in Proc. IEEE/IFIP Network Operations and Management Symp. (NOMS), 2020, pp. 1–9

2020
[7]

Recommending Root-Cause and Mitiga- tion Steps for Cloud Incidents using Large Language Models,

T. Ahmed, S. Ghosh, C. Bansal, T. Zimmermann, X. Zhang, and S. Rajmohan, “Recommending Root-Cause and Mitiga- tion Steps for Cloud Incidents using Large Language Models,” inProc. 45th IEEE/ACM Int. Conf. on Software Engineering (ICSE), 2023, pp. 1737–1749

2023
[8]

A comprehensive survey on root cause analysis in (micro) services: methodologies, challenges, and trends,

T. Wang and G. Qi, “A Comprehensive Survey on Root Cause Analysis in (Micro) Services: Methodologies, Challenges, and Trends,” arXiv preprint arXiv:2408.00803, 2024

work page arXiv 2024
[9]

OpsCortex: Operational Memory for Self- Diagnosing Microservice Systems,

M. Seedat, “OpsCortex: Operational Memory for Self- Diagnosing Microservice Systems,” source code repos- itory, 2025. https://github.com/momil-seedat/ ops-cortex 5

2025

[1] [1]

Slack’s Outage on Jan- uary 4th 2021,

Slack Engineering, “Slack’s Outage on Jan- uary 4th 2021,” Slack Engineering Blog, Feb. 4

2021

[2] [2]

https://slack.engineering/ slacks-outage-on-january-4th-2021/

2021

[3] [3]

Slack’s Incident on 2-22-22,

Slack Engineering, “Slack’s Incident on 2-22-22,” Slack Engineering Blog, Apr. 2022. https://slack. engineering/slacks-incident-on-2-22-22/

2022

[4] [4]

Dapper, a Large-Scale Distributed Systems Tracing Infrastructure,

B. H. Sigelman, L. A. Barroso, M. Burrows, P. Stephenson, M. Plakal, D. Beaver, S. Jaspan, and C. Shanbhag, “Dapper, a Large-Scale Distributed Systems Tracing Infrastructure,” Tech- nical Report, Google, Inc., 2010

2010

[5] [5]

Isolation Forest,

F. T. Liu, K. M. Ting, and Z.-H. Zhou, “Isolation Forest,” in Proc. 8th IEEE Int. Conf. on Data Mining (ICDM), 2008, pp. 413–422

2008

[6] [6]

MicroRCA: Root Cause Localization of Performance Issues in Microservices,

L. Wu, J. Tordsson, E. Elmroth, and O. Kao, “MicroRCA: Root Cause Localization of Performance Issues in Microservices,” in Proc. IEEE/IFIP Network Operations and Management Symp. (NOMS), 2020, pp. 1–9

2020

[7] [7]

Recommending Root-Cause and Mitiga- tion Steps for Cloud Incidents using Large Language Models,

T. Ahmed, S. Ghosh, C. Bansal, T. Zimmermann, X. Zhang, and S. Rajmohan, “Recommending Root-Cause and Mitiga- tion Steps for Cloud Incidents using Large Language Models,” inProc. 45th IEEE/ACM Int. Conf. on Software Engineering (ICSE), 2023, pp. 1737–1749

2023

[8] [8]

A comprehensive survey on root cause analysis in (micro) services: methodologies, challenges, and trends,

T. Wang and G. Qi, “A Comprehensive Survey on Root Cause Analysis in (Micro) Services: Methodologies, Challenges, and Trends,” arXiv preprint arXiv:2408.00803, 2024

work page arXiv 2024

[9] [9]

OpsCortex: Operational Memory for Self- Diagnosing Microservice Systems,

M. Seedat, “OpsCortex: Operational Memory for Self- Diagnosing Microservice Systems,” source code repos- itory, 2025. https://github.com/momil-seedat/ ops-cortex 5

2025