arxiv: 2605.04530 · v1 · submitted 2026-05-06 · 💻 cs.NI · cs.AI

Recognition: unknown

SADE: Symptom-Aware Diagnostic Escalation for LLM-Based Network Troubleshooting

Arunan Sivanathan, Kosta Dekic, Kuan-Hao Tseng, Niruth Bogahawatta, Suranga Seneviratne, Yasod Ginige

Pith reviewed 2026-05-08 16:53 UTC · model grok-4.3

classification 💻 cs.NI cs.AI

keywords LLM agentsnetwork troubleshootingdiagnostic policyroot cause analysisCisco methodologyNIKA benchmarkagentic workflows

0 comments

The pith

SADE encodes the Cisco troubleshooting methodology as an explicit policy that separates evidence acquisition from hypothesis commitment in LLM network agents.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper argues that LLM agents for network troubleshooting underperform because they rely on free-form deliberation instead of the disciplined layer-by-layer approach used by human engineers. SADE implements this methodology through a phase-gated workflow and a library of fault-family skills. On held-out data from the NIKA benchmark covering eleven unseen scenarios, this structured policy yields substantial gains in root-cause identification. A controlled comparison shows that much of the improvement comes from the policy rather than the underlying model.

Core claim

SADE pairs a phase-gated diagnostic workflow, which separates evidence acquisition from hypothesis commitment, with a routed library of fault-family skills and high-yield diagnostic helpers. On a held-out 523 incident set of the public NIKA benchmark covering eleven unseen scenarios, SADE improves root-cause F1 by 37 percentage points over a ReAct + GPT-5 baseline; a model-controlled comparison against the same Claude Sonnet backend without the SADE policy attributes 22 of those points to the diagnostic policy alone.

What carries the argument

The symptom-aware diagnostic escalation policy that enforces a classical Cisco-style phase-gated workflow separating evidence collection from hypothesis commitment.

If this is right

Root-cause localization performance increases significantly on public network troubleshooting benchmarks for previously unseen scenarios.
The diagnostic policy itself, independent of model choice, accounts for a substantial share of the observed gains.
LLM agents can benefit from encoding established human engineering methodologies rather than relying solely on free-form reasoning.
Structured workflows may help close the gap between current LLM performance and practical deployment thresholds in network operations.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Similar explicit policies could improve LLM performance in other technical domains that have established diagnostic protocols, such as medical or mechanical troubleshooting.
Future benchmarks should include more diverse real-world incidents to test if the gains hold beyond the NIKA dataset.
Integrating SADE with other LLM techniques like chain-of-thought or tool use might yield further improvements in complex network environments.

Load-bearing premise

The NIKA benchmark's held-out incidents and eleven unseen scenarios represent real-world network troubleshooting, and the gains are not from unstated implementation details or benchmark tuning.

What would settle it

Running SADE and the baseline on a new set of network incidents drawn from actual production environments outside the NIKA benchmark and finding no significant improvement or a reversal.

Figures

Figures reproduced from arXiv: 2605.04530 by Arunan Sivanathan, Kosta Dekic, Kuan-Hao Tseng, Niruth Bogahawatta, Suranga Seneviratne, Yasod Ginige.

**Figure 1.** Figure 1: SADE system overview. An LLM agent with specialized view at source ↗

**Figure 2.** Figure 2: SADE framework. Actions that interact with the Kathará view at source ↗

**Figure 3.** Figure 3: Example workflow for troubleshooting an OSPF error. view at source ↗

**Figure 5.** Figure 5: Sensitivity to topology size. (a) Mean overall judge score view at source ↗

**Figure 6.** Figure 6: SADE library examples: a fault-family skill book (a) view at source ↗

read the original abstract

Large language model (LLM) agents are increasingly applied to network troubleshooting, but root-cause localization on public benchmarks remains well below practical deployment thresholds. We argue this is because existing agents do not encode the disciplined, layer-by-layer methodology that human network engineers use, and instead rely on free-form deliberation that conflates evidence acquisition with hypothesis commitment. We present SADE (Symptom-Aware Diagnostic Escalation), an agent that encodes the classical Cisco troubleshooting methodology as an explicit policy. SADE pairs a phase-gated diagnostic workflow, which separates evidence acquisition from hypothesis commitment, with a routed library of fault-family skills and high-yield diagnostic helpers. On a held-out 523 incident set of the public NIKA benchmark covering eleven unseen scenarios, SADE improves root-cause F1 by 37 percentage points over a ReAct + GPT-5 baseline; a model-controlled comparison against the same Claude Sonnet backend without the SADE policy attributes 22 of those points to the diagnostic policy alone, showing that the gain is not a side-effect of the model upgrade.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

SADE's phase-gated policy and routed skills deliver a 22-point F1 lift on the NIKA benchmark when the model is held fixed, but the baseline setup needs tighter documentation to confirm the gain is truly from the workflow rather than tool differences.

read the letter

The main thing to know is that this paper shows a structured diagnostic policy can move the needle on LLM agents for network troubleshooting. By encoding Cisco-style phase gating that keeps evidence gathering separate from hypothesis commitment, plus a library of routed fault-family skills, SADE gets a 22-point root-cause F1 improvement over a plain ReAct agent using the same Claude Sonnet backend on 523 held-out NIKA incidents across eleven unseen scenarios. The 37-point gain versus the GPT-5 ReAct baseline is larger, but the model-controlled split is the cleaner signal here.

Referee Report

2 major / 1 minor

Summary. The manuscript presents SADE, an LLM agent for network troubleshooting that encodes Cisco's classical methodology as an explicit policy consisting of a phase-gated diagnostic workflow and a routed library of fault-family skills. On a held-out 523-incident subset of the NIKA benchmark spanning eleven unseen scenarios, SADE reports a 37 percentage point improvement in root-cause F1 score over a ReAct + GPT-5 baseline. A controlled comparison using the same Claude Sonnet model attributes 22 of those points to the SADE policy.

Significance. Should the attribution of gains to the explicit policy hold after verification of implementation details, the work would provide concrete evidence that structured, methodology-driven agent designs can substantially outperform free-form reasoning approaches in domain-specific technical tasks like network root-cause analysis. This has potential implications for agent architectures in other engineering and diagnostic domains.

major comments (2)

[Abstract] Abstract (and §4 evaluation): The model-controlled comparison claims to isolate the effect of the SADE policy by using 'the same Claude Sonnet backend without the SADE policy.' However, it is not specified whether this baseline retains the diagnostic helpers, skill library, and routing logic present in SADE. If the baseline is a standard ReAct loop with reduced tooling, the 22pp F1 gain cannot be unambiguously attributed to the phase-gated workflow and symptom-aware escalation alone.
[§4] §4 (results): The paper reports numerical gains on the 523-incident held-out set without accompanying error bars, confidence intervals, or details on the number of runs or variance across seeds. Given the stochastic nature of LLM agents, this makes it difficult to assess the reliability of the 37pp and 22pp improvements.

minor comments (1)

[Abstract] Abstract: The abstract mentions 'eleven unseen scenarios' but provides no details on how these scenarios were selected, their diversity, or explicit checks for data leakage from the training portion of NIKA.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. The comments highlight important areas for clarification and statistical rigor. We address each major comment below and will update the manuscript accordingly.

read point-by-point responses

Referee: [Abstract] Abstract (and §4 evaluation): The model-controlled comparison claims to isolate the effect of the SADE policy by using 'the same Claude Sonnet backend without the SADE policy.' However, it is not specified whether this baseline retains the diagnostic helpers, skill library, and routing logic present in SADE. If the baseline is a standard ReAct loop with reduced tooling, the 22pp F1 gain cannot be unambiguously attributed to the phase-gated workflow and symptom-aware escalation alone.

Authors: We agree the description in the abstract and §4 was insufficiently precise and could lead to this ambiguity. The model-controlled baseline uses the identical Claude Sonnet backend together with the full set of diagnostic helpers, skill library, and routing logic. The sole difference is replacement of the phase-gated workflow and symptom-aware escalation with a standard ReAct deliberation loop. This isolates the contribution of the SADE policy. We will revise the abstract and §4 to state this configuration explicitly. revision: yes
Referee: [§4] §4 (results): The paper reports numerical gains on the 523-incident held-out set without accompanying error bars, confidence intervals, or details on the number of runs or variance across seeds. Given the stochastic nature of LLM agents, this makes it difficult to assess the reliability of the 37pp and 22pp improvements.

Authors: We acknowledge that variance reporting is essential for LLM-agent evaluations. The original results were obtained from single runs per configuration owing to benchmark scale and API costs. We have now executed additional independent runs across multiple seeds and will add error bars (standard deviation), confidence intervals, and run counts to the revised §4 tables and text. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical gains grounded in held-out benchmark and controlled ablation

full rationale

The paper's central claim is an empirical performance improvement (37pp F1 over ReAct baseline, 22pp attributed to policy via model-controlled comparison) on a public held-out NIKA benchmark covering eleven unseen scenarios. No derivation chain, equations, or first-principles results are presented that reduce the claimed gains to self-definitions, fitted inputs renamed as predictions, or load-bearing self-citations. The evaluation uses external data and an ablation against the same backend, satisfying the criteria for a self-contained result without circular reduction.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The claim rests on the unproven assumption that the classical Cisco methodology is optimal and generalizable, plus the representativeness of the NIKA benchmark; no free parameters or invented entities are explicitly introduced in the abstract.

axioms (1)

domain assumption The classical Cisco troubleshooting methodology provides an effective and general layer-by-layer diagnostic process that can be directly encoded as an LLM policy.
The paper builds the entire SADE workflow on this premise without additional validation in the abstract.

pith-pipeline@v0.9.0 · 5510 in / 1170 out tokens · 37871 ms · 2026-05-08T16:53:06.713919+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

33 extracted references · 3 canonical work pages · 2 internal anchors

[1]

Summary of the Amazon Dy- namoDB service disruption in the Northern Virginia (US- EAST-1) region,

Amazon Web Services, “Summary of the Amazon Dy- namoDB service disruption in the Northern Virginia (US- EAST-1) region,” AWS post-event summary, https://aws. amazon.com/message/101925/, 2025

2025
[2]

Llnet: An intent- driven approach to instructing softwarized network de- vices using a small language model,

A. Angi, A. Sacco, and G. Marchetto, “Llnet: An intent- driven approach to instructing softwarized network de- vices using a small language model,”IEEE Transactions on Network and Service Management, vol. 22, no. 4, pp. 3403–3418, 2025

2025
[3]

Agent skills,

Anthropic, “Agent skills,” https://docs.claude.com/en/ docs/agents-and-tools/agent-skills, 2025

2025
[4]

Claude Sonnet 4.6,

Anthropic, PBC, “Claude Sonnet 4.6,” https://www. anthropic.com/news/claude-sonnet-4-6, 2025

2025
[5]

A survey on large language models for communication, network, and service management: Application insights, challenges, and future directions,

G. O. Boateng, H. Sami, A. Alagha, H. Elmekki, A. Ham- moud, R. Mizouni, A. Mourad, H. Otrok, J. Bentahar, S. Muhaidat, C. Talhi, Z. Dziong, and M. Guizani, “A survey on large language models for communication, network, and service management: Application insights, challenges, and future directions,”IEEE Communications Surveys & Tutorials, vol. 28, pp. 52...

2026
[6]

Kathará: A container-based framework for implement- ing network function virtualization and software defined networks,

G. Bonofiglio, V . Iovinella, G. Lospoto, and G. Di Battista, “Kathará: A container-based framework for implement- ing network function virtualization and software defined networks,” inNOMS 2018 - 2018 IEEE/IFIP Network Operations and Management Symposium, 2018, pp. 1–9

2018
[7]

Automatic root cause analysis via large language models for cloud incidents,

Y . Chen, H. Xie, M. Ma, Y . Kang, X. Gao, L. Shi, Y . Cao, X. Gao, H. Fan, M. Wenet al., “Automatic root cause analysis via large language models for cloud incidents,” inProceedings of the Nineteenth European Conference on Computer Systems, 2024, pp. 674–688

2024
[8]

Can llms understand computer networks? towards a virtual system administrator,

D. Donadel, F. Marchiori, L. Pajola, and M. Conti, “Can llms understand computer networks? towards a virtual system administrator,” in2024 IEEE 49th Conference on Local Computer Networks (LCN), 2024, pp. 1–10

2024
[9]

Zero touch network & service management (zsm) standards,

I. ETSI, “Zero touch network & service management (zsm) standards,” ETSI, Tech. Rep, Tech. Rep., 2018

2018
[10]

A gen- eral approach to network configuration analysis,

A. Fogel, S. Fung, L. Pedrosa, M. Walraed-Sullivan, R. Govindan, R. Mahajan, and T. Millstein, “A gen- eral approach to network configuration analysis,” in12th USENIX Symposium on Networked Systems Design and Implementation (NSDI 15), 2015, pp. 469–483

2015
[11]

Configreco: Network configuration recommendation with graph neural networks,

Z. Guo, F. Li, J. Shen, T. Xie, S. Jiang, and X. Wang, “Configreco: Network configuration recommendation with graph neural networks,”IEEE Network, vol. 38, no. 1, pp. 7–14, 2023

2023
[12]

Netgenius: Routing configuration recommendation based on graph neural network,

Z. Guo, F. Li, T. Xie, X. Wang, and J. Cao, “Netgenius: Routing configuration recommendation based on graph neural network,”IEEE Transactions on Networking, 2025

2025
[13]

Large language models for networking: Applications, enabling techniques, and chal- lenges,

Y . Huang, H. Du, X. Zhang, D. Niyato, J. Kang, Z. Xiong, S. Wang, and T. Huang, “Large language models for networking: Applications, enabling techniques, and chal- lenges,”IEEE Network, vol. 39, no. 1, pp. 235–242, 2024

2024
[14]

Genet: A multimodal llm-based co-pilot for net- work topology and configuration,

B. Ifland, R. Krief, A. Zilberman, E. Duani, M. Ohana, A. Murillo, O. Manor, O. Lavi, K. Hikichi, A. Shabtai et al., “Genet: A multimodal llm-based co-pilot for net- work topology and configuration,” in2025 IEEE 45th In- ternational Conference on Distributed Computing Systems Workshops (ICDCSW). IEEE, 2025, pp. 117–122

2025
[15]

Update about the October 4th outage,

S. Janardhan, “Update about the October 4th outage,” En- gineering at Meta blog, https://engineering.fb.com/2021/ 10/04/networking-traffic/outage/, 2021

2021
[16]

Large language models for zero touch network config- uration management,

O. G. Lira, O. M. Caicedo, and N. L. S. da Fonseca, “Large language models for zero touch network config- uration management,”IEEE Communications Magazine, vol. 63, no. 7, pp. 146–153, 2025

2025
[17]

Llm-enabled intent-driven service configuration for next generation networks,

A. Mekrache and A. Ksentini, “Llm-enabled intent-driven service configuration for next generation networks,” in 2024 IEEE 10th International Conference on Network Softwarization (NetSoft), 2024, pp. 253–257

2024
[18]

Rev- olutionizing networking: A comprehensive overview of intent-based networking,

S. Minhas, R. Jaswal, A. Sharma, and S. Singla, “Rev- olutionizing networking: A comprehensive overview of intent-based networking,” in2024 International Confer- ence on Emerging Innovations and Advanced Computing (INNOCOMP), 2024, pp. 463–468

2024
[19]

Model context protocol specification,

Model Context Protocol Project, “Model context protocol specification,” https://modelcontextprotocol.io, 2024

2024
[20]

Introducing GPT-5,

OpenAI, “Introducing GPT-5,” https://openai.com/index/ introducing-gpt-5/, 2025

2025
[21]

Mobile net- work configuration recommendation using deep generative graph neural network,

S. Piroti, A. Chawla, and T. Zanouda, “Mobile net- work configuration recommendation using deep generative graph neural network,”IEEE Networking Letters, vol. 6, no. 3, pp. 179–182, 2024

2024
[22]

Troubleshooting methods for Cisco IP networks,

A. Ranjbar, “Troubleshooting methods for Cisco IP networks,” https://www.ciscopress.com/articles/article. asp?p=2273070, Jan. 2015, sample chapter from Troubleshooting and Maintaining Cisco IP Networks (TSHOOT) Foundation Learning Guide (CCNP TSHOOT 300-135)

2015
[23]

Exploring llm-based agents for root cause analysis,

D. Roy, X. Zhang, R. Bhave, C. Bansal, P. Las-Casas, R. Fonseca, and S. Rajmohan, “Exploring llm-based agents for root cause analysis,” inCompanion proceed- ings of the 32nd ACM international conference on the foundations of software engineering, 2024, pp. 208–219

2024
[24]

Netconfeval: Can llms facilitate network configuration?

C. Wang, M. Scazzariello, A. Farshin, S. Ferlin, D. Kosti ´c, and M. Chiesa, “Netconfeval: Can llms facilitate network configuration?”Proceedings of the ACM on Networking, vol. 2, no. CoNEXT2, pp. 1–25, 2024

2024
[25]

Towards llm- based failure localization in production-scale networks,

C. Wang, X. Zhang, R. Lu, X. Lin, X. Zeng, X. Zhang, Z. An, G. Wu, J. Gao, C. Tianet al., “Towards llm- based failure localization in production-scale networks,” inProceedings of the ACM SIGCOMM 2025 Conference, 2025, pp. 496–511

2025
[26]

Voyager: An Open-Ended Embodied Agent with Large Language Models

G. Wang, Y . Xie, Y . Jiang, A. Mandlekar, C. Xiao, Y . Zhu, L. Fan, and A. Anandkumar, “V oyager: An open- ended embodied agent with large language models,”arXiv preprint arXiv:2305.16291, 2023

work page internal anchor Pith review arXiv 2023
[27]

{NetAssistant}: Dialogue based network diagnosis in data center networks,

H. Wang, A. Abhashkumar, C. Lin, T. Zhang, X. Gu, N. Ma, C. Wu, S. Liu, W. Zhou, Y . Donget al., “{NetAssistant}: Dialogue based network diagnosis in data center networks,” in21st USENIX Symposium on Networked Systems Design and Implementation (NSDI 24), 2024, pp. 2011–2024

2024
[28]

Rcagent: Cloud root cause analysis by autonomous agents with tool-augmented large language models,

Z. Wang, Z. Liu, Y . Zhang, A. Zhong, J. Wang, F. Yin, L. Fan, L. Wu, and Q. Wen, “Rcagent: Cloud root cause analysis by autonomous agents with tool-augmented large language models,” inProceedings of the 33rd ACM international conference on information and knowledge management, 2024, pp. 4966–4974

2024
[29]

Intent-driven network management with multi- agent LLMs: The Confucius framework,

Z. Wang, S. Lin, G. Yan, S. Ghorbani, M. Yu, J. Zhou, N. Hu, L. Baruah, S. Peters, S. Kamath, J. Yang, and Y . Zhang, “Intent-driven network management with multi- agent LLMs: The Confucius framework,” inProc. ACM SIGCOMM, 2025

2025
[30]

A network arena for benchmarking ai agents on network troubleshooting,

Z. Wang, A. Cornacchia, A. Sacco, F. Galante, M. Canini, and D. Jiang, “A network arena for benchmarking ai agents on network troubleshooting,”arXiv preprint arXiv:2512.16381, 2025

work page arXiv 2025
[31]

Netllm: Adapting large language models for networking,

D. Wu, X. Wang, Y . Qiao, Z. Wang, J. Jiang, S. Cui, and F. Wang, “Netllm: Adapting large language models for networking,” inProceedings of the ACM SIGCOMM 2024 Conference, 2024, pp. 661–678

2024
[32]

Assessment of Rogers networks for resiliency and reliability following the 8 July 2022 outage,

Xona Partners Inc., “Assessment of Rogers networks for resiliency and reliability following the 8 July 2022 outage,” Canadian Radio-television and Telecom- munications Commission, Independent assessment re- port BC92-130/1-2024E-PDF, Nov. 2024, https://crtc.gc. ca/eng/publications/reports/xonarp2023.htm

2022
[33]

ReAct: Synergizing Reasoning and Acting in Language Models

S. Yao, J. Zhao, D. Yu, N. Du, I. Shafran, K. Narasimhan, and Y . Cao, “React: Synergizing reasoning and acting in language models,”arXiv preprint arXiv:2210.03629, 2022. APPENDIXA SKILL ANDHELPER-SCRIPTEXAMPLES Two examples illustrate the SADE library (Figure 6): a fault- family skill book (ospf-fault-skill) loaded on demand when the Fault Index routes a...

work page internal anchor Pith review arXiv 2022