Recognition: unknown
SADE: Symptom-Aware Diagnostic Escalation for LLM-Based Network Troubleshooting
Pith reviewed 2026-05-08 16:53 UTC · model grok-4.3
The pith
SADE encodes the Cisco troubleshooting methodology as an explicit policy that separates evidence acquisition from hypothesis commitment in LLM network agents.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
SADE pairs a phase-gated diagnostic workflow, which separates evidence acquisition from hypothesis commitment, with a routed library of fault-family skills and high-yield diagnostic helpers. On a held-out 523 incident set of the public NIKA benchmark covering eleven unseen scenarios, SADE improves root-cause F1 by 37 percentage points over a ReAct + GPT-5 baseline; a model-controlled comparison against the same Claude Sonnet backend without the SADE policy attributes 22 of those points to the diagnostic policy alone.
What carries the argument
The symptom-aware diagnostic escalation policy that enforces a classical Cisco-style phase-gated workflow separating evidence collection from hypothesis commitment.
If this is right
- Root-cause localization performance increases significantly on public network troubleshooting benchmarks for previously unseen scenarios.
- The diagnostic policy itself, independent of model choice, accounts for a substantial share of the observed gains.
- LLM agents can benefit from encoding established human engineering methodologies rather than relying solely on free-form reasoning.
- Structured workflows may help close the gap between current LLM performance and practical deployment thresholds in network operations.
Where Pith is reading between the lines
- Similar explicit policies could improve LLM performance in other technical domains that have established diagnostic protocols, such as medical or mechanical troubleshooting.
- Future benchmarks should include more diverse real-world incidents to test if the gains hold beyond the NIKA dataset.
- Integrating SADE with other LLM techniques like chain-of-thought or tool use might yield further improvements in complex network environments.
Load-bearing premise
The NIKA benchmark's held-out incidents and eleven unseen scenarios represent real-world network troubleshooting, and the gains are not from unstated implementation details or benchmark tuning.
What would settle it
Running SADE and the baseline on a new set of network incidents drawn from actual production environments outside the NIKA benchmark and finding no significant improvement or a reversal.
Figures
read the original abstract
Large language model (LLM) agents are increasingly applied to network troubleshooting, but root-cause localization on public benchmarks remains well below practical deployment thresholds. We argue this is because existing agents do not encode the disciplined, layer-by-layer methodology that human network engineers use, and instead rely on free-form deliberation that conflates evidence acquisition with hypothesis commitment. We present SADE (Symptom-Aware Diagnostic Escalation), an agent that encodes the classical Cisco troubleshooting methodology as an explicit policy. SADE pairs a phase-gated diagnostic workflow, which separates evidence acquisition from hypothesis commitment, with a routed library of fault-family skills and high-yield diagnostic helpers. On a held-out 523 incident set of the public NIKA benchmark covering eleven unseen scenarios, SADE improves root-cause F1 by 37 percentage points over a ReAct + GPT-5 baseline; a model-controlled comparison against the same Claude Sonnet backend without the SADE policy attributes 22 of those points to the diagnostic policy alone, showing that the gain is not a side-effect of the model upgrade.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript presents SADE, an LLM agent for network troubleshooting that encodes Cisco's classical methodology as an explicit policy consisting of a phase-gated diagnostic workflow and a routed library of fault-family skills. On a held-out 523-incident subset of the NIKA benchmark spanning eleven unseen scenarios, SADE reports a 37 percentage point improvement in root-cause F1 score over a ReAct + GPT-5 baseline. A controlled comparison using the same Claude Sonnet model attributes 22 of those points to the SADE policy.
Significance. Should the attribution of gains to the explicit policy hold after verification of implementation details, the work would provide concrete evidence that structured, methodology-driven agent designs can substantially outperform free-form reasoning approaches in domain-specific technical tasks like network root-cause analysis. This has potential implications for agent architectures in other engineering and diagnostic domains.
major comments (2)
- [Abstract] Abstract (and §4 evaluation): The model-controlled comparison claims to isolate the effect of the SADE policy by using 'the same Claude Sonnet backend without the SADE policy.' However, it is not specified whether this baseline retains the diagnostic helpers, skill library, and routing logic present in SADE. If the baseline is a standard ReAct loop with reduced tooling, the 22pp F1 gain cannot be unambiguously attributed to the phase-gated workflow and symptom-aware escalation alone.
- [§4] §4 (results): The paper reports numerical gains on the 523-incident held-out set without accompanying error bars, confidence intervals, or details on the number of runs or variance across seeds. Given the stochastic nature of LLM agents, this makes it difficult to assess the reliability of the 37pp and 22pp improvements.
minor comments (1)
- [Abstract] Abstract: The abstract mentions 'eleven unseen scenarios' but provides no details on how these scenarios were selected, their diversity, or explicit checks for data leakage from the training portion of NIKA.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback. The comments highlight important areas for clarification and statistical rigor. We address each major comment below and will update the manuscript accordingly.
read point-by-point responses
-
Referee: [Abstract] Abstract (and §4 evaluation): The model-controlled comparison claims to isolate the effect of the SADE policy by using 'the same Claude Sonnet backend without the SADE policy.' However, it is not specified whether this baseline retains the diagnostic helpers, skill library, and routing logic present in SADE. If the baseline is a standard ReAct loop with reduced tooling, the 22pp F1 gain cannot be unambiguously attributed to the phase-gated workflow and symptom-aware escalation alone.
Authors: We agree the description in the abstract and §4 was insufficiently precise and could lead to this ambiguity. The model-controlled baseline uses the identical Claude Sonnet backend together with the full set of diagnostic helpers, skill library, and routing logic. The sole difference is replacement of the phase-gated workflow and symptom-aware escalation with a standard ReAct deliberation loop. This isolates the contribution of the SADE policy. We will revise the abstract and §4 to state this configuration explicitly. revision: yes
-
Referee: [§4] §4 (results): The paper reports numerical gains on the 523-incident held-out set without accompanying error bars, confidence intervals, or details on the number of runs or variance across seeds. Given the stochastic nature of LLM agents, this makes it difficult to assess the reliability of the 37pp and 22pp improvements.
Authors: We acknowledge that variance reporting is essential for LLM-agent evaluations. The original results were obtained from single runs per configuration owing to benchmark scale and API costs. We have now executed additional independent runs across multiple seeds and will add error bars (standard deviation), confidence intervals, and run counts to the revised §4 tables and text. revision: yes
Circularity Check
No circularity: empirical gains grounded in held-out benchmark and controlled ablation
full rationale
The paper's central claim is an empirical performance improvement (37pp F1 over ReAct baseline, 22pp attributed to policy via model-controlled comparison) on a public held-out NIKA benchmark covering eleven unseen scenarios. No derivation chain, equations, or first-principles results are presented that reduce the claimed gains to self-definitions, fitted inputs renamed as predictions, or load-bearing self-citations. The evaluation uses external data and an ablation against the same backend, satisfying the criteria for a self-contained result without circular reduction.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption The classical Cisco troubleshooting methodology provides an effective and general layer-by-layer diagnostic process that can be directly encoded as an LLM policy.
Reference graph
Works this paper leans on
-
[1]
Summary of the Amazon Dy- namoDB service disruption in the Northern Virginia (US- EAST-1) region,
Amazon Web Services, “Summary of the Amazon Dy- namoDB service disruption in the Northern Virginia (US- EAST-1) region,” AWS post-event summary, https://aws. amazon.com/message/101925/, 2025
2025
-
[2]
Llnet: An intent- driven approach to instructing softwarized network de- vices using a small language model,
A. Angi, A. Sacco, and G. Marchetto, “Llnet: An intent- driven approach to instructing softwarized network de- vices using a small language model,”IEEE Transactions on Network and Service Management, vol. 22, no. 4, pp. 3403–3418, 2025
2025
-
[3]
Agent skills,
Anthropic, “Agent skills,” https://docs.claude.com/en/ docs/agents-and-tools/agent-skills, 2025
2025
-
[4]
Claude Sonnet 4.6,
Anthropic, PBC, “Claude Sonnet 4.6,” https://www. anthropic.com/news/claude-sonnet-4-6, 2025
2025
-
[5]
A survey on large language models for communication, network, and service management: Application insights, challenges, and future directions,
G. O. Boateng, H. Sami, A. Alagha, H. Elmekki, A. Ham- moud, R. Mizouni, A. Mourad, H. Otrok, J. Bentahar, S. Muhaidat, C. Talhi, Z. Dziong, and M. Guizani, “A survey on large language models for communication, network, and service management: Application insights, challenges, and future directions,”IEEE Communications Surveys & Tutorials, vol. 28, pp. 52...
2026
-
[6]
Kathará: A container-based framework for implement- ing network function virtualization and software defined networks,
G. Bonofiglio, V . Iovinella, G. Lospoto, and G. Di Battista, “Kathará: A container-based framework for implement- ing network function virtualization and software defined networks,” inNOMS 2018 - 2018 IEEE/IFIP Network Operations and Management Symposium, 2018, pp. 1–9
2018
-
[7]
Automatic root cause analysis via large language models for cloud incidents,
Y . Chen, H. Xie, M. Ma, Y . Kang, X. Gao, L. Shi, Y . Cao, X. Gao, H. Fan, M. Wenet al., “Automatic root cause analysis via large language models for cloud incidents,” inProceedings of the Nineteenth European Conference on Computer Systems, 2024, pp. 674–688
2024
-
[8]
Can llms understand computer networks? towards a virtual system administrator,
D. Donadel, F. Marchiori, L. Pajola, and M. Conti, “Can llms understand computer networks? towards a virtual system administrator,” in2024 IEEE 49th Conference on Local Computer Networks (LCN), 2024, pp. 1–10
2024
-
[9]
Zero touch network & service management (zsm) standards,
I. ETSI, “Zero touch network & service management (zsm) standards,” ETSI, Tech. Rep, Tech. Rep., 2018
2018
-
[10]
A gen- eral approach to network configuration analysis,
A. Fogel, S. Fung, L. Pedrosa, M. Walraed-Sullivan, R. Govindan, R. Mahajan, and T. Millstein, “A gen- eral approach to network configuration analysis,” in12th USENIX Symposium on Networked Systems Design and Implementation (NSDI 15), 2015, pp. 469–483
2015
-
[11]
Configreco: Network configuration recommendation with graph neural networks,
Z. Guo, F. Li, J. Shen, T. Xie, S. Jiang, and X. Wang, “Configreco: Network configuration recommendation with graph neural networks,”IEEE Network, vol. 38, no. 1, pp. 7–14, 2023
2023
-
[12]
Netgenius: Routing configuration recommendation based on graph neural network,
Z. Guo, F. Li, T. Xie, X. Wang, and J. Cao, “Netgenius: Routing configuration recommendation based on graph neural network,”IEEE Transactions on Networking, 2025
2025
-
[13]
Large language models for networking: Applications, enabling techniques, and chal- lenges,
Y . Huang, H. Du, X. Zhang, D. Niyato, J. Kang, Z. Xiong, S. Wang, and T. Huang, “Large language models for networking: Applications, enabling techniques, and chal- lenges,”IEEE Network, vol. 39, no. 1, pp. 235–242, 2024
2024
-
[14]
Genet: A multimodal llm-based co-pilot for net- work topology and configuration,
B. Ifland, R. Krief, A. Zilberman, E. Duani, M. Ohana, A. Murillo, O. Manor, O. Lavi, K. Hikichi, A. Shabtai et al., “Genet: A multimodal llm-based co-pilot for net- work topology and configuration,” in2025 IEEE 45th In- ternational Conference on Distributed Computing Systems Workshops (ICDCSW). IEEE, 2025, pp. 117–122
2025
-
[15]
Update about the October 4th outage,
S. Janardhan, “Update about the October 4th outage,” En- gineering at Meta blog, https://engineering.fb.com/2021/ 10/04/networking-traffic/outage/, 2021
2021
-
[16]
Large language models for zero touch network config- uration management,
O. G. Lira, O. M. Caicedo, and N. L. S. da Fonseca, “Large language models for zero touch network config- uration management,”IEEE Communications Magazine, vol. 63, no. 7, pp. 146–153, 2025
2025
-
[17]
Llm-enabled intent-driven service configuration for next generation networks,
A. Mekrache and A. Ksentini, “Llm-enabled intent-driven service configuration for next generation networks,” in 2024 IEEE 10th International Conference on Network Softwarization (NetSoft), 2024, pp. 253–257
2024
-
[18]
Rev- olutionizing networking: A comprehensive overview of intent-based networking,
S. Minhas, R. Jaswal, A. Sharma, and S. Singla, “Rev- olutionizing networking: A comprehensive overview of intent-based networking,” in2024 International Confer- ence on Emerging Innovations and Advanced Computing (INNOCOMP), 2024, pp. 463–468
2024
-
[19]
Model context protocol specification,
Model Context Protocol Project, “Model context protocol specification,” https://modelcontextprotocol.io, 2024
2024
-
[20]
Introducing GPT-5,
OpenAI, “Introducing GPT-5,” https://openai.com/index/ introducing-gpt-5/, 2025
2025
-
[21]
Mobile net- work configuration recommendation using deep generative graph neural network,
S. Piroti, A. Chawla, and T. Zanouda, “Mobile net- work configuration recommendation using deep generative graph neural network,”IEEE Networking Letters, vol. 6, no. 3, pp. 179–182, 2024
2024
-
[22]
Troubleshooting methods for Cisco IP networks,
A. Ranjbar, “Troubleshooting methods for Cisco IP networks,” https://www.ciscopress.com/articles/article. asp?p=2273070, Jan. 2015, sample chapter from Troubleshooting and Maintaining Cisco IP Networks (TSHOOT) Foundation Learning Guide (CCNP TSHOOT 300-135)
2015
-
[23]
Exploring llm-based agents for root cause analysis,
D. Roy, X. Zhang, R. Bhave, C. Bansal, P. Las-Casas, R. Fonseca, and S. Rajmohan, “Exploring llm-based agents for root cause analysis,” inCompanion proceed- ings of the 32nd ACM international conference on the foundations of software engineering, 2024, pp. 208–219
2024
-
[24]
Netconfeval: Can llms facilitate network configuration?
C. Wang, M. Scazzariello, A. Farshin, S. Ferlin, D. Kosti ´c, and M. Chiesa, “Netconfeval: Can llms facilitate network configuration?”Proceedings of the ACM on Networking, vol. 2, no. CoNEXT2, pp. 1–25, 2024
2024
-
[25]
Towards llm- based failure localization in production-scale networks,
C. Wang, X. Zhang, R. Lu, X. Lin, X. Zeng, X. Zhang, Z. An, G. Wu, J. Gao, C. Tianet al., “Towards llm- based failure localization in production-scale networks,” inProceedings of the ACM SIGCOMM 2025 Conference, 2025, pp. 496–511
2025
-
[26]
Voyager: An Open-Ended Embodied Agent with Large Language Models
G. Wang, Y . Xie, Y . Jiang, A. Mandlekar, C. Xiao, Y . Zhu, L. Fan, and A. Anandkumar, “V oyager: An open- ended embodied agent with large language models,”arXiv preprint arXiv:2305.16291, 2023
work page internal anchor Pith review arXiv 2023
-
[27]
{NetAssistant}: Dialogue based network diagnosis in data center networks,
H. Wang, A. Abhashkumar, C. Lin, T. Zhang, X. Gu, N. Ma, C. Wu, S. Liu, W. Zhou, Y . Donget al., “{NetAssistant}: Dialogue based network diagnosis in data center networks,” in21st USENIX Symposium on Networked Systems Design and Implementation (NSDI 24), 2024, pp. 2011–2024
2024
-
[28]
Rcagent: Cloud root cause analysis by autonomous agents with tool-augmented large language models,
Z. Wang, Z. Liu, Y . Zhang, A. Zhong, J. Wang, F. Yin, L. Fan, L. Wu, and Q. Wen, “Rcagent: Cloud root cause analysis by autonomous agents with tool-augmented large language models,” inProceedings of the 33rd ACM international conference on information and knowledge management, 2024, pp. 4966–4974
2024
-
[29]
Intent-driven network management with multi- agent LLMs: The Confucius framework,
Z. Wang, S. Lin, G. Yan, S. Ghorbani, M. Yu, J. Zhou, N. Hu, L. Baruah, S. Peters, S. Kamath, J. Yang, and Y . Zhang, “Intent-driven network management with multi- agent LLMs: The Confucius framework,” inProc. ACM SIGCOMM, 2025
2025
-
[30]
A network arena for benchmarking ai agents on network troubleshooting,
Z. Wang, A. Cornacchia, A. Sacco, F. Galante, M. Canini, and D. Jiang, “A network arena for benchmarking ai agents on network troubleshooting,”arXiv preprint arXiv:2512.16381, 2025
-
[31]
Netllm: Adapting large language models for networking,
D. Wu, X. Wang, Y . Qiao, Z. Wang, J. Jiang, S. Cui, and F. Wang, “Netllm: Adapting large language models for networking,” inProceedings of the ACM SIGCOMM 2024 Conference, 2024, pp. 661–678
2024
-
[32]
Assessment of Rogers networks for resiliency and reliability following the 8 July 2022 outage,
Xona Partners Inc., “Assessment of Rogers networks for resiliency and reliability following the 8 July 2022 outage,” Canadian Radio-television and Telecom- munications Commission, Independent assessment re- port BC92-130/1-2024E-PDF, Nov. 2024, https://crtc.gc. ca/eng/publications/reports/xonarp2023.htm
2022
-
[33]
ReAct: Synergizing Reasoning and Acting in Language Models
S. Yao, J. Zhao, D. Yu, N. Du, I. Shafran, K. Narasimhan, and Y . Cao, “React: Synergizing reasoning and acting in language models,”arXiv preprint arXiv:2210.03629, 2022. APPENDIXA SKILL ANDHELPER-SCRIPTEXAMPLES Two examples illustrate the SADE library (Figure 6): a fault- family skill book (ospf-fault-skill) loaded on demand when the Fault Index routes a...
work page internal anchor Pith review arXiv 2022
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.