pith. machine review for the scientific record. sign in

arxiv: 2510.10074 · v2 · submitted 2025-10-11 · 💻 cs.AI

Recognition: unknown

StepFly: Agentic Troubleshooting Guide Automation for Incident Diagnosis

Authors on Pith no claims yet
classification 💻 cs.AI
keywords executiontsgsstepflyguideincidentstagetroubleshootingachieves
0
0 comments X
read the original abstract

Effective incident management in large-scale IT systems relies on troubleshooting guides (TSGs), but their manual execution is slow and error-prone. While recent advances in LLMs offer promise for automating incident management tasks, existing LLM-based solutions lack specialized support for several key challenges, including managing TSG quality issues, interpreting complex control flow, handling data-intensive queries, and exploiting execution parallelism. We first conducted an empirical study on 92 real-world TSGs, and, guided by our findings, we present StepFly, a novel end-to-end agentic framework for troubleshooting guide automation. Our approach features a three-stage workflow: the first stage provides a comprehensive guide together with a tool, TSG Mentor, to assist site reliability engineers (SREs) in improving TSG quality; the second stage performs offline preprocessing using LLMs to extract structured execution directed acyclic graphs (DAGs) from unstructured TSGs and to create dedicated Query Preparation Plugins (QPPs); and the third stage executes online using a DAG-guided scheduler-executor framework with a memory system to ensure correct workflow and support parallel execution of independent steps. Our empirical evaluation on a collection of real-world TSGs and incidents demonstrates that StepFly achieves a ~94% success rate on GPT-4.1, outperforming baselines with less time and token consumption. Furthermore, it achieves a remarkable execution time reduction of 32.9% to 70.4% for parallelizable TSGs. Our code and sample data are publicly available at https://github.com/microsoft/StepFly.

This paper has not been read by Pith yet.

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 3 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. SREGym: A Live Benchmark for AI SRE Agents with High-Fidelity Failure Scenarios

    cs.AI 2026-05 unverdicted novelty 7.0

    SREGym is a modular, open-source live benchmark with 90 high-fidelity SRE failure scenarios built on real cloud stacks for evaluating AI agents on diagnosis and mitigation tasks.

  2. SREGym: A Live Benchmark for AI SRE Agents with High-Fidelity Failure Scenarios

    cs.AI 2026-05 unverdicted novelty 5.0

    SREGym supplies 90 high-fidelity SRE tasks in a live environment to measure how well frontier AI agents handle diverse faults, noises, and complex failure modes such as metastable and correlated failures.

  3. ActionNex: A Virtual Outage Manager for Cloud Computing

    cs.AI 2026-04 unverdicted novelty 4.0

    ActionNex is an agentic system for cloud outage management that compresses multimodal signals into critical events, uses hierarchical memory for reasoning, and recommends actions with 71.4% precision on real Azure outages.