pith. machine review for the scientific record. sign in

arxiv: 2605.07161 · v2 · submitted 2026-05-08 · 💻 cs.AI

Recognition: no theorem link

SREGym: A Live Benchmark for AI SRE Agents with High-Fidelity Failure Scenarios

Authors on Pith no claims yet

Pith reviewed 2026-05-14 21:44 UTC · model grok-4.3

classification 💻 cs.AI
keywords site reliability engineeringAI agentsbenchmarkfault injectionsystem failurescloud-nativeagent evaluationfailure simulation
0
0 comments X

The pith

SREGym supplies 90 realistic SRE problems in a live cloud environment to measure how AI agents handle injected faults and complex failures.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents SREGym to move beyond oversimplified SRE benchmarks by creating a live system built on real cloud-native stacks. It injects a wide range of faults across layers, ambient noises, and failure modes including metastable and correlated failures. The framework is modular so new injectors and stacks can be added, and it currently ships with 90 challenging problems. When frontier agents are run on these problems their success rates differ by as much as 40 percent depending on the failure type.

Core claim

SREGym exposes a live system environment built atop real-world cloud-native system stacks, where high-fidelity failure scenarios are simulated through fault injectors. It models the complexity of production environments by simulating a wide range of faults at different layers, various ambient noises, and diverse failure modes such as metastable failures and correlated failures. Architected as a modular, extensible framework that orchestrates fault and noise injectors across stacks, SREGym currently includes 90 realistic, challenging SRE problems. Evaluation shows frontier agents vary significantly in addressing different kinds of failures, with up to 40 percent differences in end-to-end task

What carries the argument

SREGym, a modular framework that orchestrates fault and noise injectors across real cloud-native stacks to generate high-fidelity SRE scenarios.

If this is right

  • Agents display up to 40 percent variation in end-to-end success depending on the failure category.
  • The modular injector design permits straightforward addition of new faults, noises, or system stacks.
  • The benchmark distinguishes agent strengths on layer-specific faults from those on correlated or metastable failures.
  • SREGym supplies an open, maintainable platform for repeated evaluation of SRE agents as models improve.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • If agent rankings on SREGym align with real incidents, the benchmark could become a standard filter before deploying agents into production.
  • Specialized training loops could target the failure modes where current agents show the largest gaps.
  • Extending the framework to additional cloud providers or on-premise stacks would test how general the observed performance differences remain.

Load-bearing premise

The simulated faults, ambient noises, and failure modes in the benchmark accurately reflect the complexity and diversity of real production system failures.

What would settle it

A direct comparison of the same agents' success rates on SREGym scenarios versus their performance on logged, real-world production incidents would show whether the simulated environments produce matching capability rankings.

Figures

Figures reproduced from arXiv: 2605.07161 by Hans-Arno Jacobsen, Jackson Clark, Lily Gniedziejko, Saad Mohammad Rafid Pial, Tianyin Xu, Yifang Tian, Yiming Su, Yinfang Chen.

Figure 1
Figure 1. Figure 1: An SRE problem in SREGYM and the trajectories of two agents when addressing it. whereas real-world failures have diverse root causes across the stack; faults in lower layers such as operating systems [35] and hardware [46, 64] are known to be harder to address. Moreover, they mostly simulate a single failure into an otherwise clean environment, lacking noises and unrelated events that coexist in production… view at source ↗
Figure 2
Figure 2. Figure 2: provides an overview of SREGYM in terms of its components. We will present the main components in the remainder of this section. Kubernetes DeathStarBench TrainTicket MongoDB FleetCast TiDB Kafka MySQL System Environment ... ... Trace Observability Log Metric State HW Faults Code Bugs Bad Ops Misconfigs Data collector Problems Problem Spec - System spec - Fault spec - Oracle spec Diagnosis Leaderboard Quer… view at source ↗
Figure 3
Figure 3. Figure 3: Implementing a problem in SREGYM (noises are injected by the framework) [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: SRE problems that create scenarios with different failure modes in SREG [PITH_FULL_IMAGE:figures/full_fig_p005_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: The average number of tool calls per SRE prob [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Mean total tokens per run versus end-to-end success rate ( [PITH_FULL_IMAGE:figures/full_fig_p023_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Per-problem end-to-end success rate with no noise injected. [PITH_FULL_IMAGE:figures/full_fig_p025_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Per-problem end-to-end success rate under noisy conditions. [PITH_FULL_IMAGE:figures/full_fig_p026_8.png] view at source ↗
read the original abstract

AI agents are increasingly used to diagnose and mitigate failures in production systems, known as agentic Site Reliability Engineering (SRE). Current SRE benchmarks are limited to oversimplistic SRE tasks and are unfortunately hard to extend due to bespoke designs. We present SREGym, a high-fidelity benchmark for SRE agents. SREGym exposes a live system environment built atop real-world cloud-native system stacks, where high-fidelity failure scenarios are simulated through fault injectors. SREGym models the complexity of production environments by simulating (1) a wide range of faults at different layers, (2) various ambient noises, and (3) diverse failure modes such as metastable failures and correlated failures. SREGym is architected as a modular, extensible framework that orchestrates fault and noise injectors across stacks. SREGym currently includes 90 realistic, challenging SRE problems. We use SREGym to evaluate frontier agents and show that their capabilities varies significantly in addressing different kinds of failures, with up to 40% differences in end-to-end results. SREGym is actively maintained as an open-source project and has been used by researchers and practitioners.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The manuscript introduces SREGym, a live benchmark for AI SRE agents consisting of a modular, extensible framework built atop real-world cloud-native stacks. It simulates 90 high-fidelity failure scenarios via fault and noise injectors, modeling multi-layer faults, ambient noises, metastable failures, and correlated failures. Frontier agents are evaluated on these scenarios, with results showing performance variations of up to 40% across different failure kinds. The benchmark is presented as open-source and actively maintained.

Significance. If the simulated scenarios prove to be high-fidelity representations of production SRE incidents, SREGym could fill an important gap by supplying a standardized, extensible, and challenging testbed that surpasses the oversimplistic tasks in prior SRE benchmarks. The modular orchestrator design and reported agent performance gaps would provide concrete guidance for improving agentic systems in operations. The open-source release is a clear strength that supports reproducibility and community extension.

major comments (2)
  1. [Abstract] Abstract and benchmark design section: The central claim that SREGym supplies 'high-fidelity' and 'realistic' failure scenarios (including metastable and correlated failures) is load-bearing for interpreting the 40% performance gaps, yet no quantitative validation is supplied—no statistical comparison of injected fault propagation or recovery dynamics against real production incident traces, no expert panel fidelity ratings, and no correlation analysis between SREGym scores and actual SRE task outcomes.
  2. [Evaluation] Evaluation/results section: The reported up to 40% differences in end-to-end agent results are presented without accompanying details on the precise success metrics, number of trials per scenario, handling of stochastic agent behavior, statistical significance tests, or controls for environmental variability in the live system, rendering the magnitude and reliability of the gaps difficult to assess.
minor comments (1)
  1. [Abstract] Abstract: grammatical error in 'their capabilities varies significantly' (should be 'vary').

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their thoughtful and constructive comments on our manuscript. We address each major comment point by point below and describe the revisions we will incorporate to improve clarity and rigor.

read point-by-point responses
  1. Referee: [Abstract] Abstract and benchmark design section: The central claim that SREGym supplies 'high-fidelity' and 'realistic' failure scenarios (including metastable and correlated failures) is load-bearing for interpreting the 40% performance gaps, yet no quantitative validation is supplied—no statistical comparison of injected fault propagation or recovery dynamics against real production incident traces, no expert panel fidelity ratings, and no correlation analysis between SREGym scores and actual SRE task outcomes.

    Authors: We agree that stronger grounding for the high-fidelity claim would improve the manuscript. The scenarios are constructed by modeling documented production SRE failure modes (multi-layer faults, ambient noise, metastable states, and correlations) using fault injectors on live cloud-native stacks drawn from standard industry practices and incident reports. We will revise the benchmark design section to provide explicit citations to the sources of these failure patterns, a more detailed description of the injector implementation, and an explicit discussion of the current limitations regarding quantitative validation against production traces. A full statistical correlation study or expert panel rating is outside the scope of this initial benchmark paper but is noted as planned future work. revision: partial

  2. Referee: [Evaluation] Evaluation/results section: The reported up to 40% differences in end-to-end agent results are presented without accompanying details on the precise success metrics, number of trials per scenario, handling of stochastic agent behavior, statistical significance tests, or controls for environmental variability in the live system, rendering the magnitude and reliability of the gaps difficult to assess.

    Authors: We thank the referee for highlighting this omission. The revised evaluation section will define the precise success metrics (diagnosis accuracy, mitigation success rate, and end-to-end task completion), state that each scenario was run for 10 independent trials to address stochastic agent behavior, report statistical significance testing (paired t-tests and ANOVA for the observed gaps), and describe environmental controls including full system resets and isolation between trials to minimize live-system variability. These additions will make the reported performance differences more interpretable and reproducible. revision: yes

Circularity Check

0 steps flagged

No circularity detected in SREGym benchmark description

full rationale

The paper introduces SREGym as an explicitly constructed modular benchmark with described fault/noise injectors and a fixed set of 90 problems, then reports direct evaluation outcomes on external frontier agents. No derivation chain, equations, fitted parameters, or self-citations appear in the provided text that would reduce the core claims (problem count, performance gaps) to inputs by construction. The presentation is self-contained as a new artifact plus external testing results.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Review based on abstract only; no explicit free parameters, axioms, or invented entities are stated.

pith-pipeline@v0.9.0 · 5532 in / 1094 out tokens · 45283 ms · 2026-05-14T21:44:20.410285+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

92 extracted references · 92 canonical work pages · 3 internal anchors

  1. [1]

    https://chaos-mesh

    Chaos mesh: A powerful chaos engineering platform for kubernetes. https://chaos-mesh. org

  2. [2]

    Claude Code by Anthropic.https://code.claude.com

  3. [3]

    ConfigMaps.https://kubernetes.io/docs/concepts/configuration/configmap

  4. [4]

    https://kubernetes.io/docs/concepts/workloads/controllers/de ployment

    Deployment. https://kubernetes.io/docs/concepts/workloads/controllers/de ployment

  5. [5]

    Codex: Cloud coding agent.https://chatgpt.com/codex

  6. [6]

    Helm - The package manager for Kubernetes.https://helm.sh

  7. [7]

    Jaeger: open source, distributed tracing platform.https://www.jaegertracing.io

  8. [8]

    Loki - a horizontally-scalable, highly-available, multi-tenant log aggregation system inspired by Prometheus.https://github.com/grafana/loki

  9. [9]

    Pods.https://kubernetes.io/docs/concepts/workloads/pods

  10. [10]

    Open source metrics and monitoring for your systems and services.https://prometheus.io

  11. [11]

    TierZero - Agents that handle the incidents, alerts, and internal questions that fragment your team’s day.https://www.tierzero.ai/

  12. [12]

    https://wiki.ubuntu.com/Kernel /Reference/stress-ng, 2013

    stress-ng - a tool to load and stress a computer system. https://wiki.ubuntu.com/Kernel /Reference/stress-ng, 2013

  13. [13]

    https://github.com /chaosblade-io/chaosblade, 2019

    Chaosblade: An Easy to Use and Powerful Chaos Engineering Toolkit. https://github.com /chaosblade-io/chaosblade, 2019

  14. [14]

    https://github.b log/news-insights/research/research-quantifying-github-copilots-impac t-in-the-enterprise-with-accenture/, May 2024

    Quantifying GitHub Copilot’s Impact in the Enterprise with Accenture. https://github.b log/news-insights/research/research-quantifying-github-copilots-impac t-in-the-enterprise-with-accenture/, May 2024

  15. [15]

    https://github.com/open-telemet ry/opentelemetry-demo, 2024

    Otel-Demo - A microservice-based distributed system intended to illustrate the implementation of OpenTelemetry in a near real-world environment. https://github.com/open-telemet ry/opentelemetry-demo, 2024

  16. [16]

    AWS DevOps Agent.https://aws.amazon.com/devops-agent/, 2025

  17. [17]

    https://azure.microsoft.com/en-us/products/sre-agent , 2025

    Azure SRE Agent. https://azure.microsoft.com/en-us/products/sre-agent , 2025

  18. [18]

    https: //ciroos.ai/, 2025

    Ciroos - Reduce toil, investigate incidents faster, and drive autonomous operations. https: //ciroos.ai/, 2025

  19. [19]

    https://docs.kernel.org/admin-guide/device-mapper/dm-dust.ht ml, 2025

    dm-dust - A Linux kernel module which can be used to simulate the bad blocks behavior on a physical disk. https://docs.kernel.org/admin-guide/device-mapper/dm-dust.ht ml, 2025

  20. [20]

    https://survey.stackoverflow.co/2025/ , 2025

    2025 Stack Overflow Developer Survey. https://survey.stackoverflow.co/2025/ , 2025. 10

  21. [21]

    https://www.bu sinessinsider.com/amazon-tightens-code-controls-after-outages-including -one-ai-2026-3, Mar

    Amazon tightens code controls after outages, including one caused by AI. https://www.bu sinessinsider.com/amazon-tightens-code-controls-after-outages-including -one-ai-2026-3, Mar. 2026

  22. [22]

    https://www.anthropic.com/engineering/demyst ifying-evals-for-ai-agents, 2026

    Demystifying Evals for AI Agents. https://www.anthropic.com/engineering/demyst ifying-evals-for-ai-agents, 2026

  23. [23]

    https: //venturebeat.com/technology/43-of-ai-generated-code-changes-need-debug ging-in-production-survey-finds, 2026

    43% of AI-generated code changes need debugging in production, survey finds. https: //venturebeat.com/technology/43-of-ai-generated-code-changes-need-debug ging-in-production-survey-finds, 2026

  24. [24]

    https://github.com/Rootly-AI-L abs/SRE-skills-bench, 2026

    SRE-skills-bench: LLM Benchmark for SRE Tasks. https://github.com/Rootly-AI-L abs/SRE-skills-bench, 2026

  25. [25]

    Alquraan, H

    A. Alquraan, H. Takruri, M. Alfatafta, and S. Al-Kiswany. An Analysis of Network-Partitioning Failures in Cloud Systems. InProceedings of the 13th USENIX Symposium on Operating Systems Design and Implementation (OSDI’18), Oct. 2018

  26. [26]

    Avizienis, J.-C

    A. Avizienis, J.-C. Laprie, B. Randell, and C. Landwehr. Basic Concepts and Taxonomy of Dependable and Secure Computing.IEEE Transactions on Dependable and Secure Computing (TDSC), 1(1):1–23, Jan. 2004

  27. [27]

    L. N. Bairavasundaram, G. R. Goodson, S. Pasupathy, and J. Schindler. An Analysis of Latent Sector Errors in Disk Drives. InProceedings of the 2007 ACM SIGMETRICS International Conference on Measurement and Modeling of Computer Systems (SIGMETRICS’07), June 2007

  28. [28]

    Barham, A

    P. Barham, A. Donnelly, R. Isaacs, and R. Mortier. Using Magpie for Request Extraction and Workload Modelling. InProceedings of the 6th USENIX Symposium on Operating Systems Design and Implementation (OSDI’04), Dec. 2004

  29. [29]

    Bronson, A

    N. Bronson, A. Aghayev, A. Charapko, and T. Zhu. Metastable Failures in Distributed Systems. InProceedings of the 18th Workshop on Hot Topics in Operating Systems (HotOS’21), June 2021

  30. [30]

    Y . Chen, H. Xie, M. Ma, Y . Kang, X. Gao, L. Shi, Y . Cao, X. Gao, H. Fan, M. Wen, J. Zeng, S. Ghosh, X. Zhang, C. Zhang, Q. Lin, S. Rajmohan, D. Zhang, and T. Xu. Automatic Root Cause Analysis via Large Language Models for Cloud Incidents. InProceedings of the 19th European Conference on Computer Systems (EuroSys’24), Apr. 2024

  31. [31]

    Y . Chen, J. Pan, J. Clark, Y . Su, N. Zheutlin, B. Bhavya, R. Arora, Y . Deng, S. Jha, and T. Xu. Stratus: A Multi-agent System for Autonomous Reliability Engineering of Modern Clouds. InProceedings of The 39th Annual Conference on Neural Information Processing Systems (NeurIPS’25), Dec. 2025

  32. [32]

    Y . Chen, M. Shetty, G. Somashekar, M. Ma, Y . Simmhan, J. Mace, C. Bansal, R. Wang, and S. Rajmohan. AIOpsLab: A Holistic Framework to Evaluate AI Agents for Enabling Autonomous Clouds. InProceedings of the 8th Conference on Machine Learning and Systems (MLSys’25), May 2025

  33. [33]

    A. Chou, J. Yang, B. Chelf, S. Hallem, and D. Engler. An Empirical Study of Operating Systems Errors. InProceedings of the 18th ACM Symposium on Operating Systems Principles (SOSP’01), Oct. 2001

  34. [34]

    M. Chow, Y . Wang, W. Wang, A. Hailu, R. Bopardikar, B. Zhang, J. Qu, D. Meisner, S. Son- awane, Y . Zhang, R. Paim, M. Ward, I. Huang, M. McNally, D. Hodges, Z. Farkas, C. Gocmen, E. Huang, and C. Tang. ServiceLab: Preventing Tiny Performance Regressions at Hyperscale through Pre-Production Testing. InProceedings of the 18th USENIX Symposium on Operating...

  35. [35]

    S. Y . Chu, J. W. Kim, and M. Y . Yi. Think Together and Work Better: Combining Humans’ and LLMs’ Think-Aloud Outcomes for Effective Text Evaluation. InProceedings of the 2025 CHI Conference on Human Factors in Computing Systems (CHI’25), Apr. 2025. 11

  36. [36]

    D. Ford, F. Labelle, F. I. Popovici, M. Stokely, V .-A. Truong, L. Barroso, C. Grimes, and S. Quinlan. Availability in Globally Distributed Storage Systems. InProceedings of the 9th USENIX Symposium on Operating Systems Design and Implementation (OSDI’10), Oct. 2010

  37. [37]

    Y . Gan, Y . Zhang, D. Cheng, A. Shetty, P. Rathi, N. Katarki, A. Bruno, J. Hu, B. Ritchken, B. Jackson, K. Hu, M. Pancholi, Y . He, B. Clancy, C. Colen, F. Wen, C. Leung, S. Wang, L. Zaruvinsky, M. Espinosa, R. Lin, Z. Liu, J. Padilla, and C. Delimitrou. An Open-Source Benchmark Suite for Microservices and Their Hardware-Software Implications for Cloud &...

  38. [38]

    J. T. Gu, X. Sun, W. Zhang, Y . Jiang, C. Wang, M. Vaziri, O. Legunsen, and T. Xu. Acto: Automatic End-to-End Testing for Operation Correctness of Cloud System Management. In Proceedings of the 29th ACM Symposium on Operating Systems Principles (SOSP’23), Oct. 2023

  39. [39]

    J. T. Gu, Z. Tang, Y . Su, B. A. Stoica, X. Sun, W. X. Zheng, Y . Zhang, A. Rahman, C. Wang, and T. Xu. Who Watches the Watchers? On the Reliability of Softwarizing Cloud Application Management. InProceedings of the 23rd USENIX Symposium on Networked Systems Design and Implementation (NSDI’26), May 2026

  40. [40]

    H. S. Gunawi, C. Rubio-González, A. C. Arpaci-Dusseau, R. H. Arpaci-Dusseau, and B. Liblit. EIO: Error Handling is Occasionally Correct. InProceedings of the 6th USENIX Conference on File and Storage Technologies (FAST’08), Feb. 2008

  41. [41]

    H. S. Gunawi, M. Hao, R. O. Suminto, A. Laksono, A. D. Satria, J. Adityatama, and K. J. Eliazar. Why Does the Cloud Stop Computing? Lessons from Hundreds of Service Outages. In Proceedings of the 7th ACM Symposium on Cloud Computing (SoCC’16), Oct. 2016

  42. [42]

    H. S. Gunawi, R. O. Suminto, R. Sears, C. Golliher, S. Sundararaman, X. Lin, T. Emami, W. Sheng, N. Bidokhti, C. McCaffrey, G. Grider, P. M. Fields, K. Harms, R. B. Ross, A. Ja- cobson, R. Ricci, K. Webb, P. Alvaro, H. B. Runesha, M. Hao, and H. Li. Fail-Slow at Scale: Evidence of Hardware Performance Faults in Large Production Systems. InProceedings of t...

  43. [43]

    S. Han, X. Hu, H. Huang, M. Jiang, and Y . Zhao. ADBench: Anomaly Detection Benchmark. InProceedings of the 36th Annual Conference on Neural Information Processing Systems (NeurIPS’22), Nov. 2022

  44. [44]

    P. H. Hochschild, P. Turner, J. C. Mogul, R. Govindaraju, P. Ranganathan, D. E. Culler, and A. Vahdat. Cores that don’t count. InProceedings of the ACM SIGOPS 21st Workshop on Hot Topics in Operating Systems (HotOS’21), June 2021

  45. [45]

    Huang, M

    L. Huang, M. Magnusson, A. B. Muralikrishna, S. Estyak, R. Isaacs, A. Aghayev, T. Zhu, and A. Charapko. Metastable Failures in the Wild. InProceedings of the 16th USENIX Symposium on Operating Systems Design and Implementation (OSDI’22), July 2022

  46. [46]

    Huang, C

    P. Huang, C. Guo, L. Zhou, J. R. Lorch, Y . Dang, M. Chintalapati, and R. Yao. Gray Failure: The Achilles’ Heel of Cloud-Scale Systems. InProceedings of the 16th Workshop on Hot Topics in Operating Systems (HotOS’17), pages 150–155, May 2017

  47. [47]

    Isaacs, P

    R. Isaacs, P. Alvaro, R. Majumdar, K.-K. Muniswamy-Reddy, M. Salamati, and S. Soudjani. Analyzing Metastable Failures. InProceedings of the ACM SIGOPS 20th Workshop on Hot Topics in Operating Systems (HotOS’25), May 2025

  48. [48]

    Jacob, F

    V . Jacob, F. Song, A. Stiegler, B. Rad, Y . Diao, and N. Tatbul. Exathlon: a Benchmark for Explainable Anomaly Detection Over Time Series.Proceedings of the VLDB Endowment (VLDB’21), 14(11):2613–2626, July 2021

  49. [49]

    S. Jha, S. Cui, S. Banerjee, T. Xu, J. Enos, M. Showerman, Z. T. Kalbarczyk, and R. K. Iyer. Live Forensics for HPC Systems: A Case Study on Distributed Storage Systems. InProceedings of the International Conference for High-Performance Computing, Networking, Storage and Analysis (SC’20), Nov. 2020. 12

  50. [50]

    S. Jha, R. R. Arora, Y . Watanabe, T. Yanagawa, Y . Chen, J. Clark, B. Bhavya, M. Verma, H. Kumar, H. Kitahara, N. Zheutlin, S. Takano, D. Pathak, F. George, X. Wu, B. O. Turkkan, G. Vanloo, M. Nidd, T. Dai, O. Chatterjee, P. Gupta, S. Samanta, P. Aggarwal, R. Lee, J. wook Ahn, D. Kar, A. Paradkar, Y . Deng, P. Moogi, P. Mohapatra, N. Abe, C. Narayanaswam...

  51. [51]

    Kaldor, J

    J. Kaldor, J. Mace, M. Bejda, E. Gao, W. Kuropatwa, J. O’Neill, K. W. Ong, B. Schaller, P. Shan, B. Viscomi, V . Venkataraman, K. Veeraraghavan, and Y . J. Song. Canopy: An End-to-End Performance Tracing and Analysis System. InProceedings of the 26th Symposium on Operating Systems Principles (SOSP’17), Oct. 2017

  52. [52]

    J.-C. Laprie. Dependable Computing: Concepts, Limits, Challenges. InProceedings of the 25th IEEE International Symposium on Fault-Tolerant Computing (FTCS’95), June 1995

  53. [53]

    Y . Lee, J. Kim, J. Kim, H. Cho, J. Kang, P. Kang, and N. Kim. CheckEval: A Reliable LLM-as- a-Judge Framework for Evaluating Text Generation Using Checklists. InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing (EMNLP’25), Nov. 2025

  54. [54]

    H. Liu, S. Lu, M. Musuvathi, and S. Nath. What Bugs Cause Production Cloud Incidents? In Proceedings of the 17th Workshop on Hot Topics in Operating Systems (HotOS’19), May 2019

  55. [55]

    Y . Liu, C. Pei, L. Xu, B. Chen, M. Sun, Z. Zhang, Y . Sun, S. Zhang, K. Wang, H. Zhang, J. Li, G. Xie, X. Wen, X. Nie, M. Ma, and D. Pei. OpsEval: A Comprehensive IT Operations Benchmark Suite for Large Language Models.arXiv:2310.07637, June 2025

  56. [56]

    D. Lo, L. Cheng, R. Govindaraju, P. Ranganathan, and C. Kozyrakis. Heracles: Improving Resource Efficiency at Scale. InProceedings of the 42nd Annual International Symposium on Computer Architecture (ISCA’15), June 2015

  57. [57]

    D. Loker. AI vs Human Code Gen Report: AI Code Creates 1.7x More Issues. https: //www.coderabbit.ai/blog/state-of-ai-vs-human-code-generation-report , 2026

  58. [58]

    Y . Luo, J. Jiang, J. Feng, L. Tao, Q. Zhang, X. Wen, Y . Sun, S. Zhang, and D. Pei. From Observability Data to Diagnosis: An Evolving Multi-agent System for Incident Management in Cloud Systems.arXiv:2510.24145, Nov. 2025

  59. [59]

    J. Mao, L. Li, Y . Gao, Z. Peng, S. He, C. Zhang, S. Qin, S. Khalid, Q. Lin, S. Rajmohan, et al. StepFly: Agentic Troubleshooting Guide Automation for Incident Diagnosis.arXiv:2510.10074, Apr. 2026

  60. [60]

    M. A. Merrill, A. G. Shaw, N. Carlini, B. Li, H. Raj, I. Bercovich, L. Shi, J. Y . Shin, T. Walshe, E. K. Buchanan, J. Shen, G. Ye, H. Lin, J. Poulos, M. Wang, M. Nezhurina, J. Jitsev, D. Lu, O. M. Mastromichalakis, Z. Xu, Z. Chen, Y . Liu, R. Zhang, L. L. Chen, A. Kashyap, J.-L. Uslu, J. Li, J. Wu, M. Yan, S. Bian, V . Sharma, K. Sun, S. Dillmann, A. Ana...

  61. [61]

    J. J. Meza, T. Gowda, A. Eid, T. Ijaware, D. Chernyshev, Y . Yu, M. N. Uddin, R. Das, C. Nachi- appan, S. Tran, S. Shi, T. Luo, D. K. Hong, S. Panneerselvam, H. Ragas, S. Manavski, W. Wang, and F. Richard. Defcon: Preventing Overload with Graceful Feature Degradation. InPro- ceedings of the 17th USENIX Symposium on Operating Systems Design and Implementat...

  62. [62]

    Palix, G

    N. Palix, G. Thomas, S. Saha, C. Calvès, J. Lawall, and G. Muller. Faults in Linux: Ten Years Later. InProceedings of the 16th International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS’11), Mar. 2011

  63. [63]

    Patel and D

    T. Patel and D. Tiwari. CLITE: Efficient and QoS-Aware Co-Location of Multiple Latency- Critical Jobs for Warehouse Scale Computers. InProceedings of the 2020 IEEE International Symposium on High Performance Computer Architecture (HPCA’20), Feb. 2020

  64. [64]

    C. Pei, Z. Wang, F. Liu, Z. Li, Y . Liu, X. He, R. Kang, T. Zhang, J. Chen, J. Li, G. Xie, and D. Pei. Flow-of-Action: SOP Enhanced LLM-Based Multi-Agent System for Root Cause Analysis. InCompanion Proceedings of the ACM on Web Conference 2025 (WWW’25), May 2025

  65. [65]

    Pelkonen, S

    T. Pelkonen, S. Franklin, J. Teller, P. Cavallaro, Q. Huang, J. Meza, and K. Veeraraghavan. Go- rilla: a Fast, Scalable, In-memory Time Series Database.Proceedings of the VLDB Endowment (VLDB’15), 8(12):1816–1827, Aug. 2015

  66. [66]

    Schroeder, S

    B. Schroeder, S. Damouras, and P. Gill. Understanding Latent Sector Errors and How to Protect against Them. InProceedings of the 8th USENIX Conference on File and Storage Technologies (FAST’10), Feb. 2010

  67. [67]

    B. H. Sigelman, L. A. Barroso, M. Burrows, P. Stephenson, M. Plakal, D. Beaver, S. Jaspan, and C. Shanbhag. Dapper, a Large-Scale Distributed Systems Tracing Infrastructure. Technical Report dapper-2010-1, Google, Inc., Apr. 2010

  68. [68]

    X. Sun, R. Cheng, J. Chen, E. Ang, O. Legunsen, and T. Xu. Testing Configuration Changes in Context to Prevent Production Failures. InProceedings of the 14th USENIX Symposium on Operating Systems Design and Implementation (OSDI’20), Nov. 2020

  69. [69]

    X. Sun, W. Luo, J. T. Gu, A. Ganesan, R. Alagappan, M. Gasch, L. Suresh, and T. Xu. Automatic Reliability Testing for Cluster Management Controllers. InProceedings of the 16th USENIX Symposium on Operating Systems Design and Implementation (OSDI’22), July 2022

  70. [70]

    P. Tang, S. Tang, H. Pu, Z. Miao, and Z. Wang. MicroRCA-Agent: Microservice Root Cause Analysis Method Based on Large Language Model Agents.arXiv:2509.15635, Sept. 2025

  71. [71]

    Y . Tian, Y . Liu, Z. Chong, Z. Huang, and H.-A. Jacobsen. GALA: Can Graph-Augmented Large Language Model Agentic Workflows Elevate Root Cause Analysis?arXiv:2508.12472, Aug. 2025

  72. [72]

    Treynor, M

    B. Treynor, M. Dahlin, V . Rau, and B. Beyer. The Calculus of Service Availability.Communi- cations of the ACM (CACM), 60(9):42–47, Aug. 2017

  73. [73]

    Veeraraghavan, J

    K. Veeraraghavan, J. Meza, S. Michelson, S. Panneerselvam, A. Gyori, D. Chou, S. Mar- gulis, D. Obenshain, S. Padmanabha, A. Shah, Y . J. Song, and T. Xu. Maelstrom: Mitigating Datacenter-level Disasters by Draining Interdependent Traffic Safely and Efficiently. InPro- ceedings of the 13th USENIX Symposium on Operating Systems Design and Implementation (O...

  74. [74]

    H. Wang, Q. Mang, A. Cheung, K. Sen, and D. Song. How We Broke Top AI Agent Benchmarks: And What Comes Next. https://rdi.berkeley.edu/blog/trustworthy-benchmark s-cont/, 2026

  75. [75]

    Y . Wang, G. Yu, H. Huang, Z. Wang, Y . Huang, P. Chen, and M. R. Lyu. Cloud-OpsBench: A Re- producible Benchmark for Agentic Root Cause Analysis in Cloud Systems.arXiv:2603.00468, Feb. 2026

  76. [76]

    Z. Wang, Z. Liu, Y . Zhang, A. Zhong, J. Wang, F. Yin, L. Fan, L. Wu, and Q. Wen. RCAgent: Cloud Root Cause Analysis by Autonomous Agents with Tool-Augmented Large Language Models. InProceedings of the 33rd ACM International Conference on Information and Knowledge Management (CIKM’24), Oct. 2024. 14

  77. [77]

    J. Xu, Q. Zhang, Z. Zhong, S. He, C. Zhang, Q. Lin, D. Pei, P. He, D. Zhang, and Q. Zhang. OpenRCA: Can Large Language Models Locate the Root Cause of Software Failures? In Proceedings of the 13th International Conference on Learning Representations (ICLR’25), Jan. 2025

  78. [78]

    T. Xu, J. Zhang, P. Huang, J. Zheng, T. Sheng, D. Yuan, Y . Zhou, and S. Pasupathy. Do Not Blame Users for Misconfigurations. InProceedings of the 24th ACM Symposium on Operating Systems Principles (SOSP’13), Nov. 2013

  79. [79]

    T. Xu, X. Jin, P. Huang, Y . Zhou, S. Lu, L. Jin, and S. Pasupathy. Early Detection of Configu- ration Errors to Reduce Failure Damage. InProceedings of the 12th USENIX Symposium on Operating Systems Design and Implementation (OSDI’16), Nov. 2016

  80. [80]

    S. Yao, J. Zhao, D. Yu, N. Du, I. Shafran, K. Narasimhan, and Y . Cao. ReAct: Synergizing Reasoning and Acting in Language Models. InProceedings of the 11th International Conference on Learning Representations (ICLR ’23), May 2023

Showing first 80 references.