arxiv: 2605.07161 · v2 · submitted 2026-05-08 · 💻 cs.AI

Recognition: no theorem link

SREGym: A Live Benchmark for AI SRE Agents with High-Fidelity Failure Scenarios

Jackson Clark , Yiming Su , Saad Mohammad Rafid Pial , Yifang Tian , Lily Gniedziejko , Hans-Arno Jacobsen , Yinfang Chen , Tianyin Xu

Authors on Pith no claims yet

Pith reviewed 2026-05-14 21:44 UTC · model grok-4.3

classification 💻 cs.AI

keywords site reliability engineeringAI agentsbenchmarkfault injectionsystem failurescloud-nativeagent evaluationfailure simulation

0 comments

The pith

SREGym supplies 90 realistic SRE problems in a live cloud environment to measure how AI agents handle injected faults and complex failures.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents SREGym to move beyond oversimplified SRE benchmarks by creating a live system built on real cloud-native stacks. It injects a wide range of faults across layers, ambient noises, and failure modes including metastable and correlated failures. The framework is modular so new injectors and stacks can be added, and it currently ships with 90 challenging problems. When frontier agents are run on these problems their success rates differ by as much as 40 percent depending on the failure type.

Core claim

SREGym exposes a live system environment built atop real-world cloud-native system stacks, where high-fidelity failure scenarios are simulated through fault injectors. It models the complexity of production environments by simulating a wide range of faults at different layers, various ambient noises, and diverse failure modes such as metastable failures and correlated failures. Architected as a modular, extensible framework that orchestrates fault and noise injectors across stacks, SREGym currently includes 90 realistic, challenging SRE problems. Evaluation shows frontier agents vary significantly in addressing different kinds of failures, with up to 40 percent differences in end-to-end task

What carries the argument

SREGym, a modular framework that orchestrates fault and noise injectors across real cloud-native stacks to generate high-fidelity SRE scenarios.

If this is right

Agents display up to 40 percent variation in end-to-end success depending on the failure category.
The modular injector design permits straightforward addition of new faults, noises, or system stacks.
The benchmark distinguishes agent strengths on layer-specific faults from those on correlated or metastable failures.
SREGym supplies an open, maintainable platform for repeated evaluation of SRE agents as models improve.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

If agent rankings on SREGym align with real incidents, the benchmark could become a standard filter before deploying agents into production.
Specialized training loops could target the failure modes where current agents show the largest gaps.
Extending the framework to additional cloud providers or on-premise stacks would test how general the observed performance differences remain.

Load-bearing premise

The simulated faults, ambient noises, and failure modes in the benchmark accurately reflect the complexity and diversity of real production system failures.

What would settle it

A direct comparison of the same agents' success rates on SREGym scenarios versus their performance on logged, real-world production incidents would show whether the simulated environments produce matching capability rankings.

Figures

Figures reproduced from arXiv: 2605.07161 by Hans-Arno Jacobsen, Jackson Clark, Lily Gniedziejko, Saad Mohammad Rafid Pial, Tianyin Xu, Yifang Tian, Yiming Su, Yinfang Chen.

**Figure 1.** Figure 1: An SRE problem in SREGYM and the trajectories of two agents when addressing it. whereas real-world failures have diverse root causes across the stack; faults in lower layers such as operating systems [35] and hardware [46, 64] are known to be harder to address. Moreover, they mostly simulate a single failure into an otherwise clean environment, lacking noises and unrelated events that coexist in production… view at source ↗

**Figure 2.** Figure 2: provides an overview of SREGYM in terms of its components. We will present the main components in the remainder of this section. Kubernetes DeathStarBench TrainTicket MongoDB FleetCast TiDB Kafka MySQL System Environment ... ... Trace Observability Log Metric State HW Faults Code Bugs Bad Ops Misconfigs Data collector Problems Problem Spec - System spec - Fault spec - Oracle spec Diagnosis Leaderboard Quer… view at source ↗

**Figure 3.** Figure 3: Implementing a problem in SREGYM (noises are injected by the framework) [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗

**Figure 4.** Figure 4: SRE problems that create scenarios with different failure modes in SREG [PITH_FULL_IMAGE:figures/full_fig_p005_4.png] view at source ↗

**Figure 5.** Figure 5: The average number of tool calls per SRE prob [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗

**Figure 6.** Figure 6: Mean total tokens per run versus end-to-end success rate ( [PITH_FULL_IMAGE:figures/full_fig_p023_6.png] view at source ↗

**Figure 7.** Figure 7: Per-problem end-to-end success rate with no noise injected. [PITH_FULL_IMAGE:figures/full_fig_p025_7.png] view at source ↗

**Figure 8.** Figure 8: Per-problem end-to-end success rate under noisy conditions. [PITH_FULL_IMAGE:figures/full_fig_p026_8.png] view at source ↗

read the original abstract

AI agents are increasingly used to diagnose and mitigate failures in production systems, known as agentic Site Reliability Engineering (SRE). Current SRE benchmarks are limited to oversimplistic SRE tasks and are unfortunately hard to extend due to bespoke designs. We present SREGym, a high-fidelity benchmark for SRE agents. SREGym exposes a live system environment built atop real-world cloud-native system stacks, where high-fidelity failure scenarios are simulated through fault injectors. SREGym models the complexity of production environments by simulating (1) a wide range of faults at different layers, (2) various ambient noises, and (3) diverse failure modes such as metastable failures and correlated failures. SREGym is architected as a modular, extensible framework that orchestrates fault and noise injectors across stacks. SREGym currently includes 90 realistic, challenging SRE problems. We use SREGym to evaluate frontier agents and show that their capabilities varies significantly in addressing different kinds of failures, with up to 40% differences in end-to-end results. SREGym is actively maintained as an open-source project and has been used by researchers and practitioners.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

read the letter

SREGym gives a modular live benchmark with 90 SRE scenarios on real cloud stacks and shows up to 40% agent performance gaps, but the high-fidelity claim rests on untested assumptions about how well the injected faults match production reality. The paper sets up an orchestrator that runs fault and noise injectors across layers, which lets it cover a range of faults, ambient noise, metastable failures, and correlated failures. That architecture is more flexible than the rigid, oversimplified benchmarks it replaces, and the initial runs on frontier agents do show clear variation by failure type. The open-source release and active maintenance are practical pluses that let others extend it. The central weakness is the missing grounding. The description explains the design but gives no quantitative match to real incident traces, no expert ratings of scenario realism, no checks on failure propagation patterns, and no evidence that scores here predict actual SRE outcomes. Without those steps the 40% differences are hard to read as reliable signals. This is for researchers building AI agents for operations and for SRE teams that need a concrete testbed. It is worth sending to peer review so referees can push on the validation side and suggest concrete ways to strengthen the fidelity claims. The framework itself is a useful starting point even if the current evidence for realism is thin.

Referee Report

2 major / 1 minor

Summary. The manuscript introduces SREGym, a live benchmark for AI SRE agents consisting of a modular, extensible framework built atop real-world cloud-native stacks. It simulates 90 high-fidelity failure scenarios via fault and noise injectors, modeling multi-layer faults, ambient noises, metastable failures, and correlated failures. Frontier agents are evaluated on these scenarios, with results showing performance variations of up to 40% across different failure kinds. The benchmark is presented as open-source and actively maintained.

Significance. If the simulated scenarios prove to be high-fidelity representations of production SRE incidents, SREGym could fill an important gap by supplying a standardized, extensible, and challenging testbed that surpasses the oversimplistic tasks in prior SRE benchmarks. The modular orchestrator design and reported agent performance gaps would provide concrete guidance for improving agentic systems in operations. The open-source release is a clear strength that supports reproducibility and community extension.

major comments (2)

[Abstract] Abstract and benchmark design section: The central claim that SREGym supplies 'high-fidelity' and 'realistic' failure scenarios (including metastable and correlated failures) is load-bearing for interpreting the 40% performance gaps, yet no quantitative validation is supplied—no statistical comparison of injected fault propagation or recovery dynamics against real production incident traces, no expert panel fidelity ratings, and no correlation analysis between SREGym scores and actual SRE task outcomes.
[Evaluation] Evaluation/results section: The reported up to 40% differences in end-to-end agent results are presented without accompanying details on the precise success metrics, number of trials per scenario, handling of stochastic agent behavior, statistical significance tests, or controls for environmental variability in the live system, rendering the magnitude and reliability of the gaps difficult to assess.

minor comments (1)

[Abstract] Abstract: grammatical error in 'their capabilities varies significantly' (should be 'vary').

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their thoughtful and constructive comments on our manuscript. We address each major comment point by point below and describe the revisions we will incorporate to improve clarity and rigor.

read point-by-point responses

Referee: [Abstract] Abstract and benchmark design section: The central claim that SREGym supplies 'high-fidelity' and 'realistic' failure scenarios (including metastable and correlated failures) is load-bearing for interpreting the 40% performance gaps, yet no quantitative validation is supplied—no statistical comparison of injected fault propagation or recovery dynamics against real production incident traces, no expert panel fidelity ratings, and no correlation analysis between SREGym scores and actual SRE task outcomes.

Authors: We agree that stronger grounding for the high-fidelity claim would improve the manuscript. The scenarios are constructed by modeling documented production SRE failure modes (multi-layer faults, ambient noise, metastable states, and correlations) using fault injectors on live cloud-native stacks drawn from standard industry practices and incident reports. We will revise the benchmark design section to provide explicit citations to the sources of these failure patterns, a more detailed description of the injector implementation, and an explicit discussion of the current limitations regarding quantitative validation against production traces. A full statistical correlation study or expert panel rating is outside the scope of this initial benchmark paper but is noted as planned future work. revision: partial
Referee: [Evaluation] Evaluation/results section: The reported up to 40% differences in end-to-end agent results are presented without accompanying details on the precise success metrics, number of trials per scenario, handling of stochastic agent behavior, statistical significance tests, or controls for environmental variability in the live system, rendering the magnitude and reliability of the gaps difficult to assess.

Authors: We thank the referee for highlighting this omission. The revised evaluation section will define the precise success metrics (diagnosis accuracy, mitigation success rate, and end-to-end task completion), state that each scenario was run for 10 independent trials to address stochastic agent behavior, report statistical significance testing (paired t-tests and ANOVA for the observed gaps), and describe environmental controls including full system resets and isolation between trials to minimize live-system variability. These additions will make the reported performance differences more interpretable and reproducible. revision: yes

Circularity Check

0 steps flagged

No circularity detected in SREGym benchmark description

full rationale

The paper introduces SREGym as an explicitly constructed modular benchmark with described fault/noise injectors and a fixed set of 90 problems, then reports direct evaluation outcomes on external frontier agents. No derivation chain, equations, fitted parameters, or self-citations appear in the provided text that would reduce the core claims (problem count, performance gaps) to inputs by construction. The presentation is self-contained as a new artifact plus external testing results.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Review based on abstract only; no explicit free parameters, axioms, or invented entities are stated.

pith-pipeline@v0.9.0 · 5532 in / 1094 out tokens · 45283 ms · 2026-05-14T21:44:20.410285+00:00 · methodology

Review history (2 revisions) →

discussion (0)

Reference graph

Works this paper leans on

92 extracted references · 92 canonical work pages · 3 internal anchors

[1]

https://chaos-mesh

Chaos mesh: A powerful chaos engineering platform for kubernetes. https://chaos-mesh. org

work page
[2]

Claude Code by Anthropic.https://code.claude.com

work page
[3]

ConfigMaps.https://kubernetes.io/docs/concepts/configuration/configmap

work page
[4]

https://kubernetes.io/docs/concepts/workloads/controllers/de ployment

Deployment. https://kubernetes.io/docs/concepts/workloads/controllers/de ployment

work page
[5]

Codex: Cloud coding agent.https://chatgpt.com/codex

work page
[6]

Helm - The package manager for Kubernetes.https://helm.sh

work page
[7]

Jaeger: open source, distributed tracing platform.https://www.jaegertracing.io

work page
[8]

Loki - a horizontally-scalable, highly-available, multi-tenant log aggregation system inspired by Prometheus.https://github.com/grafana/loki

work page
[9]

Pods.https://kubernetes.io/docs/concepts/workloads/pods

work page
[10]

Open source metrics and monitoring for your systems and services.https://prometheus.io

work page
[11]

TierZero - Agents that handle the incidents, alerts, and internal questions that fragment your team’s day.https://www.tierzero.ai/

work page
[12]

https://wiki.ubuntu.com/Kernel /Reference/stress-ng, 2013

stress-ng - a tool to load and stress a computer system. https://wiki.ubuntu.com/Kernel /Reference/stress-ng, 2013

work page 2013
[13]

https://github.com /chaosblade-io/chaosblade, 2019

Chaosblade: An Easy to Use and Powerful Chaos Engineering Toolkit. https://github.com /chaosblade-io/chaosblade, 2019

work page 2019
[14]

https://github.b log/news-insights/research/research-quantifying-github-copilots-impac t-in-the-enterprise-with-accenture/, May 2024

Quantifying GitHub Copilot’s Impact in the Enterprise with Accenture. https://github.b log/news-insights/research/research-quantifying-github-copilots-impac t-in-the-enterprise-with-accenture/, May 2024

work page 2024
[15]

https://github.com/open-telemet ry/opentelemetry-demo, 2024

Otel-Demo - A microservice-based distributed system intended to illustrate the implementation of OpenTelemetry in a near real-world environment. https://github.com/open-telemet ry/opentelemetry-demo, 2024

work page 2024
[16]

AWS DevOps Agent.https://aws.amazon.com/devops-agent/, 2025

work page 2025
[17]

https://azure.microsoft.com/en-us/products/sre-agent , 2025

Azure SRE Agent. https://azure.microsoft.com/en-us/products/sre-agent , 2025

work page 2025
[18]

https: //ciroos.ai/, 2025

Ciroos - Reduce toil, investigate incidents faster, and drive autonomous operations. https: //ciroos.ai/, 2025

work page 2025
[19]

https://docs.kernel.org/admin-guide/device-mapper/dm-dust.ht ml, 2025

dm-dust - A Linux kernel module which can be used to simulate the bad blocks behavior on a physical disk. https://docs.kernel.org/admin-guide/device-mapper/dm-dust.ht ml, 2025

work page 2025
[20]

https://survey.stackoverflow.co/2025/ , 2025

2025 Stack Overflow Developer Survey. https://survey.stackoverflow.co/2025/ , 2025. 10

work page 2025
[21]

https://www.bu sinessinsider.com/amazon-tightens-code-controls-after-outages-including -one-ai-2026-3, Mar

Amazon tightens code controls after outages, including one caused by AI. https://www.bu sinessinsider.com/amazon-tightens-code-controls-after-outages-including -one-ai-2026-3, Mar. 2026

work page 2026
[22]

https://www.anthropic.com/engineering/demyst ifying-evals-for-ai-agents, 2026

Demystifying Evals for AI Agents. https://www.anthropic.com/engineering/demyst ifying-evals-for-ai-agents, 2026

work page 2026
[23]

https: //venturebeat.com/technology/43-of-ai-generated-code-changes-need-debug ging-in-production-survey-finds, 2026

43% of AI-generated code changes need debugging in production, survey finds. https: //venturebeat.com/technology/43-of-ai-generated-code-changes-need-debug ging-in-production-survey-finds, 2026

work page 2026
[24]

https://github.com/Rootly-AI-L abs/SRE-skills-bench, 2026

SRE-skills-bench: LLM Benchmark for SRE Tasks. https://github.com/Rootly-AI-L abs/SRE-skills-bench, 2026

work page 2026
[25]

Alquraan, H

A. Alquraan, H. Takruri, M. Alfatafta, and S. Al-Kiswany. An Analysis of Network-Partitioning Failures in Cloud Systems. InProceedings of the 13th USENIX Symposium on Operating Systems Design and Implementation (OSDI’18), Oct. 2018

work page 2018
[26]

Avizienis, J.-C

A. Avizienis, J.-C. Laprie, B. Randell, and C. Landwehr. Basic Concepts and Taxonomy of Dependable and Secure Computing.IEEE Transactions on Dependable and Secure Computing (TDSC), 1(1):1–23, Jan. 2004

work page 2004
[27]

L. N. Bairavasundaram, G. R. Goodson, S. Pasupathy, and J. Schindler. An Analysis of Latent Sector Errors in Disk Drives. InProceedings of the 2007 ACM SIGMETRICS International Conference on Measurement and Modeling of Computer Systems (SIGMETRICS’07), June 2007

work page 2007
[28]

Barham, A

P. Barham, A. Donnelly, R. Isaacs, and R. Mortier. Using Magpie for Request Extraction and Workload Modelling. InProceedings of the 6th USENIX Symposium on Operating Systems Design and Implementation (OSDI’04), Dec. 2004

work page 2004
[29]

Bronson, A

N. Bronson, A. Aghayev, A. Charapko, and T. Zhu. Metastable Failures in Distributed Systems. InProceedings of the 18th Workshop on Hot Topics in Operating Systems (HotOS’21), June 2021

work page 2021
[30]

Y . Chen, H. Xie, M. Ma, Y . Kang, X. Gao, L. Shi, Y . Cao, X. Gao, H. Fan, M. Wen, J. Zeng, S. Ghosh, X. Zhang, C. Zhang, Q. Lin, S. Rajmohan, D. Zhang, and T. Xu. Automatic Root Cause Analysis via Large Language Models for Cloud Incidents. InProceedings of the 19th European Conference on Computer Systems (EuroSys’24), Apr. 2024

work page 2024
[31]

Y . Chen, J. Pan, J. Clark, Y . Su, N. Zheutlin, B. Bhavya, R. Arora, Y . Deng, S. Jha, and T. Xu. Stratus: A Multi-agent System for Autonomous Reliability Engineering of Modern Clouds. InProceedings of The 39th Annual Conference on Neural Information Processing Systems (NeurIPS’25), Dec. 2025

work page 2025
[32]

Y . Chen, M. Shetty, G. Somashekar, M. Ma, Y . Simmhan, J. Mace, C. Bansal, R. Wang, and S. Rajmohan. AIOpsLab: A Holistic Framework to Evaluate AI Agents for Enabling Autonomous Clouds. InProceedings of the 8th Conference on Machine Learning and Systems (MLSys’25), May 2025

work page 2025
[33]

A. Chou, J. Yang, B. Chelf, S. Hallem, and D. Engler. An Empirical Study of Operating Systems Errors. InProceedings of the 18th ACM Symposium on Operating Systems Principles (SOSP’01), Oct. 2001

work page 2001
[34]

M. Chow, Y . Wang, W. Wang, A. Hailu, R. Bopardikar, B. Zhang, J. Qu, D. Meisner, S. Son- awane, Y . Zhang, R. Paim, M. Ward, I. Huang, M. McNally, D. Hodges, Z. Farkas, C. Gocmen, E. Huang, and C. Tang. ServiceLab: Preventing Tiny Performance Regressions at Hyperscale through Pre-Production Testing. InProceedings of the 18th USENIX Symposium on Operating...

work page 2024
[35]

S. Y . Chu, J. W. Kim, and M. Y . Yi. Think Together and Work Better: Combining Humans’ and LLMs’ Think-Aloud Outcomes for Effective Text Evaluation. InProceedings of the 2025 CHI Conference on Human Factors in Computing Systems (CHI’25), Apr. 2025. 11

work page 2025
[36]

D. Ford, F. Labelle, F. I. Popovici, M. Stokely, V .-A. Truong, L. Barroso, C. Grimes, and S. Quinlan. Availability in Globally Distributed Storage Systems. InProceedings of the 9th USENIX Symposium on Operating Systems Design and Implementation (OSDI’10), Oct. 2010

work page 2010
[37]

Y . Gan, Y . Zhang, D. Cheng, A. Shetty, P. Rathi, N. Katarki, A. Bruno, J. Hu, B. Ritchken, B. Jackson, K. Hu, M. Pancholi, Y . He, B. Clancy, C. Colen, F. Wen, C. Leung, S. Wang, L. Zaruvinsky, M. Espinosa, R. Lin, Z. Liu, J. Padilla, and C. Delimitrou. An Open-Source Benchmark Suite for Microservices and Their Hardware-Software Implications for Cloud &...

work page 2019
[38]

J. T. Gu, X. Sun, W. Zhang, Y . Jiang, C. Wang, M. Vaziri, O. Legunsen, and T. Xu. Acto: Automatic End-to-End Testing for Operation Correctness of Cloud System Management. In Proceedings of the 29th ACM Symposium on Operating Systems Principles (SOSP’23), Oct. 2023

work page 2023
[39]

J. T. Gu, Z. Tang, Y . Su, B. A. Stoica, X. Sun, W. X. Zheng, Y . Zhang, A. Rahman, C. Wang, and T. Xu. Who Watches the Watchers? On the Reliability of Softwarizing Cloud Application Management. InProceedings of the 23rd USENIX Symposium on Networked Systems Design and Implementation (NSDI’26), May 2026

work page 2026
[40]

H. S. Gunawi, C. Rubio-González, A. C. Arpaci-Dusseau, R. H. Arpaci-Dusseau, and B. Liblit. EIO: Error Handling is Occasionally Correct. InProceedings of the 6th USENIX Conference on File and Storage Technologies (FAST’08), Feb. 2008

work page 2008
[41]

H. S. Gunawi, M. Hao, R. O. Suminto, A. Laksono, A. D. Satria, J. Adityatama, and K. J. Eliazar. Why Does the Cloud Stop Computing? Lessons from Hundreds of Service Outages. In Proceedings of the 7th ACM Symposium on Cloud Computing (SoCC’16), Oct. 2016

work page 2016
[42]

H. S. Gunawi, R. O. Suminto, R. Sears, C. Golliher, S. Sundararaman, X. Lin, T. Emami, W. Sheng, N. Bidokhti, C. McCaffrey, G. Grider, P. M. Fields, K. Harms, R. B. Ross, A. Ja- cobson, R. Ricci, K. Webb, P. Alvaro, H. B. Runesha, M. Hao, and H. Li. Fail-Slow at Scale: Evidence of Hardware Performance Faults in Large Production Systems. InProceedings of t...

work page 2018
[43]

S. Han, X. Hu, H. Huang, M. Jiang, and Y . Zhao. ADBench: Anomaly Detection Benchmark. InProceedings of the 36th Annual Conference on Neural Information Processing Systems (NeurIPS’22), Nov. 2022

work page 2022
[44]

P. H. Hochschild, P. Turner, J. C. Mogul, R. Govindaraju, P. Ranganathan, D. E. Culler, and A. Vahdat. Cores that don’t count. InProceedings of the ACM SIGOPS 21st Workshop on Hot Topics in Operating Systems (HotOS’21), June 2021

work page 2021
[45]

Huang, M

L. Huang, M. Magnusson, A. B. Muralikrishna, S. Estyak, R. Isaacs, A. Aghayev, T. Zhu, and A. Charapko. Metastable Failures in the Wild. InProceedings of the 16th USENIX Symposium on Operating Systems Design and Implementation (OSDI’22), July 2022

work page 2022
[46]

Huang, C

P. Huang, C. Guo, L. Zhou, J. R. Lorch, Y . Dang, M. Chintalapati, and R. Yao. Gray Failure: The Achilles’ Heel of Cloud-Scale Systems. InProceedings of the 16th Workshop on Hot Topics in Operating Systems (HotOS’17), pages 150–155, May 2017

work page 2017
[47]

Isaacs, P

R. Isaacs, P. Alvaro, R. Majumdar, K.-K. Muniswamy-Reddy, M. Salamati, and S. Soudjani. Analyzing Metastable Failures. InProceedings of the ACM SIGOPS 20th Workshop on Hot Topics in Operating Systems (HotOS’25), May 2025

work page 2025
[48]

Jacob, F

V . Jacob, F. Song, A. Stiegler, B. Rad, Y . Diao, and N. Tatbul. Exathlon: a Benchmark for Explainable Anomaly Detection Over Time Series.Proceedings of the VLDB Endowment (VLDB’21), 14(11):2613–2626, July 2021

work page 2021
[49]

S. Jha, S. Cui, S. Banerjee, T. Xu, J. Enos, M. Showerman, Z. T. Kalbarczyk, and R. K. Iyer. Live Forensics for HPC Systems: A Case Study on Distributed Storage Systems. InProceedings of the International Conference for High-Performance Computing, Networking, Storage and Analysis (SC’20), Nov. 2020. 12

work page 2020
[50]

S. Jha, R. R. Arora, Y . Watanabe, T. Yanagawa, Y . Chen, J. Clark, B. Bhavya, M. Verma, H. Kumar, H. Kitahara, N. Zheutlin, S. Takano, D. Pathak, F. George, X. Wu, B. O. Turkkan, G. Vanloo, M. Nidd, T. Dai, O. Chatterjee, P. Gupta, S. Samanta, P. Aggarwal, R. Lee, J. wook Ahn, D. Kar, A. Paradkar, Y . Deng, P. Moogi, P. Mohapatra, N. Abe, C. Narayanaswam...

work page 2025
[51]

Kaldor, J

J. Kaldor, J. Mace, M. Bejda, E. Gao, W. Kuropatwa, J. O’Neill, K. W. Ong, B. Schaller, P. Shan, B. Viscomi, V . Venkataraman, K. Veeraraghavan, and Y . J. Song. Canopy: An End-to-End Performance Tracing and Analysis System. InProceedings of the 26th Symposium on Operating Systems Principles (SOSP’17), Oct. 2017

work page 2017
[52]

J.-C. Laprie. Dependable Computing: Concepts, Limits, Challenges. InProceedings of the 25th IEEE International Symposium on Fault-Tolerant Computing (FTCS’95), June 1995

work page 1995
[53]

Y . Lee, J. Kim, J. Kim, H. Cho, J. Kang, P. Kang, and N. Kim. CheckEval: A Reliable LLM-as- a-Judge Framework for Evaluating Text Generation Using Checklists. InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing (EMNLP’25), Nov. 2025

work page 2025
[54]

H. Liu, S. Lu, M. Musuvathi, and S. Nath. What Bugs Cause Production Cloud Incidents? In Proceedings of the 17th Workshop on Hot Topics in Operating Systems (HotOS’19), May 2019

work page 2019
[55]

Y . Liu, C. Pei, L. Xu, B. Chen, M. Sun, Z. Zhang, Y . Sun, S. Zhang, K. Wang, H. Zhang, J. Li, G. Xie, X. Wen, X. Nie, M. Ma, and D. Pei. OpsEval: A Comprehensive IT Operations Benchmark Suite for Large Language Models.arXiv:2310.07637, June 2025

work page arXiv 2025
[56]

D. Lo, L. Cheng, R. Govindaraju, P. Ranganathan, and C. Kozyrakis. Heracles: Improving Resource Efficiency at Scale. InProceedings of the 42nd Annual International Symposium on Computer Architecture (ISCA’15), June 2015

work page 2015
[57]

D. Loker. AI vs Human Code Gen Report: AI Code Creates 1.7x More Issues. https: //www.coderabbit.ai/blog/state-of-ai-vs-human-code-generation-report , 2026

work page 2026
[58]

Y . Luo, J. Jiang, J. Feng, L. Tao, Q. Zhang, X. Wen, Y . Sun, S. Zhang, and D. Pei. From Observability Data to Diagnosis: An Evolving Multi-agent System for Incident Management in Cloud Systems.arXiv:2510.24145, Nov. 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[59]

J. Mao, L. Li, Y . Gao, Z. Peng, S. He, C. Zhang, S. Qin, S. Khalid, Q. Lin, S. Rajmohan, et al. StepFly: Agentic Troubleshooting Guide Automation for Incident Diagnosis.arXiv:2510.10074, Apr. 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[60]

M. A. Merrill, A. G. Shaw, N. Carlini, B. Li, H. Raj, I. Bercovich, L. Shi, J. Y . Shin, T. Walshe, E. K. Buchanan, J. Shen, G. Ye, H. Lin, J. Poulos, M. Wang, M. Nezhurina, J. Jitsev, D. Lu, O. M. Mastromichalakis, Z. Xu, Z. Chen, Y . Liu, R. Zhang, L. L. Chen, A. Kashyap, J.-L. Uslu, J. Li, J. Wu, M. Yan, S. Bian, V . Sharma, K. Sun, S. Dillmann, A. Ana...

work page internal anchor Pith review Pith/arXiv arXiv 2026
[61]

J. J. Meza, T. Gowda, A. Eid, T. Ijaware, D. Chernyshev, Y . Yu, M. N. Uddin, R. Das, C. Nachi- appan, S. Tran, S. Shi, T. Luo, D. K. Hong, S. Panneerselvam, H. Ragas, S. Manavski, W. Wang, and F. Richard. Defcon: Preventing Overload with Graceful Feature Degradation. InPro- ceedings of the 17th USENIX Symposium on Operating Systems Design and Implementat...

work page 2023
[62]

Palix, G

N. Palix, G. Thomas, S. Saha, C. Calvès, J. Lawall, and G. Muller. Faults in Linux: Ten Years Later. InProceedings of the 16th International Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS’11), Mar. 2011

work page 2011
[63]

Patel and D

T. Patel and D. Tiwari. CLITE: Efficient and QoS-Aware Co-Location of Multiple Latency- Critical Jobs for Warehouse Scale Computers. InProceedings of the 2020 IEEE International Symposium on High Performance Computer Architecture (HPCA’20), Feb. 2020

work page 2020
[64]

C. Pei, Z. Wang, F. Liu, Z. Li, Y . Liu, X. He, R. Kang, T. Zhang, J. Chen, J. Li, G. Xie, and D. Pei. Flow-of-Action: SOP Enhanced LLM-Based Multi-Agent System for Root Cause Analysis. InCompanion Proceedings of the ACM on Web Conference 2025 (WWW’25), May 2025

work page 2025
[65]

Pelkonen, S

T. Pelkonen, S. Franklin, J. Teller, P. Cavallaro, Q. Huang, J. Meza, and K. Veeraraghavan. Go- rilla: a Fast, Scalable, In-memory Time Series Database.Proceedings of the VLDB Endowment (VLDB’15), 8(12):1816–1827, Aug. 2015

work page 2015
[66]

Schroeder, S

B. Schroeder, S. Damouras, and P. Gill. Understanding Latent Sector Errors and How to Protect against Them. InProceedings of the 8th USENIX Conference on File and Storage Technologies (FAST’10), Feb. 2010

work page 2010
[67]

B. H. Sigelman, L. A. Barroso, M. Burrows, P. Stephenson, M. Plakal, D. Beaver, S. Jaspan, and C. Shanbhag. Dapper, a Large-Scale Distributed Systems Tracing Infrastructure. Technical Report dapper-2010-1, Google, Inc., Apr. 2010

work page 2010
[68]

X. Sun, R. Cheng, J. Chen, E. Ang, O. Legunsen, and T. Xu. Testing Configuration Changes in Context to Prevent Production Failures. InProceedings of the 14th USENIX Symposium on Operating Systems Design and Implementation (OSDI’20), Nov. 2020

work page 2020
[69]

X. Sun, W. Luo, J. T. Gu, A. Ganesan, R. Alagappan, M. Gasch, L. Suresh, and T. Xu. Automatic Reliability Testing for Cluster Management Controllers. InProceedings of the 16th USENIX Symposium on Operating Systems Design and Implementation (OSDI’22), July 2022

work page 2022
[70]

P. Tang, S. Tang, H. Pu, Z. Miao, and Z. Wang. MicroRCA-Agent: Microservice Root Cause Analysis Method Based on Large Language Model Agents.arXiv:2509.15635, Sept. 2025

work page arXiv 2025
[71]

Y . Tian, Y . Liu, Z. Chong, Z. Huang, and H.-A. Jacobsen. GALA: Can Graph-Augmented Large Language Model Agentic Workflows Elevate Root Cause Analysis?arXiv:2508.12472, Aug. 2025

work page arXiv 2025
[72]

Treynor, M

B. Treynor, M. Dahlin, V . Rau, and B. Beyer. The Calculus of Service Availability.Communi- cations of the ACM (CACM), 60(9):42–47, Aug. 2017

work page 2017
[73]

Veeraraghavan, J

K. Veeraraghavan, J. Meza, S. Michelson, S. Panneerselvam, A. Gyori, D. Chou, S. Mar- gulis, D. Obenshain, S. Padmanabha, A. Shah, Y . J. Song, and T. Xu. Maelstrom: Mitigating Datacenter-level Disasters by Draining Interdependent Traffic Safely and Efficiently. InPro- ceedings of the 13th USENIX Symposium on Operating Systems Design and Implementation (O...

work page 2018
[74]

H. Wang, Q. Mang, A. Cheung, K. Sen, and D. Song. How We Broke Top AI Agent Benchmarks: And What Comes Next. https://rdi.berkeley.edu/blog/trustworthy-benchmark s-cont/, 2026

work page 2026
[75]

Y . Wang, G. Yu, H. Huang, Z. Wang, Y . Huang, P. Chen, and M. R. Lyu. Cloud-OpsBench: A Re- producible Benchmark for Agentic Root Cause Analysis in Cloud Systems.arXiv:2603.00468, Feb. 2026

work page arXiv 2026
[76]

Z. Wang, Z. Liu, Y . Zhang, A. Zhong, J. Wang, F. Yin, L. Fan, L. Wu, and Q. Wen. RCAgent: Cloud Root Cause Analysis by Autonomous Agents with Tool-Augmented Large Language Models. InProceedings of the 33rd ACM International Conference on Information and Knowledge Management (CIKM’24), Oct. 2024. 14

work page 2024
[77]

J. Xu, Q. Zhang, Z. Zhong, S. He, C. Zhang, Q. Lin, D. Pei, P. He, D. Zhang, and Q. Zhang. OpenRCA: Can Large Language Models Locate the Root Cause of Software Failures? In Proceedings of the 13th International Conference on Learning Representations (ICLR’25), Jan. 2025

work page 2025
[78]

T. Xu, J. Zhang, P. Huang, J. Zheng, T. Sheng, D. Yuan, Y . Zhou, and S. Pasupathy. Do Not Blame Users for Misconfigurations. InProceedings of the 24th ACM Symposium on Operating Systems Principles (SOSP’13), Nov. 2013

work page 2013
[79]

T. Xu, X. Jin, P. Huang, Y . Zhou, S. Lu, L. Jin, and S. Pasupathy. Early Detection of Configu- ration Errors to Reduce Failure Damage. InProceedings of the 12th USENIX Symposium on Operating Systems Design and Implementation (OSDI’16), Nov. 2016

work page 2016
[80]

S. Yao, J. Zhao, D. Yu, N. Du, I. Shafran, K. Narasimhan, and Y . Cao. ReAct: Synergizing Reasoning and Acting in Language Models. InProceedings of the 11th International Conference on Learning Representations (ICLR ’23), May 2023

work page 2023

Showing first 80 references.