arxiv: 2605.04431 · v1 · submitted 2026-05-06 · 💻 cs.SE · cs.AI

Recognition: 2 theorem links

· Lean Theorem

Towards Robust LLM Post-Training: Automatic Failure Management for Reinforcement Fine-Tuning

Hongyi Liu, Kening Zheng, Liancheng Fang, Lingzhe Zhang, Philip S. Yu, Tong Jia, Xiaosong Huang, Ying Li, Yunpeng Zhai

Pith reviewed 2026-05-08 18:14 UTC · model grok-4.3

classification 💻 cs.SE cs.AI

keywords reinforcement fine-tuningLLM post-trainingfailure managementanomaly detectionfault diagnosisautomatic remediationtraining robustnessbenchmark

0 comments

The pith

RFT failures during LLM post-training can be automatically detected, diagnosed, and remediated by a unified framework.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper first creates RFT-FaultBench, a large collection of simulated failure cases across different categories in reinforcement fine-tuning runs. Analysis of these cases shows that failures leave clear traces in the training metrics and have unique identifying patterns. The authors then present RFT-FM, which uses these insights to automatically detect when something is wrong, figure out what type of failure it is, and take corrective action. This matters because RFT currently depends on slow, expert-led manual fixes that hinder reliable scaling. The results indicate the benchmark poses real challenges while allowing the framework to perform well in handling issues.

Core claim

The paper claims that by building RFT-FaultBench and studying its failure data, it is possible to see that RFT failures are observable from training dynamics and distinguishable through their empirical fault fingerprints. This enables RFT-FM, an automatic failure management framework for reinforcement fine-tuning that unifies anomaly detection, failure diagnosis, and auto remediation in a closed loop.

What carries the argument

RFT-FM, a closed-loop automatic failure management system for reinforcement fine-tuning that performs anomaly detection, diagnosis, and remediation based on training dynamics and fault fingerprints identified in the RFT-FaultBench.

If this is right

Practitioners can reduce dependence on manual inspection during RFT processes.
Failures become manageable through automated responses rather than process halts.
The benchmark provides a standard way to test and improve failure management tools.
RFT processes gain reliability as the framework mitigates issues in real time.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Extending this to production LLM training could minimize wasted compute on faulty runs.
Similar fingerprinting techniques might help in diagnosing issues in other types of model training beyond RFT.
Future work could explore integrating RFT-FM with existing training platforms for seamless adoption.

Load-bearing premise

The fault families and types used to create the benchmark are representative of the failures that naturally occur during real-world reinforcement fine-tuning of large language models.

What would settle it

Observing whether RFT-FM accurately detects and remediates failures in actual RFT training runs that have no pre-injected faults, compared to human expert analysis of the same runs.

Figures

Figures reproduced from arXiv: 2605.04431 by Hongyi Liu, Kening Zheng, Liancheng Fang, Lingzhe Zhang, Philip S. Yu, Tong Jia, Xiaosong Huang, Ying Li, Yunpeng Zhai.

**Figure 1.** Figure 1: Training Anomalies in Reinforcement Fine-Tuning: From Manual view at source ↗

**Figure 2.** Figure 2: Architecture of RFT-Specific Anomaly Injection view at source ↗

**Figure 3.** Figure 3: Observability of RFT Failures in Training Dynamics view at source ↗

**Figure 5.** Figure 5: Architecture of RFT-FM Building on the empirical findings that RFT training failures are both observable and distinguishable in training dynamics, we propose RFT-FM, an automatic failure management framework for reinforcement fine-tuning. As shown in view at source ↗

**Figure 6.** Figure 6: Case Study of RFT-FM Remediation (TE-2) substantially closer to the normal training regime across multiple telemetry dimensions, including reward, KL divergence, entropy, and response length. This example illustrates what successful mitigation looks like in training dynamics: remediation does not simply reduce a scalar anomaly score, but partially restores the underlying trajectory structure toward healt… view at source ↗

**Figure 7.** Figure 7: Hyperparameter Experiment the easy setting, detection F1 rises steadily from shorter horizons and reaches its strongest performance around 18– 20 steps, indicating that longer observation windows provide more complete evidence for identifying structured deviations from healthy RFT dynamics. Under the hard setting, the gains are much smaller after the initial increase, suggesting that subtle anomalies expos… view at source ↗

read the original abstract

Reinforcement fine-tuning (RFT) has become a core paradigm for post-training large language models, yet its training process remains highly fragile. Existing efforts mainly improve reliability at the system level or address specific issues in individual subproblems by modifying RFT algorithms. Despite their effectiveness, they largely overlook the problem of failure management at the training-process level. When training goes wrong, practitioners still rely heavily on expert-driven manual inspection and correction, and automatic failure management for RFT remains largely unexplored. In this paper, we take a first step toward systematic failure management for reinforcement fine-tuning. To understand the empirical structure of RFT failures, we first construct RFT-FaultBench, the first benchmark for fine-grained failures in reinforcement fine-tuning, covering 5 fault families, 16 fault types, 779 training runs, 22,549 train-step records, and 1,457,288 trajectory-level records. Based on this benchmark, we conduct a comprehensive empirical study showing that RFT failures are both observable from training dynamics and distinguishable through their empirical fault fingerprints. Building on these findings, we propose RFT-FM, an automatic failure management framework for reinforcement fine-tuning that unifies anomaly detection, failure diagnosis, and auto remediation in a closed loop. Experimental results show that RFT-FaultBench is neither trivial nor saturated: it exhibits clear anomaly structure while still posing substantial challenges, especially under subtle fault settings. Moreover, RFT-FM shows strong capability in detecting, diagnosing, and mitigating RFT failures.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

This paper gives us the first public benchmark for fine-grained RFT failures plus a closed-loop detection-diagnosis-remediation framework, but the results rest entirely on synthetic fault injection whose match to real training runs is untested.

read the letter

The main thing to know is that the authors built RFT-FaultBench with 5 fault families, 16 types, 779 controlled runs, and over a million trajectory records, then used it to show that failures leave observable dynamics and distinguishable fingerprints. From that they put together RFT-FM, a single pipeline that tries to catch problems, figure out which fault is happening, and apply a fix without a human in the loop. That is genuinely new; prior work either patched individual RFT algorithms or handled system-level reliability, but nobody had tried to treat the whole failure-management process as a unified, automatable task with its own benchmark. The empirical study is the part that lands best: it demonstrates that the injected faults are neither trivial nor saturated, which at least makes the benchmark usable for follow-on work. The closed-loop framing is also a clean way to organize the subproblems that practitioners currently solve by staring at logs. The soft spot is exactly the one the stress-test note flags. All the data comes from deliberately injected faults in controlled runs. We have no evidence that the fingerprints or the remediation success rates will transfer to the failures that actually appear in the wild, such as reward hacking, gradient noise blow-ups, or environment-specific collapse. The abstract claims RFT-FM shows strong capability, yet supplies no detection rates, precision numbers, or remediation success metrics, so it is difficult to judge how much of the performance is real even on the synthetic set. The work is empirical and the benchmark was built independently, so there is no obvious circularity, but the realism gap is load-bearing for the practical claim. This paper is for researchers who build or maintain RFT pipelines and want concrete artifacts to measure progress against. A reader focused on training reliability or monitoring tools will find the benchmark and the framework design worth looking at. It deserves a serious referee because it opens a new, well-scoped problem with public data and a working prototype, even if the next round of reviews will almost certainly push for validation on organic failures.

Referee Report

2 major / 0 minor

Summary. The paper introduces RFT-FaultBench, the first benchmark for fine-grained failures in reinforcement fine-tuning covering 5 fault families, 16 fault types, 779 training runs, 22,549 train-step records, and 1,457,288 trajectory-level records. An empirical study shows that RFT failures exhibit observable dynamics and distinguishable fault fingerprints. Building on this, the authors propose RFT-FM, a closed-loop automatic failure management framework unifying anomaly detection, diagnosis, and remediation, claiming that the benchmark exhibits clear anomaly structure and that RFT-FM demonstrates strong capability in detecting, diagnosing, and mitigating RFT failures.

Significance. If the synthetic fault injections prove representative of real RFT failure distributions, the work would be significant as the first systematic treatment of failure management at the training-process level for reinforcement fine-tuning of LLMs. The construction of a large-scale benchmark with trajectory-level records and the demonstration of distinguishable empirical fingerprints provide a concrete foundation for future automated tools, moving beyond ad-hoc manual inspection.

major comments (2)

[Abstract] Abstract: the claim that 'RFT-FM shows strong capability in detecting, diagnosing, and mitigating RFT failures' is presented without any quantitative metrics, baseline comparisons, or details on how remediation success was measured, making the strength of the empirical results impossible to assess from the reported summary.
[Benchmark construction and experimental evaluation] Benchmark construction and experimental evaluation sections: the central claim that RFT-FM exhibits strong capability 'in practice' rests on the untested premise that the 5 deliberately injected fault families and 16 types in the 779 controlled runs produce fingerprints that statistically match the failure modes arising in uncontrolled real-world RFT runs (e.g., reward hacking, noisy-gradient instability, or environment-specific collapse). No evidence is supplied that the benchmark's fault fingerprints align with those from practitioner-collected failed trainings, which is load-bearing for the practical significance of the closed-loop pipeline.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the thoughtful and constructive feedback. The comments highlight important aspects of how we present our empirical claims and the scope of our benchmark. We address each major comment below and indicate the revisions we will make.

read point-by-point responses

Referee: [Abstract] Abstract: the claim that 'RFT-FM shows strong capability in detecting, diagnosing, and mitigating RFT failures' is presented without any quantitative metrics, baseline comparisons, or details on how remediation success was measured, making the strength of the empirical results impossible to assess from the reported summary.

Authors: We agree that the abstract would benefit from more concrete quantitative support. The current abstract summarizes the overall findings at a high level. In the revised manuscript we will expand the abstract to report key performance numbers from our experiments, including detection metrics (e.g., F1 scores under different fault settings), diagnosis accuracy across the 16 fault types, and remediation success rates relative to the baselines we evaluated. This will make the strength of the RFT-FM results directly assessable from the abstract. revision: yes
Referee: [Benchmark construction and experimental evaluation] Benchmark construction and experimental evaluation sections: the central claim that RFT-FM exhibits strong capability 'in practice' rests on the untested premise that the 5 deliberately injected fault families and 16 types in the 779 controlled runs produce fingerprints that statistically match the failure modes arising in uncontrolled real-world RFT runs (e.g., reward hacking, noisy-gradient instability, or environment-specific collapse). No evidence is supplied that the benchmark's fault fingerprints align with those from practitioner-collected failed trainings, which is load-bearing for the practical significance of the closed-loop pipeline.

Authors: We acknowledge the importance of this point. The fault families and types in RFT-FaultBench were chosen to reflect failure modes frequently described in the RFT literature (reward hacking, gradient instability, data-quality collapse, etc.). Our empirical analysis of the 779 controlled runs shows that these injected faults produce observable dynamics and distinguishable fingerprints. However, we do not supply direct statistical alignment against a corpus of real-world failed RFT runs, as such large-scale practitioner data is not publicly available and is difficult to collect systematically. We will revise the manuscript to state this limitation explicitly in the benchmark construction and discussion sections and to outline concrete directions for future validation with real-world traces. The current benchmark still offers the first large-scale, trajectory-level resource for studying and automating RFT failure management. revision: partial

Circularity Check

0 steps flagged

No circularity: empirical construction and held-out evaluation are independent

full rationale

The paper's chain consists of independent benchmark construction (injecting 5 fault families into 779 controlled runs to produce RFT-FaultBench), an empirical study of observable dynamics and fingerprints on that data, and evaluation of the proposed RFT-FM closed-loop framework on held-out runs. No equations, fitted parameters, or self-citations reduce the claimed detection/diagnosis/remediation performance to inputs defined from the same data by construction. The benchmark and framework are built and tested separately, so the derivation remains self-contained.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The central claims rest on the assumption that injected faults in the benchmark mirror natural RFT failures and that training dynamics contain sufficient signal for diagnosis. No new physical or mathematical axioms are introduced.

axioms (1)

domain assumption Reinforcement fine-tuning of LLMs produces observable training dynamics that can be used for anomaly detection.
Invoked when claiming failures are distinguishable through empirical fault fingerprints.

invented entities (1)

RFT-FM framework no independent evidence
purpose: Unifies anomaly detection, failure diagnosis, and auto remediation in a closed loop.
New system proposed in the paper; no independent evidence provided beyond the authors' experiments.

pith-pipeline@v0.9.0 · 5599 in / 1162 out tokens · 38341 ms · 2026-05-08T18:14:39.813616+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

Cost.Jcost / Foundation.LogicAsFunctionalEquation washburn_uniqueness_aczel unclear
core RFT telemetry such as reward, KL divergence, entropy, return, response length, policy loss

Reference graph

Works this paper leans on

82 extracted references · 25 canonical work pages · 6 internal anchors

[1]

Deepseek-r1 incentivizes reasoning in llms through reinforcement learning,

D. Guo, D. Yang, H. Zhang, J. Song, P. Wang, Q. Zhu, R. Xu, R. Zhang, S. Ma, X. Bi et al., “Deepseek-r1 incentivizes reasoning in llms through reinforcement learning,” Nature, vol. 645, no. 8081, pp. 633–638, 2025

2025
[2]

Competitive programming with large reasoning models.arXiv preprint arXiv:2502.06807, 2025

A. El-Kishky, A. Wei, A. Saraiva, B. Minaiev, D. Selsam, D. Do- han, F. Song, H. Lightman, I. Clavera, J. Pachocki et al., “Com- petitive programming with large reasoning models,” arXiv preprint arXiv:2502.06807, 2025

work page arXiv 2025
[3]

Yu, and Aiwei Liu

L. Zhang, L. Fang, C. Duan, M. He, L. Pan, P. Xiao, S. Huang, Y . Zhai, X. Hu, P. S. Yu et al., “A survey on parallel text generation: From parallel decoding to diffusion language models,” arXiv preprint arXiv:2508.08712, 2025

work page arXiv 2025
[4]

Omni-safetybench: A benchmark for safety evaluation of audio-visual large language models.arXiv preprint arXiv:2508.07173, 2025

L. Pan, Z. Fu, Y . Zhai, S. Tao, S. Guan, S. Huang, L. Zhang, Z. Liu, B. Ding, F. Henry et al., “Omni-safetybench: A benchmark for safety evaluation of audio-visual large language models,” arXiv preprint arXiv:2508.07173, 2025

work page arXiv 2025
[5]

d-treerpo: Towards more reliable policy optimization for diffusion language models

L. Pan, S. Tao, Y . Zhai, Z. Fu, L. Fang, M. He, L. Zhang, Z. Liu, B. Ding, A. Liu et al., “d-treerpo: Towards more reliable policy optimization for diffusion language models,” arXiv preprint arXiv:2512.09675, 2025

work page internal anchor Pith review arXiv 2025
[6]

L4: Diagnosing large-scale llm training failures via automated log analysis,

Z. Jiang, J. Huang, G. Yu, Z. Chen, Y . Li, R. Zhong, C. Feng, Y . Yang, Z. Yang, and M. Lyu, “L4: Diagnosing large-scale llm training failures via automated log analysis,” in Proceedings of the 33rd ACM International Conference on the Foundations of Software Engineering, 2025, pp. 51–63

2025
[7]

Role-based fault tolerance system for llm rl post-training,

Z. Chen, B. Zhong, X. Li, Q. Dai, X. Zhao, M. Ye, R. Cheng, L. Zhang, and J. Yin, “Role-based fault tolerance system for llm rl post-training,” arXiv preprint arXiv:2512.22492, 2025

work page arXiv 2025
[8]

TRANSOM: An efficient fault-tolerant system for training LLMs.arXiv preprint arXiv:2310.10046, 2023

B. Wu, L. Xia, Q. Li, K. Li, X. Chen, Y . Guo, T. Xiang, Y . Chen, and S. Li, “Transom: An efficient fault-tolerant system for training llms,” arXiv preprint arXiv:2310.10046, 2023

work page arXiv 2023
[9]

Flashrecovery: Fast and low-cost recovery from failures for large-scale training of llms,

H. Zhang, J. Wang, Z. Yu, Y . Zhang, X. Ji, K. Mao, J. Zhang, Y . Zhang, T. Wu, F. Jie et al., “Flashrecovery: Fast and low-cost recovery from failures for large-scale training of llms,” arXiv preprint arXiv:2509.03047, 2025

work page arXiv 2025
[10]

arXiv preprint arXiv:2502.18770 , year=

J. Fu, X. Zhao, C. Yao, H. Wang, Q. Han, and Y . Xiao, “Reward shaping to mitigate reward hacking in rlhf,” arXiv preprint arXiv:2502.18770, 2025

work page arXiv 2025
[11]

Information-theoretic reward modeling for stable rlhf: Detecting and mitigating reward hacking,

Y . Miao, L. Ding, S. Zhang, R. Bao, L. Zhang, and D. Tao, “Information- theoretic reward modeling for stable rlhf: Detecting and mitigating reward hacking,” arXiv preprint arXiv:2510.13694, 2025

work page arXiv 2025
[12]

Mitigating reward hacking in rlhf via bayesian non-negative reward modeling, 2026

Z. Duan, G. Rong, Z. Li, B. Chen, M. Zhou, and D. Guo, “Mitigating reward hacking in rlhf via bayesian non-negative reward modeling,” arXiv preprint arXiv:2602.10623, 2026

work page arXiv 2026
[13]

Rethinking KL regularization in RLHF : From value estimation to gradient optimization

K. Liu, J. K. Liu, M. Chen, and Y . Liu, “Rethinking kl regularization in rlhf: From value estimation to gradient optimization,” arXiv preprint arXiv:2510.01555, 2025

work page arXiv 2025
[14]

Unifying stable optimization and reference regularization in rlhf,

L. He, Q. Qu, H. Zhao, S. Wan, D. Wang, L. Yao, and T. Liu, “Unifying stable optimization and reference regularization in rlhf,” in The Fourteenth International Conference on Learning Representations
[15]

A survey of aiops in the era of large language models,

L. Zhang, T. Jia, M. Jia, Y . Wu, A. Liu, Y . Yang, Z. Wu, X. Hu, P. Yu, and Y . Li, “A survey of aiops in the era of large language models,”ACM Computing Surveys, 2025

2025
[16]

A survey of aiops methods for failure management,

P. Notaro, J. Cardoso, and M. Gerndt, “A survey of aiops methods for failure management,” ACM Transactions on Intelligent Systems and Technology (TIST), vol. 12, no. 6, pp. 1–45, 2021

2021
[17]

Aiops solutions for incident management: Technical guidelines and a com- prehensive literature review,

Y . Remil, A. Bendimerad, R. Mathonat, and M. Kaytoue, “Aiops solutions for incident management: Technical guidelines and a com- prehensive literature review,” arXiv preprint arXiv:2404.01363, 2024

work page arXiv 2024
[18]

Deep reinforcement learning from human preferences,

P. F. Christiano, J. Leike, T. Brown, M. Martic, S. Legg, and D. Amodei, “Deep reinforcement learning from human preferences,” Advances in neural information processing systems, vol. 30, 2017

2017
[19]

Fine-Tuning Language Models from Human Preferences

D. M. Ziegler, N. Stiennon, J. Wu, T. B. Brown, A. Radford, D. Amodei, P. Christiano, and G. Irving, “Fine-tuning language models from human preferences,” arXiv preprint arXiv:1909.08593, 2019

work page internal anchor Pith review arXiv 1909
[20]

Direct preference optimization: Your language model is secretly a reward model,

R. Rafailov, A. Sharma, E. Mitchell, C. D. Manning, S. Ermon, and C. Finn, “Direct preference optimization: Your language model is secretly a reward model,” Advances in Neural Information Processing Systems, vol. 36, pp. 53 728–53 741, 2023

2023
[21]

Proximal Policy Optimization Algorithms

J. Schulman, F. Wolski, P. Dhariwal, A. Radford, and O. Klimov, “Prox- imal policy optimization algorithms,” arXiv preprint arXiv:1707.06347, 2017

work page internal anchor Pith review arXiv 2017
[22]

DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

Z. Shao, P. Wang, Q. Zhu, R. Xu, J. Song, X. Bi, H. Zhang, M. Zhang, Y . Li, Y . Wuet al., “Deepseekmath: Pushing the limits of mathematical reasoning in open language models,” arXiv preprint arXiv:2402.03300, 2024

work page internal anchor Pith review arXiv 2024
[23]

Openrlhf: An easy-to-use, scalable and high-performance rlhf framework.arXiv preprint arXiv:2405.11143, 2024

J. Hu, X. Wu, Z. Zhu, W. Wang, D. Zhang, Y . Cao et al., “Openrlhf: An easy-to-use, scalable and high-performance rlhf framework,” arXiv preprint arXiv:2405.11143, 2024

work page arXiv 2024
[24]

Language models learn to mislead hu- mans via rlhf,

J. Wen, R. Zhong, A. Khan, E. Perez, J. Steinhardt, M. Huang, S. R. Bowman, H. He, and S. Feng, “Language models learn to mislead hu- mans via rlhf,” in The Thirteenth International Conference on Learning Representations
[25]

Odin: disentangled reward mitigates hacking in rlhf,

L. Chen, C. Zhu, J. Chen, D. Soselia, T. Zhou, T. Goldstein, H. Huang, M. Shoeybi, and B. Catanzaro, “Odin: disentangled reward mitigates hacking in rlhf,” in Proceedings of the 41st International Conference on Machine Learning, 2024, pp. 7935–7952

2024
[26]

Language models resist alignment: Evidence from data compression,

J. Ji, K. Wang, T. A. Qiu, B. Chen, J. Zhou, C. Li, H. Lou, J. Dai, Y . Liu, and Y . Yang, “Language models resist alignment: Evidence from data compression,” in Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (V olume1: Long Papers), 2025, pp. 23 411–23 432

2025
[27]

Why do some language models fake alignment while others don’t?

A. Sheshadri, J. Hughes, J. Michael, A. T. Mallen, A. Jose, and F. Roger, “Why do some language models fake alignment while others don’t?” in The Thirty-ninth Annual Conference on Neural Information Processing Systems
[28]

Why Supervised Fine-Tuning Fails to Learn: A Systematic Study of Incomplete Learning in Large Language Models

C. Xue, Y . Wang, M. Liu, D. Liang, X. Han, P. Liu, X. Wu, C. Lu, L. Jiang, Y . Lu et al., “Why supervised fine-tuning fails to learn: A systematic study of incomplete learning in large language models,”arXiv preprint arXiv:2604.10079, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[29]

Training language models to follow instructions with human feedback,

L. Ouyang, J. Wu, X. Jiang, D. Almeida, C. Wainwright, P. Mishkin, C. Zhang, S. Agarwal, K. Slama, A. Ray et al., “Training language models to follow instructions with human feedback,” Advances in neural information processing systems, vol. 35, pp. 27 730–27 744, 2022. 15

2022
[30]

Tranad: deep transformer networks for anomaly detection in multivariate time series data,

S. Tuli, G. Casale, and N. R. Jennings, “Tranad: deep transformer networks for anomaly detection in multivariate time series data,” Proceedings of the VLDB Endowment, vol. 15, no. 6, pp. 1201–1214, 2022

2022
[31]

Robust anomaly detection for multivariate time series through stochastic recurrent neural network,

Y . Su, Y . Zhao, C. Niu, R. Liu, W. Sun, and D. Pei, “Robust anomaly detection for multivariate time series through stochastic recurrent neural network,” in Proceedings of the 25th ACM SIGKDD international conference on knowledge discovery & data mining, 2019, pp. 2828– 2837

2019
[32]

Anomaly transformer: Time series anomaly detection with association discrepancy,

J. Xu, H. Wu, J. Wang, and M. Long, “Anomaly transformer: Time series anomaly detection with association discrepancy,” in International Conference on Learning Representations
[33]

Root cause analysis in microservice using neural granger causal discovery,

C.-M. Lin, C. Chang, W.-Y . Wang, K.-D. Wang, and W.-C. Peng, “Root cause analysis in microservice using neural granger causal discovery,” in Proceedings of the AAAI Conference on Artificial Intelligence, vol. 38, no. 1, 2024, pp. 206–213

2024
[34]

Causalrca: Causal inference based pre- cise fine-grained root cause localization for microservice applications,

R. Xin, P. Chen, and Z. Zhao, “Causalrca: Causal inference based pre- cise fine-grained root cause localization for microservice applications,” Journal of Systems and Software, vol. 203, p. 111724, 2023

2023
[35]

Circa: A framework for collaborative identification of root cause analysis in iot microservices,

X. Jiang, H. Luo, Y . Sun, and S. K. Das, “Circa: A framework for collaborative identification of root cause analysis in iot microservices,” IEEE Transactions on Services Computing, 2025

2025
[36]

Deeplog: Anomaly detection and diagnosis from system logs through deep learning,

M. Du, F. Li, G. Zheng, and V . Srikumar, “Deeplog: Anomaly detection and diagnosis from system logs through deep learning,” in Proceedings of the 2017 ACM SIGSAC conference on computer and communications security, 2017, pp. 1285–1298

2017
[37]

Isolation-based anomaly detection,

F. T. Liu, K. M. Ting, and Z.-H. Zhou, “Isolation-based anomaly detection,” ACM Transactions on Knowledge Discovery from Data (TKDD), vol. 6, no. 1, pp. 1–39, 2012

2012
[38]

Conditional anomaly detection,

X. Song, M. Wu, C. Jermaine, and S. Ranka, “Conditional anomaly detection,” IEEE Transactions on knowledge and Data Engineering, vol. 19, no. 5, pp. 631–645, 2007

2007
[39]

Multivariate log- based anomaly detection for distributed database,

L. Zhang, T. Jia, M. Jia, Y . Li, Y . Yang, and Z. Wu, “Multivariate log- based anomaly detection for distributed database,” in Proceedings of the 30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, 2024, pp. 4256–4267

2024
[40]

Reducing events to augment log-based anomaly detection models: An empirical study,

L. Zhang, T. Jia, K. Wang, M. Jia, Y . Yang, and Y . Li, “Reducing events to augment log-based anomaly detection models: An empirical study,” in Proceedings of the 18th ACM/IEEE International Symposium on Empirical Software Engineering and Measurement, 2024, pp. 538– 548

2024
[41]

E-log: Fine-grained elastic log-based anomaly detection and diagnosis for databases,

L. Zhang, T. Jia, X. Tan, X. Huang, M. Jia, H. Liu, Z. Wu, and Y . Li, “E-log: Fine-grained elastic log-based anomaly detection and diagnosis for databases,” IEEE Transactions on Services Computing, 2025

2025
[42]

Aaad: Asynchronous inter-variable relationship-aware anomaly detection for multivariate time series,

H. Liu, X. Huang, M. Jia, L. Zhang, T. Jia, Z. Wu, and Y . Li, “Aaad: Asynchronous inter-variable relationship-aware anomaly detection for multivariate time series,” in 2025 IEEE International Conference on Multimedia and Expo (ICME). IEEE, 2025, pp. 1–6

2025
[43]

Logaction: Consistent cross-system anomaly detection through logs via active domain adaptation,

C. Duan, M. He, P. Xiao, T. Jia, X. Zhang, Z. Zhong, X. Luo, Y . Niu, L. Zhang, S. Yu et al., “Logaction: Consistent cross-system anomaly detection through logs via active domain adaptation,” in 2025 40th IEEE/ACM International Conference on Automated Software Engineering (ASE). IEEE, 2025, pp. 700–712

2025
[44]

Walk the talk: Is your log-based software reliability main- tenance system really reliable?

M. He, T. Jia, C. Duan, P. Xiao, L. Zhang, K. Wang, Y . Wu, Y . Li, and G. Huang, “Walk the talk: Is your log-based software reliability main- tenance system really reliable?” in 2025 40th IEEE/ACM International Conference on Automated Software Engineering (ASE). IEEE, 2025, pp. 3784–3788

2025
[45]

Xraglog: A resource- efficient and context-aware log-based anomaly detection method using retrieval-augmented generation,

L. Zhang, T. Jia, M. Jia, Y . Wu, H. Liu, and Y . Li, “Xraglog: A resource- efficient and context-aware log-based anomaly detection method using retrieval-augmented generation,” inAAAI 2025 Workshop on Preventing and Detecting LLM Misinformation (PDLM), 2025

2025
[46]

Coorlog: Efficient-generalizable log anomaly detection via adaptive coordinator in software evolution,

P. Xiao, C. Duan, M. He, T. Jia, Y . Wu, J. Xu, G. Gao, L. Zhang, W. Hong, Y . Li et al., “Coorlog: Efficient-generalizable log anomaly detection via adaptive coordinator in software evolution,” in 2025 40th IEEE/ACM International Conference on Automated Software Engineering (ASE). IEEE, 2025, pp. 1119–1131

2025
[47]

Runtimeslicer: Towards generalizable unified runtime state representation for failure management,

L. Zhang, T. Jia, W. Hong, M. Wang, C. Duan, M. He, R. Wang, X. Peng, M. Wang, G. Zhang et al., “Runtimeslicer: Towards generalizable unified runtime state representation for failure management,” arXiv preprint arXiv:2603.21495, 2026

work page arXiv 2026
[48]

Ad-llm: Benchmarking large language models for anomaly detection,

T. Yang, Y . Nian, L. Li, R. Xu, Y . Li, J. Li, Z. Xiao, X. Hu, R. A. Rossi, K. Ding et al., “Ad-llm: Benchmarking large language models for anomaly detection,” in Findings of the Association for Computational Linguistics: ACL 2025, 2025, pp. 1524–1547

2025
[49]

Ora: Job runtime prediction for high-performance computing platforms using the online retrieval-augmented language model,

H. Liu, Y . Ma, X. Huang, L. Zhang, T. Jia, and Y . Li, “Ora: Job runtime prediction for high-performance computing platforms using the online retrieval-augmented language model,” in Proceedings of the 39th ACM International Conference on Supercomputing, 2025, pp. 884–894

2025
[50]

Lm-pace: Confidence estimation by large language models for effective root causing of cloud incidents,

D. Zhang, X. Zhang, C. Bansal, P. Las-Casas, R. Fonseca, and S. Raj- mohan, “Lm-pace: Confidence estimation by large language models for effective root causing of cloud incidents,” in Companion Proceedings of the 32nd ACM International Conference on the Foundations of Software Engineering, 2024, pp. 388–398

2024
[51]

Automatic root cause analysis via large language models for cloud incidents,

Y . Chen, H. Xie, M. Ma, Y . Kang, X. Gao, L. Shi, Y . Cao, X. Gao, H. Fan, M. Wen et al., “Automatic root cause analysis via large language models for cloud incidents,” in Proceedings of the Nineteenth European Conference on Computer Systems, 2024, pp. 674–688

2024
[52]

Scalalog: Scalable log-based failure diagnosis using llm,

L. Zhang, T. Jia, M. Jia, Y . Wu, H. Liu, and Y . Li, “Scalalog: Scalable log-based failure diagnosis using llm,” in ICASSP 2025-2025 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2025, pp. 1–5

2025
[53]

Agentfm: Role-aware failure management for distributed databases with llm-driven multi-agents,

L. Zhang, Y . Zhai, T. Jia, X. Huang, C. Duan, and Y . Li, “Agentfm: Role-aware failure management for distributed databases with llm-driven multi-agents,” arXiv preprint arXiv:2504.06614, 2025

work page arXiv 2025
[54]

Test: Text prototype aligned embedding to activate llm’s ability for time series,

C. Sun, H. Li, Y . Li, and S. Hong, “Test: Text prototype aligned embedding to activate llm’s ability for time series,” in The Twelfth International Conference on Learning Representations, 2024

2024
[55]

Learning representations on logs for aiops,

P. Gupta, H. Kumar, D. Kar, K. Bhukar, P. Aggarwal, and P. Mohap- atra, “Learning representations on logs for aiops,” in 2023 IEEE 16th International Conference on Cloud Computing (CLOUD). IEEE, 2023, pp. 155–166

2023
[56]

Art: A unified unsupervised framework for incident management in mi- croservice systems,

Y . Sun, B. Shi, M. Mao, M. Ma, S. Xia, S. Zhang, and D. Pei, “Art: A unified unsupervised framework for incident management in mi- croservice systems,” in Proceedings of the 39th IEEE/ACM International Conference on Automated Software Engineering, 2024, pp. 1183–1194

2024
[57]

Towards close-to-zero runtime collection overhead: Raft-based anomaly diagnosis on system faults for distributed storage system,

L. Zhang, T. Jia, M. Jia, H. Liu, Y . Yang, Z. Wu, and Y . Li, “Towards close-to-zero runtime collection overhead: Raft-based anomaly diagnosis on system faults for distributed storage system,” IEEE Transactions on Services Computing, vol. 18, no. 2, pp. 1054–1067, 2024

2024
[58]

Cslparser: A collaborative framework using small and large language models for log parsing,

W. Hong, Y . Wu, L. Zhang, C. Duan, P. Xiao, M. He, X. Yang, and Y . Li, “Cslparser: A collaborative framework using small and large language models for log parsing,” in 2025 IEEE 36th International Symposium on Software Reliability Engineering (ISSRE). IEEE, 2025, pp. 61–72

2025
[59]

United we stand: Towards end-to-end log-based fault diagnosis via interactive multi-task learning,

M. He, C. Duan, P. Xiao, T. Jia, S. Yu, L. Zhang, W. Hong, J. Han, Y . Wu, Y . Liet al., “United we stand: Towards end-to-end log-based fault diagnosis via interactive multi-task learning,” in 2025 40th IEEE/ACM International Conference on Automated Software Engineering (ASE). IEEE, 2025, pp. 661–673

2025
[60]

Efficient failure management for multi-agent systems with reasoning trace representation,

L. Zhang, T. Jia, M. Wang, W. Hong, C. Duan, M. He, R. Wang, X. Peng, M. Wang, G. Zhang et al., “Efficient failure management for multi-agent systems with reasoning trace representation,” arXiv preprint arXiv:2603.21522, 2026

work page arXiv 2026
[61]

Latent error prediction and fault localization for microservice applications by learning from system trace logs,

X. Zhou, X. Peng, T. Xie, J. Sun, C. Ji, D. Liu, Q. Xiang, and C. He, “Latent error prediction and fault localization for microservice applications by learning from system trace logs,” in Proceedings of the 2019 27th ACM joint meeting on European software engineering conference and symposium on the foundations of software engineering, 2019, pp. 683–694

2019
[62]

Seer: Leveraging big data to navigate the complexity of performance debugging in cloud microservices,

Y . Gan, Y . Zhang, K. Hu, D. Cheng, Y . He, M. Pancholi, and C. Delimitrou, “Seer: Leveraging big data to navigate the complexity of performance debugging in cloud microservices,” in Proceedings of the twenty-fourth international conference on architectural support for programming languages and operating systems, 2019, pp. 19–33

2019
[63]

Unsupervised detection of microservice trace anomalies through service-level deep bayesian networks,

P. Liu, H. Xu, Q. Ouyang, R. Jiao, Z. Chen, S. Zhang, J. Yang, L. Mo, J. Zeng, W. Xue et al., “Unsupervised detection of microservice trace anomalies through service-level deep bayesian networks,” in 2020 IEEE 31st International Symposium on Software Reliability Engineering (ISSRE). IEEE, 2020, pp. 48–58

2020
[64]

Sage: practical and scalable ml-driven performance debugging in microservices,

Y . Gan, M. Liang, S. Dev, D. Lo, and C. Delimitrou, “Sage: practical and scalable ml-driven performance debugging in microservices,” in Proceedings of the 26th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, 2021, pp. 135–151

2021
[65]

Microrank: End-to-end latency issue localization with extended spectrum analysis in microservice environments,

G. Yu, P. Chen, H. Chen, Z. Guan, Z. Huang, L. Jing, T. Weng, X. Sun, and X. Li, “Microrank: End-to-end latency issue localization with extended spectrum analysis in microservice environments,” in Proceedings of the Web Conference 2021, 2021, pp. 3087–3098

2021
[66]

Tracerank: Abnormal service localization with dis-aggregated end-to-end tracing data in cloud native systems,

G. Yu, Z. Huang, and P. Chen, “Tracerank: Abnormal service localization with dis-aggregated end-to-end tracing data in cloud native systems,” Journal of Software: Evolution and Process, vol. 35, no. 10, p. e2413, 2023. 16

2023
[67]

Trace-based multi-dimensional root cause localization of performance issues in mi- croservice systems,

C. Zhang, Z. Dong, X. Peng, B. Zhang, and M. Chen, “Trace-based multi-dimensional root cause localization of performance issues in mi- croservice systems,” in Proceedings of the IEEE/ACM 46th International Conference on Software Engineering, 2024, pp. 1–12

2024
[68]

Thinkfl: Self-refining failure localization for microservice systems via reinforcement fine-tuning,

L. Zhang, Y . Zhai, T. Jia, C. Duan, S. Yu, J. Gao, B. Ding, Z. Wu, and Y . Li, “Thinkfl: Self-refining failure localization for microservice systems via reinforcement fine-tuning,” ACM Transactions on Software Engineering and Methodology, 2025

2025
[69]

Zhang, T

L. Zhang, T. Jia, K. Wang, W. Hong, C. Duan, M. He, and Y . Li, “Adaptive root cause localization for microservice systems with multi- agent recursion-of-thought,” arXiv preprint arXiv:2508.20370, 2025

work page arXiv 2025
[70]

Hypothesize-then-verify: Speculative root cause analysis for microser- vices with pathwise parallelism,

L. Zhang, T. Jia, Y . Zhai, L. Pan, C. Duan, M. He, P. Xiao, and Y . Li, “Hypothesize-then-verify: Speculative root cause analysis for microser- vices with pathwise parallelism,” arXiv preprint arXiv:2601.02736, 2026

work page arXiv 2026
[71]

Agentic memory enhanced recursive reasoning for root cause localization in microservices,

L. Zhang, T. Jia, Y . Zhai, L. Pan, C. Duan, M. He, M. Jia, and Y . Li, “Agentic memory enhanced recursive reasoning for root cause localization in microservices,” arXiv preprint arXiv:2601.02732, 2026

work page arXiv 2026
[72]

Uda-rcl: Unsupervised domain adaptation for microservice root cause localization utilizing multimodal data,

X. Huang, H. Liu, Y . Wu, L. Zhang, T. Jia, Y . Li, and Z. Wu, “Uda-rcl: Unsupervised domain adaptation for microservice root cause localization utilizing multimodal data,” IEEE Transactions on Services Computing, 2025

2025
[73]

Recommending root-cause and mitigation steps for cloud incidents using large language models,

T. Ahmed, S. Ghosh, C. Bansal, T. Zimmermann, X. Zhang, and S. Rajmohan, “Recommending root-cause and mitigation steps for cloud incidents using large language models,” in 2023 IEEE/ACM 45th International Conference on Software Engineering (ICSE). IEEE, 2023, pp. 1737–1749

2023
[74]

X-lifecycle learning for cloud incident management using llms,

D. Goel, F. Husain, A. Singh, S. Ghosh, A. Parayil, C. Bansal, X. Zhang, and S. Rajmohan, “X-lifecycle learning for cloud incident management using llms,” in Companion Proceedings of the 32nd ACM International Conference on the Foundations of Software Engineering, 2024, pp. 417– 428

2024
[75]

A holistic view of ai-driven network incident management,

P. Hamadanian, B. Arzani, S. Fouladi, S. K. R. Kakarla, R. Fonseca, D. Billor, A. Cheema, E. Nkposong, and R. Chandra, “A holistic view of ai-driven network incident management,” in Proceedings of the 22nd ACM Workshop on Hot Topics in Networks, 2023, pp. 180–188

2023
[76]

Xpert: Empowering incident management with query recommendations via large language models,

Y . Jiang, C. Zhang, S. He, Z. Yang, M. Ma, S. Qin, Y . Kang, Y . Dang, S. Rajmohan, Q. Lin et al., “Xpert: Empowering incident management with query recommendations via large language models,” in Proceedings of the IEEE/ACM 46th International Conference on Software Engineering, 2024, pp. 1–13

2024
[77]

Shellgpt: Gener- ative pre-trained transformer model for shell language understanding,

J. Shi, S. Jiang, B. Xu, J. Liang, Y . Xiao, and W. Wang, “Shellgpt: Gener- ative pre-trained transformer model for shell language understanding,” in 2023 IEEE 34th International Symposium on Software Reliability Engineering (ISSRE). IEEE, 2023, pp. 671–682

2023
[78]

Automated code generation for information technology tasks in yaml through large language models,

S. Pujar, L. Buratti, X. Guo, N. Dupuis, B. Lewis, S. Suneja, A. Sood, G. Nalawade, M. Jones, A. Morari et al., “Automated code generation for information technology tasks in yaml through large language models,” in 2023 60th ACM/IEEE Design Automation Conference (DAC). IEEE, 2023, pp. 1–4

2023
[79]

Lever- aging large language models for the auto-remediation of microservice applications: An experimental study,

K. Sarda, Z. Namrud, M. Litoiu, L. Shwartz, and I. Watts, “Lever- aging large language models for the auto-remediation of microservice applications: An experimental study,” in Companion Proceedings of the 32nd ACM International Conference on the Foundations of Software Engineering, 2024, pp. 358–369

2024
[80]

Ansible lightspeed: A code generation service for it automation,

P. Sahoo, S. Pujar, G. Nalawade, R. Genhardt, L. Mandel, and L. Buratti, “Ansible lightspeed: A code generation service for it automation,” in Proceedings of the 39th IEEE/ACM International Conference on Automated Software Engineering, 2024, pp. 2148–2158

2024

Showing first 80 references.