pith. machine review for the scientific record. sign in

arxiv: 2604.16810 · v1 · submitted 2026-04-18 · 💻 cs.SE

Recognition: unknown

Gleaner: A Semantically-Rich and Efficient Online Sampler for Microservice Diagnostics

Authors on Pith no claims yet

Pith reviewed 2026-05-10 07:20 UTC · model grok-4.3

classification 💻 cs.SE
keywords microservicesdistributed tracingtail samplingroot cause analysisonline samplertrace groupingbag of edges
0
0 comments X

The pith

Gleaner shows that microservice traces can be sampled online using bag-of-edges with log semantics instead of graphs to improve root cause analysis accuracy even at low rates.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tries to establish that high-fidelity sampling of distributed traces does not require modeling them as explicit graphs. By representing each trace as a bag of edges augmented with log semantics and using fast set-based operations, Gleaner groups and prioritizes traces efficiently enough for online use. It adds an alarm-driven quota and diversity strategy to focus on anomalous traces for downstream diagnostics. If correct, this would let systems sample aggressively while actually raising the quality of automated root cause analysis above what full unsampled data provides. Readers would care because trace volumes overwhelm diagnostics today and slow graph methods cannot run in real time.

Core claim

Gleaner is an online tail-sampling framework founded on the insight that explicit graph structures are unnecessary for high-fidelity trace grouping. It represents each trace as a bag-of-edges augmented with log semantics and replaces slow graph algorithms with efficient set-based operations. An alarm-driven quota and diversity-preserving strategy prioritize anomalous and rare traces. The approach processes traces at 0.74 ms each, improves trace pattern coverage by up to 128.7 percent and Shannon entropy by up to 32.9 percent over baselines, and at a 1 percent sampling rate raises root cause analysis accuracy by 42 to 107 percent over the next-best sampler while outperforming the full unsam-

What carries the argument

The bag-of-edges representation of each trace augmented with log semantics, which enables set-based operations to replace graph algorithms for grouping and prioritization.

If this is right

  • Online tail sampling becomes feasible at high throughput without graph-analysis bottlenecks.
  • Trace pattern coverage rises by up to 128.7 percent and Shannon entropy by up to 32.9 percent versus prior samplers.
  • Root cause analysis accuracy at 1 percent sampling exceeds the next-best sampler by 42 to 107 percent.
  • Root cause analysis on the sampled subset outperforms analysis on the entire unsampled dataset.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same representation might apply to sampling in other large-scale event streams where graph construction is costly.
  • Log semantics can substitute for detailed structural modeling in trace-based diagnostics.
  • Systems could adopt always-on low-rate sampling as a default diagnostic enhancer rather than an optional reduction step.

Load-bearing premise

That representing each trace as a bag-of-edges augmented with log semantics combined with set-based operations is sufficient to achieve high-fidelity grouping and prioritization for root-cause analysis without needing explicit graph structures.

What would settle it

A controlled experiment on additional microservice applications where root cause analysis accuracy on Gleaner's 1 percent sampled data falls below accuracy on the full unsampled dataset or below graph-based samplers would falsify the performance claims.

Figures

Figures reproduced from arXiv: 2604.16810 by Aoyang FANG (1), Pinjia He (1) ((1) The Chinese University of Hong Kong, Shenzhen), Songhan Zhang (1), Yifan Yang (1).

Figure 1
Figure 1. Figure 1: Example trace illustrating the semantic blind spot of span-centric samplers. Despite normal span [PITH_FULL_IMAGE:figures/full_fig_p004_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: The three-stage pipeline architecture of Gleaner. First, the [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Root span grouping validation on Train-Ticket benchmark. (a) Sparsity analysis: ratio between request [PITH_FULL_IMAGE:figures/full_fig_p008_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Sampling quality evaluation on Dataset A. Left two columns (4.1a-d): Coverage and diversity metrics showing Gleaner’s consistent superiority across all dimensions. Rightmost column (4.2a-b): Anomaly and rarity capture demonstrating Gleaner’s dramatic advantage in prioritizing diagnostically relevant traces. of 0.992, 3.5× higher than TracePicker. For Proportion Anomaly, Gleaner maintains 2.1~8.3× improveme… view at source ↗
Figure 5
Figure 5. Figure 5: Cross-system evaluation on Dataset B (5 microservice benchmarks). Gleaner consistently outperforms baselines across most systems, with comparable performance to TracePicker on simpler architectures [PITH_FULL_IMAGE:figures/full_fig_p012_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Ablation study Group 1: Impact of semantic components (logs, alarms) and structural representation [PITH_FULL_IMAGE:figures/full_fig_p013_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Ablation study Group 2: Performance comparison of different sampling strategies, highlighting the [PITH_FULL_IMAGE:figures/full_fig_p013_7.png] view at source ↗
read the original abstract

Distributed tracing in microservices is critical for diagnostics but generates overwhelming data volumes, necessitating intelligent sampling. To maximize fidelity, state-of-the-art (SOTA) tail-based samplers analyze complete (or even log-enriched) traces by modeling them as graphs. However, this reliance on computationally expensive graph analysis creates a performance bottleneck that prohibits their use in online settings. To this end, we propose Gleaner, an online tail-sampling framework that breaks this trade-off. It is founded on the key insight that explicit graph structures are unnecessary for high-fidelity trace grouping. Instead, Gleaner represents each trace as a "bag-of-edges" augmented with log semantics, replacing slow graph algorithms with highly efficient set-based operations. It also employs an alarm-driven quota and a diversity-preserving strategy to prioritize anomalous and rare traces for downstream Root Cause Analysis (RCA). Experimentally, Gleaner processes traces at 0.74ms each, improving Trace Pattern Coverage by up to 128.7% and Shannon Entropy by up to 32.9% over baselines. At just a 1% sampling rate, Gleaner improves RCA accuracy by 42%-107% over the next-best sampler. Moreover, RCA on Gleaner's sampled data is more accurate than with the entire, unsampled dataset. This result reframes intelligent sampling from a data reduction technique to a powerful signal enhancement paradigm for automated operations.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 2 minor

Summary. The manuscript proposes Gleaner, an online tail-sampling framework for microservice diagnostics that represents each trace as a bag-of-edges augmented with log semantics and replaces graph algorithms with efficient set-based operations. It incorporates an alarm-driven quota and diversity-preserving strategy to prioritize anomalous and rare traces. The paper reports that Gleaner processes traces at 0.74 ms each, improves trace pattern coverage by up to 128.7% and Shannon entropy by up to 32.9% over baselines, and at a 1% sampling rate improves RCA accuracy by 42-107% over the next-best sampler while also outperforming RCA accuracy on the full unsampled dataset.

Significance. If the results hold, particularly the finding that a 1% intelligently sampled subset yields higher RCA accuracy than the entire trace collection, the work has substantial practical significance for automated operations in microservices. It reframes sampling as signal enhancement rather than data reduction and demonstrates that avoiding explicit graph structures can enable online deployment without sacrificing (and potentially improving) diagnostic fidelity. The concrete performance numbers and the counterintuitive RCA result are strengths that could influence both research and production tracing systems.

major comments (1)
  1. Abstract: The claim that RCA accuracy on Gleaner's 1% sampled data exceeds accuracy on the full unsampled dataset is load-bearing for the signal-enhancement reframing but is not accompanied by a description of the RCA algorithm, the exact accuracy metric (e.g., precision on injected faults or success rate on root-cause labels), or controls confirming that the full-dataset run used identical hyperparameters, feature extraction, and computational treatment as the sampled runs. Without these, the superiority could be an artifact of mismatched experimental conditions rather than evidence that the discarded traces systematically degrade RCA performance.
minor comments (2)
  1. The abstract would be strengthened by briefly naming the datasets, trace collection sizes, and specific baselines used for the coverage, entropy, and RCA experiments.
  2. Notation for 'bag-of-edges' and how log semantics are encoded into the set representation should be defined more explicitly in the methods section to aid reproducibility.

Simulated Author's Rebuttal

1 responses · 0 unresolved

Thank you for your review and for recognizing the potential impact of our work on microservice diagnostics. We respond to the major comment below.

read point-by-point responses
  1. Referee: Abstract: The claim that RCA accuracy on Gleaner's 1% sampled data exceeds accuracy on the full unsampled dataset is load-bearing for the signal-enhancement reframing but is not accompanied by a description of the RCA algorithm, the exact accuracy metric (e.g., precision on injected faults or success rate on root-cause labels), or controls confirming that the full-dataset run used identical hyperparameters, feature extraction, and computational treatment as the sampled runs. Without these, the superiority could be an artifact of mismatched experimental conditions rather than evidence that the discarded traces systematically degrade RCA performance.

    Authors: We appreciate the referee highlighting the need for clearer context around this key result. The details of the RCA algorithm, the exact accuracy metric (success rate on root-cause identification for injected faults), and the experimental controls are provided in the Evaluation section of the manuscript. All runs—full dataset and sampled—used identical hyperparameters, feature extraction, and computational treatment to ensure direct comparability. To address the concern that the abstract lacks sufficient accompanying description, we will revise the abstract to briefly note the RCA evaluation protocol and affirm the use of identical conditions across experiments. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical results independent of internal definitions

full rationale

The paper's claims rest on an empirical evaluation of an online sampler that represents traces as bag-of-edges plus log semantics and applies set operations plus alarm-driven quota/diversity rules. Processing latency (0.74 ms), coverage (+128.7 %), entropy (+32.9 %), and RCA accuracy gains (42-107 % at 1 % rate, plus superiority to the full trace set) are reported as measured outcomes against external baselines and the unsampled collection. No equations, fitted parameters, or self-citations are invoked to derive these quantities from the method itself; the central insight (graph structures unnecessary) is presented as a design choice justified by runtime measurements rather than by a self-referential theorem or renaming of prior results. The derivation chain is therefore self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the domain assumption that set operations on a bag-of-edges plus log semantics preserve enough structural information for accurate trace grouping and RCA; no free parameters or new invented entities are explicitly introduced in the abstract.

axioms (1)
  • domain assumption Set-based operations on bag-of-edges representations suffice for high-fidelity trace grouping without explicit graph structures.
    This premise is invoked to justify replacing graph algorithms with faster set operations.

pith-pipeline@v0.9.0 · 5580 in / 1456 out tokens · 55976 ms · 2026-05-10T07:20:15.486939+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

55 extracted references · 34 canonical work pages

  1. [1]

    Chetan Bansal, Sundararajan Renganathan, Ashima Asudani, Olivier Midy, and Mathru Janakiraman. 2020. DeCaf: diagnosing and triaging performance issues in large-scale cloud services. InProceedings of the ACM/IEEE 42nd International Conference on Software Engineering: Software Engineering in Practice(Seoul, South Korea)(ICSE-SEIP ’20). Association for Compu...

  2. [2]

    Ivan Beschastnikh, Perry Liu, Albert Xing, Patty Wang, Yuriy Brun, and Michael D. Ernst. 2020. Visualizing Distributed System Executions.ACM Trans. Softw. Eng. Methodol.29, 2, Article 9 (March 2020), 38 pages. doi:10.1145/3375633

  3. [3]

    Chaos Mesh Authors. 2025. Chaos Mesh: Chaos Engineering Platform for Kubernetes. https://chaos-mesh.org/. , Vol. 1, No. 1, Article . Publication date: April 2018. Gleaner : A Semantically-Rich and Efficient Online Sampler for Microservice Diagnostics 19

  4. [4]

    Laming Chen, Guoxin Zhang, and Hanning Zhou. 2018. Fast greedy MAP inference for determinantal point process to improve recommendation diversity. InProceedings of the 32nd International Conference on Neural Information Processing Systems (NIPS’18). Curran Associates Inc., Red Hook, NY, USA, 5627–5638

  5. [5]

    Yu Chen, Zhi-Ming Xiao, and Fei Teng. 2024. A Root Cause Localization Method Based on Event Call Chains for Microservices. In2024 16th International Conference on Communication Software and Networks (ICCSN). 43–48. doi:10.1109/ICCSN63464.2024.10793332

  6. [6]

    Zhuangbin Chen, Junsong Pu, and Zibin Zheng. 2025. Tracezip: Efficient Distributed Tracing via Trace Compression. Proc. ACM Softw. Eng.2, ISSTA, Article ISSTA019 (June 2025), 23 pages. doi:10.1145/3728888

  7. [7]

    Aoyang Fang, Songhan Zhang, Yifan Yang, Haotong Wu, Junjielong Xu, Xuyang Wang, Rui Wang, Manyi Wang, Qisheng Lu, and Pinjia He. 2025. Rethinking the Evaluation of Microservice RCA with a Fault Propagation-Aware Benchmark. https://arxiv.org/abs/2510.04711v2

  8. [8]

    Yu Gan, Mingyu Liang, Sundar Dev, David Lo, and Christina Delimitrou. 2021. Sage: practical and scalable ML-driven performance debugging in microservices. InProceedings of the 26th ACM International Conference on Architectural Support for Programming Languages and Operating Systems(Virtual, USA)(ASPLOS ’21). Association for Computing Machinery, New York, ...

  9. [9]

    Fei Gao, Ruyue Xin, Xiaocui Li, and Yaqiang Zhang. 2025. Are GNNs Actually Effective for Multimodal Fault Diagnosis in Microservice Systems? . In2025 IEEE International Conference on Web Services (ICWS). IEEE Computer Society, Los Alamitos, CA, USA, 127–129. doi:10.1109/ICWS67624.2025.00025

  10. [10]

    Shenghui Gu, Guoping Rong, Tian Ren, He Zhang, Haifeng Shen, Yongda Yu, Xian Li, Jian Ouyang, and Chunan Chen

  11. [11]

    TrinityRCL: Multi-Granular and Code-Level Root Cause Localization Using Multiple Types of Telemetry Data in Microservice Systems.IEEE Transactions on Software Engineering49, 5 (2023), 3071–3088

  12. [12]

    Xiaofeng Guo, Xin Peng, Hanzhang Wang, Wanxue Li, Huai Jiang, Dan Ding, Tao Xie, and Liangfei Su. 2020. Graph- based trace analysis for microservice architecture understanding and problem diagnosis. InProceedings of the 28th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering(Virtual Event...

  13. [13]

    Yongqi Han, Qingfeng Du, Ying Huang, Pengsheng Li, Xiaonan Shi, Jiaqi Wu, Pei Fang, Fulong Tian, and Cheng He. 2024. Holistic Root Cause Analysis for Failures in Cloud-Native Systems Through Observability Data.IEEE Transactions on Services Computing17, 6 (2024), 3789–3802. doi:10.1109/TSC.2024.3478759

  14. [14]

    Yongqi Han, Qingfeng Du, Ying Huang, Jiaqi Wu, Fulong Tian, and Cheng He. 2024. The Potential of One-Shot Failure Root Cause Analysis: Collaboration of the Large Language Model and Small Classifier. InProceedings of the 39th IEEE/ACM International Conference on Automated Software Engineering(Sacramento, CA, USA)(ASE ’24). Association for Computing Machine...

  15. [15]

    Pinjia He, Jieming Zhu, Zibin Zheng, and Michael R. Lyu. 2017. Drain: An Online Log Parsing Approach with Fixed Depth Tree. In2017 IEEE International Conference on Web Services (ICWS). 33–40. doi:10.1109/ICWS.2017.13

  16. [16]

    Shilin He, Botao Feng, Liqun Li, Xu Zhang, Yu Kang, Qingwei Lin, Saravan Rajmohan, and Dongmei Zhang. 2023. STEAM: observability-preserving trace sampling. InProceedings of the 31st ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering. 1750–1761

  17. [17]

    Haiyu Huang, Cheng Chen, Kunyi Chen, Pengfei Chen, Guangba Yu, Zilong He, Yilun Wang, Huxing Zhang, and Qi Zhou. 2025. Mint: Cost-Efficient Tracing with All Requests Collection via Commonality and Variability Analysis. InProceedings of the 30th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 1(...

  18. [18]

    Haiyu Huang, Xiaoyu Zhang, Pengfei Chen, Zilong He, Zhiming Chen, Guangba Yu, Hongyang Chen, and Chen Sun

  19. [19]

    TraStrainer: Adaptive Sampling for Distributed Traces with System Runtime State.Proceedings of the ACM on Software Engineering1, FSE (2024), 473–493

  20. [20]

    Zicheng Huang, Pengfei Chen, Guangba Yu, Hongyang Chen, and Zibin Zheng. 2021. Sieve: Attention-based Sampling of End-to-End Trace Data in Distributed Microservice Systems. In2021 IEEE International Conference on Web Services (ICWS). 436–446. doi:10.1109/ICWS53863.2021.00063

  21. [21]

    2025.Jaeger: open source, distributed tracing platform

    Jaeger. 2025.Jaeger: open source, distributed tracing platform. Retrieved 2025-09-06 from https://www.jaegertracing.io/

  22. [22]

    Xinrui Jiang, Yicheng Pan, Meng Ma, and Ping Wang. 2023. Look Deep into the Microservice System Anomaly through Very Sparse Logs. InProceedings of the ACM Web Conference 2023. 2970–2978

  23. [23]

    Pedro Las-Casas, Jonathan Mace, Dorgival Guedes, and Rodrigo Fonseca. 2018. Weighted Sampling of Execution Traces: Capturing More Needles and Less Hay. InProceedings of the ACM Symposium on Cloud Computing (SoCC ’18). Association for Computing Machinery, New York, NY, USA, 326–332. doi:10.1145/3267809.3267841

  24. [24]

    Pedro Las-Casas, View Profile, Giorgi Papakerashvili, View Profile, Vaastav Anand, View Profile, Jonathan Mace, and View Profile. 2019. Sifter. InProceedings of the ACM Symposium on Cloud Computing. 312–324. doi:10.1145/3357223. , Vol. 1, No. 1, Article . Publication date: April 2018. 20 Yang et al. 3362736

  25. [25]

    Cheryl Lee, Tianyi Yang, Zhuangbin Chen, Yuxin Su, and Michael R Lyu. 2023. Eadro: An end-to-end troubleshooting framework for microservices on multi-source data. In2023 IEEE/ACM 45th International Conference on Software Engineering (ICSE). IEEE, 1750–1762

  26. [26]

    Bowen Li, Xin Peng, Qilin Xiang, Hanzhang Wang, Tao Xie, Jun Sun, and Xuanzhe Liu. 2021. Enjoy your observability: an industrial survey of microservice tracing and analysis.Empir Software Eng27 (Nov. 2021), 25. doi:10.1007/s10664- 021-10063-9

  27. [27]

    Ye Li, Jian Tan, Bin Wu, Xiao He, and Feifei Li. 2024. ShapleyIQ: Influence Quantification by Shapley Values for Performance Debugging of Microservices. InProceedings of the 28th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 4(Vancouver, BC, Canada)(ASPLOS ’23). Association for Computing Mach...

  28. [28]

    Hongyi Liu, Xiaosong Huang, Mengxi Jia, Tong Jia, Jing Han, Zhonghai Wu, and Ying Li. 2024. UAC-AD: Unsupervised Adversarial Contrastive Learning for Anomaly Detection on Multi-Modal Data in Microservice Systems.IEEE Transactions on Services Computing17, 6 (2024), 3887–3900. doi:10.1109/TSC.2024.3411481

  29. [29]

    2025.OpenTelemetry Concepts - Traces

    OpenTelemetry Community. 2025.OpenTelemetry Concepts - Traces. Retrieved 2025-09-08 from https://opentelemetry. io/docs/concepts/signals/traces/

  30. [30]

    2025.OpenTelemetry Logging

    OpenTelemetry Community. 2025.OpenTelemetry Logging. Retrieved 2025-09-06 from https://opentelemetry.io/docs/ specs/otel/logs/

  31. [31]

    2025.opentelemetry-specification/oteps/0265-event-vision.md at v1.48.0·open- telemetry/opentelemetry-specification

    OpenTelemetry Community. 2025.opentelemetry-specification/oteps/0265-event-vision.md at v1.48.0·open- telemetry/opentelemetry-specification. Retrieved 2025-09-06 from https://github.com/open-telemetry/opentelemetry- specification/blob/v1.48.0/oteps/0265-event-vision.md

  32. [32]

    2025.OpenTracing Overview - Spans

    OpenTracing Community. 2025.OpenTracing Overview - Spans. Retrieved 2025-09-06 from https://opentracing.io/ docs/overview/spans/

  33. [33]

    Sigelman, Luiz André Barroso, Mike Burrows, Pat Stephenson, Manoj Plakal, Donald Beaver, Saul Jaspan, and Chandan Shanbhag

    Benjamin H. Sigelman, Luiz André Barroso, Mike Burrows, Pat Stephenson, Manoj Plakal, Donald Beaver, Saul Jaspan, and Chandan Shanbhag. 2010.Dapper, a Large-Scale Distributed Systems Tracing Infrastructure. Technical Report. Google, Inc. http://research.google.com/archive/papers/dapper-2010-1.pdf

  34. [34]

    Chang-Ai Sun, Tao Zeng, Wanqing Zuo, and Huai Liu. 2023. A Trace-Log-Clusterings-Based Fault Localization Approach to Microservice Systems. In2023 IEEE International Conference on Web Services (ICWS). 7–13. doi:10.1109/ ICWS60048.2023.00013 ISSN: 2836-3868

  35. [35]

    Yongqian Sun, Binpeng Shi, Mingyu Mao, Minghua Ma, Sibo Xia, Shenglin Zhang, and Dan Pei. 2024. ART: A Unified Unsupervised Framework for Incident Management in Microservice Systems. InProceedings of the 39th IEEE/ACM International Conference on Automated Software Engineering(Sacramento, CA, USA)(ASE ’24). Association for Computing Machinery, New York, NY...

  36. [36]

    Xiangbo Tian, Shi Ying, Tiangang Li, Mengting Yuan, Ruijin Wang, Yishi Zhao, and Jianga Shang. 2024. iTCRL: Causal-Intervention-Based Trace Contrastive Representation Learning for Microservice Systems.IEEE Transactions on Software Engineering50, 10 (2024), 2583–2601. doi:10.1109/TSE.2024.3446532

  37. [37]

    Hanzhang Wang, Zhengkai Wu, Huai Jiang, Yichao Huang, Jiamu Wang, Selcuk Kopru, and Tao Xie. 2021. Groot: An event-graph-based approach for root cause analysis in industrial settings. In2021 36th IEEE/ACM International Conference on Automated Software Engineering (ASE). IEEE, 419–429

  38. [38]

    Mäntylä, Serge Demeyer, Mutlu Beyazıt, Joanna Kisaakye, and Jesse Nyyssölä

    Yuqing Wang, Mika V. Mäntylä, Serge Demeyer, Mutlu Beyazıt, Joanna Kisaakye, and Jesse Nyyssölä. 2025. Cross- System Categorization of Abnormal Traces in Microservice-Based Systems via Meta-Learning.Proc. ACM Softw. Eng. 2, FSE (2025), FSE027:576–FSE027:598. doi:10.1145/3715742

  39. [39]

    Yidan Wang, Zhouruixing Zhu, Qiuai Fu, Yuchi Ma, and Pinjia He. 2024. MRCA: Metric-level Root Cause Analysis for Microservices via Multi-Modal Data. InProceedings of the 39th IEEE/ACM International Conference on Automated Software Engineering(Sacramento, CA, USA)(ASE ’24). Association for Computing Machinery, New York, NY, USA, 1057–1068. doi:10.1145/3691...

  40. [40]

    Li Wu, Johan Tordsson, Erik Elmroth, and Odej Kao. 2020. MicroRCA: Root Cause Localization of Performance Issues in Microservices. InNOMS 2020 - 2020 IEEE/IFIP Network Operations and Management Symposium. 1–9. doi:10.1109/ NOMS47738.2020.9110353

  41. [41]

    Shuaiyu Xie, Jian Wang, Maodong Li, Peiran Chen, Jifeng Xuan, and Bing Li. 2025. TracePicker: Optimization-Based Trace Sampling for Microservice-Based Systems.Proc. ACM Softw. Eng.2, FSE, Article FSE081 (June 2025), 22 pages. doi:10.1145/3729351

  42. [42]

    Junjielong Xu, Qinan Zhang, Zhiqing Zhong, Shilin He, Chaoyun Zhang, Qingwei Lin, Dan Pei, Pinjia He, Dongmei Zhang, and Qi Zhang. 2025. OpenRCA: Can Large Language Models Locate the Root Cause of Software Failures?. In The Thirteenth International Conference on Learning Representations. https://openreview.net/forum?id=M4qNIzQYpd

  43. [43]

    2025.Gleaner: A Semantically-Rich and Efficient Online Sampler for Microservice Diagnostics

    Yifan Yang. 2025.Gleaner: A Semantically-Rich and Efficient Online Sampler for Microservice Diagnostics. doi:10.5281/ zenodo.19637628 , Vol. 1, No. 1, Article . Publication date: April 2018. Gleaner : A Semantically-Rich and Efficient Online Sampler for Microservice Diagnostics 21

  44. [44]

    Yifan Yang. 2026. Gleaner: Implementation and Artifacts. https://github.com/OperationsPAI/Gleaner. Accessed: 2026-04-18

  45. [45]

    Zhenhe Yao, Changhua Pei, Wenxiao Chen, Hanzhang Wang, Liangfei Su, Huai Jiang, Zhe Xie, Xiaohui Nie, and Dan Pei. 2024. Chain-of-Event: Interpretable Root Cause Analysis for Microservices through Automatically Learning Weighted Event Causal Graph. InCompanion Proceedings of the 32nd ACM International Conference on the Foundations of Software Engineering(...

  46. [46]

    Guangba Yu, Pengfei Chen, Yufeng Li, Hongyang Chen, Xiaoyun Li, and Zibin Zheng. 2023. Nezha: Interpretable Fine-Grained Root Causes Analysis for Microservices on Multi-modal Observability Data. InProceedings of the 31st ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering (San Francisco, CA, USA)(ESE...

  47. [47]

    Chenxi Zhang, Xin Peng, Chaofeng Sha, Ke Zhang, Zhenqing Fu, Xiya Wu, Qingwei Lin, and Dongmei Zhang. 2022. DeepTraLog: Trace-Log Combined Microservice Anomaly Detection through Graph-based Deep Learning. In2022 IEEE/ACM 44th International Conference on Software Engineering (ICSE). 623–634. doi:10.1145/3510003.3510180

  48. [48]

    Chenxi Zhang, Xin Peng, Tong Zhou, Chaofeng Sha, Zhenghui Yan, Yiru Chen, and Hong Yang. 2022. TraceCRL: contrastive representation learning for microservice trace analysis. InProceedings of the 30th ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering(Singapore, Singapore) (ESEC/FSE 2022). Associatio...

  49. [49]

    Lei Zhang, Zhiqiang Xie, Vaastav Anand, Ymir Vigfusson, and Jonathan Mace. 2023. The Benefit of Hindsight: Tracing Edge-Cases in Distributed Systems. In20th USENIX Symposium on Networked Systems Design and Implementation (NSDI 23). USENIX Association, Boston, MA, 321–339. https://www.usenix.org/conference/nsdi23/presentation/zhang-lei

  50. [50]

    Shenglin Zhang, Pengxiang Jin, Zihan Lin, Yongqian Sun, Bicheng Zhang, Sibo Xia, Zhengdan Li, Zhenyu Zhong, Minghua Ma, Wa Jin, Dai Zhang, Zhenyu Zhu, and Dan Pei. 2023. Robust Failure Diagnosis of Microservice System Through Multimodal Data.IEEE Transactions on Services Computing16, 6 (2023), 3851–3864. doi:10.1109/TSC.2023. 3290018

  51. [51]

    Shenglin Zhang, Sibo Xia, Wenzhao Fan, Binpeng Shi, Xiao Xiong, Zhenyu Zhong, Minghua Ma, Yongqian Sun, and Dan Pei. 2025. Failure Diagnosis in Microservice Systems: A Comprehensive Survey and Analysis.ACM Trans. Softw. Eng. Methodol.(Jan. 2025). doi:10.1145/3715005 Just Accepted

  52. [52]

    Wei Zhang, Hongcheng Guo, Jian Yang, Zhoujin Tian, Yi Zhang, Yan Chaoran, Zhoujun Li, Tongliang Li, Xu Shi, Liangfan Zheng, and Bo Zhang. 2024. mABC: Multi-Agent Blockchain-inspired Collaboration for Root Cause Analysis in Micro-Services Architecture. InFindings of the Association for Computational Linguistics: EMNLP 2024, Yaser Al- Onaizan, Mohit Bansal,...

  53. [53]

    Xiang Zhou, Xin Peng, Tao Xie, Jun Sun, Chao Ji, Wenhai Li, and Dan Ding. 2018. Fault analysis and debugging of microservice systems: Industrial survey, benchmark system, and empirical study.IEEE Transactions on Software Engineering47, 2 (2018), 243–260

  54. [54]

    Xiang Zhou, Xin Peng, Tao Xie, Jun Sun, Chao Ji, Wenhai Li, and Dan Ding. 2021. Fault Analysis and Debugging of Microservice Systems: Industrial Survey, Benchmark System, and Empirical Study.IEEE Transactions on Software Engineering47 (Feb. 2021), 243–260. doi:10.1109/TSE.2018.2887384 Conference Name: IEEE Transactions on Software Engineering

  55. [55]

    Xiang Zhou, Xin Peng, Tao Xie, Jun Sun, Chenjie Xu, Chao Ji, and Wenyun Zhao. 2018. Benchmarking microservice systems for software engineering research. InProceedings of the 40th International Conference on Software Engineering: Companion Proceeedings, ICSE 2018, Gothenburg, Sweden, May 27 - June 03, 2018, Michel Chaudron, Ivica Crnkovic, Marsha Chechik, ...