REFLEX: Reference-Free Evaluation of Log Summarization via Large Language Model Judgment

Priyanka Mudgal

arxiv: 2511.07458 · v2 · submitted 2025-11-06 · 💻 cs.CL · cs.AI· cs.LG· cs.SE

REFLEX: Reference-Free Evaluation of Log Summarization via Large Language Model Judgment

Priyanka Mudgal This is my paper

Pith reviewed 2026-05-18 00:21 UTC · model grok-4.3

classification 💻 cs.CL cs.AIcs.LGcs.SE

keywords log summarizationreference-free evaluationLLM judgmentzero-shot evaluationROUGEBLEUsummary quality assessment

0 comments

The pith

Large language models serve as zero-shot judges to evaluate log summaries without any reference texts.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents REFLEX as a method that directs LLMs to score log summaries on relevance, informativeness, and coherence using only the input log and the candidate summary. Existing metrics like ROUGE and BLEU require gold references that are rarely available for system logs, so they often fail to separate good summaries from poor ones. If the LLM judgments hold up, evaluators can run assessments on any new model output in real deployments where no human-curated reference exists. The work reports that these scores remain stable across datasets and separate competing summarizers more clearly than lexical-overlap baselines. This opens evaluation for practical log summarization pipelines that previously lacked reliable automatic checks.

Core claim

REFLEX directs an LLM to act as a zero-shot evaluator that rates a log summary along relevance, informativeness, and coherence without ever seeing a reference summary or any human labels, and the resulting scores distinguish model outputs more effectively than ROUGE or BLEU across multiple log summarization datasets.

What carries the argument

LLM zero-shot judgment on explicit quality dimensions, which replaces the need for reference texts by directly comparing the summary to the original log.

If this is right

Evaluation becomes possible for any log summarizer even when no gold reference summaries have been created.
Fine-grained scores on separate dimensions let developers identify whether a model fails on relevance, informativeness, or coherence.
Repeated runs on the same outputs produce stable rankings, allowing reliable comparison of new summarization methods over time.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same LLM judgment pattern could be tested on other reference-scarce summarization domains such as code or medical notes.
If the dimension scores prove reliable, they could replace some human annotation steps in benchmarking suites.
Different base LLMs might yield different absolute scores, so cross-model calibration would be needed before comparing results from separate studies.

Load-bearing premise

Large language models can produce accurate and consistent ratings of summary quality dimensions without any reference summaries or task-specific training.

What would settle it

Human raters score the same set of log summaries and the correlation between those scores and REFLEX outputs falls below the correlation shown by ROUGE or BLEU.

Figures

Figures reproduced from arXiv: 2511.07458 by Priyanka Mudgal.

**Figure 1.** Figure 1: REFLEX uses LLM to generate summaries from logs and evaluates them automatically, without requiring human-written [PITH_FULL_IMAGE:figures/full_fig_p003_1.png] view at source ↗

**Figure 2.** Figure 2: HDFS block update log messages and provided sum [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗

**Figure 3.** Figure 3: Comparison of similarity and ROUGE scores across log types for three REFLEX variants. [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗

read the original abstract

Evaluating log summarization systems is challenging due to the lack of high-quality reference summaries and the limitations of existing metrics like ROUGE and BLEU, which depend on surface-level lexical overlap. We introduce REFLEX, a reference-free evaluation metric for log summarization based on large language model (LLM) judgment. REFLEX uses LLMs as zero-shot evaluators to assess summary quality along dimensions such as relevance, informativeness, and coherence, without requiring gold-standard references or human annotations. We show that REFLEX produces stable, interpretable, and fine-grained evaluations across multiple log summarization dataset, and more effectively distinguishes model outputs than traditional metrics. REFLEX provides a scalable alternative for evaluating log summaries in real-world settings where reference data is scarce or unavailable.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces REFLEX, a reference-free evaluation metric for log summarization that uses large language models as zero-shot judges to score summaries on dimensions including relevance, informativeness, and coherence. It claims that REFLEX yields stable, interpretable, and fine-grained evaluations across multiple log summarization datasets and distinguishes model outputs more effectively than surface-level metrics such as ROUGE and BLEU, without requiring reference summaries or human annotations.

Significance. If the central claim holds after proper validation, REFLEX would address a practical gap in log summarization evaluation where reference data is scarce. A scalable, reference-free metric grounded in LLM judgment could support real-world deployment. The work would benefit from explicit credit for any reproducible prompting protocols or multi-dataset experiments that demonstrate stability.

major comments (2)

[Abstract and §4] Abstract and §4 (Experiments): The claim of superior discrimination and stability versus ROUGE/BLEU is asserted without reported statistical tests, effect sizes, or controls for post-hoc dataset selection; this undermines the cross-dataset generalization argument.
[§3 and §5] §3 (Methodology) and §5 (Results): The load-bearing assumption that zero-shot LLM scores on log-specific dimensions (temporal ordering, error patterns, terminology) correlate with human expert judgment is not supported by any inter-rater agreement, Spearman, or Pearson coefficients on held-out log summaries; without this, the metric may reflect LLM stylistic preferences rather than quality.

minor comments (2)

[Abstract] Abstract: 'dataset' should be pluralized to 'datasets'.
[§3] The exact LLM prompt templates and temperature settings used for judgment are not provided, hindering reproducibility.

Simulated Author's Rebuttal

2 responses · 1 unresolved

We thank the referee for the detailed and constructive feedback. We address each major comment below and indicate where revisions will be made to strengthen the manuscript.

read point-by-point responses

Referee: [Abstract and §4] Abstract and §4 (Experiments): The claim of superior discrimination and stability versus ROUGE/BLEU is asserted without reported statistical tests, effect sizes, or controls for post-hoc dataset selection; this undermines the cross-dataset generalization argument.

Authors: We agree that formal statistical tests and effect sizes would strengthen the claims of superior discrimination and stability. In the revised manuscript we will add paired statistical comparisons (e.g., Wilcoxon signed-rank tests) between REFLEX and ROUGE/BLEU scores across all datasets, report effect sizes, and explicitly document the dataset selection criteria and inclusion rationale to support the generalization argument. revision: yes
Referee: [§3 and §5] §3 (Methodology) and §5 (Results): The load-bearing assumption that zero-shot LLM scores on log-specific dimensions (temporal ordering, error patterns, terminology) correlate with human expert judgment is not supported by any inter-rater agreement, Spearman, or Pearson coefficients on held-out log summaries; without this, the metric may reflect LLM stylistic preferences rather than quality.

Authors: We acknowledge that direct quantitative validation against human judgments is absent from the current experiments. The manuscript instead demonstrates REFLEX through cross-dataset stability and differentiation from surface metrics, supported by qualitative case studies in §5. We will revise §5 and the limitations section to explicitly note the lack of human correlation data as a limitation and outline plans for future human validation studies. revision: partial

standing simulated objections not resolved

Current experiments contain no human expert ratings, so inter-rater agreement or correlation coefficients with held-out log summaries cannot be computed or reported from existing data.

Circularity Check

0 steps flagged

No significant circularity in derivation chain

full rationale

The paper proposes REFLEX as a reference-free evaluation method that directly invokes external LLM zero-shot judgments on summary quality dimensions without any internal equations, fitted parameters, or self-referential definitions. No load-bearing steps reduce the claimed stability or discrimination power to quantities defined by the paper's own choices or prior self-citations; the central premise rests on the external capability of LLMs rather than any construction that equates outputs to inputs by design. This is the most common honest finding for a purely empirical proposal of this type.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The central claim depends on the unverified reliability of LLM zero-shot judgments for log-specific summaries; no free parameters are explicitly introduced, but the method implicitly assumes LLM consistency across datasets.

axioms (1)

domain assumption LLMs can provide stable and accurate zero-shot evaluations of summary quality dimensions without references or fine-tuning.
This premise is required for REFLEX to function as a valid metric and is invoked in the description of the evaluation process.

invented entities (1)

REFLEX metric no independent evidence
purpose: Reference-free evaluation of log summaries using LLM judgment
Newly defined evaluation procedure introduced in the paper.

pith-pipeline@v0.9.0 · 5657 in / 1260 out tokens · 36259 ms · 2026-05-18T00:21:11.049781+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

REFLEX uses LLMs as zero-shot evaluators to assess summary quality along dimensions such as relevance, informativeness, and coherence, without requiring gold-standard references
IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We show that REFLEX produces stable, interpretable, and fine-grained evaluations across multiple log summarization datasets

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

29 extracted references · 29 canonical work pages · 2 internal anchors

[1]

Brown, T., Mann, B., Ryder, N., et al. (2020). Language models are few- shot learners. InAdvances in Neural Information Processing Systems

work page 2020
[2]

Lewis, M., Liu, Y ., Goyal, N., et al. (2019). BART: Denoising sequence- to-sequence pre-training for natural language generation. arXiv preprint arXiv:1910.13461

work page internal anchor Pith review Pith/arXiv arXiv 2019
[3]

Raffel, C., Shazeer, N., Roberts, A., et al. (2020). Exploring the limits of transfer learning with a unified text-to-text transformer. JMLR, 21(140):1–67

work page 2020
[4]

Reimers, N., Gurevych, I. (2019). Sentence-BERT: Sentence embed- dings using Siamese BERT-networks. InEMNLP

work page 2019
[5]

Scaling Instruction-Finetuned Language Models

Chung, H. W., Hou, L., Longpre, S., et al. (2022). Scaling instruction- finetuned language models. arXiv preprint arXiv:2210.11416

work page internal anchor Pith review Pith/arXiv arXiv 2022
[6]

Zhang, X., et al. (2023). LogGPT: Interpretable log representation learning with pre-trained transformers. arXiv preprint arXiv:2302.09898

work page arXiv 2023
[7]

Meng et al., ”LogSummary: Unstructured Log Summarization for Software Systems,” in IEEE Transactions on Network and Service Management, vol

W. Meng et al., ”LogSummary: Unstructured Log Summarization for Software Systems,” in IEEE Transactions on Network and Service Management, vol. 20, no. 3, pp. 3803-3815, Sept. 2023, doi: 10.1109/TNSM.2023.3236994. keywords: Semantics;Software systems;Data mining;Kernel;Electronic mail;Protocols;Syntactics;AIOps;log analysis;log summarization,

work page doi:10.1109/tnsm.2023.3236994 2023
[8]

J. Zhu, S. He, P. He, J. Liu and M. R. Lyu, ”Loghub: A Large Collection of System Log Datasets for AI-driven Log Analytics,” 2023 IEEE 34th International Symposium on Software Reliability Engi- neering (ISSRE), Florence, Italy, 2023, pp. 355-366, doi: 10.1109/IS- SRE59848.2023.00071. keywords: Industries;Runtime;Operating sys- tems;Organizations;Benchmark...

work page doi:10.1109/is- 2023
[9]

Zhihan Jiang, Jinyang Liu, Junjie Huang, Yichen Li, Yintong Huo, Jiazhen Gu, Zhuangbin Chen, Jieming Zhu, and Michael R. Lyu

work page
[10]

doi: 10.1145/3650212.3652129

A Large-Scale Evaluation for Log Parsing Techniques: How Far Are We? In Proceedings of the 33rd ACM SIGSOFT Interna- tional Symposium on Software Testing and Analysis (ISSTA 2024). Association for Computing Machinery, New York, NY , USA, 223–234. https://doi.org/10.1145/3650212.3652123

work page doi:10.1145/3650212.3652123 2024
[11]

Mudgal and R

P. Mudgal and R. Wouhaybi, ‘An Assessment of ChatGPT on Log Data’, in AI-generated Content, 2024, pp. 148–169

work page 2024
[12]

Ramachandran, R

S. Ramachandran, R. Agrahari, P. Mudgal, H. Bhilwaria, G. Long, and A. Kumar, ‘Automated Log Classification Using Deep Learning’, Procedia Computer Science, vol. 218, pp. 1722–1732, 2023

work page 2023
[13]

Mudgal, B

P. Mudgal, B. Arbab and S. Sampath Kumar, ”CrashEventLLM: Pre- dicting System Crashes with Large Language Models,” 2024 Inter- national Conference on Information Technology and Computing (IC- ITCOM), Yogyakarta, Indonesia, 2024, pp. 72-76, doi: 10.1109/ICIT- COM62788.2024.10762255

work page doi:10.1109/icit- 2024
[14]

”Rouge: A package for automatic evaluation of sum- maries.” Text summarization branches out

Lin, Chin-Yew. ”Rouge: A package for automatic evaluation of sum- maries.” Text summarization branches out. 2004

work page 2004
[15]

Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. 2002. BLEU: a method for automatic evaluation of machine translation. In Proceedings of the 40th Annual Meeting on Association for Computa- tional Linguistics (ACL ’02). Association for Computational Linguistics, USA, 311–318. https://doi.org/10.3115/1073083.1073135

work page doi:10.3115/1073083.1073135 2002
[16]

Alon Lavie and Abhaya Agarwal. 2007. Meteor: an automatic metric for MT evaluation with high levels of correlation with human judgments. In Proceedings of the Second Workshop on Statistical Machine Trans- lation (StatMT ’07). Association for Computational Linguistics, USA, 228–231

work page 2007
[17]

”Lilac: Log parsing using llms with adaptive parsing cache.” Proceedings of the ACM on Software Engineering 1.FSE (2024): 137-160

Jiang, Zhihan, et al. ”Lilac: Log parsing using llms with adaptive parsing cache.” Proceedings of the ACM on Software Engineering 1.FSE (2024): 137-160

work page 2024
[18]

Yu, and Jiawei Zhang

Haopeng Zhang, Philip S. Yu, and Jiawei Zhang. 2025. A Systematic Survey of Text Summarization: From Statistical Methods to Large Language Models. ACM Comput. Surv. 57, 11, Article 277 (November 2025), 41 pages. https://doi.org/10.1145/3731445

work page doi:10.1145/3731445 2025
[19]

”Logparser-llm: Advancing efficient log parsing with large language models.” Proceedings of the 30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining

Zhong, Aoxiao, et al. ”Logparser-llm: Advancing efficient log parsing with large language models.” Proceedings of the 30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining. 2024

work page 2024
[20]

”A Comparative Study on Large Language Models for Log Parsing.” Proceedings of the 18th ACM/IEEE International Symposium on Empirical Software Engineer- ing and Measurement

Astekin, Merve, Max Hort, and Leon Moonen. ”A Comparative Study on Large Language Models for Log Parsing.” Proceedings of the 18th ACM/IEEE International Symposium on Empirical Software Engineer- ing and Measurement. 2024

work page 2024
[21]

Help: Hierarchical embeddings-based log parsing

Xu, Andy, and Arno Gau. ”HELP: Hierarchical Embeddings-based Log Parsing.” arXiv preprint arXiv:2408.08300(2024)

work page arXiv 2024
[22]

Available: https://arxiv.org/abs/2312.15223

Zhang, Quanjun, et al. ”A survey on large language models for software engineering.” arXiv preprint arXiv:2312.15223 (2023)

work page arXiv 2023
[23]

”A survey of aiops for failure management in the era of large language models.” arXiv preprint arXiv:2406.11213 (2024)

Zhang, Lingzhe, et al. ”A survey of aiops for failure management in the era of large language models.” arXiv preprint arXiv:2406.11213 (2024)

work page arXiv 2024
[24]

”Raglog: Log anomaly detection using retrieval augmented generation.” 2024 IEEE World Forum on Public Safety Technology (WFPST)

Pan, Jonathan, Wong Swee Liang, and Yuan Yidi. ”Raglog: Log anomaly detection using retrieval augmented generation.” 2024 IEEE World Forum on Public Safety Technology (WFPST). IEEE, 2024

work page 2024
[25]

”LogTransformer: Transforming IT System Logs Into Events Using Tree-Based Approach.” IEEE Transactions on Network and Service Management 21.4 (2024): 3904-3918

Fu, Yuanyuan, and Jian Xu. ”LogTransformer: Transforming IT System Logs Into Events Using Tree-Based Approach.” IEEE Transactions on Network and Service Management 21.4 (2024): 3904-3918

work page 2024
[26]

”Enhancing Reasoning Capacity of SLM using Cognitive Enhancement.” 2025 International Conference on Artificial Intelligence in Information and Communication (ICAIIC)

Pan, Jonathan, et al. ”Enhancing Reasoning Capacity of SLM using Cognitive Enhancement.” 2025 International Conference on Artificial Intelligence in Information and Communication (ICAIIC). IEEE, 2025

work page 2025
[27]

”Clustering Textual Features for Log Summarization in Large Software Systems.” (2025)

Bertalan, Vithor, and Daniel Aloise. ”Clustering Textual Features for Log Summarization in Large Software Systems.” (2025)

work page 2025
[28]

Katukam, Raju. ”AI-Driven Log Summarization for Security Operations Centers: A Web-Based Approach Using Gemini API.” International Journal of Emerging Research in Engineering and Technology 6.3 (2025): 136-145

work page 2025
[29]

Xu, Yifei, and Huan Fang. ”Next timestamp prediction in business process monitoring using large language models.” Second International Conference on Big Data, Computational Intelligence, and Applications (BDCIA 2024). V ol. 13550. SPIE, 2025

work page 2024

[1] [1]

Brown, T., Mann, B., Ryder, N., et al. (2020). Language models are few- shot learners. InAdvances in Neural Information Processing Systems

work page 2020

[2] [2]

Lewis, M., Liu, Y ., Goyal, N., et al. (2019). BART: Denoising sequence- to-sequence pre-training for natural language generation. arXiv preprint arXiv:1910.13461

work page internal anchor Pith review Pith/arXiv arXiv 2019

[3] [3]

Raffel, C., Shazeer, N., Roberts, A., et al. (2020). Exploring the limits of transfer learning with a unified text-to-text transformer. JMLR, 21(140):1–67

work page 2020

[4] [4]

Reimers, N., Gurevych, I. (2019). Sentence-BERT: Sentence embed- dings using Siamese BERT-networks. InEMNLP

work page 2019

[5] [5]

Scaling Instruction-Finetuned Language Models

Chung, H. W., Hou, L., Longpre, S., et al. (2022). Scaling instruction- finetuned language models. arXiv preprint arXiv:2210.11416

work page internal anchor Pith review Pith/arXiv arXiv 2022

[6] [6]

Zhang, X., et al. (2023). LogGPT: Interpretable log representation learning with pre-trained transformers. arXiv preprint arXiv:2302.09898

work page arXiv 2023

[7] [7]

Meng et al., ”LogSummary: Unstructured Log Summarization for Software Systems,” in IEEE Transactions on Network and Service Management, vol

W. Meng et al., ”LogSummary: Unstructured Log Summarization for Software Systems,” in IEEE Transactions on Network and Service Management, vol. 20, no. 3, pp. 3803-3815, Sept. 2023, doi: 10.1109/TNSM.2023.3236994. keywords: Semantics;Software systems;Data mining;Kernel;Electronic mail;Protocols;Syntactics;AIOps;log analysis;log summarization,

work page doi:10.1109/tnsm.2023.3236994 2023

[8] [8]

J. Zhu, S. He, P. He, J. Liu and M. R. Lyu, ”Loghub: A Large Collection of System Log Datasets for AI-driven Log Analytics,” 2023 IEEE 34th International Symposium on Software Reliability Engi- neering (ISSRE), Florence, Italy, 2023, pp. 355-366, doi: 10.1109/IS- SRE59848.2023.00071. keywords: Industries;Runtime;Operating sys- tems;Organizations;Benchmark...

work page doi:10.1109/is- 2023

[9] [9]

Zhihan Jiang, Jinyang Liu, Junjie Huang, Yichen Li, Yintong Huo, Jiazhen Gu, Zhuangbin Chen, Jieming Zhu, and Michael R. Lyu

work page

[10] [10]

doi: 10.1145/3650212.3652129

A Large-Scale Evaluation for Log Parsing Techniques: How Far Are We? In Proceedings of the 33rd ACM SIGSOFT Interna- tional Symposium on Software Testing and Analysis (ISSTA 2024). Association for Computing Machinery, New York, NY , USA, 223–234. https://doi.org/10.1145/3650212.3652123

work page doi:10.1145/3650212.3652123 2024

[11] [11]

Mudgal and R

P. Mudgal and R. Wouhaybi, ‘An Assessment of ChatGPT on Log Data’, in AI-generated Content, 2024, pp. 148–169

work page 2024

[12] [12]

Ramachandran, R

S. Ramachandran, R. Agrahari, P. Mudgal, H. Bhilwaria, G. Long, and A. Kumar, ‘Automated Log Classification Using Deep Learning’, Procedia Computer Science, vol. 218, pp. 1722–1732, 2023

work page 2023

[13] [13]

Mudgal, B

P. Mudgal, B. Arbab and S. Sampath Kumar, ”CrashEventLLM: Pre- dicting System Crashes with Large Language Models,” 2024 Inter- national Conference on Information Technology and Computing (IC- ITCOM), Yogyakarta, Indonesia, 2024, pp. 72-76, doi: 10.1109/ICIT- COM62788.2024.10762255

work page doi:10.1109/icit- 2024

[14] [14]

”Rouge: A package for automatic evaluation of sum- maries.” Text summarization branches out

Lin, Chin-Yew. ”Rouge: A package for automatic evaluation of sum- maries.” Text summarization branches out. 2004

work page 2004

[15] [15]

Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. 2002. BLEU: a method for automatic evaluation of machine translation. In Proceedings of the 40th Annual Meeting on Association for Computa- tional Linguistics (ACL ’02). Association for Computational Linguistics, USA, 311–318. https://doi.org/10.3115/1073083.1073135

work page doi:10.3115/1073083.1073135 2002

[16] [16]

Alon Lavie and Abhaya Agarwal. 2007. Meteor: an automatic metric for MT evaluation with high levels of correlation with human judgments. In Proceedings of the Second Workshop on Statistical Machine Trans- lation (StatMT ’07). Association for Computational Linguistics, USA, 228–231

work page 2007

[17] [17]

”Lilac: Log parsing using llms with adaptive parsing cache.” Proceedings of the ACM on Software Engineering 1.FSE (2024): 137-160

Jiang, Zhihan, et al. ”Lilac: Log parsing using llms with adaptive parsing cache.” Proceedings of the ACM on Software Engineering 1.FSE (2024): 137-160

work page 2024

[18] [18]

Yu, and Jiawei Zhang

Haopeng Zhang, Philip S. Yu, and Jiawei Zhang. 2025. A Systematic Survey of Text Summarization: From Statistical Methods to Large Language Models. ACM Comput. Surv. 57, 11, Article 277 (November 2025), 41 pages. https://doi.org/10.1145/3731445

work page doi:10.1145/3731445 2025

[19] [19]

”Logparser-llm: Advancing efficient log parsing with large language models.” Proceedings of the 30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining

Zhong, Aoxiao, et al. ”Logparser-llm: Advancing efficient log parsing with large language models.” Proceedings of the 30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining. 2024

work page 2024

[20] [20]

”A Comparative Study on Large Language Models for Log Parsing.” Proceedings of the 18th ACM/IEEE International Symposium on Empirical Software Engineer- ing and Measurement

Astekin, Merve, Max Hort, and Leon Moonen. ”A Comparative Study on Large Language Models for Log Parsing.” Proceedings of the 18th ACM/IEEE International Symposium on Empirical Software Engineer- ing and Measurement. 2024

work page 2024

[21] [21]

Help: Hierarchical embeddings-based log parsing

Xu, Andy, and Arno Gau. ”HELP: Hierarchical Embeddings-based Log Parsing.” arXiv preprint arXiv:2408.08300(2024)

work page arXiv 2024

[22] [22]

Available: https://arxiv.org/abs/2312.15223

Zhang, Quanjun, et al. ”A survey on large language models for software engineering.” arXiv preprint arXiv:2312.15223 (2023)

work page arXiv 2023

[23] [23]

”A survey of aiops for failure management in the era of large language models.” arXiv preprint arXiv:2406.11213 (2024)

Zhang, Lingzhe, et al. ”A survey of aiops for failure management in the era of large language models.” arXiv preprint arXiv:2406.11213 (2024)

work page arXiv 2024

[24] [24]

”Raglog: Log anomaly detection using retrieval augmented generation.” 2024 IEEE World Forum on Public Safety Technology (WFPST)

Pan, Jonathan, Wong Swee Liang, and Yuan Yidi. ”Raglog: Log anomaly detection using retrieval augmented generation.” 2024 IEEE World Forum on Public Safety Technology (WFPST). IEEE, 2024

work page 2024

[25] [25]

”LogTransformer: Transforming IT System Logs Into Events Using Tree-Based Approach.” IEEE Transactions on Network and Service Management 21.4 (2024): 3904-3918

Fu, Yuanyuan, and Jian Xu. ”LogTransformer: Transforming IT System Logs Into Events Using Tree-Based Approach.” IEEE Transactions on Network and Service Management 21.4 (2024): 3904-3918

work page 2024

[26] [26]

”Enhancing Reasoning Capacity of SLM using Cognitive Enhancement.” 2025 International Conference on Artificial Intelligence in Information and Communication (ICAIIC)

Pan, Jonathan, et al. ”Enhancing Reasoning Capacity of SLM using Cognitive Enhancement.” 2025 International Conference on Artificial Intelligence in Information and Communication (ICAIIC). IEEE, 2025

work page 2025

[27] [27]

”Clustering Textual Features for Log Summarization in Large Software Systems.” (2025)

Bertalan, Vithor, and Daniel Aloise. ”Clustering Textual Features for Log Summarization in Large Software Systems.” (2025)

work page 2025

[28] [28]

Katukam, Raju. ”AI-Driven Log Summarization for Security Operations Centers: A Web-Based Approach Using Gemini API.” International Journal of Emerging Research in Engineering and Technology 6.3 (2025): 136-145

work page 2025

[29] [29]

Xu, Yifei, and Huan Fang. ”Next timestamp prediction in business process monitoring using large language models.” Second International Conference on Big Data, Computational Intelligence, and Applications (BDCIA 2024). V ol. 13550. SPIE, 2025

work page 2024