Recognition: no theorem link
A Comparative Study of Semantic Log Representations for Software Log-based Anomaly Detection
Pith reviewed 2026-05-10 18:04 UTC · model grok-4.3
The pith
QTyBERT produces log embeddings that match or beat BERT-based anomaly detection while generating them far faster on CPUs.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
QTyBERT uses SysBE, a system-specific quantized lightweight BERT, to encode log events into embeddings on CPUs and trains CroSysEh unsupervisedly on unlabeled logs from multiple systems to capture semantic structure in the embedding space. When these embeddings feed the same deep learning models, anomaly detection effectiveness on BGL, Thunderbird, and Spirit datasets reaches or exceeds that of full BERT embeddings while log embedding generation time drops close to the speed of static word embedding methods like Word2Vec or FastText.
What carries the argument
QTyBERT, which combines system-specific quantization in SysBE with cross-system unsupervised enhancement via CroSysEh to produce usable log embeddings.
If this is right
- DL models can achieve BERT-level log anomaly detection on BGL, Thunderbird, and Spirit without the long embedding generation times that limit BERT in CPU settings.
- Static word embeddings remain an option for maximum speed but QTyBERT closes most of the effectiveness gap without their typical performance shortfalls.
- The method supports practical deployment of semantic log analysis in environments where GPU access is limited or latency matters.
- Quantization and multi-system pretraining can be reused to adapt other contextual embedding approaches for log data.
Where Pith is reading between the lines
- Teams could first adopt QTyBERT for broad CPU-based monitoring and switch to full BERT only on subsets where extra accuracy justifies the cost.
- The cross-system training idea might extend to other embedding tasks that need both efficiency and semantic depth, such as code snippet classification.
- If the multi-system training generalizes well, similar lightweight variants could reduce reliance on large pretrained models across software engineering tasks.
Load-bearing premise
Training CroSysEh on unlabeled logs from multiple systems will improve the semantic quality of SysBE embeddings for the target datasets without adding biases or dropping important system-specific details.
What would settle it
On a held-out log dataset, DL models using QTyBERT embeddings produce substantially lower anomaly detection F1 scores than the same models using full BERT embeddings while still showing the claimed speed advantage.
Figures
read the original abstract
Recent deep learning (DL) methods for log anomaly detection increasingly rely on semantic log representation methods that convert the textual content of log events into vector embeddings as input to DL models. However, these DL methods are typically evaluated as end-to-end pipelines, while the impact of different semantic representation methods is not well understood. In this paper, we benchmark widely used semantic log representation methods, including static word embedding methods (Word2Vec, GloVe, and FastText) and the BERT-based contextual embedding method, across diverse DL models for log-event level anomaly detection on three publicly available log datasets: BGL, Thunderbird, and Spirit. We identify an effectiveness--efficiency trade off under CPU deployment settings: the BERT-based method is more effective, but incurs substantially longer log embedding generation time, limiting its practicality; static word embedding methods are efficient but are generally less effective and may yield insufficient detection performance. Motivated by this finding, we propose QTyBERT, a novel semantic log representation method that better balances this trade-off. QTyBERT uses SysBE, a lightweight BERT variant with system-specific quantization, to efficiently encode log events into vector embeddings on CPUs, and leverages CroSysEh to enhance the semantic expressiveness of these log embeddings. CroSysEh is trained unsupervisedly using unlabeled logs from multiple systems to capture the underlying semantic structure of the BERT model's embedding space. We evaluate QTyBERT against existing semantic log representation methods. Our results show that, for the DL models, using QTyBERT-generated log embeddings achieves detection effectiveness comparable to or better than BERT-generated log embeddings, while bringing log embedding generation time closer to that of static word embedding methods.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper benchmarks static word embeddings (Word2Vec, GloVe, FastText) and BERT-based contextual embeddings for deep learning models in log-event anomaly detection on the public BGL, Thunderbird, and Spirit datasets. It identifies an effectiveness-efficiency trade-off under CPU settings and proposes QTyBERT, a method using SysBE (a lightweight BERT variant with system-specific quantization) combined with CroSysEh (an unsupervised cross-system enhancer trained on pooled multi-system logs) to achieve detection performance comparable or superior to BERT while approaching the speed of static embeddings.
Significance. If the central claims hold after addressing experimental gaps, the work would be significant for software engineering and systems reliability by clarifying trade-offs in semantic log representations and offering a practical CPU-friendly alternative. Strengths include the systematic comparison across multiple public datasets and DL models, plus the explicit focus on deployment constraints, which could guide more efficient anomaly detection pipelines.
major comments (2)
- [Proposed Method] The central effectiveness claim for QTyBERT (comparable or better than BERT) depends on CroSysEh's ability to enhance semantics via unsupervised multi-system training without dilution or negative transfer of target-system information; however, the manuscript provides no ablation studies, analysis of system-specific vocabulary retention, or tests for bias introduction on BGL/Thunderbird/Spirit, leaving the robustness of the trade-off unverified.
- [Experiments] The experimental evaluation lacks critical details on the precise anomaly detection metrics (e.g., F1, precision-recall), statistical significance tests, hyperparameter selection and tuning procedures, and any safeguards against post-hoc model or threshold selections; these omissions directly affect the reliability of the reported benchmark results and the claimed balance between effectiveness and efficiency.
minor comments (2)
- [Abstract] The abstract and method descriptions introduce SysBE and CroSysEh without sufficiently clear initial definitions or distinctions from standard BERT components, which could improve readability for readers unfamiliar with the variants.
- The paper would benefit from explicit discussion of potential limitations of pooling logs across systems in CroSysEh, even if preliminary results appear positive.
Simulated Author's Rebuttal
We thank the referee for the constructive comments, which highlight important areas for strengthening the manuscript. We address each major comment below and will incorporate revisions to improve the analysis and experimental details.
read point-by-point responses
-
Referee: [Proposed Method] The central effectiveness claim for QTyBERT (comparable or better than BERT) depends on CroSysEh's ability to enhance semantics via unsupervised multi-system training without dilution or negative transfer of target-system information; however, the manuscript provides no ablation studies, analysis of system-specific vocabulary retention, or tests for bias introduction on BGL/Thunderbird/Spirit, leaving the robustness of the trade-off unverified.
Authors: We acknowledge that dedicated ablation studies would provide stronger evidence for CroSysEh's role and confirm the lack of negative transfer or bias. In the revised manuscript, we will add ablations comparing SysBE embeddings alone versus full QTyBERT (with CroSysEh) across all three datasets. We will also include analysis of system-specific vocabulary retention (e.g., token overlap and embedding similarity metrics between original and enhanced representations) and bias checks via performance on system-specific anomaly subsets. These additions will directly verify the robustness of the reported effectiveness-efficiency trade-off. revision: yes
-
Referee: [Experiments] The experimental evaluation lacks critical details on the precise anomaly detection metrics (e.g., F1, precision-recall), statistical significance tests, hyperparameter selection and tuning procedures, and any safeguards against post-hoc model or threshold selections; these omissions directly affect the reliability of the reported benchmark results and the claimed balance between effectiveness and efficiency.
Authors: We agree that additional experimental details are necessary for full reproducibility and to substantiate the benchmark claims. The revised Experiments section will explicitly report all metrics (F1, precision, recall, and AUC where applicable), include statistical significance tests (e.g., paired t-tests or McNemar's test with p-values for model comparisons), detail the hyperparameter tuning process (including search ranges, validation strategy, and selection criteria), and describe threshold selection safeguards (e.g., fixed use of validation sets only, with the exact procedure documented to avoid post-hoc bias). These changes will be made without altering the core results. revision: yes
Circularity Check
No circularity: empirical benchmarking on external public datasets
full rationale
The paper performs a comparative empirical study by evaluating static word embeddings and BERT-based methods across DL models on the public BGL, Thunderbird, and Spirit datasets. It identifies a trade-off from these experiments, proposes QTyBERT (SysBE + CroSysEh) motivated by the observed results, and validates the new method via direct performance and timing comparisons against baselines. No equations, fitted parameters, or self-referential definitions are present in the provided text; claims do not reduce to inputs by construction. No load-bearing self-citations, uniqueness theorems, or ansatzes imported from prior author work are invoked. The evaluation chain is self-contained against external benchmarks.
Axiom & Free-Parameter Ledger
free parameters (2)
- Quantization parameters for SysBE
- Training hyperparameters for CroSysEh
axioms (1)
- domain assumption Log events possess semantic structures that embedding models can capture to improve anomaly detection over raw or static representations.
invented entities (2)
-
SysBE
no independent evidence
-
CroSysEh
no independent evidence
Reference graph
Works this paper leans on
-
[1]
Vitor Cerqueira, Luís Torgo, and Igor Mozetič. 2020. Evaluating time series forecasting models: an empirical study on performance estimation methods. Machine Learning109, 11 (Nov. 2020), 1997–2028. doi:10.1007/s10994-020-05910-7
-
[2]
Jining Chen, Weitu Chong, Siyu Yu, Zhun Xu, Chaohong Tan, and Ningjiang Chen. 2022. TCN-based Lightweight Log Anomaly Detection in Cloud-edge Collaborative Environment. In2022 Tenth International Conference on Advanced Cloud and Big Data (CBD). 13–18. doi:10.1109/CBD58033.2022.00012
-
[3]
Rui Chen, Shenglin Zhang, Dongwen Li, Yuzhe Zhang, Fangrui Guo, Weibin Meng, Dan Pei, Yuzhi Zhang, Xu Chen, and Yuqing Liu. 2020. LogTransfer: Cross- System Log Anomaly Detection for Software Systems with Transfer Learning . In2020 IEEE 31st International Symposium on Software Reliability Engineering (ISSRE). IEEE Computer Society, Los Alamitos, CA, USA, ...
-
[4]
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), Jill Burstein, Christy...
2019
-
[5]
Ying Fu, Meng Yan, Zhou Xu, Xin Xia, Xiaohong Zhang, and Dan Yang. 2022. An empirical study of the impact of log parsers on the performance of log-based anomaly detection.Empirical Software Engineering28, 1 (Nov. 2022), 39 pages. doi:10.1007/s10664-022-10214-6
-
[6]
Prakhar Ganesh, Yao Chen, Xin Lou, Mohammad Ali Khan, Yin Yang, Hassan Sajjad, Preslav Nakov, Deming Chen, and Marianne Winslett. 2021. Compressing Large-Scale Transformer-Based Models: A Case Study on BERT.Transactions of the Association for Computational Linguistics9 (2021), 1061–1080. doi:10.1162/ tacl_a_00413
2021
-
[7]
Google Research. 2018. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. https://github.com/google-research/bert. Accessed: 2024-03-14
2018
-
[8]
Shayan Hashemi and Mika Mäntylä. 2024. Onelog: towards end-to-end software log anomaly detection.Automated Software Engineering31, 2 (2024), 37
2024
-
[9]
Pinjia He, Jieming Zhu, Zibin Zheng, and Michael R. Lyu. 2017. Drain: An Online Log Parsing Approach with Fixed Depth Tree. In2017 IEEE International Conference on Web Services (ICWS). 33–40. doi:10.1109/ICWS.2017.13
-
[10]
Hespeler, Pablo Moriano, Mingyan Li, and Samuel C
Steven C. Hespeler, Pablo Moriano, Mingyan Li, and Samuel C. Hollifield. 2025. Temporal cross-validation impacts multivariate time series subsequence anomaly detection evaluation. arXiv:2506.12183 [stat.ML] https://arxiv.org/abs/2506.12183
-
[11]
Adha Hrusto, Nauman Bin Ali, Emelie Engström, and Yuqing Wang. 2025. Moni- toring data for Anomaly Detection in Cloud-Based Systems: A Systematic Map- ping Study.ACM Transactions on Software Engineering and Methodology(June 2025). doi:10.1145/3744556 Just Accepted
-
[12]
Shaohan Huang, Yi Liu, Carol Fung, Rong He, Yining Zhao, Hailong Yang, and Zhongzhi Luan. 2020. HitAnomaly: Hierarchical Transformers for Anomaly Detection in System Log.IEEE Trans. on Netw. and Serv. Manag.17, 4 (Dec. 2020), 2064–2076. doi:10.1109/TNSM.2020.3034647
-
[13]
Peng Jia, Shaofeng Cai, Beng Chin Ooi, Pinghui Wang, and Yiyuan Xiong. 2023. Robust and Transferable Log-based Anomaly Detection.Proc. ACM Manag. Data 1, 1, Article 64 (May 2023), 26 pages. doi:10.1145/3588918
- [14]
-
[15]
Armand Joulin, Edouard Grave, Piotr Bojanowski, Matthijs Douze, Hervé Jégou, and Tomás Mikolov. 2016. FastText.zip: Compressing text classification models. CoRRabs/1612.03651 (2016). arXiv:1612.03651 http://arxiv.org/abs/1612.03651
work page Pith review arXiv 2016
-
[16]
Van-Hoang Le and Hongyu Zhang. 2021. Log-based anomaly detection with- out log parsing. In2021 36th IEEE/ACM International Conference on Automated Software Engineering (ASE). IEEE, 492–504. doi:10.1109/ASE51524.2021.9678773
-
[17]
Van-Hoang Le and Hongyu Zhang. 2022. Log-based anomaly detection with deep learning: how far are we?. InProceedings of the 44th International Conference on Software Engineering(Pittsburgh, Pennsylvania)(ICSE ’22). Association for Computing Machinery, New York, NY, USA, 1356–1367. doi:10.1145/3510003. 3510155
-
[18]
Yukyung Lee, Jina Kim, and Pilsung Kang. 2023. LAnoBERT: System log anomaly detection based on BERT masked language model.Applied Soft Computing146 (2023), 110689. doi:10.1016/j.asoc.2023.110689
-
[19]
Xiaoyun Li, Pengfei Chen, Linxiao Jing, Zilong He, and Guangba Yu. 2023. Swiss- Log: Robust Anomaly Detection and Localization for Interleaved Unstructured Logs.IEEE Transactions on Dependable and Secure Computing20, 4 (2023), 2762–
2023
-
[20]
doi:10.1109/TDSC.2022.3162857
-
[21]
Siyang Lu, Xiang Wei, Yandong Li, and Liqiang Wang. 2018. Detecting Anomaly in Big Data System Logs Using Convolutional Neural Network. In2018 IEEE 16th Intl Conf on Dependable, Autonomic and Secure Computing, 16th Intl Conf on Pervasive Intelligence and Computing, 4th Intl Conf on Big Data Intelligence and Computing and Cyber Science and Technology Congr...
work page doi:10.1109/dasc/picom/datacom/cyberscitec.2018.00037 2018
- [22]
-
[23]
Weibin Meng, Ying Liu, Yichen Zhu, Shenglin Zhang, Dan Pei, Yuqing Liu, Yihao Chen, Ruizhi Zhang, Shimin Tao, Pei Sun, and Rong Zhou. 2019. Loganomaly: unsupervised detection of sequential and quantitative anomalies in unstruc- tured logs. InProceedings of the 28th International Joint Conference on Artificial Intelligence(Macao, China)(IJCAI’19). AAAI Pre...
2019
-
[24]
doi: 10.1109/SANER60148.2024.00051
Mika V. Mäntylä, Yuqing Wang, and Jesse Nyyssölä. 2024. LogLead - Fast and In- tegrated Log Loader, Enhancer, and Anomaly Detector. In2024 IEEE International Conference on Software Analysis, Evolution and Reengineering (SANER). 395–399. doi:10.1109/SANER60148.2024.00046
-
[25]
Markus Nagel, Rana Ali Amjad, Mart Van Baalen, Christos Louizos, and Tijmen Blankevoort. 2020. Up or Down? Adaptive Rounding for Post-Training Quanti- zation. InProceedings of the 37th International Conference on Machine Learning (Proceedings of Machine Learning Research, Vol. 119), Hal Daumé III and Aarti Singh (Eds.). PMLR, 7197–7206. https://proceeding...
2020
- [26]
-
[27]
Adam Oliner and Jon Stearley. 2007. What Supercomputers Say: A Study of Five System Logs. In37th Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN’07). 575–584. doi:10.1109/DSN.2007.103
-
[28]
ONNX Project. 2025. ONNX: Open Neural Network Exchange — Introduction. https://onnx.ai/onnx/intro/. Accessed: 2025-09-11
2025
-
[29]
Jeffrey Pennington, Richard Socher, and Christopher D. Manning. 2014. GloVe: Global Vectors for Word Representation. InProceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP). Association for Computational Linguistics, 1532–1543. doi:10.3115/v1/D14-1162
-
[30]
Riley Peronto. 2024. The State of Log Data: 6 Trends Impacting Observability and Security. Blog post, Chronosphere. https://chronosphere.io/learn/observability- log-data-trends/
2024
-
[31]
Emad Ul Haq Qazi, Abdulrazaq Almorjan, and Tanveer Zia. 2022. A One- Dimensional Convolutional Neural Network (1D-CNN) Based Deep Learn- ing System for Network Intrusion Detection.Applied Sciences12, 16 (2022). doi:10.3390/app12167986
-
[32]
Victor Sanh, Lysandre Debut, Julien Chaumond, and Thomas Wolf. 2020. DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter. arXiv:1910.01108 [cs.CL] https://arxiv.org/abs/1910.01108
work page internal anchor Pith review arXiv 2020
- [33]
- [34]
-
[35]
Lei Sun and Xiaolong Xu. 2023. LogPal: A Generic Anomaly Detection Scheme of Heterogeneous Logs for Network Systems. Security and Communication Networks2023, 1 (2023), 2803139. arXiv:https://onlinelibrary.wiley.com/doi/pdf/10.1155/2023/2803139 doi:10.1155/2023/2803139
-
[36]
Marek Suppa, Katarína Benešová, and Andrej Švec. 2021. Cost-effective De- ployment of BERT Models in Serverless Environment. InProceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies: Industry Papers, Young-bum Kim, Conference acronym ’XX, June 03–05, 2018, Woodstock, N...
-
[37]
USENIX Association. [n. d.]. The Computer Failure Data Repository (CFDR). https://www.usenix.org/cfdr. Accessed: 2025-09-08
2025
-
[38]
Laurens van der Maaten and Geoffrey Hinton. 2008. Visualizing Data using t-SNE.Journal of Machine Learning Research9, Nov (2008), 2579–2605. http: //www.jmlr.org/papers/v9/vandermaaten08a.html
2008
- [39]
-
[40]
Jianming Chang, Songqiang Chen, Chao Peng, Hao Yu, Zhiming Li, Pengfei Gao, and Tao Xie
Yuqing Wang, Mika V. Mäntylä, Jesse Nyyssölä, Ke Ping, and Liqiang Wang. 2025. Cross-System Software Log-based Anomaly Detection Using Meta-Learning. In 2025 IEEE International Conference on Software Analysis, Evolution and Reengi- neering (SANER). 454–464. doi:10.1109/SANER64311.2025.00049
-
[41]
Zumin Wang, Jiyu Tian, Hui Fang, Liming Chen, and Jing Qin. 2022. LightLog: A lightweight temporal convolutional network for log anomaly detection on the edge.Computer Networks203 (2022), 108616. doi:10.1016/j.comnet.2021.108616
-
[42]
Xingfang Wu, Heng Li, and Foutse Khomh. 2023. On the effectiveness of log representation for log-based anomaly detection.Empirical Softw. Engg.28, 6 (Oct. 2023), 39 pages. doi:10.1007/s10664-023-10364-1
-
[43]
Yonghui Wu, Mike Schuster, Zhifeng Chen, Quoc V Le, Mohammad Norouzi, Wolfgang Macherey, Maxim Krikun, Yuan Cao, Qin Gao, Klaus Macherey, et al
-
[44]
Google's Neural Machine Translation System: Bridging the Gap between Human and Machine Translation
Google’s neural machine translation system: Bridging the gap between human and machine translation.arXiv preprint arXiv:1609.08144(2016). https: //arxiv.org/abs/1609.08144
work page internal anchor Pith review arXiv 2016
- [45]
-
[46]
Xu Zhang, Yong Xu, Qingwei Lin, Bo Qiao, Hongyu Zhang, Yingnong Dang, Chunyu Xie, Xinsheng Yang, Qian Cheng, Ze Li, Junjie Chen, Xiaoting He, Ran- dolph Yao, Jian-Guang Lou, Murali Chintalapati, Furao Shen, and Dongmei Zhang
-
[47]
Robust log-based anomaly detection on unstable log data. InProceedings of the 2019 27th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering(Tallinn, Estonia) (ESEC/FSE 2019). Association for Computing Machinery, New York, NY, USA, 807–817. doi:10.1145/3338906.3338931 Received 20 February 2...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.