Sample-Efficient LLM-Based Detection of Malicious Web Server Logs with Forensically Explainable Reasoning
Pith reviewed 2026-06-27 17:59 UTC · model grok-4.3
The pith
A five-step reasoning template lets LLMs detect malicious web server logs at 0.99 F1 using only four examples.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
CEF-Log embeds a structured five-step reasoning template in few-shot prompts, allowing LLMs to achieve an F1-score of 0.99 on the CSIC 2010 dataset with only four examples, deliver a 10 times improvement in sample efficiency over other prompting methods, and generate traceable explanations suitable for forensic documentation. The approach is evaluated on the newly introduced ForenWebLog dataset that contains real-world attacks and multi-step sequences.
What carries the argument
The context-enhanced few-shot chain-of-thought prompting strategy that embeds a structured five-step expert investigative template to guide the LLM through log analysis.
If this is right
- Malicious log detection reaches high accuracy with far fewer labeled examples than conventional machine-learning pipelines.
- The generated reasoning steps supply traceable documentation that meets forensic and legal standards.
- Sample efficiency improves by a factor of ten relative to other prompting techniques on the tested dataset.
- The ForenWebLog dataset enables evaluation against realistic multi-step attack sequences.
Where Pith is reading between the lines
- Structured templates of this kind may transfer to other security-analysis tasks that require both accuracy and auditability.
- Explanation traceability could reduce dependence on separate post-hoc interpretability tools when LLMs are used in forensic settings.
- Results may change if the template is applied to server logs drawn from different software stacks or attack distributions.
Load-bearing premise
Embedding the five-step template causes the LLM to acquire general analysis methodology rather than simply memorizing patterns from the four examples.
What would settle it
Running the same four examples through an LLM with a different or absent investigative template and measuring whether detection accuracy falls while the generated explanations lose traceability.
Figures
read the original abstract
Forensic analysis of web server logs demands both accurate detection and human-readable explanations that can satisfy legal requirements. We present CEF-Log, a context-enhanced few-shot chain-of-thought prompting strategy for Large Language Models that addresses this dual requirement. CEF-Log embeds expert investigative methodology through a structured five-step reasoning template, enabling the model to learn \textit{how} to analyze logs rather than \textit{what} patterns to memorize. Experimental evaluation demonstrates that CEF-Log achieves an F1-score of 0.99 on the CSIC 2010 dataset using only four examples while providing a $10\times$ improvement in sample efficiency compared to other prompting-based methods. We also introduce ForenWebLog, a new dataset that incorporates real-world attacks and multi-step attack sequences for comprehensive evaluation. Qualitative analysis confirms that CEF-Log generates traceable, accurate explanations suitable for forensic documentation, addressing the critical "black-box" limitation of traditional machine learning approaches.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces CEF-Log, a context-enhanced few-shot chain-of-thought prompting strategy for LLMs that embeds a structured five-step expert investigative template to detect malicious web server logs while generating human-readable explanations. It reports an F1-score of 0.99 on the CSIC 2010 dataset using only four examples, a 10× sample-efficiency gain over other prompting methods, introduces the ForenWebLog dataset with real-world and multi-step attacks, and claims the outputs are suitable for forensic documentation.
Significance. If the central empirical claims hold after verification, the work would offer a practical advance in forensic log analysis by combining high detection accuracy with minimal labeled examples and traceable reasoning chains, directly addressing the explainability gap in traditional ML detectors. The introduction of ForenWebLog as a new evaluation resource with multi-step attack sequences is a concrete positive contribution that could support future benchmarking.
major comments (3)
- [Abstract] Abstract: The headline claim that the five-step template enables the model to 'learn how to analyze logs rather than what patterns to memorize' is load-bearing for the sample-efficiency and generalizability assertions, yet the manuscript supplies no ablation that removes or replaces the template while holding the four examples and overall prompt structure fixed; without this comparison the observed F1=0.99 could be explained by memorization of the specific attack patterns in the shots.
- [Abstract] Abstract: The reported 10× sample-efficiency improvement is stated relative to 'other prompting-based methods,' but the manuscript does not document that those baselines were matched on template length, structure, or number of examples, rendering the quantitative comparison unverifiable from the given information.
- [Abstract] Abstract: The F1-score of 0.99 is presented without accompanying details on baseline implementations, statistical significance tests, variance across runs, or error analysis; these omissions make it impossible to assess whether the result is robust or merely an artifact of a single prompting configuration.
minor comments (2)
- [Abstract] The abstract introduces the acronym CEF-Log but does not expand it on first use; a parenthetical definition would improve readability.
- [Abstract] The phrase 'context-enhanced' is used without a concise operational definition or pointer to the precise prompt-engineering mechanism that distinguishes it from standard few-shot CoT.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback on the abstract and empirical claims. We agree that the points raised require additional experiments and documentation to fully substantiate the central assertions and will revise the manuscript accordingly.
read point-by-point responses
-
Referee: [Abstract] Abstract: The headline claim that the five-step template enables the model to 'learn how to analyze logs rather than what patterns to memorize' is load-bearing for the sample-efficiency and generalizability assertions, yet the manuscript supplies no ablation that removes or replaces the template while holding the four examples and overall prompt structure fixed; without this comparison the observed F1=0.99 could be explained by memorization of the specific attack patterns in the shots.
Authors: We agree that an ablation isolating the five-step template's contribution (while holding the four examples and overall prompt structure fixed) is necessary to support the claim. In the revised manuscript we will add this ablation experiment and report the resulting F1 scores to demonstrate whether performance derives from the expert investigative template or from pattern memorization in the shots. revision: yes
-
Referee: [Abstract] Abstract: The reported 10× sample-efficiency improvement is stated relative to 'other prompting-based methods,' but the manuscript does not document that those baselines were matched on template length, structure, or number of examples, rendering the quantitative comparison unverifiable from the given information.
Authors: We acknowledge that the 10× claim requires explicit documentation that baselines were matched on template length, structure, and example count. We will revise the experimental section to provide full implementation details of each baseline and confirm the matching criteria used, allowing readers to verify the comparison. revision: yes
-
Referee: [Abstract] Abstract: The F1-score of 0.99 is presented without accompanying details on baseline implementations, statistical significance tests, variance across runs, or error analysis; these omissions make it impossible to assess whether the result is robust or merely an artifact of a single prompting configuration.
Authors: We will expand the results section to include complete baseline implementation details, statistical significance testing (e.g., McNemar's test), variance across multiple runs with different seeds, and a detailed error analysis. These additions will allow a rigorous assessment of result robustness. revision: yes
Circularity Check
No circularity; empirical prompting method evaluated independently
full rationale
The paper presents CEF-Log as a prompting strategy whose performance is measured through experiments on CSIC 2010 and ForenWebLog datasets. No equations, fitted parameters, self-citations, or definitional reductions appear in the provided text. The five-step template is an explicit input to the method rather than a derived output, and claims of learning 'how' versus 'what' are framed as empirical observations, not tautological equivalences. The derivation chain is therefore self-contained against external benchmarks.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
Davidoff and J
S. Davidoff and J. Ham,Network forensics: tracking hackers through cyberspace. Prentice hall Upper Saddle River, 2012, vol. 2014
2012
-
[2]
A cloud-based triage log analysis and recovery framework,
G. Qi, W.-T. Tsai, W. Li, Z. Zhu, and Y . Luo, “A cloud-based triage log analysis and recovery framework,”Simulation Modelling Practice and Theory, vol. 77, pp. 292–316, 2017
2017
-
[3]
An empirical investigation of incident triage for online service systems,
J. Chen, X. He, Q. Lin, Y . Xu, H. Zhang, D. Hao, F. Gao, Z. Xu, Y . Dang, and D. Zhang, “An empirical investigation of incident triage for online service systems,” in2019 IEEE/ACM 41st International Conference on Software Engineering: Software Engineering in Practice (ICSE-SEIP). IEEE, 2019, pp. 111–120
2019
-
[4]
K. Kent, S. Chevalier, T. Grance, and H. Dang,Guide to integrating forensic techniques into incident response. NIST Special Publication,
-
[5]
Available: http://dx.doi.org/10.6028/nist.sp.800-86
[Online]. Available: http://dx.doi.org/10.6028/nist.sp.800-86
-
[6]
Learning from experts’ experience: toward automated cyber security data triage,
C. Zhong, J. Yen, P. Liu, and R. F. Erbacher, “Learning from experts’ experience: toward automated cyber security data triage,”IEEE Systems Journal, vol. 13, no. 1, pp. 603–614, 2018
2018
-
[7]
eur-lex.europa.eu,
E. Union, “eur-lex.europa.eu,” https://eur-lex.europa.eu/legal-content/ DE/TXT/PDF/?uri=CELEX:32016R0679, [Accessed 30-12-2025]
2025
-
[8]
Detecting large-scale system problems by mining console logs,
W. Xu, L. Huang, A. Fox, D. Patterson, and M. I. Jordan, “Detecting large-scale system problems by mining console logs,” inProceedings of the ACM SIGOPS 22nd symposium on Operating systems principles, 2009, pp. 117–132
2009
-
[9]
YARA - The pattern matching swiss knife for malware re- searchers — virustotal.github.io,
VirusTotal, “YARA - The pattern matching swiss knife for malware re- searchers — virustotal.github.io,” https://virustotal.github.io/yara/, 2024, [Accessed 12-12-2025]
2024
-
[10]
Explore Sigma - generic signature format for siem systems,
SigmaHQ, “Explore Sigma - generic signature format for siem systems,” https://sigmahq.io/, 2017, [Accessed 3-12-2025]
2017
-
[11]
Web server attack detection using machine learning,
S. Saleem, M. Sheeraz, M. Hanif, and U. Farooq, “Web server attack detection using machine learning,” in2020 International Conference on Cyber Warfare and Security (ICCWS). IEEE, 2020, pp. 1–7
2020
-
[12]
Machine learning to detect anomalies in web log analysis,
Q. Cao, Y . Qiao, and Z. Lyu, “Machine learning to detect anomalies in web log analysis,” in2017 3rd IEEE international conference on computer and communications (ICCC). IEEE, 2017, pp. 519–523
2017
-
[13]
Experience report: System log analysis for anomaly detection,
S. He, J. Zhu, P. He, and M. R. Lyu, “Experience report: System log analysis for anomaly detection,” in2016 IEEE 27th international symposium on software reliability engineering (ISSRE). IEEE, 2016, pp. 207–218
2016
-
[14]
Self- attentive classification-based anomaly detection in unstructured logs,
S. Nedelkoski, J. Bogatinovski, A. Acker, J. Cardoso, and O. Kao, “Self- attentive classification-based anomaly detection in unstructured logs,” in 2020 IEEE International Conference on Data Mining (ICDM). IEEE, 2020, pp. 1196–1201
2020
-
[15]
Deep learning for anomaly detection in log data: A survey,
M. Landauer, S. Onder, F. Skopik, and M. Wurzenberger, “Deep learning for anomaly detection in log data: A survey,”Machine Learning with Applications, vol. 12, p. 100470, 2023
2023
-
[16]
Log-based anomaly detection with deep learning: How far are we?
V .-H. Le and H. Zhang, “Log-based anomaly detection with deep learning: How far are we?” inProceedings of the 44th international conference on software engineering, 2022, pp. 1356–1367
2022
-
[17]
A survey on evaluation of large language models,
Y . Chang, X. Wang, J. Wang, Y . Wu, L. Yang, K. Zhu, H. Chen, X. Yi, C. Wang, Y . Wanget al., “A survey on evaluation of large language models,”ACM transactions on intelligent systems and technology, vol. 15, no. 3, pp. 1–45, 2024
2024
-
[18]
Chatgpt for digital forensic investigation: The good, the bad, and the unknown,
M. Scanlon, F. Breitinger, C. Hargreaves, J.-N. Hilgert, and J. Sheppard, “Chatgpt for digital forensic investigation: The good, the bad, and the unknown,”Forensic Science International: Digital Investigation, vol. 46, p. 301609, 2023
2023
-
[19]
Language mod- els are few-shot learners,
T. Brown, B. Mann, N. Ryder, M. Subbiah, J. D. Kaplan, P. Dhariwal, A. Neelakantan, P. Shyam, G. Sastry, A. Askellet al., “Language mod- els are few-shot learners,”Advances in neural information processing systems, vol. 33, pp. 1877–1901, 2020
1901
-
[20]
volgpt: Evaluation on triaging ransomware process in memory forensics with large language model,
D. B. Oh, D. Kim, and H. K. Kim, “volgpt: Evaluation on triaging ransomware process in memory forensics with large language model,” Forensic Science International: Digital Investigation, vol. 49, p. 301756, 2024
2024
-
[21]
Logllm: Log- based anomaly detection using large language models,
W. Guan, J. Cao, S. Qian, J. Gao, and C. Ouyang, “Logllm: Log- based anomaly detection using large language models,”arXiv preprint arXiv:2411.08561, 2024
arXiv 2024
-
[22]
Logprompt: Prompt engineering towards zero-shot and interpretable log analysis,
Y . Liu, S. Tao, W. Meng, F. Yao, X. Zhao, and H. Yang, “Logprompt: Prompt engineering towards zero-shot and interpretable log analysis,” in Proceedings of the 2024 IEEE/ACM 46th International Conference on Software Engineering: Companion Proceedings, 2024, pp. 364–365
2024
-
[23]
Llm-lade: Large language model-based log anomaly detection with explanation,
Z. Zhang, S. Li, L. Zhang, J. Ye, C. Hu, and L. Yan, “Llm-lade: Large language model-based log anomaly detection with explanation,” Knowledge-Based Systems, vol. 326, p. 114064, 2025
2025
-
[24]
Chatgpt, llama, can you write my report? an experiment on assisted digital forensics reports written using (local) large language models,
G. Michelet and F. Breitinger, “Chatgpt, llama, can you write my report? an experiment on assisted digital forensics reports written using (local) large language models,”Forensic Science International: Digital Investigation, vol. 48, p. 301683, 2024
2024
-
[25]
Online system problem detection by mining patterns of console logs,
W. Xu, L. Huang, A. Fox, D. Patterson, and M. Jordan, “Online system problem detection by mining patterns of console logs,” in2009 ninth IEEE international conference on data mining. IEEE, 2009, pp. 588– 597
2009
-
[26]
What supercomputers say: A study of five system logs,
A. Oliner and J. Stearley, “What supercomputers say: A study of five system logs,” in37th annual IEEE/IFIP international conference on dependable systems and networks (DSN’07). IEEE, 2007, pp. 575– 584
2007
-
[27]
Deeplog: Anomaly detection and diagnosis from system logs through deep learning,
M. Du, F. Li, G. Zheng, and V . Srikumar, “Deeplog: Anomaly detection and diagnosis from system logs through deep learning,” inProceedings of the 2017 ACM SIGSAC conference on computer and communications security, 2017, pp. 1285–1298
2017
-
[28]
A self-learning anomaly-based web application firewall,
C. Torrano-Gimenez, A. Perez-Villegas, and G. Alvarez, “A self-learning anomaly-based web application firewall,” inComputational Intelligence in Security for Information Systems: CISIS’09, 2nd International Work- shop Burgos, Spain, September 2009 Proceedings. Springer, 2009, pp. 85–92
2009
-
[29]
Identifying user behavior by analyzing web server access log file,
K. Suneetha and D. R. Krishnamoorthi, “Identifying user behavior by analyzing web server access log file,”International Journal of Computer Science and Network Security, vol. 9, no. 4, pp. 327–332, 2009
2009
-
[30]
Llm meets ml: Data- efficient anomaly detection on unstable logs,
F. Hadadi, Q. Xu, D. Bianculli, and L. Briand, “Llm meets ml: Data- efficient anomaly detection on unstable logs,”ACM Transactions on Software Engineering and Methodology, 2025
2025
-
[31]
A web attack detection technology based on bag of words and hidden markov model,
X. Ren, Y . Hu, W. Kuang, and M. B. Souleymanou, “A web attack detection technology based on bag of words and hidden markov model,” in2018 IEEE 15th International Conference on Mobile Ad Hoc and Sensor Systems (MASS). IEEE, 2018, pp. 526–531
2018
-
[32]
A novel architecture for web-based attack detection using convolutional neural network,
A. Tekerek, “A novel architecture for web-based attack detection using convolutional neural network,”Computers & Security, vol. 100, p. 102096, 2021
2021
-
[33]
Web attack detection using deep learning models,
J. C. Eunaicy and S. Suguna, “Web attack detection using deep learning models,”Materials Today: Proceedings, vol. 62, pp. 4806–4813, 2022. 13
2022
-
[34]
Detecting web attacks with end-to-end deep learning,
Y . Pan, F. Sun, Z. Teng, J. White, D. C. Schmidt, J. Staples, and L. Krause, “Detecting web attacks with end-to-end deep learning,” Journal of Internet Services and Applications, vol. 10, no. 1, pp. 1– 22, 2019
2019
-
[35]
A comparative analysis of various machine learning methods for anomaly detection in cyber attacks on iot net- works,
M. M. Inuwa and R. Das, “A comparative analysis of various machine learning methods for anomaly detection in cyber attacks on iot net- works,”Internet of Things, vol. 26, p. 101162, 2024
2024
-
[36]
Evaluation of machine learn- ing algorithms for anomaly detection,
N. Elmrabit, F. Zhou, F. Li, and H. Zhou, “Evaluation of machine learn- ing algorithms for anomaly detection,” in2020 international conference on cyber security and protection of digital services (cyber security). IEEE, 2020, pp. 1–8
2020
-
[37]
Llmelog: An approach for anomaly detection based on llm-enriched log events,
M. He, T. Jia, C. Duan, H. Cai, Y . Li, and G. Huang, “Llmelog: An approach for anomaly detection based on llm-enriched log events,” in2024 IEEE 35th International Symposium on Software Reliability Engineering (ISSRE). IEEE, 2024, pp. 132–143
2024
-
[38]
Loggpt: Exploring chatgpt for log-based anomaly detection,
J. Qi, S. Huang, Z. Luan, S. Yang, C. Fung, H. Yang, D. Qian, J. Shang, Z. Xiao, and Z. Wu, “Loggpt: Exploring chatgpt for log-based anomaly detection,” in2023 IEEE International Conference on High Performance Computing & Communications, Data Science & Systems, Smart City & Dependability in Sensor, Cloud & Big Data Systems & Application (HPCC/DSS/SmartCit...
2023
-
[39]
Raglog: Log anomaly detection using retrieval augmented generation,
J. Pan, W. S. Liang, and Y . Yidi, “Raglog: Log anomaly detection using retrieval augmented generation,” in2024 IEEE World Forum on Public Safety Technology (WFPST). IEEE, 2024, pp. 169–174
2024
-
[40]
A survey on hallucination in large language models: Principles, taxonomy, challenges, and open questions,
L. Huang, W. Yu, W. Ma, W. Zhong, Z. Feng, H. Wang, Q. Chen, W. Peng, X. Feng, B. Qinet al., “A survey on hallucination in large language models: Principles, taxonomy, challenges, and open questions,” ACM Transactions on Information Systems, vol. 43, no. 2, pp. 1–55, 2025
2025
-
[41]
Analyzing web traffic ecml/pkdd 2007 discovery challange,
R. Chedy, B. Johan, D. G ´erard, and R. Mathieu, “Analyzing web traffic ecml/pkdd 2007 discovery challange,” https://www.lirmm.fr/ pkdd2007-challenge/index.html, 2007, [Accessed 03-01-2026]
2007
-
[42]
The freebsd project: A replication case study of open source development,
T. T. Dinh-Trong and J. M. Bieman, “The freebsd project: A replication case study of open source development,”IEEE Transactions on Software Engineering, vol. 31, no. 6, pp. 481–494, 2005
2005
-
[43]
Kenler and F
E. Kenler and F. Razzoli,MariaDB Essentials. Packt Publishing Birmingham, UK, 2015
2015
-
[44]
Postgresql,
B. PostgreSQL, “Postgresql,”Web resource: http://www. PostgreSQL. org/about, 1996
1996
-
[45]
Carlson,Redis in action
J. Carlson,Redis in action. Simon and Schuster, 2013
2013
-
[46]
Commix: automating evaluation and exploitation of command injection vulnerabilities in web applications,
A. Stasinopoulos, C. Ntantogian, and C. Xenakis, “Commix: automating evaluation and exploitation of command injection vulnerabilities in web applications,”International Journal of Information Security, vol. 18, no. 1, pp. 49–72, 2019
2019
-
[47]
Http extensions for distributed authoring–webdav,
Y . Goland, E. Whitehead, A. Faizi, S. Carter, and D. Jensen, “Http extensions for distributed authoring–webdav,” Tech. Rep., 1999
1999
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.