pith. sign in

arxiv: 2606.08649 · v1 · pith:IU6YLB2Tnew · submitted 2026-06-07 · 💻 cs.CR · cs.AI

Sample-Efficient LLM-Based Detection of Malicious Web Server Logs with Forensically Explainable Reasoning

Pith reviewed 2026-06-27 17:59 UTC · model grok-4.3

classification 💻 cs.CR cs.AI
keywords few-shot promptingchain-of-thought reasoningweb server log forensicsmalicious detectionexplainable AILLM security applications
0
0 comments X

The pith

A five-step reasoning template lets LLMs detect malicious web server logs at 0.99 F1 using only four examples.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents CEF-Log, a prompting strategy that places a structured five-step expert investigative template inside few-shot chain-of-thought instructions given to large language models. The goal is to make the model follow general analysis steps when examining web server logs rather than memorize specific attack signatures from the supplied examples. On the CSIC 2010 dataset the method reaches an F1-score of 0.99 with four examples and improves sample efficiency tenfold over other prompting baselines. It also produces step-by-step explanations that can be traced and documented for forensic purposes. A new dataset called ForenWebLog is introduced to test performance on realistic multi-step attacks.

Core claim

CEF-Log embeds a structured five-step reasoning template in few-shot prompts, allowing LLMs to achieve an F1-score of 0.99 on the CSIC 2010 dataset with only four examples, deliver a 10 times improvement in sample efficiency over other prompting methods, and generate traceable explanations suitable for forensic documentation. The approach is evaluated on the newly introduced ForenWebLog dataset that contains real-world attacks and multi-step sequences.

What carries the argument

The context-enhanced few-shot chain-of-thought prompting strategy that embeds a structured five-step expert investigative template to guide the LLM through log analysis.

If this is right

  • Malicious log detection reaches high accuracy with far fewer labeled examples than conventional machine-learning pipelines.
  • The generated reasoning steps supply traceable documentation that meets forensic and legal standards.
  • Sample efficiency improves by a factor of ten relative to other prompting techniques on the tested dataset.
  • The ForenWebLog dataset enables evaluation against realistic multi-step attack sequences.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Structured templates of this kind may transfer to other security-analysis tasks that require both accuracy and auditability.
  • Explanation traceability could reduce dependence on separate post-hoc interpretability tools when LLMs are used in forensic settings.
  • Results may change if the template is applied to server logs drawn from different software stacks or attack distributions.

Load-bearing premise

Embedding the five-step template causes the LLM to acquire general analysis methodology rather than simply memorizing patterns from the four examples.

What would settle it

Running the same four examples through an LLM with a different or absent investigative template and measuring whether detection accuracy falls while the generated explanations lose traceability.

Figures

Figures reproduced from arXiv: 2606.08649 by Bernhard Kneip, Hong-Hanh Nguyen-Le, Nhien-An Le-Khac.

Figure 1
Figure 1. Figure 1: Example web server log entry in Combined Log Format containing [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Illustration of unstable web server logs. Training data from Web [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Overview of CEF-Log for forensic web log classification. Few-shot examples paired with a five-step reasoning template guide the LLM through [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Illustration of ForenWebLog Dataset Collection Framework. [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Sample efficiency comparison between standard few-shot prompting [PITH_FULL_IMAGE:figures/full_fig_p010_5.png] view at source ↗
read the original abstract

Forensic analysis of web server logs demands both accurate detection and human-readable explanations that can satisfy legal requirements. We present CEF-Log, a context-enhanced few-shot chain-of-thought prompting strategy for Large Language Models that addresses this dual requirement. CEF-Log embeds expert investigative methodology through a structured five-step reasoning template, enabling the model to learn \textit{how} to analyze logs rather than \textit{what} patterns to memorize. Experimental evaluation demonstrates that CEF-Log achieves an F1-score of 0.99 on the CSIC 2010 dataset using only four examples while providing a $10\times$ improvement in sample efficiency compared to other prompting-based methods. We also introduce ForenWebLog, a new dataset that incorporates real-world attacks and multi-step attack sequences for comprehensive evaluation. Qualitative analysis confirms that CEF-Log generates traceable, accurate explanations suitable for forensic documentation, addressing the critical "black-box" limitation of traditional machine learning approaches.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper introduces CEF-Log, a context-enhanced few-shot chain-of-thought prompting strategy for LLMs that embeds a structured five-step expert investigative template to detect malicious web server logs while generating human-readable explanations. It reports an F1-score of 0.99 on the CSIC 2010 dataset using only four examples, a 10× sample-efficiency gain over other prompting methods, introduces the ForenWebLog dataset with real-world and multi-step attacks, and claims the outputs are suitable for forensic documentation.

Significance. If the central empirical claims hold after verification, the work would offer a practical advance in forensic log analysis by combining high detection accuracy with minimal labeled examples and traceable reasoning chains, directly addressing the explainability gap in traditional ML detectors. The introduction of ForenWebLog as a new evaluation resource with multi-step attack sequences is a concrete positive contribution that could support future benchmarking.

major comments (3)
  1. [Abstract] Abstract: The headline claim that the five-step template enables the model to 'learn how to analyze logs rather than what patterns to memorize' is load-bearing for the sample-efficiency and generalizability assertions, yet the manuscript supplies no ablation that removes or replaces the template while holding the four examples and overall prompt structure fixed; without this comparison the observed F1=0.99 could be explained by memorization of the specific attack patterns in the shots.
  2. [Abstract] Abstract: The reported 10× sample-efficiency improvement is stated relative to 'other prompting-based methods,' but the manuscript does not document that those baselines were matched on template length, structure, or number of examples, rendering the quantitative comparison unverifiable from the given information.
  3. [Abstract] Abstract: The F1-score of 0.99 is presented without accompanying details on baseline implementations, statistical significance tests, variance across runs, or error analysis; these omissions make it impossible to assess whether the result is robust or merely an artifact of a single prompting configuration.
minor comments (2)
  1. [Abstract] The abstract introduces the acronym CEF-Log but does not expand it on first use; a parenthetical definition would improve readability.
  2. [Abstract] The phrase 'context-enhanced' is used without a concise operational definition or pointer to the precise prompt-engineering mechanism that distinguishes it from standard few-shot CoT.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive feedback on the abstract and empirical claims. We agree that the points raised require additional experiments and documentation to fully substantiate the central assertions and will revise the manuscript accordingly.

read point-by-point responses
  1. Referee: [Abstract] Abstract: The headline claim that the five-step template enables the model to 'learn how to analyze logs rather than what patterns to memorize' is load-bearing for the sample-efficiency and generalizability assertions, yet the manuscript supplies no ablation that removes or replaces the template while holding the four examples and overall prompt structure fixed; without this comparison the observed F1=0.99 could be explained by memorization of the specific attack patterns in the shots.

    Authors: We agree that an ablation isolating the five-step template's contribution (while holding the four examples and overall prompt structure fixed) is necessary to support the claim. In the revised manuscript we will add this ablation experiment and report the resulting F1 scores to demonstrate whether performance derives from the expert investigative template or from pattern memorization in the shots. revision: yes

  2. Referee: [Abstract] Abstract: The reported 10× sample-efficiency improvement is stated relative to 'other prompting-based methods,' but the manuscript does not document that those baselines were matched on template length, structure, or number of examples, rendering the quantitative comparison unverifiable from the given information.

    Authors: We acknowledge that the 10× claim requires explicit documentation that baselines were matched on template length, structure, and example count. We will revise the experimental section to provide full implementation details of each baseline and confirm the matching criteria used, allowing readers to verify the comparison. revision: yes

  3. Referee: [Abstract] Abstract: The F1-score of 0.99 is presented without accompanying details on baseline implementations, statistical significance tests, variance across runs, or error analysis; these omissions make it impossible to assess whether the result is robust or merely an artifact of a single prompting configuration.

    Authors: We will expand the results section to include complete baseline implementation details, statistical significance testing (e.g., McNemar's test), variance across multiple runs with different seeds, and a detailed error analysis. These additions will allow a rigorous assessment of result robustness. revision: yes

Circularity Check

0 steps flagged

No circularity; empirical prompting method evaluated independently

full rationale

The paper presents CEF-Log as a prompting strategy whose performance is measured through experiments on CSIC 2010 and ForenWebLog datasets. No equations, fitted parameters, self-citations, or definitional reductions appear in the provided text. The five-step template is an explicit input to the method rather than a derived output, and claims of learning 'how' versus 'what' are framed as empirical observations, not tautological equivalences. The derivation chain is therefore self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review provides no explicit free parameters, axioms, or invented entities; the five-step template is described as drawn from expert methodology but its exact content and independence from the target task are not detailed.

pith-pipeline@v0.9.1-grok · 5705 in / 1093 out tokens · 17212 ms · 2026-06-27T17:59:54.496077+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

47 extracted references · 1 canonical work pages

  1. [1]

    Davidoff and J

    S. Davidoff and J. Ham,Network forensics: tracking hackers through cyberspace. Prentice hall Upper Saddle River, 2012, vol. 2014

  2. [2]

    A cloud-based triage log analysis and recovery framework,

    G. Qi, W.-T. Tsai, W. Li, Z. Zhu, and Y . Luo, “A cloud-based triage log analysis and recovery framework,”Simulation Modelling Practice and Theory, vol. 77, pp. 292–316, 2017

  3. [3]

    An empirical investigation of incident triage for online service systems,

    J. Chen, X. He, Q. Lin, Y . Xu, H. Zhang, D. Hao, F. Gao, Z. Xu, Y . Dang, and D. Zhang, “An empirical investigation of incident triage for online service systems,” in2019 IEEE/ACM 41st International Conference on Software Engineering: Software Engineering in Practice (ICSE-SEIP). IEEE, 2019, pp. 111–120

  4. [4]

    K. Kent, S. Chevalier, T. Grance, and H. Dang,Guide to integrating forensic techniques into incident response. NIST Special Publication,

  5. [5]

    Available: http://dx.doi.org/10.6028/nist.sp.800-86

    [Online]. Available: http://dx.doi.org/10.6028/nist.sp.800-86

  6. [6]

    Learning from experts’ experience: toward automated cyber security data triage,

    C. Zhong, J. Yen, P. Liu, and R. F. Erbacher, “Learning from experts’ experience: toward automated cyber security data triage,”IEEE Systems Journal, vol. 13, no. 1, pp. 603–614, 2018

  7. [7]

    eur-lex.europa.eu,

    E. Union, “eur-lex.europa.eu,” https://eur-lex.europa.eu/legal-content/ DE/TXT/PDF/?uri=CELEX:32016R0679, [Accessed 30-12-2025]

  8. [8]

    Detecting large-scale system problems by mining console logs,

    W. Xu, L. Huang, A. Fox, D. Patterson, and M. I. Jordan, “Detecting large-scale system problems by mining console logs,” inProceedings of the ACM SIGOPS 22nd symposium on Operating systems principles, 2009, pp. 117–132

  9. [9]

    YARA - The pattern matching swiss knife for malware re- searchers — virustotal.github.io,

    VirusTotal, “YARA - The pattern matching swiss knife for malware re- searchers — virustotal.github.io,” https://virustotal.github.io/yara/, 2024, [Accessed 12-12-2025]

  10. [10]

    Explore Sigma - generic signature format for siem systems,

    SigmaHQ, “Explore Sigma - generic signature format for siem systems,” https://sigmahq.io/, 2017, [Accessed 3-12-2025]

  11. [11]

    Web server attack detection using machine learning,

    S. Saleem, M. Sheeraz, M. Hanif, and U. Farooq, “Web server attack detection using machine learning,” in2020 International Conference on Cyber Warfare and Security (ICCWS). IEEE, 2020, pp. 1–7

  12. [12]

    Machine learning to detect anomalies in web log analysis,

    Q. Cao, Y . Qiao, and Z. Lyu, “Machine learning to detect anomalies in web log analysis,” in2017 3rd IEEE international conference on computer and communications (ICCC). IEEE, 2017, pp. 519–523

  13. [13]

    Experience report: System log analysis for anomaly detection,

    S. He, J. Zhu, P. He, and M. R. Lyu, “Experience report: System log analysis for anomaly detection,” in2016 IEEE 27th international symposium on software reliability engineering (ISSRE). IEEE, 2016, pp. 207–218

  14. [14]

    Self- attentive classification-based anomaly detection in unstructured logs,

    S. Nedelkoski, J. Bogatinovski, A. Acker, J. Cardoso, and O. Kao, “Self- attentive classification-based anomaly detection in unstructured logs,” in 2020 IEEE International Conference on Data Mining (ICDM). IEEE, 2020, pp. 1196–1201

  15. [15]

    Deep learning for anomaly detection in log data: A survey,

    M. Landauer, S. Onder, F. Skopik, and M. Wurzenberger, “Deep learning for anomaly detection in log data: A survey,”Machine Learning with Applications, vol. 12, p. 100470, 2023

  16. [16]

    Log-based anomaly detection with deep learning: How far are we?

    V .-H. Le and H. Zhang, “Log-based anomaly detection with deep learning: How far are we?” inProceedings of the 44th international conference on software engineering, 2022, pp. 1356–1367

  17. [17]

    A survey on evaluation of large language models,

    Y . Chang, X. Wang, J. Wang, Y . Wu, L. Yang, K. Zhu, H. Chen, X. Yi, C. Wang, Y . Wanget al., “A survey on evaluation of large language models,”ACM transactions on intelligent systems and technology, vol. 15, no. 3, pp. 1–45, 2024

  18. [18]

    Chatgpt for digital forensic investigation: The good, the bad, and the unknown,

    M. Scanlon, F. Breitinger, C. Hargreaves, J.-N. Hilgert, and J. Sheppard, “Chatgpt for digital forensic investigation: The good, the bad, and the unknown,”Forensic Science International: Digital Investigation, vol. 46, p. 301609, 2023

  19. [19]

    Language mod- els are few-shot learners,

    T. Brown, B. Mann, N. Ryder, M. Subbiah, J. D. Kaplan, P. Dhariwal, A. Neelakantan, P. Shyam, G. Sastry, A. Askellet al., “Language mod- els are few-shot learners,”Advances in neural information processing systems, vol. 33, pp. 1877–1901, 2020

  20. [20]

    volgpt: Evaluation on triaging ransomware process in memory forensics with large language model,

    D. B. Oh, D. Kim, and H. K. Kim, “volgpt: Evaluation on triaging ransomware process in memory forensics with large language model,” Forensic Science International: Digital Investigation, vol. 49, p. 301756, 2024

  21. [21]

    Logllm: Log- based anomaly detection using large language models,

    W. Guan, J. Cao, S. Qian, J. Gao, and C. Ouyang, “Logllm: Log- based anomaly detection using large language models,”arXiv preprint arXiv:2411.08561, 2024

  22. [22]

    Logprompt: Prompt engineering towards zero-shot and interpretable log analysis,

    Y . Liu, S. Tao, W. Meng, F. Yao, X. Zhao, and H. Yang, “Logprompt: Prompt engineering towards zero-shot and interpretable log analysis,” in Proceedings of the 2024 IEEE/ACM 46th International Conference on Software Engineering: Companion Proceedings, 2024, pp. 364–365

  23. [23]

    Llm-lade: Large language model-based log anomaly detection with explanation,

    Z. Zhang, S. Li, L. Zhang, J. Ye, C. Hu, and L. Yan, “Llm-lade: Large language model-based log anomaly detection with explanation,” Knowledge-Based Systems, vol. 326, p. 114064, 2025

  24. [24]

    Chatgpt, llama, can you write my report? an experiment on assisted digital forensics reports written using (local) large language models,

    G. Michelet and F. Breitinger, “Chatgpt, llama, can you write my report? an experiment on assisted digital forensics reports written using (local) large language models,”Forensic Science International: Digital Investigation, vol. 48, p. 301683, 2024

  25. [25]

    Online system problem detection by mining patterns of console logs,

    W. Xu, L. Huang, A. Fox, D. Patterson, and M. Jordan, “Online system problem detection by mining patterns of console logs,” in2009 ninth IEEE international conference on data mining. IEEE, 2009, pp. 588– 597

  26. [26]

    What supercomputers say: A study of five system logs,

    A. Oliner and J. Stearley, “What supercomputers say: A study of five system logs,” in37th annual IEEE/IFIP international conference on dependable systems and networks (DSN’07). IEEE, 2007, pp. 575– 584

  27. [27]

    Deeplog: Anomaly detection and diagnosis from system logs through deep learning,

    M. Du, F. Li, G. Zheng, and V . Srikumar, “Deeplog: Anomaly detection and diagnosis from system logs through deep learning,” inProceedings of the 2017 ACM SIGSAC conference on computer and communications security, 2017, pp. 1285–1298

  28. [28]

    A self-learning anomaly-based web application firewall,

    C. Torrano-Gimenez, A. Perez-Villegas, and G. Alvarez, “A self-learning anomaly-based web application firewall,” inComputational Intelligence in Security for Information Systems: CISIS’09, 2nd International Work- shop Burgos, Spain, September 2009 Proceedings. Springer, 2009, pp. 85–92

  29. [29]

    Identifying user behavior by analyzing web server access log file,

    K. Suneetha and D. R. Krishnamoorthi, “Identifying user behavior by analyzing web server access log file,”International Journal of Computer Science and Network Security, vol. 9, no. 4, pp. 327–332, 2009

  30. [30]

    Llm meets ml: Data- efficient anomaly detection on unstable logs,

    F. Hadadi, Q. Xu, D. Bianculli, and L. Briand, “Llm meets ml: Data- efficient anomaly detection on unstable logs,”ACM Transactions on Software Engineering and Methodology, 2025

  31. [31]

    A web attack detection technology based on bag of words and hidden markov model,

    X. Ren, Y . Hu, W. Kuang, and M. B. Souleymanou, “A web attack detection technology based on bag of words and hidden markov model,” in2018 IEEE 15th International Conference on Mobile Ad Hoc and Sensor Systems (MASS). IEEE, 2018, pp. 526–531

  32. [32]

    A novel architecture for web-based attack detection using convolutional neural network,

    A. Tekerek, “A novel architecture for web-based attack detection using convolutional neural network,”Computers & Security, vol. 100, p. 102096, 2021

  33. [33]

    Web attack detection using deep learning models,

    J. C. Eunaicy and S. Suguna, “Web attack detection using deep learning models,”Materials Today: Proceedings, vol. 62, pp. 4806–4813, 2022. 13

  34. [34]

    Detecting web attacks with end-to-end deep learning,

    Y . Pan, F. Sun, Z. Teng, J. White, D. C. Schmidt, J. Staples, and L. Krause, “Detecting web attacks with end-to-end deep learning,” Journal of Internet Services and Applications, vol. 10, no. 1, pp. 1– 22, 2019

  35. [35]

    A comparative analysis of various machine learning methods for anomaly detection in cyber attacks on iot net- works,

    M. M. Inuwa and R. Das, “A comparative analysis of various machine learning methods for anomaly detection in cyber attacks on iot net- works,”Internet of Things, vol. 26, p. 101162, 2024

  36. [36]

    Evaluation of machine learn- ing algorithms for anomaly detection,

    N. Elmrabit, F. Zhou, F. Li, and H. Zhou, “Evaluation of machine learn- ing algorithms for anomaly detection,” in2020 international conference on cyber security and protection of digital services (cyber security). IEEE, 2020, pp. 1–8

  37. [37]

    Llmelog: An approach for anomaly detection based on llm-enriched log events,

    M. He, T. Jia, C. Duan, H. Cai, Y . Li, and G. Huang, “Llmelog: An approach for anomaly detection based on llm-enriched log events,” in2024 IEEE 35th International Symposium on Software Reliability Engineering (ISSRE). IEEE, 2024, pp. 132–143

  38. [38]

    Loggpt: Exploring chatgpt for log-based anomaly detection,

    J. Qi, S. Huang, Z. Luan, S. Yang, C. Fung, H. Yang, D. Qian, J. Shang, Z. Xiao, and Z. Wu, “Loggpt: Exploring chatgpt for log-based anomaly detection,” in2023 IEEE International Conference on High Performance Computing & Communications, Data Science & Systems, Smart City & Dependability in Sensor, Cloud & Big Data Systems & Application (HPCC/DSS/SmartCit...

  39. [39]

    Raglog: Log anomaly detection using retrieval augmented generation,

    J. Pan, W. S. Liang, and Y . Yidi, “Raglog: Log anomaly detection using retrieval augmented generation,” in2024 IEEE World Forum on Public Safety Technology (WFPST). IEEE, 2024, pp. 169–174

  40. [40]

    A survey on hallucination in large language models: Principles, taxonomy, challenges, and open questions,

    L. Huang, W. Yu, W. Ma, W. Zhong, Z. Feng, H. Wang, Q. Chen, W. Peng, X. Feng, B. Qinet al., “A survey on hallucination in large language models: Principles, taxonomy, challenges, and open questions,” ACM Transactions on Information Systems, vol. 43, no. 2, pp. 1–55, 2025

  41. [41]

    Analyzing web traffic ecml/pkdd 2007 discovery challange,

    R. Chedy, B. Johan, D. G ´erard, and R. Mathieu, “Analyzing web traffic ecml/pkdd 2007 discovery challange,” https://www.lirmm.fr/ pkdd2007-challenge/index.html, 2007, [Accessed 03-01-2026]

  42. [42]

    The freebsd project: A replication case study of open source development,

    T. T. Dinh-Trong and J. M. Bieman, “The freebsd project: A replication case study of open source development,”IEEE Transactions on Software Engineering, vol. 31, no. 6, pp. 481–494, 2005

  43. [43]

    Kenler and F

    E. Kenler and F. Razzoli,MariaDB Essentials. Packt Publishing Birmingham, UK, 2015

  44. [44]

    Postgresql,

    B. PostgreSQL, “Postgresql,”Web resource: http://www. PostgreSQL. org/about, 1996

  45. [45]

    Carlson,Redis in action

    J. Carlson,Redis in action. Simon and Schuster, 2013

  46. [46]

    Commix: automating evaluation and exploitation of command injection vulnerabilities in web applications,

    A. Stasinopoulos, C. Ntantogian, and C. Xenakis, “Commix: automating evaluation and exploitation of command injection vulnerabilities in web applications,”International Journal of Information Security, vol. 18, no. 1, pp. 49–72, 2019

  47. [47]

    Http extensions for distributed authoring–webdav,

    Y . Goland, E. Whitehead, A. Faizi, S. Carter, and D. Jensen, “Http extensions for distributed authoring–webdav,” Tech. Rep., 1999