pith. machine review for the scientific record. sign in

arxiv: 2604.04852 · v1 · submitted 2026-04-06 · 💻 cs.CR · cs.AI

Recognition: 2 theorem links

· Lean Theorem

Strengthening Human-Centric Chain-of-Thought Reasoning Integrity in LLMs via a Structured Prompt Framework

Authors on Pith no claims yet

Pith reviewed 2026-05-10 18:49 UTC · model grok-4.3

classification 💻 cs.CR cs.AI
keywords prompt engineeringchain-of-thought reasoninglarge language modelscybersecurity analysisDDoS attack detectionstructured promptingexplainable AI
0
0 comments X

The pith

A 16-factor structured prompt framework enhances Chain-of-Thought reasoning integrity in LLMs for cybersecurity tasks.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper introduces a structured prompt engineering framework with 16 factors to guide Chain-of-Thought reasoning in large language models for security-sensitive tasks. The framework divides these factors into four dimensions that manage context, ground evidence, structure reasoning, and apply security constraints to prevent common errors like hallucination. Through experiments on detecting DDoS attacks in software-defined network traffic, the approach shows improved accuracy and more reliable explanations across different model sizes. Human raters confirm the consistency of these improvements with high agreement scores.

Core claim

By using a 16-factor prompt structure instead of unstructured prompts, LLMs demonstrate stronger reasoning integrity in analyzing security threats, leading to better detection performance and more interpretable outputs that hold up under human review.

What carries the argument

The 16-factor structured prompt framework, organized into four core dimensions of context control, evidence grounding, reasoning structure, and security constraints, which provides explicit controls to maintain reasoning quality.

If this is right

  • Reasoning improvements reach up to 40 percent in smaller models.
  • Accuracy gains hold steady across different model scales.
  • Human evaluations show strong agreement on the enhanced reliability and explainability.
  • The method serves as a lightweight way to make AI-driven security analysis more trustworthy and auditable.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The framework could be adapted for other domains requiring careful analytical reasoning, such as medical diagnosis or legal review.
  • Future work might identify which subsets of the 16 factors contribute most to the observed benefits.
  • Deploying this prompting method in real-time security monitoring systems could reduce false positives from AI hallucinations.

Load-bearing premise

That the performance improvements stem specifically from the structured 16-factor design rather than from using any carefully worded prompt in the same domain.

What would settle it

Running the same experiments but replacing the 16-factor structure with a comparably detailed unstructured prompt and measuring if gains disappear would test the necessity of the specific framework.

Figures

Figures reproduced from arXiv: 2604.04852 by Aisvarya Adeseye, Antti Hakkala, Jiling Zhou, Jouni Isoaho, Seppo Virtanen.

Figure 1
Figure 1. Figure 1: Three types of CoT prompts: (1)Free CoT Prompt [PITH_FULL_IMAGE:figures/full_fig_p005_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: The Prompt Engineering Framework for CoT Reasoning: [PITH_FULL_IMAGE:figures/full_fig_p007_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Experimental Methodology and Prompting Workflow. [PITH_FULL_IMAGE:figures/full_fig_p009_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Model Size vs. Relative Performance Gain (Improvement) (FW vs. NoFW) in Detection Accuracy and Reasoning Quality [PITH_FULL_IMAGE:figures/full_fig_p015_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: further supports this interpretation through Pareto frontier anal￾ysis. Across all four reasoning dimensions—Evidence, Faithfulness, Structure, and Taxonomy—the structured framework consistently shifts model performance toward the upper-right region of the Pareto space. This indicates simultane￾ous improvement in both detection accuracy and reasoning metrics. Impor￾tantly, the observed movement is predomin… view at source ↗
read the original abstract

Chain-of-Thought (CoT) prompting has been used to enhance the reasoning capability of LLMs. However, its reliability in security-sensitive analytical tasks remains insufficiently examined, particularly under structured human evaluation. Alternative approaches, such as model scaling and fine-tuning can be used to help improve performance. These methods are also often costly, computationally intensive, or difficult to audit. In contrast, prompt engineering provides a lightweight, transparent, and controllable mechanism for guiding LLM reasoning. This study proposes a structured prompt engineering framework designed to strengthen CoT reasoning integrity while improving security threat and attack detection reliability in local LLM deployments. The framework includes 16 factors grouped into four core dimensions: (1) Context and Scope Control, (2) Evidence Grounding and Traceability, (3) Reasoning Structure and Cognitive Control, and (4) Security-Specific Analytical Constraints. Rather than optimizing the wording of the prompt heuristically, the framework introduces explicit reasoning controls to mitigate hallucination and prevent reasoning drift, as well as strengthening interpretability in security-sensitive contexts. Using DDoS attack detection in SDN traffic as a case study, multiple model families were evaluated under structured and unstructured prompting conditions. Pareto frontier analysis and ablation experiments demonstrate consistent reasoning improvements (up to 40% in smaller models) and stable accuracy gains across scales. Human evaluation with strong inter-rater agreement (Cohen's k > 0.80) confirms robustness. The results establish structured prompting as an effective and practical approach for reliable and explainable AI-driven cybersecurity analysis.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 1 minor

Summary. The manuscript proposes a structured prompt engineering framework with 16 factors grouped into four dimensions (Context and Scope Control, Evidence Grounding and Traceability, Reasoning Structure and Cognitive Control, and Security-Specific Analytical Constraints) to improve Chain-of-Thought reasoning integrity in LLMs for security-sensitive tasks. Using DDoS attack detection in SDN traffic as a case study, it evaluates multiple model families under structured versus unstructured prompting, claiming reasoning improvements up to 40% in smaller models, stable accuracy gains across scales, and robust human evaluation with Cohen's κ > 0.80. Pareto frontier analysis and ablation experiments are invoked to support the framework as a lightweight, transparent alternative to model scaling or fine-tuning for reliable and explainable AI-driven cybersecurity analysis.

Significance. If the reported gains are shown to arise specifically from the 16-factor structure rather than from detailed prompting in general, the work would offer a practical contribution to prompt engineering in high-stakes domains by providing explicit controls against hallucination and reasoning drift. The inclusion of human evaluation with strong inter-rater agreement is a positive element that supports claims of robustness and interpretability.

major comments (3)
  1. [Abstract] Abstract: The central claims of 'up to 40% in smaller models' reasoning improvements and 'stable accuracy gains across scales' are presented without any details on model sizes, exact baselines, data splits, metrics, or statistical tests. This absence prevents verification of the performance results and attribution to the proposed framework.
  2. [Evaluation] Evaluation section (Pareto frontier analysis and ablation experiments): The manuscript states that these analyses demonstrate consistent improvements, yet provides no description of the ablated components, the unstructured prompting baseline conditions, or quantitative results from the Pareto analysis. Without this, it is impossible to determine whether gains are due to the specific 16-factor, four-dimension structure or to any comparably detailed prompt.
  3. [Case Study] Case study and framework description: The evaluation is confined to a single DDoS-in-SDN scenario. The claim that the 16-factor structure itself strengthens reasoning integrity requires a control condition consisting of a prompt matched in length, domain specificity, and security detail but lacking the explicit four-dimensional grouping and reasoning controls; the current design does not isolate this mechanism.
minor comments (1)
  1. [Abstract] The abstract and introduction would benefit from explicit statements of the exact models evaluated and the precise definition of 'unstructured prompting conditions' to improve reproducibility.

Simulated Author's Rebuttal

3 responses · 0 unresolved

Thank you for the detailed review and constructive feedback on our manuscript. We appreciate the recognition of the human evaluation aspect and the potential contribution to prompt engineering in high-stakes domains. We address each major comment below and will incorporate revisions to strengthen the paper.

read point-by-point responses
  1. Referee: [Abstract] Abstract: The central claims of 'up to 40% in smaller models' reasoning improvements and 'stable accuracy gains across scales' are presented without any details on model sizes, exact baselines, data splits, metrics, or statistical tests. This absence prevents verification of the performance results and attribution to the proposed framework.

    Authors: We agree that the abstract lacks specific details on these elements. In the revised version, we will incorporate information on the model sizes evaluated, the exact baselines employed (unstructured prompting), the data splits used, the metrics for reasoning improvements and accuracy, as well as the statistical tests performed. This will enable better verification and attribution of the results to the proposed framework. revision: yes

  2. Referee: [Evaluation] Evaluation section (Pareto frontier analysis and ablation experiments): The manuscript states that these analyses demonstrate consistent improvements, yet provides no description of the ablated components, the unstructured prompting baseline conditions, or quantitative results from the Pareto analysis. Without this, it is impossible to determine whether gains are due to the specific 16-factor, four-dimension structure or to any comparably detailed prompt.

    Authors: We will expand the Evaluation section to provide a full description of the ablated components in the ablation experiments, the conditions for the unstructured prompting baseline, and the quantitative outcomes from the Pareto frontier analysis. These additions will allow readers to assess whether the gains stem from the specific 16-factor structure. revision: yes

  3. Referee: [Case Study] Case study and framework description: The evaluation is confined to a single DDoS-in-SDN scenario. The claim that the 16-factor structure itself strengthens reasoning integrity requires a control condition consisting of a prompt matched in length, domain specificity, and security detail but lacking the explicit four-dimensional grouping and reasoning controls; the current design does not isolate this mechanism.

    Authors: We recognize that the single-scenario case study limits generalizability, and that a matched control prompt is needed to isolate the effect of the structured framework. In the revised manuscript, we will include an additional control condition with a prompt matched for length, domain specificity, and security detail but without the four-dimensional grouping and explicit reasoning controls. Comparative results will be presented to demonstrate the unique contribution of the 16-factor structure. revision: yes

Circularity Check

0 steps flagged

No significant circularity; empirical evaluation is self-contained

full rationale

The paper proposes a 16-factor structured prompting framework and evaluates it empirically on a DDoS-in-SDN case study using ablation experiments, Pareto frontier analysis, and human evaluation (Cohen's k > 0.80). No equations, derivations, or first-principles claims are present that reduce reported accuracy or reasoning gains to quantities fitted inside the same experiment or to self-citations. The comparison is between structured and unstructured prompting conditions on held-out traffic data, with the central claims resting on observable experimental outcomes rather than tautological redefinitions or imported uniqueness theorems. This is the standard honest finding for an empirical prompt-engineering study.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The framework itself is an author-defined artifact. No numerical parameters are fitted to data in the abstract. The central claim rests on the assumption that explicit reasoning controls reduce hallucination and drift.

axioms (1)
  • domain assumption LLMs can follow explicit structural instructions to reduce reasoning drift and hallucination in security analysis
    Invoked when the authors state that the framework mitigates hallucination and prevents reasoning drift

pith-pipeline@v0.9.0 · 5593 in / 1381 out tokens · 39160 ms · 2026-05-10T18:49:12.107902+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

52 extracted references · 49 canonical work pages · 5 internal anchors

  1. [1]

    Deep Learning Based Intrusion Detection for Cybersecurity in Unmanned Aerial Vehicles Network

    Shenoy, N., Mbaziira, A.V.: An extended review: LLM prompt engineering in cyber defense. In: 2024 International Conference on Electrical, Computer and Energy Technologies (ICECET), pp. 1–6 (2024) doi:10.1109/ICECET61485.2024.10698605

  2. [2]

    Cybersecurity 8(1), 55 (2025) doi:10.1186/s42400-025-00361-w

    Zhang, J., Bu, H., Wen, H., Liu, Y., Fei, H., Xi, R., Li, L., Yang, Y., Zhu, H., Meng, D.: When LLMs meet cybersecurity: A systematic literature review. Cybersecurity 8(1), 55 (2025) doi:10.1186/s42400-025-00361-w

  3. [3]

    Big Data and Cognitive Computing 9(7), 184 (2025) doi:10.3390/bdcc9070184

    Atlam, H.F.: LLMs in cyber security: Bridging practice and education. Big Data and Cognitive Computing 9(7), 184 (2025) doi:10.3390/bdcc9070184

  4. [4]

    SSRN 4853709 (2024) doi:10.2139/ssrn.4853709

    Ferrag, M.A., Alwahedi, F., Battah, A., Cherif, B., Mechri, A., Tihanyi, N.: Gener- ative AI and large language models for cyber security: All insights you need. SSRN 4853709 (2024) doi:10.2139/ssrn.4853709

  5. [5]

    Computation 13(2), 30 (2025) doi:10.3390/ computation13020030

    Kasri, W., Himeur, Y., Alkhazaleh, H.A., Tarapiah, S., Atalla, S., Mansoor, W., Al-Ahmad, H.: From vulnerability to defense: The role of large language models in enhancing cybersecurity. Computation 13(2), 30 (2025) doi:10.3390/ computation13020030

  6. [6]

    Nicolás, The bar derived category of a curved dg algebra, Journal of Pure and Applied Algebra 212 (2008) 2633–2659

    Sood, A.K., Zeadally, S., Hong, E.: The paradigm of hallucinations in AI-driven cybersecurity systems: Understanding taxonomy, classification outcomes, and mit- igations. Computers and Electrical Engineering 124, 110307 (2025) doi:10.1016/j. compeleceng.2025.110307

  7. [7]

    In: Huang, K., Wang, Y., Goertzel, B., Li, Y., Wright, S., Ponnapalli, J

    Huang, K., Huang, G., Duan, Y., Hyun, J.: Utilizing prompt engineering to oper- ationalize cybersecurity. In: Huang, K., Wang, Y., Goertzel, B., Li, Y., Wright, S., Ponnapalli, J. (eds): Generative AI Security: Theories and Practices, pp. 271–303. Springer Nature Switzerland, Cham (2024) doi:10.1007/978-3-031-54252-7 9

  8. [8]

    In: 2025 Silicon Valley Cybersecurity Conference (SVCC), pp

    Ahi, K., Valizadeh, S.: Large language models (LLMs) and generative AI in cyber- security and privacy: A survey of dual-use risks, AI-generated malware, explain- ability, and defensive strategies. In: 2025 Silicon Valley Cybersecurity Conference (SVCC), pp. 1–8 (2025) doi:10.1109/SVCC65277.2025.11133642

  9. [9]

    Chain-of-Thought Prompting Elicits Reasoning in Large Language Models

    Wei, J., Wang, X., Schuurmans, D., Bosma, M., Xia, F., Chi, E., Le, Q.V., Zhou, D., et al.: Chain-of-thought prompting elicits reasoning in large language models. In: Koyejo, S., Mohamed, S., Agarwal, A., Belgrave, D., Cho, K., Oh, A. (eds): Advances in Neural Information Processing Systems, vol. 35, pp. 24824–24837. Curran Associates, Inc. (2022) doi:10....

  10. [10]

    Habibzadeh et al

    Habibzadeh, A., Feyzi, F., Atani, R.E.: Large language models for security op- erations centers: A comprehensive survey. arXiv:2509.10858 (2025) doi:10.48550/ arXiv.2509.10858

  11. [11]

    A Survey of Large Language Models

    Zhao, W.X., Zhou, K., Li, J., Tang, T., Wang, X., Hou, Y., Min, Y., Zhang, B., Zhang, J., Dong, Z., Du, Y.: A survey of large language models. arXiv:2303.18223 (2023) doi:10.48550/arXiv.2303.18223

  12. [12]

    Chen, M., Xiao, C., Sun, H., Li, L., Derczynski, L., Anandkumar, A., Wang, F.: Combating security and privacy issues in the era of large language models. In: Pro- ceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 5: Tutorial Abstracts), pp. 8–18 (2024) doi:10...

  13. [13]

    2280–2292 (2022) doi:10.1145/3531146.3534642 20 Jiling Zhou et al

    Brown, H., Lee, K., Mireshghallah, F., Shokri, R., Tram` er, F.: What does it mean for a language model to preserve privacy? In: Proceedings of the 2022 ACM Conference on Fairness, Accountability, and Transparency, pp. 2280–2292 (2022) doi:10.1145/3531146.3534642 20 Jiling Zhou et al

  14. [14]

    arXiv:2402.03927 (2024) doi:10.48550/arXiv.2402.03927

    Balloccu, S., Schmidtov´ a, P., Lango, M., Duˇ sek, O.: Leak, cheat, repeat: Data con- tamination and evaluation malpractices in closed-source LLMs. arXiv:2402.03927 (2024) doi:10.48550/arXiv.2402.03927

  15. [15]

    In: 2025 Computing, Communications and IoT Applications (ComComAp), pp

    Adeseye, A., Isoaho, J., Virtanen, S., Mohammad, T.: Why compromise privacy? Local LLMs rival commercial LLMs in qualitative analysis. In: 2025 Computing, Communications and IoT Applications (ComComAp), pp. 127–132 (2025). doi: 10.1109/ComComAp68359.2025.11353130

  16. [16]

    In: Proceedings of the 2023 ACM Conference on Information Technology for Social Good, pp

    Montagna, S., Ferretti, S., Klopfenstein, L.C., Florio, A., Pengo, M.F.: Data de- centralisation of LLM-based chatbot systems in chronic disease self-management. In: Proceedings of the 2023 ACM Conference on Information Technology for Social Good, pp. 205–212 (2023) doi:10.1145/3582515.3609536

  17. [17]

    DISO 3, 49 (2024)

    Kumar, B.V.P., Ahmed, M.D.S.: Beyond clouds: Locally runnable LLMs as a secure solution for AI applications. DISO 3, 49 (2024). doi:10.1007/s44206-024-00141-y

  18. [18]

    arXiv:2511.12869 (2025) doi:10.48550/arXiv.2511

    Mohsin, M.A., Umer, M., Bilal, A., Memon, Z., Qadir, M.I., Bhattacharya, S., Rizwan, H., Gorle, A.R., Kazmi, M.Z., Mohsin, A., Rafique, M.U.: On the funda- mental limits of LLMs at scale. arXiv:2511.12869 (2025) doi:10.48550/arXiv.2511. 12869

  19. [19]

    A General Language Assistant as a Laboratory for Alignment

    Askell, A., Bai, Y., Chen, A., Drain, D., Ganguli, D., Henighan, T., Jones, A., Joseph, N., Mann, B., DasSarma, N., Elhage, N.: A general language assistant as a laboratory for alignment. arXiv:2112.00861 (2021) doi:10.48550/arXiv.2112.00861

  20. [21]

    NPJ Digital Medicine 7(1), 41 (2024) 10.1038/s41746-024-01029-4

    Wang, L., Chen, X., Deng, X., Wen, H., You, M., Liu, W., Li, Q., Li, J.: Prompt engineering in consistency and reliability with the evidence-based guideline for LLMs. NPJ Digital Medicine 7(1), 41 (2024) 10.1038/s41746-024-01029-4

  21. [22]

    Smith, Edoardo M

    Barrie, C., Palaiologou, E., T¨ ornberg, P.: Prompt stability scoring for text anno- tation with large language models. arXiv:2407.02039 (2024) doi:10.48550/arXiv. 2407.02039

  22. [23]

    In: Proceedings of the 2025 ACM Southeast Conference, pp

    Meda, K.N., Nara, P.S.C., Bozenka, S., Zormati, T., Turner, S., Worley, W., Mitra, R.: Integrating prompt structures using LLM embeddings for cybersecurity threats. In: Proceedings of the 2025 ACM Southeast Conference, pp. 180–187 (2025) doi: 10.1145/3696673.3723069

  23. [24]

    International Journal of Crowd Science 9(4), 251–261

    Jia, Z., Geng, S., Zhao, Y., Zhang, H.: Comprehensive survey on prompts generat- ing via knowledge-guided chain-of-thought. International Journal of Crowd Science 9(4), 251–261. Tsinghua University Press (2025) doi:10.26599/IJCS.2024.9100038

  24. [25]

    A comprehensive survey on trustworthiness in reasoning with large language models

    Wang, Y., Yu, Y., Liang, J., He, R.: A comprehensive survey on trustworthiness in reasoning with large language models. arXiv:2509.03871 (2025) doi:10.48550/ arXiv.2509.03871

  25. [26]

    Land Forces Academy Review 30(2), 291–302 (2025) doi:10.2478/raft-2025-0028

    Priescu, I., Banu, G.S., Dosescu, T.C., Banu, M.I.: Prompt Engineering in Cybersecurity–Achieving Technological Edge. Land Forces Academy Review 30(2), 291–302 (2025) doi:10.2478/raft-2025-0028

  26. [27]

    In: 8th IEEE Conference on Industrial Cyber-Physical Systems (ICPS)

    Iyenghar, P., Zimmer, C., Gregorio, C.: A feasibility study on chain-of-thought prompting for LLM-based OT cybersecurity risk assessment. In: 2025 IEEE 8th International Conference on Industrial Cyber-Physical Systems (ICPS), pp. 1–4 (2025) doi:10.1109/ICPS65515.2025.11087903

  27. [28]

    In: Al-Onaizan, Y., Bansal, M., Chen, Y.-N

    Taveekitworachai, P., Abdullah, F., Thawonmas, R.: Null-shot prompting: Re- thinking prompting large language models with hallucination. In: Al-Onaizan, Y., Bansal, M., Chen, Y.-N. (eds): Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 13321–13361. Association for Computational Linguistics, (2024) doi:10...

  28. [29]

    arXiv:2511.04108 (2025) doi:10.48550/ arXiv.2511.04108

    Singh, G., Dey, A., Bidhan, J., Kansal, T., Kath, P., Srivastava, S.: Batch prompt- ing suppresses overthinking reasoning under constraint: How batch prompting sup- presses overthinking in reasoning models. arXiv:2511.04108 (2025) doi:10.48550/ arXiv.2511.04108

  29. [30]

    arXiv:2504.01282 (2025) doi: 10.48550/arXiv.2504.01282

    Ahn, J.J., Yin, W.: Prompt-reverse inconsistency: LLM self-inconsistency beyond generative randomness and prompt paraphrasing. arXiv:2504.01282 (2025) doi: 10.48550/arXiv.2504.01282

  30. [31]

    In: International Conference on the AI Revolution, pp

    Adeseye, A., Isoaho, J., Tahir, M.: Performance evaluation of LLM hallucination reduction strategies for reliable qualitative analysis. In: International Conference on the AI Revolution, pp. 142–156. Springer Nature Switzerland, Cham. (2025) doi:10.1007/978-3-032-12313-8 11

  31. [32]

    In: Che, W., Nabende, J., Shutova, E., Pilehvar, M.T

    Xu, Y., Zheng, Y., Sun, S., Huang, S., Dong, B., Zhang, H., Huang, R., Yu, G., Wu, H., Wu, J.: Reason from Future: Reverse Thought Chain Enhances LLM Reasoning. In: Che, W., Nabende, J., Shutova, E., Pilehvar, M.T. (eds): Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (ACL 2025), pp. 25153–25166. Association for Co...

  32. [33]

    ACM Computing Surveys 58(6), 1–35 (2025) doi:10.1145/3774896

    Plaat, A., Wong, A., Verberne, S., Broekens, J., Van Stein, N., B¨ ack, T.: Multi-step reasoning with large language models, a survey. ACM Computing Surveys 58(6), 1–35 (2025) doi:10.1145/3774896

  33. [34]

    arXiv:2306.02569 (2023) doi:10.48550/arXiv.2306.02569

    Zeng, F., Gao, W.: Prompt to be consistent is better than self-consistent? Few-shot and zero-shot fact verification with pre-trained language models. arXiv:2306.02569 (2023) doi:10.48550/arXiv.2306.02569

  34. [35]

    Evaluating step-by-step reasoning traces: A survey.arXiv preprint arXiv:2502.12289,

    Lee, J., Hockenmaier, J.: Evaluating step-by-step reasoning traces: A survey. arXiv:2502.12289 (2025) doi:10.48550/arXiv.2502.12289

  35. [36]

    IRE Journals (2024)

    Osholake, S.F., Umealajekwu, C., Edohen, A., Majekodunmi, A.O., Evans-Anoruo, U.: Human-AI Collaborative Security Operations: Optimizing SOC Analyst Cog- nitive Load through Augmented Intelligence Frameworks. IRE Journals (2024). https://www.irejournals.com/formatedpaper/1709110.pdf

  36. [37]

    Authorea Preprints (2024) doi:10

    Mariam, A., Berrada, A.:Human-Centric Enterprise Security: Advancing Access Control through AI-Driven Administration. Authorea Preprints (2024) doi:10. 22541/au.170708972.23906177/v1

  37. [38]

    Information Systems Frontiers, 1–19

    Panteli, N., Nthubu, B.R., Mersinas, K.: Being Responsible in Cybersecurity: A Multi-Layered Perspective. Information Systems Frontiers, 1–19. Springer Nature (2025) doi:10.1007/s10796-025-10588-0

  38. [39]

    Towards Reasoning Era: A Survey of Long Chain-of-Thought for Reasoning Large Language Models

    Chen, Q., Qin, L., Liu, J., Peng, D., Guan, J., Wang, P., Hu, M., Zhou, Y., Gao, T., Che, W.: Towards reasoning era: A survey of long chain-of-thought for reasoning large language models. arXiv:2503.09567 (2025) doi:10.48550/arXiv.2503.09567

  39. [40]

    Chain-of-thought prompting obscures hallucination cues in large language models: An empirical evaluation

    Cheng, J., Su, T., Yuan, J., He, G., Liu, J., Tao, X., Xie, J., Li, H.: Chain-of-thought prompting obscures hallucination cues in large language models: An empirical eval- uation. arXiv:2506.17088 (2025) doi:10.48550/arXiv.2506.17088

  40. [41]

    Large Language Models are Zero-Shot Reasoners

    Kojima, T., Gu, S.S., Reid, M., Matsuo, Y., Iwasawa, Y.: Large language models are zero-shot reasoners. In: Koyejo, S., Mohamed, S., Agarwal, A., Belgrave, D., Cho, K., Oh, A. (eds): Advances in Neural Information Processing Systems, vol. 35, pp. 22199–22213. Curran Associates, Inc. (2022) doi:10.48550/arXiv.2205.11916

  41. [42]

    In: Proceed- ings of the 2024 Conference of the North American Chapter of the Associa- tion for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers), pp

    Kong, A., Zhao, S., Chen, H., Li, Q., Qin, Y., Sun, R., Zhou, X., Wang, E., Dong, X.: Better zero-shot reasoning with role-play prompting. In: Proceed- ings of the 2024 Conference of the North American Chapter of the Associa- tion for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers), pp. 4099–4113. Association for Computation...

  42. [43]

    ACM Transactions on Software Engineering and Methodology 34(2), 1–23

    Li, J., Li, G., Li, Y., Jin, Z.: Structured chain-of-thought prompting for code generation. ACM Transactions on Software Engineering and Methodology 34(2), 1–23. Association for Computing Machinery, New York, NY, USA (2025) doi: 10.1145/3690635

  43. [44]

    Ong, and Nick Haber

    Neumann, A., Kirsten, E., Zafar, M.B., Singh, J.: Position is power: System prompts as a mechanism of bias in large language models (LLMs). In: Proceed- ings of the 2025 ACM Conference on Fairness, Accountability, and Transparency, pp. 573–598. Association for Computing Machinery, New York, NY, USA (2025) doi:10.1145/3715275.3732038

  44. [45]

    In: 2025 IEEE 5th International Conference on Human-Machine Systems (ICHMS), pp

    Adeseye, A., Isoaho, J., Tahir, M.: Systematic prompt framework for qualitative data analysis: Designing system and user prompts. In: 2025 IEEE 5th International Conference on Human-Machine Systems (ICHMS), pp. 229–234. IEEE, (2025). doi:10.1109/ICHMS65439.2025.11154183

  45. [46]

    Kaggle Dataset, 1–23 (2021) https://www.kaggle

    Kazin, A.: DDoS SDN Dataset. Kaggle Dataset, 1–23 (2021) https://www.kaggle. com/datasets/aikenkazin/ddos-sdn-dataset

  46. [47]

    In: 2025 IEEE 101st Vehicular Technology Conference (VTC2025-Spring), pp

    Han, Y., Jia, Z., He, S., Zhang, Y., Wu, Q.: CNN+Transformer based anomaly traffic detection in UAV networks for emergency rescue. In: 2025 IEEE 101st Vehicular Technology Conference (VTC2025-Spring), pp. 1–5. IEEE, (2025). doi: 10.1109/VTC2025-Spring65109.2025.11174732

  47. [48]

    Applied Computing and Infor- matics 17(1), 168–192

    Tharwat, A.: Classification assessment methods. Applied Computing and Infor- matics 17(1), 168–192. Emerald Publishing, (2021) doi:10.1016/j.aci.2018.08.003

  48. [49]

    International Journal of Ad- vanced Computer Science and Applications, 12(6), 599–606

    Vujovi´ c,ˇZ.: Classification model evaluation metrics. International Journal of Ad- vanced Computer Science and Applications, 12(6), 599–606. SAI, (2021) doi: 10.14569/IJACSA.2021.0120670

  49. [50]

    In: Computer Science On-line Conference, pp

    Naidu, G., Zuva, T., Sibanda, E.M.: A review of evaluation metrics in machine learning algorithms. In: Computer Science On-line Conference, pp. 15–25. Springer International Publishing, Cham. (2023) doi:10.1007/978-3-031-35314-7 2

  50. [51]

    In: Proceedings of the First Workshop on Evaluation and Comparison of NLP Systems

    Yacouby, R., Axman, D.: Probabilistic extension of precision, recall, and F1 score for more thorough evaluation of classification models. In: Eger, S., Gao, Y., Peyrard, M., Zhao, W., Hovy, E. (eds): Proceedings of the First Workshop on Evaluation and Comparison of NLP Systems, pp. 79–91. Association for Compu- tational Linguistics, (2020) doi:10.18653/v1...

  51. [52]

    arXiv:2212.07919 [cs]

    Golovneva, O., Chen, M., Poff, S., Corredor, M., Zettlemoyer, L., Fazel-Zarandi, M., Celikyilmaz, A.: Roscoe: A suite of metrics for scoring step-by-step reasoning. arXiv:2212.07919 (2022) doi:10.48550/arXiv.2212.07919

  52. [53]

    In: International Conference on Fuzzy Systems, pp

    Vieira, S.M., Kaymak, U., Sousa, J.M.: Cohen’s kappa coefficient as a performance measure for feature selection. In: International Conference on Fuzzy Systems, pp. 1–8. IEEE, (2010) doi:10.1109/FUZZY.2010.5584447