arxiv: 2604.04852 · v1 · submitted 2026-04-06 · 💻 cs.CR · cs.AI

Recognition: 2 theorem links

· Lean Theorem

Strengthening Human-Centric Chain-of-Thought Reasoning Integrity in LLMs via a Structured Prompt Framework

Jiling Zhou , Aisvarya Adeseye , Seppo Virtanen , Antti Hakkala , Jouni Isoaho

Authors on Pith no claims yet

Pith reviewed 2026-05-10 18:49 UTC · model grok-4.3

classification 💻 cs.CR cs.AI

keywords prompt engineeringchain-of-thought reasoninglarge language modelscybersecurity analysisDDoS attack detectionstructured promptingexplainable AI

0 comments

The pith

A 16-factor structured prompt framework enhances Chain-of-Thought reasoning integrity in LLMs for cybersecurity tasks.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper introduces a structured prompt engineering framework with 16 factors to guide Chain-of-Thought reasoning in large language models for security-sensitive tasks. The framework divides these factors into four dimensions that manage context, ground evidence, structure reasoning, and apply security constraints to prevent common errors like hallucination. Through experiments on detecting DDoS attacks in software-defined network traffic, the approach shows improved accuracy and more reliable explanations across different model sizes. Human raters confirm the consistency of these improvements with high agreement scores.

Core claim

By using a 16-factor prompt structure instead of unstructured prompts, LLMs demonstrate stronger reasoning integrity in analyzing security threats, leading to better detection performance and more interpretable outputs that hold up under human review.

What carries the argument

The 16-factor structured prompt framework, organized into four core dimensions of context control, evidence grounding, reasoning structure, and security constraints, which provides explicit controls to maintain reasoning quality.

If this is right

Reasoning improvements reach up to 40 percent in smaller models.
Accuracy gains hold steady across different model scales.
Human evaluations show strong agreement on the enhanced reliability and explainability.
The method serves as a lightweight way to make AI-driven security analysis more trustworthy and auditable.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The framework could be adapted for other domains requiring careful analytical reasoning, such as medical diagnosis or legal review.
Future work might identify which subsets of the 16 factors contribute most to the observed benefits.
Deploying this prompting method in real-time security monitoring systems could reduce false positives from AI hallucinations.

Load-bearing premise

That the performance improvements stem specifically from the structured 16-factor design rather than from using any carefully worded prompt in the same domain.

What would settle it

Running the same experiments but replacing the 16-factor structure with a comparably detailed unstructured prompt and measuring if gains disappear would test the necessity of the specific framework.

Figures

Figures reproduced from arXiv: 2604.04852 by Aisvarya Adeseye, Antti Hakkala, Jiling Zhou, Jouni Isoaho, Seppo Virtanen.

**Figure 2.** Figure 2: The Prompt Engineering Framework for CoT Reasoning: [PITH_FULL_IMAGE:figures/full_fig_p007_2.png] view at source ↗

**Figure 3.** Figure 3: Experimental Methodology and Prompting Workflow. [PITH_FULL_IMAGE:figures/full_fig_p009_3.png] view at source ↗

**Figure 4.** Figure 4: Model Size vs. Relative Performance Gain (Improvement) (FW vs. NoFW) in Detection Accuracy and Reasoning Quality [PITH_FULL_IMAGE:figures/full_fig_p015_4.png] view at source ↗

**Figure 5.** Figure 5: further supports this interpretation through Pareto frontier analysis. Across all four reasoning dimensions—Evidence, Faithfulness, Structure, and Taxonomy—the structured framework consistently shifts model performance toward the upper-right region of the Pareto space. This indicates simultaneous improvement in both detection accuracy and reasoning metrics. Importantly, the observed movement is predomin… view at source ↗

read the original abstract

Chain-of-Thought (CoT) prompting has been used to enhance the reasoning capability of LLMs. However, its reliability in security-sensitive analytical tasks remains insufficiently examined, particularly under structured human evaluation. Alternative approaches, such as model scaling and fine-tuning can be used to help improve performance. These methods are also often costly, computationally intensive, or difficult to audit. In contrast, prompt engineering provides a lightweight, transparent, and controllable mechanism for guiding LLM reasoning. This study proposes a structured prompt engineering framework designed to strengthen CoT reasoning integrity while improving security threat and attack detection reliability in local LLM deployments. The framework includes 16 factors grouped into four core dimensions: (1) Context and Scope Control, (2) Evidence Grounding and Traceability, (3) Reasoning Structure and Cognitive Control, and (4) Security-Specific Analytical Constraints. Rather than optimizing the wording of the prompt heuristically, the framework introduces explicit reasoning controls to mitigate hallucination and prevent reasoning drift, as well as strengthening interpretability in security-sensitive contexts. Using DDoS attack detection in SDN traffic as a case study, multiple model families were evaluated under structured and unstructured prompting conditions. Pareto frontier analysis and ablation experiments demonstrate consistent reasoning improvements (up to 40% in smaller models) and stable accuracy gains across scales. Human evaluation with strong inter-rater agreement (Cohen's k > 0.80) confirms robustness. The results establish structured prompting as an effective and practical approach for reliable and explainable AI-driven cybersecurity analysis.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper offers a concrete 16-factor prompt checklist for security CoT with reported gains in one case, but the design does not isolate whether the structure itself drives results over any comparably detailed prompt.

read the letter

The one thing to know is that the authors put together a 16-factor structured prompt framework for chain-of-thought in security tasks, grouped into four dimensions, and tested it on DDoS detection in SDN traffic. They report reasoning improvements up to 40% in smaller models plus stable accuracy, with human raters showing strong agreement. That is the core deliverable. What is new is the explicit security-specific factors around evidence traceability, cognitive controls, and hallucination checks, rather than another generic CoT template. They ran it across model families, included ablation and Pareto analysis, and backed the quality judgments with Cohen's kappa above 0.80. Those pieces give the work a practical flavor that generic prompting papers often lack. The human evaluation step is a clear positive for a domain where reliability matters. The soft spot sits in the comparison. The paper contrasts the structured version against unstructured prompting, yet the abstract and available description do not show a control prompt matched for length, security detail, and specificity but without the four-dimensional grouping. Without that, the gains could come from any careful prompt rather than the claimed structure. It is also a single case study, so transfer to other security tasks remains untested. Model sizes, exact baselines, data splits, and statistical tests are not detailed enough in the summary to verify the numbers independently. This is aimed at security teams or researchers using local LLMs for threat analysis who want a ready checklist to try. A practitioner could pick it up and test it quickly. It deserves peer review so referees can examine the actual prompt examples, the control conditions, and whether the framework holds beyond this one scenario.

Referee Report

3 major / 1 minor

Summary. The manuscript proposes a structured prompt engineering framework with 16 factors grouped into four dimensions (Context and Scope Control, Evidence Grounding and Traceability, Reasoning Structure and Cognitive Control, and Security-Specific Analytical Constraints) to improve Chain-of-Thought reasoning integrity in LLMs for security-sensitive tasks. Using DDoS attack detection in SDN traffic as a case study, it evaluates multiple model families under structured versus unstructured prompting, claiming reasoning improvements up to 40% in smaller models, stable accuracy gains across scales, and robust human evaluation with Cohen's κ > 0.80. Pareto frontier analysis and ablation experiments are invoked to support the framework as a lightweight, transparent alternative to model scaling or fine-tuning for reliable and explainable AI-driven cybersecurity analysis.

Significance. If the reported gains are shown to arise specifically from the 16-factor structure rather than from detailed prompting in general, the work would offer a practical contribution to prompt engineering in high-stakes domains by providing explicit controls against hallucination and reasoning drift. The inclusion of human evaluation with strong inter-rater agreement is a positive element that supports claims of robustness and interpretability.

major comments (3)

[Abstract] Abstract: The central claims of 'up to 40% in smaller models' reasoning improvements and 'stable accuracy gains across scales' are presented without any details on model sizes, exact baselines, data splits, metrics, or statistical tests. This absence prevents verification of the performance results and attribution to the proposed framework.
[Evaluation] Evaluation section (Pareto frontier analysis and ablation experiments): The manuscript states that these analyses demonstrate consistent improvements, yet provides no description of the ablated components, the unstructured prompting baseline conditions, or quantitative results from the Pareto analysis. Without this, it is impossible to determine whether gains are due to the specific 16-factor, four-dimension structure or to any comparably detailed prompt.
[Case Study] Case study and framework description: The evaluation is confined to a single DDoS-in-SDN scenario. The claim that the 16-factor structure itself strengthens reasoning integrity requires a control condition consisting of a prompt matched in length, domain specificity, and security detail but lacking the explicit four-dimensional grouping and reasoning controls; the current design does not isolate this mechanism.

minor comments (1)

[Abstract] The abstract and introduction would benefit from explicit statements of the exact models evaluated and the precise definition of 'unstructured prompting conditions' to improve reproducibility.

Simulated Author's Rebuttal

3 responses · 0 unresolved

Thank you for the detailed review and constructive feedback on our manuscript. We appreciate the recognition of the human evaluation aspect and the potential contribution to prompt engineering in high-stakes domains. We address each major comment below and will incorporate revisions to strengthen the paper.

read point-by-point responses

Referee: [Abstract] Abstract: The central claims of 'up to 40% in smaller models' reasoning improvements and 'stable accuracy gains across scales' are presented without any details on model sizes, exact baselines, data splits, metrics, or statistical tests. This absence prevents verification of the performance results and attribution to the proposed framework.

Authors: We agree that the abstract lacks specific details on these elements. In the revised version, we will incorporate information on the model sizes evaluated, the exact baselines employed (unstructured prompting), the data splits used, the metrics for reasoning improvements and accuracy, as well as the statistical tests performed. This will enable better verification and attribution of the results to the proposed framework. revision: yes
Referee: [Evaluation] Evaluation section (Pareto frontier analysis and ablation experiments): The manuscript states that these analyses demonstrate consistent improvements, yet provides no description of the ablated components, the unstructured prompting baseline conditions, or quantitative results from the Pareto analysis. Without this, it is impossible to determine whether gains are due to the specific 16-factor, four-dimension structure or to any comparably detailed prompt.

Authors: We will expand the Evaluation section to provide a full description of the ablated components in the ablation experiments, the conditions for the unstructured prompting baseline, and the quantitative outcomes from the Pareto frontier analysis. These additions will allow readers to assess whether the gains stem from the specific 16-factor structure. revision: yes
Referee: [Case Study] Case study and framework description: The evaluation is confined to a single DDoS-in-SDN scenario. The claim that the 16-factor structure itself strengthens reasoning integrity requires a control condition consisting of a prompt matched in length, domain specificity, and security detail but lacking the explicit four-dimensional grouping and reasoning controls; the current design does not isolate this mechanism.

Authors: We recognize that the single-scenario case study limits generalizability, and that a matched control prompt is needed to isolate the effect of the structured framework. In the revised manuscript, we will include an additional control condition with a prompt matched for length, domain specificity, and security detail but without the four-dimensional grouping and explicit reasoning controls. Comparative results will be presented to demonstrate the unique contribution of the 16-factor structure. revision: yes

Circularity Check

0 steps flagged

No significant circularity; empirical evaluation is self-contained

full rationale

The paper proposes a 16-factor structured prompting framework and evaluates it empirically on a DDoS-in-SDN case study using ablation experiments, Pareto frontier analysis, and human evaluation (Cohen's k > 0.80). No equations, derivations, or first-principles claims are present that reduce reported accuracy or reasoning gains to quantities fitted inside the same experiment or to self-citations. The comparison is between structured and unstructured prompting conditions on held-out traffic data, with the central claims resting on observable experimental outcomes rather than tautological redefinitions or imported uniqueness theorems. This is the standard honest finding for an empirical prompt-engineering study.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The framework itself is an author-defined artifact. No numerical parameters are fitted to data in the abstract. The central claim rests on the assumption that explicit reasoning controls reduce hallucination and drift.

axioms (1)

domain assumption LLMs can follow explicit structural instructions to reduce reasoning drift and hallucination in security analysis
Invoked when the authors state that the framework mitigates hallucination and prevents reasoning drift

pith-pipeline@v0.9.0 · 5593 in / 1381 out tokens · 39160 ms · 2026-05-10T18:49:12.107902+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/AbsoluteFloorClosure.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

The framework includes 16 factors grouped into four core dimensions: (1) Context and Scope Control, (2) Evidence Grounding and Traceability, (3) Reasoning Structure and Cognitive Control, and (4) Security-Specific Analytical Constraints.
IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Pareto frontier analysis and ablation experiments demonstrate consistent reasoning improvements (up to 40% in smaller models)

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

52 extracted references · 49 canonical work pages · 5 internal anchors

[1]

Deep Learning Based Intrusion Detection for Cybersecurity in Unmanned Aerial Vehicles Network

Shenoy, N., Mbaziira, A.V.: An extended review: LLM prompt engineering in cyber defense. In: 2024 International Conference on Electrical, Computer and Energy Technologies (ICECET), pp. 1–6 (2024) doi:10.1109/ICECET61485.2024.10698605

work page doi:10.1109/icecet61485.2024.10698605 2024
[2]

Cybersecurity 8(1), 55 (2025) doi:10.1186/s42400-025-00361-w

Zhang, J., Bu, H., Wen, H., Liu, Y., Fei, H., Xi, R., Li, L., Yang, Y., Zhu, H., Meng, D.: When LLMs meet cybersecurity: A systematic literature review. Cybersecurity 8(1), 55 (2025) doi:10.1186/s42400-025-00361-w

work page doi:10.1186/s42400-025-00361-w 2025
[3]

Big Data and Cognitive Computing 9(7), 184 (2025) doi:10.3390/bdcc9070184

Atlam, H.F.: LLMs in cyber security: Bridging practice and education. Big Data and Cognitive Computing 9(7), 184 (2025) doi:10.3390/bdcc9070184

work page doi:10.3390/bdcc9070184 2025
[4]

SSRN 4853709 (2024) doi:10.2139/ssrn.4853709

Ferrag, M.A., Alwahedi, F., Battah, A., Cherif, B., Mechri, A., Tihanyi, N.: Gener- ative AI and large language models for cyber security: All insights you need. SSRN 4853709 (2024) doi:10.2139/ssrn.4853709

work page doi:10.2139/ssrn.4853709 2024
[5]

Computation 13(2), 30 (2025) doi:10.3390/ computation13020030

Kasri, W., Himeur, Y., Alkhazaleh, H.A., Tarapiah, S., Atalla, S., Mansoor, W., Al-Ahmad, H.: From vulnerability to defense: The role of large language models in enhancing cybersecurity. Computation 13(2), 30 (2025) doi:10.3390/ computation13020030

2025
[6]

Nicolás, The bar derived category of a curved dg algebra, Journal of Pure and Applied Algebra 212 (2008) 2633–2659

Sood, A.K., Zeadally, S., Hong, E.: The paradigm of hallucinations in AI-driven cybersecurity systems: Understanding taxonomy, classification outcomes, and mit- igations. Computers and Electrical Engineering 124, 110307 (2025) doi:10.1016/j. compeleceng.2025.110307

work page doi:10.1016/j 2025
[7]

In: Huang, K., Wang, Y., Goertzel, B., Li, Y., Wright, S., Ponnapalli, J

Huang, K., Huang, G., Duan, Y., Hyun, J.: Utilizing prompt engineering to oper- ationalize cybersecurity. In: Huang, K., Wang, Y., Goertzel, B., Li, Y., Wright, S., Ponnapalli, J. (eds): Generative AI Security: Theories and Practices, pp. 271–303. Springer Nature Switzerland, Cham (2024) doi:10.1007/978-3-031-54252-7 9

work page doi:10.1007/978-3-031-54252-7 2024
[8]

In: 2025 Silicon Valley Cybersecurity Conference (SVCC), pp

Ahi, K., Valizadeh, S.: Large language models (LLMs) and generative AI in cyber- security and privacy: A survey of dual-use risks, AI-generated malware, explain- ability, and defensive strategies. In: 2025 Silicon Valley Cybersecurity Conference (SVCC), pp. 1–8 (2025) doi:10.1109/SVCC65277.2025.11133642

work page doi:10.1109/svcc65277.2025.11133642 2025
[9]

Chain-of-Thought Prompting Elicits Reasoning in Large Language Models

Wei, J., Wang, X., Schuurmans, D., Bosma, M., Xia, F., Chi, E., Le, Q.V., Zhou, D., et al.: Chain-of-thought prompting elicits reasoning in large language models. In: Koyejo, S., Mohamed, S., Agarwal, A., Belgrave, D., Cho, K., Oh, A. (eds): Advances in Neural Information Processing Systems, vol. 35, pp. 24824–24837. Curran Associates, Inc. (2022) doi:10....

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2201.11903 2022
[10]

Habibzadeh et al

Habibzadeh, A., Feyzi, F., Atani, R.E.: Large language models for security op- erations centers: A comprehensive survey. arXiv:2509.10858 (2025) doi:10.48550/ arXiv.2509.10858

work page arXiv 2025
[11]

A Survey of Large Language Models

Zhao, W.X., Zhou, K., Li, J., Tang, T., Wang, X., Hou, Y., Min, Y., Zhang, B., Zhang, J., Dong, Z., Du, Y.: A survey of large language models. arXiv:2303.18223 (2023) doi:10.48550/arXiv.2303.18223

work page Pith review doi:10.48550/arxiv.2303.18223 2023
[12]

Chen, M., Xiao, C., Sun, H., Li, L., Derczynski, L., Anandkumar, A., Wang, F.: Combating security and privacy issues in the era of large language models. In: Pro- ceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 5: Tutorial Abstracts), pp. 8–18 (2024) doi:10...

work page doi:10.18653/v1/2024.naacl-tutorials.2 2024
[13]

2280–2292 (2022) doi:10.1145/3531146.3534642 20 Jiling Zhou et al

Brown, H., Lee, K., Mireshghallah, F., Shokri, R., Tram` er, F.: What does it mean for a language model to preserve privacy? In: Proceedings of the 2022 ACM Conference on Fairness, Accountability, and Transparency, pp. 2280–2292 (2022) doi:10.1145/3531146.3534642 20 Jiling Zhou et al

work page doi:10.1145/3531146.3534642 2022
[14]

arXiv:2402.03927 (2024) doi:10.48550/arXiv.2402.03927

Balloccu, S., Schmidtov´ a, P., Lango, M., Duˇ sek, O.: Leak, cheat, repeat: Data con- tamination and evaluation malpractices in closed-source LLMs. arXiv:2402.03927 (2024) doi:10.48550/arXiv.2402.03927

work page doi:10.48550/arxiv.2402.03927 2024
[15]

In: 2025 Computing, Communications and IoT Applications (ComComAp), pp

Adeseye, A., Isoaho, J., Virtanen, S., Mohammad, T.: Why compromise privacy? Local LLMs rival commercial LLMs in qualitative analysis. In: 2025 Computing, Communications and IoT Applications (ComComAp), pp. 127–132 (2025). doi: 10.1109/ComComAp68359.2025.11353130

work page doi:10.1109/comcomap68359.2025.11353130 2025
[16]

In: Proceedings of the 2023 ACM Conference on Information Technology for Social Good, pp

Montagna, S., Ferretti, S., Klopfenstein, L.C., Florio, A., Pengo, M.F.: Data de- centralisation of LLM-based chatbot systems in chronic disease self-management. In: Proceedings of the 2023 ACM Conference on Information Technology for Social Good, pp. 205–212 (2023) doi:10.1145/3582515.3609536

work page doi:10.1145/3582515.3609536 2023
[17]

DISO 3, 49 (2024)

Kumar, B.V.P., Ahmed, M.D.S.: Beyond clouds: Locally runnable LLMs as a secure solution for AI applications. DISO 3, 49 (2024). doi:10.1007/s44206-024-00141-y

work page doi:10.1007/s44206-024-00141-y 2024
[18]

arXiv:2511.12869 (2025) doi:10.48550/arXiv.2511

Mohsin, M.A., Umer, M., Bilal, A., Memon, Z., Qadir, M.I., Bhattacharya, S., Rizwan, H., Gorle, A.R., Kazmi, M.Z., Mohsin, A., Rafique, M.U.: On the funda- mental limits of LLMs at scale. arXiv:2511.12869 (2025) doi:10.48550/arXiv.2511. 12869

work page doi:10.48550/arxiv.2511 2025
[19]

A General Language Assistant as a Laboratory for Alignment

Askell, A., Bai, Y., Chen, A., Drain, D., Ganguli, D., Henighan, T., Jones, A., Joseph, N., Mann, B., DasSarma, N., Elhage, N.: A general language assistant as a laboratory for alignment. arXiv:2112.00861 (2021) doi:10.48550/arXiv.2112.00861

work page internal anchor Pith review doi:10.48550/arxiv.2112.00861 2021
[21]

NPJ Digital Medicine 7(1), 41 (2024) 10.1038/s41746-024-01029-4

Wang, L., Chen, X., Deng, X., Wen, H., You, M., Liu, W., Li, Q., Li, J.: Prompt engineering in consistency and reliability with the evidence-based guideline for LLMs. NPJ Digital Medicine 7(1), 41 (2024) 10.1038/s41746-024-01029-4

work page doi:10.1038/s41746-024-01029-4 2024
[22]

Smith, Edoardo M

Barrie, C., Palaiologou, E., T¨ ornberg, P.: Prompt stability scoring for text anno- tation with large language models. arXiv:2407.02039 (2024) doi:10.48550/arXiv. 2407.02039

work page internal anchor Pith review doi:10.48550/arxiv 2024
[23]

In: Proceedings of the 2025 ACM Southeast Conference, pp

Meda, K.N., Nara, P.S.C., Bozenka, S., Zormati, T., Turner, S., Worley, W., Mitra, R.: Integrating prompt structures using LLM embeddings for cybersecurity threats. In: Proceedings of the 2025 ACM Southeast Conference, pp. 180–187 (2025) doi: 10.1145/3696673.3723069

work page doi:10.1145/3696673.3723069 2025
[24]

International Journal of Crowd Science 9(4), 251–261

Jia, Z., Geng, S., Zhao, Y., Zhang, H.: Comprehensive survey on prompts generat- ing via knowledge-guided chain-of-thought. International Journal of Crowd Science 9(4), 251–261. Tsinghua University Press (2025) doi:10.26599/IJCS.2024.9100038

work page doi:10.26599/ijcs.2024.9100038 2025
[25]

A comprehensive survey on trustworthiness in reasoning with large language models

Wang, Y., Yu, Y., Liang, J., He, R.: A comprehensive survey on trustworthiness in reasoning with large language models. arXiv:2509.03871 (2025) doi:10.48550/ arXiv.2509.03871

work page arXiv 2025
[26]

Land Forces Academy Review 30(2), 291–302 (2025) doi:10.2478/raft-2025-0028

Priescu, I., Banu, G.S., Dosescu, T.C., Banu, M.I.: Prompt Engineering in Cybersecurity–Achieving Technological Edge. Land Forces Academy Review 30(2), 291–302 (2025) doi:10.2478/raft-2025-0028

work page doi:10.2478/raft-2025-0028 2025
[27]

In: 8th IEEE Conference on Industrial Cyber-Physical Systems (ICPS)

Iyenghar, P., Zimmer, C., Gregorio, C.: A feasibility study on chain-of-thought prompting for LLM-based OT cybersecurity risk assessment. In: 2025 IEEE 8th International Conference on Industrial Cyber-Physical Systems (ICPS), pp. 1–4 (2025) doi:10.1109/ICPS65515.2025.11087903

work page doi:10.1109/icps65515.2025.11087903 2025
[28]

In: Al-Onaizan, Y., Bansal, M., Chen, Y.-N

Taveekitworachai, P., Abdullah, F., Thawonmas, R.: Null-shot prompting: Re- thinking prompting large language models with hallucination. In: Al-Onaizan, Y., Bansal, M., Chen, Y.-N. (eds): Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 13321–13361. Association for Computational Linguistics, (2024) doi:10...

work page doi:10.18653/v1/2024.emnlp-main.740 2024
[29]

arXiv:2511.04108 (2025) doi:10.48550/ arXiv.2511.04108

Singh, G., Dey, A., Bidhan, J., Kansal, T., Kath, P., Srivastava, S.: Batch prompt- ing suppresses overthinking reasoning under constraint: How batch prompting sup- presses overthinking in reasoning models. arXiv:2511.04108 (2025) doi:10.48550/ arXiv.2511.04108

work page arXiv 2025
[30]

arXiv:2504.01282 (2025) doi: 10.48550/arXiv.2504.01282

Ahn, J.J., Yin, W.: Prompt-reverse inconsistency: LLM self-inconsistency beyond generative randomness and prompt paraphrasing. arXiv:2504.01282 (2025) doi: 10.48550/arXiv.2504.01282

work page doi:10.48550/arxiv.2504.01282 2025
[31]

In: International Conference on the AI Revolution, pp

Adeseye, A., Isoaho, J., Tahir, M.: Performance evaluation of LLM hallucination reduction strategies for reliable qualitative analysis. In: International Conference on the AI Revolution, pp. 142–156. Springer Nature Switzerland, Cham. (2025) doi:10.1007/978-3-032-12313-8 11

work page doi:10.1007/978-3-032-12313-8 2025
[32]

In: Che, W., Nabende, J., Shutova, E., Pilehvar, M.T

Xu, Y., Zheng, Y., Sun, S., Huang, S., Dong, B., Zhang, H., Huang, R., Yu, G., Wu, H., Wu, J.: Reason from Future: Reverse Thought Chain Enhances LLM Reasoning. In: Che, W., Nabende, J., Shutova, E., Pilehvar, M.T. (eds): Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (ACL 2025), pp. 25153–25166. Association for Co...

2025
[33]

ACM Computing Surveys 58(6), 1–35 (2025) doi:10.1145/3774896

Plaat, A., Wong, A., Verberne, S., Broekens, J., Van Stein, N., B¨ ack, T.: Multi-step reasoning with large language models, a survey. ACM Computing Surveys 58(6), 1–35 (2025) doi:10.1145/3774896

work page doi:10.1145/3774896 2025
[34]

arXiv:2306.02569 (2023) doi:10.48550/arXiv.2306.02569

Zeng, F., Gao, W.: Prompt to be consistent is better than self-consistent? Few-shot and zero-shot fact verification with pre-trained language models. arXiv:2306.02569 (2023) doi:10.48550/arXiv.2306.02569

work page doi:10.48550/arxiv.2306.02569 2023
[35]

Evaluating step-by-step reasoning traces: A survey.arXiv preprint arXiv:2502.12289,

Lee, J., Hockenmaier, J.: Evaluating step-by-step reasoning traces: A survey. arXiv:2502.12289 (2025) doi:10.48550/arXiv.2502.12289

work page doi:10.48550/arxiv.2502.12289 2025
[36]

IRE Journals (2024)

Osholake, S.F., Umealajekwu, C., Edohen, A., Majekodunmi, A.O., Evans-Anoruo, U.: Human-AI Collaborative Security Operations: Optimizing SOC Analyst Cog- nitive Load through Augmented Intelligence Frameworks. IRE Journals (2024). https://www.irejournals.com/formatedpaper/1709110.pdf

work page arXiv 2024
[37]

Authorea Preprints (2024) doi:10

Mariam, A., Berrada, A.:Human-Centric Enterprise Security: Advancing Access Control through AI-Driven Administration. Authorea Preprints (2024) doi:10. 22541/au.170708972.23906177/v1

work page arXiv 2024
[38]

Information Systems Frontiers, 1–19

Panteli, N., Nthubu, B.R., Mersinas, K.: Being Responsible in Cybersecurity: A Multi-Layered Perspective. Information Systems Frontiers, 1–19. Springer Nature (2025) doi:10.1007/s10796-025-10588-0

work page doi:10.1007/s10796-025-10588-0 2025
[39]

Towards Reasoning Era: A Survey of Long Chain-of-Thought for Reasoning Large Language Models

Chen, Q., Qin, L., Liu, J., Peng, D., Guan, J., Wang, P., Hu, M., Zhou, Y., Gao, T., Che, W.: Towards reasoning era: A survey of long chain-of-thought for reasoning large language models. arXiv:2503.09567 (2025) doi:10.48550/arXiv.2503.09567

work page internal anchor Pith review doi:10.48550/arxiv.2503.09567 2025
[40]

Chain-of-thought prompting obscures hallucination cues in large language models: An empirical evaluation

Cheng, J., Su, T., Yuan, J., He, G., Liu, J., Tao, X., Xie, J., Li, H.: Chain-of-thought prompting obscures hallucination cues in large language models: An empirical eval- uation. arXiv:2506.17088 (2025) doi:10.48550/arXiv.2506.17088

work page doi:10.48550/arxiv.2506.17088 2025
[41]

Large Language Models are Zero-Shot Reasoners

Kojima, T., Gu, S.S., Reid, M., Matsuo, Y., Iwasawa, Y.: Large language models are zero-shot reasoners. In: Koyejo, S., Mohamed, S., Agarwal, A., Belgrave, D., Cho, K., Oh, A. (eds): Advances in Neural Information Processing Systems, vol. 35, pp. 22199–22213. Curran Associates, Inc. (2022) doi:10.48550/arXiv.2205.11916

work page internal anchor Pith review doi:10.48550/arxiv.2205.11916 2022
[42]

In: Proceed- ings of the 2024 Conference of the North American Chapter of the Associa- tion for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers), pp

Kong, A., Zhao, S., Chen, H., Li, Q., Qin, Y., Sun, R., Zhou, X., Wang, E., Dong, X.: Better zero-shot reasoning with role-play prompting. In: Proceed- ings of the 2024 Conference of the North American Chapter of the Associa- tion for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers), pp. 4099–4113. Association for Computation...

work page doi:10.18653/v1/2024.naacl-long.228 2024
[43]

ACM Transactions on Software Engineering and Methodology 34(2), 1–23

Li, J., Li, G., Li, Y., Jin, Z.: Structured chain-of-thought prompting for code generation. ACM Transactions on Software Engineering and Methodology 34(2), 1–23. Association for Computing Machinery, New York, NY, USA (2025) doi: 10.1145/3690635

work page doi:10.1145/3690635 2025
[44]

Ong, and Nick Haber

Neumann, A., Kirsten, E., Zafar, M.B., Singh, J.: Position is power: System prompts as a mechanism of bias in large language models (LLMs). In: Proceed- ings of the 2025 ACM Conference on Fairness, Accountability, and Transparency, pp. 573–598. Association for Computing Machinery, New York, NY, USA (2025) doi:10.1145/3715275.3732038

work page doi:10.1145/3715275.3732038 2025
[45]

In: 2025 IEEE 5th International Conference on Human-Machine Systems (ICHMS), pp

Adeseye, A., Isoaho, J., Tahir, M.: Systematic prompt framework for qualitative data analysis: Designing system and user prompts. In: 2025 IEEE 5th International Conference on Human-Machine Systems (ICHMS), pp. 229–234. IEEE, (2025). doi:10.1109/ICHMS65439.2025.11154183

work page doi:10.1109/ichms65439.2025.11154183 2025
[46]

Kaggle Dataset, 1–23 (2021) https://www.kaggle

Kazin, A.: DDoS SDN Dataset. Kaggle Dataset, 1–23 (2021) https://www.kaggle. com/datasets/aikenkazin/ddos-sdn-dataset

2021
[47]

In: 2025 IEEE 101st Vehicular Technology Conference (VTC2025-Spring), pp

Han, Y., Jia, Z., He, S., Zhang, Y., Wu, Q.: CNN+Transformer based anomaly traffic detection in UAV networks for emergency rescue. In: 2025 IEEE 101st Vehicular Technology Conference (VTC2025-Spring), pp. 1–5. IEEE, (2025). doi: 10.1109/VTC2025-Spring65109.2025.11174732

work page doi:10.1109/vtc2025-spring65109.2025.11174732 2025
[48]

Applied Computing and Infor- matics 17(1), 168–192

Tharwat, A.: Classification assessment methods. Applied Computing and Infor- matics 17(1), 168–192. Emerald Publishing, (2021) doi:10.1016/j.aci.2018.08.003

work page doi:10.1016/j.aci.2018.08.003 2021
[49]

International Journal of Ad- vanced Computer Science and Applications, 12(6), 599–606

Vujovi´ c,ˇZ.: Classification model evaluation metrics. International Journal of Ad- vanced Computer Science and Applications, 12(6), 599–606. SAI, (2021) doi: 10.14569/IJACSA.2021.0120670

work page doi:10.14569/ijacsa.2021.0120670 2021
[50]

In: Computer Science On-line Conference, pp

Naidu, G., Zuva, T., Sibanda, E.M.: A review of evaluation metrics in machine learning algorithms. In: Computer Science On-line Conference, pp. 15–25. Springer International Publishing, Cham. (2023) doi:10.1007/978-3-031-35314-7 2

work page doi:10.1007/978-3-031-35314-7 2023
[51]

In: Proceedings of the First Workshop on Evaluation and Comparison of NLP Systems

Yacouby, R., Axman, D.: Probabilistic extension of precision, recall, and F1 score for more thorough evaluation of classification models. In: Eger, S., Gao, Y., Peyrard, M., Zhao, W., Hovy, E. (eds): Proceedings of the First Workshop on Evaluation and Comparison of NLP Systems, pp. 79–91. Association for Compu- tational Linguistics, (2020) doi:10.18653/v1...

work page doi:10.18653/v1/2020.eval4nlp-1.9 2020
[52]

arXiv:2212.07919 [cs]

Golovneva, O., Chen, M., Poff, S., Corredor, M., Zettlemoyer, L., Fazel-Zarandi, M., Celikyilmaz, A.: Roscoe: A suite of metrics for scoring step-by-step reasoning. arXiv:2212.07919 (2022) doi:10.48550/arXiv.2212.07919

work page doi:10.48550/arxiv.2212.07919 2022
[53]

In: International Conference on Fuzzy Systems, pp

Vieira, S.M., Kaymak, U., Sousa, J.M.: Cohen’s kappa coefficient as a performance measure for feature selection. In: International Conference on Fuzzy Systems, pp. 1–8. IEEE, (2010) doi:10.1109/FUZZY.2010.5584447

work page doi:10.1109/fuzzy.2010.5584447 2010