Enhancing Reliability in LLM-Based Secure Code Generation

Ahmed Sabbah; David Mohaisen; Mohammad Alkhanafseh; Mohammed F. Kharma

arxiv: 2605.24300 · v1 · pith:CHFO7Z65new · submitted 2026-05-22 · 💻 cs.CR · cs.AI· cs.LG

Enhancing Reliability in LLM-Based Secure Code Generation

Mohammed F. Kharma , Mohammad Alkhanafseh , Ahmed Sabbah , David Mohaisen This is my paper

Pith reviewed 2026-06-30 15:10 UTC · model grok-4.3

classification 💻 cs.CR cs.AIcs.LG

keywords LLM code generationsecure codingprompt engineeringchain-of-thoughtvulnerability mitigationstatic analysisCWE

0 comments

The pith

Embedding mitigation guidance in chain-of-thought prompts cuts security vulnerabilities in LLM code generation by over half.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tests several prompting strategies for making large language models generate secure code in C, Java, and Python. It finds that standard methods like zero-shot and basic chain-of-thought often fail to improve security and can even increase risks in certain languages. A new approach called Mitigation-Aware Chain-of-Thought adds specific instructions on avoiding common weaknesses and language-specific safeguards. This method reduces security findings substantially on two datasets and is the only one that works reliably across the tested models and languages. The work also breaks down where vulnerabilities come from, pointing to areas like operating system interactions that still need better handling.

Core claim

The Mitigation-Aware Chain-of-Thought (MA-CoT) framework embeds task-specific CWE mitigation guidance and language-aware safeguards to reduce recurring vulnerabilities. Evaluated across three LLMs, three languages, and four prompting strategies on a 200-task dataset with validation on LLMSecEval, MA-CoT is the only strategy that consistently improves security reliability, reducing total security findings from 92 to 39 on the primary dataset and from 73 to 4 on LLMSecEval, with similar drops in high-severity issues. Zero-shot and CoT are less reliable and may increase vulnerability, especially in C. A strict layered attribution shows residual risk concentrates in hardening-oriented patterns s

What carries the argument

The Mitigation-Aware Chain-of-Thought (MA-CoT) framework that adds task-specific CWE mitigation guidance and language-aware safeguards to standard chain-of-thought prompting.

Load-bearing premise

Static analysis tools combined with expert validation fully and fairly capture all relevant security vulnerabilities in the generated code without bias toward any prompting method.

What would settle it

Running the same code generation tasks with the four prompting strategies and having blinded security experts count vulnerabilities to see if MA-CoT still shows the reported reduction.

Figures

Figures reproduced from arXiv: 2605.24300 by Ahmed Sabbah, David Mohaisen, Mohammad Alkhanafseh, Mohammed F. Kharma.

**Figure 2.** Figure 2: Vulnerability severity counts across two datasets. Panels are organized by dataset (rows) and language (columns). Within [PITH_FULL_IMAGE:figures/full_fig_p007_2.png] view at source ↗

**Figure 3.** Figure 3: Common Weakness Enumeration (CWE) patterns across datasets, languages, LLMs, and prompting methods. Rows [PITH_FULL_IMAGE:figures/full_fig_p008_3.png] view at source ↗

read the original abstract

Large language models (LLMs) are widely used for code generation, but their security reliability remains inconsistent across languages and prompting strategies. Existing prompt engineering improves functional correctness but rarely ensures consistent security outcomes. We introduce the \textit{Mitigation-Aware Chain-of-Thought (MA-CoT)} framework, which embeds task-specific CWE mitigation guidance and language-aware safeguards to reduce recurring vulnerabilities in generated code. We evaluate MA-CoT across three LLMs (gpt-5, claude-4.5, gemini-2.5), three programming languages (C, Java, Python), and four prompting strategies (Vanilla, Zero-shot, CoT, MA-CoT) on a 200-task primary dataset, with external validation on LLMSecEval. Using static analysis with expert validation, MA-CoT reduces total security findings from 92 to 39 (57.6\%) on the primary dataset and from 73 to 4 (94.5\%) on LLMSecEval. High-severity findings (Blocker + Critical) drop from 90 to 39 (56.7\%) and from 45 to 2 (95.6\%), respectively. Across both datasets, MA-CoT is the only strategy that consistently improves security reliability; Zero-shot and CoT are less reliable and may increase vulnerability, especially in C. We further introduce a strict layered attribution of vulnerability drivers (language-core vs. stack layers) and show that residual risk concentrates in hardening-oriented patterns (e.g., OS- and toolchain-dependent), motivating secure-by-construction primitives alongside prompting.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

MA-CoT produces the largest reported drops in security findings but the evaluation details are too thin to separate real gains from possible measurement artifacts.

read the letter

MA-CoT produces the largest reported drops in security findings but the evaluation details are too thin to separate real gains from possible measurement artifacts.

The paper introduces a prompting framework that adds CWE-specific mitigation steps and language-aware rules on top of chain-of-thought. It also splits vulnerability sources into language-core versus stack layers. Those two pieces are not standard in the prompting work they cite. The tests cover three LLMs, three languages, a 200-task set, and an external benchmark, which gives a broader picture than single-model studies.

The numbers are the clearest part: total findings fall from 92 to 39 on the main set and from 73 to 4 on LLMSecEval, with similar drops in high-severity cases. Plain CoT and zero-shot sometimes increase issues, especially in C. That pattern is worth noting for anyone using these models in practice.

The weak point is the measurement. Static analysis plus expert review is the only evidence, yet the abstract gives no tool list, no prompt templates, no blinding protocol, and no inter-rater numbers. Static tools miss semantic and context-dependent flaws, and unblinded review can favor one condition. The stress-test concern about bias therefore stands on the information provided.

The work is aimed at people who build or evaluate secure code generation pipelines. Readers who need concrete prompting ideas and multi-model comparisons will get something usable from it. It is solid enough to send for review; the framework is defined and the results are specific, even if the current write-up needs more on controls and reproducibility.

Referee Report

3 major / 2 minor

Summary. The manuscript introduces the Mitigation-Aware Chain-of-Thought (MA-CoT) framework, which embeds task-specific CWE mitigation guidance and language-aware safeguards into prompting for LLM code generation. It evaluates MA-CoT against Vanilla, Zero-shot, and CoT strategies across three LLMs (gpt-5, claude-4.5, gemini-2.5), three languages (C, Java, Python), and a 200-task primary dataset plus LLMSecEval, claiming via static analysis plus expert validation that MA-CoT is the only strategy that consistently reduces vulnerabilities (92→39 findings, 57.6%; 73→4 findings, 94.5%) while others may increase them, especially in C; it also introduces a layered attribution of vulnerability drivers.

Significance. If the measured reductions hold under rigorous validation, the work supplies a concrete prompting technique that improves security outcomes in LLM code generation and supplies a useful decomposition of residual risk into language-core versus stack-layer drivers. The reliance on an external benchmark (LLMSecEval) alongside the primary dataset is a positive methodological feature.

major comments (3)

[§4] §4 (Evaluation setup): the headline reductions (92→39 and 73→4 total findings; 90→39 and 45→2 high-severity) are presented without error bars, statistical significance tests, or any description of the static-analysis tool versions, rule sets, or prompt-construction protocol; these omissions are load-bearing for the claim that MA-CoT is uniquely reliable.
[§4.3] §4.3 (Expert validation): no inter-rater reliability statistics or blinding protocol is reported for the expert review step; without these, systematic differences in scrutiny across prompting conditions cannot be ruled out and directly affect the central claim that only MA-CoT improves security.
[§5] §5 (Results): the conclusion that static analysis plus expert validation provides a complete, unbiased measure of vulnerability reduction rests on an untested assumption that the chosen tools detect all relevant flaw types uniformly; the paper does not address known limitations of static analyzers on semantic or context-dependent issues (e.g., incorrect hardening-API usage).

minor comments (2)

[Abstract] Abstract and §3: model names 'gpt-5, claude-4.5, gemini-2.5' read as placeholders; replace with the actual model identifiers used in the experiments.
[§3.2] §3.2: the 'strict layered attribution' of vulnerability drivers is introduced but lacks an explicit decision procedure or example trace that would allow replication.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive feedback on the evaluation methodology. We address each major comment below, indicating where revisions will be made to improve reproducibility and transparency.

read point-by-point responses

Referee: [§4] §4 (Evaluation setup): the headline reductions (92→39 and 73→4 total findings; 90→39 and 45→2 high-severity) are presented without error bars, statistical significance tests, or any description of the static-analysis tool versions, rule sets, or prompt-construction protocol; these omissions are load-bearing for the claim that MA-CoT is uniquely reliable.

Authors: We agree these details are essential. In the revision we will expand §4 with: (1) exact tool versions and rule sets (SonarQube 9.9 with OWASP Top 10 and CWE rules; CodeQL with custom security queries); (2) the full prompt-construction protocol and templates for each strategy; (3) a note that all generations used temperature=0 for determinism on the fixed 200-task set. Because the study used single deterministic runs rather than repeated sampling, error bars and significance tests were not computed; we will add an explicit discussion of this design choice and its implications for generalizability. revision: partial
Referee: [§4.3] §4.3 (Expert validation): no inter-rater reliability statistics or blinding protocol is reported for the expert review step; without these, systematic differences in scrutiny across prompting conditions cannot be ruled out and directly affect the central claim that only MA-CoT improves security.

Authors: We will revise §4.3 to describe the validation protocol in full, including that the single security expert was blinded to prompting condition during review and that reviews were performed in randomized order. We will also note the absence of inter-rater reliability statistics due to the single-rater design and discuss this as a limitation. revision: yes
Referee: [§5] §5 (Results): the conclusion that static analysis plus expert validation provides a complete, unbiased measure of vulnerability reduction rests on an untested assumption that the chosen tools detect all relevant flaw types uniformly; the paper does not address known limitations of static analyzers on semantic or context-dependent issues (e.g., incorrect hardening-API usage).

Authors: We agree that static analyzers have well-known limitations on semantic and context-dependent flaws. The revised §5 will explicitly acknowledge these limitations, cite supporting literature, and explain how the subsequent expert validation step was intended to surface issues missed by automation (e.g., incorrect hardening-API usage). We will qualify the completeness claim accordingly. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical comparison on external benchmarks with no derivations or self-referential fits

full rationale

The paper is an empirical evaluation of prompting strategies (Vanilla, Zero-shot, CoT, MA-CoT) on fixed datasets using static analysis tools plus expert validation. No equations, parameter fitting, or first-principles derivations are claimed. Results (e.g., security finding reductions) are direct measurements, not quantities defined in terms of themselves or forced by self-citation chains. External benchmarks (primary 200-task set, LLMSecEval) and tools provide independent grounding; the central claim does not reduce to its inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on empirical measurement via static analysis and expert review rather than new mathematical derivations; it assumes standard properties of LLM prompting and vulnerability detection tools without introducing fitted parameters or new physical entities.

axioms (1)

domain assumption Static analysis tools combined with expert validation provide a reliable count of security vulnerabilities in generated code samples.
This underpins the reported reductions from 92 to 39 findings and is invoked when claiming MA-CoT improves security reliability.

pith-pipeline@v0.9.1-grok · 5825 in / 1402 out tokens · 41827 ms · 2026-06-30T15:10:57.247719+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

51 extracted references · 32 canonical work pages · 4 internal anchors

[1]

Evaluating Large Language Models Trained on Code

M. Chen, J. Tworek, H. Jun, Q. Yuan, H. de Oliveira Pinto, J. Kaplan, H. Edwards, Y . Burda, N. Joseph, G. Brockmanet al., “Evaluating large language models trained on code,”arXiv preprint arXiv:2107.03374, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021
[2]

From large to mammoth: A comparative evaluation of large language models in vulnerability detection,

J. Lin and D. Mohaisen, “From large to mammoth: A comparative evaluation of large language models in vulnerability detection,” in32nd Annual Network and Distributed System Security Symposium, NDSS 2025, San Diego, California, USA, February 24-28, 2025. The Internet Society, 2025

2025
[3]

A user-centered security evaluation of copilot,

O. Asare, M. Nagappan, and N. Asokan, “A user-centered security evaluation of copilot,” inICSE, 2024

2024
[4]

Duplicate bug report detection using an attention-based neural language model,

M. B. Messaoud, A. Miladi, I. Jenhani, M. W. Mkaouer, and L. Ghadhab, “Duplicate bug report detection using an attention-based neural language model,”IEEE Trans. Reliab., vol. 72, no. 2, pp. 846–858, 2023. [Online]. Available: https://doi.org/10.1109/TR.2022.3193645

work page doi:10.1109/tr.2022.3193645 2023
[5]

Statically detecting vulnerabilities by processing programming languages as natural languages,

I. Medeiros, N. Neves, and M. Correia, “Statically detecting vulnerabilities by processing programming languages as natural languages,”IEEE Trans. Reliab., vol. 71, no. 2, pp. 1033–1056, 2022. [Online]. Available: https://doi.org/10.1109/TR.2021.3137314

work page doi:10.1109/tr.2021.3137314 2022
[6]

Large language models in cybersecurity: A survey of applications, vulnerabilities, and defense techniques,

N. O. Jaffal, M. Alkhanafseh, and D. Mohaisen, “Large language models in cybersecurity: A survey of applications, vulnerabilities, and defense techniques,”AI, vol. 6, no. 9, 2025. [Online]. Available: https://www.mdpi.com/2673-2688/6/9/216

2025
[7]

Adoption of developer ai tools (copilot, etc) 2022 - 2025: Data & graphs show increase in ai use as developers evolve - gitclear,

—, “Adoption of developer ai tools (copilot, etc) 2022 - 2025: Data & graphs show increase in ai use as developers evolve - gitclear,” 12 2025, [Online; accessed 2025-12-27]. [Online]. Available: https://www.gitclear.com/research/developer_ai_ assistant_adoption_by_year_with_ai_delegation_buckets

2022
[8]

O zsoy I, Ayerdem M, T \

B. Yetistiren, I. Özsoy, M. Ayerdem, and E. Tüzün, “Evaluating the code quality of ai-assisted code generation tools: An empirical study on github copilot, amazon codewhisperer, and chatgpt,”CoRR, vol. abs/2304.10778, 2023

work page arXiv 2023
[9]

Asleep at the keyboard? assessing the security of github copilot’s code contri- butions,

H. Pearce, B. Ahmad, B. Tan, B. Dolan-Gavitt, and R. Karri, “Asleep at the keyboard? assessing the security of github copilot’s code contri- butions,” inIEEE Symp. on Security & Privacy, 2022

2022
[10]

Security weaknesses of copilot-generated code in github projects: An empirical study,

Y . Fu, P. Liang, A. Tahir, Z. Li, M. Shahin, J. Yu, and J. Chen, “Security weaknesses of copilot-generated code in github projects: An empirical study,”CoRR, vol. abs/2310.02059, 2025

work page arXiv 2025
[11]

Lost at c: A user study on the security implications of large language model code assistants,

G. Sandoval, H. Pearce, T. Nys, R. Karri, S. Garg, and B. Dolan-Gavitt, “Lost at c: A user study on the security implications of large language model code assistants,” inProceedings of the 32nd USENIX Security Symposium, 2023, pp. 2205–2222

2023
[12]

Is github’s copilot as bad as humans at introducing vulnerabilities in code?

O. Asare, M. Nagappan, and N. Asokan, “Is github’s copilot as bad as humans at introducing vulnerabilities in code?”Empir. Softw. Eng., vol. 28, no. 6, p. 129, 2023. [Online]. Available: https://doi.org/10.1007/s10664-023-10380-1

work page doi:10.1007/s10664-023-10380-1 2023
[13]

Llms cannot reliably identify and reason about security vulnerabilities (yet?): A comprehensive evaluation, framework, and benchmarks,

S. Ullah, M. Han, S. Pujar, H. Pearce, A. Coskun, and G. Stringhini, “Llms cannot reliably identify and reason about security vulnerabilities (yet?): A comprehensive evaluation, framework, and benchmarks,” in IEEE Symposium on Security and Privacy, 2024

2024
[14]

Chain-of-thought prompting elicits reasoning in large language models,

J. Wei, X. Wang, D. Schuurmans, M. Bosma, B. Ichter, F. Xia, E. H. Chi, Q. V . Le, and D. Zhou, “Chain-of-thought prompting elicits reasoning in large language models,” inProceedings of the 36th Conference on Neural Information Processing Systems (NeurIPS), 2022

2022
[15]

Large language models are zero-shot reasoners,

T. Kojima, S. S. Gu, M. Reid, Y . Matsuo, and Y . Iwasawa, “Large language models are zero-shot reasoners,” inAdvances in Neural Information Processing Systems 35: Annual Conference on Neural Information Processing Systems 2022, NeurIPS 2022, New Orleans, LA, USA, November 28 - December 9, 2022, S. Koyejo, S. Mohamed, A. Agarwal, D. Belgrave, K. Cho, and ...

2022
[16]

Available: http://papers.nips.cc/paper_files/paper/2022/ hash/8bb0d291acd4acf06ef112099c16f326-Abstract-Conference.html

[Online]. Available: http://papers.nips.cc/paper_files/paper/2022/ hash/8bb0d291acd4acf06ef112099c16f326-Abstract-Conference.html

2022
[17]

ACM Trans Softw Eng Methodol 34(8):225:1--225:53, doi:10.1145/3722108, ://doi.org/10.1145/3722108

C. Tony, N. E. D. Ferreyra, M. Mutas, S. Dhif, and R. Scandariato, “Prompting techniques for secure code generation: A systematic investigation,”ACM TOSEM, vol. 34, no. 8, pp. 225:1–225:53, 2025. [Online]. Available: https://doi.org/10.1145/3722108

work page doi:10.1145/3722108 2025
[18]

An empirical evaluation of llm-generated code security across prompting methods,

M. Kharma, A. Sabbah, M. AlKhanafseh, M. Hammoudeh, and D. Mo- haisen, “An empirical evaluation of llm-generated code security across prompting methods,” 2026, submitted to Empirical Software Engineer- ing

2026
[19]

Why does the effective context length of llms fall short?

C. An, J. Zhang, M. Zhong, L. Li, S. Gong, Y . Luo, J. Xu, and L. Kong, “Why does the effective context length of llms fall short?”CoRR, vol. abs/2410.18745, 2024

work page arXiv 2024
[20]

Rethinking the evaluation of secure code generation,

S.-C. Dai, J. Xu, and G. Tao, “Rethinking the evaluation of secure code generation,” 2025. [Online]. Available: https://arxiv.org/abs/2503.15554

work page arXiv 2025
[21]

Devaic: A tool for security assessment of ai-generated code,

D. Cotroneo, R. D. Luca, and P. Liguori, “Devaic: A tool for security assessment of ai-generated code,”Inf. Softw. Technol., vol. 177, p. 107572, 2025. [Online]. Available: https://doi.org/10.1016/j.infsof.2024. 107572

work page doi:10.1016/j.infsof.2024 2025
[22]

Give llms a security course: Securing retrieval-augmented code generation via 13 knowledge injection,

B. Lin, S. Wang, Y . Qin, L. Chen, and X. Mao, “Give llms a security course: Securing retrieval-augmented code generation via 13 knowledge injection,” inCCS, C. Huang, J. Chen, S. Shieh, D. Lie, and V . Cortier, Eds. ACM, 2025, pp. 3356–3370. [Online]. Available: https://doi.org/10.1145/3719027.3765049

work page doi:10.1145/3719027.3765049 2025
[23]

Large language models for code generation: A comprehensive survey of challenges, techniques, evaluation, and applications,

N. Huynh and B. Lin, “Large language models for code generation: A comprehensive survey of challenges, techniques, evaluation, and applications,”CoRR, vol. abs/2503.01245, 2025. [Online]. Available: https://doi.org/10.48550/arXiv.2503.01245

work page doi:10.48550/arxiv.2503.01245 2025
[24]

Using AI assistants in software development: A qualitative study on security practices and concerns,

J. H. Klemmer, S. A. Horstmann, N. Patnaik, C. Ludden, C. B. Jr., C. Powers, F. Massacci, A. Rahman, D. V otipka, H. R. Lipford, A. Rashid, A. Naiakshina, and S. Fahl, “Using AI assistants in software development: A qualitative study on security practices and concerns,” inCCS. ACM, 2024, pp. 2726–2740. [Online]. Available: https://doi.org/10.1145/3658644.3690283

work page doi:10.1145/3658644.3690283 2024
[25]

How secure is ai-generated code: a large-scale comparison of large language models,

N. Tihanyi, T. Bisztray, M. A. Ferrag, R. Jain, and L. C. Cordeiro, “How secure is ai-generated code: a large-scale comparison of large language models,”Empir. Softw. Eng., vol. 30, no. 2, p. 47, 2025. [Online]. Available: https://doi.org/10.1007/s10664-024-10590-1

work page doi:10.1007/s10664-024-10590-1 2025
[26]

A Systematic Survey of Prompt Engineering in Large Language Models: Techniques and Applications

P. Sahoo, A. K. Singh, S. Saha, V . Jain, S. Mondal, and A. Chadha, “A systematic survey of prompt engineering in large language models: Techniques and applications,”CoRR, vol. abs/2402.07927, 2024. [Online]. Available: https://doi.org/10.48550/arXiv.2402.07927

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2402.07927 2024
[27]

arXiv preprint arXiv:2310.14735 , year=

B. Chen, Z. Zhang, N. Langrené, and S. Zhu, “Unleashing the potential of prompt engineering in large language models: a comprehensive review,”CoRR, vol. abs/2310.14735, 2023. [Online]. Available: https://doi.org/10.48550/arXiv.2310.14735

work page doi:10.48550/arxiv.2310.14735 2023
[28]

2024.From Vulnerabilities to Remediation: A Systematic Literature Review of LLMs in Code Security

E. Basic and A. Giaretta, “Large language models and code security: A systematic literature review,”CoRR, vol. abs/2412.15004, 2024. [Online]. Available: https://doi.org/10.48550/arXiv.2412.15004

work page doi:10.48550/arxiv.2412.15004 2024
[29]

You still have to study on the security of LLM generated code,

A. Schaad, S. Götz, and D. Binder, “You still have to study on the security of LLM generated code,” inICT Systems Security and Privacy Protection - 40th IFIP International Conference, SEC 2025, Maribor, Slovenia, May 21-23, 2025, Proceedings, Part II, ser. IFIP Advances in Information and Communication Technology, L. N. Zlatolas, K. Rannenberg, T. Welzer,...

work page doi:10.1007/978-3-031-92886-4_8 2025
[30]

Secure Code Generation at Scale with Reflexion

A. Datta, A. Aljohani, and H. Do, “Secure code generation at scale with reflexion,”CoRR, vol. abs/2511.03898, 2025. [Online]. Available: https://doi.org/10.48550/arXiv.2511.03898

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2511.03898 2025
[31]

Language models can solve computer tasks,

G. Kim, P. Baldi, and S. McAleer, “Language models can solve computer tasks,” inProceedings of the 37th Conference on Neural Information Processing Systems (NeurIPS), 2023

2023
[32]

Secure-instruct: An automated pipeline for synthesizing instruction-tuning datasets using llms for secure code generation,

J. Li, F. Rabbi, B. Yang, S. Wang, and J. Yang, “Secure-instruct: An automated pipeline for synthesizing instruction-tuning datasets using llms for secure code generation,” 2025. [Online]. Available: https://arxiv.org/abs/2510.07189

work page arXiv 2025
[33]

RESCUE: retrieval augmented secure code generation,

J. Shi and T. Zhang, “RESCUE: retrieval augmented secure code generation,”CoRR, vol. abs/2510.18204, 2025. [Online]. Available: https://doi.org/10.48550/arXiv.2510.18204

work page doi:10.48550/arxiv.2510.18204 2025
[34]

Towards secure code generation with llms: A study on common weakness enumeration,

J. Zhao, Y . Sun, C. Huang, C. Liu, Y . Guan, Y . Zeng, and Y . Liu, “Towards secure code generation with llms: A study on common weakness enumeration,”IEEE Transactions on Software Engineering, vol. 51, no. 12, pp. 3507–3523, 2025

2025
[35]

Promsec: Prompt optimization for secure generation of functional source code with large language models (llms),

M. Nazzal, I. Khalil, A. Khreishah, and N. Phan, “Promsec: Prompt optimization for secure generation of functional source code with large language models (llms),” inCCS. ACM, 2024, pp. 2266–2280. [Online]. Available: https://doi.org/10.1145/3658644.3690298

work page doi:10.1145/3658644.3690298 2024
[36]

Chain-of-thought prompting of large language models for discovering and fixing software vulnerabilities,

Y . Nong, M. Aldeen, L. Cheng, H. Hu, F. Chen, and H. Cai, “Chain-of-thought prompting of large language models for discovering and fixing software vulnerabilities,”CoRR, vol. abs/2402.17230, 2024. [Online]. Available: https://doi.org/10.48550/arXiv.2402.17230

work page doi:10.48550/arxiv.2402.17230 2024
[37]

Self-planning code generation with large language models,

X. Jiang, Y . Dong, L. Wang, Z. Fang, Q. Shang, G. Li, Z. Jin, and W. Jiao, “Self-planning code generation with large language models,” ACM Trans. Softw. Eng. Methodol., vol. 33, no. 7, pp. 182:1–182:30,
[38]

Available: https://doi.org/10.1145/3672456

[Online]. Available: https://doi.org/10.1145/3672456

work page doi:10.1145/3672456
[39]

IEEE Transactions on Dependable and Secure Computing pp 1--15, doi:10.1109/TDSC.2026.3672745, ://doi.org/10.1109/TDSC.2026.3672745

M. F. Kharma, S. Choi, M. Alkhanafseh, and D. Mohaisen, “Security and quality in llm-generated code: a multi-language, multi-model analysis,”IEEE Transactions on Dependable and Secure Computing, no. 01, pp. 1–15, 2026. [Online]. Available: https: //doi.org/10.1109/TDSC.2026.3672745

work page doi:10.1109/tdsc.2026.3672745 2026
[40]

Llm-csec: Empirical evaluation of security in c/c++ code generated by large language models,

M. U. Shahid, C. M. Ahmed, and R. Ranjan, “Llm-csec: Empirical evaluation of security in c/c++ code generated by large language models,” 2025. [Online]. Available: https://arxiv.org/abs/2511.18966

work page arXiv 2025
[41]

Cweval: Outcome-driven evaluation on functionality and security of LLM code generation,

J. Peng, L. Cui, K. Huang, J. Yang, and B. Ray, “Cweval: Outcome-driven evaluation on functionality and security of LLM code generation,” inIEEE/ACM International Workshop on Large Language Models for Code, LLM4Code@ICSE 2025, Ottawa, ON, Canada, May 3, 2025. IEEE, 2025, pp. 33–40. [Online]. Available: https://doi.org/10.1109/LLM4Code66737.2025.00009

work page doi:10.1109/llm4code66737.2025.00009 2025
[42]

Baxbench: Can llms generate correct and secure backends?

M. Vero, N. Mündler, V . Chibotaru, V . Raychev, M. Baader, N. Jovanovic, J. He, and M. T. Vechev, “Baxbench: Can llms generate correct and secure backends?” inICML. OpenReview.net, 2025. [Online]. Available: https://openreview.net/forum?id=il3KRr4H9u

2025
[43]

SecureVibeBench: Benchmarking Secure Vibe Coding of AI Agents via Reconstructing Vulnerability-Introducing Scenarios

J. Chen, H. Huang, Y . Lyu, J. An, J. Shi, C. Yang, T. Zhang, H. Tian, Y . Li, Z. Li, X. Zhou, X. Hu, and D. Lo, “Secureagentbench: Benchmarking secure code generation under realistic vulnerability scenarios,”CoRR, vol. abs/2509.22097, 2025. [Online]. Available: https://doi.org/10.48550/arXiv.2509.22097

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2509.22097 2025
[44]

Gpt-5 model | openai api,

—, “Gpt-5 model | openai api,” https://platform.openai.com/docs/ models/gpt-5, 09 2025, (Accessed on 2025-09-24)

2025
[45]

What’s new in claude 4.5 - claude doc,

——, “What’s new in claude 4.5 - claude doc,” https://platform.claude. com/docs/en/about-claude/models/whats-new-claude-4-5, 09 2025, (Ac- cessed on 2025-09-24)

2025
[46]

Gemini models | gemini api | google ai for developers,

——, “Gemini models | gemini api | google ai for developers,” https: //ai.google.dev/gemini-api/docs/models, 09 2025, (Accessed on 2025- 09-24)

2025
[47]

Code quality, security & static analysis tool with SonarQube,

——, “Code quality, security & static analysis tool with SonarQube,” https://www.sonarsource.com/products/sonarqube/, 05 2024, (Accessed on 05/12/2024)

2024
[48]

Llmseceval: A dataset of natural language prompts for security evaluations,

C. Tony, M. Mutas, N. E. D. Ferreyra, and R. Scandariato, “Llmseceval: A dataset of natural language prompts for security evaluations,” inMSR. IEEE, 2023, pp. 588–592. [Online]. Available: https://doi.org/10.1109/MSR59073.2023.00084

work page doi:10.1109/msr59073.2023.00084 2023
[49]

Benchmarking prompt engineering techniques for secure code generation with GPT models,

M. Bruni, F. Gabrielli, M. Ghafari, and M. Kropp, “Benchmarking prompt engineering techniques for secure code generation with GPT models,” inForge@ICSE. IEEE, 2025, pp. 93–103. [Online]. Available: https://doi.org/10.1109/Forge66646.2025.00018

work page doi:10.1109/forge66646.2025.00018 2025
[50]

SALLM: security assessment of generated code,

M. L. Siddiq, J. C. da Silva Santos, S. Devareddy, and A. Muller, “SALLM: security assessment of generated code,” inASE Workshops. ACM, 2024, pp. 54–65. [Online]. Available: https://doi.org/10.1145/ 3691621.3694934

work page arXiv 2024
[51]

From solitary directives to interactive encouragement! LLM secure code generation by natural language prompting,

S. Liu, B. Sabir, S. I. Jang, Y . Kansal, Y . Gao, K. Moore, A. Abuadbba, and S. Nepal, “From solitary directives to interactive encouragement! LLM secure code generation by natural language prompting,”CoRR, vol. abs/2410.14321, 2024. [Online]. Available: https://doi.org/10.48550/arXiv.2410.14321 APPENDIXA EXAMPLEENTRIES FROM THEMITIGATION-AWARE DATASET T...

work page doi:10.48550/arxiv.2410.14321 2024

[1] [1]

Evaluating Large Language Models Trained on Code

M. Chen, J. Tworek, H. Jun, Q. Yuan, H. de Oliveira Pinto, J. Kaplan, H. Edwards, Y . Burda, N. Joseph, G. Brockmanet al., “Evaluating large language models trained on code,”arXiv preprint arXiv:2107.03374, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021

[2] [2]

From large to mammoth: A comparative evaluation of large language models in vulnerability detection,

J. Lin and D. Mohaisen, “From large to mammoth: A comparative evaluation of large language models in vulnerability detection,” in32nd Annual Network and Distributed System Security Symposium, NDSS 2025, San Diego, California, USA, February 24-28, 2025. The Internet Society, 2025

2025

[3] [3]

A user-centered security evaluation of copilot,

O. Asare, M. Nagappan, and N. Asokan, “A user-centered security evaluation of copilot,” inICSE, 2024

2024

[4] [4]

Duplicate bug report detection using an attention-based neural language model,

M. B. Messaoud, A. Miladi, I. Jenhani, M. W. Mkaouer, and L. Ghadhab, “Duplicate bug report detection using an attention-based neural language model,”IEEE Trans. Reliab., vol. 72, no. 2, pp. 846–858, 2023. [Online]. Available: https://doi.org/10.1109/TR.2022.3193645

work page doi:10.1109/tr.2022.3193645 2023

[5] [5]

Statically detecting vulnerabilities by processing programming languages as natural languages,

I. Medeiros, N. Neves, and M. Correia, “Statically detecting vulnerabilities by processing programming languages as natural languages,”IEEE Trans. Reliab., vol. 71, no. 2, pp. 1033–1056, 2022. [Online]. Available: https://doi.org/10.1109/TR.2021.3137314

work page doi:10.1109/tr.2021.3137314 2022

[6] [6]

Large language models in cybersecurity: A survey of applications, vulnerabilities, and defense techniques,

N. O. Jaffal, M. Alkhanafseh, and D. Mohaisen, “Large language models in cybersecurity: A survey of applications, vulnerabilities, and defense techniques,”AI, vol. 6, no. 9, 2025. [Online]. Available: https://www.mdpi.com/2673-2688/6/9/216

2025

[7] [7]

Adoption of developer ai tools (copilot, etc) 2022 - 2025: Data & graphs show increase in ai use as developers evolve - gitclear,

—, “Adoption of developer ai tools (copilot, etc) 2022 - 2025: Data & graphs show increase in ai use as developers evolve - gitclear,” 12 2025, [Online; accessed 2025-12-27]. [Online]. Available: https://www.gitclear.com/research/developer_ai_ assistant_adoption_by_year_with_ai_delegation_buckets

2022

[8] [8]

O zsoy I, Ayerdem M, T \

B. Yetistiren, I. Özsoy, M. Ayerdem, and E. Tüzün, “Evaluating the code quality of ai-assisted code generation tools: An empirical study on github copilot, amazon codewhisperer, and chatgpt,”CoRR, vol. abs/2304.10778, 2023

work page arXiv 2023

[9] [9]

Asleep at the keyboard? assessing the security of github copilot’s code contri- butions,

H. Pearce, B. Ahmad, B. Tan, B. Dolan-Gavitt, and R. Karri, “Asleep at the keyboard? assessing the security of github copilot’s code contri- butions,” inIEEE Symp. on Security & Privacy, 2022

2022

[10] [10]

Security weaknesses of copilot-generated code in github projects: An empirical study,

Y . Fu, P. Liang, A. Tahir, Z. Li, M. Shahin, J. Yu, and J. Chen, “Security weaknesses of copilot-generated code in github projects: An empirical study,”CoRR, vol. abs/2310.02059, 2025

work page arXiv 2025

[11] [11]

Lost at c: A user study on the security implications of large language model code assistants,

G. Sandoval, H. Pearce, T. Nys, R. Karri, S. Garg, and B. Dolan-Gavitt, “Lost at c: A user study on the security implications of large language model code assistants,” inProceedings of the 32nd USENIX Security Symposium, 2023, pp. 2205–2222

2023

[12] [12]

Is github’s copilot as bad as humans at introducing vulnerabilities in code?

O. Asare, M. Nagappan, and N. Asokan, “Is github’s copilot as bad as humans at introducing vulnerabilities in code?”Empir. Softw. Eng., vol. 28, no. 6, p. 129, 2023. [Online]. Available: https://doi.org/10.1007/s10664-023-10380-1

work page doi:10.1007/s10664-023-10380-1 2023

[13] [13]

Llms cannot reliably identify and reason about security vulnerabilities (yet?): A comprehensive evaluation, framework, and benchmarks,

S. Ullah, M. Han, S. Pujar, H. Pearce, A. Coskun, and G. Stringhini, “Llms cannot reliably identify and reason about security vulnerabilities (yet?): A comprehensive evaluation, framework, and benchmarks,” in IEEE Symposium on Security and Privacy, 2024

2024

[14] [14]

Chain-of-thought prompting elicits reasoning in large language models,

J. Wei, X. Wang, D. Schuurmans, M. Bosma, B. Ichter, F. Xia, E. H. Chi, Q. V . Le, and D. Zhou, “Chain-of-thought prompting elicits reasoning in large language models,” inProceedings of the 36th Conference on Neural Information Processing Systems (NeurIPS), 2022

2022

[15] [15]

Large language models are zero-shot reasoners,

T. Kojima, S. S. Gu, M. Reid, Y . Matsuo, and Y . Iwasawa, “Large language models are zero-shot reasoners,” inAdvances in Neural Information Processing Systems 35: Annual Conference on Neural Information Processing Systems 2022, NeurIPS 2022, New Orleans, LA, USA, November 28 - December 9, 2022, S. Koyejo, S. Mohamed, A. Agarwal, D. Belgrave, K. Cho, and ...

2022

[16] [16]

Available: http://papers.nips.cc/paper_files/paper/2022/ hash/8bb0d291acd4acf06ef112099c16f326-Abstract-Conference.html

[Online]. Available: http://papers.nips.cc/paper_files/paper/2022/ hash/8bb0d291acd4acf06ef112099c16f326-Abstract-Conference.html

2022

[17] [17]

ACM Trans Softw Eng Methodol 34(8):225:1--225:53, doi:10.1145/3722108, ://doi.org/10.1145/3722108

C. Tony, N. E. D. Ferreyra, M. Mutas, S. Dhif, and R. Scandariato, “Prompting techniques for secure code generation: A systematic investigation,”ACM TOSEM, vol. 34, no. 8, pp. 225:1–225:53, 2025. [Online]. Available: https://doi.org/10.1145/3722108

work page doi:10.1145/3722108 2025

[18] [18]

An empirical evaluation of llm-generated code security across prompting methods,

M. Kharma, A. Sabbah, M. AlKhanafseh, M. Hammoudeh, and D. Mo- haisen, “An empirical evaluation of llm-generated code security across prompting methods,” 2026, submitted to Empirical Software Engineer- ing

2026

[19] [19]

Why does the effective context length of llms fall short?

C. An, J. Zhang, M. Zhong, L. Li, S. Gong, Y . Luo, J. Xu, and L. Kong, “Why does the effective context length of llms fall short?”CoRR, vol. abs/2410.18745, 2024

work page arXiv 2024

[20] [20]

Rethinking the evaluation of secure code generation,

S.-C. Dai, J. Xu, and G. Tao, “Rethinking the evaluation of secure code generation,” 2025. [Online]. Available: https://arxiv.org/abs/2503.15554

work page arXiv 2025

[21] [21]

Devaic: A tool for security assessment of ai-generated code,

D. Cotroneo, R. D. Luca, and P. Liguori, “Devaic: A tool for security assessment of ai-generated code,”Inf. Softw. Technol., vol. 177, p. 107572, 2025. [Online]. Available: https://doi.org/10.1016/j.infsof.2024. 107572

work page doi:10.1016/j.infsof.2024 2025

[22] [22]

Give llms a security course: Securing retrieval-augmented code generation via 13 knowledge injection,

B. Lin, S. Wang, Y . Qin, L. Chen, and X. Mao, “Give llms a security course: Securing retrieval-augmented code generation via 13 knowledge injection,” inCCS, C. Huang, J. Chen, S. Shieh, D. Lie, and V . Cortier, Eds. ACM, 2025, pp. 3356–3370. [Online]. Available: https://doi.org/10.1145/3719027.3765049

work page doi:10.1145/3719027.3765049 2025

[23] [23]

Large language models for code generation: A comprehensive survey of challenges, techniques, evaluation, and applications,

N. Huynh and B. Lin, “Large language models for code generation: A comprehensive survey of challenges, techniques, evaluation, and applications,”CoRR, vol. abs/2503.01245, 2025. [Online]. Available: https://doi.org/10.48550/arXiv.2503.01245

work page doi:10.48550/arxiv.2503.01245 2025

[24] [24]

Using AI assistants in software development: A qualitative study on security practices and concerns,

J. H. Klemmer, S. A. Horstmann, N. Patnaik, C. Ludden, C. B. Jr., C. Powers, F. Massacci, A. Rahman, D. V otipka, H. R. Lipford, A. Rashid, A. Naiakshina, and S. Fahl, “Using AI assistants in software development: A qualitative study on security practices and concerns,” inCCS. ACM, 2024, pp. 2726–2740. [Online]. Available: https://doi.org/10.1145/3658644.3690283

work page doi:10.1145/3658644.3690283 2024

[25] [25]

How secure is ai-generated code: a large-scale comparison of large language models,

N. Tihanyi, T. Bisztray, M. A. Ferrag, R. Jain, and L. C. Cordeiro, “How secure is ai-generated code: a large-scale comparison of large language models,”Empir. Softw. Eng., vol. 30, no. 2, p. 47, 2025. [Online]. Available: https://doi.org/10.1007/s10664-024-10590-1

work page doi:10.1007/s10664-024-10590-1 2025

[26] [26]

A Systematic Survey of Prompt Engineering in Large Language Models: Techniques and Applications

P. Sahoo, A. K. Singh, S. Saha, V . Jain, S. Mondal, and A. Chadha, “A systematic survey of prompt engineering in large language models: Techniques and applications,”CoRR, vol. abs/2402.07927, 2024. [Online]. Available: https://doi.org/10.48550/arXiv.2402.07927

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2402.07927 2024

[27] [27]

arXiv preprint arXiv:2310.14735 , year=

B. Chen, Z. Zhang, N. Langrené, and S. Zhu, “Unleashing the potential of prompt engineering in large language models: a comprehensive review,”CoRR, vol. abs/2310.14735, 2023. [Online]. Available: https://doi.org/10.48550/arXiv.2310.14735

work page doi:10.48550/arxiv.2310.14735 2023

[28] [28]

2024.From Vulnerabilities to Remediation: A Systematic Literature Review of LLMs in Code Security

E. Basic and A. Giaretta, “Large language models and code security: A systematic literature review,”CoRR, vol. abs/2412.15004, 2024. [Online]. Available: https://doi.org/10.48550/arXiv.2412.15004

work page doi:10.48550/arxiv.2412.15004 2024

[29] [29]

You still have to study on the security of LLM generated code,

A. Schaad, S. Götz, and D. Binder, “You still have to study on the security of LLM generated code,” inICT Systems Security and Privacy Protection - 40th IFIP International Conference, SEC 2025, Maribor, Slovenia, May 21-23, 2025, Proceedings, Part II, ser. IFIP Advances in Information and Communication Technology, L. N. Zlatolas, K. Rannenberg, T. Welzer,...

work page doi:10.1007/978-3-031-92886-4_8 2025

[30] [30]

Secure Code Generation at Scale with Reflexion

A. Datta, A. Aljohani, and H. Do, “Secure code generation at scale with reflexion,”CoRR, vol. abs/2511.03898, 2025. [Online]. Available: https://doi.org/10.48550/arXiv.2511.03898

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2511.03898 2025

[31] [31]

Language models can solve computer tasks,

G. Kim, P. Baldi, and S. McAleer, “Language models can solve computer tasks,” inProceedings of the 37th Conference on Neural Information Processing Systems (NeurIPS), 2023

2023

[32] [32]

Secure-instruct: An automated pipeline for synthesizing instruction-tuning datasets using llms for secure code generation,

J. Li, F. Rabbi, B. Yang, S. Wang, and J. Yang, “Secure-instruct: An automated pipeline for synthesizing instruction-tuning datasets using llms for secure code generation,” 2025. [Online]. Available: https://arxiv.org/abs/2510.07189

work page arXiv 2025

[33] [33]

RESCUE: retrieval augmented secure code generation,

J. Shi and T. Zhang, “RESCUE: retrieval augmented secure code generation,”CoRR, vol. abs/2510.18204, 2025. [Online]. Available: https://doi.org/10.48550/arXiv.2510.18204

work page doi:10.48550/arxiv.2510.18204 2025

[34] [34]

Towards secure code generation with llms: A study on common weakness enumeration,

J. Zhao, Y . Sun, C. Huang, C. Liu, Y . Guan, Y . Zeng, and Y . Liu, “Towards secure code generation with llms: A study on common weakness enumeration,”IEEE Transactions on Software Engineering, vol. 51, no. 12, pp. 3507–3523, 2025

2025

[35] [35]

Promsec: Prompt optimization for secure generation of functional source code with large language models (llms),

M. Nazzal, I. Khalil, A. Khreishah, and N. Phan, “Promsec: Prompt optimization for secure generation of functional source code with large language models (llms),” inCCS. ACM, 2024, pp. 2266–2280. [Online]. Available: https://doi.org/10.1145/3658644.3690298

work page doi:10.1145/3658644.3690298 2024

[36] [36]

Chain-of-thought prompting of large language models for discovering and fixing software vulnerabilities,

Y . Nong, M. Aldeen, L. Cheng, H. Hu, F. Chen, and H. Cai, “Chain-of-thought prompting of large language models for discovering and fixing software vulnerabilities,”CoRR, vol. abs/2402.17230, 2024. [Online]. Available: https://doi.org/10.48550/arXiv.2402.17230

work page doi:10.48550/arxiv.2402.17230 2024

[37] [37]

Self-planning code generation with large language models,

X. Jiang, Y . Dong, L. Wang, Z. Fang, Q. Shang, G. Li, Z. Jin, and W. Jiao, “Self-planning code generation with large language models,” ACM Trans. Softw. Eng. Methodol., vol. 33, no. 7, pp. 182:1–182:30,

[38] [38]

Available: https://doi.org/10.1145/3672456

[Online]. Available: https://doi.org/10.1145/3672456

work page doi:10.1145/3672456

[39] [39]

IEEE Transactions on Dependable and Secure Computing pp 1--15, doi:10.1109/TDSC.2026.3672745, ://doi.org/10.1109/TDSC.2026.3672745

M. F. Kharma, S. Choi, M. Alkhanafseh, and D. Mohaisen, “Security and quality in llm-generated code: a multi-language, multi-model analysis,”IEEE Transactions on Dependable and Secure Computing, no. 01, pp. 1–15, 2026. [Online]. Available: https: //doi.org/10.1109/TDSC.2026.3672745

work page doi:10.1109/tdsc.2026.3672745 2026

[40] [40]

Llm-csec: Empirical evaluation of security in c/c++ code generated by large language models,

M. U. Shahid, C. M. Ahmed, and R. Ranjan, “Llm-csec: Empirical evaluation of security in c/c++ code generated by large language models,” 2025. [Online]. Available: https://arxiv.org/abs/2511.18966

work page arXiv 2025

[41] [41]

Cweval: Outcome-driven evaluation on functionality and security of LLM code generation,

J. Peng, L. Cui, K. Huang, J. Yang, and B. Ray, “Cweval: Outcome-driven evaluation on functionality and security of LLM code generation,” inIEEE/ACM International Workshop on Large Language Models for Code, LLM4Code@ICSE 2025, Ottawa, ON, Canada, May 3, 2025. IEEE, 2025, pp. 33–40. [Online]. Available: https://doi.org/10.1109/LLM4Code66737.2025.00009

work page doi:10.1109/llm4code66737.2025.00009 2025

[42] [42]

Baxbench: Can llms generate correct and secure backends?

M. Vero, N. Mündler, V . Chibotaru, V . Raychev, M. Baader, N. Jovanovic, J. He, and M. T. Vechev, “Baxbench: Can llms generate correct and secure backends?” inICML. OpenReview.net, 2025. [Online]. Available: https://openreview.net/forum?id=il3KRr4H9u

2025

[43] [43]

SecureVibeBench: Benchmarking Secure Vibe Coding of AI Agents via Reconstructing Vulnerability-Introducing Scenarios

J. Chen, H. Huang, Y . Lyu, J. An, J. Shi, C. Yang, T. Zhang, H. Tian, Y . Li, Z. Li, X. Zhou, X. Hu, and D. Lo, “Secureagentbench: Benchmarking secure code generation under realistic vulnerability scenarios,”CoRR, vol. abs/2509.22097, 2025. [Online]. Available: https://doi.org/10.48550/arXiv.2509.22097

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2509.22097 2025

[44] [44]

Gpt-5 model | openai api,

—, “Gpt-5 model | openai api,” https://platform.openai.com/docs/ models/gpt-5, 09 2025, (Accessed on 2025-09-24)

2025

[45] [45]

What’s new in claude 4.5 - claude doc,

——, “What’s new in claude 4.5 - claude doc,” https://platform.claude. com/docs/en/about-claude/models/whats-new-claude-4-5, 09 2025, (Ac- cessed on 2025-09-24)

2025

[46] [46]

Gemini models | gemini api | google ai for developers,

——, “Gemini models | gemini api | google ai for developers,” https: //ai.google.dev/gemini-api/docs/models, 09 2025, (Accessed on 2025- 09-24)

2025

[47] [47]

Code quality, security & static analysis tool with SonarQube,

——, “Code quality, security & static analysis tool with SonarQube,” https://www.sonarsource.com/products/sonarqube/, 05 2024, (Accessed on 05/12/2024)

2024

[48] [48]

Llmseceval: A dataset of natural language prompts for security evaluations,

C. Tony, M. Mutas, N. E. D. Ferreyra, and R. Scandariato, “Llmseceval: A dataset of natural language prompts for security evaluations,” inMSR. IEEE, 2023, pp. 588–592. [Online]. Available: https://doi.org/10.1109/MSR59073.2023.00084

work page doi:10.1109/msr59073.2023.00084 2023

[49] [49]

Benchmarking prompt engineering techniques for secure code generation with GPT models,

M. Bruni, F. Gabrielli, M. Ghafari, and M. Kropp, “Benchmarking prompt engineering techniques for secure code generation with GPT models,” inForge@ICSE. IEEE, 2025, pp. 93–103. [Online]. Available: https://doi.org/10.1109/Forge66646.2025.00018

work page doi:10.1109/forge66646.2025.00018 2025

[50] [50]

SALLM: security assessment of generated code,

M. L. Siddiq, J. C. da Silva Santos, S. Devareddy, and A. Muller, “SALLM: security assessment of generated code,” inASE Workshops. ACM, 2024, pp. 54–65. [Online]. Available: https://doi.org/10.1145/ 3691621.3694934

work page arXiv 2024

[51] [51]

From solitary directives to interactive encouragement! LLM secure code generation by natural language prompting,

S. Liu, B. Sabir, S. I. Jang, Y . Kansal, Y . Gao, K. Moore, A. Abuadbba, and S. Nepal, “From solitary directives to interactive encouragement! LLM secure code generation by natural language prompting,”CoRR, vol. abs/2410.14321, 2024. [Online]. Available: https://doi.org/10.48550/arXiv.2410.14321 APPENDIXA EXAMPLEENTRIES FROM THEMITIGATION-AWARE DATASET T...

work page doi:10.48550/arxiv.2410.14321 2024