arxiv: 2604.27001 · v1 · submitted 2026-04-29 · 💻 cs.CR · cs.SE

Recognition: unknown

An Empirical Security Evaluation of LLM-Generated Cryptographic Rust Code

Mohamed Elsayed , Kenneth Fulton , Jeong Yang

Authors on Pith no claims yet

Pith reviewed 2026-05-07 13:20 UTC · model grok-4.3

classification 💻 cs.CR cs.SE

keywords LLM code generationcryptographic securityRuststatic analysisvulnerability detectionAES-256-GCMChaCha20-Poly1305prompt strategies

0 comments

The pith

LLMs generate Rust cryptographic code that compiles successfully only 23 percent of the time and contains vulnerabilities in 57 percent of working samples.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper evaluates 240 Rust code samples for AES-256-GCM and ChaCha20-Poly1305 generated by three large language models across four prompt strategies. It finds that general static analysis misses most security problems while a custom rule-based tool detects issues in over half the compiled outputs. Prompt strategy has a clear effect on results, with chain-of-thought performing markedly worse than simpler approaches. The models show recurring problems such as nonce reuse and invented API calls. These outcomes indicate that off-the-shelf tools cannot reliably validate LLM-produced cryptographic code.

Core claim

Among the 240 generated Rust samples, only 23.3 percent compiled. CodeQL produced just two false positives on the compiled set, while the authors' rule-based crypto-specific analyzer found vulnerabilities in 57 percent of those samples with zero false positives. Compilation success differed sharply between the two algorithms and was significantly affected by prompt strategy, with chain-of-thought prompting performing five times worse than zero-shot. All three models exhibited systematic failures including nonce reuse and API hallucinations.

What carries the argument

The rule-based crypto-specific analyzer that associates detected issues with Common Weakness Enumerations and is applied to compiled LLM-generated code samples.

Load-bearing premise

The custom rule-based crypto-specific analyzer accurately detects all relevant vulnerabilities with zero false positives and no false negatives for the two algorithms tested.

What would settle it

A compiled sample that the custom analyzer flags as vulnerable but independent expert review or formal verification confirms is secure, or a sample the analyzer clears that later proves to contain a vulnerability.

Figures

Figures reproduced from arXiv: 2604.27001 by Jeong Yang, Kenneth Fulton, Mohamed Elsayed.

**Figure 1.** Figure 1: Four-Stage Research Methodology Pipeline for Experiments. view at source ↗

**Figure 1.** Figure 1: Stage 1 Code Generation and Processing: In this first stage, each of the three LLMs is queried with the target prompt. Rust code blocks are extracted from the generated code response, and their dependencies are detected and injected into Cargo.toml automatically. Stage 2 Compilation and Correctness using Clippy: To check the compilation and correctness of the generated code, cargo clippy –message-format=j… view at source ↗

**Figure 2.** Figure 2: Compilation Success Rate for Two Algorithms by Three LLMs using Four Prompt Strategies. view at source ↗

read the original abstract

Developers and organizations are using Large Language Models (LLMs) to generate security-critical code more frequently than ever, including cryptographic solutions for their products. This study presents an empirical evaluation of cryptographic security in 240 Rust code samples for two crypto algorithms (AES-256-GCM and ChaCha20-Poly1305) generated by three LLMs (Gemini 2.5 Pro, GPT-4o, and DeepSeek Coder) using four different prompt strategies. For each successfully compiled code sample, CodeQL static analysis and our rule-based crypto-specific analyzer were used to detect vulnerabilities, which are also associated with Common Weakness Enumeration (CWE). The evaluation results revealed that only 23.3% of the generated code samples were successfully compiled. Among the compiled code, CodeQL produced only two false positives, while our rule-based crypto-specific analyzer identified vulnerabilities in 57% of the compiled samples with zero false positives. This demonstrates that general-purpose analysis tools are insufficient for code validation for the experimented crypto algorithms. The compilation success of the two algorithms varied significantly (AES-256-GCM 34.2% versus ChaCha20-Poly1305 12.5%), showing a gap in code generation capabilities. While model choice had no significant effect on compilation success, prompt strategy significantly influenced outcomes (P = 0.002), with chain-of-thought prompting performing 5 times worse than zero-shot. All three models exhibit systematic failures, including nonce reuse and API hallucinations.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

LLMs produce compilable Rust crypto code only 23% of the time and the custom analyzer flags over half the rest as vulnerable, but that comparison to CodeQL hangs on an unvalidated rule set.

read the letter

The paper's core finding is straightforward: across 240 generations for AES-256-GCM and ChaCha20-Poly1305, only 23.3% compiled, prompt strategy mattered (P=0.002), and the authors' rule-based checker marked 57% of the compiled ones as vulnerable while CodeQL found almost nothing. The algorithm gap (34% vs 12.5% compilation) and the listed failure modes (nonce reuse, API hallucinations) are concrete and useful to see quantified for Rust.

Referee Report

2 major / 2 minor

Summary. The paper conducts an empirical evaluation of 240 Rust code samples for AES-256-GCM and ChaCha20-Poly1305 generated by Gemini 2.5 Pro, GPT-4o, and DeepSeek Coder using four prompt strategies. It reports a 23.3% compilation success rate, with the custom rule-based analyzer detecting vulnerabilities in 57% of compiled samples (zero false positives) versus CodeQL's two false positives, significant differences in compilation by algorithm and prompt (P=0.002, chain-of-thought worst), and systematic failures including nonce reuse and API hallucinations. The central conclusion is that general-purpose static analysis tools are insufficient for validating LLM-generated cryptographic code.

Significance. If the custom analyzer's zero-FP claim holds under independent validation, the work would provide concrete empirical evidence of LLMs' limitations in producing secure crypto implementations and the shortcomings of off-the-shelf tools like CodeQL for domain-specific checks. The compilation-rate gap between algorithms and the prompt-strategy effect (with statistical support) would add actionable insights for AI-assisted secure coding practices.

major comments (2)

[Methods and Results (analyzer evaluation)] The claim that the rule-based crypto-specific analyzer achieves zero false positives while identifying vulnerabilities in 57% of the ~56 compiled samples (abstract and results) is load-bearing for the comparison to CodeQL and the conclusion that general-purpose tools are insufficient. No external ground-truth validation, blinded expert review, public rule set, or inter-rater agreement is described; the accuracy rests on author judgment alone.
[Results (statistical analysis)] The reported P=0.002 for prompt-strategy effect on compilation success lacks accompanying details on per-group sample sizes, exact statistical test (e.g., chi-square or Fisher's), multiple-comparison correction, or power analysis. Given the low overall compilation rate (23.3%), this weakens confidence in the claim that prompt strategy is a significant factor.

minor comments (2)

[Methods] Clarify the exact wording of the four prompt strategies and the sampling procedure for the 240 instances (e.g., temperature, number of generations per prompt/algorithm/model) to allow replication.
[Abstract] The abstract states CodeQL produced 'only two false positives' but does not specify what the true-positive baseline was or how false positives were determined for the custom analyzer.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive comments on our paper. We address the major comments point by point below and indicate the revisions we will make to the manuscript.

read point-by-point responses

Referee: [Methods and Results (analyzer evaluation)] The claim that the rule-based crypto-specific analyzer achieves zero false positives while identifying vulnerabilities in 57% of the ~56 compiled samples (abstract and results) is load-bearing for the comparison to CodeQL and the conclusion that general-purpose tools are insufficient. No external ground-truth validation, blinded expert review, public rule set, or inter-rater agreement is described; the accuracy rests on author judgment alone.

Authors: We agree that the zero false positive rate for the custom analyzer is a key claim and currently relies on our internal verification process. The analyzer consists of deterministic rules derived from standard cryptographic best practices and known vulnerabilities (e.g., nonce reuse in AES-GCM, incorrect key sizes). Each flagged sample was manually inspected by the authors to confirm the presence of the vulnerability and that no false positives occurred in the compiled set. To strengthen this, we will revise the Methods section to provide the full rule set, release the analyzer as open-source code, and include examples of detected issues. While we did not conduct a blinded external review for this study, we believe the transparency will allow independent validation. We will update the abstract and results to reflect this additional detail. revision: yes
Referee: [Results (statistical analysis)] The reported P=0.002 for prompt-strategy effect on compilation success lacks accompanying details on per-group sample sizes, exact statistical test (e.g., chi-square or Fisher's), multiple-comparison correction, or power analysis. Given the low overall compilation rate (23.3%), this weakens confidence in the claim that prompt strategy is a significant factor.

Authors: The P-value of 0.002 was calculated using a chi-square test of independence on the 4x2 contingency table (four prompt strategies by compiled/not compiled), with 60 samples per strategy. No multiple-comparison correction was needed as this was the sole test for prompt effects. We will add these details to the Results section, including the contingency table and degrees of freedom. We will also include a post-hoc power analysis to address concerns about the low base rate. We maintain that prompt strategy is a significant factor but will provide the requested statistical transparency. revision: yes

Circularity Check

0 steps flagged

No circularity: purely empirical measurement study

full rationale

The paper reports direct measurements of LLM code generation success rates, compilation outcomes, and vulnerability detections via CodeQL plus a custom rule-based analyzer on AES-256-GCM and ChaCha20-Poly1305 samples. No equations, fitted parameters, predictions, self-definitional constructs, or load-bearing self-citations appear in the provided text or abstract. The 57% vulnerability rate and zero-FP claim for the custom analyzer are presented as empirical observations from applying the rules to the generated corpus, not as outputs derived from the inputs by construction. The study is self-contained against its own experimental pipeline and external benchmarks (CodeQL), with no reduction of claims to tautologies or author-specific priors.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claims rest on the assumption that the custom analyzer is a reliable oracle for cryptographic correctness and that the generated samples are representative of real developer usage.

axioms (1)

domain assumption The rule-based crypto-specific analyzer identifies vulnerabilities with zero false positives and no missed issues for AES-256-GCM and ChaCha20-Poly1305.
The paper states zero false positives but provides no independent validation set or comparison against a ground-truth corpus of known-vulnerable implementations.

pith-pipeline@v0.9.0 · 5568 in / 1221 out tokens · 76034 ms · 2026-05-07T13:20:02.360874+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

25 extracted references · 9 canonical work pages · 2 internal anchors

[2]

Chen, M., Tworek, J., Jun, H., et al. (2021). Evaluating large language models trained on code. arXiv:2107.03374

work page internal anchor Pith review arXiv 2021
[3]

Austin, J., Odena, A., Nye, M., et al. (2021). Program synthesis with large language models. arXiv:2108.07732

work page internal anchor Pith review arXiv 2021
[4]

Chen, L., Guo, Q., Jia, H., et al. (2024). A survey on evaluating large language models in code generation tasks. arXiv:2408.16498

work page arXiv 2024
[5]

Young Lee, Ernesto Diaz, Jeong Yang, Bozhen Liu, Enhancing concur- rency bug detection in Rust programs through LLVM IR based graph visualization, High-Confidence Computing, 2025, 100377, ISSN 2667-2952, https://doi.org/10.1016/j.hcc.2025.100377

work page doi:10.1016/j.hcc.2025.100377 2025
[6]

Machine Learning-Based Vulner- ability Detection in Rust Code Using LLVM IR and Transformer Model

Lee, Y.; Boshra, S.J.; Yang, J.; Cao, Z.; Liang, G. Machine Learning-Based Vulner- ability Detection in Rust Code Using LLVM IR and Transformer Model. Mach. Learn. Knowl. Extr. 2025, 7, 79. https://doi.org/10.3390/make7030079

work page doi:10.3390/make7030079 2025
[7]

Asleep at the keyboard? assessing the security of github copilot’s code contributions

Hammond Pearce, Baleegh Ahmad, Benjamin Tan, Brendan Dolan-Gavitt, and Ramesh Karri. Asleep at the keyboard? assessing the security of github copilot’s code contributions. In2022 IEEE Symposium on Security and Privacy (SP), pages 754–768. IEEE, 2022

2022
[8]

Rogaway, P. (2002). Authenticated-Encryption with Associated-Data. Proceedings of the 9th ACM Conference on Computer and Communications Security (CCS)

2002
[9]

Authentication failures in NIST version of GCM

Antoine Joux. Authentication failures in NIST version of GCM. https://csrc.nist.gov/csrc/media/projects/block-cipher-techniques/documents/ bcm/joux_comments.pdf, 2006

2006
[10]

Recommendation for Block Cipher Modes of Operation: Ga- lois/Counter Mode (GCM) and GMAC,

M. Dworkin, “Recommendation for Block Cipher Modes of Operation: Ga- lois/Counter Mode (GCM) and GMAC, ” NIST Special Publication 800-38D, Na- tional Institute of Standards and Technology, 2007

2007
[11]

ChaCha20 and Poly1305 for IETF Protocols,

Y. Nir and A. Langley, “ChaCha20 and Poly1305 for IETF Protocols, ” RFC 8439, Internet Engineering Task Force, 2018. https://datatracker.ietf.org/doc/rfc8439/

2018
[12]

Why does cryptographic software fail? A case study and open problems

David Lazar, Haogang Chen, Xi Wang, and Nickolai Zeldovich. Why does cryptographic software fail? A case study and open problems. InProceedings of 5th Asia-Pacific Workshop on Systems, pages 1–7, 2014

2014
[13]

An empirical study of cryptographic misuse in android applications

Manuel Egele, David Brumley, Yanick Fratantonio, and Christopher Kruegel. An empirical study of cryptographic misuse in android applications. InProceedings of the 2013 ACM SIGSAC Conference on Computer & Communications Security, pages 73–84, 2013

2013
[14]

Chain-of-thought prompting elicits reasoning in large language models.Advances in Neural Information Processing Systems, 35:24824–24837, 2022

Jason Wei, Xuezhi Wang, Dale Schuurmans, et al. Chain-of-thought prompting elicits reasoning in large language models.Advances in Neural Information Processing Systems, 35:24824–24837, 2022

2022
[15]

Exploiting Programmatic Behavior of LLMs: Dual- Use Through Standard Security Attacks

Daniel Kang, Xuechen Li, Ion Stoica, Carlos Guestrin, Matei Zaharia, and Tat- sunori Hashimoto. Exploiting programmatic behavior of LLMs: Dual-use through standard security attacks.arXiv preprint arXiv:2302.05733, 2023

work page arXiv 2023
[16]

CodeQL: The code analysis engine

GitHub. CodeQL: The code analysis engine. https://codeql.github.com/, 2024

2024
[17]

Do users write more insecure code with AI assistants? InProceedings of the 2023 ACM SIGSAC Conference on Computer and Communications Security, pages 2785–2799, 2023

Neil Perry, Megha Srivastava, Deepak Kumar, and Dan Boneh. Do users write more insecure code with AI assistants? InProceedings of the 2023 ACM SIGSAC Conference on Computer and Communications Security, pages 2785–2799, 2023

2023
[18]

Evaluation of performance, energy, and computation costs of quantum-attack resilient encryption algorithms for embedded de- vices,

R. Jonnala, J. Yang, Y. Lee, G. Liang and Z. Cao, "Measuring and Improving the Efficiency of Python Code Generated by LLMs Using CoT Prompting and Fine-Tuning, " in IEEE Access, vol. 13, pp. 119657-119681, 2025, doi: 10.1109/AC- CESS.2025.3585742

work page doi:10.1109/ac- 2025
[19]

EFFIBENCH: Benchmarking the efficiency of au- tomatically generated code,

D. Huang et al., “EFFIBENCH: Benchmarking the efficiency of au- tomatically generated code, ” arXiv, Jul. 4, 2024. [Online]. Available: https://arxiv.org/abs/2402.02037

work page arXiv 2024
[20]

Chal- lenges of cryptography development in Python

Mohammadreza Hazhirpasand, Mohammad Ghafari, and Oscar Nierstrasz. Chal- lenges of cryptography development in Python. In2023 IEEE Security and Privacy Workshops (SPW), pages 328–336. IEEE, 2023

2023
[21]

CryptoGuard: High precision detection of cryptographic vulnerabilities

Sazzadur Rahaman, Ya Xiao, Sharmin Afrose, et al. CryptoGuard: High precision detection of cryptographic vulnerabilities. InACM CCS, 2019

2019
[22]

CogniCrypt: Supporting developers in using cryptography

Stefan Krüger, Sarah Nadi, Michael Reif, et al. CogniCrypt: Supporting developers in using cryptography. InIEEE/ACM ASE, 2017

2017
[23]

Prompt programming for large language models: Beyond the few-shot paradigm

Laria Reynolds and Kyle McDonell. Prompt programming for large language models: Beyond the few-shot paradigm. InExtended Abstracts of the 2021 CHI Conference on Human Factors in Computing Systems, pages 1–7, 2021

2021
[24]

Comparative analysis of the code generated by popular large language models (LLMs) for MISRA C++ compliance.IEEE Access, 13, 2025

Malik Muhammad Umer. Comparative analysis of the code generated by popular large language models (LLMs) for MISRA C++ compliance.IEEE Access, 13, 2025. https://doi.org/10.1109/ACCESS.2025.3633086

work page doi:10.1109/access.2025.3633086 2025
[25]

A community-developed list of SW and HW weakness that can become vulnerabilities

CWE - Common Weakness Enumeration. A community-developed list of SW and HW weakness that can become vulnerabilities. https://cwe.mitre.org/, 2006

2006
[26]

experimental framework https://github.com/MohamedSobhy11/An-Empirical- Security-Evaluation-of-LLM-Generated-Cryptographic-Rust-Code.git