Recognition: unknown
An Empirical Security Evaluation of LLM-Generated Cryptographic Rust Code
Pith reviewed 2026-05-07 13:20 UTC · model grok-4.3
The pith
LLMs generate Rust cryptographic code that compiles successfully only 23 percent of the time and contains vulnerabilities in 57 percent of working samples.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Among the 240 generated Rust samples, only 23.3 percent compiled. CodeQL produced just two false positives on the compiled set, while the authors' rule-based crypto-specific analyzer found vulnerabilities in 57 percent of those samples with zero false positives. Compilation success differed sharply between the two algorithms and was significantly affected by prompt strategy, with chain-of-thought prompting performing five times worse than zero-shot. All three models exhibited systematic failures including nonce reuse and API hallucinations.
What carries the argument
The rule-based crypto-specific analyzer that associates detected issues with Common Weakness Enumerations and is applied to compiled LLM-generated code samples.
Load-bearing premise
The custom rule-based crypto-specific analyzer accurately detects all relevant vulnerabilities with zero false positives and no false negatives for the two algorithms tested.
What would settle it
A compiled sample that the custom analyzer flags as vulnerable but independent expert review or formal verification confirms is secure, or a sample the analyzer clears that later proves to contain a vulnerability.
Figures
read the original abstract
Developers and organizations are using Large Language Models (LLMs) to generate security-critical code more frequently than ever, including cryptographic solutions for their products. This study presents an empirical evaluation of cryptographic security in 240 Rust code samples for two crypto algorithms (AES-256-GCM and ChaCha20-Poly1305) generated by three LLMs (Gemini 2.5 Pro, GPT-4o, and DeepSeek Coder) using four different prompt strategies. For each successfully compiled code sample, CodeQL static analysis and our rule-based crypto-specific analyzer were used to detect vulnerabilities, which are also associated with Common Weakness Enumeration (CWE). The evaluation results revealed that only 23.3% of the generated code samples were successfully compiled. Among the compiled code, CodeQL produced only two false positives, while our rule-based crypto-specific analyzer identified vulnerabilities in 57% of the compiled samples with zero false positives. This demonstrates that general-purpose analysis tools are insufficient for code validation for the experimented crypto algorithms. The compilation success of the two algorithms varied significantly (AES-256-GCM 34.2% versus ChaCha20-Poly1305 12.5%), showing a gap in code generation capabilities. While model choice had no significant effect on compilation success, prompt strategy significantly influenced outcomes (P = 0.002), with chain-of-thought prompting performing 5 times worse than zero-shot. All three models exhibit systematic failures, including nonce reuse and API hallucinations.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper conducts an empirical evaluation of 240 Rust code samples for AES-256-GCM and ChaCha20-Poly1305 generated by Gemini 2.5 Pro, GPT-4o, and DeepSeek Coder using four prompt strategies. It reports a 23.3% compilation success rate, with the custom rule-based analyzer detecting vulnerabilities in 57% of compiled samples (zero false positives) versus CodeQL's two false positives, significant differences in compilation by algorithm and prompt (P=0.002, chain-of-thought worst), and systematic failures including nonce reuse and API hallucinations. The central conclusion is that general-purpose static analysis tools are insufficient for validating LLM-generated cryptographic code.
Significance. If the custom analyzer's zero-FP claim holds under independent validation, the work would provide concrete empirical evidence of LLMs' limitations in producing secure crypto implementations and the shortcomings of off-the-shelf tools like CodeQL for domain-specific checks. The compilation-rate gap between algorithms and the prompt-strategy effect (with statistical support) would add actionable insights for AI-assisted secure coding practices.
major comments (2)
- [Methods and Results (analyzer evaluation)] The claim that the rule-based crypto-specific analyzer achieves zero false positives while identifying vulnerabilities in 57% of the ~56 compiled samples (abstract and results) is load-bearing for the comparison to CodeQL and the conclusion that general-purpose tools are insufficient. No external ground-truth validation, blinded expert review, public rule set, or inter-rater agreement is described; the accuracy rests on author judgment alone.
- [Results (statistical analysis)] The reported P=0.002 for prompt-strategy effect on compilation success lacks accompanying details on per-group sample sizes, exact statistical test (e.g., chi-square or Fisher's), multiple-comparison correction, or power analysis. Given the low overall compilation rate (23.3%), this weakens confidence in the claim that prompt strategy is a significant factor.
minor comments (2)
- [Methods] Clarify the exact wording of the four prompt strategies and the sampling procedure for the 240 instances (e.g., temperature, number of generations per prompt/algorithm/model) to allow replication.
- [Abstract] The abstract states CodeQL produced 'only two false positives' but does not specify what the true-positive baseline was or how false positives were determined for the custom analyzer.
Simulated Author's Rebuttal
We thank the referee for their constructive comments on our paper. We address the major comments point by point below and indicate the revisions we will make to the manuscript.
read point-by-point responses
-
Referee: [Methods and Results (analyzer evaluation)] The claim that the rule-based crypto-specific analyzer achieves zero false positives while identifying vulnerabilities in 57% of the ~56 compiled samples (abstract and results) is load-bearing for the comparison to CodeQL and the conclusion that general-purpose tools are insufficient. No external ground-truth validation, blinded expert review, public rule set, or inter-rater agreement is described; the accuracy rests on author judgment alone.
Authors: We agree that the zero false positive rate for the custom analyzer is a key claim and currently relies on our internal verification process. The analyzer consists of deterministic rules derived from standard cryptographic best practices and known vulnerabilities (e.g., nonce reuse in AES-GCM, incorrect key sizes). Each flagged sample was manually inspected by the authors to confirm the presence of the vulnerability and that no false positives occurred in the compiled set. To strengthen this, we will revise the Methods section to provide the full rule set, release the analyzer as open-source code, and include examples of detected issues. While we did not conduct a blinded external review for this study, we believe the transparency will allow independent validation. We will update the abstract and results to reflect this additional detail. revision: yes
-
Referee: [Results (statistical analysis)] The reported P=0.002 for prompt-strategy effect on compilation success lacks accompanying details on per-group sample sizes, exact statistical test (e.g., chi-square or Fisher's), multiple-comparison correction, or power analysis. Given the low overall compilation rate (23.3%), this weakens confidence in the claim that prompt strategy is a significant factor.
Authors: The P-value of 0.002 was calculated using a chi-square test of independence on the 4x2 contingency table (four prompt strategies by compiled/not compiled), with 60 samples per strategy. No multiple-comparison correction was needed as this was the sole test for prompt effects. We will add these details to the Results section, including the contingency table and degrees of freedom. We will also include a post-hoc power analysis to address concerns about the low base rate. We maintain that prompt strategy is a significant factor but will provide the requested statistical transparency. revision: yes
Circularity Check
No circularity: purely empirical measurement study
full rationale
The paper reports direct measurements of LLM code generation success rates, compilation outcomes, and vulnerability detections via CodeQL plus a custom rule-based analyzer on AES-256-GCM and ChaCha20-Poly1305 samples. No equations, fitted parameters, predictions, self-definitional constructs, or load-bearing self-citations appear in the provided text or abstract. The 57% vulnerability rate and zero-FP claim for the custom analyzer are presented as empirical observations from applying the rules to the generated corpus, not as outputs derived from the inputs by construction. The study is self-contained against its own experimental pipeline and external benchmarks (CodeQL), with no reduction of claims to tautologies or author-specific priors.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption The rule-based crypto-specific analyzer identifies vulnerabilities with zero false positives and no missed issues for AES-256-GCM and ChaCha20-Poly1305.
Reference graph
Works this paper leans on
-
[2]
Chen, M., Tworek, J., Jun, H., et al. (2021). Evaluating large language models trained on code. arXiv:2107.03374
work page internal anchor Pith review arXiv 2021
-
[3]
Austin, J., Odena, A., Nye, M., et al. (2021). Program synthesis with large language models. arXiv:2108.07732
work page internal anchor Pith review arXiv 2021
- [4]
-
[5]
Young Lee, Ernesto Diaz, Jeong Yang, Bozhen Liu, Enhancing concur- rency bug detection in Rust programs through LLVM IR based graph visualization, High-Confidence Computing, 2025, 100377, ISSN 2667-2952, https://doi.org/10.1016/j.hcc.2025.100377
-
[6]
Machine Learning-Based Vulner- ability Detection in Rust Code Using LLVM IR and Transformer Model
Lee, Y.; Boshra, S.J.; Yang, J.; Cao, Z.; Liang, G. Machine Learning-Based Vulner- ability Detection in Rust Code Using LLVM IR and Transformer Model. Mach. Learn. Knowl. Extr. 2025, 7, 79. https://doi.org/10.3390/make7030079
-
[7]
Asleep at the keyboard? assessing the security of github copilot’s code contributions
Hammond Pearce, Baleegh Ahmad, Benjamin Tan, Brendan Dolan-Gavitt, and Ramesh Karri. Asleep at the keyboard? assessing the security of github copilot’s code contributions. In2022 IEEE Symposium on Security and Privacy (SP), pages 754–768. IEEE, 2022
2022
-
[8]
Rogaway, P. (2002). Authenticated-Encryption with Associated-Data. Proceedings of the 9th ACM Conference on Computer and Communications Security (CCS)
2002
-
[9]
Authentication failures in NIST version of GCM
Antoine Joux. Authentication failures in NIST version of GCM. https://csrc.nist.gov/csrc/media/projects/block-cipher-techniques/documents/ bcm/joux_comments.pdf, 2006
2006
-
[10]
Recommendation for Block Cipher Modes of Operation: Ga- lois/Counter Mode (GCM) and GMAC,
M. Dworkin, “Recommendation for Block Cipher Modes of Operation: Ga- lois/Counter Mode (GCM) and GMAC, ” NIST Special Publication 800-38D, Na- tional Institute of Standards and Technology, 2007
2007
-
[11]
ChaCha20 and Poly1305 for IETF Protocols,
Y. Nir and A. Langley, “ChaCha20 and Poly1305 for IETF Protocols, ” RFC 8439, Internet Engineering Task Force, 2018. https://datatracker.ietf.org/doc/rfc8439/
2018
-
[12]
Why does cryptographic software fail? A case study and open problems
David Lazar, Haogang Chen, Xi Wang, and Nickolai Zeldovich. Why does cryptographic software fail? A case study and open problems. InProceedings of 5th Asia-Pacific Workshop on Systems, pages 1–7, 2014
2014
-
[13]
An empirical study of cryptographic misuse in android applications
Manuel Egele, David Brumley, Yanick Fratantonio, and Christopher Kruegel. An empirical study of cryptographic misuse in android applications. InProceedings of the 2013 ACM SIGSAC Conference on Computer & Communications Security, pages 73–84, 2013
2013
-
[14]
Chain-of-thought prompting elicits reasoning in large language models.Advances in Neural Information Processing Systems, 35:24824–24837, 2022
Jason Wei, Xuezhi Wang, Dale Schuurmans, et al. Chain-of-thought prompting elicits reasoning in large language models.Advances in Neural Information Processing Systems, 35:24824–24837, 2022
2022
-
[15]
Exploiting Programmatic Behavior of LLMs: Dual- Use Through Standard Security Attacks
Daniel Kang, Xuechen Li, Ion Stoica, Carlos Guestrin, Matei Zaharia, and Tat- sunori Hashimoto. Exploiting programmatic behavior of LLMs: Dual-use through standard security attacks.arXiv preprint arXiv:2302.05733, 2023
-
[16]
CodeQL: The code analysis engine
GitHub. CodeQL: The code analysis engine. https://codeql.github.com/, 2024
2024
-
[17]
Do users write more insecure code with AI assistants? InProceedings of the 2023 ACM SIGSAC Conference on Computer and Communications Security, pages 2785–2799, 2023
Neil Perry, Megha Srivastava, Deepak Kumar, and Dan Boneh. Do users write more insecure code with AI assistants? InProceedings of the 2023 ACM SIGSAC Conference on Computer and Communications Security, pages 2785–2799, 2023
2023
-
[18]
R. Jonnala, J. Yang, Y. Lee, G. Liang and Z. Cao, "Measuring and Improving the Efficiency of Python Code Generated by LLMs Using CoT Prompting and Fine-Tuning, " in IEEE Access, vol. 13, pp. 119657-119681, 2025, doi: 10.1109/AC- CESS.2025.3585742
work page doi:10.1109/ac- 2025
-
[19]
EFFIBENCH: Benchmarking the efficiency of au- tomatically generated code,
D. Huang et al., “EFFIBENCH: Benchmarking the efficiency of au- tomatically generated code, ” arXiv, Jul. 4, 2024. [Online]. Available: https://arxiv.org/abs/2402.02037
-
[20]
Chal- lenges of cryptography development in Python
Mohammadreza Hazhirpasand, Mohammad Ghafari, and Oscar Nierstrasz. Chal- lenges of cryptography development in Python. In2023 IEEE Security and Privacy Workshops (SPW), pages 328–336. IEEE, 2023
2023
-
[21]
CryptoGuard: High precision detection of cryptographic vulnerabilities
Sazzadur Rahaman, Ya Xiao, Sharmin Afrose, et al. CryptoGuard: High precision detection of cryptographic vulnerabilities. InACM CCS, 2019
2019
-
[22]
CogniCrypt: Supporting developers in using cryptography
Stefan Krüger, Sarah Nadi, Michael Reif, et al. CogniCrypt: Supporting developers in using cryptography. InIEEE/ACM ASE, 2017
2017
-
[23]
Prompt programming for large language models: Beyond the few-shot paradigm
Laria Reynolds and Kyle McDonell. Prompt programming for large language models: Beyond the few-shot paradigm. InExtended Abstracts of the 2021 CHI Conference on Human Factors in Computing Systems, pages 1–7, 2021
2021
-
[24]
Malik Muhammad Umer. Comparative analysis of the code generated by popular large language models (LLMs) for MISRA C++ compliance.IEEE Access, 13, 2025. https://doi.org/10.1109/ACCESS.2025.3633086
-
[25]
A community-developed list of SW and HW weakness that can become vulnerabilities
CWE - Common Weakness Enumeration. A community-developed list of SW and HW weakness that can become vulnerabilities. https://cwe.mitre.org/, 2006
2006
-
[26]
experimental framework https://github.com/MohamedSobhy11/An-Empirical- Security-Evaluation-of-LLM-Generated-Cryptographic-Rust-Code.git
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.