How Secure is Code Generated by ChatGPT?

Anderson R. Avila; Baba Mamadou Camara; Jacob Brunelle; Rapha\"el Khoury

arxiv: 2304.09655 · v2 · submitted 2023-04-19 · 💻 cs.CR

How Secure is Code Generated by ChatGPT?

Rapha\"el Khoury , Anderson R. Avila , Jacob Brunelle , Baba Mamadou Camara This is my paper

Pith reviewed 2026-05-24 10:00 UTC · model grok-4.3

classification 💻 cs.CR

keywords ChatGPTcode generationsecurity vulnerabilitiesAI-generated codeprompt engineeringsoftware securitylarge language modelscybersecurity

0 comments

The pith

ChatGPT often generates code that remains vulnerable to attacks even when it shows awareness of security risks.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tests ChatGPT by asking it to write various programs and then checks the output for exploitable weaknesses. The model recognizes many security issues when prompted directly yet still produces code open to common attacks in repeated trials. Researchers also test whether extra instructions can force safer code and consider the ethics of relying on AI for programming tasks. A reader would care because growing use of such tools in software development could introduce hidden risks if the generated code is not independently reviewed. The results point to a gap between the model's stated knowledge and its actual output behavior.

Core claim

ChatGPT is aware of potential vulnerabilities but nonetheless often generates source code that is not robust to certain attacks. The experiment involved prompting the model to create multiple programs, evaluating their security properties through analysis for weaknesses, and testing whether targeted follow-up prompts could improve robustness. The work concludes that while the model can discuss risks, the code it produces frequently lacks sufficient protections against exploitation.

What carries the argument

Prompting ChatGPT to generate programs followed by security evaluation of the resulting source code for robustness against attacks.

If this is right

Code produced by the model requires human review or additional security tools before use in real systems.
Targeted prompts can raise awareness of issues in the generated code but do not guarantee elimination of vulnerabilities.
Widespread adoption of AI code generation without safeguards could expand the number of programs susceptible to known attack patterns.
Ethical discussions around AI code tools must address responsibility for security flaws introduced in the output.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

This pattern may extend to other large language models, implying that alignment training for security properties needs explicit focus beyond general capability.
A practical extension would be to test whether combining generated code with automated repair tools reduces the observed vulnerabilities.
The results suggest organizations adopting these tools should budget for extra verification steps rather than treating the output as production-ready.

Load-bearing premise

The security evaluation of the generated programs accurately identifies real vulnerabilities without missing any due to incomplete checks.

What would settle it

A follow-up test that applies dynamic attack simulations or independent audits to a large sample of the generated programs and finds zero successful exploits would challenge the claim.

Figures

Figures reproduced from arXiv: 2304.09655 by Anderson R. Avila, Baba Mamadou Camara, Jacob Brunelle, Rapha\"el Khoury.

read the original abstract

In recent years, large language models have been responsible for great advances in the field of artificial intelligence (AI). ChatGPT in particular, an AI chatbot developed and recently released by OpenAI, has taken the field to the next level. The conversational model is able not only to process human-like text, but also to translate natural language into code. However, the safety of programs generated by ChatGPT should not be overlooked. In this paper, we perform an experiment to address this issue. Specifically, we ask ChatGPT to generate a number of program and evaluate the security of the resulting source code. We further investigate whether ChatGPT can be prodded to improve the security by appropriate prompts, and discuss the ethical aspects of using AI to generate code. Results suggest that ChatGPT is aware of potential vulnerabilities, but nonetheless often generates source code that are not robust to certain attacks.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Early case study flags ChatGPT code security risks but gives no methods, sample size, or evaluation details, so the results can't be checked.

read the letter

The main thing here is a quick early experiment that prompted ChatGPT for code, checked the output for security problems, and tested whether follow-up prompts could make the code safer. The authors conclude the model knows about vulnerabilities but still produces code that is often not robust. That matches what people were starting to notice right after the model launched, and the paper is new in targeting ChatGPT specifically plus trying the improvement prompts. It also raises the ethical point about shipping AI-written code without review. Those are the useful parts: it surfaces a practical issue and shows one way to probe it. The execution is the problem. The abstract and the rest of the writeup supply no sample size, no list of prompts, no description of the code types or languages, and no account of how security was actually evaluated. The stress-test note is correct on this—the security checks are unspecified, so we have no way to judge false-negative rates or whether the analysis caught real issues versus missing context-dependent ones. Without that, the claim that the code is “often” insecure rests on an unexamined pipeline. This paper is for people who want an early signal on LLM-assisted coding risks and are willing to treat it as a starting point rather than evidence. A reader who needs reproducible data or a solid method will not get value from it. It deserves peer review because the topic is timely and the basic idea is worth pursuing, but only if the authors add the missing experimental details before any publication decision.

Referee Report

2 major / 2 minor

Summary. The paper claims to have performed an experiment in which ChatGPT was prompted to generate programs; the resulting source code was evaluated for security. It further investigates whether targeted prompts can improve security and discusses ethical aspects of AI-generated code. The central result is that ChatGPT appears aware of vulnerabilities yet often produces code that is not robust to certain attacks.

Significance. If the experimental methodology and results were fully documented, the work would supply timely empirical evidence on security risks of LLM-based code generation, a topic of clear interest to the software-security community. The inclusion of an ethics discussion is a constructive element. The current absence of any quantitative data, prompt lists, or evaluation protocol prevents the claim from being assessed.

major comments (2)

[Abstract] Abstract: the claim that ChatGPT 'often generates source code that are not robust to certain attacks' is presented without any sample size, list of prompts, vulnerability taxonomy, evaluation method, or quantitative results, leaving the central empirical claim without visible supporting data.
[Security evaluation] Security evaluation (throughout): the manuscript supplies no description of the concrete analysis pipeline (tools, CWE categories, manual review protocol, or dynamic testing). Static analyzers routinely miss logic errors and context-dependent injection paths; without this information the false-negative rate is unknown and the classification of generated programs as non-robust cannot be verified.

minor comments (2)

[Abstract] Abstract: grammatical error ('source code that are not robust' should be 'source code that is not robust').
[Abstract] Abstract: 'a number of program' should read 'a number of programs'.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback, which identifies key gaps in methodological transparency. We will revise the manuscript to address these issues by expanding the description of the experiment and evaluation process.

read point-by-point responses

Referee: [Abstract] Abstract: the claim that ChatGPT 'often generates source code that are not robust to certain attacks' is presented without any sample size, list of prompts, vulnerability taxonomy, evaluation method, or quantitative results, leaving the central empirical claim without visible supporting data.

Authors: We agree that the abstract lacks supporting details on the experimental scale and results. In the revision we will update the abstract to include the number of programs generated, a summary of the prompt set, the vulnerability taxonomy applied, the evaluation approach, and the main quantitative outcomes. revision: yes
Referee: [Security evaluation] Security evaluation (throughout): the manuscript supplies no description of the concrete analysis pipeline (tools, CWE categories, manual review protocol, or dynamic testing). Static analyzers routinely miss logic errors and context-dependent injection paths; without this information the false-negative rate is unknown and the classification of generated programs as non-robust cannot be verified.

Authors: We concur that the current manuscript does not describe the security analysis pipeline. The revised version will add a methods subsection detailing the static analysis tools, CWE categories, manual review protocol, and any dynamic testing performed, allowing readers to evaluate potential false-negative rates. revision: yes

Circularity Check

0 steps flagged

Purely empirical study with no derivation chain or self-referential elements

full rationale

The paper describes an experiment in which ChatGPT is prompted to generate source code, followed by security evaluation of the outputs. No equations, fitted parameters, predictions derived from internal quantities, or load-bearing self-citations appear in the provided text or abstract. The central claim rests on external code analysis rather than any quantity defined inside the paper itself, satisfying the default expectation of no significant circularity.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The central claim rests on the unstated premise that the chosen prompts and security checks are representative of real-world use and that the observed vulnerabilities are not artifacts of the evaluation method.

axioms (2)

domain assumption The prompts used in the experiment represent typical developer requests for code generation.
The experiment's validity depends on the prompts being realistic; this is invoked when the authors describe asking ChatGPT to generate programs.
domain assumption The security analysis method used correctly classifies generated code as vulnerable or secure.
The claim that code is 'not robust to certain attacks' requires this background assumption about the evaluation procedure.

pith-pipeline@v0.9.0 · 5684 in / 1198 out tokens · 20901 ms · 2026-05-24T10:00:28.862573+00:00 · methodology

discussion (0)

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

"Tab, Tab, Bug": Security Pitfalls of Next Edit Suggestions in AI-Integrated IDEs
cs.CR 2026-02 conditional novelty 7.0

NES systems in AI IDEs expand attack surfaces via context poisoning from imperceptible actions and global codebase retrieval, with professional developers largely unaware of the risks.

Reference graph

Works this paper leans on

21 extracted references · 21 canonical work pages · cited by 1 Pith paper · 2 internal anchors

[1]

Training Compute-Optimal Large Language Models

J. Hoffmann, S. Borgeaud, A. Mensch, E. Buchatskaya, T. Cai, E. Rutherford, D. d. L. Casas, L. A. Hendricks, J. Welbl, A. Clark et al., “Training compute-optimal large language models,” arXiv preprint arXiv:2203.15556, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[2]

Gpt-3: Its nature, scope, limits, and consequences,

L. Floridi and M. Chiriatti, “Gpt-3: Its nature, scope, limits, and consequences,” Minds and Machines , vol. 30, pp. 681–694, 2020

work page 2020
[3]

Chatgpt: ﬁve priorities for research,

E. A. van Dis, J. Bollen, W. Zuidema, R. van Rooij, and C. L. Bockting, “Chatgpt: ﬁve priorities for research,” Nature, vol. 614, no. 7947, pp. 224–226, 2023

work page 2023
[4]

OpenAI Team chatgpt: Optimizing language models for dialogue,

“OpenAI Team chatgpt: Optimizing language models for dialogue,” https://openai.com/blog/chatgpt/, accessed: 2023-03-02

work page 2023
[5]

Chatgpt makes medicine easy to swallow: An exploratory case study on simpliﬁed radiology reports,

K. Jeblick, B. Schachtner, J. Dexl, A. Mittermeier, A. T. St ¨uber, J. Topalis, T. Weber, P. Wesp, B. Sabel, J. Ricke et al., “Chatgpt makes medicine easy to swallow: An exploratory case study on simpliﬁed radiology reports,” arXiv preprint arXiv:2212.14882 , 2022

work page arXiv 2022
[6]

An analysis of the automatic bug ﬁxing performance of chatgpt,

D. Sobania, M. Briesch, C. Hanna, and J. Petke, “An analysis of the automatic bug ﬁxing performance of chatgpt,” arXiv preprint arXiv:2301.08653, 2023

work page arXiv 2023
[7]

Chatgpt for good? on opportunities and challenges of large language models for education,

E. Kasneci, K. Seßler, S. K ¨uchemann, M. Bannert, D. Dementieva, F. Fischer, U. Gasser, G. Groh, S. G ¨unnemann, E. H ¨ullermeier et al. , “Chatgpt for good? on opportunities and challenges of large language models for education,” 2023

work page 2023
[8]

Automatically learning semantic features for defect prediction,

S. Wang, T. Liu, and L. Tan, “Automatically learning semantic features for defect prediction,” in Proceedings of the 38th International Confer- ence on Software Engineering , 2016, pp. 297–308

work page 2016
[9]

Program synthesis and semantic parsing with learned code idioms,

E. C. Shin, M. Allamanis, M. Brockschmidt, and A. Polozov, “Program synthesis and semantic parsing with learned code idioms,” Advances in Neural Information Processing Systems , vol. 32, 2019

work page 2019
[10]

code2seq: Generating Sequences from Structured Representations of Code

U. Alon, S. Brody, O. Levy, and E. Yahav, “code2seq: Generating sequences from structured representations of code,” arXiv preprint arXiv:1808.01400, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018
[11]

Learning from examples to improve code completion systems,

M. Bruch, M. Monperrus, and M. Mezini, “Learning from examples to improve code completion systems,” in Proceedings of the 7th joint meeting of the European software engineering conference and the ACM SIGSOFT symposium on the foundations of software engineering , 2009, pp. 213–222

work page 2009
[12]

Viega and M

J. Viega and M. Messier, Secure programming cookbook for C and C++: recipes for cryptography, authentication, input validation & more . ” O’Reilly Media, Inc.”, 2003

work page 2003
[13]

The impact of regular expression denial of service (redos) in practice: An empirical study at the ecosystem scale,

J. C. Davis, C. A. Coghlan, F. Servant, and D. Lee, “The impact of regular expression denial of service (redos) in practice: An empirical study at the ecosystem scale,” ser. ESEC/FSE 2018. New York, NY , USA: Association for Computing Machinery, 2018, p. 246–256. [Online]. Available: https://doi.org/10.1145/3236024.3236027

work page doi:10.1145/3236024.3236027 2018
[14]

Java deserialization vulnerabilities and mitigations,

R. C. Seacord, “Java deserialization vulnerabilities and mitigations,” in 2017 IEEE Cybersecurity Development (SecDev) , 2017, pp. 6–7

work page 2017
[15]

A qualitative study of vulnerability-ﬁxing commits,

M. Mkhallalati, “A qualitative study of vulnerability-ﬁxing commits,” Ph.D. dissertation, Concordia University, 2019

work page 2019
[16]

Seacord, Secure Coding in C and C++ , ser

R. Seacord, Secure Coding in C and C++ , ser. SEI series in software engineering. Addison-Wesley, 2013. [Online]. Available: https://books.google.ca/books?id=-KFCMAEACAAJ

work page 2013
[17]

More than half of college students believe using chatgpt to complete assignments is cheating,

M. Nietzel, “More than half of college students believe using chatgpt to complete assignments is cheating,” Forbes, 2023

work page 2023
[18]

A model for when disclosure helps security: What is different about computer and network security?

P. Swire, “A model for when disclosure helps security: What is different about computer and network security?” Journal on Telecommunications and High Technology Law , vol. 3, 2004

work page 2004
[19]

Explainable deep learn- ing: A ﬁeld guide for the uninitiated,

G. Ras, N. Xie, M. Van Gerven, and D. Doran, “Explainable deep learn- ing: A ﬁeld guide for the uninitiated,” Journal of Artiﬁcial Intelligence Research, vol. 73, pp. 329–397, 2022

work page 2022
[20]

Generating secure hardware using chatgpt resistant to cwes,

M. Nair, R. Sadhukhan, and D. Mukhopadhyay, “Generating secure hardware using chatgpt resistant to cwes,” Cryptology ePrint Archive , 2023

work page 2023
[21]

A categorical archive of chatgpt failures,

A. Borji, “A categorical archive of chatgpt failures,” arXiv preprint arXiv:2302.03494, 2023

work page arXiv 2023

[1] [1]

Training Compute-Optimal Large Language Models

J. Hoffmann, S. Borgeaud, A. Mensch, E. Buchatskaya, T. Cai, E. Rutherford, D. d. L. Casas, L. A. Hendricks, J. Welbl, A. Clark et al., “Training compute-optimal large language models,” arXiv preprint arXiv:2203.15556, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022

[2] [2]

Gpt-3: Its nature, scope, limits, and consequences,

L. Floridi and M. Chiriatti, “Gpt-3: Its nature, scope, limits, and consequences,” Minds and Machines , vol. 30, pp. 681–694, 2020

work page 2020

[3] [3]

Chatgpt: ﬁve priorities for research,

E. A. van Dis, J. Bollen, W. Zuidema, R. van Rooij, and C. L. Bockting, “Chatgpt: ﬁve priorities for research,” Nature, vol. 614, no. 7947, pp. 224–226, 2023

work page 2023

[4] [4]

OpenAI Team chatgpt: Optimizing language models for dialogue,

“OpenAI Team chatgpt: Optimizing language models for dialogue,” https://openai.com/blog/chatgpt/, accessed: 2023-03-02

work page 2023

[5] [5]

Chatgpt makes medicine easy to swallow: An exploratory case study on simpliﬁed radiology reports,

K. Jeblick, B. Schachtner, J. Dexl, A. Mittermeier, A. T. St ¨uber, J. Topalis, T. Weber, P. Wesp, B. Sabel, J. Ricke et al., “Chatgpt makes medicine easy to swallow: An exploratory case study on simpliﬁed radiology reports,” arXiv preprint arXiv:2212.14882 , 2022

work page arXiv 2022

[6] [6]

An analysis of the automatic bug ﬁxing performance of chatgpt,

D. Sobania, M. Briesch, C. Hanna, and J. Petke, “An analysis of the automatic bug ﬁxing performance of chatgpt,” arXiv preprint arXiv:2301.08653, 2023

work page arXiv 2023

[7] [7]

Chatgpt for good? on opportunities and challenges of large language models for education,

E. Kasneci, K. Seßler, S. K ¨uchemann, M. Bannert, D. Dementieva, F. Fischer, U. Gasser, G. Groh, S. G ¨unnemann, E. H ¨ullermeier et al. , “Chatgpt for good? on opportunities and challenges of large language models for education,” 2023

work page 2023

[8] [8]

Automatically learning semantic features for defect prediction,

S. Wang, T. Liu, and L. Tan, “Automatically learning semantic features for defect prediction,” in Proceedings of the 38th International Confer- ence on Software Engineering , 2016, pp. 297–308

work page 2016

[9] [9]

Program synthesis and semantic parsing with learned code idioms,

E. C. Shin, M. Allamanis, M. Brockschmidt, and A. Polozov, “Program synthesis and semantic parsing with learned code idioms,” Advances in Neural Information Processing Systems , vol. 32, 2019

work page 2019

[10] [10]

code2seq: Generating Sequences from Structured Representations of Code

U. Alon, S. Brody, O. Levy, and E. Yahav, “code2seq: Generating sequences from structured representations of code,” arXiv preprint arXiv:1808.01400, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018

[11] [11]

Learning from examples to improve code completion systems,

M. Bruch, M. Monperrus, and M. Mezini, “Learning from examples to improve code completion systems,” in Proceedings of the 7th joint meeting of the European software engineering conference and the ACM SIGSOFT symposium on the foundations of software engineering , 2009, pp. 213–222

work page 2009

[12] [12]

Viega and M

J. Viega and M. Messier, Secure programming cookbook for C and C++: recipes for cryptography, authentication, input validation & more . ” O’Reilly Media, Inc.”, 2003

work page 2003

[13] [13]

The impact of regular expression denial of service (redos) in practice: An empirical study at the ecosystem scale,

J. C. Davis, C. A. Coghlan, F. Servant, and D. Lee, “The impact of regular expression denial of service (redos) in practice: An empirical study at the ecosystem scale,” ser. ESEC/FSE 2018. New York, NY , USA: Association for Computing Machinery, 2018, p. 246–256. [Online]. Available: https://doi.org/10.1145/3236024.3236027

work page doi:10.1145/3236024.3236027 2018

[14] [14]

Java deserialization vulnerabilities and mitigations,

R. C. Seacord, “Java deserialization vulnerabilities and mitigations,” in 2017 IEEE Cybersecurity Development (SecDev) , 2017, pp. 6–7

work page 2017

[15] [15]

A qualitative study of vulnerability-ﬁxing commits,

M. Mkhallalati, “A qualitative study of vulnerability-ﬁxing commits,” Ph.D. dissertation, Concordia University, 2019

work page 2019

[16] [16]

Seacord, Secure Coding in C and C++ , ser

R. Seacord, Secure Coding in C and C++ , ser. SEI series in software engineering. Addison-Wesley, 2013. [Online]. Available: https://books.google.ca/books?id=-KFCMAEACAAJ

work page 2013

[17] [17]

More than half of college students believe using chatgpt to complete assignments is cheating,

M. Nietzel, “More than half of college students believe using chatgpt to complete assignments is cheating,” Forbes, 2023

work page 2023

[18] [18]

A model for when disclosure helps security: What is different about computer and network security?

P. Swire, “A model for when disclosure helps security: What is different about computer and network security?” Journal on Telecommunications and High Technology Law , vol. 3, 2004

work page 2004

[19] [19]

Explainable deep learn- ing: A ﬁeld guide for the uninitiated,

G. Ras, N. Xie, M. Van Gerven, and D. Doran, “Explainable deep learn- ing: A ﬁeld guide for the uninitiated,” Journal of Artiﬁcial Intelligence Research, vol. 73, pp. 329–397, 2022

work page 2022

[20] [20]

Generating secure hardware using chatgpt resistant to cwes,

M. Nair, R. Sadhukhan, and D. Mukhopadhyay, “Generating secure hardware using chatgpt resistant to cwes,” Cryptology ePrint Archive , 2023

work page 2023

[21] [21]

A categorical archive of chatgpt failures,

A. Borji, “A categorical archive of chatgpt failures,” arXiv preprint arXiv:2302.03494, 2023

work page arXiv 2023