pith. sign in

arxiv: 2304.09655 · v2 · submitted 2023-04-19 · 💻 cs.CR

How Secure is Code Generated by ChatGPT?

Pith reviewed 2026-05-24 10:00 UTC · model grok-4.3

classification 💻 cs.CR
keywords ChatGPTcode generationsecurity vulnerabilitiesAI-generated codeprompt engineeringsoftware securitylarge language modelscybersecurity
0
0 comments X

The pith

ChatGPT often generates code that remains vulnerable to attacks even when it shows awareness of security risks.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tests ChatGPT by asking it to write various programs and then checks the output for exploitable weaknesses. The model recognizes many security issues when prompted directly yet still produces code open to common attacks in repeated trials. Researchers also test whether extra instructions can force safer code and consider the ethics of relying on AI for programming tasks. A reader would care because growing use of such tools in software development could introduce hidden risks if the generated code is not independently reviewed. The results point to a gap between the model's stated knowledge and its actual output behavior.

Core claim

ChatGPT is aware of potential vulnerabilities but nonetheless often generates source code that is not robust to certain attacks. The experiment involved prompting the model to create multiple programs, evaluating their security properties through analysis for weaknesses, and testing whether targeted follow-up prompts could improve robustness. The work concludes that while the model can discuss risks, the code it produces frequently lacks sufficient protections against exploitation.

What carries the argument

Prompting ChatGPT to generate programs followed by security evaluation of the resulting source code for robustness against attacks.

If this is right

  • Code produced by the model requires human review or additional security tools before use in real systems.
  • Targeted prompts can raise awareness of issues in the generated code but do not guarantee elimination of vulnerabilities.
  • Widespread adoption of AI code generation without safeguards could expand the number of programs susceptible to known attack patterns.
  • Ethical discussions around AI code tools must address responsibility for security flaws introduced in the output.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • This pattern may extend to other large language models, implying that alignment training for security properties needs explicit focus beyond general capability.
  • A practical extension would be to test whether combining generated code with automated repair tools reduces the observed vulnerabilities.
  • The results suggest organizations adopting these tools should budget for extra verification steps rather than treating the output as production-ready.

Load-bearing premise

The security evaluation of the generated programs accurately identifies real vulnerabilities without missing any due to incomplete checks.

What would settle it

A follow-up test that applies dynamic attack simulations or independent audits to a large sample of the generated programs and finds zero successful exploits would challenge the claim.

Figures

Figures reproduced from arXiv: 2304.09655 by Anderson R. Avila, Baba Mamadou Camara, Jacob Brunelle, Rapha\"el Khoury.

Figure 1
Figure 1. Figure 1: Code generation by ChatGPT followed by vulnerability check. [PITH_FULL_IMAGE:figures/full_fig_p005_1.png] view at source ↗
read the original abstract

In recent years, large language models have been responsible for great advances in the field of artificial intelligence (AI). ChatGPT in particular, an AI chatbot developed and recently released by OpenAI, has taken the field to the next level. The conversational model is able not only to process human-like text, but also to translate natural language into code. However, the safety of programs generated by ChatGPT should not be overlooked. In this paper, we perform an experiment to address this issue. Specifically, we ask ChatGPT to generate a number of program and evaluate the security of the resulting source code. We further investigate whether ChatGPT can be prodded to improve the security by appropriate prompts, and discuss the ethical aspects of using AI to generate code. Results suggest that ChatGPT is aware of potential vulnerabilities, but nonetheless often generates source code that are not robust to certain attacks.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper claims to have performed an experiment in which ChatGPT was prompted to generate programs; the resulting source code was evaluated for security. It further investigates whether targeted prompts can improve security and discusses ethical aspects of AI-generated code. The central result is that ChatGPT appears aware of vulnerabilities yet often produces code that is not robust to certain attacks.

Significance. If the experimental methodology and results were fully documented, the work would supply timely empirical evidence on security risks of LLM-based code generation, a topic of clear interest to the software-security community. The inclusion of an ethics discussion is a constructive element. The current absence of any quantitative data, prompt lists, or evaluation protocol prevents the claim from being assessed.

major comments (2)
  1. [Abstract] Abstract: the claim that ChatGPT 'often generates source code that are not robust to certain attacks' is presented without any sample size, list of prompts, vulnerability taxonomy, evaluation method, or quantitative results, leaving the central empirical claim without visible supporting data.
  2. [Security evaluation] Security evaluation (throughout): the manuscript supplies no description of the concrete analysis pipeline (tools, CWE categories, manual review protocol, or dynamic testing). Static analyzers routinely miss logic errors and context-dependent injection paths; without this information the false-negative rate is unknown and the classification of generated programs as non-robust cannot be verified.
minor comments (2)
  1. [Abstract] Abstract: grammatical error ('source code that are not robust' should be 'source code that is not robust').
  2. [Abstract] Abstract: 'a number of program' should read 'a number of programs'.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback, which identifies key gaps in methodological transparency. We will revise the manuscript to address these issues by expanding the description of the experiment and evaluation process.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the claim that ChatGPT 'often generates source code that are not robust to certain attacks' is presented without any sample size, list of prompts, vulnerability taxonomy, evaluation method, or quantitative results, leaving the central empirical claim without visible supporting data.

    Authors: We agree that the abstract lacks supporting details on the experimental scale and results. In the revision we will update the abstract to include the number of programs generated, a summary of the prompt set, the vulnerability taxonomy applied, the evaluation approach, and the main quantitative outcomes. revision: yes

  2. Referee: [Security evaluation] Security evaluation (throughout): the manuscript supplies no description of the concrete analysis pipeline (tools, CWE categories, manual review protocol, or dynamic testing). Static analyzers routinely miss logic errors and context-dependent injection paths; without this information the false-negative rate is unknown and the classification of generated programs as non-robust cannot be verified.

    Authors: We concur that the current manuscript does not describe the security analysis pipeline. The revised version will add a methods subsection detailing the static analysis tools, CWE categories, manual review protocol, and any dynamic testing performed, allowing readers to evaluate potential false-negative rates. revision: yes

Circularity Check

0 steps flagged

Purely empirical study with no derivation chain or self-referential elements

full rationale

The paper describes an experiment in which ChatGPT is prompted to generate source code, followed by security evaluation of the outputs. No equations, fitted parameters, predictions derived from internal quantities, or load-bearing self-citations appear in the provided text or abstract. The central claim rests on external code analysis rather than any quantity defined inside the paper itself, satisfying the default expectation of no significant circularity.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The central claim rests on the unstated premise that the chosen prompts and security checks are representative of real-world use and that the observed vulnerabilities are not artifacts of the evaluation method.

axioms (2)
  • domain assumption The prompts used in the experiment represent typical developer requests for code generation.
    The experiment's validity depends on the prompts being realistic; this is invoked when the authors describe asking ChatGPT to generate programs.
  • domain assumption The security analysis method used correctly classifies generated code as vulnerable or secure.
    The claim that code is 'not robust to certain attacks' requires this background assumption about the evaluation procedure.

pith-pipeline@v0.9.0 · 5684 in / 1198 out tokens · 20901 ms · 2026-05-24T10:00:28.862573+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. "Tab, Tab, Bug": Security Pitfalls of Next Edit Suggestions in AI-Integrated IDEs

    cs.CR 2026-02 conditional novelty 7.0

    NES systems in AI IDEs expand attack surfaces via context poisoning from imperceptible actions and global codebase retrieval, with professional developers largely unaware of the risks.

Reference graph

Works this paper leans on

21 extracted references · 21 canonical work pages · cited by 1 Pith paper · 2 internal anchors

  1. [1]

    Training Compute-Optimal Large Language Models

    J. Hoffmann, S. Borgeaud, A. Mensch, E. Buchatskaya, T. Cai, E. Rutherford, D. d. L. Casas, L. A. Hendricks, J. Welbl, A. Clark et al., “Training compute-optimal large language models,” arXiv preprint arXiv:2203.15556, 2022

  2. [2]

    Gpt-3: Its nature, scope, limits, and consequences,

    L. Floridi and M. Chiriatti, “Gpt-3: Its nature, scope, limits, and consequences,” Minds and Machines , vol. 30, pp. 681–694, 2020

  3. [3]

    Chatgpt: five priorities for research,

    E. A. van Dis, J. Bollen, W. Zuidema, R. van Rooij, and C. L. Bockting, “Chatgpt: five priorities for research,” Nature, vol. 614, no. 7947, pp. 224–226, 2023

  4. [4]

    OpenAI Team chatgpt: Optimizing language models for dialogue,

    “OpenAI Team chatgpt: Optimizing language models for dialogue,” https://openai.com/blog/chatgpt/, accessed: 2023-03-02

  5. [5]

    Chatgpt makes medicine easy to swallow: An exploratory case study on simplified radiology reports,

    K. Jeblick, B. Schachtner, J. Dexl, A. Mittermeier, A. T. St ¨uber, J. Topalis, T. Weber, P. Wesp, B. Sabel, J. Ricke et al., “Chatgpt makes medicine easy to swallow: An exploratory case study on simplified radiology reports,” arXiv preprint arXiv:2212.14882 , 2022

  6. [6]

    An analysis of the automatic bug fixing performance of chatgpt,

    D. Sobania, M. Briesch, C. Hanna, and J. Petke, “An analysis of the automatic bug fixing performance of chatgpt,” arXiv preprint arXiv:2301.08653, 2023

  7. [7]

    Chatgpt for good? on opportunities and challenges of large language models for education,

    E. Kasneci, K. Seßler, S. K ¨uchemann, M. Bannert, D. Dementieva, F. Fischer, U. Gasser, G. Groh, S. G ¨unnemann, E. H ¨ullermeier et al. , “Chatgpt for good? on opportunities and challenges of large language models for education,” 2023

  8. [8]

    Automatically learning semantic features for defect prediction,

    S. Wang, T. Liu, and L. Tan, “Automatically learning semantic features for defect prediction,” in Proceedings of the 38th International Confer- ence on Software Engineering , 2016, pp. 297–308

  9. [9]

    Program synthesis and semantic parsing with learned code idioms,

    E. C. Shin, M. Allamanis, M. Brockschmidt, and A. Polozov, “Program synthesis and semantic parsing with learned code idioms,” Advances in Neural Information Processing Systems , vol. 32, 2019

  10. [10]

    code2seq: Generating Sequences from Structured Representations of Code

    U. Alon, S. Brody, O. Levy, and E. Yahav, “code2seq: Generating sequences from structured representations of code,” arXiv preprint arXiv:1808.01400, 2018

  11. [11]

    Learning from examples to improve code completion systems,

    M. Bruch, M. Monperrus, and M. Mezini, “Learning from examples to improve code completion systems,” in Proceedings of the 7th joint meeting of the European software engineering conference and the ACM SIGSOFT symposium on the foundations of software engineering , 2009, pp. 213–222

  12. [12]

    Viega and M

    J. Viega and M. Messier, Secure programming cookbook for C and C++: recipes for cryptography, authentication, input validation & more . ” O’Reilly Media, Inc.”, 2003

  13. [13]

    The impact of regular expression denial of service (redos) in practice: An empirical study at the ecosystem scale,

    J. C. Davis, C. A. Coghlan, F. Servant, and D. Lee, “The impact of regular expression denial of service (redos) in practice: An empirical study at the ecosystem scale,” ser. ESEC/FSE 2018. New York, NY , USA: Association for Computing Machinery, 2018, p. 246–256. [Online]. Available: https://doi.org/10.1145/3236024.3236027

  14. [14]

    Java deserialization vulnerabilities and mitigations,

    R. C. Seacord, “Java deserialization vulnerabilities and mitigations,” in 2017 IEEE Cybersecurity Development (SecDev) , 2017, pp. 6–7

  15. [15]

    A qualitative study of vulnerability-fixing commits,

    M. Mkhallalati, “A qualitative study of vulnerability-fixing commits,” Ph.D. dissertation, Concordia University, 2019

  16. [16]

    Seacord, Secure Coding in C and C++ , ser

    R. Seacord, Secure Coding in C and C++ , ser. SEI series in software engineering. Addison-Wesley, 2013. [Online]. Available: https://books.google.ca/books?id=-KFCMAEACAAJ

  17. [17]

    More than half of college students believe using chatgpt to complete assignments is cheating,

    M. Nietzel, “More than half of college students believe using chatgpt to complete assignments is cheating,” Forbes, 2023

  18. [18]

    A model for when disclosure helps security: What is different about computer and network security?

    P. Swire, “A model for when disclosure helps security: What is different about computer and network security?” Journal on Telecommunications and High Technology Law , vol. 3, 2004

  19. [19]

    Explainable deep learn- ing: A field guide for the uninitiated,

    G. Ras, N. Xie, M. Van Gerven, and D. Doran, “Explainable deep learn- ing: A field guide for the uninitiated,” Journal of Artificial Intelligence Research, vol. 73, pp. 329–397, 2022

  20. [20]

    Generating secure hardware using chatgpt resistant to cwes,

    M. Nair, R. Sadhukhan, and D. Mukhopadhyay, “Generating secure hardware using chatgpt resistant to cwes,” Cryptology ePrint Archive , 2023

  21. [21]

    A categorical archive of chatgpt failures,

    A. Borji, “A categorical archive of chatgpt failures,” arXiv preprint arXiv:2302.03494, 2023