pith. machine review for the scientific record. sign in

arxiv: 2604.19965 · v1 · submitted 2026-04-21 · 💻 cs.SE

Recognition: unknown

Insights into Security-Related AI-Generated Pull Requests

Authors on Pith no claims yet

Pith reviewed 2026-05-10 01:43 UTC · model grok-4.3

classification 💻 cs.SE
keywords AI-generated pull requestssecurity weaknessespull request acceptancesoftware securityAI coding agentsregex inefficienciesinjection flawspath traversal
0
0 comments X

The pith

AI-generated security pull requests introduce recurring weaknesses like regex inefficiencies and injection flaws, with many flawed ones still merged.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper analyzes more than 33,000 AI-generated pull requests to isolate 675 security-related submissions. It shows these AI contributions commonly introduce a narrow set of security weaknesses, including regex inefficiencies, injection flaws, and path traversal. Many of the flawed PRs are nevertheless merged into the codebases. Rejections tend to occur for social or process reasons such as inactivity or missing test coverage rather than the security problems. Commit message quality has little bearing on acceptance or review speed for these AI PRs, in contrast to patterns seen in human contributions.

Core claim

The study of 675 security-related AI-generated pull requests from over 33,000 total AI PRs identifies a small set of recurring weaknesses such as regex inefficiencies, injection flaws, and path traversal. Many flawed contributions are still merged, while rejections often arise from social or process factors such as inactivity or missing test coverage. Commit message quality shows limited effect on acceptance or latency unlike in human PRs, and the work extends existing rejection taxonomies with categories unique to AI-generated security contributions.

What carries the argument

Identification of security-related AI PRs and categorization of their recurring weaknesses together with an extended taxonomy of rejection reasons specific to AI security submissions.

If this is right

  • Many flawed security-related contributions from AI agents are merged into software projects.
  • Rejections of AI security PRs are more often tied to inactivity or missing tests than to the security weaknesses themselves.
  • Commit message quality does not strongly influence acceptance or review latency for AI-generated security PRs.
  • Rejection taxonomies for pull requests can be extended with new categories that apply specifically to AI security submissions.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • AI coding agents may need targeted safeguards against the narrow set of weaknesses that recur in security PRs.
  • Project maintainers could benefit from automated detectors tuned to the common AI security flaws before merging.
  • The limited role of commit messages suggests review processes for AI PRs may need different signals than those used for human PRs.

Load-bearing premise

The 675 security-related PRs were accurately and consistently identified from the 33,000 AI-generated PRs without significant selection or labeling bias.

What would settle it

An independent review of the AI PR dataset that reveals a substantially different distribution of weaknesses or shows that rejections are driven primarily by security concerns instead of process factors.

Figures

Figures reproduced from arXiv: 2604.19965 by Arifa I. Champa, Asif K. Turzo, Md Fazle Rabbi, Minhaz F. Zibran.

Figure 1
Figure 1. Figure 1: Overview of our data construction process [PITH_FULL_IMAGE:figures/full_fig_p004_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Rejection reasons across AI agents 13 [PITH_FULL_IMAGE:figures/full_fig_p013_2.png] view at source ↗
read the original abstract

Recent years have experienced growing contributions of AI coding agents that assist human developers in various software engineering tasks. However, this growing AI-assisted autonomy raises questions about security and trust. In this paper, we analyze more than 33,000 AI-generated pull requests (PRs) and identify 675 security-related submissions made by agentic AIs. Then we examine the security-related PRs with a focus on recurring security weaknesses, review outcomes and latency, commit message quality, and rejection reasons. The results show that security-related AI PRs introduce a small set of recurring weaknesses such as regex inefficiencies, injection flaws, and path traversal. Many flawed contributions are still merged, while rejections often arise from social or process factors such as inactivity or missing test coverage. The commit message quality of AI PRs has a limited effect on acceptance or latency, in contrast to human PRs reported in previous studies. We also extend existing rejection taxonomies by adding categories that are unique to AI-generated security contributions. These findings offer new insights into the strengths and shortcomings of autonomous coding systems in secure software development.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 0 minor

Summary. The paper analyzes more than 33,000 AI-generated pull requests and identifies 675 security-related submissions. It examines these for recurring security weaknesses (regex inefficiencies, injection flaws, path traversal), review outcomes and latency, commit message quality, and rejection reasons. Key findings are that many flawed AI PRs are still merged, rejections often stem from social/process factors like inactivity or missing tests, commit message quality has limited effect on acceptance (unlike human PRs), and existing rejection taxonomies are extended with AI-specific categories.

Significance. If the 675-PRs sample is accurately and representatively identified, the study supplies useful empirical observations on security risks from autonomous AI coding agents in open-source settings. The identification of recurring weakness patterns and the extension of rejection taxonomies could inform both tooling and future research on AI-assisted secure development.

major comments (1)
  1. Abstract: the identification of the 675 security-related PRs from the 33,000 AI-generated PRs is stated without any description of the detection method (e.g., keywords, classifier, manual review), validation procedure, inter-rater reliability, or false-positive/negative rates. Because every subsequent claim—recurring weaknesses, merge rates, rejection reasons, and taxonomy extensions—rests on this subset being a clean sample, the absence of these details is load-bearing for the central empirical contribution.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their constructive feedback and the recommendation for major revision. We address the single major comment below and agree that strengthening the abstract will improve the manuscript.

read point-by-point responses
  1. Referee: Abstract: the identification of the 675 security-related PRs from the 33,000 AI-generated PRs is stated without any description of the detection method (e.g., keywords, classifier, manual review), validation procedure, inter-rater reliability, or false-positive/negative rates. Because every subsequent claim—recurring weaknesses, merge rates, rejection reasons, and taxonomy extensions—rests on this subset being a clean sample, the absence of these details is load-bearing for the central empirical contribution.

    Authors: We agree that the abstract would be strengthened by briefly describing the identification process. The full manuscript details this in the methodology section, which outlines the multi-stage approach used to select the 675 security-related PRs from the larger corpus. We will revise the abstract to include a concise summary of the detection method, validation steps, and any reported reliability or error considerations. This revision will be incorporated in the next version of the paper. revision: yes

Circularity Check

0 steps flagged

No circularity: purely observational empirical study

full rationale

The paper performs data collection and qualitative analysis on external GitHub PRs (33k total, 675 labeled security-related). No equations, fitted parameters, predictions, or derivations exist. Identification of the 675 PRs, weakness taxonomy, merge/rejection statistics, and taxonomy extensions are direct observations from the dataset rather than reductions to self-definitions, self-citations, or renamed inputs. No load-bearing self-citation chains or ansatzes are present. The central claims rest on external data and manual/automated labeling whose validity is a separate methodological concern, not a circularity issue.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Empirical observational study; no free parameters or invented entities; relies on domain assumptions for classifying PRs and security issues.

axioms (1)
  • domain assumption AI-generated PRs can be reliably distinguished from human ones and their security relevance can be accurately assessed through manual or automated review
    The study begins by identifying 675 security-related submissions from 33,000 AI PRs, which requires this classification step.

pith-pipeline@v0.9.0 · 5499 in / 1271 out tokens · 52956 ms · 2026-05-10T01:43:21.053988+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

80 extracted references · 10 canonical work pages · 1 internal anchor

  1. [1]

    Software security analysis in 2030 and beyond: A research roadmap.ACM Transactions on Software Engineering and Methodology, 34(5):1–26, 2025

    Marcel Böhme, Eric Bodden, Tevfik Bultan, Cristian Cadar, Yang Liu, and Giuseppe Scanniello. Software security analysis in 2030 and beyond: A research roadmap.ACM Transactions on Software Engineering and Methodology, 34(5):1–26, 2025. 15

  2. [2]

    Software security in practice: knowledge and motivation.Journal of Cybersecurity, 11(1):tyaf005, 2025

    Hala Assal, Srivathsan G Morkonda, Muhammad Zaid Arif, and Sonia Chiasson. Software security in practice: knowledge and motivation.Journal of Cybersecurity, 11(1):tyaf005, 2025

  3. [3]

    Vulnerabilities and security patches detection in oss: a survey.ACM Computing Surveys, 57(1):1–37, 2024

    Ruyan Lin, Yulong Fu, Wei Yi, Jincheng Yang, Jin Cao, Zhiqiang Dong, Fei Xie, and Hui Li. Vulnerabilities and security patches detection in oss: a survey.ACM Computing Surveys, 57(1):1–37, 2024

  4. [4]

    Xu Yang, Wenhan Zhu, Michael Pacheco, Jiayuan Zhou, Shaowei Wang, Xing Hu, and Kui Liu. Code change intention, development artifact, and history vulnerability: Putting them together for vulnerability fix detection by llm.Proceedings of the ACM on Software Engineering, 2(FSE):489–510, 2025

  5. [5]

    An exploratory study of the pull-based software development model

    Georgios Gousios, Martin Pinzger, and Arie van Deursen. An exploratory study of the pull-based software development model. InProceedings of the 36th international conference on software engineering, pages 345–355, 2014

  6. [6]

    Expectations, outcomes, and challenges of modern code review

    Alberto Bacchelli and Christian Bird. Expectations, outcomes, and challenges of modern code review. In2013 35th International Conference on Software Engineering (ICSE), pages 712–721, 2013

  7. [7]

    How do software developers use chatgpt? an exploratory study on github pull requests

    Moataz Chouchen, Narjes Bessghaier, Mahi Begoug, Ali Ouni, Eman Alomar, and Mohamed Wiem Mkaouer. How do software developers use chatgpt? an exploratory study on github pull requests. InProceedings of the 21st International Conference on Mining Software Repositories, pages 212–216, 2024

  8. [8]

    Generative ai for pull request descriptions: Adoption, impact, and developer interventions.Proceedings of the ACM on Software Engineering, 1(FSE):1043– 1065, 2024

    Tao Xiao, Hideaki Hata, Christoph Treude, and Kenichi Matsumoto. Generative ai for pull request descriptions: Adoption, impact, and developer interventions.Proceedings of the ACM on Software Engineering, 1(FSE):1043– 1065, 2024

  9. [9]

    On the use of agentic coding: An empirical study of pull requests on github,

    Miku Watanabe, Hao Li, Yutaro Kashiwa, Brittany Reid, Hajimu Iida, and Ahmed E Hassan. On the use of agentic coding: An empirical study of pull requests on github.arXiv preprint arXiv:2509.14745, 2025

  10. [10]

    Roumeliotis, and Manoj Karkee

    Ranjan Sapkota, Konstantinos I Roumeliotis, and Manoj Karkee. Vibe coding vs. agentic coding: Fundamentals and practical implications of agentic ai.arXiv preprint arXiv:2505.19443, 2025

  11. [11]

    An empirical study of automation in software security patch management

    Nesara Dissanayake, Asangi Jayatilaka, Mansooreh Zahedi, and Muhammad Ali Babar. An empirical study of automation in software security patch management. InProceedings of the 37th IEEE/ACM International Conference on Automated Software Engineering, pages 1–13, 2022

  12. [12]

    Patchtrack: Analyzing chatgpt’s impact on software patch decision-making in pull requests

    Daniel Ogenrwot and John Businge. Patchtrack: Analyzing chatgpt’s impact on software patch decision-making in pull requests. InProceedings of the 39th IEEE/ACM International Conference on Automated Software Engineering, page 2480–2481, New York, NY , USA, 2024. ACM

  13. [13]

    Dependabot and security pull requests: large empirical study.Empirical Software Engineering, 29(5):128, 2024

    Hocine Rebatchi, Tégawendé F Bissyandé, and Naouel Moha. Dependabot and security pull requests: large empirical study.Empirical Software Engineering, 29(5):128, 2024

  14. [14]

    On the use of dependabot security pull requests

    Mahmoud Alfadel, Diego Elias Costa, Emad Shihab, and Mouafak Mkhallalati. On the use of dependabot security pull requests. In2021 IEEE/ACM 18th International conference on mining software repositories (MSR), pages 254–265. IEEE, 2021

  15. [15]

    How to get developers to accept security prs faster

    Andrew Stiefel. How to get developers to accept security prs faster. https://www.endorlabs.com/learn/ how-to-get-developers-to-accept-security-prs-faster , February 2025. Endor Labs. Accessed: 2025-10-14

  16. [16]

    The Rise of AI Teammates in Software Engineering (SE) 3.0: How Autonomous Coding Agents Are Reshaping Software Engineering

    Hao Li, Haoxiang Zhang, and Ahmed E Hassan. The rise of ai teammates in software engineering (se) 3.0: How autonomous coding agents are reshaping software engineering.arXiv preprint arXiv:2507.15003, 2025

  17. [17]

    Why do developers reject refactorings in open-source projects?ACM Transactions on Software Engineering and Methodology (TOSEM), 31(2):1–23, 2021

    Jevgenija Pantiuchina, Bin Lin, Fiorella Zampetti, Massimiliano Di Penta, Michele Lanza, and Gabriele Bavota. Why do developers reject refactorings in open-source projects?ACM Transactions on Software Engineering and Methodology (TOSEM), 31(2):1–23, 2021

  18. [18]

    insights into security-related ai-generated pull requests

    Md Fazle Rabbi, Asif K. Turzo, Arifa I. Champa, and Minhaz F. Zibran. Replication package: “insights into security-related ai-generated pull requests”.https://doi.org/10.6084/m9.figshare.30421996, 2025

  19. [19]

    An empirical study of retrieval-augmented code generation: Challenges and opportunities.ACM Transactions on Software Engineering and Methodology, 2025

    Zezhou Yang, Sirong Chen, Cuiyun Gao, Zhenhao Li, Xing Hu, Kui Liu, and Xin Xia. An empirical study of retrieval-augmented code generation: Challenges and opportunities.ACM Transactions on Software Engineering and Methodology, 2025

  20. [20]

    Humanevalcomm: Benchmarking the communication competence of code generation for llms and llm agents.ACM Transactions on Software Engineering and Methodology, 34(7):1–42, 2025

    Jie JW Wu and Fatemeh H Fard. Humanevalcomm: Benchmarking the communication competence of code generation for llms and llm agents.ACM Transactions on Software Engineering and Methodology, 34(7):1–42, 2025

  21. [21]

    AI agentic programming: A survey of techniques, challenges, and opportunities.arXiv preprint arXiv:2508.11126,

    Huanting Wang, Jingzhi Gong, Huawei Zhang, and Zheng Wang. Ai agentic programming: A survey of techniques, challenges, and opportunities.arXiv preprint arXiv:2508.11126, 2025

  22. [22]

    The impact of generative ai on open-source community engagement

    Karthik Babu Nattamai Kannan and Narayan Ramasubbu. The impact of generative ai on open-source community engagement. 2025. 16

  23. [23]

    Roychoudhury et al

    Abhik Roychoudhury, Corina Pasareanu, Michael Pradel, and Baishakhi Ray. Agentic ai software engineer: Programming with trust.arXiv preprint arXiv:2502.13767, 2025

  24. [24]

    Investigating and designing for trust in ai-powered code generation tools

    Ruotong Wang, Ruijia Cheng, Denae Ford, and Thomas Zimmermann. Investigating and designing for trust in ai-powered code generation tools. InProceedings of the 2024 ACM Conference on Fairness, Accountability, and Transparency, pages 1475–1493, 2024

  25. [25]

    Trust, transparency, and adoption in generative ai for software engineering: Insights from twitter discourse.Information and Software Technology, 186:107804, 2025

    Manaal Basha and Gema Rodríguez-Pérez. Trust, transparency, and adoption in generative ai for software engineering: Insights from twitter discourse.Information and Software Technology, 186:107804, 2025

  26. [26]

    Trust in ai: progress, challenges, and future directions.Humanities and Social Sciences Communications, 11(1):1–30, 2024

    Saleh Afroogh, Ali Akbari, Emmie Malone, Mohammadali Kargar, and Hananeh Alambeigi. Trust in ai: progress, challenges, and future directions.Humanities and Social Sciences Communications, 11(1):1–30, 2024

  27. [27]

    Storer, Derek DeBellis, Sarah D’Angelo, and Adam Brown

    Kevin M. Storer, Derek DeBellis, Sarah D’Angelo, and Adam Brown. Fostering developers’ trust in generative artificial intelligence. Technical report, DORA Research, March 2025. https://dora.dev/research/ai/ trust-in-ai/. Accessed: 2025-09-30

  28. [28]

    A large-scale empirical study of security patches

    Frank Li and Vern Paxson. A large-scale empirical study of security patches. InProceedings of the 2017 ACM SIGSAC Conference on Computer and Communications Security, pages 2201–2215, 2017

  29. [29]

    A large-scale analysis of the effectiveness of publicly reported security patches.Computers & Security, 148:104181, 2025

    Seunghoon Woo, Eunjin Choi, and Heejo Lee. A large-scale analysis of the effectiveness of publicly reported security patches.Computers & Security, 148:104181, 2025

  30. [30]

    Software security patch management- a systematic literature review of challenges, approaches, tools and practices.Information and Software Technology, 144:106771, 2022

    Nesara Dissanayake, Asangi Jayatilaka, Mansooreh Zahedi, and M Ali Babar. Software security patch management- a systematic literature review of challenges, approaches, tools and practices.Information and Software Technology, 144:106771, 2022

  31. [31]

    Patchdb: A large-scale security patch dataset

    Xinda Wang, Shu Wang, Pengbin Feng, Kun Sun, and Sushil Jajodia. Patchdb: A large-scale security patch dataset. In2021 51st Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN), pages 149–160. IEEE, 2021

  32. [32]

    Just-in-time detection of silent security patches.ACM Transactions on Software Engineering and Methodology, 2025

    Xunzhu Tang, Kisub Kim, Saad Ezzini, Yewei Song, Haoye Tian, Jacques Klein, and Tegawende Bissyande. Just-in-time detection of silent security patches.ACM Transactions on Software Engineering and Methodology, 2025

  33. [33]

    On the feasibility of stealthily introducing vulnerabilities in open-source software via hypocrite commits.Proc

    Qiushi Wu and Kangjie Lu. On the feasibility of stealthily introducing vulnerabilities in open-source software via hypocrite commits.Proc. Oakland, 17, 2021

  34. [34]

    Influence of social and technical factors for evaluating contri- bution in github

    Jason Tsay, Laura Dabbish, and James Herbsleb. Influence of social and technical factors for evaluating contri- bution in github. InProceedings of the 36th international conference on Software engineering, pages 356–366, 2014

  35. [35]

    Work practices and challenges in pull-based development: The integrator’s perspective

    Georgios Gousios, Andy Zaidman, Margaret-Anne Storey, and Arie Van Deursen. Work practices and challenges in pull-based development: The integrator’s perspective. In2015 IEEE/ACM 37th IEEE International Conference on Software Engineering, volume 1, pages 358–368. IEEE, 2015

  36. [36]

    Pull request decisions explained: An empirical overview.IEEE Transactions on Software Engineering, 49(2):849–871, 2022

    Xunhui Zhang, Yue Yu, Georgios Gousios, and Ayushi Rastogi. Pull request decisions explained: An empirical overview.IEEE Transactions on Software Engineering, 49(2):849–871, 2022

  37. [37]

    Wait for it: Determinants of pull request evaluation latency on github

    Yue Yu, Huaimin Wang, Vladimir Filkov, Premkumar Devanbu, and Bogdan Vasilescu. Wait for it: Determinants of pull request evaluation latency on github. In2015 IEEE/ACM 12th working conference on mining software repositories, pages 367–371. IEEE, 2015

  38. [38]

    Pull request latency explained: An empirical overview.Empirical Software Engineering, 27(6):126, 2022

    Xunhui Zhang, Yue Yu, Tao Wang, Ayushi Rastogi, and Huaimin Wang. Pull request latency explained: An empirical overview.Empirical Software Engineering, 27(6):126, 2022

  39. [39]

    What makes a good commit message? In Proceedings of the 44th International Conference on Software Engineering, pages 2389–2401, 2022

    Yingchen Tian, Yuxia Zhang, Klaas-Jan Stol, Lin Jiang, and Hui Liu. What makes a good commit message? In Proceedings of the 44th International Conference on Software Engineering, pages 2389–2401, 2022

  40. [40]

    Commit message matters: Investigating impact and evolution of commit message quality

    Jiawei Li and Iftekhar Ahmed. Commit message matters: Investigating impact and evolution of commit message quality. In2023 IEEE/ACM 45th International Conference on Software Engineering (ICSE), pages 806–817. IEEE, 2023

  41. [41]

    Optimization is better than generation: Optimizing commit message leveraging human-written commit message.arXiv preprint arXiv:2501.09861, 2025

    Jiawei Li, David Faragó, Christian Petrov, and Iftekhar Ahmed. Optimization is better than generation: Optimizing commit message leveraging human-written commit message.arXiv preprint arXiv:2501.09861, 2025

  42. [42]

    Codex.https://openai.com/codex/, 2025

    OpenAI. Codex.https://openai.com/codex/, 2025. Accessed: 2025-10-15

  43. [43]

    Devin, the ai software engineer, 2025

    Cognition AI. Devin, the ai software engineer, 2025. Available at: https://devin.ai. Accessed: 2025-10-15

  44. [44]

    Github copilot, 2025

    GitHub. Github copilot, 2025. Available at: https://github.com/features/copilot. Accessed: 2025-10- 15. 17

  45. [45]

    Cursor, 2025

    Cursor. Cursor, 2025. Available at:https://cursor.com. Accessed: 2025-10-15

  46. [46]

    Claude code, 2025

    Anthropic. Claude code, 2025. Available at: https://www.claude.com/product/claude-code. Accessed: 2025-10-15

  47. [47]

    Text filtering and ranking for security bug report prediction.IEEE Transactions on Software Engineering, 45(6):615–631, 2017

    Fayola Peters, Thein Than Tun, Yijun Yu, and Bashar Nuseibeh. Text filtering and ranking for security bug report prediction.IEEE Transactions on Software Engineering, 45(6):615–631, 2017

  48. [48]

    Why security defects go unnoticed during code reviews? a case-control study of the chromium os project

    Rajshakhar Paul, Asif Kamal Turzo, and Amiangshu Bosu. Why security defects go unnoticed during code reviews? a case-control study of the chromium os project. In2021 IEEE/ACM 43rd International Conference on Software Engineering (ICSE), pages 1373–1385. IEEE, 2021

  49. [49]

    Spi: Automated identification of security patches via commits.ACM Transactions on Software Engineering and Methodology (TOSEM), 31(1):1–27, 2021

    Yaqin Zhou, Jing Kai Siow, Chenyu Wang, Shangqing Liu, and Yang Liu. Spi: Automated identification of security patches via commits.ACM Transactions on Software Engineering and Methodology (TOSEM), 31(1):1–27, 2021

  50. [50]

    Automated identification of security issues from commit messages and bug reports

    Yaqin Zhou and Asankhaya Sharma. Automated identification of security issues from commit messages and bug reports. InProceedings of the 2017 11th joint meeting on foundations of software engineering, pages 914–919, 2017

  51. [51]

    Annotating materials science text: A semi-automated approach for crafting outputs with gemini pro.Integrating Materials and Manufacturing Innovation, 13(2):445– 452, 2024

    Hasan M Sayeed, Trupti Mohanty, and Taylor D Sparks. Annotating materials science text: A semi-automated approach for crafting outputs with gemini pro.Integrating Materials and Manufacturing Innovation, 13(2):445– 452, 2024

  52. [52]

    Automated reddit data annotation with large language models

    Sai Vuruma, Dezhi Wu, Saborny Sen Gupta, Lucas Aust, Valerie Lookingbill, Wyatt Bellamy, Yang Ren, Erin Kasson, Li-Shiun Chen, Patricia Cavazos-Rehg, et al. Automated reddit data annotation with large language models. In2025 IEEE 13th International Conference on Healthcare Informatics (ICHI), pages 251–260. IEEE, 2025

  53. [53]

    Chatgpt outperforms crowd workers for text-annotation tasks.Proceedings of the National Academy of Sciences, 120(30):e2305016120, 2023

    Fabrizio Gilardi, Meysam Alizadeh, and Maël Kubli. Chatgpt outperforms crowd workers for text-annotation tasks.Proceedings of the National Academy of Sciences, 120(30):e2305016120, 2023

  54. [54]

    Studying the impact of noises in build breakage data.IEEE Transactions on Software Engineering, 47(9):1998–2011, 2019

    Taher Ahmed Ghaleb, Daniel Alencar Da Costa, Ying Zou, and Ahmed E Hassan. Studying the impact of noises in build breakage data.IEEE Transactions on Software Engineering, 47(9):1998–2011, 2019

  55. [55]

    An empirical study of issue-link algorithms: which issue-link algorithms should we use?Empirical Software Engineering, 27(6):136, 2022

    Masanari Kondo, Yutaro Kashiwa, Yasutaka Kamei, and Osamu Mizuno. An empirical study of issue-link algorithms: which issue-link algorithms should we use?Empirical Software Engineering, 27(6):136, 2022

  56. [56]

    The nature of build changes: An empirical study of maven-based build systems.Empirical Software Engineering, 26(3):32, 2021

    Christian Macho, Stefanie Beyer, Shane McIntosh, and Martin Pinzger. The nature of build changes: An empirical study of maven-based build systems.Empirical Software Engineering, 26(3):32, 2021

  57. [57]

    What happens in my code reviews? an investigation on automatically classifying review changes.Empirical Software Engineering, 27(4):89, 2022

    Enrico Fregnan, Fernando Petrulio, Linda Di Geronimo, and Alberto Bacchelli. What happens in my code reviews? an investigation on automatically classifying review changes.Empirical Software Engineering, 27(4):89, 2022

  58. [58]

    Using the confidence interval confidently.Journal of thoracic disease, 9(10):4125, 2017

    Avijit Hazra. Using the confidence interval confidently.Journal of thoracic disease, 9(10):4125, 2017

  59. [59]

    Cohen’s kappa coefficient as a performance measure for feature selection

    Susana M Vieira, Uzay Kaymak, and João MC Sousa. Cohen’s kappa coefficient as a performance measure for feature selection. InInternational conference on fuzzy systems, pages 1–8. IEEE, 2010

  60. [60]

    accessed: 2025-09-24

    Semgrep.https://semgrep.dev, 2025. accessed: 2025-09-24

  61. [61]

    Semgrep*: Improving the limited performance of static application security testing (sast) tools

    Gareth Bennett, Tracy Hall, Emily Winter, and Steve Counsell. Semgrep*: Improving the limited performance of static application security testing (sast) tools. InProceedings of the 28th International Conference on Evaluation and Assessment in Software Engineering, pages 614–623, 2024

  62. [62]

    On detecting and measuring exploitable javascript functions in real-world applications.ACM Transactions on Privacy and Security, 27(1):1–37, 2024

    Maryna Kluban, Mohammad Mannan, and Amr Youssef. On detecting and measuring exploitable javascript functions in real-world applications.ACM Transactions on Privacy and Security, 27(1):1–37, 2024

  63. [63]

    Evaluating c/c++ vulnerability detectability of query-based static application security testing tools.IEEE Transactions on Dependable and Secure Computing, 21(5):4600–4618, 2024

    Zongjie Li, Zhibo Liu, Wai Kin Wong, Pingchuan Ma, and Shuai Wang. Evaluating c/c++ vulnerability detectability of query-based static application security testing tools.IEEE Transactions on Dependable and Secure Computing, 21(5):4600–4618, 2024

  64. [64]

    ♪ with a little help from my (llm) friends: Enhancing static analysis with llms to detect software vulnerabilities

    Amy Munson, Juanita Gomez, and Álvaro A Cárdenas. ♪ with a little help from my (llm) friends: Enhancing static analysis with llms to detect software vulnerabilities. In2025 IEEE/ACM International Workshop on Large Language Models for Code (LLM4Code), pages 25–32. IEEE, 2025

  65. [65]

    Effect of technical and social factors on pull request quality for the npm ecosystem

    Tapajit Dey and Audris Mockus. Effect of technical and social factors on pull request quality for the npm ecosystem. InProceedings of the 14th ACM/IEEE International Symposium on Empirical Software Engineering and Measurement (ESEM), pages 1–11, 2020

  66. [66]

    Nearest neighbor selection for iteratively knn imputation.Journal of Systems and Software, 85(11):2541–2552, 2012

    Shichao Zhang. Nearest neighbor selection for iteratively knn imputation.Journal of Systems and Software, 85(11):2541–2552, 2012. 18

  67. [67]

    The power of outliers (and why researchers should always check for them)

    Jason W Osborne and Amy Overbay. The power of outliers (and why researchers should always check for them). Practical Assessment, Research, and Evaluation, 9(1), 2004

  68. [68]

    Multivariable modeling strategies

    Frank E Harrell Jr. Multivariable modeling strategies. InRegression modeling strategies: With applications to linear models, logistic and ordinal regression, and survival analysis, pages 63–102. Springer, 2015

  69. [69]

    Linear regression models with logarithmic transformations.London School of Economics, London, 22(1):23–36, 2011

    Kenneth Benoit. Linear regression models with logarithmic transformations.London School of Economics, London, 22(1):23–36, 2011

  70. [70]

    Correlation tests in r: pearson cor, kendall’s tau, and spearman’s rho

    Kingsley Okoye and Samira Hosseini. Correlation tests in r: pearson cor, kendall’s tau, and spearman’s rho. InR programming: Statistical data analysis in research, pages 247–277. Springer, 2024

  71. [71]

    Inferring probability of relevance using the method of logistic regression

    Fredric C Gey. Inferring probability of relevance using the method of logistic regression. InSIGIR’94: Proceedings of the Seventeenth Annual International ACM-SIGIR Conference on Research and Development in Information Retrieval, organised by Dublin City University, pages 222–231. Springer, 1994

  72. [72]

    John Wiley & Sons, 2013

    David W Hosmer Jr, Stanley Lemeshow, and Rodney X Sturdivant.Applied logistic regression. John Wiley & Sons, 2013

  73. [73]

    Segress: Software engineering guidelines for reporting secondary studies.IEEE Transactions on Software Engineering, 49(3):1273–1298, 2022

    Barbara Kitchenham, Lech Madeyski, and David Budgen. Segress: Software engineering guidelines for reporting secondary studies.IEEE Transactions on Software Engineering, 49(3):1273–1298, 2022

  74. [74]

    Making sense of card sorting data.Expert Systems, 22(3):89–93, 2005

    Sally Fincher and Josh Tenenberg. Making sense of card sorting data.Expert Systems, 22(3):89–93, 2005

  75. [75]

    The measurement of observer agreement for categorical data.biometrics, pages 159–174, 1977

    J Richard Landis and Gary G Koch. The measurement of observer agreement for categorical data.biometrics, pages 159–174, 1977

  76. [76]

    Human-written vs

    Domenico Cotroneo, Cristina Improta, and Pietro Liguori. Human-written vs. ai-generated code: A large-scale study of defects, vulnerabilities, and complexity.arXiv preprint arXiv:2508.21634, 2025

  77. [77]

    Is functional correctness enough to evaluate code language models? exploring diversity of generated codes.arXiv preprint arXiv:2408.14504, 2024

    Heejae Chon, Seonghyeon Lee, Jinyoung Yeo, and Dongha Lee. Is functional correctness enough to evaluate code language models? exploring diversity of generated codes.arXiv preprint arXiv:2408.14504, 2024

  78. [78]

    Benchmarks and metrics for evaluations of code generation: A critical review

    Debalina Ghosh Paul, Hong Zhu, and Ian Bayley. Benchmarks and metrics for evaluations of code generation: A critical review. In2024 IEEE International Conference on Artificial Intelligence Testing (AITest), pages 87–94. IEEE, 2024

  79. [79]

    Jiawei Liu, Chunqiu Steven Xia, Yuyao Wang, and Lingming Zhang. Is your code generated by chatgpt really correct? rigorous evaluation of large language models for code generation.Advances in Neural Information Processing Systems, 36:21558–21572, 2023

  80. [80]

    Security degradation in iterative ai code generation – a systematic analysis of the paradox,

    Shivani Shukla, Himanshu Joshi, and Romilla Syed. Security degradation in iterative ai code generation–a systematic analysis of the paradox.arXiv preprint arXiv:2506.11022, 2025. 19