arxiv: 2604.19965 · v1 · submitted 2026-04-21 · 💻 cs.SE

Recognition: unknown

Insights into Security-Related AI-Generated Pull Requests

Md Fazle Rabbi , Asif K. Turzo , Arifa I. Champa , Minhaz F. Zibran

Authors on Pith no claims yet

Pith reviewed 2026-05-10 01:43 UTC · model grok-4.3

classification 💻 cs.SE

keywords AI-generated pull requestssecurity weaknessespull request acceptancesoftware securityAI coding agentsregex inefficienciesinjection flawspath traversal

0 comments

The pith

AI-generated security pull requests introduce recurring weaknesses like regex inefficiencies and injection flaws, with many flawed ones still merged.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper analyzes more than 33,000 AI-generated pull requests to isolate 675 security-related submissions. It shows these AI contributions commonly introduce a narrow set of security weaknesses, including regex inefficiencies, injection flaws, and path traversal. Many of the flawed PRs are nevertheless merged into the codebases. Rejections tend to occur for social or process reasons such as inactivity or missing test coverage rather than the security problems. Commit message quality has little bearing on acceptance or review speed for these AI PRs, in contrast to patterns seen in human contributions.

Core claim

The study of 675 security-related AI-generated pull requests from over 33,000 total AI PRs identifies a small set of recurring weaknesses such as regex inefficiencies, injection flaws, and path traversal. Many flawed contributions are still merged, while rejections often arise from social or process factors such as inactivity or missing test coverage. Commit message quality shows limited effect on acceptance or latency unlike in human PRs, and the work extends existing rejection taxonomies with categories unique to AI-generated security contributions.

What carries the argument

Identification of security-related AI PRs and categorization of their recurring weaknesses together with an extended taxonomy of rejection reasons specific to AI security submissions.

If this is right

Many flawed security-related contributions from AI agents are merged into software projects.
Rejections of AI security PRs are more often tied to inactivity or missing tests than to the security weaknesses themselves.
Commit message quality does not strongly influence acceptance or review latency for AI-generated security PRs.
Rejection taxonomies for pull requests can be extended with new categories that apply specifically to AI security submissions.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

AI coding agents may need targeted safeguards against the narrow set of weaknesses that recur in security PRs.
Project maintainers could benefit from automated detectors tuned to the common AI security flaws before merging.
The limited role of commit messages suggests review processes for AI PRs may need different signals than those used for human PRs.

Load-bearing premise

The 675 security-related PRs were accurately and consistently identified from the 33,000 AI-generated PRs without significant selection or labeling bias.

What would settle it

An independent review of the AI PR dataset that reveals a substantially different distribution of weaknesses or shows that rejections are driven primarily by security concerns instead of process factors.

Figures

Figures reproduced from arXiv: 2604.19965 by Arifa I. Champa, Asif K. Turzo, Md Fazle Rabbi, Minhaz F. Zibran.

**Figure 2.** Figure 2: Rejection reasons across AI agents 13 [PITH_FULL_IMAGE:figures/full_fig_p013_2.png] view at source ↗

read the original abstract

Recent years have experienced growing contributions of AI coding agents that assist human developers in various software engineering tasks. However, this growing AI-assisted autonomy raises questions about security and trust. In this paper, we analyze more than 33,000 AI-generated pull requests (PRs) and identify 675 security-related submissions made by agentic AIs. Then we examine the security-related PRs with a focus on recurring security weaknesses, review outcomes and latency, commit message quality, and rejection reasons. The results show that security-related AI PRs introduce a small set of recurring weaknesses such as regex inefficiencies, injection flaws, and path traversal. Many flawed contributions are still merged, while rejections often arise from social or process factors such as inactivity or missing test coverage. The commit message quality of AI PRs has a limited effect on acceptance or latency, in contrast to human PRs reported in previous studies. We also extend existing rejection taxonomies by adding categories that are unique to AI-generated security contributions. These findings offer new insights into the strengths and shortcomings of autonomous coding systems in secure software development.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper gives a first empirical look at security weaknesses in AI-generated PRs and notes that many flawed ones still merge, but the identification of the 675 cases from 33k lacks any validation details.

read the letter

The core contribution is an observational analysis of more than 33,000 AI-generated pull requests, from which the authors isolate 675 security-related ones. They catalog recurring issues such as inefficient regex, injection flaws, and path traversal, observe that many of these still get merged, and show that rejections tend to trace to process factors like missing tests or inactivity rather than the security problem itself. They also note that commit message quality has little bearing on outcomes here, unlike in prior human-PR studies, and they add a few AI-specific categories to existing rejection taxonomies.

Referee Report

1 major / 0 minor

Summary. The paper analyzes more than 33,000 AI-generated pull requests and identifies 675 security-related submissions. It examines these for recurring security weaknesses (regex inefficiencies, injection flaws, path traversal), review outcomes and latency, commit message quality, and rejection reasons. Key findings are that many flawed AI PRs are still merged, rejections often stem from social/process factors like inactivity or missing tests, commit message quality has limited effect on acceptance (unlike human PRs), and existing rejection taxonomies are extended with AI-specific categories.

Significance. If the 675-PRs sample is accurately and representatively identified, the study supplies useful empirical observations on security risks from autonomous AI coding agents in open-source settings. The identification of recurring weakness patterns and the extension of rejection taxonomies could inform both tooling and future research on AI-assisted secure development.

major comments (1)

Abstract: the identification of the 675 security-related PRs from the 33,000 AI-generated PRs is stated without any description of the detection method (e.g., keywords, classifier, manual review), validation procedure, inter-rater reliability, or false-positive/negative rates. Because every subsequent claim—recurring weaknesses, merge rates, rejection reasons, and taxonomy extensions—rests on this subset being a clean sample, the absence of these details is load-bearing for the central empirical contribution.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their constructive feedback and the recommendation for major revision. We address the single major comment below and agree that strengthening the abstract will improve the manuscript.

read point-by-point responses

Referee: Abstract: the identification of the 675 security-related PRs from the 33,000 AI-generated PRs is stated without any description of the detection method (e.g., keywords, classifier, manual review), validation procedure, inter-rater reliability, or false-positive/negative rates. Because every subsequent claim—recurring weaknesses, merge rates, rejection reasons, and taxonomy extensions—rests on this subset being a clean sample, the absence of these details is load-bearing for the central empirical contribution.

Authors: We agree that the abstract would be strengthened by briefly describing the identification process. The full manuscript details this in the methodology section, which outlines the multi-stage approach used to select the 675 security-related PRs from the larger corpus. We will revise the abstract to include a concise summary of the detection method, validation steps, and any reported reliability or error considerations. This revision will be incorporated in the next version of the paper. revision: yes

Circularity Check

0 steps flagged

No circularity: purely observational empirical study

full rationale

The paper performs data collection and qualitative analysis on external GitHub PRs (33k total, 675 labeled security-related). No equations, fitted parameters, predictions, or derivations exist. Identification of the 675 PRs, weakness taxonomy, merge/rejection statistics, and taxonomy extensions are direct observations from the dataset rather than reductions to self-definitions, self-citations, or renamed inputs. No load-bearing self-citation chains or ansatzes are present. The central claims rest on external data and manual/automated labeling whose validity is a separate methodological concern, not a circularity issue.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Empirical observational study; no free parameters or invented entities; relies on domain assumptions for classifying PRs and security issues.

axioms (1)

domain assumption AI-generated PRs can be reliably distinguished from human ones and their security relevance can be accurately assessed through manual or automated review
The study begins by identifying 675 security-related submissions from 33,000 AI PRs, which requires this classification step.

pith-pipeline@v0.9.0 · 5499 in / 1271 out tokens · 52956 ms · 2026-05-10T01:43:21.053988+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

80 extracted references · 10 canonical work pages · 1 internal anchor

[1]

Software security analysis in 2030 and beyond: A research roadmap.ACM Transactions on Software Engineering and Methodology, 34(5):1–26, 2025

Marcel Böhme, Eric Bodden, Tevfik Bultan, Cristian Cadar, Yang Liu, and Giuseppe Scanniello. Software security analysis in 2030 and beyond: A research roadmap.ACM Transactions on Software Engineering and Methodology, 34(5):1–26, 2025. 15

2030
[2]

Software security in practice: knowledge and motivation.Journal of Cybersecurity, 11(1):tyaf005, 2025

Hala Assal, Srivathsan G Morkonda, Muhammad Zaid Arif, and Sonia Chiasson. Software security in practice: knowledge and motivation.Journal of Cybersecurity, 11(1):tyaf005, 2025

2025
[3]

Vulnerabilities and security patches detection in oss: a survey.ACM Computing Surveys, 57(1):1–37, 2024

Ruyan Lin, Yulong Fu, Wei Yi, Jincheng Yang, Jin Cao, Zhiqiang Dong, Fei Xie, and Hui Li. Vulnerabilities and security patches detection in oss: a survey.ACM Computing Surveys, 57(1):1–37, 2024

2024
[4]

Xu Yang, Wenhan Zhu, Michael Pacheco, Jiayuan Zhou, Shaowei Wang, Xing Hu, and Kui Liu. Code change intention, development artifact, and history vulnerability: Putting them together for vulnerability fix detection by llm.Proceedings of the ACM on Software Engineering, 2(FSE):489–510, 2025

2025
[5]

An exploratory study of the pull-based software development model

Georgios Gousios, Martin Pinzger, and Arie van Deursen. An exploratory study of the pull-based software development model. InProceedings of the 36th international conference on software engineering, pages 345–355, 2014

2014
[6]

Expectations, outcomes, and challenges of modern code review

Alberto Bacchelli and Christian Bird. Expectations, outcomes, and challenges of modern code review. In2013 35th International Conference on Software Engineering (ICSE), pages 712–721, 2013

2013
[7]

How do software developers use chatgpt? an exploratory study on github pull requests

Moataz Chouchen, Narjes Bessghaier, Mahi Begoug, Ali Ouni, Eman Alomar, and Mohamed Wiem Mkaouer. How do software developers use chatgpt? an exploratory study on github pull requests. InProceedings of the 21st International Conference on Mining Software Repositories, pages 212–216, 2024

2024
[8]

Generative ai for pull request descriptions: Adoption, impact, and developer interventions.Proceedings of the ACM on Software Engineering, 1(FSE):1043– 1065, 2024

Tao Xiao, Hideaki Hata, Christoph Treude, and Kenichi Matsumoto. Generative ai for pull request descriptions: Adoption, impact, and developer interventions.Proceedings of the ACM on Software Engineering, 1(FSE):1043– 1065, 2024

2024
[9]

On the use of agentic coding: An empirical study of pull requests on github,

Miku Watanabe, Hao Li, Yutaro Kashiwa, Brittany Reid, Hajimu Iida, and Ahmed E Hassan. On the use of agentic coding: An empirical study of pull requests on github.arXiv preprint arXiv:2509.14745, 2025

work page arXiv 2025
[10]

Roumeliotis, and Manoj Karkee

Ranjan Sapkota, Konstantinos I Roumeliotis, and Manoj Karkee. Vibe coding vs. agentic coding: Fundamentals and practical implications of agentic ai.arXiv preprint arXiv:2505.19443, 2025

work page arXiv 2025
[11]

An empirical study of automation in software security patch management

Nesara Dissanayake, Asangi Jayatilaka, Mansooreh Zahedi, and Muhammad Ali Babar. An empirical study of automation in software security patch management. InProceedings of the 37th IEEE/ACM International Conference on Automated Software Engineering, pages 1–13, 2022

2022
[12]

Patchtrack: Analyzing chatgpt’s impact on software patch decision-making in pull requests

Daniel Ogenrwot and John Businge. Patchtrack: Analyzing chatgpt’s impact on software patch decision-making in pull requests. InProceedings of the 39th IEEE/ACM International Conference on Automated Software Engineering, page 2480–2481, New York, NY , USA, 2024. ACM

2024
[13]

Dependabot and security pull requests: large empirical study.Empirical Software Engineering, 29(5):128, 2024

Hocine Rebatchi, Tégawendé F Bissyandé, and Naouel Moha. Dependabot and security pull requests: large empirical study.Empirical Software Engineering, 29(5):128, 2024

2024
[14]

On the use of dependabot security pull requests

Mahmoud Alfadel, Diego Elias Costa, Emad Shihab, and Mouafak Mkhallalati. On the use of dependabot security pull requests. In2021 IEEE/ACM 18th International conference on mining software repositories (MSR), pages 254–265. IEEE, 2021

2021
[15]

How to get developers to accept security prs faster

Andrew Stiefel. How to get developers to accept security prs faster. https://www.endorlabs.com/learn/ how-to-get-developers-to-accept-security-prs-faster , February 2025. Endor Labs. Accessed: 2025-10-14

2025
[16]

The Rise of AI Teammates in Software Engineering (SE) 3.0: How Autonomous Coding Agents Are Reshaping Software Engineering

Hao Li, Haoxiang Zhang, and Ahmed E Hassan. The rise of ai teammates in software engineering (se) 3.0: How autonomous coding agents are reshaping software engineering.arXiv preprint arXiv:2507.15003, 2025

work page internal anchor Pith review arXiv 2025
[17]

Why do developers reject refactorings in open-source projects?ACM Transactions on Software Engineering and Methodology (TOSEM), 31(2):1–23, 2021

Jevgenija Pantiuchina, Bin Lin, Fiorella Zampetti, Massimiliano Di Penta, Michele Lanza, and Gabriele Bavota. Why do developers reject refactorings in open-source projects?ACM Transactions on Software Engineering and Methodology (TOSEM), 31(2):1–23, 2021

2021
[18]

insights into security-related ai-generated pull requests

Md Fazle Rabbi, Asif K. Turzo, Arifa I. Champa, and Minhaz F. Zibran. Replication package: “insights into security-related ai-generated pull requests”.https://doi.org/10.6084/m9.figshare.30421996, 2025

work page doi:10.6084/m9.figshare.30421996 2025
[19]

An empirical study of retrieval-augmented code generation: Challenges and opportunities.ACM Transactions on Software Engineering and Methodology, 2025

Zezhou Yang, Sirong Chen, Cuiyun Gao, Zhenhao Li, Xing Hu, Kui Liu, and Xin Xia. An empirical study of retrieval-augmented code generation: Challenges and opportunities.ACM Transactions on Software Engineering and Methodology, 2025

2025
[20]

Humanevalcomm: Benchmarking the communication competence of code generation for llms and llm agents.ACM Transactions on Software Engineering and Methodology, 34(7):1–42, 2025

Jie JW Wu and Fatemeh H Fard. Humanevalcomm: Benchmarking the communication competence of code generation for llms and llm agents.ACM Transactions on Software Engineering and Methodology, 34(7):1–42, 2025

2025
[21]

AI agentic programming: A survey of techniques, challenges, and opportunities.arXiv preprint arXiv:2508.11126,

Huanting Wang, Jingzhi Gong, Huawei Zhang, and Zheng Wang. Ai agentic programming: A survey of techniques, challenges, and opportunities.arXiv preprint arXiv:2508.11126, 2025

work page arXiv 2025
[22]

The impact of generative ai on open-source community engagement

Karthik Babu Nattamai Kannan and Narayan Ramasubbu. The impact of generative ai on open-source community engagement. 2025. 16

2025
[23]

Roychoudhury et al

Abhik Roychoudhury, Corina Pasareanu, Michael Pradel, and Baishakhi Ray. Agentic ai software engineer: Programming with trust.arXiv preprint arXiv:2502.13767, 2025

work page arXiv 2025
[24]

Investigating and designing for trust in ai-powered code generation tools

Ruotong Wang, Ruijia Cheng, Denae Ford, and Thomas Zimmermann. Investigating and designing for trust in ai-powered code generation tools. InProceedings of the 2024 ACM Conference on Fairness, Accountability, and Transparency, pages 1475–1493, 2024

2024
[25]

Trust, transparency, and adoption in generative ai for software engineering: Insights from twitter discourse.Information and Software Technology, 186:107804, 2025

Manaal Basha and Gema Rodríguez-Pérez. Trust, transparency, and adoption in generative ai for software engineering: Insights from twitter discourse.Information and Software Technology, 186:107804, 2025

2025
[26]

Trust in ai: progress, challenges, and future directions.Humanities and Social Sciences Communications, 11(1):1–30, 2024

Saleh Afroogh, Ali Akbari, Emmie Malone, Mohammadali Kargar, and Hananeh Alambeigi. Trust in ai: progress, challenges, and future directions.Humanities and Social Sciences Communications, 11(1):1–30, 2024

2024
[27]

Storer, Derek DeBellis, Sarah D’Angelo, and Adam Brown

Kevin M. Storer, Derek DeBellis, Sarah D’Angelo, and Adam Brown. Fostering developers’ trust in generative artificial intelligence. Technical report, DORA Research, March 2025. https://dora.dev/research/ai/ trust-in-ai/. Accessed: 2025-09-30

2025
[28]

A large-scale empirical study of security patches

Frank Li and Vern Paxson. A large-scale empirical study of security patches. InProceedings of the 2017 ACM SIGSAC Conference on Computer and Communications Security, pages 2201–2215, 2017

2017
[29]

A large-scale analysis of the effectiveness of publicly reported security patches.Computers & Security, 148:104181, 2025

Seunghoon Woo, Eunjin Choi, and Heejo Lee. A large-scale analysis of the effectiveness of publicly reported security patches.Computers & Security, 148:104181, 2025

2025
[30]

Software security patch management- a systematic literature review of challenges, approaches, tools and practices.Information and Software Technology, 144:106771, 2022

Nesara Dissanayake, Asangi Jayatilaka, Mansooreh Zahedi, and M Ali Babar. Software security patch management- a systematic literature review of challenges, approaches, tools and practices.Information and Software Technology, 144:106771, 2022

2022
[31]

Patchdb: A large-scale security patch dataset

Xinda Wang, Shu Wang, Pengbin Feng, Kun Sun, and Sushil Jajodia. Patchdb: A large-scale security patch dataset. In2021 51st Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN), pages 149–160. IEEE, 2021

2021
[32]

Just-in-time detection of silent security patches.ACM Transactions on Software Engineering and Methodology, 2025

Xunzhu Tang, Kisub Kim, Saad Ezzini, Yewei Song, Haoye Tian, Jacques Klein, and Tegawende Bissyande. Just-in-time detection of silent security patches.ACM Transactions on Software Engineering and Methodology, 2025

2025
[33]

On the feasibility of stealthily introducing vulnerabilities in open-source software via hypocrite commits.Proc

Qiushi Wu and Kangjie Lu. On the feasibility of stealthily introducing vulnerabilities in open-source software via hypocrite commits.Proc. Oakland, 17, 2021

2021
[34]

Influence of social and technical factors for evaluating contri- bution in github

Jason Tsay, Laura Dabbish, and James Herbsleb. Influence of social and technical factors for evaluating contri- bution in github. InProceedings of the 36th international conference on Software engineering, pages 356–366, 2014

2014
[35]

Work practices and challenges in pull-based development: The integrator’s perspective

Georgios Gousios, Andy Zaidman, Margaret-Anne Storey, and Arie Van Deursen. Work practices and challenges in pull-based development: The integrator’s perspective. In2015 IEEE/ACM 37th IEEE International Conference on Software Engineering, volume 1, pages 358–368. IEEE, 2015

2015
[36]

Pull request decisions explained: An empirical overview.IEEE Transactions on Software Engineering, 49(2):849–871, 2022

Xunhui Zhang, Yue Yu, Georgios Gousios, and Ayushi Rastogi. Pull request decisions explained: An empirical overview.IEEE Transactions on Software Engineering, 49(2):849–871, 2022

2022
[37]

Wait for it: Determinants of pull request evaluation latency on github

Yue Yu, Huaimin Wang, Vladimir Filkov, Premkumar Devanbu, and Bogdan Vasilescu. Wait for it: Determinants of pull request evaluation latency on github. In2015 IEEE/ACM 12th working conference on mining software repositories, pages 367–371. IEEE, 2015

2015
[38]

Pull request latency explained: An empirical overview.Empirical Software Engineering, 27(6):126, 2022

Xunhui Zhang, Yue Yu, Tao Wang, Ayushi Rastogi, and Huaimin Wang. Pull request latency explained: An empirical overview.Empirical Software Engineering, 27(6):126, 2022

2022
[39]

What makes a good commit message? In Proceedings of the 44th International Conference on Software Engineering, pages 2389–2401, 2022

Yingchen Tian, Yuxia Zhang, Klaas-Jan Stol, Lin Jiang, and Hui Liu. What makes a good commit message? In Proceedings of the 44th International Conference on Software Engineering, pages 2389–2401, 2022

2022
[40]

Commit message matters: Investigating impact and evolution of commit message quality

Jiawei Li and Iftekhar Ahmed. Commit message matters: Investigating impact and evolution of commit message quality. In2023 IEEE/ACM 45th International Conference on Software Engineering (ICSE), pages 806–817. IEEE, 2023

2023
[41]

Optimization is better than generation: Optimizing commit message leveraging human-written commit message.arXiv preprint arXiv:2501.09861, 2025

Jiawei Li, David Faragó, Christian Petrov, and Iftekhar Ahmed. Optimization is better than generation: Optimizing commit message leveraging human-written commit message.arXiv preprint arXiv:2501.09861, 2025

work page arXiv 2025
[42]

Codex.https://openai.com/codex/, 2025

OpenAI. Codex.https://openai.com/codex/, 2025. Accessed: 2025-10-15

2025
[43]

Devin, the ai software engineer, 2025

Cognition AI. Devin, the ai software engineer, 2025. Available at: https://devin.ai. Accessed: 2025-10-15

2025
[44]

Github copilot, 2025

GitHub. Github copilot, 2025. Available at: https://github.com/features/copilot. Accessed: 2025-10- 15. 17

2025
[45]

Cursor, 2025

Cursor. Cursor, 2025. Available at:https://cursor.com. Accessed: 2025-10-15

2025
[46]

Claude code, 2025

Anthropic. Claude code, 2025. Available at: https://www.claude.com/product/claude-code. Accessed: 2025-10-15

2025
[47]

Text filtering and ranking for security bug report prediction.IEEE Transactions on Software Engineering, 45(6):615–631, 2017

Fayola Peters, Thein Than Tun, Yijun Yu, and Bashar Nuseibeh. Text filtering and ranking for security bug report prediction.IEEE Transactions on Software Engineering, 45(6):615–631, 2017

2017
[48]

Why security defects go unnoticed during code reviews? a case-control study of the chromium os project

Rajshakhar Paul, Asif Kamal Turzo, and Amiangshu Bosu. Why security defects go unnoticed during code reviews? a case-control study of the chromium os project. In2021 IEEE/ACM 43rd International Conference on Software Engineering (ICSE), pages 1373–1385. IEEE, 2021

2021
[49]

Spi: Automated identification of security patches via commits.ACM Transactions on Software Engineering and Methodology (TOSEM), 31(1):1–27, 2021

Yaqin Zhou, Jing Kai Siow, Chenyu Wang, Shangqing Liu, and Yang Liu. Spi: Automated identification of security patches via commits.ACM Transactions on Software Engineering and Methodology (TOSEM), 31(1):1–27, 2021

2021
[50]

Automated identification of security issues from commit messages and bug reports

Yaqin Zhou and Asankhaya Sharma. Automated identification of security issues from commit messages and bug reports. InProceedings of the 2017 11th joint meeting on foundations of software engineering, pages 914–919, 2017

2017
[51]

Annotating materials science text: A semi-automated approach for crafting outputs with gemini pro.Integrating Materials and Manufacturing Innovation, 13(2):445– 452, 2024

Hasan M Sayeed, Trupti Mohanty, and Taylor D Sparks. Annotating materials science text: A semi-automated approach for crafting outputs with gemini pro.Integrating Materials and Manufacturing Innovation, 13(2):445– 452, 2024

2024
[52]

Automated reddit data annotation with large language models

Sai Vuruma, Dezhi Wu, Saborny Sen Gupta, Lucas Aust, Valerie Lookingbill, Wyatt Bellamy, Yang Ren, Erin Kasson, Li-Shiun Chen, Patricia Cavazos-Rehg, et al. Automated reddit data annotation with large language models. In2025 IEEE 13th International Conference on Healthcare Informatics (ICHI), pages 251–260. IEEE, 2025

2025
[53]

Chatgpt outperforms crowd workers for text-annotation tasks.Proceedings of the National Academy of Sciences, 120(30):e2305016120, 2023

Fabrizio Gilardi, Meysam Alizadeh, and Maël Kubli. Chatgpt outperforms crowd workers for text-annotation tasks.Proceedings of the National Academy of Sciences, 120(30):e2305016120, 2023

2023
[54]

Studying the impact of noises in build breakage data.IEEE Transactions on Software Engineering, 47(9):1998–2011, 2019

Taher Ahmed Ghaleb, Daniel Alencar Da Costa, Ying Zou, and Ahmed E Hassan. Studying the impact of noises in build breakage data.IEEE Transactions on Software Engineering, 47(9):1998–2011, 2019

1998
[55]

An empirical study of issue-link algorithms: which issue-link algorithms should we use?Empirical Software Engineering, 27(6):136, 2022

Masanari Kondo, Yutaro Kashiwa, Yasutaka Kamei, and Osamu Mizuno. An empirical study of issue-link algorithms: which issue-link algorithms should we use?Empirical Software Engineering, 27(6):136, 2022

2022
[56]

The nature of build changes: An empirical study of maven-based build systems.Empirical Software Engineering, 26(3):32, 2021

Christian Macho, Stefanie Beyer, Shane McIntosh, and Martin Pinzger. The nature of build changes: An empirical study of maven-based build systems.Empirical Software Engineering, 26(3):32, 2021

2021
[57]

What happens in my code reviews? an investigation on automatically classifying review changes.Empirical Software Engineering, 27(4):89, 2022

Enrico Fregnan, Fernando Petrulio, Linda Di Geronimo, and Alberto Bacchelli. What happens in my code reviews? an investigation on automatically classifying review changes.Empirical Software Engineering, 27(4):89, 2022

2022
[58]

Using the confidence interval confidently.Journal of thoracic disease, 9(10):4125, 2017

Avijit Hazra. Using the confidence interval confidently.Journal of thoracic disease, 9(10):4125, 2017

2017
[59]

Cohen’s kappa coefficient as a performance measure for feature selection

Susana M Vieira, Uzay Kaymak, and João MC Sousa. Cohen’s kappa coefficient as a performance measure for feature selection. InInternational conference on fuzzy systems, pages 1–8. IEEE, 2010

2010
[60]

accessed: 2025-09-24

Semgrep.https://semgrep.dev, 2025. accessed: 2025-09-24

2025
[61]

Semgrep*: Improving the limited performance of static application security testing (sast) tools

Gareth Bennett, Tracy Hall, Emily Winter, and Steve Counsell. Semgrep*: Improving the limited performance of static application security testing (sast) tools. InProceedings of the 28th International Conference on Evaluation and Assessment in Software Engineering, pages 614–623, 2024

2024
[62]

On detecting and measuring exploitable javascript functions in real-world applications.ACM Transactions on Privacy and Security, 27(1):1–37, 2024

Maryna Kluban, Mohammad Mannan, and Amr Youssef. On detecting and measuring exploitable javascript functions in real-world applications.ACM Transactions on Privacy and Security, 27(1):1–37, 2024

2024
[63]

Evaluating c/c++ vulnerability detectability of query-based static application security testing tools.IEEE Transactions on Dependable and Secure Computing, 21(5):4600–4618, 2024

Zongjie Li, Zhibo Liu, Wai Kin Wong, Pingchuan Ma, and Shuai Wang. Evaluating c/c++ vulnerability detectability of query-based static application security testing tools.IEEE Transactions on Dependable and Secure Computing, 21(5):4600–4618, 2024

2024
[64]

♪ with a little help from my (llm) friends: Enhancing static analysis with llms to detect software vulnerabilities

Amy Munson, Juanita Gomez, and Álvaro A Cárdenas. ♪ with a little help from my (llm) friends: Enhancing static analysis with llms to detect software vulnerabilities. In2025 IEEE/ACM International Workshop on Large Language Models for Code (LLM4Code), pages 25–32. IEEE, 2025

2025
[65]

Effect of technical and social factors on pull request quality for the npm ecosystem

Tapajit Dey and Audris Mockus. Effect of technical and social factors on pull request quality for the npm ecosystem. InProceedings of the 14th ACM/IEEE International Symposium on Empirical Software Engineering and Measurement (ESEM), pages 1–11, 2020

2020
[66]

Nearest neighbor selection for iteratively knn imputation.Journal of Systems and Software, 85(11):2541–2552, 2012

Shichao Zhang. Nearest neighbor selection for iteratively knn imputation.Journal of Systems and Software, 85(11):2541–2552, 2012. 18

2012
[67]

The power of outliers (and why researchers should always check for them)

Jason W Osborne and Amy Overbay. The power of outliers (and why researchers should always check for them). Practical Assessment, Research, and Evaluation, 9(1), 2004

2004
[68]

Multivariable modeling strategies

Frank E Harrell Jr. Multivariable modeling strategies. InRegression modeling strategies: With applications to linear models, logistic and ordinal regression, and survival analysis, pages 63–102. Springer, 2015

2015
[69]

Linear regression models with logarithmic transformations.London School of Economics, London, 22(1):23–36, 2011

Kenneth Benoit. Linear regression models with logarithmic transformations.London School of Economics, London, 22(1):23–36, 2011

2011
[70]

Correlation tests in r: pearson cor, kendall’s tau, and spearman’s rho

Kingsley Okoye and Samira Hosseini. Correlation tests in r: pearson cor, kendall’s tau, and spearman’s rho. InR programming: Statistical data analysis in research, pages 247–277. Springer, 2024

2024
[71]

Inferring probability of relevance using the method of logistic regression

Fredric C Gey. Inferring probability of relevance using the method of logistic regression. InSIGIR’94: Proceedings of the Seventeenth Annual International ACM-SIGIR Conference on Research and Development in Information Retrieval, organised by Dublin City University, pages 222–231. Springer, 1994

1994
[72]

John Wiley & Sons, 2013

David W Hosmer Jr, Stanley Lemeshow, and Rodney X Sturdivant.Applied logistic regression. John Wiley & Sons, 2013

2013
[73]

Segress: Software engineering guidelines for reporting secondary studies.IEEE Transactions on Software Engineering, 49(3):1273–1298, 2022

Barbara Kitchenham, Lech Madeyski, and David Budgen. Segress: Software engineering guidelines for reporting secondary studies.IEEE Transactions on Software Engineering, 49(3):1273–1298, 2022

2022
[74]

Making sense of card sorting data.Expert Systems, 22(3):89–93, 2005

Sally Fincher and Josh Tenenberg. Making sense of card sorting data.Expert Systems, 22(3):89–93, 2005

2005
[75]

The measurement of observer agreement for categorical data.biometrics, pages 159–174, 1977

J Richard Landis and Gary G Koch. The measurement of observer agreement for categorical data.biometrics, pages 159–174, 1977

1977
[76]

Human-written vs

Domenico Cotroneo, Cristina Improta, and Pietro Liguori. Human-written vs. ai-generated code: A large-scale study of defects, vulnerabilities, and complexity.arXiv preprint arXiv:2508.21634, 2025

work page arXiv 2025
[77]

Is functional correctness enough to evaluate code language models? exploring diversity of generated codes.arXiv preprint arXiv:2408.14504, 2024

Heejae Chon, Seonghyeon Lee, Jinyoung Yeo, and Dongha Lee. Is functional correctness enough to evaluate code language models? exploring diversity of generated codes.arXiv preprint arXiv:2408.14504, 2024

work page arXiv 2024
[78]

Benchmarks and metrics for evaluations of code generation: A critical review

Debalina Ghosh Paul, Hong Zhu, and Ian Bayley. Benchmarks and metrics for evaluations of code generation: A critical review. In2024 IEEE International Conference on Artificial Intelligence Testing (AITest), pages 87–94. IEEE, 2024

2024
[79]

Jiawei Liu, Chunqiu Steven Xia, Yuyao Wang, and Lingming Zhang. Is your code generated by chatgpt really correct? rigorous evaluation of large language models for code generation.Advances in Neural Information Processing Systems, 36:21558–21572, 2023

2023
[80]

Security degradation in iterative ai code generation – a systematic analysis of the paradox,

Shivani Shukla, Himanshu Joshi, and Romilla Syed. Security degradation in iterative ai code generation–a systematic analysis of the paradox.arXiv preprint arXiv:2506.11022, 2025. 19

work page arXiv 2025