Understanding the Rejection of Fixes Generated by Agentic Pull Requests -- Insights from the AIDev Dataset

Ali Arabat; Mahmoud Abujadallah; Mohammed Sayagh

arxiv: 2606.13468 · v1 · pith:W2NO3RDYnew · submitted 2026-06-11 · 💻 cs.SE · cs.AI

Understanding the Rejection of Fixes Generated by Agentic Pull Requests -- Insights from the AIDev Dataset

Mahmoud Abujadallah , Ali Arabat , Mohammed Sayagh This is my paper

Pith reviewed 2026-06-27 05:59 UTC · model grok-4.3

classification 💻 cs.SE cs.AI

keywords AI coding agentspull requestsrejection reasonsqualitative studysoftware fixescontinuous integrationcode generationagentic workflows

0 comments

The pith

Fourteen reasons drive rejections of AI-generated code fixes

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper examines why AI coding agents' proposed fixes in pull requests are often rejected by developers. A qualitative study of 306 non-merged PRs from agents including Copilot, Devin, Cursor, and Claude identifies fourteen rejection reasons in four categories. These include incorrect implementations like incomplete or wrong approaches, failures to pass CI pipelines and tests, the agent's inability to generate code or complete tasks, and low priority of the fixes. Understanding these modes matters as 46 percent of such fixes get discarded after consuming review and compute resources. The work indicates that guiding agents with approach hints, constraints, and validation instructions could reduce rejections.

Core claim

Our qualitative findings identify 14 reasons divided into four high-level categories for rejecting AI-agent fixes. We observe that developers can reject fixes due to fixes whose implementation is incorrect (e.g., incomplete, wrong approach), fixes that do not pass the continuous integration (CI) pipelines and fail tests, fixes for which the agent is unable to perform the implementation (e.g., no code generated, sessions lost), and fixes whose priority is low. Our results shed light on the importance of better guiding the model at these levels: (1) proposing hints about the approach to follow for fixing an issue, (2) outlining constraints or limitations regarding the approaches that should no

What carries the argument

Qualitative categorization of rejection reasons from 306 non-merged pull requests into fourteen reasons and four high-level categories.

If this is right

Guiding agents with hints on the fix approach reduces implementation errors.
Outlining constraints prevents unsuitable approaches.
Instructing on CI validation and avoiding breaks improves acceptance rates.
Prioritizing tasks avoids wasting resources on low-priority fixes.
Addressing these modes is key to integrating AI agents as efficient teammates.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Agents could incorporate automatic priority assessment before generating PRs.
The rejection categories could guide development of better prompt engineering techniques for code agents.
Similar analyses on other datasets or agent types could test the generality of the four categories.
Developers might develop review tools that flag PRs matching common rejection patterns.

Load-bearing premise

The sample of 306 non-merged pull requests is representative of rejected AI-agent fixes and the qualitative categorization into 14 reasons accurately captures the failure modes without selection or interpretation bias.

What would settle it

A study of a different or larger sample of rejected AI-agent pull requests that identifies substantially different reasons or categories would challenge the findings.

Figures

Figures reproduced from arXiv: 2606.13468 by Ali Arabat, Mahmoud Abujadallah, Mohammed Sayagh.

read the original abstract

AI coding agents are increasingly used to generate pull requests (PRs) that propose code fixes in software projects. From a first exploration of the AIDev dataset, we find that 46.41\% of the fixes proposed by the agents Copilot, Devin, Cursor, and Claude are rejected. This represents a significant amount of wasted resources that require human reviews, verifications, and running tests and validations for fixes that are merely discarded. Our goal in this paper is to understand the failure modes of AI-agents, an understanding that is crucial for better integrating AI-agents as efficient teammates. In this paper, we conduct a qualitative study on a representative sample of 306 non-merged pull requests created or co-authored by the agents mentioned earlier, followed by a quantitative analysis of the reasons for rejection. Our qualitative findings identify 14 reasons divided into four high-level categories for rejecting AI-agent fixes. We observe that developers can reject fixes due to fixes whose implementation is incorrect (e.g., incomplete, wrong approach), fixes that do not pass the continuous integration (CI) pipelines and fail tests, fixes for which the agent is unable to perform the implementation (e.g., no code generated, sessions lost), and fixes whose priority is low. Our results shed light on the importance of better guiding the model at these levels: (1) proposing hints about the approach to follow for fixing an issue, (2) outlining constraints or limitations regarding the approaches that should not be taken, and (3) instructing the agent on how to validate the implementation through CI pipelines and without introducing a breaking change. Our results suggest the need for good prioritization of tasks so that generated fixes do not lead to wasted human review efforts or wasted agent resources (e.g., tokens, compute, or allowed number of requests).

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper's qualitative taxonomy of 14 rejection reasons for AI agent PRs holds up because the sampling and coding process are documented and reliable.

read the letter

The main takeaway is that this paper gives a usable breakdown of why fixes from Copilot, Devin, Cursor, and Claude get rejected, drawn from the AIDev dataset, and the methods are stronger than the abstract alone suggested.

They started with 659 non-merged PRs, drew a stratified random sample of 306, and had two authors independently code a 20% overlap subset. Cohen's κ came in at 0.81 before disagreements were resolved, and they include the codebook with example quotes. That setup directly handles the usual concerns about selection and interpretation bias in qualitative work.

What is new is the concrete list of 14 reasons grouped into four categories: incorrect implementation, CI/test failures, agent unable to complete the task, and low priority. They also run a quantitative pass on the distribution and close with three concrete suggestions for better agent guidance.

The work is straightforward empirical SE. It stays grounded in the data and does not overclaim. The main limitation is that everything is tied to these four agents and this one dataset, so the categories could look different with other tools or projects. That is a normal scope issue rather than a flaw in the execution.

This is worth bringing to a reading group for anyone working on AI coding agents or developer workflows. A reader who wants an evidence-based look at current failure modes will get something concrete from it. The paper deserves a serious referee because the central claims rest on transparent, reproducible qualitative steps rather than hand-waving.

Referee Report

0 major / 3 minor

Summary. The manuscript reports an empirical study of the AIDev dataset, finding that 46.41% of PRs generated by AI coding agents (Copilot, Devin, Cursor, Claude) are rejected. It performs a qualitative analysis on a stratified random sample of 306 non-merged PRs, deriving a taxonomy of 14 rejection reasons grouped into four categories (incorrect implementation, CI/test failures, inability to implement, low priority), followed by quantitative analysis of those reasons and recommendations for guiding agents on approach hints, constraints, and CI validation.

Significance. If the taxonomy holds, the work supplies a concrete, empirically grounded catalog of failure modes that directly explains wasted review and compute resources in agentic code repair. The reported stratified sampling from the full 659 non-merged PRs, provision of a codebook with example quotes, and Cohen's κ = 0.81 on a 20% double-coded subset constitute standard methodological safeguards that increase confidence in the categorization; these strengths make the taxonomy a useful reference for future agent design and evaluation.

minor comments (3)

[§4] §4 (Quantitative Analysis): the text states that the 14 reasons were quantified but does not report per-reason or per-category frequencies or percentages; adding a simple table or bar chart would allow readers to judge which categories dominate and would strengthen the claim that the taxonomy is actionable.
[Abstract] Abstract and §3 (Methodology): the claim of a 'representative sample' is supported by the stratified random sampling description, yet the abstract omits the population size (659) and stratification variable; a one-sentence addition would improve standalone readability without lengthening the abstract.
[§5] §5 (Discussion): the three concrete guidance recommendations (approach hints, constraints, CI validation) are well-motivated by the taxonomy but are presented at a high level; a short paragraph mapping each recommendation to the most frequent rejection categories would make the implications more precise.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for the positive summary, significance assessment, and recommendation of minor revision. No specific major comments were provided in the report, so we have no point-by-point responses. We remain available to address any minor suggestions during revision.

Circularity Check

0 steps flagged

No significant circularity identified

full rationale

This paper is an empirical qualitative and quantitative study of rejection reasons for AI-agent PRs. It reports sampling from 659 non-merged PRs, stratified random selection of 306, a codebook with example quotes, and Cohen's κ=0.81 on a 20% overlap subset. No mathematical models, equations, fitted parameters, predictions, or derivations exist that could reduce to inputs by construction. Central claims rest on direct data analysis and inter-rater agreement rather than self-citation chains or ansatzes. The derivation chain is self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The central claim rests on the assumption that the qualitative analysis of the sample accurately captures rejection reasons, which is a domain assumption in empirical software engineering studies. No free parameters or invented entities are introduced.

pith-pipeline@v0.9.1-grok · 5867 in / 1201 out tokens · 27518 ms · 2026-06-27T05:59:01.032632+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

29 extracted references · 6 canonical work pages

[1]

Pull Request #10408

2025. Pull Request #10408. https://github.com/vercel/turborepo/pull/10408. Ac- cessed on 2025-12-27

2025
[2]

Pull Request #11352

2025. Pull Request #11352. https://github.com/taiga-family/taiga-ui/pull/11352. Accessed on 2025-12-27

2025
[3]

Pull Request #1199

2025. Pull Request #1199. https://github.com/christianhelle/apiclientcodegen/ pull/1199. Accessed on 2025-12-27

2025
[4]

Pull Request #12466

2025. Pull Request #12466. https://github.com/Azure/Azure-Sentinel/pull/12466. Accessed on 2025-12-27

2025
[5]

Pull Request #1353

2025. Pull Request #1353. https://github.com/neondatabase/autoscaling/pull/1353. Accessed on 2025-12-27

2025
[6]

Pull Request #149

2025. Pull Request #149. https://github.com/ruvnet/claude-flow/pull/149. Ac- cessed on 2025-12-27. Understanding the Rejection of Fixes Generated by Agentic Pull Requests - Insights from the AIDev Dataset MSR ’26, April 13–14, 2026, Rio de Janeiro, Brazil

2025
[7]

Pull Request #1554

2025. Pull Request #1554. https://github.com/567-labs/instructor/pull/1554. Ac- cessed on 2025-12-27

2025
[8]

Pull Request #219

2025. Pull Request #219. https://github.com/syncfusion/maui-toolkit/pull/219. Accessed on 2025-12-27

2025
[9]

Pull Request #305

2025. Pull Request #305. https://github.com/bespokelabsai/curator/pull/305. Accessed on 2025-12-27

2025
[10]

Pull Request #3113

2025. Pull Request #3113. https://github.com/crewAIInc/crewAI/pull/3113. Ac- cessed on 2025-12-27

2025
[11]

Pull Request #4354

2025. Pull Request #4354. https://github.com/owncast/owncast/pull/4354. Ac- cessed on 2025-12-27

2025
[12]

Pull Request #50357

2025. Pull Request #50357. https://github.com/Azure/azure-sdk-for-net/pull/ 50357. Accessed on 2025-12-27

2025
[13]

Pull Request #61902

2025. Pull Request #61902. https://github.com/microsoft/TypeScript/pull/61902. Accessed on 2025-12-27

2025
[14]

Pull Request #75

2025. Pull Request #75. https://github.com/rqlite/sql/pull/75. Accessed on 2025-12-27

2025
[15]

Revised MD17 dataset (rMD17),

2025. Replication Package for: Understanding the Rejection of Fixes Generated by Agentic Pull Requests - Insights from the AIDev Dataset. doi:10.6084/m9. figshare.30964363

work page doi:10.6084/m9 2025
[16]

Agathe Balayn, Mireia Yurrita, Fanny Rancourt, Fabio Casati, and Ujwal Gadiraju
[17]

An Empirical Exploration of Trust Dynamics in LLM Supply Chains.arXiv preprint arXiv:2405.16310(2024)

arXiv 2024
[18]

Di Chen, Kathyrn T Stolee, and Tim Menzies. 2019. Replication can improve prior results: A github study of pull request acceptance. In2019 IEEE/ACM 27th International Conference on Program Comprehension (ICPC). IEEE, 179–190

2019
[19]

2024.Adding custom instructions for GitHub Copilot

GitHub. 2024.Adding custom instructions for GitHub Copilot. https: //docs.github.com/en/copilot/how-tos/configure-custom-instructions/add- repository-instructions GitHub Docs

2024
[20]

Dipin Khati. 2025. Trustworthiness of Large Language Models for Code. In Proceedings of the 2025 IEEE/ACM 47th International Conference on Software Engi- neering - Companion (ICSE Companion). IEEE, Lisbon, Portugal

2025
[21]

Sayedhassan Khatoonabadi, Diego Elias Costa, Rabe Abdalkareem, and Emad Shi- hab. 2023. On Wasted Contributions: Understanding the Dynamics of Contributor- Abandoned Pull Requests–A Mixed-Methods Study of 10 Large Open-Source Projects.ACM Trans. Softw. Eng. Methodol.32, 1, Article 15 (Feb. 2023), 39 pages. doi:10.1145/3530785

work page doi:10.1145/3530785 2023
[22]

Hao Li, Haoxiang Zhang, and Ahmed E. Hassan. 2025. The Rise of AI Teammates in Software Engineering (SE) 3.0: How Autonomous Coding Agents Are Reshap- ing Software Engineering. arXiv:2507.15003 [cs.SE] https://arxiv.org/abs/2507. 15003

Pith/arXiv arXiv 2025
[23]

Zhixing Li, Gang Yin, Yue Yu, Tao Wang, and Huaimin Wang. 2017. Detecting Du- plicate Pull-requests in GitHub. InProceedings of the 9th Asia-Pacific Symposium on Internetware(Shanghai, China)(Internetware ’17). Association for Computing Machinery, New York, NY, USA, Article 20, 6 pages. doi:10.1145/3131704.3131725

work page doi:10.1145/3131704.3131725 2017
[24]

Zhixing Li, Yue Yu, Tao Wang, Gang Yin, ShanShan Li, and Huaimin Wang
[25]

doi:10.1109/TSE.2021.3053403

Are You Still Working on This? An Empirical Study on Pull Request Abandonment.IEEE Transactions on Software Engineering48, 6 (2022), 2173–2188. doi:10.1109/TSE.2021.3053403

work page doi:10.1109/tse.2021.3053403 2022
[26]

Jevgenija Pantiuchina, Bin Lin, Fiorella Zampetti, Massimiliano Di Penta, Michele Lanza, and Gabriele Bavota. 2021. Why do developers reject refactorings in open- source projects?ACM Transactions on Software Engineering and Methodology (TOSEM)31, 2 (2021), 1–23

2021
[27]

Qingye Wang, Xin Xia, David Lo, and Shanping Li. 2019. Why is my code change abandoned?Information and Software Technology110 (2019), 108–120. doi:10.1016/j.infsof.2019.02.007

work page doi:10.1016/j.infsof.2019.02.007 2019
[28]

Miku Watanabe, Hao Li, Yutaro Kashiwa, Brittany Reid, Hajimu Iida, and Ahmed E Hassan. 2025. On the use of agentic coding: An empirical study of pull requests on github.arXiv preprint arXiv:2509.14745(2025)

arXiv 2025
[29]

Yue Yu, Zhixing Li, Gang Yin, Tao Wang, and Huaimin Wang. 2018. A dataset of duplicate pull-requests in github. InProceedings of the 15th International Confer- ence on Mining Software Repositories(Gothenburg, Sweden)(MSR ’18). Associa- tion for Computing Machinery, New York, NY, USA, 22–25. doi:10.1145/3196398. 3196455 Received 30 December 2025; revised 1...

work page doi:10.1145/3196398 2018

[1] [1]

Pull Request #10408

2025. Pull Request #10408. https://github.com/vercel/turborepo/pull/10408. Ac- cessed on 2025-12-27

2025

[2] [2]

Pull Request #11352

2025. Pull Request #11352. https://github.com/taiga-family/taiga-ui/pull/11352. Accessed on 2025-12-27

2025

[3] [3]

Pull Request #1199

2025. Pull Request #1199. https://github.com/christianhelle/apiclientcodegen/ pull/1199. Accessed on 2025-12-27

2025

[4] [4]

Pull Request #12466

2025. Pull Request #12466. https://github.com/Azure/Azure-Sentinel/pull/12466. Accessed on 2025-12-27

2025

[5] [5]

Pull Request #1353

2025. Pull Request #1353. https://github.com/neondatabase/autoscaling/pull/1353. Accessed on 2025-12-27

2025

[6] [6]

Pull Request #149

2025. Pull Request #149. https://github.com/ruvnet/claude-flow/pull/149. Ac- cessed on 2025-12-27. Understanding the Rejection of Fixes Generated by Agentic Pull Requests - Insights from the AIDev Dataset MSR ’26, April 13–14, 2026, Rio de Janeiro, Brazil

2025

[7] [7]

Pull Request #1554

2025. Pull Request #1554. https://github.com/567-labs/instructor/pull/1554. Ac- cessed on 2025-12-27

2025

[8] [8]

Pull Request #219

2025. Pull Request #219. https://github.com/syncfusion/maui-toolkit/pull/219. Accessed on 2025-12-27

2025

[9] [9]

Pull Request #305

2025. Pull Request #305. https://github.com/bespokelabsai/curator/pull/305. Accessed on 2025-12-27

2025

[10] [10]

Pull Request #3113

2025. Pull Request #3113. https://github.com/crewAIInc/crewAI/pull/3113. Ac- cessed on 2025-12-27

2025

[11] [11]

Pull Request #4354

2025. Pull Request #4354. https://github.com/owncast/owncast/pull/4354. Ac- cessed on 2025-12-27

2025

[12] [12]

Pull Request #50357

2025. Pull Request #50357. https://github.com/Azure/azure-sdk-for-net/pull/ 50357. Accessed on 2025-12-27

2025

[13] [13]

Pull Request #61902

2025. Pull Request #61902. https://github.com/microsoft/TypeScript/pull/61902. Accessed on 2025-12-27

2025

[14] [14]

Pull Request #75

2025. Pull Request #75. https://github.com/rqlite/sql/pull/75. Accessed on 2025-12-27

2025

[15] [15]

Revised MD17 dataset (rMD17),

2025. Replication Package for: Understanding the Rejection of Fixes Generated by Agentic Pull Requests - Insights from the AIDev Dataset. doi:10.6084/m9. figshare.30964363

work page doi:10.6084/m9 2025

[16] [16]

Agathe Balayn, Mireia Yurrita, Fanny Rancourt, Fabio Casati, and Ujwal Gadiraju

[17] [17]

An Empirical Exploration of Trust Dynamics in LLM Supply Chains.arXiv preprint arXiv:2405.16310(2024)

arXiv 2024

[18] [18]

Di Chen, Kathyrn T Stolee, and Tim Menzies. 2019. Replication can improve prior results: A github study of pull request acceptance. In2019 IEEE/ACM 27th International Conference on Program Comprehension (ICPC). IEEE, 179–190

2019

[19] [19]

2024.Adding custom instructions for GitHub Copilot

GitHub. 2024.Adding custom instructions for GitHub Copilot. https: //docs.github.com/en/copilot/how-tos/configure-custom-instructions/add- repository-instructions GitHub Docs

2024

[20] [20]

Dipin Khati. 2025. Trustworthiness of Large Language Models for Code. In Proceedings of the 2025 IEEE/ACM 47th International Conference on Software Engi- neering - Companion (ICSE Companion). IEEE, Lisbon, Portugal

2025

[21] [21]

Sayedhassan Khatoonabadi, Diego Elias Costa, Rabe Abdalkareem, and Emad Shi- hab. 2023. On Wasted Contributions: Understanding the Dynamics of Contributor- Abandoned Pull Requests–A Mixed-Methods Study of 10 Large Open-Source Projects.ACM Trans. Softw. Eng. Methodol.32, 1, Article 15 (Feb. 2023), 39 pages. doi:10.1145/3530785

work page doi:10.1145/3530785 2023

[22] [22]

Hao Li, Haoxiang Zhang, and Ahmed E. Hassan. 2025. The Rise of AI Teammates in Software Engineering (SE) 3.0: How Autonomous Coding Agents Are Reshap- ing Software Engineering. arXiv:2507.15003 [cs.SE] https://arxiv.org/abs/2507. 15003

Pith/arXiv arXiv 2025

[23] [23]

Zhixing Li, Gang Yin, Yue Yu, Tao Wang, and Huaimin Wang. 2017. Detecting Du- plicate Pull-requests in GitHub. InProceedings of the 9th Asia-Pacific Symposium on Internetware(Shanghai, China)(Internetware ’17). Association for Computing Machinery, New York, NY, USA, Article 20, 6 pages. doi:10.1145/3131704.3131725

work page doi:10.1145/3131704.3131725 2017

[24] [24]

Zhixing Li, Yue Yu, Tao Wang, Gang Yin, ShanShan Li, and Huaimin Wang

[25] [25]

doi:10.1109/TSE.2021.3053403

Are You Still Working on This? An Empirical Study on Pull Request Abandonment.IEEE Transactions on Software Engineering48, 6 (2022), 2173–2188. doi:10.1109/TSE.2021.3053403

work page doi:10.1109/tse.2021.3053403 2022

[26] [26]

Jevgenija Pantiuchina, Bin Lin, Fiorella Zampetti, Massimiliano Di Penta, Michele Lanza, and Gabriele Bavota. 2021. Why do developers reject refactorings in open- source projects?ACM Transactions on Software Engineering and Methodology (TOSEM)31, 2 (2021), 1–23

2021

[27] [27]

Qingye Wang, Xin Xia, David Lo, and Shanping Li. 2019. Why is my code change abandoned?Information and Software Technology110 (2019), 108–120. doi:10.1016/j.infsof.2019.02.007

work page doi:10.1016/j.infsof.2019.02.007 2019

[28] [28]

Miku Watanabe, Hao Li, Yutaro Kashiwa, Brittany Reid, Hajimu Iida, and Ahmed E Hassan. 2025. On the use of agentic coding: An empirical study of pull requests on github.arXiv preprint arXiv:2509.14745(2025)

arXiv 2025

[29] [29]

Yue Yu, Zhixing Li, Gang Yin, Tao Wang, and Huaimin Wang. 2018. A dataset of duplicate pull-requests in github. InProceedings of the 15th International Confer- ence on Mining Software Repositories(Gothenburg, Sweden)(MSR ’18). Associa- tion for Computing Machinery, New York, NY, USA, 22–25. doi:10.1145/3196398. 3196455 Received 30 December 2025; revised 1...

work page doi:10.1145/3196398 2018