Understanding the Rejection of Fixes Generated by Agentic Pull Requests -- Insights from the AIDev Dataset
Pith reviewed 2026-06-27 05:59 UTC · model grok-4.3
The pith
Fourteen reasons drive rejections of AI-generated code fixes
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Our qualitative findings identify 14 reasons divided into four high-level categories for rejecting AI-agent fixes. We observe that developers can reject fixes due to fixes whose implementation is incorrect (e.g., incomplete, wrong approach), fixes that do not pass the continuous integration (CI) pipelines and fail tests, fixes for which the agent is unable to perform the implementation (e.g., no code generated, sessions lost), and fixes whose priority is low. Our results shed light on the importance of better guiding the model at these levels: (1) proposing hints about the approach to follow for fixing an issue, (2) outlining constraints or limitations regarding the approaches that should no
What carries the argument
Qualitative categorization of rejection reasons from 306 non-merged pull requests into fourteen reasons and four high-level categories.
If this is right
- Guiding agents with hints on the fix approach reduces implementation errors.
- Outlining constraints prevents unsuitable approaches.
- Instructing on CI validation and avoiding breaks improves acceptance rates.
- Prioritizing tasks avoids wasting resources on low-priority fixes.
- Addressing these modes is key to integrating AI agents as efficient teammates.
Where Pith is reading between the lines
- Agents could incorporate automatic priority assessment before generating PRs.
- The rejection categories could guide development of better prompt engineering techniques for code agents.
- Similar analyses on other datasets or agent types could test the generality of the four categories.
- Developers might develop review tools that flag PRs matching common rejection patterns.
Load-bearing premise
The sample of 306 non-merged pull requests is representative of rejected AI-agent fixes and the qualitative categorization into 14 reasons accurately captures the failure modes without selection or interpretation bias.
What would settle it
A study of a different or larger sample of rejected AI-agent pull requests that identifies substantially different reasons or categories would challenge the findings.
Figures
read the original abstract
AI coding agents are increasingly used to generate pull requests (PRs) that propose code fixes in software projects. From a first exploration of the AIDev dataset, we find that 46.41\% of the fixes proposed by the agents Copilot, Devin, Cursor, and Claude are rejected. This represents a significant amount of wasted resources that require human reviews, verifications, and running tests and validations for fixes that are merely discarded. Our goal in this paper is to understand the failure modes of AI-agents, an understanding that is crucial for better integrating AI-agents as efficient teammates. In this paper, we conduct a qualitative study on a representative sample of 306 non-merged pull requests created or co-authored by the agents mentioned earlier, followed by a quantitative analysis of the reasons for rejection. Our qualitative findings identify 14 reasons divided into four high-level categories for rejecting AI-agent fixes. We observe that developers can reject fixes due to fixes whose implementation is incorrect (e.g., incomplete, wrong approach), fixes that do not pass the continuous integration (CI) pipelines and fail tests, fixes for which the agent is unable to perform the implementation (e.g., no code generated, sessions lost), and fixes whose priority is low. Our results shed light on the importance of better guiding the model at these levels: (1) proposing hints about the approach to follow for fixing an issue, (2) outlining constraints or limitations regarding the approaches that should not be taken, and (3) instructing the agent on how to validate the implementation through CI pipelines and without introducing a breaking change. Our results suggest the need for good prioritization of tasks so that generated fixes do not lead to wasted human review efforts or wasted agent resources (e.g., tokens, compute, or allowed number of requests).
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript reports an empirical study of the AIDev dataset, finding that 46.41% of PRs generated by AI coding agents (Copilot, Devin, Cursor, Claude) are rejected. It performs a qualitative analysis on a stratified random sample of 306 non-merged PRs, deriving a taxonomy of 14 rejection reasons grouped into four categories (incorrect implementation, CI/test failures, inability to implement, low priority), followed by quantitative analysis of those reasons and recommendations for guiding agents on approach hints, constraints, and CI validation.
Significance. If the taxonomy holds, the work supplies a concrete, empirically grounded catalog of failure modes that directly explains wasted review and compute resources in agentic code repair. The reported stratified sampling from the full 659 non-merged PRs, provision of a codebook with example quotes, and Cohen's κ = 0.81 on a 20% double-coded subset constitute standard methodological safeguards that increase confidence in the categorization; these strengths make the taxonomy a useful reference for future agent design and evaluation.
minor comments (3)
- [§4] §4 (Quantitative Analysis): the text states that the 14 reasons were quantified but does not report per-reason or per-category frequencies or percentages; adding a simple table or bar chart would allow readers to judge which categories dominate and would strengthen the claim that the taxonomy is actionable.
- [Abstract] Abstract and §3 (Methodology): the claim of a 'representative sample' is supported by the stratified random sampling description, yet the abstract omits the population size (659) and stratification variable; a one-sentence addition would improve standalone readability without lengthening the abstract.
- [§5] §5 (Discussion): the three concrete guidance recommendations (approach hints, constraints, CI validation) are well-motivated by the taxonomy but are presented at a high level; a short paragraph mapping each recommendation to the most frequent rejection categories would make the implications more precise.
Simulated Author's Rebuttal
We thank the referee for the positive summary, significance assessment, and recommendation of minor revision. No specific major comments were provided in the report, so we have no point-by-point responses. We remain available to address any minor suggestions during revision.
Circularity Check
No significant circularity identified
full rationale
This paper is an empirical qualitative and quantitative study of rejection reasons for AI-agent PRs. It reports sampling from 659 non-merged PRs, stratified random selection of 306, a codebook with example quotes, and Cohen's κ=0.81 on a 20% overlap subset. No mathematical models, equations, fitted parameters, predictions, or derivations exist that could reduce to inputs by construction. Central claims rest on direct data analysis and inter-rater agreement rather than self-citation chains or ansatzes. The derivation chain is self-contained against external benchmarks.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
Pull Request #10408
2025. Pull Request #10408. https://github.com/vercel/turborepo/pull/10408. Ac- cessed on 2025-12-27
2025
-
[2]
Pull Request #11352
2025. Pull Request #11352. https://github.com/taiga-family/taiga-ui/pull/11352. Accessed on 2025-12-27
2025
-
[3]
Pull Request #1199
2025. Pull Request #1199. https://github.com/christianhelle/apiclientcodegen/ pull/1199. Accessed on 2025-12-27
2025
-
[4]
Pull Request #12466
2025. Pull Request #12466. https://github.com/Azure/Azure-Sentinel/pull/12466. Accessed on 2025-12-27
2025
-
[5]
Pull Request #1353
2025. Pull Request #1353. https://github.com/neondatabase/autoscaling/pull/1353. Accessed on 2025-12-27
2025
-
[6]
Pull Request #149
2025. Pull Request #149. https://github.com/ruvnet/claude-flow/pull/149. Ac- cessed on 2025-12-27. Understanding the Rejection of Fixes Generated by Agentic Pull Requests - Insights from the AIDev Dataset MSR ’26, April 13–14, 2026, Rio de Janeiro, Brazil
2025
-
[7]
Pull Request #1554
2025. Pull Request #1554. https://github.com/567-labs/instructor/pull/1554. Ac- cessed on 2025-12-27
2025
-
[8]
Pull Request #219
2025. Pull Request #219. https://github.com/syncfusion/maui-toolkit/pull/219. Accessed on 2025-12-27
2025
-
[9]
Pull Request #305
2025. Pull Request #305. https://github.com/bespokelabsai/curator/pull/305. Accessed on 2025-12-27
2025
-
[10]
Pull Request #3113
2025. Pull Request #3113. https://github.com/crewAIInc/crewAI/pull/3113. Ac- cessed on 2025-12-27
2025
-
[11]
Pull Request #4354
2025. Pull Request #4354. https://github.com/owncast/owncast/pull/4354. Ac- cessed on 2025-12-27
2025
-
[12]
Pull Request #50357
2025. Pull Request #50357. https://github.com/Azure/azure-sdk-for-net/pull/ 50357. Accessed on 2025-12-27
2025
-
[13]
Pull Request #61902
2025. Pull Request #61902. https://github.com/microsoft/TypeScript/pull/61902. Accessed on 2025-12-27
2025
-
[14]
Pull Request #75
2025. Pull Request #75. https://github.com/rqlite/sql/pull/75. Accessed on 2025-12-27
2025
-
[15]
2025. Replication Package for: Understanding the Rejection of Fixes Generated by Agentic Pull Requests - Insights from the AIDev Dataset. doi:10.6084/m9. figshare.30964363
work page doi:10.6084/m9 2025
-
[16]
Agathe Balayn, Mireia Yurrita, Fanny Rancourt, Fabio Casati, and Ujwal Gadiraju
-
[17]
An Empirical Exploration of Trust Dynamics in LLM Supply Chains.arXiv preprint arXiv:2405.16310(2024)
arXiv 2024
-
[18]
Di Chen, Kathyrn T Stolee, and Tim Menzies. 2019. Replication can improve prior results: A github study of pull request acceptance. In2019 IEEE/ACM 27th International Conference on Program Comprehension (ICPC). IEEE, 179–190
2019
-
[19]
2024.Adding custom instructions for GitHub Copilot
GitHub. 2024.Adding custom instructions for GitHub Copilot. https: //docs.github.com/en/copilot/how-tos/configure-custom-instructions/add- repository-instructions GitHub Docs
2024
-
[20]
Dipin Khati. 2025. Trustworthiness of Large Language Models for Code. In Proceedings of the 2025 IEEE/ACM 47th International Conference on Software Engi- neering - Companion (ICSE Companion). IEEE, Lisbon, Portugal
2025
-
[21]
Sayedhassan Khatoonabadi, Diego Elias Costa, Rabe Abdalkareem, and Emad Shi- hab. 2023. On Wasted Contributions: Understanding the Dynamics of Contributor- Abandoned Pull Requests–A Mixed-Methods Study of 10 Large Open-Source Projects.ACM Trans. Softw. Eng. Methodol.32, 1, Article 15 (Feb. 2023), 39 pages. doi:10.1145/3530785
-
[22]
Hao Li, Haoxiang Zhang, and Ahmed E. Hassan. 2025. The Rise of AI Teammates in Software Engineering (SE) 3.0: How Autonomous Coding Agents Are Reshap- ing Software Engineering. arXiv:2507.15003 [cs.SE] https://arxiv.org/abs/2507. 15003
Pith/arXiv arXiv 2025
-
[23]
Zhixing Li, Gang Yin, Yue Yu, Tao Wang, and Huaimin Wang. 2017. Detecting Du- plicate Pull-requests in GitHub. InProceedings of the 9th Asia-Pacific Symposium on Internetware(Shanghai, China)(Internetware ’17). Association for Computing Machinery, New York, NY, USA, Article 20, 6 pages. doi:10.1145/3131704.3131725
-
[24]
Zhixing Li, Yue Yu, Tao Wang, Gang Yin, ShanShan Li, and Huaimin Wang
-
[25]
Are You Still Working on This? An Empirical Study on Pull Request Abandonment.IEEE Transactions on Software Engineering48, 6 (2022), 2173–2188. doi:10.1109/TSE.2021.3053403
-
[26]
Jevgenija Pantiuchina, Bin Lin, Fiorella Zampetti, Massimiliano Di Penta, Michele Lanza, and Gabriele Bavota. 2021. Why do developers reject refactorings in open- source projects?ACM Transactions on Software Engineering and Methodology (TOSEM)31, 2 (2021), 1–23
2021
-
[27]
Qingye Wang, Xin Xia, David Lo, and Shanping Li. 2019. Why is my code change abandoned?Information and Software Technology110 (2019), 108–120. doi:10.1016/j.infsof.2019.02.007
-
[28]
Miku Watanabe, Hao Li, Yutaro Kashiwa, Brittany Reid, Hajimu Iida, and Ahmed E Hassan. 2025. On the use of agentic coding: An empirical study of pull requests on github.arXiv preprint arXiv:2509.14745(2025)
arXiv 2025
-
[29]
Yue Yu, Zhixing Li, Gang Yin, Tao Wang, and Huaimin Wang. 2018. A dataset of duplicate pull-requests in github. InProceedings of the 15th International Confer- ence on Mining Software Repositories(Gothenburg, Sweden)(MSR ’18). Associa- tion for Computing Machinery, New York, NY, USA, 22–25. doi:10.1145/3196398. 3196455 Received 30 December 2025; revised 1...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.