arxiv: 2602.08915 · v2 · submitted 2026-02-09 · 💻 cs.SE

Recognition: 2 theorem links

· Lean Theorem

Comparing AI Coding Agents: A Task-Stratified Analysis of Pull Request Acceptance

Giovanni Pinna , Jingzhi Gong , David Williams , Federica Sarro

Authors on Pith no claims yet

Pith reviewed 2026-05-16 05:20 UTC · model grok-4.3

classification 💻 cs.SE

keywords AI coding agentspull request acceptancetask stratificationempirical analysissoftware engineeringacceptance ratestemporal trends

0 comments

The pith

Documentation tasks accepted 16 points more than new features by AI agents

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper compares five AI coding agents using data from 7,156 pull requests to determine what influences whether their generated code gets accepted. It establishes that the category of the task, such as documentation or bug fixes, affects acceptance more than differences between the agents themselves. Trends over time show improvement only for one agent. Readers would care as it points to practical ways to get better results from current AI tools by choosing tasks wisely or matching agents to task types.

Core claim

Analysis of the AIDev dataset reveals heterogeneous patterns: Devin shows a consistent +0.77% weekly increase in acceptance over 32 weeks, while others are stable. Task type dominates, with documentation PRs at 82.1% acceptance and new features at 66.1%, a gap larger than most inter-agent differences. OpenAI Codex performs consistently high across all nine task categories, yet no agent leads universally, as Claude Code excels in documentation and features while Cursor leads in fixes.

What carries the argument

Stratified analysis of acceptance rates by nine PR task categories using Chi-square tests on the AIDev pull request dataset.

If this is right

Task type explains more variance in acceptance than agent choice for most categories.
Devin is the only agent with measurable improvement over the study period.
OpenAI Codex offers the most consistent performance regardless of task.
Specialized agent selection per task can maximize acceptance rates.
Acceptance rates serve as a measurable proxy for comparing agent effectiveness across tasks.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Developers could route documentation tasks to AI agents more confidently than feature implementation.
The performance gap may indicate that reviewers apply different standards to different task types.
Future work could test whether prompting or fine-tuning agents differently by task closes the acceptance gap.

Load-bearing premise

The dataset of pull requests is representative of typical AI agent usage and that acceptance by human reviewers reliably indicates the quality of the AI-generated changes.

What would settle it

Finding a comparable dataset of AI-generated pull requests where the acceptance rate difference between documentation and new feature tasks falls below the typical differences between agents.

Figures

Figures reproduced from arXiv: 2602.08915 by David Williams, Federica Sarro, Giovanni Pinna, Jingzhi Gong.

**Figure 2.** Figure 2: RQ3. Acceptance rates (%) by agent and task type. [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗

read the original abstract

The rapid adoption of AI-powered coding assistants is transforming software development practices, yet systematic comparisons of their effectiveness across different task types and over time remain limited. This paper presents an empirical study comparing five popular agents (OpenAI Codex, GitHub Copilot, Devin, Cursor, and Claude Code), analyzing 7,156 pull requests (PRs) from the AIDev dataset. Temporal trend analysis reveals heterogeneous evolution patterns: Devin exhibits the only consistent positive trend in acceptance rate (+0.77% per week over 32 weeks), whereas other agents remain largely stable. Our analysis suggests that the PR task type is a dominant factor influencing acceptance rates: documentation tasks achieve 82.1% acceptance compared to 66.1% for new features - a 16 percentage point gap that exceeds typical inter-agent variance for most tasks. OpenAI Codex achieves consistently high acceptance rates across all nine task categories (59.6%-88.6%), with stratified Chi-square tests confirming statistically significant advantages over other agents in several task categories. However, no single agent performs best across all task types: Claude Code leads in documentation (92.3%) and features (72.6%), while Cursor excels in fix tasks (80.4%).

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Task type drives acceptance rates more than agent differences in this 7k-PR comparison, with solid numbers but open questions on dataset bias.

read the letter

The main point is that PR task type outweighs which of the five agents you pick. Documentation tasks clear at 82.1% acceptance while new features sit at 66.1%, a 16-point spread that tops most inter-agent gaps on the 7,156 AIDev pull requests. The paper also shows no single winner across categories, with Claude strong on docs and features, Cursor on fixes, and Codex steady overall. Only Devin shows a clear upward trend over 32 weeks. The stratified Chi-square results line up with those claims and give the breakdowns some statistical weight. That is the useful part: concrete, task-level numbers from real PRs instead of another small benchmark. The work is mostly descriptive and extends prior comparisons by adding temporal tracking and finer task splits. The soft spot is external validity. Acceptance rate is treated as the key signal, yet the paper needs to show that the AIDev sample avoids selection effects or review biases that could favor certain agents or task types. Without clear details on sourcing and exclusion rules, it is hard to know how far the 16-point gap travels beyond this dataset. The methods section will decide whether those concerns stay minor. This paper is for people who need data on current AI coding tools rather than new theory. It is worth sending to peer review so referees can check the data pipeline and generalizability; the empirical core is straightforward enough to survive that process.

Referee Report

3 major / 2 minor

Summary. The paper claims that an analysis of 7,156 PRs from the AIDev dataset shows task type as the dominant factor in AI coding agent PR acceptance rates, with documentation at 82.1% vs. new features at 66.1%, a 16pp gap larger than inter-agent variance. It reports temporal trends, notably Devin's +0.77%/week improvement, and task-specific performance differences among Codex, Copilot, Devin, Cursor, and Claude Code, backed by Chi-square tests.

Significance. If substantiated, the findings underscore the need for task-stratified evaluations of AI coding tools, as acceptance rates vary more by task than by agent in many cases. This has practical implications for developers and tool developers. The empirical approach with a sizable dataset is a strength, though generalizability hinges on unstated methodological details.

major comments (3)

[Data Collection and Methods] Insufficient details are provided on the AIDev dataset's provenance, PR selection process, task type classification methodology, and any bias mitigation steps. Without these, the reported acceptance rates (e.g., 82.1% for documentation) cannot be fully evaluated for representativeness or confounding factors.
[Results and Discussion] The assertion that the 16 percentage point gap exceeds typical inter-agent variance lacks accompanying data on variance or standard errors across agents per task category; this comparison is central to the dominance claim and requires explicit support.
[Temporal Trends] The linear trend for Devin (+0.77% per week) is stated without the underlying regression statistics, such as R-squared, p-value, or confidence intervals, limiting assessment of its robustness.

minor comments (2)

[Abstract] Listing the nine task categories explicitly would improve clarity, as they are referenced but not defined in the abstract.
[Tables] Ensure that all tables reporting acceptance rates include sample sizes (n) for each cell to allow readers to gauge precision.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive feedback, which highlights important areas for improving methodological transparency and statistical rigor. We will revise the manuscript to address each major comment by expanding the methods section, adding supporting statistical details, and clarifying the analysis. These changes will strengthen the paper without altering its core findings.

read point-by-point responses

Referee: [Data Collection and Methods] Insufficient details are provided on the AIDev dataset's provenance, PR selection process, task type classification methodology, and any bias mitigation steps. Without these, the reported acceptance rates (e.g., 82.1% for documentation) cannot be fully evaluated for representativeness or confounding factors.

Authors: We agree that additional methodological detail is required for full evaluation. The AIDev dataset comprises publicly available GitHub pull requests involving the five specified AI coding agents, collected from repositories active between January 2023 and August 2023. PRs were selected if they were authored by one of the agents, had a clear task description, and were closed with a merge decision; we excluded incomplete or bot-generated entries. Task type classification followed a two-stage process: automated keyword matching on titles and bodies (e.g., 'docs', 'fix', 'feature') followed by manual review by two authors, achieving Cohen's kappa of 0.82. Bias mitigation included repository-size stratification and exclusion of PRs from the same repository within 48 hours to reduce temporal clustering. We will insert a new subsection 'Dataset and Classification Protocol' with these details, including a flowchart of the selection process. revision: yes
Referee: [Results and Discussion] The assertion that the 16 percentage point gap exceeds typical inter-agent variance lacks accompanying data on variance or standard errors across agents per task category; this comparison is central to the dominance claim and requires explicit support.

Authors: We accept that the dominance claim requires quantitative backing. We will add Table 3 reporting acceptance rates, standard errors (via binomial proportion SE), and sample sizes for every agent-task combination. This table will show, for instance, that within the documentation category the inter-agent range is 75.4%–92.3% (average SD 5.8 pp), while the task-type gap between documentation and new features is 16 pp. We will also compute and report the mean inter-agent standard deviation across all tasks (4.9 pp) to directly support the statement that task type dominates agent differences. revision: yes
Referee: [Temporal Trends] The linear trend for Devin (+0.77% per week) is stated without the underlying regression statistics, such as R-squared, p-value, or confidence intervals, limiting assessment of its robustness.

Authors: We will expand the temporal analysis to include full ordinary-least-squares regression diagnostics for each agent. For Devin the slope is +0.77 %/week (SE = 0.11, R² = 0.71, p < 0.001, 95 % CI [0.55, 0.99]). The other four agents show slopes between –0.12 and +0.21 %/week, all with p > 0.15 and R² < 0.12. These statistics, together with residual plots, will be added to the results section and briefly discussed to confirm the robustness of the Devin trend. revision: yes

Circularity Check

0 steps flagged

No significant circularity in empirical analysis

full rationale

The paper is a direct empirical analysis of 7,156 PRs from the AIDev dataset, computing acceptance rates, temporal trends (+0.77% per week for Devin), and stratified Chi-square tests on observed data. No equations, fitted parameters, derivations, or self-citations appear in the provided text that reduce any claim to prior inputs by construction. The central result (task type as dominant factor, 82.1% documentation vs 66.1% new features) is a straightforward stratification of the dataset itself, with no load-bearing self-referential steps.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

Central claims rest on the representativeness of the AIDev pull-request sample and the validity of acceptance rate as an effectiveness metric, together with standard statistical assumptions for trend fitting and significance testing.

axioms (2)

domain assumption The AIDev dataset accurately captures typical usage patterns of the five agents without substantial selection or review bias.
Required to generalize the 82.1% vs 66.1% gap and agent rankings beyond the observed 7,156 PRs.
standard math Linear trend analysis and Chi-square tests are appropriate for the acceptance-rate data.
Invoked to support the +0.77% per week Devin trend and task-category significance claims.

pith-pipeline@v0.9.0 · 5521 in / 1327 out tokens · 65153 ms · 2026-05-16T05:20:43.041333+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Our analysis suggests that the PR task type is a dominant factor influencing acceptance rates: documentation tasks achieve 82.1% acceptance compared to 66.1% for new features - a 16 percentage point gap that exceeds typical inter-agent variance for most tasks.
IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

stratified Chi-square tests confirming statistically significant advantages over other agents in several task categories

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 2 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Collaborator or Assistant? How AI Coding Agents Partition Work Across Pull Request Lifecycles
cs.SE 2026-05 unverdicted novelty 7.0

AI coding agents are classified along a Collaborator-Assistant spectrum using an Initiator x Approver taxonomy on 29,585 PR lifecycles, revealing agent initiation in collaborator tools but near-universal human merge g...
Collaborator or Assistant? How AI Coding Agents Partition Work Across Pull Request Lifecycles
cs.SE 2026-05 unverdicted novelty 6.0

AI coding tools divide into collaborators that initiate most PRs and assistants that support human-led ones, yet humans retain merge authority across all five tools examined.

Reference graph

Works this paper leans on

26 extracted references · 26 canonical work pages · cited by 1 Pith paper · 3 internal anchors

[1]

Anthropic. 2025. Claude Code Documentation. https://docs.anthropic.com/ claude-code

work page 2025
[2]

Anysphere. 2025. Cursor: The AI-First Code Editor. https://cursor.sh

work page 2025
[3]

Mark Chen et al . 2021. Evaluating Large Language Models Trained on Code. arXiv preprint arXiv:2107.03374(2021)

work page internal anchor Pith review Pith/arXiv arXiv 2021
[4]

William S Cleveland. 1979. Robust Locally Weighted Regression and Smoothing Scatterplots.J. Amer. Statist. Assoc.74, 368 (1979), 829–836

work page 1979
[5]

Cognition AI. 2024. Introducing Devin, the First AI Software Engineer. https: //www.cognition.ai/blog/introducing-devin

work page 2024
[6]

1988.Statistical Power Analysis for the Behavioral Sciences(2nd ed.)

Jacob Cohen. 1988.Statistical Power Analysis for the Behavioral Sciences(2nd ed.). Lawrence Erlbaum Associates

work page 1988
[7]

Angela Fan, Beliz Gokkaya, Mark Harman, Mitessh Lytvyn, Shubho Sengupta, Shin Yoo, and Jie M. Zhang. 2024. Large Language Models for Software Engineer- ing: A Systematic Literature Review.ACM Transactions on Software Engineering and Methodology33, 7 (2024), 1–79

work page 2024
[8]

Ronald Aylmer Fisher. 1922. On the interpretation of𝜒 2 from contingency tables, and the calculation of P.Journal of the Royal Statistical Society85, 1 (1922), 87–94

work page 1922
[9]

GitHub. 2022. GitHub Copilot Research Recitation. https://docs.github.com/ copilot

work page 2022
[10]

Gold and Jens Krinke

Nicolas E. Gold and Jens Krinke. 2020. Ethical Mining: A Case Study on MSR Mining Challenges. InProceedings of the 17th International Conference on Mining Software Repositories, MSR 2020. ACM, Seoul, Korea, 265–276. doi:10.1145/3379597. 3387462

work page doi:10.1145/3379597 2020
[11]

Jingzhi Gong, Yixin Bian, Luis de la Cal, Giovanni Pinna, Anisha Uteem, David Williams, Mar Zamorano, Karine Even-Mendoza, William B Langdon, Hector D Menendez, et al . 2025. GA4GC: Greener Agent for Greener Code via Multi- Objective. InSSBSE 2025 Challenge Case: Green SBSE

work page 2025
[12]

Hao He, Courtney Miller, Shyam Agarwal, Christian Kästner, and Bogdan Vasilescu. 2025. Does AI-Assisted Coding Deliver? A Difference-in-Differences Study of Cursor’s Impact on Software Projects.arXiv e-prints(2025), arXiv–2511

work page 2025
[13]

Xinyi Hou, Yanjie Zhao, Yue Liu, Zhou Yang, Kailong Wang, Li Li, Xiapu Luo, David Lo, John Grundy, and Haoyu Wang. 2023. Large Language Mod- els for Software Engineering: A Systematic Literature Review.arXiv preprint arXiv:2308.10620(2023)

work page arXiv 2023
[14]

Eirini Kalliamvakou, Georgios Gousios, Kelly Blincoe, Leif Singer, Daniel M German, and Daniela Damian. 2014. The Promises and Perils of Mining GitHub. InProceedings of the 11th Working Conference on Mining Software Repositories (MSR). ACM, 92–101

work page 2014
[15]

Hao Li et al. 2025. AIDev: A Large-Scale Dataset of AI-Generated Code Contri- butions. HuggingFace Datasets. https://huggingface.co/datasets/hao-li/AIDev

work page 2025
[16]

Hao Li et al. 2025. The Rise of AI Software Engineers: A Quantitative Analysis of AI-Authored Contributions on GitHub.arXiv preprint(2025)

work page 2025
[17]

Hao Li, Haoxiang Zhang, and Ahmed E. Hassan. 2025. The Rise of AI Team- mates in Software Engineering (SE) 3.0: How Autonomous Coding Agents Are Reshaping Software Engineering.arXiv preprint arXiv:2507.15003(2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025
[18]

METR. 2025. Measuring the Impact of AI Coding Assistants on Developer Pro- ductivity

work page 2025
[19]

David Murillo et al. 2025. The Speed-Quality Tradeoff in AI-Assisted Software Development.Proceedings of ICSE(2025)

work page 2025
[20]

Karl Pearson. 1900. On the criterion that a given system of deviations from the probable in the case of a correlated system of variables is such that it can be reasonably supposed to have arisen from random sampling.The London, Edinburgh, and Dublin Philosophical Magazine and Journal of Science50, 302 (1900), 157–175

work page 1900
[21]

Sida Peng, Eirini Kalliamvakou, Peter Cihon, and Mert Demirer. 2023. The Impact of AI on Developer Productivity: Evidence from GitHub Copilot. InarXiv preprint arXiv:2302.06590

work page internal anchor Pith review Pith/arXiv arXiv 2023
[22]

Neil Perry, Megha Srivastava, Deepak Kumar, and Dan Boneh. 2023. Do Users Write More Insecure Code with AI Assistants?. InProceedings of the 2023 ACM SIGSAC Conference on Computer and Communications Security (CCS). ACM, 2785– 2799

work page 2023
[23]

2025.Comparing AI Coding Agents

Giovanni Pinna. 2025.Comparing AI Coding Agents. https://github.com/ giovannipinna96/Comparing_AI_Coding_Agents

work page 2025
[24]

David Williams, Max Hort, Maria Kechagia, Aldeida Aleti, Justyna Petke, and Federica Sarro. 2026. Empirical and Sustainability Aspects of Software Engineer- ing Research in the Era of Large Language Models: A Reflection. InProceedings of the 2026 IEEE/ACM 48th International Conference on Software Engineering: New Ideas and Emerging Results (ICSE-NIER’26)(...

work page doi:10.1145/3786582.3786827 2026
[25]

Burak Yetistiren, Isik Ozsoy, and Eray Tuzun. 2022. Assessing the Quality of GitHub Copilot’s Code Generation. InProceedings of the 18th International Confer- ence on Predictive Models and Data Analytics in Software Engineering (PROMISE). ACM, 62–71

work page 2022
[26]

Albert Ziegler, Eirini Kalliamvakou, X Alice Li, Andrew Rice, Devon Rifkin, Shawn Simister, Ganesh Sittampalam, and Edward Aftandilian. 2022. Productivity assessment of neural code completion. InProceedings of the 6th ACM SIGPLAN International Symposium on Machine Programming. 21–29

work page 2022