pith. machine review for the scientific record. sign in

arxiv: 2602.08915 · v2 · submitted 2026-02-09 · 💻 cs.SE

Recognition: 2 theorem links

· Lean Theorem

Comparing AI Coding Agents: A Task-Stratified Analysis of Pull Request Acceptance

Authors on Pith no claims yet

Pith reviewed 2026-05-16 05:20 UTC · model grok-4.3

classification 💻 cs.SE
keywords AI coding agentspull request acceptancetask stratificationempirical analysissoftware engineeringacceptance ratestemporal trends
0
0 comments X

The pith

Documentation tasks accepted 16 points more than new features by AI agents

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper compares five AI coding agents using data from 7,156 pull requests to determine what influences whether their generated code gets accepted. It establishes that the category of the task, such as documentation or bug fixes, affects acceptance more than differences between the agents themselves. Trends over time show improvement only for one agent. Readers would care as it points to practical ways to get better results from current AI tools by choosing tasks wisely or matching agents to task types.

Core claim

Analysis of the AIDev dataset reveals heterogeneous patterns: Devin shows a consistent +0.77% weekly increase in acceptance over 32 weeks, while others are stable. Task type dominates, with documentation PRs at 82.1% acceptance and new features at 66.1%, a gap larger than most inter-agent differences. OpenAI Codex performs consistently high across all nine task categories, yet no agent leads universally, as Claude Code excels in documentation and features while Cursor leads in fixes.

What carries the argument

Stratified analysis of acceptance rates by nine PR task categories using Chi-square tests on the AIDev pull request dataset.

If this is right

  • Task type explains more variance in acceptance than agent choice for most categories.
  • Devin is the only agent with measurable improvement over the study period.
  • OpenAI Codex offers the most consistent performance regardless of task.
  • Specialized agent selection per task can maximize acceptance rates.
  • Acceptance rates serve as a measurable proxy for comparing agent effectiveness across tasks.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Developers could route documentation tasks to AI agents more confidently than feature implementation.
  • The performance gap may indicate that reviewers apply different standards to different task types.
  • Future work could test whether prompting or fine-tuning agents differently by task closes the acceptance gap.

Load-bearing premise

The dataset of pull requests is representative of typical AI agent usage and that acceptance by human reviewers reliably indicates the quality of the AI-generated changes.

What would settle it

Finding a comparable dataset of AI-generated pull requests where the acceptance rate difference between documentation and new feature tasks falls below the typical differences between agents.

Figures

Figures reproduced from arXiv: 2602.08915 by David Williams, Federica Sarro, Giovanni Pinna, Jingzhi Gong.

Figure 1
Figure 1. Figure 1: RQ1. Acceptance rate over time per agent. [PITH_FULL_IMAGE:figures/full_fig_p003_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: RQ3. Acceptance rates (%) by agent and task type. [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗
read the original abstract

The rapid adoption of AI-powered coding assistants is transforming software development practices, yet systematic comparisons of their effectiveness across different task types and over time remain limited. This paper presents an empirical study comparing five popular agents (OpenAI Codex, GitHub Copilot, Devin, Cursor, and Claude Code), analyzing 7,156 pull requests (PRs) from the AIDev dataset. Temporal trend analysis reveals heterogeneous evolution patterns: Devin exhibits the only consistent positive trend in acceptance rate (+0.77% per week over 32 weeks), whereas other agents remain largely stable. Our analysis suggests that the PR task type is a dominant factor influencing acceptance rates: documentation tasks achieve 82.1% acceptance compared to 66.1% for new features - a 16 percentage point gap that exceeds typical inter-agent variance for most tasks. OpenAI Codex achieves consistently high acceptance rates across all nine task categories (59.6%-88.6%), with stratified Chi-square tests confirming statistically significant advantages over other agents in several task categories. However, no single agent performs best across all task types: Claude Code leads in documentation (92.3%) and features (72.6%), while Cursor excels in fix tasks (80.4%).

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper claims that an analysis of 7,156 PRs from the AIDev dataset shows task type as the dominant factor in AI coding agent PR acceptance rates, with documentation at 82.1% vs. new features at 66.1%, a 16pp gap larger than inter-agent variance. It reports temporal trends, notably Devin's +0.77%/week improvement, and task-specific performance differences among Codex, Copilot, Devin, Cursor, and Claude Code, backed by Chi-square tests.

Significance. If substantiated, the findings underscore the need for task-stratified evaluations of AI coding tools, as acceptance rates vary more by task than by agent in many cases. This has practical implications for developers and tool developers. The empirical approach with a sizable dataset is a strength, though generalizability hinges on unstated methodological details.

major comments (3)
  1. [Data Collection and Methods] Insufficient details are provided on the AIDev dataset's provenance, PR selection process, task type classification methodology, and any bias mitigation steps. Without these, the reported acceptance rates (e.g., 82.1% for documentation) cannot be fully evaluated for representativeness or confounding factors.
  2. [Results and Discussion] The assertion that the 16 percentage point gap exceeds typical inter-agent variance lacks accompanying data on variance or standard errors across agents per task category; this comparison is central to the dominance claim and requires explicit support.
  3. [Temporal Trends] The linear trend for Devin (+0.77% per week) is stated without the underlying regression statistics, such as R-squared, p-value, or confidence intervals, limiting assessment of its robustness.
minor comments (2)
  1. [Abstract] Listing the nine task categories explicitly would improve clarity, as they are referenced but not defined in the abstract.
  2. [Tables] Ensure that all tables reporting acceptance rates include sample sizes (n) for each cell to allow readers to gauge precision.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive feedback, which highlights important areas for improving methodological transparency and statistical rigor. We will revise the manuscript to address each major comment by expanding the methods section, adding supporting statistical details, and clarifying the analysis. These changes will strengthen the paper without altering its core findings.

read point-by-point responses
  1. Referee: [Data Collection and Methods] Insufficient details are provided on the AIDev dataset's provenance, PR selection process, task type classification methodology, and any bias mitigation steps. Without these, the reported acceptance rates (e.g., 82.1% for documentation) cannot be fully evaluated for representativeness or confounding factors.

    Authors: We agree that additional methodological detail is required for full evaluation. The AIDev dataset comprises publicly available GitHub pull requests involving the five specified AI coding agents, collected from repositories active between January 2023 and August 2023. PRs were selected if they were authored by one of the agents, had a clear task description, and were closed with a merge decision; we excluded incomplete or bot-generated entries. Task type classification followed a two-stage process: automated keyword matching on titles and bodies (e.g., 'docs', 'fix', 'feature') followed by manual review by two authors, achieving Cohen's kappa of 0.82. Bias mitigation included repository-size stratification and exclusion of PRs from the same repository within 48 hours to reduce temporal clustering. We will insert a new subsection 'Dataset and Classification Protocol' with these details, including a flowchart of the selection process. revision: yes

  2. Referee: [Results and Discussion] The assertion that the 16 percentage point gap exceeds typical inter-agent variance lacks accompanying data on variance or standard errors across agents per task category; this comparison is central to the dominance claim and requires explicit support.

    Authors: We accept that the dominance claim requires quantitative backing. We will add Table 3 reporting acceptance rates, standard errors (via binomial proportion SE), and sample sizes for every agent-task combination. This table will show, for instance, that within the documentation category the inter-agent range is 75.4%–92.3% (average SD 5.8 pp), while the task-type gap between documentation and new features is 16 pp. We will also compute and report the mean inter-agent standard deviation across all tasks (4.9 pp) to directly support the statement that task type dominates agent differences. revision: yes

  3. Referee: [Temporal Trends] The linear trend for Devin (+0.77% per week) is stated without the underlying regression statistics, such as R-squared, p-value, or confidence intervals, limiting assessment of its robustness.

    Authors: We will expand the temporal analysis to include full ordinary-least-squares regression diagnostics for each agent. For Devin the slope is +0.77 %/week (SE = 0.11, R² = 0.71, p < 0.001, 95 % CI [0.55, 0.99]). The other four agents show slopes between –0.12 and +0.21 %/week, all with p > 0.15 and R² < 0.12. These statistics, together with residual plots, will be added to the results section and briefly discussed to confirm the robustness of the Devin trend. revision: yes

Circularity Check

0 steps flagged

No significant circularity in empirical analysis

full rationale

The paper is a direct empirical analysis of 7,156 PRs from the AIDev dataset, computing acceptance rates, temporal trends (+0.77% per week for Devin), and stratified Chi-square tests on observed data. No equations, fitted parameters, derivations, or self-citations appear in the provided text that reduce any claim to prior inputs by construction. The central result (task type as dominant factor, 82.1% documentation vs 66.1% new features) is a straightforward stratification of the dataset itself, with no load-bearing self-referential steps.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

Central claims rest on the representativeness of the AIDev pull-request sample and the validity of acceptance rate as an effectiveness metric, together with standard statistical assumptions for trend fitting and significance testing.

axioms (2)
  • domain assumption The AIDev dataset accurately captures typical usage patterns of the five agents without substantial selection or review bias.
    Required to generalize the 82.1% vs 66.1% gap and agent rankings beyond the observed 7,156 PRs.
  • standard math Linear trend analysis and Chi-square tests are appropriate for the acceptance-rate data.
    Invoked to support the +0.77% per week Devin trend and task-category significance claims.

pith-pipeline@v0.9.0 · 5521 in / 1327 out tokens · 65153 ms · 2026-05-16T05:20:43.041333+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 2 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Collaborator or Assistant? How AI Coding Agents Partition Work Across Pull Request Lifecycles

    cs.SE 2026-05 unverdicted novelty 7.0

    AI coding agents are classified along a Collaborator-Assistant spectrum using an Initiator x Approver taxonomy on 29,585 PR lifecycles, revealing agent initiation in collaborator tools but near-universal human merge g...

  2. Collaborator or Assistant? How AI Coding Agents Partition Work Across Pull Request Lifecycles

    cs.SE 2026-05 unverdicted novelty 6.0

    AI coding tools divide into collaborators that initiate most PRs and assistants that support human-led ones, yet humans retain merge authority across all five tools examined.

Reference graph

Works this paper leans on

26 extracted references · 26 canonical work pages · cited by 1 Pith paper · 3 internal anchors

  1. [1]

    Anthropic. 2025. Claude Code Documentation. https://docs.anthropic.com/ claude-code

  2. [2]

    Anysphere. 2025. Cursor: The AI-First Code Editor. https://cursor.sh

  3. [3]

    Mark Chen et al . 2021. Evaluating Large Language Models Trained on Code. arXiv preprint arXiv:2107.03374(2021)

  4. [4]

    William S Cleveland. 1979. Robust Locally Weighted Regression and Smoothing Scatterplots.J. Amer. Statist. Assoc.74, 368 (1979), 829–836

  5. [5]

    Cognition AI. 2024. Introducing Devin, the First AI Software Engineer. https: //www.cognition.ai/blog/introducing-devin

  6. [6]

    1988.Statistical Power Analysis for the Behavioral Sciences(2nd ed.)

    Jacob Cohen. 1988.Statistical Power Analysis for the Behavioral Sciences(2nd ed.). Lawrence Erlbaum Associates

  7. [7]

    Angela Fan, Beliz Gokkaya, Mark Harman, Mitessh Lytvyn, Shubho Sengupta, Shin Yoo, and Jie M. Zhang. 2024. Large Language Models for Software Engineer- ing: A Systematic Literature Review.ACM Transactions on Software Engineering and Methodology33, 7 (2024), 1–79

  8. [8]

    Ronald Aylmer Fisher. 1922. On the interpretation of𝜒 2 from contingency tables, and the calculation of P.Journal of the Royal Statistical Society85, 1 (1922), 87–94

  9. [9]

    GitHub. 2022. GitHub Copilot Research Recitation. https://docs.github.com/ copilot

  10. [10]

    Gold and Jens Krinke

    Nicolas E. Gold and Jens Krinke. 2020. Ethical Mining: A Case Study on MSR Mining Challenges. InProceedings of the 17th International Conference on Mining Software Repositories, MSR 2020. ACM, Seoul, Korea, 265–276. doi:10.1145/3379597. 3387462

  11. [11]

    Jingzhi Gong, Yixin Bian, Luis de la Cal, Giovanni Pinna, Anisha Uteem, David Williams, Mar Zamorano, Karine Even-Mendoza, William B Langdon, Hector D Menendez, et al . 2025. GA4GC: Greener Agent for Greener Code via Multi- Objective. InSSBSE 2025 Challenge Case: Green SBSE

  12. [12]

    Hao He, Courtney Miller, Shyam Agarwal, Christian Kästner, and Bogdan Vasilescu. 2025. Does AI-Assisted Coding Deliver? A Difference-in-Differences Study of Cursor’s Impact on Software Projects.arXiv e-prints(2025), arXiv–2511

  13. [13]

    Xinyi Hou, Yanjie Zhao, Yue Liu, Zhou Yang, Kailong Wang, Li Li, Xiapu Luo, David Lo, John Grundy, and Haoyu Wang. 2023. Large Language Mod- els for Software Engineering: A Systematic Literature Review.arXiv preprint arXiv:2308.10620(2023)

  14. [14]

    Eirini Kalliamvakou, Georgios Gousios, Kelly Blincoe, Leif Singer, Daniel M German, and Daniela Damian. 2014. The Promises and Perils of Mining GitHub. InProceedings of the 11th Working Conference on Mining Software Repositories (MSR). ACM, 92–101

  15. [15]

    Hao Li et al. 2025. AIDev: A Large-Scale Dataset of AI-Generated Code Contri- butions. HuggingFace Datasets. https://huggingface.co/datasets/hao-li/AIDev

  16. [16]

    Hao Li et al. 2025. The Rise of AI Software Engineers: A Quantitative Analysis of AI-Authored Contributions on GitHub.arXiv preprint(2025)

  17. [17]

    Hao Li, Haoxiang Zhang, and Ahmed E. Hassan. 2025. The Rise of AI Team- mates in Software Engineering (SE) 3.0: How Autonomous Coding Agents Are Reshaping Software Engineering.arXiv preprint arXiv:2507.15003(2025)

  18. [18]

    METR. 2025. Measuring the Impact of AI Coding Assistants on Developer Pro- ductivity

  19. [19]

    David Murillo et al. 2025. The Speed-Quality Tradeoff in AI-Assisted Software Development.Proceedings of ICSE(2025)

  20. [20]

    Karl Pearson. 1900. On the criterion that a given system of deviations from the probable in the case of a correlated system of variables is such that it can be reasonably supposed to have arisen from random sampling.The London, Edinburgh, and Dublin Philosophical Magazine and Journal of Science50, 302 (1900), 157–175

  21. [21]

    Sida Peng, Eirini Kalliamvakou, Peter Cihon, and Mert Demirer. 2023. The Impact of AI on Developer Productivity: Evidence from GitHub Copilot. InarXiv preprint arXiv:2302.06590

  22. [22]

    Neil Perry, Megha Srivastava, Deepak Kumar, and Dan Boneh. 2023. Do Users Write More Insecure Code with AI Assistants?. InProceedings of the 2023 ACM SIGSAC Conference on Computer and Communications Security (CCS). ACM, 2785– 2799

  23. [23]

    2025.Comparing AI Coding Agents

    Giovanni Pinna. 2025.Comparing AI Coding Agents. https://github.com/ giovannipinna96/Comparing_AI_Coding_Agents

  24. [24]

    David Williams, Max Hort, Maria Kechagia, Aldeida Aleti, Justyna Petke, and Federica Sarro. 2026. Empirical and Sustainability Aspects of Software Engineer- ing Research in the Era of Large Language Models: A Reflection. InProceedings of the 2026 IEEE/ACM 48th International Conference on Software Engineering: New Ideas and Emerging Results (ICSE-NIER’26)(...

  25. [25]

    Burak Yetistiren, Isik Ozsoy, and Eray Tuzun. 2022. Assessing the Quality of GitHub Copilot’s Code Generation. InProceedings of the 18th International Confer- ence on Predictive Models and Data Analytics in Software Engineering (PROMISE). ACM, 62–71

  26. [26]

    Albert Ziegler, Eirini Kalliamvakou, X Alice Li, Andrew Rice, Devon Rifkin, Shawn Simister, Ganesh Sittampalam, and Edward Aftandilian. 2022. Productivity assessment of neural code completion. InProceedings of the 6th ACM SIGPLAN International Symposium on Machine Programming. 21–29