Recognition: 2 theorem links
· Lean TheoremComparing AI Coding Agents: A Task-Stratified Analysis of Pull Request Acceptance
Pith reviewed 2026-05-16 05:20 UTC · model grok-4.3
The pith
Documentation tasks accepted 16 points more than new features by AI agents
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Analysis of the AIDev dataset reveals heterogeneous patterns: Devin shows a consistent +0.77% weekly increase in acceptance over 32 weeks, while others are stable. Task type dominates, with documentation PRs at 82.1% acceptance and new features at 66.1%, a gap larger than most inter-agent differences. OpenAI Codex performs consistently high across all nine task categories, yet no agent leads universally, as Claude Code excels in documentation and features while Cursor leads in fixes.
What carries the argument
Stratified analysis of acceptance rates by nine PR task categories using Chi-square tests on the AIDev pull request dataset.
If this is right
- Task type explains more variance in acceptance than agent choice for most categories.
- Devin is the only agent with measurable improvement over the study period.
- OpenAI Codex offers the most consistent performance regardless of task.
- Specialized agent selection per task can maximize acceptance rates.
- Acceptance rates serve as a measurable proxy for comparing agent effectiveness across tasks.
Where Pith is reading between the lines
- Developers could route documentation tasks to AI agents more confidently than feature implementation.
- The performance gap may indicate that reviewers apply different standards to different task types.
- Future work could test whether prompting or fine-tuning agents differently by task closes the acceptance gap.
Load-bearing premise
The dataset of pull requests is representative of typical AI agent usage and that acceptance by human reviewers reliably indicates the quality of the AI-generated changes.
What would settle it
Finding a comparable dataset of AI-generated pull requests where the acceptance rate difference between documentation and new feature tasks falls below the typical differences between agents.
Figures
read the original abstract
The rapid adoption of AI-powered coding assistants is transforming software development practices, yet systematic comparisons of their effectiveness across different task types and over time remain limited. This paper presents an empirical study comparing five popular agents (OpenAI Codex, GitHub Copilot, Devin, Cursor, and Claude Code), analyzing 7,156 pull requests (PRs) from the AIDev dataset. Temporal trend analysis reveals heterogeneous evolution patterns: Devin exhibits the only consistent positive trend in acceptance rate (+0.77% per week over 32 weeks), whereas other agents remain largely stable. Our analysis suggests that the PR task type is a dominant factor influencing acceptance rates: documentation tasks achieve 82.1% acceptance compared to 66.1% for new features - a 16 percentage point gap that exceeds typical inter-agent variance for most tasks. OpenAI Codex achieves consistently high acceptance rates across all nine task categories (59.6%-88.6%), with stratified Chi-square tests confirming statistically significant advantages over other agents in several task categories. However, no single agent performs best across all task types: Claude Code leads in documentation (92.3%) and features (72.6%), while Cursor excels in fix tasks (80.4%).
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper claims that an analysis of 7,156 PRs from the AIDev dataset shows task type as the dominant factor in AI coding agent PR acceptance rates, with documentation at 82.1% vs. new features at 66.1%, a 16pp gap larger than inter-agent variance. It reports temporal trends, notably Devin's +0.77%/week improvement, and task-specific performance differences among Codex, Copilot, Devin, Cursor, and Claude Code, backed by Chi-square tests.
Significance. If substantiated, the findings underscore the need for task-stratified evaluations of AI coding tools, as acceptance rates vary more by task than by agent in many cases. This has practical implications for developers and tool developers. The empirical approach with a sizable dataset is a strength, though generalizability hinges on unstated methodological details.
major comments (3)
- [Data Collection and Methods] Insufficient details are provided on the AIDev dataset's provenance, PR selection process, task type classification methodology, and any bias mitigation steps. Without these, the reported acceptance rates (e.g., 82.1% for documentation) cannot be fully evaluated for representativeness or confounding factors.
- [Results and Discussion] The assertion that the 16 percentage point gap exceeds typical inter-agent variance lacks accompanying data on variance or standard errors across agents per task category; this comparison is central to the dominance claim and requires explicit support.
- [Temporal Trends] The linear trend for Devin (+0.77% per week) is stated without the underlying regression statistics, such as R-squared, p-value, or confidence intervals, limiting assessment of its robustness.
minor comments (2)
- [Abstract] Listing the nine task categories explicitly would improve clarity, as they are referenced but not defined in the abstract.
- [Tables] Ensure that all tables reporting acceptance rates include sample sizes (n) for each cell to allow readers to gauge precision.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback, which highlights important areas for improving methodological transparency and statistical rigor. We will revise the manuscript to address each major comment by expanding the methods section, adding supporting statistical details, and clarifying the analysis. These changes will strengthen the paper without altering its core findings.
read point-by-point responses
-
Referee: [Data Collection and Methods] Insufficient details are provided on the AIDev dataset's provenance, PR selection process, task type classification methodology, and any bias mitigation steps. Without these, the reported acceptance rates (e.g., 82.1% for documentation) cannot be fully evaluated for representativeness or confounding factors.
Authors: We agree that additional methodological detail is required for full evaluation. The AIDev dataset comprises publicly available GitHub pull requests involving the five specified AI coding agents, collected from repositories active between January 2023 and August 2023. PRs were selected if they were authored by one of the agents, had a clear task description, and were closed with a merge decision; we excluded incomplete or bot-generated entries. Task type classification followed a two-stage process: automated keyword matching on titles and bodies (e.g., 'docs', 'fix', 'feature') followed by manual review by two authors, achieving Cohen's kappa of 0.82. Bias mitigation included repository-size stratification and exclusion of PRs from the same repository within 48 hours to reduce temporal clustering. We will insert a new subsection 'Dataset and Classification Protocol' with these details, including a flowchart of the selection process. revision: yes
-
Referee: [Results and Discussion] The assertion that the 16 percentage point gap exceeds typical inter-agent variance lacks accompanying data on variance or standard errors across agents per task category; this comparison is central to the dominance claim and requires explicit support.
Authors: We accept that the dominance claim requires quantitative backing. We will add Table 3 reporting acceptance rates, standard errors (via binomial proportion SE), and sample sizes for every agent-task combination. This table will show, for instance, that within the documentation category the inter-agent range is 75.4%–92.3% (average SD 5.8 pp), while the task-type gap between documentation and new features is 16 pp. We will also compute and report the mean inter-agent standard deviation across all tasks (4.9 pp) to directly support the statement that task type dominates agent differences. revision: yes
-
Referee: [Temporal Trends] The linear trend for Devin (+0.77% per week) is stated without the underlying regression statistics, such as R-squared, p-value, or confidence intervals, limiting assessment of its robustness.
Authors: We will expand the temporal analysis to include full ordinary-least-squares regression diagnostics for each agent. For Devin the slope is +0.77 %/week (SE = 0.11, R² = 0.71, p < 0.001, 95 % CI [0.55, 0.99]). The other four agents show slopes between –0.12 and +0.21 %/week, all with p > 0.15 and R² < 0.12. These statistics, together with residual plots, will be added to the results section and briefly discussed to confirm the robustness of the Devin trend. revision: yes
Circularity Check
No significant circularity in empirical analysis
full rationale
The paper is a direct empirical analysis of 7,156 PRs from the AIDev dataset, computing acceptance rates, temporal trends (+0.77% per week for Devin), and stratified Chi-square tests on observed data. No equations, fitted parameters, derivations, or self-citations appear in the provided text that reduce any claim to prior inputs by construction. The central result (task type as dominant factor, 82.1% documentation vs 66.1% new features) is a straightforward stratification of the dataset itself, with no load-bearing self-referential steps.
Axiom & Free-Parameter Ledger
axioms (2)
- domain assumption The AIDev dataset accurately captures typical usage patterns of the five agents without substantial selection or review bias.
- standard math Linear trend analysis and Chi-square tests are appropriate for the acceptance-rate data.
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
Our analysis suggests that the PR task type is a dominant factor influencing acceptance rates: documentation tasks achieve 82.1% acceptance compared to 66.1% for new features - a 16 percentage point gap that exceeds typical inter-agent variance for most tasks.
-
IndisputableMonolith/Foundation/RealityFromDistinction.leanreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
stratified Chi-square tests confirming statistically significant advantages over other agents in several task categories
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Forward citations
Cited by 2 Pith papers
-
Collaborator or Assistant? How AI Coding Agents Partition Work Across Pull Request Lifecycles
AI coding agents are classified along a Collaborator-Assistant spectrum using an Initiator x Approver taxonomy on 29,585 PR lifecycles, revealing agent initiation in collaborator tools but near-universal human merge g...
-
Collaborator or Assistant? How AI Coding Agents Partition Work Across Pull Request Lifecycles
AI coding tools divide into collaborators that initiate most PRs and assistants that support human-led ones, yet humans retain merge authority across all five tools examined.
Reference graph
Works this paper leans on
-
[1]
Anthropic. 2025. Claude Code Documentation. https://docs.anthropic.com/ claude-code
work page 2025
-
[2]
Anysphere. 2025. Cursor: The AI-First Code Editor. https://cursor.sh
work page 2025
-
[3]
Mark Chen et al . 2021. Evaluating Large Language Models Trained on Code. arXiv preprint arXiv:2107.03374(2021)
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[4]
William S Cleveland. 1979. Robust Locally Weighted Regression and Smoothing Scatterplots.J. Amer. Statist. Assoc.74, 368 (1979), 829–836
work page 1979
-
[5]
Cognition AI. 2024. Introducing Devin, the First AI Software Engineer. https: //www.cognition.ai/blog/introducing-devin
work page 2024
-
[6]
1988.Statistical Power Analysis for the Behavioral Sciences(2nd ed.)
Jacob Cohen. 1988.Statistical Power Analysis for the Behavioral Sciences(2nd ed.). Lawrence Erlbaum Associates
work page 1988
-
[7]
Angela Fan, Beliz Gokkaya, Mark Harman, Mitessh Lytvyn, Shubho Sengupta, Shin Yoo, and Jie M. Zhang. 2024. Large Language Models for Software Engineer- ing: A Systematic Literature Review.ACM Transactions on Software Engineering and Methodology33, 7 (2024), 1–79
work page 2024
-
[8]
Ronald Aylmer Fisher. 1922. On the interpretation of𝜒 2 from contingency tables, and the calculation of P.Journal of the Royal Statistical Society85, 1 (1922), 87–94
work page 1922
-
[9]
GitHub. 2022. GitHub Copilot Research Recitation. https://docs.github.com/ copilot
work page 2022
-
[10]
Nicolas E. Gold and Jens Krinke. 2020. Ethical Mining: A Case Study on MSR Mining Challenges. InProceedings of the 17th International Conference on Mining Software Repositories, MSR 2020. ACM, Seoul, Korea, 265–276. doi:10.1145/3379597. 3387462
-
[11]
Jingzhi Gong, Yixin Bian, Luis de la Cal, Giovanni Pinna, Anisha Uteem, David Williams, Mar Zamorano, Karine Even-Mendoza, William B Langdon, Hector D Menendez, et al . 2025. GA4GC: Greener Agent for Greener Code via Multi- Objective. InSSBSE 2025 Challenge Case: Green SBSE
work page 2025
-
[12]
Hao He, Courtney Miller, Shyam Agarwal, Christian Kästner, and Bogdan Vasilescu. 2025. Does AI-Assisted Coding Deliver? A Difference-in-Differences Study of Cursor’s Impact on Software Projects.arXiv e-prints(2025), arXiv–2511
work page 2025
- [13]
-
[14]
Eirini Kalliamvakou, Georgios Gousios, Kelly Blincoe, Leif Singer, Daniel M German, and Daniela Damian. 2014. The Promises and Perils of Mining GitHub. InProceedings of the 11th Working Conference on Mining Software Repositories (MSR). ACM, 92–101
work page 2014
-
[15]
Hao Li et al. 2025. AIDev: A Large-Scale Dataset of AI-Generated Code Contri- butions. HuggingFace Datasets. https://huggingface.co/datasets/hao-li/AIDev
work page 2025
-
[16]
Hao Li et al. 2025. The Rise of AI Software Engineers: A Quantitative Analysis of AI-Authored Contributions on GitHub.arXiv preprint(2025)
work page 2025
-
[17]
Hao Li, Haoxiang Zhang, and Ahmed E. Hassan. 2025. The Rise of AI Team- mates in Software Engineering (SE) 3.0: How Autonomous Coding Agents Are Reshaping Software Engineering.arXiv preprint arXiv:2507.15003(2025)
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[18]
METR. 2025. Measuring the Impact of AI Coding Assistants on Developer Pro- ductivity
work page 2025
-
[19]
David Murillo et al. 2025. The Speed-Quality Tradeoff in AI-Assisted Software Development.Proceedings of ICSE(2025)
work page 2025
-
[20]
Karl Pearson. 1900. On the criterion that a given system of deviations from the probable in the case of a correlated system of variables is such that it can be reasonably supposed to have arisen from random sampling.The London, Edinburgh, and Dublin Philosophical Magazine and Journal of Science50, 302 (1900), 157–175
work page 1900
-
[21]
Sida Peng, Eirini Kalliamvakou, Peter Cihon, and Mert Demirer. 2023. The Impact of AI on Developer Productivity: Evidence from GitHub Copilot. InarXiv preprint arXiv:2302.06590
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[22]
Neil Perry, Megha Srivastava, Deepak Kumar, and Dan Boneh. 2023. Do Users Write More Insecure Code with AI Assistants?. InProceedings of the 2023 ACM SIGSAC Conference on Computer and Communications Security (CCS). ACM, 2785– 2799
work page 2023
-
[23]
2025.Comparing AI Coding Agents
Giovanni Pinna. 2025.Comparing AI Coding Agents. https://github.com/ giovannipinna96/Comparing_AI_Coding_Agents
work page 2025
-
[24]
David Williams, Max Hort, Maria Kechagia, Aldeida Aleti, Justyna Petke, and Federica Sarro. 2026. Empirical and Sustainability Aspects of Software Engineer- ing Research in the Era of Large Language Models: A Reflection. InProceedings of the 2026 IEEE/ACM 48th International Conference on Software Engineering: New Ideas and Emerging Results (ICSE-NIER’26)(...
-
[25]
Burak Yetistiren, Isik Ozsoy, and Eray Tuzun. 2022. Assessing the Quality of GitHub Copilot’s Code Generation. InProceedings of the 18th International Confer- ence on Predictive Models and Data Analytics in Software Engineering (PROMISE). ACM, 62–71
work page 2022
-
[26]
Albert Ziegler, Eirini Kalliamvakou, X Alice Li, Andrew Rice, Devon Rifkin, Shawn Simister, Ganesh Sittampalam, and Edward Aftandilian. 2022. Productivity assessment of neural code completion. InProceedings of the 6th ACM SIGPLAN International Symposium on Machine Programming. 21–29
work page 2022
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.