Recognition: 2 theorem links
· Lean TheoremThe Rise of AI Teammates in Software Engineering (SE) 3.0: How Autonomous Coding Agents Are Reshaping Software Engineering
Pith reviewed 2026-05-15 17:33 UTC · model grok-4.3
The pith
AIDev supplies the first large-scale dataset of 456,000 real pull requests from five autonomous coding agents to ground study of AI teammates in software development.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Autonomous coding agents are now operating at scale in open repositories, and the AIDev dataset of 456,000 pull requests supplies the structured, real-world traces required to study their behavior, including metadata on authorship, review timelines, code changes, and integration outcomes. This resource directly supports research on benchmarking, agent readiness, optimization, collaboration modeling, and AI governance without relying on synthetic benchmarks such as SWE-bench.
What carries the argument
AIDev, the dataset of pull requests with rich metadata on authorship, review timelines, code changes, and integration outcomes that functions as the empirical foundation for analyzing agent behavior in the wild.
If this is right
- Agents complete submissions faster than humans yet see lower acceptance rates, revealing a measurable trust gap.
- Individual developers can increase their output rate dramatically when using agents, with some matching years of prior work in days.
- Agent-generated changes register as structurally simpler on standard complexity metrics than comparable human changes.
- The dataset can serve as a living, extensible base for new benchmarks and governance studies in SE 3.0 workflows.
Where Pith is reading between the lines
- The observed speed-acceptance mismatch suggests a need for new review tools that surface agent-specific risk signals.
- Repository maintainers could use AIDev-style traces to set policy thresholds for automated contributions.
- Comparison of agent performance across project sizes or languages becomes feasible for the first time with this scale of data.
Load-bearing premise
The pull requests from the five selected agents and the repositories that expose their activity accurately represent typical in-the-wild agent behavior without major selection bias.
What would settle it
A follow-up collection that samples a wider set of agents or repositories and finds substantially different acceptance rates or code-complexity distributions would indicate that AIDev patterns do not generalize.
read the original abstract
The future of software engineering--SE 3.0--is unfolding with the rise of AI teammates: autonomous, goal-driven systems collaborating with human developers. Among these, autonomous coding agents are especially transformative, now actively initiating, reviewing, and evolving code at scale. This paper introduces AIDev, the first large-scale dataset capturing how such agents operate in the wild. Spanning over 456,000 pull requests by five leading agents--OpenAI Codex, Devin, GitHub Copilot, Cursor, and Claude Code--across 61,000 repositories and 47,000 developers, AIDev provides an unprecedented empirical foundation for studying autonomous teammates in software development. Unlike prior work that has largely theorized the rise of AI-native software engineering, AIDev offers structured, open data to support research in benchmarking, agent readiness, optimization, collaboration modeling, and AI governance. The dataset includes rich metadata on PRs, authorship, review timelines, code changes, and integration outcomes--enabling exploration beyond synthetic benchmarks like SWE-bench. For instance, although agents often outperform humans in speed, their PRs are accepted less frequently, revealing a trust and utility gap. Furthermore, while agents accelerate code submission--one developer submitted as many PRs in three days as they had in three years--these are structurally simpler (via code complexity metrics). We envision AIDev as a living resource: extensible, analyzable, and ready for the SE and AI communities. Grounding SE 3.0 in real-world evidence, AIDev enables a new generation of research into AI-native workflows and supports building the next wave of symbiotic human-AI collaboration. The dataset is publicly available at https://github.com/SAILResearch/AI_Teammates_in_SE3. > AI Agent, Agentic AI, Coding Agent, Agentic Coding, Software Engineering Agent
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces AIDev, a dataset of 456,000 pull requests generated by five autonomous coding agents (OpenAI Codex, Devin, GitHub Copilot, Cursor, Claude Code) across 61,000 repositories and 47,000 developers. It positions the release as the first large-scale empirical resource for studying AI teammates in SE 3.0, enabling research on benchmarking, collaboration modeling, and governance, while reporting two headline observations: agents submit PRs faster than humans but with lower acceptance rates, and the changes are structurally simpler by code-complexity metrics.
Significance. If the collection process is fully documented and selection effects are quantified, AIDev would constitute a valuable public resource that moves the field beyond synthetic benchmarks such as SWE-bench. The scale and open availability could support reproducible studies of agent readiness, optimization, and human-AI workflow modeling.
major comments (2)
- [Data Collection / Methods] Data-collection section: the manuscript supplies no description of how PRs were identified as originating from the five named agents, what attribution heuristics or repository filters were applied, or any validation steps against misattribution. Because the central claim is that the 456k PRs furnish a representative empirical foundation, the absence of these details leaves the representativeness assumption unverified and load-bearing for all downstream uses.
- [Results / Empirical Observations] Results / Observations paragraph: the statements that agents are faster yet less frequently accepted and produce simpler changes are presented without accompanying quantitative metrics (e.g., median time-to-merge, acceptance-rate deltas with confidence intervals, or complexity-measure definitions), statistical controls, or comparison baselines. These observations are used to illustrate the dataset’s utility, yet cannot be evaluated for robustness.
minor comments (2)
- [Abstract / Introduction] The abstract and introduction repeatedly use the phrase “first large-scale dataset” without citing or contrasting prior public PR corpora that include agent-generated activity; a brief related-work paragraph would clarify novelty.
- [Dataset Description] The GitHub repository link is given but no summary of the exact schema, file formats, or example records is provided in the paper; readers cannot assess usability without downloading the data.
Simulated Author's Rebuttal
We thank the referee for their constructive and detailed comments, which highlight important areas for improving the clarity and rigor of our manuscript. We agree that additional documentation is needed and will revise the paper accordingly. Our point-by-point responses follow.
read point-by-point responses
-
Referee: [Data Collection / Methods] Data-collection section: the manuscript supplies no description of how PRs were identified as originating from the five named agents, what attribution heuristics or repository filters were applied, or any validation steps against misattribution. Because the central claim is that the 456k PRs furnish a representative empirical foundation, the absence of these details leaves the representativeness assumption unverified and load-bearing for all downstream uses.
Authors: We agree that the manuscript currently provides insufficient detail on the data collection and attribution process. The full pipeline—including heuristics based on commit author metadata, PR titles/descriptions, repository tags, and cross-referencing with known agent activity patterns—is implemented and documented in the public GitHub repository (https://github.com/SAILResearch/AI_Teammates_in_SE3). In the revised manuscript we will add a dedicated subsection to the Methods section that explicitly describes the identification heuristics, applied repository and PR filters, and validation steps (including manual sampling of 500 PRs and error-rate estimates). This addition will directly address concerns about representativeness and allow readers to assess potential selection effects. revision: yes
-
Referee: [Results / Empirical Observations] Results / Observations paragraph: the statements that agents are faster yet less frequently accepted and produce simpler changes are presented without accompanying quantitative metrics (e.g., median time-to-merge, acceptance-rate deltas with confidence intervals, or complexity-measure definitions), statistical controls, or comparison baselines. These observations are used to illustrate the dataset’s utility, yet cannot be evaluated for robustness.
Authors: The referee is correct that the headline observations are stated at a high level without supporting statistics in the current text. These claims are based on analyses performed on the released AIDev dataset, but the manuscript does not report the underlying numbers or controls. In the revision we will introduce a new “Preliminary Empirical Observations” subsection that supplies concrete metrics: median time-to-merge (with interquartile ranges), acceptance rates with 95% confidence intervals and deltas relative to human PRs in the same repositories, explicit definitions of the complexity metrics used (e.g., cyclomatic complexity, change size in LOC), and basic statistical comparisons. Where data permit, we will also note controls for repository and developer characteristics. This will make the illustrative claims evaluable and strengthen the demonstration of the dataset’s utility. revision: yes
Circularity Check
No circularity: data release paper with no derivations or fitted predictions
full rationale
The paper introduces the AIDev dataset of 456k PRs from five agents across 61k repositories and provides descriptive observations (e.g., faster submission but lower acceptance rates, structurally simpler changes). No equations, models, or predictions are derived; the contribution is the open data release itself. No self-citations, ansatzes, or uniqueness claims are used to justify core results, and no step reduces by construction to fitted inputs or prior self-referential work. The analysis chain is self-contained as an empirical resource without circular reduction.
Axiom & Free-Parameter Ledger
Forward citations
Cited by 21 Pith papers
-
Edit, But Verify: An Empirical Audit of Instructed Code-Editing Benchmarks
The two main benchmarks for LLM instructed code editing over-represent Python, miss common real-world domains and edit types, and have test coverage issues that limit what they measure.
-
To What Extent Does Agent-generated Code Require Maintenance? An Empirical Study
AI-generated code requires less maintenance than human code, with humans handling the majority of changes that are mostly feature extensions rather than bug fixes.
-
Do AI Coding Agents Log Like Humans? An Empirical Study
AI agents modify logging less often than humans in 58.4% of repositories but produce higher log density when they change it; explicit logging instructions are rare (4.7%) and ignored 67% of the time, with humans perfo...
-
AgenticFlict: A Large-Scale Dataset of Merge Conflicts in AI Coding Agent Pull Requests on GitHub
AgenticFlict is a public dataset of 29K+ textual merge conflicts from AI agent PRs, collected via merge simulation on 107K processed PRs and showing a 27.67% conflict rate with variation across agents.
-
A Large-Scale Empirical Study of AI-Generated Code in Real-World Repositories
A large-scale study of real-world repositories finds that AI-generated code differs from human-written code in complexity, structural traits, defect indicators, and commit-level activity patterns.
-
Mining Type Constructs Using Patterns in AI-Generated Code
AI-generated TypeScript code uses the 'any' type 9x more often than human code and employs more advanced type constructs that can ignore checks, but agentic PRs have 1.8x higher acceptance rates.
-
To What Extent Does Agent-generated Code Require Maintenance? An Empirical Study
AI-generated code requires less maintenance than human-written code, mostly involving feature additions by humans rather than bug fixes.
-
Hot Fixing in the Wild
Hot fixes show urgency patterns with reduced collaboration and testing, differing from regular fixes, and human versus AI agents display over 10 distinct repair behaviors in large-scale GitHub data.
-
On the Footprints of Reviewer Bots Feedback on Agentic Pull Requests in OSS GitHub Repositories
Reviewer bots' higher comment volume on AI agent PRs is associated with slower resolutions and poorer average feedback quality, while feedback quality itself has no association with PR outcomes.
-
Insights into Security-Related AI-Generated Pull Requests
AI-generated security pull requests frequently contain a small set of recurring weaknesses, with many flawed ones merged and rejections driven by process factors rather than technical issues.
-
ORBIT: Guided Agentic Orchestration for Autonomous C-to-Rust Transpilation
ORBIT achieves 100% compilation success and 91.7% test success on 24 mostly large programs from CRUST-Bench by using dependency-aware orchestration and iterative verification, outperforming prior static and baseline tools.
-
Agentic Business Process Management: A Research Manifesto
Agentic Business Process Management reframes BPM around autonomous agents that must exhibit framed autonomy, explainability, conversational actionability, and self-modification to keep their actions aligned with organ...
-
Characterizing Faults in Agentic AI: A Taxonomy of Types, Symptoms, and Root Causes
An empirical study of real-world issues yields a taxonomy of 34 fault types, symptoms, and root causes in agentic AI systems, validated by 145 practitioners.
-
CoT-Guard: Small Models for Strong Monitoring
CoT-Guard is a 4B model using SFT and RL that achieves 75% G-mean^2 on hidden objective detection under prompt and code manipulation attacks, outperforming several larger models.
-
These Aren't the Reviews You're Looking For How Humans Review AI-Generated Pull Requests
AI-generated PRs on GitHub receive fewer human reviews and more AI-mediated interactions than human-authored PRs.
-
KISS Sorcar: A Stupidly-Simple General-Purpose and Software Engineering AI Assistant
KISS Sorcar introduces a simple layered agent framework and VS Code IDE that reaches 62.2% pass rate on Terminal Bench 2.0 by combining ReAct execution, summarization-based continuation, parallel tools, persistent his...
-
Scaling Human-AI Coding Collaboration Requires a Governable Consensus Layer
Agentic Consensus replaces code as the main artifact with a typed property graph world model that maintains commitments and evidence through synchronization operators, shifting evaluation to alignment fidelity and con...
-
Beyond Functional Correctness: Design Issues in AI IDE-Generated Large-Scale Projects
AI IDEs with structured guidance can produce functional large-scale code but frequently introduce design flaws such as duplication, complexity, and principle violations that risk long-term maintainability.
-
From Industry Claims to Empirical Reality: An Empirical Study of Code Review Agents in Pull Requests
Code review agents achieve 45.20% merge rate on PRs versus 68.37% for humans, with 60.2% of agent-only closed PRs showing 0-30% signal quality.
-
Beyond the 'Diff': Addressing Agentic Entropy in Agentic Software Development
Agentic entropy names the systemic drift in AI coding agents away from architectural intent; a new framework using conformity seeding, reasoning monitoring, and causal graph interfaces supplies process-level oversight...
-
Fairness in Multi-Agent Systems for Software Engineering: An SDLC-Oriented Rapid Review
A rapid review of fairness in LLM-enabled multi-agent systems for the software development lifecycle concludes that the field lacks standardized evaluations, broad coverage, and effective governance, leaving it unprep...
Reference graph
Works this paper leans on
-
[1]
[n. d.]. Introducing Codex. https://openai.com/index/introducing-codex/. [Accessed 07-07-2025]
work page 2025
-
[2]
[n. d.]. Measuring the Impact of Early-2025 AI on Experienced Open-Source Developer Productivity. https://metr.org/blog/2025-07-10-early-2025- ai-experienced-os-dev-study/. [Accessed 17-07-2025]
work page 2025
-
[3]
Toufique Ahmed, Premkumar Devanbu, Christoph Treude, and Michael Pradel. 2025. Can LLMs Replace Manual Annotation of Software Engineering Artifacts?. In 2025 IEEE/ACM 22nd International Conference on Mining Software Repositories (MSR) . 526–538. doi:10.1109/MSR66628.2025.00086
- [4]
-
[5]
Viraat Aryabumi, Yixuan Su, Raymond Ma, Adrien Morisot, Ivan Zhang, Acyr Locatelli, Marzieh Fadaee, Ahmet Üstün, and Sara Hooker. 2025. To Code or Not To Code? Exploring Impact of Code in Pre-training. In The Thirteenth International Conference on Learning Representations . https://openreview.net/forum?id=zSfeN1uAcx
work page 2025
-
[6]
Aaditya Bhatia, Gustavo A. Oliva, Gopi Krishnan Rajbahadur, Haoxiang Zhang, Yihao Chen, Zhilong Chen, Arthur Leung, Dayi Lin, Boyuan Chen, and Ahmed E. Hassan. 2025. SPICE: An Automated SWE -Bench Labeling Pipeline for Issue Clarity, Test Coverage, and Effort Estimation. (Jul 2025). arXiv:2507.09108 [cs.SE] doi:10.48550/arXiv.2507.09108
-
[7]
Islem Bouzenia, Prem Devanbu, and Michael Pradel. 2024. RepairAgent: An Autonomous, LLM-Based Agent for Program Repair. 2025 IEEE/ACM 47th International Conference on Software Engineering (ICSE) (2024), 2188–2200
work page 2024
-
[8]
Islem Bouzenia, Premkumar Devanbu, and Michael Pradel. 2025. RepairAgent: An Autonomous, LLM-Based Agent for Program Repair . In 2025 IEEE/ACM 47th International Conference on Software Engineering (ICSE) . IEEE Computer Society, Los Alamitos, CA, USA, 2188–2200. doi:10.1109/ ICSE55347.2025.00157
-
[9]
Greg Brockman, Vicki Cheung, Ludwig Pettersson, Jonas Schneider, John Schulman, Jie Tang, and Wojciech Zaremba. 2016. OpenAI Gym. arXiv:1606.01540 [cs.LG] https://arxiv.org/abs/1606.01540
work page internal anchor Pith review Pith/arXiv arXiv 2016
-
[10]
Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde de Oliveira Pinto, Jared Kaplan, Harri Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, et al. 2021. Evaluating large language models trained on code . arXiv:2107.03374 [cs.SE] https://arxiv.org/abs/2107.03374
work page internal anchor Pith review Pith/arXiv arXiv 2021
- [11]
-
[12]
Chatbot Arena: An Open Platform for Evaluating LLMs by Human Preference
Wei-Lin Chiang, Lianmin Zheng, Ying Sheng, Anastasios Nikolas Angelopoulos, Tianle Li, Dacheng Li, Hao Zhang, Banghua Zhu, Michael Jordan, Joseph E. Gonzalez, and Ion Stoica. 2024. Chatbot Arena: An Open Platform for Evaluating LLMs by Human Preference. arXiv:2403.04132 [cs.AI] https://arxiv.org/abs/2403.04132
work page internal anchor Pith review Pith/arXiv arXiv 2024
- [13]
- [14]
- [15]
-
[16]
Yihong Dong, Xue Jiang, Zhi Jin, and Ge Li. 2023. Self-Collaboration Code Generation via ChatGPT. ACM Transactions on Software Engineering and Methodology 33 (2023), 1–38
work page 2023
-
[17]
Nicole Forsgren, Margaret-Anne Storey, Chandra Maddila, Thomas Zimmermann, Brian Houck, and Jenna Butler. 2021. The SPACE of Developer Productivity: There’s more to it than you think.Queue 19, 1 (March 2021), 20–48. doi:10.1145/3454122.3454124
-
[18]
Ahmed E. Hassan, Dayi Lin, Gopi Krishnan Rajbahadur, Keheliya Gallaba, Filipe Roseiro Cogo, Boyuan Chen, Haoxiang Zhang, Kishanthan Thangarajah, Gustavo Oliva, Jiahuei (Justina) Lin, et al . 2024. Rethinking Software Engineering in the Era of Foundation Models: A Curated Catalogue of Challenges in the Development of Trustworthy FMware. In Companion Procee...
work page 2024
-
[19]
Ahmed E. Hassan, Gustavo A. Oliva, Dayi Lin, Boyuan Chen, and Zhen Ming Jiang. 2024. Towards AI-Native Software Engineering (SE 3.0): A Vision and a Challenge Roadmap . arXiv:2410.06107 [cs.SE] https://arxiv.org/abs/2410.06107
-
[20]
Junda He, Christoph Treude, and David Lo. 2025. LLM-Based Multi-Agent Systems for Software Engineering: Literature Review, Vision, and the Road Ahead. ACM Trans. Softw. Eng. Methodol. 34, 5, Article 124 (May 2025), 30 pages. doi:10.1145/3712003
-
[21]
Carlos E Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, and Karthik R Narasimhan. 2024. SWE-bench: Can Language Models Resolve Real-world Github Issues?. In The Twelfth International Conference on Learning Representations . https://openreview.net/forum?id= VTF8yNQM66
work page 2024
- [22]
- [23]
-
[24]
Feng Lin, Dong Jae Kim, and Tse-Hsun Chen. 2024. SOEN-101: Code Generation by Emulating Software Process Models Using Large Language Model Agents. 2025 IEEE/ACM 47th International Conference on Software Engineering (ICSE) (2024), 1527–1539. Manuscript submitted to ACM 22 Hao Li, Haoxiang Zhang, and Ahmed E. Hassan
work page 2024
-
[25]
Long, Du Feng, and Norman Cliff
Jeffrey D. Long, Du Feng, and Norman Cliff. 2003. Ordinal Analysis of Behavioral Data. In Handbook of Psychology, Irving B. Weiner (Ed.). John Wiley & Sons, Inc., Hoboken, NJ, USA, Chapter 25, 635–661. doi:10.1002/0471264385.wei0225
-
[26]
H. B. Mann and D. R. Whitney. 1947. On a Test of Whether one of Two Random Variables is Stochastically Larger than the Other. Annals of Mathematical Statistics 18 (1947), 50–60
work page 1947
-
[27]
Dimitrios-Nikitas Nastos, Themistoklis Diamantopoulos, Davide Tosi, Martina Tropeano, and Andreas L. Symeonidis. 2025. Towards an Interpretable Analysis for Estimating the Resolution Time of Software Issues. arXiv:2505.01108 [cs.SE] https://arxiv.org/abs/2505.01108
-
[28]
Ketai Qiu, Niccolò Puccinelli, Matteo Ciniselli, and Luca Di Grazia. 2025. From Today’s Code to Tomorrow’s Symphony: The AI Transformation of Developer’s Routine by 2030. ACM Trans. Softw. Eng. Methodol. 34, 5, Article 121 (May 2025), 17 pages. doi:10.1145/3709353
-
[29]
Peter C. Rigby, Seth Rogers, Sadruddin Saleem, Parth Suresh, Daniel Suskin, Patrick Riggs, Chandra Maddila, Nachiappan Nagappan, and Audris Mockus. 2025. Improving Code Reviewer Recommendation: Accuracy, Latency, Workload, and Bystanders. ACM Trans. Softw. Eng. Methodol. (May 2025). doi:10.1145/3736405 Just Accepted
-
[30]
Jeanine Romano, Jeffrey D Kromrey, Jesse Coraggio, Jeff Skowronek, and Linda Devine. 2006. Exploring methods for evaluating group differences on the NSSE and other surveys: Are the t-test and Cohen’sd indices the most appropriate choices. Inannual meeting of the Southern Association for Institutional Research. Citeseer, 1–51
work page 2006
- [31]
-
[32]
David Silver and Richard S Sutton. 2025. Welcome to the era of experience. Google AI 1 (2025)
work page 2025
-
[33]
Charlie Victor Snell, Jaehoon Lee, Kelvin Xu, and Aviral Kumar. 2025. Scaling LLM Test-Time Compute Optimally Can be More Effective than Scaling Parameters for Reasoning. In The Thirteenth International Conference on Learning Representations . https://openreview.net/forum?id=4FWAwZtd2n
work page 2025
-
[34]
John Yang, Carlos E. Jimenez, Alex L. Zhang, et al. 2024. SWE-bench Multimodal: Do AI Systems Generalize to Visual Software Domains? arXiv preprint arXiv:2410.03859 (2024)
- [35]
-
[36]
Xin Zhou, Martin Weyssow, Ratnadira Widyasari, Ting Zhang, Junda He, Yunbo Lyu, Jianming Chang, Beiqi Zhang, Dan Huang, and David Lo
-
[37]
arXiv:2502.06215 [cs.SE] https://arxiv.org/abs/2502.06215 Manuscript submitted to ACM
LessLeak-Bench: A First Investigation of Data Leakage in LLMs Across 83 Software Engineering Benchmarks. arXiv:2502.06215 [cs.SE] https://arxiv.org/abs/2502.06215 Manuscript submitted to ACM
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.